Update maintenance documentation

This commit is contained in:
Philip Hazel 2022-04-25 15:07:14 +01:00
parent f65df06305
commit 104fe2fead
2 changed files with 56 additions and 61 deletions

61
HACKING
View File

@ -8,8 +8,8 @@ library is referred to as PCRE1 below. For information about testing PCRE2, see
the pcre2test documentation and the comment at the head of the RunTest file. the pcre2test documentation and the comment at the head of the RunTest file.
PCRE1 releases were up to 8.3x when PCRE2 was developed, and later bug fix PCRE1 releases were up to 8.3x when PCRE2 was developed, and later bug fix
releases remain in the 8.xx series. PCRE2 releases started at 10.00 to avoid releases carried on the 8.xx series, up to the final 8.45 release. PCRE2
confusion with PCRE1. releases started at 10.00 to avoid confusion with PCRE1.
Historical note 1 Historical note 1
@ -38,8 +38,8 @@ Historical note 2
By contrast, the code originally written by Henry Spencer (which was By contrast, the code originally written by Henry Spencer (which was
subsequently heavily modified for Perl) compiles the expression twice: once in subsequently heavily modified for Perl) compiles the expression twice: once in
a dummy mode in order to find out how much store will be needed, and then for a dummy mode in order to find out how much store will be needed, and then for
real. (The Perl version probably doesn't do this any more; I'm talking about real. (The Perl version may or may not still do this; I'm talking about the
the original library.) The execution function operates by backtracking and original library.) The execution function operates by backtracking and
maximizing (or, optionally, minimizing, in Perl) the amount of the subject that maximizing (or, optionally, minimizing, in Perl) the amount of the subject that
matches individual wild portions of the pattern. This is an "NFA algorithm" in matches individual wild portions of the pattern. This is an "NFA algorithm" in
Friedl's terminology. Friedl's terminology.
@ -151,8 +151,8 @@ of code units in the item itself. The exception is the aforementioned large
advance to check for such values. When auto-callouts are enabled, the generous advance to check for such values. When auto-callouts are enabled, the generous
assumption is made that there will be a callout for each pattern code unit assumption is made that there will be a callout for each pattern code unit
(which of course is only actually true if all code units are literals) plus one (which of course is only actually true if all code units are literals) plus one
at the end. There is a default parsed pattern vector on the system stack, but at the end. A default parsed pattern vector is defined on the system stack, to
if this is not big enough, heap memory is used. minimize memory handling, but if this is not big enough, heap memory is used.
As before, the actual compiling function is run twice, the first time to As before, the actual compiling function is run twice, the first time to
determine the amount of memory needed for the final compiled pattern. It determine the amount of memory needed for the final compiled pattern. It
@ -187,7 +187,7 @@ META_CLASS_EMPTY [] empty class - only with PCRE2_ALLOW_EMPTY_CLASS
META_CLASS_EMPTY_NOT [^] negative empty class - ditto META_CLASS_EMPTY_NOT [^] negative empty class - ditto
META_CLASS_END ] end of non-empty class META_CLASS_END ] end of non-empty class
META_CLASS_NOT [^ start non-empty negative class META_CLASS_NOT [^ start non-empty negative class
META_COMMIT (*COMMIT) META_COMMIT (*COMMIT) - no argument (see below for with argument)
META_COND_ASSERT (?(?assertion) META_COND_ASSERT (?(?assertion)
META_DOLLAR $ metacharacter META_DOLLAR $ metacharacter
META_DOT . metacharacter META_DOT . metacharacter
@ -201,14 +201,14 @@ META_NOCAPTURE (?: no capture parens
META_PLUS + META_PLUS +
META_PLUS_PLUS ++ META_PLUS_PLUS ++
META_PLUS_QUERY +? META_PLUS_QUERY +?
META_PRUNE (*PRUNE) - no argument META_PRUNE (*PRUNE) - no argument (see below for with argument)
META_QUERY ? META_QUERY ?
META_QUERY_PLUS ?+ META_QUERY_PLUS ?+
META_QUERY_QUERY ?? META_QUERY_QUERY ??
META_RANGE_ESCAPED hyphen in class range with at least one escape META_RANGE_ESCAPED hyphen in class range with at least one escape
META_RANGE_LITERAL hyphen in class range defined literally META_RANGE_LITERAL hyphen in class range defined literally
META_SKIP (*SKIP) - no argument META_SKIP (*SKIP) - no argument (see below for with argument)
META_THEN (*THEN) - no argument META_THEN (*THEN) - no argument (see below for with argument)
The two RANGE values occur only in character classes. They are positioned The two RANGE values occur only in character classes. They are positioned
between two literals that define the start and end of the range. In an EBCDIC between two literals that define the start and end of the range. In an EBCDIC
@ -229,7 +229,8 @@ If the data for META_ALT is non-zero, it is inside a lookbehind, and the data
is the length of its branch, for which OP_REVERSE must be generated. is the length of its branch, for which OP_REVERSE must be generated.
META_BACKREF, META_CAPTURE, and META_RECURSE have the capture group number as META_BACKREF, META_CAPTURE, and META_RECURSE have the capture group number as
their data in the lower 16 bits of the element. their data in the lower 16 bits of the element. META_RECURSE is followed by an
offset, for use in error messages.
META_BACKREF is followed by an offset if the back reference group number is 10 META_BACKREF is followed by an offset if the back reference group number is 10
or more. The offsets of the first ocurrences of references to groups whose or more. The offsets of the first ocurrences of references to groups whose
@ -238,8 +239,6 @@ occurrence is useful). On 64-bit systems this avoids using more than two parsed
pattern elements for items such as \3. The offset is used when an error occurs pattern elements for items such as \3. The offset is used when an error occurs
because the reference is to a non-existent group. because the reference is to a non-existent group.
META_RECURSE is always followed by an offset, for use in error messages.
META_ESCAPE has an ESC_xxx value as its data. For ESC_P and ESC_p, the next META_ESCAPE has an ESC_xxx value as its data. For ESC_P and ESC_p, the next
element contains the 16-bit type and data property values, packed together. element contains the 16-bit type and data property values, packed together.
ESC_g and ESC_k are used only for named references - numerical ones are turned ESC_g and ESC_k are used only for named references - numerical ones are turned
@ -291,9 +290,9 @@ META_LOOKBEHIND (?<= start of lookbehind
META_LOOKBEHIND_NA (*naplb: start of non-atomic lookbehind META_LOOKBEHIND_NA (*naplb: start of non-atomic lookbehind
META_LOOKBEHINDNOT (?<! start of negative lookbehind META_LOOKBEHINDNOT (?<! start of negative lookbehind
The following are followed by two elements, the minimum and maximum. Repeat The following are followed by two elements, the minimum and maximum. The
values are limited to 65535 (MAX_REPEAT). A maximum value of "unlimited" is maximum value is limited to 65535 (MAX_REPEAT). A maximum value of "unlimited"
represented by UNLIMITED_REPEAT, which is bigger than MAX_REPEAT: is represented by UNLIMITED_REPEAT, which is bigger than MAX_REPEAT:
META_MINMAX {n,m} repeat META_MINMAX {n,m} repeat
META_MINMAX_PLUS {n,m}+ repeat META_MINMAX_PLUS {n,m}+ repeat
@ -347,11 +346,11 @@ support is not available for this kind of matching.
Changeable options Changeable options
------------------ ------------------
The /i, /m, or /s options (PCRE2_CASELESS, PCRE2_MULTILINE, PCRE2_DOTALL, and The /i, /m, or /s options (PCRE2_CASELESS, PCRE2_MULTILINE, PCRE2_DOTALL) and
others) may be changed in the middle of patterns by items such as (?i). Their some others may be changed in the middle of patterns by items such as (?i).
processing is handled entirely at compile time by generating different opcodes Their processing is handled entirely at compile time by generating different
for the different settings. The runtime functions do not need to keep track of opcodes for the different settings. The runtime functions do not need to keep
an option's state. track of an option's state.
PCRE2_DUPNAMES, PCRE2_EXTENDED, PCRE2_EXTENDED_MORE, and PCRE2_NO_AUTO_CAPTURE PCRE2_DUPNAMES, PCRE2_EXTENDED, PCRE2_EXTENDED_MORE, and PCRE2_NO_AUTO_CAPTURE
are tracked and processed during the parsing pre-pass. The others are handled are tracked and processed during the parsing pre-pass. The others are handled
@ -437,7 +436,7 @@ Backtracking control verbs
-------------------------- --------------------------
Verbs with no arguments generate opcodes with no following data (as listed Verbs with no arguments generate opcodes with no following data (as listed
in the section above). in the section above).
(*MARK:NAME) generates OP_MARK followed by the mark name, preceded by a (*MARK:NAME) generates OP_MARK followed by the mark name, preceded by a
length in one code unit, and followed by a binary zero. The name length is length in one code unit, and followed by a binary zero. The name length is
@ -468,8 +467,8 @@ Caseless matching (positive or negative) of characters that have more than two
case-equivalent code points (which is possible only in UTF mode) is handled by case-equivalent code points (which is possible only in UTF mode) is handled by
compiling a Unicode property item (see below), with the pseudo-property compiling a Unicode property item (see below), with the pseudo-property
PT_CLIST. The value of this property is an offset in a vector called PT_CLIST. The value of this property is an offset in a vector called
"ucd_caseless_sets" which identifies the start of a short list of equivalent "ucd_caseless_sets" which identifies the start of a short list of case
characters, terminated by the value NOTACHAR (0xffffffff). equivalent characters, terminated by the value NOTACHAR (0xffffffff).
Repeating single characters Repeating single characters
@ -546,9 +545,9 @@ Each is followed by two code units that encode the desired property as a type
and a value. The types are a set of #defines of the form PT_xxx, and the values and a value. The types are a set of #defines of the form PT_xxx, and the values
are enumerations of the form ucp_xx, defined in the pcre2_ucp.h source file. are enumerations of the form ucp_xx, defined in the pcre2_ucp.h source file.
The value is relevant only for PT_GC (General Category), PT_PC (Particular The value is relevant only for PT_GC (General Category), PT_PC (Particular
Category), PT_SC (Script), PT_BIDICL (Bidi Class), and the pseudo-property Category), PT_SC (Script), PT_BIDICL (Bidi Class), PT_BOOL (Boolean property),
PT_CLIST, which is used to identify a list of case-equivalent characters when and the pseudo-property PT_CLIST, which is used to identify a list of
there are three or more. case-equivalent characters when there are three or more (see above).
Repeats of these items use the OP_TYPESTAR etc. set of opcodes, followed by Repeats of these items use the OP_TYPESTAR etc. set of opcodes, followed by
three code units: OP_PROP or OP_NOTPROP, and then the desired property type and three code units: OP_PROP or OP_NOTPROP, and then the desired property type and
@ -666,9 +665,9 @@ a count that immediately follows the offset.
There are several opcodes that mark the end of a subpattern group. OP_KET is There are several opcodes that mark the end of a subpattern group. OP_KET is
used for subpatterns that do not repeat indefinitely, OP_KETRMIN and used for subpatterns that do not repeat indefinitely, OP_KETRMIN and
OP_KETRMAX are used for indefinite repetitions, minimally or maximally OP_KETRMAX are used for indefinite repetitions, minimally or maximally
respectively, and OP_KETRPOS for possessive repetitions (see below for more respectively, and OP_KETRPOS for possessive repetitions (see below for more
details). All four are followed by a LINK_SIZE value giving (as a positive details). All four are followed by a LINK_SIZE value giving (as a positive
number) the offset back to the matching bracket opcode. number) the offset back to the matching opening bracket opcode.
If a subpattern is quantified such that it is permitted to match zero times, it If a subpattern is quantified such that it is permitted to match zero times, it
is preceded by one of OP_BRAZERO, OP_BRAMINZERO, or OP_SKIPZERO. These are is preceded by one of OP_BRAZERO, OP_BRAMINZERO, or OP_SKIPZERO. These are
@ -719,7 +718,7 @@ Assertions
Forward assertions are also just like other subpatterns, but starting with one Forward assertions are also just like other subpatterns, but starting with one
of the opcodes OP_ASSERT, OP_ASSERT_NA (non-atomic assertion), or of the opcodes OP_ASSERT, OP_ASSERT_NA (non-atomic assertion), or
OP_ASSERT_NOT. Backward assertions use the opcodes OP_ASSERTBACK, OP_ASSERT_NOT. Backward assertions use the opcodes OP_ASSERTBACK,
OP_ASSERTBACK_NA, and OP_ASSERTBACK_NOT, and the first opcode inside the OP_ASSERTBACK_NA, and OP_ASSERTBACK_NOT, and the first opcode inside the
assertion is OP_REVERSE, followed by a count of the number of characters to assertion is OP_REVERSE, followed by a count of the number of characters to
move back the pointer in the subject string. In ASCII or UTF-32 mode, the count move back the pointer in the subject string. In ASCII or UTF-32 mode, the count
@ -828,4 +827,4 @@ not a real opcode, but is used to check at compile time that tables indexed by
opcode are the correct length, in order to catch updating errors. opcode are the correct length, in order to catch updating errors.
Philip Hazel Philip Hazel
December 2021 April 2022

View File

@ -78,9 +78,9 @@ utf8.c
A short, freestanding C program for converting a Unicode code point into a A short, freestanding C program for converting a Unicode code point into a
sequence of bytes in the UTF-8 encoding, and vice versa. If its argument is a sequence of bytes in the UTF-8 encoding, and vice versa. If its argument is a
hex number such as 0x1234, it outputs a list of the equivalent UTF-8 bytes. hex number such as 0x1234, it outputs a list of the equivalent UTF-8 bytes.
If its argument is a sequence of concatenated UTF-8 bytes (e.g. e188b4) it If its argument is a sequence of concatenated UTF-8 bytes (e.g. 12e188b4) it
treats them as a UTF-8 character and outputs the equivalent code point in treats them as a UTF-8 string and outputs the equivalent code points in hex.
hex. See comments at its head for details. See comments at its head for details.
Updating to a new Unicode release Updating to a new Unicode release
@ -94,8 +94,9 @@ directory.
Note: Previously, it was necessary to update lists of scripts and their Note: Previously, it was necessary to update lists of scripts and their
abbreviations by hand before running the Python scripts. This is no longer abbreviations by hand before running the Python scripts. This is no longer
necessary because the scripts have been upgraded to extract this information necessary because the scripts have been upgraded to extract this information
themselves. Also, there used to be explicit lists of script in two of the man themselves. Also, there used to be explicit lists of scripts in two of the man
pages. This is no longer the case. pages. This is no longer the case; the pcre2test program can now output a list
of supported scripts.
You can give an output file name as an argument to the following scripts, but You can give an output file name as an argument to the following scripts, but
by default: by default:
@ -129,8 +130,8 @@ files should eventually be installed in the main testdata directory.
Preparing for a PCRE2 release Preparing for a PCRE2 release
============================= =============================
This section contains a checklist of things that I consult before building a This section contains a checklist of things that I do before building a new
distribution for a new release. release.
. Ensure that the version number and version date are correct in configure.ac. . Ensure that the version number and version date are correct in configure.ac.
@ -139,17 +140,16 @@ distribution for a new release.
. If new build options or new source files have been added, ensure that they . If new build options or new source files have been added, ensure that they
are added to the CMake files as well as to the autoconf files. The relevant are added to the CMake files as well as to the autoconf files. The relevant
files are CMakeLists.txt and config-cmake.h.in. After making a release files are CMakeLists.txt and config-cmake.h.in. After making a release, test
tarball, test it out with CMake if there have been changes here. it out with CMake if there have been changes here.
. Run ./autogen.sh to ensure everything is up-to-date. . Run ./autogen.sh to ensure everything is up-to-date.
. Compile and test with many different config options, and combinations of . Compile and test with many different config options, and combinations of
options. Also, test with valgrind by running "RunTest valgrind" and options. Also, test with valgrind by running "RunTest valgrind" and
"RunGrepTest valgrind" (which takes quite a long time). The script "RunGrepTest valgrind". The script maint/ManyConfigTests now encapsulates
maint/ManyConfigTests now encapsulates this testing. It runs tests with this testing. It runs tests with different configurations, and it also runs
different configurations, and it also runs some of them with valgrind, all of some of them with valgrind, all of which can take quite some time.
which can take quite some time.
. Run tests in both 32-bit and 64-bit environments if possible. I can no longer . Run tests in both 32-bit and 64-bit environments if possible. I can no longer
run 32-bit tests. run 32-bit tests.
@ -164,7 +164,8 @@ distribution for a new release.
-fsanitize=signed-integer-overflow -fsanitize=signed-integer-overflow
. Do a test build using CMake. Remove src/config.h first, lest it override the . Do a test build using CMake. Remove src/config.h first, lest it override the
version that CMake creates. Do NOT use parallel make. version that CMake creates. Also do a CMake unity build to check that it
still works: [c]cmake -DCMAKE_UNITY_BUILD=ON sets up a unity build.
. Run perltest.sh on the test data for tests 1 and 4. The output should match . Run perltest.sh on the test data for tests 1 and 4. The output should match
the PCRE2 test output, apart from the version identification at the start of the PCRE2 test output, apart from the version identification at the start of
@ -183,11 +184,12 @@ distribution for a new release.
systems. For example, on Solaris it is helpful to test using Sun's cc systems. For example, on Solaris it is helpful to test using Sun's cc
compiler as a change from gcc. Adding -xarch=v9 to the cc options does a compiler as a change from gcc. Adding -xarch=v9 to the cc options does a
64-bit test, but it also needs -S 64 for pcre2test to increase the stack size 64-bit test, but it also needs -S 64 for pcre2test to increase the stack size
for test 2. Since I retired I can no longer do much of this, but instead I for test 2. Since I retired I can no longer do much of this. There are
rely on putting out release candidates for testing by the community. automated tests under Ubuntu, Alpine, and Windows that are now set up as
GitHub actions. Check that they are running clean.
. The buildbots at http://buildfarm.opencsw.org/ do some automated testing . The buildbots at http://buildfarm.opencsw.org/ do some automated testing
of PCRE2 and should be checked before putting out a release. of PCRE2 and should also be checked before putting out a release.
Updating version info for libtool Updating version info for libtool
@ -243,10 +245,11 @@ it reports them and then aborts. Otherwise it removes trailing spaces from
sources and refreshes the HTML documentation. Update the GitHub repository with sources and refreshes the HTML documentation. Update the GitHub repository with
"git push". "git push".
Once PrepareRelease has run clean, run "make distcheck" to create the tarball Once PrepareRelease has run clean, run "make distcheck" to create the tarballs
and the zipball. I then sign these files. Double-check with "git status" that and the zipball. I then sign these files. Double-check with "git status" that
the repository is fully up-to-date, then create a new tag on GitHub. Upload the the repository is fully up-to-date, then create a new tag and a release on
tarball, zipball, and the signatures as "assets" of the GitHub release. GitHub. Upload the tarballs, zipball, and the signatures as "assets" of the
GitHub release.
When the new release is out, don't forget to tell webmaster@pcre.org and the When the new release is out, don't forget to tell webmaster@pcre.org and the
mailing list. mailing list.
@ -365,8 +368,6 @@ years.
See Unicode TR 29. The last two are very much aimed at natural language. See Unicode TR 29. The last two are very much aimed at natural language.
. (?[...]) extended classes: big project.
. Allow a callout to specify a number of characters to skip. This can be done . Allow a callout to specify a number of characters to skip. This can be done
compatibly via an extra callout field. compatibly via an extra callout field.
@ -436,13 +437,8 @@ years.
with lookarounds for \b and \B. Ideally the setting should last till the end with lookarounds for \b and \B. Ideally the setting should last till the end
of the group, which means remembering all previous settings; maybe a fixed of the group, which means remembering all previous settings; maybe a fixed
amount of stack would do - how deep would anyone want to nest these things? amount of stack would do - how deep would anyone want to nest these things?
See GitHub issue #13 for a compendium of character class issues. See GitHub issue #13 for a compendium of character class issues, including
(?[...]) extended classes.
. Recognize the short script names. They are already listed in maint/
Multistage2.py because they are needed for scanning the script extensions
file.
. Use script extensions for \p?
. A user suggested something like --with-build-info to set a build information . A user suggested something like --with-build-info to set a build information
string that could be retrieved by pcre2_config(). However, there's no string that could be retrieved by pcre2_config(). However, there's no
@ -461,4 +457,4 @@ years.
Philip Hazel Philip Hazel
Email local part: Philip.Hazel Email local part: Philip.Hazel
Email domain: gmail.com Email domain: gmail.com
Last updated: 10 January 2022 Last updated: 25 April 2022