Update maintenance documentation
This commit is contained in:
parent
f65df06305
commit
104fe2fead
55
HACKING
55
HACKING
|
@ -8,8 +8,8 @@ library is referred to as PCRE1 below. For information about testing PCRE2, see
|
||||||
the pcre2test documentation and the comment at the head of the RunTest file.
|
the pcre2test documentation and the comment at the head of the RunTest file.
|
||||||
|
|
||||||
PCRE1 releases were up to 8.3x when PCRE2 was developed, and later bug fix
|
PCRE1 releases were up to 8.3x when PCRE2 was developed, and later bug fix
|
||||||
releases remain in the 8.xx series. PCRE2 releases started at 10.00 to avoid
|
releases carried on the 8.xx series, up to the final 8.45 release. PCRE2
|
||||||
confusion with PCRE1.
|
releases started at 10.00 to avoid confusion with PCRE1.
|
||||||
|
|
||||||
|
|
||||||
Historical note 1
|
Historical note 1
|
||||||
|
@ -38,8 +38,8 @@ Historical note 2
|
||||||
By contrast, the code originally written by Henry Spencer (which was
|
By contrast, the code originally written by Henry Spencer (which was
|
||||||
subsequently heavily modified for Perl) compiles the expression twice: once in
|
subsequently heavily modified for Perl) compiles the expression twice: once in
|
||||||
a dummy mode in order to find out how much store will be needed, and then for
|
a dummy mode in order to find out how much store will be needed, and then for
|
||||||
real. (The Perl version probably doesn't do this any more; I'm talking about
|
real. (The Perl version may or may not still do this; I'm talking about the
|
||||||
the original library.) The execution function operates by backtracking and
|
original library.) The execution function operates by backtracking and
|
||||||
maximizing (or, optionally, minimizing, in Perl) the amount of the subject that
|
maximizing (or, optionally, minimizing, in Perl) the amount of the subject that
|
||||||
matches individual wild portions of the pattern. This is an "NFA algorithm" in
|
matches individual wild portions of the pattern. This is an "NFA algorithm" in
|
||||||
Friedl's terminology.
|
Friedl's terminology.
|
||||||
|
@ -151,8 +151,8 @@ of code units in the item itself. The exception is the aforementioned large
|
||||||
advance to check for such values. When auto-callouts are enabled, the generous
|
advance to check for such values. When auto-callouts are enabled, the generous
|
||||||
assumption is made that there will be a callout for each pattern code unit
|
assumption is made that there will be a callout for each pattern code unit
|
||||||
(which of course is only actually true if all code units are literals) plus one
|
(which of course is only actually true if all code units are literals) plus one
|
||||||
at the end. There is a default parsed pattern vector on the system stack, but
|
at the end. A default parsed pattern vector is defined on the system stack, to
|
||||||
if this is not big enough, heap memory is used.
|
minimize memory handling, but if this is not big enough, heap memory is used.
|
||||||
|
|
||||||
As before, the actual compiling function is run twice, the first time to
|
As before, the actual compiling function is run twice, the first time to
|
||||||
determine the amount of memory needed for the final compiled pattern. It
|
determine the amount of memory needed for the final compiled pattern. It
|
||||||
|
@ -187,7 +187,7 @@ META_CLASS_EMPTY [] empty class - only with PCRE2_ALLOW_EMPTY_CLASS
|
||||||
META_CLASS_EMPTY_NOT [^] negative empty class - ditto
|
META_CLASS_EMPTY_NOT [^] negative empty class - ditto
|
||||||
META_CLASS_END ] end of non-empty class
|
META_CLASS_END ] end of non-empty class
|
||||||
META_CLASS_NOT [^ start non-empty negative class
|
META_CLASS_NOT [^ start non-empty negative class
|
||||||
META_COMMIT (*COMMIT)
|
META_COMMIT (*COMMIT) - no argument (see below for with argument)
|
||||||
META_COND_ASSERT (?(?assertion)
|
META_COND_ASSERT (?(?assertion)
|
||||||
META_DOLLAR $ metacharacter
|
META_DOLLAR $ metacharacter
|
||||||
META_DOT . metacharacter
|
META_DOT . metacharacter
|
||||||
|
@ -201,14 +201,14 @@ META_NOCAPTURE (?: no capture parens
|
||||||
META_PLUS +
|
META_PLUS +
|
||||||
META_PLUS_PLUS ++
|
META_PLUS_PLUS ++
|
||||||
META_PLUS_QUERY +?
|
META_PLUS_QUERY +?
|
||||||
META_PRUNE (*PRUNE) - no argument
|
META_PRUNE (*PRUNE) - no argument (see below for with argument)
|
||||||
META_QUERY ?
|
META_QUERY ?
|
||||||
META_QUERY_PLUS ?+
|
META_QUERY_PLUS ?+
|
||||||
META_QUERY_QUERY ??
|
META_QUERY_QUERY ??
|
||||||
META_RANGE_ESCAPED hyphen in class range with at least one escape
|
META_RANGE_ESCAPED hyphen in class range with at least one escape
|
||||||
META_RANGE_LITERAL hyphen in class range defined literally
|
META_RANGE_LITERAL hyphen in class range defined literally
|
||||||
META_SKIP (*SKIP) - no argument
|
META_SKIP (*SKIP) - no argument (see below for with argument)
|
||||||
META_THEN (*THEN) - no argument
|
META_THEN (*THEN) - no argument (see below for with argument)
|
||||||
|
|
||||||
The two RANGE values occur only in character classes. They are positioned
|
The two RANGE values occur only in character classes. They are positioned
|
||||||
between two literals that define the start and end of the range. In an EBCDIC
|
between two literals that define the start and end of the range. In an EBCDIC
|
||||||
|
@ -229,7 +229,8 @@ If the data for META_ALT is non-zero, it is inside a lookbehind, and the data
|
||||||
is the length of its branch, for which OP_REVERSE must be generated.
|
is the length of its branch, for which OP_REVERSE must be generated.
|
||||||
|
|
||||||
META_BACKREF, META_CAPTURE, and META_RECURSE have the capture group number as
|
META_BACKREF, META_CAPTURE, and META_RECURSE have the capture group number as
|
||||||
their data in the lower 16 bits of the element.
|
their data in the lower 16 bits of the element. META_RECURSE is followed by an
|
||||||
|
offset, for use in error messages.
|
||||||
|
|
||||||
META_BACKREF is followed by an offset if the back reference group number is 10
|
META_BACKREF is followed by an offset if the back reference group number is 10
|
||||||
or more. The offsets of the first ocurrences of references to groups whose
|
or more. The offsets of the first ocurrences of references to groups whose
|
||||||
|
@ -238,8 +239,6 @@ occurrence is useful). On 64-bit systems this avoids using more than two parsed
|
||||||
pattern elements for items such as \3. The offset is used when an error occurs
|
pattern elements for items such as \3. The offset is used when an error occurs
|
||||||
because the reference is to a non-existent group.
|
because the reference is to a non-existent group.
|
||||||
|
|
||||||
META_RECURSE is always followed by an offset, for use in error messages.
|
|
||||||
|
|
||||||
META_ESCAPE has an ESC_xxx value as its data. For ESC_P and ESC_p, the next
|
META_ESCAPE has an ESC_xxx value as its data. For ESC_P and ESC_p, the next
|
||||||
element contains the 16-bit type and data property values, packed together.
|
element contains the 16-bit type and data property values, packed together.
|
||||||
ESC_g and ESC_k are used only for named references - numerical ones are turned
|
ESC_g and ESC_k are used only for named references - numerical ones are turned
|
||||||
|
@ -291,9 +290,9 @@ META_LOOKBEHIND (?<= start of lookbehind
|
||||||
META_LOOKBEHIND_NA (*naplb: start of non-atomic lookbehind
|
META_LOOKBEHIND_NA (*naplb: start of non-atomic lookbehind
|
||||||
META_LOOKBEHINDNOT (?<! start of negative lookbehind
|
META_LOOKBEHINDNOT (?<! start of negative lookbehind
|
||||||
|
|
||||||
The following are followed by two elements, the minimum and maximum. Repeat
|
The following are followed by two elements, the minimum and maximum. The
|
||||||
values are limited to 65535 (MAX_REPEAT). A maximum value of "unlimited" is
|
maximum value is limited to 65535 (MAX_REPEAT). A maximum value of "unlimited"
|
||||||
represented by UNLIMITED_REPEAT, which is bigger than MAX_REPEAT:
|
is represented by UNLIMITED_REPEAT, which is bigger than MAX_REPEAT:
|
||||||
|
|
||||||
META_MINMAX {n,m} repeat
|
META_MINMAX {n,m} repeat
|
||||||
META_MINMAX_PLUS {n,m}+ repeat
|
META_MINMAX_PLUS {n,m}+ repeat
|
||||||
|
@ -347,11 +346,11 @@ support is not available for this kind of matching.
|
||||||
Changeable options
|
Changeable options
|
||||||
------------------
|
------------------
|
||||||
|
|
||||||
The /i, /m, or /s options (PCRE2_CASELESS, PCRE2_MULTILINE, PCRE2_DOTALL, and
|
The /i, /m, or /s options (PCRE2_CASELESS, PCRE2_MULTILINE, PCRE2_DOTALL) and
|
||||||
others) may be changed in the middle of patterns by items such as (?i). Their
|
some others may be changed in the middle of patterns by items such as (?i).
|
||||||
processing is handled entirely at compile time by generating different opcodes
|
Their processing is handled entirely at compile time by generating different
|
||||||
for the different settings. The runtime functions do not need to keep track of
|
opcodes for the different settings. The runtime functions do not need to keep
|
||||||
an option's state.
|
track of an option's state.
|
||||||
|
|
||||||
PCRE2_DUPNAMES, PCRE2_EXTENDED, PCRE2_EXTENDED_MORE, and PCRE2_NO_AUTO_CAPTURE
|
PCRE2_DUPNAMES, PCRE2_EXTENDED, PCRE2_EXTENDED_MORE, and PCRE2_NO_AUTO_CAPTURE
|
||||||
are tracked and processed during the parsing pre-pass. The others are handled
|
are tracked and processed during the parsing pre-pass. The others are handled
|
||||||
|
@ -468,8 +467,8 @@ Caseless matching (positive or negative) of characters that have more than two
|
||||||
case-equivalent code points (which is possible only in UTF mode) is handled by
|
case-equivalent code points (which is possible only in UTF mode) is handled by
|
||||||
compiling a Unicode property item (see below), with the pseudo-property
|
compiling a Unicode property item (see below), with the pseudo-property
|
||||||
PT_CLIST. The value of this property is an offset in a vector called
|
PT_CLIST. The value of this property is an offset in a vector called
|
||||||
"ucd_caseless_sets" which identifies the start of a short list of equivalent
|
"ucd_caseless_sets" which identifies the start of a short list of case
|
||||||
characters, terminated by the value NOTACHAR (0xffffffff).
|
equivalent characters, terminated by the value NOTACHAR (0xffffffff).
|
||||||
|
|
||||||
|
|
||||||
Repeating single characters
|
Repeating single characters
|
||||||
|
@ -546,9 +545,9 @@ Each is followed by two code units that encode the desired property as a type
|
||||||
and a value. The types are a set of #defines of the form PT_xxx, and the values
|
and a value. The types are a set of #defines of the form PT_xxx, and the values
|
||||||
are enumerations of the form ucp_xx, defined in the pcre2_ucp.h source file.
|
are enumerations of the form ucp_xx, defined in the pcre2_ucp.h source file.
|
||||||
The value is relevant only for PT_GC (General Category), PT_PC (Particular
|
The value is relevant only for PT_GC (General Category), PT_PC (Particular
|
||||||
Category), PT_SC (Script), PT_BIDICL (Bidi Class), and the pseudo-property
|
Category), PT_SC (Script), PT_BIDICL (Bidi Class), PT_BOOL (Boolean property),
|
||||||
PT_CLIST, which is used to identify a list of case-equivalent characters when
|
and the pseudo-property PT_CLIST, which is used to identify a list of
|
||||||
there are three or more.
|
case-equivalent characters when there are three or more (see above).
|
||||||
|
|
||||||
Repeats of these items use the OP_TYPESTAR etc. set of opcodes, followed by
|
Repeats of these items use the OP_TYPESTAR etc. set of opcodes, followed by
|
||||||
three code units: OP_PROP or OP_NOTPROP, and then the desired property type and
|
three code units: OP_PROP or OP_NOTPROP, and then the desired property type and
|
||||||
|
@ -668,7 +667,7 @@ used for subpatterns that do not repeat indefinitely, OP_KETRMIN and
|
||||||
OP_KETRMAX are used for indefinite repetitions, minimally or maximally
|
OP_KETRMAX are used for indefinite repetitions, minimally or maximally
|
||||||
respectively, and OP_KETRPOS for possessive repetitions (see below for more
|
respectively, and OP_KETRPOS for possessive repetitions (see below for more
|
||||||
details). All four are followed by a LINK_SIZE value giving (as a positive
|
details). All four are followed by a LINK_SIZE value giving (as a positive
|
||||||
number) the offset back to the matching bracket opcode.
|
number) the offset back to the matching opening bracket opcode.
|
||||||
|
|
||||||
If a subpattern is quantified such that it is permitted to match zero times, it
|
If a subpattern is quantified such that it is permitted to match zero times, it
|
||||||
is preceded by one of OP_BRAZERO, OP_BRAMINZERO, or OP_SKIPZERO. These are
|
is preceded by one of OP_BRAZERO, OP_BRAMINZERO, or OP_SKIPZERO. These are
|
||||||
|
@ -828,4 +827,4 @@ not a real opcode, but is used to check at compile time that tables indexed by
|
||||||
opcode are the correct length, in order to catch updating errors.
|
opcode are the correct length, in order to catch updating errors.
|
||||||
|
|
||||||
Philip Hazel
|
Philip Hazel
|
||||||
December 2021
|
April 2022
|
||||||
|
|
56
maint/README
56
maint/README
|
@ -78,9 +78,9 @@ utf8.c
|
||||||
A short, freestanding C program for converting a Unicode code point into a
|
A short, freestanding C program for converting a Unicode code point into a
|
||||||
sequence of bytes in the UTF-8 encoding, and vice versa. If its argument is a
|
sequence of bytes in the UTF-8 encoding, and vice versa. If its argument is a
|
||||||
hex number such as 0x1234, it outputs a list of the equivalent UTF-8 bytes.
|
hex number such as 0x1234, it outputs a list of the equivalent UTF-8 bytes.
|
||||||
If its argument is a sequence of concatenated UTF-8 bytes (e.g. e188b4) it
|
If its argument is a sequence of concatenated UTF-8 bytes (e.g. 12e188b4) it
|
||||||
treats them as a UTF-8 character and outputs the equivalent code point in
|
treats them as a UTF-8 string and outputs the equivalent code points in hex.
|
||||||
hex. See comments at its head for details.
|
See comments at its head for details.
|
||||||
|
|
||||||
|
|
||||||
Updating to a new Unicode release
|
Updating to a new Unicode release
|
||||||
|
@ -94,8 +94,9 @@ directory.
|
||||||
Note: Previously, it was necessary to update lists of scripts and their
|
Note: Previously, it was necessary to update lists of scripts and their
|
||||||
abbreviations by hand before running the Python scripts. This is no longer
|
abbreviations by hand before running the Python scripts. This is no longer
|
||||||
necessary because the scripts have been upgraded to extract this information
|
necessary because the scripts have been upgraded to extract this information
|
||||||
themselves. Also, there used to be explicit lists of script in two of the man
|
themselves. Also, there used to be explicit lists of scripts in two of the man
|
||||||
pages. This is no longer the case.
|
pages. This is no longer the case; the pcre2test program can now output a list
|
||||||
|
of supported scripts.
|
||||||
|
|
||||||
You can give an output file name as an argument to the following scripts, but
|
You can give an output file name as an argument to the following scripts, but
|
||||||
by default:
|
by default:
|
||||||
|
@ -129,8 +130,8 @@ files should eventually be installed in the main testdata directory.
|
||||||
Preparing for a PCRE2 release
|
Preparing for a PCRE2 release
|
||||||
=============================
|
=============================
|
||||||
|
|
||||||
This section contains a checklist of things that I consult before building a
|
This section contains a checklist of things that I do before building a new
|
||||||
distribution for a new release.
|
release.
|
||||||
|
|
||||||
. Ensure that the version number and version date are correct in configure.ac.
|
. Ensure that the version number and version date are correct in configure.ac.
|
||||||
|
|
||||||
|
@ -139,17 +140,16 @@ distribution for a new release.
|
||||||
|
|
||||||
. If new build options or new source files have been added, ensure that they
|
. If new build options or new source files have been added, ensure that they
|
||||||
are added to the CMake files as well as to the autoconf files. The relevant
|
are added to the CMake files as well as to the autoconf files. The relevant
|
||||||
files are CMakeLists.txt and config-cmake.h.in. After making a release
|
files are CMakeLists.txt and config-cmake.h.in. After making a release, test
|
||||||
tarball, test it out with CMake if there have been changes here.
|
it out with CMake if there have been changes here.
|
||||||
|
|
||||||
. Run ./autogen.sh to ensure everything is up-to-date.
|
. Run ./autogen.sh to ensure everything is up-to-date.
|
||||||
|
|
||||||
. Compile and test with many different config options, and combinations of
|
. Compile and test with many different config options, and combinations of
|
||||||
options. Also, test with valgrind by running "RunTest valgrind" and
|
options. Also, test with valgrind by running "RunTest valgrind" and
|
||||||
"RunGrepTest valgrind" (which takes quite a long time). The script
|
"RunGrepTest valgrind". The script maint/ManyConfigTests now encapsulates
|
||||||
maint/ManyConfigTests now encapsulates this testing. It runs tests with
|
this testing. It runs tests with different configurations, and it also runs
|
||||||
different configurations, and it also runs some of them with valgrind, all of
|
some of them with valgrind, all of which can take quite some time.
|
||||||
which can take quite some time.
|
|
||||||
|
|
||||||
. Run tests in both 32-bit and 64-bit environments if possible. I can no longer
|
. Run tests in both 32-bit and 64-bit environments if possible. I can no longer
|
||||||
run 32-bit tests.
|
run 32-bit tests.
|
||||||
|
@ -164,7 +164,8 @@ distribution for a new release.
|
||||||
-fsanitize=signed-integer-overflow
|
-fsanitize=signed-integer-overflow
|
||||||
|
|
||||||
. Do a test build using CMake. Remove src/config.h first, lest it override the
|
. Do a test build using CMake. Remove src/config.h first, lest it override the
|
||||||
version that CMake creates. Do NOT use parallel make.
|
version that CMake creates. Also do a CMake unity build to check that it
|
||||||
|
still works: [c]cmake -DCMAKE_UNITY_BUILD=ON sets up a unity build.
|
||||||
|
|
||||||
. Run perltest.sh on the test data for tests 1 and 4. The output should match
|
. Run perltest.sh on the test data for tests 1 and 4. The output should match
|
||||||
the PCRE2 test output, apart from the version identification at the start of
|
the PCRE2 test output, apart from the version identification at the start of
|
||||||
|
@ -183,11 +184,12 @@ distribution for a new release.
|
||||||
systems. For example, on Solaris it is helpful to test using Sun's cc
|
systems. For example, on Solaris it is helpful to test using Sun's cc
|
||||||
compiler as a change from gcc. Adding -xarch=v9 to the cc options does a
|
compiler as a change from gcc. Adding -xarch=v9 to the cc options does a
|
||||||
64-bit test, but it also needs -S 64 for pcre2test to increase the stack size
|
64-bit test, but it also needs -S 64 for pcre2test to increase the stack size
|
||||||
for test 2. Since I retired I can no longer do much of this, but instead I
|
for test 2. Since I retired I can no longer do much of this. There are
|
||||||
rely on putting out release candidates for testing by the community.
|
automated tests under Ubuntu, Alpine, and Windows that are now set up as
|
||||||
|
GitHub actions. Check that they are running clean.
|
||||||
|
|
||||||
. The buildbots at http://buildfarm.opencsw.org/ do some automated testing
|
. The buildbots at http://buildfarm.opencsw.org/ do some automated testing
|
||||||
of PCRE2 and should be checked before putting out a release.
|
of PCRE2 and should also be checked before putting out a release.
|
||||||
|
|
||||||
|
|
||||||
Updating version info for libtool
|
Updating version info for libtool
|
||||||
|
@ -243,10 +245,11 @@ it reports them and then aborts. Otherwise it removes trailing spaces from
|
||||||
sources and refreshes the HTML documentation. Update the GitHub repository with
|
sources and refreshes the HTML documentation. Update the GitHub repository with
|
||||||
"git push".
|
"git push".
|
||||||
|
|
||||||
Once PrepareRelease has run clean, run "make distcheck" to create the tarball
|
Once PrepareRelease has run clean, run "make distcheck" to create the tarballs
|
||||||
and the zipball. I then sign these files. Double-check with "git status" that
|
and the zipball. I then sign these files. Double-check with "git status" that
|
||||||
the repository is fully up-to-date, then create a new tag on GitHub. Upload the
|
the repository is fully up-to-date, then create a new tag and a release on
|
||||||
tarball, zipball, and the signatures as "assets" of the GitHub release.
|
GitHub. Upload the tarballs, zipball, and the signatures as "assets" of the
|
||||||
|
GitHub release.
|
||||||
|
|
||||||
When the new release is out, don't forget to tell webmaster@pcre.org and the
|
When the new release is out, don't forget to tell webmaster@pcre.org and the
|
||||||
mailing list.
|
mailing list.
|
||||||
|
@ -365,8 +368,6 @@ years.
|
||||||
|
|
||||||
See Unicode TR 29. The last two are very much aimed at natural language.
|
See Unicode TR 29. The last two are very much aimed at natural language.
|
||||||
|
|
||||||
. (?[...]) extended classes: big project.
|
|
||||||
|
|
||||||
. Allow a callout to specify a number of characters to skip. This can be done
|
. Allow a callout to specify a number of characters to skip. This can be done
|
||||||
compatibly via an extra callout field.
|
compatibly via an extra callout field.
|
||||||
|
|
||||||
|
@ -436,13 +437,8 @@ years.
|
||||||
with lookarounds for \b and \B. Ideally the setting should last till the end
|
with lookarounds for \b and \B. Ideally the setting should last till the end
|
||||||
of the group, which means remembering all previous settings; maybe a fixed
|
of the group, which means remembering all previous settings; maybe a fixed
|
||||||
amount of stack would do - how deep would anyone want to nest these things?
|
amount of stack would do - how deep would anyone want to nest these things?
|
||||||
See GitHub issue #13 for a compendium of character class issues.
|
See GitHub issue #13 for a compendium of character class issues, including
|
||||||
|
(?[...]) extended classes.
|
||||||
. Recognize the short script names. They are already listed in maint/
|
|
||||||
Multistage2.py because they are needed for scanning the script extensions
|
|
||||||
file.
|
|
||||||
|
|
||||||
. Use script extensions for \p?
|
|
||||||
|
|
||||||
. A user suggested something like --with-build-info to set a build information
|
. A user suggested something like --with-build-info to set a build information
|
||||||
string that could be retrieved by pcre2_config(). However, there's no
|
string that could be retrieved by pcre2_config(). However, there's no
|
||||||
|
@ -461,4 +457,4 @@ years.
|
||||||
Philip Hazel
|
Philip Hazel
|
||||||
Email local part: Philip.Hazel
|
Email local part: Philip.Hazel
|
||||||
Email domain: gmail.com
|
Email domain: gmail.com
|
||||||
Last updated: 10 January 2022
|
Last updated: 25 April 2022
|
||||||
|
|
Loading…
Reference in New Issue