2014-05-13 13:20:03 +02:00
|
|
|
MAINTENANCE README FOR PCRE2
|
|
|
|
============================
|
|
|
|
|
|
|
|
The files in the "maint" directory of the PCRE2 source contain data, scripts,
|
|
|
|
and programs that are used for the maintenance of PCRE2, but which do not form
|
|
|
|
part of the PCRE2 distribution tarballs. This document describes these files
|
|
|
|
and also contains some notes for maintainers. Its contents are:
|
|
|
|
|
|
|
|
Files in the maint directory
|
|
|
|
Updating to a new Unicode release
|
|
|
|
Preparing for a PCRE2 release
|
|
|
|
Making a PCRE2 release
|
|
|
|
Long-term ideas (wish list)
|
|
|
|
|
|
|
|
|
|
|
|
Files in the maint directory
|
|
|
|
============================
|
|
|
|
|
|
|
|
GenerateUtt.py A Python script to generate part of the pcre2_tables.c file
|
|
|
|
that contains Unicode script names in a long string with
|
|
|
|
offsets, which is tedious to maintain by hand.
|
|
|
|
|
|
|
|
ManyConfigTests A shell script that runs "configure, make, test" a number of
|
|
|
|
times with different configuration settings.
|
|
|
|
|
|
|
|
MultiStage2.py A Python script that generates the file pcre2_ucd.c from three
|
|
|
|
Unicode data tables, which are themselves downloaded from the
|
|
|
|
Unicode web site. Run this script in the "maint" directory.
|
|
|
|
The generated file contains the tables for a 2-stage lookup
|
|
|
|
of Unicode properties.
|
|
|
|
|
|
|
|
pcre2_chartables.c.non-standard
|
|
|
|
This is a set of character tables that came from a Windows
|
|
|
|
system. It has characters greater than 128 that are set as
|
|
|
|
spaces, amongst other things. I kept it so that it can be
|
|
|
|
used for testing from time to time.
|
|
|
|
|
|
|
|
README This file.
|
|
|
|
|
2014-11-26 17:51:53 +01:00
|
|
|
Unicode.tables The files in this directory (CaseFolding.txt,
|
2014-05-13 13:20:03 +02:00
|
|
|
DerivedGeneralCategory.txt, GraphemeBreakProperty.txt,
|
|
|
|
Scripts.txt and UnicodeData.txt) were downloaded from the
|
|
|
|
Unicode web site. They contain information about Unicode
|
2014-11-26 17:51:53 +01:00
|
|
|
characters and scripts.
|
2014-05-13 13:20:03 +02:00
|
|
|
|
|
|
|
ucptest.c A short C program for testing the Unicode property macros
|
|
|
|
that do lookups in the pcre2_ucd.c data, mainly useful after
|
|
|
|
rebuilding the Unicode property table. Compile and run this in
|
|
|
|
the "maint" directory (see comments at its head).
|
|
|
|
|
|
|
|
ucptestdata A directory containing two files, testinput1 and testoutput1,
|
|
|
|
to use in conjunction with the ucptest program.
|
|
|
|
|
|
|
|
utf8.c A short, freestanding C program for converting a Unicode code
|
|
|
|
point into a sequence of bytes in the UTF-8 encoding, and vice
|
|
|
|
versa. If its argument is a hex number such as 0x1234, it
|
|
|
|
outputs a list of the equivalent UTF-8 bytes. If its argument
|
|
|
|
is sequence of concatenated UTF-8 bytes (e.g. e188b4) it
|
|
|
|
treats them as a UTF-8 character and outputs the equivalent
|
|
|
|
code point in hex.
|
|
|
|
|
|
|
|
|
|
|
|
Updating to a new Unicode release
|
|
|
|
=================================
|
|
|
|
|
|
|
|
When there is a new release of Unicode, the files in Unicode.tables must be
|
|
|
|
refreshed from the web site. If the new version of Unicode adds new character
|
2014-11-18 19:32:12 +01:00
|
|
|
scripts, the source file pcre2_ucp.h and both the MultiStage2.py and the
|
2014-05-13 13:20:03 +02:00
|
|
|
GenerateUtt.py scripts must be edited to add the new names. Then MultiStage2.py
|
|
|
|
can be run to generate a new version of pcre2_ucd.c, and GenerateUtt.py can be
|
|
|
|
run to generate the tricky tables for inclusion in pcre2_tables.c.
|
|
|
|
|
|
|
|
If MultiStage2.py gives the error "ValueError: list.index(x): x not in list",
|
|
|
|
the cause is usually a missing (or misspelt) name in the list of scripts. I
|
|
|
|
couldn't find a straightforward list of scripts on the Unicode site, but
|
2014-11-18 19:32:12 +01:00
|
|
|
there's a useful Wikipedia page that lists them, and notes the Unicode version
|
2014-05-13 13:20:03 +02:00
|
|
|
in which they were introduced:
|
|
|
|
|
|
|
|
http://en.wikipedia.org/wiki/Unicode_scripts#Table_of_Unicode_scripts
|
|
|
|
|
|
|
|
The ucptest program can be compiled and used to check that the new tables in
|
|
|
|
pcre2_ucd.c work properly, using the data files in ucptestdata to check a
|
|
|
|
number of test characters. The source file ucptest.c must be updated whenever
|
|
|
|
new Unicode script names are added.
|
|
|
|
|
|
|
|
Note also that both the pcre2syntax.3 and pcre2pattern.3 man pages contain
|
|
|
|
lists of Unicode script names.
|
|
|
|
|
|
|
|
|
2014-11-26 17:51:53 +01:00
|
|
|
Preparing for a PCRE2 release
|
|
|
|
=============================
|
2014-05-13 13:20:03 +02:00
|
|
|
|
|
|
|
This section contains a checklist of things that I consult before building a
|
|
|
|
distribution for a new release.
|
|
|
|
|
|
|
|
. Ensure that the version number and version date are correct in configure.ac.
|
|
|
|
|
2014-11-26 17:51:53 +01:00
|
|
|
. Update the library version numbers in configure.ac according to the rules
|
2014-05-13 13:20:03 +02:00
|
|
|
given below.
|
|
|
|
|
2015-12-15 13:07:41 +01:00
|
|
|
. If new build options or new source files have been added, ensure that they
|
|
|
|
are added to the CMake files as well as to the autoconf files. The relevant
|
|
|
|
files are CMakeLists.txt and config-cmake.h.in. After making a release
|
|
|
|
tarball, test it out with CMake if there have been changes here.
|
2014-05-13 13:20:03 +02:00
|
|
|
|
|
|
|
. Run ./autogen.sh to ensure everything is up-to-date.
|
|
|
|
|
|
|
|
. Compile and test with many different config options, and combinations of
|
|
|
|
options. Also, test with valgrind by running "RunTest valgrind" and
|
|
|
|
"RunGrepTest valgrind" (which takes quite a long time). The script
|
|
|
|
maint/ManyConfigTests now encapsulates this testing. It runs tests with
|
|
|
|
different configurations, and it also runs some of them with valgrind, all of
|
|
|
|
which can take quite some time.
|
|
|
|
|
2015-12-15 13:07:41 +01:00
|
|
|
. Run tests in both 32-bit and 64-bit environments if possible.
|
|
|
|
|
2017-05-20 18:25:29 +02:00
|
|
|
. Run tests with two or more different compilers (e.g. clang and gcc), and
|
|
|
|
make use of -fsanitize=address and friends where possible. For gcc,
|
|
|
|
-fsanitize=undefined -std=gnu99 picks up undefined behaviour at runtime, but
|
|
|
|
needs -fno-sanitize=shift to get rid of warnings for shifts of negative
|
|
|
|
numbers in the JIT compiler. For clang, -fsanitize=address,undefined,integer
|
|
|
|
can be used but -fno-sanitize=alignment,shift,unsigned-integer-overflow must
|
|
|
|
be added when compiling with JIT. Another useful clang option is
|
|
|
|
-fsanitize=signed-integer-overflow
|
2015-12-15 13:07:41 +01:00
|
|
|
|
|
|
|
. Do a test build using CMake.
|
|
|
|
|
2014-11-26 17:51:53 +01:00
|
|
|
. Run perltest.sh on the test data for tests 1 and 4. The output should match
|
2014-10-25 17:51:01 +02:00
|
|
|
the PCRE2 test output, apart from the version identification at the start of
|
|
|
|
each test. The other tests are not Perl-compatible (they use various
|
|
|
|
PCRE2-specific features or options).
|
2014-05-13 13:20:03 +02:00
|
|
|
|
|
|
|
. It is possible to test with the emulated memmove() function by undefining
|
|
|
|
HAVE_MEMMOVE and HAVE_BCOPY in config.h, though I do not do this often. You
|
|
|
|
may see a number of "pcre2_memmove defined but not used" warnings for the
|
|
|
|
modules in which there is no call to memmove(). These can be ignored.
|
|
|
|
|
2014-11-26 17:51:53 +01:00
|
|
|
. Documentation: check AUTHORS, ChangeLog (check version and date), LICENCE,
|
2014-05-13 13:20:03 +02:00
|
|
|
NEWS (check version and date), NON-AUTOTOOLS-BUILD, and README. Many of these
|
|
|
|
won't need changing, but over the long term things do change.
|
|
|
|
|
|
|
|
. I used to test new releases myself on a number of different operating
|
2015-12-15 13:07:41 +01:00
|
|
|
systems. For example, on Solaris it is helpful to test using Sun's cc
|
|
|
|
compiler as a change from gcc. Adding -xarch=v9 to the cc options does a
|
|
|
|
64-bit test, but it also needs -S 64 for pcre2test to increase the stack size
|
|
|
|
for test 2. Since I retired I can no longer do this, but instead I rely on
|
|
|
|
putting out release candidates for folks on the pcre-dev list to test.
|
2014-11-26 17:51:53 +01:00
|
|
|
|
2014-10-25 17:51:01 +02:00
|
|
|
. The buildbots at http://buildfarm.opencsw.org/ do some automated testing
|
2014-11-26 17:51:53 +01:00
|
|
|
of PCRE2 and should be checked before putting out a release.
|
|
|
|
|
2014-05-13 13:20:03 +02:00
|
|
|
|
|
|
|
Updating version info for libtool
|
|
|
|
=================================
|
|
|
|
|
2014-11-26 17:51:53 +01:00
|
|
|
This set of rules for updating library version information came from a web page
|
2014-05-13 13:20:03 +02:00
|
|
|
whose URL I have forgotten. The version information consists of three parts:
|
|
|
|
(current, revision, age).
|
|
|
|
|
|
|
|
1. Start with version information of 0:0:0 for each libtool library.
|
|
|
|
|
|
|
|
2. Update the version information only immediately before a public release of
|
|
|
|
your software. More frequent updates are unnecessary, and only guarantee
|
|
|
|
that the current interface number gets larger faster.
|
|
|
|
|
|
|
|
3. If the library source code has changed at all since the last update, then
|
|
|
|
increment revision; c:r:a becomes c:r+1:a.
|
|
|
|
|
|
|
|
4. If any interfaces have been added, removed, or changed since the last
|
|
|
|
update, increment current, and set revision to 0.
|
|
|
|
|
|
|
|
5. If any interfaces have been added since the last public release, then
|
|
|
|
increment age.
|
|
|
|
|
|
|
|
6. If any interfaces have been removed or changed since the last public
|
|
|
|
release, then set age to 0.
|
|
|
|
|
|
|
|
The following explanation may help in understanding the above rules a bit
|
|
|
|
better. Consider that there are three possible kinds of reaction from users to
|
|
|
|
changes in a shared library:
|
|
|
|
|
|
|
|
1. Programs using the previous version may use the new version as a drop-in
|
|
|
|
replacement, and programs using the new version can also work with the
|
|
|
|
previous one. In other words, no recompiling nor relinking is needed. In
|
|
|
|
this case, increment revision only, don't touch current or age.
|
|
|
|
|
|
|
|
2. Programs using the previous version may use the new version as a drop-in
|
|
|
|
replacement, but programs using the new version may use APIs not present in
|
|
|
|
the previous one. In other words, a program linking against the new version
|
|
|
|
may fail if linked against the old version at run time. In this case, set
|
|
|
|
revision to 0, increment current and age.
|
|
|
|
|
|
|
|
3. Programs may need to be changed, recompiled, relinked in order to use the
|
|
|
|
new version. Increment current, set revision and age to 0.
|
|
|
|
|
|
|
|
|
2014-10-25 17:51:01 +02:00
|
|
|
Making a PCRE2 release
|
|
|
|
======================
|
2014-05-13 13:20:03 +02:00
|
|
|
|
|
|
|
Run PrepareRelease and commit the files that it changes (by removing trailing
|
|
|
|
spaces). The first thing this script does is to run CheckMan on the man pages;
|
|
|
|
if it finds any markup errors, it reports them and then aborts.
|
|
|
|
|
|
|
|
Once PrepareRelease has run clean, run "make distcheck" to create the tarballs
|
|
|
|
and the zipball. Double-check with "svn status", then create an SVN tagged
|
|
|
|
copy:
|
|
|
|
|
|
|
|
svn copy svn://vcs.exim.org/pcre2/code/trunk \
|
2014-11-26 17:51:53 +01:00
|
|
|
svn://vcs.exim.org/pcre2/code/tags/pcre2-10.xx
|
2014-05-13 13:20:03 +02:00
|
|
|
|
2014-10-25 17:51:01 +02:00
|
|
|
When the new release is out, don't forget to tell webmaster@pcre.org and the
|
2015-12-15 13:07:41 +01:00
|
|
|
mailing list. Also, update the list of version numbers in Bugzilla
|
|
|
|
(administration > products > PCRE > Edit versions).
|
2014-05-13 13:20:03 +02:00
|
|
|
|
|
|
|
|
|
|
|
Future ideas (wish list)
|
|
|
|
========================
|
|
|
|
|
|
|
|
This section records a list of ideas so that they do not get forgotten. They
|
|
|
|
vary enormously in their usefulness and potential for implementation. Some are
|
2014-11-18 19:32:12 +01:00
|
|
|
very sensible; some are rather wacky. Some have been on this list for years.
|
2014-05-13 13:20:03 +02:00
|
|
|
|
|
|
|
. Optimization
|
|
|
|
|
|
|
|
There are always ideas for new optimizations so as to speed up pattern
|
|
|
|
matching. Most of them try to save work by recognizing a non-match without
|
|
|
|
having to scan all the possibilities. These are some that I've recorded:
|
|
|
|
|
|
|
|
* /((A{0,5}){0,5}){0,5}(something complex)/ on a non-matching string is very
|
|
|
|
slow, though Perl is fast. Can we speed up somehow? Convert to {0,125}?
|
|
|
|
OTOH, this is pathological - the user could easily fix it.
|
|
|
|
|
|
|
|
* Turn ={4} into ==== ? (for speed). I once did an experiment, and it seems
|
|
|
|
to have little effect, and maybe makes things worse.
|
|
|
|
|
|
|
|
* "Ends with literal string" - note that a single character doesn't gain much
|
2014-10-25 17:51:01 +02:00
|
|
|
over the existing "required code unit" feature that just remembers one code
|
|
|
|
unit.
|
2014-05-13 13:20:03 +02:00
|
|
|
|
2014-11-18 19:32:12 +01:00
|
|
|
* Remember an initial string rather than just 1 code unit.
|
2014-05-13 13:20:03 +02:00
|
|
|
|
|
|
|
* A required code unit from alternatives - not just the last unit, but an
|
|
|
|
earlier one if common to all alternatives.
|
|
|
|
|
2014-11-18 19:32:12 +01:00
|
|
|
* Friedl contains other ideas.
|
2014-05-13 13:20:03 +02:00
|
|
|
|
|
|
|
* The code does not set initial code unit flags for Unicode property types
|
|
|
|
such as \p; I don't know how much benefit there would be for, for example,
|
|
|
|
setting the bits for 0-9 and all values >= xC0 (in 8-bit mode) when a
|
|
|
|
pattern starts with \p{N}.
|
|
|
|
|
|
|
|
. If Perl gets to a consistent state over the settings of capturing sub-
|
|
|
|
patterns inside repeats, see if we can match it. One example of the
|
2014-11-18 19:32:12 +01:00
|
|
|
difference is the matching of /(main(O)?)+/ against mainOmain, where PCRE2
|
|
|
|
leaves $2 set. In Perl, it's unset. Changing this in PCRE2 will be very hard
|
2014-05-13 13:20:03 +02:00
|
|
|
because I think it needs much more state to be remembered.
|
|
|
|
|
2014-11-18 19:32:12 +01:00
|
|
|
. An option to use NUL as a line terminator in subject strings. This could be
|
|
|
|
done relatively easily. If it is done, a suitable option for pcre2grep is
|
|
|
|
also required.
|
2014-05-13 13:20:03 +02:00
|
|
|
|
|
|
|
. A feature to suspend a match via a callout was once requested.
|
|
|
|
|
2014-11-18 19:32:12 +01:00
|
|
|
. An option to convert results into character offsets and character lengths.
|
2014-05-13 13:20:03 +02:00
|
|
|
|
2014-11-26 17:51:53 +01:00
|
|
|
. An option for pcre2grep to scan only the start of a file. I am not keen -
|
2014-11-18 19:32:12 +01:00
|
|
|
this is the job of "head".
|
2014-05-13 13:20:03 +02:00
|
|
|
|
|
|
|
. A (non-Unix) user wanted pcregrep options to (a) list a file name just once,
|
|
|
|
preceded by a blank line, instead of adding it to every matched line, and (b)
|
|
|
|
support --outputfile=name.
|
|
|
|
|
|
|
|
. Define a union for the results from pcre2_pattern_info().
|
|
|
|
|
|
|
|
. Provide a "random access to the subject" facility so that the way in which it
|
2014-10-25 17:51:01 +02:00
|
|
|
is stored is independent of PCRE2. For efficiency, it probably isn't possible
|
|
|
|
to switch this dynamically. It would have to be specified when PCRE2 was
|
|
|
|
compiled. PCRE2 would then call a function every time it wanted a character.
|
2014-05-13 13:20:03 +02:00
|
|
|
|
2014-10-25 17:51:01 +02:00
|
|
|
. pcre2grep: add -rs for a sorted recurse? Having to store file names and sort
|
2014-05-13 13:20:03 +02:00
|
|
|
them will of course slow it down.
|
|
|
|
|
|
|
|
. Someone suggested --disable-callout to save code space when callouts are
|
|
|
|
never wanted. This seems rather marginal.
|
|
|
|
|
|
|
|
. A user suggested a parameter to limit the length of string matched, for
|
|
|
|
example if the parameter is N, the current match should fail if the matched
|
|
|
|
substring exceeds N. This could apply to both match functions. The value
|
2015-12-15 13:07:41 +01:00
|
|
|
could be a new field in the match context. Compare the offset_limit feature,
|
|
|
|
which limits where a match must start.
|
2014-05-13 13:20:03 +02:00
|
|
|
|
2014-11-26 17:51:53 +01:00
|
|
|
. Write a function that generates random matching strings for a compiled
|
2014-10-25 17:51:01 +02:00
|
|
|
pattern.
|
2014-05-13 13:20:03 +02:00
|
|
|
|
2014-10-25 17:51:01 +02:00
|
|
|
. Pcre2grep: an option to specify the output line separator, either as a string
|
2014-11-18 19:32:12 +01:00
|
|
|
or select from a fixed list. This is not straightforward, because at the
|
|
|
|
moment it outputs whatever is in the input file.
|
2014-05-13 13:20:03 +02:00
|
|
|
|
2014-11-26 17:51:53 +01:00
|
|
|
. Improve the code for duplicate checking in pcre2_dfa_match(). An incomplete,
|
2014-05-13 13:20:03 +02:00
|
|
|
non-thread-safe patch showed that this can help performance for patterns
|
|
|
|
where there are many alternatives. However, a simple thread-safe
|
|
|
|
implementation that I tried made things worse in many simple cases, so this
|
|
|
|
is not an obviously good thing.
|
|
|
|
|
2014-10-25 17:51:01 +02:00
|
|
|
. PCRE2 cannot at present distinguish between subpatterns with different names,
|
2014-05-13 13:20:03 +02:00
|
|
|
but the same number (created by the use of ?|). In order to do so, a way of
|
|
|
|
remembering *which* subpattern numbered n matched is needed. Bugzilla #760.
|
2014-11-18 19:32:12 +01:00
|
|
|
(*MARK) can perhaps be used as a way round this problem.
|
2014-05-13 13:20:03 +02:00
|
|
|
|
|
|
|
. Instead of having #ifdef HAVE_CONFIG_H in each module, put #include
|
|
|
|
"something" and the the #ifdef appears only in one place, in "something".
|
|
|
|
|
2014-11-26 17:51:53 +01:00
|
|
|
. Implement something like (?(R2+)... to check outer recursions.
|
|
|
|
|
|
|
|
. If Perl ever supports the POSIX notation [[.something.]] PCRE2 should try
|
|
|
|
to follow.
|
|
|
|
|
2015-12-15 13:07:41 +01:00
|
|
|
. Bugzilla #554 requested support for invalid UTF-8 strings.
|
|
|
|
|
|
|
|
. A user wanted a way of ignoring all Unicode "mark" characters so that, for
|
|
|
|
example "a" followed by an accent would, together, match "a".
|
|
|
|
|
|
|
|
. Perl supports [\N{x}-\N{y}] as a Unicode range, even in EBCDIC.
|
|
|
|
|
|
|
|
. Unicode stuff from Perl:
|
|
|
|
|
|
|
|
\b{gcb} or \b{g} grapheme cluster boundary
|
|
|
|
\b{sb} sentence boundary
|
|
|
|
\b{wb} word boundary
|
|
|
|
|
|
|
|
See Unicode TR 29.
|
|
|
|
|
|
|
|
. (?[...]) extended classes: big project.
|
|
|
|
|
|
|
|
. Bugzilla #1694 requests backwards searching.
|
|
|
|
|
|
|
|
. A callout from pcre2_substitute() that happens after (before?) each
|
|
|
|
substitution (value = 256?).
|
|
|
|
|
|
|
|
. Allow a callout to specify a number of characters to skip. This can be done
|
|
|
|
compatibly via an extra callout field.
|
|
|
|
|
|
|
|
. Allow callouts to return *PRUNE, *COMMIT, *THEN, *SKIP, with and without
|
|
|
|
continuing (that is, with and without an implied *FAIL). A new option,
|
|
|
|
PCRE2_CALLOUT_EXTENDED say, would be needed. This is unlikely ever to be
|
|
|
|
implemented by JIT, so this could be an option for pcre2_match().
|
|
|
|
|
|
|
|
. A limit on substitutions: a user suggested somehow finding a way of making
|
|
|
|
match_limit apply to the whole operation instead of each match separately.
|
|
|
|
|
2017-05-20 18:25:29 +02:00
|
|
|
. There was a suggestion that Perl should lock out \K in lookarounds. If it
|
|
|
|
does, PCRE2 should follow.
|
|
|
|
|
|
|
|
. Redesign handling of class/nclass/xclass because the compile code logic is
|
|
|
|
currently very contorted and obscure.
|
|
|
|
|
|
|
|
. Some #defines could be replaced with enums to improve robustness.
|
2015-12-15 13:07:41 +01:00
|
|
|
|
2014-05-13 13:20:03 +02:00
|
|
|
Philip Hazel
|
|
|
|
Email local part: ph10
|
|
|
|
Email domain: cam.ac.uk
|
2017-05-20 18:25:29 +02:00
|
|
|
Last updated: 20 May 2017
|