Documentation update.
This commit is contained in:
parent
16d47a9cb1
commit
dea540877b
136
maint/README
136
maint/README
|
@ -38,7 +38,7 @@ pcre2_chartables.c.non-standard
|
||||||
|
|
||||||
README This file.
|
README This file.
|
||||||
|
|
||||||
Unicode.tables The files in this directory were downloaded from the Unicode
|
Unicode.tables The files in this directory were downloaded from the Unicode
|
||||||
web site. They contain information about Unicode characters
|
web site. They contain information about Unicode characters
|
||||||
and scripts. The ones used by the MultiStage2.py script are
|
and scripts. The ones used by the MultiStage2.py script are
|
||||||
CaseFolding.txt, DerivedGeneralCategory.txt, Scripts.txt,
|
CaseFolding.txt, DerivedGeneralCategory.txt, Scripts.txt,
|
||||||
|
@ -97,7 +97,7 @@ lists of scripts.
|
||||||
The ucptest program can be compiled and used to check that the new tables in
|
The ucptest program can be compiled and used to check that the new tables in
|
||||||
pcre2_ucd.c work properly, using the data files in ucptestdata to check a
|
pcre2_ucd.c work properly, using the data files in ucptestdata to check a
|
||||||
number of test characters. The source file ucptest.c should also be updated
|
number of test characters. The source file ucptest.c should also be updated
|
||||||
whenever new Unicode script names are added, and adding a few tests for new
|
whenever new Unicode script names are added, and adding a few tests for new
|
||||||
scripts is a good idea.
|
scripts is a good idea.
|
||||||
|
|
||||||
|
|
||||||
|
@ -141,8 +141,9 @@ distribution for a new release.
|
||||||
|
|
||||||
. Run perltest.sh on the test data for tests 1 and 4. The output should match
|
. Run perltest.sh on the test data for tests 1 and 4. The output should match
|
||||||
the PCRE2 test output, apart from the version identification at the start of
|
the PCRE2 test output, apart from the version identification at the start of
|
||||||
each test. The other tests are not Perl-compatible (they use various
|
each test. Sometimes there are other differences in test 4 if PCRE2 and Perl
|
||||||
PCRE2-specific features or options).
|
are using different Unicode releases. The other tests are not Perl-compatible
|
||||||
|
(they use various PCRE2-specific features or options).
|
||||||
|
|
||||||
. It is possible to test with the emulated memmove() function by undefining
|
. It is possible to test with the emulated memmove() function by undefining
|
||||||
HAVE_MEMMOVE and HAVE_BCOPY in config.h, though I do not do this often.
|
HAVE_MEMMOVE and HAVE_BCOPY in config.h, though I do not do this often.
|
||||||
|
@ -155,8 +156,9 @@ distribution for a new release.
|
||||||
systems. For example, on Solaris it is helpful to test using Sun's cc
|
systems. For example, on Solaris it is helpful to test using Sun's cc
|
||||||
compiler as a change from gcc. Adding -xarch=v9 to the cc options does a
|
compiler as a change from gcc. Adding -xarch=v9 to the cc options does a
|
||||||
64-bit test, but it also needs -S 64 for pcre2test to increase the stack size
|
64-bit test, but it also needs -S 64 for pcre2test to increase the stack size
|
||||||
for test 2. Since I retired I can no longer do this, but instead I rely on
|
for test 2. Since I retired I can no longer do much of this, but instead I
|
||||||
putting out release candidates for folks on the pcre-dev list to test.
|
rely on putting out release candidates for folks on the pcre-dev list to
|
||||||
|
test.
|
||||||
|
|
||||||
. The buildbots at http://buildfarm.opencsw.org/ do some automated testing
|
. The buildbots at http://buildfarm.opencsw.org/ do some automated testing
|
||||||
of PCRE2 and should be checked before putting out a release.
|
of PCRE2 and should be checked before putting out a release.
|
||||||
|
@ -285,7 +287,7 @@ very sensible; some are rather wacky. Some have been on this list for years.
|
||||||
to switch this dynamically. It would have to be specified when PCRE2 was
|
to switch this dynamically. It would have to be specified when PCRE2 was
|
||||||
compiled. PCRE2 would then call a function every time it wanted a character.
|
compiled. PCRE2 would then call a function every time it wanted a character.
|
||||||
|
|
||||||
. pcre2grep: add -rs for a sorted recurse? Having to store file names and sort
|
. pcre2grep: add -rs for a sorted recurse. Having to store file names and sort
|
||||||
them will of course slow it down.
|
them will of course slow it down.
|
||||||
|
|
||||||
. Someone suggested --disable-callout to save code space when callouts are
|
. Someone suggested --disable-callout to save code space when callouts are
|
||||||
|
@ -314,8 +316,8 @@ very sensible; some are rather wacky. Some have been on this list for years.
|
||||||
but the same number (created by the use of ?|). In order to do so, a way of
|
but the same number (created by the use of ?|). In order to do so, a way of
|
||||||
remembering *which* subpattern numbered n matched is needed. Bugzilla #760.
|
remembering *which* subpattern numbered n matched is needed. Bugzilla #760.
|
||||||
(*MARK) can perhaps be used as a way round this problem. However, note that
|
(*MARK) can perhaps be used as a way round this problem. However, note that
|
||||||
Perl does not distinguish: like PCRE2, a name is just an alias for a number
|
Perl does not distinguish: like PCRE2, a name is just an alias for a number
|
||||||
in Perl.
|
in Perl.
|
||||||
|
|
||||||
. Instead of having #ifdef HAVE_CONFIG_H in each module, put #include
|
. Instead of having #ifdef HAVE_CONFIG_H in each module, put #include
|
||||||
"something" and the the #ifdef appears only in one place, in "something".
|
"something" and the the #ifdef appears only in one place, in "something".
|
||||||
|
@ -325,12 +327,12 @@ very sensible; some are rather wacky. Some have been on this list for years.
|
||||||
. If Perl ever supports the POSIX notation [[.something.]] PCRE2 should try
|
. If Perl ever supports the POSIX notation [[.something.]] PCRE2 should try
|
||||||
to follow.
|
to follow.
|
||||||
|
|
||||||
. Bugzilla #554 requested support for invalid UTF-8 strings.
|
|
||||||
|
|
||||||
. A user wanted a way of ignoring all Unicode "mark" characters so that, for
|
. A user wanted a way of ignoring all Unicode "mark" characters so that, for
|
||||||
example "a" followed by an accent would, together, match "a".
|
example "a" followed by an accent would, together, match "a". This can only
|
||||||
|
be done clumsily at present by using a lookahead such as /(?=a)\X/, which
|
||||||
|
works for "combining" characters.
|
||||||
|
|
||||||
. Perl supports [\N{x}-\N{y}] as a Unicode range, even in EBCDIC. PCRE2
|
. Perl supports [\N{x}-\N{y}] as a Unicode range, even in EBCDIC. PCRE2
|
||||||
supports \N{U+dd..} everywhere, but not in EBCDIC.
|
supports \N{U+dd..} everywhere, but not in EBCDIC.
|
||||||
|
|
||||||
. Unicode stuff from Perl:
|
. Unicode stuff from Perl:
|
||||||
|
@ -345,9 +347,6 @@ very sensible; some are rather wacky. Some have been on this list for years.
|
||||||
|
|
||||||
. Bugzilla #1694 requests backwards searching.
|
. Bugzilla #1694 requests backwards searching.
|
||||||
|
|
||||||
. A callout from pcre2_substitute() that happens after (before?) each
|
|
||||||
substitution (value = 256?).
|
|
||||||
|
|
||||||
. Allow a callout to specify a number of characters to skip. This can be done
|
. Allow a callout to specify a number of characters to skip. This can be done
|
||||||
compatibly via an extra callout field.
|
compatibly via an extra callout field.
|
||||||
|
|
||||||
|
@ -359,74 +358,83 @@ very sensible; some are rather wacky. Some have been on this list for years.
|
||||||
. A limit on substitutions: a user suggested somehow finding a way of making
|
. A limit on substitutions: a user suggested somehow finding a way of making
|
||||||
match_limit apply to the whole operation instead of each match separately.
|
match_limit apply to the whole operation instead of each match separately.
|
||||||
|
|
||||||
. There was a suggestion that Perl should lock out \K in lookarounds. If it
|
|
||||||
does, PCRE2 should follow.
|
|
||||||
|
|
||||||
. Redesign handling of class/nclass/xclass because the compile code logic is
|
. Redesign handling of class/nclass/xclass because the compile code logic is
|
||||||
currently very contorted and obscure.
|
currently very contorted and obscure.
|
||||||
|
|
||||||
. Some #defines could be replaced with enums to improve robustness.
|
. Some #defines could be replaced with enums to improve robustness.
|
||||||
|
|
||||||
. There was a request for and option for pcre2_match() to return the longest
|
. There was a request for an option for pcre2_match() to return the longest
|
||||||
match. This would mean searching for all possible matches, of course.
|
match. This would mean searching for all possible matches, of course.
|
||||||
|
|
||||||
. Perl's /a modifier sets Unicode, but restricts \d etc to ASCII characters,
|
. Perl's /a modifier sets Unicode, but restricts \d etc to ASCII characters,
|
||||||
which is the PCRE2 default for PCRE2_UTF (use PCRE2_UCP to change). However,
|
which is the PCRE2 default for PCRE2_UTF (use PCRE2_UCP to change). However,
|
||||||
Perl also has /aa, which in addition, disables ASCII/non-ASCII caseless
|
Perl also has /aa, which in addition, disables ASCII/non-ASCII caseless
|
||||||
matching. Perhaps we need a new option PCRE2_CASELESS_RESTRICT_ASCII. In
|
matching. Perhaps we need a new option PCRE2_CASELESS_RESTRICT_ASCII. In
|
||||||
practice, this just means not using the ucd_caseless_sets[] table.
|
practice, this just means not using the ucd_caseless_sets[] table.
|
||||||
|
|
||||||
. There is more that could be done to the oss-fuzz setup (needs some research).
|
|
||||||
A seed corpus could be built. I noted something about $LIB_FUZZING_ENGINE.
|
|
||||||
The test function could make use of get_substrings() to cover more code.
|
|
||||||
|
|
||||||
. A neater way of handling recursion file names in pcre2grep, e.g. a single
|
|
||||||
buffer that can grow.
|
|
||||||
|
|
||||||
. A user suggested that before/after parameters in pcre2grep could have
|
|
||||||
negative values, to list lines near to the matched line, but not necessarily
|
|
||||||
the line itself. For example, --before-context=-1 would list the line *after*
|
|
||||||
each matched line, without showing the matched line. The problem here is what
|
|
||||||
to do with matches that are close together. Maybe a simpler way would be a
|
|
||||||
flag to disable showing matched lines, only valid with either -A or -B?
|
|
||||||
|
|
||||||
. There was a suggestiong for a pcre2grep colour default, or possibly a more
|
|
||||||
general PCRE2GREP_OPT, but only for some options - not file names or patterns.
|
|
||||||
|
|
||||||
. Breaking loops that match an empty string: perhaps find a way of continuing
|
. There is more that could be done to the oss-fuzz setup (needs some research).
|
||||||
|
A seed corpus could be built. I noted something about $LIB_FUZZING_ENGINE.
|
||||||
|
The test function could make use of get_substrings() to cover more code.
|
||||||
|
|
||||||
|
. A neater way of handling recursion file names in pcre2grep, e.g. a single
|
||||||
|
buffer that can grow.
|
||||||
|
|
||||||
|
. A user suggested that before/after parameters in pcre2grep could have
|
||||||
|
negative values, to list lines near to the matched line, but not necessarily
|
||||||
|
the line itself. For example, --before-context=-1 would list the line *after*
|
||||||
|
each matched line, without showing the matched line. The problem here is what
|
||||||
|
to do with matches that are close together. Maybe a simpler way would be a
|
||||||
|
flag to disable showing matched lines, only valid with either -A or -B?
|
||||||
|
|
||||||
|
. There was a suggestiong for a pcre2grep colour default, or possibly a more
|
||||||
|
general PCRE2GREP_OPT, but only for some options - not file names or patterns.
|
||||||
|
|
||||||
|
. Breaking loops that match an empty string: perhaps find a way of continuing
|
||||||
if *something* has changed, but this might mean remembering additional data.
|
if *something* has changed, but this might mean remembering additional data.
|
||||||
"Something" could be a capture value, but then a list of previous values
|
"Something" could be a capture value, but then a list of previous values
|
||||||
would be needed to avoid a cycle of changes. Bugzilla #2182.
|
would be needed to avoid a cycle of changes. Bugzilla #2182.
|
||||||
|
|
||||||
. The use of \K in assertions is problematic. There was some talk of Perl
|
. The use of \K in assertions is problematic. There was some talk of Perl
|
||||||
banning this, but it hasn't happened. Some problems could be avoided by
|
banning this, but it hasn't happened. Some problems could be avoided by
|
||||||
not allowing it to set a value before the match start; others by not allowing
|
not allowing it to set a value before the match start; others by not allowing
|
||||||
it to set a value after the match end. This could be controlled by an option
|
it to set a value after the match end. This could be controlled by an option
|
||||||
such as PCRE2_SANE_BACKSLASH_K, for compatibility (or possibly make the sane
|
such as PCRE2_SANE_BACKSLASH_K, for compatibility (or possibly make the sane
|
||||||
behaviour the default and implement PCRE2_INSANE_BACKSLASH_K).
|
behaviour the default and implement PCRE2_INSANE_BACKSLASH_K).
|
||||||
|
|
||||||
. If a function could be written to find 3-character (or other length) fixed
|
. If a function could be written to find 3-character (or other length) fixed
|
||||||
strings, at least one of which must be present for a match, efficient
|
strings, at least one of which must be present for a match, efficient
|
||||||
pre-searching of large datasets could be implemented.
|
pre-searching of large datasets could be implemented.
|
||||||
|
|
||||||
. If pcre2grep had --first-line (match only in the first line) it could be
|
. If pcre2grep had --first-line (match only in the first line) it could be
|
||||||
efficiently used to find files "starting with xxx". What about --last-line?
|
efficiently used to find files "starting with xxx". What about --last-line?
|
||||||
|
|
||||||
. A user requested a means of determining whether a failed match was failed by
|
. A user requested a means of determining whether a failed match was failed by
|
||||||
the start-of-match optimizations, or by running the match engine. Easy enough
|
the start-of-match optimizations, or by running the match engine. Easy enough
|
||||||
to define a bit in the match data, but all three matchers would need work.
|
to define a bit in the match data, but all three matchers would need work.
|
||||||
|
|
||||||
. Would inlining "simple" recursions provide a useful performance boost for the
|
. Would inlining "simple" recursions provide a useful performance boost for the
|
||||||
interpreters? JIT already does some of this.
|
interpreters? JIT already does some of this, but it may not be worth it for
|
||||||
|
the interpreters.
|
||||||
. There was a request for a way of re-defining \w (and therefore \W, \b, and
|
|
||||||
\B). An in-pattern sequence such as (?w=[...]) was suggested. Easiest way
|
. There was a request for a way of re-defining \w (and therefore \W, \b, and
|
||||||
would be simply to inline the class, with lookarounds for \b and \B. Ideally
|
\B). An in-pattern sequence such as (?w=[...]) was suggested. Easiest way
|
||||||
the setting should last till the end of the group, which means remembering
|
would be simply to inline the class, with lookarounds for \b and \B. Ideally
|
||||||
all previous settings; maybe a fixed amount of stack would do - how deep
|
the setting should last till the end of the group, which means remembering
|
||||||
|
all previous settings; maybe a fixed amount of stack would do - how deep
|
||||||
would anyone want to nest these things? Bugzilla #2301.
|
would anyone want to nest these things? Bugzilla #2301.
|
||||||
|
|
||||||
|
. Recognize the short script names. They are already listed in maint/
|
||||||
|
Multistage2.py because they are needed for scanning the script extensions
|
||||||
|
file.
|
||||||
|
|
||||||
|
. Use script extensions for \p?
|
||||||
|
|
||||||
|
. A user suggested something like --with-build-info to set a build information
|
||||||
|
string that could be retrieved by pcre2_config(). However, there's no
|
||||||
|
facility for a length limit in pcre2_config(), and what would be the
|
||||||
|
encoding?
|
||||||
|
|
||||||
Philip Hazel
|
Philip Hazel
|
||||||
Email local part: ph10
|
Email local part: ph10
|
||||||
Email domain: cam.ac.uk
|
Email domain: cam.ac.uk
|
||||||
Last updated: 07 October 2018
|
Last updated: 03 June 2019
|
||||||
|
|
Loading…
Reference in New Issue