Documentation update.

2019-06-03 16:39:20 +00:00 · 2019-06-03 16:39:20 +00:00 · dea540877b
parent 16d47a9cb1
commit dea540877b
1 changed files with 72 additions and 64 deletions
--- a/maint/README
+++ b/maint/README
@ -38,7 +38,7 @@ pcre2_chartables.c.non-standard

 README           This file.

-Unicode.tables   The files in this directory were downloaded from the Unicode 
+Unicode.tables   The files in this directory were downloaded from the Unicode
                 web site. They contain information about Unicode characters
                 and scripts. The ones used by the MultiStage2.py script are
                 CaseFolding.txt, DerivedGeneralCategory.txt, Scripts.txt,
@ -97,7 +97,7 @@ lists of scripts.
 The ucptest program can be compiled and used to check that the new tables in
 pcre2_ucd.c work properly, using the data files in ucptestdata to check a
 number of test characters. The source file ucptest.c should also be updated
-whenever new Unicode script names are added, and adding a few tests for new 
+whenever new Unicode script names are added, and adding a few tests for new
 scripts is a good idea.


@ -141,8 +141,9 @@ distribution for a new release.

 . Run perltest.sh on the test data for tests 1 and 4. The output should match
  the PCRE2 test output, apart from the version identification at the start of
-  each test. The other tests are not Perl-compatible (they use various
-  PCRE2-specific features or options).
+  each test. Sometimes there are other differences in test 4 if PCRE2 and Perl
+  are using different Unicode releases. The other tests are not Perl-compatible
+  (they use various PCRE2-specific features or options).

 . It is possible to test with the emulated memmove() function by undefining
  HAVE_MEMMOVE and HAVE_BCOPY in config.h, though I do not do this often.
@ -155,8 +156,9 @@ distribution for a new release.
  systems. For example, on Solaris it is helpful to test using Sun's cc
  compiler as a change from gcc. Adding -xarch=v9 to the cc options does a
  64-bit test, but it also needs -S 64 for pcre2test to increase the stack size
-  for test 2. Since I retired I can no longer do this, but instead I rely on
-  putting out release candidates for folks on the pcre-dev list to test.
+  for test 2. Since I retired I can no longer do much of this, but instead I
+  rely on putting out release candidates for folks on the pcre-dev list to
+  test.

 . The buildbots at http://buildfarm.opencsw.org/ do some automated testing
  of PCRE2 and should be checked before putting out a release.
@ -285,7 +287,7 @@ very sensible; some are rather wacky. Some have been on this list for years.
  to switch this dynamically. It would have to be specified when PCRE2 was
  compiled. PCRE2 would then call a function every time it wanted a character.

-. pcre2grep: add -rs for a sorted recurse? Having to store file names and sort
+. pcre2grep: add -rs for a sorted recurse. Having to store file names and sort
  them will of course slow it down.

 . Someone suggested --disable-callout to save code space when callouts are
@ -314,8 +316,8 @@ very sensible; some are rather wacky. Some have been on this list for years.
  but the same number (created by the use of ?|). In order to do so, a way of
  remembering *which* subpattern numbered n matched is needed. Bugzilla #760.
  (*MARK) can perhaps be used as a way round this problem. However, note that
-  Perl does not distinguish: like PCRE2, a name is just an alias for a number 
-  in Perl. 
+  Perl does not distinguish: like PCRE2, a name is just an alias for a number
+  in Perl.

 . Instead of having #ifdef HAVE_CONFIG_H in each module, put #include
  "something" and the the #ifdef appears only in one place, in "something".
@ -325,12 +327,12 @@ very sensible; some are rather wacky. Some have been on this list for years.
 . If Perl ever supports the POSIX notation [[.something.]] PCRE2 should try
  to follow.

-. Bugzilla #554 requested support for invalid UTF-8 strings.
-
 . A user wanted a way of ignoring all Unicode "mark" characters so that, for
-  example "a" followed by an accent would, together, match "a".
+  example "a" followed by an accent would, together, match "a". This can only
+  be done clumsily at present by using a lookahead such as /(?=a)\X/, which
+  works for "combining" characters.

-. Perl supports [\N{x}-\N{y}] as a Unicode range, even in EBCDIC. PCRE2 
+. Perl supports [\N{x}-\N{y}] as a Unicode range, even in EBCDIC. PCRE2
  supports \N{U+dd..} everywhere, but not in EBCDIC.

 . Unicode stuff from Perl:
@ -345,9 +347,6 @@ very sensible; some are rather wacky. Some have been on this list for years.

 . Bugzilla #1694 requests backwards searching.

-. A callout from pcre2_substitute() that happens after (before?) each
-  substitution (value = 256?).
-
 . Allow a callout to specify a number of characters to skip. This can be done
  compatibly via an extra callout field.

@ -359,74 +358,83 @@ very sensible; some are rather wacky. Some have been on this list for years.
 . A limit on substitutions: a user suggested somehow finding a way of making
  match_limit apply to the whole operation instead of each match separately.

-. There was a suggestion that Perl should lock out \K in lookarounds. If it
-  does, PCRE2 should follow.
-
 . Redesign handling of class/nclass/xclass because the compile code logic is
  currently very contorted and obscure.

 . Some #defines could be replaced with enums to improve robustness.

-. There was a request for and option for pcre2_match() to return the longest 
+. There was a request for an option for pcre2_match() to return the longest
  match. This would mean searching for all possible matches, of course.
-  
-. Perl's /a modifier sets Unicode, but restricts \d etc to ASCII characters, 
+
+. Perl's /a modifier sets Unicode, but restricts \d etc to ASCII characters,
  which is the PCRE2 default for PCRE2_UTF (use PCRE2_UCP to change). However,
  Perl also has /aa, which in addition, disables ASCII/non-ASCII caseless
-  matching. Perhaps we need a new option PCRE2_CASELESS_RESTRICT_ASCII. In 
+  matching. Perhaps we need a new option PCRE2_CASELESS_RESTRICT_ASCII. In
  practice, this just means not using the ucd_caseless_sets[] table.
-  
-. There is more that could be done to the oss-fuzz setup (needs some research). 
-  A seed corpus could be built. I noted something about $LIB_FUZZING_ENGINE. 
-  The test function could make use of get_substrings() to cover more code.
-  
-. A neater way of handling recursion file names in pcre2grep, e.g. a single 
-  buffer that can grow.  
-  
-. A user suggested that before/after parameters in pcre2grep could have 
-  negative values, to list lines near to the matched line, but not necessarily 
-  the line itself. For example, --before-context=-1 would list the line *after* 
-  each matched line, without showing the matched line. The problem here is what
-  to do with matches that are close together. Maybe a simpler way would be a 
-  flag to disable showing matched lines, only valid with either -A or -B?
-  
-. There was a suggestiong for a pcre2grep colour default, or possibly a more
-  general PCRE2GREP_OPT, but only for some options - not file names or patterns. 

-. Breaking loops that match an empty string: perhaps find a way of continuing 
+. There is more that could be done to the oss-fuzz setup (needs some research).
+  A seed corpus could be built. I noted something about $LIB_FUZZING_ENGINE.
+  The test function could make use of get_substrings() to cover more code.
+
+. A neater way of handling recursion file names in pcre2grep, e.g. a single
+  buffer that can grow.
+
+. A user suggested that before/after parameters in pcre2grep could have
+  negative values, to list lines near to the matched line, but not necessarily
+  the line itself. For example, --before-context=-1 would list the line *after*
+  each matched line, without showing the matched line. The problem here is what
+  to do with matches that are close together. Maybe a simpler way would be a
+  flag to disable showing matched lines, only valid with either -A or -B?
+
+. There was a suggestiong for a pcre2grep colour default, or possibly a more
+  general PCRE2GREP_OPT, but only for some options - not file names or patterns.
+
+. Breaking loops that match an empty string: perhaps find a way of continuing
  if *something* has changed, but this might mean remembering additional data.
-  "Something" could be a capture value, but then a list of previous values 
+  "Something" could be a capture value, but then a list of previous values
  would be needed to avoid a cycle of changes. Bugzilla #2182.
-  
-. The use of \K in assertions is problematic. There was some talk of Perl 
-  banning this, but it hasn't happened. Some problems could be avoided by 
-  not allowing it to set a value before the match start; others by not allowing 
-  it to set a value after the match end. This could be controlled by an option 
-  such as PCRE2_SANE_BACKSLASH_K, for compatibility (or possibly make the sane 
+
+. The use of \K in assertions is problematic. There was some talk of Perl
+  banning this, but it hasn't happened. Some problems could be avoided by
+  not allowing it to set a value before the match start; others by not allowing
+  it to set a value after the match end. This could be controlled by an option
+  such as PCRE2_SANE_BACKSLASH_K, for compatibility (or possibly make the sane
  behaviour the default and implement PCRE2_INSANE_BACKSLASH_K).
-  
-. If a function could be written to find 3-character (or other length) fixed 
+
+. If a function could be written to find 3-character (or other length) fixed
  strings, at least one of which must be present for a match, efficient
  pre-searching of large datasets could be implemented.
-  
-. If pcre2grep had --first-line (match only in the first line) it could be 
+
+. If pcre2grep had --first-line (match only in the first line) it could be
  efficiently used to find files "starting with xxx". What about --last-line?
-  
+
 . A user requested a means of determining whether a failed match was failed by
-  the start-of-match optimizations, or by running the match engine. Easy enough 
+  the start-of-match optimizations, or by running the match engine. Easy enough
  to define a bit in the match data, but all three matchers would need work.
-  
-. Would inlining "simple" recursions provide a useful performance boost for the 
-  interpreters? JIT already does some of this.
-  
-. There was a request for a way of re-defining \w (and therefore \W, \b, and 
-  \B). An in-pattern sequence such as (?w=[...]) was suggested. Easiest way 
-  would be simply to inline the class, with lookarounds for \b and \B. Ideally 
-  the setting should last till the end of the group, which means remembering 
-  all previous settings; maybe a fixed amount of stack would do - how deep 
+
+. Would inlining "simple" recursions provide a useful performance boost for the
+  interpreters? JIT already does some of this, but it may not be worth it for
+  the interpreters.
+
+. There was a request for a way of re-defining \w (and therefore \W, \b, and
+  \B). An in-pattern sequence such as (?w=[...]) was suggested. Easiest way
+  would be simply to inline the class, with lookarounds for \b and \B. Ideally
+  the setting should last till the end of the group, which means remembering
+  all previous settings; maybe a fixed amount of stack would do - how deep
  would anyone want to nest these things? Bugzilla #2301.

+. Recognize the short script names. They are already listed in maint/
+  Multistage2.py because they are needed for scanning the script extensions
+  file.
+
+. Use script extensions for \p?
+
+. A user suggested something like --with-build-info to set a build information
+  string that could be retrieved by pcre2_config(). However, there's no
+  facility for a length limit in pcre2_config(), and what would be the
+  encoding?
+
 Philip Hazel
 Email local part: ph10
 Email domain: cam.ac.uk
-Last updated: 07 October 2018
+Last updated: 03 June 2019