Documentation update.

2018-08-23 16:53:45 +00:00 · 2018-08-23 16:53:45 +00:00 · 6c631997d0
parent 5d12e53399
commit 6c631997d0
1 changed files with 82 additions and 3 deletions
--- a/maint/README
+++ b/maint/README
@ -323,7 +323,7 @@ very sensible; some are rather wacky. Some have been on this list for years.
  example "a" followed by an accent would, together, match "a".
 . Perl supports [\N{x}-\N{y}] as a Unicode range, even in EBCDIC. PCRE2 
-  supports \N{U+dd..} everywhere but, not in EBCDIC.
+  supports \N{U+dd..} everywhere, but not in EBCDIC.
 . Unicode stuff from Perl:
@ -331,7 +331,7 @@ very sensible; some are rather wacky. Some have been on this list for years.
    \b{sb}              sentence boundary
    \b{wb}              word boundary
-  See Unicode TR 29.
+  See Unicode TR 29. The last two are very much aimed at natural language.
 . (?[...]) extended classes: big project.
@ -359,7 +359,86 @@ very sensible; some are rather wacky. Some have been on this list for years.
 . Some #defines could be replaced with enums to improve robustness.
 . There was a request for and option for pcre2_match() to return the longest 
  match. This would mean searching for all possible matches, of course.
 . Perl's /a modifier sets Unicode, but restricts \d etc to ASCII characters, 
  which is the PCRE2 default for PCRE2_UTF (use PCRE2_UCP to change). However,
  Perl also has /aa, which in addition, disables ASCII/non-ASCII caseless
  matching. Perhaps we need a new option PCRE2_CASELESS_RESTRICT_ASCII. In 
  practice, this just means not using the ucd_caseless_sets[] table.
 . There is more that could be done to the oss-fuzz setup (needs some research). 
  A seed corpus could be built. I noted something about $LIB_FUZZING_ENGINE. 
  The test function could make use of get_substrings() to cover more code.
 . A neater way of handling recursion file names in pcre2grep, e.g. a single 
  buffer that can grow.  
 . A user suggested that before/after parameters in pcre2grep could have 
  negative values, to list lines near to the matched line, but not necessarily 
  the line itself. For example, --before-context=-1 would list the line *after* 
  each matched line, without showing the matched line. The problem here is what
  to do with matches that are close together. Maybe a simpler way would be a 
  flag to disable showing matched lines, only valid with either -A or -B?
 . There was a suggestiong for a pcre2grep colour default, or possibly a more
  general PCRE2GREP_OPT, but only for some options - not file names or patterns. 
 . Breaking loops that match an empty string: perhaps find a way of continuing 
  if *something* has changed, but this might mean remembering additional data.
  "Something" could be a capture value, but then a list of previous values 
  would be needed to avoid a cycle of changes. Bugzilla #2182.
 . The use of \K in assertions is problematic. There was some talk of Perl 
  banning this, but it hasn't happened. Some problems could be avoided by 
  not allowing it to set a value before the match start; others by not allowing 
  it to set a value after the match end. This could be controlled by an option 
  such as PCRE2_SANE_BACKSLASH_K, for compatibility (or possibly make the sane 
  behaviour the default and implement PCRE2_INSANE_BACKSLASH_K).
 . If a function could be written to find 3-character (or other length) fixed 
  strings, at least one of which must be present for a match, efficient
  pre-searching of large datasets could be implemented.
 . There's a Perl proposal for some new (* things, including alpha synonyms for 
  the lookaround assertions:
  (*pla: …)
  (*plb: …)
  (*nla: …)
  (*nlb: …)
  (*atomic: …)
  (*positive_look_ahead:...)
  (*negative_look_ahead:...)
  (*positive_look_behind:...)
  (*negative_look_behind:...)
  Also a new one (with synonyms):
  (*script_run: ...)        Ensure all captured chars are in the same script
  (*sr: …)
  (*atomic_script_run: …)   A combination of script_run and atomic
  (*asr:...)
 . If pcre2grep had --first-line (match only in the first line) it could be 
  efficiently used to find files "starting with xxx". What about --last-line?
 . A user requested a means of determining whether a failed match was failed by
  the start-of-match optimizations, or by running the match engine. Easy enough 
  to define a bit in the match data, but all three matchers would need work.
 . Would inlining "simple" recursions provide a useful performance boost for the 
  interpreters? JIT already does some of this.
 . There was a request for a way of re-defining \w (and therefore \W, \b, and 
  \B). An in-pattern sequence such as (?w=[...]) was suggested. Easiest way 
  would be simply to inline the class, with lookarounds for \b and \B. Ideally 
  the setting should last till the end of the group, which means remembering 
  all previous settings; maybe a fixed amount of stack would do - how deep 
  would anyone want to nest these things? Bugzilla #2301.
 Philip Hazel
 Email local part: ph10
 Email domain: cam.ac.uk
-Last updated: 13 August 2018
+Last updated: 21 August 2018