Documentation update.

2018-08-23 16:53:45 +00:00 · 2018-08-23 16:53:45 +00:00 · 6c631997d0
parent 5d12e53399
commit 6c631997d0
1 changed files with 82 additions and 3 deletions
--- a/maint/README
+++ b/maint/README
@ -323,7 +323,7 @@ very sensible; some are rather wacky. Some have been on this list for years.
  example "a" followed by an accent would, together, match "a".

 . Perl supports [\N{x}-\N{y}] as a Unicode range, even in EBCDIC. PCRE2 
-  supports \N{U+dd..} everywhere but, not in EBCDIC.
+  supports \N{U+dd..} everywhere, but not in EBCDIC.

 . Unicode stuff from Perl:

@ -331,7 +331,7 @@ very sensible; some are rather wacky. Some have been on this list for years.
    \b{sb}              sentence boundary
    \b{wb}              word boundary

-  See Unicode TR 29.
+  See Unicode TR 29. The last two are very much aimed at natural language.

 . (?[...]) extended classes: big project.

@ -359,7 +359,86 @@ very sensible; some are rather wacky. Some have been on this list for years.

 . Some #defines could be replaced with enums to improve robustness.

+. There was a request for and option for pcre2_match() to return the longest 
+  match. This would mean searching for all possible matches, of course.
+  
+. Perl's /a modifier sets Unicode, but restricts \d etc to ASCII characters, 
+  which is the PCRE2 default for PCRE2_UTF (use PCRE2_UCP to change). However,
+  Perl also has /aa, which in addition, disables ASCII/non-ASCII caseless
+  matching. Perhaps we need a new option PCRE2_CASELESS_RESTRICT_ASCII. In 
+  practice, this just means not using the ucd_caseless_sets[] table.
+  
+. There is more that could be done to the oss-fuzz setup (needs some research). 
+  A seed corpus could be built. I noted something about $LIB_FUZZING_ENGINE. 
+  The test function could make use of get_substrings() to cover more code.
+  
+. A neater way of handling recursion file names in pcre2grep, e.g. a single 
+  buffer that can grow.  
+  
+. A user suggested that before/after parameters in pcre2grep could have 
+  negative values, to list lines near to the matched line, but not necessarily 
+  the line itself. For example, --before-context=-1 would list the line *after* 
+  each matched line, without showing the matched line. The problem here is what
+  to do with matches that are close together. Maybe a simpler way would be a 
+  flag to disable showing matched lines, only valid with either -A or -B?
+  
+. There was a suggestiong for a pcre2grep colour default, or possibly a more
+  general PCRE2GREP_OPT, but only for some options - not file names or patterns. 
+
+. Breaking loops that match an empty string: perhaps find a way of continuing 
+  if *something* has changed, but this might mean remembering additional data.
+  "Something" could be a capture value, but then a list of previous values 
+  would be needed to avoid a cycle of changes. Bugzilla #2182.
+  
+. The use of \K in assertions is problematic. There was some talk of Perl 
+  banning this, but it hasn't happened. Some problems could be avoided by 
+  not allowing it to set a value before the match start; others by not allowing 
+  it to set a value after the match end. This could be controlled by an option 
+  such as PCRE2_SANE_BACKSLASH_K, for compatibility (or possibly make the sane 
+  behaviour the default and implement PCRE2_INSANE_BACKSLASH_K).
+  
+. If a function could be written to find 3-character (or other length) fixed 
+  strings, at least one of which must be present for a match, efficient
+  pre-searching of large datasets could be implemented.
+  
+. There's a Perl proposal for some new (* things, including alpha synonyms for 
+  the lookaround assertions:
+
+  (*pla: …)
+  (*plb: …)
+  (*nla: …)
+  (*nlb: …)
+  (*atomic: …)
+  (*positive_look_ahead:...)
+  (*negative_look_ahead:...)
+  (*positive_look_behind:...)
+  (*negative_look_behind:...)
+
+  Also a new one (with synonyms):
+
+  (*script_run: ...)        Ensure all captured chars are in the same script
+  (*sr: …)
+  (*atomic_script_run: …)   A combination of script_run and atomic
+  (*asr:...)
+
+. If pcre2grep had --first-line (match only in the first line) it could be 
+  efficiently used to find files "starting with xxx". What about --last-line?
+  
+. A user requested a means of determining whether a failed match was failed by
+  the start-of-match optimizations, or by running the match engine. Easy enough 
+  to define a bit in the match data, but all three matchers would need work.
+  
+. Would inlining "simple" recursions provide a useful performance boost for the 
+  interpreters? JIT already does some of this.
+  
+. There was a request for a way of re-defining \w (and therefore \W, \b, and 
+  \B). An in-pattern sequence such as (?w=[...]) was suggested. Easiest way 
+  would be simply to inline the class, with lookarounds for \b and \B. Ideally 
+  the setting should last till the end of the group, which means remembering 
+  all previous settings; maybe a fixed amount of stack would do - how deep 
+  would anyone want to nest these things? Bugzilla #2301.
+
 Philip Hazel
 Email local part: ph10
 Email domain: cam.ac.uk
-Last updated: 13 August 2018
+Last updated: 21 August 2018