Documentation update.
This commit is contained in:
parent
5d12e53399
commit
6c631997d0
85
maint/README
85
maint/README
|
@ -323,7 +323,7 @@ very sensible; some are rather wacky. Some have been on this list for years.
|
||||||
example "a" followed by an accent would, together, match "a".
|
example "a" followed by an accent would, together, match "a".
|
||||||
|
|
||||||
. Perl supports [\N{x}-\N{y}] as a Unicode range, even in EBCDIC. PCRE2
|
. Perl supports [\N{x}-\N{y}] as a Unicode range, even in EBCDIC. PCRE2
|
||||||
supports \N{U+dd..} everywhere but, not in EBCDIC.
|
supports \N{U+dd..} everywhere, but not in EBCDIC.
|
||||||
|
|
||||||
. Unicode stuff from Perl:
|
. Unicode stuff from Perl:
|
||||||
|
|
||||||
|
@ -331,7 +331,7 @@ very sensible; some are rather wacky. Some have been on this list for years.
|
||||||
\b{sb} sentence boundary
|
\b{sb} sentence boundary
|
||||||
\b{wb} word boundary
|
\b{wb} word boundary
|
||||||
|
|
||||||
See Unicode TR 29.
|
See Unicode TR 29. The last two are very much aimed at natural language.
|
||||||
|
|
||||||
. (?[...]) extended classes: big project.
|
. (?[...]) extended classes: big project.
|
||||||
|
|
||||||
|
@ -359,7 +359,86 @@ very sensible; some are rather wacky. Some have been on this list for years.
|
||||||
|
|
||||||
. Some #defines could be replaced with enums to improve robustness.
|
. Some #defines could be replaced with enums to improve robustness.
|
||||||
|
|
||||||
|
. There was a request for and option for pcre2_match() to return the longest
|
||||||
|
match. This would mean searching for all possible matches, of course.
|
||||||
|
|
||||||
|
. Perl's /a modifier sets Unicode, but restricts \d etc to ASCII characters,
|
||||||
|
which is the PCRE2 default for PCRE2_UTF (use PCRE2_UCP to change). However,
|
||||||
|
Perl also has /aa, which in addition, disables ASCII/non-ASCII caseless
|
||||||
|
matching. Perhaps we need a new option PCRE2_CASELESS_RESTRICT_ASCII. In
|
||||||
|
practice, this just means not using the ucd_caseless_sets[] table.
|
||||||
|
|
||||||
|
. There is more that could be done to the oss-fuzz setup (needs some research).
|
||||||
|
A seed corpus could be built. I noted something about $LIB_FUZZING_ENGINE.
|
||||||
|
The test function could make use of get_substrings() to cover more code.
|
||||||
|
|
||||||
|
. A neater way of handling recursion file names in pcre2grep, e.g. a single
|
||||||
|
buffer that can grow.
|
||||||
|
|
||||||
|
. A user suggested that before/after parameters in pcre2grep could have
|
||||||
|
negative values, to list lines near to the matched line, but not necessarily
|
||||||
|
the line itself. For example, --before-context=-1 would list the line *after*
|
||||||
|
each matched line, without showing the matched line. The problem here is what
|
||||||
|
to do with matches that are close together. Maybe a simpler way would be a
|
||||||
|
flag to disable showing matched lines, only valid with either -A or -B?
|
||||||
|
|
||||||
|
. There was a suggestiong for a pcre2grep colour default, or possibly a more
|
||||||
|
general PCRE2GREP_OPT, but only for some options - not file names or patterns.
|
||||||
|
|
||||||
|
. Breaking loops that match an empty string: perhaps find a way of continuing
|
||||||
|
if *something* has changed, but this might mean remembering additional data.
|
||||||
|
"Something" could be a capture value, but then a list of previous values
|
||||||
|
would be needed to avoid a cycle of changes. Bugzilla #2182.
|
||||||
|
|
||||||
|
. The use of \K in assertions is problematic. There was some talk of Perl
|
||||||
|
banning this, but it hasn't happened. Some problems could be avoided by
|
||||||
|
not allowing it to set a value before the match start; others by not allowing
|
||||||
|
it to set a value after the match end. This could be controlled by an option
|
||||||
|
such as PCRE2_SANE_BACKSLASH_K, for compatibility (or possibly make the sane
|
||||||
|
behaviour the default and implement PCRE2_INSANE_BACKSLASH_K).
|
||||||
|
|
||||||
|
. If a function could be written to find 3-character (or other length) fixed
|
||||||
|
strings, at least one of which must be present for a match, efficient
|
||||||
|
pre-searching of large datasets could be implemented.
|
||||||
|
|
||||||
|
. There's a Perl proposal for some new (* things, including alpha synonyms for
|
||||||
|
the lookaround assertions:
|
||||||
|
|
||||||
|
(*pla: …)
|
||||||
|
(*plb: …)
|
||||||
|
(*nla: …)
|
||||||
|
(*nlb: …)
|
||||||
|
(*atomic: …)
|
||||||
|
(*positive_look_ahead:...)
|
||||||
|
(*negative_look_ahead:...)
|
||||||
|
(*positive_look_behind:...)
|
||||||
|
(*negative_look_behind:...)
|
||||||
|
|
||||||
|
Also a new one (with synonyms):
|
||||||
|
|
||||||
|
(*script_run: ...) Ensure all captured chars are in the same script
|
||||||
|
(*sr: …)
|
||||||
|
(*atomic_script_run: …) A combination of script_run and atomic
|
||||||
|
(*asr:...)
|
||||||
|
|
||||||
|
. If pcre2grep had --first-line (match only in the first line) it could be
|
||||||
|
efficiently used to find files "starting with xxx". What about --last-line?
|
||||||
|
|
||||||
|
. A user requested a means of determining whether a failed match was failed by
|
||||||
|
the start-of-match optimizations, or by running the match engine. Easy enough
|
||||||
|
to define a bit in the match data, but all three matchers would need work.
|
||||||
|
|
||||||
|
. Would inlining "simple" recursions provide a useful performance boost for the
|
||||||
|
interpreters? JIT already does some of this.
|
||||||
|
|
||||||
|
. There was a request for a way of re-defining \w (and therefore \W, \b, and
|
||||||
|
\B). An in-pattern sequence such as (?w=[...]) was suggested. Easiest way
|
||||||
|
would be simply to inline the class, with lookarounds for \b and \B. Ideally
|
||||||
|
the setting should last till the end of the group, which means remembering
|
||||||
|
all previous settings; maybe a fixed amount of stack would do - how deep
|
||||||
|
would anyone want to nest these things? Bugzilla #2301.
|
||||||
|
|
||||||
Philip Hazel
|
Philip Hazel
|
||||||
Email local part: ph10
|
Email local part: ph10
|
||||||
Email domain: cam.ac.uk
|
Email domain: cam.ac.uk
|
||||||
Last updated: 13 August 2018
|
Last updated: 21 August 2018
|
||||||
|
|
Loading…
Reference in New Issue