1400 lines
64 KiB
Plaintext
1400 lines
64 KiB
Plaintext
Change Log for PCRE2
|
|
--------------------
|
|
|
|
|
|
Version 10.30-DEV 09-March-2017
|
|
-------------------------------
|
|
|
|
1. The main interpreter, pcre2_match(), has been refactored into a new version
|
|
that does not use recursive function calls (and therefore the stack) for
|
|
remembering backtracking positions. This makes --disable-stack-for-recursion a
|
|
NOOP. The new implementation allows backtracking into recursive group calls in
|
|
patterns, making it more compatible with Perl, and also fixes some other
|
|
hard-to-do issues such as #1887 in Bugzilla. The code is also cleaner because
|
|
the old code had a number of fudges to try to reduce stack usage. It seems to
|
|
run no slower than the old code.
|
|
|
|
A number of bugs in the refactored code were subsequently fixed during testing
|
|
before release, but after the code was made available in the repository. These
|
|
bugs were never in fully released code, but are noted here for the record.
|
|
|
|
(a) If a pattern had fewer capturing parentheses than the ovector supplied in
|
|
the match data block, a memory error (detectable by ASAN) occurred after
|
|
a match, because the external block was being set from non-existent
|
|
internal ovector fields. Fixes oss-fuzz issue 781.
|
|
|
|
(b) A pattern with very many capturing parentheses (when the internal frame
|
|
size was greater than the initial frame vector on the stack) caused a
|
|
crash. A vector on the heap is now set up at the start of matching if the
|
|
vector on the stack is not big enough to handle at least 10 frames.
|
|
Fixes oss-fuzz issue 783.
|
|
|
|
(c) Handling of (*VERB)s in recursions was wrong in some cases.
|
|
|
|
2. Now that pcre2_match() no longer uses recursive function calls (see above),
|
|
the "match limit recursion" value seems misnamed. It still exists, and limits
|
|
the depth of tree that is searched. To avoid future confusion, it has been
|
|
renamed as "depth limit" in all relevant places (--with-depth-limit,
|
|
(*LIMIT_DEPTH), pcre2_set_depth_limit(), etc) but the old names are still
|
|
available for backwards compatibility.
|
|
|
|
3. Hardened pcre2test so as to reduce the number of bugs reported by fuzzers:
|
|
|
|
(a) Check for malloc failures when getting memory for the ovector (POSIX) or
|
|
the match data block (non-POSIX).
|
|
|
|
4. In the 32-bit library in non-UTF mode, an attempt to find a Unicode property
|
|
for a character with a code point greater than 0x10ffff (the Unicode maximum)
|
|
caused a crash.
|
|
|
|
5. If a lookbehind assertion that contained a back reference to a group
|
|
appearing later in the pattern was compiled with the PCRE2_ANCHORED option,
|
|
undefined actions (often a segmentation fault) could occur, depending on what
|
|
other options were set. An example assertion is (?<!\1(abc)) where the
|
|
reference \1 precedes the group (abc). This fixes oss-fuzz issue 865.
|
|
|
|
6. Added the PCRE2_INFO_FRAMESIZE item to pcre2_pattern_info() and arranged for
|
|
pcre2test to use it to output the frame size when the "framesize" modifier is
|
|
given.
|
|
|
|
7. Reworked the recursive pattern matching in the JIT compiler to follow the
|
|
interpreter changes.
|
|
|
|
8. When the zero_terminate modifier was specified on a pcre2test subject line
|
|
for global matching, unpredictable things could happen. For example, in UTF-8
|
|
mode, the pattern //g,zero_terminate read random memory when matched against an
|
|
empty string with zero_terminate. This was a bug in pcre2test, not the library.
|
|
|
|
9. Moved some Windows-specific code in pcre2grep (introduced in 10.23/13) out
|
|
of the section that is compiled when Unix-style directory scanning is
|
|
available, and into a new section that is always compiled for Windows.
|
|
|
|
10. In pcre2test, explicitly close the file after an error during serialization
|
|
or deserialization (the "load" or "save" commands).
|
|
|
|
11. Fix memory leak in pcre2_serialize_decode() when the input is invalid.
|
|
|
|
12. Fix potential NULL dereference in pcre2_callout_enumerate() if called with
|
|
a NULL pattern pointer when Unicode support is available.
|
|
|
|
13. When the 32-bit library was being tested by pcre2test, error messages that
|
|
were longer than 64 code units could cause a buffer overflow. This was a bug in
|
|
pcre2test.
|
|
|
|
14. The alternative matching function, pcre2_dfa_match() misbehaved if it
|
|
encountered a character class with a possessive repeat, for example [a-f]{3}+.
|
|
|
|
15. The depth (formerly recursion) limit now applies to DFA matching (as
|
|
of 10.23/36); pcre2test has been upgraded so that \=find_limits works with DFA
|
|
matching to find the minimum value for this limit.
|
|
|
|
16. Since 10.21, if pcre2_match() was called with a null context, default
|
|
memory allocation functions were used instead of whatever was used when the
|
|
pattern was compiled.
|
|
|
|
17. Changes to the pcre2test "memory" modifier on a subject line. These apply
|
|
only to pcre2_match():
|
|
|
|
(a) Warn if null_context is set on both pattern and subject, because the
|
|
memory details cannot then be shown.
|
|
|
|
(b) Remember (up to a certain number of) memory allocations and their
|
|
lengths, and list only the lengths, so as to be system-independent.
|
|
(In practice, the new interpreter never has more than 2 blocks allocated
|
|
simultaneously.)
|
|
|
|
18. Make pcre2test detect an error return from pcre2_get_error_message(), give
|
|
a message, and abandon the run (this would have detected #13 above).
|
|
|
|
|
|
Version 10.23 14-February-2017
|
|
------------------------------
|
|
|
|
1. Extended pcre2test with the utf8_input modifier so that it is able to
|
|
generate all possible 16-bit and 32-bit code unit values in non-UTF modes.
|
|
|
|
2. In any wide-character mode (8-bit UTF or any 16-bit or 32-bit mode), without
|
|
PCRE2_UCP set, a negative character type such as \D in a positive class should
|
|
cause all characters greater than 255 to match, whatever else is in the class.
|
|
There was a bug that caused this not to happen if a Unicode property item was
|
|
added to such a class, for example [\D\P{Nd}] or [\W\pL].
|
|
|
|
3. There has been a major re-factoring of the pcre2_compile.c file. Most syntax
|
|
checking is now done in the pre-pass that identifies capturing groups. This has
|
|
reduced the amount of duplication and made the code tidier. While doing this,
|
|
some minor bugs and Perl incompatibilities were fixed, including:
|
|
|
|
(a) \Q\E in the middle of a quantifier such as A+\Q\E+ is now ignored instead
|
|
of giving an invalid quantifier error.
|
|
|
|
(b) {0} can now be used after a group in a lookbehind assertion; previously
|
|
this caused an "assertion is not fixed length" error.
|
|
|
|
(c) Perl always treats (?(DEFINE) as a "define" group, even if a group with
|
|
the name "DEFINE" exists. PCRE2 now does likewise.
|
|
|
|
(d) A recursion condition test such as (?(R2)...) must now refer to an
|
|
existing subpattern.
|
|
|
|
(e) A conditional recursion test such as (?(R)...) misbehaved if there was a
|
|
group whose name began with "R".
|
|
|
|
(f) When testing zero-terminated patterns under valgrind, the terminating
|
|
zero is now marked "no access". This catches bugs that would otherwise
|
|
show up only with non-zero-terminated patterns.
|
|
|
|
(g) A hyphen appearing immediately after a POSIX character class (for example
|
|
/[[:ascii:]-z]/) now generates an error. Perl does accept this as a
|
|
literal, but gives a warning, so it seems best to fail it in PCRE.
|
|
|
|
(h) An empty \Q\E sequence may appear after a callout that precedes an
|
|
assertion condition (it is, of course, ignored).
|
|
|
|
One effect of the refactoring is that some error numbers and messages have
|
|
changed, and the pattern offset given for compiling errors is not always the
|
|
right-most character that has been read. In particular, for a variable-length
|
|
lookbehind assertion it now points to the start of the assertion. Another
|
|
change is that when a callout appears before a group, the "length of next
|
|
pattern item" that is passed now just gives the length of the opening
|
|
parenthesis item, not the length of the whole group. A length of zero is now
|
|
given only for a callout at the end of the pattern. Automatic callouts are no
|
|
longer inserted before and after explicit callouts in the pattern.
|
|
|
|
A number of bugs in the refactored code were subsequently fixed during testing
|
|
before release, but after the code was made available in the repository. Many
|
|
of the bugs were discovered by fuzzing testing. Several of them were related to
|
|
the change from assuming a zero-terminated pattern (which previously had
|
|
required non-zero terminated strings to be copied). These bugs were never in
|
|
fully released code, but are noted here for the record.
|
|
|
|
(a) An overall recursion such as (?0) inside a lookbehind assertion was not
|
|
being diagnosed as an error.
|
|
|
|
(b) In utf mode, the length of a *MARK (or other verb) name was being checked
|
|
in characters instead of code units, which could lead to bad code being
|
|
compiled, leading to unpredictable behaviour.
|
|
|
|
(c) In extended /x mode, characters whose code was greater than 255 caused
|
|
a lookup outside one of the global tables. A similar bug existed for wide
|
|
characters in *VERB names.
|
|
|
|
(d) The amount of memory needed for a compiled pattern was miscalculated if a
|
|
lookbehind contained more than one toplevel branch and the first branch
|
|
was of length zero.
|
|
|
|
(e) In UTF-8 or UTF-16 modes with PCRE2_EXTENDED (/x) set and a non-zero-
|
|
terminated pattern, if a # comment ran on to the end of the pattern, one
|
|
or more code units past the end were being read.
|
|
|
|
(f) An unterminated repeat at the end of a non-zero-terminated pattern (e.g.
|
|
"{2,2") could cause reading beyond the pattern.
|
|
|
|
(g) When reading a callout string, if the end delimiter was at the end of the
|
|
pattern one further code unit was read.
|
|
|
|
(h) An unterminated number after \g' could cause reading beyond the pattern.
|
|
|
|
(i) An insufficient memory size was being computed for compiling with
|
|
PCRE2_AUTO_CALLOUT.
|
|
|
|
(j) A conditional group with an assertion condition used more memory than was
|
|
allowed for it during parsing, so too many of them could therefore
|
|
overrun a buffer.
|
|
|
|
(k) If parsing a pattern exactly filled the buffer, the internal test for
|
|
overrun did not check when the final META_END item was added.
|
|
|
|
(l) If a lookbehind contained a subroutine call, and the called group
|
|
contained an option setting such as (?s), and the PCRE2_ANCHORED option
|
|
was set, unpredictable behaviour could occur. The underlying bug was
|
|
incorrect code and insufficient checking while searching for the end of
|
|
the called subroutine in the parsed pattern.
|
|
|
|
(m) Quantifiers following (*VERB)s were not being diagnosed as errors.
|
|
|
|
(n) The use of \Q...\E in a (*VERB) name when PCRE2_ALT_VERBNAMES and
|
|
PCRE2_AUTO_CALLOUT were both specified caused undetermined behaviour.
|
|
|
|
(o) If \Q was preceded by a quantified item, and the following \E was
|
|
followed by '?' or '+', and there was at least one literal character
|
|
between them, an internal error "unexpected repeat" occurred (example:
|
|
/.+\QX\E+/).
|
|
|
|
(p) A buffer overflow could occur while sorting the names in the group name
|
|
list (depending on the order in which the names were seen).
|
|
|
|
(q) A conditional group that started with a callout was not doing the right
|
|
check for a following assertion, leading to compiling bad code. Example:
|
|
/(?(C'XX))?!XX/
|
|
|
|
(r) If a character whose code point was greater than 0xffff appeared within
|
|
a lookbehind that was within another lookbehind, the calculation of the
|
|
lookbehind length went wrong and could provoke an internal error.
|
|
|
|
(t) The sequence \E- or \Q\E- after a POSIX class in a character class caused
|
|
an internal error. Now the hyphen is treated as a literal.
|
|
|
|
4. Back references are now permitted in lookbehind assertions when there are
|
|
no duplicated group numbers (that is, (?| has not been used), and, if the
|
|
reference is by name, there is only one group of that name. The referenced
|
|
group must, of course be of fixed length.
|
|
|
|
5. pcre2test has been upgraded so that, when run under valgrind with valgrind
|
|
support enabled, reading past the end of the pattern is detected, both when
|
|
compiling and during callout processing.
|
|
|
|
6. \g{+<number>} (e.g. \g{+2} ) is now supported. It is a "forward back
|
|
reference" and can be useful in repetitions (compare \g{-<number>} ). Perl does
|
|
not recognize this syntax.
|
|
|
|
7. Automatic callouts are no longer generated before and after callouts in the
|
|
pattern.
|
|
|
|
8. When pcre2test was outputing information from a callout, the caret indicator
|
|
for the current position in the subject line was incorrect if it was after an
|
|
escape sequence for a character whose code point was greater than \x{ff}.
|
|
|
|
9. Change 19 for 10.22 had a typo (PCRE_STATIC_RUNTIME should be
|
|
PCRE2_STATIC_RUNTIME). Fix from David Gaussmann.
|
|
|
|
10. Added --max-buffer-size to pcre2grep, to allow for automatic buffer
|
|
expansion when long lines are encountered. Original patch by Dmitry
|
|
Cherniachenko.
|
|
|
|
11. If pcre2grep was compiled with JIT support, but the library was compiled
|
|
without it (something that neither ./configure nor CMake allow, but it can be
|
|
done by editing config.h), pcre2grep was giving a JIT error. Now it detects
|
|
this situation and does not try to use JIT.
|
|
|
|
12. Added some "const" qualifiers to variables in pcre2grep.
|
|
|
|
13. Added Dmitry Cherniachenko's patch for colouring output in Windows
|
|
(untested by me). Also, look for GREP_COLOUR or GREP_COLOR if the environment
|
|
variables PCRE2GREP_COLOUR and PCRE2GREP_COLOR are not found.
|
|
|
|
14. Add the -t (grand total) option to pcre2grep.
|
|
|
|
15. A number of bugs have been mended relating to match start-up optimizations
|
|
when the first thing in a pattern is a positive lookahead. These all applied
|
|
only when PCRE2_NO_START_OPTIMIZE was *not* set:
|
|
|
|
(a) A pattern such as (?=.*X)X$ was incorrectly optimized as if it needed
|
|
both an initial 'X' and a following 'X'.
|
|
(b) Some patterns starting with an assertion that started with .* were
|
|
incorrectly optimized as having to match at the start of the subject or
|
|
after a newline. There are cases where this is not true, for example,
|
|
(?=.*[A-Z])(?=.{8,16})(?!.*[\s]) matches after the start in lines that
|
|
start with spaces. Starting .* in an assertion is no longer taken as an
|
|
indication of matching at the start (or after a newline).
|
|
|
|
16. The "offset" modifier in pcre2test was not being ignored (as documented)
|
|
when the POSIX API was in use.
|
|
|
|
17. Added --enable-fuzz-support to "configure", causing an non-installed
|
|
library containing a test function that can be called by fuzzers to be
|
|
compiled. A non-installed binary to run the test function locally, called
|
|
pcre2fuzzcheck is also compiled.
|
|
|
|
18. A pattern with PCRE2_DOTALL (/s) set but not PCRE2_NO_DOTSTAR_ANCHOR, and
|
|
which started with .* inside a positive lookahead was incorrectly being
|
|
compiled as implicitly anchored.
|
|
|
|
19. Removed all instances of "register" declarations, as they are considered
|
|
obsolete these days and in any case had become very haphazard.
|
|
|
|
20. Add strerror() to pcre2test for failed file opening.
|
|
|
|
21. Make pcre2test -C list valgrind support when it is enabled.
|
|
|
|
22. Add the use_length modifier to pcre2test.
|
|
|
|
23. Fix an off-by-one bug in pcre2test for the list of names for 'get' and
|
|
'copy' modifiers.
|
|
|
|
24. Add PCRE2_CALL_CONVENTION into the prototype declarations in pcre2.h as it
|
|
is apparently needed there as well as in the function definitions. (Why did
|
|
nobody ask for this in PCRE1?)
|
|
|
|
25. Change the _PCRE2_H and _PCRE2_UCP_H guard macros in the header files to
|
|
PCRE2_H_IDEMPOTENT_GUARD and PCRE2_UCP_H_IDEMPOTENT_GUARD to be more standard
|
|
compliant and unique.
|
|
|
|
26. pcre2-config --libs-posix was listing -lpcre2posix instead of
|
|
-lpcre2-posix. Also, the CMake build process was building the library with the
|
|
wrong name.
|
|
|
|
27. In pcre2test, give some offset information for errors in hex patterns.
|
|
This uses the C99 formatting sequence %td, except for MSVC which doesn't
|
|
support it - %lu is used instead.
|
|
|
|
28. Implemented pcre2_code_copy_with_tables(), and added pushtablescopy to
|
|
pcre2test for testing it.
|
|
|
|
29. Fix small memory leak in pcre2test.
|
|
|
|
30. Fix out-of-bounds read for partial matching of /./ against an empty string
|
|
when the newline type is CRLF.
|
|
|
|
31. Fix a bug in pcre2test that caused a crash when a locale was set either in
|
|
the current pattern or a previous one and a wide character was matched.
|
|
|
|
32. The appearance of \p, \P, or \X in a substitution string when
|
|
PCRE2_SUBSTITUTE_EXTENDED was set caused a segmentation fault (NULL
|
|
dereference).
|
|
|
|
33. If the starting offset was specified as greater than the subject length in
|
|
a call to pcre2_substitute() an out-of-bounds memory reference could occur.
|
|
|
|
34. When PCRE2 was compiled to use the heap instead of the stack for recursive
|
|
calls to match(), a repeated minimizing caseless back reference, or a
|
|
maximizing one where the two cases had different numbers of code units,
|
|
followed by a caseful back reference, could lose the caselessness of the first
|
|
repeated back reference (example: /(Z)(a)\2{1,2}?(?-i)\1X/i should match ZaAAZX
|
|
but didn't).
|
|
|
|
35. When a pattern is too complicated, PCRE2 gives up trying to find a minimum
|
|
matching length and just records zero. Typically this happens when there are
|
|
too many nested or recursive back references. If the limit was reached in
|
|
certain recursive cases it failed to be triggered and an internal error could
|
|
be the result.
|
|
|
|
36. The pcre2_dfa_match() function now takes note of the recursion limit for
|
|
the internal recursive calls that are used for lookrounds and recursions within
|
|
the pattern.
|
|
|
|
37. More refactoring has got rid of the internal could_be_empty_branch()
|
|
function (around 400 lines of code, including comments) by keeping track of
|
|
could-be-emptiness as the pattern is compiled instead of scanning compiled
|
|
groups. (This would have been much harder before the refactoring of #3 above.)
|
|
This lifts a restriction on the number of branches in a group (more than about
|
|
1100 would give "pattern is too complicated").
|
|
|
|
38. Add the "-ac" command line option to pcre2test as a synonym for "-pattern
|
|
auto_callout".
|
|
|
|
39. In a library with Unicode support, incorrect data was compiled for a
|
|
pattern with PCRE2_UCP set without PCRE2_UTF if a class required all wide
|
|
characters to match (for example, /[\s[:^ascii:]]/).
|
|
|
|
40. The callout_error modifier has been added to pcre2test to make it possible
|
|
to return PCRE2_ERROR_CALLOUT from a callout.
|
|
|
|
41. A minor change to pcre2grep: colour reset is now "<esc>[0m" instead of
|
|
"<esc>[00m".
|
|
|
|
42. The limit in the auto-possessification code that was intended to catch
|
|
overly-complicated patterns and not spend too much time auto-possessifying was
|
|
being reset too often, resulting in very long compile times for some patterns.
|
|
Now such patterns are no longer completely auto-possessified.
|
|
|
|
43. Applied Jason Hood's revised patch for RunTest.bat.
|
|
|
|
44. Added a new Windows script RunGrepTest.bat, courtesy of Jason Hood.
|
|
|
|
45. Minor cosmetic fix to pcre2test: move a variable that is not used under
|
|
Windows into the "not Windows" code.
|
|
|
|
46. Applied Jason Hood's patches to upgrade pcre2grep under Windows and tidy
|
|
some of the code:
|
|
|
|
* normalised the Windows condition by ensuring WIN32 is defined;
|
|
* enables the callout feature under Windows;
|
|
* adds globbing (Microsoft's implementation expands quoted args),
|
|
using a tweaked opendirectory;
|
|
* implements the is_*_tty functions for Windows;
|
|
* --color=always will write the ANSI sequences to file;
|
|
* add sequences 4 (underline works on Win10) and 5 (blink as bright
|
|
background, relatively standard on DOS/Win);
|
|
* remove the (char *) casts for the now-const strings;
|
|
* remove GREP_COLOUR (grep's command line allowed the 'u', but not
|
|
the environment), parsing GREP_COLORS instead;
|
|
* uses the current colour if not set, rather than black;
|
|
* add print_match for the undefined case;
|
|
* fixes a typo.
|
|
|
|
In addition, colour settings containing anything other than digits and
|
|
semicolon are ignored, and the colour controls are no longer output for empty
|
|
strings.
|
|
|
|
47. Detecting patterns that are too large inside the length-measuring loop
|
|
saves processing ridiculously long patterns to their end.
|
|
|
|
48. Ignore PCRE2_CASELESS when processing \h, \H, \v, and \V in classes as it
|
|
just wastes time. In the UTF case it can also produce redundant entries in
|
|
XCLASS lists caused by characters with multiple other cases and pairs of
|
|
characters in the same "not-x" sublists.
|
|
|
|
49. A pattern such as /(?=(a\K))/ can report the end of the match being before
|
|
its start; pcre2test was not handling this correctly when using the POSIX
|
|
interface (it was OK with the native interface).
|
|
|
|
50. In pcre2grep, ignore all JIT compile errors. This means that pcre2grep will
|
|
continue to work, falling back to interpretation if anything goes wrong with
|
|
JIT.
|
|
|
|
51. Applied patches from Christian Persch to configure.ac to make use of the
|
|
AC_USE_SYSTEM_EXTENSIONS macro and to test for functions used by the JIT
|
|
modules.
|
|
|
|
52. Minor fixes to pcre2grep from Jason Hood:
|
|
* fixed some spacing;
|
|
* Windows doesn't usually use single quotes, so I've added a define
|
|
to use appropriate quotes [in an example];
|
|
* LC_ALL was displayed as "LCC_ALL";
|
|
* numbers 11, 12 & 13 should end in "th";
|
|
* use double quotes in usage message.
|
|
|
|
53. When autopossessifying, skip empty branches without recursion, to reduce
|
|
stack usage for the benefit of clang with -fsanitize-address, which uses huge
|
|
stack frames. Example pattern: /X?(R||){3335}/. Fixes oss-fuzz issue 553.
|
|
|
|
54. A pattern with very many explicit back references to a group that is a long
|
|
way from the start of the pattern could take a long time to compile because
|
|
searching for the referenced group in order to find the minimum length was
|
|
being done repeatedly. Now up to 128 group minimum lengths are cached and the
|
|
attempt to find a minimum length is abandoned if there is a back reference to a
|
|
group whose number is greater than 128. (In that case, the pattern is so
|
|
complicated that this optimization probably isn't worth it.) This fixes
|
|
oss-fuzz issue 557.
|
|
|
|
55. Issue 32 for 10.22 below was not correctly fixed. If pcre2grep in multiline
|
|
mode with --only-matching matched several lines, it restarted scanning at the
|
|
next line instead of moving on to the end of the matched string, which can be
|
|
several lines after the start.
|
|
|
|
56. Applied Jason Hood's new patch for RunGrepTest.bat that updates it in line
|
|
with updates to the non-Windows version.
|
|
|
|
|
|
|
|
Version 10.22 29-July-2016
|
|
--------------------------
|
|
|
|
1. Applied Jason Hood's patches to RunTest.bat and testdata/wintestoutput3
|
|
to fix problems with running the tests under Windows.
|
|
|
|
2. Implemented a facility for quoting literal characters within hexadecimal
|
|
patterns in pcre2test, to make it easier to create patterns with just a few
|
|
non-printing characters.
|
|
|
|
3. Binary zeros are not supported in pcre2test input files. It now detects them
|
|
and gives an error.
|
|
|
|
4. Updated the valgrind parameters in RunTest: (a) changed smc-check=all to
|
|
smc-check=all-non-file; (b) changed obj:* in the suppression file to obj:??? so
|
|
that it matches only unknown objects.
|
|
|
|
5. Updated the maintenance script maint/ManyConfigTests to make it easier to
|
|
select individual groups of tests.
|
|
|
|
6. When the POSIX wrapper function regcomp() is called, the REG_NOSUB option
|
|
used to set PCRE2_NO_AUTO_CAPTURE when calling pcre2_compile(). However, this
|
|
disables the use of back references (and subroutine calls), which are supported
|
|
by other implementations of regcomp() with RE_NOSUB. Therefore, REG_NOSUB no
|
|
longer causes PCRE2_NO_AUTO_CAPTURE to be set, though it still ignores nmatch
|
|
and pmatch when regexec() is called.
|
|
|
|
7. Because of 6 above, pcre2test has been modified with a new modifier called
|
|
posix_nosub, to call regcomp() with REG_NOSUB. Previously the no_auto_capture
|
|
modifier had this effect. That option is now ignored when the POSIX API is in
|
|
use.
|
|
|
|
8. Minor tidies to the pcre2demo.c sample program, including more comments
|
|
about its 8-bit-ness.
|
|
|
|
9. Detect unmatched closing parentheses and give the error in the pre-scan
|
|
instead of later. Previously the pre-scan carried on and could give a
|
|
misleading incorrect error message. For example, /(?J)(?'a'))(?'a')/ gave a
|
|
message about invalid duplicate group names.
|
|
|
|
10. It has happened that pcre2test was accidentally linked with another POSIX
|
|
regex library instead of libpcre2-posix. In this situation, a call to regcomp()
|
|
(in the other library) may succeed, returning zero, but of course putting its
|
|
own data into the regex_t block. In one example the re_pcre2_code field was
|
|
left as NULL, which made pcre2test think it had not got a compiled POSIX regex,
|
|
so it treated the next line as another pattern line, resulting in a confusing
|
|
error message. A check has been added to pcre2test to see if the data returned
|
|
from a successful call of regcomp() are valid for PCRE2's regcomp(). If they
|
|
are not, an error message is output and the pcre2test run is abandoned. The
|
|
message points out the possibility of a mis-linking. Hopefully this will avoid
|
|
some head-scratching the next time this happens.
|
|
|
|
11. A pattern such as /(?<=((?C)0))/, which has a callout inside a lookbehind
|
|
assertion, caused pcre2test to output a very large number of spaces when the
|
|
callout was taken, making the program appearing to loop.
|
|
|
|
12. A pattern that included (*ACCEPT) in the middle of a sufficiently deeply
|
|
nested set of parentheses of sufficient size caused an overflow of the
|
|
compiling workspace (which was diagnosed, but of course is not desirable).
|
|
|
|
13. Detect missing closing parentheses during the pre-pass for group
|
|
identification.
|
|
|
|
14. Changed some integer variable types and put in a number of casts, following
|
|
a report of compiler warnings from Visual Studio 2013 and a few tests with
|
|
gcc's -Wconversion (which still throws up a lot).
|
|
|
|
15. Implemented pcre2_code_copy(), and added pushcopy and #popcopy to pcre2test
|
|
for testing it.
|
|
|
|
16. Change 66 for 10.21 introduced the use of snprintf() in PCRE2's version of
|
|
regerror(). When the error buffer is too small, my version of snprintf() puts a
|
|
binary zero in the final byte. Bug #1801 seems to show that other versions do
|
|
not do this, leading to bad output from pcre2test when it was checking for
|
|
buffer overflow. It no longer assumes a binary zero at the end of a too-small
|
|
regerror() buffer.
|
|
|
|
17. Fixed typo ("&&" for "&") in pcre2_study(). Fortunately, this could not
|
|
actually affect anything, by sheer luck.
|
|
|
|
18. Two minor fixes for MSVC compilation: (a) removal of apparently incorrect
|
|
"const" qualifiers in pcre2test and (b) defining snprintf as _snprintf for
|
|
older MSVC compilers. This has been done both in src/pcre2_internal.h for most
|
|
of the library, and also in src/pcre2posix.c, which no longer includes
|
|
pcre2_internal.h (see 24 below).
|
|
|
|
19. Applied Chris Wilson's patch (Bugzilla #1681) to CMakeLists.txt for MSVC
|
|
static compilation. Subsequently applied Chris Wilson's second patch, putting
|
|
the first patch under a new option instead of being unconditional when
|
|
PCRE_STATIC is set.
|
|
|
|
20. Updated pcre2grep to set stdout as binary when run under Windows, so as not
|
|
to convert \r\n at the ends of reflected lines into \r\r\n. This required
|
|
ensuring that other output that is written to stdout (e.g. file names) uses the
|
|
appropriate line terminator: \r\n for Windows, \n otherwise.
|
|
|
|
21. When a line is too long for pcre2grep's internal buffer, show the maximum
|
|
length in the error message.
|
|
|
|
22. Added support for string callouts to pcre2grep (Zoltan's patch with PH
|
|
additions).
|
|
|
|
23. RunTest.bat was missing a "set type" line for test 22.
|
|
|
|
24. The pcre2posix.c file was including pcre2_internal.h, and using some
|
|
"private" knowledge of the data structures. This is unnecessary; the code has
|
|
been re-factored and no longer includes pcre2_internal.h.
|
|
|
|
25. A racing condition is fixed in JIT reported by Mozilla.
|
|
|
|
26. Minor code refactor to avoid "array subscript is below array bounds"
|
|
compiler warning.
|
|
|
|
27. Minor code refactor to avoid "left shift of negative number" warning.
|
|
|
|
28. Add a bit more sanity checking to pcre2_serialize_decode() and document
|
|
that it expects trusted data.
|
|
|
|
29. Fix typo in pcre2_jit_test.c
|
|
|
|
30. Due to an oversight, pcre2grep was not making use of JIT when available.
|
|
This is now fixed.
|
|
|
|
31. The RunGrepTest script is updated to use the valgrind suppressions file
|
|
when testing with JIT under valgrind (compare 10.21/51 below). The suppressions
|
|
file is updated so that is now the same as for PCRE1: it suppresses the
|
|
Memcheck warnings Addr16 and Cond in unknown objects (that is, JIT-compiled
|
|
code). Also changed smc-check=all to smc-check=all-non-file as was done for
|
|
RunTest (see 4 above).
|
|
|
|
32. Implemented the PCRE2_NO_JIT option for pcre2_match().
|
|
|
|
33. Fix typo that gave a compiler error when JIT not supported.
|
|
|
|
34. Fix comment describing the returns from find_fixedlength().
|
|
|
|
35. Fix potential negative index in pcre2test.
|
|
|
|
36. Calls to pcre2_get_error_message() with error numbers that are never
|
|
returned by PCRE2 functions were returning empty strings. Now the error code
|
|
PCRE2_ERROR_BADDATA is returned. A facility has been added to pcre2test to
|
|
show the texts for given error numbers (i.e. to call pcre2_get_error_message()
|
|
and display what it returns) and a few representative error codes are now
|
|
checked in RunTest.
|
|
|
|
37. Added "&& !defined(__INTEL_COMPILER)" to the test for __GNUC__ in
|
|
pcre2_match.c, in anticipation that this is needed for the same reason it was
|
|
recently added to pcrecpp.cc in PCRE1.
|
|
|
|
38. Using -o with -M in pcre2grep could cause unnecessary repeated output when
|
|
the match extended over a line boundary, as it tried to find more matches "on
|
|
the same line" - but it was already over the end.
|
|
|
|
39. Allow \C in lookbehinds and DFA matching in UTF-32 mode (by converting it
|
|
to the same code as '.' when PCRE2_DOTALL is set).
|
|
|
|
40. Fix two clang compiler warnings in pcre2test when only one code unit width
|
|
is supported.
|
|
|
|
41. Upgrade RunTest to automatically re-run test 2 with a large (64M) stack if
|
|
it fails when running the interpreter with a 16M stack (and if changing the
|
|
stack size via pcre2test is possible). This avoids having to manually set a
|
|
large stack size when testing with clang.
|
|
|
|
42. Fix register overwite in JIT when SSE2 acceleration is enabled.
|
|
|
|
43. Detect integer overflow in pcre2test pattern and data repetition counts.
|
|
|
|
44. In pcre2test, ignore "allcaptures" after DFA matching.
|
|
|
|
45. Fix unaligned accesses on x86. Patch by Marc Mutz.
|
|
|
|
46. Fix some more clang compiler warnings.
|
|
|
|
|
|
Version 10.21 12-January-2016
|
|
-----------------------------
|
|
|
|
1. Improve matching speed of patterns starting with + or * in JIT.
|
|
|
|
2. Use memchr() to find the first character in an unanchored match in 8-bit
|
|
mode in the interpreter. This gives a significant speed improvement.
|
|
|
|
3. Removed a redundant copy of the opcode_possessify table in the
|
|
pcre2_auto_possessify.c source.
|
|
|
|
4. Fix typos in dftables.c for z/OS.
|
|
|
|
5. Change 36 for 10.20 broke the handling of [[:>:]] and [[:<:]] in that
|
|
processing them could involve a buffer overflow if the following character was
|
|
an opening parenthesis.
|
|
|
|
6. Change 36 for 10.20 also introduced a bug in processing this pattern:
|
|
/((?x)(*:0))#(?'/. Specifically: if a setting of (?x) was followed by a (*MARK)
|
|
setting (which (*:0) is), then (?x) did not get unset at the end of its group
|
|
during the scan for named groups, and hence the external # was incorrectly
|
|
treated as a comment and the invalid (?' at the end of the pattern was not
|
|
diagnosed. This caused a buffer overflow during the real compile. This bug was
|
|
discovered by Karl Skomski with the LLVM fuzzer.
|
|
|
|
7. Moved the pcre2_find_bracket() function from src/pcre2_compile.c into its
|
|
own source module to avoid a circular dependency between src/pcre2_compile.c
|
|
and src/pcre2_study.c
|
|
|
|
8. A callout with a string argument containing an opening square bracket, for
|
|
example /(?C$[$)(?<]/, was incorrectly processed and could provoke a buffer
|
|
overflow. This bug was discovered by Karl Skomski with the LLVM fuzzer.
|
|
|
|
9. The handling of callouts during the pre-pass for named group identification
|
|
has been tightened up.
|
|
|
|
10. The quantifier {1} can be ignored, whether greedy, non-greedy, or
|
|
possessive. This is a very minor optimization.
|
|
|
|
11. A possessively repeated conditional group that could match an empty string,
|
|
for example, /(?(R))*+/, was incorrectly compiled.
|
|
|
|
12. The Unicode tables have been updated to Unicode 8.0.0 (thanks to Christian
|
|
Persch).
|
|
|
|
13. An empty comment (?#) in a pattern was incorrectly processed and could
|
|
provoke a buffer overflow. This bug was discovered by Karl Skomski with the
|
|
LLVM fuzzer.
|
|
|
|
14. Fix infinite recursion in the JIT compiler when certain patterns such as
|
|
/(?:|a|){100}x/ are analysed.
|
|
|
|
15. Some patterns with character classes involving [: and \\ were incorrectly
|
|
compiled and could cause reading from uninitialized memory or an incorrect
|
|
error diagnosis. Examples are: /[[:\\](?<[::]/ and /[[:\\](?'abc')[a:]. The
|
|
first of these bugs was discovered by Karl Skomski with the LLVM fuzzer.
|
|
|
|
16. Pathological patterns containing many nested occurrences of [: caused
|
|
pcre2_compile() to run for a very long time. This bug was found by the LLVM
|
|
fuzzer.
|
|
|
|
17. A missing closing parenthesis for a callout with a string argument was not
|
|
being diagnosed, possibly leading to a buffer overflow. This bug was found by
|
|
the LLVM fuzzer.
|
|
|
|
18. A conditional group with only one branch has an implicit empty alternative
|
|
branch and must therefore be treated as potentially matching an empty string.
|
|
|
|
19. If (?R was followed by - or + incorrect behaviour happened instead of a
|
|
diagnostic. This bug was discovered by Karl Skomski with the LLVM fuzzer.
|
|
|
|
20. Another bug that was introduced by change 36 for 10.20: conditional groups
|
|
whose condition was an assertion preceded by an explicit callout with a string
|
|
argument might be incorrectly processed, especially if the string contained \Q.
|
|
This bug was discovered by Karl Skomski with the LLVM fuzzer.
|
|
|
|
21. Compiling PCRE2 with the sanitize options of clang showed up a number of
|
|
very pedantic coding infelicities and a buffer overflow while checking a UTF-8
|
|
string if the final multi-byte UTF-8 character was truncated.
|
|
|
|
22. For Perl compatibility in EBCDIC environments, ranges such as a-z in a
|
|
class, where both values are literal letters in the same case, omit the
|
|
non-letter EBCDIC code points within the range.
|
|
|
|
23. Finding the minimum matching length of complex patterns with back
|
|
references and/or recursions can take a long time. There is now a cut-off that
|
|
gives up trying to find a minimum length when things get too complex.
|
|
|
|
24. An optimization has been added that speeds up finding the minimum matching
|
|
length for patterns containing repeated capturing groups or recursions.
|
|
|
|
25. If a pattern contained a back reference to a group whose number was
|
|
duplicated as a result of appearing in a (?|...) group, the computation of the
|
|
minimum matching length gave a wrong result, which could cause incorrect "no
|
|
match" errors. For such patterns, a minimum matching length cannot at present
|
|
be computed.
|
|
|
|
26. Added a check for integer overflow in conditions (?(<digits>) and
|
|
(?(R<digits>). This omission was discovered by Karl Skomski with the LLVM
|
|
fuzzer.
|
|
|
|
27. Fixed an issue when \p{Any} inside an xclass did not read the current
|
|
character.
|
|
|
|
28. If pcre2grep was given the -q option with -c or -l, or when handling a
|
|
binary file, it incorrectly wrote output to stdout.
|
|
|
|
29. The JIT compiler did not restore the control verb head in case of *THEN
|
|
control verbs. This issue was found by Karl Skomski with a custom LLVM fuzzer.
|
|
|
|
30. The way recursive references such as (?3) are compiled has been re-written
|
|
because the old way was the cause of many issues. Now, conversion of the group
|
|
number into a pattern offset does not happen until the pattern has been
|
|
completely compiled. This does mean that detection of all infinitely looping
|
|
recursions is postponed till match time. In the past, some easy ones were
|
|
detected at compile time. This re-writing was done in response to yet another
|
|
bug found by the LLVM fuzzer.
|
|
|
|
31. A test for a back reference to a non-existent group was missing for items
|
|
such as \987. This caused incorrect code to be compiled. This issue was found
|
|
by Karl Skomski with a custom LLVM fuzzer.
|
|
|
|
32. Error messages for syntax errors following \g and \k were giving inaccurate
|
|
offsets in the pattern.
|
|
|
|
33. Improve the performance of starting single character repetitions in JIT.
|
|
|
|
34. (*LIMIT_MATCH=) now gives an error instead of setting the value to 0.
|
|
|
|
35. Error messages for syntax errors in *LIMIT_MATCH and *LIMIT_RECURSION now
|
|
give the right offset instead of zero.
|
|
|
|
36. The JIT compiler should not check repeats after a {0,1} repeat byte code.
|
|
This issue was found by Karl Skomski with a custom LLVM fuzzer.
|
|
|
|
37. The JIT compiler should restore the control chain for empty possessive
|
|
repeats. This issue was found by Karl Skomski with a custom LLVM fuzzer.
|
|
|
|
38. A bug which was introduced by the single character repetition optimization
|
|
was fixed.
|
|
|
|
39. Match limit check added to recursion. This issue was found by Karl Skomski
|
|
with a custom LLVM fuzzer.
|
|
|
|
40. Arrange for the UTF check in pcre2_match() and pcre2_dfa_match() to look
|
|
only at the part of the subject that is relevant when the starting offset is
|
|
non-zero.
|
|
|
|
41. Improve first character match in JIT with SSE2 on x86.
|
|
|
|
42. Fix two assertion fails in JIT. These issues were found by Karl Skomski
|
|
with a custom LLVM fuzzer.
|
|
|
|
43. Correct the setting of CMAKE_C_FLAGS in CMakeLists.txt (patch from Roy Ivy
|
|
III).
|
|
|
|
44. Fix bug in RunTest.bat for new test 14, and adjust the script for the added
|
|
test (there are now 20 in total).
|
|
|
|
45. Fixed a corner case of range optimization in JIT.
|
|
|
|
46. Add the ${*MARK} facility to pcre2_substitute().
|
|
|
|
47. Modifier lists in pcre2test were splitting at spaces without the required
|
|
commas.
|
|
|
|
48. Implemented PCRE2_ALT_VERBNAMES.
|
|
|
|
49. Fixed two issues in JIT. These were found by Karl Skomski with a custom
|
|
LLVM fuzzer.
|
|
|
|
50. The pcre2test program has been extended by adding the #newline_default
|
|
command. This has made it possible to run the standard tests when PCRE2 is
|
|
compiled with either CR or CRLF as the default newline convention. As part of
|
|
this work, the new command was added to several test files and the testing
|
|
scripts were modified. The pcre2grep tests can now also be run when there is no
|
|
LF in the default newline convention.
|
|
|
|
51. The RunTest script has been modified so that, when JIT is used and valgrind
|
|
is specified, a valgrind suppressions file is set up to ignore "Invalid read of
|
|
size 16" errors because these are false positives when the hardware supports
|
|
the SSE2 instruction set.
|
|
|
|
52. It is now possible to have comment lines amid the subject strings in
|
|
pcre2test (and perltest.sh) input.
|
|
|
|
53. Implemented PCRE2_USE_OFFSET_LIMIT and pcre2_set_offset_limit().
|
|
|
|
54. Add the null_context modifier to pcre2test so that calling pcre2_compile()
|
|
and the matching functions with NULL contexts can be tested.
|
|
|
|
55. Implemented PCRE2_SUBSTITUTE_EXTENDED.
|
|
|
|
56. In a character class such as [\W\p{Any}] where both a negative-type escape
|
|
("not a word character") and a property escape were present, the property
|
|
escape was being ignored.
|
|
|
|
57. Fixed integer overflow for patterns whose minimum matching length is very,
|
|
very large.
|
|
|
|
58. Implemented --never-backslash-C.
|
|
|
|
59. Change 55 above introduced a bug by which certain patterns provoked the
|
|
erroneous error "\ at end of pattern".
|
|
|
|
60. The special sequences [[:<:]] and [[:>:]] gave rise to incorrect compiling
|
|
errors or other strange effects if compiled in UCP mode. Found with libFuzzer
|
|
and AddressSanitizer.
|
|
|
|
61. Whitespace at the end of a pcre2test pattern line caused a spurious error
|
|
message if there were only single-character modifiers. It should be ignored.
|
|
|
|
62. The use of PCRE2_NO_AUTO_CAPTURE could cause incorrect compilation results
|
|
or segmentation errors for some patterns. Found with libFuzzer and
|
|
AddressSanitizer.
|
|
|
|
63. Very long names in (*MARK) or (*THEN) etc. items could provoke a buffer
|
|
overflow.
|
|
|
|
64. Improve error message for overly-complicated patterns.
|
|
|
|
65. Implemented an optional replication feature for patterns in pcre2test, to
|
|
make it easier to test long repetitive patterns. The tests for 63 above are
|
|
converted to use the new feature.
|
|
|
|
66. In the POSIX wrapper, if regerror() was given too small a buffer, it could
|
|
misbehave.
|
|
|
|
67. In pcre2_substitute() in UTF mode, the UTF validity check on the
|
|
replacement string was happening before the length setting when the replacement
|
|
string was zero-terminated.
|
|
|
|
68. In pcre2_substitute() in UTF mode, PCRE2_NO_UTF_CHECK can be set for the
|
|
second and subsequent calls to pcre2_match().
|
|
|
|
69. There was no check for integer overflow for a replacement group number in
|
|
pcre2_substitute(). An added check for a number greater than the largest group
|
|
number in the pattern means this is not now needed.
|
|
|
|
70. The PCRE2-specific VERSION condition didn't work correctly if only one
|
|
digit was given after the decimal point, or if more than two digits were given.
|
|
It now works with one or two digits, and gives a compile time error if more are
|
|
given.
|
|
|
|
71. In pcre2_substitute() there was the possibility of reading one code unit
|
|
beyond the end of the replacement string.
|
|
|
|
72. The code for checking a subject's UTF-32 validity for a pattern with a
|
|
lookbehind involved an out-of-bounds pointer, which could potentially cause
|
|
trouble in some environments.
|
|
|
|
73. The maximum lookbehind length was incorrectly calculated for patterns such
|
|
as /(?<=(a)(?-1))x/ which have a recursion within a backreference.
|
|
|
|
74. Give an error if a lookbehind assertion is longer than 65535 code units.
|
|
|
|
75. Give an error in pcre2_substitute() if a match ends before it starts (as a
|
|
result of the use of \K).
|
|
|
|
76. Check the length of subpattern names and the names in (*MARK:xx) etc.
|
|
dynamically to avoid the possibility of integer overflow.
|
|
|
|
77. Implement pcre2_set_max_pattern_length() so that programs can restrict the
|
|
size of patterns that they are prepared to handle.
|
|
|
|
78. (*NO_AUTO_POSSESS) was not working.
|
|
|
|
79. Adding group information caching improves the speed of compiling when
|
|
checking whether a group has a fixed length and/or could match an empty string,
|
|
especially when recursion or subroutine calls are involved. However, this
|
|
cannot be used when (?| is present in the pattern because the same number may
|
|
be used for groups of different sizes. To catch runaway patterns in this
|
|
situation, counts have been introduced to the functions that scan for empty
|
|
branches or compute fixed lengths.
|
|
|
|
80. Allow for the possibility of the size of the nest_save structure not being
|
|
a factor of the size of the compiling workspace (it currently is).
|
|
|
|
81. Check for integer overflow in minimum length calculation and cap it at
|
|
65535.
|
|
|
|
82. Small optimizations in code for finding the minimum matching length.
|
|
|
|
83. Lock out configuring for EBCDIC with non-8-bit libraries.
|
|
|
|
84. Test for error code <= 0 in regerror().
|
|
|
|
85. Check for too many replacements (more than INT_MAX) in pcre2_substitute().
|
|
|
|
86. Avoid the possibility of computing with an out-of-bounds pointer (though
|
|
not dereferencing it) while handling lookbehind assertions.
|
|
|
|
87. Failure to get memory for the match data in regcomp() is now given as a
|
|
regcomp() error instead of waiting for regexec() to pick it up.
|
|
|
|
88. In pcre2_substitute(), ensure that CRLF is not split when it is a valid
|
|
newline sequence.
|
|
|
|
89. Paranoid check in regcomp() for bad error code from pcre2_compile().
|
|
|
|
90. Run test 8 (internal offsets and code sizes) for link sizes 3 and 4 as well
|
|
as for link size 2.
|
|
|
|
91. Document that JIT has a limit on pattern size, and give more information
|
|
about JIT compile failures in pcre2test.
|
|
|
|
92. Implement PCRE2_INFO_HASBACKSLASHC.
|
|
|
|
93. Re-arrange valgrind support code in pcre2test to avoid spurious reports
|
|
with JIT (possibly caused by SSE2?).
|
|
|
|
94. Support offset_limit in JIT.
|
|
|
|
95. A sequence such as [[:punct:]b] that is, a POSIX character class followed
|
|
by a single ASCII character in a class item, was incorrectly compiled in UCP
|
|
mode. The POSIX class got lost, but only if the single character followed it.
|
|
|
|
96. [:punct:] in UCP mode was matching some characters in the range 128-255
|
|
that should not have been matched.
|
|
|
|
97. If [:^ascii:] or [:^xdigit:] are present in a non-negated class, all
|
|
characters with code points greater than 255 are in the class. When a Unicode
|
|
property was also in the class (if PCRE2_UCP is set, escapes such as \w are
|
|
turned into Unicode properties), wide characters were not correctly handled,
|
|
and could fail to match.
|
|
|
|
98. In pcre2test, make the "startoffset" modifier a synonym of "offset",
|
|
because it sets the "startoffset" parameter for pcre2_match().
|
|
|
|
99. If PCRE2_AUTO_CALLOUT was set on a pattern that had a (?# comment between
|
|
an item and its qualifier (for example, A(?#comment)?B) pcre2_compile()
|
|
misbehaved. This bug was found by the LLVM fuzzer.
|
|
|
|
100. The error for an invalid UTF pattern string always gave the code unit
|
|
offset as zero instead of where the invalidity was found.
|
|
|
|
101. Further to 97 above, negated classes such as [^[:^ascii:]\d] were also not
|
|
working correctly in UCP mode.
|
|
|
|
102. Similar to 99 above, if an isolated \E was present between an item and its
|
|
qualifier when PCRE2_AUTO_CALLOUT was set, pcre2_compile() misbehaved. This bug
|
|
was found by the LLVM fuzzer.
|
|
|
|
103. The POSIX wrapper function regexec() crashed if the option REG_STARTEND
|
|
was set when the pmatch argument was NULL. It now returns REG_INVARG.
|
|
|
|
104. Allow for up to 32-bit numbers in the ordin() function in pcre2grep.
|
|
|
|
105. An empty \Q\E sequence between an item and its qualifier caused
|
|
pcre2_compile() to misbehave when auto callouts were enabled. This bug
|
|
was found by the LLVM fuzzer.
|
|
|
|
106. If both PCRE2_ALT_VERBNAMES and PCRE2_EXTENDED were set, and a (*MARK) or
|
|
other verb "name" ended with whitespace immediately before the closing
|
|
parenthesis, pcre2_compile() misbehaved. Example: /(*:abc )/, but only when
|
|
both those options were set.
|
|
|
|
107. In a number of places pcre2_compile() was not handling NULL characters
|
|
correctly, and pcre2test with the "bincode" modifier was not always correctly
|
|
displaying fields containing NULLS:
|
|
|
|
(a) Within /x extended #-comments
|
|
(b) Within the "name" part of (*MARK) and other *verbs
|
|
(c) Within the text argument of a callout
|
|
|
|
108. If a pattern that was compiled with PCRE2_EXTENDED started with white
|
|
space or a #-type comment that was followed by (?-x), which turns off
|
|
PCRE2_EXTENDED, and there was no subsequent (?x) to turn it on again,
|
|
pcre2_compile() assumed that (?-x) applied to the whole pattern and
|
|
consequently mis-compiled it. This bug was found by the LLVM fuzzer. The fix
|
|
for this bug means that a setting of any of the (?imsxJU) options at the start
|
|
of a pattern is no longer transferred to the options that are returned by
|
|
PCRE2_INFO_ALLOPTIONS. In fact, this was an anachronism that should have
|
|
changed when the effects of those options were all moved to compile time.
|
|
|
|
109. An escaped closing parenthesis in the "name" part of a (*verb) when
|
|
PCRE2_ALT_VERBNAMES was set caused pcre2_compile() to malfunction. This bug
|
|
was found by the LLVM fuzzer.
|
|
|
|
110. Implemented PCRE2_SUBSTITUTE_UNSET_EMPTY, and updated pcre2test to make it
|
|
possible to test it.
|
|
|
|
111. "Harden" pcre2test against ridiculously large values in modifiers and
|
|
command line arguments.
|
|
|
|
112. Implemented PCRE2_SUBSTITUTE_UNKNOWN_UNSET and PCRE2_SUBSTITUTE_OVERFLOW_
|
|
LENGTH.
|
|
|
|
113. Fix printing of *MARK names that contain binary zeroes in pcre2test.
|
|
|
|
|
|
Version 10.20 30-June-2015
|
|
--------------------------
|
|
|
|
1. Callouts with string arguments have been added.
|
|
|
|
2. Assertion code generator in JIT has been optimized.
|
|
|
|
3. The invalid pattern (?(?C) has a missing assertion condition at the end. The
|
|
pcre2_compile() function read past the end of the input before diagnosing an
|
|
error. This bug was discovered by the LLVM fuzzer.
|
|
|
|
4. Implemented pcre2_callout_enumerate().
|
|
|
|
5. Fix JIT compilation of conditional blocks whose assertion is converted to
|
|
(*FAIL). E.g: /(?(?!))/.
|
|
|
|
6. The pattern /(?(?!)^)/ caused references to random memory. This bug was
|
|
discovered by the LLVM fuzzer.
|
|
|
|
7. The assertion (?!) is optimized to (*FAIL). This was not handled correctly
|
|
when this assertion was used as a condition, for example (?(?!)a|b). In
|
|
pcre2_match() it worked by luck; in pcre2_dfa_match() it gave an incorrect
|
|
error about an unsupported item.
|
|
|
|
8. For some types of pattern, for example /Z*(|d*){216}/, the auto-
|
|
possessification code could take exponential time to complete. A recursion
|
|
depth limit of 1000 has been imposed to limit the resources used by this
|
|
optimization. This infelicity was discovered by the LLVM fuzzer.
|
|
|
|
9. A pattern such as /(*UTF)[\S\V\H]/, which contains a negated special class
|
|
such as \S in non-UCP mode, explicit wide characters (> 255) can be ignored
|
|
because \S ensures they are all in the class. The code for doing this was
|
|
interacting badly with the code for computing the amount of space needed to
|
|
compile the pattern, leading to a buffer overflow. This bug was discovered by
|
|
the LLVM fuzzer.
|
|
|
|
10. A pattern such as /((?2)+)((?1))/ which has mutual recursion nested inside
|
|
other kinds of group caused stack overflow at compile time. This bug was
|
|
discovered by the LLVM fuzzer.
|
|
|
|
11. A pattern such as /(?1)(?#?'){8}(a)/ which had a parenthesized comment
|
|
between a subroutine call and its quantifier was incorrectly compiled, leading
|
|
to buffer overflow or other errors. This bug was discovered by the LLVM fuzzer.
|
|
|
|
12. The illegal pattern /(?(?<E>.*!.*)?)/ was not being diagnosed as missing an
|
|
assertion after (?(. The code was failing to check the character after (?(?<
|
|
for the ! or = that would indicate a lookbehind assertion. This bug was
|
|
discovered by the LLVM fuzzer.
|
|
|
|
13. A pattern such as /X((?2)()*+){2}+/ which has a possessive quantifier with
|
|
a fixed maximum following a group that contains a subroutine reference was
|
|
incorrectly compiled and could trigger buffer overflow. This bug was discovered
|
|
by the LLVM fuzzer.
|
|
|
|
14. Negative relative recursive references such as (?-7) to non-existent
|
|
subpatterns were not being diagnosed and could lead to unpredictable behaviour.
|
|
This bug was discovered by the LLVM fuzzer.
|
|
|
|
15. The bug fixed in 14 was due to an integer variable that was unsigned when
|
|
it should have been signed. Some other "int" variables, having been checked,
|
|
have either been changed to uint32_t or commented as "must be signed".
|
|
|
|
16. A mutual recursion within a lookbehind assertion such as (?<=((?2))((?1)))
|
|
caused a stack overflow instead of the diagnosis of a non-fixed length
|
|
lookbehind assertion. This bug was discovered by the LLVM fuzzer.
|
|
|
|
17. The use of \K in a positive lookbehind assertion in a non-anchored pattern
|
|
(e.g. /(?<=\Ka)/) could make pcre2grep loop.
|
|
|
|
18. There was a similar problem to 17 in pcre2test for global matches, though
|
|
the code there did catch the loop.
|
|
|
|
19. If a greedy quantified \X was preceded by \C in UTF mode (e.g. \C\X*),
|
|
and a subsequent item in the pattern caused a non-match, backtracking over the
|
|
repeated \X did not stop, but carried on past the start of the subject, causing
|
|
reference to random memory and/or a segfault. There were also some other cases
|
|
where backtracking after \C could crash. This set of bugs was discovered by the
|
|
LLVM fuzzer.
|
|
|
|
20. The function for finding the minimum length of a matching string could take
|
|
a very long time if mutual recursion was present many times in a pattern, for
|
|
example, /((?2){73}(?2))((?1))/. A better mutual recursion detection method has
|
|
been implemented. This infelicity was discovered by the LLVM fuzzer.
|
|
|
|
21. Implemented PCRE2_NEVER_BACKSLASH_C.
|
|
|
|
22. The feature for string replication in pcre2test could read from freed
|
|
memory if the replication required a buffer to be extended, and it was not
|
|
working properly in 16-bit and 32-bit modes. This issue was discovered by a
|
|
fuzzer: see http://lcamtuf.coredump.cx/afl/.
|
|
|
|
23. Added the PCRE2_ALT_CIRCUMFLEX option.
|
|
|
|
24. Adjust the treatment of \8 and \9 to be the same as the current Perl
|
|
behaviour.
|
|
|
|
25. Static linking against the PCRE2 library using the pkg-config module was
|
|
failing on missing pthread symbols.
|
|
|
|
26. If a group that contained a recursive back reference also contained a
|
|
forward reference subroutine call followed by a non-forward-reference
|
|
subroutine call, for example /.((?2)(?R)\1)()/, pcre2_compile() failed to
|
|
compile correct code, leading to undefined behaviour or an internally detected
|
|
error. This bug was discovered by the LLVM fuzzer.
|
|
|
|
27. Quantification of certain items (e.g. atomic back references) could cause
|
|
incorrect code to be compiled when recursive forward references were involved.
|
|
For example, in this pattern: /(?1)()((((((\1++))\x85)+)|))/. This bug was
|
|
discovered by the LLVM fuzzer.
|
|
|
|
28. A repeated conditional group whose condition was a reference by name caused
|
|
a buffer overflow if there was more than one group with the given name. This
|
|
bug was discovered by the LLVM fuzzer.
|
|
|
|
29. A recursive back reference by name within a group that had the same name as
|
|
another group caused a buffer overflow. For example: /(?J)(?'d'(?'d'\g{d}))/.
|
|
This bug was discovered by the LLVM fuzzer.
|
|
|
|
30. A forward reference by name to a group whose number is the same as the
|
|
current group, for example in this pattern: /(?|(\k'Pm')|(?'Pm'))/, caused a
|
|
buffer overflow at compile time. This bug was discovered by the LLVM fuzzer.
|
|
|
|
31. Fix -fsanitize=undefined warnings for left shifts of 1 by 31 (it treats 1
|
|
as an int; fixed by writing it as 1u).
|
|
|
|
32. Fix pcre2grep compile when -std=c99 is used with gcc, though it still gives
|
|
a warning for "fileno" unless -std=gnu99 us used.
|
|
|
|
33. A lookbehind assertion within a set of mutually recursive subpatterns could
|
|
provoke a buffer overflow. This bug was discovered by the LLVM fuzzer.
|
|
|
|
34. Give an error for an empty subpattern name such as (?'').
|
|
|
|
35. Make pcre2test give an error if a pattern that follows #forbud_utf contains
|
|
\P, \p, or \X.
|
|
|
|
36. The way named subpatterns are handled has been refactored. There is now a
|
|
pre-pass over the regex which does nothing other than identify named
|
|
subpatterns and count the total captures. This means that information about
|
|
named patterns is known before the rest of the compile. In particular, it means
|
|
that forward references can be checked as they are encountered. Previously, the
|
|
code for handling forward references was contorted and led to several errors in
|
|
computing the memory requirements for some patterns, leading to buffer
|
|
overflows.
|
|
|
|
37. There was no check for integer overflow in subroutine calls such as (?123).
|
|
|
|
38. The table entry for \l in EBCDIC environments was incorrect, leading to its
|
|
being treated as a literal 'l' instead of causing an error.
|
|
|
|
39. If a non-capturing group containing a conditional group that could match
|
|
an empty string was repeated, it was not identified as matching an empty string
|
|
itself. For example: /^(?:(?(1)x|)+)+$()/.
|
|
|
|
40. In an EBCDIC environment, pcretest was mishandling the escape sequences
|
|
\a and \e in test subject lines.
|
|
|
|
41. In an EBCDIC environment, \a in a pattern was converted to the ASCII
|
|
instead of the EBCDIC value.
|
|
|
|
42. The handling of \c in an EBCDIC environment has been revised so that it is
|
|
now compatible with the specification in Perl's perlebcdic page.
|
|
|
|
43. Single character repetition in JIT has been improved. 20-30% speedup
|
|
was achieved on certain patterns.
|
|
|
|
44. The EBCDIC character 0x41 is a non-breaking space, equivalent to 0xa0 in
|
|
ASCII/Unicode. This has now been added to the list of characters that are
|
|
recognized as white space in EBCDIC.
|
|
|
|
45. When PCRE2 was compiled without Unicode support, the use of \p and \P gave
|
|
an error (correctly) when used outside a class, but did not give an error
|
|
within a class.
|
|
|
|
46. \h within a class was incorrectly compiled in EBCDIC environments.
|
|
|
|
47. JIT should return with error when the compiled pattern requires
|
|
more stack space than the maximum.
|
|
|
|
48. Fixed a memory leak in pcre2grep when a locale is set.
|
|
|
|
|
|
Version 10.10 06-March-2015
|
|
---------------------------
|
|
|
|
1. When a pattern is compiled, it remembers the highest back reference so that
|
|
when matching, if the ovector is too small, extra memory can be obtained to
|
|
use instead. A conditional subpattern whose condition is a check on a capture
|
|
having happened, such as, for example in the pattern /^(?:(a)|b)(?(1)A|B)/, is
|
|
another kind of back reference, but it was not setting the highest
|
|
backreference number. This mattered only if pcre2_match() was called with an
|
|
ovector that was too small to hold the capture, and there was no other kind of
|
|
back reference (a situation which is probably quite rare). The effect of the
|
|
bug was that the condition was always treated as FALSE when the capture could
|
|
not be consulted, leading to a incorrect behaviour by pcre2_match(). This bug
|
|
has been fixed.
|
|
|
|
2. Functions for serialization and deserialization of sets of compiled patterns
|
|
have been added.
|
|
|
|
3. The value that is returned by PCRE2_INFO_SIZE has been corrected to remove
|
|
excess code units at the end of the data block that may occasionally occur if
|
|
the code for calculating the size over-estimates. This change stops the
|
|
serialization code copying uninitialized data, to which valgrind objects. The
|
|
documentation of PCRE2_INFO_SIZE was incorrect in stating that the size did not
|
|
include the general overhead. This has been corrected.
|
|
|
|
4. All code units in every slot in the table of group names are now set, again
|
|
in order to avoid accessing uninitialized data when serializing.
|
|
|
|
5. The (*NO_JIT) feature is implemented.
|
|
|
|
6. If a bug that caused pcre2_compile() to use more memory than allocated was
|
|
triggered when using valgrind, the code in (3) above passed a stupidly large
|
|
value to valgrind. This caused a crash instead of an "internal error" return.
|
|
|
|
7. A reference to a duplicated named group (either a back reference or a test
|
|
for being set in a conditional) that occurred in a part of the pattern where
|
|
PCRE2_DUPNAMES was not set caused the amount of memory needed for the pattern
|
|
to be incorrectly calculated, leading to overwriting.
|
|
|
|
8. A mutually recursive set of back references such as (\2)(\1) caused a
|
|
segfault at compile time (while trying to find the minimum matching length).
|
|
The infinite loop is now broken (with the minimum length unset, that is, zero).
|
|
|
|
9. If an assertion that was used as a condition was quantified with a minimum
|
|
of zero, matching went wrong. In particular, if the whole group had unlimited
|
|
repetition and could match an empty string, a segfault was likely. The pattern
|
|
(?(?=0)?)+ is an example that caused this. Perl allows assertions to be
|
|
quantified, but not if they are being used as conditions, so the above pattern
|
|
is faulted by Perl. PCRE2 has now been changed so that it also rejects such
|
|
patterns.
|
|
|
|
10. The error message for an invalid quantifier has been changed from "nothing
|
|
to repeat" to "quantifier does not follow a repeatable item".
|
|
|
|
11. If a bad UTF string is compiled with NO_UTF_CHECK, it may succeed, but
|
|
scanning the compiled pattern in subsequent auto-possessification can get out
|
|
of step and lead to an unknown opcode. Previously this could have caused an
|
|
infinite loop. Now it generates an "internal error" error. This is a tidyup,
|
|
not a bug fix; passing bad UTF with NO_UTF_CHECK is documented as having an
|
|
undefined outcome.
|
|
|
|
12. A UTF pattern containing a "not" match of a non-ASCII character and a
|
|
subroutine reference could loop at compile time. Example: /[^\xff]((?1))/.
|
|
|
|
13. The locale test (RunTest 3) has been upgraded. It now checks that a locale
|
|
that is found in the output of "locale -a" can actually be set by pcre2test
|
|
before it is accepted. Previously, in an environment where a locale was listed
|
|
but would not set (an example does exist), the test would "pass" without
|
|
actually doing anything. Also the fr_CA locale has been added to the list of
|
|
locales that can be used.
|
|
|
|
14. Fixed a bug in pcre2_substitute(). If a replacement string ended in a
|
|
capturing group number without parentheses, the last character was incorrectly
|
|
literally included at the end of the replacement string.
|
|
|
|
15. A possessive capturing group such as (a)*+ with a minimum repeat of zero
|
|
failed to allow the zero-repeat case if pcre2_match() was called with an
|
|
ovector too small to capture the group.
|
|
|
|
16. Improved error message in pcre2test when setting the stack size (-S) fails.
|
|
|
|
17. Fixed two bugs in CMakeLists.txt: (1) Some lines had got lost in the
|
|
transfer from PCRE1, meaning that CMake configuration failed if "build tests"
|
|
was selected. (2) The file src/pcre2_serialize.c had not been added to the list
|
|
of PCRE2 sources, which caused a failure to build pcre2test.
|
|
|
|
18. Fixed typo in pcre2_serialize.c (DECL instead of DEFN) that causes problems
|
|
only on Windows.
|
|
|
|
19. Use binary input when reading back saved serialized patterns in pcre2test.
|
|
|
|
20. Added RunTest.bat for running the tests under Windows.
|
|
|
|
21. "make distclean" was not removing config.h, a file that may be created for
|
|
use with CMake.
|
|
|
|
22. A pattern such as "((?2){0,1999}())?", which has a group containing a
|
|
forward reference repeated a large (but limited) number of times within a
|
|
repeated outer group that has a zero minimum quantifier, caused incorrect code
|
|
to be compiled, leading to the error "internal error: previously-checked
|
|
referenced subpattern not found" when an incorrect memory address was read.
|
|
This bug was reported as "heap overflow", discovered by Kai Lu of Fortinet's
|
|
FortiGuard Labs. (Added 24-March-2015: CVE-2015-2325 was given to this.)
|
|
|
|
23. A pattern such as "((?+1)(\1))/" containing a forward reference subroutine
|
|
call within a group that also contained a recursive back reference caused
|
|
incorrect code to be compiled. This bug was reported as "heap overflow",
|
|
discovered by Kai Lu of Fortinet's FortiGuard Labs. (Added 24-March-2015:
|
|
CVE-2015-2326 was given to this.)
|
|
|
|
24. Computing the size of the JIT read-only data in advance has been a source
|
|
of various issues, and new ones are still appear unfortunately. To fix
|
|
existing and future issues, size computation is eliminated from the code,
|
|
and replaced by on-demand memory allocation.
|
|
|
|
25. A pattern such as /(?i)[A-`]/, where characters in the other case are
|
|
adjacent to the end of the range, and the range contained characters with more
|
|
than one other case, caused incorrect behaviour when compiled in UTF mode. In
|
|
that example, the range a-j was left out of the class.
|
|
|
|
|
|
Version 10.00 05-January-2015
|
|
-----------------------------
|
|
|
|
Version 10.00 is the first release of PCRE2, a revised API for the PCRE
|
|
library. Changes prior to 10.00 are logged in the ChangeLog file for the old
|
|
API, up to item 20 for release 8.36.
|
|
|
|
The code of the library was heavily revised as part of the new API
|
|
implementation. Details of each and every modification were not individually
|
|
logged. In addition to the API changes, the following changes were made. They
|
|
are either new functionality, or bug fixes and other noticeable changes of
|
|
behaviour that were implemented after the code had been forked.
|
|
|
|
1. Including Unicode support at build time is now enabled by default, but it
|
|
can optionally be disabled. It is not enabled by default at run time (no
|
|
change).
|
|
|
|
2. The test program, now called pcre2test, was re-specified and almost
|
|
completely re-written. Its input is not compatible with input for pcretest.
|
|
|
|
3. Patterns may start with (*NOTEMPTY) or (*NOTEMPTY_ATSTART) to set the
|
|
PCRE2_NOTEMPTY or PCRE2_NOTEMPTY_ATSTART options for every subject line that is
|
|
matched by that pattern.
|
|
|
|
4. For the benefit of those who use PCRE2 via some other application, that is,
|
|
not writing the function calls themselves, it is possible to check the PCRE2
|
|
version by matching a pattern such as /(?(VERSION>=10)yes|no)/ against a
|
|
string such as "yesno".
|
|
|
|
5. There are case-equivalent Unicode characters whose encodings use different
|
|
numbers of code units in UTF-8. U+023A and U+2C65 are one example. (It is
|
|
theoretically possible for this to happen in UTF-16 too.) If a backreference to
|
|
a group containing one of these characters was greedily repeated, and during
|
|
the match a backtrack occurred, the subject might be backtracked by the wrong
|
|
number of code units. For example, if /^(\x{23a})\1*(.)/ is matched caselessly
|
|
(and in UTF-8 mode) against "\x{23a}\x{2c65}\x{2c65}\x{2c65}", group 2 should
|
|
capture the final character, which is the three bytes E2, B1, and A5 in UTF-8.
|
|
Incorrect backtracking meant that group 2 captured only the last two bytes.
|
|
This bug has been fixed; the new code is slower, but it is used only when the
|
|
strings matched by the repetition are not all the same length.
|
|
|
|
6. A pattern such as /()a/ was not setting the "first character must be 'a'"
|
|
information. This applied to any pattern with a group that matched no
|
|
characters, for example: /(?:(?=.)|(?<!x))a/.
|
|
|
|
7. When an (*ACCEPT) is triggered inside capturing parentheses, it arranges for
|
|
those parentheses to be closed with whatever has been captured so far. However,
|
|
it was failing to mark any other groups between the highest capture so far and
|
|
the currrent group as "unset". Thus, the ovector for those groups contained
|
|
whatever was previously there. An example is the pattern /(x)|((*ACCEPT))/ when
|
|
matched against "abcd".
|
|
|
|
8. The pcre2_substitute() function has been implemented.
|
|
|
|
9. If an assertion used as a condition was quantified with a minimum of zero
|
|
(an odd thing to do, but it happened), SIGSEGV or other misbehaviour could
|
|
occur.
|
|
|
|
10. The PCRE2_NO_DOTSTAR_ANCHOR option has been implemented.
|
|
|
|
****
|