57 lines
2.7 KiB
Plaintext
57 lines
2.7 KiB
Plaintext
Change Log for PCRE2
|
|
--------------------
|
|
|
|
Version 10.0 xx-xxxx-2014
|
|
-------------------------
|
|
|
|
Version 10.0 is the first release of PCRE2, a revised API for the PCRE library.
|
|
Changes prior to 10.0 are logged in the ChangeLog file for the old API, up to
|
|
item 20 for release 8.36.
|
|
|
|
The code of the library was heavily revised as part of the new API
|
|
implementation. Details of each and every modification were not individually
|
|
logged. In addition to the API changes, the following changes were made. They
|
|
are either new functionality, or bug fixes and other noticeable changes of
|
|
behaviour that were implemented after the code had been forked.
|
|
|
|
1. Unicode support is now enabled by default.
|
|
|
|
2. The test program, now called pcre2test, was re-specified and almost
|
|
completely re-written. Its input is not compatible with input for pcretest.
|
|
|
|
3. Patterns may start with (*NOTEMPTY) or (*NOTEMPTY_ATSTART) to set the
|
|
PCRE2_NOTEMPTY or PCRE2_NOTEMPTY_ATSTART options for every subject line that is
|
|
matched by that pattern.
|
|
|
|
4. For the benefit of those who use PCRE2 via some other application, that is,
|
|
not writing the function calls themselves, it is possible to check the PCRE2
|
|
version by matching a pattern such as /(?(VERSION>=10.0)yes|no)/ against a
|
|
string such as "yesno".
|
|
|
|
5. There are case-equivalent Unicode characters whose encodings use different
|
|
numbers of code units in UTF-8. U+023A and U+2C65 are one example. (It is
|
|
theoretically possible for this to happen in UTF-16 too.) If a backreference to
|
|
a group containing one of these characters was greedily repeated, and during
|
|
the match a backtrack occurred, the subject might be backtracked by the wrong
|
|
number of code units. For example, if /^(\x{23a})\1*(.)/ is matched caselessly
|
|
(and in UTF-8 mode) against "\x{23a}\x{2c65}\x{2c65}\x{2c65}", group 2 should
|
|
capture the final character, which is the three bytes E2, B1, and A5 in UTF-8.
|
|
Incorrect backtracking meant that group 2 captured only the last two bytes.
|
|
This bug has been fixed; the new code is slower, but it is used only when the
|
|
strings matched by the repetition are not all the same length.
|
|
|
|
6. A pattern such as /()a/ was not setting the "first character must be 'a'"
|
|
information. This applied to any pattern with a group that matched no
|
|
characters, for example: /(?:(?=.)|(?<!x))a/.
|
|
|
|
7. When an (*ACCEPT) is triggered inside capturing parentheses, it arranges for
|
|
those parentheses to be closed with whatever has been captured so far. However,
|
|
it was failing to mark any other groups between the hightest capture so far and
|
|
the currrent group as "unset". Thus, the ovector for those groups contained
|
|
whatever was previously there. An example is the pattern /(x)|((*ACCEPT))/ when
|
|
matched against "abcd".
|
|
|
|
8. The pcre2_substitute() function has been implemented.
|
|
|
|
****
|