From c0d0ee5365abd7fc6a7ddb271be258aa2a8e9d63 Mon Sep 17 00:00:00 2001 From: "Philip.Hazel" Date: Wed, 26 Jun 2019 16:13:28 +0000 Subject: [PATCH] Fix partial matching bug in pcre2_dfa_match(). --- ChangeLog | 56 +++++++++++++++++++++++-------------------- src/pcre2_dfa_match.c | 11 ++++----- testdata/testinput6 | 22 +++++++++++++++++ testdata/testoutput6 | 36 ++++++++++++++++++++++++++++ 4 files changed, 93 insertions(+), 32 deletions(-) diff --git a/ChangeLog b/ChangeLog index d71187c..cb0762c 100644 --- a/ChangeLog +++ b/ChangeLog @@ -5,7 +5,7 @@ Change Log for PCRE2 Version 10.34 22-April-2019 --------------------------- -1. The maximum number of capturing subpatterns is 65535 (documented), but no +1. The maximum number of capturing subpatterns is 65535 (documented), but no check on this was ever implemented. This omission has been rectified; it fixes ClusterFuzz 14376. @@ -25,40 +25,40 @@ PCRE2_MATCH_INVALID_UTF compile-time option. 7. Adjust the limit for "must have" code unit searching, in particular, increase it substantially for non-anchored patterns. -8. Allow (*ACCEPT) to be quantified, because an ungreedy quantifier with a zero +8. Allow (*ACCEPT) to be quantified, because an ungreedy quantifier with a zero minimum is potentially useful. 9. Some changes to the way the minimum subject length is handled: - * When PCRE2_NO_START_OPTIMIZE is set, no minimum length is computed; + * When PCRE2_NO_START_OPTIMIZE is set, no minimum length is computed; pcre2test now omits this item instead of showing a value of zero. - - * An incorrect minimum length could be calculated for a pattern that - contained (*ACCEPT) inside a qualified group whose minimum repetition was + + * An incorrect minimum length could be calculated for a pattern that + contained (*ACCEPT) inside a qualified group whose minimum repetition was zero, for example /A(?:(*ACCEPT))?B/, which incorrectly computed a minimum - of 2. The minimum length scan no longer happens for a pattern that + of 2. The minimum length scan no longer happens for a pattern that contains (*ACCEPT). - - * When no minimum length is set by the normal scan, but a first and/or last + + * When no minimum length is set by the normal scan, but a first and/or last code unit is recorded, set the minimum to 1 or 2 as appropriate. - + * When a pattern contains multiple groups with the same number, a back reference cannot know which one to scan for a minimum length. This used to cause the minimum length finder to give up with no result. Now it treats - such references as not adding to the minimum length (which it should have + such references as not adding to the minimum length (which it should have done all along). - - * Furthermore, the above action now happens only if the back reference is to - a group that exists more than once in a pattern instead of any back - reference in a pattern with duplicate numbers. - -10. A (*MARK) value inside a successful condition was not being returned by the + + * Furthermore, the above action now happens only if the back reference is to + a group that exists more than once in a pattern instead of any back + reference in a pattern with duplicate numbers. + +10. A (*MARK) value inside a successful condition was not being returned by the interpretive matcher (it was returned by JIT). This bug has been mended. -11. A bug in pcre2grep meant that -o without an argument (or -o0) didn't work -if the pattern had more than 32 capturing parentheses. This is fixed. In -addition (a) the default limit for groups requested by -o has been raised to -50, (b) the new --om-capture option changes the limit, (c) an error is raised +11. A bug in pcre2grep meant that -o without an argument (or -o0) didn't work +if the pattern had more than 32 capturing parentheses. This is fixed. In +addition (a) the default limit for groups requested by -o has been raised to +50, (b) the new --om-capture option changes the limit, (c) an error is raised if -o asks for a group that is above the limit. 12. The quantifier {1} was always being ignored, but this is incorrect when it @@ -66,19 +66,23 @@ is made possessive and applied to an item in parentheses, because a parenthesized item may contain multiple branches or other backtracking points, for example /(a|ab){1}+c/ or /(a+){1}+a/. -13. Nested lookbehinds are now taken into account when computing the maximum -lookbehind value. For example /(?<=a(?<=ba)c)/ previously set a maximum -lookbehind of 2, because that is the largest individual lookbehind. Now it sets +13. Nested lookbehinds are now taken into account when computing the maximum +lookbehind value. For example /(?<=a(?<=ba)c)/ previously set a maximum +lookbehind of 2, because that is the largest individual lookbehind. Now it sets it to 3, because matching looks back 3 characters. -14. For partial matches, pcre2test was always showing the maximum lookbehind -characters, flagged with "<", which is misleading when the lookbehind didn't +14. For partial matches, pcre2test was always showing the maximum lookbehind +characters, flagged with "<", which is misleading when the lookbehind didn't actually look behind the start (because it was later in the pattern). Showing all consulted preceding characters for partial matches is now controlled by the existing "allusedtext" modifier and, as for complete matches, this facility is available only for non-JIT matching, because JIT does not maintain the first and last consulted characters. +15. DFA matching (using pcre2_dfa_match()) was not recognising a partial match +if the end of the subject was encountered in a lookahead (conditional or +otherwise), an atomic group, or a recursion. + Version 10.33 16-April-2019 --------------------------- diff --git a/src/pcre2_dfa_match.c b/src/pcre2_dfa_match.c index 911e9b9..538d15d 100644 --- a/src/pcre2_dfa_match.c +++ b/src/pcre2_dfa_match.c @@ -3152,8 +3152,8 @@ for (;;) /* We have finished the processing at the current subject character. If no new states have been set for the next character, we have found all the - matches that we are going to find. If we are at the top level and partial - matching has been requested, check for appropriate conditions. + matches that we are going to find. If partial matching has been requested, + check for appropriate conditions. The "forced_ fail" variable counts the number of (*F) encountered for the character. If it is equal to the original active_count (saved in @@ -3165,8 +3165,7 @@ for (;;) if (new_count <= 0) { - if (rlevel == 1 && /* Top level, and */ - could_continue && /* Some could go on, and */ + if (could_continue && /* Some could go on, and */ forced_fail != workspace[1] && /* Not all forced fail & */ ( /* either... */ (mb->moptions & PCRE2_PARTIAL_HARD) != 0 /* Hard partial */ @@ -3175,8 +3174,8 @@ for (;;) match_count < 0) /* no matches */ ) && /* And... */ ( - partial_newline || /* Either partial NL */ - ( /* or ... */ + partial_newline || /* Either partial NL */ + ( /* or ... */ ptr >= end_subject && /* End of subject and */ ptr > mb->start_used_ptr) /* Inspected non-empty string */ ) diff --git a/testdata/testinput6 b/testdata/testinput6 index 403e3fa..cc3ebd0 100644 --- a/testdata/testinput6 +++ b/testdata/testinput6 @@ -4972,4 +4972,26 @@ \= Expect no match 0 +/(?<=pqr)abc(?=xyz)/ + 123pqrabcxy\=ps,allusedtext + 123pqrabcxyz\=ps,allusedtext + +/(?>a+b)/ + aaaa\=ps + aaaab\=ps + +/(abc)(?1)/ + abca\=ps + abcabc\=ps + +/(?(?=abc).*|Z)/ + ab\=ps + abcxyz\=ps + +/(abc)++x/ + abcab\=ps + abc\=ps + ab\=ps + abcx + # End of testinput6 diff --git a/testdata/testoutput6 b/testdata/testoutput6 index 6a975dd..61cbfe2 100644 --- a/testdata/testoutput6 +++ b/testdata/testoutput6 @@ -7809,4 +7809,40 @@ No match 0 No match +/(?<=pqr)abc(?=xyz)/ + 123pqrabcxy\=ps,allusedtext +Partial match: pqrabcxy + <<< + 123pqrabcxyz\=ps,allusedtext + 0: pqrabcxyz + <<< >>> + +/(?>a+b)/ + aaaa\=ps +Partial match: aaaa + aaaab\=ps + 0: aaaab + +/(abc)(?1)/ + abca\=ps +Partial match: abca + abcabc\=ps + 0: abcabc + +/(?(?=abc).*|Z)/ + ab\=ps +Partial match: ab + abcxyz\=ps + 0: abcxyz + +/(abc)++x/ + abcab\=ps +Partial match: abcab + abc\=ps +Partial match: abc + ab\=ps +Partial match: ab + abcx + 0: abcx + # End of testinput6