Make pcre2test show actual pre-match consulted characters for a partial match,

not the length of the longest lookbehind. Control this by "allusedtext".
This commit is contained in:
Philip.Hazel 2019-06-26 08:23:47 +00:00
parent d21f7daf9b
commit 434e3f7468
10 changed files with 483 additions and 350 deletions

View File

@ -71,6 +71,14 @@ lookbehind value. For example /(?<=a(?<=ba)c)/ previously set a maximum
lookbehind of 2, because that is the largest individual lookbehind. Now it sets lookbehind of 2, because that is the largest individual lookbehind. Now it sets
it to 3, because matching looks back 3 characters. it to 3, because matching looks back 3 characters.
14. For partial matches, pcre2test was always showing the maximum lookbehind
characters, flagged with "<", which is misleading when the lookbehind didn't
actually look behind the start (because it was later in the pattern). Showing
all consulted preceding characters for partial matches is now controlled by the
existing "allusedtext" modifier and, as for complete matches, this facility is
available only for non-JIT matching, because JIT does not maintain the first
and last consulted characters.
Version 10.33 16-April-2019 Version 10.33 16-April-2019
--------------------------- ---------------------------

View File

@ -1252,22 +1252,27 @@ following line with a plus character following the capture number.
</P> </P>
<P> <P>
The <b>allusedtext</b> modifier requests that all the text that was consulted The <b>allusedtext</b> modifier requests that all the text that was consulted
during a successful pattern match by the interpreter should be shown. This during a successful pattern match by the interpreter should be shown, for both
feature is not supported for JIT matching, and if requested with JIT it is full and partial matches. This feature is not supported for JIT matching, and
ignored (with a warning message). Setting this modifier affects the output if if requested with JIT it is ignored (with a warning message). Setting this
there is a lookbehind at the start of a match, or a lookahead at the end, or if modifier affects the output if there is a lookbehind at the start of a match,
\K is used in the pattern. Characters that precede or follow the start and end or, for a complete match, a lookahead at the end, or if \K is used in the
of the actual match are indicated in the output by '&#60;' or '&#62;' characters pattern. Characters that precede or follow the start and end of the actual
underneath them. Here is an example: match are indicated in the output by '&#60;' or '&#62;' characters underneath them.
Here is an example:
<pre> <pre>
re&#62; /(?&#60;=pqr)abc(?=xyz)/ re&#62; /(?&#60;=pqr)abc(?=xyz)/
data&#62; 123pqrabcxyz456\=allusedtext data&#62; 123pqrabcxyz456\=allusedtext
0: pqrabcxyz 0: pqrabcxyz
&#60;&#60;&#60; &#62;&#62;&#62; &#60;&#60;&#60; &#62;&#62;&#62;
data&#62; 123pqrabcxy\=ph,allusedtext
Partial match: pqrabcxy
&#60;&#60;&#60;
</pre> </pre>
This shows that the matched string is "abc", with the preceding and following The first, complete match shows that the matched string is "abc", with the
strings "pqr" and "xyz" having been consulted during the match (when processing preceding and following strings "pqr" and "xyz" having been consulted during
the assertions). the match (when processing the assertions). The partial match can indicate only
the preceding string.
</P> </P>
<P> <P>
The <b>startchar</b> modifier requests that the starting character for the match The <b>startchar</b> modifier requests that the starting character for the match
@ -2081,7 +2086,7 @@ Cambridge, England.
</P> </P>
<br><a name="SEC21" href="#TOC1">REVISION</a><br> <br><a name="SEC21" href="#TOC1">REVISION</a><br>
<P> <P>
Last updated: 20 June 2019 Last updated: 26 June 2019
<br> <br>
Copyright &copy; 1997-2019 University of Cambridge. Copyright &copy; 1997-2019 University of Cambridge.
<br> <br>

View File

@ -1,4 +1,4 @@
.TH PCRE2TEST 1 "20 June 2019" "PCRE 10.34" .TH PCRE2TEST 1 "26 June 2019" "PCRE 10.34"
.SH NAME .SH NAME
pcre2test - a program for testing Perl-compatible regular expressions. pcre2test - a program for testing Perl-compatible regular expressions.
.SH SYNOPSIS .SH SYNOPSIS
@ -1220,22 +1220,27 @@ well as the main matched substring. In each case the remainder is output on the
following line with a plus character following the capture number. following line with a plus character following the capture number.
.P .P
The \fBallusedtext\fP modifier requests that all the text that was consulted The \fBallusedtext\fP modifier requests that all the text that was consulted
during a successful pattern match by the interpreter should be shown. This during a successful pattern match by the interpreter should be shown, for both
feature is not supported for JIT matching, and if requested with JIT it is full and partial matches. This feature is not supported for JIT matching, and
ignored (with a warning message). Setting this modifier affects the output if if requested with JIT it is ignored (with a warning message). Setting this
there is a lookbehind at the start of a match, or a lookahead at the end, or if modifier affects the output if there is a lookbehind at the start of a match,
\eK is used in the pattern. Characters that precede or follow the start and end or, for a complete match, a lookahead at the end, or if \eK is used in the
of the actual match are indicated in the output by '<' or '>' characters pattern. Characters that precede or follow the start and end of the actual
underneath them. Here is an example: match are indicated in the output by '<' or '>' characters underneath them.
Here is an example:
.sp .sp
re> /(?<=pqr)abc(?=xyz)/ re> /(?<=pqr)abc(?=xyz)/
data> 123pqrabcxyz456\e=allusedtext data> 123pqrabcxyz456\e=allusedtext
0: pqrabcxyz 0: pqrabcxyz
<<< >>> <<< >>>
data> 123pqrabcxy\e=ph,allusedtext
Partial match: pqrabcxy
<<<
.sp .sp
This shows that the matched string is "abc", with the preceding and following The first, complete match shows that the matched string is "abc", with the
strings "pqr" and "xyz" having been consulted during the match (when processing preceding and following strings "pqr" and "xyz" having been consulted during
the assertions). the match (when processing the assertions). The partial match can indicate only
the preceding string.
.P .P
The \fBstartchar\fP modifier requests that the starting character for the match The \fBstartchar\fP modifier requests that the starting character for the match
be indicated, if it is different to the start of the matched string. The only be indicated, if it is different to the start of the matched string. The only
@ -2062,6 +2067,6 @@ Cambridge, England.
.rs .rs
.sp .sp
.nf .nf
Last updated: 20 June 2019 Last updated: 26 June 2019
Copyright (c) 1997-2019 University of Cambridge. Copyright (c) 1997-2019 University of Cambridge.
.fi .fi

View File

@ -1122,23 +1122,27 @@ SUBJECT MODIFIERS
capture number. capture number.
The allusedtext modifier requests that all the text that was consulted The allusedtext modifier requests that all the text that was consulted
during a successful pattern match by the interpreter should be shown. during a successful pattern match by the interpreter should be shown,
This feature is not supported for JIT matching, and if requested with for both full and partial matches. This feature is not supported for
JIT it is ignored (with a warning message). Setting this modifier af- JIT matching, and if requested with JIT it is ignored (with a warning
fects the output if there is a lookbehind at the start of a match, or a message). Setting this modifier affects the output if there is a look-
lookahead at the end, or if \K is used in the pattern. Characters that behind at the start of a match, or, for a complete match, a lookahead
precede or follow the start and end of the actual match are indicated at the end, or if \K is used in the pattern. Characters that precede or
in the output by '<' or '>' characters underneath them. Here is an ex- follow the start and end of the actual match are indicated in the out-
ample: put by '<' or '>' characters underneath them. Here is an example:
re> /(?<=pqr)abc(?=xyz)/ re> /(?<=pqr)abc(?=xyz)/
data> 123pqrabcxyz456\=allusedtext data> 123pqrabcxyz456\=allusedtext
0: pqrabcxyz 0: pqrabcxyz
<<< >>> <<< >>>
data> 123pqrabcxy\=ph,allusedtext
Partial match: pqrabcxy
<<<
This shows that the matched string is "abc", with the preceding and The first, complete match shows that the matched string is "abc", with
following strings "pqr" and "xyz" having been consulted during the the preceding and following strings "pqr" and "xyz" having been con-
match (when processing the assertions). sulted during the match (when processing the assertions). The partial
match can indicate only the preceding string.
The startchar modifier requests that the starting character for the The startchar modifier requests that the starting character for the
match be indicated, if it is different to the start of the matched match be indicated, if it is different to the start of the matched
@ -1893,5 +1897,5 @@ AUTHOR
REVISION REVISION
Last updated: 20 June 2019 Last updated: 26 June 2019
Copyright (c) 1997-2019 University of Cambridge. Copyright (c) 1997-2019 University of Cambridge.

View File

@ -7761,14 +7761,22 @@ for (gmatched = 0;; gmatched++)
} /* End of handling a successful match */ } /* End of handling a successful match */
/* There was a partial match. The value of ovector[0] is the bumpalong point, /* There was a partial match. The value of ovector[0] is the bumpalong point,
that is, startchar, not any \K point that might have been passed. */ that is, startchar, not any \K point that might have been passed. When JIT is
not in use, "allusedtext" may be set, in which case we indicate the leftmost
consulted character. */
else if (capcount == PCRE2_ERROR_PARTIAL) else if (capcount == PCRE2_ERROR_PARTIAL)
{ {
PCRE2_SIZE poffset; PCRE2_SIZE leftchar;
int backlength; int backlength;
int rubriclength = 0; int rubriclength = 0;
if ((dat_datctl.control & CTL_ALLUSEDTEXT) != 0)
{
leftchar = FLD(match_data, leftchar);
}
else leftchar = ovector[0];
fprintf(outfile, "Partial match"); fprintf(outfile, "Partial match");
if ((dat_datctl.control & CTL_MARK) != 0 && if ((dat_datctl.control & CTL_MARK) != 0 &&
TESTFLD(match_data, mark, !=, NULL)) TESTFLD(match_data, mark, !=, NULL))
@ -7781,8 +7789,7 @@ for (gmatched = 0;; gmatched++)
fprintf(outfile, ": "); fprintf(outfile, ": ");
rubriclength += 15; rubriclength += 15;
poffset = backchars(pp, ovector[0], maxlookbehind, utf); PCHARS(backlength, pp, leftchar, ovector[0] - leftchar, utf, outfile);
PCHARS(backlength, pp, poffset, ovector[0] - poffset, utf, outfile);
PCHARSV(pp, ovector[0], ulen - ovector[0], utf, outfile); PCHARSV(pp, ovector[0], ulen - ovector[0], utf, outfile);
if ((pat_patctl.control & CTL_JITVERIFY) != 0 && jit_was_used) if ((pat_patctl.control & CTL_JITVERIFY) != 0 && jit_was_used)

38
testdata/testinput15 vendored
View File

@ -122,6 +122,44 @@
/abc(?=abcde)(?=ab)/allusedtext /abc(?=abcde)(?=ab)/allusedtext
abcabcdefg abcabcdefg
#subject allusedtext
/(?<=abc)123/
xyzabc123pqr
xyzabc12\=ps
xyzabc12\=ph
/\babc\b/
+++abc+++
+++ab\=ps
+++ab\=ph
/(?<=abc)def/
abc\=ph
/(?<=123)(*MARK:xx)abc/mark
xxxx123a\=ph
xxxx123a\=ps
/(?<=(?<=a)b)c.*/I
abc\=ph
\= Expect no match
xbc\=ph
/(?<=ab)c.*/I
abc\=ph
\= Expect no match
xbc\=ph
/abc(?<=bc)def/
xxxabcd\=ph
/(?<=ab)cdef/
xxabcd\=ph
#subject
# -------------------------------------------------------------------
# These tests provoke recursion loops, which give a different error message # These tests provoke recursion loops, which give a different error message
# when JIT is used. # when JIT is used.

8
testdata/testinput6 vendored
View File

@ -486,7 +486,7 @@
def\=dfa_restart def\=dfa_restart
/(?<=foo)bar/ /(?<=foo)bar/
foob\=ps,offset=2 foob\=ps,offset=2,allusedtext
foobar...\=ps,dfa_restart,offset=4 foobar...\=ps,dfa_restart,offset=4
foobar\=offset=2 foobar\=offset=2
\= Expect no match \= Expect no match
@ -4415,12 +4415,12 @@
/abc\K123/ /abc\K123/
xyzabc123pqr xyzabc123pqr
/(?<=abc)123/ /(?<=abc)123/allusedtext
xyzabc123pqr xyzabc123pqr
xyzabc12\=ps xyzabc12\=ps
xyzabc12\=ph xyzabc12\=ph
/\babc\b/ /\babc\b/allusedtext
+++abc+++ +++abc+++
+++ab\=ps +++ab\=ps
+++ab\=ph +++ab\=ph
@ -4490,7 +4490,7 @@
/^(?(?!a(*SKIP)b))/ /^(?(?!a(*SKIP)b))/
ac ac
/(?<=abc)def/ /(?<=abc)def/allusedtext
abc\=ph abc\=ph
/abc$/ /abc$/

73
testdata/testoutput15 vendored
View File

@ -265,6 +265,79 @@ Failed: error -52: nested recursion at the same subject position
0: abcabcde 0: abcabcde
>>>>> >>>>>
#subject allusedtext
/(?<=abc)123/
xyzabc123pqr
0: abc123
<<<
xyzabc12\=ps
Partial match: abc12
<<<
xyzabc12\=ph
Partial match: abc12
<<<
/\babc\b/
+++abc+++
0: +abc+
< >
+++ab\=ps
Partial match: +ab
<
+++ab\=ph
Partial match: +ab
<
/(?<=abc)def/
abc\=ph
Partial match: abc
<<<
/(?<=123)(*MARK:xx)abc/mark
xxxx123a\=ph
Partial match, mark=xx: 123a
<<<
xxxx123a\=ps
Partial match, mark=xx: 123a
<<<
/(?<=(?<=a)b)c.*/I
Capture group count = 0
Max lookbehind = 2
First code unit = 'c'
Subject length lower bound = 1
abc\=ph
Partial match: abc
<<
\= Expect no match
xbc\=ph
No match
/(?<=ab)c.*/I
Capture group count = 0
Max lookbehind = 2
First code unit = 'c'
Subject length lower bound = 1
abc\=ph
Partial match: abc
<<
\= Expect no match
xbc\=ph
No match
/abc(?<=bc)def/
xxxabcd\=ph
Partial match: abcd
/(?<=ab)cdef/
xxabcd\=ph
Partial match: abcd
<<
#subject
# -------------------------------------------------------------------
# These tests provoke recursion loops, which give a different error message # These tests provoke recursion loops, which give a different error message
# when JIT is used. # when JIT is used.

27
testdata/testoutput2 vendored
View File

@ -9369,21 +9369,17 @@ Partial match: abc12
xyzabc123pqr xyzabc123pqr
0: 123 0: 123
xyzabc12\=ps xyzabc12\=ps
Partial match: abc12 Partial match: 12
<<<
xyzabc12\=ph xyzabc12\=ph
Partial match: abc12 Partial match: 12
<<<
/\babc\b/ /\babc\b/
+++abc+++ +++abc+++
0: abc 0: abc
+++ab\=ps +++ab\=ps
Partial match: +ab Partial match: ab
<
+++ab\=ph +++ab\=ph
Partial match: +ab Partial match: ab
<
/(?&word)(?&element)(?(DEFINE)(?<element><[^m][^>]>[^<])(?<word>\w*+))/B /(?&word)(?&element)(?(DEFINE)(?<element><[^m][^>]>[^<])(?<word>\w*+))/B
------------------------------------------------------------------ ------------------------------------------------------------------
@ -10401,8 +10397,7 @@ No match
/(?<=abc)def/ /(?<=abc)def/
abc\=ph abc\=ph
Partial match: abc Partial match:
<<<
/abc$/ /abc$/
abc abc
@ -11959,11 +11954,9 @@ Callout 2: last capture = 0
/(?<=123)(*MARK:xx)abc/mark /(?<=123)(*MARK:xx)abc/mark
xxxx123a\=ph xxxx123a\=ph
Partial match, mark=xx: 123a Partial match, mark=xx: a
<<<
xxxx123a\=ps xxxx123a\=ps
Partial match, mark=xx: 123a Partial match, mark=xx: a
<<<
/123\Kabc/startchar /123\Kabc/startchar
xxxx123a\=ph xxxx123a\=ph
@ -17045,8 +17038,7 @@ Max lookbehind = 2
First code unit = 'c' First code unit = 'c'
Subject length lower bound = 1 Subject length lower bound = 1
abc\=ph abc\=ph
Partial match: abc Partial match: c
<<
\= Expect no match \= Expect no match
xbc\=ph xbc\=ph
No match No match
@ -17057,8 +17049,7 @@ Max lookbehind = 2
First code unit = 'c' First code unit = 'c'
Subject length lower bound = 1 Subject length lower bound = 1
abc\=ph abc\=ph
Partial match: abc Partial match: c
<<
\= Expect no match \= Expect no match
xbc\=ph xbc\=ph
No match No match

14
testdata/testoutput6 vendored
View File

@ -876,7 +876,7 @@ Partial match: abc
0: def 0: def
/(?<=foo)bar/ /(?<=foo)bar/
foob\=ps,offset=2 foob\=ps,offset=2,allusedtext
Partial match: foob Partial match: foob
<<< <<<
foobar...\=ps,dfa_restart,offset=4 foobar...\=ps,dfa_restart,offset=4
@ -6803,9 +6803,10 @@ Partial match: dogs
xyzabc123pqr xyzabc123pqr
Failed: error -42: pattern contains an item that is not supported for DFA matching Failed: error -42: pattern contains an item that is not supported for DFA matching
/(?<=abc)123/ /(?<=abc)123/allusedtext
xyzabc123pqr xyzabc123pqr
0: 123 0: abc123
<<<
xyzabc12\=ps xyzabc12\=ps
Partial match: abc12 Partial match: abc12
<<< <<<
@ -6813,9 +6814,10 @@ Partial match: abc12
Partial match: abc12 Partial match: abc12
<<< <<<
/\babc\b/ /\babc\b/allusedtext
+++abc+++ +++abc+++
0: abc 0: +abc+
< >
+++ab\=ps +++ab\=ps
Partial match: +ab Partial match: +ab
< <
@ -6932,7 +6934,7 @@ Failed: error -42: pattern contains an item that is not supported for DFA matchi
ac ac
Failed: error -42: pattern contains an item that is not supported for DFA matching Failed: error -42: pattern contains an item that is not supported for DFA matching
/(?<=abc)def/ /(?<=abc)def/allusedtext
abc\=ph abc\=ph
Partial match: abc Partial match: abc
<<< <<<