Make pcre2test show actual pre-match consulted characters for a partial match,

not the length of the longest lookbehind. Control this by "allusedtext".
This commit is contained in:
Philip.Hazel 2019-06-26 08:23:47 +00:00
parent d21f7daf9b
commit 434e3f7468
10 changed files with 483 additions and 350 deletions

View File

@ -71,6 +71,14 @@ lookbehind value. For example /(?<=a(?<=ba)c)/ previously set a maximum
lookbehind of 2, because that is the largest individual lookbehind. Now it sets
it to 3, because matching looks back 3 characters.
14. For partial matches, pcre2test was always showing the maximum lookbehind
characters, flagged with "<", which is misleading when the lookbehind didn't
actually look behind the start (because it was later in the pattern). Showing
all consulted preceding characters for partial matches is now controlled by the
existing "allusedtext" modifier and, as for complete matches, this facility is
available only for non-JIT matching, because JIT does not maintain the first
and last consulted characters.
Version 10.33 16-April-2019
---------------------------

View File

@ -1252,22 +1252,27 @@ following line with a plus character following the capture number.
</P>
<P>
The <b>allusedtext</b> modifier requests that all the text that was consulted
during a successful pattern match by the interpreter should be shown. This
feature is not supported for JIT matching, and if requested with JIT it is
ignored (with a warning message). Setting this modifier affects the output if
there is a lookbehind at the start of a match, or a lookahead at the end, or if
\K is used in the pattern. Characters that precede or follow the start and end
of the actual match are indicated in the output by '&#60;' or '&#62;' characters
underneath them. Here is an example:
during a successful pattern match by the interpreter should be shown, for both
full and partial matches. This feature is not supported for JIT matching, and
if requested with JIT it is ignored (with a warning message). Setting this
modifier affects the output if there is a lookbehind at the start of a match,
or, for a complete match, a lookahead at the end, or if \K is used in the
pattern. Characters that precede or follow the start and end of the actual
match are indicated in the output by '&#60;' or '&#62;' characters underneath them.
Here is an example:
<pre>
re&#62; /(?&#60;=pqr)abc(?=xyz)/
data&#62; 123pqrabcxyz456\=allusedtext
0: pqrabcxyz
&#60;&#60;&#60; &#62;&#62;&#62;
data&#62; 123pqrabcxy\=ph,allusedtext
Partial match: pqrabcxy
&#60;&#60;&#60;
</pre>
This shows that the matched string is "abc", with the preceding and following
strings "pqr" and "xyz" having been consulted during the match (when processing
the assertions).
The first, complete match shows that the matched string is "abc", with the
preceding and following strings "pqr" and "xyz" having been consulted during
the match (when processing the assertions). The partial match can indicate only
the preceding string.
</P>
<P>
The <b>startchar</b> modifier requests that the starting character for the match
@ -2081,7 +2086,7 @@ Cambridge, England.
</P>
<br><a name="SEC21" href="#TOC1">REVISION</a><br>
<P>
Last updated: 20 June 2019
Last updated: 26 June 2019
<br>
Copyright &copy; 1997-2019 University of Cambridge.
<br>

View File

@ -1,4 +1,4 @@
.TH PCRE2TEST 1 "20 June 2019" "PCRE 10.34"
.TH PCRE2TEST 1 "26 June 2019" "PCRE 10.34"
.SH NAME
pcre2test - a program for testing Perl-compatible regular expressions.
.SH SYNOPSIS
@ -1220,22 +1220,27 @@ well as the main matched substring. In each case the remainder is output on the
following line with a plus character following the capture number.
.P
The \fBallusedtext\fP modifier requests that all the text that was consulted
during a successful pattern match by the interpreter should be shown. This
feature is not supported for JIT matching, and if requested with JIT it is
ignored (with a warning message). Setting this modifier affects the output if
there is a lookbehind at the start of a match, or a lookahead at the end, or if
\eK is used in the pattern. Characters that precede or follow the start and end
of the actual match are indicated in the output by '<' or '>' characters
underneath them. Here is an example:
during a successful pattern match by the interpreter should be shown, for both
full and partial matches. This feature is not supported for JIT matching, and
if requested with JIT it is ignored (with a warning message). Setting this
modifier affects the output if there is a lookbehind at the start of a match,
or, for a complete match, a lookahead at the end, or if \eK is used in the
pattern. Characters that precede or follow the start and end of the actual
match are indicated in the output by '<' or '>' characters underneath them.
Here is an example:
.sp
re> /(?<=pqr)abc(?=xyz)/
data> 123pqrabcxyz456\e=allusedtext
0: pqrabcxyz
<<< >>>
data> 123pqrabcxy\e=ph,allusedtext
Partial match: pqrabcxy
<<<
.sp
This shows that the matched string is "abc", with the preceding and following
strings "pqr" and "xyz" having been consulted during the match (when processing
the assertions).
The first, complete match shows that the matched string is "abc", with the
preceding and following strings "pqr" and "xyz" having been consulted during
the match (when processing the assertions). The partial match can indicate only
the preceding string.
.P
The \fBstartchar\fP modifier requests that the starting character for the match
be indicated, if it is different to the start of the matched string. The only
@ -2062,6 +2067,6 @@ Cambridge, England.
.rs
.sp
.nf
Last updated: 20 June 2019
Last updated: 26 June 2019
Copyright (c) 1997-2019 University of Cambridge.
.fi

View File

@ -1122,23 +1122,27 @@ SUBJECT MODIFIERS
capture number.
The allusedtext modifier requests that all the text that was consulted
during a successful pattern match by the interpreter should be shown.
This feature is not supported for JIT matching, and if requested with
JIT it is ignored (with a warning message). Setting this modifier af-
fects the output if there is a lookbehind at the start of a match, or a
lookahead at the end, or if \K is used in the pattern. Characters that
precede or follow the start and end of the actual match are indicated
in the output by '<' or '>' characters underneath them. Here is an ex-
ample:
during a successful pattern match by the interpreter should be shown,
for both full and partial matches. This feature is not supported for
JIT matching, and if requested with JIT it is ignored (with a warning
message). Setting this modifier affects the output if there is a look-
behind at the start of a match, or, for a complete match, a lookahead
at the end, or if \K is used in the pattern. Characters that precede or
follow the start and end of the actual match are indicated in the out-
put by '<' or '>' characters underneath them. Here is an example:
re> /(?<=pqr)abc(?=xyz)/
data> 123pqrabcxyz456\=allusedtext
0: pqrabcxyz
<<< >>>
data> 123pqrabcxy\=ph,allusedtext
Partial match: pqrabcxy
<<<
This shows that the matched string is "abc", with the preceding and
following strings "pqr" and "xyz" having been consulted during the
match (when processing the assertions).
The first, complete match shows that the matched string is "abc", with
the preceding and following strings "pqr" and "xyz" having been con-
sulted during the match (when processing the assertions). The partial
match can indicate only the preceding string.
The startchar modifier requests that the starting character for the
match be indicated, if it is different to the start of the matched
@ -1893,5 +1897,5 @@ AUTHOR
REVISION
Last updated: 20 June 2019
Last updated: 26 June 2019
Copyright (c) 1997-2019 University of Cambridge.

View File

@ -7761,14 +7761,22 @@ for (gmatched = 0;; gmatched++)
} /* End of handling a successful match */
/* There was a partial match. The value of ovector[0] is the bumpalong point,
that is, startchar, not any \K point that might have been passed. */
that is, startchar, not any \K point that might have been passed. When JIT is
not in use, "allusedtext" may be set, in which case we indicate the leftmost
consulted character. */
else if (capcount == PCRE2_ERROR_PARTIAL)
{
PCRE2_SIZE poffset;
PCRE2_SIZE leftchar;
int backlength;
int rubriclength = 0;
if ((dat_datctl.control & CTL_ALLUSEDTEXT) != 0)
{
leftchar = FLD(match_data, leftchar);
}
else leftchar = ovector[0];
fprintf(outfile, "Partial match");
if ((dat_datctl.control & CTL_MARK) != 0 &&
TESTFLD(match_data, mark, !=, NULL))
@ -7781,8 +7789,7 @@ for (gmatched = 0;; gmatched++)
fprintf(outfile, ": ");
rubriclength += 15;
poffset = backchars(pp, ovector[0], maxlookbehind, utf);
PCHARS(backlength, pp, poffset, ovector[0] - poffset, utf, outfile);
PCHARS(backlength, pp, leftchar, ovector[0] - leftchar, utf, outfile);
PCHARSV(pp, ovector[0], ulen - ovector[0], utf, outfile);
if ((pat_patctl.control & CTL_JITVERIFY) != 0 && jit_was_used)

38
testdata/testinput15 vendored
View File

@ -122,6 +122,44 @@
/abc(?=abcde)(?=ab)/allusedtext
abcabcdefg
#subject allusedtext
/(?<=abc)123/
xyzabc123pqr
xyzabc12\=ps
xyzabc12\=ph
/\babc\b/
+++abc+++
+++ab\=ps
+++ab\=ph
/(?<=abc)def/
abc\=ph
/(?<=123)(*MARK:xx)abc/mark
xxxx123a\=ph
xxxx123a\=ps
/(?<=(?<=a)b)c.*/I
abc\=ph
\= Expect no match
xbc\=ph
/(?<=ab)c.*/I
abc\=ph
\= Expect no match
xbc\=ph
/abc(?<=bc)def/
xxxabcd\=ph
/(?<=ab)cdef/
xxabcd\=ph
#subject
# -------------------------------------------------------------------
# These tests provoke recursion loops, which give a different error message
# when JIT is used.

8
testdata/testinput6 vendored
View File

@ -486,7 +486,7 @@
def\=dfa_restart
/(?<=foo)bar/
foob\=ps,offset=2
foob\=ps,offset=2,allusedtext
foobar...\=ps,dfa_restart,offset=4
foobar\=offset=2
\= Expect no match
@ -4415,12 +4415,12 @@
/abc\K123/
xyzabc123pqr
/(?<=abc)123/
/(?<=abc)123/allusedtext
xyzabc123pqr
xyzabc12\=ps
xyzabc12\=ph
/\babc\b/
/\babc\b/allusedtext
+++abc+++
+++ab\=ps
+++ab\=ph
@ -4490,7 +4490,7 @@
/^(?(?!a(*SKIP)b))/
ac
/(?<=abc)def/
/(?<=abc)def/allusedtext
abc\=ph
/abc$/

73
testdata/testoutput15 vendored
View File

@ -265,6 +265,79 @@ Failed: error -52: nested recursion at the same subject position
0: abcabcde
>>>>>
#subject allusedtext
/(?<=abc)123/
xyzabc123pqr
0: abc123
<<<
xyzabc12\=ps
Partial match: abc12
<<<
xyzabc12\=ph
Partial match: abc12
<<<
/\babc\b/
+++abc+++
0: +abc+
< >
+++ab\=ps
Partial match: +ab
<
+++ab\=ph
Partial match: +ab
<
/(?<=abc)def/
abc\=ph
Partial match: abc
<<<
/(?<=123)(*MARK:xx)abc/mark
xxxx123a\=ph
Partial match, mark=xx: 123a
<<<
xxxx123a\=ps
Partial match, mark=xx: 123a
<<<
/(?<=(?<=a)b)c.*/I
Capture group count = 0
Max lookbehind = 2
First code unit = 'c'
Subject length lower bound = 1
abc\=ph
Partial match: abc
<<
\= Expect no match
xbc\=ph
No match
/(?<=ab)c.*/I
Capture group count = 0
Max lookbehind = 2
First code unit = 'c'
Subject length lower bound = 1
abc\=ph
Partial match: abc
<<
\= Expect no match
xbc\=ph
No match
/abc(?<=bc)def/
xxxabcd\=ph
Partial match: abcd
/(?<=ab)cdef/
xxabcd\=ph
Partial match: abcd
<<
#subject
# -------------------------------------------------------------------
# These tests provoke recursion loops, which give a different error message
# when JIT is used.

27
testdata/testoutput2 vendored
View File

@ -9369,21 +9369,17 @@ Partial match: abc12
xyzabc123pqr
0: 123
xyzabc12\=ps
Partial match: abc12
<<<
Partial match: 12
xyzabc12\=ph
Partial match: abc12
<<<
Partial match: 12
/\babc\b/
+++abc+++
0: abc
+++ab\=ps
Partial match: +ab
<
Partial match: ab
+++ab\=ph
Partial match: +ab
<
Partial match: ab
/(?&word)(?&element)(?(DEFINE)(?<element><[^m][^>]>[^<])(?<word>\w*+))/B
------------------------------------------------------------------
@ -10401,8 +10397,7 @@ No match
/(?<=abc)def/
abc\=ph
Partial match: abc
<<<
Partial match:
/abc$/
abc
@ -11959,11 +11954,9 @@ Callout 2: last capture = 0
/(?<=123)(*MARK:xx)abc/mark
xxxx123a\=ph
Partial match, mark=xx: 123a
<<<
Partial match, mark=xx: a
xxxx123a\=ps
Partial match, mark=xx: 123a
<<<
Partial match, mark=xx: a
/123\Kabc/startchar
xxxx123a\=ph
@ -17045,8 +17038,7 @@ Max lookbehind = 2
First code unit = 'c'
Subject length lower bound = 1
abc\=ph
Partial match: abc
<<
Partial match: c
\= Expect no match
xbc\=ph
No match
@ -17057,8 +17049,7 @@ Max lookbehind = 2
First code unit = 'c'
Subject length lower bound = 1
abc\=ph
Partial match: abc
<<
Partial match: c
\= Expect no match
xbc\=ph
No match

14
testdata/testoutput6 vendored
View File

@ -876,7 +876,7 @@ Partial match: abc
0: def
/(?<=foo)bar/
foob\=ps,offset=2
foob\=ps,offset=2,allusedtext
Partial match: foob
<<<
foobar...\=ps,dfa_restart,offset=4
@ -6803,9 +6803,10 @@ Partial match: dogs
xyzabc123pqr
Failed: error -42: pattern contains an item that is not supported for DFA matching
/(?<=abc)123/
/(?<=abc)123/allusedtext
xyzabc123pqr
0: 123
0: abc123
<<<
xyzabc12\=ps
Partial match: abc12
<<<
@ -6813,9 +6814,10 @@ Partial match: abc12
Partial match: abc12
<<<
/\babc\b/
/\babc\b/allusedtext
+++abc+++
0: abc
0: +abc+
< >
+++ab\=ps
Partial match: +ab
<
@ -6932,7 +6934,7 @@ Failed: error -42: pattern contains an item that is not supported for DFA matchi
ac
Failed: error -42: pattern contains an item that is not supported for DFA matching
/(?<=abc)def/
/(?<=abc)def/allusedtext
abc\=ph
Partial match: abc
<<<