Partial match documentation rewritten.
This commit is contained in:
parent
59c7c5d100
commit
ce751bfc84
|
@ -14,85 +14,123 @@ please consult the man page, in case the conversion went wrong.
|
|||
<br>
|
||||
<ul>
|
||||
<li><a name="TOC1" href="#SEC1">PARTIAL MATCHING IN PCRE2</a>
|
||||
<li><a name="TOC2" href="#SEC2">PARTIAL MATCHING USING pcre2_match()</a>
|
||||
<li><a name="TOC3" href="#SEC3">PARTIAL MATCHING USING pcre2_dfa_match()</a>
|
||||
<li><a name="TOC4" href="#SEC4">PARTIAL MATCHING AND WORD BOUNDARIES</a>
|
||||
<li><a name="TOC5" href="#SEC5">EXAMPLE OF PARTIAL MATCHING USING PCRE2TEST</a>
|
||||
<li><a name="TOC2" href="#SEC2">REQUIREMENTS FOR A PARTIAL MATCH</a>
|
||||
<li><a name="TOC3" href="#SEC3">PARTIAL MATCHING USING pcre2_match()</a>
|
||||
<li><a name="TOC4" href="#SEC4">MULTI-SEGMENT MATCHING WITH pcre2_match()</a>
|
||||
<li><a name="TOC5" href="#SEC5">PARTIAL MATCHING USING pcre2_dfa_match()</a>
|
||||
<li><a name="TOC6" href="#SEC6">MULTI-SEGMENT MATCHING WITH pcre2_dfa_match()</a>
|
||||
<li><a name="TOC7" href="#SEC7">MULTI-SEGMENT MATCHING WITH pcre2_match()</a>
|
||||
<li><a name="TOC8" href="#SEC8">ISSUES WITH MULTI-SEGMENT MATCHING</a>
|
||||
<li><a name="TOC9" href="#SEC9">AUTHOR</a>
|
||||
<li><a name="TOC10" href="#SEC10">REVISION</a>
|
||||
<li><a name="TOC7" href="#SEC7">AUTHOR</a>
|
||||
<li><a name="TOC8" href="#SEC8">REVISION</a>
|
||||
</ul>
|
||||
<br><a name="SEC1" href="#TOC1">PARTIAL MATCHING IN PCRE2</a><br>
|
||||
<P>
|
||||
In normal use of PCRE2, if the subject string that is passed to a matching
|
||||
function matches as far as it goes, but is too short to match the entire
|
||||
pattern, PCRE2_ERROR_NOMATCH is returned. There are circumstances where it
|
||||
might be helpful to distinguish this case from other cases in which there is no
|
||||
match.
|
||||
In normal use of PCRE2, if there is a match up to the end of a subject string,
|
||||
but more characters are needed to match the entire pattern, PCRE2_ERROR_NOMATCH
|
||||
is returned, just like any other failing match. There are circumstances where
|
||||
it might be helpful to distinguish this "partial match" case.
|
||||
</P>
|
||||
<P>
|
||||
Consider, for example, an application where a human is required to type in data
|
||||
for a field with specific formatting requirements. An example might be a date
|
||||
in the form <i>ddmmmyy</i>, defined by this pattern:
|
||||
One example is an application where the subject string is very long, and not
|
||||
all available at once. The requirement here is to be able to do the matching
|
||||
segment by segment, but special action is needed when a matched substring spans
|
||||
the boundary between two segments.
|
||||
</P>
|
||||
<P>
|
||||
Another example is checking a user input string as it is typed, to ensure that
|
||||
it conforms to a required format. Invalid characters can be immediately
|
||||
diagnosed and rejected, giving instant feedback.
|
||||
</P>
|
||||
<P>
|
||||
Partial matching is a PCRE2-specific feature; it is not Perl-compatible. It is
|
||||
requested by setting one of the PCRE2_PARTIAL_HARD or PCRE2_PARTIAL_SOFT
|
||||
options when calling a matching function. The difference between the two
|
||||
options is whether or not a partial match is preferred to an alternative
|
||||
complete match, though the details differ between the two types of matching
|
||||
function. If both options are set, PCRE2_PARTIAL_HARD takes precedence.
|
||||
</P>
|
||||
<P>
|
||||
If you want to use partial matching with just-in-time optimized code, as well
|
||||
as setting a partial match option for the matching function, you must also call
|
||||
<b>pcre2_jit_compile()</b> with one or both of these options:
|
||||
<pre>
|
||||
^\d?\d(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\d\d$
|
||||
</pre>
|
||||
If the application sees the user's keystrokes one by one, and can check that
|
||||
what has been typed so far is potentially valid, it is able to raise an error
|
||||
as soon as a mistake is made, by beeping and not reflecting the character that
|
||||
has been typed, for example. This immediate feedback is likely to be a better
|
||||
user interface than a check that is delayed until the entire string has been
|
||||
entered. Partial matching can also be useful when the subject string is very
|
||||
long and is not all available at once, as discussed below.
|
||||
</P>
|
||||
<P>
|
||||
PCRE2 supports partial matching by means of the PCRE2_PARTIAL_SOFT and
|
||||
PCRE2_PARTIAL_HARD options, which can be set when calling a matching function.
|
||||
The difference between the two options is whether or not a partial match is
|
||||
preferred to an alternative complete match, though the details differ between
|
||||
the two types of matching function. If both options are set, PCRE2_PARTIAL_HARD
|
||||
takes precedence.
|
||||
</P>
|
||||
<P>
|
||||
If you want to use partial matching with just-in-time optimized code, you must
|
||||
call <b>pcre2_jit_compile()</b> with one or both of these options:
|
||||
<pre>
|
||||
PCRE2_JIT_PARTIAL_SOFT
|
||||
PCRE2_JIT_PARTIAL_HARD
|
||||
PCRE2_JIT_PARTIAL_SOFT
|
||||
</pre>
|
||||
PCRE2_JIT_COMPLETE should also be set if you are going to run non-partial
|
||||
matches on the same pattern. If the appropriate JIT mode has not been compiled,
|
||||
interpretive matching code is used.
|
||||
matches on the same pattern. Separate code is compiled for each mode. If the
|
||||
appropriate JIT mode has not been compiled, interpretive matching code is used.
|
||||
</P>
|
||||
<P>
|
||||
Setting a partial matching option disables two of PCRE2's standard
|
||||
optimizations. PCRE2 remembers the last literal code unit in a pattern, and
|
||||
abandons matching immediately if it is not present in the subject string. This
|
||||
optimization cannot be used for a subject string that might match only
|
||||
partially. PCRE2 also knows the minimum length of a matching string, and does
|
||||
optimization hints. PCRE2 remembers the last literal code unit in a pattern,
|
||||
and abandons matching immediately if it is not present in the subject string.
|
||||
This optimization cannot be used for a subject string that might match only
|
||||
partially. PCRE2 also remembers a minimum length of a matching string, and does
|
||||
not bother to run the matching function on shorter strings. This optimization
|
||||
is also disabled for partial matching.
|
||||
</P>
|
||||
<br><a name="SEC2" href="#TOC1">PARTIAL MATCHING USING pcre2_match()</a><br>
|
||||
<br><a name="SEC2" href="#TOC1">REQUIREMENTS FOR A PARTIAL MATCH</a><br>
|
||||
<P>
|
||||
A partial match occurs during a call to <b>pcre2_match()</b> when the end of the
|
||||
subject string is reached successfully, but matching cannot continue because
|
||||
more characters are needed, and in addition, either at least one character in
|
||||
the subject has been inspected or the pattern contains a lookbehind, or (when
|
||||
PCRE2_PARTIAL_HARD is set) the pattern could match an empty string. An
|
||||
inspected character need not form part of the final matched string; lookbehind
|
||||
assertions and the \K escape sequence provide ways of inspecting characters
|
||||
before the start of a matched string.
|
||||
A possible partial match occurs during matching when the end of the subject
|
||||
string is reached successfully, but either more characters are needed to
|
||||
complete the match, or the addition of more characters might change what is
|
||||
matched.
|
||||
</P>
|
||||
<P>
|
||||
The three additional requirements define the cases where adding more characters
|
||||
to the existing subject may complete the same match that would occur if they
|
||||
had all been present in the first place. Without these conditions there would
|
||||
be a partial match of an empty string at the end of the subject for all
|
||||
unanchored patterns (and also for anchored patterns if the subject itself is
|
||||
empty).
|
||||
Example 1: if the pattern is /abc/ and the subject is "ab", more characters are
|
||||
definitely needed to complete a match. In this case both hard and soft matching
|
||||
options yield a partial match.
|
||||
</P>
|
||||
<P>
|
||||
Example 2: if the pattern is /ab+/ and the subject is "ab", a complete match
|
||||
can be found, but the addition of more characters might change what is
|
||||
matched. In this case, only PCRE2_PARTIAL_HARD returns a partial match;
|
||||
PCRE2_PARTIAL_SOFT returns the complete match.
|
||||
</P>
|
||||
<P>
|
||||
On reaching the end of the subject, when PCRE2_PARTIAL_HARD is set, if the next
|
||||
pattern item is \z, \Z, \b, \B, or $ there is always a partial match.
|
||||
Otherwise, for both options, the next pattern item must be one that inspects a
|
||||
character, and at least one of the following must be true:
|
||||
</P>
|
||||
<P>
|
||||
(1) At least one character has already been inspected. An inspected character
|
||||
need not form part of the final matched string; lookbehind assertions and the
|
||||
\K escape sequence provide ways of inspecting characters before the start of a
|
||||
matched string.
|
||||
</P>
|
||||
<P>
|
||||
(2) The pattern contains one or more lookbehind assertions. This condition
|
||||
exists in case there is a lookbehind that inspects characters before the start
|
||||
of the match.
|
||||
</P>
|
||||
<P>
|
||||
(3) There is a special case when the whole pattern can match an empty string.
|
||||
When the starting point is at the end of the subject, the empty string match is
|
||||
a possibility, and if PCRE2_PARTIAL_SOFT is set and neither of the above
|
||||
conditions is true, it is returned. However, because adding more characters
|
||||
might result in a non-empty match, PCRE2_PARTIAL_HARD returns a partial match,
|
||||
which in this case means "there is going to be a match at this point, but until
|
||||
some more characters are added, we do not know if it will be an empty string or
|
||||
something longer".
|
||||
</P>
|
||||
<br><a name="SEC3" href="#TOC1">PARTIAL MATCHING USING pcre2_match()</a><br>
|
||||
<P>
|
||||
When a partial matching option is set, the result of calling
|
||||
<b>pcre2_match()</b> can be one of the following:
|
||||
</P>
|
||||
<P>
|
||||
<b>A successful match</b>
|
||||
A complete match has been found, starting and ending within this subject.
|
||||
</P>
|
||||
<P>
|
||||
<b>PCRE2_ERROR_NOMATCH</b>
|
||||
No match can start anywhere in this subject.
|
||||
</P>
|
||||
<P>
|
||||
<b>PCRE2_ERROR_PARTIAL</b>
|
||||
Adding more characters may result in a complete match that uses one or more
|
||||
characters from the end of this subject.
|
||||
</P>
|
||||
<P>
|
||||
When a partial match is returned, the first two elements in the ovector point
|
||||
|
@ -110,26 +148,6 @@ these characters are needed for a subsequent re-match with additional
|
|||
characters.
|
||||
</P>
|
||||
<P>
|
||||
What happens when a partial match is identified depends on which of the two
|
||||
partial matching options is set.
|
||||
</P>
|
||||
<br><b>
|
||||
PCRE2_PARTIAL_SOFT WITH pcre2_match()
|
||||
</b><br>
|
||||
<P>
|
||||
If PCRE2_PARTIAL_SOFT is set when <b>pcre2_match()</b> identifies a partial
|
||||
match, the partial match is remembered, but matching continues as normal, and
|
||||
other alternatives in the pattern are tried. If no complete match can be found,
|
||||
PCRE2_ERROR_PARTIAL is returned instead of PCRE2_ERROR_NOMATCH.
|
||||
</P>
|
||||
<P>
|
||||
This option is "soft" because it prefers a complete match over a partial match.
|
||||
All the various matching items in a pattern behave as if the subject string is
|
||||
potentially complete. For example, \z, \Z, and $ match at the end of the
|
||||
subject, as normal, and for \b and \B the end of the subject is treated as a
|
||||
non-alphanumeric.
|
||||
</P>
|
||||
<P>
|
||||
If there is more than one partial match, the first one that was found provides
|
||||
the data that is returned. Consider this pattern:
|
||||
<pre>
|
||||
|
@ -138,26 +156,34 @@ the data that is returned. Consider this pattern:
|
|||
If this is matched against the subject string "abc123dog", both alternatives
|
||||
fail to match, but the end of the subject is reached during matching, so
|
||||
PCRE2_ERROR_PARTIAL is returned. The offsets are set to 3 and 9, identifying
|
||||
"123dog" as the first partial match that was found. (In this example, there are
|
||||
two partial matches, because "dog" on its own partially matches the second
|
||||
alternative.)
|
||||
"123dog" as the first partial match. (In this example, there are two partial
|
||||
matches, because "dog" on its own partially matches the second alternative.)
|
||||
</P>
|
||||
<br><b>
|
||||
PCRE2_PARTIAL_HARD WITH pcre2_match()
|
||||
How a partial match is processed by pcre2_match()
|
||||
</b><br>
|
||||
<P>
|
||||
If PCRE2_PARTIAL_HARD is set for <b>pcre2_match()</b>, PCRE2_ERROR_PARTIAL is
|
||||
returned as soon as a partial match is found, without continuing to search for
|
||||
possible complete matches. This option is "hard" because it prefers an earlier
|
||||
partial match over a later complete match. For this reason, the assumption is
|
||||
made that the end of the supplied subject string may not be the true end of the
|
||||
available data, and so, if \z, \Z, \b, \B, or $ are encountered at the end
|
||||
of the subject, the result is PCRE2_ERROR_PARTIAL, whether or not any
|
||||
characters have been inspected.
|
||||
What happens when a partial match is identified depends on which of the two
|
||||
partial matching options is set.
|
||||
</P>
|
||||
<P>
|
||||
If PCRE2_PARTIAL_HARD is set, PCRE2_ERROR_PARTIAL is returned as soon as a
|
||||
partial match is found, without continuing to search for possible complete
|
||||
matches. This option is "hard" because it prefers an earlier partial match over
|
||||
a later complete match. For this reason, the assumption is made that the end of
|
||||
the supplied subject string is not the true end of the available data, which is
|
||||
why \z, \Z, \b, \B, and $ always give a partial match.
|
||||
</P>
|
||||
<P>
|
||||
If PCRE2_PARTIAL_SOFT is set, the partial match is remembered, but matching
|
||||
continues as normal, and other alternatives in the pattern are tried. If no
|
||||
complete match can be found, PCRE2_ERROR_PARTIAL is returned instead of
|
||||
PCRE2_ERROR_NOMATCH. This option is "soft" because it prefers a complete match
|
||||
over a partial match. All the various matching items in a pattern behave as if
|
||||
the subject string is potentially complete; \z, \Z, and $ match at the end of
|
||||
the subject, as normal, and for \b and \B the end of the subject is treated
|
||||
as a non-alphanumeric.
|
||||
</P>
|
||||
<br><b>
|
||||
Comparing hard and soft partial matching
|
||||
</b><br>
|
||||
<P>
|
||||
The difference between the two partial matching options can be illustrated by a
|
||||
pattern such as:
|
||||
|
@ -182,26 +208,132 @@ to follow this explanation by thinking of the two patterns like this:
|
|||
The second pattern will never match "dogsbody", because it will always find the
|
||||
shorter match first.
|
||||
</P>
|
||||
<br><a name="SEC3" href="#TOC1">PARTIAL MATCHING USING pcre2_dfa_match()</a><br>
|
||||
<br><b>
|
||||
Example of partial matching using pcre2test
|
||||
</b><br>
|
||||
<P>
|
||||
The DFA functions move along the subject string character by character, without
|
||||
The <b>pcre2test</b> data modifiers <b>partial_hard</b> (or <b>ph</b>) and
|
||||
<b>partial_soft</b> (or <b>ps</b>) set PCRE2_PARTIAL_HARD and PCRE2_PARTIAL_SOFT,
|
||||
respectively, when calling <b>pcre2_match()</b>. Here is a run of
|
||||
<b>pcre2test</b> using a pattern that matches the whole subject in the form of a
|
||||
date:
|
||||
<pre>
|
||||
re> /^\d?\d(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\d\d$/
|
||||
data> 25dec3\=ph
|
||||
Partial match: 23dec3
|
||||
data> 3ju\=ph
|
||||
Partial match: 3ju
|
||||
data> 3juj\=ph
|
||||
No match
|
||||
</pre>
|
||||
This example gives the same results for both hard and soft partial matching
|
||||
options. Here is an example where there is a difference:
|
||||
<pre>
|
||||
re> /^\d?\d(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\d\d$/
|
||||
data> 25jun04\=ps
|
||||
0: 25jun04
|
||||
1: jun
|
||||
data> 25jun04\=ph
|
||||
Partial match: 25jun04
|
||||
</pre>
|
||||
With PCRE2_PARTIAL_SOFT, the subject is matched completely. For
|
||||
PCRE2_PARTIAL_HARD, however, the subject is assumed not to be complete, so
|
||||
there is only a partial match.
|
||||
</P>
|
||||
<br><a name="SEC4" href="#TOC1">MULTI-SEGMENT MATCHING WITH pcre2_match()</a><br>
|
||||
<P>
|
||||
PCRE was not originally designed with multi-segment matching in mind. However,
|
||||
over time, features (including partial matching) that make multi-segment
|
||||
matching possible have been added. The string is searched segment by segment by
|
||||
calling <b>pcre2_match()</b> repeatedly, with the aim of achieving the same
|
||||
results that would happen if the entire string was available for searching.
|
||||
</P>
|
||||
<P>
|
||||
Special logic must be implemented to handle a matched substring that spans a
|
||||
segment boundary. PCRE2_PARTIAL_HARD should be used, because it returns a
|
||||
partial match at the end of a segment whenever there is the possibility of
|
||||
changing the match by adding more characters. The PCRE2_NOTBOL option should
|
||||
also be set for all but the first segment.
|
||||
</P>
|
||||
<P>
|
||||
When a partial match occurs, the next segment must be added to the current
|
||||
subject and the match re-run, using the <i>startoffset</i> argument of
|
||||
<b>pcre2_match()</b> to begin at the point where the partial match started.
|
||||
Multi-segment matching is usually used to search for substrings in the middle
|
||||
of very long sequences, so the patterns are normally not anchored. For example:
|
||||
<pre>
|
||||
re> /\d?\d(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\d\d/
|
||||
data> ...the date is 23ja\=ph
|
||||
Partial match: 23ja
|
||||
data> ...the date is 23jan19 and on that day...\=offset=15
|
||||
0: 23jan19
|
||||
1: jan
|
||||
</pre>
|
||||
Note the use of the <b>offset</b> modifier to start the new match where the
|
||||
partial match was found.
|
||||
</P>
|
||||
<P>
|
||||
In this simple example, the next segment was just added to the one in which the
|
||||
partial match was found. However, if there are memory constraints, it may be
|
||||
necessary to discard text that precedes the partial match before adding the
|
||||
next segment. In cases such as the above, where the pattern does not contain
|
||||
any lookbehinds, it is sufficient to retain only the partially matched
|
||||
substring. However, if a pattern contains a lookbehind assertion, characters
|
||||
that precede the start of the partial match may have been inspected during the
|
||||
matching process.
|
||||
</P>
|
||||
<P>
|
||||
The only lookbehind information that is available is the length of the longest
|
||||
lookbehind in a pattern. This may not, of course, be at the start of the
|
||||
pattern, but retaining that many characters before the partial match is
|
||||
sufficient, if not always strictly necessary. The way to do this is as follows:
|
||||
</P>
|
||||
<P>
|
||||
Before doing any matching, find the length of the longest lookbehind in the
|
||||
pattern by calling <b>pcre2_pattern_info()</b> with the PCRE2_INFO_MAXLOOKBEHIND
|
||||
option. Note that the resulting count is in characters, not code units. After a
|
||||
partial match, moving back from the ovector[0] offset in the subject by the
|
||||
number of characters given for the maximum lookbehind gets you to the earliest
|
||||
character that must be retained. In a non-UTF or a 32-bit situation, moving
|
||||
back is just a subtraction, but in UTF-8 or UTF-16 you have to count characters
|
||||
while moving back through the code units. Characters before the point you have
|
||||
now reached can be discarded.
|
||||
</P>
|
||||
<P>
|
||||
For example, if the pattern "(?<=123)abc" is partially matched against the
|
||||
string "xx123ab", the ovector offsets are 5 and 7 ("ab"). The maximum
|
||||
lookbehind count is 3, so all characters before offset 2 can be discarded. The
|
||||
value of <b>startoffset</b> for the next match should be 3. When <b>pcre2test</b>
|
||||
displays a partial match, it indicates the lookbehind characters with '<'
|
||||
characters if the <b>allusedtext</b> modifier is set:
|
||||
<pre>
|
||||
re> "(?<=123)abc"
|
||||
data> xx123ab\=ph,allusedtext
|
||||
Partial match: 123ab
|
||||
<<<
|
||||
</pre>
|
||||
Note that the \fPallusedtext\fP modifier is not available for JIT matching,
|
||||
because JIT matching does not maintain the first and last consulted characters.
|
||||
</P>
|
||||
<br><a name="SEC5" href="#TOC1">PARTIAL MATCHING USING pcre2_dfa_match()</a><br>
|
||||
<P>
|
||||
The DFA function moves along the subject string character by character, without
|
||||
backtracking, searching for all possible matches simultaneously. If the end of
|
||||
the subject is reached before the end of the pattern, there is the possibility
|
||||
of a partial match, again provided that at least one character has been
|
||||
inspected.
|
||||
of a partial match.
|
||||
</P>
|
||||
<P>
|
||||
When PCRE2_PARTIAL_SOFT is set, PCRE2_ERROR_PARTIAL is returned only if there
|
||||
have been no complete matches. Otherwise, the complete matches are returned.
|
||||
However, if PCRE2_PARTIAL_HARD is set, a partial match takes precedence over
|
||||
any complete matches. The portion of the string that was matched when the
|
||||
longest partial match was found is set as the first matching string.
|
||||
If PCRE2_PARTIAL_HARD is set, a partial match takes precedence over any
|
||||
complete matches. The portion of the string that was matched when the longest
|
||||
partial match was found is set as the first matching string.
|
||||
</P>
|
||||
<P>
|
||||
Because the DFA functions always search for all possible matches, and there is
|
||||
no difference between greedy and ungreedy repetition, their behaviour is
|
||||
different from the standard functions when PCRE2_PARTIAL_HARD is set. Consider
|
||||
the string "dog" matched against the ungreedy pattern shown above:
|
||||
Because the DFA function always searches for all possible matches, and there is
|
||||
no difference between greedy and ungreedy repetition, its behaviour is
|
||||
different from the <b>pcre2_match()</b>. Consider the string "dog" matched
|
||||
against this ungreedy pattern:
|
||||
<pre>
|
||||
/dog(sbody)??/
|
||||
</pre>
|
||||
|
@ -209,58 +341,16 @@ Whereas the standard function stops as soon as it finds the complete match for
|
|||
"dog", the DFA function also finds the partial match for "dogsbody", and so
|
||||
returns that when PCRE2_PARTIAL_HARD is set.
|
||||
</P>
|
||||
<br><a name="SEC4" href="#TOC1">PARTIAL MATCHING AND WORD BOUNDARIES</a><br>
|
||||
<P>
|
||||
If a pattern ends with one of sequences \b or \B, which test for word
|
||||
boundaries, partial matching with PCRE2_PARTIAL_SOFT can give counter-intuitive
|
||||
results. Consider this pattern:
|
||||
<pre>
|
||||
/\bcat\b/
|
||||
</pre>
|
||||
This matches "cat", provided there is a word boundary at either end. If the
|
||||
subject string is "the cat", the comparison of the final "t" with a following
|
||||
character cannot take place, so a partial match is found. However, normal
|
||||
matching carries on, and \b matches at the end of the subject when the last
|
||||
character is a letter, so a complete match is found. The result, therefore, is
|
||||
<i>not</i> PCRE2_ERROR_PARTIAL. Using PCRE2_PARTIAL_HARD in this case does yield
|
||||
PCRE2_ERROR_PARTIAL, because then the partial match takes precedence.
|
||||
</P>
|
||||
<br><a name="SEC5" href="#TOC1">EXAMPLE OF PARTIAL MATCHING USING PCRE2TEST</a><br>
|
||||
<P>
|
||||
If the <b>partial_soft</b> (or <b>ps</b>) modifier is present on a
|
||||
<b>pcre2test</b> data line, the PCRE2_PARTIAL_SOFT option is used for the match.
|
||||
Here is a run of <b>pcre2test</b> that uses the date example quoted above:
|
||||
<pre>
|
||||
re> /^\d?\d(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\d\d$/
|
||||
data> 25jun04\=ps
|
||||
0: 25jun04
|
||||
1: jun
|
||||
data> 25dec3\=ps
|
||||
Partial match: 23dec3
|
||||
data> 3ju\=ps
|
||||
Partial match: 3ju
|
||||
data> 3juj\=ps
|
||||
No match
|
||||
data> j\=ps
|
||||
No match
|
||||
</pre>
|
||||
The first data string is matched completely, so <b>pcre2test</b> shows the
|
||||
matched substrings. The remaining four strings do not match the complete
|
||||
pattern, but the first two are partial matches. Similar output is obtained
|
||||
if DFA matching is used.
|
||||
</P>
|
||||
<P>
|
||||
If the <b>partial_hard</b> (or <b>ph</b>) modifier is present on a
|
||||
<b>pcre2test</b> data line, the PCRE2_PARTIAL_HARD option is set for the match.
|
||||
</P>
|
||||
<br><a name="SEC6" href="#TOC1">MULTI-SEGMENT MATCHING WITH pcre2_dfa_match()</a><br>
|
||||
<P>
|
||||
When a partial match has been found using a DFA matching function, it is
|
||||
When a partial match has been found using the DFA matching function, it is
|
||||
possible to continue the match by providing additional subject data and calling
|
||||
the function again with the same compiled regular expression, this time setting
|
||||
the PCRE2_DFA_RESTART option. You must pass the same working space as before,
|
||||
because this is where details of the previous partial match are stored. Here is
|
||||
an example using <b>pcre2test</b>:
|
||||
because this is where details of the previous partial match are stored. You can
|
||||
set the PCRE2_PARTIAL_SOFT or PCRE2_PARTIAL_HARD options with PCRE2_DFA_RESTART
|
||||
to continue partial matching over multiple segments. Here is an example using
|
||||
<b>pcre2test</b>:
|
||||
<pre>
|
||||
re> /^\d?\d(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\d\d$/
|
||||
data> 23ja\=dfa,ps
|
||||
|
@ -272,143 +362,10 @@ The first call has "23ja" as the subject, and requests partial matching; the
|
|||
second call has "n05" as the subject for the continued (restarted) match.
|
||||
Notice that when the match is complete, only the last part is shown; PCRE2 does
|
||||
not retain the previously partially-matched string. It is up to the calling
|
||||
program to do that if it needs to.
|
||||
</P>
|
||||
<P>
|
||||
That means that, for an unanchored pattern, if a continued match fails, it is
|
||||
not possible to try again at a new starting point. All this facility is capable
|
||||
of doing is continuing with the previous match attempt. In the previous
|
||||
example, if the second set of data is "ug23" the result is no match, even
|
||||
though there would be a match for "aug23" if the entire string were given at
|
||||
once. Depending on the application, this may or may not be what you want.
|
||||
The only way to allow for starting again at the next character is to retain the
|
||||
matched part of the subject and try a new complete match.
|
||||
</P>
|
||||
<P>
|
||||
You can set the PCRE2_PARTIAL_SOFT or PCRE2_PARTIAL_HARD options with
|
||||
PCRE2_DFA_RESTART to continue partial matching over multiple segments. This
|
||||
facility can be used to pass very long subject strings to the DFA matching
|
||||
functions.
|
||||
</P>
|
||||
<br><a name="SEC7" href="#TOC1">MULTI-SEGMENT MATCHING WITH pcre2_match()</a><br>
|
||||
<P>
|
||||
Unlike the DFA function, it is not possible to restart the previous match with
|
||||
a new segment of data when using <b>pcre2_match()</b>. Instead, new data must be
|
||||
added to the previous subject string, and the entire match re-run, starting
|
||||
from the point where the partial match occurred. Earlier data can be discarded.
|
||||
</P>
|
||||
<P>
|
||||
It is best to use PCRE2_PARTIAL_HARD in this situation, because it does not
|
||||
treat the end of a segment as the end of the subject when matching \z, \Z,
|
||||
\b, \B, and $. Consider an unanchored pattern that matches dates:
|
||||
<pre>
|
||||
re> /\d?\d(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\d\d/
|
||||
data> The date is 23ja\=ph
|
||||
Partial match: 23ja
|
||||
</pre>
|
||||
At this stage, an application could discard the text preceding "23ja", add on
|
||||
text from the next segment, and call the matching function again. Unlike the
|
||||
DFA matching function, the entire matching string must always be available,
|
||||
and the complete matching process occurs for each call, so more memory and more
|
||||
processing time is needed.
|
||||
</P>
|
||||
<br><a name="SEC8" href="#TOC1">ISSUES WITH MULTI-SEGMENT MATCHING</a><br>
|
||||
<P>
|
||||
Certain types of pattern may give problems with multi-segment matching,
|
||||
whichever matching function is used.
|
||||
</P>
|
||||
<P>
|
||||
1. If the pattern contains a test for the beginning of a line, you need to pass
|
||||
the PCRE2_NOTBOL option when the subject string for any call does start at the
|
||||
beginning of a line. There is also a PCRE2_NOTEOL option, but in practice when
|
||||
doing multi-segment matching you should be using PCRE2_PARTIAL_HARD, which
|
||||
includes the effect of PCRE2_NOTEOL.
|
||||
</P>
|
||||
<P>
|
||||
2. If a pattern contains a lookbehind assertion, characters that precede the
|
||||
start of the partial match may have been inspected during the matching process.
|
||||
When using <b>pcre2_match()</b>, sufficient characters must be retained for the
|
||||
next match attempt. You can ensure that enough characters are retained by doing
|
||||
the following:
|
||||
</P>
|
||||
<P>
|
||||
Before doing any matching, find the length of the longest lookbehind in the
|
||||
pattern by calling <b>pcre2_pattern_info()</b> with the PCRE2_INFO_MAXLOOKBEHIND
|
||||
option. Note that the resulting count is in characters, not code units. After a
|
||||
partial match, moving back from the ovector[0] offset in the subject by the
|
||||
number of characters given for the maximum lookbehind gets you to the earliest
|
||||
character that must be retained. In a non-UTF or a 32-bit situation, moving
|
||||
back is just a subtraction, but in UTF-8 or UTF-16 you have to count characters
|
||||
while moving back through the code units.
|
||||
</P>
|
||||
<P>
|
||||
Characters before the point you have now reached can be discarded, and after
|
||||
the next segment has been added to what is retained, you should run the next
|
||||
match with the <b>startoffset</b> argument set so that the match begins at the
|
||||
same point as before.
|
||||
</P>
|
||||
<P>
|
||||
For example, if the pattern "(?<=123)abc" is partially matched against the
|
||||
string "xx123ab", the ovector offsets are 5 and 7 ("ab"). The maximum
|
||||
lookbehind count is 3, so all characters before offset 2 can be discarded. The
|
||||
value of <b>startoffset</b> for the next match should be 3. When <b>pcre2test</b>
|
||||
displays a partial match, it indicates the lookbehind characters with '<'
|
||||
characters if the "allusedtext" modifier is set:
|
||||
<pre>
|
||||
re> "(?<=123)abc"
|
||||
data> xx123ab\=ph,allusedtext
|
||||
Partial match: 123ab
|
||||
<<<
|
||||
</pre>
|
||||
However, the "allusedtext" modifier is not available for JIT matching, because
|
||||
JIT matching does not maintain the first and last consulted characters.
|
||||
</P>
|
||||
<P>
|
||||
3. Matching a subject string that is split into multiple segments may not
|
||||
always produce exactly the same result as matching over one single long string
|
||||
when PCRE2_PARTIAL_SOFT is used. The section "Partial Matching and Word
|
||||
Boundaries" above describes an issue that arises if the pattern ends with \b
|
||||
or \B. Another kind of difference may occur when there are multiple matching
|
||||
possibilities, because (for PCRE2_PARTIAL_SOFT) a partial match result is given
|
||||
only when there are no completed matches. This means that as soon as the
|
||||
shortest match has been found, continuation to a new subject segment is no
|
||||
longer possible. Consider this <b>pcre2test</b> example:
|
||||
<pre>
|
||||
re> /dog(sbody)?/
|
||||
data> dogsb\=ps
|
||||
0: dog
|
||||
data> do\=ps,dfa
|
||||
Partial match: do
|
||||
data> gsb\=ps,dfa,dfa_restart
|
||||
0: g
|
||||
data> dogsbody\=dfa
|
||||
0: dogsbody
|
||||
1: dog
|
||||
</pre>
|
||||
The first data line passes the string "dogsb" to a standard matching function,
|
||||
setting the PCRE2_PARTIAL_SOFT option. Although the string is a partial match
|
||||
for "dogsbody", the result is not PCRE2_ERROR_PARTIAL, because the shorter
|
||||
string "dog" is a complete match. Similarly, when the subject is presented to
|
||||
a DFA matching function in several parts ("do" and "gsb" being the first two)
|
||||
the match stops when "dog" has been found, and it is not possible to continue.
|
||||
On the other hand, if "dogsbody" is presented as a single string, a DFA
|
||||
matching function finds both matches.
|
||||
</P>
|
||||
<P>
|
||||
Because of these problems, it is best to use PCRE2_PARTIAL_HARD when matching
|
||||
multi-segment data. The example above then behaves differently:
|
||||
<pre>
|
||||
re> /dog(sbody)?/
|
||||
data> dogsb\=ph
|
||||
Partial match: dogsb
|
||||
data> do\=ps,dfa
|
||||
Partial match: do
|
||||
data> gsb\=ph,dfa,dfa_restart
|
||||
Partial match: gsb
|
||||
</pre>
|
||||
4. Patterns that contain alternatives at the top level which do not all start
|
||||
with the same pattern item may not work as expected when PCRE2_DFA_RESTART is
|
||||
used. For example, consider this pattern:
|
||||
program to do that if it needs to. This means that, for an unanchored pattern,
|
||||
if a continued match fails, it is not possible to try again at a new starting
|
||||
point. All this facility is capable of doing is continuing with the previous
|
||||
match attempt. For example, consider this pattern:
|
||||
<pre>
|
||||
1234|3789
|
||||
</pre>
|
||||
|
@ -417,30 +374,18 @@ alternative is found at offset 3. There is no partial match for the second
|
|||
alternative, because such a match does not start at the same point in the
|
||||
subject string. Attempting to continue with the string "7890" does not yield a
|
||||
match because only those alternatives that match at one point in the subject
|
||||
are remembered. The problem arises because the start of the second alternative
|
||||
matches within the first alternative. There is no problem with anchored
|
||||
patterns or patterns such as:
|
||||
<pre>
|
||||
1234|ABCD
|
||||
</pre>
|
||||
where no string can be a partial match for both alternatives. This is not a
|
||||
problem if a standard matching function is used, because the entire match has
|
||||
to be rerun each time:
|
||||
<pre>
|
||||
re> /1234|3789/
|
||||
data> ABC123\=ph
|
||||
Partial match: 123
|
||||
data> 1237890
|
||||
0: 3789
|
||||
</pre>
|
||||
Of course, instead of using PCRE2_DFA_RESTART, the same technique of re-running
|
||||
the entire match can also be used with the DFA matching function. Another
|
||||
possibility is to work with two buffers. If a partial match at offset <i>n</i>
|
||||
in the first buffer is followed by "no match" when PCRE2_DFA_RESTART is used on
|
||||
the second buffer, you can then try a new match starting at offset <i>n+1</i> in
|
||||
the first buffer.
|
||||
are remembered. Depending on the application, this may or may not be what you
|
||||
want.
|
||||
</P>
|
||||
<br><a name="SEC9" href="#TOC1">AUTHOR</a><br>
|
||||
<P>
|
||||
If you do want to allow for starting again at the next character, one way of
|
||||
doing it is to retain the matched part of the segment and try a new complete
|
||||
match, as described for <b>pcre2_match()</b> above. Another possibility is to
|
||||
work with two buffers. If a partial match at offset <i>n</i> in the first buffer
|
||||
is followed by "no match" when PCRE2_DFA_RESTART is used on the second buffer,
|
||||
you can then try a new match starting at offset <i>n+1</i> in the first buffer.
|
||||
</P>
|
||||
<br><a name="SEC7" href="#TOC1">AUTHOR</a><br>
|
||||
<P>
|
||||
Philip Hazel
|
||||
<br>
|
||||
|
@ -449,9 +394,9 @@ University Computing Service
|
|||
Cambridge, England.
|
||||
<br>
|
||||
</P>
|
||||
<br><a name="SEC10" href="#TOC1">REVISION</a><br>
|
||||
<br><a name="SEC8" href="#TOC1">REVISION</a><br>
|
||||
<P>
|
||||
Last updated: 22 July 2019
|
||||
Last updated: 07 August 2019
|
||||
<br>
|
||||
Copyright © 1997-2019 University of Cambridge.
|
||||
<br>
|
||||
|
|
566
doc/pcre2.txt
566
doc/pcre2.txt
|
@ -5650,72 +5650,109 @@ NAME
|
|||
|
||||
PARTIAL MATCHING IN PCRE2
|
||||
|
||||
In normal use of PCRE2, if the subject string that is passed to a
|
||||
matching function matches as far as it goes, but is too short to match
|
||||
the entire pattern, PCRE2_ERROR_NOMATCH is returned. There are circum-
|
||||
stances where it might be helpful to distinguish this case from other
|
||||
cases in which there is no match.
|
||||
In normal use of PCRE2, if there is a match up to the end of a subject
|
||||
string, but more characters are needed to match the entire pattern,
|
||||
PCRE2_ERROR_NOMATCH is returned, just like any other failing match.
|
||||
There are circumstances where it might be helpful to distinguish this
|
||||
"partial match" case.
|
||||
|
||||
Consider, for example, an application where a human is required to type
|
||||
in data for a field with specific formatting requirements. An example
|
||||
might be a date in the form ddmmmyy, defined by this pattern:
|
||||
One example is an application where the subject string is very long,
|
||||
and not all available at once. The requirement here is to be able to do
|
||||
the matching segment by segment, but special action is needed when a
|
||||
matched substring spans the boundary between two segments.
|
||||
|
||||
^\d?\d(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\d\d$
|
||||
Another example is checking a user input string as it is typed, to en-
|
||||
sure that it conforms to a required format. Invalid characters can be
|
||||
immediately diagnosed and rejected, giving instant feedback.
|
||||
|
||||
If the application sees the user's keystrokes one by one, and can check
|
||||
that what has been typed so far is potentially valid, it is able to
|
||||
raise an error as soon as a mistake is made, by beeping and not re-
|
||||
flecting the character that has been typed, for example. This immediate
|
||||
feedback is likely to be a better user interface than a check that is
|
||||
delayed until the entire string has been entered. Partial matching can
|
||||
also be useful when the subject string is very long and is not all
|
||||
available at once, as discussed below.
|
||||
Partial matching is a PCRE2-specific feature; it is not Perl-compati-
|
||||
ble. It is requested by setting one of the PCRE2_PARTIAL_HARD or
|
||||
PCRE2_PARTIAL_SOFT options when calling a matching function. The dif-
|
||||
ference between the two options is whether or not a partial match is
|
||||
preferred to an alternative complete match, though the details differ
|
||||
between the two types of matching function. If both options are set,
|
||||
PCRE2_PARTIAL_HARD takes precedence.
|
||||
|
||||
PCRE2 supports partial matching by means of the PCRE2_PARTIAL_SOFT and
|
||||
PCRE2_PARTIAL_HARD options, which can be set when calling a matching
|
||||
function. The difference between the two options is whether or not a
|
||||
partial match is preferred to an alternative complete match, though the
|
||||
details differ between the two types of matching function. If both op-
|
||||
tions are set, PCRE2_PARTIAL_HARD takes precedence.
|
||||
If you want to use partial matching with just-in-time optimized code,
|
||||
as well as setting a partial match option for the matching function,
|
||||
you must also call pcre2_jit_compile() with one or both of these op-
|
||||
tions:
|
||||
|
||||
If you want to use partial matching with just-in-time optimized code,
|
||||
you must call pcre2_jit_compile() with one or both of these options:
|
||||
|
||||
PCRE2_JIT_PARTIAL_SOFT
|
||||
PCRE2_JIT_PARTIAL_HARD
|
||||
PCRE2_JIT_PARTIAL_SOFT
|
||||
|
||||
PCRE2_JIT_COMPLETE should also be set if you are going to run non-par-
|
||||
tial matches on the same pattern. If the appropriate JIT mode has not
|
||||
been compiled, interpretive matching code is used.
|
||||
PCRE2_JIT_COMPLETE should also be set if you are going to run non-par-
|
||||
tial matches on the same pattern. Separate code is compiled for each
|
||||
mode. If the appropriate JIT mode has not been compiled, interpretive
|
||||
matching code is used.
|
||||
|
||||
Setting a partial matching option disables two of PCRE2's standard op-
|
||||
timizations. PCRE2 remembers the last literal code unit in a pattern,
|
||||
and abandons matching immediately if it is not present in the subject
|
||||
string. This optimization cannot be used for a subject string that
|
||||
might match only partially. PCRE2 also knows the minimum length of a
|
||||
matching string, and does not bother to run the matching function on
|
||||
shorter strings. This optimization is also disabled for partial match-
|
||||
ing.
|
||||
timization hints. PCRE2 remembers the last literal code unit in a pat-
|
||||
tern, and abandons matching immediately if it is not present in the
|
||||
subject string. This optimization cannot be used for a subject string
|
||||
that might match only partially. PCRE2 also remembers a minimum length
|
||||
of a matching string, and does not bother to run the matching function
|
||||
on shorter strings. This optimization is also disabled for partial
|
||||
matching.
|
||||
|
||||
|
||||
REQUIREMENTS FOR A PARTIAL MATCH
|
||||
|
||||
A possible partial match occurs during matching when the end of the
|
||||
subject string is reached successfully, but either more characters are
|
||||
needed to complete the match, or the addition of more characters might
|
||||
change what is matched.
|
||||
|
||||
Example 1: if the pattern is /abc/ and the subject is "ab", more char-
|
||||
acters are definitely needed to complete a match. In this case both
|
||||
hard and soft matching options yield a partial match.
|
||||
|
||||
Example 2: if the pattern is /ab+/ and the subject is "ab", a complete
|
||||
match can be found, but the addition of more characters might change
|
||||
what is matched. In this case, only PCRE2_PARTIAL_HARD returns a par-
|
||||
tial match; PCRE2_PARTIAL_SOFT returns the complete match.
|
||||
|
||||
On reaching the end of the subject, when PCRE2_PARTIAL_HARD is set, if
|
||||
the next pattern item is \z, \Z, \b, \B, or $ there is always a partial
|
||||
match. Otherwise, for both options, the next pattern item must be one
|
||||
that inspects a character, and at least one of the following must be
|
||||
true:
|
||||
|
||||
(1) At least one character has already been inspected. An inspected
|
||||
character need not form part of the final matched string; lookbehind
|
||||
assertions and the \K escape sequence provide ways of inspecting char-
|
||||
acters before the start of a matched string.
|
||||
|
||||
(2) The pattern contains one or more lookbehind assertions. This condi-
|
||||
tion exists in case there is a lookbehind that inspects characters be-
|
||||
fore the start of the match.
|
||||
|
||||
(3) There is a special case when the whole pattern can match an empty
|
||||
string. When the starting point is at the end of the subject, the
|
||||
empty string match is a possibility, and if PCRE2_PARTIAL_SOFT is set
|
||||
and neither of the above conditions is true, it is returned. However,
|
||||
because adding more characters might result in a non-empty match,
|
||||
PCRE2_PARTIAL_HARD returns a partial match, which in this case means
|
||||
"there is going to be a match at this point, but until some more char-
|
||||
acters are added, we do not know if it will be an empty string or some-
|
||||
thing longer".
|
||||
|
||||
|
||||
PARTIAL MATCHING USING pcre2_match()
|
||||
|
||||
A partial match occurs during a call to pcre2_match() when the end of
|
||||
the subject string is reached successfully, but matching cannot con-
|
||||
tinue because more characters are needed, and in addition, either at
|
||||
least one character in the subject has been inspected or the pattern
|
||||
contains a lookbehind, or (when PCRE2_PARTIAL_HARD is set) the pattern
|
||||
could match an empty string. An inspected character need not form part
|
||||
of the final matched string; lookbehind assertions and the \K escape
|
||||
sequence provide ways of inspecting characters before the start of a
|
||||
matched string.
|
||||
When a partial matching option is set, the result of calling
|
||||
pcre2_match() can be one of the following:
|
||||
|
||||
The three additional requirements define the cases where adding more
|
||||
characters to the existing subject may complete the same match that
|
||||
would occur if they had all been present in the first place. Without
|
||||
these conditions there would be a partial match of an empty string at
|
||||
the end of the subject for all unanchored patterns (and also for an-
|
||||
chored patterns if the subject itself is empty).
|
||||
A successful match
|
||||
A complete match has been found, starting and ending within this sub-
|
||||
ject.
|
||||
|
||||
PCRE2_ERROR_NOMATCH
|
||||
No match can start anywhere in this subject.
|
||||
|
||||
PCRE2_ERROR_PARTIAL
|
||||
Adding more characters may result in a complete match that uses one
|
||||
or more characters from the end of this subject.
|
||||
|
||||
When a partial match is returned, the first two elements in the ovector
|
||||
point to the portion of the subject that was matched, but the values in
|
||||
|
@ -5725,29 +5762,12 @@ PARTIAL MATCHING USING pcre2_match()
|
|||
/abc\K123/
|
||||
|
||||
If it is matched against "456abc123xyz" the result is a complete match,
|
||||
and the ovector defines the matched string as "123", because \K resets
|
||||
the "start of match" point. However, if a partial match is requested
|
||||
and the subject string is "456abc12", a partial match is found for the
|
||||
string "abc12", because all these characters are needed for a subse-
|
||||
and the ovector defines the matched string as "123", because \K resets
|
||||
the "start of match" point. However, if a partial match is requested
|
||||
and the subject string is "456abc12", a partial match is found for the
|
||||
string "abc12", because all these characters are needed for a subse-
|
||||
quent re-match with additional characters.
|
||||
|
||||
What happens when a partial match is identified depends on which of the
|
||||
two partial matching options is set.
|
||||
|
||||
PCRE2_PARTIAL_SOFT WITH pcre2_match()
|
||||
|
||||
If PCRE2_PARTIAL_SOFT is set when pcre2_match() identifies a partial
|
||||
match, the partial match is remembered, but matching continues as nor-
|
||||
mal, and other alternatives in the pattern are tried. If no complete
|
||||
match can be found, PCRE2_ERROR_PARTIAL is returned instead of
|
||||
PCRE2_ERROR_NOMATCH.
|
||||
|
||||
This option is "soft" because it prefers a complete match over a par-
|
||||
tial match. All the various matching items in a pattern behave as if
|
||||
the subject string is potentially complete. For example, \z, \Z, and $
|
||||
match at the end of the subject, as normal, and for \b and \B the end
|
||||
of the subject is treated as a non-alphanumeric.
|
||||
|
||||
If there is more than one partial match, the first one that was found
|
||||
provides the data that is returned. Consider this pattern:
|
||||
|
||||
|
@ -5756,23 +5776,31 @@ PARTIAL MATCHING USING pcre2_match()
|
|||
If this is matched against the subject string "abc123dog", both alter-
|
||||
natives fail to match, but the end of the subject is reached during
|
||||
matching, so PCRE2_ERROR_PARTIAL is returned. The offsets are set to 3
|
||||
and 9, identifying "123dog" as the first partial match that was found.
|
||||
(In this example, there are two partial matches, because "dog" on its
|
||||
own partially matches the second alternative.)
|
||||
and 9, identifying "123dog" as the first partial match. (In this exam-
|
||||
ple, there are two partial matches, because "dog" on its own partially
|
||||
matches the second alternative.)
|
||||
|
||||
PCRE2_PARTIAL_HARD WITH pcre2_match()
|
||||
How a partial match is processed by pcre2_match()
|
||||
|
||||
If PCRE2_PARTIAL_HARD is set for pcre2_match(), PCRE2_ERROR_PARTIAL is
|
||||
returned as soon as a partial match is found, without continuing to
|
||||
search for possible complete matches. This option is "hard" because it
|
||||
prefers an earlier partial match over a later complete match. For this
|
||||
reason, the assumption is made that the end of the supplied subject
|
||||
string may not be the true end of the available data, and so, if \z,
|
||||
\Z, \b, \B, or $ are encountered at the end of the subject, the result
|
||||
is PCRE2_ERROR_PARTIAL, whether or not any characters have been in-
|
||||
spected.
|
||||
What happens when a partial match is identified depends on which of the
|
||||
two partial matching options is set.
|
||||
|
||||
Comparing hard and soft partial matching
|
||||
If PCRE2_PARTIAL_HARD is set, PCRE2_ERROR_PARTIAL is returned as soon
|
||||
as a partial match is found, without continuing to search for possible
|
||||
complete matches. This option is "hard" because it prefers an earlier
|
||||
partial match over a later complete match. For this reason, the assump-
|
||||
tion is made that the end of the supplied subject string is not the
|
||||
true end of the available data, which is why \z, \Z, \b, \B, and $ al-
|
||||
ways give a partial match.
|
||||
|
||||
If PCRE2_PARTIAL_SOFT is set, the partial match is remembered, but
|
||||
matching continues as normal, and other alternatives in the pattern are
|
||||
tried. If no complete match can be found, PCRE2_ERROR_PARTIAL is re-
|
||||
turned instead of PCRE2_ERROR_NOMATCH. This option is "soft" because it
|
||||
prefers a complete match over a partial match. All the various matching
|
||||
items in a pattern behave as if the subject string is potentially com-
|
||||
plete; \z, \Z, and $ match at the end of the subject, as normal, and
|
||||
for \b and \B the end of the subject is treated as a non-alphanumeric.
|
||||
|
||||
The difference between the two partial matching options can be illus-
|
||||
trated by a pattern such as:
|
||||
|
@ -5799,27 +5827,129 @@ PARTIAL MATCHING USING pcre2_match()
|
|||
The second pattern will never match "dogsbody", because it will always
|
||||
find the shorter match first.
|
||||
|
||||
Example of partial matching using pcre2test
|
||||
|
||||
The pcre2test data modifiers partial_hard (or ph) and partial_soft (or
|
||||
ps) set PCRE2_PARTIAL_HARD and PCRE2_PARTIAL_SOFT, respectively, when
|
||||
calling pcre2_match(). Here is a run of pcre2test using a pattern that
|
||||
matches the whole subject in the form of a date:
|
||||
|
||||
re> /^\d?\d(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\d\d$/
|
||||
data> 25dec3\=ph
|
||||
Partial match: 23dec3
|
||||
data> 3ju\=ph
|
||||
Partial match: 3ju
|
||||
data> 3juj\=ph
|
||||
No match
|
||||
|
||||
This example gives the same results for both hard and soft partial
|
||||
matching options. Here is an example where there is a difference:
|
||||
|
||||
re> /^\d?\d(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\d\d$/
|
||||
data> 25jun04\=ps
|
||||
0: 25jun04
|
||||
1: jun
|
||||
data> 25jun04\=ph
|
||||
Partial match: 25jun04
|
||||
|
||||
With PCRE2_PARTIAL_SOFT, the subject is matched completely. For
|
||||
PCRE2_PARTIAL_HARD, however, the subject is assumed not to be complete,
|
||||
so there is only a partial match.
|
||||
|
||||
|
||||
MULTI-SEGMENT MATCHING WITH pcre2_match()
|
||||
|
||||
PCRE was not originally designed with multi-segment matching in mind.
|
||||
However, over time, features (including partial matching) that make
|
||||
multi-segment matching possible have been added. The string is searched
|
||||
segment by segment by calling pcre2_match() repeatedly, with the aim of
|
||||
achieving the same results that would happen if the entire string was
|
||||
available for searching.
|
||||
|
||||
Special logic must be implemented to handle a matched substring that
|
||||
spans a segment boundary. PCRE2_PARTIAL_HARD should be used, because it
|
||||
returns a partial match at the end of a segment whenever there is the
|
||||
possibility of changing the match by adding more characters. The
|
||||
PCRE2_NOTBOL option should also be set for all but the first segment.
|
||||
|
||||
When a partial match occurs, the next segment must be added to the cur-
|
||||
rent subject and the match re-run, using the startoffset argument of
|
||||
pcre2_match() to begin at the point where the partial match started.
|
||||
Multi-segment matching is usually used to search for substrings in the
|
||||
middle of very long sequences, so the patterns are normally not an-
|
||||
chored. For example:
|
||||
|
||||
re> /\d?\d(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\d\d/
|
||||
data> ...the date is 23ja\=ph
|
||||
Partial match: 23ja
|
||||
data> ...the date is 23jan19 and on that day...\=offset=15
|
||||
0: 23jan19
|
||||
1: jan
|
||||
|
||||
Note the use of the offset modifier to start the new match where the
|
||||
partial match was found.
|
||||
|
||||
In this simple example, the next segment was just added to the one in
|
||||
which the partial match was found. However, if there are memory con-
|
||||
straints, it may be necessary to discard text that precedes the partial
|
||||
match before adding the next segment. In cases such as the above, where
|
||||
the pattern does not contain any lookbehinds, it is sufficient to re-
|
||||
tain only the partially matched substring. However, if a pattern con-
|
||||
tains a lookbehind assertion, characters that precede the start of the
|
||||
partial match may have been inspected during the matching process.
|
||||
|
||||
The only lookbehind information that is available is the length of the
|
||||
longest lookbehind in a pattern. This may not, of course, be at the
|
||||
start of the pattern, but retaining that many characters before the
|
||||
partial match is sufficient, if not always strictly necessary. The way
|
||||
to do this is as follows:
|
||||
|
||||
Before doing any matching, find the length of the longest lookbehind in
|
||||
the pattern by calling pcre2_pattern_info() with the
|
||||
PCRE2_INFO_MAXLOOKBEHIND option. Note that the resulting count is in
|
||||
characters, not code units. After a partial match, moving back from the
|
||||
ovector[0] offset in the subject by the number of characters given for
|
||||
the maximum lookbehind gets you to the earliest character that must be
|
||||
retained. In a non-UTF or a 32-bit situation, moving back is just a
|
||||
subtraction, but in UTF-8 or UTF-16 you have to count characters while
|
||||
moving back through the code units. Characters before the point you
|
||||
have now reached can be discarded.
|
||||
|
||||
For example, if the pattern "(?<=123)abc" is partially matched against
|
||||
the string "xx123ab", the ovector offsets are 5 and 7 ("ab"). The maxi-
|
||||
mum lookbehind count is 3, so all characters before offset 2 can be
|
||||
discarded. The value of startoffset for the next match should be 3.
|
||||
When pcre2test displays a partial match, it indicates the lookbehind
|
||||
characters with '<' characters if the allusedtext modifier is set:
|
||||
|
||||
re> "(?<=123)abc"
|
||||
data> xx123ab\=ph,allusedtext
|
||||
Partial match: 123ab
|
||||
<<<
|
||||
|
||||
Note that the allusedtext modifier is not available for JIT matching,
|
||||
because JIT matching does not maintain the first and last consulted
|
||||
characters.
|
||||
|
||||
|
||||
PARTIAL MATCHING USING pcre2_dfa_match()
|
||||
|
||||
The DFA functions move along the subject string character by character,
|
||||
The DFA function moves along the subject string character by character,
|
||||
without backtracking, searching for all possible matches simultane-
|
||||
ously. If the end of the subject is reached before the end of the pat-
|
||||
tern, there is the possibility of a partial match, again provided that
|
||||
at least one character has been inspected.
|
||||
tern, there is the possibility of a partial match.
|
||||
|
||||
When PCRE2_PARTIAL_SOFT is set, PCRE2_ERROR_PARTIAL is returned only if
|
||||
there have been no complete matches. Otherwise, the complete matches
|
||||
are returned. However, if PCRE2_PARTIAL_HARD is set, a partial match
|
||||
takes precedence over any complete matches. The portion of the string
|
||||
that was matched when the longest partial match was found is set as the
|
||||
there have been no complete matches. Otherwise, the complete matches
|
||||
are returned. If PCRE2_PARTIAL_HARD is set, a partial match takes
|
||||
precedence over any complete matches. The portion of the string that
|
||||
was matched when the longest partial match was found is set as the
|
||||
first matching string.
|
||||
|
||||
Because the DFA functions always search for all possible matches, and
|
||||
there is no difference between greedy and ungreedy repetition, their
|
||||
behaviour is different from the standard functions when PCRE2_PAR-
|
||||
TIAL_HARD is set. Consider the string "dog" matched against the un-
|
||||
greedy pattern shown above:
|
||||
Because the DFA function always searches for all possible matches, and
|
||||
there is no difference between greedy and ungreedy repetition, its be-
|
||||
haviour is different from the pcre2_match(). Consider the string "dog"
|
||||
matched against this ungreedy pattern:
|
||||
|
||||
/dog(sbody)??/
|
||||
|
||||
|
@ -5828,60 +5958,16 @@ PARTIAL MATCHING USING pcre2_dfa_match()
|
|||
"dogsbody", and so returns that when PCRE2_PARTIAL_HARD is set.
|
||||
|
||||
|
||||
PARTIAL MATCHING AND WORD BOUNDARIES
|
||||
|
||||
If a pattern ends with one of sequences \b or \B, which test for word
|
||||
boundaries, partial matching with PCRE2_PARTIAL_SOFT can give counter-
|
||||
intuitive results. Consider this pattern:
|
||||
|
||||
/\bcat\b/
|
||||
|
||||
This matches "cat", provided there is a word boundary at either end. If
|
||||
the subject string is "the cat", the comparison of the final "t" with a
|
||||
following character cannot take place, so a partial match is found.
|
||||
However, normal matching carries on, and \b matches at the end of the
|
||||
subject when the last character is a letter, so a complete match is
|
||||
found. The result, therefore, is not PCRE2_ERROR_PARTIAL. Using
|
||||
PCRE2_PARTIAL_HARD in this case does yield PCRE2_ERROR_PARTIAL, because
|
||||
then the partial match takes precedence.
|
||||
|
||||
|
||||
EXAMPLE OF PARTIAL MATCHING USING PCRE2TEST
|
||||
|
||||
If the partial_soft (or ps) modifier is present on a pcre2test data
|
||||
line, the PCRE2_PARTIAL_SOFT option is used for the match. Here is a
|
||||
run of pcre2test that uses the date example quoted above:
|
||||
|
||||
re> /^\d?\d(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\d\d$/
|
||||
data> 25jun04\=ps
|
||||
0: 25jun04
|
||||
1: jun
|
||||
data> 25dec3\=ps
|
||||
Partial match: 23dec3
|
||||
data> 3ju\=ps
|
||||
Partial match: 3ju
|
||||
data> 3juj\=ps
|
||||
No match
|
||||
data> j\=ps
|
||||
No match
|
||||
|
||||
The first data string is matched completely, so pcre2test shows the
|
||||
matched substrings. The remaining four strings do not match the com-
|
||||
plete pattern, but the first two are partial matches. Similar output is
|
||||
obtained if DFA matching is used.
|
||||
|
||||
If the partial_hard (or ph) modifier is present on a pcre2test data
|
||||
line, the PCRE2_PARTIAL_HARD option is set for the match.
|
||||
|
||||
|
||||
MULTI-SEGMENT MATCHING WITH pcre2_dfa_match()
|
||||
|
||||
When a partial match has been found using a DFA matching function, it
|
||||
is possible to continue the match by providing additional subject data
|
||||
and calling the function again with the same compiled regular expres-
|
||||
When a partial match has been found using the DFA matching function, it
|
||||
is possible to continue the match by providing additional subject data
|
||||
and calling the function again with the same compiled regular expres-
|
||||
sion, this time setting the PCRE2_DFA_RESTART option. You must pass the
|
||||
same working space as before, because this is where details of the pre-
|
||||
vious partial match are stored. Here is an example using pcre2test:
|
||||
vious partial match are stored. You can set the PCRE2_PARTIAL_SOFT or
|
||||
PCRE2_PARTIAL_HARD options with PCRE2_DFA_RESTART to continue partial
|
||||
matching over multiple segments. Here is an example using pcre2test:
|
||||
|
||||
re> /^\d?\d(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\d\d$/
|
||||
data> 23ja\=dfa,ps
|
||||
|
@ -5889,146 +5975,15 @@ MULTI-SEGMENT MATCHING WITH pcre2_dfa_match()
|
|||
data> n05\=dfa,dfa_restart
|
||||
0: n05
|
||||
|
||||
The first call has "23ja" as the subject, and requests partial match-
|
||||
ing; the second call has "n05" as the subject for the continued
|
||||
(restarted) match. Notice that when the match is complete, only the
|
||||
last part is shown; PCRE2 does not retain the previously partially-
|
||||
matched string. It is up to the calling program to do that if it needs
|
||||
to.
|
||||
|
||||
That means that, for an unanchored pattern, if a continued match fails,
|
||||
it is not possible to try again at a new starting point. All this fa-
|
||||
cility is capable of doing is continuing with the previous match at-
|
||||
tempt. In the previous example, if the second set of data is "ug23" the
|
||||
result is no match, even though there would be a match for "aug23" if
|
||||
the entire string were given at once. Depending on the application,
|
||||
this may or may not be what you want. The only way to allow for start-
|
||||
ing again at the next character is to retain the matched part of the
|
||||
subject and try a new complete match.
|
||||
|
||||
You can set the PCRE2_PARTIAL_SOFT or PCRE2_PARTIAL_HARD options with
|
||||
PCRE2_DFA_RESTART to continue partial matching over multiple segments.
|
||||
This facility can be used to pass very long subject strings to the DFA
|
||||
matching functions.
|
||||
|
||||
|
||||
MULTI-SEGMENT MATCHING WITH pcre2_match()
|
||||
|
||||
Unlike the DFA function, it is not possible to restart the previous
|
||||
match with a new segment of data when using pcre2_match(). Instead, new
|
||||
data must be added to the previous subject string, and the entire match
|
||||
re-run, starting from the point where the partial match occurred. Ear-
|
||||
lier data can be discarded.
|
||||
|
||||
It is best to use PCRE2_PARTIAL_HARD in this situation, because it does
|
||||
not treat the end of a segment as the end of the subject when matching
|
||||
\z, \Z, \b, \B, and $. Consider an unanchored pattern that matches
|
||||
dates:
|
||||
|
||||
re> /\d?\d(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\d\d/
|
||||
data> The date is 23ja\=ph
|
||||
Partial match: 23ja
|
||||
|
||||
At this stage, an application could discard the text preceding "23ja",
|
||||
add on text from the next segment, and call the matching function
|
||||
again. Unlike the DFA matching function, the entire matching string
|
||||
must always be available, and the complete matching process occurs for
|
||||
each call, so more memory and more processing time is needed.
|
||||
|
||||
|
||||
ISSUES WITH MULTI-SEGMENT MATCHING
|
||||
|
||||
Certain types of pattern may give problems with multi-segment matching,
|
||||
whichever matching function is used.
|
||||
|
||||
1. If the pattern contains a test for the beginning of a line, you need
|
||||
to pass the PCRE2_NOTBOL option when the subject string for any call
|
||||
does start at the beginning of a line. There is also a PCRE2_NOTEOL op-
|
||||
tion, but in practice when doing multi-segment matching you should be
|
||||
using PCRE2_PARTIAL_HARD, which includes the effect of PCRE2_NOTEOL.
|
||||
|
||||
2. If a pattern contains a lookbehind assertion, characters that pre-
|
||||
cede the start of the partial match may have been inspected during the
|
||||
matching process. When using pcre2_match(), sufficient characters must
|
||||
be retained for the next match attempt. You can ensure that enough
|
||||
characters are retained by doing the following:
|
||||
|
||||
Before doing any matching, find the length of the longest lookbehind in
|
||||
the pattern by calling pcre2_pattern_info() with the
|
||||
PCRE2_INFO_MAXLOOKBEHIND option. Note that the resulting count is in
|
||||
characters, not code units. After a partial match, moving back from the
|
||||
ovector[0] offset in the subject by the number of characters given for
|
||||
the maximum lookbehind gets you to the earliest character that must be
|
||||
retained. In a non-UTF or a 32-bit situation, moving back is just a
|
||||
subtraction, but in UTF-8 or UTF-16 you have to count characters while
|
||||
moving back through the code units.
|
||||
|
||||
Characters before the point you have now reached can be discarded, and
|
||||
after the next segment has been added to what is retained, you should
|
||||
run the next match with the startoffset argument set so that the match
|
||||
begins at the same point as before.
|
||||
|
||||
For example, if the pattern "(?<=123)abc" is partially matched against
|
||||
the string "xx123ab", the ovector offsets are 5 and 7 ("ab"). The maxi-
|
||||
mum lookbehind count is 3, so all characters before offset 2 can be
|
||||
discarded. The value of startoffset for the next match should be 3.
|
||||
When pcre2test displays a partial match, it indicates the lookbehind
|
||||
characters with '<' characters if the "allusedtext" modifier is set:
|
||||
|
||||
re> "(?<=123)abc"
|
||||
data> xx123ab\=ph,allusedtext
|
||||
Partial match: 123ab
|
||||
<<< However, the "allusedtext" modifier is not avail-
|
||||
able for JIT matching, because JIT matching does not maintain the first
|
||||
and last consulted characters.
|
||||
|
||||
3. Matching a subject string that is split into multiple segments may
|
||||
not always produce exactly the same result as matching over one single
|
||||
long string when PCRE2_PARTIAL_SOFT is used. The section "Partial
|
||||
Matching and Word Boundaries" above describes an issue that arises if
|
||||
the pattern ends with \b or \B. Another kind of difference may occur
|
||||
when there are multiple matching possibilities, because (for PCRE2_PAR-
|
||||
TIAL_SOFT) a partial match result is given only when there are no com-
|
||||
pleted matches. This means that as soon as the shortest match has been
|
||||
found, continuation to a new subject segment is no longer possible.
|
||||
Consider this pcre2test example:
|
||||
|
||||
re> /dog(sbody)?/
|
||||
data> dogsb\=ps
|
||||
0: dog
|
||||
data> do\=ps,dfa
|
||||
Partial match: do
|
||||
data> gsb\=ps,dfa,dfa_restart
|
||||
0: g
|
||||
data> dogsbody\=dfa
|
||||
0: dogsbody
|
||||
1: dog
|
||||
|
||||
The first data line passes the string "dogsb" to a standard matching
|
||||
function, setting the PCRE2_PARTIAL_SOFT option. Although the string is
|
||||
a partial match for "dogsbody", the result is not PCRE2_ERROR_PARTIAL,
|
||||
because the shorter string "dog" is a complete match. Similarly, when
|
||||
the subject is presented to a DFA matching function in several parts
|
||||
("do" and "gsb" being the first two) the match stops when "dog" has
|
||||
been found, and it is not possible to continue. On the other hand, if
|
||||
"dogsbody" is presented as a single string, a DFA matching function
|
||||
finds both matches.
|
||||
|
||||
Because of these problems, it is best to use PCRE2_PARTIAL_HARD when
|
||||
matching multi-segment data. The example above then behaves differ-
|
||||
ently:
|
||||
|
||||
re> /dog(sbody)?/
|
||||
data> dogsb\=ph
|
||||
Partial match: dogsb
|
||||
data> do\=ps,dfa
|
||||
Partial match: do
|
||||
data> gsb\=ph,dfa,dfa_restart
|
||||
Partial match: gsb
|
||||
|
||||
4. Patterns that contain alternatives at the top level which do not all
|
||||
start with the same pattern item may not work as expected when
|
||||
PCRE2_DFA_RESTART is used. For example, consider this pattern:
|
||||
The first call has "23ja" as the subject, and requests partial match-
|
||||
ing; the second call has "n05" as the subject for the continued
|
||||
(restarted) match. Notice that when the match is complete, only the
|
||||
last part is shown; PCRE2 does not retain the previously partially-
|
||||
matched string. It is up to the calling program to do that if it needs
|
||||
to. This means that, for an unanchored pattern, if a continued match
|
||||
fails, it is not possible to try again at a new starting point. All
|
||||
this facility is capable of doing is continuing with the previous match
|
||||
attempt. For example, consider this pattern:
|
||||
|
||||
1234|3789
|
||||
|
||||
|
@ -6037,29 +5992,16 @@ ISSUES WITH MULTI-SEGMENT MATCHING
|
|||
the second alternative, because such a match does not start at the same
|
||||
point in the subject string. Attempting to continue with the string
|
||||
"7890" does not yield a match because only those alternatives that
|
||||
match at one point in the subject are remembered. The problem arises
|
||||
because the start of the second alternative matches within the first
|
||||
alternative. There is no problem with anchored patterns or patterns
|
||||
such as:
|
||||
match at one point in the subject are remembered. Depending on the ap-
|
||||
plication, this may or may not be what you want.
|
||||
|
||||
1234|ABCD
|
||||
|
||||
where no string can be a partial match for both alternatives. This is
|
||||
not a problem if a standard matching function is used, because the en-
|
||||
tire match has to be rerun each time:
|
||||
|
||||
re> /1234|3789/
|
||||
data> ABC123\=ph
|
||||
Partial match: 123
|
||||
data> 1237890
|
||||
0: 3789
|
||||
|
||||
Of course, instead of using PCRE2_DFA_RESTART, the same technique of
|
||||
re-running the entire match can also be used with the DFA matching
|
||||
function. Another possibility is to work with two buffers. If a partial
|
||||
match at offset n in the first buffer is followed by "no match" when
|
||||
PCRE2_DFA_RESTART is used on the second buffer, you can then try a new
|
||||
match starting at offset n+1 in the first buffer.
|
||||
If you do want to allow for starting again at the next character, one
|
||||
way of doing it is to retain the matched part of the segment and try a
|
||||
new complete match, as described for pcre2_match() above. Another pos-
|
||||
sibility is to work with two buffers. If a partial match at offset n in
|
||||
the first buffer is followed by "no match" when PCRE2_DFA_RESTART is
|
||||
used on the second buffer, you can then try a new match starting at
|
||||
offset n+1 in the first buffer.
|
||||
|
||||
|
||||
AUTHOR
|
||||
|
@ -6071,7 +6013,7 @@ AUTHOR
|
|||
|
||||
REVISION
|
||||
|
||||
Last updated: 22 July 2019
|
||||
Last updated: 07 August 2019
|
||||
Copyright (c) 1997-2019 University of Cambridge.
|
||||
------------------------------------------------------------------------------
|
||||
|
||||
|
|
|
@ -1,73 +1,107 @@
|
|||
.TH PCRE2PARTIAL 3 "22 July 2019" "PCRE2 10.34"
|
||||
.TH PCRE2PARTIAL 3 "07 August 2019" "PCRE2 10.34"
|
||||
.SH NAME
|
||||
PCRE2 - Perl-compatible regular expressions
|
||||
.SH "PARTIAL MATCHING IN PCRE2"
|
||||
.rs
|
||||
.sp
|
||||
In normal use of PCRE2, if the subject string that is passed to a matching
|
||||
function matches as far as it goes, but is too short to match the entire
|
||||
pattern, PCRE2_ERROR_NOMATCH is returned. There are circumstances where it
|
||||
might be helpful to distinguish this case from other cases in which there is no
|
||||
match.
|
||||
In normal use of PCRE2, if there is a match up to the end of a subject string,
|
||||
but more characters are needed to match the entire pattern, PCRE2_ERROR_NOMATCH
|
||||
is returned, just like any other failing match. There are circumstances where
|
||||
it might be helpful to distinguish this "partial match" case.
|
||||
.P
|
||||
Consider, for example, an application where a human is required to type in data
|
||||
for a field with specific formatting requirements. An example might be a date
|
||||
in the form \fIddmmmyy\fP, defined by this pattern:
|
||||
.sp
|
||||
^\ed?\ed(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\ed\ed$
|
||||
.sp
|
||||
If the application sees the user's keystrokes one by one, and can check that
|
||||
what has been typed so far is potentially valid, it is able to raise an error
|
||||
as soon as a mistake is made, by beeping and not reflecting the character that
|
||||
has been typed, for example. This immediate feedback is likely to be a better
|
||||
user interface than a check that is delayed until the entire string has been
|
||||
entered. Partial matching can also be useful when the subject string is very
|
||||
long and is not all available at once, as discussed below.
|
||||
One example is an application where the subject string is very long, and not
|
||||
all available at once. The requirement here is to be able to do the matching
|
||||
segment by segment, but special action is needed when a matched substring spans
|
||||
the boundary between two segments.
|
||||
.P
|
||||
PCRE2 supports partial matching by means of the PCRE2_PARTIAL_SOFT and
|
||||
PCRE2_PARTIAL_HARD options, which can be set when calling a matching function.
|
||||
The difference between the two options is whether or not a partial match is
|
||||
preferred to an alternative complete match, though the details differ between
|
||||
the two types of matching function. If both options are set, PCRE2_PARTIAL_HARD
|
||||
takes precedence.
|
||||
Another example is checking a user input string as it is typed, to ensure that
|
||||
it conforms to a required format. Invalid characters can be immediately
|
||||
diagnosed and rejected, giving instant feedback.
|
||||
.P
|
||||
If you want to use partial matching with just-in-time optimized code, you must
|
||||
call \fBpcre2_jit_compile()\fP with one or both of these options:
|
||||
Partial matching is a PCRE2-specific feature; it is not Perl-compatible. It is
|
||||
requested by setting one of the PCRE2_PARTIAL_HARD or PCRE2_PARTIAL_SOFT
|
||||
options when calling a matching function. The difference between the two
|
||||
options is whether or not a partial match is preferred to an alternative
|
||||
complete match, though the details differ between the two types of matching
|
||||
function. If both options are set, PCRE2_PARTIAL_HARD takes precedence.
|
||||
.P
|
||||
If you want to use partial matching with just-in-time optimized code, as well
|
||||
as setting a partial match option for the matching function, you must also call
|
||||
\fBpcre2_jit_compile()\fP with one or both of these options:
|
||||
.sp
|
||||
PCRE2_JIT_PARTIAL_SOFT
|
||||
PCRE2_JIT_PARTIAL_HARD
|
||||
PCRE2_JIT_PARTIAL_SOFT
|
||||
.sp
|
||||
PCRE2_JIT_COMPLETE should also be set if you are going to run non-partial
|
||||
matches on the same pattern. If the appropriate JIT mode has not been compiled,
|
||||
interpretive matching code is used.
|
||||
matches on the same pattern. Separate code is compiled for each mode. If the
|
||||
appropriate JIT mode has not been compiled, interpretive matching code is used.
|
||||
.P
|
||||
Setting a partial matching option disables two of PCRE2's standard
|
||||
optimizations. PCRE2 remembers the last literal code unit in a pattern, and
|
||||
abandons matching immediately if it is not present in the subject string. This
|
||||
optimization cannot be used for a subject string that might match only
|
||||
partially. PCRE2 also knows the minimum length of a matching string, and does
|
||||
optimization hints. PCRE2 remembers the last literal code unit in a pattern,
|
||||
and abandons matching immediately if it is not present in the subject string.
|
||||
This optimization cannot be used for a subject string that might match only
|
||||
partially. PCRE2 also remembers a minimum length of a matching string, and does
|
||||
not bother to run the matching function on shorter strings. This optimization
|
||||
is also disabled for partial matching.
|
||||
.
|
||||
.
|
||||
.SH "REQUIREMENTS FOR A PARTIAL MATCH"
|
||||
.rs
|
||||
.sp
|
||||
A possible partial match occurs during matching when the end of the subject
|
||||
string is reached successfully, but either more characters are needed to
|
||||
complete the match, or the addition of more characters might change what is
|
||||
matched.
|
||||
.P
|
||||
Example 1: if the pattern is /abc/ and the subject is "ab", more characters are
|
||||
definitely needed to complete a match. In this case both hard and soft matching
|
||||
options yield a partial match.
|
||||
.P
|
||||
Example 2: if the pattern is /ab+/ and the subject is "ab", a complete match
|
||||
can be found, but the addition of more characters might change what is
|
||||
matched. In this case, only PCRE2_PARTIAL_HARD returns a partial match;
|
||||
PCRE2_PARTIAL_SOFT returns the complete match.
|
||||
.P
|
||||
On reaching the end of the subject, when PCRE2_PARTIAL_HARD is set, if the next
|
||||
pattern item is \ez, \eZ, \eb, \eB, or $ there is always a partial match.
|
||||
Otherwise, for both options, the next pattern item must be one that inspects a
|
||||
character, and at least one of the following must be true:
|
||||
.P
|
||||
(1) At least one character has already been inspected. An inspected character
|
||||
need not form part of the final matched string; lookbehind assertions and the
|
||||
\eK escape sequence provide ways of inspecting characters before the start of a
|
||||
matched string.
|
||||
.P
|
||||
(2) The pattern contains one or more lookbehind assertions. This condition
|
||||
exists in case there is a lookbehind that inspects characters before the start
|
||||
of the match.
|
||||
.P
|
||||
(3) There is a special case when the whole pattern can match an empty string.
|
||||
When the starting point is at the end of the subject, the empty string match is
|
||||
a possibility, and if PCRE2_PARTIAL_SOFT is set and neither of the above
|
||||
conditions is true, it is returned. However, because adding more characters
|
||||
might result in a non-empty match, PCRE2_PARTIAL_HARD returns a partial match,
|
||||
which in this case means "there is going to be a match at this point, but until
|
||||
some more characters are added, we do not know if it will be an empty string or
|
||||
something longer".
|
||||
.
|
||||
.
|
||||
.
|
||||
.SH "PARTIAL MATCHING USING pcre2_match()"
|
||||
.rs
|
||||
.sp
|
||||
A partial match occurs during a call to \fBpcre2_match()\fP when the end of the
|
||||
subject string is reached successfully, but matching cannot continue because
|
||||
more characters are needed, and in addition, either at least one character in
|
||||
the subject has been inspected or the pattern contains a lookbehind, or (when
|
||||
PCRE2_PARTIAL_HARD is set) the pattern could match an empty string. An
|
||||
inspected character need not form part of the final matched string; lookbehind
|
||||
assertions and the \eK escape sequence provide ways of inspecting characters
|
||||
before the start of a matched string.
|
||||
.P
|
||||
The three additional requirements define the cases where adding more characters
|
||||
to the existing subject may complete the same match that would occur if they
|
||||
had all been present in the first place. Without these conditions there would
|
||||
be a partial match of an empty string at the end of the subject for all
|
||||
unanchored patterns (and also for anchored patterns if the subject itself is
|
||||
empty).
|
||||
When a partial matching option is set, the result of calling
|
||||
\fBpcre2_match()\fP can be one of the following:
|
||||
.TP 2
|
||||
\fBA successful match\fP
|
||||
A complete match has been found, starting and ending within this subject.
|
||||
.TP
|
||||
\fBPCRE2_ERROR_NOMATCH\fP
|
||||
No match can start anywhere in this subject.
|
||||
.TP
|
||||
\fBPCRE2_ERROR_PARTIAL\fP
|
||||
Adding more characters may result in a complete match that uses one or more
|
||||
characters from the end of this subject.
|
||||
.P
|
||||
When a partial match is returned, the first two elements in the ovector point
|
||||
to the portion of the subject that was matched, but the values in the rest of
|
||||
|
@ -83,24 +117,6 @@ is "456abc12", a partial match is found for the string "abc12", because all
|
|||
these characters are needed for a subsequent re-match with additional
|
||||
characters.
|
||||
.P
|
||||
What happens when a partial match is identified depends on which of the two
|
||||
partial matching options is set.
|
||||
.
|
||||
.
|
||||
.SS "PCRE2_PARTIAL_SOFT WITH pcre2_match()"
|
||||
.rs
|
||||
.sp
|
||||
If PCRE2_PARTIAL_SOFT is set when \fBpcre2_match()\fP identifies a partial
|
||||
match, the partial match is remembered, but matching continues as normal, and
|
||||
other alternatives in the pattern are tried. If no complete match can be found,
|
||||
PCRE2_ERROR_PARTIAL is returned instead of PCRE2_ERROR_NOMATCH.
|
||||
.P
|
||||
This option is "soft" because it prefers a complete match over a partial match.
|
||||
All the various matching items in a pattern behave as if the subject string is
|
||||
potentially complete. For example, \ez, \eZ, and $ match at the end of the
|
||||
subject, as normal, and for \eb and \eB the end of the subject is treated as a
|
||||
non-alphanumeric.
|
||||
.P
|
||||
If there is more than one partial match, the first one that was found provides
|
||||
the data that is returned. Consider this pattern:
|
||||
.sp
|
||||
|
@ -109,27 +125,32 @@ the data that is returned. Consider this pattern:
|
|||
If this is matched against the subject string "abc123dog", both alternatives
|
||||
fail to match, but the end of the subject is reached during matching, so
|
||||
PCRE2_ERROR_PARTIAL is returned. The offsets are set to 3 and 9, identifying
|
||||
"123dog" as the first partial match that was found. (In this example, there are
|
||||
two partial matches, because "dog" on its own partially matches the second
|
||||
alternative.)
|
||||
"123dog" as the first partial match. (In this example, there are two partial
|
||||
matches, because "dog" on its own partially matches the second alternative.)
|
||||
.
|
||||
.
|
||||
.SS "PCRE2_PARTIAL_HARD WITH pcre2_match()"
|
||||
.rs
|
||||
.sp
|
||||
If PCRE2_PARTIAL_HARD is set for \fBpcre2_match()\fP, PCRE2_ERROR_PARTIAL is
|
||||
returned as soon as a partial match is found, without continuing to search for
|
||||
possible complete matches. This option is "hard" because it prefers an earlier
|
||||
partial match over a later complete match. For this reason, the assumption is
|
||||
made that the end of the supplied subject string may not be the true end of the
|
||||
available data, and so, if \ez, \eZ, \eb, \eB, or $ are encountered at the end
|
||||
of the subject, the result is PCRE2_ERROR_PARTIAL, whether or not any
|
||||
characters have been inspected.
|
||||
.
|
||||
.
|
||||
.SS "Comparing hard and soft partial matching"
|
||||
.SS "How a partial match is processed by pcre2_match()"
|
||||
.rs
|
||||
.sp
|
||||
What happens when a partial match is identified depends on which of the two
|
||||
partial matching options is set.
|
||||
.P
|
||||
If PCRE2_PARTIAL_HARD is set, PCRE2_ERROR_PARTIAL is returned as soon as a
|
||||
partial match is found, without continuing to search for possible complete
|
||||
matches. This option is "hard" because it prefers an earlier partial match over
|
||||
a later complete match. For this reason, the assumption is made that the end of
|
||||
the supplied subject string is not the true end of the available data, which is
|
||||
why \ez, \eZ, \eb, \eB, and $ always give a partial match.
|
||||
.P
|
||||
If PCRE2_PARTIAL_SOFT is set, the partial match is remembered, but matching
|
||||
continues as normal, and other alternatives in the pattern are tried. If no
|
||||
complete match can be found, PCRE2_ERROR_PARTIAL is returned instead of
|
||||
PCRE2_ERROR_NOMATCH. This option is "soft" because it prefers a complete match
|
||||
over a partial match. All the various matching items in a pattern behave as if
|
||||
the subject string is potentially complete; \ez, \eZ, and $ match at the end of
|
||||
the subject, as normal, and for \eb and \eB the end of the subject is treated
|
||||
as a non-alphanumeric.
|
||||
.P
|
||||
The difference between the two partial matching options can be illustrated by a
|
||||
pattern such as:
|
||||
.sp
|
||||
|
@ -154,25 +175,129 @@ The second pattern will never match "dogsbody", because it will always find the
|
|||
shorter match first.
|
||||
.
|
||||
.
|
||||
.SS "Example of partial matching using pcre2test"
|
||||
.rs
|
||||
.sp
|
||||
The \fBpcre2test\fP data modifiers \fBpartial_hard\fP (or \fBph\fP) and
|
||||
\fBpartial_soft\fP (or \fBps\fP) set PCRE2_PARTIAL_HARD and PCRE2_PARTIAL_SOFT,
|
||||
respectively, when calling \fBpcre2_match()\fP. Here is a run of
|
||||
\fBpcre2test\fP using a pattern that matches the whole subject in the form of a
|
||||
date:
|
||||
.sp
|
||||
re> /^\ed?\ed(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\ed\ed$/
|
||||
data> 25dec3\e=ph
|
||||
Partial match: 23dec3
|
||||
data> 3ju\e=ph
|
||||
Partial match: 3ju
|
||||
data> 3juj\e=ph
|
||||
No match
|
||||
.sp
|
||||
This example gives the same results for both hard and soft partial matching
|
||||
options. Here is an example where there is a difference:
|
||||
.sp
|
||||
re> /^\ed?\ed(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\ed\ed$/
|
||||
data> 25jun04\e=ps
|
||||
0: 25jun04
|
||||
1: jun
|
||||
data> 25jun04\e=ph
|
||||
Partial match: 25jun04
|
||||
.sp
|
||||
With PCRE2_PARTIAL_SOFT, the subject is matched completely. For
|
||||
PCRE2_PARTIAL_HARD, however, the subject is assumed not to be complete, so
|
||||
there is only a partial match.
|
||||
.
|
||||
.
|
||||
.
|
||||
.SH "MULTI-SEGMENT MATCHING WITH pcre2_match()"
|
||||
.rs
|
||||
.sp
|
||||
PCRE was not originally designed with multi-segment matching in mind. However,
|
||||
over time, features (including partial matching) that make multi-segment
|
||||
matching possible have been added. The string is searched segment by segment by
|
||||
calling \fBpcre2_match()\fP repeatedly, with the aim of achieving the same
|
||||
results that would happen if the entire string was available for searching.
|
||||
.P
|
||||
Special logic must be implemented to handle a matched substring that spans a
|
||||
segment boundary. PCRE2_PARTIAL_HARD should be used, because it returns a
|
||||
partial match at the end of a segment whenever there is the possibility of
|
||||
changing the match by adding more characters. The PCRE2_NOTBOL option should
|
||||
also be set for all but the first segment.
|
||||
.P
|
||||
When a partial match occurs, the next segment must be added to the current
|
||||
subject and the match re-run, using the \fIstartoffset\fP argument of
|
||||
\fBpcre2_match()\fP to begin at the point where the partial match started.
|
||||
Multi-segment matching is usually used to search for substrings in the middle
|
||||
of very long sequences, so the patterns are normally not anchored. For example:
|
||||
.sp
|
||||
re> /\ed?\ed(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\ed\ed/
|
||||
data> ...the date is 23ja\e=ph
|
||||
Partial match: 23ja
|
||||
data> ...the date is 23jan19 and on that day...\e=offset=15
|
||||
0: 23jan19
|
||||
1: jan
|
||||
.sp
|
||||
Note the use of the \fBoffset\fP modifier to start the new match where the
|
||||
partial match was found.
|
||||
.P
|
||||
In this simple example, the next segment was just added to the one in which the
|
||||
partial match was found. However, if there are memory constraints, it may be
|
||||
necessary to discard text that precedes the partial match before adding the
|
||||
next segment. In cases such as the above, where the pattern does not contain
|
||||
any lookbehinds, it is sufficient to retain only the partially matched
|
||||
substring. However, if a pattern contains a lookbehind assertion, characters
|
||||
that precede the start of the partial match may have been inspected during the
|
||||
matching process.
|
||||
.P
|
||||
The only lookbehind information that is available is the length of the longest
|
||||
lookbehind in a pattern. This may not, of course, be at the start of the
|
||||
pattern, but retaining that many characters before the partial match is
|
||||
sufficient, if not always strictly necessary. The way to do this is as follows:
|
||||
.P
|
||||
Before doing any matching, find the length of the longest lookbehind in the
|
||||
pattern by calling \fBpcre2_pattern_info()\fP with the PCRE2_INFO_MAXLOOKBEHIND
|
||||
option. Note that the resulting count is in characters, not code units. After a
|
||||
partial match, moving back from the ovector[0] offset in the subject by the
|
||||
number of characters given for the maximum lookbehind gets you to the earliest
|
||||
character that must be retained. In a non-UTF or a 32-bit situation, moving
|
||||
back is just a subtraction, but in UTF-8 or UTF-16 you have to count characters
|
||||
while moving back through the code units. Characters before the point you have
|
||||
now reached can be discarded.
|
||||
.P
|
||||
For example, if the pattern "(?<=123)abc" is partially matched against the
|
||||
string "xx123ab", the ovector offsets are 5 and 7 ("ab"). The maximum
|
||||
lookbehind count is 3, so all characters before offset 2 can be discarded. The
|
||||
value of \fBstartoffset\fP for the next match should be 3. When \fBpcre2test\fP
|
||||
displays a partial match, it indicates the lookbehind characters with '<'
|
||||
characters if the \fBallusedtext\fP modifier is set:
|
||||
.sp
|
||||
re> "(?<=123)abc"
|
||||
data> xx123ab\e=ph,allusedtext
|
||||
Partial match: 123ab
|
||||
<<<
|
||||
.sp
|
||||
Note that the \fPallusedtext\fP modifier is not available for JIT matching,
|
||||
because JIT matching does not maintain the first and last consulted characters.
|
||||
.
|
||||
.
|
||||
.
|
||||
.SH "PARTIAL MATCHING USING pcre2_dfa_match()"
|
||||
.rs
|
||||
.sp
|
||||
The DFA functions move along the subject string character by character, without
|
||||
The DFA function moves along the subject string character by character, without
|
||||
backtracking, searching for all possible matches simultaneously. If the end of
|
||||
the subject is reached before the end of the pattern, there is the possibility
|
||||
of a partial match, again provided that at least one character has been
|
||||
inspected.
|
||||
of a partial match.
|
||||
.P
|
||||
When PCRE2_PARTIAL_SOFT is set, PCRE2_ERROR_PARTIAL is returned only if there
|
||||
have been no complete matches. Otherwise, the complete matches are returned.
|
||||
However, if PCRE2_PARTIAL_HARD is set, a partial match takes precedence over
|
||||
any complete matches. The portion of the string that was matched when the
|
||||
longest partial match was found is set as the first matching string.
|
||||
If PCRE2_PARTIAL_HARD is set, a partial match takes precedence over any
|
||||
complete matches. The portion of the string that was matched when the longest
|
||||
partial match was found is set as the first matching string.
|
||||
.P
|
||||
Because the DFA functions always search for all possible matches, and there is
|
||||
no difference between greedy and ungreedy repetition, their behaviour is
|
||||
different from the standard functions when PCRE2_PARTIAL_HARD is set. Consider
|
||||
the string "dog" matched against the ungreedy pattern shown above:
|
||||
Because the DFA function always searches for all possible matches, and there is
|
||||
no difference between greedy and ungreedy repetition, its behaviour is
|
||||
different from the \fBpcre2_match()\fP. Consider the string "dog" matched
|
||||
against this ungreedy pattern:
|
||||
.sp
|
||||
/dog(sbody)??/
|
||||
.sp
|
||||
|
@ -181,62 +306,17 @@ Whereas the standard function stops as soon as it finds the complete match for
|
|||
returns that when PCRE2_PARTIAL_HARD is set.
|
||||
.
|
||||
.
|
||||
.SH "PARTIAL MATCHING AND WORD BOUNDARIES"
|
||||
.rs
|
||||
.sp
|
||||
If a pattern ends with one of sequences \eb or \eB, which test for word
|
||||
boundaries, partial matching with PCRE2_PARTIAL_SOFT can give counter-intuitive
|
||||
results. Consider this pattern:
|
||||
.sp
|
||||
/\ebcat\eb/
|
||||
.sp
|
||||
This matches "cat", provided there is a word boundary at either end. If the
|
||||
subject string is "the cat", the comparison of the final "t" with a following
|
||||
character cannot take place, so a partial match is found. However, normal
|
||||
matching carries on, and \eb matches at the end of the subject when the last
|
||||
character is a letter, so a complete match is found. The result, therefore, is
|
||||
\fInot\fP PCRE2_ERROR_PARTIAL. Using PCRE2_PARTIAL_HARD in this case does yield
|
||||
PCRE2_ERROR_PARTIAL, because then the partial match takes precedence.
|
||||
.
|
||||
.
|
||||
.SH "EXAMPLE OF PARTIAL MATCHING USING PCRE2TEST"
|
||||
.rs
|
||||
.sp
|
||||
If the \fBpartial_soft\fP (or \fBps\fP) modifier is present on a
|
||||
\fBpcre2test\fP data line, the PCRE2_PARTIAL_SOFT option is used for the match.
|
||||
Here is a run of \fBpcre2test\fP that uses the date example quoted above:
|
||||
.sp
|
||||
re> /^\ed?\ed(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\ed\ed$/
|
||||
data> 25jun04\e=ps
|
||||
0: 25jun04
|
||||
1: jun
|
||||
data> 25dec3\e=ps
|
||||
Partial match: 23dec3
|
||||
data> 3ju\e=ps
|
||||
Partial match: 3ju
|
||||
data> 3juj\e=ps
|
||||
No match
|
||||
data> j\e=ps
|
||||
No match
|
||||
.sp
|
||||
The first data string is matched completely, so \fBpcre2test\fP shows the
|
||||
matched substrings. The remaining four strings do not match the complete
|
||||
pattern, but the first two are partial matches. Similar output is obtained
|
||||
if DFA matching is used.
|
||||
.P
|
||||
If the \fBpartial_hard\fP (or \fBph\fP) modifier is present on a
|
||||
\fBpcre2test\fP data line, the PCRE2_PARTIAL_HARD option is set for the match.
|
||||
.
|
||||
.
|
||||
.SH "MULTI-SEGMENT MATCHING WITH pcre2_dfa_match()"
|
||||
.rs
|
||||
.sp
|
||||
When a partial match has been found using a DFA matching function, it is
|
||||
When a partial match has been found using the DFA matching function, it is
|
||||
possible to continue the match by providing additional subject data and calling
|
||||
the function again with the same compiled regular expression, this time setting
|
||||
the PCRE2_DFA_RESTART option. You must pass the same working space as before,
|
||||
because this is where details of the previous partial match are stored. Here is
|
||||
an example using \fBpcre2test\fP:
|
||||
because this is where details of the previous partial match are stored. You can
|
||||
set the PCRE2_PARTIAL_SOFT or PCRE2_PARTIAL_HARD options with PCRE2_DFA_RESTART
|
||||
to continue partial matching over multiple segments. Here is an example using
|
||||
\fBpcre2test\fP:
|
||||
.sp
|
||||
re> /^\ed?\ed(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\ed\ed$/
|
||||
data> 23ja\e=dfa,ps
|
||||
|
@ -248,136 +328,10 @@ The first call has "23ja" as the subject, and requests partial matching; the
|
|||
second call has "n05" as the subject for the continued (restarted) match.
|
||||
Notice that when the match is complete, only the last part is shown; PCRE2 does
|
||||
not retain the previously partially-matched string. It is up to the calling
|
||||
program to do that if it needs to.
|
||||
.P
|
||||
That means that, for an unanchored pattern, if a continued match fails, it is
|
||||
not possible to try again at a new starting point. All this facility is capable
|
||||
of doing is continuing with the previous match attempt. In the previous
|
||||
example, if the second set of data is "ug23" the result is no match, even
|
||||
though there would be a match for "aug23" if the entire string were given at
|
||||
once. Depending on the application, this may or may not be what you want.
|
||||
The only way to allow for starting again at the next character is to retain the
|
||||
matched part of the subject and try a new complete match.
|
||||
.P
|
||||
You can set the PCRE2_PARTIAL_SOFT or PCRE2_PARTIAL_HARD options with
|
||||
PCRE2_DFA_RESTART to continue partial matching over multiple segments. This
|
||||
facility can be used to pass very long subject strings to the DFA matching
|
||||
functions.
|
||||
.
|
||||
.
|
||||
.SH "MULTI-SEGMENT MATCHING WITH pcre2_match()"
|
||||
.rs
|
||||
.sp
|
||||
Unlike the DFA function, it is not possible to restart the previous match with
|
||||
a new segment of data when using \fBpcre2_match()\fP. Instead, new data must be
|
||||
added to the previous subject string, and the entire match re-run, starting
|
||||
from the point where the partial match occurred. Earlier data can be discarded.
|
||||
.P
|
||||
It is best to use PCRE2_PARTIAL_HARD in this situation, because it does not
|
||||
treat the end of a segment as the end of the subject when matching \ez, \eZ,
|
||||
\eb, \eB, and $. Consider an unanchored pattern that matches dates:
|
||||
.sp
|
||||
re> /\ed?\ed(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\ed\ed/
|
||||
data> The date is 23ja\e=ph
|
||||
Partial match: 23ja
|
||||
.sp
|
||||
At this stage, an application could discard the text preceding "23ja", add on
|
||||
text from the next segment, and call the matching function again. Unlike the
|
||||
DFA matching function, the entire matching string must always be available,
|
||||
and the complete matching process occurs for each call, so more memory and more
|
||||
processing time is needed.
|
||||
.
|
||||
.
|
||||
.SH "ISSUES WITH MULTI-SEGMENT MATCHING"
|
||||
.rs
|
||||
.sp
|
||||
Certain types of pattern may give problems with multi-segment matching,
|
||||
whichever matching function is used.
|
||||
.P
|
||||
1. If the pattern contains a test for the beginning of a line, you need to pass
|
||||
the PCRE2_NOTBOL option when the subject string for any call does start at the
|
||||
beginning of a line. There is also a PCRE2_NOTEOL option, but in practice when
|
||||
doing multi-segment matching you should be using PCRE2_PARTIAL_HARD, which
|
||||
includes the effect of PCRE2_NOTEOL.
|
||||
.P
|
||||
2. If a pattern contains a lookbehind assertion, characters that precede the
|
||||
start of the partial match may have been inspected during the matching process.
|
||||
When using \fBpcre2_match()\fP, sufficient characters must be retained for the
|
||||
next match attempt. You can ensure that enough characters are retained by doing
|
||||
the following:
|
||||
.P
|
||||
Before doing any matching, find the length of the longest lookbehind in the
|
||||
pattern by calling \fBpcre2_pattern_info()\fP with the PCRE2_INFO_MAXLOOKBEHIND
|
||||
option. Note that the resulting count is in characters, not code units. After a
|
||||
partial match, moving back from the ovector[0] offset in the subject by the
|
||||
number of characters given for the maximum lookbehind gets you to the earliest
|
||||
character that must be retained. In a non-UTF or a 32-bit situation, moving
|
||||
back is just a subtraction, but in UTF-8 or UTF-16 you have to count characters
|
||||
while moving back through the code units.
|
||||
.P
|
||||
Characters before the point you have now reached can be discarded, and after
|
||||
the next segment has been added to what is retained, you should run the next
|
||||
match with the \fBstartoffset\fP argument set so that the match begins at the
|
||||
same point as before.
|
||||
.P
|
||||
For example, if the pattern "(?<=123)abc" is partially matched against the
|
||||
string "xx123ab", the ovector offsets are 5 and 7 ("ab"). The maximum
|
||||
lookbehind count is 3, so all characters before offset 2 can be discarded. The
|
||||
value of \fBstartoffset\fP for the next match should be 3. When \fBpcre2test\fP
|
||||
displays a partial match, it indicates the lookbehind characters with '<'
|
||||
characters if the "allusedtext" modifier is set:
|
||||
.sp
|
||||
re> "(?<=123)abc"
|
||||
data> xx123ab\e=ph,allusedtext
|
||||
Partial match: 123ab
|
||||
<<<
|
||||
However, the "allusedtext" modifier is not available for JIT matching, because
|
||||
JIT matching does not maintain the first and last consulted characters.
|
||||
.P
|
||||
3. Matching a subject string that is split into multiple segments may not
|
||||
always produce exactly the same result as matching over one single long string
|
||||
when PCRE2_PARTIAL_SOFT is used. The section "Partial Matching and Word
|
||||
Boundaries" above describes an issue that arises if the pattern ends with \eb
|
||||
or \eB. Another kind of difference may occur when there are multiple matching
|
||||
possibilities, because (for PCRE2_PARTIAL_SOFT) a partial match result is given
|
||||
only when there are no completed matches. This means that as soon as the
|
||||
shortest match has been found, continuation to a new subject segment is no
|
||||
longer possible. Consider this \fBpcre2test\fP example:
|
||||
.sp
|
||||
re> /dog(sbody)?/
|
||||
data> dogsb\e=ps
|
||||
0: dog
|
||||
data> do\e=ps,dfa
|
||||
Partial match: do
|
||||
data> gsb\e=ps,dfa,dfa_restart
|
||||
0: g
|
||||
data> dogsbody\e=dfa
|
||||
0: dogsbody
|
||||
1: dog
|
||||
.sp
|
||||
The first data line passes the string "dogsb" to a standard matching function,
|
||||
setting the PCRE2_PARTIAL_SOFT option. Although the string is a partial match
|
||||
for "dogsbody", the result is not PCRE2_ERROR_PARTIAL, because the shorter
|
||||
string "dog" is a complete match. Similarly, when the subject is presented to
|
||||
a DFA matching function in several parts ("do" and "gsb" being the first two)
|
||||
the match stops when "dog" has been found, and it is not possible to continue.
|
||||
On the other hand, if "dogsbody" is presented as a single string, a DFA
|
||||
matching function finds both matches.
|
||||
.P
|
||||
Because of these problems, it is best to use PCRE2_PARTIAL_HARD when matching
|
||||
multi-segment data. The example above then behaves differently:
|
||||
.sp
|
||||
re> /dog(sbody)?/
|
||||
data> dogsb\e=ph
|
||||
Partial match: dogsb
|
||||
data> do\e=ps,dfa
|
||||
Partial match: do
|
||||
data> gsb\e=ph,dfa,dfa_restart
|
||||
Partial match: gsb
|
||||
.sp
|
||||
4. Patterns that contain alternatives at the top level which do not all start
|
||||
with the same pattern item may not work as expected when PCRE2_DFA_RESTART is
|
||||
used. For example, consider this pattern:
|
||||
program to do that if it needs to. This means that, for an unanchored pattern,
|
||||
if a continued match fails, it is not possible to try again at a new starting
|
||||
point. All this facility is capable of doing is continuing with the previous
|
||||
match attempt. For example, consider this pattern:
|
||||
.sp
|
||||
1234|3789
|
||||
.sp
|
||||
|
@ -386,28 +340,15 @@ alternative is found at offset 3. There is no partial match for the second
|
|||
alternative, because such a match does not start at the same point in the
|
||||
subject string. Attempting to continue with the string "7890" does not yield a
|
||||
match because only those alternatives that match at one point in the subject
|
||||
are remembered. The problem arises because the start of the second alternative
|
||||
matches within the first alternative. There is no problem with anchored
|
||||
patterns or patterns such as:
|
||||
.sp
|
||||
1234|ABCD
|
||||
.sp
|
||||
where no string can be a partial match for both alternatives. This is not a
|
||||
problem if a standard matching function is used, because the entire match has
|
||||
to be rerun each time:
|
||||
.sp
|
||||
re> /1234|3789/
|
||||
data> ABC123\e=ph
|
||||
Partial match: 123
|
||||
data> 1237890
|
||||
0: 3789
|
||||
.sp
|
||||
Of course, instead of using PCRE2_DFA_RESTART, the same technique of re-running
|
||||
the entire match can also be used with the DFA matching function. Another
|
||||
possibility is to work with two buffers. If a partial match at offset \fIn\fP
|
||||
in the first buffer is followed by "no match" when PCRE2_DFA_RESTART is used on
|
||||
the second buffer, you can then try a new match starting at offset \fIn+1\fP in
|
||||
the first buffer.
|
||||
are remembered. Depending on the application, this may or may not be what you
|
||||
want.
|
||||
.P
|
||||
If you do want to allow for starting again at the next character, one way of
|
||||
doing it is to retain the matched part of the segment and try a new complete
|
||||
match, as described for \fBpcre2_match()\fP above. Another possibility is to
|
||||
work with two buffers. If a partial match at offset \fIn\fP in the first buffer
|
||||
is followed by "no match" when PCRE2_DFA_RESTART is used on the second buffer,
|
||||
you can then try a new match starting at offset \fIn+1\fP in the first buffer.
|
||||
.
|
||||
.
|
||||
.SH AUTHOR
|
||||
|
@ -424,6 +365,6 @@ Cambridge, England.
|
|||
.rs
|
||||
.sp
|
||||
.nf
|
||||
Last updated: 22 July 2019
|
||||
Last updated: 07 August 2019
|
||||
Copyright (c) 1997-2019 University of Cambridge.
|
||||
.fi
|
||||
|
|
Loading…
Reference in New Issue