Partial match documentation rewritten.
This commit is contained in:
parent
59c7c5d100
commit
ce751bfc84
|
@ -14,85 +14,123 @@ please consult the man page, in case the conversion went wrong.
|
||||||
<br>
|
<br>
|
||||||
<ul>
|
<ul>
|
||||||
<li><a name="TOC1" href="#SEC1">PARTIAL MATCHING IN PCRE2</a>
|
<li><a name="TOC1" href="#SEC1">PARTIAL MATCHING IN PCRE2</a>
|
||||||
<li><a name="TOC2" href="#SEC2">PARTIAL MATCHING USING pcre2_match()</a>
|
<li><a name="TOC2" href="#SEC2">REQUIREMENTS FOR A PARTIAL MATCH</a>
|
||||||
<li><a name="TOC3" href="#SEC3">PARTIAL MATCHING USING pcre2_dfa_match()</a>
|
<li><a name="TOC3" href="#SEC3">PARTIAL MATCHING USING pcre2_match()</a>
|
||||||
<li><a name="TOC4" href="#SEC4">PARTIAL MATCHING AND WORD BOUNDARIES</a>
|
<li><a name="TOC4" href="#SEC4">MULTI-SEGMENT MATCHING WITH pcre2_match()</a>
|
||||||
<li><a name="TOC5" href="#SEC5">EXAMPLE OF PARTIAL MATCHING USING PCRE2TEST</a>
|
<li><a name="TOC5" href="#SEC5">PARTIAL MATCHING USING pcre2_dfa_match()</a>
|
||||||
<li><a name="TOC6" href="#SEC6">MULTI-SEGMENT MATCHING WITH pcre2_dfa_match()</a>
|
<li><a name="TOC6" href="#SEC6">MULTI-SEGMENT MATCHING WITH pcre2_dfa_match()</a>
|
||||||
<li><a name="TOC7" href="#SEC7">MULTI-SEGMENT MATCHING WITH pcre2_match()</a>
|
<li><a name="TOC7" href="#SEC7">AUTHOR</a>
|
||||||
<li><a name="TOC8" href="#SEC8">ISSUES WITH MULTI-SEGMENT MATCHING</a>
|
<li><a name="TOC8" href="#SEC8">REVISION</a>
|
||||||
<li><a name="TOC9" href="#SEC9">AUTHOR</a>
|
|
||||||
<li><a name="TOC10" href="#SEC10">REVISION</a>
|
|
||||||
</ul>
|
</ul>
|
||||||
<br><a name="SEC1" href="#TOC1">PARTIAL MATCHING IN PCRE2</a><br>
|
<br><a name="SEC1" href="#TOC1">PARTIAL MATCHING IN PCRE2</a><br>
|
||||||
<P>
|
<P>
|
||||||
In normal use of PCRE2, if the subject string that is passed to a matching
|
In normal use of PCRE2, if there is a match up to the end of a subject string,
|
||||||
function matches as far as it goes, but is too short to match the entire
|
but more characters are needed to match the entire pattern, PCRE2_ERROR_NOMATCH
|
||||||
pattern, PCRE2_ERROR_NOMATCH is returned. There are circumstances where it
|
is returned, just like any other failing match. There are circumstances where
|
||||||
might be helpful to distinguish this case from other cases in which there is no
|
it might be helpful to distinguish this "partial match" case.
|
||||||
match.
|
|
||||||
</P>
|
</P>
|
||||||
<P>
|
<P>
|
||||||
Consider, for example, an application where a human is required to type in data
|
One example is an application where the subject string is very long, and not
|
||||||
for a field with specific formatting requirements. An example might be a date
|
all available at once. The requirement here is to be able to do the matching
|
||||||
in the form <i>ddmmmyy</i>, defined by this pattern:
|
segment by segment, but special action is needed when a matched substring spans
|
||||||
|
the boundary between two segments.
|
||||||
|
</P>
|
||||||
|
<P>
|
||||||
|
Another example is checking a user input string as it is typed, to ensure that
|
||||||
|
it conforms to a required format. Invalid characters can be immediately
|
||||||
|
diagnosed and rejected, giving instant feedback.
|
||||||
|
</P>
|
||||||
|
<P>
|
||||||
|
Partial matching is a PCRE2-specific feature; it is not Perl-compatible. It is
|
||||||
|
requested by setting one of the PCRE2_PARTIAL_HARD or PCRE2_PARTIAL_SOFT
|
||||||
|
options when calling a matching function. The difference between the two
|
||||||
|
options is whether or not a partial match is preferred to an alternative
|
||||||
|
complete match, though the details differ between the two types of matching
|
||||||
|
function. If both options are set, PCRE2_PARTIAL_HARD takes precedence.
|
||||||
|
</P>
|
||||||
|
<P>
|
||||||
|
If you want to use partial matching with just-in-time optimized code, as well
|
||||||
|
as setting a partial match option for the matching function, you must also call
|
||||||
|
<b>pcre2_jit_compile()</b> with one or both of these options:
|
||||||
<pre>
|
<pre>
|
||||||
^\d?\d(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\d\d$
|
|
||||||
</pre>
|
|
||||||
If the application sees the user's keystrokes one by one, and can check that
|
|
||||||
what has been typed so far is potentially valid, it is able to raise an error
|
|
||||||
as soon as a mistake is made, by beeping and not reflecting the character that
|
|
||||||
has been typed, for example. This immediate feedback is likely to be a better
|
|
||||||
user interface than a check that is delayed until the entire string has been
|
|
||||||
entered. Partial matching can also be useful when the subject string is very
|
|
||||||
long and is not all available at once, as discussed below.
|
|
||||||
</P>
|
|
||||||
<P>
|
|
||||||
PCRE2 supports partial matching by means of the PCRE2_PARTIAL_SOFT and
|
|
||||||
PCRE2_PARTIAL_HARD options, which can be set when calling a matching function.
|
|
||||||
The difference between the two options is whether or not a partial match is
|
|
||||||
preferred to an alternative complete match, though the details differ between
|
|
||||||
the two types of matching function. If both options are set, PCRE2_PARTIAL_HARD
|
|
||||||
takes precedence.
|
|
||||||
</P>
|
|
||||||
<P>
|
|
||||||
If you want to use partial matching with just-in-time optimized code, you must
|
|
||||||
call <b>pcre2_jit_compile()</b> with one or both of these options:
|
|
||||||
<pre>
|
|
||||||
PCRE2_JIT_PARTIAL_SOFT
|
|
||||||
PCRE2_JIT_PARTIAL_HARD
|
PCRE2_JIT_PARTIAL_HARD
|
||||||
|
PCRE2_JIT_PARTIAL_SOFT
|
||||||
</pre>
|
</pre>
|
||||||
PCRE2_JIT_COMPLETE should also be set if you are going to run non-partial
|
PCRE2_JIT_COMPLETE should also be set if you are going to run non-partial
|
||||||
matches on the same pattern. If the appropriate JIT mode has not been compiled,
|
matches on the same pattern. Separate code is compiled for each mode. If the
|
||||||
interpretive matching code is used.
|
appropriate JIT mode has not been compiled, interpretive matching code is used.
|
||||||
</P>
|
</P>
|
||||||
<P>
|
<P>
|
||||||
Setting a partial matching option disables two of PCRE2's standard
|
Setting a partial matching option disables two of PCRE2's standard
|
||||||
optimizations. PCRE2 remembers the last literal code unit in a pattern, and
|
optimization hints. PCRE2 remembers the last literal code unit in a pattern,
|
||||||
abandons matching immediately if it is not present in the subject string. This
|
and abandons matching immediately if it is not present in the subject string.
|
||||||
optimization cannot be used for a subject string that might match only
|
This optimization cannot be used for a subject string that might match only
|
||||||
partially. PCRE2 also knows the minimum length of a matching string, and does
|
partially. PCRE2 also remembers a minimum length of a matching string, and does
|
||||||
not bother to run the matching function on shorter strings. This optimization
|
not bother to run the matching function on shorter strings. This optimization
|
||||||
is also disabled for partial matching.
|
is also disabled for partial matching.
|
||||||
</P>
|
</P>
|
||||||
<br><a name="SEC2" href="#TOC1">PARTIAL MATCHING USING pcre2_match()</a><br>
|
<br><a name="SEC2" href="#TOC1">REQUIREMENTS FOR A PARTIAL MATCH</a><br>
|
||||||
<P>
|
<P>
|
||||||
A partial match occurs during a call to <b>pcre2_match()</b> when the end of the
|
A possible partial match occurs during matching when the end of the subject
|
||||||
subject string is reached successfully, but matching cannot continue because
|
string is reached successfully, but either more characters are needed to
|
||||||
more characters are needed, and in addition, either at least one character in
|
complete the match, or the addition of more characters might change what is
|
||||||
the subject has been inspected or the pattern contains a lookbehind, or (when
|
matched.
|
||||||
PCRE2_PARTIAL_HARD is set) the pattern could match an empty string. An
|
|
||||||
inspected character need not form part of the final matched string; lookbehind
|
|
||||||
assertions and the \K escape sequence provide ways of inspecting characters
|
|
||||||
before the start of a matched string.
|
|
||||||
</P>
|
</P>
|
||||||
<P>
|
<P>
|
||||||
The three additional requirements define the cases where adding more characters
|
Example 1: if the pattern is /abc/ and the subject is "ab", more characters are
|
||||||
to the existing subject may complete the same match that would occur if they
|
definitely needed to complete a match. In this case both hard and soft matching
|
||||||
had all been present in the first place. Without these conditions there would
|
options yield a partial match.
|
||||||
be a partial match of an empty string at the end of the subject for all
|
</P>
|
||||||
unanchored patterns (and also for anchored patterns if the subject itself is
|
<P>
|
||||||
empty).
|
Example 2: if the pattern is /ab+/ and the subject is "ab", a complete match
|
||||||
|
can be found, but the addition of more characters might change what is
|
||||||
|
matched. In this case, only PCRE2_PARTIAL_HARD returns a partial match;
|
||||||
|
PCRE2_PARTIAL_SOFT returns the complete match.
|
||||||
|
</P>
|
||||||
|
<P>
|
||||||
|
On reaching the end of the subject, when PCRE2_PARTIAL_HARD is set, if the next
|
||||||
|
pattern item is \z, \Z, \b, \B, or $ there is always a partial match.
|
||||||
|
Otherwise, for both options, the next pattern item must be one that inspects a
|
||||||
|
character, and at least one of the following must be true:
|
||||||
|
</P>
|
||||||
|
<P>
|
||||||
|
(1) At least one character has already been inspected. An inspected character
|
||||||
|
need not form part of the final matched string; lookbehind assertions and the
|
||||||
|
\K escape sequence provide ways of inspecting characters before the start of a
|
||||||
|
matched string.
|
||||||
|
</P>
|
||||||
|
<P>
|
||||||
|
(2) The pattern contains one or more lookbehind assertions. This condition
|
||||||
|
exists in case there is a lookbehind that inspects characters before the start
|
||||||
|
of the match.
|
||||||
|
</P>
|
||||||
|
<P>
|
||||||
|
(3) There is a special case when the whole pattern can match an empty string.
|
||||||
|
When the starting point is at the end of the subject, the empty string match is
|
||||||
|
a possibility, and if PCRE2_PARTIAL_SOFT is set and neither of the above
|
||||||
|
conditions is true, it is returned. However, because adding more characters
|
||||||
|
might result in a non-empty match, PCRE2_PARTIAL_HARD returns a partial match,
|
||||||
|
which in this case means "there is going to be a match at this point, but until
|
||||||
|
some more characters are added, we do not know if it will be an empty string or
|
||||||
|
something longer".
|
||||||
|
</P>
|
||||||
|
<br><a name="SEC3" href="#TOC1">PARTIAL MATCHING USING pcre2_match()</a><br>
|
||||||
|
<P>
|
||||||
|
When a partial matching option is set, the result of calling
|
||||||
|
<b>pcre2_match()</b> can be one of the following:
|
||||||
|
</P>
|
||||||
|
<P>
|
||||||
|
<b>A successful match</b>
|
||||||
|
A complete match has been found, starting and ending within this subject.
|
||||||
|
</P>
|
||||||
|
<P>
|
||||||
|
<b>PCRE2_ERROR_NOMATCH</b>
|
||||||
|
No match can start anywhere in this subject.
|
||||||
|
</P>
|
||||||
|
<P>
|
||||||
|
<b>PCRE2_ERROR_PARTIAL</b>
|
||||||
|
Adding more characters may result in a complete match that uses one or more
|
||||||
|
characters from the end of this subject.
|
||||||
</P>
|
</P>
|
||||||
<P>
|
<P>
|
||||||
When a partial match is returned, the first two elements in the ovector point
|
When a partial match is returned, the first two elements in the ovector point
|
||||||
|
@ -110,26 +148,6 @@ these characters are needed for a subsequent re-match with additional
|
||||||
characters.
|
characters.
|
||||||
</P>
|
</P>
|
||||||
<P>
|
<P>
|
||||||
What happens when a partial match is identified depends on which of the two
|
|
||||||
partial matching options is set.
|
|
||||||
</P>
|
|
||||||
<br><b>
|
|
||||||
PCRE2_PARTIAL_SOFT WITH pcre2_match()
|
|
||||||
</b><br>
|
|
||||||
<P>
|
|
||||||
If PCRE2_PARTIAL_SOFT is set when <b>pcre2_match()</b> identifies a partial
|
|
||||||
match, the partial match is remembered, but matching continues as normal, and
|
|
||||||
other alternatives in the pattern are tried. If no complete match can be found,
|
|
||||||
PCRE2_ERROR_PARTIAL is returned instead of PCRE2_ERROR_NOMATCH.
|
|
||||||
</P>
|
|
||||||
<P>
|
|
||||||
This option is "soft" because it prefers a complete match over a partial match.
|
|
||||||
All the various matching items in a pattern behave as if the subject string is
|
|
||||||
potentially complete. For example, \z, \Z, and $ match at the end of the
|
|
||||||
subject, as normal, and for \b and \B the end of the subject is treated as a
|
|
||||||
non-alphanumeric.
|
|
||||||
</P>
|
|
||||||
<P>
|
|
||||||
If there is more than one partial match, the first one that was found provides
|
If there is more than one partial match, the first one that was found provides
|
||||||
the data that is returned. Consider this pattern:
|
the data that is returned. Consider this pattern:
|
||||||
<pre>
|
<pre>
|
||||||
|
@ -138,26 +156,34 @@ the data that is returned. Consider this pattern:
|
||||||
If this is matched against the subject string "abc123dog", both alternatives
|
If this is matched against the subject string "abc123dog", both alternatives
|
||||||
fail to match, but the end of the subject is reached during matching, so
|
fail to match, but the end of the subject is reached during matching, so
|
||||||
PCRE2_ERROR_PARTIAL is returned. The offsets are set to 3 and 9, identifying
|
PCRE2_ERROR_PARTIAL is returned. The offsets are set to 3 and 9, identifying
|
||||||
"123dog" as the first partial match that was found. (In this example, there are
|
"123dog" as the first partial match. (In this example, there are two partial
|
||||||
two partial matches, because "dog" on its own partially matches the second
|
matches, because "dog" on its own partially matches the second alternative.)
|
||||||
alternative.)
|
|
||||||
</P>
|
</P>
|
||||||
<br><b>
|
<br><b>
|
||||||
PCRE2_PARTIAL_HARD WITH pcre2_match()
|
How a partial match is processed by pcre2_match()
|
||||||
</b><br>
|
</b><br>
|
||||||
<P>
|
<P>
|
||||||
If PCRE2_PARTIAL_HARD is set for <b>pcre2_match()</b>, PCRE2_ERROR_PARTIAL is
|
What happens when a partial match is identified depends on which of the two
|
||||||
returned as soon as a partial match is found, without continuing to search for
|
partial matching options is set.
|
||||||
possible complete matches. This option is "hard" because it prefers an earlier
|
</P>
|
||||||
partial match over a later complete match. For this reason, the assumption is
|
<P>
|
||||||
made that the end of the supplied subject string may not be the true end of the
|
If PCRE2_PARTIAL_HARD is set, PCRE2_ERROR_PARTIAL is returned as soon as a
|
||||||
available data, and so, if \z, \Z, \b, \B, or $ are encountered at the end
|
partial match is found, without continuing to search for possible complete
|
||||||
of the subject, the result is PCRE2_ERROR_PARTIAL, whether or not any
|
matches. This option is "hard" because it prefers an earlier partial match over
|
||||||
characters have been inspected.
|
a later complete match. For this reason, the assumption is made that the end of
|
||||||
|
the supplied subject string is not the true end of the available data, which is
|
||||||
|
why \z, \Z, \b, \B, and $ always give a partial match.
|
||||||
|
</P>
|
||||||
|
<P>
|
||||||
|
If PCRE2_PARTIAL_SOFT is set, the partial match is remembered, but matching
|
||||||
|
continues as normal, and other alternatives in the pattern are tried. If no
|
||||||
|
complete match can be found, PCRE2_ERROR_PARTIAL is returned instead of
|
||||||
|
PCRE2_ERROR_NOMATCH. This option is "soft" because it prefers a complete match
|
||||||
|
over a partial match. All the various matching items in a pattern behave as if
|
||||||
|
the subject string is potentially complete; \z, \Z, and $ match at the end of
|
||||||
|
the subject, as normal, and for \b and \B the end of the subject is treated
|
||||||
|
as a non-alphanumeric.
|
||||||
</P>
|
</P>
|
||||||
<br><b>
|
|
||||||
Comparing hard and soft partial matching
|
|
||||||
</b><br>
|
|
||||||
<P>
|
<P>
|
||||||
The difference between the two partial matching options can be illustrated by a
|
The difference between the two partial matching options can be illustrated by a
|
||||||
pattern such as:
|
pattern such as:
|
||||||
|
@ -182,26 +208,132 @@ to follow this explanation by thinking of the two patterns like this:
|
||||||
The second pattern will never match "dogsbody", because it will always find the
|
The second pattern will never match "dogsbody", because it will always find the
|
||||||
shorter match first.
|
shorter match first.
|
||||||
</P>
|
</P>
|
||||||
<br><a name="SEC3" href="#TOC1">PARTIAL MATCHING USING pcre2_dfa_match()</a><br>
|
<br><b>
|
||||||
|
Example of partial matching using pcre2test
|
||||||
|
</b><br>
|
||||||
<P>
|
<P>
|
||||||
The DFA functions move along the subject string character by character, without
|
The <b>pcre2test</b> data modifiers <b>partial_hard</b> (or <b>ph</b>) and
|
||||||
|
<b>partial_soft</b> (or <b>ps</b>) set PCRE2_PARTIAL_HARD and PCRE2_PARTIAL_SOFT,
|
||||||
|
respectively, when calling <b>pcre2_match()</b>. Here is a run of
|
||||||
|
<b>pcre2test</b> using a pattern that matches the whole subject in the form of a
|
||||||
|
date:
|
||||||
|
<pre>
|
||||||
|
re> /^\d?\d(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\d\d$/
|
||||||
|
data> 25dec3\=ph
|
||||||
|
Partial match: 23dec3
|
||||||
|
data> 3ju\=ph
|
||||||
|
Partial match: 3ju
|
||||||
|
data> 3juj\=ph
|
||||||
|
No match
|
||||||
|
</pre>
|
||||||
|
This example gives the same results for both hard and soft partial matching
|
||||||
|
options. Here is an example where there is a difference:
|
||||||
|
<pre>
|
||||||
|
re> /^\d?\d(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\d\d$/
|
||||||
|
data> 25jun04\=ps
|
||||||
|
0: 25jun04
|
||||||
|
1: jun
|
||||||
|
data> 25jun04\=ph
|
||||||
|
Partial match: 25jun04
|
||||||
|
</pre>
|
||||||
|
With PCRE2_PARTIAL_SOFT, the subject is matched completely. For
|
||||||
|
PCRE2_PARTIAL_HARD, however, the subject is assumed not to be complete, so
|
||||||
|
there is only a partial match.
|
||||||
|
</P>
|
||||||
|
<br><a name="SEC4" href="#TOC1">MULTI-SEGMENT MATCHING WITH pcre2_match()</a><br>
|
||||||
|
<P>
|
||||||
|
PCRE was not originally designed with multi-segment matching in mind. However,
|
||||||
|
over time, features (including partial matching) that make multi-segment
|
||||||
|
matching possible have been added. The string is searched segment by segment by
|
||||||
|
calling <b>pcre2_match()</b> repeatedly, with the aim of achieving the same
|
||||||
|
results that would happen if the entire string was available for searching.
|
||||||
|
</P>
|
||||||
|
<P>
|
||||||
|
Special logic must be implemented to handle a matched substring that spans a
|
||||||
|
segment boundary. PCRE2_PARTIAL_HARD should be used, because it returns a
|
||||||
|
partial match at the end of a segment whenever there is the possibility of
|
||||||
|
changing the match by adding more characters. The PCRE2_NOTBOL option should
|
||||||
|
also be set for all but the first segment.
|
||||||
|
</P>
|
||||||
|
<P>
|
||||||
|
When a partial match occurs, the next segment must be added to the current
|
||||||
|
subject and the match re-run, using the <i>startoffset</i> argument of
|
||||||
|
<b>pcre2_match()</b> to begin at the point where the partial match started.
|
||||||
|
Multi-segment matching is usually used to search for substrings in the middle
|
||||||
|
of very long sequences, so the patterns are normally not anchored. For example:
|
||||||
|
<pre>
|
||||||
|
re> /\d?\d(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\d\d/
|
||||||
|
data> ...the date is 23ja\=ph
|
||||||
|
Partial match: 23ja
|
||||||
|
data> ...the date is 23jan19 and on that day...\=offset=15
|
||||||
|
0: 23jan19
|
||||||
|
1: jan
|
||||||
|
</pre>
|
||||||
|
Note the use of the <b>offset</b> modifier to start the new match where the
|
||||||
|
partial match was found.
|
||||||
|
</P>
|
||||||
|
<P>
|
||||||
|
In this simple example, the next segment was just added to the one in which the
|
||||||
|
partial match was found. However, if there are memory constraints, it may be
|
||||||
|
necessary to discard text that precedes the partial match before adding the
|
||||||
|
next segment. In cases such as the above, where the pattern does not contain
|
||||||
|
any lookbehinds, it is sufficient to retain only the partially matched
|
||||||
|
substring. However, if a pattern contains a lookbehind assertion, characters
|
||||||
|
that precede the start of the partial match may have been inspected during the
|
||||||
|
matching process.
|
||||||
|
</P>
|
||||||
|
<P>
|
||||||
|
The only lookbehind information that is available is the length of the longest
|
||||||
|
lookbehind in a pattern. This may not, of course, be at the start of the
|
||||||
|
pattern, but retaining that many characters before the partial match is
|
||||||
|
sufficient, if not always strictly necessary. The way to do this is as follows:
|
||||||
|
</P>
|
||||||
|
<P>
|
||||||
|
Before doing any matching, find the length of the longest lookbehind in the
|
||||||
|
pattern by calling <b>pcre2_pattern_info()</b> with the PCRE2_INFO_MAXLOOKBEHIND
|
||||||
|
option. Note that the resulting count is in characters, not code units. After a
|
||||||
|
partial match, moving back from the ovector[0] offset in the subject by the
|
||||||
|
number of characters given for the maximum lookbehind gets you to the earliest
|
||||||
|
character that must be retained. In a non-UTF or a 32-bit situation, moving
|
||||||
|
back is just a subtraction, but in UTF-8 or UTF-16 you have to count characters
|
||||||
|
while moving back through the code units. Characters before the point you have
|
||||||
|
now reached can be discarded.
|
||||||
|
</P>
|
||||||
|
<P>
|
||||||
|
For example, if the pattern "(?<=123)abc" is partially matched against the
|
||||||
|
string "xx123ab", the ovector offsets are 5 and 7 ("ab"). The maximum
|
||||||
|
lookbehind count is 3, so all characters before offset 2 can be discarded. The
|
||||||
|
value of <b>startoffset</b> for the next match should be 3. When <b>pcre2test</b>
|
||||||
|
displays a partial match, it indicates the lookbehind characters with '<'
|
||||||
|
characters if the <b>allusedtext</b> modifier is set:
|
||||||
|
<pre>
|
||||||
|
re> "(?<=123)abc"
|
||||||
|
data> xx123ab\=ph,allusedtext
|
||||||
|
Partial match: 123ab
|
||||||
|
<<<
|
||||||
|
</pre>
|
||||||
|
Note that the \fPallusedtext\fP modifier is not available for JIT matching,
|
||||||
|
because JIT matching does not maintain the first and last consulted characters.
|
||||||
|
</P>
|
||||||
|
<br><a name="SEC5" href="#TOC1">PARTIAL MATCHING USING pcre2_dfa_match()</a><br>
|
||||||
|
<P>
|
||||||
|
The DFA function moves along the subject string character by character, without
|
||||||
backtracking, searching for all possible matches simultaneously. If the end of
|
backtracking, searching for all possible matches simultaneously. If the end of
|
||||||
the subject is reached before the end of the pattern, there is the possibility
|
the subject is reached before the end of the pattern, there is the possibility
|
||||||
of a partial match, again provided that at least one character has been
|
of a partial match.
|
||||||
inspected.
|
|
||||||
</P>
|
</P>
|
||||||
<P>
|
<P>
|
||||||
When PCRE2_PARTIAL_SOFT is set, PCRE2_ERROR_PARTIAL is returned only if there
|
When PCRE2_PARTIAL_SOFT is set, PCRE2_ERROR_PARTIAL is returned only if there
|
||||||
have been no complete matches. Otherwise, the complete matches are returned.
|
have been no complete matches. Otherwise, the complete matches are returned.
|
||||||
However, if PCRE2_PARTIAL_HARD is set, a partial match takes precedence over
|
If PCRE2_PARTIAL_HARD is set, a partial match takes precedence over any
|
||||||
any complete matches. The portion of the string that was matched when the
|
complete matches. The portion of the string that was matched when the longest
|
||||||
longest partial match was found is set as the first matching string.
|
partial match was found is set as the first matching string.
|
||||||
</P>
|
</P>
|
||||||
<P>
|
<P>
|
||||||
Because the DFA functions always search for all possible matches, and there is
|
Because the DFA function always searches for all possible matches, and there is
|
||||||
no difference between greedy and ungreedy repetition, their behaviour is
|
no difference between greedy and ungreedy repetition, its behaviour is
|
||||||
different from the standard functions when PCRE2_PARTIAL_HARD is set. Consider
|
different from the <b>pcre2_match()</b>. Consider the string "dog" matched
|
||||||
the string "dog" matched against the ungreedy pattern shown above:
|
against this ungreedy pattern:
|
||||||
<pre>
|
<pre>
|
||||||
/dog(sbody)??/
|
/dog(sbody)??/
|
||||||
</pre>
|
</pre>
|
||||||
|
@ -209,58 +341,16 @@ Whereas the standard function stops as soon as it finds the complete match for
|
||||||
"dog", the DFA function also finds the partial match for "dogsbody", and so
|
"dog", the DFA function also finds the partial match for "dogsbody", and so
|
||||||
returns that when PCRE2_PARTIAL_HARD is set.
|
returns that when PCRE2_PARTIAL_HARD is set.
|
||||||
</P>
|
</P>
|
||||||
<br><a name="SEC4" href="#TOC1">PARTIAL MATCHING AND WORD BOUNDARIES</a><br>
|
|
||||||
<P>
|
|
||||||
If a pattern ends with one of sequences \b or \B, which test for word
|
|
||||||
boundaries, partial matching with PCRE2_PARTIAL_SOFT can give counter-intuitive
|
|
||||||
results. Consider this pattern:
|
|
||||||
<pre>
|
|
||||||
/\bcat\b/
|
|
||||||
</pre>
|
|
||||||
This matches "cat", provided there is a word boundary at either end. If the
|
|
||||||
subject string is "the cat", the comparison of the final "t" with a following
|
|
||||||
character cannot take place, so a partial match is found. However, normal
|
|
||||||
matching carries on, and \b matches at the end of the subject when the last
|
|
||||||
character is a letter, so a complete match is found. The result, therefore, is
|
|
||||||
<i>not</i> PCRE2_ERROR_PARTIAL. Using PCRE2_PARTIAL_HARD in this case does yield
|
|
||||||
PCRE2_ERROR_PARTIAL, because then the partial match takes precedence.
|
|
||||||
</P>
|
|
||||||
<br><a name="SEC5" href="#TOC1">EXAMPLE OF PARTIAL MATCHING USING PCRE2TEST</a><br>
|
|
||||||
<P>
|
|
||||||
If the <b>partial_soft</b> (or <b>ps</b>) modifier is present on a
|
|
||||||
<b>pcre2test</b> data line, the PCRE2_PARTIAL_SOFT option is used for the match.
|
|
||||||
Here is a run of <b>pcre2test</b> that uses the date example quoted above:
|
|
||||||
<pre>
|
|
||||||
re> /^\d?\d(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\d\d$/
|
|
||||||
data> 25jun04\=ps
|
|
||||||
0: 25jun04
|
|
||||||
1: jun
|
|
||||||
data> 25dec3\=ps
|
|
||||||
Partial match: 23dec3
|
|
||||||
data> 3ju\=ps
|
|
||||||
Partial match: 3ju
|
|
||||||
data> 3juj\=ps
|
|
||||||
No match
|
|
||||||
data> j\=ps
|
|
||||||
No match
|
|
||||||
</pre>
|
|
||||||
The first data string is matched completely, so <b>pcre2test</b> shows the
|
|
||||||
matched substrings. The remaining four strings do not match the complete
|
|
||||||
pattern, but the first two are partial matches. Similar output is obtained
|
|
||||||
if DFA matching is used.
|
|
||||||
</P>
|
|
||||||
<P>
|
|
||||||
If the <b>partial_hard</b> (or <b>ph</b>) modifier is present on a
|
|
||||||
<b>pcre2test</b> data line, the PCRE2_PARTIAL_HARD option is set for the match.
|
|
||||||
</P>
|
|
||||||
<br><a name="SEC6" href="#TOC1">MULTI-SEGMENT MATCHING WITH pcre2_dfa_match()</a><br>
|
<br><a name="SEC6" href="#TOC1">MULTI-SEGMENT MATCHING WITH pcre2_dfa_match()</a><br>
|
||||||
<P>
|
<P>
|
||||||
When a partial match has been found using a DFA matching function, it is
|
When a partial match has been found using the DFA matching function, it is
|
||||||
possible to continue the match by providing additional subject data and calling
|
possible to continue the match by providing additional subject data and calling
|
||||||
the function again with the same compiled regular expression, this time setting
|
the function again with the same compiled regular expression, this time setting
|
||||||
the PCRE2_DFA_RESTART option. You must pass the same working space as before,
|
the PCRE2_DFA_RESTART option. You must pass the same working space as before,
|
||||||
because this is where details of the previous partial match are stored. Here is
|
because this is where details of the previous partial match are stored. You can
|
||||||
an example using <b>pcre2test</b>:
|
set the PCRE2_PARTIAL_SOFT or PCRE2_PARTIAL_HARD options with PCRE2_DFA_RESTART
|
||||||
|
to continue partial matching over multiple segments. Here is an example using
|
||||||
|
<b>pcre2test</b>:
|
||||||
<pre>
|
<pre>
|
||||||
re> /^\d?\d(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\d\d$/
|
re> /^\d?\d(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\d\d$/
|
||||||
data> 23ja\=dfa,ps
|
data> 23ja\=dfa,ps
|
||||||
|
@ -272,143 +362,10 @@ The first call has "23ja" as the subject, and requests partial matching; the
|
||||||
second call has "n05" as the subject for the continued (restarted) match.
|
second call has "n05" as the subject for the continued (restarted) match.
|
||||||
Notice that when the match is complete, only the last part is shown; PCRE2 does
|
Notice that when the match is complete, only the last part is shown; PCRE2 does
|
||||||
not retain the previously partially-matched string. It is up to the calling
|
not retain the previously partially-matched string. It is up to the calling
|
||||||
program to do that if it needs to.
|
program to do that if it needs to. This means that, for an unanchored pattern,
|
||||||
</P>
|
if a continued match fails, it is not possible to try again at a new starting
|
||||||
<P>
|
point. All this facility is capable of doing is continuing with the previous
|
||||||
That means that, for an unanchored pattern, if a continued match fails, it is
|
match attempt. For example, consider this pattern:
|
||||||
not possible to try again at a new starting point. All this facility is capable
|
|
||||||
of doing is continuing with the previous match attempt. In the previous
|
|
||||||
example, if the second set of data is "ug23" the result is no match, even
|
|
||||||
though there would be a match for "aug23" if the entire string were given at
|
|
||||||
once. Depending on the application, this may or may not be what you want.
|
|
||||||
The only way to allow for starting again at the next character is to retain the
|
|
||||||
matched part of the subject and try a new complete match.
|
|
||||||
</P>
|
|
||||||
<P>
|
|
||||||
You can set the PCRE2_PARTIAL_SOFT or PCRE2_PARTIAL_HARD options with
|
|
||||||
PCRE2_DFA_RESTART to continue partial matching over multiple segments. This
|
|
||||||
facility can be used to pass very long subject strings to the DFA matching
|
|
||||||
functions.
|
|
||||||
</P>
|
|
||||||
<br><a name="SEC7" href="#TOC1">MULTI-SEGMENT MATCHING WITH pcre2_match()</a><br>
|
|
||||||
<P>
|
|
||||||
Unlike the DFA function, it is not possible to restart the previous match with
|
|
||||||
a new segment of data when using <b>pcre2_match()</b>. Instead, new data must be
|
|
||||||
added to the previous subject string, and the entire match re-run, starting
|
|
||||||
from the point where the partial match occurred. Earlier data can be discarded.
|
|
||||||
</P>
|
|
||||||
<P>
|
|
||||||
It is best to use PCRE2_PARTIAL_HARD in this situation, because it does not
|
|
||||||
treat the end of a segment as the end of the subject when matching \z, \Z,
|
|
||||||
\b, \B, and $. Consider an unanchored pattern that matches dates:
|
|
||||||
<pre>
|
|
||||||
re> /\d?\d(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\d\d/
|
|
||||||
data> The date is 23ja\=ph
|
|
||||||
Partial match: 23ja
|
|
||||||
</pre>
|
|
||||||
At this stage, an application could discard the text preceding "23ja", add on
|
|
||||||
text from the next segment, and call the matching function again. Unlike the
|
|
||||||
DFA matching function, the entire matching string must always be available,
|
|
||||||
and the complete matching process occurs for each call, so more memory and more
|
|
||||||
processing time is needed.
|
|
||||||
</P>
|
|
||||||
<br><a name="SEC8" href="#TOC1">ISSUES WITH MULTI-SEGMENT MATCHING</a><br>
|
|
||||||
<P>
|
|
||||||
Certain types of pattern may give problems with multi-segment matching,
|
|
||||||
whichever matching function is used.
|
|
||||||
</P>
|
|
||||||
<P>
|
|
||||||
1. If the pattern contains a test for the beginning of a line, you need to pass
|
|
||||||
the PCRE2_NOTBOL option when the subject string for any call does start at the
|
|
||||||
beginning of a line. There is also a PCRE2_NOTEOL option, but in practice when
|
|
||||||
doing multi-segment matching you should be using PCRE2_PARTIAL_HARD, which
|
|
||||||
includes the effect of PCRE2_NOTEOL.
|
|
||||||
</P>
|
|
||||||
<P>
|
|
||||||
2. If a pattern contains a lookbehind assertion, characters that precede the
|
|
||||||
start of the partial match may have been inspected during the matching process.
|
|
||||||
When using <b>pcre2_match()</b>, sufficient characters must be retained for the
|
|
||||||
next match attempt. You can ensure that enough characters are retained by doing
|
|
||||||
the following:
|
|
||||||
</P>
|
|
||||||
<P>
|
|
||||||
Before doing any matching, find the length of the longest lookbehind in the
|
|
||||||
pattern by calling <b>pcre2_pattern_info()</b> with the PCRE2_INFO_MAXLOOKBEHIND
|
|
||||||
option. Note that the resulting count is in characters, not code units. After a
|
|
||||||
partial match, moving back from the ovector[0] offset in the subject by the
|
|
||||||
number of characters given for the maximum lookbehind gets you to the earliest
|
|
||||||
character that must be retained. In a non-UTF or a 32-bit situation, moving
|
|
||||||
back is just a subtraction, but in UTF-8 or UTF-16 you have to count characters
|
|
||||||
while moving back through the code units.
|
|
||||||
</P>
|
|
||||||
<P>
|
|
||||||
Characters before the point you have now reached can be discarded, and after
|
|
||||||
the next segment has been added to what is retained, you should run the next
|
|
||||||
match with the <b>startoffset</b> argument set so that the match begins at the
|
|
||||||
same point as before.
|
|
||||||
</P>
|
|
||||||
<P>
|
|
||||||
For example, if the pattern "(?<=123)abc" is partially matched against the
|
|
||||||
string "xx123ab", the ovector offsets are 5 and 7 ("ab"). The maximum
|
|
||||||
lookbehind count is 3, so all characters before offset 2 can be discarded. The
|
|
||||||
value of <b>startoffset</b> for the next match should be 3. When <b>pcre2test</b>
|
|
||||||
displays a partial match, it indicates the lookbehind characters with '<'
|
|
||||||
characters if the "allusedtext" modifier is set:
|
|
||||||
<pre>
|
|
||||||
re> "(?<=123)abc"
|
|
||||||
data> xx123ab\=ph,allusedtext
|
|
||||||
Partial match: 123ab
|
|
||||||
<<<
|
|
||||||
</pre>
|
|
||||||
However, the "allusedtext" modifier is not available for JIT matching, because
|
|
||||||
JIT matching does not maintain the first and last consulted characters.
|
|
||||||
</P>
|
|
||||||
<P>
|
|
||||||
3. Matching a subject string that is split into multiple segments may not
|
|
||||||
always produce exactly the same result as matching over one single long string
|
|
||||||
when PCRE2_PARTIAL_SOFT is used. The section "Partial Matching and Word
|
|
||||||
Boundaries" above describes an issue that arises if the pattern ends with \b
|
|
||||||
or \B. Another kind of difference may occur when there are multiple matching
|
|
||||||
possibilities, because (for PCRE2_PARTIAL_SOFT) a partial match result is given
|
|
||||||
only when there are no completed matches. This means that as soon as the
|
|
||||||
shortest match has been found, continuation to a new subject segment is no
|
|
||||||
longer possible. Consider this <b>pcre2test</b> example:
|
|
||||||
<pre>
|
|
||||||
re> /dog(sbody)?/
|
|
||||||
data> dogsb\=ps
|
|
||||||
0: dog
|
|
||||||
data> do\=ps,dfa
|
|
||||||
Partial match: do
|
|
||||||
data> gsb\=ps,dfa,dfa_restart
|
|
||||||
0: g
|
|
||||||
data> dogsbody\=dfa
|
|
||||||
0: dogsbody
|
|
||||||
1: dog
|
|
||||||
</pre>
|
|
||||||
The first data line passes the string "dogsb" to a standard matching function,
|
|
||||||
setting the PCRE2_PARTIAL_SOFT option. Although the string is a partial match
|
|
||||||
for "dogsbody", the result is not PCRE2_ERROR_PARTIAL, because the shorter
|
|
||||||
string "dog" is a complete match. Similarly, when the subject is presented to
|
|
||||||
a DFA matching function in several parts ("do" and "gsb" being the first two)
|
|
||||||
the match stops when "dog" has been found, and it is not possible to continue.
|
|
||||||
On the other hand, if "dogsbody" is presented as a single string, a DFA
|
|
||||||
matching function finds both matches.
|
|
||||||
</P>
|
|
||||||
<P>
|
|
||||||
Because of these problems, it is best to use PCRE2_PARTIAL_HARD when matching
|
|
||||||
multi-segment data. The example above then behaves differently:
|
|
||||||
<pre>
|
|
||||||
re> /dog(sbody)?/
|
|
||||||
data> dogsb\=ph
|
|
||||||
Partial match: dogsb
|
|
||||||
data> do\=ps,dfa
|
|
||||||
Partial match: do
|
|
||||||
data> gsb\=ph,dfa,dfa_restart
|
|
||||||
Partial match: gsb
|
|
||||||
</pre>
|
|
||||||
4. Patterns that contain alternatives at the top level which do not all start
|
|
||||||
with the same pattern item may not work as expected when PCRE2_DFA_RESTART is
|
|
||||||
used. For example, consider this pattern:
|
|
||||||
<pre>
|
<pre>
|
||||||
1234|3789
|
1234|3789
|
||||||
</pre>
|
</pre>
|
||||||
|
@ -417,30 +374,18 @@ alternative is found at offset 3. There is no partial match for the second
|
||||||
alternative, because such a match does not start at the same point in the
|
alternative, because such a match does not start at the same point in the
|
||||||
subject string. Attempting to continue with the string "7890" does not yield a
|
subject string. Attempting to continue with the string "7890" does not yield a
|
||||||
match because only those alternatives that match at one point in the subject
|
match because only those alternatives that match at one point in the subject
|
||||||
are remembered. The problem arises because the start of the second alternative
|
are remembered. Depending on the application, this may or may not be what you
|
||||||
matches within the first alternative. There is no problem with anchored
|
want.
|
||||||
patterns or patterns such as:
|
|
||||||
<pre>
|
|
||||||
1234|ABCD
|
|
||||||
</pre>
|
|
||||||
where no string can be a partial match for both alternatives. This is not a
|
|
||||||
problem if a standard matching function is used, because the entire match has
|
|
||||||
to be rerun each time:
|
|
||||||
<pre>
|
|
||||||
re> /1234|3789/
|
|
||||||
data> ABC123\=ph
|
|
||||||
Partial match: 123
|
|
||||||
data> 1237890
|
|
||||||
0: 3789
|
|
||||||
</pre>
|
|
||||||
Of course, instead of using PCRE2_DFA_RESTART, the same technique of re-running
|
|
||||||
the entire match can also be used with the DFA matching function. Another
|
|
||||||
possibility is to work with two buffers. If a partial match at offset <i>n</i>
|
|
||||||
in the first buffer is followed by "no match" when PCRE2_DFA_RESTART is used on
|
|
||||||
the second buffer, you can then try a new match starting at offset <i>n+1</i> in
|
|
||||||
the first buffer.
|
|
||||||
</P>
|
</P>
|
||||||
<br><a name="SEC9" href="#TOC1">AUTHOR</a><br>
|
<P>
|
||||||
|
If you do want to allow for starting again at the next character, one way of
|
||||||
|
doing it is to retain the matched part of the segment and try a new complete
|
||||||
|
match, as described for <b>pcre2_match()</b> above. Another possibility is to
|
||||||
|
work with two buffers. If a partial match at offset <i>n</i> in the first buffer
|
||||||
|
is followed by "no match" when PCRE2_DFA_RESTART is used on the second buffer,
|
||||||
|
you can then try a new match starting at offset <i>n+1</i> in the first buffer.
|
||||||
|
</P>
|
||||||
|
<br><a name="SEC7" href="#TOC1">AUTHOR</a><br>
|
||||||
<P>
|
<P>
|
||||||
Philip Hazel
|
Philip Hazel
|
||||||
<br>
|
<br>
|
||||||
|
@ -449,9 +394,9 @@ University Computing Service
|
||||||
Cambridge, England.
|
Cambridge, England.
|
||||||
<br>
|
<br>
|
||||||
</P>
|
</P>
|
||||||
<br><a name="SEC10" href="#TOC1">REVISION</a><br>
|
<br><a name="SEC8" href="#TOC1">REVISION</a><br>
|
||||||
<P>
|
<P>
|
||||||
Last updated: 22 July 2019
|
Last updated: 07 August 2019
|
||||||
<br>
|
<br>
|
||||||
Copyright © 1997-2019 University of Cambridge.
|
Copyright © 1997-2019 University of Cambridge.
|
||||||
<br>
|
<br>
|
||||||
|
|
566
doc/pcre2.txt
566
doc/pcre2.txt
|
@ -5650,72 +5650,109 @@ NAME
|
||||||
|
|
||||||
PARTIAL MATCHING IN PCRE2
|
PARTIAL MATCHING IN PCRE2
|
||||||
|
|
||||||
In normal use of PCRE2, if the subject string that is passed to a
|
In normal use of PCRE2, if there is a match up to the end of a subject
|
||||||
matching function matches as far as it goes, but is too short to match
|
string, but more characters are needed to match the entire pattern,
|
||||||
the entire pattern, PCRE2_ERROR_NOMATCH is returned. There are circum-
|
PCRE2_ERROR_NOMATCH is returned, just like any other failing match.
|
||||||
stances where it might be helpful to distinguish this case from other
|
There are circumstances where it might be helpful to distinguish this
|
||||||
cases in which there is no match.
|
"partial match" case.
|
||||||
|
|
||||||
Consider, for example, an application where a human is required to type
|
One example is an application where the subject string is very long,
|
||||||
in data for a field with specific formatting requirements. An example
|
and not all available at once. The requirement here is to be able to do
|
||||||
might be a date in the form ddmmmyy, defined by this pattern:
|
the matching segment by segment, but special action is needed when a
|
||||||
|
matched substring spans the boundary between two segments.
|
||||||
|
|
||||||
^\d?\d(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\d\d$
|
Another example is checking a user input string as it is typed, to en-
|
||||||
|
sure that it conforms to a required format. Invalid characters can be
|
||||||
|
immediately diagnosed and rejected, giving instant feedback.
|
||||||
|
|
||||||
If the application sees the user's keystrokes one by one, and can check
|
Partial matching is a PCRE2-specific feature; it is not Perl-compati-
|
||||||
that what has been typed so far is potentially valid, it is able to
|
ble. It is requested by setting one of the PCRE2_PARTIAL_HARD or
|
||||||
raise an error as soon as a mistake is made, by beeping and not re-
|
PCRE2_PARTIAL_SOFT options when calling a matching function. The dif-
|
||||||
flecting the character that has been typed, for example. This immediate
|
ference between the two options is whether or not a partial match is
|
||||||
feedback is likely to be a better user interface than a check that is
|
preferred to an alternative complete match, though the details differ
|
||||||
delayed until the entire string has been entered. Partial matching can
|
between the two types of matching function. If both options are set,
|
||||||
also be useful when the subject string is very long and is not all
|
PCRE2_PARTIAL_HARD takes precedence.
|
||||||
available at once, as discussed below.
|
|
||||||
|
|
||||||
PCRE2 supports partial matching by means of the PCRE2_PARTIAL_SOFT and
|
If you want to use partial matching with just-in-time optimized code,
|
||||||
PCRE2_PARTIAL_HARD options, which can be set when calling a matching
|
as well as setting a partial match option for the matching function,
|
||||||
function. The difference between the two options is whether or not a
|
you must also call pcre2_jit_compile() with one or both of these op-
|
||||||
partial match is preferred to an alternative complete match, though the
|
tions:
|
||||||
details differ between the two types of matching function. If both op-
|
|
||||||
tions are set, PCRE2_PARTIAL_HARD takes precedence.
|
|
||||||
|
|
||||||
If you want to use partial matching with just-in-time optimized code,
|
|
||||||
you must call pcre2_jit_compile() with one or both of these options:
|
|
||||||
|
|
||||||
PCRE2_JIT_PARTIAL_SOFT
|
|
||||||
PCRE2_JIT_PARTIAL_HARD
|
PCRE2_JIT_PARTIAL_HARD
|
||||||
|
PCRE2_JIT_PARTIAL_SOFT
|
||||||
|
|
||||||
PCRE2_JIT_COMPLETE should also be set if you are going to run non-par-
|
PCRE2_JIT_COMPLETE should also be set if you are going to run non-par-
|
||||||
tial matches on the same pattern. If the appropriate JIT mode has not
|
tial matches on the same pattern. Separate code is compiled for each
|
||||||
been compiled, interpretive matching code is used.
|
mode. If the appropriate JIT mode has not been compiled, interpretive
|
||||||
|
matching code is used.
|
||||||
|
|
||||||
Setting a partial matching option disables two of PCRE2's standard op-
|
Setting a partial matching option disables two of PCRE2's standard op-
|
||||||
timizations. PCRE2 remembers the last literal code unit in a pattern,
|
timization hints. PCRE2 remembers the last literal code unit in a pat-
|
||||||
and abandons matching immediately if it is not present in the subject
|
tern, and abandons matching immediately if it is not present in the
|
||||||
string. This optimization cannot be used for a subject string that
|
subject string. This optimization cannot be used for a subject string
|
||||||
might match only partially. PCRE2 also knows the minimum length of a
|
that might match only partially. PCRE2 also remembers a minimum length
|
||||||
matching string, and does not bother to run the matching function on
|
of a matching string, and does not bother to run the matching function
|
||||||
shorter strings. This optimization is also disabled for partial match-
|
on shorter strings. This optimization is also disabled for partial
|
||||||
ing.
|
matching.
|
||||||
|
|
||||||
|
|
||||||
|
REQUIREMENTS FOR A PARTIAL MATCH
|
||||||
|
|
||||||
|
A possible partial match occurs during matching when the end of the
|
||||||
|
subject string is reached successfully, but either more characters are
|
||||||
|
needed to complete the match, or the addition of more characters might
|
||||||
|
change what is matched.
|
||||||
|
|
||||||
|
Example 1: if the pattern is /abc/ and the subject is "ab", more char-
|
||||||
|
acters are definitely needed to complete a match. In this case both
|
||||||
|
hard and soft matching options yield a partial match.
|
||||||
|
|
||||||
|
Example 2: if the pattern is /ab+/ and the subject is "ab", a complete
|
||||||
|
match can be found, but the addition of more characters might change
|
||||||
|
what is matched. In this case, only PCRE2_PARTIAL_HARD returns a par-
|
||||||
|
tial match; PCRE2_PARTIAL_SOFT returns the complete match.
|
||||||
|
|
||||||
|
On reaching the end of the subject, when PCRE2_PARTIAL_HARD is set, if
|
||||||
|
the next pattern item is \z, \Z, \b, \B, or $ there is always a partial
|
||||||
|
match. Otherwise, for both options, the next pattern item must be one
|
||||||
|
that inspects a character, and at least one of the following must be
|
||||||
|
true:
|
||||||
|
|
||||||
|
(1) At least one character has already been inspected. An inspected
|
||||||
|
character need not form part of the final matched string; lookbehind
|
||||||
|
assertions and the \K escape sequence provide ways of inspecting char-
|
||||||
|
acters before the start of a matched string.
|
||||||
|
|
||||||
|
(2) The pattern contains one or more lookbehind assertions. This condi-
|
||||||
|
tion exists in case there is a lookbehind that inspects characters be-
|
||||||
|
fore the start of the match.
|
||||||
|
|
||||||
|
(3) There is a special case when the whole pattern can match an empty
|
||||||
|
string. When the starting point is at the end of the subject, the
|
||||||
|
empty string match is a possibility, and if PCRE2_PARTIAL_SOFT is set
|
||||||
|
and neither of the above conditions is true, it is returned. However,
|
||||||
|
because adding more characters might result in a non-empty match,
|
||||||
|
PCRE2_PARTIAL_HARD returns a partial match, which in this case means
|
||||||
|
"there is going to be a match at this point, but until some more char-
|
||||||
|
acters are added, we do not know if it will be an empty string or some-
|
||||||
|
thing longer".
|
||||||
|
|
||||||
|
|
||||||
PARTIAL MATCHING USING pcre2_match()
|
PARTIAL MATCHING USING pcre2_match()
|
||||||
|
|
||||||
A partial match occurs during a call to pcre2_match() when the end of
|
When a partial matching option is set, the result of calling
|
||||||
the subject string is reached successfully, but matching cannot con-
|
pcre2_match() can be one of the following:
|
||||||
tinue because more characters are needed, and in addition, either at
|
|
||||||
least one character in the subject has been inspected or the pattern
|
|
||||||
contains a lookbehind, or (when PCRE2_PARTIAL_HARD is set) the pattern
|
|
||||||
could match an empty string. An inspected character need not form part
|
|
||||||
of the final matched string; lookbehind assertions and the \K escape
|
|
||||||
sequence provide ways of inspecting characters before the start of a
|
|
||||||
matched string.
|
|
||||||
|
|
||||||
The three additional requirements define the cases where adding more
|
A successful match
|
||||||
characters to the existing subject may complete the same match that
|
A complete match has been found, starting and ending within this sub-
|
||||||
would occur if they had all been present in the first place. Without
|
ject.
|
||||||
these conditions there would be a partial match of an empty string at
|
|
||||||
the end of the subject for all unanchored patterns (and also for an-
|
PCRE2_ERROR_NOMATCH
|
||||||
chored patterns if the subject itself is empty).
|
No match can start anywhere in this subject.
|
||||||
|
|
||||||
|
PCRE2_ERROR_PARTIAL
|
||||||
|
Adding more characters may result in a complete match that uses one
|
||||||
|
or more characters from the end of this subject.
|
||||||
|
|
||||||
When a partial match is returned, the first two elements in the ovector
|
When a partial match is returned, the first two elements in the ovector
|
||||||
point to the portion of the subject that was matched, but the values in
|
point to the portion of the subject that was matched, but the values in
|
||||||
|
@ -5725,29 +5762,12 @@ PARTIAL MATCHING USING pcre2_match()
|
||||||
/abc\K123/
|
/abc\K123/
|
||||||
|
|
||||||
If it is matched against "456abc123xyz" the result is a complete match,
|
If it is matched against "456abc123xyz" the result is a complete match,
|
||||||
and the ovector defines the matched string as "123", because \K resets
|
and the ovector defines the matched string as "123", because \K resets
|
||||||
the "start of match" point. However, if a partial match is requested
|
the "start of match" point. However, if a partial match is requested
|
||||||
and the subject string is "456abc12", a partial match is found for the
|
and the subject string is "456abc12", a partial match is found for the
|
||||||
string "abc12", because all these characters are needed for a subse-
|
string "abc12", because all these characters are needed for a subse-
|
||||||
quent re-match with additional characters.
|
quent re-match with additional characters.
|
||||||
|
|
||||||
What happens when a partial match is identified depends on which of the
|
|
||||||
two partial matching options is set.
|
|
||||||
|
|
||||||
PCRE2_PARTIAL_SOFT WITH pcre2_match()
|
|
||||||
|
|
||||||
If PCRE2_PARTIAL_SOFT is set when pcre2_match() identifies a partial
|
|
||||||
match, the partial match is remembered, but matching continues as nor-
|
|
||||||
mal, and other alternatives in the pattern are tried. If no complete
|
|
||||||
match can be found, PCRE2_ERROR_PARTIAL is returned instead of
|
|
||||||
PCRE2_ERROR_NOMATCH.
|
|
||||||
|
|
||||||
This option is "soft" because it prefers a complete match over a par-
|
|
||||||
tial match. All the various matching items in a pattern behave as if
|
|
||||||
the subject string is potentially complete. For example, \z, \Z, and $
|
|
||||||
match at the end of the subject, as normal, and for \b and \B the end
|
|
||||||
of the subject is treated as a non-alphanumeric.
|
|
||||||
|
|
||||||
If there is more than one partial match, the first one that was found
|
If there is more than one partial match, the first one that was found
|
||||||
provides the data that is returned. Consider this pattern:
|
provides the data that is returned. Consider this pattern:
|
||||||
|
|
||||||
|
@ -5756,23 +5776,31 @@ PARTIAL MATCHING USING pcre2_match()
|
||||||
If this is matched against the subject string "abc123dog", both alter-
|
If this is matched against the subject string "abc123dog", both alter-
|
||||||
natives fail to match, but the end of the subject is reached during
|
natives fail to match, but the end of the subject is reached during
|
||||||
matching, so PCRE2_ERROR_PARTIAL is returned. The offsets are set to 3
|
matching, so PCRE2_ERROR_PARTIAL is returned. The offsets are set to 3
|
||||||
and 9, identifying "123dog" as the first partial match that was found.
|
and 9, identifying "123dog" as the first partial match. (In this exam-
|
||||||
(In this example, there are two partial matches, because "dog" on its
|
ple, there are two partial matches, because "dog" on its own partially
|
||||||
own partially matches the second alternative.)
|
matches the second alternative.)
|
||||||
|
|
||||||
PCRE2_PARTIAL_HARD WITH pcre2_match()
|
How a partial match is processed by pcre2_match()
|
||||||
|
|
||||||
If PCRE2_PARTIAL_HARD is set for pcre2_match(), PCRE2_ERROR_PARTIAL is
|
What happens when a partial match is identified depends on which of the
|
||||||
returned as soon as a partial match is found, without continuing to
|
two partial matching options is set.
|
||||||
search for possible complete matches. This option is "hard" because it
|
|
||||||
prefers an earlier partial match over a later complete match. For this
|
|
||||||
reason, the assumption is made that the end of the supplied subject
|
|
||||||
string may not be the true end of the available data, and so, if \z,
|
|
||||||
\Z, \b, \B, or $ are encountered at the end of the subject, the result
|
|
||||||
is PCRE2_ERROR_PARTIAL, whether or not any characters have been in-
|
|
||||||
spected.
|
|
||||||
|
|
||||||
Comparing hard and soft partial matching
|
If PCRE2_PARTIAL_HARD is set, PCRE2_ERROR_PARTIAL is returned as soon
|
||||||
|
as a partial match is found, without continuing to search for possible
|
||||||
|
complete matches. This option is "hard" because it prefers an earlier
|
||||||
|
partial match over a later complete match. For this reason, the assump-
|
||||||
|
tion is made that the end of the supplied subject string is not the
|
||||||
|
true end of the available data, which is why \z, \Z, \b, \B, and $ al-
|
||||||
|
ways give a partial match.
|
||||||
|
|
||||||
|
If PCRE2_PARTIAL_SOFT is set, the partial match is remembered, but
|
||||||
|
matching continues as normal, and other alternatives in the pattern are
|
||||||
|
tried. If no complete match can be found, PCRE2_ERROR_PARTIAL is re-
|
||||||
|
turned instead of PCRE2_ERROR_NOMATCH. This option is "soft" because it
|
||||||
|
prefers a complete match over a partial match. All the various matching
|
||||||
|
items in a pattern behave as if the subject string is potentially com-
|
||||||
|
plete; \z, \Z, and $ match at the end of the subject, as normal, and
|
||||||
|
for \b and \B the end of the subject is treated as a non-alphanumeric.
|
||||||
|
|
||||||
The difference between the two partial matching options can be illus-
|
The difference between the two partial matching options can be illus-
|
||||||
trated by a pattern such as:
|
trated by a pattern such as:
|
||||||
|
@ -5799,27 +5827,129 @@ PARTIAL MATCHING USING pcre2_match()
|
||||||
The second pattern will never match "dogsbody", because it will always
|
The second pattern will never match "dogsbody", because it will always
|
||||||
find the shorter match first.
|
find the shorter match first.
|
||||||
|
|
||||||
|
Example of partial matching using pcre2test
|
||||||
|
|
||||||
|
The pcre2test data modifiers partial_hard (or ph) and partial_soft (or
|
||||||
|
ps) set PCRE2_PARTIAL_HARD and PCRE2_PARTIAL_SOFT, respectively, when
|
||||||
|
calling pcre2_match(). Here is a run of pcre2test using a pattern that
|
||||||
|
matches the whole subject in the form of a date:
|
||||||
|
|
||||||
|
re> /^\d?\d(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\d\d$/
|
||||||
|
data> 25dec3\=ph
|
||||||
|
Partial match: 23dec3
|
||||||
|
data> 3ju\=ph
|
||||||
|
Partial match: 3ju
|
||||||
|
data> 3juj\=ph
|
||||||
|
No match
|
||||||
|
|
||||||
|
This example gives the same results for both hard and soft partial
|
||||||
|
matching options. Here is an example where there is a difference:
|
||||||
|
|
||||||
|
re> /^\d?\d(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\d\d$/
|
||||||
|
data> 25jun04\=ps
|
||||||
|
0: 25jun04
|
||||||
|
1: jun
|
||||||
|
data> 25jun04\=ph
|
||||||
|
Partial match: 25jun04
|
||||||
|
|
||||||
|
With PCRE2_PARTIAL_SOFT, the subject is matched completely. For
|
||||||
|
PCRE2_PARTIAL_HARD, however, the subject is assumed not to be complete,
|
||||||
|
so there is only a partial match.
|
||||||
|
|
||||||
|
|
||||||
|
MULTI-SEGMENT MATCHING WITH pcre2_match()
|
||||||
|
|
||||||
|
PCRE was not originally designed with multi-segment matching in mind.
|
||||||
|
However, over time, features (including partial matching) that make
|
||||||
|
multi-segment matching possible have been added. The string is searched
|
||||||
|
segment by segment by calling pcre2_match() repeatedly, with the aim of
|
||||||
|
achieving the same results that would happen if the entire string was
|
||||||
|
available for searching.
|
||||||
|
|
||||||
|
Special logic must be implemented to handle a matched substring that
|
||||||
|
spans a segment boundary. PCRE2_PARTIAL_HARD should be used, because it
|
||||||
|
returns a partial match at the end of a segment whenever there is the
|
||||||
|
possibility of changing the match by adding more characters. The
|
||||||
|
PCRE2_NOTBOL option should also be set for all but the first segment.
|
||||||
|
|
||||||
|
When a partial match occurs, the next segment must be added to the cur-
|
||||||
|
rent subject and the match re-run, using the startoffset argument of
|
||||||
|
pcre2_match() to begin at the point where the partial match started.
|
||||||
|
Multi-segment matching is usually used to search for substrings in the
|
||||||
|
middle of very long sequences, so the patterns are normally not an-
|
||||||
|
chored. For example:
|
||||||
|
|
||||||
|
re> /\d?\d(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\d\d/
|
||||||
|
data> ...the date is 23ja\=ph
|
||||||
|
Partial match: 23ja
|
||||||
|
data> ...the date is 23jan19 and on that day...\=offset=15
|
||||||
|
0: 23jan19
|
||||||
|
1: jan
|
||||||
|
|
||||||
|
Note the use of the offset modifier to start the new match where the
|
||||||
|
partial match was found.
|
||||||
|
|
||||||
|
In this simple example, the next segment was just added to the one in
|
||||||
|
which the partial match was found. However, if there are memory con-
|
||||||
|
straints, it may be necessary to discard text that precedes the partial
|
||||||
|
match before adding the next segment. In cases such as the above, where
|
||||||
|
the pattern does not contain any lookbehinds, it is sufficient to re-
|
||||||
|
tain only the partially matched substring. However, if a pattern con-
|
||||||
|
tains a lookbehind assertion, characters that precede the start of the
|
||||||
|
partial match may have been inspected during the matching process.
|
||||||
|
|
||||||
|
The only lookbehind information that is available is the length of the
|
||||||
|
longest lookbehind in a pattern. This may not, of course, be at the
|
||||||
|
start of the pattern, but retaining that many characters before the
|
||||||
|
partial match is sufficient, if not always strictly necessary. The way
|
||||||
|
to do this is as follows:
|
||||||
|
|
||||||
|
Before doing any matching, find the length of the longest lookbehind in
|
||||||
|
the pattern by calling pcre2_pattern_info() with the
|
||||||
|
PCRE2_INFO_MAXLOOKBEHIND option. Note that the resulting count is in
|
||||||
|
characters, not code units. After a partial match, moving back from the
|
||||||
|
ovector[0] offset in the subject by the number of characters given for
|
||||||
|
the maximum lookbehind gets you to the earliest character that must be
|
||||||
|
retained. In a non-UTF or a 32-bit situation, moving back is just a
|
||||||
|
subtraction, but in UTF-8 or UTF-16 you have to count characters while
|
||||||
|
moving back through the code units. Characters before the point you
|
||||||
|
have now reached can be discarded.
|
||||||
|
|
||||||
|
For example, if the pattern "(?<=123)abc" is partially matched against
|
||||||
|
the string "xx123ab", the ovector offsets are 5 and 7 ("ab"). The maxi-
|
||||||
|
mum lookbehind count is 3, so all characters before offset 2 can be
|
||||||
|
discarded. The value of startoffset for the next match should be 3.
|
||||||
|
When pcre2test displays a partial match, it indicates the lookbehind
|
||||||
|
characters with '<' characters if the allusedtext modifier is set:
|
||||||
|
|
||||||
|
re> "(?<=123)abc"
|
||||||
|
data> xx123ab\=ph,allusedtext
|
||||||
|
Partial match: 123ab
|
||||||
|
<<<
|
||||||
|
|
||||||
|
Note that the allusedtext modifier is not available for JIT matching,
|
||||||
|
because JIT matching does not maintain the first and last consulted
|
||||||
|
characters.
|
||||||
|
|
||||||
|
|
||||||
PARTIAL MATCHING USING pcre2_dfa_match()
|
PARTIAL MATCHING USING pcre2_dfa_match()
|
||||||
|
|
||||||
The DFA functions move along the subject string character by character,
|
The DFA function moves along the subject string character by character,
|
||||||
without backtracking, searching for all possible matches simultane-
|
without backtracking, searching for all possible matches simultane-
|
||||||
ously. If the end of the subject is reached before the end of the pat-
|
ously. If the end of the subject is reached before the end of the pat-
|
||||||
tern, there is the possibility of a partial match, again provided that
|
tern, there is the possibility of a partial match.
|
||||||
at least one character has been inspected.
|
|
||||||
|
|
||||||
When PCRE2_PARTIAL_SOFT is set, PCRE2_ERROR_PARTIAL is returned only if
|
When PCRE2_PARTIAL_SOFT is set, PCRE2_ERROR_PARTIAL is returned only if
|
||||||
there have been no complete matches. Otherwise, the complete matches
|
there have been no complete matches. Otherwise, the complete matches
|
||||||
are returned. However, if PCRE2_PARTIAL_HARD is set, a partial match
|
are returned. If PCRE2_PARTIAL_HARD is set, a partial match takes
|
||||||
takes precedence over any complete matches. The portion of the string
|
precedence over any complete matches. The portion of the string that
|
||||||
that was matched when the longest partial match was found is set as the
|
was matched when the longest partial match was found is set as the
|
||||||
first matching string.
|
first matching string.
|
||||||
|
|
||||||
Because the DFA functions always search for all possible matches, and
|
Because the DFA function always searches for all possible matches, and
|
||||||
there is no difference between greedy and ungreedy repetition, their
|
there is no difference between greedy and ungreedy repetition, its be-
|
||||||
behaviour is different from the standard functions when PCRE2_PAR-
|
haviour is different from the pcre2_match(). Consider the string "dog"
|
||||||
TIAL_HARD is set. Consider the string "dog" matched against the un-
|
matched against this ungreedy pattern:
|
||||||
greedy pattern shown above:
|
|
||||||
|
|
||||||
/dog(sbody)??/
|
/dog(sbody)??/
|
||||||
|
|
||||||
|
@ -5828,60 +5958,16 @@ PARTIAL MATCHING USING pcre2_dfa_match()
|
||||||
"dogsbody", and so returns that when PCRE2_PARTIAL_HARD is set.
|
"dogsbody", and so returns that when PCRE2_PARTIAL_HARD is set.
|
||||||
|
|
||||||
|
|
||||||
PARTIAL MATCHING AND WORD BOUNDARIES
|
|
||||||
|
|
||||||
If a pattern ends with one of sequences \b or \B, which test for word
|
|
||||||
boundaries, partial matching with PCRE2_PARTIAL_SOFT can give counter-
|
|
||||||
intuitive results. Consider this pattern:
|
|
||||||
|
|
||||||
/\bcat\b/
|
|
||||||
|
|
||||||
This matches "cat", provided there is a word boundary at either end. If
|
|
||||||
the subject string is "the cat", the comparison of the final "t" with a
|
|
||||||
following character cannot take place, so a partial match is found.
|
|
||||||
However, normal matching carries on, and \b matches at the end of the
|
|
||||||
subject when the last character is a letter, so a complete match is
|
|
||||||
found. The result, therefore, is not PCRE2_ERROR_PARTIAL. Using
|
|
||||||
PCRE2_PARTIAL_HARD in this case does yield PCRE2_ERROR_PARTIAL, because
|
|
||||||
then the partial match takes precedence.
|
|
||||||
|
|
||||||
|
|
||||||
EXAMPLE OF PARTIAL MATCHING USING PCRE2TEST
|
|
||||||
|
|
||||||
If the partial_soft (or ps) modifier is present on a pcre2test data
|
|
||||||
line, the PCRE2_PARTIAL_SOFT option is used for the match. Here is a
|
|
||||||
run of pcre2test that uses the date example quoted above:
|
|
||||||
|
|
||||||
re> /^\d?\d(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\d\d$/
|
|
||||||
data> 25jun04\=ps
|
|
||||||
0: 25jun04
|
|
||||||
1: jun
|
|
||||||
data> 25dec3\=ps
|
|
||||||
Partial match: 23dec3
|
|
||||||
data> 3ju\=ps
|
|
||||||
Partial match: 3ju
|
|
||||||
data> 3juj\=ps
|
|
||||||
No match
|
|
||||||
data> j\=ps
|
|
||||||
No match
|
|
||||||
|
|
||||||
The first data string is matched completely, so pcre2test shows the
|
|
||||||
matched substrings. The remaining four strings do not match the com-
|
|
||||||
plete pattern, but the first two are partial matches. Similar output is
|
|
||||||
obtained if DFA matching is used.
|
|
||||||
|
|
||||||
If the partial_hard (or ph) modifier is present on a pcre2test data
|
|
||||||
line, the PCRE2_PARTIAL_HARD option is set for the match.
|
|
||||||
|
|
||||||
|
|
||||||
MULTI-SEGMENT MATCHING WITH pcre2_dfa_match()
|
MULTI-SEGMENT MATCHING WITH pcre2_dfa_match()
|
||||||
|
|
||||||
When a partial match has been found using a DFA matching function, it
|
When a partial match has been found using the DFA matching function, it
|
||||||
is possible to continue the match by providing additional subject data
|
is possible to continue the match by providing additional subject data
|
||||||
and calling the function again with the same compiled regular expres-
|
and calling the function again with the same compiled regular expres-
|
||||||
sion, this time setting the PCRE2_DFA_RESTART option. You must pass the
|
sion, this time setting the PCRE2_DFA_RESTART option. You must pass the
|
||||||
same working space as before, because this is where details of the pre-
|
same working space as before, because this is where details of the pre-
|
||||||
vious partial match are stored. Here is an example using pcre2test:
|
vious partial match are stored. You can set the PCRE2_PARTIAL_SOFT or
|
||||||
|
PCRE2_PARTIAL_HARD options with PCRE2_DFA_RESTART to continue partial
|
||||||
|
matching over multiple segments. Here is an example using pcre2test:
|
||||||
|
|
||||||
re> /^\d?\d(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\d\d$/
|
re> /^\d?\d(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\d\d$/
|
||||||
data> 23ja\=dfa,ps
|
data> 23ja\=dfa,ps
|
||||||
|
@ -5889,146 +5975,15 @@ MULTI-SEGMENT MATCHING WITH pcre2_dfa_match()
|
||||||
data> n05\=dfa,dfa_restart
|
data> n05\=dfa,dfa_restart
|
||||||
0: n05
|
0: n05
|
||||||
|
|
||||||
The first call has "23ja" as the subject, and requests partial match-
|
The first call has "23ja" as the subject, and requests partial match-
|
||||||
ing; the second call has "n05" as the subject for the continued
|
ing; the second call has "n05" as the subject for the continued
|
||||||
(restarted) match. Notice that when the match is complete, only the
|
(restarted) match. Notice that when the match is complete, only the
|
||||||
last part is shown; PCRE2 does not retain the previously partially-
|
last part is shown; PCRE2 does not retain the previously partially-
|
||||||
matched string. It is up to the calling program to do that if it needs
|
matched string. It is up to the calling program to do that if it needs
|
||||||
to.
|
to. This means that, for an unanchored pattern, if a continued match
|
||||||
|
fails, it is not possible to try again at a new starting point. All
|
||||||
That means that, for an unanchored pattern, if a continued match fails,
|
this facility is capable of doing is continuing with the previous match
|
||||||
it is not possible to try again at a new starting point. All this fa-
|
attempt. For example, consider this pattern:
|
||||||
cility is capable of doing is continuing with the previous match at-
|
|
||||||
tempt. In the previous example, if the second set of data is "ug23" the
|
|
||||||
result is no match, even though there would be a match for "aug23" if
|
|
||||||
the entire string were given at once. Depending on the application,
|
|
||||||
this may or may not be what you want. The only way to allow for start-
|
|
||||||
ing again at the next character is to retain the matched part of the
|
|
||||||
subject and try a new complete match.
|
|
||||||
|
|
||||||
You can set the PCRE2_PARTIAL_SOFT or PCRE2_PARTIAL_HARD options with
|
|
||||||
PCRE2_DFA_RESTART to continue partial matching over multiple segments.
|
|
||||||
This facility can be used to pass very long subject strings to the DFA
|
|
||||||
matching functions.
|
|
||||||
|
|
||||||
|
|
||||||
MULTI-SEGMENT MATCHING WITH pcre2_match()
|
|
||||||
|
|
||||||
Unlike the DFA function, it is not possible to restart the previous
|
|
||||||
match with a new segment of data when using pcre2_match(). Instead, new
|
|
||||||
data must be added to the previous subject string, and the entire match
|
|
||||||
re-run, starting from the point where the partial match occurred. Ear-
|
|
||||||
lier data can be discarded.
|
|
||||||
|
|
||||||
It is best to use PCRE2_PARTIAL_HARD in this situation, because it does
|
|
||||||
not treat the end of a segment as the end of the subject when matching
|
|
||||||
\z, \Z, \b, \B, and $. Consider an unanchored pattern that matches
|
|
||||||
dates:
|
|
||||||
|
|
||||||
re> /\d?\d(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\d\d/
|
|
||||||
data> The date is 23ja\=ph
|
|
||||||
Partial match: 23ja
|
|
||||||
|
|
||||||
At this stage, an application could discard the text preceding "23ja",
|
|
||||||
add on text from the next segment, and call the matching function
|
|
||||||
again. Unlike the DFA matching function, the entire matching string
|
|
||||||
must always be available, and the complete matching process occurs for
|
|
||||||
each call, so more memory and more processing time is needed.
|
|
||||||
|
|
||||||
|
|
||||||
ISSUES WITH MULTI-SEGMENT MATCHING
|
|
||||||
|
|
||||||
Certain types of pattern may give problems with multi-segment matching,
|
|
||||||
whichever matching function is used.
|
|
||||||
|
|
||||||
1. If the pattern contains a test for the beginning of a line, you need
|
|
||||||
to pass the PCRE2_NOTBOL option when the subject string for any call
|
|
||||||
does start at the beginning of a line. There is also a PCRE2_NOTEOL op-
|
|
||||||
tion, but in practice when doing multi-segment matching you should be
|
|
||||||
using PCRE2_PARTIAL_HARD, which includes the effect of PCRE2_NOTEOL.
|
|
||||||
|
|
||||||
2. If a pattern contains a lookbehind assertion, characters that pre-
|
|
||||||
cede the start of the partial match may have been inspected during the
|
|
||||||
matching process. When using pcre2_match(), sufficient characters must
|
|
||||||
be retained for the next match attempt. You can ensure that enough
|
|
||||||
characters are retained by doing the following:
|
|
||||||
|
|
||||||
Before doing any matching, find the length of the longest lookbehind in
|
|
||||||
the pattern by calling pcre2_pattern_info() with the
|
|
||||||
PCRE2_INFO_MAXLOOKBEHIND option. Note that the resulting count is in
|
|
||||||
characters, not code units. After a partial match, moving back from the
|
|
||||||
ovector[0] offset in the subject by the number of characters given for
|
|
||||||
the maximum lookbehind gets you to the earliest character that must be
|
|
||||||
retained. In a non-UTF or a 32-bit situation, moving back is just a
|
|
||||||
subtraction, but in UTF-8 or UTF-16 you have to count characters while
|
|
||||||
moving back through the code units.
|
|
||||||
|
|
||||||
Characters before the point you have now reached can be discarded, and
|
|
||||||
after the next segment has been added to what is retained, you should
|
|
||||||
run the next match with the startoffset argument set so that the match
|
|
||||||
begins at the same point as before.
|
|
||||||
|
|
||||||
For example, if the pattern "(?<=123)abc" is partially matched against
|
|
||||||
the string "xx123ab", the ovector offsets are 5 and 7 ("ab"). The maxi-
|
|
||||||
mum lookbehind count is 3, so all characters before offset 2 can be
|
|
||||||
discarded. The value of startoffset for the next match should be 3.
|
|
||||||
When pcre2test displays a partial match, it indicates the lookbehind
|
|
||||||
characters with '<' characters if the "allusedtext" modifier is set:
|
|
||||||
|
|
||||||
re> "(?<=123)abc"
|
|
||||||
data> xx123ab\=ph,allusedtext
|
|
||||||
Partial match: 123ab
|
|
||||||
<<< However, the "allusedtext" modifier is not avail-
|
|
||||||
able for JIT matching, because JIT matching does not maintain the first
|
|
||||||
and last consulted characters.
|
|
||||||
|
|
||||||
3. Matching a subject string that is split into multiple segments may
|
|
||||||
not always produce exactly the same result as matching over one single
|
|
||||||
long string when PCRE2_PARTIAL_SOFT is used. The section "Partial
|
|
||||||
Matching and Word Boundaries" above describes an issue that arises if
|
|
||||||
the pattern ends with \b or \B. Another kind of difference may occur
|
|
||||||
when there are multiple matching possibilities, because (for PCRE2_PAR-
|
|
||||||
TIAL_SOFT) a partial match result is given only when there are no com-
|
|
||||||
pleted matches. This means that as soon as the shortest match has been
|
|
||||||
found, continuation to a new subject segment is no longer possible.
|
|
||||||
Consider this pcre2test example:
|
|
||||||
|
|
||||||
re> /dog(sbody)?/
|
|
||||||
data> dogsb\=ps
|
|
||||||
0: dog
|
|
||||||
data> do\=ps,dfa
|
|
||||||
Partial match: do
|
|
||||||
data> gsb\=ps,dfa,dfa_restart
|
|
||||||
0: g
|
|
||||||
data> dogsbody\=dfa
|
|
||||||
0: dogsbody
|
|
||||||
1: dog
|
|
||||||
|
|
||||||
The first data line passes the string "dogsb" to a standard matching
|
|
||||||
function, setting the PCRE2_PARTIAL_SOFT option. Although the string is
|
|
||||||
a partial match for "dogsbody", the result is not PCRE2_ERROR_PARTIAL,
|
|
||||||
because the shorter string "dog" is a complete match. Similarly, when
|
|
||||||
the subject is presented to a DFA matching function in several parts
|
|
||||||
("do" and "gsb" being the first two) the match stops when "dog" has
|
|
||||||
been found, and it is not possible to continue. On the other hand, if
|
|
||||||
"dogsbody" is presented as a single string, a DFA matching function
|
|
||||||
finds both matches.
|
|
||||||
|
|
||||||
Because of these problems, it is best to use PCRE2_PARTIAL_HARD when
|
|
||||||
matching multi-segment data. The example above then behaves differ-
|
|
||||||
ently:
|
|
||||||
|
|
||||||
re> /dog(sbody)?/
|
|
||||||
data> dogsb\=ph
|
|
||||||
Partial match: dogsb
|
|
||||||
data> do\=ps,dfa
|
|
||||||
Partial match: do
|
|
||||||
data> gsb\=ph,dfa,dfa_restart
|
|
||||||
Partial match: gsb
|
|
||||||
|
|
||||||
4. Patterns that contain alternatives at the top level which do not all
|
|
||||||
start with the same pattern item may not work as expected when
|
|
||||||
PCRE2_DFA_RESTART is used. For example, consider this pattern:
|
|
||||||
|
|
||||||
1234|3789
|
1234|3789
|
||||||
|
|
||||||
|
@ -6037,29 +5992,16 @@ ISSUES WITH MULTI-SEGMENT MATCHING
|
||||||
the second alternative, because such a match does not start at the same
|
the second alternative, because such a match does not start at the same
|
||||||
point in the subject string. Attempting to continue with the string
|
point in the subject string. Attempting to continue with the string
|
||||||
"7890" does not yield a match because only those alternatives that
|
"7890" does not yield a match because only those alternatives that
|
||||||
match at one point in the subject are remembered. The problem arises
|
match at one point in the subject are remembered. Depending on the ap-
|
||||||
because the start of the second alternative matches within the first
|
plication, this may or may not be what you want.
|
||||||
alternative. There is no problem with anchored patterns or patterns
|
|
||||||
such as:
|
|
||||||
|
|
||||||
1234|ABCD
|
If you do want to allow for starting again at the next character, one
|
||||||
|
way of doing it is to retain the matched part of the segment and try a
|
||||||
where no string can be a partial match for both alternatives. This is
|
new complete match, as described for pcre2_match() above. Another pos-
|
||||||
not a problem if a standard matching function is used, because the en-
|
sibility is to work with two buffers. If a partial match at offset n in
|
||||||
tire match has to be rerun each time:
|
the first buffer is followed by "no match" when PCRE2_DFA_RESTART is
|
||||||
|
used on the second buffer, you can then try a new match starting at
|
||||||
re> /1234|3789/
|
offset n+1 in the first buffer.
|
||||||
data> ABC123\=ph
|
|
||||||
Partial match: 123
|
|
||||||
data> 1237890
|
|
||||||
0: 3789
|
|
||||||
|
|
||||||
Of course, instead of using PCRE2_DFA_RESTART, the same technique of
|
|
||||||
re-running the entire match can also be used with the DFA matching
|
|
||||||
function. Another possibility is to work with two buffers. If a partial
|
|
||||||
match at offset n in the first buffer is followed by "no match" when
|
|
||||||
PCRE2_DFA_RESTART is used on the second buffer, you can then try a new
|
|
||||||
match starting at offset n+1 in the first buffer.
|
|
||||||
|
|
||||||
|
|
||||||
AUTHOR
|
AUTHOR
|
||||||
|
@ -6071,7 +6013,7 @@ AUTHOR
|
||||||
|
|
||||||
REVISION
|
REVISION
|
||||||
|
|
||||||
Last updated: 22 July 2019
|
Last updated: 07 August 2019
|
||||||
Copyright (c) 1997-2019 University of Cambridge.
|
Copyright (c) 1997-2019 University of Cambridge.
|
||||||
------------------------------------------------------------------------------
|
------------------------------------------------------------------------------
|
||||||
|
|
||||||
|
|
|
@ -1,73 +1,107 @@
|
||||||
.TH PCRE2PARTIAL 3 "22 July 2019" "PCRE2 10.34"
|
.TH PCRE2PARTIAL 3 "07 August 2019" "PCRE2 10.34"
|
||||||
.SH NAME
|
.SH NAME
|
||||||
PCRE2 - Perl-compatible regular expressions
|
PCRE2 - Perl-compatible regular expressions
|
||||||
.SH "PARTIAL MATCHING IN PCRE2"
|
.SH "PARTIAL MATCHING IN PCRE2"
|
||||||
.rs
|
.rs
|
||||||
.sp
|
.sp
|
||||||
In normal use of PCRE2, if the subject string that is passed to a matching
|
In normal use of PCRE2, if there is a match up to the end of a subject string,
|
||||||
function matches as far as it goes, but is too short to match the entire
|
but more characters are needed to match the entire pattern, PCRE2_ERROR_NOMATCH
|
||||||
pattern, PCRE2_ERROR_NOMATCH is returned. There are circumstances where it
|
is returned, just like any other failing match. There are circumstances where
|
||||||
might be helpful to distinguish this case from other cases in which there is no
|
it might be helpful to distinguish this "partial match" case.
|
||||||
match.
|
|
||||||
.P
|
.P
|
||||||
Consider, for example, an application where a human is required to type in data
|
One example is an application where the subject string is very long, and not
|
||||||
for a field with specific formatting requirements. An example might be a date
|
all available at once. The requirement here is to be able to do the matching
|
||||||
in the form \fIddmmmyy\fP, defined by this pattern:
|
segment by segment, but special action is needed when a matched substring spans
|
||||||
.sp
|
the boundary between two segments.
|
||||||
^\ed?\ed(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\ed\ed$
|
|
||||||
.sp
|
|
||||||
If the application sees the user's keystrokes one by one, and can check that
|
|
||||||
what has been typed so far is potentially valid, it is able to raise an error
|
|
||||||
as soon as a mistake is made, by beeping and not reflecting the character that
|
|
||||||
has been typed, for example. This immediate feedback is likely to be a better
|
|
||||||
user interface than a check that is delayed until the entire string has been
|
|
||||||
entered. Partial matching can also be useful when the subject string is very
|
|
||||||
long and is not all available at once, as discussed below.
|
|
||||||
.P
|
.P
|
||||||
PCRE2 supports partial matching by means of the PCRE2_PARTIAL_SOFT and
|
Another example is checking a user input string as it is typed, to ensure that
|
||||||
PCRE2_PARTIAL_HARD options, which can be set when calling a matching function.
|
it conforms to a required format. Invalid characters can be immediately
|
||||||
The difference between the two options is whether or not a partial match is
|
diagnosed and rejected, giving instant feedback.
|
||||||
preferred to an alternative complete match, though the details differ between
|
|
||||||
the two types of matching function. If both options are set, PCRE2_PARTIAL_HARD
|
|
||||||
takes precedence.
|
|
||||||
.P
|
.P
|
||||||
If you want to use partial matching with just-in-time optimized code, you must
|
Partial matching is a PCRE2-specific feature; it is not Perl-compatible. It is
|
||||||
call \fBpcre2_jit_compile()\fP with one or both of these options:
|
requested by setting one of the PCRE2_PARTIAL_HARD or PCRE2_PARTIAL_SOFT
|
||||||
|
options when calling a matching function. The difference between the two
|
||||||
|
options is whether or not a partial match is preferred to an alternative
|
||||||
|
complete match, though the details differ between the two types of matching
|
||||||
|
function. If both options are set, PCRE2_PARTIAL_HARD takes precedence.
|
||||||
|
.P
|
||||||
|
If you want to use partial matching with just-in-time optimized code, as well
|
||||||
|
as setting a partial match option for the matching function, you must also call
|
||||||
|
\fBpcre2_jit_compile()\fP with one or both of these options:
|
||||||
.sp
|
.sp
|
||||||
PCRE2_JIT_PARTIAL_SOFT
|
|
||||||
PCRE2_JIT_PARTIAL_HARD
|
PCRE2_JIT_PARTIAL_HARD
|
||||||
|
PCRE2_JIT_PARTIAL_SOFT
|
||||||
.sp
|
.sp
|
||||||
PCRE2_JIT_COMPLETE should also be set if you are going to run non-partial
|
PCRE2_JIT_COMPLETE should also be set if you are going to run non-partial
|
||||||
matches on the same pattern. If the appropriate JIT mode has not been compiled,
|
matches on the same pattern. Separate code is compiled for each mode. If the
|
||||||
interpretive matching code is used.
|
appropriate JIT mode has not been compiled, interpretive matching code is used.
|
||||||
.P
|
.P
|
||||||
Setting a partial matching option disables two of PCRE2's standard
|
Setting a partial matching option disables two of PCRE2's standard
|
||||||
optimizations. PCRE2 remembers the last literal code unit in a pattern, and
|
optimization hints. PCRE2 remembers the last literal code unit in a pattern,
|
||||||
abandons matching immediately if it is not present in the subject string. This
|
and abandons matching immediately if it is not present in the subject string.
|
||||||
optimization cannot be used for a subject string that might match only
|
This optimization cannot be used for a subject string that might match only
|
||||||
partially. PCRE2 also knows the minimum length of a matching string, and does
|
partially. PCRE2 also remembers a minimum length of a matching string, and does
|
||||||
not bother to run the matching function on shorter strings. This optimization
|
not bother to run the matching function on shorter strings. This optimization
|
||||||
is also disabled for partial matching.
|
is also disabled for partial matching.
|
||||||
.
|
.
|
||||||
.
|
.
|
||||||
|
.SH "REQUIREMENTS FOR A PARTIAL MATCH"
|
||||||
|
.rs
|
||||||
|
.sp
|
||||||
|
A possible partial match occurs during matching when the end of the subject
|
||||||
|
string is reached successfully, but either more characters are needed to
|
||||||
|
complete the match, or the addition of more characters might change what is
|
||||||
|
matched.
|
||||||
|
.P
|
||||||
|
Example 1: if the pattern is /abc/ and the subject is "ab", more characters are
|
||||||
|
definitely needed to complete a match. In this case both hard and soft matching
|
||||||
|
options yield a partial match.
|
||||||
|
.P
|
||||||
|
Example 2: if the pattern is /ab+/ and the subject is "ab", a complete match
|
||||||
|
can be found, but the addition of more characters might change what is
|
||||||
|
matched. In this case, only PCRE2_PARTIAL_HARD returns a partial match;
|
||||||
|
PCRE2_PARTIAL_SOFT returns the complete match.
|
||||||
|
.P
|
||||||
|
On reaching the end of the subject, when PCRE2_PARTIAL_HARD is set, if the next
|
||||||
|
pattern item is \ez, \eZ, \eb, \eB, or $ there is always a partial match.
|
||||||
|
Otherwise, for both options, the next pattern item must be one that inspects a
|
||||||
|
character, and at least one of the following must be true:
|
||||||
|
.P
|
||||||
|
(1) At least one character has already been inspected. An inspected character
|
||||||
|
need not form part of the final matched string; lookbehind assertions and the
|
||||||
|
\eK escape sequence provide ways of inspecting characters before the start of a
|
||||||
|
matched string.
|
||||||
|
.P
|
||||||
|
(2) The pattern contains one or more lookbehind assertions. This condition
|
||||||
|
exists in case there is a lookbehind that inspects characters before the start
|
||||||
|
of the match.
|
||||||
|
.P
|
||||||
|
(3) There is a special case when the whole pattern can match an empty string.
|
||||||
|
When the starting point is at the end of the subject, the empty string match is
|
||||||
|
a possibility, and if PCRE2_PARTIAL_SOFT is set and neither of the above
|
||||||
|
conditions is true, it is returned. However, because adding more characters
|
||||||
|
might result in a non-empty match, PCRE2_PARTIAL_HARD returns a partial match,
|
||||||
|
which in this case means "there is going to be a match at this point, but until
|
||||||
|
some more characters are added, we do not know if it will be an empty string or
|
||||||
|
something longer".
|
||||||
|
.
|
||||||
|
.
|
||||||
|
.
|
||||||
.SH "PARTIAL MATCHING USING pcre2_match()"
|
.SH "PARTIAL MATCHING USING pcre2_match()"
|
||||||
.rs
|
.rs
|
||||||
.sp
|
.sp
|
||||||
A partial match occurs during a call to \fBpcre2_match()\fP when the end of the
|
When a partial matching option is set, the result of calling
|
||||||
subject string is reached successfully, but matching cannot continue because
|
\fBpcre2_match()\fP can be one of the following:
|
||||||
more characters are needed, and in addition, either at least one character in
|
.TP 2
|
||||||
the subject has been inspected or the pattern contains a lookbehind, or (when
|
\fBA successful match\fP
|
||||||
PCRE2_PARTIAL_HARD is set) the pattern could match an empty string. An
|
A complete match has been found, starting and ending within this subject.
|
||||||
inspected character need not form part of the final matched string; lookbehind
|
.TP
|
||||||
assertions and the \eK escape sequence provide ways of inspecting characters
|
\fBPCRE2_ERROR_NOMATCH\fP
|
||||||
before the start of a matched string.
|
No match can start anywhere in this subject.
|
||||||
.P
|
.TP
|
||||||
The three additional requirements define the cases where adding more characters
|
\fBPCRE2_ERROR_PARTIAL\fP
|
||||||
to the existing subject may complete the same match that would occur if they
|
Adding more characters may result in a complete match that uses one or more
|
||||||
had all been present in the first place. Without these conditions there would
|
characters from the end of this subject.
|
||||||
be a partial match of an empty string at the end of the subject for all
|
|
||||||
unanchored patterns (and also for anchored patterns if the subject itself is
|
|
||||||
empty).
|
|
||||||
.P
|
.P
|
||||||
When a partial match is returned, the first two elements in the ovector point
|
When a partial match is returned, the first two elements in the ovector point
|
||||||
to the portion of the subject that was matched, but the values in the rest of
|
to the portion of the subject that was matched, but the values in the rest of
|
||||||
|
@ -83,24 +117,6 @@ is "456abc12", a partial match is found for the string "abc12", because all
|
||||||
these characters are needed for a subsequent re-match with additional
|
these characters are needed for a subsequent re-match with additional
|
||||||
characters.
|
characters.
|
||||||
.P
|
.P
|
||||||
What happens when a partial match is identified depends on which of the two
|
|
||||||
partial matching options is set.
|
|
||||||
.
|
|
||||||
.
|
|
||||||
.SS "PCRE2_PARTIAL_SOFT WITH pcre2_match()"
|
|
||||||
.rs
|
|
||||||
.sp
|
|
||||||
If PCRE2_PARTIAL_SOFT is set when \fBpcre2_match()\fP identifies a partial
|
|
||||||
match, the partial match is remembered, but matching continues as normal, and
|
|
||||||
other alternatives in the pattern are tried. If no complete match can be found,
|
|
||||||
PCRE2_ERROR_PARTIAL is returned instead of PCRE2_ERROR_NOMATCH.
|
|
||||||
.P
|
|
||||||
This option is "soft" because it prefers a complete match over a partial match.
|
|
||||||
All the various matching items in a pattern behave as if the subject string is
|
|
||||||
potentially complete. For example, \ez, \eZ, and $ match at the end of the
|
|
||||||
subject, as normal, and for \eb and \eB the end of the subject is treated as a
|
|
||||||
non-alphanumeric.
|
|
||||||
.P
|
|
||||||
If there is more than one partial match, the first one that was found provides
|
If there is more than one partial match, the first one that was found provides
|
||||||
the data that is returned. Consider this pattern:
|
the data that is returned. Consider this pattern:
|
||||||
.sp
|
.sp
|
||||||
|
@ -109,27 +125,32 @@ the data that is returned. Consider this pattern:
|
||||||
If this is matched against the subject string "abc123dog", both alternatives
|
If this is matched against the subject string "abc123dog", both alternatives
|
||||||
fail to match, but the end of the subject is reached during matching, so
|
fail to match, but the end of the subject is reached during matching, so
|
||||||
PCRE2_ERROR_PARTIAL is returned. The offsets are set to 3 and 9, identifying
|
PCRE2_ERROR_PARTIAL is returned. The offsets are set to 3 and 9, identifying
|
||||||
"123dog" as the first partial match that was found. (In this example, there are
|
"123dog" as the first partial match. (In this example, there are two partial
|
||||||
two partial matches, because "dog" on its own partially matches the second
|
matches, because "dog" on its own partially matches the second alternative.)
|
||||||
alternative.)
|
|
||||||
.
|
.
|
||||||
.
|
.
|
||||||
.SS "PCRE2_PARTIAL_HARD WITH pcre2_match()"
|
.SS "How a partial match is processed by pcre2_match()"
|
||||||
.rs
|
|
||||||
.sp
|
|
||||||
If PCRE2_PARTIAL_HARD is set for \fBpcre2_match()\fP, PCRE2_ERROR_PARTIAL is
|
|
||||||
returned as soon as a partial match is found, without continuing to search for
|
|
||||||
possible complete matches. This option is "hard" because it prefers an earlier
|
|
||||||
partial match over a later complete match. For this reason, the assumption is
|
|
||||||
made that the end of the supplied subject string may not be the true end of the
|
|
||||||
available data, and so, if \ez, \eZ, \eb, \eB, or $ are encountered at the end
|
|
||||||
of the subject, the result is PCRE2_ERROR_PARTIAL, whether or not any
|
|
||||||
characters have been inspected.
|
|
||||||
.
|
|
||||||
.
|
|
||||||
.SS "Comparing hard and soft partial matching"
|
|
||||||
.rs
|
.rs
|
||||||
.sp
|
.sp
|
||||||
|
What happens when a partial match is identified depends on which of the two
|
||||||
|
partial matching options is set.
|
||||||
|
.P
|
||||||
|
If PCRE2_PARTIAL_HARD is set, PCRE2_ERROR_PARTIAL is returned as soon as a
|
||||||
|
partial match is found, without continuing to search for possible complete
|
||||||
|
matches. This option is "hard" because it prefers an earlier partial match over
|
||||||
|
a later complete match. For this reason, the assumption is made that the end of
|
||||||
|
the supplied subject string is not the true end of the available data, which is
|
||||||
|
why \ez, \eZ, \eb, \eB, and $ always give a partial match.
|
||||||
|
.P
|
||||||
|
If PCRE2_PARTIAL_SOFT is set, the partial match is remembered, but matching
|
||||||
|
continues as normal, and other alternatives in the pattern are tried. If no
|
||||||
|
complete match can be found, PCRE2_ERROR_PARTIAL is returned instead of
|
||||||
|
PCRE2_ERROR_NOMATCH. This option is "soft" because it prefers a complete match
|
||||||
|
over a partial match. All the various matching items in a pattern behave as if
|
||||||
|
the subject string is potentially complete; \ez, \eZ, and $ match at the end of
|
||||||
|
the subject, as normal, and for \eb and \eB the end of the subject is treated
|
||||||
|
as a non-alphanumeric.
|
||||||
|
.P
|
||||||
The difference between the two partial matching options can be illustrated by a
|
The difference between the two partial matching options can be illustrated by a
|
||||||
pattern such as:
|
pattern such as:
|
||||||
.sp
|
.sp
|
||||||
|
@ -154,25 +175,129 @@ The second pattern will never match "dogsbody", because it will always find the
|
||||||
shorter match first.
|
shorter match first.
|
||||||
.
|
.
|
||||||
.
|
.
|
||||||
|
.SS "Example of partial matching using pcre2test"
|
||||||
|
.rs
|
||||||
|
.sp
|
||||||
|
The \fBpcre2test\fP data modifiers \fBpartial_hard\fP (or \fBph\fP) and
|
||||||
|
\fBpartial_soft\fP (or \fBps\fP) set PCRE2_PARTIAL_HARD and PCRE2_PARTIAL_SOFT,
|
||||||
|
respectively, when calling \fBpcre2_match()\fP. Here is a run of
|
||||||
|
\fBpcre2test\fP using a pattern that matches the whole subject in the form of a
|
||||||
|
date:
|
||||||
|
.sp
|
||||||
|
re> /^\ed?\ed(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\ed\ed$/
|
||||||
|
data> 25dec3\e=ph
|
||||||
|
Partial match: 23dec3
|
||||||
|
data> 3ju\e=ph
|
||||||
|
Partial match: 3ju
|
||||||
|
data> 3juj\e=ph
|
||||||
|
No match
|
||||||
|
.sp
|
||||||
|
This example gives the same results for both hard and soft partial matching
|
||||||
|
options. Here is an example where there is a difference:
|
||||||
|
.sp
|
||||||
|
re> /^\ed?\ed(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\ed\ed$/
|
||||||
|
data> 25jun04\e=ps
|
||||||
|
0: 25jun04
|
||||||
|
1: jun
|
||||||
|
data> 25jun04\e=ph
|
||||||
|
Partial match: 25jun04
|
||||||
|
.sp
|
||||||
|
With PCRE2_PARTIAL_SOFT, the subject is matched completely. For
|
||||||
|
PCRE2_PARTIAL_HARD, however, the subject is assumed not to be complete, so
|
||||||
|
there is only a partial match.
|
||||||
|
.
|
||||||
|
.
|
||||||
|
.
|
||||||
|
.SH "MULTI-SEGMENT MATCHING WITH pcre2_match()"
|
||||||
|
.rs
|
||||||
|
.sp
|
||||||
|
PCRE was not originally designed with multi-segment matching in mind. However,
|
||||||
|
over time, features (including partial matching) that make multi-segment
|
||||||
|
matching possible have been added. The string is searched segment by segment by
|
||||||
|
calling \fBpcre2_match()\fP repeatedly, with the aim of achieving the same
|
||||||
|
results that would happen if the entire string was available for searching.
|
||||||
|
.P
|
||||||
|
Special logic must be implemented to handle a matched substring that spans a
|
||||||
|
segment boundary. PCRE2_PARTIAL_HARD should be used, because it returns a
|
||||||
|
partial match at the end of a segment whenever there is the possibility of
|
||||||
|
changing the match by adding more characters. The PCRE2_NOTBOL option should
|
||||||
|
also be set for all but the first segment.
|
||||||
|
.P
|
||||||
|
When a partial match occurs, the next segment must be added to the current
|
||||||
|
subject and the match re-run, using the \fIstartoffset\fP argument of
|
||||||
|
\fBpcre2_match()\fP to begin at the point where the partial match started.
|
||||||
|
Multi-segment matching is usually used to search for substrings in the middle
|
||||||
|
of very long sequences, so the patterns are normally not anchored. For example:
|
||||||
|
.sp
|
||||||
|
re> /\ed?\ed(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\ed\ed/
|
||||||
|
data> ...the date is 23ja\e=ph
|
||||||
|
Partial match: 23ja
|
||||||
|
data> ...the date is 23jan19 and on that day...\e=offset=15
|
||||||
|
0: 23jan19
|
||||||
|
1: jan
|
||||||
|
.sp
|
||||||
|
Note the use of the \fBoffset\fP modifier to start the new match where the
|
||||||
|
partial match was found.
|
||||||
|
.P
|
||||||
|
In this simple example, the next segment was just added to the one in which the
|
||||||
|
partial match was found. However, if there are memory constraints, it may be
|
||||||
|
necessary to discard text that precedes the partial match before adding the
|
||||||
|
next segment. In cases such as the above, where the pattern does not contain
|
||||||
|
any lookbehinds, it is sufficient to retain only the partially matched
|
||||||
|
substring. However, if a pattern contains a lookbehind assertion, characters
|
||||||
|
that precede the start of the partial match may have been inspected during the
|
||||||
|
matching process.
|
||||||
|
.P
|
||||||
|
The only lookbehind information that is available is the length of the longest
|
||||||
|
lookbehind in a pattern. This may not, of course, be at the start of the
|
||||||
|
pattern, but retaining that many characters before the partial match is
|
||||||
|
sufficient, if not always strictly necessary. The way to do this is as follows:
|
||||||
|
.P
|
||||||
|
Before doing any matching, find the length of the longest lookbehind in the
|
||||||
|
pattern by calling \fBpcre2_pattern_info()\fP with the PCRE2_INFO_MAXLOOKBEHIND
|
||||||
|
option. Note that the resulting count is in characters, not code units. After a
|
||||||
|
partial match, moving back from the ovector[0] offset in the subject by the
|
||||||
|
number of characters given for the maximum lookbehind gets you to the earliest
|
||||||
|
character that must be retained. In a non-UTF or a 32-bit situation, moving
|
||||||
|
back is just a subtraction, but in UTF-8 or UTF-16 you have to count characters
|
||||||
|
while moving back through the code units. Characters before the point you have
|
||||||
|
now reached can be discarded.
|
||||||
|
.P
|
||||||
|
For example, if the pattern "(?<=123)abc" is partially matched against the
|
||||||
|
string "xx123ab", the ovector offsets are 5 and 7 ("ab"). The maximum
|
||||||
|
lookbehind count is 3, so all characters before offset 2 can be discarded. The
|
||||||
|
value of \fBstartoffset\fP for the next match should be 3. When \fBpcre2test\fP
|
||||||
|
displays a partial match, it indicates the lookbehind characters with '<'
|
||||||
|
characters if the \fBallusedtext\fP modifier is set:
|
||||||
|
.sp
|
||||||
|
re> "(?<=123)abc"
|
||||||
|
data> xx123ab\e=ph,allusedtext
|
||||||
|
Partial match: 123ab
|
||||||
|
<<<
|
||||||
|
.sp
|
||||||
|
Note that the \fPallusedtext\fP modifier is not available for JIT matching,
|
||||||
|
because JIT matching does not maintain the first and last consulted characters.
|
||||||
|
.
|
||||||
|
.
|
||||||
|
.
|
||||||
.SH "PARTIAL MATCHING USING pcre2_dfa_match()"
|
.SH "PARTIAL MATCHING USING pcre2_dfa_match()"
|
||||||
.rs
|
.rs
|
||||||
.sp
|
.sp
|
||||||
The DFA functions move along the subject string character by character, without
|
The DFA function moves along the subject string character by character, without
|
||||||
backtracking, searching for all possible matches simultaneously. If the end of
|
backtracking, searching for all possible matches simultaneously. If the end of
|
||||||
the subject is reached before the end of the pattern, there is the possibility
|
the subject is reached before the end of the pattern, there is the possibility
|
||||||
of a partial match, again provided that at least one character has been
|
of a partial match.
|
||||||
inspected.
|
|
||||||
.P
|
.P
|
||||||
When PCRE2_PARTIAL_SOFT is set, PCRE2_ERROR_PARTIAL is returned only if there
|
When PCRE2_PARTIAL_SOFT is set, PCRE2_ERROR_PARTIAL is returned only if there
|
||||||
have been no complete matches. Otherwise, the complete matches are returned.
|
have been no complete matches. Otherwise, the complete matches are returned.
|
||||||
However, if PCRE2_PARTIAL_HARD is set, a partial match takes precedence over
|
If PCRE2_PARTIAL_HARD is set, a partial match takes precedence over any
|
||||||
any complete matches. The portion of the string that was matched when the
|
complete matches. The portion of the string that was matched when the longest
|
||||||
longest partial match was found is set as the first matching string.
|
partial match was found is set as the first matching string.
|
||||||
.P
|
.P
|
||||||
Because the DFA functions always search for all possible matches, and there is
|
Because the DFA function always searches for all possible matches, and there is
|
||||||
no difference between greedy and ungreedy repetition, their behaviour is
|
no difference between greedy and ungreedy repetition, its behaviour is
|
||||||
different from the standard functions when PCRE2_PARTIAL_HARD is set. Consider
|
different from the \fBpcre2_match()\fP. Consider the string "dog" matched
|
||||||
the string "dog" matched against the ungreedy pattern shown above:
|
against this ungreedy pattern:
|
||||||
.sp
|
.sp
|
||||||
/dog(sbody)??/
|
/dog(sbody)??/
|
||||||
.sp
|
.sp
|
||||||
|
@ -181,62 +306,17 @@ Whereas the standard function stops as soon as it finds the complete match for
|
||||||
returns that when PCRE2_PARTIAL_HARD is set.
|
returns that when PCRE2_PARTIAL_HARD is set.
|
||||||
.
|
.
|
||||||
.
|
.
|
||||||
.SH "PARTIAL MATCHING AND WORD BOUNDARIES"
|
|
||||||
.rs
|
|
||||||
.sp
|
|
||||||
If a pattern ends with one of sequences \eb or \eB, which test for word
|
|
||||||
boundaries, partial matching with PCRE2_PARTIAL_SOFT can give counter-intuitive
|
|
||||||
results. Consider this pattern:
|
|
||||||
.sp
|
|
||||||
/\ebcat\eb/
|
|
||||||
.sp
|
|
||||||
This matches "cat", provided there is a word boundary at either end. If the
|
|
||||||
subject string is "the cat", the comparison of the final "t" with a following
|
|
||||||
character cannot take place, so a partial match is found. However, normal
|
|
||||||
matching carries on, and \eb matches at the end of the subject when the last
|
|
||||||
character is a letter, so a complete match is found. The result, therefore, is
|
|
||||||
\fInot\fP PCRE2_ERROR_PARTIAL. Using PCRE2_PARTIAL_HARD in this case does yield
|
|
||||||
PCRE2_ERROR_PARTIAL, because then the partial match takes precedence.
|
|
||||||
.
|
|
||||||
.
|
|
||||||
.SH "EXAMPLE OF PARTIAL MATCHING USING PCRE2TEST"
|
|
||||||
.rs
|
|
||||||
.sp
|
|
||||||
If the \fBpartial_soft\fP (or \fBps\fP) modifier is present on a
|
|
||||||
\fBpcre2test\fP data line, the PCRE2_PARTIAL_SOFT option is used for the match.
|
|
||||||
Here is a run of \fBpcre2test\fP that uses the date example quoted above:
|
|
||||||
.sp
|
|
||||||
re> /^\ed?\ed(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\ed\ed$/
|
|
||||||
data> 25jun04\e=ps
|
|
||||||
0: 25jun04
|
|
||||||
1: jun
|
|
||||||
data> 25dec3\e=ps
|
|
||||||
Partial match: 23dec3
|
|
||||||
data> 3ju\e=ps
|
|
||||||
Partial match: 3ju
|
|
||||||
data> 3juj\e=ps
|
|
||||||
No match
|
|
||||||
data> j\e=ps
|
|
||||||
No match
|
|
||||||
.sp
|
|
||||||
The first data string is matched completely, so \fBpcre2test\fP shows the
|
|
||||||
matched substrings. The remaining four strings do not match the complete
|
|
||||||
pattern, but the first two are partial matches. Similar output is obtained
|
|
||||||
if DFA matching is used.
|
|
||||||
.P
|
|
||||||
If the \fBpartial_hard\fP (or \fBph\fP) modifier is present on a
|
|
||||||
\fBpcre2test\fP data line, the PCRE2_PARTIAL_HARD option is set for the match.
|
|
||||||
.
|
|
||||||
.
|
|
||||||
.SH "MULTI-SEGMENT MATCHING WITH pcre2_dfa_match()"
|
.SH "MULTI-SEGMENT MATCHING WITH pcre2_dfa_match()"
|
||||||
.rs
|
.rs
|
||||||
.sp
|
.sp
|
||||||
When a partial match has been found using a DFA matching function, it is
|
When a partial match has been found using the DFA matching function, it is
|
||||||
possible to continue the match by providing additional subject data and calling
|
possible to continue the match by providing additional subject data and calling
|
||||||
the function again with the same compiled regular expression, this time setting
|
the function again with the same compiled regular expression, this time setting
|
||||||
the PCRE2_DFA_RESTART option. You must pass the same working space as before,
|
the PCRE2_DFA_RESTART option. You must pass the same working space as before,
|
||||||
because this is where details of the previous partial match are stored. Here is
|
because this is where details of the previous partial match are stored. You can
|
||||||
an example using \fBpcre2test\fP:
|
set the PCRE2_PARTIAL_SOFT or PCRE2_PARTIAL_HARD options with PCRE2_DFA_RESTART
|
||||||
|
to continue partial matching over multiple segments. Here is an example using
|
||||||
|
\fBpcre2test\fP:
|
||||||
.sp
|
.sp
|
||||||
re> /^\ed?\ed(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\ed\ed$/
|
re> /^\ed?\ed(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\ed\ed$/
|
||||||
data> 23ja\e=dfa,ps
|
data> 23ja\e=dfa,ps
|
||||||
|
@ -248,136 +328,10 @@ The first call has "23ja" as the subject, and requests partial matching; the
|
||||||
second call has "n05" as the subject for the continued (restarted) match.
|
second call has "n05" as the subject for the continued (restarted) match.
|
||||||
Notice that when the match is complete, only the last part is shown; PCRE2 does
|
Notice that when the match is complete, only the last part is shown; PCRE2 does
|
||||||
not retain the previously partially-matched string. It is up to the calling
|
not retain the previously partially-matched string. It is up to the calling
|
||||||
program to do that if it needs to.
|
program to do that if it needs to. This means that, for an unanchored pattern,
|
||||||
.P
|
if a continued match fails, it is not possible to try again at a new starting
|
||||||
That means that, for an unanchored pattern, if a continued match fails, it is
|
point. All this facility is capable of doing is continuing with the previous
|
||||||
not possible to try again at a new starting point. All this facility is capable
|
match attempt. For example, consider this pattern:
|
||||||
of doing is continuing with the previous match attempt. In the previous
|
|
||||||
example, if the second set of data is "ug23" the result is no match, even
|
|
||||||
though there would be a match for "aug23" if the entire string were given at
|
|
||||||
once. Depending on the application, this may or may not be what you want.
|
|
||||||
The only way to allow for starting again at the next character is to retain the
|
|
||||||
matched part of the subject and try a new complete match.
|
|
||||||
.P
|
|
||||||
You can set the PCRE2_PARTIAL_SOFT or PCRE2_PARTIAL_HARD options with
|
|
||||||
PCRE2_DFA_RESTART to continue partial matching over multiple segments. This
|
|
||||||
facility can be used to pass very long subject strings to the DFA matching
|
|
||||||
functions.
|
|
||||||
.
|
|
||||||
.
|
|
||||||
.SH "MULTI-SEGMENT MATCHING WITH pcre2_match()"
|
|
||||||
.rs
|
|
||||||
.sp
|
|
||||||
Unlike the DFA function, it is not possible to restart the previous match with
|
|
||||||
a new segment of data when using \fBpcre2_match()\fP. Instead, new data must be
|
|
||||||
added to the previous subject string, and the entire match re-run, starting
|
|
||||||
from the point where the partial match occurred. Earlier data can be discarded.
|
|
||||||
.P
|
|
||||||
It is best to use PCRE2_PARTIAL_HARD in this situation, because it does not
|
|
||||||
treat the end of a segment as the end of the subject when matching \ez, \eZ,
|
|
||||||
\eb, \eB, and $. Consider an unanchored pattern that matches dates:
|
|
||||||
.sp
|
|
||||||
re> /\ed?\ed(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\ed\ed/
|
|
||||||
data> The date is 23ja\e=ph
|
|
||||||
Partial match: 23ja
|
|
||||||
.sp
|
|
||||||
At this stage, an application could discard the text preceding "23ja", add on
|
|
||||||
text from the next segment, and call the matching function again. Unlike the
|
|
||||||
DFA matching function, the entire matching string must always be available,
|
|
||||||
and the complete matching process occurs for each call, so more memory and more
|
|
||||||
processing time is needed.
|
|
||||||
.
|
|
||||||
.
|
|
||||||
.SH "ISSUES WITH MULTI-SEGMENT MATCHING"
|
|
||||||
.rs
|
|
||||||
.sp
|
|
||||||
Certain types of pattern may give problems with multi-segment matching,
|
|
||||||
whichever matching function is used.
|
|
||||||
.P
|
|
||||||
1. If the pattern contains a test for the beginning of a line, you need to pass
|
|
||||||
the PCRE2_NOTBOL option when the subject string for any call does start at the
|
|
||||||
beginning of a line. There is also a PCRE2_NOTEOL option, but in practice when
|
|
||||||
doing multi-segment matching you should be using PCRE2_PARTIAL_HARD, which
|
|
||||||
includes the effect of PCRE2_NOTEOL.
|
|
||||||
.P
|
|
||||||
2. If a pattern contains a lookbehind assertion, characters that precede the
|
|
||||||
start of the partial match may have been inspected during the matching process.
|
|
||||||
When using \fBpcre2_match()\fP, sufficient characters must be retained for the
|
|
||||||
next match attempt. You can ensure that enough characters are retained by doing
|
|
||||||
the following:
|
|
||||||
.P
|
|
||||||
Before doing any matching, find the length of the longest lookbehind in the
|
|
||||||
pattern by calling \fBpcre2_pattern_info()\fP with the PCRE2_INFO_MAXLOOKBEHIND
|
|
||||||
option. Note that the resulting count is in characters, not code units. After a
|
|
||||||
partial match, moving back from the ovector[0] offset in the subject by the
|
|
||||||
number of characters given for the maximum lookbehind gets you to the earliest
|
|
||||||
character that must be retained. In a non-UTF or a 32-bit situation, moving
|
|
||||||
back is just a subtraction, but in UTF-8 or UTF-16 you have to count characters
|
|
||||||
while moving back through the code units.
|
|
||||||
.P
|
|
||||||
Characters before the point you have now reached can be discarded, and after
|
|
||||||
the next segment has been added to what is retained, you should run the next
|
|
||||||
match with the \fBstartoffset\fP argument set so that the match begins at the
|
|
||||||
same point as before.
|
|
||||||
.P
|
|
||||||
For example, if the pattern "(?<=123)abc" is partially matched against the
|
|
||||||
string "xx123ab", the ovector offsets are 5 and 7 ("ab"). The maximum
|
|
||||||
lookbehind count is 3, so all characters before offset 2 can be discarded. The
|
|
||||||
value of \fBstartoffset\fP for the next match should be 3. When \fBpcre2test\fP
|
|
||||||
displays a partial match, it indicates the lookbehind characters with '<'
|
|
||||||
characters if the "allusedtext" modifier is set:
|
|
||||||
.sp
|
|
||||||
re> "(?<=123)abc"
|
|
||||||
data> xx123ab\e=ph,allusedtext
|
|
||||||
Partial match: 123ab
|
|
||||||
<<<
|
|
||||||
However, the "allusedtext" modifier is not available for JIT matching, because
|
|
||||||
JIT matching does not maintain the first and last consulted characters.
|
|
||||||
.P
|
|
||||||
3. Matching a subject string that is split into multiple segments may not
|
|
||||||
always produce exactly the same result as matching over one single long string
|
|
||||||
when PCRE2_PARTIAL_SOFT is used. The section "Partial Matching and Word
|
|
||||||
Boundaries" above describes an issue that arises if the pattern ends with \eb
|
|
||||||
or \eB. Another kind of difference may occur when there are multiple matching
|
|
||||||
possibilities, because (for PCRE2_PARTIAL_SOFT) a partial match result is given
|
|
||||||
only when there are no completed matches. This means that as soon as the
|
|
||||||
shortest match has been found, continuation to a new subject segment is no
|
|
||||||
longer possible. Consider this \fBpcre2test\fP example:
|
|
||||||
.sp
|
|
||||||
re> /dog(sbody)?/
|
|
||||||
data> dogsb\e=ps
|
|
||||||
0: dog
|
|
||||||
data> do\e=ps,dfa
|
|
||||||
Partial match: do
|
|
||||||
data> gsb\e=ps,dfa,dfa_restart
|
|
||||||
0: g
|
|
||||||
data> dogsbody\e=dfa
|
|
||||||
0: dogsbody
|
|
||||||
1: dog
|
|
||||||
.sp
|
|
||||||
The first data line passes the string "dogsb" to a standard matching function,
|
|
||||||
setting the PCRE2_PARTIAL_SOFT option. Although the string is a partial match
|
|
||||||
for "dogsbody", the result is not PCRE2_ERROR_PARTIAL, because the shorter
|
|
||||||
string "dog" is a complete match. Similarly, when the subject is presented to
|
|
||||||
a DFA matching function in several parts ("do" and "gsb" being the first two)
|
|
||||||
the match stops when "dog" has been found, and it is not possible to continue.
|
|
||||||
On the other hand, if "dogsbody" is presented as a single string, a DFA
|
|
||||||
matching function finds both matches.
|
|
||||||
.P
|
|
||||||
Because of these problems, it is best to use PCRE2_PARTIAL_HARD when matching
|
|
||||||
multi-segment data. The example above then behaves differently:
|
|
||||||
.sp
|
|
||||||
re> /dog(sbody)?/
|
|
||||||
data> dogsb\e=ph
|
|
||||||
Partial match: dogsb
|
|
||||||
data> do\e=ps,dfa
|
|
||||||
Partial match: do
|
|
||||||
data> gsb\e=ph,dfa,dfa_restart
|
|
||||||
Partial match: gsb
|
|
||||||
.sp
|
|
||||||
4. Patterns that contain alternatives at the top level which do not all start
|
|
||||||
with the same pattern item may not work as expected when PCRE2_DFA_RESTART is
|
|
||||||
used. For example, consider this pattern:
|
|
||||||
.sp
|
.sp
|
||||||
1234|3789
|
1234|3789
|
||||||
.sp
|
.sp
|
||||||
|
@ -386,28 +340,15 @@ alternative is found at offset 3. There is no partial match for the second
|
||||||
alternative, because such a match does not start at the same point in the
|
alternative, because such a match does not start at the same point in the
|
||||||
subject string. Attempting to continue with the string "7890" does not yield a
|
subject string. Attempting to continue with the string "7890" does not yield a
|
||||||
match because only those alternatives that match at one point in the subject
|
match because only those alternatives that match at one point in the subject
|
||||||
are remembered. The problem arises because the start of the second alternative
|
are remembered. Depending on the application, this may or may not be what you
|
||||||
matches within the first alternative. There is no problem with anchored
|
want.
|
||||||
patterns or patterns such as:
|
.P
|
||||||
.sp
|
If you do want to allow for starting again at the next character, one way of
|
||||||
1234|ABCD
|
doing it is to retain the matched part of the segment and try a new complete
|
||||||
.sp
|
match, as described for \fBpcre2_match()\fP above. Another possibility is to
|
||||||
where no string can be a partial match for both alternatives. This is not a
|
work with two buffers. If a partial match at offset \fIn\fP in the first buffer
|
||||||
problem if a standard matching function is used, because the entire match has
|
is followed by "no match" when PCRE2_DFA_RESTART is used on the second buffer,
|
||||||
to be rerun each time:
|
you can then try a new match starting at offset \fIn+1\fP in the first buffer.
|
||||||
.sp
|
|
||||||
re> /1234|3789/
|
|
||||||
data> ABC123\e=ph
|
|
||||||
Partial match: 123
|
|
||||||
data> 1237890
|
|
||||||
0: 3789
|
|
||||||
.sp
|
|
||||||
Of course, instead of using PCRE2_DFA_RESTART, the same technique of re-running
|
|
||||||
the entire match can also be used with the DFA matching function. Another
|
|
||||||
possibility is to work with two buffers. If a partial match at offset \fIn\fP
|
|
||||||
in the first buffer is followed by "no match" when PCRE2_DFA_RESTART is used on
|
|
||||||
the second buffer, you can then try a new match starting at offset \fIn+1\fP in
|
|
||||||
the first buffer.
|
|
||||||
.
|
.
|
||||||
.
|
.
|
||||||
.SH AUTHOR
|
.SH AUTHOR
|
||||||
|
@ -424,6 +365,6 @@ Cambridge, England.
|
||||||
.rs
|
.rs
|
||||||
.sp
|
.sp
|
||||||
.nf
|
.nf
|
||||||
Last updated: 22 July 2019
|
Last updated: 07 August 2019
|
||||||
Copyright (c) 1997-2019 University of Cambridge.
|
Copyright (c) 1997-2019 University of Cambridge.
|
||||||
.fi
|
.fi
|
||||||
|
|
Loading…
Reference in New Issue