Documentation update to clarify ovector usage with DFA matching.
This commit is contained in:
parent
5ff1daffa0
commit
6c2fe9da99
|
@ -45,10 +45,16 @@ just once (except when processing lookaround assertions). This function is
|
||||||
<i>workspace</i> Points to a vector of ints used as working space
|
<i>workspace</i> Points to a vector of ints used as working space
|
||||||
<i>wscount</i> Number of elements in the vector
|
<i>wscount</i> Number of elements in the vector
|
||||||
</pre>
|
</pre>
|
||||||
For <b>pcre2_dfa_match()</b>, a match context is needed only if you want to set
|
The size of output vector needed to contain all the results depends on the
|
||||||
up a callout function or specify the heap limit or the match or the recursion
|
number of simultaneous matches, not on the number of parentheses in the
|
||||||
depth limits. The <i>length</i> and <i>startoffset</i> values are code units, not
|
pattern. Using <b>pcre2_match_data_create_from_pattern()</b> to create the match
|
||||||
characters. The options are:
|
data block is therefore not advisable when using this function.
|
||||||
|
</P>
|
||||||
|
<P>
|
||||||
|
A match context is needed only if you want to set up a callout function or
|
||||||
|
specify the heap limit or the match or the recursion depth limits. The
|
||||||
|
<i>length</i> and <i>startoffset</i> values are code units, not characters. The
|
||||||
|
options are:
|
||||||
<pre>
|
<pre>
|
||||||
PCRE2_ANCHORED Match only at the first position
|
PCRE2_ANCHORED Match only at the first position
|
||||||
PCRE2_COPY_MATCHED_SUBJECT
|
PCRE2_COPY_MATCHED_SUBJECT
|
||||||
|
|
|
@ -30,8 +30,9 @@ This function creates a new match data block, which is used for holding the
|
||||||
result of a match. The first argument specifies the number of pairs of offsets
|
result of a match. The first argument specifies the number of pairs of offsets
|
||||||
that are required. These form the "output vector" (ovector) within the match
|
that are required. These form the "output vector" (ovector) within the match
|
||||||
data block, and are used to identify the matched string and any captured
|
data block, and are used to identify the matched string and any captured
|
||||||
substrings. There is always one pair of offsets; if <b>ovecsize</b> is zero, it
|
substrings when matching with <b>pcre2_match()</b>, or a number of different
|
||||||
is treated as one.
|
matches at the same point when used with <b>pcre2_dfa_match()</b>. There is
|
||||||
|
always one pair of offsets; if <b>ovecsize</b> is zero, it is treated as one.
|
||||||
</P>
|
</P>
|
||||||
<P>
|
<P>
|
||||||
The second argument points to a general context, for custom memory management,
|
The second argument points to a general context, for custom memory management,
|
||||||
|
|
|
@ -26,12 +26,15 @@ SYNOPSIS
|
||||||
DESCRIPTION
|
DESCRIPTION
|
||||||
</b><br>
|
</b><br>
|
||||||
<P>
|
<P>
|
||||||
This function creates a new match data block, which is used for holding the
|
This function creates a new match data block for holding the result of a match.
|
||||||
result of a match. The first argument points to a compiled pattern. The number
|
The first argument points to a compiled pattern. The number of capturing
|
||||||
of capturing parentheses within the pattern is used to compute the number of
|
parentheses within the pattern is used to compute the number of pairs of
|
||||||
pairs of offsets that are required in the match data block. These form the
|
offsets that are required in the match data block. These form the "output
|
||||||
"output vector" (ovector) within the match data block, and are used to identify
|
vector" (ovector) within the match data block, and are used to identify the
|
||||||
the matched string and any captured substrings.
|
matched string and any captured substrings when matching with
|
||||||
|
<b>pcre2_match()</b>. If you are using <b>pcre2_dfa_match()</b>, which uses the
|
||||||
|
outut vector in a different way, you should use <b>pcre2_match_data_create()</b>
|
||||||
|
instead of this function.
|
||||||
</P>
|
</P>
|
||||||
<P>
|
<P>
|
||||||
The second argument points to a general context, for custom memory management,
|
The second argument points to a general context, for custom memory management,
|
||||||
|
|
|
@ -2512,20 +2512,31 @@ to an abstract format like Java or .NET serialization.
|
||||||
Information about a successful or unsuccessful match is placed in a match
|
Information about a successful or unsuccessful match is placed in a match
|
||||||
data block, which is an opaque structure that is accessed by function calls. In
|
data block, which is an opaque structure that is accessed by function calls. In
|
||||||
particular, the match data block contains a vector of offsets into the subject
|
particular, the match data block contains a vector of offsets into the subject
|
||||||
string that define the matched part of the subject and any substrings that were
|
string that define the matched parts of the subject. This is known as the
|
||||||
captured. This is known as the <i>ovector</i>.
|
<i>ovector</i>.
|
||||||
</P>
|
</P>
|
||||||
<P>
|
<P>
|
||||||
Before calling <b>pcre2_match()</b>, <b>pcre2_dfa_match()</b>, or
|
Before calling <b>pcre2_match()</b>, <b>pcre2_dfa_match()</b>, or
|
||||||
<b>pcre2_jit_match()</b> you must create a match data block by calling one of
|
<b>pcre2_jit_match()</b> you must create a match data block by calling one of
|
||||||
the creation functions above. For <b>pcre2_match_data_create()</b>, the first
|
the creation functions above. For <b>pcre2_match_data_create()</b>, the first
|
||||||
argument is the number of pairs of offsets in the <i>ovector</i>. One pair of
|
argument is the number of pairs of offsets in the <i>ovector</i>.
|
||||||
offsets is required to identify the string that matched the whole pattern, with
|
</P>
|
||||||
an additional pair for each captured substring. For example, a value of 4
|
<P>
|
||||||
creates enough space to record the matched portion of the subject plus three
|
When using <b>pcre2_match()</b>, one pair of offsets is required to identify the
|
||||||
captured substrings. A minimum of at least 1 pair is imposed by
|
string that matched the whole pattern, with an additional pair for each
|
||||||
<b>pcre2_match_data_create()</b>, so it is always possible to return the overall
|
captured substring. For example, a value of 4 creates enough space to record
|
||||||
matched string.
|
the matched portion of the subject plus three captured substrings.
|
||||||
|
</P>
|
||||||
|
<P>
|
||||||
|
When using <b>pcre2_dfa_match()</b> there may be multiple matched substrings of
|
||||||
|
different lengths at the same point in the subject. The ovector should be made
|
||||||
|
large enough to hold as many as are expected.
|
||||||
|
</P>
|
||||||
|
<P>
|
||||||
|
A minimum of at least 1 pair is imposed by <b>pcre2_match_data_create()</b>, so
|
||||||
|
it is always possible to return the overall matched string in the case of
|
||||||
|
<b>pcre2_match()</b> or the longest match in the case of
|
||||||
|
<b>pcre2_dfa_match()</b>.
|
||||||
</P>
|
</P>
|
||||||
<P>
|
<P>
|
||||||
The second argument of <b>pcre2_match_data_create()</b> is a pointer to a
|
The second argument of <b>pcre2_match_data_create()</b> is a pointer to a
|
||||||
|
@ -2536,10 +2547,11 @@ pass NULL, which causes <b>malloc()</b> to be used.
|
||||||
<P>
|
<P>
|
||||||
For <b>pcre2_match_data_create_from_pattern()</b>, the first argument is a
|
For <b>pcre2_match_data_create_from_pattern()</b>, the first argument is a
|
||||||
pointer to a compiled pattern. The ovector is created to be exactly the right
|
pointer to a compiled pattern. The ovector is created to be exactly the right
|
||||||
size to hold all the substrings a pattern might capture. The second argument is
|
size to hold all the substrings a pattern might capture when matched using
|
||||||
again a pointer to a general context, but in this case if NULL is passed, the
|
<b>pcre2_match()</b>. You should not use this call when matching with
|
||||||
memory is obtained using the same allocator that was used for the compiled
|
<b>pcre2_dfa_match()</b>. The second argument is again a pointer to a general
|
||||||
pattern (custom or default).
|
context, but in this case if NULL is passed, the memory is obtained using the
|
||||||
|
same allocator that was used for the compiled pattern (custom or default).
|
||||||
</P>
|
</P>
|
||||||
<P>
|
<P>
|
||||||
A match data block can be used many times, with the same or different compiled
|
A match data block can be used many times, with the same or different compiled
|
||||||
|
@ -3982,16 +3994,16 @@ fail, this error is given.
|
||||||
<P>
|
<P>
|
||||||
Philip Hazel
|
Philip Hazel
|
||||||
<br>
|
<br>
|
||||||
University Computing Service
|
Retired from University Computing Service
|
||||||
<br>
|
<br>
|
||||||
Cambridge, England.
|
Cambridge, England.
|
||||||
<br>
|
<br>
|
||||||
</P>
|
</P>
|
||||||
<br><a name="SEC42" href="#TOC1">REVISION</a><br>
|
<br><a name="SEC42" href="#TOC1">REVISION</a><br>
|
||||||
<P>
|
<P>
|
||||||
Last updated: 04 November 2020
|
Last updated: 28 August 2021
|
||||||
<br>
|
<br>
|
||||||
Copyright © 1997-2020 University of Cambridge.
|
Copyright © 1997-2021 University of Cambridge.
|
||||||
<br>
|
<br>
|
||||||
<p>
|
<p>
|
||||||
Return to the <a href="index.html">PCRE2 index page</a>.
|
Return to the <a href="index.html">PCRE2 index page</a>.
|
||||||
|
|
|
@ -78,8 +78,9 @@ tried is controlled by the greedy or ungreedy nature of the quantifier.
|
||||||
If a leaf node is reached, a matching string has been found, and at that point
|
If a leaf node is reached, a matching string has been found, and at that point
|
||||||
the algorithm stops. Thus, if there is more than one possible match, this
|
the algorithm stops. Thus, if there is more than one possible match, this
|
||||||
algorithm returns the first one that it finds. Whether this is the shortest,
|
algorithm returns the first one that it finds. Whether this is the shortest,
|
||||||
the longest, or some intermediate length depends on the way the greedy and
|
the longest, or some intermediate length depends on the way the alternations
|
||||||
ungreedy repetition quantifiers are specified in the pattern.
|
and the greedy or ungreedy repetition quantifiers are specified in the
|
||||||
|
pattern.
|
||||||
</P>
|
</P>
|
||||||
<P>
|
<P>
|
||||||
Because it ends up with a single path through the tree, it is relatively
|
Because it ends up with a single path through the tree, it is relatively
|
||||||
|
@ -109,11 +110,17 @@ no more unterminated paths. At this point, terminated paths represent the
|
||||||
different matching possibilities (if there are none, the match has failed).
|
different matching possibilities (if there are none, the match has failed).
|
||||||
Thus, if there is more than one possible match, this algorithm finds all of
|
Thus, if there is more than one possible match, this algorithm finds all of
|
||||||
them, and in particular, it finds the longest. The matches are returned in
|
them, and in particular, it finds the longest. The matches are returned in
|
||||||
decreasing order of length. There is an option to stop the algorithm after the
|
the output vector in decreasing order of length. There is an option to stop the
|
||||||
first match (which is necessarily the shortest) is found.
|
algorithm after the first match (which is necessarily the shortest) is found.
|
||||||
</P>
|
</P>
|
||||||
<P>
|
<P>
|
||||||
Note that all the matches that are found start at the same point in the
|
Note that the size of vector needed to contain all the results depends on the
|
||||||
|
number of simultaneous matches, not on the number of parentheses in the
|
||||||
|
pattern. Using <b>pcre2_match_data_create_from_pattern()</b> to create the match
|
||||||
|
data block is therefore not advisable when doing DFA matching.
|
||||||
|
</P>
|
||||||
|
<P>
|
||||||
|
Note also that all the matches that are found start at the same point in the
|
||||||
subject. If the pattern
|
subject. If the pattern
|
||||||
<pre>
|
<pre>
|
||||||
cat(er(pillar)?)?
|
cat(er(pillar)?)?
|
||||||
|
@ -194,21 +201,14 @@ supported by <b>pcre2_dfa_match()</b>.
|
||||||
</P>
|
</P>
|
||||||
<br><a name="SEC5" href="#TOC1">ADVANTAGES OF THE ALTERNATIVE ALGORITHM</a><br>
|
<br><a name="SEC5" href="#TOC1">ADVANTAGES OF THE ALTERNATIVE ALGORITHM</a><br>
|
||||||
<P>
|
<P>
|
||||||
Using the alternative matching algorithm provides the following advantages:
|
The main advantage of the alternative algorithm is that all possible matches
|
||||||
|
(at a single point in the subject) are automatically found, and in particular,
|
||||||
|
the longest match is found. To find more than one match at the same point using
|
||||||
|
the standard algorithm, you have to do kludgy things with callouts.
|
||||||
</P>
|
</P>
|
||||||
<P>
|
<P>
|
||||||
1. All possible matches (at a single point in the subject) are automatically
|
Partial matching is possible with this algorithm, though it has some
|
||||||
found, and in particular, the longest match is found. To find more than one
|
limitations. The
|
||||||
match using the standard algorithm, you have to do kludgy things with
|
|
||||||
callouts.
|
|
||||||
</P>
|
|
||||||
<P>
|
|
||||||
2. Because the alternative algorithm scans the subject string just once, and
|
|
||||||
never needs to backtrack (except for lookbehinds), it is possible to pass very
|
|
||||||
long subject strings to the matching function in several pieces, checking for
|
|
||||||
partial matching each time. Although it is also possible to do multi-segment
|
|
||||||
matching using the standard algorithm, by retaining partially matched
|
|
||||||
substrings, it is more complicated. The
|
|
||||||
<a href="pcre2partial.html"><b>pcre2partial</b></a>
|
<a href="pcre2partial.html"><b>pcre2partial</b></a>
|
||||||
documentation gives details of partial matching and discusses multi-segment
|
documentation gives details of partial matching and discusses multi-segment
|
||||||
matching.
|
matching.
|
||||||
|
@ -230,20 +230,23 @@ invalid UTF string are not supported.
|
||||||
3. Although atomic groups are supported, their use does not provide the
|
3. Although atomic groups are supported, their use does not provide the
|
||||||
performance advantage that it does for the standard algorithm.
|
performance advantage that it does for the standard algorithm.
|
||||||
</P>
|
</P>
|
||||||
|
<P>
|
||||||
|
4. JIT optimization is not supported.
|
||||||
|
</P>
|
||||||
<br><a name="SEC7" href="#TOC1">AUTHOR</a><br>
|
<br><a name="SEC7" href="#TOC1">AUTHOR</a><br>
|
||||||
<P>
|
<P>
|
||||||
Philip Hazel
|
Philip Hazel
|
||||||
<br>
|
<br>
|
||||||
University Computing Service
|
Retired from University Computing Service
|
||||||
<br>
|
<br>
|
||||||
Cambridge, England.
|
Cambridge, England.
|
||||||
<br>
|
<br>
|
||||||
</P>
|
</P>
|
||||||
<br><a name="SEC8" href="#TOC1">REVISION</a><br>
|
<br><a name="SEC8" href="#TOC1">REVISION</a><br>
|
||||||
<P>
|
<P>
|
||||||
Last updated: 23 May 2019
|
Last updated: 28 August 2021
|
||||||
<br>
|
<br>
|
||||||
Copyright © 1997-2019 University of Cambridge.
|
Copyright © 1997-2021 University of Cambridge.
|
||||||
<br>
|
<br>
|
||||||
<p>
|
<p>
|
||||||
Return to the <a href="index.html">PCRE2 index page</a>.
|
Return to the <a href="index.html">PCRE2 index page</a>.
|
||||||
|
|
103
doc/pcre2.txt
103
doc/pcre2.txt
|
@ -2468,20 +2468,28 @@ THE MATCH DATA BLOCK
|
||||||
Information about a successful or unsuccessful match is placed in a
|
Information about a successful or unsuccessful match is placed in a
|
||||||
match data block, which is an opaque structure that is accessed by
|
match data block, which is an opaque structure that is accessed by
|
||||||
function calls. In particular, the match data block contains a vector
|
function calls. In particular, the match data block contains a vector
|
||||||
of offsets into the subject string that define the matched part of the
|
of offsets into the subject string that define the matched parts of the
|
||||||
subject and any substrings that were captured. This is known as the
|
subject. This is known as the ovector.
|
||||||
ovector.
|
|
||||||
|
|
||||||
Before calling pcre2_match(), pcre2_dfa_match(), or pcre2_jit_match()
|
Before calling pcre2_match(), pcre2_dfa_match(), or pcre2_jit_match()
|
||||||
you must create a match data block by calling one of the creation func-
|
you must create a match data block by calling one of the creation func-
|
||||||
tions above. For pcre2_match_data_create(), the first argument is the
|
tions above. For pcre2_match_data_create(), the first argument is the
|
||||||
number of pairs of offsets in the ovector. One pair of offsets is re-
|
number of pairs of offsets in the ovector.
|
||||||
quired to identify the string that matched the whole pattern, with an
|
|
||||||
additional pair for each captured substring. For example, a value of 4
|
When using pcre2_match(), one pair of offsets is required to identify
|
||||||
creates enough space to record the matched portion of the subject plus
|
the string that matched the whole pattern, with an additional pair for
|
||||||
three captured substrings. A minimum of at least 1 pair is imposed by
|
each captured substring. For example, a value of 4 creates enough space
|
||||||
pcre2_match_data_create(), so it is always possible to return the over-
|
to record the matched portion of the subject plus three captured sub-
|
||||||
all matched string.
|
strings.
|
||||||
|
|
||||||
|
When using pcre2_dfa_match() there may be multiple matched substrings
|
||||||
|
of different lengths at the same point in the subject. The ovector
|
||||||
|
should be made large enough to hold as many as are expected.
|
||||||
|
|
||||||
|
A minimum of at least 1 pair is imposed by pcre2_match_data_create(),
|
||||||
|
so it is always possible to return the overall matched string in the
|
||||||
|
case of pcre2_match() or the longest match in the case of
|
||||||
|
pcre2_dfa_match().
|
||||||
|
|
||||||
The second argument of pcre2_match_data_create() is a pointer to a gen-
|
The second argument of pcre2_match_data_create() is a pointer to a gen-
|
||||||
eral context, which can specify custom memory management for obtaining
|
eral context, which can specify custom memory management for obtaining
|
||||||
|
@ -2490,10 +2498,12 @@ THE MATCH DATA BLOCK
|
||||||
|
|
||||||
For pcre2_match_data_create_from_pattern(), the first argument is a
|
For pcre2_match_data_create_from_pattern(), the first argument is a
|
||||||
pointer to a compiled pattern. The ovector is created to be exactly the
|
pointer to a compiled pattern. The ovector is created to be exactly the
|
||||||
right size to hold all the substrings a pattern might capture. The sec-
|
right size to hold all the substrings a pattern might capture when
|
||||||
ond argument is again a pointer to a general context, but in this case
|
matched using pcre2_match(). You should not use this call when matching
|
||||||
if NULL is passed, the memory is obtained using the same allocator that
|
with pcre2_dfa_match(). The second argument is again a pointer to a
|
||||||
was used for the compiled pattern (custom or default).
|
general context, but in this case if NULL is passed, the memory is ob-
|
||||||
|
tained using the same allocator that was used for the compiled pattern
|
||||||
|
(custom or default).
|
||||||
|
|
||||||
A match data block can be used many times, with the same or different
|
A match data block can be used many times, with the same or different
|
||||||
compiled patterns. You can extract information from a match data block
|
compiled patterns. You can extract information from a match data block
|
||||||
|
@ -3825,14 +3835,14 @@ SEE ALSO
|
||||||
AUTHOR
|
AUTHOR
|
||||||
|
|
||||||
Philip Hazel
|
Philip Hazel
|
||||||
University Computing Service
|
Retired from University Computing Service
|
||||||
Cambridge, England.
|
Cambridge, England.
|
||||||
|
|
||||||
|
|
||||||
REVISION
|
REVISION
|
||||||
|
|
||||||
Last updated: 04 November 2020
|
Last updated: 28 August 2021
|
||||||
Copyright (c) 1997-2020 University of Cambridge.
|
Copyright (c) 1997-2021 University of Cambridge.
|
||||||
------------------------------------------------------------------------------
|
------------------------------------------------------------------------------
|
||||||
|
|
||||||
|
|
||||||
|
@ -5635,8 +5645,8 @@ THE STANDARD MATCHING ALGORITHM
|
||||||
that point the algorithm stops. Thus, if there is more than one possi-
|
that point the algorithm stops. Thus, if there is more than one possi-
|
||||||
ble match, this algorithm returns the first one that it finds. Whether
|
ble match, this algorithm returns the first one that it finds. Whether
|
||||||
this is the shortest, the longest, or some intermediate length depends
|
this is the shortest, the longest, or some intermediate length depends
|
||||||
on the way the greedy and ungreedy repetition quantifiers are specified
|
on the way the alternations and the greedy or ungreedy repetition quan-
|
||||||
in the pattern.
|
tifiers are specified in the pattern.
|
||||||
|
|
||||||
Because it ends up with a single path through the tree, it is rela-
|
Because it ends up with a single path through the tree, it is rela-
|
||||||
tively straightforward for this algorithm to keep track of the sub-
|
tively straightforward for this algorithm to keep track of the sub-
|
||||||
|
@ -5665,12 +5675,18 @@ THE ALTERNATIVE MATCHING ALGORITHM
|
||||||
represent the different matching possibilities (if there are none, the
|
represent the different matching possibilities (if there are none, the
|
||||||
match has failed). Thus, if there is more than one possible match,
|
match has failed). Thus, if there is more than one possible match,
|
||||||
this algorithm finds all of them, and in particular, it finds the long-
|
this algorithm finds all of them, and in particular, it finds the long-
|
||||||
est. The matches are returned in decreasing order of length. There is
|
est. The matches are returned in the output vector in decreasing order
|
||||||
an option to stop the algorithm after the first match (which is neces-
|
of length. There is an option to stop the algorithm after the first
|
||||||
sarily the shortest) is found.
|
match (which is necessarily the shortest) is found.
|
||||||
|
|
||||||
Note that all the matches that are found start at the same point in the
|
Note that the size of vector needed to contain all the results depends
|
||||||
subject. If the pattern
|
on the number of simultaneous matches, not on the number of parentheses
|
||||||
|
in the pattern. Using pcre2_match_data_create_from_pattern() to create
|
||||||
|
the match data block is therefore not advisable when doing DFA match-
|
||||||
|
ing.
|
||||||
|
|
||||||
|
Note also that all the matches that are found start at the same point
|
||||||
|
in the subject. If the pattern
|
||||||
|
|
||||||
cat(er(pillar)?)?
|
cat(er(pillar)?)?
|
||||||
|
|
||||||
|
@ -5746,50 +5762,45 @@ THE ALTERNATIVE MATCHING ALGORITHM
|
||||||
|
|
||||||
ADVANTAGES OF THE ALTERNATIVE ALGORITHM
|
ADVANTAGES OF THE ALTERNATIVE ALGORITHM
|
||||||
|
|
||||||
Using the alternative matching algorithm provides the following advan-
|
The main advantage of the alternative algorithm is that all possible
|
||||||
tages:
|
matches (at a single point in the subject) are automatically found, and
|
||||||
|
in particular, the longest match is found. To find more than one match
|
||||||
1. All possible matches (at a single point in the subject) are automat-
|
at the same point using the standard algorithm, you have to do kludgy
|
||||||
ically found, and in particular, the longest match is found. To find
|
|
||||||
more than one match using the standard algorithm, you have to do kludgy
|
|
||||||
things with callouts.
|
things with callouts.
|
||||||
|
|
||||||
2. Because the alternative algorithm scans the subject string just
|
Partial matching is possible with this algorithm, though it has some
|
||||||
once, and never needs to backtrack (except for lookbehinds), it is pos-
|
limitations. The pcre2partial documentation gives details of partial
|
||||||
sible to pass very long subject strings to the matching function in
|
matching and discusses multi-segment matching.
|
||||||
several pieces, checking for partial matching each time. Although it is
|
|
||||||
also possible to do multi-segment matching using the standard algo-
|
|
||||||
rithm, by retaining partially matched substrings, it is more compli-
|
|
||||||
cated. The pcre2partial documentation gives details of partial matching
|
|
||||||
and discusses multi-segment matching.
|
|
||||||
|
|
||||||
|
|
||||||
DISADVANTAGES OF THE ALTERNATIVE ALGORITHM
|
DISADVANTAGES OF THE ALTERNATIVE ALGORITHM
|
||||||
|
|
||||||
The alternative algorithm suffers from a number of disadvantages:
|
The alternative algorithm suffers from a number of disadvantages:
|
||||||
|
|
||||||
1. It is substantially slower than the standard algorithm. This is
|
1. It is substantially slower than the standard algorithm. This is
|
||||||
partly because it has to search for all possible matches, but is also
|
partly because it has to search for all possible matches, but is also
|
||||||
because it is less susceptible to optimization.
|
because it is less susceptible to optimization.
|
||||||
|
|
||||||
2. Capturing parentheses, backreferences, script runs, and matching
|
2. Capturing parentheses, backreferences, script runs, and matching
|
||||||
within invalid UTF string are not supported.
|
within invalid UTF string are not supported.
|
||||||
|
|
||||||
3. Although atomic groups are supported, their use does not provide the
|
3. Although atomic groups are supported, their use does not provide the
|
||||||
performance advantage that it does for the standard algorithm.
|
performance advantage that it does for the standard algorithm.
|
||||||
|
|
||||||
|
4. JIT optimization is not supported.
|
||||||
|
|
||||||
|
|
||||||
AUTHOR
|
AUTHOR
|
||||||
|
|
||||||
Philip Hazel
|
Philip Hazel
|
||||||
University Computing Service
|
Retired from University Computing Service
|
||||||
Cambridge, England.
|
Cambridge, England.
|
||||||
|
|
||||||
|
|
||||||
REVISION
|
REVISION
|
||||||
|
|
||||||
Last updated: 23 May 2019
|
Last updated: 28 August 2021
|
||||||
Copyright (c) 1997-2019 University of Cambridge.
|
Copyright (c) 1997-2021 University of Cambridge.
|
||||||
------------------------------------------------------------------------------
|
------------------------------------------------------------------------------
|
||||||
|
|
||||||
|
|
||||||
|
|
|
@ -1,4 +1,4 @@
|
||||||
.TH PCRE2_DFA_MATCH 3 "16 October 2018" "PCRE2 10.33"
|
.TH PCRE2_DFA_MATCH 3 "28 August 2021" "PCRE2 10.38"
|
||||||
.SH NAME
|
.SH NAME
|
||||||
PCRE2 - Perl-compatible regular expressions (revised API)
|
PCRE2 - Perl-compatible regular expressions (revised API)
|
||||||
.SH SYNOPSIS
|
.SH SYNOPSIS
|
||||||
|
@ -33,10 +33,15 @@ just once (except when processing lookaround assertions). This function is
|
||||||
\fIworkspace\fP Points to a vector of ints used as working space
|
\fIworkspace\fP Points to a vector of ints used as working space
|
||||||
\fIwscount\fP Number of elements in the vector
|
\fIwscount\fP Number of elements in the vector
|
||||||
.sp
|
.sp
|
||||||
For \fBpcre2_dfa_match()\fP, a match context is needed only if you want to set
|
The size of output vector needed to contain all the results depends on the
|
||||||
up a callout function or specify the heap limit or the match or the recursion
|
number of simultaneous matches, not on the number of parentheses in the
|
||||||
depth limits. The \fIlength\fP and \fIstartoffset\fP values are code units, not
|
pattern. Using \fBpcre2_match_data_create_from_pattern()\fP to create the match
|
||||||
characters. The options are:
|
data block is therefore not advisable when using this function.
|
||||||
|
.P
|
||||||
|
A match context is needed only if you want to set up a callout function or
|
||||||
|
specify the heap limit or the match or the recursion depth limits. The
|
||||||
|
\fIlength\fP and \fIstartoffset\fP values are code units, not characters. The
|
||||||
|
options are:
|
||||||
.sp
|
.sp
|
||||||
PCRE2_ANCHORED Match only at the first position
|
PCRE2_ANCHORED Match only at the first position
|
||||||
PCRE2_COPY_MATCHED_SUBJECT
|
PCRE2_COPY_MATCHED_SUBJECT
|
||||||
|
|
|
@ -1,4 +1,4 @@
|
||||||
.TH PCRE2_MATCH_DATA_CREATE 3 "29 July 2015" "PCRE2 10.21"
|
.TH PCRE2_MATCH_DATA_CREATE 3 "28 August 2021" "PCRE2 10.38"
|
||||||
.SH NAME
|
.SH NAME
|
||||||
PCRE2 - Perl-compatible regular expressions (revised API)
|
PCRE2 - Perl-compatible regular expressions (revised API)
|
||||||
.SH SYNOPSIS
|
.SH SYNOPSIS
|
||||||
|
@ -18,8 +18,9 @@ This function creates a new match data block, which is used for holding the
|
||||||
result of a match. The first argument specifies the number of pairs of offsets
|
result of a match. The first argument specifies the number of pairs of offsets
|
||||||
that are required. These form the "output vector" (ovector) within the match
|
that are required. These form the "output vector" (ovector) within the match
|
||||||
data block, and are used to identify the matched string and any captured
|
data block, and are used to identify the matched string and any captured
|
||||||
substrings. There is always one pair of offsets; if \fBovecsize\fP is zero, it
|
substrings when matching with \fBpcre2_match()\fP, or a number of different
|
||||||
is treated as one.
|
matches at the same point when used with \fBpcre2_dfa_match()\fP. There is
|
||||||
|
always one pair of offsets; if \fBovecsize\fP is zero, it is treated as one.
|
||||||
.P
|
.P
|
||||||
The second argument points to a general context, for custom memory management,
|
The second argument points to a general context, for custom memory management,
|
||||||
or is NULL for system memory management. The result of the function is NULL if
|
or is NULL for system memory management. The result of the function is NULL if
|
||||||
|
|
|
@ -1,4 +1,4 @@
|
||||||
.TH PCRE2_MATCH_DATA_CREATE_FROM_PATTERN 3 "29 July 2015" "PCRE2 10.21"
|
.TH PCRE2_MATCH_DATA_CREATE_FROM_PATTERN 3 "28 August 2021" "PCRE2 10.38"
|
||||||
.SH NAME
|
.SH NAME
|
||||||
PCRE2 - Perl-compatible regular expressions (revised API)
|
PCRE2 - Perl-compatible regular expressions (revised API)
|
||||||
.SH SYNOPSIS
|
.SH SYNOPSIS
|
||||||
|
@ -14,12 +14,15 @@ PCRE2 - Perl-compatible regular expressions (revised API)
|
||||||
.SH DESCRIPTION
|
.SH DESCRIPTION
|
||||||
.rs
|
.rs
|
||||||
.sp
|
.sp
|
||||||
This function creates a new match data block, which is used for holding the
|
This function creates a new match data block for holding the result of a match.
|
||||||
result of a match. The first argument points to a compiled pattern. The number
|
The first argument points to a compiled pattern. The number of capturing
|
||||||
of capturing parentheses within the pattern is used to compute the number of
|
parentheses within the pattern is used to compute the number of pairs of
|
||||||
pairs of offsets that are required in the match data block. These form the
|
offsets that are required in the match data block. These form the "output
|
||||||
"output vector" (ovector) within the match data block, and are used to identify
|
vector" (ovector) within the match data block, and are used to identify the
|
||||||
the matched string and any captured substrings.
|
matched string and any captured substrings when matching with
|
||||||
|
\fBpcre2_match()\fP. If you are using \fBpcre2_dfa_match()\fP, which uses the
|
||||||
|
outut vector in a different way, you should use \fBpcre2_match_data_create()\fP
|
||||||
|
instead of this function.
|
||||||
.P
|
.P
|
||||||
The second argument points to a general context, for custom memory management,
|
The second argument points to a general context, for custom memory management,
|
||||||
or is NULL to use the same memory allocator as was used for the compiled
|
or is NULL to use the same memory allocator as was used for the compiled
|
||||||
|
|
|
@ -1,4 +1,4 @@
|
||||||
.TH PCRE2API 3 "04 November 2020" "PCRE2 10.36"
|
.TH PCRE2API 3 "28 August 2021" "PCRE2 10.38"
|
||||||
.SH NAME
|
.SH NAME
|
||||||
PCRE2 - Perl-compatible regular expressions (revised API)
|
PCRE2 - Perl-compatible regular expressions (revised API)
|
||||||
.sp
|
.sp
|
||||||
|
@ -2490,19 +2490,27 @@ to an abstract format like Java or .NET serialization.
|
||||||
Information about a successful or unsuccessful match is placed in a match
|
Information about a successful or unsuccessful match is placed in a match
|
||||||
data block, which is an opaque structure that is accessed by function calls. In
|
data block, which is an opaque structure that is accessed by function calls. In
|
||||||
particular, the match data block contains a vector of offsets into the subject
|
particular, the match data block contains a vector of offsets into the subject
|
||||||
string that define the matched part of the subject and any substrings that were
|
string that define the matched parts of the subject. This is known as the
|
||||||
captured. This is known as the \fIovector\fP.
|
\fIovector\fP.
|
||||||
.P
|
.P
|
||||||
Before calling \fBpcre2_match()\fP, \fBpcre2_dfa_match()\fP, or
|
Before calling \fBpcre2_match()\fP, \fBpcre2_dfa_match()\fP, or
|
||||||
\fBpcre2_jit_match()\fP you must create a match data block by calling one of
|
\fBpcre2_jit_match()\fP you must create a match data block by calling one of
|
||||||
the creation functions above. For \fBpcre2_match_data_create()\fP, the first
|
the creation functions above. For \fBpcre2_match_data_create()\fP, the first
|
||||||
argument is the number of pairs of offsets in the \fIovector\fP. One pair of
|
argument is the number of pairs of offsets in the \fIovector\fP.
|
||||||
offsets is required to identify the string that matched the whole pattern, with
|
.P
|
||||||
an additional pair for each captured substring. For example, a value of 4
|
When using \fBpcre2_match()\fP, one pair of offsets is required to identify the
|
||||||
creates enough space to record the matched portion of the subject plus three
|
string that matched the whole pattern, with an additional pair for each
|
||||||
captured substrings. A minimum of at least 1 pair is imposed by
|
captured substring. For example, a value of 4 creates enough space to record
|
||||||
\fBpcre2_match_data_create()\fP, so it is always possible to return the overall
|
the matched portion of the subject plus three captured substrings.
|
||||||
matched string.
|
.P
|
||||||
|
When using \fBpcre2_dfa_match()\fP there may be multiple matched substrings of
|
||||||
|
different lengths at the same point in the subject. The ovector should be made
|
||||||
|
large enough to hold as many as are expected.
|
||||||
|
.P
|
||||||
|
A minimum of at least 1 pair is imposed by \fBpcre2_match_data_create()\fP, so
|
||||||
|
it is always possible to return the overall matched string in the case of
|
||||||
|
\fBpcre2_match()\fP or the longest match in the case of
|
||||||
|
\fBpcre2_dfa_match()\fP.
|
||||||
.P
|
.P
|
||||||
The second argument of \fBpcre2_match_data_create()\fP is a pointer to a
|
The second argument of \fBpcre2_match_data_create()\fP is a pointer to a
|
||||||
general context, which can specify custom memory management for obtaining the
|
general context, which can specify custom memory management for obtaining the
|
||||||
|
@ -2511,10 +2519,11 @@ pass NULL, which causes \fBmalloc()\fP to be used.
|
||||||
.P
|
.P
|
||||||
For \fBpcre2_match_data_create_from_pattern()\fP, the first argument is a
|
For \fBpcre2_match_data_create_from_pattern()\fP, the first argument is a
|
||||||
pointer to a compiled pattern. The ovector is created to be exactly the right
|
pointer to a compiled pattern. The ovector is created to be exactly the right
|
||||||
size to hold all the substrings a pattern might capture. The second argument is
|
size to hold all the substrings a pattern might capture when matched using
|
||||||
again a pointer to a general context, but in this case if NULL is passed, the
|
\fBpcre2_match()\fP. You should not use this call when matching with
|
||||||
memory is obtained using the same allocator that was used for the compiled
|
\fBpcre2_dfa_match()\fP. The second argument is again a pointer to a general
|
||||||
pattern (custom or default).
|
context, but in this case if NULL is passed, the memory is obtained using the
|
||||||
|
same allocator that was used for the compiled pattern (custom or default).
|
||||||
.P
|
.P
|
||||||
A match data block can be used many times, with the same or different compiled
|
A match data block can be used many times, with the same or different compiled
|
||||||
patterns. You can extract information from a match data block after a match
|
patterns. You can extract information from a match data block after a match
|
||||||
|
@ -3991,7 +4000,7 @@ fail, this error is given.
|
||||||
.sp
|
.sp
|
||||||
.nf
|
.nf
|
||||||
Philip Hazel
|
Philip Hazel
|
||||||
University Computing Service
|
Retired from University Computing Service
|
||||||
Cambridge, England.
|
Cambridge, England.
|
||||||
.fi
|
.fi
|
||||||
.
|
.
|
||||||
|
@ -4000,6 +4009,6 @@ Cambridge, England.
|
||||||
.rs
|
.rs
|
||||||
.sp
|
.sp
|
||||||
.nf
|
.nf
|
||||||
Last updated: 04 November 2020
|
Last updated: 28 August 2021
|
||||||
Copyright (c) 1997-2020 University of Cambridge.
|
Copyright (c) 1997-2021 University of Cambridge.
|
||||||
.fi
|
.fi
|
||||||
|
|
|
@ -1,4 +1,4 @@
|
||||||
.TH PCRE2MATCHING 3 "23 May 2019" "PCRE2 10.34"
|
.TH PCRE2MATCHING 3 "28 August 2021" "PCRE2 10.38"
|
||||||
.SH NAME
|
.SH NAME
|
||||||
PCRE2 - Perl-compatible regular expressions (revised API)
|
PCRE2 - Perl-compatible regular expressions (revised API)
|
||||||
.SH "PCRE2 MATCHING ALGORITHMS"
|
.SH "PCRE2 MATCHING ALGORITHMS"
|
||||||
|
@ -61,8 +61,9 @@ tried is controlled by the greedy or ungreedy nature of the quantifier.
|
||||||
If a leaf node is reached, a matching string has been found, and at that point
|
If a leaf node is reached, a matching string has been found, and at that point
|
||||||
the algorithm stops. Thus, if there is more than one possible match, this
|
the algorithm stops. Thus, if there is more than one possible match, this
|
||||||
algorithm returns the first one that it finds. Whether this is the shortest,
|
algorithm returns the first one that it finds. Whether this is the shortest,
|
||||||
the longest, or some intermediate length depends on the way the greedy and
|
the longest, or some intermediate length depends on the way the alternations
|
||||||
ungreedy repetition quantifiers are specified in the pattern.
|
and the greedy or ungreedy repetition quantifiers are specified in the
|
||||||
|
pattern.
|
||||||
.P
|
.P
|
||||||
Because it ends up with a single path through the tree, it is relatively
|
Because it ends up with a single path through the tree, it is relatively
|
||||||
straightforward for this algorithm to keep track of the substrings that are
|
straightforward for this algorithm to keep track of the substrings that are
|
||||||
|
@ -91,10 +92,15 @@ no more unterminated paths. At this point, terminated paths represent the
|
||||||
different matching possibilities (if there are none, the match has failed).
|
different matching possibilities (if there are none, the match has failed).
|
||||||
Thus, if there is more than one possible match, this algorithm finds all of
|
Thus, if there is more than one possible match, this algorithm finds all of
|
||||||
them, and in particular, it finds the longest. The matches are returned in
|
them, and in particular, it finds the longest. The matches are returned in
|
||||||
decreasing order of length. There is an option to stop the algorithm after the
|
the output vector in decreasing order of length. There is an option to stop the
|
||||||
first match (which is necessarily the shortest) is found.
|
algorithm after the first match (which is necessarily the shortest) is found.
|
||||||
.P
|
.P
|
||||||
Note that all the matches that are found start at the same point in the
|
Note that the size of vector needed to contain all the results depends on the
|
||||||
|
number of simultaneous matches, not on the number of parentheses in the
|
||||||
|
pattern. Using \fBpcre2_match_data_create_from_pattern()\fP to create the match
|
||||||
|
data block is therefore not advisable when doing DFA matching.
|
||||||
|
.P
|
||||||
|
Note also that all the matches that are found start at the same point in the
|
||||||
subject. If the pattern
|
subject. If the pattern
|
||||||
.sp
|
.sp
|
||||||
cat(er(pillar)?)?
|
cat(er(pillar)?)?
|
||||||
|
@ -165,19 +171,13 @@ supported by \fBpcre2_dfa_match()\fP.
|
||||||
.SH "ADVANTAGES OF THE ALTERNATIVE ALGORITHM"
|
.SH "ADVANTAGES OF THE ALTERNATIVE ALGORITHM"
|
||||||
.rs
|
.rs
|
||||||
.sp
|
.sp
|
||||||
Using the alternative matching algorithm provides the following advantages:
|
The main advantage of the alternative algorithm is that all possible matches
|
||||||
|
(at a single point in the subject) are automatically found, and in particular,
|
||||||
|
the longest match is found. To find more than one match at the same point using
|
||||||
|
the standard algorithm, you have to do kludgy things with callouts.
|
||||||
.P
|
.P
|
||||||
1. All possible matches (at a single point in the subject) are automatically
|
Partial matching is possible with this algorithm, though it has some
|
||||||
found, and in particular, the longest match is found. To find more than one
|
limitations. The
|
||||||
match using the standard algorithm, you have to do kludgy things with
|
|
||||||
callouts.
|
|
||||||
.P
|
|
||||||
2. Because the alternative algorithm scans the subject string just once, and
|
|
||||||
never needs to backtrack (except for lookbehinds), it is possible to pass very
|
|
||||||
long subject strings to the matching function in several pieces, checking for
|
|
||||||
partial matching each time. Although it is also possible to do multi-segment
|
|
||||||
matching using the standard algorithm, by retaining partially matched
|
|
||||||
substrings, it is more complicated. The
|
|
||||||
.\" HREF
|
.\" HREF
|
||||||
\fBpcre2partial\fP
|
\fBpcre2partial\fP
|
||||||
.\"
|
.\"
|
||||||
|
@ -199,6 +199,8 @@ invalid UTF string are not supported.
|
||||||
.P
|
.P
|
||||||
3. Although atomic groups are supported, their use does not provide the
|
3. Although atomic groups are supported, their use does not provide the
|
||||||
performance advantage that it does for the standard algorithm.
|
performance advantage that it does for the standard algorithm.
|
||||||
|
.P
|
||||||
|
4. JIT optimization is not supported.
|
||||||
.
|
.
|
||||||
.
|
.
|
||||||
.SH AUTHOR
|
.SH AUTHOR
|
||||||
|
@ -206,7 +208,7 @@ performance advantage that it does for the standard algorithm.
|
||||||
.sp
|
.sp
|
||||||
.nf
|
.nf
|
||||||
Philip Hazel
|
Philip Hazel
|
||||||
University Computing Service
|
Retired from University Computing Service
|
||||||
Cambridge, England.
|
Cambridge, England.
|
||||||
.fi
|
.fi
|
||||||
.
|
.
|
||||||
|
@ -215,6 +217,6 @@ Cambridge, England.
|
||||||
.rs
|
.rs
|
||||||
.sp
|
.sp
|
||||||
.nf
|
.nf
|
||||||
Last updated: 23 May 2019
|
Last updated: 28 August 2021
|
||||||
Copyright (c) 1997-2019 University of Cambridge.
|
Copyright (c) 1997-2021 University of Cambridge.
|
||||||
.fi
|
.fi
|
||||||
|
|
Loading…
Reference in New Issue