Documentation update to clarify ovector usage with DFA matching.

This commit is contained in:
Philip Hazel 2021-08-28 16:25:59 +01:00
parent 5ff1daffa0
commit 6c2fe9da99
11 changed files with 204 additions and 148 deletions

View File

@ -45,10 +45,16 @@ just once (except when processing lookaround assertions). This function is
<i>workspace</i> Points to a vector of ints used as working space <i>workspace</i> Points to a vector of ints used as working space
<i>wscount</i> Number of elements in the vector <i>wscount</i> Number of elements in the vector
</pre> </pre>
For <b>pcre2_dfa_match()</b>, a match context is needed only if you want to set The size of output vector needed to contain all the results depends on the
up a callout function or specify the heap limit or the match or the recursion number of simultaneous matches, not on the number of parentheses in the
depth limits. The <i>length</i> and <i>startoffset</i> values are code units, not pattern. Using <b>pcre2_match_data_create_from_pattern()</b> to create the match
characters. The options are: data block is therefore not advisable when using this function.
</P>
<P>
A match context is needed only if you want to set up a callout function or
specify the heap limit or the match or the recursion depth limits. The
<i>length</i> and <i>startoffset</i> values are code units, not characters. The
options are:
<pre> <pre>
PCRE2_ANCHORED Match only at the first position PCRE2_ANCHORED Match only at the first position
PCRE2_COPY_MATCHED_SUBJECT PCRE2_COPY_MATCHED_SUBJECT

View File

@ -30,8 +30,9 @@ This function creates a new match data block, which is used for holding the
result of a match. The first argument specifies the number of pairs of offsets result of a match. The first argument specifies the number of pairs of offsets
that are required. These form the "output vector" (ovector) within the match that are required. These form the "output vector" (ovector) within the match
data block, and are used to identify the matched string and any captured data block, and are used to identify the matched string and any captured
substrings. There is always one pair of offsets; if <b>ovecsize</b> is zero, it substrings when matching with <b>pcre2_match()</b>, or a number of different
is treated as one. matches at the same point when used with <b>pcre2_dfa_match()</b>. There is
always one pair of offsets; if <b>ovecsize</b> is zero, it is treated as one.
</P> </P>
<P> <P>
The second argument points to a general context, for custom memory management, The second argument points to a general context, for custom memory management,

View File

@ -26,12 +26,15 @@ SYNOPSIS
DESCRIPTION DESCRIPTION
</b><br> </b><br>
<P> <P>
This function creates a new match data block, which is used for holding the This function creates a new match data block for holding the result of a match.
result of a match. The first argument points to a compiled pattern. The number The first argument points to a compiled pattern. The number of capturing
of capturing parentheses within the pattern is used to compute the number of parentheses within the pattern is used to compute the number of pairs of
pairs of offsets that are required in the match data block. These form the offsets that are required in the match data block. These form the "output
"output vector" (ovector) within the match data block, and are used to identify vector" (ovector) within the match data block, and are used to identify the
the matched string and any captured substrings. matched string and any captured substrings when matching with
<b>pcre2_match()</b>. If you are using <b>pcre2_dfa_match()</b>, which uses the
outut vector in a different way, you should use <b>pcre2_match_data_create()</b>
instead of this function.
</P> </P>
<P> <P>
The second argument points to a general context, for custom memory management, The second argument points to a general context, for custom memory management,

View File

@ -2512,20 +2512,31 @@ to an abstract format like Java or .NET serialization.
Information about a successful or unsuccessful match is placed in a match Information about a successful or unsuccessful match is placed in a match
data block, which is an opaque structure that is accessed by function calls. In data block, which is an opaque structure that is accessed by function calls. In
particular, the match data block contains a vector of offsets into the subject particular, the match data block contains a vector of offsets into the subject
string that define the matched part of the subject and any substrings that were string that define the matched parts of the subject. This is known as the
captured. This is known as the <i>ovector</i>. <i>ovector</i>.
</P> </P>
<P> <P>
Before calling <b>pcre2_match()</b>, <b>pcre2_dfa_match()</b>, or Before calling <b>pcre2_match()</b>, <b>pcre2_dfa_match()</b>, or
<b>pcre2_jit_match()</b> you must create a match data block by calling one of <b>pcre2_jit_match()</b> you must create a match data block by calling one of
the creation functions above. For <b>pcre2_match_data_create()</b>, the first the creation functions above. For <b>pcre2_match_data_create()</b>, the first
argument is the number of pairs of offsets in the <i>ovector</i>. One pair of argument is the number of pairs of offsets in the <i>ovector</i>.
offsets is required to identify the string that matched the whole pattern, with </P>
an additional pair for each captured substring. For example, a value of 4 <P>
creates enough space to record the matched portion of the subject plus three When using <b>pcre2_match()</b>, one pair of offsets is required to identify the
captured substrings. A minimum of at least 1 pair is imposed by string that matched the whole pattern, with an additional pair for each
<b>pcre2_match_data_create()</b>, so it is always possible to return the overall captured substring. For example, a value of 4 creates enough space to record
matched string. the matched portion of the subject plus three captured substrings.
</P>
<P>
When using <b>pcre2_dfa_match()</b> there may be multiple matched substrings of
different lengths at the same point in the subject. The ovector should be made
large enough to hold as many as are expected.
</P>
<P>
A minimum of at least 1 pair is imposed by <b>pcre2_match_data_create()</b>, so
it is always possible to return the overall matched string in the case of
<b>pcre2_match()</b> or the longest match in the case of
<b>pcre2_dfa_match()</b>.
</P> </P>
<P> <P>
The second argument of <b>pcre2_match_data_create()</b> is a pointer to a The second argument of <b>pcre2_match_data_create()</b> is a pointer to a
@ -2536,10 +2547,11 @@ pass NULL, which causes <b>malloc()</b> to be used.
<P> <P>
For <b>pcre2_match_data_create_from_pattern()</b>, the first argument is a For <b>pcre2_match_data_create_from_pattern()</b>, the first argument is a
pointer to a compiled pattern. The ovector is created to be exactly the right pointer to a compiled pattern. The ovector is created to be exactly the right
size to hold all the substrings a pattern might capture. The second argument is size to hold all the substrings a pattern might capture when matched using
again a pointer to a general context, but in this case if NULL is passed, the <b>pcre2_match()</b>. You should not use this call when matching with
memory is obtained using the same allocator that was used for the compiled <b>pcre2_dfa_match()</b>. The second argument is again a pointer to a general
pattern (custom or default). context, but in this case if NULL is passed, the memory is obtained using the
same allocator that was used for the compiled pattern (custom or default).
</P> </P>
<P> <P>
A match data block can be used many times, with the same or different compiled A match data block can be used many times, with the same or different compiled
@ -3982,16 +3994,16 @@ fail, this error is given.
<P> <P>
Philip Hazel Philip Hazel
<br> <br>
University Computing Service Retired from University Computing Service
<br> <br>
Cambridge, England. Cambridge, England.
<br> <br>
</P> </P>
<br><a name="SEC42" href="#TOC1">REVISION</a><br> <br><a name="SEC42" href="#TOC1">REVISION</a><br>
<P> <P>
Last updated: 04 November 2020 Last updated: 28 August 2021
<br> <br>
Copyright &copy; 1997-2020 University of Cambridge. Copyright &copy; 1997-2021 University of Cambridge.
<br> <br>
<p> <p>
Return to the <a href="index.html">PCRE2 index page</a>. Return to the <a href="index.html">PCRE2 index page</a>.

View File

@ -78,8 +78,9 @@ tried is controlled by the greedy or ungreedy nature of the quantifier.
If a leaf node is reached, a matching string has been found, and at that point If a leaf node is reached, a matching string has been found, and at that point
the algorithm stops. Thus, if there is more than one possible match, this the algorithm stops. Thus, if there is more than one possible match, this
algorithm returns the first one that it finds. Whether this is the shortest, algorithm returns the first one that it finds. Whether this is the shortest,
the longest, or some intermediate length depends on the way the greedy and the longest, or some intermediate length depends on the way the alternations
ungreedy repetition quantifiers are specified in the pattern. and the greedy or ungreedy repetition quantifiers are specified in the
pattern.
</P> </P>
<P> <P>
Because it ends up with a single path through the tree, it is relatively Because it ends up with a single path through the tree, it is relatively
@ -109,11 +110,17 @@ no more unterminated paths. At this point, terminated paths represent the
different matching possibilities (if there are none, the match has failed). different matching possibilities (if there are none, the match has failed).
Thus, if there is more than one possible match, this algorithm finds all of Thus, if there is more than one possible match, this algorithm finds all of
them, and in particular, it finds the longest. The matches are returned in them, and in particular, it finds the longest. The matches are returned in
decreasing order of length. There is an option to stop the algorithm after the the output vector in decreasing order of length. There is an option to stop the
first match (which is necessarily the shortest) is found. algorithm after the first match (which is necessarily the shortest) is found.
</P> </P>
<P> <P>
Note that all the matches that are found start at the same point in the Note that the size of vector needed to contain all the results depends on the
number of simultaneous matches, not on the number of parentheses in the
pattern. Using <b>pcre2_match_data_create_from_pattern()</b> to create the match
data block is therefore not advisable when doing DFA matching.
</P>
<P>
Note also that all the matches that are found start at the same point in the
subject. If the pattern subject. If the pattern
<pre> <pre>
cat(er(pillar)?)? cat(er(pillar)?)?
@ -194,21 +201,14 @@ supported by <b>pcre2_dfa_match()</b>.
</P> </P>
<br><a name="SEC5" href="#TOC1">ADVANTAGES OF THE ALTERNATIVE ALGORITHM</a><br> <br><a name="SEC5" href="#TOC1">ADVANTAGES OF THE ALTERNATIVE ALGORITHM</a><br>
<P> <P>
Using the alternative matching algorithm provides the following advantages: The main advantage of the alternative algorithm is that all possible matches
(at a single point in the subject) are automatically found, and in particular,
the longest match is found. To find more than one match at the same point using
the standard algorithm, you have to do kludgy things with callouts.
</P> </P>
<P> <P>
1. All possible matches (at a single point in the subject) are automatically Partial matching is possible with this algorithm, though it has some
found, and in particular, the longest match is found. To find more than one limitations. The
match using the standard algorithm, you have to do kludgy things with
callouts.
</P>
<P>
2. Because the alternative algorithm scans the subject string just once, and
never needs to backtrack (except for lookbehinds), it is possible to pass very
long subject strings to the matching function in several pieces, checking for
partial matching each time. Although it is also possible to do multi-segment
matching using the standard algorithm, by retaining partially matched
substrings, it is more complicated. The
<a href="pcre2partial.html"><b>pcre2partial</b></a> <a href="pcre2partial.html"><b>pcre2partial</b></a>
documentation gives details of partial matching and discusses multi-segment documentation gives details of partial matching and discusses multi-segment
matching. matching.
@ -230,20 +230,23 @@ invalid UTF string are not supported.
3. Although atomic groups are supported, their use does not provide the 3. Although atomic groups are supported, their use does not provide the
performance advantage that it does for the standard algorithm. performance advantage that it does for the standard algorithm.
</P> </P>
<P>
4. JIT optimization is not supported.
</P>
<br><a name="SEC7" href="#TOC1">AUTHOR</a><br> <br><a name="SEC7" href="#TOC1">AUTHOR</a><br>
<P> <P>
Philip Hazel Philip Hazel
<br> <br>
University Computing Service Retired from University Computing Service
<br> <br>
Cambridge, England. Cambridge, England.
<br> <br>
</P> </P>
<br><a name="SEC8" href="#TOC1">REVISION</a><br> <br><a name="SEC8" href="#TOC1">REVISION</a><br>
<P> <P>
Last updated: 23 May 2019 Last updated: 28 August 2021
<br> <br>
Copyright &copy; 1997-2019 University of Cambridge. Copyright &copy; 1997-2021 University of Cambridge.
<br> <br>
<p> <p>
Return to the <a href="index.html">PCRE2 index page</a>. Return to the <a href="index.html">PCRE2 index page</a>.

View File

@ -2468,20 +2468,28 @@ THE MATCH DATA BLOCK
Information about a successful or unsuccessful match is placed in a Information about a successful or unsuccessful match is placed in a
match data block, which is an opaque structure that is accessed by match data block, which is an opaque structure that is accessed by
function calls. In particular, the match data block contains a vector function calls. In particular, the match data block contains a vector
of offsets into the subject string that define the matched part of the of offsets into the subject string that define the matched parts of the
subject and any substrings that were captured. This is known as the subject. This is known as the ovector.
ovector.
Before calling pcre2_match(), pcre2_dfa_match(), or pcre2_jit_match() Before calling pcre2_match(), pcre2_dfa_match(), or pcre2_jit_match()
you must create a match data block by calling one of the creation func- you must create a match data block by calling one of the creation func-
tions above. For pcre2_match_data_create(), the first argument is the tions above. For pcre2_match_data_create(), the first argument is the
number of pairs of offsets in the ovector. One pair of offsets is re- number of pairs of offsets in the ovector.
quired to identify the string that matched the whole pattern, with an
additional pair for each captured substring. For example, a value of 4 When using pcre2_match(), one pair of offsets is required to identify
creates enough space to record the matched portion of the subject plus the string that matched the whole pattern, with an additional pair for
three captured substrings. A minimum of at least 1 pair is imposed by each captured substring. For example, a value of 4 creates enough space
pcre2_match_data_create(), so it is always possible to return the over- to record the matched portion of the subject plus three captured sub-
all matched string. strings.
When using pcre2_dfa_match() there may be multiple matched substrings
of different lengths at the same point in the subject. The ovector
should be made large enough to hold as many as are expected.
A minimum of at least 1 pair is imposed by pcre2_match_data_create(),
so it is always possible to return the overall matched string in the
case of pcre2_match() or the longest match in the case of
pcre2_dfa_match().
The second argument of pcre2_match_data_create() is a pointer to a gen- The second argument of pcre2_match_data_create() is a pointer to a gen-
eral context, which can specify custom memory management for obtaining eral context, which can specify custom memory management for obtaining
@ -2490,10 +2498,12 @@ THE MATCH DATA BLOCK
For pcre2_match_data_create_from_pattern(), the first argument is a For pcre2_match_data_create_from_pattern(), the first argument is a
pointer to a compiled pattern. The ovector is created to be exactly the pointer to a compiled pattern. The ovector is created to be exactly the
right size to hold all the substrings a pattern might capture. The sec- right size to hold all the substrings a pattern might capture when
ond argument is again a pointer to a general context, but in this case matched using pcre2_match(). You should not use this call when matching
if NULL is passed, the memory is obtained using the same allocator that with pcre2_dfa_match(). The second argument is again a pointer to a
was used for the compiled pattern (custom or default). general context, but in this case if NULL is passed, the memory is ob-
tained using the same allocator that was used for the compiled pattern
(custom or default).
A match data block can be used many times, with the same or different A match data block can be used many times, with the same or different
compiled patterns. You can extract information from a match data block compiled patterns. You can extract information from a match data block
@ -3825,14 +3835,14 @@ SEE ALSO
AUTHOR AUTHOR
Philip Hazel Philip Hazel
University Computing Service Retired from University Computing Service
Cambridge, England. Cambridge, England.
REVISION REVISION
Last updated: 04 November 2020 Last updated: 28 August 2021
Copyright (c) 1997-2020 University of Cambridge. Copyright (c) 1997-2021 University of Cambridge.
------------------------------------------------------------------------------ ------------------------------------------------------------------------------
@ -5635,8 +5645,8 @@ THE STANDARD MATCHING ALGORITHM
that point the algorithm stops. Thus, if there is more than one possi- that point the algorithm stops. Thus, if there is more than one possi-
ble match, this algorithm returns the first one that it finds. Whether ble match, this algorithm returns the first one that it finds. Whether
this is the shortest, the longest, or some intermediate length depends this is the shortest, the longest, or some intermediate length depends
on the way the greedy and ungreedy repetition quantifiers are specified on the way the alternations and the greedy or ungreedy repetition quan-
in the pattern. tifiers are specified in the pattern.
Because it ends up with a single path through the tree, it is rela- Because it ends up with a single path through the tree, it is rela-
tively straightforward for this algorithm to keep track of the sub- tively straightforward for this algorithm to keep track of the sub-
@ -5665,12 +5675,18 @@ THE ALTERNATIVE MATCHING ALGORITHM
represent the different matching possibilities (if there are none, the represent the different matching possibilities (if there are none, the
match has failed). Thus, if there is more than one possible match, match has failed). Thus, if there is more than one possible match,
this algorithm finds all of them, and in particular, it finds the long- this algorithm finds all of them, and in particular, it finds the long-
est. The matches are returned in decreasing order of length. There is est. The matches are returned in the output vector in decreasing order
an option to stop the algorithm after the first match (which is neces- of length. There is an option to stop the algorithm after the first
sarily the shortest) is found. match (which is necessarily the shortest) is found.
Note that all the matches that are found start at the same point in the Note that the size of vector needed to contain all the results depends
subject. If the pattern on the number of simultaneous matches, not on the number of parentheses
in the pattern. Using pcre2_match_data_create_from_pattern() to create
the match data block is therefore not advisable when doing DFA match-
ing.
Note also that all the matches that are found start at the same point
in the subject. If the pattern
cat(er(pillar)?)? cat(er(pillar)?)?
@ -5746,50 +5762,45 @@ THE ALTERNATIVE MATCHING ALGORITHM
ADVANTAGES OF THE ALTERNATIVE ALGORITHM ADVANTAGES OF THE ALTERNATIVE ALGORITHM
Using the alternative matching algorithm provides the following advan- The main advantage of the alternative algorithm is that all possible
tages: matches (at a single point in the subject) are automatically found, and
in particular, the longest match is found. To find more than one match
1. All possible matches (at a single point in the subject) are automat- at the same point using the standard algorithm, you have to do kludgy
ically found, and in particular, the longest match is found. To find
more than one match using the standard algorithm, you have to do kludgy
things with callouts. things with callouts.
2. Because the alternative algorithm scans the subject string just Partial matching is possible with this algorithm, though it has some
once, and never needs to backtrack (except for lookbehinds), it is pos- limitations. The pcre2partial documentation gives details of partial
sible to pass very long subject strings to the matching function in matching and discusses multi-segment matching.
several pieces, checking for partial matching each time. Although it is
also possible to do multi-segment matching using the standard algo-
rithm, by retaining partially matched substrings, it is more compli-
cated. The pcre2partial documentation gives details of partial matching
and discusses multi-segment matching.
DISADVANTAGES OF THE ALTERNATIVE ALGORITHM DISADVANTAGES OF THE ALTERNATIVE ALGORITHM
The alternative algorithm suffers from a number of disadvantages: The alternative algorithm suffers from a number of disadvantages:
1. It is substantially slower than the standard algorithm. This is 1. It is substantially slower than the standard algorithm. This is
partly because it has to search for all possible matches, but is also partly because it has to search for all possible matches, but is also
because it is less susceptible to optimization. because it is less susceptible to optimization.
2. Capturing parentheses, backreferences, script runs, and matching 2. Capturing parentheses, backreferences, script runs, and matching
within invalid UTF string are not supported. within invalid UTF string are not supported.
3. Although atomic groups are supported, their use does not provide the 3. Although atomic groups are supported, their use does not provide the
performance advantage that it does for the standard algorithm. performance advantage that it does for the standard algorithm.
4. JIT optimization is not supported.
AUTHOR AUTHOR
Philip Hazel Philip Hazel
University Computing Service Retired from University Computing Service
Cambridge, England. Cambridge, England.
REVISION REVISION
Last updated: 23 May 2019 Last updated: 28 August 2021
Copyright (c) 1997-2019 University of Cambridge. Copyright (c) 1997-2021 University of Cambridge.
------------------------------------------------------------------------------ ------------------------------------------------------------------------------

View File

@ -1,4 +1,4 @@
.TH PCRE2_DFA_MATCH 3 "16 October 2018" "PCRE2 10.33" .TH PCRE2_DFA_MATCH 3 "28 August 2021" "PCRE2 10.38"
.SH NAME .SH NAME
PCRE2 - Perl-compatible regular expressions (revised API) PCRE2 - Perl-compatible regular expressions (revised API)
.SH SYNOPSIS .SH SYNOPSIS
@ -33,10 +33,15 @@ just once (except when processing lookaround assertions). This function is
\fIworkspace\fP Points to a vector of ints used as working space \fIworkspace\fP Points to a vector of ints used as working space
\fIwscount\fP Number of elements in the vector \fIwscount\fP Number of elements in the vector
.sp .sp
For \fBpcre2_dfa_match()\fP, a match context is needed only if you want to set The size of output vector needed to contain all the results depends on the
up a callout function or specify the heap limit or the match or the recursion number of simultaneous matches, not on the number of parentheses in the
depth limits. The \fIlength\fP and \fIstartoffset\fP values are code units, not pattern. Using \fBpcre2_match_data_create_from_pattern()\fP to create the match
characters. The options are: data block is therefore not advisable when using this function.
.P
A match context is needed only if you want to set up a callout function or
specify the heap limit or the match or the recursion depth limits. The
\fIlength\fP and \fIstartoffset\fP values are code units, not characters. The
options are:
.sp .sp
PCRE2_ANCHORED Match only at the first position PCRE2_ANCHORED Match only at the first position
PCRE2_COPY_MATCHED_SUBJECT PCRE2_COPY_MATCHED_SUBJECT

View File

@ -1,4 +1,4 @@
.TH PCRE2_MATCH_DATA_CREATE 3 "29 July 2015" "PCRE2 10.21" .TH PCRE2_MATCH_DATA_CREATE 3 "28 August 2021" "PCRE2 10.38"
.SH NAME .SH NAME
PCRE2 - Perl-compatible regular expressions (revised API) PCRE2 - Perl-compatible regular expressions (revised API)
.SH SYNOPSIS .SH SYNOPSIS
@ -18,8 +18,9 @@ This function creates a new match data block, which is used for holding the
result of a match. The first argument specifies the number of pairs of offsets result of a match. The first argument specifies the number of pairs of offsets
that are required. These form the "output vector" (ovector) within the match that are required. These form the "output vector" (ovector) within the match
data block, and are used to identify the matched string and any captured data block, and are used to identify the matched string and any captured
substrings. There is always one pair of offsets; if \fBovecsize\fP is zero, it substrings when matching with \fBpcre2_match()\fP, or a number of different
is treated as one. matches at the same point when used with \fBpcre2_dfa_match()\fP. There is
always one pair of offsets; if \fBovecsize\fP is zero, it is treated as one.
.P .P
The second argument points to a general context, for custom memory management, The second argument points to a general context, for custom memory management,
or is NULL for system memory management. The result of the function is NULL if or is NULL for system memory management. The result of the function is NULL if

View File

@ -1,4 +1,4 @@
.TH PCRE2_MATCH_DATA_CREATE_FROM_PATTERN 3 "29 July 2015" "PCRE2 10.21" .TH PCRE2_MATCH_DATA_CREATE_FROM_PATTERN 3 "28 August 2021" "PCRE2 10.38"
.SH NAME .SH NAME
PCRE2 - Perl-compatible regular expressions (revised API) PCRE2 - Perl-compatible regular expressions (revised API)
.SH SYNOPSIS .SH SYNOPSIS
@ -14,12 +14,15 @@ PCRE2 - Perl-compatible regular expressions (revised API)
.SH DESCRIPTION .SH DESCRIPTION
.rs .rs
.sp .sp
This function creates a new match data block, which is used for holding the This function creates a new match data block for holding the result of a match.
result of a match. The first argument points to a compiled pattern. The number The first argument points to a compiled pattern. The number of capturing
of capturing parentheses within the pattern is used to compute the number of parentheses within the pattern is used to compute the number of pairs of
pairs of offsets that are required in the match data block. These form the offsets that are required in the match data block. These form the "output
"output vector" (ovector) within the match data block, and are used to identify vector" (ovector) within the match data block, and are used to identify the
the matched string and any captured substrings. matched string and any captured substrings when matching with
\fBpcre2_match()\fP. If you are using \fBpcre2_dfa_match()\fP, which uses the
outut vector in a different way, you should use \fBpcre2_match_data_create()\fP
instead of this function.
.P .P
The second argument points to a general context, for custom memory management, The second argument points to a general context, for custom memory management,
or is NULL to use the same memory allocator as was used for the compiled or is NULL to use the same memory allocator as was used for the compiled

View File

@ -1,4 +1,4 @@
.TH PCRE2API 3 "04 November 2020" "PCRE2 10.36" .TH PCRE2API 3 "28 August 2021" "PCRE2 10.38"
.SH NAME .SH NAME
PCRE2 - Perl-compatible regular expressions (revised API) PCRE2 - Perl-compatible regular expressions (revised API)
.sp .sp
@ -2490,19 +2490,27 @@ to an abstract format like Java or .NET serialization.
Information about a successful or unsuccessful match is placed in a match Information about a successful or unsuccessful match is placed in a match
data block, which is an opaque structure that is accessed by function calls. In data block, which is an opaque structure that is accessed by function calls. In
particular, the match data block contains a vector of offsets into the subject particular, the match data block contains a vector of offsets into the subject
string that define the matched part of the subject and any substrings that were string that define the matched parts of the subject. This is known as the
captured. This is known as the \fIovector\fP. \fIovector\fP.
.P .P
Before calling \fBpcre2_match()\fP, \fBpcre2_dfa_match()\fP, or Before calling \fBpcre2_match()\fP, \fBpcre2_dfa_match()\fP, or
\fBpcre2_jit_match()\fP you must create a match data block by calling one of \fBpcre2_jit_match()\fP you must create a match data block by calling one of
the creation functions above. For \fBpcre2_match_data_create()\fP, the first the creation functions above. For \fBpcre2_match_data_create()\fP, the first
argument is the number of pairs of offsets in the \fIovector\fP. One pair of argument is the number of pairs of offsets in the \fIovector\fP.
offsets is required to identify the string that matched the whole pattern, with .P
an additional pair for each captured substring. For example, a value of 4 When using \fBpcre2_match()\fP, one pair of offsets is required to identify the
creates enough space to record the matched portion of the subject plus three string that matched the whole pattern, with an additional pair for each
captured substrings. A minimum of at least 1 pair is imposed by captured substring. For example, a value of 4 creates enough space to record
\fBpcre2_match_data_create()\fP, so it is always possible to return the overall the matched portion of the subject plus three captured substrings.
matched string. .P
When using \fBpcre2_dfa_match()\fP there may be multiple matched substrings of
different lengths at the same point in the subject. The ovector should be made
large enough to hold as many as are expected.
.P
A minimum of at least 1 pair is imposed by \fBpcre2_match_data_create()\fP, so
it is always possible to return the overall matched string in the case of
\fBpcre2_match()\fP or the longest match in the case of
\fBpcre2_dfa_match()\fP.
.P .P
The second argument of \fBpcre2_match_data_create()\fP is a pointer to a The second argument of \fBpcre2_match_data_create()\fP is a pointer to a
general context, which can specify custom memory management for obtaining the general context, which can specify custom memory management for obtaining the
@ -2511,10 +2519,11 @@ pass NULL, which causes \fBmalloc()\fP to be used.
.P .P
For \fBpcre2_match_data_create_from_pattern()\fP, the first argument is a For \fBpcre2_match_data_create_from_pattern()\fP, the first argument is a
pointer to a compiled pattern. The ovector is created to be exactly the right pointer to a compiled pattern. The ovector is created to be exactly the right
size to hold all the substrings a pattern might capture. The second argument is size to hold all the substrings a pattern might capture when matched using
again a pointer to a general context, but in this case if NULL is passed, the \fBpcre2_match()\fP. You should not use this call when matching with
memory is obtained using the same allocator that was used for the compiled \fBpcre2_dfa_match()\fP. The second argument is again a pointer to a general
pattern (custom or default). context, but in this case if NULL is passed, the memory is obtained using the
same allocator that was used for the compiled pattern (custom or default).
.P .P
A match data block can be used many times, with the same or different compiled A match data block can be used many times, with the same or different compiled
patterns. You can extract information from a match data block after a match patterns. You can extract information from a match data block after a match
@ -3991,7 +4000,7 @@ fail, this error is given.
.sp .sp
.nf .nf
Philip Hazel Philip Hazel
University Computing Service Retired from University Computing Service
Cambridge, England. Cambridge, England.
.fi .fi
. .
@ -4000,6 +4009,6 @@ Cambridge, England.
.rs .rs
.sp .sp
.nf .nf
Last updated: 04 November 2020 Last updated: 28 August 2021
Copyright (c) 1997-2020 University of Cambridge. Copyright (c) 1997-2021 University of Cambridge.
.fi .fi

View File

@ -1,4 +1,4 @@
.TH PCRE2MATCHING 3 "23 May 2019" "PCRE2 10.34" .TH PCRE2MATCHING 3 "28 August 2021" "PCRE2 10.38"
.SH NAME .SH NAME
PCRE2 - Perl-compatible regular expressions (revised API) PCRE2 - Perl-compatible regular expressions (revised API)
.SH "PCRE2 MATCHING ALGORITHMS" .SH "PCRE2 MATCHING ALGORITHMS"
@ -61,8 +61,9 @@ tried is controlled by the greedy or ungreedy nature of the quantifier.
If a leaf node is reached, a matching string has been found, and at that point If a leaf node is reached, a matching string has been found, and at that point
the algorithm stops. Thus, if there is more than one possible match, this the algorithm stops. Thus, if there is more than one possible match, this
algorithm returns the first one that it finds. Whether this is the shortest, algorithm returns the first one that it finds. Whether this is the shortest,
the longest, or some intermediate length depends on the way the greedy and the longest, or some intermediate length depends on the way the alternations
ungreedy repetition quantifiers are specified in the pattern. and the greedy or ungreedy repetition quantifiers are specified in the
pattern.
.P .P
Because it ends up with a single path through the tree, it is relatively Because it ends up with a single path through the tree, it is relatively
straightforward for this algorithm to keep track of the substrings that are straightforward for this algorithm to keep track of the substrings that are
@ -91,10 +92,15 @@ no more unterminated paths. At this point, terminated paths represent the
different matching possibilities (if there are none, the match has failed). different matching possibilities (if there are none, the match has failed).
Thus, if there is more than one possible match, this algorithm finds all of Thus, if there is more than one possible match, this algorithm finds all of
them, and in particular, it finds the longest. The matches are returned in them, and in particular, it finds the longest. The matches are returned in
decreasing order of length. There is an option to stop the algorithm after the the output vector in decreasing order of length. There is an option to stop the
first match (which is necessarily the shortest) is found. algorithm after the first match (which is necessarily the shortest) is found.
.P .P
Note that all the matches that are found start at the same point in the Note that the size of vector needed to contain all the results depends on the
number of simultaneous matches, not on the number of parentheses in the
pattern. Using \fBpcre2_match_data_create_from_pattern()\fP to create the match
data block is therefore not advisable when doing DFA matching.
.P
Note also that all the matches that are found start at the same point in the
subject. If the pattern subject. If the pattern
.sp .sp
cat(er(pillar)?)? cat(er(pillar)?)?
@ -165,19 +171,13 @@ supported by \fBpcre2_dfa_match()\fP.
.SH "ADVANTAGES OF THE ALTERNATIVE ALGORITHM" .SH "ADVANTAGES OF THE ALTERNATIVE ALGORITHM"
.rs .rs
.sp .sp
Using the alternative matching algorithm provides the following advantages: The main advantage of the alternative algorithm is that all possible matches
(at a single point in the subject) are automatically found, and in particular,
the longest match is found. To find more than one match at the same point using
the standard algorithm, you have to do kludgy things with callouts.
.P .P
1. All possible matches (at a single point in the subject) are automatically Partial matching is possible with this algorithm, though it has some
found, and in particular, the longest match is found. To find more than one limitations. The
match using the standard algorithm, you have to do kludgy things with
callouts.
.P
2. Because the alternative algorithm scans the subject string just once, and
never needs to backtrack (except for lookbehinds), it is possible to pass very
long subject strings to the matching function in several pieces, checking for
partial matching each time. Although it is also possible to do multi-segment
matching using the standard algorithm, by retaining partially matched
substrings, it is more complicated. The
.\" HREF .\" HREF
\fBpcre2partial\fP \fBpcre2partial\fP
.\" .\"
@ -199,6 +199,8 @@ invalid UTF string are not supported.
.P .P
3. Although atomic groups are supported, their use does not provide the 3. Although atomic groups are supported, their use does not provide the
performance advantage that it does for the standard algorithm. performance advantage that it does for the standard algorithm.
.P
4. JIT optimization is not supported.
. .
. .
.SH AUTHOR .SH AUTHOR
@ -206,7 +208,7 @@ performance advantage that it does for the standard algorithm.
.sp .sp
.nf .nf
Philip Hazel Philip Hazel
University Computing Service Retired from University Computing Service
Cambridge, England. Cambridge, England.
.fi .fi
. .
@ -215,6 +217,6 @@ Cambridge, England.
.rs .rs
.sp .sp
.nf .nf
Last updated: 23 May 2019 Last updated: 28 August 2021
Copyright (c) 1997-2019 University of Cambridge. Copyright (c) 1997-2021 University of Cambridge.
.fi .fi