Documentation update to clarify ovector usage with DFA matching.
This commit is contained in:
parent
5ff1daffa0
commit
6c2fe9da99
|
@ -45,10 +45,16 @@ just once (except when processing lookaround assertions). This function is
|
|||
<i>workspace</i> Points to a vector of ints used as working space
|
||||
<i>wscount</i> Number of elements in the vector
|
||||
</pre>
|
||||
For <b>pcre2_dfa_match()</b>, a match context is needed only if you want to set
|
||||
up a callout function or specify the heap limit or the match or the recursion
|
||||
depth limits. The <i>length</i> and <i>startoffset</i> values are code units, not
|
||||
characters. The options are:
|
||||
The size of output vector needed to contain all the results depends on the
|
||||
number of simultaneous matches, not on the number of parentheses in the
|
||||
pattern. Using <b>pcre2_match_data_create_from_pattern()</b> to create the match
|
||||
data block is therefore not advisable when using this function.
|
||||
</P>
|
||||
<P>
|
||||
A match context is needed only if you want to set up a callout function or
|
||||
specify the heap limit or the match or the recursion depth limits. The
|
||||
<i>length</i> and <i>startoffset</i> values are code units, not characters. The
|
||||
options are:
|
||||
<pre>
|
||||
PCRE2_ANCHORED Match only at the first position
|
||||
PCRE2_COPY_MATCHED_SUBJECT
|
||||
|
|
|
@ -30,8 +30,9 @@ This function creates a new match data block, which is used for holding the
|
|||
result of a match. The first argument specifies the number of pairs of offsets
|
||||
that are required. These form the "output vector" (ovector) within the match
|
||||
data block, and are used to identify the matched string and any captured
|
||||
substrings. There is always one pair of offsets; if <b>ovecsize</b> is zero, it
|
||||
is treated as one.
|
||||
substrings when matching with <b>pcre2_match()</b>, or a number of different
|
||||
matches at the same point when used with <b>pcre2_dfa_match()</b>. There is
|
||||
always one pair of offsets; if <b>ovecsize</b> is zero, it is treated as one.
|
||||
</P>
|
||||
<P>
|
||||
The second argument points to a general context, for custom memory management,
|
||||
|
|
|
@ -26,12 +26,15 @@ SYNOPSIS
|
|||
DESCRIPTION
|
||||
</b><br>
|
||||
<P>
|
||||
This function creates a new match data block, which is used for holding the
|
||||
result of a match. The first argument points to a compiled pattern. The number
|
||||
of capturing parentheses within the pattern is used to compute the number of
|
||||
pairs of offsets that are required in the match data block. These form the
|
||||
"output vector" (ovector) within the match data block, and are used to identify
|
||||
the matched string and any captured substrings.
|
||||
This function creates a new match data block for holding the result of a match.
|
||||
The first argument points to a compiled pattern. The number of capturing
|
||||
parentheses within the pattern is used to compute the number of pairs of
|
||||
offsets that are required in the match data block. These form the "output
|
||||
vector" (ovector) within the match data block, and are used to identify the
|
||||
matched string and any captured substrings when matching with
|
||||
<b>pcre2_match()</b>. If you are using <b>pcre2_dfa_match()</b>, which uses the
|
||||
outut vector in a different way, you should use <b>pcre2_match_data_create()</b>
|
||||
instead of this function.
|
||||
</P>
|
||||
<P>
|
||||
The second argument points to a general context, for custom memory management,
|
||||
|
|
|
@ -2512,20 +2512,31 @@ to an abstract format like Java or .NET serialization.
|
|||
Information about a successful or unsuccessful match is placed in a match
|
||||
data block, which is an opaque structure that is accessed by function calls. In
|
||||
particular, the match data block contains a vector of offsets into the subject
|
||||
string that define the matched part of the subject and any substrings that were
|
||||
captured. This is known as the <i>ovector</i>.
|
||||
string that define the matched parts of the subject. This is known as the
|
||||
<i>ovector</i>.
|
||||
</P>
|
||||
<P>
|
||||
Before calling <b>pcre2_match()</b>, <b>pcre2_dfa_match()</b>, or
|
||||
<b>pcre2_jit_match()</b> you must create a match data block by calling one of
|
||||
the creation functions above. For <b>pcre2_match_data_create()</b>, the first
|
||||
argument is the number of pairs of offsets in the <i>ovector</i>. One pair of
|
||||
offsets is required to identify the string that matched the whole pattern, with
|
||||
an additional pair for each captured substring. For example, a value of 4
|
||||
creates enough space to record the matched portion of the subject plus three
|
||||
captured substrings. A minimum of at least 1 pair is imposed by
|
||||
<b>pcre2_match_data_create()</b>, so it is always possible to return the overall
|
||||
matched string.
|
||||
argument is the number of pairs of offsets in the <i>ovector</i>.
|
||||
</P>
|
||||
<P>
|
||||
When using <b>pcre2_match()</b>, one pair of offsets is required to identify the
|
||||
string that matched the whole pattern, with an additional pair for each
|
||||
captured substring. For example, a value of 4 creates enough space to record
|
||||
the matched portion of the subject plus three captured substrings.
|
||||
</P>
|
||||
<P>
|
||||
When using <b>pcre2_dfa_match()</b> there may be multiple matched substrings of
|
||||
different lengths at the same point in the subject. The ovector should be made
|
||||
large enough to hold as many as are expected.
|
||||
</P>
|
||||
<P>
|
||||
A minimum of at least 1 pair is imposed by <b>pcre2_match_data_create()</b>, so
|
||||
it is always possible to return the overall matched string in the case of
|
||||
<b>pcre2_match()</b> or the longest match in the case of
|
||||
<b>pcre2_dfa_match()</b>.
|
||||
</P>
|
||||
<P>
|
||||
The second argument of <b>pcre2_match_data_create()</b> is a pointer to a
|
||||
|
@ -2536,10 +2547,11 @@ pass NULL, which causes <b>malloc()</b> to be used.
|
|||
<P>
|
||||
For <b>pcre2_match_data_create_from_pattern()</b>, the first argument is a
|
||||
pointer to a compiled pattern. The ovector is created to be exactly the right
|
||||
size to hold all the substrings a pattern might capture. The second argument is
|
||||
again a pointer to a general context, but in this case if NULL is passed, the
|
||||
memory is obtained using the same allocator that was used for the compiled
|
||||
pattern (custom or default).
|
||||
size to hold all the substrings a pattern might capture when matched using
|
||||
<b>pcre2_match()</b>. You should not use this call when matching with
|
||||
<b>pcre2_dfa_match()</b>. The second argument is again a pointer to a general
|
||||
context, but in this case if NULL is passed, the memory is obtained using the
|
||||
same allocator that was used for the compiled pattern (custom or default).
|
||||
</P>
|
||||
<P>
|
||||
A match data block can be used many times, with the same or different compiled
|
||||
|
@ -3982,16 +3994,16 @@ fail, this error is given.
|
|||
<P>
|
||||
Philip Hazel
|
||||
<br>
|
||||
University Computing Service
|
||||
Retired from University Computing Service
|
||||
<br>
|
||||
Cambridge, England.
|
||||
<br>
|
||||
</P>
|
||||
<br><a name="SEC42" href="#TOC1">REVISION</a><br>
|
||||
<P>
|
||||
Last updated: 04 November 2020
|
||||
Last updated: 28 August 2021
|
||||
<br>
|
||||
Copyright © 1997-2020 University of Cambridge.
|
||||
Copyright © 1997-2021 University of Cambridge.
|
||||
<br>
|
||||
<p>
|
||||
Return to the <a href="index.html">PCRE2 index page</a>.
|
||||
|
|
|
@ -78,8 +78,9 @@ tried is controlled by the greedy or ungreedy nature of the quantifier.
|
|||
If a leaf node is reached, a matching string has been found, and at that point
|
||||
the algorithm stops. Thus, if there is more than one possible match, this
|
||||
algorithm returns the first one that it finds. Whether this is the shortest,
|
||||
the longest, or some intermediate length depends on the way the greedy and
|
||||
ungreedy repetition quantifiers are specified in the pattern.
|
||||
the longest, or some intermediate length depends on the way the alternations
|
||||
and the greedy or ungreedy repetition quantifiers are specified in the
|
||||
pattern.
|
||||
</P>
|
||||
<P>
|
||||
Because it ends up with a single path through the tree, it is relatively
|
||||
|
@ -109,11 +110,17 @@ no more unterminated paths. At this point, terminated paths represent the
|
|||
different matching possibilities (if there are none, the match has failed).
|
||||
Thus, if there is more than one possible match, this algorithm finds all of
|
||||
them, and in particular, it finds the longest. The matches are returned in
|
||||
decreasing order of length. There is an option to stop the algorithm after the
|
||||
first match (which is necessarily the shortest) is found.
|
||||
the output vector in decreasing order of length. There is an option to stop the
|
||||
algorithm after the first match (which is necessarily the shortest) is found.
|
||||
</P>
|
||||
<P>
|
||||
Note that all the matches that are found start at the same point in the
|
||||
Note that the size of vector needed to contain all the results depends on the
|
||||
number of simultaneous matches, not on the number of parentheses in the
|
||||
pattern. Using <b>pcre2_match_data_create_from_pattern()</b> to create the match
|
||||
data block is therefore not advisable when doing DFA matching.
|
||||
</P>
|
||||
<P>
|
||||
Note also that all the matches that are found start at the same point in the
|
||||
subject. If the pattern
|
||||
<pre>
|
||||
cat(er(pillar)?)?
|
||||
|
@ -194,21 +201,14 @@ supported by <b>pcre2_dfa_match()</b>.
|
|||
</P>
|
||||
<br><a name="SEC5" href="#TOC1">ADVANTAGES OF THE ALTERNATIVE ALGORITHM</a><br>
|
||||
<P>
|
||||
Using the alternative matching algorithm provides the following advantages:
|
||||
The main advantage of the alternative algorithm is that all possible matches
|
||||
(at a single point in the subject) are automatically found, and in particular,
|
||||
the longest match is found. To find more than one match at the same point using
|
||||
the standard algorithm, you have to do kludgy things with callouts.
|
||||
</P>
|
||||
<P>
|
||||
1. All possible matches (at a single point in the subject) are automatically
|
||||
found, and in particular, the longest match is found. To find more than one
|
||||
match using the standard algorithm, you have to do kludgy things with
|
||||
callouts.
|
||||
</P>
|
||||
<P>
|
||||
2. Because the alternative algorithm scans the subject string just once, and
|
||||
never needs to backtrack (except for lookbehinds), it is possible to pass very
|
||||
long subject strings to the matching function in several pieces, checking for
|
||||
partial matching each time. Although it is also possible to do multi-segment
|
||||
matching using the standard algorithm, by retaining partially matched
|
||||
substrings, it is more complicated. The
|
||||
Partial matching is possible with this algorithm, though it has some
|
||||
limitations. The
|
||||
<a href="pcre2partial.html"><b>pcre2partial</b></a>
|
||||
documentation gives details of partial matching and discusses multi-segment
|
||||
matching.
|
||||
|
@ -230,20 +230,23 @@ invalid UTF string are not supported.
|
|||
3. Although atomic groups are supported, their use does not provide the
|
||||
performance advantage that it does for the standard algorithm.
|
||||
</P>
|
||||
<P>
|
||||
4. JIT optimization is not supported.
|
||||
</P>
|
||||
<br><a name="SEC7" href="#TOC1">AUTHOR</a><br>
|
||||
<P>
|
||||
Philip Hazel
|
||||
<br>
|
||||
University Computing Service
|
||||
Retired from University Computing Service
|
||||
<br>
|
||||
Cambridge, England.
|
||||
<br>
|
||||
</P>
|
||||
<br><a name="SEC8" href="#TOC1">REVISION</a><br>
|
||||
<P>
|
||||
Last updated: 23 May 2019
|
||||
Last updated: 28 August 2021
|
||||
<br>
|
||||
Copyright © 1997-2019 University of Cambridge.
|
||||
Copyright © 1997-2021 University of Cambridge.
|
||||
<br>
|
||||
<p>
|
||||
Return to the <a href="index.html">PCRE2 index page</a>.
|
||||
|
|
|
@ -2468,20 +2468,28 @@ THE MATCH DATA BLOCK
|
|||
Information about a successful or unsuccessful match is placed in a
|
||||
match data block, which is an opaque structure that is accessed by
|
||||
function calls. In particular, the match data block contains a vector
|
||||
of offsets into the subject string that define the matched part of the
|
||||
subject and any substrings that were captured. This is known as the
|
||||
ovector.
|
||||
of offsets into the subject string that define the matched parts of the
|
||||
subject. This is known as the ovector.
|
||||
|
||||
Before calling pcre2_match(), pcre2_dfa_match(), or pcre2_jit_match()
|
||||
you must create a match data block by calling one of the creation func-
|
||||
tions above. For pcre2_match_data_create(), the first argument is the
|
||||
number of pairs of offsets in the ovector. One pair of offsets is re-
|
||||
quired to identify the string that matched the whole pattern, with an
|
||||
additional pair for each captured substring. For example, a value of 4
|
||||
creates enough space to record the matched portion of the subject plus
|
||||
three captured substrings. A minimum of at least 1 pair is imposed by
|
||||
pcre2_match_data_create(), so it is always possible to return the over-
|
||||
all matched string.
|
||||
number of pairs of offsets in the ovector.
|
||||
|
||||
When using pcre2_match(), one pair of offsets is required to identify
|
||||
the string that matched the whole pattern, with an additional pair for
|
||||
each captured substring. For example, a value of 4 creates enough space
|
||||
to record the matched portion of the subject plus three captured sub-
|
||||
strings.
|
||||
|
||||
When using pcre2_dfa_match() there may be multiple matched substrings
|
||||
of different lengths at the same point in the subject. The ovector
|
||||
should be made large enough to hold as many as are expected.
|
||||
|
||||
A minimum of at least 1 pair is imposed by pcre2_match_data_create(),
|
||||
so it is always possible to return the overall matched string in the
|
||||
case of pcre2_match() or the longest match in the case of
|
||||
pcre2_dfa_match().
|
||||
|
||||
The second argument of pcre2_match_data_create() is a pointer to a gen-
|
||||
eral context, which can specify custom memory management for obtaining
|
||||
|
@ -2490,10 +2498,12 @@ THE MATCH DATA BLOCK
|
|||
|
||||
For pcre2_match_data_create_from_pattern(), the first argument is a
|
||||
pointer to a compiled pattern. The ovector is created to be exactly the
|
||||
right size to hold all the substrings a pattern might capture. The sec-
|
||||
ond argument is again a pointer to a general context, but in this case
|
||||
if NULL is passed, the memory is obtained using the same allocator that
|
||||
was used for the compiled pattern (custom or default).
|
||||
right size to hold all the substrings a pattern might capture when
|
||||
matched using pcre2_match(). You should not use this call when matching
|
||||
with pcre2_dfa_match(). The second argument is again a pointer to a
|
||||
general context, but in this case if NULL is passed, the memory is ob-
|
||||
tained using the same allocator that was used for the compiled pattern
|
||||
(custom or default).
|
||||
|
||||
A match data block can be used many times, with the same or different
|
||||
compiled patterns. You can extract information from a match data block
|
||||
|
@ -3825,14 +3835,14 @@ SEE ALSO
|
|||
AUTHOR
|
||||
|
||||
Philip Hazel
|
||||
University Computing Service
|
||||
Retired from University Computing Service
|
||||
Cambridge, England.
|
||||
|
||||
|
||||
REVISION
|
||||
|
||||
Last updated: 04 November 2020
|
||||
Copyright (c) 1997-2020 University of Cambridge.
|
||||
Last updated: 28 August 2021
|
||||
Copyright (c) 1997-2021 University of Cambridge.
|
||||
------------------------------------------------------------------------------
|
||||
|
||||
|
||||
|
@ -5635,8 +5645,8 @@ THE STANDARD MATCHING ALGORITHM
|
|||
that point the algorithm stops. Thus, if there is more than one possi-
|
||||
ble match, this algorithm returns the first one that it finds. Whether
|
||||
this is the shortest, the longest, or some intermediate length depends
|
||||
on the way the greedy and ungreedy repetition quantifiers are specified
|
||||
in the pattern.
|
||||
on the way the alternations and the greedy or ungreedy repetition quan-
|
||||
tifiers are specified in the pattern.
|
||||
|
||||
Because it ends up with a single path through the tree, it is rela-
|
||||
tively straightforward for this algorithm to keep track of the sub-
|
||||
|
@ -5665,12 +5675,18 @@ THE ALTERNATIVE MATCHING ALGORITHM
|
|||
represent the different matching possibilities (if there are none, the
|
||||
match has failed). Thus, if there is more than one possible match,
|
||||
this algorithm finds all of them, and in particular, it finds the long-
|
||||
est. The matches are returned in decreasing order of length. There is
|
||||
an option to stop the algorithm after the first match (which is neces-
|
||||
sarily the shortest) is found.
|
||||
est. The matches are returned in the output vector in decreasing order
|
||||
of length. There is an option to stop the algorithm after the first
|
||||
match (which is necessarily the shortest) is found.
|
||||
|
||||
Note that all the matches that are found start at the same point in the
|
||||
subject. If the pattern
|
||||
Note that the size of vector needed to contain all the results depends
|
||||
on the number of simultaneous matches, not on the number of parentheses
|
||||
in the pattern. Using pcre2_match_data_create_from_pattern() to create
|
||||
the match data block is therefore not advisable when doing DFA match-
|
||||
ing.
|
||||
|
||||
Note also that all the matches that are found start at the same point
|
||||
in the subject. If the pattern
|
||||
|
||||
cat(er(pillar)?)?
|
||||
|
||||
|
@ -5746,22 +5762,15 @@ THE ALTERNATIVE MATCHING ALGORITHM
|
|||
|
||||
ADVANTAGES OF THE ALTERNATIVE ALGORITHM
|
||||
|
||||
Using the alternative matching algorithm provides the following advan-
|
||||
tages:
|
||||
|
||||
1. All possible matches (at a single point in the subject) are automat-
|
||||
ically found, and in particular, the longest match is found. To find
|
||||
more than one match using the standard algorithm, you have to do kludgy
|
||||
The main advantage of the alternative algorithm is that all possible
|
||||
matches (at a single point in the subject) are automatically found, and
|
||||
in particular, the longest match is found. To find more than one match
|
||||
at the same point using the standard algorithm, you have to do kludgy
|
||||
things with callouts.
|
||||
|
||||
2. Because the alternative algorithm scans the subject string just
|
||||
once, and never needs to backtrack (except for lookbehinds), it is pos-
|
||||
sible to pass very long subject strings to the matching function in
|
||||
several pieces, checking for partial matching each time. Although it is
|
||||
also possible to do multi-segment matching using the standard algo-
|
||||
rithm, by retaining partially matched substrings, it is more compli-
|
||||
cated. The pcre2partial documentation gives details of partial matching
|
||||
and discusses multi-segment matching.
|
||||
Partial matching is possible with this algorithm, though it has some
|
||||
limitations. The pcre2partial documentation gives details of partial
|
||||
matching and discusses multi-segment matching.
|
||||
|
||||
|
||||
DISADVANTAGES OF THE ALTERNATIVE ALGORITHM
|
||||
|
@ -5778,18 +5787,20 @@ DISADVANTAGES OF THE ALTERNATIVE ALGORITHM
|
|||
3. Although atomic groups are supported, their use does not provide the
|
||||
performance advantage that it does for the standard algorithm.
|
||||
|
||||
4. JIT optimization is not supported.
|
||||
|
||||
|
||||
AUTHOR
|
||||
|
||||
Philip Hazel
|
||||
University Computing Service
|
||||
Retired from University Computing Service
|
||||
Cambridge, England.
|
||||
|
||||
|
||||
REVISION
|
||||
|
||||
Last updated: 23 May 2019
|
||||
Copyright (c) 1997-2019 University of Cambridge.
|
||||
Last updated: 28 August 2021
|
||||
Copyright (c) 1997-2021 University of Cambridge.
|
||||
------------------------------------------------------------------------------
|
||||
|
||||
|
||||
|
|
|
@ -1,4 +1,4 @@
|
|||
.TH PCRE2_DFA_MATCH 3 "16 October 2018" "PCRE2 10.33"
|
||||
.TH PCRE2_DFA_MATCH 3 "28 August 2021" "PCRE2 10.38"
|
||||
.SH NAME
|
||||
PCRE2 - Perl-compatible regular expressions (revised API)
|
||||
.SH SYNOPSIS
|
||||
|
@ -33,10 +33,15 @@ just once (except when processing lookaround assertions). This function is
|
|||
\fIworkspace\fP Points to a vector of ints used as working space
|
||||
\fIwscount\fP Number of elements in the vector
|
||||
.sp
|
||||
For \fBpcre2_dfa_match()\fP, a match context is needed only if you want to set
|
||||
up a callout function or specify the heap limit or the match or the recursion
|
||||
depth limits. The \fIlength\fP and \fIstartoffset\fP values are code units, not
|
||||
characters. The options are:
|
||||
The size of output vector needed to contain all the results depends on the
|
||||
number of simultaneous matches, not on the number of parentheses in the
|
||||
pattern. Using \fBpcre2_match_data_create_from_pattern()\fP to create the match
|
||||
data block is therefore not advisable when using this function.
|
||||
.P
|
||||
A match context is needed only if you want to set up a callout function or
|
||||
specify the heap limit or the match or the recursion depth limits. The
|
||||
\fIlength\fP and \fIstartoffset\fP values are code units, not characters. The
|
||||
options are:
|
||||
.sp
|
||||
PCRE2_ANCHORED Match only at the first position
|
||||
PCRE2_COPY_MATCHED_SUBJECT
|
||||
|
|
|
@ -1,4 +1,4 @@
|
|||
.TH PCRE2_MATCH_DATA_CREATE 3 "29 July 2015" "PCRE2 10.21"
|
||||
.TH PCRE2_MATCH_DATA_CREATE 3 "28 August 2021" "PCRE2 10.38"
|
||||
.SH NAME
|
||||
PCRE2 - Perl-compatible regular expressions (revised API)
|
||||
.SH SYNOPSIS
|
||||
|
@ -18,8 +18,9 @@ This function creates a new match data block, which is used for holding the
|
|||
result of a match. The first argument specifies the number of pairs of offsets
|
||||
that are required. These form the "output vector" (ovector) within the match
|
||||
data block, and are used to identify the matched string and any captured
|
||||
substrings. There is always one pair of offsets; if \fBovecsize\fP is zero, it
|
||||
is treated as one.
|
||||
substrings when matching with \fBpcre2_match()\fP, or a number of different
|
||||
matches at the same point when used with \fBpcre2_dfa_match()\fP. There is
|
||||
always one pair of offsets; if \fBovecsize\fP is zero, it is treated as one.
|
||||
.P
|
||||
The second argument points to a general context, for custom memory management,
|
||||
or is NULL for system memory management. The result of the function is NULL if
|
||||
|
|
|
@ -1,4 +1,4 @@
|
|||
.TH PCRE2_MATCH_DATA_CREATE_FROM_PATTERN 3 "29 July 2015" "PCRE2 10.21"
|
||||
.TH PCRE2_MATCH_DATA_CREATE_FROM_PATTERN 3 "28 August 2021" "PCRE2 10.38"
|
||||
.SH NAME
|
||||
PCRE2 - Perl-compatible regular expressions (revised API)
|
||||
.SH SYNOPSIS
|
||||
|
@ -14,12 +14,15 @@ PCRE2 - Perl-compatible regular expressions (revised API)
|
|||
.SH DESCRIPTION
|
||||
.rs
|
||||
.sp
|
||||
This function creates a new match data block, which is used for holding the
|
||||
result of a match. The first argument points to a compiled pattern. The number
|
||||
of capturing parentheses within the pattern is used to compute the number of
|
||||
pairs of offsets that are required in the match data block. These form the
|
||||
"output vector" (ovector) within the match data block, and are used to identify
|
||||
the matched string and any captured substrings.
|
||||
This function creates a new match data block for holding the result of a match.
|
||||
The first argument points to a compiled pattern. The number of capturing
|
||||
parentheses within the pattern is used to compute the number of pairs of
|
||||
offsets that are required in the match data block. These form the "output
|
||||
vector" (ovector) within the match data block, and are used to identify the
|
||||
matched string and any captured substrings when matching with
|
||||
\fBpcre2_match()\fP. If you are using \fBpcre2_dfa_match()\fP, which uses the
|
||||
outut vector in a different way, you should use \fBpcre2_match_data_create()\fP
|
||||
instead of this function.
|
||||
.P
|
||||
The second argument points to a general context, for custom memory management,
|
||||
or is NULL to use the same memory allocator as was used for the compiled
|
||||
|
|
|
@ -1,4 +1,4 @@
|
|||
.TH PCRE2API 3 "04 November 2020" "PCRE2 10.36"
|
||||
.TH PCRE2API 3 "28 August 2021" "PCRE2 10.38"
|
||||
.SH NAME
|
||||
PCRE2 - Perl-compatible regular expressions (revised API)
|
||||
.sp
|
||||
|
@ -2490,19 +2490,27 @@ to an abstract format like Java or .NET serialization.
|
|||
Information about a successful or unsuccessful match is placed in a match
|
||||
data block, which is an opaque structure that is accessed by function calls. In
|
||||
particular, the match data block contains a vector of offsets into the subject
|
||||
string that define the matched part of the subject and any substrings that were
|
||||
captured. This is known as the \fIovector\fP.
|
||||
string that define the matched parts of the subject. This is known as the
|
||||
\fIovector\fP.
|
||||
.P
|
||||
Before calling \fBpcre2_match()\fP, \fBpcre2_dfa_match()\fP, or
|
||||
\fBpcre2_jit_match()\fP you must create a match data block by calling one of
|
||||
the creation functions above. For \fBpcre2_match_data_create()\fP, the first
|
||||
argument is the number of pairs of offsets in the \fIovector\fP. One pair of
|
||||
offsets is required to identify the string that matched the whole pattern, with
|
||||
an additional pair for each captured substring. For example, a value of 4
|
||||
creates enough space to record the matched portion of the subject plus three
|
||||
captured substrings. A minimum of at least 1 pair is imposed by
|
||||
\fBpcre2_match_data_create()\fP, so it is always possible to return the overall
|
||||
matched string.
|
||||
argument is the number of pairs of offsets in the \fIovector\fP.
|
||||
.P
|
||||
When using \fBpcre2_match()\fP, one pair of offsets is required to identify the
|
||||
string that matched the whole pattern, with an additional pair for each
|
||||
captured substring. For example, a value of 4 creates enough space to record
|
||||
the matched portion of the subject plus three captured substrings.
|
||||
.P
|
||||
When using \fBpcre2_dfa_match()\fP there may be multiple matched substrings of
|
||||
different lengths at the same point in the subject. The ovector should be made
|
||||
large enough to hold as many as are expected.
|
||||
.P
|
||||
A minimum of at least 1 pair is imposed by \fBpcre2_match_data_create()\fP, so
|
||||
it is always possible to return the overall matched string in the case of
|
||||
\fBpcre2_match()\fP or the longest match in the case of
|
||||
\fBpcre2_dfa_match()\fP.
|
||||
.P
|
||||
The second argument of \fBpcre2_match_data_create()\fP is a pointer to a
|
||||
general context, which can specify custom memory management for obtaining the
|
||||
|
@ -2511,10 +2519,11 @@ pass NULL, which causes \fBmalloc()\fP to be used.
|
|||
.P
|
||||
For \fBpcre2_match_data_create_from_pattern()\fP, the first argument is a
|
||||
pointer to a compiled pattern. The ovector is created to be exactly the right
|
||||
size to hold all the substrings a pattern might capture. The second argument is
|
||||
again a pointer to a general context, but in this case if NULL is passed, the
|
||||
memory is obtained using the same allocator that was used for the compiled
|
||||
pattern (custom or default).
|
||||
size to hold all the substrings a pattern might capture when matched using
|
||||
\fBpcre2_match()\fP. You should not use this call when matching with
|
||||
\fBpcre2_dfa_match()\fP. The second argument is again a pointer to a general
|
||||
context, but in this case if NULL is passed, the memory is obtained using the
|
||||
same allocator that was used for the compiled pattern (custom or default).
|
||||
.P
|
||||
A match data block can be used many times, with the same or different compiled
|
||||
patterns. You can extract information from a match data block after a match
|
||||
|
@ -3991,7 +4000,7 @@ fail, this error is given.
|
|||
.sp
|
||||
.nf
|
||||
Philip Hazel
|
||||
University Computing Service
|
||||
Retired from University Computing Service
|
||||
Cambridge, England.
|
||||
.fi
|
||||
.
|
||||
|
@ -4000,6 +4009,6 @@ Cambridge, England.
|
|||
.rs
|
||||
.sp
|
||||
.nf
|
||||
Last updated: 04 November 2020
|
||||
Copyright (c) 1997-2020 University of Cambridge.
|
||||
Last updated: 28 August 2021
|
||||
Copyright (c) 1997-2021 University of Cambridge.
|
||||
.fi
|
||||
|
|
|
@ -1,4 +1,4 @@
|
|||
.TH PCRE2MATCHING 3 "23 May 2019" "PCRE2 10.34"
|
||||
.TH PCRE2MATCHING 3 "28 August 2021" "PCRE2 10.38"
|
||||
.SH NAME
|
||||
PCRE2 - Perl-compatible regular expressions (revised API)
|
||||
.SH "PCRE2 MATCHING ALGORITHMS"
|
||||
|
@ -61,8 +61,9 @@ tried is controlled by the greedy or ungreedy nature of the quantifier.
|
|||
If a leaf node is reached, a matching string has been found, and at that point
|
||||
the algorithm stops. Thus, if there is more than one possible match, this
|
||||
algorithm returns the first one that it finds. Whether this is the shortest,
|
||||
the longest, or some intermediate length depends on the way the greedy and
|
||||
ungreedy repetition quantifiers are specified in the pattern.
|
||||
the longest, or some intermediate length depends on the way the alternations
|
||||
and the greedy or ungreedy repetition quantifiers are specified in the
|
||||
pattern.
|
||||
.P
|
||||
Because it ends up with a single path through the tree, it is relatively
|
||||
straightforward for this algorithm to keep track of the substrings that are
|
||||
|
@ -91,10 +92,15 @@ no more unterminated paths. At this point, terminated paths represent the
|
|||
different matching possibilities (if there are none, the match has failed).
|
||||
Thus, if there is more than one possible match, this algorithm finds all of
|
||||
them, and in particular, it finds the longest. The matches are returned in
|
||||
decreasing order of length. There is an option to stop the algorithm after the
|
||||
first match (which is necessarily the shortest) is found.
|
||||
the output vector in decreasing order of length. There is an option to stop the
|
||||
algorithm after the first match (which is necessarily the shortest) is found.
|
||||
.P
|
||||
Note that all the matches that are found start at the same point in the
|
||||
Note that the size of vector needed to contain all the results depends on the
|
||||
number of simultaneous matches, not on the number of parentheses in the
|
||||
pattern. Using \fBpcre2_match_data_create_from_pattern()\fP to create the match
|
||||
data block is therefore not advisable when doing DFA matching.
|
||||
.P
|
||||
Note also that all the matches that are found start at the same point in the
|
||||
subject. If the pattern
|
||||
.sp
|
||||
cat(er(pillar)?)?
|
||||
|
@ -165,19 +171,13 @@ supported by \fBpcre2_dfa_match()\fP.
|
|||
.SH "ADVANTAGES OF THE ALTERNATIVE ALGORITHM"
|
||||
.rs
|
||||
.sp
|
||||
Using the alternative matching algorithm provides the following advantages:
|
||||
The main advantage of the alternative algorithm is that all possible matches
|
||||
(at a single point in the subject) are automatically found, and in particular,
|
||||
the longest match is found. To find more than one match at the same point using
|
||||
the standard algorithm, you have to do kludgy things with callouts.
|
||||
.P
|
||||
1. All possible matches (at a single point in the subject) are automatically
|
||||
found, and in particular, the longest match is found. To find more than one
|
||||
match using the standard algorithm, you have to do kludgy things with
|
||||
callouts.
|
||||
.P
|
||||
2. Because the alternative algorithm scans the subject string just once, and
|
||||
never needs to backtrack (except for lookbehinds), it is possible to pass very
|
||||
long subject strings to the matching function in several pieces, checking for
|
||||
partial matching each time. Although it is also possible to do multi-segment
|
||||
matching using the standard algorithm, by retaining partially matched
|
||||
substrings, it is more complicated. The
|
||||
Partial matching is possible with this algorithm, though it has some
|
||||
limitations. The
|
||||
.\" HREF
|
||||
\fBpcre2partial\fP
|
||||
.\"
|
||||
|
@ -199,6 +199,8 @@ invalid UTF string are not supported.
|
|||
.P
|
||||
3. Although atomic groups are supported, their use does not provide the
|
||||
performance advantage that it does for the standard algorithm.
|
||||
.P
|
||||
4. JIT optimization is not supported.
|
||||
.
|
||||
.
|
||||
.SH AUTHOR
|
||||
|
@ -206,7 +208,7 @@ performance advantage that it does for the standard algorithm.
|
|||
.sp
|
||||
.nf
|
||||
Philip Hazel
|
||||
University Computing Service
|
||||
Retired from University Computing Service
|
||||
Cambridge, England.
|
||||
.fi
|
||||
.
|
||||
|
@ -215,6 +217,6 @@ Cambridge, England.
|
|||
.rs
|
||||
.sp
|
||||
.nf
|
||||
Last updated: 23 May 2019
|
||||
Copyright (c) 1997-2019 University of Cambridge.
|
||||
Last updated: 28 August 2021
|
||||
Copyright (c) 1997-2021 University of Cambridge.
|
||||
.fi
|
||||
|
|
Loading…
Reference in New Issue