Improvements for substring handling with partial matches.

This commit is contained in:
Philip.Hazel 2014-12-22 17:33:10 +00:00
parent 2a5767d757
commit b8dbae1474
13 changed files with 540 additions and 376 deletions

View File

@ -31,9 +31,11 @@ The arguments are:
<pre> <pre>
<i>match_data</i> The match data block for the match <i>match_data</i> The match data block for the match
<i>number</i> The substring number <i>number</i> The substring number
<i>length</i> Where to return the length <i>length</i> Where to return the length, or NULL
</pre> </pre>
The yield is zero on success, or an error code if the substring is not found. The third argument may be NULL if all you want to know is whether or not a
substring is set. The yield is zero on success, or a negative error code
otherwise. After a partial match, only substring 0 is available.
</P> </P>
<P> <P>
There is a complete description of the PCRE2 native API in the There is a complete description of the PCRE2 native API in the

View File

@ -1740,6 +1740,12 @@ and
below. below.
</P> </P>
<P> <P>
When a call of <b>pcre2_match()</b> fails, valid data is available in the match
block only when the error is PCRE2_ERROR_NOMATCH, PCRE2_ERROR_PARTIAL, or one
of the error codes for an invalid UTF string. Exactly what is available depends
on the error, and is detailed below.
</P>
<P>
When one of the matching functions is called, pointers to the compiled pattern When one of the matching functions is called, pointers to the compiled pattern
and the subject string are set in the match data block so that they can be and the subject string are set in the match data block so that they can be
referenced by the extraction functions. After running a match, you must not referenced by the extraction functions. After running a match, you must not
@ -2018,9 +2024,9 @@ function can be used to find out how many capturing subpatterns there are in a
compiled pattern. compiled pattern.
</P> </P>
<P> <P>
The overall matched string and any captured substrings are returned to the A successful match returns the overall matched string and any captured
caller via a vector of PCRE2_SIZE values. This is called the <b>ovector</b>, and substrings to the caller via a vector of PCRE2_SIZE values. This is called the
is contained within the <b>ovector</b>, and is contained within the
<a href="#matchdatablock">match data block.</a> <a href="#matchdatablock">match data block.</a>
You can obtain direct access to the ovector by calling You can obtain direct access to the ovector by calling
<b>pcre2_get_ovector_pointer()</b> to find its address, and <b>pcre2_get_ovector_pointer()</b> to find its address, and
@ -2041,20 +2047,26 @@ library, 16-bit offsets in the 16-bit library, and 32-bit offsets in the 32-bit
library. library.
</P> </P>
<P> <P>
The first pair of offsets (that is, <i>ovector[0]</i> and <i>ovector[1]</i>) After a partial match (error return PCRE2_ERROR_PARTIAL), only the first pair
identifies the portion of the subject string that was matched by the entire of offsets (that is, <i>ovector[0]</i> and <i>ovector[1]</i>) are set. They
pattern. The next pair is used for the first capturing subpattern, and so on. identify the part of the subject that was partially matched. See the
The value returned by <b>pcre2_match()</b> is one more than the highest numbered <a href="pcre2partial.html"><b>pcre2partial</b></a>
pair that has been set. For example, if two substrings have been captured, the documentation for details of partial matching.
returned value is 3. If there are no capturing subpatterns, the return value </P>
from a successful match is 1, indicating that just the first pair of offsets <P>
has been set. After a successful match, the first pair of offsets identifies the portion of
the subject string that was matched by the entire pattern. The next pair is
used for the first capturing subpattern, and so on. The value returned by
<b>pcre2_match()</b> is one more than the highest numbered pair that has been
set. For example, if two substrings have been captured, the returned value is
3. If there are no capturing subpatterns, the return value from a successful
match is 1, indicating that just the first pair of offsets has been set.
</P> </P>
<P> <P>
If a pattern uses the \K escape sequence within a positive assertion, the If a pattern uses the \K escape sequence within a positive assertion, the
reported start of the match can be greater than the end of the match. For reported start of a successful match can be greater than the end of the match.
example, if the pattern (?=ab\K) is matched against "ab", the start and end For example, if the pattern (?=ab\K) is matched against "ab", the start and
offset values for the match are 2 and 0. end offset values for the match are 2 and 0.
</P> </P>
<P> <P>
If a capturing subpattern group is matched repeatedly within a single match If a capturing subpattern group is matched repeatedly within a single match
@ -2104,24 +2116,38 @@ had.
</P> </P>
<P> <P>
As well as the offsets in the ovector, other information about a match is As well as the offsets in the ovector, other information about a match is
retained in the match data block and can be retrieved by the above functions. retained in the match data block and can be retrieved by the above functions in
appropriate circumstances. If they are called at other times, the result is
undefined.
</P> </P>
<P> <P>
When a (*MARK) name is to be passed back, <b>pcre2_get_mark()</b> returns a After a successful match, a partial match (PCRE2_ERROR_PARTIAL), or a failure
pointer to the zero-terminated name, which is within the compiled pattern. to match (PCRE2_ERROR_NOMATCH), a (*MARK) name may be available, and
Otherwise NULL is returned. A (*MARK) name may be available after a failed <b>pcre2_get_mark()</b> can be called. It returns a pointer to the
match or a partial match, as well as after a successful one. zero-terminated name, which is within the compiled pattern. Otherwise NULL is
returned. After a successful match, the (*MARK) name that is returned is the
last one encountered on the matching path through the pattern. After a "no
match" or a partial match, the last encountered (*MARK) name is returned. For
example, consider this pattern:
<pre>
^(*MARK:A)((*MARK:B)a|b)c
</pre>
When it matches "bc", the returned mark is A. The B mark is "seen" in the first
branch of the group, but it is not on the matching path. On the other hand,
when this pattern fails to match "bx", the returned mark is B.
</P> </P>
<P> <P>
The code unit offset of the character at which a successful match started is After a successful match, a partial match, or one of the invalid UTF errors
returned by <b>pcre2_get_startchar()</b>. For a non-partial match, this can be (for example, PCRE2_ERROR_UTF8_ERR5), <b>pcre2_get_startchar()</b> can be
called. After a successful or partial match it returns the code unit offset of
the character at which the match started. For a non-partial match, this can be
different to the value of <i>ovector[0]</i> if the pattern contains the \K different to the value of <i>ovector[0]</i> if the pattern contains the \K
escape sequence. After a partial match, however, this value is always the same escape sequence. After a partial match, however, this value is always the same
as <i>ovector[0]</i> because \K does not affect the result of a partial match. as <i>ovector[0]</i> because \K does not affect the result of a partial match.
</P> </P>
<P> <P>
The <b>startchar</b> field is also used to return the offset of an invalid After a UTF check failure, \fBpcre2_get_startchar()\fB can be used to obtain
UTF character when UTF checking fails. Details are given in the the code unit offset of the invalid UTF character. Details are given in the
<a href="pcre2unicode.html"><b>pcre2unicode</b></a> <a href="pcre2unicode.html"><b>pcre2unicode</b></a>
page. page.
<a name="errorlist"></a></P> <a name="errorlist"></a></P>
@ -2256,19 +2282,23 @@ The internal recursion limit was reached.
Captured substrings can be accessed directly by using the ovector as described Captured substrings can be accessed directly by using the ovector as described
<a href="#matchedstrings">above.</a> <a href="#matchedstrings">above.</a>
For convenience, auxiliary functions are provided for extracting captured For convenience, auxiliary functions are provided for extracting captured
substrings as new, separate, zero-terminated strings. The functions in this substrings as new, separate, zero-terminated strings. A substring that contains
section identify substrings by number. The number zero refers to the entire a binary zero is correctly extracted and has a further zero added on the end,
matched substring, with higher numbers referring to substrings captured by but the result is not, of course, a C string.
parenthesized groups. The next section describes similar functions for </P>
extracting captured substrings by name. A substring that contains a binary zero <P>
is correctly extracted and has a further zero added on the end, but the result The functions in this section identify substrings by number. The number zero
is not, of course, a C string. refers to the entire matched substring, with higher numbers referring to
substrings captured by parenthesized groups. After a partial match, only
substring zero is available. An attempt to extract any other substring gives
the error PCRE2_ERROR_PARTIAL. The next section describes similar functions for
extracting captured substrings by name.
</P> </P>
<P> <P>
If a pattern uses the \K escape sequence within a positive assertion, the If a pattern uses the \K escape sequence within a positive assertion, the
reported start of the match can be greater than the end of the match. For reported start of a successful match can be greater than the end of the match.
example, if the pattern (?=ab\K) is matched against "ab", the start and end For example, if the pattern (?=ab\K) is matched against "ab", the start and
offset values for the match are 2 and 0. In this situation, calling these end offset values for the match are 2 and 0. In this situation, calling these
functions with a zero substring number extracts a zero-length empty string. functions with a zero substring number extracts a zero-length empty string.
</P> </P>
<P> <P>
@ -2302,7 +2332,8 @@ calling <b>pcre2_substring_free()</b>.
<P> <P>
The return value from all these functions is zero for success, or a negative The return value from all these functions is zero for success, or a negative
error code. If the pattern match failed, the match failure code is returned. error code. If the pattern match failed, the match failure code is returned.
Other possible error codes are: If a substring number greater than zero is used after a partial match,
PCRE2_ERROR_PARTIAL is returned. Other possible error codes are:
<pre> <pre>
PCRE2_ERROR_NOMEMORY PCRE2_ERROR_NOMEMORY
</pre> </pre>
@ -2343,6 +2374,10 @@ that is obtained using the same memory allocation function that was used to get
the match data block. the match data block.
</P> </P>
<P> <P>
This function must be called only after a successful match. If called after a
partial match, the error code PCRE2_ERROR_PARTIAL is returned.
</P>
<P>
The address of the memory block is returned via <i>listptr</i>, which is also The address of the memory block is returned via <i>listptr</i>, which is also
the start of the list of string pointers. The end of the list is marked by a the start of the list of string pointers. The end of the list is marked by a
NULL pointer. The address of the list of lengths is returned via NULL pointer. The address of the list of lengths is returned via
@ -2757,7 +2792,7 @@ Cambridge, England.
</P> </P>
<br><a name="SEC37" href="#TOC1">REVISION</a><br> <br><a name="SEC37" href="#TOC1">REVISION</a><br>
<P> <P>
Last updated: 14 December 2014 Last updated: 22 December 2014
<br> <br>
Copyright &copy; 1997-2014 University of Cambridge. Copyright &copy; 1997-2014 University of Cambridge.
<br> <br>

View File

@ -89,8 +89,9 @@ empty string at the end of the subject.
</P> </P>
<P> <P>
When a partial match is returned, the first two elements in the ovector point When a partial match is returned, the first two elements in the ovector point
to the portion of the subject that was matched. The appearance of \K in the to the portion of the subject that was matched, but the values in the rest of
pattern has no effect for a partial match. Consider this pattern: the ovector are undefined. The appearance of \K in the pattern has no effect
for a partial match. Consider this pattern:
<pre> <pre>
/abc\K123/ /abc\K123/
</pre> </pre>
@ -455,7 +456,7 @@ Cambridge, England.
</P> </P>
<br><a name="SEC10" href="#TOC1">REVISION</a><br> <br><a name="SEC10" href="#TOC1">REVISION</a><br>
<P> <P>
Last updated: 14 October 2014 Last updated: 22 December 2014
<br> <br>
Copyright &copy; 1997-2014 University of Cambridge. Copyright &copy; 1997-2014 University of Cambridge.
<br> <br>

View File

@ -1753,6 +1753,12 @@ THE MATCH DATA BLOCK
described in the sections on matched strings and other match data described in the sections on matched strings and other match data
below. below.
When a call of pcre2_match() fails, valid data is available in the
match block only when the error is PCRE2_ERROR_NOMATCH,
PCRE2_ERROR_PARTIAL, or one of the error codes for an invalid UTF
string. Exactly what is available depends on the error, and is detailed
below.
When one of the matching functions is called, pointers to the compiled When one of the matching functions is called, pointers to the compiled
pattern and the subject string are set in the match data block so that pattern and the subject string are set in the match data block so that
they can be referenced by the extraction functions. After running a they can be referenced by the extraction functions. After running a
@ -2008,14 +2014,14 @@ HOW PCRE2_MATCH() RETURNS A STRING AND CAPTURED SUBSTRINGS
be captured. The pcre2_pattern_info() function can be used to find out be captured. The pcre2_pattern_info() function can be used to find out
how many capturing subpatterns there are in a compiled pattern. how many capturing subpatterns there are in a compiled pattern.
The overall matched string and any captured substrings are returned to A successful match returns the overall matched string and any captured
the caller via a vector of PCRE2_SIZE values. This is called the ovec- substrings to the caller via a vector of PCRE2_SIZE values. This is
tor, and is contained within the match data block. You can obtain called the ovector, and is contained within the match data block. You
direct access to the ovector by calling pcre2_get_ovector_pointer() to can obtain direct access to the ovector by calling pcre2_get_ovec-
find its address, and pcre2_get_ovector_count() to find the number of tor_pointer() to find its address, and pcre2_get_ovector_count() to
pairs of values it contains. Alternatively, you can use the auxiliary find the number of pairs of values it contains. Alternatively, you can
functions for accessing captured substrings by number or by name (see use the auxiliary functions for accessing captured substrings by number
below). or by name (see below).
Within the ovector, the first in each pair of values is set to the off- Within the ovector, the first in each pair of values is set to the off-
set of the first code unit of a substring, and the second is set to the set of the first code unit of a substring, and the second is set to the
@ -2024,53 +2030,58 @@ HOW PCRE2_MATCH() RETURNS A STRING AND CAPTURED SUBSTRINGS
are byte offsets in the 8-bit library, 16-bit offsets in the 16-bit are byte offsets in the 8-bit library, 16-bit offsets in the 16-bit
library, and 32-bit offsets in the 32-bit library. library, and 32-bit offsets in the 32-bit library.
The first pair of offsets (that is, ovector[0] and ovector[1]) identi- After a partial match (error return PCRE2_ERROR_PARTIAL), only the
fies the portion of the subject string that was matched by the entire first pair of offsets (that is, ovector[0] and ovector[1]) are set.
pattern. The next pair is used for the first capturing subpattern, and They identify the part of the subject that was partially matched. See
so on. The value returned by pcre2_match() is one more than the high- the pcre2partial documentation for details of partial matching.
est numbered pair that has been set. For example, if two substrings
have been captured, the returned value is 3. If there are no capturing
subpatterns, the return value from a successful match is 1, indicating
that just the first pair of offsets has been set.
If a pattern uses the \K escape sequence within a positive assertion, After a successful match, the first pair of offsets identifies the por-
the reported start of the match can be greater than the end of the tion of the subject string that was matched by the entire pattern. The
match. For example, if the pattern (?=ab\K) is matched against "ab", next pair is used for the first capturing subpattern, and so on. The
the start and end offset values for the match are 2 and 0. value returned by pcre2_match() is one more than the highest numbered
pair that has been set. For example, if two substrings have been cap-
tured, the returned value is 3. If there are no capturing subpatterns,
the return value from a successful match is 1, indicating that just the
first pair of offsets has been set.
If a capturing subpattern group is matched repeatedly within a single If a pattern uses the \K escape sequence within a positive assertion,
match operation, it is the last portion of the subject that it matched the reported start of a successful match can be greater than the end of
the match. For example, if the pattern (?=ab\K) is matched against
"ab", the start and end offset values for the match are 2 and 0.
If a capturing subpattern group is matched repeatedly within a single
match operation, it is the last portion of the subject that it matched
that is returned. that is returned.
If the ovector is too small to hold all the captured substring offsets, If the ovector is too small to hold all the captured substring offsets,
as much as possible is filled in, and the function returns a value of as much as possible is filled in, and the function returns a value of
zero. If captured substrings are not of interest, pcre2_match() may be zero. If captured substrings are not of interest, pcre2_match() may be
called with a match data block whose ovector is of minimum length (that called with a match data block whose ovector is of minimum length (that
is, one pair). However, if the pattern contains back references and the is, one pair). However, if the pattern contains back references and the
ovector is not big enough to remember the related substrings, PCRE2 has ovector is not big enough to remember the related substrings, PCRE2 has
to get additional memory for use during matching. Thus it is usually to get additional memory for use during matching. Thus it is usually
advisable to set up a match data block containing an ovector of reason- advisable to set up a match data block containing an ovector of reason-
able size. able size.
It is possible for capturing subpattern number n+1 to match some part It is possible for capturing subpattern number n+1 to match some part
of the subject when subpattern n has not been used at all. For example, of the subject when subpattern n has not been used at all. For example,
if the string "abc" is matched against the pattern (a|(z))(bc) the if the string "abc" is matched against the pattern (a|(z))(bc) the
return from the function is 4, and subpatterns 1 and 3 are matched, but return from the function is 4, and subpatterns 1 and 3 are matched, but
2 is not. When this happens, both values in the offset pairs corre- 2 is not. When this happens, both values in the offset pairs corre-
sponding to unused subpatterns are set to PCRE2_UNSET. sponding to unused subpatterns are set to PCRE2_UNSET.
Offset values that correspond to unused subpatterns at the end of the Offset values that correspond to unused subpatterns at the end of the
expression are also set to PCRE2_UNSET. For example, if the string expression are also set to PCRE2_UNSET. For example, if the string
"abc" is matched against the pattern (abc)(x(yz)?)? subpatterns 2 and 3 "abc" is matched against the pattern (abc)(x(yz)?)? subpatterns 2 and 3
are not matched. The return from the function is 2, because the high- are not matched. The return from the function is 2, because the high-
est used capturing subpattern number is 1. The offsets for for the sec- est used capturing subpattern number is 1. The offsets for for the sec-
ond and third capturing subpatterns (assuming the vector is large ond and third capturing subpatterns (assuming the vector is large
enough, of course) are set to PCRE2_UNSET. enough, of course) are set to PCRE2_UNSET.
Elements in the ovector that do not correspond to capturing parentheses Elements in the ovector that do not correspond to capturing parentheses
in the pattern are never changed. That is, if a pattern contains n cap- in the pattern are never changed. That is, if a pattern contains n cap-
turing parentheses, no more than ovector[0] to ovector[2n+1] are set by turing parentheses, no more than ovector[0] to ovector[2n+1] are set by
pcre2_match(). The other elements retain whatever values they previ- pcre2_match(). The other elements retain whatever values they previ-
ously had. ously had.
@ -2080,26 +2091,39 @@ OTHER INFORMATION ABOUT A MATCH
PCRE2_SIZE pcre2_get_startchar(pcre2_match_data *match_data); PCRE2_SIZE pcre2_get_startchar(pcre2_match_data *match_data);
As well as the offsets in the ovector, other information about a match As well as the offsets in the ovector, other information about a match
is retained in the match data block and can be retrieved by the above is retained in the match data block and can be retrieved by the above
functions. functions in appropriate circumstances. If they are called at other
times, the result is undefined.
When a (*MARK) name is to be passed back, pcre2_get_mark() returns a After a successful match, a partial match (PCRE2_ERROR_PARTIAL), or a
pointer to the zero-terminated name, which is within the compiled pat- failure to match (PCRE2_ERROR_NOMATCH), a (*MARK) name may be avail-
tern. Otherwise NULL is returned. A (*MARK) name may be available able, and pcre2_get_mark() can be called. It returns a pointer to the
after a failed match or a partial match, as well as after a successful zero-terminated name, which is within the compiled pattern. Otherwise
one. NULL is returned. After a successful match, the (*MARK) name that is
returned is the last one encountered on the matching path through the
pattern. After a "no match" or a partial match, the last encountered
(*MARK) name is returned. For example, consider this pattern:
The code unit offset of the character at which a successful match ^(*MARK:A)((*MARK:B)a|b)c
started is returned by pcre2_get_startchar(). For a non-partial match,
this can be different to the value of ovector[0] if the pattern con- When it matches "bc", the returned mark is A. The B mark is "seen" in
tains the \K escape sequence. After a partial match, however, this the first branch of the group, but it is not on the matching path. On
the other hand, when this pattern fails to match "bx", the returned
mark is B.
After a successful match, a partial match, or one of the invalid UTF
errors (for example, PCRE2_ERROR_UTF8_ERR5), pcre2_get_startchar() can
be called. After a successful or partial match it returns the code unit
offset of the character at which the match started. For a non-partial
match, this can be different to the value of ovector[0] if the pattern
contains the \K escape sequence. After a partial match, however, this
value is always the same as ovector[0] because \K does not affect the value is always the same as ovector[0] because \K does not affect the
result of a partial match. result of a partial match.
The startchar field is also used to return the offset of an invalid UTF After a UTF check failure, pcre2_get_startchar() can be used to obtain
character when UTF checking fails. Details are given in the pcre2uni- the code unit offset of the invalid UTF character. Details are given in
code page. the pcre2unicode page.
ERROR RETURNS FROM pcre2_match() ERROR RETURNS FROM pcre2_match()
@ -2225,33 +2249,36 @@ EXTRACTING CAPTURED SUBSTRINGS BY NUMBER
Captured substrings can be accessed directly by using the ovector as Captured substrings can be accessed directly by using the ovector as
described above. For convenience, auxiliary functions are provided for described above. For convenience, auxiliary functions are provided for
extracting captured substrings as new, separate, zero-terminated extracting captured substrings as new, separate, zero-terminated
strings. The functions in this section identify substrings by number. strings. A substring that contains a binary zero is correctly extracted
The number zero refers to the entire matched substring, with higher and has a further zero added on the end, but the result is not, of
numbers referring to substrings captured by parenthesized groups. The course, a C string.
next section describes similar functions for extracting captured sub-
strings by name. A substring that contains a binary zero is correctly
extracted and has a further zero added on the end, but the result is
not, of course, a C string.
If a pattern uses the \K escape sequence within a positive assertion, The functions in this section identify substrings by number. The number
the reported start of the match can be greater than the end of the zero refers to the entire matched substring, with higher numbers refer-
match. For example, if the pattern (?=ab\K) is matched against "ab", ring to substrings captured by parenthesized groups. After a partial
the start and end offset values for the match are 2 and 0. In this sit- match, only substring zero is available. An attempt to extract any
uation, calling these functions with a zero substring number extracts a other substring gives the error PCRE2_ERROR_PARTIAL. The next section
zero-length empty string. describes similar functions for extracting captured substrings by name.
You can find the length in code units of a captured substring without If a pattern uses the \K escape sequence within a positive assertion,
extracting it by calling pcre2_substring_length_bynumber(). The first the reported start of a successful match can be greater than the end of
argument is a pointer to the match data block, the second is the group the match. For example, if the pattern (?=ab\K) is matched against
number, and the third is a pointer to a variable into which the length "ab", the start and end offset values for the match are 2 and 0. In
is placed. If you just want to know whether or not the substring has this situation, calling these functions with a zero substring number
extracts a zero-length empty string.
You can find the length in code units of a captured substring without
extracting it by calling pcre2_substring_length_bynumber(). The first
argument is a pointer to the match data block, the second is the group
number, and the third is a pointer to a variable into which the length
is placed. If you just want to know whether or not the substring has
been captured, you can pass the third argument as NULL. been captured, you can pass the third argument as NULL.
The pcre2_substring_copy_bynumber() function copies a captured sub- The pcre2_substring_copy_bynumber() function copies a captured sub-
string into a supplied buffer, whereas pcre2_substring_get_bynumber() string into a supplied buffer, whereas pcre2_substring_get_bynumber()
copies it into new memory, obtained using the same memory allocation copies it into new memory, obtained using the same memory allocation
function that was used for the match data block. The first two argu- function that was used for the match data block. The first two argu-
ments of these functions are a pointer to the match data block and a ments of these functions are a pointer to the match data block and a
capturing group number. capturing group number.
The final arguments of pcre2_substring_copy_bynumber() are a pointer to The final arguments of pcre2_substring_copy_bynumber() are a pointer to
@ -2260,23 +2287,25 @@ EXTRACTING CAPTURED SUBSTRINGS BY NUMBER
for the extracted substring, excluding the terminating zero. for the extracted substring, excluding the terminating zero.
For pcre2_substring_get_bynumber() the third and fourth arguments point For pcre2_substring_get_bynumber() the third and fourth arguments point
to variables that are updated with a pointer to the new memory and the to variables that are updated with a pointer to the new memory and the
number of code units that comprise the substring, again excluding the number of code units that comprise the substring, again excluding the
terminating zero. When the substring is no longer needed, the memory terminating zero. When the substring is no longer needed, the memory
should be freed by calling pcre2_substring_free(). should be freed by calling pcre2_substring_free().
The return value from all these functions is zero for success, or a The return value from all these functions is zero for success, or a
negative error code. If the pattern match failed, the match failure negative error code. If the pattern match failed, the match failure
code is returned. Other possible error codes are: code is returned. If a substring number greater than zero is used
after a partial match, PCRE2_ERROR_PARTIAL is returned. Other possible
error codes are:
PCRE2_ERROR_NOMEMORY PCRE2_ERROR_NOMEMORY
The buffer was too small for pcre2_substring_copy_bynumber(), or the The buffer was too small for pcre2_substring_copy_bynumber(), or the
attempt to get memory failed for pcre2_substring_get_bynumber(). attempt to get memory failed for pcre2_substring_get_bynumber().
PCRE2_ERROR_NOSUBSTRING PCRE2_ERROR_NOSUBSTRING
There is no substring with that number in the pattern, that is, the There is no substring with that number in the pattern, that is, the
number is greater than the number of capturing parentheses. number is greater than the number of capturing parentheses.
PCRE2_ERROR_UNAVAILABLE PCRE2_ERROR_UNAVAILABLE
@ -2287,8 +2316,8 @@ EXTRACTING CAPTURED SUBSTRINGS BY NUMBER
PCRE2_ERROR_UNSET PCRE2_ERROR_UNSET
The substring did not participate in the match. For example, if the The substring did not participate in the match. For example, if the
pattern is (abc)|(def) and the subject is "def", and the ovector con- pattern is (abc)|(def) and the subject is "def", and the ovector con-
tains at least two capturing slots, substring number 1 is unset. tains at least two capturing slots, substring number 1 is unset.
@ -2299,13 +2328,16 @@ EXTRACTING A LIST OF ALL CAPTURED SUBSTRINGS
void pcre2_substring_list_free(PCRE2_SPTR *list); void pcre2_substring_list_free(PCRE2_SPTR *list);
The pcre2_substring_list_get() function extracts all available sub- The pcre2_substring_list_get() function extracts all available sub-
strings and builds a list of pointers to them. It also (optionally) strings and builds a list of pointers to them. It also (optionally)
builds a second list that contains their lengths (in code units), builds a second list that contains their lengths (in code units),
excluding a terminating zero that is added to each of them. All this is excluding a terminating zero that is added to each of them. All this is
done in a single block of memory that is obtained using the same memory done in a single block of memory that is obtained using the same memory
allocation function that was used to get the match data block. allocation function that was used to get the match data block.
This function must be called only after a successful match. If called
after a partial match, the error code PCRE2_ERROR_PARTIAL is returned.
The address of the memory block is returned via listptr, which is also The address of the memory block is returned via listptr, which is also
the start of the list of string pointers. The end of the list is marked the start of the list of string pointers. The end of the list is marked
by a NULL pointer. The address of the list of lengths is returned via by a NULL pointer. The address of the list of lengths is returned via
@ -2694,7 +2726,7 @@ AUTHOR
REVISION REVISION
Last updated: 14 December 2014 Last updated: 22 December 2014
Copyright (c) 1997-2014 University of Cambridge. Copyright (c) 1997-2014 University of Cambridge.
------------------------------------------------------------------------------ ------------------------------------------------------------------------------
@ -4314,9 +4346,9 @@ PARTIAL MATCHING USING pcre2_match()
string at the end of the subject. string at the end of the subject.
When a partial match is returned, the first two elements in the ovector When a partial match is returned, the first two elements in the ovector
point to the portion of the subject that was matched. The appearance of point to the portion of the subject that was matched, but the values in
\K in the pattern has no effect for a partial match. Consider this pat- the rest of the ovector are undefined. The appearance of \K in the pat-
tern: tern has no effect for a partial match. Consider this pattern:
/abc\K123/ /abc\K123/
@ -4678,7 +4710,7 @@ AUTHOR
REVISION REVISION
Last updated: 14 October 2014 Last updated: 22 December 2014
Copyright (c) 1997-2014 University of Cambridge. Copyright (c) 1997-2014 University of Cambridge.
------------------------------------------------------------------------------ ------------------------------------------------------------------------------

View File

@ -1,4 +1,4 @@
.TH PCRE2_SUBSTRING_LENGTH_BYNUMBER 3 "01 December 2014" "PCRE2 10.00" .TH PCRE2_SUBSTRING_LENGTH_BYNUMBER 3 "22 December 2014" "PCRE2 10.00"
.SH NAME .SH NAME
PCRE2 - Perl-compatible regular expressions (revised API) PCRE2 - Perl-compatible regular expressions (revised API)
.SH SYNOPSIS .SH SYNOPSIS
@ -19,9 +19,11 @@ The arguments are:
.sp .sp
\fImatch_data\fP The match data block for the match \fImatch_data\fP The match data block for the match
\fInumber\fP The substring number \fInumber\fP The substring number
\fIlength\fP Where to return the length \fIlength\fP Where to return the length, or NULL
.sp .sp
The yield is zero on success, or an error code if the substring is not found. The third argument may be NULL if all you want to know is whether or not a
substring is set. The yield is zero on success, or a negative error code
otherwise. After a partial match, only substring 0 is available.
.P .P
There is a complete description of the PCRE2 native API in the There is a complete description of the PCRE2 native API in the
.\" HREF .\" HREF

View File

@ -1,4 +1,4 @@
.TH PCRE2API 3 "14 December 2014" "PCRE2 10.00" .TH PCRE2API 3 "22 December 2014" "PCRE2 10.00"
.SH NAME .SH NAME
PCRE2 - Perl-compatible regular expressions (revised API) PCRE2 - Perl-compatible regular expressions (revised API)
.sp .sp
@ -1736,6 +1736,11 @@ other match data
.\" .\"
below. below.
.P .P
When a call of \fBpcre2_match()\fP fails, valid data is available in the match
block only when the error is PCRE2_ERROR_NOMATCH, PCRE2_ERROR_PARTIAL, or one
of the error codes for an invalid UTF string. Exactly what is available depends
on the error, and is detailed below.
.P
When one of the matching functions is called, pointers to the compiled pattern When one of the matching functions is called, pointers to the compiled pattern
and the subject string are set in the match data block so that they can be and the subject string are set in the match data block so that they can be
referenced by the extraction functions. After running a match, you must not referenced by the extraction functions. After running a match, you must not
@ -2031,9 +2036,9 @@ that do not cause substrings to be captured. The \fBpcre2_pattern_info()\fP
function can be used to find out how many capturing subpatterns there are in a function can be used to find out how many capturing subpatterns there are in a
compiled pattern. compiled pattern.
.P .P
The overall matched string and any captured substrings are returned to the A successful match returns the overall matched string and any captured
caller via a vector of PCRE2_SIZE values. This is called the \fBovector\fP, and substrings to the caller via a vector of PCRE2_SIZE values. This is called the
is contained within the \fBovector\fP, and is contained within the
.\" HTML <a href="#matchdatablock"> .\" HTML <a href="#matchdatablock">
.\" </a> .\" </a>
match data block. match data block.
@ -2061,19 +2066,26 @@ offsets, not character offsets. That is, they are byte offsets in the 8-bit
library, 16-bit offsets in the 16-bit library, and 32-bit offsets in the 32-bit library, 16-bit offsets in the 16-bit library, and 32-bit offsets in the 32-bit
library. library.
.P .P
The first pair of offsets (that is, \fIovector[0]\fP and \fIovector[1]\fP) After a partial match (error return PCRE2_ERROR_PARTIAL), only the first pair
identifies the portion of the subject string that was matched by the entire of offsets (that is, \fIovector[0]\fP and \fIovector[1]\fP) are set. They
pattern. The next pair is used for the first capturing subpattern, and so on. identify the part of the subject that was partially matched. See the
The value returned by \fBpcre2_match()\fP is one more than the highest numbered .\" HREF
pair that has been set. For example, if two substrings have been captured, the \fBpcre2partial\fP
returned value is 3. If there are no capturing subpatterns, the return value .\"
from a successful match is 1, indicating that just the first pair of offsets documentation for details of partial matching.
has been set. .P
After a successful match, the first pair of offsets identifies the portion of
the subject string that was matched by the entire pattern. The next pair is
used for the first capturing subpattern, and so on. The value returned by
\fBpcre2_match()\fP is one more than the highest numbered pair that has been
set. For example, if two substrings have been captured, the returned value is
3. If there are no capturing subpatterns, the return value from a successful
match is 1, indicating that just the first pair of offsets has been set.
.P .P
If a pattern uses the \eK escape sequence within a positive assertion, the If a pattern uses the \eK escape sequence within a positive assertion, the
reported start of the match can be greater than the end of the match. For reported start of a successful match can be greater than the end of the match.
example, if the pattern (?=ab\eK) is matched against "ab", the start and end For example, if the pattern (?=ab\eK) is matched against "ab", the start and
offset values for the match are 2 and 0. end offset values for the match are 2 and 0.
.P .P
If a capturing subpattern group is matched repeatedly within a single match If a capturing subpattern group is matched repeatedly within a single match
operation, it is the last portion of the subject that it matched that is operation, it is the last portion of the subject that it matched that is
@ -2121,21 +2133,35 @@ had.
.fi .fi
.P .P
As well as the offsets in the ovector, other information about a match is As well as the offsets in the ovector, other information about a match is
retained in the match data block and can be retrieved by the above functions. retained in the match data block and can be retrieved by the above functions in
appropriate circumstances. If they are called at other times, the result is
undefined.
.P .P
When a (*MARK) name is to be passed back, \fBpcre2_get_mark()\fP returns a After a successful match, a partial match (PCRE2_ERROR_PARTIAL), or a failure
pointer to the zero-terminated name, which is within the compiled pattern. to match (PCRE2_ERROR_NOMATCH), a (*MARK) name may be available, and
Otherwise NULL is returned. A (*MARK) name may be available after a failed \fBpcre2_get_mark()\fP can be called. It returns a pointer to the
match or a partial match, as well as after a successful one. zero-terminated name, which is within the compiled pattern. Otherwise NULL is
returned. After a successful match, the (*MARK) name that is returned is the
last one encountered on the matching path through the pattern. After a "no
match" or a partial match, the last encountered (*MARK) name is returned. For
example, consider this pattern:
.sp
^(*MARK:A)((*MARK:B)a|b)c
.sp
When it matches "bc", the returned mark is A. The B mark is "seen" in the first
branch of the group, but it is not on the matching path. On the other hand,
when this pattern fails to match "bx", the returned mark is B.
.P .P
The code unit offset of the character at which a successful match started is After a successful match, a partial match, or one of the invalid UTF errors
returned by \fBpcre2_get_startchar()\fP. For a non-partial match, this can be (for example, PCRE2_ERROR_UTF8_ERR5), \fBpcre2_get_startchar()\fP can be
called. After a successful or partial match it returns the code unit offset of
the character at which the match started. For a non-partial match, this can be
different to the value of \fIovector[0]\fP if the pattern contains the \eK different to the value of \fIovector[0]\fP if the pattern contains the \eK
escape sequence. After a partial match, however, this value is always the same escape sequence. After a partial match, however, this value is always the same
as \fIovector[0]\fP because \eK does not affect the result of a partial match. as \fIovector[0]\fP because \eK does not affect the result of a partial match.
.P .P
The \fBstartchar\fP field is also used to return the offset of an invalid After a UTF check failure, \fBpcre2_get_startchar()\fB can be used to obtain
UTF character when UTF checking fails. Details are given in the the code unit offset of the invalid UTF character. Details are given in the
.\" HREF .\" HREF
\fBpcre2unicode\fP \fBpcre2unicode\fP
.\" .\"
@ -2289,18 +2315,21 @@ Captured substrings can be accessed directly by using the ovector as described
above. above.
.\" .\"
For convenience, auxiliary functions are provided for extracting captured For convenience, auxiliary functions are provided for extracting captured
substrings as new, separate, zero-terminated strings. The functions in this substrings as new, separate, zero-terminated strings. A substring that contains
section identify substrings by number. The number zero refers to the entire a binary zero is correctly extracted and has a further zero added on the end,
matched substring, with higher numbers referring to substrings captured by but the result is not, of course, a C string.
parenthesized groups. The next section describes similar functions for .P
extracting captured substrings by name. A substring that contains a binary zero The functions in this section identify substrings by number. The number zero
is correctly extracted and has a further zero added on the end, but the result refers to the entire matched substring, with higher numbers referring to
is not, of course, a C string. substrings captured by parenthesized groups. After a partial match, only
substring zero is available. An attempt to extract any other substring gives
the error PCRE2_ERROR_PARTIAL. The next section describes similar functions for
extracting captured substrings by name.
.P .P
If a pattern uses the \eK escape sequence within a positive assertion, the If a pattern uses the \eK escape sequence within a positive assertion, the
reported start of the match can be greater than the end of the match. For reported start of a successful match can be greater than the end of the match.
example, if the pattern (?=ab\eK) is matched against "ab", the start and end For example, if the pattern (?=ab\eK) is matched against "ab", the start and
offset values for the match are 2 and 0. In this situation, calling these end offset values for the match are 2 and 0. In this situation, calling these
functions with a zero substring number extracts a zero-length empty string. functions with a zero substring number extracts a zero-length empty string.
.P .P
You can find the length in code units of a captured substring without You can find the length in code units of a captured substring without
@ -2329,7 +2358,8 @@ calling \fBpcre2_substring_free()\fP.
.P .P
The return value from all these functions is zero for success, or a negative The return value from all these functions is zero for success, or a negative
error code. If the pattern match failed, the match failure code is returned. error code. If the pattern match failed, the match failure code is returned.
Other possible error codes are: If a substring number greater than zero is used after a partial match,
PCRE2_ERROR_PARTIAL is returned. Other possible error codes are:
.sp .sp
PCRE2_ERROR_NOMEMORY PCRE2_ERROR_NOMEMORY
.sp .sp
@ -2371,6 +2401,9 @@ that is added to each of them. All this is done in a single block of memory
that is obtained using the same memory allocation function that was used to get that is obtained using the same memory allocation function that was used to get
the match data block. the match data block.
.P .P
This function must be called only after a successful match. If called after a
partial match, the error code PCRE2_ERROR_PARTIAL is returned.
.P
The address of the memory block is returned via \fIlistptr\fP, which is also The address of the memory block is returned via \fIlistptr\fP, which is also
the start of the list of string pointers. The end of the list is marked by a the start of the list of string pointers. The end of the list is marked by a
NULL pointer. The address of the list of lengths is returned via NULL pointer. The address of the list of lengths is returned via
@ -2802,6 +2835,6 @@ Cambridge, England.
.rs .rs
.sp .sp
.nf .nf
Last updated: 14 December 2014 Last updated: 22 December 2014
Copyright (c) 1997-2014 University of Cambridge. Copyright (c) 1997-2014 University of Cambridge.
.fi .fi

View File

@ -1,4 +1,4 @@
.TH PCRE2PARTIAL 3 "14 October 2014" "PCRE2 10.00" .TH PCRE2PARTIAL 3 "22 December 2014" "PCRE2 10.00"
.SH NAME .SH NAME
PCRE2 - Perl-compatible regular expressions PCRE2 - Perl-compatible regular expressions
.SH "PARTIAL MATCHING IN PCRE2" .SH "PARTIAL MATCHING IN PCRE2"
@ -64,8 +64,9 @@ matched; without such a restriction there would always be a partial match of an
empty string at the end of the subject. empty string at the end of the subject.
.P .P
When a partial match is returned, the first two elements in the ovector point When a partial match is returned, the first two elements in the ovector point
to the portion of the subject that was matched. The appearance of \eK in the to the portion of the subject that was matched, but the values in the rest of
pattern has no effect for a partial match. Consider this pattern: the ovector are undefined. The appearance of \eK in the pattern has no effect
for a partial match. Consider this pattern:
.sp .sp
/abc\eK123/ /abc\eK123/
.sp .sp
@ -428,6 +429,6 @@ Cambridge, England.
.rs .rs
.sp .sp
.nf .nf
Last updated: 14 October 2014 Last updated: 22 December 2014
Copyright (c) 1997-2014 University of Cambridge. Copyright (c) 1997-2014 University of Cambridge.
.fi .fi

View File

@ -312,9 +312,15 @@ PCRE2_EXP_DEFN int PCRE2_CALL_CONVENTION
pcre2_substring_length_bynumber(pcre2_match_data *match_data, pcre2_substring_length_bynumber(pcre2_match_data *match_data,
uint32_t stringnumber, PCRE2_SIZE *sizeptr) uint32_t stringnumber, PCRE2_SIZE *sizeptr)
{ {
int count;
PCRE2_SIZE left, right; PCRE2_SIZE left, right;
if ((count = match_data->rc) < 0) return count; /* Match failed */ int count = match_data->rc;
if (count == PCRE2_ERROR_PARTIAL)
{
if (stringnumber > 0) return PCRE2_ERROR_PARTIAL;
count = 0;
}
else if (count < 0) return count; /* Match failed */
if (match_data->matchedby != PCRE2_MATCHEDBY_DFA_INTERPRETER) if (match_data->matchedby != PCRE2_MATCHEDBY_DFA_INTERPRETER)
{ {
if (stringnumber > match_data->code->top_bracket) if (stringnumber > match_data->code->top_bracket)
@ -329,6 +335,7 @@ else /* Matched using pcre2_dfa_match() */
if (stringnumber >= match_data->oveccount) return PCRE2_ERROR_UNAVAILABLE; if (stringnumber >= match_data->oveccount) return PCRE2_ERROR_UNAVAILABLE;
if (count != 0 && stringnumber >= (uint32_t)count) return PCRE2_ERROR_UNSET; if (count != 0 && stringnumber >= (uint32_t)count) return PCRE2_ERROR_UNSET;
} }
left = match_data->ovector[stringnumber*2]; left = match_data->ovector[stringnumber*2];
right = match_data->ovector[stringnumber*2+1]; right = match_data->ovector[stringnumber*2+1];
if (sizeptr != NULL) *sizeptr = (left > right)? 0 : right - left; if (sizeptr != NULL) *sizeptr = (left > right)? 0 : right - left;

View File

@ -4233,6 +4233,232 @@ return (cb->callout_number != dat_datctl.cfail[0])? 0 :
/*************************************************
* Handle *MARK and copy/get tests *
*************************************************/
/* This function is called after complete and partial matches. It runs the
tests for substring extraction.
Arguments:
utf TRUE for utf
capcount return from pcre2_match()
Returns: nothing
*/
static void
copy_and_get(BOOL utf, int capcount)
{
int i;
uint8_t *nptr;
/* Test copy strings by number */
for (i = 0; i < MAXCPYGET && dat_datctl.copy_numbers[i] >= 0; i++)
{
int rc;
PCRE2_SIZE length, length2;
uint32_t copybuffer[256];
uint32_t n = (uint32_t)(dat_datctl.copy_numbers[i]);
length = sizeof(copybuffer)/code_unit_size;
PCRE2_SUBSTRING_COPY_BYNUMBER(rc, match_data, n, copybuffer, &length);
if (rc < 0)
{
fprintf(outfile, "Copy substring %d failed (%d): ", n, rc);
PCRE2_GET_ERROR_MESSAGE(rc, rc, pbuffer);
PCHARSV(CASTVAR(void *, pbuffer), 0, rc, FALSE, outfile);
fprintf(outfile, "\n");
}
else
{
PCRE2_SUBSTRING_LENGTH_BYNUMBER(rc, match_data, n, &length2);
if (rc < 0)
{
fprintf(outfile, "Get substring %d length failed (%d): ", n, rc);
PCRE2_GET_ERROR_MESSAGE(rc, rc, pbuffer);
PCHARSV(CASTVAR(void *, pbuffer), 0, rc, FALSE, outfile);
fprintf(outfile, "\n");
}
else if (length2 != length)
{
fprintf(outfile, "Mismatched substring lengths: %ld %ld\n",
length, length2);
}
fprintf(outfile, "%2dC ", n);
PCHARSV(copybuffer, 0, length, utf, outfile);
fprintf(outfile, " (%lu)\n", (unsigned long)length);
}
}
/* Test copy strings by name */
nptr = dat_datctl.copy_names;
for (;;)
{
int rc;
int groupnumber;
PCRE2_SIZE length, length2;
uint32_t copybuffer[256];
int namelen = strlen((const char *)nptr);
#if defined SUPPORT_PCRE2_16 || defined SUPPORT_PCRE2_32
PCRE2_SIZE cnl = namelen;
#endif
if (namelen == 0) break;
#ifdef SUPPORT_PCRE2_8
if (test_mode == PCRE8_MODE) strcpy((char *)pbuffer8, (char *)nptr);
#endif
#ifdef SUPPORT_PCRE2_16
if (test_mode == PCRE16_MODE)(void)to16(nptr, utf, &cnl);
#endif
#ifdef SUPPORT_PCRE2_32
if (test_mode == PCRE32_MODE)(void)to32(nptr, utf, &cnl);
#endif
PCRE2_SUBSTRING_NUMBER_FROM_NAME(groupnumber, compiled_code, pbuffer);
if (groupnumber < 0 && groupnumber != PCRE2_ERROR_NOUNIQUESUBSTRING)
fprintf(outfile, "Number not found for group '%s'\n", nptr);
length = sizeof(copybuffer)/code_unit_size;
PCRE2_SUBSTRING_COPY_BYNAME(rc, match_data, pbuffer, copybuffer, &length);
if (rc < 0)
{
fprintf(outfile, "Copy substring '%s' failed (%d): ", nptr, rc);
PCRE2_GET_ERROR_MESSAGE(rc, rc, pbuffer);
PCHARSV(CASTVAR(void *, pbuffer), 0, rc, FALSE, outfile);
fprintf(outfile, "\n");
}
else
{
PCRE2_SUBSTRING_LENGTH_BYNAME(rc, match_data, pbuffer, &length2);
if (rc < 0)
{
fprintf(outfile, "Get substring '%s' length failed (%d): ", nptr, rc);
PCRE2_GET_ERROR_MESSAGE(rc, rc, pbuffer);
PCHARSV(CASTVAR(void *, pbuffer), 0, rc, FALSE, outfile);
fprintf(outfile, "\n");
}
else if (length2 != length)
{
fprintf(outfile, "Mismatched substring lengths: %ld %ld\n",
length, length2);
}
fprintf(outfile, " C ");
PCHARSV(copybuffer, 0, length, utf, outfile);
fprintf(outfile, " (%lu) %s", (unsigned long)length, nptr);
if (groupnumber >= 0) fprintf(outfile, " (group %d)\n", groupnumber);
else fprintf(outfile, " (non-unique)\n");
}
nptr += namelen + 1;
}
/* Test get strings by number */
for (i = 0; i < MAXCPYGET && dat_datctl.get_numbers[i] >= 0; i++)
{
int rc;
PCRE2_SIZE length;
void *gotbuffer;
uint32_t n = (uint32_t)(dat_datctl.get_numbers[i]);
PCRE2_SUBSTRING_GET_BYNUMBER(rc, match_data, n, &gotbuffer, &length);
if (rc < 0)
{
fprintf(outfile, "Get substring %d failed (%d): ", n, rc);
PCRE2_GET_ERROR_MESSAGE(rc, rc, pbuffer);
PCHARSV(CASTVAR(void *, pbuffer), 0, rc, FALSE, outfile);
fprintf(outfile, "\n");
}
else
{
fprintf(outfile, "%2dG ", n);
PCHARSV(gotbuffer, 0, length, utf, outfile);
fprintf(outfile, " (%lu)\n", (unsigned long)length);
PCRE2_SUBSTRING_FREE(gotbuffer);
}
}
/* Test get strings by name */
nptr = dat_datctl.get_names;
for (;;)
{
PCRE2_SIZE length;
void *gotbuffer;
int rc;
int groupnumber;
int namelen = strlen((const char *)nptr);
#if defined SUPPORT_PCRE2_16 || defined SUPPORT_PCRE2_32
PCRE2_SIZE cnl = namelen;
#endif
if (namelen == 0) break;
#ifdef SUPPORT_PCRE2_8
if (test_mode == PCRE8_MODE) strcpy((char *)pbuffer8, (char *)nptr);
#endif
#ifdef SUPPORT_PCRE2_16
if (test_mode == PCRE16_MODE)(void)to16(nptr, utf, &cnl);
#endif
#ifdef SUPPORT_PCRE2_32
if (test_mode == PCRE32_MODE)(void)to32(nptr, utf, &cnl);
#endif
PCRE2_SUBSTRING_NUMBER_FROM_NAME(groupnumber, compiled_code, pbuffer);
if (groupnumber < 0 && groupnumber != PCRE2_ERROR_NOUNIQUESUBSTRING)
fprintf(outfile, "Number not found for group '%s'\n", nptr);
PCRE2_SUBSTRING_GET_BYNAME(rc, match_data, pbuffer, &gotbuffer, &length);
if (rc < 0)
{
fprintf(outfile, "Get substring '%s' failed (%d): ", nptr, rc);
PCRE2_GET_ERROR_MESSAGE(rc, rc, pbuffer);
PCHARSV(CASTVAR(void *, pbuffer), 0, rc, FALSE, outfile);
fprintf(outfile, "\n");
}
else
{
fprintf(outfile, " G ");
PCHARSV(gotbuffer, 0, length, utf, outfile);
fprintf(outfile, " (%lu) %s", (unsigned long)length, nptr);
if (groupnumber >= 0) fprintf(outfile, " (group %d)\n", groupnumber);
else fprintf(outfile, " (non-unique)\n");
PCRE2_SUBSTRING_FREE(gotbuffer);
}
nptr += namelen + 1;
}
/* Test getting the complete list of captured strings. */
if ((dat_datctl.control & CTL_GETALL) != 0)
{
int rc;
void **stringlist;
PCRE2_SIZE *lengths;
PCRE2_SUBSTRING_LIST_GET(rc, match_data, &stringlist, &lengths);
if (rc < 0)
{
fprintf(outfile, "get substring list failed (%d): ", rc);
PCRE2_GET_ERROR_MESSAGE(rc, rc, pbuffer);
PCHARSV(CASTVAR(void *, pbuffer), 0, rc, FALSE, outfile);
fprintf(outfile, "\n");
}
else
{
for (i = 0; i < capcount; i++)
{
fprintf(outfile, "%2dL ", i);
PCHARSV(stringlist[i], 0, lengths[i], utf, outfile);
putc('\n', outfile);
}
if (stringlist[i] != NULL)
fprintf(outfile, "string list not terminated by NULL\n");
PCRE2_SUBSTRING_LIST_FREE(stringlist);
}
}
}
/************************************************* /*************************************************
* Process a data line * * Process a data line *
*************************************************/ *************************************************/
@ -5074,7 +5300,6 @@ else for (gmatched = 0;; gmatched++)
{ {
int i; int i;
uint32_t oveccount; uint32_t oveccount;
uint8_t *nptr;
/* This is a check against a lunatic return value. */ /* This is a check against a lunatic return value. */
@ -5239,7 +5464,7 @@ else for (gmatched = 0;; gmatched++)
} }
} }
/* Output mark data if requested. */ /* Output (*MARK) data if requested */
if ((dat_datctl.control & CTL_MARK) != 0 && if ((dat_datctl.control & CTL_MARK) != 0 &&
TESTFLD(match_data, mark, !=, NULL)) TESTFLD(match_data, mark, !=, NULL))
@ -5249,208 +5474,10 @@ else for (gmatched = 0;; gmatched++)
fprintf(outfile, "\n"); fprintf(outfile, "\n");
} }
/* Test copy strings by number */ /* Process copy/get strings */
for (i = 0; i < MAXCPYGET && dat_datctl.copy_numbers[i] >= 0; i++) copy_and_get(utf, capcount);
{
int rc;
PCRE2_SIZE length, length2;
uint32_t copybuffer[256];
uint32_t n = (uint32_t)(dat_datctl.copy_numbers[i]);
length = sizeof(copybuffer)/code_unit_size;
PCRE2_SUBSTRING_COPY_BYNUMBER(rc, match_data, n, copybuffer, &length);
if (rc < 0)
{
fprintf(outfile, "Copy substring %d failed (%d): ", n, rc);
PCRE2_GET_ERROR_MESSAGE(rc, rc, pbuffer);
PCHARSV(CASTVAR(void *, pbuffer), 0, rc, FALSE, outfile);
fprintf(outfile, "\n");
}
else
{
PCRE2_SUBSTRING_LENGTH_BYNUMBER(rc, match_data, n, &length2);
if (rc < 0)
{
fprintf(outfile, "Get substring %d length failed (%d): ", n, rc);
PCRE2_GET_ERROR_MESSAGE(rc, rc, pbuffer);
PCHARSV(CASTVAR(void *, pbuffer), 0, rc, FALSE, outfile);
fprintf(outfile, "\n");
}
else if (length2 != length)
{
fprintf(outfile, "Mismatched substring lengths: %ld %ld\n",
length, length2);
}
fprintf(outfile, "%2dC ", n);
PCHARSV(copybuffer, 0, length, utf, outfile);
fprintf(outfile, " (%lu)\n", (unsigned long)length);
}
}
/* Test copy strings by name */
nptr = dat_datctl.copy_names;
for (;;)
{
int rc;
int groupnumber;
PCRE2_SIZE length, length2;
uint32_t copybuffer[256];
int namelen = strlen((const char *)nptr);
#if defined SUPPORT_PCRE2_16 || defined SUPPORT_PCRE2_32
PCRE2_SIZE cnl = namelen;
#endif
if (namelen == 0) break;
#ifdef SUPPORT_PCRE2_8
if (test_mode == PCRE8_MODE) strcpy((char *)pbuffer8, (char *)nptr);
#endif
#ifdef SUPPORT_PCRE2_16
if (test_mode == PCRE16_MODE)(void)to16(nptr, utf, &cnl);
#endif
#ifdef SUPPORT_PCRE2_32
if (test_mode == PCRE32_MODE)(void)to32(nptr, utf, &cnl);
#endif
PCRE2_SUBSTRING_NUMBER_FROM_NAME(groupnumber, compiled_code, pbuffer);
if (groupnumber < 0 && groupnumber != PCRE2_ERROR_NOUNIQUESUBSTRING)
fprintf(outfile, "Number not found for group '%s'\n", nptr);
length = sizeof(copybuffer)/code_unit_size;
PCRE2_SUBSTRING_COPY_BYNAME(rc, match_data, pbuffer, copybuffer, &length);
if (rc < 0)
{
fprintf(outfile, "Copy substring '%s' failed (%d): ", nptr, rc);
PCRE2_GET_ERROR_MESSAGE(rc, rc, pbuffer);
PCHARSV(CASTVAR(void *, pbuffer), 0, rc, FALSE, outfile);
fprintf(outfile, "\n");
}
else
{
PCRE2_SUBSTRING_LENGTH_BYNAME(rc, match_data, pbuffer, &length2);
if (rc < 0)
{
fprintf(outfile, "Get substring '%s' length failed (%d): ", nptr, rc);
PCRE2_GET_ERROR_MESSAGE(rc, rc, pbuffer);
PCHARSV(CASTVAR(void *, pbuffer), 0, rc, FALSE, outfile);
fprintf(outfile, "\n");
}
else if (length2 != length)
{
fprintf(outfile, "Mismatched substring lengths: %ld %ld\n",
length, length2);
}
fprintf(outfile, " C ");
PCHARSV(copybuffer, 0, length, utf, outfile);
fprintf(outfile, " (%lu) %s", (unsigned long)length, nptr);
if (groupnumber >= 0) fprintf(outfile, " (group %d)\n", groupnumber);
else fprintf(outfile, " (non-unique)\n");
}
nptr += namelen + 1;
}
/* Test get strings by number */
for (i = 0; i < MAXCPYGET && dat_datctl.get_numbers[i] >= 0; i++)
{
int rc;
PCRE2_SIZE length;
void *gotbuffer;
uint32_t n = (uint32_t)(dat_datctl.get_numbers[i]);
PCRE2_SUBSTRING_GET_BYNUMBER(rc, match_data, n, &gotbuffer, &length);
if (rc < 0)
{
fprintf(outfile, "Get substring %d failed (%d): ", n, rc);
PCRE2_GET_ERROR_MESSAGE(rc, rc, pbuffer);
PCHARSV(CASTVAR(void *, pbuffer), 0, rc, FALSE, outfile);
fprintf(outfile, "\n");
}
else
{
fprintf(outfile, "%2dG ", n);
PCHARSV(gotbuffer, 0, length, utf, outfile);
fprintf(outfile, " (%lu)\n", (unsigned long)length);
PCRE2_SUBSTRING_FREE(gotbuffer);
}
}
/* Test get strings by name */
nptr = dat_datctl.get_names;
for (;;)
{
PCRE2_SIZE length;
void *gotbuffer;
int rc;
int groupnumber;
int namelen = strlen((const char *)nptr);
#if defined SUPPORT_PCRE2_16 || defined SUPPORT_PCRE2_32
PCRE2_SIZE cnl = namelen;
#endif
if (namelen == 0) break;
#ifdef SUPPORT_PCRE2_8
if (test_mode == PCRE8_MODE) strcpy((char *)pbuffer8, (char *)nptr);
#endif
#ifdef SUPPORT_PCRE2_16
if (test_mode == PCRE16_MODE)(void)to16(nptr, utf, &cnl);
#endif
#ifdef SUPPORT_PCRE2_32
if (test_mode == PCRE32_MODE)(void)to32(nptr, utf, &cnl);
#endif
PCRE2_SUBSTRING_NUMBER_FROM_NAME(groupnumber, compiled_code, pbuffer);
if (groupnumber < 0 && groupnumber != PCRE2_ERROR_NOUNIQUESUBSTRING)
fprintf(outfile, "Number not found for group '%s'\n", nptr);
PCRE2_SUBSTRING_GET_BYNAME(rc, match_data, pbuffer, &gotbuffer, &length);
if (rc < 0)
{
fprintf(outfile, "Get substring '%s' failed (%d): ", nptr, rc);
PCRE2_GET_ERROR_MESSAGE(rc, rc, pbuffer);
PCHARSV(CASTVAR(void *, pbuffer), 0, rc, FALSE, outfile);
fprintf(outfile, "\n");
}
else
{
fprintf(outfile, " G ");
PCHARSV(gotbuffer, 0, length, utf, outfile);
fprintf(outfile, " (%lu) %s", (unsigned long)length, nptr);
if (groupnumber >= 0) fprintf(outfile, " (group %d)\n", groupnumber);
else fprintf(outfile, " (non-unique)\n");
PCRE2_SUBSTRING_FREE(gotbuffer);
}
nptr += namelen + 1;
}
/* Test getting the complete list of captured strings. */
if ((dat_datctl.control & CTL_GETALL) != 0)
{
int rc;
void **stringlist;
PCRE2_SIZE *lengths;
PCRE2_SUBSTRING_LIST_GET(rc, match_data, &stringlist, &lengths);
if (rc < 0)
{
fprintf(outfile, "get substring list failed (%d): ", rc);
PCRE2_GET_ERROR_MESSAGE(rc, rc, pbuffer);
PCHARSV(CASTVAR(void *, pbuffer), 0, rc, FALSE, outfile);
fprintf(outfile, "\n");
}
else
{
for (i = 0; i < capcount; i++)
{
fprintf(outfile, "%2dL ", i);
PCHARSV(stringlist[i], 0, lengths[i], utf, outfile);
putc('\n', outfile);
}
if (stringlist[i] != NULL)
fprintf(outfile, "string list not terminated by NULL\n");
PCRE2_SUBSTRING_LIST_FREE(stringlist);
}
}
} /* End of handling a successful match */ } /* End of handling a successful match */
/* There was a partial match. The value of ovector[0] is the bumpalong point, /* There was a partial match. The value of ovector[0] is the bumpalong point,
@ -5489,6 +5516,10 @@ else for (gmatched = 0;; gmatched++)
fprintf(outfile, "\n"); fprintf(outfile, "\n");
} }
/* Process copy/get strings */
copy_and_get(utf, 1);
break; /* Out of the /g loop */ break; /* Out of the /g loop */
} /* End of handling partial match */ } /* End of handling partial match */

3
testdata/testinput2 vendored
View File

@ -4097,4 +4097,7 @@ a random value. /Ix
a\=ovector=2,copy=A,get=A,get=2 a\=ovector=2,copy=A,get=A,get=2
b\=ovector=2,copy=A,get=A,get=2 b\=ovector=2,copy=A,get=A,get=2
/a(b)c(d)/
abc\=ph,copy=0,copy=1,getall
# End of testinput2 # End of testinput2

3
testdata/testinput6 vendored
View File

@ -4808,4 +4808,7 @@
a\=ovector=2,get=1,get=2,getall a\=ovector=2,get=1,get=2,getall
aaa\=ovector=2,get=1,get=2,getall aaa\=ovector=2,get=1,get=2,getall
/a(b)c(d)/
abc\=ph,copy=0,copy=1,getall
# End of testinput6 # End of testinput6

View File

@ -13762,4 +13762,11 @@ Copy substring 'A' failed (-55): requested value is not set
Get substring 2 failed (-54): requested value is not available Get substring 2 failed (-54): requested value is not available
Get substring 'A' failed (-55): requested value is not set Get substring 'A' failed (-55): requested value is not set
/a(b)c(d)/
abc\=ph,copy=0,copy=1,getall
Partial match: abc
0C abc (3)
Copy substring 1 failed (-2): partial match
get substring list failed (-2): partial match
# End of testinput2 # End of testinput2

View File

@ -7766,4 +7766,11 @@ Get substring 2 failed (-54): requested value is not available
0L aaa 0L aaa
1L aa 1L aa
/a(b)c(d)/
abc\=ph,copy=0,copy=1,getall
Partial match: abc
0C abc (3)
Copy substring 1 failed (-2): partial match
get substring list failed (-2): partial match
# End of testinput6 # End of testinput6