Fix problems with new PCRE2_SUBSTITUTE_MATCHED code.
This commit is contained in:
parent
29c0d64158
commit
a57787b7cd
|
@ -1,4 +1,4 @@
|
||||||
.TH PCRE2API 3 "22 January 2020" "PCRE2 10.35"
|
.TH PCRE2API 3 "16 February 2020" "PCRE2 10.35"
|
||||||
.SH NAME
|
.SH NAME
|
||||||
PCRE2 - Perl-compatible regular expressions (revised API)
|
PCRE2 - Perl-compatible regular expressions (revised API)
|
||||||
.sp
|
.sp
|
||||||
|
@ -3328,12 +3328,12 @@ can be given as PCRE2_ZERO_TERMINATED for a zero-terminated string. There is an
|
||||||
option (see PCRE2_SUBSTITUTE_REPLACEMENT_ONLY below) to return just the
|
option (see PCRE2_SUBSTITUTE_REPLACEMENT_ONLY below) to return just the
|
||||||
replacement string(s). The default action is to perform just one replacement if
|
replacement string(s). The default action is to perform just one replacement if
|
||||||
the pattern matches, but there is an option that requests multiple replacements
|
the pattern matches, but there is an option that requests multiple replacements
|
||||||
(see PCRE2_SUBSTITUTE_GLOBAL below for details).
|
(see PCRE2_SUBSTITUTE_GLOBAL below).
|
||||||
.P
|
.P
|
||||||
If successful, \fBpcre2_substitute()\fP returns the number of substitutions
|
If successful, \fBpcre2_substitute()\fP returns the number of substitutions
|
||||||
that were carried out. This may be zero if no match was found, and is never
|
that were carried out. This may be zero if no match was found, and is never
|
||||||
greater than one unless PCRE2_SUBSTITUTE_GLOBAL is set. A negative value is
|
greater than one unless PCRE2_SUBSTITUTE_GLOBAL is set. A negative value is
|
||||||
returned if an error is detected (see below for details).
|
returned if an error is detected.
|
||||||
.P
|
.P
|
||||||
Matches in which a \eK item in a lookahead in the pattern causes the match to
|
Matches in which a \eK item in a lookahead in the pattern causes the match to
|
||||||
end before it starts are not supported, and give rise to an error return. For
|
end before it starts are not supported, and give rise to an error return. For
|
||||||
|
@ -3348,10 +3348,11 @@ data block is obtained and freed within this function, using memory management
|
||||||
functions from the match context, if provided, or else those that were used to
|
functions from the match context, if provided, or else those that were used to
|
||||||
allocate memory for the compiled code.
|
allocate memory for the compiled code.
|
||||||
.P
|
.P
|
||||||
If an external \fImatch_data\fP block is provided, its contents afterwards
|
If \fImatch_data\fP is not NULL and PCRE2_SUBSTITUTE_MATCHED is not set, the
|
||||||
are those set by the final call to \fBpcre2_match()\fP. For global changes,
|
provided block is used for all calls to \fBpcre2_match()\fP, and its contents
|
||||||
this will have ended in a no-match error. The contents of the ovector within
|
afterwards are the result of the final call. For global changes, this will
|
||||||
the match data block may or may not have been changed.
|
always be a no-match error. The contents of the ovector within the match data
|
||||||
|
block may or may not have been changed.
|
||||||
.P
|
.P
|
||||||
As well as the usual options for \fBpcre2_match()\fP, a number of additional
|
As well as the usual options for \fBpcre2_match()\fP, a number of additional
|
||||||
options can be set in the \fIoptions\fP argument of \fBpcre2_substitute()\fP.
|
options can be set in the \fIoptions\fP argument of \fBpcre2_substitute()\fP.
|
||||||
|
@ -3363,16 +3364,22 @@ calling \fBpcre2_match()\fP from within \fBpcre2_substitute()\fP. This allows
|
||||||
an application to check for a match before choosing to substitute, without
|
an application to check for a match before choosing to substitute, without
|
||||||
having to repeat the match.
|
having to repeat the match.
|
||||||
.P
|
.P
|
||||||
The \fIcode\fP argument is not used for the first substitution when
|
The contents of the externally supplied match data block are not changed when
|
||||||
PCRE2_SUBSTITUTE_MATCHED is set, but if PCRE2_SUBSTITUTE_GLOBAL is also set,
|
PCRE2_SUBSTITUTE_MATCHED is set. If PCRE2_SUBSTITUTE_GLOBAL is also set,
|
||||||
\fBpcre2_match()\fP will be called after the first substitution to check for
|
\fBpcre2_match()\fP is called after the first substitution to check for further
|
||||||
further matches, and the contents of the \fImatch_data\fP block will be
|
matches, but this is done using an internally obtained match data block, thus
|
||||||
changed.
|
always leaving the external block unchanged.
|
||||||
.P
|
.P
|
||||||
The default is to return a copy of the subject string with matched substrings
|
The \fIcode\fP argument is not used for matching before the first substitution
|
||||||
replaced. However, if PCRE2_SUBSTITUTE_REPLACEMENT_ONLY is set, only the
|
when PCRE2_SUBSTITUTE_MATCHED is set, but it must be provided, even when
|
||||||
replacement substrings are returned. In the global case, multiple replacements
|
PCRE2_SUBSTITUTE_GLOBAL is not set, because it contains information such as the
|
||||||
are concatenated in the output buffer. Substitution callouts (see
|
UTF setting and the number of capturing parentheses in the pattern.
|
||||||
|
.P
|
||||||
|
The default action of \fBpcre2_substitute()\fP is to return a copy of the
|
||||||
|
subject string with matched substrings replaced. However, if
|
||||||
|
PCRE2_SUBSTITUTE_REPLACEMENT_ONLY is set, only the replacement substrings are
|
||||||
|
returned. In the global case, multiple replacements are concatenated in the
|
||||||
|
output buffer. Substitution callouts (see
|
||||||
.\" HTML <a href="#subcallouts">
|
.\" HTML <a href="#subcallouts">
|
||||||
.\" </a>
|
.\" </a>
|
||||||
below)
|
below)
|
||||||
|
@ -3381,26 +3388,39 @@ can be used to separate them if necessary.
|
||||||
.P
|
.P
|
||||||
The \fIoutlengthptr\fP argument of \fBpcre2_substitute()\fP must point to a
|
The \fIoutlengthptr\fP argument of \fBpcre2_substitute()\fP must point to a
|
||||||
variable that contains the length, in code units, of the output buffer. If the
|
variable that contains the length, in code units, of the output buffer. If the
|
||||||
function is successful, the value is updated to contain the length of the new
|
function is successful, the value is updated to contain the length in code
|
||||||
string, excluding the trailing zero that is automatically added.
|
units of the new string, excluding the trailing zero that is automatically
|
||||||
|
added.
|
||||||
.P
|
.P
|
||||||
If the function is not successful, the value set via \fIoutlengthptr\fP depends
|
If the function is not successful, the value set via \fIoutlengthptr\fP depends
|
||||||
on the type of error. For syntax errors in the replacement string, the value is
|
on the type of error. For syntax errors in the replacement string, the value is
|
||||||
the offset in the replacement string where the error was detected. For other
|
the offset in the replacement string where the error was detected. For other
|
||||||
errors, the value is PCRE2_UNSET by default. This includes the case of the
|
errors, the value is PCRE2_UNSET by default. This includes the case of the
|
||||||
output buffer being too small, unless PCRE2_SUBSTITUTE_OVERFLOW_LENGTH is set
|
output buffer being too small, unless PCRE2_SUBSTITUTE_OVERFLOW_LENGTH is set.
|
||||||
(see below), in which case the value is the minimum length needed, including
|
|
||||||
space for the trailing zero. Note that in order to compute the required length,
|
|
||||||
\fBpcre2_substitute()\fP has to simulate all the matching and copying, instead
|
|
||||||
of giving an error return as soon as the buffer overflows. Note also that the
|
|
||||||
length is in code units, not bytes.
|
|
||||||
.P
|
.P
|
||||||
The replacement string, which is interpreted as a UTF string in UTF mode,
|
PCRE2_SUBSTITUTE_OVERFLOW_LENGTH changes what happens when the output buffer is
|
||||||
is checked for UTF validity unless the PCRE2_NO_UTF_CHECK option is set. If the
|
too small. The default action is to return PCRE2_ERROR_NOMEMORY immediately. If
|
||||||
PCRE2_SUBSTITUTE_LITERAL option is set, it is not interpreted in any way. By
|
this option is set, however, \fBpcre2_substitute()\fP continues to go through
|
||||||
default, however, a dollar character is an escape character that can specify
|
the motions of matching and substituting (without, of course, writing anything)
|
||||||
the insertion of characters from capture groups and names from (*MARK) or other
|
in order to compute the size of buffer that is needed. This value is passed
|
||||||
control verbs in the pattern. The following forms are always recognized:
|
back via the \fIoutlengthptr\fP variable, with the result of the function still
|
||||||
|
being PCRE2_ERROR_NOMEMORY.
|
||||||
|
.P
|
||||||
|
Passing a buffer size of zero is a permitted way of finding out how much memory
|
||||||
|
is needed for given substitution. However, this does mean that the entire
|
||||||
|
operation is carried out twice. Depending on the application, it may be more
|
||||||
|
efficient to allocate a large buffer and free the excess afterwards, instead of
|
||||||
|
using PCRE2_SUBSTITUTE_OVERFLOW_LENGTH.
|
||||||
|
.P
|
||||||
|
The replacement string, which is interpreted as a UTF string in UTF mode, is
|
||||||
|
checked for UTF validity unless PCRE2_NO_UTF_CHECK is set. An invalid UTF
|
||||||
|
replacement string causes an immediate return with the relevant UTF error code.
|
||||||
|
.P
|
||||||
|
If PCRE2_SUBSTITUTE_LITERAL is set, the replacement string is not interpreted
|
||||||
|
in any way. By default, however, a dollar character is an escape character that
|
||||||
|
can specify the insertion of characters from capture groups and names from
|
||||||
|
(*MARK) or other control verbs in the pattern. The following forms are always
|
||||||
|
recognized:
|
||||||
.sp
|
.sp
|
||||||
$$ insert a dollar character
|
$$ insert a dollar character
|
||||||
$<n> or ${<n>} insert the contents of group <n>
|
$<n> or ${<n>} insert the contents of group <n>
|
||||||
|
@ -3445,20 +3465,6 @@ If this is not successful, the offset is advanced by one character except when
|
||||||
CRLF is a valid newline sequence and the next two characters are CR, LF. In
|
CRLF is a valid newline sequence and the next two characters are CR, LF. In
|
||||||
this case, the offset is advanced by two characters.
|
this case, the offset is advanced by two characters.
|
||||||
.P
|
.P
|
||||||
PCRE2_SUBSTITUTE_OVERFLOW_LENGTH changes what happens when the output buffer is
|
|
||||||
too small. The default action is to return PCRE2_ERROR_NOMEMORY immediately. If
|
|
||||||
this option is set, however, \fBpcre2_substitute()\fP continues to go through
|
|
||||||
the motions of matching and substituting (without, of course, writing anything)
|
|
||||||
in order to compute the size of buffer that is needed. This value is passed
|
|
||||||
back via the \fIoutlengthptr\fP variable, with the result of the function still
|
|
||||||
being PCRE2_ERROR_NOMEMORY.
|
|
||||||
.P
|
|
||||||
Passing a buffer size of zero is a permitted way of finding out how much memory
|
|
||||||
is needed for given substitution. However, this does mean that the entire
|
|
||||||
operation is carried out twice. Depending on the application, it may be more
|
|
||||||
efficient to allocate a large buffer and free the excess afterwards, instead of
|
|
||||||
using PCRE2_SUBSTITUTE_OVERFLOW_LENGTH.
|
|
||||||
.P
|
|
||||||
PCRE2_SUBSTITUTE_UNKNOWN_UNSET causes references to capture groups that do
|
PCRE2_SUBSTITUTE_UNKNOWN_UNSET causes references to capture groups that do
|
||||||
not appear in the pattern to be treated as unset groups. This option should be
|
not appear in the pattern to be treated as unset groups. This option should be
|
||||||
used with care, because it means that a typo in a group name or number no
|
used with care, because it means that a typo in a group name or number no
|
||||||
|
@ -3917,6 +3923,6 @@ Cambridge, England.
|
||||||
.rs
|
.rs
|
||||||
.sp
|
.sp
|
||||||
.nf
|
.nf
|
||||||
Last updated: 22 January 2020
|
Last updated: 16 February 2020
|
||||||
Copyright (c) 1997-2020 University of Cambridge.
|
Copyright (c) 1997-2020 University of Cambridge.
|
||||||
.fi
|
.fi
|
||||||
|
|
|
@ -229,7 +229,7 @@ int forcecasereset = 0;
|
||||||
uint32_t ovector_count;
|
uint32_t ovector_count;
|
||||||
uint32_t goptions = 0;
|
uint32_t goptions = 0;
|
||||||
uint32_t suboptions;
|
uint32_t suboptions;
|
||||||
BOOL match_data_created = FALSE;
|
pcre2_match_data *internal_match_data = NULL;
|
||||||
BOOL escaped_literal = FALSE;
|
BOOL escaped_literal = FALSE;
|
||||||
BOOL overflowed = FALSE;
|
BOOL overflowed = FALSE;
|
||||||
BOOL use_existing_match;
|
BOOL use_existing_match;
|
||||||
|
@ -265,22 +265,42 @@ pointer in the match data may be NULL after a no-match. */
|
||||||
use_existing_match = ((options & PCRE2_SUBSTITUTE_MATCHED) != 0);
|
use_existing_match = ((options & PCRE2_SUBSTITUTE_MATCHED) != 0);
|
||||||
replacement_only = ((options & PCRE2_SUBSTITUTE_REPLACEMENT_ONLY) != 0);
|
replacement_only = ((options & PCRE2_SUBSTITUTE_REPLACEMENT_ONLY) != 0);
|
||||||
|
|
||||||
if (use_existing_match)
|
/* If starting from an existing match, there must be an externally provided
|
||||||
|
match data block. We create an internal match_data block in two cases: (a) an
|
||||||
|
external one is not supplied (and we are not starting from an existing match);
|
||||||
|
(b) an existing match is to be used for the first substitution. In the latter
|
||||||
|
case, we copy the existing match into the internal block. This ensures that no
|
||||||
|
changes are made to the existing match data block. */
|
||||||
|
|
||||||
|
if (match_data == NULL)
|
||||||
{
|
{
|
||||||
if (match_data == NULL) return PCRE2_ERROR_NULL;
|
pcre2_general_context *gcontext;
|
||||||
|
if (use_existing_match) return PCRE2_ERROR_NULL;
|
||||||
|
gcontext = (mcontext == NULL)?
|
||||||
|
(pcre2_general_context *)code :
|
||||||
|
(pcre2_general_context *)mcontext;
|
||||||
|
match_data = internal_match_data =
|
||||||
|
pcre2_match_data_create_from_pattern(code, gcontext);
|
||||||
|
if (internal_match_data == NULL) return PCRE2_ERROR_NOMEMORY;
|
||||||
}
|
}
|
||||||
|
|
||||||
/* Otherwise, if no match data block is provided, create one. */
|
else if (use_existing_match)
|
||||||
|
|
||||||
else if (match_data == NULL)
|
|
||||||
{
|
{
|
||||||
pcre2_general_context *gcontext = (mcontext == NULL)?
|
pcre2_general_context *gcontext = (mcontext == NULL)?
|
||||||
(pcre2_general_context *)code :
|
(pcre2_general_context *)code :
|
||||||
(pcre2_general_context *)mcontext;
|
(pcre2_general_context *)mcontext;
|
||||||
match_data = pcre2_match_data_create_from_pattern(code, gcontext);
|
int pairs = (code->top_bracket + 1 < match_data->oveccount)?
|
||||||
if (match_data == NULL) return PCRE2_ERROR_NOMEMORY;
|
code->top_bracket + 1 : match_data->oveccount;
|
||||||
match_data_created = TRUE;
|
internal_match_data = pcre2_match_data_create(match_data->oveccount,
|
||||||
|
gcontext);
|
||||||
|
if (internal_match_data == NULL) return PCRE2_ERROR_NOMEMORY;
|
||||||
|
memcpy(internal_match_data, match_data, offsetof(pcre2_match_data, ovector)
|
||||||
|
+ 2*pairs*sizeof(PCRE2_SIZE));
|
||||||
|
match_data = internal_match_data;
|
||||||
}
|
}
|
||||||
|
|
||||||
|
/* Remember ovector details */
|
||||||
|
|
||||||
ovector = pcre2_get_ovector_pointer(match_data);
|
ovector = pcre2_get_ovector_pointer(match_data);
|
||||||
ovector_count = pcre2_get_ovector_count(match_data);
|
ovector_count = pcre2_get_ovector_count(match_data);
|
||||||
|
|
||||||
|
@ -302,7 +322,7 @@ repend = replacement + rlength;
|
||||||
#ifdef SUPPORT_UNICODE
|
#ifdef SUPPORT_UNICODE
|
||||||
if (utf && (options & PCRE2_NO_UTF_CHECK) == 0)
|
if (utf && (options & PCRE2_NO_UTF_CHECK) == 0)
|
||||||
{
|
{
|
||||||
rc = PRIV(valid_utf)(replacement, rlength, &(match_data->rightchar));
|
rc = PRIV(valid_utf)(replacement, rlength, &(match_data->startchar));
|
||||||
if (rc != 0)
|
if (rc != 0)
|
||||||
{
|
{
|
||||||
match_data->leftchar = 0;
|
match_data->leftchar = 0;
|
||||||
|
@ -316,7 +336,7 @@ if (utf && (options & PCRE2_NO_UTF_CHECK) == 0)
|
||||||
suboptions = options & SUBSTITUTE_OPTIONS;
|
suboptions = options & SUBSTITUTE_OPTIONS;
|
||||||
options &= ~SUBSTITUTE_OPTIONS;
|
options &= ~SUBSTITUTE_OPTIONS;
|
||||||
|
|
||||||
/* Error if the start match offset it greater than the length of the subject. */
|
/* Error if the start match offset is greater than the length of the subject. */
|
||||||
|
|
||||||
if (start_offset > length)
|
if (start_offset > length)
|
||||||
{
|
{
|
||||||
|
@ -898,8 +918,9 @@ do
|
||||||
}
|
}
|
||||||
|
|
||||||
/* Save the details of this match. See above for how this data is used. If we
|
/* Save the details of this match. See above for how this data is used. If we
|
||||||
matched an empty string, do the magic for global matches. Finally, update the
|
matched an empty string, do the magic for global matches. Update the start
|
||||||
start offset to point to the rest of the subject string. */
|
offset to point to the rest of the subject string. If we re-used an existing
|
||||||
|
match for the first match, switch to the internal match data block. */
|
||||||
|
|
||||||
ovecsave[0] = ovector[0];
|
ovecsave[0] = ovector[0];
|
||||||
ovecsave[1] = ovector[1];
|
ovecsave[1] = ovector[1];
|
||||||
|
@ -942,7 +963,7 @@ else
|
||||||
}
|
}
|
||||||
|
|
||||||
EXIT:
|
EXIT:
|
||||||
if (match_data_created) pcre2_match_data_free(match_data);
|
if (internal_match_data != NULL) pcre2_match_data_free(internal_match_data);
|
||||||
else match_data->rc = rc;
|
else match_data->rc = rc;
|
||||||
return rc;
|
return rc;
|
||||||
|
|
||||||
|
|
|
@ -5806,6 +5806,33 @@ a)"xI
|
||||||
12abc34xyz99abc55\=substitute_skip=1
|
12abc34xyz99abc55\=substitute_skip=1
|
||||||
12abc34xyz99abc55\=substitute_skip=2
|
12abc34xyz99abc55\=substitute_skip=2
|
||||||
|
|
||||||
|
/a(..)d/replace=>$1<,substitute_matched
|
||||||
|
xyzabcdxyzabcdxyz
|
||||||
|
xyzabcdxyzabcdxyz\=ovector=2
|
||||||
|
\= Expect error
|
||||||
|
xyzabcdxyzabcdxyz\=ovector=1
|
||||||
|
|
||||||
|
/a(..)d/g,replace=>$1<,substitute_matched
|
||||||
|
xyzabcdxyzabcdxyz
|
||||||
|
xyzabcdxyzabcdxyz\=ovector=2
|
||||||
|
\= Expect error
|
||||||
|
xyzabcdxyzabcdxyz\=ovector=1
|
||||||
|
xyzabcdxyzabcdxyz\=ovector=1,substitute_unset_empty
|
||||||
|
|
||||||
|
/55|a(..)d/g,replace=>$1<,substitute_matched
|
||||||
|
xyz55abcdxyzabcdxyz\=ovector=2,substitute_unset_empty
|
||||||
|
\= Expect error
|
||||||
|
xyz55abcdxyzabcdxyz\=ovector=2
|
||||||
|
|
||||||
|
/55|a(..)d/replace=>$1<,substitute_matched
|
||||||
|
xyz55abcdxyzabcdxyz\=ovector=2,substitute_unset_empty
|
||||||
|
|
||||||
|
/55|a(..)d/replace=>$1<
|
||||||
|
xyz55abcdxyzabcdxyz\=ovector=2,substitute_unset_empty
|
||||||
|
|
||||||
|
/55|a(..)d/g,replace=>$1<
|
||||||
|
xyz55abcdxyzabcdxyz\=ovector=2,substitute_unset_empty
|
||||||
|
|
||||||
# Expect non-fixed-length error
|
# Expect non-fixed-length error
|
||||||
|
|
||||||
"(?<=X(?(DEFINE)(.*))(?1))."
|
"(?<=X(?(DEFINE)(.*))(?1))."
|
||||||
|
|
|
@ -17536,6 +17536,45 @@ Callout 0: last capture = 2
|
||||||
3(2) Old 12 15 "abc" New 5 10 "<abc>"
|
3(2) Old 12 15 "abc" New 5 10 "<abc>"
|
||||||
3: <abc><abc>
|
3: <abc><abc>
|
||||||
|
|
||||||
|
/a(..)d/replace=>$1<,substitute_matched
|
||||||
|
xyzabcdxyzabcdxyz
|
||||||
|
1: xyz>bc<xyzabcdxyz
|
||||||
|
xyzabcdxyzabcdxyz\=ovector=2
|
||||||
|
1: xyz>bc<xyzabcdxyz
|
||||||
|
\= Expect error
|
||||||
|
xyzabcdxyzabcdxyz\=ovector=1
|
||||||
|
Failed: error -54 at offset 3 in replacement: requested value is not available
|
||||||
|
|
||||||
|
/a(..)d/g,replace=>$1<,substitute_matched
|
||||||
|
xyzabcdxyzabcdxyz
|
||||||
|
2: xyz>bc<xyz>bc<xyz
|
||||||
|
xyzabcdxyzabcdxyz\=ovector=2
|
||||||
|
2: xyz>bc<xyz>bc<xyz
|
||||||
|
\= Expect error
|
||||||
|
xyzabcdxyzabcdxyz\=ovector=1
|
||||||
|
Failed: error -54 at offset 3 in replacement: requested value is not available
|
||||||
|
xyzabcdxyzabcdxyz\=ovector=1,substitute_unset_empty
|
||||||
|
Failed: error -54 at offset 3 in replacement: requested value is not available
|
||||||
|
|
||||||
|
/55|a(..)d/g,replace=>$1<,substitute_matched
|
||||||
|
xyz55abcdxyzabcdxyz\=ovector=2,substitute_unset_empty
|
||||||
|
3: xyz><>bc<xyz>bc<xyz
|
||||||
|
\= Expect error
|
||||||
|
xyz55abcdxyzabcdxyz\=ovector=2
|
||||||
|
Failed: error -55 at offset 3 in replacement: requested value is not set
|
||||||
|
|
||||||
|
/55|a(..)d/replace=>$1<,substitute_matched
|
||||||
|
xyz55abcdxyzabcdxyz\=ovector=2,substitute_unset_empty
|
||||||
|
1: xyz><abcdxyzabcdxyz
|
||||||
|
|
||||||
|
/55|a(..)d/replace=>$1<
|
||||||
|
xyz55abcdxyzabcdxyz\=ovector=2,substitute_unset_empty
|
||||||
|
1: xyz><abcdxyzabcdxyz
|
||||||
|
|
||||||
|
/55|a(..)d/g,replace=>$1<
|
||||||
|
xyz55abcdxyzabcdxyz\=ovector=2,substitute_unset_empty
|
||||||
|
3: xyz><>bc<xyz>bc<xyz
|
||||||
|
|
||||||
# Expect non-fixed-length error
|
# Expect non-fixed-length error
|
||||||
|
|
||||||
"(?<=X(?(DEFINE)(.*))(?1))."
|
"(?<=X(?(DEFINE)(.*))(?1))."
|
||||||
|
|
Loading…
Reference in New Issue