Interpret NULL pointer, zero length as an empty string for subjects and replacements.

This commit is contained in:
Philip Hazel 2021-11-30 16:34:39 +00:00
parent 7ab2769728
commit 4ef0c51d2b
16 changed files with 241 additions and 131 deletions

View File

@ -34,6 +34,11 @@ substituting.
12. Add check for NULL replacement to pcre2_substitute().
13. For the subject arguments of pcre2_match(), pcre2_dfa_match(), and
pcre2_substitute(), and the replacement argument of the latter, if the pointer
is NULL and the length is zero, treat as an empty string. Apparently a number
of applications treat NULL/0 in this way.
Version 10.39 29-October-2021
-----------------------------

View File

@ -2640,7 +2640,9 @@ The subject string is passed to <b>pcre2_match()</b> as a pointer in
<i>startoffset</i>. The length and offset are in code units, not characters.
That is, they are in bytes for the 8-bit library, 16-bit code units for the
16-bit library, and 32-bit code units for the 32-bit library, whether or not
UTF processing is enabled.
UTF processing is enabled. As a special case, if <i>subject</i> is NULL and
<i>length</i> is zero, the subject is assumed to be an empty string. If
<i>length</i> is non-zero, an error occurs if <i>subject</i> is NULL.
</P>
<P>
If <i>startoffset</i> is greater than the length of the subject,
@ -3394,12 +3396,17 @@ same number causes an error at compile time.
<P>
This function optionally calls <b>pcre2_match()</b> and then makes a copy of the
subject string in <i>outputbuffer</i>, replacing parts that were matched with
the <i>replacement</i> string, whose length is supplied in <b>rlength</b>. This
can be given as PCRE2_ZERO_TERMINATED for a zero-terminated string. There is an
option (see PCRE2_SUBSTITUTE_REPLACEMENT_ONLY below) to return just the
replacement string(s). The default action is to perform just one replacement if
the pattern matches, but there is an option that requests multiple replacements
(see PCRE2_SUBSTITUTE_GLOBAL below).
the <i>replacement</i> string, whose length is supplied in <b>rlength</b>, which
can be given as PCRE2_ZERO_TERMINATED for a zero-terminated string. As a
special case, if <i>replacement</i> is NULL and <i>rlength</i> is zero, the
replacement is assumed to be an empty string. If <i>rlength</i> is non-zero, an
error occurs if <i>replacement</i> is NULL.
</P>
<P>
There is an option (see PCRE2_SUBSTITUTE_REPLACEMENT_ONLY below) to return just
the replacement string(s). The default action is to perform just one
replacement if the pattern matches, but there is an option that requests
multiple replacements (see PCRE2_SUBSTITUTE_GLOBAL below).
</P>
<P>
If successful, <b>pcre2_substitute()</b> returns the number of substitutions
@ -3812,12 +3819,13 @@ other alternatives. Ultimately, when it runs out of matches,
<P>
The function <b>pcre2_dfa_match()</b> is called to match a subject string
against a compiled pattern, using a matching algorithm that scans the subject
string just once (not counting lookaround assertions), and does not backtrack.
This has different characteristics to the normal algorithm, and is not
compatible with Perl. Some of the features of PCRE2 patterns are not supported.
Nevertheless, there are times when this kind of matching can be useful. For a
discussion of the two matching algorithms, and a list of features that
<b>pcre2_dfa_match()</b> does not support, see the
string just once (not counting lookaround assertions), and does not backtrack
(except when processing lookaround assertions). This has different
characteristics to the normal algorithm, and is not compatible with Perl. Some
of the features of PCRE2 patterns are not supported. Nevertheless, there are
times when this kind of matching can be useful. For a discussion of the two
matching algorithms, and a list of features that <b>pcre2_dfa_match()</b> does
not support, see the
<a href="pcre2matching.html"><b>pcre2matching</b></a>
documentation.
</P>
@ -4010,7 +4018,7 @@ Cambridge, England.
</P>
<br><a name="SEC42" href="#TOC1">REVISION</a><br>
<P>
Last updated: 30 August 2021
Last updated: 30 November 2021
<br>
Copyright &copy; 1997-2021 University of Cambridge.
<br>

View File

@ -269,11 +269,11 @@ starts another match, that match must use a different JIT stack to the one used
for currently suspended match(es).
</P>
<P>
In a multithread application, if you do not
specify a JIT stack, or if you assign or pass back NULL from a callback, that
is thread-safe, because each thread has its own machine stack. However, if you
assign or pass back a non-NULL JIT stack, this must be a different stack for
each thread so that the application is thread-safe.
In a multithread application, if you do not specify a JIT stack, or if you
assign or pass back NULL from a callback, that is thread-safe, because each
thread has its own machine stack. However, if you assign or pass back a
non-NULL JIT stack, this must be a different stack for each thread so that the
application is thread-safe.
</P>
<P>
Strictly speaking, even more is allowed. You can assign the same non-NULL stack
@ -382,8 +382,8 @@ out this complicated API.
<b>void pcre2_jit_free_unused_memory(pcre2_general_context *<i>gcontext</i>);</b>
</P>
<P>
The JIT executable allocator does not free all memory when it is possible.
It expects new allocations, and keeps some free memory around to improve
The JIT executable allocator does not free all memory when it is possible. It
expects new allocations, and keeps some free memory around to improve
allocation speed. However, in low memory conditions, it might be better to free
all possible memory. You can cause this to happen by calling
pcre2_jit_free_unused_memory(). Its argument is a general context, for custom
@ -442,10 +442,10 @@ that was not compiled.
<P>
When you call <b>pcre2_match()</b>, as well as testing for invalid options, a
number of other sanity checks are performed on the arguments. For example, if
the subject pointer is NULL, an immediate error is given. Also, unless
PCRE2_NO_UTF_CHECK is set, a UTF subject string is tested for validity. In the
interests of speed, these checks do not happen on the JIT fast path, and if
invalid data is passed, the result is undefined.
the subject pointer is NULL but the length is non-zero, an immediate error is
given. Also, unless PCRE2_NO_UTF_CHECK is set, a UTF subject string is tested
for validity. In the interests of speed, these checks do not happen on the JIT
fast path, and if invalid data is passed, the result is undefined.
</P>
<P>
Bypassing the sanity checks and the <b>pcre2_match()</b> wrapping can give
@ -466,9 +466,9 @@ Cambridge, England.
</P>
<br><a name="SEC14" href="#TOC1">REVISION</a><br>
<P>
Last updated: 23 May 2019
Last updated: 30 November 2021
<br>
Copyright &copy; 1997-2019 University of Cambridge.
Copyright &copy; 1997-2021 University of Cambridge.
<br>
<p>
Return to the <a href="index.html">PCRE2 index page</a>.

View File

@ -2579,7 +2579,9 @@ MATCHING A PATTERN: THE TRADITIONAL FUNCTION
and offset are in code units, not characters. That is, they are in
bytes for the 8-bit library, 16-bit code units for the 16-bit library,
and 32-bit code units for the 32-bit library, whether or not UTF pro-
cessing is enabled.
cessing is enabled. As a special case, if subject is NULL and length is
zero, the subject is assumed to be an empty string. If length is non-
zero, an error occurs if subject is NULL.
If startoffset is greater than the length of the subject, pcre2_match()
returns PCRE2_ERROR_BADOFFSET. When the starting offset is zero, the
@ -3280,8 +3282,12 @@ CREATING A NEW STRING WITH SUBSTITUTIONS
This function optionally calls pcre2_match() and then makes a copy of
the subject string in outputbuffer, replacing parts that were matched
with the replacement string, whose length is supplied in rlength. This
can be given as PCRE2_ZERO_TERMINATED for a zero-terminated string.
with the replacement string, whose length is supplied in rlength, which
can be given as PCRE2_ZERO_TERMINATED for a zero-terminated string. As
a special case, if replacement is NULL and rlength is zero, the re-
placement is assumed to be an empty string. If rlength is non-zero, an
error occurs if replacement is NULL.
There is an option (see PCRE2_SUBSTITUTE_REPLACEMENT_ONLY below) to re-
turn just the replacement string(s). The default action is to perform
just one replacement if the pattern matches, but there is an option
@ -3666,12 +3672,13 @@ MATCHING A PATTERN: THE ALTERNATIVE FUNCTION
The function pcre2_dfa_match() is called to match a subject string
against a compiled pattern, using a matching algorithm that scans the
subject string just once (not counting lookaround assertions), and does
not backtrack. This has different characteristics to the normal algo-
rithm, and is not compatible with Perl. Some of the features of PCRE2
patterns are not supported. Nevertheless, there are times when this
kind of matching can be useful. For a discussion of the two matching
algorithms, and a list of features that pcre2_dfa_match() does not sup-
port, see the pcre2matching documentation.
not backtrack (except when processing lookaround assertions). This has
different characteristics to the normal algorithm, and is not compati-
ble with Perl. Some of the features of PCRE2 patterns are not sup-
ported. Nevertheless, there are times when this kind of matching can be
useful. For a discussion of the two matching algorithms, and a list of
features that pcre2_dfa_match() does not support, see the pcre2matching
documentation.
The arguments for the pcre2_dfa_match() function are the same as for
pcre2_match(), plus two extras. The ovector within the match data block
@ -3850,7 +3857,7 @@ AUTHOR
REVISION
Last updated: 30 August 2021
Last updated: 30 November 2021
Copyright (c) 1997-2021 University of Cambridge.
------------------------------------------------------------------------------
@ -5494,10 +5501,11 @@ JIT FAST PATH API
When you call pcre2_match(), as well as testing for invalid options, a
number of other sanity checks are performed on the arguments. For exam-
ple, if the subject pointer is NULL, an immediate error is given. Also,
unless PCRE2_NO_UTF_CHECK is set, a UTF subject string is tested for
validity. In the interests of speed, these checks do not happen on the
JIT fast path, and if invalid data is passed, the result is undefined.
ple, if the subject pointer is NULL but the length is non-zero, an im-
mediate error is given. Also, unless PCRE2_NO_UTF_CHECK is set, a UTF
subject string is tested for validity. In the interests of speed, these
checks do not happen on the JIT fast path, and if invalid data is
passed, the result is undefined.
Bypassing the sanity checks and the pcre2_match() wrapping can give
speedups of more than 10%.
@ -5517,8 +5525,8 @@ AUTHOR
REVISION
Last updated: 23 May 2019
Copyright (c) 1997-2019 University of Cambridge.
Last updated: 30 November 2021
Copyright (c) 1997-2021 University of Cambridge.
------------------------------------------------------------------------------

View File

@ -1,4 +1,4 @@
.TH PCRE2API 3 "30 August 2021" "PCRE2 10.38"
.TH PCRE2API 3 "30 November 2021" "PCRE2 10.40"
.SH NAME
PCRE2 - Perl-compatible regular expressions (revised API)
.sp
@ -2624,7 +2624,9 @@ The subject string is passed to \fBpcre2_match()\fP as a pointer in
\fIstartoffset\fP. The length and offset are in code units, not characters.
That is, they are in bytes for the 8-bit library, 16-bit code units for the
16-bit library, and 32-bit code units for the 32-bit library, whether or not
UTF processing is enabled.
UTF processing is enabled. As a special case, if \fIsubject\fP is NULL and
\fIlength\fP is zero, the subject is assumed to be an empty string. If
\fIlength\fP is non-zero, an error occurs if \fIsubject\fP is NULL.
.P
If \fIstartoffset\fP is greater than the length of the subject,
\fBpcre2_match()\fP returns PCRE2_ERROR_BADOFFSET. When the starting offset is
@ -3413,12 +3415,16 @@ same number causes an error at compile time.
.P
This function optionally calls \fBpcre2_match()\fP and then makes a copy of the
subject string in \fIoutputbuffer\fP, replacing parts that were matched with
the \fIreplacement\fP string, whose length is supplied in \fBrlength\fP. This
can be given as PCRE2_ZERO_TERMINATED for a zero-terminated string. There is an
option (see PCRE2_SUBSTITUTE_REPLACEMENT_ONLY below) to return just the
replacement string(s). The default action is to perform just one replacement if
the pattern matches, but there is an option that requests multiple replacements
(see PCRE2_SUBSTITUTE_GLOBAL below).
the \fIreplacement\fP string, whose length is supplied in \fBrlength\fP, which
can be given as PCRE2_ZERO_TERMINATED for a zero-terminated string. As a
special case, if \fIreplacement\fP is NULL and \fIrlength\fP is zero, the
replacement is assumed to be an empty string. If \fIrlength\fP is non-zero, an
error occurs if \fIreplacement\fP is NULL.
.P
There is an option (see PCRE2_SUBSTITUTE_REPLACEMENT_ONLY below) to return just
the replacement string(s). The default action is to perform just one
replacement if the pattern matches, but there is an option that requests
multiple replacements (see PCRE2_SUBSTITUTE_GLOBAL below).
.P
If successful, \fBpcre2_substitute()\fP returns the number of substitutions
that were carried out. This may be zero if no match was found, and is never
@ -3813,12 +3819,13 @@ other alternatives. Ultimately, when it runs out of matches,
.P
The function \fBpcre2_dfa_match()\fP is called to match a subject string
against a compiled pattern, using a matching algorithm that scans the subject
string just once (not counting lookaround assertions), and does not backtrack.
This has different characteristics to the normal algorithm, and is not
compatible with Perl. Some of the features of PCRE2 patterns are not supported.
Nevertheless, there are times when this kind of matching can be useful. For a
discussion of the two matching algorithms, and a list of features that
\fBpcre2_dfa_match()\fP does not support, see the
string just once (not counting lookaround assertions), and does not backtrack
(except when processing lookaround assertions). This has different
characteristics to the normal algorithm, and is not compatible with Perl. Some
of the features of PCRE2 patterns are not supported. Nevertheless, there are
times when this kind of matching can be useful. For a discussion of the two
matching algorithms, and a list of features that \fBpcre2_dfa_match()\fP does
not support, see the
.\" HREF
\fBpcre2matching\fP
.\"
@ -4018,6 +4025,6 @@ Cambridge, England.
.rs
.sp
.nf
Last updated: 30 August 2021
Last updated: 30 November 2021
Copyright (c) 1997-2021 University of Cambridge.
.fi

View File

@ -1,4 +1,4 @@
.TH PCRE2JIT 3 "23 May 2019" "PCRE2 10.34"
.TH PCRE2JIT 3 "30 November 2021" "PCRE2 10.40"
.SH NAME
PCRE2 - Perl-compatible regular expressions (revised API)
.SH "PCRE2 JUST-IN-TIME COMPILER SUPPORT"
@ -251,11 +251,11 @@ non-sequential matches in one thread is to use callouts: if a callout function
starts another match, that match must use a different JIT stack to the one used
for currently suspended match(es).
.P
In a multithread application, if you do not
specify a JIT stack, or if you assign or pass back NULL from a callback, that
is thread-safe, because each thread has its own machine stack. However, if you
assign or pass back a non-NULL JIT stack, this must be a different stack for
each thread so that the application is thread-safe.
In a multithread application, if you do not specify a JIT stack, or if you
assign or pass back NULL from a callback, that is thread-safe, because each
thread has its own machine stack. However, if you assign or pass back a
non-NULL JIT stack, this must be a different stack for each thread so that the
application is thread-safe.
.P
Strictly speaking, even more is allowed. You can assign the same non-NULL stack
to a match context that is used by any number of patterns, as long as they are
@ -355,8 +355,8 @@ out this complicated API.
.B void pcre2_jit_free_unused_memory(pcre2_general_context *\fIgcontext\fP);
.fi
.P
The JIT executable allocator does not free all memory when it is possible.
It expects new allocations, and keeps some free memory around to improve
The JIT executable allocator does not free all memory when it is possible. It
expects new allocations, and keeps some free memory around to improve
allocation speed. However, in low memory conditions, it might be better to free
all possible memory. You can cause this to happen by calling
pcre2_jit_free_unused_memory(). Its argument is a general context, for custom
@ -416,10 +416,10 @@ that was not compiled.
.P
When you call \fBpcre2_match()\fP, as well as testing for invalid options, a
number of other sanity checks are performed on the arguments. For example, if
the subject pointer is NULL, an immediate error is given. Also, unless
PCRE2_NO_UTF_CHECK is set, a UTF subject string is tested for validity. In the
interests of speed, these checks do not happen on the JIT fast path, and if
invalid data is passed, the result is undefined.
the subject pointer is NULL but the length is non-zero, an immediate error is
given. Also, unless PCRE2_NO_UTF_CHECK is set, a UTF subject string is tested
for validity. In the interests of speed, these checks do not happen on the JIT
fast path, and if invalid data is passed, the result is undefined.
.P
Bypassing the sanity checks and the \fBpcre2_match()\fP wrapping can give
speedups of more than 10%.
@ -445,6 +445,6 @@ Cambridge, England.
.rs
.sp
.nf
Last updated: 23 May 2019
Copyright (c) 1997-2019 University of Cambridge.
Last updated: 30 November 2021
Copyright (c) 1997-2021 University of Cambridge.
.fi

View File

@ -3285,6 +3285,10 @@ rws->next = NULL;
rws->size = RWS_BASE_SIZE;
rws->free = RWS_BASE_SIZE - RWS_ANCHOR_SIZE;
/* Recognize NULL, length 0 as an empty string. */
if (subject == NULL && length == 0) subject = (PCRE2_SPTR)"";
/* Plausibility checks */
if ((options & ~PUBLIC_DFA_MATCH_OPTIONS) != 0) return PCRE2_ERROR_BADOPTION;

View File

@ -253,7 +253,7 @@ static const unsigned char match_error_texts[] =
"unknown substring\0"
/* 50 */
"non-unique substring name\0"
"NULL argument passed\0"
"NULL argument passed with non-zero length\0"
"nested recursion at the same subject position\0"
"matching depth limit exceeded\0"
"requested value is not available\0"

View File

@ -6170,6 +6170,10 @@ PCRE2_SPTR stack_frames_vector[START_FRAMES_SIZE/sizeof(PCRE2_SPTR)]
PCRE2_KEEP_UNINITIALIZED;
mb->stack_frames = (heapframe *)stack_frames_vector;
/* Recognize NULL, length 0 as an empty string. */
if (subject == NULL && length == 0) subject = (PCRE2_SPTR)"";
/* Plausibility checks */
if ((options & ~PUBLIC_MATCH_OPTIONS) != 0) return PCRE2_ERROR_BADOPTION;

View File

@ -260,9 +260,15 @@ PCRE2_UNSET, so as not to imply an offset in the replacement. */
if ((options & (PCRE2_PARTIAL_HARD|PCRE2_PARTIAL_SOFT)) != 0)
return PCRE2_ERROR_BADOPTION;
/* Validate length and find the end of the replacement. */
/* Validate length and find the end of the replacement. A NULL replacement of
zero length is interpreted as an empty string. */
if (replacement == NULL)
{
if (rlength != 0) return PCRE2_ERROR_NULL;
replacement = (PCRE2_SPTR)"";
}
if (replacement == NULL) return PCRE2_ERROR_NULL;
if (rlength == PCRE2_ZERO_TERMINATED) rlength = PRIV(strlen)(replacement);
repend = replacement + rlength;

View File

@ -304,4 +304,7 @@
/[aCz]/mg,firstline,newline=lf
match\nmatch
//jitfast
\=null_subject
# End of testinput17

View File

@ -135,4 +135,9 @@
123ace
123ace\=posix_startend=2:6
//posix
\= Expect errors
\=null_subject
abc\=null_subject
# End of testdata/testinput18

21
testdata/testinput2 vendored
View File

@ -5902,4 +5902,25 @@ a)"xI
# ---------
# Tests for zero-length NULL to be treated as an empty string.
//
\=null_subject
\= Expect error
abc\=null_subject
//replace=[20]
abc\=null_replacement
\=null_subject
\=null_replacement
/X*/g,replace=xy
\= Expect error
>X<\=null_replacement
/X+/replace=[20]
>XX<\=null_replacement
# ---------
# End of testinput2

View File

@ -550,4 +550,8 @@ Failed: error -47: match limit exceeded
match\nmatch
0: a (JIT)
//jitfast
\=null_subject
0: (JIT)
# End of testinput17

View File

@ -215,4 +215,11 @@ Failed: POSIX code 16: bad argument at offset 0
3: <unset>
4: c
//posix
\= Expect errors
\=null_subject
No match: POSIX code 16: bad argument
abc\=null_subject
No match: POSIX code 16: bad argument
# End of testdata/testinput18

28
testdata/testoutput2 vendored
View File

@ -17674,6 +17674,34 @@ Failed: error 199 at offset 14: \K is not allowed in lookarounds (but see PCRE2_
# ---------
# Tests for zero-length NULL to be treated as an empty string.
//
\=null_subject
0:
\= Expect error
abc\=null_subject
Failed: error -51: NULL argument passed with non-zero length
//replace=[20]
abc\=null_replacement
1: abc
\=null_subject
1:
\=null_replacement
1:
/X*/g,replace=xy
\= Expect error
>X<\=null_replacement
Failed: error -51: NULL argument passed with non-zero length
/X+/replace=[20]
>XX<\=null_replacement
1: ><
# ---------
# End of testinput2
Error -70: PCRE2_ERROR_BADDATA (unknown error number)
Error -62: bad serialized data