Interpret NULL pointer, zero length as an empty string for subjects and replacements.
This commit is contained in:
parent
7ab2769728
commit
4ef0c51d2b
|
@ -34,6 +34,11 @@ substituting.
|
|||
|
||||
12. Add check for NULL replacement to pcre2_substitute().
|
||||
|
||||
13. For the subject arguments of pcre2_match(), pcre2_dfa_match(), and
|
||||
pcre2_substitute(), and the replacement argument of the latter, if the pointer
|
||||
is NULL and the length is zero, treat as an empty string. Apparently a number
|
||||
of applications treat NULL/0 in this way.
|
||||
|
||||
|
||||
Version 10.39 29-October-2021
|
||||
-----------------------------
|
||||
|
|
|
@ -2640,7 +2640,9 @@ The subject string is passed to <b>pcre2_match()</b> as a pointer in
|
|||
<i>startoffset</i>. The length and offset are in code units, not characters.
|
||||
That is, they are in bytes for the 8-bit library, 16-bit code units for the
|
||||
16-bit library, and 32-bit code units for the 32-bit library, whether or not
|
||||
UTF processing is enabled.
|
||||
UTF processing is enabled. As a special case, if <i>subject</i> is NULL and
|
||||
<i>length</i> is zero, the subject is assumed to be an empty string. If
|
||||
<i>length</i> is non-zero, an error occurs if <i>subject</i> is NULL.
|
||||
</P>
|
||||
<P>
|
||||
If <i>startoffset</i> is greater than the length of the subject,
|
||||
|
@ -3394,12 +3396,17 @@ same number causes an error at compile time.
|
|||
<P>
|
||||
This function optionally calls <b>pcre2_match()</b> and then makes a copy of the
|
||||
subject string in <i>outputbuffer</i>, replacing parts that were matched with
|
||||
the <i>replacement</i> string, whose length is supplied in <b>rlength</b>. This
|
||||
can be given as PCRE2_ZERO_TERMINATED for a zero-terminated string. There is an
|
||||
option (see PCRE2_SUBSTITUTE_REPLACEMENT_ONLY below) to return just the
|
||||
replacement string(s). The default action is to perform just one replacement if
|
||||
the pattern matches, but there is an option that requests multiple replacements
|
||||
(see PCRE2_SUBSTITUTE_GLOBAL below).
|
||||
the <i>replacement</i> string, whose length is supplied in <b>rlength</b>, which
|
||||
can be given as PCRE2_ZERO_TERMINATED for a zero-terminated string. As a
|
||||
special case, if <i>replacement</i> is NULL and <i>rlength</i> is zero, the
|
||||
replacement is assumed to be an empty string. If <i>rlength</i> is non-zero, an
|
||||
error occurs if <i>replacement</i> is NULL.
|
||||
</P>
|
||||
<P>
|
||||
There is an option (see PCRE2_SUBSTITUTE_REPLACEMENT_ONLY below) to return just
|
||||
the replacement string(s). The default action is to perform just one
|
||||
replacement if the pattern matches, but there is an option that requests
|
||||
multiple replacements (see PCRE2_SUBSTITUTE_GLOBAL below).
|
||||
</P>
|
||||
<P>
|
||||
If successful, <b>pcre2_substitute()</b> returns the number of substitutions
|
||||
|
@ -3812,12 +3819,13 @@ other alternatives. Ultimately, when it runs out of matches,
|
|||
<P>
|
||||
The function <b>pcre2_dfa_match()</b> is called to match a subject string
|
||||
against a compiled pattern, using a matching algorithm that scans the subject
|
||||
string just once (not counting lookaround assertions), and does not backtrack.
|
||||
This has different characteristics to the normal algorithm, and is not
|
||||
compatible with Perl. Some of the features of PCRE2 patterns are not supported.
|
||||
Nevertheless, there are times when this kind of matching can be useful. For a
|
||||
discussion of the two matching algorithms, and a list of features that
|
||||
<b>pcre2_dfa_match()</b> does not support, see the
|
||||
string just once (not counting lookaround assertions), and does not backtrack
|
||||
(except when processing lookaround assertions). This has different
|
||||
characteristics to the normal algorithm, and is not compatible with Perl. Some
|
||||
of the features of PCRE2 patterns are not supported. Nevertheless, there are
|
||||
times when this kind of matching can be useful. For a discussion of the two
|
||||
matching algorithms, and a list of features that <b>pcre2_dfa_match()</b> does
|
||||
not support, see the
|
||||
<a href="pcre2matching.html"><b>pcre2matching</b></a>
|
||||
documentation.
|
||||
</P>
|
||||
|
@ -4010,7 +4018,7 @@ Cambridge, England.
|
|||
</P>
|
||||
<br><a name="SEC42" href="#TOC1">REVISION</a><br>
|
||||
<P>
|
||||
Last updated: 30 August 2021
|
||||
Last updated: 30 November 2021
|
||||
<br>
|
||||
Copyright © 1997-2021 University of Cambridge.
|
||||
<br>
|
||||
|
|
|
@ -269,11 +269,11 @@ starts another match, that match must use a different JIT stack to the one used
|
|||
for currently suspended match(es).
|
||||
</P>
|
||||
<P>
|
||||
In a multithread application, if you do not
|
||||
specify a JIT stack, or if you assign or pass back NULL from a callback, that
|
||||
is thread-safe, because each thread has its own machine stack. However, if you
|
||||
assign or pass back a non-NULL JIT stack, this must be a different stack for
|
||||
each thread so that the application is thread-safe.
|
||||
In a multithread application, if you do not specify a JIT stack, or if you
|
||||
assign or pass back NULL from a callback, that is thread-safe, because each
|
||||
thread has its own machine stack. However, if you assign or pass back a
|
||||
non-NULL JIT stack, this must be a different stack for each thread so that the
|
||||
application is thread-safe.
|
||||
</P>
|
||||
<P>
|
||||
Strictly speaking, even more is allowed. You can assign the same non-NULL stack
|
||||
|
@ -382,8 +382,8 @@ out this complicated API.
|
|||
<b>void pcre2_jit_free_unused_memory(pcre2_general_context *<i>gcontext</i>);</b>
|
||||
</P>
|
||||
<P>
|
||||
The JIT executable allocator does not free all memory when it is possible.
|
||||
It expects new allocations, and keeps some free memory around to improve
|
||||
The JIT executable allocator does not free all memory when it is possible. It
|
||||
expects new allocations, and keeps some free memory around to improve
|
||||
allocation speed. However, in low memory conditions, it might be better to free
|
||||
all possible memory. You can cause this to happen by calling
|
||||
pcre2_jit_free_unused_memory(). Its argument is a general context, for custom
|
||||
|
@ -442,10 +442,10 @@ that was not compiled.
|
|||
<P>
|
||||
When you call <b>pcre2_match()</b>, as well as testing for invalid options, a
|
||||
number of other sanity checks are performed on the arguments. For example, if
|
||||
the subject pointer is NULL, an immediate error is given. Also, unless
|
||||
PCRE2_NO_UTF_CHECK is set, a UTF subject string is tested for validity. In the
|
||||
interests of speed, these checks do not happen on the JIT fast path, and if
|
||||
invalid data is passed, the result is undefined.
|
||||
the subject pointer is NULL but the length is non-zero, an immediate error is
|
||||
given. Also, unless PCRE2_NO_UTF_CHECK is set, a UTF subject string is tested
|
||||
for validity. In the interests of speed, these checks do not happen on the JIT
|
||||
fast path, and if invalid data is passed, the result is undefined.
|
||||
</P>
|
||||
<P>
|
||||
Bypassing the sanity checks and the <b>pcre2_match()</b> wrapping can give
|
||||
|
@ -466,9 +466,9 @@ Cambridge, England.
|
|||
</P>
|
||||
<br><a name="SEC14" href="#TOC1">REVISION</a><br>
|
||||
<P>
|
||||
Last updated: 23 May 2019
|
||||
Last updated: 30 November 2021
|
||||
<br>
|
||||
Copyright © 1997-2019 University of Cambridge.
|
||||
Copyright © 1997-2021 University of Cambridge.
|
||||
<br>
|
||||
<p>
|
||||
Return to the <a href="index.html">PCRE2 index page</a>.
|
||||
|
|
152
doc/pcre2.txt
152
doc/pcre2.txt
|
@ -2579,7 +2579,9 @@ MATCHING A PATTERN: THE TRADITIONAL FUNCTION
|
|||
and offset are in code units, not characters. That is, they are in
|
||||
bytes for the 8-bit library, 16-bit code units for the 16-bit library,
|
||||
and 32-bit code units for the 32-bit library, whether or not UTF pro-
|
||||
cessing is enabled.
|
||||
cessing is enabled. As a special case, if subject is NULL and length is
|
||||
zero, the subject is assumed to be an empty string. If length is non-
|
||||
zero, an error occurs if subject is NULL.
|
||||
|
||||
If startoffset is greater than the length of the subject, pcre2_match()
|
||||
returns PCRE2_ERROR_BADOFFSET. When the starting offset is zero, the
|
||||
|
@ -3280,8 +3282,12 @@ CREATING A NEW STRING WITH SUBSTITUTIONS
|
|||
|
||||
This function optionally calls pcre2_match() and then makes a copy of
|
||||
the subject string in outputbuffer, replacing parts that were matched
|
||||
with the replacement string, whose length is supplied in rlength. This
|
||||
can be given as PCRE2_ZERO_TERMINATED for a zero-terminated string.
|
||||
with the replacement string, whose length is supplied in rlength, which
|
||||
can be given as PCRE2_ZERO_TERMINATED for a zero-terminated string. As
|
||||
a special case, if replacement is NULL and rlength is zero, the re-
|
||||
placement is assumed to be an empty string. If rlength is non-zero, an
|
||||
error occurs if replacement is NULL.
|
||||
|
||||
There is an option (see PCRE2_SUBSTITUTE_REPLACEMENT_ONLY below) to re-
|
||||
turn just the replacement string(s). The default action is to perform
|
||||
just one replacement if the pattern matches, but there is an option
|
||||
|
@ -3666,23 +3672,24 @@ MATCHING A PATTERN: THE ALTERNATIVE FUNCTION
|
|||
The function pcre2_dfa_match() is called to match a subject string
|
||||
against a compiled pattern, using a matching algorithm that scans the
|
||||
subject string just once (not counting lookaround assertions), and does
|
||||
not backtrack. This has different characteristics to the normal algo-
|
||||
rithm, and is not compatible with Perl. Some of the features of PCRE2
|
||||
patterns are not supported. Nevertheless, there are times when this
|
||||
kind of matching can be useful. For a discussion of the two matching
|
||||
algorithms, and a list of features that pcre2_dfa_match() does not sup-
|
||||
port, see the pcre2matching documentation.
|
||||
not backtrack (except when processing lookaround assertions). This has
|
||||
different characteristics to the normal algorithm, and is not compati-
|
||||
ble with Perl. Some of the features of PCRE2 patterns are not sup-
|
||||
ported. Nevertheless, there are times when this kind of matching can be
|
||||
useful. For a discussion of the two matching algorithms, and a list of
|
||||
features that pcre2_dfa_match() does not support, see the pcre2matching
|
||||
documentation.
|
||||
|
||||
The arguments for the pcre2_dfa_match() function are the same as for
|
||||
The arguments for the pcre2_dfa_match() function are the same as for
|
||||
pcre2_match(), plus two extras. The ovector within the match data block
|
||||
is used in a different way, and this is described below. The other com-
|
||||
mon arguments are used in the same way as for pcre2_match(), so their
|
||||
mon arguments are used in the same way as for pcre2_match(), so their
|
||||
description is not repeated here.
|
||||
|
||||
The two additional arguments provide workspace for the function. The
|
||||
workspace vector should contain at least 20 elements. It is used for
|
||||
The two additional arguments provide workspace for the function. The
|
||||
workspace vector should contain at least 20 elements. It is used for
|
||||
keeping track of multiple paths through the pattern tree. More
|
||||
workspace is needed for patterns and subjects where there are a lot of
|
||||
workspace is needed for patterns and subjects where there are a lot of
|
||||
potential matches.
|
||||
|
||||
Here is an example of a simple call to pcre2_dfa_match():
|
||||
|
@ -3702,45 +3709,45 @@ MATCHING A PATTERN: THE ALTERNATIVE FUNCTION
|
|||
|
||||
Option bits for pcre2_dfa_match()
|
||||
|
||||
The unused bits of the options argument for pcre2_dfa_match() must be
|
||||
zero. The only bits that may be set are PCRE2_ANCHORED,
|
||||
PCRE2_COPY_MATCHED_SUBJECT, PCRE2_ENDANCHORED, PCRE2_NOTBOL, PCRE2_NO-
|
||||
The unused bits of the options argument for pcre2_dfa_match() must be
|
||||
zero. The only bits that may be set are PCRE2_ANCHORED,
|
||||
PCRE2_COPY_MATCHED_SUBJECT, PCRE2_ENDANCHORED, PCRE2_NOTBOL, PCRE2_NO-
|
||||
TEOL, PCRE2_NOTEMPTY, PCRE2_NOTEMPTY_ATSTART, PCRE2_NO_UTF_CHECK,
|
||||
PCRE2_PARTIAL_HARD, PCRE2_PARTIAL_SOFT, PCRE2_DFA_SHORTEST, and
|
||||
PCRE2_DFA_RESTART. All but the last four of these are exactly the same
|
||||
PCRE2_PARTIAL_HARD, PCRE2_PARTIAL_SOFT, PCRE2_DFA_SHORTEST, and
|
||||
PCRE2_DFA_RESTART. All but the last four of these are exactly the same
|
||||
as for pcre2_match(), so their description is not repeated here.
|
||||
|
||||
PCRE2_PARTIAL_HARD
|
||||
PCRE2_PARTIAL_SOFT
|
||||
|
||||
These have the same general effect as they do for pcre2_match(), but
|
||||
the details are slightly different. When PCRE2_PARTIAL_HARD is set for
|
||||
pcre2_dfa_match(), it returns PCRE2_ERROR_PARTIAL if the end of the
|
||||
These have the same general effect as they do for pcre2_match(), but
|
||||
the details are slightly different. When PCRE2_PARTIAL_HARD is set for
|
||||
pcre2_dfa_match(), it returns PCRE2_ERROR_PARTIAL if the end of the
|
||||
subject is reached and there is still at least one matching possibility
|
||||
that requires additional characters. This happens even if some complete
|
||||
matches have already been found. When PCRE2_PARTIAL_SOFT is set, the
|
||||
return code PCRE2_ERROR_NOMATCH is converted into PCRE2_ERROR_PARTIAL
|
||||
if the end of the subject is reached, there have been no complete
|
||||
matches have already been found. When PCRE2_PARTIAL_SOFT is set, the
|
||||
return code PCRE2_ERROR_NOMATCH is converted into PCRE2_ERROR_PARTIAL
|
||||
if the end of the subject is reached, there have been no complete
|
||||
matches, but there is still at least one matching possibility. The por-
|
||||
tion of the string that was inspected when the longest partial match
|
||||
tion of the string that was inspected when the longest partial match
|
||||
was found is set as the first matching string in both cases. There is a
|
||||
more detailed discussion of partial and multi-segment matching, with
|
||||
more detailed discussion of partial and multi-segment matching, with
|
||||
examples, in the pcre2partial documentation.
|
||||
|
||||
PCRE2_DFA_SHORTEST
|
||||
|
||||
Setting the PCRE2_DFA_SHORTEST option causes the matching algorithm to
|
||||
Setting the PCRE2_DFA_SHORTEST option causes the matching algorithm to
|
||||
stop as soon as it has found one match. Because of the way the alterna-
|
||||
tive algorithm works, this is necessarily the shortest possible match
|
||||
tive algorithm works, this is necessarily the shortest possible match
|
||||
at the first possible matching point in the subject string.
|
||||
|
||||
PCRE2_DFA_RESTART
|
||||
|
||||
When pcre2_dfa_match() returns a partial match, it is possible to call
|
||||
When pcre2_dfa_match() returns a partial match, it is possible to call
|
||||
it again, with additional subject characters, and have it continue with
|
||||
the same match. The PCRE2_DFA_RESTART option requests this action; when
|
||||
it is set, the workspace and wscount options must reference the same
|
||||
vector as before because data about the match so far is left in them
|
||||
it is set, the workspace and wscount options must reference the same
|
||||
vector as before because data about the match so far is left in them
|
||||
after a partial match. There is more discussion of this facility in the
|
||||
pcre2partial documentation.
|
||||
|
||||
|
@ -3748,8 +3755,8 @@ MATCHING A PATTERN: THE ALTERNATIVE FUNCTION
|
|||
|
||||
When pcre2_dfa_match() succeeds, it may have matched more than one sub-
|
||||
string in the subject. Note, however, that all the matches from one run
|
||||
of the function start at the same point in the subject. The shorter
|
||||
matches are all initial substrings of the longer matches. For example,
|
||||
of the function start at the same point in the subject. The shorter
|
||||
matches are all initial substrings of the longer matches. For example,
|
||||
if the pattern
|
||||
|
||||
<.*>
|
||||
|
@ -3764,80 +3771,80 @@ MATCHING A PATTERN: THE ALTERNATIVE FUNCTION
|
|||
<something> <something else>
|
||||
<something>
|
||||
|
||||
On success, the yield of the function is a number greater than zero,
|
||||
which is the number of matched substrings. The offsets of the sub-
|
||||
strings are returned in the ovector, and can be extracted by number in
|
||||
the same way as for pcre2_match(), but the numbers bear no relation to
|
||||
any capture groups that may exist in the pattern, because DFA matching
|
||||
On success, the yield of the function is a number greater than zero,
|
||||
which is the number of matched substrings. The offsets of the sub-
|
||||
strings are returned in the ovector, and can be extracted by number in
|
||||
the same way as for pcre2_match(), but the numbers bear no relation to
|
||||
any capture groups that may exist in the pattern, because DFA matching
|
||||
does not support capturing.
|
||||
|
||||
Calls to the convenience functions that extract substrings by name re-
|
||||
Calls to the convenience functions that extract substrings by name re-
|
||||
turn the error PCRE2_ERROR_DFA_UFUNC (unsupported function) if used af-
|
||||
ter a DFA match. The convenience functions that extract substrings by
|
||||
ter a DFA match. The convenience functions that extract substrings by
|
||||
number never return PCRE2_ERROR_NOSUBSTRING.
|
||||
|
||||
The matched strings are stored in the ovector in reverse order of
|
||||
length; that is, the longest matching string is first. If there were
|
||||
too many matches to fit into the ovector, the yield of the function is
|
||||
The matched strings are stored in the ovector in reverse order of
|
||||
length; that is, the longest matching string is first. If there were
|
||||
too many matches to fit into the ovector, the yield of the function is
|
||||
zero, and the vector is filled with the longest matches.
|
||||
|
||||
NOTE: PCRE2's "auto-possessification" optimization usually applies to
|
||||
character repeats at the end of a pattern (as well as internally). For
|
||||
example, the pattern "a\d+" is compiled as if it were "a\d++". For DFA
|
||||
matching, this means that only one possible match is found. If you re-
|
||||
NOTE: PCRE2's "auto-possessification" optimization usually applies to
|
||||
character repeats at the end of a pattern (as well as internally). For
|
||||
example, the pattern "a\d+" is compiled as if it were "a\d++". For DFA
|
||||
matching, this means that only one possible match is found. If you re-
|
||||
ally do want multiple matches in such cases, either use an ungreedy re-
|
||||
peat such as "a\d+?" or set the PCRE2_NO_AUTO_POSSESS option when com-
|
||||
peat such as "a\d+?" or set the PCRE2_NO_AUTO_POSSESS option when com-
|
||||
piling.
|
||||
|
||||
Error returns from pcre2_dfa_match()
|
||||
|
||||
The pcre2_dfa_match() function returns a negative number when it fails.
|
||||
Many of the errors are the same as for pcre2_match(), as described
|
||||
Many of the errors are the same as for pcre2_match(), as described
|
||||
above. There are in addition the following errors that are specific to
|
||||
pcre2_dfa_match():
|
||||
|
||||
PCRE2_ERROR_DFA_UITEM
|
||||
|
||||
This return is given if pcre2_dfa_match() encounters an item in the
|
||||
pattern that it does not support, for instance, the use of \C in a UTF
|
||||
This return is given if pcre2_dfa_match() encounters an item in the
|
||||
pattern that it does not support, for instance, the use of \C in a UTF
|
||||
mode or a backreference.
|
||||
|
||||
PCRE2_ERROR_DFA_UCOND
|
||||
|
||||
This return is given if pcre2_dfa_match() encounters a condition item
|
||||
This return is given if pcre2_dfa_match() encounters a condition item
|
||||
that uses a backreference for the condition, or a test for recursion in
|
||||
a specific capture group. These are not supported.
|
||||
|
||||
PCRE2_ERROR_DFA_UINVALID_UTF
|
||||
|
||||
This return is given if pcre2_dfa_match() is called for a pattern that
|
||||
was compiled with PCRE2_MATCH_INVALID_UTF. This is not supported for
|
||||
This return is given if pcre2_dfa_match() is called for a pattern that
|
||||
was compiled with PCRE2_MATCH_INVALID_UTF. This is not supported for
|
||||
DFA matching.
|
||||
|
||||
PCRE2_ERROR_DFA_WSSIZE
|
||||
|
||||
This return is given if pcre2_dfa_match() runs out of space in the
|
||||
This return is given if pcre2_dfa_match() runs out of space in the
|
||||
workspace vector.
|
||||
|
||||
PCRE2_ERROR_DFA_RECURSE
|
||||
|
||||
When a recursion or subroutine call is processed, the matching function
|
||||
calls itself recursively, using private memory for the ovector and
|
||||
workspace. This error is given if the internal ovector is not large
|
||||
enough. This should be extremely rare, as a vector of size 1000 is
|
||||
calls itself recursively, using private memory for the ovector and
|
||||
workspace. This error is given if the internal ovector is not large
|
||||
enough. This should be extremely rare, as a vector of size 1000 is
|
||||
used.
|
||||
|
||||
PCRE2_ERROR_DFA_BADRESTART
|
||||
|
||||
When pcre2_dfa_match() is called with the PCRE2_DFA_RESTART option,
|
||||
some plausibility checks are made on the contents of the workspace,
|
||||
which should contain data about the previous partial match. If any of
|
||||
When pcre2_dfa_match() is called with the PCRE2_DFA_RESTART option,
|
||||
some plausibility checks are made on the contents of the workspace,
|
||||
which should contain data about the previous partial match. If any of
|
||||
these checks fail, this error is given.
|
||||
|
||||
|
||||
SEE ALSO
|
||||
|
||||
pcre2build(3), pcre2callout(3), pcre2demo(3), pcre2matching(3),
|
||||
pcre2build(3), pcre2callout(3), pcre2demo(3), pcre2matching(3),
|
||||
pcre2partial(3), pcre2posix(3), pcre2sample(3), pcre2unicode(3).
|
||||
|
||||
|
||||
|
@ -3850,7 +3857,7 @@ AUTHOR
|
|||
|
||||
REVISION
|
||||
|
||||
Last updated: 30 August 2021
|
||||
Last updated: 30 November 2021
|
||||
Copyright (c) 1997-2021 University of Cambridge.
|
||||
------------------------------------------------------------------------------
|
||||
|
||||
|
@ -5436,7 +5443,7 @@ FREEING JIT SPECULATIVE MEMORY
|
|||
void pcre2_jit_free_unused_memory(pcre2_general_context *gcontext);
|
||||
|
||||
The JIT executable allocator does not free all memory when it is possi-
|
||||
ble. It expects new allocations, and keeps some free memory around to
|
||||
ble. It expects new allocations, and keeps some free memory around to
|
||||
improve allocation speed. However, in low memory conditions, it might
|
||||
be better to free all possible memory. You can cause this to happen by
|
||||
calling pcre2_jit_free_unused_memory(). Its argument is a general con-
|
||||
|
@ -5494,12 +5501,13 @@ JIT FAST PATH API
|
|||
|
||||
When you call pcre2_match(), as well as testing for invalid options, a
|
||||
number of other sanity checks are performed on the arguments. For exam-
|
||||
ple, if the subject pointer is NULL, an immediate error is given. Also,
|
||||
unless PCRE2_NO_UTF_CHECK is set, a UTF subject string is tested for
|
||||
validity. In the interests of speed, these checks do not happen on the
|
||||
JIT fast path, and if invalid data is passed, the result is undefined.
|
||||
ple, if the subject pointer is NULL but the length is non-zero, an im-
|
||||
mediate error is given. Also, unless PCRE2_NO_UTF_CHECK is set, a UTF
|
||||
subject string is tested for validity. In the interests of speed, these
|
||||
checks do not happen on the JIT fast path, and if invalid data is
|
||||
passed, the result is undefined.
|
||||
|
||||
Bypassing the sanity checks and the pcre2_match() wrapping can give
|
||||
Bypassing the sanity checks and the pcre2_match() wrapping can give
|
||||
speedups of more than 10%.
|
||||
|
||||
|
||||
|
@ -5517,8 +5525,8 @@ AUTHOR
|
|||
|
||||
REVISION
|
||||
|
||||
Last updated: 23 May 2019
|
||||
Copyright (c) 1997-2019 University of Cambridge.
|
||||
Last updated: 30 November 2021
|
||||
Copyright (c) 1997-2021 University of Cambridge.
|
||||
------------------------------------------------------------------------------
|
||||
|
||||
|
||||
|
|
|
@ -1,4 +1,4 @@
|
|||
.TH PCRE2API 3 "30 August 2021" "PCRE2 10.38"
|
||||
.TH PCRE2API 3 "30 November 2021" "PCRE2 10.40"
|
||||
.SH NAME
|
||||
PCRE2 - Perl-compatible regular expressions (revised API)
|
||||
.sp
|
||||
|
@ -2624,7 +2624,9 @@ The subject string is passed to \fBpcre2_match()\fP as a pointer in
|
|||
\fIstartoffset\fP. The length and offset are in code units, not characters.
|
||||
That is, they are in bytes for the 8-bit library, 16-bit code units for the
|
||||
16-bit library, and 32-bit code units for the 32-bit library, whether or not
|
||||
UTF processing is enabled.
|
||||
UTF processing is enabled. As a special case, if \fIsubject\fP is NULL and
|
||||
\fIlength\fP is zero, the subject is assumed to be an empty string. If
|
||||
\fIlength\fP is non-zero, an error occurs if \fIsubject\fP is NULL.
|
||||
.P
|
||||
If \fIstartoffset\fP is greater than the length of the subject,
|
||||
\fBpcre2_match()\fP returns PCRE2_ERROR_BADOFFSET. When the starting offset is
|
||||
|
@ -3413,12 +3415,16 @@ same number causes an error at compile time.
|
|||
.P
|
||||
This function optionally calls \fBpcre2_match()\fP and then makes a copy of the
|
||||
subject string in \fIoutputbuffer\fP, replacing parts that were matched with
|
||||
the \fIreplacement\fP string, whose length is supplied in \fBrlength\fP. This
|
||||
can be given as PCRE2_ZERO_TERMINATED for a zero-terminated string. There is an
|
||||
option (see PCRE2_SUBSTITUTE_REPLACEMENT_ONLY below) to return just the
|
||||
replacement string(s). The default action is to perform just one replacement if
|
||||
the pattern matches, but there is an option that requests multiple replacements
|
||||
(see PCRE2_SUBSTITUTE_GLOBAL below).
|
||||
the \fIreplacement\fP string, whose length is supplied in \fBrlength\fP, which
|
||||
can be given as PCRE2_ZERO_TERMINATED for a zero-terminated string. As a
|
||||
special case, if \fIreplacement\fP is NULL and \fIrlength\fP is zero, the
|
||||
replacement is assumed to be an empty string. If \fIrlength\fP is non-zero, an
|
||||
error occurs if \fIreplacement\fP is NULL.
|
||||
.P
|
||||
There is an option (see PCRE2_SUBSTITUTE_REPLACEMENT_ONLY below) to return just
|
||||
the replacement string(s). The default action is to perform just one
|
||||
replacement if the pattern matches, but there is an option that requests
|
||||
multiple replacements (see PCRE2_SUBSTITUTE_GLOBAL below).
|
||||
.P
|
||||
If successful, \fBpcre2_substitute()\fP returns the number of substitutions
|
||||
that were carried out. This may be zero if no match was found, and is never
|
||||
|
@ -3813,12 +3819,13 @@ other alternatives. Ultimately, when it runs out of matches,
|
|||
.P
|
||||
The function \fBpcre2_dfa_match()\fP is called to match a subject string
|
||||
against a compiled pattern, using a matching algorithm that scans the subject
|
||||
string just once (not counting lookaround assertions), and does not backtrack.
|
||||
This has different characteristics to the normal algorithm, and is not
|
||||
compatible with Perl. Some of the features of PCRE2 patterns are not supported.
|
||||
Nevertheless, there are times when this kind of matching can be useful. For a
|
||||
discussion of the two matching algorithms, and a list of features that
|
||||
\fBpcre2_dfa_match()\fP does not support, see the
|
||||
string just once (not counting lookaround assertions), and does not backtrack
|
||||
(except when processing lookaround assertions). This has different
|
||||
characteristics to the normal algorithm, and is not compatible with Perl. Some
|
||||
of the features of PCRE2 patterns are not supported. Nevertheless, there are
|
||||
times when this kind of matching can be useful. For a discussion of the two
|
||||
matching algorithms, and a list of features that \fBpcre2_dfa_match()\fP does
|
||||
not support, see the
|
||||
.\" HREF
|
||||
\fBpcre2matching\fP
|
||||
.\"
|
||||
|
@ -4018,6 +4025,6 @@ Cambridge, England.
|
|||
.rs
|
||||
.sp
|
||||
.nf
|
||||
Last updated: 30 August 2021
|
||||
Last updated: 30 November 2021
|
||||
Copyright (c) 1997-2021 University of Cambridge.
|
||||
.fi
|
||||
|
|
|
@ -1,4 +1,4 @@
|
|||
.TH PCRE2JIT 3 "23 May 2019" "PCRE2 10.34"
|
||||
.TH PCRE2JIT 3 "30 November 2021" "PCRE2 10.40"
|
||||
.SH NAME
|
||||
PCRE2 - Perl-compatible regular expressions (revised API)
|
||||
.SH "PCRE2 JUST-IN-TIME COMPILER SUPPORT"
|
||||
|
@ -251,11 +251,11 @@ non-sequential matches in one thread is to use callouts: if a callout function
|
|||
starts another match, that match must use a different JIT stack to the one used
|
||||
for currently suspended match(es).
|
||||
.P
|
||||
In a multithread application, if you do not
|
||||
specify a JIT stack, or if you assign or pass back NULL from a callback, that
|
||||
is thread-safe, because each thread has its own machine stack. However, if you
|
||||
assign or pass back a non-NULL JIT stack, this must be a different stack for
|
||||
each thread so that the application is thread-safe.
|
||||
In a multithread application, if you do not specify a JIT stack, or if you
|
||||
assign or pass back NULL from a callback, that is thread-safe, because each
|
||||
thread has its own machine stack. However, if you assign or pass back a
|
||||
non-NULL JIT stack, this must be a different stack for each thread so that the
|
||||
application is thread-safe.
|
||||
.P
|
||||
Strictly speaking, even more is allowed. You can assign the same non-NULL stack
|
||||
to a match context that is used by any number of patterns, as long as they are
|
||||
|
@ -355,8 +355,8 @@ out this complicated API.
|
|||
.B void pcre2_jit_free_unused_memory(pcre2_general_context *\fIgcontext\fP);
|
||||
.fi
|
||||
.P
|
||||
The JIT executable allocator does not free all memory when it is possible.
|
||||
It expects new allocations, and keeps some free memory around to improve
|
||||
The JIT executable allocator does not free all memory when it is possible. It
|
||||
expects new allocations, and keeps some free memory around to improve
|
||||
allocation speed. However, in low memory conditions, it might be better to free
|
||||
all possible memory. You can cause this to happen by calling
|
||||
pcre2_jit_free_unused_memory(). Its argument is a general context, for custom
|
||||
|
@ -416,10 +416,10 @@ that was not compiled.
|
|||
.P
|
||||
When you call \fBpcre2_match()\fP, as well as testing for invalid options, a
|
||||
number of other sanity checks are performed on the arguments. For example, if
|
||||
the subject pointer is NULL, an immediate error is given. Also, unless
|
||||
PCRE2_NO_UTF_CHECK is set, a UTF subject string is tested for validity. In the
|
||||
interests of speed, these checks do not happen on the JIT fast path, and if
|
||||
invalid data is passed, the result is undefined.
|
||||
the subject pointer is NULL but the length is non-zero, an immediate error is
|
||||
given. Also, unless PCRE2_NO_UTF_CHECK is set, a UTF subject string is tested
|
||||
for validity. In the interests of speed, these checks do not happen on the JIT
|
||||
fast path, and if invalid data is passed, the result is undefined.
|
||||
.P
|
||||
Bypassing the sanity checks and the \fBpcre2_match()\fP wrapping can give
|
||||
speedups of more than 10%.
|
||||
|
@ -445,6 +445,6 @@ Cambridge, England.
|
|||
.rs
|
||||
.sp
|
||||
.nf
|
||||
Last updated: 23 May 2019
|
||||
Copyright (c) 1997-2019 University of Cambridge.
|
||||
Last updated: 30 November 2021
|
||||
Copyright (c) 1997-2021 University of Cambridge.
|
||||
.fi
|
||||
|
|
|
@ -3285,6 +3285,10 @@ rws->next = NULL;
|
|||
rws->size = RWS_BASE_SIZE;
|
||||
rws->free = RWS_BASE_SIZE - RWS_ANCHOR_SIZE;
|
||||
|
||||
/* Recognize NULL, length 0 as an empty string. */
|
||||
|
||||
if (subject == NULL && length == 0) subject = (PCRE2_SPTR)"";
|
||||
|
||||
/* Plausibility checks */
|
||||
|
||||
if ((options & ~PUBLIC_DFA_MATCH_OPTIONS) != 0) return PCRE2_ERROR_BADOPTION;
|
||||
|
|
|
@ -253,7 +253,7 @@ static const unsigned char match_error_texts[] =
|
|||
"unknown substring\0"
|
||||
/* 50 */
|
||||
"non-unique substring name\0"
|
||||
"NULL argument passed\0"
|
||||
"NULL argument passed with non-zero length\0"
|
||||
"nested recursion at the same subject position\0"
|
||||
"matching depth limit exceeded\0"
|
||||
"requested value is not available\0"
|
||||
|
|
|
@ -6170,6 +6170,10 @@ PCRE2_SPTR stack_frames_vector[START_FRAMES_SIZE/sizeof(PCRE2_SPTR)]
|
|||
PCRE2_KEEP_UNINITIALIZED;
|
||||
mb->stack_frames = (heapframe *)stack_frames_vector;
|
||||
|
||||
/* Recognize NULL, length 0 as an empty string. */
|
||||
|
||||
if (subject == NULL && length == 0) subject = (PCRE2_SPTR)"";
|
||||
|
||||
/* Plausibility checks */
|
||||
|
||||
if ((options & ~PUBLIC_MATCH_OPTIONS) != 0) return PCRE2_ERROR_BADOPTION;
|
||||
|
|
|
@ -260,9 +260,15 @@ PCRE2_UNSET, so as not to imply an offset in the replacement. */
|
|||
if ((options & (PCRE2_PARTIAL_HARD|PCRE2_PARTIAL_SOFT)) != 0)
|
||||
return PCRE2_ERROR_BADOPTION;
|
||||
|
||||
/* Validate length and find the end of the replacement. */
|
||||
/* Validate length and find the end of the replacement. A NULL replacement of
|
||||
zero length is interpreted as an empty string. */
|
||||
|
||||
if (replacement == NULL) return PCRE2_ERROR_NULL;
|
||||
if (replacement == NULL)
|
||||
{
|
||||
if (rlength != 0) return PCRE2_ERROR_NULL;
|
||||
replacement = (PCRE2_SPTR)"";
|
||||
}
|
||||
|
||||
if (rlength == PCRE2_ZERO_TERMINATED) rlength = PRIV(strlen)(replacement);
|
||||
repend = replacement + rlength;
|
||||
|
||||
|
|
|
@ -304,4 +304,7 @@
|
|||
/[aCz]/mg,firstline,newline=lf
|
||||
match\nmatch
|
||||
|
||||
//jitfast
|
||||
\=null_subject
|
||||
|
||||
# End of testinput17
|
||||
|
|
|
@ -135,4 +135,9 @@
|
|||
123ace
|
||||
123ace\=posix_startend=2:6
|
||||
|
||||
//posix
|
||||
\= Expect errors
|
||||
\=null_subject
|
||||
abc\=null_subject
|
||||
|
||||
# End of testdata/testinput18
|
||||
|
|
|
@ -5902,4 +5902,25 @@ a)"xI
|
|||
|
||||
# ---------
|
||||
|
||||
# Tests for zero-length NULL to be treated as an empty string.
|
||||
|
||||
//
|
||||
\=null_subject
|
||||
\= Expect error
|
||||
abc\=null_subject
|
||||
|
||||
//replace=[20]
|
||||
abc\=null_replacement
|
||||
\=null_subject
|
||||
\=null_replacement
|
||||
|
||||
/X*/g,replace=xy
|
||||
\= Expect error
|
||||
>X<\=null_replacement
|
||||
|
||||
/X+/replace=[20]
|
||||
>XX<\=null_replacement
|
||||
|
||||
# ---------
|
||||
|
||||
# End of testinput2
|
||||
|
|
|
@ -550,4 +550,8 @@ Failed: error -47: match limit exceeded
|
|||
match\nmatch
|
||||
0: a (JIT)
|
||||
|
||||
//jitfast
|
||||
\=null_subject
|
||||
0: (JIT)
|
||||
|
||||
# End of testinput17
|
||||
|
|
|
@ -215,4 +215,11 @@ Failed: POSIX code 16: bad argument at offset 0
|
|||
3: <unset>
|
||||
4: c
|
||||
|
||||
//posix
|
||||
\= Expect errors
|
||||
\=null_subject
|
||||
No match: POSIX code 16: bad argument
|
||||
abc\=null_subject
|
||||
No match: POSIX code 16: bad argument
|
||||
|
||||
# End of testdata/testinput18
|
||||
|
|
|
@ -17674,6 +17674,34 @@ Failed: error 199 at offset 14: \K is not allowed in lookarounds (but see PCRE2_
|
|||
|
||||
# ---------
|
||||
|
||||
# Tests for zero-length NULL to be treated as an empty string.
|
||||
|
||||
//
|
||||
\=null_subject
|
||||
0:
|
||||
\= Expect error
|
||||
abc\=null_subject
|
||||
Failed: error -51: NULL argument passed with non-zero length
|
||||
|
||||
//replace=[20]
|
||||
abc\=null_replacement
|
||||
1: abc
|
||||
\=null_subject
|
||||
1:
|
||||
\=null_replacement
|
||||
1:
|
||||
|
||||
/X*/g,replace=xy
|
||||
\= Expect error
|
||||
>X<\=null_replacement
|
||||
Failed: error -51: NULL argument passed with non-zero length
|
||||
|
||||
/X+/replace=[20]
|
||||
>XX<\=null_replacement
|
||||
1: ><
|
||||
|
||||
# ---------
|
||||
|
||||
# End of testinput2
|
||||
Error -70: PCRE2_ERROR_BADDATA (unknown error number)
|
||||
Error -62: bad serialized data
|
||||
|
|
Loading…
Reference in New Issue