Implement PCRE2_SUBSTITUTE_REPLACEMENT_ONLY.

This commit is contained in:
Philip.Hazel 2020-01-22 17:50:12 +00:00
parent 7171d86587
commit e8d70e2459
14 changed files with 591 additions and 460 deletions

View File

@ -11,14 +11,14 @@ Version 10.35
3. A JIT bug is fixed which allowed to read the fields of the compiled 3. A JIT bug is fixed which allowed to read the fields of the compiled
pattern before its existence is checked. pattern before its existence is checked.
4. Back in the PCRE1 day, capturing groups that contained recursive back 4. Back in the PCRE1 day, capturing groups that contained recursive back
references to themselves were made atomic (version 8.01, change 18) because references to themselves were made atomic (version 8.01, change 18) because
after the end a repeated group, the captured substrings had their values from after the end a repeated group, the captured substrings had their values from
the final repetition, not from an earlier repetition that might be the the final repetition, not from an earlier repetition that might be the
destination of a backtrack. This feature was documented, and was carried over destination of a backtrack. This feature was documented, and was carried over
into PCRE2. However, it has now been realized that the major refactoring that into PCRE2. However, it has now been realized that the major refactoring that
was done for 10.30 has made this atomicizing unnecessary, and it is confusing was done for 10.30 has made this atomicizing unnecessary, and it is confusing
when users are unaware of it, making some patterns appear not to be working as when users are unaware of it, making some patterns appear not to be working as
expected. Capture values of recursive back references in repeated groups are expected. Capture values of recursive back references in repeated groups are
now correctly backtracked, so this unnecessary restriction has been removed. now correctly backtracked, so this unnecessary restriction has been removed.
@ -28,19 +28,21 @@ now correctly backtracked, so this unnecessary restriction has been removed.
7. Added PCRE2_SUBSTITUTE_MATCHED. 7. Added PCRE2_SUBSTITUTE_MATCHED.
8. Added (?* and (?<* as synonms for (*napla: and (*naplb: to match another 8. Added (?* and (?<* as synonms for (*napla: and (*naplb: to match another
regex engine. The Perl regex folks are aware of this usage and have made a note regex engine. The Perl regex folks are aware of this usage and have made a note
about it. about it.
9. When an assertion is repeated, PCRE2 used to limit the maximum repetition to 9. When an assertion is repeated, PCRE2 used to limit the maximum repetition to
1, believing that repeating an assertion is pointless. However, if a positive 1, believing that repeating an assertion is pointless. However, if a positive
assertion contains capturing groups, repetition can be useful. In any case, an assertion contains capturing groups, repetition can be useful. In any case, an
assertion could always be wrapped in a repeated group. The only restriction assertion could always be wrapped in a repeated group. The only restriction
that is now imposed is that an unlimited maximum is changed to one more than that is now imposed is that an unlimited maximum is changed to one more than
the minimum. the minimum.
10. Fix *THEN verbs in lookahead assertions in JIT. 10. Fix *THEN verbs in lookahead assertions in JIT.
11. Added PCRE2_SUBSTITUTE_REPLACEMENT_ONLY.
Version 10.34 21-November-2019 Version 10.34 21-November-2019
------------------------------ ------------------------------

View File

@ -82,6 +82,7 @@ zero-terminated strings. The options are:
PCRE2_SUBSTITUTE_LITERAL The replacement string is literal PCRE2_SUBSTITUTE_LITERAL The replacement string is literal
PCRE2_SUBSTITUTE_MATCHED Use pre-existing match data for 1st match PCRE2_SUBSTITUTE_MATCHED Use pre-existing match data for 1st match
PCRE2_SUBSTITUTE_OVERFLOW_LENGTH If overflow, compute needed length PCRE2_SUBSTITUTE_OVERFLOW_LENGTH If overflow, compute needed length
PCRE2_SUBSTITUTE_REPLACEMENT_ONLY Return only replacement string(s)
PCRE2_SUBSTITUTE_UNKNOWN_UNSET Treat unknown group as unset PCRE2_SUBSTITUTE_UNKNOWN_UNSET Treat unknown group as unset
PCRE2_SUBSTITUTE_UNSET_EMPTY Simple unset insert = empty string PCRE2_SUBSTITUTE_UNSET_EMPTY Simple unset insert = empty string
</pre> </pre>

View File

@ -3305,10 +3305,11 @@ same number causes an error at compile time.
This function optionally calls <b>pcre2_match()</b> and then makes a copy of the This function optionally calls <b>pcre2_match()</b> and then makes a copy of the
subject string in <i>outputbuffer</i>, replacing parts that were matched with subject string in <i>outputbuffer</i>, replacing parts that were matched with
the <i>replacement</i> string, whose length is supplied in <b>rlength</b>. This the <i>replacement</i> string, whose length is supplied in <b>rlength</b>. This
can be given as PCRE2_ZERO_TERMINATED for a zero-terminated string. The default can be given as PCRE2_ZERO_TERMINATED for a zero-terminated string. There is an
is to perform just one replacement if the pattern matches, but there is an option (see PCRE2_SUBSTITUTE_REPLACEMENT_ONLY below) to return just the
option that requests multiple replacements (see PCRE2_SUBSTITUTE_GLOBAL below replacement string(s). The default action is to perform just one replacement if
for details). the pattern matches, but there is an option that requests multiple replacements
(see PCRE2_SUBSTITUTE_GLOBAL below for details).
</P> </P>
<P> <P>
If successful, <b>pcre2_substitute()</b> returns the number of substitutions If successful, <b>pcre2_substitute()</b> returns the number of substitutions
@ -3349,10 +3350,19 @@ an application to check for a match before choosing to substitute, without
having to repeat the match. having to repeat the match.
</P> </P>
<P> <P>
The <i>code</i> argument is not used for the first substitution, but if The <i>code</i> argument is not used for the first substitution when
PCRE2_SUBSTITUTE_GLOBAL is set, <b>pcre2_match()</b> will be called after the PCRE2_SUBSTITUTE_MATCHED is set, but if PCRE2_SUBSTITUTE_GLOBAL is also set,
first substitution to check for further matches, and the contents of the <b>pcre2_match()</b> will be called after the first substitution to check for
<i>match_data</i> block will be changed. further matches, and the contents of the <i>match_data</i> block will be
changed.
</P>
<P>
The default is to return a copy of the subject string with matched substrings
replaced. However, if PCRE2_SUBSTITUTE_REPLACEMENT_ONLY is set, only the
replacement substrings are returned. In the global case, multiple replacements
are concatenated in the output buffer. Substitution callouts (see
<a href="#subcallouts">below)</a>
can be used to separate them if necessary.
</P> </P>
<P> <P>
The <i>outlengthptr</i> argument of <b>pcre2_substitute()</b> must point to a The <i>outlengthptr</i> argument of <b>pcre2_substitute()</b> must point to a
@ -3560,7 +3570,7 @@ As for all PCRE2 errors, a text message that describes the error can be
obtained by calling the <b>pcre2_get_error_message()</b> function (see obtained by calling the <b>pcre2_get_error_message()</b> function (see
"Obtaining a textual error message" "Obtaining a textual error message"
<a href="#geterrormessage">above).</a> <a href="#geterrormessage">above).</a>
</P> <a name="subcallouts"></a></P>
<br><b> <br><b>
Substitution callouts Substitution callouts
</b><br> </b><br>
@ -3897,9 +3907,9 @@ Cambridge, England.
</P> </P>
<br><a name="SEC42" href="#TOC1">REVISION</a><br> <br><a name="SEC42" href="#TOC1">REVISION</a><br>
<P> <P>
Last updated: 27 December 2019 Last updated: 22 January 2020
<br> <br>
Copyright &copy; 1997-2019 University of Cambridge. Copyright &copy; 1997-2020 University of Cambridge.
<br> <br>
<p> <p>
Return to the <a href="index.html">PCRE2 index page</a>. Return to the <a href="index.html">PCRE2 index page</a>.

View File

@ -1050,25 +1050,27 @@ modifier list, in which case they are applied to every subject line that is
processed with that pattern. These modifiers do not affect the compilation processed with that pattern. These modifiers do not affect the compilation
process. process.
<pre> <pre>
aftertext show text after match aftertext show text after match
allaftertext show text after captures allaftertext show text after captures
allcaptures show all captures allcaptures show all captures
allvector show the entire ovector allvector show the entire ovector
allusedtext show all consulted text allusedtext show all consulted text
altglobal alternative global matching altglobal alternative global matching
/g global global matching /g global global matching
jitstack=&#60;n&#62; set size of JIT stack jitstack=&#60;n&#62; set size of JIT stack
mark show mark values mark show mark values
replace=&#60;string&#62; specify a replacement string replace=&#60;string&#62; specify a replacement string
startchar show starting character when relevant startchar show starting character when relevant
substitute_callout use substitution callouts substitute_callout use substitution callouts
substitute_extended use PCRE2_SUBSTITUTE_EXTENDED substitute_extended use PCRE2_SUBSTITUTE_EXTENDED
substitute_literal use PCRE2_SUBSTITUTE_LITERAL substitute_literal use PCRE2_SUBSTITUTE_LITERAL
substitute_overflow_length use PCRE2_SUBSTITUTE_OVERFLOW_LENGTH substitute_matched use PCRE2_SUBSTITUTE_MATCHED
substitute_skip=&#60;n&#62; skip substitution number n substitute_overflow_length use PCRE2_SUBSTITUTE_OVERFLOW_LENGTH
substitute_stop=&#60;n&#62; skip substitution number n and greater substitute_replacement_only use PCRE2_SUBSTITUTE_REPLACEMENT_ONLY
substitute_unknown_unset use PCRE2_SUBSTITUTE_UNKNOWN_UNSET substitute_skip=&#60;n&#62; skip substitution &#60;n&#62;
substitute_unset_empty use PCRE2_SUBSTITUTE_UNSET_EMPTY substitute_stop=&#60;n&#62; skip substitution &#60;n&#62; and following
substitute_unknown_unset use PCRE2_SUBSTITUTE_UNKNOWN_UNSET
substitute_unset_empty use PCRE2_SUBSTITUTE_UNSET_EMPTY
</pre> </pre>
These modifiers may not appear in a <b>#pattern</b> command. If you want them as These modifiers may not appear in a <b>#pattern</b> command. If you want them as
defaults, set them in a <b>#subject</b> command. defaults, set them in a <b>#subject</b> command.
@ -1235,7 +1237,9 @@ pattern.
substitute_callout use substitution callouts substitute_callout use substitution callouts
substitute_extedded use PCRE2_SUBSTITUTE_EXTENDED substitute_extedded use PCRE2_SUBSTITUTE_EXTENDED
substitute_literal use PCRE2_SUBSTITUTE_LITERAL substitute_literal use PCRE2_SUBSTITUTE_LITERAL
substitute_matched use PCRE2_SUBSTITUTE_MATCHED
substitute_overflow_length use PCRE2_SUBSTITUTE_OVERFLOW_LENGTH substitute_overflow_length use PCRE2_SUBSTITUTE_OVERFLOW_LENGTH
substitute_replacement_only use PCRE2_SUBSTITUTE_REPLACEMENT_ONLY
substitute_skip=&#60;n&#62; skip substitution number n substitute_skip=&#60;n&#62; skip substitution number n
substitute_stop=&#60;n&#62; skip substitution number n and greater substitute_stop=&#60;n&#62; skip substitution number n and greater
substitute_unknown_unset use PCRE2_SUBSTITUTE_UNKNOWN_UNSET substitute_unknown_unset use PCRE2_SUBSTITUTE_UNKNOWN_UNSET
@ -1397,9 +1401,10 @@ Testing the substitution function
</b><br> </b><br>
<P> <P>
If the <b>replace</b> modifier is set, the <b>pcre2_substitute()</b> function is If the <b>replace</b> modifier is set, the <b>pcre2_substitute()</b> function is
called instead of one of the matching functions. Note that replacement strings called instead of one of the matching functions (or after one call of
cannot contain commas, because a comma signifies the end of a modifier. This is <b>pcre2_match()</b> in the case of PCRE2_SUBSTITUTE_MATCHED). Note that
not thought to be an issue in a test program. replacement strings cannot contain commas, because a comma signifies the end of
a modifier. This is not thought to be an issue in a test program.
</P> </P>
<P> <P>
Unlike subject strings, <b>pcre2test</b> does not process replacement strings Unlike subject strings, <b>pcre2test</b> does not process replacement strings
@ -1416,11 +1421,15 @@ for <b>pcre2_substitute()</b>:
global PCRE2_SUBSTITUTE_GLOBAL global PCRE2_SUBSTITUTE_GLOBAL
substitute_extended PCRE2_SUBSTITUTE_EXTENDED substitute_extended PCRE2_SUBSTITUTE_EXTENDED
substitute_literal PCRE2_SUBSTITUTE_LITERAL substitute_literal PCRE2_SUBSTITUTE_LITERAL
substitute_matched PCRE2_SUBSTITUTE_MATCHED
substitute_overflow_length PCRE2_SUBSTITUTE_OVERFLOW_LENGTH substitute_overflow_length PCRE2_SUBSTITUTE_OVERFLOW_LENGTH
substitute_replacement_only PCRE2_SUBSTITUTE_REPLACEMENT_ONLY
substitute_unknown_unset PCRE2_SUBSTITUTE_UNKNOWN_UNSET substitute_unknown_unset PCRE2_SUBSTITUTE_UNKNOWN_UNSET
substitute_unset_empty PCRE2_SUBSTITUTE_UNSET_EMPTY substitute_unset_empty PCRE2_SUBSTITUTE_UNSET_EMPTY
</pre>
</PRE> See the
<a href="pcre2api.html"><b>pcre2api</b></a>
documentation for details of these options.
</P> </P>
<P> <P>
After a successful substitution, the modified string is output, preceded by the After a successful substitution, the modified string is output, preceded by the
@ -2096,9 +2105,9 @@ Cambridge, England.
</P> </P>
<br><a name="SEC21" href="#TOC1">REVISION</a><br> <br><a name="SEC21" href="#TOC1">REVISION</a><br>
<P> <P>
Last updated: 26 December 2019 Last updated: 22 January 2020
<br> <br>
Copyright &copy; 1997-2019 University of Cambridge. Copyright &copy; 1997-2020 University of Cambridge.
<br> <br>
<p> <p>
Return to the <a href="index.html">PCRE2 index page</a>. Return to the <a href="index.html">PCRE2 index page</a>.

View File

@ -3196,10 +3196,12 @@ CREATING A NEW STRING WITH SUBSTITUTIONS
This function optionally calls pcre2_match() and then makes a copy of This function optionally calls pcre2_match() and then makes a copy of
the subject string in outputbuffer, replacing parts that were matched the subject string in outputbuffer, replacing parts that were matched
with the replacement string, whose length is supplied in rlength. This with the replacement string, whose length is supplied in rlength. This
can be given as PCRE2_ZERO_TERMINATED for a zero-terminated string. The can be given as PCRE2_ZERO_TERMINATED for a zero-terminated string.
default is to perform just one replacement if the pattern matches, but There is an option (see PCRE2_SUBSTITUTE_REPLACEMENT_ONLY below) to re-
there is an option that requests multiple replacements (see PCRE2_SUB- turn just the replacement string(s). The default action is to perform
STITUTE_GLOBAL below for details). just one replacement if the pattern matches, but there is an option
that requests multiple replacements (see PCRE2_SUBSTITUTE_GLOBAL below
for details).
If successful, pcre2_substitute() returns the number of substitutions If successful, pcre2_substitute() returns the number of substitutions
that were carried out. This may be zero if no match was found, and is that were carried out. This may be zero if no match was found, and is
@ -3234,53 +3236,60 @@ CREATING A NEW STRING WITH SUBSTITUTIONS
application to check for a match before choosing to substitute, without application to check for a match before choosing to substitute, without
having to repeat the match. having to repeat the match.
The code argument is not used for the first substitution, but if The code argument is not used for the first substitution when
PCRE2_SUBSTITUTE_GLOBAL is set, pcre2_match() will be called after the PCRE2_SUBSTITUTE_MATCHED is set, but if PCRE2_SUBSTITUTE_GLOBAL is also
first substitution to check for further matches, and the contents of set, pcre2_match() will be called after the first substitution to check
the match_data block will be changed. for further matches, and the contents of the match_data block will be
changed.
The outlengthptr argument of pcre2_substitute() must point to a vari- The default is to return a copy of the subject string with matched sub-
able that contains the length, in code units, of the output buffer. If strings replaced. However, if PCRE2_SUBSTITUTE_REPLACEMENT_ONLY is set,
the function is successful, the value is updated to contain the length only the replacement substrings are returned. In the global case, mul-
of the new string, excluding the trailing zero that is automatically tiple replacements are concatenated in the output buffer. Substitution
callouts (see below) can be used to separate them if necessary.
The outlengthptr argument of pcre2_substitute() must point to a vari-
able that contains the length, in code units, of the output buffer. If
the function is successful, the value is updated to contain the length
of the new string, excluding the trailing zero that is automatically
added. added.
If the function is not successful, the value set via outlengthptr de- If the function is not successful, the value set via outlengthptr de-
pends on the type of error. For syntax errors in the replacement pends on the type of error. For syntax errors in the replacement
string, the value is the offset in the replacement string where the er- string, the value is the offset in the replacement string where the er-
ror was detected. For other errors, the value is PCRE2_UNSET by de- ror was detected. For other errors, the value is PCRE2_UNSET by de-
fault. This includes the case of the output buffer being too small, un- fault. This includes the case of the output buffer being too small, un-
less PCRE2_SUBSTITUTE_OVERFLOW_LENGTH is set (see below), in which case less PCRE2_SUBSTITUTE_OVERFLOW_LENGTH is set (see below), in which case
the value is the minimum length needed, including space for the trail- the value is the minimum length needed, including space for the trail-
ing zero. Note that in order to compute the required length, pcre2_sub- ing zero. Note that in order to compute the required length, pcre2_sub-
stitute() has to simulate all the matching and copying, instead of giv- stitute() has to simulate all the matching and copying, instead of giv-
ing an error return as soon as the buffer overflows. Note also that the ing an error return as soon as the buffer overflows. Note also that the
length is in code units, not bytes. length is in code units, not bytes.
The replacement string, which is interpreted as a UTF string in UTF The replacement string, which is interpreted as a UTF string in UTF
mode, is checked for UTF validity unless the PCRE2_NO_UTF_CHECK option mode, is checked for UTF validity unless the PCRE2_NO_UTF_CHECK option
is set. If the PCRE2_SUBSTITUTE_LITERAL option is set, it is not inter- is set. If the PCRE2_SUBSTITUTE_LITERAL option is set, it is not inter-
preted in any way. By default, however, a dollar character is an escape preted in any way. By default, however, a dollar character is an escape
character that can specify the insertion of characters from capture character that can specify the insertion of characters from capture
groups and names from (*MARK) or other control verbs in the pattern. groups and names from (*MARK) or other control verbs in the pattern.
The following forms are always recognized: The following forms are always recognized:
$$ insert a dollar character $$ insert a dollar character
$<n> or ${<n>} insert the contents of group <n> $<n> or ${<n>} insert the contents of group <n>
$*MARK or ${*MARK} insert a control verb name $*MARK or ${*MARK} insert a control verb name
Either a group number or a group name can be given for <n>. Curly Either a group number or a group name can be given for <n>. Curly
brackets are required only if the following character would be inter- brackets are required only if the following character would be inter-
preted as part of the number or name. The number may be zero to include preted as part of the number or name. The number may be zero to include
the entire matched string. For example, if the pattern a(b)c is the entire matched string. For example, if the pattern a(b)c is
matched with "=abc=" and the replacement string "+$1$0$1+", the result matched with "=abc=" and the replacement string "+$1$0$1+", the result
is "=+babcb+=". is "=+babcb+=".
$*MARK inserts the name from the last encountered backtracking control $*MARK inserts the name from the last encountered backtracking control
verb on the matching path that has a name. (*MARK) must always include verb on the matching path that has a name. (*MARK) must always include
a name, but the other verbs need not. For example, in the case of a name, but the other verbs need not. For example, in the case of
(*MARK:A)(*PRUNE) the name inserted is "A", but for (*MARK:A)(*PRUNE:B) (*MARK:A)(*PRUNE) the name inserted is "A", but for (*MARK:A)(*PRUNE:B)
the relevant name is "B". This facility can be used to perform simple the relevant name is "B". This facility can be used to perform simple
simultaneous substitutions, as this pcre2test example shows: simultaneous substitutions, as this pcre2test example shows:
/(*MARK:pear)apple|(*MARK:orange)lemon/g,replace=${*MARK} /(*MARK:pear)apple|(*MARK:orange)lemon/g,replace=${*MARK}
@ -3288,15 +3297,15 @@ CREATING A NEW STRING WITH SUBSTITUTIONS
2: pear orange 2: pear orange
PCRE2_SUBSTITUTE_GLOBAL causes the function to iterate over the subject PCRE2_SUBSTITUTE_GLOBAL causes the function to iterate over the subject
string, replacing every matching substring. If this option is not set, string, replacing every matching substring. If this option is not set,
only the first matching substring is replaced. The search for matches only the first matching substring is replaced. The search for matches
takes place in the original subject string (that is, previous replace- takes place in the original subject string (that is, previous replace-
ments do not affect it). Iteration is implemented by advancing the ments do not affect it). Iteration is implemented by advancing the
startoffset value for each search, which is always passed the entire startoffset value for each search, which is always passed the entire
subject string. If an offset limit is set in the match context, search- subject string. If an offset limit is set in the match context, search-
ing stops when that limit is reached. ing stops when that limit is reached.
You can restrict the effect of a global substitution to a portion of You can restrict the effect of a global substitution to a portion of
the subject string by setting either or both of startoffset and an off- the subject string by setting either or both of startoffset and an off-
set limit. Here is a pcre2test example: set limit. Here is a pcre2test example:
@ -3304,87 +3313,87 @@ CREATING A NEW STRING WITH SUBSTITUTIONS
ABC ABC ABC ABC\=offset=3,offset_limit=12 ABC ABC ABC ABC\=offset=3,offset_limit=12
2: ABC A!C A!C ABC 2: ABC A!C A!C ABC
When continuing with global substitutions after matching a substring When continuing with global substitutions after matching a substring
with zero length, an attempt to find a non-empty match at the same off- with zero length, an attempt to find a non-empty match at the same off-
set is performed. If this is not successful, the offset is advanced by set is performed. If this is not successful, the offset is advanced by
one character except when CRLF is a valid newline sequence and the next one character except when CRLF is a valid newline sequence and the next
two characters are CR, LF. In this case, the offset is advanced by two two characters are CR, LF. In this case, the offset is advanced by two
characters. characters.
PCRE2_SUBSTITUTE_OVERFLOW_LENGTH changes what happens when the output PCRE2_SUBSTITUTE_OVERFLOW_LENGTH changes what happens when the output
buffer is too small. The default action is to return PCRE2_ERROR_NOMEM- buffer is too small. The default action is to return PCRE2_ERROR_NOMEM-
ORY immediately. If this option is set, however, pcre2_substitute() ORY immediately. If this option is set, however, pcre2_substitute()
continues to go through the motions of matching and substituting (with- continues to go through the motions of matching and substituting (with-
out, of course, writing anything) in order to compute the size of buf- out, of course, writing anything) in order to compute the size of buf-
fer that is needed. This value is passed back via the outlengthptr fer that is needed. This value is passed back via the outlengthptr
variable, with the result of the function still being PCRE2_ER- variable, with the result of the function still being PCRE2_ER-
ROR_NOMEMORY. ROR_NOMEMORY.
Passing a buffer size of zero is a permitted way of finding out how Passing a buffer size of zero is a permitted way of finding out how
much memory is needed for given substitution. However, this does mean much memory is needed for given substitution. However, this does mean
that the entire operation is carried out twice. Depending on the appli- that the entire operation is carried out twice. Depending on the appli-
cation, it may be more efficient to allocate a large buffer and free cation, it may be more efficient to allocate a large buffer and free
the excess afterwards, instead of using PCRE2_SUBSTITUTE_OVER- the excess afterwards, instead of using PCRE2_SUBSTITUTE_OVER-
FLOW_LENGTH. FLOW_LENGTH.
PCRE2_SUBSTITUTE_UNKNOWN_UNSET causes references to capture groups that PCRE2_SUBSTITUTE_UNKNOWN_UNSET causes references to capture groups that
do not appear in the pattern to be treated as unset groups. This option do not appear in the pattern to be treated as unset groups. This option
should be used with care, because it means that a typo in a group name should be used with care, because it means that a typo in a group name
or number no longer causes the PCRE2_ERROR_NOSUBSTRING error. or number no longer causes the PCRE2_ERROR_NOSUBSTRING error.
PCRE2_SUBSTITUTE_UNSET_EMPTY causes unset capture groups (including un- PCRE2_SUBSTITUTE_UNSET_EMPTY causes unset capture groups (including un-
known groups when PCRE2_SUBSTITUTE_UNKNOWN_UNSET is set) to be treated known groups when PCRE2_SUBSTITUTE_UNKNOWN_UNSET is set) to be treated
as empty strings when inserted as described above. If this option is as empty strings when inserted as described above. If this option is
not set, an attempt to insert an unset group causes the PCRE2_ERROR_UN- not set, an attempt to insert an unset group causes the PCRE2_ERROR_UN-
SET error. This option does not influence the extended substitution SET error. This option does not influence the extended substitution
syntax described below. syntax described below.
PCRE2_SUBSTITUTE_EXTENDED causes extra processing to be applied to the PCRE2_SUBSTITUTE_EXTENDED causes extra processing to be applied to the
replacement string. Without this option, only the dollar character is replacement string. Without this option, only the dollar character is
special, and only the group insertion forms listed above are valid. special, and only the group insertion forms listed above are valid.
When PCRE2_SUBSTITUTE_EXTENDED is set, two things change: When PCRE2_SUBSTITUTE_EXTENDED is set, two things change:
Firstly, backslash in a replacement string is interpreted as an escape Firstly, backslash in a replacement string is interpreted as an escape
character. The usual forms such as \n or \x{ddd} can be used to specify character. The usual forms such as \n or \x{ddd} can be used to specify
particular character codes, and backslash followed by any non-alphanu- particular character codes, and backslash followed by any non-alphanu-
meric character quotes that character. Extended quoting can be coded meric character quotes that character. Extended quoting can be coded
using \Q...\E, exactly as in pattern strings. using \Q...\E, exactly as in pattern strings.
There are also four escape sequences for forcing the case of inserted There are also four escape sequences for forcing the case of inserted
letters. The insertion mechanism has three states: no case forcing, letters. The insertion mechanism has three states: no case forcing,
force upper case, and force lower case. The escape sequences change the force upper case, and force lower case. The escape sequences change the
current state: \U and \L change to upper or lower case forcing, respec- current state: \U and \L change to upper or lower case forcing, respec-
tively, and \E (when not terminating a \Q quoted sequence) reverts to tively, and \E (when not terminating a \Q quoted sequence) reverts to
no case forcing. The sequences \u and \l force the next character (if no case forcing. The sequences \u and \l force the next character (if
it is a letter) to upper or lower case, respectively, and then the it is a letter) to upper or lower case, respectively, and then the
state automatically reverts to no case forcing. Case forcing applies to state automatically reverts to no case forcing. Case forcing applies to
all inserted characters, including those from capture groups and let- all inserted characters, including those from capture groups and let-
ters within \Q...\E quoted sequences. ters within \Q...\E quoted sequences.
Note that case forcing sequences such as \U...\E do not nest. For exam- Note that case forcing sequences such as \U...\E do not nest. For exam-
ple, the result of processing "\Uaa\LBB\Ecc\E" is "AAbbcc"; the final ple, the result of processing "\Uaa\LBB\Ecc\E" is "AAbbcc"; the final
\E has no effect. Note also that the PCRE2_ALT_BSUX and PCRE2_EX- \E has no effect. Note also that the PCRE2_ALT_BSUX and PCRE2_EX-
TRA_ALT_BSUX options do not apply to replacement strings. TRA_ALT_BSUX options do not apply to replacement strings.
The second effect of setting PCRE2_SUBSTITUTE_EXTENDED is to add more The second effect of setting PCRE2_SUBSTITUTE_EXTENDED is to add more
flexibility to capture group substitution. The syntax is similar to flexibility to capture group substitution. The syntax is similar to
that used by Bash: that used by Bash:
${<n>:-<string>} ${<n>:-<string>}
${<n>:+<string1>:<string2>} ${<n>:+<string1>:<string2>}
As before, <n> may be a group number or a name. The first form speci- As before, <n> may be a group number or a name. The first form speci-
fies a default value. If group <n> is set, its value is inserted; if fies a default value. If group <n> is set, its value is inserted; if
not, <string> is expanded and the result inserted. The second form not, <string> is expanded and the result inserted. The second form
specifies strings that are expanded and inserted when group <n> is set specifies strings that are expanded and inserted when group <n> is set
or unset, respectively. The first form is just a convenient shorthand or unset, respectively. The first form is just a convenient shorthand
for for
${<n>:+${<n>}:<string>} ${<n>:+${<n>}:<string>}
Backslash can be used to escape colons and closing curly brackets in Backslash can be used to escape colons and closing curly brackets in
the replacement strings. A change of the case forcing state within a the replacement strings. A change of the case forcing state within a
replacement string remains in force afterwards, as shown in this replacement string remains in force afterwards, as shown in this
pcre2test example: pcre2test example:
/(some)?(body)/substitute_extended,replace=${1:+\U:\L}HeLLo /(some)?(body)/substitute_extended,replace=${1:+\U:\L}HeLLo
@ -3393,8 +3402,8 @@ CREATING A NEW STRING WITH SUBSTITUTIONS
somebody somebody
1: HELLO 1: HELLO
The PCRE2_SUBSTITUTE_UNSET_EMPTY option does not affect these extended The PCRE2_SUBSTITUTE_UNSET_EMPTY option does not affect these extended
substitutions. However, PCRE2_SUBSTITUTE_UNKNOWN_UNSET does cause un- substitutions. However, PCRE2_SUBSTITUTE_UNKNOWN_UNSET does cause un-
known groups in the extended syntax forms to be treated as unset. known groups in the extended syntax forms to be treated as unset.
If PCRE2_SUBSTITUTE_LITERAL is set, PCRE2_SUBSTITUTE_UNKNOWN_UNSET, If PCRE2_SUBSTITUTE_LITERAL is set, PCRE2_SUBSTITUTE_UNKNOWN_UNSET,
@ -3403,37 +3412,37 @@ CREATING A NEW STRING WITH SUBSTITUTIONS
Substitution errors Substitution errors
In the event of an error, pcre2_substitute() returns a negative error In the event of an error, pcre2_substitute() returns a negative error
code. Except for PCRE2_ERROR_NOMATCH (which is never returned), errors code. Except for PCRE2_ERROR_NOMATCH (which is never returned), errors
from pcre2_match() are passed straight back. from pcre2_match() are passed straight back.
PCRE2_ERROR_NOSUBSTRING is returned for a non-existent substring inser- PCRE2_ERROR_NOSUBSTRING is returned for a non-existent substring inser-
tion, unless PCRE2_SUBSTITUTE_UNKNOWN_UNSET is set. tion, unless PCRE2_SUBSTITUTE_UNKNOWN_UNSET is set.
PCRE2_ERROR_UNSET is returned for an unset substring insertion (includ- PCRE2_ERROR_UNSET is returned for an unset substring insertion (includ-
ing an unknown substring when PCRE2_SUBSTITUTE_UNKNOWN_UNSET is set) ing an unknown substring when PCRE2_SUBSTITUTE_UNKNOWN_UNSET is set)
when the simple (non-extended) syntax is used and PCRE2_SUBSTITUTE_UN- when the simple (non-extended) syntax is used and PCRE2_SUBSTITUTE_UN-
SET_EMPTY is not set. SET_EMPTY is not set.
PCRE2_ERROR_NOMEMORY is returned if the output buffer is not big PCRE2_ERROR_NOMEMORY is returned if the output buffer is not big
enough. If the PCRE2_SUBSTITUTE_OVERFLOW_LENGTH option is set, the size enough. If the PCRE2_SUBSTITUTE_OVERFLOW_LENGTH option is set, the size
of buffer that is needed is returned via outlengthptr. Note that this of buffer that is needed is returned via outlengthptr. Note that this
does not happen by default. does not happen by default.
PCRE2_ERROR_NULL is returned if PCRE2_SUBSTITUTE_MATCHED is set but the PCRE2_ERROR_NULL is returned if PCRE2_SUBSTITUTE_MATCHED is set but the
match_data argument is NULL. match_data argument is NULL.
PCRE2_ERROR_BADREPLACEMENT is used for miscellaneous syntax errors in PCRE2_ERROR_BADREPLACEMENT is used for miscellaneous syntax errors in
the replacement string, with more particular errors being PCRE2_ER- the replacement string, with more particular errors being PCRE2_ER-
ROR_BADREPESCAPE (invalid escape sequence), PCRE2_ERROR_REPMISSINGBRACE ROR_BADREPESCAPE (invalid escape sequence), PCRE2_ERROR_REPMISSINGBRACE
(closing curly bracket not found), PCRE2_ERROR_BADSUBSTITUTION (syntax (closing curly bracket not found), PCRE2_ERROR_BADSUBSTITUTION (syntax
error in extended group substitution), and PCRE2_ERROR_BADSUBSPATTERN error in extended group substitution), and PCRE2_ERROR_BADSUBSPATTERN
(the pattern match ended before it started or the match started earlier (the pattern match ended before it started or the match started earlier
than the current position in the subject, which can happen if \K is than the current position in the subject, which can happen if \K is
used in an assertion). used in an assertion).
As for all PCRE2 errors, a text message that describes the error can be As for all PCRE2 errors, a text message that describes the error can be
obtained by calling the pcre2_get_error_message() function (see "Ob- obtained by calling the pcre2_get_error_message() function (see "Ob-
taining a textual error message" above). taining a textual error message" above).
Substitution callouts Substitution callouts
@ -3442,15 +3451,15 @@ CREATING A NEW STRING WITH SUBSTITUTIONS
int (*callout_function)(pcre2_substitute_callout_block *, void *), int (*callout_function)(pcre2_substitute_callout_block *, void *),
void *callout_data); void *callout_data);
The pcre2_set_substitution_callout() function can be used to specify a The pcre2_set_substitution_callout() function can be used to specify a
callout function for pcre2_substitute(). This information is passed in callout function for pcre2_substitute(). This information is passed in
a match context. The callout function is called after each substitution a match context. The callout function is called after each substitution
has been processed, but it can cause the replacement not to happen. The has been processed, but it can cause the replacement not to happen. The
callout function is not called for simulated substitutions that happen callout function is not called for simulated substitutions that happen
as a result of the PCRE2_SUBSTITUTE_OVERFLOW_LENGTH option. as a result of the PCRE2_SUBSTITUTE_OVERFLOW_LENGTH option.
The first argument of the callout function is a pointer to a substitute The first argument of the callout function is a pointer to a substitute
callout block structure, which contains the following fields, not nec- callout block structure, which contains the following fields, not nec-
essarily in this order: essarily in this order:
uint32_t version; uint32_t version;
@ -3461,34 +3470,34 @@ CREATING A NEW STRING WITH SUBSTITUTIONS
uint32_t oveccount; uint32_t oveccount;
PCRE2_SIZE output_offsets[2]; PCRE2_SIZE output_offsets[2];
The version field contains the version number of the block format. The The version field contains the version number of the block format. The
current version is 0. The version number will increase in future if current version is 0. The version number will increase in future if
more fields are added, but the intention is never to remove any of the more fields are added, but the intention is never to remove any of the
existing fields. existing fields.
The subscount field is the number of the current match. It is 1 for the The subscount field is the number of the current match. It is 1 for the
first callout, 2 for the second, and so on. The input and output point- first callout, 2 for the second, and so on. The input and output point-
ers are copies of the values passed to pcre2_substitute(). ers are copies of the values passed to pcre2_substitute().
The ovector field points to the ovector, which contains the result of The ovector field points to the ovector, which contains the result of
the most recent match. The oveccount field contains the number of pairs the most recent match. The oveccount field contains the number of pairs
that are set in the ovector, and is always greater than zero. that are set in the ovector, and is always greater than zero.
The output_offsets vector contains the offsets of the replacement in The output_offsets vector contains the offsets of the replacement in
the output string. This has already been processed for dollar and (if the output string. This has already been processed for dollar and (if
requested) backslash substitutions as described above. requested) backslash substitutions as described above.
The second argument of the callout function is the value passed as The second argument of the callout function is the value passed as
callout_data when the function was registered. The value returned by callout_data when the function was registered. The value returned by
the callout function is interpreted as follows: the callout function is interpreted as follows:
If the value is zero, the replacement is accepted, and, if PCRE2_SUB- If the value is zero, the replacement is accepted, and, if PCRE2_SUB-
STITUTE_GLOBAL is set, processing continues with a search for the next STITUTE_GLOBAL is set, processing continues with a search for the next
match. If the value is not zero, the current replacement is not ac- match. If the value is not zero, the current replacement is not ac-
cepted. If the value is greater than zero, processing continues when cepted. If the value is greater than zero, processing continues when
PCRE2_SUBSTITUTE_GLOBAL is set. Otherwise (the value is less than zero PCRE2_SUBSTITUTE_GLOBAL is set. Otherwise (the value is less than zero
or PCRE2_SUBSTITUTE_GLOBAL is not set), the the rest of the input is or PCRE2_SUBSTITUTE_GLOBAL is not set), the the rest of the input is
copied to the output and the call to pcre2_substitute() exits, return- copied to the output and the call to pcre2_substitute() exits, return-
ing the number of matches so far. ing the number of matches so far.
@ -3497,56 +3506,56 @@ DUPLICATE CAPTURE GROUP NAMES
int pcre2_substring_nametable_scan(const pcre2_code *code, int pcre2_substring_nametable_scan(const pcre2_code *code,
PCRE2_SPTR name, PCRE2_SPTR *first, PCRE2_SPTR *last); PCRE2_SPTR name, PCRE2_SPTR *first, PCRE2_SPTR *last);
When a pattern is compiled with the PCRE2_DUPNAMES option, names for When a pattern is compiled with the PCRE2_DUPNAMES option, names for
capture groups are not required to be unique. Duplicate names are al- capture groups are not required to be unique. Duplicate names are al-
ways allowed for groups with the same number, created by using the (?| ways allowed for groups with the same number, created by using the (?|
feature. Indeed, if such groups are named, they are required to use the feature. Indeed, if such groups are named, they are required to use the
same names. same names.
Normally, patterns that use duplicate names are such that in any one Normally, patterns that use duplicate names are such that in any one
match, only one of each set of identically-named groups participates. match, only one of each set of identically-named groups participates.
An example is shown in the pcre2pattern documentation. An example is shown in the pcre2pattern documentation.
When duplicates are present, pcre2_substring_copy_byname() and When duplicates are present, pcre2_substring_copy_byname() and
pcre2_substring_get_byname() return the first substring corresponding pcre2_substring_get_byname() return the first substring corresponding
to the given name that is set. Only if none are set is PCRE2_ERROR_UN- to the given name that is set. Only if none are set is PCRE2_ERROR_UN-
SET is returned. The pcre2_substring_number_from_name() function re- SET is returned. The pcre2_substring_number_from_name() function re-
turns the error PCRE2_ERROR_NOUNIQUESUBSTRING when there are duplicate turns the error PCRE2_ERROR_NOUNIQUESUBSTRING when there are duplicate
names. names.
If you want to get full details of all captured substrings for a given If you want to get full details of all captured substrings for a given
name, you must use the pcre2_substring_nametable_scan() function. The name, you must use the pcre2_substring_nametable_scan() function. The
first argument is the compiled pattern, and the second is the name. If first argument is the compiled pattern, and the second is the name. If
the third and fourth arguments are NULL, the function returns a group the third and fourth arguments are NULL, the function returns a group
number for a unique name, or PCRE2_ERROR_NOUNIQUESUBSTRING otherwise. number for a unique name, or PCRE2_ERROR_NOUNIQUESUBSTRING otherwise.
When the third and fourth arguments are not NULL, they must be pointers When the third and fourth arguments are not NULL, they must be pointers
to variables that are updated by the function. After it has run, they to variables that are updated by the function. After it has run, they
point to the first and last entries in the name-to-number table for the point to the first and last entries in the name-to-number table for the
given name, and the function returns the length of each entry in code given name, and the function returns the length of each entry in code
units. In both cases, PCRE2_ERROR_NOSUBSTRING is returned if there are units. In both cases, PCRE2_ERROR_NOSUBSTRING is returned if there are
no entries for the given name. no entries for the given name.
The format of the name table is described above in the section entitled The format of the name table is described above in the section entitled
Information about a pattern. Given all the relevant entries for the Information about a pattern. Given all the relevant entries for the
name, you can extract each of their numbers, and hence the captured name, you can extract each of their numbers, and hence the captured
data. data.
FINDING ALL POSSIBLE MATCHES AT ONE POSITION FINDING ALL POSSIBLE MATCHES AT ONE POSITION
The traditional matching function uses a similar algorithm to Perl, The traditional matching function uses a similar algorithm to Perl,
which stops when it finds the first match at a given point in the sub- which stops when it finds the first match at a given point in the sub-
ject. If you want to find all possible matches, or the longest possible ject. If you want to find all possible matches, or the longest possible
match at a given position, consider using the alternative matching match at a given position, consider using the alternative matching
function (see below) instead. If you cannot use the alternative func- function (see below) instead. If you cannot use the alternative func-
tion, you can kludge it up by making use of the callout facility, which tion, you can kludge it up by making use of the callout facility, which
is described in the pcre2callout documentation. is described in the pcre2callout documentation.
What you have to do is to insert a callout right at the end of the pat- What you have to do is to insert a callout right at the end of the pat-
tern. When your callout function is called, extract and save the cur- tern. When your callout function is called, extract and save the cur-
rent matched substring. Then return 1, which forces pcre2_match() to rent matched substring. Then return 1, which forces pcre2_match() to
backtrack and try other alternatives. Ultimately, when it runs out of backtrack and try other alternatives. Ultimately, when it runs out of
matches, pcre2_match() will yield PCRE2_ERROR_NOMATCH. matches, pcre2_match() will yield PCRE2_ERROR_NOMATCH.
@ -3558,26 +3567,26 @@ MATCHING A PATTERN: THE ALTERNATIVE FUNCTION
pcre2_match_context *mcontext, pcre2_match_context *mcontext,
int *workspace, PCRE2_SIZE wscount); int *workspace, PCRE2_SIZE wscount);
The function pcre2_dfa_match() is called to match a subject string The function pcre2_dfa_match() is called to match a subject string
against a compiled pattern, using a matching algorithm that scans the against a compiled pattern, using a matching algorithm that scans the
subject string just once (not counting lookaround assertions), and does subject string just once (not counting lookaround assertions), and does
not backtrack. This has different characteristics to the normal algo- not backtrack. This has different characteristics to the normal algo-
rithm, and is not compatible with Perl. Some of the features of PCRE2 rithm, and is not compatible with Perl. Some of the features of PCRE2
patterns are not supported. Nevertheless, there are times when this patterns are not supported. Nevertheless, there are times when this
kind of matching can be useful. For a discussion of the two matching kind of matching can be useful. For a discussion of the two matching
algorithms, and a list of features that pcre2_dfa_match() does not sup- algorithms, and a list of features that pcre2_dfa_match() does not sup-
port, see the pcre2matching documentation. port, see the pcre2matching documentation.
The arguments for the pcre2_dfa_match() function are the same as for The arguments for the pcre2_dfa_match() function are the same as for
pcre2_match(), plus two extras. The ovector within the match data block pcre2_match(), plus two extras. The ovector within the match data block
is used in a different way, and this is described below. The other com- is used in a different way, and this is described below. The other com-
mon arguments are used in the same way as for pcre2_match(), so their mon arguments are used in the same way as for pcre2_match(), so their
description is not repeated here. description is not repeated here.
The two additional arguments provide workspace for the function. The The two additional arguments provide workspace for the function. The
workspace vector should contain at least 20 elements. It is used for workspace vector should contain at least 20 elements. It is used for
keeping track of multiple paths through the pattern tree. More keeping track of multiple paths through the pattern tree. More
workspace is needed for patterns and subjects where there are a lot of workspace is needed for patterns and subjects where there are a lot of
potential matches. potential matches.
Here is an example of a simple call to pcre2_dfa_match(): Here is an example of a simple call to pcre2_dfa_match():
@ -3597,45 +3606,45 @@ MATCHING A PATTERN: THE ALTERNATIVE FUNCTION
Option bits for pcre_dfa_match() Option bits for pcre_dfa_match()
The unused bits of the options argument for pcre2_dfa_match() must be The unused bits of the options argument for pcre2_dfa_match() must be
zero. The only bits that may be set are PCRE2_ANCHORED, zero. The only bits that may be set are PCRE2_ANCHORED,
PCRE2_COPY_MATCHED_SUBJECT, PCRE2_ENDANCHORED, PCRE2_NOTBOL, PCRE2_NO- PCRE2_COPY_MATCHED_SUBJECT, PCRE2_ENDANCHORED, PCRE2_NOTBOL, PCRE2_NO-
TEOL, PCRE2_NOTEMPTY, PCRE2_NOTEMPTY_ATSTART, PCRE2_NO_UTF_CHECK, TEOL, PCRE2_NOTEMPTY, PCRE2_NOTEMPTY_ATSTART, PCRE2_NO_UTF_CHECK,
PCRE2_PARTIAL_HARD, PCRE2_PARTIAL_SOFT, PCRE2_DFA_SHORTEST, and PCRE2_PARTIAL_HARD, PCRE2_PARTIAL_SOFT, PCRE2_DFA_SHORTEST, and
PCRE2_DFA_RESTART. All but the last four of these are exactly the same PCRE2_DFA_RESTART. All but the last four of these are exactly the same
as for pcre2_match(), so their description is not repeated here. as for pcre2_match(), so their description is not repeated here.
PCRE2_PARTIAL_HARD PCRE2_PARTIAL_HARD
PCRE2_PARTIAL_SOFT PCRE2_PARTIAL_SOFT
These have the same general effect as they do for pcre2_match(), but These have the same general effect as they do for pcre2_match(), but
the details are slightly different. When PCRE2_PARTIAL_HARD is set for the details are slightly different. When PCRE2_PARTIAL_HARD is set for
pcre2_dfa_match(), it returns PCRE2_ERROR_PARTIAL if the end of the pcre2_dfa_match(), it returns PCRE2_ERROR_PARTIAL if the end of the
subject is reached and there is still at least one matching possibility subject is reached and there is still at least one matching possibility
that requires additional characters. This happens even if some complete that requires additional characters. This happens even if some complete
matches have already been found. When PCRE2_PARTIAL_SOFT is set, the matches have already been found. When PCRE2_PARTIAL_SOFT is set, the
return code PCRE2_ERROR_NOMATCH is converted into PCRE2_ERROR_PARTIAL return code PCRE2_ERROR_NOMATCH is converted into PCRE2_ERROR_PARTIAL
if the end of the subject is reached, there have been no complete if the end of the subject is reached, there have been no complete
matches, but there is still at least one matching possibility. The por- matches, but there is still at least one matching possibility. The por-
tion of the string that was inspected when the longest partial match tion of the string that was inspected when the longest partial match
was found is set as the first matching string in both cases. There is a was found is set as the first matching string in both cases. There is a
more detailed discussion of partial and multi-segment matching, with more detailed discussion of partial and multi-segment matching, with
examples, in the pcre2partial documentation. examples, in the pcre2partial documentation.
PCRE2_DFA_SHORTEST PCRE2_DFA_SHORTEST
Setting the PCRE2_DFA_SHORTEST option causes the matching algorithm to Setting the PCRE2_DFA_SHORTEST option causes the matching algorithm to
stop as soon as it has found one match. Because of the way the alterna- stop as soon as it has found one match. Because of the way the alterna-
tive algorithm works, this is necessarily the shortest possible match tive algorithm works, this is necessarily the shortest possible match
at the first possible matching point in the subject string. at the first possible matching point in the subject string.
PCRE2_DFA_RESTART PCRE2_DFA_RESTART
When pcre2_dfa_match() returns a partial match, it is possible to call When pcre2_dfa_match() returns a partial match, it is possible to call
it again, with additional subject characters, and have it continue with it again, with additional subject characters, and have it continue with
the same match. The PCRE2_DFA_RESTART option requests this action; when the same match. The PCRE2_DFA_RESTART option requests this action; when
it is set, the workspace and wscount options must reference the same it is set, the workspace and wscount options must reference the same
vector as before because data about the match so far is left in them vector as before because data about the match so far is left in them
after a partial match. There is more discussion of this facility in the after a partial match. There is more discussion of this facility in the
pcre2partial documentation. pcre2partial documentation.
@ -3643,8 +3652,8 @@ MATCHING A PATTERN: THE ALTERNATIVE FUNCTION
When pcre2_dfa_match() succeeds, it may have matched more than one sub- When pcre2_dfa_match() succeeds, it may have matched more than one sub-
string in the subject. Note, however, that all the matches from one run string in the subject. Note, however, that all the matches from one run
of the function start at the same point in the subject. The shorter of the function start at the same point in the subject. The shorter
matches are all initial substrings of the longer matches. For example, matches are all initial substrings of the longer matches. For example,
if the pattern if the pattern
<.*> <.*>
@ -3659,80 +3668,80 @@ MATCHING A PATTERN: THE ALTERNATIVE FUNCTION
<something> <something else> <something> <something else>
<something> <something>
On success, the yield of the function is a number greater than zero, On success, the yield of the function is a number greater than zero,
which is the number of matched substrings. The offsets of the sub- which is the number of matched substrings. The offsets of the sub-
strings are returned in the ovector, and can be extracted by number in strings are returned in the ovector, and can be extracted by number in
the same way as for pcre2_match(), but the numbers bear no relation to the same way as for pcre2_match(), but the numbers bear no relation to
any capture groups that may exist in the pattern, because DFA matching any capture groups that may exist in the pattern, because DFA matching
does not support capturing. does not support capturing.
Calls to the convenience functions that extract substrings by name re- Calls to the convenience functions that extract substrings by name re-
turn the error PCRE2_ERROR_DFA_UFUNC (unsupported function) if used af- turn the error PCRE2_ERROR_DFA_UFUNC (unsupported function) if used af-
ter a DFA match. The convenience functions that extract substrings by ter a DFA match. The convenience functions that extract substrings by
number never return PCRE2_ERROR_NOSUBSTRING. number never return PCRE2_ERROR_NOSUBSTRING.
The matched strings are stored in the ovector in reverse order of The matched strings are stored in the ovector in reverse order of
length; that is, the longest matching string is first. If there were length; that is, the longest matching string is first. If there were
too many matches to fit into the ovector, the yield of the function is too many matches to fit into the ovector, the yield of the function is
zero, and the vector is filled with the longest matches. zero, and the vector is filled with the longest matches.
NOTE: PCRE2's "auto-possessification" optimization usually applies to NOTE: PCRE2's "auto-possessification" optimization usually applies to
character repeats at the end of a pattern (as well as internally). For character repeats at the end of a pattern (as well as internally). For
example, the pattern "a\d+" is compiled as if it were "a\d++". For DFA example, the pattern "a\d+" is compiled as if it were "a\d++". For DFA
matching, this means that only one possible match is found. If you re- matching, this means that only one possible match is found. If you re-
ally do want multiple matches in such cases, either use an ungreedy re- ally do want multiple matches in such cases, either use an ungreedy re-
peat such as "a\d+?" or set the PCRE2_NO_AUTO_POSSESS option when com- peat such as "a\d+?" or set the PCRE2_NO_AUTO_POSSESS option when com-
piling. piling.
Error returns from pcre2_dfa_match() Error returns from pcre2_dfa_match()
The pcre2_dfa_match() function returns a negative number when it fails. The pcre2_dfa_match() function returns a negative number when it fails.
Many of the errors are the same as for pcre2_match(), as described Many of the errors are the same as for pcre2_match(), as described
above. There are in addition the following errors that are specific to above. There are in addition the following errors that are specific to
pcre2_dfa_match(): pcre2_dfa_match():
PCRE2_ERROR_DFA_UITEM PCRE2_ERROR_DFA_UITEM
This return is given if pcre2_dfa_match() encounters an item in the This return is given if pcre2_dfa_match() encounters an item in the
pattern that it does not support, for instance, the use of \C in a UTF pattern that it does not support, for instance, the use of \C in a UTF
mode or a backreference. mode or a backreference.
PCRE2_ERROR_DFA_UCOND PCRE2_ERROR_DFA_UCOND
This return is given if pcre2_dfa_match() encounters a condition item This return is given if pcre2_dfa_match() encounters a condition item
that uses a backreference for the condition, or a test for recursion in that uses a backreference for the condition, or a test for recursion in
a specific capture group. These are not supported. a specific capture group. These are not supported.
PCRE2_ERROR_DFA_UINVALID_UTF PCRE2_ERROR_DFA_UINVALID_UTF
This return is given if pcre2_dfa_match() is called for a pattern that This return is given if pcre2_dfa_match() is called for a pattern that
was compiled with PCRE2_MATCH_INVALID_UTF. This is not supported for was compiled with PCRE2_MATCH_INVALID_UTF. This is not supported for
DFA matching. DFA matching.
PCRE2_ERROR_DFA_WSSIZE PCRE2_ERROR_DFA_WSSIZE
This return is given if pcre2_dfa_match() runs out of space in the This return is given if pcre2_dfa_match() runs out of space in the
workspace vector. workspace vector.
PCRE2_ERROR_DFA_RECURSE PCRE2_ERROR_DFA_RECURSE
When a recursion or subroutine call is processed, the matching function When a recursion or subroutine call is processed, the matching function
calls itself recursively, using private memory for the ovector and calls itself recursively, using private memory for the ovector and
workspace. This error is given if the internal ovector is not large workspace. This error is given if the internal ovector is not large
enough. This should be extremely rare, as a vector of size 1000 is enough. This should be extremely rare, as a vector of size 1000 is
used. used.
PCRE2_ERROR_DFA_BADRESTART PCRE2_ERROR_DFA_BADRESTART
When pcre2_dfa_match() is called with the PCRE2_DFA_RESTART option, When pcre2_dfa_match() is called with the PCRE2_DFA_RESTART option,
some plausibility checks are made on the contents of the workspace, some plausibility checks are made on the contents of the workspace,
which should contain data about the previous partial match. If any of which should contain data about the previous partial match. If any of
these checks fail, this error is given. these checks fail, this error is given.
SEE ALSO SEE ALSO
pcre2build(3), pcre2callout(3), pcre2demo(3), pcre2matching(3), pcre2build(3), pcre2callout(3), pcre2demo(3), pcre2matching(3),
pcre2partial(3), pcre2posix(3), pcre2sample(3), pcre2unicode(3). pcre2partial(3), pcre2posix(3), pcre2sample(3), pcre2unicode(3).
@ -3745,8 +3754,8 @@ AUTHOR
REVISION REVISION
Last updated: 27 December 2019 Last updated: 22 January 2020
Copyright (c) 1997-2019 University of Cambridge. Copyright (c) 1997-2020 University of Cambridge.
------------------------------------------------------------------------------ ------------------------------------------------------------------------------

View File

@ -1,4 +1,4 @@
.TH PCRE2_SUBSTITUTE 3 "05 January 2020" "PCRE2 10.35" .TH PCRE2_SUBSTITUTE 3 "22 January 2020" "PCRE2 10.35"
.SH NAME .SH NAME
PCRE2 - Perl-compatible regular expressions (revised API) PCRE2 - Perl-compatible regular expressions (revised API)
.SH SYNOPSIS .SH SYNOPSIS
@ -73,6 +73,7 @@ zero-terminated strings. The options are:
PCRE2_SUBSTITUTE_LITERAL The replacement string is literal PCRE2_SUBSTITUTE_LITERAL The replacement string is literal
PCRE2_SUBSTITUTE_MATCHED Use pre-existing match data for 1st match PCRE2_SUBSTITUTE_MATCHED Use pre-existing match data for 1st match
PCRE2_SUBSTITUTE_OVERFLOW_LENGTH If overflow, compute needed length PCRE2_SUBSTITUTE_OVERFLOW_LENGTH If overflow, compute needed length
PCRE2_SUBSTITUTE_REPLACEMENT_ONLY Return only replacement string(s)
PCRE2_SUBSTITUTE_UNKNOWN_UNSET Treat unknown group as unset PCRE2_SUBSTITUTE_UNKNOWN_UNSET Treat unknown group as unset
PCRE2_SUBSTITUTE_UNSET_EMPTY Simple unset insert = empty string PCRE2_SUBSTITUTE_UNSET_EMPTY Simple unset insert = empty string
.sp .sp

View File

@ -1,4 +1,4 @@
.TH PCRE2API 3 "27 December 2019" "PCRE2 10.35" .TH PCRE2API 3 "22 January 2020" "PCRE2 10.35"
.SH NAME .SH NAME
PCRE2 - Perl-compatible regular expressions (revised API) PCRE2 - Perl-compatible regular expressions (revised API)
.sp .sp
@ -3324,10 +3324,11 @@ same number causes an error at compile time.
This function optionally calls \fBpcre2_match()\fP and then makes a copy of the This function optionally calls \fBpcre2_match()\fP and then makes a copy of the
subject string in \fIoutputbuffer\fP, replacing parts that were matched with subject string in \fIoutputbuffer\fP, replacing parts that were matched with
the \fIreplacement\fP string, whose length is supplied in \fBrlength\fP. This the \fIreplacement\fP string, whose length is supplied in \fBrlength\fP. This
can be given as PCRE2_ZERO_TERMINATED for a zero-terminated string. The default can be given as PCRE2_ZERO_TERMINATED for a zero-terminated string. There is an
is to perform just one replacement if the pattern matches, but there is an option (see PCRE2_SUBSTITUTE_REPLACEMENT_ONLY below) to return just the
option that requests multiple replacements (see PCRE2_SUBSTITUTE_GLOBAL below replacement string(s). The default action is to perform just one replacement if
for details). the pattern matches, but there is an option that requests multiple replacements
(see PCRE2_SUBSTITUTE_GLOBAL below for details).
.P .P
If successful, \fBpcre2_substitute()\fP returns the number of substitutions If successful, \fBpcre2_substitute()\fP returns the number of substitutions
that were carried out. This may be zero if no match was found, and is never that were carried out. This may be zero if no match was found, and is never
@ -3362,10 +3363,21 @@ calling \fBpcre2_match()\fP from within \fBpcre2_substitute()\fP. This allows
an application to check for a match before choosing to substitute, without an application to check for a match before choosing to substitute, without
having to repeat the match. having to repeat the match.
.P .P
The \fIcode\fP argument is not used for the first substitution, but if The \fIcode\fP argument is not used for the first substitution when
PCRE2_SUBSTITUTE_GLOBAL is set, \fBpcre2_match()\fP will be called after the PCRE2_SUBSTITUTE_MATCHED is set, but if PCRE2_SUBSTITUTE_GLOBAL is also set,
first substitution to check for further matches, and the contents of the \fBpcre2_match()\fP will be called after the first substitution to check for
\fImatch_data\fP block will be changed. further matches, and the contents of the \fImatch_data\fP block will be
changed.
.P
The default is to return a copy of the subject string with matched substrings
replaced. However, if PCRE2_SUBSTITUTE_REPLACEMENT_ONLY is set, only the
replacement substrings are returned. In the global case, multiple replacements
are concatenated in the output buffer. Substitution callouts (see
.\" HTML <a href="#subcallouts">
.\" </a>
below)
.\"
can be used to separate them if necessary.
.P .P
The \fIoutlengthptr\fP argument of \fBpcre2_substitute()\fP must point to a The \fIoutlengthptr\fP argument of \fBpcre2_substitute()\fP must point to a
variable that contains the length, in code units, of the output buffer. If the variable that contains the length, in code units, of the output buffer. If the
@ -3557,6 +3569,7 @@ above).
.\" .\"
. .
. .
.\" HTML <a name="subcallouts"></a>
.SS "Substitution callouts" .SS "Substitution callouts"
.rs .rs
.sp .sp
@ -3904,6 +3917,6 @@ Cambridge, England.
.rs .rs
.sp .sp
.nf .nf
Last updated: 27 December 2019 Last updated: 22 January 2020
Copyright (c) 1997-2019 University of Cambridge. Copyright (c) 1997-2020 University of Cambridge.
.fi .fi

View File

@ -1,4 +1,4 @@
.TH PCRE2TEST 1 "26 December 2019" "PCRE 10.35" .TH PCRE2TEST 1 "22 January 2020" "PCRE 10.35"
.SH NAME .SH NAME
pcre2test - a program for testing Perl-compatible regular expressions. pcre2test - a program for testing Perl-compatible regular expressions.
.SH SYNOPSIS .SH SYNOPSIS
@ -1011,25 +1011,27 @@ modifier list, in which case they are applied to every subject line that is
processed with that pattern. These modifiers do not affect the compilation processed with that pattern. These modifiers do not affect the compilation
process. process.
.sp .sp
aftertext show text after match aftertext show text after match
allaftertext show text after captures allaftertext show text after captures
allcaptures show all captures allcaptures show all captures
allvector show the entire ovector allvector show the entire ovector
allusedtext show all consulted text allusedtext show all consulted text
altglobal alternative global matching altglobal alternative global matching
/g global global matching /g global global matching
jitstack=<n> set size of JIT stack jitstack=<n> set size of JIT stack
mark show mark values mark show mark values
replace=<string> specify a replacement string replace=<string> specify a replacement string
startchar show starting character when relevant startchar show starting character when relevant
substitute_callout use substitution callouts substitute_callout use substitution callouts
substitute_extended use PCRE2_SUBSTITUTE_EXTENDED substitute_extended use PCRE2_SUBSTITUTE_EXTENDED
substitute_literal use PCRE2_SUBSTITUTE_LITERAL substitute_literal use PCRE2_SUBSTITUTE_LITERAL
substitute_overflow_length use PCRE2_SUBSTITUTE_OVERFLOW_LENGTH substitute_matched use PCRE2_SUBSTITUTE_MATCHED
substitute_skip=<n> skip substitution number n substitute_overflow_length use PCRE2_SUBSTITUTE_OVERFLOW_LENGTH
substitute_stop=<n> skip substitution number n and greater substitute_replacement_only use PCRE2_SUBSTITUTE_REPLACEMENT_ONLY
substitute_unknown_unset use PCRE2_SUBSTITUTE_UNKNOWN_UNSET substitute_skip=<n> skip substitution <n>
substitute_unset_empty use PCRE2_SUBSTITUTE_UNSET_EMPTY substitute_stop=<n> skip substitution <n> and following
substitute_unknown_unset use PCRE2_SUBSTITUTE_UNKNOWN_UNSET
substitute_unset_empty use PCRE2_SUBSTITUTE_UNSET_EMPTY
.sp .sp
These modifiers may not appear in a \fB#pattern\fP command. If you want them as These modifiers may not appear in a \fB#pattern\fP command. If you want them as
defaults, set them in a \fB#subject\fP command. defaults, set them in a \fB#subject\fP command.
@ -1203,7 +1205,9 @@ pattern.
substitute_callout use substitution callouts substitute_callout use substitution callouts
substitute_extedded use PCRE2_SUBSTITUTE_EXTENDED substitute_extedded use PCRE2_SUBSTITUTE_EXTENDED
substitute_literal use PCRE2_SUBSTITUTE_LITERAL substitute_literal use PCRE2_SUBSTITUTE_LITERAL
substitute_matched use PCRE2_SUBSTITUTE_MATCHED
substitute_overflow_length use PCRE2_SUBSTITUTE_OVERFLOW_LENGTH substitute_overflow_length use PCRE2_SUBSTITUTE_OVERFLOW_LENGTH
substitute_replacement_only use PCRE2_SUBSTITUTE_REPLACEMENT_ONLY
substitute_skip=<n> skip substitution number n substitute_skip=<n> skip substitution number n
substitute_stop=<n> skip substitution number n and greater substitute_stop=<n> skip substitution number n and greater
substitute_unknown_unset use PCRE2_SUBSTITUTE_UNKNOWN_UNSET substitute_unknown_unset use PCRE2_SUBSTITUTE_UNKNOWN_UNSET
@ -1367,9 +1371,10 @@ by name.
.rs .rs
.sp .sp
If the \fBreplace\fP modifier is set, the \fBpcre2_substitute()\fP function is If the \fBreplace\fP modifier is set, the \fBpcre2_substitute()\fP function is
called instead of one of the matching functions. Note that replacement strings called instead of one of the matching functions (or after one call of
cannot contain commas, because a comma signifies the end of a modifier. This is \fBpcre2_match()\fP in the case of PCRE2_SUBSTITUTE_MATCHED). Note that
not thought to be an issue in a test program. replacement strings cannot contain commas, because a comma signifies the end of
a modifier. This is not thought to be an issue in a test program.
.P .P
Unlike subject strings, \fBpcre2test\fP does not process replacement strings Unlike subject strings, \fBpcre2test\fP does not process replacement strings
for escape sequences. In UTF mode, a replacement string is checked to see if it for escape sequences. In UTF mode, a replacement string is checked to see if it
@ -1384,10 +1389,17 @@ for \fBpcre2_substitute()\fP:
global PCRE2_SUBSTITUTE_GLOBAL global PCRE2_SUBSTITUTE_GLOBAL
substitute_extended PCRE2_SUBSTITUTE_EXTENDED substitute_extended PCRE2_SUBSTITUTE_EXTENDED
substitute_literal PCRE2_SUBSTITUTE_LITERAL substitute_literal PCRE2_SUBSTITUTE_LITERAL
substitute_matched PCRE2_SUBSTITUTE_MATCHED
substitute_overflow_length PCRE2_SUBSTITUTE_OVERFLOW_LENGTH substitute_overflow_length PCRE2_SUBSTITUTE_OVERFLOW_LENGTH
substitute_replacement_only PCRE2_SUBSTITUTE_REPLACEMENT_ONLY
substitute_unknown_unset PCRE2_SUBSTITUTE_UNKNOWN_UNSET substitute_unknown_unset PCRE2_SUBSTITUTE_UNKNOWN_UNSET
substitute_unset_empty PCRE2_SUBSTITUTE_UNSET_EMPTY substitute_unset_empty PCRE2_SUBSTITUTE_UNSET_EMPTY
.sp .sp
See the
.\" HREF
\fBpcre2api\fP
.\"
documentation for details of these options.
.P .P
After a successful substitution, the modified string is output, preceded by the After a successful substitution, the modified string is output, preceded by the
number of replacements. This may be zero if there were no matches. Here is a number of replacements. This may be zero if there were no matches. Here is a
@ -2076,6 +2088,6 @@ Cambridge, England.
.rs .rs
.sp .sp
.nf .nf
Last updated: 26 December 2019 Last updated: 22 January 2020
Copyright (c) 1997-2019 University of Cambridge. Copyright (c) 1997-2020 University of Cambridge.
.fi .fi

View File

@ -936,25 +936,27 @@ PATTERN MODIFIERS
ject line that is processed with that pattern. These modifiers do not ject line that is processed with that pattern. These modifiers do not
affect the compilation process. affect the compilation process.
aftertext show text after match aftertext show text after match
allaftertext show text after captures allaftertext show text after captures
allcaptures show all captures allcaptures show all captures
allvector show the entire ovector allvector show the entire ovector
allusedtext show all consulted text allusedtext show all consulted text
altglobal alternative global matching altglobal alternative global matching
/g global global matching /g global global matching
jitstack=<n> set size of JIT stack jitstack=<n> set size of JIT stack
mark show mark values mark show mark values
replace=<string> specify a replacement string replace=<string> specify a replacement string
startchar show starting character when relevant startchar show starting character when relevant
substitute_callout use substitution callouts substitute_callout use substitution callouts
substitute_extended use PCRE2_SUBSTITUTE_EXTENDED substitute_extended use PCRE2_SUBSTITUTE_EXTENDED
substitute_literal use PCRE2_SUBSTITUTE_LITERAL substitute_literal use PCRE2_SUBSTITUTE_LITERAL
substitute_overflow_length use PCRE2_SUBSTITUTE_OVERFLOW_LENGTH substitute_matched use PCRE2_SUBSTITUTE_MATCHED
substitute_skip=<n> skip substitution number n substitute_overflow_length use PCRE2_SUBSTITUTE_OVERFLOW_LENGTH
substitute_stop=<n> skip substitution number n and greater substitute_replacement_only use PCRE2_SUBSTITUTE_REPLACEMENT_ONLY
substitute_unknown_unset use PCRE2_SUBSTITUTE_UNKNOWN_UNSET substitute_skip=<n> skip substitution <n>
substitute_unset_empty use PCRE2_SUBSTITUTE_UNSET_EMPTY substitute_stop=<n> skip substitution <n> and following
substitute_unknown_unset use PCRE2_SUBSTITUTE_UNKNOWN_UNSET
substitute_unset_empty use PCRE2_SUBSTITUTE_UNSET_EMPTY
These modifiers may not appear in a #pattern command. If you want them These modifiers may not appear in a #pattern command. If you want them
as defaults, set them in a #subject command. as defaults, set them in a #subject command.
@ -1105,7 +1107,9 @@ SUBJECT MODIFIERS
substitute_callout use substitution callouts substitute_callout use substitution callouts
substitute_extedded use PCRE2_SUBSTITUTE_EXTENDED substitute_extedded use PCRE2_SUBSTITUTE_EXTENDED
substitute_literal use PCRE2_SUBSTITUTE_LITERAL substitute_literal use PCRE2_SUBSTITUTE_LITERAL
substitute_matched use PCRE2_SUBSTITUTE_MATCHED
substitute_overflow_length use PCRE2_SUBSTITUTE_OVERFLOW_LENGTH substitute_overflow_length use PCRE2_SUBSTITUTE_OVERFLOW_LENGTH
substitute_replacement_only use PCRE2_SUBSTITUTE_REPLACEMENT_ONLY
substitute_skip=<n> skip substitution number n substitute_skip=<n> skip substitution number n
substitute_stop=<n> skip substitution number n and greater substitute_stop=<n> skip substitution number n and greater
substitute_unknown_unset use PCRE2_SUBSTITUTE_UNKNOWN_UNSET substitute_unknown_unset use PCRE2_SUBSTITUTE_UNKNOWN_UNSET
@ -1251,9 +1255,11 @@ SUBJECT MODIFIERS
Testing the substitution function Testing the substitution function
If the replace modifier is set, the pcre2_substitute() function is If the replace modifier is set, the pcre2_substitute() function is
called instead of one of the matching functions. Note that replacement called instead of one of the matching functions (or after one call of
strings cannot contain commas, because a comma signifies the end of a pcre2_match() in the case of PCRE2_SUBSTITUTE_MATCHED). Note that re-
modifier. This is not thought to be an issue in a test program. placement strings cannot contain commas, because a comma signifies the
end of a modifier. This is not thought to be an issue in a test pro-
gram.
Unlike subject strings, pcre2test does not process replacement strings Unlike subject strings, pcre2test does not process replacement strings
for escape sequences. In UTF mode, a replacement string is checked to for escape sequences. In UTF mode, a replacement string is checked to
@ -1268,10 +1274,13 @@ SUBJECT MODIFIERS
global PCRE2_SUBSTITUTE_GLOBAL global PCRE2_SUBSTITUTE_GLOBAL
substitute_extended PCRE2_SUBSTITUTE_EXTENDED substitute_extended PCRE2_SUBSTITUTE_EXTENDED
substitute_literal PCRE2_SUBSTITUTE_LITERAL substitute_literal PCRE2_SUBSTITUTE_LITERAL
substitute_matched PCRE2_SUBSTITUTE_MATCHED
substitute_overflow_length PCRE2_SUBSTITUTE_OVERFLOW_LENGTH substitute_overflow_length PCRE2_SUBSTITUTE_OVERFLOW_LENGTH
substitute_replacement_only PCRE2_SUBSTITUTE_REPLACEMENT_ONLY
substitute_unknown_unset PCRE2_SUBSTITUTE_UNKNOWN_UNSET substitute_unknown_unset PCRE2_SUBSTITUTE_UNKNOWN_UNSET
substitute_unset_empty PCRE2_SUBSTITUTE_UNSET_EMPTY substitute_unset_empty PCRE2_SUBSTITUTE_UNSET_EMPTY
See the pcre2api documentation for details of these options.
After a successful substitution, the modified string is output, pre- After a successful substitution, the modified string is output, pre-
ceded by the number of replacements. This may be zero if there were no ceded by the number of replacements. This may be zero if there were no
@ -1905,5 +1914,5 @@ AUTHOR
REVISION REVISION
Last updated: 26 December 2019 Last updated: 22 January 2020
Copyright (c) 1997-2019 University of Cambridge. Copyright (c) 1997-2020 University of Cambridge.

View File

@ -5,7 +5,7 @@
/* This is the public header file for the PCRE library, second API, to be /* This is the public header file for the PCRE library, second API, to be
#included by applications that call PCRE2 functions. #included by applications that call PCRE2 functions.
Copyright (c) 2016-2019 University of Cambridge Copyright (c) 2016-2020 University of Cambridge
----------------------------------------------------------------------------- -----------------------------------------------------------------------------
Redistribution and use in source and binary forms, with or without Redistribution and use in source and binary forms, with or without
@ -183,6 +183,7 @@ pcre2_jit_match() ignores the latter since it bypasses all sanity checks). */
#define PCRE2_COPY_MATCHED_SUBJECT 0x00004000u #define PCRE2_COPY_MATCHED_SUBJECT 0x00004000u
#define PCRE2_SUBSTITUTE_LITERAL 0x00008000u /* pcre2_substitute() only */ #define PCRE2_SUBSTITUTE_LITERAL 0x00008000u /* pcre2_substitute() only */
#define PCRE2_SUBSTITUTE_MATCHED 0x00010000u /* pcre2_substitute() only */ #define PCRE2_SUBSTITUTE_MATCHED 0x00010000u /* pcre2_substitute() only */
#define PCRE2_SUBSTITUTE_REPLACEMENT_ONLY 0x00020000u /* pcre2_substitute() only */
/* Options for pcre2_pattern_convert(). */ /* Options for pcre2_pattern_convert(). */

View File

@ -7,7 +7,7 @@ and semantics are as close as possible to those of the Perl 5 language.
Written by Philip Hazel Written by Philip Hazel
Original API code Copyright (c) 1997-2012 University of Cambridge Original API code Copyright (c) 1997-2012 University of Cambridge
New API code Copyright (c) 2016-2019 University of Cambridge New API code Copyright (c) 2016-2020 University of Cambridge
----------------------------------------------------------------------------- -----------------------------------------------------------------------------
Redistribution and use in source and binary forms, with or without Redistribution and use in source and binary forms, with or without
@ -50,8 +50,8 @@ POSSIBILITY OF SUCH DAMAGE.
#define SUBSTITUTE_OPTIONS \ #define SUBSTITUTE_OPTIONS \
(PCRE2_SUBSTITUTE_EXTENDED|PCRE2_SUBSTITUTE_GLOBAL| \ (PCRE2_SUBSTITUTE_EXTENDED|PCRE2_SUBSTITUTE_GLOBAL| \
PCRE2_SUBSTITUTE_LITERAL|PCRE2_SUBSTITUTE_MATCHED| \ PCRE2_SUBSTITUTE_LITERAL|PCRE2_SUBSTITUTE_MATCHED| \
PCRE2_SUBSTITUTE_OVERFLOW_LENGTH|PCRE2_SUBSTITUTE_UNKNOWN_UNSET| \ PCRE2_SUBSTITUTE_OVERFLOW_LENGTH|PCRE2_SUBSTITUTE_REPLACEMENT_ONLY| \
PCRE2_SUBSTITUTE_UNSET_EMPTY) PCRE2_SUBSTITUTE_UNKNOWN_UNSET|PCRE2_SUBSTITUTE_UNSET_EMPTY)
@ -195,6 +195,7 @@ overflow, either give an error immediately, or keep on, accumulating the
length. */ length. */
#define CHECKMEMCPY(from,length) \ #define CHECKMEMCPY(from,length) \
{ \
if (!overflowed && lengthleft < length) \ if (!overflowed && lengthleft < length) \
{ \ { \
if ((suboptions & PCRE2_SUBSTITUTE_OVERFLOW_LENGTH) == 0) goto NOROOM; \ if ((suboptions & PCRE2_SUBSTITUTE_OVERFLOW_LENGTH) == 0) goto NOROOM; \
@ -210,7 +211,8 @@ length. */
memcpy(buffer + buff_offset, from, CU2BYTES(length)); \ memcpy(buffer + buff_offset, from, CU2BYTES(length)); \
buff_offset += length; \ buff_offset += length; \
lengthleft -= length; \ lengthleft -= length; \
} } \
}
/* Here's the function */ /* Here's the function */
@ -231,6 +233,7 @@ BOOL match_data_created = FALSE;
BOOL escaped_literal = FALSE; BOOL escaped_literal = FALSE;
BOOL overflowed = FALSE; BOOL overflowed = FALSE;
BOOL use_existing_match; BOOL use_existing_match;
BOOL replacement_only;
#ifdef SUPPORT_UNICODE #ifdef SUPPORT_UNICODE
BOOL utf = (code->overall_options & PCRE2_UTF) != 0; BOOL utf = (code->overall_options & PCRE2_UTF) != 0;
#endif #endif
@ -256,10 +259,11 @@ PCRE2_UNSET, so as not to imply an offset in the replacement. */
if ((options & (PCRE2_PARTIAL_HARD|PCRE2_PARTIAL_SOFT)) != 0) if ((options & (PCRE2_PARTIAL_HARD|PCRE2_PARTIAL_SOFT)) != 0)
return PCRE2_ERROR_BADOPTION; return PCRE2_ERROR_BADOPTION;
/* Check for using a match that has already happened. Note that the subject /* Check for using a match that has already happened. Note that the subject
pointer in the match data may be NULL after a no-match. */ pointer in the match data may be NULL after a no-match. */
use_existing_match = ((options & PCRE2_SUBSTITUTE_MATCHED) != 0); use_existing_match = ((options & PCRE2_SUBSTITUTE_MATCHED) != 0);
replacement_only = ((options & PCRE2_SUBSTITUTE_REPLACEMENT_ONLY) != 0);
if (use_existing_match) if (use_existing_match)
{ {
@ -312,7 +316,7 @@ if (utf && (options & PCRE2_NO_UTF_CHECK) == 0)
suboptions = options & SUBSTITUTE_OPTIONS; suboptions = options & SUBSTITUTE_OPTIONS;
options &= ~SUBSTITUTE_OPTIONS; options &= ~SUBSTITUTE_OPTIONS;
/* Copy up to the start offset */ /* Error if the start match offset it greater than the length of the subject. */
if (start_offset > length) if (start_offset > length)
{ {
@ -320,7 +324,10 @@ if (start_offset > length)
rc = PCRE2_ERROR_BADOFFSET; rc = PCRE2_ERROR_BADOFFSET;
goto EXIT; goto EXIT;
} }
CHECKMEMCPY(subject, start_offset);
/* Copy up to the start offset, unless only the replacement is required. */
if (!replacement_only) CHECKMEMCPY(subject, start_offset);
/* Loop for global substituting. If PCRE2_SUBSTITUTE_MATCHED is set, the first /* Loop for global substituting. If PCRE2_SUBSTITUTE_MATCHED is set, the first
match is taken from the match_data that was passed in. */ match is taken from the match_data that was passed in. */
@ -382,11 +389,11 @@ do
#endif #endif
} }
/* Copy what we have advanced past, reset the special global options, and /* Copy what we have advanced past (unless not required), reset the special
continue to the next match. */ global options, and continue to the next match. */
fraglength = start_offset - save_start; fraglength = start_offset - save_start;
CHECKMEMCPY(subject + save_start, fraglength); if (!replacement_only) CHECKMEMCPY(subject + save_start, fraglength);
goptions = 0; goptions = 0;
continue; continue;
} }
@ -430,12 +437,12 @@ do
} }
subs++; subs++;
/* Copy the text leading up to the match, and remember where the insert /* Copy the text leading up to the match (unless not required), and remember
begins and how many ovector pairs are set. */ where the insert begins and how many ovector pairs are set. */
if (rc == 0) rc = ovector_count; if (rc == 0) rc = ovector_count;
fraglength = ovector[0] - start_offset; fraglength = ovector[0] - start_offset;
CHECKMEMCPY(subject + start_offset, fraglength); if (!replacement_only) CHECKMEMCPY(subject + start_offset, fraglength);
scb.output_offsets[0] = buff_offset; scb.output_offsets[0] = buff_offset;
scb.oveccount = rc; scb.oveccount = rc;
@ -882,7 +889,7 @@ do
buff_offset -= newlength; buff_offset -= newlength;
lengthleft += newlength; lengthleft += newlength;
CHECKMEMCPY(subject + ovector[0], oldlength); if (!replacement_only) CHECKMEMCPY(subject + ovector[0], oldlength);
/* A negative return means do not do any more. */ /* A negative return means do not do any more. */
@ -903,12 +910,17 @@ do
start_offset = ovector[1]; start_offset = ovector[1];
} while ((suboptions & PCRE2_SUBSTITUTE_GLOBAL) != 0); /* Repeat "do" loop */ } while ((suboptions & PCRE2_SUBSTITUTE_GLOBAL) != 0); /* Repeat "do" loop */
/* Copy the rest of the subject. */ /* Copy the rest of the subject unless not required, and terminate the output
with a binary zero. */
if (!replacement_only)
{
fraglength = length - start_offset;
CHECKMEMCPY(subject + start_offset, fraglength);
}
fraglength = length - start_offset;
CHECKMEMCPY(subject + start_offset, fraglength);
temp[0] = 0; temp[0] = 0;
CHECKMEMCPY(temp , 1); CHECKMEMCPY(temp, 1);
/* If overflowed is set it means the PCRE2_SUBSTITUTE_OVERFLOW_LENGTH is set, /* If overflowed is set it means the PCRE2_SUBSTITUTE_OVERFLOW_LENGTH is set,
and matching has carried on after a full buffer, in order to compute the length and matching has carried on after a full buffer, in order to compute the length

View File

@ -11,7 +11,7 @@ hacked-up (non-) design had also run out of steam.
Written by Philip Hazel Written by Philip Hazel
Original code Copyright (c) 1997-2012 University of Cambridge Original code Copyright (c) 1997-2012 University of Cambridge
Rewritten code Copyright (c) 2016-2019 University of Cambridge Rewritten code Copyright (c) 2016-2020 University of Cambridge
----------------------------------------------------------------------------- -----------------------------------------------------------------------------
Redistribution and use in source and binary forms, with or without Redistribution and use in source and binary forms, with or without
@ -505,12 +505,13 @@ so many of them that they are split into two fields. */
#define CTL2_SUBSTITUTE_LITERAL 0x00000004u #define CTL2_SUBSTITUTE_LITERAL 0x00000004u
#define CTL2_SUBSTITUTE_MATCHED 0x00000008u #define CTL2_SUBSTITUTE_MATCHED 0x00000008u
#define CTL2_SUBSTITUTE_OVERFLOW_LENGTH 0x00000010u #define CTL2_SUBSTITUTE_OVERFLOW_LENGTH 0x00000010u
#define CTL2_SUBSTITUTE_UNKNOWN_UNSET 0x00000020u #define CTL2_SUBSTITUTE_REPLACEMENT_ONLY 0x00000020u
#define CTL2_SUBSTITUTE_UNSET_EMPTY 0x00000040u #define CTL2_SUBSTITUTE_UNKNOWN_UNSET 0x00000040u
#define CTL2_SUBJECT_LITERAL 0x00000080u #define CTL2_SUBSTITUTE_UNSET_EMPTY 0x00000080u
#define CTL2_CALLOUT_NO_WHERE 0x00000100u #define CTL2_SUBJECT_LITERAL 0x00000100u
#define CTL2_CALLOUT_EXTRA 0x00000200u #define CTL2_CALLOUT_NO_WHERE 0x00000200u
#define CTL2_ALLVECTOR 0x00000400u #define CTL2_CALLOUT_EXTRA 0x00000400u
#define CTL2_ALLVECTOR 0x00000800u
#define CTL2_NL_SET 0x40000000u /* Informational */ #define CTL2_NL_SET 0x40000000u /* Informational */
#define CTL2_BSR_SET 0x80000000u /* Informational */ #define CTL2_BSR_SET 0x80000000u /* Informational */
@ -535,6 +536,7 @@ different things in the two cases. */
CTL2_SUBSTITUTE_LITERAL|\ CTL2_SUBSTITUTE_LITERAL|\
CTL2_SUBSTITUTE_MATCHED|\ CTL2_SUBSTITUTE_MATCHED|\
CTL2_SUBSTITUTE_OVERFLOW_LENGTH|\ CTL2_SUBSTITUTE_OVERFLOW_LENGTH|\
CTL2_SUBSTITUTE_REPLACEMENT_ONLY|\
CTL2_SUBSTITUTE_UNKNOWN_UNSET|\ CTL2_SUBSTITUTE_UNKNOWN_UNSET|\
CTL2_SUBSTITUTE_UNSET_EMPTY|\ CTL2_SUBSTITUTE_UNSET_EMPTY|\
CTL2_ALLVECTOR) CTL2_ALLVECTOR)
@ -614,129 +616,130 @@ typedef struct modstruct {
} modstruct; } modstruct;
static modstruct modlist[] = { static modstruct modlist[] = {
{ "aftertext", MOD_PNDP, MOD_CTL, CTL_AFTERTEXT, PO(control) }, { "aftertext", MOD_PNDP, MOD_CTL, CTL_AFTERTEXT, PO(control) },
{ "allaftertext", MOD_PNDP, MOD_CTL, CTL_ALLAFTERTEXT, PO(control) }, { "allaftertext", MOD_PNDP, MOD_CTL, CTL_ALLAFTERTEXT, PO(control) },
{ "allcaptures", MOD_PND, MOD_CTL, CTL_ALLCAPTURES, PO(control) }, { "allcaptures", MOD_PND, MOD_CTL, CTL_ALLCAPTURES, PO(control) },
{ "allow_empty_class", MOD_PAT, MOD_OPT, PCRE2_ALLOW_EMPTY_CLASS, PO(options) }, { "allow_empty_class", MOD_PAT, MOD_OPT, PCRE2_ALLOW_EMPTY_CLASS, PO(options) },
{ "allow_surrogate_escapes", MOD_CTC, MOD_OPT, PCRE2_EXTRA_ALLOW_SURROGATE_ESCAPES, CO(extra_options) }, { "allow_surrogate_escapes", MOD_CTC, MOD_OPT, PCRE2_EXTRA_ALLOW_SURROGATE_ESCAPES, CO(extra_options) },
{ "allusedtext", MOD_PNDP, MOD_CTL, CTL_ALLUSEDTEXT, PO(control) }, { "allusedtext", MOD_PNDP, MOD_CTL, CTL_ALLUSEDTEXT, PO(control) },
{ "allvector", MOD_PND, MOD_CTL, CTL2_ALLVECTOR, PO(control2) }, { "allvector", MOD_PND, MOD_CTL, CTL2_ALLVECTOR, PO(control2) },
{ "alt_bsux", MOD_PAT, MOD_OPT, PCRE2_ALT_BSUX, PO(options) }, { "alt_bsux", MOD_PAT, MOD_OPT, PCRE2_ALT_BSUX, PO(options) },
{ "alt_circumflex", MOD_PAT, MOD_OPT, PCRE2_ALT_CIRCUMFLEX, PO(options) }, { "alt_circumflex", MOD_PAT, MOD_OPT, PCRE2_ALT_CIRCUMFLEX, PO(options) },
{ "alt_verbnames", MOD_PAT, MOD_OPT, PCRE2_ALT_VERBNAMES, PO(options) }, { "alt_verbnames", MOD_PAT, MOD_OPT, PCRE2_ALT_VERBNAMES, PO(options) },
{ "altglobal", MOD_PND, MOD_CTL, CTL_ALTGLOBAL, PO(control) }, { "altglobal", MOD_PND, MOD_CTL, CTL_ALTGLOBAL, PO(control) },
{ "anchored", MOD_PD, MOD_OPT, PCRE2_ANCHORED, PD(options) }, { "anchored", MOD_PD, MOD_OPT, PCRE2_ANCHORED, PD(options) },
{ "auto_callout", MOD_PAT, MOD_OPT, PCRE2_AUTO_CALLOUT, PO(options) }, { "auto_callout", MOD_PAT, MOD_OPT, PCRE2_AUTO_CALLOUT, PO(options) },
{ "bad_escape_is_literal", MOD_CTC, MOD_OPT, PCRE2_EXTRA_BAD_ESCAPE_IS_LITERAL, CO(extra_options) }, { "bad_escape_is_literal", MOD_CTC, MOD_OPT, PCRE2_EXTRA_BAD_ESCAPE_IS_LITERAL, CO(extra_options) },
{ "bincode", MOD_PAT, MOD_CTL, CTL_BINCODE, PO(control) }, { "bincode", MOD_PAT, MOD_CTL, CTL_BINCODE, PO(control) },
{ "bsr", MOD_CTC, MOD_BSR, 0, CO(bsr_convention) }, { "bsr", MOD_CTC, MOD_BSR, 0, CO(bsr_convention) },
{ "callout_capture", MOD_DAT, MOD_CTL, CTL_CALLOUT_CAPTURE, DO(control) }, { "callout_capture", MOD_DAT, MOD_CTL, CTL_CALLOUT_CAPTURE, DO(control) },
{ "callout_data", MOD_DAT, MOD_INS, 0, DO(callout_data) }, { "callout_data", MOD_DAT, MOD_INS, 0, DO(callout_data) },
{ "callout_error", MOD_DAT, MOD_IN2, 0, DO(cerror) }, { "callout_error", MOD_DAT, MOD_IN2, 0, DO(cerror) },
{ "callout_extra", MOD_DAT, MOD_CTL, CTL2_CALLOUT_EXTRA, DO(control2) }, { "callout_extra", MOD_DAT, MOD_CTL, CTL2_CALLOUT_EXTRA, DO(control2) },
{ "callout_fail", MOD_DAT, MOD_IN2, 0, DO(cfail) }, { "callout_fail", MOD_DAT, MOD_IN2, 0, DO(cfail) },
{ "callout_info", MOD_PAT, MOD_CTL, CTL_CALLOUT_INFO, PO(control) }, { "callout_info", MOD_PAT, MOD_CTL, CTL_CALLOUT_INFO, PO(control) },
{ "callout_no_where", MOD_DAT, MOD_CTL, CTL2_CALLOUT_NO_WHERE, DO(control2) }, { "callout_no_where", MOD_DAT, MOD_CTL, CTL2_CALLOUT_NO_WHERE, DO(control2) },
{ "callout_none", MOD_DAT, MOD_CTL, CTL_CALLOUT_NONE, DO(control) }, { "callout_none", MOD_DAT, MOD_CTL, CTL_CALLOUT_NONE, DO(control) },
{ "caseless", MOD_PATP, MOD_OPT, PCRE2_CASELESS, PO(options) }, { "caseless", MOD_PATP, MOD_OPT, PCRE2_CASELESS, PO(options) },
{ "convert", MOD_PAT, MOD_CON, 0, PO(convert_type) }, { "convert", MOD_PAT, MOD_CON, 0, PO(convert_type) },
{ "convert_glob_escape", MOD_PAT, MOD_CHR, 0, PO(convert_glob_escape) }, { "convert_glob_escape", MOD_PAT, MOD_CHR, 0, PO(convert_glob_escape) },
{ "convert_glob_separator", MOD_PAT, MOD_CHR, 0, PO(convert_glob_separator) }, { "convert_glob_separator", MOD_PAT, MOD_CHR, 0, PO(convert_glob_separator) },
{ "convert_length", MOD_PAT, MOD_INT, 0, PO(convert_length) }, { "convert_length", MOD_PAT, MOD_INT, 0, PO(convert_length) },
{ "copy", MOD_DAT, MOD_NN, DO(copy_numbers), DO(copy_names) }, { "copy", MOD_DAT, MOD_NN, DO(copy_numbers), DO(copy_names) },
{ "copy_matched_subject", MOD_DAT, MOD_OPT, PCRE2_COPY_MATCHED_SUBJECT, DO(options) }, { "copy_matched_subject", MOD_DAT, MOD_OPT, PCRE2_COPY_MATCHED_SUBJECT, DO(options) },
{ "debug", MOD_PAT, MOD_CTL, CTL_DEBUG, PO(control) }, { "debug", MOD_PAT, MOD_CTL, CTL_DEBUG, PO(control) },
{ "depth_limit", MOD_CTM, MOD_INT, 0, MO(depth_limit) }, { "depth_limit", MOD_CTM, MOD_INT, 0, MO(depth_limit) },
{ "dfa", MOD_DAT, MOD_CTL, CTL_DFA, DO(control) }, { "dfa", MOD_DAT, MOD_CTL, CTL_DFA, DO(control) },
{ "dfa_restart", MOD_DAT, MOD_OPT, PCRE2_DFA_RESTART, DO(options) }, { "dfa_restart", MOD_DAT, MOD_OPT, PCRE2_DFA_RESTART, DO(options) },
{ "dfa_shortest", MOD_DAT, MOD_OPT, PCRE2_DFA_SHORTEST, DO(options) }, { "dfa_shortest", MOD_DAT, MOD_OPT, PCRE2_DFA_SHORTEST, DO(options) },
{ "dollar_endonly", MOD_PAT, MOD_OPT, PCRE2_DOLLAR_ENDONLY, PO(options) }, { "dollar_endonly", MOD_PAT, MOD_OPT, PCRE2_DOLLAR_ENDONLY, PO(options) },
{ "dotall", MOD_PATP, MOD_OPT, PCRE2_DOTALL, PO(options) }, { "dotall", MOD_PATP, MOD_OPT, PCRE2_DOTALL, PO(options) },
{ "dupnames", MOD_PATP, MOD_OPT, PCRE2_DUPNAMES, PO(options) }, { "dupnames", MOD_PATP, MOD_OPT, PCRE2_DUPNAMES, PO(options) },
{ "endanchored", MOD_PD, MOD_OPT, PCRE2_ENDANCHORED, PD(options) }, { "endanchored", MOD_PD, MOD_OPT, PCRE2_ENDANCHORED, PD(options) },
{ "escaped_cr_is_lf", MOD_CTC, MOD_OPT, PCRE2_EXTRA_ESCAPED_CR_IS_LF, CO(extra_options) }, { "escaped_cr_is_lf", MOD_CTC, MOD_OPT, PCRE2_EXTRA_ESCAPED_CR_IS_LF, CO(extra_options) },
{ "expand", MOD_PAT, MOD_CTL, CTL_EXPAND, PO(control) }, { "expand", MOD_PAT, MOD_CTL, CTL_EXPAND, PO(control) },
{ "extended", MOD_PATP, MOD_OPT, PCRE2_EXTENDED, PO(options) }, { "extended", MOD_PATP, MOD_OPT, PCRE2_EXTENDED, PO(options) },
{ "extended_more", MOD_PATP, MOD_OPT, PCRE2_EXTENDED_MORE, PO(options) }, { "extended_more", MOD_PATP, MOD_OPT, PCRE2_EXTENDED_MORE, PO(options) },
{ "extra_alt_bsux", MOD_CTC, MOD_OPT, PCRE2_EXTRA_ALT_BSUX, CO(extra_options) }, { "extra_alt_bsux", MOD_CTC, MOD_OPT, PCRE2_EXTRA_ALT_BSUX, CO(extra_options) },
{ "find_limits", MOD_DAT, MOD_CTL, CTL_FINDLIMITS, DO(control) }, { "find_limits", MOD_DAT, MOD_CTL, CTL_FINDLIMITS, DO(control) },
{ "firstline", MOD_PAT, MOD_OPT, PCRE2_FIRSTLINE, PO(options) }, { "firstline", MOD_PAT, MOD_OPT, PCRE2_FIRSTLINE, PO(options) },
{ "framesize", MOD_PAT, MOD_CTL, CTL_FRAMESIZE, PO(control) }, { "framesize", MOD_PAT, MOD_CTL, CTL_FRAMESIZE, PO(control) },
{ "fullbincode", MOD_PAT, MOD_CTL, CTL_FULLBINCODE, PO(control) }, { "fullbincode", MOD_PAT, MOD_CTL, CTL_FULLBINCODE, PO(control) },
{ "get", MOD_DAT, MOD_NN, DO(get_numbers), DO(get_names) }, { "get", MOD_DAT, MOD_NN, DO(get_numbers), DO(get_names) },
{ "getall", MOD_DAT, MOD_CTL, CTL_GETALL, DO(control) }, { "getall", MOD_DAT, MOD_CTL, CTL_GETALL, DO(control) },
{ "global", MOD_PNDP, MOD_CTL, CTL_GLOBAL, PO(control) }, { "global", MOD_PNDP, MOD_CTL, CTL_GLOBAL, PO(control) },
{ "heap_limit", MOD_CTM, MOD_INT, 0, MO(heap_limit) }, { "heap_limit", MOD_CTM, MOD_INT, 0, MO(heap_limit) },
{ "hex", MOD_PAT, MOD_CTL, CTL_HEXPAT, PO(control) }, { "hex", MOD_PAT, MOD_CTL, CTL_HEXPAT, PO(control) },
{ "info", MOD_PAT, MOD_CTL, CTL_INFO, PO(control) }, { "info", MOD_PAT, MOD_CTL, CTL_INFO, PO(control) },
{ "jit", MOD_PAT, MOD_IND, 7, PO(jit) }, { "jit", MOD_PAT, MOD_IND, 7, PO(jit) },
{ "jitfast", MOD_PAT, MOD_CTL, CTL_JITFAST, PO(control) }, { "jitfast", MOD_PAT, MOD_CTL, CTL_JITFAST, PO(control) },
{ "jitstack", MOD_PNDP, MOD_INT, 0, PO(jitstack) }, { "jitstack", MOD_PNDP, MOD_INT, 0, PO(jitstack) },
{ "jitverify", MOD_PAT, MOD_CTL, CTL_JITVERIFY, PO(control) }, { "jitverify", MOD_PAT, MOD_CTL, CTL_JITVERIFY, PO(control) },
{ "literal", MOD_PAT, MOD_OPT, PCRE2_LITERAL, PO(options) }, { "literal", MOD_PAT, MOD_OPT, PCRE2_LITERAL, PO(options) },
{ "locale", MOD_PAT, MOD_STR, LOCALESIZE, PO(locale) }, { "locale", MOD_PAT, MOD_STR, LOCALESIZE, PO(locale) },
{ "mark", MOD_PNDP, MOD_CTL, CTL_MARK, PO(control) }, { "mark", MOD_PNDP, MOD_CTL, CTL_MARK, PO(control) },
{ "match_invalid_utf", MOD_PAT, MOD_OPT, PCRE2_MATCH_INVALID_UTF, PO(options) }, { "match_invalid_utf", MOD_PAT, MOD_OPT, PCRE2_MATCH_INVALID_UTF, PO(options) },
{ "match_limit", MOD_CTM, MOD_INT, 0, MO(match_limit) }, { "match_limit", MOD_CTM, MOD_INT, 0, MO(match_limit) },
{ "match_line", MOD_CTC, MOD_OPT, PCRE2_EXTRA_MATCH_LINE, CO(extra_options) }, { "match_line", MOD_CTC, MOD_OPT, PCRE2_EXTRA_MATCH_LINE, CO(extra_options) },
{ "match_unset_backref", MOD_PAT, MOD_OPT, PCRE2_MATCH_UNSET_BACKREF, PO(options) }, { "match_unset_backref", MOD_PAT, MOD_OPT, PCRE2_MATCH_UNSET_BACKREF, PO(options) },
{ "match_word", MOD_CTC, MOD_OPT, PCRE2_EXTRA_MATCH_WORD, CO(extra_options) }, { "match_word", MOD_CTC, MOD_OPT, PCRE2_EXTRA_MATCH_WORD, CO(extra_options) },
{ "max_pattern_length", MOD_CTC, MOD_SIZ, 0, CO(max_pattern_length) }, { "max_pattern_length", MOD_CTC, MOD_SIZ, 0, CO(max_pattern_length) },
{ "memory", MOD_PD, MOD_CTL, CTL_MEMORY, PD(control) }, { "memory", MOD_PD, MOD_CTL, CTL_MEMORY, PD(control) },
{ "multiline", MOD_PATP, MOD_OPT, PCRE2_MULTILINE, PO(options) }, { "multiline", MOD_PATP, MOD_OPT, PCRE2_MULTILINE, PO(options) },
{ "never_backslash_c", MOD_PAT, MOD_OPT, PCRE2_NEVER_BACKSLASH_C, PO(options) }, { "never_backslash_c", MOD_PAT, MOD_OPT, PCRE2_NEVER_BACKSLASH_C, PO(options) },
{ "never_ucp", MOD_PAT, MOD_OPT, PCRE2_NEVER_UCP, PO(options) }, { "never_ucp", MOD_PAT, MOD_OPT, PCRE2_NEVER_UCP, PO(options) },
{ "never_utf", MOD_PAT, MOD_OPT, PCRE2_NEVER_UTF, PO(options) }, { "never_utf", MOD_PAT, MOD_OPT, PCRE2_NEVER_UTF, PO(options) },
{ "newline", MOD_CTC, MOD_NL, 0, CO(newline_convention) }, { "newline", MOD_CTC, MOD_NL, 0, CO(newline_convention) },
{ "no_auto_capture", MOD_PAT, MOD_OPT, PCRE2_NO_AUTO_CAPTURE, PO(options) }, { "no_auto_capture", MOD_PAT, MOD_OPT, PCRE2_NO_AUTO_CAPTURE, PO(options) },
{ "no_auto_possess", MOD_PATP, MOD_OPT, PCRE2_NO_AUTO_POSSESS, PO(options) }, { "no_auto_possess", MOD_PATP, MOD_OPT, PCRE2_NO_AUTO_POSSESS, PO(options) },
{ "no_dotstar_anchor", MOD_PAT, MOD_OPT, PCRE2_NO_DOTSTAR_ANCHOR, PO(options) }, { "no_dotstar_anchor", MOD_PAT, MOD_OPT, PCRE2_NO_DOTSTAR_ANCHOR, PO(options) },
{ "no_jit", MOD_DAT, MOD_OPT, PCRE2_NO_JIT, DO(options) }, { "no_jit", MOD_DAT, MOD_OPT, PCRE2_NO_JIT, DO(options) },
{ "no_start_optimize", MOD_PATP, MOD_OPT, PCRE2_NO_START_OPTIMIZE, PO(options) }, { "no_start_optimize", MOD_PATP, MOD_OPT, PCRE2_NO_START_OPTIMIZE, PO(options) },
{ "no_utf_check", MOD_PD, MOD_OPT, PCRE2_NO_UTF_CHECK, PD(options) }, { "no_utf_check", MOD_PD, MOD_OPT, PCRE2_NO_UTF_CHECK, PD(options) },
{ "notbol", MOD_DAT, MOD_OPT, PCRE2_NOTBOL, DO(options) }, { "notbol", MOD_DAT, MOD_OPT, PCRE2_NOTBOL, DO(options) },
{ "notempty", MOD_DAT, MOD_OPT, PCRE2_NOTEMPTY, DO(options) }, { "notempty", MOD_DAT, MOD_OPT, PCRE2_NOTEMPTY, DO(options) },
{ "notempty_atstart", MOD_DAT, MOD_OPT, PCRE2_NOTEMPTY_ATSTART, DO(options) }, { "notempty_atstart", MOD_DAT, MOD_OPT, PCRE2_NOTEMPTY_ATSTART, DO(options) },
{ "noteol", MOD_DAT, MOD_OPT, PCRE2_NOTEOL, DO(options) }, { "noteol", MOD_DAT, MOD_OPT, PCRE2_NOTEOL, DO(options) },
{ "null_context", MOD_PD, MOD_CTL, CTL_NULLCONTEXT, PO(control) }, { "null_context", MOD_PD, MOD_CTL, CTL_NULLCONTEXT, PO(control) },
{ "offset", MOD_DAT, MOD_INT, 0, DO(offset) }, { "offset", MOD_DAT, MOD_INT, 0, DO(offset) },
{ "offset_limit", MOD_CTM, MOD_SIZ, 0, MO(offset_limit)}, { "offset_limit", MOD_CTM, MOD_SIZ, 0, MO(offset_limit)},
{ "ovector", MOD_DAT, MOD_INT, 0, DO(oveccount) }, { "ovector", MOD_DAT, MOD_INT, 0, DO(oveccount) },
{ "parens_nest_limit", MOD_CTC, MOD_INT, 0, CO(parens_nest_limit) }, { "parens_nest_limit", MOD_CTC, MOD_INT, 0, CO(parens_nest_limit) },
{ "partial_hard", MOD_DAT, MOD_OPT, PCRE2_PARTIAL_HARD, DO(options) }, { "partial_hard", MOD_DAT, MOD_OPT, PCRE2_PARTIAL_HARD, DO(options) },
{ "partial_soft", MOD_DAT, MOD_OPT, PCRE2_PARTIAL_SOFT, DO(options) }, { "partial_soft", MOD_DAT, MOD_OPT, PCRE2_PARTIAL_SOFT, DO(options) },
{ "ph", MOD_DAT, MOD_OPT, PCRE2_PARTIAL_HARD, DO(options) }, { "ph", MOD_DAT, MOD_OPT, PCRE2_PARTIAL_HARD, DO(options) },
{ "posix", MOD_PAT, MOD_CTL, CTL_POSIX, PO(control) }, { "posix", MOD_PAT, MOD_CTL, CTL_POSIX, PO(control) },
{ "posix_nosub", MOD_PAT, MOD_CTL, CTL_POSIX|CTL_POSIX_NOSUB, PO(control) }, { "posix_nosub", MOD_PAT, MOD_CTL, CTL_POSIX|CTL_POSIX_NOSUB, PO(control) },
{ "posix_startend", MOD_DAT, MOD_IN2, 0, DO(startend) }, { "posix_startend", MOD_DAT, MOD_IN2, 0, DO(startend) },
{ "ps", MOD_DAT, MOD_OPT, PCRE2_PARTIAL_SOFT, DO(options) }, { "ps", MOD_DAT, MOD_OPT, PCRE2_PARTIAL_SOFT, DO(options) },
{ "push", MOD_PAT, MOD_CTL, CTL_PUSH, PO(control) }, { "push", MOD_PAT, MOD_CTL, CTL_PUSH, PO(control) },
{ "pushcopy", MOD_PAT, MOD_CTL, CTL_PUSHCOPY, PO(control) }, { "pushcopy", MOD_PAT, MOD_CTL, CTL_PUSHCOPY, PO(control) },
{ "pushtablescopy", MOD_PAT, MOD_CTL, CTL_PUSHTABLESCOPY, PO(control) }, { "pushtablescopy", MOD_PAT, MOD_CTL, CTL_PUSHTABLESCOPY, PO(control) },
{ "recursion_limit", MOD_CTM, MOD_INT, 0, MO(depth_limit) }, /* Obsolete synonym */ { "recursion_limit", MOD_CTM, MOD_INT, 0, MO(depth_limit) }, /* Obsolete synonym */
{ "regerror_buffsize", MOD_PAT, MOD_INT, 0, PO(regerror_buffsize) }, { "regerror_buffsize", MOD_PAT, MOD_INT, 0, PO(regerror_buffsize) },
{ "replace", MOD_PND, MOD_STR, REPLACE_MODSIZE, PO(replacement) }, { "replace", MOD_PND, MOD_STR, REPLACE_MODSIZE, PO(replacement) },
{ "stackguard", MOD_PAT, MOD_INT, 0, PO(stackguard_test) }, { "stackguard", MOD_PAT, MOD_INT, 0, PO(stackguard_test) },
{ "startchar", MOD_PND, MOD_CTL, CTL_STARTCHAR, PO(control) }, { "startchar", MOD_PND, MOD_CTL, CTL_STARTCHAR, PO(control) },
{ "startoffset", MOD_DAT, MOD_INT, 0, DO(offset) }, { "startoffset", MOD_DAT, MOD_INT, 0, DO(offset) },
{ "subject_literal", MOD_PATP, MOD_CTL, CTL2_SUBJECT_LITERAL, PO(control2) }, { "subject_literal", MOD_PATP, MOD_CTL, CTL2_SUBJECT_LITERAL, PO(control2) },
{ "substitute_callout", MOD_PND, MOD_CTL, CTL2_SUBSTITUTE_CALLOUT, PO(control2) }, { "substitute_callout", MOD_PND, MOD_CTL, CTL2_SUBSTITUTE_CALLOUT, PO(control2) },
{ "substitute_extended", MOD_PND, MOD_CTL, CTL2_SUBSTITUTE_EXTENDED, PO(control2) }, { "substitute_extended", MOD_PND, MOD_CTL, CTL2_SUBSTITUTE_EXTENDED, PO(control2) },
{ "substitute_literal", MOD_PND, MOD_CTL, CTL2_SUBSTITUTE_LITERAL, PO(control2) }, { "substitute_literal", MOD_PND, MOD_CTL, CTL2_SUBSTITUTE_LITERAL, PO(control2) },
{ "substitute_matched", MOD_PND, MOD_CTL, CTL2_SUBSTITUTE_MATCHED, PO(control2) }, { "substitute_matched", MOD_PND, MOD_CTL, CTL2_SUBSTITUTE_MATCHED, PO(control2) },
{ "substitute_overflow_length", MOD_PND, MOD_CTL, CTL2_SUBSTITUTE_OVERFLOW_LENGTH, PO(control2) }, { "substitute_overflow_length", MOD_PND, MOD_CTL, CTL2_SUBSTITUTE_OVERFLOW_LENGTH, PO(control2) },
{ "substitute_skip", MOD_PND, MOD_INT, 0, PO(substitute_skip) }, { "substitute_replacement_only", MOD_PND, MOD_CTL, CTL2_SUBSTITUTE_REPLACEMENT_ONLY, PO(control2) },
{ "substitute_stop", MOD_PND, MOD_INT, 0, PO(substitute_stop) }, { "substitute_skip", MOD_PND, MOD_INT, 0, PO(substitute_skip) },
{ "substitute_unknown_unset", MOD_PND, MOD_CTL, CTL2_SUBSTITUTE_UNKNOWN_UNSET, PO(control2) }, { "substitute_stop", MOD_PND, MOD_INT, 0, PO(substitute_stop) },
{ "substitute_unset_empty", MOD_PND, MOD_CTL, CTL2_SUBSTITUTE_UNSET_EMPTY, PO(control2) }, { "substitute_unknown_unset", MOD_PND, MOD_CTL, CTL2_SUBSTITUTE_UNKNOWN_UNSET, PO(control2) },
{ "tables", MOD_PAT, MOD_INT, 0, PO(tables_id) }, { "substitute_unset_empty", MOD_PND, MOD_CTL, CTL2_SUBSTITUTE_UNSET_EMPTY, PO(control2) },
{ "ucp", MOD_PATP, MOD_OPT, PCRE2_UCP, PO(options) }, { "tables", MOD_PAT, MOD_INT, 0, PO(tables_id) },
{ "ungreedy", MOD_PAT, MOD_OPT, PCRE2_UNGREEDY, PO(options) }, { "ucp", MOD_PATP, MOD_OPT, PCRE2_UCP, PO(options) },
{ "use_length", MOD_PAT, MOD_CTL, CTL_USE_LENGTH, PO(control) }, { "ungreedy", MOD_PAT, MOD_OPT, PCRE2_UNGREEDY, PO(options) },
{ "use_offset_limit", MOD_PAT, MOD_OPT, PCRE2_USE_OFFSET_LIMIT, PO(options) }, { "use_length", MOD_PAT, MOD_CTL, CTL_USE_LENGTH, PO(control) },
{ "utf", MOD_PATP, MOD_OPT, PCRE2_UTF, PO(options) }, { "use_offset_limit", MOD_PAT, MOD_OPT, PCRE2_USE_OFFSET_LIMIT, PO(options) },
{ "utf8_input", MOD_PAT, MOD_CTL, CTL_UTF8_INPUT, PO(control) }, { "utf", MOD_PATP, MOD_OPT, PCRE2_UTF, PO(options) },
{ "zero_terminate", MOD_DAT, MOD_CTL, CTL_ZERO_TERMINATE, DO(control) } { "utf8_input", MOD_PAT, MOD_CTL, CTL_UTF8_INPUT, PO(control) },
{ "zero_terminate", MOD_DAT, MOD_CTL, CTL_ZERO_TERMINATE, DO(control) }
}; };
#define MODLISTCOUNT sizeof(modlist)/sizeof(modstruct) #define MODLISTCOUNT sizeof(modlist)/sizeof(modstruct)
@ -4091,7 +4094,7 @@ Returns: nothing
static void static void
show_controls(uint32_t controls, uint32_t controls2, const char *before) show_controls(uint32_t controls, uint32_t controls2, const char *before)
{ {
fprintf(outfile, "%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s", fprintf(outfile, "%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s",
before, before,
((controls & CTL_AFTERTEXT) != 0)? " aftertext" : "", ((controls & CTL_AFTERTEXT) != 0)? " aftertext" : "",
((controls & CTL_ALLAFTERTEXT) != 0)? " allaftertext" : "", ((controls & CTL_ALLAFTERTEXT) != 0)? " allaftertext" : "",
@ -4132,6 +4135,7 @@ fprintf(outfile, "%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s
((controls2 & CTL2_SUBSTITUTE_LITERAL) != 0)? " substitute_literal" : "", ((controls2 & CTL2_SUBSTITUTE_LITERAL) != 0)? " substitute_literal" : "",
((controls2 & CTL2_SUBSTITUTE_MATCHED) != 0)? " substitute_matched" : "", ((controls2 & CTL2_SUBSTITUTE_MATCHED) != 0)? " substitute_matched" : "",
((controls2 & CTL2_SUBSTITUTE_OVERFLOW_LENGTH) != 0)? " substitute_overflow_length" : "", ((controls2 & CTL2_SUBSTITUTE_OVERFLOW_LENGTH) != 0)? " substitute_overflow_length" : "",
((controls2 & CTL2_SUBSTITUTE_REPLACEMENT_ONLY) != 0)? " substitute_replacement_only" : "",
((controls2 & CTL2_SUBSTITUTE_UNKNOWN_UNSET) != 0)? " substitute_unknown_unset" : "", ((controls2 & CTL2_SUBSTITUTE_UNKNOWN_UNSET) != 0)? " substitute_unknown_unset" : "",
((controls2 & CTL2_SUBSTITUTE_UNSET_EMPTY) != 0)? " substitute_unset_empty" : "", ((controls2 & CTL2_SUBSTITUTE_UNSET_EMPTY) != 0)? " substitute_unset_empty" : "",
((controls & CTL_USE_LENGTH) != 0)? " use_length" : "", ((controls & CTL_USE_LENGTH) != 0)? " use_length" : "",
@ -7257,17 +7261,17 @@ if (dat_datctl.replacement[0] != 0)
if (timeitm) if (timeitm)
fprintf(outfile, "** Timing is not supported with replace: ignored\n"); fprintf(outfile, "** Timing is not supported with replace: ignored\n");
if ((dat_datctl.control & CTL_ALTGLOBAL) != 0) if ((dat_datctl.control & CTL_ALTGLOBAL) != 0)
fprintf(outfile, "** Altglobal is not supported with replace: ignored\n"); fprintf(outfile, "** Altglobal is not supported with replace: ignored\n");
/* Check for a test that does substitution after an initial external match. /* Check for a test that does substitution after an initial external match.
If this is set, we run the external match, but leave the interpretation of If this is set, we run the external match, but leave the interpretation of
its output to pcre2_substitute(). */ its output to pcre2_substitute(). */
emoption = ((dat_datctl.control2 & CTL2_SUBSTITUTE_MATCHED) == 0)? 0 : emoption = ((dat_datctl.control2 & CTL2_SUBSTITUTE_MATCHED) == 0)? 0 :
PCRE2_SUBSTITUTE_MATCHED; PCRE2_SUBSTITUTE_MATCHED;
if (emoption != 0) if (emoption != 0)
{ {
PCRE2_MATCH(rc, compiled_code, pp, arg_ulen, dat_datctl.offset, PCRE2_MATCH(rc, compiled_code, pp, arg_ulen, dat_datctl.offset,
@ -7283,6 +7287,8 @@ if (dat_datctl.replacement[0] != 0)
PCRE2_SUBSTITUTE_LITERAL) | PCRE2_SUBSTITUTE_LITERAL) |
(((dat_datctl.control2 & CTL2_SUBSTITUTE_OVERFLOW_LENGTH) == 0)? 0 : (((dat_datctl.control2 & CTL2_SUBSTITUTE_OVERFLOW_LENGTH) == 0)? 0 :
PCRE2_SUBSTITUTE_OVERFLOW_LENGTH) | PCRE2_SUBSTITUTE_OVERFLOW_LENGTH) |
(((dat_datctl.control2 & CTL2_SUBSTITUTE_REPLACEMENT_ONLY) == 0)? 0 :
PCRE2_SUBSTITUTE_REPLACEMENT_ONLY) |
(((dat_datctl.control2 & CTL2_SUBSTITUTE_UNKNOWN_UNSET) == 0)? 0 : (((dat_datctl.control2 & CTL2_SUBSTITUTE_UNKNOWN_UNSET) == 0)? 0 :
PCRE2_SUBSTITUTE_UNKNOWN_UNSET) | PCRE2_SUBSTITUTE_UNKNOWN_UNSET) |
(((dat_datctl.control2 & CTL2_SUBSTITUTE_UNSET_EMPTY) == 0)? 0 : (((dat_datctl.control2 & CTL2_SUBSTITUTE_UNSET_EMPTY) == 0)? 0 :

13
testdata/testinput2 vendored
View File

@ -5793,4 +5793,17 @@ a)"xI
/^((\1+)(?C)|\d)+133X$/ /^((\1+)(?C)|\d)+133X$/
111133X\=callout_capture 111133X\=callout_capture
/abc/replace=xyz,substitute_replacement_only
123abc456
/a(?<ONE>b)c(?<TWO>d)e/g,replace=X$ONE+${TWO}Z,substitute_replacement_only
"abcde-abcde-"
/a(b)c|xyz/g,replace=<$0>,substitute_callout,substitute_replacement_only
abcdefabcpqr
abxyzpqrabcxyz
12abc34xyz99abc55\=substitute_stop=2
12abc34xyz99abc55\=substitute_skip=1
12abc34xyz99abc55\=substitute_skip=2
# End of testinput2 # End of testinput2

33
testdata/testoutput2 vendored
View File

@ -17503,6 +17503,39 @@ Callout 0: last capture = 2
1: 11 1: 11
2: 11 2: 11
/abc/replace=xyz,substitute_replacement_only
123abc456
1: xyz
/a(?<ONE>b)c(?<TWO>d)e/g,replace=X$ONE+${TWO}Z,substitute_replacement_only
"abcde-abcde-"
2: Xb+dZXb+dZ
/a(b)c|xyz/g,replace=<$0>,substitute_callout,substitute_replacement_only
abcdefabcpqr
1(2) Old 0 3 "abc" New 0 5 "<abc>"
2(2) Old 6 9 "abc" New 5 10 "<abc>"
2: <abc><abc>
abxyzpqrabcxyz
1(1) Old 2 5 "xyz" New 0 5 "<xyz>"
2(2) Old 8 11 "abc" New 5 10 "<abc>"
3(1) Old 11 14 "xyz" New 10 15 "<xyz>"
3: <xyz><abc><xyz>
12abc34xyz99abc55\=substitute_stop=2
1(2) Old 2 5 "abc" New 0 5 "<abc>"
2(1) Old 7 10 "xyz" New 5 10 "<xyz> STOPPED"
2: <abc>
12abc34xyz99abc55\=substitute_skip=1
1(2) Old 2 5 "abc" New 0 5 "<abc> SKIPPED"
2(1) Old 7 10 "xyz" New 0 5 "<xyz>"
3(2) Old 12 15 "abc" New 5 10 "<abc>"
3: <xyz><abc>
12abc34xyz99abc55\=substitute_skip=2
1(2) Old 2 5 "abc" New 0 5 "<abc>"
2(1) Old 7 10 "xyz" New 5 10 "<xyz> SKIPPED"
3(2) Old 12 15 "abc" New 5 10 "<abc>"
3: <abc><abc>
# End of testinput2 # End of testinput2
Error -70: PCRE2_ERROR_BADDATA (unknown error number) Error -70: PCRE2_ERROR_BADDATA (unknown error number)
Error -62: bad serialized data Error -62: bad serialized data