diff --git a/ChangeLog b/ChangeLog index e27b96b..c650e8a 100644 --- a/ChangeLog +++ b/ChangeLog @@ -386,6 +386,9 @@ possible to test it. 111. "Harden" pcre2test against ridiculously large values in modifiers and command line arguments. +112. Implemented PCRE2_SUBSTITUTE_UNKNOWN_UNSET and PCRE2_SUBSTITUTE_OVERFLOW_ +LENGTH. + Version 10.20 30-June-2015 -------------------------- diff --git a/doc/pcre2_substitute.3 b/doc/pcre2_substitute.3 index 55d58d2..74b5849 100644 --- a/doc/pcre2_substitute.3 +++ b/doc/pcre2_substitute.3 @@ -1,4 +1,4 @@ -.TH PCRE2_SUBSTITUTE 3 "04 December 2015" "PCRE2 10.21" +.TH PCRE2_SUBSTITUTE 3 "12 December 2015" "PCRE2 10.21" .SH NAME PCRE2 - Perl-compatible regular expressions (revised API) .SH SYNOPSIS @@ -58,6 +58,8 @@ The options are: PCRE2_UTF was set at compile time) PCRE2_SUBSTITUTE_EXTENDED Do extended replacement processing PCRE2_SUBSTITUTE_GLOBAL Replace all occurrences in the subject + PCRE2_SUBSTITUTE_OVERFLOW_LENGTH If overflow, compute needed length + PCRE2_SUBSTITUTE_UNKNOWN_UNSET Treat unknown group as unset PCRE2_SUBSTITUTE_UNSET_EMPTY Simple unset insert = empty string .sp The function returns the number of substitutions, which may be zero if there diff --git a/doc/pcre2api.3 b/doc/pcre2api.3 index 2ec6f67..f1de15d 100644 --- a/doc/pcre2api.3 +++ b/doc/pcre2api.3 @@ -1,4 +1,4 @@ -.TH PCRE2API 3 "04 December 2015" "PCRE2 10.21" +.TH PCRE2API 3 "12 December 2015" "PCRE2 10.21" .SH NAME PCRE2 - Perl-compatible regular expressions (revised API) .sp @@ -2704,12 +2704,20 @@ functions from the match context, if provided, or else those that were used to allocate memory for the compiled code. .P The \fIoutlengthptr\fP argument must point to a variable that contains the -length, in code units, of the output buffer. If the function is successful, -the value is updated to contain the length of the new string, excluding the -trailing zero that is automatically added. If the function is not successful, -the value is set to PCRE2_UNSET for general errors (such as output buffer too -small). For syntax errors in the replacement string, the value is set to the -offset in the replacement string where the error was detected. +length, in code units, of the output buffer. If the function is successful, the +value is updated to contain the length of the new string, excluding the +trailing zero that is automatically added. +.P +If the function is not successful, the value set via \fIoutlengthptr\fP depends +on the type of error. For syntax errors in the replacement string, the value is +the offset in the replacement string where the error was detected. For other +errors, the value is PCRE2_UNSET by default. This includes the case of the +output buffer being too small, unless PCRE2_SUBSTITUTE_OVERFLOW_LENGTH is set +(see below), in which case the value is the minimum length needed, including +space for the trailing zero. Note that in order to compute the required length, +\fBpcre2_substitute()\fP has to simulate all the matching and copying, instead +of giving an error return as soon as the buffer overflows. Note also that the +length is in code units, not bytes. .P In the replacement string, which is interpreted as a UTF string in UTF mode, and is checked for UTF validity unless the PCRE2_NO_UTF_CHECK option is set, a @@ -2734,7 +2742,8 @@ simultaneous substitutions, as this \fBpcre2test\fP example shows: apple lemon 2: pear orange .sp -Three additional options are available: +As well as the usual options for \fBpcre2_match()\fP, a number of additional +options can be set in the \fIoptions\fP argument. .P PCRE2_SUBSTITUTE_GLOBAL causes the function to iterate over the subject string, replacing every matching substring. If this is not set, only the first matching @@ -2745,10 +2754,30 @@ advanced by one character except when CRLF is a valid newline sequence and the next two characters are CR, LF. In this case, the current position is advanced by two characters. .P -PCRE2_SUBSTITUTE_UNSET_EMPTY causes unset capturing groups to be treated as -empty strings when inserted as described above. If this option is not set, an -attempt to insert an unset group causes the PCRE2_ERROR_UNSET error. This -option does not influence the extended substitution syntax described below. +PCRE2_SUBSTITUTE_OVERFLOW_LENGTH changes what happens when the output buffer is +too small. The default action is to return PCRE2_ERROR_NOMEMORY immediately. If +this option is set, however, \fBpcre2_substitute()\fP continues to go through +the motions of matching and substituting (without, of course, writing anything) +in order to compute the size of buffer that is needed. This value is passed +back via the \fIoutlengthptr\fP variable, with the result of the function still +being PCRE2_ERROR_NOMEMORY. +.P +Passing a buffer size of zero is a permitted way of finding out how much memory +is needed for given substitution. However, this does mean that the entire +operation is carried out twice. Depending on the application, it may be more +efficient to allocate a large buffer and free the excess afterwards, instead of +using PCRE2_SUBSTITUTE_OVERFLOW_LENGTH. +.P +PCRE2_SUBSTITUTE_UNKNOWN_UNSET causes references to capturing groups that do +not appear in the pattern to be treated as unset groups. This option should be +used with care, because it means that a typo in a group name or number no +longer causes the PCRE2_ERROR_NOSUBSTRING error. +.P +PCRE2_SUBSTITUTE_UNSET_EMPTY causes unset capturing groups (including unknown +groups when PCRE2_SUBSTITUTE_UNKNOWN_UNSET is set) to be treated as empty +strings when inserted as described above. If this option is not set, an attempt +to insert an unset group causes the PCRE2_ERROR_UNSET error. This option does +not influence the extended substitution syntax described below. .P PCRE2_SUBSTITUTE_EXTENDED causes extra processing to be applied to the replacement string. Without this option, only the dollar character is special, @@ -2800,26 +2829,38 @@ string remains in force afterwards, as shown in this \fBpcre2test\fP example: 1: HELLO .sp The PCRE2_SUBSTITUTE_UNSET_EMPTY option does not affect these extended -substitutions. +substitutions. However, PCRE2_SUBSTITUTE_UNKNOWN_UNSET does cause unknown +groups in the extended syntax forms to be treated as unset. .P -If successful, the function returns the number of replacements that were made. -This may be zero if no matches were found, and is never greater than 1 unless -PCRE2_SUBSTITUTE_GLOBAL is set. +If successful, \fBpcre2_substitute()\fP returns the number of replacements that +were made. This may be zero if no matches were found, and is never greater than +1 unless PCRE2_SUBSTITUTE_GLOBAL is set. .P In the event of an error, a negative error code is returned. Except for PCRE2_ERROR_NOMATCH (which is never returned), errors from \fBpcre2_match()\fP -are passed straight back. PCRE2_ERROR_NOSUBSTRING is returned for a -non-existent substring insertion, and PCRE2_ERROR_UNSET is returned for an -unset substring insertion when the simple (non-extended) syntax is used and -PCRE2_SUBSTITUTE_UNSET_EMPTY is not set. PCRE2_ERROR_NOMEMORY is returned if -the output buffer is not big enough. PCRE2_ERROR_BADREPLACEMENT is used for -miscellaneous syntax errors in the replacement string, with more particular -errors being PCRE2_ERROR_BADREPESCAPE (invalid escape sequence), -PCRE2_ERROR_REPMISSING_BRACE (closing curly bracket not found), -PCRE2_BADSUBSTITUTION (syntax error in extended group substitution), and -PCRE2_BADSUBPATTERN (the pattern match ended before it started). As for all -PCRE2 errors, a text message that describes the error can be obtained by -calling \fBpcre2_get_error_message()\fP. +are passed straight back. +.P +PCRE2_ERROR_NOSUBSTRING is returned for a non-existent substring insertion, +unless PCRE2_SUBSTITUTE_UNKNOWN_UNSET is set. +.P +PCRE2_ERROR_UNSET is returned for an unset substring insertion (including an +unknown substring when PCRE2_SUBSTITUTE_UNKNOWN_UNSET is set) when the simple +(non-extended) syntax is used and PCRE2_SUBSTITUTE_UNSET_EMPTY is not set. +.P +PCRE2_ERROR_NOMEMORY is returned if the output buffer is not big enough. If the +PCRE2_SUBSTITUTE_OVERFLOW_LENGTH option is set, the size of buffer that is +needed is returned via \fIoutlengthptr\fP. Note that this does not happen by +default. +.P +PCRE2_ERROR_BADREPLACEMENT is used for miscellaneous syntax errors in the +replacement string, with more particular errors being PCRE2_ERROR_BADREPESCAPE +(invalid escape sequence), PCRE2_ERROR_REPMISSING_BRACE (closing curly bracket +not found), PCRE2_BADSUBSTITUTION (syntax error in extended group +substitution), and PCRE2_BADSUBPATTERN (the pattern match ended before it +started, which can happen if \eK is used in an assertion). +.P +As for all PCRE2 errors, a text message that describes the error can be +obtained by calling \fBpcre2_get_error_message()\fP. . . .SH "DUPLICATE SUBPATTERN NAMES" @@ -3113,6 +3154,6 @@ Cambridge, England. .rs .sp .nf -Last updated: 04 December 2015 +Last updated: 21 December 2015 Copyright (c) 1997-2015 University of Cambridge. .fi diff --git a/doc/pcre2test.1 b/doc/pcre2test.1 index 2315ddb..1255d29 100644 --- a/doc/pcre2test.1 +++ b/doc/pcre2test.1 @@ -1,4 +1,4 @@ -.TH PCRE2TEST 1 "04 December 2015" "PCRE 10.21" +.TH PCRE2TEST 1 "12 December 2015" "PCRE 10.21" .SH NAME pcre2test - a program for testing Perl-compatible regular expressions. .SH SYNOPSIS @@ -854,16 +854,18 @@ are applied to every subject line that is processed with that pattern. They may not appear in \fB#pattern\fP commands. These modifiers do not affect the compilation process. .sp - aftertext show text after match - allaftertext show text after captures - allcaptures show all captures - allusedtext show all consulted text - /g global global matching - mark show mark values - replace= specify a replacement string - startchar show starting character when relevant - substitute_extended use PCRE2_SUBSTITUTE_EXTENDED - substitute_unset_empty use PCRE2_SUBSTITUTE_UNSET_EMPTY + aftertext show text after match + allaftertext show text after captures + allcaptures show all captures + allusedtext show all consulted text + /g global global matching + mark show mark values + replace= specify a replacement string + startchar show starting character when relevant + substitute_extended use PCRE2_SUBSTITUTE_EXTENDED + substitute_overflow_length use PCRE2_SUBSTITUTE_OVERFLOW_LENGTH + substitute_unknown_unset use PCRE2_SUBSTITUTE_UNKNOWN_UNSET + substitute_unset_empty use PCRE2_SUBSTITUTE_UNSET_EMPTY .sp These modifiers may not appear in a \fB#pattern\fP command. If you want them as defaults, set them in a \fB#subject\fP command. @@ -935,36 +937,38 @@ information. Some of them may also be specified on a pattern line (see above), in which case they apply to every subject line that is matched against that pattern. .sp - aftertext show text after match - allaftertext show text after captures - allcaptures show all captures - allusedtext show all consulted text (non-JIT only) - altglobal alternative global matching - callout_capture show captures at callout time - callout_data= set a value to pass via callouts - callout_fail=[:] control callout failure - callout_none do not supply a callout function - copy= copy captured substring - dfa use \fBpcre2_dfa_match()\fP - find_limits find match and recursion limits - get= extract captured substring - getall extract all captured substrings - /g global global matching - jitstack= set size of JIT stack - mark show mark values - match_limit= set a match limit - memory show memory usage - null_context match with a NULL context - offset= set starting offset - offset_limit= set offset limit - ovector= set size of output vector - recursion_limit= set a recursion limit - replace= specify a replacement string - startchar show startchar when relevant - startoffset= same as offset= - substitute_extedded use PCRE2_SUBSTITUTE_EXTENDED - substitute_unset_empty use PCRE2_SUBSTITUTE_UNSET_EMPTY - zero_terminate pass the subject as zero-terminated + aftertext show text after match + allaftertext show text after captures + allcaptures show all captures + allusedtext show all consulted text (non-JIT only) + altglobal alternative global matching + callout_capture show captures at callout time + callout_data= set a value to pass via callouts + callout_fail=[:] control callout failure + callout_none do not supply a callout function + copy= copy captured substring + dfa use \fBpcre2_dfa_match()\fP + find_limits find match and recursion limits + get= extract captured substring + getall extract all captured substrings + /g global global matching + jitstack= set size of JIT stack + mark show mark values + match_limit= set a match limit + memory show memory usage + null_context match with a NULL context + offset= set starting offset + offset_limit= set offset limit + ovector= set size of output vector + recursion_limit= set a recursion limit + replace= specify a replacement string + startchar show startchar when relevant + startoffset= same as offset= + substitute_extedded use PCRE2_SUBSTITUTE_EXTENDED + substitute_overflow_length use PCRE2_SUBSTITUTE_OVERFLOW_LENGTH + substitute_unknown_unset use PCRE2_SUBSTITUTE_UNKNOWN_UNSET + substitute_unset_empty use PCRE2_SUBSTITUTE_UNSET_EMPTY + zero_terminate pass the subject as zero-terminated .sp The effects of these modifiers are described in the following sections. . @@ -1107,10 +1111,15 @@ the appropriate code unit width. If it is not a valid UTF-8 string, the individual code units are copied directly. This provides a means of passing an invalid UTF-8 string for testing purposes. .P -If the \fBglobal\fP modifier is set, PCRE2_SUBSTITUTE_GLOBAL is passed to -\fBpcre2_substitute()\fP. The \fBsubstitute_extended\fP and -\fBsubstitute_unset_empty\fP modifiers set PCRE2_SUBSTITUTE_EXTENDED and -PCRE2_SUBSTITUTE_UNSET_EMPTY, respectively. +The following modifiers set options (in additional to the normal match options) +for \fBpcre2_substitute()\fP: +.sp + global PCRE2_SUBSTITUTE_GLOBAL + substitute_extended PCRE2_SUBSTITUTE_EXTENDED + substitute_overflow_length PCRE2_SUBSTITUTE_OVERFLOW_LENGTH + substitute_unknown_unset PCRE2_SUBSTITUTE_UNKNOWN_UNSET + substitute_unset_empty PCRE2_SUBSTITUTE_UNSET_EMPTY +.sp .P After a successful substitution, the modified string is output, preceded by the number of replacements. This may be zero if there were no matches. Here is a @@ -1135,6 +1144,19 @@ character. Here is an example that tests the edge case: 123abc123\e=replace=[9]XYZ Failed: error -47: no more memory .sp +The default action of \fBpcre2_substitute()\fP is to return +PCRE2_ERROR_NOMEMORY when the output buffer is too small. However, if the +PCRE2_SUBSTITUTE_OVERFLOW_LENGTH option is set (by using the +\fBsubstitute_overflow_length\fP modifier), \fBpcre2_substitute()\fP continues +to go through the motions of matching and substituting, in order to compute the +size of buffer that is required. When this happens, \fBpcre2test\fP shows the +required buffer length (which includes space for the trailing zero) as part of +the error message. For example: +.sp + /abc/substitute_overflow_length + 123abc123\e=replace=[9]XYZ + Failed: error -47: no more memory: 10 code units are needed +.sp A replacement string is ignored with POSIX and DFA matching. Specifying partial matching provokes an error return ("bad option value") from \fBpcre2_substitute()\fP. @@ -1618,6 +1640,6 @@ Cambridge, England. .rs .sp .nf -Last updated: 04 December 2015 +Last updated: 12 December 2015 Copyright (c) 1997-2015 University of Cambridge. .fi diff --git a/src/pcre2.h b/src/pcre2.h index 23460e1..3f8f8e0 100644 --- a/src/pcre2.h +++ b/src/pcre2.h @@ -148,9 +148,11 @@ sanity checks). */ /* These are additional options for pcre2_substitute(). */ -#define PCRE2_SUBSTITUTE_GLOBAL 0x00000100u -#define PCRE2_SUBSTITUTE_EXTENDED 0x00000200u -#define PCRE2_SUBSTITUTE_UNSET_EMPTY 0x00000400u +#define PCRE2_SUBSTITUTE_GLOBAL 0x00000100u +#define PCRE2_SUBSTITUTE_EXTENDED 0x00000200u +#define PCRE2_SUBSTITUTE_UNSET_EMPTY 0x00000400u +#define PCRE2_SUBSTITUTE_UNKNOWN_UNSET 0x00000800u +#define PCRE2_SUBSTITUTE_OVERFLOW_LENGTH 0x00001000u /* Newline and \R settings, for use in compile contexts. The newline values must be kept in step with values set in config.h and both sets must all be diff --git a/src/pcre2.h.in b/src/pcre2.h.in index c70f765..c11e130 100644 --- a/src/pcre2.h.in +++ b/src/pcre2.h.in @@ -148,9 +148,11 @@ sanity checks). */ /* These are additional options for pcre2_substitute(). */ -#define PCRE2_SUBSTITUTE_GLOBAL 0x00000100u -#define PCRE2_SUBSTITUTE_EXTENDED 0x00000200u -#define PCRE2_SUBSTITUTE_UNSET_EMPTY 0x00000400u +#define PCRE2_SUBSTITUTE_GLOBAL 0x00000100u +#define PCRE2_SUBSTITUTE_EXTENDED 0x00000200u +#define PCRE2_SUBSTITUTE_UNSET_EMPTY 0x00000400u +#define PCRE2_SUBSTITUTE_UNKNOWN_UNSET 0x00000800u +#define PCRE2_SUBSTITUTE_OVERFLOW_LENGTH 0x00001000u /* Newline and \R settings, for use in compile contexts. The newline values must be kept in step with values set in config.h and both sets must all be diff --git a/src/pcre2_substitute.c b/src/pcre2_substitute.c index 76dd37b..f3517cc 100644 --- a/src/pcre2_substitute.c +++ b/src/pcre2_substitute.c @@ -47,6 +47,12 @@ POSSIBILITY OF SUCH DAMAGE. #define PTR_STACK_SIZE 20 +#define SUBSTITUTE_OPTIONS \ + (PCRE2_SUBSTITUTE_EXTENDED|PCRE2_SUBSTITUTE_GLOBAL| \ + PCRE2_SUBSTITUTE_OVERFLOW_LENGTH|PCRE2_SUBSTITUTE_UNKNOWN_UNSET| \ + PCRE2_SUBSTITUTE_UNSET_EMPTY) + + /************************************************* * Find end of substitute text * @@ -181,6 +187,30 @@ Returns: >= 0 number of substitutions made PCRE2_ERROR_BADREPLACEMENT means invalid use of $ */ +/* This macro checks for space in the buffer before copying into it. On +overflow, either give an error immediately, or keep on, accumulating the +length. */ + +#define CHECKMEMCPY(from,length) \ + if (!overflowed && lengthleft < length) \ + { \ + if ((suboptions & PCRE2_SUBSTITUTE_OVERFLOW_LENGTH) == 0) goto NOROOM; \ + overflowed = TRUE; \ + extra_needed = length - lengthleft; \ + } \ + else if (overflowed) \ + { \ + extra_needed += length; \ + } \ + else \ + { \ + memcpy(buffer + buff_offset, from, CU2BYTES(length)); \ + buff_offset += length; \ + lengthleft -= length; \ + } + +/* Here's the function */ + PCRE2_EXP_DEFN int PCRE2_CALL_CONVENTION pcre2_substitute(const pcre2_code *code, PCRE2_SPTR subject, PCRE2_SIZE length, PCRE2_SIZE start_offset, uint32_t options, pcre2_match_data *match_data, @@ -193,20 +223,22 @@ int forcecase = 0; int forcecasereset = 0; uint32_t ovector_count; uint32_t goptions = 0; +uint32_t suboptions; BOOL match_data_created = FALSE; -BOOL global = FALSE; -BOOL extended = FALSE; BOOL literal = FALSE; -BOOL uempty = FALSE; /* Unset/unknown groups => empty string */ +BOOL overflowed = FALSE; #ifdef SUPPORT_UNICODE BOOL utf = (code->overall_options & PCRE2_UTF) != 0; #endif +PCRE2_UCHAR temp[6]; PCRE2_SPTR ptr; PCRE2_SPTR repend; +PCRE2_SIZE extra_needed = 0; PCRE2_SIZE buff_offset, buff_length, lengthleft, fraglength; PCRE2_SIZE *ovector; -buff_length = *blength; +buff_offset = 0; +lengthleft = buff_length = *blength; *blength = PCRE2_UNSET; /* Partial matching is not valid. */ @@ -248,33 +280,14 @@ if (utf && (options & PCRE2_NO_UTF_CHECK) == 0) } #endif /* SUPPORT_UNICODE */ -/* Notice the global and extended options and remove them from the options that -are passed to pcre2_match(). */ +/* Save the substitute options and remove them from the match options. */ -if ((options & PCRE2_SUBSTITUTE_GLOBAL) != 0) - { - options &= ~PCRE2_SUBSTITUTE_GLOBAL; - global = TRUE; - } - -if ((options & PCRE2_SUBSTITUTE_EXTENDED) != 0) - { - options &= ~PCRE2_SUBSTITUTE_EXTENDED; - extended = TRUE; - } - -if ((options & PCRE2_SUBSTITUTE_UNSET_EMPTY) != 0) - { - options &= ~PCRE2_SUBSTITUTE_UNSET_EMPTY; - uempty = TRUE; - } +suboptions = options & SUBSTITUTE_OPTIONS; +options &= ~SUBSTITUTE_OPTIONS; /* Copy up to the start offset */ -if (start_offset > buff_length) goto NOROOM; -memcpy(buffer, subject, start_offset * (PCRE2_CODE_UNIT_WIDTH/8)); -buff_offset = start_offset; -lengthleft = buff_length - start_offset; +CHECKMEMCPY(subject, start_offset); /* Loop for global substituting. */ @@ -330,13 +343,11 @@ do #endif } - fraglength = start_offset - save_start; - if (lengthleft < fraglength) goto NOROOM; - memcpy(buffer + buff_offset, subject + save_start, - fraglength*(PCRE2_CODE_UNIT_WIDTH/8)); - buff_offset += fraglength; - lengthleft -= fraglength; + /* Copy what we have advanced past, reset the special global options, and + continue to the next match. */ + fraglength = start_offset - save_start; + CHECKMEMCPY(subject + save_start, fraglength); goptions = 0; continue; } @@ -350,25 +361,21 @@ do goto EXIT; } - /* Paranoid check for integer overflow; surely no real call to this function - would ever hit this! */ + /* Count substitutions with a paranoid check for integer overflow; surely no + real call to this function would ever hit this! */ if (subs == INT_MAX) { rc = PCRE2_ERROR_TOOMANYREPLACE; goto EXIT; } - - /* Count substitutions and proceed */ - subs++; + + /* Copy the text leading up to the match. */ + if (rc == 0) rc = ovector_count; fraglength = ovector[0] - start_offset; - if (fraglength >= lengthleft) goto NOROOM; - memcpy(buffer + buff_offset, subject + start_offset, - fraglength*(PCRE2_CODE_UNIT_WIDTH/8)); - buff_offset += fraglength; - lengthleft -= fraglength; + CHECKMEMCPY(subject + start_offset, fraglength); /* Process the replacement string. Literal mode is set by \Q, but only in extended mode when backslashes are being interpreted. In extended mode we @@ -378,12 +385,13 @@ do for (;;) { uint32_t ch; + unsigned int chlen; /* If at the end of a nested substring, pop the stack. */ if (ptr >= repend) { - if (ptrstackptr <= 0) break; + if (ptrstackptr <= 0) break; /* End of replacement string */ repend = ptrstack[--ptrstackptr]; ptr = ptrstack[--ptrstackptr]; continue; @@ -450,12 +458,22 @@ do group = group * 10 + next - CHAR_0; /* A check for a number greater than the hightest captured group - is sufficient here; no need for a separate overflow check. */ + is sufficient here; no need for a separate overflow check. If unknown + groups are to be treated as unset, just skip over any remaining + digits and carry on. */ if (group > code->top_bracket) { - rc = PCRE2_ERROR_NOSUBSTRING; - goto PTREXIT; + if ((suboptions & PCRE2_SUBSTITUTE_UNKNOWN_UNSET) != 0) + { + while (++ptr < repend && *ptr >= CHAR_0 && *ptr <= CHAR_9); + break; + } + else + { + rc = PCRE2_ERROR_NOSUBSTRING; + goto PTREXIT; + } } } } @@ -478,7 +496,8 @@ do if (inparens) { - if (extended && !star && ptr < repend - 2 && next == CHAR_COLON) + if ((suboptions & PCRE2_SUBSTITUTE_EXTENDED) != 0 && + !star && ptr < repend - 2 && next == CHAR_COLON) { special = *(++ptr); if (special != CHAR_PLUS && special != CHAR_MINUS) @@ -513,8 +532,8 @@ do ptr++; } - /* Have found a syntactically correct group number or name, or - *name. Only *MARK is currently recognized. */ + /* Have found a syntactically correct group number or name, or *name. + Only *MARK is currently recognized. */ if (star) { @@ -523,11 +542,10 @@ do PCRE2_SPTR mark = pcre2_get_mark(match_data); if (mark != NULL) { - while (*mark != 0) - { - if (lengthleft-- < 1) goto NOROOM; - buffer[buff_offset++] = *mark++; - } + PCRE2_SPTR mark_start = mark; + while (*mark != 0) mark++; + fraglength = mark - mark_start; + CHECKMEMCPY(mark_start, fraglength); } } else goto BAD; @@ -541,31 +559,41 @@ do PCRE2_SPTR subptr, subptrend; /* Find a number for a named group. In case there are duplicate names, - search for the first one that is set. */ + search for the first one that is set. If the name is not found when + PCRE2_SUBSTITUTE_UNKNOWN_EMPTY is set, set the group number to a + non-existent group. */ if (group < 0) { PCRE2_SPTR first, last, entry; rc = pcre2_substring_nametable_scan(code, name, &first, &last); - if (rc < 0) goto PTREXIT; - for (entry = first; entry <= last; entry += rc) + if (rc == PCRE2_ERROR_NOSUBSTRING && + (suboptions & PCRE2_SUBSTITUTE_UNKNOWN_UNSET) != 0) { - uint32_t ng = GET2(entry, 0); - if (ng < ovector_count) + group = code->top_bracket + 1; + } + else + { + if (rc < 0) goto PTREXIT; + for (entry = first; entry <= last; entry += rc) { - if (group < 0) group = ng; /* First in ovector */ - if (ovector[ng*2] != PCRE2_UNSET) + uint32_t ng = GET2(entry, 0); + if (ng < ovector_count) { - group = ng; /* First that is set */ - break; + if (group < 0) group = ng; /* First in ovector */ + if (ovector[ng*2] != PCRE2_UNSET) + { + group = ng; /* First that is set */ + break; + } } } + + /* If group is still negative, it means we did not find a group + that is in the ovector. Just set the first group. */ + + if (group < 0) group = GET2(first, 0); } - - /* If group is still negative, it means we did not find a group that - is in the ovector. Just set the first group. */ - - if (group < 0) group = GET2(first, 0); } /* We now have a group that is identified by number. Find the length of @@ -575,10 +603,15 @@ do rc = pcre2_substring_length_bynumber(match_data, group, &sublength); if (rc < 0) { + if (rc == PCRE2_ERROR_NOSUBSTRING && + (suboptions & PCRE2_SUBSTITUTE_UNKNOWN_UNSET) != 0) + { + rc = PCRE2_ERROR_UNSET; + } if (rc != PCRE2_ERROR_UNSET) goto PTREXIT; /* Non-unset errors */ if (special == 0) /* Plain substitution */ { - if (uempty) continue; /* Treat as empty */ + if ((suboptions & PCRE2_SUBSTITUTE_UNSET_EMPTY) != 0) continue; goto PTREXIT; /* Else error */ } } @@ -646,26 +679,13 @@ do } #ifdef SUPPORT_UNICODE - if (utf) - { - unsigned int chlen; -#if PCRE2_CODE_UNIT_WIDTH == 8 - if (lengthleft < 6) goto NOROOM; -#elif PCRE2_CODE_UNIT_WIDTH == 16 - if (lengthleft < 2) goto NOROOM; -#else - if (lengthleft < 1) goto NOROOM; -#endif - chlen = PRIV(ord2utf)(ch, buffer + buff_offset); - buff_offset += chlen; - lengthleft -= chlen; - } - else + if (utf) chlen = PRIV(ord2utf)(ch, temp); else #endif { - if (lengthleft-- < 1) goto NOROOM; - buffer[buff_offset++] = ch; + temp[0] = ch; + chlen = 1; } + CHECKMEMCPY(temp, chlen); } } } @@ -675,7 +695,8 @@ do the case-forcing escapes are not supported in pcre2_compile() so must be recognized here. */ - else if (extended && *ptr == CHAR_BACKSLASH) + else if ((suboptions & PCRE2_SUBSTITUTE_EXTENDED) != 0 && + *ptr == CHAR_BACKSLASH) { int errorcode = 0; @@ -756,33 +777,19 @@ do )[ch/8] & (1 << (ch%8))) == 0) ch = (code->tables + fcc_offset)[ch]; } - forcecase = forcecasereset; } #ifdef SUPPORT_UNICODE - if (utf) - { - unsigned int chlen; -#if PCRE2_CODE_UNIT_WIDTH == 8 - if (lengthleft < 6) goto NOROOM; -#elif PCRE2_CODE_UNIT_WIDTH == 16 - if (lengthleft < 2) goto NOROOM; -#else - if (lengthleft < 1) goto NOROOM; -#endif - chlen = PRIV(ord2utf)(ch, buffer + buff_offset); - buff_offset += chlen; - lengthleft -= chlen; - } - else + if (utf) chlen = PRIV(ord2utf)(ch, temp); else #endif { - if (lengthleft-- < 1) goto NOROOM; - buffer[buff_offset++] = ch; + temp[0] = ch; + chlen = 1; } - } - } + CHECKMEMCPY(temp, chlen); + } /* End handling a literal code unit */ + } /* End of loop for scanning the replacement. */ /* The replacement has been copied to the output. Update the start offset to point to the rest of the subject string. If we matched an empty string, @@ -791,18 +798,33 @@ do start_offset = ovector[1]; goptions = (ovector[0] != ovector[1])? 0 : PCRE2_ANCHORED|PCRE2_NOTEMPTY_ATSTART; - } while (global); /* Repeat "do" loop */ + } while ((suboptions & PCRE2_SUBSTITUTE_GLOBAL) != 0); /* Repeat "do" loop */ -/* Copy the rest of the subject and return the number of substitutions. */ +/* Copy the rest of the subject. */ -rc = subs; fraglength = length - start_offset; -if (fraglength + 1 > lengthleft) goto NOROOM; -memcpy(buffer + buff_offset, subject + start_offset, - fraglength*(PCRE2_CODE_UNIT_WIDTH/8)); -buff_offset += fraglength; -buffer[buff_offset] = 0; -*blength = buff_offset; +CHECKMEMCPY(subject + start_offset, fraglength); +temp[0] = 0; +CHECKMEMCPY(temp , 1); + +/* If overflowed is set it means the PCRE2_SUBSTITUTE_OVERFLOW_LENGTH is set, +and matching has carried on after a full buffer, in order to compute the length +needed. Otherwise, an overflow generates an immediate error return. */ + +if (overflowed) + { + rc = PCRE2_ERROR_NOMEMORY; + *blength = buff_length + extra_needed; + } + +/* After a successful execution, return the number of substitutions and set the +length of buffer used, excluding the trailing zero. */ + +else + { + rc = subs; + *blength = buff_offset - 1; + } EXIT: if (match_data_created) pcre2_match_data_free(match_data); diff --git a/src/pcre2test.c b/src/pcre2test.c index 63b5996..85b3cff 100644 --- a/src/pcre2test.c +++ b/src/pcre2test.c @@ -399,39 +399,50 @@ enum { MOD_CTC, /* Applies to a compile context */ MOD_STR }; /* Is a string */ /* Control bits. Some apply to compiling, some to matching, but some can be set -either on a pattern or a data line, so they must all be distinct. */ +either on a pattern or a data line, so they must all be distinct. There are now +so many of them that they are split into two fields. */ -#define CTL_AFTERTEXT 0x00000001u -#define CTL_ALLAFTERTEXT 0x00000002u -#define CTL_ALLCAPTURES 0x00000004u -#define CTL_ALLUSEDTEXT 0x00000008u -#define CTL_ALTGLOBAL 0x00000010u -#define CTL_BINCODE 0x00000020u -#define CTL_CALLOUT_CAPTURE 0x00000040u -#define CTL_CALLOUT_INFO 0x00000080u -#define CTL_CALLOUT_NONE 0x00000100u -#define CTL_DFA 0x00000200u -#define CTL_EXPAND 0x00000400u -#define CTL_FINDLIMITS 0x00000800u -#define CTL_FULLBINCODE 0x00001000u -#define CTL_GETALL 0x00002000u -#define CTL_GLOBAL 0x00004000u -#define CTL_HEXPAT 0x00008000u -#define CTL_INFO 0x00010000u -#define CTL_JITFAST 0x00020000u -#define CTL_JITVERIFY 0x00040000u -#define CTL_MARK 0x00080000u -#define CTL_MEMORY 0x00100000u -#define CTL_NULLCONTEXT 0x00200000u -#define CTL_POSIX 0x00400000u -#define CTL_PUSH 0x00800000u -#define CTL_STARTCHAR 0x01000000u -#define CTL_SUBSTITUTE_EXTENDED 0x02000000u -#define CTL_SUBSTITUTE_UNSET_EMPTY 0x04000000u -#define CTL_ZERO_TERMINATE 0x08000000u +#define CTL_AFTERTEXT 0x00000001u +#define CTL_ALLAFTERTEXT 0x00000002u +#define CTL_ALLCAPTURES 0x00000004u +#define CTL_ALLUSEDTEXT 0x00000008u +#define CTL_ALTGLOBAL 0x00000010u +#define CTL_BINCODE 0x00000020u +#define CTL_CALLOUT_CAPTURE 0x00000040u +#define CTL_CALLOUT_INFO 0x00000080u +#define CTL_CALLOUT_NONE 0x00000100u +#define CTL_DFA 0x00000200u +#define CTL_EXPAND 0x00000400u +#define CTL_FINDLIMITS 0x00000800u +#define CTL_FULLBINCODE 0x00001000u +#define CTL_GETALL 0x00002000u +#define CTL_GLOBAL 0x00004000u +#define CTL_HEXPAT 0x00008000u +#define CTL_INFO 0x00010000u +#define CTL_JITFAST 0x00020000u +#define CTL_JITVERIFY 0x00040000u +#define CTL_MARK 0x00080000u +#define CTL_MEMORY 0x00100000u +#define CTL_NULLCONTEXT 0x00200000u +#define CTL_POSIX 0x00400000u +#define CTL_PUSH 0x00800000u +#define CTL_STARTCHAR 0x01000000u +#define CTL_ZERO_TERMINATE 0x02000000u +/* Spare 0x04000000u */ +/* Spare 0x08000000u */ +/* Spare 0x10000000u */ +/* Spare 0x20000000u */ +#define CTL_NL_SET 0x40000000u /* Informational */ +#define CTL_BSR_SET 0x80000000u /* Informational */ -#define CTL_BSR_SET 0x80000000u /* This is informational */ -#define CTL_NL_SET 0x40000000u /* This is informational */ +/* Second control word */ + +#define CTL2_SUBSTITUTE_EXTENDED 0x00000001u +#define CTL2_SUBSTITUTE_OVERFLOW_LENGTH 0x00000002u +#define CTL2_SUBSTITUTE_UNKNOWN_UNSET 0x00000004u +#define CTL2_SUBSTITUTE_UNSET_EMPTY 0x00000008u + +/* Combinations */ #define CTL_DEBUG (CTL_FULLBINCODE|CTL_INFO) /* For setting */ #define CTL_ANYINFO (CTL_DEBUG|CTL_BINCODE|CTL_CALLOUT_INFO) @@ -448,9 +459,12 @@ data line. */ CTL_GLOBAL|\ CTL_MARK|\ CTL_MEMORY|\ - CTL_STARTCHAR|\ - CTL_SUBSTITUTE_EXTENDED|\ - CTL_SUBSTITUTE_UNSET_EMPTY) + CTL_STARTCHAR) + +#define CTL2_ALLPD (CTL2_SUBSTITUTE_EXTENDED|\ + CTL2_SUBSTITUTE_OVERFLOW_LENGTH|\ + CTL2_SUBSTITUTE_UNKNOWN_UNSET|\ + CTL2_SUBSTITUTE_UNSET_EMPTY) /* Structures for holding modifier information for patterns and subject strings (data). Fields containing modifiers that can be set either for a pattern or a @@ -460,6 +474,7 @@ same offset in the big table below works for both. */ typedef struct patctl { /* Structure for pattern modifiers. */ uint32_t options; /* Must be in same position as datctl */ uint32_t control; /* Must be in same position as datctl */ + uint32_t control2; /* Must be in same position as datctl */ uint8_t replacement[REPLACE_MODSIZE]; /* So must this */ uint32_t jit; uint32_t stackguard_test; @@ -474,6 +489,7 @@ typedef struct patctl { /* Structure for pattern modifiers. */ typedef struct datctl { /* Structure for data line modifiers. */ uint32_t options; /* Must be in same position as patctl */ uint32_t control; /* Must be in same position as patctl */ + uint32_t control2; /* Must be in same position as patctl */ uint8_t replacement[REPLACE_MODSIZE]; /* So must this */ uint32_t cfail[2]; int32_t callout_data; @@ -514,92 +530,94 @@ typedef struct modstruct { } modstruct; static modstruct modlist[] = { - { "aftertext", MOD_PNDP, MOD_CTL, CTL_AFTERTEXT, PO(control) }, - { "allaftertext", MOD_PNDP, MOD_CTL, CTL_ALLAFTERTEXT, PO(control) }, - { "allcaptures", MOD_PND, MOD_CTL, CTL_ALLCAPTURES, PO(control) }, - { "allow_empty_class", MOD_PAT, MOD_OPT, PCRE2_ALLOW_EMPTY_CLASS, PO(options) }, - { "allusedtext", MOD_PNDP, MOD_CTL, CTL_ALLUSEDTEXT, PO(control) }, - { "alt_bsux", MOD_PAT, MOD_OPT, PCRE2_ALT_BSUX, PO(options) }, - { "alt_circumflex", MOD_PAT, MOD_OPT, PCRE2_ALT_CIRCUMFLEX, PO(options) }, - { "alt_verbnames", MOD_PAT, MOD_OPT, PCRE2_ALT_VERBNAMES, PO(options) }, - { "altglobal", MOD_PND, MOD_CTL, CTL_ALTGLOBAL, PO(control) }, - { "anchored", MOD_PD, MOD_OPT, PCRE2_ANCHORED, PD(options) }, - { "auto_callout", MOD_PAT, MOD_OPT, PCRE2_AUTO_CALLOUT, PO(options) }, - { "bincode", MOD_PAT, MOD_CTL, CTL_BINCODE, PO(control) }, - { "bsr", MOD_CTC, MOD_BSR, 0, CO(bsr_convention) }, - { "callout_capture", MOD_DAT, MOD_CTL, CTL_CALLOUT_CAPTURE, DO(control) }, - { "callout_data", MOD_DAT, MOD_INS, 0, DO(callout_data) }, - { "callout_fail", MOD_DAT, MOD_IN2, 0, DO(cfail) }, - { "callout_info", MOD_PAT, MOD_CTL, CTL_CALLOUT_INFO, PO(control) }, - { "callout_none", MOD_DAT, MOD_CTL, CTL_CALLOUT_NONE, DO(control) }, - { "caseless", MOD_PATP, MOD_OPT, PCRE2_CASELESS, PO(options) }, - { "copy", MOD_DAT, MOD_NN, DO(copy_numbers), DO(copy_names) }, - { "debug", MOD_PAT, MOD_CTL, CTL_DEBUG, PO(control) }, - { "dfa", MOD_DAT, MOD_CTL, CTL_DFA, DO(control) }, - { "dfa_restart", MOD_DAT, MOD_OPT, PCRE2_DFA_RESTART, DO(options) }, - { "dfa_shortest", MOD_DAT, MOD_OPT, PCRE2_DFA_SHORTEST, DO(options) }, - { "dollar_endonly", MOD_PAT, MOD_OPT, PCRE2_DOLLAR_ENDONLY, PO(options) }, - { "dotall", MOD_PATP, MOD_OPT, PCRE2_DOTALL, PO(options) }, - { "dupnames", MOD_PATP, MOD_OPT, PCRE2_DUPNAMES, PO(options) }, - { "expand", MOD_PAT, MOD_CTL, CTL_EXPAND, PO(control) }, - { "extended", MOD_PATP, MOD_OPT, PCRE2_EXTENDED, PO(options) }, - { "find_limits", MOD_DAT, MOD_CTL, CTL_FINDLIMITS, DO(control) }, - { "firstline", MOD_PAT, MOD_OPT, PCRE2_FIRSTLINE, PO(options) }, - { "fullbincode", MOD_PAT, MOD_CTL, CTL_FULLBINCODE, PO(control) }, - { "get", MOD_DAT, MOD_NN, DO(get_numbers), DO(get_names) }, - { "getall", MOD_DAT, MOD_CTL, CTL_GETALL, DO(control) }, - { "global", MOD_PNDP, MOD_CTL, CTL_GLOBAL, PO(control) }, - { "hex", MOD_PAT, MOD_CTL, CTL_HEXPAT, PO(control) }, - { "info", MOD_PAT, MOD_CTL, CTL_INFO, PO(control) }, - { "jit", MOD_PAT, MOD_IND, 7, PO(jit) }, - { "jitfast", MOD_PAT, MOD_CTL, CTL_JITFAST, PO(control) }, - { "jitstack", MOD_DAT, MOD_INT, 0, DO(jitstack) }, - { "jitverify", MOD_PAT, MOD_CTL, CTL_JITVERIFY, PO(control) }, - { "locale", MOD_PAT, MOD_STR, LOCALESIZE, PO(locale) }, - { "mark", MOD_PNDP, MOD_CTL, CTL_MARK, PO(control) }, - { "match_limit", MOD_CTM, MOD_INT, 0, MO(match_limit) }, - { "match_unset_backref", MOD_PAT, MOD_OPT, PCRE2_MATCH_UNSET_BACKREF, PO(options) }, - { "max_pattern_length", MOD_CTC, MOD_SIZ, 0, CO(max_pattern_length) }, - { "memory", MOD_PD, MOD_CTL, CTL_MEMORY, PD(control) }, - { "multiline", MOD_PATP, MOD_OPT, PCRE2_MULTILINE, PO(options) }, - { "never_backslash_c", MOD_PAT, MOD_OPT, PCRE2_NEVER_BACKSLASH_C, PO(options) }, - { "never_ucp", MOD_PAT, MOD_OPT, PCRE2_NEVER_UCP, PO(options) }, - { "never_utf", MOD_PAT, MOD_OPT, PCRE2_NEVER_UTF, PO(options) }, - { "newline", MOD_CTC, MOD_NL, 0, CO(newline_convention) }, - { "no_auto_capture", MOD_PAT, MOD_OPT, PCRE2_NO_AUTO_CAPTURE, PO(options) }, - { "no_auto_possess", MOD_PATP, MOD_OPT, PCRE2_NO_AUTO_POSSESS, PO(options) }, - { "no_dotstar_anchor", MOD_PAT, MOD_OPT, PCRE2_NO_DOTSTAR_ANCHOR, PO(options) }, - { "no_start_optimize", MOD_PATP, MOD_OPT, PCRE2_NO_START_OPTIMIZE, PO(options) }, - { "no_utf_check", MOD_PD, MOD_OPT, PCRE2_NO_UTF_CHECK, PD(options) }, - { "notbol", MOD_DAT, MOD_OPT, PCRE2_NOTBOL, DO(options) }, - { "notempty", MOD_DAT, MOD_OPT, PCRE2_NOTEMPTY, DO(options) }, - { "notempty_atstart", MOD_DAT, MOD_OPT, PCRE2_NOTEMPTY_ATSTART, DO(options) }, - { "noteol", MOD_DAT, MOD_OPT, PCRE2_NOTEOL, DO(options) }, - { "null_context", MOD_PD, MOD_CTL, CTL_NULLCONTEXT, PO(control) }, - { "offset", MOD_DAT, MOD_INT, 0, DO(offset) }, - { "offset_limit", MOD_CTM, MOD_SIZ, 0, MO(offset_limit)}, - { "ovector", MOD_DAT, MOD_INT, 0, DO(oveccount) }, - { "parens_nest_limit", MOD_CTC, MOD_INT, 0, CO(parens_nest_limit) }, - { "partial_hard", MOD_DAT, MOD_OPT, PCRE2_PARTIAL_HARD, DO(options) }, - { "partial_soft", MOD_DAT, MOD_OPT, PCRE2_PARTIAL_SOFT, DO(options) }, - { "ph", MOD_DAT, MOD_OPT, PCRE2_PARTIAL_HARD, DO(options) }, - { "posix", MOD_PAT, MOD_CTL, CTL_POSIX, PO(control) }, - { "ps", MOD_DAT, MOD_OPT, PCRE2_PARTIAL_SOFT, DO(options) }, - { "push", MOD_PAT, MOD_CTL, CTL_PUSH, PO(control) }, - { "recursion_limit", MOD_CTM, MOD_INT, 0, MO(recursion_limit) }, - { "regerror_buffsize", MOD_PAT, MOD_INT, 0, PO(regerror_buffsize) }, - { "replace", MOD_PND, MOD_STR, REPLACE_MODSIZE, PO(replacement) }, - { "stackguard", MOD_PAT, MOD_INT, 0, PO(stackguard_test) }, - { "startchar", MOD_PND, MOD_CTL, CTL_STARTCHAR, PO(control) }, - { "startoffset", MOD_DAT, MOD_INT, 0, DO(offset) }, - { "substitute_extended", MOD_PND, MOD_CTL, CTL_SUBSTITUTE_EXTENDED, PO(control) }, - { "substitute_unset_empty", MOD_PND, MOD_CTL, CTL_SUBSTITUTE_UNSET_EMPTY, PO(control) }, - { "tables", MOD_PAT, MOD_INT, 0, PO(tables_id) }, - { "ucp", MOD_PATP, MOD_OPT, PCRE2_UCP, PO(options) }, - { "ungreedy", MOD_PAT, MOD_OPT, PCRE2_UNGREEDY, PO(options) }, - { "use_offset_limit", MOD_PAT, MOD_OPT, PCRE2_USE_OFFSET_LIMIT, PO(options) }, - { "utf", MOD_PATP, MOD_OPT, PCRE2_UTF, PO(options) }, - { "zero_terminate", MOD_DAT, MOD_CTL, CTL_ZERO_TERMINATE, DO(control) } + { "aftertext", MOD_PNDP, MOD_CTL, CTL_AFTERTEXT, PO(control) }, + { "allaftertext", MOD_PNDP, MOD_CTL, CTL_ALLAFTERTEXT, PO(control) }, + { "allcaptures", MOD_PND, MOD_CTL, CTL_ALLCAPTURES, PO(control) }, + { "allow_empty_class", MOD_PAT, MOD_OPT, PCRE2_ALLOW_EMPTY_CLASS, PO(options) }, + { "allusedtext", MOD_PNDP, MOD_CTL, CTL_ALLUSEDTEXT, PO(control) }, + { "alt_bsux", MOD_PAT, MOD_OPT, PCRE2_ALT_BSUX, PO(options) }, + { "alt_circumflex", MOD_PAT, MOD_OPT, PCRE2_ALT_CIRCUMFLEX, PO(options) }, + { "alt_verbnames", MOD_PAT, MOD_OPT, PCRE2_ALT_VERBNAMES, PO(options) }, + { "altglobal", MOD_PND, MOD_CTL, CTL_ALTGLOBAL, PO(control) }, + { "anchored", MOD_PD, MOD_OPT, PCRE2_ANCHORED, PD(options) }, + { "auto_callout", MOD_PAT, MOD_OPT, PCRE2_AUTO_CALLOUT, PO(options) }, + { "bincode", MOD_PAT, MOD_CTL, CTL_BINCODE, PO(control) }, + { "bsr", MOD_CTC, MOD_BSR, 0, CO(bsr_convention) }, + { "callout_capture", MOD_DAT, MOD_CTL, CTL_CALLOUT_CAPTURE, DO(control) }, + { "callout_data", MOD_DAT, MOD_INS, 0, DO(callout_data) }, + { "callout_fail", MOD_DAT, MOD_IN2, 0, DO(cfail) }, + { "callout_info", MOD_PAT, MOD_CTL, CTL_CALLOUT_INFO, PO(control) }, + { "callout_none", MOD_DAT, MOD_CTL, CTL_CALLOUT_NONE, DO(control) }, + { "caseless", MOD_PATP, MOD_OPT, PCRE2_CASELESS, PO(options) }, + { "copy", MOD_DAT, MOD_NN, DO(copy_numbers), DO(copy_names) }, + { "debug", MOD_PAT, MOD_CTL, CTL_DEBUG, PO(control) }, + { "dfa", MOD_DAT, MOD_CTL, CTL_DFA, DO(control) }, + { "dfa_restart", MOD_DAT, MOD_OPT, PCRE2_DFA_RESTART, DO(options) }, + { "dfa_shortest", MOD_DAT, MOD_OPT, PCRE2_DFA_SHORTEST, DO(options) }, + { "dollar_endonly", MOD_PAT, MOD_OPT, PCRE2_DOLLAR_ENDONLY, PO(options) }, + { "dotall", MOD_PATP, MOD_OPT, PCRE2_DOTALL, PO(options) }, + { "dupnames", MOD_PATP, MOD_OPT, PCRE2_DUPNAMES, PO(options) }, + { "expand", MOD_PAT, MOD_CTL, CTL_EXPAND, PO(control) }, + { "extended", MOD_PATP, MOD_OPT, PCRE2_EXTENDED, PO(options) }, + { "find_limits", MOD_DAT, MOD_CTL, CTL_FINDLIMITS, DO(control) }, + { "firstline", MOD_PAT, MOD_OPT, PCRE2_FIRSTLINE, PO(options) }, + { "fullbincode", MOD_PAT, MOD_CTL, CTL_FULLBINCODE, PO(control) }, + { "get", MOD_DAT, MOD_NN, DO(get_numbers), DO(get_names) }, + { "getall", MOD_DAT, MOD_CTL, CTL_GETALL, DO(control) }, + { "global", MOD_PNDP, MOD_CTL, CTL_GLOBAL, PO(control) }, + { "hex", MOD_PAT, MOD_CTL, CTL_HEXPAT, PO(control) }, + { "info", MOD_PAT, MOD_CTL, CTL_INFO, PO(control) }, + { "jit", MOD_PAT, MOD_IND, 7, PO(jit) }, + { "jitfast", MOD_PAT, MOD_CTL, CTL_JITFAST, PO(control) }, + { "jitstack", MOD_DAT, MOD_INT, 0, DO(jitstack) }, + { "jitverify", MOD_PAT, MOD_CTL, CTL_JITVERIFY, PO(control) }, + { "locale", MOD_PAT, MOD_STR, LOCALESIZE, PO(locale) }, + { "mark", MOD_PNDP, MOD_CTL, CTL_MARK, PO(control) }, + { "match_limit", MOD_CTM, MOD_INT, 0, MO(match_limit) }, + { "match_unset_backref", MOD_PAT, MOD_OPT, PCRE2_MATCH_UNSET_BACKREF, PO(options) }, + { "max_pattern_length", MOD_CTC, MOD_SIZ, 0, CO(max_pattern_length) }, + { "memory", MOD_PD, MOD_CTL, CTL_MEMORY, PD(control) }, + { "multiline", MOD_PATP, MOD_OPT, PCRE2_MULTILINE, PO(options) }, + { "never_backslash_c", MOD_PAT, MOD_OPT, PCRE2_NEVER_BACKSLASH_C, PO(options) }, + { "never_ucp", MOD_PAT, MOD_OPT, PCRE2_NEVER_UCP, PO(options) }, + { "never_utf", MOD_PAT, MOD_OPT, PCRE2_NEVER_UTF, PO(options) }, + { "newline", MOD_CTC, MOD_NL, 0, CO(newline_convention) }, + { "no_auto_capture", MOD_PAT, MOD_OPT, PCRE2_NO_AUTO_CAPTURE, PO(options) }, + { "no_auto_possess", MOD_PATP, MOD_OPT, PCRE2_NO_AUTO_POSSESS, PO(options) }, + { "no_dotstar_anchor", MOD_PAT, MOD_OPT, PCRE2_NO_DOTSTAR_ANCHOR, PO(options) }, + { "no_start_optimize", MOD_PATP, MOD_OPT, PCRE2_NO_START_OPTIMIZE, PO(options) }, + { "no_utf_check", MOD_PD, MOD_OPT, PCRE2_NO_UTF_CHECK, PD(options) }, + { "notbol", MOD_DAT, MOD_OPT, PCRE2_NOTBOL, DO(options) }, + { "notempty", MOD_DAT, MOD_OPT, PCRE2_NOTEMPTY, DO(options) }, + { "notempty_atstart", MOD_DAT, MOD_OPT, PCRE2_NOTEMPTY_ATSTART, DO(options) }, + { "noteol", MOD_DAT, MOD_OPT, PCRE2_NOTEOL, DO(options) }, + { "null_context", MOD_PD, MOD_CTL, CTL_NULLCONTEXT, PO(control) }, + { "offset", MOD_DAT, MOD_INT, 0, DO(offset) }, + { "offset_limit", MOD_CTM, MOD_SIZ, 0, MO(offset_limit)}, + { "ovector", MOD_DAT, MOD_INT, 0, DO(oveccount) }, + { "parens_nest_limit", MOD_CTC, MOD_INT, 0, CO(parens_nest_limit) }, + { "partial_hard", MOD_DAT, MOD_OPT, PCRE2_PARTIAL_HARD, DO(options) }, + { "partial_soft", MOD_DAT, MOD_OPT, PCRE2_PARTIAL_SOFT, DO(options) }, + { "ph", MOD_DAT, MOD_OPT, PCRE2_PARTIAL_HARD, DO(options) }, + { "posix", MOD_PAT, MOD_CTL, CTL_POSIX, PO(control) }, + { "ps", MOD_DAT, MOD_OPT, PCRE2_PARTIAL_SOFT, DO(options) }, + { "push", MOD_PAT, MOD_CTL, CTL_PUSH, PO(control) }, + { "recursion_limit", MOD_CTM, MOD_INT, 0, MO(recursion_limit) }, + { "regerror_buffsize", MOD_PAT, MOD_INT, 0, PO(regerror_buffsize) }, + { "replace", MOD_PND, MOD_STR, REPLACE_MODSIZE, PO(replacement) }, + { "stackguard", MOD_PAT, MOD_INT, 0, PO(stackguard_test) }, + { "startchar", MOD_PND, MOD_CTL, CTL_STARTCHAR, PO(control) }, + { "startoffset", MOD_DAT, MOD_INT, 0, DO(offset) }, + { "substitute_extended", MOD_PND, MOD_CTL, CTL2_SUBSTITUTE_EXTENDED, PO(control2) }, + { "substitute_overflow_length", MOD_PND, MOD_CTL, CTL2_SUBSTITUTE_OVERFLOW_LENGTH, PO(control2) }, + { "substitute_unknown_unset", MOD_PND, MOD_CTL, CTL2_SUBSTITUTE_UNKNOWN_UNSET, PO(control2) }, + { "substitute_unset_empty", MOD_PND, MOD_CTL, CTL2_SUBSTITUTE_UNSET_EMPTY, PO(control2) }, + { "tables", MOD_PAT, MOD_INT, 0, PO(tables_id) }, + { "ucp", MOD_PATP, MOD_OPT, PCRE2_UCP, PO(options) }, + { "ungreedy", MOD_PAT, MOD_OPT, PCRE2_UNGREEDY, PO(options) }, + { "use_offset_limit", MOD_PAT, MOD_OPT, PCRE2_USE_OFFSET_LIMIT, PO(options) }, + { "utf", MOD_PATP, MOD_OPT, PCRE2_UTF, PO(options) }, + { "zero_terminate", MOD_DAT, MOD_CTL, CTL_ZERO_TERMINATE, DO(control) } }; #define MODLISTCOUNT sizeof(modlist)/sizeof(modstruct) @@ -613,10 +631,13 @@ static modstruct modlist[] = { #define POSIX_SUPPORTED_COMPILE_CONTROLS ( \ CTL_AFTERTEXT|CTL_ALLAFTERTEXT|CTL_EXPAND|CTL_POSIX) +#define POSIX_SUPPORTED_COMPILE_CONTROLS2 (0) + #define POSIX_SUPPORTED_MATCH_OPTIONS ( \ PCRE2_NOTBOL|PCRE2_NOTEMPTY|PCRE2_NOTEOL) -#define POSIX_SUPPORTED_MATCH_CONTROLS (CTL_AFTERTEXT|CTL_ALLAFTERTEXT) +#define POSIX_SUPPORTED_MATCH_CONTROLS (CTL_AFTERTEXT|CTL_ALLAFTERTEXT) +#define POSIX_SUPPORTED_MATCH_CONTROLS2 (0) /* Control bits that are not ignored with 'push'. */ @@ -624,23 +645,27 @@ static modstruct modlist[] = { CTL_BINCODE|CTL_CALLOUT_INFO|CTL_FULLBINCODE|CTL_HEXPAT|CTL_INFO| \ CTL_JITVERIFY|CTL_MEMORY|CTL_PUSH|CTL_BSR_SET|CTL_NL_SET) +#define PUSH_SUPPORTED_COMPILE_CONTROLS2 (0) + /* Controls that apply only at compile time with 'push'. */ -#define PUSH_COMPILE_ONLY_CONTROLS CTL_JITVERIFY +#define PUSH_COMPILE_ONLY_CONTROLS CTL_JITVERIFY +#define PUSH_COMPILE_ONLY_CONTROLS2 (0) /* Controls that are forbidden with #pop. */ #define NOTPOP_CONTROLS (CTL_HEXPAT|CTL_POSIX|CTL_PUSH) -/* Pattern controls that are mutually exclusive. */ +/* Pattern controls that are mutually exclusive. At present these are all in +the first control word. */ static uint32_t exclusive_pat_controls[] = { CTL_POSIX | CTL_HEXPAT, CTL_POSIX | CTL_PUSH, CTL_EXPAND | CTL_HEXPAT }; -/* Data controls that are mutually exclusive. */ - +/* Data controls that are mutually exclusive. At present these are all in the +first control word. */ static uint32_t exclusive_dat_controls[] = { CTL_ALLUSEDTEXT | CTL_STARTCHAR, CTL_FINDLIMITS | CTL_NULLCONTEXT }; @@ -3528,15 +3553,16 @@ words. Arguments: controls control bits + controls2 more control bits before text to print before Returns: nothing */ static void -show_controls(uint32_t controls, const char *before) +show_controls(uint32_t controls, uint32_t controls2, const char *before) { -fprintf(outfile, "%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s", +fprintf(outfile, "%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s", before, ((controls & CTL_AFTERTEXT) != 0)? " aftertext" : "", ((controls & CTL_ALLAFTERTEXT) != 0)? " allaftertext" : "", @@ -3565,8 +3591,10 @@ fprintf(outfile, "%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s ((controls & CTL_POSIX) != 0)? " posix" : "", ((controls & CTL_PUSH) != 0)? " push" : "", ((controls & CTL_STARTCHAR) != 0)? " startchar" : "", - ((controls & CTL_SUBSTITUTE_EXTENDED) != 0)? " substitute_extended" : "", - ((controls & CTL_SUBSTITUTE_UNSET_EMPTY) != 0)? " substitute_unset_empty" : "", + ((controls2 & CTL2_SUBSTITUTE_EXTENDED) != 0)? " substitute_extended" : "", + ((controls2 & CTL2_SUBSTITUTE_OVERFLOW_LENGTH) != 0)? " substitute_overflow_length" : "", + ((controls2 & CTL2_SUBSTITUTE_UNKNOWN_UNSET) != 0)? " substitute_unknown_unset" : "", + ((controls2 & CTL2_SUBSTITUTE_UNSET_EMPTY) != 0)? " substitute_unset_empty" : "", ((controls & CTL_ZERO_TERMINATE) != 0)? " zero_terminate" : ""); } @@ -4398,14 +4426,15 @@ patlen = p - buffer - 2; if (!decode_modifiers(p, CTX_PAT, &pat_patctl, NULL)) return PR_SKIP; utf = (pat_patctl.options & PCRE2_UTF) != 0; -/* Check for mutually exclusive modifiers. */ +/* Check for mutually exclusive modifiers. At present, these are all in the +first control word. */ for (k = 0; k < sizeof(exclusive_pat_controls)/sizeof(uint32_t); k++) { uint32_t c = pat_patctl.control & exclusive_pat_controls[k]; if (c != 0 && c != (c & (~c+1))) { - show_controls(c, "** Not allowed together:"); + show_controls(c, 0, "** Not allowed together:"); fprintf(outfile, "\n"); return PR_SKIP; } @@ -4605,9 +4634,11 @@ if ((pat_patctl.control & CTL_POSIX) != 0) pat_patctl.options & ~POSIX_SUPPORTED_COMPILE_OPTIONS, msg, ""); msg = ""; } - if ((pat_patctl.control & ~POSIX_SUPPORTED_COMPILE_CONTROLS) != 0) + if ((pat_patctl.control & ~POSIX_SUPPORTED_COMPILE_CONTROLS) != 0 || + (pat_patctl.control2 & ~POSIX_SUPPORTED_COMPILE_CONTROLS2) != 0) { - show_controls(pat_patctl.control & ~POSIX_SUPPORTED_COMPILE_CONTROLS, msg); + show_controls(pat_patctl.control & ~POSIX_SUPPORTED_COMPILE_CONTROLS, + pat_patctl.control2 & ~POSIX_SUPPORTED_COMPILE_CONTROLS2, msg); msg = ""; } @@ -4663,15 +4694,19 @@ if ((pat_patctl.control & CTL_PUSH) != 0) fprintf(outfile, "** Replacement text is not supported with 'push'.\n"); return PR_OK; } - if ((pat_patctl.control & ~PUSH_SUPPORTED_COMPILE_CONTROLS) != 0) + if ((pat_patctl.control & ~PUSH_SUPPORTED_COMPILE_CONTROLS) != 0 || + (pat_patctl.control2 & ~PUSH_SUPPORTED_COMPILE_CONTROLS2) != 0) { show_controls(pat_patctl.control & ~PUSH_SUPPORTED_COMPILE_CONTROLS, + pat_patctl.control2 & ~PUSH_SUPPORTED_COMPILE_CONTROLS2, "** Ignored when compiled pattern is stacked with 'push':"); fprintf(outfile, "\n"); } - if ((pat_patctl.control & PUSH_COMPILE_ONLY_CONTROLS) != 0) + if ((pat_patctl.control & PUSH_COMPILE_ONLY_CONTROLS) != 0 || + (pat_patctl.control2 & PUSH_COMPILE_ONLY_CONTROLS2) != 0) { show_controls(pat_patctl.control & PUSH_COMPILE_ONLY_CONTROLS, + pat_patctl.control2 & PUSH_COMPILE_ONLY_CONTROLS2, "** Applies only to compile when pattern is stacked with 'push':"); fprintf(outfile, "\n"); } @@ -5340,6 +5375,7 @@ matching. */ DATCTXCPY(dat_context, default_dat_context); memcpy(&dat_datctl, &def_datctl, sizeof(datctl)); dat_datctl.control |= (pat_patctl.control & CTL_ALLPD); +dat_datctl.control2 |= (pat_patctl.control2 & CTL2_ALLPD); strcpy((char *)dat_datctl.replacement, (char *)pat_patctl.replacement); /* Initialize for scanning the data line. */ @@ -5657,14 +5693,15 @@ ulen = len/code_unit_size; /* Length in code units */ if (p[-1] != 0 && !decode_modifiers(p, CTX_DAT, NULL, &dat_datctl)) return PR_OK; -/* Check for mutually exclusive modifiers. */ +/* Check for mutually exclusive modifiers. At present, these are all in the +first control word. */ for (k = 0; k < sizeof(exclusive_dat_controls)/sizeof(uint32_t); k++) { c = dat_datctl.control & exclusive_dat_controls[k]; if (c != 0 && c != (c & (~c+1))) { - show_controls(c, "** Not allowed together:"); + show_controls(c, 0, "** Not allowed together:"); fprintf(outfile, "\n"); return PR_OK; } @@ -5717,9 +5754,11 @@ if ((pat_patctl.control & CTL_POSIX) != 0) show_match_options(dat_datctl.options & ~POSIX_SUPPORTED_MATCH_OPTIONS); msg = ""; } - if ((dat_datctl.control & ~POSIX_SUPPORTED_MATCH_CONTROLS) != 0) + if ((dat_datctl.control & ~POSIX_SUPPORTED_MATCH_CONTROLS) != 0 || + (dat_datctl.control2 & ~POSIX_SUPPORTED_MATCH_CONTROLS2) != 0) { - show_controls(dat_datctl.control & ~POSIX_SUPPORTED_MATCH_CONTROLS, msg); + show_controls(dat_datctl.control & ~POSIX_SUPPORTED_MATCH_CONTROLS, + dat_datctl.control2 & ~POSIX_SUPPORTED_MATCH_CONTROLS2, msg); msg = ""; } @@ -5891,9 +5930,13 @@ if (dat_datctl.replacement[0] != 0) xoptions = (((dat_datctl.control & CTL_GLOBAL) == 0)? 0 : PCRE2_SUBSTITUTE_GLOBAL) | - (((dat_datctl.control & CTL_SUBSTITUTE_EXTENDED) == 0)? 0 : + (((dat_datctl.control2 & CTL2_SUBSTITUTE_EXTENDED) == 0)? 0 : PCRE2_SUBSTITUTE_EXTENDED) | - (((dat_datctl.control & CTL_SUBSTITUTE_UNSET_EMPTY) == 0)? 0 : + (((dat_datctl.control2 & CTL2_SUBSTITUTE_OVERFLOW_LENGTH) == 0)? 0 : + PCRE2_SUBSTITUTE_OVERFLOW_LENGTH) | + (((dat_datctl.control2 & CTL2_SUBSTITUTE_UNKNOWN_UNSET) == 0)? 0 : + PCRE2_SUBSTITUTE_UNKNOWN_UNSET) | + (((dat_datctl.control2 & CTL2_SUBSTITUTE_UNSET_EMPTY) == 0)? 0 : PCRE2_SUBSTITUTE_UNSET_EMPTY); SETCASTPTR(r, rbuffer); /* Sets r8, r16, or r32, as appropriate. */ @@ -5987,12 +6030,16 @@ if (dat_datctl.replacement[0] != 0) if (rc < 0) { + PCRE2_SIZE msize; fprintf(outfile, "Failed: error %d", rc); - if (nsize != PCRE2_UNSET) + if (rc != PCRE2_ERROR_NOMEMORY && nsize != PCRE2_UNSET) fprintf(outfile, " at offset %ld in replacement", nsize); fprintf(outfile, ": "); - PCRE2_GET_ERROR_MESSAGE(nsize, rc, pbuffer); - PCHARSV(CASTVAR(void *, pbuffer), 0, nsize, FALSE, outfile); + PCRE2_GET_ERROR_MESSAGE(msize, rc, pbuffer); + PCHARSV(CASTVAR(void *, pbuffer), 0, msize, FALSE, outfile); + if (rc == PCRE2_ERROR_NOMEMORY && + (xoptions & PCRE2_SUBSTITUTE_OVERFLOW_LENGTH) != 0) + fprintf(outfile, ": %ld code units are needed", nsize); } else { @@ -6850,7 +6897,8 @@ control blocks must be the same so that common options and controls such as We cannot test this till runtime because "offsetof" does not work in the preprocessor. */ -if (PO(options) != DO(options) || PO(control) != DO(control)) +if (PO(options) != DO(options) || PO(control) != DO(control) || + PO(control2) != DO(control2)) { fprintf(stderr, "** Coding error: " "options and control offsets for pattern and data must be the same.\n"); diff --git a/testdata/testinput2 b/testdata/testinput2 index 075d414..461bc0a 100644 --- a/testdata/testinput2 +++ b/testdata/testinput2 @@ -4042,8 +4042,6 @@ /(((((a)))))/parens_nest_limit=2 -# Tests for pcre2_substitute() - /abc/replace=XYZ 123123 123abc123 @@ -4149,11 +4147,24 @@ /(*:pear)apple|(*:orange)lemon|(*:strawberry)blackberry/g,replace=[22]${*MARK} apple lemon blackberry + apple lemon blackberry\=substitute_overflow_length /(*:pear)apple|(*:orange)lemon|(*:strawberry)blackberry/g,replace=[23]${*MARK} apple lemon blackberry -# End of substitute tests +/abc/ + 123abc123\=replace=[9]XYZ + 123abc123\=substitute_overflow_length,replace=[9]XYZ + 123abc123\=substitute_overflow_length,replace=[6]XYZ + 123abc123\=substitute_overflow_length,replace=[1]XYZ + 123abc123\=substitute_overflow_length,replace=[0]XYZ + +/a(b)c/ + 123abc123\=replace=[9]x$1z + 123abc123\=substitute_overflow_length,replace=[9]x$1z + 123abc123\=substitute_overflow_length,replace=[6]x$1z + 123abc123\=substitute_overflow_length,replace=[1]x$1z + 123abc123\=substitute_overflow_length,replace=[0]x$1z "((?=(?(?=(?(?=(?(?=()))))))))" a @@ -4749,12 +4760,24 @@ a)"xI cat\=replace=>$1<,substitute_unset_empty xbcom\=replace=>$1<,substitute_unset_empty +/a|(b)c/substitute_extended + cat\=replace=>${2:-xx}< + cat\=replace=>${2:-xx}<,substitute_unknown_unset + cat\=replace=>${X:-xx}<,substitute_unknown_unset + /a|(?'X'b)c/replace=>$X<,substitute_unset_empty cat xbcom +/a|(?'X'b)c/replace=>$Y<,substitute_unset_empty + cat + cat\=substitute_unknown_unset + cat\=substitute_unknown_unset,-substitute_unset_empty + /a|(b)c/replace=>$2<,substitute_unset_empty cat + cat\=substitute_unknown_unset + cat\=substitute_unknown_unset,-substitute_unset_empty /()()()/use_offset_limit \=ovector=11000000000 diff --git a/testdata/testoutput2 b/testdata/testoutput2 index ea0c69d..91bca64 100644 --- a/testdata/testoutput2 +++ b/testdata/testoutput2 @@ -13432,8 +13432,6 @@ Subject length lower bound = 0 /(((((a)))))/parens_nest_limit=2 Failed: error 119 at offset 3: parentheses are too deeply nested -# Tests for pcre2_substitute() - /abc/replace=XYZ 123123 0: 123123 @@ -13583,12 +13581,36 @@ Failed: error -35 at offset 9 in replacement: invalid replacement string /(*:pear)apple|(*:orange)lemon|(*:strawberry)blackberry/g,replace=[22]${*MARK} apple lemon blackberry Failed: error -48: no more memory + apple lemon blackberry\=substitute_overflow_length +Failed: error -48: no more memory: 23 code units are needed /(*:pear)apple|(*:orange)lemon|(*:strawberry)blackberry/g,replace=[23]${*MARK} apple lemon blackberry 3: pear orange strawberry -# End of substitute tests +/abc/ + 123abc123\=replace=[9]XYZ +Failed: error -48: no more memory + 123abc123\=substitute_overflow_length,replace=[9]XYZ +Failed: error -48: no more memory: 10 code units are needed + 123abc123\=substitute_overflow_length,replace=[6]XYZ +Failed: error -48: no more memory: 10 code units are needed + 123abc123\=substitute_overflow_length,replace=[1]XYZ +Failed: error -48: no more memory: 10 code units are needed + 123abc123\=substitute_overflow_length,replace=[0]XYZ +Failed: error -48: no more memory: 10 code units are needed + +/a(b)c/ + 123abc123\=replace=[9]x$1z +Failed: error -48: no more memory + 123abc123\=substitute_overflow_length,replace=[9]x$1z +Failed: error -48: no more memory: 10 code units are needed + 123abc123\=substitute_overflow_length,replace=[6]x$1z +Failed: error -48: no more memory: 10 code units are needed + 123abc123\=substitute_overflow_length,replace=[1]x$1z +Failed: error -48: no more memory: 10 code units are needed + 123abc123\=substitute_overflow_length,replace=[0]x$1z +Failed: error -48: no more memory: 10 code units are needed "((?=(?(?=(?(?=(?(?=()))))))))" a @@ -15075,15 +15097,35 @@ Failed: error -55 at offset 3 in replacement: requested value is not set xbcom\=replace=>$1<,substitute_unset_empty 1: x>b${2:-xx}< +Failed: error -49 at offset 9 in replacement: unknown substring + cat\=replace=>${2:-xx}<,substitute_unknown_unset + 1: c>xx${X:-xx}<,substitute_unknown_unset + 1: c>xx$X<,substitute_unset_empty cat 1: c>b$Y<,substitute_unset_empty + cat +Failed: error -49 at offset 3 in replacement: unknown substring + cat\=substitute_unknown_unset + 1: c>$2<,substitute_unset_empty cat Failed: error -49 at offset 3 in replacement: unknown substring + cat\=substitute_unknown_unset + 1: c>