Implement PCRE2_SUBSTITUTE_UNSET_EMPTY.
This commit is contained in:
parent
38caadff03
commit
2f684a60ed
|
@ -380,6 +380,9 @@ changed when the effects of those options were all moved to compile time.
|
|||
PCRE2_ALT_VERBNAMES was set caused pcre2_compile() to malfunction. This bug
|
||||
was found by the LLVM fuzzer.
|
||||
|
||||
110. Implemented PCRE2_SUBSTITUTE_UNSET_EMPTY, and updated pcre2test to make it
|
||||
possible to test it.
|
||||
|
||||
|
||||
Version 10.20 30-June-2015
|
||||
--------------------------
|
||||
|
|
|
@ -1,4 +1,4 @@
|
|||
.TH PCRE2API 3 "03 December 2015" "PCRE2 10.21"
|
||||
.TH PCRE2API 3 "04 December 2015" "PCRE2 10.21"
|
||||
.SH NAME
|
||||
PCRE2 - Perl-compatible regular expressions (revised API)
|
||||
.sp
|
||||
|
@ -2734,19 +2734,26 @@ simultaneous substitutions, as this \fBpcre2test\fP example shows:
|
|||
apple lemon
|
||||
2: pear orange
|
||||
.sp
|
||||
There is an additional option, PCRE2_SUBSTITUTE_GLOBAL, which causes the
|
||||
function to iterate over the subject string, replacing every matching
|
||||
substring. If this is not set, only the first matching substring is replaced.
|
||||
If any matched substring has zero length, after the substitution has happened,
|
||||
an attempt to find a non-empty match at the same position is performed. If this
|
||||
is not successful, the current position is advanced by one character except
|
||||
when CRLF is a valid newline sequence and the next two characters are CR, LF.
|
||||
In this case, the current position is advanced by two characters.
|
||||
Three additional options are available:
|
||||
.P
|
||||
A second additional option, PCRE2_SUBSTITUTE_EXTENDED, causes extra processing
|
||||
to be applied to the replacement string. Without this option, only the dollar
|
||||
character is special, and only the group insertion forms listed above are
|
||||
valid. When PCRE2_SUBSTITUTE_EXTENDED is set, two things change:
|
||||
PCRE2_SUBSTITUTE_GLOBAL causes the function to iterate over the subject string,
|
||||
replacing every matching substring. If this is not set, only the first matching
|
||||
substring is replaced. If any matched substring has zero length, after the
|
||||
substitution has happened, an attempt to find a non-empty match at the same
|
||||
position is performed. If this is not successful, the current position is
|
||||
advanced by one character except when CRLF is a valid newline sequence and the
|
||||
next two characters are CR, LF. In this case, the current position is advanced
|
||||
by two characters.
|
||||
.P
|
||||
PCRE2_SUBSTITUTE_UNSET_EMPTY causes unset capturing groups to be treated as
|
||||
empty strings when inserted as described above. If this option is not set, an
|
||||
attempt to insert an unset group causes the PCRE2_ERROR_UNSET error. This
|
||||
option does not influence the extended substitution syntax described below.
|
||||
.P
|
||||
PCRE2_SUBSTITUTE_EXTENDED causes extra processing to be applied to the
|
||||
replacement string. Without this option, only the dollar character is special,
|
||||
and only the group insertion forms listed above are valid. When
|
||||
PCRE2_SUBSTITUTE_EXTENDED is set, two things change:
|
||||
.P
|
||||
Firstly, backslash in a replacement string is interpreted as an escape
|
||||
character. The usual forms such as \en or \ex{ddd} can be used to specify
|
||||
|
@ -2792,16 +2799,22 @@ string remains in force afterwards, as shown in this \fBpcre2test\fP example:
|
|||
somebody
|
||||
1: HELLO
|
||||
.sp
|
||||
The PCRE2_SUBSTITUTE_UNSET_EMPTY option does not affect these extended
|
||||
substitutions.
|
||||
.P
|
||||
If successful, the function returns the number of replacements that were made.
|
||||
This may be zero if no matches were found, and is never greater than 1 unless
|
||||
PCRE2_SUBSTITUTE_GLOBAL is set.
|
||||
.P
|
||||
In the event of an error, a negative error code is returned. Except for
|
||||
PCRE2_ERROR_NOMATCH (which is never returned), errors from \fBpcre2_match()\fP
|
||||
are passed straight back. PCRE2_ERROR_NOMEMORY is returned if the output buffer
|
||||
is not big enough. PCRE2_ERROR_BADREPLACEMENT is used for miscellaneous syntax
|
||||
errors in the replacement string, with more particular errors being
|
||||
PCRE2_ERROR_BADREPESCAPE (invalid escape sequence),
|
||||
are passed straight back. PCRE2_ERROR_NOSUBSTRING is returned for a
|
||||
non-existent substring insertion, and PCRE2_ERROR_UNSET is returned for an
|
||||
unset substring insertion when the simple (non-extended) syntax is used and
|
||||
PCRE2_SUBSTITUTE_UNSET_EMPTY is not set. PCRE2_ERROR_NOMEMORY is returned if
|
||||
the output buffer is not big enough. PCRE2_ERROR_BADREPLACEMENT is used for
|
||||
miscellaneous syntax errors in the replacement string, with more particular
|
||||
errors being PCRE2_ERROR_BADREPESCAPE (invalid escape sequence),
|
||||
PCRE2_ERROR_REPMISSING_BRACE (closing curly bracket not found),
|
||||
PCRE2_BADSUBSTITUTION (syntax error in extended group substitution), and
|
||||
PCRE2_BADSUBPATTERN (the pattern match ended before it started). As for all
|
||||
|
@ -3100,6 +3113,6 @@ Cambridge, England.
|
|||
.rs
|
||||
.sp
|
||||
.nf
|
||||
Last updated: 03 December 2015
|
||||
Last updated: 04 December 2015
|
||||
Copyright (c) 1997-2015 University of Cambridge.
|
||||
.fi
|
||||
|
|
|
@ -1,4 +1,4 @@
|
|||
.TH PCRE2TEST 1 "21 November 2015" "PCRE 10.21"
|
||||
.TH PCRE2TEST 1 "04 December 2015" "PCRE 10.21"
|
||||
.SH NAME
|
||||
pcre2test - a program for testing Perl-compatible regular expressions.
|
||||
.SH SYNOPSIS
|
||||
|
@ -862,6 +862,8 @@ compilation process.
|
|||
mark show mark values
|
||||
replace=<string> specify a replacement string
|
||||
startchar show starting character when relevant
|
||||
substitute_extended use PCRE2_SUBSTITUTE_EXTENDED
|
||||
substitute_unset_empty use PCRE2_SUBSTITUTE_UNSET_EMPTY
|
||||
.sp
|
||||
These modifiers may not appear in a \fB#pattern\fP command. If you want them as
|
||||
defaults, set them in a \fB#subject\fP command.
|
||||
|
@ -960,6 +962,8 @@ pattern.
|
|||
replace=<string> specify a replacement string
|
||||
startchar show startchar when relevant
|
||||
startoffset=<n> same as offset=<n>
|
||||
substitute_extedded use PCRE2_SUBSTITUTE_EXTENDED
|
||||
substitute_unset_empty use PCRE2_SUBSTITUTE_UNSET_EMPTY
|
||||
zero_terminate pass the subject as zero-terminated
|
||||
.sp
|
||||
The effects of these modifiers are described in the following sections.
|
||||
|
@ -1104,9 +1108,13 @@ individual code units are copied directly. This provides a means of passing an
|
|||
invalid UTF-8 string for testing purposes.
|
||||
.P
|
||||
If the \fBglobal\fP modifier is set, PCRE2_SUBSTITUTE_GLOBAL is passed to
|
||||
\fBpcre2_substitute()\fP. After a successful substitution, the modified string
|
||||
is output, preceded by the number of replacements. This may be zero if there
|
||||
were no matches. Here is a simple example of a substitution test:
|
||||
\fBpcre2_substitute()\fP. The \fBsubstitute_extended\fP and
|
||||
\fBsubstitute_unset_empty\fP modifiers set PCRE2_SUBSTITUTE_EXTENDED and
|
||||
PCRE2_SUBSTITUTE_UNSET_EMPTY, respectively.
|
||||
.P
|
||||
After a successful substitution, the modified string is output, preceded by the
|
||||
number of replacements. This may be zero if there were no matches. Here is a
|
||||
simple example of a substitution test:
|
||||
.sp
|
||||
/abc/replace=xxx
|
||||
=abc=abc=
|
||||
|
@ -1610,6 +1618,6 @@ Cambridge, England.
|
|||
.rs
|
||||
.sp
|
||||
.nf
|
||||
Last updated: 21 November 2015
|
||||
Last updated: 04 December 2015
|
||||
Copyright (c) 1997-2015 University of Cambridge.
|
||||
.fi
|
||||
|
|
|
@ -150,6 +150,7 @@ sanity checks). */
|
|||
|
||||
#define PCRE2_SUBSTITUTE_GLOBAL 0x00000100u
|
||||
#define PCRE2_SUBSTITUTE_EXTENDED 0x00000200u
|
||||
#define PCRE2_SUBSTITUTE_UNSET_EMPTY 0x00000400u
|
||||
|
||||
/* Newline and \R settings, for use in compile contexts. The newline values
|
||||
must be kept in step with values set in config.h and both sets must all be
|
||||
|
|
|
@ -150,6 +150,7 @@ sanity checks). */
|
|||
|
||||
#define PCRE2_SUBSTITUTE_GLOBAL 0x00000100u
|
||||
#define PCRE2_SUBSTITUTE_EXTENDED 0x00000200u
|
||||
#define PCRE2_SUBSTITUTE_UNSET_EMPTY 0x00000400u
|
||||
|
||||
/* Newline and \R settings, for use in compile contexts. The newline values
|
||||
must be kept in step with values set in config.h and both sets must all be
|
||||
|
|
|
@ -197,6 +197,7 @@ BOOL match_data_created = FALSE;
|
|||
BOOL global = FALSE;
|
||||
BOOL extended = FALSE;
|
||||
BOOL literal = FALSE;
|
||||
BOOL uempty = FALSE; /* Unset/unknown groups => empty string */
|
||||
#ifdef SUPPORT_UNICODE
|
||||
BOOL utf = (code->overall_options & PCRE2_UTF) != 0;
|
||||
#endif
|
||||
|
@ -262,6 +263,12 @@ if ((options & PCRE2_SUBSTITUTE_EXTENDED) != 0)
|
|||
extended = TRUE;
|
||||
}
|
||||
|
||||
if ((options & PCRE2_SUBSTITUTE_UNSET_EMPTY) != 0)
|
||||
{
|
||||
options &= ~PCRE2_SUBSTITUTE_UNSET_EMPTY;
|
||||
uempty = TRUE;
|
||||
}
|
||||
|
||||
/* Copy up to the start offset */
|
||||
|
||||
if (start_offset > buff_length) goto NOROOM;
|
||||
|
@ -471,7 +478,6 @@ do
|
|||
|
||||
if (inparens)
|
||||
{
|
||||
|
||||
if (extended && !star && ptr < repend - 2 && next == CHAR_COLON)
|
||||
{
|
||||
special = *(++ptr);
|
||||
|
@ -562,8 +568,20 @@ do
|
|||
if (group < 0) group = GET2(first, 0);
|
||||
}
|
||||
|
||||
/* We now have a group that is identified by number. Find the length of
|
||||
the captured string. If a group in a non-special substitution is unset
|
||||
when PCRE2_SUBSTITUTE_UNSET_EMPTY is set, substitute nothing. */
|
||||
|
||||
rc = pcre2_substring_length_bynumber(match_data, group, &sublength);
|
||||
if (rc < 0 && (special == 0 || rc != PCRE2_ERROR_UNSET)) goto PTREXIT;
|
||||
if (rc < 0)
|
||||
{
|
||||
if (rc != PCRE2_ERROR_UNSET) goto PTREXIT; /* Non-unset errors */
|
||||
if (special == 0) /* Plain substitution */
|
||||
{
|
||||
if (uempty) continue; /* Treat as empty */
|
||||
goto PTREXIT; /* Else error */
|
||||
}
|
||||
}
|
||||
|
||||
/* If special is '+' we have a 'set' and possibly an 'unset' text,
|
||||
both of which are reprocessed when used. If special is '-' we have a
|
||||
|
|
|
@ -411,7 +411,8 @@ either on a pattern or a data line, so they must all be distinct. */
|
|||
#define CTL_PUSH 0x00800000u
|
||||
#define CTL_STARTCHAR 0x01000000u
|
||||
#define CTL_SUBSTITUTE_EXTENDED 0x02000000u
|
||||
#define CTL_ZERO_TERMINATE 0x04000000u
|
||||
#define CTL_SUBSTITUTE_UNSET_EMPTY 0x04000000u
|
||||
#define CTL_ZERO_TERMINATE 0x08000000u
|
||||
|
||||
#define CTL_BSR_SET 0x80000000u /* This is informational */
|
||||
#define CTL_NL_SET 0x40000000u /* This is informational */
|
||||
|
@ -431,7 +432,9 @@ data line. */
|
|||
CTL_GLOBAL|\
|
||||
CTL_MARK|\
|
||||
CTL_MEMORY|\
|
||||
CTL_STARTCHAR)
|
||||
CTL_STARTCHAR|\
|
||||
CTL_SUBSTITUTE_EXTENDED|\
|
||||
CTL_SUBSTITUTE_UNSET_EMPTY)
|
||||
|
||||
/* Structures for holding modifier information for patterns and subject strings
|
||||
(data). Fields containing modifiers that can be set either for a pattern or a
|
||||
|
@ -573,7 +576,8 @@ static modstruct modlist[] = {
|
|||
{ "stackguard", MOD_PAT, MOD_INT, 0, PO(stackguard_test) },
|
||||
{ "startchar", MOD_PND, MOD_CTL, CTL_STARTCHAR, PO(control) },
|
||||
{ "startoffset", MOD_DAT, MOD_INT, 0, DO(offset) },
|
||||
{ "substitute_extended", MOD_PAT, MOD_CTL, CTL_SUBSTITUTE_EXTENDED, PO(control) },
|
||||
{ "substitute_extended", MOD_PND, MOD_CTL, CTL_SUBSTITUTE_EXTENDED, PO(control) },
|
||||
{ "substitute_unset_empty", MOD_PND, MOD_CTL, CTL_SUBSTITUTE_UNSET_EMPTY, PO(control) },
|
||||
{ "tables", MOD_PAT, MOD_INT, 0, PO(tables_id) },
|
||||
{ "ucp", MOD_PATP, MOD_OPT, PCRE2_UCP, PO(options) },
|
||||
{ "ungreedy", MOD_PAT, MOD_OPT, PCRE2_UNGREEDY, PO(options) },
|
||||
|
@ -3519,7 +3523,7 @@ Returns: nothing
|
|||
static void
|
||||
show_controls(uint32_t controls, const char *before)
|
||||
{
|
||||
fprintf(outfile, "%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s",
|
||||
fprintf(outfile, "%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s",
|
||||
before,
|
||||
((controls & CTL_AFTERTEXT) != 0)? " aftertext" : "",
|
||||
((controls & CTL_ALLAFTERTEXT) != 0)? " allaftertext" : "",
|
||||
|
@ -3549,6 +3553,7 @@ fprintf(outfile, "%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s",
|
|||
((controls & CTL_PUSH) != 0)? " push" : "",
|
||||
((controls & CTL_STARTCHAR) != 0)? " startchar" : "",
|
||||
((controls & CTL_SUBSTITUTE_EXTENDED) != 0)? " substitute_extended" : "",
|
||||
((controls & CTL_SUBSTITUTE_UNSET_EMPTY) != 0)? " substitute_unset_empty" : "",
|
||||
((controls & CTL_ZERO_TERMINATE) != 0)? " zero_terminate" : "");
|
||||
}
|
||||
|
||||
|
@ -5873,8 +5878,10 @@ if (dat_datctl.replacement[0] != 0)
|
|||
|
||||
xoptions = (((dat_datctl.control & CTL_GLOBAL) == 0)? 0 :
|
||||
PCRE2_SUBSTITUTE_GLOBAL) |
|
||||
(((pat_patctl.control & CTL_SUBSTITUTE_EXTENDED) == 0)? 0 :
|
||||
PCRE2_SUBSTITUTE_EXTENDED);
|
||||
(((dat_datctl.control & CTL_SUBSTITUTE_EXTENDED) == 0)? 0 :
|
||||
PCRE2_SUBSTITUTE_EXTENDED) |
|
||||
(((dat_datctl.control & CTL_SUBSTITUTE_UNSET_EMPTY) == 0)? 0 :
|
||||
PCRE2_SUBSTITUTE_UNSET_EMPTY);
|
||||
|
||||
SETCASTPTR(r, rbuffer); /* Sets r8, r16, or r32, as appropriate. */
|
||||
pr = dat_datctl.replacement;
|
||||
|
|
|
@ -4576,6 +4576,9 @@ B)x/alt_verbnames,mark
|
|||
/(abcd)/replace=${1:+xy\kz},substitute_extended
|
||||
abcd
|
||||
|
||||
/(abcd)/
|
||||
abcd\=replace=${1:+xy\kz},substitute_extended
|
||||
|
||||
/abcd/substitute_extended,replace=>$1<
|
||||
abcd
|
||||
|
||||
|
@ -4737,4 +4740,20 @@ a)"xI
|
|||
|
||||
/(8(*:6^\x09x\xa6l\)6!|\xd0:[^:|)\x09d\Z\d{85*m(?'(?<1!)*\W[*\xff]!!h\w]*\xbe;/alt_bsux,alt_verbnames,allow_empty_class,dollar_endonly,extended,multiline,never_utf,no_dotstar_anchor,no_start_optimize
|
||||
|
||||
/a|(b)c/replace=>$1<,substitute_unset_empty
|
||||
cat
|
||||
xbcom
|
||||
|
||||
/a|(b)c/
|
||||
cat\=replace=>$1<
|
||||
cat\=replace=>$1<,substitute_unset_empty
|
||||
xbcom\=replace=>$1<,substitute_unset_empty
|
||||
|
||||
/a|(?'X'b)c/replace=>$X<,substitute_unset_empty
|
||||
cat
|
||||
xbcom
|
||||
|
||||
/a|(b)c/replace=>$2<,substitute_unset_empty
|
||||
cat
|
||||
|
||||
# End of testinput2
|
||||
|
|
|
@ -14648,6 +14648,10 @@ Failed: error -58 at offset 7 in replacement: expected closing curly bracket in
|
|||
abcd
|
||||
Failed: error -57 at offset 8 in replacement: bad escape sequence in replacement string
|
||||
|
||||
/(abcd)/
|
||||
abcd\=replace=${1:+xy\kz},substitute_extended
|
||||
Failed: error -57 at offset 8 in replacement: bad escape sequence in replacement string
|
||||
|
||||
/abcd/substitute_extended,replace=>$1<
|
||||
abcd
|
||||
Failed: error -49 at offset 3 in replacement: unknown substring
|
||||
|
@ -15057,4 +15061,28 @@ Subject length lower bound = 0
|
|||
/(8(*:6^\x09x\xa6l\)6!|\xd0:[^:|)\x09d\Z\d{85*m(?'(?<1!)*\W[*\xff]!!h\w]*\xbe;/alt_bsux,alt_verbnames,allow_empty_class,dollar_endonly,extended,multiline,never_utf,no_dotstar_anchor,no_start_optimize
|
||||
Failed: error 124 at offset 49: letter or underscore expected after (?< or (?'
|
||||
|
||||
/a|(b)c/replace=>$1<,substitute_unset_empty
|
||||
cat
|
||||
1: c><t
|
||||
xbcom
|
||||
1: x>b<om
|
||||
|
||||
/a|(b)c/
|
||||
cat\=replace=>$1<
|
||||
Failed: error -55 at offset 3 in replacement: requested value is not set
|
||||
cat\=replace=>$1<,substitute_unset_empty
|
||||
1: c><t
|
||||
xbcom\=replace=>$1<,substitute_unset_empty
|
||||
1: x>b<om
|
||||
|
||||
/a|(?'X'b)c/replace=>$X<,substitute_unset_empty
|
||||
cat
|
||||
1: c><t
|
||||
xbcom
|
||||
1: x>b<om
|
||||
|
||||
/a|(b)c/replace=>$2<,substitute_unset_empty
|
||||
cat
|
||||
Failed: error -49 at offset 3 in replacement: unknown substring
|
||||
|
||||
# End of testinput2
|
||||
|
|
Loading…
Reference in New Issue