Don't split CRLF in pcre2_substitute() when it's a valid newline sequence.

This commit is contained in:
Philip.Hazel 2015-11-13 16:52:26 +00:00
parent 2f8febd4b1
commit 299e587f9b
6 changed files with 87 additions and 7 deletions

View File

@ -296,6 +296,9 @@ not dereferencing it) while handling lookbehind assertions.
87. Failure to get memory for the match data in regcomp() is now given as a 87. Failure to get memory for the match data in regcomp() is now given as a
regcomp() error instead of waiting for regexec() to pick it up. regcomp() error instead of waiting for regexec() to pick it up.
88. In pcre2_substitute(), ensure that CRLF is not split when it is a valid
newline sequence.
Version 10.20 30-June-2015 Version 10.20 30-June-2015
-------------------------- --------------------------

View File

@ -1,4 +1,4 @@
.TH PCRE2API 3 "10 November 2015" "PCRE2 10.21" .TH PCRE2API 3 "13 November 2015" "PCRE2 10.21"
.SH NAME .SH NAME
PCRE2 - Perl-compatible regular expressions (revised API) PCRE2 - Perl-compatible regular expressions (revised API)
.sp .sp
@ -2729,6 +2729,11 @@ simultaneous substitutions, as this \fBpcre2test\fP example shows:
There is an additional option, PCRE2_SUBSTITUTE_GLOBAL, which causes the There is an additional option, PCRE2_SUBSTITUTE_GLOBAL, which causes the
function to iterate over the subject string, replacing every matching function to iterate over the subject string, replacing every matching
substring. If this is not set, only the first matching substring is replaced. substring. If this is not set, only the first matching substring is replaced.
If any matched substring has zero length, after the substitution has happened,
an attempt to find a non-empty match at the same position is performed. If this
is not successful, the current position is advanced by one character except
when CRLF is a valid newline sequence and the next two characters are CR, LF.
In this case, the current position is advanced by two characters.
.P .P
A second additional option, PCRE2_SUBSTITUTE_EXTENDED, causes extra processing A second additional option, PCRE2_SUBSTITUTE_EXTENDED, causes extra processing
to be applied to the replacement string. Without this option, only the dollar to be applied to the replacement string. Without this option, only the dollar
@ -3087,6 +3092,6 @@ Cambridge, England.
.rs .rs
.sp .sp
.nf .nf
Last updated: 10 November 2015 Last updated: 13 November 2015
Copyright (c) 1997-2015 University of Cambridge. Copyright (c) 1997-2015 University of Cambridge.
.fi .fi

View File

@ -1,4 +1,4 @@
.TH PCRE2PATTERN 3 "10 November 2015" "PCRE2 10.21" .TH PCRE2PATTERN 3 "13 November 2015" "PCRE2 10.21"
.SH NAME .SH NAME
PCRE2 - Perl-compatible regular expressions (revised API) PCRE2 - Perl-compatible regular expressions (revised API)
.SH "PCRE2 REGULAR EXPRESSION DETAILS" .SH "PCRE2 REGULAR EXPRESSION DETAILS"
@ -671,8 +671,8 @@ below.
This particular group matches either the two-character sequence CR followed by This particular group matches either the two-character sequence CR followed by
LF, or one of the single characters LF (linefeed, U+000A), VT (vertical tab, LF, or one of the single characters LF (linefeed, U+000A), VT (vertical tab,
U+000B), FF (form feed, U+000C), CR (carriage return, U+000D), or NEL (next U+000B), FF (form feed, U+000C), CR (carriage return, U+000D), or NEL (next
line, U+0085). The two-character sequence is treated as a single unit that line, U+0085). Because this is an atomic group, the two-character sequence is
cannot be split. treated as a single unit that cannot be split.
.P .P
In other modes, two additional characters whose codepoints are greater than 255 In other modes, two additional characters whose codepoints are greater than 255
are added: LS (line separator, U+2028) and PS (paragraph separator, U+2029). are added: LS (line separator, U+2028) and PS (paragraph separator, U+2029).
@ -1183,6 +1183,18 @@ patterns that are anchored in single line mode because all branches start with
when the \fIstartoffset\fP argument of \fBpcre2_match()\fP is non-zero. The when the \fIstartoffset\fP argument of \fBpcre2_match()\fP is non-zero. The
PCRE2_DOLLAR_ENDONLY option is ignored if PCRE2_MULTILINE is set. PCRE2_DOLLAR_ENDONLY option is ignored if PCRE2_MULTILINE is set.
.P .P
When the newline convention (see
.\" HTML <a href="#newlines">
.\" </a>
"Newline conventions"
.\"
below) recognizes the two-character sequence CRLF as a newline, this is
preferred, even if the single characters CR and LF are also recognized as
newlines. For example, if the newline convention is "any", a multiline mode
circumflex matches before "xyz" in the string "abc\er\enxyz" rather than after
CR, even though CR on its own is a valid newline. (It also matches at the very
start of the string, of course.)
.P
Note that the sequences \eA, \eZ, and \ez can be used to match the start and Note that the sequences \eA, \eZ, and \ez can be used to match the start and
end of the subject in both modes, and if all branches of a pattern start with end of the subject in both modes, and if all branches of a pattern start with
\eA it is always anchored, whether or not PCRE2_MULTILINE is set. \eA it is always anchored, whether or not PCRE2_MULTILINE is set.
@ -3413,6 +3425,6 @@ Cambridge, England.
.rs .rs
.sp .sp
.nf .nf
Last updated: 10 November 2015 Last updated: 13 November 2015
Copyright (c) 1997-2015 University of Cambridge. Copyright (c) 1997-2015 University of Cambridge.
.fi .fi

View File

@ -296,8 +296,22 @@ do
if (rc != PCRE2_ERROR_NOMATCH) goto EXIT; if (rc != PCRE2_ERROR_NOMATCH) goto EXIT;
if (goptions == 0 || start_offset >= length) break; if (goptions == 0 || start_offset >= length) break;
/* Advance by one code point. Then, if CRLF is a valid newline sequence and
we have advanced into the middle of it, advance one more code point. In
other words, do not start in the middle of CRLF, even if CR and LF on their
own are valid newlines. */
save_start = start_offset++; save_start = start_offset++;
if ((code->overall_options & PCRE2_UTF) != 0) if (subject[start_offset-1] == CHAR_CR &&
code->newline_convention != PCRE2_NEWLINE_CR &&
code->newline_convention != PCRE2_NEWLINE_LF &&
start_offset < length &&
subject[start_offset] == CHAR_LF)
start_offset++;
/* Otherwise, in UTF mode, advance past any secondary code points. */
else if ((code->overall_options & PCRE2_UTF) != 0)
{ {
#if PCRE2_CODE_UNIT_WIDTH == 8 #if PCRE2_CODE_UNIT_WIDTH == 8
while (start_offset < length && (subject[start_offset] & 0xc0) == 0x80) while (start_offset < length && (subject[start_offset] & 0xc0) == 0x80)

17
testdata/testinput2 vendored
View File

@ -498,6 +498,9 @@
/^ab\n/Igm,aftertext /^ab\n/Igm,aftertext
ab\nab\ncd ab\nab\ncd
/^/gm,newline=any
a\rb\nc\r\nxyz\=aftertext
/abc/I /abc/I
/abc|bac/I /abc|bac/I
@ -4659,4 +4662,18 @@ a)"xI
/(?|(a)|())/BI /(?|(a)|())/BI
# Test CRLF handling in empty string substitutions
/^$/gm,newline=anycrlf,replace=-
X\r\n\r\nY
/^$/gm,newline=crlf,replace=-
X\r\n\r\nY
/^$/gm,newline=any,replace=-
X\r\n\r\nY
"(*ANYCRLF)(?m)^(.*[^0-9\r\n].*|)$"g,replace=NaN
15\r\nfoo\r\n20\r\nbar\r\nbaz\r\n\r\n20
# End of testinput2 # End of testinput2

29
testdata/testoutput2 vendored
View File

@ -1316,6 +1316,17 @@ Subject length lower bound = 3
0: ab\x0a 0: ab\x0a
0+ cd 0+ cd
/^/gm,newline=any
a\rb\nc\r\nxyz\=aftertext
0:
0+ a\x0db\x0ac\x0d\x0axyz
0:
0+ b\x0ac\x0d\x0axyz
0:
0+ c\x0d\x0axyz
0:
0+ xyz
/abc/I /abc/I
Capturing subpattern count = 0 Capturing subpattern count = 0
First code unit = 'a' First code unit = 'a'
@ -14842,4 +14853,22 @@ Capturing subpattern count = 1
May match empty string May match empty string
Subject length lower bound = 0 Subject length lower bound = 0
# Test CRLF handling in empty string substitutions
/^$/gm,newline=anycrlf,replace=-
X\r\n\r\nY
1: X\x0d\x0a-\x0d\x0aY
/^$/gm,newline=crlf,replace=-
X\r\n\r\nY
1: X\x0d\x0a-\x0d\x0aY
/^$/gm,newline=any,replace=-
X\r\n\r\nY
1: X\x0d\x0a-\x0d\x0aY
"(*ANYCRLF)(?m)^(.*[^0-9\r\n].*|)$"g,replace=NaN
15\r\nfoo\r\n20\r\nbar\r\nbaz\r\n\r\n20
4: 15\x0d\x0aNaN\x0d\x0a20\x0d\x0aNaN\x0d\x0aNaN\x0d\x0aNaN\x0d\x0a20
# End of testinput2 # End of testinput2