Fix compatibility issues for \8 and \9.

This commit is contained in:
Philip.Hazel 2015-04-23 17:28:39 +00:00
parent e75aa00591
commit 4d35b44b43
8 changed files with 129 additions and 41 deletions

View File

@ -94,6 +94,9 @@ fuzzer: see http://lcamtuf.coredump.cx/afl/.
23. Added the PCRE2_ALT_CIRCUMFLEX option.
24. Adjust the treatment of \8 and \9 to be the same as the current Perl
behaviour.
Version 10.10 06-March-2015
---------------------------

View File

@ -1,4 +1,4 @@
.TH PCRE2PATTERN 3 "22 April 2015" "PCRE2 10.20"
.TH PCRE2PATTERN 3 "23 April 2015" "PCRE2 10.20"
.SH NAME
PCRE2 - Perl-compatible regular expressions (revised API)
.SH "PCRE2 REGULAR EXPRESSION DETAILS"
@ -387,11 +387,13 @@ numbers, and \eg{} to specify back references. The following paragraphs
describe the old, ambiguous syntax.
.P
The handling of a backslash followed by a digit other than 0 is complicated,
and Perl has changed in recent releases, causing PCRE2 also to change. Outside
a character class, PCRE2 reads the digit and any following digits as a decimal
number. If the number is less than 8, or if there have been at least that many
previous capturing left parentheses in the expression, the entire sequence is
taken as a \fIback reference\fP. A description of how this works is given
and Perl has changed over time, causing PCRE2 also to change.
.P
Outside a character class, PCRE2 reads the digit and any following digits as a
decimal number. If the number is less than 10, begins with the digit 8 or 9, or
if there are at least that many previous capturing left parentheses in the
expression, the entire sequence is taken as a \fIback reference\fP. A
description of how this works is given
.\" HTML <a href="#backreferences">
.\" </a>
later,
@ -399,14 +401,14 @@ later,
following the discussion of
.\" HTML <a href="#subpattern">
.\" </a>
parenthesized subpatterns.
parenthesized subpatterns.
.\"
Otherwise, up to three octal digits are read to form a character code.
.P
Inside a character class, or if the decimal number following \e is greater than
7 and there have not been that many capturing subpatterns, PCRE2 handles \e8
and \e9 as the literal characters "8" and "9", and otherwise re-reads up to
three octal digits following the backslash, using them to generate a data
character. Any subsequent digits stand for themselves. For example:
Inside a character class, PCRE2 handles \e8 and \e9 as the literal characters
"8" and "9", and otherwise reads up to three octal digits following the
backslash, using them to generate a data character. Any subsequent digits stand
for themselves. For example, outside a character class:
.sp
\e040 is another way of writing an ASCII space
.\" JOIN
@ -425,8 +427,7 @@ character. Any subsequent digits stand for themselves. For example:
\e377 might be a back reference, otherwise
the value 255 (decimal)
.\" JOIN
\e81 is either a back reference, or the two
characters "8" and "1"
\e81 is always a back reference
.sp
Note that octal values of 100 or greater that are specified using this syntax
must not be introduced by a leading zero, because no more than three octal
@ -3337,6 +3338,6 @@ Cambridge, England.
.rs
.sp
.nf
Last updated: 22 April 2015
Last updated: 23 April 2015
Copyright (c) 1997-2015 University of Cambridge.
.fi

View File

@ -1,4 +1,4 @@
.TH PCRE2SYNTAX 3 "22 April 2015" "PCRE2 10.20"
.TH PCRE2SYNTAX 3 "23 April 2015" "PCRE2 10.20"
.SH NAME
PCRE2 - Perl-compatible regular expressions (revised API)
.SH "PCRE2 REGULAR EXPRESSION SYNTAX SUMMARY"
@ -19,7 +19,7 @@ documentation. This document contains a quick-reference summary of the syntax.
\eQ...\eE treat enclosed characters as literal
.
.
.SH "CHARACTERS"
.SH "ESCAPED CHARACTERS"
.rs
.sp
\ea alarm, that is, the BEL character (hex 07)
@ -32,17 +32,28 @@ documentation. This document contains a quick-reference summary of the syntax.
\e0dd character with octal code 0dd
\eddd character with octal code ddd, or backreference
\eo{ddd..} character with octal code ddd..
\eU "U" if PCRE2_ALT_BSUX is set (otherwise is an error)
\eU "U" if PCRE2_ALT_BSUX is set (otherwise is an error)
\euhhhh character with hex code hhhh (if PCRE2_ALT_BSUX is set)
\exhh character with hex code hh
\exhh character with hex code hh
\ex{hhh..} character with hex code hhh..
.sp
Note that \e0dd is always an octal code, and that \e8 and \e9 are the literal
characters "8" and "9". When \ex is not followed by {, from zero to two
hexadecimal digits are read, but if PCRE2_ALT_BSUX is set, \ex must be followed
by two hexadecimal digits to be recognized as a hexadecimal escape; otherwise
it matches a literal "x". Likewise, if \eu (in ALT_BSUX mode) is not followed
by four hexadecimal digits, it matches a literal "u".
Note that \e0dd is always an octal code. The treatment of backslash followed by
a non-zero digit is complicated; for details see the section
.\" HTML <a href="pcre2pattern.html#digitsafterbackslash">
.\" </a>
"Non-printing characters"
.\"
in the
.\" HREF
\fBpcre2pattern\fP
.\"
documentation.
.P
When \ex is not followed by {, from zero to two hexadecimal digits are read,
but if PCRE2_ALT_BSUX is set, \ex must be followed by two hexadecimal digits to
be recognized as a hexadecimal escape; otherwise it matches a literal "x".
Likewise, if \eu (in ALT_BSUX mode) is not followed by four hexadecimal digits,
it matches a literal "u".
.
.
.SH "CHARACTER TYPES"
@ -329,7 +340,7 @@ but some of them use Unicode properties if PCRE2_UCP is set. You can use
\eB not a word boundary
^ start of subject
also after an internal newline in multiline mode
(after any newline if PCRE2_ALT_CIRCUMFLEX is set)
(after any newline if PCRE2_ALT_CIRCUMFLEX is set)
\eA start of subject
$ end of subject
also before newline at end of subject
@ -407,8 +418,8 @@ appear.
(*UCP) set PCRE2_UCP (use Unicode properties for \ed etc)
.sp
Note that LIMIT_MATCH and LIMIT_RECURSION can only reduce the value of the
limits set by the caller of pcre2_match(), not increase them. The application
can lock out the use of (*UTF) and (*UCP) by setting the PCRE2_NEVER_UTF or
limits set by the caller of pcre2_match(), not increase them. The application
can lock out the use of (*UTF) and (*UCP) by setting the PCRE2_NEVER_UTF or
PCRE2_NEVER_UCP options, respectively, at compile time.
.
.
@ -530,9 +541,9 @@ pattern is not anchored.
(?Cn) callout with numerical data n
(?C"text") callout with string data
.sp
The allowed string delimiters are ` ' " ^ % # $ (which are the same for the
start and the end), and the starting delimiter { matched with the ending
delimiter }. To encode the ending delimiter within the string, double it.
The allowed string delimiters are ` ' " ^ % # $ (which are the same for the
start and the end), and the starting delimiter { matched with the ending
delimiter }. To encode the ending delimiter within the string, double it.
.
.
.SH "SEE ALSO"
@ -556,6 +567,6 @@ Cambridge, England.
.rs
.sp
.nf
Last updated: 22 April 2015
Last updated: 23 April 2015
Copyright (c) 1997-2015 University of Cambridge.
.fi

View File

@ -1868,9 +1868,9 @@ else
Outside a character class, the digits are read as a decimal number. If the
number is less than 10, or if there are that many previous extracting left
brackets, it is a back reference. Otherwise, up to three octal digits are
read to form an escaped byte. Thus \123 is likely to be octal 123 (cf
\0123, which is octal 012 followed by the literal 3). If the octal value is
greater than 377, the least significant 8 bits are taken.
read to form an escaped character code. Thus \123 is likely to be octal 123
(cf \0123, which is octal 012 followed by the literal 3). If the octal
value is greater than 377, the least significant 8 bits are taken.
Inside a character class, \ followed by a digit is always either a literal
8 or 9 or an octal number. */
@ -1899,18 +1899,24 @@ else
*errorcodeptr = ERR61;
break;
}
if (s < 10 || s <= cb->bracount) /* Check for back reference */
/* \1 to \9 are always back references. \8x and \9x are too, unless there
are an awful lot of previous captures; \1x to \7x are octal escapes if
there are not that many previous captures. */
if (s < 10 || *oldptr >= CHAR_8 || s <= cb->bracount)
{
escape = -s;
escape = -s; /* Indicates a back reference */
break;
}
ptr = oldptr; /* Put the pointer back and fall through */
}
/* Handle a digit following \ when the number is not a back reference. If
the first digit is 8 or 9, Perl used to generate a binary zero byte and
then treat the digit as a following literal. At least by Perl 5.18 this
changed so as not to insert the binary zero. */
/* Handle a digit following \ when the number is not a back reference, or
we are within a character class. If the first digit is 8 or 9, Perl used to
generate a binary zero byte and then treat the digit as a following
literal. At least by Perl 5.18 this changed so as not to insert the binary
zero. */
if ((c = *ptr) >= CHAR_8) break;

6
testdata/testinput1 vendored
View File

@ -5715,4 +5715,10 @@ name)/mark
"(?1)(?#?'){8}(a)"
baaaaaaaaac
/((((((((((((x))))))))))))\12/
xx
/A[\8]B[\9]C/
A8B9C
# End of testinput1

11
testdata/testinput2 vendored
View File

@ -4279,4 +4279,15 @@ a random value. /Ix
/^/gm,alt_circumflex
\n\n\n
/((((((((x))))))))\81/
xx1
/((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((x))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))\80/
xx
/\80/
/A\8B\9C/
A8B9C
# End of testinput2

20
testdata/testoutput1 vendored
View File

@ -9427,4 +9427,24 @@ No match
0: aaaaaaaaa
1: a
/((((((((((((x))))))))))))\12/
xx
0: xx
1: x
2: x
3: x
4: x
5: x
6: x
7: x
8: x
9: x
10: x
11: x
12: x
/A[\8]B[\9]C/
A8B9C
0: A8B9C
# End of testinput1

30
testdata/testoutput2 vendored
View File

@ -14318,4 +14318,34 @@ No match
0:
0:
/((((((((x))))))))\81/
Failed: error 115 at offset 20: reference to non-existent subpattern
xx1
/((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((x))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))\80/
xx
Matched, but too many substrings
0: xx
1: x
2: x
3: x
4: x
5: x
6: x
7: x
8: x
9: x
10: x
11: x
12: x
13: x
14: x
/\80/
Failed: error 115 at offset 3: reference to non-existent subpattern
/A\8B\9C/
Failed: error 115 at offset 7: reference to non-existent subpattern
A8B9C
# End of testinput2