Fix compatibility issues for \8 and \9.

This commit is contained in:
Philip.Hazel 2015-04-23 17:28:39 +00:00
parent e75aa00591
commit 4d35b44b43
8 changed files with 129 additions and 41 deletions

View File

@ -94,6 +94,9 @@ fuzzer: see http://lcamtuf.coredump.cx/afl/.
23. Added the PCRE2_ALT_CIRCUMFLEX option. 23. Added the PCRE2_ALT_CIRCUMFLEX option.
24. Adjust the treatment of \8 and \9 to be the same as the current Perl
behaviour.
Version 10.10 06-March-2015 Version 10.10 06-March-2015
--------------------------- ---------------------------

View File

@ -1,4 +1,4 @@
.TH PCRE2PATTERN 3 "22 April 2015" "PCRE2 10.20" .TH PCRE2PATTERN 3 "23 April 2015" "PCRE2 10.20"
.SH NAME .SH NAME
PCRE2 - Perl-compatible regular expressions (revised API) PCRE2 - Perl-compatible regular expressions (revised API)
.SH "PCRE2 REGULAR EXPRESSION DETAILS" .SH "PCRE2 REGULAR EXPRESSION DETAILS"
@ -387,11 +387,13 @@ numbers, and \eg{} to specify back references. The following paragraphs
describe the old, ambiguous syntax. describe the old, ambiguous syntax.
.P .P
The handling of a backslash followed by a digit other than 0 is complicated, The handling of a backslash followed by a digit other than 0 is complicated,
and Perl has changed in recent releases, causing PCRE2 also to change. Outside and Perl has changed over time, causing PCRE2 also to change.
a character class, PCRE2 reads the digit and any following digits as a decimal .P
number. If the number is less than 8, or if there have been at least that many Outside a character class, PCRE2 reads the digit and any following digits as a
previous capturing left parentheses in the expression, the entire sequence is decimal number. If the number is less than 10, begins with the digit 8 or 9, or
taken as a \fIback reference\fP. A description of how this works is given if there are at least that many previous capturing left parentheses in the
expression, the entire sequence is taken as a \fIback reference\fP. A
description of how this works is given
.\" HTML <a href="#backreferences"> .\" HTML <a href="#backreferences">
.\" </a> .\" </a>
later, later,
@ -399,14 +401,14 @@ later,
following the discussion of following the discussion of
.\" HTML <a href="#subpattern"> .\" HTML <a href="#subpattern">
.\" </a> .\" </a>
parenthesized subpatterns. parenthesized subpatterns.
.\" .\"
Otherwise, up to three octal digits are read to form a character code.
.P .P
Inside a character class, or if the decimal number following \e is greater than Inside a character class, PCRE2 handles \e8 and \e9 as the literal characters
7 and there have not been that many capturing subpatterns, PCRE2 handles \e8 "8" and "9", and otherwise reads up to three octal digits following the
and \e9 as the literal characters "8" and "9", and otherwise re-reads up to backslash, using them to generate a data character. Any subsequent digits stand
three octal digits following the backslash, using them to generate a data for themselves. For example, outside a character class:
character. Any subsequent digits stand for themselves. For example:
.sp .sp
\e040 is another way of writing an ASCII space \e040 is another way of writing an ASCII space
.\" JOIN .\" JOIN
@ -425,8 +427,7 @@ character. Any subsequent digits stand for themselves. For example:
\e377 might be a back reference, otherwise \e377 might be a back reference, otherwise
the value 255 (decimal) the value 255 (decimal)
.\" JOIN .\" JOIN
\e81 is either a back reference, or the two \e81 is always a back reference
characters "8" and "1"
.sp .sp
Note that octal values of 100 or greater that are specified using this syntax Note that octal values of 100 or greater that are specified using this syntax
must not be introduced by a leading zero, because no more than three octal must not be introduced by a leading zero, because no more than three octal
@ -3337,6 +3338,6 @@ Cambridge, England.
.rs .rs
.sp .sp
.nf .nf
Last updated: 22 April 2015 Last updated: 23 April 2015
Copyright (c) 1997-2015 University of Cambridge. Copyright (c) 1997-2015 University of Cambridge.
.fi .fi

View File

@ -1,4 +1,4 @@
.TH PCRE2SYNTAX 3 "22 April 2015" "PCRE2 10.20" .TH PCRE2SYNTAX 3 "23 April 2015" "PCRE2 10.20"
.SH NAME .SH NAME
PCRE2 - Perl-compatible regular expressions (revised API) PCRE2 - Perl-compatible regular expressions (revised API)
.SH "PCRE2 REGULAR EXPRESSION SYNTAX SUMMARY" .SH "PCRE2 REGULAR EXPRESSION SYNTAX SUMMARY"
@ -19,7 +19,7 @@ documentation. This document contains a quick-reference summary of the syntax.
\eQ...\eE treat enclosed characters as literal \eQ...\eE treat enclosed characters as literal
. .
. .
.SH "CHARACTERS" .SH "ESCAPED CHARACTERS"
.rs .rs
.sp .sp
\ea alarm, that is, the BEL character (hex 07) \ea alarm, that is, the BEL character (hex 07)
@ -32,17 +32,28 @@ documentation. This document contains a quick-reference summary of the syntax.
\e0dd character with octal code 0dd \e0dd character with octal code 0dd
\eddd character with octal code ddd, or backreference \eddd character with octal code ddd, or backreference
\eo{ddd..} character with octal code ddd.. \eo{ddd..} character with octal code ddd..
\eU "U" if PCRE2_ALT_BSUX is set (otherwise is an error) \eU "U" if PCRE2_ALT_BSUX is set (otherwise is an error)
\euhhhh character with hex code hhhh (if PCRE2_ALT_BSUX is set) \euhhhh character with hex code hhhh (if PCRE2_ALT_BSUX is set)
\exhh character with hex code hh \exhh character with hex code hh
\ex{hhh..} character with hex code hhh.. \ex{hhh..} character with hex code hhh..
.sp .sp
Note that \e0dd is always an octal code, and that \e8 and \e9 are the literal Note that \e0dd is always an octal code. The treatment of backslash followed by
characters "8" and "9". When \ex is not followed by {, from zero to two a non-zero digit is complicated; for details see the section
hexadecimal digits are read, but if PCRE2_ALT_BSUX is set, \ex must be followed .\" HTML <a href="pcre2pattern.html#digitsafterbackslash">
by two hexadecimal digits to be recognized as a hexadecimal escape; otherwise .\" </a>
it matches a literal "x". Likewise, if \eu (in ALT_BSUX mode) is not followed "Non-printing characters"
by four hexadecimal digits, it matches a literal "u". .\"
in the
.\" HREF
\fBpcre2pattern\fP
.\"
documentation.
.P
When \ex is not followed by {, from zero to two hexadecimal digits are read,
but if PCRE2_ALT_BSUX is set, \ex must be followed by two hexadecimal digits to
be recognized as a hexadecimal escape; otherwise it matches a literal "x".
Likewise, if \eu (in ALT_BSUX mode) is not followed by four hexadecimal digits,
it matches a literal "u".
. .
. .
.SH "CHARACTER TYPES" .SH "CHARACTER TYPES"
@ -329,7 +340,7 @@ but some of them use Unicode properties if PCRE2_UCP is set. You can use
\eB not a word boundary \eB not a word boundary
^ start of subject ^ start of subject
also after an internal newline in multiline mode also after an internal newline in multiline mode
(after any newline if PCRE2_ALT_CIRCUMFLEX is set) (after any newline if PCRE2_ALT_CIRCUMFLEX is set)
\eA start of subject \eA start of subject
$ end of subject $ end of subject
also before newline at end of subject also before newline at end of subject
@ -407,8 +418,8 @@ appear.
(*UCP) set PCRE2_UCP (use Unicode properties for \ed etc) (*UCP) set PCRE2_UCP (use Unicode properties for \ed etc)
.sp .sp
Note that LIMIT_MATCH and LIMIT_RECURSION can only reduce the value of the Note that LIMIT_MATCH and LIMIT_RECURSION can only reduce the value of the
limits set by the caller of pcre2_match(), not increase them. The application limits set by the caller of pcre2_match(), not increase them. The application
can lock out the use of (*UTF) and (*UCP) by setting the PCRE2_NEVER_UTF or can lock out the use of (*UTF) and (*UCP) by setting the PCRE2_NEVER_UTF or
PCRE2_NEVER_UCP options, respectively, at compile time. PCRE2_NEVER_UCP options, respectively, at compile time.
. .
. .
@ -530,9 +541,9 @@ pattern is not anchored.
(?Cn) callout with numerical data n (?Cn) callout with numerical data n
(?C"text") callout with string data (?C"text") callout with string data
.sp .sp
The allowed string delimiters are ` ' " ^ % # $ (which are the same for the The allowed string delimiters are ` ' " ^ % # $ (which are the same for the
start and the end), and the starting delimiter { matched with the ending start and the end), and the starting delimiter { matched with the ending
delimiter }. To encode the ending delimiter within the string, double it. delimiter }. To encode the ending delimiter within the string, double it.
. .
. .
.SH "SEE ALSO" .SH "SEE ALSO"
@ -556,6 +567,6 @@ Cambridge, England.
.rs .rs
.sp .sp
.nf .nf
Last updated: 22 April 2015 Last updated: 23 April 2015
Copyright (c) 1997-2015 University of Cambridge. Copyright (c) 1997-2015 University of Cambridge.
.fi .fi

View File

@ -1868,9 +1868,9 @@ else
Outside a character class, the digits are read as a decimal number. If the Outside a character class, the digits are read as a decimal number. If the
number is less than 10, or if there are that many previous extracting left number is less than 10, or if there are that many previous extracting left
brackets, it is a back reference. Otherwise, up to three octal digits are brackets, it is a back reference. Otherwise, up to three octal digits are
read to form an escaped byte. Thus \123 is likely to be octal 123 (cf read to form an escaped character code. Thus \123 is likely to be octal 123
\0123, which is octal 012 followed by the literal 3). If the octal value is (cf \0123, which is octal 012 followed by the literal 3). If the octal
greater than 377, the least significant 8 bits are taken. value is greater than 377, the least significant 8 bits are taken.
Inside a character class, \ followed by a digit is always either a literal Inside a character class, \ followed by a digit is always either a literal
8 or 9 or an octal number. */ 8 or 9 or an octal number. */
@ -1899,18 +1899,24 @@ else
*errorcodeptr = ERR61; *errorcodeptr = ERR61;
break; break;
} }
if (s < 10 || s <= cb->bracount) /* Check for back reference */
/* \1 to \9 are always back references. \8x and \9x are too, unless there
are an awful lot of previous captures; \1x to \7x are octal escapes if
there are not that many previous captures. */
if (s < 10 || *oldptr >= CHAR_8 || s <= cb->bracount)
{ {
escape = -s; escape = -s; /* Indicates a back reference */
break; break;
} }
ptr = oldptr; /* Put the pointer back and fall through */ ptr = oldptr; /* Put the pointer back and fall through */
} }
/* Handle a digit following \ when the number is not a back reference. If /* Handle a digit following \ when the number is not a back reference, or
the first digit is 8 or 9, Perl used to generate a binary zero byte and we are within a character class. If the first digit is 8 or 9, Perl used to
then treat the digit as a following literal. At least by Perl 5.18 this generate a binary zero byte and then treat the digit as a following
changed so as not to insert the binary zero. */ literal. At least by Perl 5.18 this changed so as not to insert the binary
zero. */
if ((c = *ptr) >= CHAR_8) break; if ((c = *ptr) >= CHAR_8) break;

6
testdata/testinput1 vendored
View File

@ -5715,4 +5715,10 @@ name)/mark
"(?1)(?#?'){8}(a)" "(?1)(?#?'){8}(a)"
baaaaaaaaac baaaaaaaaac
/((((((((((((x))))))))))))\12/
xx
/A[\8]B[\9]C/
A8B9C
# End of testinput1 # End of testinput1

11
testdata/testinput2 vendored
View File

@ -4279,4 +4279,15 @@ a random value. /Ix
/^/gm,alt_circumflex /^/gm,alt_circumflex
\n\n\n \n\n\n
/((((((((x))))))))\81/
xx1
/((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((x))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))\80/
xx
/\80/
/A\8B\9C/
A8B9C
# End of testinput2 # End of testinput2

20
testdata/testoutput1 vendored
View File

@ -9427,4 +9427,24 @@ No match
0: aaaaaaaaa 0: aaaaaaaaa
1: a 1: a
/((((((((((((x))))))))))))\12/
xx
0: xx
1: x
2: x
3: x
4: x
5: x
6: x
7: x
8: x
9: x
10: x
11: x
12: x
/A[\8]B[\9]C/
A8B9C
0: A8B9C
# End of testinput1 # End of testinput1

30
testdata/testoutput2 vendored
View File

@ -14318,4 +14318,34 @@ No match
0: 0:
0: 0:
/((((((((x))))))))\81/
Failed: error 115 at offset 20: reference to non-existent subpattern
xx1
/((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((x))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))\80/
xx
Matched, but too many substrings
0: xx
1: x
2: x
3: x
4: x
5: x
6: x
7: x
8: x
9: x
10: x
11: x
12: x
13: x
14: x
/\80/
Failed: error 115 at offset 3: reference to non-existent subpattern
/A\8B\9C/
Failed: error 115 at offset 7: reference to non-existent subpattern
A8B9C
# End of testinput2 # End of testinput2