Fix compatibility issues for \8 and \9.
This commit is contained in:
parent
e75aa00591
commit
4d35b44b43
|
@ -94,6 +94,9 @@ fuzzer: see http://lcamtuf.coredump.cx/afl/.
|
||||||
|
|
||||||
23. Added the PCRE2_ALT_CIRCUMFLEX option.
|
23. Added the PCRE2_ALT_CIRCUMFLEX option.
|
||||||
|
|
||||||
|
24. Adjust the treatment of \8 and \9 to be the same as the current Perl
|
||||||
|
behaviour.
|
||||||
|
|
||||||
|
|
||||||
Version 10.10 06-March-2015
|
Version 10.10 06-March-2015
|
||||||
---------------------------
|
---------------------------
|
||||||
|
|
|
@ -1,4 +1,4 @@
|
||||||
.TH PCRE2PATTERN 3 "22 April 2015" "PCRE2 10.20"
|
.TH PCRE2PATTERN 3 "23 April 2015" "PCRE2 10.20"
|
||||||
.SH NAME
|
.SH NAME
|
||||||
PCRE2 - Perl-compatible regular expressions (revised API)
|
PCRE2 - Perl-compatible regular expressions (revised API)
|
||||||
.SH "PCRE2 REGULAR EXPRESSION DETAILS"
|
.SH "PCRE2 REGULAR EXPRESSION DETAILS"
|
||||||
|
@ -387,11 +387,13 @@ numbers, and \eg{} to specify back references. The following paragraphs
|
||||||
describe the old, ambiguous syntax.
|
describe the old, ambiguous syntax.
|
||||||
.P
|
.P
|
||||||
The handling of a backslash followed by a digit other than 0 is complicated,
|
The handling of a backslash followed by a digit other than 0 is complicated,
|
||||||
and Perl has changed in recent releases, causing PCRE2 also to change. Outside
|
and Perl has changed over time, causing PCRE2 also to change.
|
||||||
a character class, PCRE2 reads the digit and any following digits as a decimal
|
.P
|
||||||
number. If the number is less than 8, or if there have been at least that many
|
Outside a character class, PCRE2 reads the digit and any following digits as a
|
||||||
previous capturing left parentheses in the expression, the entire sequence is
|
decimal number. If the number is less than 10, begins with the digit 8 or 9, or
|
||||||
taken as a \fIback reference\fP. A description of how this works is given
|
if there are at least that many previous capturing left parentheses in the
|
||||||
|
expression, the entire sequence is taken as a \fIback reference\fP. A
|
||||||
|
description of how this works is given
|
||||||
.\" HTML <a href="#backreferences">
|
.\" HTML <a href="#backreferences">
|
||||||
.\" </a>
|
.\" </a>
|
||||||
later,
|
later,
|
||||||
|
@ -399,14 +401,14 @@ later,
|
||||||
following the discussion of
|
following the discussion of
|
||||||
.\" HTML <a href="#subpattern">
|
.\" HTML <a href="#subpattern">
|
||||||
.\" </a>
|
.\" </a>
|
||||||
parenthesized subpatterns.
|
parenthesized subpatterns.
|
||||||
.\"
|
.\"
|
||||||
|
Otherwise, up to three octal digits are read to form a character code.
|
||||||
.P
|
.P
|
||||||
Inside a character class, or if the decimal number following \e is greater than
|
Inside a character class, PCRE2 handles \e8 and \e9 as the literal characters
|
||||||
7 and there have not been that many capturing subpatterns, PCRE2 handles \e8
|
"8" and "9", and otherwise reads up to three octal digits following the
|
||||||
and \e9 as the literal characters "8" and "9", and otherwise re-reads up to
|
backslash, using them to generate a data character. Any subsequent digits stand
|
||||||
three octal digits following the backslash, using them to generate a data
|
for themselves. For example, outside a character class:
|
||||||
character. Any subsequent digits stand for themselves. For example:
|
|
||||||
.sp
|
.sp
|
||||||
\e040 is another way of writing an ASCII space
|
\e040 is another way of writing an ASCII space
|
||||||
.\" JOIN
|
.\" JOIN
|
||||||
|
@ -425,8 +427,7 @@ character. Any subsequent digits stand for themselves. For example:
|
||||||
\e377 might be a back reference, otherwise
|
\e377 might be a back reference, otherwise
|
||||||
the value 255 (decimal)
|
the value 255 (decimal)
|
||||||
.\" JOIN
|
.\" JOIN
|
||||||
\e81 is either a back reference, or the two
|
\e81 is always a back reference
|
||||||
characters "8" and "1"
|
|
||||||
.sp
|
.sp
|
||||||
Note that octal values of 100 or greater that are specified using this syntax
|
Note that octal values of 100 or greater that are specified using this syntax
|
||||||
must not be introduced by a leading zero, because no more than three octal
|
must not be introduced by a leading zero, because no more than three octal
|
||||||
|
@ -3337,6 +3338,6 @@ Cambridge, England.
|
||||||
.rs
|
.rs
|
||||||
.sp
|
.sp
|
||||||
.nf
|
.nf
|
||||||
Last updated: 22 April 2015
|
Last updated: 23 April 2015
|
||||||
Copyright (c) 1997-2015 University of Cambridge.
|
Copyright (c) 1997-2015 University of Cambridge.
|
||||||
.fi
|
.fi
|
||||||
|
|
|
@ -1,4 +1,4 @@
|
||||||
.TH PCRE2SYNTAX 3 "22 April 2015" "PCRE2 10.20"
|
.TH PCRE2SYNTAX 3 "23 April 2015" "PCRE2 10.20"
|
||||||
.SH NAME
|
.SH NAME
|
||||||
PCRE2 - Perl-compatible regular expressions (revised API)
|
PCRE2 - Perl-compatible regular expressions (revised API)
|
||||||
.SH "PCRE2 REGULAR EXPRESSION SYNTAX SUMMARY"
|
.SH "PCRE2 REGULAR EXPRESSION SYNTAX SUMMARY"
|
||||||
|
@ -19,7 +19,7 @@ documentation. This document contains a quick-reference summary of the syntax.
|
||||||
\eQ...\eE treat enclosed characters as literal
|
\eQ...\eE treat enclosed characters as literal
|
||||||
.
|
.
|
||||||
.
|
.
|
||||||
.SH "CHARACTERS"
|
.SH "ESCAPED CHARACTERS"
|
||||||
.rs
|
.rs
|
||||||
.sp
|
.sp
|
||||||
\ea alarm, that is, the BEL character (hex 07)
|
\ea alarm, that is, the BEL character (hex 07)
|
||||||
|
@ -32,17 +32,28 @@ documentation. This document contains a quick-reference summary of the syntax.
|
||||||
\e0dd character with octal code 0dd
|
\e0dd character with octal code 0dd
|
||||||
\eddd character with octal code ddd, or backreference
|
\eddd character with octal code ddd, or backreference
|
||||||
\eo{ddd..} character with octal code ddd..
|
\eo{ddd..} character with octal code ddd..
|
||||||
\eU "U" if PCRE2_ALT_BSUX is set (otherwise is an error)
|
\eU "U" if PCRE2_ALT_BSUX is set (otherwise is an error)
|
||||||
\euhhhh character with hex code hhhh (if PCRE2_ALT_BSUX is set)
|
\euhhhh character with hex code hhhh (if PCRE2_ALT_BSUX is set)
|
||||||
\exhh character with hex code hh
|
\exhh character with hex code hh
|
||||||
\ex{hhh..} character with hex code hhh..
|
\ex{hhh..} character with hex code hhh..
|
||||||
.sp
|
.sp
|
||||||
Note that \e0dd is always an octal code, and that \e8 and \e9 are the literal
|
Note that \e0dd is always an octal code. The treatment of backslash followed by
|
||||||
characters "8" and "9". When \ex is not followed by {, from zero to two
|
a non-zero digit is complicated; for details see the section
|
||||||
hexadecimal digits are read, but if PCRE2_ALT_BSUX is set, \ex must be followed
|
.\" HTML <a href="pcre2pattern.html#digitsafterbackslash">
|
||||||
by two hexadecimal digits to be recognized as a hexadecimal escape; otherwise
|
.\" </a>
|
||||||
it matches a literal "x". Likewise, if \eu (in ALT_BSUX mode) is not followed
|
"Non-printing characters"
|
||||||
by four hexadecimal digits, it matches a literal "u".
|
.\"
|
||||||
|
in the
|
||||||
|
.\" HREF
|
||||||
|
\fBpcre2pattern\fP
|
||||||
|
.\"
|
||||||
|
documentation.
|
||||||
|
.P
|
||||||
|
When \ex is not followed by {, from zero to two hexadecimal digits are read,
|
||||||
|
but if PCRE2_ALT_BSUX is set, \ex must be followed by two hexadecimal digits to
|
||||||
|
be recognized as a hexadecimal escape; otherwise it matches a literal "x".
|
||||||
|
Likewise, if \eu (in ALT_BSUX mode) is not followed by four hexadecimal digits,
|
||||||
|
it matches a literal "u".
|
||||||
.
|
.
|
||||||
.
|
.
|
||||||
.SH "CHARACTER TYPES"
|
.SH "CHARACTER TYPES"
|
||||||
|
@ -329,7 +340,7 @@ but some of them use Unicode properties if PCRE2_UCP is set. You can use
|
||||||
\eB not a word boundary
|
\eB not a word boundary
|
||||||
^ start of subject
|
^ start of subject
|
||||||
also after an internal newline in multiline mode
|
also after an internal newline in multiline mode
|
||||||
(after any newline if PCRE2_ALT_CIRCUMFLEX is set)
|
(after any newline if PCRE2_ALT_CIRCUMFLEX is set)
|
||||||
\eA start of subject
|
\eA start of subject
|
||||||
$ end of subject
|
$ end of subject
|
||||||
also before newline at end of subject
|
also before newline at end of subject
|
||||||
|
@ -407,8 +418,8 @@ appear.
|
||||||
(*UCP) set PCRE2_UCP (use Unicode properties for \ed etc)
|
(*UCP) set PCRE2_UCP (use Unicode properties for \ed etc)
|
||||||
.sp
|
.sp
|
||||||
Note that LIMIT_MATCH and LIMIT_RECURSION can only reduce the value of the
|
Note that LIMIT_MATCH and LIMIT_RECURSION can only reduce the value of the
|
||||||
limits set by the caller of pcre2_match(), not increase them. The application
|
limits set by the caller of pcre2_match(), not increase them. The application
|
||||||
can lock out the use of (*UTF) and (*UCP) by setting the PCRE2_NEVER_UTF or
|
can lock out the use of (*UTF) and (*UCP) by setting the PCRE2_NEVER_UTF or
|
||||||
PCRE2_NEVER_UCP options, respectively, at compile time.
|
PCRE2_NEVER_UCP options, respectively, at compile time.
|
||||||
.
|
.
|
||||||
.
|
.
|
||||||
|
@ -530,9 +541,9 @@ pattern is not anchored.
|
||||||
(?Cn) callout with numerical data n
|
(?Cn) callout with numerical data n
|
||||||
(?C"text") callout with string data
|
(?C"text") callout with string data
|
||||||
.sp
|
.sp
|
||||||
The allowed string delimiters are ` ' " ^ % # $ (which are the same for the
|
The allowed string delimiters are ` ' " ^ % # $ (which are the same for the
|
||||||
start and the end), and the starting delimiter { matched with the ending
|
start and the end), and the starting delimiter { matched with the ending
|
||||||
delimiter }. To encode the ending delimiter within the string, double it.
|
delimiter }. To encode the ending delimiter within the string, double it.
|
||||||
.
|
.
|
||||||
.
|
.
|
||||||
.SH "SEE ALSO"
|
.SH "SEE ALSO"
|
||||||
|
@ -556,6 +567,6 @@ Cambridge, England.
|
||||||
.rs
|
.rs
|
||||||
.sp
|
.sp
|
||||||
.nf
|
.nf
|
||||||
Last updated: 22 April 2015
|
Last updated: 23 April 2015
|
||||||
Copyright (c) 1997-2015 University of Cambridge.
|
Copyright (c) 1997-2015 University of Cambridge.
|
||||||
.fi
|
.fi
|
||||||
|
|
|
@ -1868,9 +1868,9 @@ else
|
||||||
Outside a character class, the digits are read as a decimal number. If the
|
Outside a character class, the digits are read as a decimal number. If the
|
||||||
number is less than 10, or if there are that many previous extracting left
|
number is less than 10, or if there are that many previous extracting left
|
||||||
brackets, it is a back reference. Otherwise, up to three octal digits are
|
brackets, it is a back reference. Otherwise, up to three octal digits are
|
||||||
read to form an escaped byte. Thus \123 is likely to be octal 123 (cf
|
read to form an escaped character code. Thus \123 is likely to be octal 123
|
||||||
\0123, which is octal 012 followed by the literal 3). If the octal value is
|
(cf \0123, which is octal 012 followed by the literal 3). If the octal
|
||||||
greater than 377, the least significant 8 bits are taken.
|
value is greater than 377, the least significant 8 bits are taken.
|
||||||
|
|
||||||
Inside a character class, \ followed by a digit is always either a literal
|
Inside a character class, \ followed by a digit is always either a literal
|
||||||
8 or 9 or an octal number. */
|
8 or 9 or an octal number. */
|
||||||
|
@ -1899,18 +1899,24 @@ else
|
||||||
*errorcodeptr = ERR61;
|
*errorcodeptr = ERR61;
|
||||||
break;
|
break;
|
||||||
}
|
}
|
||||||
if (s < 10 || s <= cb->bracount) /* Check for back reference */
|
|
||||||
|
/* \1 to \9 are always back references. \8x and \9x are too, unless there
|
||||||
|
are an awful lot of previous captures; \1x to \7x are octal escapes if
|
||||||
|
there are not that many previous captures. */
|
||||||
|
|
||||||
|
if (s < 10 || *oldptr >= CHAR_8 || s <= cb->bracount)
|
||||||
{
|
{
|
||||||
escape = -s;
|
escape = -s; /* Indicates a back reference */
|
||||||
break;
|
break;
|
||||||
}
|
}
|
||||||
ptr = oldptr; /* Put the pointer back and fall through */
|
ptr = oldptr; /* Put the pointer back and fall through */
|
||||||
}
|
}
|
||||||
|
|
||||||
/* Handle a digit following \ when the number is not a back reference. If
|
/* Handle a digit following \ when the number is not a back reference, or
|
||||||
the first digit is 8 or 9, Perl used to generate a binary zero byte and
|
we are within a character class. If the first digit is 8 or 9, Perl used to
|
||||||
then treat the digit as a following literal. At least by Perl 5.18 this
|
generate a binary zero byte and then treat the digit as a following
|
||||||
changed so as not to insert the binary zero. */
|
literal. At least by Perl 5.18 this changed so as not to insert the binary
|
||||||
|
zero. */
|
||||||
|
|
||||||
if ((c = *ptr) >= CHAR_8) break;
|
if ((c = *ptr) >= CHAR_8) break;
|
||||||
|
|
||||||
|
|
|
@ -5715,4 +5715,10 @@ name)/mark
|
||||||
"(?1)(?#?'){8}(a)"
|
"(?1)(?#?'){8}(a)"
|
||||||
baaaaaaaaac
|
baaaaaaaaac
|
||||||
|
|
||||||
|
/((((((((((((x))))))))))))\12/
|
||||||
|
xx
|
||||||
|
|
||||||
|
/A[\8]B[\9]C/
|
||||||
|
A8B9C
|
||||||
|
|
||||||
# End of testinput1
|
# End of testinput1
|
||||||
|
|
|
@ -4279,4 +4279,15 @@ a random value. /Ix
|
||||||
/^/gm,alt_circumflex
|
/^/gm,alt_circumflex
|
||||||
\n\n\n
|
\n\n\n
|
||||||
|
|
||||||
|
/((((((((x))))))))\81/
|
||||||
|
xx1
|
||||||
|
|
||||||
|
/((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((x))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))\80/
|
||||||
|
xx
|
||||||
|
|
||||||
|
/\80/
|
||||||
|
|
||||||
|
/A\8B\9C/
|
||||||
|
A8B9C
|
||||||
|
|
||||||
# End of testinput2
|
# End of testinput2
|
||||||
|
|
|
@ -9427,4 +9427,24 @@ No match
|
||||||
0: aaaaaaaaa
|
0: aaaaaaaaa
|
||||||
1: a
|
1: a
|
||||||
|
|
||||||
|
/((((((((((((x))))))))))))\12/
|
||||||
|
xx
|
||||||
|
0: xx
|
||||||
|
1: x
|
||||||
|
2: x
|
||||||
|
3: x
|
||||||
|
4: x
|
||||||
|
5: x
|
||||||
|
6: x
|
||||||
|
7: x
|
||||||
|
8: x
|
||||||
|
9: x
|
||||||
|
10: x
|
||||||
|
11: x
|
||||||
|
12: x
|
||||||
|
|
||||||
|
/A[\8]B[\9]C/
|
||||||
|
A8B9C
|
||||||
|
0: A8B9C
|
||||||
|
|
||||||
# End of testinput1
|
# End of testinput1
|
||||||
|
|
|
@ -14318,4 +14318,34 @@ No match
|
||||||
0:
|
0:
|
||||||
0:
|
0:
|
||||||
|
|
||||||
|
/((((((((x))))))))\81/
|
||||||
|
Failed: error 115 at offset 20: reference to non-existent subpattern
|
||||||
|
xx1
|
||||||
|
|
||||||
|
/((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((x))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))\80/
|
||||||
|
xx
|
||||||
|
Matched, but too many substrings
|
||||||
|
0: xx
|
||||||
|
1: x
|
||||||
|
2: x
|
||||||
|
3: x
|
||||||
|
4: x
|
||||||
|
5: x
|
||||||
|
6: x
|
||||||
|
7: x
|
||||||
|
8: x
|
||||||
|
9: x
|
||||||
|
10: x
|
||||||
|
11: x
|
||||||
|
12: x
|
||||||
|
13: x
|
||||||
|
14: x
|
||||||
|
|
||||||
|
/\80/
|
||||||
|
Failed: error 115 at offset 3: reference to non-existent subpattern
|
||||||
|
|
||||||
|
/A\8B\9C/
|
||||||
|
Failed: error 115 at offset 7: reference to non-existent subpattern
|
||||||
|
A8B9C
|
||||||
|
|
||||||
# End of testinput2
|
# End of testinput2
|
||||||
|
|
Loading…
Reference in New Issue