Implement PCRE2_EXTRA_BAD_ESCAPE_IS_LITERAL.

This commit is contained in:
Philip.Hazel 2017-06-01 18:10:15 +00:00
parent c0902e176f
commit e3a0f22349
16 changed files with 206 additions and 50 deletions

View File

@ -31,6 +31,7 @@ housed in a compile context. It completely replaces all the bits. The extra
options are:
<pre>
PCRE2_EXTRA_ALLOW_SURROGATE_ESCAPES Allow \x{df800} to \x{dfff} in UTF-8 and UTF-32 modes
PCRE2_EXTRA_BAD_ESCAPE_IS_LITERAL Treat all invalid escapes as a literal following character
</pre>
There is a complete description of the PCRE2 native API in the
<a href="pcre2api.html"><b>pcre2api</b></a>

View File

@ -1706,6 +1706,24 @@ If the extra option PCRE2_EXTRA_ALLOW_SURROGATE_ESCAPES is set, surrogate code
point values in UTF-8 and UTF-32 patterns no longer provoke errors and are
incorporated in the compiled pattern. However, they can only match subject
characters if the matching function is called with PCRE2_NO_UTF_CHECK set.
<pre>
PCRE2_EXTRA_BAD_ESCAPE_IS_LITERAL
</pre>
This is a dangerous option. Use with care. By default, an unrecognized escape
such as \j or a malformed one such as \x{2z} causes a compile-time error when
detected by <b>pcre2_compile()</b>. Perl is somewhat inconsistent in handling
such items: for example, \j is treated as a literal "j", and non-hexadecimal
digits in \x{} are just ignored, though warnings are given in both cases if
Perl's warning switch is enabled. However, a malformed octal number after \o{
always causes an error in Perl.
</P>
<P>
If the PCRE2_EXTRA_BAD_ESCAPE_IS_LITERAL extra option is passed to
<b>pcre2_compile()</b>, all unrecognized or erroneous escape sequences are
treated as single-character escapes. For example, \j is a literal "j" and
\x{2z} is treated as the literal string "x{2z}". Setting this option means
that typos in patterns may go undetected and have unexpected results. This is a
dangerous option. Use with care.
</P>
<br><a name="SEC20" href="#TOC1">COMPILATION ERROR CODES</a><br>
<P>
@ -3471,7 +3489,7 @@ Cambridge, England.
</P>
<br><a name="SEC42" href="#TOC1">REVISION</a><br>
<P>
Last updated: 30 May 2017
Last updated: 01 June 2017
<br>
Copyright &copy; 1997-2017 University of Cambridge.
<br>

View File

@ -577,6 +577,7 @@ for a description of the effects of these options.
alt_verbnames set PCRE2_ALT_VERBNAMES
anchored set PCRE2_ANCHORED
auto_callout set PCRE2_AUTO_CALLOUT
bad_escape_is_literal set PCRE2_EXTRA_BAD_ESCAPE_IS_LITERAL
/i caseless set PCRE2_CASELESS
dollar_endonly set PCRE2_DOLLAR_ENDONLY
/s dotall set PCRE2_DOTALL
@ -1816,7 +1817,7 @@ Cambridge, England.
</P>
<br><a name="SEC21" href="#TOC1">REVISION</a><br>
<P>
Last updated: 26 May 2017
Last updated: 01 June 2017
<br>
Copyright &copy; 1997-2017 University of Cambridge.
<br>

View File

@ -1688,6 +1688,24 @@ COMPILING A PATTERN
only match subject characters if the matching function is called with
PCRE2_NO_UTF_CHECK set.
PCRE2_EXTRA_BAD_ESCAPE_IS_LITERAL
This is a dangerous option. Use with care. By default, an unrecognized
escape such as \j or a malformed one such as \x{2z} causes a compile-
time error when detected by pcre2_compile(). Perl is somewhat inconsis-
tent in handling such items: for example, \j is treated as a literal
"j", and non-hexadecimal digits in \x{} are just ignored, though warn-
ings are given in both cases if Perl's warning switch is enabled. How-
ever, a malformed octal number after \o{ always causes an error in
Perl.
If the PCRE2_EXTRA_BAD_ESCAPE_IS_LITERAL extra option is passed to
pcre2_compile(), all unrecognized or erroneous escape sequences are
treated as single-character escapes. For example, \j is a literal "j"
and \x{2z} is treated as the literal string "x{2z}". Setting this
option means that typos in patterns may go undetected and have unex-
pected results. This is a dangerous option. Use with care.
COMPILATION ERROR CODES
@ -3350,7 +3368,7 @@ AUTHOR
REVISION
Last updated: 30 May 2017
Last updated: 01 June 2017
Copyright (c) 1997-2017 University of Cambridge.
------------------------------------------------------------------------------

View File

@ -1,4 +1,4 @@
.TH PCRE2_SET_MAX_PATTERN_LENGTH 3 "17 May 2017" "PCRE2 10.30"
.TH PCRE2_SET_MAX_PATTERN_LENGTH 3 "01 June 2017" "PCRE2 10.30"
.SH NAME
PCRE2 - Perl-compatible regular expressions (revised API)
.SH SYNOPSIS
@ -21,6 +21,9 @@ options are:
.\" JOIN
PCRE2_EXTRA_ALLOW_SURROGATE_ESCAPES Allow \ex{df800} to \ex{dfff}
in UTF-8 and UTF-32 modes
.\" JOIN
PCRE2_EXTRA_BAD_ESCAPE_IS_LITERAL Treat all invalid escapes as
a literal following character
.sp
There is a complete description of the PCRE2 native API in the
.\" HREF

View File

@ -1,4 +1,4 @@
.TH PCRE2API 3 "30 May 2017" "PCRE2 10.30"
.TH PCRE2API 3 "01 June 2017" "PCRE2 10.30"
.SH NAME
PCRE2 - Perl-compatible regular expressions (revised API)
.sp
@ -1661,6 +1661,23 @@ If the extra option PCRE2_EXTRA_ALLOW_SURROGATE_ESCAPES is set, surrogate code
point values in UTF-8 and UTF-32 patterns no longer provoke errors and are
incorporated in the compiled pattern. However, they can only match subject
characters if the matching function is called with PCRE2_NO_UTF_CHECK set.
.sp
PCRE2_EXTRA_BAD_ESCAPE_IS_LITERAL
.sp
This is a dangerous option. Use with care. By default, an unrecognized escape
such as \ej or a malformed one such as \ex{2z} causes a compile-time error when
detected by \fBpcre2_compile()\fP. Perl is somewhat inconsistent in handling
such items: for example, \ej is treated as a literal "j", and non-hexadecimal
digits in \ex{} are just ignored, though warnings are given in both cases if
Perl's warning switch is enabled. However, a malformed octal number after \eo{
always causes an error in Perl.
.P
If the PCRE2_EXTRA_BAD_ESCAPE_IS_LITERAL extra option is passed to
\fBpcre2_compile()\fP, all unrecognized or erroneous escape sequences are
treated as single-character escapes. For example, \ej is a literal "j" and
\ex{2z} is treated as the literal string "x{2z}". Setting this option means
that typos in patterns may go undetected and have unexpected results. This is a
dangerous option. Use with care.
.
.
.SH "COMPILATION ERROR CODES"
@ -3491,6 +3508,6 @@ Cambridge, England.
.rs
.sp
.nf
Last updated: 30 May 2017
Last updated: 01 June 2017
Copyright (c) 1997-2017 University of Cambridge.
.fi

View File

@ -1,4 +1,4 @@
.TH PCRE2TEST 1 "26 May 2017" "PCRE 10.30"
.TH PCRE2TEST 1 "01 June 2017" "PCRE 10.30"
.SH NAME
pcre2test - a program for testing Perl-compatible regular expressions.
.SH SYNOPSIS
@ -539,6 +539,7 @@ for a description of the effects of these options.
alt_verbnames set PCRE2_ALT_VERBNAMES
anchored set PCRE2_ANCHORED
auto_callout set PCRE2_AUTO_CALLOUT
bad_escape_is_literal set PCRE2_EXTRA_BAD_ESCAPE_IS_LITERAL
/i caseless set PCRE2_CASELESS
dollar_endonly set PCRE2_DOLLAR_ENDONLY
/s dotall set PCRE2_DOTALL
@ -1792,6 +1793,6 @@ Cambridge, England.
.rs
.sp
.nf
Last updated: 26 May 2017
Last updated: 01 June 2017
Copyright (c) 1997-2017 University of Cambridge.
.fi

View File

@ -521,6 +521,7 @@ PATTERN MODIFIERS
alt_verbnames set PCRE2_ALT_VERBNAMES
anchored set PCRE2_ANCHORED
auto_callout set PCRE2_AUTO_CALLOUT
bad_escape_is_literal set PCRE2_EXTRA_BAD_ESCAPE_IS_LITERAL
/i caseless set PCRE2_CASELESS
dollar_endonly set PCRE2_DOLLAR_ENDONLY
/s dotall set PCRE2_DOTALL
@ -1650,5 +1651,5 @@ AUTHOR
REVISION
Last updated: 26 May 2017
Last updated: 01 June 2017
Copyright (c) 1997-2017 University of Cambridge.

View File

@ -142,6 +142,7 @@ D is inspected during pcre2_dfa_match() execution
/* An additional compile options word is available in the compile context. */
#define PCRE2_EXTRA_ALLOW_SURROGATE_ESCAPES 0x00000001u /* C */
#define PCRE2_EXTRA_BAD_ESCAPE_IS_LITERAL 0x00000002u /* C */
/* These are for pcre2_jit_compile(). */

View File

@ -142,6 +142,7 @@ D is inspected during pcre2_dfa_match() execution
/* An additional compile options word is available in the compile context. */
#define PCRE2_EXTRA_ALLOW_SURROGATE_ESCAPES 0x00000001u /* C */
#define PCRE2_EXTRA_BAD_ESCAPE_IS_LITERAL 0x00000002u /* C */
/* These are for pcre2_jit_compile(). */

View File

@ -2591,11 +2591,23 @@ while (ptr < ptrend)
/* ---- Escape sequence ---- */
case CHAR_BACKSLASH:
tempptr = ptr;
escape = PRIV(check_escape)(&ptr, ptrend, &c, &errorcode, options,
FALSE, cb);
if (errorcode != 0) goto FAILED;
if (errorcode != 0)
{
ESCAPE_FAILED:
if ((cb->cx->extra_options & PCRE2_EXTRA_BAD_ESCAPE_IS_LITERAL) == 0)
goto FAILED;
ptr = tempptr;
if (ptr >= ptrend) c = CHAR_BACKSLASH; else
{
GETCHARINCTEST(c, ptr); /* Get character value, increment pointer */
}
escape = 0; /* Treat as literal character */
}
/* The escape was a data character. */
/* The escape was a data escape or literal character. */
if (escape == 0)
{
@ -2647,12 +2659,12 @@ while (ptr < ptrend)
case ESC_C:
#ifdef NEVER_BACKSLASH_C
errorcode = ERR85;
goto FAILED;
goto ESCAPE_FAILED;
#else
if ((options & PCRE2_NEVER_BACKSLASH_C) != 0)
{
errorcode = ERR83;
goto FAILED;
goto ESCAPE_FAILED;
}
#endif
okquantifier = TRUE;
@ -2662,7 +2674,7 @@ while (ptr < ptrend)
case ESC_X:
#ifndef SUPPORT_UNICODE
errorcode = ERR45; /* Supported only with Unicode support */
goto FAILED;
goto ESCAPE_FAILED;
#endif
case ESC_H:
case ESC_h:
@ -2727,7 +2739,7 @@ while (ptr < ptrend)
BOOL negated;
uint16_t ptype = 0, pdata = 0;
if (!get_ucp(&ptr, &negated, &ptype, &pdata, &errorcode, cb))
goto FAILED;
goto ESCAPE_FAILED;
if (negated) escape = (escape == ESC_P)? ESC_p : ESC_P;
*parsed_pattern++ = META_ESCAPE + escape;
*parsed_pattern++ = (ptype << 16) | pdata;
@ -2735,7 +2747,7 @@ while (ptr < ptrend)
}
#else
errorcode = ERR45;
goto FAILED;
goto ESCAPE_FAILED;
#endif
break; /* End \P and \p */
@ -2751,7 +2763,7 @@ while (ptr < ptrend)
*ptr != CHAR_LESS_THAN_SIGN && *ptr != CHAR_APOSTROPHE))
{
errorcode = (escape == ESC_g)? ERR57 : ERR69;
goto FAILED;
goto ESCAPE_FAILED;
}
terminator = (*ptr == CHAR_LESS_THAN_SIGN)?
CHAR_GREATER_THAN_SIGN : (*ptr == CHAR_APOSTROPHE)?
@ -2769,18 +2781,18 @@ while (ptr < ptrend)
if (p >= ptrend || *p != terminator)
{
errorcode = ERR57;
goto FAILED;
goto ESCAPE_FAILED;
}
ptr = p;
goto SET_RECURSION;
}
if (errorcode != 0) goto FAILED;
if (errorcode != 0) goto ESCAPE_FAILED;
}
/* Not a numerical recursion */
if (!read_name(&ptr, ptrend, terminator, &offset, &name, &namelen,
&errorcode, cb)) goto FAILED;
&errorcode, cb)) goto ESCAPE_FAILED;
/* \k and \g when used with braces are back references, whereas \g used
with quotes or angle brackets is a recursion */
@ -2792,7 +2804,7 @@ while (ptr < ptrend)
PUTOFFSET(offset, parsed_pattern);
okquantifier = TRUE;
break;
break; /* End special escape processing */
}
break; /* End escape sequence processing */
@ -3139,10 +3151,23 @@ while (ptr < ptrend)
else
{
tempptr = ptr;
escape = PRIV(check_escape)(&ptr, ptrend, &c, &errorcode,
options, TRUE, cb);
if (errorcode != 0) goto FAILED;
if (errorcode != 0)
{
CLASS_ESCAPE_FAILED:
if ((cb->cx->extra_options & PCRE2_EXTRA_BAD_ESCAPE_IS_LITERAL) == 0)
goto FAILED;
ptr = tempptr;
if (ptr >= ptrend) c = CHAR_BACKSLASH; else
{
GETCHARINCTEST(c, ptr); /* Get character value, increment pointer */
}
escape = 0; /* Treat as literal character */
}
if (escape == 0) /* Escaped character code point is in c */
{
char_is_literal = FALSE;
@ -3176,7 +3201,7 @@ while (ptr < ptrend)
if (class_range_state == RANGE_STARTED)
{
errorcode = ERR50;
goto FAILED;
goto CLASS_ESCAPE_FAILED;
}
/* Of the remaining escapes, only those that define characters are
@ -3187,7 +3212,7 @@ while (ptr < ptrend)
{
case ESC_N:
errorcode = ERR71; /* Not supported in a class */
goto FAILED;
goto CLASS_ESCAPE_FAILED;
case ESC_H:
case ESC_h:
@ -3250,13 +3275,14 @@ while (ptr < ptrend)
}
#else
errorcode = ERR45;
goto FAILED;
goto CLASS_ESCAPE_FAILED;
#endif
break; /* End \P and \p */
default: /* All others are not allowed in a class */
errorcode = ERR7;
goto FAILED_BACK;
ptr--;
goto CLASS_ESCAPE_FAILED;
}
}

View File

@ -590,6 +590,7 @@ static modstruct modlist[] = {
{ "altglobal", MOD_PND, MOD_CTL, CTL_ALTGLOBAL, PO(control) },
{ "anchored", MOD_PD, MOD_OPT, PCRE2_ANCHORED, PD(options) },
{ "auto_callout", MOD_PAT, MOD_OPT, PCRE2_AUTO_CALLOUT, PO(options) },
{ "bad_escape_is_literal", MOD_CTC, MOD_OPT, PCRE2_EXTRA_BAD_ESCAPE_IS_LITERAL, CO(extra_options) },
{ "bincode", MOD_PAT, MOD_CTL, CTL_BINCODE, PO(control) },
{ "bsr", MOD_CTC, MOD_BSR, 0, CO(bsr_convention) },
{ "callout_capture", MOD_DAT, MOD_CTL, CTL_CALLOUT_CAPTURE, DO(control) },
@ -4077,9 +4078,10 @@ show_compile_extra_options(uint32_t options, const char *before,
const char *after)
{
if (options == 0) fprintf(outfile, "%s <none>%s", before, after);
else fprintf(outfile, "%s%s%s",
else fprintf(outfile, "%s%s%s%s",
before,
((options & PCRE2_EXTRA_ALLOW_SURROGATE_ESCAPES) != 0)? " allow_surrogate_escapes" : "",
((options & PCRE2_EXTRA_BAD_ESCAPE_IS_LITERAL) != 0)? " bad_escape_is_literal" : "",
after);
}

13
testdata/testinput2 vendored
View File

@ -5279,4 +5279,17 @@ a)"xI
/(a)(?-n:(b))(c)/nB
# ----------------------------------------------------------------------
# These test the dangerous PCRE2_EXTRA_BAD_ESCAPE_IS_LITERAL option.
/\j\x{z}\o{82}\L\uabcd\u\U\g{\g/B,\bad_escape_is_literal
/\N{\c/B,bad_escape_is_literal
/[\j\x{z}\o\gA-\Nb-\g]/B,bad_escape_is_literal
/[Q-\N]/B,bad_escape_is_literal
# ----------------------------------------------------------------------
# End of testinput2

9
testdata/testinput5 vendored
View File

@ -2015,6 +2015,13 @@
\= Expect no match
X$
# ---------------------------------------------------------------------------
# ----------------------------------------------------------------------
# These test the dangerous PCRE2_EXTRA_BAD_ESCAPE_IS_LITERAL option.
/\x{d800}/B,utf,bad_escape_is_literal
/\ud800/B,utf,alt_bsux,bad_escape_is_literal
# ----------------------------------------------------------------------
# End of testinput5

27
testdata/testoutput2 vendored
View File

@ -15988,6 +15988,33 @@ Subject length lower bound = 1
End
------------------------------------------------------------------
# ----------------------------------------------------------------------
# These test the dangerous PCRE2_EXTRA_BAD_ESCAPE_IS_LITERAL option.
/\j\x{z}\o{82}\L\uabcd\u\U\g{\g/B,\bad_escape_is_literal
** Unrecognized modifier '\' in '\bad_escape_is_literal'
/\N{\c/B,bad_escape_is_literal
------------------------------------------------------------------
Bra
N{c
Ket
End
------------------------------------------------------------------
/[\j\x{z}\o\gA-\Nb-\g]/B,bad_escape_is_literal
------------------------------------------------------------------
Bra
[A-Nb-gjoxz{}]
Ket
End
------------------------------------------------------------------
/[Q-\N]/B,bad_escape_is_literal
Failed: error 108 at offset 4: range out of order in character class
# ----------------------------------------------------------------------
# End of testinput2
Error -65: PCRE2_ERROR_BADDATA (unknown error number)
Error -62: bad serialized data

21
testdata/testoutput5 vendored
View File

@ -4579,6 +4579,25 @@ No match
X$
No match
# ---------------------------------------------------------------------------
# ----------------------------------------------------------------------
# These test the dangerous PCRE2_EXTRA_BAD_ESCAPE_IS_LITERAL option.
/\x{d800}/B,utf,bad_escape_is_literal
------------------------------------------------------------------
Bra
x{d800}
Ket
End
------------------------------------------------------------------
/\ud800/B,utf,alt_bsux,bad_escape_is_literal
------------------------------------------------------------------
Bra
ud800
Ket
End
------------------------------------------------------------------
# ----------------------------------------------------------------------
# End of testinput5