Add support for (?^) as now supported by Perl.

This commit is contained in:
Philip.Hazel 2018-07-28 16:23:24 +00:00
parent 27337495dc
commit 6e245572b8
15 changed files with 2281 additions and 2162 deletions

View File

@ -131,6 +131,8 @@ present.
terminated by (*ACCEPT).
29. Add support for \N{U+dddd}, but not in EBCDIC environments.
30. Add support for (?^) for unsetting all imnsx options.
Version 10.31 12-February-2018

View File

@ -1466,7 +1466,8 @@ character, even if newlines are coded as CRLF. Without this option, a dot does
not match when the current position in the subject is at a newline. This option
is equivalent to Perl's /s option, and it can be changed within a pattern by a
(?s) option setting. A negative class such as [^a] always matches newline
characters, independent of the setting of this option.
characters, and the \N escape sequence always matches a non-newline character,
independent of the setting of PCRE2_DOTALL.
<pre>
PCRE2_DUPNAMES
</pre>
@ -3634,7 +3635,7 @@ Cambridge, England.
</P>
<br><a name="SEC42" href="#TOC1">REVISION</a><br>
<P>
Last updated: 02 July 2018
Last updated: 27 July 2018
<br>
Copyright &copy; 1997-2018 University of Cambridge.
<br>

View File

@ -42,13 +42,14 @@ assertion is a condition that has a matching branch (that is, the condition is
false).
</P>
<P>
4. The following Perl escape sequences are not supported: \l, \u, \L,
\U, and \N when followed by a character name or Unicode value. (\N on its
own, matching a non-newline character, is supported.) In fact these are
4. The following Perl escape sequences are not supported: \F, \l, \L, \u,
\U, and \N when followed by a character name. \N on its own, matching a
non-newline character, and \N{U+dd..}, matching a Unicode code point, are
supported. The escapes that modify the case of following letters are
implemented by Perl's general string-handling and are not part of its pattern
matching engine. If any of these are encountered by PCRE2, an error is
generated by default. However, if the PCRE2_ALT_BSUX option is set,
\U and \u are interpreted as ECMAScript interprets them.
generated by default. However, if the PCRE2_ALT_BSUX option is set, \U and \u
are interpreted as ECMAScript interprets them.
</P>
<P>
5. The Perl escape sequences \p, \P, and \X are supported only if PCRE2 is
@ -61,17 +62,22 @@ internal representation of Unicode characters, there is no need to implement
the somewhat messy concept of surrogates."
</P>
<P>
6. PCRE2 does support the \Q...\E escape for quoting substrings. Characters
in between are treated as literals. This is slightly different from Perl in
that $ and @ are also handled as literals inside the quotes. In Perl, they
cause variable interpolation (but of course PCRE2 does not have variables).
Note the following examples:
6. PCRE2 supports the \Q...\E escape for quoting substrings. Characters
in between are treated as literals. However, this is slightly different from
Perl in that $ and @ are also handled as literals inside the quotes. In Perl,
they cause variable interpolation (but of course PCRE2 does not have
variables). Also, Perl does "double-quotish backslash interpolation" on any
backslashes between \Q and \E which, its documentation says, "may lead to
confusing results". PCRE2 treats a backslash between \Q and \E just like any
other character. Note the following examples:
<pre>
Pattern PCRE2 matches Perl matches
Pattern PCRE2 matches Perl matches
\Qabc$xyz\E abc$xyz abc followed by the contents of $xyz
\Qabc\$xyz\E abc\$xyz abc\$xyz
\Qabc\E\$\Qxyz\E abc$xyz abc$xyz
\QA\B\E A\B A\B
\Q\\E \ \\E
</pre>
The \Q...\E sequence is recognized both inside and outside character classes.
</P>
@ -229,9 +235,9 @@ Cambridge, England.
REVISION
</b><br>
<P>
Last updated: 18 April 2017
Last updated: 28 July 2018
<br>
Copyright &copy; 1997-2017 University of Cambridge.
Copyright &copy; 1997-2018 University of Cambridge.
<br>
<p>
Return to the <a href="index.html">PCRE2 index page</a>.

View File

@ -357,13 +357,18 @@ of the pattern.
If you want to remove the special meaning from a sequence of characters, you
can do so by putting them between \Q and \E. This is different from Perl in
that $ and @ are handled as literals in \Q...\E sequences in PCRE2, whereas
in Perl, $ and @ cause variable interpolation. Note the following examples:
in Perl, $ and @ cause variable interpolation. Also, Perl does "double-quotish
backslash interpolation" on any backslashes between \Q and \E which, its
documentation says, "may lead to confusing results". PCRE2 treats a backslash
between \Q and \E just like any other character. Note the following examples:
<pre>
Pattern PCRE2 matches Perl matches
\Qabc$xyz\E abc$xyz abc followed by the contents of $xyz
\Qabc\$xyz\E abc\$xyz abc\$xyz
\Qabc\E\$\Qxyz\E abc$xyz abc$xyz
\QA\B\E A\B A\B
\Q\\E \ \\E
</pre>
The \Q...\E sequence is recognized both inside and outside character classes.
An isolated \E that is not preceded by \Q is ignored. If \Q is not followed
@ -545,7 +550,7 @@ character class, these sequences have different meanings.
Unsupported escape sequences
</b><br>
<P>
In Perl, the sequences \l, \L, \u, and \U are recognized by its string
In Perl, the sequences \F, \l, \L, \u, and \U are recognized by its string
handler and used to modify the case of following characters. By default, PCRE2
does not support these escape sequences. However, if the PCRE2_ALT_BSUX option
is set, \U matches a "U" character, and \u can be used to define a character
@ -1635,21 +1640,27 @@ Perl option letters enclosed between "(?" and ")". The option letters are
xx for PCRE2_EXTENDED_MORE
</pre>
For example, (?im) sets caseless, multiline matching. It is also possible to
unset these options by preceding the letter with a hyphen. The two "extended"
options are not independent; unsetting either one cancels the effects of both
of them.
unset these options by preceding the relevant letters with a hyphen, for
example (?-im). The two "extended" options are not independent; unsetting either
one cancels the effects of both of them.
</P>
<P>
A combined setting and unsetting such as (?im-sx), which sets PCRE2_CASELESS
and PCRE2_MULTILINE while unsetting PCRE2_DOTALL and PCRE2_EXTENDED, is also
permitted. If a letter appears both before and after the hyphen, the option is
unset. An empty options setting "(?)" is allowed. Needless to say, it has no
effect.
permitted. Only one hyphen may appear in the options string. If a letter
appears both before and after the hyphen, the option is unset. An empty options
setting "(?)" is allowed. Needless to say, it has no effect.
</P>
<P>
If the first character following (? is a circumflex, it causes all of the above
options to be unset. Thus, (?^) is equivalent to (?-imnsx). Letters may follow
the circumflex to cause some options to be re-instated, but a hyphen may not
appear.
</P>
<P>
The PCRE2-specific options PCRE2_DUPNAMES and PCRE2_UNGREEDY can be changed in
the same way as the Perl-compatible options by using the characters J and U
respectively.
respectively. However, these are not unset by (?^).
</P>
<P>
When one of these option changes occurs at top level (that is, not inside
@ -3579,7 +3590,7 @@ Cambridge, England.
</P>
<br><a name="SEC30" href="#TOC1">REVISION</a><br>
<P>
Last updated: 27 July 2018
Last updated: 28 July 2018
<br>
Copyright &copy; 1997-2018 University of Cambridge.
<br>

View File

@ -456,7 +456,15 @@ but some of them use Unicode properties if PCRE2_UCP is set. You can use
(?x) extended: ignore white space except in classes
(?xx) as (?x) but also ignore space and tab in classes
(?-...) unset option(s)
(?^) unset imnsx options
</pre>
Unsetting x or xx unsets both. Several options may be set at once, and a
mixture of setting and unsetting such as (?i-x) is allowed, but there may be
only one hyphen. Setting (but no unsetting) is allowed after (?^ for example
(?^in). An option setting may appear at the start of a non-capturing group, for
example (?i:...).
</P>
<P>
The following are recognized only at the very start of a pattern or after one
of the newline or \R options with similar syntax. More than one of them may
appear. For the first three, d is a decimal number.
@ -624,7 +632,7 @@ Cambridge, England.
</P>
<br><a name="SEC27" href="#TOC1">REVISION</a><br>
<P>
Last updated: 27 July 2018
Last updated: 28 July 2018
<br>
Copyright &copy; 1997-2018 University of Cambridge.
<br>

File diff suppressed because it is too large Load Diff

View File

@ -1639,19 +1639,24 @@ Perl option letters enclosed between "(?" and ")". The option letters are
xx for PCRE2_EXTENDED_MORE
.sp
For example, (?im) sets caseless, multiline matching. It is also possible to
unset these options by preceding the letter with a hyphen. The two "extended"
options are not independent; unsetting either one cancels the effects of both
of them.
unset these options by preceding the relevant letters with a hyphen, for
example (?-im). The two "extended" options are not independent; unsetting either
one cancels the effects of both of them.
.P
A combined setting and unsetting such as (?im-sx), which sets PCRE2_CASELESS
and PCRE2_MULTILINE while unsetting PCRE2_DOTALL and PCRE2_EXTENDED, is also
permitted. If a letter appears both before and after the hyphen, the option is
unset. An empty options setting "(?)" is allowed. Needless to say, it has no
effect.
permitted. Only one hyphen may appear in the options string. If a letter
appears both before and after the hyphen, the option is unset. An empty options
setting "(?)" is allowed. Needless to say, it has no effect.
.P
If the first character following (? is a circumflex, it causes all of the above
options to be unset. Thus, (?^) is equivalent to (?-imnsx). Letters may follow
the circumflex to cause some options to be re-instated, but a hyphen may not
appear.
.P
The PCRE2-specific options PCRE2_DUPNAMES and PCRE2_UNGREEDY can be changed in
the same way as the Perl-compatible options by using the characters J and U
respectively.
respectively. However, these are not unset by (?^).
.P
When one of these option changes occurs at top level (that is, not inside
subpattern parentheses), the change applies to the remainder of the pattern

View File

@ -1,4 +1,4 @@
.TH PCRE2SYNTAX 3 "27 July 2018" "PCRE2 10.32"
.TH PCRE2SYNTAX 3 "28 July 2018" "PCRE2 10.32"
.SH NAME
PCRE2 - Perl-compatible regular expressions (revised API)
.SH "PCRE2 REGULAR EXPRESSION SYNTAX SUMMARY"
@ -431,7 +431,14 @@ but some of them use Unicode properties if PCRE2_UCP is set. You can use
(?x) extended: ignore white space except in classes
(?xx) as (?x) but also ignore space and tab in classes
(?-...) unset option(s)
(?^) unset imnsx options
.sp
Unsetting x or xx unsets both. Several options may be set at once, and a
mixture of setting and unsetting such as (?i-x) is allowed, but there may be
only one hyphen. Setting (but no unsetting) is allowed after (?^ for example
(?^in). An option setting may appear at the start of a non-capturing group, for
example (?i:...).
.P
The following are recognized only at the very start of a pattern or after one
of the newline or \eR options with similar syntax. More than one of them may
appear. For the first three, d is a decimal number.
@ -612,6 +619,6 @@ Cambridge, England.
.rs
.sp
.nf
Last updated: 27 July 2018
Last updated: 28 July 2018
Copyright (c) 1997-2018 University of Cambridge.
.fi

View File

@ -317,6 +317,7 @@ pcre2_pattern_convert(). */
#define PCRE2_ERROR_NO_SURROGATES_IN_UTF16 191
#define PCRE2_ERROR_BAD_LITERAL_OPTIONS 192
#define PCRE2_ERROR_NOT_SUPPORTED_IN_EBCDIC 193
#define PCRE2_ERROR_INVALID_HYPHEN_IN_OPTIONS 194
/* "Expected" matching error codes: no match and partial match. */

View File

@ -263,7 +263,7 @@ versions. */
#define META_SKIP 0x802d0000u /* kept */
#define META_SKIP_ARG 0x802e0000u /* in */
#define META_THEN 0x802f0000u /* this */
#define META_THEN_ARG 0x80300000u /* order */
#define META_THEN_ARG 0x80300000u /* order */
/* These must be kept in groups of adjacent 3 values, and all together. */
@ -330,7 +330,7 @@ static unsigned char meta_extra_lengths[] = {
0, /* META_ACCEPT */
0, /* META_FAIL */
0, /* META_COMMIT */
1, /* META_COMMIT_ARG - plus the string length */
1, /* META_COMMIT_ARG - plus the string length */
0, /* META_PRUNE */
1, /* META_PRUNE_ARG - plus the string length */
0, /* META_SKIP */
@ -612,7 +612,7 @@ static const int verbcount = sizeof(verbs)/sizeof(verbitem);
/* Verb opcodes, indexed by their META code offset from META_MARK. */
static const uint32_t verbops[] = {
OP_MARK, OP_ACCEPT, OP_FAIL, OP_COMMIT, OP_COMMIT_ARG, OP_PRUNE,
OP_MARK, OP_ACCEPT, OP_FAIL, OP_COMMIT, OP_COMMIT_ARG, OP_PRUNE,
OP_PRUNE_ARG, OP_SKIP, OP_SKIP_ARG, OP_THEN, OP_THEN_ARG };
/* Offsets from OP_STAR for case-independent and negative repeat opcodes. */
@ -731,7 +731,7 @@ enum { ERR0 = COMPILE_ERROR_BASE,
ERR61, ERR62, ERR63, ERR64, ERR65, ERR66, ERR67, ERR68, ERR69, ERR70,
ERR71, ERR72, ERR73, ERR74, ERR75, ERR76, ERR77, ERR78, ERR79, ERR80,
ERR81, ERR82, ERR83, ERR84, ERR85, ERR86, ERR87, ERR88, ERR89, ERR90,
ERR91, ERR92, ERR93 };
ERR91, ERR92, ERR93, ERR94 };
/* This is a table of start-of-pattern options such as (*UTF) and settings such
as (*LIMIT_MATCH=nnnn) and (*CRLF). For completeness and backward
@ -1441,41 +1441,41 @@ else if ((i = escapes[c - ESCAPES_FIRST]) != 0)
escape = -i; /* Else return a special escape */
if (cb != NULL && (escape == ESC_P || escape == ESC_p || escape == ESC_X))
cb->external_flags |= PCRE2_HASBKPORX; /* Note \P, \p, or \X */
/* Perl supports \N{name} for character names and \N{U+dddd} for numerical
Unicode code points, as well as plain \N for "not newline". PCRE does not
support \N{name}. However, it does support quantification such as \N{2,3},
support \N{name}. However, it does support quantification such as \N{2,3},
so if \N{ is not followed by U+dddd we check for a quantifier. */
if (escape == ESC_N && ptr < ptrend && *ptr == CHAR_LEFT_CURLY_BRACKET)
{
PCRE2_SPTR p = ptr + 1;
/* \N{U+ can be handled by the \x{ code. However, this construction is
not valid in EBCDIC environments because it specifies a Unicode
character, not a codepoint in the local code. For example \N{U+0041}
/* \N{U+ can be handled by the \x{ code. However, this construction is
not valid in EBCDIC environments because it specifies a Unicode
character, not a codepoint in the local code. For example \N{U+0041}
must be "A" in all environments. */
if (ptrend - p > 1 && *p == CHAR_U && p[1] == CHAR_PLUS)
{
#ifdef EBCDIC
*errorcodeptr = ERR93;
#else
#else
ptr = p + 1;
escape = 0; /* Not a fancy escape after all */
escape = 0; /* Not a fancy escape after all */
goto COME_FROM_NU;
#endif
}
/* Give an error if what follows is not a quantifier, but don't override
#endif
}
/* Give an error if what follows is not a quantifier, but don't override
an error set by the quantifier reader (e.g. number overflow). */
else
{
{
if (!read_repeat_counts(&p, ptrend, NULL, NULL, errorcodeptr) &&
*errorcodeptr == 0)
*errorcodeptr = ERR37;
}
}
}
}
}
@ -1762,9 +1762,9 @@ else
{
if (ptr < ptrend && *ptr == CHAR_LEFT_CURLY_BRACKET)
{
#ifndef EBCDIC
COME_FROM_NU:
#endif
#ifndef EBCDIC
COME_FROM_NU:
#endif
if (++ptr >= ptrend || *ptr == CHAR_RIGHT_CURLY_BRACKET)
{
*errorcodeptr = ERR78;
@ -2495,15 +2495,15 @@ while (ptr < ptrend)
goto FAILED;
}
*verblengthptr = (uint32_t)verbnamelength;
/* If this name was on a verb such as (*ACCEPT) which does not continue,
a (*MARK) was generated for the name. We now add the original verb as the
next item. */
a (*MARK) was generated for the name. We now add the original verb as the
next item. */
if (add_after_mark != 0)
{
*parsed_pattern++ = add_after_mark;
add_after_mark = 0;
add_after_mark = 0;
}
break;
@ -3498,22 +3498,22 @@ while (ptr < ptrend)
if (*ptr++ == CHAR_COLON) /* Skip past : or ) */
{
/* Some optional arguments can be treated as a preceding (*MARK) */
if (verbs[i].has_arg < 0)
{
add_after_mark = verbs[i].meta;
*parsed_pattern++ = META_MARK;
*parsed_pattern++ = META_MARK;
}
/* The remaining verbs with arguments (except *MARK) need a different
opcode. */
else
{
{
*parsed_pattern++ = verbs[i].meta +
((verbs[i].meta != META_MARK)? 0x00010000u:0);
}
}
/* Set up for reading the name in the main loop. */
verblengthptr = parsed_pattern++;
@ -3576,17 +3576,37 @@ while (ptr < ptrend)
else
{
BOOL hyphenok = TRUE;
top_nest->reset_group = 0;
top_nest->max_group = 0;
set = unset = 0;
optset = &set;
/* ^ at the start unsets imnsx and disables the subsequent use of - */
if (ptr < ptrend && *ptr == CHAR_CIRCUMFLEX_ACCENT)
{
options &= ~(PCRE2_CASELESS|PCRE2_MULTILINE|PCRE2_NO_AUTO_CAPTURE|
PCRE2_DOTALL|PCRE2_EXTENDED|PCRE2_EXTENDED_MORE);
hyphenok = FALSE;
ptr++;
}
while (ptr < ptrend && *ptr != CHAR_RIGHT_PARENTHESIS &&
*ptr != CHAR_COLON)
{
switch (*ptr++)
{
case CHAR_MINUS: optset = &unset; break;
case CHAR_MINUS:
if (!hyphenok)
{
errorcode = ERR94;
ptr--; /* Correct the offset */
goto FAILED;
}
optset = &unset;
hyphenok = FALSE;
break;
case CHAR_J: /* Record that it changed in the external options */
*optset |= PCRE2_DUPNAMES;
@ -3644,9 +3664,10 @@ while (ptr < ptrend)
}
else *parsed_pattern++ = META_NOCAPTURE;
/* If nothing changed, no need to record. */
/* If nothing changed, no need to record. The check of hyphenok catches
the (?^) case. */
if (set != 0 || unset != 0)
if (set != 0 || unset != 0 || !hyphenok)
{
*parsed_pattern++ = META_OPTIONS;
*parsed_pattern++ = options;
@ -3952,7 +3973,7 @@ while (ptr < ptrend)
{
if (++ptr >= ptrend || !IS_DIGIT(*ptr)) goto BAD_VERSION_CONDITION;
minor = (*ptr++ - CHAR_0) * 10;
if (IS_DIGIT(*ptr)) minor += *ptr++ - CHAR_0;
if (IS_DIGIT(*ptr)) minor += *ptr++ - CHAR_0;
if (ptr >= ptrend || *ptr != CHAR_RIGHT_PARENTHESIS)
goto BAD_VERSION_CONDITION;
}
@ -5709,7 +5730,7 @@ for (;; pptr++)
cb->had_pruneorskip = TRUE;
/* Fall through */
case META_MARK:
case META_COMMIT_ARG:
case META_COMMIT_ARG:
VERB_ARG:
*code++ = verbops[(meta - META_MARK) >> 16];
/* The length is in characters. */
@ -8058,7 +8079,7 @@ for (;;)
break;
case OP_MARK:
case OP_COMMIT_ARG:
case OP_COMMIT_ARG:
case OP_PRUNE_ARG:
case OP_SKIP_ARG:
case OP_THEN_ARG:
@ -8367,7 +8388,7 @@ for (;; pptr++)
break;
case META_MARK: /* Add the length of the name. */
case META_COMMIT_ARG:
case META_COMMIT_ARG:
case META_PRUNE_ARG:
case META_SKIP_ARG:
case META_THEN_ARG:
@ -8558,7 +8579,7 @@ for (;; pptr++)
goto EXIT;
case META_MARK:
case META_COMMIT_ARG:
case META_COMMIT_ARG:
case META_PRUNE_ARG:
case META_SKIP_ARG:
case META_THEN_ARG:
@ -8630,31 +8651,31 @@ for (;; pptr++)
case META_LOOKAHEADNOT:
pptr = parsed_skip(pptr + 1, PSKIP_KET);
if (pptr == NULL) goto PARSED_SKIP_FAILED;
/* Also ignore any qualifiers that follow a lookahead assertion. */
switch (pptr[1])
{
case META_ASTERISK:
case META_ASTERISK_PLUS:
case META_ASTERISK_QUERY:
case META_ASTERISK_QUERY:
case META_PLUS:
case META_PLUS_PLUS:
case META_PLUS_PLUS:
case META_PLUS_QUERY:
case META_QUERY:
case META_QUERY_PLUS:
case META_QUERY_QUERY:
case META_QUERY_QUERY:
pptr++;
break;
case META_MINMAX:
case META_MINMAX_PLUS:
case META_MINMAX_QUERY:
pptr += 3;
break;
default:
break;
break;
}
break;
@ -9026,7 +9047,7 @@ for (pptr = cb->parsed_pattern; *pptr != META_END; pptr++)
break;
case META_MARK:
case META_COMMIT_ARG:
case META_COMMIT_ARG:
case META_PRUNE_ARG:
case META_SKIP_ARG:
case META_THEN_ARG:

View File

@ -179,7 +179,8 @@ static const unsigned char compile_error_texts[] =
"internal error: bad code value in parsed_skip()\0"
"PCRE2_EXTRA_ALLOW_SURROGATE_ESCAPES is not allowed in UTF-16 mode\0"
"invalid option bits with PCRE2_LITERAL\0"
"\\N{U+dddd} is not supported in EBCDIC mode\0"
"\\N{U+dddd} is not supported in EBCDIC mode\0"
"invalid hyphen in option setting\0"
;
/* Match-time and UTF error texts are in the same format. */

6
testdata/testinput1 vendored
View File

@ -6252,4 +6252,10 @@ ef) x/x,mark
/(*COMMIT:]w)/
/(?i)A(?^)B(?^x:C D)(?^i)e f/
aBCDE F
\= Expect no match
aBCDEF
AbCDe f
# End of testinput1

6
testdata/testinput2 vendored
View File

@ -5453,4 +5453,10 @@ a)"xI
\= Expect no match
axy
/(?^x-i)AB/
/(?^-i)AB/
/(?x-i-i)/
# End of testinput2

View File

@ -9912,4 +9912,13 @@ No match, mark = X
/(*COMMIT:]w)/
/(?i)A(?^)B(?^x:C D)(?^i)e f/
aBCDE F
0: aBCDE F
\= Expect no match
aBCDEF
No match
AbCDe f
No match
# End of testinput1

View File

@ -16622,6 +16622,15 @@ No match, mark = X
axy
No match, mark = X
/(?^x-i)AB/
Failed: error 194 at offset 4: invalid hyphen in option setting
/(?^-i)AB/
Failed: error 194 at offset 3: invalid hyphen in option setting
/(?x-i-i)/
Failed: error 194 at offset 5: invalid hyphen in option setting
# End of testinput2
Error -70: PCRE2_ERROR_BADDATA (unknown error number)
Error -62: bad serialized data