Implement Perl 5.28's alphabetic lookaround syntax, e.g. (*pla:...) and also

(*atomic:...).
This commit is contained in:
Philip.Hazel 2018-09-24 16:23:53 +00:00
parent 69254c77f1
commit f26b0b0bae
21 changed files with 1218 additions and 734 deletions

View File

@ -5,8 +5,8 @@ Change Log for PCRE2
Version 10.33-RC1 15-September-2018
-----------------------------------
1. Added "allvector" to pcre2test to make it easy to check the part of the
ovector that shouldn't be changed, in particular after substitute and failed or
1. Added "allvector" to pcre2test to make it easy to check the part of the
ovector that shouldn't be changed, in particular after substitute and failed or
partial matches.
2. Fix subject buffer overread in JIT when UTF is disabled and \X or \R has
@ -15,13 +15,21 @@ a greater than 1 fixed quantifier. This issue was found by Yunho Kim.
3. Added support for callouts from pcre2_substitute().
4. The POSIX functions are now all called pcre2_regcomp() etc., with wrappers
that use the standard POSIX names. This should help avoid linking with the
that use the standard POSIX names. This should help avoid linking with the
wrong library in some environments.
5. Fix an xclass matching issue in JIT.
6. Implement PCRE2_EXTRA_ESCAPED_CR_IS_LF (see Bugzilla 2315).
7. Implement the Perl 5.28 experimental alphabetic names for atomic groups and
lookaround assertions, for example, (*pla:...) and (*atomic:...). These are
characterized by a lower case letter following (* and to simplify coding for
this, the character tables created by pcre2_maketables() were updated to add a
new "is lower case letter" bit. At the same time, the now unused "is
hexadecimal digit" bit was removed. The default tables in
src/pcre2_chartables.c.dist are updated.
Version 10.32 10-September-2018
-------------------------------

View File

@ -2120,6 +2120,11 @@ special parenthesis, starting with (?> as in this example:
<pre>
(?&#62;\d+)foo
</pre>
Perl 5.28 introduced an experimental alphabetic form starting with (* which may
be easier to remember:
<pre>
(*atomic:\d+)foo
</pre>
This kind of parenthesis "locks up" the part of the pattern it contains once
it has matched, and a failure further into the pattern is prevented from
backtracking into it. Backtracking past it to previous items, however, works as
@ -2342,11 +2347,17 @@ coded as \b, \B, \A, \G, \Z, \z, ^ and $ are described
<P>
More complicated assertions are coded as subpatterns. There are two kinds:
those that look ahead of the current position in the subject string, and those
that look behind it, and in each case an assertion may be positive (must
succeed for matching to continue) or negative (must not succeed for matching to
continue). An assertion subpattern is matched in the normal way, except that,
when matching continues after a successful assertion, the matching position in
the subject string is as it was before the assertion was processed.
that look behind it, and in each case an assertion may be positive (must match
for the assertion to be true) or negative (must not match for the assertion to
be true). An assertion subpattern is matched in the normal way, and if it is
true, matching continues after it, but with the matching position in the
subject string is was it was before the assertion was processed.
</P>
<P>
A lookaround assertion may also appear as the condition in a
<a href="#conditions">conditional subpattern</a>
(see below). In this case, the result of matching the assertion determines
which branch of the condition is followed.
</P>
<P>
Assertion subpatterns are not capturing subpatterns. If an assertion contains
@ -2359,7 +2370,7 @@ adjacent characters are the same.
<P>
When a branch within an assertion fails to match, any substrings that were
captured are discarded (as happens with any pattern branch that fails to
match). A negative assertion succeeds only when all its branches fail to match;
match). A negative assertion is true only when all its branches fail to match;
this means that no captured substrings are ever retained after a successful
negative assertion. When an assertion contains a matching branch, what happens
depends on the type of assertion.
@ -2368,7 +2379,7 @@ depends on the type of assertion.
For a positive assertion, internally captured substrings in the successful
branch are retained, and matching continues with the next pattern item after
the assertion. For a negative assertion, a matching branch means that the
assertion has failed. If the assertion is being used as a condition in a
assertion is not true. If such an assertion is being used as a condition in a
<a href="#conditions">conditional subpattern</a>
(see below), captured substrings are retained, because matching continues with
the "no" branch of the condition. For other failing negative assertions,
@ -2398,6 +2409,25 @@ without the assertion, the order depending on the greediness of the quantifier.
The assertion is obeyed just once when encountered during matching.
</P>
<br><b>
Alphabetic assertion names
</b><br>
<P>
Traditionally, symbolic sequences such as (?= and (?&#60;= have been used to specify
lookaround assertions. Perl 5.28 introduced some experimental alphabetic
alternatives which might be easier to remember. They all start with (* instead
of (? and must be written using lower case letters. PCRE2 supports the
following synonyms:
<pre>
(*positive_lookahead: or (*pla: is the same as (?=
(*negative_lookahead: or (*nla: is the same as (?!
(*positive_lookbehind: or (*plb: is the same as (?&#60;=
(*negative_lookbehind: or (*nlb: is the same as (?&#60;!
</pre>
For example, (*pla:foo) is the same assertion as (?=foo). However, in the
following sections, the various assertions are described using the original
symbolic forms.
</P>
<br><b>
Lookahead assertions
</b><br>
<P>
@ -3630,7 +3660,7 @@ Cambridge, England.
</P>
<br><a name="SEC30" href="#TOC1">REVISION</a><br>
<P>
Last updated: 21 September 2018
Last updated: 24 September 2018
<br>
Copyright &copy; 1997-2018 University of Cambridge.
<br>

View File

@ -436,6 +436,7 @@ but some of them use Unicode properties if PCRE2_UCP is set. You can use
<P>
<pre>
(?&#62;...) atomic, non-capturing group
(*atomic:...) atomic, non-capturing group
</PRE>
</P>
<br><a name="SEC15" href="#TOC1">COMMENT</a><br>
@ -514,12 +515,23 @@ setting with a similar syntax.
<br><a name="SEC19" href="#TOC1">LOOKAHEAD AND LOOKBEHIND ASSERTIONS</a><br>
<P>
<pre>
(?=...) positive look ahead
(?!...) negative look ahead
(?&#60;=...) positive look behind
(?&#60;!...) negative look behind
(?=...) )
(*pla:...) ) positive lookahead
(*positive_lookahead:...) )
(?!...) )
(*nla:...) ) negative lookahead
(*negative_lookahead:...) )
(?&#60;=...) )
(*plb:...) ) positive lookbehind
(*positive_lookbehind:...) )
(?&#60;!...) )
(*nlb:...) ) negative lookbehind
(*negative_lookbehind:...) )
</pre>
Each top-level branch of a look behind must be of a fixed length.
Each top-level branch of a lookbehind must be of a fixed length.
</P>
<br><a name="SEC20" href="#TOC1">BACKREFERENCES</a><br>
<P>
@ -634,7 +646,7 @@ Cambridge, England.
</P>
<br><a name="SEC27" href="#TOC1">REVISION</a><br>
<P>
Last updated: 02 September 2018
Last updated: 24 September 2018
<br>
Copyright &copy; 1997-2018 University of Cambridge.
<br>

File diff suppressed because it is too large Load Diff

View File

@ -1,4 +1,4 @@
.TH PCRE2PATTERN 3 "21 September 2018" "PCRE2 10.33"
.TH PCRE2PATTERN 3 "24 September 2018" "PCRE2 10.33"
.SH NAME
PCRE2 - Perl-compatible regular expressions (revised API)
.SH "PCRE2 REGULAR EXPRESSION DETAILS"
@ -2124,6 +2124,11 @@ special parenthesis, starting with (?> as in this example:
.sp
(?>\ed+)foo
.sp
Perl 5.28 introduced an experimental alphabetic form starting with (* which may
be easier to remember:
.sp
(*atomic:\ed+)foo
.sp
This kind of parenthesis "locks up" the part of the pattern it contains once
it has matched, and a failure further into the pattern is prevented from
backtracking into it. Backtracking past it to previous items, however, works as
@ -2351,11 +2356,19 @@ above.
.P
More complicated assertions are coded as subpatterns. There are two kinds:
those that look ahead of the current position in the subject string, and those
that look behind it, and in each case an assertion may be positive (must
succeed for matching to continue) or negative (must not succeed for matching to
continue). An assertion subpattern is matched in the normal way, except that,
when matching continues after a successful assertion, the matching position in
the subject string is as it was before the assertion was processed.
that look behind it, and in each case an assertion may be positive (must match
for the assertion to be true) or negative (must not match for the assertion to
be true). An assertion subpattern is matched in the normal way, and if it is
true, matching continues after it, but with the matching position in the
subject string is was it was before the assertion was processed.
.P
A lookaround assertion may also appear as the condition in a
.\" HTML <a href="#conditions">
.\" </a>
conditional subpattern
.\"
(see below). In this case, the result of matching the assertion determines
which branch of the condition is followed.
.P
Assertion subpatterns are not capturing subpatterns. If an assertion contains
capturing subpatterns within it, these are counted for the purposes of
@ -2366,7 +2379,7 @@ adjacent characters are the same.
.P
When a branch within an assertion fails to match, any substrings that were
captured are discarded (as happens with any pattern branch that fails to
match). A negative assertion succeeds only when all its branches fail to match;
match). A negative assertion is true only when all its branches fail to match;
this means that no captured substrings are ever retained after a successful
negative assertion. When an assertion contains a matching branch, what happens
depends on the type of assertion.
@ -2374,7 +2387,7 @@ depends on the type of assertion.
For a positive assertion, internally captured substrings in the successful
branch are retained, and matching continues with the next pattern item after
the assertion. For a negative assertion, a matching branch means that the
assertion has failed. If the assertion is being used as a condition in a
assertion is not true. If such an assertion is being used as a condition in a
.\" HTML <a href="#conditions">
.\" </a>
conditional subpattern
@ -2406,6 +2419,25 @@ without the assertion, the order depending on the greediness of the quantifier.
The assertion is obeyed just once when encountered during matching.
.
.
.SS "Alphabetic assertion names"
.rs
.sp
Traditionally, symbolic sequences such as (?= and (?<= have been used to specify
lookaround assertions. Perl 5.28 introduced some experimental alphabetic
alternatives which might be easier to remember. They all start with (* instead
of (? and must be written using lower case letters. PCRE2 supports the
following synonyms:
.sp
(*positive_lookahead: or (*pla: is the same as (?=
(*negative_lookahead: or (*nla: is the same as (?!
(*positive_lookbehind: or (*plb: is the same as (?<=
(*negative_lookbehind: or (*nlb: is the same as (?<!
.sp
For example, (*pla:foo) is the same assertion as (?=foo). However, in the
following sections, the various assertions are described using the original
symbolic forms.
.
.
.SS "Lookahead assertions"
.rs
.sp
@ -3660,6 +3692,6 @@ Cambridge, England.
.rs
.sp
.nf
Last updated: 21 September 2018
Last updated: 24 September 2018
Copyright (c) 1997-2018 University of Cambridge.
.fi

View File

@ -1,4 +1,4 @@
.TH PCRE2SYNTAX 3 "02 September 2018" "PCRE2 10.32"
.TH PCRE2SYNTAX 3 "24 September 2018" "PCRE2 10.33"
.SH NAME
PCRE2 - Perl-compatible regular expressions (revised API)
.SH "PCRE2 REGULAR EXPRESSION SYNTAX SUMMARY"
@ -411,6 +411,7 @@ but some of them use Unicode properties if PCRE2_UCP is set. You can use
.rs
.sp
(?>...) atomic, non-capturing group
(*atomic:...) atomic, non-capturing group
.
.
.SH "COMMENT"
@ -491,12 +492,23 @@ setting with a similar syntax.
.SH "LOOKAHEAD AND LOOKBEHIND ASSERTIONS"
.rs
.sp
(?=...) positive look ahead
(?!...) negative look ahead
(?<=...) positive look behind
(?<!...) negative look behind
(?=...) )
(*pla:...) ) positive lookahead
(*positive_lookahead:...) )
.sp
Each top-level branch of a look behind must be of a fixed length.
(?!...) )
(*nla:...) ) negative lookahead
(*negative_lookahead:...) )
.sp
(?<=...) )
(*plb:...) ) positive lookbehind
(*positive_lookbehind:...) )
.sp
(?<!...) )
(*nlb:...) ) negative lookbehind
(*negative_lookbehind:...) )
.sp
Each top-level branch of a lookbehind must be of a fixed length.
.
.
.SH "BACKREFERENCES"
@ -621,6 +633,6 @@ Cambridge, England.
.rs
.sp
.nf
Last updated: 02 September 2018
Last updated: 24 September 2018
Copyright (c) 1997-2018 University of Cambridge.
.fi

View File

@ -103,19 +103,22 @@ const unsigned char _pcre_default_tables[] = {
0,0,0,0,0,0,0,128,
255,255,255,255,0,0,0,0,
0,0,0,0,0,0,0,0,
/* Fiddled by hand when the table bits changed. May be broken! */
128,0,0,0,0,0,0,0,
0,1,1,0,1,1,0,0,
0,1,1,1,1,1,0,0,
0,0,0,0,0,0,0,0,
0,0,0,0,0,0,0,0,
1,0,0,0,128,0,0,0,
128,128,128,128,0,0,128,0,
28,28,28,28,28,28,28,28,
28,28,0,0,0,0,0,128,
0,26,26,26,26,26,26,18,
24,24,24,24,24,24,24,24,
24,24,0,0,0,0,0,128,
0,18,18,18,18,18,18,18,
18,18,18,18,18,18,18,18,
18,18,18,18,18,18,18,18,
18,18,18,128,128,0,128,16,
0,26,26,26,26,26,26,18,
0,18,18,18,18,18,18,18,
18,18,18,18,18,18,18,18,
18,18,18,18,18,18,18,18,
18,18,18,128,128,0,0,0,
@ -125,8 +128,8 @@ const unsigned char _pcre_default_tables[] = {
0,0,0,0,0,0,0,0,
1,0,0,0,0,0,0,0,
0,0,18,0,0,0,0,0,
0,0,20,20,0,18,0,0,
0,20,18,0,0,0,0,0,
0,0,24,24,0,18,0,0,
0,24,18,0,0,0,0,0,
18,18,18,18,18,18,18,18,
18,18,18,18,18,18,18,18,
18,18,18,18,18,18,18,0,

View File

@ -75,6 +75,10 @@ fi
(echo "$prefix" ; cat <<'PERLEND'
# The alpha assertions currently give warnings even when -w is not specified.
no warnings "experimental::alpha_assertions";
# Function for turning a string into a string of printing chars.
sub pchars {
@ -129,6 +133,9 @@ else { $outfile = "STDOUT"; }
printf($outfile "Perl $] Regular Expressions\n\n");
$extra_modifiers = "";
$default_show_mark = 0;
# Main loop
NEXT_RE:
@ -370,7 +377,10 @@ for (;;)
}
}
# printf $outfile "\n";
# By closing OUTFILE explicitly, we avoid a Perl warning in -w mode
# "main::OUTFILE" used only once".
close(OUTFILE) if $outfile eq "OUTFILE";
PERLEND
) | $perl $perlarg - $@

View File

@ -183,10 +183,10 @@ fprintf(f,
"/* This table identifies various classes of character by individual bits:\n"
" 0x%02x white space character\n"
" 0x%02x letter\n"
" 0x%02x lower case letter\n"
" 0x%02x decimal digit\n"
" 0x%02x hexadecimal digit\n"
" 0x%02x alphanumeric or '_'\n*/\n\n",
ctype_space, ctype_letter, ctype_digit, ctype_xdigit, ctype_word);
ctype_space, ctype_letter, ctype_lcletter, ctype_digit, ctype_word);
fprintf(f, " ");
for (i = 0; i < 256; i++)

View File

@ -320,6 +320,7 @@ pcre2_pattern_convert(). */
#define PCRE2_ERROR_BAD_LITERAL_OPTIONS 192
#define PCRE2_ERROR_SUPPORTED_ONLY_IN_UNICODE 193
#define PCRE2_ERROR_INVALID_HYPHEN_IN_OPTIONS 194
#define PCRE2_ERROR_ALPHA_ASSERTION_UNKNOWN 195
/* "Expected" matching error codes: no match and partial match. */

View File

@ -157,8 +157,8 @@ graph print, punct, and cntrl. Other classes are built from combinations. */
/* This table identifies various classes of character by individual bits:
0x01 white space character
0x02 letter
0x04 decimal digit
0x08 hexadecimal digit
0x04 lower case letter
0x08 decimal digit
0x10 alphanumeric or '_'
*/
@ -168,16 +168,16 @@ graph print, punct, and cntrl. Other classes are built from combinations. */
0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00, /* 24- 31 */
0x01,0x00,0x00,0x00,0x00,0x00,0x00,0x00, /* - ' */
0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00, /* ( - / */
0x1c,0x1c,0x1c,0x1c,0x1c,0x1c,0x1c,0x1c, /* 0 - 7 */
0x1c,0x1c,0x00,0x00,0x00,0x00,0x00,0x00, /* 8 - ? */
0x00,0x1a,0x1a,0x1a,0x1a,0x1a,0x1a,0x12, /* @ - G */
0x18,0x18,0x18,0x18,0x18,0x18,0x18,0x18, /* 0 - 7 */
0x18,0x18,0x00,0x00,0x00,0x00,0x00,0x00, /* 8 - ? */
0x00,0x12,0x12,0x12,0x12,0x12,0x12,0x12, /* @ - G */
0x12,0x12,0x12,0x12,0x12,0x12,0x12,0x12, /* H - O */
0x12,0x12,0x12,0x12,0x12,0x12,0x12,0x12, /* P - W */
0x12,0x12,0x12,0x00,0x00,0x00,0x00,0x10, /* X - _ */
0x00,0x1a,0x1a,0x1a,0x1a,0x1a,0x1a,0x12, /* ` - g */
0x12,0x12,0x12,0x12,0x12,0x12,0x12,0x12, /* h - o */
0x12,0x12,0x12,0x12,0x12,0x12,0x12,0x12, /* p - w */
0x12,0x12,0x12,0x00,0x00,0x00,0x00,0x00, /* x -127 */
0x00,0x16,0x16,0x16,0x16,0x16,0x16,0x16, /* ` - g */
0x16,0x16,0x16,0x16,0x16,0x16,0x16,0x16, /* h - o */
0x16,0x16,0x16,0x16,0x16,0x16,0x16,0x16, /* p - w */
0x16,0x16,0x16,0x00,0x00,0x00,0x00,0x00, /* x -127 */
0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00, /* 128-135 */
0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00, /* 136-143 */
0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00, /* 144-151 */

View File

@ -615,6 +615,46 @@ static const uint32_t verbops[] = {
OP_MARK, OP_ACCEPT, OP_FAIL, OP_COMMIT, OP_COMMIT_ARG, OP_PRUNE,
OP_PRUNE_ARG, OP_SKIP, OP_SKIP_ARG, OP_THEN, OP_THEN_ARG };
/* Table of "alpha assertions" like (*pla:...), similar to the (*VERB) table. */
typedef struct alasitem {
unsigned int len; /* Length of name */
uint32_t meta; /* Base META_ code */
} alasitem;
static const char alasnames[] =
STRING_pla0
STRING_plb0
STRING_nla0
STRING_nlb0
STRING_positive_lookahead0
STRING_positive_lookbehind0
STRING_negative_lookahead0
STRING_negative_lookbehind0
STRING_atomic0
STRING_sr0
STRING_asr0
STRING_script_run0
STRING_atomic_script_run;
static const alasitem alasmeta[] = {
{ 3, META_LOOKAHEAD },
{ 3, META_LOOKBEHIND },
{ 3, META_LOOKAHEADNOT },
{ 3, META_LOOKBEHINDNOT },
{ 18, META_LOOKAHEAD },
{ 19, META_LOOKBEHIND },
{ 18, META_LOOKAHEADNOT },
{ 19, META_LOOKBEHINDNOT },
{ 6, META_ATOMIC },
{ 2, 0 }, /* sr = script run */
{ 3, 0 }, /* asr = atomic script run */
{ 10, 0 }, /* script run */
{ 17, 0 } /* atomic script run */
};
static const int alascount = sizeof(alasmeta)/sizeof(alasitem);
/* Offsets from OP_STAR for case-independent and negative repeat opcodes. */
static uint32_t chartypeoffset[] = {
@ -732,7 +772,7 @@ enum { ERR0 = COMPILE_ERROR_BASE,
ERR61, ERR62, ERR63, ERR64, ERR65, ERR66, ERR67, ERR68, ERR69, ERR70,
ERR71, ERR72, ERR73, ERR74, ERR75, ERR76, ERR77, ERR78, ERR79, ERR80,
ERR81, ERR82, ERR83, ERR84, ERR85, ERR86, ERR87, ERR88, ERR89, ERR90,
ERR91, ERR92, ERR93, ERR94 };
ERR91, ERR92, ERR93, ERR94, ERR95 };
/* This is a table of start-of-pattern options such as (*UTF) and settings such
as (*LIMIT_MATCH=nnnn) and (*CRLF). For completeness and backward
@ -1447,9 +1487,9 @@ else if ((i = escapes[c - ESCAPES_FIRST]) != 0)
c = (uint32_t)i;
if (cb != NULL && c == CHAR_CR &&
(cb->cx->extra_options & PCRE2_EXTRA_ESCAPED_CR_IS_LF) != 0)
c = CHAR_LF;
c = CHAR_LF;
}
else /* Negative table entry */
else /* Negative table entry */
{
escape = -i; /* Else return a special escape */
if (cb != NULL && (escape == ESC_P || escape == ESC_p || escape == ESC_X))
@ -1499,7 +1539,7 @@ else if ((i = escapes[c - ESCAPES_FIRST]) != 0)
}
}
/* Escapes that need further processing, including those that are unknown, have
/* Escapes that need further processing, including those that are unknown, have
a zero entry in the lookup table. When called from pcre2_substitute(), only \c,
\o, and \x are recognized (and \u when BSUX is set). */
@ -2133,9 +2173,10 @@ return -1;
*************************************************/
/* This function is called from parse_regex() below whenever it needs to read
the name of a subpattern or a (*VERB). The initial pointer must be to the
character before the name. If that character is '*' we are reading a verb name.
The pointer is updated to point after the name, for a VERB, or after tha name's
the name of a subpattern or a (*VERB) or an (*alpha_assertion). The initial
pointer must be to the character before the name. If that character is '*' we
are reading a verb or alpha assertion name. The pointer is updated to point
after the name, for a VERB or alpha assertion name, or after tha name's
terminator for a subpattern name. Returning both the offset and the name
pointer is redundant information, but some callers use one and some the other,
so it is simplest just to return both.
@ -2160,27 +2201,29 @@ read_name(PCRE2_SPTR *ptrptr, PCRE2_SPTR ptrend, uint32_t terminator,
int *errorcodeptr, compile_block *cb)
{
PCRE2_SPTR ptr = *ptrptr;
BOOL is_verb = (*ptr == CHAR_ASTERISK);
BOOL is_group = (*ptr != CHAR_ASTERISK);
uint32_t namelen = 0;
uint32_t ctype = is_verb? ctype_letter : ctype_word;
if (++ptr >= ptrend)
if (++ptr >= ptrend) /* No characters in name */
{
*errorcodeptr = is_verb? ERR60: /* Verb not recognized or malformed */
ERR62; /* Subpattern name expected */
*errorcodeptr = is_group? ERR62: /* Subpattern name expected */
ERR60; /* Verb not recognized or malformed */
goto FAILED;
}
/* A group name must not start with a digit. If either of the others start with
a digit it just won't be recognized. */
if (is_group && IS_DIGIT(*ptr))
{
*errorcodeptr = ERR44;
goto FAILED;
}
*nameptr = ptr;
*offsetptr = (PCRE2_SIZE)(ptr - cb->start_pattern);
if (IS_DIGIT(*ptr))
{
*errorcodeptr = ERR44; /* Group name must not start with digit */
goto FAILED;
}
while (ptr < ptrend && MAX_255(*ptr) && (cb->ctypes[*ptr] & ctype) != 0)
while (ptr < ptrend && MAX_255(*ptr) && (cb->ctypes[*ptr] & ctype_word) != 0)
{
ptr++;
namelen++;
@ -2192,9 +2235,9 @@ while (ptr < ptrend && MAX_255(*ptr) && (cb->ctypes[*ptr] & ctype) != 0)
}
/* Subpattern names must not be empty, and their terminator is checked here.
(What follows a verb name is checked separately.) */
(What follows a verb or alpha assertion name is checked separately.) */
if (!is_verb)
if (is_group)
{
if (namelen == 0)
{
@ -2652,24 +2695,31 @@ while (ptr < ptrend)
if (expect_cond_assert > 0)
{
BOOL ok = c == CHAR_LEFT_PARENTHESIS && ptrend - ptr >= 3 &&
ptr[0] == CHAR_QUESTION_MARK;
if (ok) switch(ptr[1])
(ptr[0] == CHAR_QUESTION_MARK || ptr[0] == CHAR_ASTERISK);
if (ok)
{
case CHAR_C:
ok = expect_cond_assert == 2;
break;
case CHAR_EQUALS_SIGN:
case CHAR_EXCLAMATION_MARK:
break;
case CHAR_LESS_THAN_SIGN:
ok = ptr[2] == CHAR_EQUALS_SIGN || ptr[2] == CHAR_EXCLAMATION_MARK;
break;
default:
ok = FALSE;
}
if (ptr[0] == CHAR_ASTERISK) /* New alpha assertion format, possibly */
{
ok = MAX_255(ptr[1]) && (cb->ctypes[ptr[1]] & ctype_lcletter) != 0;
}
else switch(ptr[1]) /* Traditional symbolic format */
{
case CHAR_C:
ok = expect_cond_assert == 2;
break;
case CHAR_EQUALS_SIGN:
case CHAR_EXCLAMATION_MARK:
break;
case CHAR_LESS_THAN_SIGN:
ok = ptr[2] == CHAR_EQUALS_SIGN || ptr[2] == CHAR_EXCLAMATION_MARK;
break;
default:
ok = FALSE;
}
}
if (!ok)
{
@ -3453,7 +3503,8 @@ while (ptr < ptrend)
case CHAR_LEFT_PARENTHESIS:
if (ptr >= ptrend) goto UNCLOSED_PARENTHESIS;
/* If ( is not followed by ? it is either a capture or a special verb. */
/* If ( is not followed by ? it is either a capture or a special verb or an
alpha assertion. */
if (*ptr != CHAR_QUESTION_MARK)
{
@ -3473,13 +3524,88 @@ while (ptr < ptrend)
else *parsed_pattern++ = META_NOCAPTURE;
}
/* Do nothing for (* followed by end of pattern or ) so it gives a "bad
quantifier" error rather than "(*MARK) must have an argument". */
else if (ptrend - ptr <= 1 || (c = ptr[1]) == CHAR_RIGHT_PARENTHESIS)
break;
/* Handle "alpha assertions" such as (*pla:...). Most of these are
synonyms for the historical symbolic assertions, but the script run ones
are new. They are distinguished by starting with a lower case letter.
Checking both ends of the alphabet makes this work in all character
codes. */
else if (CHMAX_255(c) && (cb->ctypes[c] & ctype_lcletter) != 0)
{
uint32_t meta;
vn = alasnames;
if (!read_name(&ptr, ptrend, 0, &offset, &name, &namelen, &errorcode,
cb)) goto FAILED;
if (ptr >= ptrend || *ptr != CHAR_COLON)
{
errorcode = ERR95; /* Malformed */
goto FAILED;
}
/* Scan the table of alpha assertion names */
for (i = 0; i < alascount; i++)
{
if (namelen == alasmeta[i].len &&
PRIV(strncmp_c8)(name, vn, namelen) == 0)
break;
vn += alasmeta[i].len + 1;
}
if (i >= alascount)
{
errorcode = ERR95; /* Alpha assertion not recognized */
goto FAILED;
}
/* Check for expecting an assertion condition. If so, only lookaround
assertions are valid. */
meta = alasmeta[i].meta;
if (prev_expect_cond_assert > 0 &&
(meta < META_LOOKAHEAD || meta > META_LOOKBEHINDNOT))
{
errorcode = ERR28; /* Assertion expected */
goto FAILED;
}
switch(meta)
{
case META_ATOMIC:
goto ATOMIC_GROUP;
case META_LOOKAHEAD:
goto POSITIVE_LOOK_AHEAD;
case META_LOOKAHEADNOT:
goto NEGATIVE_LOOK_AHEAD;
case META_LOOKBEHIND:
case META_LOOKBEHINDNOT:
*parsed_pattern++ = meta;
ptr--;
goto LOOKBEHIND;
/* FIXME: Script Run stuff ... */
}
}
/* ---- Handle (*VERB) and (*VERB:NAME) ---- */
/* Do nothing for (*) so it gives a "bad quantifier" error rather than
"(*MARK) must have an argument". */
else if (ptrend - ptr > 1 && ptr[1] != CHAR_RIGHT_PARENTHESIS)
else
{
vn = verbnames;
if (!read_name(&ptr, ptrend, 0, &offset, &name, &namelen, &errorcode,
@ -3946,14 +4072,15 @@ while (ptr < ptrend)
if (++ptr >= ptrend) goto UNCLOSED_PARENTHESIS;
nest_depth++;
/* If the next character is ? there must be an assertion next (optionally
preceded by a callout). We do not check this here, but instead we set
expect_cond_assert to 2. If this is still greater than zero (callouts
decrement it) when the next assertion is read, it will be marked as a
condition that must not be repeated. A value greater than zero also
causes checking that an assertion (possibly with callout) follows. */
/* If the next character is ? or * there must be an assertion next
(optionally preceded by a callout). We do not check this here, but
instead we set expect_cond_assert to 2. If this is still greater than
zero (callouts decrement it) when the next assertion is read, it will be
marked as a condition that must not be repeated. A value greater than
zero also causes checking that an assertion (possibly with callout)
follows. */
if (*ptr == CHAR_QUESTION_MARK)
if (*ptr == CHAR_QUESTION_MARK || *ptr == CHAR_ASTERISK)
{
*parsed_pattern++ = META_COND_ASSERT;
ptr--; /* Pull pointer back to the opening parenthesis. */
@ -4099,6 +4226,7 @@ while (ptr < ptrend)
/* ---- Atomic group ---- */
case CHAR_GREATER_THAN_SIGN:
ATOMIC_GROUP: /* Come from (*atomic: */
*parsed_pattern++ = META_ATOMIC;
nest_depth++;
ptr++;
@ -4108,11 +4236,13 @@ while (ptr < ptrend)
/* ---- Lookahead assertions ---- */
case CHAR_EQUALS_SIGN:
POSITIVE_LOOK_AHEAD: /* Come from (*pla: */
*parsed_pattern++ = META_LOOKAHEAD;
ptr++;
goto POST_ASSERTION;
case CHAR_EXCLAMATION_MARK:
NEGATIVE_LOOK_AHEAD: /* Come from (*nla: */
*parsed_pattern++ = META_LOOKAHEADNOT;
ptr++;
goto POST_ASSERTION;
@ -4132,6 +4262,8 @@ while (ptr < ptrend)
}
*parsed_pattern++ = (ptr[1] == CHAR_EQUALS_SIGN)?
META_LOOKBEHIND : META_LOOKBEHINDNOT;
LOOKBEHIND: /* Come from (*plb: and (*nlb: */
*has_lookbehind = TRUE;
offset = (PCRE2_SIZE)(ptr - cb->start_pattern - 2);
PUTOFFSET(offset, parsed_pattern);

View File

@ -181,6 +181,8 @@ static const unsigned char compile_error_texts[] =
"invalid option bits with PCRE2_LITERAL\0"
"\\N{U+dddd} is supported only in Unicode (UTF) mode\0"
"invalid hyphen in option setting\0"
/* 95 */
"(*alpha_assertion) not recognized\0"
;
/* Match-time and UTF error texts are in the same format. */

View File

@ -569,11 +569,11 @@ these tables. */
without checking pcre2_jit_compile.c, which has an assertion to ensure that
ctype_word has the value 16. */
#define ctype_space 0x01
#define ctype_letter 0x02
#define ctype_digit 0x04
#define ctype_xdigit 0x08 /* not actually used any more */
#define ctype_word 0x10 /* alphanumeric or '_' */
#define ctype_space 0x01
#define ctype_letter 0x02
#define ctype_lcletter 0x04
#define ctype_digit 0x08
#define ctype_word 0x10 /* alphanumeric or '_' */
/* Offsets of the various tables from the base tables pointer, and
total length of the tables. */
@ -874,34 +874,48 @@ a positive value. */
#define STR_RIGHT_CURLY_BRACKET "}"
#define STR_TILDE "~"
#define STRING_ACCEPT0 "ACCEPT\0"
#define STRING_COMMIT0 "COMMIT\0"
#define STRING_F0 "F\0"
#define STRING_FAIL0 "FAIL\0"
#define STRING_MARK0 "MARK\0"
#define STRING_PRUNE0 "PRUNE\0"
#define STRING_SKIP0 "SKIP\0"
#define STRING_THEN "THEN"
#define STRING_ACCEPT0 "ACCEPT\0"
#define STRING_COMMIT0 "COMMIT\0"
#define STRING_F0 "F\0"
#define STRING_FAIL0 "FAIL\0"
#define STRING_MARK0 "MARK\0"
#define STRING_PRUNE0 "PRUNE\0"
#define STRING_SKIP0 "SKIP\0"
#define STRING_THEN "THEN"
#define STRING_alpha0 "alpha\0"
#define STRING_lower0 "lower\0"
#define STRING_upper0 "upper\0"
#define STRING_alnum0 "alnum\0"
#define STRING_ascii0 "ascii\0"
#define STRING_blank0 "blank\0"
#define STRING_cntrl0 "cntrl\0"
#define STRING_digit0 "digit\0"
#define STRING_graph0 "graph\0"
#define STRING_print0 "print\0"
#define STRING_punct0 "punct\0"
#define STRING_space0 "space\0"
#define STRING_word0 "word\0"
#define STRING_xdigit "xdigit"
#define STRING_atomic0 "atomic\0"
#define STRING_pla0 "pla\0"
#define STRING_plb0 "plb\0"
#define STRING_nla0 "nla\0"
#define STRING_nlb0 "nlb\0"
#define STRING_sr0 "sr\0"
#define STRING_asr0 "asr\0"
#define STRING_positive_lookahead0 "positive_lookahead\0"
#define STRING_positive_lookbehind0 "positive_lookbehind\0"
#define STRING_negative_lookahead0 "negative_lookahead\0"
#define STRING_negative_lookbehind0 "negative_lookbehind\0"
#define STRING_script_run0 "script_run\0"
#define STRING_atomic_script_run "atomic_script_run"
#define STRING_DEFINE "DEFINE"
#define STRING_VERSION "VERSION"
#define STRING_WEIRD_STARTWORD "[:<:]]"
#define STRING_WEIRD_ENDWORD "[:>:]]"
#define STRING_alpha0 "alpha\0"
#define STRING_lower0 "lower\0"
#define STRING_upper0 "upper\0"
#define STRING_alnum0 "alnum\0"
#define STRING_ascii0 "ascii\0"
#define STRING_blank0 "blank\0"
#define STRING_cntrl0 "cntrl\0"
#define STRING_digit0 "digit\0"
#define STRING_graph0 "graph\0"
#define STRING_print0 "print\0"
#define STRING_punct0 "punct\0"
#define STRING_space0 "space\0"
#define STRING_word0 "word\0"
#define STRING_xdigit "xdigit"
#define STRING_DEFINE "DEFINE"
#define STRING_VERSION "VERSION"
#define STRING_WEIRD_STARTWORD "[:<:]]"
#define STRING_WEIRD_ENDWORD "[:>:]]"
#define STRING_CR_RIGHTPAR "CR)"
#define STRING_LF_RIGHTPAR "LF)"
@ -1150,34 +1164,48 @@ only. */
#define STR_RIGHT_CURLY_BRACKET "\175"
#define STR_TILDE "\176"
#define STRING_ACCEPT0 STR_A STR_C STR_C STR_E STR_P STR_T "\0"
#define STRING_COMMIT0 STR_C STR_O STR_M STR_M STR_I STR_T "\0"
#define STRING_F0 STR_F "\0"
#define STRING_FAIL0 STR_F STR_A STR_I STR_L "\0"
#define STRING_MARK0 STR_M STR_A STR_R STR_K "\0"
#define STRING_PRUNE0 STR_P STR_R STR_U STR_N STR_E "\0"
#define STRING_SKIP0 STR_S STR_K STR_I STR_P "\0"
#define STRING_THEN STR_T STR_H STR_E STR_N
#define STRING_ACCEPT0 STR_A STR_C STR_C STR_E STR_P STR_T "\0"
#define STRING_COMMIT0 STR_C STR_O STR_M STR_M STR_I STR_T "\0"
#define STRING_F0 STR_F "\0"
#define STRING_FAIL0 STR_F STR_A STR_I STR_L "\0"
#define STRING_MARK0 STR_M STR_A STR_R STR_K "\0"
#define STRING_PRUNE0 STR_P STR_R STR_U STR_N STR_E "\0"
#define STRING_SKIP0 STR_S STR_K STR_I STR_P "\0"
#define STRING_THEN STR_T STR_H STR_E STR_N
#define STRING_alpha0 STR_a STR_l STR_p STR_h STR_a "\0"
#define STRING_lower0 STR_l STR_o STR_w STR_e STR_r "\0"
#define STRING_upper0 STR_u STR_p STR_p STR_e STR_r "\0"
#define STRING_alnum0 STR_a STR_l STR_n STR_u STR_m "\0"
#define STRING_ascii0 STR_a STR_s STR_c STR_i STR_i "\0"
#define STRING_blank0 STR_b STR_l STR_a STR_n STR_k "\0"
#define STRING_cntrl0 STR_c STR_n STR_t STR_r STR_l "\0"
#define STRING_digit0 STR_d STR_i STR_g STR_i STR_t "\0"
#define STRING_graph0 STR_g STR_r STR_a STR_p STR_h "\0"
#define STRING_print0 STR_p STR_r STR_i STR_n STR_t "\0"
#define STRING_punct0 STR_p STR_u STR_n STR_c STR_t "\0"
#define STRING_space0 STR_s STR_p STR_a STR_c STR_e "\0"
#define STRING_word0 STR_w STR_o STR_r STR_d "\0"
#define STRING_xdigit STR_x STR_d STR_i STR_g STR_i STR_t
#define STRING_atomic0 STR_a STR_t STR_o STR_m STR_i STR_c "\0"
#define STRING_pla0 STR_p STR_l STR_a "\0"
#define STRING_plb0 STR_p STR_l STR_b "\0"
#define STRING_nla0 STR_n STR_l STR_a "\0"
#define STRING_nlb0 STR_n STR_l STR_b "\0"
#define STRING_sr0 STR_s STR_r "\0"
#define STRING_asr0 STR_a STR_s STR_r "\0"
#define STRING_positive_lookahead0 STR_p STR_o STR_s STR_i STR_t STR_i STR_v STR_e STR_UNDERSCORE STR_l STR_o STR_o STR_k STR_a STR_h STR_e STR_a STR_d "\0"
#define STRING_positive_lookbehind0 STR_p STR_o STR_s STR_i STR_t STR_i STR_v STR_e STR_UNDERSCORE STR_l STR_o STR_o STR_k STR_b STR_e STR_h STR_i STR_n STR_d "\0"
#define STRING_negative_lookahead0 STR_n STR_e STR_g STR_a STR_t STR_i STR_v STR_e STR_UNDERSCORE STR_l STR_o STR_o STR_k STR_a STR_h STR_e STR_a STR_d "\0"
#define STRING_negative_lookbehind0 STR_n STR_e STR_g STR_a STR_t STR_i STR_v STR_e STR_UNDERSCORE STR_l STR_o STR_o STR_k STR_b STR_e STR_h STR_i STR_n STR_d "\0"
#define STRING_script_run0 STR_s STR_c STR_r STR_i STR_p STR_t STR_UNDERSCORE STR_r STR_u STR_n "\0"
#define STRING_atomic_script_run STR_a STR_t STR_o STR_m STR_i STR_c STR_UNDERSCORE STR_s STR_c STR_r STR_i STR_p STR_t STR_UNDERSCORE STR_r STR_u STR_n
#define STRING_DEFINE STR_D STR_E STR_F STR_I STR_N STR_E
#define STRING_VERSION STR_V STR_E STR_R STR_S STR_I STR_O STR_N
#define STRING_WEIRD_STARTWORD STR_LEFT_SQUARE_BRACKET STR_COLON STR_LESS_THAN_SIGN STR_COLON STR_RIGHT_SQUARE_BRACKET STR_RIGHT_SQUARE_BRACKET
#define STRING_WEIRD_ENDWORD STR_LEFT_SQUARE_BRACKET STR_COLON STR_GREATER_THAN_SIGN STR_COLON STR_RIGHT_SQUARE_BRACKET STR_RIGHT_SQUARE_BRACKET
#define STRING_alpha0 STR_a STR_l STR_p STR_h STR_a "\0"
#define STRING_lower0 STR_l STR_o STR_w STR_e STR_r "\0"
#define STRING_upper0 STR_u STR_p STR_p STR_e STR_r "\0"
#define STRING_alnum0 STR_a STR_l STR_n STR_u STR_m "\0"
#define STRING_ascii0 STR_a STR_s STR_c STR_i STR_i "\0"
#define STRING_blank0 STR_b STR_l STR_a STR_n STR_k "\0"
#define STRING_cntrl0 STR_c STR_n STR_t STR_r STR_l "\0"
#define STRING_digit0 STR_d STR_i STR_g STR_i STR_t "\0"
#define STRING_graph0 STR_g STR_r STR_a STR_p STR_h "\0"
#define STRING_print0 STR_p STR_r STR_i STR_n STR_t "\0"
#define STRING_punct0 STR_p STR_u STR_n STR_c STR_t "\0"
#define STRING_space0 STR_s STR_p STR_a STR_c STR_e "\0"
#define STRING_word0 STR_w STR_o STR_r STR_d "\0"
#define STRING_xdigit STR_x STR_d STR_i STR_g STR_i STR_t
#define STRING_DEFINE STR_D STR_E STR_F STR_I STR_N STR_E
#define STRING_VERSION STR_V STR_E STR_R STR_S STR_I STR_O STR_N
#define STRING_WEIRD_STARTWORD STR_LEFT_SQUARE_BRACKET STR_COLON STR_LESS_THAN_SIGN STR_COLON STR_RIGHT_SQUARE_BRACKET STR_RIGHT_SQUARE_BRACKET
#define STRING_WEIRD_ENDWORD STR_LEFT_SQUARE_BRACKET STR_COLON STR_GREATER_THAN_SIGN STR_COLON STR_RIGHT_SQUARE_BRACKET STR_RIGHT_SQUARE_BRACKET
#define STRING_CR_RIGHTPAR STR_C STR_R STR_RIGHT_PARENTHESIS
#define STRING_LF_RIGHTPAR STR_L STR_F STR_RIGHT_PARENTHESIS

View File

@ -138,8 +138,8 @@ for (i = 0; i < 256; i++)
int x = 0;
if (isspace(i)) x += ctype_space;
if (isalpha(i)) x += ctype_letter;
if (islower(i)) x += ctype_lcletter;
if (isdigit(i)) x += ctype_digit;
if (isxdigit(i)) x += ctype_xdigit;
if (isalnum(i) || i == '_') x += ctype_word;
*p++ = x;
}

65
testdata/testinput1 vendored
View File

@ -6263,4 +6263,69 @@ ef) x/x,mark
aBCDEF
AbCDe f
/(*pla:foo).{6}/
abcfoobarxyz
\= Expect no match
abcfooba
/(*positive_lookahead:foo).{6}/
abcfoobarxyz
/(?(*pla:foo).{6}|a..)/
foobarbaz
abcfoobar
/(?(*positive_lookahead:foo).{6}|a..)/
foobarbaz
abcfoobar
/(*plb:foo)bar/
abcfoobar
\= Expect no match
abcbarfoo
/(*positive_lookbehind:foo)bar/
abcfoobar
\= Expect no match
abcbarfoo
/(?(*plb:foo)bar|baz)/
abcfoobar
bazfoobar
abcbazfoobar
foobazfoobar
/(?(*positive_lookbehind:foo)bar|baz)/
abcfoobar
bazfoobar
abcbazfoobar
foobazfoobar
/(*nlb:foo)bar/
abcbarfoo
\= Expect no match
abcfoobar
/(*negative_lookbehind:foo)bar/
abcbarfoo
\= Expect no match
abcfoobar
/(?(*nlb:foo)bar|baz)/
abcfoobaz
abcbarbaz
\= Expect no match
abcfoobar
/(?(*negative_lookbehind:foo)bar|baz)/
abcfoobaz
abcbarbaz
\= Expect no match
abcfoobar
/(*atomic:a+)\w/
aaab
\= Expect no match
aaaa
# End of testinput1

6
testdata/testinput2 vendored
View File

@ -5525,4 +5525,10 @@ a)"xI
\= Expect no match
abc\ndef\nxyz
/(?(*ACCEPT)xxx)/
/(?(*atomic:xx)xxx)/
/(?(*script_run:xxx)zzz)/
# End of testinput2

96
testdata/testoutput1 vendored
View File

@ -9929,4 +9929,100 @@ No match
AbCDe f
No match
/(*pla:foo).{6}/
abcfoobarxyz
0: foobar
\= Expect no match
abcfooba
No match
/(*positive_lookahead:foo).{6}/
abcfoobarxyz
0: foobar
/(?(*pla:foo).{6}|a..)/
foobarbaz
0: foobar
abcfoobar
0: abc
/(?(*positive_lookahead:foo).{6}|a..)/
foobarbaz
0: foobar
abcfoobar
0: abc
/(*plb:foo)bar/
abcfoobar
0: bar
\= Expect no match
abcbarfoo
No match
/(*positive_lookbehind:foo)bar/
abcfoobar
0: bar
\= Expect no match
abcbarfoo
No match
/(?(*plb:foo)bar|baz)/
abcfoobar
0: bar
bazfoobar
0: baz
abcbazfoobar
0: baz
foobazfoobar
0: bar
/(?(*positive_lookbehind:foo)bar|baz)/
abcfoobar
0: bar
bazfoobar
0: baz
abcbazfoobar
0: baz
foobazfoobar
0: bar
/(*nlb:foo)bar/
abcbarfoo
0: bar
\= Expect no match
abcfoobar
No match
/(*negative_lookbehind:foo)bar/
abcbarfoo
0: bar
\= Expect no match
abcfoobar
No match
/(?(*nlb:foo)bar|baz)/
abcfoobaz
0: baz
abcbarbaz
0: bar
\= Expect no match
abcfoobar
No match
/(?(*negative_lookbehind:foo)bar|baz)/
abcfoobaz
0: baz
abcbarbaz
0: bar
\= Expect no match
abcfoobar
No match
/(*atomic:a+)\w/
aaab
0: aaab
\= Expect no match
aaaa
No match
# End of testinput1

View File

@ -575,7 +575,7 @@ Last code unit = 'b'
Subject length lower bound = 3
/(*CRLF)(*UTF32)(*BSR_UNICODE)a\Rb/I
Failed: error 160 at offset 12: (*VERB) not recognized or malformed
Failed: error 160 at offset 14: (*VERB) not recognized or malformed
/\h/I,utf
Capturing subpattern count = 0

View File

@ -538,7 +538,7 @@ No match
Failed: error -27: UTF-32 error: code points 0xd800-0xdfff are not defined at offset 2
/(*UTF16)\x{11234}/
Failed: error 160 at offset 5: (*VERB) not recognized or malformed
Failed: error 160 at offset 7: (*VERB) not recognized or malformed
abcd\x{11234}pqr
/(*UTF)\x{11234}/I
@ -559,7 +559,7 @@ Failed: error 160 at offset 5: (*VERB) not recognized or malformed
abcd\x{11234}pqr
/(*CRLF)(*UTF16)(*BSR_UNICODE)a\Rb/I
Failed: error 160 at offset 12: (*VERB) not recognized or malformed
Failed: error 160 at offset 14: (*VERB) not recognized or malformed
/(*CRLF)(*UTF32)(*BSR_UNICODE)a\Rb/I
Capturing subpattern count = 0

View File

@ -16812,6 +16812,15 @@ No match
abc\ndef\nxyz
No match
/(?(*ACCEPT)xxx)/
Failed: error 128 at offset 2: assertion expected after (?( or (?(?C)
/(?(*atomic:xx)xxx)/
Failed: error 128 at offset 10: assertion expected after (?( or (?(?C)
/(?(*script_run:xxx)zzz)/
Failed: error 128 at offset 14: assertion expected after (?( or (?(?C)
# End of testinput2
Error -70: PCRE2_ERROR_BADDATA (unknown error number)
Error -62: bad serialized data