Implement Perl 5.28's alphabetic lookaround syntax, e.g. (*pla:...) and also

(*atomic:...).
This commit is contained in:
Philip.Hazel 2018-09-24 16:23:53 +00:00
parent 69254c77f1
commit f26b0b0bae
21 changed files with 1218 additions and 734 deletions

View File

@ -22,6 +22,14 @@ wrong library in some environments.
6. Implement PCRE2_EXTRA_ESCAPED_CR_IS_LF (see Bugzilla 2315).
7. Implement the Perl 5.28 experimental alphabetic names for atomic groups and
lookaround assertions, for example, (*pla:...) and (*atomic:...). These are
characterized by a lower case letter following (* and to simplify coding for
this, the character tables created by pcre2_maketables() were updated to add a
new "is lower case letter" bit. At the same time, the now unused "is
hexadecimal digit" bit was removed. The default tables in
src/pcre2_chartables.c.dist are updated.
Version 10.32 10-September-2018
-------------------------------

View File

@ -2120,6 +2120,11 @@ special parenthesis, starting with (?> as in this example:
<pre>
(?&#62;\d+)foo
</pre>
Perl 5.28 introduced an experimental alphabetic form starting with (* which may
be easier to remember:
<pre>
(*atomic:\d+)foo
</pre>
This kind of parenthesis "locks up" the part of the pattern it contains once
it has matched, and a failure further into the pattern is prevented from
backtracking into it. Backtracking past it to previous items, however, works as
@ -2342,11 +2347,17 @@ coded as \b, \B, \A, \G, \Z, \z, ^ and $ are described
<P>
More complicated assertions are coded as subpatterns. There are two kinds:
those that look ahead of the current position in the subject string, and those
that look behind it, and in each case an assertion may be positive (must
succeed for matching to continue) or negative (must not succeed for matching to
continue). An assertion subpattern is matched in the normal way, except that,
when matching continues after a successful assertion, the matching position in
the subject string is as it was before the assertion was processed.
that look behind it, and in each case an assertion may be positive (must match
for the assertion to be true) or negative (must not match for the assertion to
be true). An assertion subpattern is matched in the normal way, and if it is
true, matching continues after it, but with the matching position in the
subject string is was it was before the assertion was processed.
</P>
<P>
A lookaround assertion may also appear as the condition in a
<a href="#conditions">conditional subpattern</a>
(see below). In this case, the result of matching the assertion determines
which branch of the condition is followed.
</P>
<P>
Assertion subpatterns are not capturing subpatterns. If an assertion contains
@ -2359,7 +2370,7 @@ adjacent characters are the same.
<P>
When a branch within an assertion fails to match, any substrings that were
captured are discarded (as happens with any pattern branch that fails to
match). A negative assertion succeeds only when all its branches fail to match;
match). A negative assertion is true only when all its branches fail to match;
this means that no captured substrings are ever retained after a successful
negative assertion. When an assertion contains a matching branch, what happens
depends on the type of assertion.
@ -2368,7 +2379,7 @@ depends on the type of assertion.
For a positive assertion, internally captured substrings in the successful
branch are retained, and matching continues with the next pattern item after
the assertion. For a negative assertion, a matching branch means that the
assertion has failed. If the assertion is being used as a condition in a
assertion is not true. If such an assertion is being used as a condition in a
<a href="#conditions">conditional subpattern</a>
(see below), captured substrings are retained, because matching continues with
the "no" branch of the condition. For other failing negative assertions,
@ -2398,6 +2409,25 @@ without the assertion, the order depending on the greediness of the quantifier.
The assertion is obeyed just once when encountered during matching.
</P>
<br><b>
Alphabetic assertion names
</b><br>
<P>
Traditionally, symbolic sequences such as (?= and (?&#60;= have been used to specify
lookaround assertions. Perl 5.28 introduced some experimental alphabetic
alternatives which might be easier to remember. They all start with (* instead
of (? and must be written using lower case letters. PCRE2 supports the
following synonyms:
<pre>
(*positive_lookahead: or (*pla: is the same as (?=
(*negative_lookahead: or (*nla: is the same as (?!
(*positive_lookbehind: or (*plb: is the same as (?&#60;=
(*negative_lookbehind: or (*nlb: is the same as (?&#60;!
</pre>
For example, (*pla:foo) is the same assertion as (?=foo). However, in the
following sections, the various assertions are described using the original
symbolic forms.
</P>
<br><b>
Lookahead assertions
</b><br>
<P>
@ -3630,7 +3660,7 @@ Cambridge, England.
</P>
<br><a name="SEC30" href="#TOC1">REVISION</a><br>
<P>
Last updated: 21 September 2018
Last updated: 24 September 2018
<br>
Copyright &copy; 1997-2018 University of Cambridge.
<br>

View File

@ -436,6 +436,7 @@ but some of them use Unicode properties if PCRE2_UCP is set. You can use
<P>
<pre>
(?&#62;...) atomic, non-capturing group
(*atomic:...) atomic, non-capturing group
</PRE>
</P>
<br><a name="SEC15" href="#TOC1">COMMENT</a><br>
@ -514,10 +515,21 @@ setting with a similar syntax.
<br><a name="SEC19" href="#TOC1">LOOKAHEAD AND LOOKBEHIND ASSERTIONS</a><br>
<P>
<pre>
(?=...) positive look ahead
(?!...) negative look ahead
(?&#60;=...) positive look behind
(?&#60;!...) negative look behind
(?=...) )
(*pla:...) ) positive lookahead
(*positive_lookahead:...) )
(?!...) )
(*nla:...) ) negative lookahead
(*negative_lookahead:...) )
(?&#60;=...) )
(*plb:...) ) positive lookbehind
(*positive_lookbehind:...) )
(?&#60;!...) )
(*nlb:...) ) negative lookbehind
(*negative_lookbehind:...) )
</pre>
Each top-level branch of a lookbehind must be of a fixed length.
</P>
@ -634,7 +646,7 @@ Cambridge, England.
</P>
<br><a name="SEC27" href="#TOC1">REVISION</a><br>
<P>
Last updated: 02 September 2018
Last updated: 24 September 2018
<br>
Copyright &copy; 1997-2018 University of Cambridge.
<br>

View File

@ -7760,6 +7760,11 @@ ATOMIC GROUPING AND POSSESSIVE QUANTIFIERS
(?>\d+)foo
Perl 5.28 introduced an experimental alphabetic form starting with (*
which may be easier to remember:
(*atomic:\d+)foo
This kind of parenthesis "locks up" the part of the pattern it con-
tains once it has matched, and a failure further into the pattern is
prevented from backtracking into it. Backtracking past it to previous
@ -7970,12 +7975,16 @@ ASSERTIONS
More complicated assertions are coded as subpatterns. There are two
kinds: those that look ahead of the current position in the subject
string, and those that look behind it, and in each case an assertion
may be positive (must succeed for matching to continue) or negative
(must not succeed for matching to continue). An assertion subpattern is
matched in the normal way, except that, when matching continues after a
successful assertion, the matching position in the subject string is as
may be positive (must match for the assertion to be true) or negative
(must not match for the assertion to be true). An assertion subpattern
is matched in the normal way, and if it is true, matching continues
after it, but with the matching position in the subject string is was
it was before the assertion was processed.
A lookaround assertion may also appear as the condition in a condi-
tional subpattern (see below). In this case, the result of matching the
assertion determines which branch of the condition is followed.
Assertion subpatterns are not capturing subpatterns. If an assertion
contains capturing subpatterns within it, these are counted for the
purposes of numbering the capturing subpatterns in the whole pattern.
@ -7985,7 +7994,7 @@ ASSERTIONS
When a branch within an assertion fails to match, any substrings that
were captured are discarded (as happens with any pattern branch that
fails to match). A negative assertion succeeds only when all its
fails to match). A negative assertion is true only when all its
branches fail to match; this means that no captured substrings are ever
retained after a successful negative assertion. When an assertion con-
tains a matching branch, what happens depends on the type of assertion.
@ -7993,9 +8002,9 @@ ASSERTIONS
For a positive assertion, internally captured substrings in the suc-
cessful branch are retained, and matching continues with the next pat-
tern item after the assertion. For a negative assertion, a matching
branch means that the assertion has failed. If the assertion is being
used as a condition in a conditional subpattern (see below), captured
substrings are retained, because matching continues with the "no"
branch means that the assertion is not true. If such an assertion is
being used as a condition in a conditional subpattern (see below), cap-
tured substrings are retained, because matching continues with the "no"
branch of the condition. For other failing negative assertions, control
passes to the previous backtracking point, thus discarding any captured
strings within the assertion.
@ -8020,6 +8029,23 @@ ASSERTIONS
ignored. The assertion is obeyed just once when encountered during
matching.
Alphabetic assertion names
Traditionally, symbolic sequences such as (?= and (?<= have been used
to specify lookaround assertions. Perl 5.28 introduced some experimen-
tal alphabetic alternatives which might be easier to remember. They all
start with (* instead of (? and must be written using lower case let-
ters. PCRE2 supports the following synonyms:
(*positive_lookahead: or (*pla: is the same as (?=
(*negative_lookahead: or (*nla: is the same as (?!
(*positive_lookbehind: or (*plb: is the same as (?<=
(*negative_lookbehind: or (*nlb: is the same as (?<!
For example, (*pla:foo) is the same assertion as (?=foo). However, in
the following sections, the various assertions are described using the
original symbolic forms.
Lookahead assertions
Lookahead assertions start with (?= for positive assertions and (?! for
@ -9179,7 +9205,7 @@ AUTHOR
REVISION
Last updated: 21 September 2018
Last updated: 24 September 2018
Copyright (c) 1997-2018 University of Cambridge.
------------------------------------------------------------------------------
@ -10291,6 +10317,7 @@ CAPTURING
ATOMIC GROUPS
(?>...) atomic, non-capturing group
(*atomic:...) atomic, non-capturing group
COMMENT
@ -10367,10 +10394,21 @@ WHAT \R MATCHES
LOOKAHEAD AND LOOKBEHIND ASSERTIONS
(?=...) positive look ahead
(?!...) negative look ahead
(?<=...) positive look behind
(?<!...) negative look behind
(?=...) )
(*pla:...) ) positive lookahead
(*positive_lookahead:...) )
(?!...) )
(*nla:...) ) negative lookahead
(*negative_lookahead:...) )
(?<=...) )
(*plb:...) ) positive lookbehind
(*positive_lookbehind:...) )
(?<!...) )
(*nlb:...) ) negative lookbehind
(*negative_lookbehind:...) )
Each top-level branch of a lookbehind must be of a fixed length.
@ -10487,7 +10525,7 @@ AUTHOR
REVISION
Last updated: 02 September 2018
Last updated: 24 September 2018
Copyright (c) 1997-2018 University of Cambridge.
------------------------------------------------------------------------------

View File

@ -1,4 +1,4 @@
.TH PCRE2PATTERN 3 "21 September 2018" "PCRE2 10.33"
.TH PCRE2PATTERN 3 "24 September 2018" "PCRE2 10.33"
.SH NAME
PCRE2 - Perl-compatible regular expressions (revised API)
.SH "PCRE2 REGULAR EXPRESSION DETAILS"
@ -2124,6 +2124,11 @@ special parenthesis, starting with (?> as in this example:
.sp
(?>\ed+)foo
.sp
Perl 5.28 introduced an experimental alphabetic form starting with (* which may
be easier to remember:
.sp
(*atomic:\ed+)foo
.sp
This kind of parenthesis "locks up" the part of the pattern it contains once
it has matched, and a failure further into the pattern is prevented from
backtracking into it. Backtracking past it to previous items, however, works as
@ -2351,11 +2356,19 @@ above.
.P
More complicated assertions are coded as subpatterns. There are two kinds:
those that look ahead of the current position in the subject string, and those
that look behind it, and in each case an assertion may be positive (must
succeed for matching to continue) or negative (must not succeed for matching to
continue). An assertion subpattern is matched in the normal way, except that,
when matching continues after a successful assertion, the matching position in
the subject string is as it was before the assertion was processed.
that look behind it, and in each case an assertion may be positive (must match
for the assertion to be true) or negative (must not match for the assertion to
be true). An assertion subpattern is matched in the normal way, and if it is
true, matching continues after it, but with the matching position in the
subject string is was it was before the assertion was processed.
.P
A lookaround assertion may also appear as the condition in a
.\" HTML <a href="#conditions">
.\" </a>
conditional subpattern
.\"
(see below). In this case, the result of matching the assertion determines
which branch of the condition is followed.
.P
Assertion subpatterns are not capturing subpatterns. If an assertion contains
capturing subpatterns within it, these are counted for the purposes of
@ -2366,7 +2379,7 @@ adjacent characters are the same.
.P
When a branch within an assertion fails to match, any substrings that were
captured are discarded (as happens with any pattern branch that fails to
match). A negative assertion succeeds only when all its branches fail to match;
match). A negative assertion is true only when all its branches fail to match;
this means that no captured substrings are ever retained after a successful
negative assertion. When an assertion contains a matching branch, what happens
depends on the type of assertion.
@ -2374,7 +2387,7 @@ depends on the type of assertion.
For a positive assertion, internally captured substrings in the successful
branch are retained, and matching continues with the next pattern item after
the assertion. For a negative assertion, a matching branch means that the
assertion has failed. If the assertion is being used as a condition in a
assertion is not true. If such an assertion is being used as a condition in a
.\" HTML <a href="#conditions">
.\" </a>
conditional subpattern
@ -2406,6 +2419,25 @@ without the assertion, the order depending on the greediness of the quantifier.
The assertion is obeyed just once when encountered during matching.
.
.
.SS "Alphabetic assertion names"
.rs
.sp
Traditionally, symbolic sequences such as (?= and (?<= have been used to specify
lookaround assertions. Perl 5.28 introduced some experimental alphabetic
alternatives which might be easier to remember. They all start with (* instead
of (? and must be written using lower case letters. PCRE2 supports the
following synonyms:
.sp
(*positive_lookahead: or (*pla: is the same as (?=
(*negative_lookahead: or (*nla: is the same as (?!
(*positive_lookbehind: or (*plb: is the same as (?<=
(*negative_lookbehind: or (*nlb: is the same as (?<!
.sp
For example, (*pla:foo) is the same assertion as (?=foo). However, in the
following sections, the various assertions are described using the original
symbolic forms.
.
.
.SS "Lookahead assertions"
.rs
.sp
@ -3660,6 +3692,6 @@ Cambridge, England.
.rs
.sp
.nf
Last updated: 21 September 2018
Last updated: 24 September 2018
Copyright (c) 1997-2018 University of Cambridge.
.fi

View File

@ -1,4 +1,4 @@
.TH PCRE2SYNTAX 3 "02 September 2018" "PCRE2 10.32"
.TH PCRE2SYNTAX 3 "24 September 2018" "PCRE2 10.33"
.SH NAME
PCRE2 - Perl-compatible regular expressions (revised API)
.SH "PCRE2 REGULAR EXPRESSION SYNTAX SUMMARY"
@ -411,6 +411,7 @@ but some of them use Unicode properties if PCRE2_UCP is set. You can use
.rs
.sp
(?>...) atomic, non-capturing group
(*atomic:...) atomic, non-capturing group
.
.
.SH "COMMENT"
@ -491,10 +492,21 @@ setting with a similar syntax.
.SH "LOOKAHEAD AND LOOKBEHIND ASSERTIONS"
.rs
.sp
(?=...) positive look ahead
(?!...) negative look ahead
(?<=...) positive look behind
(?<!...) negative look behind
(?=...) )
(*pla:...) ) positive lookahead
(*positive_lookahead:...) )
.sp
(?!...) )
(*nla:...) ) negative lookahead
(*negative_lookahead:...) )
.sp
(?<=...) )
(*plb:...) ) positive lookbehind
(*positive_lookbehind:...) )
.sp
(?<!...) )
(*nlb:...) ) negative lookbehind
(*negative_lookbehind:...) )
.sp
Each top-level branch of a lookbehind must be of a fixed length.
.
@ -621,6 +633,6 @@ Cambridge, England.
.rs
.sp
.nf
Last updated: 02 September 2018
Last updated: 24 September 2018
Copyright (c) 1997-2018 University of Cambridge.
.fi

View File

@ -103,19 +103,22 @@ const unsigned char _pcre_default_tables[] = {
0,0,0,0,0,0,0,128,
255,255,255,255,0,0,0,0,
0,0,0,0,0,0,0,0,
/* Fiddled by hand when the table bits changed. May be broken! */
128,0,0,0,0,0,0,0,
0,1,1,0,1,1,0,0,
0,1,1,1,1,1,0,0,
0,0,0,0,0,0,0,0,
0,0,0,0,0,0,0,0,
1,0,0,0,128,0,0,0,
128,128,128,128,0,0,128,0,
28,28,28,28,28,28,28,28,
28,28,0,0,0,0,0,128,
0,26,26,26,26,26,26,18,
24,24,24,24,24,24,24,24,
24,24,0,0,0,0,0,128,
0,18,18,18,18,18,18,18,
18,18,18,18,18,18,18,18,
18,18,18,18,18,18,18,18,
18,18,18,128,128,0,128,16,
0,26,26,26,26,26,26,18,
0,18,18,18,18,18,18,18,
18,18,18,18,18,18,18,18,
18,18,18,18,18,18,18,18,
18,18,18,128,128,0,0,0,
@ -125,8 +128,8 @@ const unsigned char _pcre_default_tables[] = {
0,0,0,0,0,0,0,0,
1,0,0,0,0,0,0,0,
0,0,18,0,0,0,0,0,
0,0,20,20,0,18,0,0,
0,20,18,0,0,0,0,0,
0,0,24,24,0,18,0,0,
0,24,18,0,0,0,0,0,
18,18,18,18,18,18,18,18,
18,18,18,18,18,18,18,18,
18,18,18,18,18,18,18,0,

View File

@ -75,6 +75,10 @@ fi
(echo "$prefix" ; cat <<'PERLEND'
# The alpha assertions currently give warnings even when -w is not specified.
no warnings "experimental::alpha_assertions";
# Function for turning a string into a string of printing chars.
sub pchars {
@ -129,6 +133,9 @@ else { $outfile = "STDOUT"; }
printf($outfile "Perl $] Regular Expressions\n\n");
$extra_modifiers = "";
$default_show_mark = 0;
# Main loop
NEXT_RE:
@ -370,7 +377,10 @@ for (;;)
}
}
# printf $outfile "\n";
# By closing OUTFILE explicitly, we avoid a Perl warning in -w mode
# "main::OUTFILE" used only once".
close(OUTFILE) if $outfile eq "OUTFILE";
PERLEND
) | $perl $perlarg - $@

View File

@ -183,10 +183,10 @@ fprintf(f,
"/* This table identifies various classes of character by individual bits:\n"
" 0x%02x white space character\n"
" 0x%02x letter\n"
" 0x%02x lower case letter\n"
" 0x%02x decimal digit\n"
" 0x%02x hexadecimal digit\n"
" 0x%02x alphanumeric or '_'\n*/\n\n",
ctype_space, ctype_letter, ctype_digit, ctype_xdigit, ctype_word);
ctype_space, ctype_letter, ctype_lcletter, ctype_digit, ctype_word);
fprintf(f, " ");
for (i = 0; i < 256; i++)

View File

@ -320,6 +320,7 @@ pcre2_pattern_convert(). */
#define PCRE2_ERROR_BAD_LITERAL_OPTIONS 192
#define PCRE2_ERROR_SUPPORTED_ONLY_IN_UNICODE 193
#define PCRE2_ERROR_INVALID_HYPHEN_IN_OPTIONS 194
#define PCRE2_ERROR_ALPHA_ASSERTION_UNKNOWN 195
/* "Expected" matching error codes: no match and partial match. */

View File

@ -157,8 +157,8 @@ graph print, punct, and cntrl. Other classes are built from combinations. */
/* This table identifies various classes of character by individual bits:
0x01 white space character
0x02 letter
0x04 decimal digit
0x08 hexadecimal digit
0x04 lower case letter
0x08 decimal digit
0x10 alphanumeric or '_'
*/
@ -168,16 +168,16 @@ graph print, punct, and cntrl. Other classes are built from combinations. */
0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00, /* 24- 31 */
0x01,0x00,0x00,0x00,0x00,0x00,0x00,0x00, /* - ' */
0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00, /* ( - / */
0x1c,0x1c,0x1c,0x1c,0x1c,0x1c,0x1c,0x1c, /* 0 - 7 */
0x1c,0x1c,0x00,0x00,0x00,0x00,0x00,0x00, /* 8 - ? */
0x00,0x1a,0x1a,0x1a,0x1a,0x1a,0x1a,0x12, /* @ - G */
0x18,0x18,0x18,0x18,0x18,0x18,0x18,0x18, /* 0 - 7 */
0x18,0x18,0x00,0x00,0x00,0x00,0x00,0x00, /* 8 - ? */
0x00,0x12,0x12,0x12,0x12,0x12,0x12,0x12, /* @ - G */
0x12,0x12,0x12,0x12,0x12,0x12,0x12,0x12, /* H - O */
0x12,0x12,0x12,0x12,0x12,0x12,0x12,0x12, /* P - W */
0x12,0x12,0x12,0x00,0x00,0x00,0x00,0x10, /* X - _ */
0x00,0x1a,0x1a,0x1a,0x1a,0x1a,0x1a,0x12, /* ` - g */
0x12,0x12,0x12,0x12,0x12,0x12,0x12,0x12, /* h - o */
0x12,0x12,0x12,0x12,0x12,0x12,0x12,0x12, /* p - w */
0x12,0x12,0x12,0x00,0x00,0x00,0x00,0x00, /* x -127 */
0x00,0x16,0x16,0x16,0x16,0x16,0x16,0x16, /* ` - g */
0x16,0x16,0x16,0x16,0x16,0x16,0x16,0x16, /* h - o */
0x16,0x16,0x16,0x16,0x16,0x16,0x16,0x16, /* p - w */
0x16,0x16,0x16,0x00,0x00,0x00,0x00,0x00, /* x -127 */
0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00, /* 128-135 */
0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00, /* 136-143 */
0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00, /* 144-151 */

View File

@ -615,6 +615,46 @@ static const uint32_t verbops[] = {
OP_MARK, OP_ACCEPT, OP_FAIL, OP_COMMIT, OP_COMMIT_ARG, OP_PRUNE,
OP_PRUNE_ARG, OP_SKIP, OP_SKIP_ARG, OP_THEN, OP_THEN_ARG };
/* Table of "alpha assertions" like (*pla:...), similar to the (*VERB) table. */
typedef struct alasitem {
unsigned int len; /* Length of name */
uint32_t meta; /* Base META_ code */
} alasitem;
static const char alasnames[] =
STRING_pla0
STRING_plb0
STRING_nla0
STRING_nlb0
STRING_positive_lookahead0
STRING_positive_lookbehind0
STRING_negative_lookahead0
STRING_negative_lookbehind0
STRING_atomic0
STRING_sr0
STRING_asr0
STRING_script_run0
STRING_atomic_script_run;
static const alasitem alasmeta[] = {
{ 3, META_LOOKAHEAD },
{ 3, META_LOOKBEHIND },
{ 3, META_LOOKAHEADNOT },
{ 3, META_LOOKBEHINDNOT },
{ 18, META_LOOKAHEAD },
{ 19, META_LOOKBEHIND },
{ 18, META_LOOKAHEADNOT },
{ 19, META_LOOKBEHINDNOT },
{ 6, META_ATOMIC },
{ 2, 0 }, /* sr = script run */
{ 3, 0 }, /* asr = atomic script run */
{ 10, 0 }, /* script run */
{ 17, 0 } /* atomic script run */
};
static const int alascount = sizeof(alasmeta)/sizeof(alasitem);
/* Offsets from OP_STAR for case-independent and negative repeat opcodes. */
static uint32_t chartypeoffset[] = {
@ -732,7 +772,7 @@ enum { ERR0 = COMPILE_ERROR_BASE,
ERR61, ERR62, ERR63, ERR64, ERR65, ERR66, ERR67, ERR68, ERR69, ERR70,
ERR71, ERR72, ERR73, ERR74, ERR75, ERR76, ERR77, ERR78, ERR79, ERR80,
ERR81, ERR82, ERR83, ERR84, ERR85, ERR86, ERR87, ERR88, ERR89, ERR90,
ERR91, ERR92, ERR93, ERR94 };
ERR91, ERR92, ERR93, ERR94, ERR95 };
/* This is a table of start-of-pattern options such as (*UTF) and settings such
as (*LIMIT_MATCH=nnnn) and (*CRLF). For completeness and backward
@ -2133,9 +2173,10 @@ return -1;
*************************************************/
/* This function is called from parse_regex() below whenever it needs to read
the name of a subpattern or a (*VERB). The initial pointer must be to the
character before the name. If that character is '*' we are reading a verb name.
The pointer is updated to point after the name, for a VERB, or after tha name's
the name of a subpattern or a (*VERB) or an (*alpha_assertion). The initial
pointer must be to the character before the name. If that character is '*' we
are reading a verb or alpha assertion name. The pointer is updated to point
after the name, for a VERB or alpha assertion name, or after tha name's
terminator for a subpattern name. Returning both the offset and the name
pointer is redundant information, but some callers use one and some the other,
so it is simplest just to return both.
@ -2160,27 +2201,29 @@ read_name(PCRE2_SPTR *ptrptr, PCRE2_SPTR ptrend, uint32_t terminator,
int *errorcodeptr, compile_block *cb)
{
PCRE2_SPTR ptr = *ptrptr;
BOOL is_verb = (*ptr == CHAR_ASTERISK);
BOOL is_group = (*ptr != CHAR_ASTERISK);
uint32_t namelen = 0;
uint32_t ctype = is_verb? ctype_letter : ctype_word;
if (++ptr >= ptrend)
if (++ptr >= ptrend) /* No characters in name */
{
*errorcodeptr = is_verb? ERR60: /* Verb not recognized or malformed */
ERR62; /* Subpattern name expected */
*errorcodeptr = is_group? ERR62: /* Subpattern name expected */
ERR60; /* Verb not recognized or malformed */
goto FAILED;
}
/* A group name must not start with a digit. If either of the others start with
a digit it just won't be recognized. */
if (is_group && IS_DIGIT(*ptr))
{
*errorcodeptr = ERR44;
goto FAILED;
}
*nameptr = ptr;
*offsetptr = (PCRE2_SIZE)(ptr - cb->start_pattern);
if (IS_DIGIT(*ptr))
{
*errorcodeptr = ERR44; /* Group name must not start with digit */
goto FAILED;
}
while (ptr < ptrend && MAX_255(*ptr) && (cb->ctypes[*ptr] & ctype) != 0)
while (ptr < ptrend && MAX_255(*ptr) && (cb->ctypes[*ptr] & ctype_word) != 0)
{
ptr++;
namelen++;
@ -2192,9 +2235,9 @@ while (ptr < ptrend && MAX_255(*ptr) && (cb->ctypes[*ptr] & ctype) != 0)
}
/* Subpattern names must not be empty, and their terminator is checked here.
(What follows a verb name is checked separately.) */
(What follows a verb or alpha assertion name is checked separately.) */
if (!is_verb)
if (is_group)
{
if (namelen == 0)
{
@ -2652,8 +2695,14 @@ while (ptr < ptrend)
if (expect_cond_assert > 0)
{
BOOL ok = c == CHAR_LEFT_PARENTHESIS && ptrend - ptr >= 3 &&
ptr[0] == CHAR_QUESTION_MARK;
if (ok) switch(ptr[1])
(ptr[0] == CHAR_QUESTION_MARK || ptr[0] == CHAR_ASTERISK);
if (ok)
{
if (ptr[0] == CHAR_ASTERISK) /* New alpha assertion format, possibly */
{
ok = MAX_255(ptr[1]) && (cb->ctypes[ptr[1]] & ctype_lcletter) != 0;
}
else switch(ptr[1]) /* Traditional symbolic format */
{
case CHAR_C:
ok = expect_cond_assert == 2;
@ -2670,6 +2719,7 @@ while (ptr < ptrend)
default:
ok = FALSE;
}
}
if (!ok)
{
@ -3453,7 +3503,8 @@ while (ptr < ptrend)
case CHAR_LEFT_PARENTHESIS:
if (ptr >= ptrend) goto UNCLOSED_PARENTHESIS;
/* If ( is not followed by ? it is either a capture or a special verb. */
/* If ( is not followed by ? it is either a capture or a special verb or an
alpha assertion. */
if (*ptr != CHAR_QUESTION_MARK)
{
@ -3473,13 +3524,88 @@ while (ptr < ptrend)
else *parsed_pattern++ = META_NOCAPTURE;
}
/* Do nothing for (* followed by end of pattern or ) so it gives a "bad
quantifier" error rather than "(*MARK) must have an argument". */
else if (ptrend - ptr <= 1 || (c = ptr[1]) == CHAR_RIGHT_PARENTHESIS)
break;
/* Handle "alpha assertions" such as (*pla:...). Most of these are
synonyms for the historical symbolic assertions, but the script run ones
are new. They are distinguished by starting with a lower case letter.
Checking both ends of the alphabet makes this work in all character
codes. */
else if (CHMAX_255(c) && (cb->ctypes[c] & ctype_lcletter) != 0)
{
uint32_t meta;
vn = alasnames;
if (!read_name(&ptr, ptrend, 0, &offset, &name, &namelen, &errorcode,
cb)) goto FAILED;
if (ptr >= ptrend || *ptr != CHAR_COLON)
{
errorcode = ERR95; /* Malformed */
goto FAILED;
}
/* Scan the table of alpha assertion names */
for (i = 0; i < alascount; i++)
{
if (namelen == alasmeta[i].len &&
PRIV(strncmp_c8)(name, vn, namelen) == 0)
break;
vn += alasmeta[i].len + 1;
}
if (i >= alascount)
{
errorcode = ERR95; /* Alpha assertion not recognized */
goto FAILED;
}
/* Check for expecting an assertion condition. If so, only lookaround
assertions are valid. */
meta = alasmeta[i].meta;
if (prev_expect_cond_assert > 0 &&
(meta < META_LOOKAHEAD || meta > META_LOOKBEHINDNOT))
{
errorcode = ERR28; /* Assertion expected */
goto FAILED;
}
switch(meta)
{
case META_ATOMIC:
goto ATOMIC_GROUP;
case META_LOOKAHEAD:
goto POSITIVE_LOOK_AHEAD;
case META_LOOKAHEADNOT:
goto NEGATIVE_LOOK_AHEAD;
case META_LOOKBEHIND:
case META_LOOKBEHINDNOT:
*parsed_pattern++ = meta;
ptr--;
goto LOOKBEHIND;
/* FIXME: Script Run stuff ... */
}
}
/* ---- Handle (*VERB) and (*VERB:NAME) ---- */
/* Do nothing for (*) so it gives a "bad quantifier" error rather than
"(*MARK) must have an argument". */
else if (ptrend - ptr > 1 && ptr[1] != CHAR_RIGHT_PARENTHESIS)
else
{
vn = verbnames;
if (!read_name(&ptr, ptrend, 0, &offset, &name, &namelen, &errorcode,
@ -3946,14 +4072,15 @@ while (ptr < ptrend)
if (++ptr >= ptrend) goto UNCLOSED_PARENTHESIS;
nest_depth++;
/* If the next character is ? there must be an assertion next (optionally
preceded by a callout). We do not check this here, but instead we set
expect_cond_assert to 2. If this is still greater than zero (callouts
decrement it) when the next assertion is read, it will be marked as a
condition that must not be repeated. A value greater than zero also
causes checking that an assertion (possibly with callout) follows. */
/* If the next character is ? or * there must be an assertion next
(optionally preceded by a callout). We do not check this here, but
instead we set expect_cond_assert to 2. If this is still greater than
zero (callouts decrement it) when the next assertion is read, it will be
marked as a condition that must not be repeated. A value greater than
zero also causes checking that an assertion (possibly with callout)
follows. */
if (*ptr == CHAR_QUESTION_MARK)
if (*ptr == CHAR_QUESTION_MARK || *ptr == CHAR_ASTERISK)
{
*parsed_pattern++ = META_COND_ASSERT;
ptr--; /* Pull pointer back to the opening parenthesis. */
@ -4099,6 +4226,7 @@ while (ptr < ptrend)
/* ---- Atomic group ---- */
case CHAR_GREATER_THAN_SIGN:
ATOMIC_GROUP: /* Come from (*atomic: */
*parsed_pattern++ = META_ATOMIC;
nest_depth++;
ptr++;
@ -4108,11 +4236,13 @@ while (ptr < ptrend)
/* ---- Lookahead assertions ---- */
case CHAR_EQUALS_SIGN:
POSITIVE_LOOK_AHEAD: /* Come from (*pla: */
*parsed_pattern++ = META_LOOKAHEAD;
ptr++;
goto POST_ASSERTION;
case CHAR_EXCLAMATION_MARK:
NEGATIVE_LOOK_AHEAD: /* Come from (*nla: */
*parsed_pattern++ = META_LOOKAHEADNOT;
ptr++;
goto POST_ASSERTION;
@ -4132,6 +4262,8 @@ while (ptr < ptrend)
}
*parsed_pattern++ = (ptr[1] == CHAR_EQUALS_SIGN)?
META_LOOKBEHIND : META_LOOKBEHINDNOT;
LOOKBEHIND: /* Come from (*plb: and (*nlb: */
*has_lookbehind = TRUE;
offset = (PCRE2_SIZE)(ptr - cb->start_pattern - 2);
PUTOFFSET(offset, parsed_pattern);

View File

@ -181,6 +181,8 @@ static const unsigned char compile_error_texts[] =
"invalid option bits with PCRE2_LITERAL\0"
"\\N{U+dddd} is supported only in Unicode (UTF) mode\0"
"invalid hyphen in option setting\0"
/* 95 */
"(*alpha_assertion) not recognized\0"
;
/* Match-time and UTF error texts are in the same format. */

View File

@ -571,8 +571,8 @@ ctype_word has the value 16. */
#define ctype_space 0x01
#define ctype_letter 0x02
#define ctype_digit 0x04
#define ctype_xdigit 0x08 /* not actually used any more */
#define ctype_lcletter 0x04
#define ctype_digit 0x08
#define ctype_word 0x10 /* alphanumeric or '_' */
/* Offsets of the various tables from the base tables pointer, and
@ -883,6 +883,20 @@ a positive value. */
#define STRING_SKIP0 "SKIP\0"
#define STRING_THEN "THEN"
#define STRING_atomic0 "atomic\0"
#define STRING_pla0 "pla\0"
#define STRING_plb0 "plb\0"
#define STRING_nla0 "nla\0"
#define STRING_nlb0 "nlb\0"
#define STRING_sr0 "sr\0"
#define STRING_asr0 "asr\0"
#define STRING_positive_lookahead0 "positive_lookahead\0"
#define STRING_positive_lookbehind0 "positive_lookbehind\0"
#define STRING_negative_lookahead0 "negative_lookahead\0"
#define STRING_negative_lookbehind0 "negative_lookbehind\0"
#define STRING_script_run0 "script_run\0"
#define STRING_atomic_script_run "atomic_script_run"
#define STRING_alpha0 "alpha\0"
#define STRING_lower0 "lower\0"
#define STRING_upper0 "upper\0"
@ -1159,6 +1173,20 @@ only. */
#define STRING_SKIP0 STR_S STR_K STR_I STR_P "\0"
#define STRING_THEN STR_T STR_H STR_E STR_N
#define STRING_atomic0 STR_a STR_t STR_o STR_m STR_i STR_c "\0"
#define STRING_pla0 STR_p STR_l STR_a "\0"
#define STRING_plb0 STR_p STR_l STR_b "\0"
#define STRING_nla0 STR_n STR_l STR_a "\0"
#define STRING_nlb0 STR_n STR_l STR_b "\0"
#define STRING_sr0 STR_s STR_r "\0"
#define STRING_asr0 STR_a STR_s STR_r "\0"
#define STRING_positive_lookahead0 STR_p STR_o STR_s STR_i STR_t STR_i STR_v STR_e STR_UNDERSCORE STR_l STR_o STR_o STR_k STR_a STR_h STR_e STR_a STR_d "\0"
#define STRING_positive_lookbehind0 STR_p STR_o STR_s STR_i STR_t STR_i STR_v STR_e STR_UNDERSCORE STR_l STR_o STR_o STR_k STR_b STR_e STR_h STR_i STR_n STR_d "\0"
#define STRING_negative_lookahead0 STR_n STR_e STR_g STR_a STR_t STR_i STR_v STR_e STR_UNDERSCORE STR_l STR_o STR_o STR_k STR_a STR_h STR_e STR_a STR_d "\0"
#define STRING_negative_lookbehind0 STR_n STR_e STR_g STR_a STR_t STR_i STR_v STR_e STR_UNDERSCORE STR_l STR_o STR_o STR_k STR_b STR_e STR_h STR_i STR_n STR_d "\0"
#define STRING_script_run0 STR_s STR_c STR_r STR_i STR_p STR_t STR_UNDERSCORE STR_r STR_u STR_n "\0"
#define STRING_atomic_script_run STR_a STR_t STR_o STR_m STR_i STR_c STR_UNDERSCORE STR_s STR_c STR_r STR_i STR_p STR_t STR_UNDERSCORE STR_r STR_u STR_n
#define STRING_alpha0 STR_a STR_l STR_p STR_h STR_a "\0"
#define STRING_lower0 STR_l STR_o STR_w STR_e STR_r "\0"
#define STRING_upper0 STR_u STR_p STR_p STR_e STR_r "\0"

View File

@ -138,8 +138,8 @@ for (i = 0; i < 256; i++)
int x = 0;
if (isspace(i)) x += ctype_space;
if (isalpha(i)) x += ctype_letter;
if (islower(i)) x += ctype_lcletter;
if (isdigit(i)) x += ctype_digit;
if (isxdigit(i)) x += ctype_xdigit;
if (isalnum(i) || i == '_') x += ctype_word;
*p++ = x;
}

65
testdata/testinput1 vendored
View File

@ -6263,4 +6263,69 @@ ef) x/x,mark
aBCDEF
AbCDe f
/(*pla:foo).{6}/
abcfoobarxyz
\= Expect no match
abcfooba
/(*positive_lookahead:foo).{6}/
abcfoobarxyz
/(?(*pla:foo).{6}|a..)/
foobarbaz
abcfoobar
/(?(*positive_lookahead:foo).{6}|a..)/
foobarbaz
abcfoobar
/(*plb:foo)bar/
abcfoobar
\= Expect no match
abcbarfoo
/(*positive_lookbehind:foo)bar/
abcfoobar
\= Expect no match
abcbarfoo
/(?(*plb:foo)bar|baz)/
abcfoobar
bazfoobar
abcbazfoobar
foobazfoobar
/(?(*positive_lookbehind:foo)bar|baz)/
abcfoobar
bazfoobar
abcbazfoobar
foobazfoobar
/(*nlb:foo)bar/
abcbarfoo
\= Expect no match
abcfoobar
/(*negative_lookbehind:foo)bar/
abcbarfoo
\= Expect no match
abcfoobar
/(?(*nlb:foo)bar|baz)/
abcfoobaz
abcbarbaz
\= Expect no match
abcfoobar
/(?(*negative_lookbehind:foo)bar|baz)/
abcfoobaz
abcbarbaz
\= Expect no match
abcfoobar
/(*atomic:a+)\w/
aaab
\= Expect no match
aaaa
# End of testinput1

6
testdata/testinput2 vendored
View File

@ -5525,4 +5525,10 @@ a)"xI
\= Expect no match
abc\ndef\nxyz
/(?(*ACCEPT)xxx)/
/(?(*atomic:xx)xxx)/
/(?(*script_run:xxx)zzz)/
# End of testinput2

96
testdata/testoutput1 vendored
View File

@ -9929,4 +9929,100 @@ No match
AbCDe f
No match
/(*pla:foo).{6}/
abcfoobarxyz
0: foobar
\= Expect no match
abcfooba
No match
/(*positive_lookahead:foo).{6}/
abcfoobarxyz
0: foobar
/(?(*pla:foo).{6}|a..)/
foobarbaz
0: foobar
abcfoobar
0: abc
/(?(*positive_lookahead:foo).{6}|a..)/
foobarbaz
0: foobar
abcfoobar
0: abc
/(*plb:foo)bar/
abcfoobar
0: bar
\= Expect no match
abcbarfoo
No match
/(*positive_lookbehind:foo)bar/
abcfoobar
0: bar
\= Expect no match
abcbarfoo
No match
/(?(*plb:foo)bar|baz)/
abcfoobar
0: bar
bazfoobar
0: baz
abcbazfoobar
0: baz
foobazfoobar
0: bar
/(?(*positive_lookbehind:foo)bar|baz)/
abcfoobar
0: bar
bazfoobar
0: baz
abcbazfoobar
0: baz
foobazfoobar
0: bar
/(*nlb:foo)bar/
abcbarfoo
0: bar
\= Expect no match
abcfoobar
No match
/(*negative_lookbehind:foo)bar/
abcbarfoo
0: bar
\= Expect no match
abcfoobar
No match
/(?(*nlb:foo)bar|baz)/
abcfoobaz
0: baz
abcbarbaz
0: bar
\= Expect no match
abcfoobar
No match
/(?(*negative_lookbehind:foo)bar|baz)/
abcfoobaz
0: baz
abcbarbaz
0: bar
\= Expect no match
abcfoobar
No match
/(*atomic:a+)\w/
aaab
0: aaab
\= Expect no match
aaaa
No match
# End of testinput1

View File

@ -575,7 +575,7 @@ Last code unit = 'b'
Subject length lower bound = 3
/(*CRLF)(*UTF32)(*BSR_UNICODE)a\Rb/I
Failed: error 160 at offset 12: (*VERB) not recognized or malformed
Failed: error 160 at offset 14: (*VERB) not recognized or malformed
/\h/I,utf
Capturing subpattern count = 0

View File

@ -538,7 +538,7 @@ No match
Failed: error -27: UTF-32 error: code points 0xd800-0xdfff are not defined at offset 2
/(*UTF16)\x{11234}/
Failed: error 160 at offset 5: (*VERB) not recognized or malformed
Failed: error 160 at offset 7: (*VERB) not recognized or malformed
abcd\x{11234}pqr
/(*UTF)\x{11234}/I
@ -559,7 +559,7 @@ Failed: error 160 at offset 5: (*VERB) not recognized or malformed
abcd\x{11234}pqr
/(*CRLF)(*UTF16)(*BSR_UNICODE)a\Rb/I
Failed: error 160 at offset 12: (*VERB) not recognized or malformed
Failed: error 160 at offset 14: (*VERB) not recognized or malformed
/(*CRLF)(*UTF32)(*BSR_UNICODE)a\Rb/I
Capturing subpattern count = 0

View File

@ -16812,6 +16812,15 @@ No match
abc\ndef\nxyz
No match
/(?(*ACCEPT)xxx)/
Failed: error 128 at offset 2: assertion expected after (?( or (?(?C)
/(?(*atomic:xx)xxx)/
Failed: error 128 at offset 10: assertion expected after (?( or (?(?C)
/(?(*script_run:xxx)zzz)/
Failed: error 128 at offset 14: assertion expected after (?( or (?(?C)
# End of testinput2
Error -70: PCRE2_ERROR_BADDATA (unknown error number)
Error -62: bad serialized data