Implement Perl 5.28's alphabetic lookaround syntax, e.g. (*pla:...) and also

(*atomic:...).
This commit is contained in:
Philip.Hazel 2018-09-24 16:23:53 +00:00
parent 69254c77f1
commit f26b0b0bae
21 changed files with 1218 additions and 734 deletions

View File

@ -5,8 +5,8 @@ Change Log for PCRE2
Version 10.33-RC1 15-September-2018 Version 10.33-RC1 15-September-2018
----------------------------------- -----------------------------------
1. Added "allvector" to pcre2test to make it easy to check the part of the 1. Added "allvector" to pcre2test to make it easy to check the part of the
ovector that shouldn't be changed, in particular after substitute and failed or ovector that shouldn't be changed, in particular after substitute and failed or
partial matches. partial matches.
2. Fix subject buffer overread in JIT when UTF is disabled and \X or \R has 2. Fix subject buffer overread in JIT when UTF is disabled and \X or \R has
@ -15,13 +15,21 @@ a greater than 1 fixed quantifier. This issue was found by Yunho Kim.
3. Added support for callouts from pcre2_substitute(). 3. Added support for callouts from pcre2_substitute().
4. The POSIX functions are now all called pcre2_regcomp() etc., with wrappers 4. The POSIX functions are now all called pcre2_regcomp() etc., with wrappers
that use the standard POSIX names. This should help avoid linking with the that use the standard POSIX names. This should help avoid linking with the
wrong library in some environments. wrong library in some environments.
5. Fix an xclass matching issue in JIT. 5. Fix an xclass matching issue in JIT.
6. Implement PCRE2_EXTRA_ESCAPED_CR_IS_LF (see Bugzilla 2315). 6. Implement PCRE2_EXTRA_ESCAPED_CR_IS_LF (see Bugzilla 2315).
7. Implement the Perl 5.28 experimental alphabetic names for atomic groups and
lookaround assertions, for example, (*pla:...) and (*atomic:...). These are
characterized by a lower case letter following (* and to simplify coding for
this, the character tables created by pcre2_maketables() were updated to add a
new "is lower case letter" bit. At the same time, the now unused "is
hexadecimal digit" bit was removed. The default tables in
src/pcre2_chartables.c.dist are updated.
Version 10.32 10-September-2018 Version 10.32 10-September-2018
------------------------------- -------------------------------

View File

@ -2120,6 +2120,11 @@ special parenthesis, starting with (?> as in this example:
<pre> <pre>
(?&#62;\d+)foo (?&#62;\d+)foo
</pre> </pre>
Perl 5.28 introduced an experimental alphabetic form starting with (* which may
be easier to remember:
<pre>
(*atomic:\d+)foo
</pre>
This kind of parenthesis "locks up" the part of the pattern it contains once This kind of parenthesis "locks up" the part of the pattern it contains once
it has matched, and a failure further into the pattern is prevented from it has matched, and a failure further into the pattern is prevented from
backtracking into it. Backtracking past it to previous items, however, works as backtracking into it. Backtracking past it to previous items, however, works as
@ -2342,11 +2347,17 @@ coded as \b, \B, \A, \G, \Z, \z, ^ and $ are described
<P> <P>
More complicated assertions are coded as subpatterns. There are two kinds: More complicated assertions are coded as subpatterns. There are two kinds:
those that look ahead of the current position in the subject string, and those those that look ahead of the current position in the subject string, and those
that look behind it, and in each case an assertion may be positive (must that look behind it, and in each case an assertion may be positive (must match
succeed for matching to continue) or negative (must not succeed for matching to for the assertion to be true) or negative (must not match for the assertion to
continue). An assertion subpattern is matched in the normal way, except that, be true). An assertion subpattern is matched in the normal way, and if it is
when matching continues after a successful assertion, the matching position in true, matching continues after it, but with the matching position in the
the subject string is as it was before the assertion was processed. subject string is was it was before the assertion was processed.
</P>
<P>
A lookaround assertion may also appear as the condition in a
<a href="#conditions">conditional subpattern</a>
(see below). In this case, the result of matching the assertion determines
which branch of the condition is followed.
</P> </P>
<P> <P>
Assertion subpatterns are not capturing subpatterns. If an assertion contains Assertion subpatterns are not capturing subpatterns. If an assertion contains
@ -2359,7 +2370,7 @@ adjacent characters are the same.
<P> <P>
When a branch within an assertion fails to match, any substrings that were When a branch within an assertion fails to match, any substrings that were
captured are discarded (as happens with any pattern branch that fails to captured are discarded (as happens with any pattern branch that fails to
match). A negative assertion succeeds only when all its branches fail to match; match). A negative assertion is true only when all its branches fail to match;
this means that no captured substrings are ever retained after a successful this means that no captured substrings are ever retained after a successful
negative assertion. When an assertion contains a matching branch, what happens negative assertion. When an assertion contains a matching branch, what happens
depends on the type of assertion. depends on the type of assertion.
@ -2368,7 +2379,7 @@ depends on the type of assertion.
For a positive assertion, internally captured substrings in the successful For a positive assertion, internally captured substrings in the successful
branch are retained, and matching continues with the next pattern item after branch are retained, and matching continues with the next pattern item after
the assertion. For a negative assertion, a matching branch means that the the assertion. For a negative assertion, a matching branch means that the
assertion has failed. If the assertion is being used as a condition in a assertion is not true. If such an assertion is being used as a condition in a
<a href="#conditions">conditional subpattern</a> <a href="#conditions">conditional subpattern</a>
(see below), captured substrings are retained, because matching continues with (see below), captured substrings are retained, because matching continues with
the "no" branch of the condition. For other failing negative assertions, the "no" branch of the condition. For other failing negative assertions,
@ -2398,6 +2409,25 @@ without the assertion, the order depending on the greediness of the quantifier.
The assertion is obeyed just once when encountered during matching. The assertion is obeyed just once when encountered during matching.
</P> </P>
<br><b> <br><b>
Alphabetic assertion names
</b><br>
<P>
Traditionally, symbolic sequences such as (?= and (?&#60;= have been used to specify
lookaround assertions. Perl 5.28 introduced some experimental alphabetic
alternatives which might be easier to remember. They all start with (* instead
of (? and must be written using lower case letters. PCRE2 supports the
following synonyms:
<pre>
(*positive_lookahead: or (*pla: is the same as (?=
(*negative_lookahead: or (*nla: is the same as (?!
(*positive_lookbehind: or (*plb: is the same as (?&#60;=
(*negative_lookbehind: or (*nlb: is the same as (?&#60;!
</pre>
For example, (*pla:foo) is the same assertion as (?=foo). However, in the
following sections, the various assertions are described using the original
symbolic forms.
</P>
<br><b>
Lookahead assertions Lookahead assertions
</b><br> </b><br>
<P> <P>
@ -3630,7 +3660,7 @@ Cambridge, England.
</P> </P>
<br><a name="SEC30" href="#TOC1">REVISION</a><br> <br><a name="SEC30" href="#TOC1">REVISION</a><br>
<P> <P>
Last updated: 21 September 2018 Last updated: 24 September 2018
<br> <br>
Copyright &copy; 1997-2018 University of Cambridge. Copyright &copy; 1997-2018 University of Cambridge.
<br> <br>

View File

@ -436,6 +436,7 @@ but some of them use Unicode properties if PCRE2_UCP is set. You can use
<P> <P>
<pre> <pre>
(?&#62;...) atomic, non-capturing group (?&#62;...) atomic, non-capturing group
(*atomic:...) atomic, non-capturing group
</PRE> </PRE>
</P> </P>
<br><a name="SEC15" href="#TOC1">COMMENT</a><br> <br><a name="SEC15" href="#TOC1">COMMENT</a><br>
@ -514,12 +515,23 @@ setting with a similar syntax.
<br><a name="SEC19" href="#TOC1">LOOKAHEAD AND LOOKBEHIND ASSERTIONS</a><br> <br><a name="SEC19" href="#TOC1">LOOKAHEAD AND LOOKBEHIND ASSERTIONS</a><br>
<P> <P>
<pre> <pre>
(?=...) positive look ahead (?=...) )
(?!...) negative look ahead (*pla:...) ) positive lookahead
(?&#60;=...) positive look behind (*positive_lookahead:...) )
(?&#60;!...) negative look behind
(?!...) )
(*nla:...) ) negative lookahead
(*negative_lookahead:...) )
(?&#60;=...) )
(*plb:...) ) positive lookbehind
(*positive_lookbehind:...) )
(?&#60;!...) )
(*nlb:...) ) negative lookbehind
(*negative_lookbehind:...) )
</pre> </pre>
Each top-level branch of a look behind must be of a fixed length. Each top-level branch of a lookbehind must be of a fixed length.
</P> </P>
<br><a name="SEC20" href="#TOC1">BACKREFERENCES</a><br> <br><a name="SEC20" href="#TOC1">BACKREFERENCES</a><br>
<P> <P>
@ -634,7 +646,7 @@ Cambridge, England.
</P> </P>
<br><a name="SEC27" href="#TOC1">REVISION</a><br> <br><a name="SEC27" href="#TOC1">REVISION</a><br>
<P> <P>
Last updated: 02 September 2018 Last updated: 24 September 2018
<br> <br>
Copyright &copy; 1997-2018 University of Cambridge. Copyright &copy; 1997-2018 University of Cambridge.
<br> <br>

File diff suppressed because it is too large Load Diff

View File

@ -1,4 +1,4 @@
.TH PCRE2PATTERN 3 "21 September 2018" "PCRE2 10.33" .TH PCRE2PATTERN 3 "24 September 2018" "PCRE2 10.33"
.SH NAME .SH NAME
PCRE2 - Perl-compatible regular expressions (revised API) PCRE2 - Perl-compatible regular expressions (revised API)
.SH "PCRE2 REGULAR EXPRESSION DETAILS" .SH "PCRE2 REGULAR EXPRESSION DETAILS"
@ -2124,6 +2124,11 @@ special parenthesis, starting with (?> as in this example:
.sp .sp
(?>\ed+)foo (?>\ed+)foo
.sp .sp
Perl 5.28 introduced an experimental alphabetic form starting with (* which may
be easier to remember:
.sp
(*atomic:\ed+)foo
.sp
This kind of parenthesis "locks up" the part of the pattern it contains once This kind of parenthesis "locks up" the part of the pattern it contains once
it has matched, and a failure further into the pattern is prevented from it has matched, and a failure further into the pattern is prevented from
backtracking into it. Backtracking past it to previous items, however, works as backtracking into it. Backtracking past it to previous items, however, works as
@ -2351,11 +2356,19 @@ above.
.P .P
More complicated assertions are coded as subpatterns. There are two kinds: More complicated assertions are coded as subpatterns. There are two kinds:
those that look ahead of the current position in the subject string, and those those that look ahead of the current position in the subject string, and those
that look behind it, and in each case an assertion may be positive (must that look behind it, and in each case an assertion may be positive (must match
succeed for matching to continue) or negative (must not succeed for matching to for the assertion to be true) or negative (must not match for the assertion to
continue). An assertion subpattern is matched in the normal way, except that, be true). An assertion subpattern is matched in the normal way, and if it is
when matching continues after a successful assertion, the matching position in true, matching continues after it, but with the matching position in the
the subject string is as it was before the assertion was processed. subject string is was it was before the assertion was processed.
.P
A lookaround assertion may also appear as the condition in a
.\" HTML <a href="#conditions">
.\" </a>
conditional subpattern
.\"
(see below). In this case, the result of matching the assertion determines
which branch of the condition is followed.
.P .P
Assertion subpatterns are not capturing subpatterns. If an assertion contains Assertion subpatterns are not capturing subpatterns. If an assertion contains
capturing subpatterns within it, these are counted for the purposes of capturing subpatterns within it, these are counted for the purposes of
@ -2366,7 +2379,7 @@ adjacent characters are the same.
.P .P
When a branch within an assertion fails to match, any substrings that were When a branch within an assertion fails to match, any substrings that were
captured are discarded (as happens with any pattern branch that fails to captured are discarded (as happens with any pattern branch that fails to
match). A negative assertion succeeds only when all its branches fail to match; match). A negative assertion is true only when all its branches fail to match;
this means that no captured substrings are ever retained after a successful this means that no captured substrings are ever retained after a successful
negative assertion. When an assertion contains a matching branch, what happens negative assertion. When an assertion contains a matching branch, what happens
depends on the type of assertion. depends on the type of assertion.
@ -2374,7 +2387,7 @@ depends on the type of assertion.
For a positive assertion, internally captured substrings in the successful For a positive assertion, internally captured substrings in the successful
branch are retained, and matching continues with the next pattern item after branch are retained, and matching continues with the next pattern item after
the assertion. For a negative assertion, a matching branch means that the the assertion. For a negative assertion, a matching branch means that the
assertion has failed. If the assertion is being used as a condition in a assertion is not true. If such an assertion is being used as a condition in a
.\" HTML <a href="#conditions"> .\" HTML <a href="#conditions">
.\" </a> .\" </a>
conditional subpattern conditional subpattern
@ -2406,6 +2419,25 @@ without the assertion, the order depending on the greediness of the quantifier.
The assertion is obeyed just once when encountered during matching. The assertion is obeyed just once when encountered during matching.
. .
. .
.SS "Alphabetic assertion names"
.rs
.sp
Traditionally, symbolic sequences such as (?= and (?<= have been used to specify
lookaround assertions. Perl 5.28 introduced some experimental alphabetic
alternatives which might be easier to remember. They all start with (* instead
of (? and must be written using lower case letters. PCRE2 supports the
following synonyms:
.sp
(*positive_lookahead: or (*pla: is the same as (?=
(*negative_lookahead: or (*nla: is the same as (?!
(*positive_lookbehind: or (*plb: is the same as (?<=
(*negative_lookbehind: or (*nlb: is the same as (?<!
.sp
For example, (*pla:foo) is the same assertion as (?=foo). However, in the
following sections, the various assertions are described using the original
symbolic forms.
.
.
.SS "Lookahead assertions" .SS "Lookahead assertions"
.rs .rs
.sp .sp
@ -3660,6 +3692,6 @@ Cambridge, England.
.rs .rs
.sp .sp
.nf .nf
Last updated: 21 September 2018 Last updated: 24 September 2018
Copyright (c) 1997-2018 University of Cambridge. Copyright (c) 1997-2018 University of Cambridge.
.fi .fi

View File

@ -1,4 +1,4 @@
.TH PCRE2SYNTAX 3 "02 September 2018" "PCRE2 10.32" .TH PCRE2SYNTAX 3 "24 September 2018" "PCRE2 10.33"
.SH NAME .SH NAME
PCRE2 - Perl-compatible regular expressions (revised API) PCRE2 - Perl-compatible regular expressions (revised API)
.SH "PCRE2 REGULAR EXPRESSION SYNTAX SUMMARY" .SH "PCRE2 REGULAR EXPRESSION SYNTAX SUMMARY"
@ -411,6 +411,7 @@ but some of them use Unicode properties if PCRE2_UCP is set. You can use
.rs .rs
.sp .sp
(?>...) atomic, non-capturing group (?>...) atomic, non-capturing group
(*atomic:...) atomic, non-capturing group
. .
. .
.SH "COMMENT" .SH "COMMENT"
@ -491,12 +492,23 @@ setting with a similar syntax.
.SH "LOOKAHEAD AND LOOKBEHIND ASSERTIONS" .SH "LOOKAHEAD AND LOOKBEHIND ASSERTIONS"
.rs .rs
.sp .sp
(?=...) positive look ahead (?=...) )
(?!...) negative look ahead (*pla:...) ) positive lookahead
(?<=...) positive look behind (*positive_lookahead:...) )
(?<!...) negative look behind
.sp .sp
Each top-level branch of a look behind must be of a fixed length. (?!...) )
(*nla:...) ) negative lookahead
(*negative_lookahead:...) )
.sp
(?<=...) )
(*plb:...) ) positive lookbehind
(*positive_lookbehind:...) )
.sp
(?<!...) )
(*nlb:...) ) negative lookbehind
(*negative_lookbehind:...) )
.sp
Each top-level branch of a lookbehind must be of a fixed length.
. .
. .
.SH "BACKREFERENCES" .SH "BACKREFERENCES"
@ -621,6 +633,6 @@ Cambridge, England.
.rs .rs
.sp .sp
.nf .nf
Last updated: 02 September 2018 Last updated: 24 September 2018
Copyright (c) 1997-2018 University of Cambridge. Copyright (c) 1997-2018 University of Cambridge.
.fi .fi

View File

@ -103,19 +103,22 @@ const unsigned char _pcre_default_tables[] = {
0,0,0,0,0,0,0,128, 0,0,0,0,0,0,0,128,
255,255,255,255,0,0,0,0, 255,255,255,255,0,0,0,0,
0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,
/* Fiddled by hand when the table bits changed. May be broken! */
128,0,0,0,0,0,0,0, 128,0,0,0,0,0,0,0,
0,1,1,0,1,1,0,0, 0,1,1,1,1,1,0,0,
0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,
0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,
1,0,0,0,128,0,0,0, 1,0,0,0,128,0,0,0,
128,128,128,128,0,0,128,0, 128,128,128,128,0,0,128,0,
28,28,28,28,28,28,28,28, 24,24,24,24,24,24,24,24,
28,28,0,0,0,0,0,128, 24,24,0,0,0,0,0,128,
0,26,26,26,26,26,26,18, 0,18,18,18,18,18,18,18,
18,18,18,18,18,18,18,18, 18,18,18,18,18,18,18,18,
18,18,18,18,18,18,18,18, 18,18,18,18,18,18,18,18,
18,18,18,128,128,0,128,16, 18,18,18,128,128,0,128,16,
0,26,26,26,26,26,26,18, 0,18,18,18,18,18,18,18,
18,18,18,18,18,18,18,18, 18,18,18,18,18,18,18,18,
18,18,18,18,18,18,18,18, 18,18,18,18,18,18,18,18,
18,18,18,128,128,0,0,0, 18,18,18,128,128,0,0,0,
@ -125,8 +128,8 @@ const unsigned char _pcre_default_tables[] = {
0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,
1,0,0,0,0,0,0,0, 1,0,0,0,0,0,0,0,
0,0,18,0,0,0,0,0, 0,0,18,0,0,0,0,0,
0,0,20,20,0,18,0,0, 0,0,24,24,0,18,0,0,
0,20,18,0,0,0,0,0, 0,24,18,0,0,0,0,0,
18,18,18,18,18,18,18,18, 18,18,18,18,18,18,18,18,
18,18,18,18,18,18,18,18, 18,18,18,18,18,18,18,18,
18,18,18,18,18,18,18,0, 18,18,18,18,18,18,18,0,

View File

@ -75,6 +75,10 @@ fi
(echo "$prefix" ; cat <<'PERLEND' (echo "$prefix" ; cat <<'PERLEND'
# The alpha assertions currently give warnings even when -w is not specified.
no warnings "experimental::alpha_assertions";
# Function for turning a string into a string of printing chars. # Function for turning a string into a string of printing chars.
sub pchars { sub pchars {
@ -129,6 +133,9 @@ else { $outfile = "STDOUT"; }
printf($outfile "Perl $] Regular Expressions\n\n"); printf($outfile "Perl $] Regular Expressions\n\n");
$extra_modifiers = "";
$default_show_mark = 0;
# Main loop # Main loop
NEXT_RE: NEXT_RE:
@ -370,7 +377,10 @@ for (;;)
} }
} }
# printf $outfile "\n"; # By closing OUTFILE explicitly, we avoid a Perl warning in -w mode
# "main::OUTFILE" used only once".
close(OUTFILE) if $outfile eq "OUTFILE";
PERLEND PERLEND
) | $perl $perlarg - $@ ) | $perl $perlarg - $@

View File

@ -183,10 +183,10 @@ fprintf(f,
"/* This table identifies various classes of character by individual bits:\n" "/* This table identifies various classes of character by individual bits:\n"
" 0x%02x white space character\n" " 0x%02x white space character\n"
" 0x%02x letter\n" " 0x%02x letter\n"
" 0x%02x lower case letter\n"
" 0x%02x decimal digit\n" " 0x%02x decimal digit\n"
" 0x%02x hexadecimal digit\n"
" 0x%02x alphanumeric or '_'\n*/\n\n", " 0x%02x alphanumeric or '_'\n*/\n\n",
ctype_space, ctype_letter, ctype_digit, ctype_xdigit, ctype_word); ctype_space, ctype_letter, ctype_lcletter, ctype_digit, ctype_word);
fprintf(f, " "); fprintf(f, " ");
for (i = 0; i < 256; i++) for (i = 0; i < 256; i++)

View File

@ -320,6 +320,7 @@ pcre2_pattern_convert(). */
#define PCRE2_ERROR_BAD_LITERAL_OPTIONS 192 #define PCRE2_ERROR_BAD_LITERAL_OPTIONS 192
#define PCRE2_ERROR_SUPPORTED_ONLY_IN_UNICODE 193 #define PCRE2_ERROR_SUPPORTED_ONLY_IN_UNICODE 193
#define PCRE2_ERROR_INVALID_HYPHEN_IN_OPTIONS 194 #define PCRE2_ERROR_INVALID_HYPHEN_IN_OPTIONS 194
#define PCRE2_ERROR_ALPHA_ASSERTION_UNKNOWN 195
/* "Expected" matching error codes: no match and partial match. */ /* "Expected" matching error codes: no match and partial match. */

View File

@ -157,8 +157,8 @@ graph print, punct, and cntrl. Other classes are built from combinations. */
/* This table identifies various classes of character by individual bits: /* This table identifies various classes of character by individual bits:
0x01 white space character 0x01 white space character
0x02 letter 0x02 letter
0x04 decimal digit 0x04 lower case letter
0x08 hexadecimal digit 0x08 decimal digit
0x10 alphanumeric or '_' 0x10 alphanumeric or '_'
*/ */
@ -168,16 +168,16 @@ graph print, punct, and cntrl. Other classes are built from combinations. */
0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00, /* 24- 31 */ 0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00, /* 24- 31 */
0x01,0x00,0x00,0x00,0x00,0x00,0x00,0x00, /* - ' */ 0x01,0x00,0x00,0x00,0x00,0x00,0x00,0x00, /* - ' */
0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00, /* ( - / */ 0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00, /* ( - / */
0x1c,0x1c,0x1c,0x1c,0x1c,0x1c,0x1c,0x1c, /* 0 - 7 */ 0x18,0x18,0x18,0x18,0x18,0x18,0x18,0x18, /* 0 - 7 */
0x1c,0x1c,0x00,0x00,0x00,0x00,0x00,0x00, /* 8 - ? */ 0x18,0x18,0x00,0x00,0x00,0x00,0x00,0x00, /* 8 - ? */
0x00,0x1a,0x1a,0x1a,0x1a,0x1a,0x1a,0x12, /* @ - G */ 0x00,0x12,0x12,0x12,0x12,0x12,0x12,0x12, /* @ - G */
0x12,0x12,0x12,0x12,0x12,0x12,0x12,0x12, /* H - O */ 0x12,0x12,0x12,0x12,0x12,0x12,0x12,0x12, /* H - O */
0x12,0x12,0x12,0x12,0x12,0x12,0x12,0x12, /* P - W */ 0x12,0x12,0x12,0x12,0x12,0x12,0x12,0x12, /* P - W */
0x12,0x12,0x12,0x00,0x00,0x00,0x00,0x10, /* X - _ */ 0x12,0x12,0x12,0x00,0x00,0x00,0x00,0x10, /* X - _ */
0x00,0x1a,0x1a,0x1a,0x1a,0x1a,0x1a,0x12, /* ` - g */ 0x00,0x16,0x16,0x16,0x16,0x16,0x16,0x16, /* ` - g */
0x12,0x12,0x12,0x12,0x12,0x12,0x12,0x12, /* h - o */ 0x16,0x16,0x16,0x16,0x16,0x16,0x16,0x16, /* h - o */
0x12,0x12,0x12,0x12,0x12,0x12,0x12,0x12, /* p - w */ 0x16,0x16,0x16,0x16,0x16,0x16,0x16,0x16, /* p - w */
0x12,0x12,0x12,0x00,0x00,0x00,0x00,0x00, /* x -127 */ 0x16,0x16,0x16,0x00,0x00,0x00,0x00,0x00, /* x -127 */
0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00, /* 128-135 */ 0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00, /* 128-135 */
0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00, /* 136-143 */ 0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00, /* 136-143 */
0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00, /* 144-151 */ 0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00, /* 144-151 */

View File

@ -615,6 +615,46 @@ static const uint32_t verbops[] = {
OP_MARK, OP_ACCEPT, OP_FAIL, OP_COMMIT, OP_COMMIT_ARG, OP_PRUNE, OP_MARK, OP_ACCEPT, OP_FAIL, OP_COMMIT, OP_COMMIT_ARG, OP_PRUNE,
OP_PRUNE_ARG, OP_SKIP, OP_SKIP_ARG, OP_THEN, OP_THEN_ARG }; OP_PRUNE_ARG, OP_SKIP, OP_SKIP_ARG, OP_THEN, OP_THEN_ARG };
/* Table of "alpha assertions" like (*pla:...), similar to the (*VERB) table. */
typedef struct alasitem {
unsigned int len; /* Length of name */
uint32_t meta; /* Base META_ code */
} alasitem;
static const char alasnames[] =
STRING_pla0
STRING_plb0
STRING_nla0
STRING_nlb0
STRING_positive_lookahead0
STRING_positive_lookbehind0
STRING_negative_lookahead0
STRING_negative_lookbehind0
STRING_atomic0
STRING_sr0
STRING_asr0
STRING_script_run0
STRING_atomic_script_run;
static const alasitem alasmeta[] = {
{ 3, META_LOOKAHEAD },
{ 3, META_LOOKBEHIND },
{ 3, META_LOOKAHEADNOT },
{ 3, META_LOOKBEHINDNOT },
{ 18, META_LOOKAHEAD },
{ 19, META_LOOKBEHIND },
{ 18, META_LOOKAHEADNOT },
{ 19, META_LOOKBEHINDNOT },
{ 6, META_ATOMIC },
{ 2, 0 }, /* sr = script run */
{ 3, 0 }, /* asr = atomic script run */
{ 10, 0 }, /* script run */
{ 17, 0 } /* atomic script run */
};
static const int alascount = sizeof(alasmeta)/sizeof(alasitem);
/* Offsets from OP_STAR for case-independent and negative repeat opcodes. */ /* Offsets from OP_STAR for case-independent and negative repeat opcodes. */
static uint32_t chartypeoffset[] = { static uint32_t chartypeoffset[] = {
@ -732,7 +772,7 @@ enum { ERR0 = COMPILE_ERROR_BASE,
ERR61, ERR62, ERR63, ERR64, ERR65, ERR66, ERR67, ERR68, ERR69, ERR70, ERR61, ERR62, ERR63, ERR64, ERR65, ERR66, ERR67, ERR68, ERR69, ERR70,
ERR71, ERR72, ERR73, ERR74, ERR75, ERR76, ERR77, ERR78, ERR79, ERR80, ERR71, ERR72, ERR73, ERR74, ERR75, ERR76, ERR77, ERR78, ERR79, ERR80,
ERR81, ERR82, ERR83, ERR84, ERR85, ERR86, ERR87, ERR88, ERR89, ERR90, ERR81, ERR82, ERR83, ERR84, ERR85, ERR86, ERR87, ERR88, ERR89, ERR90,
ERR91, ERR92, ERR93, ERR94 }; ERR91, ERR92, ERR93, ERR94, ERR95 };
/* This is a table of start-of-pattern options such as (*UTF) and settings such /* This is a table of start-of-pattern options such as (*UTF) and settings such
as (*LIMIT_MATCH=nnnn) and (*CRLF). For completeness and backward as (*LIMIT_MATCH=nnnn) and (*CRLF). For completeness and backward
@ -1447,9 +1487,9 @@ else if ((i = escapes[c - ESCAPES_FIRST]) != 0)
c = (uint32_t)i; c = (uint32_t)i;
if (cb != NULL && c == CHAR_CR && if (cb != NULL && c == CHAR_CR &&
(cb->cx->extra_options & PCRE2_EXTRA_ESCAPED_CR_IS_LF) != 0) (cb->cx->extra_options & PCRE2_EXTRA_ESCAPED_CR_IS_LF) != 0)
c = CHAR_LF; c = CHAR_LF;
} }
else /* Negative table entry */ else /* Negative table entry */
{ {
escape = -i; /* Else return a special escape */ escape = -i; /* Else return a special escape */
if (cb != NULL && (escape == ESC_P || escape == ESC_p || escape == ESC_X)) if (cb != NULL && (escape == ESC_P || escape == ESC_p || escape == ESC_X))
@ -1499,7 +1539,7 @@ else if ((i = escapes[c - ESCAPES_FIRST]) != 0)
} }
} }
/* Escapes that need further processing, including those that are unknown, have /* Escapes that need further processing, including those that are unknown, have
a zero entry in the lookup table. When called from pcre2_substitute(), only \c, a zero entry in the lookup table. When called from pcre2_substitute(), only \c,
\o, and \x are recognized (and \u when BSUX is set). */ \o, and \x are recognized (and \u when BSUX is set). */
@ -2133,9 +2173,10 @@ return -1;
*************************************************/ *************************************************/
/* This function is called from parse_regex() below whenever it needs to read /* This function is called from parse_regex() below whenever it needs to read
the name of a subpattern or a (*VERB). The initial pointer must be to the the name of a subpattern or a (*VERB) or an (*alpha_assertion). The initial
character before the name. If that character is '*' we are reading a verb name. pointer must be to the character before the name. If that character is '*' we
The pointer is updated to point after the name, for a VERB, or after tha name's are reading a verb or alpha assertion name. The pointer is updated to point
after the name, for a VERB or alpha assertion name, or after tha name's
terminator for a subpattern name. Returning both the offset and the name terminator for a subpattern name. Returning both the offset and the name
pointer is redundant information, but some callers use one and some the other, pointer is redundant information, but some callers use one and some the other,
so it is simplest just to return both. so it is simplest just to return both.
@ -2160,27 +2201,29 @@ read_name(PCRE2_SPTR *ptrptr, PCRE2_SPTR ptrend, uint32_t terminator,
int *errorcodeptr, compile_block *cb) int *errorcodeptr, compile_block *cb)
{ {
PCRE2_SPTR ptr = *ptrptr; PCRE2_SPTR ptr = *ptrptr;
BOOL is_verb = (*ptr == CHAR_ASTERISK); BOOL is_group = (*ptr != CHAR_ASTERISK);
uint32_t namelen = 0; uint32_t namelen = 0;
uint32_t ctype = is_verb? ctype_letter : ctype_word;
if (++ptr >= ptrend) if (++ptr >= ptrend) /* No characters in name */
{ {
*errorcodeptr = is_verb? ERR60: /* Verb not recognized or malformed */ *errorcodeptr = is_group? ERR62: /* Subpattern name expected */
ERR62; /* Subpattern name expected */ ERR60; /* Verb not recognized or malformed */
goto FAILED; goto FAILED;
} }
/* A group name must not start with a digit. If either of the others start with
a digit it just won't be recognized. */
if (is_group && IS_DIGIT(*ptr))
{
*errorcodeptr = ERR44;
goto FAILED;
}
*nameptr = ptr; *nameptr = ptr;
*offsetptr = (PCRE2_SIZE)(ptr - cb->start_pattern); *offsetptr = (PCRE2_SIZE)(ptr - cb->start_pattern);
if (IS_DIGIT(*ptr)) while (ptr < ptrend && MAX_255(*ptr) && (cb->ctypes[*ptr] & ctype_word) != 0)
{
*errorcodeptr = ERR44; /* Group name must not start with digit */
goto FAILED;
}
while (ptr < ptrend && MAX_255(*ptr) && (cb->ctypes[*ptr] & ctype) != 0)
{ {
ptr++; ptr++;
namelen++; namelen++;
@ -2192,9 +2235,9 @@ while (ptr < ptrend && MAX_255(*ptr) && (cb->ctypes[*ptr] & ctype) != 0)
} }
/* Subpattern names must not be empty, and their terminator is checked here. /* Subpattern names must not be empty, and their terminator is checked here.
(What follows a verb name is checked separately.) */ (What follows a verb or alpha assertion name is checked separately.) */
if (!is_verb) if (is_group)
{ {
if (namelen == 0) if (namelen == 0)
{ {
@ -2652,24 +2695,31 @@ while (ptr < ptrend)
if (expect_cond_assert > 0) if (expect_cond_assert > 0)
{ {
BOOL ok = c == CHAR_LEFT_PARENTHESIS && ptrend - ptr >= 3 && BOOL ok = c == CHAR_LEFT_PARENTHESIS && ptrend - ptr >= 3 &&
ptr[0] == CHAR_QUESTION_MARK; (ptr[0] == CHAR_QUESTION_MARK || ptr[0] == CHAR_ASTERISK);
if (ok) switch(ptr[1]) if (ok)
{ {
case CHAR_C: if (ptr[0] == CHAR_ASTERISK) /* New alpha assertion format, possibly */
ok = expect_cond_assert == 2; {
break; ok = MAX_255(ptr[1]) && (cb->ctypes[ptr[1]] & ctype_lcletter) != 0;
}
case CHAR_EQUALS_SIGN: else switch(ptr[1]) /* Traditional symbolic format */
case CHAR_EXCLAMATION_MARK: {
break; case CHAR_C:
ok = expect_cond_assert == 2;
case CHAR_LESS_THAN_SIGN: break;
ok = ptr[2] == CHAR_EQUALS_SIGN || ptr[2] == CHAR_EXCLAMATION_MARK;
break; case CHAR_EQUALS_SIGN:
case CHAR_EXCLAMATION_MARK:
default: break;
ok = FALSE;
} case CHAR_LESS_THAN_SIGN:
ok = ptr[2] == CHAR_EQUALS_SIGN || ptr[2] == CHAR_EXCLAMATION_MARK;
break;
default:
ok = FALSE;
}
}
if (!ok) if (!ok)
{ {
@ -3453,7 +3503,8 @@ while (ptr < ptrend)
case CHAR_LEFT_PARENTHESIS: case CHAR_LEFT_PARENTHESIS:
if (ptr >= ptrend) goto UNCLOSED_PARENTHESIS; if (ptr >= ptrend) goto UNCLOSED_PARENTHESIS;
/* If ( is not followed by ? it is either a capture or a special verb. */ /* If ( is not followed by ? it is either a capture or a special verb or an
alpha assertion. */
if (*ptr != CHAR_QUESTION_MARK) if (*ptr != CHAR_QUESTION_MARK)
{ {
@ -3473,13 +3524,88 @@ while (ptr < ptrend)
else *parsed_pattern++ = META_NOCAPTURE; else *parsed_pattern++ = META_NOCAPTURE;
} }
/* Do nothing for (* followed by end of pattern or ) so it gives a "bad
quantifier" error rather than "(*MARK) must have an argument". */
else if (ptrend - ptr <= 1 || (c = ptr[1]) == CHAR_RIGHT_PARENTHESIS)
break;
/* Handle "alpha assertions" such as (*pla:...). Most of these are
synonyms for the historical symbolic assertions, but the script run ones
are new. They are distinguished by starting with a lower case letter.
Checking both ends of the alphabet makes this work in all character
codes. */
else if (CHMAX_255(c) && (cb->ctypes[c] & ctype_lcletter) != 0)
{
uint32_t meta;
vn = alasnames;
if (!read_name(&ptr, ptrend, 0, &offset, &name, &namelen, &errorcode,
cb)) goto FAILED;
if (ptr >= ptrend || *ptr != CHAR_COLON)
{
errorcode = ERR95; /* Malformed */
goto FAILED;
}
/* Scan the table of alpha assertion names */
for (i = 0; i < alascount; i++)
{
if (namelen == alasmeta[i].len &&
PRIV(strncmp_c8)(name, vn, namelen) == 0)
break;
vn += alasmeta[i].len + 1;
}
if (i >= alascount)
{
errorcode = ERR95; /* Alpha assertion not recognized */
goto FAILED;
}
/* Check for expecting an assertion condition. If so, only lookaround
assertions are valid. */
meta = alasmeta[i].meta;
if (prev_expect_cond_assert > 0 &&
(meta < META_LOOKAHEAD || meta > META_LOOKBEHINDNOT))
{
errorcode = ERR28; /* Assertion expected */
goto FAILED;
}
switch(meta)
{
case META_ATOMIC:
goto ATOMIC_GROUP;
case META_LOOKAHEAD:
goto POSITIVE_LOOK_AHEAD;
case META_LOOKAHEADNOT:
goto NEGATIVE_LOOK_AHEAD;
case META_LOOKBEHIND:
case META_LOOKBEHINDNOT:
*parsed_pattern++ = meta;
ptr--;
goto LOOKBEHIND;
/* FIXME: Script Run stuff ... */
}
}
/* ---- Handle (*VERB) and (*VERB:NAME) ---- */ /* ---- Handle (*VERB) and (*VERB:NAME) ---- */
/* Do nothing for (*) so it gives a "bad quantifier" error rather than else
"(*MARK) must have an argument". */
else if (ptrend - ptr > 1 && ptr[1] != CHAR_RIGHT_PARENTHESIS)
{ {
vn = verbnames; vn = verbnames;
if (!read_name(&ptr, ptrend, 0, &offset, &name, &namelen, &errorcode, if (!read_name(&ptr, ptrend, 0, &offset, &name, &namelen, &errorcode,
@ -3946,14 +4072,15 @@ while (ptr < ptrend)
if (++ptr >= ptrend) goto UNCLOSED_PARENTHESIS; if (++ptr >= ptrend) goto UNCLOSED_PARENTHESIS;
nest_depth++; nest_depth++;
/* If the next character is ? there must be an assertion next (optionally /* If the next character is ? or * there must be an assertion next
preceded by a callout). We do not check this here, but instead we set (optionally preceded by a callout). We do not check this here, but
expect_cond_assert to 2. If this is still greater than zero (callouts instead we set expect_cond_assert to 2. If this is still greater than
decrement it) when the next assertion is read, it will be marked as a zero (callouts decrement it) when the next assertion is read, it will be
condition that must not be repeated. A value greater than zero also marked as a condition that must not be repeated. A value greater than
causes checking that an assertion (possibly with callout) follows. */ zero also causes checking that an assertion (possibly with callout)
follows. */
if (*ptr == CHAR_QUESTION_MARK) if (*ptr == CHAR_QUESTION_MARK || *ptr == CHAR_ASTERISK)
{ {
*parsed_pattern++ = META_COND_ASSERT; *parsed_pattern++ = META_COND_ASSERT;
ptr--; /* Pull pointer back to the opening parenthesis. */ ptr--; /* Pull pointer back to the opening parenthesis. */
@ -4099,6 +4226,7 @@ while (ptr < ptrend)
/* ---- Atomic group ---- */ /* ---- Atomic group ---- */
case CHAR_GREATER_THAN_SIGN: case CHAR_GREATER_THAN_SIGN:
ATOMIC_GROUP: /* Come from (*atomic: */
*parsed_pattern++ = META_ATOMIC; *parsed_pattern++ = META_ATOMIC;
nest_depth++; nest_depth++;
ptr++; ptr++;
@ -4108,11 +4236,13 @@ while (ptr < ptrend)
/* ---- Lookahead assertions ---- */ /* ---- Lookahead assertions ---- */
case CHAR_EQUALS_SIGN: case CHAR_EQUALS_SIGN:
POSITIVE_LOOK_AHEAD: /* Come from (*pla: */
*parsed_pattern++ = META_LOOKAHEAD; *parsed_pattern++ = META_LOOKAHEAD;
ptr++; ptr++;
goto POST_ASSERTION; goto POST_ASSERTION;
case CHAR_EXCLAMATION_MARK: case CHAR_EXCLAMATION_MARK:
NEGATIVE_LOOK_AHEAD: /* Come from (*nla: */
*parsed_pattern++ = META_LOOKAHEADNOT; *parsed_pattern++ = META_LOOKAHEADNOT;
ptr++; ptr++;
goto POST_ASSERTION; goto POST_ASSERTION;
@ -4132,6 +4262,8 @@ while (ptr < ptrend)
} }
*parsed_pattern++ = (ptr[1] == CHAR_EQUALS_SIGN)? *parsed_pattern++ = (ptr[1] == CHAR_EQUALS_SIGN)?
META_LOOKBEHIND : META_LOOKBEHINDNOT; META_LOOKBEHIND : META_LOOKBEHINDNOT;
LOOKBEHIND: /* Come from (*plb: and (*nlb: */
*has_lookbehind = TRUE; *has_lookbehind = TRUE;
offset = (PCRE2_SIZE)(ptr - cb->start_pattern - 2); offset = (PCRE2_SIZE)(ptr - cb->start_pattern - 2);
PUTOFFSET(offset, parsed_pattern); PUTOFFSET(offset, parsed_pattern);

View File

@ -181,6 +181,8 @@ static const unsigned char compile_error_texts[] =
"invalid option bits with PCRE2_LITERAL\0" "invalid option bits with PCRE2_LITERAL\0"
"\\N{U+dddd} is supported only in Unicode (UTF) mode\0" "\\N{U+dddd} is supported only in Unicode (UTF) mode\0"
"invalid hyphen in option setting\0" "invalid hyphen in option setting\0"
/* 95 */
"(*alpha_assertion) not recognized\0"
; ;
/* Match-time and UTF error texts are in the same format. */ /* Match-time and UTF error texts are in the same format. */

View File

@ -569,11 +569,11 @@ these tables. */
without checking pcre2_jit_compile.c, which has an assertion to ensure that without checking pcre2_jit_compile.c, which has an assertion to ensure that
ctype_word has the value 16. */ ctype_word has the value 16. */
#define ctype_space 0x01 #define ctype_space 0x01
#define ctype_letter 0x02 #define ctype_letter 0x02
#define ctype_digit 0x04 #define ctype_lcletter 0x04
#define ctype_xdigit 0x08 /* not actually used any more */ #define ctype_digit 0x08
#define ctype_word 0x10 /* alphanumeric or '_' */ #define ctype_word 0x10 /* alphanumeric or '_' */
/* Offsets of the various tables from the base tables pointer, and /* Offsets of the various tables from the base tables pointer, and
total length of the tables. */ total length of the tables. */
@ -874,34 +874,48 @@ a positive value. */
#define STR_RIGHT_CURLY_BRACKET "}" #define STR_RIGHT_CURLY_BRACKET "}"
#define STR_TILDE "~" #define STR_TILDE "~"
#define STRING_ACCEPT0 "ACCEPT\0" #define STRING_ACCEPT0 "ACCEPT\0"
#define STRING_COMMIT0 "COMMIT\0" #define STRING_COMMIT0 "COMMIT\0"
#define STRING_F0 "F\0" #define STRING_F0 "F\0"
#define STRING_FAIL0 "FAIL\0" #define STRING_FAIL0 "FAIL\0"
#define STRING_MARK0 "MARK\0" #define STRING_MARK0 "MARK\0"
#define STRING_PRUNE0 "PRUNE\0" #define STRING_PRUNE0 "PRUNE\0"
#define STRING_SKIP0 "SKIP\0" #define STRING_SKIP0 "SKIP\0"
#define STRING_THEN "THEN" #define STRING_THEN "THEN"
#define STRING_alpha0 "alpha\0" #define STRING_atomic0 "atomic\0"
#define STRING_lower0 "lower\0" #define STRING_pla0 "pla\0"
#define STRING_upper0 "upper\0" #define STRING_plb0 "plb\0"
#define STRING_alnum0 "alnum\0" #define STRING_nla0 "nla\0"
#define STRING_ascii0 "ascii\0" #define STRING_nlb0 "nlb\0"
#define STRING_blank0 "blank\0" #define STRING_sr0 "sr\0"
#define STRING_cntrl0 "cntrl\0" #define STRING_asr0 "asr\0"
#define STRING_digit0 "digit\0" #define STRING_positive_lookahead0 "positive_lookahead\0"
#define STRING_graph0 "graph\0" #define STRING_positive_lookbehind0 "positive_lookbehind\0"
#define STRING_print0 "print\0" #define STRING_negative_lookahead0 "negative_lookahead\0"
#define STRING_punct0 "punct\0" #define STRING_negative_lookbehind0 "negative_lookbehind\0"
#define STRING_space0 "space\0" #define STRING_script_run0 "script_run\0"
#define STRING_word0 "word\0" #define STRING_atomic_script_run "atomic_script_run"
#define STRING_xdigit "xdigit"
#define STRING_DEFINE "DEFINE" #define STRING_alpha0 "alpha\0"
#define STRING_VERSION "VERSION" #define STRING_lower0 "lower\0"
#define STRING_WEIRD_STARTWORD "[:<:]]" #define STRING_upper0 "upper\0"
#define STRING_WEIRD_ENDWORD "[:>:]]" #define STRING_alnum0 "alnum\0"
#define STRING_ascii0 "ascii\0"
#define STRING_blank0 "blank\0"
#define STRING_cntrl0 "cntrl\0"
#define STRING_digit0 "digit\0"
#define STRING_graph0 "graph\0"
#define STRING_print0 "print\0"
#define STRING_punct0 "punct\0"
#define STRING_space0 "space\0"
#define STRING_word0 "word\0"
#define STRING_xdigit "xdigit"
#define STRING_DEFINE "DEFINE"
#define STRING_VERSION "VERSION"
#define STRING_WEIRD_STARTWORD "[:<:]]"
#define STRING_WEIRD_ENDWORD "[:>:]]"
#define STRING_CR_RIGHTPAR "CR)" #define STRING_CR_RIGHTPAR "CR)"
#define STRING_LF_RIGHTPAR "LF)" #define STRING_LF_RIGHTPAR "LF)"
@ -1150,34 +1164,48 @@ only. */
#define STR_RIGHT_CURLY_BRACKET "\175" #define STR_RIGHT_CURLY_BRACKET "\175"
#define STR_TILDE "\176" #define STR_TILDE "\176"
#define STRING_ACCEPT0 STR_A STR_C STR_C STR_E STR_P STR_T "\0" #define STRING_ACCEPT0 STR_A STR_C STR_C STR_E STR_P STR_T "\0"
#define STRING_COMMIT0 STR_C STR_O STR_M STR_M STR_I STR_T "\0" #define STRING_COMMIT0 STR_C STR_O STR_M STR_M STR_I STR_T "\0"
#define STRING_F0 STR_F "\0" #define STRING_F0 STR_F "\0"
#define STRING_FAIL0 STR_F STR_A STR_I STR_L "\0" #define STRING_FAIL0 STR_F STR_A STR_I STR_L "\0"
#define STRING_MARK0 STR_M STR_A STR_R STR_K "\0" #define STRING_MARK0 STR_M STR_A STR_R STR_K "\0"
#define STRING_PRUNE0 STR_P STR_R STR_U STR_N STR_E "\0" #define STRING_PRUNE0 STR_P STR_R STR_U STR_N STR_E "\0"
#define STRING_SKIP0 STR_S STR_K STR_I STR_P "\0" #define STRING_SKIP0 STR_S STR_K STR_I STR_P "\0"
#define STRING_THEN STR_T STR_H STR_E STR_N #define STRING_THEN STR_T STR_H STR_E STR_N
#define STRING_alpha0 STR_a STR_l STR_p STR_h STR_a "\0" #define STRING_atomic0 STR_a STR_t STR_o STR_m STR_i STR_c "\0"
#define STRING_lower0 STR_l STR_o STR_w STR_e STR_r "\0" #define STRING_pla0 STR_p STR_l STR_a "\0"
#define STRING_upper0 STR_u STR_p STR_p STR_e STR_r "\0" #define STRING_plb0 STR_p STR_l STR_b "\0"
#define STRING_alnum0 STR_a STR_l STR_n STR_u STR_m "\0" #define STRING_nla0 STR_n STR_l STR_a "\0"
#define STRING_ascii0 STR_a STR_s STR_c STR_i STR_i "\0" #define STRING_nlb0 STR_n STR_l STR_b "\0"
#define STRING_blank0 STR_b STR_l STR_a STR_n STR_k "\0" #define STRING_sr0 STR_s STR_r "\0"
#define STRING_cntrl0 STR_c STR_n STR_t STR_r STR_l "\0" #define STRING_asr0 STR_a STR_s STR_r "\0"
#define STRING_digit0 STR_d STR_i STR_g STR_i STR_t "\0" #define STRING_positive_lookahead0 STR_p STR_o STR_s STR_i STR_t STR_i STR_v STR_e STR_UNDERSCORE STR_l STR_o STR_o STR_k STR_a STR_h STR_e STR_a STR_d "\0"
#define STRING_graph0 STR_g STR_r STR_a STR_p STR_h "\0" #define STRING_positive_lookbehind0 STR_p STR_o STR_s STR_i STR_t STR_i STR_v STR_e STR_UNDERSCORE STR_l STR_o STR_o STR_k STR_b STR_e STR_h STR_i STR_n STR_d "\0"
#define STRING_print0 STR_p STR_r STR_i STR_n STR_t "\0" #define STRING_negative_lookahead0 STR_n STR_e STR_g STR_a STR_t STR_i STR_v STR_e STR_UNDERSCORE STR_l STR_o STR_o STR_k STR_a STR_h STR_e STR_a STR_d "\0"
#define STRING_punct0 STR_p STR_u STR_n STR_c STR_t "\0" #define STRING_negative_lookbehind0 STR_n STR_e STR_g STR_a STR_t STR_i STR_v STR_e STR_UNDERSCORE STR_l STR_o STR_o STR_k STR_b STR_e STR_h STR_i STR_n STR_d "\0"
#define STRING_space0 STR_s STR_p STR_a STR_c STR_e "\0" #define STRING_script_run0 STR_s STR_c STR_r STR_i STR_p STR_t STR_UNDERSCORE STR_r STR_u STR_n "\0"
#define STRING_word0 STR_w STR_o STR_r STR_d "\0" #define STRING_atomic_script_run STR_a STR_t STR_o STR_m STR_i STR_c STR_UNDERSCORE STR_s STR_c STR_r STR_i STR_p STR_t STR_UNDERSCORE STR_r STR_u STR_n
#define STRING_xdigit STR_x STR_d STR_i STR_g STR_i STR_t
#define STRING_DEFINE STR_D STR_E STR_F STR_I STR_N STR_E #define STRING_alpha0 STR_a STR_l STR_p STR_h STR_a "\0"
#define STRING_VERSION STR_V STR_E STR_R STR_S STR_I STR_O STR_N #define STRING_lower0 STR_l STR_o STR_w STR_e STR_r "\0"
#define STRING_WEIRD_STARTWORD STR_LEFT_SQUARE_BRACKET STR_COLON STR_LESS_THAN_SIGN STR_COLON STR_RIGHT_SQUARE_BRACKET STR_RIGHT_SQUARE_BRACKET #define STRING_upper0 STR_u STR_p STR_p STR_e STR_r "\0"
#define STRING_WEIRD_ENDWORD STR_LEFT_SQUARE_BRACKET STR_COLON STR_GREATER_THAN_SIGN STR_COLON STR_RIGHT_SQUARE_BRACKET STR_RIGHT_SQUARE_BRACKET #define STRING_alnum0 STR_a STR_l STR_n STR_u STR_m "\0"
#define STRING_ascii0 STR_a STR_s STR_c STR_i STR_i "\0"
#define STRING_blank0 STR_b STR_l STR_a STR_n STR_k "\0"
#define STRING_cntrl0 STR_c STR_n STR_t STR_r STR_l "\0"
#define STRING_digit0 STR_d STR_i STR_g STR_i STR_t "\0"
#define STRING_graph0 STR_g STR_r STR_a STR_p STR_h "\0"
#define STRING_print0 STR_p STR_r STR_i STR_n STR_t "\0"
#define STRING_punct0 STR_p STR_u STR_n STR_c STR_t "\0"
#define STRING_space0 STR_s STR_p STR_a STR_c STR_e "\0"
#define STRING_word0 STR_w STR_o STR_r STR_d "\0"
#define STRING_xdigit STR_x STR_d STR_i STR_g STR_i STR_t
#define STRING_DEFINE STR_D STR_E STR_F STR_I STR_N STR_E
#define STRING_VERSION STR_V STR_E STR_R STR_S STR_I STR_O STR_N
#define STRING_WEIRD_STARTWORD STR_LEFT_SQUARE_BRACKET STR_COLON STR_LESS_THAN_SIGN STR_COLON STR_RIGHT_SQUARE_BRACKET STR_RIGHT_SQUARE_BRACKET
#define STRING_WEIRD_ENDWORD STR_LEFT_SQUARE_BRACKET STR_COLON STR_GREATER_THAN_SIGN STR_COLON STR_RIGHT_SQUARE_BRACKET STR_RIGHT_SQUARE_BRACKET
#define STRING_CR_RIGHTPAR STR_C STR_R STR_RIGHT_PARENTHESIS #define STRING_CR_RIGHTPAR STR_C STR_R STR_RIGHT_PARENTHESIS
#define STRING_LF_RIGHTPAR STR_L STR_F STR_RIGHT_PARENTHESIS #define STRING_LF_RIGHTPAR STR_L STR_F STR_RIGHT_PARENTHESIS

View File

@ -138,8 +138,8 @@ for (i = 0; i < 256; i++)
int x = 0; int x = 0;
if (isspace(i)) x += ctype_space; if (isspace(i)) x += ctype_space;
if (isalpha(i)) x += ctype_letter; if (isalpha(i)) x += ctype_letter;
if (islower(i)) x += ctype_lcletter;
if (isdigit(i)) x += ctype_digit; if (isdigit(i)) x += ctype_digit;
if (isxdigit(i)) x += ctype_xdigit;
if (isalnum(i) || i == '_') x += ctype_word; if (isalnum(i) || i == '_') x += ctype_word;
*p++ = x; *p++ = x;
} }

65
testdata/testinput1 vendored
View File

@ -6263,4 +6263,69 @@ ef) x/x,mark
aBCDEF aBCDEF
AbCDe f AbCDe f
/(*pla:foo).{6}/
abcfoobarxyz
\= Expect no match
abcfooba
/(*positive_lookahead:foo).{6}/
abcfoobarxyz
/(?(*pla:foo).{6}|a..)/
foobarbaz
abcfoobar
/(?(*positive_lookahead:foo).{6}|a..)/
foobarbaz
abcfoobar
/(*plb:foo)bar/
abcfoobar
\= Expect no match
abcbarfoo
/(*positive_lookbehind:foo)bar/
abcfoobar
\= Expect no match
abcbarfoo
/(?(*plb:foo)bar|baz)/
abcfoobar
bazfoobar
abcbazfoobar
foobazfoobar
/(?(*positive_lookbehind:foo)bar|baz)/
abcfoobar
bazfoobar
abcbazfoobar
foobazfoobar
/(*nlb:foo)bar/
abcbarfoo
\= Expect no match
abcfoobar
/(*negative_lookbehind:foo)bar/
abcbarfoo
\= Expect no match
abcfoobar
/(?(*nlb:foo)bar|baz)/
abcfoobaz
abcbarbaz
\= Expect no match
abcfoobar
/(?(*negative_lookbehind:foo)bar|baz)/
abcfoobaz
abcbarbaz
\= Expect no match
abcfoobar
/(*atomic:a+)\w/
aaab
\= Expect no match
aaaa
# End of testinput1 # End of testinput1

6
testdata/testinput2 vendored
View File

@ -5525,4 +5525,10 @@ a)"xI
\= Expect no match \= Expect no match
abc\ndef\nxyz abc\ndef\nxyz
/(?(*ACCEPT)xxx)/
/(?(*atomic:xx)xxx)/
/(?(*script_run:xxx)zzz)/
# End of testinput2 # End of testinput2

96
testdata/testoutput1 vendored
View File

@ -9929,4 +9929,100 @@ No match
AbCDe f AbCDe f
No match No match
/(*pla:foo).{6}/
abcfoobarxyz
0: foobar
\= Expect no match
abcfooba
No match
/(*positive_lookahead:foo).{6}/
abcfoobarxyz
0: foobar
/(?(*pla:foo).{6}|a..)/
foobarbaz
0: foobar
abcfoobar
0: abc
/(?(*positive_lookahead:foo).{6}|a..)/
foobarbaz
0: foobar
abcfoobar
0: abc
/(*plb:foo)bar/
abcfoobar
0: bar
\= Expect no match
abcbarfoo
No match
/(*positive_lookbehind:foo)bar/
abcfoobar
0: bar
\= Expect no match
abcbarfoo
No match
/(?(*plb:foo)bar|baz)/
abcfoobar
0: bar
bazfoobar
0: baz
abcbazfoobar
0: baz
foobazfoobar
0: bar
/(?(*positive_lookbehind:foo)bar|baz)/
abcfoobar
0: bar
bazfoobar
0: baz
abcbazfoobar
0: baz
foobazfoobar
0: bar
/(*nlb:foo)bar/
abcbarfoo
0: bar
\= Expect no match
abcfoobar
No match
/(*negative_lookbehind:foo)bar/
abcbarfoo
0: bar
\= Expect no match
abcfoobar
No match
/(?(*nlb:foo)bar|baz)/
abcfoobaz
0: baz
abcbarbaz
0: bar
\= Expect no match
abcfoobar
No match
/(?(*negative_lookbehind:foo)bar|baz)/
abcfoobaz
0: baz
abcbarbaz
0: bar
\= Expect no match
abcfoobar
No match
/(*atomic:a+)\w/
aaab
0: aaab
\= Expect no match
aaaa
No match
# End of testinput1 # End of testinput1

View File

@ -575,7 +575,7 @@ Last code unit = 'b'
Subject length lower bound = 3 Subject length lower bound = 3
/(*CRLF)(*UTF32)(*BSR_UNICODE)a\Rb/I /(*CRLF)(*UTF32)(*BSR_UNICODE)a\Rb/I
Failed: error 160 at offset 12: (*VERB) not recognized or malformed Failed: error 160 at offset 14: (*VERB) not recognized or malformed
/\h/I,utf /\h/I,utf
Capturing subpattern count = 0 Capturing subpattern count = 0

View File

@ -538,7 +538,7 @@ No match
Failed: error -27: UTF-32 error: code points 0xd800-0xdfff are not defined at offset 2 Failed: error -27: UTF-32 error: code points 0xd800-0xdfff are not defined at offset 2
/(*UTF16)\x{11234}/ /(*UTF16)\x{11234}/
Failed: error 160 at offset 5: (*VERB) not recognized or malformed Failed: error 160 at offset 7: (*VERB) not recognized or malformed
abcd\x{11234}pqr abcd\x{11234}pqr
/(*UTF)\x{11234}/I /(*UTF)\x{11234}/I
@ -559,7 +559,7 @@ Failed: error 160 at offset 5: (*VERB) not recognized or malformed
abcd\x{11234}pqr abcd\x{11234}pqr
/(*CRLF)(*UTF16)(*BSR_UNICODE)a\Rb/I /(*CRLF)(*UTF16)(*BSR_UNICODE)a\Rb/I
Failed: error 160 at offset 12: (*VERB) not recognized or malformed Failed: error 160 at offset 14: (*VERB) not recognized or malformed
/(*CRLF)(*UTF32)(*BSR_UNICODE)a\Rb/I /(*CRLF)(*UTF32)(*BSR_UNICODE)a\Rb/I
Capturing subpattern count = 0 Capturing subpattern count = 0

View File

@ -16812,6 +16812,15 @@ No match
abc\ndef\nxyz abc\ndef\nxyz
No match No match
/(?(*ACCEPT)xxx)/
Failed: error 128 at offset 2: assertion expected after (?( or (?(?C)
/(?(*atomic:xx)xxx)/
Failed: error 128 at offset 10: assertion expected after (?( or (?(?C)
/(?(*script_run:xxx)zzz)/
Failed: error 128 at offset 14: assertion expected after (?( or (?(?C)
# End of testinput2 # End of testinput2
Error -70: PCRE2_ERROR_BADDATA (unknown error number) Error -70: PCRE2_ERROR_BADDATA (unknown error number)
Error -62: bad serialized data Error -62: bad serialized data