Major refactoring of pcre2_compile.c; see ChangeLog and HACKING.

This commit is contained in:
Philip.Hazel 2016-10-02 16:01:01 +00:00
parent dda1e79060
commit 99264dfc23
46 changed files with 7298 additions and 6268 deletions

View File

@ -14,6 +14,46 @@ cause all characters greater than 255 to match, whatever else is in the class.
There was a bug that caused this not to happen if a Unicode property item was
added to such a class, for example [\D\P{Nd}] or [\W\pL].
3. There has been a major re-factoring of the pcre2_compile.c file. Most syntax
checking is now done in the pre-pass that identifies capturing groups. This has
reduced the amount of duplication and made the code tidier. While doing this,
some minor bugs and Perl incompatibilities were fixed, including:
(a) \Q\E in the middle of a quantifier such as A+\Q\E+ is now ignored instead
of giving an invalid quantifier error.
(b) {0} can now be used after a group in a lookbehind assertion; previously
this caused an "assertion is not fixed length" error.
(c) Perl always treats (?(DEFINE) as a "define" group, even if a group with
the name "DEFINE" exists. PCRE2 now does likewise.
(d) A recursion condition test such as (?(R2)...) must now refer to an
existing subpattern.
One effect of the refactoring is that some error numbers and messages have
changed, and the pattern offset given for compiling errors is not always the
right-most character that has been read. In particular, for a variable-length
lookbehind assertion it now points to the start of the assertion. Another
change is that when a callout appears before a group, the "length of next
pattern item" that is passed now just gives the length of the opening
parenthesis item, not the length of the whole group. A length of zero is now
given only for a callout at the end of the pattern. Automatic callouts are no
longer inserted before and after explicit callouts in the pattern.
4. Back references are now permitted in lookbehind assertions when there are
no duplicated group numbers (that is, (?| has not been used), and, if the
reference is by name, there is only one group of that name. The referenced
group must, of course be of fixed length.
5. pcre2test has been upgraded so that, when run under valgrind with valgrind
support enabled, reading past the end of the pattern is detected, both when
compiling and during callout processing.
6. \g{+<number>} (e.g. \g{+2)} ) is now supported. It is a "forward back
reference" and can be useful in repetitions (compare \g{-<number>}). Perl does
not recognize this syntax.
7. Automatic callouts are no longer generated before and after callouts in the
pattern.
Version 10.22 29-July-2016
--------------------------

241
HACKING
View File

@ -7,8 +7,8 @@ but with a revised (and incompatible) API. To avoid confusion, the original
library is referred to as PCRE1 below. For information about testing PCRE2, see
the pcre2test documentation and the comment at the head of the RunTest file.
PCRE1 releases were up to 8.3x when PCRE2 was developed. The 8.xx series will
continue for bugfixes if necessary. PCRE2 releases started at 10.00 to avoid
PCRE1 releases were up to 8.3x when PCRE2 was developed, and later bug fix
releases remain in the 8.xx series. PCRE2 releases started at 10.00 to avoid
confusion with PCRE1.
@ -16,19 +16,20 @@ Historical note 1
-----------------
Many years ago I implemented some regular expression functions to an algorithm
suggested by Martin Richards. These were not Unix-like in form, and were quite
restricted in what they could do by comparison with Perl. The interesting part
about the algorithm was that the amount of space required to hold the compiled
form of an expression was known in advance. The code to apply an expression did
not operate by backtracking, as the original Henry Spencer code and current
PCRE2 and Perl code does, but instead checked all possibilities simultaneously
by keeping a list of current states and checking all of them as it advanced
through the subject string. In the terminology of Jeffrey Friedl's book, it was
a "DFA algorithm", though it was not a traditional Finite State Machine (FSM).
When the pattern was all used up, all remaining states were possible matches,
and the one matching the longest subset of the subject string was chosen. This
did not necessarily maximize the individual wild portions of the pattern, as is
expected in Unix and Perl-style regular expressions.
suggested by Martin Richards. The rather simple patterns were not Unix-like in
form, and were quite restricted in what they could do by comparison with Perl.
The interesting part about the algorithm was that the amount of space required
to hold the compiled form of an expression was known in advance. The code to
apply an expression did not operate by backtracking, as the original Henry
Spencer code and current PCRE2 and Perl code does, but instead checked all
possibilities simultaneously by keeping a list of current states and checking
all of them as it advanced through the subject string. In the terminology of
Jeffrey Friedl's book, it was a "DFA algorithm", though it was not a
traditional Finite State Machine (FSM). When the pattern was all used up, all
remaining states were possible matches, and the one matching the longest subset
of the subject string was chosen. This did not necessarily maximize the
individual wild portions of the pattern, as is expected in Unix and Perl-style
regular expressions.
Historical note 2
@ -85,7 +86,7 @@ had become very complicated and hard to maintain. Indeed one of the early
things I did for 6.8 was to fix Yet Another Bug in the memory computation. Then
I had a flash of inspiration as to how I could run the real compile function in
a "fake" mode that enables it to compute how much memory it would need, while
actually only ever using a few hundred bytes of working memory, and without too
in most cases only ever using a small amount of working memory, and without too
many tests of the mode that might slow it down. So I refactored the compiling
functions to work this way. This got rid of about 600 lines of source. It
should make future maintenance and development easier. As this was such a major
@ -104,20 +105,204 @@ system stack used by the compile function, which uses recursive function calls
for nested parenthesized groups. This is a safety feature for environments with
small stacks where the patterns are provided by users.
History repeated itself for release 10.20. A number of bugs relating to named
subpatterns had been discovered by fuzzers. Most of these were related to the
handling of forward references when it was not known if the named pattern was
Yet another pattern scan
------------------------
History repeated itself for PCRE2 release 10.20. A number of bugs relating to
named subpatterns had been discovered by fuzzers. Most of these were related to
the handling of forward references when it was not known if the named group was
unique. (References to non-unique names use a different opcode and more
memory.) The use of duplicate group numbers (the (?| facility) also caused
issues.
issues.
To get around these problems I adopted a new approach by adding a third pass,
really a "pre-pass", over the pattern, which does nothing other than identify
all the named subpatterns and their corresponding group numbers. This means
that the actual compile (both pre-pass and real compile) have full knowledge of
group names and numbers throughout. Several dozen lines of messy code were
eliminated, though the new pre-pass is not short (skipping over [] classes is
complicated).
To get around these problems I adopted a new approach by adding a third pass
over the pattern (really a "pre-pass"), which did nothing other than identify
all the named subpatterns and their corresponding group numbers. This means
that the actual compile (both the memory-computing dummy run and the real
compile) has full knowledge of group names and numbers throughout. Several
dozen lines of messy code were eliminated, though the new pre-pass was not
short. In particular, parsing and skipping over [] classes is complicated.
While working on 10.22 I realized that I could simplify yet again by moving
more of the parsing into the pre-pass, thus avoiding doing it in two places, so
after 10.22 was released, the code underwent yet another big refactoring. This
is how it is from 10.23 onwards:
The function called parse_regex() scans the pattern characters, parsing them
into literal data and meta characters. It converts escapes such as \x{123}
into literals, handles \Q...\E, and skips over comments and non-significant
white space. The result of the scanning is put into a vector of 32-bit unsigned
integers. Values less than 0x80000000 are literal data. Higher values represent
meta-characters. The top 16-bits of such values identify the meta-character,
and these are given names such as META_CAPTURE. The lower 16-bits are available
for data, for example, the capturing group number. The only situation in which
literal data values greater than 0x7fffffff can appear is when the 32-bit
library is running in non-UTF mode. This is handled by having a special
meta-character that is followed by the 32-bit data value.
The size of the parsed pattern vector, when auto-callouts are not enabled, is
bounded by the length of the pattern (with one exception). The code is written
so that each item in the pattern uses no more vector elements than the number
of code units in the item itself. The exception is the aforementioned large
32-bit number handling. For this reason, 32-bit non-UTF patterns are scanned in
advance to check for such values. When auto-callouts are enabled, the generous
assumption is made that there will be a callout for each pattern code unit
(which of course is only actually true if all code units are literals) plus one
at the end. There is a default parsed pattern vector on the stack, but if this
is not big enough, heap memory is used.
As before, the actual compiling function is run twice, the first time to
determine the amount of memory needed for the final compiled pattern. It
now processes the parsed pattern vector, not the pattern itself, although some
of the parsed items refer to strings in the pattern - for example, group
names. As escapes and comments have already been processed, the code is a bit
simpler than before.
Most errors can be diagnosed during the parsing scan. For those that cannot
(for example, "lookbehind assertion is not fixed length"), the parsed code
contains offsets into the pattern so that the actual compiling code can
identify where errors occur.
The elements of the parsed pattern vector
-----------------------------------------
The word "offset" below means a code unit offset into the pattern. When
PCRE2_SIZE (which is usually size_t) is no bigger than uint32_t, an offset is
stored in a single parsed pattern element. Otherwise (typically on 64-bit
systems) it occupies two elements. The following meta items occupy just one
element, with no data:
META_ACCEPT (*ACCEPT)
META_ALT | alternation
META_ASTERISK *
META_ASTERISK_PLUS *+
META_ASTERISK_QUERY *?
META_ATOMIC (?> start of atomic group
META_CIRCUMFLEX ^ metacharacter
META_CLASS [ start of non-empty class
META_CLASS_EMPTY [] empty class - only with PCRE2_ALLOW_EMPTY_CLASS
META_CLASS_EMPTY_NOT [^] negative empty class - ditto
META_CLASS_END ] end of non-empty class
META_CLASS_NOT [^ start non-empty negative class
META_COMMIT (*COMMIT)
META_DOLLAR $ metacharacter
META_DOT . metacharacter
META_END End of pattern (this value is 0x80000000)
META_FAIL (*FAIL)
META_KET ) closing parenthesis
META_LOOKAHEAD (?= start of lookahead
META_LOOKAHEADNOT (?! start of negative lookahead
META_NOCAPTURE (?: no capture parens
META_PLUS +
META_PLUS_PLUS ++
META_PLUS_QUERY +?
META_PRUNE (*PRUNE) - no argument
META_QUERY ?
META_QUERY_PLUS ?+
META_QUERY_QUERY ??
META_RANGE_ESCAPED hyphen in class range with at least one escape
META_RANGE_LITERAL hyphen in class range defined literally
META_SKIP (*SKIP) - no argument
META_THEN (*THEN) - no argument
The two RANGE values occur only in character classes. They are positioned
between two literals that define the start and end of the range. In an EBCDIC
evironment it is necessary to know whether either of the range values was
specified as an escape. In an ASCII/Unicode environment the distinction is not
relevant.
The following have data in the lower 16 bits, and may be followed by other data
elements:
META_BACKREF
META_CAPTURE
META_ESCAPE
META_RECURSE
META_BACKREF, META_CAPTURE, and META_RECURSE have the capture group number as
their data in the lower 16 bits of the element.
META_BACKREF is followed by an offset if the back reference group number is 10
or more. The offsets of the first ocurrences of references to groups whose
numbers are less than 10 are put in cb->small_ref_offset[] (only the first
occurrence is useful). On 64-bit systems this avoids using more than two parsed
pattern elements for items such as \3. The offset is used when an error is
given for a reference to a non-existent group.
META_RECURSE is always followed by an offset, for use in error messages.
META_ESCAPE has an ESC_xxx value as its data. For ESC_P and ESC_p, the next
element contains the 16-bit type and data property values, packed together.
ESC_g and ESC_k are used only for named references - numerical ones are turned
into META_RECURSE or META_BACKREF as appropriate. They are followed by a length
and an offset into the pattern to specify the name.
The following have one data item that follows in the next vector element:
META_BIGVALUE Next is a literal >= META_END
META_OPTIONS (?i) and friends (data is new option bits)
META_POSIX POSIX class item (data identifies the class)
META_POSIX_NEG negative POSIX class item (ditto)
The following are followed by a length element, then a number of character code
values (which should match with the length):
META_MARK (*MARK:xxxx)
META_PRUNE_ARG (*PRUNE:xxx)
META_SKIP_ARG (*SKIP:xxxx)
META_THEN_ARG (*THEN:xxxx)
The following are followed by a length element, then an offset in the pattern
that identifies the name:
META_COND_NAME (?(<name>) or (?('name') or (?(name)
META_COND_RNAME (?(R&name)
META_COND_RNUMBER (?(Rdigits)
META_RECURSE_BYNAME (?&name)
META_BACKREF_BYNAME \k'name'
META_COND_RNUMBER is used for names that start with R and continue with digits,
because this is an ambiguous case. It could be a back reference to a group with
that name, or it could be a recursion test on a numbered group.
This one is followed by an offset, for use in error messages, then a number:
META_COND_NUMBER (?([+-]digits)
The following are followed just by an offset, for use in error messages:
META_COND_ASSERT (?(?assertion)
META_COND_DEFINE (?(DEFINE)
META_LOOKBEHIND (?<=
META_LOOKBEHINDNOT (?<!
In fact, META_COND_ASSERT is used for any group starting (?( that does not
match any of the other META_COND cases. The check that this group is an
assertion (optionally preceded by a callout) happens at compile time.
The following are followed by two values, the minimum and maximum. Repeat
values are limited to 65535 (MAX_REPEAT). A maximum value of "unlimited" is
represented by UNLIMITED_REPEAT, which is bigger than MAX_REPEAT:
META_MINMAX {n,m} repeat
META_MINMAX_PLUS {n,m}+ repeat
META_MINMAX_QUERY {n,m}? repeat
This one is followed by three elements. The first is 0 for '>' and 1 for '>=';
the next two are the major and minor numbers:
META_COND_VERSION (?(VERSION<op>x.y)
Callouts are converted into one of two items:
META_CALLOUT_NUMBER (?C with numerical argument
META_CALLOUT_STRING (?C with string argument
In both cases, the next two elements contain the offset and length of the next
item in the pattern. Then there is either one callout number, or a length and
an offset for the string argument. The length includes both delimiters.
Traditional matching function
@ -606,4 +791,4 @@ not a real opcode, but is used to check that tables indexed by opcode are the
correct length, in order to catch updating errors.
Philip Hazel
June 2016
September 2016

View File

@ -65,6 +65,7 @@ dist_html_DATA = \
doc/html/pcre2_set_character_tables.html \
doc/html/pcre2_set_compile_recursion_guard.html \
doc/html/pcre2_set_match_limit.html \
doc/html/pcre2_set_max_pattern_length.html \
doc/html/pcre2_set_offset_limit.html \
doc/html/pcre2_set_newline.html \
doc/html/pcre2_set_parens_nest_limit.html \
@ -146,6 +147,7 @@ dist_man_MANS = \
doc/pcre2_set_character_tables.3 \
doc/pcre2_set_compile_recursion_guard.3 \
doc/pcre2_set_match_limit.3 \
doc/pcre2_set_max_pattern_length.3 \
doc/pcre2_set_offset_limit.3 \
doc/pcre2_set_newline.3 \
doc/pcre2_set_parens_nest_limit.3 \

View File

@ -502,7 +502,7 @@ for bmode in "$test8" "$test16" "$test32"; do
for opt in "" $jitopt; do
$sim $valgrind ${opt:+$vjs} ./pcre2test -q $test2stack $bmode $opt $testdata/testinput2 testtry
if [ $? = 0 ] ; then
$sim $valgrind ${opt:+$vjs} ./pcre2test -q $bmode $opt -error -63,-62,-2,-1,0,100,188,189 >>testtry
$sim $valgrind ${opt:+$vjs} ./pcre2test -q $bmode $opt -error -63,-62,-2,-1,0,100,188,189,190 >>testtry
checkresult $? 2 "$opt"
else
echo " "

View File

@ -1,4 +1,4 @@
.TH PCRE2API 3 "17 June 2016" "PCRE2 10.22"
.TH PCRE2API 3 "30 September 2016" "PCRE2 10.23"
.SH NAME
PCRE2 - Perl-compatible regular expressions (revised API)
.sp
@ -693,7 +693,8 @@ functions, \fIpcre2_match()\fP and \fIpcre2_dfa_match()\fP.
.sp
This parameter ajusts the limit, set when PCRE2 is built (default 250), on the
depth of parenthesis nesting in a pattern. This limit stops rogue patterns
using up too much system stack when being compiled.
using up too much system stack when being compiled. The limit applies to
parentheses of all kinds, not just capturing parentheses.
.sp
.nf
.B int pcre2_set_compile_recursion_guard(pcre2_compile_context *\fIccontext\fP,
@ -1091,7 +1092,13 @@ NULL immediately. Otherwise, the variables to which these point are set to an
error code and an offset (number of code units) within the pattern,
respectively, when \fBpcre2_compile()\fP returns NULL because a compilation
error has occurred. The values are not defined when compilation is successful
and \fBpcre2_compile()\fP returns a non-NULL value.
and \fBpcre2_compile()\fP returns a non-NULL value.
.P
The value returned in \fIerroroffset\fP is an indication of where in the
pattern the error occurred. It is not necessarily the furthest point in the
pattern that was read. For example, after the error "lookbehind assertion is
not fixed length", the error offset points to the start of the failing
assertion.
.P
The \fBpcre2_get_error_message()\fP function (see "Obtaining a textual error
message"
@ -1184,8 +1191,8 @@ recognized, exactly as in the rest of the pattern.
PCRE2_AUTO_CALLOUT
.sp
If this bit is set, \fBpcre2_compile()\fP automatically inserts callout items,
all with number 255, before each pattern item. For discussion of the callout
facility, see the
all with number 255, before each pattern item, except immediately before or
after a callout in the pattern. For discussion of the callout facility, see the
.\" HREF
\fBpcre2callout\fP
.\"
@ -3292,6 +3299,6 @@ Cambridge, England.
.rs
.sp
.nf
Last updated: 17 June 2016
Last updated: 30 September 2016
Copyright (c) 1997-2016 University of Cambridge.
.fi

View File

@ -1,4 +1,4 @@
.TH PCRE2CALLOUT 3 "23 March 2015" "PCRE2 10.20"
.TH PCRE2CALLOUT 3 "29 September 2016" "PCRE2 10.23"
.SH NAME
PCRE2 - Perl-compatible regular expressions (revised API)
.SH SYNOPSIS
@ -40,11 +40,20 @@ two callout points:
.sp
If the PCRE2_AUTO_CALLOUT option bit is set when a pattern is compiled, PCRE2
automatically inserts callouts, all with number 255, before each item in the
pattern. For example, if PCRE2_AUTO_CALLOUT is used with the pattern
pattern except for immediately before or after a callout item in the pattern.
For example, if PCRE2_AUTO_CALLOUT is used with the pattern
.sp
A(?C3)B
.sp
it is processed as if it were
.sp
(?C255)A(?C3)B(?C255)
.sp
Here is a more complicated example:
.sp
A(\ed{2}|--)
.sp
it is processed as if it were
With PCRE2_AUTO_CALLOUT, this pattern is processed as if it were
.sp
(?C255)A(?C255)((?C255)\ed{2}(?C255)|(?C255)-(?C255)-(?C255))(?C255)
.sp
@ -91,10 +100,10 @@ with PCRE2_ANCHORED and PCRE2_AUTO_CALLOUT and then applied to the string
No match
.sp
This indicates that when matching [bc] fails, there is no backtracking into a+
and therefore the callouts that would be taken for the backtracks do not occur.
You can disable the auto-possessify feature by passing PCRE2_NO_AUTO_POSSESS to
\fBpcre2_compile()\fP, or starting the pattern with (*NO_AUTO_POSSESS). In this
case, the output changes to this:
(because it is being treated as a++) and therefore the callouts that would be
taken for the backtracks do not occur. You can disable the auto-possessify
feature by passing PCRE2_NO_AUTO_POSSESS to \fBpcre2_compile()\fP, or starting
the pattern with (*NO_AUTO_POSSESS). In this case, the output changes to this:
.sp
--->aaaa
+0 ^ a+
@ -220,8 +229,8 @@ but the intention is never to remove any of the existing fields.
.sp
For a numerical callout, \fIcallout_string\fP is NULL, and \fIcallout_number\fP
contains the number of the callout, in the range 0-255. This is the number
that follows (?C for manual callouts; it is 255 for automatically generated
callouts.
that follows (?C for callouts that part of the pattern; it is 255 for
automatically generated callouts.
.
.
.SS "Fields for string callouts"
@ -286,10 +295,15 @@ The \fIpattern_position\fP field contains the offset in the pattern string to
the next item to be matched.
.P
The \fInext_item_length\fP field contains the length of the next item to be
matched in the pattern string. When the callout immediately precedes an
alternation bar, a closing parenthesis, or the end of the pattern, the length
is zero. When the callout precedes an opening parenthesis, the length is that
of the entire subpattern.
processed in the pattern string. When the callout is at the end of the pattern,
the length is zero. When the callout precedes an opening parenthesis, the
length includes meta characters that follow the parenthesis. For example, in a
callout before an assertion such as (?=ab) the length is 3. For an an
alternation bar or a closing parenthesis, the length is one, unless a closing
parenthesis is followed by a quantifier, in which case its length is included.
(This changed in release 10.23. In earlier releases, before an opening
parenthesis the length was that of the entire subpattern, and before an
alternation bar or a closing parenthesis the length was zero.)
.P
The \fIpattern_position\fP and \fInext_item_length\fP fields are intended to
help in distinguishing between different automatic callouts, which all have the
@ -382,6 +396,6 @@ Cambridge, England.
.rs
.sp
.nf
Last updated: 23 March 2015
Copyright (c) 1997-2015 University of Cambridge.
Last updated: 29 September 2016
Copyright (c) 1997-2016 University of Cambridge.
.fi

View File

@ -1,4 +1,4 @@
.TH PCRE2COMPAT 3 "15 March 2015" "PCRE2 10.20"
.TH PCRE2COMPAT 3 "30 September 2016" "PCRE2 10.23"
.SH NAME
PCRE2 - Perl-compatible regular expressions (revised API)
.SH "DIFFERENCES BETWEEN PCRE2 AND PERL"
@ -96,7 +96,7 @@ processed as anchored at the point where they are tested.
one that is backtracked onto acts. For example, in the pattern
A(*COMMIT)B(*PRUNE)C a failure in B triggers (*COMMIT), but a failure in C
triggers (*PRUNE). Perl's behaviour is more complex; in many cases it is the
same as PCRE2, but there are examples where it differs.
same as PCRE2, but there are cases where it differs.
.P
11. Most backtracking verbs in assertions have their normal actions. They are
not confined to the assertion.
@ -116,10 +116,11 @@ would not be possible to distinguish which parentheses matched, because both
names map to capturing subpattern number 1. To avoid this confusing situation,
an error is given at compile time.
.P
14. Perl recognizes comments in some places that PCRE2 does not, for example,
between the ( and ? at the start of a subpattern. If the /x modifier is set,
Perl allows white space between ( and ? (though current Perls warn that this is
deprecated) but PCRE2 never does, even if the PCRE2_EXTENDED option is set.
14. Perl used to recognize comments in some places that PCRE2 does not, for
example, between the ( and ? at the start of a subpattern. If the /x modifier
is set, Perl allowed white space between ( and ? though the latest Perls give
an error (for a while it was just deprecated). There may still be some cases
where Perl behaves differently.
.P
15. Perl, when in warning mode, gives warnings for character classes such as
[A-\ed] or [a-[:digit:]]. It then treats the hyphens as literals. PCRE2 has no
@ -139,35 +140,39 @@ list is with respect to Perl 5.10:
.sp
(a) Although lookbehind assertions in PCRE2 must match fixed length strings,
each alternative branch of a lookbehind assertion can match a different length
of string. Perl requires them all to have the same length.
of string. Perl requires them all to have the same length.
.sp
(b) If PCRE2_DOLLAR_ENDONLY is set and PCRE2_MULTILINE is not set, the $
(b) From PCRE2 10.23, back references to groups of fixed length are supported
in lookbehinds, provided that there is no possibility of referencing a
non-unique number or name. Perl does not support backreferences in lookbehinds.
.sp
(c) If PCRE2_DOLLAR_ENDONLY is set and PCRE2_MULTILINE is not set, the $
meta-character matches only at the very end of the string.
.sp
(c) A backslash followed by a letter with no special meaning is faulted. (Perl
(d) A backslash followed by a letter with no special meaning is faulted. (Perl
can be made to issue a warning.)
.sp
(d) If PCRE2_UNGREEDY is set, the greediness of the repetition quantifiers is
(e) If PCRE2_UNGREEDY is set, the greediness of the repetition quantifiers is
inverted, that is, by default they are not greedy, but if followed by a
question mark they are.
.sp
(e) PCRE2_ANCHORED can be used at matching time to force a pattern to be tried
(f) PCRE2_ANCHORED can be used at matching time to force a pattern to be tried
only at the first matching position in the subject string.
.sp
(f) The PCRE2_NOTBOL, PCRE2_NOTEOL, PCRE2_NOTEMPTY, PCRE2_NOTEMPTY_ATSTART, and
(g) The PCRE2_NOTBOL, PCRE2_NOTEOL, PCRE2_NOTEMPTY, PCRE2_NOTEMPTY_ATSTART, and
PCRE2_NO_AUTO_CAPTURE options have no Perl equivalents.
.sp
(g) The \eR escape sequence can be restricted to match only CR, LF, or CRLF
(h) The \eR escape sequence can be restricted to match only CR, LF, or CRLF
by the PCRE2_BSR_ANYCRLF option.
.sp
(h) The callout facility is PCRE2-specific.
(i) The callout facility is PCRE2-specific.
.sp
(i) The partial matching facility is PCRE2-specific.
(j) The partial matching facility is PCRE2-specific.
.sp
(j) The alternative matching function (\fBpcre2_dfa_match()\fP matches in a
(k) The alternative matching function (\fBpcre2_dfa_match()\fP matches in a
different way and is not Perl-compatible.
.sp
(k) PCRE2 recognizes some special sequences such as (*CR) at the start of
(l) PCRE2 recognizes some special sequences such as (*CR) at the start of
a pattern that set overall options that cannot be changed within the pattern.
.
.
@ -185,6 +190,6 @@ Cambridge, England.
.rs
.sp
.nf
Last updated: 15 March 2015
Copyright (c) 1997-2015 University of Cambridge.
Last updated: 30 September 2016
Copyright (c) 1997-2016 University of Cambridge.
.fi

View File

@ -1,4 +1,4 @@
.TH PCRE2LIMITS 3 "05 November 2015" "PCRE2 10.21"
.TH PCRE2LIMITS 3 "29 September 2016" "PCRE2 10.23"
.SH NAME
PCRE2 - Perl-compatible regular expressions (revised API)
.SH "SIZE AND OTHER LIMITATIONS"
@ -46,19 +46,19 @@ The maximum length of a lookbehind assertion is 65535 characters.
There is no limit to the number of parenthesized subpatterns, but there can be
no more than 65535 capturing subpatterns. There is, however, a limit to the
depth of nesting of parenthesized subpatterns of all kinds. This is imposed in
order to limit the amount of system stack used at compile time. The limit can
be specified when PCRE2 is built; the default is 250.
.P
There is a limit to the number of forward references to subsequent subpatterns
of around 200,000. Repeated forward references with fixed upper limits, for
example, (?2){0,100} when subpattern number 2 is to the right, are included in
the count. There is no limit to the number of backward references.
order to limit the amount of system stack used at compile time. The default
limit can be specified when PCRE2 is built; the default default is 250. An
application can change this limit by calling pcre2_set_parens_nest_limit() to
set the limit in a compile context.
.P
The maximum length of name for a named subpattern is 32 code units, and the
maximum number of named subpatterns is 10000.
.P
The maximum length of a name in a (*MARK), (*PRUNE), (*SKIP), or (*THEN) verb
is 255 for the 8-bit library and 65535 for the 16-bit and 32-bit libraries.
.P
The maximum length of a string argument to a callout is the largest number a
32-bit unsigned integer can hold.
.
.
.SH AUTHOR
@ -75,6 +75,6 @@ Cambridge, England.
.rs
.sp
.nf
Last updated: 05 November 2015
Copyright (c) 1997-2015 University of Cambridge.
Last updated: 29 September 2016
Copyright (c) 1997-2016 University of Cambridge.
.fi

View File

@ -1,4 +1,4 @@
.TH PCRE2PATTERN 3 "20 June 2016" "PCRE2 10.22"
.TH PCRE2PATTERN 3 "30 September 2016" "PCRE2 10.23"
.SH NAME
PCRE2 - Perl-compatible regular expressions (revised API)
.SH "PCRE2 REGULAR EXPRESSION DETAILS"
@ -508,9 +508,9 @@ by code point, as described in the previous section.
.SS "Absolute and relative back references"
.rs
.sp
The sequence \eg followed by an unsigned or a negative number, optionally
enclosed in braces, is an absolute or relative back reference. A named back
reference can be coded as \eg{name}. Back references are discussed
The sequence \eg followed by a signed or unsigned number, optionally enclosed
in braces, is an absolute or relative back reference. A named back reference
can be coded as \eg{name}. Back references are discussed
.\" HTML <a href="#backreferences">
.\" </a>
later,
@ -1325,13 +1325,33 @@ when matching character classes, whatever line-ending sequence is in use, and
whatever setting of the PCRE2_DOTALL and PCRE2_MULTILINE options is used. A
class such as [^a] always matches one of these characters.
.P
The character escape sequences \ed, \eD, \eh, \eH, \ep, \eP, \es, \eS, \ev,
\eV, \ew, and \eW may appear in a character class, and add the characters that
they match to the class. For example, [\edABCDEF] matches any hexadecimal
digit. In UTF modes, the PCRE2_UCP option affects the meanings of \ed, \es, \ew
and their upper case partners, just as it does when they appear outside a
character class, as described in the section entitled
.\" HTML <a href="#genericchartypes">
.\" </a>
"Generic character types"
.\"
above. The escape sequence \eb has a different meaning inside a character
class; it matches the backspace character. The sequences \eB, \eN, \eR, and \eX
are not special inside a character class. Like any other unrecognized escape
sequences, they cause an error.
.P
The minus (hyphen) character can be used to specify a range of characters in a
character class. For example, [d-m] matches any letter between d and m,
inclusive. If a minus character is required in a class, it must be escaped with
a backslash or appear in a position where it cannot be interpreted as
indicating a range, typically as the first or last character in the class, or
immediately after a range. For example, [b-d-z] matches letters in the range b
to d, a hyphen character, or z.
indicating a range, typically as the first or last character in the class,
or immediately after a range. For example, [b-d-z] matches letters in the range
b to d, a hyphen character, or z.
.P
Perl treats a hyphen as a literal if it appears before a POSIX class (see
below) or a character type escape such as as \ed, but gives a warning in its
warning mode, as this is most likely a user error. As PCRE2 has no facility for
warning, an error is given in these cases.
.P
It is not possible to have the literal character "]" as the end character of a
range. A pattern such as [W-]46] is interpreted as a class of two characters
@ -1341,11 +1361,6 @@ the end of range, so [W-\e]46] is interpreted as a class containing a range
followed by two other characters. The octal or hexadecimal representation of
"]" can also be used to end a range.
.P
An error is generated if a POSIX character class (see below) or an escape
sequence other than one that defines a single character appears at a point
where a range ending character is expected. For example, [z-\exff] is valid,
but [A-\ed] and [A-[:digit:]] are not.
.P
Ranges normally include all code points between the start and end characters,
inclusive. They can also be used for code points specified numerically, for
example [\e000-\e037]. Ranges can include any characters that are valid for the
@ -1365,21 +1380,6 @@ matches the letters in either case. For example, [W-c] is equivalent to
tables for a French locale are in use, [\exc8-\excb] matches accented E
characters in both cases.
.P
The character escape sequences \ed, \eD, \eh, \eH, \ep, \eP, \es, \eS, \ev,
\eV, \ew, and \eW may appear in a character class, and add the characters that
they match to the class. For example, [\edABCDEF] matches any hexadecimal
digit. In UTF modes, the PCRE2_UCP option affects the meanings of \ed, \es, \ew
and their upper case partners, just as it does when they appear outside a
character class, as described in the section entitled
.\" HTML <a href="#genericchartypes">
.\" </a>
"Generic character types"
.\"
above. The escape sequence \eb has a different meaning inside a character
class; it matches the backspace character. The sequences \eB, \eN, \eR, and \eX
are not special inside a character class. Like any other unrecognized escape
sequences, they cause an error.
.P
A circumflex can conveniently be used with the upper case character types to
specify a more restricted set of characters than the matching lower case type.
For example, the class [^\eW_] matches any letter or digit, but not underscore,
@ -2096,9 +2096,9 @@ no such problem when named parentheses are used. A back reference to any
subpattern is possible using named parentheses (see below).
.P
Another way of avoiding the ambiguity inherent in the use of digits following a
backslash is to use the \eg escape sequence. This escape must be followed by an
unsigned number or a negative number, optionally enclosed in braces. These
examples are all identical:
backslash is to use the \eg escape sequence. This escape must be followed by a
signed or unsigned number, optionally enclosed in braces. These examples are
all identical:
.sp
(ring), \e1
(ring), \eg1
@ -2106,8 +2106,7 @@ examples are all identical:
.sp
An unsigned number specifies an absolute reference without the ambiguity that
is present in the older syntax. It is also useful when literal digits follow
the reference. A negative number is a relative reference. Consider this
example:
the reference. A signed number is a relative reference. Consider this example:
.sp
(abc(def)ghi)\eg{-1}
.sp
@ -2117,6 +2116,10 @@ Similarly, \eg{-2} would be equivalent to \e1. The use of relative references
can be helpful in long patterns, and also in patterns that are created by
joining together fragments that contain references within themselves.
.P
The sequence \eg{+1} is a reference to the next capturing subpattern. This kind
of forward reference can be useful it patterns that repeat. Perl does not
support the use of + in this way.
.P
A back reference matches whatever actually matched the capturing subpattern in
the current subject string, rather than anything matching the subpattern
itself (see
@ -2321,23 +2324,34 @@ temporarily move the current position back by the fixed length and then try to
match. If there are insufficient characters before the current position, the
assertion fails.
.P
In a UTF mode, PCRE2 does not allow the \eC escape (which matches a single code
unit even in a UTF mode) to appear in lookbehind assertions, because it makes
it impossible to calculate the length of the lookbehind. The \eX and \eR
escapes, which can match different numbers of code units, are also not
permitted.
In UTF-8 and UTF-16 modes, PCRE2 does not allow the \eC escape (which matches a
single code unit even in a UTF mode) to appear in lookbehind assertions,
because it makes it impossible to calculate the length of the lookbehind. The
\eX and \eR escapes, which can match different numbers of code units, are never
permitted in lookbehinds.
.P
.\" HTML <a href="#subpatternsassubroutines">
.\" </a>
"Subroutine"
.\"
calls (see below) such as (?2) or (?&X) are permitted in lookbehinds, as long
as the subpattern matches a fixed-length string.
as the subpattern matches a fixed-length string. However,
.\" HTML <a href="#recursion">
.\" </a>
Recursion,
recursion,
.\"
however, is not supported.
that is, a "subroutine" call into a group that is already active,
is not supported.
.P
Perl does not support back references in lookbehinds. PCRE2 does support them,
but only if certain conditions are met. The PCRE2_MATCH_UNSET_BACKREF option
must not be set, there must be no use of (?| in the pattern (it creates
duplicate subpattern numbers), and if the back reference is by name, the name
must be unique. Of course, the referenced subpattern must itself be of fixed
length. The following pattern matches words containing at least two characters
that begin and end with the same character:
.sp
\eb(\ew)\ew++(?<=\e1)
.P
Possessive quantifiers can be used in conjunction with lookbehind assertions to
specify efficient matching of fixed-length strings at the end of subject
@ -2476,7 +2490,9 @@ This makes the fragment independent of the parentheses in the larger pattern.
.sp
Perl uses the syntax (?(<name>)...) or (?('name')...) to test for a used
subpattern by name. For compatibility with earlier versions of PCRE1, which had
this facility before Perl, the syntax (?(name)...) is also recognized.
this facility before Perl, the syntax (?(name)...) is also recognized. Note,
however, that undelimited names consisting of the letter R followed by digits
are ambiguous (see the following section).
.P
Rewriting the above example to use a named subpattern gives this:
.sp
@ -2490,33 +2506,55 @@ matched.
.SS "Checking for pattern recursion"
.rs
.sp
If the condition is the string (R), and there is no subpattern with the name R,
the condition is true if a recursive call to the whole pattern or any
subpattern has been made. If digits or a name preceded by ampersand follow the
letter R, for example:
.sp
(?(R3)...) or (?(R&name)...)
.sp
the condition is true if the most recent recursion is into a subpattern whose
number or name is given. This condition does not check the entire recursion
stack. If the name used in a condition of this kind is a duplicate, the test is
applied to all subpatterns of the same name, and is true if any one of them is
the most recent recursion.
.P
At "top level", all these recursion test conditions are false.
"Recursion" in this sense refers to any subroutine-like call from one part of
the pattern to another, whether or not it is actually recursive. See the
sections entitled
.\" HTML <a href="#recursion">
.\" </a>
The syntax for recursive patterns
"Recursive patterns"
.\"
is described below.
and
.\" HTML <a href="#subpatternsassubroutines">
.\" </a>
"Subpatterns as subroutines"
.\"
below for details of recursion and subpattern calls.
.P
If a condition is the string (R), and there is no subpattern with the name R,
the condition is true if matching is currently in a recursion or subroutine
call to the whole pattern or any subpattern. If digits follow the letter R, and
there is no subpattern with that name, the condition is true if the most recent
call is into a subpattern with the given number, which must exist somewhere in
the overall pattern. This is a contrived example that is equivalent to a+b:
.sp
((?(R1)a+|(?1)b))
.sp
However, in both cases, if there is a subpattern with a matching name, the
condition tests for its being set, as described in the section above, instead
of testing for recursion. For example, creating a group with the name R1 by
adding (?<R1>) to the above pattern completely changes its meaning.
.P
If a name preceded by ampersand follows the letter R, for example:
.sp
(?(R&name)...)
.sp
the condition is true if the most recent recursion is into a subpattern of that
name (which must exist within the pattern).
.P
This condition does not check the entire recursion stack. It tests only the
current level. If the name used in a condition of this kind is a duplicate, the
test is applied to all subpatterns of the same name, and is true if any one of
them is the most recent recursion.
.P
At "top level", all these recursion test conditions are false.
.
.
.\" HTML <a name="subdefine"></a>
.SS "Defining subpatterns for use by reference only"
.rs
.sp
If the condition is the string (DEFINE), and there is no subpattern with the
name DEFINE, the condition is always false. In this case, there may be only one
If the condition is the string (DEFINE), the condition is always false, even if
there is a group with the name DEFINE. In this case, there may be only one
alternative in the subpattern. It is always skipped if control reaches this
point in the pattern; the idea of DEFINE is that it can be used to define
subroutines that can be referenced from elsewhere. (The use of
@ -2994,12 +3032,20 @@ depending on whether or not a name is present.
By default, for compatibility with Perl, a name is any sequence of characters
that does not include a closing parenthesis. The name is not processed in
any way, and it is not possible to include a closing parenthesis in the name.
However, if the PCRE2_ALT_VERBNAMES option is set, normal backslash processing
is applied to verb names and only an unescaped closing parenthesis terminates
the name. A closing parenthesis can be included in a name either as \e) or
between \eQ and \eE. If the PCRE2_EXTENDED option is set, unescaped whitespace
in verb names is skipped and #-comments are recognized, exactly as in the rest
of the pattern.
This can be changed by setting the PCRE2_ALT_VERBNAMES option, but the result
is no longer Perl-compatible.
.P
When PCRE2_ALT_VERBNAMES is set, backslash processing is applied to verb names
and only an unescaped closing parenthesis terminates the name. However, the
only backslash items that are permitted are \eQ, \eE, and sequences such as
\ex{100} that define character code points. Character type escapes such as \ed
are faulted.
.P
A closing parenthesis can be included in a name either as \e) or between \eQ
and \eE. In addition to backslash processing, if the PCRE2_EXTENDED option is
also set, unescaped whitespace in verb names is skipped, and #-comments are
recognized, exactly as in the rest of the pattern. PCRE2_EXTENDED does not
affect verb names unless PCRE2_ALT_VERBNAMES is also set.
.P
The maximum length of a name is 255 in the 8-bit library and 65535 in the
16-bit and 32-bit libraries. If the name is empty, that is, if the closing
@ -3429,6 +3475,6 @@ Cambridge, England.
.rs
.sp
.nf
Last updated: 20 June 2016
Last updated: 30 September 2016
Copyright (c) 1997-2016 University of Cambridge.
.fi

View File

@ -1,4 +1,4 @@
.TH PCRE2SYNTAX 3 "16 October 2015" "PCRE2 10.21"
.TH PCRE2SYNTAX 3 "28 September 2016" "PCRE2 10.23"
.SH NAME
PCRE2 - Perl-compatible regular expressions (revised API)
.SH "PCRE2 REGULAR EXPRESSION SYNTAX SUMMARY"
@ -473,6 +473,9 @@ Each top-level branch of a look behind must be of a fixed length.
\en reference by number (can be ambiguous)
\egn reference by number
\eg{n} reference by number
\eg+n relative reference by number (PCRE2 extension)
\eg-n relative reference by number
\eg{+n} relative reference by number (PCRE2 extension)
\eg{-n} relative reference by number
\ek<name> reference by name (Perl)
\ek'name' reference by name (Perl)
@ -511,13 +514,17 @@ Each top-level branch of a look behind must be of a fixed length.
(?(-n) relative reference condition
(?(<name>) named reference condition (Perl)
(?('name') named reference condition (Perl)
(?(name) named reference condition (PCRE2)
(?(name) named reference condition (PCRE2, deprecated)
(?(R) overall recursion condition
(?(Rn) specific group recursion condition
(?(R&name) specific recursion condition
(?(Rn) specific numbered group recursion condition
(?(R&name) specific named group recursion condition
(?(DEFINE) define subpattern for reference
(?(VERSION[>]=n.m) test PCRE2 version
(?(assert) assertion condition
.sp
Note the ambiguity of (?(R) and (?(Rn) which might be named reference
conditions or recursion tests. Such a condition is interpreted as a reference
condition if the relevant named group exists.
.
.
.SH "BACKTRACKING CONTROL"
@ -577,6 +584,6 @@ Cambridge, England.
.rs
.sp
.nf
Last updated: 16 October 2015
Copyright (c) 1997-2015 University of Cambridge.
Last updated: 28 September 2016
Copyright (c) 1997-2016 University of Cambridge.
.fi

View File

@ -1,14 +1,17 @@
#! /bin/sh
# Script for testing regular expressions with perl to check that PCRE2 handles
# them the same. The Perl code has to have "use utf8" and "require Encode" at
# the start when running UTF-8 tests, but *not* for non-utf8 tests. (The
# "require" would actually be OK for non-utf8-tests, but is not always
# installed, so this way the script will always run for these tests.)
# them the same. If the first argument to this script is "-w", Perl is also
# called with "-w", which turns on its warning mode.
#
# The Perl code has to have "use utf8" and "require Encode" at the start when
# running UTF-8 tests, but *not* for non-utf8 tests. (The "require" would
# actually be OK for non-utf8-tests, but is not always installed, so this way
# the script will always run for these tests.)
#
# The desired effect is achieved by making this a shell script that passes the
# Perl script to Perl through a pipe. If the first argument is "-utf8", a
# suitable prefix is set up.
# Perl script to Perl through a pipe. If the first argument (possibly after
# removing "-w") is "-utf8", a suitable prefix is set up.
#
# The remaining arguments, if any, are passed to Perl. They are an input file
# and an output file. If there is one argument, the output is written to
@ -17,7 +20,14 @@
# of the contorted piping input.)
perl=perl
perlarg=''
prefix=''
if [ $# -gt 0 -a "$1" = "-w" ] ; then
perlarg="-w"
shift
fi
if [ $# -gt 0 -a "$1" = "-utf8" ] ; then
prefix="use utf8; require Encode;"
shift
@ -292,6 +302,6 @@ for (;;)
# printf $outfile "\n";
PERLEND
) | $perl - $@
) | $perl $perlarg - $@
# End

File diff suppressed because it is too large Load Diff

View File

@ -91,13 +91,13 @@ static const unsigned char compile_error_texts[] =
"failed to allocate heap memory\0"
"unmatched closing parenthesis\0"
"internal error: code overflow\0"
"letter or underscore expected after (?< or (?'\0"
"missing closing parenthesis for condition\0"
/* 25 */
"lookbehind assertion is not fixed length\0"
"malformed number or name after (?(\0"
"a relative value of zero is not allowed\0"
"conditional group contains more than two branches\0"
"assertion expected after (?( or (?(?C)\0"
"(?R or (?[+-]digits must be followed by )\0"
"digit expected after (?+ or (?-\0"
/* 30 */
"unknown POSIX class name\0"
"internal error in pcre2_study(): should not occur\0"
@ -105,7 +105,7 @@ static const unsigned char compile_error_texts[] =
"parentheses are too deeply nested (stack check)\0"
"character code point value in \\x{} or \\o{} is too large\0"
/* 35 */
"invalid condition (?(0)\0"
"lookbehind is too complicated\0"
"\\C is not allowed in a lookbehind assertion in UTF-" XSTRING(PCRE2_CODE_UNIT_WIDTH) " mode\0"
"PCRE does not support \\L, \\l, \\N{name}, \\U, or \\u\0"
"number after (?C is greater than 255\0"
@ -132,13 +132,13 @@ static const unsigned char compile_error_texts[] =
"missing opening brace after \\o\0"
"internal error: unknown newline setting\0"
"\\g is not followed by a braced, angle-bracketed, or quoted name/number or by a plain number\0"
"a numbered reference must not be zero\0"
"(?R (recursive pattern call) must be followed by a closing parenthesis\0"
"an argument is not allowed for (*ACCEPT), (*FAIL), or (*COMMIT)\0"
/* 60 */
"(*VERB) not recognized or malformed\0"
"number is too big\0"
"group number is too big\0"
"subpattern name expected\0"
"digit expected after (?+\0"
"SPARE ERROR\0"
"non-octal character in \\o{} (closing brace missing?)\0"
/* 65 */
"different names for subpatterns of the same number are not allowed\0"
@ -151,9 +151,9 @@ static const unsigned char compile_error_texts[] =
#endif
"\\k is not followed by a braced, angle-bracketed, or quoted name\0"
/* 70 */
"internal error: unknown opcode in find_fixedlength()\0"
"internal error: unknown meta code in check_lookbehinds()\0"
"\\N is not supported in a class\0"
"SPARE ERROR\0"
"callout string is too long\0"
"disallowed Unicode code point (>= 0xd800 && <= 0xdfff)\0"
"using UTF is disabled by the application\0"
/* 75 */
@ -161,7 +161,7 @@ static const unsigned char compile_error_texts[] =
"name is too long in (*MARK), (*PRUNE), (*SKIP), or (*THEN)\0"
"character code point value in \\u.... sequence is too large\0"
"digits missing in \\x{} or \\o{}\0"
"syntax error in (?(VERSION condition\0"
"syntax error or number too big in (?(VERSION condition\0"
/* 80 */
"internal error: unknown opcode in auto_possessify()\0"
"missing terminating delimiter for callout with string argument\0"
@ -173,6 +173,8 @@ static const unsigned char compile_error_texts[] =
"regular expression is too complicated\0"
"lookbehind assertion is too long\0"
"pattern string is longer than the limit set by the application\0"
"internal error: unknown code in parsed pattern\0"
/* 90 */
;
/* Match-time and UTF error texts are in the same format. */

View File

@ -1298,23 +1298,16 @@ mode rather than an escape sequence. It is also used for [^] in JavaScript
compatibility mode, and for \C in non-utf mode. In non-DOTALL mode, "." behaves
like \N.
The special values ESC_DU, ESC_du, etc. are used instead of ESC_D, ESC_d, etc.
when PCRE2_UCP is set and replacement of \d etc by \p sequences is required.
They must be contiguous, and remain in order so that the replacements can be
looked up from a table.
Negative numbers are used to encode a backreference (\1, \2, \3, etc.) in
check_escape(). There are two tests in the code for an escape
greater than ESC_b and less than ESC_Z to detect the types that may be
repeated. These are the types that consume characters. If any new escapes are
put in between that don't consume a character, that code will have to change.
*/
check_escape(). There are tests in the code for an escape greater than ESC_b
and less than ESC_Z to detect the types that may be repeated. These are the
types that consume characters. If any new escapes are put in between that don't
consume a character, that code will have to change. */
enum { ESC_A = 1, ESC_G, ESC_K, ESC_B, ESC_b, ESC_D, ESC_d, ESC_S, ESC_s,
ESC_W, ESC_w, ESC_N, ESC_dum, ESC_C, ESC_P, ESC_p, ESC_R, ESC_H,
ESC_h, ESC_V, ESC_v, ESC_X, ESC_Z, ESC_z,
ESC_E, ESC_Q, ESC_g, ESC_k,
ESC_DU, ESC_du, ESC_SU, ESC_su, ESC_WU, ESC_wu };
ESC_E, ESC_Q, ESC_g, ESC_k };
/********************** Opcode definitions ******************/
@ -1380,7 +1373,8 @@ enum {
OP_CIRC, /* 27 Start of line - not multiline */
OP_CIRCM, /* 28 Start of line - multiline */
/* Single characters; caseful must precede the caseless ones */
/* Single characters; caseful must precede the caseless ones, and these
must remain in this order, and adjacent. */
OP_CHAR, /* 29 Match one character, casefully */
OP_CHARI, /* 30 Match one character, caselessly */

View File

@ -648,18 +648,24 @@ typedef struct pcre2_real_match_data {
#ifndef PCRE2_PCRE2TEST
/* Structure for checking for mutual recursion when scanning compiled code. */
/* Structures for checking for mutual recursion when scanning compiled or
parsed code. */
typedef struct recurse_check {
struct recurse_check *prev;
PCRE2_SPTR group;
} recurse_check;
typedef struct parsed_recurse_check {
struct parsed_recurse_check *prev;
uint32_t *groupptr;
} parsed_recurse_check;
/* Structure for building a cache when filling in recursion offsets. */
typedef struct recurse_cache {
PCRE2_SPTR group;
int recno;
int groupnumber;
} recurse_cache;
/* Structure for maintaining a chain of pointers to the currently incomplete
@ -693,9 +699,10 @@ typedef struct compile_block {
PCRE2_SPTR start_code; /* The start of the compiled code */
PCRE2_SPTR start_pattern; /* The start of the pattern */
PCRE2_SPTR end_pattern; /* The end of the pattern */
PCRE2_SPTR nestptr[2]; /* Pointer(s) saved for string substitution */
PCRE2_UCHAR *name_table; /* The name/number table */
size_t workspace_size; /* Size of workspace */
PCRE2_SIZE workspace_size; /* Size of workspace */
PCRE2_SIZE small_ref_offset[10]; /* Offsets for \1 to \9 */
PCRE2_SIZE erroroffset; /* Offset of error in pattern */
uint16_t names_found; /* Number of entries so far */
uint16_t name_entry_size; /* Size of each entry */
open_capitem *open_caps; /* Chain of open capture items */
@ -703,8 +710,9 @@ typedef struct compile_block {
uint32_t named_group_list_size; /* Number of entries in the list */
uint32_t external_options; /* External (initial) options */
uint32_t external_flags; /* External flag bits to be set */
uint32_t bracount; /* Count of capturing parens as we compile */
uint32_t final_bracount; /* Saved value after first pass */
uint32_t bracount; /* Count of capturing parentheses */
uint32_t lastcapture; /* Last capture encountered */
uint32_t *parsed_pattern; /* Parsed pattern buffer */
uint32_t *groupinfo; /* Group info vector */
uint32_t top_backref; /* Maximum back reference */
uint32_t backref_map; /* Bitmap of low back refs */
@ -718,9 +726,7 @@ typedef struct compile_block {
BOOL had_accept; /* (*ACCEPT) encountered */
BOOL had_pruneorskip; /* (*PRUNE) or (*SKIP) encountered */
BOOL had_recurse; /* Had a recursion or subroutine call */
BOOL check_lookbehind; /* Lookbehinds need later checking */
BOOL dupnames; /* Duplicate names exist */
BOOL iscondassert; /* Next assert is a condition */
} compile_block;
/* Structure for keeping the properties of the in-memory stack used

View File

@ -114,7 +114,7 @@ for (; ptr < ptrend; ptr++)
else if (*ptr == CHAR_BACKSLASH)
{
int erc;
int errorcode = 0;
int errorcode;
uint32_t ch;
if (ptr < ptrend - 1) switch (ptr[1])
@ -127,8 +127,10 @@ for (; ptr < ptrend; ptr++)
continue;
}
ptr += 1; /* Must point after \ */
erc = PRIV(check_escape)(&ptr, ptrend, &ch, &errorcode,
code->overall_options, FALSE, NULL);
ptr -= 1; /* Back to last code unit of escape */
if (errorcode != 0)
{
rc = errorcode;
@ -698,7 +700,7 @@ do
else if ((suboptions & PCRE2_SUBSTITUTE_EXTENDED) != 0 &&
*ptr == CHAR_BACKSLASH)
{
int errorcode = 0;
int errorcode;
if (ptr < repend - 1) switch (ptr[1])
{
@ -728,10 +730,10 @@ do
break;
}
ptr++; /* Point after \ */
rc = PRIV(check_escape)(&ptr, repend, &ch, &errorcode,
code->overall_options, FALSE, NULL);
if (errorcode != 0) goto BADESCAPE;
ptr++;
switch(rc)
{

View File

@ -2808,7 +2808,7 @@ return 0;
/* In UTF mode the input is always interpreted as a string of UTF-8 bytes using
the original UTF-8 definition of RFC 2279, which allows for up to 6 bytes, and
code values from 0 to 0x7fffffff. However, values greater than the later UTF
code values from 0 to 0x7fffffff. However, values greater than the later UTF
limit of 0x10ffff cause an error.
In non-UTF mode the input is interpreted as UTF-8 if the utf8_input modifier
@ -2867,7 +2867,7 @@ if (!utf && (pat_patctl.control & CTL_UTF8_INPUT) == 0)
else while (len > 0)
{
int chlen;
int chlen;
uint32_t c;
uint32_t topbit = 0;
if (!utf && *p == 0xff && len > 1)
@ -2875,7 +2875,7 @@ else while (len > 0)
topbit = 0x80000000u;
p++;
len--;
}
}
chlen = utf82ord(p, &c);
if (chlen <= 0) return -1;
if (utf && c > 0x10ffff) return -2;
@ -4494,6 +4494,7 @@ unsigned int delimiter = *p++;
int errorcode;
void *use_pat_context;
PCRE2_SIZE patlen;
PCRE2_SIZE valgrind_access_length;
PCRE2_SIZE erroroffset;
/* Initialize the context and pattern/data controls for this test from the
@ -4537,7 +4538,7 @@ patlen = p - buffer - 2;
if (!decode_modifiers(p, CTX_PAT, &pat_patctl, NULL)) return PR_SKIP;
utf = (pat_patctl.options & PCRE2_UTF) != 0;
/* The utf8_input modifier is not allowed in 8-bit mode, and is mutually
/* The utf8_input modifier is not allowed in 8-bit mode, and is mutually
exclusive with the utf modifier. */
if ((pat_patctl.control & CTL_UTF8_INPUT) != 0)
@ -4550,8 +4551,8 @@ if ((pat_patctl.control & CTL_UTF8_INPUT) != 0)
if (utf)
{
fprintf(outfile, "** The utf and utf8_input modifiers are mutually exclusive\n");
return PR_SKIP;
}
return PR_SKIP;
}
}
/* Check for mutually exclusive modifiers. At present, these are all in the
@ -4949,11 +4950,43 @@ switch(errorcode)
break;
}
/* The pattern is now in pbuffer[8|16|32], with the length in patlen. By
default, however, we pass a zero-terminated pattern. The length is passed only
if we had a hex pattern. */
/* The pattern is now in pbuffer[8|16|32], with the length in code units in
patlen. By default, however, we pass a zero-terminated pattern. The length is
passed only if we had a hex pattern. When valgrind is supported, arrange for
the unused part of the buffer to be marked as no access. */
if ((pat_patctl.control & CTL_HEXPAT) == 0) patlen = PCRE2_ZERO_TERMINATED;
valgrind_access_length = patlen;
if ((pat_patctl.control & CTL_HEXPAT) == 0)
{
patlen = PCRE2_ZERO_TERMINATED;
valgrind_access_length += 1; /* For the terminating zero */
}
#ifdef SUPPORT_VALGRIND
#ifdef SUPPORT_PCRE2_8
if (test_mode == PCRE8_MODE && pbuffer8 != NULL)
{
VALGRIND_MAKE_MEM_NOACCESS(pbuffer8 + valgrind_access_length,
pbuffer8_size - valgrind_access_length);
}
#endif
#ifdef SUPPORT_PCRE2_16
if (test_mode == PCRE16_MODE && pbuffer16 != NULL)
{
VALGRIND_MAKE_MEM_NOACCESS(pbuffer16 + valgrind_access_length,
pbuffer16_size - valgrind_access_length*sizeof(uint16_t));
}
#endif
#ifdef SUPPORT_PCRE2_32
if (test_mode == PCRE32_MODE && pbuffer32 != NULL)
{
VALGRIND_MAKE_MEM_NOACCESS(pbuffer32 + valgrind_access_length,
pbuffer32_size - valgrind_access_length*sizeof(uint32_t));
}
#endif
#else /* Valgrind not supported */
(void)valgrind_access_length; /* Avoid compiler warning */
#endif
/* If #newline_default has been used and the library was not compiled with an
appropriate default newline setting, local_newline_default will be non-zero. We
@ -4996,6 +5029,65 @@ if (timeit > 0)
PCRE2_COMPILE(compiled_code, pbuffer, patlen, pat_patctl.options|forbid_utf,
&errorcode, &erroroffset, use_pat_context);
/* Call the JIT compiler if requested. When timing, we must free and recompile
the pattern each time because that is the only way to free the JIT compiled
code. We know that compilation will always succeed. */
if (TEST(compiled_code, !=, NULL) && pat_patctl.jit != 0)
{
if (timeit > 0)
{
register int i;
clock_t time_taken = 0;
for (i = 0; i < timeit; i++)
{
clock_t start_time;
SUB1(pcre2_code_free, compiled_code);
PCRE2_COMPILE(compiled_code, pbuffer, patlen,
pat_patctl.options|forbid_utf, &errorcode, &erroroffset,
use_pat_context);
start_time = clock();
PCRE2_JIT_COMPILE(jitrc,compiled_code, pat_patctl.jit);
time_taken += clock() - start_time;
}
total_jit_compile_time += time_taken;
fprintf(outfile, "JIT compile %.4f milliseconds\n",
(((double)time_taken * 1000.0) / (double)timeit) /
(double)CLOCKS_PER_SEC);
}
else
{
PCRE2_JIT_COMPILE(jitrc, compiled_code, pat_patctl.jit);
}
}
/* If valgrind is supported, mark the pbuffer as accessible again. The 16-bit
and 32-bit buffers can be marked completely undefined, but we must leave the
pattern in the 8-bit buffer defined because it may be read from a callout
during matching. */
#ifdef SUPPORT_VALGRIND
#ifdef SUPPORT_PCRE2_8
if (test_mode == PCRE8_MODE)
{
VALGRIND_MAKE_MEM_UNDEFINED(pbuffer8 + valgrind_access_length,
pbuffer8_size - valgrind_access_length);
}
#endif
#ifdef SUPPORT_PCRE2_16
if (test_mode == PCRE16_MODE)
{
VALGRIND_MAKE_MEM_UNDEFINED(pbuffer16, pbuffer16_size);
}
#endif
#ifdef SUPPORT_PCRE2_32
if (test_mode == PCRE32_MODE)
{
VALGRIND_MAKE_MEM_UNDEFINED(pbuffer32, pbuffer32_size);
}
#endif
#endif
/* Compilation failed; go back for another re, skipping to blank line
if non-interactive. */
@ -5029,38 +5121,6 @@ if (forbid_utf != 0)
if (pattern_info(PCRE2_INFO_MAXLOOKBEHIND, &maxlookbehind, FALSE) != 0)
return PR_ABEND;
/* Call the JIT compiler if requested. When timing, we must free and recompile
the pattern each time because that is the only way to free the JIT compiled
code. We know that compilation will always succeed. */
if (pat_patctl.jit != 0)
{
if (timeit > 0)
{
register int i;
clock_t time_taken = 0;
for (i = 0; i < timeit; i++)
{
clock_t start_time;
SUB1(pcre2_code_free, compiled_code);
PCRE2_COMPILE(compiled_code, pbuffer, patlen,
pat_patctl.options|forbid_utf, &errorcode, &erroroffset,
use_pat_context);
start_time = clock();
PCRE2_JIT_COMPILE(jitrc,compiled_code, pat_patctl.jit);
time_taken += clock() - start_time;
}
total_jit_compile_time += time_taken;
fprintf(outfile, "JIT compile %.4f milliseconds\n",
(((double)time_taken * 1000.0) / (double)timeit) /
(double)CLOCKS_PER_SEC);
}
else
{
PCRE2_JIT_COMPILE(jitrc, compiled_code, pat_patctl.jit);
}
}
/* If an explicit newline modifier was given, set the information flag in the
pattern so that it is preserved over push/pop. */
@ -5300,10 +5360,10 @@ if (post_start > 0)
for (i = 0; i < subject_length - pre_start - post_start + 4; i++)
fprintf(outfile, " ");
fprintf(outfile, "%.*s",
(int)((cb->next_item_length == 0)? 1 : cb->next_item_length),
pbuffer8 + cb->pattern_position);
if (cb->next_item_length != 0)
fprintf(outfile, "%.*s", (int)(cb->next_item_length),
pbuffer8 + cb->pattern_position);
fprintf(outfile, "\n");
first_callout = FALSE;
@ -5740,18 +5800,18 @@ while ((c = *p++) != 0)
continue;
}
/* Handle a non-escaped character. In non-UTF 32-bit mode with utf8_input
/* Handle a non-escaped character. In non-UTF 32-bit mode with utf8_input
set, do the fudge for setting the top bit. */
if (c != '\\')
{
uint32_t topbit = 0;
if (test_mode == PCRE32_MODE && c == 0xff && *p != 0)
if (test_mode == PCRE32_MODE && c == 0xff && *p != 0)
{
topbit = 0x80000000;
c = *p++;
}
if ((utf || (pat_patctl.control & CTL_UTF8_INPUT) != 0) &&
}
if ((utf || (pat_patctl.control & CTL_UTF8_INPUT) != 0) &&
HASUTF8EXTRALEN(c)) { GETUTF8INC(c, p); }
c |= topbit;
}
@ -6405,7 +6465,7 @@ else for (gmatched = 0;; gmatched++)
}
/* Otherwise just run a single match, setting up a callout if required (the
default). */
default). There is a copy of the pattern in pbuffer8 for use by callouts. */
else
{
@ -7583,6 +7643,10 @@ if (argc > 1 && strcmp(argv[op], "-") != 0)
}
}
#if defined(SUPPORT_LIBREADLINE) || defined(SUPPORT_LIBEDIT)
if (INTERACTIVE(infile)) using_history();
#endif
if (argc > 2)
{
outfile = fopen(argv[op+1], OUTPUT_MODE);
@ -7621,8 +7685,7 @@ while (notdone)
p = buffer;
/* If we have a pattern set up for testing, or we are skipping after a
compile failure, a blank line terminates this test; otherwise process the
line as a data line. */
compile failure, a blank line terminates this test. */
if (expectdata || skipping)
{
@ -7645,14 +7708,21 @@ while (notdone)
skipping = FALSE;
setlocale(LC_CTYPE, "C");
}
/* Otherwise, if we are not skipping, and the line is not a data comment
line starting with "\=", process a data line. */
else if (!skipping && !(p[0] == '\\' && p[1] == '=' && isspace(p[2])))
{
rc = process_data();
}
}
/* We do not have a pattern set up for testing. Lines starting with # are
either comments or special commands. Blank lines are ignored. Otherwise, the
line must start with a valid delimiter. It is then processed as a pattern
line. */
line. A copy of the pattern is left in pbuffer8 for use by callouts. Under
valgrind, make the unused part of the buffer undefined, to catch overruns. */
else if (*p == '#')
{
@ -7713,6 +7783,10 @@ if (showtotaltimes)
EXIT:
#if defined(SUPPORT_LIBREADLINE) || defined(SUPPORT_LIBEDIT)
if (infile != NULL && INTERACTIVE(infile)) clear_history();
#endif
if (infile != NULL && infile != stdin) fclose(infile);
if (outfile != NULL && outfile != stdout) fclose(outfile);

14
testdata/testinput1 vendored
View File

@ -5792,4 +5792,18 @@ name)/mark
aaaccccaaa
bccccb
# /x does not apply to MARK labels
/x (*MARK:ab cd # comment
ef) x/x,mark
axxz
/(?<=a(B){0}c)X/
acX
/(?<DEFINE>b)(?(DEFINE)(a+))(?&DEFINE)/
bbbb
\= Expect no match
baaab
# End of testinput1

View File

@ -79,7 +79,7 @@
/((?2))((?1))/
abc
/((?(R2)a+|(?1)b))/
/((?(R2)a+|(?1)b))()/
aaaabcde
/(?(R)a*(?1)|((?R))b)/

View File

@ -177,7 +177,7 @@
/((?2))((?1))/
abc
/((?(R2)a+|(?1)b))/
/((?(R2)a+|(?1)b))()/
aaaabcde
/(?(R)a*(?1)|((?R))b)/

73
testdata/testinput2 vendored
View File

@ -189,9 +189,9 @@
the barfoo
and cattlefoo
/(?<=a+)b/
/abc(?<=a+)b/
/(?<=aaa|b{0,3})b/
/12345(?<=aaa|b{0,3})b/
/(?<!(foo)a\1)bar/
@ -4518,6 +4518,18 @@
\ B)x/x,alt_verbnames,mark
x
/(*: A \ and #comment
\ B)x/alt_verbnames,mark
x
/(*: A \ and #comment
\ B)x/x,mark
x
/(*: A \ and #comment
\ B)x/mark
x
/(*:A
B)x/alt_verbnames,mark
x
@ -4819,4 +4831,61 @@ a)"xI
/\[AB]{6000000000000000000000}/expand
# Hex uses pattern length, not zero-terminated. This tests for overrunning
# the given length of a pattern.
/'(*U'/hex
/'(*'/hex
/'('/hex
//hex
# These tests are here because Perl never allows a back reference in a
# lookbehind. PCRE2 supports some limited cases.
/([ab])...(?<=\1)z/
a11az
b11bz
\= Expect no match
b11az
/(?|([ab]))...(?<=\1)z/
/([ab])(\1)...(?<=\2)z/
aa11az
/(a\2)(b\1)(?<=\2)/
/(?<A>[ab])...(?<=\k'A')z/
a11az
b11bz
\= Expect no match
b11az
/(?<A>[ab])...(?<=\k'A')(?<A>)z/dupnames
# Perl does not support \g+n
/((\g+1X)?([ab]))+/
aaXbbXa
/ab(?C1)c/auto_callout
abc
/'ab(?C1)c'/hex,auto_callout
abc
# Perl accepts these, but gives a warning. We can't warn, so give an error.
/[a-[:digit:]]+/
a-a9-a
/[A-[:digit:]]+/
A-A9-A
/[a-\d]+/
a-a9-a
# End of testinput2

9
testdata/testinput5 vendored
View File

@ -1723,5 +1723,14 @@
\x{1d7cf}
\= Expect no match
\x{10000}
# Hex uses pattern length, not zero-terminated. This tests for overrunning
# the given length of a pattern.
/'(*UTF)'/hex
/a(?<=A\XB)/utf
/ab(?<=A\RB)/utf
# End of testinput5

2
testdata/testinput6 vendored
View File

@ -4635,7 +4635,7 @@
/((?(R)a+|(?1)b))/
aaaabcde
/((?(R2)a+|(?1)b))/
/((?(R2)a+|(?1)b))()/
aaaabcde
/(?(R)a*(?1)|((?R))b)/

12
testdata/testinput8 vendored
View File

@ -161,18 +161,14 @@
# Use "expand" to create some very long patterns with nested parentheses, in
# order to test workspace overflow. Again, this varies with code unit width,
# and even with it fails in two modes, the error offset differs. It also varies
# and even when it fails in two modes, the error offset differs. It also varies
# with link size - hence multiple tests with different values.
/(?'ABC'\[[bar](]{105}*THEN:\[A]{255}\[)]{106}/expand,-fullbincode
/(?'ABC'\[[bar](]{792}*THEN:\[A]{255}\[)]{793}/expand,-fullbincode,parens_nest_limit=1000
/(?'ABC'\[[bar](]{106}*THEN:\[A]{255}\[)]{107}/expand,-fullbincode
/(?'ABC'\[[bar](]{793}*THEN:\[A]{255}\[)]{794}/expand,-fullbincode,parens_nest_limit=1000
/(?'ABC'\[[bar](]{159}*THEN:\[A]{255}\[)]{160}/expand,-fullbincode
/(?'ABC'\[[bar](]{199}*THEN:\[A]{255}\[)]{200}/expand,-fullbincode
/(?'ABC'\[[bar](]{299}*THEN:\[A]{255}\[)]{300}/expand,-fullbincode
/(?'ABC'\[[bar](]{1793}*THEN:\[A]{255}\[)]{1794}/expand,-fullbincode,parens_nest_limit=2000
/(?(1)(?1)){8,}+()/debug
abcd

20
testdata/testoutput1 vendored
View File

@ -9257,4 +9257,24 @@ No match
1: b
2: cccc
# /x does not apply to MARK labels
/x (*MARK:ab cd # comment
ef) x/x,mark
axxz
0: xx
MK: ab cd # comment\x0aef
/(?<=a(B){0}c)X/
acX
0: X
/(?<DEFINE>b)(?(DEFINE)(a+))(?&DEFINE)/
bbbb
0: bb
1: b
\= Expect no match
baaab
No match
# End of testinput1

View File

@ -557,7 +557,7 @@ Subject length lower bound = 1
0: \x{11234}
/(*UTF-32)\x{11234}/
Failed: error 134 at offset 17: character code point value in \x{} or \o{} is too large
Failed: error 160 at offset 5: (*VERB) not recognized or malformed
abcd\x{11234}pqr
/(*UTF-32)\x{112}/

View File

@ -188,7 +188,7 @@ Failed: error -53: recursion limit exceeded
abc
Failed: error -52: nested recursion at the same subject position
/((?(R2)a+|(?1)b))/
/((?(R2)a+|(?1)b))()/
aaaabcde
Failed: error -52: nested recursion at the same subject position

View File

@ -335,7 +335,7 @@ Failed: error -47: match limit exceeded
abc
Failed: error -46: JIT stack limit reached
/((?(R2)a+|(?1)b))/
/((?(R2)a+|(?1)b))()/
aaaabcde
Failed: error -46: JIT stack limit reached

View File

@ -139,7 +139,7 @@ No match: POSIX code 17: match failed
0+ issippi
/abc/\
Failed: POSIX code 9: bad escape sequence at offset 3
Failed: POSIX code 9: bad escape sequence at offset 4
"(?(?C)"
Failed: POSIX code 11: unbalanced () at offset 6

552
testdata/testoutput2 vendored

File diff suppressed because it is too large Load Diff

View File

@ -76,7 +76,7 @@
------------------------------------------------------------------
/ab\Cde/never_backslash_c
Failed: error 183 at offset 3: using \C is disabled by the application
Failed: error 183 at offset 4: using \C is disabled by the application
/ab\Cde/info
Capturing subpattern count = 0

View File

@ -17,7 +17,7 @@ Subject length lower bound = 0
# 16-bit modes, but not in 32-bit mode.
/(?<=ab\Cde)X/utf
Failed: error 136 at offset 10: \C is not allowed in a lookbehind assertion in UTF-16 mode
Failed: error 136 at offset 0: \C is not allowed in a lookbehind assertion in UTF-16 mode
ab!deXYZ
# Autopossessification tests

View File

@ -17,7 +17,7 @@ Subject length lower bound = 0
# 16-bit modes, but not in 32-bit mode.
/(?<=ab\Cde)X/utf
Failed: error 136 at offset 10: \C is not allowed in a lookbehind assertion in UTF-8 mode
Failed: error 136 at offset 0: \C is not allowed in a lookbehind assertion in UTF-8 mode
ab!deXYZ
# Autopossessification tests

View File

@ -3,6 +3,6 @@
# correct error message.
/a\Cb/
Failed: error 185 at offset 2: using \C is disabled in this PCRE2 library
Failed: error 185 at offset 3: using \C is disabled in this PCRE2 library
# End of testinput23

15
testdata/testoutput5 vendored
View File

@ -1746,7 +1746,7 @@ No match
------------------------------------------------------------------
/\ud800/utf,alt_bsux,allow_empty_class,match_unset_backref
Failed: error 173 at offset 5: disallowed Unicode code point (>= 0xd800 && <= 0xdfff)
Failed: error 173 at offset 6: disallowed Unicode code point (>= 0xd800 && <= 0xdfff)
/^a+[a\x{200}]/B,utf
------------------------------------------------------------------
@ -3997,7 +3997,7 @@ Failed: error 122 at offset 1227: unmatched closing parenthesis
/$(&.+[\p{Me}].\s\xdcC*?(?(<y>))(?<!^)$C((;*?(R))+(?(R)){0,6}?|){12\x8a\X*?\x8a\x0b\xd1^9\3*+(\xc1,\k'P'\xb4)\xcc(z\z(?JJ)(?'X'8};(\x0b\xd1^9\?'3*+(\xc1.]k+\x0b'Pm'\xb4\xcc4'\xd1'(?'X'))?-%--\x95$9*\4'|\xd1(''%\x95*$9)#(?'R')3\x07?('P\xed')\\x16:;()\x1e\x10*:(?<y>)\xd1+!~:(?)''(d'E:yD!\s(?'R'\x1e;\x10:U))|')g!\xb0*){29+))#(?'P'})*?/
"(*UTF)(*UCP)(.UTF).+X(\V+;\^(\D|)!999}(?(?C{7(?C')\H*\S*/^\x5\xa\\xd3\x85n?(;\D*(?m).[^mH+((*UCP)(*U:F)})(?!^)(?'"
Failed: error 124 at offset 113: letter or underscore expected after (?< or (?'
Failed: error 162 at offset 113: subpattern name expected
/[\pS#moq]/
=
@ -4159,5 +4159,16 @@ No match
\= Expect no match
\x{10000}
No match
# Hex uses pattern length, not zero-terminated. This tests for overrunning
# the given length of a pattern.
/'(*UTF)'/hex
/a(?<=A\XB)/utf
Failed: error 125 at offset 1: lookbehind assertion is not fixed length
/ab(?<=A\RB)/utf
Failed: error 125 at offset 2: lookbehind assertion is not fixed length
# End of testinput5

52
testdata/testoutput6 vendored
View File

@ -713,7 +713,7 @@ No match
/(ab|cd){3,4}/auto_callout
ababab
--->ababab
+0 ^ (ab|cd){3,4}
+0 ^ (
+1 ^ a
+4 ^ c
+2 ^^ b
@ -732,7 +732,7 @@ No match
0: ababab
abcdabcd
--->abcdabcd
+0 ^ (ab|cd){3,4}
+0 ^ (
+1 ^ a
+4 ^ c
+2 ^^ b
@ -740,7 +740,7 @@ No match
+1 ^ ^ a
+4 ^ ^ c
+5 ^ ^ d
+6 ^ ^ )
+6 ^ ^ ){3,4}
+1 ^ ^ a
+4 ^ ^ c
+2 ^ ^ b
@ -749,13 +749,13 @@ No match
+1 ^ ^ a
+4 ^ ^ c
+5 ^ ^ d
+6 ^ ^ )
+6 ^ ^ ){3,4}
+12 ^ ^
0: abcdabcd
1: abcdab
abcdcdcdcdcd
--->abcdcdcdcdcd
+0 ^ (ab|cd){3,4}
+0 ^ (
+1 ^ a
+4 ^ c
+2 ^^ b
@ -763,16 +763,16 @@ No match
+1 ^ ^ a
+4 ^ ^ c
+5 ^ ^ d
+6 ^ ^ )
+6 ^ ^ ){3,4}
+1 ^ ^ a
+4 ^ ^ c
+5 ^ ^ d
+6 ^ ^ )
+6 ^ ^ ){3,4}
+12 ^ ^
+1 ^ ^ a
+4 ^ ^ c
+5 ^ ^ d
+6 ^ ^ )
+6 ^ ^ ){3,4}
+12 ^ ^
0: abcdcdcd
1: abcdcd
@ -6712,26 +6712,26 @@ No match
--->"ab"
+0 ^ ^
+1 ^ "
+2 ^^ ((?(?=[a])[^"])|b)*
+2 ^^ (
+21 ^^ "
+3 ^^ (?(?=[a])[^"])
+3 ^^ (?
+18 ^^ b
+5 ^^ (?=[a])
+5 ^^ (?=
+8 ^ [a]
+11 ^^ )
+12 ^^ [^"]
+16 ^ ^ )
+17 ^ ^ |
+21 ^ ^ "
+3 ^ ^ (?(?=[a])[^"])
+3 ^ ^ (?
+18 ^ ^ b
+5 ^ ^ (?=[a])
+5 ^ ^ (?=
+8 ^ [a]
+19 ^ ^ )
+19 ^ ^ )*
+21 ^ ^ "
+3 ^ ^ (?(?=[a])[^"])
+3 ^ ^ (?
+18 ^ ^ b
+5 ^ ^ (?=[a])
+5 ^ ^ (?=
+8 ^ [a]
+17 ^ ^ |
+22 ^ ^ $
@ -7154,7 +7154,7 @@ Failed: error -52: nested recursion at the same subject position
aaaabcde
0: aaaab
/((?(R2)a+|(?1)b))/
/((?(R2)a+|(?1)b))()/
aaaabcde
Failed: error -40: backreference condition or recursion test is not supported for DFA matching
@ -7548,7 +7548,7 @@ Callout (10): {AB} last capture = 0
Bra
^
Cond
Callout 25 9 7
Callout 25 9 3
Assert
abc
Ket
@ -7561,11 +7561,11 @@ Callout (10): {AB} last capture = 0
------------------------------------------------------------------
abcdefg
--->abcdefg
25 ^ (?=abc)
25 ^ (?=
0: abcd
xyz123
--->xyz123
25 ^ (?=abc)
25 ^ (?=
0: xyz
/^(?(?C$abc$)(?=abc)abcd|xyz)/B
@ -7573,7 +7573,7 @@ Callout (10): {AB} last capture = 0
Bra
^
Cond
CalloutStr $abc$ 7 12 7
CalloutStr $abc$ 7 12 3
Assert
abc
Ket
@ -7587,12 +7587,12 @@ Callout (10): {AB} last capture = 0
abcdefg
Callout (7): $abc$
--->abcdefg
^ (?=abc)
^ (?=
0: abcd
xyz123
Callout (7): $abc$
--->xyz123
^ (?=abc)
^ (?=
0: xyz
/^ab(?C'first')cd(?C"second")ef/
@ -7609,13 +7609,13 @@ Callout (20): "second"
aaaXY
Callout (8): `code`
--->aaaXY
^^ )
^^ ){3}
Callout (8): `code`
--->aaaXY
^ ^ )
^ ^ ){3}
Callout (8): `code`
--->aaaXY
^ ^ )
^ ^ ){3}
0: aaaX
# Binary zero in callout string

View File

@ -854,23 +854,17 @@ Failed: error 184 at offset 1540: (?| and/or (?J: or (?x: parentheses are too de
# Use "expand" to create some very long patterns with nested parentheses, in
# order to test workspace overflow. Again, this varies with code unit width,
# and even with it fails in two modes, the error offset differs. It also varies
# and even when it fails in two modes, the error offset differs. It also varies
# with link size - hence multiple tests with different values.
/(?'ABC'\[[bar](]{105}*THEN:\[A]{255}\[)]{106}/expand,-fullbincode
Failed: error 186 at offset 594: regular expression is too complicated
/(?'ABC'\[[bar](]{792}*THEN:\[A]{255}\[)]{793}/expand,-fullbincode,parens_nest_limit=1000
Failed: error 186 at offset 5813: regular expression is too complicated
/(?'ABC'\[[bar](]{106}*THEN:\[A]{255}\[)]{107}/expand,-fullbincode
Failed: error 186 at offset 594: regular expression is too complicated
/(?'ABC'\[[bar](]{793}*THEN:\[A]{255}\[)]{794}/expand,-fullbincode,parens_nest_limit=1000
Failed: error 186 at offset 5820: regular expression is too complicated
/(?'ABC'\[[bar](]{159}*THEN:\[A]{255}\[)]{160}/expand,-fullbincode
Failed: error 186 at offset 594: regular expression is too complicated
/(?'ABC'\[[bar](]{199}*THEN:\[A]{255}\[)]{200}/expand,-fullbincode
Failed: error 186 at offset 594: regular expression is too complicated
/(?'ABC'\[[bar](]{299}*THEN:\[A]{255}\[)]{300}/expand,-fullbincode
Failed: error 186 at offset 594: regular expression is too complicated
/(?'ABC'\[[bar](]{1793}*THEN:\[A]{255}\[)]{1794}/expand,-fullbincode,parens_nest_limit=2000
Failed: error 186 at offset 12820: regular expression is too complicated
/(?(1)(?1)){8,}+()/debug
------------------------------------------------------------------
@ -1031,6 +1025,5 @@ Subject length lower bound = 0
Failed: error 114 at offset 509: missing closing parenthesis
/([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00](*ACCEPT)))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))/-fullbincode
Failed: error 186 at offset 490: regular expression is too complicated
# End of testinput8

View File

@ -853,20 +853,15 @@ Memory allocation (code space): 18
# Use "expand" to create some very long patterns with nested parentheses, in
# order to test workspace overflow. Again, this varies with code unit width,
# and even with it fails in two modes, the error offset differs. It also varies
# and even when it fails in two modes, the error offset differs. It also varies
# with link size - hence multiple tests with different values.
/(?'ABC'\[[bar](]{105}*THEN:\[A]{255}\[)]{106}/expand,-fullbincode
/(?'ABC'\[[bar](]{792}*THEN:\[A]{255}\[)]{793}/expand,-fullbincode,parens_nest_limit=1000
/(?'ABC'\[[bar](]{106}*THEN:\[A]{255}\[)]{107}/expand,-fullbincode
/(?'ABC'\[[bar](]{793}*THEN:\[A]{255}\[)]{794}/expand,-fullbincode,parens_nest_limit=1000
/(?'ABC'\[[bar](]{159}*THEN:\[A]{255}\[)]{160}/expand,-fullbincode
/(?'ABC'\[[bar](]{199}*THEN:\[A]{255}\[)]{200}/expand,-fullbincode
Failed: error 186 at offset 1147: regular expression is too complicated
/(?'ABC'\[[bar](]{299}*THEN:\[A]{255}\[)]{300}/expand,-fullbincode
Failed: error 186 at offset 1147: regular expression is too complicated
/(?'ABC'\[[bar](]{1793}*THEN:\[A]{255}\[)]{1794}/expand,-fullbincode,parens_nest_limit=2000
Failed: error 186 at offset 12820: regular expression is too complicated
/(?(1)(?1)){8,}+()/debug
------------------------------------------------------------------

View File

@ -853,20 +853,15 @@ Memory allocation (code space): 18
# Use "expand" to create some very long patterns with nested parentheses, in
# order to test workspace overflow. Again, this varies with code unit width,
# and even with it fails in two modes, the error offset differs. It also varies
# and even when it fails in two modes, the error offset differs. It also varies
# with link size - hence multiple tests with different values.
/(?'ABC'\[[bar](]{105}*THEN:\[A]{255}\[)]{106}/expand,-fullbincode
/(?'ABC'\[[bar](]{792}*THEN:\[A]{255}\[)]{793}/expand,-fullbincode,parens_nest_limit=1000
/(?'ABC'\[[bar](]{106}*THEN:\[A]{255}\[)]{107}/expand,-fullbincode
/(?'ABC'\[[bar](]{793}*THEN:\[A]{255}\[)]{794}/expand,-fullbincode,parens_nest_limit=1000
/(?'ABC'\[[bar](]{159}*THEN:\[A]{255}\[)]{160}/expand,-fullbincode
/(?'ABC'\[[bar](]{199}*THEN:\[A]{255}\[)]{200}/expand,-fullbincode
Failed: error 186 at offset 1147: regular expression is too complicated
/(?'ABC'\[[bar](]{299}*THEN:\[A]{255}\[)]{300}/expand,-fullbincode
Failed: error 186 at offset 1147: regular expression is too complicated
/(?'ABC'\[[bar](]{1793}*THEN:\[A]{255}\[)]{1794}/expand,-fullbincode,parens_nest_limit=2000
Failed: error 186 at offset 12820: regular expression is too complicated
/(?(1)(?1)){8,}+()/debug
------------------------------------------------------------------

View File

@ -853,20 +853,17 @@ Memory allocation (code space): 28
# Use "expand" to create some very long patterns with nested parentheses, in
# order to test workspace overflow. Again, this varies with code unit width,
# and even with it fails in two modes, the error offset differs. It also varies
# and even when it fails in two modes, the error offset differs. It also varies
# with link size - hence multiple tests with different values.
/(?'ABC'\[[bar](]{105}*THEN:\[A]{255}\[)]{106}/expand,-fullbincode
/(?'ABC'\[[bar](]{792}*THEN:\[A]{255}\[)]{793}/expand,-fullbincode,parens_nest_limit=1000
Failed: error 186 at offset 5813: regular expression is too complicated
/(?'ABC'\[[bar](]{106}*THEN:\[A]{255}\[)]{107}/expand,-fullbincode
/(?'ABC'\[[bar](]{793}*THEN:\[A]{255}\[)]{794}/expand,-fullbincode,parens_nest_limit=1000
Failed: error 186 at offset 5820: regular expression is too complicated
/(?'ABC'\[[bar](]{159}*THEN:\[A]{255}\[)]{160}/expand,-fullbincode
/(?'ABC'\[[bar](]{199}*THEN:\[A]{255}\[)]{200}/expand,-fullbincode
Failed: error 186 at offset 979: regular expression is too complicated
/(?'ABC'\[[bar](]{299}*THEN:\[A]{255}\[)]{300}/expand,-fullbincode
Failed: error 186 at offset 979: regular expression is too complicated
/(?'ABC'\[[bar](]{1793}*THEN:\[A]{255}\[)]{1794}/expand,-fullbincode,parens_nest_limit=2000
Failed: error 186 at offset 12820: regular expression is too complicated
/(?(1)(?1)){8,}+()/debug
------------------------------------------------------------------

View File

@ -853,20 +853,17 @@ Memory allocation (code space): 28
# Use "expand" to create some very long patterns with nested parentheses, in
# order to test workspace overflow. Again, this varies with code unit width,
# and even with it fails in two modes, the error offset differs. It also varies
# and even when it fails in two modes, the error offset differs. It also varies
# with link size - hence multiple tests with different values.
/(?'ABC'\[[bar](]{105}*THEN:\[A]{255}\[)]{106}/expand,-fullbincode
/(?'ABC'\[[bar](]{792}*THEN:\[A]{255}\[)]{793}/expand,-fullbincode,parens_nest_limit=1000
Failed: error 186 at offset 5813: regular expression is too complicated
/(?'ABC'\[[bar](]{106}*THEN:\[A]{255}\[)]{107}/expand,-fullbincode
/(?'ABC'\[[bar](]{793}*THEN:\[A]{255}\[)]{794}/expand,-fullbincode,parens_nest_limit=1000
Failed: error 186 at offset 5820: regular expression is too complicated
/(?'ABC'\[[bar](]{159}*THEN:\[A]{255}\[)]{160}/expand,-fullbincode
/(?'ABC'\[[bar](]{199}*THEN:\[A]{255}\[)]{200}/expand,-fullbincode
Failed: error 186 at offset 979: regular expression is too complicated
/(?'ABC'\[[bar](]{299}*THEN:\[A]{255}\[)]{300}/expand,-fullbincode
Failed: error 186 at offset 979: regular expression is too complicated
/(?'ABC'\[[bar](]{1793}*THEN:\[A]{255}\[)]{1794}/expand,-fullbincode,parens_nest_limit=2000
Failed: error 186 at offset 12820: regular expression is too complicated
/(?(1)(?1)){8,}+()/debug
------------------------------------------------------------------

View File

@ -853,20 +853,17 @@ Memory allocation (code space): 28
# Use "expand" to create some very long patterns with nested parentheses, in
# order to test workspace overflow. Again, this varies with code unit width,
# and even with it fails in two modes, the error offset differs. It also varies
# and even when it fails in two modes, the error offset differs. It also varies
# with link size - hence multiple tests with different values.
/(?'ABC'\[[bar](]{105}*THEN:\[A]{255}\[)]{106}/expand,-fullbincode
/(?'ABC'\[[bar](]{792}*THEN:\[A]{255}\[)]{793}/expand,-fullbincode,parens_nest_limit=1000
Failed: error 186 at offset 5813: regular expression is too complicated
/(?'ABC'\[[bar](]{106}*THEN:\[A]{255}\[)]{107}/expand,-fullbincode
/(?'ABC'\[[bar](]{793}*THEN:\[A]{255}\[)]{794}/expand,-fullbincode,parens_nest_limit=1000
Failed: error 186 at offset 5820: regular expression is too complicated
/(?'ABC'\[[bar](]{159}*THEN:\[A]{255}\[)]{160}/expand,-fullbincode
/(?'ABC'\[[bar](]{199}*THEN:\[A]{255}\[)]{200}/expand,-fullbincode
Failed: error 186 at offset 979: regular expression is too complicated
/(?'ABC'\[[bar](]{299}*THEN:\[A]{255}\[)]{300}/expand,-fullbincode
Failed: error 186 at offset 979: regular expression is too complicated
/(?'ABC'\[[bar](]{1793}*THEN:\[A]{255}\[)]{1794}/expand,-fullbincode,parens_nest_limit=2000
Failed: error 186 at offset 12820: regular expression is too complicated
/(?(1)(?1)){8,}+()/debug
------------------------------------------------------------------

View File

@ -854,22 +854,16 @@ Failed: error 184 at offset 1540: (?| and/or (?J: or (?x: parentheses are too de
# Use "expand" to create some very long patterns with nested parentheses, in
# order to test workspace overflow. Again, this varies with code unit width,
# and even with it fails in two modes, the error offset differs. It also varies
# and even when it fails in two modes, the error offset differs. It also varies
# with link size - hence multiple tests with different values.
/(?'ABC'\[[bar](]{105}*THEN:\[A]{255}\[)]{106}/expand,-fullbincode
/(?'ABC'\[[bar](]{792}*THEN:\[A]{255}\[)]{793}/expand,-fullbincode,parens_nest_limit=1000
/(?'ABC'\[[bar](]{106}*THEN:\[A]{255}\[)]{107}/expand,-fullbincode
Failed: error 186 at offset 637: regular expression is too complicated
/(?'ABC'\[[bar](]{793}*THEN:\[A]{255}\[)]{794}/expand,-fullbincode,parens_nest_limit=1000
Failed: error 186 at offset 5820: regular expression is too complicated
/(?'ABC'\[[bar](]{159}*THEN:\[A]{255}\[)]{160}/expand,-fullbincode
Failed: error 186 at offset 637: regular expression is too complicated
/(?'ABC'\[[bar](]{199}*THEN:\[A]{255}\[)]{200}/expand,-fullbincode
Failed: error 186 at offset 637: regular expression is too complicated
/(?'ABC'\[[bar](]{299}*THEN:\[A]{255}\[)]{300}/expand,-fullbincode
Failed: error 186 at offset 637: regular expression is too complicated
/(?'ABC'\[[bar](]{1793}*THEN:\[A]{255}\[)]{1794}/expand,-fullbincode,parens_nest_limit=2000
Failed: error 186 at offset 12820: regular expression is too complicated
/(?(1)(?1)){8,}+()/debug
------------------------------------------------------------------

View File

@ -853,21 +853,15 @@ Memory allocation (code space): 12
# Use "expand" to create some very long patterns with nested parentheses, in
# order to test workspace overflow. Again, this varies with code unit width,
# and even with it fails in two modes, the error offset differs. It also varies
# and even when it fails in two modes, the error offset differs. It also varies
# with link size - hence multiple tests with different values.
/(?'ABC'\[[bar](]{105}*THEN:\[A]{255}\[)]{106}/expand,-fullbincode
/(?'ABC'\[[bar](]{792}*THEN:\[A]{255}\[)]{793}/expand,-fullbincode,parens_nest_limit=1000
/(?'ABC'\[[bar](]{106}*THEN:\[A]{255}\[)]{107}/expand,-fullbincode
/(?'ABC'\[[bar](]{793}*THEN:\[A]{255}\[)]{794}/expand,-fullbincode,parens_nest_limit=1000
/(?'ABC'\[[bar](]{159}*THEN:\[A]{255}\[)]{160}/expand,-fullbincode
Failed: error 186 at offset 936: regular expression is too complicated
/(?'ABC'\[[bar](]{199}*THEN:\[A]{255}\[)]{200}/expand,-fullbincode
Failed: error 186 at offset 936: regular expression is too complicated
/(?'ABC'\[[bar](]{299}*THEN:\[A]{255}\[)]{300}/expand,-fullbincode
Failed: error 186 at offset 936: regular expression is too complicated
/(?'ABC'\[[bar](]{1793}*THEN:\[A]{255}\[)]{1794}/expand,-fullbincode,parens_nest_limit=2000
Failed: error 186 at offset 12820: regular expression is too complicated
/(?(1)(?1)){8,}+()/debug
------------------------------------------------------------------

View File

@ -853,19 +853,15 @@ Memory allocation (code space): 14
# Use "expand" to create some very long patterns with nested parentheses, in
# order to test workspace overflow. Again, this varies with code unit width,
# and even with it fails in two modes, the error offset differs. It also varies
# and even when it fails in two modes, the error offset differs. It also varies
# with link size - hence multiple tests with different values.
/(?'ABC'\[[bar](]{105}*THEN:\[A]{255}\[)]{106}/expand,-fullbincode
/(?'ABC'\[[bar](]{792}*THEN:\[A]{255}\[)]{793}/expand,-fullbincode,parens_nest_limit=1000
/(?'ABC'\[[bar](]{106}*THEN:\[A]{255}\[)]{107}/expand,-fullbincode
/(?'ABC'\[[bar](]{793}*THEN:\[A]{255}\[)]{794}/expand,-fullbincode,parens_nest_limit=1000
/(?'ABC'\[[bar](]{159}*THEN:\[A]{255}\[)]{160}/expand,-fullbincode
/(?'ABC'\[[bar](]{199}*THEN:\[A]{255}\[)]{200}/expand,-fullbincode
/(?'ABC'\[[bar](]{299}*THEN:\[A]{255}\[)]{300}/expand,-fullbincode
Failed: error 186 at offset 1224: regular expression is too complicated
/(?'ABC'\[[bar](]{1793}*THEN:\[A]{255}\[)]{1794}/expand,-fullbincode,parens_nest_limit=2000
Failed: error 186 at offset 12820: regular expression is too complicated
/(?(1)(?1)){8,}+()/debug
------------------------------------------------------------------

View File

@ -307,14 +307,14 @@ Subject length lower bound = 1
------------------------------------------------------------------
/\777/I
Failed: error 151 at offset 3: octal value is greater than \377 in 8-bit non-UTF-8 mode
Failed: error 151 at offset 4: octal value is greater than \377 in 8-bit non-UTF-8 mode
/(*:0123456789ABCDEF0123456789ABCDEF0123456789ABCDEF0123456789ABCDEF0123456789ABCDEF0123456789ABCDEF0123456789ABCDEF0123456789ABCDEF0123456789ABCDEF0123456789ABCDEF0123456789ABCDEF0123456789ABCDEF0123456789ABCDEF0123456789ABCDEF0123456789ABCDEF0123456789ABCDEF)XX/mark
Failed: error 176 at offset 259: name is too long in (*MARK), (*PRUNE), (*SKIP), or (*THEN)
XX
/(*:0123456789ABCDEF0123456789ABCDEF0123456789ABCDEF0123456789ABCDEF0123456789ABCDEF0123456789ABCDEF0123456789ABCDEF0123456789ABCDEF0123456789ABCDEF0123456789ABCDEF0123456789ABCDEF0123456789ABCDEF0123456789ABCDEF0123456789ABCDEF0123456789ABCDEF0123456789ABCDEF)XX/mark,alt_verbnames
Failed: error 176 at offset 258: name is too long in (*MARK), (*PRUNE), (*SKIP), or (*THEN)
Failed: error 176 at offset 259: name is too long in (*MARK), (*PRUNE), (*SKIP), or (*THEN)
XX
/(*:0123456789ABCDEF0123456789ABCDEF0123456789ABCDEF0123456789ABCDEF0123456789ABCDEF0123456789ABCDEF0123456789ABCDEF0123456789ABCDEF0123456789ABCDEF0123456789ABCDEF0123456789ABCDEF0123456789ABCDEF0123456789ABCDEF0123456789ABCDEF0123456789ABCDEF0123456789ABCDE)XX/mark
@ -328,10 +328,10 @@ MK: 0123456789ABCDEF0123456789ABCDEF0123456789ABCDEF0123456789ABCDEF0123456789AB
MK: 0123456789ABCDEF0123456789ABCDEF0123456789ABCDEF0123456789ABCDEF0123456789ABCDEF0123456789ABCDEF0123456789ABCDEF0123456789ABCDEF0123456789ABCDEF0123456789ABCDEF0123456789ABCDEF0123456789ABCDEF0123456789ABCDEF0123456789ABCDEF0123456789ABCDEF0123456789ABCDE
/\u0100/alt_bsux,allow_empty_class,match_unset_backref,dupnames
Failed: error 177 at offset 5: character code point value in \u.... sequence is too large
Failed: error 177 at offset 6: character code point value in \u.... sequence is too large
/[\u0100-\u0200]/alt_bsux,allow_empty_class,match_unset_backref,dupnames
Failed: error 177 at offset 6: character code point value in \u.... sequence is too large
Failed: error 177 at offset 7: character code point value in \u.... sequence is too large
/[^\x00-a]{12,}[^b-\xff]*/B
------------------------------------------------------------------