Major refactoring of pcre2_compile.c; see ChangeLog and HACKING.
This commit is contained in:
parent
dda1e79060
commit
99264dfc23
40
ChangeLog
40
ChangeLog
|
@ -14,6 +14,46 @@ cause all characters greater than 255 to match, whatever else is in the class.
|
|||
There was a bug that caused this not to happen if a Unicode property item was
|
||||
added to such a class, for example [\D\P{Nd}] or [\W\pL].
|
||||
|
||||
3. There has been a major re-factoring of the pcre2_compile.c file. Most syntax
|
||||
checking is now done in the pre-pass that identifies capturing groups. This has
|
||||
reduced the amount of duplication and made the code tidier. While doing this,
|
||||
some minor bugs and Perl incompatibilities were fixed, including:
|
||||
|
||||
(a) \Q\E in the middle of a quantifier such as A+\Q\E+ is now ignored instead
|
||||
of giving an invalid quantifier error.
|
||||
(b) {0} can now be used after a group in a lookbehind assertion; previously
|
||||
this caused an "assertion is not fixed length" error.
|
||||
(c) Perl always treats (?(DEFINE) as a "define" group, even if a group with
|
||||
the name "DEFINE" exists. PCRE2 now does likewise.
|
||||
(d) A recursion condition test such as (?(R2)...) must now refer to an
|
||||
existing subpattern.
|
||||
|
||||
One effect of the refactoring is that some error numbers and messages have
|
||||
changed, and the pattern offset given for compiling errors is not always the
|
||||
right-most character that has been read. In particular, for a variable-length
|
||||
lookbehind assertion it now points to the start of the assertion. Another
|
||||
change is that when a callout appears before a group, the "length of next
|
||||
pattern item" that is passed now just gives the length of the opening
|
||||
parenthesis item, not the length of the whole group. A length of zero is now
|
||||
given only for a callout at the end of the pattern. Automatic callouts are no
|
||||
longer inserted before and after explicit callouts in the pattern.
|
||||
|
||||
4. Back references are now permitted in lookbehind assertions when there are
|
||||
no duplicated group numbers (that is, (?| has not been used), and, if the
|
||||
reference is by name, there is only one group of that name. The referenced
|
||||
group must, of course be of fixed length.
|
||||
|
||||
5. pcre2test has been upgraded so that, when run under valgrind with valgrind
|
||||
support enabled, reading past the end of the pattern is detected, both when
|
||||
compiling and during callout processing.
|
||||
|
||||
6. \g{+<number>} (e.g. \g{+2)} ) is now supported. It is a "forward back
|
||||
reference" and can be useful in repetitions (compare \g{-<number>}). Perl does
|
||||
not recognize this syntax.
|
||||
|
||||
7. Automatic callouts are no longer generated before and after callouts in the
|
||||
pattern.
|
||||
|
||||
|
||||
Version 10.22 29-July-2016
|
||||
--------------------------
|
||||
|
|
237
HACKING
237
HACKING
|
@ -7,8 +7,8 @@ but with a revised (and incompatible) API. To avoid confusion, the original
|
|||
library is referred to as PCRE1 below. For information about testing PCRE2, see
|
||||
the pcre2test documentation and the comment at the head of the RunTest file.
|
||||
|
||||
PCRE1 releases were up to 8.3x when PCRE2 was developed. The 8.xx series will
|
||||
continue for bugfixes if necessary. PCRE2 releases started at 10.00 to avoid
|
||||
PCRE1 releases were up to 8.3x when PCRE2 was developed, and later bug fix
|
||||
releases remain in the 8.xx series. PCRE2 releases started at 10.00 to avoid
|
||||
confusion with PCRE1.
|
||||
|
||||
|
||||
|
@ -16,19 +16,20 @@ Historical note 1
|
|||
-----------------
|
||||
|
||||
Many years ago I implemented some regular expression functions to an algorithm
|
||||
suggested by Martin Richards. These were not Unix-like in form, and were quite
|
||||
restricted in what they could do by comparison with Perl. The interesting part
|
||||
about the algorithm was that the amount of space required to hold the compiled
|
||||
form of an expression was known in advance. The code to apply an expression did
|
||||
not operate by backtracking, as the original Henry Spencer code and current
|
||||
PCRE2 and Perl code does, but instead checked all possibilities simultaneously
|
||||
by keeping a list of current states and checking all of them as it advanced
|
||||
through the subject string. In the terminology of Jeffrey Friedl's book, it was
|
||||
a "DFA algorithm", though it was not a traditional Finite State Machine (FSM).
|
||||
When the pattern was all used up, all remaining states were possible matches,
|
||||
and the one matching the longest subset of the subject string was chosen. This
|
||||
did not necessarily maximize the individual wild portions of the pattern, as is
|
||||
expected in Unix and Perl-style regular expressions.
|
||||
suggested by Martin Richards. The rather simple patterns were not Unix-like in
|
||||
form, and were quite restricted in what they could do by comparison with Perl.
|
||||
The interesting part about the algorithm was that the amount of space required
|
||||
to hold the compiled form of an expression was known in advance. The code to
|
||||
apply an expression did not operate by backtracking, as the original Henry
|
||||
Spencer code and current PCRE2 and Perl code does, but instead checked all
|
||||
possibilities simultaneously by keeping a list of current states and checking
|
||||
all of them as it advanced through the subject string. In the terminology of
|
||||
Jeffrey Friedl's book, it was a "DFA algorithm", though it was not a
|
||||
traditional Finite State Machine (FSM). When the pattern was all used up, all
|
||||
remaining states were possible matches, and the one matching the longest subset
|
||||
of the subject string was chosen. This did not necessarily maximize the
|
||||
individual wild portions of the pattern, as is expected in Unix and Perl-style
|
||||
regular expressions.
|
||||
|
||||
|
||||
Historical note 2
|
||||
|
@ -85,7 +86,7 @@ had become very complicated and hard to maintain. Indeed one of the early
|
|||
things I did for 6.8 was to fix Yet Another Bug in the memory computation. Then
|
||||
I had a flash of inspiration as to how I could run the real compile function in
|
||||
a "fake" mode that enables it to compute how much memory it would need, while
|
||||
actually only ever using a few hundred bytes of working memory, and without too
|
||||
in most cases only ever using a small amount of working memory, and without too
|
||||
many tests of the mode that might slow it down. So I refactored the compiling
|
||||
functions to work this way. This got rid of about 600 lines of source. It
|
||||
should make future maintenance and development easier. As this was such a major
|
||||
|
@ -104,20 +105,204 @@ system stack used by the compile function, which uses recursive function calls
|
|||
for nested parenthesized groups. This is a safety feature for environments with
|
||||
small stacks where the patterns are provided by users.
|
||||
|
||||
History repeated itself for release 10.20. A number of bugs relating to named
|
||||
subpatterns had been discovered by fuzzers. Most of these were related to the
|
||||
handling of forward references when it was not known if the named pattern was
|
||||
|
||||
Yet another pattern scan
|
||||
------------------------
|
||||
|
||||
History repeated itself for PCRE2 release 10.20. A number of bugs relating to
|
||||
named subpatterns had been discovered by fuzzers. Most of these were related to
|
||||
the handling of forward references when it was not known if the named group was
|
||||
unique. (References to non-unique names use a different opcode and more
|
||||
memory.) The use of duplicate group numbers (the (?| facility) also caused
|
||||
issues.
|
||||
|
||||
To get around these problems I adopted a new approach by adding a third pass,
|
||||
really a "pre-pass", over the pattern, which does nothing other than identify
|
||||
To get around these problems I adopted a new approach by adding a third pass
|
||||
over the pattern (really a "pre-pass"), which did nothing other than identify
|
||||
all the named subpatterns and their corresponding group numbers. This means
|
||||
that the actual compile (both pre-pass and real compile) have full knowledge of
|
||||
group names and numbers throughout. Several dozen lines of messy code were
|
||||
eliminated, though the new pre-pass is not short (skipping over [] classes is
|
||||
complicated).
|
||||
that the actual compile (both the memory-computing dummy run and the real
|
||||
compile) has full knowledge of group names and numbers throughout. Several
|
||||
dozen lines of messy code were eliminated, though the new pre-pass was not
|
||||
short. In particular, parsing and skipping over [] classes is complicated.
|
||||
|
||||
While working on 10.22 I realized that I could simplify yet again by moving
|
||||
more of the parsing into the pre-pass, thus avoiding doing it in two places, so
|
||||
after 10.22 was released, the code underwent yet another big refactoring. This
|
||||
is how it is from 10.23 onwards:
|
||||
|
||||
The function called parse_regex() scans the pattern characters, parsing them
|
||||
into literal data and meta characters. It converts escapes such as \x{123}
|
||||
into literals, handles \Q...\E, and skips over comments and non-significant
|
||||
white space. The result of the scanning is put into a vector of 32-bit unsigned
|
||||
integers. Values less than 0x80000000 are literal data. Higher values represent
|
||||
meta-characters. The top 16-bits of such values identify the meta-character,
|
||||
and these are given names such as META_CAPTURE. The lower 16-bits are available
|
||||
for data, for example, the capturing group number. The only situation in which
|
||||
literal data values greater than 0x7fffffff can appear is when the 32-bit
|
||||
library is running in non-UTF mode. This is handled by having a special
|
||||
meta-character that is followed by the 32-bit data value.
|
||||
|
||||
The size of the parsed pattern vector, when auto-callouts are not enabled, is
|
||||
bounded by the length of the pattern (with one exception). The code is written
|
||||
so that each item in the pattern uses no more vector elements than the number
|
||||
of code units in the item itself. The exception is the aforementioned large
|
||||
32-bit number handling. For this reason, 32-bit non-UTF patterns are scanned in
|
||||
advance to check for such values. When auto-callouts are enabled, the generous
|
||||
assumption is made that there will be a callout for each pattern code unit
|
||||
(which of course is only actually true if all code units are literals) plus one
|
||||
at the end. There is a default parsed pattern vector on the stack, but if this
|
||||
is not big enough, heap memory is used.
|
||||
|
||||
As before, the actual compiling function is run twice, the first time to
|
||||
determine the amount of memory needed for the final compiled pattern. It
|
||||
now processes the parsed pattern vector, not the pattern itself, although some
|
||||
of the parsed items refer to strings in the pattern - for example, group
|
||||
names. As escapes and comments have already been processed, the code is a bit
|
||||
simpler than before.
|
||||
|
||||
Most errors can be diagnosed during the parsing scan. For those that cannot
|
||||
(for example, "lookbehind assertion is not fixed length"), the parsed code
|
||||
contains offsets into the pattern so that the actual compiling code can
|
||||
identify where errors occur.
|
||||
|
||||
|
||||
The elements of the parsed pattern vector
|
||||
-----------------------------------------
|
||||
|
||||
The word "offset" below means a code unit offset into the pattern. When
|
||||
PCRE2_SIZE (which is usually size_t) is no bigger than uint32_t, an offset is
|
||||
stored in a single parsed pattern element. Otherwise (typically on 64-bit
|
||||
systems) it occupies two elements. The following meta items occupy just one
|
||||
element, with no data:
|
||||
|
||||
META_ACCEPT (*ACCEPT)
|
||||
META_ALT | alternation
|
||||
META_ASTERISK *
|
||||
META_ASTERISK_PLUS *+
|
||||
META_ASTERISK_QUERY *?
|
||||
META_ATOMIC (?> start of atomic group
|
||||
META_CIRCUMFLEX ^ metacharacter
|
||||
META_CLASS [ start of non-empty class
|
||||
META_CLASS_EMPTY [] empty class - only with PCRE2_ALLOW_EMPTY_CLASS
|
||||
META_CLASS_EMPTY_NOT [^] negative empty class - ditto
|
||||
META_CLASS_END ] end of non-empty class
|
||||
META_CLASS_NOT [^ start non-empty negative class
|
||||
META_COMMIT (*COMMIT)
|
||||
META_DOLLAR $ metacharacter
|
||||
META_DOT . metacharacter
|
||||
META_END End of pattern (this value is 0x80000000)
|
||||
META_FAIL (*FAIL)
|
||||
META_KET ) closing parenthesis
|
||||
META_LOOKAHEAD (?= start of lookahead
|
||||
META_LOOKAHEADNOT (?! start of negative lookahead
|
||||
META_NOCAPTURE (?: no capture parens
|
||||
META_PLUS +
|
||||
META_PLUS_PLUS ++
|
||||
META_PLUS_QUERY +?
|
||||
META_PRUNE (*PRUNE) - no argument
|
||||
META_QUERY ?
|
||||
META_QUERY_PLUS ?+
|
||||
META_QUERY_QUERY ??
|
||||
META_RANGE_ESCAPED hyphen in class range with at least one escape
|
||||
META_RANGE_LITERAL hyphen in class range defined literally
|
||||
META_SKIP (*SKIP) - no argument
|
||||
META_THEN (*THEN) - no argument
|
||||
|
||||
The two RANGE values occur only in character classes. They are positioned
|
||||
between two literals that define the start and end of the range. In an EBCDIC
|
||||
evironment it is necessary to know whether either of the range values was
|
||||
specified as an escape. In an ASCII/Unicode environment the distinction is not
|
||||
relevant.
|
||||
|
||||
The following have data in the lower 16 bits, and may be followed by other data
|
||||
elements:
|
||||
|
||||
META_BACKREF
|
||||
META_CAPTURE
|
||||
META_ESCAPE
|
||||
META_RECURSE
|
||||
|
||||
META_BACKREF, META_CAPTURE, and META_RECURSE have the capture group number as
|
||||
their data in the lower 16 bits of the element.
|
||||
|
||||
META_BACKREF is followed by an offset if the back reference group number is 10
|
||||
or more. The offsets of the first ocurrences of references to groups whose
|
||||
numbers are less than 10 are put in cb->small_ref_offset[] (only the first
|
||||
occurrence is useful). On 64-bit systems this avoids using more than two parsed
|
||||
pattern elements for items such as \3. The offset is used when an error is
|
||||
given for a reference to a non-existent group.
|
||||
|
||||
META_RECURSE is always followed by an offset, for use in error messages.
|
||||
|
||||
META_ESCAPE has an ESC_xxx value as its data. For ESC_P and ESC_p, the next
|
||||
element contains the 16-bit type and data property values, packed together.
|
||||
ESC_g and ESC_k are used only for named references - numerical ones are turned
|
||||
into META_RECURSE or META_BACKREF as appropriate. They are followed by a length
|
||||
and an offset into the pattern to specify the name.
|
||||
|
||||
The following have one data item that follows in the next vector element:
|
||||
|
||||
META_BIGVALUE Next is a literal >= META_END
|
||||
META_OPTIONS (?i) and friends (data is new option bits)
|
||||
META_POSIX POSIX class item (data identifies the class)
|
||||
META_POSIX_NEG negative POSIX class item (ditto)
|
||||
|
||||
The following are followed by a length element, then a number of character code
|
||||
values (which should match with the length):
|
||||
|
||||
META_MARK (*MARK:xxxx)
|
||||
META_PRUNE_ARG (*PRUNE:xxx)
|
||||
META_SKIP_ARG (*SKIP:xxxx)
|
||||
META_THEN_ARG (*THEN:xxxx)
|
||||
|
||||
The following are followed by a length element, then an offset in the pattern
|
||||
that identifies the name:
|
||||
|
||||
META_COND_NAME (?(<name>) or (?('name') or (?(name)
|
||||
META_COND_RNAME (?(R&name)
|
||||
META_COND_RNUMBER (?(Rdigits)
|
||||
META_RECURSE_BYNAME (?&name)
|
||||
META_BACKREF_BYNAME \k'name'
|
||||
|
||||
META_COND_RNUMBER is used for names that start with R and continue with digits,
|
||||
because this is an ambiguous case. It could be a back reference to a group with
|
||||
that name, or it could be a recursion test on a numbered group.
|
||||
|
||||
This one is followed by an offset, for use in error messages, then a number:
|
||||
|
||||
META_COND_NUMBER (?([+-]digits)
|
||||
|
||||
The following are followed just by an offset, for use in error messages:
|
||||
|
||||
META_COND_ASSERT (?(?assertion)
|
||||
META_COND_DEFINE (?(DEFINE)
|
||||
META_LOOKBEHIND (?<=
|
||||
META_LOOKBEHINDNOT (?<!
|
||||
|
||||
In fact, META_COND_ASSERT is used for any group starting (?( that does not
|
||||
match any of the other META_COND cases. The check that this group is an
|
||||
assertion (optionally preceded by a callout) happens at compile time.
|
||||
|
||||
The following are followed by two values, the minimum and maximum. Repeat
|
||||
values are limited to 65535 (MAX_REPEAT). A maximum value of "unlimited" is
|
||||
represented by UNLIMITED_REPEAT, which is bigger than MAX_REPEAT:
|
||||
|
||||
META_MINMAX {n,m} repeat
|
||||
META_MINMAX_PLUS {n,m}+ repeat
|
||||
META_MINMAX_QUERY {n,m}? repeat
|
||||
|
||||
This one is followed by three elements. The first is 0 for '>' and 1 for '>=';
|
||||
the next two are the major and minor numbers:
|
||||
|
||||
META_COND_VERSION (?(VERSION<op>x.y)
|
||||
|
||||
Callouts are converted into one of two items:
|
||||
|
||||
META_CALLOUT_NUMBER (?C with numerical argument
|
||||
META_CALLOUT_STRING (?C with string argument
|
||||
|
||||
In both cases, the next two elements contain the offset and length of the next
|
||||
item in the pattern. Then there is either one callout number, or a length and
|
||||
an offset for the string argument. The length includes both delimiters.
|
||||
|
||||
|
||||
Traditional matching function
|
||||
|
@ -606,4 +791,4 @@ not a real opcode, but is used to check that tables indexed by opcode are the
|
|||
correct length, in order to catch updating errors.
|
||||
|
||||
Philip Hazel
|
||||
June 2016
|
||||
September 2016
|
||||
|
|
|
@ -65,6 +65,7 @@ dist_html_DATA = \
|
|||
doc/html/pcre2_set_character_tables.html \
|
||||
doc/html/pcre2_set_compile_recursion_guard.html \
|
||||
doc/html/pcre2_set_match_limit.html \
|
||||
doc/html/pcre2_set_max_pattern_length.html \
|
||||
doc/html/pcre2_set_offset_limit.html \
|
||||
doc/html/pcre2_set_newline.html \
|
||||
doc/html/pcre2_set_parens_nest_limit.html \
|
||||
|
@ -146,6 +147,7 @@ dist_man_MANS = \
|
|||
doc/pcre2_set_character_tables.3 \
|
||||
doc/pcre2_set_compile_recursion_guard.3 \
|
||||
doc/pcre2_set_match_limit.3 \
|
||||
doc/pcre2_set_max_pattern_length.3 \
|
||||
doc/pcre2_set_offset_limit.3 \
|
||||
doc/pcre2_set_newline.3 \
|
||||
doc/pcre2_set_parens_nest_limit.3 \
|
||||
|
|
2
RunTest
2
RunTest
|
@ -502,7 +502,7 @@ for bmode in "$test8" "$test16" "$test32"; do
|
|||
for opt in "" $jitopt; do
|
||||
$sim $valgrind ${opt:+$vjs} ./pcre2test -q $test2stack $bmode $opt $testdata/testinput2 testtry
|
||||
if [ $? = 0 ] ; then
|
||||
$sim $valgrind ${opt:+$vjs} ./pcre2test -q $bmode $opt -error -63,-62,-2,-1,0,100,188,189 >>testtry
|
||||
$sim $valgrind ${opt:+$vjs} ./pcre2test -q $bmode $opt -error -63,-62,-2,-1,0,100,188,189,190 >>testtry
|
||||
checkresult $? 2 "$opt"
|
||||
else
|
||||
echo " "
|
||||
|
|
|
@ -1,4 +1,4 @@
|
|||
.TH PCRE2API 3 "17 June 2016" "PCRE2 10.22"
|
||||
.TH PCRE2API 3 "30 September 2016" "PCRE2 10.23"
|
||||
.SH NAME
|
||||
PCRE2 - Perl-compatible regular expressions (revised API)
|
||||
.sp
|
||||
|
@ -693,7 +693,8 @@ functions, \fIpcre2_match()\fP and \fIpcre2_dfa_match()\fP.
|
|||
.sp
|
||||
This parameter ajusts the limit, set when PCRE2 is built (default 250), on the
|
||||
depth of parenthesis nesting in a pattern. This limit stops rogue patterns
|
||||
using up too much system stack when being compiled.
|
||||
using up too much system stack when being compiled. The limit applies to
|
||||
parentheses of all kinds, not just capturing parentheses.
|
||||
.sp
|
||||
.nf
|
||||
.B int pcre2_set_compile_recursion_guard(pcre2_compile_context *\fIccontext\fP,
|
||||
|
@ -1093,6 +1094,12 @@ respectively, when \fBpcre2_compile()\fP returns NULL because a compilation
|
|||
error has occurred. The values are not defined when compilation is successful
|
||||
and \fBpcre2_compile()\fP returns a non-NULL value.
|
||||
.P
|
||||
The value returned in \fIerroroffset\fP is an indication of where in the
|
||||
pattern the error occurred. It is not necessarily the furthest point in the
|
||||
pattern that was read. For example, after the error "lookbehind assertion is
|
||||
not fixed length", the error offset points to the start of the failing
|
||||
assertion.
|
||||
.P
|
||||
The \fBpcre2_get_error_message()\fP function (see "Obtaining a textual error
|
||||
message"
|
||||
.\" HTML <a href="#geterrormessage">
|
||||
|
@ -1184,8 +1191,8 @@ recognized, exactly as in the rest of the pattern.
|
|||
PCRE2_AUTO_CALLOUT
|
||||
.sp
|
||||
If this bit is set, \fBpcre2_compile()\fP automatically inserts callout items,
|
||||
all with number 255, before each pattern item. For discussion of the callout
|
||||
facility, see the
|
||||
all with number 255, before each pattern item, except immediately before or
|
||||
after a callout in the pattern. For discussion of the callout facility, see the
|
||||
.\" HREF
|
||||
\fBpcre2callout\fP
|
||||
.\"
|
||||
|
@ -3292,6 +3299,6 @@ Cambridge, England.
|
|||
.rs
|
||||
.sp
|
||||
.nf
|
||||
Last updated: 17 June 2016
|
||||
Last updated: 30 September 2016
|
||||
Copyright (c) 1997-2016 University of Cambridge.
|
||||
.fi
|
||||
|
|
|
@ -1,4 +1,4 @@
|
|||
.TH PCRE2CALLOUT 3 "23 March 2015" "PCRE2 10.20"
|
||||
.TH PCRE2CALLOUT 3 "29 September 2016" "PCRE2 10.23"
|
||||
.SH NAME
|
||||
PCRE2 - Perl-compatible regular expressions (revised API)
|
||||
.SH SYNOPSIS
|
||||
|
@ -40,11 +40,20 @@ two callout points:
|
|||
.sp
|
||||
If the PCRE2_AUTO_CALLOUT option bit is set when a pattern is compiled, PCRE2
|
||||
automatically inserts callouts, all with number 255, before each item in the
|
||||
pattern. For example, if PCRE2_AUTO_CALLOUT is used with the pattern
|
||||
pattern except for immediately before or after a callout item in the pattern.
|
||||
For example, if PCRE2_AUTO_CALLOUT is used with the pattern
|
||||
.sp
|
||||
A(?C3)B
|
||||
.sp
|
||||
it is processed as if it were
|
||||
.sp
|
||||
(?C255)A(?C3)B(?C255)
|
||||
.sp
|
||||
Here is a more complicated example:
|
||||
.sp
|
||||
A(\ed{2}|--)
|
||||
.sp
|
||||
it is processed as if it were
|
||||
With PCRE2_AUTO_CALLOUT, this pattern is processed as if it were
|
||||
.sp
|
||||
(?C255)A(?C255)((?C255)\ed{2}(?C255)|(?C255)-(?C255)-(?C255))(?C255)
|
||||
.sp
|
||||
|
@ -91,10 +100,10 @@ with PCRE2_ANCHORED and PCRE2_AUTO_CALLOUT and then applied to the string
|
|||
No match
|
||||
.sp
|
||||
This indicates that when matching [bc] fails, there is no backtracking into a+
|
||||
and therefore the callouts that would be taken for the backtracks do not occur.
|
||||
You can disable the auto-possessify feature by passing PCRE2_NO_AUTO_POSSESS to
|
||||
\fBpcre2_compile()\fP, or starting the pattern with (*NO_AUTO_POSSESS). In this
|
||||
case, the output changes to this:
|
||||
(because it is being treated as a++) and therefore the callouts that would be
|
||||
taken for the backtracks do not occur. You can disable the auto-possessify
|
||||
feature by passing PCRE2_NO_AUTO_POSSESS to \fBpcre2_compile()\fP, or starting
|
||||
the pattern with (*NO_AUTO_POSSESS). In this case, the output changes to this:
|
||||
.sp
|
||||
--->aaaa
|
||||
+0 ^ a+
|
||||
|
@ -220,8 +229,8 @@ but the intention is never to remove any of the existing fields.
|
|||
.sp
|
||||
For a numerical callout, \fIcallout_string\fP is NULL, and \fIcallout_number\fP
|
||||
contains the number of the callout, in the range 0-255. This is the number
|
||||
that follows (?C for manual callouts; it is 255 for automatically generated
|
||||
callouts.
|
||||
that follows (?C for callouts that part of the pattern; it is 255 for
|
||||
automatically generated callouts.
|
||||
.
|
||||
.
|
||||
.SS "Fields for string callouts"
|
||||
|
@ -286,10 +295,15 @@ The \fIpattern_position\fP field contains the offset in the pattern string to
|
|||
the next item to be matched.
|
||||
.P
|
||||
The \fInext_item_length\fP field contains the length of the next item to be
|
||||
matched in the pattern string. When the callout immediately precedes an
|
||||
alternation bar, a closing parenthesis, or the end of the pattern, the length
|
||||
is zero. When the callout precedes an opening parenthesis, the length is that
|
||||
of the entire subpattern.
|
||||
processed in the pattern string. When the callout is at the end of the pattern,
|
||||
the length is zero. When the callout precedes an opening parenthesis, the
|
||||
length includes meta characters that follow the parenthesis. For example, in a
|
||||
callout before an assertion such as (?=ab) the length is 3. For an an
|
||||
alternation bar or a closing parenthesis, the length is one, unless a closing
|
||||
parenthesis is followed by a quantifier, in which case its length is included.
|
||||
(This changed in release 10.23. In earlier releases, before an opening
|
||||
parenthesis the length was that of the entire subpattern, and before an
|
||||
alternation bar or a closing parenthesis the length was zero.)
|
||||
.P
|
||||
The \fIpattern_position\fP and \fInext_item_length\fP fields are intended to
|
||||
help in distinguishing between different automatic callouts, which all have the
|
||||
|
@ -382,6 +396,6 @@ Cambridge, England.
|
|||
.rs
|
||||
.sp
|
||||
.nf
|
||||
Last updated: 23 March 2015
|
||||
Copyright (c) 1997-2015 University of Cambridge.
|
||||
Last updated: 29 September 2016
|
||||
Copyright (c) 1997-2016 University of Cambridge.
|
||||
.fi
|
||||
|
|
|
@ -1,4 +1,4 @@
|
|||
.TH PCRE2COMPAT 3 "15 March 2015" "PCRE2 10.20"
|
||||
.TH PCRE2COMPAT 3 "30 September 2016" "PCRE2 10.23"
|
||||
.SH NAME
|
||||
PCRE2 - Perl-compatible regular expressions (revised API)
|
||||
.SH "DIFFERENCES BETWEEN PCRE2 AND PERL"
|
||||
|
@ -96,7 +96,7 @@ processed as anchored at the point where they are tested.
|
|||
one that is backtracked onto acts. For example, in the pattern
|
||||
A(*COMMIT)B(*PRUNE)C a failure in B triggers (*COMMIT), but a failure in C
|
||||
triggers (*PRUNE). Perl's behaviour is more complex; in many cases it is the
|
||||
same as PCRE2, but there are examples where it differs.
|
||||
same as PCRE2, but there are cases where it differs.
|
||||
.P
|
||||
11. Most backtracking verbs in assertions have their normal actions. They are
|
||||
not confined to the assertion.
|
||||
|
@ -116,10 +116,11 @@ would not be possible to distinguish which parentheses matched, because both
|
|||
names map to capturing subpattern number 1. To avoid this confusing situation,
|
||||
an error is given at compile time.
|
||||
.P
|
||||
14. Perl recognizes comments in some places that PCRE2 does not, for example,
|
||||
between the ( and ? at the start of a subpattern. If the /x modifier is set,
|
||||
Perl allows white space between ( and ? (though current Perls warn that this is
|
||||
deprecated) but PCRE2 never does, even if the PCRE2_EXTENDED option is set.
|
||||
14. Perl used to recognize comments in some places that PCRE2 does not, for
|
||||
example, between the ( and ? at the start of a subpattern. If the /x modifier
|
||||
is set, Perl allowed white space between ( and ? though the latest Perls give
|
||||
an error (for a while it was just deprecated). There may still be some cases
|
||||
where Perl behaves differently.
|
||||
.P
|
||||
15. Perl, when in warning mode, gives warnings for character classes such as
|
||||
[A-\ed] or [a-[:digit:]]. It then treats the hyphens as literals. PCRE2 has no
|
||||
|
@ -141,33 +142,37 @@ list is with respect to Perl 5.10:
|
|||
each alternative branch of a lookbehind assertion can match a different length
|
||||
of string. Perl requires them all to have the same length.
|
||||
.sp
|
||||
(b) If PCRE2_DOLLAR_ENDONLY is set and PCRE2_MULTILINE is not set, the $
|
||||
(b) From PCRE2 10.23, back references to groups of fixed length are supported
|
||||
in lookbehinds, provided that there is no possibility of referencing a
|
||||
non-unique number or name. Perl does not support backreferences in lookbehinds.
|
||||
.sp
|
||||
(c) If PCRE2_DOLLAR_ENDONLY is set and PCRE2_MULTILINE is not set, the $
|
||||
meta-character matches only at the very end of the string.
|
||||
.sp
|
||||
(c) A backslash followed by a letter with no special meaning is faulted. (Perl
|
||||
(d) A backslash followed by a letter with no special meaning is faulted. (Perl
|
||||
can be made to issue a warning.)
|
||||
.sp
|
||||
(d) If PCRE2_UNGREEDY is set, the greediness of the repetition quantifiers is
|
||||
(e) If PCRE2_UNGREEDY is set, the greediness of the repetition quantifiers is
|
||||
inverted, that is, by default they are not greedy, but if followed by a
|
||||
question mark they are.
|
||||
.sp
|
||||
(e) PCRE2_ANCHORED can be used at matching time to force a pattern to be tried
|
||||
(f) PCRE2_ANCHORED can be used at matching time to force a pattern to be tried
|
||||
only at the first matching position in the subject string.
|
||||
.sp
|
||||
(f) The PCRE2_NOTBOL, PCRE2_NOTEOL, PCRE2_NOTEMPTY, PCRE2_NOTEMPTY_ATSTART, and
|
||||
(g) The PCRE2_NOTBOL, PCRE2_NOTEOL, PCRE2_NOTEMPTY, PCRE2_NOTEMPTY_ATSTART, and
|
||||
PCRE2_NO_AUTO_CAPTURE options have no Perl equivalents.
|
||||
.sp
|
||||
(g) The \eR escape sequence can be restricted to match only CR, LF, or CRLF
|
||||
(h) The \eR escape sequence can be restricted to match only CR, LF, or CRLF
|
||||
by the PCRE2_BSR_ANYCRLF option.
|
||||
.sp
|
||||
(h) The callout facility is PCRE2-specific.
|
||||
(i) The callout facility is PCRE2-specific.
|
||||
.sp
|
||||
(i) The partial matching facility is PCRE2-specific.
|
||||
(j) The partial matching facility is PCRE2-specific.
|
||||
.sp
|
||||
(j) The alternative matching function (\fBpcre2_dfa_match()\fP matches in a
|
||||
(k) The alternative matching function (\fBpcre2_dfa_match()\fP matches in a
|
||||
different way and is not Perl-compatible.
|
||||
.sp
|
||||
(k) PCRE2 recognizes some special sequences such as (*CR) at the start of
|
||||
(l) PCRE2 recognizes some special sequences such as (*CR) at the start of
|
||||
a pattern that set overall options that cannot be changed within the pattern.
|
||||
.
|
||||
.
|
||||
|
@ -185,6 +190,6 @@ Cambridge, England.
|
|||
.rs
|
||||
.sp
|
||||
.nf
|
||||
Last updated: 15 March 2015
|
||||
Copyright (c) 1997-2015 University of Cambridge.
|
||||
Last updated: 30 September 2016
|
||||
Copyright (c) 1997-2016 University of Cambridge.
|
||||
.fi
|
||||
|
|
|
@ -1,4 +1,4 @@
|
|||
.TH PCRE2LIMITS 3 "05 November 2015" "PCRE2 10.21"
|
||||
.TH PCRE2LIMITS 3 "29 September 2016" "PCRE2 10.23"
|
||||
.SH NAME
|
||||
PCRE2 - Perl-compatible regular expressions (revised API)
|
||||
.SH "SIZE AND OTHER LIMITATIONS"
|
||||
|
@ -46,19 +46,19 @@ The maximum length of a lookbehind assertion is 65535 characters.
|
|||
There is no limit to the number of parenthesized subpatterns, but there can be
|
||||
no more than 65535 capturing subpatterns. There is, however, a limit to the
|
||||
depth of nesting of parenthesized subpatterns of all kinds. This is imposed in
|
||||
order to limit the amount of system stack used at compile time. The limit can
|
||||
be specified when PCRE2 is built; the default is 250.
|
||||
.P
|
||||
There is a limit to the number of forward references to subsequent subpatterns
|
||||
of around 200,000. Repeated forward references with fixed upper limits, for
|
||||
example, (?2){0,100} when subpattern number 2 is to the right, are included in
|
||||
the count. There is no limit to the number of backward references.
|
||||
order to limit the amount of system stack used at compile time. The default
|
||||
limit can be specified when PCRE2 is built; the default default is 250. An
|
||||
application can change this limit by calling pcre2_set_parens_nest_limit() to
|
||||
set the limit in a compile context.
|
||||
.P
|
||||
The maximum length of name for a named subpattern is 32 code units, and the
|
||||
maximum number of named subpatterns is 10000.
|
||||
.P
|
||||
The maximum length of a name in a (*MARK), (*PRUNE), (*SKIP), or (*THEN) verb
|
||||
is 255 for the 8-bit library and 65535 for the 16-bit and 32-bit libraries.
|
||||
.P
|
||||
The maximum length of a string argument to a callout is the largest number a
|
||||
32-bit unsigned integer can hold.
|
||||
.
|
||||
.
|
||||
.SH AUTHOR
|
||||
|
@ -75,6 +75,6 @@ Cambridge, England.
|
|||
.rs
|
||||
.sp
|
||||
.nf
|
||||
Last updated: 05 November 2015
|
||||
Copyright (c) 1997-2015 University of Cambridge.
|
||||
Last updated: 29 September 2016
|
||||
Copyright (c) 1997-2016 University of Cambridge.
|
||||
.fi
|
||||
|
|
|
@ -1,4 +1,4 @@
|
|||
.TH PCRE2PATTERN 3 "20 June 2016" "PCRE2 10.22"
|
||||
.TH PCRE2PATTERN 3 "30 September 2016" "PCRE2 10.23"
|
||||
.SH NAME
|
||||
PCRE2 - Perl-compatible regular expressions (revised API)
|
||||
.SH "PCRE2 REGULAR EXPRESSION DETAILS"
|
||||
|
@ -508,9 +508,9 @@ by code point, as described in the previous section.
|
|||
.SS "Absolute and relative back references"
|
||||
.rs
|
||||
.sp
|
||||
The sequence \eg followed by an unsigned or a negative number, optionally
|
||||
enclosed in braces, is an absolute or relative back reference. A named back
|
||||
reference can be coded as \eg{name}. Back references are discussed
|
||||
The sequence \eg followed by a signed or unsigned number, optionally enclosed
|
||||
in braces, is an absolute or relative back reference. A named back reference
|
||||
can be coded as \eg{name}. Back references are discussed
|
||||
.\" HTML <a href="#backreferences">
|
||||
.\" </a>
|
||||
later,
|
||||
|
@ -1325,13 +1325,33 @@ when matching character classes, whatever line-ending sequence is in use, and
|
|||
whatever setting of the PCRE2_DOTALL and PCRE2_MULTILINE options is used. A
|
||||
class such as [^a] always matches one of these characters.
|
||||
.P
|
||||
The character escape sequences \ed, \eD, \eh, \eH, \ep, \eP, \es, \eS, \ev,
|
||||
\eV, \ew, and \eW may appear in a character class, and add the characters that
|
||||
they match to the class. For example, [\edABCDEF] matches any hexadecimal
|
||||
digit. In UTF modes, the PCRE2_UCP option affects the meanings of \ed, \es, \ew
|
||||
and their upper case partners, just as it does when they appear outside a
|
||||
character class, as described in the section entitled
|
||||
.\" HTML <a href="#genericchartypes">
|
||||
.\" </a>
|
||||
"Generic character types"
|
||||
.\"
|
||||
above. The escape sequence \eb has a different meaning inside a character
|
||||
class; it matches the backspace character. The sequences \eB, \eN, \eR, and \eX
|
||||
are not special inside a character class. Like any other unrecognized escape
|
||||
sequences, they cause an error.
|
||||
.P
|
||||
The minus (hyphen) character can be used to specify a range of characters in a
|
||||
character class. For example, [d-m] matches any letter between d and m,
|
||||
inclusive. If a minus character is required in a class, it must be escaped with
|
||||
a backslash or appear in a position where it cannot be interpreted as
|
||||
indicating a range, typically as the first or last character in the class, or
|
||||
immediately after a range. For example, [b-d-z] matches letters in the range b
|
||||
to d, a hyphen character, or z.
|
||||
indicating a range, typically as the first or last character in the class,
|
||||
or immediately after a range. For example, [b-d-z] matches letters in the range
|
||||
b to d, a hyphen character, or z.
|
||||
.P
|
||||
Perl treats a hyphen as a literal if it appears before a POSIX class (see
|
||||
below) or a character type escape such as as \ed, but gives a warning in its
|
||||
warning mode, as this is most likely a user error. As PCRE2 has no facility for
|
||||
warning, an error is given in these cases.
|
||||
.P
|
||||
It is not possible to have the literal character "]" as the end character of a
|
||||
range. A pattern such as [W-]46] is interpreted as a class of two characters
|
||||
|
@ -1341,11 +1361,6 @@ the end of range, so [W-\e]46] is interpreted as a class containing a range
|
|||
followed by two other characters. The octal or hexadecimal representation of
|
||||
"]" can also be used to end a range.
|
||||
.P
|
||||
An error is generated if a POSIX character class (see below) or an escape
|
||||
sequence other than one that defines a single character appears at a point
|
||||
where a range ending character is expected. For example, [z-\exff] is valid,
|
||||
but [A-\ed] and [A-[:digit:]] are not.
|
||||
.P
|
||||
Ranges normally include all code points between the start and end characters,
|
||||
inclusive. They can also be used for code points specified numerically, for
|
||||
example [\e000-\e037]. Ranges can include any characters that are valid for the
|
||||
|
@ -1365,21 +1380,6 @@ matches the letters in either case. For example, [W-c] is equivalent to
|
|||
tables for a French locale are in use, [\exc8-\excb] matches accented E
|
||||
characters in both cases.
|
||||
.P
|
||||
The character escape sequences \ed, \eD, \eh, \eH, \ep, \eP, \es, \eS, \ev,
|
||||
\eV, \ew, and \eW may appear in a character class, and add the characters that
|
||||
they match to the class. For example, [\edABCDEF] matches any hexadecimal
|
||||
digit. In UTF modes, the PCRE2_UCP option affects the meanings of \ed, \es, \ew
|
||||
and their upper case partners, just as it does when they appear outside a
|
||||
character class, as described in the section entitled
|
||||
.\" HTML <a href="#genericchartypes">
|
||||
.\" </a>
|
||||
"Generic character types"
|
||||
.\"
|
||||
above. The escape sequence \eb has a different meaning inside a character
|
||||
class; it matches the backspace character. The sequences \eB, \eN, \eR, and \eX
|
||||
are not special inside a character class. Like any other unrecognized escape
|
||||
sequences, they cause an error.
|
||||
.P
|
||||
A circumflex can conveniently be used with the upper case character types to
|
||||
specify a more restricted set of characters than the matching lower case type.
|
||||
For example, the class [^\eW_] matches any letter or digit, but not underscore,
|
||||
|
@ -2096,9 +2096,9 @@ no such problem when named parentheses are used. A back reference to any
|
|||
subpattern is possible using named parentheses (see below).
|
||||
.P
|
||||
Another way of avoiding the ambiguity inherent in the use of digits following a
|
||||
backslash is to use the \eg escape sequence. This escape must be followed by an
|
||||
unsigned number or a negative number, optionally enclosed in braces. These
|
||||
examples are all identical:
|
||||
backslash is to use the \eg escape sequence. This escape must be followed by a
|
||||
signed or unsigned number, optionally enclosed in braces. These examples are
|
||||
all identical:
|
||||
.sp
|
||||
(ring), \e1
|
||||
(ring), \eg1
|
||||
|
@ -2106,8 +2106,7 @@ examples are all identical:
|
|||
.sp
|
||||
An unsigned number specifies an absolute reference without the ambiguity that
|
||||
is present in the older syntax. It is also useful when literal digits follow
|
||||
the reference. A negative number is a relative reference. Consider this
|
||||
example:
|
||||
the reference. A signed number is a relative reference. Consider this example:
|
||||
.sp
|
||||
(abc(def)ghi)\eg{-1}
|
||||
.sp
|
||||
|
@ -2117,6 +2116,10 @@ Similarly, \eg{-2} would be equivalent to \e1. The use of relative references
|
|||
can be helpful in long patterns, and also in patterns that are created by
|
||||
joining together fragments that contain references within themselves.
|
||||
.P
|
||||
The sequence \eg{+1} is a reference to the next capturing subpattern. This kind
|
||||
of forward reference can be useful it patterns that repeat. Perl does not
|
||||
support the use of + in this way.
|
||||
.P
|
||||
A back reference matches whatever actually matched the capturing subpattern in
|
||||
the current subject string, rather than anything matching the subpattern
|
||||
itself (see
|
||||
|
@ -2321,23 +2324,34 @@ temporarily move the current position back by the fixed length and then try to
|
|||
match. If there are insufficient characters before the current position, the
|
||||
assertion fails.
|
||||
.P
|
||||
In a UTF mode, PCRE2 does not allow the \eC escape (which matches a single code
|
||||
unit even in a UTF mode) to appear in lookbehind assertions, because it makes
|
||||
it impossible to calculate the length of the lookbehind. The \eX and \eR
|
||||
escapes, which can match different numbers of code units, are also not
|
||||
permitted.
|
||||
In UTF-8 and UTF-16 modes, PCRE2 does not allow the \eC escape (which matches a
|
||||
single code unit even in a UTF mode) to appear in lookbehind assertions,
|
||||
because it makes it impossible to calculate the length of the lookbehind. The
|
||||
\eX and \eR escapes, which can match different numbers of code units, are never
|
||||
permitted in lookbehinds.
|
||||
.P
|
||||
.\" HTML <a href="#subpatternsassubroutines">
|
||||
.\" </a>
|
||||
"Subroutine"
|
||||
.\"
|
||||
calls (see below) such as (?2) or (?&X) are permitted in lookbehinds, as long
|
||||
as the subpattern matches a fixed-length string.
|
||||
as the subpattern matches a fixed-length string. However,
|
||||
.\" HTML <a href="#recursion">
|
||||
.\" </a>
|
||||
Recursion,
|
||||
recursion,
|
||||
.\"
|
||||
however, is not supported.
|
||||
that is, a "subroutine" call into a group that is already active,
|
||||
is not supported.
|
||||
.P
|
||||
Perl does not support back references in lookbehinds. PCRE2 does support them,
|
||||
but only if certain conditions are met. The PCRE2_MATCH_UNSET_BACKREF option
|
||||
must not be set, there must be no use of (?| in the pattern (it creates
|
||||
duplicate subpattern numbers), and if the back reference is by name, the name
|
||||
must be unique. Of course, the referenced subpattern must itself be of fixed
|
||||
length. The following pattern matches words containing at least two characters
|
||||
that begin and end with the same character:
|
||||
.sp
|
||||
\eb(\ew)\ew++(?<=\e1)
|
||||
.P
|
||||
Possessive quantifiers can be used in conjunction with lookbehind assertions to
|
||||
specify efficient matching of fixed-length strings at the end of subject
|
||||
|
@ -2476,7 +2490,9 @@ This makes the fragment independent of the parentheses in the larger pattern.
|
|||
.sp
|
||||
Perl uses the syntax (?(<name>)...) or (?('name')...) to test for a used
|
||||
subpattern by name. For compatibility with earlier versions of PCRE1, which had
|
||||
this facility before Perl, the syntax (?(name)...) is also recognized.
|
||||
this facility before Perl, the syntax (?(name)...) is also recognized. Note,
|
||||
however, that undelimited names consisting of the letter R followed by digits
|
||||
are ambiguous (see the following section).
|
||||
.P
|
||||
Rewriting the above example to use a named subpattern gives this:
|
||||
.sp
|
||||
|
@ -2490,33 +2506,55 @@ matched.
|
|||
.SS "Checking for pattern recursion"
|
||||
.rs
|
||||
.sp
|
||||
If the condition is the string (R), and there is no subpattern with the name R,
|
||||
the condition is true if a recursive call to the whole pattern or any
|
||||
subpattern has been made. If digits or a name preceded by ampersand follow the
|
||||
letter R, for example:
|
||||
.sp
|
||||
(?(R3)...) or (?(R&name)...)
|
||||
.sp
|
||||
the condition is true if the most recent recursion is into a subpattern whose
|
||||
number or name is given. This condition does not check the entire recursion
|
||||
stack. If the name used in a condition of this kind is a duplicate, the test is
|
||||
applied to all subpatterns of the same name, and is true if any one of them is
|
||||
the most recent recursion.
|
||||
.P
|
||||
At "top level", all these recursion test conditions are false.
|
||||
"Recursion" in this sense refers to any subroutine-like call from one part of
|
||||
the pattern to another, whether or not it is actually recursive. See the
|
||||
sections entitled
|
||||
.\" HTML <a href="#recursion">
|
||||
.\" </a>
|
||||
The syntax for recursive patterns
|
||||
"Recursive patterns"
|
||||
.\"
|
||||
is described below.
|
||||
and
|
||||
.\" HTML <a href="#subpatternsassubroutines">
|
||||
.\" </a>
|
||||
"Subpatterns as subroutines"
|
||||
.\"
|
||||
below for details of recursion and subpattern calls.
|
||||
.P
|
||||
If a condition is the string (R), and there is no subpattern with the name R,
|
||||
the condition is true if matching is currently in a recursion or subroutine
|
||||
call to the whole pattern or any subpattern. If digits follow the letter R, and
|
||||
there is no subpattern with that name, the condition is true if the most recent
|
||||
call is into a subpattern with the given number, which must exist somewhere in
|
||||
the overall pattern. This is a contrived example that is equivalent to a+b:
|
||||
.sp
|
||||
((?(R1)a+|(?1)b))
|
||||
.sp
|
||||
However, in both cases, if there is a subpattern with a matching name, the
|
||||
condition tests for its being set, as described in the section above, instead
|
||||
of testing for recursion. For example, creating a group with the name R1 by
|
||||
adding (?<R1>) to the above pattern completely changes its meaning.
|
||||
.P
|
||||
If a name preceded by ampersand follows the letter R, for example:
|
||||
.sp
|
||||
(?(R&name)...)
|
||||
.sp
|
||||
the condition is true if the most recent recursion is into a subpattern of that
|
||||
name (which must exist within the pattern).
|
||||
.P
|
||||
This condition does not check the entire recursion stack. It tests only the
|
||||
current level. If the name used in a condition of this kind is a duplicate, the
|
||||
test is applied to all subpatterns of the same name, and is true if any one of
|
||||
them is the most recent recursion.
|
||||
.P
|
||||
At "top level", all these recursion test conditions are false.
|
||||
.
|
||||
.
|
||||
.\" HTML <a name="subdefine"></a>
|
||||
.SS "Defining subpatterns for use by reference only"
|
||||
.rs
|
||||
.sp
|
||||
If the condition is the string (DEFINE), and there is no subpattern with the
|
||||
name DEFINE, the condition is always false. In this case, there may be only one
|
||||
If the condition is the string (DEFINE), the condition is always false, even if
|
||||
there is a group with the name DEFINE. In this case, there may be only one
|
||||
alternative in the subpattern. It is always skipped if control reaches this
|
||||
point in the pattern; the idea of DEFINE is that it can be used to define
|
||||
subroutines that can be referenced from elsewhere. (The use of
|
||||
|
@ -2994,12 +3032,20 @@ depending on whether or not a name is present.
|
|||
By default, for compatibility with Perl, a name is any sequence of characters
|
||||
that does not include a closing parenthesis. The name is not processed in
|
||||
any way, and it is not possible to include a closing parenthesis in the name.
|
||||
However, if the PCRE2_ALT_VERBNAMES option is set, normal backslash processing
|
||||
is applied to verb names and only an unescaped closing parenthesis terminates
|
||||
the name. A closing parenthesis can be included in a name either as \e) or
|
||||
between \eQ and \eE. If the PCRE2_EXTENDED option is set, unescaped whitespace
|
||||
in verb names is skipped and #-comments are recognized, exactly as in the rest
|
||||
of the pattern.
|
||||
This can be changed by setting the PCRE2_ALT_VERBNAMES option, but the result
|
||||
is no longer Perl-compatible.
|
||||
.P
|
||||
When PCRE2_ALT_VERBNAMES is set, backslash processing is applied to verb names
|
||||
and only an unescaped closing parenthesis terminates the name. However, the
|
||||
only backslash items that are permitted are \eQ, \eE, and sequences such as
|
||||
\ex{100} that define character code points. Character type escapes such as \ed
|
||||
are faulted.
|
||||
.P
|
||||
A closing parenthesis can be included in a name either as \e) or between \eQ
|
||||
and \eE. In addition to backslash processing, if the PCRE2_EXTENDED option is
|
||||
also set, unescaped whitespace in verb names is skipped, and #-comments are
|
||||
recognized, exactly as in the rest of the pattern. PCRE2_EXTENDED does not
|
||||
affect verb names unless PCRE2_ALT_VERBNAMES is also set.
|
||||
.P
|
||||
The maximum length of a name is 255 in the 8-bit library and 65535 in the
|
||||
16-bit and 32-bit libraries. If the name is empty, that is, if the closing
|
||||
|
@ -3429,6 +3475,6 @@ Cambridge, England.
|
|||
.rs
|
||||
.sp
|
||||
.nf
|
||||
Last updated: 20 June 2016
|
||||
Last updated: 30 September 2016
|
||||
Copyright (c) 1997-2016 University of Cambridge.
|
||||
.fi
|
||||
|
|
|
@ -1,4 +1,4 @@
|
|||
.TH PCRE2SYNTAX 3 "16 October 2015" "PCRE2 10.21"
|
||||
.TH PCRE2SYNTAX 3 "28 September 2016" "PCRE2 10.23"
|
||||
.SH NAME
|
||||
PCRE2 - Perl-compatible regular expressions (revised API)
|
||||
.SH "PCRE2 REGULAR EXPRESSION SYNTAX SUMMARY"
|
||||
|
@ -473,6 +473,9 @@ Each top-level branch of a look behind must be of a fixed length.
|
|||
\en reference by number (can be ambiguous)
|
||||
\egn reference by number
|
||||
\eg{n} reference by number
|
||||
\eg+n relative reference by number (PCRE2 extension)
|
||||
\eg-n relative reference by number
|
||||
\eg{+n} relative reference by number (PCRE2 extension)
|
||||
\eg{-n} relative reference by number
|
||||
\ek<name> reference by name (Perl)
|
||||
\ek'name' reference by name (Perl)
|
||||
|
@ -511,13 +514,17 @@ Each top-level branch of a look behind must be of a fixed length.
|
|||
(?(-n) relative reference condition
|
||||
(?(<name>) named reference condition (Perl)
|
||||
(?('name') named reference condition (Perl)
|
||||
(?(name) named reference condition (PCRE2)
|
||||
(?(name) named reference condition (PCRE2, deprecated)
|
||||
(?(R) overall recursion condition
|
||||
(?(Rn) specific group recursion condition
|
||||
(?(R&name) specific recursion condition
|
||||
(?(Rn) specific numbered group recursion condition
|
||||
(?(R&name) specific named group recursion condition
|
||||
(?(DEFINE) define subpattern for reference
|
||||
(?(VERSION[>]=n.m) test PCRE2 version
|
||||
(?(assert) assertion condition
|
||||
.sp
|
||||
Note the ambiguity of (?(R) and (?(Rn) which might be named reference
|
||||
conditions or recursion tests. Such a condition is interpreted as a reference
|
||||
condition if the relevant named group exists.
|
||||
.
|
||||
.
|
||||
.SH "BACKTRACKING CONTROL"
|
||||
|
@ -577,6 +584,6 @@ Cambridge, England.
|
|||
.rs
|
||||
.sp
|
||||
.nf
|
||||
Last updated: 16 October 2015
|
||||
Copyright (c) 1997-2015 University of Cambridge.
|
||||
Last updated: 28 September 2016
|
||||
Copyright (c) 1997-2016 University of Cambridge.
|
||||
.fi
|
||||
|
|
24
perltest.sh
24
perltest.sh
|
@ -1,14 +1,17 @@
|
|||
#! /bin/sh
|
||||
|
||||
# Script for testing regular expressions with perl to check that PCRE2 handles
|
||||
# them the same. The Perl code has to have "use utf8" and "require Encode" at
|
||||
# the start when running UTF-8 tests, but *not* for non-utf8 tests. (The
|
||||
# "require" would actually be OK for non-utf8-tests, but is not always
|
||||
# installed, so this way the script will always run for these tests.)
|
||||
# them the same. If the first argument to this script is "-w", Perl is also
|
||||
# called with "-w", which turns on its warning mode.
|
||||
#
|
||||
# The Perl code has to have "use utf8" and "require Encode" at the start when
|
||||
# running UTF-8 tests, but *not* for non-utf8 tests. (The "require" would
|
||||
# actually be OK for non-utf8-tests, but is not always installed, so this way
|
||||
# the script will always run for these tests.)
|
||||
#
|
||||
# The desired effect is achieved by making this a shell script that passes the
|
||||
# Perl script to Perl through a pipe. If the first argument is "-utf8", a
|
||||
# suitable prefix is set up.
|
||||
# Perl script to Perl through a pipe. If the first argument (possibly after
|
||||
# removing "-w") is "-utf8", a suitable prefix is set up.
|
||||
#
|
||||
# The remaining arguments, if any, are passed to Perl. They are an input file
|
||||
# and an output file. If there is one argument, the output is written to
|
||||
|
@ -17,7 +20,14 @@
|
|||
# of the contorted piping input.)
|
||||
|
||||
perl=perl
|
||||
perlarg=''
|
||||
prefix=''
|
||||
|
||||
if [ $# -gt 0 -a "$1" = "-w" ] ; then
|
||||
perlarg="-w"
|
||||
shift
|
||||
fi
|
||||
|
||||
if [ $# -gt 0 -a "$1" = "-utf8" ] ; then
|
||||
prefix="use utf8; require Encode;"
|
||||
shift
|
||||
|
@ -292,6 +302,6 @@ for (;;)
|
|||
# printf $outfile "\n";
|
||||
|
||||
PERLEND
|
||||
) | $perl - $@
|
||||
) | $perl $perlarg - $@
|
||||
|
||||
# End
|
||||
|
|
10919
src/pcre2_compile.c
10919
src/pcre2_compile.c
File diff suppressed because it is too large
Load Diff
|
@ -91,13 +91,13 @@ static const unsigned char compile_error_texts[] =
|
|||
"failed to allocate heap memory\0"
|
||||
"unmatched closing parenthesis\0"
|
||||
"internal error: code overflow\0"
|
||||
"letter or underscore expected after (?< or (?'\0"
|
||||
"missing closing parenthesis for condition\0"
|
||||
/* 25 */
|
||||
"lookbehind assertion is not fixed length\0"
|
||||
"malformed number or name after (?(\0"
|
||||
"a relative value of zero is not allowed\0"
|
||||
"conditional group contains more than two branches\0"
|
||||
"assertion expected after (?( or (?(?C)\0"
|
||||
"(?R or (?[+-]digits must be followed by )\0"
|
||||
"digit expected after (?+ or (?-\0"
|
||||
/* 30 */
|
||||
"unknown POSIX class name\0"
|
||||
"internal error in pcre2_study(): should not occur\0"
|
||||
|
@ -105,7 +105,7 @@ static const unsigned char compile_error_texts[] =
|
|||
"parentheses are too deeply nested (stack check)\0"
|
||||
"character code point value in \\x{} or \\o{} is too large\0"
|
||||
/* 35 */
|
||||
"invalid condition (?(0)\0"
|
||||
"lookbehind is too complicated\0"
|
||||
"\\C is not allowed in a lookbehind assertion in UTF-" XSTRING(PCRE2_CODE_UNIT_WIDTH) " mode\0"
|
||||
"PCRE does not support \\L, \\l, \\N{name}, \\U, or \\u\0"
|
||||
"number after (?C is greater than 255\0"
|
||||
|
@ -132,13 +132,13 @@ static const unsigned char compile_error_texts[] =
|
|||
"missing opening brace after \\o\0"
|
||||
"internal error: unknown newline setting\0"
|
||||
"\\g is not followed by a braced, angle-bracketed, or quoted name/number or by a plain number\0"
|
||||
"a numbered reference must not be zero\0"
|
||||
"(?R (recursive pattern call) must be followed by a closing parenthesis\0"
|
||||
"an argument is not allowed for (*ACCEPT), (*FAIL), or (*COMMIT)\0"
|
||||
/* 60 */
|
||||
"(*VERB) not recognized or malformed\0"
|
||||
"number is too big\0"
|
||||
"group number is too big\0"
|
||||
"subpattern name expected\0"
|
||||
"digit expected after (?+\0"
|
||||
"SPARE ERROR\0"
|
||||
"non-octal character in \\o{} (closing brace missing?)\0"
|
||||
/* 65 */
|
||||
"different names for subpatterns of the same number are not allowed\0"
|
||||
|
@ -151,9 +151,9 @@ static const unsigned char compile_error_texts[] =
|
|||
#endif
|
||||
"\\k is not followed by a braced, angle-bracketed, or quoted name\0"
|
||||
/* 70 */
|
||||
"internal error: unknown opcode in find_fixedlength()\0"
|
||||
"internal error: unknown meta code in check_lookbehinds()\0"
|
||||
"\\N is not supported in a class\0"
|
||||
"SPARE ERROR\0"
|
||||
"callout string is too long\0"
|
||||
"disallowed Unicode code point (>= 0xd800 && <= 0xdfff)\0"
|
||||
"using UTF is disabled by the application\0"
|
||||
/* 75 */
|
||||
|
@ -161,7 +161,7 @@ static const unsigned char compile_error_texts[] =
|
|||
"name is too long in (*MARK), (*PRUNE), (*SKIP), or (*THEN)\0"
|
||||
"character code point value in \\u.... sequence is too large\0"
|
||||
"digits missing in \\x{} or \\o{}\0"
|
||||
"syntax error in (?(VERSION condition\0"
|
||||
"syntax error or number too big in (?(VERSION condition\0"
|
||||
/* 80 */
|
||||
"internal error: unknown opcode in auto_possessify()\0"
|
||||
"missing terminating delimiter for callout with string argument\0"
|
||||
|
@ -173,6 +173,8 @@ static const unsigned char compile_error_texts[] =
|
|||
"regular expression is too complicated\0"
|
||||
"lookbehind assertion is too long\0"
|
||||
"pattern string is longer than the limit set by the application\0"
|
||||
"internal error: unknown code in parsed pattern\0"
|
||||
/* 90 */
|
||||
;
|
||||
|
||||
/* Match-time and UTF error texts are in the same format. */
|
||||
|
|
|
@ -1298,23 +1298,16 @@ mode rather than an escape sequence. It is also used for [^] in JavaScript
|
|||
compatibility mode, and for \C in non-utf mode. In non-DOTALL mode, "." behaves
|
||||
like \N.
|
||||
|
||||
The special values ESC_DU, ESC_du, etc. are used instead of ESC_D, ESC_d, etc.
|
||||
when PCRE2_UCP is set and replacement of \d etc by \p sequences is required.
|
||||
They must be contiguous, and remain in order so that the replacements can be
|
||||
looked up from a table.
|
||||
|
||||
Negative numbers are used to encode a backreference (\1, \2, \3, etc.) in
|
||||
check_escape(). There are two tests in the code for an escape
|
||||
greater than ESC_b and less than ESC_Z to detect the types that may be
|
||||
repeated. These are the types that consume characters. If any new escapes are
|
||||
put in between that don't consume a character, that code will have to change.
|
||||
*/
|
||||
check_escape(). There are tests in the code for an escape greater than ESC_b
|
||||
and less than ESC_Z to detect the types that may be repeated. These are the
|
||||
types that consume characters. If any new escapes are put in between that don't
|
||||
consume a character, that code will have to change. */
|
||||
|
||||
enum { ESC_A = 1, ESC_G, ESC_K, ESC_B, ESC_b, ESC_D, ESC_d, ESC_S, ESC_s,
|
||||
ESC_W, ESC_w, ESC_N, ESC_dum, ESC_C, ESC_P, ESC_p, ESC_R, ESC_H,
|
||||
ESC_h, ESC_V, ESC_v, ESC_X, ESC_Z, ESC_z,
|
||||
ESC_E, ESC_Q, ESC_g, ESC_k,
|
||||
ESC_DU, ESC_du, ESC_SU, ESC_su, ESC_WU, ESC_wu };
|
||||
ESC_E, ESC_Q, ESC_g, ESC_k };
|
||||
|
||||
|
||||
/********************** Opcode definitions ******************/
|
||||
|
@ -1380,7 +1373,8 @@ enum {
|
|||
OP_CIRC, /* 27 Start of line - not multiline */
|
||||
OP_CIRCM, /* 28 Start of line - multiline */
|
||||
|
||||
/* Single characters; caseful must precede the caseless ones */
|
||||
/* Single characters; caseful must precede the caseless ones, and these
|
||||
must remain in this order, and adjacent. */
|
||||
|
||||
OP_CHAR, /* 29 Match one character, casefully */
|
||||
OP_CHARI, /* 30 Match one character, caselessly */
|
||||
|
|
|
@ -648,18 +648,24 @@ typedef struct pcre2_real_match_data {
|
|||
|
||||
#ifndef PCRE2_PCRE2TEST
|
||||
|
||||
/* Structure for checking for mutual recursion when scanning compiled code. */
|
||||
/* Structures for checking for mutual recursion when scanning compiled or
|
||||
parsed code. */
|
||||
|
||||
typedef struct recurse_check {
|
||||
struct recurse_check *prev;
|
||||
PCRE2_SPTR group;
|
||||
} recurse_check;
|
||||
|
||||
typedef struct parsed_recurse_check {
|
||||
struct parsed_recurse_check *prev;
|
||||
uint32_t *groupptr;
|
||||
} parsed_recurse_check;
|
||||
|
||||
/* Structure for building a cache when filling in recursion offsets. */
|
||||
|
||||
typedef struct recurse_cache {
|
||||
PCRE2_SPTR group;
|
||||
int recno;
|
||||
int groupnumber;
|
||||
} recurse_cache;
|
||||
|
||||
/* Structure for maintaining a chain of pointers to the currently incomplete
|
||||
|
@ -693,9 +699,10 @@ typedef struct compile_block {
|
|||
PCRE2_SPTR start_code; /* The start of the compiled code */
|
||||
PCRE2_SPTR start_pattern; /* The start of the pattern */
|
||||
PCRE2_SPTR end_pattern; /* The end of the pattern */
|
||||
PCRE2_SPTR nestptr[2]; /* Pointer(s) saved for string substitution */
|
||||
PCRE2_UCHAR *name_table; /* The name/number table */
|
||||
size_t workspace_size; /* Size of workspace */
|
||||
PCRE2_SIZE workspace_size; /* Size of workspace */
|
||||
PCRE2_SIZE small_ref_offset[10]; /* Offsets for \1 to \9 */
|
||||
PCRE2_SIZE erroroffset; /* Offset of error in pattern */
|
||||
uint16_t names_found; /* Number of entries so far */
|
||||
uint16_t name_entry_size; /* Size of each entry */
|
||||
open_capitem *open_caps; /* Chain of open capture items */
|
||||
|
@ -703,8 +710,9 @@ typedef struct compile_block {
|
|||
uint32_t named_group_list_size; /* Number of entries in the list */
|
||||
uint32_t external_options; /* External (initial) options */
|
||||
uint32_t external_flags; /* External flag bits to be set */
|
||||
uint32_t bracount; /* Count of capturing parens as we compile */
|
||||
uint32_t final_bracount; /* Saved value after first pass */
|
||||
uint32_t bracount; /* Count of capturing parentheses */
|
||||
uint32_t lastcapture; /* Last capture encountered */
|
||||
uint32_t *parsed_pattern; /* Parsed pattern buffer */
|
||||
uint32_t *groupinfo; /* Group info vector */
|
||||
uint32_t top_backref; /* Maximum back reference */
|
||||
uint32_t backref_map; /* Bitmap of low back refs */
|
||||
|
@ -718,9 +726,7 @@ typedef struct compile_block {
|
|||
BOOL had_accept; /* (*ACCEPT) encountered */
|
||||
BOOL had_pruneorskip; /* (*PRUNE) or (*SKIP) encountered */
|
||||
BOOL had_recurse; /* Had a recursion or subroutine call */
|
||||
BOOL check_lookbehind; /* Lookbehinds need later checking */
|
||||
BOOL dupnames; /* Duplicate names exist */
|
||||
BOOL iscondassert; /* Next assert is a condition */
|
||||
} compile_block;
|
||||
|
||||
/* Structure for keeping the properties of the in-memory stack used
|
||||
|
|
|
@ -114,7 +114,7 @@ for (; ptr < ptrend; ptr++)
|
|||
else if (*ptr == CHAR_BACKSLASH)
|
||||
{
|
||||
int erc;
|
||||
int errorcode = 0;
|
||||
int errorcode;
|
||||
uint32_t ch;
|
||||
|
||||
if (ptr < ptrend - 1) switch (ptr[1])
|
||||
|
@ -127,8 +127,10 @@ for (; ptr < ptrend; ptr++)
|
|||
continue;
|
||||
}
|
||||
|
||||
ptr += 1; /* Must point after \ */
|
||||
erc = PRIV(check_escape)(&ptr, ptrend, &ch, &errorcode,
|
||||
code->overall_options, FALSE, NULL);
|
||||
ptr -= 1; /* Back to last code unit of escape */
|
||||
if (errorcode != 0)
|
||||
{
|
||||
rc = errorcode;
|
||||
|
@ -698,7 +700,7 @@ do
|
|||
else if ((suboptions & PCRE2_SUBSTITUTE_EXTENDED) != 0 &&
|
||||
*ptr == CHAR_BACKSLASH)
|
||||
{
|
||||
int errorcode = 0;
|
||||
int errorcode;
|
||||
|
||||
if (ptr < repend - 1) switch (ptr[1])
|
||||
{
|
||||
|
@ -728,10 +730,10 @@ do
|
|||
break;
|
||||
}
|
||||
|
||||
ptr++; /* Point after \ */
|
||||
rc = PRIV(check_escape)(&ptr, repend, &ch, &errorcode,
|
||||
code->overall_options, FALSE, NULL);
|
||||
if (errorcode != 0) goto BADESCAPE;
|
||||
ptr++;
|
||||
|
||||
switch(rc)
|
||||
{
|
||||
|
|
158
src/pcre2test.c
158
src/pcre2test.c
|
@ -4494,6 +4494,7 @@ unsigned int delimiter = *p++;
|
|||
int errorcode;
|
||||
void *use_pat_context;
|
||||
PCRE2_SIZE patlen;
|
||||
PCRE2_SIZE valgrind_access_length;
|
||||
PCRE2_SIZE erroroffset;
|
||||
|
||||
/* Initialize the context and pattern/data controls for this test from the
|
||||
|
@ -4949,11 +4950,43 @@ switch(errorcode)
|
|||
break;
|
||||
}
|
||||
|
||||
/* The pattern is now in pbuffer[8|16|32], with the length in patlen. By
|
||||
default, however, we pass a zero-terminated pattern. The length is passed only
|
||||
if we had a hex pattern. */
|
||||
/* The pattern is now in pbuffer[8|16|32], with the length in code units in
|
||||
patlen. By default, however, we pass a zero-terminated pattern. The length is
|
||||
passed only if we had a hex pattern. When valgrind is supported, arrange for
|
||||
the unused part of the buffer to be marked as no access. */
|
||||
|
||||
if ((pat_patctl.control & CTL_HEXPAT) == 0) patlen = PCRE2_ZERO_TERMINATED;
|
||||
valgrind_access_length = patlen;
|
||||
if ((pat_patctl.control & CTL_HEXPAT) == 0)
|
||||
{
|
||||
patlen = PCRE2_ZERO_TERMINATED;
|
||||
valgrind_access_length += 1; /* For the terminating zero */
|
||||
}
|
||||
|
||||
#ifdef SUPPORT_VALGRIND
|
||||
#ifdef SUPPORT_PCRE2_8
|
||||
if (test_mode == PCRE8_MODE && pbuffer8 != NULL)
|
||||
{
|
||||
VALGRIND_MAKE_MEM_NOACCESS(pbuffer8 + valgrind_access_length,
|
||||
pbuffer8_size - valgrind_access_length);
|
||||
}
|
||||
#endif
|
||||
#ifdef SUPPORT_PCRE2_16
|
||||
if (test_mode == PCRE16_MODE && pbuffer16 != NULL)
|
||||
{
|
||||
VALGRIND_MAKE_MEM_NOACCESS(pbuffer16 + valgrind_access_length,
|
||||
pbuffer16_size - valgrind_access_length*sizeof(uint16_t));
|
||||
}
|
||||
#endif
|
||||
#ifdef SUPPORT_PCRE2_32
|
||||
if (test_mode == PCRE32_MODE && pbuffer32 != NULL)
|
||||
{
|
||||
VALGRIND_MAKE_MEM_NOACCESS(pbuffer32 + valgrind_access_length,
|
||||
pbuffer32_size - valgrind_access_length*sizeof(uint32_t));
|
||||
}
|
||||
#endif
|
||||
#else /* Valgrind not supported */
|
||||
(void)valgrind_access_length; /* Avoid compiler warning */
|
||||
#endif
|
||||
|
||||
/* If #newline_default has been used and the library was not compiled with an
|
||||
appropriate default newline setting, local_newline_default will be non-zero. We
|
||||
|
@ -4996,6 +5029,65 @@ if (timeit > 0)
|
|||
PCRE2_COMPILE(compiled_code, pbuffer, patlen, pat_patctl.options|forbid_utf,
|
||||
&errorcode, &erroroffset, use_pat_context);
|
||||
|
||||
/* Call the JIT compiler if requested. When timing, we must free and recompile
|
||||
the pattern each time because that is the only way to free the JIT compiled
|
||||
code. We know that compilation will always succeed. */
|
||||
|
||||
if (TEST(compiled_code, !=, NULL) && pat_patctl.jit != 0)
|
||||
{
|
||||
if (timeit > 0)
|
||||
{
|
||||
register int i;
|
||||
clock_t time_taken = 0;
|
||||
for (i = 0; i < timeit; i++)
|
||||
{
|
||||
clock_t start_time;
|
||||
SUB1(pcre2_code_free, compiled_code);
|
||||
PCRE2_COMPILE(compiled_code, pbuffer, patlen,
|
||||
pat_patctl.options|forbid_utf, &errorcode, &erroroffset,
|
||||
use_pat_context);
|
||||
start_time = clock();
|
||||
PCRE2_JIT_COMPILE(jitrc,compiled_code, pat_patctl.jit);
|
||||
time_taken += clock() - start_time;
|
||||
}
|
||||
total_jit_compile_time += time_taken;
|
||||
fprintf(outfile, "JIT compile %.4f milliseconds\n",
|
||||
(((double)time_taken * 1000.0) / (double)timeit) /
|
||||
(double)CLOCKS_PER_SEC);
|
||||
}
|
||||
else
|
||||
{
|
||||
PCRE2_JIT_COMPILE(jitrc, compiled_code, pat_patctl.jit);
|
||||
}
|
||||
}
|
||||
|
||||
/* If valgrind is supported, mark the pbuffer as accessible again. The 16-bit
|
||||
and 32-bit buffers can be marked completely undefined, but we must leave the
|
||||
pattern in the 8-bit buffer defined because it may be read from a callout
|
||||
during matching. */
|
||||
|
||||
#ifdef SUPPORT_VALGRIND
|
||||
#ifdef SUPPORT_PCRE2_8
|
||||
if (test_mode == PCRE8_MODE)
|
||||
{
|
||||
VALGRIND_MAKE_MEM_UNDEFINED(pbuffer8 + valgrind_access_length,
|
||||
pbuffer8_size - valgrind_access_length);
|
||||
}
|
||||
#endif
|
||||
#ifdef SUPPORT_PCRE2_16
|
||||
if (test_mode == PCRE16_MODE)
|
||||
{
|
||||
VALGRIND_MAKE_MEM_UNDEFINED(pbuffer16, pbuffer16_size);
|
||||
}
|
||||
#endif
|
||||
#ifdef SUPPORT_PCRE2_32
|
||||
if (test_mode == PCRE32_MODE)
|
||||
{
|
||||
VALGRIND_MAKE_MEM_UNDEFINED(pbuffer32, pbuffer32_size);
|
||||
}
|
||||
#endif
|
||||
#endif
|
||||
|
||||
/* Compilation failed; go back for another re, skipping to blank line
|
||||
if non-interactive. */
|
||||
|
||||
|
@ -5029,38 +5121,6 @@ if (forbid_utf != 0)
|
|||
if (pattern_info(PCRE2_INFO_MAXLOOKBEHIND, &maxlookbehind, FALSE) != 0)
|
||||
return PR_ABEND;
|
||||
|
||||
/* Call the JIT compiler if requested. When timing, we must free and recompile
|
||||
the pattern each time because that is the only way to free the JIT compiled
|
||||
code. We know that compilation will always succeed. */
|
||||
|
||||
if (pat_patctl.jit != 0)
|
||||
{
|
||||
if (timeit > 0)
|
||||
{
|
||||
register int i;
|
||||
clock_t time_taken = 0;
|
||||
for (i = 0; i < timeit; i++)
|
||||
{
|
||||
clock_t start_time;
|
||||
SUB1(pcre2_code_free, compiled_code);
|
||||
PCRE2_COMPILE(compiled_code, pbuffer, patlen,
|
||||
pat_patctl.options|forbid_utf, &errorcode, &erroroffset,
|
||||
use_pat_context);
|
||||
start_time = clock();
|
||||
PCRE2_JIT_COMPILE(jitrc,compiled_code, pat_patctl.jit);
|
||||
time_taken += clock() - start_time;
|
||||
}
|
||||
total_jit_compile_time += time_taken;
|
||||
fprintf(outfile, "JIT compile %.4f milliseconds\n",
|
||||
(((double)time_taken * 1000.0) / (double)timeit) /
|
||||
(double)CLOCKS_PER_SEC);
|
||||
}
|
||||
else
|
||||
{
|
||||
PCRE2_JIT_COMPILE(jitrc, compiled_code, pat_patctl.jit);
|
||||
}
|
||||
}
|
||||
|
||||
/* If an explicit newline modifier was given, set the information flag in the
|
||||
pattern so that it is preserved over push/pop. */
|
||||
|
||||
|
@ -5300,8 +5360,8 @@ if (post_start > 0)
|
|||
for (i = 0; i < subject_length - pre_start - post_start + 4; i++)
|
||||
fprintf(outfile, " ");
|
||||
|
||||
fprintf(outfile, "%.*s",
|
||||
(int)((cb->next_item_length == 0)? 1 : cb->next_item_length),
|
||||
if (cb->next_item_length != 0)
|
||||
fprintf(outfile, "%.*s", (int)(cb->next_item_length),
|
||||
pbuffer8 + cb->pattern_position);
|
||||
|
||||
fprintf(outfile, "\n");
|
||||
|
@ -6405,7 +6465,7 @@ else for (gmatched = 0;; gmatched++)
|
|||
}
|
||||
|
||||
/* Otherwise just run a single match, setting up a callout if required (the
|
||||
default). */
|
||||
default). There is a copy of the pattern in pbuffer8 for use by callouts. */
|
||||
|
||||
else
|
||||
{
|
||||
|
@ -7583,6 +7643,10 @@ if (argc > 1 && strcmp(argv[op], "-") != 0)
|
|||
}
|
||||
}
|
||||
|
||||
#if defined(SUPPORT_LIBREADLINE) || defined(SUPPORT_LIBEDIT)
|
||||
if (INTERACTIVE(infile)) using_history();
|
||||
#endif
|
||||
|
||||
if (argc > 2)
|
||||
{
|
||||
outfile = fopen(argv[op+1], OUTPUT_MODE);
|
||||
|
@ -7621,8 +7685,7 @@ while (notdone)
|
|||
p = buffer;
|
||||
|
||||
/* If we have a pattern set up for testing, or we are skipping after a
|
||||
compile failure, a blank line terminates this test; otherwise process the
|
||||
line as a data line. */
|
||||
compile failure, a blank line terminates this test. */
|
||||
|
||||
if (expectdata || skipping)
|
||||
{
|
||||
|
@ -7645,14 +7708,21 @@ while (notdone)
|
|||
skipping = FALSE;
|
||||
setlocale(LC_CTYPE, "C");
|
||||
}
|
||||
|
||||
/* Otherwise, if we are not skipping, and the line is not a data comment
|
||||
line starting with "\=", process a data line. */
|
||||
|
||||
else if (!skipping && !(p[0] == '\\' && p[1] == '=' && isspace(p[2])))
|
||||
{
|
||||
rc = process_data();
|
||||
}
|
||||
}
|
||||
|
||||
/* We do not have a pattern set up for testing. Lines starting with # are
|
||||
either comments or special commands. Blank lines are ignored. Otherwise, the
|
||||
line must start with a valid delimiter. It is then processed as a pattern
|
||||
line. */
|
||||
line. A copy of the pattern is left in pbuffer8 for use by callouts. Under
|
||||
valgrind, make the unused part of the buffer undefined, to catch overruns. */
|
||||
|
||||
else if (*p == '#')
|
||||
{
|
||||
|
@ -7713,6 +7783,10 @@ if (showtotaltimes)
|
|||
|
||||
EXIT:
|
||||
|
||||
#if defined(SUPPORT_LIBREADLINE) || defined(SUPPORT_LIBEDIT)
|
||||
if (infile != NULL && INTERACTIVE(infile)) clear_history();
|
||||
#endif
|
||||
|
||||
if (infile != NULL && infile != stdin) fclose(infile);
|
||||
if (outfile != NULL && outfile != stdout) fclose(outfile);
|
||||
|
||||
|
|
|
@ -5792,4 +5792,18 @@ name)/mark
|
|||
aaaccccaaa
|
||||
bccccb
|
||||
|
||||
# /x does not apply to MARK labels
|
||||
|
||||
/x (*MARK:ab cd # comment
|
||||
ef) x/x,mark
|
||||
axxz
|
||||
|
||||
/(?<=a(B){0}c)X/
|
||||
acX
|
||||
|
||||
/(?<DEFINE>b)(?(DEFINE)(a+))(?&DEFINE)/
|
||||
bbbb
|
||||
\= Expect no match
|
||||
baaab
|
||||
|
||||
# End of testinput1
|
||||
|
|
|
@ -79,7 +79,7 @@
|
|||
/((?2))((?1))/
|
||||
abc
|
||||
|
||||
/((?(R2)a+|(?1)b))/
|
||||
/((?(R2)a+|(?1)b))()/
|
||||
aaaabcde
|
||||
|
||||
/(?(R)a*(?1)|((?R))b)/
|
||||
|
|
|
@ -177,7 +177,7 @@
|
|||
/((?2))((?1))/
|
||||
abc
|
||||
|
||||
/((?(R2)a+|(?1)b))/
|
||||
/((?(R2)a+|(?1)b))()/
|
||||
aaaabcde
|
||||
|
||||
/(?(R)a*(?1)|((?R))b)/
|
||||
|
|
|
@ -189,9 +189,9 @@
|
|||
the barfoo
|
||||
and cattlefoo
|
||||
|
||||
/(?<=a+)b/
|
||||
/abc(?<=a+)b/
|
||||
|
||||
/(?<=aaa|b{0,3})b/
|
||||
/12345(?<=aaa|b{0,3})b/
|
||||
|
||||
/(?<!(foo)a\1)bar/
|
||||
|
||||
|
@ -4518,6 +4518,18 @@
|
|||
\ B)x/x,alt_verbnames,mark
|
||||
x
|
||||
|
||||
/(*: A \ and #comment
|
||||
\ B)x/alt_verbnames,mark
|
||||
x
|
||||
|
||||
/(*: A \ and #comment
|
||||
\ B)x/x,mark
|
||||
x
|
||||
|
||||
/(*: A \ and #comment
|
||||
\ B)x/mark
|
||||
x
|
||||
|
||||
/(*:A
|
||||
B)x/alt_verbnames,mark
|
||||
x
|
||||
|
@ -4819,4 +4831,61 @@ a)"xI
|
|||
|
||||
/\[AB]{6000000000000000000000}/expand
|
||||
|
||||
# Hex uses pattern length, not zero-terminated. This tests for overrunning
|
||||
# the given length of a pattern.
|
||||
|
||||
/'(*U'/hex
|
||||
|
||||
/'(*'/hex
|
||||
|
||||
/'('/hex
|
||||
|
||||
//hex
|
||||
|
||||
# These tests are here because Perl never allows a back reference in a
|
||||
# lookbehind. PCRE2 supports some limited cases.
|
||||
|
||||
/([ab])...(?<=\1)z/
|
||||
a11az
|
||||
b11bz
|
||||
\= Expect no match
|
||||
b11az
|
||||
|
||||
/(?|([ab]))...(?<=\1)z/
|
||||
|
||||
/([ab])(\1)...(?<=\2)z/
|
||||
aa11az
|
||||
|
||||
/(a\2)(b\1)(?<=\2)/
|
||||
|
||||
/(?<A>[ab])...(?<=\k'A')z/
|
||||
a11az
|
||||
b11bz
|
||||
\= Expect no match
|
||||
b11az
|
||||
|
||||
/(?<A>[ab])...(?<=\k'A')(?<A>)z/dupnames
|
||||
|
||||
# Perl does not support \g+n
|
||||
|
||||
/((\g+1X)?([ab]))+/
|
||||
aaXbbXa
|
||||
|
||||
/ab(?C1)c/auto_callout
|
||||
abc
|
||||
|
||||
/'ab(?C1)c'/hex,auto_callout
|
||||
abc
|
||||
|
||||
# Perl accepts these, but gives a warning. We can't warn, so give an error.
|
||||
|
||||
/[a-[:digit:]]+/
|
||||
a-a9-a
|
||||
|
||||
/[A-[:digit:]]+/
|
||||
A-A9-A
|
||||
|
||||
/[a-\d]+/
|
||||
a-a9-a
|
||||
|
||||
# End of testinput2
|
||||
|
|
|
@ -1724,4 +1724,13 @@
|
|||
\= Expect no match
|
||||
\x{10000}
|
||||
|
||||
# Hex uses pattern length, not zero-terminated. This tests for overrunning
|
||||
# the given length of a pattern.
|
||||
|
||||
/'(*UTF)'/hex
|
||||
|
||||
/a(?<=A\XB)/utf
|
||||
|
||||
/ab(?<=A\RB)/utf
|
||||
|
||||
# End of testinput5
|
||||
|
|
|
@ -4635,7 +4635,7 @@
|
|||
/((?(R)a+|(?1)b))/
|
||||
aaaabcde
|
||||
|
||||
/((?(R2)a+|(?1)b))/
|
||||
/((?(R2)a+|(?1)b))()/
|
||||
aaaabcde
|
||||
|
||||
/(?(R)a*(?1)|((?R))b)/
|
||||
|
|
|
@ -161,18 +161,14 @@
|
|||
|
||||
# Use "expand" to create some very long patterns with nested parentheses, in
|
||||
# order to test workspace overflow. Again, this varies with code unit width,
|
||||
# and even with it fails in two modes, the error offset differs. It also varies
|
||||
# and even when it fails in two modes, the error offset differs. It also varies
|
||||
# with link size - hence multiple tests with different values.
|
||||
|
||||
/(?'ABC'\[[bar](]{105}*THEN:\[A]{255}\[)]{106}/expand,-fullbincode
|
||||
/(?'ABC'\[[bar](]{792}*THEN:\[A]{255}\[)]{793}/expand,-fullbincode,parens_nest_limit=1000
|
||||
|
||||
/(?'ABC'\[[bar](]{106}*THEN:\[A]{255}\[)]{107}/expand,-fullbincode
|
||||
/(?'ABC'\[[bar](]{793}*THEN:\[A]{255}\[)]{794}/expand,-fullbincode,parens_nest_limit=1000
|
||||
|
||||
/(?'ABC'\[[bar](]{159}*THEN:\[A]{255}\[)]{160}/expand,-fullbincode
|
||||
|
||||
/(?'ABC'\[[bar](]{199}*THEN:\[A]{255}\[)]{200}/expand,-fullbincode
|
||||
|
||||
/(?'ABC'\[[bar](]{299}*THEN:\[A]{255}\[)]{300}/expand,-fullbincode
|
||||
/(?'ABC'\[[bar](]{1793}*THEN:\[A]{255}\[)]{1794}/expand,-fullbincode,parens_nest_limit=2000
|
||||
|
||||
/(?(1)(?1)){8,}+()/debug
|
||||
abcd
|
||||
|
|
|
@ -9257,4 +9257,24 @@ No match
|
|||
1: b
|
||||
2: cccc
|
||||
|
||||
# /x does not apply to MARK labels
|
||||
|
||||
/x (*MARK:ab cd # comment
|
||||
ef) x/x,mark
|
||||
axxz
|
||||
0: xx
|
||||
MK: ab cd # comment\x0aef
|
||||
|
||||
/(?<=a(B){0}c)X/
|
||||
acX
|
||||
0: X
|
||||
|
||||
/(?<DEFINE>b)(?(DEFINE)(a+))(?&DEFINE)/
|
||||
bbbb
|
||||
0: bb
|
||||
1: b
|
||||
\= Expect no match
|
||||
baaab
|
||||
No match
|
||||
|
||||
# End of testinput1
|
||||
|
|
|
@ -557,7 +557,7 @@ Subject length lower bound = 1
|
|||
0: \x{11234}
|
||||
|
||||
/(*UTF-32)\x{11234}/
|
||||
Failed: error 134 at offset 17: character code point value in \x{} or \o{} is too large
|
||||
Failed: error 160 at offset 5: (*VERB) not recognized or malformed
|
||||
abcd\x{11234}pqr
|
||||
|
||||
/(*UTF-32)\x{112}/
|
||||
|
|
|
@ -188,7 +188,7 @@ Failed: error -53: recursion limit exceeded
|
|||
abc
|
||||
Failed: error -52: nested recursion at the same subject position
|
||||
|
||||
/((?(R2)a+|(?1)b))/
|
||||
/((?(R2)a+|(?1)b))()/
|
||||
aaaabcde
|
||||
Failed: error -52: nested recursion at the same subject position
|
||||
|
||||
|
|
|
@ -335,7 +335,7 @@ Failed: error -47: match limit exceeded
|
|||
abc
|
||||
Failed: error -46: JIT stack limit reached
|
||||
|
||||
/((?(R2)a+|(?1)b))/
|
||||
/((?(R2)a+|(?1)b))()/
|
||||
aaaabcde
|
||||
Failed: error -46: JIT stack limit reached
|
||||
|
||||
|
|
|
@ -139,7 +139,7 @@ No match: POSIX code 17: match failed
|
|||
0+ issippi
|
||||
|
||||
/abc/\
|
||||
Failed: POSIX code 9: bad escape sequence at offset 3
|
||||
Failed: POSIX code 9: bad escape sequence at offset 4
|
||||
|
||||
"(?(?C)"
|
||||
Failed: POSIX code 11: unbalanced () at offset 6
|
||||
|
|
File diff suppressed because it is too large
Load Diff
|
@ -76,7 +76,7 @@
|
|||
------------------------------------------------------------------
|
||||
|
||||
/ab\Cde/never_backslash_c
|
||||
Failed: error 183 at offset 3: using \C is disabled by the application
|
||||
Failed: error 183 at offset 4: using \C is disabled by the application
|
||||
|
||||
/ab\Cde/info
|
||||
Capturing subpattern count = 0
|
||||
|
|
|
@ -17,7 +17,7 @@ Subject length lower bound = 0
|
|||
# 16-bit modes, but not in 32-bit mode.
|
||||
|
||||
/(?<=ab\Cde)X/utf
|
||||
Failed: error 136 at offset 10: \C is not allowed in a lookbehind assertion in UTF-16 mode
|
||||
Failed: error 136 at offset 0: \C is not allowed in a lookbehind assertion in UTF-16 mode
|
||||
ab!deXYZ
|
||||
|
||||
# Autopossessification tests
|
||||
|
|
|
@ -17,7 +17,7 @@ Subject length lower bound = 0
|
|||
# 16-bit modes, but not in 32-bit mode.
|
||||
|
||||
/(?<=ab\Cde)X/utf
|
||||
Failed: error 136 at offset 10: \C is not allowed in a lookbehind assertion in UTF-8 mode
|
||||
Failed: error 136 at offset 0: \C is not allowed in a lookbehind assertion in UTF-8 mode
|
||||
ab!deXYZ
|
||||
|
||||
# Autopossessification tests
|
||||
|
|
|
@ -3,6 +3,6 @@
|
|||
# correct error message.
|
||||
|
||||
/a\Cb/
|
||||
Failed: error 185 at offset 2: using \C is disabled in this PCRE2 library
|
||||
Failed: error 185 at offset 3: using \C is disabled in this PCRE2 library
|
||||
|
||||
# End of testinput23
|
||||
|
|
|
@ -1746,7 +1746,7 @@ No match
|
|||
------------------------------------------------------------------
|
||||
|
||||
/\ud800/utf,alt_bsux,allow_empty_class,match_unset_backref
|
||||
Failed: error 173 at offset 5: disallowed Unicode code point (>= 0xd800 && <= 0xdfff)
|
||||
Failed: error 173 at offset 6: disallowed Unicode code point (>= 0xd800 && <= 0xdfff)
|
||||
|
||||
/^a+[a\x{200}]/B,utf
|
||||
------------------------------------------------------------------
|
||||
|
@ -3997,7 +3997,7 @@ Failed: error 122 at offset 1227: unmatched closing parenthesis
|
|||
/$(&.+[\p{Me}].\s\xdcC*?(?(<y>))(?<!^)$C((;*?(R))+(?(R)){0,6}?|){12\x8a\X*?\x8a\x0b\xd1^9\3*+(\xc1,\k'P'\xb4)\xcc(z\z(?JJ)(?'X'8};(\x0b\xd1^9\?'3*+(\xc1.]k+\x0b'Pm'\xb4\xcc4'\xd1'(?'X'))?-%--\x95$9*\4'|\xd1(''%\x95*$9)#(?'R')3\x07?('P\xed')\\x16:;()\x1e\x10*:(?<y>)\xd1+!~:(?)''(d'E:yD!\s(?'R'\x1e;\x10:U))|')g!\xb0*){29+))#(?'P'})*?/
|
||||
|
||||
"(*UTF)(*UCP)(.UTF).+X(\V+;\^(\D|)!999}(?(?C{7(?C')\H*\S*/^\x5\xa\\xd3\x85n?(;\D*(?m).[^mH+((*UCP)(*U:F)})(?!^)(?'"
|
||||
Failed: error 124 at offset 113: letter or underscore expected after (?< or (?'
|
||||
Failed: error 162 at offset 113: subpattern name expected
|
||||
|
||||
/[\pS#moq]/
|
||||
=
|
||||
|
@ -4160,4 +4160,15 @@ No match
|
|||
\x{10000}
|
||||
No match
|
||||
|
||||
# Hex uses pattern length, not zero-terminated. This tests for overrunning
|
||||
# the given length of a pattern.
|
||||
|
||||
/'(*UTF)'/hex
|
||||
|
||||
/a(?<=A\XB)/utf
|
||||
Failed: error 125 at offset 1: lookbehind assertion is not fixed length
|
||||
|
||||
/ab(?<=A\RB)/utf
|
||||
Failed: error 125 at offset 2: lookbehind assertion is not fixed length
|
||||
|
||||
# End of testinput5
|
||||
|
|
|
@ -713,7 +713,7 @@ No match
|
|||
/(ab|cd){3,4}/auto_callout
|
||||
ababab
|
||||
--->ababab
|
||||
+0 ^ (ab|cd){3,4}
|
||||
+0 ^ (
|
||||
+1 ^ a
|
||||
+4 ^ c
|
||||
+2 ^^ b
|
||||
|
@ -732,7 +732,7 @@ No match
|
|||
0: ababab
|
||||
abcdabcd
|
||||
--->abcdabcd
|
||||
+0 ^ (ab|cd){3,4}
|
||||
+0 ^ (
|
||||
+1 ^ a
|
||||
+4 ^ c
|
||||
+2 ^^ b
|
||||
|
@ -740,7 +740,7 @@ No match
|
|||
+1 ^ ^ a
|
||||
+4 ^ ^ c
|
||||
+5 ^ ^ d
|
||||
+6 ^ ^ )
|
||||
+6 ^ ^ ){3,4}
|
||||
+1 ^ ^ a
|
||||
+4 ^ ^ c
|
||||
+2 ^ ^ b
|
||||
|
@ -749,13 +749,13 @@ No match
|
|||
+1 ^ ^ a
|
||||
+4 ^ ^ c
|
||||
+5 ^ ^ d
|
||||
+6 ^ ^ )
|
||||
+6 ^ ^ ){3,4}
|
||||
+12 ^ ^
|
||||
0: abcdabcd
|
||||
1: abcdab
|
||||
abcdcdcdcdcd
|
||||
--->abcdcdcdcdcd
|
||||
+0 ^ (ab|cd){3,4}
|
||||
+0 ^ (
|
||||
+1 ^ a
|
||||
+4 ^ c
|
||||
+2 ^^ b
|
||||
|
@ -763,16 +763,16 @@ No match
|
|||
+1 ^ ^ a
|
||||
+4 ^ ^ c
|
||||
+5 ^ ^ d
|
||||
+6 ^ ^ )
|
||||
+6 ^ ^ ){3,4}
|
||||
+1 ^ ^ a
|
||||
+4 ^ ^ c
|
||||
+5 ^ ^ d
|
||||
+6 ^ ^ )
|
||||
+6 ^ ^ ){3,4}
|
||||
+12 ^ ^
|
||||
+1 ^ ^ a
|
||||
+4 ^ ^ c
|
||||
+5 ^ ^ d
|
||||
+6 ^ ^ )
|
||||
+6 ^ ^ ){3,4}
|
||||
+12 ^ ^
|
||||
0: abcdcdcd
|
||||
1: abcdcd
|
||||
|
@ -6712,26 +6712,26 @@ No match
|
|||
--->"ab"
|
||||
+0 ^ ^
|
||||
+1 ^ "
|
||||
+2 ^^ ((?(?=[a])[^"])|b)*
|
||||
+2 ^^ (
|
||||
+21 ^^ "
|
||||
+3 ^^ (?(?=[a])[^"])
|
||||
+3 ^^ (?
|
||||
+18 ^^ b
|
||||
+5 ^^ (?=[a])
|
||||
+5 ^^ (?=
|
||||
+8 ^ [a]
|
||||
+11 ^^ )
|
||||
+12 ^^ [^"]
|
||||
+16 ^ ^ )
|
||||
+17 ^ ^ |
|
||||
+21 ^ ^ "
|
||||
+3 ^ ^ (?(?=[a])[^"])
|
||||
+3 ^ ^ (?
|
||||
+18 ^ ^ b
|
||||
+5 ^ ^ (?=[a])
|
||||
+5 ^ ^ (?=
|
||||
+8 ^ [a]
|
||||
+19 ^ ^ )
|
||||
+19 ^ ^ )*
|
||||
+21 ^ ^ "
|
||||
+3 ^ ^ (?(?=[a])[^"])
|
||||
+3 ^ ^ (?
|
||||
+18 ^ ^ b
|
||||
+5 ^ ^ (?=[a])
|
||||
+5 ^ ^ (?=
|
||||
+8 ^ [a]
|
||||
+17 ^ ^ |
|
||||
+22 ^ ^ $
|
||||
|
@ -7154,7 +7154,7 @@ Failed: error -52: nested recursion at the same subject position
|
|||
aaaabcde
|
||||
0: aaaab
|
||||
|
||||
/((?(R2)a+|(?1)b))/
|
||||
/((?(R2)a+|(?1)b))()/
|
||||
aaaabcde
|
||||
Failed: error -40: backreference condition or recursion test is not supported for DFA matching
|
||||
|
||||
|
@ -7548,7 +7548,7 @@ Callout (10): {AB} last capture = 0
|
|||
Bra
|
||||
^
|
||||
Cond
|
||||
Callout 25 9 7
|
||||
Callout 25 9 3
|
||||
Assert
|
||||
abc
|
||||
Ket
|
||||
|
@ -7561,11 +7561,11 @@ Callout (10): {AB} last capture = 0
|
|||
------------------------------------------------------------------
|
||||
abcdefg
|
||||
--->abcdefg
|
||||
25 ^ (?=abc)
|
||||
25 ^ (?=
|
||||
0: abcd
|
||||
xyz123
|
||||
--->xyz123
|
||||
25 ^ (?=abc)
|
||||
25 ^ (?=
|
||||
0: xyz
|
||||
|
||||
/^(?(?C$abc$)(?=abc)abcd|xyz)/B
|
||||
|
@ -7573,7 +7573,7 @@ Callout (10): {AB} last capture = 0
|
|||
Bra
|
||||
^
|
||||
Cond
|
||||
CalloutStr $abc$ 7 12 7
|
||||
CalloutStr $abc$ 7 12 3
|
||||
Assert
|
||||
abc
|
||||
Ket
|
||||
|
@ -7587,12 +7587,12 @@ Callout (10): {AB} last capture = 0
|
|||
abcdefg
|
||||
Callout (7): $abc$
|
||||
--->abcdefg
|
||||
^ (?=abc)
|
||||
^ (?=
|
||||
0: abcd
|
||||
xyz123
|
||||
Callout (7): $abc$
|
||||
--->xyz123
|
||||
^ (?=abc)
|
||||
^ (?=
|
||||
0: xyz
|
||||
|
||||
/^ab(?C'first')cd(?C"second")ef/
|
||||
|
@ -7609,13 +7609,13 @@ Callout (20): "second"
|
|||
aaaXY
|
||||
Callout (8): `code`
|
||||
--->aaaXY
|
||||
^^ )
|
||||
^^ ){3}
|
||||
Callout (8): `code`
|
||||
--->aaaXY
|
||||
^ ^ )
|
||||
^ ^ ){3}
|
||||
Callout (8): `code`
|
||||
--->aaaXY
|
||||
^ ^ )
|
||||
^ ^ ){3}
|
||||
0: aaaX
|
||||
|
||||
# Binary zero in callout string
|
||||
|
|
|
@ -854,23 +854,17 @@ Failed: error 184 at offset 1540: (?| and/or (?J: or (?x: parentheses are too de
|
|||
|
||||
# Use "expand" to create some very long patterns with nested parentheses, in
|
||||
# order to test workspace overflow. Again, this varies with code unit width,
|
||||
# and even with it fails in two modes, the error offset differs. It also varies
|
||||
# and even when it fails in two modes, the error offset differs. It also varies
|
||||
# with link size - hence multiple tests with different values.
|
||||
|
||||
/(?'ABC'\[[bar](]{105}*THEN:\[A]{255}\[)]{106}/expand,-fullbincode
|
||||
Failed: error 186 at offset 594: regular expression is too complicated
|
||||
/(?'ABC'\[[bar](]{792}*THEN:\[A]{255}\[)]{793}/expand,-fullbincode,parens_nest_limit=1000
|
||||
Failed: error 186 at offset 5813: regular expression is too complicated
|
||||
|
||||
/(?'ABC'\[[bar](]{106}*THEN:\[A]{255}\[)]{107}/expand,-fullbincode
|
||||
Failed: error 186 at offset 594: regular expression is too complicated
|
||||
/(?'ABC'\[[bar](]{793}*THEN:\[A]{255}\[)]{794}/expand,-fullbincode,parens_nest_limit=1000
|
||||
Failed: error 186 at offset 5820: regular expression is too complicated
|
||||
|
||||
/(?'ABC'\[[bar](]{159}*THEN:\[A]{255}\[)]{160}/expand,-fullbincode
|
||||
Failed: error 186 at offset 594: regular expression is too complicated
|
||||
|
||||
/(?'ABC'\[[bar](]{199}*THEN:\[A]{255}\[)]{200}/expand,-fullbincode
|
||||
Failed: error 186 at offset 594: regular expression is too complicated
|
||||
|
||||
/(?'ABC'\[[bar](]{299}*THEN:\[A]{255}\[)]{300}/expand,-fullbincode
|
||||
Failed: error 186 at offset 594: regular expression is too complicated
|
||||
/(?'ABC'\[[bar](]{1793}*THEN:\[A]{255}\[)]{1794}/expand,-fullbincode,parens_nest_limit=2000
|
||||
Failed: error 186 at offset 12820: regular expression is too complicated
|
||||
|
||||
/(?(1)(?1)){8,}+()/debug
|
||||
------------------------------------------------------------------
|
||||
|
@ -1031,6 +1025,5 @@ Subject length lower bound = 0
|
|||
Failed: error 114 at offset 509: missing closing parenthesis
|
||||
|
||||
/([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00]([00](*ACCEPT)))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))/-fullbincode
|
||||
Failed: error 186 at offset 490: regular expression is too complicated
|
||||
|
||||
# End of testinput8
|
||||
|
|
|
@ -853,20 +853,15 @@ Memory allocation (code space): 18
|
|||
|
||||
# Use "expand" to create some very long patterns with nested parentheses, in
|
||||
# order to test workspace overflow. Again, this varies with code unit width,
|
||||
# and even with it fails in two modes, the error offset differs. It also varies
|
||||
# and even when it fails in two modes, the error offset differs. It also varies
|
||||
# with link size - hence multiple tests with different values.
|
||||
|
||||
/(?'ABC'\[[bar](]{105}*THEN:\[A]{255}\[)]{106}/expand,-fullbincode
|
||||
/(?'ABC'\[[bar](]{792}*THEN:\[A]{255}\[)]{793}/expand,-fullbincode,parens_nest_limit=1000
|
||||
|
||||
/(?'ABC'\[[bar](]{106}*THEN:\[A]{255}\[)]{107}/expand,-fullbincode
|
||||
/(?'ABC'\[[bar](]{793}*THEN:\[A]{255}\[)]{794}/expand,-fullbincode,parens_nest_limit=1000
|
||||
|
||||
/(?'ABC'\[[bar](]{159}*THEN:\[A]{255}\[)]{160}/expand,-fullbincode
|
||||
|
||||
/(?'ABC'\[[bar](]{199}*THEN:\[A]{255}\[)]{200}/expand,-fullbincode
|
||||
Failed: error 186 at offset 1147: regular expression is too complicated
|
||||
|
||||
/(?'ABC'\[[bar](]{299}*THEN:\[A]{255}\[)]{300}/expand,-fullbincode
|
||||
Failed: error 186 at offset 1147: regular expression is too complicated
|
||||
/(?'ABC'\[[bar](]{1793}*THEN:\[A]{255}\[)]{1794}/expand,-fullbincode,parens_nest_limit=2000
|
||||
Failed: error 186 at offset 12820: regular expression is too complicated
|
||||
|
||||
/(?(1)(?1)){8,}+()/debug
|
||||
------------------------------------------------------------------
|
||||
|
|
|
@ -853,20 +853,15 @@ Memory allocation (code space): 18
|
|||
|
||||
# Use "expand" to create some very long patterns with nested parentheses, in
|
||||
# order to test workspace overflow. Again, this varies with code unit width,
|
||||
# and even with it fails in two modes, the error offset differs. It also varies
|
||||
# and even when it fails in two modes, the error offset differs. It also varies
|
||||
# with link size - hence multiple tests with different values.
|
||||
|
||||
/(?'ABC'\[[bar](]{105}*THEN:\[A]{255}\[)]{106}/expand,-fullbincode
|
||||
/(?'ABC'\[[bar](]{792}*THEN:\[A]{255}\[)]{793}/expand,-fullbincode,parens_nest_limit=1000
|
||||
|
||||
/(?'ABC'\[[bar](]{106}*THEN:\[A]{255}\[)]{107}/expand,-fullbincode
|
||||
/(?'ABC'\[[bar](]{793}*THEN:\[A]{255}\[)]{794}/expand,-fullbincode,parens_nest_limit=1000
|
||||
|
||||
/(?'ABC'\[[bar](]{159}*THEN:\[A]{255}\[)]{160}/expand,-fullbincode
|
||||
|
||||
/(?'ABC'\[[bar](]{199}*THEN:\[A]{255}\[)]{200}/expand,-fullbincode
|
||||
Failed: error 186 at offset 1147: regular expression is too complicated
|
||||
|
||||
/(?'ABC'\[[bar](]{299}*THEN:\[A]{255}\[)]{300}/expand,-fullbincode
|
||||
Failed: error 186 at offset 1147: regular expression is too complicated
|
||||
/(?'ABC'\[[bar](]{1793}*THEN:\[A]{255}\[)]{1794}/expand,-fullbincode,parens_nest_limit=2000
|
||||
Failed: error 186 at offset 12820: regular expression is too complicated
|
||||
|
||||
/(?(1)(?1)){8,}+()/debug
|
||||
------------------------------------------------------------------
|
||||
|
|
|
@ -853,20 +853,17 @@ Memory allocation (code space): 28
|
|||
|
||||
# Use "expand" to create some very long patterns with nested parentheses, in
|
||||
# order to test workspace overflow. Again, this varies with code unit width,
|
||||
# and even with it fails in two modes, the error offset differs. It also varies
|
||||
# and even when it fails in two modes, the error offset differs. It also varies
|
||||
# with link size - hence multiple tests with different values.
|
||||
|
||||
/(?'ABC'\[[bar](]{105}*THEN:\[A]{255}\[)]{106}/expand,-fullbincode
|
||||
/(?'ABC'\[[bar](]{792}*THEN:\[A]{255}\[)]{793}/expand,-fullbincode,parens_nest_limit=1000
|
||||
Failed: error 186 at offset 5813: regular expression is too complicated
|
||||
|
||||
/(?'ABC'\[[bar](]{106}*THEN:\[A]{255}\[)]{107}/expand,-fullbincode
|
||||
/(?'ABC'\[[bar](]{793}*THEN:\[A]{255}\[)]{794}/expand,-fullbincode,parens_nest_limit=1000
|
||||
Failed: error 186 at offset 5820: regular expression is too complicated
|
||||
|
||||
/(?'ABC'\[[bar](]{159}*THEN:\[A]{255}\[)]{160}/expand,-fullbincode
|
||||
|
||||
/(?'ABC'\[[bar](]{199}*THEN:\[A]{255}\[)]{200}/expand,-fullbincode
|
||||
Failed: error 186 at offset 979: regular expression is too complicated
|
||||
|
||||
/(?'ABC'\[[bar](]{299}*THEN:\[A]{255}\[)]{300}/expand,-fullbincode
|
||||
Failed: error 186 at offset 979: regular expression is too complicated
|
||||
/(?'ABC'\[[bar](]{1793}*THEN:\[A]{255}\[)]{1794}/expand,-fullbincode,parens_nest_limit=2000
|
||||
Failed: error 186 at offset 12820: regular expression is too complicated
|
||||
|
||||
/(?(1)(?1)){8,}+()/debug
|
||||
------------------------------------------------------------------
|
||||
|
|
|
@ -853,20 +853,17 @@ Memory allocation (code space): 28
|
|||
|
||||
# Use "expand" to create some very long patterns with nested parentheses, in
|
||||
# order to test workspace overflow. Again, this varies with code unit width,
|
||||
# and even with it fails in two modes, the error offset differs. It also varies
|
||||
# and even when it fails in two modes, the error offset differs. It also varies
|
||||
# with link size - hence multiple tests with different values.
|
||||
|
||||
/(?'ABC'\[[bar](]{105}*THEN:\[A]{255}\[)]{106}/expand,-fullbincode
|
||||
/(?'ABC'\[[bar](]{792}*THEN:\[A]{255}\[)]{793}/expand,-fullbincode,parens_nest_limit=1000
|
||||
Failed: error 186 at offset 5813: regular expression is too complicated
|
||||
|
||||
/(?'ABC'\[[bar](]{106}*THEN:\[A]{255}\[)]{107}/expand,-fullbincode
|
||||
/(?'ABC'\[[bar](]{793}*THEN:\[A]{255}\[)]{794}/expand,-fullbincode,parens_nest_limit=1000
|
||||
Failed: error 186 at offset 5820: regular expression is too complicated
|
||||
|
||||
/(?'ABC'\[[bar](]{159}*THEN:\[A]{255}\[)]{160}/expand,-fullbincode
|
||||
|
||||
/(?'ABC'\[[bar](]{199}*THEN:\[A]{255}\[)]{200}/expand,-fullbincode
|
||||
Failed: error 186 at offset 979: regular expression is too complicated
|
||||
|
||||
/(?'ABC'\[[bar](]{299}*THEN:\[A]{255}\[)]{300}/expand,-fullbincode
|
||||
Failed: error 186 at offset 979: regular expression is too complicated
|
||||
/(?'ABC'\[[bar](]{1793}*THEN:\[A]{255}\[)]{1794}/expand,-fullbincode,parens_nest_limit=2000
|
||||
Failed: error 186 at offset 12820: regular expression is too complicated
|
||||
|
||||
/(?(1)(?1)){8,}+()/debug
|
||||
------------------------------------------------------------------
|
||||
|
|
|
@ -853,20 +853,17 @@ Memory allocation (code space): 28
|
|||
|
||||
# Use "expand" to create some very long patterns with nested parentheses, in
|
||||
# order to test workspace overflow. Again, this varies with code unit width,
|
||||
# and even with it fails in two modes, the error offset differs. It also varies
|
||||
# and even when it fails in two modes, the error offset differs. It also varies
|
||||
# with link size - hence multiple tests with different values.
|
||||
|
||||
/(?'ABC'\[[bar](]{105}*THEN:\[A]{255}\[)]{106}/expand,-fullbincode
|
||||
/(?'ABC'\[[bar](]{792}*THEN:\[A]{255}\[)]{793}/expand,-fullbincode,parens_nest_limit=1000
|
||||
Failed: error 186 at offset 5813: regular expression is too complicated
|
||||
|
||||
/(?'ABC'\[[bar](]{106}*THEN:\[A]{255}\[)]{107}/expand,-fullbincode
|
||||
/(?'ABC'\[[bar](]{793}*THEN:\[A]{255}\[)]{794}/expand,-fullbincode,parens_nest_limit=1000
|
||||
Failed: error 186 at offset 5820: regular expression is too complicated
|
||||
|
||||
/(?'ABC'\[[bar](]{159}*THEN:\[A]{255}\[)]{160}/expand,-fullbincode
|
||||
|
||||
/(?'ABC'\[[bar](]{199}*THEN:\[A]{255}\[)]{200}/expand,-fullbincode
|
||||
Failed: error 186 at offset 979: regular expression is too complicated
|
||||
|
||||
/(?'ABC'\[[bar](]{299}*THEN:\[A]{255}\[)]{300}/expand,-fullbincode
|
||||
Failed: error 186 at offset 979: regular expression is too complicated
|
||||
/(?'ABC'\[[bar](]{1793}*THEN:\[A]{255}\[)]{1794}/expand,-fullbincode,parens_nest_limit=2000
|
||||
Failed: error 186 at offset 12820: regular expression is too complicated
|
||||
|
||||
/(?(1)(?1)){8,}+()/debug
|
||||
------------------------------------------------------------------
|
||||
|
|
|
@ -854,22 +854,16 @@ Failed: error 184 at offset 1540: (?| and/or (?J: or (?x: parentheses are too de
|
|||
|
||||
# Use "expand" to create some very long patterns with nested parentheses, in
|
||||
# order to test workspace overflow. Again, this varies with code unit width,
|
||||
# and even with it fails in two modes, the error offset differs. It also varies
|
||||
# and even when it fails in two modes, the error offset differs. It also varies
|
||||
# with link size - hence multiple tests with different values.
|
||||
|
||||
/(?'ABC'\[[bar](]{105}*THEN:\[A]{255}\[)]{106}/expand,-fullbincode
|
||||
/(?'ABC'\[[bar](]{792}*THEN:\[A]{255}\[)]{793}/expand,-fullbincode,parens_nest_limit=1000
|
||||
|
||||
/(?'ABC'\[[bar](]{106}*THEN:\[A]{255}\[)]{107}/expand,-fullbincode
|
||||
Failed: error 186 at offset 637: regular expression is too complicated
|
||||
/(?'ABC'\[[bar](]{793}*THEN:\[A]{255}\[)]{794}/expand,-fullbincode,parens_nest_limit=1000
|
||||
Failed: error 186 at offset 5820: regular expression is too complicated
|
||||
|
||||
/(?'ABC'\[[bar](]{159}*THEN:\[A]{255}\[)]{160}/expand,-fullbincode
|
||||
Failed: error 186 at offset 637: regular expression is too complicated
|
||||
|
||||
/(?'ABC'\[[bar](]{199}*THEN:\[A]{255}\[)]{200}/expand,-fullbincode
|
||||
Failed: error 186 at offset 637: regular expression is too complicated
|
||||
|
||||
/(?'ABC'\[[bar](]{299}*THEN:\[A]{255}\[)]{300}/expand,-fullbincode
|
||||
Failed: error 186 at offset 637: regular expression is too complicated
|
||||
/(?'ABC'\[[bar](]{1793}*THEN:\[A]{255}\[)]{1794}/expand,-fullbincode,parens_nest_limit=2000
|
||||
Failed: error 186 at offset 12820: regular expression is too complicated
|
||||
|
||||
/(?(1)(?1)){8,}+()/debug
|
||||
------------------------------------------------------------------
|
||||
|
|
|
@ -853,21 +853,15 @@ Memory allocation (code space): 12
|
|||
|
||||
# Use "expand" to create some very long patterns with nested parentheses, in
|
||||
# order to test workspace overflow. Again, this varies with code unit width,
|
||||
# and even with it fails in two modes, the error offset differs. It also varies
|
||||
# and even when it fails in two modes, the error offset differs. It also varies
|
||||
# with link size - hence multiple tests with different values.
|
||||
|
||||
/(?'ABC'\[[bar](]{105}*THEN:\[A]{255}\[)]{106}/expand,-fullbincode
|
||||
/(?'ABC'\[[bar](]{792}*THEN:\[A]{255}\[)]{793}/expand,-fullbincode,parens_nest_limit=1000
|
||||
|
||||
/(?'ABC'\[[bar](]{106}*THEN:\[A]{255}\[)]{107}/expand,-fullbincode
|
||||
/(?'ABC'\[[bar](]{793}*THEN:\[A]{255}\[)]{794}/expand,-fullbincode,parens_nest_limit=1000
|
||||
|
||||
/(?'ABC'\[[bar](]{159}*THEN:\[A]{255}\[)]{160}/expand,-fullbincode
|
||||
Failed: error 186 at offset 936: regular expression is too complicated
|
||||
|
||||
/(?'ABC'\[[bar](]{199}*THEN:\[A]{255}\[)]{200}/expand,-fullbincode
|
||||
Failed: error 186 at offset 936: regular expression is too complicated
|
||||
|
||||
/(?'ABC'\[[bar](]{299}*THEN:\[A]{255}\[)]{300}/expand,-fullbincode
|
||||
Failed: error 186 at offset 936: regular expression is too complicated
|
||||
/(?'ABC'\[[bar](]{1793}*THEN:\[A]{255}\[)]{1794}/expand,-fullbincode,parens_nest_limit=2000
|
||||
Failed: error 186 at offset 12820: regular expression is too complicated
|
||||
|
||||
/(?(1)(?1)){8,}+()/debug
|
||||
------------------------------------------------------------------
|
||||
|
|
|
@ -853,19 +853,15 @@ Memory allocation (code space): 14
|
|||
|
||||
# Use "expand" to create some very long patterns with nested parentheses, in
|
||||
# order to test workspace overflow. Again, this varies with code unit width,
|
||||
# and even with it fails in two modes, the error offset differs. It also varies
|
||||
# and even when it fails in two modes, the error offset differs. It also varies
|
||||
# with link size - hence multiple tests with different values.
|
||||
|
||||
/(?'ABC'\[[bar](]{105}*THEN:\[A]{255}\[)]{106}/expand,-fullbincode
|
||||
/(?'ABC'\[[bar](]{792}*THEN:\[A]{255}\[)]{793}/expand,-fullbincode,parens_nest_limit=1000
|
||||
|
||||
/(?'ABC'\[[bar](]{106}*THEN:\[A]{255}\[)]{107}/expand,-fullbincode
|
||||
/(?'ABC'\[[bar](]{793}*THEN:\[A]{255}\[)]{794}/expand,-fullbincode,parens_nest_limit=1000
|
||||
|
||||
/(?'ABC'\[[bar](]{159}*THEN:\[A]{255}\[)]{160}/expand,-fullbincode
|
||||
|
||||
/(?'ABC'\[[bar](]{199}*THEN:\[A]{255}\[)]{200}/expand,-fullbincode
|
||||
|
||||
/(?'ABC'\[[bar](]{299}*THEN:\[A]{255}\[)]{300}/expand,-fullbincode
|
||||
Failed: error 186 at offset 1224: regular expression is too complicated
|
||||
/(?'ABC'\[[bar](]{1793}*THEN:\[A]{255}\[)]{1794}/expand,-fullbincode,parens_nest_limit=2000
|
||||
Failed: error 186 at offset 12820: regular expression is too complicated
|
||||
|
||||
/(?(1)(?1)){8,}+()/debug
|
||||
------------------------------------------------------------------
|
||||
|
|
|
@ -307,14 +307,14 @@ Subject length lower bound = 1
|
|||
------------------------------------------------------------------
|
||||
|
||||
/\777/I
|
||||
Failed: error 151 at offset 3: octal value is greater than \377 in 8-bit non-UTF-8 mode
|
||||
Failed: error 151 at offset 4: octal value is greater than \377 in 8-bit non-UTF-8 mode
|
||||
|
||||
/(*:0123456789ABCDEF0123456789ABCDEF0123456789ABCDEF0123456789ABCDEF0123456789ABCDEF0123456789ABCDEF0123456789ABCDEF0123456789ABCDEF0123456789ABCDEF0123456789ABCDEF0123456789ABCDEF0123456789ABCDEF0123456789ABCDEF0123456789ABCDEF0123456789ABCDEF0123456789ABCDEF)XX/mark
|
||||
Failed: error 176 at offset 259: name is too long in (*MARK), (*PRUNE), (*SKIP), or (*THEN)
|
||||
XX
|
||||
|
||||
/(*:0123456789ABCDEF0123456789ABCDEF0123456789ABCDEF0123456789ABCDEF0123456789ABCDEF0123456789ABCDEF0123456789ABCDEF0123456789ABCDEF0123456789ABCDEF0123456789ABCDEF0123456789ABCDEF0123456789ABCDEF0123456789ABCDEF0123456789ABCDEF0123456789ABCDEF0123456789ABCDEF)XX/mark,alt_verbnames
|
||||
Failed: error 176 at offset 258: name is too long in (*MARK), (*PRUNE), (*SKIP), or (*THEN)
|
||||
Failed: error 176 at offset 259: name is too long in (*MARK), (*PRUNE), (*SKIP), or (*THEN)
|
||||
XX
|
||||
|
||||
/(*:0123456789ABCDEF0123456789ABCDEF0123456789ABCDEF0123456789ABCDEF0123456789ABCDEF0123456789ABCDEF0123456789ABCDEF0123456789ABCDEF0123456789ABCDEF0123456789ABCDEF0123456789ABCDEF0123456789ABCDEF0123456789ABCDEF0123456789ABCDEF0123456789ABCDEF0123456789ABCDE)XX/mark
|
||||
|
@ -328,10 +328,10 @@ MK: 0123456789ABCDEF0123456789ABCDEF0123456789ABCDEF0123456789ABCDEF0123456789AB
|
|||
MK: 0123456789ABCDEF0123456789ABCDEF0123456789ABCDEF0123456789ABCDEF0123456789ABCDEF0123456789ABCDEF0123456789ABCDEF0123456789ABCDEF0123456789ABCDEF0123456789ABCDEF0123456789ABCDEF0123456789ABCDEF0123456789ABCDEF0123456789ABCDEF0123456789ABCDEF0123456789ABCDE
|
||||
|
||||
/\u0100/alt_bsux,allow_empty_class,match_unset_backref,dupnames
|
||||
Failed: error 177 at offset 5: character code point value in \u.... sequence is too large
|
||||
Failed: error 177 at offset 6: character code point value in \u.... sequence is too large
|
||||
|
||||
/[\u0100-\u0200]/alt_bsux,allow_empty_class,match_unset_backref,dupnames
|
||||
Failed: error 177 at offset 6: character code point value in \u.... sequence is too large
|
||||
Failed: error 177 at offset 7: character code point value in \u.... sequence is too large
|
||||
|
||||
/[^\x00-a]{12,}[^b-\xff]*/B
|
||||
------------------------------------------------------------------
|
||||
|
|
Loading…
Reference in New Issue