Add PCRE2_NO_DOTSTAR_ANCHOR and revise documentation for .* optimizing.
This commit is contained in:
parent
019e115060
commit
5a18651441
|
@ -58,4 +58,6 @@ matched against "abcd".
|
|||
(an odd thing to do, but it happened), SIGSEGV or other misbehaviour could
|
||||
occur.
|
||||
|
||||
10. The PCRE2_NO_DOTSTAR_ANCHOR option has been implemented.
|
||||
|
||||
****
|
||||
|
|
|
@ -63,6 +63,7 @@ or provide an external function for stack size checking. The option bits are:
|
|||
PCRE2_NO_AUTO_CAPTURE Disable numbered capturing paren-
|
||||
theses (named ones available)
|
||||
PCRE2_NO_AUTO_POSSESS Disable auto-possessification
|
||||
PCRE2_NO_DOTSTAR_ANCHOR Disable automatic anchoring for .*
|
||||
PCRE2_NO_START_OPTIMIZE Disable match-time start optimizations
|
||||
PCRE2_NO_UTF_CHECK Do not check the pattern for UTF validity
|
||||
(only relevant if PCRE2_UTF is set)
|
||||
|
|
|
@ -1187,6 +1187,19 @@ use, auto-possessification means that some callouts are never taken. You can
|
|||
set this option if you want the matching functions to do a full unoptimized
|
||||
search and run all the callouts, but it is mainly provided for testing
|
||||
purposes.
|
||||
<pre>
|
||||
PCRE2_NO_DOTSTAR_ANCHOR
|
||||
</pre>
|
||||
If this option is set, it disables an optimization that is applied when .* is
|
||||
the first significant item in a top-level branch of a pattern, and all the
|
||||
other branches also start with .* or with \A or \G or ^. The optimization is
|
||||
automatically disabled for .* if it is inside an atomic group or a capturing
|
||||
group that is the subject of a back reference, or if the pattern contains
|
||||
(*PRUNE) or (*SKIP). When the optimization is not disabled, such a pattern is
|
||||
automatically anchored if PCRE2_DOTALL is set for all the .* items and
|
||||
PCRE2_MULTILINE is not set for any ^ items. Otherwise, the fact that any match
|
||||
must start either at the start of the subject or following a newline is
|
||||
remembered. Like other optimizations, this can cause callouts to be skipped.
|
||||
<pre>
|
||||
PCRE2_NO_START_OPTIMIZE
|
||||
</pre>
|
||||
|
@ -1442,16 +1455,25 @@ compiled with the PCRE2_EXTENDED option, the result is PCRE2_CASELESS,
|
|||
PCRE2_MULTILINE, and PCRE2_EXTENDED.
|
||||
</P>
|
||||
<P>
|
||||
A pattern is automatically anchored by PCRE2 if all of its top-level
|
||||
alternatives begin with one of the following:
|
||||
A pattern compiled without PCRE2_ANCHORED is automatically anchored by PCRE2 if
|
||||
the first significant item in every top-level branch is one of the following:
|
||||
<pre>
|
||||
^ unless PCRE2_MULTILINE is set
|
||||
\A always
|
||||
\G always
|
||||
.* if PCRE2_DOTALL is set and there are no back references to the subpattern in which .* appears
|
||||
.* sometimes - see below
|
||||
</pre>
|
||||
For such patterns, the PCRE2_ANCHORED bit is set in the options returned for
|
||||
PCRE2_INFO_ALLOPTIONS.
|
||||
When .* is the first significant item, anchoring is possible only when all the
|
||||
following are true:
|
||||
<pre>
|
||||
.* is not in an atomic group
|
||||
.* is not in a capturing group that is the subject of a back reference
|
||||
PCRE2_DOTALL is in force for .*
|
||||
Neither (*PRUNE) nor (*SKIP) appears in the pattern.
|
||||
PCRE2_NO_DOTSTAR_ANCHOR is not set.
|
||||
</pre>
|
||||
For patterns that are auto-anchored, the PCRE2_ANCHORED bit is set in the
|
||||
options returned for PCRE2_INFO_ALLOPTIONS.
|
||||
<pre>
|
||||
PCRE2_INFO_BACKREFMAX
|
||||
</pre>
|
||||
|
@ -1480,21 +1502,10 @@ variable.
|
|||
<P>
|
||||
If there is a fixed first value, for example, the letter "c" from a pattern
|
||||
such as (cat|cow|coyote), 1 is returned, and the character value can be
|
||||
retrieved using PCRE2_INFO_FIRSTCODEUNIT. If there is no fixed first value, and
|
||||
if either
|
||||
<br>
|
||||
<br>
|
||||
(a) the pattern was compiled with the PCRE2_MULTILINE option, and every branch
|
||||
starts with "^", or
|
||||
<br>
|
||||
<br>
|
||||
(b) every branch of the pattern starts with ".*" and PCRE2_DOTALL is not set
|
||||
(if it were set, the pattern would be anchored),
|
||||
<br>
|
||||
<br>
|
||||
2 is returned, indicating that the pattern matches only at the start of a
|
||||
subject string or after any newline within the string. Otherwise 0 is
|
||||
returned. For anchored patterns, 0 is returned.
|
||||
retrieved using PCRE2_INFO_FIRSTCODEUNIT. If there is no fixed first value, but
|
||||
it is known that a match can occur only at the start of the subject or
|
||||
following a newline in the subject, 2 is returned. Otherwise, and for anchored
|
||||
patterns, 0 is returned.
|
||||
<pre>
|
||||
PCRE2_INFO_FIRSTCODEUNIT
|
||||
</pre>
|
||||
|
@ -2792,9 +2803,9 @@ Cambridge, England.
|
|||
</P>
|
||||
<br><a name="SEC37" href="#TOC1">REVISION</a><br>
|
||||
<P>
|
||||
Last updated: 22 December 2014
|
||||
Last updated: 02 January 2015
|
||||
<br>
|
||||
Copyright © 1997-2014 University of Cambridge.
|
||||
Copyright © 1997-2015 University of Cambridge.
|
||||
<br>
|
||||
<p>
|
||||
Return to the <a href="index.html">PCRE2 index page</a>.
|
||||
|
|
|
@ -82,6 +82,9 @@ You should be aware that, because of optimizations in the way PCRE2 compiles
|
|||
and matches patterns, callouts sometimes do not happen exactly as you might
|
||||
expect.
|
||||
</P>
|
||||
<br><b>
|
||||
Auto-possessification
|
||||
</b><br>
|
||||
<P>
|
||||
At compile time, PCRE2 "auto-possessifies" repeated items when it knows that
|
||||
what follows cannot be part of the repeat. For example, a+[bc] is compiled as
|
||||
|
@ -111,6 +114,56 @@ case, the output changes to this:
|
|||
This time, when matching [bc] fails, the matcher backtracks into a+ and tries
|
||||
again, repeatedly, until a+ itself fails.
|
||||
</P>
|
||||
<br><b>
|
||||
Automatic .* anchoring
|
||||
</b><br>
|
||||
<P>
|
||||
By default, an optimization is applied when .* is the first significant item in
|
||||
a pattern. If PCRE2_DOTALL is set, so that the dot can match any character, the
|
||||
pattern is automatically anchored. If PCRE2_DOTALL is not set, a match can
|
||||
start only after an internal newline or at the beginning of the subject, and
|
||||
<b>pcre2_compile()</b> remembers this. This optimization is disabled, however,
|
||||
if .* is in an atomic group or if there is a back reference to the capturing
|
||||
group in which it appears. It is also disabled if the pattern contains (*PRUNE)
|
||||
or (*SKIP). However, the presence of callouts does not affect it.
|
||||
</P>
|
||||
<P>
|
||||
For example, if the pattern .*\d is compiled with PCRE2_AUTO_CALLOUT and
|
||||
applied to the string "aa", the <b>pcre2test</b> output is:
|
||||
<pre>
|
||||
--->aa
|
||||
+0 ^ .*
|
||||
+2 ^ ^ \d
|
||||
+2 ^^ \d
|
||||
+2 ^ \d
|
||||
No match
|
||||
</pre>
|
||||
This shows that all match attempts start at the beginning of the subject. In
|
||||
other words, the pattern is anchored. You can disable this optimization by
|
||||
passing PCRE2_NO_DOTSTAR_ANCHOR to <b>pcre2_compile()</b>, or starting the
|
||||
pattern with (*NO_DOTSTAR_ANCHOR). In this case, the output changes to:
|
||||
<pre>
|
||||
--->aa
|
||||
+0 ^ .*
|
||||
+2 ^ ^ \d
|
||||
+2 ^^ \d
|
||||
+2 ^ \d
|
||||
+0 ^ .*
|
||||
+2 ^^ \d
|
||||
+2 ^ \d
|
||||
No match
|
||||
</pre>
|
||||
This shows more match attempts, starting at the second subject character.
|
||||
Another optimization, described in the next section, means that there is no
|
||||
subsequent attempt to match with an empty subject.
|
||||
</P>
|
||||
<P>
|
||||
If a pattern has more than one top-level branch, automatic anchoring occurs if
|
||||
all branches are anchorable.
|
||||
</P>
|
||||
<br><b>
|
||||
Other optimizations
|
||||
</b><br>
|
||||
<P>
|
||||
Other optimizations that provide fast "no match" results also affect callouts.
|
||||
For example, if the pattern is
|
||||
|
@ -254,9 +307,9 @@ Cambridge, England.
|
|||
</P>
|
||||
<br><a name="SEC7" href="#TOC1">REVISION</a><br>
|
||||
<P>
|
||||
Last updated: 25 November 2014
|
||||
Last updated: 02 January 2015
|
||||
<br>
|
||||
Copyright © 1997-2014 University of Cambridge.
|
||||
Copyright © 1997-2015 University of Cambridge.
|
||||
<br>
|
||||
<p>
|
||||
Return to the <a href="index.html">PCRE2 index page</a>.
|
||||
|
|
|
@ -151,6 +151,17 @@ reaching "no match" results. For more details, see the
|
|||
documentation.
|
||||
</P>
|
||||
<br><b>
|
||||
Disabling automatic anchoring
|
||||
</b><br>
|
||||
<P>
|
||||
If a pattern starts with (*NO_DOTSTAR_ANCHOR), it has the same effect as
|
||||
setting the PCRE2_NO_DOTSTAR_ANCHOR option. This disables optimizations that
|
||||
apply to patterns whose top-level branches all start with .* (match any number
|
||||
of arbitrary characters). For more details, see the
|
||||
<a href="pcre2api.html"><b>pcre2api</b></a>
|
||||
documentation.
|
||||
</P>
|
||||
<br><b>
|
||||
Setting match and recursion limits
|
||||
</b><br>
|
||||
<P>
|
||||
|
@ -1841,7 +1852,8 @@ one succeeds. Consider this pattern:
|
|||
(?>.*?a)b
|
||||
</pre>
|
||||
It matches "ab" in the subject "aab". The use of the backtracking control verbs
|
||||
(*PRUNE) and (*SKIP) also disable this optimization.
|
||||
(*PRUNE) and (*SKIP) also disable this optimization, and there is an option,
|
||||
PCRE2_NO_DOTSTAR_ANCHOR, to do so explicitly.
|
||||
</P>
|
||||
<P>
|
||||
When a capturing subpattern is repeated, the value captured is the substring
|
||||
|
@ -3236,9 +3248,9 @@ Cambridge, England.
|
|||
</P>
|
||||
<br><a name="SEC30" href="#TOC1">REVISION</a><br>
|
||||
<P>
|
||||
Last updated: 14 November 2014
|
||||
Last updated: 02 January 2015
|
||||
<br>
|
||||
Copyright © 1997-2014 University of Cambridge.
|
||||
Copyright © 1997-2015 University of Cambridge.
|
||||
<br>
|
||||
<p>
|
||||
Return to the <a href="index.html">PCRE2 index page</a>.
|
||||
|
|
|
@ -115,14 +115,19 @@ less with a DFA matching function, and in both cases there is not much
|
|||
difference for \b.
|
||||
</P>
|
||||
<P>
|
||||
When a pattern begins with .* not in parentheses, or in parentheses that are
|
||||
not the subject of a backreference, and the PCRE2_DOTALL option is set, the
|
||||
pattern is implicitly anchored by PCRE2, since it can match only at the start
|
||||
of a subject string. However, if PCRE2_DOTALL is not set, PCRE2 cannot make
|
||||
this optimization, because the dot metacharacter does not then match a newline,
|
||||
and if the subject string contains newlines, the pattern may match from the
|
||||
character immediately following one of them instead of from the very start. For
|
||||
example, the pattern
|
||||
When a pattern begins with .* not in atomic parentheses, nor in parentheses
|
||||
that are the subject of a backreference, and the PCRE2_DOTALL option is set,
|
||||
the pattern is implicitly anchored by PCRE2, since it can match only at the
|
||||
start of a subject string. If the pattern has multiple top-level branches, they
|
||||
must all be anchorable. The optimization can be disabled by the
|
||||
PCRE2_NO_DOTSTAR_ANCHOR option, and is automatically disabled if the pattern
|
||||
contains (*PRUNE) or (*SKIP).
|
||||
</P>
|
||||
<P>
|
||||
If PCRE2_DOTALL is not set, PCRE2 cannot make this optimization, because the
|
||||
dot metacharacter does not then match a newline, and if the subject string
|
||||
contains newlines, the pattern may match from the character immediately
|
||||
following one of them instead of from the very start. For example, the pattern
|
||||
<pre>
|
||||
.*second
|
||||
</pre>
|
||||
|
@ -187,9 +192,9 @@ Cambridge, England.
|
|||
REVISION
|
||||
</b><br>
|
||||
<P>
|
||||
Last updated: 20 October 2014
|
||||
Last updated: 02 January 2015
|
||||
<br>
|
||||
Copyright © 1997-2014 University of Cambridge.
|
||||
Copyright © 1997-2015 University of Cambridge.
|
||||
<br>
|
||||
<p>
|
||||
Return to the <a href="index.html">PCRE2 index page</a>.
|
||||
|
|
|
@ -416,6 +416,7 @@ appear.
|
|||
(*NOTEMPTY) set PCRE2_NOTEMPTY when matching
|
||||
(*NOTEMPTY_ATSTART) set PCRE2_NOTEMPTY_ATSTART when matching
|
||||
(*NO_AUTO_POSSESS) no auto-possessification (PCRE2_NO_AUTO_POSSESS)
|
||||
(*NO_DOTSTAR_ANCHOR) no .* anchoring (PCRE2_NO_DOTSTAR_ANCHOR)
|
||||
(*NO_START_OPT) no start-match optimization (PCRE2_NO_START_OPTIMIZE)
|
||||
(*UTF) set appropriate UTF mode for the library in use
|
||||
(*UCP) set PCRE2_UCP (use Unicode properties for \d etc)
|
||||
|
@ -553,9 +554,9 @@ Cambridge, England.
|
|||
</P>
|
||||
<br><a name="SEC27" href="#TOC1">REVISION</a><br>
|
||||
<P>
|
||||
Last updated: 23 November 2014
|
||||
Last updated: 02 January 2015
|
||||
<br>
|
||||
Copyright © 1997-2014 University of Cambridge.
|
||||
Copyright © 1997-2015 University of Cambridge.
|
||||
<br>
|
||||
<p>
|
||||
Return to the <a href="index.html">PCRE2 index page</a>.
|
||||
|
|
|
@ -291,7 +291,7 @@ checked for compatibility with the <b>perltest.sh</b> script, which is used to
|
|||
confirm that Perl gives the same results as PCRE2. Also, apart from comment
|
||||
lines, none of the other command lines are permitted, because they and many
|
||||
of the modifiers are specific to <b>pcre2test</b>, and should not be used in
|
||||
test files that are also processed by <b>perltest.sh</b>. The \fP#perltest\fB
|
||||
test files that are also processed by <b>perltest.sh</b>. The <b>#perltest</b>
|
||||
command helps detect tests that are accidentally put in the wrong file.
|
||||
<pre>
|
||||
#subject <modifier-list>
|
||||
|
@ -454,6 +454,7 @@ for a description of their effects.
|
|||
never_utf set PCRE2_NEVER_UTF
|
||||
no_auto_capture set PCRE2_NO_AUTO_CAPTURE
|
||||
no_auto_possess set PCRE2_NO_AUTO_POSSESS
|
||||
no_dotstar_anchor set PCRE2_NO_DOTSTAR_ANCHOR
|
||||
no_start_optimize set PCRE2_NO_START_OPTIMIZE
|
||||
no_utf_check set PCRE2_NO_UTF_CHECK
|
||||
ucp set PCRE2_UCP
|
||||
|
@ -596,7 +597,7 @@ setting the size of the JIT stack.
|
|||
</P>
|
||||
<P>
|
||||
If the <b>jitfast</b> modifier is specified, matching is done using the JIT
|
||||
"fast path" interface, \fBpcre2_jit_match(), which skips some of the sanity
|
||||
"fast path" interface, <b>pcre2_jit_match()</b>, which skips some of the sanity
|
||||
checks that are done by <b>pcre2_match()</b>, and of course does not work when
|
||||
JIT is not supported. If <b>jitfast</b> is specified without <b>jit</b>, jit=7 is
|
||||
assumed.
|
||||
|
@ -1309,9 +1310,9 @@ Cambridge, England.
|
|||
</P>
|
||||
<br><a name="SEC20" href="#TOC1">REVISION</a><br>
|
||||
<P>
|
||||
Last updated: 23 November 2014
|
||||
Last updated: 02 January 2015
|
||||
<br>
|
||||
Copyright © 1997-2014 University of Cambridge.
|
||||
Copyright © 1997-2015 University of Cambridge.
|
||||
<br>
|
||||
<p>
|
||||
Return to the <a href="index.html">PCRE2 index page</a>.
|
||||
|
|
107
doc/pcre2.txt
107
doc/pcre2.txt
|
@ -1226,6 +1226,20 @@ COMPILING A PATTERN
|
|||
a full unoptimized search and run all the callouts, but it is mainly
|
||||
provided for testing purposes.
|
||||
|
||||
PCRE2_NO_DOTSTAR_ANCHOR
|
||||
|
||||
If this option is set, it disables an optimization that is applied when
|
||||
.* is the first significant item in a top-level branch of a pattern,
|
||||
and all the other branches also start with .* or with \A or \G or ^.
|
||||
The optimization is automatically disabled for .* if it is inside an
|
||||
atomic group or a capturing group that is the subject of a back refer-
|
||||
ence, or if the pattern contains (*PRUNE) or (*SKIP). When the opti-
|
||||
mization is not disabled, such a pattern is automatically anchored if
|
||||
PCRE2_DOTALL is set for all the .* items and PCRE2_MULTILINE is not set
|
||||
for any ^ items. Otherwise, the fact that any match must start either
|
||||
at the start of the subject or following a newline is remembered. Like
|
||||
other optimizations, this can cause callouts to be skipped.
|
||||
|
||||
PCRE2_NO_START_OPTIMIZE
|
||||
|
||||
This is an option whose main effect is at matching time. It does not
|
||||
|
@ -1465,17 +1479,27 @@ INFORMATION ABOUT A COMPILED PATTERN
|
|||
option, the result is PCRE2_CASELESS, PCRE2_MULTILINE, and
|
||||
PCRE2_EXTENDED.
|
||||
|
||||
A pattern is automatically anchored by PCRE2 if all of its top-level
|
||||
alternatives begin with one of the following:
|
||||
A pattern compiled without PCRE2_ANCHORED is automatically anchored by
|
||||
PCRE2 if the first significant item in every top-level branch is one of
|
||||
the following:
|
||||
|
||||
^ unless PCRE2_MULTILINE is set
|
||||
\A always
|
||||
\G always
|
||||
.* if PCRE2_DOTALL is set and there are no back
|
||||
references to the subpattern in which .* appears
|
||||
.* sometimes - see below
|
||||
|
||||
For such patterns, the PCRE2_ANCHORED bit is set in the options
|
||||
returned for PCRE2_INFO_ALLOPTIONS.
|
||||
When .* is the first significant item, anchoring is possible only when
|
||||
all the following are true:
|
||||
|
||||
.* is not in an atomic group
|
||||
.* is not in a capturing group that is the subject
|
||||
of a back reference
|
||||
PCRE2_DOTALL is in force for .*
|
||||
Neither (*PRUNE) nor (*SKIP) appears in the pattern.
|
||||
PCRE2_NO_DOTSTAR_ANCHOR is not set.
|
||||
|
||||
For patterns that are auto-anchored, the PCRE2_ANCHORED bit is set in
|
||||
the options returned for PCRE2_INFO_ALLOPTIONS.
|
||||
|
||||
PCRE2_INFO_BACKREFMAX
|
||||
|
||||
|
@ -1504,17 +1528,9 @@ INFORMATION ABOUT A COMPILED PATTERN
|
|||
If there is a fixed first value, for example, the letter "c" from a
|
||||
pattern such as (cat|cow|coyote), 1 is returned, and the character
|
||||
value can be retrieved using PCRE2_INFO_FIRSTCODEUNIT. If there is no
|
||||
fixed first value, and if either
|
||||
|
||||
(a) the pattern was compiled with the PCRE2_MULTILINE option, and every
|
||||
branch starts with "^", or
|
||||
|
||||
(b) every branch of the pattern starts with ".*" and PCRE2_DOTALL is
|
||||
not set (if it were set, the pattern would be anchored),
|
||||
|
||||
2 is returned, indicating that the pattern matches only at the start of
|
||||
a subject string or after any newline within the string. Otherwise 0 is
|
||||
returned. For anchored patterns, 0 is returned.
|
||||
fixed first value, but it is known that a match can occur only at the
|
||||
start of the subject or following a newline in the subject, 2 is
|
||||
returned. Otherwise, and for anchored patterns, 0 is returned.
|
||||
|
||||
PCRE2_INFO_FIRSTCODEUNIT
|
||||
|
||||
|
@ -2726,8 +2742,8 @@ AUTHOR
|
|||
|
||||
REVISION
|
||||
|
||||
Last updated: 22 December 2014
|
||||
Copyright (c) 1997-2014 University of Cambridge.
|
||||
Last updated: 02 January 2015
|
||||
Copyright (c) 1997-2015 University of Cambridge.
|
||||
------------------------------------------------------------------------------
|
||||
|
||||
|
||||
|
@ -3251,6 +3267,8 @@ MISSING CALLOUTS
|
|||
compiles and matches patterns, callouts sometimes do not happen exactly
|
||||
as you might expect.
|
||||
|
||||
Auto-possessification
|
||||
|
||||
At compile time, PCRE2 "auto-possessifies" repeated items when it knows
|
||||
that what follows cannot be part of the repeat. For example, a+[bc] is
|
||||
compiled as if it were a++[bc]. The pcre2test output when this pattern
|
||||
|
@ -3279,6 +3297,53 @@ MISSING CALLOUTS
|
|||
This time, when matching [bc] fails, the matcher backtracks into a+ and
|
||||
tries again, repeatedly, until a+ itself fails.
|
||||
|
||||
Automatic .* anchoring
|
||||
|
||||
By default, an optimization is applied when .* is the first significant
|
||||
item in a pattern. If PCRE2_DOTALL is set, so that the dot can match
|
||||
any character, the pattern is automatically anchored. If PCRE2_DOTALL
|
||||
is not set, a match can start only after an internal newline or at the
|
||||
beginning of the subject, and pcre2_compile() remembers this. This
|
||||
optimization is disabled, however, if .* is in an atomic group or if
|
||||
there is a back reference to the capturing group in which it appears.
|
||||
It is also disabled if the pattern contains (*PRUNE) or (*SKIP). How-
|
||||
ever, the presence of callouts does not affect it.
|
||||
|
||||
For example, if the pattern .*\d is compiled with PCRE2_AUTO_CALLOUT
|
||||
and applied to the string "aa", the pcre2test output is:
|
||||
|
||||
--->aa
|
||||
+0 ^ .*
|
||||
+2 ^ ^ \d
|
||||
+2 ^^ \d
|
||||
+2 ^ \d
|
||||
No match
|
||||
|
||||
This shows that all match attempts start at the beginning of the sub-
|
||||
ject. In other words, the pattern is anchored. You can disable this
|
||||
optimization by passing PCRE2_NO_DOTSTAR_ANCHOR to pcre2_compile(), or
|
||||
starting the pattern with (*NO_DOTSTAR_ANCHOR). In this case, the out-
|
||||
put changes to:
|
||||
|
||||
--->aa
|
||||
+0 ^ .*
|
||||
+2 ^ ^ \d
|
||||
+2 ^^ \d
|
||||
+2 ^ \d
|
||||
+0 ^ .*
|
||||
+2 ^^ \d
|
||||
+2 ^ \d
|
||||
No match
|
||||
|
||||
This shows more match attempts, starting at the second subject charac-
|
||||
ter. Another optimization, described in the next section, means that
|
||||
there is no subsequent attempt to match with an empty subject.
|
||||
|
||||
If a pattern has more than one top-level branch, automatic anchoring
|
||||
occurs if all branches are anchorable.
|
||||
|
||||
Other optimizations
|
||||
|
||||
Other optimizations that provide fast "no match" results also affect
|
||||
callouts. For example, if the pattern is
|
||||
|
||||
|
@ -3410,8 +3475,8 @@ AUTHOR
|
|||
|
||||
REVISION
|
||||
|
||||
Last updated: 25 November 2014
|
||||
Copyright (c) 1997-2014 University of Cambridge.
|
||||
Last updated: 02 January 2015
|
||||
Copyright (c) 1997-2015 University of Cambridge.
|
||||
------------------------------------------------------------------------------
|
||||
|
||||
|
||||
|
|
|
@ -1,4 +1,4 @@
|
|||
.TH PCRE2_COMPILE 3 "21 October 2014" "PCRE2 10.00"
|
||||
.TH PCRE2_COMPILE 3 "02 January 2015" "PCRE2 10.00"
|
||||
.SH NAME
|
||||
PCRE2 - Perl-compatible regular expressions (revised API)
|
||||
.SH SYNOPSIS
|
||||
|
@ -51,6 +51,7 @@ or provide an external function for stack size checking. The option bits are:
|
|||
PCRE2_NO_AUTO_CAPTURE Disable numbered capturing paren-
|
||||
theses (named ones available)
|
||||
PCRE2_NO_AUTO_POSSESS Disable auto-possessification
|
||||
PCRE2_NO_DOTSTAR_ANCHOR Disable automatic anchoring for .*
|
||||
PCRE2_NO_START_OPTIMIZE Disable match-time start optimizations
|
||||
PCRE2_NO_UTF_CHECK Do not check the pattern for UTF validity
|
||||
(only relevant if PCRE2_UTF is set)
|
||||
|
|
|
@ -1,4 +1,4 @@
|
|||
.TH PCRE2API 3 "22 December 2014" "PCRE2 10.00"
|
||||
.TH PCRE2API 3 "02 January 2015" "PCRE2 10.00"
|
||||
.SH NAME
|
||||
PCRE2 - Perl-compatible regular expressions (revised API)
|
||||
.sp
|
||||
|
@ -1163,6 +1163,19 @@ use, auto-possessification means that some callouts are never taken. You can
|
|||
set this option if you want the matching functions to do a full unoptimized
|
||||
search and run all the callouts, but it is mainly provided for testing
|
||||
purposes.
|
||||
.sp
|
||||
PCRE2_NO_DOTSTAR_ANCHOR
|
||||
.sp
|
||||
If this option is set, it disables an optimization that is applied when .* is
|
||||
the first significant item in a top-level branch of a pattern, and all the
|
||||
other branches also start with .* or with \eA or \eG or ^. The optimization is
|
||||
automatically disabled for .* if it is inside an atomic group or a capturing
|
||||
group that is the subject of a back reference, or if the pattern contains
|
||||
(*PRUNE) or (*SKIP). When the optimization is not disabled, such a pattern is
|
||||
automatically anchored if PCRE2_DOTALL is set for all the .* items and
|
||||
PCRE2_MULTILINE is not set for any ^ items. Otherwise, the fact that any match
|
||||
must start either at the start of the subject or following a newline is
|
||||
remembered. Like other optimizations, this can cause callouts to be skipped.
|
||||
.sp
|
||||
PCRE2_NO_START_OPTIMIZE
|
||||
.sp
|
||||
|
@ -1436,18 +1449,27 @@ force when matching starts. For example, if the pattern /(?im)abc(?-i)d/ is
|
|||
compiled with the PCRE2_EXTENDED option, the result is PCRE2_CASELESS,
|
||||
PCRE2_MULTILINE, and PCRE2_EXTENDED.
|
||||
.P
|
||||
A pattern is automatically anchored by PCRE2 if all of its top-level
|
||||
alternatives begin with one of the following:
|
||||
A pattern compiled without PCRE2_ANCHORED is automatically anchored by PCRE2 if
|
||||
the first significant item in every top-level branch is one of the following:
|
||||
.sp
|
||||
^ unless PCRE2_MULTILINE is set
|
||||
\eA always
|
||||
\eG always
|
||||
.\" JOIN
|
||||
.* if PCRE2_DOTALL is set and there are no back
|
||||
references to the subpattern in which .* appears
|
||||
.* sometimes - see below
|
||||
.sp
|
||||
For such patterns, the PCRE2_ANCHORED bit is set in the options returned for
|
||||
PCRE2_INFO_ALLOPTIONS.
|
||||
When .* is the first significant item, anchoring is possible only when all the
|
||||
following are true:
|
||||
.sp
|
||||
.* is not in an atomic group
|
||||
.\" JOIN
|
||||
.* is not in a capturing group that is the subject
|
||||
of a back reference
|
||||
PCRE2_DOTALL is in force for .*
|
||||
Neither (*PRUNE) nor (*SKIP) appears in the pattern.
|
||||
PCRE2_NO_DOTSTAR_ANCHOR is not set.
|
||||
.sp
|
||||
For patterns that are auto-anchored, the PCRE2_ANCHORED bit is set in the
|
||||
options returned for PCRE2_INFO_ALLOPTIONS.
|
||||
.sp
|
||||
PCRE2_INFO_BACKREFMAX
|
||||
.sp
|
||||
|
@ -1475,18 +1497,10 @@ variable.
|
|||
.P
|
||||
If there is a fixed first value, for example, the letter "c" from a pattern
|
||||
such as (cat|cow|coyote), 1 is returned, and the character value can be
|
||||
retrieved using PCRE2_INFO_FIRSTCODEUNIT. If there is no fixed first value, and
|
||||
if either
|
||||
.sp
|
||||
(a) the pattern was compiled with the PCRE2_MULTILINE option, and every branch
|
||||
starts with "^", or
|
||||
.sp
|
||||
(b) every branch of the pattern starts with ".*" and PCRE2_DOTALL is not set
|
||||
(if it were set, the pattern would be anchored),
|
||||
.sp
|
||||
2 is returned, indicating that the pattern matches only at the start of a
|
||||
subject string or after any newline within the string. Otherwise 0 is
|
||||
returned. For anchored patterns, 0 is returned.
|
||||
retrieved using PCRE2_INFO_FIRSTCODEUNIT. If there is no fixed first value, but
|
||||
it is known that a match can occur only at the start of the subject or
|
||||
following a newline in the subject, 2 is returned. Otherwise, and for anchored
|
||||
patterns, 0 is returned.
|
||||
.sp
|
||||
PCRE2_INFO_FIRSTCODEUNIT
|
||||
.sp
|
||||
|
@ -2835,6 +2849,6 @@ Cambridge, England.
|
|||
.rs
|
||||
.sp
|
||||
.nf
|
||||
Last updated: 22 December 2014
|
||||
Copyright (c) 1997-2014 University of Cambridge.
|
||||
Last updated: 02 January 2015
|
||||
Copyright (c) 1997-2015 University of Cambridge.
|
||||
.fi
|
||||
|
|
|
@ -1,4 +1,4 @@
|
|||
.TH PCRE2CALLOUT 3 "25 November 2014" "PCRE2 10.00"
|
||||
.TH PCRE2CALLOUT 3 "02 January 2015" "PCRE2 10.00"
|
||||
.SH NAME
|
||||
PCRE2 - Perl-compatible regular expressions (revised API)
|
||||
.SH SYNOPSIS
|
||||
|
@ -65,7 +65,11 @@ particular pattern.
|
|||
You should be aware that, because of optimizations in the way PCRE2 compiles
|
||||
and matches patterns, callouts sometimes do not happen exactly as you might
|
||||
expect.
|
||||
.P
|
||||
.
|
||||
.
|
||||
.SS "Auto-possessification"
|
||||
.rs
|
||||
.sp
|
||||
At compile time, PCRE2 "auto-possessifies" repeated items when it knows that
|
||||
what follows cannot be part of the repeat. For example, a+[bc] is compiled as
|
||||
if it were a++[bc]. The \fBpcre2test\fP output when this pattern is compiled
|
||||
|
@ -93,7 +97,56 @@ case, the output changes to this:
|
|||
.sp
|
||||
This time, when matching [bc] fails, the matcher backtracks into a+ and tries
|
||||
again, repeatedly, until a+ itself fails.
|
||||
.
|
||||
.
|
||||
.SS "Automatic .* anchoring"
|
||||
.rs
|
||||
.sp
|
||||
By default, an optimization is applied when .* is the first significant item in
|
||||
a pattern. If PCRE2_DOTALL is set, so that the dot can match any character, the
|
||||
pattern is automatically anchored. If PCRE2_DOTALL is not set, a match can
|
||||
start only after an internal newline or at the beginning of the subject, and
|
||||
\fBpcre2_compile()\fP remembers this. This optimization is disabled, however,
|
||||
if .* is in an atomic group or if there is a back reference to the capturing
|
||||
group in which it appears. It is also disabled if the pattern contains (*PRUNE)
|
||||
or (*SKIP). However, the presence of callouts does not affect it.
|
||||
.P
|
||||
For example, if the pattern .*\ed is compiled with PCRE2_AUTO_CALLOUT and
|
||||
applied to the string "aa", the \fBpcre2test\fP output is:
|
||||
.sp
|
||||
--->aa
|
||||
+0 ^ .*
|
||||
+2 ^ ^ \ed
|
||||
+2 ^^ \ed
|
||||
+2 ^ \ed
|
||||
No match
|
||||
.sp
|
||||
This shows that all match attempts start at the beginning of the subject. In
|
||||
other words, the pattern is anchored. You can disable this optimization by
|
||||
passing PCRE2_NO_DOTSTAR_ANCHOR to \fBpcre2_compile()\fP, or starting the
|
||||
pattern with (*NO_DOTSTAR_ANCHOR). In this case, the output changes to:
|
||||
.sp
|
||||
--->aa
|
||||
+0 ^ .*
|
||||
+2 ^ ^ \ed
|
||||
+2 ^^ \ed
|
||||
+2 ^ \ed
|
||||
+0 ^ .*
|
||||
+2 ^^ \ed
|
||||
+2 ^ \ed
|
||||
No match
|
||||
.sp
|
||||
This shows more match attempts, starting at the second subject character.
|
||||
Another optimization, described in the next section, means that there is no
|
||||
subsequent attempt to match with an empty subject.
|
||||
.P
|
||||
If a pattern has more than one top-level branch, automatic anchoring occurs if
|
||||
all branches are anchorable.
|
||||
.
|
||||
.
|
||||
.SS "Other optimizations"
|
||||
.rs
|
||||
.sp
|
||||
Other optimizations that provide fast "no match" results also affect callouts.
|
||||
For example, if the pattern is
|
||||
.sp
|
||||
|
@ -232,6 +285,6 @@ Cambridge, England.
|
|||
.rs
|
||||
.sp
|
||||
.nf
|
||||
Last updated: 25 November 2014
|
||||
Copyright (c) 1997-2014 University of Cambridge.
|
||||
Last updated: 02 January 2015
|
||||
Copyright (c) 1997-2015 University of Cambridge.
|
||||
.fi
|
||||
|
|
|
@ -1,4 +1,4 @@
|
|||
.TH PCRE2PATTERN 3 "14 November 2014" "PCRE2 10.00"
|
||||
.TH PCRE2PATTERN 3 "02 January 2015" "PCRE2 10.00"
|
||||
.SH NAME
|
||||
PCRE2 - Perl-compatible regular expressions (revised API)
|
||||
.SH "PCRE2 REGULAR EXPRESSION DETAILS"
|
||||
|
@ -117,6 +117,19 @@ reaching "no match" results. For more details, see the
|
|||
documentation.
|
||||
.
|
||||
.
|
||||
.SS "Disabling automatic anchoring"
|
||||
.rs
|
||||
.sp
|
||||
If a pattern starts with (*NO_DOTSTAR_ANCHOR), it has the same effect as
|
||||
setting the PCRE2_NO_DOTSTAR_ANCHOR option. This disables optimizations that
|
||||
apply to patterns whose top-level branches all start with .* (match any number
|
||||
of arbitrary characters). For more details, see the
|
||||
.\" HREF
|
||||
\fBpcre2api\fP
|
||||
.\"
|
||||
documentation.
|
||||
.
|
||||
.
|
||||
.SS "Setting match and recursion limits"
|
||||
.rs
|
||||
.sp
|
||||
|
@ -1853,7 +1866,8 @@ one succeeds. Consider this pattern:
|
|||
(?>.*?a)b
|
||||
.sp
|
||||
It matches "ab" in the subject "aab". The use of the backtracking control verbs
|
||||
(*PRUNE) and (*SKIP) also disable this optimization.
|
||||
(*PRUNE) and (*SKIP) also disable this optimization, and there is an option,
|
||||
PCRE2_NO_DOTSTAR_ANCHOR, to do so explicitly.
|
||||
.P
|
||||
When a capturing subpattern is repeated, the value captured is the substring
|
||||
that matched the final iteration. For example, after
|
||||
|
@ -3278,6 +3292,6 @@ Cambridge, England.
|
|||
.rs
|
||||
.sp
|
||||
.nf
|
||||
Last updated: 14 November 2014
|
||||
Copyright (c) 1997-2014 University of Cambridge.
|
||||
Last updated: 02 January 2015
|
||||
Copyright (c) 1997-2015 University of Cambridge.
|
||||
.fi
|
||||
|
|
|
@ -1,4 +1,4 @@
|
|||
.TH PCRE2PERFORM 3 "20 Ocbober 2014" "PCRE2 10.00"
|
||||
.TH PCRE2PERFORM 3 "02 January 2015" "PCRE2 10.00"
|
||||
.SH NAME
|
||||
PCRE2 - Perl-compatible regular expressions (revised API)
|
||||
.SH "PCRE2 PERFORMANCE"
|
||||
|
@ -105,14 +105,18 @@ such as \ed, when matched with \fBpcre2_match()\fP; the performance loss is
|
|||
less with a DFA matching function, and in both cases there is not much
|
||||
difference for \eb.
|
||||
.P
|
||||
When a pattern begins with .* not in parentheses, or in parentheses that are
|
||||
not the subject of a backreference, and the PCRE2_DOTALL option is set, the
|
||||
pattern is implicitly anchored by PCRE2, since it can match only at the start
|
||||
of a subject string. However, if PCRE2_DOTALL is not set, PCRE2 cannot make
|
||||
this optimization, because the dot metacharacter does not then match a newline,
|
||||
and if the subject string contains newlines, the pattern may match from the
|
||||
character immediately following one of them instead of from the very start. For
|
||||
example, the pattern
|
||||
When a pattern begins with .* not in atomic parentheses, nor in parentheses
|
||||
that are the subject of a backreference, and the PCRE2_DOTALL option is set,
|
||||
the pattern is implicitly anchored by PCRE2, since it can match only at the
|
||||
start of a subject string. If the pattern has multiple top-level branches, they
|
||||
must all be anchorable. The optimization can be disabled by the
|
||||
PCRE2_NO_DOTSTAR_ANCHOR option, and is automatically disabled if the pattern
|
||||
contains (*PRUNE) or (*SKIP).
|
||||
.P
|
||||
If PCRE2_DOTALL is not set, PCRE2 cannot make this optimization, because the
|
||||
dot metacharacter does not then match a newline, and if the subject string
|
||||
contains newlines, the pattern may match from the character immediately
|
||||
following one of them instead of from the very start. For example, the pattern
|
||||
.sp
|
||||
.*second
|
||||
.sp
|
||||
|
@ -173,6 +177,6 @@ Cambridge, England.
|
|||
.rs
|
||||
.sp
|
||||
.nf
|
||||
Last updated: 20 October 2014
|
||||
Copyright (c) 1997-2014 University of Cambridge.
|
||||
Last updated: 02 January 2015
|
||||
Copyright (c) 1997-2015 University of Cambridge.
|
||||
.fi
|
||||
|
|
|
@ -1,4 +1,4 @@
|
|||
.TH PCRE2SYNTAX 3 "23 November 2014" "PCRE2 10.00"
|
||||
.TH PCRE2SYNTAX 3 "02 January 2015" "PCRE2 10.00"
|
||||
.SH NAME
|
||||
PCRE2 - Perl-compatible regular expressions (revised API)
|
||||
.SH "PCRE2 REGULAR EXPRESSION SYNTAX SUMMARY"
|
||||
|
@ -389,6 +389,7 @@ appear.
|
|||
(*NOTEMPTY) set PCRE2_NOTEMPTY when matching
|
||||
(*NOTEMPTY_ATSTART) set PCRE2_NOTEMPTY_ATSTART when matching
|
||||
(*NO_AUTO_POSSESS) no auto-possessification (PCRE2_NO_AUTO_POSSESS)
|
||||
(*NO_DOTSTAR_ANCHOR) no .* anchoring (PCRE2_NO_DOTSTAR_ANCHOR)
|
||||
(*NO_START_OPT) no start-match optimization (PCRE2_NO_START_OPTIMIZE)
|
||||
(*UTF) set appropriate UTF mode for the library in use
|
||||
(*UCP) set PCRE2_UCP (use Unicode properties for \ed etc)
|
||||
|
@ -536,6 +537,6 @@ Cambridge, England.
|
|||
.rs
|
||||
.sp
|
||||
.nf
|
||||
Last updated: 23 November 2014
|
||||
Copyright (c) 1997-2014 University of Cambridge.
|
||||
Last updated: 02 January 2015
|
||||
Copyright (c) 1997-2015 University of Cambridge.
|
||||
.fi
|
||||
|
|
|
@ -1,4 +1,4 @@
|
|||
.TH PCRE2TEST 1 "23 November 2014" "PCRE 10.00"
|
||||
.TH PCRE2TEST 1 "02 January 2015" "PCRE 10.00"
|
||||
.SH NAME
|
||||
pcre2test - a program for testing Perl-compatible regular expressions.
|
||||
.SH SYNOPSIS
|
||||
|
@ -247,7 +247,7 @@ checked for compatibility with the \fBperltest.sh\fP script, which is used to
|
|||
confirm that Perl gives the same results as PCRE2. Also, apart from comment
|
||||
lines, none of the other command lines are permitted, because they and many
|
||||
of the modifiers are specific to \fBpcre2test\fP, and should not be used in
|
||||
test files that are also processed by \fBperltest.sh\fP. The \fP#perltest\fB
|
||||
test files that are also processed by \fBperltest.sh\fP. The \fB#perltest\fP
|
||||
command helps detect tests that are accidentally put in the wrong file.
|
||||
.sp
|
||||
#subject <modifier-list>
|
||||
|
@ -413,6 +413,7 @@ for a description of their effects.
|
|||
never_utf set PCRE2_NEVER_UTF
|
||||
no_auto_capture set PCRE2_NO_AUTO_CAPTURE
|
||||
no_auto_possess set PCRE2_NO_AUTO_POSSESS
|
||||
no_dotstar_anchor set PCRE2_NO_DOTSTAR_ANCHOR
|
||||
no_start_optimize set PCRE2_NO_START_OPTIMIZE
|
||||
no_utf_check set PCRE2_NO_UTF_CHECK
|
||||
ucp set PCRE2_UCP
|
||||
|
@ -552,7 +553,7 @@ documentation. See also the \fBjitstack\fP modifier below for a way of
|
|||
setting the size of the JIT stack.
|
||||
.P
|
||||
If the \fBjitfast\fP modifier is specified, matching is done using the JIT
|
||||
"fast path" interface, \fBpcre2_jit_match(), which skips some of the sanity
|
||||
"fast path" interface, \fBpcre2_jit_match()\fP, which skips some of the sanity
|
||||
checks that are done by \fBpcre2_match()\fP, and of course does not work when
|
||||
JIT is not supported. If \fBjitfast\fP is specified without \fBjit\fP, jit=7 is
|
||||
assumed.
|
||||
|
@ -1274,6 +1275,6 @@ Cambridge, England.
|
|||
.rs
|
||||
.sp
|
||||
.nf
|
||||
Last updated: 23 November 2014
|
||||
Copyright (c) 1997-2014 University of Cambridge.
|
||||
Last updated: 02 January 2015
|
||||
Copyright (c) 1997-2015 University of Cambridge.
|
||||
.fi
|
||||
|
|
|
@ -402,6 +402,7 @@ PATTERN MODIFIERS
|
|||
never_utf set PCRE2_NEVER_UTF
|
||||
no_auto_capture set PCRE2_NO_AUTO_CAPTURE
|
||||
no_auto_possess set PCRE2_NO_AUTO_POSSESS
|
||||
no_dotstar_anchor set PCRE2_NO_DOTSTAR_ANCHOR
|
||||
no_start_optimize set PCRE2_NO_START_OPTIMIZE
|
||||
no_utf_check set PCRE2_NO_UTF_CHECK
|
||||
ucp set PCRE2_UCP
|
||||
|
@ -1185,5 +1186,5 @@ AUTHOR
|
|||
|
||||
REVISION
|
||||
|
||||
Last updated: 23 November 2014
|
||||
Copyright (c) 1997-2014 University of Cambridge.
|
||||
Last updated: 02 January 2015
|
||||
Copyright (c) 1997-2015 University of Cambridge.
|
||||
|
|
|
@ -5,7 +5,7 @@
|
|||
/* This is the public header file for the PCRE library, second API, to be
|
||||
#included by applications that call PCRE2 functions.
|
||||
|
||||
Copyright (c) 2014 University of Cambridge
|
||||
Copyright (c) 2015 University of Cambridge
|
||||
|
||||
-----------------------------------------------------------------------------
|
||||
Redistribution and use in source and binary forms, with or without
|
||||
|
@ -113,10 +113,11 @@ D is inspected during pcre2_dfa_match() execution
|
|||
#define PCRE2_NEVER_UTF 0x00001000u /* C */
|
||||
#define PCRE2_NO_AUTO_CAPTURE 0x00002000u /* C */
|
||||
#define PCRE2_NO_AUTO_POSSESS 0x00004000u /* C */
|
||||
#define PCRE2_NO_START_OPTIMIZE 0x00008000u /* J M D */
|
||||
#define PCRE2_UCP 0x00010000u /* C J M D */
|
||||
#define PCRE2_UNGREEDY 0x00020000u /* C */
|
||||
#define PCRE2_UTF 0x00040000u /* C J M D */
|
||||
#define PCRE2_NO_DOTSTAR_ANCHOR 0x00008000u /* C */
|
||||
#define PCRE2_NO_START_OPTIMIZE 0x00010000u /* J M D */
|
||||
#define PCRE2_UCP 0x00020000u /* C J M D */
|
||||
#define PCRE2_UNGREEDY 0x00040000u /* C */
|
||||
#define PCRE2_UTF 0x00080000u /* C J M D */
|
||||
|
||||
/* These are for pcre2_jit_compile(). */
|
||||
|
||||
|
|
|
@ -7,7 +7,7 @@ and semantics are as close as possible to those of the Perl 5 language.
|
|||
|
||||
Written by Philip Hazel
|
||||
Original API code Copyright (c) 1997-2012 University of Cambridge
|
||||
New API code Copyright (c) 2014 University of Cambridge
|
||||
New API code Copyright (c) 2015 University of Cambridge
|
||||
|
||||
-----------------------------------------------------------------------------
|
||||
Redistribution and use in source and binary forms, with or without
|
||||
|
@ -557,8 +557,8 @@ static PCRE2_SPTR posix_substitutes[] = {
|
|||
PCRE2_CASELESS|PCRE2_DOLLAR_ENDONLY|PCRE2_DOTALL|PCRE2_DUPNAMES| \
|
||||
PCRE2_EXTENDED|PCRE2_FIRSTLINE|PCRE2_MATCH_UNSET_BACKREF| \
|
||||
PCRE2_MULTILINE|PCRE2_NEVER_UCP|PCRE2_NEVER_UTF|PCRE2_NO_AUTO_CAPTURE| \
|
||||
PCRE2_NO_AUTO_POSSESS|PCRE2_NO_START_OPTIMIZE|PCRE2_NO_UTF_CHECK| \
|
||||
PCRE2_UCP|PCRE2_UNGREEDY|PCRE2_UTF)
|
||||
PCRE2_NO_AUTO_POSSESS|PCRE2_NO_DOTSTAR_ANCHOR|PCRE2_NO_START_OPTIMIZE| \
|
||||
PCRE2_NO_UTF_CHECK|PCRE2_UCP|PCRE2_UNGREEDY|PCRE2_UTF)
|
||||
|
||||
/* Compile time error code numbers. They are given names so that they can more
|
||||
easily be tracked. When a new number is added, the tables called eint1 and
|
||||
|
@ -597,22 +597,23 @@ typedef struct pso {
|
|||
/* NB: STRING_UTFn_RIGHTPAR contains the length as well */
|
||||
|
||||
static pso pso_list[] = {
|
||||
{ (uint8_t *)STRING_UTFn_RIGHTPAR, PSO_OPT, PCRE2_UTF },
|
||||
{ (uint8_t *)STRING_UTF_RIGHTPAR, 4, PSO_OPT, PCRE2_UTF },
|
||||
{ (uint8_t *)STRING_UCP_RIGHTPAR, 4, PSO_OPT, PCRE2_UCP },
|
||||
{ (uint8_t *)STRING_NOTEMPTY_RIGHTPAR, 9, PSO_FLG, PCRE2_NOTEMPTY_SET },
|
||||
{ (uint8_t *)STRING_NOTEMPTY_ATSTART_RIGHTPAR,17, PSO_FLG, PCRE2_NE_ATST_SET },
|
||||
{ (uint8_t *)STRING_NO_AUTO_POSSESS_RIGHTPAR, 16, PSO_OPT, PCRE2_NO_AUTO_POSSESS },
|
||||
{ (uint8_t *)STRING_NO_START_OPT_RIGHTPAR, 13, PSO_OPT, PCRE2_NO_START_OPTIMIZE },
|
||||
{ (uint8_t *)STRING_LIMIT_MATCH_EQ, 12, PSO_LIMM, 0 },
|
||||
{ (uint8_t *)STRING_LIMIT_RECURSION_EQ, 16, PSO_LIMR, 0 },
|
||||
{ (uint8_t *)STRING_CR_RIGHTPAR, 3, PSO_NL, PCRE2_NEWLINE_CR },
|
||||
{ (uint8_t *)STRING_LF_RIGHTPAR, 3, PSO_NL, PCRE2_NEWLINE_LF },
|
||||
{ (uint8_t *)STRING_CRLF_RIGHTPAR, 5, PSO_NL, PCRE2_NEWLINE_CRLF },
|
||||
{ (uint8_t *)STRING_ANY_RIGHTPAR, 4, PSO_NL, PCRE2_NEWLINE_ANY },
|
||||
{ (uint8_t *)STRING_ANYCRLF_RIGHTPAR, 8, PSO_NL, PCRE2_NEWLINE_ANYCRLF },
|
||||
{ (uint8_t *)STRING_BSR_ANYCRLF_RIGHTPAR, 12, PSO_BSR, PCRE2_BSR_ANYCRLF },
|
||||
{ (uint8_t *)STRING_BSR_UNICODE_RIGHTPAR, 12, PSO_BSR, PCRE2_BSR_UNICODE }
|
||||
{ (uint8_t *)STRING_UTFn_RIGHTPAR, PSO_OPT, PCRE2_UTF },
|
||||
{ (uint8_t *)STRING_UTF_RIGHTPAR, 4, PSO_OPT, PCRE2_UTF },
|
||||
{ (uint8_t *)STRING_UCP_RIGHTPAR, 4, PSO_OPT, PCRE2_UCP },
|
||||
{ (uint8_t *)STRING_NOTEMPTY_RIGHTPAR, 9, PSO_FLG, PCRE2_NOTEMPTY_SET },
|
||||
{ (uint8_t *)STRING_NOTEMPTY_ATSTART_RIGHTPAR, 17, PSO_FLG, PCRE2_NE_ATST_SET },
|
||||
{ (uint8_t *)STRING_NO_AUTO_POSSESS_RIGHTPAR, 16, PSO_OPT, PCRE2_NO_AUTO_POSSESS },
|
||||
{ (uint8_t *)STRING_NO_DOTSTAR_ANCHOR_RIGHTPAR, 18, PSO_OPT, PCRE2_NO_DOTSTAR_ANCHOR },
|
||||
{ (uint8_t *)STRING_NO_START_OPT_RIGHTPAR, 13, PSO_OPT, PCRE2_NO_START_OPTIMIZE },
|
||||
{ (uint8_t *)STRING_LIMIT_MATCH_EQ, 12, PSO_LIMM, 0 },
|
||||
{ (uint8_t *)STRING_LIMIT_RECURSION_EQ, 16, PSO_LIMR, 0 },
|
||||
{ (uint8_t *)STRING_CR_RIGHTPAR, 3, PSO_NL, PCRE2_NEWLINE_CR },
|
||||
{ (uint8_t *)STRING_LF_RIGHTPAR, 3, PSO_NL, PCRE2_NEWLINE_LF },
|
||||
{ (uint8_t *)STRING_CRLF_RIGHTPAR, 5, PSO_NL, PCRE2_NEWLINE_CRLF },
|
||||
{ (uint8_t *)STRING_ANY_RIGHTPAR, 4, PSO_NL, PCRE2_NEWLINE_ANY },
|
||||
{ (uint8_t *)STRING_ANYCRLF_RIGHTPAR, 8, PSO_NL, PCRE2_NEWLINE_ANYCRLF },
|
||||
{ (uint8_t *)STRING_BSR_ANYCRLF_RIGHTPAR, 12, PSO_BSR, PCRE2_BSR_ANYCRLF },
|
||||
{ (uint8_t *)STRING_BSR_UNICODE_RIGHTPAR, 12, PSO_BSR, PCRE2_BSR_UNICODE }
|
||||
};
|
||||
|
||||
|
||||
|
@ -7020,13 +7021,14 @@ do {
|
|||
|
||||
/* .* is not anchored unless DOTALL is set (which generates OP_ALLANY) and
|
||||
it isn't in brackets that are or may be referenced or inside an atomic
|
||||
group. */
|
||||
group. There is also an option that disables auto-anchoring. */
|
||||
|
||||
else if ((op == OP_TYPESTAR || op == OP_TYPEMINSTAR ||
|
||||
op == OP_TYPEPOSSTAR))
|
||||
{
|
||||
if (scode[1] != OP_ALLANY || (bracket_map & cb->backref_map) != 0 ||
|
||||
atomcount > 0 || cb->had_pruneorskip)
|
||||
atomcount > 0 || cb->had_pruneorskip ||
|
||||
(cb->external_options & PCRE2_NO_DOTSTAR_ANCHOR) != 0)
|
||||
return FALSE;
|
||||
}
|
||||
|
||||
|
@ -7140,12 +7142,13 @@ do {
|
|||
brackets that may be referenced, as long as the pattern does not contain
|
||||
*PRUNE or *SKIP, because these break the feature. Consider, for example,
|
||||
/.*?a(*PRUNE)b/ with the subject "aab", which matches "ab", i.e. not at the
|
||||
start of a line. */
|
||||
start of a line. There is also an option that disables this optimization. */
|
||||
|
||||
else if (op == OP_TYPESTAR || op == OP_TYPEMINSTAR || op == OP_TYPEPOSSTAR)
|
||||
{
|
||||
if (scode[1] != OP_ANY || (bracket_map & cb->backref_map) != 0 ||
|
||||
atomcount > 0 || cb->had_pruneorskip)
|
||||
atomcount > 0 || cb->had_pruneorskip ||
|
||||
(cb->external_options & PCRE2_NO_DOTSTAR_ANCHOR) != 0)
|
||||
return FALSE;
|
||||
}
|
||||
|
||||
|
@ -7863,7 +7866,8 @@ if (errorcode != 0)
|
|||
/* Successful compile. If the anchored option was not passed, set it if
|
||||
we can determine that the pattern is anchored by virtue of ^ characters or \A
|
||||
or anything else, such as starting with non-atomic .* when DOTALL is set and
|
||||
there are no occurrences of *PRUNE or *SKIP. */
|
||||
there are no occurrences of *PRUNE or *SKIP (though there is an option to
|
||||
disable this case). */
|
||||
|
||||
if ((re->overall_options & PCRE2_ANCHORED) == 0 &&
|
||||
is_anchored(codestart, 0, &cb, 0))
|
||||
|
@ -7912,7 +7916,8 @@ if ((re->overall_options & (PCRE2_ANCHORED|PCRE2_NO_START_OPTIMIZE)) == 0)
|
|||
/* When there is no first code unit, see if we can set the PCRE2_STARTLINE
|
||||
flag. This is helpful for multiline matches when all branches start with ^
|
||||
and also when all branches start with non-atomic .* for non-DOTALL matches
|
||||
when *PRUNE and SKIP are not present. */
|
||||
when *PRUNE and SKIP are not present. (There is an option that disables this
|
||||
case.) */
|
||||
|
||||
else if (is_startline(codestart, 0, &cb, 0)) re->flags |= PCRE2_STARTLINE;
|
||||
}
|
||||
|
|
|
@ -7,7 +7,7 @@ and semantics are as close as possible to those of the Perl 5 language.
|
|||
|
||||
Written by Philip Hazel
|
||||
Original API code Copyright (c) 1997-2012 University of Cambridge
|
||||
New API code Copyright (c) 2014 University of Cambridge
|
||||
New API code Copyright (c) 2015 University of Cambridge
|
||||
|
||||
-----------------------------------------------------------------------------
|
||||
Redistribution and use in source and binary forms, with or without
|
||||
|
@ -904,6 +904,7 @@ a positive value. */
|
|||
#define STRING_UTF_RIGHTPAR "UTF)"
|
||||
#define STRING_UCP_RIGHTPAR "UCP)"
|
||||
#define STRING_NO_AUTO_POSSESS_RIGHTPAR "NO_AUTO_POSSESS)"
|
||||
#define STRING_NO_DOTSTAR_ANCHOR_RIGHTPAR "NO_DOTSTAR_ANCHOR)"
|
||||
#define STRING_NO_START_OPT_RIGHTPAR "NO_START_OPT)"
|
||||
#define STRING_NOTEMPTY_RIGHTPAR "NOTEMPTY)"
|
||||
#define STRING_NOTEMPTY_ATSTART_RIGHTPAR "NOTEMPTY_ATSTART)"
|
||||
|
@ -1173,6 +1174,7 @@ only. */
|
|||
#define STRING_UTF_RIGHTPAR STR_U STR_T STR_F STR_RIGHT_PARENTHESIS
|
||||
#define STRING_UCP_RIGHTPAR STR_U STR_C STR_P STR_RIGHT_PARENTHESIS
|
||||
#define STRING_NO_AUTO_POSSESS_RIGHTPAR STR_N STR_O STR_UNDERSCORE STR_A STR_U STR_T STR_O STR_UNDERSCORE STR_P STR_O STR_S STR_S STR_E STR_S STR_S STR_RIGHT_PARENTHESIS
|
||||
#define STRING_NO_DOTSTAR_ANCHOR_RIGHTPAR STR_N STR_O STR_UNDERSCORE STR_D STR_O STR_T STR_S STR_T STR_A STR_R STR_UNDERSCORE STR_A STR_N STR_C STR_H STR_O STR_R STR_RIGHT_PARENTHESIS
|
||||
#define STRING_NO_START_OPT_RIGHTPAR STR_N STR_O STR_UNDERSCORE STR_S STR_T STR_A STR_R STR_T STR_UNDERSCORE STR_O STR_P STR_T STR_RIGHT_PARENTHESIS
|
||||
#define STRING_NOTEMPTY_RIGHTPAR STR_N STR_O STR_T STR_E STR_M STR_P STR_T STR_Y STR_RIGHT_PARENTHESIS
|
||||
#define STRING_NOTEMPTY_ATSTART_RIGHTPAR STR_N STR_O STR_T STR_E STR_M STR_P STR_T STR_Y STR_UNDERSCORE STR_A STR_T STR_S STR_T STR_A STR_R STR_T STR_RIGHT_PARENTHESIS
|
||||
|
|
|
@ -445,7 +445,7 @@ match() only in the case when ovecsave is needed. (David Wheeler used to say
|
|||
"All problems in computer science can be solved by another level of
|
||||
indirection.")
|
||||
|
||||
HOWEVER: when this file is compiled by gcc in an optimizing mode, because this
|
||||
HOWEVER: when this file is compiled by gcc in an optimizing mode, because this
|
||||
function is called only once, and only from within match(), gcc will "inline"
|
||||
it - that is, move it inside match() - and this completely negates its reason
|
||||
for existence. Therefore, we mark it as non-inline when gcc is in use.
|
||||
|
|
|
@ -11,7 +11,7 @@ hacked-up (non-) design had also run out of steam.
|
|||
|
||||
Written by Philip Hazel
|
||||
Original code Copyright (c) 1997-2012 University of Cambridge
|
||||
Rewritten code Copyright (c) 2014 University of Cambridge
|
||||
Rewritten code Copyright (c) 2015 University of Cambridge
|
||||
|
||||
-----------------------------------------------------------------------------
|
||||
Redistribution and use in source and binary forms, with or without
|
||||
|
@ -498,6 +498,7 @@ static modstruct modlist[] = {
|
|||
{ "newline", MOD_CTC, MOD_NL, 0, CO(newline_convention) },
|
||||
{ "no_auto_capture", MOD_PAT, MOD_OPT, PCRE2_NO_AUTO_CAPTURE, PO(options) },
|
||||
{ "no_auto_possess", MOD_PATP, MOD_OPT, PCRE2_NO_AUTO_POSSESS, PO(options) },
|
||||
{ "no_dotstar_anchor", MOD_PAT, MOD_OPT, PCRE2_NO_DOTSTAR_ANCHOR, PO(options) },
|
||||
{ "no_start_optimize", MOD_PATP, MOD_OPT, PCRE2_NO_START_OPTIMIZE, PO(options) },
|
||||
{ "no_utf_check", MOD_PD, MOD_OPT, PCRE2_NO_UTF_CHECK, PD(options) },
|
||||
{ "notbol", MOD_DAT, MOD_OPT, PCRE2_NOTBOL, DO(options) },
|
||||
|
@ -3291,29 +3292,30 @@ static void
|
|||
show_compile_options(uint32_t options, const char *before, const char *after)
|
||||
{
|
||||
if (options == 0) fprintf(outfile, "%s <none>%s", before, after);
|
||||
else fprintf(outfile, "%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s",
|
||||
else fprintf(outfile, "%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s",
|
||||
before,
|
||||
((options & PCRE2_ANCHORED) != 0)? " anchored" : "",
|
||||
((options & PCRE2_CASELESS) != 0)? " caseless" : "",
|
||||
((options & PCRE2_EXTENDED) != 0)? " extended" : "",
|
||||
((options & PCRE2_MULTILINE) != 0)? " multiline" : "",
|
||||
((options & PCRE2_FIRSTLINE) != 0)? " firstline" : "",
|
||||
((options & PCRE2_DOTALL) != 0)? " dotall" : "",
|
||||
((options & PCRE2_DOLLAR_ENDONLY) != 0)? " dollar_endonly" : "",
|
||||
((options & PCRE2_UNGREEDY) != 0)? " ungreedy" : "",
|
||||
((options & PCRE2_NO_AUTO_CAPTURE) != 0)? " no_auto_capture" : "",
|
||||
((options & PCRE2_NO_AUTO_POSSESS) != 0)? " no_auto_possess" : "",
|
||||
((options & PCRE2_UTF) != 0)? " utf" : "",
|
||||
((options & PCRE2_UCP) != 0)? " ucp" : "",
|
||||
((options & PCRE2_NO_UTF_CHECK) != 0)? " no_utf_check" : "",
|
||||
((options & PCRE2_NO_START_OPTIMIZE) != 0)? " no_start_optimize" : "",
|
||||
((options & PCRE2_DUPNAMES) != 0)? " dupnames" : "",
|
||||
((options & PCRE2_ALT_BSUX) != 0)? " alt_bsux" : "",
|
||||
((options & PCRE2_ALLOW_EMPTY_CLASS) != 0)? " allow_empty_class" : "",
|
||||
((options & PCRE2_ANCHORED) != 0)? " anchored" : "",
|
||||
((options & PCRE2_AUTO_CALLOUT) != 0)? " auto_callout" : "",
|
||||
((options & PCRE2_CASELESS) != 0)? " caseless" : "",
|
||||
((options & PCRE2_DOLLAR_ENDONLY) != 0)? " dollar_endonly" : "",
|
||||
((options & PCRE2_DOTALL) != 0)? " dotall" : "",
|
||||
((options & PCRE2_DUPNAMES) != 0)? " dupnames" : "",
|
||||
((options & PCRE2_EXTENDED) != 0)? " extended" : "",
|
||||
((options & PCRE2_FIRSTLINE) != 0)? " firstline" : "",
|
||||
((options & PCRE2_MATCH_UNSET_BACKREF) != 0)? " match_unset_backref" : "",
|
||||
((options & PCRE2_MULTILINE) != 0)? " multiline" : "",
|
||||
((options & PCRE2_NEVER_UCP) != 0)? " never_ucp" : "",
|
||||
((options & PCRE2_NEVER_UTF) != 0)? " never_utf" : "",
|
||||
((options & PCRE2_NO_AUTO_CAPTURE) != 0)? " no_auto_capture" : "",
|
||||
((options & PCRE2_NO_AUTO_POSSESS) != 0)? " no_auto_possess" : "",
|
||||
((options & PCRE2_NO_DOTSTAR_ANCHOR) != 0)? " no_dotstar_anchor" : "",
|
||||
((options & PCRE2_NO_UTF_CHECK) != 0)? " no_utf_check" : "",
|
||||
((options & PCRE2_NO_START_OPTIMIZE) != 0)? " no_start_optimize" : "",
|
||||
((options & PCRE2_UCP) != 0)? " ucp" : "",
|
||||
((options & PCRE2_UNGREEDY) != 0)? " ungreedy" : "",
|
||||
((options & PCRE2_UTF) != 0)? " utf" : "",
|
||||
after);
|
||||
}
|
||||
|
||||
|
|
|
@ -4100,4 +4100,20 @@ a random value. /Ix
|
|||
/a(b)c(d)/
|
||||
abc\=ph,copy=0,copy=1,getall
|
||||
|
||||
/^abc/info
|
||||
|
||||
/^abc/info,no_dotstar_anchor
|
||||
|
||||
/.*\d/info,auto_callout
|
||||
aaa
|
||||
|
||||
/.*\d/info,no_dotstar_anchor,auto_callout
|
||||
aaa
|
||||
|
||||
/.*\d/dotall,info
|
||||
|
||||
/.*\d/dotall,no_dotstar_anchor,info
|
||||
|
||||
/(*NO_DOTSTAR_ANCHOR)(?s).*\d/info
|
||||
|
||||
# End of testinput2
|
||||
|
|
|
@ -5361,7 +5361,7 @@ No match
|
|||
"<(\w+)/?>(.)*</(\1)>"Igms
|
||||
Capturing subpattern count = 3
|
||||
Max back reference = 1
|
||||
Options: multiline dotall
|
||||
Options: dotall multiline
|
||||
First code unit = '<'
|
||||
Last code unit = '>'
|
||||
Subject length lower bound = 7
|
||||
|
@ -5399,7 +5399,7 @@ No match
|
|||
/line\nbreak/Im,firstline
|
||||
Capturing subpattern count = 0
|
||||
Contains explicit CR or LF match
|
||||
Options: multiline firstline
|
||||
Options: firstline multiline
|
||||
First code unit = 'l'
|
||||
Last code unit = 'k'
|
||||
Subject length lower bound = 10
|
||||
|
@ -9698,7 +9698,7 @@ Subject length lower bound = 41
|
|||
/Iisx
|
||||
Capturing subpattern count = 3
|
||||
Max back reference = 1
|
||||
Options: caseless extended dotall
|
||||
Options: caseless dotall extended
|
||||
First code unit = '<'
|
||||
Last code unit = '='
|
||||
Subject length lower bound = 9
|
||||
|
@ -9747,7 +9747,7 @@ Named capturing subpatterns:
|
|||
quote 4
|
||||
realquote 3
|
||||
realquote 6
|
||||
Options: extended dupnames
|
||||
Options: dupnames extended
|
||||
Starting code units: a b
|
||||
Subject length lower bound = 3
|
||||
a"aaaaa
|
||||
|
@ -9805,8 +9805,8 @@ Capturing subpattern count = 4
|
|||
Named capturing subpatterns:
|
||||
D 4
|
||||
D 1
|
||||
Compile options: extended dupnames
|
||||
Overall options: anchored extended dupnames
|
||||
Compile options: dupnames extended
|
||||
Overall options: anchored dupnames extended
|
||||
Subject length lower bound = 2
|
||||
abcdX
|
||||
0: abcdX
|
||||
|
@ -9852,7 +9852,7 @@ Capturing subpattern count = 4
|
|||
Named capturing subpatterns:
|
||||
A 1
|
||||
A 4
|
||||
Options: extended dupnames
|
||||
Options: dupnames extended
|
||||
First code unit = 'a'
|
||||
Last code unit = 'd'
|
||||
Subject length lower bound = 4
|
||||
|
@ -9936,7 +9936,7 @@ No match
|
|||
/(\3)(\1)(a)/I,alt_bsux,allow_empty_class,match_unset_backref,dupnames
|
||||
Capturing subpattern count = 3
|
||||
Max back reference = 3
|
||||
Options: dupnames alt_bsux allow_empty_class match_unset_backref
|
||||
Options: alt_bsux allow_empty_class dupnames match_unset_backref
|
||||
Last code unit = 'a'
|
||||
Subject length lower bound = 1
|
||||
cat
|
||||
|
@ -13769,4 +13769,67 @@ Partial match: abc
|
|||
Copy substring 1 failed (-2): partial match
|
||||
get substring list failed (-2): partial match
|
||||
|
||||
/^abc/info
|
||||
Capturing subpattern count = 0
|
||||
Compile options: <none>
|
||||
Overall options: anchored
|
||||
Subject length lower bound = 3
|
||||
|
||||
/^abc/info,no_dotstar_anchor
|
||||
Capturing subpattern count = 0
|
||||
Compile options: no_dotstar_anchor
|
||||
Overall options: anchored no_dotstar_anchor
|
||||
Subject length lower bound = 3
|
||||
|
||||
/.*\d/info,auto_callout
|
||||
Capturing subpattern count = 0
|
||||
Options: auto_callout
|
||||
First code unit at start or follows newline
|
||||
Subject length lower bound = 1
|
||||
aaa
|
||||
--->aaa
|
||||
+0 ^ .*
|
||||
+2 ^ ^ \d
|
||||
+2 ^ ^ \d
|
||||
+2 ^^ \d
|
||||
+2 ^ \d
|
||||
No match
|
||||
|
||||
/.*\d/info,no_dotstar_anchor,auto_callout
|
||||
Capturing subpattern count = 0
|
||||
Options: auto_callout no_dotstar_anchor
|
||||
Subject length lower bound = 1
|
||||
aaa
|
||||
--->aaa
|
||||
+0 ^ .*
|
||||
+2 ^ ^ \d
|
||||
+2 ^ ^ \d
|
||||
+2 ^^ \d
|
||||
+2 ^ \d
|
||||
+0 ^ .*
|
||||
+2 ^ ^ \d
|
||||
+2 ^^ \d
|
||||
+2 ^ \d
|
||||
+0 ^ .*
|
||||
+2 ^^ \d
|
||||
+2 ^ \d
|
||||
No match
|
||||
|
||||
/.*\d/dotall,info
|
||||
Capturing subpattern count = 0
|
||||
Compile options: dotall
|
||||
Overall options: anchored dotall
|
||||
Subject length lower bound = 1
|
||||
|
||||
/.*\d/dotall,no_dotstar_anchor,info
|
||||
Capturing subpattern count = 0
|
||||
Options: dotall no_dotstar_anchor
|
||||
Subject length lower bound = 1
|
||||
|
||||
/(*NO_DOTSTAR_ANCHOR)(?s).*\d/info
|
||||
Capturing subpattern count = 0
|
||||
Compile options: <none>
|
||||
Overall options: dotall no_dotstar_anchor
|
||||
Subject length lower bound = 1
|
||||
|
||||
# End of testinput2
|
||||
|
|
Loading…
Reference in New Issue