Add PCRE2_NO_DOTSTAR_ANCHOR and revise documentation for .* optimizing.

This commit is contained in:
Philip.Hazel 2015-01-02 17:09:16 +00:00
parent 019e115060
commit 5a18651441
24 changed files with 502 additions and 173 deletions

View File

@ -58,4 +58,6 @@ matched against "abcd".
(an odd thing to do, but it happened), SIGSEGV or other misbehaviour could (an odd thing to do, but it happened), SIGSEGV or other misbehaviour could
occur. occur.
10. The PCRE2_NO_DOTSTAR_ANCHOR option has been implemented.
**** ****

View File

@ -63,6 +63,7 @@ or provide an external function for stack size checking. The option bits are:
PCRE2_NO_AUTO_CAPTURE Disable numbered capturing paren- PCRE2_NO_AUTO_CAPTURE Disable numbered capturing paren-
theses (named ones available) theses (named ones available)
PCRE2_NO_AUTO_POSSESS Disable auto-possessification PCRE2_NO_AUTO_POSSESS Disable auto-possessification
PCRE2_NO_DOTSTAR_ANCHOR Disable automatic anchoring for .*
PCRE2_NO_START_OPTIMIZE Disable match-time start optimizations PCRE2_NO_START_OPTIMIZE Disable match-time start optimizations
PCRE2_NO_UTF_CHECK Do not check the pattern for UTF validity PCRE2_NO_UTF_CHECK Do not check the pattern for UTF validity
(only relevant if PCRE2_UTF is set) (only relevant if PCRE2_UTF is set)

View File

@ -1187,6 +1187,19 @@ use, auto-possessification means that some callouts are never taken. You can
set this option if you want the matching functions to do a full unoptimized set this option if you want the matching functions to do a full unoptimized
search and run all the callouts, but it is mainly provided for testing search and run all the callouts, but it is mainly provided for testing
purposes. purposes.
<pre>
PCRE2_NO_DOTSTAR_ANCHOR
</pre>
If this option is set, it disables an optimization that is applied when .* is
the first significant item in a top-level branch of a pattern, and all the
other branches also start with .* or with \A or \G or ^. The optimization is
automatically disabled for .* if it is inside an atomic group or a capturing
group that is the subject of a back reference, or if the pattern contains
(*PRUNE) or (*SKIP). When the optimization is not disabled, such a pattern is
automatically anchored if PCRE2_DOTALL is set for all the .* items and
PCRE2_MULTILINE is not set for any ^ items. Otherwise, the fact that any match
must start either at the start of the subject or following a newline is
remembered. Like other optimizations, this can cause callouts to be skipped.
<pre> <pre>
PCRE2_NO_START_OPTIMIZE PCRE2_NO_START_OPTIMIZE
</pre> </pre>
@ -1442,16 +1455,25 @@ compiled with the PCRE2_EXTENDED option, the result is PCRE2_CASELESS,
PCRE2_MULTILINE, and PCRE2_EXTENDED. PCRE2_MULTILINE, and PCRE2_EXTENDED.
</P> </P>
<P> <P>
A pattern is automatically anchored by PCRE2 if all of its top-level A pattern compiled without PCRE2_ANCHORED is automatically anchored by PCRE2 if
alternatives begin with one of the following: the first significant item in every top-level branch is one of the following:
<pre> <pre>
^ unless PCRE2_MULTILINE is set ^ unless PCRE2_MULTILINE is set
\A always \A always
\G always \G always
.* if PCRE2_DOTALL is set and there are no back references to the subpattern in which .* appears .* sometimes - see below
</pre> </pre>
For such patterns, the PCRE2_ANCHORED bit is set in the options returned for When .* is the first significant item, anchoring is possible only when all the
PCRE2_INFO_ALLOPTIONS. following are true:
<pre>
.* is not in an atomic group
.* is not in a capturing group that is the subject of a back reference
PCRE2_DOTALL is in force for .*
Neither (*PRUNE) nor (*SKIP) appears in the pattern.
PCRE2_NO_DOTSTAR_ANCHOR is not set.
</pre>
For patterns that are auto-anchored, the PCRE2_ANCHORED bit is set in the
options returned for PCRE2_INFO_ALLOPTIONS.
<pre> <pre>
PCRE2_INFO_BACKREFMAX PCRE2_INFO_BACKREFMAX
</pre> </pre>
@ -1480,21 +1502,10 @@ variable.
<P> <P>
If there is a fixed first value, for example, the letter "c" from a pattern If there is a fixed first value, for example, the letter "c" from a pattern
such as (cat|cow|coyote), 1 is returned, and the character value can be such as (cat|cow|coyote), 1 is returned, and the character value can be
retrieved using PCRE2_INFO_FIRSTCODEUNIT. If there is no fixed first value, and retrieved using PCRE2_INFO_FIRSTCODEUNIT. If there is no fixed first value, but
if either it is known that a match can occur only at the start of the subject or
<br> following a newline in the subject, 2 is returned. Otherwise, and for anchored
<br> patterns, 0 is returned.
(a) the pattern was compiled with the PCRE2_MULTILINE option, and every branch
starts with "^", or
<br>
<br>
(b) every branch of the pattern starts with ".*" and PCRE2_DOTALL is not set
(if it were set, the pattern would be anchored),
<br>
<br>
2 is returned, indicating that the pattern matches only at the start of a
subject string or after any newline within the string. Otherwise 0 is
returned. For anchored patterns, 0 is returned.
<pre> <pre>
PCRE2_INFO_FIRSTCODEUNIT PCRE2_INFO_FIRSTCODEUNIT
</pre> </pre>
@ -2792,9 +2803,9 @@ Cambridge, England.
</P> </P>
<br><a name="SEC37" href="#TOC1">REVISION</a><br> <br><a name="SEC37" href="#TOC1">REVISION</a><br>
<P> <P>
Last updated: 22 December 2014 Last updated: 02 January 2015
<br> <br>
Copyright &copy; 1997-2014 University of Cambridge. Copyright &copy; 1997-2015 University of Cambridge.
<br> <br>
<p> <p>
Return to the <a href="index.html">PCRE2 index page</a>. Return to the <a href="index.html">PCRE2 index page</a>.

View File

@ -82,6 +82,9 @@ You should be aware that, because of optimizations in the way PCRE2 compiles
and matches patterns, callouts sometimes do not happen exactly as you might and matches patterns, callouts sometimes do not happen exactly as you might
expect. expect.
</P> </P>
<br><b>
Auto-possessification
</b><br>
<P> <P>
At compile time, PCRE2 "auto-possessifies" repeated items when it knows that At compile time, PCRE2 "auto-possessifies" repeated items when it knows that
what follows cannot be part of the repeat. For example, a+[bc] is compiled as what follows cannot be part of the repeat. For example, a+[bc] is compiled as
@ -111,6 +114,56 @@ case, the output changes to this:
This time, when matching [bc] fails, the matcher backtracks into a+ and tries This time, when matching [bc] fails, the matcher backtracks into a+ and tries
again, repeatedly, until a+ itself fails. again, repeatedly, until a+ itself fails.
</P> </P>
<br><b>
Automatic .* anchoring
</b><br>
<P>
By default, an optimization is applied when .* is the first significant item in
a pattern. If PCRE2_DOTALL is set, so that the dot can match any character, the
pattern is automatically anchored. If PCRE2_DOTALL is not set, a match can
start only after an internal newline or at the beginning of the subject, and
<b>pcre2_compile()</b> remembers this. This optimization is disabled, however,
if .* is in an atomic group or if there is a back reference to the capturing
group in which it appears. It is also disabled if the pattern contains (*PRUNE)
or (*SKIP). However, the presence of callouts does not affect it.
</P>
<P>
For example, if the pattern .*\d is compiled with PCRE2_AUTO_CALLOUT and
applied to the string "aa", the <b>pcre2test</b> output is:
<pre>
---&#62;aa
+0 ^ .*
+2 ^ ^ \d
+2 ^^ \d
+2 ^ \d
No match
</pre>
This shows that all match attempts start at the beginning of the subject. In
other words, the pattern is anchored. You can disable this optimization by
passing PCRE2_NO_DOTSTAR_ANCHOR to <b>pcre2_compile()</b>, or starting the
pattern with (*NO_DOTSTAR_ANCHOR). In this case, the output changes to:
<pre>
---&#62;aa
+0 ^ .*
+2 ^ ^ \d
+2 ^^ \d
+2 ^ \d
+0 ^ .*
+2 ^^ \d
+2 ^ \d
No match
</pre>
This shows more match attempts, starting at the second subject character.
Another optimization, described in the next section, means that there is no
subsequent attempt to match with an empty subject.
</P>
<P>
If a pattern has more than one top-level branch, automatic anchoring occurs if
all branches are anchorable.
</P>
<br><b>
Other optimizations
</b><br>
<P> <P>
Other optimizations that provide fast "no match" results also affect callouts. Other optimizations that provide fast "no match" results also affect callouts.
For example, if the pattern is For example, if the pattern is
@ -254,9 +307,9 @@ Cambridge, England.
</P> </P>
<br><a name="SEC7" href="#TOC1">REVISION</a><br> <br><a name="SEC7" href="#TOC1">REVISION</a><br>
<P> <P>
Last updated: 25 November 2014 Last updated: 02 January 2015
<br> <br>
Copyright &copy; 1997-2014 University of Cambridge. Copyright &copy; 1997-2015 University of Cambridge.
<br> <br>
<p> <p>
Return to the <a href="index.html">PCRE2 index page</a>. Return to the <a href="index.html">PCRE2 index page</a>.

View File

@ -151,6 +151,17 @@ reaching "no match" results. For more details, see the
documentation. documentation.
</P> </P>
<br><b> <br><b>
Disabling automatic anchoring
</b><br>
<P>
If a pattern starts with (*NO_DOTSTAR_ANCHOR), it has the same effect as
setting the PCRE2_NO_DOTSTAR_ANCHOR option. This disables optimizations that
apply to patterns whose top-level branches all start with .* (match any number
of arbitrary characters). For more details, see the
<a href="pcre2api.html"><b>pcre2api</b></a>
documentation.
</P>
<br><b>
Setting match and recursion limits Setting match and recursion limits
</b><br> </b><br>
<P> <P>
@ -1841,7 +1852,8 @@ one succeeds. Consider this pattern:
(?&#62;.*?a)b (?&#62;.*?a)b
</pre> </pre>
It matches "ab" in the subject "aab". The use of the backtracking control verbs It matches "ab" in the subject "aab". The use of the backtracking control verbs
(*PRUNE) and (*SKIP) also disable this optimization. (*PRUNE) and (*SKIP) also disable this optimization, and there is an option,
PCRE2_NO_DOTSTAR_ANCHOR, to do so explicitly.
</P> </P>
<P> <P>
When a capturing subpattern is repeated, the value captured is the substring When a capturing subpattern is repeated, the value captured is the substring
@ -3236,9 +3248,9 @@ Cambridge, England.
</P> </P>
<br><a name="SEC30" href="#TOC1">REVISION</a><br> <br><a name="SEC30" href="#TOC1">REVISION</a><br>
<P> <P>
Last updated: 14 November 2014 Last updated: 02 January 2015
<br> <br>
Copyright &copy; 1997-2014 University of Cambridge. Copyright &copy; 1997-2015 University of Cambridge.
<br> <br>
<p> <p>
Return to the <a href="index.html">PCRE2 index page</a>. Return to the <a href="index.html">PCRE2 index page</a>.

View File

@ -115,14 +115,19 @@ less with a DFA matching function, and in both cases there is not much
difference for \b. difference for \b.
</P> </P>
<P> <P>
When a pattern begins with .* not in parentheses, or in parentheses that are When a pattern begins with .* not in atomic parentheses, nor in parentheses
not the subject of a backreference, and the PCRE2_DOTALL option is set, the that are the subject of a backreference, and the PCRE2_DOTALL option is set,
pattern is implicitly anchored by PCRE2, since it can match only at the start the pattern is implicitly anchored by PCRE2, since it can match only at the
of a subject string. However, if PCRE2_DOTALL is not set, PCRE2 cannot make start of a subject string. If the pattern has multiple top-level branches, they
this optimization, because the dot metacharacter does not then match a newline, must all be anchorable. The optimization can be disabled by the
and if the subject string contains newlines, the pattern may match from the PCRE2_NO_DOTSTAR_ANCHOR option, and is automatically disabled if the pattern
character immediately following one of them instead of from the very start. For contains (*PRUNE) or (*SKIP).
example, the pattern </P>
<P>
If PCRE2_DOTALL is not set, PCRE2 cannot make this optimization, because the
dot metacharacter does not then match a newline, and if the subject string
contains newlines, the pattern may match from the character immediately
following one of them instead of from the very start. For example, the pattern
<pre> <pre>
.*second .*second
</pre> </pre>
@ -187,9 +192,9 @@ Cambridge, England.
REVISION REVISION
</b><br> </b><br>
<P> <P>
Last updated: 20 October 2014 Last updated: 02 January 2015
<br> <br>
Copyright &copy; 1997-2014 University of Cambridge. Copyright &copy; 1997-2015 University of Cambridge.
<br> <br>
<p> <p>
Return to the <a href="index.html">PCRE2 index page</a>. Return to the <a href="index.html">PCRE2 index page</a>.

View File

@ -416,6 +416,7 @@ appear.
(*NOTEMPTY) set PCRE2_NOTEMPTY when matching (*NOTEMPTY) set PCRE2_NOTEMPTY when matching
(*NOTEMPTY_ATSTART) set PCRE2_NOTEMPTY_ATSTART when matching (*NOTEMPTY_ATSTART) set PCRE2_NOTEMPTY_ATSTART when matching
(*NO_AUTO_POSSESS) no auto-possessification (PCRE2_NO_AUTO_POSSESS) (*NO_AUTO_POSSESS) no auto-possessification (PCRE2_NO_AUTO_POSSESS)
(*NO_DOTSTAR_ANCHOR) no .* anchoring (PCRE2_NO_DOTSTAR_ANCHOR)
(*NO_START_OPT) no start-match optimization (PCRE2_NO_START_OPTIMIZE) (*NO_START_OPT) no start-match optimization (PCRE2_NO_START_OPTIMIZE)
(*UTF) set appropriate UTF mode for the library in use (*UTF) set appropriate UTF mode for the library in use
(*UCP) set PCRE2_UCP (use Unicode properties for \d etc) (*UCP) set PCRE2_UCP (use Unicode properties for \d etc)
@ -553,9 +554,9 @@ Cambridge, England.
</P> </P>
<br><a name="SEC27" href="#TOC1">REVISION</a><br> <br><a name="SEC27" href="#TOC1">REVISION</a><br>
<P> <P>
Last updated: 23 November 2014 Last updated: 02 January 2015
<br> <br>
Copyright &copy; 1997-2014 University of Cambridge. Copyright &copy; 1997-2015 University of Cambridge.
<br> <br>
<p> <p>
Return to the <a href="index.html">PCRE2 index page</a>. Return to the <a href="index.html">PCRE2 index page</a>.

View File

@ -291,7 +291,7 @@ checked for compatibility with the <b>perltest.sh</b> script, which is used to
confirm that Perl gives the same results as PCRE2. Also, apart from comment confirm that Perl gives the same results as PCRE2. Also, apart from comment
lines, none of the other command lines are permitted, because they and many lines, none of the other command lines are permitted, because they and many
of the modifiers are specific to <b>pcre2test</b>, and should not be used in of the modifiers are specific to <b>pcre2test</b>, and should not be used in
test files that are also processed by <b>perltest.sh</b>. The \fP#perltest\fB test files that are also processed by <b>perltest.sh</b>. The <b>#perltest</b>
command helps detect tests that are accidentally put in the wrong file. command helps detect tests that are accidentally put in the wrong file.
<pre> <pre>
#subject &#60;modifier-list&#62; #subject &#60;modifier-list&#62;
@ -454,6 +454,7 @@ for a description of their effects.
never_utf set PCRE2_NEVER_UTF never_utf set PCRE2_NEVER_UTF
no_auto_capture set PCRE2_NO_AUTO_CAPTURE no_auto_capture set PCRE2_NO_AUTO_CAPTURE
no_auto_possess set PCRE2_NO_AUTO_POSSESS no_auto_possess set PCRE2_NO_AUTO_POSSESS
no_dotstar_anchor set PCRE2_NO_DOTSTAR_ANCHOR
no_start_optimize set PCRE2_NO_START_OPTIMIZE no_start_optimize set PCRE2_NO_START_OPTIMIZE
no_utf_check set PCRE2_NO_UTF_CHECK no_utf_check set PCRE2_NO_UTF_CHECK
ucp set PCRE2_UCP ucp set PCRE2_UCP
@ -596,7 +597,7 @@ setting the size of the JIT stack.
</P> </P>
<P> <P>
If the <b>jitfast</b> modifier is specified, matching is done using the JIT If the <b>jitfast</b> modifier is specified, matching is done using the JIT
"fast path" interface, \fBpcre2_jit_match(), which skips some of the sanity "fast path" interface, <b>pcre2_jit_match()</b>, which skips some of the sanity
checks that are done by <b>pcre2_match()</b>, and of course does not work when checks that are done by <b>pcre2_match()</b>, and of course does not work when
JIT is not supported. If <b>jitfast</b> is specified without <b>jit</b>, jit=7 is JIT is not supported. If <b>jitfast</b> is specified without <b>jit</b>, jit=7 is
assumed. assumed.
@ -1309,9 +1310,9 @@ Cambridge, England.
</P> </P>
<br><a name="SEC20" href="#TOC1">REVISION</a><br> <br><a name="SEC20" href="#TOC1">REVISION</a><br>
<P> <P>
Last updated: 23 November 2014 Last updated: 02 January 2015
<br> <br>
Copyright &copy; 1997-2014 University of Cambridge. Copyright &copy; 1997-2015 University of Cambridge.
<br> <br>
<p> <p>
Return to the <a href="index.html">PCRE2 index page</a>. Return to the <a href="index.html">PCRE2 index page</a>.

View File

@ -1226,6 +1226,20 @@ COMPILING A PATTERN
a full unoptimized search and run all the callouts, but it is mainly a full unoptimized search and run all the callouts, but it is mainly
provided for testing purposes. provided for testing purposes.
PCRE2_NO_DOTSTAR_ANCHOR
If this option is set, it disables an optimization that is applied when
.* is the first significant item in a top-level branch of a pattern,
and all the other branches also start with .* or with \A or \G or ^.
The optimization is automatically disabled for .* if it is inside an
atomic group or a capturing group that is the subject of a back refer-
ence, or if the pattern contains (*PRUNE) or (*SKIP). When the opti-
mization is not disabled, such a pattern is automatically anchored if
PCRE2_DOTALL is set for all the .* items and PCRE2_MULTILINE is not set
for any ^ items. Otherwise, the fact that any match must start either
at the start of the subject or following a newline is remembered. Like
other optimizations, this can cause callouts to be skipped.
PCRE2_NO_START_OPTIMIZE PCRE2_NO_START_OPTIMIZE
This is an option whose main effect is at matching time. It does not This is an option whose main effect is at matching time. It does not
@ -1465,17 +1479,27 @@ INFORMATION ABOUT A COMPILED PATTERN
option, the result is PCRE2_CASELESS, PCRE2_MULTILINE, and option, the result is PCRE2_CASELESS, PCRE2_MULTILINE, and
PCRE2_EXTENDED. PCRE2_EXTENDED.
A pattern is automatically anchored by PCRE2 if all of its top-level A pattern compiled without PCRE2_ANCHORED is automatically anchored by
alternatives begin with one of the following: PCRE2 if the first significant item in every top-level branch is one of
the following:
^ unless PCRE2_MULTILINE is set ^ unless PCRE2_MULTILINE is set
\A always \A always
\G always \G always
.* if PCRE2_DOTALL is set and there are no back .* sometimes - see below
references to the subpattern in which .* appears
For such patterns, the PCRE2_ANCHORED bit is set in the options When .* is the first significant item, anchoring is possible only when
returned for PCRE2_INFO_ALLOPTIONS. all the following are true:
.* is not in an atomic group
.* is not in a capturing group that is the subject
of a back reference
PCRE2_DOTALL is in force for .*
Neither (*PRUNE) nor (*SKIP) appears in the pattern.
PCRE2_NO_DOTSTAR_ANCHOR is not set.
For patterns that are auto-anchored, the PCRE2_ANCHORED bit is set in
the options returned for PCRE2_INFO_ALLOPTIONS.
PCRE2_INFO_BACKREFMAX PCRE2_INFO_BACKREFMAX
@ -1504,17 +1528,9 @@ INFORMATION ABOUT A COMPILED PATTERN
If there is a fixed first value, for example, the letter "c" from a If there is a fixed first value, for example, the letter "c" from a
pattern such as (cat|cow|coyote), 1 is returned, and the character pattern such as (cat|cow|coyote), 1 is returned, and the character
value can be retrieved using PCRE2_INFO_FIRSTCODEUNIT. If there is no value can be retrieved using PCRE2_INFO_FIRSTCODEUNIT. If there is no
fixed first value, and if either fixed first value, but it is known that a match can occur only at the
start of the subject or following a newline in the subject, 2 is
(a) the pattern was compiled with the PCRE2_MULTILINE option, and every returned. Otherwise, and for anchored patterns, 0 is returned.
branch starts with "^", or
(b) every branch of the pattern starts with ".*" and PCRE2_DOTALL is
not set (if it were set, the pattern would be anchored),
2 is returned, indicating that the pattern matches only at the start of
a subject string or after any newline within the string. Otherwise 0 is
returned. For anchored patterns, 0 is returned.
PCRE2_INFO_FIRSTCODEUNIT PCRE2_INFO_FIRSTCODEUNIT
@ -2726,8 +2742,8 @@ AUTHOR
REVISION REVISION
Last updated: 22 December 2014 Last updated: 02 January 2015
Copyright (c) 1997-2014 University of Cambridge. Copyright (c) 1997-2015 University of Cambridge.
------------------------------------------------------------------------------ ------------------------------------------------------------------------------
@ -3251,6 +3267,8 @@ MISSING CALLOUTS
compiles and matches patterns, callouts sometimes do not happen exactly compiles and matches patterns, callouts sometimes do not happen exactly
as you might expect. as you might expect.
Auto-possessification
At compile time, PCRE2 "auto-possessifies" repeated items when it knows At compile time, PCRE2 "auto-possessifies" repeated items when it knows
that what follows cannot be part of the repeat. For example, a+[bc] is that what follows cannot be part of the repeat. For example, a+[bc] is
compiled as if it were a++[bc]. The pcre2test output when this pattern compiled as if it were a++[bc]. The pcre2test output when this pattern
@ -3279,6 +3297,53 @@ MISSING CALLOUTS
This time, when matching [bc] fails, the matcher backtracks into a+ and This time, when matching [bc] fails, the matcher backtracks into a+ and
tries again, repeatedly, until a+ itself fails. tries again, repeatedly, until a+ itself fails.
Automatic .* anchoring
By default, an optimization is applied when .* is the first significant
item in a pattern. If PCRE2_DOTALL is set, so that the dot can match
any character, the pattern is automatically anchored. If PCRE2_DOTALL
is not set, a match can start only after an internal newline or at the
beginning of the subject, and pcre2_compile() remembers this. This
optimization is disabled, however, if .* is in an atomic group or if
there is a back reference to the capturing group in which it appears.
It is also disabled if the pattern contains (*PRUNE) or (*SKIP). How-
ever, the presence of callouts does not affect it.
For example, if the pattern .*\d is compiled with PCRE2_AUTO_CALLOUT
and applied to the string "aa", the pcre2test output is:
--->aa
+0 ^ .*
+2 ^ ^ \d
+2 ^^ \d
+2 ^ \d
No match
This shows that all match attempts start at the beginning of the sub-
ject. In other words, the pattern is anchored. You can disable this
optimization by passing PCRE2_NO_DOTSTAR_ANCHOR to pcre2_compile(), or
starting the pattern with (*NO_DOTSTAR_ANCHOR). In this case, the out-
put changes to:
--->aa
+0 ^ .*
+2 ^ ^ \d
+2 ^^ \d
+2 ^ \d
+0 ^ .*
+2 ^^ \d
+2 ^ \d
No match
This shows more match attempts, starting at the second subject charac-
ter. Another optimization, described in the next section, means that
there is no subsequent attempt to match with an empty subject.
If a pattern has more than one top-level branch, automatic anchoring
occurs if all branches are anchorable.
Other optimizations
Other optimizations that provide fast "no match" results also affect Other optimizations that provide fast "no match" results also affect
callouts. For example, if the pattern is callouts. For example, if the pattern is
@ -3410,8 +3475,8 @@ AUTHOR
REVISION REVISION
Last updated: 25 November 2014 Last updated: 02 January 2015
Copyright (c) 1997-2014 University of Cambridge. Copyright (c) 1997-2015 University of Cambridge.
------------------------------------------------------------------------------ ------------------------------------------------------------------------------

View File

@ -1,4 +1,4 @@
.TH PCRE2_COMPILE 3 "21 October 2014" "PCRE2 10.00" .TH PCRE2_COMPILE 3 "02 January 2015" "PCRE2 10.00"
.SH NAME .SH NAME
PCRE2 - Perl-compatible regular expressions (revised API) PCRE2 - Perl-compatible regular expressions (revised API)
.SH SYNOPSIS .SH SYNOPSIS
@ -51,6 +51,7 @@ or provide an external function for stack size checking. The option bits are:
PCRE2_NO_AUTO_CAPTURE Disable numbered capturing paren- PCRE2_NO_AUTO_CAPTURE Disable numbered capturing paren-
theses (named ones available) theses (named ones available)
PCRE2_NO_AUTO_POSSESS Disable auto-possessification PCRE2_NO_AUTO_POSSESS Disable auto-possessification
PCRE2_NO_DOTSTAR_ANCHOR Disable automatic anchoring for .*
PCRE2_NO_START_OPTIMIZE Disable match-time start optimizations PCRE2_NO_START_OPTIMIZE Disable match-time start optimizations
PCRE2_NO_UTF_CHECK Do not check the pattern for UTF validity PCRE2_NO_UTF_CHECK Do not check the pattern for UTF validity
(only relevant if PCRE2_UTF is set) (only relevant if PCRE2_UTF is set)

View File

@ -1,4 +1,4 @@
.TH PCRE2API 3 "22 December 2014" "PCRE2 10.00" .TH PCRE2API 3 "02 January 2015" "PCRE2 10.00"
.SH NAME .SH NAME
PCRE2 - Perl-compatible regular expressions (revised API) PCRE2 - Perl-compatible regular expressions (revised API)
.sp .sp
@ -1163,6 +1163,19 @@ use, auto-possessification means that some callouts are never taken. You can
set this option if you want the matching functions to do a full unoptimized set this option if you want the matching functions to do a full unoptimized
search and run all the callouts, but it is mainly provided for testing search and run all the callouts, but it is mainly provided for testing
purposes. purposes.
.sp
PCRE2_NO_DOTSTAR_ANCHOR
.sp
If this option is set, it disables an optimization that is applied when .* is
the first significant item in a top-level branch of a pattern, and all the
other branches also start with .* or with \eA or \eG or ^. The optimization is
automatically disabled for .* if it is inside an atomic group or a capturing
group that is the subject of a back reference, or if the pattern contains
(*PRUNE) or (*SKIP). When the optimization is not disabled, such a pattern is
automatically anchored if PCRE2_DOTALL is set for all the .* items and
PCRE2_MULTILINE is not set for any ^ items. Otherwise, the fact that any match
must start either at the start of the subject or following a newline is
remembered. Like other optimizations, this can cause callouts to be skipped.
.sp .sp
PCRE2_NO_START_OPTIMIZE PCRE2_NO_START_OPTIMIZE
.sp .sp
@ -1436,18 +1449,27 @@ force when matching starts. For example, if the pattern /(?im)abc(?-i)d/ is
compiled with the PCRE2_EXTENDED option, the result is PCRE2_CASELESS, compiled with the PCRE2_EXTENDED option, the result is PCRE2_CASELESS,
PCRE2_MULTILINE, and PCRE2_EXTENDED. PCRE2_MULTILINE, and PCRE2_EXTENDED.
.P .P
A pattern is automatically anchored by PCRE2 if all of its top-level A pattern compiled without PCRE2_ANCHORED is automatically anchored by PCRE2 if
alternatives begin with one of the following: the first significant item in every top-level branch is one of the following:
.sp .sp
^ unless PCRE2_MULTILINE is set ^ unless PCRE2_MULTILINE is set
\eA always \eA always
\eG always \eG always
.\" JOIN .* sometimes - see below
.* if PCRE2_DOTALL is set and there are no back
references to the subpattern in which .* appears
.sp .sp
For such patterns, the PCRE2_ANCHORED bit is set in the options returned for When .* is the first significant item, anchoring is possible only when all the
PCRE2_INFO_ALLOPTIONS. following are true:
.sp
.* is not in an atomic group
.\" JOIN
.* is not in a capturing group that is the subject
of a back reference
PCRE2_DOTALL is in force for .*
Neither (*PRUNE) nor (*SKIP) appears in the pattern.
PCRE2_NO_DOTSTAR_ANCHOR is not set.
.sp
For patterns that are auto-anchored, the PCRE2_ANCHORED bit is set in the
options returned for PCRE2_INFO_ALLOPTIONS.
.sp .sp
PCRE2_INFO_BACKREFMAX PCRE2_INFO_BACKREFMAX
.sp .sp
@ -1475,18 +1497,10 @@ variable.
.P .P
If there is a fixed first value, for example, the letter "c" from a pattern If there is a fixed first value, for example, the letter "c" from a pattern
such as (cat|cow|coyote), 1 is returned, and the character value can be such as (cat|cow|coyote), 1 is returned, and the character value can be
retrieved using PCRE2_INFO_FIRSTCODEUNIT. If there is no fixed first value, and retrieved using PCRE2_INFO_FIRSTCODEUNIT. If there is no fixed first value, but
if either it is known that a match can occur only at the start of the subject or
.sp following a newline in the subject, 2 is returned. Otherwise, and for anchored
(a) the pattern was compiled with the PCRE2_MULTILINE option, and every branch patterns, 0 is returned.
starts with "^", or
.sp
(b) every branch of the pattern starts with ".*" and PCRE2_DOTALL is not set
(if it were set, the pattern would be anchored),
.sp
2 is returned, indicating that the pattern matches only at the start of a
subject string or after any newline within the string. Otherwise 0 is
returned. For anchored patterns, 0 is returned.
.sp .sp
PCRE2_INFO_FIRSTCODEUNIT PCRE2_INFO_FIRSTCODEUNIT
.sp .sp
@ -2835,6 +2849,6 @@ Cambridge, England.
.rs .rs
.sp .sp
.nf .nf
Last updated: 22 December 2014 Last updated: 02 January 2015
Copyright (c) 1997-2014 University of Cambridge. Copyright (c) 1997-2015 University of Cambridge.
.fi .fi

View File

@ -1,4 +1,4 @@
.TH PCRE2CALLOUT 3 "25 November 2014" "PCRE2 10.00" .TH PCRE2CALLOUT 3 "02 January 2015" "PCRE2 10.00"
.SH NAME .SH NAME
PCRE2 - Perl-compatible regular expressions (revised API) PCRE2 - Perl-compatible regular expressions (revised API)
.SH SYNOPSIS .SH SYNOPSIS
@ -65,7 +65,11 @@ particular pattern.
You should be aware that, because of optimizations in the way PCRE2 compiles You should be aware that, because of optimizations in the way PCRE2 compiles
and matches patterns, callouts sometimes do not happen exactly as you might and matches patterns, callouts sometimes do not happen exactly as you might
expect. expect.
.P .
.
.SS "Auto-possessification"
.rs
.sp
At compile time, PCRE2 "auto-possessifies" repeated items when it knows that At compile time, PCRE2 "auto-possessifies" repeated items when it knows that
what follows cannot be part of the repeat. For example, a+[bc] is compiled as what follows cannot be part of the repeat. For example, a+[bc] is compiled as
if it were a++[bc]. The \fBpcre2test\fP output when this pattern is compiled if it were a++[bc]. The \fBpcre2test\fP output when this pattern is compiled
@ -93,7 +97,56 @@ case, the output changes to this:
.sp .sp
This time, when matching [bc] fails, the matcher backtracks into a+ and tries This time, when matching [bc] fails, the matcher backtracks into a+ and tries
again, repeatedly, until a+ itself fails. again, repeatedly, until a+ itself fails.
.
.
.SS "Automatic .* anchoring"
.rs
.sp
By default, an optimization is applied when .* is the first significant item in
a pattern. If PCRE2_DOTALL is set, so that the dot can match any character, the
pattern is automatically anchored. If PCRE2_DOTALL is not set, a match can
start only after an internal newline or at the beginning of the subject, and
\fBpcre2_compile()\fP remembers this. This optimization is disabled, however,
if .* is in an atomic group or if there is a back reference to the capturing
group in which it appears. It is also disabled if the pattern contains (*PRUNE)
or (*SKIP). However, the presence of callouts does not affect it.
.P .P
For example, if the pattern .*\ed is compiled with PCRE2_AUTO_CALLOUT and
applied to the string "aa", the \fBpcre2test\fP output is:
.sp
--->aa
+0 ^ .*
+2 ^ ^ \ed
+2 ^^ \ed
+2 ^ \ed
No match
.sp
This shows that all match attempts start at the beginning of the subject. In
other words, the pattern is anchored. You can disable this optimization by
passing PCRE2_NO_DOTSTAR_ANCHOR to \fBpcre2_compile()\fP, or starting the
pattern with (*NO_DOTSTAR_ANCHOR). In this case, the output changes to:
.sp
--->aa
+0 ^ .*
+2 ^ ^ \ed
+2 ^^ \ed
+2 ^ \ed
+0 ^ .*
+2 ^^ \ed
+2 ^ \ed
No match
.sp
This shows more match attempts, starting at the second subject character.
Another optimization, described in the next section, means that there is no
subsequent attempt to match with an empty subject.
.P
If a pattern has more than one top-level branch, automatic anchoring occurs if
all branches are anchorable.
.
.
.SS "Other optimizations"
.rs
.sp
Other optimizations that provide fast "no match" results also affect callouts. Other optimizations that provide fast "no match" results also affect callouts.
For example, if the pattern is For example, if the pattern is
.sp .sp
@ -232,6 +285,6 @@ Cambridge, England.
.rs .rs
.sp .sp
.nf .nf
Last updated: 25 November 2014 Last updated: 02 January 2015
Copyright (c) 1997-2014 University of Cambridge. Copyright (c) 1997-2015 University of Cambridge.
.fi .fi

View File

@ -1,4 +1,4 @@
.TH PCRE2PATTERN 3 "14 November 2014" "PCRE2 10.00" .TH PCRE2PATTERN 3 "02 January 2015" "PCRE2 10.00"
.SH NAME .SH NAME
PCRE2 - Perl-compatible regular expressions (revised API) PCRE2 - Perl-compatible regular expressions (revised API)
.SH "PCRE2 REGULAR EXPRESSION DETAILS" .SH "PCRE2 REGULAR EXPRESSION DETAILS"
@ -117,6 +117,19 @@ reaching "no match" results. For more details, see the
documentation. documentation.
. .
. .
.SS "Disabling automatic anchoring"
.rs
.sp
If a pattern starts with (*NO_DOTSTAR_ANCHOR), it has the same effect as
setting the PCRE2_NO_DOTSTAR_ANCHOR option. This disables optimizations that
apply to patterns whose top-level branches all start with .* (match any number
of arbitrary characters). For more details, see the
.\" HREF
\fBpcre2api\fP
.\"
documentation.
.
.
.SS "Setting match and recursion limits" .SS "Setting match and recursion limits"
.rs .rs
.sp .sp
@ -1853,7 +1866,8 @@ one succeeds. Consider this pattern:
(?>.*?a)b (?>.*?a)b
.sp .sp
It matches "ab" in the subject "aab". The use of the backtracking control verbs It matches "ab" in the subject "aab". The use of the backtracking control verbs
(*PRUNE) and (*SKIP) also disable this optimization. (*PRUNE) and (*SKIP) also disable this optimization, and there is an option,
PCRE2_NO_DOTSTAR_ANCHOR, to do so explicitly.
.P .P
When a capturing subpattern is repeated, the value captured is the substring When a capturing subpattern is repeated, the value captured is the substring
that matched the final iteration. For example, after that matched the final iteration. For example, after
@ -3278,6 +3292,6 @@ Cambridge, England.
.rs .rs
.sp .sp
.nf .nf
Last updated: 14 November 2014 Last updated: 02 January 2015
Copyright (c) 1997-2014 University of Cambridge. Copyright (c) 1997-2015 University of Cambridge.
.fi .fi

View File

@ -1,4 +1,4 @@
.TH PCRE2PERFORM 3 "20 Ocbober 2014" "PCRE2 10.00" .TH PCRE2PERFORM 3 "02 January 2015" "PCRE2 10.00"
.SH NAME .SH NAME
PCRE2 - Perl-compatible regular expressions (revised API) PCRE2 - Perl-compatible regular expressions (revised API)
.SH "PCRE2 PERFORMANCE" .SH "PCRE2 PERFORMANCE"
@ -105,14 +105,18 @@ such as \ed, when matched with \fBpcre2_match()\fP; the performance loss is
less with a DFA matching function, and in both cases there is not much less with a DFA matching function, and in both cases there is not much
difference for \eb. difference for \eb.
.P .P
When a pattern begins with .* not in parentheses, or in parentheses that are When a pattern begins with .* not in atomic parentheses, nor in parentheses
not the subject of a backreference, and the PCRE2_DOTALL option is set, the that are the subject of a backreference, and the PCRE2_DOTALL option is set,
pattern is implicitly anchored by PCRE2, since it can match only at the start the pattern is implicitly anchored by PCRE2, since it can match only at the
of a subject string. However, if PCRE2_DOTALL is not set, PCRE2 cannot make start of a subject string. If the pattern has multiple top-level branches, they
this optimization, because the dot metacharacter does not then match a newline, must all be anchorable. The optimization can be disabled by the
and if the subject string contains newlines, the pattern may match from the PCRE2_NO_DOTSTAR_ANCHOR option, and is automatically disabled if the pattern
character immediately following one of them instead of from the very start. For contains (*PRUNE) or (*SKIP).
example, the pattern .P
If PCRE2_DOTALL is not set, PCRE2 cannot make this optimization, because the
dot metacharacter does not then match a newline, and if the subject string
contains newlines, the pattern may match from the character immediately
following one of them instead of from the very start. For example, the pattern
.sp .sp
.*second .*second
.sp .sp
@ -173,6 +177,6 @@ Cambridge, England.
.rs .rs
.sp .sp
.nf .nf
Last updated: 20 October 2014 Last updated: 02 January 2015
Copyright (c) 1997-2014 University of Cambridge. Copyright (c) 1997-2015 University of Cambridge.
.fi .fi

View File

@ -1,4 +1,4 @@
.TH PCRE2SYNTAX 3 "23 November 2014" "PCRE2 10.00" .TH PCRE2SYNTAX 3 "02 January 2015" "PCRE2 10.00"
.SH NAME .SH NAME
PCRE2 - Perl-compatible regular expressions (revised API) PCRE2 - Perl-compatible regular expressions (revised API)
.SH "PCRE2 REGULAR EXPRESSION SYNTAX SUMMARY" .SH "PCRE2 REGULAR EXPRESSION SYNTAX SUMMARY"
@ -389,6 +389,7 @@ appear.
(*NOTEMPTY) set PCRE2_NOTEMPTY when matching (*NOTEMPTY) set PCRE2_NOTEMPTY when matching
(*NOTEMPTY_ATSTART) set PCRE2_NOTEMPTY_ATSTART when matching (*NOTEMPTY_ATSTART) set PCRE2_NOTEMPTY_ATSTART when matching
(*NO_AUTO_POSSESS) no auto-possessification (PCRE2_NO_AUTO_POSSESS) (*NO_AUTO_POSSESS) no auto-possessification (PCRE2_NO_AUTO_POSSESS)
(*NO_DOTSTAR_ANCHOR) no .* anchoring (PCRE2_NO_DOTSTAR_ANCHOR)
(*NO_START_OPT) no start-match optimization (PCRE2_NO_START_OPTIMIZE) (*NO_START_OPT) no start-match optimization (PCRE2_NO_START_OPTIMIZE)
(*UTF) set appropriate UTF mode for the library in use (*UTF) set appropriate UTF mode for the library in use
(*UCP) set PCRE2_UCP (use Unicode properties for \ed etc) (*UCP) set PCRE2_UCP (use Unicode properties for \ed etc)
@ -536,6 +537,6 @@ Cambridge, England.
.rs .rs
.sp .sp
.nf .nf
Last updated: 23 November 2014 Last updated: 02 January 2015
Copyright (c) 1997-2014 University of Cambridge. Copyright (c) 1997-2015 University of Cambridge.
.fi .fi

View File

@ -1,4 +1,4 @@
.TH PCRE2TEST 1 "23 November 2014" "PCRE 10.00" .TH PCRE2TEST 1 "02 January 2015" "PCRE 10.00"
.SH NAME .SH NAME
pcre2test - a program for testing Perl-compatible regular expressions. pcre2test - a program for testing Perl-compatible regular expressions.
.SH SYNOPSIS .SH SYNOPSIS
@ -247,7 +247,7 @@ checked for compatibility with the \fBperltest.sh\fP script, which is used to
confirm that Perl gives the same results as PCRE2. Also, apart from comment confirm that Perl gives the same results as PCRE2. Also, apart from comment
lines, none of the other command lines are permitted, because they and many lines, none of the other command lines are permitted, because they and many
of the modifiers are specific to \fBpcre2test\fP, and should not be used in of the modifiers are specific to \fBpcre2test\fP, and should not be used in
test files that are also processed by \fBperltest.sh\fP. The \fP#perltest\fB test files that are also processed by \fBperltest.sh\fP. The \fB#perltest\fP
command helps detect tests that are accidentally put in the wrong file. command helps detect tests that are accidentally put in the wrong file.
.sp .sp
#subject <modifier-list> #subject <modifier-list>
@ -413,6 +413,7 @@ for a description of their effects.
never_utf set PCRE2_NEVER_UTF never_utf set PCRE2_NEVER_UTF
no_auto_capture set PCRE2_NO_AUTO_CAPTURE no_auto_capture set PCRE2_NO_AUTO_CAPTURE
no_auto_possess set PCRE2_NO_AUTO_POSSESS no_auto_possess set PCRE2_NO_AUTO_POSSESS
no_dotstar_anchor set PCRE2_NO_DOTSTAR_ANCHOR
no_start_optimize set PCRE2_NO_START_OPTIMIZE no_start_optimize set PCRE2_NO_START_OPTIMIZE
no_utf_check set PCRE2_NO_UTF_CHECK no_utf_check set PCRE2_NO_UTF_CHECK
ucp set PCRE2_UCP ucp set PCRE2_UCP
@ -552,7 +553,7 @@ documentation. See also the \fBjitstack\fP modifier below for a way of
setting the size of the JIT stack. setting the size of the JIT stack.
.P .P
If the \fBjitfast\fP modifier is specified, matching is done using the JIT If the \fBjitfast\fP modifier is specified, matching is done using the JIT
"fast path" interface, \fBpcre2_jit_match(), which skips some of the sanity "fast path" interface, \fBpcre2_jit_match()\fP, which skips some of the sanity
checks that are done by \fBpcre2_match()\fP, and of course does not work when checks that are done by \fBpcre2_match()\fP, and of course does not work when
JIT is not supported. If \fBjitfast\fP is specified without \fBjit\fP, jit=7 is JIT is not supported. If \fBjitfast\fP is specified without \fBjit\fP, jit=7 is
assumed. assumed.
@ -1274,6 +1275,6 @@ Cambridge, England.
.rs .rs
.sp .sp
.nf .nf
Last updated: 23 November 2014 Last updated: 02 January 2015
Copyright (c) 1997-2014 University of Cambridge. Copyright (c) 1997-2015 University of Cambridge.
.fi .fi

View File

@ -402,6 +402,7 @@ PATTERN MODIFIERS
never_utf set PCRE2_NEVER_UTF never_utf set PCRE2_NEVER_UTF
no_auto_capture set PCRE2_NO_AUTO_CAPTURE no_auto_capture set PCRE2_NO_AUTO_CAPTURE
no_auto_possess set PCRE2_NO_AUTO_POSSESS no_auto_possess set PCRE2_NO_AUTO_POSSESS
no_dotstar_anchor set PCRE2_NO_DOTSTAR_ANCHOR
no_start_optimize set PCRE2_NO_START_OPTIMIZE no_start_optimize set PCRE2_NO_START_OPTIMIZE
no_utf_check set PCRE2_NO_UTF_CHECK no_utf_check set PCRE2_NO_UTF_CHECK
ucp set PCRE2_UCP ucp set PCRE2_UCP
@ -1185,5 +1186,5 @@ AUTHOR
REVISION REVISION
Last updated: 23 November 2014 Last updated: 02 January 2015
Copyright (c) 1997-2014 University of Cambridge. Copyright (c) 1997-2015 University of Cambridge.

View File

@ -5,7 +5,7 @@
/* This is the public header file for the PCRE library, second API, to be /* This is the public header file for the PCRE library, second API, to be
#included by applications that call PCRE2 functions. #included by applications that call PCRE2 functions.
Copyright (c) 2014 University of Cambridge Copyright (c) 2015 University of Cambridge
----------------------------------------------------------------------------- -----------------------------------------------------------------------------
Redistribution and use in source and binary forms, with or without Redistribution and use in source and binary forms, with or without
@ -113,10 +113,11 @@ D is inspected during pcre2_dfa_match() execution
#define PCRE2_NEVER_UTF 0x00001000u /* C */ #define PCRE2_NEVER_UTF 0x00001000u /* C */
#define PCRE2_NO_AUTO_CAPTURE 0x00002000u /* C */ #define PCRE2_NO_AUTO_CAPTURE 0x00002000u /* C */
#define PCRE2_NO_AUTO_POSSESS 0x00004000u /* C */ #define PCRE2_NO_AUTO_POSSESS 0x00004000u /* C */
#define PCRE2_NO_START_OPTIMIZE 0x00008000u /* J M D */ #define PCRE2_NO_DOTSTAR_ANCHOR 0x00008000u /* C */
#define PCRE2_UCP 0x00010000u /* C J M D */ #define PCRE2_NO_START_OPTIMIZE 0x00010000u /* J M D */
#define PCRE2_UNGREEDY 0x00020000u /* C */ #define PCRE2_UCP 0x00020000u /* C J M D */
#define PCRE2_UTF 0x00040000u /* C J M D */ #define PCRE2_UNGREEDY 0x00040000u /* C */
#define PCRE2_UTF 0x00080000u /* C J M D */
/* These are for pcre2_jit_compile(). */ /* These are for pcre2_jit_compile(). */

View File

@ -7,7 +7,7 @@ and semantics are as close as possible to those of the Perl 5 language.
Written by Philip Hazel Written by Philip Hazel
Original API code Copyright (c) 1997-2012 University of Cambridge Original API code Copyright (c) 1997-2012 University of Cambridge
New API code Copyright (c) 2014 University of Cambridge New API code Copyright (c) 2015 University of Cambridge
----------------------------------------------------------------------------- -----------------------------------------------------------------------------
Redistribution and use in source and binary forms, with or without Redistribution and use in source and binary forms, with or without
@ -557,8 +557,8 @@ static PCRE2_SPTR posix_substitutes[] = {
PCRE2_CASELESS|PCRE2_DOLLAR_ENDONLY|PCRE2_DOTALL|PCRE2_DUPNAMES| \ PCRE2_CASELESS|PCRE2_DOLLAR_ENDONLY|PCRE2_DOTALL|PCRE2_DUPNAMES| \
PCRE2_EXTENDED|PCRE2_FIRSTLINE|PCRE2_MATCH_UNSET_BACKREF| \ PCRE2_EXTENDED|PCRE2_FIRSTLINE|PCRE2_MATCH_UNSET_BACKREF| \
PCRE2_MULTILINE|PCRE2_NEVER_UCP|PCRE2_NEVER_UTF|PCRE2_NO_AUTO_CAPTURE| \ PCRE2_MULTILINE|PCRE2_NEVER_UCP|PCRE2_NEVER_UTF|PCRE2_NO_AUTO_CAPTURE| \
PCRE2_NO_AUTO_POSSESS|PCRE2_NO_START_OPTIMIZE|PCRE2_NO_UTF_CHECK| \ PCRE2_NO_AUTO_POSSESS|PCRE2_NO_DOTSTAR_ANCHOR|PCRE2_NO_START_OPTIMIZE| \
PCRE2_UCP|PCRE2_UNGREEDY|PCRE2_UTF) PCRE2_NO_UTF_CHECK|PCRE2_UCP|PCRE2_UNGREEDY|PCRE2_UTF)
/* Compile time error code numbers. They are given names so that they can more /* Compile time error code numbers. They are given names so that they can more
easily be tracked. When a new number is added, the tables called eint1 and easily be tracked. When a new number is added, the tables called eint1 and
@ -597,22 +597,23 @@ typedef struct pso {
/* NB: STRING_UTFn_RIGHTPAR contains the length as well */ /* NB: STRING_UTFn_RIGHTPAR contains the length as well */
static pso pso_list[] = { static pso pso_list[] = {
{ (uint8_t *)STRING_UTFn_RIGHTPAR, PSO_OPT, PCRE2_UTF }, { (uint8_t *)STRING_UTFn_RIGHTPAR, PSO_OPT, PCRE2_UTF },
{ (uint8_t *)STRING_UTF_RIGHTPAR, 4, PSO_OPT, PCRE2_UTF }, { (uint8_t *)STRING_UTF_RIGHTPAR, 4, PSO_OPT, PCRE2_UTF },
{ (uint8_t *)STRING_UCP_RIGHTPAR, 4, PSO_OPT, PCRE2_UCP }, { (uint8_t *)STRING_UCP_RIGHTPAR, 4, PSO_OPT, PCRE2_UCP },
{ (uint8_t *)STRING_NOTEMPTY_RIGHTPAR, 9, PSO_FLG, PCRE2_NOTEMPTY_SET }, { (uint8_t *)STRING_NOTEMPTY_RIGHTPAR, 9, PSO_FLG, PCRE2_NOTEMPTY_SET },
{ (uint8_t *)STRING_NOTEMPTY_ATSTART_RIGHTPAR,17, PSO_FLG, PCRE2_NE_ATST_SET }, { (uint8_t *)STRING_NOTEMPTY_ATSTART_RIGHTPAR, 17, PSO_FLG, PCRE2_NE_ATST_SET },
{ (uint8_t *)STRING_NO_AUTO_POSSESS_RIGHTPAR, 16, PSO_OPT, PCRE2_NO_AUTO_POSSESS }, { (uint8_t *)STRING_NO_AUTO_POSSESS_RIGHTPAR, 16, PSO_OPT, PCRE2_NO_AUTO_POSSESS },
{ (uint8_t *)STRING_NO_START_OPT_RIGHTPAR, 13, PSO_OPT, PCRE2_NO_START_OPTIMIZE }, { (uint8_t *)STRING_NO_DOTSTAR_ANCHOR_RIGHTPAR, 18, PSO_OPT, PCRE2_NO_DOTSTAR_ANCHOR },
{ (uint8_t *)STRING_LIMIT_MATCH_EQ, 12, PSO_LIMM, 0 }, { (uint8_t *)STRING_NO_START_OPT_RIGHTPAR, 13, PSO_OPT, PCRE2_NO_START_OPTIMIZE },
{ (uint8_t *)STRING_LIMIT_RECURSION_EQ, 16, PSO_LIMR, 0 }, { (uint8_t *)STRING_LIMIT_MATCH_EQ, 12, PSO_LIMM, 0 },
{ (uint8_t *)STRING_CR_RIGHTPAR, 3, PSO_NL, PCRE2_NEWLINE_CR }, { (uint8_t *)STRING_LIMIT_RECURSION_EQ, 16, PSO_LIMR, 0 },
{ (uint8_t *)STRING_LF_RIGHTPAR, 3, PSO_NL, PCRE2_NEWLINE_LF }, { (uint8_t *)STRING_CR_RIGHTPAR, 3, PSO_NL, PCRE2_NEWLINE_CR },
{ (uint8_t *)STRING_CRLF_RIGHTPAR, 5, PSO_NL, PCRE2_NEWLINE_CRLF }, { (uint8_t *)STRING_LF_RIGHTPAR, 3, PSO_NL, PCRE2_NEWLINE_LF },
{ (uint8_t *)STRING_ANY_RIGHTPAR, 4, PSO_NL, PCRE2_NEWLINE_ANY }, { (uint8_t *)STRING_CRLF_RIGHTPAR, 5, PSO_NL, PCRE2_NEWLINE_CRLF },
{ (uint8_t *)STRING_ANYCRLF_RIGHTPAR, 8, PSO_NL, PCRE2_NEWLINE_ANYCRLF }, { (uint8_t *)STRING_ANY_RIGHTPAR, 4, PSO_NL, PCRE2_NEWLINE_ANY },
{ (uint8_t *)STRING_BSR_ANYCRLF_RIGHTPAR, 12, PSO_BSR, PCRE2_BSR_ANYCRLF }, { (uint8_t *)STRING_ANYCRLF_RIGHTPAR, 8, PSO_NL, PCRE2_NEWLINE_ANYCRLF },
{ (uint8_t *)STRING_BSR_UNICODE_RIGHTPAR, 12, PSO_BSR, PCRE2_BSR_UNICODE } { (uint8_t *)STRING_BSR_ANYCRLF_RIGHTPAR, 12, PSO_BSR, PCRE2_BSR_ANYCRLF },
{ (uint8_t *)STRING_BSR_UNICODE_RIGHTPAR, 12, PSO_BSR, PCRE2_BSR_UNICODE }
}; };
@ -7020,13 +7021,14 @@ do {
/* .* is not anchored unless DOTALL is set (which generates OP_ALLANY) and /* .* is not anchored unless DOTALL is set (which generates OP_ALLANY) and
it isn't in brackets that are or may be referenced or inside an atomic it isn't in brackets that are or may be referenced or inside an atomic
group. */ group. There is also an option that disables auto-anchoring. */
else if ((op == OP_TYPESTAR || op == OP_TYPEMINSTAR || else if ((op == OP_TYPESTAR || op == OP_TYPEMINSTAR ||
op == OP_TYPEPOSSTAR)) op == OP_TYPEPOSSTAR))
{ {
if (scode[1] != OP_ALLANY || (bracket_map & cb->backref_map) != 0 || if (scode[1] != OP_ALLANY || (bracket_map & cb->backref_map) != 0 ||
atomcount > 0 || cb->had_pruneorskip) atomcount > 0 || cb->had_pruneorskip ||
(cb->external_options & PCRE2_NO_DOTSTAR_ANCHOR) != 0)
return FALSE; return FALSE;
} }
@ -7140,12 +7142,13 @@ do {
brackets that may be referenced, as long as the pattern does not contain brackets that may be referenced, as long as the pattern does not contain
*PRUNE or *SKIP, because these break the feature. Consider, for example, *PRUNE or *SKIP, because these break the feature. Consider, for example,
/.*?a(*PRUNE)b/ with the subject "aab", which matches "ab", i.e. not at the /.*?a(*PRUNE)b/ with the subject "aab", which matches "ab", i.e. not at the
start of a line. */ start of a line. There is also an option that disables this optimization. */
else if (op == OP_TYPESTAR || op == OP_TYPEMINSTAR || op == OP_TYPEPOSSTAR) else if (op == OP_TYPESTAR || op == OP_TYPEMINSTAR || op == OP_TYPEPOSSTAR)
{ {
if (scode[1] != OP_ANY || (bracket_map & cb->backref_map) != 0 || if (scode[1] != OP_ANY || (bracket_map & cb->backref_map) != 0 ||
atomcount > 0 || cb->had_pruneorskip) atomcount > 0 || cb->had_pruneorskip ||
(cb->external_options & PCRE2_NO_DOTSTAR_ANCHOR) != 0)
return FALSE; return FALSE;
} }
@ -7863,7 +7866,8 @@ if (errorcode != 0)
/* Successful compile. If the anchored option was not passed, set it if /* Successful compile. If the anchored option was not passed, set it if
we can determine that the pattern is anchored by virtue of ^ characters or \A we can determine that the pattern is anchored by virtue of ^ characters or \A
or anything else, such as starting with non-atomic .* when DOTALL is set and or anything else, such as starting with non-atomic .* when DOTALL is set and
there are no occurrences of *PRUNE or *SKIP. */ there are no occurrences of *PRUNE or *SKIP (though there is an option to
disable this case). */
if ((re->overall_options & PCRE2_ANCHORED) == 0 && if ((re->overall_options & PCRE2_ANCHORED) == 0 &&
is_anchored(codestart, 0, &cb, 0)) is_anchored(codestart, 0, &cb, 0))
@ -7912,7 +7916,8 @@ if ((re->overall_options & (PCRE2_ANCHORED|PCRE2_NO_START_OPTIMIZE)) == 0)
/* When there is no first code unit, see if we can set the PCRE2_STARTLINE /* When there is no first code unit, see if we can set the PCRE2_STARTLINE
flag. This is helpful for multiline matches when all branches start with ^ flag. This is helpful for multiline matches when all branches start with ^
and also when all branches start with non-atomic .* for non-DOTALL matches and also when all branches start with non-atomic .* for non-DOTALL matches
when *PRUNE and SKIP are not present. */ when *PRUNE and SKIP are not present. (There is an option that disables this
case.) */
else if (is_startline(codestart, 0, &cb, 0)) re->flags |= PCRE2_STARTLINE; else if (is_startline(codestart, 0, &cb, 0)) re->flags |= PCRE2_STARTLINE;
} }

View File

@ -7,7 +7,7 @@ and semantics are as close as possible to those of the Perl 5 language.
Written by Philip Hazel Written by Philip Hazel
Original API code Copyright (c) 1997-2012 University of Cambridge Original API code Copyright (c) 1997-2012 University of Cambridge
New API code Copyright (c) 2014 University of Cambridge New API code Copyright (c) 2015 University of Cambridge
----------------------------------------------------------------------------- -----------------------------------------------------------------------------
Redistribution and use in source and binary forms, with or without Redistribution and use in source and binary forms, with or without
@ -904,6 +904,7 @@ a positive value. */
#define STRING_UTF_RIGHTPAR "UTF)" #define STRING_UTF_RIGHTPAR "UTF)"
#define STRING_UCP_RIGHTPAR "UCP)" #define STRING_UCP_RIGHTPAR "UCP)"
#define STRING_NO_AUTO_POSSESS_RIGHTPAR "NO_AUTO_POSSESS)" #define STRING_NO_AUTO_POSSESS_RIGHTPAR "NO_AUTO_POSSESS)"
#define STRING_NO_DOTSTAR_ANCHOR_RIGHTPAR "NO_DOTSTAR_ANCHOR)"
#define STRING_NO_START_OPT_RIGHTPAR "NO_START_OPT)" #define STRING_NO_START_OPT_RIGHTPAR "NO_START_OPT)"
#define STRING_NOTEMPTY_RIGHTPAR "NOTEMPTY)" #define STRING_NOTEMPTY_RIGHTPAR "NOTEMPTY)"
#define STRING_NOTEMPTY_ATSTART_RIGHTPAR "NOTEMPTY_ATSTART)" #define STRING_NOTEMPTY_ATSTART_RIGHTPAR "NOTEMPTY_ATSTART)"
@ -1173,6 +1174,7 @@ only. */
#define STRING_UTF_RIGHTPAR STR_U STR_T STR_F STR_RIGHT_PARENTHESIS #define STRING_UTF_RIGHTPAR STR_U STR_T STR_F STR_RIGHT_PARENTHESIS
#define STRING_UCP_RIGHTPAR STR_U STR_C STR_P STR_RIGHT_PARENTHESIS #define STRING_UCP_RIGHTPAR STR_U STR_C STR_P STR_RIGHT_PARENTHESIS
#define STRING_NO_AUTO_POSSESS_RIGHTPAR STR_N STR_O STR_UNDERSCORE STR_A STR_U STR_T STR_O STR_UNDERSCORE STR_P STR_O STR_S STR_S STR_E STR_S STR_S STR_RIGHT_PARENTHESIS #define STRING_NO_AUTO_POSSESS_RIGHTPAR STR_N STR_O STR_UNDERSCORE STR_A STR_U STR_T STR_O STR_UNDERSCORE STR_P STR_O STR_S STR_S STR_E STR_S STR_S STR_RIGHT_PARENTHESIS
#define STRING_NO_DOTSTAR_ANCHOR_RIGHTPAR STR_N STR_O STR_UNDERSCORE STR_D STR_O STR_T STR_S STR_T STR_A STR_R STR_UNDERSCORE STR_A STR_N STR_C STR_H STR_O STR_R STR_RIGHT_PARENTHESIS
#define STRING_NO_START_OPT_RIGHTPAR STR_N STR_O STR_UNDERSCORE STR_S STR_T STR_A STR_R STR_T STR_UNDERSCORE STR_O STR_P STR_T STR_RIGHT_PARENTHESIS #define STRING_NO_START_OPT_RIGHTPAR STR_N STR_O STR_UNDERSCORE STR_S STR_T STR_A STR_R STR_T STR_UNDERSCORE STR_O STR_P STR_T STR_RIGHT_PARENTHESIS
#define STRING_NOTEMPTY_RIGHTPAR STR_N STR_O STR_T STR_E STR_M STR_P STR_T STR_Y STR_RIGHT_PARENTHESIS #define STRING_NOTEMPTY_RIGHTPAR STR_N STR_O STR_T STR_E STR_M STR_P STR_T STR_Y STR_RIGHT_PARENTHESIS
#define STRING_NOTEMPTY_ATSTART_RIGHTPAR STR_N STR_O STR_T STR_E STR_M STR_P STR_T STR_Y STR_UNDERSCORE STR_A STR_T STR_S STR_T STR_A STR_R STR_T STR_RIGHT_PARENTHESIS #define STRING_NOTEMPTY_ATSTART_RIGHTPAR STR_N STR_O STR_T STR_E STR_M STR_P STR_T STR_Y STR_UNDERSCORE STR_A STR_T STR_S STR_T STR_A STR_R STR_T STR_RIGHT_PARENTHESIS

View File

@ -11,7 +11,7 @@ hacked-up (non-) design had also run out of steam.
Written by Philip Hazel Written by Philip Hazel
Original code Copyright (c) 1997-2012 University of Cambridge Original code Copyright (c) 1997-2012 University of Cambridge
Rewritten code Copyright (c) 2014 University of Cambridge Rewritten code Copyright (c) 2015 University of Cambridge
----------------------------------------------------------------------------- -----------------------------------------------------------------------------
Redistribution and use in source and binary forms, with or without Redistribution and use in source and binary forms, with or without
@ -498,6 +498,7 @@ static modstruct modlist[] = {
{ "newline", MOD_CTC, MOD_NL, 0, CO(newline_convention) }, { "newline", MOD_CTC, MOD_NL, 0, CO(newline_convention) },
{ "no_auto_capture", MOD_PAT, MOD_OPT, PCRE2_NO_AUTO_CAPTURE, PO(options) }, { "no_auto_capture", MOD_PAT, MOD_OPT, PCRE2_NO_AUTO_CAPTURE, PO(options) },
{ "no_auto_possess", MOD_PATP, MOD_OPT, PCRE2_NO_AUTO_POSSESS, PO(options) }, { "no_auto_possess", MOD_PATP, MOD_OPT, PCRE2_NO_AUTO_POSSESS, PO(options) },
{ "no_dotstar_anchor", MOD_PAT, MOD_OPT, PCRE2_NO_DOTSTAR_ANCHOR, PO(options) },
{ "no_start_optimize", MOD_PATP, MOD_OPT, PCRE2_NO_START_OPTIMIZE, PO(options) }, { "no_start_optimize", MOD_PATP, MOD_OPT, PCRE2_NO_START_OPTIMIZE, PO(options) },
{ "no_utf_check", MOD_PD, MOD_OPT, PCRE2_NO_UTF_CHECK, PD(options) }, { "no_utf_check", MOD_PD, MOD_OPT, PCRE2_NO_UTF_CHECK, PD(options) },
{ "notbol", MOD_DAT, MOD_OPT, PCRE2_NOTBOL, DO(options) }, { "notbol", MOD_DAT, MOD_OPT, PCRE2_NOTBOL, DO(options) },
@ -3291,29 +3292,30 @@ static void
show_compile_options(uint32_t options, const char *before, const char *after) show_compile_options(uint32_t options, const char *before, const char *after)
{ {
if (options == 0) fprintf(outfile, "%s <none>%s", before, after); if (options == 0) fprintf(outfile, "%s <none>%s", before, after);
else fprintf(outfile, "%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s", else fprintf(outfile, "%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s",
before, before,
((options & PCRE2_ANCHORED) != 0)? " anchored" : "",
((options & PCRE2_CASELESS) != 0)? " caseless" : "",
((options & PCRE2_EXTENDED) != 0)? " extended" : "",
((options & PCRE2_MULTILINE) != 0)? " multiline" : "",
((options & PCRE2_FIRSTLINE) != 0)? " firstline" : "",
((options & PCRE2_DOTALL) != 0)? " dotall" : "",
((options & PCRE2_DOLLAR_ENDONLY) != 0)? " dollar_endonly" : "",
((options & PCRE2_UNGREEDY) != 0)? " ungreedy" : "",
((options & PCRE2_NO_AUTO_CAPTURE) != 0)? " no_auto_capture" : "",
((options & PCRE2_NO_AUTO_POSSESS) != 0)? " no_auto_possess" : "",
((options & PCRE2_UTF) != 0)? " utf" : "",
((options & PCRE2_UCP) != 0)? " ucp" : "",
((options & PCRE2_NO_UTF_CHECK) != 0)? " no_utf_check" : "",
((options & PCRE2_NO_START_OPTIMIZE) != 0)? " no_start_optimize" : "",
((options & PCRE2_DUPNAMES) != 0)? " dupnames" : "",
((options & PCRE2_ALT_BSUX) != 0)? " alt_bsux" : "", ((options & PCRE2_ALT_BSUX) != 0)? " alt_bsux" : "",
((options & PCRE2_ALLOW_EMPTY_CLASS) != 0)? " allow_empty_class" : "", ((options & PCRE2_ALLOW_EMPTY_CLASS) != 0)? " allow_empty_class" : "",
((options & PCRE2_ANCHORED) != 0)? " anchored" : "",
((options & PCRE2_AUTO_CALLOUT) != 0)? " auto_callout" : "", ((options & PCRE2_AUTO_CALLOUT) != 0)? " auto_callout" : "",
((options & PCRE2_CASELESS) != 0)? " caseless" : "",
((options & PCRE2_DOLLAR_ENDONLY) != 0)? " dollar_endonly" : "",
((options & PCRE2_DOTALL) != 0)? " dotall" : "",
((options & PCRE2_DUPNAMES) != 0)? " dupnames" : "",
((options & PCRE2_EXTENDED) != 0)? " extended" : "",
((options & PCRE2_FIRSTLINE) != 0)? " firstline" : "",
((options & PCRE2_MATCH_UNSET_BACKREF) != 0)? " match_unset_backref" : "", ((options & PCRE2_MATCH_UNSET_BACKREF) != 0)? " match_unset_backref" : "",
((options & PCRE2_MULTILINE) != 0)? " multiline" : "",
((options & PCRE2_NEVER_UCP) != 0)? " never_ucp" : "", ((options & PCRE2_NEVER_UCP) != 0)? " never_ucp" : "",
((options & PCRE2_NEVER_UTF) != 0)? " never_utf" : "", ((options & PCRE2_NEVER_UTF) != 0)? " never_utf" : "",
((options & PCRE2_NO_AUTO_CAPTURE) != 0)? " no_auto_capture" : "",
((options & PCRE2_NO_AUTO_POSSESS) != 0)? " no_auto_possess" : "",
((options & PCRE2_NO_DOTSTAR_ANCHOR) != 0)? " no_dotstar_anchor" : "",
((options & PCRE2_NO_UTF_CHECK) != 0)? " no_utf_check" : "",
((options & PCRE2_NO_START_OPTIMIZE) != 0)? " no_start_optimize" : "",
((options & PCRE2_UCP) != 0)? " ucp" : "",
((options & PCRE2_UNGREEDY) != 0)? " ungreedy" : "",
((options & PCRE2_UTF) != 0)? " utf" : "",
after); after);
} }

16
testdata/testinput2 vendored
View File

@ -4100,4 +4100,20 @@ a random value. /Ix
/a(b)c(d)/ /a(b)c(d)/
abc\=ph,copy=0,copy=1,getall abc\=ph,copy=0,copy=1,getall
/^abc/info
/^abc/info,no_dotstar_anchor
/.*\d/info,auto_callout
aaa
/.*\d/info,no_dotstar_anchor,auto_callout
aaa
/.*\d/dotall,info
/.*\d/dotall,no_dotstar_anchor,info
/(*NO_DOTSTAR_ANCHOR)(?s).*\d/info
# End of testinput2 # End of testinput2

79
testdata/testoutput2 vendored
View File

@ -5361,7 +5361,7 @@ No match
"<(\w+)/?>(.)*</(\1)>"Igms "<(\w+)/?>(.)*</(\1)>"Igms
Capturing subpattern count = 3 Capturing subpattern count = 3
Max back reference = 1 Max back reference = 1
Options: multiline dotall Options: dotall multiline
First code unit = '<' First code unit = '<'
Last code unit = '>' Last code unit = '>'
Subject length lower bound = 7 Subject length lower bound = 7
@ -5399,7 +5399,7 @@ No match
/line\nbreak/Im,firstline /line\nbreak/Im,firstline
Capturing subpattern count = 0 Capturing subpattern count = 0
Contains explicit CR or LF match Contains explicit CR or LF match
Options: multiline firstline Options: firstline multiline
First code unit = 'l' First code unit = 'l'
Last code unit = 'k' Last code unit = 'k'
Subject length lower bound = 10 Subject length lower bound = 10
@ -9698,7 +9698,7 @@ Subject length lower bound = 41
/Iisx /Iisx
Capturing subpattern count = 3 Capturing subpattern count = 3
Max back reference = 1 Max back reference = 1
Options: caseless extended dotall Options: caseless dotall extended
First code unit = '<' First code unit = '<'
Last code unit = '=' Last code unit = '='
Subject length lower bound = 9 Subject length lower bound = 9
@ -9747,7 +9747,7 @@ Named capturing subpatterns:
quote 4 quote 4
realquote 3 realquote 3
realquote 6 realquote 6
Options: extended dupnames Options: dupnames extended
Starting code units: a b Starting code units: a b
Subject length lower bound = 3 Subject length lower bound = 3
a"aaaaa a"aaaaa
@ -9805,8 +9805,8 @@ Capturing subpattern count = 4
Named capturing subpatterns: Named capturing subpatterns:
D 4 D 4
D 1 D 1
Compile options: extended dupnames Compile options: dupnames extended
Overall options: anchored extended dupnames Overall options: anchored dupnames extended
Subject length lower bound = 2 Subject length lower bound = 2
abcdX abcdX
0: abcdX 0: abcdX
@ -9852,7 +9852,7 @@ Capturing subpattern count = 4
Named capturing subpatterns: Named capturing subpatterns:
A 1 A 1
A 4 A 4
Options: extended dupnames Options: dupnames extended
First code unit = 'a' First code unit = 'a'
Last code unit = 'd' Last code unit = 'd'
Subject length lower bound = 4 Subject length lower bound = 4
@ -9936,7 +9936,7 @@ No match
/(\3)(\1)(a)/I,alt_bsux,allow_empty_class,match_unset_backref,dupnames /(\3)(\1)(a)/I,alt_bsux,allow_empty_class,match_unset_backref,dupnames
Capturing subpattern count = 3 Capturing subpattern count = 3
Max back reference = 3 Max back reference = 3
Options: dupnames alt_bsux allow_empty_class match_unset_backref Options: alt_bsux allow_empty_class dupnames match_unset_backref
Last code unit = 'a' Last code unit = 'a'
Subject length lower bound = 1 Subject length lower bound = 1
cat cat
@ -13769,4 +13769,67 @@ Partial match: abc
Copy substring 1 failed (-2): partial match Copy substring 1 failed (-2): partial match
get substring list failed (-2): partial match get substring list failed (-2): partial match
/^abc/info
Capturing subpattern count = 0
Compile options: <none>
Overall options: anchored
Subject length lower bound = 3
/^abc/info,no_dotstar_anchor
Capturing subpattern count = 0
Compile options: no_dotstar_anchor
Overall options: anchored no_dotstar_anchor
Subject length lower bound = 3
/.*\d/info,auto_callout
Capturing subpattern count = 0
Options: auto_callout
First code unit at start or follows newline
Subject length lower bound = 1
aaa
--->aaa
+0 ^ .*
+2 ^ ^ \d
+2 ^ ^ \d
+2 ^^ \d
+2 ^ \d
No match
/.*\d/info,no_dotstar_anchor,auto_callout
Capturing subpattern count = 0
Options: auto_callout no_dotstar_anchor
Subject length lower bound = 1
aaa
--->aaa
+0 ^ .*
+2 ^ ^ \d
+2 ^ ^ \d
+2 ^^ \d
+2 ^ \d
+0 ^ .*
+2 ^ ^ \d
+2 ^^ \d
+2 ^ \d
+0 ^ .*
+2 ^^ \d
+2 ^ \d
No match
/.*\d/dotall,info
Capturing subpattern count = 0
Compile options: dotall
Overall options: anchored dotall
Subject length lower bound = 1
/.*\d/dotall,no_dotstar_anchor,info
Capturing subpattern count = 0
Options: dotall no_dotstar_anchor
Subject length lower bound = 1
/(*NO_DOTSTAR_ANCHOR)(?s).*\d/info
Capturing subpattern count = 0
Compile options: <none>
Overall options: dotall no_dotstar_anchor
Subject length lower bound = 1
# End of testinput2 # End of testinput2