Add PCRE2_NO_DOTSTAR_ANCHOR and revise documentation for .* optimizing.

This commit is contained in:
Philip.Hazel 2015-01-02 17:09:16 +00:00
parent 019e115060
commit 5a18651441
24 changed files with 502 additions and 173 deletions

View File

@ -58,4 +58,6 @@ matched against "abcd".
(an odd thing to do, but it happened), SIGSEGV or other misbehaviour could
occur.
10. The PCRE2_NO_DOTSTAR_ANCHOR option has been implemented.
****

View File

@ -63,6 +63,7 @@ or provide an external function for stack size checking. The option bits are:
PCRE2_NO_AUTO_CAPTURE Disable numbered capturing paren-
theses (named ones available)
PCRE2_NO_AUTO_POSSESS Disable auto-possessification
PCRE2_NO_DOTSTAR_ANCHOR Disable automatic anchoring for .*
PCRE2_NO_START_OPTIMIZE Disable match-time start optimizations
PCRE2_NO_UTF_CHECK Do not check the pattern for UTF validity
(only relevant if PCRE2_UTF is set)

View File

@ -1187,6 +1187,19 @@ use, auto-possessification means that some callouts are never taken. You can
set this option if you want the matching functions to do a full unoptimized
search and run all the callouts, but it is mainly provided for testing
purposes.
<pre>
PCRE2_NO_DOTSTAR_ANCHOR
</pre>
If this option is set, it disables an optimization that is applied when .* is
the first significant item in a top-level branch of a pattern, and all the
other branches also start with .* or with \A or \G or ^. The optimization is
automatically disabled for .* if it is inside an atomic group or a capturing
group that is the subject of a back reference, or if the pattern contains
(*PRUNE) or (*SKIP). When the optimization is not disabled, such a pattern is
automatically anchored if PCRE2_DOTALL is set for all the .* items and
PCRE2_MULTILINE is not set for any ^ items. Otherwise, the fact that any match
must start either at the start of the subject or following a newline is
remembered. Like other optimizations, this can cause callouts to be skipped.
<pre>
PCRE2_NO_START_OPTIMIZE
</pre>
@ -1442,16 +1455,25 @@ compiled with the PCRE2_EXTENDED option, the result is PCRE2_CASELESS,
PCRE2_MULTILINE, and PCRE2_EXTENDED.
</P>
<P>
A pattern is automatically anchored by PCRE2 if all of its top-level
alternatives begin with one of the following:
A pattern compiled without PCRE2_ANCHORED is automatically anchored by PCRE2 if
the first significant item in every top-level branch is one of the following:
<pre>
^ unless PCRE2_MULTILINE is set
\A always
\G always
.* if PCRE2_DOTALL is set and there are no back references to the subpattern in which .* appears
.* sometimes - see below
</pre>
For such patterns, the PCRE2_ANCHORED bit is set in the options returned for
PCRE2_INFO_ALLOPTIONS.
When .* is the first significant item, anchoring is possible only when all the
following are true:
<pre>
.* is not in an atomic group
.* is not in a capturing group that is the subject of a back reference
PCRE2_DOTALL is in force for .*
Neither (*PRUNE) nor (*SKIP) appears in the pattern.
PCRE2_NO_DOTSTAR_ANCHOR is not set.
</pre>
For patterns that are auto-anchored, the PCRE2_ANCHORED bit is set in the
options returned for PCRE2_INFO_ALLOPTIONS.
<pre>
PCRE2_INFO_BACKREFMAX
</pre>
@ -1480,21 +1502,10 @@ variable.
<P>
If there is a fixed first value, for example, the letter "c" from a pattern
such as (cat|cow|coyote), 1 is returned, and the character value can be
retrieved using PCRE2_INFO_FIRSTCODEUNIT. If there is no fixed first value, and
if either
<br>
<br>
(a) the pattern was compiled with the PCRE2_MULTILINE option, and every branch
starts with "^", or
<br>
<br>
(b) every branch of the pattern starts with ".*" and PCRE2_DOTALL is not set
(if it were set, the pattern would be anchored),
<br>
<br>
2 is returned, indicating that the pattern matches only at the start of a
subject string or after any newline within the string. Otherwise 0 is
returned. For anchored patterns, 0 is returned.
retrieved using PCRE2_INFO_FIRSTCODEUNIT. If there is no fixed first value, but
it is known that a match can occur only at the start of the subject or
following a newline in the subject, 2 is returned. Otherwise, and for anchored
patterns, 0 is returned.
<pre>
PCRE2_INFO_FIRSTCODEUNIT
</pre>
@ -2792,9 +2803,9 @@ Cambridge, England.
</P>
<br><a name="SEC37" href="#TOC1">REVISION</a><br>
<P>
Last updated: 22 December 2014
Last updated: 02 January 2015
<br>
Copyright &copy; 1997-2014 University of Cambridge.
Copyright &copy; 1997-2015 University of Cambridge.
<br>
<p>
Return to the <a href="index.html">PCRE2 index page</a>.

View File

@ -82,6 +82,9 @@ You should be aware that, because of optimizations in the way PCRE2 compiles
and matches patterns, callouts sometimes do not happen exactly as you might
expect.
</P>
<br><b>
Auto-possessification
</b><br>
<P>
At compile time, PCRE2 "auto-possessifies" repeated items when it knows that
what follows cannot be part of the repeat. For example, a+[bc] is compiled as
@ -111,6 +114,56 @@ case, the output changes to this:
This time, when matching [bc] fails, the matcher backtracks into a+ and tries
again, repeatedly, until a+ itself fails.
</P>
<br><b>
Automatic .* anchoring
</b><br>
<P>
By default, an optimization is applied when .* is the first significant item in
a pattern. If PCRE2_DOTALL is set, so that the dot can match any character, the
pattern is automatically anchored. If PCRE2_DOTALL is not set, a match can
start only after an internal newline or at the beginning of the subject, and
<b>pcre2_compile()</b> remembers this. This optimization is disabled, however,
if .* is in an atomic group or if there is a back reference to the capturing
group in which it appears. It is also disabled if the pattern contains (*PRUNE)
or (*SKIP). However, the presence of callouts does not affect it.
</P>
<P>
For example, if the pattern .*\d is compiled with PCRE2_AUTO_CALLOUT and
applied to the string "aa", the <b>pcre2test</b> output is:
<pre>
---&#62;aa
+0 ^ .*
+2 ^ ^ \d
+2 ^^ \d
+2 ^ \d
No match
</pre>
This shows that all match attempts start at the beginning of the subject. In
other words, the pattern is anchored. You can disable this optimization by
passing PCRE2_NO_DOTSTAR_ANCHOR to <b>pcre2_compile()</b>, or starting the
pattern with (*NO_DOTSTAR_ANCHOR). In this case, the output changes to:
<pre>
---&#62;aa
+0 ^ .*
+2 ^ ^ \d
+2 ^^ \d
+2 ^ \d
+0 ^ .*
+2 ^^ \d
+2 ^ \d
No match
</pre>
This shows more match attempts, starting at the second subject character.
Another optimization, described in the next section, means that there is no
subsequent attempt to match with an empty subject.
</P>
<P>
If a pattern has more than one top-level branch, automatic anchoring occurs if
all branches are anchorable.
</P>
<br><b>
Other optimizations
</b><br>
<P>
Other optimizations that provide fast "no match" results also affect callouts.
For example, if the pattern is
@ -254,9 +307,9 @@ Cambridge, England.
</P>
<br><a name="SEC7" href="#TOC1">REVISION</a><br>
<P>
Last updated: 25 November 2014
Last updated: 02 January 2015
<br>
Copyright &copy; 1997-2014 University of Cambridge.
Copyright &copy; 1997-2015 University of Cambridge.
<br>
<p>
Return to the <a href="index.html">PCRE2 index page</a>.

View File

@ -151,6 +151,17 @@ reaching "no match" results. For more details, see the
documentation.
</P>
<br><b>
Disabling automatic anchoring
</b><br>
<P>
If a pattern starts with (*NO_DOTSTAR_ANCHOR), it has the same effect as
setting the PCRE2_NO_DOTSTAR_ANCHOR option. This disables optimizations that
apply to patterns whose top-level branches all start with .* (match any number
of arbitrary characters). For more details, see the
<a href="pcre2api.html"><b>pcre2api</b></a>
documentation.
</P>
<br><b>
Setting match and recursion limits
</b><br>
<P>
@ -1841,7 +1852,8 @@ one succeeds. Consider this pattern:
(?&#62;.*?a)b
</pre>
It matches "ab" in the subject "aab". The use of the backtracking control verbs
(*PRUNE) and (*SKIP) also disable this optimization.
(*PRUNE) and (*SKIP) also disable this optimization, and there is an option,
PCRE2_NO_DOTSTAR_ANCHOR, to do so explicitly.
</P>
<P>
When a capturing subpattern is repeated, the value captured is the substring
@ -3236,9 +3248,9 @@ Cambridge, England.
</P>
<br><a name="SEC30" href="#TOC1">REVISION</a><br>
<P>
Last updated: 14 November 2014
Last updated: 02 January 2015
<br>
Copyright &copy; 1997-2014 University of Cambridge.
Copyright &copy; 1997-2015 University of Cambridge.
<br>
<p>
Return to the <a href="index.html">PCRE2 index page</a>.

View File

@ -115,14 +115,19 @@ less with a DFA matching function, and in both cases there is not much
difference for \b.
</P>
<P>
When a pattern begins with .* not in parentheses, or in parentheses that are
not the subject of a backreference, and the PCRE2_DOTALL option is set, the
pattern is implicitly anchored by PCRE2, since it can match only at the start
of a subject string. However, if PCRE2_DOTALL is not set, PCRE2 cannot make
this optimization, because the dot metacharacter does not then match a newline,
and if the subject string contains newlines, the pattern may match from the
character immediately following one of them instead of from the very start. For
example, the pattern
When a pattern begins with .* not in atomic parentheses, nor in parentheses
that are the subject of a backreference, and the PCRE2_DOTALL option is set,
the pattern is implicitly anchored by PCRE2, since it can match only at the
start of a subject string. If the pattern has multiple top-level branches, they
must all be anchorable. The optimization can be disabled by the
PCRE2_NO_DOTSTAR_ANCHOR option, and is automatically disabled if the pattern
contains (*PRUNE) or (*SKIP).
</P>
<P>
If PCRE2_DOTALL is not set, PCRE2 cannot make this optimization, because the
dot metacharacter does not then match a newline, and if the subject string
contains newlines, the pattern may match from the character immediately
following one of them instead of from the very start. For example, the pattern
<pre>
.*second
</pre>
@ -187,9 +192,9 @@ Cambridge, England.
REVISION
</b><br>
<P>
Last updated: 20 October 2014
Last updated: 02 January 2015
<br>
Copyright &copy; 1997-2014 University of Cambridge.
Copyright &copy; 1997-2015 University of Cambridge.
<br>
<p>
Return to the <a href="index.html">PCRE2 index page</a>.

View File

@ -416,6 +416,7 @@ appear.
(*NOTEMPTY) set PCRE2_NOTEMPTY when matching
(*NOTEMPTY_ATSTART) set PCRE2_NOTEMPTY_ATSTART when matching
(*NO_AUTO_POSSESS) no auto-possessification (PCRE2_NO_AUTO_POSSESS)
(*NO_DOTSTAR_ANCHOR) no .* anchoring (PCRE2_NO_DOTSTAR_ANCHOR)
(*NO_START_OPT) no start-match optimization (PCRE2_NO_START_OPTIMIZE)
(*UTF) set appropriate UTF mode for the library in use
(*UCP) set PCRE2_UCP (use Unicode properties for \d etc)
@ -553,9 +554,9 @@ Cambridge, England.
</P>
<br><a name="SEC27" href="#TOC1">REVISION</a><br>
<P>
Last updated: 23 November 2014
Last updated: 02 January 2015
<br>
Copyright &copy; 1997-2014 University of Cambridge.
Copyright &copy; 1997-2015 University of Cambridge.
<br>
<p>
Return to the <a href="index.html">PCRE2 index page</a>.

View File

@ -291,7 +291,7 @@ checked for compatibility with the <b>perltest.sh</b> script, which is used to
confirm that Perl gives the same results as PCRE2. Also, apart from comment
lines, none of the other command lines are permitted, because they and many
of the modifiers are specific to <b>pcre2test</b>, and should not be used in
test files that are also processed by <b>perltest.sh</b>. The \fP#perltest\fB
test files that are also processed by <b>perltest.sh</b>. The <b>#perltest</b>
command helps detect tests that are accidentally put in the wrong file.
<pre>
#subject &#60;modifier-list&#62;
@ -454,6 +454,7 @@ for a description of their effects.
never_utf set PCRE2_NEVER_UTF
no_auto_capture set PCRE2_NO_AUTO_CAPTURE
no_auto_possess set PCRE2_NO_AUTO_POSSESS
no_dotstar_anchor set PCRE2_NO_DOTSTAR_ANCHOR
no_start_optimize set PCRE2_NO_START_OPTIMIZE
no_utf_check set PCRE2_NO_UTF_CHECK
ucp set PCRE2_UCP
@ -596,7 +597,7 @@ setting the size of the JIT stack.
</P>
<P>
If the <b>jitfast</b> modifier is specified, matching is done using the JIT
"fast path" interface, \fBpcre2_jit_match(), which skips some of the sanity
"fast path" interface, <b>pcre2_jit_match()</b>, which skips some of the sanity
checks that are done by <b>pcre2_match()</b>, and of course does not work when
JIT is not supported. If <b>jitfast</b> is specified without <b>jit</b>, jit=7 is
assumed.
@ -1309,9 +1310,9 @@ Cambridge, England.
</P>
<br><a name="SEC20" href="#TOC1">REVISION</a><br>
<P>
Last updated: 23 November 2014
Last updated: 02 January 2015
<br>
Copyright &copy; 1997-2014 University of Cambridge.
Copyright &copy; 1997-2015 University of Cambridge.
<br>
<p>
Return to the <a href="index.html">PCRE2 index page</a>.

View File

@ -1226,6 +1226,20 @@ COMPILING A PATTERN
a full unoptimized search and run all the callouts, but it is mainly
provided for testing purposes.
PCRE2_NO_DOTSTAR_ANCHOR
If this option is set, it disables an optimization that is applied when
.* is the first significant item in a top-level branch of a pattern,
and all the other branches also start with .* or with \A or \G or ^.
The optimization is automatically disabled for .* if it is inside an
atomic group or a capturing group that is the subject of a back refer-
ence, or if the pattern contains (*PRUNE) or (*SKIP). When the opti-
mization is not disabled, such a pattern is automatically anchored if
PCRE2_DOTALL is set for all the .* items and PCRE2_MULTILINE is not set
for any ^ items. Otherwise, the fact that any match must start either
at the start of the subject or following a newline is remembered. Like
other optimizations, this can cause callouts to be skipped.
PCRE2_NO_START_OPTIMIZE
This is an option whose main effect is at matching time. It does not
@ -1465,17 +1479,27 @@ INFORMATION ABOUT A COMPILED PATTERN
option, the result is PCRE2_CASELESS, PCRE2_MULTILINE, and
PCRE2_EXTENDED.
A pattern is automatically anchored by PCRE2 if all of its top-level
alternatives begin with one of the following:
A pattern compiled without PCRE2_ANCHORED is automatically anchored by
PCRE2 if the first significant item in every top-level branch is one of
the following:
^ unless PCRE2_MULTILINE is set
\A always
\G always
.* if PCRE2_DOTALL is set and there are no back
references to the subpattern in which .* appears
.* sometimes - see below
For such patterns, the PCRE2_ANCHORED bit is set in the options
returned for PCRE2_INFO_ALLOPTIONS.
When .* is the first significant item, anchoring is possible only when
all the following are true:
.* is not in an atomic group
.* is not in a capturing group that is the subject
of a back reference
PCRE2_DOTALL is in force for .*
Neither (*PRUNE) nor (*SKIP) appears in the pattern.
PCRE2_NO_DOTSTAR_ANCHOR is not set.
For patterns that are auto-anchored, the PCRE2_ANCHORED bit is set in
the options returned for PCRE2_INFO_ALLOPTIONS.
PCRE2_INFO_BACKREFMAX
@ -1504,17 +1528,9 @@ INFORMATION ABOUT A COMPILED PATTERN
If there is a fixed first value, for example, the letter "c" from a
pattern such as (cat|cow|coyote), 1 is returned, and the character
value can be retrieved using PCRE2_INFO_FIRSTCODEUNIT. If there is no
fixed first value, and if either
(a) the pattern was compiled with the PCRE2_MULTILINE option, and every
branch starts with "^", or
(b) every branch of the pattern starts with ".*" and PCRE2_DOTALL is
not set (if it were set, the pattern would be anchored),
2 is returned, indicating that the pattern matches only at the start of
a subject string or after any newline within the string. Otherwise 0 is
returned. For anchored patterns, 0 is returned.
fixed first value, but it is known that a match can occur only at the
start of the subject or following a newline in the subject, 2 is
returned. Otherwise, and for anchored patterns, 0 is returned.
PCRE2_INFO_FIRSTCODEUNIT
@ -2726,8 +2742,8 @@ AUTHOR
REVISION
Last updated: 22 December 2014
Copyright (c) 1997-2014 University of Cambridge.
Last updated: 02 January 2015
Copyright (c) 1997-2015 University of Cambridge.
------------------------------------------------------------------------------
@ -3251,6 +3267,8 @@ MISSING CALLOUTS
compiles and matches patterns, callouts sometimes do not happen exactly
as you might expect.
Auto-possessification
At compile time, PCRE2 "auto-possessifies" repeated items when it knows
that what follows cannot be part of the repeat. For example, a+[bc] is
compiled as if it were a++[bc]. The pcre2test output when this pattern
@ -3279,6 +3297,53 @@ MISSING CALLOUTS
This time, when matching [bc] fails, the matcher backtracks into a+ and
tries again, repeatedly, until a+ itself fails.
Automatic .* anchoring
By default, an optimization is applied when .* is the first significant
item in a pattern. If PCRE2_DOTALL is set, so that the dot can match
any character, the pattern is automatically anchored. If PCRE2_DOTALL
is not set, a match can start only after an internal newline or at the
beginning of the subject, and pcre2_compile() remembers this. This
optimization is disabled, however, if .* is in an atomic group or if
there is a back reference to the capturing group in which it appears.
It is also disabled if the pattern contains (*PRUNE) or (*SKIP). How-
ever, the presence of callouts does not affect it.
For example, if the pattern .*\d is compiled with PCRE2_AUTO_CALLOUT
and applied to the string "aa", the pcre2test output is:
--->aa
+0 ^ .*
+2 ^ ^ \d
+2 ^^ \d
+2 ^ \d
No match
This shows that all match attempts start at the beginning of the sub-
ject. In other words, the pattern is anchored. You can disable this
optimization by passing PCRE2_NO_DOTSTAR_ANCHOR to pcre2_compile(), or
starting the pattern with (*NO_DOTSTAR_ANCHOR). In this case, the out-
put changes to:
--->aa
+0 ^ .*
+2 ^ ^ \d
+2 ^^ \d
+2 ^ \d
+0 ^ .*
+2 ^^ \d
+2 ^ \d
No match
This shows more match attempts, starting at the second subject charac-
ter. Another optimization, described in the next section, means that
there is no subsequent attempt to match with an empty subject.
If a pattern has more than one top-level branch, automatic anchoring
occurs if all branches are anchorable.
Other optimizations
Other optimizations that provide fast "no match" results also affect
callouts. For example, if the pattern is
@ -3410,8 +3475,8 @@ AUTHOR
REVISION
Last updated: 25 November 2014
Copyright (c) 1997-2014 University of Cambridge.
Last updated: 02 January 2015
Copyright (c) 1997-2015 University of Cambridge.
------------------------------------------------------------------------------

View File

@ -1,4 +1,4 @@
.TH PCRE2_COMPILE 3 "21 October 2014" "PCRE2 10.00"
.TH PCRE2_COMPILE 3 "02 January 2015" "PCRE2 10.00"
.SH NAME
PCRE2 - Perl-compatible regular expressions (revised API)
.SH SYNOPSIS
@ -51,6 +51,7 @@ or provide an external function for stack size checking. The option bits are:
PCRE2_NO_AUTO_CAPTURE Disable numbered capturing paren-
theses (named ones available)
PCRE2_NO_AUTO_POSSESS Disable auto-possessification
PCRE2_NO_DOTSTAR_ANCHOR Disable automatic anchoring for .*
PCRE2_NO_START_OPTIMIZE Disable match-time start optimizations
PCRE2_NO_UTF_CHECK Do not check the pattern for UTF validity
(only relevant if PCRE2_UTF is set)

View File

@ -1,4 +1,4 @@
.TH PCRE2API 3 "22 December 2014" "PCRE2 10.00"
.TH PCRE2API 3 "02 January 2015" "PCRE2 10.00"
.SH NAME
PCRE2 - Perl-compatible regular expressions (revised API)
.sp
@ -1163,6 +1163,19 @@ use, auto-possessification means that some callouts are never taken. You can
set this option if you want the matching functions to do a full unoptimized
search and run all the callouts, but it is mainly provided for testing
purposes.
.sp
PCRE2_NO_DOTSTAR_ANCHOR
.sp
If this option is set, it disables an optimization that is applied when .* is
the first significant item in a top-level branch of a pattern, and all the
other branches also start with .* or with \eA or \eG or ^. The optimization is
automatically disabled for .* if it is inside an atomic group or a capturing
group that is the subject of a back reference, or if the pattern contains
(*PRUNE) or (*SKIP). When the optimization is not disabled, such a pattern is
automatically anchored if PCRE2_DOTALL is set for all the .* items and
PCRE2_MULTILINE is not set for any ^ items. Otherwise, the fact that any match
must start either at the start of the subject or following a newline is
remembered. Like other optimizations, this can cause callouts to be skipped.
.sp
PCRE2_NO_START_OPTIMIZE
.sp
@ -1436,18 +1449,27 @@ force when matching starts. For example, if the pattern /(?im)abc(?-i)d/ is
compiled with the PCRE2_EXTENDED option, the result is PCRE2_CASELESS,
PCRE2_MULTILINE, and PCRE2_EXTENDED.
.P
A pattern is automatically anchored by PCRE2 if all of its top-level
alternatives begin with one of the following:
A pattern compiled without PCRE2_ANCHORED is automatically anchored by PCRE2 if
the first significant item in every top-level branch is one of the following:
.sp
^ unless PCRE2_MULTILINE is set
\eA always
\eG always
.\" JOIN
.* if PCRE2_DOTALL is set and there are no back
references to the subpattern in which .* appears
.* sometimes - see below
.sp
For such patterns, the PCRE2_ANCHORED bit is set in the options returned for
PCRE2_INFO_ALLOPTIONS.
When .* is the first significant item, anchoring is possible only when all the
following are true:
.sp
.* is not in an atomic group
.\" JOIN
.* is not in a capturing group that is the subject
of a back reference
PCRE2_DOTALL is in force for .*
Neither (*PRUNE) nor (*SKIP) appears in the pattern.
PCRE2_NO_DOTSTAR_ANCHOR is not set.
.sp
For patterns that are auto-anchored, the PCRE2_ANCHORED bit is set in the
options returned for PCRE2_INFO_ALLOPTIONS.
.sp
PCRE2_INFO_BACKREFMAX
.sp
@ -1475,18 +1497,10 @@ variable.
.P
If there is a fixed first value, for example, the letter "c" from a pattern
such as (cat|cow|coyote), 1 is returned, and the character value can be
retrieved using PCRE2_INFO_FIRSTCODEUNIT. If there is no fixed first value, and
if either
.sp
(a) the pattern was compiled with the PCRE2_MULTILINE option, and every branch
starts with "^", or
.sp
(b) every branch of the pattern starts with ".*" and PCRE2_DOTALL is not set
(if it were set, the pattern would be anchored),
.sp
2 is returned, indicating that the pattern matches only at the start of a
subject string or after any newline within the string. Otherwise 0 is
returned. For anchored patterns, 0 is returned.
retrieved using PCRE2_INFO_FIRSTCODEUNIT. If there is no fixed first value, but
it is known that a match can occur only at the start of the subject or
following a newline in the subject, 2 is returned. Otherwise, and for anchored
patterns, 0 is returned.
.sp
PCRE2_INFO_FIRSTCODEUNIT
.sp
@ -2835,6 +2849,6 @@ Cambridge, England.
.rs
.sp
.nf
Last updated: 22 December 2014
Copyright (c) 1997-2014 University of Cambridge.
Last updated: 02 January 2015
Copyright (c) 1997-2015 University of Cambridge.
.fi

View File

@ -1,4 +1,4 @@
.TH PCRE2CALLOUT 3 "25 November 2014" "PCRE2 10.00"
.TH PCRE2CALLOUT 3 "02 January 2015" "PCRE2 10.00"
.SH NAME
PCRE2 - Perl-compatible regular expressions (revised API)
.SH SYNOPSIS
@ -65,7 +65,11 @@ particular pattern.
You should be aware that, because of optimizations in the way PCRE2 compiles
and matches patterns, callouts sometimes do not happen exactly as you might
expect.
.P
.
.
.SS "Auto-possessification"
.rs
.sp
At compile time, PCRE2 "auto-possessifies" repeated items when it knows that
what follows cannot be part of the repeat. For example, a+[bc] is compiled as
if it were a++[bc]. The \fBpcre2test\fP output when this pattern is compiled
@ -93,7 +97,56 @@ case, the output changes to this:
.sp
This time, when matching [bc] fails, the matcher backtracks into a+ and tries
again, repeatedly, until a+ itself fails.
.
.
.SS "Automatic .* anchoring"
.rs
.sp
By default, an optimization is applied when .* is the first significant item in
a pattern. If PCRE2_DOTALL is set, so that the dot can match any character, the
pattern is automatically anchored. If PCRE2_DOTALL is not set, a match can
start only after an internal newline or at the beginning of the subject, and
\fBpcre2_compile()\fP remembers this. This optimization is disabled, however,
if .* is in an atomic group or if there is a back reference to the capturing
group in which it appears. It is also disabled if the pattern contains (*PRUNE)
or (*SKIP). However, the presence of callouts does not affect it.
.P
For example, if the pattern .*\ed is compiled with PCRE2_AUTO_CALLOUT and
applied to the string "aa", the \fBpcre2test\fP output is:
.sp
--->aa
+0 ^ .*
+2 ^ ^ \ed
+2 ^^ \ed
+2 ^ \ed
No match
.sp
This shows that all match attempts start at the beginning of the subject. In
other words, the pattern is anchored. You can disable this optimization by
passing PCRE2_NO_DOTSTAR_ANCHOR to \fBpcre2_compile()\fP, or starting the
pattern with (*NO_DOTSTAR_ANCHOR). In this case, the output changes to:
.sp
--->aa
+0 ^ .*
+2 ^ ^ \ed
+2 ^^ \ed
+2 ^ \ed
+0 ^ .*
+2 ^^ \ed
+2 ^ \ed
No match
.sp
This shows more match attempts, starting at the second subject character.
Another optimization, described in the next section, means that there is no
subsequent attempt to match with an empty subject.
.P
If a pattern has more than one top-level branch, automatic anchoring occurs if
all branches are anchorable.
.
.
.SS "Other optimizations"
.rs
.sp
Other optimizations that provide fast "no match" results also affect callouts.
For example, if the pattern is
.sp
@ -232,6 +285,6 @@ Cambridge, England.
.rs
.sp
.nf
Last updated: 25 November 2014
Copyright (c) 1997-2014 University of Cambridge.
Last updated: 02 January 2015
Copyright (c) 1997-2015 University of Cambridge.
.fi

View File

@ -1,4 +1,4 @@
.TH PCRE2PATTERN 3 "14 November 2014" "PCRE2 10.00"
.TH PCRE2PATTERN 3 "02 January 2015" "PCRE2 10.00"
.SH NAME
PCRE2 - Perl-compatible regular expressions (revised API)
.SH "PCRE2 REGULAR EXPRESSION DETAILS"
@ -117,6 +117,19 @@ reaching "no match" results. For more details, see the
documentation.
.
.
.SS "Disabling automatic anchoring"
.rs
.sp
If a pattern starts with (*NO_DOTSTAR_ANCHOR), it has the same effect as
setting the PCRE2_NO_DOTSTAR_ANCHOR option. This disables optimizations that
apply to patterns whose top-level branches all start with .* (match any number
of arbitrary characters). For more details, see the
.\" HREF
\fBpcre2api\fP
.\"
documentation.
.
.
.SS "Setting match and recursion limits"
.rs
.sp
@ -1853,7 +1866,8 @@ one succeeds. Consider this pattern:
(?>.*?a)b
.sp
It matches "ab" in the subject "aab". The use of the backtracking control verbs
(*PRUNE) and (*SKIP) also disable this optimization.
(*PRUNE) and (*SKIP) also disable this optimization, and there is an option,
PCRE2_NO_DOTSTAR_ANCHOR, to do so explicitly.
.P
When a capturing subpattern is repeated, the value captured is the substring
that matched the final iteration. For example, after
@ -3278,6 +3292,6 @@ Cambridge, England.
.rs
.sp
.nf
Last updated: 14 November 2014
Copyright (c) 1997-2014 University of Cambridge.
Last updated: 02 January 2015
Copyright (c) 1997-2015 University of Cambridge.
.fi

View File

@ -1,4 +1,4 @@
.TH PCRE2PERFORM 3 "20 Ocbober 2014" "PCRE2 10.00"
.TH PCRE2PERFORM 3 "02 January 2015" "PCRE2 10.00"
.SH NAME
PCRE2 - Perl-compatible regular expressions (revised API)
.SH "PCRE2 PERFORMANCE"
@ -105,14 +105,18 @@ such as \ed, when matched with \fBpcre2_match()\fP; the performance loss is
less with a DFA matching function, and in both cases there is not much
difference for \eb.
.P
When a pattern begins with .* not in parentheses, or in parentheses that are
not the subject of a backreference, and the PCRE2_DOTALL option is set, the
pattern is implicitly anchored by PCRE2, since it can match only at the start
of a subject string. However, if PCRE2_DOTALL is not set, PCRE2 cannot make
this optimization, because the dot metacharacter does not then match a newline,
and if the subject string contains newlines, the pattern may match from the
character immediately following one of them instead of from the very start. For
example, the pattern
When a pattern begins with .* not in atomic parentheses, nor in parentheses
that are the subject of a backreference, and the PCRE2_DOTALL option is set,
the pattern is implicitly anchored by PCRE2, since it can match only at the
start of a subject string. If the pattern has multiple top-level branches, they
must all be anchorable. The optimization can be disabled by the
PCRE2_NO_DOTSTAR_ANCHOR option, and is automatically disabled if the pattern
contains (*PRUNE) or (*SKIP).
.P
If PCRE2_DOTALL is not set, PCRE2 cannot make this optimization, because the
dot metacharacter does not then match a newline, and if the subject string
contains newlines, the pattern may match from the character immediately
following one of them instead of from the very start. For example, the pattern
.sp
.*second
.sp
@ -173,6 +177,6 @@ Cambridge, England.
.rs
.sp
.nf
Last updated: 20 October 2014
Copyright (c) 1997-2014 University of Cambridge.
Last updated: 02 January 2015
Copyright (c) 1997-2015 University of Cambridge.
.fi

View File

@ -1,4 +1,4 @@
.TH PCRE2SYNTAX 3 "23 November 2014" "PCRE2 10.00"
.TH PCRE2SYNTAX 3 "02 January 2015" "PCRE2 10.00"
.SH NAME
PCRE2 - Perl-compatible regular expressions (revised API)
.SH "PCRE2 REGULAR EXPRESSION SYNTAX SUMMARY"
@ -389,6 +389,7 @@ appear.
(*NOTEMPTY) set PCRE2_NOTEMPTY when matching
(*NOTEMPTY_ATSTART) set PCRE2_NOTEMPTY_ATSTART when matching
(*NO_AUTO_POSSESS) no auto-possessification (PCRE2_NO_AUTO_POSSESS)
(*NO_DOTSTAR_ANCHOR) no .* anchoring (PCRE2_NO_DOTSTAR_ANCHOR)
(*NO_START_OPT) no start-match optimization (PCRE2_NO_START_OPTIMIZE)
(*UTF) set appropriate UTF mode for the library in use
(*UCP) set PCRE2_UCP (use Unicode properties for \ed etc)
@ -536,6 +537,6 @@ Cambridge, England.
.rs
.sp
.nf
Last updated: 23 November 2014
Copyright (c) 1997-2014 University of Cambridge.
Last updated: 02 January 2015
Copyright (c) 1997-2015 University of Cambridge.
.fi

View File

@ -1,4 +1,4 @@
.TH PCRE2TEST 1 "23 November 2014" "PCRE 10.00"
.TH PCRE2TEST 1 "02 January 2015" "PCRE 10.00"
.SH NAME
pcre2test - a program for testing Perl-compatible regular expressions.
.SH SYNOPSIS
@ -247,7 +247,7 @@ checked for compatibility with the \fBperltest.sh\fP script, which is used to
confirm that Perl gives the same results as PCRE2. Also, apart from comment
lines, none of the other command lines are permitted, because they and many
of the modifiers are specific to \fBpcre2test\fP, and should not be used in
test files that are also processed by \fBperltest.sh\fP. The \fP#perltest\fB
test files that are also processed by \fBperltest.sh\fP. The \fB#perltest\fP
command helps detect tests that are accidentally put in the wrong file.
.sp
#subject <modifier-list>
@ -413,6 +413,7 @@ for a description of their effects.
never_utf set PCRE2_NEVER_UTF
no_auto_capture set PCRE2_NO_AUTO_CAPTURE
no_auto_possess set PCRE2_NO_AUTO_POSSESS
no_dotstar_anchor set PCRE2_NO_DOTSTAR_ANCHOR
no_start_optimize set PCRE2_NO_START_OPTIMIZE
no_utf_check set PCRE2_NO_UTF_CHECK
ucp set PCRE2_UCP
@ -552,7 +553,7 @@ documentation. See also the \fBjitstack\fP modifier below for a way of
setting the size of the JIT stack.
.P
If the \fBjitfast\fP modifier is specified, matching is done using the JIT
"fast path" interface, \fBpcre2_jit_match(), which skips some of the sanity
"fast path" interface, \fBpcre2_jit_match()\fP, which skips some of the sanity
checks that are done by \fBpcre2_match()\fP, and of course does not work when
JIT is not supported. If \fBjitfast\fP is specified without \fBjit\fP, jit=7 is
assumed.
@ -1274,6 +1275,6 @@ Cambridge, England.
.rs
.sp
.nf
Last updated: 23 November 2014
Copyright (c) 1997-2014 University of Cambridge.
Last updated: 02 January 2015
Copyright (c) 1997-2015 University of Cambridge.
.fi

View File

@ -402,6 +402,7 @@ PATTERN MODIFIERS
never_utf set PCRE2_NEVER_UTF
no_auto_capture set PCRE2_NO_AUTO_CAPTURE
no_auto_possess set PCRE2_NO_AUTO_POSSESS
no_dotstar_anchor set PCRE2_NO_DOTSTAR_ANCHOR
no_start_optimize set PCRE2_NO_START_OPTIMIZE
no_utf_check set PCRE2_NO_UTF_CHECK
ucp set PCRE2_UCP
@ -1185,5 +1186,5 @@ AUTHOR
REVISION
Last updated: 23 November 2014
Copyright (c) 1997-2014 University of Cambridge.
Last updated: 02 January 2015
Copyright (c) 1997-2015 University of Cambridge.

View File

@ -5,7 +5,7 @@
/* This is the public header file for the PCRE library, second API, to be
#included by applications that call PCRE2 functions.
Copyright (c) 2014 University of Cambridge
Copyright (c) 2015 University of Cambridge
-----------------------------------------------------------------------------
Redistribution and use in source and binary forms, with or without
@ -113,10 +113,11 @@ D is inspected during pcre2_dfa_match() execution
#define PCRE2_NEVER_UTF 0x00001000u /* C */
#define PCRE2_NO_AUTO_CAPTURE 0x00002000u /* C */
#define PCRE2_NO_AUTO_POSSESS 0x00004000u /* C */
#define PCRE2_NO_START_OPTIMIZE 0x00008000u /* J M D */
#define PCRE2_UCP 0x00010000u /* C J M D */
#define PCRE2_UNGREEDY 0x00020000u /* C */
#define PCRE2_UTF 0x00040000u /* C J M D */
#define PCRE2_NO_DOTSTAR_ANCHOR 0x00008000u /* C */
#define PCRE2_NO_START_OPTIMIZE 0x00010000u /* J M D */
#define PCRE2_UCP 0x00020000u /* C J M D */
#define PCRE2_UNGREEDY 0x00040000u /* C */
#define PCRE2_UTF 0x00080000u /* C J M D */
/* These are for pcre2_jit_compile(). */

View File

@ -7,7 +7,7 @@ and semantics are as close as possible to those of the Perl 5 language.
Written by Philip Hazel
Original API code Copyright (c) 1997-2012 University of Cambridge
New API code Copyright (c) 2014 University of Cambridge
New API code Copyright (c) 2015 University of Cambridge
-----------------------------------------------------------------------------
Redistribution and use in source and binary forms, with or without
@ -557,8 +557,8 @@ static PCRE2_SPTR posix_substitutes[] = {
PCRE2_CASELESS|PCRE2_DOLLAR_ENDONLY|PCRE2_DOTALL|PCRE2_DUPNAMES| \
PCRE2_EXTENDED|PCRE2_FIRSTLINE|PCRE2_MATCH_UNSET_BACKREF| \
PCRE2_MULTILINE|PCRE2_NEVER_UCP|PCRE2_NEVER_UTF|PCRE2_NO_AUTO_CAPTURE| \
PCRE2_NO_AUTO_POSSESS|PCRE2_NO_START_OPTIMIZE|PCRE2_NO_UTF_CHECK| \
PCRE2_UCP|PCRE2_UNGREEDY|PCRE2_UTF)
PCRE2_NO_AUTO_POSSESS|PCRE2_NO_DOTSTAR_ANCHOR|PCRE2_NO_START_OPTIMIZE| \
PCRE2_NO_UTF_CHECK|PCRE2_UCP|PCRE2_UNGREEDY|PCRE2_UTF)
/* Compile time error code numbers. They are given names so that they can more
easily be tracked. When a new number is added, the tables called eint1 and
@ -597,22 +597,23 @@ typedef struct pso {
/* NB: STRING_UTFn_RIGHTPAR contains the length as well */
static pso pso_list[] = {
{ (uint8_t *)STRING_UTFn_RIGHTPAR, PSO_OPT, PCRE2_UTF },
{ (uint8_t *)STRING_UTF_RIGHTPAR, 4, PSO_OPT, PCRE2_UTF },
{ (uint8_t *)STRING_UCP_RIGHTPAR, 4, PSO_OPT, PCRE2_UCP },
{ (uint8_t *)STRING_NOTEMPTY_RIGHTPAR, 9, PSO_FLG, PCRE2_NOTEMPTY_SET },
{ (uint8_t *)STRING_NOTEMPTY_ATSTART_RIGHTPAR,17, PSO_FLG, PCRE2_NE_ATST_SET },
{ (uint8_t *)STRING_NO_AUTO_POSSESS_RIGHTPAR, 16, PSO_OPT, PCRE2_NO_AUTO_POSSESS },
{ (uint8_t *)STRING_NO_START_OPT_RIGHTPAR, 13, PSO_OPT, PCRE2_NO_START_OPTIMIZE },
{ (uint8_t *)STRING_LIMIT_MATCH_EQ, 12, PSO_LIMM, 0 },
{ (uint8_t *)STRING_LIMIT_RECURSION_EQ, 16, PSO_LIMR, 0 },
{ (uint8_t *)STRING_CR_RIGHTPAR, 3, PSO_NL, PCRE2_NEWLINE_CR },
{ (uint8_t *)STRING_LF_RIGHTPAR, 3, PSO_NL, PCRE2_NEWLINE_LF },
{ (uint8_t *)STRING_CRLF_RIGHTPAR, 5, PSO_NL, PCRE2_NEWLINE_CRLF },
{ (uint8_t *)STRING_ANY_RIGHTPAR, 4, PSO_NL, PCRE2_NEWLINE_ANY },
{ (uint8_t *)STRING_ANYCRLF_RIGHTPAR, 8, PSO_NL, PCRE2_NEWLINE_ANYCRLF },
{ (uint8_t *)STRING_BSR_ANYCRLF_RIGHTPAR, 12, PSO_BSR, PCRE2_BSR_ANYCRLF },
{ (uint8_t *)STRING_BSR_UNICODE_RIGHTPAR, 12, PSO_BSR, PCRE2_BSR_UNICODE }
{ (uint8_t *)STRING_UTFn_RIGHTPAR, PSO_OPT, PCRE2_UTF },
{ (uint8_t *)STRING_UTF_RIGHTPAR, 4, PSO_OPT, PCRE2_UTF },
{ (uint8_t *)STRING_UCP_RIGHTPAR, 4, PSO_OPT, PCRE2_UCP },
{ (uint8_t *)STRING_NOTEMPTY_RIGHTPAR, 9, PSO_FLG, PCRE2_NOTEMPTY_SET },
{ (uint8_t *)STRING_NOTEMPTY_ATSTART_RIGHTPAR, 17, PSO_FLG, PCRE2_NE_ATST_SET },
{ (uint8_t *)STRING_NO_AUTO_POSSESS_RIGHTPAR, 16, PSO_OPT, PCRE2_NO_AUTO_POSSESS },
{ (uint8_t *)STRING_NO_DOTSTAR_ANCHOR_RIGHTPAR, 18, PSO_OPT, PCRE2_NO_DOTSTAR_ANCHOR },
{ (uint8_t *)STRING_NO_START_OPT_RIGHTPAR, 13, PSO_OPT, PCRE2_NO_START_OPTIMIZE },
{ (uint8_t *)STRING_LIMIT_MATCH_EQ, 12, PSO_LIMM, 0 },
{ (uint8_t *)STRING_LIMIT_RECURSION_EQ, 16, PSO_LIMR, 0 },
{ (uint8_t *)STRING_CR_RIGHTPAR, 3, PSO_NL, PCRE2_NEWLINE_CR },
{ (uint8_t *)STRING_LF_RIGHTPAR, 3, PSO_NL, PCRE2_NEWLINE_LF },
{ (uint8_t *)STRING_CRLF_RIGHTPAR, 5, PSO_NL, PCRE2_NEWLINE_CRLF },
{ (uint8_t *)STRING_ANY_RIGHTPAR, 4, PSO_NL, PCRE2_NEWLINE_ANY },
{ (uint8_t *)STRING_ANYCRLF_RIGHTPAR, 8, PSO_NL, PCRE2_NEWLINE_ANYCRLF },
{ (uint8_t *)STRING_BSR_ANYCRLF_RIGHTPAR, 12, PSO_BSR, PCRE2_BSR_ANYCRLF },
{ (uint8_t *)STRING_BSR_UNICODE_RIGHTPAR, 12, PSO_BSR, PCRE2_BSR_UNICODE }
};
@ -7020,13 +7021,14 @@ do {
/* .* is not anchored unless DOTALL is set (which generates OP_ALLANY) and
it isn't in brackets that are or may be referenced or inside an atomic
group. */
group. There is also an option that disables auto-anchoring. */
else if ((op == OP_TYPESTAR || op == OP_TYPEMINSTAR ||
op == OP_TYPEPOSSTAR))
{
if (scode[1] != OP_ALLANY || (bracket_map & cb->backref_map) != 0 ||
atomcount > 0 || cb->had_pruneorskip)
atomcount > 0 || cb->had_pruneorskip ||
(cb->external_options & PCRE2_NO_DOTSTAR_ANCHOR) != 0)
return FALSE;
}
@ -7140,12 +7142,13 @@ do {
brackets that may be referenced, as long as the pattern does not contain
*PRUNE or *SKIP, because these break the feature. Consider, for example,
/.*?a(*PRUNE)b/ with the subject "aab", which matches "ab", i.e. not at the
start of a line. */
start of a line. There is also an option that disables this optimization. */
else if (op == OP_TYPESTAR || op == OP_TYPEMINSTAR || op == OP_TYPEPOSSTAR)
{
if (scode[1] != OP_ANY || (bracket_map & cb->backref_map) != 0 ||
atomcount > 0 || cb->had_pruneorskip)
atomcount > 0 || cb->had_pruneorskip ||
(cb->external_options & PCRE2_NO_DOTSTAR_ANCHOR) != 0)
return FALSE;
}
@ -7863,7 +7866,8 @@ if (errorcode != 0)
/* Successful compile. If the anchored option was not passed, set it if
we can determine that the pattern is anchored by virtue of ^ characters or \A
or anything else, such as starting with non-atomic .* when DOTALL is set and
there are no occurrences of *PRUNE or *SKIP. */
there are no occurrences of *PRUNE or *SKIP (though there is an option to
disable this case). */
if ((re->overall_options & PCRE2_ANCHORED) == 0 &&
is_anchored(codestart, 0, &cb, 0))
@ -7912,7 +7916,8 @@ if ((re->overall_options & (PCRE2_ANCHORED|PCRE2_NO_START_OPTIMIZE)) == 0)
/* When there is no first code unit, see if we can set the PCRE2_STARTLINE
flag. This is helpful for multiline matches when all branches start with ^
and also when all branches start with non-atomic .* for non-DOTALL matches
when *PRUNE and SKIP are not present. */
when *PRUNE and SKIP are not present. (There is an option that disables this
case.) */
else if (is_startline(codestart, 0, &cb, 0)) re->flags |= PCRE2_STARTLINE;
}

View File

@ -7,7 +7,7 @@ and semantics are as close as possible to those of the Perl 5 language.
Written by Philip Hazel
Original API code Copyright (c) 1997-2012 University of Cambridge
New API code Copyright (c) 2014 University of Cambridge
New API code Copyright (c) 2015 University of Cambridge
-----------------------------------------------------------------------------
Redistribution and use in source and binary forms, with or without
@ -904,6 +904,7 @@ a positive value. */
#define STRING_UTF_RIGHTPAR "UTF)"
#define STRING_UCP_RIGHTPAR "UCP)"
#define STRING_NO_AUTO_POSSESS_RIGHTPAR "NO_AUTO_POSSESS)"
#define STRING_NO_DOTSTAR_ANCHOR_RIGHTPAR "NO_DOTSTAR_ANCHOR)"
#define STRING_NO_START_OPT_RIGHTPAR "NO_START_OPT)"
#define STRING_NOTEMPTY_RIGHTPAR "NOTEMPTY)"
#define STRING_NOTEMPTY_ATSTART_RIGHTPAR "NOTEMPTY_ATSTART)"
@ -1173,6 +1174,7 @@ only. */
#define STRING_UTF_RIGHTPAR STR_U STR_T STR_F STR_RIGHT_PARENTHESIS
#define STRING_UCP_RIGHTPAR STR_U STR_C STR_P STR_RIGHT_PARENTHESIS
#define STRING_NO_AUTO_POSSESS_RIGHTPAR STR_N STR_O STR_UNDERSCORE STR_A STR_U STR_T STR_O STR_UNDERSCORE STR_P STR_O STR_S STR_S STR_E STR_S STR_S STR_RIGHT_PARENTHESIS
#define STRING_NO_DOTSTAR_ANCHOR_RIGHTPAR STR_N STR_O STR_UNDERSCORE STR_D STR_O STR_T STR_S STR_T STR_A STR_R STR_UNDERSCORE STR_A STR_N STR_C STR_H STR_O STR_R STR_RIGHT_PARENTHESIS
#define STRING_NO_START_OPT_RIGHTPAR STR_N STR_O STR_UNDERSCORE STR_S STR_T STR_A STR_R STR_T STR_UNDERSCORE STR_O STR_P STR_T STR_RIGHT_PARENTHESIS
#define STRING_NOTEMPTY_RIGHTPAR STR_N STR_O STR_T STR_E STR_M STR_P STR_T STR_Y STR_RIGHT_PARENTHESIS
#define STRING_NOTEMPTY_ATSTART_RIGHTPAR STR_N STR_O STR_T STR_E STR_M STR_P STR_T STR_Y STR_UNDERSCORE STR_A STR_T STR_S STR_T STR_A STR_R STR_T STR_RIGHT_PARENTHESIS

View File

@ -445,7 +445,7 @@ match() only in the case when ovecsave is needed. (David Wheeler used to say
"All problems in computer science can be solved by another level of
indirection.")
HOWEVER: when this file is compiled by gcc in an optimizing mode, because this
HOWEVER: when this file is compiled by gcc in an optimizing mode, because this
function is called only once, and only from within match(), gcc will "inline"
it - that is, move it inside match() - and this completely negates its reason
for existence. Therefore, we mark it as non-inline when gcc is in use.

View File

@ -11,7 +11,7 @@ hacked-up (non-) design had also run out of steam.
Written by Philip Hazel
Original code Copyright (c) 1997-2012 University of Cambridge
Rewritten code Copyright (c) 2014 University of Cambridge
Rewritten code Copyright (c) 2015 University of Cambridge
-----------------------------------------------------------------------------
Redistribution and use in source and binary forms, with or without
@ -498,6 +498,7 @@ static modstruct modlist[] = {
{ "newline", MOD_CTC, MOD_NL, 0, CO(newline_convention) },
{ "no_auto_capture", MOD_PAT, MOD_OPT, PCRE2_NO_AUTO_CAPTURE, PO(options) },
{ "no_auto_possess", MOD_PATP, MOD_OPT, PCRE2_NO_AUTO_POSSESS, PO(options) },
{ "no_dotstar_anchor", MOD_PAT, MOD_OPT, PCRE2_NO_DOTSTAR_ANCHOR, PO(options) },
{ "no_start_optimize", MOD_PATP, MOD_OPT, PCRE2_NO_START_OPTIMIZE, PO(options) },
{ "no_utf_check", MOD_PD, MOD_OPT, PCRE2_NO_UTF_CHECK, PD(options) },
{ "notbol", MOD_DAT, MOD_OPT, PCRE2_NOTBOL, DO(options) },
@ -3291,29 +3292,30 @@ static void
show_compile_options(uint32_t options, const char *before, const char *after)
{
if (options == 0) fprintf(outfile, "%s <none>%s", before, after);
else fprintf(outfile, "%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s",
else fprintf(outfile, "%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s",
before,
((options & PCRE2_ANCHORED) != 0)? " anchored" : "",
((options & PCRE2_CASELESS) != 0)? " caseless" : "",
((options & PCRE2_EXTENDED) != 0)? " extended" : "",
((options & PCRE2_MULTILINE) != 0)? " multiline" : "",
((options & PCRE2_FIRSTLINE) != 0)? " firstline" : "",
((options & PCRE2_DOTALL) != 0)? " dotall" : "",
((options & PCRE2_DOLLAR_ENDONLY) != 0)? " dollar_endonly" : "",
((options & PCRE2_UNGREEDY) != 0)? " ungreedy" : "",
((options & PCRE2_NO_AUTO_CAPTURE) != 0)? " no_auto_capture" : "",
((options & PCRE2_NO_AUTO_POSSESS) != 0)? " no_auto_possess" : "",
((options & PCRE2_UTF) != 0)? " utf" : "",
((options & PCRE2_UCP) != 0)? " ucp" : "",
((options & PCRE2_NO_UTF_CHECK) != 0)? " no_utf_check" : "",
((options & PCRE2_NO_START_OPTIMIZE) != 0)? " no_start_optimize" : "",
((options & PCRE2_DUPNAMES) != 0)? " dupnames" : "",
((options & PCRE2_ALT_BSUX) != 0)? " alt_bsux" : "",
((options & PCRE2_ALLOW_EMPTY_CLASS) != 0)? " allow_empty_class" : "",
((options & PCRE2_ANCHORED) != 0)? " anchored" : "",
((options & PCRE2_AUTO_CALLOUT) != 0)? " auto_callout" : "",
((options & PCRE2_CASELESS) != 0)? " caseless" : "",
((options & PCRE2_DOLLAR_ENDONLY) != 0)? " dollar_endonly" : "",
((options & PCRE2_DOTALL) != 0)? " dotall" : "",
((options & PCRE2_DUPNAMES) != 0)? " dupnames" : "",
((options & PCRE2_EXTENDED) != 0)? " extended" : "",
((options & PCRE2_FIRSTLINE) != 0)? " firstline" : "",
((options & PCRE2_MATCH_UNSET_BACKREF) != 0)? " match_unset_backref" : "",
((options & PCRE2_MULTILINE) != 0)? " multiline" : "",
((options & PCRE2_NEVER_UCP) != 0)? " never_ucp" : "",
((options & PCRE2_NEVER_UTF) != 0)? " never_utf" : "",
((options & PCRE2_NO_AUTO_CAPTURE) != 0)? " no_auto_capture" : "",
((options & PCRE2_NO_AUTO_POSSESS) != 0)? " no_auto_possess" : "",
((options & PCRE2_NO_DOTSTAR_ANCHOR) != 0)? " no_dotstar_anchor" : "",
((options & PCRE2_NO_UTF_CHECK) != 0)? " no_utf_check" : "",
((options & PCRE2_NO_START_OPTIMIZE) != 0)? " no_start_optimize" : "",
((options & PCRE2_UCP) != 0)? " ucp" : "",
((options & PCRE2_UNGREEDY) != 0)? " ungreedy" : "",
((options & PCRE2_UTF) != 0)? " utf" : "",
after);
}

16
testdata/testinput2 vendored
View File

@ -4100,4 +4100,20 @@ a random value. /Ix
/a(b)c(d)/
abc\=ph,copy=0,copy=1,getall
/^abc/info
/^abc/info,no_dotstar_anchor
/.*\d/info,auto_callout
aaa
/.*\d/info,no_dotstar_anchor,auto_callout
aaa
/.*\d/dotall,info
/.*\d/dotall,no_dotstar_anchor,info
/(*NO_DOTSTAR_ANCHOR)(?s).*\d/info
# End of testinput2

79
testdata/testoutput2 vendored
View File

@ -5361,7 +5361,7 @@ No match
"<(\w+)/?>(.)*</(\1)>"Igms
Capturing subpattern count = 3
Max back reference = 1
Options: multiline dotall
Options: dotall multiline
First code unit = '<'
Last code unit = '>'
Subject length lower bound = 7
@ -5399,7 +5399,7 @@ No match
/line\nbreak/Im,firstline
Capturing subpattern count = 0
Contains explicit CR or LF match
Options: multiline firstline
Options: firstline multiline
First code unit = 'l'
Last code unit = 'k'
Subject length lower bound = 10
@ -9698,7 +9698,7 @@ Subject length lower bound = 41
/Iisx
Capturing subpattern count = 3
Max back reference = 1
Options: caseless extended dotall
Options: caseless dotall extended
First code unit = '<'
Last code unit = '='
Subject length lower bound = 9
@ -9747,7 +9747,7 @@ Named capturing subpatterns:
quote 4
realquote 3
realquote 6
Options: extended dupnames
Options: dupnames extended
Starting code units: a b
Subject length lower bound = 3
a"aaaaa
@ -9805,8 +9805,8 @@ Capturing subpattern count = 4
Named capturing subpatterns:
D 4
D 1
Compile options: extended dupnames
Overall options: anchored extended dupnames
Compile options: dupnames extended
Overall options: anchored dupnames extended
Subject length lower bound = 2
abcdX
0: abcdX
@ -9852,7 +9852,7 @@ Capturing subpattern count = 4
Named capturing subpatterns:
A 1
A 4
Options: extended dupnames
Options: dupnames extended
First code unit = 'a'
Last code unit = 'd'
Subject length lower bound = 4
@ -9936,7 +9936,7 @@ No match
/(\3)(\1)(a)/I,alt_bsux,allow_empty_class,match_unset_backref,dupnames
Capturing subpattern count = 3
Max back reference = 3
Options: dupnames alt_bsux allow_empty_class match_unset_backref
Options: alt_bsux allow_empty_class dupnames match_unset_backref
Last code unit = 'a'
Subject length lower bound = 1
cat
@ -13769,4 +13769,67 @@ Partial match: abc
Copy substring 1 failed (-2): partial match
get substring list failed (-2): partial match
/^abc/info
Capturing subpattern count = 0
Compile options: <none>
Overall options: anchored
Subject length lower bound = 3
/^abc/info,no_dotstar_anchor
Capturing subpattern count = 0
Compile options: no_dotstar_anchor
Overall options: anchored no_dotstar_anchor
Subject length lower bound = 3
/.*\d/info,auto_callout
Capturing subpattern count = 0
Options: auto_callout
First code unit at start or follows newline
Subject length lower bound = 1
aaa
--->aaa
+0 ^ .*
+2 ^ ^ \d
+2 ^ ^ \d
+2 ^^ \d
+2 ^ \d
No match
/.*\d/info,no_dotstar_anchor,auto_callout
Capturing subpattern count = 0
Options: auto_callout no_dotstar_anchor
Subject length lower bound = 1
aaa
--->aaa
+0 ^ .*
+2 ^ ^ \d
+2 ^ ^ \d
+2 ^^ \d
+2 ^ \d
+0 ^ .*
+2 ^ ^ \d
+2 ^^ \d
+2 ^ \d
+0 ^ .*
+2 ^^ \d
+2 ^ \d
No match
/.*\d/dotall,info
Capturing subpattern count = 0
Compile options: dotall
Overall options: anchored dotall
Subject length lower bound = 1
/.*\d/dotall,no_dotstar_anchor,info
Capturing subpattern count = 0
Options: dotall no_dotstar_anchor
Subject length lower bound = 1
/(*NO_DOTSTAR_ANCHOR)(?s).*\d/info
Capturing subpattern count = 0
Compile options: <none>
Overall options: dotall no_dotstar_anchor
Subject length lower bound = 1
# End of testinput2