Add PCRE2_NO_DOTSTAR_ANCHOR and revise documentation for .* optimizing.
This commit is contained in:
parent
019e115060
commit
5a18651441
|
@ -58,4 +58,6 @@ matched against "abcd".
|
||||||
(an odd thing to do, but it happened), SIGSEGV or other misbehaviour could
|
(an odd thing to do, but it happened), SIGSEGV or other misbehaviour could
|
||||||
occur.
|
occur.
|
||||||
|
|
||||||
|
10. The PCRE2_NO_DOTSTAR_ANCHOR option has been implemented.
|
||||||
|
|
||||||
****
|
****
|
||||||
|
|
|
@ -63,6 +63,7 @@ or provide an external function for stack size checking. The option bits are:
|
||||||
PCRE2_NO_AUTO_CAPTURE Disable numbered capturing paren-
|
PCRE2_NO_AUTO_CAPTURE Disable numbered capturing paren-
|
||||||
theses (named ones available)
|
theses (named ones available)
|
||||||
PCRE2_NO_AUTO_POSSESS Disable auto-possessification
|
PCRE2_NO_AUTO_POSSESS Disable auto-possessification
|
||||||
|
PCRE2_NO_DOTSTAR_ANCHOR Disable automatic anchoring for .*
|
||||||
PCRE2_NO_START_OPTIMIZE Disable match-time start optimizations
|
PCRE2_NO_START_OPTIMIZE Disable match-time start optimizations
|
||||||
PCRE2_NO_UTF_CHECK Do not check the pattern for UTF validity
|
PCRE2_NO_UTF_CHECK Do not check the pattern for UTF validity
|
||||||
(only relevant if PCRE2_UTF is set)
|
(only relevant if PCRE2_UTF is set)
|
||||||
|
|
|
@ -1187,6 +1187,19 @@ use, auto-possessification means that some callouts are never taken. You can
|
||||||
set this option if you want the matching functions to do a full unoptimized
|
set this option if you want the matching functions to do a full unoptimized
|
||||||
search and run all the callouts, but it is mainly provided for testing
|
search and run all the callouts, but it is mainly provided for testing
|
||||||
purposes.
|
purposes.
|
||||||
|
<pre>
|
||||||
|
PCRE2_NO_DOTSTAR_ANCHOR
|
||||||
|
</pre>
|
||||||
|
If this option is set, it disables an optimization that is applied when .* is
|
||||||
|
the first significant item in a top-level branch of a pattern, and all the
|
||||||
|
other branches also start with .* or with \A or \G or ^. The optimization is
|
||||||
|
automatically disabled for .* if it is inside an atomic group or a capturing
|
||||||
|
group that is the subject of a back reference, or if the pattern contains
|
||||||
|
(*PRUNE) or (*SKIP). When the optimization is not disabled, such a pattern is
|
||||||
|
automatically anchored if PCRE2_DOTALL is set for all the .* items and
|
||||||
|
PCRE2_MULTILINE is not set for any ^ items. Otherwise, the fact that any match
|
||||||
|
must start either at the start of the subject or following a newline is
|
||||||
|
remembered. Like other optimizations, this can cause callouts to be skipped.
|
||||||
<pre>
|
<pre>
|
||||||
PCRE2_NO_START_OPTIMIZE
|
PCRE2_NO_START_OPTIMIZE
|
||||||
</pre>
|
</pre>
|
||||||
|
@ -1442,16 +1455,25 @@ compiled with the PCRE2_EXTENDED option, the result is PCRE2_CASELESS,
|
||||||
PCRE2_MULTILINE, and PCRE2_EXTENDED.
|
PCRE2_MULTILINE, and PCRE2_EXTENDED.
|
||||||
</P>
|
</P>
|
||||||
<P>
|
<P>
|
||||||
A pattern is automatically anchored by PCRE2 if all of its top-level
|
A pattern compiled without PCRE2_ANCHORED is automatically anchored by PCRE2 if
|
||||||
alternatives begin with one of the following:
|
the first significant item in every top-level branch is one of the following:
|
||||||
<pre>
|
<pre>
|
||||||
^ unless PCRE2_MULTILINE is set
|
^ unless PCRE2_MULTILINE is set
|
||||||
\A always
|
\A always
|
||||||
\G always
|
\G always
|
||||||
.* if PCRE2_DOTALL is set and there are no back references to the subpattern in which .* appears
|
.* sometimes - see below
|
||||||
</pre>
|
</pre>
|
||||||
For such patterns, the PCRE2_ANCHORED bit is set in the options returned for
|
When .* is the first significant item, anchoring is possible only when all the
|
||||||
PCRE2_INFO_ALLOPTIONS.
|
following are true:
|
||||||
|
<pre>
|
||||||
|
.* is not in an atomic group
|
||||||
|
.* is not in a capturing group that is the subject of a back reference
|
||||||
|
PCRE2_DOTALL is in force for .*
|
||||||
|
Neither (*PRUNE) nor (*SKIP) appears in the pattern.
|
||||||
|
PCRE2_NO_DOTSTAR_ANCHOR is not set.
|
||||||
|
</pre>
|
||||||
|
For patterns that are auto-anchored, the PCRE2_ANCHORED bit is set in the
|
||||||
|
options returned for PCRE2_INFO_ALLOPTIONS.
|
||||||
<pre>
|
<pre>
|
||||||
PCRE2_INFO_BACKREFMAX
|
PCRE2_INFO_BACKREFMAX
|
||||||
</pre>
|
</pre>
|
||||||
|
@ -1480,21 +1502,10 @@ variable.
|
||||||
<P>
|
<P>
|
||||||
If there is a fixed first value, for example, the letter "c" from a pattern
|
If there is a fixed first value, for example, the letter "c" from a pattern
|
||||||
such as (cat|cow|coyote), 1 is returned, and the character value can be
|
such as (cat|cow|coyote), 1 is returned, and the character value can be
|
||||||
retrieved using PCRE2_INFO_FIRSTCODEUNIT. If there is no fixed first value, and
|
retrieved using PCRE2_INFO_FIRSTCODEUNIT. If there is no fixed first value, but
|
||||||
if either
|
it is known that a match can occur only at the start of the subject or
|
||||||
<br>
|
following a newline in the subject, 2 is returned. Otherwise, and for anchored
|
||||||
<br>
|
patterns, 0 is returned.
|
||||||
(a) the pattern was compiled with the PCRE2_MULTILINE option, and every branch
|
|
||||||
starts with "^", or
|
|
||||||
<br>
|
|
||||||
<br>
|
|
||||||
(b) every branch of the pattern starts with ".*" and PCRE2_DOTALL is not set
|
|
||||||
(if it were set, the pattern would be anchored),
|
|
||||||
<br>
|
|
||||||
<br>
|
|
||||||
2 is returned, indicating that the pattern matches only at the start of a
|
|
||||||
subject string or after any newline within the string. Otherwise 0 is
|
|
||||||
returned. For anchored patterns, 0 is returned.
|
|
||||||
<pre>
|
<pre>
|
||||||
PCRE2_INFO_FIRSTCODEUNIT
|
PCRE2_INFO_FIRSTCODEUNIT
|
||||||
</pre>
|
</pre>
|
||||||
|
@ -2792,9 +2803,9 @@ Cambridge, England.
|
||||||
</P>
|
</P>
|
||||||
<br><a name="SEC37" href="#TOC1">REVISION</a><br>
|
<br><a name="SEC37" href="#TOC1">REVISION</a><br>
|
||||||
<P>
|
<P>
|
||||||
Last updated: 22 December 2014
|
Last updated: 02 January 2015
|
||||||
<br>
|
<br>
|
||||||
Copyright © 1997-2014 University of Cambridge.
|
Copyright © 1997-2015 University of Cambridge.
|
||||||
<br>
|
<br>
|
||||||
<p>
|
<p>
|
||||||
Return to the <a href="index.html">PCRE2 index page</a>.
|
Return to the <a href="index.html">PCRE2 index page</a>.
|
||||||
|
|
|
@ -82,6 +82,9 @@ You should be aware that, because of optimizations in the way PCRE2 compiles
|
||||||
and matches patterns, callouts sometimes do not happen exactly as you might
|
and matches patterns, callouts sometimes do not happen exactly as you might
|
||||||
expect.
|
expect.
|
||||||
</P>
|
</P>
|
||||||
|
<br><b>
|
||||||
|
Auto-possessification
|
||||||
|
</b><br>
|
||||||
<P>
|
<P>
|
||||||
At compile time, PCRE2 "auto-possessifies" repeated items when it knows that
|
At compile time, PCRE2 "auto-possessifies" repeated items when it knows that
|
||||||
what follows cannot be part of the repeat. For example, a+[bc] is compiled as
|
what follows cannot be part of the repeat. For example, a+[bc] is compiled as
|
||||||
|
@ -111,6 +114,56 @@ case, the output changes to this:
|
||||||
This time, when matching [bc] fails, the matcher backtracks into a+ and tries
|
This time, when matching [bc] fails, the matcher backtracks into a+ and tries
|
||||||
again, repeatedly, until a+ itself fails.
|
again, repeatedly, until a+ itself fails.
|
||||||
</P>
|
</P>
|
||||||
|
<br><b>
|
||||||
|
Automatic .* anchoring
|
||||||
|
</b><br>
|
||||||
|
<P>
|
||||||
|
By default, an optimization is applied when .* is the first significant item in
|
||||||
|
a pattern. If PCRE2_DOTALL is set, so that the dot can match any character, the
|
||||||
|
pattern is automatically anchored. If PCRE2_DOTALL is not set, a match can
|
||||||
|
start only after an internal newline or at the beginning of the subject, and
|
||||||
|
<b>pcre2_compile()</b> remembers this. This optimization is disabled, however,
|
||||||
|
if .* is in an atomic group or if there is a back reference to the capturing
|
||||||
|
group in which it appears. It is also disabled if the pattern contains (*PRUNE)
|
||||||
|
or (*SKIP). However, the presence of callouts does not affect it.
|
||||||
|
</P>
|
||||||
|
<P>
|
||||||
|
For example, if the pattern .*\d is compiled with PCRE2_AUTO_CALLOUT and
|
||||||
|
applied to the string "aa", the <b>pcre2test</b> output is:
|
||||||
|
<pre>
|
||||||
|
--->aa
|
||||||
|
+0 ^ .*
|
||||||
|
+2 ^ ^ \d
|
||||||
|
+2 ^^ \d
|
||||||
|
+2 ^ \d
|
||||||
|
No match
|
||||||
|
</pre>
|
||||||
|
This shows that all match attempts start at the beginning of the subject. In
|
||||||
|
other words, the pattern is anchored. You can disable this optimization by
|
||||||
|
passing PCRE2_NO_DOTSTAR_ANCHOR to <b>pcre2_compile()</b>, or starting the
|
||||||
|
pattern with (*NO_DOTSTAR_ANCHOR). In this case, the output changes to:
|
||||||
|
<pre>
|
||||||
|
--->aa
|
||||||
|
+0 ^ .*
|
||||||
|
+2 ^ ^ \d
|
||||||
|
+2 ^^ \d
|
||||||
|
+2 ^ \d
|
||||||
|
+0 ^ .*
|
||||||
|
+2 ^^ \d
|
||||||
|
+2 ^ \d
|
||||||
|
No match
|
||||||
|
</pre>
|
||||||
|
This shows more match attempts, starting at the second subject character.
|
||||||
|
Another optimization, described in the next section, means that there is no
|
||||||
|
subsequent attempt to match with an empty subject.
|
||||||
|
</P>
|
||||||
|
<P>
|
||||||
|
If a pattern has more than one top-level branch, automatic anchoring occurs if
|
||||||
|
all branches are anchorable.
|
||||||
|
</P>
|
||||||
|
<br><b>
|
||||||
|
Other optimizations
|
||||||
|
</b><br>
|
||||||
<P>
|
<P>
|
||||||
Other optimizations that provide fast "no match" results also affect callouts.
|
Other optimizations that provide fast "no match" results also affect callouts.
|
||||||
For example, if the pattern is
|
For example, if the pattern is
|
||||||
|
@ -254,9 +307,9 @@ Cambridge, England.
|
||||||
</P>
|
</P>
|
||||||
<br><a name="SEC7" href="#TOC1">REVISION</a><br>
|
<br><a name="SEC7" href="#TOC1">REVISION</a><br>
|
||||||
<P>
|
<P>
|
||||||
Last updated: 25 November 2014
|
Last updated: 02 January 2015
|
||||||
<br>
|
<br>
|
||||||
Copyright © 1997-2014 University of Cambridge.
|
Copyright © 1997-2015 University of Cambridge.
|
||||||
<br>
|
<br>
|
||||||
<p>
|
<p>
|
||||||
Return to the <a href="index.html">PCRE2 index page</a>.
|
Return to the <a href="index.html">PCRE2 index page</a>.
|
||||||
|
|
|
@ -151,6 +151,17 @@ reaching "no match" results. For more details, see the
|
||||||
documentation.
|
documentation.
|
||||||
</P>
|
</P>
|
||||||
<br><b>
|
<br><b>
|
||||||
|
Disabling automatic anchoring
|
||||||
|
</b><br>
|
||||||
|
<P>
|
||||||
|
If a pattern starts with (*NO_DOTSTAR_ANCHOR), it has the same effect as
|
||||||
|
setting the PCRE2_NO_DOTSTAR_ANCHOR option. This disables optimizations that
|
||||||
|
apply to patterns whose top-level branches all start with .* (match any number
|
||||||
|
of arbitrary characters). For more details, see the
|
||||||
|
<a href="pcre2api.html"><b>pcre2api</b></a>
|
||||||
|
documentation.
|
||||||
|
</P>
|
||||||
|
<br><b>
|
||||||
Setting match and recursion limits
|
Setting match and recursion limits
|
||||||
</b><br>
|
</b><br>
|
||||||
<P>
|
<P>
|
||||||
|
@ -1841,7 +1852,8 @@ one succeeds. Consider this pattern:
|
||||||
(?>.*?a)b
|
(?>.*?a)b
|
||||||
</pre>
|
</pre>
|
||||||
It matches "ab" in the subject "aab". The use of the backtracking control verbs
|
It matches "ab" in the subject "aab". The use of the backtracking control verbs
|
||||||
(*PRUNE) and (*SKIP) also disable this optimization.
|
(*PRUNE) and (*SKIP) also disable this optimization, and there is an option,
|
||||||
|
PCRE2_NO_DOTSTAR_ANCHOR, to do so explicitly.
|
||||||
</P>
|
</P>
|
||||||
<P>
|
<P>
|
||||||
When a capturing subpattern is repeated, the value captured is the substring
|
When a capturing subpattern is repeated, the value captured is the substring
|
||||||
|
@ -3236,9 +3248,9 @@ Cambridge, England.
|
||||||
</P>
|
</P>
|
||||||
<br><a name="SEC30" href="#TOC1">REVISION</a><br>
|
<br><a name="SEC30" href="#TOC1">REVISION</a><br>
|
||||||
<P>
|
<P>
|
||||||
Last updated: 14 November 2014
|
Last updated: 02 January 2015
|
||||||
<br>
|
<br>
|
||||||
Copyright © 1997-2014 University of Cambridge.
|
Copyright © 1997-2015 University of Cambridge.
|
||||||
<br>
|
<br>
|
||||||
<p>
|
<p>
|
||||||
Return to the <a href="index.html">PCRE2 index page</a>.
|
Return to the <a href="index.html">PCRE2 index page</a>.
|
||||||
|
|
|
@ -115,14 +115,19 @@ less with a DFA matching function, and in both cases there is not much
|
||||||
difference for \b.
|
difference for \b.
|
||||||
</P>
|
</P>
|
||||||
<P>
|
<P>
|
||||||
When a pattern begins with .* not in parentheses, or in parentheses that are
|
When a pattern begins with .* not in atomic parentheses, nor in parentheses
|
||||||
not the subject of a backreference, and the PCRE2_DOTALL option is set, the
|
that are the subject of a backreference, and the PCRE2_DOTALL option is set,
|
||||||
pattern is implicitly anchored by PCRE2, since it can match only at the start
|
the pattern is implicitly anchored by PCRE2, since it can match only at the
|
||||||
of a subject string. However, if PCRE2_DOTALL is not set, PCRE2 cannot make
|
start of a subject string. If the pattern has multiple top-level branches, they
|
||||||
this optimization, because the dot metacharacter does not then match a newline,
|
must all be anchorable. The optimization can be disabled by the
|
||||||
and if the subject string contains newlines, the pattern may match from the
|
PCRE2_NO_DOTSTAR_ANCHOR option, and is automatically disabled if the pattern
|
||||||
character immediately following one of them instead of from the very start. For
|
contains (*PRUNE) or (*SKIP).
|
||||||
example, the pattern
|
</P>
|
||||||
|
<P>
|
||||||
|
If PCRE2_DOTALL is not set, PCRE2 cannot make this optimization, because the
|
||||||
|
dot metacharacter does not then match a newline, and if the subject string
|
||||||
|
contains newlines, the pattern may match from the character immediately
|
||||||
|
following one of them instead of from the very start. For example, the pattern
|
||||||
<pre>
|
<pre>
|
||||||
.*second
|
.*second
|
||||||
</pre>
|
</pre>
|
||||||
|
@ -187,9 +192,9 @@ Cambridge, England.
|
||||||
REVISION
|
REVISION
|
||||||
</b><br>
|
</b><br>
|
||||||
<P>
|
<P>
|
||||||
Last updated: 20 October 2014
|
Last updated: 02 January 2015
|
||||||
<br>
|
<br>
|
||||||
Copyright © 1997-2014 University of Cambridge.
|
Copyright © 1997-2015 University of Cambridge.
|
||||||
<br>
|
<br>
|
||||||
<p>
|
<p>
|
||||||
Return to the <a href="index.html">PCRE2 index page</a>.
|
Return to the <a href="index.html">PCRE2 index page</a>.
|
||||||
|
|
|
@ -416,6 +416,7 @@ appear.
|
||||||
(*NOTEMPTY) set PCRE2_NOTEMPTY when matching
|
(*NOTEMPTY) set PCRE2_NOTEMPTY when matching
|
||||||
(*NOTEMPTY_ATSTART) set PCRE2_NOTEMPTY_ATSTART when matching
|
(*NOTEMPTY_ATSTART) set PCRE2_NOTEMPTY_ATSTART when matching
|
||||||
(*NO_AUTO_POSSESS) no auto-possessification (PCRE2_NO_AUTO_POSSESS)
|
(*NO_AUTO_POSSESS) no auto-possessification (PCRE2_NO_AUTO_POSSESS)
|
||||||
|
(*NO_DOTSTAR_ANCHOR) no .* anchoring (PCRE2_NO_DOTSTAR_ANCHOR)
|
||||||
(*NO_START_OPT) no start-match optimization (PCRE2_NO_START_OPTIMIZE)
|
(*NO_START_OPT) no start-match optimization (PCRE2_NO_START_OPTIMIZE)
|
||||||
(*UTF) set appropriate UTF mode for the library in use
|
(*UTF) set appropriate UTF mode for the library in use
|
||||||
(*UCP) set PCRE2_UCP (use Unicode properties for \d etc)
|
(*UCP) set PCRE2_UCP (use Unicode properties for \d etc)
|
||||||
|
@ -553,9 +554,9 @@ Cambridge, England.
|
||||||
</P>
|
</P>
|
||||||
<br><a name="SEC27" href="#TOC1">REVISION</a><br>
|
<br><a name="SEC27" href="#TOC1">REVISION</a><br>
|
||||||
<P>
|
<P>
|
||||||
Last updated: 23 November 2014
|
Last updated: 02 January 2015
|
||||||
<br>
|
<br>
|
||||||
Copyright © 1997-2014 University of Cambridge.
|
Copyright © 1997-2015 University of Cambridge.
|
||||||
<br>
|
<br>
|
||||||
<p>
|
<p>
|
||||||
Return to the <a href="index.html">PCRE2 index page</a>.
|
Return to the <a href="index.html">PCRE2 index page</a>.
|
||||||
|
|
|
@ -291,7 +291,7 @@ checked for compatibility with the <b>perltest.sh</b> script, which is used to
|
||||||
confirm that Perl gives the same results as PCRE2. Also, apart from comment
|
confirm that Perl gives the same results as PCRE2. Also, apart from comment
|
||||||
lines, none of the other command lines are permitted, because they and many
|
lines, none of the other command lines are permitted, because they and many
|
||||||
of the modifiers are specific to <b>pcre2test</b>, and should not be used in
|
of the modifiers are specific to <b>pcre2test</b>, and should not be used in
|
||||||
test files that are also processed by <b>perltest.sh</b>. The \fP#perltest\fB
|
test files that are also processed by <b>perltest.sh</b>. The <b>#perltest</b>
|
||||||
command helps detect tests that are accidentally put in the wrong file.
|
command helps detect tests that are accidentally put in the wrong file.
|
||||||
<pre>
|
<pre>
|
||||||
#subject <modifier-list>
|
#subject <modifier-list>
|
||||||
|
@ -454,6 +454,7 @@ for a description of their effects.
|
||||||
never_utf set PCRE2_NEVER_UTF
|
never_utf set PCRE2_NEVER_UTF
|
||||||
no_auto_capture set PCRE2_NO_AUTO_CAPTURE
|
no_auto_capture set PCRE2_NO_AUTO_CAPTURE
|
||||||
no_auto_possess set PCRE2_NO_AUTO_POSSESS
|
no_auto_possess set PCRE2_NO_AUTO_POSSESS
|
||||||
|
no_dotstar_anchor set PCRE2_NO_DOTSTAR_ANCHOR
|
||||||
no_start_optimize set PCRE2_NO_START_OPTIMIZE
|
no_start_optimize set PCRE2_NO_START_OPTIMIZE
|
||||||
no_utf_check set PCRE2_NO_UTF_CHECK
|
no_utf_check set PCRE2_NO_UTF_CHECK
|
||||||
ucp set PCRE2_UCP
|
ucp set PCRE2_UCP
|
||||||
|
@ -596,7 +597,7 @@ setting the size of the JIT stack.
|
||||||
</P>
|
</P>
|
||||||
<P>
|
<P>
|
||||||
If the <b>jitfast</b> modifier is specified, matching is done using the JIT
|
If the <b>jitfast</b> modifier is specified, matching is done using the JIT
|
||||||
"fast path" interface, \fBpcre2_jit_match(), which skips some of the sanity
|
"fast path" interface, <b>pcre2_jit_match()</b>, which skips some of the sanity
|
||||||
checks that are done by <b>pcre2_match()</b>, and of course does not work when
|
checks that are done by <b>pcre2_match()</b>, and of course does not work when
|
||||||
JIT is not supported. If <b>jitfast</b> is specified without <b>jit</b>, jit=7 is
|
JIT is not supported. If <b>jitfast</b> is specified without <b>jit</b>, jit=7 is
|
||||||
assumed.
|
assumed.
|
||||||
|
@ -1309,9 +1310,9 @@ Cambridge, England.
|
||||||
</P>
|
</P>
|
||||||
<br><a name="SEC20" href="#TOC1">REVISION</a><br>
|
<br><a name="SEC20" href="#TOC1">REVISION</a><br>
|
||||||
<P>
|
<P>
|
||||||
Last updated: 23 November 2014
|
Last updated: 02 January 2015
|
||||||
<br>
|
<br>
|
||||||
Copyright © 1997-2014 University of Cambridge.
|
Copyright © 1997-2015 University of Cambridge.
|
||||||
<br>
|
<br>
|
||||||
<p>
|
<p>
|
||||||
Return to the <a href="index.html">PCRE2 index page</a>.
|
Return to the <a href="index.html">PCRE2 index page</a>.
|
||||||
|
|
107
doc/pcre2.txt
107
doc/pcre2.txt
|
@ -1226,6 +1226,20 @@ COMPILING A PATTERN
|
||||||
a full unoptimized search and run all the callouts, but it is mainly
|
a full unoptimized search and run all the callouts, but it is mainly
|
||||||
provided for testing purposes.
|
provided for testing purposes.
|
||||||
|
|
||||||
|
PCRE2_NO_DOTSTAR_ANCHOR
|
||||||
|
|
||||||
|
If this option is set, it disables an optimization that is applied when
|
||||||
|
.* is the first significant item in a top-level branch of a pattern,
|
||||||
|
and all the other branches also start with .* or with \A or \G or ^.
|
||||||
|
The optimization is automatically disabled for .* if it is inside an
|
||||||
|
atomic group or a capturing group that is the subject of a back refer-
|
||||||
|
ence, or if the pattern contains (*PRUNE) or (*SKIP). When the opti-
|
||||||
|
mization is not disabled, such a pattern is automatically anchored if
|
||||||
|
PCRE2_DOTALL is set for all the .* items and PCRE2_MULTILINE is not set
|
||||||
|
for any ^ items. Otherwise, the fact that any match must start either
|
||||||
|
at the start of the subject or following a newline is remembered. Like
|
||||||
|
other optimizations, this can cause callouts to be skipped.
|
||||||
|
|
||||||
PCRE2_NO_START_OPTIMIZE
|
PCRE2_NO_START_OPTIMIZE
|
||||||
|
|
||||||
This is an option whose main effect is at matching time. It does not
|
This is an option whose main effect is at matching time. It does not
|
||||||
|
@ -1465,17 +1479,27 @@ INFORMATION ABOUT A COMPILED PATTERN
|
||||||
option, the result is PCRE2_CASELESS, PCRE2_MULTILINE, and
|
option, the result is PCRE2_CASELESS, PCRE2_MULTILINE, and
|
||||||
PCRE2_EXTENDED.
|
PCRE2_EXTENDED.
|
||||||
|
|
||||||
A pattern is automatically anchored by PCRE2 if all of its top-level
|
A pattern compiled without PCRE2_ANCHORED is automatically anchored by
|
||||||
alternatives begin with one of the following:
|
PCRE2 if the first significant item in every top-level branch is one of
|
||||||
|
the following:
|
||||||
|
|
||||||
^ unless PCRE2_MULTILINE is set
|
^ unless PCRE2_MULTILINE is set
|
||||||
\A always
|
\A always
|
||||||
\G always
|
\G always
|
||||||
.* if PCRE2_DOTALL is set and there are no back
|
.* sometimes - see below
|
||||||
references to the subpattern in which .* appears
|
|
||||||
|
|
||||||
For such patterns, the PCRE2_ANCHORED bit is set in the options
|
When .* is the first significant item, anchoring is possible only when
|
||||||
returned for PCRE2_INFO_ALLOPTIONS.
|
all the following are true:
|
||||||
|
|
||||||
|
.* is not in an atomic group
|
||||||
|
.* is not in a capturing group that is the subject
|
||||||
|
of a back reference
|
||||||
|
PCRE2_DOTALL is in force for .*
|
||||||
|
Neither (*PRUNE) nor (*SKIP) appears in the pattern.
|
||||||
|
PCRE2_NO_DOTSTAR_ANCHOR is not set.
|
||||||
|
|
||||||
|
For patterns that are auto-anchored, the PCRE2_ANCHORED bit is set in
|
||||||
|
the options returned for PCRE2_INFO_ALLOPTIONS.
|
||||||
|
|
||||||
PCRE2_INFO_BACKREFMAX
|
PCRE2_INFO_BACKREFMAX
|
||||||
|
|
||||||
|
@ -1504,17 +1528,9 @@ INFORMATION ABOUT A COMPILED PATTERN
|
||||||
If there is a fixed first value, for example, the letter "c" from a
|
If there is a fixed first value, for example, the letter "c" from a
|
||||||
pattern such as (cat|cow|coyote), 1 is returned, and the character
|
pattern such as (cat|cow|coyote), 1 is returned, and the character
|
||||||
value can be retrieved using PCRE2_INFO_FIRSTCODEUNIT. If there is no
|
value can be retrieved using PCRE2_INFO_FIRSTCODEUNIT. If there is no
|
||||||
fixed first value, and if either
|
fixed first value, but it is known that a match can occur only at the
|
||||||
|
start of the subject or following a newline in the subject, 2 is
|
||||||
(a) the pattern was compiled with the PCRE2_MULTILINE option, and every
|
returned. Otherwise, and for anchored patterns, 0 is returned.
|
||||||
branch starts with "^", or
|
|
||||||
|
|
||||||
(b) every branch of the pattern starts with ".*" and PCRE2_DOTALL is
|
|
||||||
not set (if it were set, the pattern would be anchored),
|
|
||||||
|
|
||||||
2 is returned, indicating that the pattern matches only at the start of
|
|
||||||
a subject string or after any newline within the string. Otherwise 0 is
|
|
||||||
returned. For anchored patterns, 0 is returned.
|
|
||||||
|
|
||||||
PCRE2_INFO_FIRSTCODEUNIT
|
PCRE2_INFO_FIRSTCODEUNIT
|
||||||
|
|
||||||
|
@ -2726,8 +2742,8 @@ AUTHOR
|
||||||
|
|
||||||
REVISION
|
REVISION
|
||||||
|
|
||||||
Last updated: 22 December 2014
|
Last updated: 02 January 2015
|
||||||
Copyright (c) 1997-2014 University of Cambridge.
|
Copyright (c) 1997-2015 University of Cambridge.
|
||||||
------------------------------------------------------------------------------
|
------------------------------------------------------------------------------
|
||||||
|
|
||||||
|
|
||||||
|
@ -3251,6 +3267,8 @@ MISSING CALLOUTS
|
||||||
compiles and matches patterns, callouts sometimes do not happen exactly
|
compiles and matches patterns, callouts sometimes do not happen exactly
|
||||||
as you might expect.
|
as you might expect.
|
||||||
|
|
||||||
|
Auto-possessification
|
||||||
|
|
||||||
At compile time, PCRE2 "auto-possessifies" repeated items when it knows
|
At compile time, PCRE2 "auto-possessifies" repeated items when it knows
|
||||||
that what follows cannot be part of the repeat. For example, a+[bc] is
|
that what follows cannot be part of the repeat. For example, a+[bc] is
|
||||||
compiled as if it were a++[bc]. The pcre2test output when this pattern
|
compiled as if it were a++[bc]. The pcre2test output when this pattern
|
||||||
|
@ -3279,6 +3297,53 @@ MISSING CALLOUTS
|
||||||
This time, when matching [bc] fails, the matcher backtracks into a+ and
|
This time, when matching [bc] fails, the matcher backtracks into a+ and
|
||||||
tries again, repeatedly, until a+ itself fails.
|
tries again, repeatedly, until a+ itself fails.
|
||||||
|
|
||||||
|
Automatic .* anchoring
|
||||||
|
|
||||||
|
By default, an optimization is applied when .* is the first significant
|
||||||
|
item in a pattern. If PCRE2_DOTALL is set, so that the dot can match
|
||||||
|
any character, the pattern is automatically anchored. If PCRE2_DOTALL
|
||||||
|
is not set, a match can start only after an internal newline or at the
|
||||||
|
beginning of the subject, and pcre2_compile() remembers this. This
|
||||||
|
optimization is disabled, however, if .* is in an atomic group or if
|
||||||
|
there is a back reference to the capturing group in which it appears.
|
||||||
|
It is also disabled if the pattern contains (*PRUNE) or (*SKIP). How-
|
||||||
|
ever, the presence of callouts does not affect it.
|
||||||
|
|
||||||
|
For example, if the pattern .*\d is compiled with PCRE2_AUTO_CALLOUT
|
||||||
|
and applied to the string "aa", the pcre2test output is:
|
||||||
|
|
||||||
|
--->aa
|
||||||
|
+0 ^ .*
|
||||||
|
+2 ^ ^ \d
|
||||||
|
+2 ^^ \d
|
||||||
|
+2 ^ \d
|
||||||
|
No match
|
||||||
|
|
||||||
|
This shows that all match attempts start at the beginning of the sub-
|
||||||
|
ject. In other words, the pattern is anchored. You can disable this
|
||||||
|
optimization by passing PCRE2_NO_DOTSTAR_ANCHOR to pcre2_compile(), or
|
||||||
|
starting the pattern with (*NO_DOTSTAR_ANCHOR). In this case, the out-
|
||||||
|
put changes to:
|
||||||
|
|
||||||
|
--->aa
|
||||||
|
+0 ^ .*
|
||||||
|
+2 ^ ^ \d
|
||||||
|
+2 ^^ \d
|
||||||
|
+2 ^ \d
|
||||||
|
+0 ^ .*
|
||||||
|
+2 ^^ \d
|
||||||
|
+2 ^ \d
|
||||||
|
No match
|
||||||
|
|
||||||
|
This shows more match attempts, starting at the second subject charac-
|
||||||
|
ter. Another optimization, described in the next section, means that
|
||||||
|
there is no subsequent attempt to match with an empty subject.
|
||||||
|
|
||||||
|
If a pattern has more than one top-level branch, automatic anchoring
|
||||||
|
occurs if all branches are anchorable.
|
||||||
|
|
||||||
|
Other optimizations
|
||||||
|
|
||||||
Other optimizations that provide fast "no match" results also affect
|
Other optimizations that provide fast "no match" results also affect
|
||||||
callouts. For example, if the pattern is
|
callouts. For example, if the pattern is
|
||||||
|
|
||||||
|
@ -3410,8 +3475,8 @@ AUTHOR
|
||||||
|
|
||||||
REVISION
|
REVISION
|
||||||
|
|
||||||
Last updated: 25 November 2014
|
Last updated: 02 January 2015
|
||||||
Copyright (c) 1997-2014 University of Cambridge.
|
Copyright (c) 1997-2015 University of Cambridge.
|
||||||
------------------------------------------------------------------------------
|
------------------------------------------------------------------------------
|
||||||
|
|
||||||
|
|
||||||
|
|
|
@ -1,4 +1,4 @@
|
||||||
.TH PCRE2_COMPILE 3 "21 October 2014" "PCRE2 10.00"
|
.TH PCRE2_COMPILE 3 "02 January 2015" "PCRE2 10.00"
|
||||||
.SH NAME
|
.SH NAME
|
||||||
PCRE2 - Perl-compatible regular expressions (revised API)
|
PCRE2 - Perl-compatible regular expressions (revised API)
|
||||||
.SH SYNOPSIS
|
.SH SYNOPSIS
|
||||||
|
@ -51,6 +51,7 @@ or provide an external function for stack size checking. The option bits are:
|
||||||
PCRE2_NO_AUTO_CAPTURE Disable numbered capturing paren-
|
PCRE2_NO_AUTO_CAPTURE Disable numbered capturing paren-
|
||||||
theses (named ones available)
|
theses (named ones available)
|
||||||
PCRE2_NO_AUTO_POSSESS Disable auto-possessification
|
PCRE2_NO_AUTO_POSSESS Disable auto-possessification
|
||||||
|
PCRE2_NO_DOTSTAR_ANCHOR Disable automatic anchoring for .*
|
||||||
PCRE2_NO_START_OPTIMIZE Disable match-time start optimizations
|
PCRE2_NO_START_OPTIMIZE Disable match-time start optimizations
|
||||||
PCRE2_NO_UTF_CHECK Do not check the pattern for UTF validity
|
PCRE2_NO_UTF_CHECK Do not check the pattern for UTF validity
|
||||||
(only relevant if PCRE2_UTF is set)
|
(only relevant if PCRE2_UTF is set)
|
||||||
|
|
|
@ -1,4 +1,4 @@
|
||||||
.TH PCRE2API 3 "22 December 2014" "PCRE2 10.00"
|
.TH PCRE2API 3 "02 January 2015" "PCRE2 10.00"
|
||||||
.SH NAME
|
.SH NAME
|
||||||
PCRE2 - Perl-compatible regular expressions (revised API)
|
PCRE2 - Perl-compatible regular expressions (revised API)
|
||||||
.sp
|
.sp
|
||||||
|
@ -1163,6 +1163,19 @@ use, auto-possessification means that some callouts are never taken. You can
|
||||||
set this option if you want the matching functions to do a full unoptimized
|
set this option if you want the matching functions to do a full unoptimized
|
||||||
search and run all the callouts, but it is mainly provided for testing
|
search and run all the callouts, but it is mainly provided for testing
|
||||||
purposes.
|
purposes.
|
||||||
|
.sp
|
||||||
|
PCRE2_NO_DOTSTAR_ANCHOR
|
||||||
|
.sp
|
||||||
|
If this option is set, it disables an optimization that is applied when .* is
|
||||||
|
the first significant item in a top-level branch of a pattern, and all the
|
||||||
|
other branches also start with .* or with \eA or \eG or ^. The optimization is
|
||||||
|
automatically disabled for .* if it is inside an atomic group or a capturing
|
||||||
|
group that is the subject of a back reference, or if the pattern contains
|
||||||
|
(*PRUNE) or (*SKIP). When the optimization is not disabled, such a pattern is
|
||||||
|
automatically anchored if PCRE2_DOTALL is set for all the .* items and
|
||||||
|
PCRE2_MULTILINE is not set for any ^ items. Otherwise, the fact that any match
|
||||||
|
must start either at the start of the subject or following a newline is
|
||||||
|
remembered. Like other optimizations, this can cause callouts to be skipped.
|
||||||
.sp
|
.sp
|
||||||
PCRE2_NO_START_OPTIMIZE
|
PCRE2_NO_START_OPTIMIZE
|
||||||
.sp
|
.sp
|
||||||
|
@ -1436,18 +1449,27 @@ force when matching starts. For example, if the pattern /(?im)abc(?-i)d/ is
|
||||||
compiled with the PCRE2_EXTENDED option, the result is PCRE2_CASELESS,
|
compiled with the PCRE2_EXTENDED option, the result is PCRE2_CASELESS,
|
||||||
PCRE2_MULTILINE, and PCRE2_EXTENDED.
|
PCRE2_MULTILINE, and PCRE2_EXTENDED.
|
||||||
.P
|
.P
|
||||||
A pattern is automatically anchored by PCRE2 if all of its top-level
|
A pattern compiled without PCRE2_ANCHORED is automatically anchored by PCRE2 if
|
||||||
alternatives begin with one of the following:
|
the first significant item in every top-level branch is one of the following:
|
||||||
.sp
|
.sp
|
||||||
^ unless PCRE2_MULTILINE is set
|
^ unless PCRE2_MULTILINE is set
|
||||||
\eA always
|
\eA always
|
||||||
\eG always
|
\eG always
|
||||||
.\" JOIN
|
.* sometimes - see below
|
||||||
.* if PCRE2_DOTALL is set and there are no back
|
|
||||||
references to the subpattern in which .* appears
|
|
||||||
.sp
|
.sp
|
||||||
For such patterns, the PCRE2_ANCHORED bit is set in the options returned for
|
When .* is the first significant item, anchoring is possible only when all the
|
||||||
PCRE2_INFO_ALLOPTIONS.
|
following are true:
|
||||||
|
.sp
|
||||||
|
.* is not in an atomic group
|
||||||
|
.\" JOIN
|
||||||
|
.* is not in a capturing group that is the subject
|
||||||
|
of a back reference
|
||||||
|
PCRE2_DOTALL is in force for .*
|
||||||
|
Neither (*PRUNE) nor (*SKIP) appears in the pattern.
|
||||||
|
PCRE2_NO_DOTSTAR_ANCHOR is not set.
|
||||||
|
.sp
|
||||||
|
For patterns that are auto-anchored, the PCRE2_ANCHORED bit is set in the
|
||||||
|
options returned for PCRE2_INFO_ALLOPTIONS.
|
||||||
.sp
|
.sp
|
||||||
PCRE2_INFO_BACKREFMAX
|
PCRE2_INFO_BACKREFMAX
|
||||||
.sp
|
.sp
|
||||||
|
@ -1475,18 +1497,10 @@ variable.
|
||||||
.P
|
.P
|
||||||
If there is a fixed first value, for example, the letter "c" from a pattern
|
If there is a fixed first value, for example, the letter "c" from a pattern
|
||||||
such as (cat|cow|coyote), 1 is returned, and the character value can be
|
such as (cat|cow|coyote), 1 is returned, and the character value can be
|
||||||
retrieved using PCRE2_INFO_FIRSTCODEUNIT. If there is no fixed first value, and
|
retrieved using PCRE2_INFO_FIRSTCODEUNIT. If there is no fixed first value, but
|
||||||
if either
|
it is known that a match can occur only at the start of the subject or
|
||||||
.sp
|
following a newline in the subject, 2 is returned. Otherwise, and for anchored
|
||||||
(a) the pattern was compiled with the PCRE2_MULTILINE option, and every branch
|
patterns, 0 is returned.
|
||||||
starts with "^", or
|
|
||||||
.sp
|
|
||||||
(b) every branch of the pattern starts with ".*" and PCRE2_DOTALL is not set
|
|
||||||
(if it were set, the pattern would be anchored),
|
|
||||||
.sp
|
|
||||||
2 is returned, indicating that the pattern matches only at the start of a
|
|
||||||
subject string or after any newline within the string. Otherwise 0 is
|
|
||||||
returned. For anchored patterns, 0 is returned.
|
|
||||||
.sp
|
.sp
|
||||||
PCRE2_INFO_FIRSTCODEUNIT
|
PCRE2_INFO_FIRSTCODEUNIT
|
||||||
.sp
|
.sp
|
||||||
|
@ -2835,6 +2849,6 @@ Cambridge, England.
|
||||||
.rs
|
.rs
|
||||||
.sp
|
.sp
|
||||||
.nf
|
.nf
|
||||||
Last updated: 22 December 2014
|
Last updated: 02 January 2015
|
||||||
Copyright (c) 1997-2014 University of Cambridge.
|
Copyright (c) 1997-2015 University of Cambridge.
|
||||||
.fi
|
.fi
|
||||||
|
|
|
@ -1,4 +1,4 @@
|
||||||
.TH PCRE2CALLOUT 3 "25 November 2014" "PCRE2 10.00"
|
.TH PCRE2CALLOUT 3 "02 January 2015" "PCRE2 10.00"
|
||||||
.SH NAME
|
.SH NAME
|
||||||
PCRE2 - Perl-compatible regular expressions (revised API)
|
PCRE2 - Perl-compatible regular expressions (revised API)
|
||||||
.SH SYNOPSIS
|
.SH SYNOPSIS
|
||||||
|
@ -65,7 +65,11 @@ particular pattern.
|
||||||
You should be aware that, because of optimizations in the way PCRE2 compiles
|
You should be aware that, because of optimizations in the way PCRE2 compiles
|
||||||
and matches patterns, callouts sometimes do not happen exactly as you might
|
and matches patterns, callouts sometimes do not happen exactly as you might
|
||||||
expect.
|
expect.
|
||||||
.P
|
.
|
||||||
|
.
|
||||||
|
.SS "Auto-possessification"
|
||||||
|
.rs
|
||||||
|
.sp
|
||||||
At compile time, PCRE2 "auto-possessifies" repeated items when it knows that
|
At compile time, PCRE2 "auto-possessifies" repeated items when it knows that
|
||||||
what follows cannot be part of the repeat. For example, a+[bc] is compiled as
|
what follows cannot be part of the repeat. For example, a+[bc] is compiled as
|
||||||
if it were a++[bc]. The \fBpcre2test\fP output when this pattern is compiled
|
if it were a++[bc]. The \fBpcre2test\fP output when this pattern is compiled
|
||||||
|
@ -93,7 +97,56 @@ case, the output changes to this:
|
||||||
.sp
|
.sp
|
||||||
This time, when matching [bc] fails, the matcher backtracks into a+ and tries
|
This time, when matching [bc] fails, the matcher backtracks into a+ and tries
|
||||||
again, repeatedly, until a+ itself fails.
|
again, repeatedly, until a+ itself fails.
|
||||||
|
.
|
||||||
|
.
|
||||||
|
.SS "Automatic .* anchoring"
|
||||||
|
.rs
|
||||||
|
.sp
|
||||||
|
By default, an optimization is applied when .* is the first significant item in
|
||||||
|
a pattern. If PCRE2_DOTALL is set, so that the dot can match any character, the
|
||||||
|
pattern is automatically anchored. If PCRE2_DOTALL is not set, a match can
|
||||||
|
start only after an internal newline or at the beginning of the subject, and
|
||||||
|
\fBpcre2_compile()\fP remembers this. This optimization is disabled, however,
|
||||||
|
if .* is in an atomic group or if there is a back reference to the capturing
|
||||||
|
group in which it appears. It is also disabled if the pattern contains (*PRUNE)
|
||||||
|
or (*SKIP). However, the presence of callouts does not affect it.
|
||||||
.P
|
.P
|
||||||
|
For example, if the pattern .*\ed is compiled with PCRE2_AUTO_CALLOUT and
|
||||||
|
applied to the string "aa", the \fBpcre2test\fP output is:
|
||||||
|
.sp
|
||||||
|
--->aa
|
||||||
|
+0 ^ .*
|
||||||
|
+2 ^ ^ \ed
|
||||||
|
+2 ^^ \ed
|
||||||
|
+2 ^ \ed
|
||||||
|
No match
|
||||||
|
.sp
|
||||||
|
This shows that all match attempts start at the beginning of the subject. In
|
||||||
|
other words, the pattern is anchored. You can disable this optimization by
|
||||||
|
passing PCRE2_NO_DOTSTAR_ANCHOR to \fBpcre2_compile()\fP, or starting the
|
||||||
|
pattern with (*NO_DOTSTAR_ANCHOR). In this case, the output changes to:
|
||||||
|
.sp
|
||||||
|
--->aa
|
||||||
|
+0 ^ .*
|
||||||
|
+2 ^ ^ \ed
|
||||||
|
+2 ^^ \ed
|
||||||
|
+2 ^ \ed
|
||||||
|
+0 ^ .*
|
||||||
|
+2 ^^ \ed
|
||||||
|
+2 ^ \ed
|
||||||
|
No match
|
||||||
|
.sp
|
||||||
|
This shows more match attempts, starting at the second subject character.
|
||||||
|
Another optimization, described in the next section, means that there is no
|
||||||
|
subsequent attempt to match with an empty subject.
|
||||||
|
.P
|
||||||
|
If a pattern has more than one top-level branch, automatic anchoring occurs if
|
||||||
|
all branches are anchorable.
|
||||||
|
.
|
||||||
|
.
|
||||||
|
.SS "Other optimizations"
|
||||||
|
.rs
|
||||||
|
.sp
|
||||||
Other optimizations that provide fast "no match" results also affect callouts.
|
Other optimizations that provide fast "no match" results also affect callouts.
|
||||||
For example, if the pattern is
|
For example, if the pattern is
|
||||||
.sp
|
.sp
|
||||||
|
@ -232,6 +285,6 @@ Cambridge, England.
|
||||||
.rs
|
.rs
|
||||||
.sp
|
.sp
|
||||||
.nf
|
.nf
|
||||||
Last updated: 25 November 2014
|
Last updated: 02 January 2015
|
||||||
Copyright (c) 1997-2014 University of Cambridge.
|
Copyright (c) 1997-2015 University of Cambridge.
|
||||||
.fi
|
.fi
|
||||||
|
|
|
@ -1,4 +1,4 @@
|
||||||
.TH PCRE2PATTERN 3 "14 November 2014" "PCRE2 10.00"
|
.TH PCRE2PATTERN 3 "02 January 2015" "PCRE2 10.00"
|
||||||
.SH NAME
|
.SH NAME
|
||||||
PCRE2 - Perl-compatible regular expressions (revised API)
|
PCRE2 - Perl-compatible regular expressions (revised API)
|
||||||
.SH "PCRE2 REGULAR EXPRESSION DETAILS"
|
.SH "PCRE2 REGULAR EXPRESSION DETAILS"
|
||||||
|
@ -117,6 +117,19 @@ reaching "no match" results. For more details, see the
|
||||||
documentation.
|
documentation.
|
||||||
.
|
.
|
||||||
.
|
.
|
||||||
|
.SS "Disabling automatic anchoring"
|
||||||
|
.rs
|
||||||
|
.sp
|
||||||
|
If a pattern starts with (*NO_DOTSTAR_ANCHOR), it has the same effect as
|
||||||
|
setting the PCRE2_NO_DOTSTAR_ANCHOR option. This disables optimizations that
|
||||||
|
apply to patterns whose top-level branches all start with .* (match any number
|
||||||
|
of arbitrary characters). For more details, see the
|
||||||
|
.\" HREF
|
||||||
|
\fBpcre2api\fP
|
||||||
|
.\"
|
||||||
|
documentation.
|
||||||
|
.
|
||||||
|
.
|
||||||
.SS "Setting match and recursion limits"
|
.SS "Setting match and recursion limits"
|
||||||
.rs
|
.rs
|
||||||
.sp
|
.sp
|
||||||
|
@ -1853,7 +1866,8 @@ one succeeds. Consider this pattern:
|
||||||
(?>.*?a)b
|
(?>.*?a)b
|
||||||
.sp
|
.sp
|
||||||
It matches "ab" in the subject "aab". The use of the backtracking control verbs
|
It matches "ab" in the subject "aab". The use of the backtracking control verbs
|
||||||
(*PRUNE) and (*SKIP) also disable this optimization.
|
(*PRUNE) and (*SKIP) also disable this optimization, and there is an option,
|
||||||
|
PCRE2_NO_DOTSTAR_ANCHOR, to do so explicitly.
|
||||||
.P
|
.P
|
||||||
When a capturing subpattern is repeated, the value captured is the substring
|
When a capturing subpattern is repeated, the value captured is the substring
|
||||||
that matched the final iteration. For example, after
|
that matched the final iteration. For example, after
|
||||||
|
@ -3278,6 +3292,6 @@ Cambridge, England.
|
||||||
.rs
|
.rs
|
||||||
.sp
|
.sp
|
||||||
.nf
|
.nf
|
||||||
Last updated: 14 November 2014
|
Last updated: 02 January 2015
|
||||||
Copyright (c) 1997-2014 University of Cambridge.
|
Copyright (c) 1997-2015 University of Cambridge.
|
||||||
.fi
|
.fi
|
||||||
|
|
|
@ -1,4 +1,4 @@
|
||||||
.TH PCRE2PERFORM 3 "20 Ocbober 2014" "PCRE2 10.00"
|
.TH PCRE2PERFORM 3 "02 January 2015" "PCRE2 10.00"
|
||||||
.SH NAME
|
.SH NAME
|
||||||
PCRE2 - Perl-compatible regular expressions (revised API)
|
PCRE2 - Perl-compatible regular expressions (revised API)
|
||||||
.SH "PCRE2 PERFORMANCE"
|
.SH "PCRE2 PERFORMANCE"
|
||||||
|
@ -105,14 +105,18 @@ such as \ed, when matched with \fBpcre2_match()\fP; the performance loss is
|
||||||
less with a DFA matching function, and in both cases there is not much
|
less with a DFA matching function, and in both cases there is not much
|
||||||
difference for \eb.
|
difference for \eb.
|
||||||
.P
|
.P
|
||||||
When a pattern begins with .* not in parentheses, or in parentheses that are
|
When a pattern begins with .* not in atomic parentheses, nor in parentheses
|
||||||
not the subject of a backreference, and the PCRE2_DOTALL option is set, the
|
that are the subject of a backreference, and the PCRE2_DOTALL option is set,
|
||||||
pattern is implicitly anchored by PCRE2, since it can match only at the start
|
the pattern is implicitly anchored by PCRE2, since it can match only at the
|
||||||
of a subject string. However, if PCRE2_DOTALL is not set, PCRE2 cannot make
|
start of a subject string. If the pattern has multiple top-level branches, they
|
||||||
this optimization, because the dot metacharacter does not then match a newline,
|
must all be anchorable. The optimization can be disabled by the
|
||||||
and if the subject string contains newlines, the pattern may match from the
|
PCRE2_NO_DOTSTAR_ANCHOR option, and is automatically disabled if the pattern
|
||||||
character immediately following one of them instead of from the very start. For
|
contains (*PRUNE) or (*SKIP).
|
||||||
example, the pattern
|
.P
|
||||||
|
If PCRE2_DOTALL is not set, PCRE2 cannot make this optimization, because the
|
||||||
|
dot metacharacter does not then match a newline, and if the subject string
|
||||||
|
contains newlines, the pattern may match from the character immediately
|
||||||
|
following one of them instead of from the very start. For example, the pattern
|
||||||
.sp
|
.sp
|
||||||
.*second
|
.*second
|
||||||
.sp
|
.sp
|
||||||
|
@ -173,6 +177,6 @@ Cambridge, England.
|
||||||
.rs
|
.rs
|
||||||
.sp
|
.sp
|
||||||
.nf
|
.nf
|
||||||
Last updated: 20 October 2014
|
Last updated: 02 January 2015
|
||||||
Copyright (c) 1997-2014 University of Cambridge.
|
Copyright (c) 1997-2015 University of Cambridge.
|
||||||
.fi
|
.fi
|
||||||
|
|
|
@ -1,4 +1,4 @@
|
||||||
.TH PCRE2SYNTAX 3 "23 November 2014" "PCRE2 10.00"
|
.TH PCRE2SYNTAX 3 "02 January 2015" "PCRE2 10.00"
|
||||||
.SH NAME
|
.SH NAME
|
||||||
PCRE2 - Perl-compatible regular expressions (revised API)
|
PCRE2 - Perl-compatible regular expressions (revised API)
|
||||||
.SH "PCRE2 REGULAR EXPRESSION SYNTAX SUMMARY"
|
.SH "PCRE2 REGULAR EXPRESSION SYNTAX SUMMARY"
|
||||||
|
@ -389,6 +389,7 @@ appear.
|
||||||
(*NOTEMPTY) set PCRE2_NOTEMPTY when matching
|
(*NOTEMPTY) set PCRE2_NOTEMPTY when matching
|
||||||
(*NOTEMPTY_ATSTART) set PCRE2_NOTEMPTY_ATSTART when matching
|
(*NOTEMPTY_ATSTART) set PCRE2_NOTEMPTY_ATSTART when matching
|
||||||
(*NO_AUTO_POSSESS) no auto-possessification (PCRE2_NO_AUTO_POSSESS)
|
(*NO_AUTO_POSSESS) no auto-possessification (PCRE2_NO_AUTO_POSSESS)
|
||||||
|
(*NO_DOTSTAR_ANCHOR) no .* anchoring (PCRE2_NO_DOTSTAR_ANCHOR)
|
||||||
(*NO_START_OPT) no start-match optimization (PCRE2_NO_START_OPTIMIZE)
|
(*NO_START_OPT) no start-match optimization (PCRE2_NO_START_OPTIMIZE)
|
||||||
(*UTF) set appropriate UTF mode for the library in use
|
(*UTF) set appropriate UTF mode for the library in use
|
||||||
(*UCP) set PCRE2_UCP (use Unicode properties for \ed etc)
|
(*UCP) set PCRE2_UCP (use Unicode properties for \ed etc)
|
||||||
|
@ -536,6 +537,6 @@ Cambridge, England.
|
||||||
.rs
|
.rs
|
||||||
.sp
|
.sp
|
||||||
.nf
|
.nf
|
||||||
Last updated: 23 November 2014
|
Last updated: 02 January 2015
|
||||||
Copyright (c) 1997-2014 University of Cambridge.
|
Copyright (c) 1997-2015 University of Cambridge.
|
||||||
.fi
|
.fi
|
||||||
|
|
|
@ -1,4 +1,4 @@
|
||||||
.TH PCRE2TEST 1 "23 November 2014" "PCRE 10.00"
|
.TH PCRE2TEST 1 "02 January 2015" "PCRE 10.00"
|
||||||
.SH NAME
|
.SH NAME
|
||||||
pcre2test - a program for testing Perl-compatible regular expressions.
|
pcre2test - a program for testing Perl-compatible regular expressions.
|
||||||
.SH SYNOPSIS
|
.SH SYNOPSIS
|
||||||
|
@ -247,7 +247,7 @@ checked for compatibility with the \fBperltest.sh\fP script, which is used to
|
||||||
confirm that Perl gives the same results as PCRE2. Also, apart from comment
|
confirm that Perl gives the same results as PCRE2. Also, apart from comment
|
||||||
lines, none of the other command lines are permitted, because they and many
|
lines, none of the other command lines are permitted, because they and many
|
||||||
of the modifiers are specific to \fBpcre2test\fP, and should not be used in
|
of the modifiers are specific to \fBpcre2test\fP, and should not be used in
|
||||||
test files that are also processed by \fBperltest.sh\fP. The \fP#perltest\fB
|
test files that are also processed by \fBperltest.sh\fP. The \fB#perltest\fP
|
||||||
command helps detect tests that are accidentally put in the wrong file.
|
command helps detect tests that are accidentally put in the wrong file.
|
||||||
.sp
|
.sp
|
||||||
#subject <modifier-list>
|
#subject <modifier-list>
|
||||||
|
@ -413,6 +413,7 @@ for a description of their effects.
|
||||||
never_utf set PCRE2_NEVER_UTF
|
never_utf set PCRE2_NEVER_UTF
|
||||||
no_auto_capture set PCRE2_NO_AUTO_CAPTURE
|
no_auto_capture set PCRE2_NO_AUTO_CAPTURE
|
||||||
no_auto_possess set PCRE2_NO_AUTO_POSSESS
|
no_auto_possess set PCRE2_NO_AUTO_POSSESS
|
||||||
|
no_dotstar_anchor set PCRE2_NO_DOTSTAR_ANCHOR
|
||||||
no_start_optimize set PCRE2_NO_START_OPTIMIZE
|
no_start_optimize set PCRE2_NO_START_OPTIMIZE
|
||||||
no_utf_check set PCRE2_NO_UTF_CHECK
|
no_utf_check set PCRE2_NO_UTF_CHECK
|
||||||
ucp set PCRE2_UCP
|
ucp set PCRE2_UCP
|
||||||
|
@ -552,7 +553,7 @@ documentation. See also the \fBjitstack\fP modifier below for a way of
|
||||||
setting the size of the JIT stack.
|
setting the size of the JIT stack.
|
||||||
.P
|
.P
|
||||||
If the \fBjitfast\fP modifier is specified, matching is done using the JIT
|
If the \fBjitfast\fP modifier is specified, matching is done using the JIT
|
||||||
"fast path" interface, \fBpcre2_jit_match(), which skips some of the sanity
|
"fast path" interface, \fBpcre2_jit_match()\fP, which skips some of the sanity
|
||||||
checks that are done by \fBpcre2_match()\fP, and of course does not work when
|
checks that are done by \fBpcre2_match()\fP, and of course does not work when
|
||||||
JIT is not supported. If \fBjitfast\fP is specified without \fBjit\fP, jit=7 is
|
JIT is not supported. If \fBjitfast\fP is specified without \fBjit\fP, jit=7 is
|
||||||
assumed.
|
assumed.
|
||||||
|
@ -1274,6 +1275,6 @@ Cambridge, England.
|
||||||
.rs
|
.rs
|
||||||
.sp
|
.sp
|
||||||
.nf
|
.nf
|
||||||
Last updated: 23 November 2014
|
Last updated: 02 January 2015
|
||||||
Copyright (c) 1997-2014 University of Cambridge.
|
Copyright (c) 1997-2015 University of Cambridge.
|
||||||
.fi
|
.fi
|
||||||
|
|
|
@ -402,6 +402,7 @@ PATTERN MODIFIERS
|
||||||
never_utf set PCRE2_NEVER_UTF
|
never_utf set PCRE2_NEVER_UTF
|
||||||
no_auto_capture set PCRE2_NO_AUTO_CAPTURE
|
no_auto_capture set PCRE2_NO_AUTO_CAPTURE
|
||||||
no_auto_possess set PCRE2_NO_AUTO_POSSESS
|
no_auto_possess set PCRE2_NO_AUTO_POSSESS
|
||||||
|
no_dotstar_anchor set PCRE2_NO_DOTSTAR_ANCHOR
|
||||||
no_start_optimize set PCRE2_NO_START_OPTIMIZE
|
no_start_optimize set PCRE2_NO_START_OPTIMIZE
|
||||||
no_utf_check set PCRE2_NO_UTF_CHECK
|
no_utf_check set PCRE2_NO_UTF_CHECK
|
||||||
ucp set PCRE2_UCP
|
ucp set PCRE2_UCP
|
||||||
|
@ -1185,5 +1186,5 @@ AUTHOR
|
||||||
|
|
||||||
REVISION
|
REVISION
|
||||||
|
|
||||||
Last updated: 23 November 2014
|
Last updated: 02 January 2015
|
||||||
Copyright (c) 1997-2014 University of Cambridge.
|
Copyright (c) 1997-2015 University of Cambridge.
|
||||||
|
|
|
@ -5,7 +5,7 @@
|
||||||
/* This is the public header file for the PCRE library, second API, to be
|
/* This is the public header file for the PCRE library, second API, to be
|
||||||
#included by applications that call PCRE2 functions.
|
#included by applications that call PCRE2 functions.
|
||||||
|
|
||||||
Copyright (c) 2014 University of Cambridge
|
Copyright (c) 2015 University of Cambridge
|
||||||
|
|
||||||
-----------------------------------------------------------------------------
|
-----------------------------------------------------------------------------
|
||||||
Redistribution and use in source and binary forms, with or without
|
Redistribution and use in source and binary forms, with or without
|
||||||
|
@ -113,10 +113,11 @@ D is inspected during pcre2_dfa_match() execution
|
||||||
#define PCRE2_NEVER_UTF 0x00001000u /* C */
|
#define PCRE2_NEVER_UTF 0x00001000u /* C */
|
||||||
#define PCRE2_NO_AUTO_CAPTURE 0x00002000u /* C */
|
#define PCRE2_NO_AUTO_CAPTURE 0x00002000u /* C */
|
||||||
#define PCRE2_NO_AUTO_POSSESS 0x00004000u /* C */
|
#define PCRE2_NO_AUTO_POSSESS 0x00004000u /* C */
|
||||||
#define PCRE2_NO_START_OPTIMIZE 0x00008000u /* J M D */
|
#define PCRE2_NO_DOTSTAR_ANCHOR 0x00008000u /* C */
|
||||||
#define PCRE2_UCP 0x00010000u /* C J M D */
|
#define PCRE2_NO_START_OPTIMIZE 0x00010000u /* J M D */
|
||||||
#define PCRE2_UNGREEDY 0x00020000u /* C */
|
#define PCRE2_UCP 0x00020000u /* C J M D */
|
||||||
#define PCRE2_UTF 0x00040000u /* C J M D */
|
#define PCRE2_UNGREEDY 0x00040000u /* C */
|
||||||
|
#define PCRE2_UTF 0x00080000u /* C J M D */
|
||||||
|
|
||||||
/* These are for pcre2_jit_compile(). */
|
/* These are for pcre2_jit_compile(). */
|
||||||
|
|
||||||
|
|
|
@ -7,7 +7,7 @@ and semantics are as close as possible to those of the Perl 5 language.
|
||||||
|
|
||||||
Written by Philip Hazel
|
Written by Philip Hazel
|
||||||
Original API code Copyright (c) 1997-2012 University of Cambridge
|
Original API code Copyright (c) 1997-2012 University of Cambridge
|
||||||
New API code Copyright (c) 2014 University of Cambridge
|
New API code Copyright (c) 2015 University of Cambridge
|
||||||
|
|
||||||
-----------------------------------------------------------------------------
|
-----------------------------------------------------------------------------
|
||||||
Redistribution and use in source and binary forms, with or without
|
Redistribution and use in source and binary forms, with or without
|
||||||
|
@ -557,8 +557,8 @@ static PCRE2_SPTR posix_substitutes[] = {
|
||||||
PCRE2_CASELESS|PCRE2_DOLLAR_ENDONLY|PCRE2_DOTALL|PCRE2_DUPNAMES| \
|
PCRE2_CASELESS|PCRE2_DOLLAR_ENDONLY|PCRE2_DOTALL|PCRE2_DUPNAMES| \
|
||||||
PCRE2_EXTENDED|PCRE2_FIRSTLINE|PCRE2_MATCH_UNSET_BACKREF| \
|
PCRE2_EXTENDED|PCRE2_FIRSTLINE|PCRE2_MATCH_UNSET_BACKREF| \
|
||||||
PCRE2_MULTILINE|PCRE2_NEVER_UCP|PCRE2_NEVER_UTF|PCRE2_NO_AUTO_CAPTURE| \
|
PCRE2_MULTILINE|PCRE2_NEVER_UCP|PCRE2_NEVER_UTF|PCRE2_NO_AUTO_CAPTURE| \
|
||||||
PCRE2_NO_AUTO_POSSESS|PCRE2_NO_START_OPTIMIZE|PCRE2_NO_UTF_CHECK| \
|
PCRE2_NO_AUTO_POSSESS|PCRE2_NO_DOTSTAR_ANCHOR|PCRE2_NO_START_OPTIMIZE| \
|
||||||
PCRE2_UCP|PCRE2_UNGREEDY|PCRE2_UTF)
|
PCRE2_NO_UTF_CHECK|PCRE2_UCP|PCRE2_UNGREEDY|PCRE2_UTF)
|
||||||
|
|
||||||
/* Compile time error code numbers. They are given names so that they can more
|
/* Compile time error code numbers. They are given names so that they can more
|
||||||
easily be tracked. When a new number is added, the tables called eint1 and
|
easily be tracked. When a new number is added, the tables called eint1 and
|
||||||
|
@ -597,22 +597,23 @@ typedef struct pso {
|
||||||
/* NB: STRING_UTFn_RIGHTPAR contains the length as well */
|
/* NB: STRING_UTFn_RIGHTPAR contains the length as well */
|
||||||
|
|
||||||
static pso pso_list[] = {
|
static pso pso_list[] = {
|
||||||
{ (uint8_t *)STRING_UTFn_RIGHTPAR, PSO_OPT, PCRE2_UTF },
|
{ (uint8_t *)STRING_UTFn_RIGHTPAR, PSO_OPT, PCRE2_UTF },
|
||||||
{ (uint8_t *)STRING_UTF_RIGHTPAR, 4, PSO_OPT, PCRE2_UTF },
|
{ (uint8_t *)STRING_UTF_RIGHTPAR, 4, PSO_OPT, PCRE2_UTF },
|
||||||
{ (uint8_t *)STRING_UCP_RIGHTPAR, 4, PSO_OPT, PCRE2_UCP },
|
{ (uint8_t *)STRING_UCP_RIGHTPAR, 4, PSO_OPT, PCRE2_UCP },
|
||||||
{ (uint8_t *)STRING_NOTEMPTY_RIGHTPAR, 9, PSO_FLG, PCRE2_NOTEMPTY_SET },
|
{ (uint8_t *)STRING_NOTEMPTY_RIGHTPAR, 9, PSO_FLG, PCRE2_NOTEMPTY_SET },
|
||||||
{ (uint8_t *)STRING_NOTEMPTY_ATSTART_RIGHTPAR,17, PSO_FLG, PCRE2_NE_ATST_SET },
|
{ (uint8_t *)STRING_NOTEMPTY_ATSTART_RIGHTPAR, 17, PSO_FLG, PCRE2_NE_ATST_SET },
|
||||||
{ (uint8_t *)STRING_NO_AUTO_POSSESS_RIGHTPAR, 16, PSO_OPT, PCRE2_NO_AUTO_POSSESS },
|
{ (uint8_t *)STRING_NO_AUTO_POSSESS_RIGHTPAR, 16, PSO_OPT, PCRE2_NO_AUTO_POSSESS },
|
||||||
{ (uint8_t *)STRING_NO_START_OPT_RIGHTPAR, 13, PSO_OPT, PCRE2_NO_START_OPTIMIZE },
|
{ (uint8_t *)STRING_NO_DOTSTAR_ANCHOR_RIGHTPAR, 18, PSO_OPT, PCRE2_NO_DOTSTAR_ANCHOR },
|
||||||
{ (uint8_t *)STRING_LIMIT_MATCH_EQ, 12, PSO_LIMM, 0 },
|
{ (uint8_t *)STRING_NO_START_OPT_RIGHTPAR, 13, PSO_OPT, PCRE2_NO_START_OPTIMIZE },
|
||||||
{ (uint8_t *)STRING_LIMIT_RECURSION_EQ, 16, PSO_LIMR, 0 },
|
{ (uint8_t *)STRING_LIMIT_MATCH_EQ, 12, PSO_LIMM, 0 },
|
||||||
{ (uint8_t *)STRING_CR_RIGHTPAR, 3, PSO_NL, PCRE2_NEWLINE_CR },
|
{ (uint8_t *)STRING_LIMIT_RECURSION_EQ, 16, PSO_LIMR, 0 },
|
||||||
{ (uint8_t *)STRING_LF_RIGHTPAR, 3, PSO_NL, PCRE2_NEWLINE_LF },
|
{ (uint8_t *)STRING_CR_RIGHTPAR, 3, PSO_NL, PCRE2_NEWLINE_CR },
|
||||||
{ (uint8_t *)STRING_CRLF_RIGHTPAR, 5, PSO_NL, PCRE2_NEWLINE_CRLF },
|
{ (uint8_t *)STRING_LF_RIGHTPAR, 3, PSO_NL, PCRE2_NEWLINE_LF },
|
||||||
{ (uint8_t *)STRING_ANY_RIGHTPAR, 4, PSO_NL, PCRE2_NEWLINE_ANY },
|
{ (uint8_t *)STRING_CRLF_RIGHTPAR, 5, PSO_NL, PCRE2_NEWLINE_CRLF },
|
||||||
{ (uint8_t *)STRING_ANYCRLF_RIGHTPAR, 8, PSO_NL, PCRE2_NEWLINE_ANYCRLF },
|
{ (uint8_t *)STRING_ANY_RIGHTPAR, 4, PSO_NL, PCRE2_NEWLINE_ANY },
|
||||||
{ (uint8_t *)STRING_BSR_ANYCRLF_RIGHTPAR, 12, PSO_BSR, PCRE2_BSR_ANYCRLF },
|
{ (uint8_t *)STRING_ANYCRLF_RIGHTPAR, 8, PSO_NL, PCRE2_NEWLINE_ANYCRLF },
|
||||||
{ (uint8_t *)STRING_BSR_UNICODE_RIGHTPAR, 12, PSO_BSR, PCRE2_BSR_UNICODE }
|
{ (uint8_t *)STRING_BSR_ANYCRLF_RIGHTPAR, 12, PSO_BSR, PCRE2_BSR_ANYCRLF },
|
||||||
|
{ (uint8_t *)STRING_BSR_UNICODE_RIGHTPAR, 12, PSO_BSR, PCRE2_BSR_UNICODE }
|
||||||
};
|
};
|
||||||
|
|
||||||
|
|
||||||
|
@ -7020,13 +7021,14 @@ do {
|
||||||
|
|
||||||
/* .* is not anchored unless DOTALL is set (which generates OP_ALLANY) and
|
/* .* is not anchored unless DOTALL is set (which generates OP_ALLANY) and
|
||||||
it isn't in brackets that are or may be referenced or inside an atomic
|
it isn't in brackets that are or may be referenced or inside an atomic
|
||||||
group. */
|
group. There is also an option that disables auto-anchoring. */
|
||||||
|
|
||||||
else if ((op == OP_TYPESTAR || op == OP_TYPEMINSTAR ||
|
else if ((op == OP_TYPESTAR || op == OP_TYPEMINSTAR ||
|
||||||
op == OP_TYPEPOSSTAR))
|
op == OP_TYPEPOSSTAR))
|
||||||
{
|
{
|
||||||
if (scode[1] != OP_ALLANY || (bracket_map & cb->backref_map) != 0 ||
|
if (scode[1] != OP_ALLANY || (bracket_map & cb->backref_map) != 0 ||
|
||||||
atomcount > 0 || cb->had_pruneorskip)
|
atomcount > 0 || cb->had_pruneorskip ||
|
||||||
|
(cb->external_options & PCRE2_NO_DOTSTAR_ANCHOR) != 0)
|
||||||
return FALSE;
|
return FALSE;
|
||||||
}
|
}
|
||||||
|
|
||||||
|
@ -7140,12 +7142,13 @@ do {
|
||||||
brackets that may be referenced, as long as the pattern does not contain
|
brackets that may be referenced, as long as the pattern does not contain
|
||||||
*PRUNE or *SKIP, because these break the feature. Consider, for example,
|
*PRUNE or *SKIP, because these break the feature. Consider, for example,
|
||||||
/.*?a(*PRUNE)b/ with the subject "aab", which matches "ab", i.e. not at the
|
/.*?a(*PRUNE)b/ with the subject "aab", which matches "ab", i.e. not at the
|
||||||
start of a line. */
|
start of a line. There is also an option that disables this optimization. */
|
||||||
|
|
||||||
else if (op == OP_TYPESTAR || op == OP_TYPEMINSTAR || op == OP_TYPEPOSSTAR)
|
else if (op == OP_TYPESTAR || op == OP_TYPEMINSTAR || op == OP_TYPEPOSSTAR)
|
||||||
{
|
{
|
||||||
if (scode[1] != OP_ANY || (bracket_map & cb->backref_map) != 0 ||
|
if (scode[1] != OP_ANY || (bracket_map & cb->backref_map) != 0 ||
|
||||||
atomcount > 0 || cb->had_pruneorskip)
|
atomcount > 0 || cb->had_pruneorskip ||
|
||||||
|
(cb->external_options & PCRE2_NO_DOTSTAR_ANCHOR) != 0)
|
||||||
return FALSE;
|
return FALSE;
|
||||||
}
|
}
|
||||||
|
|
||||||
|
@ -7863,7 +7866,8 @@ if (errorcode != 0)
|
||||||
/* Successful compile. If the anchored option was not passed, set it if
|
/* Successful compile. If the anchored option was not passed, set it if
|
||||||
we can determine that the pattern is anchored by virtue of ^ characters or \A
|
we can determine that the pattern is anchored by virtue of ^ characters or \A
|
||||||
or anything else, such as starting with non-atomic .* when DOTALL is set and
|
or anything else, such as starting with non-atomic .* when DOTALL is set and
|
||||||
there are no occurrences of *PRUNE or *SKIP. */
|
there are no occurrences of *PRUNE or *SKIP (though there is an option to
|
||||||
|
disable this case). */
|
||||||
|
|
||||||
if ((re->overall_options & PCRE2_ANCHORED) == 0 &&
|
if ((re->overall_options & PCRE2_ANCHORED) == 0 &&
|
||||||
is_anchored(codestart, 0, &cb, 0))
|
is_anchored(codestart, 0, &cb, 0))
|
||||||
|
@ -7912,7 +7916,8 @@ if ((re->overall_options & (PCRE2_ANCHORED|PCRE2_NO_START_OPTIMIZE)) == 0)
|
||||||
/* When there is no first code unit, see if we can set the PCRE2_STARTLINE
|
/* When there is no first code unit, see if we can set the PCRE2_STARTLINE
|
||||||
flag. This is helpful for multiline matches when all branches start with ^
|
flag. This is helpful for multiline matches when all branches start with ^
|
||||||
and also when all branches start with non-atomic .* for non-DOTALL matches
|
and also when all branches start with non-atomic .* for non-DOTALL matches
|
||||||
when *PRUNE and SKIP are not present. */
|
when *PRUNE and SKIP are not present. (There is an option that disables this
|
||||||
|
case.) */
|
||||||
|
|
||||||
else if (is_startline(codestart, 0, &cb, 0)) re->flags |= PCRE2_STARTLINE;
|
else if (is_startline(codestart, 0, &cb, 0)) re->flags |= PCRE2_STARTLINE;
|
||||||
}
|
}
|
||||||
|
|
|
@ -7,7 +7,7 @@ and semantics are as close as possible to those of the Perl 5 language.
|
||||||
|
|
||||||
Written by Philip Hazel
|
Written by Philip Hazel
|
||||||
Original API code Copyright (c) 1997-2012 University of Cambridge
|
Original API code Copyright (c) 1997-2012 University of Cambridge
|
||||||
New API code Copyright (c) 2014 University of Cambridge
|
New API code Copyright (c) 2015 University of Cambridge
|
||||||
|
|
||||||
-----------------------------------------------------------------------------
|
-----------------------------------------------------------------------------
|
||||||
Redistribution and use in source and binary forms, with or without
|
Redistribution and use in source and binary forms, with or without
|
||||||
|
@ -904,6 +904,7 @@ a positive value. */
|
||||||
#define STRING_UTF_RIGHTPAR "UTF)"
|
#define STRING_UTF_RIGHTPAR "UTF)"
|
||||||
#define STRING_UCP_RIGHTPAR "UCP)"
|
#define STRING_UCP_RIGHTPAR "UCP)"
|
||||||
#define STRING_NO_AUTO_POSSESS_RIGHTPAR "NO_AUTO_POSSESS)"
|
#define STRING_NO_AUTO_POSSESS_RIGHTPAR "NO_AUTO_POSSESS)"
|
||||||
|
#define STRING_NO_DOTSTAR_ANCHOR_RIGHTPAR "NO_DOTSTAR_ANCHOR)"
|
||||||
#define STRING_NO_START_OPT_RIGHTPAR "NO_START_OPT)"
|
#define STRING_NO_START_OPT_RIGHTPAR "NO_START_OPT)"
|
||||||
#define STRING_NOTEMPTY_RIGHTPAR "NOTEMPTY)"
|
#define STRING_NOTEMPTY_RIGHTPAR "NOTEMPTY)"
|
||||||
#define STRING_NOTEMPTY_ATSTART_RIGHTPAR "NOTEMPTY_ATSTART)"
|
#define STRING_NOTEMPTY_ATSTART_RIGHTPAR "NOTEMPTY_ATSTART)"
|
||||||
|
@ -1173,6 +1174,7 @@ only. */
|
||||||
#define STRING_UTF_RIGHTPAR STR_U STR_T STR_F STR_RIGHT_PARENTHESIS
|
#define STRING_UTF_RIGHTPAR STR_U STR_T STR_F STR_RIGHT_PARENTHESIS
|
||||||
#define STRING_UCP_RIGHTPAR STR_U STR_C STR_P STR_RIGHT_PARENTHESIS
|
#define STRING_UCP_RIGHTPAR STR_U STR_C STR_P STR_RIGHT_PARENTHESIS
|
||||||
#define STRING_NO_AUTO_POSSESS_RIGHTPAR STR_N STR_O STR_UNDERSCORE STR_A STR_U STR_T STR_O STR_UNDERSCORE STR_P STR_O STR_S STR_S STR_E STR_S STR_S STR_RIGHT_PARENTHESIS
|
#define STRING_NO_AUTO_POSSESS_RIGHTPAR STR_N STR_O STR_UNDERSCORE STR_A STR_U STR_T STR_O STR_UNDERSCORE STR_P STR_O STR_S STR_S STR_E STR_S STR_S STR_RIGHT_PARENTHESIS
|
||||||
|
#define STRING_NO_DOTSTAR_ANCHOR_RIGHTPAR STR_N STR_O STR_UNDERSCORE STR_D STR_O STR_T STR_S STR_T STR_A STR_R STR_UNDERSCORE STR_A STR_N STR_C STR_H STR_O STR_R STR_RIGHT_PARENTHESIS
|
||||||
#define STRING_NO_START_OPT_RIGHTPAR STR_N STR_O STR_UNDERSCORE STR_S STR_T STR_A STR_R STR_T STR_UNDERSCORE STR_O STR_P STR_T STR_RIGHT_PARENTHESIS
|
#define STRING_NO_START_OPT_RIGHTPAR STR_N STR_O STR_UNDERSCORE STR_S STR_T STR_A STR_R STR_T STR_UNDERSCORE STR_O STR_P STR_T STR_RIGHT_PARENTHESIS
|
||||||
#define STRING_NOTEMPTY_RIGHTPAR STR_N STR_O STR_T STR_E STR_M STR_P STR_T STR_Y STR_RIGHT_PARENTHESIS
|
#define STRING_NOTEMPTY_RIGHTPAR STR_N STR_O STR_T STR_E STR_M STR_P STR_T STR_Y STR_RIGHT_PARENTHESIS
|
||||||
#define STRING_NOTEMPTY_ATSTART_RIGHTPAR STR_N STR_O STR_T STR_E STR_M STR_P STR_T STR_Y STR_UNDERSCORE STR_A STR_T STR_S STR_T STR_A STR_R STR_T STR_RIGHT_PARENTHESIS
|
#define STRING_NOTEMPTY_ATSTART_RIGHTPAR STR_N STR_O STR_T STR_E STR_M STR_P STR_T STR_Y STR_UNDERSCORE STR_A STR_T STR_S STR_T STR_A STR_R STR_T STR_RIGHT_PARENTHESIS
|
||||||
|
|
|
@ -11,7 +11,7 @@ hacked-up (non-) design had also run out of steam.
|
||||||
|
|
||||||
Written by Philip Hazel
|
Written by Philip Hazel
|
||||||
Original code Copyright (c) 1997-2012 University of Cambridge
|
Original code Copyright (c) 1997-2012 University of Cambridge
|
||||||
Rewritten code Copyright (c) 2014 University of Cambridge
|
Rewritten code Copyright (c) 2015 University of Cambridge
|
||||||
|
|
||||||
-----------------------------------------------------------------------------
|
-----------------------------------------------------------------------------
|
||||||
Redistribution and use in source and binary forms, with or without
|
Redistribution and use in source and binary forms, with or without
|
||||||
|
@ -498,6 +498,7 @@ static modstruct modlist[] = {
|
||||||
{ "newline", MOD_CTC, MOD_NL, 0, CO(newline_convention) },
|
{ "newline", MOD_CTC, MOD_NL, 0, CO(newline_convention) },
|
||||||
{ "no_auto_capture", MOD_PAT, MOD_OPT, PCRE2_NO_AUTO_CAPTURE, PO(options) },
|
{ "no_auto_capture", MOD_PAT, MOD_OPT, PCRE2_NO_AUTO_CAPTURE, PO(options) },
|
||||||
{ "no_auto_possess", MOD_PATP, MOD_OPT, PCRE2_NO_AUTO_POSSESS, PO(options) },
|
{ "no_auto_possess", MOD_PATP, MOD_OPT, PCRE2_NO_AUTO_POSSESS, PO(options) },
|
||||||
|
{ "no_dotstar_anchor", MOD_PAT, MOD_OPT, PCRE2_NO_DOTSTAR_ANCHOR, PO(options) },
|
||||||
{ "no_start_optimize", MOD_PATP, MOD_OPT, PCRE2_NO_START_OPTIMIZE, PO(options) },
|
{ "no_start_optimize", MOD_PATP, MOD_OPT, PCRE2_NO_START_OPTIMIZE, PO(options) },
|
||||||
{ "no_utf_check", MOD_PD, MOD_OPT, PCRE2_NO_UTF_CHECK, PD(options) },
|
{ "no_utf_check", MOD_PD, MOD_OPT, PCRE2_NO_UTF_CHECK, PD(options) },
|
||||||
{ "notbol", MOD_DAT, MOD_OPT, PCRE2_NOTBOL, DO(options) },
|
{ "notbol", MOD_DAT, MOD_OPT, PCRE2_NOTBOL, DO(options) },
|
||||||
|
@ -3291,29 +3292,30 @@ static void
|
||||||
show_compile_options(uint32_t options, const char *before, const char *after)
|
show_compile_options(uint32_t options, const char *before, const char *after)
|
||||||
{
|
{
|
||||||
if (options == 0) fprintf(outfile, "%s <none>%s", before, after);
|
if (options == 0) fprintf(outfile, "%s <none>%s", before, after);
|
||||||
else fprintf(outfile, "%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s",
|
else fprintf(outfile, "%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s",
|
||||||
before,
|
before,
|
||||||
((options & PCRE2_ANCHORED) != 0)? " anchored" : "",
|
|
||||||
((options & PCRE2_CASELESS) != 0)? " caseless" : "",
|
|
||||||
((options & PCRE2_EXTENDED) != 0)? " extended" : "",
|
|
||||||
((options & PCRE2_MULTILINE) != 0)? " multiline" : "",
|
|
||||||
((options & PCRE2_FIRSTLINE) != 0)? " firstline" : "",
|
|
||||||
((options & PCRE2_DOTALL) != 0)? " dotall" : "",
|
|
||||||
((options & PCRE2_DOLLAR_ENDONLY) != 0)? " dollar_endonly" : "",
|
|
||||||
((options & PCRE2_UNGREEDY) != 0)? " ungreedy" : "",
|
|
||||||
((options & PCRE2_NO_AUTO_CAPTURE) != 0)? " no_auto_capture" : "",
|
|
||||||
((options & PCRE2_NO_AUTO_POSSESS) != 0)? " no_auto_possess" : "",
|
|
||||||
((options & PCRE2_UTF) != 0)? " utf" : "",
|
|
||||||
((options & PCRE2_UCP) != 0)? " ucp" : "",
|
|
||||||
((options & PCRE2_NO_UTF_CHECK) != 0)? " no_utf_check" : "",
|
|
||||||
((options & PCRE2_NO_START_OPTIMIZE) != 0)? " no_start_optimize" : "",
|
|
||||||
((options & PCRE2_DUPNAMES) != 0)? " dupnames" : "",
|
|
||||||
((options & PCRE2_ALT_BSUX) != 0)? " alt_bsux" : "",
|
((options & PCRE2_ALT_BSUX) != 0)? " alt_bsux" : "",
|
||||||
((options & PCRE2_ALLOW_EMPTY_CLASS) != 0)? " allow_empty_class" : "",
|
((options & PCRE2_ALLOW_EMPTY_CLASS) != 0)? " allow_empty_class" : "",
|
||||||
|
((options & PCRE2_ANCHORED) != 0)? " anchored" : "",
|
||||||
((options & PCRE2_AUTO_CALLOUT) != 0)? " auto_callout" : "",
|
((options & PCRE2_AUTO_CALLOUT) != 0)? " auto_callout" : "",
|
||||||
|
((options & PCRE2_CASELESS) != 0)? " caseless" : "",
|
||||||
|
((options & PCRE2_DOLLAR_ENDONLY) != 0)? " dollar_endonly" : "",
|
||||||
|
((options & PCRE2_DOTALL) != 0)? " dotall" : "",
|
||||||
|
((options & PCRE2_DUPNAMES) != 0)? " dupnames" : "",
|
||||||
|
((options & PCRE2_EXTENDED) != 0)? " extended" : "",
|
||||||
|
((options & PCRE2_FIRSTLINE) != 0)? " firstline" : "",
|
||||||
((options & PCRE2_MATCH_UNSET_BACKREF) != 0)? " match_unset_backref" : "",
|
((options & PCRE2_MATCH_UNSET_BACKREF) != 0)? " match_unset_backref" : "",
|
||||||
|
((options & PCRE2_MULTILINE) != 0)? " multiline" : "",
|
||||||
((options & PCRE2_NEVER_UCP) != 0)? " never_ucp" : "",
|
((options & PCRE2_NEVER_UCP) != 0)? " never_ucp" : "",
|
||||||
((options & PCRE2_NEVER_UTF) != 0)? " never_utf" : "",
|
((options & PCRE2_NEVER_UTF) != 0)? " never_utf" : "",
|
||||||
|
((options & PCRE2_NO_AUTO_CAPTURE) != 0)? " no_auto_capture" : "",
|
||||||
|
((options & PCRE2_NO_AUTO_POSSESS) != 0)? " no_auto_possess" : "",
|
||||||
|
((options & PCRE2_NO_DOTSTAR_ANCHOR) != 0)? " no_dotstar_anchor" : "",
|
||||||
|
((options & PCRE2_NO_UTF_CHECK) != 0)? " no_utf_check" : "",
|
||||||
|
((options & PCRE2_NO_START_OPTIMIZE) != 0)? " no_start_optimize" : "",
|
||||||
|
((options & PCRE2_UCP) != 0)? " ucp" : "",
|
||||||
|
((options & PCRE2_UNGREEDY) != 0)? " ungreedy" : "",
|
||||||
|
((options & PCRE2_UTF) != 0)? " utf" : "",
|
||||||
after);
|
after);
|
||||||
}
|
}
|
||||||
|
|
||||||
|
|
|
@ -4100,4 +4100,20 @@ a random value. /Ix
|
||||||
/a(b)c(d)/
|
/a(b)c(d)/
|
||||||
abc\=ph,copy=0,copy=1,getall
|
abc\=ph,copy=0,copy=1,getall
|
||||||
|
|
||||||
|
/^abc/info
|
||||||
|
|
||||||
|
/^abc/info,no_dotstar_anchor
|
||||||
|
|
||||||
|
/.*\d/info,auto_callout
|
||||||
|
aaa
|
||||||
|
|
||||||
|
/.*\d/info,no_dotstar_anchor,auto_callout
|
||||||
|
aaa
|
||||||
|
|
||||||
|
/.*\d/dotall,info
|
||||||
|
|
||||||
|
/.*\d/dotall,no_dotstar_anchor,info
|
||||||
|
|
||||||
|
/(*NO_DOTSTAR_ANCHOR)(?s).*\d/info
|
||||||
|
|
||||||
# End of testinput2
|
# End of testinput2
|
||||||
|
|
|
@ -5361,7 +5361,7 @@ No match
|
||||||
"<(\w+)/?>(.)*</(\1)>"Igms
|
"<(\w+)/?>(.)*</(\1)>"Igms
|
||||||
Capturing subpattern count = 3
|
Capturing subpattern count = 3
|
||||||
Max back reference = 1
|
Max back reference = 1
|
||||||
Options: multiline dotall
|
Options: dotall multiline
|
||||||
First code unit = '<'
|
First code unit = '<'
|
||||||
Last code unit = '>'
|
Last code unit = '>'
|
||||||
Subject length lower bound = 7
|
Subject length lower bound = 7
|
||||||
|
@ -5399,7 +5399,7 @@ No match
|
||||||
/line\nbreak/Im,firstline
|
/line\nbreak/Im,firstline
|
||||||
Capturing subpattern count = 0
|
Capturing subpattern count = 0
|
||||||
Contains explicit CR or LF match
|
Contains explicit CR or LF match
|
||||||
Options: multiline firstline
|
Options: firstline multiline
|
||||||
First code unit = 'l'
|
First code unit = 'l'
|
||||||
Last code unit = 'k'
|
Last code unit = 'k'
|
||||||
Subject length lower bound = 10
|
Subject length lower bound = 10
|
||||||
|
@ -9698,7 +9698,7 @@ Subject length lower bound = 41
|
||||||
/Iisx
|
/Iisx
|
||||||
Capturing subpattern count = 3
|
Capturing subpattern count = 3
|
||||||
Max back reference = 1
|
Max back reference = 1
|
||||||
Options: caseless extended dotall
|
Options: caseless dotall extended
|
||||||
First code unit = '<'
|
First code unit = '<'
|
||||||
Last code unit = '='
|
Last code unit = '='
|
||||||
Subject length lower bound = 9
|
Subject length lower bound = 9
|
||||||
|
@ -9747,7 +9747,7 @@ Named capturing subpatterns:
|
||||||
quote 4
|
quote 4
|
||||||
realquote 3
|
realquote 3
|
||||||
realquote 6
|
realquote 6
|
||||||
Options: extended dupnames
|
Options: dupnames extended
|
||||||
Starting code units: a b
|
Starting code units: a b
|
||||||
Subject length lower bound = 3
|
Subject length lower bound = 3
|
||||||
a"aaaaa
|
a"aaaaa
|
||||||
|
@ -9805,8 +9805,8 @@ Capturing subpattern count = 4
|
||||||
Named capturing subpatterns:
|
Named capturing subpatterns:
|
||||||
D 4
|
D 4
|
||||||
D 1
|
D 1
|
||||||
Compile options: extended dupnames
|
Compile options: dupnames extended
|
||||||
Overall options: anchored extended dupnames
|
Overall options: anchored dupnames extended
|
||||||
Subject length lower bound = 2
|
Subject length lower bound = 2
|
||||||
abcdX
|
abcdX
|
||||||
0: abcdX
|
0: abcdX
|
||||||
|
@ -9852,7 +9852,7 @@ Capturing subpattern count = 4
|
||||||
Named capturing subpatterns:
|
Named capturing subpatterns:
|
||||||
A 1
|
A 1
|
||||||
A 4
|
A 4
|
||||||
Options: extended dupnames
|
Options: dupnames extended
|
||||||
First code unit = 'a'
|
First code unit = 'a'
|
||||||
Last code unit = 'd'
|
Last code unit = 'd'
|
||||||
Subject length lower bound = 4
|
Subject length lower bound = 4
|
||||||
|
@ -9936,7 +9936,7 @@ No match
|
||||||
/(\3)(\1)(a)/I,alt_bsux,allow_empty_class,match_unset_backref,dupnames
|
/(\3)(\1)(a)/I,alt_bsux,allow_empty_class,match_unset_backref,dupnames
|
||||||
Capturing subpattern count = 3
|
Capturing subpattern count = 3
|
||||||
Max back reference = 3
|
Max back reference = 3
|
||||||
Options: dupnames alt_bsux allow_empty_class match_unset_backref
|
Options: alt_bsux allow_empty_class dupnames match_unset_backref
|
||||||
Last code unit = 'a'
|
Last code unit = 'a'
|
||||||
Subject length lower bound = 1
|
Subject length lower bound = 1
|
||||||
cat
|
cat
|
||||||
|
@ -13769,4 +13769,67 @@ Partial match: abc
|
||||||
Copy substring 1 failed (-2): partial match
|
Copy substring 1 failed (-2): partial match
|
||||||
get substring list failed (-2): partial match
|
get substring list failed (-2): partial match
|
||||||
|
|
||||||
|
/^abc/info
|
||||||
|
Capturing subpattern count = 0
|
||||||
|
Compile options: <none>
|
||||||
|
Overall options: anchored
|
||||||
|
Subject length lower bound = 3
|
||||||
|
|
||||||
|
/^abc/info,no_dotstar_anchor
|
||||||
|
Capturing subpattern count = 0
|
||||||
|
Compile options: no_dotstar_anchor
|
||||||
|
Overall options: anchored no_dotstar_anchor
|
||||||
|
Subject length lower bound = 3
|
||||||
|
|
||||||
|
/.*\d/info,auto_callout
|
||||||
|
Capturing subpattern count = 0
|
||||||
|
Options: auto_callout
|
||||||
|
First code unit at start or follows newline
|
||||||
|
Subject length lower bound = 1
|
||||||
|
aaa
|
||||||
|
--->aaa
|
||||||
|
+0 ^ .*
|
||||||
|
+2 ^ ^ \d
|
||||||
|
+2 ^ ^ \d
|
||||||
|
+2 ^^ \d
|
||||||
|
+2 ^ \d
|
||||||
|
No match
|
||||||
|
|
||||||
|
/.*\d/info,no_dotstar_anchor,auto_callout
|
||||||
|
Capturing subpattern count = 0
|
||||||
|
Options: auto_callout no_dotstar_anchor
|
||||||
|
Subject length lower bound = 1
|
||||||
|
aaa
|
||||||
|
--->aaa
|
||||||
|
+0 ^ .*
|
||||||
|
+2 ^ ^ \d
|
||||||
|
+2 ^ ^ \d
|
||||||
|
+2 ^^ \d
|
||||||
|
+2 ^ \d
|
||||||
|
+0 ^ .*
|
||||||
|
+2 ^ ^ \d
|
||||||
|
+2 ^^ \d
|
||||||
|
+2 ^ \d
|
||||||
|
+0 ^ .*
|
||||||
|
+2 ^^ \d
|
||||||
|
+2 ^ \d
|
||||||
|
No match
|
||||||
|
|
||||||
|
/.*\d/dotall,info
|
||||||
|
Capturing subpattern count = 0
|
||||||
|
Compile options: dotall
|
||||||
|
Overall options: anchored dotall
|
||||||
|
Subject length lower bound = 1
|
||||||
|
|
||||||
|
/.*\d/dotall,no_dotstar_anchor,info
|
||||||
|
Capturing subpattern count = 0
|
||||||
|
Options: dotall no_dotstar_anchor
|
||||||
|
Subject length lower bound = 1
|
||||||
|
|
||||||
|
/(*NO_DOTSTAR_ANCHOR)(?s).*\d/info
|
||||||
|
Capturing subpattern count = 0
|
||||||
|
Compile options: <none>
|
||||||
|
Overall options: dotall no_dotstar_anchor
|
||||||
|
Subject length lower bound = 1
|
||||||
|
|
||||||
# End of testinput2
|
# End of testinput2
|
||||||
|
|
Loading…
Reference in New Issue