diff --git a/ChangeLog b/ChangeLog index f9dada4..40cc171 100644 --- a/ChangeLog +++ b/ChangeLog @@ -58,4 +58,6 @@ matched against "abcd". (an odd thing to do, but it happened), SIGSEGV or other misbehaviour could occur. +10. The PCRE2_NO_DOTSTAR_ANCHOR option has been implemented. + **** diff --git a/doc/html/pcre2_compile.html b/doc/html/pcre2_compile.html index 8657470..d833ebd 100644 --- a/doc/html/pcre2_compile.html +++ b/doc/html/pcre2_compile.html @@ -63,6 +63,7 @@ or provide an external function for stack size checking. The option bits are: PCRE2_NO_AUTO_CAPTURE Disable numbered capturing paren- theses (named ones available) PCRE2_NO_AUTO_POSSESS Disable auto-possessification + PCRE2_NO_DOTSTAR_ANCHOR Disable automatic anchoring for .* PCRE2_NO_START_OPTIMIZE Disable match-time start optimizations PCRE2_NO_UTF_CHECK Do not check the pattern for UTF validity (only relevant if PCRE2_UTF is set) diff --git a/doc/html/pcre2api.html b/doc/html/pcre2api.html index a0aef87..3d519c7 100644 --- a/doc/html/pcre2api.html +++ b/doc/html/pcre2api.html @@ -1187,6 +1187,19 @@ use, auto-possessification means that some callouts are never taken. You can set this option if you want the matching functions to do a full unoptimized search and run all the callouts, but it is mainly provided for testing purposes. +
+ PCRE2_NO_DOTSTAR_ANCHOR ++If this option is set, it disables an optimization that is applied when .* is +the first significant item in a top-level branch of a pattern, and all the +other branches also start with .* or with \A or \G or ^. The optimization is +automatically disabled for .* if it is inside an atomic group or a capturing +group that is the subject of a back reference, or if the pattern contains +(*PRUNE) or (*SKIP). When the optimization is not disabled, such a pattern is +automatically anchored if PCRE2_DOTALL is set for all the .* items and +PCRE2_MULTILINE is not set for any ^ items. Otherwise, the fact that any match +must start either at the start of the subject or following a newline is +remembered. Like other optimizations, this can cause callouts to be skipped.
PCRE2_NO_START_OPTIMIZE@@ -1442,16 +1455,25 @@ compiled with the PCRE2_EXTENDED option, the result is PCRE2_CASELESS, PCRE2_MULTILINE, and PCRE2_EXTENDED.
-A pattern is automatically anchored by PCRE2 if all of its top-level -alternatives begin with one of the following: +A pattern compiled without PCRE2_ANCHORED is automatically anchored by PCRE2 if +the first significant item in every top-level branch is one of the following:
^ unless PCRE2_MULTILINE is set \A always \G always - .* if PCRE2_DOTALL is set and there are no back references to the subpattern in which .* appears + .* sometimes - see below-For such patterns, the PCRE2_ANCHORED bit is set in the options returned for -PCRE2_INFO_ALLOPTIONS. +When .* is the first significant item, anchoring is possible only when all the +following are true: +
+ .* is not in an atomic group + .* is not in a capturing group that is the subject of a back reference + PCRE2_DOTALL is in force for .* + Neither (*PRUNE) nor (*SKIP) appears in the pattern. + PCRE2_NO_DOTSTAR_ANCHOR is not set. ++For patterns that are auto-anchored, the PCRE2_ANCHORED bit is set in the +options returned for PCRE2_INFO_ALLOPTIONS.
PCRE2_INFO_BACKREFMAX@@ -1480,21 +1502,10 @@ variable.
If there is a fixed first value, for example, the letter "c" from a pattern
such as (cat|cow|coyote), 1 is returned, and the character value can be
-retrieved using PCRE2_INFO_FIRSTCODEUNIT. If there is no fixed first value, and
-if either
-
-
-(a) the pattern was compiled with the PCRE2_MULTILINE option, and every branch
-starts with "^", or
-
-
-(b) every branch of the pattern starts with ".*" and PCRE2_DOTALL is not set
-(if it were set, the pattern would be anchored),
-
-
-2 is returned, indicating that the pattern matches only at the start of a
-subject string or after any newline within the string. Otherwise 0 is
-returned. For anchored patterns, 0 is returned.
+retrieved using PCRE2_INFO_FIRSTCODEUNIT. If there is no fixed first value, but
+it is known that a match can occur only at the start of the subject or
+following a newline in the subject, 2 is returned. Otherwise, and for anchored
+patterns, 0 is returned.
PCRE2_INFO_FIRSTCODEUNIT@@ -2792,9 +2803,9 @@ Cambridge, England.
-Last updated: 22 December 2014
+Last updated: 02 January 2015
-Copyright © 1997-2014 University of Cambridge.
+Copyright © 1997-2015 University of Cambridge.
Return to the PCRE2 index page. diff --git a/doc/html/pcre2callout.html b/doc/html/pcre2callout.html index e6894da..93deba9 100644 --- a/doc/html/pcre2callout.html +++ b/doc/html/pcre2callout.html @@ -82,6 +82,9 @@ You should be aware that, because of optimizations in the way PCRE2 compiles and matches patterns, callouts sometimes do not happen exactly as you might expect.
+At compile time, PCRE2 "auto-possessifies" repeated items when it knows that what follows cannot be part of the repeat. For example, a+[bc] is compiled as @@ -111,6 +114,56 @@ case, the output changes to this: This time, when matching [bc] fails, the matcher backtracks into a+ and tries again, repeatedly, until a+ itself fails.
++By default, an optimization is applied when .* is the first significant item in +a pattern. If PCRE2_DOTALL is set, so that the dot can match any character, the +pattern is automatically anchored. If PCRE2_DOTALL is not set, a match can +start only after an internal newline or at the beginning of the subject, and +pcre2_compile() remembers this. This optimization is disabled, however, +if .* is in an atomic group or if there is a back reference to the capturing +group in which it appears. It is also disabled if the pattern contains (*PRUNE) +or (*SKIP). However, the presence of callouts does not affect it. +
++For example, if the pattern .*\d is compiled with PCRE2_AUTO_CALLOUT and +applied to the string "aa", the pcre2test output is: +
+ --->aa + +0 ^ .* + +2 ^ ^ \d + +2 ^^ \d + +2 ^ \d + No match ++This shows that all match attempts start at the beginning of the subject. In +other words, the pattern is anchored. You can disable this optimization by +passing PCRE2_NO_DOTSTAR_ANCHOR to pcre2_compile(), or starting the +pattern with (*NO_DOTSTAR_ANCHOR). In this case, the output changes to: +
+ --->aa + +0 ^ .* + +2 ^ ^ \d + +2 ^^ \d + +2 ^ \d + +0 ^ .* + +2 ^^ \d + +2 ^ \d + No match ++This shows more match attempts, starting at the second subject character. +Another optimization, described in the next section, means that there is no +subsequent attempt to match with an empty subject. + +
+If a pattern has more than one top-level branch, automatic anchoring occurs if +all branches are anchorable. +
+Other optimizations that provide fast "no match" results also affect callouts. For example, if the pattern is @@ -254,9 +307,9 @@ Cambridge, England.
-Last updated: 25 November 2014
+Last updated: 02 January 2015
-Copyright © 1997-2014 University of Cambridge.
+Copyright © 1997-2015 University of Cambridge.
Return to the PCRE2 index page. diff --git a/doc/html/pcre2pattern.html b/doc/html/pcre2pattern.html index 3237c8f..27de3f0 100644 --- a/doc/html/pcre2pattern.html +++ b/doc/html/pcre2pattern.html @@ -151,6 +151,17 @@ reaching "no match" results. For more details, see the documentation.
+If a pattern starts with (*NO_DOTSTAR_ANCHOR), it has the same effect as +setting the PCRE2_NO_DOTSTAR_ANCHOR option. This disables optimizations that +apply to patterns whose top-level branches all start with .* (match any number +of arbitrary characters). For more details, see the +pcre2api +documentation. +
+@@ -1841,7 +1852,8 @@ one succeeds. Consider this pattern: (?>.*?a)b It matches "ab" in the subject "aab". The use of the backtracking control verbs -(*PRUNE) and (*SKIP) also disable this optimization. +(*PRUNE) and (*SKIP) also disable this optimization, and there is an option, +PCRE2_NO_DOTSTAR_ANCHOR, to do so explicitly.
When a capturing subpattern is repeated, the value captured is the substring @@ -3236,9 +3248,9 @@ Cambridge, England.
-Last updated: 14 November 2014
+Last updated: 02 January 2015
-Copyright © 1997-2014 University of Cambridge.
+Copyright © 1997-2015 University of Cambridge.
Return to the PCRE2 index page. diff --git a/doc/html/pcre2perform.html b/doc/html/pcre2perform.html index 1b0b145..3b6a4a6 100644 --- a/doc/html/pcre2perform.html +++ b/doc/html/pcre2perform.html @@ -115,14 +115,19 @@ less with a DFA matching function, and in both cases there is not much difference for \b.
-When a pattern begins with .* not in parentheses, or in parentheses that are -not the subject of a backreference, and the PCRE2_DOTALL option is set, the -pattern is implicitly anchored by PCRE2, since it can match only at the start -of a subject string. However, if PCRE2_DOTALL is not set, PCRE2 cannot make -this optimization, because the dot metacharacter does not then match a newline, -and if the subject string contains newlines, the pattern may match from the -character immediately following one of them instead of from the very start. For -example, the pattern +When a pattern begins with .* not in atomic parentheses, nor in parentheses +that are the subject of a backreference, and the PCRE2_DOTALL option is set, +the pattern is implicitly anchored by PCRE2, since it can match only at the +start of a subject string. If the pattern has multiple top-level branches, they +must all be anchorable. The optimization can be disabled by the +PCRE2_NO_DOTSTAR_ANCHOR option, and is automatically disabled if the pattern +contains (*PRUNE) or (*SKIP). +
++If PCRE2_DOTALL is not set, PCRE2 cannot make this optimization, because the +dot metacharacter does not then match a newline, and if the subject string +contains newlines, the pattern may match from the character immediately +following one of them instead of from the very start. For example, the pattern
.*second@@ -187,9 +192,9 @@ Cambridge, England. REVISION
-Last updated: 20 October 2014
+Last updated: 02 January 2015
-Copyright © 1997-2014 University of Cambridge.
+Copyright © 1997-2015 University of Cambridge.
Return to the PCRE2 index page. diff --git a/doc/html/pcre2syntax.html b/doc/html/pcre2syntax.html index dca9868..373b5aa 100644 --- a/doc/html/pcre2syntax.html +++ b/doc/html/pcre2syntax.html @@ -416,6 +416,7 @@ appear. (*NOTEMPTY) set PCRE2_NOTEMPTY when matching (*NOTEMPTY_ATSTART) set PCRE2_NOTEMPTY_ATSTART when matching (*NO_AUTO_POSSESS) no auto-possessification (PCRE2_NO_AUTO_POSSESS) + (*NO_DOTSTAR_ANCHOR) no .* anchoring (PCRE2_NO_DOTSTAR_ANCHOR) (*NO_START_OPT) no start-match optimization (PCRE2_NO_START_OPTIMIZE) (*UTF) set appropriate UTF mode for the library in use (*UCP) set PCRE2_UCP (use Unicode properties for \d etc) @@ -553,9 +554,9 @@ Cambridge, England.
-Last updated: 23 November 2014
+Last updated: 02 January 2015
-Copyright © 1997-2014 University of Cambridge.
+Copyright © 1997-2015 University of Cambridge.
Return to the PCRE2 index page. diff --git a/doc/html/pcre2test.html b/doc/html/pcre2test.html index 8d74235..ea704c9 100644 --- a/doc/html/pcre2test.html +++ b/doc/html/pcre2test.html @@ -291,7 +291,7 @@ checked for compatibility with the perltest.sh script, which is used to confirm that Perl gives the same results as PCRE2. Also, apart from comment lines, none of the other command lines are permitted, because they and many of the modifiers are specific to pcre2test, and should not be used in -test files that are also processed by perltest.sh. The \fP#perltest\fB +test files that are also processed by perltest.sh. The #perltest command helps detect tests that are accidentally put in the wrong file.
#subject <modifier-list> @@ -454,6 +454,7 @@ for a description of their effects. never_utf set PCRE2_NEVER_UTF no_auto_capture set PCRE2_NO_AUTO_CAPTURE no_auto_possess set PCRE2_NO_AUTO_POSSESS + no_dotstar_anchor set PCRE2_NO_DOTSTAR_ANCHOR no_start_optimize set PCRE2_NO_START_OPTIMIZE no_utf_check set PCRE2_NO_UTF_CHECK ucp set PCRE2_UCP @@ -596,7 +597,7 @@ setting the size of the JIT stack.If the jitfast modifier is specified, matching is done using the JIT -"fast path" interface, \fBpcre2_jit_match(), which skips some of the sanity +"fast path" interface, pcre2_jit_match(), which skips some of the sanity checks that are done by pcre2_match(), and of course does not work when JIT is not supported. If jitfast is specified without jit, jit=7 is assumed. @@ -1309,9 +1310,9 @@ Cambridge, England.
REVISION
-Last updated: 23 November 2014 +Last updated: 02 January 2015
-Copyright © 1997-2014 University of Cambridge. +Copyright © 1997-2015 University of Cambridge.
Return to the PCRE2 index page. diff --git a/doc/pcre2.txt b/doc/pcre2.txt index 419797a..c979b24 100644 --- a/doc/pcre2.txt +++ b/doc/pcre2.txt @@ -1226,6 +1226,20 @@ COMPILING A PATTERN a full unoptimized search and run all the callouts, but it is mainly provided for testing purposes. + PCRE2_NO_DOTSTAR_ANCHOR + + If this option is set, it disables an optimization that is applied when + .* is the first significant item in a top-level branch of a pattern, + and all the other branches also start with .* or with \A or \G or ^. + The optimization is automatically disabled for .* if it is inside an + atomic group or a capturing group that is the subject of a back refer- + ence, or if the pattern contains (*PRUNE) or (*SKIP). When the opti- + mization is not disabled, such a pattern is automatically anchored if + PCRE2_DOTALL is set for all the .* items and PCRE2_MULTILINE is not set + for any ^ items. Otherwise, the fact that any match must start either + at the start of the subject or following a newline is remembered. Like + other optimizations, this can cause callouts to be skipped. + PCRE2_NO_START_OPTIMIZE This is an option whose main effect is at matching time. It does not @@ -1465,17 +1479,27 @@ INFORMATION ABOUT A COMPILED PATTERN option, the result is PCRE2_CASELESS, PCRE2_MULTILINE, and PCRE2_EXTENDED. - A pattern is automatically anchored by PCRE2 if all of its top-level - alternatives begin with one of the following: + A pattern compiled without PCRE2_ANCHORED is automatically anchored by + PCRE2 if the first significant item in every top-level branch is one of + the following: ^ unless PCRE2_MULTILINE is set \A always \G always - .* if PCRE2_DOTALL is set and there are no back - references to the subpattern in which .* appears + .* sometimes - see below - For such patterns, the PCRE2_ANCHORED bit is set in the options - returned for PCRE2_INFO_ALLOPTIONS. + When .* is the first significant item, anchoring is possible only when + all the following are true: + + .* is not in an atomic group + .* is not in a capturing group that is the subject + of a back reference + PCRE2_DOTALL is in force for .* + Neither (*PRUNE) nor (*SKIP) appears in the pattern. + PCRE2_NO_DOTSTAR_ANCHOR is not set. + + For patterns that are auto-anchored, the PCRE2_ANCHORED bit is set in + the options returned for PCRE2_INFO_ALLOPTIONS. PCRE2_INFO_BACKREFMAX @@ -1504,17 +1528,9 @@ INFORMATION ABOUT A COMPILED PATTERN If there is a fixed first value, for example, the letter "c" from a pattern such as (cat|cow|coyote), 1 is returned, and the character value can be retrieved using PCRE2_INFO_FIRSTCODEUNIT. If there is no - fixed first value, and if either - - (a) the pattern was compiled with the PCRE2_MULTILINE option, and every - branch starts with "^", or - - (b) every branch of the pattern starts with ".*" and PCRE2_DOTALL is - not set (if it were set, the pattern would be anchored), - - 2 is returned, indicating that the pattern matches only at the start of - a subject string or after any newline within the string. Otherwise 0 is - returned. For anchored patterns, 0 is returned. + fixed first value, but it is known that a match can occur only at the + start of the subject or following a newline in the subject, 2 is + returned. Otherwise, and for anchored patterns, 0 is returned. PCRE2_INFO_FIRSTCODEUNIT @@ -2726,8 +2742,8 @@ AUTHOR REVISION - Last updated: 22 December 2014 - Copyright (c) 1997-2014 University of Cambridge. + Last updated: 02 January 2015 + Copyright (c) 1997-2015 University of Cambridge. ------------------------------------------------------------------------------ @@ -3251,6 +3267,8 @@ MISSING CALLOUTS compiles and matches patterns, callouts sometimes do not happen exactly as you might expect. + Auto-possessification + At compile time, PCRE2 "auto-possessifies" repeated items when it knows that what follows cannot be part of the repeat. For example, a+[bc] is compiled as if it were a++[bc]. The pcre2test output when this pattern @@ -3279,6 +3297,53 @@ MISSING CALLOUTS This time, when matching [bc] fails, the matcher backtracks into a+ and tries again, repeatedly, until a+ itself fails. + Automatic .* anchoring + + By default, an optimization is applied when .* is the first significant + item in a pattern. If PCRE2_DOTALL is set, so that the dot can match + any character, the pattern is automatically anchored. If PCRE2_DOTALL + is not set, a match can start only after an internal newline or at the + beginning of the subject, and pcre2_compile() remembers this. This + optimization is disabled, however, if .* is in an atomic group or if + there is a back reference to the capturing group in which it appears. + It is also disabled if the pattern contains (*PRUNE) or (*SKIP). How- + ever, the presence of callouts does not affect it. + + For example, if the pattern .*\d is compiled with PCRE2_AUTO_CALLOUT + and applied to the string "aa", the pcre2test output is: + + --->aa + +0 ^ .* + +2 ^ ^ \d + +2 ^^ \d + +2 ^ \d + No match + + This shows that all match attempts start at the beginning of the sub- + ject. In other words, the pattern is anchored. You can disable this + optimization by passing PCRE2_NO_DOTSTAR_ANCHOR to pcre2_compile(), or + starting the pattern with (*NO_DOTSTAR_ANCHOR). In this case, the out- + put changes to: + + --->aa + +0 ^ .* + +2 ^ ^ \d + +2 ^^ \d + +2 ^ \d + +0 ^ .* + +2 ^^ \d + +2 ^ \d + No match + + This shows more match attempts, starting at the second subject charac- + ter. Another optimization, described in the next section, means that + there is no subsequent attempt to match with an empty subject. + + If a pattern has more than one top-level branch, automatic anchoring + occurs if all branches are anchorable. + + Other optimizations + Other optimizations that provide fast "no match" results also affect callouts. For example, if the pattern is @@ -3410,8 +3475,8 @@ AUTHOR REVISION - Last updated: 25 November 2014 - Copyright (c) 1997-2014 University of Cambridge. + Last updated: 02 January 2015 + Copyright (c) 1997-2015 University of Cambridge. ------------------------------------------------------------------------------ diff --git a/doc/pcre2_compile.3 b/doc/pcre2_compile.3 index 6040420..cf0858d 100644 --- a/doc/pcre2_compile.3 +++ b/doc/pcre2_compile.3 @@ -1,4 +1,4 @@ -.TH PCRE2_COMPILE 3 "21 October 2014" "PCRE2 10.00" +.TH PCRE2_COMPILE 3 "02 January 2015" "PCRE2 10.00" .SH NAME PCRE2 - Perl-compatible regular expressions (revised API) .SH SYNOPSIS @@ -51,6 +51,7 @@ or provide an external function for stack size checking. The option bits are: PCRE2_NO_AUTO_CAPTURE Disable numbered capturing paren- theses (named ones available) PCRE2_NO_AUTO_POSSESS Disable auto-possessification + PCRE2_NO_DOTSTAR_ANCHOR Disable automatic anchoring for .* PCRE2_NO_START_OPTIMIZE Disable match-time start optimizations PCRE2_NO_UTF_CHECK Do not check the pattern for UTF validity (only relevant if PCRE2_UTF is set) diff --git a/doc/pcre2api.3 b/doc/pcre2api.3 index 282fdc6..183f7fa 100644 --- a/doc/pcre2api.3 +++ b/doc/pcre2api.3 @@ -1,4 +1,4 @@ -.TH PCRE2API 3 "22 December 2014" "PCRE2 10.00" +.TH PCRE2API 3 "02 January 2015" "PCRE2 10.00" .SH NAME PCRE2 - Perl-compatible regular expressions (revised API) .sp @@ -1163,6 +1163,19 @@ use, auto-possessification means that some callouts are never taken. You can set this option if you want the matching functions to do a full unoptimized search and run all the callouts, but it is mainly provided for testing purposes. +.sp + PCRE2_NO_DOTSTAR_ANCHOR +.sp +If this option is set, it disables an optimization that is applied when .* is +the first significant item in a top-level branch of a pattern, and all the +other branches also start with .* or with \eA or \eG or ^. The optimization is +automatically disabled for .* if it is inside an atomic group or a capturing +group that is the subject of a back reference, or if the pattern contains +(*PRUNE) or (*SKIP). When the optimization is not disabled, such a pattern is +automatically anchored if PCRE2_DOTALL is set for all the .* items and +PCRE2_MULTILINE is not set for any ^ items. Otherwise, the fact that any match +must start either at the start of the subject or following a newline is +remembered. Like other optimizations, this can cause callouts to be skipped. .sp PCRE2_NO_START_OPTIMIZE .sp @@ -1436,18 +1449,27 @@ force when matching starts. For example, if the pattern /(?im)abc(?-i)d/ is compiled with the PCRE2_EXTENDED option, the result is PCRE2_CASELESS, PCRE2_MULTILINE, and PCRE2_EXTENDED. .P -A pattern is automatically anchored by PCRE2 if all of its top-level -alternatives begin with one of the following: +A pattern compiled without PCRE2_ANCHORED is automatically anchored by PCRE2 if +the first significant item in every top-level branch is one of the following: .sp ^ unless PCRE2_MULTILINE is set \eA always \eG always -.\" JOIN - .* if PCRE2_DOTALL is set and there are no back - references to the subpattern in which .* appears + .* sometimes - see below .sp -For such patterns, the PCRE2_ANCHORED bit is set in the options returned for -PCRE2_INFO_ALLOPTIONS. +When .* is the first significant item, anchoring is possible only when all the +following are true: +.sp + .* is not in an atomic group +.\" JOIN + .* is not in a capturing group that is the subject + of a back reference + PCRE2_DOTALL is in force for .* + Neither (*PRUNE) nor (*SKIP) appears in the pattern. + PCRE2_NO_DOTSTAR_ANCHOR is not set. +.sp +For patterns that are auto-anchored, the PCRE2_ANCHORED bit is set in the +options returned for PCRE2_INFO_ALLOPTIONS. .sp PCRE2_INFO_BACKREFMAX .sp @@ -1475,18 +1497,10 @@ variable. .P If there is a fixed first value, for example, the letter "c" from a pattern such as (cat|cow|coyote), 1 is returned, and the character value can be -retrieved using PCRE2_INFO_FIRSTCODEUNIT. If there is no fixed first value, and -if either -.sp -(a) the pattern was compiled with the PCRE2_MULTILINE option, and every branch -starts with "^", or -.sp -(b) every branch of the pattern starts with ".*" and PCRE2_DOTALL is not set -(if it were set, the pattern would be anchored), -.sp -2 is returned, indicating that the pattern matches only at the start of a -subject string or after any newline within the string. Otherwise 0 is -returned. For anchored patterns, 0 is returned. +retrieved using PCRE2_INFO_FIRSTCODEUNIT. If there is no fixed first value, but +it is known that a match can occur only at the start of the subject or +following a newline in the subject, 2 is returned. Otherwise, and for anchored +patterns, 0 is returned. .sp PCRE2_INFO_FIRSTCODEUNIT .sp @@ -2835,6 +2849,6 @@ Cambridge, England. .rs .sp .nf -Last updated: 22 December 2014 -Copyright (c) 1997-2014 University of Cambridge. +Last updated: 02 January 2015 +Copyright (c) 1997-2015 University of Cambridge. .fi diff --git a/doc/pcre2callout.3 b/doc/pcre2callout.3 index 4e83305..eeac0d5 100644 --- a/doc/pcre2callout.3 +++ b/doc/pcre2callout.3 @@ -1,4 +1,4 @@ -.TH PCRE2CALLOUT 3 "25 November 2014" "PCRE2 10.00" +.TH PCRE2CALLOUT 3 "02 January 2015" "PCRE2 10.00" .SH NAME PCRE2 - Perl-compatible regular expressions (revised API) .SH SYNOPSIS @@ -65,7 +65,11 @@ particular pattern. You should be aware that, because of optimizations in the way PCRE2 compiles and matches patterns, callouts sometimes do not happen exactly as you might expect. -.P +. +. +.SS "Auto-possessification" +.rs +.sp At compile time, PCRE2 "auto-possessifies" repeated items when it knows that what follows cannot be part of the repeat. For example, a+[bc] is compiled as if it were a++[bc]. The \fBpcre2test\fP output when this pattern is compiled @@ -93,7 +97,56 @@ case, the output changes to this: .sp This time, when matching [bc] fails, the matcher backtracks into a+ and tries again, repeatedly, until a+ itself fails. +. +. +.SS "Automatic .* anchoring" +.rs +.sp +By default, an optimization is applied when .* is the first significant item in +a pattern. If PCRE2_DOTALL is set, so that the dot can match any character, the +pattern is automatically anchored. If PCRE2_DOTALL is not set, a match can +start only after an internal newline or at the beginning of the subject, and +\fBpcre2_compile()\fP remembers this. This optimization is disabled, however, +if .* is in an atomic group or if there is a back reference to the capturing +group in which it appears. It is also disabled if the pattern contains (*PRUNE) +or (*SKIP). However, the presence of callouts does not affect it. .P +For example, if the pattern .*\ed is compiled with PCRE2_AUTO_CALLOUT and +applied to the string "aa", the \fBpcre2test\fP output is: +.sp + --->aa + +0 ^ .* + +2 ^ ^ \ed + +2 ^^ \ed + +2 ^ \ed + No match +.sp +This shows that all match attempts start at the beginning of the subject. In +other words, the pattern is anchored. You can disable this optimization by +passing PCRE2_NO_DOTSTAR_ANCHOR to \fBpcre2_compile()\fP, or starting the +pattern with (*NO_DOTSTAR_ANCHOR). In this case, the output changes to: +.sp + --->aa + +0 ^ .* + +2 ^ ^ \ed + +2 ^^ \ed + +2 ^ \ed + +0 ^ .* + +2 ^^ \ed + +2 ^ \ed + No match +.sp +This shows more match attempts, starting at the second subject character. +Another optimization, described in the next section, means that there is no +subsequent attempt to match with an empty subject. +.P +If a pattern has more than one top-level branch, automatic anchoring occurs if +all branches are anchorable. +. +. +.SS "Other optimizations" +.rs +.sp Other optimizations that provide fast "no match" results also affect callouts. For example, if the pattern is .sp @@ -232,6 +285,6 @@ Cambridge, England. .rs .sp .nf -Last updated: 25 November 2014 -Copyright (c) 1997-2014 University of Cambridge. +Last updated: 02 January 2015 +Copyright (c) 1997-2015 University of Cambridge. .fi diff --git a/doc/pcre2pattern.3 b/doc/pcre2pattern.3 index e9b0f8f..fcd76a1 100644 --- a/doc/pcre2pattern.3 +++ b/doc/pcre2pattern.3 @@ -1,4 +1,4 @@ -.TH PCRE2PATTERN 3 "14 November 2014" "PCRE2 10.00" +.TH PCRE2PATTERN 3 "02 January 2015" "PCRE2 10.00" .SH NAME PCRE2 - Perl-compatible regular expressions (revised API) .SH "PCRE2 REGULAR EXPRESSION DETAILS" @@ -117,6 +117,19 @@ reaching "no match" results. For more details, see the documentation. . . +.SS "Disabling automatic anchoring" +.rs +.sp +If a pattern starts with (*NO_DOTSTAR_ANCHOR), it has the same effect as +setting the PCRE2_NO_DOTSTAR_ANCHOR option. This disables optimizations that +apply to patterns whose top-level branches all start with .* (match any number +of arbitrary characters). For more details, see the +.\" HREF +\fBpcre2api\fP +.\" +documentation. +. +. .SS "Setting match and recursion limits" .rs .sp @@ -1853,7 +1866,8 @@ one succeeds. Consider this pattern: (?>.*?a)b .sp It matches "ab" in the subject "aab". The use of the backtracking control verbs -(*PRUNE) and (*SKIP) also disable this optimization. +(*PRUNE) and (*SKIP) also disable this optimization, and there is an option, +PCRE2_NO_DOTSTAR_ANCHOR, to do so explicitly. .P When a capturing subpattern is repeated, the value captured is the substring that matched the final iteration. For example, after @@ -3278,6 +3292,6 @@ Cambridge, England. .rs .sp .nf -Last updated: 14 November 2014 -Copyright (c) 1997-2014 University of Cambridge. +Last updated: 02 January 2015 +Copyright (c) 1997-2015 University of Cambridge. .fi diff --git a/doc/pcre2perform.3 b/doc/pcre2perform.3 index e163e84..ec86fe7 100644 --- a/doc/pcre2perform.3 +++ b/doc/pcre2perform.3 @@ -1,4 +1,4 @@ -.TH PCRE2PERFORM 3 "20 Ocbober 2014" "PCRE2 10.00" +.TH PCRE2PERFORM 3 "02 January 2015" "PCRE2 10.00" .SH NAME PCRE2 - Perl-compatible regular expressions (revised API) .SH "PCRE2 PERFORMANCE" @@ -105,14 +105,18 @@ such as \ed, when matched with \fBpcre2_match()\fP; the performance loss is less with a DFA matching function, and in both cases there is not much difference for \eb. .P -When a pattern begins with .* not in parentheses, or in parentheses that are -not the subject of a backreference, and the PCRE2_DOTALL option is set, the -pattern is implicitly anchored by PCRE2, since it can match only at the start -of a subject string. However, if PCRE2_DOTALL is not set, PCRE2 cannot make -this optimization, because the dot metacharacter does not then match a newline, -and if the subject string contains newlines, the pattern may match from the -character immediately following one of them instead of from the very start. For -example, the pattern +When a pattern begins with .* not in atomic parentheses, nor in parentheses +that are the subject of a backreference, and the PCRE2_DOTALL option is set, +the pattern is implicitly anchored by PCRE2, since it can match only at the +start of a subject string. If the pattern has multiple top-level branches, they +must all be anchorable. The optimization can be disabled by the +PCRE2_NO_DOTSTAR_ANCHOR option, and is automatically disabled if the pattern +contains (*PRUNE) or (*SKIP). +.P +If PCRE2_DOTALL is not set, PCRE2 cannot make this optimization, because the +dot metacharacter does not then match a newline, and if the subject string +contains newlines, the pattern may match from the character immediately +following one of them instead of from the very start. For example, the pattern .sp .*second .sp @@ -173,6 +177,6 @@ Cambridge, England. .rs .sp .nf -Last updated: 20 October 2014 -Copyright (c) 1997-2014 University of Cambridge. +Last updated: 02 January 2015 +Copyright (c) 1997-2015 University of Cambridge. .fi diff --git a/doc/pcre2syntax.3 b/doc/pcre2syntax.3 index 7420896..580f892 100644 --- a/doc/pcre2syntax.3 +++ b/doc/pcre2syntax.3 @@ -1,4 +1,4 @@ -.TH PCRE2SYNTAX 3 "23 November 2014" "PCRE2 10.00" +.TH PCRE2SYNTAX 3 "02 January 2015" "PCRE2 10.00" .SH NAME PCRE2 - Perl-compatible regular expressions (revised API) .SH "PCRE2 REGULAR EXPRESSION SYNTAX SUMMARY" @@ -389,6 +389,7 @@ appear. (*NOTEMPTY) set PCRE2_NOTEMPTY when matching (*NOTEMPTY_ATSTART) set PCRE2_NOTEMPTY_ATSTART when matching (*NO_AUTO_POSSESS) no auto-possessification (PCRE2_NO_AUTO_POSSESS) + (*NO_DOTSTAR_ANCHOR) no .* anchoring (PCRE2_NO_DOTSTAR_ANCHOR) (*NO_START_OPT) no start-match optimization (PCRE2_NO_START_OPTIMIZE) (*UTF) set appropriate UTF mode for the library in use (*UCP) set PCRE2_UCP (use Unicode properties for \ed etc) @@ -536,6 +537,6 @@ Cambridge, England. .rs .sp .nf -Last updated: 23 November 2014 -Copyright (c) 1997-2014 University of Cambridge. +Last updated: 02 January 2015 +Copyright (c) 1997-2015 University of Cambridge. .fi diff --git a/doc/pcre2test.1 b/doc/pcre2test.1 index 9223f66..89d182d 100644 --- a/doc/pcre2test.1 +++ b/doc/pcre2test.1 @@ -1,4 +1,4 @@ -.TH PCRE2TEST 1 "23 November 2014" "PCRE 10.00" +.TH PCRE2TEST 1 "02 January 2015" "PCRE 10.00" .SH NAME pcre2test - a program for testing Perl-compatible regular expressions. .SH SYNOPSIS @@ -247,7 +247,7 @@ checked for compatibility with the \fBperltest.sh\fP script, which is used to confirm that Perl gives the same results as PCRE2. Also, apart from comment lines, none of the other command lines are permitted, because they and many of the modifiers are specific to \fBpcre2test\fP, and should not be used in -test files that are also processed by \fBperltest.sh\fP. The \fP#perltest\fB +test files that are also processed by \fBperltest.sh\fP. The \fB#perltest\fP command helps detect tests that are accidentally put in the wrong file. .sp #subject
@@ -413,6 +413,7 @@ for a description of their effects. never_utf set PCRE2_NEVER_UTF no_auto_capture set PCRE2_NO_AUTO_CAPTURE no_auto_possess set PCRE2_NO_AUTO_POSSESS + no_dotstar_anchor set PCRE2_NO_DOTSTAR_ANCHOR no_start_optimize set PCRE2_NO_START_OPTIMIZE no_utf_check set PCRE2_NO_UTF_CHECK ucp set PCRE2_UCP @@ -552,7 +553,7 @@ documentation. See also the \fBjitstack\fP modifier below for a way of setting the size of the JIT stack. .P If the \fBjitfast\fP modifier is specified, matching is done using the JIT -"fast path" interface, \fBpcre2_jit_match(), which skips some of the sanity +"fast path" interface, \fBpcre2_jit_match()\fP, which skips some of the sanity checks that are done by \fBpcre2_match()\fP, and of course does not work when JIT is not supported. If \fBjitfast\fP is specified without \fBjit\fP, jit=7 is assumed. @@ -1274,6 +1275,6 @@ Cambridge, England. .rs .sp .nf -Last updated: 23 November 2014 -Copyright (c) 1997-2014 University of Cambridge. +Last updated: 02 January 2015 +Copyright (c) 1997-2015 University of Cambridge. .fi diff --git a/doc/pcre2test.txt b/doc/pcre2test.txt index 721aa7d..b1a5551 100644 --- a/doc/pcre2test.txt +++ b/doc/pcre2test.txt @@ -402,6 +402,7 @@ PATTERN MODIFIERS never_utf set PCRE2_NEVER_UTF no_auto_capture set PCRE2_NO_AUTO_CAPTURE no_auto_possess set PCRE2_NO_AUTO_POSSESS + no_dotstar_anchor set PCRE2_NO_DOTSTAR_ANCHOR no_start_optimize set PCRE2_NO_START_OPTIMIZE no_utf_check set PCRE2_NO_UTF_CHECK ucp set PCRE2_UCP @@ -1185,5 +1186,5 @@ AUTHOR REVISION - Last updated: 23 November 2014 - Copyright (c) 1997-2014 University of Cambridge. + Last updated: 02 January 2015 + Copyright (c) 1997-2015 University of Cambridge. diff --git a/src/pcre2.h.in b/src/pcre2.h.in index 00c7460..2d0d031 100644 --- a/src/pcre2.h.in +++ b/src/pcre2.h.in @@ -5,7 +5,7 @@ /* This is the public header file for the PCRE library, second API, to be #included by applications that call PCRE2 functions. - Copyright (c) 2014 University of Cambridge + Copyright (c) 2015 University of Cambridge ----------------------------------------------------------------------------- Redistribution and use in source and binary forms, with or without @@ -113,10 +113,11 @@ D is inspected during pcre2_dfa_match() execution #define PCRE2_NEVER_UTF 0x00001000u /* C */ #define PCRE2_NO_AUTO_CAPTURE 0x00002000u /* C */ #define PCRE2_NO_AUTO_POSSESS 0x00004000u /* C */ -#define PCRE2_NO_START_OPTIMIZE 0x00008000u /* J M D */ -#define PCRE2_UCP 0x00010000u /* C J M D */ -#define PCRE2_UNGREEDY 0x00020000u /* C */ -#define PCRE2_UTF 0x00040000u /* C J M D */ +#define PCRE2_NO_DOTSTAR_ANCHOR 0x00008000u /* C */ +#define PCRE2_NO_START_OPTIMIZE 0x00010000u /* J M D */ +#define PCRE2_UCP 0x00020000u /* C J M D */ +#define PCRE2_UNGREEDY 0x00040000u /* C */ +#define PCRE2_UTF 0x00080000u /* C J M D */ /* These are for pcre2_jit_compile(). */ diff --git a/src/pcre2_compile.c b/src/pcre2_compile.c index 57753e9..149abe9 100644 --- a/src/pcre2_compile.c +++ b/src/pcre2_compile.c @@ -7,7 +7,7 @@ and semantics are as close as possible to those of the Perl 5 language. Written by Philip Hazel Original API code Copyright (c) 1997-2012 University of Cambridge - New API code Copyright (c) 2014 University of Cambridge + New API code Copyright (c) 2015 University of Cambridge ----------------------------------------------------------------------------- Redistribution and use in source and binary forms, with or without @@ -557,8 +557,8 @@ static PCRE2_SPTR posix_substitutes[] = { PCRE2_CASELESS|PCRE2_DOLLAR_ENDONLY|PCRE2_DOTALL|PCRE2_DUPNAMES| \ PCRE2_EXTENDED|PCRE2_FIRSTLINE|PCRE2_MATCH_UNSET_BACKREF| \ PCRE2_MULTILINE|PCRE2_NEVER_UCP|PCRE2_NEVER_UTF|PCRE2_NO_AUTO_CAPTURE| \ - PCRE2_NO_AUTO_POSSESS|PCRE2_NO_START_OPTIMIZE|PCRE2_NO_UTF_CHECK| \ - PCRE2_UCP|PCRE2_UNGREEDY|PCRE2_UTF) + PCRE2_NO_AUTO_POSSESS|PCRE2_NO_DOTSTAR_ANCHOR|PCRE2_NO_START_OPTIMIZE| \ + PCRE2_NO_UTF_CHECK|PCRE2_UCP|PCRE2_UNGREEDY|PCRE2_UTF) /* Compile time error code numbers. They are given names so that they can more easily be tracked. When a new number is added, the tables called eint1 and @@ -597,22 +597,23 @@ typedef struct pso { /* NB: STRING_UTFn_RIGHTPAR contains the length as well */ static pso pso_list[] = { - { (uint8_t *)STRING_UTFn_RIGHTPAR, PSO_OPT, PCRE2_UTF }, - { (uint8_t *)STRING_UTF_RIGHTPAR, 4, PSO_OPT, PCRE2_UTF }, - { (uint8_t *)STRING_UCP_RIGHTPAR, 4, PSO_OPT, PCRE2_UCP }, - { (uint8_t *)STRING_NOTEMPTY_RIGHTPAR, 9, PSO_FLG, PCRE2_NOTEMPTY_SET }, - { (uint8_t *)STRING_NOTEMPTY_ATSTART_RIGHTPAR,17, PSO_FLG, PCRE2_NE_ATST_SET }, - { (uint8_t *)STRING_NO_AUTO_POSSESS_RIGHTPAR, 16, PSO_OPT, PCRE2_NO_AUTO_POSSESS }, - { (uint8_t *)STRING_NO_START_OPT_RIGHTPAR, 13, PSO_OPT, PCRE2_NO_START_OPTIMIZE }, - { (uint8_t *)STRING_LIMIT_MATCH_EQ, 12, PSO_LIMM, 0 }, - { (uint8_t *)STRING_LIMIT_RECURSION_EQ, 16, PSO_LIMR, 0 }, - { (uint8_t *)STRING_CR_RIGHTPAR, 3, PSO_NL, PCRE2_NEWLINE_CR }, - { (uint8_t *)STRING_LF_RIGHTPAR, 3, PSO_NL, PCRE2_NEWLINE_LF }, - { (uint8_t *)STRING_CRLF_RIGHTPAR, 5, PSO_NL, PCRE2_NEWLINE_CRLF }, - { (uint8_t *)STRING_ANY_RIGHTPAR, 4, PSO_NL, PCRE2_NEWLINE_ANY }, - { (uint8_t *)STRING_ANYCRLF_RIGHTPAR, 8, PSO_NL, PCRE2_NEWLINE_ANYCRLF }, - { (uint8_t *)STRING_BSR_ANYCRLF_RIGHTPAR, 12, PSO_BSR, PCRE2_BSR_ANYCRLF }, - { (uint8_t *)STRING_BSR_UNICODE_RIGHTPAR, 12, PSO_BSR, PCRE2_BSR_UNICODE } + { (uint8_t *)STRING_UTFn_RIGHTPAR, PSO_OPT, PCRE2_UTF }, + { (uint8_t *)STRING_UTF_RIGHTPAR, 4, PSO_OPT, PCRE2_UTF }, + { (uint8_t *)STRING_UCP_RIGHTPAR, 4, PSO_OPT, PCRE2_UCP }, + { (uint8_t *)STRING_NOTEMPTY_RIGHTPAR, 9, PSO_FLG, PCRE2_NOTEMPTY_SET }, + { (uint8_t *)STRING_NOTEMPTY_ATSTART_RIGHTPAR, 17, PSO_FLG, PCRE2_NE_ATST_SET }, + { (uint8_t *)STRING_NO_AUTO_POSSESS_RIGHTPAR, 16, PSO_OPT, PCRE2_NO_AUTO_POSSESS }, + { (uint8_t *)STRING_NO_DOTSTAR_ANCHOR_RIGHTPAR, 18, PSO_OPT, PCRE2_NO_DOTSTAR_ANCHOR }, + { (uint8_t *)STRING_NO_START_OPT_RIGHTPAR, 13, PSO_OPT, PCRE2_NO_START_OPTIMIZE }, + { (uint8_t *)STRING_LIMIT_MATCH_EQ, 12, PSO_LIMM, 0 }, + { (uint8_t *)STRING_LIMIT_RECURSION_EQ, 16, PSO_LIMR, 0 }, + { (uint8_t *)STRING_CR_RIGHTPAR, 3, PSO_NL, PCRE2_NEWLINE_CR }, + { (uint8_t *)STRING_LF_RIGHTPAR, 3, PSO_NL, PCRE2_NEWLINE_LF }, + { (uint8_t *)STRING_CRLF_RIGHTPAR, 5, PSO_NL, PCRE2_NEWLINE_CRLF }, + { (uint8_t *)STRING_ANY_RIGHTPAR, 4, PSO_NL, PCRE2_NEWLINE_ANY }, + { (uint8_t *)STRING_ANYCRLF_RIGHTPAR, 8, PSO_NL, PCRE2_NEWLINE_ANYCRLF }, + { (uint8_t *)STRING_BSR_ANYCRLF_RIGHTPAR, 12, PSO_BSR, PCRE2_BSR_ANYCRLF }, + { (uint8_t *)STRING_BSR_UNICODE_RIGHTPAR, 12, PSO_BSR, PCRE2_BSR_UNICODE } }; @@ -7020,13 +7021,14 @@ do { /* .* is not anchored unless DOTALL is set (which generates OP_ALLANY) and it isn't in brackets that are or may be referenced or inside an atomic - group. */ + group. There is also an option that disables auto-anchoring. */ else if ((op == OP_TYPESTAR || op == OP_TYPEMINSTAR || op == OP_TYPEPOSSTAR)) { if (scode[1] != OP_ALLANY || (bracket_map & cb->backref_map) != 0 || - atomcount > 0 || cb->had_pruneorskip) + atomcount > 0 || cb->had_pruneorskip || + (cb->external_options & PCRE2_NO_DOTSTAR_ANCHOR) != 0) return FALSE; } @@ -7140,12 +7142,13 @@ do { brackets that may be referenced, as long as the pattern does not contain *PRUNE or *SKIP, because these break the feature. Consider, for example, /.*?a(*PRUNE)b/ with the subject "aab", which matches "ab", i.e. not at the - start of a line. */ + start of a line. There is also an option that disables this optimization. */ else if (op == OP_TYPESTAR || op == OP_TYPEMINSTAR || op == OP_TYPEPOSSTAR) { if (scode[1] != OP_ANY || (bracket_map & cb->backref_map) != 0 || - atomcount > 0 || cb->had_pruneorskip) + atomcount > 0 || cb->had_pruneorskip || + (cb->external_options & PCRE2_NO_DOTSTAR_ANCHOR) != 0) return FALSE; } @@ -7863,7 +7866,8 @@ if (errorcode != 0) /* Successful compile. If the anchored option was not passed, set it if we can determine that the pattern is anchored by virtue of ^ characters or \A or anything else, such as starting with non-atomic .* when DOTALL is set and -there are no occurrences of *PRUNE or *SKIP. */ +there are no occurrences of *PRUNE or *SKIP (though there is an option to +disable this case). */ if ((re->overall_options & PCRE2_ANCHORED) == 0 && is_anchored(codestart, 0, &cb, 0)) @@ -7912,7 +7916,8 @@ if ((re->overall_options & (PCRE2_ANCHORED|PCRE2_NO_START_OPTIMIZE)) == 0) /* When there is no first code unit, see if we can set the PCRE2_STARTLINE flag. This is helpful for multiline matches when all branches start with ^ and also when all branches start with non-atomic .* for non-DOTALL matches - when *PRUNE and SKIP are not present. */ + when *PRUNE and SKIP are not present. (There is an option that disables this + case.) */ else if (is_startline(codestart, 0, &cb, 0)) re->flags |= PCRE2_STARTLINE; } diff --git a/src/pcre2_internal.h b/src/pcre2_internal.h index cbddecc..38eb20e 100644 --- a/src/pcre2_internal.h +++ b/src/pcre2_internal.h @@ -7,7 +7,7 @@ and semantics are as close as possible to those of the Perl 5 language. Written by Philip Hazel Original API code Copyright (c) 1997-2012 University of Cambridge - New API code Copyright (c) 2014 University of Cambridge + New API code Copyright (c) 2015 University of Cambridge ----------------------------------------------------------------------------- Redistribution and use in source and binary forms, with or without @@ -904,6 +904,7 @@ a positive value. */ #define STRING_UTF_RIGHTPAR "UTF)" #define STRING_UCP_RIGHTPAR "UCP)" #define STRING_NO_AUTO_POSSESS_RIGHTPAR "NO_AUTO_POSSESS)" +#define STRING_NO_DOTSTAR_ANCHOR_RIGHTPAR "NO_DOTSTAR_ANCHOR)" #define STRING_NO_START_OPT_RIGHTPAR "NO_START_OPT)" #define STRING_NOTEMPTY_RIGHTPAR "NOTEMPTY)" #define STRING_NOTEMPTY_ATSTART_RIGHTPAR "NOTEMPTY_ATSTART)" @@ -1173,6 +1174,7 @@ only. */ #define STRING_UTF_RIGHTPAR STR_U STR_T STR_F STR_RIGHT_PARENTHESIS #define STRING_UCP_RIGHTPAR STR_U STR_C STR_P STR_RIGHT_PARENTHESIS #define STRING_NO_AUTO_POSSESS_RIGHTPAR STR_N STR_O STR_UNDERSCORE STR_A STR_U STR_T STR_O STR_UNDERSCORE STR_P STR_O STR_S STR_S STR_E STR_S STR_S STR_RIGHT_PARENTHESIS +#define STRING_NO_DOTSTAR_ANCHOR_RIGHTPAR STR_N STR_O STR_UNDERSCORE STR_D STR_O STR_T STR_S STR_T STR_A STR_R STR_UNDERSCORE STR_A STR_N STR_C STR_H STR_O STR_R STR_RIGHT_PARENTHESIS #define STRING_NO_START_OPT_RIGHTPAR STR_N STR_O STR_UNDERSCORE STR_S STR_T STR_A STR_R STR_T STR_UNDERSCORE STR_O STR_P STR_T STR_RIGHT_PARENTHESIS #define STRING_NOTEMPTY_RIGHTPAR STR_N STR_O STR_T STR_E STR_M STR_P STR_T STR_Y STR_RIGHT_PARENTHESIS #define STRING_NOTEMPTY_ATSTART_RIGHTPAR STR_N STR_O STR_T STR_E STR_M STR_P STR_T STR_Y STR_UNDERSCORE STR_A STR_T STR_S STR_T STR_A STR_R STR_T STR_RIGHT_PARENTHESIS diff --git a/src/pcre2_match.c b/src/pcre2_match.c index da05cf6..1c44503 100644 --- a/src/pcre2_match.c +++ b/src/pcre2_match.c @@ -445,7 +445,7 @@ match() only in the case when ovecsave is needed. (David Wheeler used to say "All problems in computer science can be solved by another level of indirection.") -HOWEVER: when this file is compiled by gcc in an optimizing mode, because this +HOWEVER: when this file is compiled by gcc in an optimizing mode, because this function is called only once, and only from within match(), gcc will "inline" it - that is, move it inside match() - and this completely negates its reason for existence. Therefore, we mark it as non-inline when gcc is in use. diff --git a/src/pcre2test.c b/src/pcre2test.c index a6b6229..173b76b 100644 --- a/src/pcre2test.c +++ b/src/pcre2test.c @@ -11,7 +11,7 @@ hacked-up (non-) design had also run out of steam. Written by Philip Hazel Original code Copyright (c) 1997-2012 University of Cambridge - Rewritten code Copyright (c) 2014 University of Cambridge + Rewritten code Copyright (c) 2015 University of Cambridge ----------------------------------------------------------------------------- Redistribution and use in source and binary forms, with or without @@ -498,6 +498,7 @@ static modstruct modlist[] = { { "newline", MOD_CTC, MOD_NL, 0, CO(newline_convention) }, { "no_auto_capture", MOD_PAT, MOD_OPT, PCRE2_NO_AUTO_CAPTURE, PO(options) }, { "no_auto_possess", MOD_PATP, MOD_OPT, PCRE2_NO_AUTO_POSSESS, PO(options) }, + { "no_dotstar_anchor", MOD_PAT, MOD_OPT, PCRE2_NO_DOTSTAR_ANCHOR, PO(options) }, { "no_start_optimize", MOD_PATP, MOD_OPT, PCRE2_NO_START_OPTIMIZE, PO(options) }, { "no_utf_check", MOD_PD, MOD_OPT, PCRE2_NO_UTF_CHECK, PD(options) }, { "notbol", MOD_DAT, MOD_OPT, PCRE2_NOTBOL, DO(options) }, @@ -3291,29 +3292,30 @@ static void show_compile_options(uint32_t options, const char *before, const char *after) { if (options == 0) fprintf(outfile, "%s %s", before, after); -else fprintf(outfile, "%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s", +else fprintf(outfile, "%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s", before, - ((options & PCRE2_ANCHORED) != 0)? " anchored" : "", - ((options & PCRE2_CASELESS) != 0)? " caseless" : "", - ((options & PCRE2_EXTENDED) != 0)? " extended" : "", - ((options & PCRE2_MULTILINE) != 0)? " multiline" : "", - ((options & PCRE2_FIRSTLINE) != 0)? " firstline" : "", - ((options & PCRE2_DOTALL) != 0)? " dotall" : "", - ((options & PCRE2_DOLLAR_ENDONLY) != 0)? " dollar_endonly" : "", - ((options & PCRE2_UNGREEDY) != 0)? " ungreedy" : "", - ((options & PCRE2_NO_AUTO_CAPTURE) != 0)? " no_auto_capture" : "", - ((options & PCRE2_NO_AUTO_POSSESS) != 0)? " no_auto_possess" : "", - ((options & PCRE2_UTF) != 0)? " utf" : "", - ((options & PCRE2_UCP) != 0)? " ucp" : "", - ((options & PCRE2_NO_UTF_CHECK) != 0)? " no_utf_check" : "", - ((options & PCRE2_NO_START_OPTIMIZE) != 0)? " no_start_optimize" : "", - ((options & PCRE2_DUPNAMES) != 0)? " dupnames" : "", ((options & PCRE2_ALT_BSUX) != 0)? " alt_bsux" : "", ((options & PCRE2_ALLOW_EMPTY_CLASS) != 0)? " allow_empty_class" : "", + ((options & PCRE2_ANCHORED) != 0)? " anchored" : "", ((options & PCRE2_AUTO_CALLOUT) != 0)? " auto_callout" : "", + ((options & PCRE2_CASELESS) != 0)? " caseless" : "", + ((options & PCRE2_DOLLAR_ENDONLY) != 0)? " dollar_endonly" : "", + ((options & PCRE2_DOTALL) != 0)? " dotall" : "", + ((options & PCRE2_DUPNAMES) != 0)? " dupnames" : "", + ((options & PCRE2_EXTENDED) != 0)? " extended" : "", + ((options & PCRE2_FIRSTLINE) != 0)? " firstline" : "", ((options & PCRE2_MATCH_UNSET_BACKREF) != 0)? " match_unset_backref" : "", + ((options & PCRE2_MULTILINE) != 0)? " multiline" : "", ((options & PCRE2_NEVER_UCP) != 0)? " never_ucp" : "", ((options & PCRE2_NEVER_UTF) != 0)? " never_utf" : "", + ((options & PCRE2_NO_AUTO_CAPTURE) != 0)? " no_auto_capture" : "", + ((options & PCRE2_NO_AUTO_POSSESS) != 0)? " no_auto_possess" : "", + ((options & PCRE2_NO_DOTSTAR_ANCHOR) != 0)? " no_dotstar_anchor" : "", + ((options & PCRE2_NO_UTF_CHECK) != 0)? " no_utf_check" : "", + ((options & PCRE2_NO_START_OPTIMIZE) != 0)? " no_start_optimize" : "", + ((options & PCRE2_UCP) != 0)? " ucp" : "", + ((options & PCRE2_UNGREEDY) != 0)? " ungreedy" : "", + ((options & PCRE2_UTF) != 0)? " utf" : "", after); } diff --git a/testdata/testinput2 b/testdata/testinput2 index 914d620..6d0d259 100644 --- a/testdata/testinput2 +++ b/testdata/testinput2 @@ -4100,4 +4100,20 @@ a random value. /Ix /a(b)c(d)/ abc\=ph,copy=0,copy=1,getall +/^abc/info + +/^abc/info,no_dotstar_anchor + +/.*\d/info,auto_callout + aaa + +/.*\d/info,no_dotstar_anchor,auto_callout + aaa + +/.*\d/dotall,info + +/.*\d/dotall,no_dotstar_anchor,info + +/(*NO_DOTSTAR_ANCHOR)(?s).*\d/info + # End of testinput2 diff --git a/testdata/testoutput2 b/testdata/testoutput2 index f677c9f..4d90a96 100644 --- a/testdata/testoutput2 +++ b/testdata/testoutput2 @@ -5361,7 +5361,7 @@ No match "<(\w+)/?>(.)*(\1)>"Igms Capturing subpattern count = 3 Max back reference = 1 -Options: multiline dotall +Options: dotall multiline First code unit = '<' Last code unit = '>' Subject length lower bound = 7 @@ -5399,7 +5399,7 @@ No match /line\nbreak/Im,firstline Capturing subpattern count = 0 Contains explicit CR or LF match -Options: multiline firstline +Options: firstline multiline First code unit = 'l' Last code unit = 'k' Subject length lower bound = 10 @@ -9698,7 +9698,7 @@ Subject length lower bound = 41 /Iisx Capturing subpattern count = 3 Max back reference = 1 -Options: caseless extended dotall +Options: caseless dotall extended First code unit = '<' Last code unit = '=' Subject length lower bound = 9 @@ -9747,7 +9747,7 @@ Named capturing subpatterns: quote 4 realquote 3 realquote 6 -Options: extended dupnames +Options: dupnames extended Starting code units: a b Subject length lower bound = 3 a"aaaaa @@ -9805,8 +9805,8 @@ Capturing subpattern count = 4 Named capturing subpatterns: D 4 D 1 -Compile options: extended dupnames -Overall options: anchored extended dupnames +Compile options: dupnames extended +Overall options: anchored dupnames extended Subject length lower bound = 2 abcdX 0: abcdX @@ -9852,7 +9852,7 @@ Capturing subpattern count = 4 Named capturing subpatterns: A 1 A 4 -Options: extended dupnames +Options: dupnames extended First code unit = 'a' Last code unit = 'd' Subject length lower bound = 4 @@ -9936,7 +9936,7 @@ No match /(\3)(\1)(a)/I,alt_bsux,allow_empty_class,match_unset_backref,dupnames Capturing subpattern count = 3 Max back reference = 3 -Options: dupnames alt_bsux allow_empty_class match_unset_backref +Options: alt_bsux allow_empty_class dupnames match_unset_backref Last code unit = 'a' Subject length lower bound = 1 cat @@ -13769,4 +13769,67 @@ Partial match: abc Copy substring 1 failed (-2): partial match get substring list failed (-2): partial match +/^abc/info +Capturing subpattern count = 0 +Compile options: +Overall options: anchored +Subject length lower bound = 3 + +/^abc/info,no_dotstar_anchor +Capturing subpattern count = 0 +Compile options: no_dotstar_anchor +Overall options: anchored no_dotstar_anchor +Subject length lower bound = 3 + +/.*\d/info,auto_callout +Capturing subpattern count = 0 +Options: auto_callout +First code unit at start or follows newline +Subject length lower bound = 1 + aaa +--->aaa + +0 ^ .* + +2 ^ ^ \d + +2 ^ ^ \d + +2 ^^ \d + +2 ^ \d +No match + +/.*\d/info,no_dotstar_anchor,auto_callout +Capturing subpattern count = 0 +Options: auto_callout no_dotstar_anchor +Subject length lower bound = 1 + aaa +--->aaa + +0 ^ .* + +2 ^ ^ \d + +2 ^ ^ \d + +2 ^^ \d + +2 ^ \d + +0 ^ .* + +2 ^ ^ \d + +2 ^^ \d + +2 ^ \d + +0 ^ .* + +2 ^^ \d + +2 ^ \d +No match + +/.*\d/dotall,info +Capturing subpattern count = 0 +Compile options: dotall +Overall options: anchored dotall +Subject length lower bound = 1 + +/.*\d/dotall,no_dotstar_anchor,info +Capturing subpattern count = 0 +Options: dotall no_dotstar_anchor +Subject length lower bound = 1 + +/(*NO_DOTSTAR_ANCHOR)(?s).*\d/info +Capturing subpattern count = 0 +Compile options: +Overall options: dotall no_dotstar_anchor +Subject length lower bound = 1 + # End of testinput2