Make PCRE2_NO_START_OPTIMIZE a compile-only option.

This commit is contained in:
Philip.Hazel 2014-10-01 16:16:27 +00:00
parent 313245365d
commit a0410efc56
12 changed files with 142 additions and 137 deletions

View File

@ -930,9 +930,8 @@ documentation).
.P
For those options that can be different in different parts of the pattern, the
contents of the \fIoptions\fP argument specifies their settings at the start of
compilation. The PCRE2_ANCHORED, PCRE2_NO_UTF_CHECK, and
PCRE2_NO_START_OPTIMIZE options can be set at the time of matching as well as
at compile time.
compilation. The PCRE2_ANCHORED and PCRE2_NO_UTF_CHECK options can be set at
the time of matching as well as at compile time.
.P
Other, less frequently required compile-time parameters (for example, the
newline setting) can be provided in a compile context (as described
@ -1150,17 +1149,52 @@ purposes.
.sp
PCRE2_NO_START_OPTIMIZE
.sp
This is an option that acts at matching time; that is, it is really an option
for \fBpcre2_match()\fP or \fBpcre_dfa_match()\fP. If it is set at compile
time, it is remembered with the compiled pattern and assumed at matching time.
This is necessary if you want to use JIT execution, because the JIT compiler
needs to know whether or not this option is set. For details, see the
discussion of PCRE2_NO_START_OPTIMIZE in the section on \fBpcre2_match()\fP
options
.\" HTML <a href="#matchoptions">
.\" </a>
below.
.\"
This is an option whose main effect is at matching time. It does not change
what \fBpcre2_compile()\fP generates, but it does affect the output of the JIT
compiler.
.P
There are a number of optimizations that may occur at the start of a match, in
order to speed up the process. For example, if it is known that an unanchored
match must start with a specific character, the matching code searches the
subject for that character, and fails immediately if it cannot find it, without
actually running the main matching function. This means that a special item
such as (*COMMIT) at the start of a pattern is not considered until after a
suitable starting point for the match has been found. Also, when callouts or
(*MARK) items are in use, these "start-up" optimizations can cause them to be
skipped if the pattern is never actually used. The start-up optimizations are
in effect a pre-scan of the subject that takes place before the pattern is run.
.P
The PCRE2_NO_START_OPTIMIZE option disables the start-up optimizations,
possibly causing performance to suffer, but ensuring that in cases where the
result is "no match", the callouts do occur, and that items such as (*COMMIT)
and (*MARK) are considered at every possible starting position in the subject
string.
.P
Setting PCRE2_NO_START_OPTIMIZE may change the outcome of a matching operation.
Consider the pattern
.sp
(*COMMIT)ABC
.sp
When this is compiled, PCRE2 records the fact that a match must start with the
character "A". Suppose the subject string is "DEFABC". The start-up
optimization scans along the subject, finds "A" and runs the first match
attempt from there. The (*COMMIT) item means that the pattern must match the
current starting position, which in this case, it does. However, if the same
match is run with PCRE2_NO_START_OPTIMIZE set, the initial scan along the
subject string does not happen. The first match attempt is run starting from
"D" and when this fails, (*COMMIT) prevents any further matches being tried, so
the overall result is "no match". There are also other start-up optimizations.
For example, a minimum length for the subject may be recorded. Consider the
pattern
.sp
(*MARK:A)(X|Y)
.sp
The minimum length for a match is one character. If the subject is "ABC", there
will be attempts to match "ABC", "BC", and "C". An attempt to match an empty
string at the end of the subject does not take place, because PCRE2 knows that
the subject is now too short, and so the (*MARK) is never encountered. In this
case, the optimization does not affect the overall match result, which is still
"no match", but it does affect the auxiliary information that is returned.
.sp
PCRE2_NO_UTF_CHECK
.sp
@ -1787,10 +1821,9 @@ pattern does not require the match to be at the start of the subject.
.rs
.sp
The unused bits of the \fIoptions\fP argument for \fBpcre2_match()\fP must be
zero. The only bits that may be set are PCRE2_ANCHORED,
PCRE2_NOTBOL, PCRE2_NOTEOL, PCRE2_NOTEMPTY, PCRE2_NOTEMPTY_ATSTART,
PCRE2_NO_START_OPTIMIZE, PCRE2_NO_UTF_CHECK, PCRE2_PARTIAL_HARD, and
PCRE2_PARTIAL_SOFT. Their action is described below.
zero. The only bits that may be set are PCRE2_ANCHORED, PCRE2_NOTBOL,
PCRE2_NOTEOL, PCRE2_NOTEMPTY, PCRE2_NOTEMPTY_ATSTART, PCRE2_NO_UTF_CHECK,
PCRE2_PARTIAL_HARD, and PCRE2_PARTIAL_SOFT. Their action is described below.
.P
If the pattern was successfully processed by the just-in-time (JIT) compiler,
the only supported options for matching using the JIT code are PCRE2_NOTBOL,
@ -1840,54 +1873,6 @@ valid, so PCRE2 searches further into the string for occurrences of "a" or "b".
This is like PCRE2_NOTEMPTY, except that an empty string match that is not at
the start of the subject is permitted. If the pattern is anchored, such a match
can occur only if the pattern contains \eK.
.sp
PCRE2_NO_START_OPTIMIZE
.sp
There are a number of optimizations that \fBpcre2_match()\fP uses at the start
of a match, in order to speed up the process. For example, if it is known that
an unanchored match must start with a specific character, it searches the
subject for that character, and fails immediately if it cannot find it, without
actually running the main matching function. This means that a special item
such as (*COMMIT) at the start of a pattern is not considered until after a
suitable starting point for the match has been found. Also, when callouts or
(*MARK) items are in use, these "start-up" optimizations can cause them to be
skipped if the pattern is never actually used. The start-up optimizations are
in effect a pre-scan of the subject that takes place before the pattern is run.
.P
The PCRE2_NO_START_OPTIMIZE option disables the start-up optimizations,
possibly causing performance to suffer, but ensuring that in cases where the
result is "no match", the callouts do occur, and that items such as (*COMMIT)
and (*MARK) are considered at every possible starting position in the subject
string. If PCRE2_NO_START_OPTIMIZE is set at compile time, it cannot be unset
at matching time. The use of PCRE2_NO_START_OPTIMIZE at matching time (that is,
passing it to \fBpcre2_match()\fP) disables JIT execution; in this situation,
matching is always done using interpretively.
.P
Setting PCRE2_NO_START_OPTIMIZE can change the outcome of a matching operation.
Consider the pattern
.sp
(*COMMIT)ABC
.sp
When this is compiled, PCRE2 records the fact that a match must start with the
character "A". Suppose the subject string is "DEFABC". The start-up
optimization scans along the subject, finds "A" and runs the first match
attempt from there. The (*COMMIT) item means that the pattern must match the
current starting position, which in this case, it does. However, if the same
match is run with PCRE2_NO_START_OPTIMIZE set, the initial scan along the
subject string does not happen. The first match attempt is run starting from
"D" and when this fails, (*COMMIT) prevents any further matches being tried, so
the overall result is "no match". There are also other start-up optimizations.
For example, a minimum length for the subject may be recorded. Consider the
pattern
.sp
(*MARK:A)(X|Y)
.sp
The minimum length for a match is one character. If the subject is "ABC", there
will be attempts to match "ABC", "BC", and "C". An attempt to match an empty
string at the end of the subject does not take place, because PCRE2 knows that
the subject is now too short, and so the (*MARK) is never encountered. In this
case, the optimization does not affect the overall match result, which is still
"no match", but it does affect the auxiliary information that is returned.
.sp
PCRE2_NO_UTF_CHECK
.sp
@ -2550,10 +2535,9 @@ Here is an example of a simple call to \fBpcre2_dfa_match()\fP:
The unused bits of the \fIoptions\fP argument for \fBpcre2_dfa_match()\fP must
be zero. The only bits that may be set are PCRE2_ANCHORED, PCRE2_NOTBOL,
PCRE2_NOTEOL, PCRE2_NOTEMPTY, PCRE2_NOTEMPTY_ATSTART, PCRE2_NO_UTF_CHECK,
PCRE2_NO_START_OPTIMIZE, PCRE2_PARTIAL_HARD, PCRE2_PARTIAL_SOFT,
PCRE2_DFA_SHORTEST, and PCRE2_DFA_RESTART. All but the last four of these are
exactly the same as for \fBpcre2_match()\fP, so their description is not
repeated here.
PCRE2_PARTIAL_HARD, PCRE2_PARTIAL_SOFT, PCRE2_DFA_SHORTEST, and
PCRE2_DFA_RESTART. All but the last four of these are exactly the same as for
\fBpcre2_match()\fP, so their description is not repeated here.
.sp
PCRE2_PARTIAL_HARD
PCRE2_PARTIAL_SOFT

View File

@ -111,7 +111,7 @@ give a "no match" return without actually running a match if the subject is not
long enough, or, for unanchored patterns, if it has been scanned far enough.
.P
You can disable these optimizations by passing the PCRE2_NO_START_OPTIMIZE
option to the matching function, or by starting the pattern with
option to \fBpcre2_compile()\fP, or by starting the pattern with
(*NO_START_OPT). This slows down the matching process, but does ensure that
callouts such as the example above are obeyed.
.

View File

@ -107,9 +107,8 @@ or the JIT compiler was not able to handle the pattern.
.sp
The \fBpcre2_match()\fP options that are supported for JIT matching are
PCRE2_NOTBOL, PCRE2_NOTEOL, PCRE2_NOTEMPTY, PCRE2_NOTEMPTY_ATSTART,
PCRE2_NO_UTF_CHECK, PCRE2_PARTIAL_HARD, and PCRE2_PARTIAL_SOFT. The options
that are not supported at match time are PCRE2_ANCHORED and
PCRE2_NO_START_OPTIMIZE, though they are supported if given at compile time.
PCRE2_NO_UTF_CHECK, PCRE2_PARTIAL_HARD, and PCRE2_PARTIAL_SOFT. The
PCRE2_ANCHORED option is not supported at match time.
.P
The only unsupported pattern items are \eC (match a single data unit) when
running in a UTF mode, and a callout immediately before an assertion condition

View File

@ -662,7 +662,6 @@ for a description of their effects.
anchored set PCRE2_ANCHORED
dfa_restart set PCRE2_DFA_RESTART
dfa_shortest set PCRE2_DFA_SHORTEST
no_start_optimize set PCRE2_NO_START_OPTIMIZE
no_utf_check set PCRE2_NO_UTF_CHECK
notbol set PCRE2_NOTBOL
notempty set PCRE2_NOTEMPTY

View File

@ -86,8 +86,7 @@ passed. Put these bits at the most significant end of the options word so
others can be added next to them */
#define PCRE2_ANCHORED 0x80000000u
#define PCRE2_NO_START_OPTIMIZE 0x40000000u
#define PCRE2_NO_UTF_CHECK 0x20000000u
#define PCRE2_NO_UTF_CHECK 0x40000000u
/* Other options that can be passed to pcre2_compile(). They may affect
compilation, JIT compilation, and/or interpretive execution. The following tags
@ -95,7 +94,7 @@ indicate which:
C alters what is compiled
J alters what JIT compiles
E is inspected during pcre2_match() execution
M is inspected during pcre2_match() execution
D is inspected during pcre2_dfa_match() execution
*/
@ -103,20 +102,21 @@ D is inspected during pcre2_dfa_match() execution
#define PCRE2_ALT_BSUX 0x00000002u /* C */
#define PCRE2_AUTO_CALLOUT 0x00000004u /* C */
#define PCRE2_CASELESS 0x00000008u /* C */
#define PCRE2_DOLLAR_ENDONLY 0x00000010u /* J E D */
#define PCRE2_DOLLAR_ENDONLY 0x00000010u /* J M D */
#define PCRE2_DOTALL 0x00000020u /* C */
#define PCRE2_DUPNAMES 0x00000040u /* C */
#define PCRE2_EXTENDED 0x00000080u /* C */
#define PCRE2_FIRSTLINE 0x00000100u /* J E D */
#define PCRE2_MATCH_UNSET_BACKREF 0x00000200u /* C J E */
#define PCRE2_FIRSTLINE 0x00000100u /* J M D */
#define PCRE2_MATCH_UNSET_BACKREF 0x00000200u /* C J M */
#define PCRE2_MULTILINE 0x00000400u /* C */
#define PCRE2_NEVER_UCP 0x00000800u /* C */
#define PCRE2_NEVER_UTF 0x00001000u /* C */
#define PCRE2_NO_AUTO_CAPTURE 0x00002000u /* C */
#define PCRE2_NO_AUTO_POSSESS 0x00004000u /* C */
#define PCRE2_UCP 0x00008000u /* C J E D */
#define PCRE2_UNGREEDY 0x00010000u /* C */
#define PCRE2_UTF 0x00020000u /* C J E D */
#define PCRE2_NO_START_OPTIMIZE 0x00008000u /* J M D */
#define PCRE2_UCP 0x00010000u /* C J M D */
#define PCRE2_UNGREEDY 0x00020000u /* C */
#define PCRE2_UTF 0x00040000u /* C J M D */
/* These are for pcre2_jit_compile(). */

View File

@ -85,8 +85,7 @@ in others, so I abandoned this code. */
#define PUBLIC_DFA_MATCH_OPTIONS \
(PCRE2_ANCHORED|PCRE2_NOTBOL|PCRE2_NOTEOL|PCRE2_NOTEMPTY| \
PCRE2_NOTEMPTY_ATSTART|PCRE2_NO_UTF_CHECK|PCRE2_PARTIAL_HARD| \
PCRE2_PARTIAL_SOFT|PCRE2_DFA_SHORTEST|PCRE2_DFA_RESTART| \
PCRE2_NO_START_OPTIMIZE)
PCRE2_PARTIAL_SOFT|PCRE2_DFA_SHORTEST|PCRE2_DFA_RESTART)
/*************************************************
@ -3319,12 +3318,12 @@ for (;;)
/* There are some optimizations that avoid running the match if a known
starting point is not found, or if a known later code unit is not present.
However, there is an option (settable at compile or match time) that disables
However, there is an option (settable at compile time) that disables
these, for testing and for ensuring that all callouts do actually occur.
The must also be avoided when restarting a DFA match. */
The optimizations must also be avoided when restarting a DFA match. */
if (((options | re->overall_options) &
(PCRE2_NO_START_OPTIMIZE|PCRE2_DFA_RESTART)) == 0)
if ((re->overall_options & PCRE2_NO_START_OPTIMIZE) == 0 &&
(options & PCRE2_DFA_RESTART) == 0)
{
PCRE2_SPTR save_end_subject = end_subject;

View File

@ -55,7 +55,7 @@ POSSIBILITY OF SUCH DAMAGE.
#define PUBLIC_MATCH_OPTIONS \
(PCRE2_ANCHORED|PCRE2_NOTBOL|PCRE2_NOTEOL|PCRE2_NOTEMPTY| \
PCRE2_NOTEMPTY_ATSTART|PCRE2_NO_UTF_CHECK|PCRE2_PARTIAL_HARD| \
PCRE2_PARTIAL_SOFT|PCRE2_NO_START_OPTIMIZE)
PCRE2_PARTIAL_SOFT)
#define PUBLIC_JIT_MATCH_OPTIONS \
(PCRE2_NO_UTF_CHECK|PCRE2_NOTBOL|PCRE2_NOTEOL|PCRE2_NOTEMPTY|\
@ -6687,10 +6687,10 @@ for(;;)
/* There are some optimizations that avoid running the match if a known
starting point is not found, or if a known later code unit is not present.
However, there is an option (settable at compile or match time) that disables
these, for testing and for ensuring that all callouts do actually occur. */
However, there is an option (settable at compile time) that disables these,
for testing and for ensuring that all callouts do actually occur. */
if (((options | re->overall_options) & PCRE2_NO_START_OPTIMIZE) == 0)
if ((re->overall_options & PCRE2_NO_START_OPTIMIZE) == 0)
{
PCRE2_SPTR save_end_subject = end_subject;

View File

@ -461,7 +461,7 @@ static modstruct modlist[] = {
{ "newline", MOD_CTB, MOD_NL, MO(newline_convention), CO(newline_convention) },
{ "no_auto_capture", MOD_PAT, MOD_OPT, PCRE2_NO_AUTO_CAPTURE, PO(options) },
{ "no_auto_possess", MOD_PATP, MOD_OPT, PCRE2_NO_AUTO_POSSESS, PO(options) },
{ "no_start_optimize", MOD_PDP, MOD_OPT, PCRE2_NO_START_OPTIMIZE, PD(options) },
{ "no_start_optimize", MOD_PATP, MOD_OPT, PCRE2_NO_START_OPTIMIZE, PO(options) },
{ "no_utf_check", MOD_PD, MOD_OPT, PCRE2_NO_UTF_CHECK, PD(options) },
{ "notbol", MOD_DAT, MOD_OPT, PCRE2_NOTBOL, DO(options) },
{ "notempty", MOD_DAT, MOD_OPT, PCRE2_NOTEMPTY, DO(options) },
@ -3058,11 +3058,10 @@ fprintf(outfile, "%s%s%s%s%s%s%s%s%s%s%s%s%s",
static void
show_match_options(uint32_t options)
{
fprintf(outfile, "%s%s%s%s%s%s%s%s%s%s%s",
fprintf(outfile, "%s%s%s%s%s%s%s%s%s%s",
((options & PCRE2_ANCHORED) != 0)? " anchored" : "",
((options & PCRE2_DFA_RESTART) != 0)? " dfa_restart" : "",
((options & PCRE2_DFA_SHORTEST) != 0)? " dfa_shortest" : "",
((options & PCRE2_NO_START_OPTIMIZE) != 0)? " no_start_optimize" : "",
((options & PCRE2_NO_UTF_CHECK) != 0)? " no_utf_check" : "",
((options & PCRE2_NOTBOL) != 0)? " notbol" : "",
((options & PCRE2_NOTEMPTY) != 0)? " notempty" : "",

13
testdata/testinput2 vendored
View File

@ -2491,12 +2491,15 @@ a random value. /Ix
/xyz/auto_callout
xyz
abcxyz
abcxyz\=no_start_optimize
** Failers
abc
abc\=no_start_optimize
abcxypqr
abcxypqr\=no_start_optimize
/xyz/auto_callout,no_start_optimize
abcxyz
** Failers
abc
abcxypqr
/(*NO_START_OPT)xyz/auto_callout
abcxyz
@ -2987,8 +2990,10 @@ a random value. /Ix
/(*COMMIT)ABC/
ABCDEFG
/(*COMMIT)ABC/no_start_optimize
** Failers
DEFGABC\=no_start_optimize
DEFGABC
/^(ab (c+(*THEN)cd) | xyz)/x
abcccd

19
testdata/testinput6 vendored
View File

@ -4349,12 +4349,15 @@
/xyz/auto_callout
xyz
abcxyz
abcxyz\=no_start_optimize
** Failers
abc
abc\=no_start_optimize
abcxypqr
abcxypqr\=no_start_optimize
/xyz/auto_callout,no_start_optimize
abcxyz
** Failers
abc
abcxypqr
/(*NO_START_OPT)xyz/auto_callout
abcxyz
@ -4439,20 +4442,14 @@
/(abc|def|xyz)/I
terhjk;abcdaadsfe
the quick xyz brown fox
terhjk;abcdaadsfe\=no_start_optimize
the quick xyz brown fox\=no_start_optimize
** Failers
thejk;adlfj aenjl;fda asdfasd ehj;kjxyasiupd
thejk;adlfj aenjl;fda asdfasd ehj;kjxyasiupd\=no_start_optimize
/(abc|def|xyz)/I
/(abc|def|xyz)/I,no_start_optimize
terhjk;abcdaadsfe
the quick xyz brown fox
terhjk;abcdaadsfe\=no_start_optimize
the quick xyz brown fox\=no_start_optimize
the quick xyz brown fox
** Failers
thejk;adlfj aenjl;fda asdfasd ehj;kjxyasiupd
thejk;adlfj aenjl;fda asdfasd ehj;kjxyasiupd\=no_start_optimize
/abcd*/aftertext
xxxxabcd\=ps

30
testdata/testoutput2 vendored
View File

@ -8941,7 +8941,15 @@ Subject length lower bound = 1
+2 ^ ^ z
+3 ^ ^
0: xyz
abcxyz\=no_start_optimize
** Failers
No match
abc
No match
abcxypqr
No match
/xyz/auto_callout,no_start_optimize
abcxyz
--->abcxyz
+0 ^ x
+0 ^ x
@ -8952,10 +8960,20 @@ Subject length lower bound = 1
+3 ^ ^
0: xyz
** Failers
--->** Failers
+0 ^ x
+0 ^ x
+0 ^ x
+0 ^ x
+0 ^ x
+0 ^ x
+0 ^ x
+0 ^ x
+0 ^ x
+0 ^ x
+0 ^ x
No match
abc
No match
abc\=no_start_optimize
--->abc
+0 ^ x
+0 ^ x
@ -8963,8 +8981,6 @@ No match
+0 ^ x
No match
abcxypqr
No match
abcxypqr\=no_start_optimize
--->abcxypqr
+0 ^ x
+0 ^ x
@ -10182,9 +10198,11 @@ No match, mark = A
/(*COMMIT)ABC/
ABCDEFG
0: ABC
/(*COMMIT)ABC/no_start_optimize
** Failers
No match
DEFGABC\=no_start_optimize
DEFGABC
No match
/^(ab (c+(*THEN)cd) | xyz)/x

43
testdata/testoutput6 vendored
View File

@ -6882,7 +6882,15 @@ No match
+2 ^ ^ z
+3 ^ ^
0: xyz
abcxyz\=no_start_optimize
** Failers
No match
abc
No match
abcxypqr
No match
/xyz/auto_callout,no_start_optimize
abcxyz
--->abcxyz
+0 ^ x
+0 ^ x
@ -6893,10 +6901,20 @@ No match
+3 ^ ^
0: xyz
** Failers
--->** Failers
+0 ^ x
+0 ^ x
+0 ^ x
+0 ^ x
+0 ^ x
+0 ^ x
+0 ^ x
+0 ^ x
+0 ^ x
+0 ^ x
+0 ^ x
No match
abc
No match
abc\=no_start_optimize
--->abc
+0 ^ x
+0 ^ x
@ -6904,8 +6922,6 @@ No match
+0 ^ x
No match
abcxypqr
No match
abcxypqr\=no_start_optimize
--->abcxypqr
+0 ^ x
+0 ^ x
@ -7091,36 +7107,25 @@ Subject length lower bound = 3
terhjk;abcdaadsfe
0: abc
the quick xyz brown fox
0: xyz
terhjk;abcdaadsfe\=no_start_optimize
0: abc
the quick xyz brown fox\=no_start_optimize
0: xyz
** Failers
No match
thejk;adlfj aenjl;fda asdfasd ehj;kjxyasiupd
No match
thejk;adlfj aenjl;fda asdfasd ehj;kjxyasiupd\=no_start_optimize
No match
/(abc|def|xyz)/I
/(abc|def|xyz)/I,no_start_optimize
Capturing subpattern count = 1
Options: no_start_optimize
Starting code units: a d x
Subject length lower bound = 3
terhjk;abcdaadsfe
0: abc
the quick xyz brown fox
0: xyz
terhjk;abcdaadsfe\=no_start_optimize
0: abc
the quick xyz brown fox\=no_start_optimize
the quick xyz brown fox
0: xyz
** Failers
No match
thejk;adlfj aenjl;fda asdfasd ehj;kjxyasiupd
No match
thejk;adlfj aenjl;fda asdfasd ehj;kjxyasiupd\=no_start_optimize
No match
/abcd*/aftertext
xxxxabcd\=ps