Convert pcre2grep to use new pcre2_compile() options, thereby fixing two minor

(?) bugs.
This commit is contained in:
Philip.Hazel 2017-06-17 11:32:06 +00:00
parent 69eab9cfe7
commit 76a57bd839
9 changed files with 232 additions and 184 deletions

View File

@ -192,6 +192,12 @@ pattern lines.
42. Implement PCRE2_EXTRA_MATCH_LINE and PCRE2_EXTRA_MATCH_WORD for the benefit 42. Implement PCRE2_EXTRA_MATCH_LINE and PCRE2_EXTRA_MATCH_WORD for the benefit
of pcre2grep. of pcre2grep.
43. Re-implement pcre2grep's -F, -w, and -x options using PCRE2_LITERAL,
PCRE2_EXTRA_MATCH_WORD, and PCRE2_EXTRA_MATCH_LINE. This fixes two bugs:
(a) The -F option did not work for fixed strings containing \E.
(b) The -w option did not work for patterns with multiple branches.
Version 10.23 14-February-2017 Version 10.23 14-February-2017
------------------------------ ------------------------------

View File

@ -602,6 +602,19 @@ echo "---------------------------- Test 120 ------------------------------" >>te
(cd $srcdir; $valgrind $vjs $pcre2grep -HO '$0:$2$1$3' '(\w+) binary (\w+)(\.)?' ./testdata/grepinput) >>testtrygrep (cd $srcdir; $valgrind $vjs $pcre2grep -HO '$0:$2$1$3' '(\w+) binary (\w+)(\.)?' ./testdata/grepinput) >>testtrygrep
echo "RC=$?" >>testtrygrep echo "RC=$?" >>testtrygrep
echo "---------------------------- Test 121 -----------------------------" >>testtrygrep
(cd $srcdir; $valgrind $vjs $pcre2grep -F '\E and (regex)' testdata/grepinputv) >>testtrygrep
echo "RC=$?" >>testtrygrep
echo "---------------------------- Test 122 -----------------------------" >>testtrygrep
(cd $srcdir; $valgrind $vjs $pcre2grep -w 'cat|dog' testdata/grepinputv) >>testtrygrep
echo "RC=$?" >>testtrygrep
echo "---------------------------- Test 122 -----------------------------" >>testtrygrep
(cd $srcdir; $valgrind $vjs $pcre2grep -w 'dog|cat' testdata/grepinputv) >>testtrygrep
echo "RC=$?" >>testtrygrep
# Now compare the results. # Now compare the results.
$cf $srcdir/testdata/grepoutput testtrygrep $cf $srcdir/testdata/grepoutput testtrygrep

View File

@ -740,20 +740,21 @@ the patterns are the ones that are found.
</P> </P>
<P> <P>
<b>-w</b>, <b>--word-regex</b>, <b>--word-regexp</b> <b>-w</b>, <b>--word-regex</b>, <b>--word-regexp</b>
Force the patterns to match only whole words. This is equivalent to having \b Force the patterns only to match "words". That is, there must be a word
at the start and end of the pattern. This option applies only to the patterns boundary at the start and end of each matched string. This is equivalent to
that are matched against the contents of files; it does not apply to patterns having "\b(?:" at the start of each pattern, and ")\b" at the end. This
specified by any of the <b>--include</b> or <b>--exclude</b> options. option applies only to the patterns that are matched against the contents of
files; it does not apply to patterns specified by any of the <b>--include</b> or
<b>--exclude</b> options.
</P> </P>
<P> <P>
<b>-x</b>, <b>--line-regex</b>, <b>--line-regexp</b> <b>-x</b>, <b>--line-regex</b>, <b>--line-regexp</b>
Force the patterns to be anchored (each must start matching at the beginning of Force the patterns to start matching only at the beginnings of lines, and in
a line) and in addition, require them to match entire lines. In multiline mode addition, require them to match entire lines. In multiline mode the match may
the match may be more than one line. This is equivalent to having \A and \Z be more than one line. This is equivalent to having "^(?:" at the start of each
characters at the start and end of each alternative top-level branch in every pattern and ")$" at the end. This option applies only to the patterns that are
pattern. This option applies only to the patterns that are matched against the matched against the contents of files; it does not apply to patterns specified
contents of files; it does not apply to patterns specified by any of the by any of the <b>--include</b> or <b>--exclude</b> options.
<b>--include</b> or <b>--exclude</b> options.
</P> </P>
<br><a name="SEC6" href="#TOC1">ENVIRONMENT VARIABLES</a><br> <br><a name="SEC6" href="#TOC1">ENVIRONMENT VARIABLES</a><br>
<P> <P>
@ -936,7 +937,7 @@ Cambridge, England.
</P> </P>
<br><a name="SEC15" href="#TOC1">REVISION</a><br> <br><a name="SEC15" href="#TOC1">REVISION</a><br>
<P> <P>
Last updated: 26 May 2017 Last updated: 17 June 2017
<br> <br>
Copyright &copy; 1997-2017 University of Cambridge. Copyright &copy; 1997-2017 University of Cambridge.
<br> <br>

View File

@ -1,4 +1,4 @@
.TH PCRE2GREP 1 "26 May 2017" "PCRE2 10.30" .TH PCRE2GREP 1 "17 June 2017" "PCRE2 10.30"
.SH NAME .SH NAME
pcre2grep - a grep with Perl-compatible regular expressions. pcre2grep - a grep with Perl-compatible regular expressions.
.SH SYNOPSIS .SH SYNOPSIS
@ -639,19 +639,20 @@ Invert the sense of the match, so that lines which do \fInot\fP match any of
the patterns are the ones that are found. the patterns are the ones that are found.
.TP .TP
\fB-w\fP, \fB--word-regex\fP, \fB--word-regexp\fP \fB-w\fP, \fB--word-regex\fP, \fB--word-regexp\fP
Force the patterns to match only whole words. This is equivalent to having \eb Force the patterns only to match "words". That is, there must be a word
at the start and end of the pattern. This option applies only to the patterns boundary at the start and end of each matched string. This is equivalent to
that are matched against the contents of files; it does not apply to patterns having "\eb(?:" at the start of each pattern, and ")\eb" at the end. This
specified by any of the \fB--include\fP or \fB--exclude\fP options. option applies only to the patterns that are matched against the contents of
files; it does not apply to patterns specified by any of the \fB--include\fP or
\fB--exclude\fP options.
.TP .TP
\fB-x\fP, \fB--line-regex\fP, \fB--line-regexp\fP \fB-x\fP, \fB--line-regex\fP, \fB--line-regexp\fP
Force the patterns to be anchored (each must start matching at the beginning of Force the patterns to start matching only at the beginnings of lines, and in
a line) and in addition, require them to match entire lines. In multiline mode addition, require them to match entire lines. In multiline mode the match may
the match may be more than one line. This is equivalent to having \eA and \eZ be more than one line. This is equivalent to having "^(?:" at the start of each
characters at the start and end of each alternative top-level branch in every pattern and ")$" at the end. This option applies only to the patterns that are
pattern. This option applies only to the patterns that are matched against the matched against the contents of files; it does not apply to patterns specified
contents of files; it does not apply to patterns specified by any of the by any of the \fB--include\fP or \fB--exclude\fP options.
\fB--include\fP or \fB--exclude\fP options.
. .
. .
.SH "ENVIRONMENT VARIABLES" .SH "ENVIRONMENT VARIABLES"
@ -850,6 +851,6 @@ Cambridge, England.
.rs .rs
.sp .sp
.nf .nf
Last updated: 26 May 2017 Last updated: 17 June 2017
Copyright (c) 1997-2017 University of Cambridge. Copyright (c) 1997-2017 University of Cambridge.
.fi .fi

View File

@ -718,29 +718,30 @@ OPTIONS
match any of the patterns are the ones that are found. match any of the patterns are the ones that are found.
-w, --word-regex, --word-regexp -w, --word-regex, --word-regexp
Force the patterns to match only whole words. This is equiva- Force the patterns only to match "words". That is, there must
lent to having \b at the start and end of the pattern. This be a word boundary at the start and end of each matched
option applies only to the patterns that are matched against string. This is equivalent to having "\b(?:" at the start of
the contents of files; it does not apply to patterns speci- each pattern, and ")\b" at the end. This option applies only
fied by any of the --include or --exclude options. to the patterns that are matched against the contents of
files; it does not apply to patterns specified by any of the
--include or --exclude options.
-x, --line-regex, --line-regexp -x, --line-regex, --line-regexp
Force the patterns to be anchored (each must start matching Force the patterns to start matching only at the beginnings
at the beginning of a line) and in addition, require them to of lines, and in addition, require them to match entire
match entire lines. In multiline mode the match may be more lines. In multiline mode the match may be more than one line.
than one line. This is equivalent to having \A and \Z charac- This is equivalent to having "^(?:" at the start of each pat-
ters at the start and end of each alternative top-level tern and ")$" at the end. This option applies only to the
branch in every pattern. This option applies only to the pat- patterns that are matched against the contents of files; it
terns that are matched against the contents of files; it does does not apply to patterns specified by any of the --include
not apply to patterns specified by any of the --include or or --exclude options.
--exclude options.
ENVIRONMENT VARIABLES ENVIRONMENT VARIABLES
The environment variables LC_ALL and LC_CTYPE are examined, in that The environment variables LC_ALL and LC_CTYPE are examined, in that
order, for a locale. The first one that is set is used. This can be order, for a locale. The first one that is set is used. This can be
overridden by the --locale option. If no locale is set, the PCRE2 overridden by the --locale option. If no locale is set, the PCRE2
library's default (usually the "C" locale) is used. library's default (usually the "C" locale) is used.
@ -748,99 +749,99 @@ NEWLINES
The -N (--newline) option allows pcre2grep to scan files with different The -N (--newline) option allows pcre2grep to scan files with different
newline conventions from the default. Any parts of the input files that newline conventions from the default. Any parts of the input files that
are written to the standard output are copied identically, with what- are written to the standard output are copied identically, with what-
ever newline sequences they have in the input. However, the setting of ever newline sequences they have in the input. However, the setting of
this option does not affect the interpretation of files specified by this option does not affect the interpretation of files specified by
the -f, --exclude-from, or --include-from options, which are assumed to the -f, --exclude-from, or --include-from options, which are assumed to
use the operating system's standard newline sequence, nor does it use the operating system's standard newline sequence, nor does it
affect the way in which pcre2grep writes informational messages to the affect the way in which pcre2grep writes informational messages to the
standard error and output streams. For these it uses the string "\n" to standard error and output streams. For these it uses the string "\n" to
indicate newlines, relying on the C I/O library to convert this to an indicate newlines, relying on the C I/O library to convert this to an
appropriate sequence. appropriate sequence.
OPTIONS COMPATIBILITY OPTIONS COMPATIBILITY
Many of the short and long forms of pcre2grep's options are the same as Many of the short and long forms of pcre2grep's options are the same as
in the GNU grep program. Any long option of the form --xxx-regexp (GNU in the GNU grep program. Any long option of the form --xxx-regexp (GNU
terminology) is also available as --xxx-regex (PCRE2 terminology). How- terminology) is also available as --xxx-regex (PCRE2 terminology). How-
ever, the --depth-limit, --file-list, --file-offsets, --heap-limit, ever, the --depth-limit, --file-list, --file-offsets, --heap-limit,
--include-dir, --line-offsets, --locale, --match-limit, -M, --multi- --include-dir, --line-offsets, --locale, --match-limit, -M, --multi-
line, -N, --newline, --om-separator, --output, -u, and --utf-8 options line, -N, --newline, --om-separator, --output, -u, and --utf-8 options
are specific to pcre2grep, as is the use of the --only-matching option are specific to pcre2grep, as is the use of the --only-matching option
with a capturing parentheses number. with a capturing parentheses number.
Although most of the common options work the same way, a few are dif- Although most of the common options work the same way, a few are dif-
ferent in pcre2grep. For example, the --include option's argument is a ferent in pcre2grep. For example, the --include option's argument is a
glob for GNU grep, but a regular expression for pcre2grep. If both the glob for GNU grep, but a regular expression for pcre2grep. If both the
-c and -l options are given, GNU grep lists only file names, without -c and -l options are given, GNU grep lists only file names, without
counts, but pcre2grep gives the counts as well. counts, but pcre2grep gives the counts as well.
OPTIONS WITH DATA OPTIONS WITH DATA
There are four different ways in which an option with data can be spec- There are four different ways in which an option with data can be spec-
ified. If a short form option is used, the data may follow immedi- ified. If a short form option is used, the data may follow immedi-
ately, or (with one exception) in the next command line item. For exam- ately, or (with one exception) in the next command line item. For exam-
ple: ple:
-f/some/file -f/some/file
-f /some/file -f /some/file
The exception is the -o option, which may appear with or without data. The exception is the -o option, which may appear with or without data.
Because of this, if data is present, it must follow immediately in the Because of this, if data is present, it must follow immediately in the
same item, for example -o3. same item, for example -o3.
If a long form option is used, the data may appear in the same command If a long form option is used, the data may appear in the same command
line item, separated by an equals character, or (with two exceptions) line item, separated by an equals character, or (with two exceptions)
it may appear in the next command line item. For example: it may appear in the next command line item. For example:
--file=/some/file --file=/some/file
--file /some/file --file /some/file
Note, however, that if you want to supply a file name beginning with ~ Note, however, that if you want to supply a file name beginning with ~
as data in a shell command, and have the shell expand ~ to a home as data in a shell command, and have the shell expand ~ to a home
directory, you must separate the file name from the option, because the directory, you must separate the file name from the option, because the
shell does not treat ~ specially unless it is at the start of an item. shell does not treat ~ specially unless it is at the start of an item.
The exceptions to the above are the --colour (or --color) and --only- The exceptions to the above are the --colour (or --color) and --only-
matching options, for which the data is optional. If one of these matching options, for which the data is optional. If one of these
options does have data, it must be given in the first form, using an options does have data, it must be given in the first form, using an
equals character. Otherwise pcre2grep will assume that it has no data. equals character. Otherwise pcre2grep will assume that it has no data.
USING PCRE2'S CALLOUT FACILITY USING PCRE2'S CALLOUT FACILITY
pcre2grep has, by default, support for calling external programs or pcre2grep has, by default, support for calling external programs or
scripts or echoing specific strings during matching by making use of scripts or echoing specific strings during matching by making use of
PCRE2's callout facility. However, this support can be disabled when PCRE2's callout facility. However, this support can be disabled when
pcre2grep is built. You can find out whether your binary has support pcre2grep is built. You can find out whether your binary has support
for callouts by running it with the --help option. If the support is for callouts by running it with the --help option. If the support is
not enabled, all callouts in patterns are ignored by pcre2grep. not enabled, all callouts in patterns are ignored by pcre2grep.
A callout in a PCRE2 pattern is of the form (?C<arg>) where the argu- A callout in a PCRE2 pattern is of the form (?C<arg>) where the argu-
ment is either a number or a quoted string (see the pcre2callout docu- ment is either a number or a quoted string (see the pcre2callout docu-
mentation for details). Numbered callouts are ignored by pcre2grep; mentation for details). Numbered callouts are ignored by pcre2grep;
only callouts with string arguments are useful. only callouts with string arguments are useful.
Calling external programs or scripts Calling external programs or scripts
If the callout string does not start with a pipe (vertical bar) charac- If the callout string does not start with a pipe (vertical bar) charac-
ter, it is parsed into a list of substrings separated by pipe charac- ter, it is parsed into a list of substrings separated by pipe charac-
ters. The first substring must be an executable name, with the follow- ters. The first substring must be an executable name, with the follow-
ing substrings specifying arguments: ing substrings specifying arguments:
executable_name|arg1|arg2|... executable_name|arg1|arg2|...
Any substring (including the executable name) may contain escape Any substring (including the executable name) may contain escape
sequences started by a dollar character: $<digits> or ${<digits>} is sequences started by a dollar character: $<digits> or ${<digits>} is
replaced by the captured substring of the given decimal number, which replaced by the captured substring of the given decimal number, which
must be greater than zero. If the number is greater than the number of must be greater than zero. If the number is greater than the number of
capturing substrings, or if the capture is unset, the replacement is capturing substrings, or if the capture is unset, the replacement is
empty. empty.
Any other character is substituted by itself. In particular, $$ is Any other character is substituted by itself. In particular, $$ is
replaced by a single dollar and $| is replaced by a pipe character. replaced by a single dollar and $| is replaced by a pipe character.
Here is an example: Here is an example:
echo -e "abcde\n12345" | pcre2grep \ echo -e "abcde\n12345" | pcre2grep \
@ -856,49 +857,49 @@ USING PCRE2'S CALLOUT FACILITY
The parameters for the execv() system call that is used to run the pro- The parameters for the execv() system call that is used to run the pro-
gram or script are zero-terminated strings. This means that binary zero gram or script are zero-terminated strings. This means that binary zero
characters in the callout argument will cause premature termination of characters in the callout argument will cause premature termination of
their substrings, and therefore should not be present. Any syntax their substrings, and therefore should not be present. Any syntax
errors in the string (for example, a dollar not followed by another errors in the string (for example, a dollar not followed by another
character) cause the callout to be ignored. If running the program character) cause the callout to be ignored. If running the program
fails for any reason (including the non-existence of the executable), a fails for any reason (including the non-existence of the executable), a
local matching failure occurs and the matcher backtracks in the normal local matching failure occurs and the matcher backtracks in the normal
way. way.
Echoing a specific string Echoing a specific string
If the callout string starts with a pipe (vertical bar) character, the If the callout string starts with a pipe (vertical bar) character, the
rest of the string is written to the output, having been passed through rest of the string is written to the output, having been passed through
the same escape processing as text from the --output option. This pro- the same escape processing as text from the --output option. This pro-
vides a simple echoing facility that avoids calling an external program vides a simple echoing facility that avoids calling an external program
or script. No terminator is added to the string, so if you want a new- or script. No terminator is added to the string, so if you want a new-
line, you must include it explicitly. Matching continues normally line, you must include it explicitly. Matching continues normally
after the string is output. If you want to see only the callout output after the string is output. If you want to see only the callout output
but not any output from an actual match, you should end the relevant but not any output from an actual match, you should end the relevant
pattern with (*FAIL). pattern with (*FAIL).
MATCHING ERRORS MATCHING ERRORS
It is possible to supply a regular expression that takes a very long It is possible to supply a regular expression that takes a very long
time to fail to match certain lines. Such patterns normally involve time to fail to match certain lines. Such patterns normally involve
nested indefinite repeats, for example: (a+)*\d when matched against a nested indefinite repeats, for example: (a+)*\d when matched against a
line of a's with no final digit. The PCRE2 matching function has a line of a's with no final digit. The PCRE2 matching function has a
resource limit that causes it to abort in these circumstances. If this resource limit that causes it to abort in these circumstances. If this
happens, pcre2grep outputs an error message and the line that caused happens, pcre2grep outputs an error message and the line that caused
the problem to the standard error stream. If there are more than 20 the problem to the standard error stream. If there are more than 20
such errors, pcre2grep gives up. such errors, pcre2grep gives up.
The --match-limit option of pcre2grep can be used to set the overall The --match-limit option of pcre2grep can be used to set the overall
resource limit. There are also other limits that affect the amount of resource limit. There are also other limits that affect the amount of
memory used during matching; see the discussion of --heap-limit and memory used during matching; see the discussion of --heap-limit and
--depth-limit above. --depth-limit above.
DIAGNOSTICS DIAGNOSTICS
Exit status is 0 if any matches were found, 1 if no matches were found, Exit status is 0 if any matches were found, 1 if no matches were found,
and 2 for syntax errors, overlong lines, non-existent or inaccessible and 2 for syntax errors, overlong lines, non-existent or inaccessible
files (even if matches were found in other files) or too many matching files (even if matches were found in other files) or too many matching
errors. Using the -s option to suppress error messages about inaccessi- errors. Using the -s option to suppress error messages about inaccessi-
ble files does not affect the return code. ble files does not affect the return code.
@ -917,5 +918,5 @@ AUTHOR
REVISION REVISION
Last updated: 26 May 2017 Last updated: 17 June 2017
Copyright (c) 1997-2017 University of Cambridge. Copyright (c) 1997-2017 University of Cambridge.

View File

@ -103,7 +103,8 @@ typedef int BOOL;
#define MAXPATLEN 8192 #define MAXPATLEN 8192
#endif #endif
#define PATBUFSIZE (MAXPATLEN + 10) /* Allows for prefix+suffix */ #define FNBUFSIZ 1024
#define ERRBUFSIZ 256
/* Values for the "filenames" variable, which specifies options for file name /* Values for the "filenames" variable, which specifies options for file name
output. The order is important; it is assumed that a file name is wanted for output. The order is important; it is assumed that a file name is wanted for
@ -211,7 +212,7 @@ static BOOL use_jit = FALSE;
static const uint8_t *character_tables = NULL; static const uint8_t *character_tables = NULL;
static uint32_t pcre2_options = 0; static uint32_t pcre2_options = 0;
static uint32_t process_options = 0; static uint32_t extra_options = 0;
static PCRE2_SIZE heap_limit = PCRE2_UNSET; static PCRE2_SIZE heap_limit = PCRE2_UNSET;
static uint32_t match_limit = 0; static uint32_t match_limit = 0;
static uint32_t depth_limit = 0; static uint32_t depth_limit = 0;
@ -441,19 +442,6 @@ of PCRE2_NEWLINE_xx in pcre2.h. */
static const char *newlines[] = { static const char *newlines[] = {
"DEFAULT", "CR", "LF", "CRLF", "ANY", "ANYCRLF", "NUL" }; "DEFAULT", "CR", "LF", "CRLF", "ANY", "ANYCRLF", "NUL" };
/* Tables for prefixing and suffixing patterns, according to the -w, -x, and -F
options. These set the 1, 2, and 4 bits in process_options, respectively. Note
that the combination of -w and -x has the same effect as -x on its own, so we
can treat them as the same. Note that the MAXPATLEN macro assumes the longest
prefix+suffix is 10 characters; if anything longer is added, it must be
adjusted. */
static const char *prefix[] = {
"", "\\b", "^(?:", "^(?:", "\\Q", "\\b\\Q", "^(?:\\Q", "^(?:\\Q" };
static const char *suffix[] = {
"", "\\b", ")$", ")$", "\\E", "\\E\\b", "\\E)$", "\\E)$" };
/* UTF-8 tables - used only when the newline setting is "any". */ /* UTF-8 tables - used only when the newline setting is "any". */
const int utf8_table3[] = { 0xff, 0x1f, 0x0f, 0x07, 0x03, 0x01}; const int utf8_table3[] = { 0xff, 0x1f, 0x0f, 0x07, 0x03, 0x01};
@ -2339,7 +2327,7 @@ file. However, when the newline convention is binary zero, we can't do this. */
if (binary_files != BIN_TEXT) if (binary_files != BIN_TEXT)
{ {
if (endlinetype != PCRE2_NEWLINE_NUL) if (endlinetype != PCRE2_NEWLINE_NUL)
binary = memchr(main_buffer, 0, (bufflength > 1024)? 1024 : bufflength) binary = memchr(main_buffer, 0, (bufflength > 1024)? 1024 : bufflength)
!= NULL; != NULL;
if (binary && binary_files == BIN_NOMATCH) return 1; if (binary && binary_files == BIN_NOMATCH) return 1;
} }
@ -3224,7 +3212,7 @@ switch(letter)
case N_NOJIT: use_jit = FALSE; break; case N_NOJIT: use_jit = FALSE; break;
case 'a': binary_files = BIN_TEXT; break; case 'a': binary_files = BIN_TEXT; break;
case 'c': count_only = TRUE; break; case 'c': count_only = TRUE; break;
case 'F': process_options |= PO_FIXED_STRINGS; break; case 'F': options |= PCRE2_LITERAL; break;
case 'H': filenames = FN_FORCE; break; case 'H': filenames = FN_FORCE; break;
case 'I': binary_files = BIN_NOMATCH; break; case 'I': binary_files = BIN_NOMATCH; break;
case 'h': filenames = FN_NONE; break; case 'h': filenames = FN_NONE; break;
@ -3245,8 +3233,8 @@ switch(letter)
case 't': show_total_count = TRUE; break; case 't': show_total_count = TRUE; break;
case 'u': options |= PCRE2_UTF; utf = TRUE; break; case 'u': options |= PCRE2_UTF; utf = TRUE; break;
case 'v': invert = TRUE; break; case 'v': invert = TRUE; break;
case 'w': process_options |= PO_WORD_MATCH; break; case 'w': extra_options |= PCRE2_EXTRA_MATCH_WORD; break;
case 'x': process_options |= PO_LINE_MATCH; break; case 'x': extra_options |= PCRE2_EXTRA_MATCH_LINE; break;
case 'V': case 'V':
{ {
@ -3309,7 +3297,6 @@ pattern chain.
Arguments: Arguments:
p points to the pattern block p points to the pattern block
options the PCRE options options the PCRE options
popts the processing options
fromfile TRUE if the pattern was read from a file fromfile TRUE if the pattern was read from a file
fromtext file name or identifying text (e.g. "include") fromtext file name or identifying text (e.g. "include")
count 0 if this is the only command line pattern, or count 0 if this is the only command line pattern, or
@ -3320,18 +3307,20 @@ Returns: TRUE on success, FALSE after an error
*/ */
static BOOL static BOOL
compile_pattern(patstr *p, int options, int popts, int fromfile, compile_pattern(patstr *p, int options, int fromfile, const char *fromtext,
const char *fromtext, int count) int count)
{ {
unsigned char buffer[PATBUFSIZE]; char *ps;
PCRE2_SIZE erroffset;
char *ps = p->string;
unsigned int patlen = strlen(ps);
int errcode; int errcode;
PCRE2_SIZE patlen, erroffset;
PCRE2_UCHAR errmessbuffer[ERRBUFSIZ];
if (p->compiled != NULL) return TRUE; if (p->compiled != NULL) return TRUE;
if ((popts & PO_FIXED_STRINGS) != 0) ps = p->string;
patlen = strlen(ps);
if ((options & PCRE2_LITERAL) != 0)
{ {
int ellength; int ellength;
char *eop = ps + patlen; char *eop = ps + patlen;
@ -3344,8 +3333,7 @@ if ((popts & PO_FIXED_STRINGS) != 0)
} }
} }
sprintf((char *)buffer, "%s%.*s%s", prefix[popts], patlen, ps, suffix[popts]); p->compiled = pcre2_compile((PCRE2_SPTR)ps, patlen, options, &errcode,
p->compiled = pcre2_compile(buffer, PCRE2_ZERO_TERMINATED, options, &errcode,
&erroffset, compile_context); &erroffset, compile_context);
/* Handle successful compile. Try JIT-compiling if supported and enabled. We /* Handle successful compile. Try JIT-compiling if supported and enabled. We
@ -3362,23 +3350,22 @@ if (p->compiled != NULL)
/* Handle compile errors */ /* Handle compile errors */
erroffset -= (int)strlen(prefix[popts]);
if (erroffset > patlen) erroffset = patlen; if (erroffset > patlen) erroffset = patlen;
pcre2_get_error_message(errcode, buffer, PATBUFSIZE); pcre2_get_error_message(errcode, errmessbuffer, sizeof(errmessbuffer));
if (fromfile) if (fromfile)
{ {
fprintf(stderr, "pcre2grep: Error in regex in line %d of %s " fprintf(stderr, "pcre2grep: Error in regex in line %d of %s "
"at offset %d: %s\n", count, fromtext, (int)erroffset, buffer); "at offset %d: %s\n", count, fromtext, (int)erroffset, errmessbuffer);
} }
else else
{ {
if (count == 0) if (count == 0)
fprintf(stderr, "pcre2grep: Error in %s regex at offset %d: %s\n", fprintf(stderr, "pcre2grep: Error in %s regex at offset %d: %s\n",
fromtext, (int)erroffset, buffer); fromtext, (int)erroffset, errmessbuffer);
else else
fprintf(stderr, "pcre2grep: Error in %s %s regex at offset %d: %s\n", fprintf(stderr, "pcre2grep: Error in %s %s regex at offset %d: %s\n",
ordin(count), fromtext, (int)erroffset, buffer); ordin(count), fromtext, (int)erroffset, errmessbuffer);
} }
return FALSE; return FALSE;
@ -3396,18 +3383,17 @@ Arguments:
name the name of the file; "-" is stdin name the name of the file; "-" is stdin
patptr pointer to the pattern chain anchor patptr pointer to the pattern chain anchor
patlastptr pointer to the last pattern pointer patlastptr pointer to the last pattern pointer
popts the process options to pass to pattern_compile()
Returns: TRUE if all went well Returns: TRUE if all went well
*/ */
static BOOL static BOOL
read_pattern_file(char *name, patstr **patptr, patstr **patlastptr, int popts) read_pattern_file(char *name, patstr **patptr, patstr **patlastptr)
{ {
int linenumber = 0; int linenumber = 0;
FILE *f; FILE *f;
const char *filename; const char *filename;
char buffer[PATBUFSIZE]; char buffer[MAXPATLEN+20];
if (strcmp(name, "-") == 0) if (strcmp(name, "-") == 0)
{ {
@ -3425,7 +3411,7 @@ else
filename = name; filename = name;
} }
while (fgets(buffer, PATBUFSIZE, f) != NULL) while (fgets(buffer, sizeof(buffer), f) != NULL)
{ {
char *s = buffer + (int)strlen(buffer); char *s = buffer + (int)strlen(buffer);
while (s > buffer && isspace((unsigned char)(s[-1]))) s--; while (s > buffer && isspace((unsigned char)(s[-1]))) s--;
@ -3453,7 +3439,7 @@ while (fgets(buffer, PATBUFSIZE, f) != NULL)
for(;;) for(;;)
{ {
if (!compile_pattern(*patlastptr, pcre2_options, popts, TRUE, filename, if (!compile_pattern(*patlastptr, pcre2_options, TRUE, filename,
linenumber)) linenumber))
{ {
if (f != stdin) fclose(f); if (f != stdin) fclose(f);
@ -3823,7 +3809,7 @@ for (i = 1; i < argc; i++)
{ {
unsigned long int n = decode_number(option_data, op, longop); unsigned long int n = decode_number(option_data, op, longop);
if (op->type == OP_U32NUMBER) *((uint32_t *)op->dataptr) = n; if (op->type == OP_U32NUMBER) *((uint32_t *)op->dataptr) = n;
else if (op->type == OP_SIZE) *((PCRE2_SIZE *)op->dataptr) = n; else if (op->type == OP_SIZE) *((PCRE2_SIZE *)op->dataptr) = n;
else *((int *)op->dataptr) = n; else *((int *)op->dataptr) = n;
} }
} }
@ -3978,6 +3964,10 @@ if (DEE_option != NULL)
} }
} }
/* Set the extra options */
(void)pcre2_set_compile_extra_options(compile_context, extra_options);
/* Check the values for Jeffrey Friedl's debugging options. */ /* Check the values for Jeffrey Friedl's debugging options. */
#ifdef JFRIEDL_DEBUG #ifdef JFRIEDL_DEBUG
@ -4038,7 +4028,7 @@ chain, so we must not access the next pointer till after the compile. */
for (j = 1, cp = patterns; cp != NULL; j++, cp = cp->next) for (j = 1, cp = patterns; cp != NULL; j++, cp = cp->next)
{ {
if (!compile_pattern(cp, pcre2_options, process_options, FALSE, "command-line", if (!compile_pattern(cp, pcre2_options, FALSE, "command-line",
(j == 1 && patterns->next == NULL)? 0 : j)) (j == 1 && patterns->next == NULL)? 0 : j))
goto EXIT2; goto EXIT2;
} }
@ -4047,48 +4037,35 @@ for (j = 1, cp = patterns; cp != NULL; j++, cp = cp->next)
for (fn = pattern_files; fn != NULL; fn = fn->next) for (fn = pattern_files; fn != NULL; fn = fn->next)
{ {
if (!read_pattern_file(fn->name, &patterns, &patterns_last, process_options)) if (!read_pattern_file(fn->name, &patterns, &patterns_last)) goto EXIT2;
goto EXIT2;
} }
/* Unless JIT has been explicitly disabled, arrange a stack for it to use. */ /* Unless JIT has been explicitly disabled, arrange a stack for it to use. */
#ifdef NEVER
#ifdef SUPPORT_PCRE2GREP_JIT
if (use_jit)
jit_stack = pcre2_jit_stack_create(32*1024, 1024*1024, NULL);
#endif
for (j = 1, cp = patterns; cp != NULL; j++, cp = cp->next)
{
#ifdef SUPPORT_PCRE2GREP_JIT
if (jit_stack != NULL && cp->compiled != NULL)
pcre2_jit_stack_assign(match_context, NULL, jit_stack);
#endif
}
#endif
#ifdef SUPPORT_PCRE2GREP_JIT #ifdef SUPPORT_PCRE2GREP_JIT
if (use_jit) if (use_jit)
{ {
jit_stack = pcre2_jit_stack_create(32*1024, 1024*1024, NULL); jit_stack = pcre2_jit_stack_create(32*1024, 1024*1024, NULL);
if (jit_stack != NULL ) if (jit_stack != NULL )
pcre2_jit_stack_assign(match_context, NULL, jit_stack); pcre2_jit_stack_assign(match_context, NULL, jit_stack);
} }
#endif #endif
/* -F, -w, and -x do not apply to include or exclude patterns, so we must
adjust the options. */
pcre2_options &= ~PCRE2_LITERAL;
(void)pcre2_set_compile_extra_options(compile_context, 0);
/* If there are include or exclude patterns read from the command line, compile /* If there are include or exclude patterns read from the command line, compile
them. -F, -w, and -x do not apply, so the third argument of compile_pattern is them. */
0. */
for (j = 0; j < 4; j++) for (j = 0; j < 4; j++)
{ {
int k; int k;
for (k = 1, cp = *(incexlist[j]); cp != NULL; k++, cp = cp->next) for (k = 1, cp = *(incexlist[j]); cp != NULL; k++, cp = cp->next)
{ {
if (!compile_pattern(cp, pcre2_options, 0, FALSE, incexname[j], if (!compile_pattern(cp, pcre2_options, FALSE, incexname[j],
(k == 1 && cp->next == NULL)? 0 : k)) (k == 1 && cp->next == NULL)? 0 : k))
goto EXIT2; goto EXIT2;
} }
@ -4098,13 +4075,13 @@ for (j = 0; j < 4; j++)
for (fn = include_from; fn != NULL; fn = fn->next) for (fn = include_from; fn != NULL; fn = fn->next)
{ {
if (!read_pattern_file(fn->name, &include_patterns, &include_patterns_last, 0)) if (!read_pattern_file(fn->name, &include_patterns, &include_patterns_last))
goto EXIT2; goto EXIT2;
} }
for (fn = exclude_from; fn != NULL; fn = fn->next) for (fn = exclude_from; fn != NULL; fn = fn->next)
{ {
if (!read_pattern_file(fn->name, &exclude_patterns, &exclude_patterns_last, 0)) if (!read_pattern_file(fn->name, &exclude_patterns, &exclude_patterns_last))
goto EXIT2; goto EXIT2;
} }
@ -4123,7 +4100,7 @@ read them line by line and search the given files. */
for (fn = file_lists; fn != NULL; fn = fn->next) for (fn = file_lists; fn != NULL; fn = fn->next)
{ {
char buffer[PATBUFSIZE]; char buffer[FNBUFSIZ];
FILE *fl; FILE *fl;
if (strcmp(fn->name, "-") == 0) fl = stdin; else if (strcmp(fn->name, "-") == 0) fl = stdin; else
{ {
@ -4135,7 +4112,7 @@ for (fn = file_lists; fn != NULL; fn = fn->next)
goto EXIT2; goto EXIT2;
} }
} }
while (fgets(buffer, PATBUFSIZE, fl) != NULL) while (fgets(buffer, sizeof(buffer), fl) != NULL)
{ {
int frc; int frc;
char *end = buffer + (int)strlen(buffer); char *end = buffer + (int)strlen(buffer);

5
testdata/grepinputv vendored
View File

@ -2,3 +2,8 @@ The quick brown
fox jumps fox jumps
over the lazy dog. over the lazy dog.
This time it jumps and jumps and jumps. This time it jumps and jumps and jumps.
This line contains \E and (regex) *meta* [characters].
The word is cat in this line
The caterpillar sat on the mat
The snowcat is not an animal
A buried feline in the syndicate

32
testdata/grepoutput vendored
View File

@ -454,6 +454,11 @@ RC=1
---------------------------- Test 51 ------------------------------ ---------------------------- Test 51 ------------------------------
over the lazy dog. over the lazy dog.
This time it jumps and jumps and jumps. This time it jumps and jumps and jumps.
This line contains \E and (regex) *meta* [characters].
The word is cat in this line
The caterpillar sat on the mat
The snowcat is not an animal
A buried feline in the syndicate
RC=0 RC=0
---------------------------- Test 52 ------------------------------ ---------------------------- Test 52 ------------------------------
fox jumps fox jumps
@ -788,32 +793,32 @@ RC=0
37216,12 37216,12
RC=0 RC=0
---------------------------- Test 113 ----------------------------- ---------------------------- Test 113 -----------------------------
476 478
RC=0 RC=0
---------------------------- Test 114 ----------------------------- ---------------------------- Test 114 -----------------------------
testdata/grepinput:469 testdata/grepinput:469
testdata/grepinput3:0 testdata/grepinput3:0
testdata/grepinput8:0 testdata/grepinput8:0
testdata/grepinputv:1 testdata/grepinputv:3
testdata/grepinputx:6 testdata/grepinputx:6
TOTAL:476 TOTAL:478
RC=0 RC=0
---------------------------- Test 115 ----------------------------- ---------------------------- Test 115 -----------------------------
testdata/grepinput:469 testdata/grepinput:469
testdata/grepinputv:1 testdata/grepinputv:3
testdata/grepinputx:6 testdata/grepinputx:6
TOTAL:476 TOTAL:478
RC=0 RC=0
---------------------------- Test 116 ----------------------------- ---------------------------- Test 116 -----------------------------
476 478
RC=0 RC=0
---------------------------- Test 117 ----------------------------- ---------------------------- Test 117 -----------------------------
469 469
0 0
0 0
1 3
6 6
476 478
RC=0 RC=0
---------------------------- Test 118 ----------------------------- ---------------------------- Test 118 -----------------------------
testdata/grepinput3 testdata/grepinput3
@ -834,3 +839,14 @@ RC=0
./testdata/grepinput:a binary zero:zeroa ./testdata/grepinput:a binary zero:zeroa
./testdata/grepinput:the binary zero.:zerothe. ./testdata/grepinput:the binary zero.:zerothe.
RC=0 RC=0
---------------------------- Test 121 -----------------------------
This line contains \E and (regex) *meta* [characters].
RC=0
---------------------------- Test 122 -----------------------------
over the lazy dog.
The word is cat in this line
RC=0
---------------------------- Test 122 -----------------------------
over the lazy dog.
The word is cat in this line
RC=0

28
testdata/grepoutputC vendored
View File

@ -1,14 +1,42 @@
Arg1: [T] [he ] [ ] Arg2: |T| () () (0) Arg1: [T] [he ] [ ] Arg2: |T| () () (0)
Arg1: [T] [his] [s] Arg2: |T| () () (0) Arg1: [T] [his] [s] Arg2: |T| () () (0)
Arg1: [T] [his] [s] Arg2: |T| () () (0)
Arg1: [T] [he ] [ ] Arg2: |T| () () (0)
Arg1: [T] [he ] [ ] Arg2: |T| () () (0)
Arg1: [T] [he ] [ ] Arg2: |T| () () (0)
The quick brown The quick brown
This time it jumps and jumps and jumps. This time it jumps and jumps and jumps.
This line contains \E and (regex) *meta* [characters].
The word is cat in this line
The caterpillar sat on the mat
The snowcat is not an animal
Arg1: [qu] [qu] Arg1: [qu] [qu]
Arg1: [ t] [ t] Arg1: [ t] [ t]
Arg1: [ l] [ l]
Arg1: [wo] [wo]
Arg1: [ca] [ca]
Arg1: [sn] [sn]
The quick brown The quick brown
This time it jumps and jumps and jumps. This time it jumps and jumps and jumps.
This line contains \E and (regex) *meta* [characters].
The word is cat in this line
The caterpillar sat on the mat
The snowcat is not an animal
0:T 0:T
The quick brown The quick brown
0:T 0:T
This time it jumps and jumps and jumps. This time it jumps and jumps and jumps.
0:T
This line contains \E and (regex) *meta* [characters].
0:T
The word is cat in this line
0:T
The caterpillar sat on the mat
0:T
The snowcat is not an animal
T
T
T
T
T T
T T