Add support to pcre2grep for binary zeros in -f files.

This commit is contained in:
Philip.Hazel 2018-02-24 17:09:19 +00:00
parent c440473190
commit 04919e9d03
5 changed files with 92 additions and 60 deletions

View File

@ -25,6 +25,9 @@ multi-code-unit characters caused bad behaviour and possibly a crash. This
issue was fixed for other kinds of repeat in release 10.20 by change 19, but issue was fixed for other kinds of repeat in release 10.20 by change 19, but
repeating character classes were overlooked. repeating character classes were overlooked.
5. pcre2grep now supports the inclusion of binary zeros in patterns that are
read from files via the -f option.
Version 10.31 12-February-2018 Version 10.31 12-February-2018
------------------------------ ------------------------------

View File

@ -641,6 +641,12 @@ echo "RC=$?" >>testtrygrep
$valgrind $vjs $pcre2grep --colour=always '(?=[ac]\K)' testNinputgrep >>testtrygrep $valgrind $vjs $pcre2grep --colour=always '(?=[ac]\K)' testNinputgrep >>testtrygrep
echo "RC=$?" >>testtrygrep echo "RC=$?" >>testtrygrep
echo "---------------------------- Test 126 -----------------------------" >>testtrygrep
printf "Next line pattern has binary zero\nABC\x00XYZ\n" >testtemp1grep
printf "ABC\x00XYZ\nABCDEF\nDEFABC\n" >testtemp2grep
$valgrind $vjs $pcre2grep -a -f testtemp1grep testtemp2grep >>testtrygrep
echo "RC=$?" >>testtrygrep
# Now compare the results. # Now compare the results.

View File

@ -1,4 +1,4 @@
.TH PCRE2GREP 1 "13 November 2017" "PCRE2 10.31" .TH PCRE2GREP 1 "24 February 2018" "PCRE2 10.32"
.SH NAME .SH NAME
pcre2grep - a grep with Perl-compatible regular expressions. pcre2grep - a grep with Perl-compatible regular expressions.
.SH SYNOPSIS .SH SYNOPSIS
@ -121,6 +121,14 @@ a binary file is not applied. See the \fB--binary-files\fP option for a means
of changing the way binary files are handled. of changing the way binary files are handled.
. .
. .
.SH "BINARY ZEROS IN PATTERNS"
.rs
.sp
Patterns passed from the command line are strings that are terminated by a
binary zero, so cannot contain internal zeros. However, patterns that are read
from a file via the \fB-f\fP option may contain binary zeros.
.
.
.SH OPTIONS .SH OPTIONS
.rs .rs
.sp .sp
@ -304,12 +312,15 @@ files; it does not apply to patterns specified by any of the \fB--include\fP or
.TP .TP
\fB-f\fP \fIfilename\fP, \fB--file=\fP\fIfilename\fP \fB-f\fP \fIfilename\fP, \fB--file=\fP\fIfilename\fP
Read patterns from the file, one per line, and match them against each line of Read patterns from the file, one per line, and match them against each line of
input. What constitutes a newline when reading the file is the operating input. As is the case with patterns on the command line, no delimiters should
system's default. The \fB--newline\fP option has no effect on this option. be used. What constitutes a newline when reading the file is the operating
Trailing white space is removed from each line, and blank lines are ignored. An system's default interpretation of \en. The \fB--newline\fP option has no
empty file contains no patterns and therefore matches nothing. See also the effect on this option. Trailing white space is removed from each line, and
comments about multiple patterns versus a single pattern with alternatives in blank lines are ignored. An empty file contains no patterns and therefore
the description of \fB-e\fP above. matches nothing. Patterns read from a file in this way may contain binary
zeros, which are treated as ordinary data characters. See also the comments
about multiple patterns versus a single pattern with alternatives in the
description of \fB-e\fP above.
.sp .sp
If this option is given more than once, all the specified files are read. A If this option is given more than once, all the specified files are read. A
data line is output if any of the patterns match it. A file name can be given data line is output if any of the patterns match it. A file name can be given
@ -320,14 +331,15 @@ command line; all arguments are treated as the names of paths to be searched.
.TP .TP
\fB--file-list\fP=\fIfilename\fP \fB--file-list\fP=\fIfilename\fP
Read a list of files and/or directories that are to be scanned from the given Read a list of files and/or directories that are to be scanned from the given
file, one per line. Trailing white space is removed from each line, and blank file, one per line. What constitutes a newline when reading the file is the
lines are ignored. These paths are processed before any that are listed on the operating system's default. Trailing white space is removed from each line, and
command line. The file name can be given as "-" to refer to the standard input. blank lines are ignored. These paths are processed before any that are listed
If \fB--file\fP and \fB--file-list\fP are both specified as "-", patterns are on the command line. The file name can be given as "-" to refer to the standard
read first. This is useful only when the standard input is a terminal, from input. If \fB--file\fP and \fB--file-list\fP are both specified as "-",
which further lines (the list of files) can be read after an end-of-file patterns are read first. This is useful only when the standard input is a
indication. If this option is given more than once, all the specified files are terminal, from which further lines (the list of files) can be read after an
read. end-of-file indication. If this option is given more than once, all the
specified files are read.
.TP .TP
\fB--file-offsets\fP \fB--file-offsets\fP
Instead of showing lines or parts of lines that match, show each match as an Instead of showing lines or parts of lines that match, show each match as an
@ -679,12 +691,13 @@ The \fB-N\fP (\fB--newline\fP) option allows \fBpcre2grep\fP to scan files with
different newline conventions from the default. Any parts of the input files different newline conventions from the default. Any parts of the input files
that are written to the standard output are copied identically, with whatever that are written to the standard output are copied identically, with whatever
newline sequences they have in the input. However, the setting of this option newline sequences they have in the input. However, the setting of this option
does not affect the interpretation of files specified by the \fB-f\fP, affects only the way scanned files are processed. It does not affect the
\fB--exclude-from\fP, or \fB--include-from\fP options, which are assumed to use interpretation of files specified by the \fB-f\fP, \fB--file-list\fP,
the operating system's standard newline sequence, nor does it affect the way in \fB--exclude-from\fP, or \fB--include-from\fP options, nor does it affect the
which \fBpcre2grep\fP writes informational messages to the standard error and way in which \fBpcre2grep\fP writes informational messages to the standard
output streams. For these it uses the string "\en" to indicate newlines, error and output streams. For these it uses the string "\en" to indicate
relying on the C I/O library to convert this to an appropriate sequence. newlines, relying on the C I/O library to convert this to an appropriate
sequence.
. .
. .
.SH "OPTIONS COMPATIBILITY" .SH "OPTIONS COMPATIBILITY"
@ -862,6 +875,6 @@ Cambridge, England.
.rs .rs
.sp .sp
.nf .nf
Last updated: 13 November 2017 Last updated: 24 February 2018
Copyright (c) 1997-2017 University of Cambridge. Copyright (c) 1997-2018 University of Cambridge.
.fi .fi

View File

@ -13,7 +13,7 @@ distribution because other apparatus is needed to compile pcre2grep for z/OS.
The header can be found in the special z/OS distribution, which is available The header can be found in the special z/OS distribution, which is available
from www.zaconsultants.net or from www.cbttape.org. from www.zaconsultants.net or from www.cbttape.org.
Copyright (c) 1997-2017 University of Cambridge Copyright (c) 1997-2018 University of Cambridge
----------------------------------------------------------------------------- -----------------------------------------------------------------------------
Redistribution and use in source and binary forms, with or without Redistribution and use in source and binary forms, with or without
@ -303,6 +303,7 @@ also for include/exclude patterns. */
typedef struct patstr { typedef struct patstr {
struct patstr *next; struct patstr *next;
char *string; char *string;
PCRE2_SIZE length;
pcre2_code *compiled; pcre2_code *compiled;
} patstr; } patstr;
@ -557,13 +558,14 @@ exit(rc);
Arguments: Arguments:
s pattern string to add s pattern string to add
patlen length of pattern
after if not NULL points to item to insert after after if not NULL points to item to insert after
Returns: new pattern block or NULL on error Returns: new pattern block or NULL on error
*/ */
static patstr * static patstr *
add_pattern(char *s, patstr *after) add_pattern(char *s, PCRE2_SIZE patlen, patstr *after)
{ {
patstr *p = (patstr *)malloc(sizeof(patstr)); patstr *p = (patstr *)malloc(sizeof(patstr));
if (p == NULL) if (p == NULL)
@ -571,7 +573,7 @@ if (p == NULL)
fprintf(stderr, "pcre2grep: malloc failed\n"); fprintf(stderr, "pcre2grep: malloc failed\n");
pcre2grep_exit(2); pcre2grep_exit(2);
} }
if (strlen(s) > MAXPATLEN) if (patlen > MAXPATLEN)
{ {
fprintf(stderr, "pcre2grep: pattern is too long (limit is %d bytes)\n", fprintf(stderr, "pcre2grep: pattern is too long (limit is %d bytes)\n",
MAXPATLEN); MAXPATLEN);
@ -580,6 +582,7 @@ if (strlen(s) > MAXPATLEN)
} }
p->next = NULL; p->next = NULL;
p->string = s; p->string = s;
p->length = patlen;
p->compiled = NULL; p->compiled = NULL;
if (after != NULL) if (after != NULL)
@ -1276,12 +1279,14 @@ return om;
* Read one line of input * * Read one line of input *
*************************************************/ *************************************************/
/* Normally, input is read using fread() (or gzread, or BZ2_read) into a large /* Normally, input that is to be scanned is read using fread() (or gzread, or
buffer, so many lines may be read at once. However, doing this for tty input BZ2_read) into a large buffer, so many lines may be read at once. However,
means that no output appears until a lot of input has been typed. Instead, tty doing this for tty input means that no output appears until a lot of input has
input is handled line by line. We cannot use fgets() for this, because it does been typed. Instead, tty input is handled line by line. We cannot use fgets()
not stop at a binary zero, and therefore there is no way of telling how many for this, because it does not stop at a binary zero, and therefore there is no
characters it has read, because there may be binary zeros embedded in the data. way of telling how many characters it has read, because there may be binary
zeros embedded in the data. This function is also used for reading patterns
from files (the -f option).
Arguments: Arguments:
buffer the buffer to read into buffer the buffer to read into
@ -1291,7 +1296,7 @@ Arguments:
Returns: the number of characters read, zero at end of file Returns: the number of characters read, zero at end of file
*/ */
static unsigned int static PCRE2_SIZE
read_one_line(char *buffer, int length, FILE *f) read_one_line(char *buffer, int length, FILE *f)
{ {
int c; int c;
@ -1651,11 +1656,11 @@ Returns: TRUE if there was a match
*/ */
static BOOL static BOOL
match_patterns(char *matchptr, size_t length, unsigned int options, match_patterns(char *matchptr, PCRE2_SIZE length, unsigned int options,
size_t startoffset, int *mrc) PCRE2_SIZE startoffset, int *mrc)
{ {
int i; int i;
size_t slen = length; PCRE2_SIZE slen = length;
patstr *p = patterns; patstr *p = patterns;
const char *msg = "this text:\n\n"; const char *msg = "this text:\n\n";
@ -2317,7 +2322,7 @@ unsigned long int count = 0;
char *lastmatchrestart = NULL; char *lastmatchrestart = NULL;
char *ptr = main_buffer; char *ptr = main_buffer;
char *endptr; char *endptr;
size_t bufflength; PCRE2_SIZE bufflength;
BOOL binary = FALSE; BOOL binary = FALSE;
BOOL endhyphenpending = FALSE; BOOL endhyphenpending = FALSE;
BOOL input_line_buffered = line_buffered; BOOL input_line_buffered = line_buffered;
@ -2339,7 +2344,7 @@ bufflength = fill_buffer(handle, frtype, main_buffer, bufsize,
input_line_buffered); input_line_buffered);
#ifdef SUPPORT_LIBBZ2 #ifdef SUPPORT_LIBBZ2
if (frtype == FR_LIBBZ2 && (int)bufflength < 0) return 2; /* Gotcha: bufflength is size_t; */ if (frtype == FR_LIBBZ2 && (int)bufflength < 0) return 2; /* Gotcha: bufflength is PCRE2_SIZE; */
#endif #endif
endptr = main_buffer + bufflength; endptr = main_buffer + bufflength;
@ -2368,8 +2373,8 @@ while (ptr < endptr)
unsigned int options = 0; unsigned int options = 0;
BOOL match; BOOL match;
char *t = ptr; char *t = ptr;
size_t length, linelength; PCRE2_SIZE length, linelength;
size_t startoffset = 0; PCRE2_SIZE startoffset = 0;
/* At this point, ptr is at the start of a line. We need to find the length /* At this point, ptr is at the start of a line. We need to find the length
of the subject string to pass to pcre2_match(). In multiline mode, it is the of the subject string to pass to pcre2_match(). In multiline mode, it is the
@ -2381,7 +2386,7 @@ while (ptr < endptr)
t = end_of_line(t, endptr, &endlinelength); t = end_of_line(t, endptr, &endlinelength);
linelength = t - ptr - endlinelength; linelength = t - ptr - endlinelength;
length = multiline? (size_t)(endptr - ptr) : linelength; length = multiline? (PCRE2_SIZE)(endptr - ptr) : linelength;
/* Check to see if the line we are looking at extends right to the very end /* Check to see if the line we are looking at extends right to the very end
of the buffer without a line terminator. This means the line is too long to of the buffer without a line terminator. This means the line is too long to
@ -2560,7 +2565,7 @@ while (ptr < endptr)
{ {
if (!invert) if (!invert)
{ {
size_t oldstartoffset; PCRE2_SIZE oldstartoffset;
if (printname != NULL) fprintf(stdout, "%s:", printname); if (printname != NULL) fprintf(stdout, "%s:", printname);
if (number) fprintf(stdout, "%lu:", linenumber); if (number) fprintf(stdout, "%lu:", linenumber);
@ -2647,7 +2652,7 @@ while (ptr < endptr)
startoffset -= (int)(linelength + endlinelength); startoffset -= (int)(linelength + endlinelength);
t = end_of_line(ptr, endptr, &endlinelength); t = end_of_line(ptr, endptr, &endlinelength);
linelength = t - ptr - endlinelength; linelength = t - ptr - endlinelength;
length = (size_t)(endptr - ptr); length = (PCRE2_SIZE)(endptr - ptr);
} }
goto ONLY_MATCHING_RESTART; goto ONLY_MATCHING_RESTART;
@ -2812,7 +2817,7 @@ while (ptr < endptr)
endprevious -= (int)(linelength + endlinelength); endprevious -= (int)(linelength + endlinelength);
t = end_of_line(ptr, endptr, &endlinelength); t = end_of_line(ptr, endptr, &endlinelength);
linelength = t - ptr - endlinelength; linelength = t - ptr - endlinelength;
length = (size_t)(endptr - ptr); length = (PCRE2_SIZE)(endptr - ptr);
} }
/* If startoffset is at the exact end of the line it means this /* If startoffset is at the exact end of the line it means this
@ -2895,7 +2900,7 @@ while (ptr < endptr)
/* If input is line buffered, and the buffer is not yet full, read another /* If input is line buffered, and the buffer is not yet full, read another
line and add it into the buffer. */ line and add it into the buffer. */
if (input_line_buffered && bufflength < (size_t)bufsize) if (input_line_buffered && bufflength < (PCRE2_SIZE)bufsize)
{ {
int add = read_one_line(ptr, bufsize - (int)(ptr - main_buffer), in); int add = read_one_line(ptr, bufsize - (int)(ptr - main_buffer), in);
bufflength += add; bufflength += add;
@ -2907,7 +2912,7 @@ while (ptr < endptr)
1/3 and refill it. Before we do this, if some unprinted "after" lines are 1/3 and refill it. Before we do this, if some unprinted "after" lines are
about to be lost, print them. */ about to be lost, print them. */
if (bufflength >= (size_t)bufsize && ptr > main_buffer + 2*bufthird) if (bufflength >= (PCRE2_SIZE)bufsize && ptr > main_buffer + 2*bufthird)
{ {
if (after_context > 0 && if (after_context > 0 &&
lastmatchnumber > 0 && lastmatchnumber > 0 &&
@ -3395,9 +3400,8 @@ PCRE2_SIZE patlen, erroffset;
PCRE2_UCHAR errmessbuffer[ERRBUFSIZ]; PCRE2_UCHAR errmessbuffer[ERRBUFSIZ];
if (p->compiled != NULL) return TRUE; if (p->compiled != NULL) return TRUE;
ps = p->string; ps = p->string;
patlen = strlen(ps); patlen = p->length;
if ((options & PCRE2_LITERAL) != 0) if ((options & PCRE2_LITERAL) != 0)
{ {
@ -3407,8 +3411,8 @@ if ((options & PCRE2_LITERAL) != 0)
if (ellength != 0) if (ellength != 0)
{ {
if (add_pattern(pe, p) == NULL) return FALSE; patlen = pe - ps - ellength;
patlen = (int)(pe - ps - ellength); if (add_pattern(pe, p->length-patlen-ellength, p) == NULL) return FALSE;
} }
} }
@ -3470,6 +3474,7 @@ static BOOL
read_pattern_file(char *name, patstr **patptr, patstr **patlastptr) read_pattern_file(char *name, patstr **patptr, patstr **patlastptr)
{ {
int linenumber = 0; int linenumber = 0;
PCRE2_SIZE patlen;
FILE *f; FILE *f;
const char *filename; const char *filename;
char buffer[MAXPATLEN+20]; char buffer[MAXPATLEN+20];
@ -3490,20 +3495,18 @@ else
filename = name; filename = name;
} }
while (fgets(buffer, sizeof(buffer), f) != NULL) while ((patlen = read_one_line(buffer, sizeof(buffer), f)) > 0)
{ {
char *s = buffer + (int)strlen(buffer); while (patlen > 0 && isspace((unsigned char)(buffer[patlen-1]))) patlen--;
while (s > buffer && isspace((unsigned char)(s[-1]))) s--;
*s = 0;
linenumber++; linenumber++;
if (buffer[0] == 0) continue; /* Skip blank lines */ if (patlen == 0) continue; /* Skip blank lines */
/* Note: this call to add_pattern() puts a pointer to the local variable /* Note: this call to add_pattern() puts a pointer to the local variable
"buffer" into the pattern chain. However, that pointer is used only when "buffer" into the pattern chain. However, that pointer is used only when
compiling the pattern, which happens immediately below, so we flatten it compiling the pattern, which happens immediately below, so we flatten it
afterwards, as a precaution against any later code trying to use it. */ afterwards, as a precaution against any later code trying to use it. */
*patlastptr = add_pattern(buffer, *patlastptr); *patlastptr = add_pattern(buffer, patlen, *patlastptr);
if (*patlastptr == NULL) if (*patlastptr == NULL)
{ {
if (f != stdin) fclose(f); if (f != stdin) fclose(f);
@ -3513,8 +3516,9 @@ while (fgets(buffer, sizeof(buffer), f) != NULL)
/* This loop is needed because compiling a "pattern" when -F is set may add /* This loop is needed because compiling a "pattern" when -F is set may add
on additional literal patterns if the original contains a newline. In the on additional literal patterns if the original contains a newline. In the
common case, it never will, because fgets() stops at a newline. However, common case, it never will, because read_one_line() stops at a newline.
the -N option can be used to give pcre2grep a different newline setting. */ However, the -N option can be used to give pcre2grep a different newline
setting. */
for(;;) for(;;)
{ {
@ -3833,7 +3837,8 @@ for (i = 1; i < argc; i++)
else if (op->type == OP_PATLIST) else if (op->type == OP_PATLIST)
{ {
patdatastr *pd = (patdatastr *)op->dataptr; patdatastr *pd = (patdatastr *)op->dataptr;
*(pd->lastptr) = add_pattern(option_data, *(pd->lastptr)); *(pd->lastptr) = add_pattern(option_data, (PCRE2_SIZE)strlen(option_data),
*(pd->lastptr));
if (*(pd->lastptr) == NULL) goto EXIT2; if (*(pd->lastptr) == NULL) goto EXIT2;
if (*(pd->anchor) == NULL) *(pd->anchor) = *(pd->lastptr); if (*(pd->anchor) == NULL) *(pd->anchor) = *(pd->lastptr);
} }
@ -4095,7 +4100,9 @@ the first argument is the one and only pattern, and it must exist. */
if (patterns == NULL && pattern_files == NULL) if (patterns == NULL && pattern_files == NULL)
{ {
if (i >= argc) return usage(2); if (i >= argc) return usage(2);
patterns = patterns_last = add_pattern(argv[i++], NULL); patterns = patterns_last = add_pattern(argv[i], (PCRE2_SIZE)strlen(argv[i]),
NULL);
i++;
if (patterns == NULL) goto EXIT2; if (patterns == NULL) goto EXIT2;
} }

3
testdata/grepoutput vendored
View File

@ -945,3 +945,6 @@ RC=0
RC=0 RC=0
abcd abcd
RC=0 RC=0
---------------------------- Test 126 -----------------------------
ABCXYZ
RC=0