Substitution tests and documentation.

This commit is contained in:
Philip.Hazel 2014-11-12 16:57:56 +00:00
parent b3ac0ffb32
commit c19bd9a377
8 changed files with 171 additions and 67 deletions

View File

@ -1,4 +1,4 @@
.TH PCRE2TEST 1 "09 November 2014" "PCRE 10.00" .TH PCRE2TEST 1 "12 November 2014" "PCRE 10.00"
.SH NAME .SH NAME
pcre2test - a program for testing Perl-compatible regular expressions. pcre2test - a program for testing Perl-compatible regular expressions.
.SH SYNOPSIS .SH SYNOPSIS
@ -645,6 +645,7 @@ not affect the compilation process.
allusedtext show all consulted text allusedtext show all consulted text
/g global global matching /g global global matching
mark show mark values mark show mark values
replace=<string> specify a replacement string
startchar show starting character when relevant startchar show starting character when relevant
.sp .sp
These modifiers may not appear in a \fB#pattern\fP command. If you want them as These modifiers may not appear in a \fB#pattern\fP command. If you want them as
@ -719,6 +720,7 @@ pattern.
offset=<n> set starting offset offset=<n> set starting offset
ovector=<n> set size of output vector ovector=<n> set size of output vector
recursion_limit=<n> set a recursion limit recursion_limit=<n> set a recursion limit
replace=<string> specify a replacement string
startchar show startchar when relevant startchar show startchar when relevant
zero_terminate pass the subject as zero-terminated zero_terminate pass the subject as zero-terminated
.sp .sp
@ -797,6 +799,29 @@ Any value other than zero is used as a return from \fBpcre2test\fP's callout
function. function.
. .
. .
.SS "Finding all matches in a string"
.rs
.sp
Searching for all possible matches within a subject can be requested by the
\fBglobal\fP or \fB/altglobal\fP modifier. After finding a match, the matching
function is called again to search the remainder of the subject. The difference
between \fBglobal\fP and \fBaltglobal\fP is that the former uses the
\fIstart_offset\fP argument to \fBpcre2_match()\fP or \fBpcre2_dfa_match()\fP
to start searching at a new point within the entire string (which is what Perl
does), whereas the latter passes over a shortened substring. This makes a
difference to the matching process if the pattern begins with a lookbehind
assertion (including \eb or \eB).
.P
If an empty string is matched, the next match is done with the
PCRE2_NOTEMPTY_ATSTART and PCRE2_ANCHORED flags set, in order to search for
another, non-empty, match at the same point in the subject. If this match
fails, the start offset is advanced, and the normal match is retried. This
imitates the way Perl handles such cases when using the \fB/g\fP modifier or
the \fBsplit()\fP function. Normally, the start offset is advanced by one
character, but if the newline convention recognizes CRLF as a newline, and the
current character is CR followed by LF, an advance of two is used.
.
.
.SS "Testing substring extraction functions" .SS "Testing substring extraction functions"
.rs .rs
.sp .sp
@ -821,27 +846,38 @@ length (that is, the return from the extraction function) is given in
parentheses after each substring. parentheses after each substring.
. .
. .
.SS "Finding all matches in a string" .SS "Testing the substitution function"
.rs .rs
.sp .sp
Searching for all possible matches within a subject can be requested by the If the \fBreplace\fP modifier is set, the \fBpcre2_substitute()\fP function is
\fBglobal\fP or \fB/altglobal\fP modifier. After finding a match, the matching called instead of one of the matching functions. Unlike subject strings,
function is called again to search the remainder of the subject. The difference \fBpcre2test\fP does not process replacement strings for escape sequences. In
between \fBglobal\fP and \fBaltglobal\fP is that the former uses the UTF mode, a replacement string is checked to see if it is a valid UTF-8 string.
\fIstart_offset\fP argument to \fBpcre2_match()\fP or \fBpcre2_dfa_match()\fP If so, it is correctly converted to a UTF string of the appropriate code unit
to start searching at a new point within the entire string (which is what Perl width. If it is not a valid UTF-8 string, the individual code units are copied
does), whereas the latter passes over a shortened substring. This makes a directly. This provides a means of passing an invalid UTF-8 string for testing
difference to the matching process if the pattern begins with a lookbehind purposes.
assertion (including \eb or \eB).
.P .P
If an empty string is matched, the next match is done with the If the \fBglobal\fP modifier is set, PCRE2_SUBSTITUTE_GLOBAL is passed to
PCRE2_NOTEMPTY_ATSTART and PCRE2_ANCHORED flags set, in order to search for \fBpcre2_substitute()\fP. After a successful substitution, the modified string
another, non-empty, match at the same point in the subject. If this match is output, preceded by the number of replacements. This may be zero if there
fails, the start offset is advanced, and the normal match is retried. This were no matches. Here is a simple example of a substitution test:
imitates the way Perl handles such cases when using the \fB/g\fP modifier or .sp
the \fBsplit()\fP function. Normally, the start offset is advanced by one /abc/replace=xxx
character, but if the newline convention recognizes CRLF as a newline, and the =abc=abc=
current character is CR followed by LF, an advance of two is used. 1: =xxx=abc=
=abc=abc=\=global
2: =xxx=xxx=
.sp
Subject and replacement strings should be kept relatively short for
substitution tests, as fixed-size buffers are used. To make it easy to test for
buffer overflow, if the replacement string starts with a number in square
brackets, that number is passed to \fBpcre2_substitute()\fP as the size of the
output buffer, with the replacement string starting at the next character.
.P
A replacement string is ignored with POSIX and DFA matching. Specifying partial
matching provokes an error return ("bad option value") from
\fBpcre2_substitute()\fP.
. .
. .
.SS "Setting the JIT stack size" .SS "Setting the JIT stack size"
@ -1200,6 +1236,6 @@ Cambridge CB2 3QH, England.
.rs .rs
.sp .sp
.nf .nf
Last updated: 09 November 2014 Last updated: 12 November 2014
Copyright (c) 1997-2014 University of Cambridge. Copyright (c) 1997-2014 University of Cambridge.
.fi .fi

View File

@ -102,7 +102,7 @@ static const char compile_error_texts[] =
/* 30 */ /* 30 */
"unknown POSIX class name\0" "unknown POSIX class name\0"
"internal error in pcre2_study(): should not occur\0" "internal error in pcre2_study(): should not occur\0"
"this version of PCRE does not have UTF or Unicode property support\0" "this version of PCRE2 does not have Unicode support\0"
"parentheses are too deeply nested (stack check)\0" "parentheses are too deeply nested (stack check)\0"
"character code point value in \\x{} or \\o{} is too large\0" "character code point value in \\x{} or \\o{} is too large\0"
/* 35 */ /* 35 */
@ -118,7 +118,7 @@ static const char compile_error_texts[] =
"two named subpatterns have the same name (PCRE2_DUPNAMES not set)\0" "two named subpatterns have the same name (PCRE2_DUPNAMES not set)\0"
"group name must start with a non-digit\0" "group name must start with a non-digit\0"
/* 45 */ /* 45 */
"this version of PCRE does not have support for \\P, \\p, or \\X\0" "this version of PCRE2 does not have support for \\P, \\p, or \\X\0"
"malformed \\P or \\p sequence\0" "malformed \\P or \\p sequence\0"
"unknown property name after \\P or \\p\0" "unknown property name after \\P or \\p\0"
"subpattern name is too long (maximum " XSTRING(MAX_NAME_SIZE) " characters)\0" "subpattern name is too long (maximum " XSTRING(MAX_NAME_SIZE) " characters)\0"

View File

@ -40,14 +40,16 @@ POSSIBILITY OF SUCH DAMAGE.
/* This module contains an internal function for validating UTF character /* This module contains an internal function for validating UTF character
strings. */ strings. This file is also #included by the pcre2test program, which uses
macros to change names from _pcre2_xxx to xxxx, thereby avoiding name clashes
with the library. In this case, PCRE2_PCRE2TEST is defined. */
#ifndef PCRE2_PCRE2TEST /* We're compiling the library */
#ifdef HAVE_CONFIG_H #ifdef HAVE_CONFIG_H
#include "config.h" #include "config.h"
#endif #endif
#include "pcre2_internal.h" #include "pcre2_internal.h"
#endif /* PCRE2_PCRE2TEST */
#ifndef SUPPORT_UNICODE #ifndef SUPPORT_UNICODE

View File

@ -165,9 +165,14 @@ void vms_setsymbol( char *, char *, int );
#define DEFAULT_OVECCOUNT 15 /* Default ovector count */ #define DEFAULT_OVECCOUNT 15 /* Default ovector count */
#define JUNK_OFFSET 0xdeadbeef /* For initializing ovector */ #define JUNK_OFFSET 0xdeadbeef /* For initializing ovector */
#define LOOPREPEAT 500000 /* Default loop count for timing */ #define LOOPREPEAT 500000 /* Default loop count for timing */
#define REPLACE_BUFFSIZE 400 /* For replacement strings */ #define REPLACE_MODSIZE 96 /* Field for reading 8-bit replacement */
#define VERSION_SIZE 64 /* Size of buffer for the version strings */ #define VERSION_SIZE 64 /* Size of buffer for the version strings */
/* Make sure the buffer into which replacement strings are copied is big enough
to hold them as 32-bit code units. */
#define REPLACE_BUFFSIZE (4*REPLACE_MODSIZE)
/* Execution modes */ /* Execution modes */
#define PCRE8_MODE 8 #define PCRE8_MODE 8
@ -258,6 +263,20 @@ these inclusions should not be changed. */
#define PCRE2_SUFFIX(a) a #define PCRE2_SUFFIX(a) a
/* We need to be able to check input text for UTF-8 validity, whatever code
widths are actually available, because the input to pcre2test is always in
8-bit code units. So we include the UTF validity checking function for 8-bit
code units. */
extern int valid_utf(PCRE2_SPTR8, PCRE2_SIZE, PCRE2_SIZE *);
#define PCRE2_CODE_UNIT_WIDTH 8
#undef PCRE2_SPTR
#define PCRE2_SPTR PCRE2_SPTR8
#include "pcre2_valid_utf.c"
#undef PCRE2_CODE_UNIT_WIDTH
#undef PCRE2_SPTR
/* If we have 8-bit support, default to it; if there is also 16-or 32-bit /* If we have 8-bit support, default to it; if there is also 16-or 32-bit
support, it can be selected by a command-line option. If there is no 8-bit support, it can be selected by a command-line option. If there is no 8-bit
support, there must be 16- or 32-bit support, so default to one of them. The support, there must be 16- or 32-bit support, so default to one of them. The
@ -370,14 +389,19 @@ data line. */
CTL_MEMORY|\ CTL_MEMORY|\
CTL_STARTCHAR) CTL_STARTCHAR)
/* Structures for holding modifier information for patterns and subject strings
(data). Fields containing modifiers that can be set either for a pattern or a
subject must be at the start and in the same order in both cases so that the
same offset in the big table below works for both. */
typedef struct patctl { /* Structure for pattern modifiers. */ typedef struct patctl { /* Structure for pattern modifiers. */
uint32_t options; /* Must be in same position as datctl */ uint32_t options; /* Must be in same position as datctl */
uint32_t control; /* Must be in same position as datctl */ uint32_t control; /* Must be in same position as datctl */
uint8_t replacement[REPLACE_MODSIZE]; /* So must this */
uint32_t jit; uint32_t jit;
uint32_t stackguard_test; uint32_t stackguard_test;
uint32_t tables_id; uint32_t tables_id;
uint8_t locale[32]; uint8_t locale[32];
uint8_t replacement[REPLACE_BUFFSIZE];
} patctl; } patctl;
#define MAXCPYGET 10 #define MAXCPYGET 10
@ -386,6 +410,7 @@ typedef struct patctl { /* Structure for pattern modifiers. */
typedef struct datctl { /* Structure for data line modifiers. */ typedef struct datctl { /* Structure for data line modifiers. */
uint32_t options; /* Must be in same position as patctl */ uint32_t options; /* Must be in same position as patctl */
uint32_t control; /* Must be in same position as patctl */ uint32_t control; /* Must be in same position as patctl */
uint8_t replacement[REPLACE_MODSIZE]; /* So must this */
uint32_t cfail[2]; uint32_t cfail[2];
int32_t callout_data; int32_t callout_data;
int32_t copy_numbers[MAXCPYGET]; int32_t copy_numbers[MAXCPYGET];
@ -487,7 +512,7 @@ static modstruct modlist[] = {
{ "posix", MOD_PAT, MOD_CTL, CTL_POSIX, PO(control) }, { "posix", MOD_PAT, MOD_CTL, CTL_POSIX, PO(control) },
{ "ps", MOD_DAT, MOD_OPT, PCRE2_PARTIAL_SOFT, DO(options) }, { "ps", MOD_DAT, MOD_OPT, PCRE2_PARTIAL_SOFT, DO(options) },
{ "recursion_limit", MOD_CTM, MOD_INT, 0, MO(recursion_limit) }, { "recursion_limit", MOD_CTM, MOD_INT, 0, MO(recursion_limit) },
{ "replace", MOD_PAT, MOD_STR, 0, PO(replacement) }, { "replace", MOD_PND, MOD_STR, 0, PO(replacement) },
{ "stackguard", MOD_PAT, MOD_INT, 0, PO(stackguard_test) }, { "stackguard", MOD_PAT, MOD_INT, 0, PO(stackguard_test) },
{ "startchar", MOD_PND, MOD_CTL, CTL_STARTCHAR, PO(control) }, { "startchar", MOD_PND, MOD_CTL, CTL_STARTCHAR, PO(control) },
{ "tables", MOD_PAT, MOD_INT, 0, PO(tables_id) }, { "tables", MOD_PAT, MOD_INT, 0, PO(tables_id) },
@ -4211,13 +4236,14 @@ uint32_t *q32 = NULL;
/* Copy the default context and data control blocks to the active ones. Then /* Copy the default context and data control blocks to the active ones. Then
copy from the pattern the controls that can be set in either the pattern or the copy from the pattern the controls that can be set in either the pattern or the
data. This allows them to be unset in the data line. We do not do this for data. This allows them to be overridden in the data line. We do not do this for
options because those that are common apply separately to compiling and options because those that are common apply separately to compiling and
matching. */ matching. */
DATCTXCPY(dat_context, default_dat_context); DATCTXCPY(dat_context, default_dat_context);
memcpy(&dat_datctl, &def_datctl, sizeof(datctl)); memcpy(&dat_datctl, &def_datctl, sizeof(datctl));
dat_datctl.control |= (pat_patctl.control & CTL_ALLPD); dat_datctl.control |= (pat_patctl.control & CTL_ALLPD);
strcpy((char *)dat_datctl.replacement, (char *)pat_patctl.replacement);
/* Initialize for scanning the data line. */ /* Initialize for scanning the data line. */
@ -4716,19 +4742,27 @@ else
PCRE2_MATCH_DATA_CREATE(match_data, max_oveccount, NULL); PCRE2_MATCH_DATA_CREATE(match_data, max_oveccount, NULL);
} }
/* Replacement processing is ignored for DFA matching. */
if (dat_datctl.replacement[0] != 0 && (dat_datctl.control & CTL_DFA) != 0)
{
fprintf(outfile, "** Ignored for DFA matching: replace\n");
dat_datctl.replacement[0] = 0;
}
/* If a replacement string is provided, call pcre2_substitute() instead of one /* If a replacement string is provided, call pcre2_substitute() instead of one
of the matching functions. First we have to convert the replacement string to of the matching functions. First we have to convert the replacement string to
the appropriate width. */ the appropriate width. */
if (pat_patctl.replacement[0] != 0) if (dat_datctl.replacement[0] != 0)
{ {
int rc; int rc;
uint8_t *pr; uint8_t *pr;
uint8_t rbuffer[REPLACE_BUFFSIZE]; uint8_t rbuffer[REPLACE_BUFFSIZE];
uint8_t nbuffer[REPLACE_BUFFSIZE]; uint8_t nbuffer[REPLACE_BUFFSIZE];
uint32_t goption; uint32_t goption;
PCRE2_SIZE rlen; PCRE2_SIZE rlen, nsize, erroroffset;
PCRE2_SIZE nsize; BOOL badutf = FALSE;
#ifdef SUPPORT_PCRE2_8 #ifdef SUPPORT_PCRE2_8
uint8_t *r8 = NULL; uint8_t *r8 = NULL;
@ -4740,10 +4774,13 @@ if (pat_patctl.replacement[0] != 0)
uint32_t *r32 = NULL; uint32_t *r32 = NULL;
#endif #endif
goption = ((pat_patctl.control & CTL_GLOBAL) == 0)? 0 : if (timeitm)
fprintf(outfile, "** Timing is not supported with replace: ignored\n");
goption = ((dat_datctl.control & CTL_GLOBAL) == 0)? 0 :
PCRE2_SUBSTITUTE_GLOBAL; PCRE2_SUBSTITUTE_GLOBAL;
SETCASTPTR(r, rbuffer); /* Sets r8, r16, or r32, as appropriate. */ SETCASTPTR(r, rbuffer); /* Sets r8, r16, or r32, as appropriate. */
pr = pat_patctl.replacement; pr = dat_datctl.replacement;
/* If the replacement starts with '[<number>]' we interpret that as length /* If the replacement starts with '[<number>]' we interpret that as length
value for the replacement buffer. */ value for the replacement buffer. */
@ -4767,52 +4804,58 @@ if (pat_patctl.replacement[0] != 0)
nsize = n; nsize = n;
} }
/* Now copy the replacement string to a buffer of the appropriate width. */ /* Now copy the replacement string to a buffer of the appropriate width. No
escape processing is done for replacements. In UTF mode, check for an invalid
UTF-8 input string, and if it is invalid, just copy its code units without
UTF interpretation. This provides a means of checking that an invalid string
is detected. Otherwise, UTF-8 can be used to include wide characters in a
replacement. */
while ((c = *pr++) != 0) if (utf) badutf = valid_utf(pr, strlen((const char *)pr), &erroroffset);
/* Not UTF or invalid UTF-8: just copy the code units. */
if (!utf || badutf)
{ {
if (utf && HASUTF8EXTRALEN(c)) { GETUTF8INC(c, pr); } while ((c = *pr++) != 0)
{
#ifdef SUPPORT_PCRE2_8
if (test_mode == PCRE8_MODE) *r8++ = c;
#endif
#ifdef SUPPORT_PCRE2_16
if (test_mode == PCRE16_MODE) *r16++ = c;
#endif
#ifdef SUPPORT_PCRE2_32
if (test_mode == PCRE32_MODE) *r32++ = c;
#endif
}
}
/* At present no escape processing is provided for replacements. */ /* Valid UTF-8 replacement string */
else while ((c = *pr++) != 0)
{
if (HASUTF8EXTRALEN(c)) { GETUTF8INC(c, pr); }
#ifdef SUPPORT_PCRE2_8 #ifdef SUPPORT_PCRE2_8
if (test_mode == PCRE8_MODE) if (test_mode == PCRE8_MODE) r8 += ord2utf8(c, r8);
{
if (utf)
{
r8 += ord2utf8(c, r8);
}
else
{
*r8++ = c;
}
}
#endif #endif
#ifdef SUPPORT_PCRE2_16 #ifdef SUPPORT_PCRE2_16
if (test_mode == PCRE16_MODE) if (test_mode == PCRE16_MODE)
{ {
if (utf) if (c >= 0x10000u)
{ {
if (c >= 0x10000u) c-= 0x10000u;
{ *r16++ = 0xD800 | (c >> 10);
c-= 0x10000u; *r16++ = 0xDC00 | (c & 0x3ff);
*r16++ = 0xD800 | (c >> 10);
*r16++ = 0xDC00 | (c & 0x3ff);
}
else
*r16++ = c;
}
else
{
*r16++ = c;
} }
else *r16++ = c;
} }
#endif #endif
#ifdef SUPPORT_PCRE2_32 #ifdef SUPPORT_PCRE2_32
if (test_mode == PCRE32_MODE) if (test_mode == PCRE32_MODE) *r32++ = c;
{
*r32++ = c;
}
#endif #endif
} }

View File

@ -444,4 +444,7 @@
/\x{3a3}B/IBi,utf /\x{3a3}B/IBi,utf
/abc/utf,replace=Ã
abc
# End of testinput10 # End of testinput10

6
testdata/testinput2 vendored
View File

@ -4067,6 +4067,12 @@ a random value. /Ix
/abc/replace=xyz /abc/replace=xyz
1abc2\=partial_hard 1abc2\=partial_hard
/abc/replace=xyz
123abc456
123abc456\=replace=pqr
123abc456abc789
123abc456abc789\=g
# End of substitute tests # End of substitute tests
# End of testinput2 # End of testinput2

View File

@ -1546,4 +1546,8 @@ Starting code units: \xce \xcf
Last code unit = 'B' (caseless) Last code unit = 'B' (caseless)
Subject length lower bound = 2 Subject length lower bound = 2
/abc/utf,replace=Ã
abc
Failed: error -3: UTF-8 error: 1 byte missing at end
# End of testinput10 # End of testinput10

10
testdata/testoutput2 vendored
View File

@ -13689,6 +13689,16 @@ Failed: error -47: no more memory
1abc2\=partial_hard 1abc2\=partial_hard
Failed: error -34: bad option value Failed: error -34: bad option value
/abc/replace=xyz
123abc456
1: 123xyz456
123abc456\=replace=pqr
1: 123pqr456
123abc456abc789
1: 123xyz456abc789
123abc456abc789\=g
2: 123xyz456xyz789
# End of substitute tests # End of substitute tests
# End of testinput2 # End of testinput2