From bcba497c0b34136a9cddd94c9af1fa4b121c95d7 Mon Sep 17 00:00:00 2001 From: "Philip.Hazel" Date: Mon, 5 Jun 2017 18:25:47 +0000 Subject: [PATCH] Implement REG_PEND (GNU extension) for the POSIX wrapper. --- ChangeLog | 2 + doc/html/pcre2posix.html | 61 ++++-- doc/html/pcre2test.html | 15 +- doc/pcre2.txt | 147 +++++++------ doc/pcre2posix.3 | 53 +++-- doc/pcre2test.txt | 443 ++++++++++++++++++++------------------- src/pcre2posix.c | 8 +- src/pcre2posix.h | 5 +- src/pcre2test.c | 19 +- testdata/testinput18 | 6 + testdata/testinput19 | 3 + testdata/testoutput18 | 8 + testdata/testoutput19 | 4 + 13 files changed, 447 insertions(+), 327 deletions(-) diff --git a/ChangeLog b/ChangeLog index 25ced58..f063442 100644 --- a/ChangeLog +++ b/ChangeLog @@ -182,6 +182,8 @@ deeply. (Compare item 10.23/36.) This should fix oss-fuzz #1761. 38. Fix returned offsets from regexec() when REG_STARTEND is used with a starting offset greater than zero. +39. Implement REG_PEND (GNU extension) for the POSIX wrapper. + Version 10.23 14-February-2017 ------------------------------ diff --git a/doc/html/pcre2posix.html b/doc/html/pcre2posix.html index 1d5fe63..a6d75e1 100644 --- a/doc/html/pcre2posix.html +++ b/doc/html/pcre2posix.html @@ -69,7 +69,7 @@ replacement library. Other POSIX options are not even defined.

There are also some options that are not defined by POSIX. These have been added at the request of users who want to make use of certain PCRE2-specific -features via the POSIX calling interface. +features via the POSIX calling interface or to add BSD or GNU functionality.

When PCRE2 is called via these functions, it is only the API that is POSIX-like @@ -91,10 +91,11 @@ identifying error codes.
COMPILING A PATTERN

The function regcomp() is called to compile a pattern into an -internal form. The pattern is a C string terminated by a binary zero, and -is passed in the argument pattern. The preg argument is a pointer -to a regex_t structure that is used as a base for storing information -about the compiled regular expression. +internal form. By default, the pattern is a C string terminated by a binary +zero (but see REG_PEND below). The preg argument is a pointer to a +regex_t structure that is used as a base for storing information about +the compiled regular expression. (It is also used for input when REG_PEND is +set.)

The argument cflags is either zero, or contains one or more of the bits @@ -124,6 +125,16 @@ matching, the nmatch and pmatch arguments are ignored, and no captured strings are returned. Versions of the PCRE library prior to 10.22 used to set the PCRE2_NO_AUTO_CAPTURE compile option, but this no longer happens because it disables the use of back references. +

+  REG_PEND
+
+If this option is set, the reg_endp field in the preg structure +(which has the type const char *) must be set to point to the character beyond +the end of the pattern before calling regcomp(). The pattern itself may +now contain binary zeroes, which are treated as data characters. Without +REG_PEND, a binary zero terminates the pattern and the re_endp field is +ignored. This is a GNU extension to the POSIX standard and should be used with +caution in software intended to be portable to other systems.
   REG_UCP
 
@@ -156,9 +167,10 @@ class such as [^a] (they are).

The yield of regcomp() is zero on success, and non-zero otherwise. The -preg structure is filled in on success, and one member of the structure -is public: re_nsub contains the number of capturing subpatterns in -the regular expression. Various error codes are defined in the header file. +preg structure is filled in on success, and one other member of the +structure (as well as re_endp) is public: re_nsub contains the +number of capturing subpatterns in the regular expression. Various error codes +are defined in the header file.

NOTE: If the yield of regcomp() is non-zero, you must not attempt to @@ -228,15 +240,26 @@ function.

   REG_STARTEND
 
-The string is considered to start at string + pmatch[0].rm_so and -to have a terminating NUL located at string + pmatch[0].rm_eo -(there need not actually be a NUL at that location), regardless of the value of -nmatch. This is a BSD extension, compatible with but not specified by -IEEE Standard 1003.2 (POSIX.2), and should be used with caution in software -intended to be portable to other systems. Note that a non-zero rm_so does -not imply REG_NOTBOL; REG_STARTEND affects only the location of the string, not -how it is matched. Setting REG_STARTEND and passing pmatch as NULL are -mutually exclusive; the error REG_INVARG is returned. +When this option is set, the subject string is starts at string + +pmatch[0].rm_so and ends at string + pmatch[0].rm_eo, which +should point to the first character beyond the string. There may be binary +zeroes within the subject string, and indeed, using REG_STARTEND is the only +way to pass a subject string that contains a binary zero. +

+

+Whatever the value of pmatch[0].rm_so, the offsets of the matched string +and any captured substrings are still given relative to the start of +string itself. (Before PCRE2 release 10.30 these were given relative to +string + pmatch[0].rm_so, but this differs from other +implementations.) +

+

+This is a BSD extension, compatible with but not specified by IEEE Standard +1003.2 (POSIX.2), and should be used with caution in software intended to be +portable to other systems. Note that a non-zero rm_so does not imply +REG_NOTBOL; REG_STARTEND affects only the location and length of the string, +not how it is matched. Setting REG_STARTEND and passing pmatch as NULL +are mutually exclusive; the error REG_INVARG is returned.

If the pattern was compiled with the REG_NOSUB flag, no data about any matched @@ -291,9 +314,9 @@ Cambridge, England.


REVISION

-Last updated: 31 January 2016 +Last updated: 05 June 2017
-Copyright © 1997-2016 University of Cambridge. +Copyright © 1997-2017 University of Cambridge.

Return to the PCRE2 index page. diff --git a/doc/html/pcre2test.html b/doc/html/pcre2test.html index 58312c0..3fbc8c5 100644 --- a/doc/html/pcre2test.html +++ b/doc/html/pcre2test.html @@ -1078,6 +1078,19 @@ are notbol, notempty, and noteol, causing REG_NOTBOL, REG_NOTEMPTY, and REG_NOTEOL, respectively, to be passed to regexec(). The other modifiers are ignored, with a warning message.

+

+There is one additional modifier that can be used with the POSIX wrapper. It is +ignored (with a warning) if used for non-POSIX matching. +

+      posix_startend=<n>[:<m>] 
+
+This causes the subject string to be passed to regexec() using the +REG_STARTEND option, which uses offsets to restrict which part of the string is +searched. If only one number is given, the end offset is passed as the end of +the subject string. For more detail of REG_STARTEND, see the +pcre2posix +documentation. +


Setting match controls
@@ -1817,7 +1830,7 @@ Cambridge, England.


REVISION

-Last updated: 01 June 2017 +Last updated: 03 June 2017
Copyright © 1997-2017 University of Cambridge.
diff --git a/doc/pcre2.txt b/doc/pcre2.txt index 4341a61..ae34d4e 100644 --- a/doc/pcre2.txt +++ b/doc/pcre2.txt @@ -8986,32 +8986,34 @@ DESCRIPTION There are also some options that are not defined by POSIX. These have been added at the request of users who want to make use of certain - PCRE2-specific features via the POSIX calling interface. + PCRE2-specific features via the POSIX calling interface or to add BSD + or GNU functionality. - When PCRE2 is called via these functions, it is only the API that is - POSIX-like in style. The syntax and semantics of the regular expres- - sions themselves are still those of Perl, subject to the setting of - various PCRE2 options, as described below. "POSIX-like in style" means - that the API approximates to the POSIX definition; it is not fully - POSIX-compatible, and in multi-unit encoding domains it is probably + When PCRE2 is called via these functions, it is only the API that is + POSIX-like in style. The syntax and semantics of the regular expres- + sions themselves are still those of Perl, subject to the setting of + various PCRE2 options, as described below. "POSIX-like in style" means + that the API approximates to the POSIX definition; it is not fully + POSIX-compatible, and in multi-unit encoding domains it is probably even less compatible. The header for these functions is supplied as pcre2posix.h to avoid any - potential clash with other POSIX libraries. It can, of course, be + potential clash with other POSIX libraries. It can, of course, be renamed or aliased as regex.h, which is the "correct" name. It provides - two structure types, regex_t for compiled internal forms, and reg- - match_t for returning captured substrings. It also defines some con- - stants whose names start with "REG_"; these are used for setting + two structure types, regex_t for compiled internal forms, and reg- + match_t for returning captured substrings. It also defines some con- + stants whose names start with "REG_"; these are used for setting options and identifying error codes. COMPILING A PATTERN - The function regcomp() is called to compile a pattern into an internal - form. The pattern is a C string terminated by a binary zero, and is - passed in the argument pattern. The preg argument is a pointer to a - regex_t structure that is used as a base for storing information about - the compiled regular expression. + The function regcomp() is called to compile a pattern into an internal + form. By default, the pattern is a C string terminated by a binary zero + (but see REG_PEND below). The preg argument is a pointer to a regex_t + structure that is used as a base for storing information about the com- + piled regular expression. (It is also used for input when REG_PEND is + set.) The argument cflags is either zero, or contains one or more of the bits defined by the following macros: @@ -9042,38 +9044,50 @@ COMPILING A PATTERN used to set the PCRE2_NO_AUTO_CAPTURE compile option, but this no longer happens because it disables the use of back references. + REG_PEND + + If this option is set, the reg_endp field in the preg structure (which + has the type const char *) must be set to point to the character beyond + the end of the pattern before calling regcomp(). The pattern itself may + now contain binary zeroes, which are treated as data characters. With- + out REG_PEND, a binary zero terminates the pattern and the re_endp + field is ignored. This is a GNU extension to the POSIX standard and + should be used with caution in software intended to be portable to + other systems. + REG_UCP - The PCRE2_UCP option is set when the regular expression is passed for - compilation to the native function. This causes PCRE2 to use Unicode - properties when matchine \d, \w, etc., instead of just recognizing + The PCRE2_UCP option is set when the regular expression is passed for + compilation to the native function. This causes PCRE2 to use Unicode + properties when matchine \d, \w, etc., instead of just recognizing ASCII values. Note that REG_UCP is not part of the POSIX standard. REG_UNGREEDY - The PCRE2_UNGREEDY option is set when the regular expression is passed - for compilation to the native function. Note that REG_UNGREEDY is not + The PCRE2_UNGREEDY option is set when the regular expression is passed + for compilation to the native function. Note that REG_UNGREEDY is not part of the POSIX standard. REG_UTF - The PCRE2_UTF option is set when the regular expression is passed for - compilation to the native function. This causes the pattern itself and - all data strings used for matching it to be treated as UTF-8 strings. + The PCRE2_UTF option is set when the regular expression is passed for + compilation to the native function. This causes the pattern itself and + all data strings used for matching it to be treated as UTF-8 strings. Note that REG_UTF is not part of the POSIX standard. - In the absence of these flags, no options are passed to the native - function. This means the the regex is compiled with PCRE2 default - semantics. In particular, the way it handles newline characters in the - subject string is the Perl way, not the POSIX way. Note that setting + In the absence of these flags, no options are passed to the native + function. This means the the regex is compiled with PCRE2 default + semantics. In particular, the way it handles newline characters in the + subject string is the Perl way, not the POSIX way. Note that setting PCRE2_MULTILINE has only some of the effects specified for REG_NEWLINE. - It does not affect the way newlines are matched by the dot metacharac- + It does not affect the way newlines are matched by the dot metacharac- ter (they are not) or by a negative class such as [^a] (they are). - The yield of regcomp() is zero on success, and non-zero otherwise. The - preg structure is filled in on success, and one member of the structure - is public: re_nsub contains the number of capturing subpatterns in the - regular expression. Various error codes are defined in the header file. + The yield of regcomp() is zero on success, and non-zero otherwise. The + preg structure is filled in on success, and one other member of the + structure (as well as re_endp) is public: re_nsub contains the number + of capturing subpatterns in the regular expression. Various error codes + are defined in the header file. NOTE: If the yield of regcomp() is non-zero, you must not attempt to use the contents of the preg structure. If, for example, you pass it to @@ -9146,57 +9160,66 @@ MATCHING A PATTERN REG_STARTEND - The string is considered to start at string + pmatch[0].rm_so and to - have a terminating NUL located at string + pmatch[0].rm_eo (there need - not actually be a NUL at that location), regardless of the value of - nmatch. This is a BSD extension, compatible with but not specified by - IEEE Standard 1003.2 (POSIX.2), and should be used with caution in - software intended to be portable to other systems. Note that a non-zero - rm_so does not imply REG_NOTBOL; REG_STARTEND affects only the location - of the string, not how it is matched. Setting REG_STARTEND and passing - pmatch as NULL are mutually exclusive; the error REG_INVARG is + When this option is set, the subject string is starts at string + + pmatch[0].rm_so and ends at string + pmatch[0].rm_eo, which should + point to the first character beyond the string. There may be binary + zeroes within the subject string, and indeed, using REG_STARTEND is the + only way to pass a subject string that contains a binary zero. + + Whatever the value of pmatch[0].rm_so, the offsets of the matched + string and any captured substrings are still given relative to the + start of string itself. (Before PCRE2 release 10.30 these were given + relative to string + pmatch[0].rm_so, but this differs from other + implementations.) + + This is a BSD extension, compatible with but not specified by IEEE + Standard 1003.2 (POSIX.2), and should be used with caution in software + intended to be portable to other systems. Note that a non-zero rm_so + does not imply REG_NOTBOL; REG_STARTEND affects only the location and + length of the string, not how it is matched. Setting REG_STARTEND and + passing pmatch as NULL are mutually exclusive; the error REG_INVARG is returned. - If the pattern was compiled with the REG_NOSUB flag, no data about any - matched strings is returned. The nmatch and pmatch arguments of + If the pattern was compiled with the REG_NOSUB flag, no data about any + matched strings is returned. The nmatch and pmatch arguments of regexec() are ignored (except possibly as input for REG_STARTEND). - The value of nmatch may be zero, and the value pmatch may be NULL - (unless REG_STARTEND is set); in both these cases no data about any + The value of nmatch may be zero, and the value pmatch may be NULL + (unless REG_STARTEND is set); in both these cases no data about any matched strings is returned. - Otherwise, the portion of the string that was matched, and also any + Otherwise, the portion of the string that was matched, and also any captured substrings, are returned via the pmatch argument, which points - to an array of nmatch structures of type regmatch_t, containing the - members rm_so and rm_eo. These contain the byte offset to the first + to an array of nmatch structures of type regmatch_t, containing the + members rm_so and rm_eo. These contain the byte offset to the first character of each substring and the offset to the first character after - the end of each substring, respectively. The 0th element of the vector - relates to the entire portion of string that was matched; subsequent + the end of each substring, respectively. The 0th element of the vector + relates to the entire portion of string that was matched; subsequent elements relate to the capturing subpatterns of the regular expression. Unused entries in the array have both structure members set to -1. - A successful match yields a zero return; various error codes are - defined in the header file, of which REG_NOMATCH is the "expected" + A successful match yields a zero return; various error codes are + defined in the header file, of which REG_NOMATCH is the "expected" failure code. ERROR MESSAGES The regerror() function maps a non-zero errorcode from either regcomp() - or regexec() to a printable message. If preg is not NULL, the error + or regexec() to a printable message. If preg is not NULL, the error should have arisen from the use of that structure. A message terminated - by a binary zero is placed in errbuf. If the buffer is too short, only + by a binary zero is placed in errbuf. If the buffer is too short, only the first errbuf_size - 1 characters of the error message are used. The - yield of the function is the size of buffer needed to hold the whole - message, including the terminating zero. This value is greater than + yield of the function is the size of buffer needed to hold the whole + message, including the terminating zero. This value is greater than errbuf_size if the message was truncated. MEMORY USAGE - Compiling a regular expression causes memory to be allocated and asso- - ciated with the preg structure. The function regfree() frees all such - memory, after which preg may no longer be used as a compiled expres- + Compiling a regular expression causes memory to be allocated and asso- + ciated with the preg structure. The function regfree() frees all such + memory, after which preg may no longer be used as a compiled expres- sion. @@ -9209,8 +9232,8 @@ AUTHOR REVISION - Last updated: 31 January 2016 - Copyright (c) 1997-2016 University of Cambridge. + Last updated: 05 June 2017 + Copyright (c) 1997-2017 University of Cambridge. ------------------------------------------------------------------------------ diff --git a/doc/pcre2posix.3 b/doc/pcre2posix.3 index b37046b..cce65fa 100644 --- a/doc/pcre2posix.3 +++ b/doc/pcre2posix.3 @@ -1,4 +1,4 @@ -.TH PCRE2POSIX 3 "03 June 2017" "PCRE2 10.30" +.TH PCRE2POSIX 3 "05 June 2017" "PCRE2 10.30" .SH NAME PCRE2 - Perl-compatible regular expressions (revised API) .SH "SYNOPSIS" @@ -46,7 +46,7 @@ replacement library. Other POSIX options are not even defined. .P There are also some options that are not defined by POSIX. These have been added at the request of users who want to make use of certain PCRE2-specific -features via the POSIX calling interface. +features via the POSIX calling interface or to add BSD or GNU functionality. .P When PCRE2 is called via these functions, it is only the API that is POSIX-like in style. The syntax and semantics of the regular expressions themselves are @@ -68,10 +68,11 @@ identifying error codes. .rs .sp The function \fBregcomp()\fP is called to compile a pattern into an -internal form. The pattern is a C string terminated by a binary zero, and -is passed in the argument \fIpattern\fP. The \fIpreg\fP argument is a pointer -to a \fBregex_t\fP structure that is used as a base for storing information -about the compiled regular expression. +internal form. By default, the pattern is a C string terminated by a binary +zero (but see REG_PEND below). The \fIpreg\fP argument is a pointer to a +\fBregex_t\fP structure that is used as a base for storing information about +the compiled regular expression. (It is also used for input when REG_PEND is +set.) .P The argument \fIcflags\fP is either zero, or contains one or more of the bits defined by the following macros: @@ -100,6 +101,16 @@ matching, the \fInmatch\fP and \fIpmatch\fP arguments are ignored, and no captured strings are returned. Versions of the PCRE library prior to 10.22 used to set the PCRE2_NO_AUTO_CAPTURE compile option, but this no longer happens because it disables the use of back references. +.sp + REG_PEND +.sp +If this option is set, the \fBreg_endp\fP field in the \fIpreg\fP structure +(which has the type const char *) must be set to point to the character beyond +the end of the pattern before calling \fBregcomp()\fP. The pattern itself may +now contain binary zeroes, which are treated as data characters. Without +REG_PEND, a binary zero terminates the pattern and the \fBre_endp\fP field is +ignored. This is a GNU extension to the POSIX standard and should be used with +caution in software intended to be portable to other systems. .sp REG_UCP .sp @@ -130,9 +141,10 @@ newlines are matched by the dot metacharacter (they are not) or by a negative class such as [^a] (they are). .P The yield of \fBregcomp()\fP is zero on success, and non-zero otherwise. The -\fIpreg\fP structure is filled in on success, and one member of the structure -is public: \fIre_nsub\fP contains the number of capturing subpatterns in -the regular expression. Various error codes are defined in the header file. +\fIpreg\fP structure is filled in on success, and one other member of the +structure (as well as \fIre_endp\fP) is public: \fIre_nsub\fP contains the +number of capturing subpatterns in the regular expression. Various error codes +are defined in the header file. .P NOTE: If the yield of \fBregcomp()\fP is non-zero, you must not attempt to use the contents of the \fIpreg\fP structure. If, for example, you pass it to @@ -204,21 +216,24 @@ function. .sp REG_STARTEND .sp -When this option is set, the string is considered to start at \fIstring\fP + -\fIpmatch[0].rm_so\fP and to have a terminating NUL located at \fIstring\fP + -\fIpmatch[0].rm_eo\fP (there need not actually be a NUL at that location), -regardless of the value of \fInmatch\fP. However, the offsets of the matched -string and any captured substrings are still given relative to the start of -\fIstring\fP. (Before PCRE2 release 10.30 these were given relative to +When this option is set, the subject string is starts at \fIstring\fP + +\fIpmatch[0].rm_so\fP and ends at \fIstring\fP + \fIpmatch[0].rm_eo\fP, which +should point to the first character beyond the string. There may be binary +zeroes within the subject string, and indeed, using REG_STARTEND is the only +way to pass a subject string that contains a binary zero. +.P +Whatever the value of \fIpmatch[0].rm_so\fP, the offsets of the matched string +and any captured substrings are still given relative to the start of +\fIstring\fP itself. (Before PCRE2 release 10.30 these were given relative to \fIstring\fP + \fIpmatch[0].rm_so\fP, but this differs from other implementations.) .P This is a BSD extension, compatible with but not specified by IEEE Standard 1003.2 (POSIX.2), and should be used with caution in software intended to be portable to other systems. Note that a non-zero \fIrm_so\fP does not imply -REG_NOTBOL; REG_STARTEND affects only the location of the string, not how it is -matched. Setting REG_STARTEND and passing \fIpmatch\fP as NULL are mutually -exclusive; the error REG_INVARG is returned. +REG_NOTBOL; REG_STARTEND affects only the location and length of the string, +not how it is matched. Setting REG_STARTEND and passing \fIpmatch\fP as NULL +are mutually exclusive; the error REG_INVARG is returned. .P If the pattern was compiled with the REG_NOSUB flag, no data about any matched strings is returned. The \fInmatch\fP and \fIpmatch\fP arguments of @@ -277,6 +292,6 @@ Cambridge, England. .rs .sp .nf -Last updated: 03 June 2017 +Last updated: 05 June 2017 Copyright (c) 1997-2017 University of Cambridge. .fi diff --git a/doc/pcre2test.txt b/doc/pcre2test.txt index 685c3d9..60d81ac 100644 --- a/doc/pcre2test.txt +++ b/doc/pcre2test.txt @@ -965,11 +965,22 @@ SUBJECT MODIFIERS REG_NOTEMPTY, and REG_NOTEOL, respectively, to be passed to regexec(). The other modifiers are ignored, with a warning message. + There is one additional modifier that can be used with the POSIX wrap- + per. It is ignored (with a warning) if used for non-POSIX matching. + + posix_startend=[:] + + This causes the subject string to be passed to regexec() using the + REG_STARTEND option, which uses offsets to restrict which part of the + string is searched. If only one number is given, the end offset is + passed as the end of the subject string. For more detail of REG_STAR- + TEND, see the pcre2posix documentation. + Setting match controls - The following modifiers affect the matching process or request addi- - tional information. Some of them may also be specified on a pattern - line (see above), in which case they apply to every subject line that + The following modifiers affect the matching process or request addi- + tional information. Some of them may also be specified on a pattern + line (see above), in which case they apply to every subject line that is matched against that pattern. aftertext show text after match @@ -1009,29 +1020,29 @@ SUBJECT MODIFIERS zero_terminate pass the subject as zero-terminated The effects of these modifiers are described in the following sections. - When matching via the POSIX wrapper API, the aftertext, allaftertext, - and ovector subject modifiers work as described below. All other modi- + When matching via the POSIX wrapper API, the aftertext, allaftertext, + and ovector subject modifiers work as described below. All other modi- fiers are either ignored, with a warning message, or cause an error. Showing more text - The aftertext modifier requests that as well as outputting the part of + The aftertext modifier requests that as well as outputting the part of the subject string that matched the entire pattern, pcre2test should in addition output the remainder of the subject string. This is useful for tests where the subject contains multiple copies of the same substring. - The allaftertext modifier requests the same action for captured sub- + The allaftertext modifier requests the same action for captured sub- strings as well as the main matched substring. In each case the remain- der is output on the following line with a plus character following the capture number. - The allusedtext modifier requests that all the text that was consulted - during a successful pattern match by the interpreter should be shown. - This feature is not supported for JIT matching, and if requested with - JIT it is ignored (with a warning message). Setting this modifier + The allusedtext modifier requests that all the text that was consulted + during a successful pattern match by the interpreter should be shown. + This feature is not supported for JIT matching, and if requested with + JIT it is ignored (with a warning message). Setting this modifier affects the output if there is a lookbehind at the start of a match, or - a lookahead at the end, or if \K is used in the pattern. Characters - that precede or follow the start and end of the actual match are indi- - cated in the output by '<' or '>' characters underneath them. Here is + a lookahead at the end, or if \K is used in the pattern. Characters + that precede or follow the start and end of the actual match are indi- + cated in the output by '<' or '>' characters underneath them. Here is an example: re> /(?<=pqr)abc(?=xyz)/ @@ -1039,16 +1050,16 @@ SUBJECT MODIFIERS 0: pqrabcxyz <<< >>> - This shows that the matched string is "abc", with the preceding and - following strings "pqr" and "xyz" having been consulted during the + This shows that the matched string is "abc", with the preceding and + following strings "pqr" and "xyz" having been consulted during the match (when processing the assertions). - The startchar modifier requests that the starting character for the - match be indicated, if it is different to the start of the matched + The startchar modifier requests that the starting character for the + match be indicated, if it is different to the start of the matched string. The only time when this occurs is when \K has been processed as part of the match. In this situation, the output for the matched string - is displayed from the starting character instead of from the match - point, with circumflex characters under the earlier characters. For + is displayed from the starting character instead of from the match + point, with circumflex characters under the earlier characters. For example: re> /abc\Kxyz/ @@ -1056,7 +1067,7 @@ SUBJECT MODIFIERS 0: abcxyz ^^^ - Unlike allusedtext, the startchar modifier can be used with JIT. How- + Unlike allusedtext, the startchar modifier can be used with JIT. How- ever, these two modifiers are mutually exclusive. Showing the value of all capture groups @@ -1064,98 +1075,98 @@ SUBJECT MODIFIERS The allcaptures modifier requests that the values of all potential cap- tured parentheses be output after a match. By default, only those up to the highest one actually used in the match are output (corresponding to - the return code from pcre2_match()). Groups that did not take part in - the match are output as "". This modifier is not relevant for - DFA matching (which does no capturing); it is ignored, with a warning + the return code from pcre2_match()). Groups that did not take part in + the match are output as "". This modifier is not relevant for + DFA matching (which does no capturing); it is ignored, with a warning message, if present. Testing callouts - A callout function is supplied when pcre2test calls the library match- - ing functions, unless callout_none is specified. If callout_capture is - set, the current captured groups are output when a callout occurs. The + A callout function is supplied when pcre2test calls the library match- + ing functions, unless callout_none is specified. If callout_capture is + set, the current captured groups are output when a callout occurs. The default return from the callout function is zero, which allows matching to continue. - The callout_fail modifier can be given one or two numbers. If there is - only one number, 1 is returned instead of 0 (causing matching to back- - track) when a callout of that number is reached. If two numbers - (:) are given, 1 is returned when callout is reached and - there have been at least callouts. The callout_error modifier is - similar, except that PCRE2_ERROR_CALLOUT is returned, causing the - entire matching process to be aborted. If both these modifiers are set + The callout_fail modifier can be given one or two numbers. If there is + only one number, 1 is returned instead of 0 (causing matching to back- + track) when a callout of that number is reached. If two numbers + (:) are given, 1 is returned when callout is reached and + there have been at least callouts. The callout_error modifier is + similar, except that PCRE2_ERROR_CALLOUT is returned, causing the + entire matching process to be aborted. If both these modifiers are set for the same callout number, callout_error takes precedence. - Note that callouts with string arguments are always given the number + Note that callouts with string arguments are always given the number zero. See "Callouts" below for a description of the output when a call- out it taken. - The callout_data modifier can be given an unsigned or a negative num- - ber. This is set as the "user data" that is passed to the matching - function, and passed back when the callout function is invoked. Any - value other than zero is used as a return from pcre2test's callout + The callout_data modifier can be given an unsigned or a negative num- + ber. This is set as the "user data" that is passed to the matching + function, and passed back when the callout function is invoked. Any + value other than zero is used as a return from pcre2test's callout function. Finding all matches in a string Searching for all possible matches within a subject can be requested by - the global or altglobal modifier. After finding a match, the matching - function is called again to search the remainder of the subject. The - difference between global and altglobal is that the former uses the - start_offset argument to pcre2_match() or pcre2_dfa_match() to start - searching at a new point within the entire string (which is what Perl + the global or altglobal modifier. After finding a match, the matching + function is called again to search the remainder of the subject. The + difference between global and altglobal is that the former uses the + start_offset argument to pcre2_match() or pcre2_dfa_match() to start + searching at a new point within the entire string (which is what Perl does), whereas the latter passes over a shortened subject. This makes a difference to the matching process if the pattern begins with a lookbe- hind assertion (including \b or \B). - If an empty string is matched, the next match is done with the + If an empty string is matched, the next match is done with the PCRE2_NOTEMPTY_ATSTART and PCRE2_ANCHORED flags set, in order to search for another, non-empty, match at the same point in the subject. If this - match fails, the start offset is advanced, and the normal match is - retried. This imitates the way Perl handles such cases when using the - /g modifier or the split() function. Normally, the start offset is - advanced by one character, but if the newline convention recognizes - CRLF as a newline, and the current character is CR followed by LF, an + match fails, the start offset is advanced, and the normal match is + retried. This imitates the way Perl handles such cases when using the + /g modifier or the split() function. Normally, the start offset is + advanced by one character, but if the newline convention recognizes + CRLF as a newline, and the current character is CR followed by LF, an advance of two characters occurs. Testing substring extraction functions - The copy and get modifiers can be used to test the pcre2_sub- + The copy and get modifiers can be used to test the pcre2_sub- string_copy_xxx() and pcre2_substring_get_xxx() functions. They can be - given more than once, and each can specify a group name or number, for + given more than once, and each can specify a group name or number, for example: abcd\=copy=1,copy=3,get=G1 - If the #subject command is used to set default copy and/or get lists, - these can be unset by specifying a negative number to cancel all num- + If the #subject command is used to set default copy and/or get lists, + these can be unset by specifying a negative number to cancel all num- bered groups and an empty name to cancel all named groups. - The getall modifier tests pcre2_substring_list_get(), which extracts + The getall modifier tests pcre2_substring_list_get(), which extracts all captured substrings. - If the subject line is successfully matched, the substrings extracted - by the convenience functions are output with C, G, or L after the - string number instead of a colon. This is in addition to the normal - full list. The string length (that is, the return from the extraction + If the subject line is successfully matched, the substrings extracted + by the convenience functions are output with C, G, or L after the + string number instead of a colon. This is in addition to the normal + full list. The string length (that is, the return from the extraction function) is given in parentheses after each substring, followed by the name when the extraction was by name. Testing the substitution function - If the replace modifier is set, the pcre2_substitute() function is - called instead of one of the matching functions. Note that replacement - strings cannot contain commas, because a comma signifies the end of a + If the replace modifier is set, the pcre2_substitute() function is + called instead of one of the matching functions. Note that replacement + strings cannot contain commas, because a comma signifies the end of a modifier. This is not thought to be an issue in a test program. - Unlike subject strings, pcre2test does not process replacement strings - for escape sequences. In UTF mode, a replacement string is checked to - see if it is a valid UTF-8 string. If so, it is correctly converted to - a UTF string of the appropriate code unit width. If it is not a valid - UTF-8 string, the individual code units are copied directly. This pro- + Unlike subject strings, pcre2test does not process replacement strings + for escape sequences. In UTF mode, a replacement string is checked to + see if it is a valid UTF-8 string. If so, it is correctly converted to + a UTF string of the appropriate code unit width. If it is not a valid + UTF-8 string, the individual code units are copied directly. This pro- vides a means of passing an invalid UTF-8 string for testing purposes. - The following modifiers set options (in additional to the normal match + The following modifiers set options (in additional to the normal match options) for pcre2_substitute(): global PCRE2_SUBSTITUTE_GLOBAL @@ -1165,8 +1176,8 @@ SUBJECT MODIFIERS substitute_unset_empty PCRE2_SUBSTITUTE_UNSET_EMPTY - After a successful substitution, the modified string is output, pre- - ceded by the number of replacements. This may be zero if there were no + After a successful substitution, the modified string is output, pre- + ceded by the number of replacements. This may be zero if there were no matches. Here is a simple example of a substitution test: /abc/replace=xxx @@ -1175,12 +1186,12 @@ SUBJECT MODIFIERS =abc=abc=\=global 2: =xxx=xxx= - Subject and replacement strings should be kept relatively short (fewer - than 256 characters) for substitution tests, as fixed-size buffers are - used. To make it easy to test for buffer overflow, if the replacement - string starts with a number in square brackets, that number is passed - to pcre2_substitute() as the size of the output buffer, with the - replacement string starting at the next character. Here is an example + Subject and replacement strings should be kept relatively short (fewer + than 256 characters) for substitution tests, as fixed-size buffers are + used. To make it easy to test for buffer overflow, if the replacement + string starts with a number in square brackets, that number is passed + to pcre2_substitute() as the size of the output buffer, with the + replacement string starting at the next character. Here is an example that tests the edge case: /abc/ @@ -1189,11 +1200,11 @@ SUBJECT MODIFIERS 123abc123\=replace=[9]XYZ Failed: error -47: no more memory - The default action of pcre2_substitute() is to return - PCRE2_ERROR_NOMEMORY when the output buffer is too small. However, if - the PCRE2_SUBSTITUTE_OVERFLOW_LENGTH option is set (by using the sub- - stitute_overflow_length modifier), pcre2_substitute() continues to go - through the motions of matching and substituting, in order to compute + The default action of pcre2_substitute() is to return + PCRE2_ERROR_NOMEMORY when the output buffer is too small. However, if + the PCRE2_SUBSTITUTE_OVERFLOW_LENGTH option is set (by using the sub- + stitute_overflow_length modifier), pcre2_substitute() continues to go + through the motions of matching and substituting, in order to compute the size of buffer that is required. When this happens, pcre2test shows the required buffer length (which includes space for the trailing zero) as part of the error message. For example: @@ -1203,151 +1214,151 @@ SUBJECT MODIFIERS Failed: error -47: no more memory: 10 code units are needed A replacement string is ignored with POSIX and DFA matching. Specifying - partial matching provokes an error return ("bad option value") from + partial matching provokes an error return ("bad option value") from pcre2_substitute(). Setting the JIT stack size - The jitstack modifier provides a way of setting the maximum stack size - that is used by the just-in-time optimization code. It is ignored if + The jitstack modifier provides a way of setting the maximum stack size + that is used by the just-in-time optimization code. It is ignored if JIT optimization is not being used. The value is a number of kilobytes. Providing a stack that is larger than the default 32K is necessary only for very complicated patterns. Setting heap, match, and depth limits - The heap_limit, match_limit, and depth_limit modifiers set the appro- - priate limits in the match context. These values are ignored when the + The heap_limit, match_limit, and depth_limit modifiers set the appro- + priate limits in the match context. These values are ignored when the find_limits modifier is specified. Finding minimum limits - If the find_limits modifier is present on a subject line, pcre2test - calls the relevant matching function several times, setting different - values in the match context via pcre2_set_heap_limit(), - pcre2_set_match_limit(), or pcre2_set_depth_limit() until it finds the - minimum values for each parameter that allows the match to complete + If the find_limits modifier is present on a subject line, pcre2test + calls the relevant matching function several times, setting different + values in the match context via pcre2_set_heap_limit(), + pcre2_set_match_limit(), or pcre2_set_depth_limit() until it finds the + minimum values for each parameter that allows the match to complete without error. If JIT is being used, only the match limit is relevant. If DFA matching is being used, only the depth limit is relevant. - The match_limit number is a measure of the amount of backtracking that - takes place, and learning the minimum value can be instructive. For - most simple matches, the number is quite small, but for patterns with - very large numbers of matching possibilities, it can become large very + The match_limit number is a measure of the amount of backtracking that + takes place, and learning the minimum value can be instructive. For + most simple matches, the number is quite small, but for patterns with + very large numbers of matching possibilities, it can become large very quickly with increasing length of subject string. - For non-DFA matching, the minimum depth_limit number is a measure of + For non-DFA matching, the minimum depth_limit number is a measure of how much nested backtracking happens (that is, how deeply the pattern's - tree is searched). In the case of DFA matching, depth_limit controls - the depth of recursive calls of the internal function that is used for + tree is searched). In the case of DFA matching, depth_limit controls + the depth of recursive calls of the internal function that is used for handling pattern recursion, lookaround assertions, and atomic groups. Showing MARK names The mark modifier causes the names from backtracking control verbs that - are returned from calls to pcre2_match() to be displayed. If a mark is - returned for a match, non-match, or partial match, pcre2test shows it. - For a match, it is on a line by itself, tagged with "MK:". Otherwise, + are returned from calls to pcre2_match() to be displayed. If a mark is + returned for a match, non-match, or partial match, pcre2test shows it. + For a match, it is on a line by itself, tagged with "MK:". Otherwise, it is added to the non-match message. Showing memory usage - The memory modifier causes pcre2test to log the sizes of all heap mem- - ory allocation and freeing calls that occur during a call to - pcre2_match(). These occur only when a match requires a bigger vector - than the default for remembering backtracking points. In many cases - there will be no heap memory used and therefore no additional output. - No heap memory is allocated during matching with pcre2_dfa_match or - with JIT, so in those cases the memory modifier never has any effect. + The memory modifier causes pcre2test to log the sizes of all heap mem- + ory allocation and freeing calls that occur during a call to + pcre2_match(). These occur only when a match requires a bigger vector + than the default for remembering backtracking points. In many cases + there will be no heap memory used and therefore no additional output. + No heap memory is allocated during matching with pcre2_dfa_match or + with JIT, so in those cases the memory modifier never has any effect. For this modifier to work, the null_context modifier must not be set on - both the pattern and the subject, though it can be set on one or the + both the pattern and the subject, though it can be set on one or the other. Setting a starting offset - The offset modifier sets an offset in the subject string at which + The offset modifier sets an offset in the subject string at which matching starts. Its value is a number of code units, not characters. Setting an offset limit - The offset_limit modifier sets a limit for unanchored matches. If a + The offset_limit modifier sets a limit for unanchored matches. If a match cannot be found starting at or before this offset in the subject, a "no match" return is given. The data value is a number of code units, - not characters. When this modifier is used, the use_offset_limit modi- + not characters. When this modifier is used, the use_offset_limit modi- fier must have been set for the pattern; if not, an error is generated. Setting the size of the output vector - The ovector modifier applies only to the subject line in which it - appears, though of course it can also be used to set a default in a - #subject command. It specifies the number of pairs of offsets that are + The ovector modifier applies only to the subject line in which it + appears, though of course it can also be used to set a default in a + #subject command. It specifies the number of pairs of offsets that are available for storing matching information. The default is 15. - A value of zero is useful when testing the POSIX API because it causes + A value of zero is useful when testing the POSIX API because it causes regexec() to be called with a NULL capture vector. When not testing the - POSIX API, a value of zero is used to cause pcre2_match_data_cre- - ate_from_pattern() to be called, in order to create a match block of + POSIX API, a value of zero is used to cause pcre2_match_data_cre- + ate_from_pattern() to be called, in order to create a match block of exactly the right size for the pattern. (It is not possible to create a - match block with a zero-length ovector; there is always at least one + match block with a zero-length ovector; there is always at least one pair of offsets.) Passing the subject as zero-terminated By default, the subject string is passed to a native API matching func- tion with its correct length. In order to test the facility for passing - a zero-terminated string, the zero_terminate modifier is provided. It + a zero-terminated string, the zero_terminate modifier is provided. It causes the length to be passed as PCRE2_ZERO_TERMINATED. (When matching - via the POSIX interface, this modifier has no effect, as there is no + via the POSIX interface, this modifier has no effect, as there is no facility for passing a length.) - When testing pcre2_substitute(), this modifier also has the effect of + When testing pcre2_substitute(), this modifier also has the effect of passing the replacement string as zero-terminated. Passing a NULL context - Normally, pcre2test passes a context block to pcre2_match(), + Normally, pcre2test passes a context block to pcre2_match(), pcre2_dfa_match() or pcre2_jit_match(). If the null_context modifier is - set, however, NULL is passed. This is for testing that the matching + set, however, NULL is passed. This is for testing that the matching functions behave correctly in this case (they use default values). This - modifier cannot be used with the find_limits modifier or when testing + modifier cannot be used with the find_limits modifier or when testing the substitution function. THE ALTERNATIVE MATCHING FUNCTION - By default, pcre2test uses the standard PCRE2 matching function, + By default, pcre2test uses the standard PCRE2 matching function, pcre2_match() to match each subject line. PCRE2 also supports an alter- - native matching function, pcre2_dfa_match(), which operates in a dif- - ferent way, and has some restrictions. The differences between the two + native matching function, pcre2_dfa_match(), which operates in a dif- + ferent way, and has some restrictions. The differences between the two functions are described in the pcre2matching documentation. - If the dfa modifier is set, the alternative matching function is used. - This function finds all possible matches at a given point in the sub- - ject. If, however, the dfa_shortest modifier is set, processing stops - after the first match is found. This is always the shortest possible + If the dfa modifier is set, the alternative matching function is used. + This function finds all possible matches at a given point in the sub- + ject. If, however, the dfa_shortest modifier is set, processing stops + after the first match is found. This is always the shortest possible match. DEFAULT OUTPUT FROM pcre2test - This section describes the output when the normal matching function, + This section describes the output when the normal matching function, pcre2_match(), is being used. - When a match succeeds, pcre2test outputs the list of captured sub- - strings, starting with number 0 for the string that matched the whole - pattern. Otherwise, it outputs "No match" when the return is - PCRE2_ERROR_NOMATCH, or "Partial match:" followed by the partially - matching substring when the return is PCRE2_ERROR_PARTIAL. (Note that - this is the entire substring that was inspected during the partial - match; it may include characters before the actual match start if a + When a match succeeds, pcre2test outputs the list of captured sub- + strings, starting with number 0 for the string that matched the whole + pattern. Otherwise, it outputs "No match" when the return is + PCRE2_ERROR_NOMATCH, or "Partial match:" followed by the partially + matching substring when the return is PCRE2_ERROR_PARTIAL. (Note that + this is the entire substring that was inspected during the partial + match; it may include characters before the actual match start if a lookbehind assertion, \K, \b, or \B was involved.) For any other return, pcre2test outputs the PCRE2 negative error number - and a short descriptive phrase. If the error is a failed UTF string - check, the code unit offset of the start of the failing character is + and a short descriptive phrase. If the error is a failed UTF string + check, the code unit offset of the start of the failing character is also output. Here is an example of an interactive pcre2test run. $ pcre2test @@ -1363,8 +1374,8 @@ DEFAULT OUTPUT FROM pcre2test Unset capturing substrings that are not followed by one that is set are not shown by pcre2test unless the allcaptures modifier is specified. In the following example, there are two capturing substrings, but when the - first data line is matched, the second, unset substring is not shown. - An "internal" unset substring is shown as "", as for the second + first data line is matched, the second, unset substring is not shown. + An "internal" unset substring is shown as "", as for the second data line. re> /(a)|(b)/ @@ -1376,11 +1387,11 @@ DEFAULT OUTPUT FROM pcre2test 1: 2: b - If the strings contain any non-printing characters, they are output as - \xhh escapes if the value is less than 256 and UTF mode is not set. + If the strings contain any non-printing characters, they are output as + \xhh escapes if the value is less than 256 and UTF mode is not set. Otherwise they are output as \x{hh...} escapes. See below for the defi- - nition of non-printing characters. If the aftertext modifier is set, - the output for substring 0 is followed by the the rest of the subject + nition of non-printing characters. If the aftertext modifier is set, + the output for substring 0 is followed by the the rest of the subject string, identified by "0+" like this: re> /cat/aftertext @@ -1388,7 +1399,7 @@ DEFAULT OUTPUT FROM pcre2test 0: cat 0+ aract - If global matching is requested, the results of successive matching + If global matching is requested, the results of successive matching attempts are output in sequence, like this: re> /\Bi(\w\w)/g @@ -1400,8 +1411,8 @@ DEFAULT OUTPUT FROM pcre2test 0: ipp 1: pp - "No match" is output only if the first match attempt fails. Here is an - example of a failure message (the offset 4 that is specified by the + "No match" is output only if the first match attempt fails. Here is an + example of a failure message (the offset 4 that is specified by the offset modifier is past the end of the subject string): re> /xyz/ @@ -1409,7 +1420,7 @@ DEFAULT OUTPUT FROM pcre2test Error -24 (bad offset value) Note that whereas patterns can be continued over several lines (a plain - ">" prompt is used for continuations), subject lines may not. However + ">" prompt is used for continuations), subject lines may not. However newlines can be included in a subject by means of the \n escape (or \r, \r\n, etc., depending on the newline sequence setting). @@ -1417,7 +1428,7 @@ DEFAULT OUTPUT FROM pcre2test OUTPUT FROM THE ALTERNATIVE MATCHING FUNCTION When the alternative matching function, pcre2_dfa_match(), is used, the - output consists of a list of all the matches that start at the first + output consists of a list of all the matches that start at the first point in the subject where there is at least one match. For example: re> /(tang|tangerine|tan)/ @@ -1426,11 +1437,11 @@ OUTPUT FROM THE ALTERNATIVE MATCHING FUNCTION 1: tang 2: tan - Using the normal matching function on this data finds only "tang". The - longest matching string is always given first (and numbered zero). - After a PCRE2_ERROR_PARTIAL return, the output is "Partial match:", - followed by the partially matching substring. Note that this is the - entire substring that was inspected during the partial match; it may + Using the normal matching function on this data finds only "tang". The + longest matching string is always given first (and numbered zero). + After a PCRE2_ERROR_PARTIAL return, the output is "Partial match:", + followed by the partially matching substring. Note that this is the + entire substring that was inspected during the partial match; it may include characters before the actual match start if a lookbehind asser- tion, \b, or \B was involved. (\K is not supported for DFA matching.) @@ -1446,16 +1457,16 @@ OUTPUT FROM THE ALTERNATIVE MATCHING FUNCTION 1: tan 0: tan - The alternative matching function does not support substring capture, - so the modifiers that are concerned with captured substrings are not + The alternative matching function does not support substring capture, + so the modifiers that are concerned with captured substrings are not relevant. RESTARTING AFTER A PARTIAL MATCH - When the alternative matching function has given the PCRE2_ERROR_PAR- + When the alternative matching function has given the PCRE2_ERROR_PAR- TIAL return, indicating that the subject partially matched the pattern, - you can restart the match with additional subject data by means of the + you can restart the match with additional subject data by means of the dfa_restart modifier. For example: re> /^\d?\d(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\d\d$/ @@ -1464,45 +1475,45 @@ RESTARTING AFTER A PARTIAL MATCH data> n05\=dfa,dfa_restart 0: n05 - For further information about partial matching, see the pcre2partial + For further information about partial matching, see the pcre2partial documentation. CALLOUTS If the pattern contains any callout requests, pcre2test's callout func- - tion is called during matching unless callout_none is specified. This + tion is called during matching unless callout_none is specified. This works with both matching functions. - The callout function in pcre2test returns zero (carry on matching) by - default, but you can use a callout_fail modifier in a subject line (as + The callout function in pcre2test returns zero (carry on matching) by + default, but you can use a callout_fail modifier in a subject line (as described above) to change this and other parameters of the callout. Inserting callouts can be helpful when using pcre2test to check compli- - cated regular expressions. For further information about callouts, see + cated regular expressions. For further information about callouts, see the pcre2callout documentation. - The output for callouts with numerical arguments and those with string + The output for callouts with numerical arguments and those with string arguments is slightly different. Callouts with numerical arguments By default, the callout function displays the callout number, the start - and current positions in the subject text at the callout time, and the + and current positions in the subject text at the callout time, and the next pattern item to be tested. For example: --->pqrabcdef 0 ^ ^ \d - This output indicates that callout number 0 occurred for a match - attempt starting at the fourth character of the subject string, when - the pointer was at the seventh character, and when the next pattern - item was \d. Just one circumflex is output if the start and current - positions are the same, or if the current position precedes the start + This output indicates that callout number 0 occurred for a match + attempt starting at the fourth character of the subject string, when + the pointer was at the seventh character, and when the next pattern + item was \d. Just one circumflex is output if the start and current + positions are the same, or if the current position precedes the start position, which can happen if the callout is in a lookbehind assertion. Callouts numbered 255 are assumed to be automatic callouts, inserted as - a result of the /auto_callout pattern modifier. In this case, instead + a result of the /auto_callout pattern modifier. In this case, instead of showing the callout number, the offset in the pattern, preceded by a plus, is output. For example: @@ -1516,7 +1527,7 @@ CALLOUTS 0: E* If a pattern contains (*MARK) items, an additional line is output when- - ever a change of latest mark is passed to the callout function. For + ever a change of latest mark is passed to the callout function. For example: re> /a(*MARK:X)bc/auto_callout @@ -1530,17 +1541,17 @@ CALLOUTS +12 ^ ^ 0: abc - The mark changes between matching "a" and "b", but stays the same for - the rest of the match, so nothing more is output. If, as a result of - backtracking, the mark reverts to being unset, the text "" is + The mark changes between matching "a" and "b", but stays the same for + the rest of the match, so nothing more is output. If, as a result of + backtracking, the mark reverts to being unset, the text "" is output. Callouts with string arguments The output for a callout with a string argument is similar, except that - instead of outputting a callout number before the position indicators, - the callout string and its offset in the pattern string are output - before the reflection of the subject string, and the subject string is + instead of outputting a callout number before the position indicators, + the callout string and its offset in the pattern string are output + before the reflection of the subject string, and the subject string is reflected for each callout. For example: re> /^ab(?C'first')cd(?C"second")ef/ @@ -1557,43 +1568,43 @@ CALLOUTS NON-PRINTING CHARACTERS When pcre2test is outputting text in the compiled version of a pattern, - bytes other than 32-126 are always treated as non-printing characters + bytes other than 32-126 are always treated as non-printing characters and are therefore shown as hex escapes. - When pcre2test is outputting text that is a matched part of a subject - string, it behaves in the same way, unless a different locale has been - set for the pattern (using the locale modifier). In this case, the - isprint() function is used to distinguish printing and non-printing + When pcre2test is outputting text that is a matched part of a subject + string, it behaves in the same way, unless a different locale has been + set for the pattern (using the locale modifier). In this case, the + isprint() function is used to distinguish printing and non-printing characters. SAVING AND RESTORING COMPILED PATTERNS - It is possible to save compiled patterns on disc or elsewhere, and + It is possible to save compiled patterns on disc or elsewhere, and reload them later, subject to a number of restrictions. JIT data cannot - be saved. The host on which the patterns are reloaded must be running + be saved. The host on which the patterns are reloaded must be running the same version of PCRE2, with the same code unit width, and must also - have the same endianness, pointer width and PCRE2_SIZE type. Before - compiled patterns can be saved they must be serialized, that is, con- - verted to a stream of bytes. A single byte stream may contain any num- - ber of compiled patterns, but they must all use the same character + have the same endianness, pointer width and PCRE2_SIZE type. Before + compiled patterns can be saved they must be serialized, that is, con- + verted to a stream of bytes. A single byte stream may contain any num- + ber of compiled patterns, but they must all use the same character tables. A single copy of the tables is included in the byte stream (its size is 1088 bytes). - The functions whose names begin with pcre2_serialize_ are used for - serializing and de-serializing. They are described in the pcre2serial- + The functions whose names begin with pcre2_serialize_ are used for + serializing and de-serializing. They are described in the pcre2serial- ize documentation. In this section we describe the features of pcre2test that can be used to test these functions. - When a pattern with push modifier is successfully compiled, it is - pushed onto a stack of compiled patterns, and pcre2test expects the - next line to contain a new pattern (or command) instead of a subject - line. By contrast, the pushcopy modifier causes a copy of the compiled - pattern to be stacked, leaving the original available for immediate - matching. By using push and/or pushcopy, a number of patterns can be + When a pattern with push modifier is successfully compiled, it is + pushed onto a stack of compiled patterns, and pcre2test expects the + next line to contain a new pattern (or command) instead of a subject + line. By contrast, the pushcopy modifier causes a copy of the compiled + pattern to be stacked, leaving the original available for immediate + matching. By using push and/or pushcopy, a number of patterns can be compiled and retained. These modifiers are incompatible with posix, and - control modifiers that act at match time are ignored (with a message) - for the stacked patterns. The jitverify modifier applies only at com- + control modifiers that act at match time are ignored (with a message) + for the stacked patterns. The jitverify modifier applies only at com- pile time. The command @@ -1601,21 +1612,21 @@ SAVING AND RESTORING COMPILED PATTERNS #save causes all the stacked patterns to be serialized and the result written - to the named file. Afterwards, all the stacked patterns are freed. The + to the named file. Afterwards, all the stacked patterns are freed. The command #load - reads the data in the file, and then arranges for it to be de-serial- - ized, with the resulting compiled patterns added to the pattern stack. - The pattern on the top of the stack can be retrieved by the #pop com- - mand, which must be followed by lines of subjects that are to be - matched with the pattern, terminated as usual by an empty line or end - of file. This command may be followed by a modifier list containing - only control modifiers that act after a pattern has been compiled. In + reads the data in the file, and then arranges for it to be de-serial- + ized, with the resulting compiled patterns added to the pattern stack. + The pattern on the top of the stack can be retrieved by the #pop com- + mand, which must be followed by lines of subjects that are to be + matched with the pattern, terminated as usual by an empty line or end + of file. This command may be followed by a modifier list containing + only control modifiers that act after a pattern has been compiled. In particular, hex, posix, posix_nosub, push, and pushcopy are not - allowed, nor are any option-setting modifiers. The JIT modifiers are, - however permitted. Here is an example that saves and reloads two pat- + allowed, nor are any option-setting modifiers. The JIT modifiers are, + however permitted. Here is an example that saves and reloads two pat- terns. /abc/push @@ -1628,10 +1639,10 @@ SAVING AND RESTORING COMPILED PATTERNS #pop jit,bincode abc - If jitverify is used with #pop, it does not automatically imply jit, + If jitverify is used with #pop, it does not automatically imply jit, which is different behaviour from when it is used on a pattern. - The #popcopy command is analagous to the pushcopy modifier in that it + The #popcopy command is analagous to the pushcopy modifier in that it makes current a copy of the topmost stack pattern, leaving the original still on the stack. @@ -1651,5 +1662,5 @@ AUTHOR REVISION - Last updated: 01 June 2017 + Last updated: 03 June 2017 Copyright (c) 1997-2017 University of Cambridge. diff --git a/src/pcre2posix.c b/src/pcre2posix.c index 8be969a..0c460cb 100644 --- a/src/pcre2posix.c +++ b/src/pcre2posix.c @@ -231,10 +231,14 @@ PCRE2POSIX_EXP_DEFN int PCRE2_CALL_CONVENTION regcomp(regex_t *preg, const char *pattern, int cflags) { PCRE2_SIZE erroffset; +PCRE2_SIZE patlen; int errorcode; int options = 0; int re_nsub = 0; +patlen = ((cflags & REG_PEND) != 0)? (PCRE2_SIZE)(preg->re_endp - pattern) : + PCRE2_ZERO_TERMINATED; + if ((cflags & REG_ICASE) != 0) options |= PCRE2_CASELESS; if ((cflags & REG_NEWLINE) != 0) options |= PCRE2_MULTILINE; if ((cflags & REG_DOTALL) != 0) options |= PCRE2_DOTALL; @@ -243,8 +247,8 @@ if ((cflags & REG_UCP) != 0) options |= PCRE2_UCP; if ((cflags & REG_UNGREEDY) != 0) options |= PCRE2_UNGREEDY; preg->re_cflags = cflags; -preg->re_pcre2_code = pcre2_compile((PCRE2_SPTR)pattern, PCRE2_ZERO_TERMINATED, - options, &errorcode, &erroffset, NULL); +preg->re_pcre2_code = pcre2_compile((PCRE2_SPTR)pattern, patlen, options, + &errorcode, &erroffset, NULL); preg->re_erroffset = erroffset; if (preg->re_pcre2_code == NULL) diff --git a/src/pcre2posix.h b/src/pcre2posix.h index 6505976..c17be3b 100644 --- a/src/pcre2posix.h +++ b/src/pcre2posix.h @@ -62,6 +62,7 @@ extern "C" { #define REG_NOTEMPTY 0x0100 /* NOT defined by POSIX; maps to PCRE2_NOTEMPTY */ #define REG_UNGREEDY 0x0200 /* NOT defined by POSIX; maps to PCRE2_UNGREEDY */ #define REG_UCP 0x0400 /* NOT defined by POSIX; maps to PCRE2_UCP */ +#define REG_PEND 0x0800 /* GNU feature: pass end pattern by re_endp */ /* This is not used by PCRE2, but by defining it we make it easier to slot PCRE2 into existing programs that make POSIX calls. */ @@ -91,11 +92,13 @@ enum { }; -/* The structure representing a compiled regular expression. */ +/* The structure representing a compiled regular expression. It is also used +for passing the pattern end pointer when REG_PEND is set. */ typedef struct { void *re_pcre2_code; void *re_match_data; + const char *re_endp; size_t re_nsub; size_t re_erroffset; int re_cflags; diff --git a/src/pcre2test.c b/src/pcre2test.c index 5a2b86f..626ada4 100644 --- a/src/pcre2test.c +++ b/src/pcre2test.c @@ -538,7 +538,7 @@ typedef struct datctl { /* Structure for data line modifiers. */ uint32_t control; /* Must be in same position as patctl */ uint32_t control2; /* Must be in same position as patctl */ uint8_t replacement[REPLACE_MODSIZE]; /* So must this */ - uint32_t startend[2]; + uint32_t startend[2]; uint32_t cerror[2]; uint32_t cfail[2]; int32_t callout_data; @@ -699,7 +699,8 @@ static modstruct modlist[] = { #define POSIX_SUPPORTED_COMPILE_EXTRA_OPTIONS (0) #define POSIX_SUPPORTED_COMPILE_CONTROLS ( \ - CTL_AFTERTEXT|CTL_ALLAFTERTEXT|CTL_EXPAND|CTL_POSIX|CTL_POSIX_NOSUB) + CTL_AFTERTEXT|CTL_ALLAFTERTEXT|CTL_EXPAND|CTL_HEXPAT|CTL_POSIX| \ + CTL_POSIX_NOSUB|CTL_USE_LENGTH) #define POSIX_SUPPORTED_COMPILE_CONTROLS2 (0) @@ -733,11 +734,9 @@ the first control word. Note that CTL_POSIX_NOSUB is always accompanied by CTL_POSIX, so it doesn't need its own entries. */ static uint32_t exclusive_pat_controls[] = { - CTL_POSIX | CTL_HEXPAT, CTL_POSIX | CTL_PUSH, CTL_POSIX | CTL_PUSHCOPY, CTL_POSIX | CTL_PUSHTABLESCOPY, - CTL_POSIX | CTL_USE_LENGTH, CTL_PUSH | CTL_PUSHCOPY, CTL_PUSH | CTL_PUSHTABLESCOPY, CTL_PUSHCOPY | CTL_PUSHTABLESCOPY, @@ -896,7 +895,7 @@ static PCRE2_SIZE malloclistlength[MALLOCLISTSIZE]; static uint32_t malloclistptr = 0; #ifdef SUPPORT_PCRE2_8 -static regex_t preg = { NULL, NULL, 0, 0, 0 }; +static regex_t preg = { NULL, NULL, 0, 0, 0, 0 }; #endif static int *dfa_workspace = NULL; @@ -5264,6 +5263,12 @@ if ((pat_patctl.control & CTL_POSIX) != 0) if ((pat_patctl.options & PCRE2_DOTALL) != 0) cflags |= REG_DOTALL; if ((pat_patctl.options & PCRE2_UNGREEDY) != 0) cflags |= REG_UNGREEDY; + if ((pat_patctl.control & (CTL_HEXPAT|CTL_USE_LENGTH)) != 0) + { + preg.re_endp = (char *)pbuffer8 + patlen; + cflags |= REG_PEND; + } + rc = regcomp(&preg, (char *)pbuffer8, cflags); /* Compiling failed */ @@ -6665,10 +6670,10 @@ if ((pat_patctl.control & CTL_POSIX) != 0) if (dat_datctl.startend[0] != CFORE_UNSET) { pmatch[0].rm_so = dat_datctl.startend[0]; - pmatch[0].rm_eo = (dat_datctl.startend[1] != 0)? + pmatch[0].rm_eo = (dat_datctl.startend[1] != 0)? dat_datctl.startend[1] : len; eflags |= REG_STARTEND; - } + } if ((dat_datctl.options & PCRE2_NOTBOL) != 0) eflags |= REG_NOTBOL; if ((dat_datctl.options & PCRE2_NOTEOL) != 0) eflags |= REG_NOTEOL; diff --git a/testdata/testinput18 b/testdata/testinput18 index ececc06..a133532 100644 --- a/testdata/testinput18 +++ b/testdata/testinput18 @@ -123,4 +123,10 @@ /^a\x{00}b$/posix a\x{00}b\=posix_startend=0:3 +/"A" 00 "B"/hex + A\x{00}B\=posix_startend=0:3 + +/ABC/use_length + ABC + # End of testdata/testinput18 diff --git a/testdata/testinput19 b/testdata/testinput19 index 7a90f1a..3bf1720 100644 --- a/testdata/testinput19 +++ b/testdata/testinput19 @@ -15,4 +15,7 @@ /\w/ucp +++\x{c2} +/"^AB" 00 "\x{1234}$"/hex,utf + AB\x{00}\x{1234}\=posix_startend=0:6 + # End of testdata/testinput19 diff --git a/testdata/testoutput18 b/testdata/testoutput18 index 96386da..b02631c 100644 --- a/testdata/testoutput18 +++ b/testdata/testoutput18 @@ -191,4 +191,12 @@ No match: POSIX code 17: match failed a\x{00}b\=posix_startend=0:3 0: a\x00b +/"A" 00 "B"/hex + A\x{00}B\=posix_startend=0:3 + 0: A\x00B + +/ABC/use_length + ABC + 0: ABC + # End of testdata/testinput18 diff --git a/testdata/testoutput19 b/testdata/testoutput19 index c4169ca..a4a8b1a 100644 --- a/testdata/testoutput19 +++ b/testdata/testoutput19 @@ -18,4 +18,8 @@ No match: POSIX code 17: match failed +++\x{c2} 0: \xc2 +/"^AB" 00 "\x{1234}$"/hex,utf + AB\x{00}\x{1234}\=posix_startend=0:6 + 0: AB\x{00}\x{1234} + # End of testdata/testinput19