Implement REG_PEND (GNU extension) for the POSIX wrapper.

This commit is contained in:
Philip.Hazel 2017-06-05 18:25:47 +00:00
parent f850015168
commit bcba497c0b
13 changed files with 447 additions and 327 deletions

View File

@ -182,6 +182,8 @@ deeply. (Compare item 10.23/36.) This should fix oss-fuzz #1761.
38. Fix returned offsets from regexec() when REG_STARTEND is used with a
starting offset greater than zero.
39. Implement REG_PEND (GNU extension) for the POSIX wrapper.
Version 10.23 14-February-2017
------------------------------

View File

@ -69,7 +69,7 @@ replacement library. Other POSIX options are not even defined.
<P>
There are also some options that are not defined by POSIX. These have been
added at the request of users who want to make use of certain PCRE2-specific
features via the POSIX calling interface.
features via the POSIX calling interface or to add BSD or GNU functionality.
</P>
<P>
When PCRE2 is called via these functions, it is only the API that is POSIX-like
@ -91,10 +91,11 @@ identifying error codes.
<br><a name="SEC3" href="#TOC1">COMPILING A PATTERN</a><br>
<P>
The function <b>regcomp()</b> is called to compile a pattern into an
internal form. The pattern is a C string terminated by a binary zero, and
is passed in the argument <i>pattern</i>. The <i>preg</i> argument is a pointer
to a <b>regex_t</b> structure that is used as a base for storing information
about the compiled regular expression.
internal form. By default, the pattern is a C string terminated by a binary
zero (but see REG_PEND below). The <i>preg</i> argument is a pointer to a
<b>regex_t</b> structure that is used as a base for storing information about
the compiled regular expression. (It is also used for input when REG_PEND is
set.)
</P>
<P>
The argument <i>cflags</i> is either zero, or contains one or more of the bits
@ -124,6 +125,16 @@ matching, the <i>nmatch</i> and <i>pmatch</i> arguments are ignored, and no
captured strings are returned. Versions of the PCRE library prior to 10.22 used
to set the PCRE2_NO_AUTO_CAPTURE compile option, but this no longer happens
because it disables the use of back references.
<pre>
REG_PEND
</pre>
If this option is set, the <b>reg_endp</b> field in the <i>preg</i> structure
(which has the type const char *) must be set to point to the character beyond
the end of the pattern before calling <b>regcomp()</b>. The pattern itself may
now contain binary zeroes, which are treated as data characters. Without
REG_PEND, a binary zero terminates the pattern and the <b>re_endp</b> field is
ignored. This is a GNU extension to the POSIX standard and should be used with
caution in software intended to be portable to other systems.
<pre>
REG_UCP
</pre>
@ -156,9 +167,10 @@ class such as [^a] (they are).
</P>
<P>
The yield of <b>regcomp()</b> is zero on success, and non-zero otherwise. The
<i>preg</i> structure is filled in on success, and one member of the structure
is public: <i>re_nsub</i> contains the number of capturing subpatterns in
the regular expression. Various error codes are defined in the header file.
<i>preg</i> structure is filled in on success, and one other member of the
structure (as well as <i>re_endp</i>) is public: <i>re_nsub</i> contains the
number of capturing subpatterns in the regular expression. Various error codes
are defined in the header file.
</P>
<P>
NOTE: If the yield of <b>regcomp()</b> is non-zero, you must not attempt to
@ -228,15 +240,26 @@ function.
<pre>
REG_STARTEND
</pre>
The string is considered to start at <i>string</i> + <i>pmatch[0].rm_so</i> and
to have a terminating NUL located at <i>string</i> + <i>pmatch[0].rm_eo</i>
(there need not actually be a NUL at that location), regardless of the value of
<i>nmatch</i>. This is a BSD extension, compatible with but not specified by
IEEE Standard 1003.2 (POSIX.2), and should be used with caution in software
intended to be portable to other systems. Note that a non-zero <i>rm_so</i> does
not imply REG_NOTBOL; REG_STARTEND affects only the location of the string, not
how it is matched. Setting REG_STARTEND and passing <i>pmatch</i> as NULL are
mutually exclusive; the error REG_INVARG is returned.
When this option is set, the subject string is starts at <i>string</i> +
<i>pmatch[0].rm_so</i> and ends at <i>string</i> + <i>pmatch[0].rm_eo</i>, which
should point to the first character beyond the string. There may be binary
zeroes within the subject string, and indeed, using REG_STARTEND is the only
way to pass a subject string that contains a binary zero.
</P>
<P>
Whatever the value of <i>pmatch[0].rm_so</i>, the offsets of the matched string
and any captured substrings are still given relative to the start of
<i>string</i> itself. (Before PCRE2 release 10.30 these were given relative to
<i>string</i> + <i>pmatch[0].rm_so</i>, but this differs from other
implementations.)
</P>
<P>
This is a BSD extension, compatible with but not specified by IEEE Standard
1003.2 (POSIX.2), and should be used with caution in software intended to be
portable to other systems. Note that a non-zero <i>rm_so</i> does not imply
REG_NOTBOL; REG_STARTEND affects only the location and length of the string,
not how it is matched. Setting REG_STARTEND and passing <i>pmatch</i> as NULL
are mutually exclusive; the error REG_INVARG is returned.
</P>
<P>
If the pattern was compiled with the REG_NOSUB flag, no data about any matched
@ -291,9 +314,9 @@ Cambridge, England.
</P>
<br><a name="SEC9" href="#TOC1">REVISION</a><br>
<P>
Last updated: 31 January 2016
Last updated: 05 June 2017
<br>
Copyright &copy; 1997-2016 University of Cambridge.
Copyright &copy; 1997-2017 University of Cambridge.
<br>
<p>
Return to the <a href="index.html">PCRE2 index page</a>.

View File

@ -1078,6 +1078,19 @@ are <b>notbol</b>, <b>notempty</b>, and <b>noteol</b>, causing REG_NOTBOL,
REG_NOTEMPTY, and REG_NOTEOL, respectively, to be passed to <b>regexec()</b>.
The other modifiers are ignored, with a warning message.
</P>
<P>
There is one additional modifier that can be used with the POSIX wrapper. It is
ignored (with a warning) if used for non-POSIX matching.
<pre>
posix_startend=&#60;n&#62;[:&#60;m&#62;]
</pre>
This causes the subject string to be passed to <b>regexec()</b> using the
REG_STARTEND option, which uses offsets to restrict which part of the string is
searched. If only one number is given, the end offset is passed as the end of
the subject string. For more detail of REG_STARTEND, see the
<a href="pcre2posix.html"><b>pcre2posix</b></a>
documentation.
</P>
<br><b>
Setting match controls
</b><br>
@ -1817,7 +1830,7 @@ Cambridge, England.
</P>
<br><a name="SEC21" href="#TOC1">REVISION</a><br>
<P>
Last updated: 01 June 2017
Last updated: 03 June 2017
<br>
Copyright &copy; 1997-2017 University of Cambridge.
<br>

View File

@ -8986,7 +8986,8 @@ DESCRIPTION
There are also some options that are not defined by POSIX. These have
been added at the request of users who want to make use of certain
PCRE2-specific features via the POSIX calling interface.
PCRE2-specific features via the POSIX calling interface or to add BSD
or GNU functionality.
When PCRE2 is called via these functions, it is only the API that is
POSIX-like in style. The syntax and semantics of the regular expres-
@ -9008,10 +9009,11 @@ DESCRIPTION
COMPILING A PATTERN
The function regcomp() is called to compile a pattern into an internal
form. The pattern is a C string terminated by a binary zero, and is
passed in the argument pattern. The preg argument is a pointer to a
regex_t structure that is used as a base for storing information about
the compiled regular expression.
form. By default, the pattern is a C string terminated by a binary zero
(but see REG_PEND below). The preg argument is a pointer to a regex_t
structure that is used as a base for storing information about the com-
piled regular expression. (It is also used for input when REG_PEND is
set.)
The argument cflags is either zero, or contains one or more of the bits
defined by the following macros:
@ -9042,6 +9044,17 @@ COMPILING A PATTERN
used to set the PCRE2_NO_AUTO_CAPTURE compile option, but this no
longer happens because it disables the use of back references.
REG_PEND
If this option is set, the reg_endp field in the preg structure (which
has the type const char *) must be set to point to the character beyond
the end of the pattern before calling regcomp(). The pattern itself may
now contain binary zeroes, which are treated as data characters. With-
out REG_PEND, a binary zero terminates the pattern and the re_endp
field is ignored. This is a GNU extension to the POSIX standard and
should be used with caution in software intended to be portable to
other systems.
REG_UCP
The PCRE2_UCP option is set when the regular expression is passed for
@ -9071,9 +9084,10 @@ COMPILING A PATTERN
ter (they are not) or by a negative class such as [^a] (they are).
The yield of regcomp() is zero on success, and non-zero otherwise. The
preg structure is filled in on success, and one member of the structure
is public: re_nsub contains the number of capturing subpatterns in the
regular expression. Various error codes are defined in the header file.
preg structure is filled in on success, and one other member of the
structure (as well as re_endp) is public: re_nsub contains the number
of capturing subpatterns in the regular expression. Various error codes
are defined in the header file.
NOTE: If the yield of regcomp() is non-zero, you must not attempt to
use the contents of the preg structure. If, for example, you pass it to
@ -9146,15 +9160,24 @@ MATCHING A PATTERN
REG_STARTEND
The string is considered to start at string + pmatch[0].rm_so and to
have a terminating NUL located at string + pmatch[0].rm_eo (there need
not actually be a NUL at that location), regardless of the value of
nmatch. This is a BSD extension, compatible with but not specified by
IEEE Standard 1003.2 (POSIX.2), and should be used with caution in
software intended to be portable to other systems. Note that a non-zero
rm_so does not imply REG_NOTBOL; REG_STARTEND affects only the location
of the string, not how it is matched. Setting REG_STARTEND and passing
pmatch as NULL are mutually exclusive; the error REG_INVARG is
When this option is set, the subject string is starts at string +
pmatch[0].rm_so and ends at string + pmatch[0].rm_eo, which should
point to the first character beyond the string. There may be binary
zeroes within the subject string, and indeed, using REG_STARTEND is the
only way to pass a subject string that contains a binary zero.
Whatever the value of pmatch[0].rm_so, the offsets of the matched
string and any captured substrings are still given relative to the
start of string itself. (Before PCRE2 release 10.30 these were given
relative to string + pmatch[0].rm_so, but this differs from other
implementations.)
This is a BSD extension, compatible with but not specified by IEEE
Standard 1003.2 (POSIX.2), and should be used with caution in software
intended to be portable to other systems. Note that a non-zero rm_so
does not imply REG_NOTBOL; REG_STARTEND affects only the location and
length of the string, not how it is matched. Setting REG_STARTEND and
passing pmatch as NULL are mutually exclusive; the error REG_INVARG is
returned.
If the pattern was compiled with the REG_NOSUB flag, no data about any
@ -9209,8 +9232,8 @@ AUTHOR
REVISION
Last updated: 31 January 2016
Copyright (c) 1997-2016 University of Cambridge.
Last updated: 05 June 2017
Copyright (c) 1997-2017 University of Cambridge.
------------------------------------------------------------------------------

View File

@ -1,4 +1,4 @@
.TH PCRE2POSIX 3 "03 June 2017" "PCRE2 10.30"
.TH PCRE2POSIX 3 "05 June 2017" "PCRE2 10.30"
.SH NAME
PCRE2 - Perl-compatible regular expressions (revised API)
.SH "SYNOPSIS"
@ -46,7 +46,7 @@ replacement library. Other POSIX options are not even defined.
.P
There are also some options that are not defined by POSIX. These have been
added at the request of users who want to make use of certain PCRE2-specific
features via the POSIX calling interface.
features via the POSIX calling interface or to add BSD or GNU functionality.
.P
When PCRE2 is called via these functions, it is only the API that is POSIX-like
in style. The syntax and semantics of the regular expressions themselves are
@ -68,10 +68,11 @@ identifying error codes.
.rs
.sp
The function \fBregcomp()\fP is called to compile a pattern into an
internal form. The pattern is a C string terminated by a binary zero, and
is passed in the argument \fIpattern\fP. The \fIpreg\fP argument is a pointer
to a \fBregex_t\fP structure that is used as a base for storing information
about the compiled regular expression.
internal form. By default, the pattern is a C string terminated by a binary
zero (but see REG_PEND below). The \fIpreg\fP argument is a pointer to a
\fBregex_t\fP structure that is used as a base for storing information about
the compiled regular expression. (It is also used for input when REG_PEND is
set.)
.P
The argument \fIcflags\fP is either zero, or contains one or more of the bits
defined by the following macros:
@ -100,6 +101,16 @@ matching, the \fInmatch\fP and \fIpmatch\fP arguments are ignored, and no
captured strings are returned. Versions of the PCRE library prior to 10.22 used
to set the PCRE2_NO_AUTO_CAPTURE compile option, but this no longer happens
because it disables the use of back references.
.sp
REG_PEND
.sp
If this option is set, the \fBreg_endp\fP field in the \fIpreg\fP structure
(which has the type const char *) must be set to point to the character beyond
the end of the pattern before calling \fBregcomp()\fP. The pattern itself may
now contain binary zeroes, which are treated as data characters. Without
REG_PEND, a binary zero terminates the pattern and the \fBre_endp\fP field is
ignored. This is a GNU extension to the POSIX standard and should be used with
caution in software intended to be portable to other systems.
.sp
REG_UCP
.sp
@ -130,9 +141,10 @@ newlines are matched by the dot metacharacter (they are not) or by a negative
class such as [^a] (they are).
.P
The yield of \fBregcomp()\fP is zero on success, and non-zero otherwise. The
\fIpreg\fP structure is filled in on success, and one member of the structure
is public: \fIre_nsub\fP contains the number of capturing subpatterns in
the regular expression. Various error codes are defined in the header file.
\fIpreg\fP structure is filled in on success, and one other member of the
structure (as well as \fIre_endp\fP) is public: \fIre_nsub\fP contains the
number of capturing subpatterns in the regular expression. Various error codes
are defined in the header file.
.P
NOTE: If the yield of \fBregcomp()\fP is non-zero, you must not attempt to
use the contents of the \fIpreg\fP structure. If, for example, you pass it to
@ -204,21 +216,24 @@ function.
.sp
REG_STARTEND
.sp
When this option is set, the string is considered to start at \fIstring\fP +
\fIpmatch[0].rm_so\fP and to have a terminating NUL located at \fIstring\fP +
\fIpmatch[0].rm_eo\fP (there need not actually be a NUL at that location),
regardless of the value of \fInmatch\fP. However, the offsets of the matched
string and any captured substrings are still given relative to the start of
\fIstring\fP. (Before PCRE2 release 10.30 these were given relative to
When this option is set, the subject string is starts at \fIstring\fP +
\fIpmatch[0].rm_so\fP and ends at \fIstring\fP + \fIpmatch[0].rm_eo\fP, which
should point to the first character beyond the string. There may be binary
zeroes within the subject string, and indeed, using REG_STARTEND is the only
way to pass a subject string that contains a binary zero.
.P
Whatever the value of \fIpmatch[0].rm_so\fP, the offsets of the matched string
and any captured substrings are still given relative to the start of
\fIstring\fP itself. (Before PCRE2 release 10.30 these were given relative to
\fIstring\fP + \fIpmatch[0].rm_so\fP, but this differs from other
implementations.)
.P
This is a BSD extension, compatible with but not specified by IEEE Standard
1003.2 (POSIX.2), and should be used with caution in software intended to be
portable to other systems. Note that a non-zero \fIrm_so\fP does not imply
REG_NOTBOL; REG_STARTEND affects only the location of the string, not how it is
matched. Setting REG_STARTEND and passing \fIpmatch\fP as NULL are mutually
exclusive; the error REG_INVARG is returned.
REG_NOTBOL; REG_STARTEND affects only the location and length of the string,
not how it is matched. Setting REG_STARTEND and passing \fIpmatch\fP as NULL
are mutually exclusive; the error REG_INVARG is returned.
.P
If the pattern was compiled with the REG_NOSUB flag, no data about any matched
strings is returned. The \fInmatch\fP and \fIpmatch\fP arguments of
@ -277,6 +292,6 @@ Cambridge, England.
.rs
.sp
.nf
Last updated: 03 June 2017
Last updated: 05 June 2017
Copyright (c) 1997-2017 University of Cambridge.
.fi

View File

@ -965,6 +965,17 @@ SUBJECT MODIFIERS
REG_NOTEMPTY, and REG_NOTEOL, respectively, to be passed to regexec().
The other modifiers are ignored, with a warning message.
There is one additional modifier that can be used with the POSIX wrap-
per. It is ignored (with a warning) if used for non-POSIX matching.
posix_startend=<n>[:<m>]
This causes the subject string to be passed to regexec() using the
REG_STARTEND option, which uses offsets to restrict which part of the
string is searched. If only one number is given, the end offset is
passed as the end of the subject string. For more detail of REG_STAR-
TEND, see the pcre2posix documentation.
Setting match controls
The following modifiers affect the matching process or request addi-
@ -1651,5 +1662,5 @@ AUTHOR
REVISION
Last updated: 01 June 2017
Last updated: 03 June 2017
Copyright (c) 1997-2017 University of Cambridge.

View File

@ -231,10 +231,14 @@ PCRE2POSIX_EXP_DEFN int PCRE2_CALL_CONVENTION
regcomp(regex_t *preg, const char *pattern, int cflags)
{
PCRE2_SIZE erroffset;
PCRE2_SIZE patlen;
int errorcode;
int options = 0;
int re_nsub = 0;
patlen = ((cflags & REG_PEND) != 0)? (PCRE2_SIZE)(preg->re_endp - pattern) :
PCRE2_ZERO_TERMINATED;
if ((cflags & REG_ICASE) != 0) options |= PCRE2_CASELESS;
if ((cflags & REG_NEWLINE) != 0) options |= PCRE2_MULTILINE;
if ((cflags & REG_DOTALL) != 0) options |= PCRE2_DOTALL;
@ -243,8 +247,8 @@ if ((cflags & REG_UCP) != 0) options |= PCRE2_UCP;
if ((cflags & REG_UNGREEDY) != 0) options |= PCRE2_UNGREEDY;
preg->re_cflags = cflags;
preg->re_pcre2_code = pcre2_compile((PCRE2_SPTR)pattern, PCRE2_ZERO_TERMINATED,
options, &errorcode, &erroffset, NULL);
preg->re_pcre2_code = pcre2_compile((PCRE2_SPTR)pattern, patlen, options,
&errorcode, &erroffset, NULL);
preg->re_erroffset = erroffset;
if (preg->re_pcre2_code == NULL)

View File

@ -62,6 +62,7 @@ extern "C" {
#define REG_NOTEMPTY 0x0100 /* NOT defined by POSIX; maps to PCRE2_NOTEMPTY */
#define REG_UNGREEDY 0x0200 /* NOT defined by POSIX; maps to PCRE2_UNGREEDY */
#define REG_UCP 0x0400 /* NOT defined by POSIX; maps to PCRE2_UCP */
#define REG_PEND 0x0800 /* GNU feature: pass end pattern by re_endp */
/* This is not used by PCRE2, but by defining it we make it easier
to slot PCRE2 into existing programs that make POSIX calls. */
@ -91,11 +92,13 @@ enum {
};
/* The structure representing a compiled regular expression. */
/* The structure representing a compiled regular expression. It is also used
for passing the pattern end pointer when REG_PEND is set. */
typedef struct {
void *re_pcre2_code;
void *re_match_data;
const char *re_endp;
size_t re_nsub;
size_t re_erroffset;
int re_cflags;

View File

@ -699,7 +699,8 @@ static modstruct modlist[] = {
#define POSIX_SUPPORTED_COMPILE_EXTRA_OPTIONS (0)
#define POSIX_SUPPORTED_COMPILE_CONTROLS ( \
CTL_AFTERTEXT|CTL_ALLAFTERTEXT|CTL_EXPAND|CTL_POSIX|CTL_POSIX_NOSUB)
CTL_AFTERTEXT|CTL_ALLAFTERTEXT|CTL_EXPAND|CTL_HEXPAT|CTL_POSIX| \
CTL_POSIX_NOSUB|CTL_USE_LENGTH)
#define POSIX_SUPPORTED_COMPILE_CONTROLS2 (0)
@ -733,11 +734,9 @@ the first control word. Note that CTL_POSIX_NOSUB is always accompanied by
CTL_POSIX, so it doesn't need its own entries. */
static uint32_t exclusive_pat_controls[] = {
CTL_POSIX | CTL_HEXPAT,
CTL_POSIX | CTL_PUSH,
CTL_POSIX | CTL_PUSHCOPY,
CTL_POSIX | CTL_PUSHTABLESCOPY,
CTL_POSIX | CTL_USE_LENGTH,
CTL_PUSH | CTL_PUSHCOPY,
CTL_PUSH | CTL_PUSHTABLESCOPY,
CTL_PUSHCOPY | CTL_PUSHTABLESCOPY,
@ -896,7 +895,7 @@ static PCRE2_SIZE malloclistlength[MALLOCLISTSIZE];
static uint32_t malloclistptr = 0;
#ifdef SUPPORT_PCRE2_8
static regex_t preg = { NULL, NULL, 0, 0, 0 };
static regex_t preg = { NULL, NULL, 0, 0, 0, 0 };
#endif
static int *dfa_workspace = NULL;
@ -5264,6 +5263,12 @@ if ((pat_patctl.control & CTL_POSIX) != 0)
if ((pat_patctl.options & PCRE2_DOTALL) != 0) cflags |= REG_DOTALL;
if ((pat_patctl.options & PCRE2_UNGREEDY) != 0) cflags |= REG_UNGREEDY;
if ((pat_patctl.control & (CTL_HEXPAT|CTL_USE_LENGTH)) != 0)
{
preg.re_endp = (char *)pbuffer8 + patlen;
cflags |= REG_PEND;
}
rc = regcomp(&preg, (char *)pbuffer8, cflags);
/* Compiling failed */

View File

@ -123,4 +123,10 @@
/^a\x{00}b$/posix
a\x{00}b\=posix_startend=0:3
/"A" 00 "B"/hex
A\x{00}B\=posix_startend=0:3
/ABC/use_length
ABC
# End of testdata/testinput18

View File

@ -15,4 +15,7 @@
/\w/ucp
+++\x{c2}
/"^AB" 00 "\x{1234}$"/hex,utf
AB\x{00}\x{1234}\=posix_startend=0:6
# End of testdata/testinput19

View File

@ -191,4 +191,12 @@ No match: POSIX code 17: match failed
a\x{00}b\=posix_startend=0:3
0: a\x00b
/"A" 00 "B"/hex
A\x{00}B\=posix_startend=0:3
0: A\x00B
/ABC/use_length
ABC
0: ABC
# End of testdata/testinput18

View File

@ -18,4 +18,8 @@ No match: POSIX code 17: match failed
+++\x{c2}
0: \xc2
/"^AB" 00 "\x{1234}$"/hex,utf
AB\x{00}\x{1234}\=posix_startend=0:6
0: AB\x{00}\x{1234}
# End of testdata/testinput19