Documentation update.
This commit is contained in:
parent
a083420cac
commit
c92bfc3d21
|
@ -47,7 +47,7 @@ system stack size checking, or to change one or more of these parameters:
|
||||||
The newline character sequence;
|
The newline character sequence;
|
||||||
The compile time nested parentheses limit;
|
The compile time nested parentheses limit;
|
||||||
The maximum pattern length (in code units) that is allowed.
|
The maximum pattern length (in code units) that is allowed.
|
||||||
The additional options bits
|
The additional options bits (see pcre2_set_compile_extra_options())
|
||||||
</pre>
|
</pre>
|
||||||
The option bits are:
|
The option bits are:
|
||||||
<pre>
|
<pre>
|
||||||
|
@ -64,6 +64,7 @@ The option bits are:
|
||||||
PCRE2_ENDANCHORED Pattern can match only at end of subject
|
PCRE2_ENDANCHORED Pattern can match only at end of subject
|
||||||
PCRE2_EXTENDED Ignore white space and # comments
|
PCRE2_EXTENDED Ignore white space and # comments
|
||||||
PCRE2_FIRSTLINE Force matching to be before newline
|
PCRE2_FIRSTLINE Force matching to be before newline
|
||||||
|
PCRE2_LITERAL Pattern characters are all literal
|
||||||
PCRE2_MATCH_UNSET_BACKREF Match unset back references
|
PCRE2_MATCH_UNSET_BACKREF Match unset back references
|
||||||
PCRE2_MULTILINE ^ and $ match newlines within data
|
PCRE2_MULTILINE ^ and $ match newlines within data
|
||||||
PCRE2_NEVER_BACKSLASH_C Lock out the use of \C in patterns
|
PCRE2_NEVER_BACKSLASH_C Lock out the use of \C in patterns
|
||||||
|
|
|
@ -32,6 +32,8 @@ options are:
|
||||||
<pre>
|
<pre>
|
||||||
PCRE2_EXTRA_ALLOW_SURROGATE_ESCAPES Allow \x{df800} to \x{dfff} in UTF-8 and UTF-32 modes
|
PCRE2_EXTRA_ALLOW_SURROGATE_ESCAPES Allow \x{df800} to \x{dfff} in UTF-8 and UTF-32 modes
|
||||||
PCRE2_EXTRA_BAD_ESCAPE_IS_LITERAL Treat all invalid escapes as a literal following character
|
PCRE2_EXTRA_BAD_ESCAPE_IS_LITERAL Treat all invalid escapes as a literal following character
|
||||||
|
PCRE2_EXTRA_MATCH_LINE Pattern matches whole lines
|
||||||
|
PCRE2_EXTRA_MATCH_WORD Pattern matches "words"
|
||||||
</pre>
|
</pre>
|
||||||
There is a complete description of the PCRE2 native API in the
|
There is a complete description of the PCRE2 native API in the
|
||||||
<a href="pcre2api.html"><b>pcre2api</b></a>
|
<a href="pcre2api.html"><b>pcre2api</b></a>
|
||||||
|
|
|
@ -1453,6 +1453,19 @@ continue over the newline. See also PCRE2_USE_OFFSET_LIMIT, which provides a
|
||||||
more general limiting facility. If PCRE2_FIRSTLINE is set with an offset limit,
|
more general limiting facility. If PCRE2_FIRSTLINE is set with an offset limit,
|
||||||
a match must occur in the first line and also within the offset limit. In other
|
a match must occur in the first line and also within the offset limit. In other
|
||||||
words, whichever limit comes first is used.
|
words, whichever limit comes first is used.
|
||||||
|
<pre>
|
||||||
|
PCRE2_LITERAL
|
||||||
|
</pre>
|
||||||
|
If this option is set, all meta-characters in the pattern are disabled, and it
|
||||||
|
is treated as a literal string. Matching literal strings with a regular
|
||||||
|
expression engine is not the most efficient way of doing it. If you are doing a
|
||||||
|
lot of literal matching and are worried about efficiency, you should consider
|
||||||
|
using other approaches. The only other main options that are allowed with
|
||||||
|
PCRE2_LITERAL are: PCRE2_ANCHORED, PCRE2_ENDANCHORED, PCRE2_AUTO_CALLOUT,
|
||||||
|
PCRE2_CASELESS, PCRE2_FIRSTLINE, PCRE2_NO_START_OPTIMIZE, PCRE2_NO_UTF_CHECK,
|
||||||
|
PCRE2_UTF, and PCRE2_USE_OFFSET_LIMIT. The extra options PCRE2_EXTRA_MATCH_LINE
|
||||||
|
and PCRE2_EXTRA_MATCH_WORD are also supported. Any other options cause an
|
||||||
|
error.
|
||||||
<pre>
|
<pre>
|
||||||
PCRE2_MATCH_UNSET_BACKREF
|
PCRE2_MATCH_UNSET_BACKREF
|
||||||
</pre>
|
</pre>
|
||||||
|
@ -1724,6 +1737,24 @@ treated as single-character escapes. For example, \j is a literal "j" and
|
||||||
\x{2z} is treated as the literal string "x{2z}". Setting this option means
|
\x{2z} is treated as the literal string "x{2z}". Setting this option means
|
||||||
that typos in patterns may go undetected and have unexpected results. This is a
|
that typos in patterns may go undetected and have unexpected results. This is a
|
||||||
dangerous option. Use with care.
|
dangerous option. Use with care.
|
||||||
|
<pre>
|
||||||
|
PCRE2_EXTRA_MATCH_LINE
|
||||||
|
</pre>
|
||||||
|
This option is provided for use by the <b>-x</b> option of <b>pcre2grep</b>. It
|
||||||
|
causes the pattern only to match complete lines. This is achieved by
|
||||||
|
automatically inserting the code for "^(?:" at the start of the compiled
|
||||||
|
pattern and ")$" at the end. Thus, when PCRE2_MULTILINE is set, the matched
|
||||||
|
line may be in the middle of the subject string. This option can be used with
|
||||||
|
PCRE2_LITERAL.
|
||||||
|
<pre>
|
||||||
|
PCRE2_EXTRA_MATCH_WORD
|
||||||
|
</pre>
|
||||||
|
This option is provided for use by the <b>-w</b> option of <b>pcre2grep</b>. It
|
||||||
|
causes the pattern only to match strings that have a word boundary at the start
|
||||||
|
and the end. This is achieved by automatically inserting the code for "\b(?:"
|
||||||
|
at the start of the compiled pattern and ")\b" at the end. The option may be
|
||||||
|
used with PCRE2_LITERAL. However, it is ignored if PCRE2_EXTRA_MATCH_LINE is
|
||||||
|
also set.
|
||||||
</P>
|
</P>
|
||||||
<br><a name="SEC20" href="#TOC1">COMPILATION ERROR CODES</a><br>
|
<br><a name="SEC20" href="#TOC1">COMPILATION ERROR CODES</a><br>
|
||||||
<P>
|
<P>
|
||||||
|
@ -3489,7 +3520,7 @@ Cambridge, England.
|
||||||
</P>
|
</P>
|
||||||
<br><a name="SEC42" href="#TOC1">REVISION</a><br>
|
<br><a name="SEC42" href="#TOC1">REVISION</a><br>
|
||||||
<P>
|
<P>
|
||||||
Last updated: 01 June 2017
|
Last updated: 16 June 2017
|
||||||
<br>
|
<br>
|
||||||
Copyright © 1997-2017 University of Cambridge.
|
Copyright © 1997-2017 University of Cambridge.
|
||||||
<br>
|
<br>
|
||||||
|
|
|
@ -94,7 +94,7 @@ The function <b>regcomp()</b> is called to compile a pattern into an
|
||||||
internal form. By default, the pattern is a C string terminated by a binary
|
internal form. By default, the pattern is a C string terminated by a binary
|
||||||
zero (but see REG_PEND below). The <i>preg</i> argument is a pointer to a
|
zero (but see REG_PEND below). The <i>preg</i> argument is a pointer to a
|
||||||
<b>regex_t</b> structure that is used as a base for storing information about
|
<b>regex_t</b> structure that is used as a base for storing information about
|
||||||
the compiled regular expression. (It is also used for input when REG_PEND is
|
the compiled regular expression. (It is also used for input when REG_PEND is
|
||||||
set.)
|
set.)
|
||||||
</P>
|
</P>
|
||||||
<P>
|
<P>
|
||||||
|
@ -117,6 +117,14 @@ compilation to the native function.
|
||||||
The PCRE2_MULTILINE option is set when the regular expression is passed for
|
The PCRE2_MULTILINE option is set when the regular expression is passed for
|
||||||
compilation to the native function. Note that this does <i>not</i> mimic the
|
compilation to the native function. Note that this does <i>not</i> mimic the
|
||||||
defined POSIX behaviour for REG_NEWLINE (see the following section).
|
defined POSIX behaviour for REG_NEWLINE (see the following section).
|
||||||
|
<pre>
|
||||||
|
REG_NOSPEC
|
||||||
|
</pre>
|
||||||
|
The PCRE2_LITERAL option is set when the regular expression is passed for
|
||||||
|
compilation to the native function. This disables all meta characters in the
|
||||||
|
pattern, causing it to be treated as a literal string. The only other options
|
||||||
|
that are allowed with REG_NOSPEC are REG_ICASE, REG_NOSUB, REG_PEND, and
|
||||||
|
REG_UTF. Note that REG_NOSPEC is not part of the POSIX standard.
|
||||||
<pre>
|
<pre>
|
||||||
REG_NOSUB
|
REG_NOSUB
|
||||||
</pre>
|
</pre>
|
||||||
|
@ -128,8 +136,8 @@ because it disables the use of back references.
|
||||||
<pre>
|
<pre>
|
||||||
REG_PEND
|
REG_PEND
|
||||||
</pre>
|
</pre>
|
||||||
If this option is set, the <b>reg_endp</b> field in the <i>preg</i> structure
|
If this option is set, the <b>reg_endp</b> field in the <i>preg</i> structure
|
||||||
(which has the type const char *) must be set to point to the character beyond
|
(which has the type const char *) must be set to point to the character beyond
|
||||||
the end of the pattern before calling <b>regcomp()</b>. The pattern itself may
|
the end of the pattern before calling <b>regcomp()</b>. The pattern itself may
|
||||||
now contain binary zeroes, which are treated as data characters. Without
|
now contain binary zeroes, which are treated as data characters. Without
|
||||||
REG_PEND, a binary zero terminates the pattern and the <b>re_endp</b> field is
|
REG_PEND, a binary zero terminates the pattern and the <b>re_endp</b> field is
|
||||||
|
@ -242,8 +250,8 @@ function.
|
||||||
</pre>
|
</pre>
|
||||||
When this option is set, the subject string is starts at <i>string</i> +
|
When this option is set, the subject string is starts at <i>string</i> +
|
||||||
<i>pmatch[0].rm_so</i> and ends at <i>string</i> + <i>pmatch[0].rm_eo</i>, which
|
<i>pmatch[0].rm_so</i> and ends at <i>string</i> + <i>pmatch[0].rm_eo</i>, which
|
||||||
should point to the first character beyond the string. There may be binary
|
should point to the first character beyond the string. There may be binary
|
||||||
zeroes within the subject string, and indeed, using REG_STARTEND is the only
|
zeroes within the subject string, and indeed, using REG_STARTEND is the only
|
||||||
way to pass a subject string that contains a binary zero.
|
way to pass a subject string that contains a binary zero.
|
||||||
</P>
|
</P>
|
||||||
<P>
|
<P>
|
||||||
|
@ -314,7 +322,7 @@ Cambridge, England.
|
||||||
</P>
|
</P>
|
||||||
<br><a name="SEC9" href="#TOC1">REVISION</a><br>
|
<br><a name="SEC9" href="#TOC1">REVISION</a><br>
|
||||||
<P>
|
<P>
|
||||||
Last updated: 05 June 2017
|
Last updated: 15 June 2017
|
||||||
<br>
|
<br>
|
||||||
Copyright © 1997-2017 University of Cambridge.
|
Copyright © 1997-2017 University of Cambridge.
|
||||||
<br>
|
<br>
|
||||||
|
|
|
@ -96,12 +96,12 @@ want that action.
|
||||||
</P>
|
</P>
|
||||||
<P>
|
<P>
|
||||||
The input is processed using using C's string functions, so must not
|
The input is processed using using C's string functions, so must not
|
||||||
contain binary zeroes, even though in Unix-like environments, <b>fgets()</b>
|
contain binary zeros, even though in Unix-like environments, <b>fgets()</b>
|
||||||
treats any bytes other than newline as data characters. An error is generated
|
treats any bytes other than newline as data characters. An error is generated
|
||||||
if a binary zero is encountered. Subject lines are processed for backslash
|
if a binary zero is encountered. By default subject lines are processed for
|
||||||
escapes, which makes it possible to include any data value in strings that are
|
backslash escapes, which makes it possible to include any data value in strings
|
||||||
passed to the library for matching. For patterns, there is a facility for
|
that are passed to the library for matching. For patterns, there is a facility
|
||||||
specifying some or all of the 8-bit input characters as hexadecimal pairs,
|
for specifying some or all of the 8-bit input characters as hexadecimal pairs,
|
||||||
which makes it possible to include binary zeros.
|
which makes it possible to include binary zeros.
|
||||||
</P>
|
</P>
|
||||||
<br><b>
|
<br><b>
|
||||||
|
@ -382,8 +382,9 @@ of the standard test input files.
|
||||||
<P>
|
<P>
|
||||||
When the POSIX API is being tested there is no way to override the default
|
When the POSIX API is being tested there is no way to override the default
|
||||||
newline convention, though it is possible to set the newline convention from
|
newline convention, though it is possible to set the newline convention from
|
||||||
within the pattern. A warning is given if the <b>posix</b> modifier is used when
|
within the pattern. A warning is given if the <b>posix</b> or <b>posix_nosub</b>
|
||||||
<b>#newline_default</b> would set a default for the non-POSIX API.
|
modifier is used when <b>#newline_default</b> would set a default for the
|
||||||
|
non-POSIX API.
|
||||||
<pre>
|
<pre>
|
||||||
#pattern <modifier-list>
|
#pattern <modifier-list>
|
||||||
</pre>
|
</pre>
|
||||||
|
@ -479,8 +480,9 @@ A pattern can be followed by a modifier list (details below).
|
||||||
<P>
|
<P>
|
||||||
Before each subject line is passed to <b>pcre2_match()</b> or
|
Before each subject line is passed to <b>pcre2_match()</b> or
|
||||||
<b>pcre2_dfa_match()</b>, leading and trailing white space is removed, and the
|
<b>pcre2_dfa_match()</b>, leading and trailing white space is removed, and the
|
||||||
line is scanned for backslash escapes. The following provide a means of
|
line is scanned for backslash escapes, unless the <b>subject_literal</b>
|
||||||
encoding non-printing characters in a visible way:
|
modifier was set for the pattern. The following provide a means of encoding
|
||||||
|
non-printing characters in a visible way:
|
||||||
<pre>
|
<pre>
|
||||||
\a alarm (BEL, \x07)
|
\a alarm (BEL, \x07)
|
||||||
\b backspace (\x08)
|
\b backspace (\x08)
|
||||||
|
@ -548,6 +550,12 @@ the very last character in the line is a backslash (and there is no modifier
|
||||||
list), it is ignored. This gives a way of passing an empty line as data, since
|
list), it is ignored. This gives a way of passing an empty line as data, since
|
||||||
a real empty line terminates the data input.
|
a real empty line terminates the data input.
|
||||||
</P>
|
</P>
|
||||||
|
<P>
|
||||||
|
If the <b>subject_literal</b> modifier is set for a pattern, all subject lines
|
||||||
|
that follow are treated as literals, with no special treatment of backslashes.
|
||||||
|
No replication is possible, and any subject modifiers must be set as defaults
|
||||||
|
by a <b>#subject</b> command.
|
||||||
|
</P>
|
||||||
<br><a name="SEC10" href="#TOC1">PATTERN MODIFIERS</a><br>
|
<br><a name="SEC10" href="#TOC1">PATTERN MODIFIERS</a><br>
|
||||||
<P>
|
<P>
|
||||||
There are several types of modifier that can appear in pattern lines. Except
|
There are several types of modifier that can appear in pattern lines. Except
|
||||||
|
@ -586,7 +594,10 @@ for a description of the effects of these options.
|
||||||
/x extended set PCRE2_EXTENDED
|
/x extended set PCRE2_EXTENDED
|
||||||
/xx extended_more set PCRE2_EXTENDED_MORE
|
/xx extended_more set PCRE2_EXTENDED_MORE
|
||||||
firstline set PCRE2_FIRSTLINE
|
firstline set PCRE2_FIRSTLINE
|
||||||
|
literal set PCRE2_LITERAL
|
||||||
|
match_line set PCRE2_EXTRA_MATCH_LINE
|
||||||
match_unset_backref set PCRE2_MATCH_UNSET_BACKREF
|
match_unset_backref set PCRE2_MATCH_UNSET_BACKREF
|
||||||
|
match_word set PCRE2_EXTRA_MATCH_WORD
|
||||||
/m multiline set PCRE2_MULTILINE
|
/m multiline set PCRE2_MULTILINE
|
||||||
never_backslash_c set PCRE2_NEVER_BACKSLASH_C
|
never_backslash_c set PCRE2_NEVER_BACKSLASH_C
|
||||||
never_ucp set PCRE2_NEVER_UCP
|
never_ucp set PCRE2_NEVER_UCP
|
||||||
|
@ -638,6 +649,7 @@ heavily used in the test files.
|
||||||
push push compiled pattern onto the stack
|
push push compiled pattern onto the stack
|
||||||
pushcopy push a copy onto the stack
|
pushcopy push a copy onto the stack
|
||||||
stackguard=<number> test the stackguard feature
|
stackguard=<number> test the stackguard feature
|
||||||
|
subject_literal treat all subject lines as literal
|
||||||
tables=[0|1|2] select internal tables
|
tables=[0|1|2] select internal tables
|
||||||
use_length do not zero-terminate the pattern
|
use_length do not zero-terminate the pattern
|
||||||
utf8_input treat input as UTF-8
|
utf8_input treat input as UTF-8
|
||||||
|
@ -728,18 +740,6 @@ testing that <b>pcre2_compile()</b> behaves correctly in this case (it uses
|
||||||
default values).
|
default values).
|
||||||
</P>
|
</P>
|
||||||
<br><b>
|
<br><b>
|
||||||
Specifying the pattern's length
|
|
||||||
</b><br>
|
|
||||||
<P>
|
|
||||||
By default, patterns are passed to the compiling functions as zero-terminated
|
|
||||||
strings. When using the POSIX wrapper API, there is no other option. However,
|
|
||||||
when using PCRE2's native API, patterns can be passed by length instead of
|
|
||||||
being zero-terminated. The <b>use_length</b> modifier causes this to happen.
|
|
||||||
Using a length happens automatically (whether or not <b>use_length</b> is set)
|
|
||||||
when <b>hex</b> is set, because patterns specified in hexadecimal may contain
|
|
||||||
binary zeros.
|
|
||||||
</P>
|
|
||||||
<br><b>
|
|
||||||
Specifying pattern characters in hexadecimal
|
Specifying pattern characters in hexadecimal
|
||||||
</b><br>
|
</b><br>
|
||||||
<P>
|
<P>
|
||||||
|
@ -761,11 +761,20 @@ Either single or double quotes may be used. There is no way of including
|
||||||
the delimiter within a substring. The <b>hex</b> and <b>expand</b> modifiers are
|
the delimiter within a substring. The <b>hex</b> and <b>expand</b> modifiers are
|
||||||
mutually exclusive.
|
mutually exclusive.
|
||||||
</P>
|
</P>
|
||||||
|
<br><b>
|
||||||
|
Specifying the pattern's length
|
||||||
|
</b><br>
|
||||||
<P>
|
<P>
|
||||||
The POSIX API cannot be used with patterns specified in hexadecimal because
|
By default, patterns are passed to the compiling functions as zero-terminated
|
||||||
they may contain binary zeros, which conflicts with <b>regcomp()</b>'s
|
strings but can be passed by length instead of being zero-terminated. The
|
||||||
requirement for a zero-terminated string. Such patterns are always passed to
|
<b>use_length</b> modifier causes this to happen. Using a length happens
|
||||||
<b>pcre2_compile()</b> as a string with a length, not as zero-terminated.
|
automatically (whether or not <b>use_length</b> is set) when <b>hex</b> is set,
|
||||||
|
because patterns specified in hexadecimal may contain binary zeros.
|
||||||
|
</P>
|
||||||
|
<P>
|
||||||
|
If <b>hex</b> or <b>use_length</b> is used with the POSIX wrapper API (see
|
||||||
|
<a href="#posixwrapper">"Using the POSIX wrapper API"</a>
|
||||||
|
below), the REG_PEND extension is used to pass the pattern's length.
|
||||||
</P>
|
</P>
|
||||||
<br><b>
|
<br><b>
|
||||||
Specifying wide characters in 16-bit and 32-bit modes
|
Specifying wide characters in 16-bit and 32-bit modes
|
||||||
|
@ -826,7 +835,7 @@ modifier in "Subject Modifiers"
|
||||||
for details of how these options are specified for each match attempt.
|
for details of how these options are specified for each match attempt.
|
||||||
</P>
|
</P>
|
||||||
<P>
|
<P>
|
||||||
JIT compilation is requested by the <b>/jit</b> pattern modifier, which may
|
JIT compilation is requested by the <b>jit</b> pattern modifier, which may
|
||||||
optionally be followed by an equals sign and a number in the range 0 to 7.
|
optionally be followed by an equals sign and a number in the range 0 to 7.
|
||||||
The three bits that make up the number specify which of the three JIT operating
|
The three bits that make up the number specify which of the three JIT operating
|
||||||
modes are to be compiled:
|
modes are to be compiled:
|
||||||
|
@ -850,7 +859,7 @@ to <b>pcre2_match()</b> with either the PCRE2_PARTIAL_SOFT or the
|
||||||
PCRE2_PARTIAL_HARD option set. Note that such a call may return a complete
|
PCRE2_PARTIAL_HARD option set. Note that such a call may return a complete
|
||||||
match; the options enable the possibility of a partial match, but do not
|
match; the options enable the possibility of a partial match, but do not
|
||||||
require it. Note also that if you request JIT compilation only for partial
|
require it. Note also that if you request JIT compilation only for partial
|
||||||
matching (for example, /jit=2) but do not set the <b>partial</b> modifier on a
|
matching (for example, jit=2) but do not set the <b>partial</b> modifier on a
|
||||||
subject line, that match will not use JIT code because none was compiled for
|
subject line, that match will not use JIT code because none was compiled for
|
||||||
non-partial matching.
|
non-partial matching.
|
||||||
</P>
|
</P>
|
||||||
|
@ -927,12 +936,12 @@ The <b>max_pattern_length</b> modifier sets a limit, in code units, to the
|
||||||
length of pattern that <b>pcre2_compile()</b> will accept. Breaching the limit
|
length of pattern that <b>pcre2_compile()</b> will accept. Breaching the limit
|
||||||
causes a compilation error. The default is the largest number a PCRE2_SIZE
|
causes a compilation error. The default is the largest number a PCRE2_SIZE
|
||||||
variable can hold (essentially unlimited).
|
variable can hold (essentially unlimited).
|
||||||
</P>
|
<a name="posixwrapper"></a></P>
|
||||||
<br><b>
|
<br><b>
|
||||||
Using the POSIX wrapper API
|
Using the POSIX wrapper API
|
||||||
</b><br>
|
</b><br>
|
||||||
<P>
|
<P>
|
||||||
The <b>/posix</b> and <b>posix_nosub</b> modifiers cause <b>pcre2test</b> to call
|
The <b>posix</b> and <b>posix_nosub</b> modifiers cause <b>pcre2test</b> to call
|
||||||
PCRE2 via the POSIX wrapper API rather than its native API. When
|
PCRE2 via the POSIX wrapper API rather than its native API. When
|
||||||
<b>posix_nosub</b> is used, the POSIX option REG_NOSUB is passed to
|
<b>posix_nosub</b> is used, the POSIX option REG_NOSUB is passed to
|
||||||
<b>regcomp()</b>. The POSIX wrapper supports only the 8-bit library. Note that
|
<b>regcomp()</b>. The POSIX wrapper supports only the 8-bit library. Note that
|
||||||
|
@ -962,6 +971,11 @@ The <b>aftertext</b> and <b>allaftertext</b> subject modifiers work as described
|
||||||
below. All other modifiers are either ignored, with a warning message, or cause
|
below. All other modifiers are either ignored, with a warning message, or cause
|
||||||
an error.
|
an error.
|
||||||
</P>
|
</P>
|
||||||
|
<P>
|
||||||
|
The pattern is passed to <b>regcomp()</b> as a zero-terminated string by
|
||||||
|
default, but if the <b>use_length</b> or <b>hex</b> modifiers are set, the
|
||||||
|
REG_PEND extension is used to pass it by length.
|
||||||
|
</P>
|
||||||
<br><b>
|
<br><b>
|
||||||
Testing the stack guard feature
|
Testing the stack guard feature
|
||||||
</b><br>
|
</b><br>
|
||||||
|
@ -999,17 +1013,18 @@ are mutually exclusive.
|
||||||
Setting certain match controls
|
Setting certain match controls
|
||||||
</b><br>
|
</b><br>
|
||||||
<P>
|
<P>
|
||||||
The following modifiers are really subject modifiers, and are described below.
|
The following modifiers are really subject modifiers, and are described under
|
||||||
However, they may be included in a pattern's modifier list, in which case they
|
"Subject Modifiers" below. However, they may be included in a pattern's
|
||||||
are applied to every subject line that is processed with that pattern. They may
|
modifier list, in which case they are applied to every subject line that is
|
||||||
not appear in <b>#pattern</b> commands. These modifiers do not affect the
|
processed with that pattern. They may not appear in <b>#pattern</b> commands.
|
||||||
compilation process.
|
These modifiers do not affect the compilation process.
|
||||||
<pre>
|
<pre>
|
||||||
aftertext show text after match
|
aftertext show text after match
|
||||||
allaftertext show text after captures
|
allaftertext show text after captures
|
||||||
allcaptures show all captures
|
allcaptures show all captures
|
||||||
allusedtext show all consulted text
|
allusedtext show all consulted text
|
||||||
/g global global matching
|
/g global global matching
|
||||||
|
jitstack=<n> set size of JIT stack
|
||||||
mark show mark values
|
mark show mark values
|
||||||
replace=<string> specify a replacement string
|
replace=<string> specify a replacement string
|
||||||
startchar show starting character when relevant
|
startchar show starting character when relevant
|
||||||
|
@ -1022,6 +1037,15 @@ These modifiers may not appear in a <b>#pattern</b> command. If you want them as
|
||||||
defaults, set them in a <b>#subject</b> command.
|
defaults, set them in a <b>#subject</b> command.
|
||||||
</P>
|
</P>
|
||||||
<br><b>
|
<br><b>
|
||||||
|
Specifying literal subject lines
|
||||||
|
</b><br>
|
||||||
|
<P>
|
||||||
|
If the <b>subject_literal</b> modifier is present on a pattern, all the subject
|
||||||
|
lines that it matches are taken as literal strings, with no interpretation of
|
||||||
|
backslashes. It is not possible to set subject modifiers on such lines, but any
|
||||||
|
that are set as defaults by a <b>#subject</b> command are recognized.
|
||||||
|
</P>
|
||||||
|
<br><b>
|
||||||
Saving a compiled pattern
|
Saving a compiled pattern
|
||||||
</b><br>
|
</b><br>
|
||||||
<P>
|
<P>
|
||||||
|
@ -1072,11 +1096,11 @@ The partial matching modifiers are provided with abbreviations because they
|
||||||
appear frequently in tests.
|
appear frequently in tests.
|
||||||
</P>
|
</P>
|
||||||
<P>
|
<P>
|
||||||
If the <b>posix</b> modifier was present on the pattern, causing the POSIX
|
If the <b>posix</b> or <b>posix_nosub</b> modifier was present on the pattern,
|
||||||
wrapper API to be used, the only option-setting modifiers that have any effect
|
causing the POSIX wrapper API to be used, the only option-setting modifiers
|
||||||
are <b>notbol</b>, <b>notempty</b>, and <b>noteol</b>, causing REG_NOTBOL,
|
that have any effect are <b>notbol</b>, <b>notempty</b>, and <b>noteol</b>,
|
||||||
REG_NOTEMPTY, and REG_NOTEOL, respectively, to be passed to <b>regexec()</b>.
|
causing REG_NOTBOL, REG_NOTEMPTY, and REG_NOTEOL, respectively, to be passed to
|
||||||
The other modifiers are ignored, with a warning message.
|
<b>regexec()</b>. The other modifiers are ignored, with a warning message.
|
||||||
</P>
|
</P>
|
||||||
<P>
|
<P>
|
||||||
There is one additional modifier that can be used with the POSIX wrapper. It is
|
There is one additional modifier that can be used with the POSIX wrapper. It is
|
||||||
|
@ -1085,11 +1109,13 @@ ignored (with a warning) if used for non-POSIX matching.
|
||||||
posix_startend=<n>[:<m>]
|
posix_startend=<n>[:<m>]
|
||||||
</pre>
|
</pre>
|
||||||
This causes the subject string to be passed to <b>regexec()</b> using the
|
This causes the subject string to be passed to <b>regexec()</b> using the
|
||||||
REG_STARTEND option, which uses offsets to restrict which part of the string is
|
REG_STARTEND option, which uses offsets to specify which part of the string is
|
||||||
searched. If only one number is given, the end offset is passed as the end of
|
searched. If only one number is given, the end offset is passed as the end of
|
||||||
the subject string. For more detail of REG_STARTEND, see the
|
the subject string. For more detail of REG_STARTEND, see the
|
||||||
<a href="pcre2posix.html"><b>pcre2posix</b></a>
|
<a href="pcre2posix.html"><b>pcre2posix</b></a>
|
||||||
documentation.
|
documentation. If the subject string contains binary zeros (coded as escapes
|
||||||
|
such as \x{00} because <b>pcre2test</b> does not support actual binary zeros in
|
||||||
|
its input), you must use <b>posix_startend</b> to specify its length.
|
||||||
</P>
|
</P>
|
||||||
<br><b>
|
<br><b>
|
||||||
Setting match controls
|
Setting match controls
|
||||||
|
@ -1355,9 +1381,11 @@ Setting the JIT stack size
|
||||||
<P>
|
<P>
|
||||||
The <b>jitstack</b> modifier provides a way of setting the maximum stack size
|
The <b>jitstack</b> modifier provides a way of setting the maximum stack size
|
||||||
that is used by the just-in-time optimization code. It is ignored if JIT
|
that is used by the just-in-time optimization code. It is ignored if JIT
|
||||||
optimization is not being used. The value is a number of kilobytes. Providing a
|
optimization is not being used. The value is a number of kilobytes. Setting
|
||||||
stack that is larger than the default 32K is necessary only for very
|
zero reverts to the default of 32K. Providing a stack that is larger than the
|
||||||
complicated patterns.
|
default is necessary only for very complicated patterns. If <b>jitstack</b> is
|
||||||
|
set non-zero on a subject line it overrides any value that was set on the
|
||||||
|
pattern.
|
||||||
</P>
|
</P>
|
||||||
<br><b>
|
<br><b>
|
||||||
Setting heap, match, and depth limits
|
Setting heap, match, and depth limits
|
||||||
|
@ -1461,8 +1489,8 @@ Passing the subject as zero-terminated
|
||||||
By default, the subject string is passed to a native API matching function with
|
By default, the subject string is passed to a native API matching function with
|
||||||
its correct length. In order to test the facility for passing a zero-terminated
|
its correct length. In order to test the facility for passing a zero-terminated
|
||||||
string, the <b>zero_terminate</b> modifier is provided. It causes the length to
|
string, the <b>zero_terminate</b> modifier is provided. It causes the length to
|
||||||
be passed as PCRE2_ZERO_TERMINATED. (When matching via the POSIX interface,
|
be passed as PCRE2_ZERO_TERMINATED. When matching via the POSIX interface,
|
||||||
this modifier has no effect, as there is no facility for passing a length.)
|
this modifier is ignored, with a warning.
|
||||||
</P>
|
</P>
|
||||||
<P>
|
<P>
|
||||||
When testing <b>pcre2_substitute()</b>, this modifier also has the effect of
|
When testing <b>pcre2_substitute()</b>, this modifier also has the effect of
|
||||||
|
@ -1675,7 +1703,7 @@ callout is in a lookbehind assertion.
|
||||||
</P>
|
</P>
|
||||||
<P>
|
<P>
|
||||||
Callouts numbered 255 are assumed to be automatic callouts, inserted as a
|
Callouts numbered 255 are assumed to be automatic callouts, inserted as a
|
||||||
result of the <b>/auto_callout</b> pattern modifier. In this case, instead of
|
result of the <b>auto_callout</b> pattern modifier. In this case, instead of
|
||||||
showing the callout number, the offset in the pattern, preceded by a plus, is
|
showing the callout number, the offset in the pattern, preceded by a plus, is
|
||||||
output. For example:
|
output. For example:
|
||||||
<pre>
|
<pre>
|
||||||
|
@ -1830,7 +1858,7 @@ Cambridge, England.
|
||||||
</P>
|
</P>
|
||||||
<br><a name="SEC21" href="#TOC1">REVISION</a><br>
|
<br><a name="SEC21" href="#TOC1">REVISION</a><br>
|
||||||
<P>
|
<P>
|
||||||
Last updated: 03 June 2017
|
Last updated: 16 June 2017
|
||||||
<br>
|
<br>
|
||||||
Copyright © 1997-2017 University of Cambridge.
|
Copyright © 1997-2017 University of Cambridge.
|
||||||
<br>
|
<br>
|
||||||
|
|
189
doc/pcre2.txt
189
doc/pcre2.txt
|
@ -1441,6 +1441,20 @@ COMPILING A PATTERN
|
||||||
first line and also within the offset limit. In other words, whichever
|
first line and also within the offset limit. In other words, whichever
|
||||||
limit comes first is used.
|
limit comes first is used.
|
||||||
|
|
||||||
|
PCRE2_LITERAL
|
||||||
|
|
||||||
|
If this option is set, all meta-characters in the pattern are disabled,
|
||||||
|
and it is treated as a literal string. Matching literal strings with a
|
||||||
|
regular expression engine is not the most efficient way of doing it. If
|
||||||
|
you are doing a lot of literal matching and are worried about effi-
|
||||||
|
ciency, you should consider using other approaches. The only other main
|
||||||
|
options that are allowed with PCRE2_LITERAL are: PCRE2_ANCHORED,
|
||||||
|
PCRE2_ENDANCHORED, PCRE2_AUTO_CALLOUT, PCRE2_CASELESS, PCRE2_FIRSTLINE,
|
||||||
|
PCRE2_NO_START_OPTIMIZE, PCRE2_NO_UTF_CHECK, PCRE2_UTF, and
|
||||||
|
PCRE2_USE_OFFSET_LIMIT. The extra options PCRE2_EXTRA_MATCH_LINE and
|
||||||
|
PCRE2_EXTRA_MATCH_WORD are also supported. Any other options cause an
|
||||||
|
error.
|
||||||
|
|
||||||
PCRE2_MATCH_UNSET_BACKREF
|
PCRE2_MATCH_UNSET_BACKREF
|
||||||
|
|
||||||
If this option is set, a back reference to an unset subpattern group
|
If this option is set, a back reference to an unset subpattern group
|
||||||
|
@ -1706,6 +1720,24 @@ COMPILING A PATTERN
|
||||||
option means that typos in patterns may go undetected and have unex-
|
option means that typos in patterns may go undetected and have unex-
|
||||||
pected results. This is a dangerous option. Use with care.
|
pected results. This is a dangerous option. Use with care.
|
||||||
|
|
||||||
|
PCRE2_EXTRA_MATCH_LINE
|
||||||
|
|
||||||
|
This option is provided for use by the -x option of pcre2grep. It
|
||||||
|
causes the pattern only to match complete lines. This is achieved by
|
||||||
|
automatically inserting the code for "^(?:" at the start of the com-
|
||||||
|
piled pattern and ")$" at the end. Thus, when PCRE2_MULTILINE is set,
|
||||||
|
the matched line may be in the middle of the subject string. This
|
||||||
|
option can be used with PCRE2_LITERAL.
|
||||||
|
|
||||||
|
PCRE2_EXTRA_MATCH_WORD
|
||||||
|
|
||||||
|
This option is provided for use by the -w option of pcre2grep. It
|
||||||
|
causes the pattern only to match strings that have a word boundary at
|
||||||
|
the start and the end. This is achieved by automatically inserting the
|
||||||
|
code for "\b(?:" at the start of the compiled pattern and ")\b" at the
|
||||||
|
end. The option may be used with PCRE2_LITERAL. However, it is ignored
|
||||||
|
if PCRE2_EXTRA_MATCH_LINE is also set.
|
||||||
|
|
||||||
|
|
||||||
COMPILATION ERROR CODES
|
COMPILATION ERROR CODES
|
||||||
|
|
||||||
|
@ -3368,7 +3400,7 @@ AUTHOR
|
||||||
|
|
||||||
REVISION
|
REVISION
|
||||||
|
|
||||||
Last updated: 01 June 2017
|
Last updated: 16 June 2017
|
||||||
Copyright (c) 1997-2017 University of Cambridge.
|
Copyright (c) 1997-2017 University of Cambridge.
|
||||||
------------------------------------------------------------------------------
|
------------------------------------------------------------------------------
|
||||||
|
|
||||||
|
@ -9036,60 +9068,69 @@ COMPILING A PATTERN
|
||||||
the defined POSIX behaviour for REG_NEWLINE (see the following sec-
|
the defined POSIX behaviour for REG_NEWLINE (see the following sec-
|
||||||
tion).
|
tion).
|
||||||
|
|
||||||
|
REG_NOSPEC
|
||||||
|
|
||||||
|
The PCRE2_LITERAL option is set when the regular expression is passed
|
||||||
|
for compilation to the native function. This disables all meta charac-
|
||||||
|
ters in the pattern, causing it to be treated as a literal string. The
|
||||||
|
only other options that are allowed with REG_NOSPEC are REG_ICASE,
|
||||||
|
REG_NOSUB, REG_PEND, and REG_UTF. Note that REG_NOSPEC is not part of
|
||||||
|
the POSIX standard.
|
||||||
|
|
||||||
REG_NOSUB
|
REG_NOSUB
|
||||||
|
|
||||||
When a pattern that is compiled with this flag is passed to regexec()
|
When a pattern that is compiled with this flag is passed to regexec()
|
||||||
for matching, the nmatch and pmatch arguments are ignored, and no cap-
|
for matching, the nmatch and pmatch arguments are ignored, and no cap-
|
||||||
tured strings are returned. Versions of the PCRE library prior to 10.22
|
tured strings are returned. Versions of the PCRE library prior to 10.22
|
||||||
used to set the PCRE2_NO_AUTO_CAPTURE compile option, but this no
|
used to set the PCRE2_NO_AUTO_CAPTURE compile option, but this no
|
||||||
longer happens because it disables the use of back references.
|
longer happens because it disables the use of back references.
|
||||||
|
|
||||||
REG_PEND
|
REG_PEND
|
||||||
|
|
||||||
If this option is set, the reg_endp field in the preg structure (which
|
If this option is set, the reg_endp field in the preg structure (which
|
||||||
has the type const char *) must be set to point to the character beyond
|
has the type const char *) must be set to point to the character beyond
|
||||||
the end of the pattern before calling regcomp(). The pattern itself may
|
the end of the pattern before calling regcomp(). The pattern itself may
|
||||||
now contain binary zeroes, which are treated as data characters. With-
|
now contain binary zeroes, which are treated as data characters. With-
|
||||||
out REG_PEND, a binary zero terminates the pattern and the re_endp
|
out REG_PEND, a binary zero terminates the pattern and the re_endp
|
||||||
field is ignored. This is a GNU extension to the POSIX standard and
|
field is ignored. This is a GNU extension to the POSIX standard and
|
||||||
should be used with caution in software intended to be portable to
|
should be used with caution in software intended to be portable to
|
||||||
other systems.
|
other systems.
|
||||||
|
|
||||||
REG_UCP
|
REG_UCP
|
||||||
|
|
||||||
The PCRE2_UCP option is set when the regular expression is passed for
|
The PCRE2_UCP option is set when the regular expression is passed for
|
||||||
compilation to the native function. This causes PCRE2 to use Unicode
|
compilation to the native function. This causes PCRE2 to use Unicode
|
||||||
properties when matchine \d, \w, etc., instead of just recognizing
|
properties when matchine \d, \w, etc., instead of just recognizing
|
||||||
ASCII values. Note that REG_UCP is not part of the POSIX standard.
|
ASCII values. Note that REG_UCP is not part of the POSIX standard.
|
||||||
|
|
||||||
REG_UNGREEDY
|
REG_UNGREEDY
|
||||||
|
|
||||||
The PCRE2_UNGREEDY option is set when the regular expression is passed
|
The PCRE2_UNGREEDY option is set when the regular expression is passed
|
||||||
for compilation to the native function. Note that REG_UNGREEDY is not
|
for compilation to the native function. Note that REG_UNGREEDY is not
|
||||||
part of the POSIX standard.
|
part of the POSIX standard.
|
||||||
|
|
||||||
REG_UTF
|
REG_UTF
|
||||||
|
|
||||||
The PCRE2_UTF option is set when the regular expression is passed for
|
The PCRE2_UTF option is set when the regular expression is passed for
|
||||||
compilation to the native function. This causes the pattern itself and
|
compilation to the native function. This causes the pattern itself and
|
||||||
all data strings used for matching it to be treated as UTF-8 strings.
|
all data strings used for matching it to be treated as UTF-8 strings.
|
||||||
Note that REG_UTF is not part of the POSIX standard.
|
Note that REG_UTF is not part of the POSIX standard.
|
||||||
|
|
||||||
In the absence of these flags, no options are passed to the native
|
In the absence of these flags, no options are passed to the native
|
||||||
function. This means the the regex is compiled with PCRE2 default
|
function. This means the the regex is compiled with PCRE2 default
|
||||||
semantics. In particular, the way it handles newline characters in the
|
semantics. In particular, the way it handles newline characters in the
|
||||||
subject string is the Perl way, not the POSIX way. Note that setting
|
subject string is the Perl way, not the POSIX way. Note that setting
|
||||||
PCRE2_MULTILINE has only some of the effects specified for REG_NEWLINE.
|
PCRE2_MULTILINE has only some of the effects specified for REG_NEWLINE.
|
||||||
It does not affect the way newlines are matched by the dot metacharac-
|
It does not affect the way newlines are matched by the dot metacharac-
|
||||||
ter (they are not) or by a negative class such as [^a] (they are).
|
ter (they are not) or by a negative class such as [^a] (they are).
|
||||||
|
|
||||||
The yield of regcomp() is zero on success, and non-zero otherwise. The
|
The yield of regcomp() is zero on success, and non-zero otherwise. The
|
||||||
preg structure is filled in on success, and one other member of the
|
preg structure is filled in on success, and one other member of the
|
||||||
structure (as well as re_endp) is public: re_nsub contains the number
|
structure (as well as re_endp) is public: re_nsub contains the number
|
||||||
of capturing subpatterns in the regular expression. Various error codes
|
of capturing subpatterns in the regular expression. Various error codes
|
||||||
are defined in the header file.
|
are defined in the header file.
|
||||||
|
|
||||||
NOTE: If the yield of regcomp() is non-zero, you must not attempt to
|
NOTE: If the yield of regcomp() is non-zero, you must not attempt to
|
||||||
use the contents of the preg structure. If, for example, you pass it to
|
use the contents of the preg structure. If, for example, you pass it to
|
||||||
regexec(), the result is undefined and your program is likely to crash.
|
regexec(), the result is undefined and your program is likely to crash.
|
||||||
|
|
||||||
|
@ -9097,9 +9138,9 @@ COMPILING A PATTERN
|
||||||
MATCHING NEWLINE CHARACTERS
|
MATCHING NEWLINE CHARACTERS
|
||||||
|
|
||||||
This area is not simple, because POSIX and Perl take different views of
|
This area is not simple, because POSIX and Perl take different views of
|
||||||
things. It is not possible to get PCRE2 to obey POSIX semantics, but
|
things. It is not possible to get PCRE2 to obey POSIX semantics, but
|
||||||
then PCRE2 was never intended to be a POSIX engine. The following table
|
then PCRE2 was never intended to be a POSIX engine. The following table
|
||||||
lists the different possibilities for matching newline characters in
|
lists the different possibilities for matching newline characters in
|
||||||
Perl and PCRE2:
|
Perl and PCRE2:
|
||||||
|
|
||||||
Default Change with
|
Default Change with
|
||||||
|
@ -9120,25 +9161,25 @@ MATCHING NEWLINE CHARACTERS
|
||||||
$ matches \n in middle no REG_NEWLINE
|
$ matches \n in middle no REG_NEWLINE
|
||||||
^ matches \n in middle no REG_NEWLINE
|
^ matches \n in middle no REG_NEWLINE
|
||||||
|
|
||||||
This behaviour is not what happens when PCRE2 is called via its POSIX
|
This behaviour is not what happens when PCRE2 is called via its POSIX
|
||||||
API. By default, PCRE2's behaviour is the same as Perl's, except that
|
API. By default, PCRE2's behaviour is the same as Perl's, except that
|
||||||
there is no equivalent for PCRE2_DOLLAR_ENDONLY in Perl. In both PCRE2
|
there is no equivalent for PCRE2_DOLLAR_ENDONLY in Perl. In both PCRE2
|
||||||
and Perl, there is no way to stop newline from matching [^a].
|
and Perl, there is no way to stop newline from matching [^a].
|
||||||
|
|
||||||
Default POSIX newline handling can be obtained by setting PCRE2_DOTALL
|
Default POSIX newline handling can be obtained by setting PCRE2_DOTALL
|
||||||
and PCRE2_DOLLAR_ENDONLY when calling pcre2_compile() directly, but
|
and PCRE2_DOLLAR_ENDONLY when calling pcre2_compile() directly, but
|
||||||
there is no way to make PCRE2 behave exactly as for the REG_NEWLINE
|
there is no way to make PCRE2 behave exactly as for the REG_NEWLINE
|
||||||
action. When using the POSIX API, passing REG_NEWLINE to PCRE2's reg-
|
action. When using the POSIX API, passing REG_NEWLINE to PCRE2's reg-
|
||||||
comp() function causes PCRE2_MULTILINE to be passed to pcre2_compile(),
|
comp() function causes PCRE2_MULTILINE to be passed to pcre2_compile(),
|
||||||
and REG_DOTALL passes PCRE2_DOTALL. There is no way to pass PCRE2_DOL-
|
and REG_DOTALL passes PCRE2_DOTALL. There is no way to pass PCRE2_DOL-
|
||||||
LAR_ENDONLY.
|
LAR_ENDONLY.
|
||||||
|
|
||||||
|
|
||||||
MATCHING A PATTERN
|
MATCHING A PATTERN
|
||||||
|
|
||||||
The function regexec() is called to match a compiled pattern preg
|
The function regexec() is called to match a compiled pattern preg
|
||||||
against a given string, which is by default terminated by a zero byte
|
against a given string, which is by default terminated by a zero byte
|
||||||
(but see REG_STARTEND below), subject to the options in eflags. These
|
(but see REG_STARTEND below), subject to the options in eflags. These
|
||||||
can be:
|
can be:
|
||||||
|
|
||||||
REG_NOTBOL
|
REG_NOTBOL
|
||||||
|
@ -9148,9 +9189,9 @@ MATCHING A PATTERN
|
||||||
|
|
||||||
REG_NOTEMPTY
|
REG_NOTEMPTY
|
||||||
|
|
||||||
The PCRE2_NOTEMPTY option is set when calling the underlying PCRE2
|
The PCRE2_NOTEMPTY option is set when calling the underlying PCRE2
|
||||||
matching function. Note that REG_NOTEMPTY is not part of the POSIX
|
matching function. Note that REG_NOTEMPTY is not part of the POSIX
|
||||||
standard. However, setting this option can give more POSIX-like behav-
|
standard. However, setting this option can give more POSIX-like behav-
|
||||||
iour in some situations.
|
iour in some situations.
|
||||||
|
|
||||||
REG_NOTEOL
|
REG_NOTEOL
|
||||||
|
@ -9160,66 +9201,66 @@ MATCHING A PATTERN
|
||||||
|
|
||||||
REG_STARTEND
|
REG_STARTEND
|
||||||
|
|
||||||
When this option is set, the subject string is starts at string +
|
When this option is set, the subject string is starts at string +
|
||||||
pmatch[0].rm_so and ends at string + pmatch[0].rm_eo, which should
|
pmatch[0].rm_so and ends at string + pmatch[0].rm_eo, which should
|
||||||
point to the first character beyond the string. There may be binary
|
point to the first character beyond the string. There may be binary
|
||||||
zeroes within the subject string, and indeed, using REG_STARTEND is the
|
zeroes within the subject string, and indeed, using REG_STARTEND is the
|
||||||
only way to pass a subject string that contains a binary zero.
|
only way to pass a subject string that contains a binary zero.
|
||||||
|
|
||||||
Whatever the value of pmatch[0].rm_so, the offsets of the matched
|
Whatever the value of pmatch[0].rm_so, the offsets of the matched
|
||||||
string and any captured substrings are still given relative to the
|
string and any captured substrings are still given relative to the
|
||||||
start of string itself. (Before PCRE2 release 10.30 these were given
|
start of string itself. (Before PCRE2 release 10.30 these were given
|
||||||
relative to string + pmatch[0].rm_so, but this differs from other
|
relative to string + pmatch[0].rm_so, but this differs from other
|
||||||
implementations.)
|
implementations.)
|
||||||
|
|
||||||
This is a BSD extension, compatible with but not specified by IEEE
|
This is a BSD extension, compatible with but not specified by IEEE
|
||||||
Standard 1003.2 (POSIX.2), and should be used with caution in software
|
Standard 1003.2 (POSIX.2), and should be used with caution in software
|
||||||
intended to be portable to other systems. Note that a non-zero rm_so
|
intended to be portable to other systems. Note that a non-zero rm_so
|
||||||
does not imply REG_NOTBOL; REG_STARTEND affects only the location and
|
does not imply REG_NOTBOL; REG_STARTEND affects only the location and
|
||||||
length of the string, not how it is matched. Setting REG_STARTEND and
|
length of the string, not how it is matched. Setting REG_STARTEND and
|
||||||
passing pmatch as NULL are mutually exclusive; the error REG_INVARG is
|
passing pmatch as NULL are mutually exclusive; the error REG_INVARG is
|
||||||
returned.
|
returned.
|
||||||
|
|
||||||
If the pattern was compiled with the REG_NOSUB flag, no data about any
|
If the pattern was compiled with the REG_NOSUB flag, no data about any
|
||||||
matched strings is returned. The nmatch and pmatch arguments of
|
matched strings is returned. The nmatch and pmatch arguments of
|
||||||
regexec() are ignored (except possibly as input for REG_STARTEND).
|
regexec() are ignored (except possibly as input for REG_STARTEND).
|
||||||
|
|
||||||
The value of nmatch may be zero, and the value pmatch may be NULL
|
The value of nmatch may be zero, and the value pmatch may be NULL
|
||||||
(unless REG_STARTEND is set); in both these cases no data about any
|
(unless REG_STARTEND is set); in both these cases no data about any
|
||||||
matched strings is returned.
|
matched strings is returned.
|
||||||
|
|
||||||
Otherwise, the portion of the string that was matched, and also any
|
Otherwise, the portion of the string that was matched, and also any
|
||||||
captured substrings, are returned via the pmatch argument, which points
|
captured substrings, are returned via the pmatch argument, which points
|
||||||
to an array of nmatch structures of type regmatch_t, containing the
|
to an array of nmatch structures of type regmatch_t, containing the
|
||||||
members rm_so and rm_eo. These contain the byte offset to the first
|
members rm_so and rm_eo. These contain the byte offset to the first
|
||||||
character of each substring and the offset to the first character after
|
character of each substring and the offset to the first character after
|
||||||
the end of each substring, respectively. The 0th element of the vector
|
the end of each substring, respectively. The 0th element of the vector
|
||||||
relates to the entire portion of string that was matched; subsequent
|
relates to the entire portion of string that was matched; subsequent
|
||||||
elements relate to the capturing subpatterns of the regular expression.
|
elements relate to the capturing subpatterns of the regular expression.
|
||||||
Unused entries in the array have both structure members set to -1.
|
Unused entries in the array have both structure members set to -1.
|
||||||
|
|
||||||
A successful match yields a zero return; various error codes are
|
A successful match yields a zero return; various error codes are
|
||||||
defined in the header file, of which REG_NOMATCH is the "expected"
|
defined in the header file, of which REG_NOMATCH is the "expected"
|
||||||
failure code.
|
failure code.
|
||||||
|
|
||||||
|
|
||||||
ERROR MESSAGES
|
ERROR MESSAGES
|
||||||
|
|
||||||
The regerror() function maps a non-zero errorcode from either regcomp()
|
The regerror() function maps a non-zero errorcode from either regcomp()
|
||||||
or regexec() to a printable message. If preg is not NULL, the error
|
or regexec() to a printable message. If preg is not NULL, the error
|
||||||
should have arisen from the use of that structure. A message terminated
|
should have arisen from the use of that structure. A message terminated
|
||||||
by a binary zero is placed in errbuf. If the buffer is too short, only
|
by a binary zero is placed in errbuf. If the buffer is too short, only
|
||||||
the first errbuf_size - 1 characters of the error message are used. The
|
the first errbuf_size - 1 characters of the error message are used. The
|
||||||
yield of the function is the size of buffer needed to hold the whole
|
yield of the function is the size of buffer needed to hold the whole
|
||||||
message, including the terminating zero. This value is greater than
|
message, including the terminating zero. This value is greater than
|
||||||
errbuf_size if the message was truncated.
|
errbuf_size if the message was truncated.
|
||||||
|
|
||||||
|
|
||||||
MEMORY USAGE
|
MEMORY USAGE
|
||||||
|
|
||||||
Compiling a regular expression causes memory to be allocated and asso-
|
Compiling a regular expression causes memory to be allocated and asso-
|
||||||
ciated with the preg structure. The function regfree() frees all such
|
ciated with the preg structure. The function regfree() frees all such
|
||||||
memory, after which preg may no longer be used as a compiled expres-
|
memory, after which preg may no longer be used as a compiled expres-
|
||||||
sion.
|
sion.
|
||||||
|
|
||||||
|
|
||||||
|
@ -9232,7 +9273,7 @@ AUTHOR
|
||||||
|
|
||||||
REVISION
|
REVISION
|
||||||
|
|
||||||
Last updated: 05 June 2017
|
Last updated: 15 June 2017
|
||||||
Copyright (c) 1997-2017 University of Cambridge.
|
Copyright (c) 1997-2017 University of Cambridge.
|
||||||
------------------------------------------------------------------------------
|
------------------------------------------------------------------------------
|
||||||
|
|
||||||
|
|
|
@ -1,4 +1,4 @@
|
||||||
.TH PCRE2_COMPILE 3 "17 May 2017" "PCRE2 10.30"
|
.TH PCRE2_COMPILE 3 "16 June 2017" "PCRE2 10.30"
|
||||||
.SH NAME
|
.SH NAME
|
||||||
PCRE2 - Perl-compatible regular expressions (revised API)
|
PCRE2 - Perl-compatible regular expressions (revised API)
|
||||||
.SH SYNOPSIS
|
.SH SYNOPSIS
|
||||||
|
@ -35,7 +35,7 @@ system stack size checking, or to change one or more of these parameters:
|
||||||
The newline character sequence;
|
The newline character sequence;
|
||||||
The compile time nested parentheses limit;
|
The compile time nested parentheses limit;
|
||||||
The maximum pattern length (in code units) that is allowed.
|
The maximum pattern length (in code units) that is allowed.
|
||||||
The additional options bits
|
The additional options bits (see pcre2_set_compile_extra_options())
|
||||||
.sp
|
.sp
|
||||||
The option bits are:
|
The option bits are:
|
||||||
.sp
|
.sp
|
||||||
|
@ -52,6 +52,7 @@ The option bits are:
|
||||||
PCRE2_ENDANCHORED Pattern can match only at end of subject
|
PCRE2_ENDANCHORED Pattern can match only at end of subject
|
||||||
PCRE2_EXTENDED Ignore white space and # comments
|
PCRE2_EXTENDED Ignore white space and # comments
|
||||||
PCRE2_FIRSTLINE Force matching to be before newline
|
PCRE2_FIRSTLINE Force matching to be before newline
|
||||||
|
PCRE2_LITERAL Pattern characters are all literal
|
||||||
PCRE2_MATCH_UNSET_BACKREF Match unset back references
|
PCRE2_MATCH_UNSET_BACKREF Match unset back references
|
||||||
PCRE2_MULTILINE ^ and $ match newlines within data
|
PCRE2_MULTILINE ^ and $ match newlines within data
|
||||||
PCRE2_NEVER_BACKSLASH_C Lock out the use of \eC in patterns
|
PCRE2_NEVER_BACKSLASH_C Lock out the use of \eC in patterns
|
||||||
|
|
|
@ -1,4 +1,4 @@
|
||||||
.TH PCRE2_SET_MAX_PATTERN_LENGTH 3 "01 June 2017" "PCRE2 10.30"
|
.TH PCRE2_SET_MAX_PATTERN_LENGTH 3 "16 June 2017" "PCRE2 10.30"
|
||||||
.SH NAME
|
.SH NAME
|
||||||
PCRE2 - Perl-compatible regular expressions (revised API)
|
PCRE2 - Perl-compatible regular expressions (revised API)
|
||||||
.SH SYNOPSIS
|
.SH SYNOPSIS
|
||||||
|
@ -24,6 +24,8 @@ options are:
|
||||||
.\" JOIN
|
.\" JOIN
|
||||||
PCRE2_EXTRA_BAD_ESCAPE_IS_LITERAL Treat all invalid escapes as
|
PCRE2_EXTRA_BAD_ESCAPE_IS_LITERAL Treat all invalid escapes as
|
||||||
a literal following character
|
a literal following character
|
||||||
|
PCRE2_EXTRA_MATCH_LINE Pattern matches whole lines
|
||||||
|
PCRE2_EXTRA_MATCH_WORD Pattern matches "words"
|
||||||
.sp
|
.sp
|
||||||
There is a complete description of the PCRE2 native API in the
|
There is a complete description of the PCRE2 native API in the
|
||||||
.\" HREF
|
.\" HREF
|
||||||
|
|
|
@ -64,12 +64,12 @@ INPUT ENCODING
|
||||||
unless you really want that action.
|
unless you really want that action.
|
||||||
|
|
||||||
The input is processed using using C's string functions, so must not
|
The input is processed using using C's string functions, so must not
|
||||||
contain binary zeroes, even though in Unix-like environments, fgets()
|
contain binary zeros, even though in Unix-like environments, fgets()
|
||||||
treats any bytes other than newline as data characters. An error is
|
treats any bytes other than newline as data characters. An error is
|
||||||
generated if a binary zero is encountered. Subject lines are processed
|
generated if a binary zero is encountered. By default subject lines are
|
||||||
for backslash escapes, which makes it possible to include any data
|
processed for backslash escapes, which makes it possible to include any
|
||||||
value in strings that are passed to the library for matching. For pat-
|
data value in strings that are passed to the library for matching. For
|
||||||
terns, there is a facility for specifying some or all of the 8-bit
|
patterns, there is a facility for specifying some or all of the 8-bit
|
||||||
input characters as hexadecimal pairs, which makes it possible to
|
input characters as hexadecimal pairs, which makes it possible to
|
||||||
include binary zeros.
|
include binary zeros.
|
||||||
|
|
||||||
|
@ -319,9 +319,9 @@ COMMAND LINES
|
||||||
|
|
||||||
When the POSIX API is being tested there is no way to override the
|
When the POSIX API is being tested there is no way to override the
|
||||||
default newline convention, though it is possible to set the newline
|
default newline convention, though it is possible to set the newline
|
||||||
convention from within the pattern. A warning is given if the posix
|
convention from within the pattern. A warning is given if the posix or
|
||||||
modifier is used when #newline_default would set a default for the non-
|
posix_nosub modifier is used when #newline_default would set a default
|
||||||
POSIX API.
|
for the non-POSIX API.
|
||||||
|
|
||||||
#pattern <modifier-list>
|
#pattern <modifier-list>
|
||||||
|
|
||||||
|
@ -424,8 +424,9 @@ SUBJECT LINE SYNTAX
|
||||||
|
|
||||||
Before each subject line is passed to pcre2_match() or
|
Before each subject line is passed to pcre2_match() or
|
||||||
pcre2_dfa_match(), leading and trailing white space is removed, and the
|
pcre2_dfa_match(), leading and trailing white space is removed, and the
|
||||||
line is scanned for backslash escapes. The following provide a means of
|
line is scanned for backslash escapes, unless the subject_literal modi-
|
||||||
encoding non-printing characters in a visible way:
|
fier was set for the pattern. The following provide a means of encoding
|
||||||
|
non-printing characters in a visible way:
|
||||||
|
|
||||||
\a alarm (BEL, \x07)
|
\a alarm (BEL, \x07)
|
||||||
\b backspace (\x08)
|
\b backspace (\x08)
|
||||||
|
@ -442,23 +443,23 @@ SUBJECT LINE SYNTAX
|
||||||
\x{hh...} hexadecimal character (any number of hex digits)
|
\x{hh...} hexadecimal character (any number of hex digits)
|
||||||
|
|
||||||
The use of \x{hh...} is not dependent on the use of the utf modifier on
|
The use of \x{hh...} is not dependent on the use of the utf modifier on
|
||||||
the pattern. It is recognized always. There may be any number of hexa-
|
the pattern. It is recognized always. There may be any number of hexa-
|
||||||
decimal digits inside the braces; invalid values provoke error mes-
|
decimal digits inside the braces; invalid values provoke error mes-
|
||||||
sages.
|
sages.
|
||||||
|
|
||||||
Note that \xhh specifies one byte rather than one character in UTF-8
|
Note that \xhh specifies one byte rather than one character in UTF-8
|
||||||
mode; this makes it possible to construct invalid UTF-8 sequences for
|
mode; this makes it possible to construct invalid UTF-8 sequences for
|
||||||
testing purposes. On the other hand, \x{hh} is interpreted as a UTF-8
|
testing purposes. On the other hand, \x{hh} is interpreted as a UTF-8
|
||||||
character in UTF-8 mode, generating more than one byte if the value is
|
character in UTF-8 mode, generating more than one byte if the value is
|
||||||
greater than 127. When testing the 8-bit library not in UTF-8 mode,
|
greater than 127. When testing the 8-bit library not in UTF-8 mode,
|
||||||
\x{hh} generates one byte for values less than 256, and causes an error
|
\x{hh} generates one byte for values less than 256, and causes an error
|
||||||
for greater values.
|
for greater values.
|
||||||
|
|
||||||
In UTF-16 mode, all 4-digit \x{hhhh} values are accepted. This makes it
|
In UTF-16 mode, all 4-digit \x{hhhh} values are accepted. This makes it
|
||||||
possible to construct invalid UTF-16 sequences for testing purposes.
|
possible to construct invalid UTF-16 sequences for testing purposes.
|
||||||
|
|
||||||
In UTF-32 mode, all 4- to 8-digit \x{...} values are accepted. This
|
In UTF-32 mode, all 4- to 8-digit \x{...} values are accepted. This
|
||||||
makes it possible to construct invalid UTF-32 sequences for testing
|
makes it possible to construct invalid UTF-32 sequences for testing
|
||||||
purposes.
|
purposes.
|
||||||
|
|
||||||
There is a special backslash sequence that specifies replication of one
|
There is a special backslash sequence that specifies replication of one
|
||||||
|
@ -466,33 +467,38 @@ SUBJECT LINE SYNTAX
|
||||||
|
|
||||||
\[<characters>]{<count>}
|
\[<characters>]{<count>}
|
||||||
|
|
||||||
This makes it possible to test long strings without having to provide
|
This makes it possible to test long strings without having to provide
|
||||||
them as part of the file. For example:
|
them as part of the file. For example:
|
||||||
|
|
||||||
\[abc]{4}
|
\[abc]{4}
|
||||||
|
|
||||||
is converted to "abcabcabcabc". This feature does not support nesting.
|
is converted to "abcabcabcabc". This feature does not support nesting.
|
||||||
To include a closing square bracket in the characters, code it as \x5D.
|
To include a closing square bracket in the characters, code it as \x5D.
|
||||||
|
|
||||||
A backslash followed by an equals sign marks the end of the subject
|
A backslash followed by an equals sign marks the end of the subject
|
||||||
string and the start of a modifier list. For example:
|
string and the start of a modifier list. For example:
|
||||||
|
|
||||||
abc\=notbol,notempty
|
abc\=notbol,notempty
|
||||||
|
|
||||||
If the subject string is empty and \= is followed by whitespace, the
|
If the subject string is empty and \= is followed by whitespace, the
|
||||||
line is treated as a comment line, and is not used for matching. For
|
line is treated as a comment line, and is not used for matching. For
|
||||||
example:
|
example:
|
||||||
|
|
||||||
\= This is a comment.
|
\= This is a comment.
|
||||||
abc\= This is an invalid modifier list.
|
abc\= This is an invalid modifier list.
|
||||||
|
|
||||||
A backslash followed by any other non-alphanumeric character just
|
A backslash followed by any other non-alphanumeric character just
|
||||||
escapes that character. A backslash followed by anything else causes an
|
escapes that character. A backslash followed by anything else causes an
|
||||||
error. However, if the very last character in the line is a backslash
|
error. However, if the very last character in the line is a backslash
|
||||||
(and there is no modifier list), it is ignored. This gives a way of
|
(and there is no modifier list), it is ignored. This gives a way of
|
||||||
passing an empty line as data, since a real empty line terminates the
|
passing an empty line as data, since a real empty line terminates the
|
||||||
data input.
|
data input.
|
||||||
|
|
||||||
|
If the subject_literal modifier is set for a pattern, all subject lines
|
||||||
|
that follow are treated as literals, with no special treatment of back-
|
||||||
|
slashes. No replication is possible, and any subject modifiers must be
|
||||||
|
set as defaults by a #subject command.
|
||||||
|
|
||||||
|
|
||||||
PATTERN MODIFIERS
|
PATTERN MODIFIERS
|
||||||
|
|
||||||
|
@ -530,7 +536,10 @@ PATTERN MODIFIERS
|
||||||
/x extended set PCRE2_EXTENDED
|
/x extended set PCRE2_EXTENDED
|
||||||
/xx extended_more set PCRE2_EXTENDED_MORE
|
/xx extended_more set PCRE2_EXTENDED_MORE
|
||||||
firstline set PCRE2_FIRSTLINE
|
firstline set PCRE2_FIRSTLINE
|
||||||
|
literal set PCRE2_LITERAL
|
||||||
|
match_line set PCRE2_EXTRA_MATCH_LINE
|
||||||
match_unset_backref set PCRE2_MATCH_UNSET_BACKREF
|
match_unset_backref set PCRE2_MATCH_UNSET_BACKREF
|
||||||
|
match_word set PCRE2_EXTRA_MATCH_WORD
|
||||||
/m multiline set PCRE2_MULTILINE
|
/m multiline set PCRE2_MULTILINE
|
||||||
never_backslash_c set PCRE2_NEVER_BACKSLASH_C
|
never_backslash_c set PCRE2_NEVER_BACKSLASH_C
|
||||||
never_ucp set PCRE2_NEVER_UCP
|
never_ucp set PCRE2_NEVER_UCP
|
||||||
|
@ -580,6 +589,7 @@ PATTERN MODIFIERS
|
||||||
push push compiled pattern onto the stack
|
push push compiled pattern onto the stack
|
||||||
pushcopy push a copy onto the stack
|
pushcopy push a copy onto the stack
|
||||||
stackguard=<number> test the stackguard feature
|
stackguard=<number> test the stackguard feature
|
||||||
|
subject_literal treat all subject lines as literal
|
||||||
tables=[0|1|2] select internal tables
|
tables=[0|1|2] select internal tables
|
||||||
use_length do not zero-terminate the pattern
|
use_length do not zero-terminate the pattern
|
||||||
utf8_input treat input as UTF-8
|
utf8_input treat input as UTF-8
|
||||||
|
@ -659,16 +669,6 @@ PATTERN MODIFIERS
|
||||||
testing that pcre2_compile() behaves correctly in this case (it uses
|
testing that pcre2_compile() behaves correctly in this case (it uses
|
||||||
default values).
|
default values).
|
||||||
|
|
||||||
Specifying the pattern's length
|
|
||||||
|
|
||||||
By default, patterns are passed to the compiling functions as zero-ter-
|
|
||||||
minated strings. When using the POSIX wrapper API, there is no other
|
|
||||||
option. However, when using PCRE2's native API, patterns can be passed
|
|
||||||
by length instead of being zero-terminated. The use_length modifier
|
|
||||||
causes this to happen. Using a length happens automatically (whether
|
|
||||||
or not use_length is set) when hex is set, because patterns specified
|
|
||||||
in hexadecimal may contain binary zeros.
|
|
||||||
|
|
||||||
Specifying pattern characters in hexadecimal
|
Specifying pattern characters in hexadecimal
|
||||||
|
|
||||||
The hex modifier specifies that the characters of the pattern, except
|
The hex modifier specifies that the characters of the pattern, except
|
||||||
|
@ -690,61 +690,68 @@ PATTERN MODIFIERS
|
||||||
ing the delimiter within a substring. The hex and expand modifiers are
|
ing the delimiter within a substring. The hex and expand modifiers are
|
||||||
mutually exclusive.
|
mutually exclusive.
|
||||||
|
|
||||||
The POSIX API cannot be used with patterns specified in hexadecimal
|
Specifying the pattern's length
|
||||||
because they may contain binary zeros, which conflicts with regcomp()'s
|
|
||||||
requirement for a zero-terminated string. Such patterns are always
|
By default, patterns are passed to the compiling functions as zero-ter-
|
||||||
passed to pcre2_compile() as a string with a length, not as zero-termi-
|
minated strings but can be passed by length instead of being zero-ter-
|
||||||
nated.
|
minated. The use_length modifier causes this to happen. Using a length
|
||||||
|
happens automatically (whether or not use_length is set) when hex is
|
||||||
|
set, because patterns specified in hexadecimal may contain binary
|
||||||
|
zeros.
|
||||||
|
|
||||||
|
If hex or use_length is used with the POSIX wrapper API (see "Using the
|
||||||
|
POSIX wrapper API" below), the REG_PEND extension is used to pass the
|
||||||
|
pattern's length.
|
||||||
|
|
||||||
Specifying wide characters in 16-bit and 32-bit modes
|
Specifying wide characters in 16-bit and 32-bit modes
|
||||||
|
|
||||||
In 16-bit and 32-bit modes, all input is automatically treated as UTF-8
|
In 16-bit and 32-bit modes, all input is automatically treated as UTF-8
|
||||||
and translated to UTF-16 or UTF-32 when the utf modifier is set. For
|
and translated to UTF-16 or UTF-32 when the utf modifier is set. For
|
||||||
testing the 16-bit and 32-bit libraries in non-UTF mode, the utf8_input
|
testing the 16-bit and 32-bit libraries in non-UTF mode, the utf8_input
|
||||||
modifier can be used. It is mutually exclusive with utf. Input lines
|
modifier can be used. It is mutually exclusive with utf. Input lines
|
||||||
are interpreted as UTF-8 as a means of specifying wide characters. More
|
are interpreted as UTF-8 as a means of specifying wide characters. More
|
||||||
details are given in "Input encoding" above.
|
details are given in "Input encoding" above.
|
||||||
|
|
||||||
Generating long repetitive patterns
|
Generating long repetitive patterns
|
||||||
|
|
||||||
Some tests use long patterns that are very repetitive. Instead of cre-
|
Some tests use long patterns that are very repetitive. Instead of cre-
|
||||||
ating a very long input line for such a pattern, you can use a special
|
ating a very long input line for such a pattern, you can use a special
|
||||||
repetition feature, similar to the one described for subject lines
|
repetition feature, similar to the one described for subject lines
|
||||||
above. If the expand modifier is present on a pattern, parts of the
|
above. If the expand modifier is present on a pattern, parts of the
|
||||||
pattern that have the form
|
pattern that have the form
|
||||||
|
|
||||||
\[<characters>]{<count>}
|
\[<characters>]{<count>}
|
||||||
|
|
||||||
are expanded before the pattern is passed to pcre2_compile(). For exam-
|
are expanded before the pattern is passed to pcre2_compile(). For exam-
|
||||||
ple, \[AB]{6000} is expanded to "ABAB..." 6000 times. This construction
|
ple, \[AB]{6000} is expanded to "ABAB..." 6000 times. This construction
|
||||||
cannot be nested. An initial "\[" sequence is recognized only if "]{"
|
cannot be nested. An initial "\[" sequence is recognized only if "]{"
|
||||||
followed by decimal digits and "}" is found later in the pattern. If
|
followed by decimal digits and "}" is found later in the pattern. If
|
||||||
not, the characters remain in the pattern unaltered. The expand and hex
|
not, the characters remain in the pattern unaltered. The expand and hex
|
||||||
modifiers are mutually exclusive.
|
modifiers are mutually exclusive.
|
||||||
|
|
||||||
If part of an expanded pattern looks like an expansion, but is really
|
If part of an expanded pattern looks like an expansion, but is really
|
||||||
part of the actual pattern, unwanted expansion can be avoided by giving
|
part of the actual pattern, unwanted expansion can be avoided by giving
|
||||||
two values in the quantifier. For example, \[AB]{6000,6000} is not rec-
|
two values in the quantifier. For example, \[AB]{6000,6000} is not rec-
|
||||||
ognized as an expansion item.
|
ognized as an expansion item.
|
||||||
|
|
||||||
If the info modifier is set on an expanded pattern, the result of the
|
If the info modifier is set on an expanded pattern, the result of the
|
||||||
expansion is included in the information that is output.
|
expansion is included in the information that is output.
|
||||||
|
|
||||||
JIT compilation
|
JIT compilation
|
||||||
|
|
||||||
Just-in-time (JIT) compiling is a heavyweight optimization that can
|
Just-in-time (JIT) compiling is a heavyweight optimization that can
|
||||||
greatly speed up pattern matching. See the pcre2jit documentation for
|
greatly speed up pattern matching. See the pcre2jit documentation for
|
||||||
details. JIT compiling happens, optionally, after a pattern has been
|
details. JIT compiling happens, optionally, after a pattern has been
|
||||||
successfully compiled into an internal form. The JIT compiler converts
|
successfully compiled into an internal form. The JIT compiler converts
|
||||||
this to optimized machine code. It needs to know whether the match-time
|
this to optimized machine code. It needs to know whether the match-time
|
||||||
options PCRE2_PARTIAL_HARD and PCRE2_PARTIAL_SOFT are going to be used,
|
options PCRE2_PARTIAL_HARD and PCRE2_PARTIAL_SOFT are going to be used,
|
||||||
because different code is generated for the different cases. See the
|
because different code is generated for the different cases. See the
|
||||||
partial modifier in "Subject Modifiers" below for details of how these
|
partial modifier in "Subject Modifiers" below for details of how these
|
||||||
options are specified for each match attempt.
|
options are specified for each match attempt.
|
||||||
|
|
||||||
JIT compilation is requested by the /jit pattern modifier, which may
|
JIT compilation is requested by the jit pattern modifier, which may
|
||||||
optionally be followed by an equals sign and a number in the range 0 to
|
optionally be followed by an equals sign and a number in the range 0 to
|
||||||
7. The three bits that make up the number specify which of the three
|
7. The three bits that make up the number specify which of the three
|
||||||
JIT operating modes are to be compiled:
|
JIT operating modes are to be compiled:
|
||||||
|
|
||||||
1 compile JIT code for non-partial matching
|
1 compile JIT code for non-partial matching
|
||||||
|
@ -761,31 +768,31 @@ PATTERN MODIFIERS
|
||||||
6 soft and hard partial matching only
|
6 soft and hard partial matching only
|
||||||
7 all three modes
|
7 all three modes
|
||||||
|
|
||||||
If no number is given, 7 is assumed. The phrase "partial matching"
|
If no number is given, 7 is assumed. The phrase "partial matching"
|
||||||
means a call to pcre2_match() with either the PCRE2_PARTIAL_SOFT or the
|
means a call to pcre2_match() with either the PCRE2_PARTIAL_SOFT or the
|
||||||
PCRE2_PARTIAL_HARD option set. Note that such a call may return a com-
|
PCRE2_PARTIAL_HARD option set. Note that such a call may return a com-
|
||||||
plete match; the options enable the possibility of a partial match, but
|
plete match; the options enable the possibility of a partial match, but
|
||||||
do not require it. Note also that if you request JIT compilation only
|
do not require it. Note also that if you request JIT compilation only
|
||||||
for partial matching (for example, /jit=2) but do not set the partial
|
for partial matching (for example, jit=2) but do not set the partial
|
||||||
modifier on a subject line, that match will not use JIT code because
|
modifier on a subject line, that match will not use JIT code because
|
||||||
none was compiled for non-partial matching.
|
none was compiled for non-partial matching.
|
||||||
|
|
||||||
If JIT compilation is successful, the compiled JIT code will automati-
|
If JIT compilation is successful, the compiled JIT code will automati-
|
||||||
cally be used when an appropriate type of match is run, except when
|
cally be used when an appropriate type of match is run, except when
|
||||||
incompatible run-time options are specified. For more details, see the
|
incompatible run-time options are specified. For more details, see the
|
||||||
pcre2jit documentation. See also the jitstack modifier below for a way
|
pcre2jit documentation. See also the jitstack modifier below for a way
|
||||||
of setting the size of the JIT stack.
|
of setting the size of the JIT stack.
|
||||||
|
|
||||||
If the jitfast modifier is specified, matching is done using the JIT
|
If the jitfast modifier is specified, matching is done using the JIT
|
||||||
"fast path" interface, pcre2_jit_match(), which skips some of the san-
|
"fast path" interface, pcre2_jit_match(), which skips some of the san-
|
||||||
ity checks that are done by pcre2_match(), and of course does not work
|
ity checks that are done by pcre2_match(), and of course does not work
|
||||||
when JIT is not supported. If jitfast is specified without jit, jit=7
|
when JIT is not supported. If jitfast is specified without jit, jit=7
|
||||||
is assumed.
|
is assumed.
|
||||||
|
|
||||||
If the jitverify modifier is specified, information about the compiled
|
If the jitverify modifier is specified, information about the compiled
|
||||||
pattern shows whether JIT compilation was or was not successful. If
|
pattern shows whether JIT compilation was or was not successful. If
|
||||||
jitverify is specified without jit, jit=7 is assumed. If JIT compila-
|
jitverify is specified without jit, jit=7 is assumed. If JIT compila-
|
||||||
tion is successful when jitverify is set, the text "(JIT)" is added to
|
tion is successful when jitverify is set, the text "(JIT)" is added to
|
||||||
the first output line after a match or non match when JIT-compiled code
|
the first output line after a match or non match when JIT-compiled code
|
||||||
was actually used in the match.
|
was actually used in the match.
|
||||||
|
|
||||||
|
@ -796,19 +803,19 @@ PATTERN MODIFIERS
|
||||||
/pattern/locale=fr_FR
|
/pattern/locale=fr_FR
|
||||||
|
|
||||||
The given locale is set, pcre2_maketables() is called to build a set of
|
The given locale is set, pcre2_maketables() is called to build a set of
|
||||||
character tables for the locale, and this is then passed to pcre2_com-
|
character tables for the locale, and this is then passed to pcre2_com-
|
||||||
pile() when compiling the regular expression. The same tables are used
|
pile() when compiling the regular expression. The same tables are used
|
||||||
when matching the following subject lines. The locale modifier applies
|
when matching the following subject lines. The locale modifier applies
|
||||||
only to the pattern on which it appears, but can be given in a #pattern
|
only to the pattern on which it appears, but can be given in a #pattern
|
||||||
command if a default is needed. Setting a locale and alternate charac-
|
command if a default is needed. Setting a locale and alternate charac-
|
||||||
ter tables are mutually exclusive.
|
ter tables are mutually exclusive.
|
||||||
|
|
||||||
Showing pattern memory
|
Showing pattern memory
|
||||||
|
|
||||||
The memory modifier causes the size in bytes of the memory used to hold
|
The memory modifier causes the size in bytes of the memory used to hold
|
||||||
the compiled pattern to be output. This does not include the size of
|
the compiled pattern to be output. This does not include the size of
|
||||||
the pcre2_code block; it is just the actual compiled data. If the pat-
|
the pcre2_code block; it is just the actual compiled data. If the pat-
|
||||||
tern is subsequently passed to the JIT compiler, the size of the JIT
|
tern is subsequently passed to the JIT compiler, the size of the JIT
|
||||||
compiled code is also output. Here is an example:
|
compiled code is also output. Here is an example:
|
||||||
|
|
||||||
re> /a(b)c/jit,memory
|
re> /a(b)c/jit,memory
|
||||||
|
@ -818,27 +825,27 @@ PATTERN MODIFIERS
|
||||||
|
|
||||||
Limiting nested parentheses
|
Limiting nested parentheses
|
||||||
|
|
||||||
The parens_nest_limit modifier sets a limit on the depth of nested
|
The parens_nest_limit modifier sets a limit on the depth of nested
|
||||||
parentheses in a pattern. Breaching the limit causes a compilation
|
parentheses in a pattern. Breaching the limit causes a compilation
|
||||||
error. The default for the library is set when PCRE2 is built, but
|
error. The default for the library is set when PCRE2 is built, but
|
||||||
pcre2test sets its own default of 220, which is required for running
|
pcre2test sets its own default of 220, which is required for running
|
||||||
the standard test suite.
|
the standard test suite.
|
||||||
|
|
||||||
Limiting the pattern length
|
Limiting the pattern length
|
||||||
|
|
||||||
The max_pattern_length modifier sets a limit, in code units, to the
|
The max_pattern_length modifier sets a limit, in code units, to the
|
||||||
length of pattern that pcre2_compile() will accept. Breaching the limit
|
length of pattern that pcre2_compile() will accept. Breaching the limit
|
||||||
causes a compilation error. The default is the largest number a
|
causes a compilation error. The default is the largest number a
|
||||||
PCRE2_SIZE variable can hold (essentially unlimited).
|
PCRE2_SIZE variable can hold (essentially unlimited).
|
||||||
|
|
||||||
Using the POSIX wrapper API
|
Using the POSIX wrapper API
|
||||||
|
|
||||||
The /posix and posix_nosub modifiers cause pcre2test to call PCRE2 via
|
The posix and posix_nosub modifiers cause pcre2test to call PCRE2 via
|
||||||
the POSIX wrapper API rather than its native API. When posix_nosub is
|
the POSIX wrapper API rather than its native API. When posix_nosub is
|
||||||
used, the POSIX option REG_NOSUB is passed to regcomp(). The POSIX
|
used, the POSIX option REG_NOSUB is passed to regcomp(). The POSIX
|
||||||
wrapper supports only the 8-bit library. Note that it does not imply
|
wrapper supports only the 8-bit library. Note that it does not imply
|
||||||
POSIX matching semantics; for more detail see the pcre2posix documenta-
|
POSIX matching semantics; for more detail see the pcre2posix documenta-
|
||||||
tion. The following pattern modifiers set options for the regcomp()
|
tion. The following pattern modifiers set options for the regcomp()
|
||||||
function:
|
function:
|
||||||
|
|
||||||
caseless REG_ICASE
|
caseless REG_ICASE
|
||||||
|
@ -848,35 +855,39 @@ PATTERN MODIFIERS
|
||||||
ucp REG_UCP ) the POSIX standard
|
ucp REG_UCP ) the POSIX standard
|
||||||
utf REG_UTF8 )
|
utf REG_UTF8 )
|
||||||
|
|
||||||
The regerror_buffsize modifier specifies a size for the error buffer
|
The regerror_buffsize modifier specifies a size for the error buffer
|
||||||
that is passed to regerror() in the event of a compilation error. For
|
that is passed to regerror() in the event of a compilation error. For
|
||||||
example:
|
example:
|
||||||
|
|
||||||
/abc/posix,regerror_buffsize=20
|
/abc/posix,regerror_buffsize=20
|
||||||
|
|
||||||
This provides a means of testing the behaviour of regerror() when the
|
This provides a means of testing the behaviour of regerror() when the
|
||||||
buffer is too small for the error message. If this modifier has not
|
buffer is too small for the error message. If this modifier has not
|
||||||
been set, a large buffer is used.
|
been set, a large buffer is used.
|
||||||
|
|
||||||
The aftertext and allaftertext subject modifiers work as described
|
The aftertext and allaftertext subject modifiers work as described
|
||||||
below. All other modifiers are either ignored, with a warning message,
|
below. All other modifiers are either ignored, with a warning message,
|
||||||
or cause an error.
|
or cause an error.
|
||||||
|
|
||||||
|
The pattern is passed to regcomp() as a zero-terminated string by
|
||||||
|
default, but if the use_length or hex modifiers are set, the REG_PEND
|
||||||
|
extension is used to pass it by length.
|
||||||
|
|
||||||
Testing the stack guard feature
|
Testing the stack guard feature
|
||||||
|
|
||||||
The stackguard modifier is used to test the use of pcre2_set_com-
|
The stackguard modifier is used to test the use of pcre2_set_com-
|
||||||
pile_recursion_guard(), a function that is provided to enable stack
|
pile_recursion_guard(), a function that is provided to enable stack
|
||||||
availability to be checked during compilation (see the pcre2api docu-
|
availability to be checked during compilation (see the pcre2api docu-
|
||||||
mentation for details). If the number specified by the modifier is
|
mentation for details). If the number specified by the modifier is
|
||||||
greater than zero, pcre2_set_compile_recursion_guard() is called to set
|
greater than zero, pcre2_set_compile_recursion_guard() is called to set
|
||||||
up callback from pcre2_compile() to a local function. The argument it
|
up callback from pcre2_compile() to a local function. The argument it
|
||||||
receives is the current nesting parenthesis depth; if this is greater
|
receives is the current nesting parenthesis depth; if this is greater
|
||||||
than the value given by the modifier, non-zero is returned, causing the
|
than the value given by the modifier, non-zero is returned, causing the
|
||||||
compilation to be aborted.
|
compilation to be aborted.
|
||||||
|
|
||||||
Using alternative character tables
|
Using alternative character tables
|
||||||
|
|
||||||
The value specified for the tables modifier must be one of the digits
|
The value specified for the tables modifier must be one of the digits
|
||||||
0, 1, or 2. It causes a specific set of built-in character tables to be
|
0, 1, or 2. It causes a specific set of built-in character tables to be
|
||||||
passed to pcre2_compile(). This is used in the PCRE2 tests to check be-
|
passed to pcre2_compile(). This is used in the PCRE2 tests to check be-
|
||||||
haviour with different character tables. The digit specifies the tables
|
haviour with different character tables. The digit specifies the tables
|
||||||
|
@ -887,23 +898,25 @@ PATTERN MODIFIERS
|
||||||
pcre2_chartables.c.dist
|
pcre2_chartables.c.dist
|
||||||
2 a set of tables defining ISO 8859 characters
|
2 a set of tables defining ISO 8859 characters
|
||||||
|
|
||||||
In table 2, some characters whose codes are greater than 128 are iden-
|
In table 2, some characters whose codes are greater than 128 are iden-
|
||||||
tified as letters, digits, spaces, etc. Setting alternate character
|
tified as letters, digits, spaces, etc. Setting alternate character
|
||||||
tables and a locale are mutually exclusive.
|
tables and a locale are mutually exclusive.
|
||||||
|
|
||||||
Setting certain match controls
|
Setting certain match controls
|
||||||
|
|
||||||
The following modifiers are really subject modifiers, and are described
|
The following modifiers are really subject modifiers, and are described
|
||||||
below. However, they may be included in a pattern's modifier list, in
|
under "Subject Modifiers" below. However, they may be included in a
|
||||||
which case they are applied to every subject line that is processed
|
pattern's modifier list, in which case they are applied to every sub-
|
||||||
with that pattern. They may not appear in #pattern commands. These mod-
|
ject line that is processed with that pattern. They may not appear in
|
||||||
ifiers do not affect the compilation process.
|
#pattern commands. These modifiers do not affect the compilation
|
||||||
|
process.
|
||||||
|
|
||||||
aftertext show text after match
|
aftertext show text after match
|
||||||
allaftertext show text after captures
|
allaftertext show text after captures
|
||||||
allcaptures show all captures
|
allcaptures show all captures
|
||||||
allusedtext show all consulted text
|
allusedtext show all consulted text
|
||||||
/g global global matching
|
/g global global matching
|
||||||
|
jitstack=<n> set size of JIT stack
|
||||||
mark show mark values
|
mark show mark values
|
||||||
replace=<string> specify a replacement string
|
replace=<string> specify a replacement string
|
||||||
startchar show starting character when relevant
|
startchar show starting character when relevant
|
||||||
|
@ -915,6 +928,14 @@ PATTERN MODIFIERS
|
||||||
These modifiers may not appear in a #pattern command. If you want them
|
These modifiers may not appear in a #pattern command. If you want them
|
||||||
as defaults, set them in a #subject command.
|
as defaults, set them in a #subject command.
|
||||||
|
|
||||||
|
Specifying literal subject lines
|
||||||
|
|
||||||
|
If the subject_literal modifier is present on a pattern, all the sub-
|
||||||
|
ject lines that it matches are taken as literal strings, with no inter-
|
||||||
|
pretation of backslashes. It is not possible to set subject modifiers
|
||||||
|
on such lines, but any that are set as defaults by a #subject command
|
||||||
|
are recognized.
|
||||||
|
|
||||||
Saving a compiled pattern
|
Saving a compiled pattern
|
||||||
|
|
||||||
When a pattern with the push modifier is successfully compiled, it is
|
When a pattern with the push modifier is successfully compiled, it is
|
||||||
|
@ -959,11 +980,11 @@ SUBJECT MODIFIERS
|
||||||
The partial matching modifiers are provided with abbreviations because
|
The partial matching modifiers are provided with abbreviations because
|
||||||
they appear frequently in tests.
|
they appear frequently in tests.
|
||||||
|
|
||||||
If the posix modifier was present on the pattern, causing the POSIX
|
If the posix or posix_nosub modifier was present on the pattern, caus-
|
||||||
wrapper API to be used, the only option-setting modifiers that have any
|
ing the POSIX wrapper API to be used, the only option-setting modifiers
|
||||||
effect are notbol, notempty, and noteol, causing REG_NOTBOL,
|
that have any effect are notbol, notempty, and noteol, causing REG_NOT-
|
||||||
REG_NOTEMPTY, and REG_NOTEOL, respectively, to be passed to regexec().
|
BOL, REG_NOTEMPTY, and REG_NOTEOL, respectively, to be passed to
|
||||||
The other modifiers are ignored, with a warning message.
|
regexec(). The other modifiers are ignored, with a warning message.
|
||||||
|
|
||||||
There is one additional modifier that can be used with the POSIX wrap-
|
There is one additional modifier that can be used with the POSIX wrap-
|
||||||
per. It is ignored (with a warning) if used for non-POSIX matching.
|
per. It is ignored (with a warning) if used for non-POSIX matching.
|
||||||
|
@ -971,16 +992,19 @@ SUBJECT MODIFIERS
|
||||||
posix_startend=<n>[:<m>]
|
posix_startend=<n>[:<m>]
|
||||||
|
|
||||||
This causes the subject string to be passed to regexec() using the
|
This causes the subject string to be passed to regexec() using the
|
||||||
REG_STARTEND option, which uses offsets to restrict which part of the
|
REG_STARTEND option, which uses offsets to specify which part of the
|
||||||
string is searched. If only one number is given, the end offset is
|
string is searched. If only one number is given, the end offset is
|
||||||
passed as the end of the subject string. For more detail of REG_STAR-
|
passed as the end of the subject string. For more detail of REG_STAR-
|
||||||
TEND, see the pcre2posix documentation.
|
TEND, see the pcre2posix documentation. If the subject string contains
|
||||||
|
binary zeros (coded as escapes such as \x{00} because pcre2test does
|
||||||
|
not support actual binary zeros in its input), you must use posix_star-
|
||||||
|
tend to specify its length.
|
||||||
|
|
||||||
Setting match controls
|
Setting match controls
|
||||||
|
|
||||||
The following modifiers affect the matching process or request addi-
|
The following modifiers affect the matching process or request addi-
|
||||||
tional information. Some of them may also be specified on a pattern
|
tional information. Some of them may also be specified on a pattern
|
||||||
line (see above), in which case they apply to every subject line that
|
line (see above), in which case they apply to every subject line that
|
||||||
is matched against that pattern.
|
is matched against that pattern.
|
||||||
|
|
||||||
aftertext show text after match
|
aftertext show text after match
|
||||||
|
@ -1020,29 +1044,29 @@ SUBJECT MODIFIERS
|
||||||
zero_terminate pass the subject as zero-terminated
|
zero_terminate pass the subject as zero-terminated
|
||||||
|
|
||||||
The effects of these modifiers are described in the following sections.
|
The effects of these modifiers are described in the following sections.
|
||||||
When matching via the POSIX wrapper API, the aftertext, allaftertext,
|
When matching via the POSIX wrapper API, the aftertext, allaftertext,
|
||||||
and ovector subject modifiers work as described below. All other modi-
|
and ovector subject modifiers work as described below. All other modi-
|
||||||
fiers are either ignored, with a warning message, or cause an error.
|
fiers are either ignored, with a warning message, or cause an error.
|
||||||
|
|
||||||
Showing more text
|
Showing more text
|
||||||
|
|
||||||
The aftertext modifier requests that as well as outputting the part of
|
The aftertext modifier requests that as well as outputting the part of
|
||||||
the subject string that matched the entire pattern, pcre2test should in
|
the subject string that matched the entire pattern, pcre2test should in
|
||||||
addition output the remainder of the subject string. This is useful for
|
addition output the remainder of the subject string. This is useful for
|
||||||
tests where the subject contains multiple copies of the same substring.
|
tests where the subject contains multiple copies of the same substring.
|
||||||
The allaftertext modifier requests the same action for captured sub-
|
The allaftertext modifier requests the same action for captured sub-
|
||||||
strings as well as the main matched substring. In each case the remain-
|
strings as well as the main matched substring. In each case the remain-
|
||||||
der is output on the following line with a plus character following the
|
der is output on the following line with a plus character following the
|
||||||
capture number.
|
capture number.
|
||||||
|
|
||||||
The allusedtext modifier requests that all the text that was consulted
|
The allusedtext modifier requests that all the text that was consulted
|
||||||
during a successful pattern match by the interpreter should be shown.
|
during a successful pattern match by the interpreter should be shown.
|
||||||
This feature is not supported for JIT matching, and if requested with
|
This feature is not supported for JIT matching, and if requested with
|
||||||
JIT it is ignored (with a warning message). Setting this modifier
|
JIT it is ignored (with a warning message). Setting this modifier
|
||||||
affects the output if there is a lookbehind at the start of a match, or
|
affects the output if there is a lookbehind at the start of a match, or
|
||||||
a lookahead at the end, or if \K is used in the pattern. Characters
|
a lookahead at the end, or if \K is used in the pattern. Characters
|
||||||
that precede or follow the start and end of the actual match are indi-
|
that precede or follow the start and end of the actual match are indi-
|
||||||
cated in the output by '<' or '>' characters underneath them. Here is
|
cated in the output by '<' or '>' characters underneath them. Here is
|
||||||
an example:
|
an example:
|
||||||
|
|
||||||
re> /(?<=pqr)abc(?=xyz)/
|
re> /(?<=pqr)abc(?=xyz)/
|
||||||
|
@ -1050,16 +1074,16 @@ SUBJECT MODIFIERS
|
||||||
0: pqrabcxyz
|
0: pqrabcxyz
|
||||||
<<< >>>
|
<<< >>>
|
||||||
|
|
||||||
This shows that the matched string is "abc", with the preceding and
|
This shows that the matched string is "abc", with the preceding and
|
||||||
following strings "pqr" and "xyz" having been consulted during the
|
following strings "pqr" and "xyz" having been consulted during the
|
||||||
match (when processing the assertions).
|
match (when processing the assertions).
|
||||||
|
|
||||||
The startchar modifier requests that the starting character for the
|
The startchar modifier requests that the starting character for the
|
||||||
match be indicated, if it is different to the start of the matched
|
match be indicated, if it is different to the start of the matched
|
||||||
string. The only time when this occurs is when \K has been processed as
|
string. The only time when this occurs is when \K has been processed as
|
||||||
part of the match. In this situation, the output for the matched string
|
part of the match. In this situation, the output for the matched string
|
||||||
is displayed from the starting character instead of from the match
|
is displayed from the starting character instead of from the match
|
||||||
point, with circumflex characters under the earlier characters. For
|
point, with circumflex characters under the earlier characters. For
|
||||||
example:
|
example:
|
||||||
|
|
||||||
re> /abc\Kxyz/
|
re> /abc\Kxyz/
|
||||||
|
@ -1067,7 +1091,7 @@ SUBJECT MODIFIERS
|
||||||
0: abcxyz
|
0: abcxyz
|
||||||
^^^
|
^^^
|
||||||
|
|
||||||
Unlike allusedtext, the startchar modifier can be used with JIT. How-
|
Unlike allusedtext, the startchar modifier can be used with JIT. How-
|
||||||
ever, these two modifiers are mutually exclusive.
|
ever, these two modifiers are mutually exclusive.
|
||||||
|
|
||||||
Showing the value of all capture groups
|
Showing the value of all capture groups
|
||||||
|
@ -1075,98 +1099,98 @@ SUBJECT MODIFIERS
|
||||||
The allcaptures modifier requests that the values of all potential cap-
|
The allcaptures modifier requests that the values of all potential cap-
|
||||||
tured parentheses be output after a match. By default, only those up to
|
tured parentheses be output after a match. By default, only those up to
|
||||||
the highest one actually used in the match are output (corresponding to
|
the highest one actually used in the match are output (corresponding to
|
||||||
the return code from pcre2_match()). Groups that did not take part in
|
the return code from pcre2_match()). Groups that did not take part in
|
||||||
the match are output as "<unset>". This modifier is not relevant for
|
the match are output as "<unset>". This modifier is not relevant for
|
||||||
DFA matching (which does no capturing); it is ignored, with a warning
|
DFA matching (which does no capturing); it is ignored, with a warning
|
||||||
message, if present.
|
message, if present.
|
||||||
|
|
||||||
Testing callouts
|
Testing callouts
|
||||||
|
|
||||||
A callout function is supplied when pcre2test calls the library match-
|
A callout function is supplied when pcre2test calls the library match-
|
||||||
ing functions, unless callout_none is specified. If callout_capture is
|
ing functions, unless callout_none is specified. If callout_capture is
|
||||||
set, the current captured groups are output when a callout occurs. The
|
set, the current captured groups are output when a callout occurs. The
|
||||||
default return from the callout function is zero, which allows matching
|
default return from the callout function is zero, which allows matching
|
||||||
to continue.
|
to continue.
|
||||||
|
|
||||||
The callout_fail modifier can be given one or two numbers. If there is
|
The callout_fail modifier can be given one or two numbers. If there is
|
||||||
only one number, 1 is returned instead of 0 (causing matching to back-
|
only one number, 1 is returned instead of 0 (causing matching to back-
|
||||||
track) when a callout of that number is reached. If two numbers
|
track) when a callout of that number is reached. If two numbers
|
||||||
(<n>:<m>) are given, 1 is returned when callout <n> is reached and
|
(<n>:<m>) are given, 1 is returned when callout <n> is reached and
|
||||||
there have been at least <m> callouts. The callout_error modifier is
|
there have been at least <m> callouts. The callout_error modifier is
|
||||||
similar, except that PCRE2_ERROR_CALLOUT is returned, causing the
|
similar, except that PCRE2_ERROR_CALLOUT is returned, causing the
|
||||||
entire matching process to be aborted. If both these modifiers are set
|
entire matching process to be aborted. If both these modifiers are set
|
||||||
for the same callout number, callout_error takes precedence.
|
for the same callout number, callout_error takes precedence.
|
||||||
|
|
||||||
Note that callouts with string arguments are always given the number
|
Note that callouts with string arguments are always given the number
|
||||||
zero. See "Callouts" below for a description of the output when a call-
|
zero. See "Callouts" below for a description of the output when a call-
|
||||||
out it taken.
|
out it taken.
|
||||||
|
|
||||||
The callout_data modifier can be given an unsigned or a negative num-
|
The callout_data modifier can be given an unsigned or a negative num-
|
||||||
ber. This is set as the "user data" that is passed to the matching
|
ber. This is set as the "user data" that is passed to the matching
|
||||||
function, and passed back when the callout function is invoked. Any
|
function, and passed back when the callout function is invoked. Any
|
||||||
value other than zero is used as a return from pcre2test's callout
|
value other than zero is used as a return from pcre2test's callout
|
||||||
function.
|
function.
|
||||||
|
|
||||||
Finding all matches in a string
|
Finding all matches in a string
|
||||||
|
|
||||||
Searching for all possible matches within a subject can be requested by
|
Searching for all possible matches within a subject can be requested by
|
||||||
the global or altglobal modifier. After finding a match, the matching
|
the global or altglobal modifier. After finding a match, the matching
|
||||||
function is called again to search the remainder of the subject. The
|
function is called again to search the remainder of the subject. The
|
||||||
difference between global and altglobal is that the former uses the
|
difference between global and altglobal is that the former uses the
|
||||||
start_offset argument to pcre2_match() or pcre2_dfa_match() to start
|
start_offset argument to pcre2_match() or pcre2_dfa_match() to start
|
||||||
searching at a new point within the entire string (which is what Perl
|
searching at a new point within the entire string (which is what Perl
|
||||||
does), whereas the latter passes over a shortened subject. This makes a
|
does), whereas the latter passes over a shortened subject. This makes a
|
||||||
difference to the matching process if the pattern begins with a lookbe-
|
difference to the matching process if the pattern begins with a lookbe-
|
||||||
hind assertion (including \b or \B).
|
hind assertion (including \b or \B).
|
||||||
|
|
||||||
If an empty string is matched, the next match is done with the
|
If an empty string is matched, the next match is done with the
|
||||||
PCRE2_NOTEMPTY_ATSTART and PCRE2_ANCHORED flags set, in order to search
|
PCRE2_NOTEMPTY_ATSTART and PCRE2_ANCHORED flags set, in order to search
|
||||||
for another, non-empty, match at the same point in the subject. If this
|
for another, non-empty, match at the same point in the subject. If this
|
||||||
match fails, the start offset is advanced, and the normal match is
|
match fails, the start offset is advanced, and the normal match is
|
||||||
retried. This imitates the way Perl handles such cases when using the
|
retried. This imitates the way Perl handles such cases when using the
|
||||||
/g modifier or the split() function. Normally, the start offset is
|
/g modifier or the split() function. Normally, the start offset is
|
||||||
advanced by one character, but if the newline convention recognizes
|
advanced by one character, but if the newline convention recognizes
|
||||||
CRLF as a newline, and the current character is CR followed by LF, an
|
CRLF as a newline, and the current character is CR followed by LF, an
|
||||||
advance of two characters occurs.
|
advance of two characters occurs.
|
||||||
|
|
||||||
Testing substring extraction functions
|
Testing substring extraction functions
|
||||||
|
|
||||||
The copy and get modifiers can be used to test the pcre2_sub-
|
The copy and get modifiers can be used to test the pcre2_sub-
|
||||||
string_copy_xxx() and pcre2_substring_get_xxx() functions. They can be
|
string_copy_xxx() and pcre2_substring_get_xxx() functions. They can be
|
||||||
given more than once, and each can specify a group name or number, for
|
given more than once, and each can specify a group name or number, for
|
||||||
example:
|
example:
|
||||||
|
|
||||||
abcd\=copy=1,copy=3,get=G1
|
abcd\=copy=1,copy=3,get=G1
|
||||||
|
|
||||||
If the #subject command is used to set default copy and/or get lists,
|
If the #subject command is used to set default copy and/or get lists,
|
||||||
these can be unset by specifying a negative number to cancel all num-
|
these can be unset by specifying a negative number to cancel all num-
|
||||||
bered groups and an empty name to cancel all named groups.
|
bered groups and an empty name to cancel all named groups.
|
||||||
|
|
||||||
The getall modifier tests pcre2_substring_list_get(), which extracts
|
The getall modifier tests pcre2_substring_list_get(), which extracts
|
||||||
all captured substrings.
|
all captured substrings.
|
||||||
|
|
||||||
If the subject line is successfully matched, the substrings extracted
|
If the subject line is successfully matched, the substrings extracted
|
||||||
by the convenience functions are output with C, G, or L after the
|
by the convenience functions are output with C, G, or L after the
|
||||||
string number instead of a colon. This is in addition to the normal
|
string number instead of a colon. This is in addition to the normal
|
||||||
full list. The string length (that is, the return from the extraction
|
full list. The string length (that is, the return from the extraction
|
||||||
function) is given in parentheses after each substring, followed by the
|
function) is given in parentheses after each substring, followed by the
|
||||||
name when the extraction was by name.
|
name when the extraction was by name.
|
||||||
|
|
||||||
Testing the substitution function
|
Testing the substitution function
|
||||||
|
|
||||||
If the replace modifier is set, the pcre2_substitute() function is
|
If the replace modifier is set, the pcre2_substitute() function is
|
||||||
called instead of one of the matching functions. Note that replacement
|
called instead of one of the matching functions. Note that replacement
|
||||||
strings cannot contain commas, because a comma signifies the end of a
|
strings cannot contain commas, because a comma signifies the end of a
|
||||||
modifier. This is not thought to be an issue in a test program.
|
modifier. This is not thought to be an issue in a test program.
|
||||||
|
|
||||||
Unlike subject strings, pcre2test does not process replacement strings
|
Unlike subject strings, pcre2test does not process replacement strings
|
||||||
for escape sequences. In UTF mode, a replacement string is checked to
|
for escape sequences. In UTF mode, a replacement string is checked to
|
||||||
see if it is a valid UTF-8 string. If so, it is correctly converted to
|
see if it is a valid UTF-8 string. If so, it is correctly converted to
|
||||||
a UTF string of the appropriate code unit width. If it is not a valid
|
a UTF string of the appropriate code unit width. If it is not a valid
|
||||||
UTF-8 string, the individual code units are copied directly. This pro-
|
UTF-8 string, the individual code units are copied directly. This pro-
|
||||||
vides a means of passing an invalid UTF-8 string for testing purposes.
|
vides a means of passing an invalid UTF-8 string for testing purposes.
|
||||||
|
|
||||||
The following modifiers set options (in additional to the normal match
|
The following modifiers set options (in additional to the normal match
|
||||||
options) for pcre2_substitute():
|
options) for pcre2_substitute():
|
||||||
|
|
||||||
global PCRE2_SUBSTITUTE_GLOBAL
|
global PCRE2_SUBSTITUTE_GLOBAL
|
||||||
|
@ -1176,8 +1200,8 @@ SUBJECT MODIFIERS
|
||||||
substitute_unset_empty PCRE2_SUBSTITUTE_UNSET_EMPTY
|
substitute_unset_empty PCRE2_SUBSTITUTE_UNSET_EMPTY
|
||||||
|
|
||||||
|
|
||||||
After a successful substitution, the modified string is output, pre-
|
After a successful substitution, the modified string is output, pre-
|
||||||
ceded by the number of replacements. This may be zero if there were no
|
ceded by the number of replacements. This may be zero if there were no
|
||||||
matches. Here is a simple example of a substitution test:
|
matches. Here is a simple example of a substitution test:
|
||||||
|
|
||||||
/abc/replace=xxx
|
/abc/replace=xxx
|
||||||
|
@ -1186,12 +1210,12 @@ SUBJECT MODIFIERS
|
||||||
=abc=abc=\=global
|
=abc=abc=\=global
|
||||||
2: =xxx=xxx=
|
2: =xxx=xxx=
|
||||||
|
|
||||||
Subject and replacement strings should be kept relatively short (fewer
|
Subject and replacement strings should be kept relatively short (fewer
|
||||||
than 256 characters) for substitution tests, as fixed-size buffers are
|
than 256 characters) for substitution tests, as fixed-size buffers are
|
||||||
used. To make it easy to test for buffer overflow, if the replacement
|
used. To make it easy to test for buffer overflow, if the replacement
|
||||||
string starts with a number in square brackets, that number is passed
|
string starts with a number in square brackets, that number is passed
|
||||||
to pcre2_substitute() as the size of the output buffer, with the
|
to pcre2_substitute() as the size of the output buffer, with the
|
||||||
replacement string starting at the next character. Here is an example
|
replacement string starting at the next character. Here is an example
|
||||||
that tests the edge case:
|
that tests the edge case:
|
||||||
|
|
||||||
/abc/
|
/abc/
|
||||||
|
@ -1200,11 +1224,11 @@ SUBJECT MODIFIERS
|
||||||
123abc123\=replace=[9]XYZ
|
123abc123\=replace=[9]XYZ
|
||||||
Failed: error -47: no more memory
|
Failed: error -47: no more memory
|
||||||
|
|
||||||
The default action of pcre2_substitute() is to return
|
The default action of pcre2_substitute() is to return
|
||||||
PCRE2_ERROR_NOMEMORY when the output buffer is too small. However, if
|
PCRE2_ERROR_NOMEMORY when the output buffer is too small. However, if
|
||||||
the PCRE2_SUBSTITUTE_OVERFLOW_LENGTH option is set (by using the sub-
|
the PCRE2_SUBSTITUTE_OVERFLOW_LENGTH option is set (by using the sub-
|
||||||
stitute_overflow_length modifier), pcre2_substitute() continues to go
|
stitute_overflow_length modifier), pcre2_substitute() continues to go
|
||||||
through the motions of matching and substituting, in order to compute
|
through the motions of matching and substituting, in order to compute
|
||||||
the size of buffer that is required. When this happens, pcre2test shows
|
the size of buffer that is required. When this happens, pcre2test shows
|
||||||
the required buffer length (which includes space for the trailing zero)
|
the required buffer length (which includes space for the trailing zero)
|
||||||
as part of the error message. For example:
|
as part of the error message. For example:
|
||||||
|
@ -1214,105 +1238,106 @@ SUBJECT MODIFIERS
|
||||||
Failed: error -47: no more memory: 10 code units are needed
|
Failed: error -47: no more memory: 10 code units are needed
|
||||||
|
|
||||||
A replacement string is ignored with POSIX and DFA matching. Specifying
|
A replacement string is ignored with POSIX and DFA matching. Specifying
|
||||||
partial matching provokes an error return ("bad option value") from
|
partial matching provokes an error return ("bad option value") from
|
||||||
pcre2_substitute().
|
pcre2_substitute().
|
||||||
|
|
||||||
Setting the JIT stack size
|
Setting the JIT stack size
|
||||||
|
|
||||||
The jitstack modifier provides a way of setting the maximum stack size
|
The jitstack modifier provides a way of setting the maximum stack size
|
||||||
that is used by the just-in-time optimization code. It is ignored if
|
that is used by the just-in-time optimization code. It is ignored if
|
||||||
JIT optimization is not being used. The value is a number of kilobytes.
|
JIT optimization is not being used. The value is a number of kilobytes.
|
||||||
Providing a stack that is larger than the default 32K is necessary only
|
Setting zero reverts to the default of 32K. Providing a stack that is
|
||||||
for very complicated patterns.
|
larger than the default is necessary only for very complicated pat-
|
||||||
|
terns. If jitstack is set non-zero on a subject line it overrides any
|
||||||
|
value that was set on the pattern.
|
||||||
|
|
||||||
Setting heap, match, and depth limits
|
Setting heap, match, and depth limits
|
||||||
|
|
||||||
The heap_limit, match_limit, and depth_limit modifiers set the appro-
|
The heap_limit, match_limit, and depth_limit modifiers set the appro-
|
||||||
priate limits in the match context. These values are ignored when the
|
priate limits in the match context. These values are ignored when the
|
||||||
find_limits modifier is specified.
|
find_limits modifier is specified.
|
||||||
|
|
||||||
Finding minimum limits
|
Finding minimum limits
|
||||||
|
|
||||||
If the find_limits modifier is present on a subject line, pcre2test
|
If the find_limits modifier is present on a subject line, pcre2test
|
||||||
calls the relevant matching function several times, setting different
|
calls the relevant matching function several times, setting different
|
||||||
values in the match context via pcre2_set_heap_limit(),
|
values in the match context via pcre2_set_heap_limit(),
|
||||||
pcre2_set_match_limit(), or pcre2_set_depth_limit() until it finds the
|
pcre2_set_match_limit(), or pcre2_set_depth_limit() until it finds the
|
||||||
minimum values for each parameter that allows the match to complete
|
minimum values for each parameter that allows the match to complete
|
||||||
without error.
|
without error.
|
||||||
|
|
||||||
If JIT is being used, only the match limit is relevant. If DFA matching
|
If JIT is being used, only the match limit is relevant. If DFA matching
|
||||||
is being used, only the depth limit is relevant.
|
is being used, only the depth limit is relevant.
|
||||||
|
|
||||||
The match_limit number is a measure of the amount of backtracking that
|
The match_limit number is a measure of the amount of backtracking that
|
||||||
takes place, and learning the minimum value can be instructive. For
|
takes place, and learning the minimum value can be instructive. For
|
||||||
most simple matches, the number is quite small, but for patterns with
|
most simple matches, the number is quite small, but for patterns with
|
||||||
very large numbers of matching possibilities, it can become large very
|
very large numbers of matching possibilities, it can become large very
|
||||||
quickly with increasing length of subject string.
|
quickly with increasing length of subject string.
|
||||||
|
|
||||||
For non-DFA matching, the minimum depth_limit number is a measure of
|
For non-DFA matching, the minimum depth_limit number is a measure of
|
||||||
how much nested backtracking happens (that is, how deeply the pattern's
|
how much nested backtracking happens (that is, how deeply the pattern's
|
||||||
tree is searched). In the case of DFA matching, depth_limit controls
|
tree is searched). In the case of DFA matching, depth_limit controls
|
||||||
the depth of recursive calls of the internal function that is used for
|
the depth of recursive calls of the internal function that is used for
|
||||||
handling pattern recursion, lookaround assertions, and atomic groups.
|
handling pattern recursion, lookaround assertions, and atomic groups.
|
||||||
|
|
||||||
Showing MARK names
|
Showing MARK names
|
||||||
|
|
||||||
|
|
||||||
The mark modifier causes the names from backtracking control verbs that
|
The mark modifier causes the names from backtracking control verbs that
|
||||||
are returned from calls to pcre2_match() to be displayed. If a mark is
|
are returned from calls to pcre2_match() to be displayed. If a mark is
|
||||||
returned for a match, non-match, or partial match, pcre2test shows it.
|
returned for a match, non-match, or partial match, pcre2test shows it.
|
||||||
For a match, it is on a line by itself, tagged with "MK:". Otherwise,
|
For a match, it is on a line by itself, tagged with "MK:". Otherwise,
|
||||||
it is added to the non-match message.
|
it is added to the non-match message.
|
||||||
|
|
||||||
Showing memory usage
|
Showing memory usage
|
||||||
|
|
||||||
The memory modifier causes pcre2test to log the sizes of all heap mem-
|
The memory modifier causes pcre2test to log the sizes of all heap mem-
|
||||||
ory allocation and freeing calls that occur during a call to
|
ory allocation and freeing calls that occur during a call to
|
||||||
pcre2_match(). These occur only when a match requires a bigger vector
|
pcre2_match(). These occur only when a match requires a bigger vector
|
||||||
than the default for remembering backtracking points. In many cases
|
than the default for remembering backtracking points. In many cases
|
||||||
there will be no heap memory used and therefore no additional output.
|
there will be no heap memory used and therefore no additional output.
|
||||||
No heap memory is allocated during matching with pcre2_dfa_match or
|
No heap memory is allocated during matching with pcre2_dfa_match or
|
||||||
with JIT, so in those cases the memory modifier never has any effect.
|
with JIT, so in those cases the memory modifier never has any effect.
|
||||||
For this modifier to work, the null_context modifier must not be set on
|
For this modifier to work, the null_context modifier must not be set on
|
||||||
both the pattern and the subject, though it can be set on one or the
|
both the pattern and the subject, though it can be set on one or the
|
||||||
other.
|
other.
|
||||||
|
|
||||||
Setting a starting offset
|
Setting a starting offset
|
||||||
|
|
||||||
The offset modifier sets an offset in the subject string at which
|
The offset modifier sets an offset in the subject string at which
|
||||||
matching starts. Its value is a number of code units, not characters.
|
matching starts. Its value is a number of code units, not characters.
|
||||||
|
|
||||||
Setting an offset limit
|
Setting an offset limit
|
||||||
|
|
||||||
The offset_limit modifier sets a limit for unanchored matches. If a
|
The offset_limit modifier sets a limit for unanchored matches. If a
|
||||||
match cannot be found starting at or before this offset in the subject,
|
match cannot be found starting at or before this offset in the subject,
|
||||||
a "no match" return is given. The data value is a number of code units,
|
a "no match" return is given. The data value is a number of code units,
|
||||||
not characters. When this modifier is used, the use_offset_limit modi-
|
not characters. When this modifier is used, the use_offset_limit modi-
|
||||||
fier must have been set for the pattern; if not, an error is generated.
|
fier must have been set for the pattern; if not, an error is generated.
|
||||||
|
|
||||||
Setting the size of the output vector
|
Setting the size of the output vector
|
||||||
|
|
||||||
The ovector modifier applies only to the subject line in which it
|
The ovector modifier applies only to the subject line in which it
|
||||||
appears, though of course it can also be used to set a default in a
|
appears, though of course it can also be used to set a default in a
|
||||||
#subject command. It specifies the number of pairs of offsets that are
|
#subject command. It specifies the number of pairs of offsets that are
|
||||||
available for storing matching information. The default is 15.
|
available for storing matching information. The default is 15.
|
||||||
|
|
||||||
A value of zero is useful when testing the POSIX API because it causes
|
A value of zero is useful when testing the POSIX API because it causes
|
||||||
regexec() to be called with a NULL capture vector. When not testing the
|
regexec() to be called with a NULL capture vector. When not testing the
|
||||||
POSIX API, a value of zero is used to cause pcre2_match_data_cre-
|
POSIX API, a value of zero is used to cause pcre2_match_data_cre-
|
||||||
ate_from_pattern() to be called, in order to create a match block of
|
ate_from_pattern() to be called, in order to create a match block of
|
||||||
exactly the right size for the pattern. (It is not possible to create a
|
exactly the right size for the pattern. (It is not possible to create a
|
||||||
match block with a zero-length ovector; there is always at least one
|
match block with a zero-length ovector; there is always at least one
|
||||||
pair of offsets.)
|
pair of offsets.)
|
||||||
|
|
||||||
Passing the subject as zero-terminated
|
Passing the subject as zero-terminated
|
||||||
|
|
||||||
By default, the subject string is passed to a native API matching func-
|
By default, the subject string is passed to a native API matching func-
|
||||||
tion with its correct length. In order to test the facility for passing
|
tion with its correct length. In order to test the facility for passing
|
||||||
a zero-terminated string, the zero_terminate modifier is provided. It
|
a zero-terminated string, the zero_terminate modifier is provided. It
|
||||||
causes the length to be passed as PCRE2_ZERO_TERMINATED. (When matching
|
causes the length to be passed as PCRE2_ZERO_TERMINATED. When matching
|
||||||
via the POSIX interface, this modifier has no effect, as there is no
|
via the POSIX interface, this modifier is ignored, with a warning.
|
||||||
facility for passing a length.)
|
|
||||||
|
|
||||||
When testing pcre2_substitute(), this modifier also has the effect of
|
When testing pcre2_substitute(), this modifier also has the effect of
|
||||||
passing the replacement string as zero-terminated.
|
passing the replacement string as zero-terminated.
|
||||||
|
@ -1513,8 +1538,8 @@ CALLOUTS
|
||||||
position, which can happen if the callout is in a lookbehind assertion.
|
position, which can happen if the callout is in a lookbehind assertion.
|
||||||
|
|
||||||
Callouts numbered 255 are assumed to be automatic callouts, inserted as
|
Callouts numbered 255 are assumed to be automatic callouts, inserted as
|
||||||
a result of the /auto_callout pattern modifier. In this case, instead
|
a result of the auto_callout pattern modifier. In this case, instead of
|
||||||
of showing the callout number, the offset in the pattern, preceded by a
|
showing the callout number, the offset in the pattern, preceded by a
|
||||||
plus, is output. For example:
|
plus, is output. For example:
|
||||||
|
|
||||||
re> /\d?[A-E]\*/auto_callout
|
re> /\d?[A-E]\*/auto_callout
|
||||||
|
@ -1662,5 +1687,5 @@ AUTHOR
|
||||||
|
|
||||||
REVISION
|
REVISION
|
||||||
|
|
||||||
Last updated: 03 June 2017
|
Last updated: 16 June 2017
|
||||||
Copyright (c) 1997-2017 University of Cambridge.
|
Copyright (c) 1997-2017 University of Cambridge.
|
||||||
|
|
Loading…
Reference in New Issue