Documentation update.
This commit is contained in:
parent
a083420cac
commit
c92bfc3d21
|
@ -47,7 +47,7 @@ system stack size checking, or to change one or more of these parameters:
|
|||
The newline character sequence;
|
||||
The compile time nested parentheses limit;
|
||||
The maximum pattern length (in code units) that is allowed.
|
||||
The additional options bits
|
||||
The additional options bits (see pcre2_set_compile_extra_options())
|
||||
</pre>
|
||||
The option bits are:
|
||||
<pre>
|
||||
|
@ -64,6 +64,7 @@ The option bits are:
|
|||
PCRE2_ENDANCHORED Pattern can match only at end of subject
|
||||
PCRE2_EXTENDED Ignore white space and # comments
|
||||
PCRE2_FIRSTLINE Force matching to be before newline
|
||||
PCRE2_LITERAL Pattern characters are all literal
|
||||
PCRE2_MATCH_UNSET_BACKREF Match unset back references
|
||||
PCRE2_MULTILINE ^ and $ match newlines within data
|
||||
PCRE2_NEVER_BACKSLASH_C Lock out the use of \C in patterns
|
||||
|
|
|
@ -32,6 +32,8 @@ options are:
|
|||
<pre>
|
||||
PCRE2_EXTRA_ALLOW_SURROGATE_ESCAPES Allow \x{df800} to \x{dfff} in UTF-8 and UTF-32 modes
|
||||
PCRE2_EXTRA_BAD_ESCAPE_IS_LITERAL Treat all invalid escapes as a literal following character
|
||||
PCRE2_EXTRA_MATCH_LINE Pattern matches whole lines
|
||||
PCRE2_EXTRA_MATCH_WORD Pattern matches "words"
|
||||
</pre>
|
||||
There is a complete description of the PCRE2 native API in the
|
||||
<a href="pcre2api.html"><b>pcre2api</b></a>
|
||||
|
|
|
@ -1453,6 +1453,19 @@ continue over the newline. See also PCRE2_USE_OFFSET_LIMIT, which provides a
|
|||
more general limiting facility. If PCRE2_FIRSTLINE is set with an offset limit,
|
||||
a match must occur in the first line and also within the offset limit. In other
|
||||
words, whichever limit comes first is used.
|
||||
<pre>
|
||||
PCRE2_LITERAL
|
||||
</pre>
|
||||
If this option is set, all meta-characters in the pattern are disabled, and it
|
||||
is treated as a literal string. Matching literal strings with a regular
|
||||
expression engine is not the most efficient way of doing it. If you are doing a
|
||||
lot of literal matching and are worried about efficiency, you should consider
|
||||
using other approaches. The only other main options that are allowed with
|
||||
PCRE2_LITERAL are: PCRE2_ANCHORED, PCRE2_ENDANCHORED, PCRE2_AUTO_CALLOUT,
|
||||
PCRE2_CASELESS, PCRE2_FIRSTLINE, PCRE2_NO_START_OPTIMIZE, PCRE2_NO_UTF_CHECK,
|
||||
PCRE2_UTF, and PCRE2_USE_OFFSET_LIMIT. The extra options PCRE2_EXTRA_MATCH_LINE
|
||||
and PCRE2_EXTRA_MATCH_WORD are also supported. Any other options cause an
|
||||
error.
|
||||
<pre>
|
||||
PCRE2_MATCH_UNSET_BACKREF
|
||||
</pre>
|
||||
|
@ -1724,6 +1737,24 @@ treated as single-character escapes. For example, \j is a literal "j" and
|
|||
\x{2z} is treated as the literal string "x{2z}". Setting this option means
|
||||
that typos in patterns may go undetected and have unexpected results. This is a
|
||||
dangerous option. Use with care.
|
||||
<pre>
|
||||
PCRE2_EXTRA_MATCH_LINE
|
||||
</pre>
|
||||
This option is provided for use by the <b>-x</b> option of <b>pcre2grep</b>. It
|
||||
causes the pattern only to match complete lines. This is achieved by
|
||||
automatically inserting the code for "^(?:" at the start of the compiled
|
||||
pattern and ")$" at the end. Thus, when PCRE2_MULTILINE is set, the matched
|
||||
line may be in the middle of the subject string. This option can be used with
|
||||
PCRE2_LITERAL.
|
||||
<pre>
|
||||
PCRE2_EXTRA_MATCH_WORD
|
||||
</pre>
|
||||
This option is provided for use by the <b>-w</b> option of <b>pcre2grep</b>. It
|
||||
causes the pattern only to match strings that have a word boundary at the start
|
||||
and the end. This is achieved by automatically inserting the code for "\b(?:"
|
||||
at the start of the compiled pattern and ")\b" at the end. The option may be
|
||||
used with PCRE2_LITERAL. However, it is ignored if PCRE2_EXTRA_MATCH_LINE is
|
||||
also set.
|
||||
</P>
|
||||
<br><a name="SEC20" href="#TOC1">COMPILATION ERROR CODES</a><br>
|
||||
<P>
|
||||
|
@ -3489,7 +3520,7 @@ Cambridge, England.
|
|||
</P>
|
||||
<br><a name="SEC42" href="#TOC1">REVISION</a><br>
|
||||
<P>
|
||||
Last updated: 01 June 2017
|
||||
Last updated: 16 June 2017
|
||||
<br>
|
||||
Copyright © 1997-2017 University of Cambridge.
|
||||
<br>
|
||||
|
|
|
@ -94,7 +94,7 @@ The function <b>regcomp()</b> is called to compile a pattern into an
|
|||
internal form. By default, the pattern is a C string terminated by a binary
|
||||
zero (but see REG_PEND below). The <i>preg</i> argument is a pointer to a
|
||||
<b>regex_t</b> structure that is used as a base for storing information about
|
||||
the compiled regular expression. (It is also used for input when REG_PEND is
|
||||
the compiled regular expression. (It is also used for input when REG_PEND is
|
||||
set.)
|
||||
</P>
|
||||
<P>
|
||||
|
@ -117,6 +117,14 @@ compilation to the native function.
|
|||
The PCRE2_MULTILINE option is set when the regular expression is passed for
|
||||
compilation to the native function. Note that this does <i>not</i> mimic the
|
||||
defined POSIX behaviour for REG_NEWLINE (see the following section).
|
||||
<pre>
|
||||
REG_NOSPEC
|
||||
</pre>
|
||||
The PCRE2_LITERAL option is set when the regular expression is passed for
|
||||
compilation to the native function. This disables all meta characters in the
|
||||
pattern, causing it to be treated as a literal string. The only other options
|
||||
that are allowed with REG_NOSPEC are REG_ICASE, REG_NOSUB, REG_PEND, and
|
||||
REG_UTF. Note that REG_NOSPEC is not part of the POSIX standard.
|
||||
<pre>
|
||||
REG_NOSUB
|
||||
</pre>
|
||||
|
@ -128,8 +136,8 @@ because it disables the use of back references.
|
|||
<pre>
|
||||
REG_PEND
|
||||
</pre>
|
||||
If this option is set, the <b>reg_endp</b> field in the <i>preg</i> structure
|
||||
(which has the type const char *) must be set to point to the character beyond
|
||||
If this option is set, the <b>reg_endp</b> field in the <i>preg</i> structure
|
||||
(which has the type const char *) must be set to point to the character beyond
|
||||
the end of the pattern before calling <b>regcomp()</b>. The pattern itself may
|
||||
now contain binary zeroes, which are treated as data characters. Without
|
||||
REG_PEND, a binary zero terminates the pattern and the <b>re_endp</b> field is
|
||||
|
@ -242,8 +250,8 @@ function.
|
|||
</pre>
|
||||
When this option is set, the subject string is starts at <i>string</i> +
|
||||
<i>pmatch[0].rm_so</i> and ends at <i>string</i> + <i>pmatch[0].rm_eo</i>, which
|
||||
should point to the first character beyond the string. There may be binary
|
||||
zeroes within the subject string, and indeed, using REG_STARTEND is the only
|
||||
should point to the first character beyond the string. There may be binary
|
||||
zeroes within the subject string, and indeed, using REG_STARTEND is the only
|
||||
way to pass a subject string that contains a binary zero.
|
||||
</P>
|
||||
<P>
|
||||
|
@ -314,7 +322,7 @@ Cambridge, England.
|
|||
</P>
|
||||
<br><a name="SEC9" href="#TOC1">REVISION</a><br>
|
||||
<P>
|
||||
Last updated: 05 June 2017
|
||||
Last updated: 15 June 2017
|
||||
<br>
|
||||
Copyright © 1997-2017 University of Cambridge.
|
||||
<br>
|
||||
|
|
|
@ -96,12 +96,12 @@ want that action.
|
|||
</P>
|
||||
<P>
|
||||
The input is processed using using C's string functions, so must not
|
||||
contain binary zeroes, even though in Unix-like environments, <b>fgets()</b>
|
||||
contain binary zeros, even though in Unix-like environments, <b>fgets()</b>
|
||||
treats any bytes other than newline as data characters. An error is generated
|
||||
if a binary zero is encountered. Subject lines are processed for backslash
|
||||
escapes, which makes it possible to include any data value in strings that are
|
||||
passed to the library for matching. For patterns, there is a facility for
|
||||
specifying some or all of the 8-bit input characters as hexadecimal pairs,
|
||||
if a binary zero is encountered. By default subject lines are processed for
|
||||
backslash escapes, which makes it possible to include any data value in strings
|
||||
that are passed to the library for matching. For patterns, there is a facility
|
||||
for specifying some or all of the 8-bit input characters as hexadecimal pairs,
|
||||
which makes it possible to include binary zeros.
|
||||
</P>
|
||||
<br><b>
|
||||
|
@ -382,8 +382,9 @@ of the standard test input files.
|
|||
<P>
|
||||
When the POSIX API is being tested there is no way to override the default
|
||||
newline convention, though it is possible to set the newline convention from
|
||||
within the pattern. A warning is given if the <b>posix</b> modifier is used when
|
||||
<b>#newline_default</b> would set a default for the non-POSIX API.
|
||||
within the pattern. A warning is given if the <b>posix</b> or <b>posix_nosub</b>
|
||||
modifier is used when <b>#newline_default</b> would set a default for the
|
||||
non-POSIX API.
|
||||
<pre>
|
||||
#pattern <modifier-list>
|
||||
</pre>
|
||||
|
@ -479,8 +480,9 @@ A pattern can be followed by a modifier list (details below).
|
|||
<P>
|
||||
Before each subject line is passed to <b>pcre2_match()</b> or
|
||||
<b>pcre2_dfa_match()</b>, leading and trailing white space is removed, and the
|
||||
line is scanned for backslash escapes. The following provide a means of
|
||||
encoding non-printing characters in a visible way:
|
||||
line is scanned for backslash escapes, unless the <b>subject_literal</b>
|
||||
modifier was set for the pattern. The following provide a means of encoding
|
||||
non-printing characters in a visible way:
|
||||
<pre>
|
||||
\a alarm (BEL, \x07)
|
||||
\b backspace (\x08)
|
||||
|
@ -548,6 +550,12 @@ the very last character in the line is a backslash (and there is no modifier
|
|||
list), it is ignored. This gives a way of passing an empty line as data, since
|
||||
a real empty line terminates the data input.
|
||||
</P>
|
||||
<P>
|
||||
If the <b>subject_literal</b> modifier is set for a pattern, all subject lines
|
||||
that follow are treated as literals, with no special treatment of backslashes.
|
||||
No replication is possible, and any subject modifiers must be set as defaults
|
||||
by a <b>#subject</b> command.
|
||||
</P>
|
||||
<br><a name="SEC10" href="#TOC1">PATTERN MODIFIERS</a><br>
|
||||
<P>
|
||||
There are several types of modifier that can appear in pattern lines. Except
|
||||
|
@ -586,7 +594,10 @@ for a description of the effects of these options.
|
|||
/x extended set PCRE2_EXTENDED
|
||||
/xx extended_more set PCRE2_EXTENDED_MORE
|
||||
firstline set PCRE2_FIRSTLINE
|
||||
literal set PCRE2_LITERAL
|
||||
match_line set PCRE2_EXTRA_MATCH_LINE
|
||||
match_unset_backref set PCRE2_MATCH_UNSET_BACKREF
|
||||
match_word set PCRE2_EXTRA_MATCH_WORD
|
||||
/m multiline set PCRE2_MULTILINE
|
||||
never_backslash_c set PCRE2_NEVER_BACKSLASH_C
|
||||
never_ucp set PCRE2_NEVER_UCP
|
||||
|
@ -638,6 +649,7 @@ heavily used in the test files.
|
|||
push push compiled pattern onto the stack
|
||||
pushcopy push a copy onto the stack
|
||||
stackguard=<number> test the stackguard feature
|
||||
subject_literal treat all subject lines as literal
|
||||
tables=[0|1|2] select internal tables
|
||||
use_length do not zero-terminate the pattern
|
||||
utf8_input treat input as UTF-8
|
||||
|
@ -728,18 +740,6 @@ testing that <b>pcre2_compile()</b> behaves correctly in this case (it uses
|
|||
default values).
|
||||
</P>
|
||||
<br><b>
|
||||
Specifying the pattern's length
|
||||
</b><br>
|
||||
<P>
|
||||
By default, patterns are passed to the compiling functions as zero-terminated
|
||||
strings. When using the POSIX wrapper API, there is no other option. However,
|
||||
when using PCRE2's native API, patterns can be passed by length instead of
|
||||
being zero-terminated. The <b>use_length</b> modifier causes this to happen.
|
||||
Using a length happens automatically (whether or not <b>use_length</b> is set)
|
||||
when <b>hex</b> is set, because patterns specified in hexadecimal may contain
|
||||
binary zeros.
|
||||
</P>
|
||||
<br><b>
|
||||
Specifying pattern characters in hexadecimal
|
||||
</b><br>
|
||||
<P>
|
||||
|
@ -761,11 +761,20 @@ Either single or double quotes may be used. There is no way of including
|
|||
the delimiter within a substring. The <b>hex</b> and <b>expand</b> modifiers are
|
||||
mutually exclusive.
|
||||
</P>
|
||||
<br><b>
|
||||
Specifying the pattern's length
|
||||
</b><br>
|
||||
<P>
|
||||
The POSIX API cannot be used with patterns specified in hexadecimal because
|
||||
they may contain binary zeros, which conflicts with <b>regcomp()</b>'s
|
||||
requirement for a zero-terminated string. Such patterns are always passed to
|
||||
<b>pcre2_compile()</b> as a string with a length, not as zero-terminated.
|
||||
By default, patterns are passed to the compiling functions as zero-terminated
|
||||
strings but can be passed by length instead of being zero-terminated. The
|
||||
<b>use_length</b> modifier causes this to happen. Using a length happens
|
||||
automatically (whether or not <b>use_length</b> is set) when <b>hex</b> is set,
|
||||
because patterns specified in hexadecimal may contain binary zeros.
|
||||
</P>
|
||||
<P>
|
||||
If <b>hex</b> or <b>use_length</b> is used with the POSIX wrapper API (see
|
||||
<a href="#posixwrapper">"Using the POSIX wrapper API"</a>
|
||||
below), the REG_PEND extension is used to pass the pattern's length.
|
||||
</P>
|
||||
<br><b>
|
||||
Specifying wide characters in 16-bit and 32-bit modes
|
||||
|
@ -826,7 +835,7 @@ modifier in "Subject Modifiers"
|
|||
for details of how these options are specified for each match attempt.
|
||||
</P>
|
||||
<P>
|
||||
JIT compilation is requested by the <b>/jit</b> pattern modifier, which may
|
||||
JIT compilation is requested by the <b>jit</b> pattern modifier, which may
|
||||
optionally be followed by an equals sign and a number in the range 0 to 7.
|
||||
The three bits that make up the number specify which of the three JIT operating
|
||||
modes are to be compiled:
|
||||
|
@ -850,7 +859,7 @@ to <b>pcre2_match()</b> with either the PCRE2_PARTIAL_SOFT or the
|
|||
PCRE2_PARTIAL_HARD option set. Note that such a call may return a complete
|
||||
match; the options enable the possibility of a partial match, but do not
|
||||
require it. Note also that if you request JIT compilation only for partial
|
||||
matching (for example, /jit=2) but do not set the <b>partial</b> modifier on a
|
||||
matching (for example, jit=2) but do not set the <b>partial</b> modifier on a
|
||||
subject line, that match will not use JIT code because none was compiled for
|
||||
non-partial matching.
|
||||
</P>
|
||||
|
@ -927,12 +936,12 @@ The <b>max_pattern_length</b> modifier sets a limit, in code units, to the
|
|||
length of pattern that <b>pcre2_compile()</b> will accept. Breaching the limit
|
||||
causes a compilation error. The default is the largest number a PCRE2_SIZE
|
||||
variable can hold (essentially unlimited).
|
||||
</P>
|
||||
<a name="posixwrapper"></a></P>
|
||||
<br><b>
|
||||
Using the POSIX wrapper API
|
||||
</b><br>
|
||||
<P>
|
||||
The <b>/posix</b> and <b>posix_nosub</b> modifiers cause <b>pcre2test</b> to call
|
||||
The <b>posix</b> and <b>posix_nosub</b> modifiers cause <b>pcre2test</b> to call
|
||||
PCRE2 via the POSIX wrapper API rather than its native API. When
|
||||
<b>posix_nosub</b> is used, the POSIX option REG_NOSUB is passed to
|
||||
<b>regcomp()</b>. The POSIX wrapper supports only the 8-bit library. Note that
|
||||
|
@ -962,6 +971,11 @@ The <b>aftertext</b> and <b>allaftertext</b> subject modifiers work as described
|
|||
below. All other modifiers are either ignored, with a warning message, or cause
|
||||
an error.
|
||||
</P>
|
||||
<P>
|
||||
The pattern is passed to <b>regcomp()</b> as a zero-terminated string by
|
||||
default, but if the <b>use_length</b> or <b>hex</b> modifiers are set, the
|
||||
REG_PEND extension is used to pass it by length.
|
||||
</P>
|
||||
<br><b>
|
||||
Testing the stack guard feature
|
||||
</b><br>
|
||||
|
@ -999,17 +1013,18 @@ are mutually exclusive.
|
|||
Setting certain match controls
|
||||
</b><br>
|
||||
<P>
|
||||
The following modifiers are really subject modifiers, and are described below.
|
||||
However, they may be included in a pattern's modifier list, in which case they
|
||||
are applied to every subject line that is processed with that pattern. They may
|
||||
not appear in <b>#pattern</b> commands. These modifiers do not affect the
|
||||
compilation process.
|
||||
The following modifiers are really subject modifiers, and are described under
|
||||
"Subject Modifiers" below. However, they may be included in a pattern's
|
||||
modifier list, in which case they are applied to every subject line that is
|
||||
processed with that pattern. They may not appear in <b>#pattern</b> commands.
|
||||
These modifiers do not affect the compilation process.
|
||||
<pre>
|
||||
aftertext show text after match
|
||||
allaftertext show text after captures
|
||||
allcaptures show all captures
|
||||
allusedtext show all consulted text
|
||||
/g global global matching
|
||||
jitstack=<n> set size of JIT stack
|
||||
mark show mark values
|
||||
replace=<string> specify a replacement string
|
||||
startchar show starting character when relevant
|
||||
|
@ -1022,6 +1037,15 @@ These modifiers may not appear in a <b>#pattern</b> command. If you want them as
|
|||
defaults, set them in a <b>#subject</b> command.
|
||||
</P>
|
||||
<br><b>
|
||||
Specifying literal subject lines
|
||||
</b><br>
|
||||
<P>
|
||||
If the <b>subject_literal</b> modifier is present on a pattern, all the subject
|
||||
lines that it matches are taken as literal strings, with no interpretation of
|
||||
backslashes. It is not possible to set subject modifiers on such lines, but any
|
||||
that are set as defaults by a <b>#subject</b> command are recognized.
|
||||
</P>
|
||||
<br><b>
|
||||
Saving a compiled pattern
|
||||
</b><br>
|
||||
<P>
|
||||
|
@ -1072,11 +1096,11 @@ The partial matching modifiers are provided with abbreviations because they
|
|||
appear frequently in tests.
|
||||
</P>
|
||||
<P>
|
||||
If the <b>posix</b> modifier was present on the pattern, causing the POSIX
|
||||
wrapper API to be used, the only option-setting modifiers that have any effect
|
||||
are <b>notbol</b>, <b>notempty</b>, and <b>noteol</b>, causing REG_NOTBOL,
|
||||
REG_NOTEMPTY, and REG_NOTEOL, respectively, to be passed to <b>regexec()</b>.
|
||||
The other modifiers are ignored, with a warning message.
|
||||
If the <b>posix</b> or <b>posix_nosub</b> modifier was present on the pattern,
|
||||
causing the POSIX wrapper API to be used, the only option-setting modifiers
|
||||
that have any effect are <b>notbol</b>, <b>notempty</b>, and <b>noteol</b>,
|
||||
causing REG_NOTBOL, REG_NOTEMPTY, and REG_NOTEOL, respectively, to be passed to
|
||||
<b>regexec()</b>. The other modifiers are ignored, with a warning message.
|
||||
</P>
|
||||
<P>
|
||||
There is one additional modifier that can be used with the POSIX wrapper. It is
|
||||
|
@ -1085,11 +1109,13 @@ ignored (with a warning) if used for non-POSIX matching.
|
|||
posix_startend=<n>[:<m>]
|
||||
</pre>
|
||||
This causes the subject string to be passed to <b>regexec()</b> using the
|
||||
REG_STARTEND option, which uses offsets to restrict which part of the string is
|
||||
REG_STARTEND option, which uses offsets to specify which part of the string is
|
||||
searched. If only one number is given, the end offset is passed as the end of
|
||||
the subject string. For more detail of REG_STARTEND, see the
|
||||
<a href="pcre2posix.html"><b>pcre2posix</b></a>
|
||||
documentation.
|
||||
documentation. If the subject string contains binary zeros (coded as escapes
|
||||
such as \x{00} because <b>pcre2test</b> does not support actual binary zeros in
|
||||
its input), you must use <b>posix_startend</b> to specify its length.
|
||||
</P>
|
||||
<br><b>
|
||||
Setting match controls
|
||||
|
@ -1355,9 +1381,11 @@ Setting the JIT stack size
|
|||
<P>
|
||||
The <b>jitstack</b> modifier provides a way of setting the maximum stack size
|
||||
that is used by the just-in-time optimization code. It is ignored if JIT
|
||||
optimization is not being used. The value is a number of kilobytes. Providing a
|
||||
stack that is larger than the default 32K is necessary only for very
|
||||
complicated patterns.
|
||||
optimization is not being used. The value is a number of kilobytes. Setting
|
||||
zero reverts to the default of 32K. Providing a stack that is larger than the
|
||||
default is necessary only for very complicated patterns. If <b>jitstack</b> is
|
||||
set non-zero on a subject line it overrides any value that was set on the
|
||||
pattern.
|
||||
</P>
|
||||
<br><b>
|
||||
Setting heap, match, and depth limits
|
||||
|
@ -1461,8 +1489,8 @@ Passing the subject as zero-terminated
|
|||
By default, the subject string is passed to a native API matching function with
|
||||
its correct length. In order to test the facility for passing a zero-terminated
|
||||
string, the <b>zero_terminate</b> modifier is provided. It causes the length to
|
||||
be passed as PCRE2_ZERO_TERMINATED. (When matching via the POSIX interface,
|
||||
this modifier has no effect, as there is no facility for passing a length.)
|
||||
be passed as PCRE2_ZERO_TERMINATED. When matching via the POSIX interface,
|
||||
this modifier is ignored, with a warning.
|
||||
</P>
|
||||
<P>
|
||||
When testing <b>pcre2_substitute()</b>, this modifier also has the effect of
|
||||
|
@ -1675,7 +1703,7 @@ callout is in a lookbehind assertion.
|
|||
</P>
|
||||
<P>
|
||||
Callouts numbered 255 are assumed to be automatic callouts, inserted as a
|
||||
result of the <b>/auto_callout</b> pattern modifier. In this case, instead of
|
||||
result of the <b>auto_callout</b> pattern modifier. In this case, instead of
|
||||
showing the callout number, the offset in the pattern, preceded by a plus, is
|
||||
output. For example:
|
||||
<pre>
|
||||
|
@ -1830,7 +1858,7 @@ Cambridge, England.
|
|||
</P>
|
||||
<br><a name="SEC21" href="#TOC1">REVISION</a><br>
|
||||
<P>
|
||||
Last updated: 03 June 2017
|
||||
Last updated: 16 June 2017
|
||||
<br>
|
||||
Copyright © 1997-2017 University of Cambridge.
|
||||
<br>
|
||||
|
|
189
doc/pcre2.txt
189
doc/pcre2.txt
|
@ -1441,6 +1441,20 @@ COMPILING A PATTERN
|
|||
first line and also within the offset limit. In other words, whichever
|
||||
limit comes first is used.
|
||||
|
||||
PCRE2_LITERAL
|
||||
|
||||
If this option is set, all meta-characters in the pattern are disabled,
|
||||
and it is treated as a literal string. Matching literal strings with a
|
||||
regular expression engine is not the most efficient way of doing it. If
|
||||
you are doing a lot of literal matching and are worried about effi-
|
||||
ciency, you should consider using other approaches. The only other main
|
||||
options that are allowed with PCRE2_LITERAL are: PCRE2_ANCHORED,
|
||||
PCRE2_ENDANCHORED, PCRE2_AUTO_CALLOUT, PCRE2_CASELESS, PCRE2_FIRSTLINE,
|
||||
PCRE2_NO_START_OPTIMIZE, PCRE2_NO_UTF_CHECK, PCRE2_UTF, and
|
||||
PCRE2_USE_OFFSET_LIMIT. The extra options PCRE2_EXTRA_MATCH_LINE and
|
||||
PCRE2_EXTRA_MATCH_WORD are also supported. Any other options cause an
|
||||
error.
|
||||
|
||||
PCRE2_MATCH_UNSET_BACKREF
|
||||
|
||||
If this option is set, a back reference to an unset subpattern group
|
||||
|
@ -1706,6 +1720,24 @@ COMPILING A PATTERN
|
|||
option means that typos in patterns may go undetected and have unex-
|
||||
pected results. This is a dangerous option. Use with care.
|
||||
|
||||
PCRE2_EXTRA_MATCH_LINE
|
||||
|
||||
This option is provided for use by the -x option of pcre2grep. It
|
||||
causes the pattern only to match complete lines. This is achieved by
|
||||
automatically inserting the code for "^(?:" at the start of the com-
|
||||
piled pattern and ")$" at the end. Thus, when PCRE2_MULTILINE is set,
|
||||
the matched line may be in the middle of the subject string. This
|
||||
option can be used with PCRE2_LITERAL.
|
||||
|
||||
PCRE2_EXTRA_MATCH_WORD
|
||||
|
||||
This option is provided for use by the -w option of pcre2grep. It
|
||||
causes the pattern only to match strings that have a word boundary at
|
||||
the start and the end. This is achieved by automatically inserting the
|
||||
code for "\b(?:" at the start of the compiled pattern and ")\b" at the
|
||||
end. The option may be used with PCRE2_LITERAL. However, it is ignored
|
||||
if PCRE2_EXTRA_MATCH_LINE is also set.
|
||||
|
||||
|
||||
COMPILATION ERROR CODES
|
||||
|
||||
|
@ -3368,7 +3400,7 @@ AUTHOR
|
|||
|
||||
REVISION
|
||||
|
||||
Last updated: 01 June 2017
|
||||
Last updated: 16 June 2017
|
||||
Copyright (c) 1997-2017 University of Cambridge.
|
||||
------------------------------------------------------------------------------
|
||||
|
||||
|
@ -9036,60 +9068,69 @@ COMPILING A PATTERN
|
|||
the defined POSIX behaviour for REG_NEWLINE (see the following sec-
|
||||
tion).
|
||||
|
||||
REG_NOSPEC
|
||||
|
||||
The PCRE2_LITERAL option is set when the regular expression is passed
|
||||
for compilation to the native function. This disables all meta charac-
|
||||
ters in the pattern, causing it to be treated as a literal string. The
|
||||
only other options that are allowed with REG_NOSPEC are REG_ICASE,
|
||||
REG_NOSUB, REG_PEND, and REG_UTF. Note that REG_NOSPEC is not part of
|
||||
the POSIX standard.
|
||||
|
||||
REG_NOSUB
|
||||
|
||||
When a pattern that is compiled with this flag is passed to regexec()
|
||||
for matching, the nmatch and pmatch arguments are ignored, and no cap-
|
||||
When a pattern that is compiled with this flag is passed to regexec()
|
||||
for matching, the nmatch and pmatch arguments are ignored, and no cap-
|
||||
tured strings are returned. Versions of the PCRE library prior to 10.22
|
||||
used to set the PCRE2_NO_AUTO_CAPTURE compile option, but this no
|
||||
used to set the PCRE2_NO_AUTO_CAPTURE compile option, but this no
|
||||
longer happens because it disables the use of back references.
|
||||
|
||||
REG_PEND
|
||||
|
||||
If this option is set, the reg_endp field in the preg structure (which
|
||||
If this option is set, the reg_endp field in the preg structure (which
|
||||
has the type const char *) must be set to point to the character beyond
|
||||
the end of the pattern before calling regcomp(). The pattern itself may
|
||||
now contain binary zeroes, which are treated as data characters. With-
|
||||
out REG_PEND, a binary zero terminates the pattern and the re_endp
|
||||
field is ignored. This is a GNU extension to the POSIX standard and
|
||||
should be used with caution in software intended to be portable to
|
||||
now contain binary zeroes, which are treated as data characters. With-
|
||||
out REG_PEND, a binary zero terminates the pattern and the re_endp
|
||||
field is ignored. This is a GNU extension to the POSIX standard and
|
||||
should be used with caution in software intended to be portable to
|
||||
other systems.
|
||||
|
||||
REG_UCP
|
||||
|
||||
The PCRE2_UCP option is set when the regular expression is passed for
|
||||
compilation to the native function. This causes PCRE2 to use Unicode
|
||||
properties when matchine \d, \w, etc., instead of just recognizing
|
||||
The PCRE2_UCP option is set when the regular expression is passed for
|
||||
compilation to the native function. This causes PCRE2 to use Unicode
|
||||
properties when matchine \d, \w, etc., instead of just recognizing
|
||||
ASCII values. Note that REG_UCP is not part of the POSIX standard.
|
||||
|
||||
REG_UNGREEDY
|
||||
|
||||
The PCRE2_UNGREEDY option is set when the regular expression is passed
|
||||
for compilation to the native function. Note that REG_UNGREEDY is not
|
||||
The PCRE2_UNGREEDY option is set when the regular expression is passed
|
||||
for compilation to the native function. Note that REG_UNGREEDY is not
|
||||
part of the POSIX standard.
|
||||
|
||||
REG_UTF
|
||||
|
||||
The PCRE2_UTF option is set when the regular expression is passed for
|
||||
compilation to the native function. This causes the pattern itself and
|
||||
all data strings used for matching it to be treated as UTF-8 strings.
|
||||
The PCRE2_UTF option is set when the regular expression is passed for
|
||||
compilation to the native function. This causes the pattern itself and
|
||||
all data strings used for matching it to be treated as UTF-8 strings.
|
||||
Note that REG_UTF is not part of the POSIX standard.
|
||||
|
||||
In the absence of these flags, no options are passed to the native
|
||||
function. This means the the regex is compiled with PCRE2 default
|
||||
semantics. In particular, the way it handles newline characters in the
|
||||
subject string is the Perl way, not the POSIX way. Note that setting
|
||||
In the absence of these flags, no options are passed to the native
|
||||
function. This means the the regex is compiled with PCRE2 default
|
||||
semantics. In particular, the way it handles newline characters in the
|
||||
subject string is the Perl way, not the POSIX way. Note that setting
|
||||
PCRE2_MULTILINE has only some of the effects specified for REG_NEWLINE.
|
||||
It does not affect the way newlines are matched by the dot metacharac-
|
||||
It does not affect the way newlines are matched by the dot metacharac-
|
||||
ter (they are not) or by a negative class such as [^a] (they are).
|
||||
|
||||
The yield of regcomp() is zero on success, and non-zero otherwise. The
|
||||
preg structure is filled in on success, and one other member of the
|
||||
structure (as well as re_endp) is public: re_nsub contains the number
|
||||
The yield of regcomp() is zero on success, and non-zero otherwise. The
|
||||
preg structure is filled in on success, and one other member of the
|
||||
structure (as well as re_endp) is public: re_nsub contains the number
|
||||
of capturing subpatterns in the regular expression. Various error codes
|
||||
are defined in the header file.
|
||||
|
||||
NOTE: If the yield of regcomp() is non-zero, you must not attempt to
|
||||
NOTE: If the yield of regcomp() is non-zero, you must not attempt to
|
||||
use the contents of the preg structure. If, for example, you pass it to
|
||||
regexec(), the result is undefined and your program is likely to crash.
|
||||
|
||||
|
@ -9097,9 +9138,9 @@ COMPILING A PATTERN
|
|||
MATCHING NEWLINE CHARACTERS
|
||||
|
||||
This area is not simple, because POSIX and Perl take different views of
|
||||
things. It is not possible to get PCRE2 to obey POSIX semantics, but
|
||||
things. It is not possible to get PCRE2 to obey POSIX semantics, but
|
||||
then PCRE2 was never intended to be a POSIX engine. The following table
|
||||
lists the different possibilities for matching newline characters in
|
||||
lists the different possibilities for matching newline characters in
|
||||
Perl and PCRE2:
|
||||
|
||||
Default Change with
|
||||
|
@ -9120,25 +9161,25 @@ MATCHING NEWLINE CHARACTERS
|
|||
$ matches \n in middle no REG_NEWLINE
|
||||
^ matches \n in middle no REG_NEWLINE
|
||||
|
||||
This behaviour is not what happens when PCRE2 is called via its POSIX
|
||||
API. By default, PCRE2's behaviour is the same as Perl's, except that
|
||||
there is no equivalent for PCRE2_DOLLAR_ENDONLY in Perl. In both PCRE2
|
||||
This behaviour is not what happens when PCRE2 is called via its POSIX
|
||||
API. By default, PCRE2's behaviour is the same as Perl's, except that
|
||||
there is no equivalent for PCRE2_DOLLAR_ENDONLY in Perl. In both PCRE2
|
||||
and Perl, there is no way to stop newline from matching [^a].
|
||||
|
||||
Default POSIX newline handling can be obtained by setting PCRE2_DOTALL
|
||||
and PCRE2_DOLLAR_ENDONLY when calling pcre2_compile() directly, but
|
||||
there is no way to make PCRE2 behave exactly as for the REG_NEWLINE
|
||||
action. When using the POSIX API, passing REG_NEWLINE to PCRE2's reg-
|
||||
Default POSIX newline handling can be obtained by setting PCRE2_DOTALL
|
||||
and PCRE2_DOLLAR_ENDONLY when calling pcre2_compile() directly, but
|
||||
there is no way to make PCRE2 behave exactly as for the REG_NEWLINE
|
||||
action. When using the POSIX API, passing REG_NEWLINE to PCRE2's reg-
|
||||
comp() function causes PCRE2_MULTILINE to be passed to pcre2_compile(),
|
||||
and REG_DOTALL passes PCRE2_DOTALL. There is no way to pass PCRE2_DOL-
|
||||
and REG_DOTALL passes PCRE2_DOTALL. There is no way to pass PCRE2_DOL-
|
||||
LAR_ENDONLY.
|
||||
|
||||
|
||||
MATCHING A PATTERN
|
||||
|
||||
The function regexec() is called to match a compiled pattern preg
|
||||
against a given string, which is by default terminated by a zero byte
|
||||
(but see REG_STARTEND below), subject to the options in eflags. These
|
||||
The function regexec() is called to match a compiled pattern preg
|
||||
against a given string, which is by default terminated by a zero byte
|
||||
(but see REG_STARTEND below), subject to the options in eflags. These
|
||||
can be:
|
||||
|
||||
REG_NOTBOL
|
||||
|
@ -9148,9 +9189,9 @@ MATCHING A PATTERN
|
|||
|
||||
REG_NOTEMPTY
|
||||
|
||||
The PCRE2_NOTEMPTY option is set when calling the underlying PCRE2
|
||||
matching function. Note that REG_NOTEMPTY is not part of the POSIX
|
||||
standard. However, setting this option can give more POSIX-like behav-
|
||||
The PCRE2_NOTEMPTY option is set when calling the underlying PCRE2
|
||||
matching function. Note that REG_NOTEMPTY is not part of the POSIX
|
||||
standard. However, setting this option can give more POSIX-like behav-
|
||||
iour in some situations.
|
||||
|
||||
REG_NOTEOL
|
||||
|
@ -9160,66 +9201,66 @@ MATCHING A PATTERN
|
|||
|
||||
REG_STARTEND
|
||||
|
||||
When this option is set, the subject string is starts at string +
|
||||
pmatch[0].rm_so and ends at string + pmatch[0].rm_eo, which should
|
||||
point to the first character beyond the string. There may be binary
|
||||
When this option is set, the subject string is starts at string +
|
||||
pmatch[0].rm_so and ends at string + pmatch[0].rm_eo, which should
|
||||
point to the first character beyond the string. There may be binary
|
||||
zeroes within the subject string, and indeed, using REG_STARTEND is the
|
||||
only way to pass a subject string that contains a binary zero.
|
||||
|
||||
Whatever the value of pmatch[0].rm_so, the offsets of the matched
|
||||
string and any captured substrings are still given relative to the
|
||||
start of string itself. (Before PCRE2 release 10.30 these were given
|
||||
relative to string + pmatch[0].rm_so, but this differs from other
|
||||
Whatever the value of pmatch[0].rm_so, the offsets of the matched
|
||||
string and any captured substrings are still given relative to the
|
||||
start of string itself. (Before PCRE2 release 10.30 these were given
|
||||
relative to string + pmatch[0].rm_so, but this differs from other
|
||||
implementations.)
|
||||
|
||||
This is a BSD extension, compatible with but not specified by IEEE
|
||||
Standard 1003.2 (POSIX.2), and should be used with caution in software
|
||||
intended to be portable to other systems. Note that a non-zero rm_so
|
||||
does not imply REG_NOTBOL; REG_STARTEND affects only the location and
|
||||
length of the string, not how it is matched. Setting REG_STARTEND and
|
||||
passing pmatch as NULL are mutually exclusive; the error REG_INVARG is
|
||||
This is a BSD extension, compatible with but not specified by IEEE
|
||||
Standard 1003.2 (POSIX.2), and should be used with caution in software
|
||||
intended to be portable to other systems. Note that a non-zero rm_so
|
||||
does not imply REG_NOTBOL; REG_STARTEND affects only the location and
|
||||
length of the string, not how it is matched. Setting REG_STARTEND and
|
||||
passing pmatch as NULL are mutually exclusive; the error REG_INVARG is
|
||||
returned.
|
||||
|
||||
If the pattern was compiled with the REG_NOSUB flag, no data about any
|
||||
matched strings is returned. The nmatch and pmatch arguments of
|
||||
If the pattern was compiled with the REG_NOSUB flag, no data about any
|
||||
matched strings is returned. The nmatch and pmatch arguments of
|
||||
regexec() are ignored (except possibly as input for REG_STARTEND).
|
||||
|
||||
The value of nmatch may be zero, and the value pmatch may be NULL
|
||||
(unless REG_STARTEND is set); in both these cases no data about any
|
||||
The value of nmatch may be zero, and the value pmatch may be NULL
|
||||
(unless REG_STARTEND is set); in both these cases no data about any
|
||||
matched strings is returned.
|
||||
|
||||
Otherwise, the portion of the string that was matched, and also any
|
||||
Otherwise, the portion of the string that was matched, and also any
|
||||
captured substrings, are returned via the pmatch argument, which points
|
||||
to an array of nmatch structures of type regmatch_t, containing the
|
||||
members rm_so and rm_eo. These contain the byte offset to the first
|
||||
to an array of nmatch structures of type regmatch_t, containing the
|
||||
members rm_so and rm_eo. These contain the byte offset to the first
|
||||
character of each substring and the offset to the first character after
|
||||
the end of each substring, respectively. The 0th element of the vector
|
||||
relates to the entire portion of string that was matched; subsequent
|
||||
the end of each substring, respectively. The 0th element of the vector
|
||||
relates to the entire portion of string that was matched; subsequent
|
||||
elements relate to the capturing subpatterns of the regular expression.
|
||||
Unused entries in the array have both structure members set to -1.
|
||||
|
||||
A successful match yields a zero return; various error codes are
|
||||
defined in the header file, of which REG_NOMATCH is the "expected"
|
||||
A successful match yields a zero return; various error codes are
|
||||
defined in the header file, of which REG_NOMATCH is the "expected"
|
||||
failure code.
|
||||
|
||||
|
||||
ERROR MESSAGES
|
||||
|
||||
The regerror() function maps a non-zero errorcode from either regcomp()
|
||||
or regexec() to a printable message. If preg is not NULL, the error
|
||||
or regexec() to a printable message. If preg is not NULL, the error
|
||||
should have arisen from the use of that structure. A message terminated
|
||||
by a binary zero is placed in errbuf. If the buffer is too short, only
|
||||
by a binary zero is placed in errbuf. If the buffer is too short, only
|
||||
the first errbuf_size - 1 characters of the error message are used. The
|
||||
yield of the function is the size of buffer needed to hold the whole
|
||||
message, including the terminating zero. This value is greater than
|
||||
yield of the function is the size of buffer needed to hold the whole
|
||||
message, including the terminating zero. This value is greater than
|
||||
errbuf_size if the message was truncated.
|
||||
|
||||
|
||||
MEMORY USAGE
|
||||
|
||||
Compiling a regular expression causes memory to be allocated and asso-
|
||||
ciated with the preg structure. The function regfree() frees all such
|
||||
memory, after which preg may no longer be used as a compiled expres-
|
||||
Compiling a regular expression causes memory to be allocated and asso-
|
||||
ciated with the preg structure. The function regfree() frees all such
|
||||
memory, after which preg may no longer be used as a compiled expres-
|
||||
sion.
|
||||
|
||||
|
||||
|
@ -9232,7 +9273,7 @@ AUTHOR
|
|||
|
||||
REVISION
|
||||
|
||||
Last updated: 05 June 2017
|
||||
Last updated: 15 June 2017
|
||||
Copyright (c) 1997-2017 University of Cambridge.
|
||||
------------------------------------------------------------------------------
|
||||
|
||||
|
|
|
@ -1,4 +1,4 @@
|
|||
.TH PCRE2_COMPILE 3 "17 May 2017" "PCRE2 10.30"
|
||||
.TH PCRE2_COMPILE 3 "16 June 2017" "PCRE2 10.30"
|
||||
.SH NAME
|
||||
PCRE2 - Perl-compatible regular expressions (revised API)
|
||||
.SH SYNOPSIS
|
||||
|
@ -35,7 +35,7 @@ system stack size checking, or to change one or more of these parameters:
|
|||
The newline character sequence;
|
||||
The compile time nested parentheses limit;
|
||||
The maximum pattern length (in code units) that is allowed.
|
||||
The additional options bits
|
||||
The additional options bits (see pcre2_set_compile_extra_options())
|
||||
.sp
|
||||
The option bits are:
|
||||
.sp
|
||||
|
@ -52,6 +52,7 @@ The option bits are:
|
|||
PCRE2_ENDANCHORED Pattern can match only at end of subject
|
||||
PCRE2_EXTENDED Ignore white space and # comments
|
||||
PCRE2_FIRSTLINE Force matching to be before newline
|
||||
PCRE2_LITERAL Pattern characters are all literal
|
||||
PCRE2_MATCH_UNSET_BACKREF Match unset back references
|
||||
PCRE2_MULTILINE ^ and $ match newlines within data
|
||||
PCRE2_NEVER_BACKSLASH_C Lock out the use of \eC in patterns
|
||||
|
|
|
@ -1,4 +1,4 @@
|
|||
.TH PCRE2_SET_MAX_PATTERN_LENGTH 3 "01 June 2017" "PCRE2 10.30"
|
||||
.TH PCRE2_SET_MAX_PATTERN_LENGTH 3 "16 June 2017" "PCRE2 10.30"
|
||||
.SH NAME
|
||||
PCRE2 - Perl-compatible regular expressions (revised API)
|
||||
.SH SYNOPSIS
|
||||
|
@ -24,6 +24,8 @@ options are:
|
|||
.\" JOIN
|
||||
PCRE2_EXTRA_BAD_ESCAPE_IS_LITERAL Treat all invalid escapes as
|
||||
a literal following character
|
||||
PCRE2_EXTRA_MATCH_LINE Pattern matches whole lines
|
||||
PCRE2_EXTRA_MATCH_WORD Pattern matches "words"
|
||||
.sp
|
||||
There is a complete description of the PCRE2 native API in the
|
||||
.\" HREF
|
||||
|
|
|
@ -64,12 +64,12 @@ INPUT ENCODING
|
|||
unless you really want that action.
|
||||
|
||||
The input is processed using using C's string functions, so must not
|
||||
contain binary zeroes, even though in Unix-like environments, fgets()
|
||||
contain binary zeros, even though in Unix-like environments, fgets()
|
||||
treats any bytes other than newline as data characters. An error is
|
||||
generated if a binary zero is encountered. Subject lines are processed
|
||||
for backslash escapes, which makes it possible to include any data
|
||||
value in strings that are passed to the library for matching. For pat-
|
||||
terns, there is a facility for specifying some or all of the 8-bit
|
||||
generated if a binary zero is encountered. By default subject lines are
|
||||
processed for backslash escapes, which makes it possible to include any
|
||||
data value in strings that are passed to the library for matching. For
|
||||
patterns, there is a facility for specifying some or all of the 8-bit
|
||||
input characters as hexadecimal pairs, which makes it possible to
|
||||
include binary zeros.
|
||||
|
||||
|
@ -319,9 +319,9 @@ COMMAND LINES
|
|||
|
||||
When the POSIX API is being tested there is no way to override the
|
||||
default newline convention, though it is possible to set the newline
|
||||
convention from within the pattern. A warning is given if the posix
|
||||
modifier is used when #newline_default would set a default for the non-
|
||||
POSIX API.
|
||||
convention from within the pattern. A warning is given if the posix or
|
||||
posix_nosub modifier is used when #newline_default would set a default
|
||||
for the non-POSIX API.
|
||||
|
||||
#pattern <modifier-list>
|
||||
|
||||
|
@ -424,8 +424,9 @@ SUBJECT LINE SYNTAX
|
|||
|
||||
Before each subject line is passed to pcre2_match() or
|
||||
pcre2_dfa_match(), leading and trailing white space is removed, and the
|
||||
line is scanned for backslash escapes. The following provide a means of
|
||||
encoding non-printing characters in a visible way:
|
||||
line is scanned for backslash escapes, unless the subject_literal modi-
|
||||
fier was set for the pattern. The following provide a means of encoding
|
||||
non-printing characters in a visible way:
|
||||
|
||||
\a alarm (BEL, \x07)
|
||||
\b backspace (\x08)
|
||||
|
@ -442,23 +443,23 @@ SUBJECT LINE SYNTAX
|
|||
\x{hh...} hexadecimal character (any number of hex digits)
|
||||
|
||||
The use of \x{hh...} is not dependent on the use of the utf modifier on
|
||||
the pattern. It is recognized always. There may be any number of hexa-
|
||||
decimal digits inside the braces; invalid values provoke error mes-
|
||||
the pattern. It is recognized always. There may be any number of hexa-
|
||||
decimal digits inside the braces; invalid values provoke error mes-
|
||||
sages.
|
||||
|
||||
Note that \xhh specifies one byte rather than one character in UTF-8
|
||||
mode; this makes it possible to construct invalid UTF-8 sequences for
|
||||
testing purposes. On the other hand, \x{hh} is interpreted as a UTF-8
|
||||
character in UTF-8 mode, generating more than one byte if the value is
|
||||
greater than 127. When testing the 8-bit library not in UTF-8 mode,
|
||||
Note that \xhh specifies one byte rather than one character in UTF-8
|
||||
mode; this makes it possible to construct invalid UTF-8 sequences for
|
||||
testing purposes. On the other hand, \x{hh} is interpreted as a UTF-8
|
||||
character in UTF-8 mode, generating more than one byte if the value is
|
||||
greater than 127. When testing the 8-bit library not in UTF-8 mode,
|
||||
\x{hh} generates one byte for values less than 256, and causes an error
|
||||
for greater values.
|
||||
|
||||
In UTF-16 mode, all 4-digit \x{hhhh} values are accepted. This makes it
|
||||
possible to construct invalid UTF-16 sequences for testing purposes.
|
||||
|
||||
In UTF-32 mode, all 4- to 8-digit \x{...} values are accepted. This
|
||||
makes it possible to construct invalid UTF-32 sequences for testing
|
||||
In UTF-32 mode, all 4- to 8-digit \x{...} values are accepted. This
|
||||
makes it possible to construct invalid UTF-32 sequences for testing
|
||||
purposes.
|
||||
|
||||
There is a special backslash sequence that specifies replication of one
|
||||
|
@ -466,33 +467,38 @@ SUBJECT LINE SYNTAX
|
|||
|
||||
\[<characters>]{<count>}
|
||||
|
||||
This makes it possible to test long strings without having to provide
|
||||
This makes it possible to test long strings without having to provide
|
||||
them as part of the file. For example:
|
||||
|
||||
\[abc]{4}
|
||||
|
||||
is converted to "abcabcabcabc". This feature does not support nesting.
|
||||
is converted to "abcabcabcabc". This feature does not support nesting.
|
||||
To include a closing square bracket in the characters, code it as \x5D.
|
||||
|
||||
A backslash followed by an equals sign marks the end of the subject
|
||||
A backslash followed by an equals sign marks the end of the subject
|
||||
string and the start of a modifier list. For example:
|
||||
|
||||
abc\=notbol,notempty
|
||||
|
||||
If the subject string is empty and \= is followed by whitespace, the
|
||||
line is treated as a comment line, and is not used for matching. For
|
||||
If the subject string is empty and \= is followed by whitespace, the
|
||||
line is treated as a comment line, and is not used for matching. For
|
||||
example:
|
||||
|
||||
\= This is a comment.
|
||||
abc\= This is an invalid modifier list.
|
||||
|
||||
A backslash followed by any other non-alphanumeric character just
|
||||
A backslash followed by any other non-alphanumeric character just
|
||||
escapes that character. A backslash followed by anything else causes an
|
||||
error. However, if the very last character in the line is a backslash
|
||||
(and there is no modifier list), it is ignored. This gives a way of
|
||||
passing an empty line as data, since a real empty line terminates the
|
||||
error. However, if the very last character in the line is a backslash
|
||||
(and there is no modifier list), it is ignored. This gives a way of
|
||||
passing an empty line as data, since a real empty line terminates the
|
||||
data input.
|
||||
|
||||
If the subject_literal modifier is set for a pattern, all subject lines
|
||||
that follow are treated as literals, with no special treatment of back-
|
||||
slashes. No replication is possible, and any subject modifiers must be
|
||||
set as defaults by a #subject command.
|
||||
|
||||
|
||||
PATTERN MODIFIERS
|
||||
|
||||
|
@ -530,7 +536,10 @@ PATTERN MODIFIERS
|
|||
/x extended set PCRE2_EXTENDED
|
||||
/xx extended_more set PCRE2_EXTENDED_MORE
|
||||
firstline set PCRE2_FIRSTLINE
|
||||
literal set PCRE2_LITERAL
|
||||
match_line set PCRE2_EXTRA_MATCH_LINE
|
||||
match_unset_backref set PCRE2_MATCH_UNSET_BACKREF
|
||||
match_word set PCRE2_EXTRA_MATCH_WORD
|
||||
/m multiline set PCRE2_MULTILINE
|
||||
never_backslash_c set PCRE2_NEVER_BACKSLASH_C
|
||||
never_ucp set PCRE2_NEVER_UCP
|
||||
|
@ -580,6 +589,7 @@ PATTERN MODIFIERS
|
|||
push push compiled pattern onto the stack
|
||||
pushcopy push a copy onto the stack
|
||||
stackguard=<number> test the stackguard feature
|
||||
subject_literal treat all subject lines as literal
|
||||
tables=[0|1|2] select internal tables
|
||||
use_length do not zero-terminate the pattern
|
||||
utf8_input treat input as UTF-8
|
||||
|
@ -659,16 +669,6 @@ PATTERN MODIFIERS
|
|||
testing that pcre2_compile() behaves correctly in this case (it uses
|
||||
default values).
|
||||
|
||||
Specifying the pattern's length
|
||||
|
||||
By default, patterns are passed to the compiling functions as zero-ter-
|
||||
minated strings. When using the POSIX wrapper API, there is no other
|
||||
option. However, when using PCRE2's native API, patterns can be passed
|
||||
by length instead of being zero-terminated. The use_length modifier
|
||||
causes this to happen. Using a length happens automatically (whether
|
||||
or not use_length is set) when hex is set, because patterns specified
|
||||
in hexadecimal may contain binary zeros.
|
||||
|
||||
Specifying pattern characters in hexadecimal
|
||||
|
||||
The hex modifier specifies that the characters of the pattern, except
|
||||
|
@ -690,61 +690,68 @@ PATTERN MODIFIERS
|
|||
ing the delimiter within a substring. The hex and expand modifiers are
|
||||
mutually exclusive.
|
||||
|
||||
The POSIX API cannot be used with patterns specified in hexadecimal
|
||||
because they may contain binary zeros, which conflicts with regcomp()'s
|
||||
requirement for a zero-terminated string. Such patterns are always
|
||||
passed to pcre2_compile() as a string with a length, not as zero-termi-
|
||||
nated.
|
||||
Specifying the pattern's length
|
||||
|
||||
By default, patterns are passed to the compiling functions as zero-ter-
|
||||
minated strings but can be passed by length instead of being zero-ter-
|
||||
minated. The use_length modifier causes this to happen. Using a length
|
||||
happens automatically (whether or not use_length is set) when hex is
|
||||
set, because patterns specified in hexadecimal may contain binary
|
||||
zeros.
|
||||
|
||||
If hex or use_length is used with the POSIX wrapper API (see "Using the
|
||||
POSIX wrapper API" below), the REG_PEND extension is used to pass the
|
||||
pattern's length.
|
||||
|
||||
Specifying wide characters in 16-bit and 32-bit modes
|
||||
|
||||
In 16-bit and 32-bit modes, all input is automatically treated as UTF-8
|
||||
and translated to UTF-16 or UTF-32 when the utf modifier is set. For
|
||||
and translated to UTF-16 or UTF-32 when the utf modifier is set. For
|
||||
testing the 16-bit and 32-bit libraries in non-UTF mode, the utf8_input
|
||||
modifier can be used. It is mutually exclusive with utf. Input lines
|
||||
modifier can be used. It is mutually exclusive with utf. Input lines
|
||||
are interpreted as UTF-8 as a means of specifying wide characters. More
|
||||
details are given in "Input encoding" above.
|
||||
|
||||
Generating long repetitive patterns
|
||||
|
||||
Some tests use long patterns that are very repetitive. Instead of cre-
|
||||
ating a very long input line for such a pattern, you can use a special
|
||||
repetition feature, similar to the one described for subject lines
|
||||
above. If the expand modifier is present on a pattern, parts of the
|
||||
Some tests use long patterns that are very repetitive. Instead of cre-
|
||||
ating a very long input line for such a pattern, you can use a special
|
||||
repetition feature, similar to the one described for subject lines
|
||||
above. If the expand modifier is present on a pattern, parts of the
|
||||
pattern that have the form
|
||||
|
||||
\[<characters>]{<count>}
|
||||
|
||||
are expanded before the pattern is passed to pcre2_compile(). For exam-
|
||||
ple, \[AB]{6000} is expanded to "ABAB..." 6000 times. This construction
|
||||
cannot be nested. An initial "\[" sequence is recognized only if "]{"
|
||||
followed by decimal digits and "}" is found later in the pattern. If
|
||||
cannot be nested. An initial "\[" sequence is recognized only if "]{"
|
||||
followed by decimal digits and "}" is found later in the pattern. If
|
||||
not, the characters remain in the pattern unaltered. The expand and hex
|
||||
modifiers are mutually exclusive.
|
||||
|
||||
If part of an expanded pattern looks like an expansion, but is really
|
||||
If part of an expanded pattern looks like an expansion, but is really
|
||||
part of the actual pattern, unwanted expansion can be avoided by giving
|
||||
two values in the quantifier. For example, \[AB]{6000,6000} is not rec-
|
||||
ognized as an expansion item.
|
||||
|
||||
If the info modifier is set on an expanded pattern, the result of the
|
||||
If the info modifier is set on an expanded pattern, the result of the
|
||||
expansion is included in the information that is output.
|
||||
|
||||
JIT compilation
|
||||
|
||||
Just-in-time (JIT) compiling is a heavyweight optimization that can
|
||||
greatly speed up pattern matching. See the pcre2jit documentation for
|
||||
details. JIT compiling happens, optionally, after a pattern has been
|
||||
successfully compiled into an internal form. The JIT compiler converts
|
||||
Just-in-time (JIT) compiling is a heavyweight optimization that can
|
||||
greatly speed up pattern matching. See the pcre2jit documentation for
|
||||
details. JIT compiling happens, optionally, after a pattern has been
|
||||
successfully compiled into an internal form. The JIT compiler converts
|
||||
this to optimized machine code. It needs to know whether the match-time
|
||||
options PCRE2_PARTIAL_HARD and PCRE2_PARTIAL_SOFT are going to be used,
|
||||
because different code is generated for the different cases. See the
|
||||
partial modifier in "Subject Modifiers" below for details of how these
|
||||
because different code is generated for the different cases. See the
|
||||
partial modifier in "Subject Modifiers" below for details of how these
|
||||
options are specified for each match attempt.
|
||||
|
||||
JIT compilation is requested by the /jit pattern modifier, which may
|
||||
JIT compilation is requested by the jit pattern modifier, which may
|
||||
optionally be followed by an equals sign and a number in the range 0 to
|
||||
7. The three bits that make up the number specify which of the three
|
||||
7. The three bits that make up the number specify which of the three
|
||||
JIT operating modes are to be compiled:
|
||||
|
||||
1 compile JIT code for non-partial matching
|
||||
|
@ -761,31 +768,31 @@ PATTERN MODIFIERS
|
|||
6 soft and hard partial matching only
|
||||
7 all three modes
|
||||
|
||||
If no number is given, 7 is assumed. The phrase "partial matching"
|
||||
If no number is given, 7 is assumed. The phrase "partial matching"
|
||||
means a call to pcre2_match() with either the PCRE2_PARTIAL_SOFT or the
|
||||
PCRE2_PARTIAL_HARD option set. Note that such a call may return a com-
|
||||
PCRE2_PARTIAL_HARD option set. Note that such a call may return a com-
|
||||
plete match; the options enable the possibility of a partial match, but
|
||||
do not require it. Note also that if you request JIT compilation only
|
||||
for partial matching (for example, /jit=2) but do not set the partial
|
||||
modifier on a subject line, that match will not use JIT code because
|
||||
do not require it. Note also that if you request JIT compilation only
|
||||
for partial matching (for example, jit=2) but do not set the partial
|
||||
modifier on a subject line, that match will not use JIT code because
|
||||
none was compiled for non-partial matching.
|
||||
|
||||
If JIT compilation is successful, the compiled JIT code will automati-
|
||||
cally be used when an appropriate type of match is run, except when
|
||||
incompatible run-time options are specified. For more details, see the
|
||||
pcre2jit documentation. See also the jitstack modifier below for a way
|
||||
If JIT compilation is successful, the compiled JIT code will automati-
|
||||
cally be used when an appropriate type of match is run, except when
|
||||
incompatible run-time options are specified. For more details, see the
|
||||
pcre2jit documentation. See also the jitstack modifier below for a way
|
||||
of setting the size of the JIT stack.
|
||||
|
||||
If the jitfast modifier is specified, matching is done using the JIT
|
||||
"fast path" interface, pcre2_jit_match(), which skips some of the san-
|
||||
ity checks that are done by pcre2_match(), and of course does not work
|
||||
when JIT is not supported. If jitfast is specified without jit, jit=7
|
||||
If the jitfast modifier is specified, matching is done using the JIT
|
||||
"fast path" interface, pcre2_jit_match(), which skips some of the san-
|
||||
ity checks that are done by pcre2_match(), and of course does not work
|
||||
when JIT is not supported. If jitfast is specified without jit, jit=7
|
||||
is assumed.
|
||||
|
||||
If the jitverify modifier is specified, information about the compiled
|
||||
pattern shows whether JIT compilation was or was not successful. If
|
||||
jitverify is specified without jit, jit=7 is assumed. If JIT compila-
|
||||
tion is successful when jitverify is set, the text "(JIT)" is added to
|
||||
If the jitverify modifier is specified, information about the compiled
|
||||
pattern shows whether JIT compilation was or was not successful. If
|
||||
jitverify is specified without jit, jit=7 is assumed. If JIT compila-
|
||||
tion is successful when jitverify is set, the text "(JIT)" is added to
|
||||
the first output line after a match or non match when JIT-compiled code
|
||||
was actually used in the match.
|
||||
|
||||
|
@ -796,19 +803,19 @@ PATTERN MODIFIERS
|
|||
/pattern/locale=fr_FR
|
||||
|
||||
The given locale is set, pcre2_maketables() is called to build a set of
|
||||
character tables for the locale, and this is then passed to pcre2_com-
|
||||
pile() when compiling the regular expression. The same tables are used
|
||||
when matching the following subject lines. The locale modifier applies
|
||||
character tables for the locale, and this is then passed to pcre2_com-
|
||||
pile() when compiling the regular expression. The same tables are used
|
||||
when matching the following subject lines. The locale modifier applies
|
||||
only to the pattern on which it appears, but can be given in a #pattern
|
||||
command if a default is needed. Setting a locale and alternate charac-
|
||||
command if a default is needed. Setting a locale and alternate charac-
|
||||
ter tables are mutually exclusive.
|
||||
|
||||
Showing pattern memory
|
||||
|
||||
The memory modifier causes the size in bytes of the memory used to hold
|
||||
the compiled pattern to be output. This does not include the size of
|
||||
the pcre2_code block; it is just the actual compiled data. If the pat-
|
||||
tern is subsequently passed to the JIT compiler, the size of the JIT
|
||||
the compiled pattern to be output. This does not include the size of
|
||||
the pcre2_code block; it is just the actual compiled data. If the pat-
|
||||
tern is subsequently passed to the JIT compiler, the size of the JIT
|
||||
compiled code is also output. Here is an example:
|
||||
|
||||
re> /a(b)c/jit,memory
|
||||
|
@ -818,27 +825,27 @@ PATTERN MODIFIERS
|
|||
|
||||
Limiting nested parentheses
|
||||
|
||||
The parens_nest_limit modifier sets a limit on the depth of nested
|
||||
parentheses in a pattern. Breaching the limit causes a compilation
|
||||
error. The default for the library is set when PCRE2 is built, but
|
||||
pcre2test sets its own default of 220, which is required for running
|
||||
The parens_nest_limit modifier sets a limit on the depth of nested
|
||||
parentheses in a pattern. Breaching the limit causes a compilation
|
||||
error. The default for the library is set when PCRE2 is built, but
|
||||
pcre2test sets its own default of 220, which is required for running
|
||||
the standard test suite.
|
||||
|
||||
Limiting the pattern length
|
||||
|
||||
The max_pattern_length modifier sets a limit, in code units, to the
|
||||
The max_pattern_length modifier sets a limit, in code units, to the
|
||||
length of pattern that pcre2_compile() will accept. Breaching the limit
|
||||
causes a compilation error. The default is the largest number a
|
||||
causes a compilation error. The default is the largest number a
|
||||
PCRE2_SIZE variable can hold (essentially unlimited).
|
||||
|
||||
Using the POSIX wrapper API
|
||||
|
||||
The /posix and posix_nosub modifiers cause pcre2test to call PCRE2 via
|
||||
the POSIX wrapper API rather than its native API. When posix_nosub is
|
||||
used, the POSIX option REG_NOSUB is passed to regcomp(). The POSIX
|
||||
wrapper supports only the 8-bit library. Note that it does not imply
|
||||
The posix and posix_nosub modifiers cause pcre2test to call PCRE2 via
|
||||
the POSIX wrapper API rather than its native API. When posix_nosub is
|
||||
used, the POSIX option REG_NOSUB is passed to regcomp(). The POSIX
|
||||
wrapper supports only the 8-bit library. Note that it does not imply
|
||||
POSIX matching semantics; for more detail see the pcre2posix documenta-
|
||||
tion. The following pattern modifiers set options for the regcomp()
|
||||
tion. The following pattern modifiers set options for the regcomp()
|
||||
function:
|
||||
|
||||
caseless REG_ICASE
|
||||
|
@ -848,35 +855,39 @@ PATTERN MODIFIERS
|
|||
ucp REG_UCP ) the POSIX standard
|
||||
utf REG_UTF8 )
|
||||
|
||||
The regerror_buffsize modifier specifies a size for the error buffer
|
||||
that is passed to regerror() in the event of a compilation error. For
|
||||
The regerror_buffsize modifier specifies a size for the error buffer
|
||||
that is passed to regerror() in the event of a compilation error. For
|
||||
example:
|
||||
|
||||
/abc/posix,regerror_buffsize=20
|
||||
|
||||
This provides a means of testing the behaviour of regerror() when the
|
||||
buffer is too small for the error message. If this modifier has not
|
||||
This provides a means of testing the behaviour of regerror() when the
|
||||
buffer is too small for the error message. If this modifier has not
|
||||
been set, a large buffer is used.
|
||||
|
||||
The aftertext and allaftertext subject modifiers work as described
|
||||
below. All other modifiers are either ignored, with a warning message,
|
||||
The aftertext and allaftertext subject modifiers work as described
|
||||
below. All other modifiers are either ignored, with a warning message,
|
||||
or cause an error.
|
||||
|
||||
The pattern is passed to regcomp() as a zero-terminated string by
|
||||
default, but if the use_length or hex modifiers are set, the REG_PEND
|
||||
extension is used to pass it by length.
|
||||
|
||||
Testing the stack guard feature
|
||||
|
||||
The stackguard modifier is used to test the use of pcre2_set_com-
|
||||
pile_recursion_guard(), a function that is provided to enable stack
|
||||
availability to be checked during compilation (see the pcre2api docu-
|
||||
mentation for details). If the number specified by the modifier is
|
||||
The stackguard modifier is used to test the use of pcre2_set_com-
|
||||
pile_recursion_guard(), a function that is provided to enable stack
|
||||
availability to be checked during compilation (see the pcre2api docu-
|
||||
mentation for details). If the number specified by the modifier is
|
||||
greater than zero, pcre2_set_compile_recursion_guard() is called to set
|
||||
up callback from pcre2_compile() to a local function. The argument it
|
||||
receives is the current nesting parenthesis depth; if this is greater
|
||||
up callback from pcre2_compile() to a local function. The argument it
|
||||
receives is the current nesting parenthesis depth; if this is greater
|
||||
than the value given by the modifier, non-zero is returned, causing the
|
||||
compilation to be aborted.
|
||||
|
||||
Using alternative character tables
|
||||
|
||||
The value specified for the tables modifier must be one of the digits
|
||||
The value specified for the tables modifier must be one of the digits
|
||||
0, 1, or 2. It causes a specific set of built-in character tables to be
|
||||
passed to pcre2_compile(). This is used in the PCRE2 tests to check be-
|
||||
haviour with different character tables. The digit specifies the tables
|
||||
|
@ -887,23 +898,25 @@ PATTERN MODIFIERS
|
|||
pcre2_chartables.c.dist
|
||||
2 a set of tables defining ISO 8859 characters
|
||||
|
||||
In table 2, some characters whose codes are greater than 128 are iden-
|
||||
tified as letters, digits, spaces, etc. Setting alternate character
|
||||
In table 2, some characters whose codes are greater than 128 are iden-
|
||||
tified as letters, digits, spaces, etc. Setting alternate character
|
||||
tables and a locale are mutually exclusive.
|
||||
|
||||
Setting certain match controls
|
||||
|
||||
The following modifiers are really subject modifiers, and are described
|
||||
below. However, they may be included in a pattern's modifier list, in
|
||||
which case they are applied to every subject line that is processed
|
||||
with that pattern. They may not appear in #pattern commands. These mod-
|
||||
ifiers do not affect the compilation process.
|
||||
under "Subject Modifiers" below. However, they may be included in a
|
||||
pattern's modifier list, in which case they are applied to every sub-
|
||||
ject line that is processed with that pattern. They may not appear in
|
||||
#pattern commands. These modifiers do not affect the compilation
|
||||
process.
|
||||
|
||||
aftertext show text after match
|
||||
allaftertext show text after captures
|
||||
allcaptures show all captures
|
||||
allusedtext show all consulted text
|
||||
/g global global matching
|
||||
jitstack=<n> set size of JIT stack
|
||||
mark show mark values
|
||||
replace=<string> specify a replacement string
|
||||
startchar show starting character when relevant
|
||||
|
@ -915,6 +928,14 @@ PATTERN MODIFIERS
|
|||
These modifiers may not appear in a #pattern command. If you want them
|
||||
as defaults, set them in a #subject command.
|
||||
|
||||
Specifying literal subject lines
|
||||
|
||||
If the subject_literal modifier is present on a pattern, all the sub-
|
||||
ject lines that it matches are taken as literal strings, with no inter-
|
||||
pretation of backslashes. It is not possible to set subject modifiers
|
||||
on such lines, but any that are set as defaults by a #subject command
|
||||
are recognized.
|
||||
|
||||
Saving a compiled pattern
|
||||
|
||||
When a pattern with the push modifier is successfully compiled, it is
|
||||
|
@ -959,11 +980,11 @@ SUBJECT MODIFIERS
|
|||
The partial matching modifiers are provided with abbreviations because
|
||||
they appear frequently in tests.
|
||||
|
||||
If the posix modifier was present on the pattern, causing the POSIX
|
||||
wrapper API to be used, the only option-setting modifiers that have any
|
||||
effect are notbol, notempty, and noteol, causing REG_NOTBOL,
|
||||
REG_NOTEMPTY, and REG_NOTEOL, respectively, to be passed to regexec().
|
||||
The other modifiers are ignored, with a warning message.
|
||||
If the posix or posix_nosub modifier was present on the pattern, caus-
|
||||
ing the POSIX wrapper API to be used, the only option-setting modifiers
|
||||
that have any effect are notbol, notempty, and noteol, causing REG_NOT-
|
||||
BOL, REG_NOTEMPTY, and REG_NOTEOL, respectively, to be passed to
|
||||
regexec(). The other modifiers are ignored, with a warning message.
|
||||
|
||||
There is one additional modifier that can be used with the POSIX wrap-
|
||||
per. It is ignored (with a warning) if used for non-POSIX matching.
|
||||
|
@ -971,16 +992,19 @@ SUBJECT MODIFIERS
|
|||
posix_startend=<n>[:<m>]
|
||||
|
||||
This causes the subject string to be passed to regexec() using the
|
||||
REG_STARTEND option, which uses offsets to restrict which part of the
|
||||
REG_STARTEND option, which uses offsets to specify which part of the
|
||||
string is searched. If only one number is given, the end offset is
|
||||
passed as the end of the subject string. For more detail of REG_STAR-
|
||||
TEND, see the pcre2posix documentation.
|
||||
TEND, see the pcre2posix documentation. If the subject string contains
|
||||
binary zeros (coded as escapes such as \x{00} because pcre2test does
|
||||
not support actual binary zeros in its input), you must use posix_star-
|
||||
tend to specify its length.
|
||||
|
||||
Setting match controls
|
||||
|
||||
The following modifiers affect the matching process or request addi-
|
||||
tional information. Some of them may also be specified on a pattern
|
||||
line (see above), in which case they apply to every subject line that
|
||||
The following modifiers affect the matching process or request addi-
|
||||
tional information. Some of them may also be specified on a pattern
|
||||
line (see above), in which case they apply to every subject line that
|
||||
is matched against that pattern.
|
||||
|
||||
aftertext show text after match
|
||||
|
@ -1020,29 +1044,29 @@ SUBJECT MODIFIERS
|
|||
zero_terminate pass the subject as zero-terminated
|
||||
|
||||
The effects of these modifiers are described in the following sections.
|
||||
When matching via the POSIX wrapper API, the aftertext, allaftertext,
|
||||
and ovector subject modifiers work as described below. All other modi-
|
||||
When matching via the POSIX wrapper API, the aftertext, allaftertext,
|
||||
and ovector subject modifiers work as described below. All other modi-
|
||||
fiers are either ignored, with a warning message, or cause an error.
|
||||
|
||||
Showing more text
|
||||
|
||||
The aftertext modifier requests that as well as outputting the part of
|
||||
The aftertext modifier requests that as well as outputting the part of
|
||||
the subject string that matched the entire pattern, pcre2test should in
|
||||
addition output the remainder of the subject string. This is useful for
|
||||
tests where the subject contains multiple copies of the same substring.
|
||||
The allaftertext modifier requests the same action for captured sub-
|
||||
The allaftertext modifier requests the same action for captured sub-
|
||||
strings as well as the main matched substring. In each case the remain-
|
||||
der is output on the following line with a plus character following the
|
||||
capture number.
|
||||
|
||||
The allusedtext modifier requests that all the text that was consulted
|
||||
during a successful pattern match by the interpreter should be shown.
|
||||
This feature is not supported for JIT matching, and if requested with
|
||||
JIT it is ignored (with a warning message). Setting this modifier
|
||||
The allusedtext modifier requests that all the text that was consulted
|
||||
during a successful pattern match by the interpreter should be shown.
|
||||
This feature is not supported for JIT matching, and if requested with
|
||||
JIT it is ignored (with a warning message). Setting this modifier
|
||||
affects the output if there is a lookbehind at the start of a match, or
|
||||
a lookahead at the end, or if \K is used in the pattern. Characters
|
||||
that precede or follow the start and end of the actual match are indi-
|
||||
cated in the output by '<' or '>' characters underneath them. Here is
|
||||
a lookahead at the end, or if \K is used in the pattern. Characters
|
||||
that precede or follow the start and end of the actual match are indi-
|
||||
cated in the output by '<' or '>' characters underneath them. Here is
|
||||
an example:
|
||||
|
||||
re> /(?<=pqr)abc(?=xyz)/
|
||||
|
@ -1050,16 +1074,16 @@ SUBJECT MODIFIERS
|
|||
0: pqrabcxyz
|
||||
<<< >>>
|
||||
|
||||
This shows that the matched string is "abc", with the preceding and
|
||||
following strings "pqr" and "xyz" having been consulted during the
|
||||
This shows that the matched string is "abc", with the preceding and
|
||||
following strings "pqr" and "xyz" having been consulted during the
|
||||
match (when processing the assertions).
|
||||
|
||||
The startchar modifier requests that the starting character for the
|
||||
match be indicated, if it is different to the start of the matched
|
||||
The startchar modifier requests that the starting character for the
|
||||
match be indicated, if it is different to the start of the matched
|
||||
string. The only time when this occurs is when \K has been processed as
|
||||
part of the match. In this situation, the output for the matched string
|
||||
is displayed from the starting character instead of from the match
|
||||
point, with circumflex characters under the earlier characters. For
|
||||
is displayed from the starting character instead of from the match
|
||||
point, with circumflex characters under the earlier characters. For
|
||||
example:
|
||||
|
||||
re> /abc\Kxyz/
|
||||
|
@ -1067,7 +1091,7 @@ SUBJECT MODIFIERS
|
|||
0: abcxyz
|
||||
^^^
|
||||
|
||||
Unlike allusedtext, the startchar modifier can be used with JIT. How-
|
||||
Unlike allusedtext, the startchar modifier can be used with JIT. How-
|
||||
ever, these two modifiers are mutually exclusive.
|
||||
|
||||
Showing the value of all capture groups
|
||||
|
@ -1075,98 +1099,98 @@ SUBJECT MODIFIERS
|
|||
The allcaptures modifier requests that the values of all potential cap-
|
||||
tured parentheses be output after a match. By default, only those up to
|
||||
the highest one actually used in the match are output (corresponding to
|
||||
the return code from pcre2_match()). Groups that did not take part in
|
||||
the match are output as "<unset>". This modifier is not relevant for
|
||||
DFA matching (which does no capturing); it is ignored, with a warning
|
||||
the return code from pcre2_match()). Groups that did not take part in
|
||||
the match are output as "<unset>". This modifier is not relevant for
|
||||
DFA matching (which does no capturing); it is ignored, with a warning
|
||||
message, if present.
|
||||
|
||||
Testing callouts
|
||||
|
||||
A callout function is supplied when pcre2test calls the library match-
|
||||
ing functions, unless callout_none is specified. If callout_capture is
|
||||
set, the current captured groups are output when a callout occurs. The
|
||||
A callout function is supplied when pcre2test calls the library match-
|
||||
ing functions, unless callout_none is specified. If callout_capture is
|
||||
set, the current captured groups are output when a callout occurs. The
|
||||
default return from the callout function is zero, which allows matching
|
||||
to continue.
|
||||
|
||||
The callout_fail modifier can be given one or two numbers. If there is
|
||||
only one number, 1 is returned instead of 0 (causing matching to back-
|
||||
track) when a callout of that number is reached. If two numbers
|
||||
(<n>:<m>) are given, 1 is returned when callout <n> is reached and
|
||||
there have been at least <m> callouts. The callout_error modifier is
|
||||
similar, except that PCRE2_ERROR_CALLOUT is returned, causing the
|
||||
entire matching process to be aborted. If both these modifiers are set
|
||||
The callout_fail modifier can be given one or two numbers. If there is
|
||||
only one number, 1 is returned instead of 0 (causing matching to back-
|
||||
track) when a callout of that number is reached. If two numbers
|
||||
(<n>:<m>) are given, 1 is returned when callout <n> is reached and
|
||||
there have been at least <m> callouts. The callout_error modifier is
|
||||
similar, except that PCRE2_ERROR_CALLOUT is returned, causing the
|
||||
entire matching process to be aborted. If both these modifiers are set
|
||||
for the same callout number, callout_error takes precedence.
|
||||
|
||||
Note that callouts with string arguments are always given the number
|
||||
Note that callouts with string arguments are always given the number
|
||||
zero. See "Callouts" below for a description of the output when a call-
|
||||
out it taken.
|
||||
|
||||
The callout_data modifier can be given an unsigned or a negative num-
|
||||
ber. This is set as the "user data" that is passed to the matching
|
||||
function, and passed back when the callout function is invoked. Any
|
||||
value other than zero is used as a return from pcre2test's callout
|
||||
The callout_data modifier can be given an unsigned or a negative num-
|
||||
ber. This is set as the "user data" that is passed to the matching
|
||||
function, and passed back when the callout function is invoked. Any
|
||||
value other than zero is used as a return from pcre2test's callout
|
||||
function.
|
||||
|
||||
Finding all matches in a string
|
||||
|
||||
Searching for all possible matches within a subject can be requested by
|
||||
the global or altglobal modifier. After finding a match, the matching
|
||||
function is called again to search the remainder of the subject. The
|
||||
difference between global and altglobal is that the former uses the
|
||||
start_offset argument to pcre2_match() or pcre2_dfa_match() to start
|
||||
searching at a new point within the entire string (which is what Perl
|
||||
the global or altglobal modifier. After finding a match, the matching
|
||||
function is called again to search the remainder of the subject. The
|
||||
difference between global and altglobal is that the former uses the
|
||||
start_offset argument to pcre2_match() or pcre2_dfa_match() to start
|
||||
searching at a new point within the entire string (which is what Perl
|
||||
does), whereas the latter passes over a shortened subject. This makes a
|
||||
difference to the matching process if the pattern begins with a lookbe-
|
||||
hind assertion (including \b or \B).
|
||||
|
||||
If an empty string is matched, the next match is done with the
|
||||
If an empty string is matched, the next match is done with the
|
||||
PCRE2_NOTEMPTY_ATSTART and PCRE2_ANCHORED flags set, in order to search
|
||||
for another, non-empty, match at the same point in the subject. If this
|
||||
match fails, the start offset is advanced, and the normal match is
|
||||
retried. This imitates the way Perl handles such cases when using the
|
||||
/g modifier or the split() function. Normally, the start offset is
|
||||
advanced by one character, but if the newline convention recognizes
|
||||
CRLF as a newline, and the current character is CR followed by LF, an
|
||||
match fails, the start offset is advanced, and the normal match is
|
||||
retried. This imitates the way Perl handles such cases when using the
|
||||
/g modifier or the split() function. Normally, the start offset is
|
||||
advanced by one character, but if the newline convention recognizes
|
||||
CRLF as a newline, and the current character is CR followed by LF, an
|
||||
advance of two characters occurs.
|
||||
|
||||
Testing substring extraction functions
|
||||
|
||||
The copy and get modifiers can be used to test the pcre2_sub-
|
||||
The copy and get modifiers can be used to test the pcre2_sub-
|
||||
string_copy_xxx() and pcre2_substring_get_xxx() functions. They can be
|
||||
given more than once, and each can specify a group name or number, for
|
||||
given more than once, and each can specify a group name or number, for
|
||||
example:
|
||||
|
||||
abcd\=copy=1,copy=3,get=G1
|
||||
|
||||
If the #subject command is used to set default copy and/or get lists,
|
||||
these can be unset by specifying a negative number to cancel all num-
|
||||
If the #subject command is used to set default copy and/or get lists,
|
||||
these can be unset by specifying a negative number to cancel all num-
|
||||
bered groups and an empty name to cancel all named groups.
|
||||
|
||||
The getall modifier tests pcre2_substring_list_get(), which extracts
|
||||
The getall modifier tests pcre2_substring_list_get(), which extracts
|
||||
all captured substrings.
|
||||
|
||||
If the subject line is successfully matched, the substrings extracted
|
||||
by the convenience functions are output with C, G, or L after the
|
||||
string number instead of a colon. This is in addition to the normal
|
||||
full list. The string length (that is, the return from the extraction
|
||||
If the subject line is successfully matched, the substrings extracted
|
||||
by the convenience functions are output with C, G, or L after the
|
||||
string number instead of a colon. This is in addition to the normal
|
||||
full list. The string length (that is, the return from the extraction
|
||||
function) is given in parentheses after each substring, followed by the
|
||||
name when the extraction was by name.
|
||||
|
||||
Testing the substitution function
|
||||
|
||||
If the replace modifier is set, the pcre2_substitute() function is
|
||||
called instead of one of the matching functions. Note that replacement
|
||||
strings cannot contain commas, because a comma signifies the end of a
|
||||
If the replace modifier is set, the pcre2_substitute() function is
|
||||
called instead of one of the matching functions. Note that replacement
|
||||
strings cannot contain commas, because a comma signifies the end of a
|
||||
modifier. This is not thought to be an issue in a test program.
|
||||
|
||||
Unlike subject strings, pcre2test does not process replacement strings
|
||||
for escape sequences. In UTF mode, a replacement string is checked to
|
||||
see if it is a valid UTF-8 string. If so, it is correctly converted to
|
||||
a UTF string of the appropriate code unit width. If it is not a valid
|
||||
UTF-8 string, the individual code units are copied directly. This pro-
|
||||
Unlike subject strings, pcre2test does not process replacement strings
|
||||
for escape sequences. In UTF mode, a replacement string is checked to
|
||||
see if it is a valid UTF-8 string. If so, it is correctly converted to
|
||||
a UTF string of the appropriate code unit width. If it is not a valid
|
||||
UTF-8 string, the individual code units are copied directly. This pro-
|
||||
vides a means of passing an invalid UTF-8 string for testing purposes.
|
||||
|
||||
The following modifiers set options (in additional to the normal match
|
||||
The following modifiers set options (in additional to the normal match
|
||||
options) for pcre2_substitute():
|
||||
|
||||
global PCRE2_SUBSTITUTE_GLOBAL
|
||||
|
@ -1176,8 +1200,8 @@ SUBJECT MODIFIERS
|
|||
substitute_unset_empty PCRE2_SUBSTITUTE_UNSET_EMPTY
|
||||
|
||||
|
||||
After a successful substitution, the modified string is output, pre-
|
||||
ceded by the number of replacements. This may be zero if there were no
|
||||
After a successful substitution, the modified string is output, pre-
|
||||
ceded by the number of replacements. This may be zero if there were no
|
||||
matches. Here is a simple example of a substitution test:
|
||||
|
||||
/abc/replace=xxx
|
||||
|
@ -1186,12 +1210,12 @@ SUBJECT MODIFIERS
|
|||
=abc=abc=\=global
|
||||
2: =xxx=xxx=
|
||||
|
||||
Subject and replacement strings should be kept relatively short (fewer
|
||||
than 256 characters) for substitution tests, as fixed-size buffers are
|
||||
used. To make it easy to test for buffer overflow, if the replacement
|
||||
string starts with a number in square brackets, that number is passed
|
||||
to pcre2_substitute() as the size of the output buffer, with the
|
||||
replacement string starting at the next character. Here is an example
|
||||
Subject and replacement strings should be kept relatively short (fewer
|
||||
than 256 characters) for substitution tests, as fixed-size buffers are
|
||||
used. To make it easy to test for buffer overflow, if the replacement
|
||||
string starts with a number in square brackets, that number is passed
|
||||
to pcre2_substitute() as the size of the output buffer, with the
|
||||
replacement string starting at the next character. Here is an example
|
||||
that tests the edge case:
|
||||
|
||||
/abc/
|
||||
|
@ -1200,11 +1224,11 @@ SUBJECT MODIFIERS
|
|||
123abc123\=replace=[9]XYZ
|
||||
Failed: error -47: no more memory
|
||||
|
||||
The default action of pcre2_substitute() is to return
|
||||
PCRE2_ERROR_NOMEMORY when the output buffer is too small. However, if
|
||||
the PCRE2_SUBSTITUTE_OVERFLOW_LENGTH option is set (by using the sub-
|
||||
stitute_overflow_length modifier), pcre2_substitute() continues to go
|
||||
through the motions of matching and substituting, in order to compute
|
||||
The default action of pcre2_substitute() is to return
|
||||
PCRE2_ERROR_NOMEMORY when the output buffer is too small. However, if
|
||||
the PCRE2_SUBSTITUTE_OVERFLOW_LENGTH option is set (by using the sub-
|
||||
stitute_overflow_length modifier), pcre2_substitute() continues to go
|
||||
through the motions of matching and substituting, in order to compute
|
||||
the size of buffer that is required. When this happens, pcre2test shows
|
||||
the required buffer length (which includes space for the trailing zero)
|
||||
as part of the error message. For example:
|
||||
|
@ -1214,105 +1238,106 @@ SUBJECT MODIFIERS
|
|||
Failed: error -47: no more memory: 10 code units are needed
|
||||
|
||||
A replacement string is ignored with POSIX and DFA matching. Specifying
|
||||
partial matching provokes an error return ("bad option value") from
|
||||
partial matching provokes an error return ("bad option value") from
|
||||
pcre2_substitute().
|
||||
|
||||
Setting the JIT stack size
|
||||
|
||||
The jitstack modifier provides a way of setting the maximum stack size
|
||||
that is used by the just-in-time optimization code. It is ignored if
|
||||
The jitstack modifier provides a way of setting the maximum stack size
|
||||
that is used by the just-in-time optimization code. It is ignored if
|
||||
JIT optimization is not being used. The value is a number of kilobytes.
|
||||
Providing a stack that is larger than the default 32K is necessary only
|
||||
for very complicated patterns.
|
||||
Setting zero reverts to the default of 32K. Providing a stack that is
|
||||
larger than the default is necessary only for very complicated pat-
|
||||
terns. If jitstack is set non-zero on a subject line it overrides any
|
||||
value that was set on the pattern.
|
||||
|
||||
Setting heap, match, and depth limits
|
||||
|
||||
The heap_limit, match_limit, and depth_limit modifiers set the appro-
|
||||
priate limits in the match context. These values are ignored when the
|
||||
The heap_limit, match_limit, and depth_limit modifiers set the appro-
|
||||
priate limits in the match context. These values are ignored when the
|
||||
find_limits modifier is specified.
|
||||
|
||||
Finding minimum limits
|
||||
|
||||
If the find_limits modifier is present on a subject line, pcre2test
|
||||
calls the relevant matching function several times, setting different
|
||||
values in the match context via pcre2_set_heap_limit(),
|
||||
pcre2_set_match_limit(), or pcre2_set_depth_limit() until it finds the
|
||||
minimum values for each parameter that allows the match to complete
|
||||
If the find_limits modifier is present on a subject line, pcre2test
|
||||
calls the relevant matching function several times, setting different
|
||||
values in the match context via pcre2_set_heap_limit(),
|
||||
pcre2_set_match_limit(), or pcre2_set_depth_limit() until it finds the
|
||||
minimum values for each parameter that allows the match to complete
|
||||
without error.
|
||||
|
||||
If JIT is being used, only the match limit is relevant. If DFA matching
|
||||
is being used, only the depth limit is relevant.
|
||||
|
||||
The match_limit number is a measure of the amount of backtracking that
|
||||
takes place, and learning the minimum value can be instructive. For
|
||||
most simple matches, the number is quite small, but for patterns with
|
||||
very large numbers of matching possibilities, it can become large very
|
||||
The match_limit number is a measure of the amount of backtracking that
|
||||
takes place, and learning the minimum value can be instructive. For
|
||||
most simple matches, the number is quite small, but for patterns with
|
||||
very large numbers of matching possibilities, it can become large very
|
||||
quickly with increasing length of subject string.
|
||||
|
||||
For non-DFA matching, the minimum depth_limit number is a measure of
|
||||
For non-DFA matching, the minimum depth_limit number is a measure of
|
||||
how much nested backtracking happens (that is, how deeply the pattern's
|
||||
tree is searched). In the case of DFA matching, depth_limit controls
|
||||
the depth of recursive calls of the internal function that is used for
|
||||
tree is searched). In the case of DFA matching, depth_limit controls
|
||||
the depth of recursive calls of the internal function that is used for
|
||||
handling pattern recursion, lookaround assertions, and atomic groups.
|
||||
|
||||
Showing MARK names
|
||||
|
||||
|
||||
The mark modifier causes the names from backtracking control verbs that
|
||||
are returned from calls to pcre2_match() to be displayed. If a mark is
|
||||
returned for a match, non-match, or partial match, pcre2test shows it.
|
||||
For a match, it is on a line by itself, tagged with "MK:". Otherwise,
|
||||
are returned from calls to pcre2_match() to be displayed. If a mark is
|
||||
returned for a match, non-match, or partial match, pcre2test shows it.
|
||||
For a match, it is on a line by itself, tagged with "MK:". Otherwise,
|
||||
it is added to the non-match message.
|
||||
|
||||
Showing memory usage
|
||||
|
||||
The memory modifier causes pcre2test to log the sizes of all heap mem-
|
||||
ory allocation and freeing calls that occur during a call to
|
||||
pcre2_match(). These occur only when a match requires a bigger vector
|
||||
than the default for remembering backtracking points. In many cases
|
||||
there will be no heap memory used and therefore no additional output.
|
||||
No heap memory is allocated during matching with pcre2_dfa_match or
|
||||
with JIT, so in those cases the memory modifier never has any effect.
|
||||
The memory modifier causes pcre2test to log the sizes of all heap mem-
|
||||
ory allocation and freeing calls that occur during a call to
|
||||
pcre2_match(). These occur only when a match requires a bigger vector
|
||||
than the default for remembering backtracking points. In many cases
|
||||
there will be no heap memory used and therefore no additional output.
|
||||
No heap memory is allocated during matching with pcre2_dfa_match or
|
||||
with JIT, so in those cases the memory modifier never has any effect.
|
||||
For this modifier to work, the null_context modifier must not be set on
|
||||
both the pattern and the subject, though it can be set on one or the
|
||||
both the pattern and the subject, though it can be set on one or the
|
||||
other.
|
||||
|
||||
Setting a starting offset
|
||||
|
||||
The offset modifier sets an offset in the subject string at which
|
||||
The offset modifier sets an offset in the subject string at which
|
||||
matching starts. Its value is a number of code units, not characters.
|
||||
|
||||
Setting an offset limit
|
||||
|
||||
The offset_limit modifier sets a limit for unanchored matches. If a
|
||||
The offset_limit modifier sets a limit for unanchored matches. If a
|
||||
match cannot be found starting at or before this offset in the subject,
|
||||
a "no match" return is given. The data value is a number of code units,
|
||||
not characters. When this modifier is used, the use_offset_limit modi-
|
||||
not characters. When this modifier is used, the use_offset_limit modi-
|
||||
fier must have been set for the pattern; if not, an error is generated.
|
||||
|
||||
Setting the size of the output vector
|
||||
|
||||
The ovector modifier applies only to the subject line in which it
|
||||
appears, though of course it can also be used to set a default in a
|
||||
#subject command. It specifies the number of pairs of offsets that are
|
||||
The ovector modifier applies only to the subject line in which it
|
||||
appears, though of course it can also be used to set a default in a
|
||||
#subject command. It specifies the number of pairs of offsets that are
|
||||
available for storing matching information. The default is 15.
|
||||
|
||||
A value of zero is useful when testing the POSIX API because it causes
|
||||
A value of zero is useful when testing the POSIX API because it causes
|
||||
regexec() to be called with a NULL capture vector. When not testing the
|
||||
POSIX API, a value of zero is used to cause pcre2_match_data_cre-
|
||||
ate_from_pattern() to be called, in order to create a match block of
|
||||
POSIX API, a value of zero is used to cause pcre2_match_data_cre-
|
||||
ate_from_pattern() to be called, in order to create a match block of
|
||||
exactly the right size for the pattern. (It is not possible to create a
|
||||
match block with a zero-length ovector; there is always at least one
|
||||
match block with a zero-length ovector; there is always at least one
|
||||
pair of offsets.)
|
||||
|
||||
Passing the subject as zero-terminated
|
||||
|
||||
By default, the subject string is passed to a native API matching func-
|
||||
tion with its correct length. In order to test the facility for passing
|
||||
a zero-terminated string, the zero_terminate modifier is provided. It
|
||||
causes the length to be passed as PCRE2_ZERO_TERMINATED. (When matching
|
||||
via the POSIX interface, this modifier has no effect, as there is no
|
||||
facility for passing a length.)
|
||||
a zero-terminated string, the zero_terminate modifier is provided. It
|
||||
causes the length to be passed as PCRE2_ZERO_TERMINATED. When matching
|
||||
via the POSIX interface, this modifier is ignored, with a warning.
|
||||
|
||||
When testing pcre2_substitute(), this modifier also has the effect of
|
||||
passing the replacement string as zero-terminated.
|
||||
|
@ -1513,8 +1538,8 @@ CALLOUTS
|
|||
position, which can happen if the callout is in a lookbehind assertion.
|
||||
|
||||
Callouts numbered 255 are assumed to be automatic callouts, inserted as
|
||||
a result of the /auto_callout pattern modifier. In this case, instead
|
||||
of showing the callout number, the offset in the pattern, preceded by a
|
||||
a result of the auto_callout pattern modifier. In this case, instead of
|
||||
showing the callout number, the offset in the pattern, preceded by a
|
||||
plus, is output. For example:
|
||||
|
||||
re> /\d?[A-E]\*/auto_callout
|
||||
|
@ -1662,5 +1687,5 @@ AUTHOR
|
|||
|
||||
REVISION
|
||||
|
||||
Last updated: 03 June 2017
|
||||
Last updated: 16 June 2017
|
||||
Copyright (c) 1997-2017 University of Cambridge.
|
||||
|
|
Loading…
Reference in New Issue