Documentation update.

This commit is contained in:
Philip.Hazel 2017-06-16 17:57:18 +00:00
parent a083420cac
commit c92bfc3d21
9 changed files with 522 additions and 383 deletions

View File

@ -47,7 +47,7 @@ system stack size checking, or to change one or more of these parameters:
The newline character sequence; The newline character sequence;
The compile time nested parentheses limit; The compile time nested parentheses limit;
The maximum pattern length (in code units) that is allowed. The maximum pattern length (in code units) that is allowed.
The additional options bits The additional options bits (see pcre2_set_compile_extra_options())
</pre> </pre>
The option bits are: The option bits are:
<pre> <pre>
@ -64,6 +64,7 @@ The option bits are:
PCRE2_ENDANCHORED Pattern can match only at end of subject PCRE2_ENDANCHORED Pattern can match only at end of subject
PCRE2_EXTENDED Ignore white space and # comments PCRE2_EXTENDED Ignore white space and # comments
PCRE2_FIRSTLINE Force matching to be before newline PCRE2_FIRSTLINE Force matching to be before newline
PCRE2_LITERAL Pattern characters are all literal
PCRE2_MATCH_UNSET_BACKREF Match unset back references PCRE2_MATCH_UNSET_BACKREF Match unset back references
PCRE2_MULTILINE ^ and $ match newlines within data PCRE2_MULTILINE ^ and $ match newlines within data
PCRE2_NEVER_BACKSLASH_C Lock out the use of \C in patterns PCRE2_NEVER_BACKSLASH_C Lock out the use of \C in patterns

View File

@ -32,6 +32,8 @@ options are:
<pre> <pre>
PCRE2_EXTRA_ALLOW_SURROGATE_ESCAPES Allow \x{df800} to \x{dfff} in UTF-8 and UTF-32 modes PCRE2_EXTRA_ALLOW_SURROGATE_ESCAPES Allow \x{df800} to \x{dfff} in UTF-8 and UTF-32 modes
PCRE2_EXTRA_BAD_ESCAPE_IS_LITERAL Treat all invalid escapes as a literal following character PCRE2_EXTRA_BAD_ESCAPE_IS_LITERAL Treat all invalid escapes as a literal following character
PCRE2_EXTRA_MATCH_LINE Pattern matches whole lines
PCRE2_EXTRA_MATCH_WORD Pattern matches "words"
</pre> </pre>
There is a complete description of the PCRE2 native API in the There is a complete description of the PCRE2 native API in the
<a href="pcre2api.html"><b>pcre2api</b></a> <a href="pcre2api.html"><b>pcre2api</b></a>

View File

@ -1453,6 +1453,19 @@ continue over the newline. See also PCRE2_USE_OFFSET_LIMIT, which provides a
more general limiting facility. If PCRE2_FIRSTLINE is set with an offset limit, more general limiting facility. If PCRE2_FIRSTLINE is set with an offset limit,
a match must occur in the first line and also within the offset limit. In other a match must occur in the first line and also within the offset limit. In other
words, whichever limit comes first is used. words, whichever limit comes first is used.
<pre>
PCRE2_LITERAL
</pre>
If this option is set, all meta-characters in the pattern are disabled, and it
is treated as a literal string. Matching literal strings with a regular
expression engine is not the most efficient way of doing it. If you are doing a
lot of literal matching and are worried about efficiency, you should consider
using other approaches. The only other main options that are allowed with
PCRE2_LITERAL are: PCRE2_ANCHORED, PCRE2_ENDANCHORED, PCRE2_AUTO_CALLOUT,
PCRE2_CASELESS, PCRE2_FIRSTLINE, PCRE2_NO_START_OPTIMIZE, PCRE2_NO_UTF_CHECK,
PCRE2_UTF, and PCRE2_USE_OFFSET_LIMIT. The extra options PCRE2_EXTRA_MATCH_LINE
and PCRE2_EXTRA_MATCH_WORD are also supported. Any other options cause an
error.
<pre> <pre>
PCRE2_MATCH_UNSET_BACKREF PCRE2_MATCH_UNSET_BACKREF
</pre> </pre>
@ -1724,6 +1737,24 @@ treated as single-character escapes. For example, \j is a literal "j" and
\x{2z} is treated as the literal string "x{2z}". Setting this option means \x{2z} is treated as the literal string "x{2z}". Setting this option means
that typos in patterns may go undetected and have unexpected results. This is a that typos in patterns may go undetected and have unexpected results. This is a
dangerous option. Use with care. dangerous option. Use with care.
<pre>
PCRE2_EXTRA_MATCH_LINE
</pre>
This option is provided for use by the <b>-x</b> option of <b>pcre2grep</b>. It
causes the pattern only to match complete lines. This is achieved by
automatically inserting the code for "^(?:" at the start of the compiled
pattern and ")$" at the end. Thus, when PCRE2_MULTILINE is set, the matched
line may be in the middle of the subject string. This option can be used with
PCRE2_LITERAL.
<pre>
PCRE2_EXTRA_MATCH_WORD
</pre>
This option is provided for use by the <b>-w</b> option of <b>pcre2grep</b>. It
causes the pattern only to match strings that have a word boundary at the start
and the end. This is achieved by automatically inserting the code for "\b(?:"
at the start of the compiled pattern and ")\b" at the end. The option may be
used with PCRE2_LITERAL. However, it is ignored if PCRE2_EXTRA_MATCH_LINE is
also set.
</P> </P>
<br><a name="SEC20" href="#TOC1">COMPILATION ERROR CODES</a><br> <br><a name="SEC20" href="#TOC1">COMPILATION ERROR CODES</a><br>
<P> <P>
@ -3489,7 +3520,7 @@ Cambridge, England.
</P> </P>
<br><a name="SEC42" href="#TOC1">REVISION</a><br> <br><a name="SEC42" href="#TOC1">REVISION</a><br>
<P> <P>
Last updated: 01 June 2017 Last updated: 16 June 2017
<br> <br>
Copyright &copy; 1997-2017 University of Cambridge. Copyright &copy; 1997-2017 University of Cambridge.
<br> <br>

View File

@ -94,7 +94,7 @@ The function <b>regcomp()</b> is called to compile a pattern into an
internal form. By default, the pattern is a C string terminated by a binary internal form. By default, the pattern is a C string terminated by a binary
zero (but see REG_PEND below). The <i>preg</i> argument is a pointer to a zero (but see REG_PEND below). The <i>preg</i> argument is a pointer to a
<b>regex_t</b> structure that is used as a base for storing information about <b>regex_t</b> structure that is used as a base for storing information about
the compiled regular expression. (It is also used for input when REG_PEND is the compiled regular expression. (It is also used for input when REG_PEND is
set.) set.)
</P> </P>
<P> <P>
@ -117,6 +117,14 @@ compilation to the native function.
The PCRE2_MULTILINE option is set when the regular expression is passed for The PCRE2_MULTILINE option is set when the regular expression is passed for
compilation to the native function. Note that this does <i>not</i> mimic the compilation to the native function. Note that this does <i>not</i> mimic the
defined POSIX behaviour for REG_NEWLINE (see the following section). defined POSIX behaviour for REG_NEWLINE (see the following section).
<pre>
REG_NOSPEC
</pre>
The PCRE2_LITERAL option is set when the regular expression is passed for
compilation to the native function. This disables all meta characters in the
pattern, causing it to be treated as a literal string. The only other options
that are allowed with REG_NOSPEC are REG_ICASE, REG_NOSUB, REG_PEND, and
REG_UTF. Note that REG_NOSPEC is not part of the POSIX standard.
<pre> <pre>
REG_NOSUB REG_NOSUB
</pre> </pre>
@ -128,8 +136,8 @@ because it disables the use of back references.
<pre> <pre>
REG_PEND REG_PEND
</pre> </pre>
If this option is set, the <b>reg_endp</b> field in the <i>preg</i> structure If this option is set, the <b>reg_endp</b> field in the <i>preg</i> structure
(which has the type const char *) must be set to point to the character beyond (which has the type const char *) must be set to point to the character beyond
the end of the pattern before calling <b>regcomp()</b>. The pattern itself may the end of the pattern before calling <b>regcomp()</b>. The pattern itself may
now contain binary zeroes, which are treated as data characters. Without now contain binary zeroes, which are treated as data characters. Without
REG_PEND, a binary zero terminates the pattern and the <b>re_endp</b> field is REG_PEND, a binary zero terminates the pattern and the <b>re_endp</b> field is
@ -242,8 +250,8 @@ function.
</pre> </pre>
When this option is set, the subject string is starts at <i>string</i> + When this option is set, the subject string is starts at <i>string</i> +
<i>pmatch[0].rm_so</i> and ends at <i>string</i> + <i>pmatch[0].rm_eo</i>, which <i>pmatch[0].rm_so</i> and ends at <i>string</i> + <i>pmatch[0].rm_eo</i>, which
should point to the first character beyond the string. There may be binary should point to the first character beyond the string. There may be binary
zeroes within the subject string, and indeed, using REG_STARTEND is the only zeroes within the subject string, and indeed, using REG_STARTEND is the only
way to pass a subject string that contains a binary zero. way to pass a subject string that contains a binary zero.
</P> </P>
<P> <P>
@ -314,7 +322,7 @@ Cambridge, England.
</P> </P>
<br><a name="SEC9" href="#TOC1">REVISION</a><br> <br><a name="SEC9" href="#TOC1">REVISION</a><br>
<P> <P>
Last updated: 05 June 2017 Last updated: 15 June 2017
<br> <br>
Copyright &copy; 1997-2017 University of Cambridge. Copyright &copy; 1997-2017 University of Cambridge.
<br> <br>

View File

@ -96,12 +96,12 @@ want that action.
</P> </P>
<P> <P>
The input is processed using using C's string functions, so must not The input is processed using using C's string functions, so must not
contain binary zeroes, even though in Unix-like environments, <b>fgets()</b> contain binary zeros, even though in Unix-like environments, <b>fgets()</b>
treats any bytes other than newline as data characters. An error is generated treats any bytes other than newline as data characters. An error is generated
if a binary zero is encountered. Subject lines are processed for backslash if a binary zero is encountered. By default subject lines are processed for
escapes, which makes it possible to include any data value in strings that are backslash escapes, which makes it possible to include any data value in strings
passed to the library for matching. For patterns, there is a facility for that are passed to the library for matching. For patterns, there is a facility
specifying some or all of the 8-bit input characters as hexadecimal pairs, for specifying some or all of the 8-bit input characters as hexadecimal pairs,
which makes it possible to include binary zeros. which makes it possible to include binary zeros.
</P> </P>
<br><b> <br><b>
@ -382,8 +382,9 @@ of the standard test input files.
<P> <P>
When the POSIX API is being tested there is no way to override the default When the POSIX API is being tested there is no way to override the default
newline convention, though it is possible to set the newline convention from newline convention, though it is possible to set the newline convention from
within the pattern. A warning is given if the <b>posix</b> modifier is used when within the pattern. A warning is given if the <b>posix</b> or <b>posix_nosub</b>
<b>#newline_default</b> would set a default for the non-POSIX API. modifier is used when <b>#newline_default</b> would set a default for the
non-POSIX API.
<pre> <pre>
#pattern &#60;modifier-list&#62; #pattern &#60;modifier-list&#62;
</pre> </pre>
@ -479,8 +480,9 @@ A pattern can be followed by a modifier list (details below).
<P> <P>
Before each subject line is passed to <b>pcre2_match()</b> or Before each subject line is passed to <b>pcre2_match()</b> or
<b>pcre2_dfa_match()</b>, leading and trailing white space is removed, and the <b>pcre2_dfa_match()</b>, leading and trailing white space is removed, and the
line is scanned for backslash escapes. The following provide a means of line is scanned for backslash escapes, unless the <b>subject_literal</b>
encoding non-printing characters in a visible way: modifier was set for the pattern. The following provide a means of encoding
non-printing characters in a visible way:
<pre> <pre>
\a alarm (BEL, \x07) \a alarm (BEL, \x07)
\b backspace (\x08) \b backspace (\x08)
@ -548,6 +550,12 @@ the very last character in the line is a backslash (and there is no modifier
list), it is ignored. This gives a way of passing an empty line as data, since list), it is ignored. This gives a way of passing an empty line as data, since
a real empty line terminates the data input. a real empty line terminates the data input.
</P> </P>
<P>
If the <b>subject_literal</b> modifier is set for a pattern, all subject lines
that follow are treated as literals, with no special treatment of backslashes.
No replication is possible, and any subject modifiers must be set as defaults
by a <b>#subject</b> command.
</P>
<br><a name="SEC10" href="#TOC1">PATTERN MODIFIERS</a><br> <br><a name="SEC10" href="#TOC1">PATTERN MODIFIERS</a><br>
<P> <P>
There are several types of modifier that can appear in pattern lines. Except There are several types of modifier that can appear in pattern lines. Except
@ -586,7 +594,10 @@ for a description of the effects of these options.
/x extended set PCRE2_EXTENDED /x extended set PCRE2_EXTENDED
/xx extended_more set PCRE2_EXTENDED_MORE /xx extended_more set PCRE2_EXTENDED_MORE
firstline set PCRE2_FIRSTLINE firstline set PCRE2_FIRSTLINE
literal set PCRE2_LITERAL
match_line set PCRE2_EXTRA_MATCH_LINE
match_unset_backref set PCRE2_MATCH_UNSET_BACKREF match_unset_backref set PCRE2_MATCH_UNSET_BACKREF
match_word set PCRE2_EXTRA_MATCH_WORD
/m multiline set PCRE2_MULTILINE /m multiline set PCRE2_MULTILINE
never_backslash_c set PCRE2_NEVER_BACKSLASH_C never_backslash_c set PCRE2_NEVER_BACKSLASH_C
never_ucp set PCRE2_NEVER_UCP never_ucp set PCRE2_NEVER_UCP
@ -638,6 +649,7 @@ heavily used in the test files.
push push compiled pattern onto the stack push push compiled pattern onto the stack
pushcopy push a copy onto the stack pushcopy push a copy onto the stack
stackguard=&#60;number&#62; test the stackguard feature stackguard=&#60;number&#62; test the stackguard feature
subject_literal treat all subject lines as literal
tables=[0|1|2] select internal tables tables=[0|1|2] select internal tables
use_length do not zero-terminate the pattern use_length do not zero-terminate the pattern
utf8_input treat input as UTF-8 utf8_input treat input as UTF-8
@ -728,18 +740,6 @@ testing that <b>pcre2_compile()</b> behaves correctly in this case (it uses
default values). default values).
</P> </P>
<br><b> <br><b>
Specifying the pattern's length
</b><br>
<P>
By default, patterns are passed to the compiling functions as zero-terminated
strings. When using the POSIX wrapper API, there is no other option. However,
when using PCRE2's native API, patterns can be passed by length instead of
being zero-terminated. The <b>use_length</b> modifier causes this to happen.
Using a length happens automatically (whether or not <b>use_length</b> is set)
when <b>hex</b> is set, because patterns specified in hexadecimal may contain
binary zeros.
</P>
<br><b>
Specifying pattern characters in hexadecimal Specifying pattern characters in hexadecimal
</b><br> </b><br>
<P> <P>
@ -761,11 +761,20 @@ Either single or double quotes may be used. There is no way of including
the delimiter within a substring. The <b>hex</b> and <b>expand</b> modifiers are the delimiter within a substring. The <b>hex</b> and <b>expand</b> modifiers are
mutually exclusive. mutually exclusive.
</P> </P>
<br><b>
Specifying the pattern's length
</b><br>
<P> <P>
The POSIX API cannot be used with patterns specified in hexadecimal because By default, patterns are passed to the compiling functions as zero-terminated
they may contain binary zeros, which conflicts with <b>regcomp()</b>'s strings but can be passed by length instead of being zero-terminated. The
requirement for a zero-terminated string. Such patterns are always passed to <b>use_length</b> modifier causes this to happen. Using a length happens
<b>pcre2_compile()</b> as a string with a length, not as zero-terminated. automatically (whether or not <b>use_length</b> is set) when <b>hex</b> is set,
because patterns specified in hexadecimal may contain binary zeros.
</P>
<P>
If <b>hex</b> or <b>use_length</b> is used with the POSIX wrapper API (see
<a href="#posixwrapper">"Using the POSIX wrapper API"</a>
below), the REG_PEND extension is used to pass the pattern's length.
</P> </P>
<br><b> <br><b>
Specifying wide characters in 16-bit and 32-bit modes Specifying wide characters in 16-bit and 32-bit modes
@ -826,7 +835,7 @@ modifier in "Subject Modifiers"
for details of how these options are specified for each match attempt. for details of how these options are specified for each match attempt.
</P> </P>
<P> <P>
JIT compilation is requested by the <b>/jit</b> pattern modifier, which may JIT compilation is requested by the <b>jit</b> pattern modifier, which may
optionally be followed by an equals sign and a number in the range 0 to 7. optionally be followed by an equals sign and a number in the range 0 to 7.
The three bits that make up the number specify which of the three JIT operating The three bits that make up the number specify which of the three JIT operating
modes are to be compiled: modes are to be compiled:
@ -850,7 +859,7 @@ to <b>pcre2_match()</b> with either the PCRE2_PARTIAL_SOFT or the
PCRE2_PARTIAL_HARD option set. Note that such a call may return a complete PCRE2_PARTIAL_HARD option set. Note that such a call may return a complete
match; the options enable the possibility of a partial match, but do not match; the options enable the possibility of a partial match, but do not
require it. Note also that if you request JIT compilation only for partial require it. Note also that if you request JIT compilation only for partial
matching (for example, /jit=2) but do not set the <b>partial</b> modifier on a matching (for example, jit=2) but do not set the <b>partial</b> modifier on a
subject line, that match will not use JIT code because none was compiled for subject line, that match will not use JIT code because none was compiled for
non-partial matching. non-partial matching.
</P> </P>
@ -927,12 +936,12 @@ The <b>max_pattern_length</b> modifier sets a limit, in code units, to the
length of pattern that <b>pcre2_compile()</b> will accept. Breaching the limit length of pattern that <b>pcre2_compile()</b> will accept. Breaching the limit
causes a compilation error. The default is the largest number a PCRE2_SIZE causes a compilation error. The default is the largest number a PCRE2_SIZE
variable can hold (essentially unlimited). variable can hold (essentially unlimited).
</P> <a name="posixwrapper"></a></P>
<br><b> <br><b>
Using the POSIX wrapper API Using the POSIX wrapper API
</b><br> </b><br>
<P> <P>
The <b>/posix</b> and <b>posix_nosub</b> modifiers cause <b>pcre2test</b> to call The <b>posix</b> and <b>posix_nosub</b> modifiers cause <b>pcre2test</b> to call
PCRE2 via the POSIX wrapper API rather than its native API. When PCRE2 via the POSIX wrapper API rather than its native API. When
<b>posix_nosub</b> is used, the POSIX option REG_NOSUB is passed to <b>posix_nosub</b> is used, the POSIX option REG_NOSUB is passed to
<b>regcomp()</b>. The POSIX wrapper supports only the 8-bit library. Note that <b>regcomp()</b>. The POSIX wrapper supports only the 8-bit library. Note that
@ -962,6 +971,11 @@ The <b>aftertext</b> and <b>allaftertext</b> subject modifiers work as described
below. All other modifiers are either ignored, with a warning message, or cause below. All other modifiers are either ignored, with a warning message, or cause
an error. an error.
</P> </P>
<P>
The pattern is passed to <b>regcomp()</b> as a zero-terminated string by
default, but if the <b>use_length</b> or <b>hex</b> modifiers are set, the
REG_PEND extension is used to pass it by length.
</P>
<br><b> <br><b>
Testing the stack guard feature Testing the stack guard feature
</b><br> </b><br>
@ -999,17 +1013,18 @@ are mutually exclusive.
Setting certain match controls Setting certain match controls
</b><br> </b><br>
<P> <P>
The following modifiers are really subject modifiers, and are described below. The following modifiers are really subject modifiers, and are described under
However, they may be included in a pattern's modifier list, in which case they "Subject Modifiers" below. However, they may be included in a pattern's
are applied to every subject line that is processed with that pattern. They may modifier list, in which case they are applied to every subject line that is
not appear in <b>#pattern</b> commands. These modifiers do not affect the processed with that pattern. They may not appear in <b>#pattern</b> commands.
compilation process. These modifiers do not affect the compilation process.
<pre> <pre>
aftertext show text after match aftertext show text after match
allaftertext show text after captures allaftertext show text after captures
allcaptures show all captures allcaptures show all captures
allusedtext show all consulted text allusedtext show all consulted text
/g global global matching /g global global matching
jitstack=&#60;n&#62; set size of JIT stack
mark show mark values mark show mark values
replace=&#60;string&#62; specify a replacement string replace=&#60;string&#62; specify a replacement string
startchar show starting character when relevant startchar show starting character when relevant
@ -1022,6 +1037,15 @@ These modifiers may not appear in a <b>#pattern</b> command. If you want them as
defaults, set them in a <b>#subject</b> command. defaults, set them in a <b>#subject</b> command.
</P> </P>
<br><b> <br><b>
Specifying literal subject lines
</b><br>
<P>
If the <b>subject_literal</b> modifier is present on a pattern, all the subject
lines that it matches are taken as literal strings, with no interpretation of
backslashes. It is not possible to set subject modifiers on such lines, but any
that are set as defaults by a <b>#subject</b> command are recognized.
</P>
<br><b>
Saving a compiled pattern Saving a compiled pattern
</b><br> </b><br>
<P> <P>
@ -1072,11 +1096,11 @@ The partial matching modifiers are provided with abbreviations because they
appear frequently in tests. appear frequently in tests.
</P> </P>
<P> <P>
If the <b>posix</b> modifier was present on the pattern, causing the POSIX If the <b>posix</b> or <b>posix_nosub</b> modifier was present on the pattern,
wrapper API to be used, the only option-setting modifiers that have any effect causing the POSIX wrapper API to be used, the only option-setting modifiers
are <b>notbol</b>, <b>notempty</b>, and <b>noteol</b>, causing REG_NOTBOL, that have any effect are <b>notbol</b>, <b>notempty</b>, and <b>noteol</b>,
REG_NOTEMPTY, and REG_NOTEOL, respectively, to be passed to <b>regexec()</b>. causing REG_NOTBOL, REG_NOTEMPTY, and REG_NOTEOL, respectively, to be passed to
The other modifiers are ignored, with a warning message. <b>regexec()</b>. The other modifiers are ignored, with a warning message.
</P> </P>
<P> <P>
There is one additional modifier that can be used with the POSIX wrapper. It is There is one additional modifier that can be used with the POSIX wrapper. It is
@ -1085,11 +1109,13 @@ ignored (with a warning) if used for non-POSIX matching.
posix_startend=&#60;n&#62;[:&#60;m&#62;] posix_startend=&#60;n&#62;[:&#60;m&#62;]
</pre> </pre>
This causes the subject string to be passed to <b>regexec()</b> using the This causes the subject string to be passed to <b>regexec()</b> using the
REG_STARTEND option, which uses offsets to restrict which part of the string is REG_STARTEND option, which uses offsets to specify which part of the string is
searched. If only one number is given, the end offset is passed as the end of searched. If only one number is given, the end offset is passed as the end of
the subject string. For more detail of REG_STARTEND, see the the subject string. For more detail of REG_STARTEND, see the
<a href="pcre2posix.html"><b>pcre2posix</b></a> <a href="pcre2posix.html"><b>pcre2posix</b></a>
documentation. documentation. If the subject string contains binary zeros (coded as escapes
such as \x{00} because <b>pcre2test</b> does not support actual binary zeros in
its input), you must use <b>posix_startend</b> to specify its length.
</P> </P>
<br><b> <br><b>
Setting match controls Setting match controls
@ -1355,9 +1381,11 @@ Setting the JIT stack size
<P> <P>
The <b>jitstack</b> modifier provides a way of setting the maximum stack size The <b>jitstack</b> modifier provides a way of setting the maximum stack size
that is used by the just-in-time optimization code. It is ignored if JIT that is used by the just-in-time optimization code. It is ignored if JIT
optimization is not being used. The value is a number of kilobytes. Providing a optimization is not being used. The value is a number of kilobytes. Setting
stack that is larger than the default 32K is necessary only for very zero reverts to the default of 32K. Providing a stack that is larger than the
complicated patterns. default is necessary only for very complicated patterns. If <b>jitstack</b> is
set non-zero on a subject line it overrides any value that was set on the
pattern.
</P> </P>
<br><b> <br><b>
Setting heap, match, and depth limits Setting heap, match, and depth limits
@ -1461,8 +1489,8 @@ Passing the subject as zero-terminated
By default, the subject string is passed to a native API matching function with By default, the subject string is passed to a native API matching function with
its correct length. In order to test the facility for passing a zero-terminated its correct length. In order to test the facility for passing a zero-terminated
string, the <b>zero_terminate</b> modifier is provided. It causes the length to string, the <b>zero_terminate</b> modifier is provided. It causes the length to
be passed as PCRE2_ZERO_TERMINATED. (When matching via the POSIX interface, be passed as PCRE2_ZERO_TERMINATED. When matching via the POSIX interface,
this modifier has no effect, as there is no facility for passing a length.) this modifier is ignored, with a warning.
</P> </P>
<P> <P>
When testing <b>pcre2_substitute()</b>, this modifier also has the effect of When testing <b>pcre2_substitute()</b>, this modifier also has the effect of
@ -1675,7 +1703,7 @@ callout is in a lookbehind assertion.
</P> </P>
<P> <P>
Callouts numbered 255 are assumed to be automatic callouts, inserted as a Callouts numbered 255 are assumed to be automatic callouts, inserted as a
result of the <b>/auto_callout</b> pattern modifier. In this case, instead of result of the <b>auto_callout</b> pattern modifier. In this case, instead of
showing the callout number, the offset in the pattern, preceded by a plus, is showing the callout number, the offset in the pattern, preceded by a plus, is
output. For example: output. For example:
<pre> <pre>
@ -1830,7 +1858,7 @@ Cambridge, England.
</P> </P>
<br><a name="SEC21" href="#TOC1">REVISION</a><br> <br><a name="SEC21" href="#TOC1">REVISION</a><br>
<P> <P>
Last updated: 03 June 2017 Last updated: 16 June 2017
<br> <br>
Copyright &copy; 1997-2017 University of Cambridge. Copyright &copy; 1997-2017 University of Cambridge.
<br> <br>

View File

@ -1441,6 +1441,20 @@ COMPILING A PATTERN
first line and also within the offset limit. In other words, whichever first line and also within the offset limit. In other words, whichever
limit comes first is used. limit comes first is used.
PCRE2_LITERAL
If this option is set, all meta-characters in the pattern are disabled,
and it is treated as a literal string. Matching literal strings with a
regular expression engine is not the most efficient way of doing it. If
you are doing a lot of literal matching and are worried about effi-
ciency, you should consider using other approaches. The only other main
options that are allowed with PCRE2_LITERAL are: PCRE2_ANCHORED,
PCRE2_ENDANCHORED, PCRE2_AUTO_CALLOUT, PCRE2_CASELESS, PCRE2_FIRSTLINE,
PCRE2_NO_START_OPTIMIZE, PCRE2_NO_UTF_CHECK, PCRE2_UTF, and
PCRE2_USE_OFFSET_LIMIT. The extra options PCRE2_EXTRA_MATCH_LINE and
PCRE2_EXTRA_MATCH_WORD are also supported. Any other options cause an
error.
PCRE2_MATCH_UNSET_BACKREF PCRE2_MATCH_UNSET_BACKREF
If this option is set, a back reference to an unset subpattern group If this option is set, a back reference to an unset subpattern group
@ -1706,6 +1720,24 @@ COMPILING A PATTERN
option means that typos in patterns may go undetected and have unex- option means that typos in patterns may go undetected and have unex-
pected results. This is a dangerous option. Use with care. pected results. This is a dangerous option. Use with care.
PCRE2_EXTRA_MATCH_LINE
This option is provided for use by the -x option of pcre2grep. It
causes the pattern only to match complete lines. This is achieved by
automatically inserting the code for "^(?:" at the start of the com-
piled pattern and ")$" at the end. Thus, when PCRE2_MULTILINE is set,
the matched line may be in the middle of the subject string. This
option can be used with PCRE2_LITERAL.
PCRE2_EXTRA_MATCH_WORD
This option is provided for use by the -w option of pcre2grep. It
causes the pattern only to match strings that have a word boundary at
the start and the end. This is achieved by automatically inserting the
code for "\b(?:" at the start of the compiled pattern and ")\b" at the
end. The option may be used with PCRE2_LITERAL. However, it is ignored
if PCRE2_EXTRA_MATCH_LINE is also set.
COMPILATION ERROR CODES COMPILATION ERROR CODES
@ -3368,7 +3400,7 @@ AUTHOR
REVISION REVISION
Last updated: 01 June 2017 Last updated: 16 June 2017
Copyright (c) 1997-2017 University of Cambridge. Copyright (c) 1997-2017 University of Cambridge.
------------------------------------------------------------------------------ ------------------------------------------------------------------------------
@ -9036,60 +9068,69 @@ COMPILING A PATTERN
the defined POSIX behaviour for REG_NEWLINE (see the following sec- the defined POSIX behaviour for REG_NEWLINE (see the following sec-
tion). tion).
REG_NOSPEC
The PCRE2_LITERAL option is set when the regular expression is passed
for compilation to the native function. This disables all meta charac-
ters in the pattern, causing it to be treated as a literal string. The
only other options that are allowed with REG_NOSPEC are REG_ICASE,
REG_NOSUB, REG_PEND, and REG_UTF. Note that REG_NOSPEC is not part of
the POSIX standard.
REG_NOSUB REG_NOSUB
When a pattern that is compiled with this flag is passed to regexec() When a pattern that is compiled with this flag is passed to regexec()
for matching, the nmatch and pmatch arguments are ignored, and no cap- for matching, the nmatch and pmatch arguments are ignored, and no cap-
tured strings are returned. Versions of the PCRE library prior to 10.22 tured strings are returned. Versions of the PCRE library prior to 10.22
used to set the PCRE2_NO_AUTO_CAPTURE compile option, but this no used to set the PCRE2_NO_AUTO_CAPTURE compile option, but this no
longer happens because it disables the use of back references. longer happens because it disables the use of back references.
REG_PEND REG_PEND
If this option is set, the reg_endp field in the preg structure (which If this option is set, the reg_endp field in the preg structure (which
has the type const char *) must be set to point to the character beyond has the type const char *) must be set to point to the character beyond
the end of the pattern before calling regcomp(). The pattern itself may the end of the pattern before calling regcomp(). The pattern itself may
now contain binary zeroes, which are treated as data characters. With- now contain binary zeroes, which are treated as data characters. With-
out REG_PEND, a binary zero terminates the pattern and the re_endp out REG_PEND, a binary zero terminates the pattern and the re_endp
field is ignored. This is a GNU extension to the POSIX standard and field is ignored. This is a GNU extension to the POSIX standard and
should be used with caution in software intended to be portable to should be used with caution in software intended to be portable to
other systems. other systems.
REG_UCP REG_UCP
The PCRE2_UCP option is set when the regular expression is passed for The PCRE2_UCP option is set when the regular expression is passed for
compilation to the native function. This causes PCRE2 to use Unicode compilation to the native function. This causes PCRE2 to use Unicode
properties when matchine \d, \w, etc., instead of just recognizing properties when matchine \d, \w, etc., instead of just recognizing
ASCII values. Note that REG_UCP is not part of the POSIX standard. ASCII values. Note that REG_UCP is not part of the POSIX standard.
REG_UNGREEDY REG_UNGREEDY
The PCRE2_UNGREEDY option is set when the regular expression is passed The PCRE2_UNGREEDY option is set when the regular expression is passed
for compilation to the native function. Note that REG_UNGREEDY is not for compilation to the native function. Note that REG_UNGREEDY is not
part of the POSIX standard. part of the POSIX standard.
REG_UTF REG_UTF
The PCRE2_UTF option is set when the regular expression is passed for The PCRE2_UTF option is set when the regular expression is passed for
compilation to the native function. This causes the pattern itself and compilation to the native function. This causes the pattern itself and
all data strings used for matching it to be treated as UTF-8 strings. all data strings used for matching it to be treated as UTF-8 strings.
Note that REG_UTF is not part of the POSIX standard. Note that REG_UTF is not part of the POSIX standard.
In the absence of these flags, no options are passed to the native In the absence of these flags, no options are passed to the native
function. This means the the regex is compiled with PCRE2 default function. This means the the regex is compiled with PCRE2 default
semantics. In particular, the way it handles newline characters in the semantics. In particular, the way it handles newline characters in the
subject string is the Perl way, not the POSIX way. Note that setting subject string is the Perl way, not the POSIX way. Note that setting
PCRE2_MULTILINE has only some of the effects specified for REG_NEWLINE. PCRE2_MULTILINE has only some of the effects specified for REG_NEWLINE.
It does not affect the way newlines are matched by the dot metacharac- It does not affect the way newlines are matched by the dot metacharac-
ter (they are not) or by a negative class such as [^a] (they are). ter (they are not) or by a negative class such as [^a] (they are).
The yield of regcomp() is zero on success, and non-zero otherwise. The The yield of regcomp() is zero on success, and non-zero otherwise. The
preg structure is filled in on success, and one other member of the preg structure is filled in on success, and one other member of the
structure (as well as re_endp) is public: re_nsub contains the number structure (as well as re_endp) is public: re_nsub contains the number
of capturing subpatterns in the regular expression. Various error codes of capturing subpatterns in the regular expression. Various error codes
are defined in the header file. are defined in the header file.
NOTE: If the yield of regcomp() is non-zero, you must not attempt to NOTE: If the yield of regcomp() is non-zero, you must not attempt to
use the contents of the preg structure. If, for example, you pass it to use the contents of the preg structure. If, for example, you pass it to
regexec(), the result is undefined and your program is likely to crash. regexec(), the result is undefined and your program is likely to crash.
@ -9097,9 +9138,9 @@ COMPILING A PATTERN
MATCHING NEWLINE CHARACTERS MATCHING NEWLINE CHARACTERS
This area is not simple, because POSIX and Perl take different views of This area is not simple, because POSIX and Perl take different views of
things. It is not possible to get PCRE2 to obey POSIX semantics, but things. It is not possible to get PCRE2 to obey POSIX semantics, but
then PCRE2 was never intended to be a POSIX engine. The following table then PCRE2 was never intended to be a POSIX engine. The following table
lists the different possibilities for matching newline characters in lists the different possibilities for matching newline characters in
Perl and PCRE2: Perl and PCRE2:
Default Change with Default Change with
@ -9120,25 +9161,25 @@ MATCHING NEWLINE CHARACTERS
$ matches \n in middle no REG_NEWLINE $ matches \n in middle no REG_NEWLINE
^ matches \n in middle no REG_NEWLINE ^ matches \n in middle no REG_NEWLINE
This behaviour is not what happens when PCRE2 is called via its POSIX This behaviour is not what happens when PCRE2 is called via its POSIX
API. By default, PCRE2's behaviour is the same as Perl's, except that API. By default, PCRE2's behaviour is the same as Perl's, except that
there is no equivalent for PCRE2_DOLLAR_ENDONLY in Perl. In both PCRE2 there is no equivalent for PCRE2_DOLLAR_ENDONLY in Perl. In both PCRE2
and Perl, there is no way to stop newline from matching [^a]. and Perl, there is no way to stop newline from matching [^a].
Default POSIX newline handling can be obtained by setting PCRE2_DOTALL Default POSIX newline handling can be obtained by setting PCRE2_DOTALL
and PCRE2_DOLLAR_ENDONLY when calling pcre2_compile() directly, but and PCRE2_DOLLAR_ENDONLY when calling pcre2_compile() directly, but
there is no way to make PCRE2 behave exactly as for the REG_NEWLINE there is no way to make PCRE2 behave exactly as for the REG_NEWLINE
action. When using the POSIX API, passing REG_NEWLINE to PCRE2's reg- action. When using the POSIX API, passing REG_NEWLINE to PCRE2's reg-
comp() function causes PCRE2_MULTILINE to be passed to pcre2_compile(), comp() function causes PCRE2_MULTILINE to be passed to pcre2_compile(),
and REG_DOTALL passes PCRE2_DOTALL. There is no way to pass PCRE2_DOL- and REG_DOTALL passes PCRE2_DOTALL. There is no way to pass PCRE2_DOL-
LAR_ENDONLY. LAR_ENDONLY.
MATCHING A PATTERN MATCHING A PATTERN
The function regexec() is called to match a compiled pattern preg The function regexec() is called to match a compiled pattern preg
against a given string, which is by default terminated by a zero byte against a given string, which is by default terminated by a zero byte
(but see REG_STARTEND below), subject to the options in eflags. These (but see REG_STARTEND below), subject to the options in eflags. These
can be: can be:
REG_NOTBOL REG_NOTBOL
@ -9148,9 +9189,9 @@ MATCHING A PATTERN
REG_NOTEMPTY REG_NOTEMPTY
The PCRE2_NOTEMPTY option is set when calling the underlying PCRE2 The PCRE2_NOTEMPTY option is set when calling the underlying PCRE2
matching function. Note that REG_NOTEMPTY is not part of the POSIX matching function. Note that REG_NOTEMPTY is not part of the POSIX
standard. However, setting this option can give more POSIX-like behav- standard. However, setting this option can give more POSIX-like behav-
iour in some situations. iour in some situations.
REG_NOTEOL REG_NOTEOL
@ -9160,66 +9201,66 @@ MATCHING A PATTERN
REG_STARTEND REG_STARTEND
When this option is set, the subject string is starts at string + When this option is set, the subject string is starts at string +
pmatch[0].rm_so and ends at string + pmatch[0].rm_eo, which should pmatch[0].rm_so and ends at string + pmatch[0].rm_eo, which should
point to the first character beyond the string. There may be binary point to the first character beyond the string. There may be binary
zeroes within the subject string, and indeed, using REG_STARTEND is the zeroes within the subject string, and indeed, using REG_STARTEND is the
only way to pass a subject string that contains a binary zero. only way to pass a subject string that contains a binary zero.
Whatever the value of pmatch[0].rm_so, the offsets of the matched Whatever the value of pmatch[0].rm_so, the offsets of the matched
string and any captured substrings are still given relative to the string and any captured substrings are still given relative to the
start of string itself. (Before PCRE2 release 10.30 these were given start of string itself. (Before PCRE2 release 10.30 these were given
relative to string + pmatch[0].rm_so, but this differs from other relative to string + pmatch[0].rm_so, but this differs from other
implementations.) implementations.)
This is a BSD extension, compatible with but not specified by IEEE This is a BSD extension, compatible with but not specified by IEEE
Standard 1003.2 (POSIX.2), and should be used with caution in software Standard 1003.2 (POSIX.2), and should be used with caution in software
intended to be portable to other systems. Note that a non-zero rm_so intended to be portable to other systems. Note that a non-zero rm_so
does not imply REG_NOTBOL; REG_STARTEND affects only the location and does not imply REG_NOTBOL; REG_STARTEND affects only the location and
length of the string, not how it is matched. Setting REG_STARTEND and length of the string, not how it is matched. Setting REG_STARTEND and
passing pmatch as NULL are mutually exclusive; the error REG_INVARG is passing pmatch as NULL are mutually exclusive; the error REG_INVARG is
returned. returned.
If the pattern was compiled with the REG_NOSUB flag, no data about any If the pattern was compiled with the REG_NOSUB flag, no data about any
matched strings is returned. The nmatch and pmatch arguments of matched strings is returned. The nmatch and pmatch arguments of
regexec() are ignored (except possibly as input for REG_STARTEND). regexec() are ignored (except possibly as input for REG_STARTEND).
The value of nmatch may be zero, and the value pmatch may be NULL The value of nmatch may be zero, and the value pmatch may be NULL
(unless REG_STARTEND is set); in both these cases no data about any (unless REG_STARTEND is set); in both these cases no data about any
matched strings is returned. matched strings is returned.
Otherwise, the portion of the string that was matched, and also any Otherwise, the portion of the string that was matched, and also any
captured substrings, are returned via the pmatch argument, which points captured substrings, are returned via the pmatch argument, which points
to an array of nmatch structures of type regmatch_t, containing the to an array of nmatch structures of type regmatch_t, containing the
members rm_so and rm_eo. These contain the byte offset to the first members rm_so and rm_eo. These contain the byte offset to the first
character of each substring and the offset to the first character after character of each substring and the offset to the first character after
the end of each substring, respectively. The 0th element of the vector the end of each substring, respectively. The 0th element of the vector
relates to the entire portion of string that was matched; subsequent relates to the entire portion of string that was matched; subsequent
elements relate to the capturing subpatterns of the regular expression. elements relate to the capturing subpatterns of the regular expression.
Unused entries in the array have both structure members set to -1. Unused entries in the array have both structure members set to -1.
A successful match yields a zero return; various error codes are A successful match yields a zero return; various error codes are
defined in the header file, of which REG_NOMATCH is the "expected" defined in the header file, of which REG_NOMATCH is the "expected"
failure code. failure code.
ERROR MESSAGES ERROR MESSAGES
The regerror() function maps a non-zero errorcode from either regcomp() The regerror() function maps a non-zero errorcode from either regcomp()
or regexec() to a printable message. If preg is not NULL, the error or regexec() to a printable message. If preg is not NULL, the error
should have arisen from the use of that structure. A message terminated should have arisen from the use of that structure. A message terminated
by a binary zero is placed in errbuf. If the buffer is too short, only by a binary zero is placed in errbuf. If the buffer is too short, only
the first errbuf_size - 1 characters of the error message are used. The the first errbuf_size - 1 characters of the error message are used. The
yield of the function is the size of buffer needed to hold the whole yield of the function is the size of buffer needed to hold the whole
message, including the terminating zero. This value is greater than message, including the terminating zero. This value is greater than
errbuf_size if the message was truncated. errbuf_size if the message was truncated.
MEMORY USAGE MEMORY USAGE
Compiling a regular expression causes memory to be allocated and asso- Compiling a regular expression causes memory to be allocated and asso-
ciated with the preg structure. The function regfree() frees all such ciated with the preg structure. The function regfree() frees all such
memory, after which preg may no longer be used as a compiled expres- memory, after which preg may no longer be used as a compiled expres-
sion. sion.
@ -9232,7 +9273,7 @@ AUTHOR
REVISION REVISION
Last updated: 05 June 2017 Last updated: 15 June 2017
Copyright (c) 1997-2017 University of Cambridge. Copyright (c) 1997-2017 University of Cambridge.
------------------------------------------------------------------------------ ------------------------------------------------------------------------------

View File

@ -1,4 +1,4 @@
.TH PCRE2_COMPILE 3 "17 May 2017" "PCRE2 10.30" .TH PCRE2_COMPILE 3 "16 June 2017" "PCRE2 10.30"
.SH NAME .SH NAME
PCRE2 - Perl-compatible regular expressions (revised API) PCRE2 - Perl-compatible regular expressions (revised API)
.SH SYNOPSIS .SH SYNOPSIS
@ -35,7 +35,7 @@ system stack size checking, or to change one or more of these parameters:
The newline character sequence; The newline character sequence;
The compile time nested parentheses limit; The compile time nested parentheses limit;
The maximum pattern length (in code units) that is allowed. The maximum pattern length (in code units) that is allowed.
The additional options bits The additional options bits (see pcre2_set_compile_extra_options())
.sp .sp
The option bits are: The option bits are:
.sp .sp
@ -52,6 +52,7 @@ The option bits are:
PCRE2_ENDANCHORED Pattern can match only at end of subject PCRE2_ENDANCHORED Pattern can match only at end of subject
PCRE2_EXTENDED Ignore white space and # comments PCRE2_EXTENDED Ignore white space and # comments
PCRE2_FIRSTLINE Force matching to be before newline PCRE2_FIRSTLINE Force matching to be before newline
PCRE2_LITERAL Pattern characters are all literal
PCRE2_MATCH_UNSET_BACKREF Match unset back references PCRE2_MATCH_UNSET_BACKREF Match unset back references
PCRE2_MULTILINE ^ and $ match newlines within data PCRE2_MULTILINE ^ and $ match newlines within data
PCRE2_NEVER_BACKSLASH_C Lock out the use of \eC in patterns PCRE2_NEVER_BACKSLASH_C Lock out the use of \eC in patterns

View File

@ -1,4 +1,4 @@
.TH PCRE2_SET_MAX_PATTERN_LENGTH 3 "01 June 2017" "PCRE2 10.30" .TH PCRE2_SET_MAX_PATTERN_LENGTH 3 "16 June 2017" "PCRE2 10.30"
.SH NAME .SH NAME
PCRE2 - Perl-compatible regular expressions (revised API) PCRE2 - Perl-compatible regular expressions (revised API)
.SH SYNOPSIS .SH SYNOPSIS
@ -24,6 +24,8 @@ options are:
.\" JOIN .\" JOIN
PCRE2_EXTRA_BAD_ESCAPE_IS_LITERAL Treat all invalid escapes as PCRE2_EXTRA_BAD_ESCAPE_IS_LITERAL Treat all invalid escapes as
a literal following character a literal following character
PCRE2_EXTRA_MATCH_LINE Pattern matches whole lines
PCRE2_EXTRA_MATCH_WORD Pattern matches "words"
.sp .sp
There is a complete description of the PCRE2 native API in the There is a complete description of the PCRE2 native API in the
.\" HREF .\" HREF

View File

@ -64,12 +64,12 @@ INPUT ENCODING
unless you really want that action. unless you really want that action.
The input is processed using using C's string functions, so must not The input is processed using using C's string functions, so must not
contain binary zeroes, even though in Unix-like environments, fgets() contain binary zeros, even though in Unix-like environments, fgets()
treats any bytes other than newline as data characters. An error is treats any bytes other than newline as data characters. An error is
generated if a binary zero is encountered. Subject lines are processed generated if a binary zero is encountered. By default subject lines are
for backslash escapes, which makes it possible to include any data processed for backslash escapes, which makes it possible to include any
value in strings that are passed to the library for matching. For pat- data value in strings that are passed to the library for matching. For
terns, there is a facility for specifying some or all of the 8-bit patterns, there is a facility for specifying some or all of the 8-bit
input characters as hexadecimal pairs, which makes it possible to input characters as hexadecimal pairs, which makes it possible to
include binary zeros. include binary zeros.
@ -319,9 +319,9 @@ COMMAND LINES
When the POSIX API is being tested there is no way to override the When the POSIX API is being tested there is no way to override the
default newline convention, though it is possible to set the newline default newline convention, though it is possible to set the newline
convention from within the pattern. A warning is given if the posix convention from within the pattern. A warning is given if the posix or
modifier is used when #newline_default would set a default for the non- posix_nosub modifier is used when #newline_default would set a default
POSIX API. for the non-POSIX API.
#pattern <modifier-list> #pattern <modifier-list>
@ -424,8 +424,9 @@ SUBJECT LINE SYNTAX
Before each subject line is passed to pcre2_match() or Before each subject line is passed to pcre2_match() or
pcre2_dfa_match(), leading and trailing white space is removed, and the pcre2_dfa_match(), leading and trailing white space is removed, and the
line is scanned for backslash escapes. The following provide a means of line is scanned for backslash escapes, unless the subject_literal modi-
encoding non-printing characters in a visible way: fier was set for the pattern. The following provide a means of encoding
non-printing characters in a visible way:
\a alarm (BEL, \x07) \a alarm (BEL, \x07)
\b backspace (\x08) \b backspace (\x08)
@ -442,23 +443,23 @@ SUBJECT LINE SYNTAX
\x{hh...} hexadecimal character (any number of hex digits) \x{hh...} hexadecimal character (any number of hex digits)
The use of \x{hh...} is not dependent on the use of the utf modifier on The use of \x{hh...} is not dependent on the use of the utf modifier on
the pattern. It is recognized always. There may be any number of hexa- the pattern. It is recognized always. There may be any number of hexa-
decimal digits inside the braces; invalid values provoke error mes- decimal digits inside the braces; invalid values provoke error mes-
sages. sages.
Note that \xhh specifies one byte rather than one character in UTF-8 Note that \xhh specifies one byte rather than one character in UTF-8
mode; this makes it possible to construct invalid UTF-8 sequences for mode; this makes it possible to construct invalid UTF-8 sequences for
testing purposes. On the other hand, \x{hh} is interpreted as a UTF-8 testing purposes. On the other hand, \x{hh} is interpreted as a UTF-8
character in UTF-8 mode, generating more than one byte if the value is character in UTF-8 mode, generating more than one byte if the value is
greater than 127. When testing the 8-bit library not in UTF-8 mode, greater than 127. When testing the 8-bit library not in UTF-8 mode,
\x{hh} generates one byte for values less than 256, and causes an error \x{hh} generates one byte for values less than 256, and causes an error
for greater values. for greater values.
In UTF-16 mode, all 4-digit \x{hhhh} values are accepted. This makes it In UTF-16 mode, all 4-digit \x{hhhh} values are accepted. This makes it
possible to construct invalid UTF-16 sequences for testing purposes. possible to construct invalid UTF-16 sequences for testing purposes.
In UTF-32 mode, all 4- to 8-digit \x{...} values are accepted. This In UTF-32 mode, all 4- to 8-digit \x{...} values are accepted. This
makes it possible to construct invalid UTF-32 sequences for testing makes it possible to construct invalid UTF-32 sequences for testing
purposes. purposes.
There is a special backslash sequence that specifies replication of one There is a special backslash sequence that specifies replication of one
@ -466,33 +467,38 @@ SUBJECT LINE SYNTAX
\[<characters>]{<count>} \[<characters>]{<count>}
This makes it possible to test long strings without having to provide This makes it possible to test long strings without having to provide
them as part of the file. For example: them as part of the file. For example:
\[abc]{4} \[abc]{4}
is converted to "abcabcabcabc". This feature does not support nesting. is converted to "abcabcabcabc". This feature does not support nesting.
To include a closing square bracket in the characters, code it as \x5D. To include a closing square bracket in the characters, code it as \x5D.
A backslash followed by an equals sign marks the end of the subject A backslash followed by an equals sign marks the end of the subject
string and the start of a modifier list. For example: string and the start of a modifier list. For example:
abc\=notbol,notempty abc\=notbol,notempty
If the subject string is empty and \= is followed by whitespace, the If the subject string is empty and \= is followed by whitespace, the
line is treated as a comment line, and is not used for matching. For line is treated as a comment line, and is not used for matching. For
example: example:
\= This is a comment. \= This is a comment.
abc\= This is an invalid modifier list. abc\= This is an invalid modifier list.
A backslash followed by any other non-alphanumeric character just A backslash followed by any other non-alphanumeric character just
escapes that character. A backslash followed by anything else causes an escapes that character. A backslash followed by anything else causes an
error. However, if the very last character in the line is a backslash error. However, if the very last character in the line is a backslash
(and there is no modifier list), it is ignored. This gives a way of (and there is no modifier list), it is ignored. This gives a way of
passing an empty line as data, since a real empty line terminates the passing an empty line as data, since a real empty line terminates the
data input. data input.
If the subject_literal modifier is set for a pattern, all subject lines
that follow are treated as literals, with no special treatment of back-
slashes. No replication is possible, and any subject modifiers must be
set as defaults by a #subject command.
PATTERN MODIFIERS PATTERN MODIFIERS
@ -530,7 +536,10 @@ PATTERN MODIFIERS
/x extended set PCRE2_EXTENDED /x extended set PCRE2_EXTENDED
/xx extended_more set PCRE2_EXTENDED_MORE /xx extended_more set PCRE2_EXTENDED_MORE
firstline set PCRE2_FIRSTLINE firstline set PCRE2_FIRSTLINE
literal set PCRE2_LITERAL
match_line set PCRE2_EXTRA_MATCH_LINE
match_unset_backref set PCRE2_MATCH_UNSET_BACKREF match_unset_backref set PCRE2_MATCH_UNSET_BACKREF
match_word set PCRE2_EXTRA_MATCH_WORD
/m multiline set PCRE2_MULTILINE /m multiline set PCRE2_MULTILINE
never_backslash_c set PCRE2_NEVER_BACKSLASH_C never_backslash_c set PCRE2_NEVER_BACKSLASH_C
never_ucp set PCRE2_NEVER_UCP never_ucp set PCRE2_NEVER_UCP
@ -580,6 +589,7 @@ PATTERN MODIFIERS
push push compiled pattern onto the stack push push compiled pattern onto the stack
pushcopy push a copy onto the stack pushcopy push a copy onto the stack
stackguard=<number> test the stackguard feature stackguard=<number> test the stackguard feature
subject_literal treat all subject lines as literal
tables=[0|1|2] select internal tables tables=[0|1|2] select internal tables
use_length do not zero-terminate the pattern use_length do not zero-terminate the pattern
utf8_input treat input as UTF-8 utf8_input treat input as UTF-8
@ -659,16 +669,6 @@ PATTERN MODIFIERS
testing that pcre2_compile() behaves correctly in this case (it uses testing that pcre2_compile() behaves correctly in this case (it uses
default values). default values).
Specifying the pattern's length
By default, patterns are passed to the compiling functions as zero-ter-
minated strings. When using the POSIX wrapper API, there is no other
option. However, when using PCRE2's native API, patterns can be passed
by length instead of being zero-terminated. The use_length modifier
causes this to happen. Using a length happens automatically (whether
or not use_length is set) when hex is set, because patterns specified
in hexadecimal may contain binary zeros.
Specifying pattern characters in hexadecimal Specifying pattern characters in hexadecimal
The hex modifier specifies that the characters of the pattern, except The hex modifier specifies that the characters of the pattern, except
@ -690,61 +690,68 @@ PATTERN MODIFIERS
ing the delimiter within a substring. The hex and expand modifiers are ing the delimiter within a substring. The hex and expand modifiers are
mutually exclusive. mutually exclusive.
The POSIX API cannot be used with patterns specified in hexadecimal Specifying the pattern's length
because they may contain binary zeros, which conflicts with regcomp()'s
requirement for a zero-terminated string. Such patterns are always By default, patterns are passed to the compiling functions as zero-ter-
passed to pcre2_compile() as a string with a length, not as zero-termi- minated strings but can be passed by length instead of being zero-ter-
nated. minated. The use_length modifier causes this to happen. Using a length
happens automatically (whether or not use_length is set) when hex is
set, because patterns specified in hexadecimal may contain binary
zeros.
If hex or use_length is used with the POSIX wrapper API (see "Using the
POSIX wrapper API" below), the REG_PEND extension is used to pass the
pattern's length.
Specifying wide characters in 16-bit and 32-bit modes Specifying wide characters in 16-bit and 32-bit modes
In 16-bit and 32-bit modes, all input is automatically treated as UTF-8 In 16-bit and 32-bit modes, all input is automatically treated as UTF-8
and translated to UTF-16 or UTF-32 when the utf modifier is set. For and translated to UTF-16 or UTF-32 when the utf modifier is set. For
testing the 16-bit and 32-bit libraries in non-UTF mode, the utf8_input testing the 16-bit and 32-bit libraries in non-UTF mode, the utf8_input
modifier can be used. It is mutually exclusive with utf. Input lines modifier can be used. It is mutually exclusive with utf. Input lines
are interpreted as UTF-8 as a means of specifying wide characters. More are interpreted as UTF-8 as a means of specifying wide characters. More
details are given in "Input encoding" above. details are given in "Input encoding" above.
Generating long repetitive patterns Generating long repetitive patterns
Some tests use long patterns that are very repetitive. Instead of cre- Some tests use long patterns that are very repetitive. Instead of cre-
ating a very long input line for such a pattern, you can use a special ating a very long input line for such a pattern, you can use a special
repetition feature, similar to the one described for subject lines repetition feature, similar to the one described for subject lines
above. If the expand modifier is present on a pattern, parts of the above. If the expand modifier is present on a pattern, parts of the
pattern that have the form pattern that have the form
\[<characters>]{<count>} \[<characters>]{<count>}
are expanded before the pattern is passed to pcre2_compile(). For exam- are expanded before the pattern is passed to pcre2_compile(). For exam-
ple, \[AB]{6000} is expanded to "ABAB..." 6000 times. This construction ple, \[AB]{6000} is expanded to "ABAB..." 6000 times. This construction
cannot be nested. An initial "\[" sequence is recognized only if "]{" cannot be nested. An initial "\[" sequence is recognized only if "]{"
followed by decimal digits and "}" is found later in the pattern. If followed by decimal digits and "}" is found later in the pattern. If
not, the characters remain in the pattern unaltered. The expand and hex not, the characters remain in the pattern unaltered. The expand and hex
modifiers are mutually exclusive. modifiers are mutually exclusive.
If part of an expanded pattern looks like an expansion, but is really If part of an expanded pattern looks like an expansion, but is really
part of the actual pattern, unwanted expansion can be avoided by giving part of the actual pattern, unwanted expansion can be avoided by giving
two values in the quantifier. For example, \[AB]{6000,6000} is not rec- two values in the quantifier. For example, \[AB]{6000,6000} is not rec-
ognized as an expansion item. ognized as an expansion item.
If the info modifier is set on an expanded pattern, the result of the If the info modifier is set on an expanded pattern, the result of the
expansion is included in the information that is output. expansion is included in the information that is output.
JIT compilation JIT compilation
Just-in-time (JIT) compiling is a heavyweight optimization that can Just-in-time (JIT) compiling is a heavyweight optimization that can
greatly speed up pattern matching. See the pcre2jit documentation for greatly speed up pattern matching. See the pcre2jit documentation for
details. JIT compiling happens, optionally, after a pattern has been details. JIT compiling happens, optionally, after a pattern has been
successfully compiled into an internal form. The JIT compiler converts successfully compiled into an internal form. The JIT compiler converts
this to optimized machine code. It needs to know whether the match-time this to optimized machine code. It needs to know whether the match-time
options PCRE2_PARTIAL_HARD and PCRE2_PARTIAL_SOFT are going to be used, options PCRE2_PARTIAL_HARD and PCRE2_PARTIAL_SOFT are going to be used,
because different code is generated for the different cases. See the because different code is generated for the different cases. See the
partial modifier in "Subject Modifiers" below for details of how these partial modifier in "Subject Modifiers" below for details of how these
options are specified for each match attempt. options are specified for each match attempt.
JIT compilation is requested by the /jit pattern modifier, which may JIT compilation is requested by the jit pattern modifier, which may
optionally be followed by an equals sign and a number in the range 0 to optionally be followed by an equals sign and a number in the range 0 to
7. The three bits that make up the number specify which of the three 7. The three bits that make up the number specify which of the three
JIT operating modes are to be compiled: JIT operating modes are to be compiled:
1 compile JIT code for non-partial matching 1 compile JIT code for non-partial matching
@ -761,31 +768,31 @@ PATTERN MODIFIERS
6 soft and hard partial matching only 6 soft and hard partial matching only
7 all three modes 7 all three modes
If no number is given, 7 is assumed. The phrase "partial matching" If no number is given, 7 is assumed. The phrase "partial matching"
means a call to pcre2_match() with either the PCRE2_PARTIAL_SOFT or the means a call to pcre2_match() with either the PCRE2_PARTIAL_SOFT or the
PCRE2_PARTIAL_HARD option set. Note that such a call may return a com- PCRE2_PARTIAL_HARD option set. Note that such a call may return a com-
plete match; the options enable the possibility of a partial match, but plete match; the options enable the possibility of a partial match, but
do not require it. Note also that if you request JIT compilation only do not require it. Note also that if you request JIT compilation only
for partial matching (for example, /jit=2) but do not set the partial for partial matching (for example, jit=2) but do not set the partial
modifier on a subject line, that match will not use JIT code because modifier on a subject line, that match will not use JIT code because
none was compiled for non-partial matching. none was compiled for non-partial matching.
If JIT compilation is successful, the compiled JIT code will automati- If JIT compilation is successful, the compiled JIT code will automati-
cally be used when an appropriate type of match is run, except when cally be used when an appropriate type of match is run, except when
incompatible run-time options are specified. For more details, see the incompatible run-time options are specified. For more details, see the
pcre2jit documentation. See also the jitstack modifier below for a way pcre2jit documentation. See also the jitstack modifier below for a way
of setting the size of the JIT stack. of setting the size of the JIT stack.
If the jitfast modifier is specified, matching is done using the JIT If the jitfast modifier is specified, matching is done using the JIT
"fast path" interface, pcre2_jit_match(), which skips some of the san- "fast path" interface, pcre2_jit_match(), which skips some of the san-
ity checks that are done by pcre2_match(), and of course does not work ity checks that are done by pcre2_match(), and of course does not work
when JIT is not supported. If jitfast is specified without jit, jit=7 when JIT is not supported. If jitfast is specified without jit, jit=7
is assumed. is assumed.
If the jitverify modifier is specified, information about the compiled If the jitverify modifier is specified, information about the compiled
pattern shows whether JIT compilation was or was not successful. If pattern shows whether JIT compilation was or was not successful. If
jitverify is specified without jit, jit=7 is assumed. If JIT compila- jitverify is specified without jit, jit=7 is assumed. If JIT compila-
tion is successful when jitverify is set, the text "(JIT)" is added to tion is successful when jitverify is set, the text "(JIT)" is added to
the first output line after a match or non match when JIT-compiled code the first output line after a match or non match when JIT-compiled code
was actually used in the match. was actually used in the match.
@ -796,19 +803,19 @@ PATTERN MODIFIERS
/pattern/locale=fr_FR /pattern/locale=fr_FR
The given locale is set, pcre2_maketables() is called to build a set of The given locale is set, pcre2_maketables() is called to build a set of
character tables for the locale, and this is then passed to pcre2_com- character tables for the locale, and this is then passed to pcre2_com-
pile() when compiling the regular expression. The same tables are used pile() when compiling the regular expression. The same tables are used
when matching the following subject lines. The locale modifier applies when matching the following subject lines. The locale modifier applies
only to the pattern on which it appears, but can be given in a #pattern only to the pattern on which it appears, but can be given in a #pattern
command if a default is needed. Setting a locale and alternate charac- command if a default is needed. Setting a locale and alternate charac-
ter tables are mutually exclusive. ter tables are mutually exclusive.
Showing pattern memory Showing pattern memory
The memory modifier causes the size in bytes of the memory used to hold The memory modifier causes the size in bytes of the memory used to hold
the compiled pattern to be output. This does not include the size of the compiled pattern to be output. This does not include the size of
the pcre2_code block; it is just the actual compiled data. If the pat- the pcre2_code block; it is just the actual compiled data. If the pat-
tern is subsequently passed to the JIT compiler, the size of the JIT tern is subsequently passed to the JIT compiler, the size of the JIT
compiled code is also output. Here is an example: compiled code is also output. Here is an example:
re> /a(b)c/jit,memory re> /a(b)c/jit,memory
@ -818,27 +825,27 @@ PATTERN MODIFIERS
Limiting nested parentheses Limiting nested parentheses
The parens_nest_limit modifier sets a limit on the depth of nested The parens_nest_limit modifier sets a limit on the depth of nested
parentheses in a pattern. Breaching the limit causes a compilation parentheses in a pattern. Breaching the limit causes a compilation
error. The default for the library is set when PCRE2 is built, but error. The default for the library is set when PCRE2 is built, but
pcre2test sets its own default of 220, which is required for running pcre2test sets its own default of 220, which is required for running
the standard test suite. the standard test suite.
Limiting the pattern length Limiting the pattern length
The max_pattern_length modifier sets a limit, in code units, to the The max_pattern_length modifier sets a limit, in code units, to the
length of pattern that pcre2_compile() will accept. Breaching the limit length of pattern that pcre2_compile() will accept. Breaching the limit
causes a compilation error. The default is the largest number a causes a compilation error. The default is the largest number a
PCRE2_SIZE variable can hold (essentially unlimited). PCRE2_SIZE variable can hold (essentially unlimited).
Using the POSIX wrapper API Using the POSIX wrapper API
The /posix and posix_nosub modifiers cause pcre2test to call PCRE2 via The posix and posix_nosub modifiers cause pcre2test to call PCRE2 via
the POSIX wrapper API rather than its native API. When posix_nosub is the POSIX wrapper API rather than its native API. When posix_nosub is
used, the POSIX option REG_NOSUB is passed to regcomp(). The POSIX used, the POSIX option REG_NOSUB is passed to regcomp(). The POSIX
wrapper supports only the 8-bit library. Note that it does not imply wrapper supports only the 8-bit library. Note that it does not imply
POSIX matching semantics; for more detail see the pcre2posix documenta- POSIX matching semantics; for more detail see the pcre2posix documenta-
tion. The following pattern modifiers set options for the regcomp() tion. The following pattern modifiers set options for the regcomp()
function: function:
caseless REG_ICASE caseless REG_ICASE
@ -848,35 +855,39 @@ PATTERN MODIFIERS
ucp REG_UCP ) the POSIX standard ucp REG_UCP ) the POSIX standard
utf REG_UTF8 ) utf REG_UTF8 )
The regerror_buffsize modifier specifies a size for the error buffer The regerror_buffsize modifier specifies a size for the error buffer
that is passed to regerror() in the event of a compilation error. For that is passed to regerror() in the event of a compilation error. For
example: example:
/abc/posix,regerror_buffsize=20 /abc/posix,regerror_buffsize=20
This provides a means of testing the behaviour of regerror() when the This provides a means of testing the behaviour of regerror() when the
buffer is too small for the error message. If this modifier has not buffer is too small for the error message. If this modifier has not
been set, a large buffer is used. been set, a large buffer is used.
The aftertext and allaftertext subject modifiers work as described The aftertext and allaftertext subject modifiers work as described
below. All other modifiers are either ignored, with a warning message, below. All other modifiers are either ignored, with a warning message,
or cause an error. or cause an error.
The pattern is passed to regcomp() as a zero-terminated string by
default, but if the use_length or hex modifiers are set, the REG_PEND
extension is used to pass it by length.
Testing the stack guard feature Testing the stack guard feature
The stackguard modifier is used to test the use of pcre2_set_com- The stackguard modifier is used to test the use of pcre2_set_com-
pile_recursion_guard(), a function that is provided to enable stack pile_recursion_guard(), a function that is provided to enable stack
availability to be checked during compilation (see the pcre2api docu- availability to be checked during compilation (see the pcre2api docu-
mentation for details). If the number specified by the modifier is mentation for details). If the number specified by the modifier is
greater than zero, pcre2_set_compile_recursion_guard() is called to set greater than zero, pcre2_set_compile_recursion_guard() is called to set
up callback from pcre2_compile() to a local function. The argument it up callback from pcre2_compile() to a local function. The argument it
receives is the current nesting parenthesis depth; if this is greater receives is the current nesting parenthesis depth; if this is greater
than the value given by the modifier, non-zero is returned, causing the than the value given by the modifier, non-zero is returned, causing the
compilation to be aborted. compilation to be aborted.
Using alternative character tables Using alternative character tables
The value specified for the tables modifier must be one of the digits The value specified for the tables modifier must be one of the digits
0, 1, or 2. It causes a specific set of built-in character tables to be 0, 1, or 2. It causes a specific set of built-in character tables to be
passed to pcre2_compile(). This is used in the PCRE2 tests to check be- passed to pcre2_compile(). This is used in the PCRE2 tests to check be-
haviour with different character tables. The digit specifies the tables haviour with different character tables. The digit specifies the tables
@ -887,23 +898,25 @@ PATTERN MODIFIERS
pcre2_chartables.c.dist pcre2_chartables.c.dist
2 a set of tables defining ISO 8859 characters 2 a set of tables defining ISO 8859 characters
In table 2, some characters whose codes are greater than 128 are iden- In table 2, some characters whose codes are greater than 128 are iden-
tified as letters, digits, spaces, etc. Setting alternate character tified as letters, digits, spaces, etc. Setting alternate character
tables and a locale are mutually exclusive. tables and a locale are mutually exclusive.
Setting certain match controls Setting certain match controls
The following modifiers are really subject modifiers, and are described The following modifiers are really subject modifiers, and are described
below. However, they may be included in a pattern's modifier list, in under "Subject Modifiers" below. However, they may be included in a
which case they are applied to every subject line that is processed pattern's modifier list, in which case they are applied to every sub-
with that pattern. They may not appear in #pattern commands. These mod- ject line that is processed with that pattern. They may not appear in
ifiers do not affect the compilation process. #pattern commands. These modifiers do not affect the compilation
process.
aftertext show text after match aftertext show text after match
allaftertext show text after captures allaftertext show text after captures
allcaptures show all captures allcaptures show all captures
allusedtext show all consulted text allusedtext show all consulted text
/g global global matching /g global global matching
jitstack=<n> set size of JIT stack
mark show mark values mark show mark values
replace=<string> specify a replacement string replace=<string> specify a replacement string
startchar show starting character when relevant startchar show starting character when relevant
@ -915,6 +928,14 @@ PATTERN MODIFIERS
These modifiers may not appear in a #pattern command. If you want them These modifiers may not appear in a #pattern command. If you want them
as defaults, set them in a #subject command. as defaults, set them in a #subject command.
Specifying literal subject lines
If the subject_literal modifier is present on a pattern, all the sub-
ject lines that it matches are taken as literal strings, with no inter-
pretation of backslashes. It is not possible to set subject modifiers
on such lines, but any that are set as defaults by a #subject command
are recognized.
Saving a compiled pattern Saving a compiled pattern
When a pattern with the push modifier is successfully compiled, it is When a pattern with the push modifier is successfully compiled, it is
@ -959,11 +980,11 @@ SUBJECT MODIFIERS
The partial matching modifiers are provided with abbreviations because The partial matching modifiers are provided with abbreviations because
they appear frequently in tests. they appear frequently in tests.
If the posix modifier was present on the pattern, causing the POSIX If the posix or posix_nosub modifier was present on the pattern, caus-
wrapper API to be used, the only option-setting modifiers that have any ing the POSIX wrapper API to be used, the only option-setting modifiers
effect are notbol, notempty, and noteol, causing REG_NOTBOL, that have any effect are notbol, notempty, and noteol, causing REG_NOT-
REG_NOTEMPTY, and REG_NOTEOL, respectively, to be passed to regexec(). BOL, REG_NOTEMPTY, and REG_NOTEOL, respectively, to be passed to
The other modifiers are ignored, with a warning message. regexec(). The other modifiers are ignored, with a warning message.
There is one additional modifier that can be used with the POSIX wrap- There is one additional modifier that can be used with the POSIX wrap-
per. It is ignored (with a warning) if used for non-POSIX matching. per. It is ignored (with a warning) if used for non-POSIX matching.
@ -971,16 +992,19 @@ SUBJECT MODIFIERS
posix_startend=<n>[:<m>] posix_startend=<n>[:<m>]
This causes the subject string to be passed to regexec() using the This causes the subject string to be passed to regexec() using the
REG_STARTEND option, which uses offsets to restrict which part of the REG_STARTEND option, which uses offsets to specify which part of the
string is searched. If only one number is given, the end offset is string is searched. If only one number is given, the end offset is
passed as the end of the subject string. For more detail of REG_STAR- passed as the end of the subject string. For more detail of REG_STAR-
TEND, see the pcre2posix documentation. TEND, see the pcre2posix documentation. If the subject string contains
binary zeros (coded as escapes such as \x{00} because pcre2test does
not support actual binary zeros in its input), you must use posix_star-
tend to specify its length.
Setting match controls Setting match controls
The following modifiers affect the matching process or request addi- The following modifiers affect the matching process or request addi-
tional information. Some of them may also be specified on a pattern tional information. Some of them may also be specified on a pattern
line (see above), in which case they apply to every subject line that line (see above), in which case they apply to every subject line that
is matched against that pattern. is matched against that pattern.
aftertext show text after match aftertext show text after match
@ -1020,29 +1044,29 @@ SUBJECT MODIFIERS
zero_terminate pass the subject as zero-terminated zero_terminate pass the subject as zero-terminated
The effects of these modifiers are described in the following sections. The effects of these modifiers are described in the following sections.
When matching via the POSIX wrapper API, the aftertext, allaftertext, When matching via the POSIX wrapper API, the aftertext, allaftertext,
and ovector subject modifiers work as described below. All other modi- and ovector subject modifiers work as described below. All other modi-
fiers are either ignored, with a warning message, or cause an error. fiers are either ignored, with a warning message, or cause an error.
Showing more text Showing more text
The aftertext modifier requests that as well as outputting the part of The aftertext modifier requests that as well as outputting the part of
the subject string that matched the entire pattern, pcre2test should in the subject string that matched the entire pattern, pcre2test should in
addition output the remainder of the subject string. This is useful for addition output the remainder of the subject string. This is useful for
tests where the subject contains multiple copies of the same substring. tests where the subject contains multiple copies of the same substring.
The allaftertext modifier requests the same action for captured sub- The allaftertext modifier requests the same action for captured sub-
strings as well as the main matched substring. In each case the remain- strings as well as the main matched substring. In each case the remain-
der is output on the following line with a plus character following the der is output on the following line with a plus character following the
capture number. capture number.
The allusedtext modifier requests that all the text that was consulted The allusedtext modifier requests that all the text that was consulted
during a successful pattern match by the interpreter should be shown. during a successful pattern match by the interpreter should be shown.
This feature is not supported for JIT matching, and if requested with This feature is not supported for JIT matching, and if requested with
JIT it is ignored (with a warning message). Setting this modifier JIT it is ignored (with a warning message). Setting this modifier
affects the output if there is a lookbehind at the start of a match, or affects the output if there is a lookbehind at the start of a match, or
a lookahead at the end, or if \K is used in the pattern. Characters a lookahead at the end, or if \K is used in the pattern. Characters
that precede or follow the start and end of the actual match are indi- that precede or follow the start and end of the actual match are indi-
cated in the output by '<' or '>' characters underneath them. Here is cated in the output by '<' or '>' characters underneath them. Here is
an example: an example:
re> /(?<=pqr)abc(?=xyz)/ re> /(?<=pqr)abc(?=xyz)/
@ -1050,16 +1074,16 @@ SUBJECT MODIFIERS
0: pqrabcxyz 0: pqrabcxyz
<<< >>> <<< >>>
This shows that the matched string is "abc", with the preceding and This shows that the matched string is "abc", with the preceding and
following strings "pqr" and "xyz" having been consulted during the following strings "pqr" and "xyz" having been consulted during the
match (when processing the assertions). match (when processing the assertions).
The startchar modifier requests that the starting character for the The startchar modifier requests that the starting character for the
match be indicated, if it is different to the start of the matched match be indicated, if it is different to the start of the matched
string. The only time when this occurs is when \K has been processed as string. The only time when this occurs is when \K has been processed as
part of the match. In this situation, the output for the matched string part of the match. In this situation, the output for the matched string
is displayed from the starting character instead of from the match is displayed from the starting character instead of from the match
point, with circumflex characters under the earlier characters. For point, with circumflex characters under the earlier characters. For
example: example:
re> /abc\Kxyz/ re> /abc\Kxyz/
@ -1067,7 +1091,7 @@ SUBJECT MODIFIERS
0: abcxyz 0: abcxyz
^^^ ^^^
Unlike allusedtext, the startchar modifier can be used with JIT. How- Unlike allusedtext, the startchar modifier can be used with JIT. How-
ever, these two modifiers are mutually exclusive. ever, these two modifiers are mutually exclusive.
Showing the value of all capture groups Showing the value of all capture groups
@ -1075,98 +1099,98 @@ SUBJECT MODIFIERS
The allcaptures modifier requests that the values of all potential cap- The allcaptures modifier requests that the values of all potential cap-
tured parentheses be output after a match. By default, only those up to tured parentheses be output after a match. By default, only those up to
the highest one actually used in the match are output (corresponding to the highest one actually used in the match are output (corresponding to
the return code from pcre2_match()). Groups that did not take part in the return code from pcre2_match()). Groups that did not take part in
the match are output as "<unset>". This modifier is not relevant for the match are output as "<unset>". This modifier is not relevant for
DFA matching (which does no capturing); it is ignored, with a warning DFA matching (which does no capturing); it is ignored, with a warning
message, if present. message, if present.
Testing callouts Testing callouts
A callout function is supplied when pcre2test calls the library match- A callout function is supplied when pcre2test calls the library match-
ing functions, unless callout_none is specified. If callout_capture is ing functions, unless callout_none is specified. If callout_capture is
set, the current captured groups are output when a callout occurs. The set, the current captured groups are output when a callout occurs. The
default return from the callout function is zero, which allows matching default return from the callout function is zero, which allows matching
to continue. to continue.
The callout_fail modifier can be given one or two numbers. If there is The callout_fail modifier can be given one or two numbers. If there is
only one number, 1 is returned instead of 0 (causing matching to back- only one number, 1 is returned instead of 0 (causing matching to back-
track) when a callout of that number is reached. If two numbers track) when a callout of that number is reached. If two numbers
(<n>:<m>) are given, 1 is returned when callout <n> is reached and (<n>:<m>) are given, 1 is returned when callout <n> is reached and
there have been at least <m> callouts. The callout_error modifier is there have been at least <m> callouts. The callout_error modifier is
similar, except that PCRE2_ERROR_CALLOUT is returned, causing the similar, except that PCRE2_ERROR_CALLOUT is returned, causing the
entire matching process to be aborted. If both these modifiers are set entire matching process to be aborted. If both these modifiers are set
for the same callout number, callout_error takes precedence. for the same callout number, callout_error takes precedence.
Note that callouts with string arguments are always given the number Note that callouts with string arguments are always given the number
zero. See "Callouts" below for a description of the output when a call- zero. See "Callouts" below for a description of the output when a call-
out it taken. out it taken.
The callout_data modifier can be given an unsigned or a negative num- The callout_data modifier can be given an unsigned or a negative num-
ber. This is set as the "user data" that is passed to the matching ber. This is set as the "user data" that is passed to the matching
function, and passed back when the callout function is invoked. Any function, and passed back when the callout function is invoked. Any
value other than zero is used as a return from pcre2test's callout value other than zero is used as a return from pcre2test's callout
function. function.
Finding all matches in a string Finding all matches in a string
Searching for all possible matches within a subject can be requested by Searching for all possible matches within a subject can be requested by
the global or altglobal modifier. After finding a match, the matching the global or altglobal modifier. After finding a match, the matching
function is called again to search the remainder of the subject. The function is called again to search the remainder of the subject. The
difference between global and altglobal is that the former uses the difference between global and altglobal is that the former uses the
start_offset argument to pcre2_match() or pcre2_dfa_match() to start start_offset argument to pcre2_match() or pcre2_dfa_match() to start
searching at a new point within the entire string (which is what Perl searching at a new point within the entire string (which is what Perl
does), whereas the latter passes over a shortened subject. This makes a does), whereas the latter passes over a shortened subject. This makes a
difference to the matching process if the pattern begins with a lookbe- difference to the matching process if the pattern begins with a lookbe-
hind assertion (including \b or \B). hind assertion (including \b or \B).
If an empty string is matched, the next match is done with the If an empty string is matched, the next match is done with the
PCRE2_NOTEMPTY_ATSTART and PCRE2_ANCHORED flags set, in order to search PCRE2_NOTEMPTY_ATSTART and PCRE2_ANCHORED flags set, in order to search
for another, non-empty, match at the same point in the subject. If this for another, non-empty, match at the same point in the subject. If this
match fails, the start offset is advanced, and the normal match is match fails, the start offset is advanced, and the normal match is
retried. This imitates the way Perl handles such cases when using the retried. This imitates the way Perl handles such cases when using the
/g modifier or the split() function. Normally, the start offset is /g modifier or the split() function. Normally, the start offset is
advanced by one character, but if the newline convention recognizes advanced by one character, but if the newline convention recognizes
CRLF as a newline, and the current character is CR followed by LF, an CRLF as a newline, and the current character is CR followed by LF, an
advance of two characters occurs. advance of two characters occurs.
Testing substring extraction functions Testing substring extraction functions
The copy and get modifiers can be used to test the pcre2_sub- The copy and get modifiers can be used to test the pcre2_sub-
string_copy_xxx() and pcre2_substring_get_xxx() functions. They can be string_copy_xxx() and pcre2_substring_get_xxx() functions. They can be
given more than once, and each can specify a group name or number, for given more than once, and each can specify a group name or number, for
example: example:
abcd\=copy=1,copy=3,get=G1 abcd\=copy=1,copy=3,get=G1
If the #subject command is used to set default copy and/or get lists, If the #subject command is used to set default copy and/or get lists,
these can be unset by specifying a negative number to cancel all num- these can be unset by specifying a negative number to cancel all num-
bered groups and an empty name to cancel all named groups. bered groups and an empty name to cancel all named groups.
The getall modifier tests pcre2_substring_list_get(), which extracts The getall modifier tests pcre2_substring_list_get(), which extracts
all captured substrings. all captured substrings.
If the subject line is successfully matched, the substrings extracted If the subject line is successfully matched, the substrings extracted
by the convenience functions are output with C, G, or L after the by the convenience functions are output with C, G, or L after the
string number instead of a colon. This is in addition to the normal string number instead of a colon. This is in addition to the normal
full list. The string length (that is, the return from the extraction full list. The string length (that is, the return from the extraction
function) is given in parentheses after each substring, followed by the function) is given in parentheses after each substring, followed by the
name when the extraction was by name. name when the extraction was by name.
Testing the substitution function Testing the substitution function
If the replace modifier is set, the pcre2_substitute() function is If the replace modifier is set, the pcre2_substitute() function is
called instead of one of the matching functions. Note that replacement called instead of one of the matching functions. Note that replacement
strings cannot contain commas, because a comma signifies the end of a strings cannot contain commas, because a comma signifies the end of a
modifier. This is not thought to be an issue in a test program. modifier. This is not thought to be an issue in a test program.
Unlike subject strings, pcre2test does not process replacement strings Unlike subject strings, pcre2test does not process replacement strings
for escape sequences. In UTF mode, a replacement string is checked to for escape sequences. In UTF mode, a replacement string is checked to
see if it is a valid UTF-8 string. If so, it is correctly converted to see if it is a valid UTF-8 string. If so, it is correctly converted to
a UTF string of the appropriate code unit width. If it is not a valid a UTF string of the appropriate code unit width. If it is not a valid
UTF-8 string, the individual code units are copied directly. This pro- UTF-8 string, the individual code units are copied directly. This pro-
vides a means of passing an invalid UTF-8 string for testing purposes. vides a means of passing an invalid UTF-8 string for testing purposes.
The following modifiers set options (in additional to the normal match The following modifiers set options (in additional to the normal match
options) for pcre2_substitute(): options) for pcre2_substitute():
global PCRE2_SUBSTITUTE_GLOBAL global PCRE2_SUBSTITUTE_GLOBAL
@ -1176,8 +1200,8 @@ SUBJECT MODIFIERS
substitute_unset_empty PCRE2_SUBSTITUTE_UNSET_EMPTY substitute_unset_empty PCRE2_SUBSTITUTE_UNSET_EMPTY
After a successful substitution, the modified string is output, pre- After a successful substitution, the modified string is output, pre-
ceded by the number of replacements. This may be zero if there were no ceded by the number of replacements. This may be zero if there were no
matches. Here is a simple example of a substitution test: matches. Here is a simple example of a substitution test:
/abc/replace=xxx /abc/replace=xxx
@ -1186,12 +1210,12 @@ SUBJECT MODIFIERS
=abc=abc=\=global =abc=abc=\=global
2: =xxx=xxx= 2: =xxx=xxx=
Subject and replacement strings should be kept relatively short (fewer Subject and replacement strings should be kept relatively short (fewer
than 256 characters) for substitution tests, as fixed-size buffers are than 256 characters) for substitution tests, as fixed-size buffers are
used. To make it easy to test for buffer overflow, if the replacement used. To make it easy to test for buffer overflow, if the replacement
string starts with a number in square brackets, that number is passed string starts with a number in square brackets, that number is passed
to pcre2_substitute() as the size of the output buffer, with the to pcre2_substitute() as the size of the output buffer, with the
replacement string starting at the next character. Here is an example replacement string starting at the next character. Here is an example
that tests the edge case: that tests the edge case:
/abc/ /abc/
@ -1200,11 +1224,11 @@ SUBJECT MODIFIERS
123abc123\=replace=[9]XYZ 123abc123\=replace=[9]XYZ
Failed: error -47: no more memory Failed: error -47: no more memory
The default action of pcre2_substitute() is to return The default action of pcre2_substitute() is to return
PCRE2_ERROR_NOMEMORY when the output buffer is too small. However, if PCRE2_ERROR_NOMEMORY when the output buffer is too small. However, if
the PCRE2_SUBSTITUTE_OVERFLOW_LENGTH option is set (by using the sub- the PCRE2_SUBSTITUTE_OVERFLOW_LENGTH option is set (by using the sub-
stitute_overflow_length modifier), pcre2_substitute() continues to go stitute_overflow_length modifier), pcre2_substitute() continues to go
through the motions of matching and substituting, in order to compute through the motions of matching and substituting, in order to compute
the size of buffer that is required. When this happens, pcre2test shows the size of buffer that is required. When this happens, pcre2test shows
the required buffer length (which includes space for the trailing zero) the required buffer length (which includes space for the trailing zero)
as part of the error message. For example: as part of the error message. For example:
@ -1214,105 +1238,106 @@ SUBJECT MODIFIERS
Failed: error -47: no more memory: 10 code units are needed Failed: error -47: no more memory: 10 code units are needed
A replacement string is ignored with POSIX and DFA matching. Specifying A replacement string is ignored with POSIX and DFA matching. Specifying
partial matching provokes an error return ("bad option value") from partial matching provokes an error return ("bad option value") from
pcre2_substitute(). pcre2_substitute().
Setting the JIT stack size Setting the JIT stack size
The jitstack modifier provides a way of setting the maximum stack size The jitstack modifier provides a way of setting the maximum stack size
that is used by the just-in-time optimization code. It is ignored if that is used by the just-in-time optimization code. It is ignored if
JIT optimization is not being used. The value is a number of kilobytes. JIT optimization is not being used. The value is a number of kilobytes.
Providing a stack that is larger than the default 32K is necessary only Setting zero reverts to the default of 32K. Providing a stack that is
for very complicated patterns. larger than the default is necessary only for very complicated pat-
terns. If jitstack is set non-zero on a subject line it overrides any
value that was set on the pattern.
Setting heap, match, and depth limits Setting heap, match, and depth limits
The heap_limit, match_limit, and depth_limit modifiers set the appro- The heap_limit, match_limit, and depth_limit modifiers set the appro-
priate limits in the match context. These values are ignored when the priate limits in the match context. These values are ignored when the
find_limits modifier is specified. find_limits modifier is specified.
Finding minimum limits Finding minimum limits
If the find_limits modifier is present on a subject line, pcre2test If the find_limits modifier is present on a subject line, pcre2test
calls the relevant matching function several times, setting different calls the relevant matching function several times, setting different
values in the match context via pcre2_set_heap_limit(), values in the match context via pcre2_set_heap_limit(),
pcre2_set_match_limit(), or pcre2_set_depth_limit() until it finds the pcre2_set_match_limit(), or pcre2_set_depth_limit() until it finds the
minimum values for each parameter that allows the match to complete minimum values for each parameter that allows the match to complete
without error. without error.
If JIT is being used, only the match limit is relevant. If DFA matching If JIT is being used, only the match limit is relevant. If DFA matching
is being used, only the depth limit is relevant. is being used, only the depth limit is relevant.
The match_limit number is a measure of the amount of backtracking that The match_limit number is a measure of the amount of backtracking that
takes place, and learning the minimum value can be instructive. For takes place, and learning the minimum value can be instructive. For
most simple matches, the number is quite small, but for patterns with most simple matches, the number is quite small, but for patterns with
very large numbers of matching possibilities, it can become large very very large numbers of matching possibilities, it can become large very
quickly with increasing length of subject string. quickly with increasing length of subject string.
For non-DFA matching, the minimum depth_limit number is a measure of For non-DFA matching, the minimum depth_limit number is a measure of
how much nested backtracking happens (that is, how deeply the pattern's how much nested backtracking happens (that is, how deeply the pattern's
tree is searched). In the case of DFA matching, depth_limit controls tree is searched). In the case of DFA matching, depth_limit controls
the depth of recursive calls of the internal function that is used for the depth of recursive calls of the internal function that is used for
handling pattern recursion, lookaround assertions, and atomic groups. handling pattern recursion, lookaround assertions, and atomic groups.
Showing MARK names Showing MARK names
The mark modifier causes the names from backtracking control verbs that The mark modifier causes the names from backtracking control verbs that
are returned from calls to pcre2_match() to be displayed. If a mark is are returned from calls to pcre2_match() to be displayed. If a mark is
returned for a match, non-match, or partial match, pcre2test shows it. returned for a match, non-match, or partial match, pcre2test shows it.
For a match, it is on a line by itself, tagged with "MK:". Otherwise, For a match, it is on a line by itself, tagged with "MK:". Otherwise,
it is added to the non-match message. it is added to the non-match message.
Showing memory usage Showing memory usage
The memory modifier causes pcre2test to log the sizes of all heap mem- The memory modifier causes pcre2test to log the sizes of all heap mem-
ory allocation and freeing calls that occur during a call to ory allocation and freeing calls that occur during a call to
pcre2_match(). These occur only when a match requires a bigger vector pcre2_match(). These occur only when a match requires a bigger vector
than the default for remembering backtracking points. In many cases than the default for remembering backtracking points. In many cases
there will be no heap memory used and therefore no additional output. there will be no heap memory used and therefore no additional output.
No heap memory is allocated during matching with pcre2_dfa_match or No heap memory is allocated during matching with pcre2_dfa_match or
with JIT, so in those cases the memory modifier never has any effect. with JIT, so in those cases the memory modifier never has any effect.
For this modifier to work, the null_context modifier must not be set on For this modifier to work, the null_context modifier must not be set on
both the pattern and the subject, though it can be set on one or the both the pattern and the subject, though it can be set on one or the
other. other.
Setting a starting offset Setting a starting offset
The offset modifier sets an offset in the subject string at which The offset modifier sets an offset in the subject string at which
matching starts. Its value is a number of code units, not characters. matching starts. Its value is a number of code units, not characters.
Setting an offset limit Setting an offset limit
The offset_limit modifier sets a limit for unanchored matches. If a The offset_limit modifier sets a limit for unanchored matches. If a
match cannot be found starting at or before this offset in the subject, match cannot be found starting at or before this offset in the subject,
a "no match" return is given. The data value is a number of code units, a "no match" return is given. The data value is a number of code units,
not characters. When this modifier is used, the use_offset_limit modi- not characters. When this modifier is used, the use_offset_limit modi-
fier must have been set for the pattern; if not, an error is generated. fier must have been set for the pattern; if not, an error is generated.
Setting the size of the output vector Setting the size of the output vector
The ovector modifier applies only to the subject line in which it The ovector modifier applies only to the subject line in which it
appears, though of course it can also be used to set a default in a appears, though of course it can also be used to set a default in a
#subject command. It specifies the number of pairs of offsets that are #subject command. It specifies the number of pairs of offsets that are
available for storing matching information. The default is 15. available for storing matching information. The default is 15.
A value of zero is useful when testing the POSIX API because it causes A value of zero is useful when testing the POSIX API because it causes
regexec() to be called with a NULL capture vector. When not testing the regexec() to be called with a NULL capture vector. When not testing the
POSIX API, a value of zero is used to cause pcre2_match_data_cre- POSIX API, a value of zero is used to cause pcre2_match_data_cre-
ate_from_pattern() to be called, in order to create a match block of ate_from_pattern() to be called, in order to create a match block of
exactly the right size for the pattern. (It is not possible to create a exactly the right size for the pattern. (It is not possible to create a
match block with a zero-length ovector; there is always at least one match block with a zero-length ovector; there is always at least one
pair of offsets.) pair of offsets.)
Passing the subject as zero-terminated Passing the subject as zero-terminated
By default, the subject string is passed to a native API matching func- By default, the subject string is passed to a native API matching func-
tion with its correct length. In order to test the facility for passing tion with its correct length. In order to test the facility for passing
a zero-terminated string, the zero_terminate modifier is provided. It a zero-terminated string, the zero_terminate modifier is provided. It
causes the length to be passed as PCRE2_ZERO_TERMINATED. (When matching causes the length to be passed as PCRE2_ZERO_TERMINATED. When matching
via the POSIX interface, this modifier has no effect, as there is no via the POSIX interface, this modifier is ignored, with a warning.
facility for passing a length.)
When testing pcre2_substitute(), this modifier also has the effect of When testing pcre2_substitute(), this modifier also has the effect of
passing the replacement string as zero-terminated. passing the replacement string as zero-terminated.
@ -1513,8 +1538,8 @@ CALLOUTS
position, which can happen if the callout is in a lookbehind assertion. position, which can happen if the callout is in a lookbehind assertion.
Callouts numbered 255 are assumed to be automatic callouts, inserted as Callouts numbered 255 are assumed to be automatic callouts, inserted as
a result of the /auto_callout pattern modifier. In this case, instead a result of the auto_callout pattern modifier. In this case, instead of
of showing the callout number, the offset in the pattern, preceded by a showing the callout number, the offset in the pattern, preceded by a
plus, is output. For example: plus, is output. For example:
re> /\d?[A-E]\*/auto_callout re> /\d?[A-E]\*/auto_callout
@ -1662,5 +1687,5 @@ AUTHOR
REVISION REVISION
Last updated: 03 June 2017 Last updated: 16 June 2017
Copyright (c) 1997-2017 University of Cambridge. Copyright (c) 1997-2017 University of Cambridge.