Documentation update.

This commit is contained in:
Philip.Hazel 2017-06-16 17:57:18 +00:00
parent a083420cac
commit c92bfc3d21
9 changed files with 522 additions and 383 deletions

View File

@ -47,7 +47,7 @@ system stack size checking, or to change one or more of these parameters:
The newline character sequence; The newline character sequence;
The compile time nested parentheses limit; The compile time nested parentheses limit;
The maximum pattern length (in code units) that is allowed. The maximum pattern length (in code units) that is allowed.
The additional options bits The additional options bits (see pcre2_set_compile_extra_options())
</pre> </pre>
The option bits are: The option bits are:
<pre> <pre>
@ -64,6 +64,7 @@ The option bits are:
PCRE2_ENDANCHORED Pattern can match only at end of subject PCRE2_ENDANCHORED Pattern can match only at end of subject
PCRE2_EXTENDED Ignore white space and # comments PCRE2_EXTENDED Ignore white space and # comments
PCRE2_FIRSTLINE Force matching to be before newline PCRE2_FIRSTLINE Force matching to be before newline
PCRE2_LITERAL Pattern characters are all literal
PCRE2_MATCH_UNSET_BACKREF Match unset back references PCRE2_MATCH_UNSET_BACKREF Match unset back references
PCRE2_MULTILINE ^ and $ match newlines within data PCRE2_MULTILINE ^ and $ match newlines within data
PCRE2_NEVER_BACKSLASH_C Lock out the use of \C in patterns PCRE2_NEVER_BACKSLASH_C Lock out the use of \C in patterns

View File

@ -32,6 +32,8 @@ options are:
<pre> <pre>
PCRE2_EXTRA_ALLOW_SURROGATE_ESCAPES Allow \x{df800} to \x{dfff} in UTF-8 and UTF-32 modes PCRE2_EXTRA_ALLOW_SURROGATE_ESCAPES Allow \x{df800} to \x{dfff} in UTF-8 and UTF-32 modes
PCRE2_EXTRA_BAD_ESCAPE_IS_LITERAL Treat all invalid escapes as a literal following character PCRE2_EXTRA_BAD_ESCAPE_IS_LITERAL Treat all invalid escapes as a literal following character
PCRE2_EXTRA_MATCH_LINE Pattern matches whole lines
PCRE2_EXTRA_MATCH_WORD Pattern matches "words"
</pre> </pre>
There is a complete description of the PCRE2 native API in the There is a complete description of the PCRE2 native API in the
<a href="pcre2api.html"><b>pcre2api</b></a> <a href="pcre2api.html"><b>pcre2api</b></a>

View File

@ -1453,6 +1453,19 @@ continue over the newline. See also PCRE2_USE_OFFSET_LIMIT, which provides a
more general limiting facility. If PCRE2_FIRSTLINE is set with an offset limit, more general limiting facility. If PCRE2_FIRSTLINE is set with an offset limit,
a match must occur in the first line and also within the offset limit. In other a match must occur in the first line and also within the offset limit. In other
words, whichever limit comes first is used. words, whichever limit comes first is used.
<pre>
PCRE2_LITERAL
</pre>
If this option is set, all meta-characters in the pattern are disabled, and it
is treated as a literal string. Matching literal strings with a regular
expression engine is not the most efficient way of doing it. If you are doing a
lot of literal matching and are worried about efficiency, you should consider
using other approaches. The only other main options that are allowed with
PCRE2_LITERAL are: PCRE2_ANCHORED, PCRE2_ENDANCHORED, PCRE2_AUTO_CALLOUT,
PCRE2_CASELESS, PCRE2_FIRSTLINE, PCRE2_NO_START_OPTIMIZE, PCRE2_NO_UTF_CHECK,
PCRE2_UTF, and PCRE2_USE_OFFSET_LIMIT. The extra options PCRE2_EXTRA_MATCH_LINE
and PCRE2_EXTRA_MATCH_WORD are also supported. Any other options cause an
error.
<pre> <pre>
PCRE2_MATCH_UNSET_BACKREF PCRE2_MATCH_UNSET_BACKREF
</pre> </pre>
@ -1724,6 +1737,24 @@ treated as single-character escapes. For example, \j is a literal "j" and
\x{2z} is treated as the literal string "x{2z}". Setting this option means \x{2z} is treated as the literal string "x{2z}". Setting this option means
that typos in patterns may go undetected and have unexpected results. This is a that typos in patterns may go undetected and have unexpected results. This is a
dangerous option. Use with care. dangerous option. Use with care.
<pre>
PCRE2_EXTRA_MATCH_LINE
</pre>
This option is provided for use by the <b>-x</b> option of <b>pcre2grep</b>. It
causes the pattern only to match complete lines. This is achieved by
automatically inserting the code for "^(?:" at the start of the compiled
pattern and ")$" at the end. Thus, when PCRE2_MULTILINE is set, the matched
line may be in the middle of the subject string. This option can be used with
PCRE2_LITERAL.
<pre>
PCRE2_EXTRA_MATCH_WORD
</pre>
This option is provided for use by the <b>-w</b> option of <b>pcre2grep</b>. It
causes the pattern only to match strings that have a word boundary at the start
and the end. This is achieved by automatically inserting the code for "\b(?:"
at the start of the compiled pattern and ")\b" at the end. The option may be
used with PCRE2_LITERAL. However, it is ignored if PCRE2_EXTRA_MATCH_LINE is
also set.
</P> </P>
<br><a name="SEC20" href="#TOC1">COMPILATION ERROR CODES</a><br> <br><a name="SEC20" href="#TOC1">COMPILATION ERROR CODES</a><br>
<P> <P>
@ -3489,7 +3520,7 @@ Cambridge, England.
</P> </P>
<br><a name="SEC42" href="#TOC1">REVISION</a><br> <br><a name="SEC42" href="#TOC1">REVISION</a><br>
<P> <P>
Last updated: 01 June 2017 Last updated: 16 June 2017
<br> <br>
Copyright &copy; 1997-2017 University of Cambridge. Copyright &copy; 1997-2017 University of Cambridge.
<br> <br>

View File

@ -117,6 +117,14 @@ compilation to the native function.
The PCRE2_MULTILINE option is set when the regular expression is passed for The PCRE2_MULTILINE option is set when the regular expression is passed for
compilation to the native function. Note that this does <i>not</i> mimic the compilation to the native function. Note that this does <i>not</i> mimic the
defined POSIX behaviour for REG_NEWLINE (see the following section). defined POSIX behaviour for REG_NEWLINE (see the following section).
<pre>
REG_NOSPEC
</pre>
The PCRE2_LITERAL option is set when the regular expression is passed for
compilation to the native function. This disables all meta characters in the
pattern, causing it to be treated as a literal string. The only other options
that are allowed with REG_NOSPEC are REG_ICASE, REG_NOSUB, REG_PEND, and
REG_UTF. Note that REG_NOSPEC is not part of the POSIX standard.
<pre> <pre>
REG_NOSUB REG_NOSUB
</pre> </pre>
@ -314,7 +322,7 @@ Cambridge, England.
</P> </P>
<br><a name="SEC9" href="#TOC1">REVISION</a><br> <br><a name="SEC9" href="#TOC1">REVISION</a><br>
<P> <P>
Last updated: 05 June 2017 Last updated: 15 June 2017
<br> <br>
Copyright &copy; 1997-2017 University of Cambridge. Copyright &copy; 1997-2017 University of Cambridge.
<br> <br>

View File

@ -96,12 +96,12 @@ want that action.
</P> </P>
<P> <P>
The input is processed using using C's string functions, so must not The input is processed using using C's string functions, so must not
contain binary zeroes, even though in Unix-like environments, <b>fgets()</b> contain binary zeros, even though in Unix-like environments, <b>fgets()</b>
treats any bytes other than newline as data characters. An error is generated treats any bytes other than newline as data characters. An error is generated
if a binary zero is encountered. Subject lines are processed for backslash if a binary zero is encountered. By default subject lines are processed for
escapes, which makes it possible to include any data value in strings that are backslash escapes, which makes it possible to include any data value in strings
passed to the library for matching. For patterns, there is a facility for that are passed to the library for matching. For patterns, there is a facility
specifying some or all of the 8-bit input characters as hexadecimal pairs, for specifying some or all of the 8-bit input characters as hexadecimal pairs,
which makes it possible to include binary zeros. which makes it possible to include binary zeros.
</P> </P>
<br><b> <br><b>
@ -382,8 +382,9 @@ of the standard test input files.
<P> <P>
When the POSIX API is being tested there is no way to override the default When the POSIX API is being tested there is no way to override the default
newline convention, though it is possible to set the newline convention from newline convention, though it is possible to set the newline convention from
within the pattern. A warning is given if the <b>posix</b> modifier is used when within the pattern. A warning is given if the <b>posix</b> or <b>posix_nosub</b>
<b>#newline_default</b> would set a default for the non-POSIX API. modifier is used when <b>#newline_default</b> would set a default for the
non-POSIX API.
<pre> <pre>
#pattern &#60;modifier-list&#62; #pattern &#60;modifier-list&#62;
</pre> </pre>
@ -479,8 +480,9 @@ A pattern can be followed by a modifier list (details below).
<P> <P>
Before each subject line is passed to <b>pcre2_match()</b> or Before each subject line is passed to <b>pcre2_match()</b> or
<b>pcre2_dfa_match()</b>, leading and trailing white space is removed, and the <b>pcre2_dfa_match()</b>, leading and trailing white space is removed, and the
line is scanned for backslash escapes. The following provide a means of line is scanned for backslash escapes, unless the <b>subject_literal</b>
encoding non-printing characters in a visible way: modifier was set for the pattern. The following provide a means of encoding
non-printing characters in a visible way:
<pre> <pre>
\a alarm (BEL, \x07) \a alarm (BEL, \x07)
\b backspace (\x08) \b backspace (\x08)
@ -548,6 +550,12 @@ the very last character in the line is a backslash (and there is no modifier
list), it is ignored. This gives a way of passing an empty line as data, since list), it is ignored. This gives a way of passing an empty line as data, since
a real empty line terminates the data input. a real empty line terminates the data input.
</P> </P>
<P>
If the <b>subject_literal</b> modifier is set for a pattern, all subject lines
that follow are treated as literals, with no special treatment of backslashes.
No replication is possible, and any subject modifiers must be set as defaults
by a <b>#subject</b> command.
</P>
<br><a name="SEC10" href="#TOC1">PATTERN MODIFIERS</a><br> <br><a name="SEC10" href="#TOC1">PATTERN MODIFIERS</a><br>
<P> <P>
There are several types of modifier that can appear in pattern lines. Except There are several types of modifier that can appear in pattern lines. Except
@ -586,7 +594,10 @@ for a description of the effects of these options.
/x extended set PCRE2_EXTENDED /x extended set PCRE2_EXTENDED
/xx extended_more set PCRE2_EXTENDED_MORE /xx extended_more set PCRE2_EXTENDED_MORE
firstline set PCRE2_FIRSTLINE firstline set PCRE2_FIRSTLINE
literal set PCRE2_LITERAL
match_line set PCRE2_EXTRA_MATCH_LINE
match_unset_backref set PCRE2_MATCH_UNSET_BACKREF match_unset_backref set PCRE2_MATCH_UNSET_BACKREF
match_word set PCRE2_EXTRA_MATCH_WORD
/m multiline set PCRE2_MULTILINE /m multiline set PCRE2_MULTILINE
never_backslash_c set PCRE2_NEVER_BACKSLASH_C never_backslash_c set PCRE2_NEVER_BACKSLASH_C
never_ucp set PCRE2_NEVER_UCP never_ucp set PCRE2_NEVER_UCP
@ -638,6 +649,7 @@ heavily used in the test files.
push push compiled pattern onto the stack push push compiled pattern onto the stack
pushcopy push a copy onto the stack pushcopy push a copy onto the stack
stackguard=&#60;number&#62; test the stackguard feature stackguard=&#60;number&#62; test the stackguard feature
subject_literal treat all subject lines as literal
tables=[0|1|2] select internal tables tables=[0|1|2] select internal tables
use_length do not zero-terminate the pattern use_length do not zero-terminate the pattern
utf8_input treat input as UTF-8 utf8_input treat input as UTF-8
@ -728,18 +740,6 @@ testing that <b>pcre2_compile()</b> behaves correctly in this case (it uses
default values). default values).
</P> </P>
<br><b> <br><b>
Specifying the pattern's length
</b><br>
<P>
By default, patterns are passed to the compiling functions as zero-terminated
strings. When using the POSIX wrapper API, there is no other option. However,
when using PCRE2's native API, patterns can be passed by length instead of
being zero-terminated. The <b>use_length</b> modifier causes this to happen.
Using a length happens automatically (whether or not <b>use_length</b> is set)
when <b>hex</b> is set, because patterns specified in hexadecimal may contain
binary zeros.
</P>
<br><b>
Specifying pattern characters in hexadecimal Specifying pattern characters in hexadecimal
</b><br> </b><br>
<P> <P>
@ -761,11 +761,20 @@ Either single or double quotes may be used. There is no way of including
the delimiter within a substring. The <b>hex</b> and <b>expand</b> modifiers are the delimiter within a substring. The <b>hex</b> and <b>expand</b> modifiers are
mutually exclusive. mutually exclusive.
</P> </P>
<br><b>
Specifying the pattern's length
</b><br>
<P> <P>
The POSIX API cannot be used with patterns specified in hexadecimal because By default, patterns are passed to the compiling functions as zero-terminated
they may contain binary zeros, which conflicts with <b>regcomp()</b>'s strings but can be passed by length instead of being zero-terminated. The
requirement for a zero-terminated string. Such patterns are always passed to <b>use_length</b> modifier causes this to happen. Using a length happens
<b>pcre2_compile()</b> as a string with a length, not as zero-terminated. automatically (whether or not <b>use_length</b> is set) when <b>hex</b> is set,
because patterns specified in hexadecimal may contain binary zeros.
</P>
<P>
If <b>hex</b> or <b>use_length</b> is used with the POSIX wrapper API (see
<a href="#posixwrapper">"Using the POSIX wrapper API"</a>
below), the REG_PEND extension is used to pass the pattern's length.
</P> </P>
<br><b> <br><b>
Specifying wide characters in 16-bit and 32-bit modes Specifying wide characters in 16-bit and 32-bit modes
@ -826,7 +835,7 @@ modifier in "Subject Modifiers"
for details of how these options are specified for each match attempt. for details of how these options are specified for each match attempt.
</P> </P>
<P> <P>
JIT compilation is requested by the <b>/jit</b> pattern modifier, which may JIT compilation is requested by the <b>jit</b> pattern modifier, which may
optionally be followed by an equals sign and a number in the range 0 to 7. optionally be followed by an equals sign and a number in the range 0 to 7.
The three bits that make up the number specify which of the three JIT operating The three bits that make up the number specify which of the three JIT operating
modes are to be compiled: modes are to be compiled:
@ -850,7 +859,7 @@ to <b>pcre2_match()</b> with either the PCRE2_PARTIAL_SOFT or the
PCRE2_PARTIAL_HARD option set. Note that such a call may return a complete PCRE2_PARTIAL_HARD option set. Note that such a call may return a complete
match; the options enable the possibility of a partial match, but do not match; the options enable the possibility of a partial match, but do not
require it. Note also that if you request JIT compilation only for partial require it. Note also that if you request JIT compilation only for partial
matching (for example, /jit=2) but do not set the <b>partial</b> modifier on a matching (for example, jit=2) but do not set the <b>partial</b> modifier on a
subject line, that match will not use JIT code because none was compiled for subject line, that match will not use JIT code because none was compiled for
non-partial matching. non-partial matching.
</P> </P>
@ -927,12 +936,12 @@ The <b>max_pattern_length</b> modifier sets a limit, in code units, to the
length of pattern that <b>pcre2_compile()</b> will accept. Breaching the limit length of pattern that <b>pcre2_compile()</b> will accept. Breaching the limit
causes a compilation error. The default is the largest number a PCRE2_SIZE causes a compilation error. The default is the largest number a PCRE2_SIZE
variable can hold (essentially unlimited). variable can hold (essentially unlimited).
</P> <a name="posixwrapper"></a></P>
<br><b> <br><b>
Using the POSIX wrapper API Using the POSIX wrapper API
</b><br> </b><br>
<P> <P>
The <b>/posix</b> and <b>posix_nosub</b> modifiers cause <b>pcre2test</b> to call The <b>posix</b> and <b>posix_nosub</b> modifiers cause <b>pcre2test</b> to call
PCRE2 via the POSIX wrapper API rather than its native API. When PCRE2 via the POSIX wrapper API rather than its native API. When
<b>posix_nosub</b> is used, the POSIX option REG_NOSUB is passed to <b>posix_nosub</b> is used, the POSIX option REG_NOSUB is passed to
<b>regcomp()</b>. The POSIX wrapper supports only the 8-bit library. Note that <b>regcomp()</b>. The POSIX wrapper supports only the 8-bit library. Note that
@ -962,6 +971,11 @@ The <b>aftertext</b> and <b>allaftertext</b> subject modifiers work as described
below. All other modifiers are either ignored, with a warning message, or cause below. All other modifiers are either ignored, with a warning message, or cause
an error. an error.
</P> </P>
<P>
The pattern is passed to <b>regcomp()</b> as a zero-terminated string by
default, but if the <b>use_length</b> or <b>hex</b> modifiers are set, the
REG_PEND extension is used to pass it by length.
</P>
<br><b> <br><b>
Testing the stack guard feature Testing the stack guard feature
</b><br> </b><br>
@ -999,17 +1013,18 @@ are mutually exclusive.
Setting certain match controls Setting certain match controls
</b><br> </b><br>
<P> <P>
The following modifiers are really subject modifiers, and are described below. The following modifiers are really subject modifiers, and are described under
However, they may be included in a pattern's modifier list, in which case they "Subject Modifiers" below. However, they may be included in a pattern's
are applied to every subject line that is processed with that pattern. They may modifier list, in which case they are applied to every subject line that is
not appear in <b>#pattern</b> commands. These modifiers do not affect the processed with that pattern. They may not appear in <b>#pattern</b> commands.
compilation process. These modifiers do not affect the compilation process.
<pre> <pre>
aftertext show text after match aftertext show text after match
allaftertext show text after captures allaftertext show text after captures
allcaptures show all captures allcaptures show all captures
allusedtext show all consulted text allusedtext show all consulted text
/g global global matching /g global global matching
jitstack=&#60;n&#62; set size of JIT stack
mark show mark values mark show mark values
replace=&#60;string&#62; specify a replacement string replace=&#60;string&#62; specify a replacement string
startchar show starting character when relevant startchar show starting character when relevant
@ -1022,6 +1037,15 @@ These modifiers may not appear in a <b>#pattern</b> command. If you want them as
defaults, set them in a <b>#subject</b> command. defaults, set them in a <b>#subject</b> command.
</P> </P>
<br><b> <br><b>
Specifying literal subject lines
</b><br>
<P>
If the <b>subject_literal</b> modifier is present on a pattern, all the subject
lines that it matches are taken as literal strings, with no interpretation of
backslashes. It is not possible to set subject modifiers on such lines, but any
that are set as defaults by a <b>#subject</b> command are recognized.
</P>
<br><b>
Saving a compiled pattern Saving a compiled pattern
</b><br> </b><br>
<P> <P>
@ -1072,11 +1096,11 @@ The partial matching modifiers are provided with abbreviations because they
appear frequently in tests. appear frequently in tests.
</P> </P>
<P> <P>
If the <b>posix</b> modifier was present on the pattern, causing the POSIX If the <b>posix</b> or <b>posix_nosub</b> modifier was present on the pattern,
wrapper API to be used, the only option-setting modifiers that have any effect causing the POSIX wrapper API to be used, the only option-setting modifiers
are <b>notbol</b>, <b>notempty</b>, and <b>noteol</b>, causing REG_NOTBOL, that have any effect are <b>notbol</b>, <b>notempty</b>, and <b>noteol</b>,
REG_NOTEMPTY, and REG_NOTEOL, respectively, to be passed to <b>regexec()</b>. causing REG_NOTBOL, REG_NOTEMPTY, and REG_NOTEOL, respectively, to be passed to
The other modifiers are ignored, with a warning message. <b>regexec()</b>. The other modifiers are ignored, with a warning message.
</P> </P>
<P> <P>
There is one additional modifier that can be used with the POSIX wrapper. It is There is one additional modifier that can be used with the POSIX wrapper. It is
@ -1085,11 +1109,13 @@ ignored (with a warning) if used for non-POSIX matching.
posix_startend=&#60;n&#62;[:&#60;m&#62;] posix_startend=&#60;n&#62;[:&#60;m&#62;]
</pre> </pre>
This causes the subject string to be passed to <b>regexec()</b> using the This causes the subject string to be passed to <b>regexec()</b> using the
REG_STARTEND option, which uses offsets to restrict which part of the string is REG_STARTEND option, which uses offsets to specify which part of the string is
searched. If only one number is given, the end offset is passed as the end of searched. If only one number is given, the end offset is passed as the end of
the subject string. For more detail of REG_STARTEND, see the the subject string. For more detail of REG_STARTEND, see the
<a href="pcre2posix.html"><b>pcre2posix</b></a> <a href="pcre2posix.html"><b>pcre2posix</b></a>
documentation. documentation. If the subject string contains binary zeros (coded as escapes
such as \x{00} because <b>pcre2test</b> does not support actual binary zeros in
its input), you must use <b>posix_startend</b> to specify its length.
</P> </P>
<br><b> <br><b>
Setting match controls Setting match controls
@ -1355,9 +1381,11 @@ Setting the JIT stack size
<P> <P>
The <b>jitstack</b> modifier provides a way of setting the maximum stack size The <b>jitstack</b> modifier provides a way of setting the maximum stack size
that is used by the just-in-time optimization code. It is ignored if JIT that is used by the just-in-time optimization code. It is ignored if JIT
optimization is not being used. The value is a number of kilobytes. Providing a optimization is not being used. The value is a number of kilobytes. Setting
stack that is larger than the default 32K is necessary only for very zero reverts to the default of 32K. Providing a stack that is larger than the
complicated patterns. default is necessary only for very complicated patterns. If <b>jitstack</b> is
set non-zero on a subject line it overrides any value that was set on the
pattern.
</P> </P>
<br><b> <br><b>
Setting heap, match, and depth limits Setting heap, match, and depth limits
@ -1461,8 +1489,8 @@ Passing the subject as zero-terminated
By default, the subject string is passed to a native API matching function with By default, the subject string is passed to a native API matching function with
its correct length. In order to test the facility for passing a zero-terminated its correct length. In order to test the facility for passing a zero-terminated
string, the <b>zero_terminate</b> modifier is provided. It causes the length to string, the <b>zero_terminate</b> modifier is provided. It causes the length to
be passed as PCRE2_ZERO_TERMINATED. (When matching via the POSIX interface, be passed as PCRE2_ZERO_TERMINATED. When matching via the POSIX interface,
this modifier has no effect, as there is no facility for passing a length.) this modifier is ignored, with a warning.
</P> </P>
<P> <P>
When testing <b>pcre2_substitute()</b>, this modifier also has the effect of When testing <b>pcre2_substitute()</b>, this modifier also has the effect of
@ -1675,7 +1703,7 @@ callout is in a lookbehind assertion.
</P> </P>
<P> <P>
Callouts numbered 255 are assumed to be automatic callouts, inserted as a Callouts numbered 255 are assumed to be automatic callouts, inserted as a
result of the <b>/auto_callout</b> pattern modifier. In this case, instead of result of the <b>auto_callout</b> pattern modifier. In this case, instead of
showing the callout number, the offset in the pattern, preceded by a plus, is showing the callout number, the offset in the pattern, preceded by a plus, is
output. For example: output. For example:
<pre> <pre>
@ -1830,7 +1858,7 @@ Cambridge, England.
</P> </P>
<br><a name="SEC21" href="#TOC1">REVISION</a><br> <br><a name="SEC21" href="#TOC1">REVISION</a><br>
<P> <P>
Last updated: 03 June 2017 Last updated: 16 June 2017
<br> <br>
Copyright &copy; 1997-2017 University of Cambridge. Copyright &copy; 1997-2017 University of Cambridge.
<br> <br>

View File

@ -1441,6 +1441,20 @@ COMPILING A PATTERN
first line and also within the offset limit. In other words, whichever first line and also within the offset limit. In other words, whichever
limit comes first is used. limit comes first is used.
PCRE2_LITERAL
If this option is set, all meta-characters in the pattern are disabled,
and it is treated as a literal string. Matching literal strings with a
regular expression engine is not the most efficient way of doing it. If
you are doing a lot of literal matching and are worried about effi-
ciency, you should consider using other approaches. The only other main
options that are allowed with PCRE2_LITERAL are: PCRE2_ANCHORED,
PCRE2_ENDANCHORED, PCRE2_AUTO_CALLOUT, PCRE2_CASELESS, PCRE2_FIRSTLINE,
PCRE2_NO_START_OPTIMIZE, PCRE2_NO_UTF_CHECK, PCRE2_UTF, and
PCRE2_USE_OFFSET_LIMIT. The extra options PCRE2_EXTRA_MATCH_LINE and
PCRE2_EXTRA_MATCH_WORD are also supported. Any other options cause an
error.
PCRE2_MATCH_UNSET_BACKREF PCRE2_MATCH_UNSET_BACKREF
If this option is set, a back reference to an unset subpattern group If this option is set, a back reference to an unset subpattern group
@ -1706,6 +1720,24 @@ COMPILING A PATTERN
option means that typos in patterns may go undetected and have unex- option means that typos in patterns may go undetected and have unex-
pected results. This is a dangerous option. Use with care. pected results. This is a dangerous option. Use with care.
PCRE2_EXTRA_MATCH_LINE
This option is provided for use by the -x option of pcre2grep. It
causes the pattern only to match complete lines. This is achieved by
automatically inserting the code for "^(?:" at the start of the com-
piled pattern and ")$" at the end. Thus, when PCRE2_MULTILINE is set,
the matched line may be in the middle of the subject string. This
option can be used with PCRE2_LITERAL.
PCRE2_EXTRA_MATCH_WORD
This option is provided for use by the -w option of pcre2grep. It
causes the pattern only to match strings that have a word boundary at
the start and the end. This is achieved by automatically inserting the
code for "\b(?:" at the start of the compiled pattern and ")\b" at the
end. The option may be used with PCRE2_LITERAL. However, it is ignored
if PCRE2_EXTRA_MATCH_LINE is also set.
COMPILATION ERROR CODES COMPILATION ERROR CODES
@ -3368,7 +3400,7 @@ AUTHOR
REVISION REVISION
Last updated: 01 June 2017 Last updated: 16 June 2017
Copyright (c) 1997-2017 University of Cambridge. Copyright (c) 1997-2017 University of Cambridge.
------------------------------------------------------------------------------ ------------------------------------------------------------------------------
@ -9036,6 +9068,15 @@ COMPILING A PATTERN
the defined POSIX behaviour for REG_NEWLINE (see the following sec- the defined POSIX behaviour for REG_NEWLINE (see the following sec-
tion). tion).
REG_NOSPEC
The PCRE2_LITERAL option is set when the regular expression is passed
for compilation to the native function. This disables all meta charac-
ters in the pattern, causing it to be treated as a literal string. The
only other options that are allowed with REG_NOSPEC are REG_ICASE,
REG_NOSUB, REG_PEND, and REG_UTF. Note that REG_NOSPEC is not part of
the POSIX standard.
REG_NOSUB REG_NOSUB
When a pattern that is compiled with this flag is passed to regexec() When a pattern that is compiled with this flag is passed to regexec()
@ -9232,7 +9273,7 @@ AUTHOR
REVISION REVISION
Last updated: 05 June 2017 Last updated: 15 June 2017
Copyright (c) 1997-2017 University of Cambridge. Copyright (c) 1997-2017 University of Cambridge.
------------------------------------------------------------------------------ ------------------------------------------------------------------------------

View File

@ -1,4 +1,4 @@
.TH PCRE2_COMPILE 3 "17 May 2017" "PCRE2 10.30" .TH PCRE2_COMPILE 3 "16 June 2017" "PCRE2 10.30"
.SH NAME .SH NAME
PCRE2 - Perl-compatible regular expressions (revised API) PCRE2 - Perl-compatible regular expressions (revised API)
.SH SYNOPSIS .SH SYNOPSIS
@ -35,7 +35,7 @@ system stack size checking, or to change one or more of these parameters:
The newline character sequence; The newline character sequence;
The compile time nested parentheses limit; The compile time nested parentheses limit;
The maximum pattern length (in code units) that is allowed. The maximum pattern length (in code units) that is allowed.
The additional options bits The additional options bits (see pcre2_set_compile_extra_options())
.sp .sp
The option bits are: The option bits are:
.sp .sp
@ -52,6 +52,7 @@ The option bits are:
PCRE2_ENDANCHORED Pattern can match only at end of subject PCRE2_ENDANCHORED Pattern can match only at end of subject
PCRE2_EXTENDED Ignore white space and # comments PCRE2_EXTENDED Ignore white space and # comments
PCRE2_FIRSTLINE Force matching to be before newline PCRE2_FIRSTLINE Force matching to be before newline
PCRE2_LITERAL Pattern characters are all literal
PCRE2_MATCH_UNSET_BACKREF Match unset back references PCRE2_MATCH_UNSET_BACKREF Match unset back references
PCRE2_MULTILINE ^ and $ match newlines within data PCRE2_MULTILINE ^ and $ match newlines within data
PCRE2_NEVER_BACKSLASH_C Lock out the use of \eC in patterns PCRE2_NEVER_BACKSLASH_C Lock out the use of \eC in patterns

View File

@ -1,4 +1,4 @@
.TH PCRE2_SET_MAX_PATTERN_LENGTH 3 "01 June 2017" "PCRE2 10.30" .TH PCRE2_SET_MAX_PATTERN_LENGTH 3 "16 June 2017" "PCRE2 10.30"
.SH NAME .SH NAME
PCRE2 - Perl-compatible regular expressions (revised API) PCRE2 - Perl-compatible regular expressions (revised API)
.SH SYNOPSIS .SH SYNOPSIS
@ -24,6 +24,8 @@ options are:
.\" JOIN .\" JOIN
PCRE2_EXTRA_BAD_ESCAPE_IS_LITERAL Treat all invalid escapes as PCRE2_EXTRA_BAD_ESCAPE_IS_LITERAL Treat all invalid escapes as
a literal following character a literal following character
PCRE2_EXTRA_MATCH_LINE Pattern matches whole lines
PCRE2_EXTRA_MATCH_WORD Pattern matches "words"
.sp .sp
There is a complete description of the PCRE2 native API in the There is a complete description of the PCRE2 native API in the
.\" HREF .\" HREF

View File

@ -64,12 +64,12 @@ INPUT ENCODING
unless you really want that action. unless you really want that action.
The input is processed using using C's string functions, so must not The input is processed using using C's string functions, so must not
contain binary zeroes, even though in Unix-like environments, fgets() contain binary zeros, even though in Unix-like environments, fgets()
treats any bytes other than newline as data characters. An error is treats any bytes other than newline as data characters. An error is
generated if a binary zero is encountered. Subject lines are processed generated if a binary zero is encountered. By default subject lines are
for backslash escapes, which makes it possible to include any data processed for backslash escapes, which makes it possible to include any
value in strings that are passed to the library for matching. For pat- data value in strings that are passed to the library for matching. For
terns, there is a facility for specifying some or all of the 8-bit patterns, there is a facility for specifying some or all of the 8-bit
input characters as hexadecimal pairs, which makes it possible to input characters as hexadecimal pairs, which makes it possible to
include binary zeros. include binary zeros.
@ -319,9 +319,9 @@ COMMAND LINES
When the POSIX API is being tested there is no way to override the When the POSIX API is being tested there is no way to override the
default newline convention, though it is possible to set the newline default newline convention, though it is possible to set the newline
convention from within the pattern. A warning is given if the posix convention from within the pattern. A warning is given if the posix or
modifier is used when #newline_default would set a default for the non- posix_nosub modifier is used when #newline_default would set a default
POSIX API. for the non-POSIX API.
#pattern <modifier-list> #pattern <modifier-list>
@ -424,8 +424,9 @@ SUBJECT LINE SYNTAX
Before each subject line is passed to pcre2_match() or Before each subject line is passed to pcre2_match() or
pcre2_dfa_match(), leading and trailing white space is removed, and the pcre2_dfa_match(), leading and trailing white space is removed, and the
line is scanned for backslash escapes. The following provide a means of line is scanned for backslash escapes, unless the subject_literal modi-
encoding non-printing characters in a visible way: fier was set for the pattern. The following provide a means of encoding
non-printing characters in a visible way:
\a alarm (BEL, \x07) \a alarm (BEL, \x07)
\b backspace (\x08) \b backspace (\x08)
@ -493,6 +494,11 @@ SUBJECT LINE SYNTAX
passing an empty line as data, since a real empty line terminates the passing an empty line as data, since a real empty line terminates the
data input. data input.
If the subject_literal modifier is set for a pattern, all subject lines
that follow are treated as literals, with no special treatment of back-
slashes. No replication is possible, and any subject modifiers must be
set as defaults by a #subject command.
PATTERN MODIFIERS PATTERN MODIFIERS
@ -530,7 +536,10 @@ PATTERN MODIFIERS
/x extended set PCRE2_EXTENDED /x extended set PCRE2_EXTENDED
/xx extended_more set PCRE2_EXTENDED_MORE /xx extended_more set PCRE2_EXTENDED_MORE
firstline set PCRE2_FIRSTLINE firstline set PCRE2_FIRSTLINE
literal set PCRE2_LITERAL
match_line set PCRE2_EXTRA_MATCH_LINE
match_unset_backref set PCRE2_MATCH_UNSET_BACKREF match_unset_backref set PCRE2_MATCH_UNSET_BACKREF
match_word set PCRE2_EXTRA_MATCH_WORD
/m multiline set PCRE2_MULTILINE /m multiline set PCRE2_MULTILINE
never_backslash_c set PCRE2_NEVER_BACKSLASH_C never_backslash_c set PCRE2_NEVER_BACKSLASH_C
never_ucp set PCRE2_NEVER_UCP never_ucp set PCRE2_NEVER_UCP
@ -580,6 +589,7 @@ PATTERN MODIFIERS
push push compiled pattern onto the stack push push compiled pattern onto the stack
pushcopy push a copy onto the stack pushcopy push a copy onto the stack
stackguard=<number> test the stackguard feature stackguard=<number> test the stackguard feature
subject_literal treat all subject lines as literal
tables=[0|1|2] select internal tables tables=[0|1|2] select internal tables
use_length do not zero-terminate the pattern use_length do not zero-terminate the pattern
utf8_input treat input as UTF-8 utf8_input treat input as UTF-8
@ -659,16 +669,6 @@ PATTERN MODIFIERS
testing that pcre2_compile() behaves correctly in this case (it uses testing that pcre2_compile() behaves correctly in this case (it uses
default values). default values).
Specifying the pattern's length
By default, patterns are passed to the compiling functions as zero-ter-
minated strings. When using the POSIX wrapper API, there is no other
option. However, when using PCRE2's native API, patterns can be passed
by length instead of being zero-terminated. The use_length modifier
causes this to happen. Using a length happens automatically (whether
or not use_length is set) when hex is set, because patterns specified
in hexadecimal may contain binary zeros.
Specifying pattern characters in hexadecimal Specifying pattern characters in hexadecimal
The hex modifier specifies that the characters of the pattern, except The hex modifier specifies that the characters of the pattern, except
@ -690,11 +690,18 @@ PATTERN MODIFIERS
ing the delimiter within a substring. The hex and expand modifiers are ing the delimiter within a substring. The hex and expand modifiers are
mutually exclusive. mutually exclusive.
The POSIX API cannot be used with patterns specified in hexadecimal Specifying the pattern's length
because they may contain binary zeros, which conflicts with regcomp()'s
requirement for a zero-terminated string. Such patterns are always By default, patterns are passed to the compiling functions as zero-ter-
passed to pcre2_compile() as a string with a length, not as zero-termi- minated strings but can be passed by length instead of being zero-ter-
nated. minated. The use_length modifier causes this to happen. Using a length
happens automatically (whether or not use_length is set) when hex is
set, because patterns specified in hexadecimal may contain binary
zeros.
If hex or use_length is used with the POSIX wrapper API (see "Using the
POSIX wrapper API" below), the REG_PEND extension is used to pass the
pattern's length.
Specifying wide characters in 16-bit and 32-bit modes Specifying wide characters in 16-bit and 32-bit modes
@ -742,7 +749,7 @@ PATTERN MODIFIERS
partial modifier in "Subject Modifiers" below for details of how these partial modifier in "Subject Modifiers" below for details of how these
options are specified for each match attempt. options are specified for each match attempt.
JIT compilation is requested by the /jit pattern modifier, which may JIT compilation is requested by the jit pattern modifier, which may
optionally be followed by an equals sign and a number in the range 0 to optionally be followed by an equals sign and a number in the range 0 to
7. The three bits that make up the number specify which of the three 7. The three bits that make up the number specify which of the three
JIT operating modes are to be compiled: JIT operating modes are to be compiled:
@ -766,7 +773,7 @@ PATTERN MODIFIERS
PCRE2_PARTIAL_HARD option set. Note that such a call may return a com- PCRE2_PARTIAL_HARD option set. Note that such a call may return a com-
plete match; the options enable the possibility of a partial match, but plete match; the options enable the possibility of a partial match, but
do not require it. Note also that if you request JIT compilation only do not require it. Note also that if you request JIT compilation only
for partial matching (for example, /jit=2) but do not set the partial for partial matching (for example, jit=2) but do not set the partial
modifier on a subject line, that match will not use JIT code because modifier on a subject line, that match will not use JIT code because
none was compiled for non-partial matching. none was compiled for non-partial matching.
@ -833,7 +840,7 @@ PATTERN MODIFIERS
Using the POSIX wrapper API Using the POSIX wrapper API
The /posix and posix_nosub modifiers cause pcre2test to call PCRE2 via The posix and posix_nosub modifiers cause pcre2test to call PCRE2 via
the POSIX wrapper API rather than its native API. When posix_nosub is the POSIX wrapper API rather than its native API. When posix_nosub is
used, the POSIX option REG_NOSUB is passed to regcomp(). The POSIX used, the POSIX option REG_NOSUB is passed to regcomp(). The POSIX
wrapper supports only the 8-bit library. Note that it does not imply wrapper supports only the 8-bit library. Note that it does not imply
@ -862,6 +869,10 @@ PATTERN MODIFIERS
below. All other modifiers are either ignored, with a warning message, below. All other modifiers are either ignored, with a warning message,
or cause an error. or cause an error.
The pattern is passed to regcomp() as a zero-terminated string by
default, but if the use_length or hex modifiers are set, the REG_PEND
extension is used to pass it by length.
Testing the stack guard feature Testing the stack guard feature
The stackguard modifier is used to test the use of pcre2_set_com- The stackguard modifier is used to test the use of pcre2_set_com-
@ -894,16 +905,18 @@ PATTERN MODIFIERS
Setting certain match controls Setting certain match controls
The following modifiers are really subject modifiers, and are described The following modifiers are really subject modifiers, and are described
below. However, they may be included in a pattern's modifier list, in under "Subject Modifiers" below. However, they may be included in a
which case they are applied to every subject line that is processed pattern's modifier list, in which case they are applied to every sub-
with that pattern. They may not appear in #pattern commands. These mod- ject line that is processed with that pattern. They may not appear in
ifiers do not affect the compilation process. #pattern commands. These modifiers do not affect the compilation
process.
aftertext show text after match aftertext show text after match
allaftertext show text after captures allaftertext show text after captures
allcaptures show all captures allcaptures show all captures
allusedtext show all consulted text allusedtext show all consulted text
/g global global matching /g global global matching
jitstack=<n> set size of JIT stack
mark show mark values mark show mark values
replace=<string> specify a replacement string replace=<string> specify a replacement string
startchar show starting character when relevant startchar show starting character when relevant
@ -915,6 +928,14 @@ PATTERN MODIFIERS
These modifiers may not appear in a #pattern command. If you want them These modifiers may not appear in a #pattern command. If you want them
as defaults, set them in a #subject command. as defaults, set them in a #subject command.
Specifying literal subject lines
If the subject_literal modifier is present on a pattern, all the sub-
ject lines that it matches are taken as literal strings, with no inter-
pretation of backslashes. It is not possible to set subject modifiers
on such lines, but any that are set as defaults by a #subject command
are recognized.
Saving a compiled pattern Saving a compiled pattern
When a pattern with the push modifier is successfully compiled, it is When a pattern with the push modifier is successfully compiled, it is
@ -959,11 +980,11 @@ SUBJECT MODIFIERS
The partial matching modifiers are provided with abbreviations because The partial matching modifiers are provided with abbreviations because
they appear frequently in tests. they appear frequently in tests.
If the posix modifier was present on the pattern, causing the POSIX If the posix or posix_nosub modifier was present on the pattern, caus-
wrapper API to be used, the only option-setting modifiers that have any ing the POSIX wrapper API to be used, the only option-setting modifiers
effect are notbol, notempty, and noteol, causing REG_NOTBOL, that have any effect are notbol, notempty, and noteol, causing REG_NOT-
REG_NOTEMPTY, and REG_NOTEOL, respectively, to be passed to regexec(). BOL, REG_NOTEMPTY, and REG_NOTEOL, respectively, to be passed to
The other modifiers are ignored, with a warning message. regexec(). The other modifiers are ignored, with a warning message.
There is one additional modifier that can be used with the POSIX wrap- There is one additional modifier that can be used with the POSIX wrap-
per. It is ignored (with a warning) if used for non-POSIX matching. per. It is ignored (with a warning) if used for non-POSIX matching.
@ -971,10 +992,13 @@ SUBJECT MODIFIERS
posix_startend=<n>[:<m>] posix_startend=<n>[:<m>]
This causes the subject string to be passed to regexec() using the This causes the subject string to be passed to regexec() using the
REG_STARTEND option, which uses offsets to restrict which part of the REG_STARTEND option, which uses offsets to specify which part of the
string is searched. If only one number is given, the end offset is string is searched. If only one number is given, the end offset is
passed as the end of the subject string. For more detail of REG_STAR- passed as the end of the subject string. For more detail of REG_STAR-
TEND, see the pcre2posix documentation. TEND, see the pcre2posix documentation. If the subject string contains
binary zeros (coded as escapes such as \x{00} because pcre2test does
not support actual binary zeros in its input), you must use posix_star-
tend to specify its length.
Setting match controls Setting match controls
@ -1222,8 +1246,10 @@ SUBJECT MODIFIERS
The jitstack modifier provides a way of setting the maximum stack size The jitstack modifier provides a way of setting the maximum stack size
that is used by the just-in-time optimization code. It is ignored if that is used by the just-in-time optimization code. It is ignored if
JIT optimization is not being used. The value is a number of kilobytes. JIT optimization is not being used. The value is a number of kilobytes.
Providing a stack that is larger than the default 32K is necessary only Setting zero reverts to the default of 32K. Providing a stack that is
for very complicated patterns. larger than the default is necessary only for very complicated pat-
terns. If jitstack is set non-zero on a subject line it overrides any
value that was set on the pattern.
Setting heap, match, and depth limits Setting heap, match, and depth limits
@ -1310,9 +1336,8 @@ SUBJECT MODIFIERS
By default, the subject string is passed to a native API matching func- By default, the subject string is passed to a native API matching func-
tion with its correct length. In order to test the facility for passing tion with its correct length. In order to test the facility for passing
a zero-terminated string, the zero_terminate modifier is provided. It a zero-terminated string, the zero_terminate modifier is provided. It
causes the length to be passed as PCRE2_ZERO_TERMINATED. (When matching causes the length to be passed as PCRE2_ZERO_TERMINATED. When matching
via the POSIX interface, this modifier has no effect, as there is no via the POSIX interface, this modifier is ignored, with a warning.
facility for passing a length.)
When testing pcre2_substitute(), this modifier also has the effect of When testing pcre2_substitute(), this modifier also has the effect of
passing the replacement string as zero-terminated. passing the replacement string as zero-terminated.
@ -1513,8 +1538,8 @@ CALLOUTS
position, which can happen if the callout is in a lookbehind assertion. position, which can happen if the callout is in a lookbehind assertion.
Callouts numbered 255 are assumed to be automatic callouts, inserted as Callouts numbered 255 are assumed to be automatic callouts, inserted as
a result of the /auto_callout pattern modifier. In this case, instead a result of the auto_callout pattern modifier. In this case, instead of
of showing the callout number, the offset in the pattern, preceded by a showing the callout number, the offset in the pattern, preceded by a
plus, is output. For example: plus, is output. For example:
re> /\d?[A-E]\*/auto_callout re> /\d?[A-E]\*/auto_callout
@ -1662,5 +1687,5 @@ AUTHOR
REVISION REVISION
Last updated: 03 June 2017 Last updated: 16 June 2017
Copyright (c) 1997-2017 University of Cambridge. Copyright (c) 1997-2017 University of Cambridge.