Documentation update.
This commit is contained in:
parent
a083420cac
commit
c92bfc3d21
|
@ -47,7 +47,7 @@ system stack size checking, or to change one or more of these parameters:
|
|||
The newline character sequence;
|
||||
The compile time nested parentheses limit;
|
||||
The maximum pattern length (in code units) that is allowed.
|
||||
The additional options bits
|
||||
The additional options bits (see pcre2_set_compile_extra_options())
|
||||
</pre>
|
||||
The option bits are:
|
||||
<pre>
|
||||
|
@ -64,6 +64,7 @@ The option bits are:
|
|||
PCRE2_ENDANCHORED Pattern can match only at end of subject
|
||||
PCRE2_EXTENDED Ignore white space and # comments
|
||||
PCRE2_FIRSTLINE Force matching to be before newline
|
||||
PCRE2_LITERAL Pattern characters are all literal
|
||||
PCRE2_MATCH_UNSET_BACKREF Match unset back references
|
||||
PCRE2_MULTILINE ^ and $ match newlines within data
|
||||
PCRE2_NEVER_BACKSLASH_C Lock out the use of \C in patterns
|
||||
|
|
|
@ -32,6 +32,8 @@ options are:
|
|||
<pre>
|
||||
PCRE2_EXTRA_ALLOW_SURROGATE_ESCAPES Allow \x{df800} to \x{dfff} in UTF-8 and UTF-32 modes
|
||||
PCRE2_EXTRA_BAD_ESCAPE_IS_LITERAL Treat all invalid escapes as a literal following character
|
||||
PCRE2_EXTRA_MATCH_LINE Pattern matches whole lines
|
||||
PCRE2_EXTRA_MATCH_WORD Pattern matches "words"
|
||||
</pre>
|
||||
There is a complete description of the PCRE2 native API in the
|
||||
<a href="pcre2api.html"><b>pcre2api</b></a>
|
||||
|
|
|
@ -1453,6 +1453,19 @@ continue over the newline. See also PCRE2_USE_OFFSET_LIMIT, which provides a
|
|||
more general limiting facility. If PCRE2_FIRSTLINE is set with an offset limit,
|
||||
a match must occur in the first line and also within the offset limit. In other
|
||||
words, whichever limit comes first is used.
|
||||
<pre>
|
||||
PCRE2_LITERAL
|
||||
</pre>
|
||||
If this option is set, all meta-characters in the pattern are disabled, and it
|
||||
is treated as a literal string. Matching literal strings with a regular
|
||||
expression engine is not the most efficient way of doing it. If you are doing a
|
||||
lot of literal matching and are worried about efficiency, you should consider
|
||||
using other approaches. The only other main options that are allowed with
|
||||
PCRE2_LITERAL are: PCRE2_ANCHORED, PCRE2_ENDANCHORED, PCRE2_AUTO_CALLOUT,
|
||||
PCRE2_CASELESS, PCRE2_FIRSTLINE, PCRE2_NO_START_OPTIMIZE, PCRE2_NO_UTF_CHECK,
|
||||
PCRE2_UTF, and PCRE2_USE_OFFSET_LIMIT. The extra options PCRE2_EXTRA_MATCH_LINE
|
||||
and PCRE2_EXTRA_MATCH_WORD are also supported. Any other options cause an
|
||||
error.
|
||||
<pre>
|
||||
PCRE2_MATCH_UNSET_BACKREF
|
||||
</pre>
|
||||
|
@ -1724,6 +1737,24 @@ treated as single-character escapes. For example, \j is a literal "j" and
|
|||
\x{2z} is treated as the literal string "x{2z}". Setting this option means
|
||||
that typos in patterns may go undetected and have unexpected results. This is a
|
||||
dangerous option. Use with care.
|
||||
<pre>
|
||||
PCRE2_EXTRA_MATCH_LINE
|
||||
</pre>
|
||||
This option is provided for use by the <b>-x</b> option of <b>pcre2grep</b>. It
|
||||
causes the pattern only to match complete lines. This is achieved by
|
||||
automatically inserting the code for "^(?:" at the start of the compiled
|
||||
pattern and ")$" at the end. Thus, when PCRE2_MULTILINE is set, the matched
|
||||
line may be in the middle of the subject string. This option can be used with
|
||||
PCRE2_LITERAL.
|
||||
<pre>
|
||||
PCRE2_EXTRA_MATCH_WORD
|
||||
</pre>
|
||||
This option is provided for use by the <b>-w</b> option of <b>pcre2grep</b>. It
|
||||
causes the pattern only to match strings that have a word boundary at the start
|
||||
and the end. This is achieved by automatically inserting the code for "\b(?:"
|
||||
at the start of the compiled pattern and ")\b" at the end. The option may be
|
||||
used with PCRE2_LITERAL. However, it is ignored if PCRE2_EXTRA_MATCH_LINE is
|
||||
also set.
|
||||
</P>
|
||||
<br><a name="SEC20" href="#TOC1">COMPILATION ERROR CODES</a><br>
|
||||
<P>
|
||||
|
@ -3489,7 +3520,7 @@ Cambridge, England.
|
|||
</P>
|
||||
<br><a name="SEC42" href="#TOC1">REVISION</a><br>
|
||||
<P>
|
||||
Last updated: 01 June 2017
|
||||
Last updated: 16 June 2017
|
||||
<br>
|
||||
Copyright © 1997-2017 University of Cambridge.
|
||||
<br>
|
||||
|
|
|
@ -117,6 +117,14 @@ compilation to the native function.
|
|||
The PCRE2_MULTILINE option is set when the regular expression is passed for
|
||||
compilation to the native function. Note that this does <i>not</i> mimic the
|
||||
defined POSIX behaviour for REG_NEWLINE (see the following section).
|
||||
<pre>
|
||||
REG_NOSPEC
|
||||
</pre>
|
||||
The PCRE2_LITERAL option is set when the regular expression is passed for
|
||||
compilation to the native function. This disables all meta characters in the
|
||||
pattern, causing it to be treated as a literal string. The only other options
|
||||
that are allowed with REG_NOSPEC are REG_ICASE, REG_NOSUB, REG_PEND, and
|
||||
REG_UTF. Note that REG_NOSPEC is not part of the POSIX standard.
|
||||
<pre>
|
||||
REG_NOSUB
|
||||
</pre>
|
||||
|
@ -314,7 +322,7 @@ Cambridge, England.
|
|||
</P>
|
||||
<br><a name="SEC9" href="#TOC1">REVISION</a><br>
|
||||
<P>
|
||||
Last updated: 05 June 2017
|
||||
Last updated: 15 June 2017
|
||||
<br>
|
||||
Copyright © 1997-2017 University of Cambridge.
|
||||
<br>
|
||||
|
|
|
@ -96,12 +96,12 @@ want that action.
|
|||
</P>
|
||||
<P>
|
||||
The input is processed using using C's string functions, so must not
|
||||
contain binary zeroes, even though in Unix-like environments, <b>fgets()</b>
|
||||
contain binary zeros, even though in Unix-like environments, <b>fgets()</b>
|
||||
treats any bytes other than newline as data characters. An error is generated
|
||||
if a binary zero is encountered. Subject lines are processed for backslash
|
||||
escapes, which makes it possible to include any data value in strings that are
|
||||
passed to the library for matching. For patterns, there is a facility for
|
||||
specifying some or all of the 8-bit input characters as hexadecimal pairs,
|
||||
if a binary zero is encountered. By default subject lines are processed for
|
||||
backslash escapes, which makes it possible to include any data value in strings
|
||||
that are passed to the library for matching. For patterns, there is a facility
|
||||
for specifying some or all of the 8-bit input characters as hexadecimal pairs,
|
||||
which makes it possible to include binary zeros.
|
||||
</P>
|
||||
<br><b>
|
||||
|
@ -382,8 +382,9 @@ of the standard test input files.
|
|||
<P>
|
||||
When the POSIX API is being tested there is no way to override the default
|
||||
newline convention, though it is possible to set the newline convention from
|
||||
within the pattern. A warning is given if the <b>posix</b> modifier is used when
|
||||
<b>#newline_default</b> would set a default for the non-POSIX API.
|
||||
within the pattern. A warning is given if the <b>posix</b> or <b>posix_nosub</b>
|
||||
modifier is used when <b>#newline_default</b> would set a default for the
|
||||
non-POSIX API.
|
||||
<pre>
|
||||
#pattern <modifier-list>
|
||||
</pre>
|
||||
|
@ -479,8 +480,9 @@ A pattern can be followed by a modifier list (details below).
|
|||
<P>
|
||||
Before each subject line is passed to <b>pcre2_match()</b> or
|
||||
<b>pcre2_dfa_match()</b>, leading and trailing white space is removed, and the
|
||||
line is scanned for backslash escapes. The following provide a means of
|
||||
encoding non-printing characters in a visible way:
|
||||
line is scanned for backslash escapes, unless the <b>subject_literal</b>
|
||||
modifier was set for the pattern. The following provide a means of encoding
|
||||
non-printing characters in a visible way:
|
||||
<pre>
|
||||
\a alarm (BEL, \x07)
|
||||
\b backspace (\x08)
|
||||
|
@ -548,6 +550,12 @@ the very last character in the line is a backslash (and there is no modifier
|
|||
list), it is ignored. This gives a way of passing an empty line as data, since
|
||||
a real empty line terminates the data input.
|
||||
</P>
|
||||
<P>
|
||||
If the <b>subject_literal</b> modifier is set for a pattern, all subject lines
|
||||
that follow are treated as literals, with no special treatment of backslashes.
|
||||
No replication is possible, and any subject modifiers must be set as defaults
|
||||
by a <b>#subject</b> command.
|
||||
</P>
|
||||
<br><a name="SEC10" href="#TOC1">PATTERN MODIFIERS</a><br>
|
||||
<P>
|
||||
There are several types of modifier that can appear in pattern lines. Except
|
||||
|
@ -586,7 +594,10 @@ for a description of the effects of these options.
|
|||
/x extended set PCRE2_EXTENDED
|
||||
/xx extended_more set PCRE2_EXTENDED_MORE
|
||||
firstline set PCRE2_FIRSTLINE
|
||||
literal set PCRE2_LITERAL
|
||||
match_line set PCRE2_EXTRA_MATCH_LINE
|
||||
match_unset_backref set PCRE2_MATCH_UNSET_BACKREF
|
||||
match_word set PCRE2_EXTRA_MATCH_WORD
|
||||
/m multiline set PCRE2_MULTILINE
|
||||
never_backslash_c set PCRE2_NEVER_BACKSLASH_C
|
||||
never_ucp set PCRE2_NEVER_UCP
|
||||
|
@ -638,6 +649,7 @@ heavily used in the test files.
|
|||
push push compiled pattern onto the stack
|
||||
pushcopy push a copy onto the stack
|
||||
stackguard=<number> test the stackguard feature
|
||||
subject_literal treat all subject lines as literal
|
||||
tables=[0|1|2] select internal tables
|
||||
use_length do not zero-terminate the pattern
|
||||
utf8_input treat input as UTF-8
|
||||
|
@ -728,18 +740,6 @@ testing that <b>pcre2_compile()</b> behaves correctly in this case (it uses
|
|||
default values).
|
||||
</P>
|
||||
<br><b>
|
||||
Specifying the pattern's length
|
||||
</b><br>
|
||||
<P>
|
||||
By default, patterns are passed to the compiling functions as zero-terminated
|
||||
strings. When using the POSIX wrapper API, there is no other option. However,
|
||||
when using PCRE2's native API, patterns can be passed by length instead of
|
||||
being zero-terminated. The <b>use_length</b> modifier causes this to happen.
|
||||
Using a length happens automatically (whether or not <b>use_length</b> is set)
|
||||
when <b>hex</b> is set, because patterns specified in hexadecimal may contain
|
||||
binary zeros.
|
||||
</P>
|
||||
<br><b>
|
||||
Specifying pattern characters in hexadecimal
|
||||
</b><br>
|
||||
<P>
|
||||
|
@ -761,11 +761,20 @@ Either single or double quotes may be used. There is no way of including
|
|||
the delimiter within a substring. The <b>hex</b> and <b>expand</b> modifiers are
|
||||
mutually exclusive.
|
||||
</P>
|
||||
<br><b>
|
||||
Specifying the pattern's length
|
||||
</b><br>
|
||||
<P>
|
||||
The POSIX API cannot be used with patterns specified in hexadecimal because
|
||||
they may contain binary zeros, which conflicts with <b>regcomp()</b>'s
|
||||
requirement for a zero-terminated string. Such patterns are always passed to
|
||||
<b>pcre2_compile()</b> as a string with a length, not as zero-terminated.
|
||||
By default, patterns are passed to the compiling functions as zero-terminated
|
||||
strings but can be passed by length instead of being zero-terminated. The
|
||||
<b>use_length</b> modifier causes this to happen. Using a length happens
|
||||
automatically (whether or not <b>use_length</b> is set) when <b>hex</b> is set,
|
||||
because patterns specified in hexadecimal may contain binary zeros.
|
||||
</P>
|
||||
<P>
|
||||
If <b>hex</b> or <b>use_length</b> is used with the POSIX wrapper API (see
|
||||
<a href="#posixwrapper">"Using the POSIX wrapper API"</a>
|
||||
below), the REG_PEND extension is used to pass the pattern's length.
|
||||
</P>
|
||||
<br><b>
|
||||
Specifying wide characters in 16-bit and 32-bit modes
|
||||
|
@ -826,7 +835,7 @@ modifier in "Subject Modifiers"
|
|||
for details of how these options are specified for each match attempt.
|
||||
</P>
|
||||
<P>
|
||||
JIT compilation is requested by the <b>/jit</b> pattern modifier, which may
|
||||
JIT compilation is requested by the <b>jit</b> pattern modifier, which may
|
||||
optionally be followed by an equals sign and a number in the range 0 to 7.
|
||||
The three bits that make up the number specify which of the three JIT operating
|
||||
modes are to be compiled:
|
||||
|
@ -850,7 +859,7 @@ to <b>pcre2_match()</b> with either the PCRE2_PARTIAL_SOFT or the
|
|||
PCRE2_PARTIAL_HARD option set. Note that such a call may return a complete
|
||||
match; the options enable the possibility of a partial match, but do not
|
||||
require it. Note also that if you request JIT compilation only for partial
|
||||
matching (for example, /jit=2) but do not set the <b>partial</b> modifier on a
|
||||
matching (for example, jit=2) but do not set the <b>partial</b> modifier on a
|
||||
subject line, that match will not use JIT code because none was compiled for
|
||||
non-partial matching.
|
||||
</P>
|
||||
|
@ -927,12 +936,12 @@ The <b>max_pattern_length</b> modifier sets a limit, in code units, to the
|
|||
length of pattern that <b>pcre2_compile()</b> will accept. Breaching the limit
|
||||
causes a compilation error. The default is the largest number a PCRE2_SIZE
|
||||
variable can hold (essentially unlimited).
|
||||
</P>
|
||||
<a name="posixwrapper"></a></P>
|
||||
<br><b>
|
||||
Using the POSIX wrapper API
|
||||
</b><br>
|
||||
<P>
|
||||
The <b>/posix</b> and <b>posix_nosub</b> modifiers cause <b>pcre2test</b> to call
|
||||
The <b>posix</b> and <b>posix_nosub</b> modifiers cause <b>pcre2test</b> to call
|
||||
PCRE2 via the POSIX wrapper API rather than its native API. When
|
||||
<b>posix_nosub</b> is used, the POSIX option REG_NOSUB is passed to
|
||||
<b>regcomp()</b>. The POSIX wrapper supports only the 8-bit library. Note that
|
||||
|
@ -962,6 +971,11 @@ The <b>aftertext</b> and <b>allaftertext</b> subject modifiers work as described
|
|||
below. All other modifiers are either ignored, with a warning message, or cause
|
||||
an error.
|
||||
</P>
|
||||
<P>
|
||||
The pattern is passed to <b>regcomp()</b> as a zero-terminated string by
|
||||
default, but if the <b>use_length</b> or <b>hex</b> modifiers are set, the
|
||||
REG_PEND extension is used to pass it by length.
|
||||
</P>
|
||||
<br><b>
|
||||
Testing the stack guard feature
|
||||
</b><br>
|
||||
|
@ -999,17 +1013,18 @@ are mutually exclusive.
|
|||
Setting certain match controls
|
||||
</b><br>
|
||||
<P>
|
||||
The following modifiers are really subject modifiers, and are described below.
|
||||
However, they may be included in a pattern's modifier list, in which case they
|
||||
are applied to every subject line that is processed with that pattern. They may
|
||||
not appear in <b>#pattern</b> commands. These modifiers do not affect the
|
||||
compilation process.
|
||||
The following modifiers are really subject modifiers, and are described under
|
||||
"Subject Modifiers" below. However, they may be included in a pattern's
|
||||
modifier list, in which case they are applied to every subject line that is
|
||||
processed with that pattern. They may not appear in <b>#pattern</b> commands.
|
||||
These modifiers do not affect the compilation process.
|
||||
<pre>
|
||||
aftertext show text after match
|
||||
allaftertext show text after captures
|
||||
allcaptures show all captures
|
||||
allusedtext show all consulted text
|
||||
/g global global matching
|
||||
jitstack=<n> set size of JIT stack
|
||||
mark show mark values
|
||||
replace=<string> specify a replacement string
|
||||
startchar show starting character when relevant
|
||||
|
@ -1022,6 +1037,15 @@ These modifiers may not appear in a <b>#pattern</b> command. If you want them as
|
|||
defaults, set them in a <b>#subject</b> command.
|
||||
</P>
|
||||
<br><b>
|
||||
Specifying literal subject lines
|
||||
</b><br>
|
||||
<P>
|
||||
If the <b>subject_literal</b> modifier is present on a pattern, all the subject
|
||||
lines that it matches are taken as literal strings, with no interpretation of
|
||||
backslashes. It is not possible to set subject modifiers on such lines, but any
|
||||
that are set as defaults by a <b>#subject</b> command are recognized.
|
||||
</P>
|
||||
<br><b>
|
||||
Saving a compiled pattern
|
||||
</b><br>
|
||||
<P>
|
||||
|
@ -1072,11 +1096,11 @@ The partial matching modifiers are provided with abbreviations because they
|
|||
appear frequently in tests.
|
||||
</P>
|
||||
<P>
|
||||
If the <b>posix</b> modifier was present on the pattern, causing the POSIX
|
||||
wrapper API to be used, the only option-setting modifiers that have any effect
|
||||
are <b>notbol</b>, <b>notempty</b>, and <b>noteol</b>, causing REG_NOTBOL,
|
||||
REG_NOTEMPTY, and REG_NOTEOL, respectively, to be passed to <b>regexec()</b>.
|
||||
The other modifiers are ignored, with a warning message.
|
||||
If the <b>posix</b> or <b>posix_nosub</b> modifier was present on the pattern,
|
||||
causing the POSIX wrapper API to be used, the only option-setting modifiers
|
||||
that have any effect are <b>notbol</b>, <b>notempty</b>, and <b>noteol</b>,
|
||||
causing REG_NOTBOL, REG_NOTEMPTY, and REG_NOTEOL, respectively, to be passed to
|
||||
<b>regexec()</b>. The other modifiers are ignored, with a warning message.
|
||||
</P>
|
||||
<P>
|
||||
There is one additional modifier that can be used with the POSIX wrapper. It is
|
||||
|
@ -1085,11 +1109,13 @@ ignored (with a warning) if used for non-POSIX matching.
|
|||
posix_startend=<n>[:<m>]
|
||||
</pre>
|
||||
This causes the subject string to be passed to <b>regexec()</b> using the
|
||||
REG_STARTEND option, which uses offsets to restrict which part of the string is
|
||||
REG_STARTEND option, which uses offsets to specify which part of the string is
|
||||
searched. If only one number is given, the end offset is passed as the end of
|
||||
the subject string. For more detail of REG_STARTEND, see the
|
||||
<a href="pcre2posix.html"><b>pcre2posix</b></a>
|
||||
documentation.
|
||||
documentation. If the subject string contains binary zeros (coded as escapes
|
||||
such as \x{00} because <b>pcre2test</b> does not support actual binary zeros in
|
||||
its input), you must use <b>posix_startend</b> to specify its length.
|
||||
</P>
|
||||
<br><b>
|
||||
Setting match controls
|
||||
|
@ -1355,9 +1381,11 @@ Setting the JIT stack size
|
|||
<P>
|
||||
The <b>jitstack</b> modifier provides a way of setting the maximum stack size
|
||||
that is used by the just-in-time optimization code. It is ignored if JIT
|
||||
optimization is not being used. The value is a number of kilobytes. Providing a
|
||||
stack that is larger than the default 32K is necessary only for very
|
||||
complicated patterns.
|
||||
optimization is not being used. The value is a number of kilobytes. Setting
|
||||
zero reverts to the default of 32K. Providing a stack that is larger than the
|
||||
default is necessary only for very complicated patterns. If <b>jitstack</b> is
|
||||
set non-zero on a subject line it overrides any value that was set on the
|
||||
pattern.
|
||||
</P>
|
||||
<br><b>
|
||||
Setting heap, match, and depth limits
|
||||
|
@ -1461,8 +1489,8 @@ Passing the subject as zero-terminated
|
|||
By default, the subject string is passed to a native API matching function with
|
||||
its correct length. In order to test the facility for passing a zero-terminated
|
||||
string, the <b>zero_terminate</b> modifier is provided. It causes the length to
|
||||
be passed as PCRE2_ZERO_TERMINATED. (When matching via the POSIX interface,
|
||||
this modifier has no effect, as there is no facility for passing a length.)
|
||||
be passed as PCRE2_ZERO_TERMINATED. When matching via the POSIX interface,
|
||||
this modifier is ignored, with a warning.
|
||||
</P>
|
||||
<P>
|
||||
When testing <b>pcre2_substitute()</b>, this modifier also has the effect of
|
||||
|
@ -1675,7 +1703,7 @@ callout is in a lookbehind assertion.
|
|||
</P>
|
||||
<P>
|
||||
Callouts numbered 255 are assumed to be automatic callouts, inserted as a
|
||||
result of the <b>/auto_callout</b> pattern modifier. In this case, instead of
|
||||
result of the <b>auto_callout</b> pattern modifier. In this case, instead of
|
||||
showing the callout number, the offset in the pattern, preceded by a plus, is
|
||||
output. For example:
|
||||
<pre>
|
||||
|
@ -1830,7 +1858,7 @@ Cambridge, England.
|
|||
</P>
|
||||
<br><a name="SEC21" href="#TOC1">REVISION</a><br>
|
||||
<P>
|
||||
Last updated: 03 June 2017
|
||||
Last updated: 16 June 2017
|
||||
<br>
|
||||
Copyright © 1997-2017 University of Cambridge.
|
||||
<br>
|
||||
|
|
|
@ -1441,6 +1441,20 @@ COMPILING A PATTERN
|
|||
first line and also within the offset limit. In other words, whichever
|
||||
limit comes first is used.
|
||||
|
||||
PCRE2_LITERAL
|
||||
|
||||
If this option is set, all meta-characters in the pattern are disabled,
|
||||
and it is treated as a literal string. Matching literal strings with a
|
||||
regular expression engine is not the most efficient way of doing it. If
|
||||
you are doing a lot of literal matching and are worried about effi-
|
||||
ciency, you should consider using other approaches. The only other main
|
||||
options that are allowed with PCRE2_LITERAL are: PCRE2_ANCHORED,
|
||||
PCRE2_ENDANCHORED, PCRE2_AUTO_CALLOUT, PCRE2_CASELESS, PCRE2_FIRSTLINE,
|
||||
PCRE2_NO_START_OPTIMIZE, PCRE2_NO_UTF_CHECK, PCRE2_UTF, and
|
||||
PCRE2_USE_OFFSET_LIMIT. The extra options PCRE2_EXTRA_MATCH_LINE and
|
||||
PCRE2_EXTRA_MATCH_WORD are also supported. Any other options cause an
|
||||
error.
|
||||
|
||||
PCRE2_MATCH_UNSET_BACKREF
|
||||
|
||||
If this option is set, a back reference to an unset subpattern group
|
||||
|
@ -1706,6 +1720,24 @@ COMPILING A PATTERN
|
|||
option means that typos in patterns may go undetected and have unex-
|
||||
pected results. This is a dangerous option. Use with care.
|
||||
|
||||
PCRE2_EXTRA_MATCH_LINE
|
||||
|
||||
This option is provided for use by the -x option of pcre2grep. It
|
||||
causes the pattern only to match complete lines. This is achieved by
|
||||
automatically inserting the code for "^(?:" at the start of the com-
|
||||
piled pattern and ")$" at the end. Thus, when PCRE2_MULTILINE is set,
|
||||
the matched line may be in the middle of the subject string. This
|
||||
option can be used with PCRE2_LITERAL.
|
||||
|
||||
PCRE2_EXTRA_MATCH_WORD
|
||||
|
||||
This option is provided for use by the -w option of pcre2grep. It
|
||||
causes the pattern only to match strings that have a word boundary at
|
||||
the start and the end. This is achieved by automatically inserting the
|
||||
code for "\b(?:" at the start of the compiled pattern and ")\b" at the
|
||||
end. The option may be used with PCRE2_LITERAL. However, it is ignored
|
||||
if PCRE2_EXTRA_MATCH_LINE is also set.
|
||||
|
||||
|
||||
COMPILATION ERROR CODES
|
||||
|
||||
|
@ -3368,7 +3400,7 @@ AUTHOR
|
|||
|
||||
REVISION
|
||||
|
||||
Last updated: 01 June 2017
|
||||
Last updated: 16 June 2017
|
||||
Copyright (c) 1997-2017 University of Cambridge.
|
||||
------------------------------------------------------------------------------
|
||||
|
||||
|
@ -9036,6 +9068,15 @@ COMPILING A PATTERN
|
|||
the defined POSIX behaviour for REG_NEWLINE (see the following sec-
|
||||
tion).
|
||||
|
||||
REG_NOSPEC
|
||||
|
||||
The PCRE2_LITERAL option is set when the regular expression is passed
|
||||
for compilation to the native function. This disables all meta charac-
|
||||
ters in the pattern, causing it to be treated as a literal string. The
|
||||
only other options that are allowed with REG_NOSPEC are REG_ICASE,
|
||||
REG_NOSUB, REG_PEND, and REG_UTF. Note that REG_NOSPEC is not part of
|
||||
the POSIX standard.
|
||||
|
||||
REG_NOSUB
|
||||
|
||||
When a pattern that is compiled with this flag is passed to regexec()
|
||||
|
@ -9232,7 +9273,7 @@ AUTHOR
|
|||
|
||||
REVISION
|
||||
|
||||
Last updated: 05 June 2017
|
||||
Last updated: 15 June 2017
|
||||
Copyright (c) 1997-2017 University of Cambridge.
|
||||
------------------------------------------------------------------------------
|
||||
|
||||
|
|
|
@ -1,4 +1,4 @@
|
|||
.TH PCRE2_COMPILE 3 "17 May 2017" "PCRE2 10.30"
|
||||
.TH PCRE2_COMPILE 3 "16 June 2017" "PCRE2 10.30"
|
||||
.SH NAME
|
||||
PCRE2 - Perl-compatible regular expressions (revised API)
|
||||
.SH SYNOPSIS
|
||||
|
@ -35,7 +35,7 @@ system stack size checking, or to change one or more of these parameters:
|
|||
The newline character sequence;
|
||||
The compile time nested parentheses limit;
|
||||
The maximum pattern length (in code units) that is allowed.
|
||||
The additional options bits
|
||||
The additional options bits (see pcre2_set_compile_extra_options())
|
||||
.sp
|
||||
The option bits are:
|
||||
.sp
|
||||
|
@ -52,6 +52,7 @@ The option bits are:
|
|||
PCRE2_ENDANCHORED Pattern can match only at end of subject
|
||||
PCRE2_EXTENDED Ignore white space and # comments
|
||||
PCRE2_FIRSTLINE Force matching to be before newline
|
||||
PCRE2_LITERAL Pattern characters are all literal
|
||||
PCRE2_MATCH_UNSET_BACKREF Match unset back references
|
||||
PCRE2_MULTILINE ^ and $ match newlines within data
|
||||
PCRE2_NEVER_BACKSLASH_C Lock out the use of \eC in patterns
|
||||
|
|
|
@ -1,4 +1,4 @@
|
|||
.TH PCRE2_SET_MAX_PATTERN_LENGTH 3 "01 June 2017" "PCRE2 10.30"
|
||||
.TH PCRE2_SET_MAX_PATTERN_LENGTH 3 "16 June 2017" "PCRE2 10.30"
|
||||
.SH NAME
|
||||
PCRE2 - Perl-compatible regular expressions (revised API)
|
||||
.SH SYNOPSIS
|
||||
|
@ -24,6 +24,8 @@ options are:
|
|||
.\" JOIN
|
||||
PCRE2_EXTRA_BAD_ESCAPE_IS_LITERAL Treat all invalid escapes as
|
||||
a literal following character
|
||||
PCRE2_EXTRA_MATCH_LINE Pattern matches whole lines
|
||||
PCRE2_EXTRA_MATCH_WORD Pattern matches "words"
|
||||
.sp
|
||||
There is a complete description of the PCRE2 native API in the
|
||||
.\" HREF
|
||||
|
|
|
@ -64,12 +64,12 @@ INPUT ENCODING
|
|||
unless you really want that action.
|
||||
|
||||
The input is processed using using C's string functions, so must not
|
||||
contain binary zeroes, even though in Unix-like environments, fgets()
|
||||
contain binary zeros, even though in Unix-like environments, fgets()
|
||||
treats any bytes other than newline as data characters. An error is
|
||||
generated if a binary zero is encountered. Subject lines are processed
|
||||
for backslash escapes, which makes it possible to include any data
|
||||
value in strings that are passed to the library for matching. For pat-
|
||||
terns, there is a facility for specifying some or all of the 8-bit
|
||||
generated if a binary zero is encountered. By default subject lines are
|
||||
processed for backslash escapes, which makes it possible to include any
|
||||
data value in strings that are passed to the library for matching. For
|
||||
patterns, there is a facility for specifying some or all of the 8-bit
|
||||
input characters as hexadecimal pairs, which makes it possible to
|
||||
include binary zeros.
|
||||
|
||||
|
@ -319,9 +319,9 @@ COMMAND LINES
|
|||
|
||||
When the POSIX API is being tested there is no way to override the
|
||||
default newline convention, though it is possible to set the newline
|
||||
convention from within the pattern. A warning is given if the posix
|
||||
modifier is used when #newline_default would set a default for the non-
|
||||
POSIX API.
|
||||
convention from within the pattern. A warning is given if the posix or
|
||||
posix_nosub modifier is used when #newline_default would set a default
|
||||
for the non-POSIX API.
|
||||
|
||||
#pattern <modifier-list>
|
||||
|
||||
|
@ -424,8 +424,9 @@ SUBJECT LINE SYNTAX
|
|||
|
||||
Before each subject line is passed to pcre2_match() or
|
||||
pcre2_dfa_match(), leading and trailing white space is removed, and the
|
||||
line is scanned for backslash escapes. The following provide a means of
|
||||
encoding non-printing characters in a visible way:
|
||||
line is scanned for backslash escapes, unless the subject_literal modi-
|
||||
fier was set for the pattern. The following provide a means of encoding
|
||||
non-printing characters in a visible way:
|
||||
|
||||
\a alarm (BEL, \x07)
|
||||
\b backspace (\x08)
|
||||
|
@ -493,6 +494,11 @@ SUBJECT LINE SYNTAX
|
|||
passing an empty line as data, since a real empty line terminates the
|
||||
data input.
|
||||
|
||||
If the subject_literal modifier is set for a pattern, all subject lines
|
||||
that follow are treated as literals, with no special treatment of back-
|
||||
slashes. No replication is possible, and any subject modifiers must be
|
||||
set as defaults by a #subject command.
|
||||
|
||||
|
||||
PATTERN MODIFIERS
|
||||
|
||||
|
@ -530,7 +536,10 @@ PATTERN MODIFIERS
|
|||
/x extended set PCRE2_EXTENDED
|
||||
/xx extended_more set PCRE2_EXTENDED_MORE
|
||||
firstline set PCRE2_FIRSTLINE
|
||||
literal set PCRE2_LITERAL
|
||||
match_line set PCRE2_EXTRA_MATCH_LINE
|
||||
match_unset_backref set PCRE2_MATCH_UNSET_BACKREF
|
||||
match_word set PCRE2_EXTRA_MATCH_WORD
|
||||
/m multiline set PCRE2_MULTILINE
|
||||
never_backslash_c set PCRE2_NEVER_BACKSLASH_C
|
||||
never_ucp set PCRE2_NEVER_UCP
|
||||
|
@ -580,6 +589,7 @@ PATTERN MODIFIERS
|
|||
push push compiled pattern onto the stack
|
||||
pushcopy push a copy onto the stack
|
||||
stackguard=<number> test the stackguard feature
|
||||
subject_literal treat all subject lines as literal
|
||||
tables=[0|1|2] select internal tables
|
||||
use_length do not zero-terminate the pattern
|
||||
utf8_input treat input as UTF-8
|
||||
|
@ -659,16 +669,6 @@ PATTERN MODIFIERS
|
|||
testing that pcre2_compile() behaves correctly in this case (it uses
|
||||
default values).
|
||||
|
||||
Specifying the pattern's length
|
||||
|
||||
By default, patterns are passed to the compiling functions as zero-ter-
|
||||
minated strings. When using the POSIX wrapper API, there is no other
|
||||
option. However, when using PCRE2's native API, patterns can be passed
|
||||
by length instead of being zero-terminated. The use_length modifier
|
||||
causes this to happen. Using a length happens automatically (whether
|
||||
or not use_length is set) when hex is set, because patterns specified
|
||||
in hexadecimal may contain binary zeros.
|
||||
|
||||
Specifying pattern characters in hexadecimal
|
||||
|
||||
The hex modifier specifies that the characters of the pattern, except
|
||||
|
@ -690,11 +690,18 @@ PATTERN MODIFIERS
|
|||
ing the delimiter within a substring. The hex and expand modifiers are
|
||||
mutually exclusive.
|
||||
|
||||
The POSIX API cannot be used with patterns specified in hexadecimal
|
||||
because they may contain binary zeros, which conflicts with regcomp()'s
|
||||
requirement for a zero-terminated string. Such patterns are always
|
||||
passed to pcre2_compile() as a string with a length, not as zero-termi-
|
||||
nated.
|
||||
Specifying the pattern's length
|
||||
|
||||
By default, patterns are passed to the compiling functions as zero-ter-
|
||||
minated strings but can be passed by length instead of being zero-ter-
|
||||
minated. The use_length modifier causes this to happen. Using a length
|
||||
happens automatically (whether or not use_length is set) when hex is
|
||||
set, because patterns specified in hexadecimal may contain binary
|
||||
zeros.
|
||||
|
||||
If hex or use_length is used with the POSIX wrapper API (see "Using the
|
||||
POSIX wrapper API" below), the REG_PEND extension is used to pass the
|
||||
pattern's length.
|
||||
|
||||
Specifying wide characters in 16-bit and 32-bit modes
|
||||
|
||||
|
@ -742,7 +749,7 @@ PATTERN MODIFIERS
|
|||
partial modifier in "Subject Modifiers" below for details of how these
|
||||
options are specified for each match attempt.
|
||||
|
||||
JIT compilation is requested by the /jit pattern modifier, which may
|
||||
JIT compilation is requested by the jit pattern modifier, which may
|
||||
optionally be followed by an equals sign and a number in the range 0 to
|
||||
7. The three bits that make up the number specify which of the three
|
||||
JIT operating modes are to be compiled:
|
||||
|
@ -766,7 +773,7 @@ PATTERN MODIFIERS
|
|||
PCRE2_PARTIAL_HARD option set. Note that such a call may return a com-
|
||||
plete match; the options enable the possibility of a partial match, but
|
||||
do not require it. Note also that if you request JIT compilation only
|
||||
for partial matching (for example, /jit=2) but do not set the partial
|
||||
for partial matching (for example, jit=2) but do not set the partial
|
||||
modifier on a subject line, that match will not use JIT code because
|
||||
none was compiled for non-partial matching.
|
||||
|
||||
|
@ -833,7 +840,7 @@ PATTERN MODIFIERS
|
|||
|
||||
Using the POSIX wrapper API
|
||||
|
||||
The /posix and posix_nosub modifiers cause pcre2test to call PCRE2 via
|
||||
The posix and posix_nosub modifiers cause pcre2test to call PCRE2 via
|
||||
the POSIX wrapper API rather than its native API. When posix_nosub is
|
||||
used, the POSIX option REG_NOSUB is passed to regcomp(). The POSIX
|
||||
wrapper supports only the 8-bit library. Note that it does not imply
|
||||
|
@ -862,6 +869,10 @@ PATTERN MODIFIERS
|
|||
below. All other modifiers are either ignored, with a warning message,
|
||||
or cause an error.
|
||||
|
||||
The pattern is passed to regcomp() as a zero-terminated string by
|
||||
default, but if the use_length or hex modifiers are set, the REG_PEND
|
||||
extension is used to pass it by length.
|
||||
|
||||
Testing the stack guard feature
|
||||
|
||||
The stackguard modifier is used to test the use of pcre2_set_com-
|
||||
|
@ -894,16 +905,18 @@ PATTERN MODIFIERS
|
|||
Setting certain match controls
|
||||
|
||||
The following modifiers are really subject modifiers, and are described
|
||||
below. However, they may be included in a pattern's modifier list, in
|
||||
which case they are applied to every subject line that is processed
|
||||
with that pattern. They may not appear in #pattern commands. These mod-
|
||||
ifiers do not affect the compilation process.
|
||||
under "Subject Modifiers" below. However, they may be included in a
|
||||
pattern's modifier list, in which case they are applied to every sub-
|
||||
ject line that is processed with that pattern. They may not appear in
|
||||
#pattern commands. These modifiers do not affect the compilation
|
||||
process.
|
||||
|
||||
aftertext show text after match
|
||||
allaftertext show text after captures
|
||||
allcaptures show all captures
|
||||
allusedtext show all consulted text
|
||||
/g global global matching
|
||||
jitstack=<n> set size of JIT stack
|
||||
mark show mark values
|
||||
replace=<string> specify a replacement string
|
||||
startchar show starting character when relevant
|
||||
|
@ -915,6 +928,14 @@ PATTERN MODIFIERS
|
|||
These modifiers may not appear in a #pattern command. If you want them
|
||||
as defaults, set them in a #subject command.
|
||||
|
||||
Specifying literal subject lines
|
||||
|
||||
If the subject_literal modifier is present on a pattern, all the sub-
|
||||
ject lines that it matches are taken as literal strings, with no inter-
|
||||
pretation of backslashes. It is not possible to set subject modifiers
|
||||
on such lines, but any that are set as defaults by a #subject command
|
||||
are recognized.
|
||||
|
||||
Saving a compiled pattern
|
||||
|
||||
When a pattern with the push modifier is successfully compiled, it is
|
||||
|
@ -959,11 +980,11 @@ SUBJECT MODIFIERS
|
|||
The partial matching modifiers are provided with abbreviations because
|
||||
they appear frequently in tests.
|
||||
|
||||
If the posix modifier was present on the pattern, causing the POSIX
|
||||
wrapper API to be used, the only option-setting modifiers that have any
|
||||
effect are notbol, notempty, and noteol, causing REG_NOTBOL,
|
||||
REG_NOTEMPTY, and REG_NOTEOL, respectively, to be passed to regexec().
|
||||
The other modifiers are ignored, with a warning message.
|
||||
If the posix or posix_nosub modifier was present on the pattern, caus-
|
||||
ing the POSIX wrapper API to be used, the only option-setting modifiers
|
||||
that have any effect are notbol, notempty, and noteol, causing REG_NOT-
|
||||
BOL, REG_NOTEMPTY, and REG_NOTEOL, respectively, to be passed to
|
||||
regexec(). The other modifiers are ignored, with a warning message.
|
||||
|
||||
There is one additional modifier that can be used with the POSIX wrap-
|
||||
per. It is ignored (with a warning) if used for non-POSIX matching.
|
||||
|
@ -971,10 +992,13 @@ SUBJECT MODIFIERS
|
|||
posix_startend=<n>[:<m>]
|
||||
|
||||
This causes the subject string to be passed to regexec() using the
|
||||
REG_STARTEND option, which uses offsets to restrict which part of the
|
||||
REG_STARTEND option, which uses offsets to specify which part of the
|
||||
string is searched. If only one number is given, the end offset is
|
||||
passed as the end of the subject string. For more detail of REG_STAR-
|
||||
TEND, see the pcre2posix documentation.
|
||||
TEND, see the pcre2posix documentation. If the subject string contains
|
||||
binary zeros (coded as escapes such as \x{00} because pcre2test does
|
||||
not support actual binary zeros in its input), you must use posix_star-
|
||||
tend to specify its length.
|
||||
|
||||
Setting match controls
|
||||
|
||||
|
@ -1222,8 +1246,10 @@ SUBJECT MODIFIERS
|
|||
The jitstack modifier provides a way of setting the maximum stack size
|
||||
that is used by the just-in-time optimization code. It is ignored if
|
||||
JIT optimization is not being used. The value is a number of kilobytes.
|
||||
Providing a stack that is larger than the default 32K is necessary only
|
||||
for very complicated patterns.
|
||||
Setting zero reverts to the default of 32K. Providing a stack that is
|
||||
larger than the default is necessary only for very complicated pat-
|
||||
terns. If jitstack is set non-zero on a subject line it overrides any
|
||||
value that was set on the pattern.
|
||||
|
||||
Setting heap, match, and depth limits
|
||||
|
||||
|
@ -1310,9 +1336,8 @@ SUBJECT MODIFIERS
|
|||
By default, the subject string is passed to a native API matching func-
|
||||
tion with its correct length. In order to test the facility for passing
|
||||
a zero-terminated string, the zero_terminate modifier is provided. It
|
||||
causes the length to be passed as PCRE2_ZERO_TERMINATED. (When matching
|
||||
via the POSIX interface, this modifier has no effect, as there is no
|
||||
facility for passing a length.)
|
||||
causes the length to be passed as PCRE2_ZERO_TERMINATED. When matching
|
||||
via the POSIX interface, this modifier is ignored, with a warning.
|
||||
|
||||
When testing pcre2_substitute(), this modifier also has the effect of
|
||||
passing the replacement string as zero-terminated.
|
||||
|
@ -1513,8 +1538,8 @@ CALLOUTS
|
|||
position, which can happen if the callout is in a lookbehind assertion.
|
||||
|
||||
Callouts numbered 255 are assumed to be automatic callouts, inserted as
|
||||
a result of the /auto_callout pattern modifier. In this case, instead
|
||||
of showing the callout number, the offset in the pattern, preceded by a
|
||||
a result of the auto_callout pattern modifier. In this case, instead of
|
||||
showing the callout number, the offset in the pattern, preceded by a
|
||||
plus, is output. For example:
|
||||
|
||||
re> /\d?[A-E]\*/auto_callout
|
||||
|
@ -1662,5 +1687,5 @@ AUTHOR
|
|||
|
||||
REVISION
|
||||
|
||||
Last updated: 03 June 2017
|
||||
Last updated: 16 June 2017
|
||||
Copyright (c) 1997-2017 University of Cambridge.
|
||||
|
|
Loading…
Reference in New Issue