Typos in documentation and comments noted by Jason Hood.
This commit is contained in:
parent
fa58ac6734
commit
fabea723cf
|
@ -146,7 +146,7 @@ SET(PCRE2_PARENS_NEST_LIMIT "250" CACHE STRING
|
|||
"Default nested parentheses limit. See PARENS_NEST_LIMIT in config.h.in for details.")
|
||||
|
||||
SET(PCRE2_HEAP_LIMIT "20000000" CACHE STRING
|
||||
"Default limit on heap memory (kilobytes). See HEAP_LIMIT in config.h.in for details.")
|
||||
"Default limit on heap memory (kibibytes). See HEAP_LIMIT in config.h.in for details.")
|
||||
|
||||
SET(PCRE2_MATCH_LIMIT "10000000" CACHE STRING
|
||||
"Default limit on internal looping. See MATCH_LIMIT in config.h.in for details.")
|
||||
|
|
|
@ -17,7 +17,7 @@ groups altogether. Now it shows those that come before any actual captures as
|
|||
3. Running "pcre2test -C" always stated "\R matches CR, LF, or CRLF only",
|
||||
whatever the build configuration was. It now correctly says "\R matches all
|
||||
Unicode newlines" in the default case when --enable-bsr-anycrlf has not been
|
||||
specified. Similarly, running "pcfre2test -C bsr" never produced the result
|
||||
specified. Similarly, running "pcre2test -C bsr" never produced the result
|
||||
ANY.
|
||||
|
||||
4. Matching the pattern /(*UTF)\C[^\v]+\x80/ against an 8-bit string containing
|
||||
|
@ -370,7 +370,7 @@ tests to improve coverage.
|
|||
31. If more than one of "push", "pushcopy", or "pushtablescopy" were set in
|
||||
pcre2test, a crash could occur.
|
||||
|
||||
32. Make -bigstack in RunTest allocate a 64Mb stack (instead of 16 MB) so that
|
||||
32. Make -bigstack in RunTest allocate a 64MB stack (instead of 16 MB) so that
|
||||
all the tests can run with clang's sanitizing options.
|
||||
|
||||
33. Implement extra compile options in the compile context and add the first
|
||||
|
|
4
HACKING
4
HACKING
|
@ -348,7 +348,7 @@ The /i, /m, or /s options (PCRE2_CASELESS, PCRE2_MULTILINE, PCRE2_DOTALL, and
|
|||
others) may be changed in the middle of patterns by items such as (?i). Their
|
||||
processing is handled entirely at compile time by generating different opcodes
|
||||
for the different settings. The runtime functions do not need to keep track of
|
||||
an options state.
|
||||
an option's state.
|
||||
|
||||
PCRE2_DUPNAMES, PCRE2_EXTENDED, PCRE2_EXTENDED_MORE, and PCRE2_NO_AUTO_CAPTURE
|
||||
are tracked and processed during the parsing pre-pass. The others are handled
|
||||
|
@ -764,7 +764,7 @@ OP_RECURSE is followed by a LINK_SIZE value that is the offset to the starting
|
|||
bracket from the start of the whole pattern. OP_RECURSE is also used for
|
||||
"subroutine" calls, even though they are not strictly a recursion. Up till
|
||||
release 10.30 recursions were treated as atomic groups, making them
|
||||
incompatible with Perl (but PCRE had then well before Perl did). From 10.30,
|
||||
incompatible with Perl (but PCRE had them well before Perl did). From 10.30,
|
||||
backtracking into recursions is supported.
|
||||
|
||||
Repeated recursions used to be wrapped inside OP_ONCE brackets, which not only
|
||||
|
|
4
NEWS
4
NEWS
|
@ -31,7 +31,7 @@ remembering backtracking positions. This makes --disable-stack-for-recursion a
|
|||
NOOP. The new implementation allows backtracking into recursive group calls in
|
||||
patterns, making it more compatible with Perl, and also fixes some other
|
||||
previously hard-to-do issues. For patterns that have a lot of backtracking, the
|
||||
heap is now used, and there is explicit limit on the amount, settable by
|
||||
heap is now used, and there is an explicit limit on the amount, settable by
|
||||
pcre2_set_heap_limit() or (*LIMIT_HEAP=xxx). The "recursion limit" is retained,
|
||||
but is renamed as "depth limit" (though the old names remain for
|
||||
compatibility).
|
||||
|
@ -53,7 +53,7 @@ also supported.
|
|||
|
||||
5. Additional compile options in the compile context are now available, and the
|
||||
first two are: PCRE2_EXTRA_ALLOW_SURROGATE_ESCAPES and
|
||||
PCRE2_EXTRA_BAD_ESCAPE_IS LITERAL.
|
||||
PCRE2_EXTRA_BAD_ESCAPE_IS_LITERAL.
|
||||
|
||||
6. The newline type PCRE2_NEWLINE_NUL is now available.
|
||||
|
||||
|
|
|
@ -127,7 +127,7 @@ can skip ahead to the CMake section.
|
|||
src/pcre2_jit_match.c and src/pcre2_jit_misc.c, so you should not compile
|
||||
these yourself.
|
||||
|
||||
Not also that the pcre2_fuzzsupport.c file contains special code that is
|
||||
Note also that the pcre2_fuzzsupport.c file contains special code that is
|
||||
useful to those who want to run fuzzing tests on the PCRE2 library. Unless
|
||||
you are doing that, you can ignore it.
|
||||
|
||||
|
@ -186,7 +186,7 @@ can skip ahead to the CMake section.
|
|||
|
||||
STACK SIZE IN WINDOWS ENVIRONMENTS
|
||||
|
||||
Prior to release 10.30 the default system stack size of 1Mb in some Windows
|
||||
Prior to release 10.30 the default system stack size of 1MB in some Windows
|
||||
environments caused issues with some tests. This should no longer be the case
|
||||
for 10.30 and later releases.
|
||||
|
||||
|
|
17
README
17
README
|
@ -257,9 +257,10 @@ library. They are also documented in the pcre2build man page.
|
|||
|
||||
--with-heap-limit=500
|
||||
|
||||
The units are kilobytes. This limit does not apply when the JIT optimization
|
||||
(which has its own memory control features) is used. There is more discussion
|
||||
on the pcre2api man page (search for pcre2_set_heap_limit).
|
||||
The units are kibibytes (units of 1024 bytes). This limit does not apply when
|
||||
the JIT optimization (which has its own memory control features) is used.
|
||||
There is more discussion on the pcre2api man page (search for
|
||||
pcre2_set_heap_limit).
|
||||
|
||||
. In the 8-bit library, the default maximum compiled pattern size is around
|
||||
64K bytes. You can increase this by adding --with-link-size=3 to the
|
||||
|
@ -319,10 +320,10 @@ library. They are also documented in the pcre2build man page.
|
|||
. When JIT support is enabled, pcre2grep automatically makes use of it, unless
|
||||
you add --disable-pcre2grep-jit to the "configure" command.
|
||||
|
||||
. On non-Windows sytems there is support for calling external scripts during
|
||||
matching in the pcre2grep command via PCRE2's callout facility with string
|
||||
arguments. This support can be disabled by adding --disable-pcre2grep-callout
|
||||
to the "configure" command.
|
||||
. There is support for calling external programs during matching in the
|
||||
pcre2grep command, using PCRE2's callout facility with string arguments. This
|
||||
support can be disabled by adding --disable-pcre2grep-callout to the
|
||||
"configure" command.
|
||||
|
||||
. The pcre2grep program currently supports only 8-bit data files, and so
|
||||
requires the 8-bit PCRE2 library. It is possible to compile pcre2grep to use
|
||||
|
@ -887,4 +888,4 @@ The distribution should contain the files listed below.
|
|||
Philip Hazel
|
||||
Email local part: ph10
|
||||
Email domain: cam.ac.uk
|
||||
Last updated: 27 April 2018
|
||||
Last updated: 17 June 2018
|
||||
|
|
|
@ -708,7 +708,7 @@ $valgrind $vjs $pcre2grep -n --newline=any "^(abc|def|ghi|jkl)" testNinputgrep >
|
|||
printf "%c--------------------------- Test N6 ------------------------------\r\n" - >>testtrygrep
|
||||
$valgrind $vjs $pcre2grep -n --newline=anycrlf "^(abc|def|ghi|jkl)" testNinputgrep >>testtrygrep
|
||||
|
||||
# It seems inpossible to handle NUL characters easily in Solaris (aka SunOS).
|
||||
# It seems impossible to handle NUL characters easily in Solaris (aka SunOS).
|
||||
# The version of sed explicitly doesn't like them. For the moment, we just
|
||||
# don't run this test under SunOS. Fudge the output so that the comparison
|
||||
# works. A similar problem has also been reported for MacOS (Darwin).
|
||||
|
|
2
RunTest
2
RunTest
|
@ -843,7 +843,7 @@ for bmode in "$test8" "$test16" "$test32"; do
|
|||
checkresult $? 24 ""
|
||||
fi
|
||||
|
||||
# UTF pattern converson tests
|
||||
# UTF pattern conversion tests
|
||||
|
||||
if [ "$do25" = yes ] ; then
|
||||
echo $title25
|
||||
|
|
|
@ -288,7 +288,7 @@ AC_ARG_WITH(parens-nest-limit,
|
|||
# Handle --with-heap-limit
|
||||
AC_ARG_WITH(heap-limit,
|
||||
AS_HELP_STRING([--with-heap-limit=N],
|
||||
[default limit on heap memory (kilobytes, default=20000000)]),
|
||||
[default limit on heap memory (kibibytes, default=20000000)]),
|
||||
, with_heap_limit=20000000)
|
||||
|
||||
# Handle --with-match-limit=N
|
||||
|
@ -754,7 +754,7 @@ AC_DEFINE_UNQUOTED([MATCH_LIMIT_DEPTH], [$with_match_limit_depth], [
|
|||
AC_DEFINE_UNQUOTED([HEAP_LIMIT], [$with_heap_limit], [
|
||||
This limits the amount of memory that may be used while matching
|
||||
a pattern. It applies to both pcre2_match() and pcre2_dfa_match(). It does
|
||||
not apply to JIT matching. The value is in kilobytes.])
|
||||
not apply to JIT matching. The value is in kibibytes (units of 1024 bytes).])
|
||||
|
||||
AC_DEFINE([MAX_NAME_SIZE], [32], [
|
||||
This limit is parameterized just in case anybody ever wants to
|
||||
|
@ -1017,7 +1017,7 @@ $PACKAGE-$VERSION configuration summary:
|
|||
Rebuild char tables ................ : ${enable_rebuild_chartables}
|
||||
Internal link size ................. : ${with_link_size}
|
||||
Nested parentheses limit ........... : ${with_parens_nest_limit}
|
||||
Heap limit ......................... : ${with_heap_limit} kilobytes
|
||||
Heap limit ......................... : ${with_heap_limit} kibibytes
|
||||
Match limit ........................ : ${with_match_limit}
|
||||
Match depth limit .................. : ${with_match_limit_depth}
|
||||
Build shared libs .................. : ${enable_shared}
|
||||
|
|
|
@ -127,7 +127,7 @@ can skip ahead to the CMake section.
|
|||
src/pcre2_jit_match.c and src/pcre2_jit_misc.c, so you should not compile
|
||||
these yourself.
|
||||
|
||||
Not also that the pcre2_fuzzsupport.c file contains special code that is
|
||||
Note also that the pcre2_fuzzsupport.c file contains special code that is
|
||||
useful to those who want to run fuzzing tests on the PCRE2 library. Unless
|
||||
you are doing that, you can ignore it.
|
||||
|
||||
|
@ -186,7 +186,7 @@ can skip ahead to the CMake section.
|
|||
|
||||
STACK SIZE IN WINDOWS ENVIRONMENTS
|
||||
|
||||
Prior to release 10.30 the default system stack size of 1Mb in some Windows
|
||||
Prior to release 10.30 the default system stack size of 1MB in some Windows
|
||||
environments caused issues with some tests. This should no longer be the case
|
||||
for 10.30 and later releases.
|
||||
|
||||
|
|
|
@ -257,9 +257,10 @@ library. They are also documented in the pcre2build man page.
|
|||
|
||||
--with-heap-limit=500
|
||||
|
||||
The units are kilobytes. This limit does not apply when the JIT optimization
|
||||
(which has its own memory control features) is used. There is more discussion
|
||||
on the pcre2api man page (search for pcre2_set_heap_limit).
|
||||
The units are kibibytes (units of 1024 bytes). This limit does not apply when
|
||||
the JIT optimization (which has its own memory control features) is used.
|
||||
There is more discussion on the pcre2api man page (search for
|
||||
pcre2_set_heap_limit).
|
||||
|
||||
. In the 8-bit library, the default maximum compiled pattern size is around
|
||||
64K bytes. You can increase this by adding --with-link-size=3 to the
|
||||
|
@ -319,10 +320,10 @@ library. They are also documented in the pcre2build man page.
|
|||
. When JIT support is enabled, pcre2grep automatically makes use of it, unless
|
||||
you add --disable-pcre2grep-jit to the "configure" command.
|
||||
|
||||
. On non-Windows sytems there is support for calling external scripts during
|
||||
matching in the pcre2grep command via PCRE2's callout facility with string
|
||||
arguments. This support can be disabled by adding --disable-pcre2grep-callout
|
||||
to the "configure" command.
|
||||
. There is support for calling external programs during matching in the
|
||||
pcre2grep command, using PCRE2's callout facility with string arguments. This
|
||||
support can be disabled by adding --disable-pcre2grep-callout to the
|
||||
"configure" command.
|
||||
|
||||
. The pcre2grep program currently supports only 8-bit data files, and so
|
||||
requires the 8-bit PCRE2 library. It is possible to compile pcre2grep to use
|
||||
|
@ -887,4 +888,4 @@ The distribution should contain the files listed below.
|
|||
Philip Hazel
|
||||
Email local part: ph10
|
||||
Email domain: cam.ac.uk
|
||||
Last updated: 27 April 2018
|
||||
Last updated: 17 June 2018
|
||||
|
|
|
@ -65,7 +65,7 @@ The option bits are:
|
|||
PCRE2_EXTENDED Ignore white space and # comments
|
||||
PCRE2_FIRSTLINE Force matching to be before newline
|
||||
PCRE2_LITERAL Pattern characters are all literal
|
||||
PCRE2_MATCH_UNSET_BACKREF Match unset back references
|
||||
PCRE2_MATCH_UNSET_BACKREF Match unset backreferences
|
||||
PCRE2_MULTILINE ^ and $ match newlines within data
|
||||
PCRE2_NEVER_BACKSLASH_C Lock out the use of \C in patterns
|
||||
PCRE2_NEVER_UCP Lock out PCRE2_UCP, e.g. via (*UCP)
|
||||
|
|
|
@ -36,7 +36,7 @@ request are as follows:
|
|||
<pre>
|
||||
PCRE2_INFO_ALLOPTIONS Final options after compiling
|
||||
PCRE2_INFO_ARGOPTIONS Options passed to <b>pcre2_compile()</b>
|
||||
PCRE2_INFO_BACKREFMAX Number of highest back reference
|
||||
PCRE2_INFO_BACKREFMAX Number of highest backreference
|
||||
PCRE2_INFO_BSR What \R matches:
|
||||
PCRE2_BSR_UNICODE: Unicode line endings
|
||||
PCRE2_BSR_ANYCRLF: CR, LF, or CRLF only
|
||||
|
|
|
@ -28,7 +28,7 @@ DESCRIPTION
|
|||
<P>
|
||||
This function is part of an experimental set of pattern conversion functions.
|
||||
It sets the component separator character that is used when converting globs.
|
||||
The second argument must one of the characters forward slash, backslash, or
|
||||
The second argument must be one of the characters forward slash, backslash, or
|
||||
dot. The default is backslash when running under Windows, otherwise forward
|
||||
slash. The result of the function is zero for success or PCRE2_ERROR_BADDATA if
|
||||
the second argument is invalid.
|
||||
|
|
|
@ -562,10 +562,10 @@ U+000C), NEL (next line, U+0085), LS (line separator, U+2028), and PS
|
|||
<P>
|
||||
Each of the first three conventions is used by at least one operating system as
|
||||
its standard newline sequence. When PCRE2 is built, a default can be specified.
|
||||
The default default is LF, which is the Unix standard. However, the newline
|
||||
convention can be changed by an application when calling <b>pcre2_compile()</b>,
|
||||
or it can be specified by special text at the start of the pattern itself; this
|
||||
overrides any other settings. See the
|
||||
If it is not, the default is set to LF, which is the Unix standard. However,
|
||||
the newline convention can be changed by an application when calling
|
||||
<b>pcre2_compile()</b>, or it can be specified by special text at the start of
|
||||
the pattern itself; this overrides any other settings. See the
|
||||
<a href="pcre2pattern.html"><b>pcre2pattern</b></a>
|
||||
page for details of the special character sequences.
|
||||
</P>
|
||||
|
@ -949,17 +949,18 @@ offset limit. In other words, whichever limit comes first is used.
|
|||
<b> uint32_t <i>value</i>);</b>
|
||||
<br>
|
||||
<br>
|
||||
The <i>heap_limit</i> parameter specifies, in units of kilobytes, the maximum
|
||||
amount of heap memory that <b>pcre2_match()</b> may use to hold backtracking
|
||||
information when running an interpretive match. This limit also applies to
|
||||
<b>pcre2_dfa_match()</b>, which may use the heap when processing patterns with a
|
||||
lot of nested pattern recursion or lookarounds or atomic groups. This limit
|
||||
does not apply to matching with the JIT optimization, which has its own memory
|
||||
control arrangements (see the
|
||||
The <i>heap_limit</i> parameter specifies, in units of kibibytes (1024 bytes),
|
||||
the maximum amount of heap memory that <b>pcre2_match()</b> may use to hold
|
||||
backtracking information when running an interpretive match. This limit also
|
||||
applies to <b>pcre2_dfa_match()</b>, which may use the heap when processing
|
||||
patterns with a lot of nested pattern recursion or lookarounds or atomic
|
||||
groups. This limit does not apply to matching with the JIT optimization, which
|
||||
has its own memory control arrangements (see the
|
||||
<a href="pcre2jit.html"><b>pcre2jit</b></a>
|
||||
documentation for more details). If the limit is reached, the negative error
|
||||
code PCRE2_ERROR_HEAPLIMIT is returned. The default limit is set when PCRE2 is
|
||||
built; the default default is very large and is essentially "unlimited".
|
||||
code PCRE2_ERROR_HEAPLIMIT is returned. The default limit can be set when PCRE2
|
||||
is built; if it is not, the default is set very large and is essentially
|
||||
"unlimited".
|
||||
</P>
|
||||
<P>
|
||||
A value for the heap limit may also be supplied by an item at the start of a
|
||||
|
@ -1044,7 +1045,7 @@ The depth limit is not relevant, and is ignored, when matching is done using
|
|||
JIT compiled code. However, it is supported by <b>pcre2_dfa_match()</b>, which
|
||||
uses it to limit the depth of nested internal recursive function calls that
|
||||
implement atomic groups, lookaround assertions, and pattern recursions. This
|
||||
limits, indirectly, the amount of system stack this is used. It was more useful
|
||||
limits, indirectly, the amount of system stack that is used. It was more useful
|
||||
in versions before 10.32, when stack memory was used for local workspace
|
||||
vectors for recursive function calls. From version 10.32, only local variables
|
||||
are allocated on the stack and as each call uses only a few hundred bytes, even
|
||||
|
@ -1060,11 +1061,11 @@ probably better to limit heap usage directly by calling
|
|||
<b>pcre2_set_heap_limit()</b>.
|
||||
</P>
|
||||
<P>
|
||||
The default value for the depth limit can be set when PCRE2 is built; the
|
||||
default default is the same value as the default for the match limit. If the
|
||||
limit is exceeded, <b>pcre2_match()</b> or <b>pcre2_dfa_match()</b> returns
|
||||
PCRE2_ERROR_DEPTHLIMIT. A value for the depth limit may also be supplied by an
|
||||
item at the start of a pattern of the form
|
||||
The default value for the depth limit can be set when PCRE2 is built; if it is
|
||||
not, the default is set to the same value as the default for the match limit.
|
||||
If the limit is exceeded, <b>pcre2_match()</b> or <b>pcre2_dfa_match()</b>
|
||||
returns PCRE2_ERROR_DEPTHLIMIT. A value for the depth limit may also be
|
||||
supplied by an item at the start of a pattern of the form
|
||||
<pre>
|
||||
(*LIMIT_DEPTH=ddd)
|
||||
</pre>
|
||||
|
@ -1120,7 +1121,7 @@ given with <b>pcre2_set_depth_limit()</b> above.
|
|||
<pre>
|
||||
PCRE2_CONFIG_HEAPLIMIT
|
||||
</pre>
|
||||
The output is a uint32_t integer that gives, in kilobytes, the default limit
|
||||
The output is a uint32_t integer that gives, in kibibytes, the default limit
|
||||
for the amount of heap memory used by <b>pcre2_match()</b> or
|
||||
<b>pcre2_dfa_match()</b>. Further details are given with
|
||||
<b>pcre2_set_heap_limit()</b> above.
|
||||
|
@ -1431,7 +1432,7 @@ If this bit is set, letters in the pattern match both upper and lower case
|
|||
letters in the subject. It is equivalent to Perl's /i option, and it can be
|
||||
changed within a pattern by a (?i) option setting. If PCRE2_UTF is set, Unicode
|
||||
properties are used for all characters with more than one other case, and for
|
||||
all characters whose code points are greater than U+007f. For lower valued
|
||||
all characters whose code points are greater than U+007F. For lower valued
|
||||
characters with only one other case, a lookup table is used for speed. When
|
||||
PCRE2_UTF is not set, a lookup table is used for all code points less than 256,
|
||||
and higher code points (available only in 16-bit or 32-bit mode) are treated as
|
||||
|
@ -1551,7 +1552,7 @@ error.
|
|||
<pre>
|
||||
PCRE2_MATCH_UNSET_BACKREF
|
||||
</pre>
|
||||
If this option is set, a back reference to an unset subpattern group matches an
|
||||
If this option is set, a backreference to an unset subpattern group matches an
|
||||
empty string (by default this causes the current matching alternative to fail).
|
||||
A pattern such as (\1)(a) succeeds when this option is set (assuming it can
|
||||
find an "a" in the subject), whereas it fails by default, for Perl
|
||||
|
@ -1613,8 +1614,8 @@ If this option is set, it disables the use of numbered capturing parentheses in
|
|||
the pattern. Any opening parenthesis that is not followed by ? behaves as if it
|
||||
were followed by ?: but named parentheses can still be used for capturing (and
|
||||
they acquire numbers in the usual way). This is the same as Perl's /n option.
|
||||
Note that, when this option is set, references to capturing groups (back
|
||||
references or recursion/subroutine calls) may only refer to named groups,
|
||||
Note that, when this option is set, references to capturing groups
|
||||
(backreferences or recursion/subroutine calls) may only refer to named groups,
|
||||
though the reference can be by name or by number.
|
||||
<pre>
|
||||
PCRE2_NO_AUTO_POSSESS
|
||||
|
@ -1633,7 +1634,7 @@ If this option is set, it disables an optimization that is applied when .* is
|
|||
the first significant item in a top-level branch of a pattern, and all the
|
||||
other branches also start with .* or with \A or \G or ^. The optimization is
|
||||
automatically disabled for .* if it is inside an atomic group or a capturing
|
||||
group that is the subject of a back reference, or if the pattern contains
|
||||
group that is the subject of a backreference, or if the pattern contains
|
||||
(*PRUNE) or (*SKIP). When the optimization is not disabled, such a pattern is
|
||||
automatically anchored if PCRE2_DOTALL is set for all the .* items and
|
||||
PCRE2_MULTILINE is not set for any ^ items. Otherwise, the fact that any match
|
||||
|
@ -1999,7 +2000,7 @@ When .* is the first significant item, anchoring is possible only when all the
|
|||
following are true:
|
||||
<pre>
|
||||
.* is not in an atomic group
|
||||
.* is not in a capturing group that is the subject of a back reference
|
||||
.* is not in a capturing group that is the subject of a backreference
|
||||
PCRE2_DOTALL is in force for .*
|
||||
Neither (*PRUNE) nor (*SKIP) appears in the pattern
|
||||
PCRE2_NO_DOTSTAR_ANCHOR is not set
|
||||
|
@ -2009,20 +2010,20 @@ options returned for PCRE2_INFO_ALLOPTIONS.
|
|||
<pre>
|
||||
PCRE2_INFO_BACKREFMAX
|
||||
</pre>
|
||||
Return the number of the highest back reference in the pattern. The third
|
||||
Return the number of the highest backreference in the pattern. The third
|
||||
argument should point to an <b>uint32_t</b> variable. Named subpatterns acquire
|
||||
numbers as well as names, and these count towards the highest back reference.
|
||||
Back references such as \4 or \g{12} match the captured characters of the
|
||||
numbers as well as names, and these count towards the highest backreference.
|
||||
Backreferences such as \4 or \g{12} match the captured characters of the
|
||||
given group, but in addition, the check that a capturing group is set in a
|
||||
conditional subpattern such as (?(3)a|b) is also a back reference. Zero is
|
||||
returned if there are no back references.
|
||||
conditional subpattern such as (?(3)a|b) is also a backreference. Zero is
|
||||
returned if there are no backreferences.
|
||||
<pre>
|
||||
PCRE2_INFO_BSR
|
||||
</pre>
|
||||
The output is a uint32_t whose value indicates what character sequences the \R
|
||||
escape sequence matches. A value of PCRE2_BSR_UNICODE means that \R matches
|
||||
any Unicode line ending sequence; a value of PCRE2_BSR_ANYCRLF means that \R
|
||||
matches only CR, LF, or CRLF.
|
||||
The output is a uint32_t integer whose value indicates what character sequences
|
||||
the \R escape sequence matches. A value of PCRE2_BSR_UNICODE means that \R
|
||||
matches any Unicode line ending sequence; a value of PCRE2_BSR_ANYCRLF means
|
||||
that \R matches only CR, LF, or CRLF.
|
||||
<pre>
|
||||
PCRE2_INFO_CAPTURECOUNT
|
||||
</pre>
|
||||
|
@ -2034,10 +2035,10 @@ The third argument should point to an <b>uint32_t</b> variable.
|
|||
</pre>
|
||||
If the pattern set a backtracking depth limit by including an item of the form
|
||||
(*LIMIT_DEPTH=nnnn) at the start, the value is returned. The third argument
|
||||
should point to an unsigned 32-bit integer. If no such value has been set, the
|
||||
call to <b>pcre2_pattern_info()</b> returns the error PCRE2_ERROR_UNSET. Note
|
||||
that this limit will only be used during matching if it is less than the limit
|
||||
set or defaulted by the caller of the match function.
|
||||
should point to a uint32_t integer. If no such value has been set, the call to
|
||||
<b>pcre2_pattern_info()</b> returns the error PCRE2_ERROR_UNSET. Note that this
|
||||
limit will only be used during matching if it is less than the limit set or
|
||||
defaulted by the caller of the match function.
|
||||
<pre>
|
||||
PCRE2_INFO_FIRSTBITMAP
|
||||
</pre>
|
||||
|
@ -2047,7 +2048,7 @@ values for the first code unit in any match. For example, a pattern that starts
|
|||
with [abc] results in a table with three bits set. When code unit values
|
||||
greater than 255 are supported, the flag bit for 255 means "any code unit of
|
||||
value 255 or above". If such a table was constructed, a pointer to it is
|
||||
returned. Otherwise NULL is returned. The third argument should point to an
|
||||
returned. Otherwise NULL is returned. The third argument should point to a
|
||||
<b>const uint8_t *</b> variable.
|
||||
<pre>
|
||||
PCRE2_INFO_FIRSTCODETYPE
|
||||
|
@ -2074,7 +2075,7 @@ and up to 0xffffffff when not using UTF-32 mode.
|
|||
</pre>
|
||||
Return the size (in bytes) of the data frames that are used to remember
|
||||
backtracking positions when the pattern is processed by <b>pcre2_match()</b>
|
||||
without the use of JIT. The third argument should point to an <b>size_t</b>
|
||||
without the use of JIT. The third argument should point to a <b>size_t</b>
|
||||
variable. The frame size depends on the number of capturing parentheses in the
|
||||
pattern. Each additional capturing group adds two PCRE2_SIZE variables.
|
||||
<pre>
|
||||
|
@ -2094,10 +2095,10 @@ the equivalent hexadecimal or octal escape sequences.
|
|||
</pre>
|
||||
If the pattern set a heap memory limit by including an item of the form
|
||||
(*LIMIT_HEAP=nnnn) at the start, the value is returned. The third argument
|
||||
should point to an unsigned 32-bit integer. If no such value has been set, the
|
||||
call to <b>pcre2_pattern_info()</b> returns the error PCRE2_ERROR_UNSET. Note
|
||||
that this limit will only be used during matching if it is less than the limit
|
||||
set or defaulted by the caller of the match function.
|
||||
should point to a uint32_t integer. If no such value has been set, the call to
|
||||
<b>pcre2_pattern_info()</b> returns the error PCRE2_ERROR_UNSET. Note that this
|
||||
limit will only be used during matching if it is less than the limit set or
|
||||
defaulted by the caller of the match function.
|
||||
<pre>
|
||||
PCRE2_INFO_JCHANGED
|
||||
</pre>
|
||||
|
@ -2141,15 +2142,15 @@ in such cases.
|
|||
</pre>
|
||||
If the pattern set a match limit by including an item of the form
|
||||
(*LIMIT_MATCH=nnnn) at the start, the value is returned. The third argument
|
||||
should point to an unsigned 32-bit integer. If no such value has been set, the
|
||||
call to <b>pcre2_pattern_info()</b> returns the error PCRE2_ERROR_UNSET. Note
|
||||
that this limit will only be used during matching if it is less than the limit
|
||||
set or defaulted by the caller of the match function.
|
||||
should point to a uint32_t integer. If no such value has been set, the call to
|
||||
<b>pcre2_pattern_info()</b> returns the error PCRE2_ERROR_UNSET. Note that this
|
||||
limit will only be used during matching if it is less than the limit set or
|
||||
defaulted by the caller of the match function.
|
||||
<pre>
|
||||
PCRE2_INFO_MAXLOOKBEHIND
|
||||
</pre>
|
||||
Return the number of characters (not code units) in the longest lookbehind
|
||||
assertion in the pattern. The third argument should point to an unsigned 32-bit
|
||||
assertion in the pattern. The third argument should point to a uint32_t
|
||||
integer. This information is useful when doing multi-segment matching using the
|
||||
partial matching facilities. Note that the simple assertions \b and \B
|
||||
require a one-character lookbehind. \A also registers a one-character
|
||||
|
@ -2417,7 +2418,7 @@ zero, the search for a match starts at the beginning of the subject, and this
|
|||
is by far the most common case. In UTF-8 or UTF-16 mode, the starting offset
|
||||
must point to the start of a character, or to the end of the subject (in UTF-32
|
||||
mode, one code unit equals one character, so all offsets are valid). Like the
|
||||
pattern string, the subject may contain binary zeroes.
|
||||
pattern string, the subject may contain binary zeros.
|
||||
</P>
|
||||
<P>
|
||||
A non-zero starting offset is useful when searching for another match in the
|
||||
|
@ -3559,12 +3560,12 @@ There are in addition the following errors that are specific to
|
|||
</pre>
|
||||
This return is given if <b>pcre2_dfa_match()</b> encounters an item in the
|
||||
pattern that it does not support, for instance, the use of \C in a UTF mode or
|
||||
a back reference.
|
||||
a backreference.
|
||||
<pre>
|
||||
PCRE2_ERROR_DFA_UCOND
|
||||
</pre>
|
||||
This return is given if <b>pcre2_dfa_match()</b> encounters a condition item
|
||||
that uses a back reference for the condition, or a test for recursion in a
|
||||
that uses a backreference for the condition, or a test for recursion in a
|
||||
specific group. These are not supported.
|
||||
<pre>
|
||||
PCRE2_ERROR_DFA_WSSIZE
|
||||
|
|
|
@ -227,7 +227,7 @@ separator, U+2028), and PS (paragraph separator, U+2029). The final option is
|
|||
<pre>
|
||||
--enable-newline-is-nul
|
||||
</pre>
|
||||
which causes NUL (binary zero) is set as the default line-ending character.
|
||||
which causes NUL (binary zero) to be set as the default line-ending character.
|
||||
</P>
|
||||
<P>
|
||||
Whatever default line ending convention is selected when PCRE2 is built can be
|
||||
|
@ -286,15 +286,15 @@ The <b>pcre2_match()</b> function starts out using a 20K vector on the system
|
|||
stack to record backtracking points. The more nested backtracking points there
|
||||
are (that is, the deeper the search tree), the more memory is needed. If the
|
||||
initial vector is not large enough, heap memory is used, up to a certain limit,
|
||||
which is specified in kilobytes. The limit can be changed at run time, as
|
||||
described in the
|
||||
which is specified in kibibytes (units of 1024 bytes). The limit can be changed
|
||||
at run time, as described in the
|
||||
<a href="pcre2api.html"><b>pcre2api</b></a>
|
||||
documentation. The default limit (in effect unlimited) is 20 million. You can
|
||||
change this by a setting such as
|
||||
<pre>
|
||||
--with-heap-limit=500
|
||||
</pre>
|
||||
which limits the amount of heap to 500 kilobytes. This limit applies only to
|
||||
which limits the amount of heap to 500 KiB. This limit applies only to
|
||||
interpretive matching in <b>pcre2_match()</b> and <b>pcre2_dfa_match()</b>, which
|
||||
may also use the heap for internal workspace when processing complicated
|
||||
patterns. This limit does not apply when JIT (which has its own memory
|
||||
|
@ -542,7 +542,7 @@ generated from the string.
|
|||
Setting --enable-fuzz-support also causes a binary called <b>pcre2fuzzcheck</b>
|
||||
to be created. This is normally run under valgrind or used when PCRE2 is
|
||||
compiled with address sanitizing enabled. It calls the fuzzing function and
|
||||
outputs information about it is doing. The input strings are specified by
|
||||
outputs information about what it is doing. The input strings are specified by
|
||||
arguments: if an argument starts with "=" the rest of it is a literal input
|
||||
string. Otherwise, it is assumed to be a file name, and the contents of the
|
||||
file are the test string.
|
||||
|
|
|
@ -143,7 +143,7 @@ branch, automatic anchoring occurs if all branches are anchorable.
|
|||
</P>
|
||||
<P>
|
||||
This optimization is disabled, however, if .* is in an atomic group or if there
|
||||
is a back reference to the capturing group in which it appears. It is also
|
||||
is a backreference to the capturing group in which it appears. It is also
|
||||
disabled if the pattern contains (*PRUNE) or (*SKIP). However, the presence of
|
||||
callouts does not affect it.
|
||||
</P>
|
||||
|
|
|
@ -31,7 +31,7 @@ page.
|
|||
2. Like Perl, PCRE2 allows repeat quantifiers on parenthesized assertions, but
|
||||
they do not mean what you might think. For example, (?!a){3} does not assert
|
||||
that the next three characters are not "a". It just asserts that the next
|
||||
character is not "a" three times (in principle: PCRE2 optimizes this to run the
|
||||
character is not "a" three times (in principle; PCRE2 optimizes this to run the
|
||||
assertion just once). Perl allows some repeat quantifiers on other assertions,
|
||||
for example, \b* (but not \b{3}), but these do not seem to have any use.
|
||||
</P>
|
||||
|
@ -77,8 +77,8 @@ The \Q...\E sequence is recognized both inside and outside character classes.
|
|||
</P>
|
||||
<P>
|
||||
7. Fairly obviously, PCRE2 does not support the (?{code}) and (??{code})
|
||||
constructions. However, there is support PCRE2's "callout" feature, which
|
||||
allows an external function to be called during pattern matching. See the
|
||||
constructions. However, PCRE2 does have a "callout" feature, which allows an
|
||||
external function to be called during pattern matching. See the
|
||||
<a href="pcre2callout.html"><b>pcre2callout</b></a>
|
||||
documentation for details.
|
||||
</P>
|
||||
|
@ -156,7 +156,7 @@ each alternative branch of a lookbehind assertion can match a different length
|
|||
of string. Perl requires them all to have the same length.
|
||||
<br>
|
||||
<br>
|
||||
(b) From PCRE2 10.23, back references to groups of fixed length are supported
|
||||
(b) From PCRE2 10.23, backreferences to groups of fixed length are supported
|
||||
in lookbehinds, provided that there is no possibility of referencing a
|
||||
non-unique number or name. Perl does not support backreferences in lookbehinds.
|
||||
<br>
|
||||
|
|
|
@ -86,9 +86,10 @@ controlled by parameters that can be set by the <b>--buffer-size</b> and
|
|||
that is obtained at the start of processing. If an input file contains very
|
||||
long lines, a larger buffer may be needed; this is handled by automatically
|
||||
extending the buffer, up to the limit specified by <b>--max-buffer-size</b>. The
|
||||
default values for these parameters are specified when <b>pcre2grep</b> is
|
||||
built, with the default defaults being 20K and 1M respectively. An error occurs
|
||||
if a line is too long and the buffer can no longer be expanded.
|
||||
default values for these parameters can be set when <b>pcre2grep</b> is
|
||||
built; if nothing is specified, the defaults are set to 20K and 1M
|
||||
respectively. An error occurs if a line is too long and the buffer can no
|
||||
longer be expanded.
|
||||
</P>
|
||||
<P>
|
||||
The block of memory that is actually used is three times the "buffer size", to
|
||||
|
@ -500,13 +501,13 @@ short form for this option.
|
|||
When this option is given, non-compressed input is read and processed line by
|
||||
line, and the output is flushed after each write. By default, input is read in
|
||||
large chunks, unless <b>pcre2grep</b> can determine that it is reading from a
|
||||
terminal (which is currently possible only in Unix-like environments). Output
|
||||
to terminal is normally automatically flushed by the operating system. This
|
||||
option can be useful when the input or output is attached to a pipe and you do
|
||||
not want <b>pcre2grep</b> to buffer up large amounts of data. However, its use
|
||||
will affect performance, and the <b>-M</b> (multiline) option ceases to work.
|
||||
When input is from a compressed .gz or .bz2 file, <b>--line-buffered</b> is
|
||||
ignored.
|
||||
terminal (which is currently possible only in Unix-like environments or
|
||||
Windows). Output to terminal is normally automatically flushed by the operating
|
||||
system. This option can be useful when the input or output is attached to a
|
||||
pipe and you do not want <b>pcre2grep</b> to buffer up large amounts of data.
|
||||
However, its use will affect performance, and the <b>-M</b> (multiline) option
|
||||
ceases to work. When input is from a compressed .gz or .bz2 file,
|
||||
<b>--line-buffered</b> is ignored.
|
||||
</P>
|
||||
<P>
|
||||
<b>--line-offsets</b>
|
||||
|
@ -541,11 +542,11 @@ counter that is incremented each time around its main processing loop. If the
|
|||
value set by <b>--match-limit</b> is reached, an error occurs.
|
||||
<br>
|
||||
<br>
|
||||
The <b>--heap-limit</b> option specifies, as a number of kilobytes, the amount
|
||||
of heap memory that may be used for matching. Heap memory is needed only if
|
||||
matching the pattern requires a significant number of nested backtracking
|
||||
points to be remembered. This parameter can be set to zero to forbid the use of
|
||||
heap memory altogether.
|
||||
The <b>--heap-limit</b> option specifies, as a number of kibibytes (units of
|
||||
1024 bytes), the amount of heap memory that may be used for matching. Heap
|
||||
memory is needed only if matching the pattern requires a significant number of
|
||||
nested backtracking points to be remembered. This parameter can be set to zero
|
||||
to forbid the use of heap memory altogether.
|
||||
<br>
|
||||
<br>
|
||||
The <b>--depth-limit</b> option limits the depth of nested backtracking points,
|
||||
|
@ -556,9 +557,9 @@ limit acts varies from pattern to pattern. This limit is of use only if it is
|
|||
set smaller than <b>--match-limit</b>.
|
||||
<br>
|
||||
<br>
|
||||
There are no short forms for these options. The default settings are specified
|
||||
when the PCRE2 library is compiled, with the default defaults being very large
|
||||
and so effectively unlimited.
|
||||
There are no short forms for these options. The default limits can be set
|
||||
when the PCRE2 library is compiled; if they are not specified, the defaults
|
||||
are very large and so effectively unlimited.
|
||||
</P>
|
||||
<P>
|
||||
\fB--max-buffer-size=<i>number</i>
|
||||
|
|
|
@ -54,9 +54,9 @@ There is no limit to the number of parenthesized subpatterns, but there can be
|
|||
no more than 65535 capturing subpatterns. There is, however, a limit to the
|
||||
depth of nesting of parenthesized subpatterns of all kinds. This is imposed in
|
||||
order to limit the amount of system stack used at compile time. The default
|
||||
limit can be specified when PCRE2 is built; the default default is 250. An
|
||||
application can change this limit by calling pcre2_set_parens_nest_limit() to
|
||||
set the limit in a compile context.
|
||||
limit can be specified when PCRE2 is built; if not, the default is set to 250.
|
||||
An application can change this limit by calling pcre2_set_parens_nest_limit()
|
||||
to set the limit in a compile context.
|
||||
</P>
|
||||
<P>
|
||||
The maximum length of name for a named subpattern is 32 code units, and the
|
||||
|
|
|
@ -85,7 +85,7 @@ ungreedy repetition quantifiers are specified in the pattern.
|
|||
Because it ends up with a single path through the tree, it is relatively
|
||||
straightforward for this algorithm to keep track of the substrings that are
|
||||
matched by portions of the pattern in parentheses. This provides support for
|
||||
capturing parentheses and back references.
|
||||
capturing parentheses and backreferences.
|
||||
</P>
|
||||
<br><a name="SEC4" href="#TOC1">THE ALTERNATIVE MATCHING ALGORITHM</a><br>
|
||||
<P>
|
||||
|
@ -158,7 +158,7 @@ possibilities, and PCRE2's implementation of this algorithm does not attempt to
|
|||
do this. This means that no captured substrings are available.
|
||||
</P>
|
||||
<P>
|
||||
3. Because no substrings are captured, back references within the pattern are
|
||||
3. Because no substrings are captured, backreferences within the pattern are
|
||||
not supported, and cause errors if encountered.
|
||||
</P>
|
||||
<P>
|
||||
|
@ -215,7 +215,7 @@ because it has to search for all possible matches, but is also because it is
|
|||
less susceptible to optimization.
|
||||
</P>
|
||||
<P>
|
||||
2. Capturing parentheses and back references are not supported.
|
||||
2. Capturing parentheses and backreferences are not supported.
|
||||
</P>
|
||||
<P>
|
||||
3. Although atomic groups are supported, their use does not provide the
|
||||
|
|
|
@ -31,7 +31,7 @@ please consult the man page, in case the conversion went wrong.
|
|||
<li><a name="TOC16" href="#SEC16">NAMED SUBPATTERNS</a>
|
||||
<li><a name="TOC17" href="#SEC17">REPETITION</a>
|
||||
<li><a name="TOC18" href="#SEC18">ATOMIC GROUPING AND POSSESSIVE QUANTIFIERS</a>
|
||||
<li><a name="TOC19" href="#SEC19">BACK REFERENCES</a>
|
||||
<li><a name="TOC19" href="#SEC19">BACKREFERENCES</a>
|
||||
<li><a name="TOC20" href="#SEC20">ASSERTIONS</a>
|
||||
<li><a name="TOC21" href="#SEC21">CONDITIONAL SUBPATTERNS</a>
|
||||
<li><a name="TOC22" href="#SEC22">COMMENTS</a>
|
||||
|
@ -196,7 +196,7 @@ be less than the value set (or defaulted) by the caller of <b>pcre2_match()</b>
|
|||
for it to have any effect. In other words, the pattern writer can lower the
|
||||
limits set by the programmer, but not raise them. If there is more than one
|
||||
setting of one of these limits, the lower value is used. The heap limit is
|
||||
specified in kilobytes.
|
||||
specified in kibibytes (units of 1024 bytes).
|
||||
</P>
|
||||
<P>
|
||||
Prior to release 10.30, LIMIT_DEPTH was called LIMIT_RECURSION. This name is
|
||||
|
@ -342,7 +342,7 @@ In particular, if you want to match a backslash, you write \\.
|
|||
</P>
|
||||
<P>
|
||||
In a UTF mode, only ASCII numbers and letters have any special meaning after a
|
||||
backslash. All other characters (in particular, those whose codepoints are
|
||||
backslash. All other characters (in particular, those whose code points are
|
||||
greater than 127) are treated as literals.
|
||||
</P>
|
||||
<P>
|
||||
|
@ -390,7 +390,7 @@ these escapes are as follows:
|
|||
\r carriage return (hex 0D)
|
||||
\t tab (hex 09)
|
||||
\0dd character with octal code 0dd
|
||||
\ddd character with octal code ddd, or back reference
|
||||
\ddd character with octal code ddd, or backreference
|
||||
\o{ddd..} character with octal code ddd..
|
||||
\xhh character with hex code hh
|
||||
\x{hhh..} character with hex code hhh.. (default mode)
|
||||
|
@ -438,13 +438,13 @@ follows is itself an octal digit.
|
|||
The escape \o must be followed by a sequence of octal digits, enclosed in
|
||||
braces. An error occurs if this is not the case. This escape is a recent
|
||||
addition to Perl; it provides way of specifying character code points as octal
|
||||
numbers greater than 0777, and it also allows octal numbers and back references
|
||||
numbers greater than 0777, and it also allows octal numbers and backreferences
|
||||
to be unambiguously specified.
|
||||
</P>
|
||||
<P>
|
||||
For greater clarity and unambiguity, it is best to avoid following \ by a
|
||||
digit greater than zero. Instead, use \o{} or \x{} to specify character
|
||||
numbers, and \g{} to specify back references. The following paragraphs
|
||||
numbers, and \g{} to specify backreferences. The following paragraphs
|
||||
describe the old, ambiguous syntax.
|
||||
</P>
|
||||
<P>
|
||||
|
@ -455,7 +455,7 @@ and Perl has changed over time, causing PCRE2 also to change.
|
|||
Outside a character class, PCRE2 reads the digit and any following digits as a
|
||||
decimal number. If the number is less than 10, begins with the digit 8 or 9, or
|
||||
if there are at least that many previous capturing left parentheses in the
|
||||
expression, the entire sequence is taken as a <i>back reference</i>. A
|
||||
expression, the entire sequence is taken as a <i>backreference</i>. A
|
||||
description of how this works is given
|
||||
<a href="#backreferences">later,</a>
|
||||
following the discussion of
|
||||
|
@ -470,13 +470,13 @@ for themselves. For example, outside a character class:
|
|||
<pre>
|
||||
\040 is another way of writing an ASCII space
|
||||
\40 is the same, provided there are fewer than 40 previous capturing subpatterns
|
||||
\7 is always a back reference
|
||||
\11 might be a back reference, or another way of writing a tab
|
||||
\7 is always a backreference
|
||||
\11 might be a backreference, or another way of writing a tab
|
||||
\011 is always a tab
|
||||
\0113 is a tab followed by the character "3"
|
||||
\113 might be a back reference, otherwise the character with octal code 113
|
||||
\377 might be a back reference, otherwise the value 255 (decimal)
|
||||
\81 is always a back reference .sp
|
||||
\113 might be a backreference, otherwise the character with octal code 113
|
||||
\377 might be a backreference, otherwise the value 255 (decimal)
|
||||
\81 is always a backreference .sp
|
||||
</pre>
|
||||
Note that octal values of 100 or greater that are specified using this syntax
|
||||
must not be introduced by a leading zero, because no more than three octal
|
||||
|
@ -512,10 +512,10 @@ limited to certain values, as follows:
|
|||
8-bit non-UTF mode no greater than 0xff
|
||||
16-bit non-UTF mode no greater than 0xffff
|
||||
32-bit non-UTF mode no greater than 0xffffffff
|
||||
All UTF modes no greater than 0x10ffff and a valid codepoint
|
||||
All UTF modes no greater than 0x10ffff and a valid code point
|
||||
</pre>
|
||||
Invalid Unicode codepoints are all those in the range 0xd800 to 0xdfff (the
|
||||
so-called "surrogate" codepoints). The check for these can be disabled by the
|
||||
Invalid Unicode code points are all those in the range 0xd800 to 0xdfff (the
|
||||
so-called "surrogate" code points). The check for these can be disabled by the
|
||||
caller of <b>pcre2_compile()</b> by setting the option
|
||||
PCRE2_EXTRA_ALLOW_SURROGATE_ESCAPES.
|
||||
</P>
|
||||
|
@ -544,12 +544,12 @@ is set, \U matches a "U" character, and \u can be used to define a character
|
|||
by code point, as described above.
|
||||
</P>
|
||||
<br><b>
|
||||
Absolute and relative back references
|
||||
Absolute and relative backreferences
|
||||
</b><br>
|
||||
<P>
|
||||
The sequence \g followed by a signed or unsigned number, optionally enclosed
|
||||
in braces, is an absolute or relative back reference. A named back reference
|
||||
can be coded as \g{name}. Back references are discussed
|
||||
in braces, is an absolute or relative backreference. A named backreference
|
||||
can be coded as \g{name}. backreferences are discussed
|
||||
<a href="#backreferences">later,</a>
|
||||
following the discussion of
|
||||
<a href="#subpattern">parenthesized subpatterns.</a>
|
||||
|
@ -563,7 +563,7 @@ a number enclosed either in angle brackets or single quotes, is an alternative
|
|||
syntax for referencing a subpattern as a "subroutine". Details are discussed
|
||||
<a href="#onigurumasubroutines">later.</a>
|
||||
Note that \g{...} (Perl syntax) and \g<...> (Oniguruma syntax) are <i>not</i>
|
||||
synonymous. The former is a back reference; the latter is a
|
||||
synonymous. The former is a backreference; the latter is a
|
||||
<a href="#subpatternsassubroutines">subroutine</a>
|
||||
call.
|
||||
<a name="genericchartypes"></a></P>
|
||||
|
@ -694,7 +694,7 @@ line, U+0085). Because this is an atomic group, the two-character sequence is
|
|||
treated as a single unit that cannot be split.
|
||||
</P>
|
||||
<P>
|
||||
In other modes, two additional characters whose codepoints are greater than 255
|
||||
In other modes, two additional characters whose code points are greater than 255
|
||||
are added: LS (line separator, U+2028) and PS (paragraph separator, U+2029).
|
||||
Unicode support is not needed for these characters to be recognized.
|
||||
</P>
|
||||
|
@ -729,8 +729,8 @@ Unicode character properties
|
|||
When PCRE2 is built with Unicode support (the default), three additional escape
|
||||
sequences that match characters with specific properties are available. In
|
||||
8-bit non-UTF-8 mode, these sequences are of course limited to testing
|
||||
characters whose codepoints are less than 256, but they do work in this mode.
|
||||
In 32-bit non-UTF mode, codepoints greater than 0x10ffff (the Unicode limit)
|
||||
characters whose code points are less than 256, but they do work in this mode.
|
||||
In 32-bit non-UTF mode, code points greater than 0x10ffff (the Unicode limit)
|
||||
may be encountered. These are all treated as being in the Common script and
|
||||
with an unassigned type. The extra escape sequences are:
|
||||
<pre>
|
||||
|
@ -1037,7 +1037,7 @@ joiner" characters. Characters with the "mark" property always have the
|
|||
modifier). Extending characters are allowed before the modifier.
|
||||
</P>
|
||||
<P>
|
||||
7. Do not break within emoji zwj sequences (zero-width jointer followed by
|
||||
7. Do not break within emoji zwj sequences (zero-width joiner followed by
|
||||
"glue after ZWJ" or "base glue after ZWJ").
|
||||
</P>
|
||||
<P>
|
||||
|
@ -1731,7 +1731,7 @@ numbers underneath show in which buffer the captured content will be stored.
|
|||
/ ( a ) (?| x ( y ) z | (p (q) r) | (t) u (v) ) ( z ) /x
|
||||
# 1 2 2 3 2 3 4
|
||||
</pre>
|
||||
A back reference to a numbered subpattern uses the most recent value that is
|
||||
A backreference to a numbered subpattern uses the most recent value that is
|
||||
set for that number by any subpattern. The following pattern matches "abcabc"
|
||||
or "defdef":
|
||||
<pre>
|
||||
|
@ -1771,7 +1771,7 @@ have different names, but PCRE2 does not.
|
|||
In PCRE2, a subpattern can be named in one of three ways: (?<name>...) or
|
||||
(?'name'...) as in Perl, or (?P<name>...) as in Python. References to capturing
|
||||
parentheses from other parts of the pattern, such as
|
||||
<a href="#backreferences">back references,</a>
|
||||
<a href="#backreferences">backreferences,</a>
|
||||
<a href="#recursion">recursion,</a>
|
||||
and
|
||||
<a href="#conditions">conditions,</a>
|
||||
|
@ -1811,7 +1811,7 @@ for the first (and in this example, the only) subpattern of that name that
|
|||
matched. This saves searching to find which numbered subpattern it was.
|
||||
</P>
|
||||
<P>
|
||||
If you make a back reference to a non-unique named subpattern from elsewhere in
|
||||
If you make a backreference to a non-unique named subpattern from elsewhere in
|
||||
the pattern, the subpatterns to which the name refers are checked in the order
|
||||
in which they appear in the overall pattern. The first one that is set is used
|
||||
for the reference. For example, this pattern matches both "foofoo" and
|
||||
|
@ -1859,7 +1859,7 @@ items:
|
|||
the \R escape sequence
|
||||
an escape such as \d or \pL that matches a single character
|
||||
a character class
|
||||
a back reference
|
||||
a backreference
|
||||
a parenthesized subpattern (including most assertions)
|
||||
a subroutine call to a subpattern (recursive or otherwise)
|
||||
</pre>
|
||||
|
@ -1980,7 +1980,7 @@ alternatively, using ^ to indicate anchoring explicitly.
|
|||
</P>
|
||||
<P>
|
||||
However, there are some cases where the optimization cannot be used. When .*
|
||||
is inside capturing parentheses that are the subject of a back reference
|
||||
is inside capturing parentheses that are the subject of a backreference
|
||||
elsewhere in the pattern, a match at the start may fail where a later one
|
||||
succeeds. Consider, for example:
|
||||
<pre>
|
||||
|
@ -2121,30 +2121,30 @@ an atomic group, like this:
|
|||
</pre>
|
||||
sequences of non-digits cannot be broken, and failure happens quickly.
|
||||
<a name="backreferences"></a></P>
|
||||
<br><a name="SEC19" href="#TOC1">BACK REFERENCES</a><br>
|
||||
<br><a name="SEC19" href="#TOC1">BACKREFERENCES</a><br>
|
||||
<P>
|
||||
Outside a character class, a backslash followed by a digit greater than 0 (and
|
||||
possibly further digits) is a back reference to a capturing subpattern earlier
|
||||
possibly further digits) is a backreference to a capturing subpattern earlier
|
||||
(that is, to its left) in the pattern, provided there have been that many
|
||||
previous capturing left parentheses.
|
||||
</P>
|
||||
<P>
|
||||
However, if the decimal number following the backslash is less than 8, it is
|
||||
always taken as a back reference, and causes an error only if there are not
|
||||
always taken as a backreference, and causes an error only if there are not
|
||||
that many capturing left parentheses in the entire pattern. In other words, the
|
||||
parentheses that are referenced need not be to the left of the reference for
|
||||
numbers less than 8. A "forward back reference" of this type can make sense
|
||||
numbers less than 8. A "forward backreference" of this type can make sense
|
||||
when a repetition is involved and the subpattern to the right has participated
|
||||
in an earlier iteration.
|
||||
</P>
|
||||
<P>
|
||||
It is not possible to have a numerical "forward back reference" to a subpattern
|
||||
It is not possible to have a numerical "forward backreference" to a subpattern
|
||||
whose number is 8 or more using this syntax because a sequence such as \50 is
|
||||
interpreted as a character defined in octal. See the subsection entitled
|
||||
"Non-printing characters"
|
||||
<a href="#digitsafterbackslash">above</a>
|
||||
for further details of the handling of digits following a backslash. There is
|
||||
no such problem when named parentheses are used. A back reference to any
|
||||
no such problem when named parentheses are used. A backreference to any
|
||||
subpattern is possible using named parentheses (see below).
|
||||
</P>
|
||||
<P>
|
||||
|
@ -2175,7 +2175,7 @@ of forward reference can be useful it patterns that repeat. Perl does not
|
|||
support the use of + in this way.
|
||||
</P>
|
||||
<P>
|
||||
A back reference matches whatever actually matched the capturing subpattern in
|
||||
A backreference matches whatever actually matched the capturing subpattern in
|
||||
the current subject string, rather than anything matching the subpattern
|
||||
itself (see
|
||||
<a href="#subpatternsassubroutines">"Subpatterns as subroutines"</a>
|
||||
|
@ -2185,7 +2185,7 @@ below for a way of doing that). So the pattern
|
|||
</pre>
|
||||
matches "sense and sensibility" and "response and responsibility", but not
|
||||
"sense and responsibility". If caseful matching is in force at the time of the
|
||||
back reference, the case of letters is relevant. For example,
|
||||
backreference, the case of letters is relevant. For example,
|
||||
<pre>
|
||||
((?i)rah)\s+\1
|
||||
</pre>
|
||||
|
@ -2193,10 +2193,10 @@ matches "rah rah" and "RAH RAH", but not "RAH rah", even though the original
|
|||
capturing subpattern is matched caselessly.
|
||||
</P>
|
||||
<P>
|
||||
There are several different ways of writing back references to named
|
||||
There are several different ways of writing backreferences to named
|
||||
subpatterns. The .NET syntax \k{name} and the Perl syntax \k<name> or
|
||||
\k'name' are supported, as is the Python syntax (?P=name). Perl 5.10's unified
|
||||
back reference syntax, in which \g can be used for both numeric and named
|
||||
backreference syntax, in which \g can be used for both numeric and named
|
||||
references, is also supported. We could rewrite the above example in any of
|
||||
the following ways:
|
||||
<pre>
|
||||
|
@ -2209,30 +2209,30 @@ A subpattern that is referenced by name may appear in the pattern before or
|
|||
after the reference.
|
||||
</P>
|
||||
<P>
|
||||
There may be more than one back reference to the same subpattern. If a
|
||||
subpattern has not actually been used in a particular match, any back
|
||||
references to it always fail by default. For example, the pattern
|
||||
There may be more than one backreference to the same subpattern. If a
|
||||
subpattern has not actually been used in a particular match, any backreferences
|
||||
to it always fail by default. For example, the pattern
|
||||
<pre>
|
||||
(a|(bc))\2
|
||||
</pre>
|
||||
always fails if it starts to match "a" rather than "bc". However, if the
|
||||
PCRE2_MATCH_UNSET_BACKREF option is set at compile time, a back reference to an
|
||||
PCRE2_MATCH_UNSET_BACKREF option is set at compile time, a backreference to an
|
||||
unset value matches an empty string.
|
||||
</P>
|
||||
<P>
|
||||
Because there may be many capturing parentheses in a pattern, all digits
|
||||
following a backslash are taken as part of a potential back reference number.
|
||||
following a backslash are taken as part of a potential backreference number.
|
||||
If the pattern continues with a digit character, some delimiter must be used to
|
||||
terminate the back reference. If the PCRE2_EXTENDED option is set, this can be
|
||||
terminate the backreference. If the PCRE2_EXTENDED option is set, this can be
|
||||
white space. Otherwise, the \g{ syntax or an empty comment (see
|
||||
<a href="#comments">"Comments"</a>
|
||||
below) can be used.
|
||||
</P>
|
||||
<br><b>
|
||||
Recursive back references
|
||||
Recursive backreferences
|
||||
</b><br>
|
||||
<P>
|
||||
A back reference that occurs inside the parentheses to which it refers fails
|
||||
A backreference that occurs inside the parentheses to which it refers fails
|
||||
when the subpattern is first used, so, for example, (a\1) never matches.
|
||||
However, such references can be useful inside repeated subpatterns. For
|
||||
example, the pattern
|
||||
|
@ -2240,14 +2240,14 @@ example, the pattern
|
|||
(a|b\1)+
|
||||
</pre>
|
||||
matches any number of "a"s and also "aba", "ababbaa" etc. At each iteration of
|
||||
the subpattern, the back reference matches the character string corresponding
|
||||
the subpattern, the backreference matches the character string corresponding
|
||||
to the previous iteration. In order for this to work, the pattern must be such
|
||||
that the first iteration does not need to match the back reference. This can be
|
||||
that the first iteration does not need to match the backreference. This can be
|
||||
done using alternation, as in the example above, or by a quantifier with a
|
||||
minimum of zero.
|
||||
</P>
|
||||
<P>
|
||||
Back references of this type cause the group that they reference to be treated
|
||||
backreferences of this type cause the group that they reference to be treated
|
||||
as an
|
||||
<a href="#atomicgroup">atomic group.</a>
|
||||
Once the whole group has been matched, a subsequent matching failure cannot
|
||||
|
@ -2397,10 +2397,10 @@ that is, a "subroutine" call into a group that is already active,
|
|||
is not supported.
|
||||
</P>
|
||||
<P>
|
||||
Perl does not support back references in lookbehinds. PCRE2 does support them,
|
||||
Perl does not support backreferences in lookbehinds. PCRE2 does support them,
|
||||
but only if certain conditions are met. The PCRE2_MATCH_UNSET_BACKREF option
|
||||
must not be set, there must be no use of (?| in the pattern (it creates
|
||||
duplicate subpattern numbers), and if the back reference is by name, the name
|
||||
duplicate subpattern numbers), and if the backreference is by name, the name
|
||||
must be unique. Of course, the referenced subpattern must itself be of fixed
|
||||
length. The following pattern matches words containing at least two characters
|
||||
that begin and end with the same character:
|
||||
|
@ -2882,7 +2882,7 @@ in PCRE2 these values can be referenced. Consider this pattern:
|
|||
^(.)(\1|a(?2))
|
||||
</pre>
|
||||
This pattern matches "bab". The first capturing parentheses match "b", then in
|
||||
the second group, when the back reference \1 fails to match "b", the second
|
||||
the second group, when the backreference \1 fails to match "b", the second
|
||||
alternative matches "a" and then recurses. In the recursion, \1 does now match
|
||||
"b" and so the whole match succeeds. This match used to fail in Perl, but in
|
||||
later versions (I tried 5.024) it now works.
|
||||
|
@ -2943,7 +2943,7 @@ plus or a minus sign it is taken as a relative reference. For example:
|
|||
(abc)(?i:\g<-1>)
|
||||
</pre>
|
||||
Note that \g{...} (Perl syntax) and \g<...> (Oniguruma syntax) are <i>not</i>
|
||||
synonymous. The former is a back reference; the latter is a subroutine call.
|
||||
synonymous. The former is a backreference; the latter is a subroutine call.
|
||||
</P>
|
||||
<br><a name="SEC26" href="#TOC1">CALLOUTS</a><br>
|
||||
<P>
|
||||
|
|
|
@ -132,14 +132,14 @@ When a pattern that is compiled with this flag is passed to <b>regexec()</b> for
|
|||
matching, the <i>nmatch</i> and <i>pmatch</i> arguments are ignored, and no
|
||||
captured strings are returned. Versions of the PCRE library prior to 10.22 used
|
||||
to set the PCRE2_NO_AUTO_CAPTURE compile option, but this no longer happens
|
||||
because it disables the use of back references.
|
||||
because it disables the use of backreferences.
|
||||
<pre>
|
||||
REG_PEND
|
||||
</pre>
|
||||
If this option is set, the <b>reg_endp</b> field in the <i>preg</i> structure
|
||||
(which has the type const char *) must be set to point to the character beyond
|
||||
the end of the pattern before calling <b>regcomp()</b>. The pattern itself may
|
||||
now contain binary zeroes, which are treated as data characters. Without
|
||||
now contain binary zeros, which are treated as data characters. Without
|
||||
REG_PEND, a binary zero terminates the pattern and the <b>re_endp</b> field is
|
||||
ignored. This is a GNU extension to the POSIX standard and should be used with
|
||||
caution in software intended to be portable to other systems.
|
||||
|
@ -248,10 +248,10 @@ function.
|
|||
<pre>
|
||||
REG_STARTEND
|
||||
</pre>
|
||||
When this option is set, the subject string is starts at <i>string</i> +
|
||||
When this option is set, the subject string starts at <i>string</i> +
|
||||
<i>pmatch[0].rm_so</i> and ends at <i>string</i> + <i>pmatch[0].rm_eo</i>, which
|
||||
should point to the first character beyond the string. There may be binary
|
||||
zeroes within the subject string, and indeed, using REG_STARTEND is the only
|
||||
zeros within the subject string, and indeed, using REG_STARTEND is the only
|
||||
way to pass a subject string that contains a binary zero.
|
||||
</P>
|
||||
<P>
|
||||
|
|
|
@ -442,7 +442,7 @@ of the newline or \R options with similar syntax. More than one of them may
|
|||
appear. For the first three, d is a decimal number.
|
||||
<pre>
|
||||
(*LIMIT_DEPTH=d) set the backtracking limit to d
|
||||
(*LIMIT_HEAP=d) set the heap size limit to d kilobytes
|
||||
(*LIMIT_HEAP=d) set the heap size limit to d * 1024 bytes
|
||||
(*LIMIT_MATCH=d) set the match limit to d
|
||||
(*NOTEMPTY) set PCRE2_NOTEMPTY when matching
|
||||
(*NOTEMPTY_ATSTART) set PCRE2_NOTEMPTY_ATSTART when matching
|
||||
|
|
|
@ -129,7 +129,7 @@ to occur).
|
|||
UTF-8 (in its original definition) is not capable of encoding values greater
|
||||
than 0x7fffffff, but such values can be handled by the 32-bit library. When
|
||||
testing this library in non-UTF mode with <b>utf8_input</b> set, if any
|
||||
character is preceded by the byte 0xff (which is an illegal byte in UTF-8)
|
||||
character is preceded by the byte 0xff (which is an invalid byte in UTF-8)
|
||||
0x80000000 is added to the character's value. This is the only way of passing
|
||||
such code points in a pattern string. For subject strings, using an escape
|
||||
sequence is preferable.
|
||||
|
@ -264,7 +264,7 @@ Do not output the version number of <b>pcre2test</b> at the start of execution.
|
|||
<P>
|
||||
<b>-S</b> <i>size</i>
|
||||
On Unix-like systems, set the size of the run-time stack to <i>size</i>
|
||||
megabytes.
|
||||
mebibytes (units of 1024*1024 bytes).
|
||||
</P>
|
||||
<P>
|
||||
<b>-subject</b> <i>modifier-list</i>
|
||||
|
@ -679,8 +679,8 @@ Newline and \R handling
|
|||
<P>
|
||||
The <b>bsr</b> modifier specifies what \R in a pattern should match. If it is
|
||||
set to "anycrlf", \R matches CR, LF, or CRLF only. If it is set to "unicode",
|
||||
\R matches any Unicode newline sequence. The default is specified when PCRE2
|
||||
is built, with the default default being Unicode.
|
||||
\R matches any Unicode newline sequence. The default can be specified when
|
||||
PCRE2 is built; if it is not, the default is set to Unicode.
|
||||
</P>
|
||||
<P>
|
||||
The <b>newline</b> modifier specifies which characters are to be interpreted as
|
||||
|
@ -1418,11 +1418,11 @@ Setting the JIT stack size
|
|||
<P>
|
||||
The <b>jitstack</b> modifier provides a way of setting the maximum stack size
|
||||
that is used by the just-in-time optimization code. It is ignored if JIT
|
||||
optimization is not being used. The value is a number of kilobytes. Setting
|
||||
zero reverts to the default of 32K. Providing a stack that is larger than the
|
||||
default is necessary only for very complicated patterns. If <b>jitstack</b> is
|
||||
set non-zero on a subject line it overrides any value that was set on the
|
||||
pattern.
|
||||
optimization is not being used. The value is a number of kibibytes (units of
|
||||
1024 bytes). Setting zero reverts to the default of 32KiB. Providing a stack
|
||||
that is larger than the default is necessary only for very complicated
|
||||
patterns. If <b>jitstack</b> is set non-zero on a subject line it overrides any
|
||||
value that was set on the pattern.
|
||||
</P>
|
||||
<br><b>
|
||||
Setting heap, match, and depth limits
|
||||
|
@ -1468,10 +1468,10 @@ and non-recursive, to the internal matching function, thus controlling the
|
|||
overall amount of computing resource that is used.
|
||||
</P>
|
||||
<P>
|
||||
For both kinds of matching, the <i>heap_limit</i> number (which is in kilobytes)
|
||||
limits the amount of heap memory used for matching. A value of zero disables
|
||||
the use of any heap memory; many simple pattern matches can be done without
|
||||
using the heap, so this is not an unreasonable setting.
|
||||
For both kinds of matching, the <i>heap_limit</i> number, which is in kibibytes
|
||||
(units of 1024 bytes), limits the amount of heap memory used for matching. A
|
||||
value of zero disables the use of any heap memory; many simple pattern matches
|
||||
can be done without using the heap, so zero is not an unreasonable setting.
|
||||
</P>
|
||||
<br><b>
|
||||
Showing MARK names
|
||||
|
|
|
@ -53,7 +53,7 @@ compatibility with Perl 5.6. PCRE2 does not support this.
|
|||
WIDE CHARACTERS AND UTF MODES
|
||||
</b><br>
|
||||
<P>
|
||||
Codepoints less than 256 can be specified in patterns by either braced or
|
||||
Code points less than 256 can be specified in patterns by either braced or
|
||||
unbraced hexadecimal escape sequences (for example, \x{b3} or \xb3). Larger
|
||||
values have to use braced sequences. Unbraced octal code points up to \777 are
|
||||
also recognized; larger ones can be coded using \o{...}.
|
||||
|
@ -116,7 +116,7 @@ CASE-EQUIVALENCE IN UTF MODES
|
|||
Case-insensitive matching in a UTF mode makes use of Unicode properties except
|
||||
for characters whose code points are less than 128 and that have at most two
|
||||
case-equivalent values. For these, a direct table lookup is used for speed. A
|
||||
few Unicode characters such as Greek sigma have more than two codepoints that
|
||||
few Unicode characters such as Greek sigma have more than two code points that
|
||||
are case-equivalent, and these are treated as such.
|
||||
</P>
|
||||
<br><b>
|
||||
|
|
2629
doc/pcre2.txt
2629
doc/pcre2.txt
File diff suppressed because it is too large
Load Diff
|
@ -53,7 +53,7 @@ The option bits are:
|
|||
PCRE2_EXTENDED Ignore white space and # comments
|
||||
PCRE2_FIRSTLINE Force matching to be before newline
|
||||
PCRE2_LITERAL Pattern characters are all literal
|
||||
PCRE2_MATCH_UNSET_BACKREF Match unset back references
|
||||
PCRE2_MATCH_UNSET_BACKREF Match unset backreferences
|
||||
PCRE2_MULTILINE ^ and $ match newlines within data
|
||||
PCRE2_NEVER_BACKSLASH_C Lock out the use of \eC in patterns
|
||||
PCRE2_NEVER_UCP Lock out PCRE2_UCP, e.g. via (*UCP)
|
||||
|
|
|
@ -65,7 +65,7 @@ subject that is terminated by a binary zero code unit. The options are:
|
|||
match even if there is a full match
|
||||
.\" JOIN
|
||||
PCRE2_PARTIAL_SOFT Return PCRE2_ERROR_PARTIAL for a partial
|
||||
match if no full matches are found
|
||||
match if no full matches are found
|
||||
.sp
|
||||
For details of partial matching, see the
|
||||
.\" HREF
|
||||
|
|
|
@ -24,7 +24,7 @@ request are as follows:
|
|||
.sp
|
||||
PCRE2_INFO_ALLOPTIONS Final options after compiling
|
||||
PCRE2_INFO_ARGOPTIONS Options passed to \fBpcre2_compile()\fP
|
||||
PCRE2_INFO_BACKREFMAX Number of highest back reference
|
||||
PCRE2_INFO_BACKREFMAX Number of highest backreference
|
||||
PCRE2_INFO_BSR What \eR matches:
|
||||
PCRE2_BSR_UNICODE: Unicode line endings
|
||||
PCRE2_BSR_ANYCRLF: CR, LF, or CRLF only
|
||||
|
|
|
@ -1,4 +1,4 @@
|
|||
.TH PCRE2_SET_MAX_PATTERN_LENGTH 3 "16 June 2017" "PCRE2 10.30"
|
||||
.TH PCRE2_SET_COMPILE_EXTRA_OPTIONS 3 "16 June 2017" "PCRE2 10.30"
|
||||
.SH NAME
|
||||
PCRE2 - Perl-compatible regular expressions (revised API)
|
||||
.SH SYNOPSIS
|
||||
|
|
|
@ -16,7 +16,7 @@ PCRE2 - Perl-compatible regular expressions (revised API)
|
|||
.sp
|
||||
This function is part of an experimental set of pattern conversion functions.
|
||||
It sets the component separator character that is used when converting globs.
|
||||
The second argument must one of the characters forward slash, backslash, or
|
||||
The second argument must be one of the characters forward slash, backslash, or
|
||||
dot. The default is backslash when running under Windows, otherwise forward
|
||||
slash. The result of the function is zero for success or PCRE2_ERROR_BADDATA if
|
||||
the second argument is invalid.
|
||||
|
|
|
@ -1,4 +1,4 @@
|
|||
.TH PCRE2_SET_DEPTH_LIMIT 3 "11 April 2017" "PCRE2 10.30"
|
||||
.TH PCRE2_SET_HEAP_LIMIT 3 "11 April 2017" "PCRE2 10.30"
|
||||
.SH NAME
|
||||
PCRE2 - Perl-compatible regular expressions (revised API)
|
||||
.SH SYNOPSIS
|
||||
|
|
107
doc/pcre2api.3
107
doc/pcre2api.3
|
@ -497,10 +497,10 @@ U+000C), NEL (next line, U+0085), LS (line separator, U+2028), and PS
|
|||
.P
|
||||
Each of the first three conventions is used by at least one operating system as
|
||||
its standard newline sequence. When PCRE2 is built, a default can be specified.
|
||||
The default default is LF, which is the Unix standard. However, the newline
|
||||
convention can be changed by an application when calling \fBpcre2_compile()\fP,
|
||||
or it can be specified by special text at the start of the pattern itself; this
|
||||
overrides any other settings. See the
|
||||
If it is not, the default is set to LF, which is the Unix standard. However,
|
||||
the newline convention can be changed by an application when calling
|
||||
\fBpcre2_compile()\fP, or it can be specified by special text at the start of
|
||||
the pattern itself; this overrides any other settings. See the
|
||||
.\" HREF
|
||||
\fBpcre2pattern\fP
|
||||
.\"
|
||||
|
@ -885,19 +885,20 @@ offset limit. In other words, whichever limit comes first is used.
|
|||
.B " uint32_t \fIvalue\fP);"
|
||||
.fi
|
||||
.sp
|
||||
The \fIheap_limit\fP parameter specifies, in units of kilobytes, the maximum
|
||||
amount of heap memory that \fBpcre2_match()\fP may use to hold backtracking
|
||||
information when running an interpretive match. This limit also applies to
|
||||
\fBpcre2_dfa_match()\fP, which may use the heap when processing patterns with a
|
||||
lot of nested pattern recursion or lookarounds or atomic groups. This limit
|
||||
does not apply to matching with the JIT optimization, which has its own memory
|
||||
control arrangements (see the
|
||||
The \fIheap_limit\fP parameter specifies, in units of kibibytes (1024 bytes),
|
||||
the maximum amount of heap memory that \fBpcre2_match()\fP may use to hold
|
||||
backtracking information when running an interpretive match. This limit also
|
||||
applies to \fBpcre2_dfa_match()\fP, which may use the heap when processing
|
||||
patterns with a lot of nested pattern recursion or lookarounds or atomic
|
||||
groups. This limit does not apply to matching with the JIT optimization, which
|
||||
has its own memory control arrangements (see the
|
||||
.\" HREF
|
||||
\fBpcre2jit\fP
|
||||
.\"
|
||||
documentation for more details). If the limit is reached, the negative error
|
||||
code PCRE2_ERROR_HEAPLIMIT is returned. The default limit is set when PCRE2 is
|
||||
built; the default default is very large and is essentially "unlimited".
|
||||
code PCRE2_ERROR_HEAPLIMIT is returned. The default limit can be set when PCRE2
|
||||
is built; if it is not, the default is set very large and is essentially
|
||||
"unlimited".
|
||||
.P
|
||||
A value for the heap limit may also be supplied by an item at the start of a
|
||||
pattern of the form
|
||||
|
@ -975,7 +976,7 @@ The depth limit is not relevant, and is ignored, when matching is done using
|
|||
JIT compiled code. However, it is supported by \fBpcre2_dfa_match()\fP, which
|
||||
uses it to limit the depth of nested internal recursive function calls that
|
||||
implement atomic groups, lookaround assertions, and pattern recursions. This
|
||||
limits, indirectly, the amount of system stack this is used. It was more useful
|
||||
limits, indirectly, the amount of system stack that is used. It was more useful
|
||||
in versions before 10.32, when stack memory was used for local workspace
|
||||
vectors for recursive function calls. From version 10.32, only local variables
|
||||
are allocated on the stack and as each call uses only a few hundred bytes, even
|
||||
|
@ -989,11 +990,11 @@ using \fBpcre2_dfa_match()\fP, can use a great deal of memory. However, it is
|
|||
probably better to limit heap usage directly by calling
|
||||
\fBpcre2_set_heap_limit()\fP.
|
||||
.P
|
||||
The default value for the depth limit can be set when PCRE2 is built; the
|
||||
default default is the same value as the default for the match limit. If the
|
||||
limit is exceeded, \fBpcre2_match()\fP or \fBpcre2_dfa_match()\fP returns
|
||||
PCRE2_ERROR_DEPTHLIMIT. A value for the depth limit may also be supplied by an
|
||||
item at the start of a pattern of the form
|
||||
The default value for the depth limit can be set when PCRE2 is built; if it is
|
||||
not, the default is set to the same value as the default for the match limit.
|
||||
If the limit is exceeded, \fBpcre2_match()\fP or \fBpcre2_dfa_match()\fP
|
||||
returns PCRE2_ERROR_DEPTHLIMIT. A value for the depth limit may also be
|
||||
supplied by an item at the start of a pattern of the form
|
||||
.sp
|
||||
(*LIMIT_DEPTH=ddd)
|
||||
.sp
|
||||
|
@ -1050,7 +1051,7 @@ given with \fBpcre2_set_depth_limit()\fP above.
|
|||
.sp
|
||||
PCRE2_CONFIG_HEAPLIMIT
|
||||
.sp
|
||||
The output is a uint32_t integer that gives, in kilobytes, the default limit
|
||||
The output is a uint32_t integer that gives, in kibibytes, the default limit
|
||||
for the amount of heap memory used by \fBpcre2_match()\fP or
|
||||
\fBpcre2_dfa_match()\fP. Further details are given with
|
||||
\fBpcre2_set_heap_limit()\fP above.
|
||||
|
@ -1367,7 +1368,7 @@ If this bit is set, letters in the pattern match both upper and lower case
|
|||
letters in the subject. It is equivalent to Perl's /i option, and it can be
|
||||
changed within a pattern by a (?i) option setting. If PCRE2_UTF is set, Unicode
|
||||
properties are used for all characters with more than one other case, and for
|
||||
all characters whose code points are greater than U+007f. For lower valued
|
||||
all characters whose code points are greater than U+007F. For lower valued
|
||||
characters with only one other case, a lookup table is used for speed. When
|
||||
PCRE2_UTF is not set, a lookup table is used for all code points less than 256,
|
||||
and higher code points (available only in 16-bit or 32-bit mode) are treated as
|
||||
|
@ -1489,7 +1490,7 @@ error.
|
|||
.sp
|
||||
PCRE2_MATCH_UNSET_BACKREF
|
||||
.sp
|
||||
If this option is set, a back reference to an unset subpattern group matches an
|
||||
If this option is set, a backreference to an unset subpattern group matches an
|
||||
empty string (by default this causes the current matching alternative to fail).
|
||||
A pattern such as (\e1)(a) succeeds when this option is set (assuming it can
|
||||
find an "a" in the subject), whereas it fails by default, for Perl
|
||||
|
@ -1550,8 +1551,8 @@ If this option is set, it disables the use of numbered capturing parentheses in
|
|||
the pattern. Any opening parenthesis that is not followed by ? behaves as if it
|
||||
were followed by ?: but named parentheses can still be used for capturing (and
|
||||
they acquire numbers in the usual way). This is the same as Perl's /n option.
|
||||
Note that, when this option is set, references to capturing groups (back
|
||||
references or recursion/subroutine calls) may only refer to named groups,
|
||||
Note that, when this option is set, references to capturing groups
|
||||
(backreferences or recursion/subroutine calls) may only refer to named groups,
|
||||
though the reference can be by name or by number.
|
||||
.sp
|
||||
PCRE2_NO_AUTO_POSSESS
|
||||
|
@ -1570,7 +1571,7 @@ If this option is set, it disables an optimization that is applied when .* is
|
|||
the first significant item in a top-level branch of a pattern, and all the
|
||||
other branches also start with .* or with \eA or \eG or ^. The optimization is
|
||||
automatically disabled for .* if it is inside an atomic group or a capturing
|
||||
group that is the subject of a back reference, or if the pattern contains
|
||||
group that is the subject of a backreference, or if the pattern contains
|
||||
(*PRUNE) or (*SKIP). When the optimization is not disabled, such a pattern is
|
||||
automatically anchored if PCRE2_DOTALL is set for all the .* items and
|
||||
PCRE2_MULTILINE is not set for any ^ items. Otherwise, the fact that any match
|
||||
|
@ -1956,7 +1957,7 @@ following are true:
|
|||
.* is not in an atomic group
|
||||
.\" JOIN
|
||||
.* is not in a capturing group that is the subject
|
||||
of a back reference
|
||||
of a backreference
|
||||
PCRE2_DOTALL is in force for .*
|
||||
Neither (*PRUNE) nor (*SKIP) appears in the pattern
|
||||
PCRE2_NO_DOTSTAR_ANCHOR is not set
|
||||
|
@ -1966,20 +1967,20 @@ options returned for PCRE2_INFO_ALLOPTIONS.
|
|||
.sp
|
||||
PCRE2_INFO_BACKREFMAX
|
||||
.sp
|
||||
Return the number of the highest back reference in the pattern. The third
|
||||
Return the number of the highest backreference in the pattern. The third
|
||||
argument should point to an \fBuint32_t\fP variable. Named subpatterns acquire
|
||||
numbers as well as names, and these count towards the highest back reference.
|
||||
Back references such as \e4 or \eg{12} match the captured characters of the
|
||||
numbers as well as names, and these count towards the highest backreference.
|
||||
Backreferences such as \e4 or \eg{12} match the captured characters of the
|
||||
given group, but in addition, the check that a capturing group is set in a
|
||||
conditional subpattern such as (?(3)a|b) is also a back reference. Zero is
|
||||
returned if there are no back references.
|
||||
conditional subpattern such as (?(3)a|b) is also a backreference. Zero is
|
||||
returned if there are no backreferences.
|
||||
.sp
|
||||
PCRE2_INFO_BSR
|
||||
.sp
|
||||
The output is a uint32_t whose value indicates what character sequences the \eR
|
||||
escape sequence matches. A value of PCRE2_BSR_UNICODE means that \eR matches
|
||||
any Unicode line ending sequence; a value of PCRE2_BSR_ANYCRLF means that \eR
|
||||
matches only CR, LF, or CRLF.
|
||||
The output is a uint32_t integer whose value indicates what character sequences
|
||||
the \eR escape sequence matches. A value of PCRE2_BSR_UNICODE means that \eR
|
||||
matches any Unicode line ending sequence; a value of PCRE2_BSR_ANYCRLF means
|
||||
that \eR matches only CR, LF, or CRLF.
|
||||
.sp
|
||||
PCRE2_INFO_CAPTURECOUNT
|
||||
.sp
|
||||
|
@ -1991,10 +1992,10 @@ The third argument should point to an \fBuint32_t\fP variable.
|
|||
.sp
|
||||
If the pattern set a backtracking depth limit by including an item of the form
|
||||
(*LIMIT_DEPTH=nnnn) at the start, the value is returned. The third argument
|
||||
should point to an unsigned 32-bit integer. If no such value has been set, the
|
||||
call to \fBpcre2_pattern_info()\fP returns the error PCRE2_ERROR_UNSET. Note
|
||||
that this limit will only be used during matching if it is less than the limit
|
||||
set or defaulted by the caller of the match function.
|
||||
should point to a uint32_t integer. If no such value has been set, the call to
|
||||
\fBpcre2_pattern_info()\fP returns the error PCRE2_ERROR_UNSET. Note that this
|
||||
limit will only be used during matching if it is less than the limit set or
|
||||
defaulted by the caller of the match function.
|
||||
.sp
|
||||
PCRE2_INFO_FIRSTBITMAP
|
||||
.sp
|
||||
|
@ -2004,7 +2005,7 @@ values for the first code unit in any match. For example, a pattern that starts
|
|||
with [abc] results in a table with three bits set. When code unit values
|
||||
greater than 255 are supported, the flag bit for 255 means "any code unit of
|
||||
value 255 or above". If such a table was constructed, a pointer to it is
|
||||
returned. Otherwise NULL is returned. The third argument should point to an
|
||||
returned. Otherwise NULL is returned. The third argument should point to a
|
||||
\fBconst uint8_t *\fP variable.
|
||||
.sp
|
||||
PCRE2_INFO_FIRSTCODETYPE
|
||||
|
@ -2031,7 +2032,7 @@ and up to 0xffffffff when not using UTF-32 mode.
|
|||
.sp
|
||||
Return the size (in bytes) of the data frames that are used to remember
|
||||
backtracking positions when the pattern is processed by \fBpcre2_match()\fP
|
||||
without the use of JIT. The third argument should point to an \fBsize_t\fP
|
||||
without the use of JIT. The third argument should point to a \fBsize_t\fP
|
||||
variable. The frame size depends on the number of capturing parentheses in the
|
||||
pattern. Each additional capturing group adds two PCRE2_SIZE variables.
|
||||
.sp
|
||||
|
@ -2051,10 +2052,10 @@ the equivalent hexadecimal or octal escape sequences.
|
|||
.sp
|
||||
If the pattern set a heap memory limit by including an item of the form
|
||||
(*LIMIT_HEAP=nnnn) at the start, the value is returned. The third argument
|
||||
should point to an unsigned 32-bit integer. If no such value has been set, the
|
||||
call to \fBpcre2_pattern_info()\fP returns the error PCRE2_ERROR_UNSET. Note
|
||||
that this limit will only be used during matching if it is less than the limit
|
||||
set or defaulted by the caller of the match function.
|
||||
should point to a uint32_t integer. If no such value has been set, the call to
|
||||
\fBpcre2_pattern_info()\fP returns the error PCRE2_ERROR_UNSET. Note that this
|
||||
limit will only be used during matching if it is less than the limit set or
|
||||
defaulted by the caller of the match function.
|
||||
.sp
|
||||
PCRE2_INFO_JCHANGED
|
||||
.sp
|
||||
|
@ -2098,15 +2099,15 @@ in such cases.
|
|||
.sp
|
||||
If the pattern set a match limit by including an item of the form
|
||||
(*LIMIT_MATCH=nnnn) at the start, the value is returned. The third argument
|
||||
should point to an unsigned 32-bit integer. If no such value has been set, the
|
||||
call to \fBpcre2_pattern_info()\fP returns the error PCRE2_ERROR_UNSET. Note
|
||||
that this limit will only be used during matching if it is less than the limit
|
||||
set or defaulted by the caller of the match function.
|
||||
should point to a uint32_t integer. If no such value has been set, the call to
|
||||
\fBpcre2_pattern_info()\fP returns the error PCRE2_ERROR_UNSET. Note that this
|
||||
limit will only be used during matching if it is less than the limit set or
|
||||
defaulted by the caller of the match function.
|
||||
.sp
|
||||
PCRE2_INFO_MAXLOOKBEHIND
|
||||
.sp
|
||||
Return the number of characters (not code units) in the longest lookbehind
|
||||
assertion in the pattern. The third argument should point to an unsigned 32-bit
|
||||
assertion in the pattern. The third argument should point to a uint32_t
|
||||
integer. This information is useful when doing multi-segment matching using the
|
||||
partial matching facilities. Note that the simple assertions \eb and \eB
|
||||
require a one-character lookbehind. \eA also registers a one-character
|
||||
|
@ -2393,7 +2394,7 @@ zero, the search for a match starts at the beginning of the subject, and this
|
|||
is by far the most common case. In UTF-8 or UTF-16 mode, the starting offset
|
||||
must point to the start of a character, or to the end of the subject (in UTF-32
|
||||
mode, one code unit equals one character, so all offsets are valid). Like the
|
||||
pattern string, the subject may contain binary zeroes.
|
||||
pattern string, the subject may contain binary zeros.
|
||||
.P
|
||||
A non-zero starting offset is useful when searching for another match in the
|
||||
same subject by calling \fBpcre2_match()\fP again after a previous success.
|
||||
|
@ -3562,12 +3563,12 @@ There are in addition the following errors that are specific to
|
|||
.sp
|
||||
This return is given if \fBpcre2_dfa_match()\fP encounters an item in the
|
||||
pattern that it does not support, for instance, the use of \eC in a UTF mode or
|
||||
a back reference.
|
||||
a backreference.
|
||||
.sp
|
||||
PCRE2_ERROR_DFA_UCOND
|
||||
.sp
|
||||
This return is given if \fBpcre2_dfa_match()\fP encounters a condition item
|
||||
that uses a back reference for the condition, or a test for recursion in a
|
||||
that uses a backreference for the condition, or a test for recursion in a
|
||||
specific group. These are not supported.
|
||||
.sp
|
||||
PCRE2_ERROR_DFA_WSSIZE
|
||||
|
|
|
@ -216,7 +216,7 @@ separator, U+2028), and PS (paragraph separator, U+2029). The final option is
|
|||
.sp
|
||||
--enable-newline-is-nul
|
||||
.sp
|
||||
which causes NUL (binary zero) is set as the default line-ending character.
|
||||
which causes NUL (binary zero) to be set as the default line-ending character.
|
||||
.P
|
||||
Whatever default line ending convention is selected when PCRE2 is built can be
|
||||
overridden by applications that use the library. At build time it is
|
||||
|
@ -281,8 +281,8 @@ The \fBpcre2_match()\fP function starts out using a 20K vector on the system
|
|||
stack to record backtracking points. The more nested backtracking points there
|
||||
are (that is, the deeper the search tree), the more memory is needed. If the
|
||||
initial vector is not large enough, heap memory is used, up to a certain limit,
|
||||
which is specified in kilobytes. The limit can be changed at run time, as
|
||||
described in the
|
||||
which is specified in kibibytes (units of 1024 bytes). The limit can be changed
|
||||
at run time, as described in the
|
||||
.\" HREF
|
||||
\fBpcre2api\fP
|
||||
.\"
|
||||
|
@ -291,7 +291,7 @@ change this by a setting such as
|
|||
.sp
|
||||
--with-heap-limit=500
|
||||
.sp
|
||||
which limits the amount of heap to 500 kilobytes. This limit applies only to
|
||||
which limits the amount of heap to 500 KiB. This limit applies only to
|
||||
interpretive matching in \fBpcre2_match()\fP and \fBpcre2_dfa_match()\fP, which
|
||||
may also use the heap for internal workspace when processing complicated
|
||||
patterns. This limit does not apply when JIT (which has its own memory
|
||||
|
@ -552,7 +552,7 @@ generated from the string.
|
|||
Setting --enable-fuzz-support also causes a binary called \fBpcre2fuzzcheck\fP
|
||||
to be created. This is normally run under valgrind or used when PCRE2 is
|
||||
compiled with address sanitizing enabled. It calls the fuzzing function and
|
||||
outputs information about it is doing. The input strings are specified by
|
||||
outputs information about what it is doing. The input strings are specified by
|
||||
arguments: if an argument starts with "=" the rest of it is a literal input
|
||||
string. Otherwise, it is assumed to be a file name, and the contents of the
|
||||
file are the test string.
|
||||
|
|
|
@ -128,7 +128,7 @@ start only after an internal newline or at the beginning of the subject, and
|
|||
branch, automatic anchoring occurs if all branches are anchorable.
|
||||
.P
|
||||
This optimization is disabled, however, if .* is in an atomic group or if there
|
||||
is a back reference to the capturing group in which it appears. It is also
|
||||
is a backreference to the capturing group in which it appears. It is also
|
||||
disabled if the pattern contains (*PRUNE) or (*SKIP). However, the presence of
|
||||
callouts does not affect it.
|
||||
.P
|
||||
|
|
|
@ -19,7 +19,7 @@ page.
|
|||
2. Like Perl, PCRE2 allows repeat quantifiers on parenthesized assertions, but
|
||||
they do not mean what you might think. For example, (?!a){3} does not assert
|
||||
that the next three characters are not "a". It just asserts that the next
|
||||
character is not "a" three times (in principle: PCRE2 optimizes this to run the
|
||||
character is not "a" three times (in principle; PCRE2 optimizes this to run the
|
||||
assertion just once). Perl allows some repeat quantifiers on other assertions,
|
||||
for example, \eb* (but not \eb{3}), but these do not seem to have any use.
|
||||
.P
|
||||
|
@ -62,8 +62,8 @@ Note the following examples:
|
|||
The \eQ...\eE sequence is recognized both inside and outside character classes.
|
||||
.P
|
||||
7. Fairly obviously, PCRE2 does not support the (?{code}) and (??{code})
|
||||
constructions. However, there is support PCRE2's "callout" feature, which
|
||||
allows an external function to be called during pattern matching. See the
|
||||
constructions. However, PCRE2 does have a "callout" feature, which allows an
|
||||
external function to be called during pattern matching. See the
|
||||
.\" HREF
|
||||
\fBpcre2callout\fP
|
||||
.\"
|
||||
|
@ -131,7 +131,7 @@ list is with respect to Perl 5.26:
|
|||
each alternative branch of a lookbehind assertion can match a different length
|
||||
of string. Perl requires them all to have the same length.
|
||||
.sp
|
||||
(b) From PCRE2 10.23, back references to groups of fixed length are supported
|
||||
(b) From PCRE2 10.23, backreferences to groups of fixed length are supported
|
||||
in lookbehinds, provided that there is no possibility of referencing a
|
||||
non-unique number or name. Perl does not support backreferences in lookbehinds.
|
||||
.sp
|
||||
|
|
|
@ -57,9 +57,10 @@ controlled by parameters that can be set by the \fB--buffer-size\fP and
|
|||
that is obtained at the start of processing. If an input file contains very
|
||||
long lines, a larger buffer may be needed; this is handled by automatically
|
||||
extending the buffer, up to the limit specified by \fB--max-buffer-size\fP. The
|
||||
default values for these parameters are specified when \fBpcre2grep\fP is
|
||||
built, with the default defaults being 20K and 1M respectively. An error occurs
|
||||
if a line is too long and the buffer can no longer be expanded.
|
||||
default values for these parameters can be set when \fBpcre2grep\fP is
|
||||
built; if nothing is specified, the defaults are set to 20K and 1M
|
||||
respectively. An error occurs if a line is too long and the buffer can no
|
||||
longer be expanded.
|
||||
.P
|
||||
The block of memory that is actually used is three times the "buffer size", to
|
||||
allow for buffering "before" and "after" lines. If the buffer size is too
|
||||
|
@ -434,13 +435,13 @@ short form for this option.
|
|||
When this option is given, non-compressed input is read and processed line by
|
||||
line, and the output is flushed after each write. By default, input is read in
|
||||
large chunks, unless \fBpcre2grep\fP can determine that it is reading from a
|
||||
terminal (which is currently possible only in Unix-like environments). Output
|
||||
to terminal is normally automatically flushed by the operating system. This
|
||||
option can be useful when the input or output is attached to a pipe and you do
|
||||
not want \fBpcre2grep\fP to buffer up large amounts of data. However, its use
|
||||
will affect performance, and the \fB-M\fP (multiline) option ceases to work.
|
||||
When input is from a compressed .gz or .bz2 file, \fB--line-buffered\fP is
|
||||
ignored.
|
||||
terminal (which is currently possible only in Unix-like environments or
|
||||
Windows). Output to terminal is normally automatically flushed by the operating
|
||||
system. This option can be useful when the input or output is attached to a
|
||||
pipe and you do not want \fBpcre2grep\fP to buffer up large amounts of data.
|
||||
However, its use will affect performance, and the \fB-M\fP (multiline) option
|
||||
ceases to work. When input is from a compressed .gz or .bz2 file,
|
||||
\fB--line-buffered\fP is ignored.
|
||||
.TP
|
||||
\fB--line-offsets\fP
|
||||
Instead of showing lines or parts of lines that match, show each match as a
|
||||
|
@ -470,11 +471,11 @@ is a pattern that uses nested unlimited repeats. Internally, PCRE2 has a
|
|||
counter that is incremented each time around its main processing loop. If the
|
||||
value set by \fB--match-limit\fP is reached, an error occurs.
|
||||
.sp
|
||||
The \fB--heap-limit\fP option specifies, as a number of kilobytes, the amount
|
||||
of heap memory that may be used for matching. Heap memory is needed only if
|
||||
matching the pattern requires a significant number of nested backtracking
|
||||
points to be remembered. This parameter can be set to zero to forbid the use of
|
||||
heap memory altogether.
|
||||
The \fB--heap-limit\fP option specifies, as a number of kibibytes (units of
|
||||
1024 bytes), the amount of heap memory that may be used for matching. Heap
|
||||
memory is needed only if matching the pattern requires a significant number of
|
||||
nested backtracking points to be remembered. This parameter can be set to zero
|
||||
to forbid the use of heap memory altogether.
|
||||
.sp
|
||||
The \fB--depth-limit\fP option limits the depth of nested backtracking points,
|
||||
which indirectly limits the amount of memory that is used. The amount of memory
|
||||
|
@ -483,9 +484,9 @@ parentheses in the pattern, so the amount of memory that is used before this
|
|||
limit acts varies from pattern to pattern. This limit is of use only if it is
|
||||
set smaller than \fB--match-limit\fP.
|
||||
.sp
|
||||
There are no short forms for these options. The default settings are specified
|
||||
when the PCRE2 library is compiled, with the default defaults being very large
|
||||
and so effectively unlimited.
|
||||
There are no short forms for these options. The default limits can be set
|
||||
when the PCRE2 library is compiled; if they are not specified, the defaults
|
||||
are very large and so effectively unlimited.
|
||||
.TP
|
||||
\fB--max-buffer-size=\fInumber\fP
|
||||
This limits the expansion of the processing buffer, whose initial size can be
|
||||
|
|
|
@ -56,10 +56,10 @@ DESCRIPTION
|
|||
that is obtained at the start of processing. If an input file contains
|
||||
very long lines, a larger buffer may be needed; this is handled by
|
||||
automatically extending the buffer, up to the limit specified by --max-
|
||||
buffer-size. The default values for these parameters are specified when
|
||||
pcre2grep is built, with the default defaults being 20K and 1M respec-
|
||||
tively. An error occurs if a line is too long and the buffer can no
|
||||
longer be expanded.
|
||||
buffer-size. The default values for these parameters can be set when
|
||||
pcre2grep is built; if nothing is specified, the defaults are set to
|
||||
20K and 1M respectively. An error occurs if a line is too long and the
|
||||
buffer can no longer be expanded.
|
||||
|
||||
The block of memory that is actually used is three times the "buffer
|
||||
size", to allow for buffering "before" and "after" lines. If the buffer
|
||||
|
@ -475,14 +475,14 @@ OPTIONS
|
|||
processed line by line, and the output is flushed after each
|
||||
write. By default, input is read in large chunks, unless
|
||||
pcre2grep can determine that it is reading from a terminal
|
||||
(which is currently possible only in Unix-like environments).
|
||||
Output to terminal is normally automatically flushed by the
|
||||
operating system. This option can be useful when the input or
|
||||
output is attached to a pipe and you do not want pcre2grep to
|
||||
buffer up large amounts of data. However, its use will affect
|
||||
performance, and the -M (multiline) option ceases to work.
|
||||
When input is from a compressed .gz or .bz2 file, --line-
|
||||
buffered is ignored.
|
||||
(which is currently possible only in Unix-like environments
|
||||
or Windows). Output to terminal is normally automatically
|
||||
flushed by the operating system. This option can be useful
|
||||
when the input or output is attached to a pipe and you do not
|
||||
want pcre2grep to buffer up large amounts of data. However,
|
||||
its use will affect performance, and the -M (multiline)
|
||||
option ceases to work. When input is from a compressed .gz or
|
||||
.bz2 file, --line-buffered is ignored.
|
||||
|
||||
--line-offsets
|
||||
Instead of showing lines or parts of lines that match, show
|
||||
|
@ -517,12 +517,12 @@ OPTIONS
|
|||
processing loop. If the value set by --match-limit is
|
||||
reached, an error occurs.
|
||||
|
||||
The --heap-limit option specifies, as a number of kilobytes,
|
||||
the amount of heap memory that may be used for matching. Heap
|
||||
memory is needed only if matching the pattern requires a sig-
|
||||
nificant number of nested backtracking points to be remem-
|
||||
bered. This parameter can be set to zero to forbid the use of
|
||||
heap memory altogether.
|
||||
The --heap-limit option specifies, as a number of kibibytes
|
||||
(units of 1024 bytes), the amount of heap memory that may be
|
||||
used for matching. Heap memory is needed only if matching the
|
||||
pattern requires a significant number of nested backtracking
|
||||
points to be remembered. This parameter can be set to zero to
|
||||
forbid the use of heap memory altogether.
|
||||
|
||||
The --depth-limit option limits the depth of nested back-
|
||||
tracking points, which indirectly limits the amount of memory
|
||||
|
@ -532,10 +532,10 @@ OPTIONS
|
|||
limit acts varies from pattern to pattern. This limit is of
|
||||
use only if it is set smaller than --match-limit.
|
||||
|
||||
There are no short forms for these options. The default set-
|
||||
tings are specified when the PCRE2 library is compiled, with
|
||||
the default defaults being very large and so effectively
|
||||
unlimited.
|
||||
There are no short forms for these options. The default lim-
|
||||
its can be set when the PCRE2 library is compiled; if they
|
||||
are not specified, the defaults are very large and so effec-
|
||||
tively unlimited.
|
||||
|
||||
--max-buffer-size=number
|
||||
This limits the expansion of the processing buffer, whose
|
||||
|
|
|
@ -38,9 +38,9 @@ There is no limit to the number of parenthesized subpatterns, but there can be
|
|||
no more than 65535 capturing subpatterns. There is, however, a limit to the
|
||||
depth of nesting of parenthesized subpatterns of all kinds. This is imposed in
|
||||
order to limit the amount of system stack used at compile time. The default
|
||||
limit can be specified when PCRE2 is built; the default default is 250. An
|
||||
application can change this limit by calling pcre2_set_parens_nest_limit() to
|
||||
set the limit in a compile context.
|
||||
limit can be specified when PCRE2 is built; if not, the default is set to 250.
|
||||
An application can change this limit by calling pcre2_set_parens_nest_limit()
|
||||
to set the limit in a compile context.
|
||||
.P
|
||||
The maximum length of name for a named subpattern is 32 code units, and the
|
||||
maximum number of named subpatterns is 10000.
|
||||
|
|
|
@ -67,7 +67,7 @@ ungreedy repetition quantifiers are specified in the pattern.
|
|||
Because it ends up with a single path through the tree, it is relatively
|
||||
straightforward for this algorithm to keep track of the substrings that are
|
||||
matched by portions of the pattern in parentheses. This provides support for
|
||||
capturing parentheses and back references.
|
||||
capturing parentheses and backreferences.
|
||||
.
|
||||
.
|
||||
.SH "THE ALTERNATIVE MATCHING ALGORITHM"
|
||||
|
@ -134,7 +134,7 @@ straightforward to keep track of captured substrings for the different matching
|
|||
possibilities, and PCRE2's implementation of this algorithm does not attempt to
|
||||
do this. This means that no captured substrings are available.
|
||||
.P
|
||||
3. Because no substrings are captured, back references within the pattern are
|
||||
3. Because no substrings are captured, backreferences within the pattern are
|
||||
not supported, and cause errors if encountered.
|
||||
.P
|
||||
4. For the same reason, conditional expressions that use a backreference as the
|
||||
|
@ -188,7 +188,7 @@ The alternative algorithm suffers from a number of disadvantages:
|
|||
because it has to search for all possible matches, but is also because it is
|
||||
less susceptible to optimization.
|
||||
.P
|
||||
2. Capturing parentheses and back references are not supported.
|
||||
2. Capturing parentheses and backreferences are not supported.
|
||||
.P
|
||||
3. Although atomic groups are supported, their use does not provide the
|
||||
performance advantage that it does for the standard algorithm.
|
||||
|
|
|
@ -163,7 +163,7 @@ be less than the value set (or defaulted) by the caller of \fBpcre2_match()\fP
|
|||
for it to have any effect. In other words, the pattern writer can lower the
|
||||
limits set by the programmer, but not raise them. If there is more than one
|
||||
setting of one of these limits, the lower value is used. The heap limit is
|
||||
specified in kilobytes.
|
||||
specified in kibibytes (units of 1024 bytes).
|
||||
.P
|
||||
Prior to release 10.30, LIMIT_DEPTH was called LIMIT_RECURSION. This name is
|
||||
still recognized for backwards compatibility.
|
||||
|
@ -318,7 +318,7 @@ precede a non-alphanumeric with backslash to specify that it stands for itself.
|
|||
In particular, if you want to match a backslash, you write \e\e.
|
||||
.P
|
||||
In a UTF mode, only ASCII numbers and letters have any special meaning after a
|
||||
backslash. All other characters (in particular, those whose codepoints are
|
||||
backslash. All other characters (in particular, those whose code points are
|
||||
greater than 127) are treated as literals.
|
||||
.P
|
||||
If a pattern is compiled with the PCRE2_EXTENDED option, most white space in
|
||||
|
@ -367,7 +367,7 @@ these escapes are as follows:
|
|||
\er carriage return (hex 0D)
|
||||
\et tab (hex 09)
|
||||
\e0dd character with octal code 0dd
|
||||
\eddd character with octal code ddd, or back reference
|
||||
\eddd character with octal code ddd, or backreference
|
||||
\eo{ddd..} character with octal code ddd..
|
||||
\exhh character with hex code hh
|
||||
\ex{hhh..} character with hex code hhh.. (default mode)
|
||||
|
@ -410,12 +410,12 @@ follows is itself an octal digit.
|
|||
The escape \eo must be followed by a sequence of octal digits, enclosed in
|
||||
braces. An error occurs if this is not the case. This escape is a recent
|
||||
addition to Perl; it provides way of specifying character code points as octal
|
||||
numbers greater than 0777, and it also allows octal numbers and back references
|
||||
numbers greater than 0777, and it also allows octal numbers and backreferences
|
||||
to be unambiguously specified.
|
||||
.P
|
||||
For greater clarity and unambiguity, it is best to avoid following \e by a
|
||||
digit greater than zero. Instead, use \eo{} or \ex{} to specify character
|
||||
numbers, and \eg{} to specify back references. The following paragraphs
|
||||
numbers, and \eg{} to specify backreferences. The following paragraphs
|
||||
describe the old, ambiguous syntax.
|
||||
.P
|
||||
The handling of a backslash followed by a digit other than 0 is complicated,
|
||||
|
@ -424,7 +424,7 @@ and Perl has changed over time, causing PCRE2 also to change.
|
|||
Outside a character class, PCRE2 reads the digit and any following digits as a
|
||||
decimal number. If the number is less than 10, begins with the digit 8 or 9, or
|
||||
if there are at least that many previous capturing left parentheses in the
|
||||
expression, the entire sequence is taken as a \fIback reference\fP. A
|
||||
expression, the entire sequence is taken as a \fIbackreference\fP. A
|
||||
description of how this works is given
|
||||
.\" HTML <a href="#backreferences">
|
||||
.\" </a>
|
||||
|
@ -446,20 +446,20 @@ for themselves. For example, outside a character class:
|
|||
.\" JOIN
|
||||
\e40 is the same, provided there are fewer than 40
|
||||
previous capturing subpatterns
|
||||
\e7 is always a back reference
|
||||
\e7 is always a backreference
|
||||
.\" JOIN
|
||||
\e11 might be a back reference, or another way of
|
||||
\e11 might be a backreference, or another way of
|
||||
writing a tab
|
||||
\e011 is always a tab
|
||||
\e0113 is a tab followed by the character "3"
|
||||
.\" JOIN
|
||||
\e113 might be a back reference, otherwise the
|
||||
\e113 might be a backreference, otherwise the
|
||||
character with octal code 113
|
||||
.\" JOIN
|
||||
\e377 might be a back reference, otherwise
|
||||
\e377 might be a backreference, otherwise
|
||||
the value 255 (decimal)
|
||||
.\" JOIN
|
||||
\e81 is always a back reference
|
||||
\e81 is always a backreference
|
||||
.sp
|
||||
Note that octal values of 100 or greater that are specified using this syntax
|
||||
must not be introduced by a leading zero, because no more than three octal
|
||||
|
@ -492,10 +492,10 @@ limited to certain values, as follows:
|
|||
8-bit non-UTF mode no greater than 0xff
|
||||
16-bit non-UTF mode no greater than 0xffff
|
||||
32-bit non-UTF mode no greater than 0xffffffff
|
||||
All UTF modes no greater than 0x10ffff and a valid codepoint
|
||||
All UTF modes no greater than 0x10ffff and a valid code point
|
||||
.sp
|
||||
Invalid Unicode codepoints are all those in the range 0xd800 to 0xdfff (the
|
||||
so-called "surrogate" codepoints). The check for these can be disabled by the
|
||||
Invalid Unicode code points are all those in the range 0xd800 to 0xdfff (the
|
||||
so-called "surrogate" code points). The check for these can be disabled by the
|
||||
caller of \fBpcre2_compile()\fP by setting the option
|
||||
PCRE2_EXTRA_ALLOW_SURROGATE_ESCAPES.
|
||||
.
|
||||
|
@ -523,12 +523,12 @@ is set, \eU matches a "U" character, and \eu can be used to define a character
|
|||
by code point, as described above.
|
||||
.
|
||||
.
|
||||
.SS "Absolute and relative back references"
|
||||
.SS "Absolute and relative backreferences"
|
||||
.rs
|
||||
.sp
|
||||
The sequence \eg followed by a signed or unsigned number, optionally enclosed
|
||||
in braces, is an absolute or relative back reference. A named back reference
|
||||
can be coded as \eg{name}. Back references are discussed
|
||||
in braces, is an absolute or relative backreference. A named backreference
|
||||
can be coded as \eg{name}. backreferences are discussed
|
||||
.\" HTML <a href="#backreferences">
|
||||
.\" </a>
|
||||
later,
|
||||
|
@ -551,7 +551,7 @@ syntax for referencing a subpattern as a "subroutine". Details are discussed
|
|||
later.
|
||||
.\"
|
||||
Note that \eg{...} (Perl syntax) and \eg<...> (Oniguruma syntax) are \fInot\fP
|
||||
synonymous. The former is a back reference; the latter is a
|
||||
synonymous. The former is a backreference; the latter is a
|
||||
.\" HTML <a href="#subpatternsassubroutines">
|
||||
.\" </a>
|
||||
subroutine
|
||||
|
@ -692,7 +692,7 @@ U+000B), FF (form feed, U+000C), CR (carriage return, U+000D), or NEL (next
|
|||
line, U+0085). Because this is an atomic group, the two-character sequence is
|
||||
treated as a single unit that cannot be split.
|
||||
.P
|
||||
In other modes, two additional characters whose codepoints are greater than 255
|
||||
In other modes, two additional characters whose code points are greater than 255
|
||||
are added: LS (line separator, U+2028) and PS (paragraph separator, U+2029).
|
||||
Unicode support is not needed for these characters to be recognized.
|
||||
.P
|
||||
|
@ -727,8 +727,8 @@ an error.
|
|||
When PCRE2 is built with Unicode support (the default), three additional escape
|
||||
sequences that match characters with specific properties are available. In
|
||||
8-bit non-UTF-8 mode, these sequences are of course limited to testing
|
||||
characters whose codepoints are less than 256, but they do work in this mode.
|
||||
In 32-bit non-UTF mode, codepoints greater than 0x10ffff (the Unicode limit)
|
||||
characters whose code points are less than 256, but they do work in this mode.
|
||||
In 32-bit non-UTF mode, code points greater than 0x10ffff (the Unicode limit)
|
||||
may be encountered. These are all treated as being in the Common script and
|
||||
with an unassigned type. The extra escape sequences are:
|
||||
.sp
|
||||
|
@ -1026,7 +1026,7 @@ joiner" characters. Characters with the "mark" property always have the
|
|||
6. Do not break within emoji modifier sequences (a base character followed by a
|
||||
modifier). Extending characters are allowed before the modifier.
|
||||
.P
|
||||
7. Do not break within emoji zwj sequences (zero-width jointer followed by
|
||||
7. Do not break within emoji zwj sequences (zero-width joiner followed by
|
||||
"glue after ZWJ" or "base glue after ZWJ").
|
||||
.P
|
||||
8. Do not break within emoji flag sequences. That is, do not break between
|
||||
|
@ -1724,7 +1724,7 @@ numbers underneath show in which buffer the captured content will be stored.
|
|||
/ ( a ) (?| x ( y ) z | (p (q) r) | (t) u (v) ) ( z ) /x
|
||||
# 1 2 2 3 2 3 4
|
||||
.sp
|
||||
A back reference to a numbered subpattern uses the most recent value that is
|
||||
A backreference to a numbered subpattern uses the most recent value that is
|
||||
set for that number by any subpattern. The following pattern matches "abcabc"
|
||||
or "defdef":
|
||||
.sp
|
||||
|
@ -1768,7 +1768,7 @@ In PCRE2, a subpattern can be named in one of three ways: (?<name>...) or
|
|||
parentheses from other parts of the pattern, such as
|
||||
.\" HTML <a href="#backreferences">
|
||||
.\" </a>
|
||||
back references,
|
||||
backreferences,
|
||||
.\"
|
||||
.\" HTML <a href="#recursion">
|
||||
.\" </a>
|
||||
|
@ -1811,7 +1811,7 @@ The convenience functions for extracting the data by name returns the substring
|
|||
for the first (and in this example, the only) subpattern of that name that
|
||||
matched. This saves searching to find which numbered subpattern it was.
|
||||
.P
|
||||
If you make a back reference to a non-unique named subpattern from elsewhere in
|
||||
If you make a backreference to a non-unique named subpattern from elsewhere in
|
||||
the pattern, the subpatterns to which the name refers are checked in the order
|
||||
in which they appear in the overall pattern. The first one that is set is used
|
||||
for the reference. For example, this pattern matches both "foofoo" and
|
||||
|
@ -1863,7 +1863,7 @@ items:
|
|||
the \eR escape sequence
|
||||
an escape such as \ed or \epL that matches a single character
|
||||
a character class
|
||||
a back reference
|
||||
a backreference
|
||||
a parenthesized subpattern (including most assertions)
|
||||
a subroutine call to a subpattern (recursive or otherwise)
|
||||
.sp
|
||||
|
@ -1980,7 +1980,7 @@ worth setting PCRE2_DOTALL in order to obtain this optimization, or
|
|||
alternatively, using ^ to indicate anchoring explicitly.
|
||||
.P
|
||||
However, there are some cases where the optimization cannot be used. When .*
|
||||
is inside capturing parentheses that are the subject of a back reference
|
||||
is inside capturing parentheses that are the subject of a backreference
|
||||
elsewhere in the pattern, a match at the start may fail where a later one
|
||||
succeeds. Consider, for example:
|
||||
.sp
|
||||
|
@ -2116,23 +2116,23 @@ sequences of non-digits cannot be broken, and failure happens quickly.
|
|||
.
|
||||
.
|
||||
.\" HTML <a name="backreferences"></a>
|
||||
.SH "BACK REFERENCES"
|
||||
.SH "BACKREFERENCES"
|
||||
.rs
|
||||
.sp
|
||||
Outside a character class, a backslash followed by a digit greater than 0 (and
|
||||
possibly further digits) is a back reference to a capturing subpattern earlier
|
||||
possibly further digits) is a backreference to a capturing subpattern earlier
|
||||
(that is, to its left) in the pattern, provided there have been that many
|
||||
previous capturing left parentheses.
|
||||
.P
|
||||
However, if the decimal number following the backslash is less than 8, it is
|
||||
always taken as a back reference, and causes an error only if there are not
|
||||
always taken as a backreference, and causes an error only if there are not
|
||||
that many capturing left parentheses in the entire pattern. In other words, the
|
||||
parentheses that are referenced need not be to the left of the reference for
|
||||
numbers less than 8. A "forward back reference" of this type can make sense
|
||||
numbers less than 8. A "forward backreference" of this type can make sense
|
||||
when a repetition is involved and the subpattern to the right has participated
|
||||
in an earlier iteration.
|
||||
.P
|
||||
It is not possible to have a numerical "forward back reference" to a subpattern
|
||||
It is not possible to have a numerical "forward backreference" to a subpattern
|
||||
whose number is 8 or more using this syntax because a sequence such as \e50 is
|
||||
interpreted as a character defined in octal. See the subsection entitled
|
||||
"Non-printing characters"
|
||||
|
@ -2141,7 +2141,7 @@ interpreted as a character defined in octal. See the subsection entitled
|
|||
above
|
||||
.\"
|
||||
for further details of the handling of digits following a backslash. There is
|
||||
no such problem when named parentheses are used. A back reference to any
|
||||
no such problem when named parentheses are used. A backreference to any
|
||||
subpattern is possible using named parentheses (see below).
|
||||
.P
|
||||
Another way of avoiding the ambiguity inherent in the use of digits following a
|
||||
|
@ -2169,7 +2169,7 @@ The sequence \eg{+1} is a reference to the next capturing subpattern. This kind
|
|||
of forward reference can be useful it patterns that repeat. Perl does not
|
||||
support the use of + in this way.
|
||||
.P
|
||||
A back reference matches whatever actually matched the capturing subpattern in
|
||||
A backreference matches whatever actually matched the capturing subpattern in
|
||||
the current subject string, rather than anything matching the subpattern
|
||||
itself (see
|
||||
.\" HTML <a href="#subpatternsassubroutines">
|
||||
|
@ -2182,17 +2182,17 @@ below for a way of doing that). So the pattern
|
|||
.sp
|
||||
matches "sense and sensibility" and "response and responsibility", but not
|
||||
"sense and responsibility". If caseful matching is in force at the time of the
|
||||
back reference, the case of letters is relevant. For example,
|
||||
backreference, the case of letters is relevant. For example,
|
||||
.sp
|
||||
((?i)rah)\es+\e1
|
||||
.sp
|
||||
matches "rah rah" and "RAH RAH", but not "RAH rah", even though the original
|
||||
capturing subpattern is matched caselessly.
|
||||
.P
|
||||
There are several different ways of writing back references to named
|
||||
There are several different ways of writing backreferences to named
|
||||
subpatterns. The .NET syntax \ek{name} and the Perl syntax \ek<name> or
|
||||
\ek'name' are supported, as is the Python syntax (?P=name). Perl 5.10's unified
|
||||
back reference syntax, in which \eg can be used for both numeric and named
|
||||
backreference syntax, in which \eg can be used for both numeric and named
|
||||
references, is also supported. We could rewrite the above example in any of
|
||||
the following ways:
|
||||
.sp
|
||||
|
@ -2204,20 +2204,20 @@ the following ways:
|
|||
A subpattern that is referenced by name may appear in the pattern before or
|
||||
after the reference.
|
||||
.P
|
||||
There may be more than one back reference to the same subpattern. If a
|
||||
subpattern has not actually been used in a particular match, any back
|
||||
references to it always fail by default. For example, the pattern
|
||||
There may be more than one backreference to the same subpattern. If a
|
||||
subpattern has not actually been used in a particular match, any backreferences
|
||||
to it always fail by default. For example, the pattern
|
||||
.sp
|
||||
(a|(bc))\e2
|
||||
.sp
|
||||
always fails if it starts to match "a" rather than "bc". However, if the
|
||||
PCRE2_MATCH_UNSET_BACKREF option is set at compile time, a back reference to an
|
||||
PCRE2_MATCH_UNSET_BACKREF option is set at compile time, a backreference to an
|
||||
unset value matches an empty string.
|
||||
.P
|
||||
Because there may be many capturing parentheses in a pattern, all digits
|
||||
following a backslash are taken as part of a potential back reference number.
|
||||
following a backslash are taken as part of a potential backreference number.
|
||||
If the pattern continues with a digit character, some delimiter must be used to
|
||||
terminate the back reference. If the PCRE2_EXTENDED option is set, this can be
|
||||
terminate the backreference. If the PCRE2_EXTENDED option is set, this can be
|
||||
white space. Otherwise, the \eg{ syntax or an empty comment (see
|
||||
.\" HTML <a href="#comments">
|
||||
.\" </a>
|
||||
|
@ -2226,10 +2226,10 @@ white space. Otherwise, the \eg{ syntax or an empty comment (see
|
|||
below) can be used.
|
||||
.
|
||||
.
|
||||
.SS "Recursive back references"
|
||||
.SS "Recursive backreferences"
|
||||
.rs
|
||||
.sp
|
||||
A back reference that occurs inside the parentheses to which it refers fails
|
||||
A backreference that occurs inside the parentheses to which it refers fails
|
||||
when the subpattern is first used, so, for example, (a\e1) never matches.
|
||||
However, such references can be useful inside repeated subpatterns. For
|
||||
example, the pattern
|
||||
|
@ -2237,13 +2237,13 @@ example, the pattern
|
|||
(a|b\e1)+
|
||||
.sp
|
||||
matches any number of "a"s and also "aba", "ababbaa" etc. At each iteration of
|
||||
the subpattern, the back reference matches the character string corresponding
|
||||
the subpattern, the backreference matches the character string corresponding
|
||||
to the previous iteration. In order for this to work, the pattern must be such
|
||||
that the first iteration does not need to match the back reference. This can be
|
||||
that the first iteration does not need to match the backreference. This can be
|
||||
done using alternation, as in the example above, or by a quantifier with a
|
||||
minimum of zero.
|
||||
.P
|
||||
Back references of this type cause the group that they reference to be treated
|
||||
backreferences of this type cause the group that they reference to be treated
|
||||
as an
|
||||
.\" HTML <a href="#atomicgroup">
|
||||
.\" </a>
|
||||
|
@ -2406,10 +2406,10 @@ recursion,
|
|||
that is, a "subroutine" call into a group that is already active,
|
||||
is not supported.
|
||||
.P
|
||||
Perl does not support back references in lookbehinds. PCRE2 does support them,
|
||||
Perl does not support backreferences in lookbehinds. PCRE2 does support them,
|
||||
but only if certain conditions are met. The PCRE2_MATCH_UNSET_BACKREF option
|
||||
must not be set, there must be no use of (?| in the pattern (it creates
|
||||
duplicate subpattern numbers), and if the back reference is by name, the name
|
||||
duplicate subpattern numbers), and if the backreference is by name, the name
|
||||
must be unique. Of course, the referenced subpattern must itself be of fixed
|
||||
length. The following pattern matches words containing at least two characters
|
||||
that begin and end with the same character:
|
||||
|
@ -2899,7 +2899,7 @@ in PCRE2 these values can be referenced. Consider this pattern:
|
|||
^(.)(\e1|a(?2))
|
||||
.sp
|
||||
This pattern matches "bab". The first capturing parentheses match "b", then in
|
||||
the second group, when the back reference \e1 fails to match "b", the second
|
||||
the second group, when the backreference \e1 fails to match "b", the second
|
||||
alternative matches "a" and then recurses. In the recursion, \e1 does now match
|
||||
"b" and so the whole match succeeds. This match used to fail in Perl, but in
|
||||
later versions (I tried 5.024) it now works.
|
||||
|
@ -2964,7 +2964,7 @@ plus or a minus sign it is taken as a relative reference. For example:
|
|||
(abc)(?i:\eg<-1>)
|
||||
.sp
|
||||
Note that \eg{...} (Perl syntax) and \eg<...> (Oniguruma syntax) are \fInot\fP
|
||||
synonymous. The former is a back reference; the latter is a subroutine call.
|
||||
synonymous. The former is a backreference; the latter is a subroutine call.
|
||||
.
|
||||
.
|
||||
.SH CALLOUTS
|
||||
|
|
|
@ -108,14 +108,14 @@ When a pattern that is compiled with this flag is passed to \fBregexec()\fP for
|
|||
matching, the \fInmatch\fP and \fIpmatch\fP arguments are ignored, and no
|
||||
captured strings are returned. Versions of the PCRE library prior to 10.22 used
|
||||
to set the PCRE2_NO_AUTO_CAPTURE compile option, but this no longer happens
|
||||
because it disables the use of back references.
|
||||
because it disables the use of backreferences.
|
||||
.sp
|
||||
REG_PEND
|
||||
.sp
|
||||
If this option is set, the \fBreg_endp\fP field in the \fIpreg\fP structure
|
||||
(which has the type const char *) must be set to point to the character beyond
|
||||
the end of the pattern before calling \fBregcomp()\fP. The pattern itself may
|
||||
now contain binary zeroes, which are treated as data characters. Without
|
||||
now contain binary zeros, which are treated as data characters. Without
|
||||
REG_PEND, a binary zero terminates the pattern and the \fBre_endp\fP field is
|
||||
ignored. This is a GNU extension to the POSIX standard and should be used with
|
||||
caution in software intended to be portable to other systems.
|
||||
|
@ -224,10 +224,10 @@ function.
|
|||
.sp
|
||||
REG_STARTEND
|
||||
.sp
|
||||
When this option is set, the subject string is starts at \fIstring\fP +
|
||||
When this option is set, the subject string starts at \fIstring\fP +
|
||||
\fIpmatch[0].rm_so\fP and ends at \fIstring\fP + \fIpmatch[0].rm_eo\fP, which
|
||||
should point to the first character beyond the string. There may be binary
|
||||
zeroes within the subject string, and indeed, using REG_STARTEND is the only
|
||||
zeros within the subject string, and indeed, using REG_STARTEND is the only
|
||||
way to pass a subject string that contains a binary zero.
|
||||
.P
|
||||
Whatever the value of \fIpmatch[0].rm_so\fP, the offsets of the matched string
|
||||
|
|
|
@ -419,7 +419,7 @@ of the newline or \eR options with similar syntax. More than one of them may
|
|||
appear. For the first three, d is a decimal number.
|
||||
.sp
|
||||
(*LIMIT_DEPTH=d) set the backtracking limit to d
|
||||
(*LIMIT_HEAP=d) set the heap size limit to d kilobytes
|
||||
(*LIMIT_HEAP=d) set the heap size limit to d * 1024 bytes
|
||||
(*LIMIT_MATCH=d) set the match limit to d
|
||||
(*NOTEMPTY) set PCRE2_NOTEMPTY when matching
|
||||
(*NOTEMPTY_ATSTART) set PCRE2_NOTEMPTY_ATSTART when matching
|
||||
|
|
|
@ -101,7 +101,7 @@ to occur).
|
|||
UTF-8 (in its original definition) is not capable of encoding values greater
|
||||
than 0x7fffffff, but such values can be handled by the 32-bit library. When
|
||||
testing this library in non-UTF mode with \fButf8_input\fP set, if any
|
||||
character is preceded by the byte 0xff (which is an illegal byte in UTF-8)
|
||||
character is preceded by the byte 0xff (which is an invalid byte in UTF-8)
|
||||
0x80000000 is added to the character's value. This is the only way of passing
|
||||
such code points in a pattern string. For subject strings, using an escape
|
||||
sequence is preferable.
|
||||
|
@ -220,7 +220,7 @@ Do not output the version number of \fBpcre2test\fP at the start of execution.
|
|||
.TP 10
|
||||
\fB-S\fP \fIsize\fP
|
||||
On Unix-like systems, set the size of the run-time stack to \fIsize\fP
|
||||
megabytes.
|
||||
mebibytes (units of 1024*1024 bytes).
|
||||
.TP 10
|
||||
\fB-subject\fP \fImodifier-list\fP
|
||||
Behave as if each subject line contains the given modifiers.
|
||||
|
@ -639,8 +639,8 @@ The effects of these modifiers are described in the following sections.
|
|||
.sp
|
||||
The \fBbsr\fP modifier specifies what \eR in a pattern should match. If it is
|
||||
set to "anycrlf", \eR matches CR, LF, or CRLF only. If it is set to "unicode",
|
||||
\eR matches any Unicode newline sequence. The default is specified when PCRE2
|
||||
is built, with the default default being Unicode.
|
||||
\eR matches any Unicode newline sequence. The default can be specified when
|
||||
PCRE2 is built; if it is not, the default is set to Unicode.
|
||||
.P
|
||||
The \fBnewline\fP modifier specifies which characters are to be interpreted as
|
||||
newlines, both in the pattern and in subject lines. The type must be one of CR,
|
||||
|
@ -1381,11 +1381,11 @@ matching provokes an error return ("bad option value") from
|
|||
.sp
|
||||
The \fBjitstack\fP modifier provides a way of setting the maximum stack size
|
||||
that is used by the just-in-time optimization code. It is ignored if JIT
|
||||
optimization is not being used. The value is a number of kilobytes. Setting
|
||||
zero reverts to the default of 32K. Providing a stack that is larger than the
|
||||
default is necessary only for very complicated patterns. If \fBjitstack\fP is
|
||||
set non-zero on a subject line it overrides any value that was set on the
|
||||
pattern.
|
||||
optimization is not being used. The value is a number of kibibytes (units of
|
||||
1024 bytes). Setting zero reverts to the default of 32KiB. Providing a stack
|
||||
that is larger than the default is necessary only for very complicated
|
||||
patterns. If \fBjitstack\fP is set non-zero on a subject line it overrides any
|
||||
value that was set on the pattern.
|
||||
.
|
||||
.
|
||||
.SS "Setting heap, match, and depth limits"
|
||||
|
@ -1427,10 +1427,10 @@ matching, \fImatch_limit\fP controls the total number of calls, both recursive
|
|||
and non-recursive, to the internal matching function, thus controlling the
|
||||
overall amount of computing resource that is used.
|
||||
.P
|
||||
For both kinds of matching, the \fIheap_limit\fP number (which is in kilobytes)
|
||||
limits the amount of heap memory used for matching. A value of zero disables
|
||||
the use of any heap memory; many simple pattern matches can be done without
|
||||
using the heap, so this is not an unreasonable setting.
|
||||
For both kinds of matching, the \fIheap_limit\fP number, which is in kibibytes
|
||||
(units of 1024 bytes), limits the amount of heap memory used for matching. A
|
||||
value of zero disables the use of any heap memory; many simple pattern matches
|
||||
can be done without using the heap, so zero is not an unreasonable setting.
|
||||
.
|
||||
.
|
||||
.SS "Showing MARK names"
|
||||
|
|
File diff suppressed because it is too large
Load Diff
|
@ -46,7 +46,7 @@ compatibility with Perl 5.6. PCRE2 does not support this.
|
|||
.SH "WIDE CHARACTERS AND UTF MODES"
|
||||
.rs
|
||||
.sp
|
||||
Codepoints less than 256 can be specified in patterns by either braced or
|
||||
Code points less than 256 can be specified in patterns by either braced or
|
||||
unbraced hexadecimal escape sequences (for example, \ex{b3} or \exb3). Larger
|
||||
values have to use braced sequences. Unbraced octal code points up to \e777 are
|
||||
also recognized; larger ones can be coded using \eo{...}.
|
||||
|
@ -109,7 +109,7 @@ not PCRE2_UCP is set.
|
|||
Case-insensitive matching in a UTF mode makes use of Unicode properties except
|
||||
for characters whose code points are less than 128 and that have at most two
|
||||
case-equivalent values. For these, a direct table lookup is used for speed. A
|
||||
few Unicode characters such as Greek sigma have more than two codepoints that
|
||||
few Unicode characters such as Greek sigma have more than two code points that
|
||||
are case-equivalent, and these are treated as such.
|
||||
.
|
||||
.
|
||||
|
|
|
@ -51,7 +51,7 @@ fi
|
|||
# utf invoke UTF-8 functionality
|
||||
#
|
||||
# The data lines must not have any pcre2test modifiers. Unless
|
||||
# "subject_litersl" is on the pattern, data lines are processed as
|
||||
# "subject_literal" is on the pattern, data lines are processed as
|
||||
# Perl double-quoted strings, so if they contain " $ or @ characters, these
|
||||
# have to be escaped. For this reason, all such characters in the
|
||||
# Perl-compatible testinput1 and testinput4 files are escaped so that they can
|
||||
|
|
|
@ -132,8 +132,9 @@ sure both macros are undefined; an emulation function will then be used. */
|
|||
/* Define to 1 if you have the <zlib.h> header file. */
|
||||
/* #undef HAVE_ZLIB_H */
|
||||
|
||||
/* This limits the amount of memory that pcre2_match() may use while matching
|
||||
a pattern. The value is in kilobytes. */
|
||||
/* This limits the amount of memory that may be used while matching a pattern.
|
||||
It applies to both pcre2_match() and pcre2_dfa_match(). It does not apply
|
||||
to JIT matching. The value is in kilobytes. */
|
||||
#ifndef HEAP_LIMIT
|
||||
#define HEAP_LIMIT 20000000
|
||||
#endif
|
||||
|
@ -155,7 +156,8 @@ sure both macros are undefined; an emulation function will then be used. */
|
|||
|
||||
/* The value of MATCH_LIMIT determines the default number of times the
|
||||
pcre2_match() function can record a backtrack position during a single
|
||||
matching attempt. There is a runtime interface for setting a different
|
||||
matching attempt. The value is also used to limit a loop counter in
|
||||
pcre2_dfa_match(). There is a runtime interface for setting a different
|
||||
limit. The limit exists in order to catch runaway regular expressions that
|
||||
take for ever to determine that they do not match. The default is set very
|
||||
large so that it does not accidentally catch legitimate cases. */
|
||||
|
@ -170,7 +172,9 @@ sure both macros are undefined; an emulation function will then be used. */
|
|||
MATCH_LIMIT_DEPTH provides this facility. To have any useful effect, it
|
||||
must be less than the value of MATCH_LIMIT. The default is to use the same
|
||||
value as MATCH_LIMIT. There is a runtime method for setting a different
|
||||
limit. */
|
||||
limit. In the case of pcre2_dfa_match(), this limit controls the depth of
|
||||
the internal nested function calls that are used for pattern recursions,
|
||||
lookarounds, and atomic groups. */
|
||||
#ifndef MATCH_LIMIT_DEPTH
|
||||
#define MATCH_LIMIT_DEPTH MATCH_LIMIT
|
||||
#endif
|
||||
|
@ -210,7 +214,7 @@ sure both macros are undefined; an emulation function will then be used. */
|
|||
#define PACKAGE_NAME "PCRE2"
|
||||
|
||||
/* Define to the full name and version of this package. */
|
||||
#define PACKAGE_STRING "PCRE2 10.31"
|
||||
#define PACKAGE_STRING "PCRE2 10.32-RC1"
|
||||
|
||||
/* Define to the one symbol short name of this package. */
|
||||
#define PACKAGE_TARNAME "pcre2"
|
||||
|
@ -219,7 +223,7 @@ sure both macros are undefined; an emulation function will then be used. */
|
|||
#define PACKAGE_URL ""
|
||||
|
||||
/* Define to the version of this package. */
|
||||
#define PACKAGE_VERSION "10.31"
|
||||
#define PACKAGE_VERSION "10.32-RC1"
|
||||
|
||||
/* The value of PARENS_NEST_LIMIT specifies the maximum depth of nested
|
||||
parentheses (of any kind) in a pattern. This limits the amount of system
|
||||
|
@ -339,7 +343,7 @@ sure both macros are undefined; an emulation function will then be used. */
|
|||
#endif
|
||||
|
||||
/* Version number of package */
|
||||
#define VERSION "10.31"
|
||||
#define VERSION "10.32-RC1"
|
||||
|
||||
/* Define to 1 if on MINIX. */
|
||||
/* #undef _MINIX */
|
||||
|
|
|
@ -134,7 +134,7 @@ sure both macros are undefined; an emulation function will then be used. */
|
|||
|
||||
/* This limits the amount of memory that may be used while matching a pattern.
|
||||
It applies to both pcre2_match() and pcre2_dfa_match(). It does not apply
|
||||
to JIT matching. The value is in kilobytes. */
|
||||
to JIT matching. The value is in kibibytes (units of 1024 bytes). */
|
||||
#undef HEAP_LIMIT
|
||||
|
||||
/* The value of LINK_SIZE determines the number of bytes used to store links
|
||||
|
|
|
@ -42,9 +42,9 @@ POSSIBILITY OF SUCH DAMAGE.
|
|||
/* The current PCRE version information. */
|
||||
|
||||
#define PCRE2_MAJOR 10
|
||||
#define PCRE2_MINOR 31
|
||||
#define PCRE2_PRERELEASE
|
||||
#define PCRE2_DATE 2018-02-12
|
||||
#define PCRE2_MINOR 32
|
||||
#define PCRE2_PRERELEASE -RC1
|
||||
#define PCRE2_DATE 2018-02-19
|
||||
|
||||
/* When an application links to a PCRE DLL in Windows, the symbols that are
|
||||
imported have to be identified as such. When building PCRE2, the appropriate
|
||||
|
|
|
@ -4261,11 +4261,11 @@ goto FAILED;
|
|||
|
||||
|
||||
/*************************************************
|
||||
* Find first significant op code *
|
||||
* Find first significant opcode *
|
||||
*************************************************/
|
||||
|
||||
/* This is called by several functions that scan a compiled expression looking
|
||||
for a fixed first character, or an anchoring op code etc. It skips over things
|
||||
for a fixed first character, or an anchoring opcode etc. It skips over things
|
||||
that do not influence this. For some calls, it makes sense to skip negative
|
||||
forward and all backward assertions, and also the \b assertion; for others it
|
||||
does not.
|
||||
|
@ -5472,7 +5472,7 @@ for (;; pptr++)
|
|||
set xclass = TRUE. Then, in the pre-compile phase, accumulate the length
|
||||
of the extra data and reset the pointer. This is so that very large
|
||||
classes that contain a zillion wide characters or Unicode property tests
|
||||
do not overwrite the work space (which is on the stack). */
|
||||
do not overwrite the workspace (which is on the stack). */
|
||||
|
||||
if (class_uchardata > class_uchardata_base)
|
||||
{
|
||||
|
@ -7460,7 +7460,7 @@ length of the BRA and KET and any extra code units that are required at the
|
|||
beginning. We accumulate in a local variable to save frequent testing of
|
||||
lengthptr for NULL. We cannot do this by looking at the value of 'code' at the
|
||||
start and end of each alternative, because compiled items are discarded during
|
||||
the pre-compile phase so that the work space is not exceeded. */
|
||||
the pre-compile phase so that the workspace is not exceeded. */
|
||||
|
||||
length = 2 + 2*LINK_SIZE + skipunits;
|
||||
|
||||
|
|
|
@ -387,8 +387,8 @@ return (mb->callout)(cb, mb->callout_data);
|
|||
*************************************************/
|
||||
|
||||
/* This function is called when internal_dfa_match() is about to be called
|
||||
recursively and there is insufficient workingspace left in the current work
|
||||
space block. If there's an existing next block, use it; otherwise get a new
|
||||
recursively and there is insufficient working space left in the current
|
||||
workspace block. If there's an existing next block, use it; otherwise get a new
|
||||
block unless the heap limit is reached.
|
||||
|
||||
Arguments:
|
||||
|
@ -2800,7 +2800,7 @@ for (;;)
|
|||
local_workspace, /* workspace vector */
|
||||
RWS_RSIZE, /* size of same */
|
||||
rlevel, /* function recursion level */
|
||||
RWS); /* recursion work space */
|
||||
RWS); /* recursion workspace */
|
||||
|
||||
rws->free += RWS_RSIZE + RWS_OVEC_OSIZE;
|
||||
|
||||
|
|
|
@ -43,7 +43,7 @@ POSSIBILITY OF SUCH DAMAGE.
|
|||
#include "config.h"
|
||||
#endif
|
||||
|
||||
/* These defines enables debugging code */
|
||||
/* These defines enable debugging code */
|
||||
|
||||
//#define DEBUG_FRAMES_DISPLAY
|
||||
//#define DEBUG_SHOW_OPS
|
||||
|
@ -1776,7 +1776,7 @@ fprintf(stderr, "++ op=%d\n", *Fecode);
|
|||
|
||||
|
||||
/* ===================================================================== */
|
||||
/* Match a bit-mapped character class, possibly repeatedly. These op codes
|
||||
/* Match a bit-mapped character class, possibly repeatedly. These opcodes
|
||||
are used when all the characters in the class have values in the range
|
||||
0-255, and either the matching is caseful, or the characters are in the
|
||||
range 0-127 when UTF processing is enabled. The only difference between
|
||||
|
@ -2464,7 +2464,7 @@ fprintf(stderr, "++ op=%d\n", *Fecode);
|
|||
|
||||
/* ===================================================================== */
|
||||
/* Match a single character type repeatedly. Note that the property type
|
||||
does not need to be in a stack frame as it not used within an RMATCH()
|
||||
does not need to be in a stack frame as it is not used within an RMATCH()
|
||||
loop. */
|
||||
|
||||
#define Lstart_eptr F->temp_sptr[0]
|
||||
|
@ -4143,7 +4143,7 @@ fprintf(stderr, "++ op=%d\n", *Fecode);
|
|||
}
|
||||
break;
|
||||
|
||||
/* The "byte" (i.e. "code unit") case is the same as non-UTF */
|
||||
/* The "byte" (i.e. "code unit") case is the same as non-UTF */
|
||||
|
||||
case OP_ANYBYTE:
|
||||
fc = Lmax - Lmin;
|
||||
|
@ -5424,7 +5424,7 @@ fprintf(stderr, "++ op=%d\n", *Fecode);
|
|||
Feptr -= number;
|
||||
}
|
||||
|
||||
/* Save the earliest consulted character, then skip to next op code */
|
||||
/* Save the earliest consulted character, then skip to next opcode */
|
||||
|
||||
if (Feptr < mb->start_used_ptr) mb->start_used_ptr = Feptr;
|
||||
Fecode += 1 + LINK_SIZE;
|
||||
|
@ -5929,7 +5929,7 @@ in rrc. */
|
|||
|
||||
RETURN_SWITCH:
|
||||
if (Frdepth == 0) return rrc; /* Exit from the top level */
|
||||
F = (heapframe *)((char *)F - Fback_frame); /* Back track */
|
||||
F = (heapframe *)((char *)F - Fback_frame); /* Backtrack */
|
||||
mb->cb->callout_flags |= PCRE2_CALLOUT_BACKTRACK; /* Note for callouts */
|
||||
|
||||
#ifdef DEBUG_SHOW_RMATCH
|
||||
|
|
|
@ -1274,7 +1274,7 @@ do
|
|||
break;
|
||||
|
||||
/* Single character types set the bits and stop. Note that if PCRE2_UCP
|
||||
is set, we do not see these op codes because \d etc are converted to
|
||||
is set, we do not see these opcodes because \d etc are converted to
|
||||
properties. Therefore, these apply in the case when only characters less
|
||||
than 256 are recognized to match the types. */
|
||||
|
||||
|
|
|
@ -170,7 +170,7 @@ are implementing).
|
|||
by E_Modifier). Extend characters are allowed before the modifier; this
|
||||
cannot be represented in this table, the code has to deal with it.
|
||||
|
||||
8. Do not break within emoji zwj sequences (ZWJ followed by Glue_After_Zwj or
|
||||
8. Do not break within emoji zwj sequences (ZWJ followed by Glue_After_Zwj or
|
||||
E_Base_GAZ).
|
||||
|
||||
9. Do not break within emoji flag sequences. That is, do not break between
|
||||
|
|
|
@ -492,7 +492,7 @@ so many of them that they are split into two fields. */
|
|||
|
||||
/* These are the matching controls that may be set either on a pattern or on a
|
||||
data line. They are copied from the pattern controls as initial settings for
|
||||
data line controls Note that CTL_MEMORY is not included here, because it does
|
||||
data line controls. Note that CTL_MEMORY is not included here, because it does
|
||||
different things in the two cases. */
|
||||
|
||||
#define CTL_ALLPD (CTL_AFTERTEXT|\
|
||||
|
@ -5411,7 +5411,7 @@ switch(errorcode)
|
|||
|
||||
/* The pattern is now in pbuffer[8|16|32], with the length in code units in
|
||||
patlen. If it is to be converted, copy the result back afterwards so that it
|
||||
it ends up back in the usual place. */
|
||||
ends up back in the usual place. */
|
||||
|
||||
if (pat_patctl.convert_type != CONVERT_UNSET)
|
||||
{
|
||||
|
@ -5735,7 +5735,7 @@ return PR_OK;
|
|||
*************************************************/
|
||||
|
||||
/* This is used for DFA, normal, and JIT fast matching. For DFA matching it
|
||||
should only called with the third argument set to PCRE2_ERROR_DEPTHLIMIT.
|
||||
should only be called with the third argument set to PCRE2_ERROR_DEPTHLIMIT.
|
||||
|
||||
Arguments:
|
||||
pp the subject string
|
||||
|
@ -7766,7 +7766,7 @@ printf(" -LM list pattern and subject modifiers, then exit\n");
|
|||
printf(" -q quiet: do not output PCRE2 version number at start\n");
|
||||
printf(" -pattern <s> set default pattern modifier fields\n");
|
||||
printf(" -subject <s> set default subject modifier fields\n");
|
||||
printf(" -S <n> set stack size to <n> megabytes\n");
|
||||
printf(" -S <n> set stack size to <n> mebibytes\n");
|
||||
printf(" -t [<n>] time compilation and execution, repeating <n> times\n");
|
||||
printf(" -tm [<n>] time execution (matching) only, repeating <n> times\n");
|
||||
printf(" -T same as -t, but show total times at the end\n");
|
||||
|
|
Loading…
Reference in New Issue