Typos in documentation and comments noted by Jason Hood.

This commit is contained in:
Philip.Hazel 2018-06-17 14:13:28 +00:00
parent fa58ac6734
commit fabea723cf
57 changed files with 2128 additions and 2118 deletions

View File

@ -146,7 +146,7 @@ SET(PCRE2_PARENS_NEST_LIMIT "250" CACHE STRING
"Default nested parentheses limit. See PARENS_NEST_LIMIT in config.h.in for details.")
SET(PCRE2_HEAP_LIMIT "20000000" CACHE STRING
"Default limit on heap memory (kilobytes). See HEAP_LIMIT in config.h.in for details.")
"Default limit on heap memory (kibibytes). See HEAP_LIMIT in config.h.in for details.")
SET(PCRE2_MATCH_LIMIT "10000000" CACHE STRING
"Default limit on internal looping. See MATCH_LIMIT in config.h.in for details.")

View File

@ -17,7 +17,7 @@ groups altogether. Now it shows those that come before any actual captures as
3. Running "pcre2test -C" always stated "\R matches CR, LF, or CRLF only",
whatever the build configuration was. It now correctly says "\R matches all
Unicode newlines" in the default case when --enable-bsr-anycrlf has not been
specified. Similarly, running "pcfre2test -C bsr" never produced the result
specified. Similarly, running "pcre2test -C bsr" never produced the result
ANY.
4. Matching the pattern /(*UTF)\C[^\v]+\x80/ against an 8-bit string containing
@ -370,7 +370,7 @@ tests to improve coverage.
31. If more than one of "push", "pushcopy", or "pushtablescopy" were set in
pcre2test, a crash could occur.
32. Make -bigstack in RunTest allocate a 64Mb stack (instead of 16 MB) so that
32. Make -bigstack in RunTest allocate a 64MB stack (instead of 16 MB) so that
all the tests can run with clang's sanitizing options.
33. Implement extra compile options in the compile context and add the first

View File

@ -348,7 +348,7 @@ The /i, /m, or /s options (PCRE2_CASELESS, PCRE2_MULTILINE, PCRE2_DOTALL, and
others) may be changed in the middle of patterns by items such as (?i). Their
processing is handled entirely at compile time by generating different opcodes
for the different settings. The runtime functions do not need to keep track of
an options state.
an option's state.
PCRE2_DUPNAMES, PCRE2_EXTENDED, PCRE2_EXTENDED_MORE, and PCRE2_NO_AUTO_CAPTURE
are tracked and processed during the parsing pre-pass. The others are handled
@ -764,7 +764,7 @@ OP_RECURSE is followed by a LINK_SIZE value that is the offset to the starting
bracket from the start of the whole pattern. OP_RECURSE is also used for
"subroutine" calls, even though they are not strictly a recursion. Up till
release 10.30 recursions were treated as atomic groups, making them
incompatible with Perl (but PCRE had then well before Perl did). From 10.30,
incompatible with Perl (but PCRE had them well before Perl did). From 10.30,
backtracking into recursions is supported.
Repeated recursions used to be wrapped inside OP_ONCE brackets, which not only

4
NEWS
View File

@ -31,7 +31,7 @@ remembering backtracking positions. This makes --disable-stack-for-recursion a
NOOP. The new implementation allows backtracking into recursive group calls in
patterns, making it more compatible with Perl, and also fixes some other
previously hard-to-do issues. For patterns that have a lot of backtracking, the
heap is now used, and there is explicit limit on the amount, settable by
heap is now used, and there is an explicit limit on the amount, settable by
pcre2_set_heap_limit() or (*LIMIT_HEAP=xxx). The "recursion limit" is retained,
but is renamed as "depth limit" (though the old names remain for
compatibility).
@ -53,7 +53,7 @@ also supported.
5. Additional compile options in the compile context are now available, and the
first two are: PCRE2_EXTRA_ALLOW_SURROGATE_ESCAPES and
PCRE2_EXTRA_BAD_ESCAPE_IS LITERAL.
PCRE2_EXTRA_BAD_ESCAPE_IS_LITERAL.
6. The newline type PCRE2_NEWLINE_NUL is now available.

View File

@ -127,7 +127,7 @@ can skip ahead to the CMake section.
src/pcre2_jit_match.c and src/pcre2_jit_misc.c, so you should not compile
these yourself.
Not also that the pcre2_fuzzsupport.c file contains special code that is
Note also that the pcre2_fuzzsupport.c file contains special code that is
useful to those who want to run fuzzing tests on the PCRE2 library. Unless
you are doing that, you can ignore it.
@ -186,7 +186,7 @@ can skip ahead to the CMake section.
STACK SIZE IN WINDOWS ENVIRONMENTS
Prior to release 10.30 the default system stack size of 1Mb in some Windows
Prior to release 10.30 the default system stack size of 1MB in some Windows
environments caused issues with some tests. This should no longer be the case
for 10.30 and later releases.

17
README
View File

@ -257,9 +257,10 @@ library. They are also documented in the pcre2build man page.
--with-heap-limit=500
The units are kilobytes. This limit does not apply when the JIT optimization
(which has its own memory control features) is used. There is more discussion
on the pcre2api man page (search for pcre2_set_heap_limit).
The units are kibibytes (units of 1024 bytes). This limit does not apply when
the JIT optimization (which has its own memory control features) is used.
There is more discussion on the pcre2api man page (search for
pcre2_set_heap_limit).
. In the 8-bit library, the default maximum compiled pattern size is around
64K bytes. You can increase this by adding --with-link-size=3 to the
@ -319,10 +320,10 @@ library. They are also documented in the pcre2build man page.
. When JIT support is enabled, pcre2grep automatically makes use of it, unless
you add --disable-pcre2grep-jit to the "configure" command.
. On non-Windows sytems there is support for calling external scripts during
matching in the pcre2grep command via PCRE2's callout facility with string
arguments. This support can be disabled by adding --disable-pcre2grep-callout
to the "configure" command.
. There is support for calling external programs during matching in the
pcre2grep command, using PCRE2's callout facility with string arguments. This
support can be disabled by adding --disable-pcre2grep-callout to the
"configure" command.
. The pcre2grep program currently supports only 8-bit data files, and so
requires the 8-bit PCRE2 library. It is possible to compile pcre2grep to use
@ -887,4 +888,4 @@ The distribution should contain the files listed below.
Philip Hazel
Email local part: ph10
Email domain: cam.ac.uk
Last updated: 27 April 2018
Last updated: 17 June 2018

View File

@ -708,7 +708,7 @@ $valgrind $vjs $pcre2grep -n --newline=any "^(abc|def|ghi|jkl)" testNinputgrep >
printf "%c--------------------------- Test N6 ------------------------------\r\n" - >>testtrygrep
$valgrind $vjs $pcre2grep -n --newline=anycrlf "^(abc|def|ghi|jkl)" testNinputgrep >>testtrygrep
# It seems inpossible to handle NUL characters easily in Solaris (aka SunOS).
# It seems impossible to handle NUL characters easily in Solaris (aka SunOS).
# The version of sed explicitly doesn't like them. For the moment, we just
# don't run this test under SunOS. Fudge the output so that the comparison
# works. A similar problem has also been reported for MacOS (Darwin).

View File

@ -843,7 +843,7 @@ for bmode in "$test8" "$test16" "$test32"; do
checkresult $? 24 ""
fi
# UTF pattern converson tests
# UTF pattern conversion tests
if [ "$do25" = yes ] ; then
echo $title25

View File

@ -288,7 +288,7 @@ AC_ARG_WITH(parens-nest-limit,
# Handle --with-heap-limit
AC_ARG_WITH(heap-limit,
AS_HELP_STRING([--with-heap-limit=N],
[default limit on heap memory (kilobytes, default=20000000)]),
[default limit on heap memory (kibibytes, default=20000000)]),
, with_heap_limit=20000000)
# Handle --with-match-limit=N
@ -754,7 +754,7 @@ AC_DEFINE_UNQUOTED([MATCH_LIMIT_DEPTH], [$with_match_limit_depth], [
AC_DEFINE_UNQUOTED([HEAP_LIMIT], [$with_heap_limit], [
This limits the amount of memory that may be used while matching
a pattern. It applies to both pcre2_match() and pcre2_dfa_match(). It does
not apply to JIT matching. The value is in kilobytes.])
not apply to JIT matching. The value is in kibibytes (units of 1024 bytes).])
AC_DEFINE([MAX_NAME_SIZE], [32], [
This limit is parameterized just in case anybody ever wants to
@ -1017,7 +1017,7 @@ $PACKAGE-$VERSION configuration summary:
Rebuild char tables ................ : ${enable_rebuild_chartables}
Internal link size ................. : ${with_link_size}
Nested parentheses limit ........... : ${with_parens_nest_limit}
Heap limit ......................... : ${with_heap_limit} kilobytes
Heap limit ......................... : ${with_heap_limit} kibibytes
Match limit ........................ : ${with_match_limit}
Match depth limit .................. : ${with_match_limit_depth}
Build shared libs .................. : ${enable_shared}

View File

@ -127,7 +127,7 @@ can skip ahead to the CMake section.
src/pcre2_jit_match.c and src/pcre2_jit_misc.c, so you should not compile
these yourself.
Not also that the pcre2_fuzzsupport.c file contains special code that is
Note also that the pcre2_fuzzsupport.c file contains special code that is
useful to those who want to run fuzzing tests on the PCRE2 library. Unless
you are doing that, you can ignore it.
@ -186,7 +186,7 @@ can skip ahead to the CMake section.
STACK SIZE IN WINDOWS ENVIRONMENTS
Prior to release 10.30 the default system stack size of 1Mb in some Windows
Prior to release 10.30 the default system stack size of 1MB in some Windows
environments caused issues with some tests. This should no longer be the case
for 10.30 and later releases.

View File

@ -257,9 +257,10 @@ library. They are also documented in the pcre2build man page.
--with-heap-limit=500
The units are kilobytes. This limit does not apply when the JIT optimization
(which has its own memory control features) is used. There is more discussion
on the pcre2api man page (search for pcre2_set_heap_limit).
The units are kibibytes (units of 1024 bytes). This limit does not apply when
the JIT optimization (which has its own memory control features) is used.
There is more discussion on the pcre2api man page (search for
pcre2_set_heap_limit).
. In the 8-bit library, the default maximum compiled pattern size is around
64K bytes. You can increase this by adding --with-link-size=3 to the
@ -319,10 +320,10 @@ library. They are also documented in the pcre2build man page.
. When JIT support is enabled, pcre2grep automatically makes use of it, unless
you add --disable-pcre2grep-jit to the "configure" command.
. On non-Windows sytems there is support for calling external scripts during
matching in the pcre2grep command via PCRE2's callout facility with string
arguments. This support can be disabled by adding --disable-pcre2grep-callout
to the "configure" command.
. There is support for calling external programs during matching in the
pcre2grep command, using PCRE2's callout facility with string arguments. This
support can be disabled by adding --disable-pcre2grep-callout to the
"configure" command.
. The pcre2grep program currently supports only 8-bit data files, and so
requires the 8-bit PCRE2 library. It is possible to compile pcre2grep to use
@ -887,4 +888,4 @@ The distribution should contain the files listed below.
Philip Hazel
Email local part: ph10
Email domain: cam.ac.uk
Last updated: 27 April 2018
Last updated: 17 June 2018

View File

@ -65,7 +65,7 @@ The option bits are:
PCRE2_EXTENDED Ignore white space and # comments
PCRE2_FIRSTLINE Force matching to be before newline
PCRE2_LITERAL Pattern characters are all literal
PCRE2_MATCH_UNSET_BACKREF Match unset back references
PCRE2_MATCH_UNSET_BACKREF Match unset backreferences
PCRE2_MULTILINE ^ and $ match newlines within data
PCRE2_NEVER_BACKSLASH_C Lock out the use of \C in patterns
PCRE2_NEVER_UCP Lock out PCRE2_UCP, e.g. via (*UCP)

View File

@ -36,7 +36,7 @@ request are as follows:
<pre>
PCRE2_INFO_ALLOPTIONS Final options after compiling
PCRE2_INFO_ARGOPTIONS Options passed to <b>pcre2_compile()</b>
PCRE2_INFO_BACKREFMAX Number of highest back reference
PCRE2_INFO_BACKREFMAX Number of highest backreference
PCRE2_INFO_BSR What \R matches:
PCRE2_BSR_UNICODE: Unicode line endings
PCRE2_BSR_ANYCRLF: CR, LF, or CRLF only

View File

@ -28,7 +28,7 @@ DESCRIPTION
<P>
This function is part of an experimental set of pattern conversion functions.
It sets the component separator character that is used when converting globs.
The second argument must one of the characters forward slash, backslash, or
The second argument must be one of the characters forward slash, backslash, or
dot. The default is backslash when running under Windows, otherwise forward
slash. The result of the function is zero for success or PCRE2_ERROR_BADDATA if
the second argument is invalid.

View File

@ -562,10 +562,10 @@ U+000C), NEL (next line, U+0085), LS (line separator, U+2028), and PS
<P>
Each of the first three conventions is used by at least one operating system as
its standard newline sequence. When PCRE2 is built, a default can be specified.
The default default is LF, which is the Unix standard. However, the newline
convention can be changed by an application when calling <b>pcre2_compile()</b>,
or it can be specified by special text at the start of the pattern itself; this
overrides any other settings. See the
If it is not, the default is set to LF, which is the Unix standard. However,
the newline convention can be changed by an application when calling
<b>pcre2_compile()</b>, or it can be specified by special text at the start of
the pattern itself; this overrides any other settings. See the
<a href="pcre2pattern.html"><b>pcre2pattern</b></a>
page for details of the special character sequences.
</P>
@ -949,17 +949,18 @@ offset limit. In other words, whichever limit comes first is used.
<b> uint32_t <i>value</i>);</b>
<br>
<br>
The <i>heap_limit</i> parameter specifies, in units of kilobytes, the maximum
amount of heap memory that <b>pcre2_match()</b> may use to hold backtracking
information when running an interpretive match. This limit also applies to
<b>pcre2_dfa_match()</b>, which may use the heap when processing patterns with a
lot of nested pattern recursion or lookarounds or atomic groups. This limit
does not apply to matching with the JIT optimization, which has its own memory
control arrangements (see the
The <i>heap_limit</i> parameter specifies, in units of kibibytes (1024 bytes),
the maximum amount of heap memory that <b>pcre2_match()</b> may use to hold
backtracking information when running an interpretive match. This limit also
applies to <b>pcre2_dfa_match()</b>, which may use the heap when processing
patterns with a lot of nested pattern recursion or lookarounds or atomic
groups. This limit does not apply to matching with the JIT optimization, which
has its own memory control arrangements (see the
<a href="pcre2jit.html"><b>pcre2jit</b></a>
documentation for more details). If the limit is reached, the negative error
code PCRE2_ERROR_HEAPLIMIT is returned. The default limit is set when PCRE2 is
built; the default default is very large and is essentially "unlimited".
code PCRE2_ERROR_HEAPLIMIT is returned. The default limit can be set when PCRE2
is built; if it is not, the default is set very large and is essentially
"unlimited".
</P>
<P>
A value for the heap limit may also be supplied by an item at the start of a
@ -1044,7 +1045,7 @@ The depth limit is not relevant, and is ignored, when matching is done using
JIT compiled code. However, it is supported by <b>pcre2_dfa_match()</b>, which
uses it to limit the depth of nested internal recursive function calls that
implement atomic groups, lookaround assertions, and pattern recursions. This
limits, indirectly, the amount of system stack this is used. It was more useful
limits, indirectly, the amount of system stack that is used. It was more useful
in versions before 10.32, when stack memory was used for local workspace
vectors for recursive function calls. From version 10.32, only local variables
are allocated on the stack and as each call uses only a few hundred bytes, even
@ -1060,11 +1061,11 @@ probably better to limit heap usage directly by calling
<b>pcre2_set_heap_limit()</b>.
</P>
<P>
The default value for the depth limit can be set when PCRE2 is built; the
default default is the same value as the default for the match limit. If the
limit is exceeded, <b>pcre2_match()</b> or <b>pcre2_dfa_match()</b> returns
PCRE2_ERROR_DEPTHLIMIT. A value for the depth limit may also be supplied by an
item at the start of a pattern of the form
The default value for the depth limit can be set when PCRE2 is built; if it is
not, the default is set to the same value as the default for the match limit.
If the limit is exceeded, <b>pcre2_match()</b> or <b>pcre2_dfa_match()</b>
returns PCRE2_ERROR_DEPTHLIMIT. A value for the depth limit may also be
supplied by an item at the start of a pattern of the form
<pre>
(*LIMIT_DEPTH=ddd)
</pre>
@ -1120,7 +1121,7 @@ given with <b>pcre2_set_depth_limit()</b> above.
<pre>
PCRE2_CONFIG_HEAPLIMIT
</pre>
The output is a uint32_t integer that gives, in kilobytes, the default limit
The output is a uint32_t integer that gives, in kibibytes, the default limit
for the amount of heap memory used by <b>pcre2_match()</b> or
<b>pcre2_dfa_match()</b>. Further details are given with
<b>pcre2_set_heap_limit()</b> above.
@ -1431,7 +1432,7 @@ If this bit is set, letters in the pattern match both upper and lower case
letters in the subject. It is equivalent to Perl's /i option, and it can be
changed within a pattern by a (?i) option setting. If PCRE2_UTF is set, Unicode
properties are used for all characters with more than one other case, and for
all characters whose code points are greater than U+007f. For lower valued
all characters whose code points are greater than U+007F. For lower valued
characters with only one other case, a lookup table is used for speed. When
PCRE2_UTF is not set, a lookup table is used for all code points less than 256,
and higher code points (available only in 16-bit or 32-bit mode) are treated as
@ -1551,7 +1552,7 @@ error.
<pre>
PCRE2_MATCH_UNSET_BACKREF
</pre>
If this option is set, a back reference to an unset subpattern group matches an
If this option is set, a backreference to an unset subpattern group matches an
empty string (by default this causes the current matching alternative to fail).
A pattern such as (\1)(a) succeeds when this option is set (assuming it can
find an "a" in the subject), whereas it fails by default, for Perl
@ -1613,8 +1614,8 @@ If this option is set, it disables the use of numbered capturing parentheses in
the pattern. Any opening parenthesis that is not followed by ? behaves as if it
were followed by ?: but named parentheses can still be used for capturing (and
they acquire numbers in the usual way). This is the same as Perl's /n option.
Note that, when this option is set, references to capturing groups (back
references or recursion/subroutine calls) may only refer to named groups,
Note that, when this option is set, references to capturing groups
(backreferences or recursion/subroutine calls) may only refer to named groups,
though the reference can be by name or by number.
<pre>
PCRE2_NO_AUTO_POSSESS
@ -1633,7 +1634,7 @@ If this option is set, it disables an optimization that is applied when .* is
the first significant item in a top-level branch of a pattern, and all the
other branches also start with .* or with \A or \G or ^. The optimization is
automatically disabled for .* if it is inside an atomic group or a capturing
group that is the subject of a back reference, or if the pattern contains
group that is the subject of a backreference, or if the pattern contains
(*PRUNE) or (*SKIP). When the optimization is not disabled, such a pattern is
automatically anchored if PCRE2_DOTALL is set for all the .* items and
PCRE2_MULTILINE is not set for any ^ items. Otherwise, the fact that any match
@ -1999,7 +2000,7 @@ When .* is the first significant item, anchoring is possible only when all the
following are true:
<pre>
.* is not in an atomic group
.* is not in a capturing group that is the subject of a back reference
.* is not in a capturing group that is the subject of a backreference
PCRE2_DOTALL is in force for .*
Neither (*PRUNE) nor (*SKIP) appears in the pattern
PCRE2_NO_DOTSTAR_ANCHOR is not set
@ -2009,20 +2010,20 @@ options returned for PCRE2_INFO_ALLOPTIONS.
<pre>
PCRE2_INFO_BACKREFMAX
</pre>
Return the number of the highest back reference in the pattern. The third
Return the number of the highest backreference in the pattern. The third
argument should point to an <b>uint32_t</b> variable. Named subpatterns acquire
numbers as well as names, and these count towards the highest back reference.
Back references such as \4 or \g{12} match the captured characters of the
numbers as well as names, and these count towards the highest backreference.
Backreferences such as \4 or \g{12} match the captured characters of the
given group, but in addition, the check that a capturing group is set in a
conditional subpattern such as (?(3)a|b) is also a back reference. Zero is
returned if there are no back references.
conditional subpattern such as (?(3)a|b) is also a backreference. Zero is
returned if there are no backreferences.
<pre>
PCRE2_INFO_BSR
</pre>
The output is a uint32_t whose value indicates what character sequences the \R
escape sequence matches. A value of PCRE2_BSR_UNICODE means that \R matches
any Unicode line ending sequence; a value of PCRE2_BSR_ANYCRLF means that \R
matches only CR, LF, or CRLF.
The output is a uint32_t integer whose value indicates what character sequences
the \R escape sequence matches. A value of PCRE2_BSR_UNICODE means that \R
matches any Unicode line ending sequence; a value of PCRE2_BSR_ANYCRLF means
that \R matches only CR, LF, or CRLF.
<pre>
PCRE2_INFO_CAPTURECOUNT
</pre>
@ -2034,10 +2035,10 @@ The third argument should point to an <b>uint32_t</b> variable.
</pre>
If the pattern set a backtracking depth limit by including an item of the form
(*LIMIT_DEPTH=nnnn) at the start, the value is returned. The third argument
should point to an unsigned 32-bit integer. If no such value has been set, the
call to <b>pcre2_pattern_info()</b> returns the error PCRE2_ERROR_UNSET. Note
that this limit will only be used during matching if it is less than the limit
set or defaulted by the caller of the match function.
should point to a uint32_t integer. If no such value has been set, the call to
<b>pcre2_pattern_info()</b> returns the error PCRE2_ERROR_UNSET. Note that this
limit will only be used during matching if it is less than the limit set or
defaulted by the caller of the match function.
<pre>
PCRE2_INFO_FIRSTBITMAP
</pre>
@ -2047,7 +2048,7 @@ values for the first code unit in any match. For example, a pattern that starts
with [abc] results in a table with three bits set. When code unit values
greater than 255 are supported, the flag bit for 255 means "any code unit of
value 255 or above". If such a table was constructed, a pointer to it is
returned. Otherwise NULL is returned. The third argument should point to an
returned. Otherwise NULL is returned. The third argument should point to a
<b>const uint8_t *</b> variable.
<pre>
PCRE2_INFO_FIRSTCODETYPE
@ -2074,7 +2075,7 @@ and up to 0xffffffff when not using UTF-32 mode.
</pre>
Return the size (in bytes) of the data frames that are used to remember
backtracking positions when the pattern is processed by <b>pcre2_match()</b>
without the use of JIT. The third argument should point to an <b>size_t</b>
without the use of JIT. The third argument should point to a <b>size_t</b>
variable. The frame size depends on the number of capturing parentheses in the
pattern. Each additional capturing group adds two PCRE2_SIZE variables.
<pre>
@ -2094,10 +2095,10 @@ the equivalent hexadecimal or octal escape sequences.
</pre>
If the pattern set a heap memory limit by including an item of the form
(*LIMIT_HEAP=nnnn) at the start, the value is returned. The third argument
should point to an unsigned 32-bit integer. If no such value has been set, the
call to <b>pcre2_pattern_info()</b> returns the error PCRE2_ERROR_UNSET. Note
that this limit will only be used during matching if it is less than the limit
set or defaulted by the caller of the match function.
should point to a uint32_t integer. If no such value has been set, the call to
<b>pcre2_pattern_info()</b> returns the error PCRE2_ERROR_UNSET. Note that this
limit will only be used during matching if it is less than the limit set or
defaulted by the caller of the match function.
<pre>
PCRE2_INFO_JCHANGED
</pre>
@ -2141,15 +2142,15 @@ in such cases.
</pre>
If the pattern set a match limit by including an item of the form
(*LIMIT_MATCH=nnnn) at the start, the value is returned. The third argument
should point to an unsigned 32-bit integer. If no such value has been set, the
call to <b>pcre2_pattern_info()</b> returns the error PCRE2_ERROR_UNSET. Note
that this limit will only be used during matching if it is less than the limit
set or defaulted by the caller of the match function.
should point to a uint32_t integer. If no such value has been set, the call to
<b>pcre2_pattern_info()</b> returns the error PCRE2_ERROR_UNSET. Note that this
limit will only be used during matching if it is less than the limit set or
defaulted by the caller of the match function.
<pre>
PCRE2_INFO_MAXLOOKBEHIND
</pre>
Return the number of characters (not code units) in the longest lookbehind
assertion in the pattern. The third argument should point to an unsigned 32-bit
assertion in the pattern. The third argument should point to a uint32_t
integer. This information is useful when doing multi-segment matching using the
partial matching facilities. Note that the simple assertions \b and \B
require a one-character lookbehind. \A also registers a one-character
@ -2417,7 +2418,7 @@ zero, the search for a match starts at the beginning of the subject, and this
is by far the most common case. In UTF-8 or UTF-16 mode, the starting offset
must point to the start of a character, or to the end of the subject (in UTF-32
mode, one code unit equals one character, so all offsets are valid). Like the
pattern string, the subject may contain binary zeroes.
pattern string, the subject may contain binary zeros.
</P>
<P>
A non-zero starting offset is useful when searching for another match in the
@ -3559,12 +3560,12 @@ There are in addition the following errors that are specific to
</pre>
This return is given if <b>pcre2_dfa_match()</b> encounters an item in the
pattern that it does not support, for instance, the use of \C in a UTF mode or
a back reference.
a backreference.
<pre>
PCRE2_ERROR_DFA_UCOND
</pre>
This return is given if <b>pcre2_dfa_match()</b> encounters a condition item
that uses a back reference for the condition, or a test for recursion in a
that uses a backreference for the condition, or a test for recursion in a
specific group. These are not supported.
<pre>
PCRE2_ERROR_DFA_WSSIZE

View File

@ -227,7 +227,7 @@ separator, U+2028), and PS (paragraph separator, U+2029). The final option is
<pre>
--enable-newline-is-nul
</pre>
which causes NUL (binary zero) is set as the default line-ending character.
which causes NUL (binary zero) to be set as the default line-ending character.
</P>
<P>
Whatever default line ending convention is selected when PCRE2 is built can be
@ -286,15 +286,15 @@ The <b>pcre2_match()</b> function starts out using a 20K vector on the system
stack to record backtracking points. The more nested backtracking points there
are (that is, the deeper the search tree), the more memory is needed. If the
initial vector is not large enough, heap memory is used, up to a certain limit,
which is specified in kilobytes. The limit can be changed at run time, as
described in the
which is specified in kibibytes (units of 1024 bytes). The limit can be changed
at run time, as described in the
<a href="pcre2api.html"><b>pcre2api</b></a>
documentation. The default limit (in effect unlimited) is 20 million. You can
change this by a setting such as
<pre>
--with-heap-limit=500
</pre>
which limits the amount of heap to 500 kilobytes. This limit applies only to
which limits the amount of heap to 500 KiB. This limit applies only to
interpretive matching in <b>pcre2_match()</b> and <b>pcre2_dfa_match()</b>, which
may also use the heap for internal workspace when processing complicated
patterns. This limit does not apply when JIT (which has its own memory
@ -542,7 +542,7 @@ generated from the string.
Setting --enable-fuzz-support also causes a binary called <b>pcre2fuzzcheck</b>
to be created. This is normally run under valgrind or used when PCRE2 is
compiled with address sanitizing enabled. It calls the fuzzing function and
outputs information about it is doing. The input strings are specified by
outputs information about what it is doing. The input strings are specified by
arguments: if an argument starts with "=" the rest of it is a literal input
string. Otherwise, it is assumed to be a file name, and the contents of the
file are the test string.

View File

@ -143,7 +143,7 @@ branch, automatic anchoring occurs if all branches are anchorable.
</P>
<P>
This optimization is disabled, however, if .* is in an atomic group or if there
is a back reference to the capturing group in which it appears. It is also
is a backreference to the capturing group in which it appears. It is also
disabled if the pattern contains (*PRUNE) or (*SKIP). However, the presence of
callouts does not affect it.
</P>

View File

@ -31,7 +31,7 @@ page.
2. Like Perl, PCRE2 allows repeat quantifiers on parenthesized assertions, but
they do not mean what you might think. For example, (?!a){3} does not assert
that the next three characters are not "a". It just asserts that the next
character is not "a" three times (in principle: PCRE2 optimizes this to run the
character is not "a" three times (in principle; PCRE2 optimizes this to run the
assertion just once). Perl allows some repeat quantifiers on other assertions,
for example, \b* (but not \b{3}), but these do not seem to have any use.
</P>
@ -77,8 +77,8 @@ The \Q...\E sequence is recognized both inside and outside character classes.
</P>
<P>
7. Fairly obviously, PCRE2 does not support the (?{code}) and (??{code})
constructions. However, there is support PCRE2's "callout" feature, which
allows an external function to be called during pattern matching. See the
constructions. However, PCRE2 does have a "callout" feature, which allows an
external function to be called during pattern matching. See the
<a href="pcre2callout.html"><b>pcre2callout</b></a>
documentation for details.
</P>
@ -156,7 +156,7 @@ each alternative branch of a lookbehind assertion can match a different length
of string. Perl requires them all to have the same length.
<br>
<br>
(b) From PCRE2 10.23, back references to groups of fixed length are supported
(b) From PCRE2 10.23, backreferences to groups of fixed length are supported
in lookbehinds, provided that there is no possibility of referencing a
non-unique number or name. Perl does not support backreferences in lookbehinds.
<br>

View File

@ -86,9 +86,10 @@ controlled by parameters that can be set by the <b>--buffer-size</b> and
that is obtained at the start of processing. If an input file contains very
long lines, a larger buffer may be needed; this is handled by automatically
extending the buffer, up to the limit specified by <b>--max-buffer-size</b>. The
default values for these parameters are specified when <b>pcre2grep</b> is
built, with the default defaults being 20K and 1M respectively. An error occurs
if a line is too long and the buffer can no longer be expanded.
default values for these parameters can be set when <b>pcre2grep</b> is
built; if nothing is specified, the defaults are set to 20K and 1M
respectively. An error occurs if a line is too long and the buffer can no
longer be expanded.
</P>
<P>
The block of memory that is actually used is three times the "buffer size", to
@ -500,13 +501,13 @@ short form for this option.
When this option is given, non-compressed input is read and processed line by
line, and the output is flushed after each write. By default, input is read in
large chunks, unless <b>pcre2grep</b> can determine that it is reading from a
terminal (which is currently possible only in Unix-like environments). Output
to terminal is normally automatically flushed by the operating system. This
option can be useful when the input or output is attached to a pipe and you do
not want <b>pcre2grep</b> to buffer up large amounts of data. However, its use
will affect performance, and the <b>-M</b> (multiline) option ceases to work.
When input is from a compressed .gz or .bz2 file, <b>--line-buffered</b> is
ignored.
terminal (which is currently possible only in Unix-like environments or
Windows). Output to terminal is normally automatically flushed by the operating
system. This option can be useful when the input or output is attached to a
pipe and you do not want <b>pcre2grep</b> to buffer up large amounts of data.
However, its use will affect performance, and the <b>-M</b> (multiline) option
ceases to work. When input is from a compressed .gz or .bz2 file,
<b>--line-buffered</b> is ignored.
</P>
<P>
<b>--line-offsets</b>
@ -541,11 +542,11 @@ counter that is incremented each time around its main processing loop. If the
value set by <b>--match-limit</b> is reached, an error occurs.
<br>
<br>
The <b>--heap-limit</b> option specifies, as a number of kilobytes, the amount
of heap memory that may be used for matching. Heap memory is needed only if
matching the pattern requires a significant number of nested backtracking
points to be remembered. This parameter can be set to zero to forbid the use of
heap memory altogether.
The <b>--heap-limit</b> option specifies, as a number of kibibytes (units of
1024 bytes), the amount of heap memory that may be used for matching. Heap
memory is needed only if matching the pattern requires a significant number of
nested backtracking points to be remembered. This parameter can be set to zero
to forbid the use of heap memory altogether.
<br>
<br>
The <b>--depth-limit</b> option limits the depth of nested backtracking points,
@ -556,9 +557,9 @@ limit acts varies from pattern to pattern. This limit is of use only if it is
set smaller than <b>--match-limit</b>.
<br>
<br>
There are no short forms for these options. The default settings are specified
when the PCRE2 library is compiled, with the default defaults being very large
and so effectively unlimited.
There are no short forms for these options. The default limits can be set
when the PCRE2 library is compiled; if they are not specified, the defaults
are very large and so effectively unlimited.
</P>
<P>
\fB--max-buffer-size=<i>number</i>

View File

@ -54,9 +54,9 @@ There is no limit to the number of parenthesized subpatterns, but there can be
no more than 65535 capturing subpatterns. There is, however, a limit to the
depth of nesting of parenthesized subpatterns of all kinds. This is imposed in
order to limit the amount of system stack used at compile time. The default
limit can be specified when PCRE2 is built; the default default is 250. An
application can change this limit by calling pcre2_set_parens_nest_limit() to
set the limit in a compile context.
limit can be specified when PCRE2 is built; if not, the default is set to 250.
An application can change this limit by calling pcre2_set_parens_nest_limit()
to set the limit in a compile context.
</P>
<P>
The maximum length of name for a named subpattern is 32 code units, and the

View File

@ -85,7 +85,7 @@ ungreedy repetition quantifiers are specified in the pattern.
Because it ends up with a single path through the tree, it is relatively
straightforward for this algorithm to keep track of the substrings that are
matched by portions of the pattern in parentheses. This provides support for
capturing parentheses and back references.
capturing parentheses and backreferences.
</P>
<br><a name="SEC4" href="#TOC1">THE ALTERNATIVE MATCHING ALGORITHM</a><br>
<P>
@ -158,7 +158,7 @@ possibilities, and PCRE2's implementation of this algorithm does not attempt to
do this. This means that no captured substrings are available.
</P>
<P>
3. Because no substrings are captured, back references within the pattern are
3. Because no substrings are captured, backreferences within the pattern are
not supported, and cause errors if encountered.
</P>
<P>
@ -215,7 +215,7 @@ because it has to search for all possible matches, but is also because it is
less susceptible to optimization.
</P>
<P>
2. Capturing parentheses and back references are not supported.
2. Capturing parentheses and backreferences are not supported.
</P>
<P>
3. Although atomic groups are supported, their use does not provide the

View File

@ -31,7 +31,7 @@ please consult the man page, in case the conversion went wrong.
<li><a name="TOC16" href="#SEC16">NAMED SUBPATTERNS</a>
<li><a name="TOC17" href="#SEC17">REPETITION</a>
<li><a name="TOC18" href="#SEC18">ATOMIC GROUPING AND POSSESSIVE QUANTIFIERS</a>
<li><a name="TOC19" href="#SEC19">BACK REFERENCES</a>
<li><a name="TOC19" href="#SEC19">BACKREFERENCES</a>
<li><a name="TOC20" href="#SEC20">ASSERTIONS</a>
<li><a name="TOC21" href="#SEC21">CONDITIONAL SUBPATTERNS</a>
<li><a name="TOC22" href="#SEC22">COMMENTS</a>
@ -196,7 +196,7 @@ be less than the value set (or defaulted) by the caller of <b>pcre2_match()</b>
for it to have any effect. In other words, the pattern writer can lower the
limits set by the programmer, but not raise them. If there is more than one
setting of one of these limits, the lower value is used. The heap limit is
specified in kilobytes.
specified in kibibytes (units of 1024 bytes).
</P>
<P>
Prior to release 10.30, LIMIT_DEPTH was called LIMIT_RECURSION. This name is
@ -342,7 +342,7 @@ In particular, if you want to match a backslash, you write \\.
</P>
<P>
In a UTF mode, only ASCII numbers and letters have any special meaning after a
backslash. All other characters (in particular, those whose codepoints are
backslash. All other characters (in particular, those whose code points are
greater than 127) are treated as literals.
</P>
<P>
@ -390,7 +390,7 @@ these escapes are as follows:
\r carriage return (hex 0D)
\t tab (hex 09)
\0dd character with octal code 0dd
\ddd character with octal code ddd, or back reference
\ddd character with octal code ddd, or backreference
\o{ddd..} character with octal code ddd..
\xhh character with hex code hh
\x{hhh..} character with hex code hhh.. (default mode)
@ -438,13 +438,13 @@ follows is itself an octal digit.
The escape \o must be followed by a sequence of octal digits, enclosed in
braces. An error occurs if this is not the case. This escape is a recent
addition to Perl; it provides way of specifying character code points as octal
numbers greater than 0777, and it also allows octal numbers and back references
numbers greater than 0777, and it also allows octal numbers and backreferences
to be unambiguously specified.
</P>
<P>
For greater clarity and unambiguity, it is best to avoid following \ by a
digit greater than zero. Instead, use \o{} or \x{} to specify character
numbers, and \g{} to specify back references. The following paragraphs
numbers, and \g{} to specify backreferences. The following paragraphs
describe the old, ambiguous syntax.
</P>
<P>
@ -455,7 +455,7 @@ and Perl has changed over time, causing PCRE2 also to change.
Outside a character class, PCRE2 reads the digit and any following digits as a
decimal number. If the number is less than 10, begins with the digit 8 or 9, or
if there are at least that many previous capturing left parentheses in the
expression, the entire sequence is taken as a <i>back reference</i>. A
expression, the entire sequence is taken as a <i>backreference</i>. A
description of how this works is given
<a href="#backreferences">later,</a>
following the discussion of
@ -470,13 +470,13 @@ for themselves. For example, outside a character class:
<pre>
\040 is another way of writing an ASCII space
\40 is the same, provided there are fewer than 40 previous capturing subpatterns
\7 is always a back reference
\11 might be a back reference, or another way of writing a tab
\7 is always a backreference
\11 might be a backreference, or another way of writing a tab
\011 is always a tab
\0113 is a tab followed by the character "3"
\113 might be a back reference, otherwise the character with octal code 113
\377 might be a back reference, otherwise the value 255 (decimal)
\81 is always a back reference .sp
\113 might be a backreference, otherwise the character with octal code 113
\377 might be a backreference, otherwise the value 255 (decimal)
\81 is always a backreference .sp
</pre>
Note that octal values of 100 or greater that are specified using this syntax
must not be introduced by a leading zero, because no more than three octal
@ -512,10 +512,10 @@ limited to certain values, as follows:
8-bit non-UTF mode no greater than 0xff
16-bit non-UTF mode no greater than 0xffff
32-bit non-UTF mode no greater than 0xffffffff
All UTF modes no greater than 0x10ffff and a valid codepoint
All UTF modes no greater than 0x10ffff and a valid code point
</pre>
Invalid Unicode codepoints are all those in the range 0xd800 to 0xdfff (the
so-called "surrogate" codepoints). The check for these can be disabled by the
Invalid Unicode code points are all those in the range 0xd800 to 0xdfff (the
so-called "surrogate" code points). The check for these can be disabled by the
caller of <b>pcre2_compile()</b> by setting the option
PCRE2_EXTRA_ALLOW_SURROGATE_ESCAPES.
</P>
@ -544,12 +544,12 @@ is set, \U matches a "U" character, and \u can be used to define a character
by code point, as described above.
</P>
<br><b>
Absolute and relative back references
Absolute and relative backreferences
</b><br>
<P>
The sequence \g followed by a signed or unsigned number, optionally enclosed
in braces, is an absolute or relative back reference. A named back reference
can be coded as \g{name}. Back references are discussed
in braces, is an absolute or relative backreference. A named backreference
can be coded as \g{name}. backreferences are discussed
<a href="#backreferences">later,</a>
following the discussion of
<a href="#subpattern">parenthesized subpatterns.</a>
@ -563,7 +563,7 @@ a number enclosed either in angle brackets or single quotes, is an alternative
syntax for referencing a subpattern as a "subroutine". Details are discussed
<a href="#onigurumasubroutines">later.</a>
Note that \g{...} (Perl syntax) and \g&#60;...&#62; (Oniguruma syntax) are <i>not</i>
synonymous. The former is a back reference; the latter is a
synonymous. The former is a backreference; the latter is a
<a href="#subpatternsassubroutines">subroutine</a>
call.
<a name="genericchartypes"></a></P>
@ -694,7 +694,7 @@ line, U+0085). Because this is an atomic group, the two-character sequence is
treated as a single unit that cannot be split.
</P>
<P>
In other modes, two additional characters whose codepoints are greater than 255
In other modes, two additional characters whose code points are greater than 255
are added: LS (line separator, U+2028) and PS (paragraph separator, U+2029).
Unicode support is not needed for these characters to be recognized.
</P>
@ -729,8 +729,8 @@ Unicode character properties
When PCRE2 is built with Unicode support (the default), three additional escape
sequences that match characters with specific properties are available. In
8-bit non-UTF-8 mode, these sequences are of course limited to testing
characters whose codepoints are less than 256, but they do work in this mode.
In 32-bit non-UTF mode, codepoints greater than 0x10ffff (the Unicode limit)
characters whose code points are less than 256, but they do work in this mode.
In 32-bit non-UTF mode, code points greater than 0x10ffff (the Unicode limit)
may be encountered. These are all treated as being in the Common script and
with an unassigned type. The extra escape sequences are:
<pre>
@ -1037,7 +1037,7 @@ joiner" characters. Characters with the "mark" property always have the
modifier). Extending characters are allowed before the modifier.
</P>
<P>
7. Do not break within emoji zwj sequences (zero-width jointer followed by
7. Do not break within emoji zwj sequences (zero-width joiner followed by
"glue after ZWJ" or "base glue after ZWJ").
</P>
<P>
@ -1731,7 +1731,7 @@ numbers underneath show in which buffer the captured content will be stored.
/ ( a ) (?| x ( y ) z | (p (q) r) | (t) u (v) ) ( z ) /x
# 1 2 2 3 2 3 4
</pre>
A back reference to a numbered subpattern uses the most recent value that is
A backreference to a numbered subpattern uses the most recent value that is
set for that number by any subpattern. The following pattern matches "abcabc"
or "defdef":
<pre>
@ -1771,7 +1771,7 @@ have different names, but PCRE2 does not.
In PCRE2, a subpattern can be named in one of three ways: (?&#60;name&#62;...) or
(?'name'...) as in Perl, or (?P&#60;name&#62;...) as in Python. References to capturing
parentheses from other parts of the pattern, such as
<a href="#backreferences">back references,</a>
<a href="#backreferences">backreferences,</a>
<a href="#recursion">recursion,</a>
and
<a href="#conditions">conditions,</a>
@ -1811,7 +1811,7 @@ for the first (and in this example, the only) subpattern of that name that
matched. This saves searching to find which numbered subpattern it was.
</P>
<P>
If you make a back reference to a non-unique named subpattern from elsewhere in
If you make a backreference to a non-unique named subpattern from elsewhere in
the pattern, the subpatterns to which the name refers are checked in the order
in which they appear in the overall pattern. The first one that is set is used
for the reference. For example, this pattern matches both "foofoo" and
@ -1859,7 +1859,7 @@ items:
the \R escape sequence
an escape such as \d or \pL that matches a single character
a character class
a back reference
a backreference
a parenthesized subpattern (including most assertions)
a subroutine call to a subpattern (recursive or otherwise)
</pre>
@ -1980,7 +1980,7 @@ alternatively, using ^ to indicate anchoring explicitly.
</P>
<P>
However, there are some cases where the optimization cannot be used. When .*
is inside capturing parentheses that are the subject of a back reference
is inside capturing parentheses that are the subject of a backreference
elsewhere in the pattern, a match at the start may fail where a later one
succeeds. Consider, for example:
<pre>
@ -2121,30 +2121,30 @@ an atomic group, like this:
</pre>
sequences of non-digits cannot be broken, and failure happens quickly.
<a name="backreferences"></a></P>
<br><a name="SEC19" href="#TOC1">BACK REFERENCES</a><br>
<br><a name="SEC19" href="#TOC1">BACKREFERENCES</a><br>
<P>
Outside a character class, a backslash followed by a digit greater than 0 (and
possibly further digits) is a back reference to a capturing subpattern earlier
possibly further digits) is a backreference to a capturing subpattern earlier
(that is, to its left) in the pattern, provided there have been that many
previous capturing left parentheses.
</P>
<P>
However, if the decimal number following the backslash is less than 8, it is
always taken as a back reference, and causes an error only if there are not
always taken as a backreference, and causes an error only if there are not
that many capturing left parentheses in the entire pattern. In other words, the
parentheses that are referenced need not be to the left of the reference for
numbers less than 8. A "forward back reference" of this type can make sense
numbers less than 8. A "forward backreference" of this type can make sense
when a repetition is involved and the subpattern to the right has participated
in an earlier iteration.
</P>
<P>
It is not possible to have a numerical "forward back reference" to a subpattern
It is not possible to have a numerical "forward backreference" to a subpattern
whose number is 8 or more using this syntax because a sequence such as \50 is
interpreted as a character defined in octal. See the subsection entitled
"Non-printing characters"
<a href="#digitsafterbackslash">above</a>
for further details of the handling of digits following a backslash. There is
no such problem when named parentheses are used. A back reference to any
no such problem when named parentheses are used. A backreference to any
subpattern is possible using named parentheses (see below).
</P>
<P>
@ -2175,7 +2175,7 @@ of forward reference can be useful it patterns that repeat. Perl does not
support the use of + in this way.
</P>
<P>
A back reference matches whatever actually matched the capturing subpattern in
A backreference matches whatever actually matched the capturing subpattern in
the current subject string, rather than anything matching the subpattern
itself (see
<a href="#subpatternsassubroutines">"Subpatterns as subroutines"</a>
@ -2185,7 +2185,7 @@ below for a way of doing that). So the pattern
</pre>
matches "sense and sensibility" and "response and responsibility", but not
"sense and responsibility". If caseful matching is in force at the time of the
back reference, the case of letters is relevant. For example,
backreference, the case of letters is relevant. For example,
<pre>
((?i)rah)\s+\1
</pre>
@ -2193,10 +2193,10 @@ matches "rah rah" and "RAH RAH", but not "RAH rah", even though the original
capturing subpattern is matched caselessly.
</P>
<P>
There are several different ways of writing back references to named
There are several different ways of writing backreferences to named
subpatterns. The .NET syntax \k{name} and the Perl syntax \k&#60;name&#62; or
\k'name' are supported, as is the Python syntax (?P=name). Perl 5.10's unified
back reference syntax, in which \g can be used for both numeric and named
backreference syntax, in which \g can be used for both numeric and named
references, is also supported. We could rewrite the above example in any of
the following ways:
<pre>
@ -2209,30 +2209,30 @@ A subpattern that is referenced by name may appear in the pattern before or
after the reference.
</P>
<P>
There may be more than one back reference to the same subpattern. If a
subpattern has not actually been used in a particular match, any back
references to it always fail by default. For example, the pattern
There may be more than one backreference to the same subpattern. If a
subpattern has not actually been used in a particular match, any backreferences
to it always fail by default. For example, the pattern
<pre>
(a|(bc))\2
</pre>
always fails if it starts to match "a" rather than "bc". However, if the
PCRE2_MATCH_UNSET_BACKREF option is set at compile time, a back reference to an
PCRE2_MATCH_UNSET_BACKREF option is set at compile time, a backreference to an
unset value matches an empty string.
</P>
<P>
Because there may be many capturing parentheses in a pattern, all digits
following a backslash are taken as part of a potential back reference number.
following a backslash are taken as part of a potential backreference number.
If the pattern continues with a digit character, some delimiter must be used to
terminate the back reference. If the PCRE2_EXTENDED option is set, this can be
terminate the backreference. If the PCRE2_EXTENDED option is set, this can be
white space. Otherwise, the \g{ syntax or an empty comment (see
<a href="#comments">"Comments"</a>
below) can be used.
</P>
<br><b>
Recursive back references
Recursive backreferences
</b><br>
<P>
A back reference that occurs inside the parentheses to which it refers fails
A backreference that occurs inside the parentheses to which it refers fails
when the subpattern is first used, so, for example, (a\1) never matches.
However, such references can be useful inside repeated subpatterns. For
example, the pattern
@ -2240,14 +2240,14 @@ example, the pattern
(a|b\1)+
</pre>
matches any number of "a"s and also "aba", "ababbaa" etc. At each iteration of
the subpattern, the back reference matches the character string corresponding
the subpattern, the backreference matches the character string corresponding
to the previous iteration. In order for this to work, the pattern must be such
that the first iteration does not need to match the back reference. This can be
that the first iteration does not need to match the backreference. This can be
done using alternation, as in the example above, or by a quantifier with a
minimum of zero.
</P>
<P>
Back references of this type cause the group that they reference to be treated
backreferences of this type cause the group that they reference to be treated
as an
<a href="#atomicgroup">atomic group.</a>
Once the whole group has been matched, a subsequent matching failure cannot
@ -2397,10 +2397,10 @@ that is, a "subroutine" call into a group that is already active,
is not supported.
</P>
<P>
Perl does not support back references in lookbehinds. PCRE2 does support them,
Perl does not support backreferences in lookbehinds. PCRE2 does support them,
but only if certain conditions are met. The PCRE2_MATCH_UNSET_BACKREF option
must not be set, there must be no use of (?| in the pattern (it creates
duplicate subpattern numbers), and if the back reference is by name, the name
duplicate subpattern numbers), and if the backreference is by name, the name
must be unique. Of course, the referenced subpattern must itself be of fixed
length. The following pattern matches words containing at least two characters
that begin and end with the same character:
@ -2882,7 +2882,7 @@ in PCRE2 these values can be referenced. Consider this pattern:
^(.)(\1|a(?2))
</pre>
This pattern matches "bab". The first capturing parentheses match "b", then in
the second group, when the back reference \1 fails to match "b", the second
the second group, when the backreference \1 fails to match "b", the second
alternative matches "a" and then recurses. In the recursion, \1 does now match
"b" and so the whole match succeeds. This match used to fail in Perl, but in
later versions (I tried 5.024) it now works.
@ -2943,7 +2943,7 @@ plus or a minus sign it is taken as a relative reference. For example:
(abc)(?i:\g&#60;-1&#62;)
</pre>
Note that \g{...} (Perl syntax) and \g&#60;...&#62; (Oniguruma syntax) are <i>not</i>
synonymous. The former is a back reference; the latter is a subroutine call.
synonymous. The former is a backreference; the latter is a subroutine call.
</P>
<br><a name="SEC26" href="#TOC1">CALLOUTS</a><br>
<P>

View File

@ -132,14 +132,14 @@ When a pattern that is compiled with this flag is passed to <b>regexec()</b> for
matching, the <i>nmatch</i> and <i>pmatch</i> arguments are ignored, and no
captured strings are returned. Versions of the PCRE library prior to 10.22 used
to set the PCRE2_NO_AUTO_CAPTURE compile option, but this no longer happens
because it disables the use of back references.
because it disables the use of backreferences.
<pre>
REG_PEND
</pre>
If this option is set, the <b>reg_endp</b> field in the <i>preg</i> structure
(which has the type const char *) must be set to point to the character beyond
the end of the pattern before calling <b>regcomp()</b>. The pattern itself may
now contain binary zeroes, which are treated as data characters. Without
now contain binary zeros, which are treated as data characters. Without
REG_PEND, a binary zero terminates the pattern and the <b>re_endp</b> field is
ignored. This is a GNU extension to the POSIX standard and should be used with
caution in software intended to be portable to other systems.
@ -248,10 +248,10 @@ function.
<pre>
REG_STARTEND
</pre>
When this option is set, the subject string is starts at <i>string</i> +
When this option is set, the subject string starts at <i>string</i> +
<i>pmatch[0].rm_so</i> and ends at <i>string</i> + <i>pmatch[0].rm_eo</i>, which
should point to the first character beyond the string. There may be binary
zeroes within the subject string, and indeed, using REG_STARTEND is the only
zeros within the subject string, and indeed, using REG_STARTEND is the only
way to pass a subject string that contains a binary zero.
</P>
<P>

View File

@ -442,7 +442,7 @@ of the newline or \R options with similar syntax. More than one of them may
appear. For the first three, d is a decimal number.
<pre>
(*LIMIT_DEPTH=d) set the backtracking limit to d
(*LIMIT_HEAP=d) set the heap size limit to d kilobytes
(*LIMIT_HEAP=d) set the heap size limit to d * 1024 bytes
(*LIMIT_MATCH=d) set the match limit to d
(*NOTEMPTY) set PCRE2_NOTEMPTY when matching
(*NOTEMPTY_ATSTART) set PCRE2_NOTEMPTY_ATSTART when matching

View File

@ -129,7 +129,7 @@ to occur).
UTF-8 (in its original definition) is not capable of encoding values greater
than 0x7fffffff, but such values can be handled by the 32-bit library. When
testing this library in non-UTF mode with <b>utf8_input</b> set, if any
character is preceded by the byte 0xff (which is an illegal byte in UTF-8)
character is preceded by the byte 0xff (which is an invalid byte in UTF-8)
0x80000000 is added to the character's value. This is the only way of passing
such code points in a pattern string. For subject strings, using an escape
sequence is preferable.
@ -264,7 +264,7 @@ Do not output the version number of <b>pcre2test</b> at the start of execution.
<P>
<b>-S</b> <i>size</i>
On Unix-like systems, set the size of the run-time stack to <i>size</i>
megabytes.
mebibytes (units of 1024*1024 bytes).
</P>
<P>
<b>-subject</b> <i>modifier-list</i>
@ -679,8 +679,8 @@ Newline and \R handling
<P>
The <b>bsr</b> modifier specifies what \R in a pattern should match. If it is
set to "anycrlf", \R matches CR, LF, or CRLF only. If it is set to "unicode",
\R matches any Unicode newline sequence. The default is specified when PCRE2
is built, with the default default being Unicode.
\R matches any Unicode newline sequence. The default can be specified when
PCRE2 is built; if it is not, the default is set to Unicode.
</P>
<P>
The <b>newline</b> modifier specifies which characters are to be interpreted as
@ -1418,11 +1418,11 @@ Setting the JIT stack size
<P>
The <b>jitstack</b> modifier provides a way of setting the maximum stack size
that is used by the just-in-time optimization code. It is ignored if JIT
optimization is not being used. The value is a number of kilobytes. Setting
zero reverts to the default of 32K. Providing a stack that is larger than the
default is necessary only for very complicated patterns. If <b>jitstack</b> is
set non-zero on a subject line it overrides any value that was set on the
pattern.
optimization is not being used. The value is a number of kibibytes (units of
1024 bytes). Setting zero reverts to the default of 32KiB. Providing a stack
that is larger than the default is necessary only for very complicated
patterns. If <b>jitstack</b> is set non-zero on a subject line it overrides any
value that was set on the pattern.
</P>
<br><b>
Setting heap, match, and depth limits
@ -1468,10 +1468,10 @@ and non-recursive, to the internal matching function, thus controlling the
overall amount of computing resource that is used.
</P>
<P>
For both kinds of matching, the <i>heap_limit</i> number (which is in kilobytes)
limits the amount of heap memory used for matching. A value of zero disables
the use of any heap memory; many simple pattern matches can be done without
using the heap, so this is not an unreasonable setting.
For both kinds of matching, the <i>heap_limit</i> number, which is in kibibytes
(units of 1024 bytes), limits the amount of heap memory used for matching. A
value of zero disables the use of any heap memory; many simple pattern matches
can be done without using the heap, so zero is not an unreasonable setting.
</P>
<br><b>
Showing MARK names

View File

@ -53,7 +53,7 @@ compatibility with Perl 5.6. PCRE2 does not support this.
WIDE CHARACTERS AND UTF MODES
</b><br>
<P>
Codepoints less than 256 can be specified in patterns by either braced or
Code points less than 256 can be specified in patterns by either braced or
unbraced hexadecimal escape sequences (for example, \x{b3} or \xb3). Larger
values have to use braced sequences. Unbraced octal code points up to \777 are
also recognized; larger ones can be coded using \o{...}.
@ -116,7 +116,7 @@ CASE-EQUIVALENCE IN UTF MODES
Case-insensitive matching in a UTF mode makes use of Unicode properties except
for characters whose code points are less than 128 and that have at most two
case-equivalent values. For these, a direct table lookup is used for speed. A
few Unicode characters such as Greek sigma have more than two codepoints that
few Unicode characters such as Greek sigma have more than two code points that
are case-equivalent, and these are treated as such.
</P>
<br><b>

File diff suppressed because it is too large Load Diff

View File

@ -53,7 +53,7 @@ The option bits are:
PCRE2_EXTENDED Ignore white space and # comments
PCRE2_FIRSTLINE Force matching to be before newline
PCRE2_LITERAL Pattern characters are all literal
PCRE2_MATCH_UNSET_BACKREF Match unset back references
PCRE2_MATCH_UNSET_BACKREF Match unset backreferences
PCRE2_MULTILINE ^ and $ match newlines within data
PCRE2_NEVER_BACKSLASH_C Lock out the use of \eC in patterns
PCRE2_NEVER_UCP Lock out PCRE2_UCP, e.g. via (*UCP)

View File

@ -65,7 +65,7 @@ subject that is terminated by a binary zero code unit. The options are:
match even if there is a full match
.\" JOIN
PCRE2_PARTIAL_SOFT Return PCRE2_ERROR_PARTIAL for a partial
match if no full matches are found
match if no full matches are found
.sp
For details of partial matching, see the
.\" HREF

View File

@ -24,7 +24,7 @@ request are as follows:
.sp
PCRE2_INFO_ALLOPTIONS Final options after compiling
PCRE2_INFO_ARGOPTIONS Options passed to \fBpcre2_compile()\fP
PCRE2_INFO_BACKREFMAX Number of highest back reference
PCRE2_INFO_BACKREFMAX Number of highest backreference
PCRE2_INFO_BSR What \eR matches:
PCRE2_BSR_UNICODE: Unicode line endings
PCRE2_BSR_ANYCRLF: CR, LF, or CRLF only

View File

@ -1,4 +1,4 @@
.TH PCRE2_SET_MAX_PATTERN_LENGTH 3 "16 June 2017" "PCRE2 10.30"
.TH PCRE2_SET_COMPILE_EXTRA_OPTIONS 3 "16 June 2017" "PCRE2 10.30"
.SH NAME
PCRE2 - Perl-compatible regular expressions (revised API)
.SH SYNOPSIS

View File

@ -16,7 +16,7 @@ PCRE2 - Perl-compatible regular expressions (revised API)
.sp
This function is part of an experimental set of pattern conversion functions.
It sets the component separator character that is used when converting globs.
The second argument must one of the characters forward slash, backslash, or
The second argument must be one of the characters forward slash, backslash, or
dot. The default is backslash when running under Windows, otherwise forward
slash. The result of the function is zero for success or PCRE2_ERROR_BADDATA if
the second argument is invalid.

View File

@ -1,4 +1,4 @@
.TH PCRE2_SET_DEPTH_LIMIT 3 "11 April 2017" "PCRE2 10.30"
.TH PCRE2_SET_HEAP_LIMIT 3 "11 April 2017" "PCRE2 10.30"
.SH NAME
PCRE2 - Perl-compatible regular expressions (revised API)
.SH SYNOPSIS

View File

@ -497,10 +497,10 @@ U+000C), NEL (next line, U+0085), LS (line separator, U+2028), and PS
.P
Each of the first three conventions is used by at least one operating system as
its standard newline sequence. When PCRE2 is built, a default can be specified.
The default default is LF, which is the Unix standard. However, the newline
convention can be changed by an application when calling \fBpcre2_compile()\fP,
or it can be specified by special text at the start of the pattern itself; this
overrides any other settings. See the
If it is not, the default is set to LF, which is the Unix standard. However,
the newline convention can be changed by an application when calling
\fBpcre2_compile()\fP, or it can be specified by special text at the start of
the pattern itself; this overrides any other settings. See the
.\" HREF
\fBpcre2pattern\fP
.\"
@ -885,19 +885,20 @@ offset limit. In other words, whichever limit comes first is used.
.B " uint32_t \fIvalue\fP);"
.fi
.sp
The \fIheap_limit\fP parameter specifies, in units of kilobytes, the maximum
amount of heap memory that \fBpcre2_match()\fP may use to hold backtracking
information when running an interpretive match. This limit also applies to
\fBpcre2_dfa_match()\fP, which may use the heap when processing patterns with a
lot of nested pattern recursion or lookarounds or atomic groups. This limit
does not apply to matching with the JIT optimization, which has its own memory
control arrangements (see the
The \fIheap_limit\fP parameter specifies, in units of kibibytes (1024 bytes),
the maximum amount of heap memory that \fBpcre2_match()\fP may use to hold
backtracking information when running an interpretive match. This limit also
applies to \fBpcre2_dfa_match()\fP, which may use the heap when processing
patterns with a lot of nested pattern recursion or lookarounds or atomic
groups. This limit does not apply to matching with the JIT optimization, which
has its own memory control arrangements (see the
.\" HREF
\fBpcre2jit\fP
.\"
documentation for more details). If the limit is reached, the negative error
code PCRE2_ERROR_HEAPLIMIT is returned. The default limit is set when PCRE2 is
built; the default default is very large and is essentially "unlimited".
code PCRE2_ERROR_HEAPLIMIT is returned. The default limit can be set when PCRE2
is built; if it is not, the default is set very large and is essentially
"unlimited".
.P
A value for the heap limit may also be supplied by an item at the start of a
pattern of the form
@ -975,7 +976,7 @@ The depth limit is not relevant, and is ignored, when matching is done using
JIT compiled code. However, it is supported by \fBpcre2_dfa_match()\fP, which
uses it to limit the depth of nested internal recursive function calls that
implement atomic groups, lookaround assertions, and pattern recursions. This
limits, indirectly, the amount of system stack this is used. It was more useful
limits, indirectly, the amount of system stack that is used. It was more useful
in versions before 10.32, when stack memory was used for local workspace
vectors for recursive function calls. From version 10.32, only local variables
are allocated on the stack and as each call uses only a few hundred bytes, even
@ -989,11 +990,11 @@ using \fBpcre2_dfa_match()\fP, can use a great deal of memory. However, it is
probably better to limit heap usage directly by calling
\fBpcre2_set_heap_limit()\fP.
.P
The default value for the depth limit can be set when PCRE2 is built; the
default default is the same value as the default for the match limit. If the
limit is exceeded, \fBpcre2_match()\fP or \fBpcre2_dfa_match()\fP returns
PCRE2_ERROR_DEPTHLIMIT. A value for the depth limit may also be supplied by an
item at the start of a pattern of the form
The default value for the depth limit can be set when PCRE2 is built; if it is
not, the default is set to the same value as the default for the match limit.
If the limit is exceeded, \fBpcre2_match()\fP or \fBpcre2_dfa_match()\fP
returns PCRE2_ERROR_DEPTHLIMIT. A value for the depth limit may also be
supplied by an item at the start of a pattern of the form
.sp
(*LIMIT_DEPTH=ddd)
.sp
@ -1050,7 +1051,7 @@ given with \fBpcre2_set_depth_limit()\fP above.
.sp
PCRE2_CONFIG_HEAPLIMIT
.sp
The output is a uint32_t integer that gives, in kilobytes, the default limit
The output is a uint32_t integer that gives, in kibibytes, the default limit
for the amount of heap memory used by \fBpcre2_match()\fP or
\fBpcre2_dfa_match()\fP. Further details are given with
\fBpcre2_set_heap_limit()\fP above.
@ -1367,7 +1368,7 @@ If this bit is set, letters in the pattern match both upper and lower case
letters in the subject. It is equivalent to Perl's /i option, and it can be
changed within a pattern by a (?i) option setting. If PCRE2_UTF is set, Unicode
properties are used for all characters with more than one other case, and for
all characters whose code points are greater than U+007f. For lower valued
all characters whose code points are greater than U+007F. For lower valued
characters with only one other case, a lookup table is used for speed. When
PCRE2_UTF is not set, a lookup table is used for all code points less than 256,
and higher code points (available only in 16-bit or 32-bit mode) are treated as
@ -1489,7 +1490,7 @@ error.
.sp
PCRE2_MATCH_UNSET_BACKREF
.sp
If this option is set, a back reference to an unset subpattern group matches an
If this option is set, a backreference to an unset subpattern group matches an
empty string (by default this causes the current matching alternative to fail).
A pattern such as (\e1)(a) succeeds when this option is set (assuming it can
find an "a" in the subject), whereas it fails by default, for Perl
@ -1550,8 +1551,8 @@ If this option is set, it disables the use of numbered capturing parentheses in
the pattern. Any opening parenthesis that is not followed by ? behaves as if it
were followed by ?: but named parentheses can still be used for capturing (and
they acquire numbers in the usual way). This is the same as Perl's /n option.
Note that, when this option is set, references to capturing groups (back
references or recursion/subroutine calls) may only refer to named groups,
Note that, when this option is set, references to capturing groups
(backreferences or recursion/subroutine calls) may only refer to named groups,
though the reference can be by name or by number.
.sp
PCRE2_NO_AUTO_POSSESS
@ -1570,7 +1571,7 @@ If this option is set, it disables an optimization that is applied when .* is
the first significant item in a top-level branch of a pattern, and all the
other branches also start with .* or with \eA or \eG or ^. The optimization is
automatically disabled for .* if it is inside an atomic group or a capturing
group that is the subject of a back reference, or if the pattern contains
group that is the subject of a backreference, or if the pattern contains
(*PRUNE) or (*SKIP). When the optimization is not disabled, such a pattern is
automatically anchored if PCRE2_DOTALL is set for all the .* items and
PCRE2_MULTILINE is not set for any ^ items. Otherwise, the fact that any match
@ -1956,7 +1957,7 @@ following are true:
.* is not in an atomic group
.\" JOIN
.* is not in a capturing group that is the subject
of a back reference
of a backreference
PCRE2_DOTALL is in force for .*
Neither (*PRUNE) nor (*SKIP) appears in the pattern
PCRE2_NO_DOTSTAR_ANCHOR is not set
@ -1966,20 +1967,20 @@ options returned for PCRE2_INFO_ALLOPTIONS.
.sp
PCRE2_INFO_BACKREFMAX
.sp
Return the number of the highest back reference in the pattern. The third
Return the number of the highest backreference in the pattern. The third
argument should point to an \fBuint32_t\fP variable. Named subpatterns acquire
numbers as well as names, and these count towards the highest back reference.
Back references such as \e4 or \eg{12} match the captured characters of the
numbers as well as names, and these count towards the highest backreference.
Backreferences such as \e4 or \eg{12} match the captured characters of the
given group, but in addition, the check that a capturing group is set in a
conditional subpattern such as (?(3)a|b) is also a back reference. Zero is
returned if there are no back references.
conditional subpattern such as (?(3)a|b) is also a backreference. Zero is
returned if there are no backreferences.
.sp
PCRE2_INFO_BSR
.sp
The output is a uint32_t whose value indicates what character sequences the \eR
escape sequence matches. A value of PCRE2_BSR_UNICODE means that \eR matches
any Unicode line ending sequence; a value of PCRE2_BSR_ANYCRLF means that \eR
matches only CR, LF, or CRLF.
The output is a uint32_t integer whose value indicates what character sequences
the \eR escape sequence matches. A value of PCRE2_BSR_UNICODE means that \eR
matches any Unicode line ending sequence; a value of PCRE2_BSR_ANYCRLF means
that \eR matches only CR, LF, or CRLF.
.sp
PCRE2_INFO_CAPTURECOUNT
.sp
@ -1991,10 +1992,10 @@ The third argument should point to an \fBuint32_t\fP variable.
.sp
If the pattern set a backtracking depth limit by including an item of the form
(*LIMIT_DEPTH=nnnn) at the start, the value is returned. The third argument
should point to an unsigned 32-bit integer. If no such value has been set, the
call to \fBpcre2_pattern_info()\fP returns the error PCRE2_ERROR_UNSET. Note
that this limit will only be used during matching if it is less than the limit
set or defaulted by the caller of the match function.
should point to a uint32_t integer. If no such value has been set, the call to
\fBpcre2_pattern_info()\fP returns the error PCRE2_ERROR_UNSET. Note that this
limit will only be used during matching if it is less than the limit set or
defaulted by the caller of the match function.
.sp
PCRE2_INFO_FIRSTBITMAP
.sp
@ -2004,7 +2005,7 @@ values for the first code unit in any match. For example, a pattern that starts
with [abc] results in a table with three bits set. When code unit values
greater than 255 are supported, the flag bit for 255 means "any code unit of
value 255 or above". If such a table was constructed, a pointer to it is
returned. Otherwise NULL is returned. The third argument should point to an
returned. Otherwise NULL is returned. The third argument should point to a
\fBconst uint8_t *\fP variable.
.sp
PCRE2_INFO_FIRSTCODETYPE
@ -2031,7 +2032,7 @@ and up to 0xffffffff when not using UTF-32 mode.
.sp
Return the size (in bytes) of the data frames that are used to remember
backtracking positions when the pattern is processed by \fBpcre2_match()\fP
without the use of JIT. The third argument should point to an \fBsize_t\fP
without the use of JIT. The third argument should point to a \fBsize_t\fP
variable. The frame size depends on the number of capturing parentheses in the
pattern. Each additional capturing group adds two PCRE2_SIZE variables.
.sp
@ -2051,10 +2052,10 @@ the equivalent hexadecimal or octal escape sequences.
.sp
If the pattern set a heap memory limit by including an item of the form
(*LIMIT_HEAP=nnnn) at the start, the value is returned. The third argument
should point to an unsigned 32-bit integer. If no such value has been set, the
call to \fBpcre2_pattern_info()\fP returns the error PCRE2_ERROR_UNSET. Note
that this limit will only be used during matching if it is less than the limit
set or defaulted by the caller of the match function.
should point to a uint32_t integer. If no such value has been set, the call to
\fBpcre2_pattern_info()\fP returns the error PCRE2_ERROR_UNSET. Note that this
limit will only be used during matching if it is less than the limit set or
defaulted by the caller of the match function.
.sp
PCRE2_INFO_JCHANGED
.sp
@ -2098,15 +2099,15 @@ in such cases.
.sp
If the pattern set a match limit by including an item of the form
(*LIMIT_MATCH=nnnn) at the start, the value is returned. The third argument
should point to an unsigned 32-bit integer. If no such value has been set, the
call to \fBpcre2_pattern_info()\fP returns the error PCRE2_ERROR_UNSET. Note
that this limit will only be used during matching if it is less than the limit
set or defaulted by the caller of the match function.
should point to a uint32_t integer. If no such value has been set, the call to
\fBpcre2_pattern_info()\fP returns the error PCRE2_ERROR_UNSET. Note that this
limit will only be used during matching if it is less than the limit set or
defaulted by the caller of the match function.
.sp
PCRE2_INFO_MAXLOOKBEHIND
.sp
Return the number of characters (not code units) in the longest lookbehind
assertion in the pattern. The third argument should point to an unsigned 32-bit
assertion in the pattern. The third argument should point to a uint32_t
integer. This information is useful when doing multi-segment matching using the
partial matching facilities. Note that the simple assertions \eb and \eB
require a one-character lookbehind. \eA also registers a one-character
@ -2393,7 +2394,7 @@ zero, the search for a match starts at the beginning of the subject, and this
is by far the most common case. In UTF-8 or UTF-16 mode, the starting offset
must point to the start of a character, or to the end of the subject (in UTF-32
mode, one code unit equals one character, so all offsets are valid). Like the
pattern string, the subject may contain binary zeroes.
pattern string, the subject may contain binary zeros.
.P
A non-zero starting offset is useful when searching for another match in the
same subject by calling \fBpcre2_match()\fP again after a previous success.
@ -3562,12 +3563,12 @@ There are in addition the following errors that are specific to
.sp
This return is given if \fBpcre2_dfa_match()\fP encounters an item in the
pattern that it does not support, for instance, the use of \eC in a UTF mode or
a back reference.
a backreference.
.sp
PCRE2_ERROR_DFA_UCOND
.sp
This return is given if \fBpcre2_dfa_match()\fP encounters a condition item
that uses a back reference for the condition, or a test for recursion in a
that uses a backreference for the condition, or a test for recursion in a
specific group. These are not supported.
.sp
PCRE2_ERROR_DFA_WSSIZE

View File

@ -216,7 +216,7 @@ separator, U+2028), and PS (paragraph separator, U+2029). The final option is
.sp
--enable-newline-is-nul
.sp
which causes NUL (binary zero) is set as the default line-ending character.
which causes NUL (binary zero) to be set as the default line-ending character.
.P
Whatever default line ending convention is selected when PCRE2 is built can be
overridden by applications that use the library. At build time it is
@ -281,8 +281,8 @@ The \fBpcre2_match()\fP function starts out using a 20K vector on the system
stack to record backtracking points. The more nested backtracking points there
are (that is, the deeper the search tree), the more memory is needed. If the
initial vector is not large enough, heap memory is used, up to a certain limit,
which is specified in kilobytes. The limit can be changed at run time, as
described in the
which is specified in kibibytes (units of 1024 bytes). The limit can be changed
at run time, as described in the
.\" HREF
\fBpcre2api\fP
.\"
@ -291,7 +291,7 @@ change this by a setting such as
.sp
--with-heap-limit=500
.sp
which limits the amount of heap to 500 kilobytes. This limit applies only to
which limits the amount of heap to 500 KiB. This limit applies only to
interpretive matching in \fBpcre2_match()\fP and \fBpcre2_dfa_match()\fP, which
may also use the heap for internal workspace when processing complicated
patterns. This limit does not apply when JIT (which has its own memory
@ -552,7 +552,7 @@ generated from the string.
Setting --enable-fuzz-support also causes a binary called \fBpcre2fuzzcheck\fP
to be created. This is normally run under valgrind or used when PCRE2 is
compiled with address sanitizing enabled. It calls the fuzzing function and
outputs information about it is doing. The input strings are specified by
outputs information about what it is doing. The input strings are specified by
arguments: if an argument starts with "=" the rest of it is a literal input
string. Otherwise, it is assumed to be a file name, and the contents of the
file are the test string.

View File

@ -128,7 +128,7 @@ start only after an internal newline or at the beginning of the subject, and
branch, automatic anchoring occurs if all branches are anchorable.
.P
This optimization is disabled, however, if .* is in an atomic group or if there
is a back reference to the capturing group in which it appears. It is also
is a backreference to the capturing group in which it appears. It is also
disabled if the pattern contains (*PRUNE) or (*SKIP). However, the presence of
callouts does not affect it.
.P

View File

@ -19,7 +19,7 @@ page.
2. Like Perl, PCRE2 allows repeat quantifiers on parenthesized assertions, but
they do not mean what you might think. For example, (?!a){3} does not assert
that the next three characters are not "a". It just asserts that the next
character is not "a" three times (in principle: PCRE2 optimizes this to run the
character is not "a" three times (in principle; PCRE2 optimizes this to run the
assertion just once). Perl allows some repeat quantifiers on other assertions,
for example, \eb* (but not \eb{3}), but these do not seem to have any use.
.P
@ -62,8 +62,8 @@ Note the following examples:
The \eQ...\eE sequence is recognized both inside and outside character classes.
.P
7. Fairly obviously, PCRE2 does not support the (?{code}) and (??{code})
constructions. However, there is support PCRE2's "callout" feature, which
allows an external function to be called during pattern matching. See the
constructions. However, PCRE2 does have a "callout" feature, which allows an
external function to be called during pattern matching. See the
.\" HREF
\fBpcre2callout\fP
.\"
@ -131,7 +131,7 @@ list is with respect to Perl 5.26:
each alternative branch of a lookbehind assertion can match a different length
of string. Perl requires them all to have the same length.
.sp
(b) From PCRE2 10.23, back references to groups of fixed length are supported
(b) From PCRE2 10.23, backreferences to groups of fixed length are supported
in lookbehinds, provided that there is no possibility of referencing a
non-unique number or name. Perl does not support backreferences in lookbehinds.
.sp

View File

@ -57,9 +57,10 @@ controlled by parameters that can be set by the \fB--buffer-size\fP and
that is obtained at the start of processing. If an input file contains very
long lines, a larger buffer may be needed; this is handled by automatically
extending the buffer, up to the limit specified by \fB--max-buffer-size\fP. The
default values for these parameters are specified when \fBpcre2grep\fP is
built, with the default defaults being 20K and 1M respectively. An error occurs
if a line is too long and the buffer can no longer be expanded.
default values for these parameters can be set when \fBpcre2grep\fP is
built; if nothing is specified, the defaults are set to 20K and 1M
respectively. An error occurs if a line is too long and the buffer can no
longer be expanded.
.P
The block of memory that is actually used is three times the "buffer size", to
allow for buffering "before" and "after" lines. If the buffer size is too
@ -434,13 +435,13 @@ short form for this option.
When this option is given, non-compressed input is read and processed line by
line, and the output is flushed after each write. By default, input is read in
large chunks, unless \fBpcre2grep\fP can determine that it is reading from a
terminal (which is currently possible only in Unix-like environments). Output
to terminal is normally automatically flushed by the operating system. This
option can be useful when the input or output is attached to a pipe and you do
not want \fBpcre2grep\fP to buffer up large amounts of data. However, its use
will affect performance, and the \fB-M\fP (multiline) option ceases to work.
When input is from a compressed .gz or .bz2 file, \fB--line-buffered\fP is
ignored.
terminal (which is currently possible only in Unix-like environments or
Windows). Output to terminal is normally automatically flushed by the operating
system. This option can be useful when the input or output is attached to a
pipe and you do not want \fBpcre2grep\fP to buffer up large amounts of data.
However, its use will affect performance, and the \fB-M\fP (multiline) option
ceases to work. When input is from a compressed .gz or .bz2 file,
\fB--line-buffered\fP is ignored.
.TP
\fB--line-offsets\fP
Instead of showing lines or parts of lines that match, show each match as a
@ -470,11 +471,11 @@ is a pattern that uses nested unlimited repeats. Internally, PCRE2 has a
counter that is incremented each time around its main processing loop. If the
value set by \fB--match-limit\fP is reached, an error occurs.
.sp
The \fB--heap-limit\fP option specifies, as a number of kilobytes, the amount
of heap memory that may be used for matching. Heap memory is needed only if
matching the pattern requires a significant number of nested backtracking
points to be remembered. This parameter can be set to zero to forbid the use of
heap memory altogether.
The \fB--heap-limit\fP option specifies, as a number of kibibytes (units of
1024 bytes), the amount of heap memory that may be used for matching. Heap
memory is needed only if matching the pattern requires a significant number of
nested backtracking points to be remembered. This parameter can be set to zero
to forbid the use of heap memory altogether.
.sp
The \fB--depth-limit\fP option limits the depth of nested backtracking points,
which indirectly limits the amount of memory that is used. The amount of memory
@ -483,9 +484,9 @@ parentheses in the pattern, so the amount of memory that is used before this
limit acts varies from pattern to pattern. This limit is of use only if it is
set smaller than \fB--match-limit\fP.
.sp
There are no short forms for these options. The default settings are specified
when the PCRE2 library is compiled, with the default defaults being very large
and so effectively unlimited.
There are no short forms for these options. The default limits can be set
when the PCRE2 library is compiled; if they are not specified, the defaults
are very large and so effectively unlimited.
.TP
\fB--max-buffer-size=\fInumber\fP
This limits the expansion of the processing buffer, whose initial size can be

View File

@ -56,10 +56,10 @@ DESCRIPTION
that is obtained at the start of processing. If an input file contains
very long lines, a larger buffer may be needed; this is handled by
automatically extending the buffer, up to the limit specified by --max-
buffer-size. The default values for these parameters are specified when
pcre2grep is built, with the default defaults being 20K and 1M respec-
tively. An error occurs if a line is too long and the buffer can no
longer be expanded.
buffer-size. The default values for these parameters can be set when
pcre2grep is built; if nothing is specified, the defaults are set to
20K and 1M respectively. An error occurs if a line is too long and the
buffer can no longer be expanded.
The block of memory that is actually used is three times the "buffer
size", to allow for buffering "before" and "after" lines. If the buffer
@ -475,14 +475,14 @@ OPTIONS
processed line by line, and the output is flushed after each
write. By default, input is read in large chunks, unless
pcre2grep can determine that it is reading from a terminal
(which is currently possible only in Unix-like environments).
Output to terminal is normally automatically flushed by the
operating system. This option can be useful when the input or
output is attached to a pipe and you do not want pcre2grep to
buffer up large amounts of data. However, its use will affect
performance, and the -M (multiline) option ceases to work.
When input is from a compressed .gz or .bz2 file, --line-
buffered is ignored.
(which is currently possible only in Unix-like environments
or Windows). Output to terminal is normally automatically
flushed by the operating system. This option can be useful
when the input or output is attached to a pipe and you do not
want pcre2grep to buffer up large amounts of data. However,
its use will affect performance, and the -M (multiline)
option ceases to work. When input is from a compressed .gz or
.bz2 file, --line-buffered is ignored.
--line-offsets
Instead of showing lines or parts of lines that match, show
@ -517,12 +517,12 @@ OPTIONS
processing loop. If the value set by --match-limit is
reached, an error occurs.
The --heap-limit option specifies, as a number of kilobytes,
the amount of heap memory that may be used for matching. Heap
memory is needed only if matching the pattern requires a sig-
nificant number of nested backtracking points to be remem-
bered. This parameter can be set to zero to forbid the use of
heap memory altogether.
The --heap-limit option specifies, as a number of kibibytes
(units of 1024 bytes), the amount of heap memory that may be
used for matching. Heap memory is needed only if matching the
pattern requires a significant number of nested backtracking
points to be remembered. This parameter can be set to zero to
forbid the use of heap memory altogether.
The --depth-limit option limits the depth of nested back-
tracking points, which indirectly limits the amount of memory
@ -532,10 +532,10 @@ OPTIONS
limit acts varies from pattern to pattern. This limit is of
use only if it is set smaller than --match-limit.
There are no short forms for these options. The default set-
tings are specified when the PCRE2 library is compiled, with
the default defaults being very large and so effectively
unlimited.
There are no short forms for these options. The default lim-
its can be set when the PCRE2 library is compiled; if they
are not specified, the defaults are very large and so effec-
tively unlimited.
--max-buffer-size=number
This limits the expansion of the processing buffer, whose

View File

@ -38,9 +38,9 @@ There is no limit to the number of parenthesized subpatterns, but there can be
no more than 65535 capturing subpatterns. There is, however, a limit to the
depth of nesting of parenthesized subpatterns of all kinds. This is imposed in
order to limit the amount of system stack used at compile time. The default
limit can be specified when PCRE2 is built; the default default is 250. An
application can change this limit by calling pcre2_set_parens_nest_limit() to
set the limit in a compile context.
limit can be specified when PCRE2 is built; if not, the default is set to 250.
An application can change this limit by calling pcre2_set_parens_nest_limit()
to set the limit in a compile context.
.P
The maximum length of name for a named subpattern is 32 code units, and the
maximum number of named subpatterns is 10000.

View File

@ -67,7 +67,7 @@ ungreedy repetition quantifiers are specified in the pattern.
Because it ends up with a single path through the tree, it is relatively
straightforward for this algorithm to keep track of the substrings that are
matched by portions of the pattern in parentheses. This provides support for
capturing parentheses and back references.
capturing parentheses and backreferences.
.
.
.SH "THE ALTERNATIVE MATCHING ALGORITHM"
@ -134,7 +134,7 @@ straightforward to keep track of captured substrings for the different matching
possibilities, and PCRE2's implementation of this algorithm does not attempt to
do this. This means that no captured substrings are available.
.P
3. Because no substrings are captured, back references within the pattern are
3. Because no substrings are captured, backreferences within the pattern are
not supported, and cause errors if encountered.
.P
4. For the same reason, conditional expressions that use a backreference as the
@ -188,7 +188,7 @@ The alternative algorithm suffers from a number of disadvantages:
because it has to search for all possible matches, but is also because it is
less susceptible to optimization.
.P
2. Capturing parentheses and back references are not supported.
2. Capturing parentheses and backreferences are not supported.
.P
3. Although atomic groups are supported, their use does not provide the
performance advantage that it does for the standard algorithm.

View File

@ -163,7 +163,7 @@ be less than the value set (or defaulted) by the caller of \fBpcre2_match()\fP
for it to have any effect. In other words, the pattern writer can lower the
limits set by the programmer, but not raise them. If there is more than one
setting of one of these limits, the lower value is used. The heap limit is
specified in kilobytes.
specified in kibibytes (units of 1024 bytes).
.P
Prior to release 10.30, LIMIT_DEPTH was called LIMIT_RECURSION. This name is
still recognized for backwards compatibility.
@ -318,7 +318,7 @@ precede a non-alphanumeric with backslash to specify that it stands for itself.
In particular, if you want to match a backslash, you write \e\e.
.P
In a UTF mode, only ASCII numbers and letters have any special meaning after a
backslash. All other characters (in particular, those whose codepoints are
backslash. All other characters (in particular, those whose code points are
greater than 127) are treated as literals.
.P
If a pattern is compiled with the PCRE2_EXTENDED option, most white space in
@ -367,7 +367,7 @@ these escapes are as follows:
\er carriage return (hex 0D)
\et tab (hex 09)
\e0dd character with octal code 0dd
\eddd character with octal code ddd, or back reference
\eddd character with octal code ddd, or backreference
\eo{ddd..} character with octal code ddd..
\exhh character with hex code hh
\ex{hhh..} character with hex code hhh.. (default mode)
@ -410,12 +410,12 @@ follows is itself an octal digit.
The escape \eo must be followed by a sequence of octal digits, enclosed in
braces. An error occurs if this is not the case. This escape is a recent
addition to Perl; it provides way of specifying character code points as octal
numbers greater than 0777, and it also allows octal numbers and back references
numbers greater than 0777, and it also allows octal numbers and backreferences
to be unambiguously specified.
.P
For greater clarity and unambiguity, it is best to avoid following \e by a
digit greater than zero. Instead, use \eo{} or \ex{} to specify character
numbers, and \eg{} to specify back references. The following paragraphs
numbers, and \eg{} to specify backreferences. The following paragraphs
describe the old, ambiguous syntax.
.P
The handling of a backslash followed by a digit other than 0 is complicated,
@ -424,7 +424,7 @@ and Perl has changed over time, causing PCRE2 also to change.
Outside a character class, PCRE2 reads the digit and any following digits as a
decimal number. If the number is less than 10, begins with the digit 8 or 9, or
if there are at least that many previous capturing left parentheses in the
expression, the entire sequence is taken as a \fIback reference\fP. A
expression, the entire sequence is taken as a \fIbackreference\fP. A
description of how this works is given
.\" HTML <a href="#backreferences">
.\" </a>
@ -446,20 +446,20 @@ for themselves. For example, outside a character class:
.\" JOIN
\e40 is the same, provided there are fewer than 40
previous capturing subpatterns
\e7 is always a back reference
\e7 is always a backreference
.\" JOIN
\e11 might be a back reference, or another way of
\e11 might be a backreference, or another way of
writing a tab
\e011 is always a tab
\e0113 is a tab followed by the character "3"
.\" JOIN
\e113 might be a back reference, otherwise the
\e113 might be a backreference, otherwise the
character with octal code 113
.\" JOIN
\e377 might be a back reference, otherwise
\e377 might be a backreference, otherwise
the value 255 (decimal)
.\" JOIN
\e81 is always a back reference
\e81 is always a backreference
.sp
Note that octal values of 100 or greater that are specified using this syntax
must not be introduced by a leading zero, because no more than three octal
@ -492,10 +492,10 @@ limited to certain values, as follows:
8-bit non-UTF mode no greater than 0xff
16-bit non-UTF mode no greater than 0xffff
32-bit non-UTF mode no greater than 0xffffffff
All UTF modes no greater than 0x10ffff and a valid codepoint
All UTF modes no greater than 0x10ffff and a valid code point
.sp
Invalid Unicode codepoints are all those in the range 0xd800 to 0xdfff (the
so-called "surrogate" codepoints). The check for these can be disabled by the
Invalid Unicode code points are all those in the range 0xd800 to 0xdfff (the
so-called "surrogate" code points). The check for these can be disabled by the
caller of \fBpcre2_compile()\fP by setting the option
PCRE2_EXTRA_ALLOW_SURROGATE_ESCAPES.
.
@ -523,12 +523,12 @@ is set, \eU matches a "U" character, and \eu can be used to define a character
by code point, as described above.
.
.
.SS "Absolute and relative back references"
.SS "Absolute and relative backreferences"
.rs
.sp
The sequence \eg followed by a signed or unsigned number, optionally enclosed
in braces, is an absolute or relative back reference. A named back reference
can be coded as \eg{name}. Back references are discussed
in braces, is an absolute or relative backreference. A named backreference
can be coded as \eg{name}. backreferences are discussed
.\" HTML <a href="#backreferences">
.\" </a>
later,
@ -551,7 +551,7 @@ syntax for referencing a subpattern as a "subroutine". Details are discussed
later.
.\"
Note that \eg{...} (Perl syntax) and \eg<...> (Oniguruma syntax) are \fInot\fP
synonymous. The former is a back reference; the latter is a
synonymous. The former is a backreference; the latter is a
.\" HTML <a href="#subpatternsassubroutines">
.\" </a>
subroutine
@ -692,7 +692,7 @@ U+000B), FF (form feed, U+000C), CR (carriage return, U+000D), or NEL (next
line, U+0085). Because this is an atomic group, the two-character sequence is
treated as a single unit that cannot be split.
.P
In other modes, two additional characters whose codepoints are greater than 255
In other modes, two additional characters whose code points are greater than 255
are added: LS (line separator, U+2028) and PS (paragraph separator, U+2029).
Unicode support is not needed for these characters to be recognized.
.P
@ -727,8 +727,8 @@ an error.
When PCRE2 is built with Unicode support (the default), three additional escape
sequences that match characters with specific properties are available. In
8-bit non-UTF-8 mode, these sequences are of course limited to testing
characters whose codepoints are less than 256, but they do work in this mode.
In 32-bit non-UTF mode, codepoints greater than 0x10ffff (the Unicode limit)
characters whose code points are less than 256, but they do work in this mode.
In 32-bit non-UTF mode, code points greater than 0x10ffff (the Unicode limit)
may be encountered. These are all treated as being in the Common script and
with an unassigned type. The extra escape sequences are:
.sp
@ -1026,7 +1026,7 @@ joiner" characters. Characters with the "mark" property always have the
6. Do not break within emoji modifier sequences (a base character followed by a
modifier). Extending characters are allowed before the modifier.
.P
7. Do not break within emoji zwj sequences (zero-width jointer followed by
7. Do not break within emoji zwj sequences (zero-width joiner followed by
"glue after ZWJ" or "base glue after ZWJ").
.P
8. Do not break within emoji flag sequences. That is, do not break between
@ -1724,7 +1724,7 @@ numbers underneath show in which buffer the captured content will be stored.
/ ( a ) (?| x ( y ) z | (p (q) r) | (t) u (v) ) ( z ) /x
# 1 2 2 3 2 3 4
.sp
A back reference to a numbered subpattern uses the most recent value that is
A backreference to a numbered subpattern uses the most recent value that is
set for that number by any subpattern. The following pattern matches "abcabc"
or "defdef":
.sp
@ -1768,7 +1768,7 @@ In PCRE2, a subpattern can be named in one of three ways: (?<name>...) or
parentheses from other parts of the pattern, such as
.\" HTML <a href="#backreferences">
.\" </a>
back references,
backreferences,
.\"
.\" HTML <a href="#recursion">
.\" </a>
@ -1811,7 +1811,7 @@ The convenience functions for extracting the data by name returns the substring
for the first (and in this example, the only) subpattern of that name that
matched. This saves searching to find which numbered subpattern it was.
.P
If you make a back reference to a non-unique named subpattern from elsewhere in
If you make a backreference to a non-unique named subpattern from elsewhere in
the pattern, the subpatterns to which the name refers are checked in the order
in which they appear in the overall pattern. The first one that is set is used
for the reference. For example, this pattern matches both "foofoo" and
@ -1863,7 +1863,7 @@ items:
the \eR escape sequence
an escape such as \ed or \epL that matches a single character
a character class
a back reference
a backreference
a parenthesized subpattern (including most assertions)
a subroutine call to a subpattern (recursive or otherwise)
.sp
@ -1980,7 +1980,7 @@ worth setting PCRE2_DOTALL in order to obtain this optimization, or
alternatively, using ^ to indicate anchoring explicitly.
.P
However, there are some cases where the optimization cannot be used. When .*
is inside capturing parentheses that are the subject of a back reference
is inside capturing parentheses that are the subject of a backreference
elsewhere in the pattern, a match at the start may fail where a later one
succeeds. Consider, for example:
.sp
@ -2116,23 +2116,23 @@ sequences of non-digits cannot be broken, and failure happens quickly.
.
.
.\" HTML <a name="backreferences"></a>
.SH "BACK REFERENCES"
.SH "BACKREFERENCES"
.rs
.sp
Outside a character class, a backslash followed by a digit greater than 0 (and
possibly further digits) is a back reference to a capturing subpattern earlier
possibly further digits) is a backreference to a capturing subpattern earlier
(that is, to its left) in the pattern, provided there have been that many
previous capturing left parentheses.
.P
However, if the decimal number following the backslash is less than 8, it is
always taken as a back reference, and causes an error only if there are not
always taken as a backreference, and causes an error only if there are not
that many capturing left parentheses in the entire pattern. In other words, the
parentheses that are referenced need not be to the left of the reference for
numbers less than 8. A "forward back reference" of this type can make sense
numbers less than 8. A "forward backreference" of this type can make sense
when a repetition is involved and the subpattern to the right has participated
in an earlier iteration.
.P
It is not possible to have a numerical "forward back reference" to a subpattern
It is not possible to have a numerical "forward backreference" to a subpattern
whose number is 8 or more using this syntax because a sequence such as \e50 is
interpreted as a character defined in octal. See the subsection entitled
"Non-printing characters"
@ -2141,7 +2141,7 @@ interpreted as a character defined in octal. See the subsection entitled
above
.\"
for further details of the handling of digits following a backslash. There is
no such problem when named parentheses are used. A back reference to any
no such problem when named parentheses are used. A backreference to any
subpattern is possible using named parentheses (see below).
.P
Another way of avoiding the ambiguity inherent in the use of digits following a
@ -2169,7 +2169,7 @@ The sequence \eg{+1} is a reference to the next capturing subpattern. This kind
of forward reference can be useful it patterns that repeat. Perl does not
support the use of + in this way.
.P
A back reference matches whatever actually matched the capturing subpattern in
A backreference matches whatever actually matched the capturing subpattern in
the current subject string, rather than anything matching the subpattern
itself (see
.\" HTML <a href="#subpatternsassubroutines">
@ -2182,17 +2182,17 @@ below for a way of doing that). So the pattern
.sp
matches "sense and sensibility" and "response and responsibility", but not
"sense and responsibility". If caseful matching is in force at the time of the
back reference, the case of letters is relevant. For example,
backreference, the case of letters is relevant. For example,
.sp
((?i)rah)\es+\e1
.sp
matches "rah rah" and "RAH RAH", but not "RAH rah", even though the original
capturing subpattern is matched caselessly.
.P
There are several different ways of writing back references to named
There are several different ways of writing backreferences to named
subpatterns. The .NET syntax \ek{name} and the Perl syntax \ek<name> or
\ek'name' are supported, as is the Python syntax (?P=name). Perl 5.10's unified
back reference syntax, in which \eg can be used for both numeric and named
backreference syntax, in which \eg can be used for both numeric and named
references, is also supported. We could rewrite the above example in any of
the following ways:
.sp
@ -2204,20 +2204,20 @@ the following ways:
A subpattern that is referenced by name may appear in the pattern before or
after the reference.
.P
There may be more than one back reference to the same subpattern. If a
subpattern has not actually been used in a particular match, any back
references to it always fail by default. For example, the pattern
There may be more than one backreference to the same subpattern. If a
subpattern has not actually been used in a particular match, any backreferences
to it always fail by default. For example, the pattern
.sp
(a|(bc))\e2
.sp
always fails if it starts to match "a" rather than "bc". However, if the
PCRE2_MATCH_UNSET_BACKREF option is set at compile time, a back reference to an
PCRE2_MATCH_UNSET_BACKREF option is set at compile time, a backreference to an
unset value matches an empty string.
.P
Because there may be many capturing parentheses in a pattern, all digits
following a backslash are taken as part of a potential back reference number.
following a backslash are taken as part of a potential backreference number.
If the pattern continues with a digit character, some delimiter must be used to
terminate the back reference. If the PCRE2_EXTENDED option is set, this can be
terminate the backreference. If the PCRE2_EXTENDED option is set, this can be
white space. Otherwise, the \eg{ syntax or an empty comment (see
.\" HTML <a href="#comments">
.\" </a>
@ -2226,10 +2226,10 @@ white space. Otherwise, the \eg{ syntax or an empty comment (see
below) can be used.
.
.
.SS "Recursive back references"
.SS "Recursive backreferences"
.rs
.sp
A back reference that occurs inside the parentheses to which it refers fails
A backreference that occurs inside the parentheses to which it refers fails
when the subpattern is first used, so, for example, (a\e1) never matches.
However, such references can be useful inside repeated subpatterns. For
example, the pattern
@ -2237,13 +2237,13 @@ example, the pattern
(a|b\e1)+
.sp
matches any number of "a"s and also "aba", "ababbaa" etc. At each iteration of
the subpattern, the back reference matches the character string corresponding
the subpattern, the backreference matches the character string corresponding
to the previous iteration. In order for this to work, the pattern must be such
that the first iteration does not need to match the back reference. This can be
that the first iteration does not need to match the backreference. This can be
done using alternation, as in the example above, or by a quantifier with a
minimum of zero.
.P
Back references of this type cause the group that they reference to be treated
backreferences of this type cause the group that they reference to be treated
as an
.\" HTML <a href="#atomicgroup">
.\" </a>
@ -2406,10 +2406,10 @@ recursion,
that is, a "subroutine" call into a group that is already active,
is not supported.
.P
Perl does not support back references in lookbehinds. PCRE2 does support them,
Perl does not support backreferences in lookbehinds. PCRE2 does support them,
but only if certain conditions are met. The PCRE2_MATCH_UNSET_BACKREF option
must not be set, there must be no use of (?| in the pattern (it creates
duplicate subpattern numbers), and if the back reference is by name, the name
duplicate subpattern numbers), and if the backreference is by name, the name
must be unique. Of course, the referenced subpattern must itself be of fixed
length. The following pattern matches words containing at least two characters
that begin and end with the same character:
@ -2899,7 +2899,7 @@ in PCRE2 these values can be referenced. Consider this pattern:
^(.)(\e1|a(?2))
.sp
This pattern matches "bab". The first capturing parentheses match "b", then in
the second group, when the back reference \e1 fails to match "b", the second
the second group, when the backreference \e1 fails to match "b", the second
alternative matches "a" and then recurses. In the recursion, \e1 does now match
"b" and so the whole match succeeds. This match used to fail in Perl, but in
later versions (I tried 5.024) it now works.
@ -2964,7 +2964,7 @@ plus or a minus sign it is taken as a relative reference. For example:
(abc)(?i:\eg<-1>)
.sp
Note that \eg{...} (Perl syntax) and \eg<...> (Oniguruma syntax) are \fInot\fP
synonymous. The former is a back reference; the latter is a subroutine call.
synonymous. The former is a backreference; the latter is a subroutine call.
.
.
.SH CALLOUTS

View File

@ -108,14 +108,14 @@ When a pattern that is compiled with this flag is passed to \fBregexec()\fP for
matching, the \fInmatch\fP and \fIpmatch\fP arguments are ignored, and no
captured strings are returned. Versions of the PCRE library prior to 10.22 used
to set the PCRE2_NO_AUTO_CAPTURE compile option, but this no longer happens
because it disables the use of back references.
because it disables the use of backreferences.
.sp
REG_PEND
.sp
If this option is set, the \fBreg_endp\fP field in the \fIpreg\fP structure
(which has the type const char *) must be set to point to the character beyond
the end of the pattern before calling \fBregcomp()\fP. The pattern itself may
now contain binary zeroes, which are treated as data characters. Without
now contain binary zeros, which are treated as data characters. Without
REG_PEND, a binary zero terminates the pattern and the \fBre_endp\fP field is
ignored. This is a GNU extension to the POSIX standard and should be used with
caution in software intended to be portable to other systems.
@ -224,10 +224,10 @@ function.
.sp
REG_STARTEND
.sp
When this option is set, the subject string is starts at \fIstring\fP +
When this option is set, the subject string starts at \fIstring\fP +
\fIpmatch[0].rm_so\fP and ends at \fIstring\fP + \fIpmatch[0].rm_eo\fP, which
should point to the first character beyond the string. There may be binary
zeroes within the subject string, and indeed, using REG_STARTEND is the only
zeros within the subject string, and indeed, using REG_STARTEND is the only
way to pass a subject string that contains a binary zero.
.P
Whatever the value of \fIpmatch[0].rm_so\fP, the offsets of the matched string

View File

@ -419,7 +419,7 @@ of the newline or \eR options with similar syntax. More than one of them may
appear. For the first three, d is a decimal number.
.sp
(*LIMIT_DEPTH=d) set the backtracking limit to d
(*LIMIT_HEAP=d) set the heap size limit to d kilobytes
(*LIMIT_HEAP=d) set the heap size limit to d * 1024 bytes
(*LIMIT_MATCH=d) set the match limit to d
(*NOTEMPTY) set PCRE2_NOTEMPTY when matching
(*NOTEMPTY_ATSTART) set PCRE2_NOTEMPTY_ATSTART when matching

View File

@ -101,7 +101,7 @@ to occur).
UTF-8 (in its original definition) is not capable of encoding values greater
than 0x7fffffff, but such values can be handled by the 32-bit library. When
testing this library in non-UTF mode with \fButf8_input\fP set, if any
character is preceded by the byte 0xff (which is an illegal byte in UTF-8)
character is preceded by the byte 0xff (which is an invalid byte in UTF-8)
0x80000000 is added to the character's value. This is the only way of passing
such code points in a pattern string. For subject strings, using an escape
sequence is preferable.
@ -220,7 +220,7 @@ Do not output the version number of \fBpcre2test\fP at the start of execution.
.TP 10
\fB-S\fP \fIsize\fP
On Unix-like systems, set the size of the run-time stack to \fIsize\fP
megabytes.
mebibytes (units of 1024*1024 bytes).
.TP 10
\fB-subject\fP \fImodifier-list\fP
Behave as if each subject line contains the given modifiers.
@ -639,8 +639,8 @@ The effects of these modifiers are described in the following sections.
.sp
The \fBbsr\fP modifier specifies what \eR in a pattern should match. If it is
set to "anycrlf", \eR matches CR, LF, or CRLF only. If it is set to "unicode",
\eR matches any Unicode newline sequence. The default is specified when PCRE2
is built, with the default default being Unicode.
\eR matches any Unicode newline sequence. The default can be specified when
PCRE2 is built; if it is not, the default is set to Unicode.
.P
The \fBnewline\fP modifier specifies which characters are to be interpreted as
newlines, both in the pattern and in subject lines. The type must be one of CR,
@ -1381,11 +1381,11 @@ matching provokes an error return ("bad option value") from
.sp
The \fBjitstack\fP modifier provides a way of setting the maximum stack size
that is used by the just-in-time optimization code. It is ignored if JIT
optimization is not being used. The value is a number of kilobytes. Setting
zero reverts to the default of 32K. Providing a stack that is larger than the
default is necessary only for very complicated patterns. If \fBjitstack\fP is
set non-zero on a subject line it overrides any value that was set on the
pattern.
optimization is not being used. The value is a number of kibibytes (units of
1024 bytes). Setting zero reverts to the default of 32KiB. Providing a stack
that is larger than the default is necessary only for very complicated
patterns. If \fBjitstack\fP is set non-zero on a subject line it overrides any
value that was set on the pattern.
.
.
.SS "Setting heap, match, and depth limits"
@ -1427,10 +1427,10 @@ matching, \fImatch_limit\fP controls the total number of calls, both recursive
and non-recursive, to the internal matching function, thus controlling the
overall amount of computing resource that is used.
.P
For both kinds of matching, the \fIheap_limit\fP number (which is in kilobytes)
limits the amount of heap memory used for matching. A value of zero disables
the use of any heap memory; many simple pattern matches can be done without
using the heap, so this is not an unreasonable setting.
For both kinds of matching, the \fIheap_limit\fP number, which is in kibibytes
(units of 1024 bytes), limits the amount of heap memory used for matching. A
value of zero disables the use of any heap memory; many simple pattern matches
can be done without using the heap, so zero is not an unreasonable setting.
.
.
.SS "Showing MARK names"

File diff suppressed because it is too large Load Diff

View File

@ -46,7 +46,7 @@ compatibility with Perl 5.6. PCRE2 does not support this.
.SH "WIDE CHARACTERS AND UTF MODES"
.rs
.sp
Codepoints less than 256 can be specified in patterns by either braced or
Code points less than 256 can be specified in patterns by either braced or
unbraced hexadecimal escape sequences (for example, \ex{b3} or \exb3). Larger
values have to use braced sequences. Unbraced octal code points up to \e777 are
also recognized; larger ones can be coded using \eo{...}.
@ -109,7 +109,7 @@ not PCRE2_UCP is set.
Case-insensitive matching in a UTF mode makes use of Unicode properties except
for characters whose code points are less than 128 and that have at most two
case-equivalent values. For these, a direct table lookup is used for speed. A
few Unicode characters such as Greek sigma have more than two codepoints that
few Unicode characters such as Greek sigma have more than two code points that
are case-equivalent, and these are treated as such.
.
.

View File

@ -51,7 +51,7 @@ fi
# utf invoke UTF-8 functionality
#
# The data lines must not have any pcre2test modifiers. Unless
# "subject_litersl" is on the pattern, data lines are processed as
# "subject_literal" is on the pattern, data lines are processed as
# Perl double-quoted strings, so if they contain " $ or @ characters, these
# have to be escaped. For this reason, all such characters in the
# Perl-compatible testinput1 and testinput4 files are escaped so that they can

View File

@ -132,8 +132,9 @@ sure both macros are undefined; an emulation function will then be used. */
/* Define to 1 if you have the <zlib.h> header file. */
/* #undef HAVE_ZLIB_H */
/* This limits the amount of memory that pcre2_match() may use while matching
a pattern. The value is in kilobytes. */
/* This limits the amount of memory that may be used while matching a pattern.
It applies to both pcre2_match() and pcre2_dfa_match(). It does not apply
to JIT matching. The value is in kilobytes. */
#ifndef HEAP_LIMIT
#define HEAP_LIMIT 20000000
#endif
@ -155,7 +156,8 @@ sure both macros are undefined; an emulation function will then be used. */
/* The value of MATCH_LIMIT determines the default number of times the
pcre2_match() function can record a backtrack position during a single
matching attempt. There is a runtime interface for setting a different
matching attempt. The value is also used to limit a loop counter in
pcre2_dfa_match(). There is a runtime interface for setting a different
limit. The limit exists in order to catch runaway regular expressions that
take for ever to determine that they do not match. The default is set very
large so that it does not accidentally catch legitimate cases. */
@ -170,7 +172,9 @@ sure both macros are undefined; an emulation function will then be used. */
MATCH_LIMIT_DEPTH provides this facility. To have any useful effect, it
must be less than the value of MATCH_LIMIT. The default is to use the same
value as MATCH_LIMIT. There is a runtime method for setting a different
limit. */
limit. In the case of pcre2_dfa_match(), this limit controls the depth of
the internal nested function calls that are used for pattern recursions,
lookarounds, and atomic groups. */
#ifndef MATCH_LIMIT_DEPTH
#define MATCH_LIMIT_DEPTH MATCH_LIMIT
#endif
@ -210,7 +214,7 @@ sure both macros are undefined; an emulation function will then be used. */
#define PACKAGE_NAME "PCRE2"
/* Define to the full name and version of this package. */
#define PACKAGE_STRING "PCRE2 10.31"
#define PACKAGE_STRING "PCRE2 10.32-RC1"
/* Define to the one symbol short name of this package. */
#define PACKAGE_TARNAME "pcre2"
@ -219,7 +223,7 @@ sure both macros are undefined; an emulation function will then be used. */
#define PACKAGE_URL ""
/* Define to the version of this package. */
#define PACKAGE_VERSION "10.31"
#define PACKAGE_VERSION "10.32-RC1"
/* The value of PARENS_NEST_LIMIT specifies the maximum depth of nested
parentheses (of any kind) in a pattern. This limits the amount of system
@ -339,7 +343,7 @@ sure both macros are undefined; an emulation function will then be used. */
#endif
/* Version number of package */
#define VERSION "10.31"
#define VERSION "10.32-RC1"
/* Define to 1 if on MINIX. */
/* #undef _MINIX */

View File

@ -134,7 +134,7 @@ sure both macros are undefined; an emulation function will then be used. */
/* This limits the amount of memory that may be used while matching a pattern.
It applies to both pcre2_match() and pcre2_dfa_match(). It does not apply
to JIT matching. The value is in kilobytes. */
to JIT matching. The value is in kibibytes (units of 1024 bytes). */
#undef HEAP_LIMIT
/* The value of LINK_SIZE determines the number of bytes used to store links

View File

@ -42,9 +42,9 @@ POSSIBILITY OF SUCH DAMAGE.
/* The current PCRE version information. */
#define PCRE2_MAJOR 10
#define PCRE2_MINOR 31
#define PCRE2_PRERELEASE
#define PCRE2_DATE 2018-02-12
#define PCRE2_MINOR 32
#define PCRE2_PRERELEASE -RC1
#define PCRE2_DATE 2018-02-19
/* When an application links to a PCRE DLL in Windows, the symbols that are
imported have to be identified as such. When building PCRE2, the appropriate

View File

@ -4261,11 +4261,11 @@ goto FAILED;
/*************************************************
* Find first significant op code *
* Find first significant opcode *
*************************************************/
/* This is called by several functions that scan a compiled expression looking
for a fixed first character, or an anchoring op code etc. It skips over things
for a fixed first character, or an anchoring opcode etc. It skips over things
that do not influence this. For some calls, it makes sense to skip negative
forward and all backward assertions, and also the \b assertion; for others it
does not.
@ -5472,7 +5472,7 @@ for (;; pptr++)
set xclass = TRUE. Then, in the pre-compile phase, accumulate the length
of the extra data and reset the pointer. This is so that very large
classes that contain a zillion wide characters or Unicode property tests
do not overwrite the work space (which is on the stack). */
do not overwrite the workspace (which is on the stack). */
if (class_uchardata > class_uchardata_base)
{
@ -7460,7 +7460,7 @@ length of the BRA and KET and any extra code units that are required at the
beginning. We accumulate in a local variable to save frequent testing of
lengthptr for NULL. We cannot do this by looking at the value of 'code' at the
start and end of each alternative, because compiled items are discarded during
the pre-compile phase so that the work space is not exceeded. */
the pre-compile phase so that the workspace is not exceeded. */
length = 2 + 2*LINK_SIZE + skipunits;

View File

@ -387,8 +387,8 @@ return (mb->callout)(cb, mb->callout_data);
*************************************************/
/* This function is called when internal_dfa_match() is about to be called
recursively and there is insufficient workingspace left in the current work
space block. If there's an existing next block, use it; otherwise get a new
recursively and there is insufficient working space left in the current
workspace block. If there's an existing next block, use it; otherwise get a new
block unless the heap limit is reached.
Arguments:
@ -2800,7 +2800,7 @@ for (;;)
local_workspace, /* workspace vector */
RWS_RSIZE, /* size of same */
rlevel, /* function recursion level */
RWS); /* recursion work space */
RWS); /* recursion workspace */
rws->free += RWS_RSIZE + RWS_OVEC_OSIZE;

View File

@ -43,7 +43,7 @@ POSSIBILITY OF SUCH DAMAGE.
#include "config.h"
#endif
/* These defines enables debugging code */
/* These defines enable debugging code */
//#define DEBUG_FRAMES_DISPLAY
//#define DEBUG_SHOW_OPS
@ -1776,7 +1776,7 @@ fprintf(stderr, "++ op=%d\n", *Fecode);
/* ===================================================================== */
/* Match a bit-mapped character class, possibly repeatedly. These op codes
/* Match a bit-mapped character class, possibly repeatedly. These opcodes
are used when all the characters in the class have values in the range
0-255, and either the matching is caseful, or the characters are in the
range 0-127 when UTF processing is enabled. The only difference between
@ -2464,7 +2464,7 @@ fprintf(stderr, "++ op=%d\n", *Fecode);
/* ===================================================================== */
/* Match a single character type repeatedly. Note that the property type
does not need to be in a stack frame as it not used within an RMATCH()
does not need to be in a stack frame as it is not used within an RMATCH()
loop. */
#define Lstart_eptr F->temp_sptr[0]
@ -4143,7 +4143,7 @@ fprintf(stderr, "++ op=%d\n", *Fecode);
}
break;
/* The "byte" (i.e. "code unit") case is the same as non-UTF */
/* The "byte" (i.e. "code unit") case is the same as non-UTF */
case OP_ANYBYTE:
fc = Lmax - Lmin;
@ -5424,7 +5424,7 @@ fprintf(stderr, "++ op=%d\n", *Fecode);
Feptr -= number;
}
/* Save the earliest consulted character, then skip to next op code */
/* Save the earliest consulted character, then skip to next opcode */
if (Feptr < mb->start_used_ptr) mb->start_used_ptr = Feptr;
Fecode += 1 + LINK_SIZE;
@ -5929,7 +5929,7 @@ in rrc. */
RETURN_SWITCH:
if (Frdepth == 0) return rrc; /* Exit from the top level */
F = (heapframe *)((char *)F - Fback_frame); /* Back track */
F = (heapframe *)((char *)F - Fback_frame); /* Backtrack */
mb->cb->callout_flags |= PCRE2_CALLOUT_BACKTRACK; /* Note for callouts */
#ifdef DEBUG_SHOW_RMATCH

View File

@ -1274,7 +1274,7 @@ do
break;
/* Single character types set the bits and stop. Note that if PCRE2_UCP
is set, we do not see these op codes because \d etc are converted to
is set, we do not see these opcodes because \d etc are converted to
properties. Therefore, these apply in the case when only characters less
than 256 are recognized to match the types. */

View File

@ -170,7 +170,7 @@ are implementing).
by E_Modifier). Extend characters are allowed before the modifier; this
cannot be represented in this table, the code has to deal with it.
8. Do not break within emoji zwj sequences (ZWJ followed by Glue_After_Zwj or
8. Do not break within emoji zwj sequences (ZWJ followed by Glue_After_Zwj or
E_Base_GAZ).
9. Do not break within emoji flag sequences. That is, do not break between

View File

@ -492,7 +492,7 @@ so many of them that they are split into two fields. */
/* These are the matching controls that may be set either on a pattern or on a
data line. They are copied from the pattern controls as initial settings for
data line controls Note that CTL_MEMORY is not included here, because it does
data line controls. Note that CTL_MEMORY is not included here, because it does
different things in the two cases. */
#define CTL_ALLPD (CTL_AFTERTEXT|\
@ -5411,7 +5411,7 @@ switch(errorcode)
/* The pattern is now in pbuffer[8|16|32], with the length in code units in
patlen. If it is to be converted, copy the result back afterwards so that it
it ends up back in the usual place. */
ends up back in the usual place. */
if (pat_patctl.convert_type != CONVERT_UNSET)
{
@ -5735,7 +5735,7 @@ return PR_OK;
*************************************************/
/* This is used for DFA, normal, and JIT fast matching. For DFA matching it
should only called with the third argument set to PCRE2_ERROR_DEPTHLIMIT.
should only be called with the third argument set to PCRE2_ERROR_DEPTHLIMIT.
Arguments:
pp the subject string
@ -7766,7 +7766,7 @@ printf(" -LM list pattern and subject modifiers, then exit\n");
printf(" -q quiet: do not output PCRE2 version number at start\n");
printf(" -pattern <s> set default pattern modifier fields\n");
printf(" -subject <s> set default subject modifier fields\n");
printf(" -S <n> set stack size to <n> megabytes\n");
printf(" -S <n> set stack size to <n> mebibytes\n");
printf(" -t [<n>] time compilation and execution, repeating <n> times\n");
printf(" -tm [<n>] time execution (matching) only, repeating <n> times\n");
printf(" -T same as -t, but show total times at the end\n");