Source and document file tidies for 10.20-RC1.
This commit is contained in:
parent
a68ddd48b5
commit
07a8fdce25
26
ChangeLog
26
ChangeLog
|
@ -1,8 +1,8 @@
|
|||
Change Log for PCRE2
|
||||
--------------------
|
||||
|
||||
Version 10.20 xx-xx-2015
|
||||
------------------------
|
||||
Version 10.20 16-June-2015
|
||||
--------------------------
|
||||
|
||||
1. Callouts with string arguments have been added.
|
||||
|
||||
|
@ -123,27 +123,27 @@ This bug was discovered by the LLVM fuzzer.
|
|||
current group, for example in this pattern: /(?|(\k'Pm')|(?'Pm'))/, caused a
|
||||
buffer overflow at compile time. This bug was discovered by the LLVM fuzzer.
|
||||
|
||||
31. Fix -fsanitize=undefined warnings for left shifts of 1 by 31 (it treats 1
|
||||
31. Fix -fsanitize=undefined warnings for left shifts of 1 by 31 (it treats 1
|
||||
as an int; fixed by writing it as 1u).
|
||||
|
||||
32. Fix pcre2grep compile when -std=c99 is used with gcc, though it still gives
|
||||
32. Fix pcre2grep compile when -std=c99 is used with gcc, though it still gives
|
||||
a warning for "fileno" unless -std=gnu99 us used.
|
||||
|
||||
33. A lookbehind assertion within a set of mutually recursive subpatterns could
|
||||
33. A lookbehind assertion within a set of mutually recursive subpatterns could
|
||||
provoke a buffer overflow. This bug was discovered by the LLVM fuzzer.
|
||||
|
||||
34. Give an error for an empty subpattern name such as (?'').
|
||||
|
||||
35. Make pcre2test give an error if a pattern that follows #forbud_utf contains
|
||||
35. Make pcre2test give an error if a pattern that follows #forbud_utf contains
|
||||
\P, \p, or \X.
|
||||
|
||||
36. The way named subpatterns are handled has been refactored. There is now a
|
||||
36. The way named subpatterns are handled has been refactored. There is now a
|
||||
pre-pass over the regex which does nothing other than identify named
|
||||
subpatterns and count the total captures. This means that information about
|
||||
named patterns is known before the rest of the compile. In particular, it means
|
||||
that forward references can be checked as they are encountered. Previously, the
|
||||
code for handling forward references was contorted and led to several errors in
|
||||
computing the memory requirements for some patterns, leading to buffer
|
||||
named patterns is known before the rest of the compile. In particular, it means
|
||||
that forward references can be checked as they are encountered. Previously, the
|
||||
code for handling forward references was contorted and led to several errors in
|
||||
computing the memory requirements for some patterns, leading to buffer
|
||||
overflows.
|
||||
|
||||
37. There was no check for integer overflow in subroutine calls such as (?123).
|
||||
|
@ -152,11 +152,11 @@ overflows.
|
|||
being treated as a literal 'l' instead of causing an error.
|
||||
|
||||
39. If a non-capturing group containing a conditional group that could match
|
||||
an empty string was repeated, it was not identified as matching an empty string
|
||||
an empty string was repeated, it was not identified as matching an empty string
|
||||
itself. For example: /^(?:(?(1)x|)+)+$()/.
|
||||
|
||||
40. In an EBCDIC environment, pcretest was mishandling the escape sequences
|
||||
\a and \e in test subject lines.
|
||||
\a and \e in test subject lines.
|
||||
|
||||
41. In an EBCDIC environment, \a in a pattern was converted to the ASCII
|
||||
instead of the EBCDIC value.
|
||||
|
|
30
HACKING
30
HACKING
|
@ -104,6 +104,21 @@ system stack used by the compile function, which uses recursive function calls
|
|||
for nested parenthesized groups. This is a safety feature for environments with
|
||||
small stacks where the patterns are provided by users.
|
||||
|
||||
History repeated itself for release 10.20. A number of bugs relating to named
|
||||
subpatterns had been discovered by fuzzers. Most of these were related to the
|
||||
handling of forward references when it was not known if the named pattern was
|
||||
unique. (References to non-unique names use a different opcode and more
|
||||
memory.) The use of duplicate group numbers (the (?| facility) also caused
|
||||
issues.
|
||||
|
||||
To get around these problems I adopted a new approach by adding a third pass,
|
||||
really a "pre-pass", over the pattern, which does nothing other than identify
|
||||
all the named subpatterns and their corresponding group numbers. This means
|
||||
that the actual compile (both pre-pass and real compile) have full knowledge of
|
||||
group names and numbers throughout. Several dozen lines of messy code were
|
||||
eliminated, though the new pre-pass is not short (skipping over [] classes is
|
||||
complicated).
|
||||
|
||||
|
||||
Traditional matching function
|
||||
-----------------------------
|
||||
|
@ -343,8 +358,9 @@ do.
|
|||
|
||||
For classes containing characters with values greater than 255 or that contain
|
||||
\p or \P, OP_XCLASS is used. It optionally uses a bit map if any acceptable
|
||||
code points are less than 256, followed by a list of pairs (for a range) and
|
||||
single characters. In caseless mode, both cases are explicitly listed.
|
||||
code points are less than 256, followed by a list of pairs (for a range) and/or
|
||||
single characters and/or properties. In caseless mode, both cases are
|
||||
explicitly listed.
|
||||
|
||||
OP_XCLASS is followed by a LINK_SIZE value containing the total length of the
|
||||
opcode and its data. This is followed by a code unit containing flag bits:
|
||||
|
@ -431,7 +447,7 @@ bracket opcode.
|
|||
If a subpattern is quantified such that it is permitted to match zero times, it
|
||||
is preceded by one of OP_BRAZERO, OP_BRAMINZERO, or OP_SKIPZERO. These are
|
||||
single-unit opcodes that tell the matcher that skipping the following
|
||||
subpattern entirely is a valid branch. In the case of the first two, not
|
||||
subpattern entirely is a valid match. In the case of the first two, not
|
||||
skipping the pattern is also valid (greedy and non-greedy). The third is used
|
||||
when a pattern has the quantifier {0,0}. It cannot be entirely discarded,
|
||||
because it may be called as a subroutine from elsewhere in the pattern.
|
||||
|
@ -487,9 +503,9 @@ Forward assertions are also just like other subpatterns, but starting with one
|
|||
of the opcodes OP_ASSERT or OP_ASSERT_NOT. Backward assertions use the opcodes
|
||||
OP_ASSERTBACK and OP_ASSERTBACK_NOT, and the first opcode inside the assertion
|
||||
is OP_REVERSE, followed by a count of the number of characters to move back the
|
||||
pointer in the subject string. In ASCII or UTF-32 mode, the count is a number
|
||||
of code units, but in UTF-8/16 mode each character may occupy more than one
|
||||
code unit. A separate count is present in each alternative of a lookbehind
|
||||
pointer in the subject string. In ASCII or UTF-32 mode, the count is also the
|
||||
number of code units, but in UTF-8/16 mode each character may occupy more than
|
||||
one code unit. A separate count is present in each alternative of a lookbehind
|
||||
assertion, allowing them to have different (but fixed) lengths.
|
||||
|
||||
|
||||
|
@ -585,4 +601,4 @@ not a real opcode, but is used to check that tables indexed by opcode are the
|
|||
correct length, in order to catch updating errors.
|
||||
|
||||
Philip Hazel
|
||||
March 2015
|
||||
June 2015
|
||||
|
|
20
NEWS
20
NEWS
|
@ -1,6 +1,26 @@
|
|||
News about PCRE2 releases
|
||||
-------------------------
|
||||
|
||||
Version 10.20 16-June-2015
|
||||
--------------------------
|
||||
|
||||
1. Callouts with string arguments and the pcre2_callout_enumerate() function
|
||||
have been implemented.
|
||||
|
||||
2. The PCRE2_NEVER_BACKSLASH_C option, which locks out the use of \C, is added.
|
||||
|
||||
3. The PCRE2_ALT_CIRCUMFLEX option lets ^ match after a newline at the end of a
|
||||
subject in multiline mode.
|
||||
|
||||
4. The way named subpatterns are handled has been refactored. The previous
|
||||
approach had several bugs.
|
||||
|
||||
5. The handling of \c in EBCDIC environments has been changed to conform to the
|
||||
perlebcdic document. This is an incompatible change.
|
||||
|
||||
6. Bugs have been mended, many of them discovered by fuzzers.
|
||||
|
||||
|
||||
Version 10.10 06-March-2015
|
||||
---------------------------
|
||||
|
||||
|
|
4
README
4
README
|
@ -293,9 +293,9 @@ library. They are also documented in the pcre2build man page.
|
|||
both EBCDIC and UTF-8/16/32. There is a second option, --enable-ebcdic-nl25,
|
||||
which specifies that the code value for the EBCDIC NL character is 0x25
|
||||
instead of the default 0x15.
|
||||
|
||||
|
||||
. If you specify --enable-debug, additional debugging code is included in the
|
||||
build. This option is intended for use by the PCRE2 maintainers.
|
||||
build. This option is intended for use by the PCRE2 maintainers.
|
||||
|
||||
. In environments where valgrind is installed, if you specify
|
||||
|
||||
|
|
4
RunTest
4
RunTest
|
@ -24,8 +24,8 @@
|
|||
# example, if JIT support is not compiled, test 16 is skipped, whereas if JIT
|
||||
# support is compiled, test 15 is skipped.
|
||||
#
|
||||
# Other arguments can be one of the words "-valgrind", "-valgrind-log", or
|
||||
# "-sim" followed by an argument to run cross-compiled executables under a
|
||||
# Other arguments can be one of the words "-valgrind", "-valgrind-log", or
|
||||
# "-sim" followed by an argument to run cross-compiled executables under a
|
||||
# simulator, for example:
|
||||
#
|
||||
# RunTest 3 -sim "qemu-arm -s 8388608"
|
||||
|
|
16
configure.ac
16
configure.ac
|
@ -11,15 +11,15 @@ dnl be defined as -RC2, for example. For real releases, it should be empty.
|
|||
m4_define(pcre2_major, [10])
|
||||
m4_define(pcre2_minor, [20])
|
||||
m4_define(pcre2_prerelease, [-RC1])
|
||||
m4_define(pcre2_date, [2015-03-11])
|
||||
m4_define(pcre2_date, [2015-06-16])
|
||||
|
||||
# NOTE: The CMakeLists.txt file searches for the above variables in the first
|
||||
# 50 lines of this file. Please update that if the variables above are moved.
|
||||
|
||||
# Libtool shared library interface versions (current:revision:age)
|
||||
m4_define(libpcre2_8_version, [1:0:1])
|
||||
m4_define(libpcre2_16_version, [1:0:1])
|
||||
m4_define(libpcre2_32_version, [1:0:1])
|
||||
m4_define(libpcre2_8_version, [2:0:0])
|
||||
m4_define(libpcre2_16_version, [2:0:0])
|
||||
m4_define(libpcre2_32_version, [2:0:0])
|
||||
m4_define(libpcre2_posix_version, [0:0:0])
|
||||
|
||||
AC_PREREQ(2.57)
|
||||
|
@ -134,14 +134,14 @@ AC_SUBST(enable_pcre2_32)
|
|||
AC_ARG_ENABLE(debug,
|
||||
AS_HELP_STRING([--enable-debug],
|
||||
[enable debugging code]),
|
||||
, enable_debug=no)
|
||||
, enable_debug=no)
|
||||
|
||||
# Handle --enable-jit (disabled by default)
|
||||
AC_ARG_ENABLE(jit,
|
||||
AS_HELP_STRING([--enable-jit],
|
||||
[enable Just-In-Time compiling support]),
|
||||
, enable_jit=no)
|
||||
|
||||
|
||||
# Handle --disable-pcre2grep-jit (enabled by default)
|
||||
AC_ARG_ENABLE(pcre2grep-jit,
|
||||
AS_HELP_STRING([--disable-pcre2grep-jit],
|
||||
|
@ -514,7 +514,7 @@ fi
|
|||
if test "$enable_debug" = "yes"; then
|
||||
AC_DEFINE([PCRE2_DEBUG], [], [
|
||||
Define to any value to include debugging code.])
|
||||
fi
|
||||
fi
|
||||
|
||||
# Unless running under Windows, JIT support requires pthreads.
|
||||
|
||||
|
@ -876,7 +876,7 @@ $PACKAGE-$VERSION configuration summary:
|
|||
Build 8-bit pcre2 library ....... : ${enable_pcre2_8}
|
||||
Build 16-bit pcre2 library ...... : ${enable_pcre2_16}
|
||||
Build 32-bit pcre2 library ...... : ${enable_pcre2_32}
|
||||
Include debugging code .......... : ${enable_debug}
|
||||
Include debugging code .......... : ${enable_debug}
|
||||
Enable JIT compiling support .... : ${enable_jit}
|
||||
Enable Unicode support .......... : ${enable_unicode}
|
||||
Newline char/sequence ........... : ${enable_newline}
|
||||
|
|
|
@ -294,6 +294,9 @@ library. They are also documented in the pcre2build man page.
|
|||
which specifies that the code value for the EBCDIC NL character is 0x25
|
||||
instead of the default 0x15.
|
||||
|
||||
. If you specify --enable-debug, additional debugging code is included in the
|
||||
build. This option is intended for use by the PCRE2 maintainers.
|
||||
|
||||
. In environments where valgrind is installed, if you specify
|
||||
|
||||
--enable-valgrind
|
||||
|
@ -829,4 +832,4 @@ The distribution should contain the files listed below.
|
|||
Philip Hazel
|
||||
Email local part: ph10
|
||||
Email domain: cam.ac.uk
|
||||
Last updated: 26 January 2015
|
||||
Last updated: 24 April 2015
|
||||
|
|
|
@ -108,8 +108,14 @@ lose performance.
|
|||
<P>
|
||||
One way of guarding against this possibility is to use the
|
||||
<b>pcre2_pattern_info()</b> function to check the compiled pattern's options for
|
||||
UTF. Alternatively, you can set the PCRE2_NEVER_UTF option at compile time.
|
||||
This causes an compile time error if a pattern contains a UTF-setting sequence.
|
||||
PCRE2_UTF. Alternatively, you can set the PCRE2_NEVER_UTF option when calling
|
||||
<b>pcre2_compile()</b>. This causes an compile time error if a pattern contains
|
||||
a UTF-setting sequence.
|
||||
</P>
|
||||
<P>
|
||||
The use of Unicode properties for character types such as \d can also be
|
||||
enabled from within the pattern, by specifying "(*UCP)". This feature can be
|
||||
disallowed by setting the PCRE2_NEVER_UCP option.
|
||||
</P>
|
||||
<P>
|
||||
If your application is one that supports UTF, be aware that validity checking
|
||||
|
@ -118,6 +124,12 @@ the PCRE2_NO_UTF_CHECK option for the second and subsequent matches to avoid
|
|||
running redundant checks.
|
||||
</P>
|
||||
<P>
|
||||
The use of the \C escape sequence in a UTF-8 or UTF-16 pattern can lead to
|
||||
problems, because it may leave the current matching point in the middle of a
|
||||
multi-code-unit character. The PCRE2_NEVER_BACKSLASH_C option can be used to
|
||||
lock out the use of \C, causing a compile-time error if it is encountered.
|
||||
</P>
|
||||
<P>
|
||||
Another way that performance can be hit is by running a pattern that has a very
|
||||
large search tree against a string that will never match. Nested unlimited
|
||||
repeats in a pattern are a common example. PCRE2 provides some protection
|
||||
|
@ -175,9 +187,9 @@ use my two initials, followed by the two digits 10, at the domain cam.ac.uk.
|
|||
</P>
|
||||
<br><a name="SEC5" href="#TOC1">REVISION</a><br>
|
||||
<P>
|
||||
Last updated: 18 November 2014
|
||||
Last updated: 13 April 2015
|
||||
<br>
|
||||
Copyright © 1997-2014 University of Cambridge.
|
||||
Copyright © 1997-2015 University of Cambridge.
|
||||
<br>
|
||||
<p>
|
||||
Return to the <a href="index.html">PCRE2 index page</a>.
|
||||
|
|
|
@ -33,7 +33,7 @@ for success and non-zero otherwise. The arguments are:
|
|||
<pre>
|
||||
<i>code</i> Points to the compiled pattern
|
||||
<i>callback</i> The callback function
|
||||
<i>callout_data</i> User data that is passed to the callback
|
||||
<i>callout_data</i> User data that is passed to the callback
|
||||
</pre>
|
||||
The <i>callback()</i> function is passed a pointer to a data block containing
|
||||
the following fields:
|
||||
|
@ -46,9 +46,9 @@ the following fields:
|
|||
<i>callout_string_length</i> Length of callout string
|
||||
<i>callout_string</i> Points to callout string or is NULL
|
||||
</pre>
|
||||
The second argument is the callout data that was passed to
|
||||
<b>pcre2_callout_enumerate()</b>. The <b>callback()</b> function must return zero
|
||||
for success. Any other value causes the pattern scan to stop, with the value
|
||||
The second argument is the callout data that was passed to
|
||||
<b>pcre2_callout_enumerate()</b>. The <b>callback()</b> function must return zero
|
||||
for success. Any other value causes the pattern scan to stop, with the value
|
||||
being passed back as the result of <b>pcre2_callout_enumerate()</b>.
|
||||
</P>
|
||||
<P>
|
||||
|
|
|
@ -49,6 +49,7 @@ or provide an external function for stack size checking. The option bits are:
|
|||
<pre>
|
||||
PCRE2_ANCHORED Force pattern anchoring
|
||||
PCRE2_ALT_BSUX Alternative handling of \u, \U, and \x
|
||||
PCRE2_ALT_CIRCUMFLEX Alternative handling of ^ in multiline mode
|
||||
PCRE2_AUTO_CALLOUT Compile automatic callouts
|
||||
PCRE2_CASELESS Do caseless matching
|
||||
PCRE2_DOLLAR_ENDONLY $ not to match newline at end
|
||||
|
@ -58,6 +59,7 @@ or provide an external function for stack size checking. The option bits are:
|
|||
PCRE2_FIRSTLINE Force matching to be before newline
|
||||
PCRE2_MATCH_UNSET_BACKREF Match unset back references
|
||||
PCRE2_MULTILINE ^ and $ match newlines within data
|
||||
PCRE2_NEVER_BACKSLASH_C Lock out the use of \C in patterns
|
||||
PCRE2_NEVER_UCP Lock out PCRE2_UCP, e.g. via (*UCP)
|
||||
PCRE2_NEVER_UTF Lock out PCRE2_UTF, e.g. via (*UTF)
|
||||
PCRE2_NO_AUTO_CAPTURE Disable numbered capturing paren-
|
||||
|
|
|
@ -1074,6 +1074,15 @@ hexadecimal digits, in which case the hexadecimal number defines the code point
|
|||
to match. By default, as in Perl, a hexadecimal number is always expected after
|
||||
\x, but it may have zero, one, or two digits (so, for example, \xz matches a
|
||||
binary zero character followed by z).
|
||||
<pre>
|
||||
PCRE2_ALT_CIRCUMFLEX
|
||||
</pre>
|
||||
In multiline mode (when PCRE2_MULTILINE is set), the circumflex metacharacter
|
||||
matches at the start of the subject (unless PCRE2_NOTBOL is set), and also
|
||||
after any internal newline. However, it does not match after a newline at the
|
||||
end of the subject, for compatibility with Perl. If you want a multiline
|
||||
circumflex also to match after a terminating newline, you must set
|
||||
PCRE2_ALT_CIRCUMFLEX.
|
||||
<pre>
|
||||
PCRE2_AUTO_CALLOUT
|
||||
</pre>
|
||||
|
@ -1174,8 +1183,19 @@ When PCRE2_MULTILINE it is set, the "start of line" and "end of line"
|
|||
constructs match immediately following or immediately before internal newlines
|
||||
in the subject string, respectively, as well as at the very start and end. This
|
||||
is equivalent to Perl's /m option, and it can be changed within a pattern by a
|
||||
(?m) option setting. If there are no newlines in a subject string, or no
|
||||
occurrences of ^ or $ in a pattern, setting PCRE2_MULTILINE has no effect.
|
||||
(?m) option setting. Note that the "start of line" metacharacter does not match
|
||||
after a newline at the end of the subject, for compatibility with Perl.
|
||||
However, you can change this by setting the PCRE2_ALT_CIRCUMFLEX option. If
|
||||
there are no newlines in a subject string, or no occurrences of ^ or $ in a
|
||||
pattern, setting PCRE2_MULTILINE has no effect.
|
||||
<pre>
|
||||
PCRE2_NEVER_BACKSLASH_C
|
||||
</pre>
|
||||
This option locks out the use of \C in the pattern that is being compiled.
|
||||
This escape can cause unpredictable behaviour in UTF-8 or UTF-16 modes, because
|
||||
it may leave the current matching point in the middle of a multi-code-unit
|
||||
character. This option may be useful in applications that process patterns from
|
||||
external sources.
|
||||
<pre>
|
||||
PCRE2_NEVER_UCP
|
||||
</pre>
|
||||
|
@ -1183,17 +1203,17 @@ This option locks out the use of Unicode properties for handling \B, \b, \D,
|
|||
\d, \S, \s, \W, \w, and some of the POSIX character classes, as described
|
||||
for the PCRE2_UCP option below. In particular, it prevents the creator of the
|
||||
pattern from enabling this facility by starting the pattern with (*UCP). This
|
||||
may be useful in applications that process patterns from external sources. The
|
||||
option combination PCRE_UCP and PCRE_NEVER_UCP causes an error.
|
||||
option may be useful in applications that process patterns from external
|
||||
sources. The option combination PCRE_UCP and PCRE_NEVER_UCP causes an error.
|
||||
<pre>
|
||||
PCRE2_NEVER_UTF
|
||||
</pre>
|
||||
This option locks out interpretation of the pattern as UTF-8, UTF-16, or
|
||||
UTF-32, depending on which library is in use. In particular, it prevents the
|
||||
creator of the pattern from switching to UTF interpretation by starting the
|
||||
pattern with (*UTF). This may be useful in applications that process patterns
|
||||
from external sources. The combination of PCRE2_UTF and PCRE2_NEVER_UTF causes
|
||||
an error.
|
||||
pattern with (*UTF). This option may be useful in applications that process
|
||||
patterns from external sources. The combination of PCRE2_UTF and
|
||||
PCRE2_NEVER_UTF causes an error.
|
||||
<pre>
|
||||
PCRE2_NO_AUTO_CAPTURE
|
||||
</pre>
|
||||
|
@ -1735,14 +1755,14 @@ compiler does not alter the value returned by this option.
|
|||
<b> void *<i>user_data</i>);</b>
|
||||
<br>
|
||||
<br>
|
||||
A script language that supports the use of string arguments in callouts might
|
||||
like to scan all the callouts in a pattern before running the match. This can
|
||||
be done by calling <b>pcre2_callout_enumerate()</b>. The first argument is a
|
||||
A script language that supports the use of string arguments in callouts might
|
||||
like to scan all the callouts in a pattern before running the match. This can
|
||||
be done by calling <b>pcre2_callout_enumerate()</b>. The first argument is a
|
||||
pointer to a compiled pattern, the second points to a callback function, and
|
||||
the third is arbitrary user data. The callback function is called for every
|
||||
callout in the pattern in the order in which they appear. Its first argument is
|
||||
a pointer to a callout enumeration block, and its second argument is the
|
||||
<i>user_data</i> value that was passed to <b>pcre2_callout_enumerate()</b>. The
|
||||
<i>user_data</i> value that was passed to <b>pcre2_callout_enumerate()</b>. The
|
||||
contents of the callout enumeration block are described in the
|
||||
<a href="pcre2callout.html"><b>pcre2callout</b></a>
|
||||
documentation, which also gives further details about callouts.
|
||||
|
@ -2273,7 +2293,7 @@ of the subject.
|
|||
PCRE2_ERROR_CALLOUT
|
||||
</pre>
|
||||
This error is never generated by <b>pcre2_match()</b> itself. It is provided for
|
||||
use by callout functions that want to cause <b>pcre2_match()</b> or
|
||||
use by callout functions that want to cause <b>pcre2_match()</b> or
|
||||
<b>pcre2_callout_enumerate()</b> to return a distinctive error code. See the
|
||||
<a href="pcre2callout.html"><b>pcre2callout</b></a>
|
||||
documentation for details.
|
||||
|
@ -2863,7 +2883,7 @@ Cambridge, England.
|
|||
</P>
|
||||
<br><a name="SEC40" href="#TOC1">REVISION</a><br>
|
||||
<P>
|
||||
Last updated: 23 March 2015
|
||||
Last updated: 22 April 2015
|
||||
<br>
|
||||
Copyright © 1997-2015 University of Cambridge.
|
||||
<br>
|
||||
|
|
|
@ -29,11 +29,12 @@ please consult the man page, in case the conversion went wrong.
|
|||
<li><a name="TOC14" href="#SEC14">PCRE2GREP OPTIONS FOR COMPRESSED FILE SUPPORT</a>
|
||||
<li><a name="TOC15" href="#SEC15">PCRE2GREP BUFFER SIZE</a>
|
||||
<li><a name="TOC16" href="#SEC16">PCRE2TEST OPTION FOR LIBREADLINE SUPPORT</a>
|
||||
<li><a name="TOC17" href="#SEC17">DEBUGGING WITH VALGRIND SUPPORT</a>
|
||||
<li><a name="TOC18" href="#SEC18">CODE COVERAGE REPORTING</a>
|
||||
<li><a name="TOC19" href="#SEC19">SEE ALSO</a>
|
||||
<li><a name="TOC20" href="#SEC20">AUTHOR</a>
|
||||
<li><a name="TOC21" href="#SEC21">REVISION</a>
|
||||
<li><a name="TOC17" href="#SEC17">INCLUDING DEBUGGING CODE</a>
|
||||
<li><a name="TOC18" href="#SEC18">DEBUGGING WITH VALGRIND SUPPORT</a>
|
||||
<li><a name="TOC19" href="#SEC19">CODE COVERAGE REPORTING</a>
|
||||
<li><a name="TOC20" href="#SEC20">SEE ALSO</a>
|
||||
<li><a name="TOC21" href="#SEC21">AUTHOR</a>
|
||||
<li><a name="TOC22" href="#SEC22">REVISION</a>
|
||||
</ul>
|
||||
<br><a name="SEC1" href="#TOC1">BUILDING PCRE2</a><br>
|
||||
<P>
|
||||
|
@ -147,6 +148,12 @@ properties. The application can request that they do by setting the PCRE2_UCP
|
|||
option. Unless the application has set PCRE2_NEVER_UCP, a pattern may also
|
||||
request this by starting with (*UCP).
|
||||
</P>
|
||||
<P>
|
||||
The \C escape sequence, which matches a single code unit, even in a UTF mode,
|
||||
can cause unpredictable behaviour because it may leave the current matching
|
||||
point in the middle of a multi-code-unit character. It can be locked out by
|
||||
setting the PCRE2_NEVER_BACKSLASH_C option.
|
||||
</P>
|
||||
<br><a name="SEC6" href="#TOC1">JUST-IN-TIME COMPILER SUPPORT</a><br>
|
||||
<P>
|
||||
Just-in-time compiler support is included in the build by specifying
|
||||
|
@ -397,7 +404,16 @@ automatically included, you may need to add something like
|
|||
</pre>
|
||||
immediately before the <b>configure</b> command.
|
||||
</P>
|
||||
<br><a name="SEC17" href="#TOC1">DEBUGGING WITH VALGRIND SUPPORT</a><br>
|
||||
<br><a name="SEC17" href="#TOC1">INCLUDING DEBUGGING CODE</a><br>
|
||||
<P>
|
||||
If you add
|
||||
<pre>
|
||||
--enable-debug
|
||||
</pre>
|
||||
to the <b>configure</b> command, additional debugging code is included in the
|
||||
build. This feature is intended for use by the PCRE2 maintainers.
|
||||
</P>
|
||||
<br><a name="SEC18" href="#TOC1">DEBUGGING WITH VALGRIND SUPPORT</a><br>
|
||||
<P>
|
||||
If you add
|
||||
<pre>
|
||||
|
@ -407,7 +423,7 @@ to the <b>configure</b> command, PCRE2 will use valgrind annotations to mark
|
|||
certain memory regions as unaddressable. This allows it to detect invalid
|
||||
memory accesses, and is mostly useful for debugging PCRE2 itself.
|
||||
</P>
|
||||
<br><a name="SEC18" href="#TOC1">CODE COVERAGE REPORTING</a><br>
|
||||
<br><a name="SEC19" href="#TOC1">CODE COVERAGE REPORTING</a><br>
|
||||
<P>
|
||||
If your C compiler is gcc, you can build a version of PCRE2 that can generate a
|
||||
code coverage report for its test suite. To enable this, you must install
|
||||
|
@ -464,11 +480,11 @@ This cleans all coverage data including the generated coverage report. For more
|
|||
information about code coverage, see the <b>gcov</b> and <b>lcov</b>
|
||||
documentation.
|
||||
</P>
|
||||
<br><a name="SEC19" href="#TOC1">SEE ALSO</a><br>
|
||||
<br><a name="SEC20" href="#TOC1">SEE ALSO</a><br>
|
||||
<P>
|
||||
<b>pcre2api</b>(3), <b>pcre2-config</b>(3).
|
||||
</P>
|
||||
<br><a name="SEC20" href="#TOC1">AUTHOR</a><br>
|
||||
<br><a name="SEC21" href="#TOC1">AUTHOR</a><br>
|
||||
<P>
|
||||
Philip Hazel
|
||||
<br>
|
||||
|
@ -477,9 +493,9 @@ University Computing Service
|
|||
Cambridge, England.
|
||||
<br>
|
||||
</P>
|
||||
<br><a name="SEC21" href="#TOC1">REVISION</a><br>
|
||||
<br><a name="SEC22" href="#TOC1">REVISION</a><br>
|
||||
<P>
|
||||
Last updated: 26 January 2015
|
||||
Last updated: 24 April 2015
|
||||
<br>
|
||||
Copyright © 1997-2015 University of Cambridge.
|
||||
<br>
|
||||
|
|
|
@ -219,11 +219,11 @@ documentation). The callout block structure contains the following fields:
|
|||
PCRE2_SIZE <i>pattern_position</i>;
|
||||
PCRE2_SIZE <i>next_item_length</i>;
|
||||
PCRE2_SIZE <i>callout_string_offset</i>;
|
||||
PCRE2_SIZE <i>callout_string_length</i>;
|
||||
PCRE2_SPTR <i>callout_string</i>;
|
||||
PCRE2_SIZE <i>callout_string_length</i>;
|
||||
PCRE2_SPTR <i>callout_string</i>;
|
||||
</pre>
|
||||
The <i>version</i> field contains the version number of the block format. The
|
||||
current version is 1; the three callout string fields were added for this
|
||||
current version is 1; the three callout string fields were added for this
|
||||
version. If you are writing an application that might use an earlier release of
|
||||
PCRE2, you should check the version number before accessing any of these
|
||||
fields. The version number will increase in future if more fields are added,
|
||||
|
@ -263,7 +263,7 @@ need to report errors in the callout string within the pattern.
|
|||
Fields for all callouts
|
||||
</b><br>
|
||||
<P>
|
||||
The remaining fields in the callout block are the same for both kinds of
|
||||
The remaining fields in the callout block are the same for both kinds of
|
||||
callout.
|
||||
</P>
|
||||
<P>
|
||||
|
@ -306,7 +306,7 @@ always the case for the DFA matching functions.
|
|||
</P>
|
||||
<P>
|
||||
The <i>pattern_position</i> field contains the offset in the pattern string to
|
||||
the next item to be matched.
|
||||
the next item to be matched.
|
||||
</P>
|
||||
<P>
|
||||
The <i>next_item_length</i> field contains the length of the next item to be
|
||||
|
@ -318,8 +318,8 @@ of the entire subpattern.
|
|||
<P>
|
||||
The <i>pattern_position</i> and <i>next_item_length</i> fields are intended to
|
||||
help in distinguishing between different automatic callouts, which all have the
|
||||
same callout number. However, they are set for all callouts, and are used by
|
||||
<b>pcre2test</b> to show the next item to be matched when displaying callout
|
||||
same callout number. However, they are set for all callouts, and are used by
|
||||
<b>pcre2test</b> to show the next item to be matched when displaying callout
|
||||
information.
|
||||
</P>
|
||||
<P>
|
||||
|
@ -351,9 +351,9 @@ functions; it will never be used by PCRE2 itself.
|
|||
<b> void *<i>user_data</i>);</b>
|
||||
<br>
|
||||
<br>
|
||||
A script language that supports the use of string arguments in callouts might
|
||||
like to scan all the callouts in a pattern before running the match. This can
|
||||
be done by calling <b>pcre2_callout_enumerate()</b>. The first argument is a
|
||||
A script language that supports the use of string arguments in callouts might
|
||||
like to scan all the callouts in a pattern before running the match. This can
|
||||
be done by calling <b>pcre2_callout_enumerate()</b>. The first argument is a
|
||||
pointer to a compiled pattern, the second points to a callback function, and
|
||||
the third is arbitrary user data. The callback function is called for every
|
||||
callout in the pattern in the order in which they appear. Its first argument is
|
||||
|
@ -369,7 +369,7 @@ data block contains the following fields:
|
|||
<i>callout_string_length</i> Length of callout string
|
||||
<i>callout_string</i> Points to callout string or is NULL
|
||||
</pre>
|
||||
The version number is currently 0. It will increase if new fields are ever
|
||||
The version number is currently 0. It will increase if new fields are ever
|
||||
added to the block. The remaining fields are the same as their namesakes in the
|
||||
<b>pcre2_callout</b> block that is used for callouts during matching, as
|
||||
described
|
||||
|
@ -384,8 +384,8 @@ pattern. For example, a pattern such as /(a){2}/ is compiled as if it were
|
|||
with the same value for <i>pattern_position</i> in each case.
|
||||
</P>
|
||||
<P>
|
||||
The callback function should normally return zero. If it returns a non-zero
|
||||
value, scanning the pattern stops, and that value is returned from
|
||||
The callback function should normally return zero. If it returns a non-zero
|
||||
value, scanning the pattern stops, and that value is returned from
|
||||
<b>pcre2_callout_enumerate()</b>.
|
||||
</P>
|
||||
<br><a name="SEC7" href="#TOC1">AUTHOR</a><br>
|
||||
|
|
|
@ -357,10 +357,11 @@ A second use of backslash provides a way of encoding non-printing characters
|
|||
in patterns in a visible manner. There is no restriction on the appearance of
|
||||
non-printing characters in a pattern, but when a pattern is being prepared by
|
||||
text editing, it is often easier to use one of the following escape sequences
|
||||
than the binary character it represents:
|
||||
than the binary character it represents. In an ASCII or Unicode environment,
|
||||
these escapes are as follows:
|
||||
<pre>
|
||||
\a alarm, that is, the BEL character (hex 07)
|
||||
\cx "control-x", where x is any ASCII character
|
||||
\cx "control-x", where x is any printable ASCII character
|
||||
\e escape (hex 1B)
|
||||
\f form feed (hex 0C)
|
||||
\n linefeed (hex 0A)
|
||||
|
@ -377,23 +378,38 @@ The precise effect of \cx on ASCII characters is as follows: if x is a lower
|
|||
case letter, it is converted to upper case. Then bit 6 of the character (hex
|
||||
40) is inverted. Thus \cA to \cZ become hex 01 to hex 1A (A is 41, Z is 5A),
|
||||
but \c{ becomes hex 3B ({ is 7B), and \c; becomes hex 7B (; is 3B). If the
|
||||
code unit following \c has a value greater than 127, a compile-time error
|
||||
occurs. This locks out non-ASCII characters in all modes.
|
||||
code unit following \c has a value less than 32 or greater than 126, a
|
||||
compile-time error occurs. This locks out non-printable ASCII characters in all
|
||||
modes.
|
||||
</P>
|
||||
<P>
|
||||
The \c facility was designed for use with ASCII characters, but with the
|
||||
extension to Unicode it is even less useful than it once was. It is, however,
|
||||
recognized when PCRE2 is compiled in EBCDIC mode, where data items are always
|
||||
bytes. In this mode, all values are valid after \c. If the next character is a
|
||||
lower case letter, it is converted to upper case. Then the 0xc0 bits of the
|
||||
byte are inverted. Thus \cA becomes hex 01, as in ASCII (A is C1), but because
|
||||
the EBCDIC letters are disjoint, \cZ becomes hex 29 (Z is E9), and other
|
||||
characters also generate different values.
|
||||
When PCRE2 is compiled in EBCDIC mode, \a, \e, \f, \n, \r, and \t
|
||||
generate the appropriate EBCDIC code values. The \c escape is processed
|
||||
as specified for Perl in the <b>perlebcdic</b> document. The only characters
|
||||
that are allowed after \c are A-Z, a-z, or one of @, [, \, ], ^, _, or ?. Any
|
||||
other character provokes a compile-time error. The sequence \@ encodes
|
||||
character code 0; the letters (in either case) encode characters 1-26 (hex 01
|
||||
to hex 1A); [, \, ], ^, and _ encode characters 27-31 (hex 1B to hex 1F), and
|
||||
\? becomes either 255 (hex FF) or 95 (hex 5F).
|
||||
</P>
|
||||
<P>
|
||||
Thus, apart from \?, these escapes generate the same character code values as
|
||||
they do in an ASCII environment, though the meanings of the values mostly
|
||||
differ. For example, \G always generates code value 7, which is BEL in ASCII
|
||||
but DEL in EBCDIC.
|
||||
</P>
|
||||
<P>
|
||||
The sequence \? generates DEL (127, hex 7F) in an ASCII environment, but
|
||||
because 127 is not a control character in EBCDIC, Perl makes it generate the
|
||||
APC character. Unfortunately, there are several variants of EBCDIC. In most of
|
||||
them the APC character has the value 255 (hex FF), but in the one Perl calls
|
||||
POSIX-BC its value is 95 (hex 5F). If certain other characters have POSIX-BC
|
||||
values, PCRE2 makes \? generate 95; otherwise it generates 255.
|
||||
</P>
|
||||
<P>
|
||||
After \0 up to two further octal digits are read. If there are fewer than two
|
||||
digits, just those that are present are used. Thus the sequence \0\x\07
|
||||
specifies two binary zeros followed by a BEL character (code value 7). Make
|
||||
digits, just those that are present are used. Thus the sequence \0\x\015
|
||||
specifies two binary zeros followed by a CR character (code value 13). Make
|
||||
sure you supply two digits after the initial zero if the pattern character that
|
||||
follows is itself an octal digit.
|
||||
</P>
|
||||
|
@ -412,21 +428,24 @@ describe the old, ambiguous syntax.
|
|||
</P>
|
||||
<P>
|
||||
The handling of a backslash followed by a digit other than 0 is complicated,
|
||||
and Perl has changed in recent releases, causing PCRE2 also to change. Outside
|
||||
a character class, PCRE2 reads the digit and any following digits as a decimal
|
||||
number. If the number is less than 8, or if there have been at least that many
|
||||
previous capturing left parentheses in the expression, the entire sequence is
|
||||
taken as a <i>back reference</i>. A description of how this works is given
|
||||
and Perl has changed over time, causing PCRE2 also to change.
|
||||
</P>
|
||||
<P>
|
||||
Outside a character class, PCRE2 reads the digit and any following digits as a
|
||||
decimal number. If the number is less than 10, begins with the digit 8 or 9, or
|
||||
if there are at least that many previous capturing left parentheses in the
|
||||
expression, the entire sequence is taken as a <i>back reference</i>. A
|
||||
description of how this works is given
|
||||
<a href="#backreferences">later,</a>
|
||||
following the discussion of
|
||||
<a href="#subpattern">parenthesized subpatterns.</a>
|
||||
Otherwise, up to three octal digits are read to form a character code.
|
||||
</P>
|
||||
<P>
|
||||
Inside a character class, or if the decimal number following \ is greater than
|
||||
7 and there have not been that many capturing subpatterns, PCRE2 handles \8
|
||||
and \9 as the literal characters "8" and "9", and otherwise re-reads up to
|
||||
three octal digits following the backslash, using them to generate a data
|
||||
character. Any subsequent digits stand for themselves. For example:
|
||||
Inside a character class, PCRE2 handles \8 and \9 as the literal characters
|
||||
"8" and "9", and otherwise reads up to three octal digits following the
|
||||
backslash, using them to generate a data character. Any subsequent digits stand
|
||||
for themselves. For example, outside a character class:
|
||||
<pre>
|
||||
\040 is another way of writing an ASCII space
|
||||
\40 is the same, provided there are fewer than 40 previous capturing subpatterns
|
||||
|
@ -436,7 +455,7 @@ character. Any subsequent digits stand for themselves. For example:
|
|||
\0113 is a tab followed by the character "3"
|
||||
\113 might be a back reference, otherwise the character with octal code 113
|
||||
\377 might be a back reference, otherwise the value 255 (decimal)
|
||||
\81 is either a back reference, or the two characters "8" and "1"
|
||||
\81 is always a back reference .sp
|
||||
</pre>
|
||||
Note that octal values of 100 or greater that are specified using this syntax
|
||||
must not be introduced by a leading zero, because no more than three octal
|
||||
|
@ -1105,15 +1124,19 @@ regular expression.
|
|||
<P>
|
||||
The circumflex and dollar metacharacters are zero-width assertions. That is,
|
||||
they test for a particular condition being true without consuming any
|
||||
characters from the subject string.
|
||||
characters from the subject string. These two metacharacters are concerned with
|
||||
matching the starts and ends of lines. If the newline convention is set so that
|
||||
only the two-character sequence CRLF is recognized as a newline, isolated CR
|
||||
and LF characters are treated as ordinary data characters, and are not
|
||||
recognized as newlines.
|
||||
</P>
|
||||
<P>
|
||||
Outside a character class, in the default matching mode, the circumflex
|
||||
character is an assertion that is true only if the current matching point is at
|
||||
the start of the subject string. If the <i>startoffset</i> argument of
|
||||
<b>pcre2_match()</b> is non-zero, circumflex can never match if the
|
||||
PCRE2_MULTILINE option is unset. Inside a character class, circumflex has an
|
||||
entirely different meaning
|
||||
<b>pcre2_match()</b> is non-zero, or if PCRE2_NOTBOL is set, circumflex can
|
||||
never match if the PCRE2_MULTILINE option is unset. Inside a character class,
|
||||
circumflex has an entirely different meaning
|
||||
<a href="#characterclass">(see below).</a>
|
||||
</P>
|
||||
<P>
|
||||
|
@ -1128,10 +1151,11 @@ to be anchored.)
|
|||
<P>
|
||||
The dollar character is an assertion that is true only if the current matching
|
||||
point is at the end of the subject string, or immediately before a newline at
|
||||
the end of the string (by default). Note, however, that it does not actually
|
||||
match the newline. Dollar need not be the last character of the pattern if a
|
||||
number of alternatives are involved, but it should be the last item in any
|
||||
branch in which it appears. Dollar has no special meaning in a character class.
|
||||
the end of the string (by default), unless PCRE2_NOTEOL is set. Note, however,
|
||||
that it does not actually match the newline. Dollar need not be the last
|
||||
character of the pattern if a number of alternatives are involved, but it
|
||||
should be the last item in any branch in which it appears. Dollar has no
|
||||
special meaning in a character class.
|
||||
</P>
|
||||
<P>
|
||||
The meaning of dollar can be changed so that it matches only at the very end of
|
||||
|
@ -1139,13 +1163,13 @@ the string, by setting the PCRE2_DOLLAR_ENDONLY option at compile time. This
|
|||
does not affect the \Z assertion.
|
||||
</P>
|
||||
<P>
|
||||
The meanings of the circumflex and dollar characters are changed if the
|
||||
PCRE2_MULTILINE option is set. When this is the case, a circumflex matches
|
||||
immediately after internal newlines as well as at the start of the subject
|
||||
string. It does not match after a newline that ends the string. A dollar
|
||||
matches before any newlines in the string, as well as at the very end, when
|
||||
PCRE2_MULTILINE is set. When newline is specified as the two-character
|
||||
sequence CRLF, isolated CR and LF characters do not indicate newlines.
|
||||
The meanings of the circumflex and dollar metacharacters are changed if the
|
||||
PCRE2_MULTILINE option is set. When this is the case, a dollar character
|
||||
matches before any newlines in the string, as well as at the very end, and a
|
||||
circumflex matches immediately after internal newlines as well as at the start
|
||||
of the subject string. It does not match after a newline that ends the string,
|
||||
for compatibility with Perl. However, this can be changed by setting the
|
||||
PCRE2_ALT_CIRCUMFLEX option.
|
||||
</P>
|
||||
<P>
|
||||
For example, the pattern /^abc$/ matches the subject string "def\nabc" (where
|
||||
|
@ -1198,12 +1222,16 @@ whether or not a UTF mode is set. In the 8-bit library, one code unit is one
|
|||
byte; in the 16-bit library it is a 16-bit unit; in the 32-bit library it is a
|
||||
32-bit unit. Unlike a dot, \C always matches line-ending characters. The
|
||||
feature is provided in Perl in order to match individual bytes in UTF-8 mode,
|
||||
but it is unclear how it can usefully be used. Because \C breaks up characters
|
||||
into individual code units, matching one unit with \C in a UTF mode means that
|
||||
the rest of the string may start with a malformed UTF character. This has
|
||||
undefined results, because PCRE2 assumes that it is dealing with valid UTF
|
||||
strings (and by default it checks this at the start of processing unless the
|
||||
PCRE2_NO_UTF_CHECK option is used).
|
||||
but it is unclear how it can usefully be used.
|
||||
</P>
|
||||
<P>
|
||||
Because \C breaks up characters into individual code units, matching one unit
|
||||
with \C in UTF-8 or UTF-16 mode means that the rest of the string may start
|
||||
with a malformed UTF character. This has undefined results, because PCRE2
|
||||
assumes that it is matching character by character in a valid UTF string (by
|
||||
default it checks the subject string's validity at the start of processing
|
||||
unless the PCRE2_NO_UTF_CHECK option is used). An application can lock out the
|
||||
use of \C by setting the PCRE2_NEVER_BACKSLASH_C option.
|
||||
</P>
|
||||
<P>
|
||||
PCRE2 does not allow \C to appear in lookbehind assertions
|
||||
|
@ -1475,7 +1503,8 @@ unset these options by preceding the letter with a hyphen, and a combined
|
|||
setting and unsetting such as (?im-sx), which sets PCRE2_CASELESS and
|
||||
PCRE2_MULTILINE while unsetting PCRE2_DOTALL and PCRE2_EXTENDED, is also
|
||||
permitted. If a letter appears both before and after the hyphen, the option is
|
||||
unset.
|
||||
unset. An empty options setting "(?)" is allowed. Needless to say, it has no
|
||||
effect.
|
||||
</P>
|
||||
<P>
|
||||
The PCRE2-specific options PCRE2_DUPNAMES and PCRE2_UNGREEDY can be changed in
|
||||
|
@ -1508,11 +1537,20 @@ option settings happen at compile time. There would be some very weird
|
|||
behaviour otherwise.
|
||||
</P>
|
||||
<P>
|
||||
As a convenient shorthand, if any option settings are required at the start of
|
||||
a non-capturing subpattern (see the next section), the option letters may
|
||||
appear between the "?" and the ":". Thus the two patterns
|
||||
<pre>
|
||||
(?i:saturday|sunday)
|
||||
(?:(?i)saturday|sunday)
|
||||
</pre>
|
||||
match exactly the same set of strings.
|
||||
</P>
|
||||
<P>
|
||||
<b>Note:</b> There are other PCRE2-specific options that can be set by the
|
||||
application when the compiling function is called.
|
||||
The pattern can contain special leading sequences such as (*CRLF) to override
|
||||
what the application has set or what has been defaulted. Details are given in
|
||||
the section entitled
|
||||
application when the compiling function is called. The pattern can contain
|
||||
special leading sequences such as (*CRLF) to override what the application has
|
||||
set or what has been defaulted. Details are given in the section entitled
|
||||
<a href="#newlineseq">"Newline sequences"</a>
|
||||
above. There are also the (*UTF) and (*UCP) leading sequences that can be used
|
||||
to set UTF and Unicode property modes; they are equivalent to setting the
|
||||
|
@ -2841,10 +2879,10 @@ condition.
|
|||
Callouts with string arguments
|
||||
</b><br>
|
||||
<P>
|
||||
A delimited string may be used instead of a number as a callout argument. The
|
||||
starting delimiter must be one of ` ' " ^ % # $ { and the ending delimiter is
|
||||
the same as the start, except for {, where the ending delimiter is }. If the
|
||||
ending delimiter is needed within the string, it must be doubled. For
|
||||
A delimited string may be used instead of a number as a callout argument. The
|
||||
starting delimiter must be one of ` ' " ^ % # $ { and the ending delimiter is
|
||||
the same as the start, except for {, where the ending delimiter is }. If the
|
||||
ending delimiter is needed within the string, it must be doubled. For
|
||||
example:
|
||||
<pre>
|
||||
(?C'ab ''c'' d')xyz(?C{any text})pqr
|
||||
|
@ -3285,7 +3323,7 @@ Cambridge, England.
|
|||
</P>
|
||||
<br><a name="SEC30" href="#TOC1">REVISION</a><br>
|
||||
<P>
|
||||
Last updated: 15 March 2015
|
||||
Last updated: 13 June 2015
|
||||
<br>
|
||||
Copyright © 1997-2015 University of Cambridge.
|
||||
<br>
|
||||
|
|
|
@ -15,7 +15,7 @@ please consult the man page, in case the conversion went wrong.
|
|||
<ul>
|
||||
<li><a name="TOC1" href="#SEC1">PCRE2 REGULAR EXPRESSION SYNTAX SUMMARY</a>
|
||||
<li><a name="TOC2" href="#SEC2">QUOTING</a>
|
||||
<li><a name="TOC3" href="#SEC3">CHARACTERS</a>
|
||||
<li><a name="TOC3" href="#SEC3">ESCAPED CHARACTERS</a>
|
||||
<li><a name="TOC4" href="#SEC4">CHARACTER TYPES</a>
|
||||
<li><a name="TOC5" href="#SEC5">GENERAL CATEGORY PROPERTIES FOR \p and \P</a>
|
||||
<li><a name="TOC6" href="#SEC6">PCRE2 SPECIAL CATEGORY PROPERTIES FOR \p and \P</a>
|
||||
|
@ -55,11 +55,12 @@ documentation. This document contains a quick-reference summary of the syntax.
|
|||
\Q...\E treat enclosed characters as literal
|
||||
</PRE>
|
||||
</P>
|
||||
<br><a name="SEC3" href="#TOC1">CHARACTERS</a><br>
|
||||
<br><a name="SEC3" href="#TOC1">ESCAPED CHARACTERS</a><br>
|
||||
<P>
|
||||
This table applies to ASCII and Unicode environments.
|
||||
<pre>
|
||||
\a alarm, that is, the BEL character (hex 07)
|
||||
\cx "control-x", where x is any ASCII character
|
||||
\cx "control-x", where x is any ASCII printing character
|
||||
\e escape (hex 1B)
|
||||
\f form feed (hex 0C)
|
||||
\n newline (hex 0A)
|
||||
|
@ -68,18 +69,32 @@ documentation. This document contains a quick-reference summary of the syntax.
|
|||
\0dd character with octal code 0dd
|
||||
\ddd character with octal code ddd, or backreference
|
||||
\o{ddd..} character with octal code ddd..
|
||||
\U "U" if PCRE2_ALT_BSUX is set (otherwise is an error)
|
||||
\uhhhh character with hex code hhhh (if PCRE2_ALT_BSUX is set)
|
||||
\xhh character with hex code hh
|
||||
\x{hhh..} character with hex code hhh..
|
||||
</pre>
|
||||
Note that \0dd is always an octal code, and that \8 and \9 are the literal
|
||||
characters "8" and "9".
|
||||
Note that \0dd is always an octal code. The treatment of backslash followed by
|
||||
a non-zero digit is complicated; for details see the section
|
||||
<a href="pcre2pattern.html#digitsafterbackslash">"Non-printing characters"</a>
|
||||
in the
|
||||
<a href="pcre2pattern.html"><b>pcre2pattern</b></a>
|
||||
documentation, where details of escape processing in EBCDIC environments are
|
||||
also given.
|
||||
</P>
|
||||
<P>
|
||||
When \x is not followed by {, from zero to two hexadecimal digits are read,
|
||||
but if PCRE2_ALT_BSUX is set, \x must be followed by two hexadecimal digits to
|
||||
be recognized as a hexadecimal escape; otherwise it matches a literal "x".
|
||||
Likewise, if \u (in ALT_BSUX mode) is not followed by four hexadecimal digits,
|
||||
it matches a literal "u".
|
||||
</P>
|
||||
<br><a name="SEC4" href="#TOC1">CHARACTER TYPES</a><br>
|
||||
<P>
|
||||
<pre>
|
||||
. any character except newline;
|
||||
in dotall mode, any character whatsoever
|
||||
\C one data unit, even in UTF mode (best avoided)
|
||||
\C one code unit, even in UTF mode (best avoided)
|
||||
\d a decimal digit
|
||||
\D a character that is not a decimal digit
|
||||
\h a horizontal white space character
|
||||
|
@ -96,6 +111,11 @@ characters "8" and "9".
|
|||
\W a "non-word" character
|
||||
\X a Unicode extended grapheme cluster
|
||||
</pre>
|
||||
The application can lock out the use of \C by setting the
|
||||
PCRE2_NEVER_BACKSLASH_C option. It is dangerous because it may leave the
|
||||
current matching point in the middle of a UTF-8 or UTF-16 character.
|
||||
</P>
|
||||
<P>
|
||||
By default, \d, \s, and \w match only ASCII characters, even in UTF-8 mode
|
||||
or in the 16-bit and 32-bit libraries. However, if locale-specific matching is
|
||||
happening, \s and \w may also match characters with code points in the range
|
||||
|
@ -348,13 +368,14 @@ but some of them use Unicode properties if PCRE2_UCP is set. You can use
|
|||
\b word boundary
|
||||
\B not a word boundary
|
||||
^ start of subject
|
||||
also after internal newline in multiline mode
|
||||
also after an internal newline in multiline mode
|
||||
(after any newline if PCRE2_ALT_CIRCUMFLEX is set)
|
||||
\A start of subject
|
||||
$ end of subject
|
||||
also before newline at end of subject
|
||||
also before internal newline in multiline mode
|
||||
also before newline at end of subject
|
||||
also before internal newline in multiline mode
|
||||
\Z end of subject
|
||||
also before newline at end of subject
|
||||
also before newline at end of subject
|
||||
\z end of subject
|
||||
\G first matching position in subject
|
||||
</PRE>
|
||||
|
@ -423,7 +444,9 @@ appear.
|
|||
(*UCP) set PCRE2_UCP (use Unicode properties for \d etc)
|
||||
</pre>
|
||||
Note that LIMIT_MATCH and LIMIT_RECURSION can only reduce the value of the
|
||||
limits set by the caller of pcre2_match(), not increase them.
|
||||
limits set by the caller of pcre2_match(), not increase them. The application
|
||||
can lock out the use of (*UTF) and (*UCP) by setting the PCRE2_NEVER_UTF or
|
||||
PCRE2_NEVER_UCP options, respectively, at compile time.
|
||||
</P>
|
||||
<br><a name="SEC17" href="#TOC1">NEWLINE CONVENTION</a><br>
|
||||
<P>
|
||||
|
@ -539,9 +562,9 @@ pattern is not anchored.
|
|||
(?Cn) callout with numerical data n
|
||||
(?C"text") callout with string data
|
||||
</pre>
|
||||
The allowed string delimiters are ` ' " ^ % # $ (which are the same for the
|
||||
start and the end), and the starting delimiter { matched with the ending
|
||||
delimiter }. To encode the ending delimiter within the string, double it.
|
||||
The allowed string delimiters are ` ' " ^ % # $ (which are the same for the
|
||||
start and the end), and the starting delimiter { matched with the ending
|
||||
delimiter }. To encode the ending delimiter within the string, double it.
|
||||
</P>
|
||||
<br><a name="SEC25" href="#TOC1">SEE ALSO</a><br>
|
||||
<P>
|
||||
|
@ -559,7 +582,7 @@ Cambridge, England.
|
|||
</P>
|
||||
<br><a name="SEC27" href="#TOC1">REVISION</a><br>
|
||||
<P>
|
||||
Last updated: 15 March 2015
|
||||
Last updated: 13 June 2015
|
||||
<br>
|
||||
Copyright © 1997-2015 University of Cambridge.
|
||||
<br>
|
||||
|
|
|
@ -94,7 +94,7 @@ below). The input is processed using using C's string functions, so must not
|
|||
contain binary zeroes, even though in Unix-like environments, <b>fgets()</b>
|
||||
treats any bytes other than newline as data characters. In some Windows
|
||||
environments character 26 (hex 1A) causes an immediate end of file, and no
|
||||
further data is read.
|
||||
further data is read.
|
||||
</P>
|
||||
<P>
|
||||
For maximum portability, therefore, it is safest to avoid non-printing
|
||||
|
@ -284,13 +284,20 @@ following commands are recognized:
|
|||
#forbid_utf
|
||||
</pre>
|
||||
Subsequent patterns automatically have the PCRE2_NEVER_UTF and PCRE2_NEVER_UCP
|
||||
options set, which locks out the use of UTF and Unicode property features. This
|
||||
is a trigger guard that is used in test files to ensure that UTF or Unicode
|
||||
property tests are not accidentally added to files that are used when Unicode
|
||||
support is not included in the library. This effect can also be obtained by the
|
||||
use of <b>#pattern</b>; the difference is that <b>#forbid_utf</b> cannot be
|
||||
unset, and the automatic options are not displayed in pattern information, to
|
||||
avoid cluttering up test output.
|
||||
options set, which locks out the use of the PCRE2_UTF and PCRE2_UCP options and
|
||||
the use of (*UTF) and (*UCP) at the start of patterns. This command also forces
|
||||
an error if a subsequent pattern contains any occurrences of \P, \p, or \X,
|
||||
which are still supported when PCRE2_UTF is not set, but which require Unicode
|
||||
property support to be included in the library.
|
||||
</P>
|
||||
<P>
|
||||
This is a trigger guard that is used in test files to ensure that UTF or
|
||||
Unicode property tests are not accidentally added to files that are used when
|
||||
Unicode support is not included in the library. Setting PCRE2_NEVER_UTF and
|
||||
PCRE2_NEVER_UCP as a default can also be obtained by the use of <b>#pattern</b>;
|
||||
the difference is that <b>#forbid_utf</b> cannot be unset, and the automatic
|
||||
options are not displayed in pattern information, to avoid cluttering up test
|
||||
output.
|
||||
<pre>
|
||||
#load <filename>
|
||||
</pre>
|
||||
|
@ -471,6 +478,7 @@ for a description of their effects.
|
|||
<pre>
|
||||
allow_empty_class set PCRE2_ALLOW_EMPTY_CLASS
|
||||
alt_bsux set PCRE2_ALT_BSUX
|
||||
alt_circumflex set PCRE2_ALT_CIRCUMFLEX
|
||||
anchored set PCRE2_ANCHORED
|
||||
auto_callout set PCRE2_AUTO_CALLOUT
|
||||
/i caseless set PCRE2_CASELESS
|
||||
|
@ -481,6 +489,7 @@ for a description of their effects.
|
|||
firstline set PCRE2_FIRSTLINE
|
||||
match_unset_backref set PCRE2_MATCH_UNSET_BACKREF
|
||||
/m multiline set PCRE2_MULTILINE
|
||||
never_backslash_c set PCRE2_NEVER_BACKSLASH_C
|
||||
never_ucp set PCRE2_NEVER_UCP
|
||||
never_utf set PCRE2_NEVER_UTF
|
||||
no_auto_capture set PCRE2_NO_AUTO_CAPTURE
|
||||
|
@ -506,7 +515,7 @@ about the pattern:
|
|||
<pre>
|
||||
bsr=[anycrlf|unicode] specify \R handling
|
||||
/B bincode show binary code without lengths
|
||||
callout_info show callout information
|
||||
callout_info show callout information
|
||||
debug same as info,fullbincode
|
||||
fullbincode show binary code with lengths
|
||||
/I info show info about compiled pattern
|
||||
|
@ -589,9 +598,9 @@ not necessarily the last character. These lines are omitted if no starting or
|
|||
ending code units are recorded.
|
||||
</P>
|
||||
<P>
|
||||
The <b>callout_info</b> modifier requests information about all the callouts in
|
||||
the pattern. A list of them is output at the end of any other information that
|
||||
is requested. For each callout, either its number or string is given, followed
|
||||
The <b>callout_info</b> modifier requests information about all the callouts in
|
||||
the pattern. A list of them is output at the end of any other information that
|
||||
is requested. For each callout, either its number or string is given, followed
|
||||
by the item that follows it in the pattern.
|
||||
</P>
|
||||
<br><b>
|
||||
|
@ -1460,7 +1469,7 @@ Cambridge, England.
|
|||
</P>
|
||||
<br><a name="SEC21" href="#TOC1">REVISION</a><br>
|
||||
<P>
|
||||
Last updated: 22 March 2015
|
||||
Last updated: 20 May 2015
|
||||
<br>
|
||||
Copyright © 1997-2015 University of Cambridge.
|
||||
<br>
|
||||
|
|
|
@ -103,12 +103,12 @@ lose performance.
|
|||
.P
|
||||
One way of guarding against this possibility is to use the
|
||||
\fBpcre2_pattern_info()\fP function to check the compiled pattern's options for
|
||||
PCRE2_UTF. Alternatively, you can set the PCRE2_NEVER_UTF option when calling
|
||||
PCRE2_UTF. Alternatively, you can set the PCRE2_NEVER_UTF option when calling
|
||||
\fBpcre2_compile()\fP. This causes an compile time error if a pattern contains
|
||||
a UTF-setting sequence.
|
||||
.P
|
||||
The use of Unicode properties for character types such as \ed can also be
|
||||
enabled from within the pattern, by specifying "(*UCP)". This feature can be
|
||||
The use of Unicode properties for character types such as \ed can also be
|
||||
enabled from within the pattern, by specifying "(*UCP)". This feature can be
|
||||
disallowed by setting the PCRE2_NEVER_UCP option.
|
||||
.P
|
||||
If your application is one that supports UTF, be aware that validity checking
|
||||
|
|
359
doc/pcre2.txt
359
doc/pcre2.txt
|
@ -87,16 +87,26 @@ SECURITY CONSIDERATIONS
|
|||
mance.
|
||||
|
||||
One way of guarding against this possibility is to use the pcre2_pat-
|
||||
tern_info() function to check the compiled pattern's options for UTF.
|
||||
Alternatively, you can set the PCRE2_NEVER_UTF option at compile time.
|
||||
This causes an compile time error if a pattern contains a UTF-setting
|
||||
sequence.
|
||||
tern_info() function to check the compiled pattern's options for
|
||||
PCRE2_UTF. Alternatively, you can set the PCRE2_NEVER_UTF option when
|
||||
calling pcre2_compile(). This causes an compile time error if a pattern
|
||||
contains a UTF-setting sequence.
|
||||
|
||||
The use of Unicode properties for character types such as \d can also
|
||||
be enabled from within the pattern, by specifying "(*UCP)". This fea-
|
||||
ture can be disallowed by setting the PCRE2_NEVER_UCP option.
|
||||
|
||||
If your application is one that supports UTF, be aware that validity
|
||||
checking can take time. If the same data string is to be matched many
|
||||
times, you can use the PCRE2_NO_UTF_CHECK option for the second and
|
||||
subsequent matches to avoid running redundant checks.
|
||||
|
||||
The use of the \C escape sequence in a UTF-8 or UTF-16 pattern can lead
|
||||
to problems, because it may leave the current matching point in the
|
||||
middle of a multi-code-unit character. The PCRE2_NEVER_BACKSLASH_C
|
||||
option can be used to lock out the use of \C, causing a compile-time
|
||||
error if it is encountered.
|
||||
|
||||
Another way that performance can be hit is by running a pattern that
|
||||
has a very large search tree against a string that will never match.
|
||||
Nested unlimited repeats in a pattern are a common example. PCRE2 pro-
|
||||
|
@ -155,11 +165,11 @@ AUTHOR
|
|||
|
||||
REVISION
|
||||
|
||||
Last updated: 18 November 2014
|
||||
Copyright (c) 1997-2014 University of Cambridge.
|
||||
Last updated: 13 April 2015
|
||||
Copyright (c) 1997-2015 University of Cambridge.
|
||||
------------------------------------------------------------------------------
|
||||
|
||||
|
||||
|
||||
|
||||
PCRE2API(3) Library Functions Manual PCRE2API(3)
|
||||
|
||||
|
||||
|
@ -1109,104 +1119,124 @@ COMPILING A PATTERN
|
|||
always expected after \x, but it may have zero, one, or two digits (so,
|
||||
for example, \xz matches a binary zero character followed by z).
|
||||
|
||||
PCRE2_ALT_CIRCUMFLEX
|
||||
|
||||
In multiline mode (when PCRE2_MULTILINE is set), the circumflex
|
||||
metacharacter matches at the start of the subject (unless PCRE2_NOTBOL
|
||||
is set), and also after any internal newline. However, it does not
|
||||
match after a newline at the end of the subject, for compatibility with
|
||||
Perl. If you want a multiline circumflex also to match after a termi-
|
||||
nating newline, you must set PCRE2_ALT_CIRCUMFLEX.
|
||||
|
||||
PCRE2_AUTO_CALLOUT
|
||||
|
||||
If this bit is set, pcre2_compile() automatically inserts callout
|
||||
If this bit is set, pcre2_compile() automatically inserts callout
|
||||
items, all with number 255, before each pattern item. For discussion of
|
||||
the callout facility, see the pcre2callout documentation.
|
||||
|
||||
PCRE2_CASELESS
|
||||
|
||||
If this bit is set, letters in the pattern match both upper and lower
|
||||
case letters in the subject. It is equivalent to Perl's /i option, and
|
||||
If this bit is set, letters in the pattern match both upper and lower
|
||||
case letters in the subject. It is equivalent to Perl's /i option, and
|
||||
it can be changed within a pattern by a (?i) option setting.
|
||||
|
||||
PCRE2_DOLLAR_ENDONLY
|
||||
|
||||
If this bit is set, a dollar metacharacter in the pattern matches only
|
||||
at the end of the subject string. Without this option, a dollar also
|
||||
matches immediately before a newline at the end of the string (but not
|
||||
before any other newlines). The PCRE2_DOLLAR_ENDONLY option is ignored
|
||||
if PCRE2_MULTILINE is set. There is no equivalent to this option in
|
||||
If this bit is set, a dollar metacharacter in the pattern matches only
|
||||
at the end of the subject string. Without this option, a dollar also
|
||||
matches immediately before a newline at the end of the string (but not
|
||||
before any other newlines). The PCRE2_DOLLAR_ENDONLY option is ignored
|
||||
if PCRE2_MULTILINE is set. There is no equivalent to this option in
|
||||
Perl, and no way to set it within a pattern.
|
||||
|
||||
PCRE2_DOTALL
|
||||
|
||||
If this bit is set, a dot metacharacter in the pattern matches any
|
||||
character, including one that indicates a newline. However, it only
|
||||
If this bit is set, a dot metacharacter in the pattern matches any
|
||||
character, including one that indicates a newline. However, it only
|
||||
ever matches one character, even if newlines are coded as CRLF. Without
|
||||
this option, a dot does not match when the current position in the sub-
|
||||
ject is at a newline. This option is equivalent to Perl's /s option,
|
||||
ject is at a newline. This option is equivalent to Perl's /s option,
|
||||
and it can be changed within a pattern by a (?s) option setting. A neg-
|
||||
ative class such as [^a] always matches newline characters, independent
|
||||
of the setting of this option.
|
||||
|
||||
PCRE2_DUPNAMES
|
||||
|
||||
If this bit is set, names used to identify capturing subpatterns need
|
||||
If this bit is set, names used to identify capturing subpatterns need
|
||||
not be unique. This can be helpful for certain types of pattern when it
|
||||
is known that only one instance of the named subpattern can ever be
|
||||
matched. There are more details of named subpatterns below; see also
|
||||
is known that only one instance of the named subpattern can ever be
|
||||
matched. There are more details of named subpatterns below; see also
|
||||
the pcre2pattern documentation.
|
||||
|
||||
PCRE2_EXTENDED
|
||||
|
||||
If this bit is set, most white space characters in the pattern are
|
||||
totally ignored except when escaped or inside a character class. How-
|
||||
ever, white space is not allowed within sequences such as (?> that
|
||||
If this bit is set, most white space characters in the pattern are
|
||||
totally ignored except when escaped or inside a character class. How-
|
||||
ever, white space is not allowed within sequences such as (?> that
|
||||
introduce various parenthesized subpatterns, nor within numerical quan-
|
||||
tifiers such as {1,3}. Ignorable white space is permitted between an
|
||||
item and a following quantifier and between a quantifier and a follow-
|
||||
tifiers such as {1,3}. Ignorable white space is permitted between an
|
||||
item and a following quantifier and between a quantifier and a follow-
|
||||
ing + that indicates possessiveness.
|
||||
|
||||
PCRE2_EXTENDED also causes characters between an unescaped # outside a
|
||||
character class and the next newline, inclusive, to be ignored, which
|
||||
PCRE2_EXTENDED also causes characters between an unescaped # outside a
|
||||
character class and the next newline, inclusive, to be ignored, which
|
||||
makes it possible to include comments inside complicated patterns. Note
|
||||
that the end of this type of comment is a literal newline sequence in
|
||||
that the end of this type of comment is a literal newline sequence in
|
||||
the pattern; escape sequences that happen to represent a newline do not
|
||||
count. PCRE2_EXTENDED is equivalent to Perl's /x option, and it can be
|
||||
count. PCRE2_EXTENDED is equivalent to Perl's /x option, and it can be
|
||||
changed within a pattern by a (?x) option setting.
|
||||
|
||||
Which characters are interpreted as newlines can be specified by a set-
|
||||
ting in the compile context that is passed to pcre2_compile() or by a
|
||||
special sequence at the start of the pattern, as described in the sec-
|
||||
tion entitled "Newline conventions" in the pcre2pattern documentation.
|
||||
ting in the compile context that is passed to pcre2_compile() or by a
|
||||
special sequence at the start of the pattern, as described in the sec-
|
||||
tion entitled "Newline conventions" in the pcre2pattern documentation.
|
||||
A default is defined when PCRE2 is built.
|
||||
|
||||
PCRE2_FIRSTLINE
|
||||
|
||||
If this option is set, an unanchored pattern is required to match
|
||||
before or at the first newline in the subject string, though the
|
||||
If this option is set, an unanchored pattern is required to match
|
||||
before or at the first newline in the subject string, though the
|
||||
matched text may continue over the newline.
|
||||
|
||||
PCRE2_MATCH_UNSET_BACKREF
|
||||
|
||||
If this option is set, a back reference to an unset subpattern group
|
||||
matches an empty string (by default this causes the current matching
|
||||
alternative to fail). A pattern such as (\1)(a) succeeds when this
|
||||
option is set (assuming it can find an "a" in the subject), whereas it
|
||||
fails by default, for Perl compatibility. Setting this option makes
|
||||
If this option is set, a back reference to an unset subpattern group
|
||||
matches an empty string (by default this causes the current matching
|
||||
alternative to fail). A pattern such as (\1)(a) succeeds when this
|
||||
option is set (assuming it can find an "a" in the subject), whereas it
|
||||
fails by default, for Perl compatibility. Setting this option makes
|
||||
PCRE2 behave more like ECMAscript (aka JavaScript).
|
||||
|
||||
PCRE2_MULTILINE
|
||||
|
||||
By default, for the purposes of matching "start of line" and "end of
|
||||
line", PCRE2 treats the subject string as consisting of a single line
|
||||
of characters, even if it actually contains newlines. The "start of
|
||||
line" metacharacter (^) matches only at the start of the string, and
|
||||
the "end of line" metacharacter ($) matches only at the end of the
|
||||
By default, for the purposes of matching "start of line" and "end of
|
||||
line", PCRE2 treats the subject string as consisting of a single line
|
||||
of characters, even if it actually contains newlines. The "start of
|
||||
line" metacharacter (^) matches only at the start of the string, and
|
||||
the "end of line" metacharacter ($) matches only at the end of the
|
||||
string, or before a terminating newline (except when PCRE2_DOL-
|
||||
LAR_ENDONLY is set). Note, however, that unless PCRE2_DOTALL is set,
|
||||
LAR_ENDONLY is set). Note, however, that unless PCRE2_DOTALL is set,
|
||||
the "any character" metacharacter (.) does not match at a newline. This
|
||||
behaviour (for ^, $, and dot) is the same as Perl.
|
||||
|
||||
When PCRE2_MULTILINE it is set, the "start of line" and "end of line"
|
||||
constructs match immediately following or immediately before internal
|
||||
newlines in the subject string, respectively, as well as at the very
|
||||
start and end. This is equivalent to Perl's /m option, and it can be
|
||||
changed within a pattern by a (?m) option setting. If there are no new-
|
||||
lines in a subject string, or no occurrences of ^ or $ in a pattern,
|
||||
setting PCRE2_MULTILINE has no effect.
|
||||
When PCRE2_MULTILINE it is set, the "start of line" and "end of line"
|
||||
constructs match immediately following or immediately before internal
|
||||
newlines in the subject string, respectively, as well as at the very
|
||||
start and end. This is equivalent to Perl's /m option, and it can be
|
||||
changed within a pattern by a (?m) option setting. Note that the "start
|
||||
of line" metacharacter does not match after a newline at the end of the
|
||||
subject, for compatibility with Perl. However, you can change this by
|
||||
setting the PCRE2_ALT_CIRCUMFLEX option. If there are no newlines in a
|
||||
subject string, or no occurrences of ^ or $ in a pattern, setting
|
||||
PCRE2_MULTILINE has no effect.
|
||||
|
||||
PCRE2_NEVER_BACKSLASH_C
|
||||
|
||||
This option locks out the use of \C in the pattern that is being com-
|
||||
piled. This escape can cause unpredictable behaviour in UTF-8 or
|
||||
UTF-16 modes, because it may leave the current matching point in the
|
||||
middle of a multi-code-unit character. This option may be useful in
|
||||
applications that process patterns from external sources.
|
||||
|
||||
PCRE2_NEVER_UCP
|
||||
|
||||
|
@ -1214,18 +1244,18 @@ COMPILING A PATTERN
|
|||
\b, \D, \d, \S, \s, \W, \w, and some of the POSIX character classes, as
|
||||
described for the PCRE2_UCP option below. In particular, it prevents
|
||||
the creator of the pattern from enabling this facility by starting the
|
||||
pattern with (*UCP). This may be useful in applications that process
|
||||
patterns from external sources. The option combination PCRE_UCP and
|
||||
PCRE_NEVER_UCP causes an error.
|
||||
pattern with (*UCP). This option may be useful in applications that
|
||||
process patterns from external sources. The option combination PCRE_UCP
|
||||
and PCRE_NEVER_UCP causes an error.
|
||||
|
||||
PCRE2_NEVER_UTF
|
||||
|
||||
This option locks out interpretation of the pattern as UTF-8, UTF-16,
|
||||
or UTF-32, depending on which library is in use. In particular, it pre-
|
||||
vents the creator of the pattern from switching to UTF interpretation
|
||||
by starting the pattern with (*UTF). This may be useful in applications
|
||||
that process patterns from external sources. The combination of
|
||||
PCRE2_UTF and PCRE2_NEVER_UTF causes an error.
|
||||
by starting the pattern with (*UTF). This option may be useful in
|
||||
applications that process patterns from external sources. The combina-
|
||||
tion of PCRE2_UTF and PCRE2_NEVER_UTF causes an error.
|
||||
|
||||
PCRE2_NO_AUTO_CAPTURE
|
||||
|
||||
|
@ -2796,11 +2826,11 @@ AUTHOR
|
|||
|
||||
REVISION
|
||||
|
||||
Last updated: 23 March 2015
|
||||
Last updated: 22 April 2015
|
||||
Copyright (c) 1997-2015 University of Cambridge.
|
||||
------------------------------------------------------------------------------
|
||||
|
||||
|
||||
|
||||
|
||||
PCRE2BUILD(3) Library Functions Manual PCRE2BUILD(3)
|
||||
|
||||
|
||||
|
@ -2916,6 +2946,11 @@ UNICODE AND UTF SUPPORT
|
|||
PCRE2_UCP option. Unless the application has set PCRE2_NEVER_UCP, a
|
||||
pattern may also request this by starting with (*UCP).
|
||||
|
||||
The \C escape sequence, which matches a single code unit, even in a UTF
|
||||
mode, can cause unpredictable behaviour because it may leave the cur-
|
||||
rent matching point in the middle of a multi-code-unit character. It
|
||||
can be locked out by setting the PCRE2_NEVER_BACKSLASH_C option.
|
||||
|
||||
|
||||
JUST-IN-TIME COMPILER SUPPORT
|
||||
|
||||
|
@ -2923,10 +2958,10 @@ JUST-IN-TIME COMPILER SUPPORT
|
|||
|
||||
--enable-jit
|
||||
|
||||
This support is available only for certain hardware architectures. If
|
||||
this option is set for an unsupported architecture, a building error
|
||||
occurs. See the pcre2jit documentation for a discussion of JIT usage.
|
||||
When JIT support is enabled, pcre2grep automatically makes use of it,
|
||||
This support is available only for certain hardware architectures. If
|
||||
this option is set for an unsupported architecture, a building error
|
||||
occurs. See the pcre2jit documentation for a discussion of JIT usage.
|
||||
When JIT support is enabled, pcre2grep automatically makes use of it,
|
||||
unless you add
|
||||
|
||||
--disable-pcre2grep-jit
|
||||
|
@ -2936,14 +2971,14 @@ JUST-IN-TIME COMPILER SUPPORT
|
|||
|
||||
NEWLINE RECOGNITION
|
||||
|
||||
By default, PCRE2 interprets the linefeed (LF) character as indicating
|
||||
the end of a line. This is the normal newline character on Unix-like
|
||||
systems. You can compile PCRE2 to use carriage return (CR) instead, by
|
||||
By default, PCRE2 interprets the linefeed (LF) character as indicating
|
||||
the end of a line. This is the normal newline character on Unix-like
|
||||
systems. You can compile PCRE2 to use carriage return (CR) instead, by
|
||||
adding
|
||||
|
||||
--enable-newline-is-cr
|
||||
|
||||
to the configure command. There is also an --enable-newline-is-lf
|
||||
to the configure command. There is also an --enable-newline-is-lf
|
||||
option, which explicitly specifies linefeed as the newline character.
|
||||
|
||||
Alternatively, you can specify that line endings are to be indicated by
|
||||
|
@ -2956,76 +2991,76 @@ NEWLINE RECOGNITION
|
|||
|
||||
--enable-newline-is-anycrlf
|
||||
|
||||
which causes PCRE2 to recognize any of the three sequences CR, LF, or
|
||||
which causes PCRE2 to recognize any of the three sequences CR, LF, or
|
||||
CRLF as indicating a line ending. Finally, a fifth option, specified by
|
||||
|
||||
--enable-newline-is-any
|
||||
|
||||
causes PCRE2 to recognize any Unicode newline sequence. The Unicode
|
||||
causes PCRE2 to recognize any Unicode newline sequence. The Unicode
|
||||
newline sequences are the three just mentioned, plus the single charac-
|
||||
ters VT (vertical tab, U+000B), FF (form feed, U+000C), NEL (next line,
|
||||
U+0085), LS (line separator, U+2028), and PS (paragraph separator,
|
||||
U+0085), LS (line separator, U+2028), and PS (paragraph separator,
|
||||
U+2029).
|
||||
|
||||
Whatever default line ending convention is selected when PCRE2 is built
|
||||
can be overridden by applications that use the library. At build time
|
||||
can be overridden by applications that use the library. At build time
|
||||
it is conventional to use the standard for your operating system.
|
||||
|
||||
|
||||
WHAT \R MATCHES
|
||||
|
||||
By default, the sequence \R in a pattern matches any Unicode newline
|
||||
sequence, independently of what has been selected as the line ending
|
||||
By default, the sequence \R in a pattern matches any Unicode newline
|
||||
sequence, independently of what has been selected as the line ending
|
||||
sequence. If you specify
|
||||
|
||||
--enable-bsr-anycrlf
|
||||
|
||||
the default is changed so that \R matches only CR, LF, or CRLF. What-
|
||||
ever is selected when PCRE2 is built can be overridden by applications
|
||||
the default is changed so that \R matches only CR, LF, or CRLF. What-
|
||||
ever is selected when PCRE2 is built can be overridden by applications
|
||||
that use the called.
|
||||
|
||||
|
||||
HANDLING VERY LARGE PATTERNS
|
||||
|
||||
Within a compiled pattern, offset values are used to point from one
|
||||
part to another (for example, from an opening parenthesis to an alter-
|
||||
nation metacharacter). By default, in the 8-bit and 16-bit libraries,
|
||||
two-byte values are used for these offsets, leading to a maximum size
|
||||
for a compiled pattern of around 64K code units. This is sufficient to
|
||||
Within a compiled pattern, offset values are used to point from one
|
||||
part to another (for example, from an opening parenthesis to an alter-
|
||||
nation metacharacter). By default, in the 8-bit and 16-bit libraries,
|
||||
two-byte values are used for these offsets, leading to a maximum size
|
||||
for a compiled pattern of around 64K code units. This is sufficient to
|
||||
handle all but the most gigantic patterns. Nevertheless, some people do
|
||||
want to process truly enormous patterns, so it is possible to compile
|
||||
PCRE2 to use three-byte or four-byte offsets by adding a setting such
|
||||
want to process truly enormous patterns, so it is possible to compile
|
||||
PCRE2 to use three-byte or four-byte offsets by adding a setting such
|
||||
as
|
||||
|
||||
--with-link-size=3
|
||||
|
||||
to the configure command. The value given must be 2, 3, or 4. For the
|
||||
16-bit library, a value of 3 is rounded up to 4. In these libraries,
|
||||
using longer offsets slows down the operation of PCRE2 because it has
|
||||
to load additional data when handling them. For the 32-bit library the
|
||||
value is always 4 and cannot be overridden; the value of --with-link-
|
||||
to the configure command. The value given must be 2, 3, or 4. For the
|
||||
16-bit library, a value of 3 is rounded up to 4. In these libraries,
|
||||
using longer offsets slows down the operation of PCRE2 because it has
|
||||
to load additional data when handling them. For the 32-bit library the
|
||||
value is always 4 and cannot be overridden; the value of --with-link-
|
||||
size is ignored.
|
||||
|
||||
|
||||
AVOIDING EXCESSIVE STACK USAGE
|
||||
|
||||
When matching with the pcre2_match() function, PCRE2 implements back-
|
||||
tracking by making recursive calls to an internal function called
|
||||
match(). In environments where the size of the stack is limited, this
|
||||
can severely limit PCRE2's operation. (The Unix environment does not
|
||||
usually suffer from this problem, but it may sometimes be necessary to
|
||||
When matching with the pcre2_match() function, PCRE2 implements back-
|
||||
tracking by making recursive calls to an internal function called
|
||||
match(). In environments where the size of the stack is limited, this
|
||||
can severely limit PCRE2's operation. (The Unix environment does not
|
||||
usually suffer from this problem, but it may sometimes be necessary to
|
||||
increase the maximum stack size. There is a discussion in the
|
||||
pcre2stack documentation.) An alternative approach to recursion that
|
||||
uses memory from the heap to remember data, instead of using recursive
|
||||
function calls, has been implemented to work round the problem of lim-
|
||||
ited stack size. If you want to build a version of PCRE2 that works
|
||||
pcre2stack documentation.) An alternative approach to recursion that
|
||||
uses memory from the heap to remember data, instead of using recursive
|
||||
function calls, has been implemented to work round the problem of lim-
|
||||
ited stack size. If you want to build a version of PCRE2 that works
|
||||
this way, add
|
||||
|
||||
--disable-stack-for-recursion
|
||||
|
||||
to the configure command. By default, the system functions malloc() and
|
||||
free() are called to manage the heap memory that is required, but cus-
|
||||
tom memory management functions can be called instead. PCRE2 runs
|
||||
free() are called to manage the heap memory that is required, but cus-
|
||||
tom memory management functions can be called instead. PCRE2 runs
|
||||
noticeably more slowly when built in this way. This option affects only
|
||||
the pcre2_match() function; it is not relevant for pcre2_dfa_match().
|
||||
|
||||
|
@ -3033,30 +3068,30 @@ AVOIDING EXCESSIVE STACK USAGE
|
|||
LIMITING PCRE2 RESOURCE USAGE
|
||||
|
||||
Internally, PCRE2 has a function called match(), which it calls repeat-
|
||||
edly (sometimes recursively) when matching a pattern with the
|
||||
edly (sometimes recursively) when matching a pattern with the
|
||||
pcre2_match() function. By controlling the maximum number of times this
|
||||
function may be called during a single matching operation, a limit can
|
||||
be placed on the resources used by a single call to pcre2_match(). The
|
||||
function may be called during a single matching operation, a limit can
|
||||
be placed on the resources used by a single call to pcre2_match(). The
|
||||
limit can be changed at run time, as described in the pcre2api documen-
|
||||
tation. The default is 10 million, but this can be changed by adding a
|
||||
tation. The default is 10 million, but this can be changed by adding a
|
||||
setting such as
|
||||
|
||||
--with-match-limit=500000
|
||||
|
||||
to the configure command. This setting has no effect on the
|
||||
to the configure command. This setting has no effect on the
|
||||
pcre2_dfa_match() matching function.
|
||||
|
||||
In some environments it is desirable to limit the depth of recursive
|
||||
In some environments it is desirable to limit the depth of recursive
|
||||
calls of match() more strictly than the total number of calls, in order
|
||||
to restrict the maximum amount of stack (or heap, if --disable-stack-
|
||||
to restrict the maximum amount of stack (or heap, if --disable-stack-
|
||||
for-recursion is specified) that is used. A second limit controls this;
|
||||
it defaults to the value that is set for --with-match-limit, which
|
||||
imposes no additional constraints. However, you can set a lower limit
|
||||
it defaults to the value that is set for --with-match-limit, which
|
||||
imposes no additional constraints. However, you can set a lower limit
|
||||
by adding, for example,
|
||||
|
||||
--with-match-limit-recursion=10000
|
||||
|
||||
to the configure command. This value can also be overridden at run
|
||||
to the configure command. This value can also be overridden at run
|
||||
time.
|
||||
|
||||
|
||||
|
@ -3064,45 +3099,45 @@ CREATING CHARACTER TABLES AT BUILD TIME
|
|||
|
||||
PCRE2 uses fixed tables for processing characters whose code points are
|
||||
less than 256. By default, PCRE2 is built with a set of tables that are
|
||||
distributed in the file src/pcre2_chartables.c.dist. These tables are
|
||||
distributed in the file src/pcre2_chartables.c.dist. These tables are
|
||||
for ASCII codes only. If you add
|
||||
|
||||
--enable-rebuild-chartables
|
||||
|
||||
to the configure command, the distributed tables are no longer used.
|
||||
Instead, a program called dftables is compiled and run. This outputs
|
||||
to the configure command, the distributed tables are no longer used.
|
||||
Instead, a program called dftables is compiled and run. This outputs
|
||||
the source for new set of tables, created in the default locale of your
|
||||
C run-time system. (This method of replacing the tables does not work
|
||||
if you are cross compiling, because dftables is run on the local host.
|
||||
C run-time system. (This method of replacing the tables does not work
|
||||
if you are cross compiling, because dftables is run on the local host.
|
||||
If you need to create alternative tables when cross compiling, you will
|
||||
have to do so "by hand".)
|
||||
|
||||
|
||||
USING EBCDIC CODE
|
||||
|
||||
PCRE2 assumes by default that it will run in an environment where the
|
||||
character code is ASCII or Unicode, which is a superset of ASCII. This
|
||||
PCRE2 assumes by default that it will run in an environment where the
|
||||
character code is ASCII or Unicode, which is a superset of ASCII. This
|
||||
is the case for most computer operating systems. PCRE2 can, however, be
|
||||
compiled to run in an 8-bit EBCDIC environment by adding
|
||||
|
||||
--enable-ebcdic --disable-unicode
|
||||
|
||||
to the configure command. This setting implies --enable-rebuild-charta-
|
||||
bles. You should only use it if you know that you are in an EBCDIC
|
||||
bles. You should only use it if you know that you are in an EBCDIC
|
||||
environment (for example, an IBM mainframe operating system).
|
||||
|
||||
It is not possible to support both EBCDIC and UTF-8 codes in the same
|
||||
version of the library. Consequently, --enable-unicode and --enable-
|
||||
It is not possible to support both EBCDIC and UTF-8 codes in the same
|
||||
version of the library. Consequently, --enable-unicode and --enable-
|
||||
ebcdic are mutually exclusive.
|
||||
|
||||
The EBCDIC character that corresponds to an ASCII LF is assumed to have
|
||||
the value 0x15 by default. However, in some EBCDIC environments, 0x25
|
||||
the value 0x15 by default. However, in some EBCDIC environments, 0x25
|
||||
is used. In such an environment you should use
|
||||
|
||||
--enable-ebcdic-nl25
|
||||
|
||||
as well as, or instead of, --enable-ebcdic. The EBCDIC character for CR
|
||||
has the same value as in ASCII, namely, 0x0d. Whichever of 0x15 and
|
||||
has the same value as in ASCII, namely, 0x0d. Whichever of 0x15 and
|
||||
0x25 is not chosen as LF is made to correspond to the Unicode NEL char-
|
||||
acter (which, in Unicode, is 0x85).
|
||||
|
||||
|
@ -3113,31 +3148,31 @@ USING EBCDIC CODE
|
|||
|
||||
PCRE2GREP OPTIONS FOR COMPRESSED FILE SUPPORT
|
||||
|
||||
By default, pcre2grep reads all files as plain text. You can build it
|
||||
so that it recognizes files whose names end in .gz or .bz2, and reads
|
||||
By default, pcre2grep reads all files as plain text. You can build it
|
||||
so that it recognizes files whose names end in .gz or .bz2, and reads
|
||||
them with libz or libbz2, respectively, by adding one or both of
|
||||
|
||||
--enable-pcre2grep-libz
|
||||
--enable-pcre2grep-libbz2
|
||||
|
||||
to the configure command. These options naturally require that the rel-
|
||||
evant libraries are installed on your system. Configuration will fail
|
||||
evant libraries are installed on your system. Configuration will fail
|
||||
if they are not.
|
||||
|
||||
|
||||
PCRE2GREP BUFFER SIZE
|
||||
|
||||
pcre2grep uses an internal buffer to hold a "window" on the file it is
|
||||
pcre2grep uses an internal buffer to hold a "window" on the file it is
|
||||
scanning, in order to be able to output "before" and "after" lines when
|
||||
it finds a match. The size of the buffer is controlled by a parameter
|
||||
it finds a match. The size of the buffer is controlled by a parameter
|
||||
whose default value is 20K. The buffer itself is three times this size,
|
||||
but because of the way it is used for holding "before" lines, the long-
|
||||
est line that is guaranteed to be processable is the parameter size.
|
||||
est line that is guaranteed to be processable is the parameter size.
|
||||
You can change the default parameter value by adding, for example,
|
||||
|
||||
--with-pcre2grep-bufsize=50K
|
||||
|
||||
to the configure command. The caller of pcre2grep can override this
|
||||
to the configure command. The caller of pcre2grep can override this
|
||||
value by using --buffer-size on the command line..
|
||||
|
||||
|
||||
|
@ -3148,26 +3183,26 @@ PCRE2TEST OPTION FOR LIBREADLINE SUPPORT
|
|||
--enable-pcre2test-libreadline
|
||||
--enable-pcre2test-libedit
|
||||
|
||||
to the configure command, pcre2test is linked with the libreadline
|
||||
to the configure command, pcre2test is linked with the libreadline
|
||||
orlibedit library, respectively, and when its input is from a terminal,
|
||||
it reads it using the readline() function. This provides line-editing
|
||||
and history facilities. Note that libreadline is GPL-licensed, so if
|
||||
you distribute a binary of pcre2test linked in this way, there may be
|
||||
it reads it using the readline() function. This provides line-editing
|
||||
and history facilities. Note that libreadline is GPL-licensed, so if
|
||||
you distribute a binary of pcre2test linked in this way, there may be
|
||||
licensing issues. These can be avoided by linking instead with libedit,
|
||||
which has a BSD licence.
|
||||
|
||||
Setting --enable-pcre2test-libreadline causes the -lreadline option to
|
||||
be added to the pcre2test build. In many operating environments with a
|
||||
sytem-installed readline library this is sufficient. However, in some
|
||||
Setting --enable-pcre2test-libreadline causes the -lreadline option to
|
||||
be added to the pcre2test build. In many operating environments with a
|
||||
sytem-installed readline library this is sufficient. However, in some
|
||||
environments (e.g. if an unmodified distribution version of readline is
|
||||
in use), some extra configuration may be necessary. The INSTALL file
|
||||
in use), some extra configuration may be necessary. The INSTALL file
|
||||
for libreadline says this:
|
||||
|
||||
"Readline uses the termcap functions, but does not link with
|
||||
the termcap or curses library itself, allowing applications
|
||||
which link with readline the to choose an appropriate library."
|
||||
|
||||
If your environment has not been set up so that an appropriate library
|
||||
If your environment has not been set up so that an appropriate library
|
||||
is automatically included, you may need to add something like
|
||||
|
||||
LIBS="-ncurses"
|
||||
|
@ -3175,6 +3210,16 @@ PCRE2TEST OPTION FOR LIBREADLINE SUPPORT
|
|||
immediately before the configure command.
|
||||
|
||||
|
||||
INCLUDING DEBUGGING CODE
|
||||
|
||||
If you add
|
||||
|
||||
--enable-debug
|
||||
|
||||
to the configure command, additional debugging code is included in the
|
||||
build. This feature is intended for use by the PCRE2 maintainers.
|
||||
|
||||
|
||||
DEBUGGING WITH VALGRIND SUPPORT
|
||||
|
||||
If you add
|
||||
|
@ -3257,11 +3302,11 @@ AUTHOR
|
|||
|
||||
REVISION
|
||||
|
||||
Last updated: 26 January 2015
|
||||
Last updated: 24 April 2015
|
||||
Copyright (c) 1997-2015 University of Cambridge.
|
||||
------------------------------------------------------------------------------
|
||||
|
||||
|
||||
|
||||
|
||||
PCRE2CALLOUT(3) Library Functions Manual PCRE2CALLOUT(3)
|
||||
|
||||
|
||||
|
@ -3624,8 +3669,8 @@ REVISION
|
|||
Last updated: 23 March 2015
|
||||
Copyright (c) 1997-2015 University of Cambridge.
|
||||
------------------------------------------------------------------------------
|
||||
|
||||
|
||||
|
||||
|
||||
PCRE2COMPAT(3) Library Functions Manual PCRE2COMPAT(3)
|
||||
|
||||
|
||||
|
@ -3809,8 +3854,8 @@ REVISION
|
|||
Last updated: 15 March 2015
|
||||
Copyright (c) 1997-2015 University of Cambridge.
|
||||
------------------------------------------------------------------------------
|
||||
|
||||
|
||||
|
||||
|
||||
PCRE2JIT(3) Library Functions Manual PCRE2JIT(3)
|
||||
|
||||
|
||||
|
@ -4192,8 +4237,8 @@ REVISION
|
|||
Last updated: 27 November 2014
|
||||
Copyright (c) 1997-2014 University of Cambridge.
|
||||
------------------------------------------------------------------------------
|
||||
|
||||
|
||||
|
||||
|
||||
PCRE2LIMITS(3) Library Functions Manual PCRE2LIMITS(3)
|
||||
|
||||
|
||||
|
@ -4264,8 +4309,8 @@ REVISION
|
|||
Last updated: 25 November 2014
|
||||
Copyright (c) 1997-2014 University of Cambridge.
|
||||
------------------------------------------------------------------------------
|
||||
|
||||
|
||||
|
||||
|
||||
PCRE2MATCHING(3) Library Functions Manual PCRE2MATCHING(3)
|
||||
|
||||
|
||||
|
@ -4483,8 +4528,8 @@ REVISION
|
|||
Last updated: 29 September 2014
|
||||
Copyright (c) 1997-2014 University of Cambridge.
|
||||
------------------------------------------------------------------------------
|
||||
|
||||
|
||||
|
||||
|
||||
PCRE2PARTIAL(3) Library Functions Manual PCRE2PARTIAL(3)
|
||||
|
||||
|
||||
|
@ -4923,8 +4968,8 @@ REVISION
|
|||
Last updated: 22 December 2014
|
||||
Copyright (c) 1997-2014 University of Cambridge.
|
||||
------------------------------------------------------------------------------
|
||||
|
||||
|
||||
|
||||
|
||||
PCRE2UNICODE(3) Library Functions Manual PCRE2UNICODE(3)
|
||||
|
||||
|
||||
|
@ -5150,5 +5195,5 @@ REVISION
|
|||
Last updated: 23 November 2014
|
||||
Copyright (c) 1997-2014 University of Cambridge.
|
||||
------------------------------------------------------------------------------
|
||||
|
||||
|
||||
|
||||
|
||||
|
|
|
@ -21,7 +21,7 @@ for success and non-zero otherwise. The arguments are:
|
|||
.sp
|
||||
\fIcode\fP Points to the compiled pattern
|
||||
\fIcallback\fP The callback function
|
||||
\fIcallout_data\fP User data that is passed to the callback
|
||||
\fIcallout_data\fP User data that is passed to the callback
|
||||
.sp
|
||||
The \fIcallback()\fP function is passed a pointer to a data block containing
|
||||
the following fields:
|
||||
|
@ -34,9 +34,9 @@ the following fields:
|
|||
\fIcallout_string_length\fP Length of callout string
|
||||
\fIcallout_string\fP Points to callout string or is NULL
|
||||
.sp
|
||||
The second argument is the callout data that was passed to
|
||||
\fBpcre2_callout_enumerate()\fP. The \fBcallback()\fP function must return zero
|
||||
for success. Any other value causes the pattern scan to stop, with the value
|
||||
The second argument is the callout data that was passed to
|
||||
\fBpcre2_callout_enumerate()\fP. The \fBcallback()\fP function must return zero
|
||||
for success. Any other value causes the pattern scan to stop, with the value
|
||||
being passed back as the result of \fBpcre2_callout_enumerate()\fP.
|
||||
.P
|
||||
There is a complete description of the PCRE2 native API in the
|
||||
|
|
|
@ -37,7 +37,7 @@ or provide an external function for stack size checking. The option bits are:
|
|||
.sp
|
||||
PCRE2_ANCHORED Force pattern anchoring
|
||||
PCRE2_ALT_BSUX Alternative handling of \eu, \eU, and \ex
|
||||
PCRE2_ALT_CIRCUMFLEX Alternative handling of ^ in multiline mode
|
||||
PCRE2_ALT_CIRCUMFLEX Alternative handling of ^ in multiline mode
|
||||
PCRE2_AUTO_CALLOUT Compile automatic callouts
|
||||
PCRE2_CASELESS Do caseless matching
|
||||
PCRE2_DOLLAR_ENDONLY $ not to match newline at end
|
||||
|
@ -47,7 +47,7 @@ or provide an external function for stack size checking. The option bits are:
|
|||
PCRE2_FIRSTLINE Force matching to be before newline
|
||||
PCRE2_MATCH_UNSET_BACKREF Match unset back references
|
||||
PCRE2_MULTILINE ^ and $ match newlines within data
|
||||
PCRE2_NEVER_BACKSLASH_C Lock out the use of \C in patterns
|
||||
PCRE2_NEVER_BACKSLASH_C Lock out the use of \eC in patterns
|
||||
PCRE2_NEVER_UCP Lock out PCRE2_UCP, e.g. via (*UCP)
|
||||
PCRE2_NEVER_UTF Lock out PCRE2_UTF, e.g. via (*UTF)
|
||||
PCRE2_NO_AUTO_CAPTURE Disable numbered capturing paren-
|
||||
|
|
|
@ -1045,7 +1045,7 @@ to match. By default, as in Perl, a hexadecimal number is always expected after
|
|||
binary zero character followed by z).
|
||||
.sp
|
||||
PCRE2_ALT_CIRCUMFLEX
|
||||
.sp
|
||||
.sp
|
||||
In multiline mode (when PCRE2_MULTILINE is set), the circumflex metacharacter
|
||||
matches at the start of the subject (unless PCRE2_NOTBOL is set), and also
|
||||
after any internal newline. However, it does not match after a newline at the
|
||||
|
@ -1161,11 +1161,10 @@ after a newline at the end of the subject, for compatibility with Perl.
|
|||
However, you can change this by setting the PCRE2_ALT_CIRCUMFLEX option. If
|
||||
there are no newlines in a subject string, or no occurrences of ^ or $ in a
|
||||
pattern, setting PCRE2_MULTILINE has no effect.
|
||||
|
||||
.sp
|
||||
PCRE2_NEVER_BACKSLASH_C
|
||||
.sp
|
||||
This option locks out the use of \eC in the pattern that is being compiled.
|
||||
This option locks out the use of \eC in the pattern that is being compiled.
|
||||
This escape can cause unpredictable behaviour in UTF-8 or UTF-16 modes, because
|
||||
it may leave the current matching point in the middle of a multi-code-unit
|
||||
character. This option may be useful in applications that process patterns from
|
||||
|
@ -1756,14 +1755,14 @@ compiler does not alter the value returned by this option.
|
|||
.B " void *\fIuser_data\fP);"
|
||||
.fi
|
||||
.sp
|
||||
A script language that supports the use of string arguments in callouts might
|
||||
like to scan all the callouts in a pattern before running the match. This can
|
||||
be done by calling \fBpcre2_callout_enumerate()\fP. The first argument is a
|
||||
A script language that supports the use of string arguments in callouts might
|
||||
like to scan all the callouts in a pattern before running the match. This can
|
||||
be done by calling \fBpcre2_callout_enumerate()\fP. The first argument is a
|
||||
pointer to a compiled pattern, the second points to a callback function, and
|
||||
the third is arbitrary user data. The callback function is called for every
|
||||
callout in the pattern in the order in which they appear. Its first argument is
|
||||
a pointer to a callout enumeration block, and its second argument is the
|
||||
\fIuser_data\fP value that was passed to \fBpcre2_callout_enumerate()\fP. The
|
||||
\fIuser_data\fP value that was passed to \fBpcre2_callout_enumerate()\fP. The
|
||||
contents of the callout enumeration block are described in the
|
||||
.\" HREF
|
||||
\fBpcre2callout\fP
|
||||
|
@ -2330,7 +2329,7 @@ of the subject.
|
|||
PCRE2_ERROR_CALLOUT
|
||||
.sp
|
||||
This error is never generated by \fBpcre2_match()\fP itself. It is provided for
|
||||
use by callout functions that want to cause \fBpcre2_match()\fP or
|
||||
use by callout functions that want to cause \fBpcre2_match()\fP or
|
||||
\fBpcre2_callout_enumerate()\fP to return a distinctive error code. See the
|
||||
.\" HREF
|
||||
\fBpcre2callout\fP
|
||||
|
|
|
@ -134,8 +134,8 @@ option. Unless the application has set PCRE2_NEVER_UCP, a pattern may also
|
|||
request this by starting with (*UCP).
|
||||
.P
|
||||
The \eC escape sequence, which matches a single code unit, even in a UTF mode,
|
||||
can cause unpredictable behaviour because it may leave the current matching
|
||||
point in the middle of a multi-code-unit character. It can be locked out by
|
||||
can cause unpredictable behaviour because it may leave the current matching
|
||||
point in the middle of a multi-code-unit character. It can be locked out by
|
||||
setting the PCRE2_NEVER_BACKSLASH_C option.
|
||||
.
|
||||
.
|
||||
|
@ -417,8 +417,8 @@ If you add
|
|||
.sp
|
||||
--enable-debug
|
||||
.sp
|
||||
to the \fBconfigure\fP command, additional debugging code is included in the
|
||||
build. This feature is intended for use by the PCRE2 maintainers.
|
||||
to the \fBconfigure\fP command, additional debugging code is included in the
|
||||
build. This feature is intended for use by the PCRE2 maintainers.
|
||||
.
|
||||
.
|
||||
.SH "DEBUGGING WITH VALGRIND SUPPORT"
|
||||
|
|
|
@ -204,11 +204,11 @@ documentation). The callout block structure contains the following fields:
|
|||
PCRE2_SIZE \fIpattern_position\fP;
|
||||
PCRE2_SIZE \fInext_item_length\fP;
|
||||
PCRE2_SIZE \fIcallout_string_offset\fP;
|
||||
PCRE2_SIZE \fIcallout_string_length\fP;
|
||||
PCRE2_SPTR \fIcallout_string\fP;
|
||||
PCRE2_SIZE \fIcallout_string_length\fP;
|
||||
PCRE2_SPTR \fIcallout_string\fP;
|
||||
.sp
|
||||
The \fIversion\fP field contains the version number of the block format. The
|
||||
current version is 1; the three callout string fields were added for this
|
||||
current version is 1; the three callout string fields were added for this
|
||||
version. If you are writing an application that might use an earlier release of
|
||||
PCRE2, you should check the version number before accessing any of these
|
||||
fields. The version number will increase in future if more fields are added,
|
||||
|
@ -247,7 +247,7 @@ need to report errors in the callout string within the pattern.
|
|||
.SS "Fields for all callouts"
|
||||
.rs
|
||||
.sp
|
||||
The remaining fields in the callout block are the same for both kinds of
|
||||
The remaining fields in the callout block are the same for both kinds of
|
||||
callout.
|
||||
.P
|
||||
The \fIoffset_vector\fP field is a pointer to the vector of capturing offsets
|
||||
|
@ -283,7 +283,7 @@ substrings have been captured, the value of \fIcapture_last\fP is 0. This is
|
|||
always the case for the DFA matching functions.
|
||||
.P
|
||||
The \fIpattern_position\fP field contains the offset in the pattern string to
|
||||
the next item to be matched.
|
||||
the next item to be matched.
|
||||
.P
|
||||
The \fInext_item_length\fP field contains the length of the next item to be
|
||||
matched in the pattern string. When the callout immediately precedes an
|
||||
|
@ -293,8 +293,8 @@ of the entire subpattern.
|
|||
.P
|
||||
The \fIpattern_position\fP and \fInext_item_length\fP fields are intended to
|
||||
help in distinguishing between different automatic callouts, which all have the
|
||||
same callout number. However, they are set for all callouts, and are used by
|
||||
\fBpcre2test\fP to show the next item to be matched when displaying callout
|
||||
same callout number. However, they are set for all callouts, and are used by
|
||||
\fBpcre2test\fP to show the next item to be matched when displaying callout
|
||||
information.
|
||||
.P
|
||||
In callouts from \fBpcre2_match()\fP the \fImark\fP field contains a pointer to
|
||||
|
@ -329,9 +329,9 @@ functions; it will never be used by PCRE2 itself.
|
|||
.B " void *\fIuser_data\fP);"
|
||||
.fi
|
||||
.sp
|
||||
A script language that supports the use of string arguments in callouts might
|
||||
like to scan all the callouts in a pattern before running the match. This can
|
||||
be done by calling \fBpcre2_callout_enumerate()\fP. The first argument is a
|
||||
A script language that supports the use of string arguments in callouts might
|
||||
like to scan all the callouts in a pattern before running the match. This can
|
||||
be done by calling \fBpcre2_callout_enumerate()\fP. The first argument is a
|
||||
pointer to a compiled pattern, the second points to a callback function, and
|
||||
the third is arbitrary user data. The callback function is called for every
|
||||
callout in the pattern in the order in which they appear. Its first argument is
|
||||
|
@ -347,7 +347,7 @@ data block contains the following fields:
|
|||
\fIcallout_string_length\fP Length of callout string
|
||||
\fIcallout_string\fP Points to callout string or is NULL
|
||||
.sp
|
||||
The version number is currently 0. It will increase if new fields are ever
|
||||
The version number is currently 0. It will increase if new fields are ever
|
||||
added to the block. The remaining fields are the same as their namesakes in the
|
||||
\fBpcre2_callout\fP block that is used for callouts during matching, as
|
||||
described
|
||||
|
@ -363,8 +363,8 @@ pattern. For example, a pattern such as /(a){2}/ is compiled as if it were
|
|||
/(a)(a)/. This means that the callout will be enumerated more than once, but
|
||||
with the same value for \fIpattern_position\fP in each case.
|
||||
.P
|
||||
The callback function should normally return zero. If it returns a non-zero
|
||||
value, scanning the pattern stops, and that value is returned from
|
||||
The callback function should normally return zero. If it returns a non-zero
|
||||
value, scanning the pattern stops, and that value is returned from
|
||||
\fBpcre2_callout_enumerate()\fP.
|
||||
.
|
||||
.
|
||||
|
|
|
@ -337,7 +337,7 @@ A second use of backslash provides a way of encoding non-printing characters
|
|||
in patterns in a visible manner. There is no restriction on the appearance of
|
||||
non-printing characters in a pattern, but when a pattern is being prepared by
|
||||
text editing, it is often easier to use one of the following escape sequences
|
||||
than the binary character it represents. In an ASCII or Unicode environment,
|
||||
than the binary character it represents. In an ASCII or Unicode environment,
|
||||
these escapes are as follows:
|
||||
.sp
|
||||
\ea alarm, that is, the BEL character (hex 07)
|
||||
|
@ -372,15 +372,15 @@ to hex 1A); [, \e, ], ^, and _ encode characters 27-31 (hex 1B to hex 1F), and
|
|||
\e? becomes either 255 (hex FF) or 95 (hex 5F).
|
||||
.P
|
||||
Thus, apart from \e?, these escapes generate the same character code values as
|
||||
they do in an ASCII environment, though the meanings of the values mostly
|
||||
they do in an ASCII environment, though the meanings of the values mostly
|
||||
differ. For example, \eG always generates code value 7, which is BEL in ASCII
|
||||
but DEL in EBCDIC.
|
||||
.P
|
||||
The sequence \e? generates DEL (127, hex 7F) in an ASCII environment, but
|
||||
because 127 is not a control character in EBCDIC, Perl makes it generate the
|
||||
APC character. Unfortunately, there are several variants of EBCDIC. In most of
|
||||
them the APC character has the value 255 (hex FF), but in the one Perl calls
|
||||
POSIX-BC its value is 95 (hex 5F). If certain other characters have POSIX-BC
|
||||
because 127 is not a control character in EBCDIC, Perl makes it generate the
|
||||
APC character. Unfortunately, there are several variants of EBCDIC. In most of
|
||||
them the APC character has the value 255 (hex FF), but in the one Perl calls
|
||||
POSIX-BC its value is 95 (hex 5F). If certain other characters have POSIX-BC
|
||||
values, PCRE2 makes \e? generate 95; otherwise it generates 255.
|
||||
.P
|
||||
After \e0 up to two further octal digits are read. If there are fewer than two
|
||||
|
@ -415,7 +415,7 @@ later,
|
|||
following the discussion of
|
||||
.\" HTML <a href="#subpattern">
|
||||
.\" </a>
|
||||
parenthesized subpatterns.
|
||||
parenthesized subpatterns.
|
||||
.\"
|
||||
Otherwise, up to three octal digits are read to form a character code.
|
||||
.P
|
||||
|
@ -1128,7 +1128,7 @@ they test for a particular condition being true without consuming any
|
|||
characters from the subject string. These two metacharacters are concerned with
|
||||
matching the starts and ends of lines. If the newline convention is set so that
|
||||
only the two-character sequence CRLF is recognized as a newline, isolated CR
|
||||
and LF characters are treated as ordinary data characters, and are not
|
||||
and LF characters are treated as ordinary data characters, and are not
|
||||
recognized as newlines.
|
||||
.P
|
||||
Outside a character class, in the default matching mode, the circumflex
|
||||
|
@ -1220,14 +1220,14 @@ whether or not a UTF mode is set. In the 8-bit library, one code unit is one
|
|||
byte; in the 16-bit library it is a 16-bit unit; in the 32-bit library it is a
|
||||
32-bit unit. Unlike a dot, \eC always matches line-ending characters. The
|
||||
feature is provided in Perl in order to match individual bytes in UTF-8 mode,
|
||||
but it is unclear how it can usefully be used.
|
||||
but it is unclear how it can usefully be used.
|
||||
.P
|
||||
Because \eC breaks up characters into individual code units, matching one unit
|
||||
with \eC in UTF-8 or UTF-16 mode means that the rest of the string may start
|
||||
with a malformed UTF character. This has undefined results, because PCRE2
|
||||
assumes that it is matching character by character in a valid UTF string (by
|
||||
default it checks the subject string's validity at the start of processing
|
||||
unless the PCRE2_NO_UTF_CHECK option is used). An application can lock out the
|
||||
unless the PCRE2_NO_UTF_CHECK option is used). An application can lock out the
|
||||
use of \eC by setting the PCRE2_NEVER_BACKSLASH_C option.
|
||||
.P
|
||||
PCRE2 does not allow \eC to appear in lookbehind assertions
|
||||
|
@ -1505,7 +1505,7 @@ unset these options by preceding the letter with a hyphen, and a combined
|
|||
setting and unsetting such as (?im-sx), which sets PCRE2_CASELESS and
|
||||
PCRE2_MULTILINE while unsetting PCRE2_DOTALL and PCRE2_EXTENDED, is also
|
||||
permitted. If a letter appears both before and after the hyphen, the option is
|
||||
unset. An empty options setting "(?)" is allowed. Needless to say, it has no
|
||||
unset. An empty options setting "(?)" is allowed. Needless to say, it has no
|
||||
effect.
|
||||
.P
|
||||
The PCRE2-specific options PCRE2_DUPNAMES and PCRE2_UNGREEDY can be changed in
|
||||
|
@ -1542,7 +1542,7 @@ appear between the "?" and the ":". Thus the two patterns
|
|||
(?i:saturday|sunday)
|
||||
(?:(?i)saturday|sunday)
|
||||
.sp
|
||||
match exactly the same set of strings.
|
||||
match exactly the same set of strings.
|
||||
.P
|
||||
\fBNote:\fP There are other PCRE2-specific options that can be set by the
|
||||
application when the compiling function is called. The pattern can contain
|
||||
|
@ -2907,14 +2907,14 @@ condition.
|
|||
.SS "Callouts with string arguments"
|
||||
.rs
|
||||
.sp
|
||||
A delimited string may be used instead of a number as a callout argument. The
|
||||
starting delimiter must be one of ` ' " ^ % # $ { and the ending delimiter is
|
||||
the same as the start, except for {, where the ending delimiter is }. If the
|
||||
ending delimiter is needed within the string, it must be doubled. For
|
||||
A delimited string may be used instead of a number as a callout argument. The
|
||||
starting delimiter must be one of ` ' " ^ % # $ { and the ending delimiter is
|
||||
the same as the start, except for {, where the ending delimiter is }. If the
|
||||
ending delimiter is needed within the string, it must be doubled. For
|
||||
example:
|
||||
.sp
|
||||
(?C'ab ''c'' d')xyz(?C{any text})pqr
|
||||
.sp
|
||||
.sp
|
||||
The doubling is removed before the string is passed to the callout function.
|
||||
.
|
||||
.
|
||||
|
|
|
@ -49,7 +49,7 @@ in the
|
|||
.\" HREF
|
||||
\fBpcre2pattern\fP
|
||||
.\"
|
||||
documentation, where details of escape processing in EBCDIC environments are
|
||||
documentation, where details of escape processing in EBCDIC environments are
|
||||
also given.
|
||||
.P
|
||||
When \ex is not followed by {, from zero to two hexadecimal digits are read,
|
||||
|
|
|
@ -65,7 +65,7 @@ below). The input is processed using using C's string functions, so must not
|
|||
contain binary zeroes, even though in Unix-like environments, \fBfgets()\fP
|
||||
treats any bytes other than newline as data characters. In some Windows
|
||||
environments character 26 (hex 1A) causes an immediate end of file, and no
|
||||
further data is read.
|
||||
further data is read.
|
||||
.P
|
||||
For maximum portability, therefore, it is safest to avoid non-printing
|
||||
characters in \fBpcre2test\fP input files. There is a facility for specifying a
|
||||
|
@ -237,7 +237,7 @@ following commands are recognized:
|
|||
#forbid_utf
|
||||
.sp
|
||||
Subsequent patterns automatically have the PCRE2_NEVER_UTF and PCRE2_NEVER_UCP
|
||||
options set, which locks out the use of the PCRE2_UTF and PCRE2_UCP options and
|
||||
options set, which locks out the use of the PCRE2_UTF and PCRE2_UCP options and
|
||||
the use of (*UTF) and (*UCP) at the start of patterns. This command also forces
|
||||
an error if a subsequent pattern contains any occurrences of \eP, \ep, or \eX,
|
||||
which are still supported when PCRE2_UTF is not set, but which require Unicode
|
||||
|
@ -245,7 +245,7 @@ property support to be included in the library.
|
|||
.P
|
||||
This is a trigger guard that is used in test files to ensure that UTF or
|
||||
Unicode property tests are not accidentally added to files that are used when
|
||||
Unicode support is not included in the library. Setting PCRE2_NEVER_UTF and
|
||||
Unicode support is not included in the library. Setting PCRE2_NEVER_UTF and
|
||||
PCRE2_NEVER_UCP as a default can also be obtained by the use of \fB#pattern\fP;
|
||||
the difference is that \fB#forbid_utf\fP cannot be unset, and the automatic
|
||||
options are not displayed in pattern information, to avoid cluttering up test
|
||||
|
@ -443,7 +443,7 @@ for a description of their effects.
|
|||
.sp
|
||||
allow_empty_class set PCRE2_ALLOW_EMPTY_CLASS
|
||||
alt_bsux set PCRE2_ALT_BSUX
|
||||
alt_circumflex set PCRE2_ALT_CIRCUMFLEX
|
||||
alt_circumflex set PCRE2_ALT_CIRCUMFLEX
|
||||
anchored set PCRE2_ANCHORED
|
||||
auto_callout set PCRE2_AUTO_CALLOUT
|
||||
/i caseless set PCRE2_CASELESS
|
||||
|
@ -454,7 +454,7 @@ for a description of their effects.
|
|||
firstline set PCRE2_FIRSTLINE
|
||||
match_unset_backref set PCRE2_MATCH_UNSET_BACKREF
|
||||
/m multiline set PCRE2_MULTILINE
|
||||
never_backslash_c set PCRE2_NEVER_BACKSLASH_C
|
||||
never_backslash_c set PCRE2_NEVER_BACKSLASH_C
|
||||
never_ucp set PCRE2_NEVER_UCP
|
||||
never_utf set PCRE2_NEVER_UTF
|
||||
no_auto_capture set PCRE2_NO_AUTO_CAPTURE
|
||||
|
@ -481,7 +481,7 @@ about the pattern:
|
|||
.sp
|
||||
bsr=[anycrlf|unicode] specify \eR handling
|
||||
/B bincode show binary code without lengths
|
||||
callout_info show callout information
|
||||
callout_info show callout information
|
||||
debug same as info,fullbincode
|
||||
fullbincode show binary code with lengths
|
||||
/I info show info about compiled pattern
|
||||
|
@ -559,9 +559,9 @@ unit" is the last literal code unit that must be present in any match. This is
|
|||
not necessarily the last character. These lines are omitted if no starting or
|
||||
ending code units are recorded.
|
||||
.P
|
||||
The \fBcallout_info\fP modifier requests information about all the callouts in
|
||||
the pattern. A list of them is output at the end of any other information that
|
||||
is requested. For each callout, either its number or string is given, followed
|
||||
The \fBcallout_info\fP modifier requests information about all the callouts in
|
||||
the pattern. A list of them is output at the end of any other information that
|
||||
is requested. For each callout, either its number or string is given, followed
|
||||
by the item that follows it in the pattern.
|
||||
.
|
||||
.
|
||||
|
|
|
@ -226,14 +226,20 @@ COMMAND LINES
|
|||
#forbid_utf
|
||||
|
||||
Subsequent patterns automatically have the PCRE2_NEVER_UTF and
|
||||
PCRE2_NEVER_UCP options set, which locks out the use of UTF and Unicode
|
||||
property features. This is a trigger guard that is used in test files
|
||||
to ensure that UTF or Unicode property tests are not accidentally added
|
||||
to files that are used when Unicode support is not included in the
|
||||
library. This effect can also be obtained by the use of #pattern; the
|
||||
difference is that #forbid_utf cannot be unset, and the automatic
|
||||
options are not displayed in pattern information, to avoid cluttering
|
||||
up test output.
|
||||
PCRE2_NEVER_UCP options set, which locks out the use of the PCRE2_UTF
|
||||
and PCRE2_UCP options and the use of (*UTF) and (*UCP) at the start of
|
||||
patterns. This command also forces an error if a subsequent pattern
|
||||
contains any occurrences of \P, \p, or \X, which are still supported
|
||||
when PCRE2_UTF is not set, but which require Unicode property support
|
||||
to be included in the library.
|
||||
|
||||
This is a trigger guard that is used in test files to ensure that UTF
|
||||
or Unicode property tests are not accidentally added to files that are
|
||||
used when Unicode support is not included in the library. Setting
|
||||
PCRE2_NEVER_UTF and PCRE2_NEVER_UCP as a default can also be obtained
|
||||
by the use of #pattern; the difference is that #forbid_utf cannot be
|
||||
unset, and the automatic options are not displayed in pattern informa-
|
||||
tion, to avoid cluttering up test output.
|
||||
|
||||
#load <filename>
|
||||
|
||||
|
@ -417,6 +423,7 @@ PATTERN MODIFIERS
|
|||
|
||||
allow_empty_class set PCRE2_ALLOW_EMPTY_CLASS
|
||||
alt_bsux set PCRE2_ALT_BSUX
|
||||
alt_circumflex set PCRE2_ALT_CIRCUMFLEX
|
||||
anchored set PCRE2_ANCHORED
|
||||
auto_callout set PCRE2_AUTO_CALLOUT
|
||||
/i caseless set PCRE2_CASELESS
|
||||
|
@ -427,6 +434,7 @@ PATTERN MODIFIERS
|
|||
firstline set PCRE2_FIRSTLINE
|
||||
match_unset_backref set PCRE2_MATCH_UNSET_BACKREF
|
||||
/m multiline set PCRE2_MULTILINE
|
||||
never_backslash_c set PCRE2_NEVER_BACKSLASH_C
|
||||
never_ucp set PCRE2_NEVER_UCP
|
||||
never_utf set PCRE2_NEVER_UTF
|
||||
no_auto_capture set PCRE2_NO_AUTO_CAPTURE
|
||||
|
@ -1322,5 +1330,5 @@ AUTHOR
|
|||
|
||||
REVISION
|
||||
|
||||
Last updated: 22 March 2015
|
||||
Last updated: 20 May 2015
|
||||
Copyright (c) 1997-2015 University of Cambridge.
|
||||
|
|
|
@ -200,7 +200,7 @@ sure both macros are undefined; an emulation function will then be used. */
|
|||
#define PACKAGE_NAME "PCRE2"
|
||||
|
||||
/* Define to the full name and version of this package. */
|
||||
#define PACKAGE_STRING "PCRE2 10.10"
|
||||
#define PACKAGE_STRING "PCRE2 10.20-RC1"
|
||||
|
||||
/* Define to the one symbol short name of this package. */
|
||||
#define PACKAGE_TARNAME "pcre2"
|
||||
|
@ -209,7 +209,7 @@ sure both macros are undefined; an emulation function will then be used. */
|
|||
#define PACKAGE_URL ""
|
||||
|
||||
/* Define to the version of this package. */
|
||||
#define PACKAGE_VERSION "10.10"
|
||||
#define PACKAGE_VERSION "10.20-RC1"
|
||||
|
||||
/* The value of PARENS_NEST_LIMIT specifies the maximum depth of nested
|
||||
parentheses (of any kind) in a pattern. This limits the amount of system
|
||||
|
@ -227,6 +227,9 @@ sure both macros are undefined; an emulation function will then be used. */
|
|||
#define PCRE2GREP_BUFSIZE 20480
|
||||
#endif
|
||||
|
||||
/* Define to any value to include debugging code. */
|
||||
/* #undef PCRE2_DEBUG */
|
||||
|
||||
/* If you are compiling for a system other than a Unix-like system or
|
||||
Win32, and it needs some magic to be inserted before the definition
|
||||
of a function that is exported by the library, define this macro to
|
||||
|
@ -287,7 +290,7 @@ sure both macros are undefined; an emulation function will then be used. */
|
|||
/* #undef SUPPORT_VALGRIND */
|
||||
|
||||
/* Version number of package */
|
||||
#define VERSION "10.10"
|
||||
#define VERSION "10.20-RC1"
|
||||
|
||||
/* Define to empty if `const' does not conform to ANSI C. */
|
||||
/* #undef const */
|
||||
|
|
|
@ -42,9 +42,9 @@ POSSIBILITY OF SUCH DAMAGE.
|
|||
/* The current PCRE version information. */
|
||||
|
||||
#define PCRE2_MAJOR 10
|
||||
#define PCRE2_MINOR 10
|
||||
#define PCRE2_PRERELEASE
|
||||
#define PCRE2_DATE 2015-03-06
|
||||
#define PCRE2_MINOR 20
|
||||
#define PCRE2_PRERELEASE -RC1
|
||||
#define PCRE2_DATE 2015-06-16
|
||||
|
||||
/* When an application links to a PCRE DLL in Windows, the symbols that are
|
||||
imported have to be identified as such. When building PCRE2, the appropriate
|
||||
|
@ -118,6 +118,8 @@ D is inspected during pcre2_dfa_match() execution
|
|||
#define PCRE2_UCP 0x00020000u /* C J M D */
|
||||
#define PCRE2_UNGREEDY 0x00040000u /* C */
|
||||
#define PCRE2_UTF 0x00080000u /* C J M D */
|
||||
#define PCRE2_NEVER_BACKSLASH_C 0x00100000u /* C */
|
||||
#define PCRE2_ALT_CIRCUMFLEX 0x00200000u /* J M D */
|
||||
|
||||
/* These are for pcre2_jit_compile(). */
|
||||
|
||||
|
@ -125,9 +127,10 @@ D is inspected during pcre2_dfa_match() execution
|
|||
#define PCRE2_JIT_PARTIAL_SOFT 0x00000002u
|
||||
#define PCRE2_JIT_PARTIAL_HARD 0x00000004u
|
||||
|
||||
/* These are for pcre2_match() and pcre2_dfa_match(). Note that PCRE2_ANCHORED,
|
||||
and PCRE2_NO_UTF_CHECK can also be passed to these functions, so take care not
|
||||
to define synonyms by mistake. */
|
||||
/* These are for pcre2_match(), pcre2_dfa_match(), and pcre2_jit_match(). Note
|
||||
that PCRE2_ANCHORED and PCRE2_NO_UTF_CHECK can also be passed to these
|
||||
functions (though pcre2_jit_match() ignores the latter since it bypasses all
|
||||
sanity checks). */
|
||||
|
||||
#define PCRE2_NOTBOL 0x00000001u
|
||||
#define PCRE2_NOTEOL 0x00000002u
|
||||
|
@ -337,8 +340,24 @@ typedef struct pcre2_callout_block { \
|
|||
PCRE2_SIZE current_position; /* Where we currently are in the subject */ \
|
||||
PCRE2_SIZE pattern_position; /* Offset to next item in the pattern */ \
|
||||
PCRE2_SIZE next_item_length; /* Length of next item in the pattern */ \
|
||||
/* ------------------- Added for Version 1 -------------------------- */ \
|
||||
PCRE2_SIZE callout_string_offset; /* Offset to string within pattern */ \
|
||||
PCRE2_SIZE callout_string_length; /* Length of string compiled into pattern */ \
|
||||
PCRE2_SPTR callout_string; /* String compiled into pattern */ \
|
||||
/* ------------------------------------------------------------------ */ \
|
||||
} pcre2_callout_block;
|
||||
} pcre2_callout_block; \
|
||||
\
|
||||
typedef struct pcre2_callout_enumerate_block { \
|
||||
uint32_t version; /* Identifies version of block */ \
|
||||
/* ------------------------ Version 0 ------------------------------- */ \
|
||||
PCRE2_SIZE pattern_position; /* Offset to next item in the pattern */ \
|
||||
PCRE2_SIZE next_item_length; /* Length of next item in the pattern */ \
|
||||
uint32_t callout_number; /* Number compiled into pattern */ \
|
||||
PCRE2_SIZE callout_string_offset; /* Offset to string within pattern */ \
|
||||
PCRE2_SIZE callout_string_length; /* Length of string compiled into pattern */ \
|
||||
PCRE2_SPTR callout_string; /* String compiled into pattern */ \
|
||||
/* ------------------------------------------------------------------ */ \
|
||||
} pcre2_callout_enumerate_block;
|
||||
|
||||
|
||||
/* List the generic forms of all other functions in macros, which will be
|
||||
|
@ -406,6 +425,9 @@ PCRE2_EXP_DECL void pcre2_code_free(pcre2_code *);
|
|||
|
||||
#define PCRE2_PATTERN_INFO_FUNCTIONS \
|
||||
PCRE2_EXP_DECL int pcre2_pattern_info(const pcre2_code *, uint32_t, \
|
||||
void *); \
|
||||
PCRE2_EXP_DECL int pcre2_callout_enumerate(const pcre2_code *, \
|
||||
int (*)(pcre2_callout_enumerate_block *, void *), \
|
||||
void *);
|
||||
|
||||
|
||||
|
@ -534,15 +556,17 @@ pcre2_compile are called by application code. */
|
|||
|
||||
/* Data blocks */
|
||||
|
||||
#define pcre2_callout_block PCRE2_SUFFIX(pcre2_callout_block_)
|
||||
#define pcre2_general_context PCRE2_SUFFIX(pcre2_general_context_)
|
||||
#define pcre2_compile_context PCRE2_SUFFIX(pcre2_compile_context_)
|
||||
#define pcre2_match_context PCRE2_SUFFIX(pcre2_match_context_)
|
||||
#define pcre2_match_data PCRE2_SUFFIX(pcre2_match_data_)
|
||||
#define pcre2_callout_block PCRE2_SUFFIX(pcre2_callout_block_)
|
||||
#define pcre2_callout_enumerate_block PCRE2_SUFFIX(pcre2_callout_enumerate_block_)
|
||||
#define pcre2_general_context PCRE2_SUFFIX(pcre2_general_context_)
|
||||
#define pcre2_compile_context PCRE2_SUFFIX(pcre2_compile_context_)
|
||||
#define pcre2_match_context PCRE2_SUFFIX(pcre2_match_context_)
|
||||
#define pcre2_match_data PCRE2_SUFFIX(pcre2_match_data_)
|
||||
|
||||
|
||||
/* Functions: the complete list in alphabetical order */
|
||||
|
||||
#define pcre2_callout_enumerate PCRE2_SUFFIX(pcre2_callout_enumerate_)
|
||||
#define pcre2_code_free PCRE2_SUFFIX(pcre2_code_free_)
|
||||
#define pcre2_compile PCRE2_SUFFIX(pcre2_compile_)
|
||||
#define pcre2_compile_context_copy PCRE2_SUFFIX(pcre2_compile_context_copy_)
|
||||
|
@ -550,7 +574,6 @@ pcre2_compile are called by application code. */
|
|||
#define pcre2_compile_context_free PCRE2_SUFFIX(pcre2_compile_context_free_)
|
||||
#define pcre2_config PCRE2_SUFFIX(pcre2_config_)
|
||||
#define pcre2_dfa_match PCRE2_SUFFIX(pcre2_dfa_match_)
|
||||
#define pcre2_match PCRE2_SUFFIX(pcre2_match_)
|
||||
#define pcre2_general_context_copy PCRE2_SUFFIX(pcre2_general_context_copy_)
|
||||
#define pcre2_general_context_create PCRE2_SUFFIX(pcre2_general_context_create_)
|
||||
#define pcre2_general_context_free PCRE2_SUFFIX(pcre2_general_context_free_)
|
||||
|
@ -566,6 +589,7 @@ pcre2_compile are called by application code. */
|
|||
#define pcre2_jit_stack_create PCRE2_SUFFIX(pcre2_jit_stack_create_)
|
||||
#define pcre2_jit_stack_free PCRE2_SUFFIX(pcre2_jit_stack_free_)
|
||||
#define pcre2_maketables PCRE2_SUFFIX(pcre2_maketables_)
|
||||
#define pcre2_match PCRE2_SUFFIX(pcre2_match_)
|
||||
#define pcre2_match_context_copy PCRE2_SUFFIX(pcre2_match_context_copy_)
|
||||
#define pcre2_match_context_create PCRE2_SUFFIX(pcre2_match_context_create_)
|
||||
#define pcre2_match_context_free PCRE2_SUFFIX(pcre2_match_context_free_)
|
||||
|
|
|
@ -129,7 +129,7 @@ D is inspected during pcre2_dfa_match() execution
|
|||
|
||||
/* These are for pcre2_match(), pcre2_dfa_match(), and pcre2_jit_match(). Note
|
||||
that PCRE2_ANCHORED and PCRE2_NO_UTF_CHECK can also be passed to these
|
||||
functions (though pcre2_jit_match() ignores the latter since it bypasses all
|
||||
functions (though pcre2_jit_match() ignores the latter since it bypasses all
|
||||
sanity checks). */
|
||||
|
||||
#define PCRE2_NOTBOL 0x00000001u
|
||||
|
|
|
@ -562,7 +562,7 @@ Arguments:
|
|||
cb compile data block
|
||||
base_list the data list of the base opcode
|
||||
base_end the end of the data list
|
||||
rec_limit points to recursion depth counter
|
||||
rec_limit points to recursion depth counter
|
||||
|
||||
Returns: TRUE if the auto-possessification is possible
|
||||
*/
|
||||
|
@ -664,7 +664,7 @@ for(;;)
|
|||
|
||||
while (*next_code == OP_ALT)
|
||||
{
|
||||
if (!compare_opcodes(code, utf, cb, base_list, base_end, rec_limit))
|
||||
if (!compare_opcodes(code, utf, cb, base_list, base_end, rec_limit))
|
||||
return FALSE;
|
||||
code = next_code + 1 + LINK_SIZE;
|
||||
next_code += GET(next_code, 1);
|
||||
|
|
|
@ -2632,14 +2632,14 @@ for (;;)
|
|||
if (code[LINK_SIZE + 1] == OP_CALLOUT)
|
||||
{
|
||||
cb.callout_number = code[2 + 3*LINK_SIZE];
|
||||
cb.callout_string_offset = 0;
|
||||
cb.callout_string_offset = 0;
|
||||
cb.callout_string = NULL;
|
||||
cb.callout_string_length = 0;
|
||||
}
|
||||
else
|
||||
{
|
||||
cb.callout_number = 0;
|
||||
cb.callout_string_offset = GET(code, 2 + 4*LINK_SIZE);
|
||||
cb.callout_string_offset = GET(code, 2 + 4*LINK_SIZE);
|
||||
cb.callout_string = code + (2 + 5*LINK_SIZE) + 1;
|
||||
cb.callout_string_length =
|
||||
callout_length - (1 + 4*LINK_SIZE) - 2;
|
||||
|
@ -2663,7 +2663,7 @@ for (;;)
|
|||
|
||||
/* The DEFINE condition is always false, and the assertion (?!) is
|
||||
converted to OP_FAIL. */
|
||||
|
||||
|
||||
if (condcode == OP_FALSE || condcode == OP_FAIL)
|
||||
{ ADD_ACTIVE(state_offset + codelink + LINK_SIZE + 1, 0); }
|
||||
|
||||
|
@ -3001,14 +3001,14 @@ for (;;)
|
|||
if (*code == OP_CALLOUT)
|
||||
{
|
||||
cb.callout_number = code[1 + 2*LINK_SIZE];
|
||||
cb.callout_string_offset = 0;
|
||||
cb.callout_string_offset = 0;
|
||||
cb.callout_string = NULL;
|
||||
cb.callout_string_length = 0;
|
||||
}
|
||||
else
|
||||
{
|
||||
cb.callout_number = 0;
|
||||
cb.callout_string_offset = GET(code, 1 + 3*LINK_SIZE);
|
||||
cb.callout_string_offset = GET(code, 1 + 3*LINK_SIZE);
|
||||
cb.callout_string = code + (1 + 4*LINK_SIZE) + 1;
|
||||
cb.callout_string_length =
|
||||
callout_length - (1 + 4*LINK_SIZE) - 2;
|
||||
|
|
|
@ -145,9 +145,9 @@ static const char compile_error_texts[] =
|
|||
"different names for subpatterns of the same number are not allowed\0"
|
||||
"(*MARK) must have an argument\0"
|
||||
"non-hex character in \\x{} (closing brace missing?)\0"
|
||||
#ifndef EBCDIC
|
||||
#ifndef EBCDIC
|
||||
"\\c must be followed by a printable ASCII character\0"
|
||||
#else
|
||||
#else
|
||||
"\\c must be followed by a letter or one of [\\]^_?\0"
|
||||
#endif
|
||||
"\\k is not followed by a braced, angle-bracketed, or quoted name\0"
|
||||
|
@ -168,7 +168,7 @@ static const char compile_error_texts[] =
|
|||
"missing terminating delimiter for callout with string argument\0"
|
||||
"unrecognized string delimiter follows (?C\0"
|
||||
"using \\C is disabled by the application\0"
|
||||
"(?| and/or (?J: or (?x: parentheses are too deeply nested\0"
|
||||
"(?| and/or (?J: or (?x: parentheses are too deeply nested\0"
|
||||
;
|
||||
|
||||
/* Match-time and UTF error texts are in the same format. */
|
||||
|
|
|
@ -1230,7 +1230,7 @@ contain characters with values greater than 255. */
|
|||
#define XCL_PROP 3 /* Unicode property (2-byte property code follows) */
|
||||
#define XCL_NOTPROP 4 /* Unicode inverted property (ditto) */
|
||||
|
||||
/* Escape items that are just an encoding of a particular data value. These
|
||||
/* Escape items that are just an encoding of a particular data value. These
|
||||
appear in the escapes[] table in pcre2_compile.c as positive numbers. */
|
||||
|
||||
#ifndef ESC_a
|
||||
|
@ -1262,7 +1262,7 @@ appear in the escapes[] table in pcre2_compile.c as positive numbers. */
|
|||
|
||||
/* These are escaped items that aren't just an encoding of a particular data
|
||||
value such as \n. They must have non-zero values, as check_escape() returns 0
|
||||
for a data character. In the escapes[] table in pcre2_compile.c their values
|
||||
for a data character. In the escapes[] table in pcre2_compile.c their values
|
||||
are negated in order to distinguish them from data values.
|
||||
|
||||
They must appear here in the same order as in the opcode definitions below, up
|
||||
|
|
|
@ -662,7 +662,7 @@ typedef struct named_group {
|
|||
PCRE2_SPTR name; /* Points to the name in the pattern */
|
||||
uint32_t number; /* Group number */
|
||||
uint16_t length; /* Length of the name */
|
||||
uint16_t isdup; /* TRUE if a duplicate */
|
||||
uint16_t isdup; /* TRUE if a duplicate */
|
||||
} named_group;
|
||||
|
||||
/* Structure for passing "static" information around between the functions
|
||||
|
|
|
@ -6432,13 +6432,13 @@ if (*cc == OP_CALLOUT)
|
|||
{
|
||||
value1 = 0;
|
||||
value2 = 0;
|
||||
value3 = 0;
|
||||
value3 = 0;
|
||||
}
|
||||
else
|
||||
{
|
||||
value1 = (sljit_sw) (cc + (1 + 4*LINK_SIZE) + 1);
|
||||
value2 = (callout_length - (1 + 4*LINK_SIZE + 2));
|
||||
value3 = (sljit_sw) (GET(cc, 1 + 3*LINK_SIZE));
|
||||
value3 = (sljit_sw) (GET(cc, 1 + 3*LINK_SIZE));
|
||||
}
|
||||
|
||||
OP1(SLJIT_MOV, SLJIT_MEM1(STACK_TOP), CALLOUT_ARG_OFFSET(callout_string), SLJIT_IMM, value1);
|
||||
|
|
|
@ -2156,14 +2156,14 @@ for (;;)
|
|||
ecode++;
|
||||
break;
|
||||
|
||||
/* Multiline mode: start of subject unless notbol, or after any newline
|
||||
/* Multiline mode: start of subject unless notbol, or after any newline
|
||||
except for one at the very end, unless PCRE2_ALT_CIRCUMFLEX is set. */
|
||||
|
||||
case OP_CIRCM:
|
||||
if ((mb->moptions & PCRE2_NOTBOL) != 0 && eptr == mb->start_subject)
|
||||
RRETURN(MATCH_NOMATCH);
|
||||
if (eptr != mb->start_subject &&
|
||||
((eptr == mb->end_subject &&
|
||||
((eptr == mb->end_subject &&
|
||||
(mb->poptions & PCRE2_ALT_CIRCUMFLEX) == 0) ||
|
||||
!WAS_NEWLINE(eptr)))
|
||||
RRETURN(MATCH_NOMATCH);
|
||||
|
|
|
@ -239,7 +239,7 @@ Arguments:
|
|||
|
||||
Returns: 0 when successfully completed
|
||||
< 0 on local error
|
||||
!= 0 for callback error
|
||||
!= 0 for callback error
|
||||
*/
|
||||
|
||||
PCRE2_EXP_DEFN int PCRE2_CALL_CONVENTION
|
||||
|
@ -270,7 +270,7 @@ cc = (PCRE2_SPTR)((uint8_t *)re + sizeof(pcre2_real_code))
|
|||
|
||||
while (TRUE)
|
||||
{
|
||||
int rc;
|
||||
int rc;
|
||||
switch (*cc)
|
||||
{
|
||||
case OP_END:
|
||||
|
@ -378,7 +378,7 @@ while (TRUE)
|
|||
cb.callout_string_length = 0;
|
||||
cb.callout_string = NULL;
|
||||
rc = callback(&cb, callout_data);
|
||||
if (rc != 0) return rc;
|
||||
if (rc != 0) return rc;
|
||||
cc += PRIV(OP_lengths)[*cc];
|
||||
break;
|
||||
|
||||
|
@ -391,7 +391,7 @@ while (TRUE)
|
|||
GET(cc, 1 + 2*LINK_SIZE) - (1 + 4*LINK_SIZE) - 2;
|
||||
cb.callout_string = cc + (1 + 4*LINK_SIZE) + 1;
|
||||
rc = callback(&cb, callout_data);
|
||||
if (rc != 0) return rc;
|
||||
if (rc != 0) return rc;
|
||||
cc += GET(cc, 1 + 2*LINK_SIZE);
|
||||
break;
|
||||
|
||||
|
|
|
@ -67,18 +67,18 @@ const uint32_t PRIV(hspace_list)[] = { HSPACE_LIST };
|
|||
const uint32_t PRIV(vspace_list)[] = { VSPACE_LIST };
|
||||
|
||||
/* These tables are the pairs of delimiters that are valid for callout string
|
||||
arguments. For each starting delimiter there must be a matching ending
|
||||
arguments. For each starting delimiter there must be a matching ending
|
||||
delimiter, which in fact is different only for bracket-like delimiters. */
|
||||
|
||||
const uint32_t PRIV(callout_start_delims)[] = {
|
||||
CHAR_GRAVE_ACCENT, CHAR_APOSTROPHE, CHAR_QUOTATION_MARK,
|
||||
CHAR_CIRCUMFLEX_ACCENT, CHAR_PERCENT_SIGN, CHAR_NUMBER_SIGN,
|
||||
CHAR_DOLLAR_SIGN, CHAR_LEFT_CURLY_BRACKET, 0 };
|
||||
CHAR_DOLLAR_SIGN, CHAR_LEFT_CURLY_BRACKET, 0 };
|
||||
|
||||
const uint32_t PRIV(callout_end_delims[]) = {
|
||||
CHAR_GRAVE_ACCENT, CHAR_APOSTROPHE, CHAR_QUOTATION_MARK,
|
||||
CHAR_CIRCUMFLEX_ACCENT, CHAR_PERCENT_SIGN, CHAR_NUMBER_SIGN,
|
||||
CHAR_DOLLAR_SIGN, CHAR_RIGHT_CURLY_BRACKET, 0 };
|
||||
CHAR_DOLLAR_SIGN, CHAR_RIGHT_CURLY_BRACKET, 0 };
|
||||
|
||||
|
||||
/*************************************************
|
||||
|
|
|
@ -4492,9 +4492,9 @@ if (TEST(compiled_code, ==, NULL))
|
|||
fprintf(outfile, "\n");
|
||||
return PR_SKIP;
|
||||
}
|
||||
|
||||
/* If forbid_utf is non-zero, we are running a non-UTF test. UTF and UCP are
|
||||
locked out at compile time, but we must also check for occurrences of \P, \p,
|
||||
|
||||
/* If forbid_utf is non-zero, we are running a non-UTF test. UTF and UCP are
|
||||
locked out at compile time, but we must also check for occurrences of \P, \p,
|
||||
and \X, which are only supported when Unicode is supported. */
|
||||
|
||||
if (forbid_utf != 0)
|
||||
|
@ -4503,9 +4503,9 @@ if (forbid_utf != 0)
|
|||
{
|
||||
fprintf(outfile, "** \\P, \\p, and \\X are not allowed after the "
|
||||
"#forbid_utf command\n");
|
||||
return PR_SKIP;
|
||||
}
|
||||
}
|
||||
return PR_SKIP;
|
||||
}
|
||||
}
|
||||
|
||||
/* Remember the maximum lookbehind, for partial matching. */
|
||||
|
||||
|
@ -5095,7 +5095,7 @@ if (dbuffer != NULL)
|
|||
#endif
|
||||
|
||||
/* Allocate a buffer to hold the data line; len+1 is an upper bound on
|
||||
the number of code units that will be needed (though the buffer may have to be
|
||||
the number of code units that will be needed (though the buffer may have to be
|
||||
extended if replication is involved). */
|
||||
|
||||
needlen = (size_t)(len * code_unit_size);
|
||||
|
@ -5145,7 +5145,7 @@ while ((c = *p++) != 0)
|
|||
|
||||
replen = CAST8VAR(q) - start_rep;
|
||||
needlen += replen * i;
|
||||
|
||||
|
||||
if (needlen >= dbuffer_size)
|
||||
{
|
||||
while (needlen >= dbuffer_size) dbuffer_size *= 2;
|
||||
|
@ -5158,7 +5158,7 @@ while ((c = *p++) != 0)
|
|||
SETCASTPTR(q, dbuffer + qoffset);
|
||||
start_rep = dbuffer + rep_offset;
|
||||
}
|
||||
|
||||
|
||||
while (i-- > 0)
|
||||
{
|
||||
memcpy(CAST8VAR(q), start_rep, replen);
|
||||
|
|
Loading…
Reference in New Issue