Source and document file tidies for 10.20-RC1.
This commit is contained in:
parent
a68ddd48b5
commit
07a8fdce25
|
@ -1,8 +1,8 @@
|
|||
Change Log for PCRE2
|
||||
--------------------
|
||||
|
||||
Version 10.20 xx-xx-2015
|
||||
------------------------
|
||||
Version 10.20 16-June-2015
|
||||
--------------------------
|
||||
|
||||
1. Callouts with string arguments have been added.
|
||||
|
||||
|
|
30
HACKING
30
HACKING
|
@ -104,6 +104,21 @@ system stack used by the compile function, which uses recursive function calls
|
|||
for nested parenthesized groups. This is a safety feature for environments with
|
||||
small stacks where the patterns are provided by users.
|
||||
|
||||
History repeated itself for release 10.20. A number of bugs relating to named
|
||||
subpatterns had been discovered by fuzzers. Most of these were related to the
|
||||
handling of forward references when it was not known if the named pattern was
|
||||
unique. (References to non-unique names use a different opcode and more
|
||||
memory.) The use of duplicate group numbers (the (?| facility) also caused
|
||||
issues.
|
||||
|
||||
To get around these problems I adopted a new approach by adding a third pass,
|
||||
really a "pre-pass", over the pattern, which does nothing other than identify
|
||||
all the named subpatterns and their corresponding group numbers. This means
|
||||
that the actual compile (both pre-pass and real compile) have full knowledge of
|
||||
group names and numbers throughout. Several dozen lines of messy code were
|
||||
eliminated, though the new pre-pass is not short (skipping over [] classes is
|
||||
complicated).
|
||||
|
||||
|
||||
Traditional matching function
|
||||
-----------------------------
|
||||
|
@ -343,8 +358,9 @@ do.
|
|||
|
||||
For classes containing characters with values greater than 255 or that contain
|
||||
\p or \P, OP_XCLASS is used. It optionally uses a bit map if any acceptable
|
||||
code points are less than 256, followed by a list of pairs (for a range) and
|
||||
single characters. In caseless mode, both cases are explicitly listed.
|
||||
code points are less than 256, followed by a list of pairs (for a range) and/or
|
||||
single characters and/or properties. In caseless mode, both cases are
|
||||
explicitly listed.
|
||||
|
||||
OP_XCLASS is followed by a LINK_SIZE value containing the total length of the
|
||||
opcode and its data. This is followed by a code unit containing flag bits:
|
||||
|
@ -431,7 +447,7 @@ bracket opcode.
|
|||
If a subpattern is quantified such that it is permitted to match zero times, it
|
||||
is preceded by one of OP_BRAZERO, OP_BRAMINZERO, or OP_SKIPZERO. These are
|
||||
single-unit opcodes that tell the matcher that skipping the following
|
||||
subpattern entirely is a valid branch. In the case of the first two, not
|
||||
subpattern entirely is a valid match. In the case of the first two, not
|
||||
skipping the pattern is also valid (greedy and non-greedy). The third is used
|
||||
when a pattern has the quantifier {0,0}. It cannot be entirely discarded,
|
||||
because it may be called as a subroutine from elsewhere in the pattern.
|
||||
|
@ -487,9 +503,9 @@ Forward assertions are also just like other subpatterns, but starting with one
|
|||
of the opcodes OP_ASSERT or OP_ASSERT_NOT. Backward assertions use the opcodes
|
||||
OP_ASSERTBACK and OP_ASSERTBACK_NOT, and the first opcode inside the assertion
|
||||
is OP_REVERSE, followed by a count of the number of characters to move back the
|
||||
pointer in the subject string. In ASCII or UTF-32 mode, the count is a number
|
||||
of code units, but in UTF-8/16 mode each character may occupy more than one
|
||||
code unit. A separate count is present in each alternative of a lookbehind
|
||||
pointer in the subject string. In ASCII or UTF-32 mode, the count is also the
|
||||
number of code units, but in UTF-8/16 mode each character may occupy more than
|
||||
one code unit. A separate count is present in each alternative of a lookbehind
|
||||
assertion, allowing them to have different (but fixed) lengths.
|
||||
|
||||
|
||||
|
@ -585,4 +601,4 @@ not a real opcode, but is used to check that tables indexed by opcode are the
|
|||
correct length, in order to catch updating errors.
|
||||
|
||||
Philip Hazel
|
||||
March 2015
|
||||
June 2015
|
||||
|
|
20
NEWS
20
NEWS
|
@ -1,6 +1,26 @@
|
|||
News about PCRE2 releases
|
||||
-------------------------
|
||||
|
||||
Version 10.20 16-June-2015
|
||||
--------------------------
|
||||
|
||||
1. Callouts with string arguments and the pcre2_callout_enumerate() function
|
||||
have been implemented.
|
||||
|
||||
2. The PCRE2_NEVER_BACKSLASH_C option, which locks out the use of \C, is added.
|
||||
|
||||
3. The PCRE2_ALT_CIRCUMFLEX option lets ^ match after a newline at the end of a
|
||||
subject in multiline mode.
|
||||
|
||||
4. The way named subpatterns are handled has been refactored. The previous
|
||||
approach had several bugs.
|
||||
|
||||
5. The handling of \c in EBCDIC environments has been changed to conform to the
|
||||
perlebcdic document. This is an incompatible change.
|
||||
|
||||
6. Bugs have been mended, many of them discovered by fuzzers.
|
||||
|
||||
|
||||
Version 10.10 06-March-2015
|
||||
---------------------------
|
||||
|
||||
|
|
|
@ -11,15 +11,15 @@ dnl be defined as -RC2, for example. For real releases, it should be empty.
|
|||
m4_define(pcre2_major, [10])
|
||||
m4_define(pcre2_minor, [20])
|
||||
m4_define(pcre2_prerelease, [-RC1])
|
||||
m4_define(pcre2_date, [2015-03-11])
|
||||
m4_define(pcre2_date, [2015-06-16])
|
||||
|
||||
# NOTE: The CMakeLists.txt file searches for the above variables in the first
|
||||
# 50 lines of this file. Please update that if the variables above are moved.
|
||||
|
||||
# Libtool shared library interface versions (current:revision:age)
|
||||
m4_define(libpcre2_8_version, [1:0:1])
|
||||
m4_define(libpcre2_16_version, [1:0:1])
|
||||
m4_define(libpcre2_32_version, [1:0:1])
|
||||
m4_define(libpcre2_8_version, [2:0:0])
|
||||
m4_define(libpcre2_16_version, [2:0:0])
|
||||
m4_define(libpcre2_32_version, [2:0:0])
|
||||
m4_define(libpcre2_posix_version, [0:0:0])
|
||||
|
||||
AC_PREREQ(2.57)
|
||||
|
|
|
@ -294,6 +294,9 @@ library. They are also documented in the pcre2build man page.
|
|||
which specifies that the code value for the EBCDIC NL character is 0x25
|
||||
instead of the default 0x15.
|
||||
|
||||
. If you specify --enable-debug, additional debugging code is included in the
|
||||
build. This option is intended for use by the PCRE2 maintainers.
|
||||
|
||||
. In environments where valgrind is installed, if you specify
|
||||
|
||||
--enable-valgrind
|
||||
|
@ -829,4 +832,4 @@ The distribution should contain the files listed below.
|
|||
Philip Hazel
|
||||
Email local part: ph10
|
||||
Email domain: cam.ac.uk
|
||||
Last updated: 26 January 2015
|
||||
Last updated: 24 April 2015
|
||||
|
|
|
@ -108,8 +108,14 @@ lose performance.
|
|||
<P>
|
||||
One way of guarding against this possibility is to use the
|
||||
<b>pcre2_pattern_info()</b> function to check the compiled pattern's options for
|
||||
UTF. Alternatively, you can set the PCRE2_NEVER_UTF option at compile time.
|
||||
This causes an compile time error if a pattern contains a UTF-setting sequence.
|
||||
PCRE2_UTF. Alternatively, you can set the PCRE2_NEVER_UTF option when calling
|
||||
<b>pcre2_compile()</b>. This causes an compile time error if a pattern contains
|
||||
a UTF-setting sequence.
|
||||
</P>
|
||||
<P>
|
||||
The use of Unicode properties for character types such as \d can also be
|
||||
enabled from within the pattern, by specifying "(*UCP)". This feature can be
|
||||
disallowed by setting the PCRE2_NEVER_UCP option.
|
||||
</P>
|
||||
<P>
|
||||
If your application is one that supports UTF, be aware that validity checking
|
||||
|
@ -118,6 +124,12 @@ the PCRE2_NO_UTF_CHECK option for the second and subsequent matches to avoid
|
|||
running redundant checks.
|
||||
</P>
|
||||
<P>
|
||||
The use of the \C escape sequence in a UTF-8 or UTF-16 pattern can lead to
|
||||
problems, because it may leave the current matching point in the middle of a
|
||||
multi-code-unit character. The PCRE2_NEVER_BACKSLASH_C option can be used to
|
||||
lock out the use of \C, causing a compile-time error if it is encountered.
|
||||
</P>
|
||||
<P>
|
||||
Another way that performance can be hit is by running a pattern that has a very
|
||||
large search tree against a string that will never match. Nested unlimited
|
||||
repeats in a pattern are a common example. PCRE2 provides some protection
|
||||
|
@ -175,9 +187,9 @@ use my two initials, followed by the two digits 10, at the domain cam.ac.uk.
|
|||
</P>
|
||||
<br><a name="SEC5" href="#TOC1">REVISION</a><br>
|
||||
<P>
|
||||
Last updated: 18 November 2014
|
||||
Last updated: 13 April 2015
|
||||
<br>
|
||||
Copyright © 1997-2014 University of Cambridge.
|
||||
Copyright © 1997-2015 University of Cambridge.
|
||||
<br>
|
||||
<p>
|
||||
Return to the <a href="index.html">PCRE2 index page</a>.
|
||||
|
|
|
@ -49,6 +49,7 @@ or provide an external function for stack size checking. The option bits are:
|
|||
<pre>
|
||||
PCRE2_ANCHORED Force pattern anchoring
|
||||
PCRE2_ALT_BSUX Alternative handling of \u, \U, and \x
|
||||
PCRE2_ALT_CIRCUMFLEX Alternative handling of ^ in multiline mode
|
||||
PCRE2_AUTO_CALLOUT Compile automatic callouts
|
||||
PCRE2_CASELESS Do caseless matching
|
||||
PCRE2_DOLLAR_ENDONLY $ not to match newline at end
|
||||
|
@ -58,6 +59,7 @@ or provide an external function for stack size checking. The option bits are:
|
|||
PCRE2_FIRSTLINE Force matching to be before newline
|
||||
PCRE2_MATCH_UNSET_BACKREF Match unset back references
|
||||
PCRE2_MULTILINE ^ and $ match newlines within data
|
||||
PCRE2_NEVER_BACKSLASH_C Lock out the use of \C in patterns
|
||||
PCRE2_NEVER_UCP Lock out PCRE2_UCP, e.g. via (*UCP)
|
||||
PCRE2_NEVER_UTF Lock out PCRE2_UTF, e.g. via (*UTF)
|
||||
PCRE2_NO_AUTO_CAPTURE Disable numbered capturing paren-
|
||||
|
|
|
@ -1074,6 +1074,15 @@ hexadecimal digits, in which case the hexadecimal number defines the code point
|
|||
to match. By default, as in Perl, a hexadecimal number is always expected after
|
||||
\x, but it may have zero, one, or two digits (so, for example, \xz matches a
|
||||
binary zero character followed by z).
|
||||
<pre>
|
||||
PCRE2_ALT_CIRCUMFLEX
|
||||
</pre>
|
||||
In multiline mode (when PCRE2_MULTILINE is set), the circumflex metacharacter
|
||||
matches at the start of the subject (unless PCRE2_NOTBOL is set), and also
|
||||
after any internal newline. However, it does not match after a newline at the
|
||||
end of the subject, for compatibility with Perl. If you want a multiline
|
||||
circumflex also to match after a terminating newline, you must set
|
||||
PCRE2_ALT_CIRCUMFLEX.
|
||||
<pre>
|
||||
PCRE2_AUTO_CALLOUT
|
||||
</pre>
|
||||
|
@ -1174,8 +1183,19 @@ When PCRE2_MULTILINE it is set, the "start of line" and "end of line"
|
|||
constructs match immediately following or immediately before internal newlines
|
||||
in the subject string, respectively, as well as at the very start and end. This
|
||||
is equivalent to Perl's /m option, and it can be changed within a pattern by a
|
||||
(?m) option setting. If there are no newlines in a subject string, or no
|
||||
occurrences of ^ or $ in a pattern, setting PCRE2_MULTILINE has no effect.
|
||||
(?m) option setting. Note that the "start of line" metacharacter does not match
|
||||
after a newline at the end of the subject, for compatibility with Perl.
|
||||
However, you can change this by setting the PCRE2_ALT_CIRCUMFLEX option. If
|
||||
there are no newlines in a subject string, or no occurrences of ^ or $ in a
|
||||
pattern, setting PCRE2_MULTILINE has no effect.
|
||||
<pre>
|
||||
PCRE2_NEVER_BACKSLASH_C
|
||||
</pre>
|
||||
This option locks out the use of \C in the pattern that is being compiled.
|
||||
This escape can cause unpredictable behaviour in UTF-8 or UTF-16 modes, because
|
||||
it may leave the current matching point in the middle of a multi-code-unit
|
||||
character. This option may be useful in applications that process patterns from
|
||||
external sources.
|
||||
<pre>
|
||||
PCRE2_NEVER_UCP
|
||||
</pre>
|
||||
|
@ -1183,17 +1203,17 @@ This option locks out the use of Unicode properties for handling \B, \b, \D,
|
|||
\d, \S, \s, \W, \w, and some of the POSIX character classes, as described
|
||||
for the PCRE2_UCP option below. In particular, it prevents the creator of the
|
||||
pattern from enabling this facility by starting the pattern with (*UCP). This
|
||||
may be useful in applications that process patterns from external sources. The
|
||||
option combination PCRE_UCP and PCRE_NEVER_UCP causes an error.
|
||||
option may be useful in applications that process patterns from external
|
||||
sources. The option combination PCRE_UCP and PCRE_NEVER_UCP causes an error.
|
||||
<pre>
|
||||
PCRE2_NEVER_UTF
|
||||
</pre>
|
||||
This option locks out interpretation of the pattern as UTF-8, UTF-16, or
|
||||
UTF-32, depending on which library is in use. In particular, it prevents the
|
||||
creator of the pattern from switching to UTF interpretation by starting the
|
||||
pattern with (*UTF). This may be useful in applications that process patterns
|
||||
from external sources. The combination of PCRE2_UTF and PCRE2_NEVER_UTF causes
|
||||
an error.
|
||||
pattern with (*UTF). This option may be useful in applications that process
|
||||
patterns from external sources. The combination of PCRE2_UTF and
|
||||
PCRE2_NEVER_UTF causes an error.
|
||||
<pre>
|
||||
PCRE2_NO_AUTO_CAPTURE
|
||||
</pre>
|
||||
|
@ -2863,7 +2883,7 @@ Cambridge, England.
|
|||
</P>
|
||||
<br><a name="SEC40" href="#TOC1">REVISION</a><br>
|
||||
<P>
|
||||
Last updated: 23 March 2015
|
||||
Last updated: 22 April 2015
|
||||
<br>
|
||||
Copyright © 1997-2015 University of Cambridge.
|
||||
<br>
|
||||
|
|
|
@ -29,11 +29,12 @@ please consult the man page, in case the conversion went wrong.
|
|||
<li><a name="TOC14" href="#SEC14">PCRE2GREP OPTIONS FOR COMPRESSED FILE SUPPORT</a>
|
||||
<li><a name="TOC15" href="#SEC15">PCRE2GREP BUFFER SIZE</a>
|
||||
<li><a name="TOC16" href="#SEC16">PCRE2TEST OPTION FOR LIBREADLINE SUPPORT</a>
|
||||
<li><a name="TOC17" href="#SEC17">DEBUGGING WITH VALGRIND SUPPORT</a>
|
||||
<li><a name="TOC18" href="#SEC18">CODE COVERAGE REPORTING</a>
|
||||
<li><a name="TOC19" href="#SEC19">SEE ALSO</a>
|
||||
<li><a name="TOC20" href="#SEC20">AUTHOR</a>
|
||||
<li><a name="TOC21" href="#SEC21">REVISION</a>
|
||||
<li><a name="TOC17" href="#SEC17">INCLUDING DEBUGGING CODE</a>
|
||||
<li><a name="TOC18" href="#SEC18">DEBUGGING WITH VALGRIND SUPPORT</a>
|
||||
<li><a name="TOC19" href="#SEC19">CODE COVERAGE REPORTING</a>
|
||||
<li><a name="TOC20" href="#SEC20">SEE ALSO</a>
|
||||
<li><a name="TOC21" href="#SEC21">AUTHOR</a>
|
||||
<li><a name="TOC22" href="#SEC22">REVISION</a>
|
||||
</ul>
|
||||
<br><a name="SEC1" href="#TOC1">BUILDING PCRE2</a><br>
|
||||
<P>
|
||||
|
@ -147,6 +148,12 @@ properties. The application can request that they do by setting the PCRE2_UCP
|
|||
option. Unless the application has set PCRE2_NEVER_UCP, a pattern may also
|
||||
request this by starting with (*UCP).
|
||||
</P>
|
||||
<P>
|
||||
The \C escape sequence, which matches a single code unit, even in a UTF mode,
|
||||
can cause unpredictable behaviour because it may leave the current matching
|
||||
point in the middle of a multi-code-unit character. It can be locked out by
|
||||
setting the PCRE2_NEVER_BACKSLASH_C option.
|
||||
</P>
|
||||
<br><a name="SEC6" href="#TOC1">JUST-IN-TIME COMPILER SUPPORT</a><br>
|
||||
<P>
|
||||
Just-in-time compiler support is included in the build by specifying
|
||||
|
@ -397,7 +404,16 @@ automatically included, you may need to add something like
|
|||
</pre>
|
||||
immediately before the <b>configure</b> command.
|
||||
</P>
|
||||
<br><a name="SEC17" href="#TOC1">DEBUGGING WITH VALGRIND SUPPORT</a><br>
|
||||
<br><a name="SEC17" href="#TOC1">INCLUDING DEBUGGING CODE</a><br>
|
||||
<P>
|
||||
If you add
|
||||
<pre>
|
||||
--enable-debug
|
||||
</pre>
|
||||
to the <b>configure</b> command, additional debugging code is included in the
|
||||
build. This feature is intended for use by the PCRE2 maintainers.
|
||||
</P>
|
||||
<br><a name="SEC18" href="#TOC1">DEBUGGING WITH VALGRIND SUPPORT</a><br>
|
||||
<P>
|
||||
If you add
|
||||
<pre>
|
||||
|
@ -407,7 +423,7 @@ to the <b>configure</b> command, PCRE2 will use valgrind annotations to mark
|
|||
certain memory regions as unaddressable. This allows it to detect invalid
|
||||
memory accesses, and is mostly useful for debugging PCRE2 itself.
|
||||
</P>
|
||||
<br><a name="SEC18" href="#TOC1">CODE COVERAGE REPORTING</a><br>
|
||||
<br><a name="SEC19" href="#TOC1">CODE COVERAGE REPORTING</a><br>
|
||||
<P>
|
||||
If your C compiler is gcc, you can build a version of PCRE2 that can generate a
|
||||
code coverage report for its test suite. To enable this, you must install
|
||||
|
@ -464,11 +480,11 @@ This cleans all coverage data including the generated coverage report. For more
|
|||
information about code coverage, see the <b>gcov</b> and <b>lcov</b>
|
||||
documentation.
|
||||
</P>
|
||||
<br><a name="SEC19" href="#TOC1">SEE ALSO</a><br>
|
||||
<br><a name="SEC20" href="#TOC1">SEE ALSO</a><br>
|
||||
<P>
|
||||
<b>pcre2api</b>(3), <b>pcre2-config</b>(3).
|
||||
</P>
|
||||
<br><a name="SEC20" href="#TOC1">AUTHOR</a><br>
|
||||
<br><a name="SEC21" href="#TOC1">AUTHOR</a><br>
|
||||
<P>
|
||||
Philip Hazel
|
||||
<br>
|
||||
|
@ -477,9 +493,9 @@ University Computing Service
|
|||
Cambridge, England.
|
||||
<br>
|
||||
</P>
|
||||
<br><a name="SEC21" href="#TOC1">REVISION</a><br>
|
||||
<br><a name="SEC22" href="#TOC1">REVISION</a><br>
|
||||
<P>
|
||||
Last updated: 26 January 2015
|
||||
Last updated: 24 April 2015
|
||||
<br>
|
||||
Copyright © 1997-2015 University of Cambridge.
|
||||
<br>
|
||||
|
|
|
@ -357,10 +357,11 @@ A second use of backslash provides a way of encoding non-printing characters
|
|||
in patterns in a visible manner. There is no restriction on the appearance of
|
||||
non-printing characters in a pattern, but when a pattern is being prepared by
|
||||
text editing, it is often easier to use one of the following escape sequences
|
||||
than the binary character it represents:
|
||||
than the binary character it represents. In an ASCII or Unicode environment,
|
||||
these escapes are as follows:
|
||||
<pre>
|
||||
\a alarm, that is, the BEL character (hex 07)
|
||||
\cx "control-x", where x is any ASCII character
|
||||
\cx "control-x", where x is any printable ASCII character
|
||||
\e escape (hex 1B)
|
||||
\f form feed (hex 0C)
|
||||
\n linefeed (hex 0A)
|
||||
|
@ -377,23 +378,38 @@ The precise effect of \cx on ASCII characters is as follows: if x is a lower
|
|||
case letter, it is converted to upper case. Then bit 6 of the character (hex
|
||||
40) is inverted. Thus \cA to \cZ become hex 01 to hex 1A (A is 41, Z is 5A),
|
||||
but \c{ becomes hex 3B ({ is 7B), and \c; becomes hex 7B (; is 3B). If the
|
||||
code unit following \c has a value greater than 127, a compile-time error
|
||||
occurs. This locks out non-ASCII characters in all modes.
|
||||
code unit following \c has a value less than 32 or greater than 126, a
|
||||
compile-time error occurs. This locks out non-printable ASCII characters in all
|
||||
modes.
|
||||
</P>
|
||||
<P>
|
||||
The \c facility was designed for use with ASCII characters, but with the
|
||||
extension to Unicode it is even less useful than it once was. It is, however,
|
||||
recognized when PCRE2 is compiled in EBCDIC mode, where data items are always
|
||||
bytes. In this mode, all values are valid after \c. If the next character is a
|
||||
lower case letter, it is converted to upper case. Then the 0xc0 bits of the
|
||||
byte are inverted. Thus \cA becomes hex 01, as in ASCII (A is C1), but because
|
||||
the EBCDIC letters are disjoint, \cZ becomes hex 29 (Z is E9), and other
|
||||
characters also generate different values.
|
||||
When PCRE2 is compiled in EBCDIC mode, \a, \e, \f, \n, \r, and \t
|
||||
generate the appropriate EBCDIC code values. The \c escape is processed
|
||||
as specified for Perl in the <b>perlebcdic</b> document. The only characters
|
||||
that are allowed after \c are A-Z, a-z, or one of @, [, \, ], ^, _, or ?. Any
|
||||
other character provokes a compile-time error. The sequence \@ encodes
|
||||
character code 0; the letters (in either case) encode characters 1-26 (hex 01
|
||||
to hex 1A); [, \, ], ^, and _ encode characters 27-31 (hex 1B to hex 1F), and
|
||||
\? becomes either 255 (hex FF) or 95 (hex 5F).
|
||||
</P>
|
||||
<P>
|
||||
Thus, apart from \?, these escapes generate the same character code values as
|
||||
they do in an ASCII environment, though the meanings of the values mostly
|
||||
differ. For example, \G always generates code value 7, which is BEL in ASCII
|
||||
but DEL in EBCDIC.
|
||||
</P>
|
||||
<P>
|
||||
The sequence \? generates DEL (127, hex 7F) in an ASCII environment, but
|
||||
because 127 is not a control character in EBCDIC, Perl makes it generate the
|
||||
APC character. Unfortunately, there are several variants of EBCDIC. In most of
|
||||
them the APC character has the value 255 (hex FF), but in the one Perl calls
|
||||
POSIX-BC its value is 95 (hex 5F). If certain other characters have POSIX-BC
|
||||
values, PCRE2 makes \? generate 95; otherwise it generates 255.
|
||||
</P>
|
||||
<P>
|
||||
After \0 up to two further octal digits are read. If there are fewer than two
|
||||
digits, just those that are present are used. Thus the sequence \0\x\07
|
||||
specifies two binary zeros followed by a BEL character (code value 7). Make
|
||||
digits, just those that are present are used. Thus the sequence \0\x\015
|
||||
specifies two binary zeros followed by a CR character (code value 13). Make
|
||||
sure you supply two digits after the initial zero if the pattern character that
|
||||
follows is itself an octal digit.
|
||||
</P>
|
||||
|
@ -412,21 +428,24 @@ describe the old, ambiguous syntax.
|
|||
</P>
|
||||
<P>
|
||||
The handling of a backslash followed by a digit other than 0 is complicated,
|
||||
and Perl has changed in recent releases, causing PCRE2 also to change. Outside
|
||||
a character class, PCRE2 reads the digit and any following digits as a decimal
|
||||
number. If the number is less than 8, or if there have been at least that many
|
||||
previous capturing left parentheses in the expression, the entire sequence is
|
||||
taken as a <i>back reference</i>. A description of how this works is given
|
||||
and Perl has changed over time, causing PCRE2 also to change.
|
||||
</P>
|
||||
<P>
|
||||
Outside a character class, PCRE2 reads the digit and any following digits as a
|
||||
decimal number. If the number is less than 10, begins with the digit 8 or 9, or
|
||||
if there are at least that many previous capturing left parentheses in the
|
||||
expression, the entire sequence is taken as a <i>back reference</i>. A
|
||||
description of how this works is given
|
||||
<a href="#backreferences">later,</a>
|
||||
following the discussion of
|
||||
<a href="#subpattern">parenthesized subpatterns.</a>
|
||||
Otherwise, up to three octal digits are read to form a character code.
|
||||
</P>
|
||||
<P>
|
||||
Inside a character class, or if the decimal number following \ is greater than
|
||||
7 and there have not been that many capturing subpatterns, PCRE2 handles \8
|
||||
and \9 as the literal characters "8" and "9", and otherwise re-reads up to
|
||||
three octal digits following the backslash, using them to generate a data
|
||||
character. Any subsequent digits stand for themselves. For example:
|
||||
Inside a character class, PCRE2 handles \8 and \9 as the literal characters
|
||||
"8" and "9", and otherwise reads up to three octal digits following the
|
||||
backslash, using them to generate a data character. Any subsequent digits stand
|
||||
for themselves. For example, outside a character class:
|
||||
<pre>
|
||||
\040 is another way of writing an ASCII space
|
||||
\40 is the same, provided there are fewer than 40 previous capturing subpatterns
|
||||
|
@ -436,7 +455,7 @@ character. Any subsequent digits stand for themselves. For example:
|
|||
\0113 is a tab followed by the character "3"
|
||||
\113 might be a back reference, otherwise the character with octal code 113
|
||||
\377 might be a back reference, otherwise the value 255 (decimal)
|
||||
\81 is either a back reference, or the two characters "8" and "1"
|
||||
\81 is always a back reference .sp
|
||||
</pre>
|
||||
Note that octal values of 100 or greater that are specified using this syntax
|
||||
must not be introduced by a leading zero, because no more than three octal
|
||||
|
@ -1105,15 +1124,19 @@ regular expression.
|
|||
<P>
|
||||
The circumflex and dollar metacharacters are zero-width assertions. That is,
|
||||
they test for a particular condition being true without consuming any
|
||||
characters from the subject string.
|
||||
characters from the subject string. These two metacharacters are concerned with
|
||||
matching the starts and ends of lines. If the newline convention is set so that
|
||||
only the two-character sequence CRLF is recognized as a newline, isolated CR
|
||||
and LF characters are treated as ordinary data characters, and are not
|
||||
recognized as newlines.
|
||||
</P>
|
||||
<P>
|
||||
Outside a character class, in the default matching mode, the circumflex
|
||||
character is an assertion that is true only if the current matching point is at
|
||||
the start of the subject string. If the <i>startoffset</i> argument of
|
||||
<b>pcre2_match()</b> is non-zero, circumflex can never match if the
|
||||
PCRE2_MULTILINE option is unset. Inside a character class, circumflex has an
|
||||
entirely different meaning
|
||||
<b>pcre2_match()</b> is non-zero, or if PCRE2_NOTBOL is set, circumflex can
|
||||
never match if the PCRE2_MULTILINE option is unset. Inside a character class,
|
||||
circumflex has an entirely different meaning
|
||||
<a href="#characterclass">(see below).</a>
|
||||
</P>
|
||||
<P>
|
||||
|
@ -1128,10 +1151,11 @@ to be anchored.)
|
|||
<P>
|
||||
The dollar character is an assertion that is true only if the current matching
|
||||
point is at the end of the subject string, or immediately before a newline at
|
||||
the end of the string (by default). Note, however, that it does not actually
|
||||
match the newline. Dollar need not be the last character of the pattern if a
|
||||
number of alternatives are involved, but it should be the last item in any
|
||||
branch in which it appears. Dollar has no special meaning in a character class.
|
||||
the end of the string (by default), unless PCRE2_NOTEOL is set. Note, however,
|
||||
that it does not actually match the newline. Dollar need not be the last
|
||||
character of the pattern if a number of alternatives are involved, but it
|
||||
should be the last item in any branch in which it appears. Dollar has no
|
||||
special meaning in a character class.
|
||||
</P>
|
||||
<P>
|
||||
The meaning of dollar can be changed so that it matches only at the very end of
|
||||
|
@ -1139,13 +1163,13 @@ the string, by setting the PCRE2_DOLLAR_ENDONLY option at compile time. This
|
|||
does not affect the \Z assertion.
|
||||
</P>
|
||||
<P>
|
||||
The meanings of the circumflex and dollar characters are changed if the
|
||||
PCRE2_MULTILINE option is set. When this is the case, a circumflex matches
|
||||
immediately after internal newlines as well as at the start of the subject
|
||||
string. It does not match after a newline that ends the string. A dollar
|
||||
matches before any newlines in the string, as well as at the very end, when
|
||||
PCRE2_MULTILINE is set. When newline is specified as the two-character
|
||||
sequence CRLF, isolated CR and LF characters do not indicate newlines.
|
||||
The meanings of the circumflex and dollar metacharacters are changed if the
|
||||
PCRE2_MULTILINE option is set. When this is the case, a dollar character
|
||||
matches before any newlines in the string, as well as at the very end, and a
|
||||
circumflex matches immediately after internal newlines as well as at the start
|
||||
of the subject string. It does not match after a newline that ends the string,
|
||||
for compatibility with Perl. However, this can be changed by setting the
|
||||
PCRE2_ALT_CIRCUMFLEX option.
|
||||
</P>
|
||||
<P>
|
||||
For example, the pattern /^abc$/ matches the subject string "def\nabc" (where
|
||||
|
@ -1198,12 +1222,16 @@ whether or not a UTF mode is set. In the 8-bit library, one code unit is one
|
|||
byte; in the 16-bit library it is a 16-bit unit; in the 32-bit library it is a
|
||||
32-bit unit. Unlike a dot, \C always matches line-ending characters. The
|
||||
feature is provided in Perl in order to match individual bytes in UTF-8 mode,
|
||||
but it is unclear how it can usefully be used. Because \C breaks up characters
|
||||
into individual code units, matching one unit with \C in a UTF mode means that
|
||||
the rest of the string may start with a malformed UTF character. This has
|
||||
undefined results, because PCRE2 assumes that it is dealing with valid UTF
|
||||
strings (and by default it checks this at the start of processing unless the
|
||||
PCRE2_NO_UTF_CHECK option is used).
|
||||
but it is unclear how it can usefully be used.
|
||||
</P>
|
||||
<P>
|
||||
Because \C breaks up characters into individual code units, matching one unit
|
||||
with \C in UTF-8 or UTF-16 mode means that the rest of the string may start
|
||||
with a malformed UTF character. This has undefined results, because PCRE2
|
||||
assumes that it is matching character by character in a valid UTF string (by
|
||||
default it checks the subject string's validity at the start of processing
|
||||
unless the PCRE2_NO_UTF_CHECK option is used). An application can lock out the
|
||||
use of \C by setting the PCRE2_NEVER_BACKSLASH_C option.
|
||||
</P>
|
||||
<P>
|
||||
PCRE2 does not allow \C to appear in lookbehind assertions
|
||||
|
@ -1475,7 +1503,8 @@ unset these options by preceding the letter with a hyphen, and a combined
|
|||
setting and unsetting such as (?im-sx), which sets PCRE2_CASELESS and
|
||||
PCRE2_MULTILINE while unsetting PCRE2_DOTALL and PCRE2_EXTENDED, is also
|
||||
permitted. If a letter appears both before and after the hyphen, the option is
|
||||
unset.
|
||||
unset. An empty options setting "(?)" is allowed. Needless to say, it has no
|
||||
effect.
|
||||
</P>
|
||||
<P>
|
||||
The PCRE2-specific options PCRE2_DUPNAMES and PCRE2_UNGREEDY can be changed in
|
||||
|
@ -1508,11 +1537,20 @@ option settings happen at compile time. There would be some very weird
|
|||
behaviour otherwise.
|
||||
</P>
|
||||
<P>
|
||||
As a convenient shorthand, if any option settings are required at the start of
|
||||
a non-capturing subpattern (see the next section), the option letters may
|
||||
appear between the "?" and the ":". Thus the two patterns
|
||||
<pre>
|
||||
(?i:saturday|sunday)
|
||||
(?:(?i)saturday|sunday)
|
||||
</pre>
|
||||
match exactly the same set of strings.
|
||||
</P>
|
||||
<P>
|
||||
<b>Note:</b> There are other PCRE2-specific options that can be set by the
|
||||
application when the compiling function is called.
|
||||
The pattern can contain special leading sequences such as (*CRLF) to override
|
||||
what the application has set or what has been defaulted. Details are given in
|
||||
the section entitled
|
||||
application when the compiling function is called. The pattern can contain
|
||||
special leading sequences such as (*CRLF) to override what the application has
|
||||
set or what has been defaulted. Details are given in the section entitled
|
||||
<a href="#newlineseq">"Newline sequences"</a>
|
||||
above. There are also the (*UTF) and (*UCP) leading sequences that can be used
|
||||
to set UTF and Unicode property modes; they are equivalent to setting the
|
||||
|
@ -3285,7 +3323,7 @@ Cambridge, England.
|
|||
</P>
|
||||
<br><a name="SEC30" href="#TOC1">REVISION</a><br>
|
||||
<P>
|
||||
Last updated: 15 March 2015
|
||||
Last updated: 13 June 2015
|
||||
<br>
|
||||
Copyright © 1997-2015 University of Cambridge.
|
||||
<br>
|
||||
|
|
|
@ -15,7 +15,7 @@ please consult the man page, in case the conversion went wrong.
|
|||
<ul>
|
||||
<li><a name="TOC1" href="#SEC1">PCRE2 REGULAR EXPRESSION SYNTAX SUMMARY</a>
|
||||
<li><a name="TOC2" href="#SEC2">QUOTING</a>
|
||||
<li><a name="TOC3" href="#SEC3">CHARACTERS</a>
|
||||
<li><a name="TOC3" href="#SEC3">ESCAPED CHARACTERS</a>
|
||||
<li><a name="TOC4" href="#SEC4">CHARACTER TYPES</a>
|
||||
<li><a name="TOC5" href="#SEC5">GENERAL CATEGORY PROPERTIES FOR \p and \P</a>
|
||||
<li><a name="TOC6" href="#SEC6">PCRE2 SPECIAL CATEGORY PROPERTIES FOR \p and \P</a>
|
||||
|
@ -55,11 +55,12 @@ documentation. This document contains a quick-reference summary of the syntax.
|
|||
\Q...\E treat enclosed characters as literal
|
||||
</PRE>
|
||||
</P>
|
||||
<br><a name="SEC3" href="#TOC1">CHARACTERS</a><br>
|
||||
<br><a name="SEC3" href="#TOC1">ESCAPED CHARACTERS</a><br>
|
||||
<P>
|
||||
This table applies to ASCII and Unicode environments.
|
||||
<pre>
|
||||
\a alarm, that is, the BEL character (hex 07)
|
||||
\cx "control-x", where x is any ASCII character
|
||||
\cx "control-x", where x is any ASCII printing character
|
||||
\e escape (hex 1B)
|
||||
\f form feed (hex 0C)
|
||||
\n newline (hex 0A)
|
||||
|
@ -68,18 +69,32 @@ documentation. This document contains a quick-reference summary of the syntax.
|
|||
\0dd character with octal code 0dd
|
||||
\ddd character with octal code ddd, or backreference
|
||||
\o{ddd..} character with octal code ddd..
|
||||
\U "U" if PCRE2_ALT_BSUX is set (otherwise is an error)
|
||||
\uhhhh character with hex code hhhh (if PCRE2_ALT_BSUX is set)
|
||||
\xhh character with hex code hh
|
||||
\x{hhh..} character with hex code hhh..
|
||||
</pre>
|
||||
Note that \0dd is always an octal code, and that \8 and \9 are the literal
|
||||
characters "8" and "9".
|
||||
Note that \0dd is always an octal code. The treatment of backslash followed by
|
||||
a non-zero digit is complicated; for details see the section
|
||||
<a href="pcre2pattern.html#digitsafterbackslash">"Non-printing characters"</a>
|
||||
in the
|
||||
<a href="pcre2pattern.html"><b>pcre2pattern</b></a>
|
||||
documentation, where details of escape processing in EBCDIC environments are
|
||||
also given.
|
||||
</P>
|
||||
<P>
|
||||
When \x is not followed by {, from zero to two hexadecimal digits are read,
|
||||
but if PCRE2_ALT_BSUX is set, \x must be followed by two hexadecimal digits to
|
||||
be recognized as a hexadecimal escape; otherwise it matches a literal "x".
|
||||
Likewise, if \u (in ALT_BSUX mode) is not followed by four hexadecimal digits,
|
||||
it matches a literal "u".
|
||||
</P>
|
||||
<br><a name="SEC4" href="#TOC1">CHARACTER TYPES</a><br>
|
||||
<P>
|
||||
<pre>
|
||||
. any character except newline;
|
||||
in dotall mode, any character whatsoever
|
||||
\C one data unit, even in UTF mode (best avoided)
|
||||
\C one code unit, even in UTF mode (best avoided)
|
||||
\d a decimal digit
|
||||
\D a character that is not a decimal digit
|
||||
\h a horizontal white space character
|
||||
|
@ -96,6 +111,11 @@ characters "8" and "9".
|
|||
\W a "non-word" character
|
||||
\X a Unicode extended grapheme cluster
|
||||
</pre>
|
||||
The application can lock out the use of \C by setting the
|
||||
PCRE2_NEVER_BACKSLASH_C option. It is dangerous because it may leave the
|
||||
current matching point in the middle of a UTF-8 or UTF-16 character.
|
||||
</P>
|
||||
<P>
|
||||
By default, \d, \s, and \w match only ASCII characters, even in UTF-8 mode
|
||||
or in the 16-bit and 32-bit libraries. However, if locale-specific matching is
|
||||
happening, \s and \w may also match characters with code points in the range
|
||||
|
@ -348,7 +368,8 @@ but some of them use Unicode properties if PCRE2_UCP is set. You can use
|
|||
\b word boundary
|
||||
\B not a word boundary
|
||||
^ start of subject
|
||||
also after internal newline in multiline mode
|
||||
also after an internal newline in multiline mode
|
||||
(after any newline if PCRE2_ALT_CIRCUMFLEX is set)
|
||||
\A start of subject
|
||||
$ end of subject
|
||||
also before newline at end of subject
|
||||
|
@ -423,7 +444,9 @@ appear.
|
|||
(*UCP) set PCRE2_UCP (use Unicode properties for \d etc)
|
||||
</pre>
|
||||
Note that LIMIT_MATCH and LIMIT_RECURSION can only reduce the value of the
|
||||
limits set by the caller of pcre2_match(), not increase them.
|
||||
limits set by the caller of pcre2_match(), not increase them. The application
|
||||
can lock out the use of (*UTF) and (*UCP) by setting the PCRE2_NEVER_UTF or
|
||||
PCRE2_NEVER_UCP options, respectively, at compile time.
|
||||
</P>
|
||||
<br><a name="SEC17" href="#TOC1">NEWLINE CONVENTION</a><br>
|
||||
<P>
|
||||
|
@ -559,7 +582,7 @@ Cambridge, England.
|
|||
</P>
|
||||
<br><a name="SEC27" href="#TOC1">REVISION</a><br>
|
||||
<P>
|
||||
Last updated: 15 March 2015
|
||||
Last updated: 13 June 2015
|
||||
<br>
|
||||
Copyright © 1997-2015 University of Cambridge.
|
||||
<br>
|
||||
|
|
|
@ -284,13 +284,20 @@ following commands are recognized:
|
|||
#forbid_utf
|
||||
</pre>
|
||||
Subsequent patterns automatically have the PCRE2_NEVER_UTF and PCRE2_NEVER_UCP
|
||||
options set, which locks out the use of UTF and Unicode property features. This
|
||||
is a trigger guard that is used in test files to ensure that UTF or Unicode
|
||||
property tests are not accidentally added to files that are used when Unicode
|
||||
support is not included in the library. This effect can also be obtained by the
|
||||
use of <b>#pattern</b>; the difference is that <b>#forbid_utf</b> cannot be
|
||||
unset, and the automatic options are not displayed in pattern information, to
|
||||
avoid cluttering up test output.
|
||||
options set, which locks out the use of the PCRE2_UTF and PCRE2_UCP options and
|
||||
the use of (*UTF) and (*UCP) at the start of patterns. This command also forces
|
||||
an error if a subsequent pattern contains any occurrences of \P, \p, or \X,
|
||||
which are still supported when PCRE2_UTF is not set, but which require Unicode
|
||||
property support to be included in the library.
|
||||
</P>
|
||||
<P>
|
||||
This is a trigger guard that is used in test files to ensure that UTF or
|
||||
Unicode property tests are not accidentally added to files that are used when
|
||||
Unicode support is not included in the library. Setting PCRE2_NEVER_UTF and
|
||||
PCRE2_NEVER_UCP as a default can also be obtained by the use of <b>#pattern</b>;
|
||||
the difference is that <b>#forbid_utf</b> cannot be unset, and the automatic
|
||||
options are not displayed in pattern information, to avoid cluttering up test
|
||||
output.
|
||||
<pre>
|
||||
#load <filename>
|
||||
</pre>
|
||||
|
@ -471,6 +478,7 @@ for a description of their effects.
|
|||
<pre>
|
||||
allow_empty_class set PCRE2_ALLOW_EMPTY_CLASS
|
||||
alt_bsux set PCRE2_ALT_BSUX
|
||||
alt_circumflex set PCRE2_ALT_CIRCUMFLEX
|
||||
anchored set PCRE2_ANCHORED
|
||||
auto_callout set PCRE2_AUTO_CALLOUT
|
||||
/i caseless set PCRE2_CASELESS
|
||||
|
@ -481,6 +489,7 @@ for a description of their effects.
|
|||
firstline set PCRE2_FIRSTLINE
|
||||
match_unset_backref set PCRE2_MATCH_UNSET_BACKREF
|
||||
/m multiline set PCRE2_MULTILINE
|
||||
never_backslash_c set PCRE2_NEVER_BACKSLASH_C
|
||||
never_ucp set PCRE2_NEVER_UCP
|
||||
never_utf set PCRE2_NEVER_UTF
|
||||
no_auto_capture set PCRE2_NO_AUTO_CAPTURE
|
||||
|
@ -1460,7 +1469,7 @@ Cambridge, England.
|
|||
</P>
|
||||
<br><a name="SEC21" href="#TOC1">REVISION</a><br>
|
||||
<P>
|
||||
Last updated: 22 March 2015
|
||||
Last updated: 20 May 2015
|
||||
<br>
|
||||
Copyright © 1997-2015 University of Cambridge.
|
||||
<br>
|
||||
|
|
|
@ -87,16 +87,26 @@ SECURITY CONSIDERATIONS
|
|||
mance.
|
||||
|
||||
One way of guarding against this possibility is to use the pcre2_pat-
|
||||
tern_info() function to check the compiled pattern's options for UTF.
|
||||
Alternatively, you can set the PCRE2_NEVER_UTF option at compile time.
|
||||
This causes an compile time error if a pattern contains a UTF-setting
|
||||
sequence.
|
||||
tern_info() function to check the compiled pattern's options for
|
||||
PCRE2_UTF. Alternatively, you can set the PCRE2_NEVER_UTF option when
|
||||
calling pcre2_compile(). This causes an compile time error if a pattern
|
||||
contains a UTF-setting sequence.
|
||||
|
||||
The use of Unicode properties for character types such as \d can also
|
||||
be enabled from within the pattern, by specifying "(*UCP)". This fea-
|
||||
ture can be disallowed by setting the PCRE2_NEVER_UCP option.
|
||||
|
||||
If your application is one that supports UTF, be aware that validity
|
||||
checking can take time. If the same data string is to be matched many
|
||||
times, you can use the PCRE2_NO_UTF_CHECK option for the second and
|
||||
subsequent matches to avoid running redundant checks.
|
||||
|
||||
The use of the \C escape sequence in a UTF-8 or UTF-16 pattern can lead
|
||||
to problems, because it may leave the current matching point in the
|
||||
middle of a multi-code-unit character. The PCRE2_NEVER_BACKSLASH_C
|
||||
option can be used to lock out the use of \C, causing a compile-time
|
||||
error if it is encountered.
|
||||
|
||||
Another way that performance can be hit is by running a pattern that
|
||||
has a very large search tree against a string that will never match.
|
||||
Nested unlimited repeats in a pattern are a common example. PCRE2 pro-
|
||||
|
@ -155,8 +165,8 @@ AUTHOR
|
|||
|
||||
REVISION
|
||||
|
||||
Last updated: 18 November 2014
|
||||
Copyright (c) 1997-2014 University of Cambridge.
|
||||
Last updated: 13 April 2015
|
||||
Copyright (c) 1997-2015 University of Cambridge.
|
||||
------------------------------------------------------------------------------
|
||||
|
||||
|
||||
|
@ -1109,6 +1119,15 @@ COMPILING A PATTERN
|
|||
always expected after \x, but it may have zero, one, or two digits (so,
|
||||
for example, \xz matches a binary zero character followed by z).
|
||||
|
||||
PCRE2_ALT_CIRCUMFLEX
|
||||
|
||||
In multiline mode (when PCRE2_MULTILINE is set), the circumflex
|
||||
metacharacter matches at the start of the subject (unless PCRE2_NOTBOL
|
||||
is set), and also after any internal newline. However, it does not
|
||||
match after a newline at the end of the subject, for compatibility with
|
||||
Perl. If you want a multiline circumflex also to match after a termi-
|
||||
nating newline, you must set PCRE2_ALT_CIRCUMFLEX.
|
||||
|
||||
PCRE2_AUTO_CALLOUT
|
||||
|
||||
If this bit is set, pcre2_compile() automatically inserts callout
|
||||
|
@ -1204,9 +1223,20 @@ COMPILING A PATTERN
|
|||
constructs match immediately following or immediately before internal
|
||||
newlines in the subject string, respectively, as well as at the very
|
||||
start and end. This is equivalent to Perl's /m option, and it can be
|
||||
changed within a pattern by a (?m) option setting. If there are no new-
|
||||
lines in a subject string, or no occurrences of ^ or $ in a pattern,
|
||||
setting PCRE2_MULTILINE has no effect.
|
||||
changed within a pattern by a (?m) option setting. Note that the "start
|
||||
of line" metacharacter does not match after a newline at the end of the
|
||||
subject, for compatibility with Perl. However, you can change this by
|
||||
setting the PCRE2_ALT_CIRCUMFLEX option. If there are no newlines in a
|
||||
subject string, or no occurrences of ^ or $ in a pattern, setting
|
||||
PCRE2_MULTILINE has no effect.
|
||||
|
||||
PCRE2_NEVER_BACKSLASH_C
|
||||
|
||||
This option locks out the use of \C in the pattern that is being com-
|
||||
piled. This escape can cause unpredictable behaviour in UTF-8 or
|
||||
UTF-16 modes, because it may leave the current matching point in the
|
||||
middle of a multi-code-unit character. This option may be useful in
|
||||
applications that process patterns from external sources.
|
||||
|
||||
PCRE2_NEVER_UCP
|
||||
|
||||
|
@ -1214,18 +1244,18 @@ COMPILING A PATTERN
|
|||
\b, \D, \d, \S, \s, \W, \w, and some of the POSIX character classes, as
|
||||
described for the PCRE2_UCP option below. In particular, it prevents
|
||||
the creator of the pattern from enabling this facility by starting the
|
||||
pattern with (*UCP). This may be useful in applications that process
|
||||
patterns from external sources. The option combination PCRE_UCP and
|
||||
PCRE_NEVER_UCP causes an error.
|
||||
pattern with (*UCP). This option may be useful in applications that
|
||||
process patterns from external sources. The option combination PCRE_UCP
|
||||
and PCRE_NEVER_UCP causes an error.
|
||||
|
||||
PCRE2_NEVER_UTF
|
||||
|
||||
This option locks out interpretation of the pattern as UTF-8, UTF-16,
|
||||
or UTF-32, depending on which library is in use. In particular, it pre-
|
||||
vents the creator of the pattern from switching to UTF interpretation
|
||||
by starting the pattern with (*UTF). This may be useful in applications
|
||||
that process patterns from external sources. The combination of
|
||||
PCRE2_UTF and PCRE2_NEVER_UTF causes an error.
|
||||
by starting the pattern with (*UTF). This option may be useful in
|
||||
applications that process patterns from external sources. The combina-
|
||||
tion of PCRE2_UTF and PCRE2_NEVER_UTF causes an error.
|
||||
|
||||
PCRE2_NO_AUTO_CAPTURE
|
||||
|
||||
|
@ -2796,7 +2826,7 @@ AUTHOR
|
|||
|
||||
REVISION
|
||||
|
||||
Last updated: 23 March 2015
|
||||
Last updated: 22 April 2015
|
||||
Copyright (c) 1997-2015 University of Cambridge.
|
||||
------------------------------------------------------------------------------
|
||||
|
||||
|
@ -2916,6 +2946,11 @@ UNICODE AND UTF SUPPORT
|
|||
PCRE2_UCP option. Unless the application has set PCRE2_NEVER_UCP, a
|
||||
pattern may also request this by starting with (*UCP).
|
||||
|
||||
The \C escape sequence, which matches a single code unit, even in a UTF
|
||||
mode, can cause unpredictable behaviour because it may leave the cur-
|
||||
rent matching point in the middle of a multi-code-unit character. It
|
||||
can be locked out by setting the PCRE2_NEVER_BACKSLASH_C option.
|
||||
|
||||
|
||||
JUST-IN-TIME COMPILER SUPPORT
|
||||
|
||||
|
@ -3175,6 +3210,16 @@ PCRE2TEST OPTION FOR LIBREADLINE SUPPORT
|
|||
immediately before the configure command.
|
||||
|
||||
|
||||
INCLUDING DEBUGGING CODE
|
||||
|
||||
If you add
|
||||
|
||||
--enable-debug
|
||||
|
||||
to the configure command, additional debugging code is included in the
|
||||
build. This feature is intended for use by the PCRE2 maintainers.
|
||||
|
||||
|
||||
DEBUGGING WITH VALGRIND SUPPORT
|
||||
|
||||
If you add
|
||||
|
@ -3257,7 +3302,7 @@ AUTHOR
|
|||
|
||||
REVISION
|
||||
|
||||
Last updated: 26 January 2015
|
||||
Last updated: 24 April 2015
|
||||
Copyright (c) 1997-2015 University of Cambridge.
|
||||
------------------------------------------------------------------------------
|
||||
|
||||
|
|
|
@ -47,7 +47,7 @@ or provide an external function for stack size checking. The option bits are:
|
|||
PCRE2_FIRSTLINE Force matching to be before newline
|
||||
PCRE2_MATCH_UNSET_BACKREF Match unset back references
|
||||
PCRE2_MULTILINE ^ and $ match newlines within data
|
||||
PCRE2_NEVER_BACKSLASH_C Lock out the use of \C in patterns
|
||||
PCRE2_NEVER_BACKSLASH_C Lock out the use of \eC in patterns
|
||||
PCRE2_NEVER_UCP Lock out PCRE2_UCP, e.g. via (*UCP)
|
||||
PCRE2_NEVER_UTF Lock out PCRE2_UTF, e.g. via (*UTF)
|
||||
PCRE2_NO_AUTO_CAPTURE Disable numbered capturing paren-
|
||||
|
|
|
@ -1161,7 +1161,6 @@ after a newline at the end of the subject, for compatibility with Perl.
|
|||
However, you can change this by setting the PCRE2_ALT_CIRCUMFLEX option. If
|
||||
there are no newlines in a subject string, or no occurrences of ^ or $ in a
|
||||
pattern, setting PCRE2_MULTILINE has no effect.
|
||||
|
||||
.sp
|
||||
PCRE2_NEVER_BACKSLASH_C
|
||||
.sp
|
||||
|
|
|
@ -226,14 +226,20 @@ COMMAND LINES
|
|||
#forbid_utf
|
||||
|
||||
Subsequent patterns automatically have the PCRE2_NEVER_UTF and
|
||||
PCRE2_NEVER_UCP options set, which locks out the use of UTF and Unicode
|
||||
property features. This is a trigger guard that is used in test files
|
||||
to ensure that UTF or Unicode property tests are not accidentally added
|
||||
to files that are used when Unicode support is not included in the
|
||||
library. This effect can also be obtained by the use of #pattern; the
|
||||
difference is that #forbid_utf cannot be unset, and the automatic
|
||||
options are not displayed in pattern information, to avoid cluttering
|
||||
up test output.
|
||||
PCRE2_NEVER_UCP options set, which locks out the use of the PCRE2_UTF
|
||||
and PCRE2_UCP options and the use of (*UTF) and (*UCP) at the start of
|
||||
patterns. This command also forces an error if a subsequent pattern
|
||||
contains any occurrences of \P, \p, or \X, which are still supported
|
||||
when PCRE2_UTF is not set, but which require Unicode property support
|
||||
to be included in the library.
|
||||
|
||||
This is a trigger guard that is used in test files to ensure that UTF
|
||||
or Unicode property tests are not accidentally added to files that are
|
||||
used when Unicode support is not included in the library. Setting
|
||||
PCRE2_NEVER_UTF and PCRE2_NEVER_UCP as a default can also be obtained
|
||||
by the use of #pattern; the difference is that #forbid_utf cannot be
|
||||
unset, and the automatic options are not displayed in pattern informa-
|
||||
tion, to avoid cluttering up test output.
|
||||
|
||||
#load <filename>
|
||||
|
||||
|
@ -417,6 +423,7 @@ PATTERN MODIFIERS
|
|||
|
||||
allow_empty_class set PCRE2_ALLOW_EMPTY_CLASS
|
||||
alt_bsux set PCRE2_ALT_BSUX
|
||||
alt_circumflex set PCRE2_ALT_CIRCUMFLEX
|
||||
anchored set PCRE2_ANCHORED
|
||||
auto_callout set PCRE2_AUTO_CALLOUT
|
||||
/i caseless set PCRE2_CASELESS
|
||||
|
@ -427,6 +434,7 @@ PATTERN MODIFIERS
|
|||
firstline set PCRE2_FIRSTLINE
|
||||
match_unset_backref set PCRE2_MATCH_UNSET_BACKREF
|
||||
/m multiline set PCRE2_MULTILINE
|
||||
never_backslash_c set PCRE2_NEVER_BACKSLASH_C
|
||||
never_ucp set PCRE2_NEVER_UCP
|
||||
never_utf set PCRE2_NEVER_UTF
|
||||
no_auto_capture set PCRE2_NO_AUTO_CAPTURE
|
||||
|
@ -1322,5 +1330,5 @@ AUTHOR
|
|||
|
||||
REVISION
|
||||
|
||||
Last updated: 22 March 2015
|
||||
Last updated: 20 May 2015
|
||||
Copyright (c) 1997-2015 University of Cambridge.
|
||||
|
|
|
@ -200,7 +200,7 @@ sure both macros are undefined; an emulation function will then be used. */
|
|||
#define PACKAGE_NAME "PCRE2"
|
||||
|
||||
/* Define to the full name and version of this package. */
|
||||
#define PACKAGE_STRING "PCRE2 10.10"
|
||||
#define PACKAGE_STRING "PCRE2 10.20-RC1"
|
||||
|
||||
/* Define to the one symbol short name of this package. */
|
||||
#define PACKAGE_TARNAME "pcre2"
|
||||
|
@ -209,7 +209,7 @@ sure both macros are undefined; an emulation function will then be used. */
|
|||
#define PACKAGE_URL ""
|
||||
|
||||
/* Define to the version of this package. */
|
||||
#define PACKAGE_VERSION "10.10"
|
||||
#define PACKAGE_VERSION "10.20-RC1"
|
||||
|
||||
/* The value of PARENS_NEST_LIMIT specifies the maximum depth of nested
|
||||
parentheses (of any kind) in a pattern. This limits the amount of system
|
||||
|
@ -227,6 +227,9 @@ sure both macros are undefined; an emulation function will then be used. */
|
|||
#define PCRE2GREP_BUFSIZE 20480
|
||||
#endif
|
||||
|
||||
/* Define to any value to include debugging code. */
|
||||
/* #undef PCRE2_DEBUG */
|
||||
|
||||
/* If you are compiling for a system other than a Unix-like system or
|
||||
Win32, and it needs some magic to be inserted before the definition
|
||||
of a function that is exported by the library, define this macro to
|
||||
|
@ -287,7 +290,7 @@ sure both macros are undefined; an emulation function will then be used. */
|
|||
/* #undef SUPPORT_VALGRIND */
|
||||
|
||||
/* Version number of package */
|
||||
#define VERSION "10.10"
|
||||
#define VERSION "10.20-RC1"
|
||||
|
||||
/* Define to empty if `const' does not conform to ANSI C. */
|
||||
/* #undef const */
|
||||
|
|
|
@ -42,9 +42,9 @@ POSSIBILITY OF SUCH DAMAGE.
|
|||
/* The current PCRE version information. */
|
||||
|
||||
#define PCRE2_MAJOR 10
|
||||
#define PCRE2_MINOR 10
|
||||
#define PCRE2_PRERELEASE
|
||||
#define PCRE2_DATE 2015-03-06
|
||||
#define PCRE2_MINOR 20
|
||||
#define PCRE2_PRERELEASE -RC1
|
||||
#define PCRE2_DATE 2015-06-16
|
||||
|
||||
/* When an application links to a PCRE DLL in Windows, the symbols that are
|
||||
imported have to be identified as such. When building PCRE2, the appropriate
|
||||
|
@ -118,6 +118,8 @@ D is inspected during pcre2_dfa_match() execution
|
|||
#define PCRE2_UCP 0x00020000u /* C J M D */
|
||||
#define PCRE2_UNGREEDY 0x00040000u /* C */
|
||||
#define PCRE2_UTF 0x00080000u /* C J M D */
|
||||
#define PCRE2_NEVER_BACKSLASH_C 0x00100000u /* C */
|
||||
#define PCRE2_ALT_CIRCUMFLEX 0x00200000u /* J M D */
|
||||
|
||||
/* These are for pcre2_jit_compile(). */
|
||||
|
||||
|
@ -125,9 +127,10 @@ D is inspected during pcre2_dfa_match() execution
|
|||
#define PCRE2_JIT_PARTIAL_SOFT 0x00000002u
|
||||
#define PCRE2_JIT_PARTIAL_HARD 0x00000004u
|
||||
|
||||
/* These are for pcre2_match() and pcre2_dfa_match(). Note that PCRE2_ANCHORED,
|
||||
and PCRE2_NO_UTF_CHECK can also be passed to these functions, so take care not
|
||||
to define synonyms by mistake. */
|
||||
/* These are for pcre2_match(), pcre2_dfa_match(), and pcre2_jit_match(). Note
|
||||
that PCRE2_ANCHORED and PCRE2_NO_UTF_CHECK can also be passed to these
|
||||
functions (though pcre2_jit_match() ignores the latter since it bypasses all
|
||||
sanity checks). */
|
||||
|
||||
#define PCRE2_NOTBOL 0x00000001u
|
||||
#define PCRE2_NOTEOL 0x00000002u
|
||||
|
@ -337,8 +340,24 @@ typedef struct pcre2_callout_block { \
|
|||
PCRE2_SIZE current_position; /* Where we currently are in the subject */ \
|
||||
PCRE2_SIZE pattern_position; /* Offset to next item in the pattern */ \
|
||||
PCRE2_SIZE next_item_length; /* Length of next item in the pattern */ \
|
||||
/* ------------------- Added for Version 1 -------------------------- */ \
|
||||
PCRE2_SIZE callout_string_offset; /* Offset to string within pattern */ \
|
||||
PCRE2_SIZE callout_string_length; /* Length of string compiled into pattern */ \
|
||||
PCRE2_SPTR callout_string; /* String compiled into pattern */ \
|
||||
/* ------------------------------------------------------------------ */ \
|
||||
} pcre2_callout_block;
|
||||
} pcre2_callout_block; \
|
||||
\
|
||||
typedef struct pcre2_callout_enumerate_block { \
|
||||
uint32_t version; /* Identifies version of block */ \
|
||||
/* ------------------------ Version 0 ------------------------------- */ \
|
||||
PCRE2_SIZE pattern_position; /* Offset to next item in the pattern */ \
|
||||
PCRE2_SIZE next_item_length; /* Length of next item in the pattern */ \
|
||||
uint32_t callout_number; /* Number compiled into pattern */ \
|
||||
PCRE2_SIZE callout_string_offset; /* Offset to string within pattern */ \
|
||||
PCRE2_SIZE callout_string_length; /* Length of string compiled into pattern */ \
|
||||
PCRE2_SPTR callout_string; /* String compiled into pattern */ \
|
||||
/* ------------------------------------------------------------------ */ \
|
||||
} pcre2_callout_enumerate_block;
|
||||
|
||||
|
||||
/* List the generic forms of all other functions in macros, which will be
|
||||
|
@ -406,6 +425,9 @@ PCRE2_EXP_DECL void pcre2_code_free(pcre2_code *);
|
|||
|
||||
#define PCRE2_PATTERN_INFO_FUNCTIONS \
|
||||
PCRE2_EXP_DECL int pcre2_pattern_info(const pcre2_code *, uint32_t, \
|
||||
void *); \
|
||||
PCRE2_EXP_DECL int pcre2_callout_enumerate(const pcre2_code *, \
|
||||
int (*)(pcre2_callout_enumerate_block *, void *), \
|
||||
void *);
|
||||
|
||||
|
||||
|
@ -535,6 +557,7 @@ pcre2_compile are called by application code. */
|
|||
/* Data blocks */
|
||||
|
||||
#define pcre2_callout_block PCRE2_SUFFIX(pcre2_callout_block_)
|
||||
#define pcre2_callout_enumerate_block PCRE2_SUFFIX(pcre2_callout_enumerate_block_)
|
||||
#define pcre2_general_context PCRE2_SUFFIX(pcre2_general_context_)
|
||||
#define pcre2_compile_context PCRE2_SUFFIX(pcre2_compile_context_)
|
||||
#define pcre2_match_context PCRE2_SUFFIX(pcre2_match_context_)
|
||||
|
@ -543,6 +566,7 @@ pcre2_compile are called by application code. */
|
|||
|
||||
/* Functions: the complete list in alphabetical order */
|
||||
|
||||
#define pcre2_callout_enumerate PCRE2_SUFFIX(pcre2_callout_enumerate_)
|
||||
#define pcre2_code_free PCRE2_SUFFIX(pcre2_code_free_)
|
||||
#define pcre2_compile PCRE2_SUFFIX(pcre2_compile_)
|
||||
#define pcre2_compile_context_copy PCRE2_SUFFIX(pcre2_compile_context_copy_)
|
||||
|
@ -550,7 +574,6 @@ pcre2_compile are called by application code. */
|
|||
#define pcre2_compile_context_free PCRE2_SUFFIX(pcre2_compile_context_free_)
|
||||
#define pcre2_config PCRE2_SUFFIX(pcre2_config_)
|
||||
#define pcre2_dfa_match PCRE2_SUFFIX(pcre2_dfa_match_)
|
||||
#define pcre2_match PCRE2_SUFFIX(pcre2_match_)
|
||||
#define pcre2_general_context_copy PCRE2_SUFFIX(pcre2_general_context_copy_)
|
||||
#define pcre2_general_context_create PCRE2_SUFFIX(pcre2_general_context_create_)
|
||||
#define pcre2_general_context_free PCRE2_SUFFIX(pcre2_general_context_free_)
|
||||
|
@ -566,6 +589,7 @@ pcre2_compile are called by application code. */
|
|||
#define pcre2_jit_stack_create PCRE2_SUFFIX(pcre2_jit_stack_create_)
|
||||
#define pcre2_jit_stack_free PCRE2_SUFFIX(pcre2_jit_stack_free_)
|
||||
#define pcre2_maketables PCRE2_SUFFIX(pcre2_maketables_)
|
||||
#define pcre2_match PCRE2_SUFFIX(pcre2_match_)
|
||||
#define pcre2_match_context_copy PCRE2_SUFFIX(pcre2_match_context_copy_)
|
||||
#define pcre2_match_context_create PCRE2_SUFFIX(pcre2_match_context_create_)
|
||||
#define pcre2_match_context_free PCRE2_SUFFIX(pcre2_match_context_free_)
|
||||
|
|
Loading…
Reference in New Issue