Source and document file tidies for 10.20-RC1.

This commit is contained in:
Philip.Hazel 2015-06-18 16:39:25 +00:00
parent a68ddd48b5
commit 07a8fdce25
40 changed files with 677 additions and 439 deletions

View File

@ -1,8 +1,8 @@
Change Log for PCRE2 Change Log for PCRE2
-------------------- --------------------
Version 10.20 xx-xx-2015 Version 10.20 16-June-2015
------------------------ --------------------------
1. Callouts with string arguments have been added. 1. Callouts with string arguments have been added.

30
HACKING
View File

@ -104,6 +104,21 @@ system stack used by the compile function, which uses recursive function calls
for nested parenthesized groups. This is a safety feature for environments with for nested parenthesized groups. This is a safety feature for environments with
small stacks where the patterns are provided by users. small stacks where the patterns are provided by users.
History repeated itself for release 10.20. A number of bugs relating to named
subpatterns had been discovered by fuzzers. Most of these were related to the
handling of forward references when it was not known if the named pattern was
unique. (References to non-unique names use a different opcode and more
memory.) The use of duplicate group numbers (the (?| facility) also caused
issues.
To get around these problems I adopted a new approach by adding a third pass,
really a "pre-pass", over the pattern, which does nothing other than identify
all the named subpatterns and their corresponding group numbers. This means
that the actual compile (both pre-pass and real compile) have full knowledge of
group names and numbers throughout. Several dozen lines of messy code were
eliminated, though the new pre-pass is not short (skipping over [] classes is
complicated).
Traditional matching function Traditional matching function
----------------------------- -----------------------------
@ -343,8 +358,9 @@ do.
For classes containing characters with values greater than 255 or that contain For classes containing characters with values greater than 255 or that contain
\p or \P, OP_XCLASS is used. It optionally uses a bit map if any acceptable \p or \P, OP_XCLASS is used. It optionally uses a bit map if any acceptable
code points are less than 256, followed by a list of pairs (for a range) and code points are less than 256, followed by a list of pairs (for a range) and/or
single characters. In caseless mode, both cases are explicitly listed. single characters and/or properties. In caseless mode, both cases are
explicitly listed.
OP_XCLASS is followed by a LINK_SIZE value containing the total length of the OP_XCLASS is followed by a LINK_SIZE value containing the total length of the
opcode and its data. This is followed by a code unit containing flag bits: opcode and its data. This is followed by a code unit containing flag bits:
@ -431,7 +447,7 @@ bracket opcode.
If a subpattern is quantified such that it is permitted to match zero times, it If a subpattern is quantified such that it is permitted to match zero times, it
is preceded by one of OP_BRAZERO, OP_BRAMINZERO, or OP_SKIPZERO. These are is preceded by one of OP_BRAZERO, OP_BRAMINZERO, or OP_SKIPZERO. These are
single-unit opcodes that tell the matcher that skipping the following single-unit opcodes that tell the matcher that skipping the following
subpattern entirely is a valid branch. In the case of the first two, not subpattern entirely is a valid match. In the case of the first two, not
skipping the pattern is also valid (greedy and non-greedy). The third is used skipping the pattern is also valid (greedy and non-greedy). The third is used
when a pattern has the quantifier {0,0}. It cannot be entirely discarded, when a pattern has the quantifier {0,0}. It cannot be entirely discarded,
because it may be called as a subroutine from elsewhere in the pattern. because it may be called as a subroutine from elsewhere in the pattern.
@ -487,9 +503,9 @@ Forward assertions are also just like other subpatterns, but starting with one
of the opcodes OP_ASSERT or OP_ASSERT_NOT. Backward assertions use the opcodes of the opcodes OP_ASSERT or OP_ASSERT_NOT. Backward assertions use the opcodes
OP_ASSERTBACK and OP_ASSERTBACK_NOT, and the first opcode inside the assertion OP_ASSERTBACK and OP_ASSERTBACK_NOT, and the first opcode inside the assertion
is OP_REVERSE, followed by a count of the number of characters to move back the is OP_REVERSE, followed by a count of the number of characters to move back the
pointer in the subject string. In ASCII or UTF-32 mode, the count is a number pointer in the subject string. In ASCII or UTF-32 mode, the count is also the
of code units, but in UTF-8/16 mode each character may occupy more than one number of code units, but in UTF-8/16 mode each character may occupy more than
code unit. A separate count is present in each alternative of a lookbehind one code unit. A separate count is present in each alternative of a lookbehind
assertion, allowing them to have different (but fixed) lengths. assertion, allowing them to have different (but fixed) lengths.
@ -585,4 +601,4 @@ not a real opcode, but is used to check that tables indexed by opcode are the
correct length, in order to catch updating errors. correct length, in order to catch updating errors.
Philip Hazel Philip Hazel
March 2015 June 2015

20
NEWS
View File

@ -1,6 +1,26 @@
News about PCRE2 releases News about PCRE2 releases
------------------------- -------------------------
Version 10.20 16-June-2015
--------------------------
1. Callouts with string arguments and the pcre2_callout_enumerate() function
have been implemented.
2. The PCRE2_NEVER_BACKSLASH_C option, which locks out the use of \C, is added.
3. The PCRE2_ALT_CIRCUMFLEX option lets ^ match after a newline at the end of a
subject in multiline mode.
4. The way named subpatterns are handled has been refactored. The previous
approach had several bugs.
5. The handling of \c in EBCDIC environments has been changed to conform to the
perlebcdic document. This is an incompatible change.
6. Bugs have been mended, many of them discovered by fuzzers.
Version 10.10 06-March-2015 Version 10.10 06-March-2015
--------------------------- ---------------------------

View File

@ -11,15 +11,15 @@ dnl be defined as -RC2, for example. For real releases, it should be empty.
m4_define(pcre2_major, [10]) m4_define(pcre2_major, [10])
m4_define(pcre2_minor, [20]) m4_define(pcre2_minor, [20])
m4_define(pcre2_prerelease, [-RC1]) m4_define(pcre2_prerelease, [-RC1])
m4_define(pcre2_date, [2015-03-11]) m4_define(pcre2_date, [2015-06-16])
# NOTE: The CMakeLists.txt file searches for the above variables in the first # NOTE: The CMakeLists.txt file searches for the above variables in the first
# 50 lines of this file. Please update that if the variables above are moved. # 50 lines of this file. Please update that if the variables above are moved.
# Libtool shared library interface versions (current:revision:age) # Libtool shared library interface versions (current:revision:age)
m4_define(libpcre2_8_version, [1:0:1]) m4_define(libpcre2_8_version, [2:0:0])
m4_define(libpcre2_16_version, [1:0:1]) m4_define(libpcre2_16_version, [2:0:0])
m4_define(libpcre2_32_version, [1:0:1]) m4_define(libpcre2_32_version, [2:0:0])
m4_define(libpcre2_posix_version, [0:0:0]) m4_define(libpcre2_posix_version, [0:0:0])
AC_PREREQ(2.57) AC_PREREQ(2.57)

View File

@ -294,6 +294,9 @@ library. They are also documented in the pcre2build man page.
which specifies that the code value for the EBCDIC NL character is 0x25 which specifies that the code value for the EBCDIC NL character is 0x25
instead of the default 0x15. instead of the default 0x15.
. If you specify --enable-debug, additional debugging code is included in the
build. This option is intended for use by the PCRE2 maintainers.
. In environments where valgrind is installed, if you specify . In environments where valgrind is installed, if you specify
--enable-valgrind --enable-valgrind
@ -829,4 +832,4 @@ The distribution should contain the files listed below.
Philip Hazel Philip Hazel
Email local part: ph10 Email local part: ph10
Email domain: cam.ac.uk Email domain: cam.ac.uk
Last updated: 26 January 2015 Last updated: 24 April 2015

View File

@ -108,8 +108,14 @@ lose performance.
<P> <P>
One way of guarding against this possibility is to use the One way of guarding against this possibility is to use the
<b>pcre2_pattern_info()</b> function to check the compiled pattern's options for <b>pcre2_pattern_info()</b> function to check the compiled pattern's options for
UTF. Alternatively, you can set the PCRE2_NEVER_UTF option at compile time. PCRE2_UTF. Alternatively, you can set the PCRE2_NEVER_UTF option when calling
This causes an compile time error if a pattern contains a UTF-setting sequence. <b>pcre2_compile()</b>. This causes an compile time error if a pattern contains
a UTF-setting sequence.
</P>
<P>
The use of Unicode properties for character types such as \d can also be
enabled from within the pattern, by specifying "(*UCP)". This feature can be
disallowed by setting the PCRE2_NEVER_UCP option.
</P> </P>
<P> <P>
If your application is one that supports UTF, be aware that validity checking If your application is one that supports UTF, be aware that validity checking
@ -118,6 +124,12 @@ the PCRE2_NO_UTF_CHECK option for the second and subsequent matches to avoid
running redundant checks. running redundant checks.
</P> </P>
<P> <P>
The use of the \C escape sequence in a UTF-8 or UTF-16 pattern can lead to
problems, because it may leave the current matching point in the middle of a
multi-code-unit character. The PCRE2_NEVER_BACKSLASH_C option can be used to
lock out the use of \C, causing a compile-time error if it is encountered.
</P>
<P>
Another way that performance can be hit is by running a pattern that has a very Another way that performance can be hit is by running a pattern that has a very
large search tree against a string that will never match. Nested unlimited large search tree against a string that will never match. Nested unlimited
repeats in a pattern are a common example. PCRE2 provides some protection repeats in a pattern are a common example. PCRE2 provides some protection
@ -175,9 +187,9 @@ use my two initials, followed by the two digits 10, at the domain cam.ac.uk.
</P> </P>
<br><a name="SEC5" href="#TOC1">REVISION</a><br> <br><a name="SEC5" href="#TOC1">REVISION</a><br>
<P> <P>
Last updated: 18 November 2014 Last updated: 13 April 2015
<br> <br>
Copyright &copy; 1997-2014 University of Cambridge. Copyright &copy; 1997-2015 University of Cambridge.
<br> <br>
<p> <p>
Return to the <a href="index.html">PCRE2 index page</a>. Return to the <a href="index.html">PCRE2 index page</a>.

View File

@ -49,6 +49,7 @@ or provide an external function for stack size checking. The option bits are:
<pre> <pre>
PCRE2_ANCHORED Force pattern anchoring PCRE2_ANCHORED Force pattern anchoring
PCRE2_ALT_BSUX Alternative handling of \u, \U, and \x PCRE2_ALT_BSUX Alternative handling of \u, \U, and \x
PCRE2_ALT_CIRCUMFLEX Alternative handling of ^ in multiline mode
PCRE2_AUTO_CALLOUT Compile automatic callouts PCRE2_AUTO_CALLOUT Compile automatic callouts
PCRE2_CASELESS Do caseless matching PCRE2_CASELESS Do caseless matching
PCRE2_DOLLAR_ENDONLY $ not to match newline at end PCRE2_DOLLAR_ENDONLY $ not to match newline at end
@ -58,6 +59,7 @@ or provide an external function for stack size checking. The option bits are:
PCRE2_FIRSTLINE Force matching to be before newline PCRE2_FIRSTLINE Force matching to be before newline
PCRE2_MATCH_UNSET_BACKREF Match unset back references PCRE2_MATCH_UNSET_BACKREF Match unset back references
PCRE2_MULTILINE ^ and $ match newlines within data PCRE2_MULTILINE ^ and $ match newlines within data
PCRE2_NEVER_BACKSLASH_C Lock out the use of \C in patterns
PCRE2_NEVER_UCP Lock out PCRE2_UCP, e.g. via (*UCP) PCRE2_NEVER_UCP Lock out PCRE2_UCP, e.g. via (*UCP)
PCRE2_NEVER_UTF Lock out PCRE2_UTF, e.g. via (*UTF) PCRE2_NEVER_UTF Lock out PCRE2_UTF, e.g. via (*UTF)
PCRE2_NO_AUTO_CAPTURE Disable numbered capturing paren- PCRE2_NO_AUTO_CAPTURE Disable numbered capturing paren-

View File

@ -1074,6 +1074,15 @@ hexadecimal digits, in which case the hexadecimal number defines the code point
to match. By default, as in Perl, a hexadecimal number is always expected after to match. By default, as in Perl, a hexadecimal number is always expected after
\x, but it may have zero, one, or two digits (so, for example, \xz matches a \x, but it may have zero, one, or two digits (so, for example, \xz matches a
binary zero character followed by z). binary zero character followed by z).
<pre>
PCRE2_ALT_CIRCUMFLEX
</pre>
In multiline mode (when PCRE2_MULTILINE is set), the circumflex metacharacter
matches at the start of the subject (unless PCRE2_NOTBOL is set), and also
after any internal newline. However, it does not match after a newline at the
end of the subject, for compatibility with Perl. If you want a multiline
circumflex also to match after a terminating newline, you must set
PCRE2_ALT_CIRCUMFLEX.
<pre> <pre>
PCRE2_AUTO_CALLOUT PCRE2_AUTO_CALLOUT
</pre> </pre>
@ -1174,8 +1183,19 @@ When PCRE2_MULTILINE it is set, the "start of line" and "end of line"
constructs match immediately following or immediately before internal newlines constructs match immediately following or immediately before internal newlines
in the subject string, respectively, as well as at the very start and end. This in the subject string, respectively, as well as at the very start and end. This
is equivalent to Perl's /m option, and it can be changed within a pattern by a is equivalent to Perl's /m option, and it can be changed within a pattern by a
(?m) option setting. If there are no newlines in a subject string, or no (?m) option setting. Note that the "start of line" metacharacter does not match
occurrences of ^ or $ in a pattern, setting PCRE2_MULTILINE has no effect. after a newline at the end of the subject, for compatibility with Perl.
However, you can change this by setting the PCRE2_ALT_CIRCUMFLEX option. If
there are no newlines in a subject string, or no occurrences of ^ or $ in a
pattern, setting PCRE2_MULTILINE has no effect.
<pre>
PCRE2_NEVER_BACKSLASH_C
</pre>
This option locks out the use of \C in the pattern that is being compiled.
This escape can cause unpredictable behaviour in UTF-8 or UTF-16 modes, because
it may leave the current matching point in the middle of a multi-code-unit
character. This option may be useful in applications that process patterns from
external sources.
<pre> <pre>
PCRE2_NEVER_UCP PCRE2_NEVER_UCP
</pre> </pre>
@ -1183,17 +1203,17 @@ This option locks out the use of Unicode properties for handling \B, \b, \D,
\d, \S, \s, \W, \w, and some of the POSIX character classes, as described \d, \S, \s, \W, \w, and some of the POSIX character classes, as described
for the PCRE2_UCP option below. In particular, it prevents the creator of the for the PCRE2_UCP option below. In particular, it prevents the creator of the
pattern from enabling this facility by starting the pattern with (*UCP). This pattern from enabling this facility by starting the pattern with (*UCP). This
may be useful in applications that process patterns from external sources. The option may be useful in applications that process patterns from external
option combination PCRE_UCP and PCRE_NEVER_UCP causes an error. sources. The option combination PCRE_UCP and PCRE_NEVER_UCP causes an error.
<pre> <pre>
PCRE2_NEVER_UTF PCRE2_NEVER_UTF
</pre> </pre>
This option locks out interpretation of the pattern as UTF-8, UTF-16, or This option locks out interpretation of the pattern as UTF-8, UTF-16, or
UTF-32, depending on which library is in use. In particular, it prevents the UTF-32, depending on which library is in use. In particular, it prevents the
creator of the pattern from switching to UTF interpretation by starting the creator of the pattern from switching to UTF interpretation by starting the
pattern with (*UTF). This may be useful in applications that process patterns pattern with (*UTF). This option may be useful in applications that process
from external sources. The combination of PCRE2_UTF and PCRE2_NEVER_UTF causes patterns from external sources. The combination of PCRE2_UTF and
an error. PCRE2_NEVER_UTF causes an error.
<pre> <pre>
PCRE2_NO_AUTO_CAPTURE PCRE2_NO_AUTO_CAPTURE
</pre> </pre>
@ -2863,7 +2883,7 @@ Cambridge, England.
</P> </P>
<br><a name="SEC40" href="#TOC1">REVISION</a><br> <br><a name="SEC40" href="#TOC1">REVISION</a><br>
<P> <P>
Last updated: 23 March 2015 Last updated: 22 April 2015
<br> <br>
Copyright &copy; 1997-2015 University of Cambridge. Copyright &copy; 1997-2015 University of Cambridge.
<br> <br>

View File

@ -29,11 +29,12 @@ please consult the man page, in case the conversion went wrong.
<li><a name="TOC14" href="#SEC14">PCRE2GREP OPTIONS FOR COMPRESSED FILE SUPPORT</a> <li><a name="TOC14" href="#SEC14">PCRE2GREP OPTIONS FOR COMPRESSED FILE SUPPORT</a>
<li><a name="TOC15" href="#SEC15">PCRE2GREP BUFFER SIZE</a> <li><a name="TOC15" href="#SEC15">PCRE2GREP BUFFER SIZE</a>
<li><a name="TOC16" href="#SEC16">PCRE2TEST OPTION FOR LIBREADLINE SUPPORT</a> <li><a name="TOC16" href="#SEC16">PCRE2TEST OPTION FOR LIBREADLINE SUPPORT</a>
<li><a name="TOC17" href="#SEC17">DEBUGGING WITH VALGRIND SUPPORT</a> <li><a name="TOC17" href="#SEC17">INCLUDING DEBUGGING CODE</a>
<li><a name="TOC18" href="#SEC18">CODE COVERAGE REPORTING</a> <li><a name="TOC18" href="#SEC18">DEBUGGING WITH VALGRIND SUPPORT</a>
<li><a name="TOC19" href="#SEC19">SEE ALSO</a> <li><a name="TOC19" href="#SEC19">CODE COVERAGE REPORTING</a>
<li><a name="TOC20" href="#SEC20">AUTHOR</a> <li><a name="TOC20" href="#SEC20">SEE ALSO</a>
<li><a name="TOC21" href="#SEC21">REVISION</a> <li><a name="TOC21" href="#SEC21">AUTHOR</a>
<li><a name="TOC22" href="#SEC22">REVISION</a>
</ul> </ul>
<br><a name="SEC1" href="#TOC1">BUILDING PCRE2</a><br> <br><a name="SEC1" href="#TOC1">BUILDING PCRE2</a><br>
<P> <P>
@ -147,6 +148,12 @@ properties. The application can request that they do by setting the PCRE2_UCP
option. Unless the application has set PCRE2_NEVER_UCP, a pattern may also option. Unless the application has set PCRE2_NEVER_UCP, a pattern may also
request this by starting with (*UCP). request this by starting with (*UCP).
</P> </P>
<P>
The \C escape sequence, which matches a single code unit, even in a UTF mode,
can cause unpredictable behaviour because it may leave the current matching
point in the middle of a multi-code-unit character. It can be locked out by
setting the PCRE2_NEVER_BACKSLASH_C option.
</P>
<br><a name="SEC6" href="#TOC1">JUST-IN-TIME COMPILER SUPPORT</a><br> <br><a name="SEC6" href="#TOC1">JUST-IN-TIME COMPILER SUPPORT</a><br>
<P> <P>
Just-in-time compiler support is included in the build by specifying Just-in-time compiler support is included in the build by specifying
@ -397,7 +404,16 @@ automatically included, you may need to add something like
</pre> </pre>
immediately before the <b>configure</b> command. immediately before the <b>configure</b> command.
</P> </P>
<br><a name="SEC17" href="#TOC1">DEBUGGING WITH VALGRIND SUPPORT</a><br> <br><a name="SEC17" href="#TOC1">INCLUDING DEBUGGING CODE</a><br>
<P>
If you add
<pre>
--enable-debug
</pre>
to the <b>configure</b> command, additional debugging code is included in the
build. This feature is intended for use by the PCRE2 maintainers.
</P>
<br><a name="SEC18" href="#TOC1">DEBUGGING WITH VALGRIND SUPPORT</a><br>
<P> <P>
If you add If you add
<pre> <pre>
@ -407,7 +423,7 @@ to the <b>configure</b> command, PCRE2 will use valgrind annotations to mark
certain memory regions as unaddressable. This allows it to detect invalid certain memory regions as unaddressable. This allows it to detect invalid
memory accesses, and is mostly useful for debugging PCRE2 itself. memory accesses, and is mostly useful for debugging PCRE2 itself.
</P> </P>
<br><a name="SEC18" href="#TOC1">CODE COVERAGE REPORTING</a><br> <br><a name="SEC19" href="#TOC1">CODE COVERAGE REPORTING</a><br>
<P> <P>
If your C compiler is gcc, you can build a version of PCRE2 that can generate a If your C compiler is gcc, you can build a version of PCRE2 that can generate a
code coverage report for its test suite. To enable this, you must install code coverage report for its test suite. To enable this, you must install
@ -464,11 +480,11 @@ This cleans all coverage data including the generated coverage report. For more
information about code coverage, see the <b>gcov</b> and <b>lcov</b> information about code coverage, see the <b>gcov</b> and <b>lcov</b>
documentation. documentation.
</P> </P>
<br><a name="SEC19" href="#TOC1">SEE ALSO</a><br> <br><a name="SEC20" href="#TOC1">SEE ALSO</a><br>
<P> <P>
<b>pcre2api</b>(3), <b>pcre2-config</b>(3). <b>pcre2api</b>(3), <b>pcre2-config</b>(3).
</P> </P>
<br><a name="SEC20" href="#TOC1">AUTHOR</a><br> <br><a name="SEC21" href="#TOC1">AUTHOR</a><br>
<P> <P>
Philip Hazel Philip Hazel
<br> <br>
@ -477,9 +493,9 @@ University Computing Service
Cambridge, England. Cambridge, England.
<br> <br>
</P> </P>
<br><a name="SEC21" href="#TOC1">REVISION</a><br> <br><a name="SEC22" href="#TOC1">REVISION</a><br>
<P> <P>
Last updated: 26 January 2015 Last updated: 24 April 2015
<br> <br>
Copyright &copy; 1997-2015 University of Cambridge. Copyright &copy; 1997-2015 University of Cambridge.
<br> <br>

View File

@ -357,10 +357,11 @@ A second use of backslash provides a way of encoding non-printing characters
in patterns in a visible manner. There is no restriction on the appearance of in patterns in a visible manner. There is no restriction on the appearance of
non-printing characters in a pattern, but when a pattern is being prepared by non-printing characters in a pattern, but when a pattern is being prepared by
text editing, it is often easier to use one of the following escape sequences text editing, it is often easier to use one of the following escape sequences
than the binary character it represents: than the binary character it represents. In an ASCII or Unicode environment,
these escapes are as follows:
<pre> <pre>
\a alarm, that is, the BEL character (hex 07) \a alarm, that is, the BEL character (hex 07)
\cx "control-x", where x is any ASCII character \cx "control-x", where x is any printable ASCII character
\e escape (hex 1B) \e escape (hex 1B)
\f form feed (hex 0C) \f form feed (hex 0C)
\n linefeed (hex 0A) \n linefeed (hex 0A)
@ -377,23 +378,38 @@ The precise effect of \cx on ASCII characters is as follows: if x is a lower
case letter, it is converted to upper case. Then bit 6 of the character (hex case letter, it is converted to upper case. Then bit 6 of the character (hex
40) is inverted. Thus \cA to \cZ become hex 01 to hex 1A (A is 41, Z is 5A), 40) is inverted. Thus \cA to \cZ become hex 01 to hex 1A (A is 41, Z is 5A),
but \c{ becomes hex 3B ({ is 7B), and \c; becomes hex 7B (; is 3B). If the but \c{ becomes hex 3B ({ is 7B), and \c; becomes hex 7B (; is 3B). If the
code unit following \c has a value greater than 127, a compile-time error code unit following \c has a value less than 32 or greater than 126, a
occurs. This locks out non-ASCII characters in all modes. compile-time error occurs. This locks out non-printable ASCII characters in all
modes.
</P> </P>
<P> <P>
The \c facility was designed for use with ASCII characters, but with the When PCRE2 is compiled in EBCDIC mode, \a, \e, \f, \n, \r, and \t
extension to Unicode it is even less useful than it once was. It is, however, generate the appropriate EBCDIC code values. The \c escape is processed
recognized when PCRE2 is compiled in EBCDIC mode, where data items are always as specified for Perl in the <b>perlebcdic</b> document. The only characters
bytes. In this mode, all values are valid after \c. If the next character is a that are allowed after \c are A-Z, a-z, or one of @, [, \, ], ^, _, or ?. Any
lower case letter, it is converted to upper case. Then the 0xc0 bits of the other character provokes a compile-time error. The sequence \@ encodes
byte are inverted. Thus \cA becomes hex 01, as in ASCII (A is C1), but because character code 0; the letters (in either case) encode characters 1-26 (hex 01
the EBCDIC letters are disjoint, \cZ becomes hex 29 (Z is E9), and other to hex 1A); [, \, ], ^, and _ encode characters 27-31 (hex 1B to hex 1F), and
characters also generate different values. \? becomes either 255 (hex FF) or 95 (hex 5F).
</P>
<P>
Thus, apart from \?, these escapes generate the same character code values as
they do in an ASCII environment, though the meanings of the values mostly
differ. For example, \G always generates code value 7, which is BEL in ASCII
but DEL in EBCDIC.
</P>
<P>
The sequence \? generates DEL (127, hex 7F) in an ASCII environment, but
because 127 is not a control character in EBCDIC, Perl makes it generate the
APC character. Unfortunately, there are several variants of EBCDIC. In most of
them the APC character has the value 255 (hex FF), but in the one Perl calls
POSIX-BC its value is 95 (hex 5F). If certain other characters have POSIX-BC
values, PCRE2 makes \? generate 95; otherwise it generates 255.
</P> </P>
<P> <P>
After \0 up to two further octal digits are read. If there are fewer than two After \0 up to two further octal digits are read. If there are fewer than two
digits, just those that are present are used. Thus the sequence \0\x\07 digits, just those that are present are used. Thus the sequence \0\x\015
specifies two binary zeros followed by a BEL character (code value 7). Make specifies two binary zeros followed by a CR character (code value 13). Make
sure you supply two digits after the initial zero if the pattern character that sure you supply two digits after the initial zero if the pattern character that
follows is itself an octal digit. follows is itself an octal digit.
</P> </P>
@ -412,21 +428,24 @@ describe the old, ambiguous syntax.
</P> </P>
<P> <P>
The handling of a backslash followed by a digit other than 0 is complicated, The handling of a backslash followed by a digit other than 0 is complicated,
and Perl has changed in recent releases, causing PCRE2 also to change. Outside and Perl has changed over time, causing PCRE2 also to change.
a character class, PCRE2 reads the digit and any following digits as a decimal </P>
number. If the number is less than 8, or if there have been at least that many <P>
previous capturing left parentheses in the expression, the entire sequence is Outside a character class, PCRE2 reads the digit and any following digits as a
taken as a <i>back reference</i>. A description of how this works is given decimal number. If the number is less than 10, begins with the digit 8 or 9, or
if there are at least that many previous capturing left parentheses in the
expression, the entire sequence is taken as a <i>back reference</i>. A
description of how this works is given
<a href="#backreferences">later,</a> <a href="#backreferences">later,</a>
following the discussion of following the discussion of
<a href="#subpattern">parenthesized subpatterns.</a> <a href="#subpattern">parenthesized subpatterns.</a>
Otherwise, up to three octal digits are read to form a character code.
</P> </P>
<P> <P>
Inside a character class, or if the decimal number following \ is greater than Inside a character class, PCRE2 handles \8 and \9 as the literal characters
7 and there have not been that many capturing subpatterns, PCRE2 handles \8 "8" and "9", and otherwise reads up to three octal digits following the
and \9 as the literal characters "8" and "9", and otherwise re-reads up to backslash, using them to generate a data character. Any subsequent digits stand
three octal digits following the backslash, using them to generate a data for themselves. For example, outside a character class:
character. Any subsequent digits stand for themselves. For example:
<pre> <pre>
\040 is another way of writing an ASCII space \040 is another way of writing an ASCII space
\40 is the same, provided there are fewer than 40 previous capturing subpatterns \40 is the same, provided there are fewer than 40 previous capturing subpatterns
@ -436,7 +455,7 @@ character. Any subsequent digits stand for themselves. For example:
\0113 is a tab followed by the character "3" \0113 is a tab followed by the character "3"
\113 might be a back reference, otherwise the character with octal code 113 \113 might be a back reference, otherwise the character with octal code 113
\377 might be a back reference, otherwise the value 255 (decimal) \377 might be a back reference, otherwise the value 255 (decimal)
\81 is either a back reference, or the two characters "8" and "1" \81 is always a back reference .sp
</pre> </pre>
Note that octal values of 100 or greater that are specified using this syntax Note that octal values of 100 or greater that are specified using this syntax
must not be introduced by a leading zero, because no more than three octal must not be introduced by a leading zero, because no more than three octal
@ -1105,15 +1124,19 @@ regular expression.
<P> <P>
The circumflex and dollar metacharacters are zero-width assertions. That is, The circumflex and dollar metacharacters are zero-width assertions. That is,
they test for a particular condition being true without consuming any they test for a particular condition being true without consuming any
characters from the subject string. characters from the subject string. These two metacharacters are concerned with
matching the starts and ends of lines. If the newline convention is set so that
only the two-character sequence CRLF is recognized as a newline, isolated CR
and LF characters are treated as ordinary data characters, and are not
recognized as newlines.
</P> </P>
<P> <P>
Outside a character class, in the default matching mode, the circumflex Outside a character class, in the default matching mode, the circumflex
character is an assertion that is true only if the current matching point is at character is an assertion that is true only if the current matching point is at
the start of the subject string. If the <i>startoffset</i> argument of the start of the subject string. If the <i>startoffset</i> argument of
<b>pcre2_match()</b> is non-zero, circumflex can never match if the <b>pcre2_match()</b> is non-zero, or if PCRE2_NOTBOL is set, circumflex can
PCRE2_MULTILINE option is unset. Inside a character class, circumflex has an never match if the PCRE2_MULTILINE option is unset. Inside a character class,
entirely different meaning circumflex has an entirely different meaning
<a href="#characterclass">(see below).</a> <a href="#characterclass">(see below).</a>
</P> </P>
<P> <P>
@ -1128,10 +1151,11 @@ to be anchored.)
<P> <P>
The dollar character is an assertion that is true only if the current matching The dollar character is an assertion that is true only if the current matching
point is at the end of the subject string, or immediately before a newline at point is at the end of the subject string, or immediately before a newline at
the end of the string (by default). Note, however, that it does not actually the end of the string (by default), unless PCRE2_NOTEOL is set. Note, however,
match the newline. Dollar need not be the last character of the pattern if a that it does not actually match the newline. Dollar need not be the last
number of alternatives are involved, but it should be the last item in any character of the pattern if a number of alternatives are involved, but it
branch in which it appears. Dollar has no special meaning in a character class. should be the last item in any branch in which it appears. Dollar has no
special meaning in a character class.
</P> </P>
<P> <P>
The meaning of dollar can be changed so that it matches only at the very end of The meaning of dollar can be changed so that it matches only at the very end of
@ -1139,13 +1163,13 @@ the string, by setting the PCRE2_DOLLAR_ENDONLY option at compile time. This
does not affect the \Z assertion. does not affect the \Z assertion.
</P> </P>
<P> <P>
The meanings of the circumflex and dollar characters are changed if the The meanings of the circumflex and dollar metacharacters are changed if the
PCRE2_MULTILINE option is set. When this is the case, a circumflex matches PCRE2_MULTILINE option is set. When this is the case, a dollar character
immediately after internal newlines as well as at the start of the subject matches before any newlines in the string, as well as at the very end, and a
string. It does not match after a newline that ends the string. A dollar circumflex matches immediately after internal newlines as well as at the start
matches before any newlines in the string, as well as at the very end, when of the subject string. It does not match after a newline that ends the string,
PCRE2_MULTILINE is set. When newline is specified as the two-character for compatibility with Perl. However, this can be changed by setting the
sequence CRLF, isolated CR and LF characters do not indicate newlines. PCRE2_ALT_CIRCUMFLEX option.
</P> </P>
<P> <P>
For example, the pattern /^abc$/ matches the subject string "def\nabc" (where For example, the pattern /^abc$/ matches the subject string "def\nabc" (where
@ -1198,12 +1222,16 @@ whether or not a UTF mode is set. In the 8-bit library, one code unit is one
byte; in the 16-bit library it is a 16-bit unit; in the 32-bit library it is a byte; in the 16-bit library it is a 16-bit unit; in the 32-bit library it is a
32-bit unit. Unlike a dot, \C always matches line-ending characters. The 32-bit unit. Unlike a dot, \C always matches line-ending characters. The
feature is provided in Perl in order to match individual bytes in UTF-8 mode, feature is provided in Perl in order to match individual bytes in UTF-8 mode,
but it is unclear how it can usefully be used. Because \C breaks up characters but it is unclear how it can usefully be used.
into individual code units, matching one unit with \C in a UTF mode means that </P>
the rest of the string may start with a malformed UTF character. This has <P>
undefined results, because PCRE2 assumes that it is dealing with valid UTF Because \C breaks up characters into individual code units, matching one unit
strings (and by default it checks this at the start of processing unless the with \C in UTF-8 or UTF-16 mode means that the rest of the string may start
PCRE2_NO_UTF_CHECK option is used). with a malformed UTF character. This has undefined results, because PCRE2
assumes that it is matching character by character in a valid UTF string (by
default it checks the subject string's validity at the start of processing
unless the PCRE2_NO_UTF_CHECK option is used). An application can lock out the
use of \C by setting the PCRE2_NEVER_BACKSLASH_C option.
</P> </P>
<P> <P>
PCRE2 does not allow \C to appear in lookbehind assertions PCRE2 does not allow \C to appear in lookbehind assertions
@ -1475,7 +1503,8 @@ unset these options by preceding the letter with a hyphen, and a combined
setting and unsetting such as (?im-sx), which sets PCRE2_CASELESS and setting and unsetting such as (?im-sx), which sets PCRE2_CASELESS and
PCRE2_MULTILINE while unsetting PCRE2_DOTALL and PCRE2_EXTENDED, is also PCRE2_MULTILINE while unsetting PCRE2_DOTALL and PCRE2_EXTENDED, is also
permitted. If a letter appears both before and after the hyphen, the option is permitted. If a letter appears both before and after the hyphen, the option is
unset. unset. An empty options setting "(?)" is allowed. Needless to say, it has no
effect.
</P> </P>
<P> <P>
The PCRE2-specific options PCRE2_DUPNAMES and PCRE2_UNGREEDY can be changed in The PCRE2-specific options PCRE2_DUPNAMES and PCRE2_UNGREEDY can be changed in
@ -1508,11 +1537,20 @@ option settings happen at compile time. There would be some very weird
behaviour otherwise. behaviour otherwise.
</P> </P>
<P> <P>
As a convenient shorthand, if any option settings are required at the start of
a non-capturing subpattern (see the next section), the option letters may
appear between the "?" and the ":". Thus the two patterns
<pre>
(?i:saturday|sunday)
(?:(?i)saturday|sunday)
</pre>
match exactly the same set of strings.
</P>
<P>
<b>Note:</b> There are other PCRE2-specific options that can be set by the <b>Note:</b> There are other PCRE2-specific options that can be set by the
application when the compiling function is called. application when the compiling function is called. The pattern can contain
The pattern can contain special leading sequences such as (*CRLF) to override special leading sequences such as (*CRLF) to override what the application has
what the application has set or what has been defaulted. Details are given in set or what has been defaulted. Details are given in the section entitled
the section entitled
<a href="#newlineseq">"Newline sequences"</a> <a href="#newlineseq">"Newline sequences"</a>
above. There are also the (*UTF) and (*UCP) leading sequences that can be used above. There are also the (*UTF) and (*UCP) leading sequences that can be used
to set UTF and Unicode property modes; they are equivalent to setting the to set UTF and Unicode property modes; they are equivalent to setting the
@ -3285,7 +3323,7 @@ Cambridge, England.
</P> </P>
<br><a name="SEC30" href="#TOC1">REVISION</a><br> <br><a name="SEC30" href="#TOC1">REVISION</a><br>
<P> <P>
Last updated: 15 March 2015 Last updated: 13 June 2015
<br> <br>
Copyright &copy; 1997-2015 University of Cambridge. Copyright &copy; 1997-2015 University of Cambridge.
<br> <br>

View File

@ -15,7 +15,7 @@ please consult the man page, in case the conversion went wrong.
<ul> <ul>
<li><a name="TOC1" href="#SEC1">PCRE2 REGULAR EXPRESSION SYNTAX SUMMARY</a> <li><a name="TOC1" href="#SEC1">PCRE2 REGULAR EXPRESSION SYNTAX SUMMARY</a>
<li><a name="TOC2" href="#SEC2">QUOTING</a> <li><a name="TOC2" href="#SEC2">QUOTING</a>
<li><a name="TOC3" href="#SEC3">CHARACTERS</a> <li><a name="TOC3" href="#SEC3">ESCAPED CHARACTERS</a>
<li><a name="TOC4" href="#SEC4">CHARACTER TYPES</a> <li><a name="TOC4" href="#SEC4">CHARACTER TYPES</a>
<li><a name="TOC5" href="#SEC5">GENERAL CATEGORY PROPERTIES FOR \p and \P</a> <li><a name="TOC5" href="#SEC5">GENERAL CATEGORY PROPERTIES FOR \p and \P</a>
<li><a name="TOC6" href="#SEC6">PCRE2 SPECIAL CATEGORY PROPERTIES FOR \p and \P</a> <li><a name="TOC6" href="#SEC6">PCRE2 SPECIAL CATEGORY PROPERTIES FOR \p and \P</a>
@ -55,11 +55,12 @@ documentation. This document contains a quick-reference summary of the syntax.
\Q...\E treat enclosed characters as literal \Q...\E treat enclosed characters as literal
</PRE> </PRE>
</P> </P>
<br><a name="SEC3" href="#TOC1">CHARACTERS</a><br> <br><a name="SEC3" href="#TOC1">ESCAPED CHARACTERS</a><br>
<P> <P>
This table applies to ASCII and Unicode environments.
<pre> <pre>
\a alarm, that is, the BEL character (hex 07) \a alarm, that is, the BEL character (hex 07)
\cx "control-x", where x is any ASCII character \cx "control-x", where x is any ASCII printing character
\e escape (hex 1B) \e escape (hex 1B)
\f form feed (hex 0C) \f form feed (hex 0C)
\n newline (hex 0A) \n newline (hex 0A)
@ -68,18 +69,32 @@ documentation. This document contains a quick-reference summary of the syntax.
\0dd character with octal code 0dd \0dd character with octal code 0dd
\ddd character with octal code ddd, or backreference \ddd character with octal code ddd, or backreference
\o{ddd..} character with octal code ddd.. \o{ddd..} character with octal code ddd..
\U "U" if PCRE2_ALT_BSUX is set (otherwise is an error)
\uhhhh character with hex code hhhh (if PCRE2_ALT_BSUX is set)
\xhh character with hex code hh \xhh character with hex code hh
\x{hhh..} character with hex code hhh.. \x{hhh..} character with hex code hhh..
</pre> </pre>
Note that \0dd is always an octal code, and that \8 and \9 are the literal Note that \0dd is always an octal code. The treatment of backslash followed by
characters "8" and "9". a non-zero digit is complicated; for details see the section
<a href="pcre2pattern.html#digitsafterbackslash">"Non-printing characters"</a>
in the
<a href="pcre2pattern.html"><b>pcre2pattern</b></a>
documentation, where details of escape processing in EBCDIC environments are
also given.
</P>
<P>
When \x is not followed by {, from zero to two hexadecimal digits are read,
but if PCRE2_ALT_BSUX is set, \x must be followed by two hexadecimal digits to
be recognized as a hexadecimal escape; otherwise it matches a literal "x".
Likewise, if \u (in ALT_BSUX mode) is not followed by four hexadecimal digits,
it matches a literal "u".
</P> </P>
<br><a name="SEC4" href="#TOC1">CHARACTER TYPES</a><br> <br><a name="SEC4" href="#TOC1">CHARACTER TYPES</a><br>
<P> <P>
<pre> <pre>
. any character except newline; . any character except newline;
in dotall mode, any character whatsoever in dotall mode, any character whatsoever
\C one data unit, even in UTF mode (best avoided) \C one code unit, even in UTF mode (best avoided)
\d a decimal digit \d a decimal digit
\D a character that is not a decimal digit \D a character that is not a decimal digit
\h a horizontal white space character \h a horizontal white space character
@ -96,6 +111,11 @@ characters "8" and "9".
\W a "non-word" character \W a "non-word" character
\X a Unicode extended grapheme cluster \X a Unicode extended grapheme cluster
</pre> </pre>
The application can lock out the use of \C by setting the
PCRE2_NEVER_BACKSLASH_C option. It is dangerous because it may leave the
current matching point in the middle of a UTF-8 or UTF-16 character.
</P>
<P>
By default, \d, \s, and \w match only ASCII characters, even in UTF-8 mode By default, \d, \s, and \w match only ASCII characters, even in UTF-8 mode
or in the 16-bit and 32-bit libraries. However, if locale-specific matching is or in the 16-bit and 32-bit libraries. However, if locale-specific matching is
happening, \s and \w may also match characters with code points in the range happening, \s and \w may also match characters with code points in the range
@ -348,7 +368,8 @@ but some of them use Unicode properties if PCRE2_UCP is set. You can use
\b word boundary \b word boundary
\B not a word boundary \B not a word boundary
^ start of subject ^ start of subject
also after internal newline in multiline mode also after an internal newline in multiline mode
(after any newline if PCRE2_ALT_CIRCUMFLEX is set)
\A start of subject \A start of subject
$ end of subject $ end of subject
also before newline at end of subject also before newline at end of subject
@ -423,7 +444,9 @@ appear.
(*UCP) set PCRE2_UCP (use Unicode properties for \d etc) (*UCP) set PCRE2_UCP (use Unicode properties for \d etc)
</pre> </pre>
Note that LIMIT_MATCH and LIMIT_RECURSION can only reduce the value of the Note that LIMIT_MATCH and LIMIT_RECURSION can only reduce the value of the
limits set by the caller of pcre2_match(), not increase them. limits set by the caller of pcre2_match(), not increase them. The application
can lock out the use of (*UTF) and (*UCP) by setting the PCRE2_NEVER_UTF or
PCRE2_NEVER_UCP options, respectively, at compile time.
</P> </P>
<br><a name="SEC17" href="#TOC1">NEWLINE CONVENTION</a><br> <br><a name="SEC17" href="#TOC1">NEWLINE CONVENTION</a><br>
<P> <P>
@ -559,7 +582,7 @@ Cambridge, England.
</P> </P>
<br><a name="SEC27" href="#TOC1">REVISION</a><br> <br><a name="SEC27" href="#TOC1">REVISION</a><br>
<P> <P>
Last updated: 15 March 2015 Last updated: 13 June 2015
<br> <br>
Copyright &copy; 1997-2015 University of Cambridge. Copyright &copy; 1997-2015 University of Cambridge.
<br> <br>

View File

@ -284,13 +284,20 @@ following commands are recognized:
#forbid_utf #forbid_utf
</pre> </pre>
Subsequent patterns automatically have the PCRE2_NEVER_UTF and PCRE2_NEVER_UCP Subsequent patterns automatically have the PCRE2_NEVER_UTF and PCRE2_NEVER_UCP
options set, which locks out the use of UTF and Unicode property features. This options set, which locks out the use of the PCRE2_UTF and PCRE2_UCP options and
is a trigger guard that is used in test files to ensure that UTF or Unicode the use of (*UTF) and (*UCP) at the start of patterns. This command also forces
property tests are not accidentally added to files that are used when Unicode an error if a subsequent pattern contains any occurrences of \P, \p, or \X,
support is not included in the library. This effect can also be obtained by the which are still supported when PCRE2_UTF is not set, but which require Unicode
use of <b>#pattern</b>; the difference is that <b>#forbid_utf</b> cannot be property support to be included in the library.
unset, and the automatic options are not displayed in pattern information, to </P>
avoid cluttering up test output. <P>
This is a trigger guard that is used in test files to ensure that UTF or
Unicode property tests are not accidentally added to files that are used when
Unicode support is not included in the library. Setting PCRE2_NEVER_UTF and
PCRE2_NEVER_UCP as a default can also be obtained by the use of <b>#pattern</b>;
the difference is that <b>#forbid_utf</b> cannot be unset, and the automatic
options are not displayed in pattern information, to avoid cluttering up test
output.
<pre> <pre>
#load &#60;filename&#62; #load &#60;filename&#62;
</pre> </pre>
@ -471,6 +478,7 @@ for a description of their effects.
<pre> <pre>
allow_empty_class set PCRE2_ALLOW_EMPTY_CLASS allow_empty_class set PCRE2_ALLOW_EMPTY_CLASS
alt_bsux set PCRE2_ALT_BSUX alt_bsux set PCRE2_ALT_BSUX
alt_circumflex set PCRE2_ALT_CIRCUMFLEX
anchored set PCRE2_ANCHORED anchored set PCRE2_ANCHORED
auto_callout set PCRE2_AUTO_CALLOUT auto_callout set PCRE2_AUTO_CALLOUT
/i caseless set PCRE2_CASELESS /i caseless set PCRE2_CASELESS
@ -481,6 +489,7 @@ for a description of their effects.
firstline set PCRE2_FIRSTLINE firstline set PCRE2_FIRSTLINE
match_unset_backref set PCRE2_MATCH_UNSET_BACKREF match_unset_backref set PCRE2_MATCH_UNSET_BACKREF
/m multiline set PCRE2_MULTILINE /m multiline set PCRE2_MULTILINE
never_backslash_c set PCRE2_NEVER_BACKSLASH_C
never_ucp set PCRE2_NEVER_UCP never_ucp set PCRE2_NEVER_UCP
never_utf set PCRE2_NEVER_UTF never_utf set PCRE2_NEVER_UTF
no_auto_capture set PCRE2_NO_AUTO_CAPTURE no_auto_capture set PCRE2_NO_AUTO_CAPTURE
@ -1460,7 +1469,7 @@ Cambridge, England.
</P> </P>
<br><a name="SEC21" href="#TOC1">REVISION</a><br> <br><a name="SEC21" href="#TOC1">REVISION</a><br>
<P> <P>
Last updated: 22 March 2015 Last updated: 20 May 2015
<br> <br>
Copyright &copy; 1997-2015 University of Cambridge. Copyright &copy; 1997-2015 University of Cambridge.
<br> <br>

View File

@ -87,16 +87,26 @@ SECURITY CONSIDERATIONS
mance. mance.
One way of guarding against this possibility is to use the pcre2_pat- One way of guarding against this possibility is to use the pcre2_pat-
tern_info() function to check the compiled pattern's options for UTF. tern_info() function to check the compiled pattern's options for
Alternatively, you can set the PCRE2_NEVER_UTF option at compile time. PCRE2_UTF. Alternatively, you can set the PCRE2_NEVER_UTF option when
This causes an compile time error if a pattern contains a UTF-setting calling pcre2_compile(). This causes an compile time error if a pattern
sequence. contains a UTF-setting sequence.
The use of Unicode properties for character types such as \d can also
be enabled from within the pattern, by specifying "(*UCP)". This fea-
ture can be disallowed by setting the PCRE2_NEVER_UCP option.
If your application is one that supports UTF, be aware that validity If your application is one that supports UTF, be aware that validity
checking can take time. If the same data string is to be matched many checking can take time. If the same data string is to be matched many
times, you can use the PCRE2_NO_UTF_CHECK option for the second and times, you can use the PCRE2_NO_UTF_CHECK option for the second and
subsequent matches to avoid running redundant checks. subsequent matches to avoid running redundant checks.
The use of the \C escape sequence in a UTF-8 or UTF-16 pattern can lead
to problems, because it may leave the current matching point in the
middle of a multi-code-unit character. The PCRE2_NEVER_BACKSLASH_C
option can be used to lock out the use of \C, causing a compile-time
error if it is encountered.
Another way that performance can be hit is by running a pattern that Another way that performance can be hit is by running a pattern that
has a very large search tree against a string that will never match. has a very large search tree against a string that will never match.
Nested unlimited repeats in a pattern are a common example. PCRE2 pro- Nested unlimited repeats in a pattern are a common example. PCRE2 pro-
@ -155,8 +165,8 @@ AUTHOR
REVISION REVISION
Last updated: 18 November 2014 Last updated: 13 April 2015
Copyright (c) 1997-2014 University of Cambridge. Copyright (c) 1997-2015 University of Cambridge.
------------------------------------------------------------------------------ ------------------------------------------------------------------------------
@ -1109,6 +1119,15 @@ COMPILING A PATTERN
always expected after \x, but it may have zero, one, or two digits (so, always expected after \x, but it may have zero, one, or two digits (so,
for example, \xz matches a binary zero character followed by z). for example, \xz matches a binary zero character followed by z).
PCRE2_ALT_CIRCUMFLEX
In multiline mode (when PCRE2_MULTILINE is set), the circumflex
metacharacter matches at the start of the subject (unless PCRE2_NOTBOL
is set), and also after any internal newline. However, it does not
match after a newline at the end of the subject, for compatibility with
Perl. If you want a multiline circumflex also to match after a termi-
nating newline, you must set PCRE2_ALT_CIRCUMFLEX.
PCRE2_AUTO_CALLOUT PCRE2_AUTO_CALLOUT
If this bit is set, pcre2_compile() automatically inserts callout If this bit is set, pcre2_compile() automatically inserts callout
@ -1204,9 +1223,20 @@ COMPILING A PATTERN
constructs match immediately following or immediately before internal constructs match immediately following or immediately before internal
newlines in the subject string, respectively, as well as at the very newlines in the subject string, respectively, as well as at the very
start and end. This is equivalent to Perl's /m option, and it can be start and end. This is equivalent to Perl's /m option, and it can be
changed within a pattern by a (?m) option setting. If there are no new- changed within a pattern by a (?m) option setting. Note that the "start
lines in a subject string, or no occurrences of ^ or $ in a pattern, of line" metacharacter does not match after a newline at the end of the
setting PCRE2_MULTILINE has no effect. subject, for compatibility with Perl. However, you can change this by
setting the PCRE2_ALT_CIRCUMFLEX option. If there are no newlines in a
subject string, or no occurrences of ^ or $ in a pattern, setting
PCRE2_MULTILINE has no effect.
PCRE2_NEVER_BACKSLASH_C
This option locks out the use of \C in the pattern that is being com-
piled. This escape can cause unpredictable behaviour in UTF-8 or
UTF-16 modes, because it may leave the current matching point in the
middle of a multi-code-unit character. This option may be useful in
applications that process patterns from external sources.
PCRE2_NEVER_UCP PCRE2_NEVER_UCP
@ -1214,18 +1244,18 @@ COMPILING A PATTERN
\b, \D, \d, \S, \s, \W, \w, and some of the POSIX character classes, as \b, \D, \d, \S, \s, \W, \w, and some of the POSIX character classes, as
described for the PCRE2_UCP option below. In particular, it prevents described for the PCRE2_UCP option below. In particular, it prevents
the creator of the pattern from enabling this facility by starting the the creator of the pattern from enabling this facility by starting the
pattern with (*UCP). This may be useful in applications that process pattern with (*UCP). This option may be useful in applications that
patterns from external sources. The option combination PCRE_UCP and process patterns from external sources. The option combination PCRE_UCP
PCRE_NEVER_UCP causes an error. and PCRE_NEVER_UCP causes an error.
PCRE2_NEVER_UTF PCRE2_NEVER_UTF
This option locks out interpretation of the pattern as UTF-8, UTF-16, This option locks out interpretation of the pattern as UTF-8, UTF-16,
or UTF-32, depending on which library is in use. In particular, it pre- or UTF-32, depending on which library is in use. In particular, it pre-
vents the creator of the pattern from switching to UTF interpretation vents the creator of the pattern from switching to UTF interpretation
by starting the pattern with (*UTF). This may be useful in applications by starting the pattern with (*UTF). This option may be useful in
that process patterns from external sources. The combination of applications that process patterns from external sources. The combina-
PCRE2_UTF and PCRE2_NEVER_UTF causes an error. tion of PCRE2_UTF and PCRE2_NEVER_UTF causes an error.
PCRE2_NO_AUTO_CAPTURE PCRE2_NO_AUTO_CAPTURE
@ -2796,7 +2826,7 @@ AUTHOR
REVISION REVISION
Last updated: 23 March 2015 Last updated: 22 April 2015
Copyright (c) 1997-2015 University of Cambridge. Copyright (c) 1997-2015 University of Cambridge.
------------------------------------------------------------------------------ ------------------------------------------------------------------------------
@ -2916,6 +2946,11 @@ UNICODE AND UTF SUPPORT
PCRE2_UCP option. Unless the application has set PCRE2_NEVER_UCP, a PCRE2_UCP option. Unless the application has set PCRE2_NEVER_UCP, a
pattern may also request this by starting with (*UCP). pattern may also request this by starting with (*UCP).
The \C escape sequence, which matches a single code unit, even in a UTF
mode, can cause unpredictable behaviour because it may leave the cur-
rent matching point in the middle of a multi-code-unit character. It
can be locked out by setting the PCRE2_NEVER_BACKSLASH_C option.
JUST-IN-TIME COMPILER SUPPORT JUST-IN-TIME COMPILER SUPPORT
@ -3175,6 +3210,16 @@ PCRE2TEST OPTION FOR LIBREADLINE SUPPORT
immediately before the configure command. immediately before the configure command.
INCLUDING DEBUGGING CODE
If you add
--enable-debug
to the configure command, additional debugging code is included in the
build. This feature is intended for use by the PCRE2 maintainers.
DEBUGGING WITH VALGRIND SUPPORT DEBUGGING WITH VALGRIND SUPPORT
If you add If you add
@ -3257,7 +3302,7 @@ AUTHOR
REVISION REVISION
Last updated: 26 January 2015 Last updated: 24 April 2015
Copyright (c) 1997-2015 University of Cambridge. Copyright (c) 1997-2015 University of Cambridge.
------------------------------------------------------------------------------ ------------------------------------------------------------------------------

View File

@ -47,7 +47,7 @@ or provide an external function for stack size checking. The option bits are:
PCRE2_FIRSTLINE Force matching to be before newline PCRE2_FIRSTLINE Force matching to be before newline
PCRE2_MATCH_UNSET_BACKREF Match unset back references PCRE2_MATCH_UNSET_BACKREF Match unset back references
PCRE2_MULTILINE ^ and $ match newlines within data PCRE2_MULTILINE ^ and $ match newlines within data
PCRE2_NEVER_BACKSLASH_C Lock out the use of \C in patterns PCRE2_NEVER_BACKSLASH_C Lock out the use of \eC in patterns
PCRE2_NEVER_UCP Lock out PCRE2_UCP, e.g. via (*UCP) PCRE2_NEVER_UCP Lock out PCRE2_UCP, e.g. via (*UCP)
PCRE2_NEVER_UTF Lock out PCRE2_UTF, e.g. via (*UTF) PCRE2_NEVER_UTF Lock out PCRE2_UTF, e.g. via (*UTF)
PCRE2_NO_AUTO_CAPTURE Disable numbered capturing paren- PCRE2_NO_AUTO_CAPTURE Disable numbered capturing paren-

View File

@ -1161,7 +1161,6 @@ after a newline at the end of the subject, for compatibility with Perl.
However, you can change this by setting the PCRE2_ALT_CIRCUMFLEX option. If However, you can change this by setting the PCRE2_ALT_CIRCUMFLEX option. If
there are no newlines in a subject string, or no occurrences of ^ or $ in a there are no newlines in a subject string, or no occurrences of ^ or $ in a
pattern, setting PCRE2_MULTILINE has no effect. pattern, setting PCRE2_MULTILINE has no effect.
.sp .sp
PCRE2_NEVER_BACKSLASH_C PCRE2_NEVER_BACKSLASH_C
.sp .sp

View File

@ -226,14 +226,20 @@ COMMAND LINES
#forbid_utf #forbid_utf
Subsequent patterns automatically have the PCRE2_NEVER_UTF and Subsequent patterns automatically have the PCRE2_NEVER_UTF and
PCRE2_NEVER_UCP options set, which locks out the use of UTF and Unicode PCRE2_NEVER_UCP options set, which locks out the use of the PCRE2_UTF
property features. This is a trigger guard that is used in test files and PCRE2_UCP options and the use of (*UTF) and (*UCP) at the start of
to ensure that UTF or Unicode property tests are not accidentally added patterns. This command also forces an error if a subsequent pattern
to files that are used when Unicode support is not included in the contains any occurrences of \P, \p, or \X, which are still supported
library. This effect can also be obtained by the use of #pattern; the when PCRE2_UTF is not set, but which require Unicode property support
difference is that #forbid_utf cannot be unset, and the automatic to be included in the library.
options are not displayed in pattern information, to avoid cluttering
up test output. This is a trigger guard that is used in test files to ensure that UTF
or Unicode property tests are not accidentally added to files that are
used when Unicode support is not included in the library. Setting
PCRE2_NEVER_UTF and PCRE2_NEVER_UCP as a default can also be obtained
by the use of #pattern; the difference is that #forbid_utf cannot be
unset, and the automatic options are not displayed in pattern informa-
tion, to avoid cluttering up test output.
#load <filename> #load <filename>
@ -417,6 +423,7 @@ PATTERN MODIFIERS
allow_empty_class set PCRE2_ALLOW_EMPTY_CLASS allow_empty_class set PCRE2_ALLOW_EMPTY_CLASS
alt_bsux set PCRE2_ALT_BSUX alt_bsux set PCRE2_ALT_BSUX
alt_circumflex set PCRE2_ALT_CIRCUMFLEX
anchored set PCRE2_ANCHORED anchored set PCRE2_ANCHORED
auto_callout set PCRE2_AUTO_CALLOUT auto_callout set PCRE2_AUTO_CALLOUT
/i caseless set PCRE2_CASELESS /i caseless set PCRE2_CASELESS
@ -427,6 +434,7 @@ PATTERN MODIFIERS
firstline set PCRE2_FIRSTLINE firstline set PCRE2_FIRSTLINE
match_unset_backref set PCRE2_MATCH_UNSET_BACKREF match_unset_backref set PCRE2_MATCH_UNSET_BACKREF
/m multiline set PCRE2_MULTILINE /m multiline set PCRE2_MULTILINE
never_backslash_c set PCRE2_NEVER_BACKSLASH_C
never_ucp set PCRE2_NEVER_UCP never_ucp set PCRE2_NEVER_UCP
never_utf set PCRE2_NEVER_UTF never_utf set PCRE2_NEVER_UTF
no_auto_capture set PCRE2_NO_AUTO_CAPTURE no_auto_capture set PCRE2_NO_AUTO_CAPTURE
@ -1322,5 +1330,5 @@ AUTHOR
REVISION REVISION
Last updated: 22 March 2015 Last updated: 20 May 2015
Copyright (c) 1997-2015 University of Cambridge. Copyright (c) 1997-2015 University of Cambridge.

View File

@ -200,7 +200,7 @@ sure both macros are undefined; an emulation function will then be used. */
#define PACKAGE_NAME "PCRE2" #define PACKAGE_NAME "PCRE2"
/* Define to the full name and version of this package. */ /* Define to the full name and version of this package. */
#define PACKAGE_STRING "PCRE2 10.10" #define PACKAGE_STRING "PCRE2 10.20-RC1"
/* Define to the one symbol short name of this package. */ /* Define to the one symbol short name of this package. */
#define PACKAGE_TARNAME "pcre2" #define PACKAGE_TARNAME "pcre2"
@ -209,7 +209,7 @@ sure both macros are undefined; an emulation function will then be used. */
#define PACKAGE_URL "" #define PACKAGE_URL ""
/* Define to the version of this package. */ /* Define to the version of this package. */
#define PACKAGE_VERSION "10.10" #define PACKAGE_VERSION "10.20-RC1"
/* The value of PARENS_NEST_LIMIT specifies the maximum depth of nested /* The value of PARENS_NEST_LIMIT specifies the maximum depth of nested
parentheses (of any kind) in a pattern. This limits the amount of system parentheses (of any kind) in a pattern. This limits the amount of system
@ -227,6 +227,9 @@ sure both macros are undefined; an emulation function will then be used. */
#define PCRE2GREP_BUFSIZE 20480 #define PCRE2GREP_BUFSIZE 20480
#endif #endif
/* Define to any value to include debugging code. */
/* #undef PCRE2_DEBUG */
/* If you are compiling for a system other than a Unix-like system or /* If you are compiling for a system other than a Unix-like system or
Win32, and it needs some magic to be inserted before the definition Win32, and it needs some magic to be inserted before the definition
of a function that is exported by the library, define this macro to of a function that is exported by the library, define this macro to
@ -287,7 +290,7 @@ sure both macros are undefined; an emulation function will then be used. */
/* #undef SUPPORT_VALGRIND */ /* #undef SUPPORT_VALGRIND */
/* Version number of package */ /* Version number of package */
#define VERSION "10.10" #define VERSION "10.20-RC1"
/* Define to empty if `const' does not conform to ANSI C. */ /* Define to empty if `const' does not conform to ANSI C. */
/* #undef const */ /* #undef const */

View File

@ -42,9 +42,9 @@ POSSIBILITY OF SUCH DAMAGE.
/* The current PCRE version information. */ /* The current PCRE version information. */
#define PCRE2_MAJOR 10 #define PCRE2_MAJOR 10
#define PCRE2_MINOR 10 #define PCRE2_MINOR 20
#define PCRE2_PRERELEASE #define PCRE2_PRERELEASE -RC1
#define PCRE2_DATE 2015-03-06 #define PCRE2_DATE 2015-06-16
/* When an application links to a PCRE DLL in Windows, the symbols that are /* When an application links to a PCRE DLL in Windows, the symbols that are
imported have to be identified as such. When building PCRE2, the appropriate imported have to be identified as such. When building PCRE2, the appropriate
@ -118,6 +118,8 @@ D is inspected during pcre2_dfa_match() execution
#define PCRE2_UCP 0x00020000u /* C J M D */ #define PCRE2_UCP 0x00020000u /* C J M D */
#define PCRE2_UNGREEDY 0x00040000u /* C */ #define PCRE2_UNGREEDY 0x00040000u /* C */
#define PCRE2_UTF 0x00080000u /* C J M D */ #define PCRE2_UTF 0x00080000u /* C J M D */
#define PCRE2_NEVER_BACKSLASH_C 0x00100000u /* C */
#define PCRE2_ALT_CIRCUMFLEX 0x00200000u /* J M D */
/* These are for pcre2_jit_compile(). */ /* These are for pcre2_jit_compile(). */
@ -125,9 +127,10 @@ D is inspected during pcre2_dfa_match() execution
#define PCRE2_JIT_PARTIAL_SOFT 0x00000002u #define PCRE2_JIT_PARTIAL_SOFT 0x00000002u
#define PCRE2_JIT_PARTIAL_HARD 0x00000004u #define PCRE2_JIT_PARTIAL_HARD 0x00000004u
/* These are for pcre2_match() and pcre2_dfa_match(). Note that PCRE2_ANCHORED, /* These are for pcre2_match(), pcre2_dfa_match(), and pcre2_jit_match(). Note
and PCRE2_NO_UTF_CHECK can also be passed to these functions, so take care not that PCRE2_ANCHORED and PCRE2_NO_UTF_CHECK can also be passed to these
to define synonyms by mistake. */ functions (though pcre2_jit_match() ignores the latter since it bypasses all
sanity checks). */
#define PCRE2_NOTBOL 0x00000001u #define PCRE2_NOTBOL 0x00000001u
#define PCRE2_NOTEOL 0x00000002u #define PCRE2_NOTEOL 0x00000002u
@ -337,8 +340,24 @@ typedef struct pcre2_callout_block { \
PCRE2_SIZE current_position; /* Where we currently are in the subject */ \ PCRE2_SIZE current_position; /* Where we currently are in the subject */ \
PCRE2_SIZE pattern_position; /* Offset to next item in the pattern */ \ PCRE2_SIZE pattern_position; /* Offset to next item in the pattern */ \
PCRE2_SIZE next_item_length; /* Length of next item in the pattern */ \ PCRE2_SIZE next_item_length; /* Length of next item in the pattern */ \
/* ------------------- Added for Version 1 -------------------------- */ \
PCRE2_SIZE callout_string_offset; /* Offset to string within pattern */ \
PCRE2_SIZE callout_string_length; /* Length of string compiled into pattern */ \
PCRE2_SPTR callout_string; /* String compiled into pattern */ \
/* ------------------------------------------------------------------ */ \ /* ------------------------------------------------------------------ */ \
} pcre2_callout_block; } pcre2_callout_block; \
\
typedef struct pcre2_callout_enumerate_block { \
uint32_t version; /* Identifies version of block */ \
/* ------------------------ Version 0 ------------------------------- */ \
PCRE2_SIZE pattern_position; /* Offset to next item in the pattern */ \
PCRE2_SIZE next_item_length; /* Length of next item in the pattern */ \
uint32_t callout_number; /* Number compiled into pattern */ \
PCRE2_SIZE callout_string_offset; /* Offset to string within pattern */ \
PCRE2_SIZE callout_string_length; /* Length of string compiled into pattern */ \
PCRE2_SPTR callout_string; /* String compiled into pattern */ \
/* ------------------------------------------------------------------ */ \
} pcre2_callout_enumerate_block;
/* List the generic forms of all other functions in macros, which will be /* List the generic forms of all other functions in macros, which will be
@ -406,6 +425,9 @@ PCRE2_EXP_DECL void pcre2_code_free(pcre2_code *);
#define PCRE2_PATTERN_INFO_FUNCTIONS \ #define PCRE2_PATTERN_INFO_FUNCTIONS \
PCRE2_EXP_DECL int pcre2_pattern_info(const pcre2_code *, uint32_t, \ PCRE2_EXP_DECL int pcre2_pattern_info(const pcre2_code *, uint32_t, \
void *); \
PCRE2_EXP_DECL int pcre2_callout_enumerate(const pcre2_code *, \
int (*)(pcre2_callout_enumerate_block *, void *), \
void *); void *);
@ -535,6 +557,7 @@ pcre2_compile are called by application code. */
/* Data blocks */ /* Data blocks */
#define pcre2_callout_block PCRE2_SUFFIX(pcre2_callout_block_) #define pcre2_callout_block PCRE2_SUFFIX(pcre2_callout_block_)
#define pcre2_callout_enumerate_block PCRE2_SUFFIX(pcre2_callout_enumerate_block_)
#define pcre2_general_context PCRE2_SUFFIX(pcre2_general_context_) #define pcre2_general_context PCRE2_SUFFIX(pcre2_general_context_)
#define pcre2_compile_context PCRE2_SUFFIX(pcre2_compile_context_) #define pcre2_compile_context PCRE2_SUFFIX(pcre2_compile_context_)
#define pcre2_match_context PCRE2_SUFFIX(pcre2_match_context_) #define pcre2_match_context PCRE2_SUFFIX(pcre2_match_context_)
@ -543,6 +566,7 @@ pcre2_compile are called by application code. */
/* Functions: the complete list in alphabetical order */ /* Functions: the complete list in alphabetical order */
#define pcre2_callout_enumerate PCRE2_SUFFIX(pcre2_callout_enumerate_)
#define pcre2_code_free PCRE2_SUFFIX(pcre2_code_free_) #define pcre2_code_free PCRE2_SUFFIX(pcre2_code_free_)
#define pcre2_compile PCRE2_SUFFIX(pcre2_compile_) #define pcre2_compile PCRE2_SUFFIX(pcre2_compile_)
#define pcre2_compile_context_copy PCRE2_SUFFIX(pcre2_compile_context_copy_) #define pcre2_compile_context_copy PCRE2_SUFFIX(pcre2_compile_context_copy_)
@ -550,7 +574,6 @@ pcre2_compile are called by application code. */
#define pcre2_compile_context_free PCRE2_SUFFIX(pcre2_compile_context_free_) #define pcre2_compile_context_free PCRE2_SUFFIX(pcre2_compile_context_free_)
#define pcre2_config PCRE2_SUFFIX(pcre2_config_) #define pcre2_config PCRE2_SUFFIX(pcre2_config_)
#define pcre2_dfa_match PCRE2_SUFFIX(pcre2_dfa_match_) #define pcre2_dfa_match PCRE2_SUFFIX(pcre2_dfa_match_)
#define pcre2_match PCRE2_SUFFIX(pcre2_match_)
#define pcre2_general_context_copy PCRE2_SUFFIX(pcre2_general_context_copy_) #define pcre2_general_context_copy PCRE2_SUFFIX(pcre2_general_context_copy_)
#define pcre2_general_context_create PCRE2_SUFFIX(pcre2_general_context_create_) #define pcre2_general_context_create PCRE2_SUFFIX(pcre2_general_context_create_)
#define pcre2_general_context_free PCRE2_SUFFIX(pcre2_general_context_free_) #define pcre2_general_context_free PCRE2_SUFFIX(pcre2_general_context_free_)
@ -566,6 +589,7 @@ pcre2_compile are called by application code. */
#define pcre2_jit_stack_create PCRE2_SUFFIX(pcre2_jit_stack_create_) #define pcre2_jit_stack_create PCRE2_SUFFIX(pcre2_jit_stack_create_)
#define pcre2_jit_stack_free PCRE2_SUFFIX(pcre2_jit_stack_free_) #define pcre2_jit_stack_free PCRE2_SUFFIX(pcre2_jit_stack_free_)
#define pcre2_maketables PCRE2_SUFFIX(pcre2_maketables_) #define pcre2_maketables PCRE2_SUFFIX(pcre2_maketables_)
#define pcre2_match PCRE2_SUFFIX(pcre2_match_)
#define pcre2_match_context_copy PCRE2_SUFFIX(pcre2_match_context_copy_) #define pcre2_match_context_copy PCRE2_SUFFIX(pcre2_match_context_copy_)
#define pcre2_match_context_create PCRE2_SUFFIX(pcre2_match_context_create_) #define pcre2_match_context_create PCRE2_SUFFIX(pcre2_match_context_create_)
#define pcre2_match_context_free PCRE2_SUFFIX(pcre2_match_context_free_) #define pcre2_match_context_free PCRE2_SUFFIX(pcre2_match_context_free_)