diff --git a/ChangeLog b/ChangeLog index 32e6f3a..8ed6e83 100644 --- a/ChangeLog +++ b/ChangeLog @@ -181,6 +181,9 @@ wrong name. 27. In pcre2test, give some offset information for errors in hex patterns. +28. Implemented pcre2_code_copy_with_tables(), and added pushtablescopy to +pcre2test for testing it. + Version 10.22 29-July-2016 -------------------------- @@ -250,7 +253,7 @@ a report of compiler warnings from Visual Studio 2013 and a few tests with gcc's -Wconversion (which still throws up a lot). 15. Implemented pcre2_code_copy(), and added pushcopy and #popcopy to pcre2test -for testing it. +for testing it. 16. Change 66 for 10.21 introduced the use of snprintf() in PCRE2's version of regerror(). When the error buffer is too small, my version of snprintf() puts a diff --git a/Makefile.am b/Makefile.am index 6386f81..21e1626 100644 --- a/Makefile.am +++ b/Makefile.am @@ -25,6 +25,7 @@ dist_html_DATA = \ doc/html/pcre2.html \ doc/html/pcre2_callout_enumerate.html \ doc/html/pcre2_code_copy.html \ + doc/html/pcre2_code_copy_with_tables.html \ doc/html/pcre2_code_free.html \ doc/html/pcre2_compile.html \ doc/html/pcre2_compile_context_copy.html \ @@ -107,6 +108,7 @@ dist_man_MANS = \ doc/pcre2.3 \ doc/pcre2_callout_enumerate.3 \ doc/pcre2_code_copy.3 \ + doc/pcre2_code_copy_with_tables.3 \ doc/pcre2_code_free.3 \ doc/pcre2_compile.3 \ doc/pcre2_compile_context_copy.3 \ diff --git a/doc/html/NON-AUTOTOOLS-BUILD.txt b/doc/html/NON-AUTOTOOLS-BUILD.txt index ceb9245..e3cf813 100644 --- a/doc/html/NON-AUTOTOOLS-BUILD.txt +++ b/doc/html/NON-AUTOTOOLS-BUILD.txt @@ -174,7 +174,11 @@ can skip ahead to the CMake section. (11) If you want to use the pcre2grep command, compile and link src/pcre2grep.c; it uses only the basic 8-bit PCRE2 library (it does not - need the pcre2posix library). + need the pcre2posix library). If you have built the PCRE2 library with JIT + support by defining SUPPORT_JIT in src/config.h, you can also define + SUPPORT_PCRE2GREP_JIT, which causes pcre2grep to make use of JIT (unless + it is run with --no-jit). If you define SUPPORT_PCRE2GREP_JIT without + defining SUPPORT_JIT, pcre2grep does not try to make use of JIT. STACK SIZE IN WINDOWS ENVIRONMENTS @@ -389,4 +393,4 @@ and executable, is in EBCDIC and native z/OS file formats and this is the recommended download site. ============================= -Last Updated: 16 July 2015 +Last Updated: 13 October 2016 diff --git a/doc/html/README.txt b/doc/html/README.txt index 03d67f6..18cfcfc 100644 --- a/doc/html/README.txt +++ b/doc/html/README.txt @@ -44,7 +44,7 @@ wrappers. The distribution does contain a set of C wrapper functions for the 8-bit library that are based on the POSIX regular expression API (see the pcre2posix -man page). These can be found in a library called libpcre2posix. Note that this +man page). These can be found in a library called libpcre2-posix. Note that this just provides a POSIX calling interface to PCRE2; the regular expressions themselves still follow Perl syntax and semantics. The POSIX API is restricted, and does not give full access to all of PCRE2's facilities. @@ -58,8 +58,8 @@ renamed or pointed at by a link. If you are using the POSIX interface to PCRE2 and there is already a POSIX regex library installed on your system, as well as worrying about the regex.h header file (as mentioned above), you must also take care when linking programs -to ensure that they link with PCRE2's libpcre2posix library. Otherwise they may -pick up the POSIX functions of the same name from the other library. +to ensure that they link with PCRE2's libpcre2-posix library. Otherwise they +may pick up the POSIX functions of the same name from the other library. One way of avoiding this confusion is to compile PCRE2 with the addition of -Dregcomp=PCRE2regcomp (and similarly for the other POSIX functions) to the @@ -204,13 +204,6 @@ library. They are also documented in the pcre2build man page. --enable-newline-is-crlf, --enable-newline-is-anycrlf, or --enable-newline-is-any to the "configure" command, respectively. - If you specify --enable-newline-is-cr or --enable-newline-is-crlf, some of - the standard tests will fail, because the lines in the test files end with - LF. Even if the files are edited to change the line endings, there are likely - to be some failures. With --enable-newline-is-anycrlf or - --enable-newline-is-any, many tests should succeed, but there may be some - failures. - . By default, the sequence \R in a pattern matches any Unicode line ending sequence. This is independent of the option specifying what PCRE2 considers to be the end of a line (see above). However, the caller of PCRE2 can @@ -253,13 +246,13 @@ library. They are also documented in the pcre2build man page. sizes in the pcre2stack man page. . In the 8-bit library, the default maximum compiled pattern size is around - 64K. You can increase this by adding --with-link-size=3 to the "configure" - command. PCRE2 then uses three bytes instead of two for offsets to different - parts of the compiled pattern. In the 16-bit library, --with-link-size=3 is - the same as --with-link-size=4, which (in both libraries) uses four-byte - offsets. Increasing the internal link size reduces performance in the 8-bit - and 16-bit libraries. In the 32-bit library, the link size setting is - ignored, as 4-byte offsets are always used. + 64K bytes. You can increase this by adding --with-link-size=3 to the + "configure" command. PCRE2 then uses three bytes instead of two for offsets + to different parts of the compiled pattern. In the 16-bit library, + --with-link-size=3 is the same as --with-link-size=4, which (in both + libraries) uses four-byte offsets. Increasing the internal link size reduces + performance in the 8-bit and 16-bit libraries. In the 32-bit library, the + link size setting is ignored, as 4-byte offsets are always used. . You can build PCRE2 so that its internal match() function that is called from pcre2_match() does not call itself recursively. Instead, it uses memory @@ -339,12 +332,23 @@ library. They are also documented in the pcre2build man page. Of course, the relevant libraries must be installed on your system. -. The default size (in bytes) of the internal buffer used by pcre2grep can be - set by, for example: +. The default starting size (in bytes) of the internal buffer used by pcre2grep + can be set by, for example: --with-pcre2grep-bufsize=51200 - The value must be a plain integer. The default is 20480. + The value must be a plain integer. The default is 20480. The amount of memory + used by pcre2grep is actually three times this number, to allow for "before" + and "after" lines. If very long lines are encountered, the buffer is + automatically enlarged, up to a fixed maximum size. + +. The default maximum size of pcre2grep's internal buffer can be set by, for + example: + + --with-pcre2grep-max-bufsize=2097152 + + The default is either 1048576 or the value of --with-pcre2grep-bufsize, + whichever is the larger. . It is possible to compile pcre2test so that it links with the libreadline or libedit libraries, by specifying, respectively, @@ -368,6 +372,22 @@ library. They are also documented in the pcre2build man page. If you get error messages about missing functions tgetstr, tgetent, tputs, tgetflag, or tgoto, this is the problem, and linking with the ncurses library should fix it. + +. There is a special option called --enable-fuzz-support for use by people who + want to run fuzzing tests on PCRE2. At present this applies only to the 8-bit + library. If set, it causes an extra library called libpcre2-fuzzsupport.a to + be built, but not installed. This contains a single function called + LLVMFuzzerTestOneInput() whose arguments are a pointer to a string and the + length of the string. When called, this function tries to compile the string + as a pattern, and if that succeeds, to match it. This is done both with no + options and with some random options bits that are generated from the string. + Setting --enable-fuzz-support also causes a binary called pcre2fuzzcheck to + be created. This is normally run under valgrind or used when PCRE2 is + compiled with address sanitizing enabled. It calls the fuzzing function and + outputs information about it is doing. The input strings are specified by + arguments: if an argument starts with "=" the rest of it is a literal input + string. Otherwise, it is assumed to be a file name, and the contents of the + file are the test string. The "configure" script builds the following files for the basic C library: @@ -543,7 +563,7 @@ script creates the .txt and HTML forms of the documentation from the man pages. Testing PCRE2 ------------- +------------- To test the basic PCRE2 library on a Unix-like system, run the RunTest script. There is another script called RunGrepTest that tests the pcre2grep command. @@ -757,6 +777,7 @@ The distribution should contain the files listed below. src/pcre2_xclass.c ) src/pcre2_printint.c debugging function that is used by pcre2test, + src/pcre2_fuzzsupport.c function for (optional) fuzzing support src/config.h.in template for config.h, when built by "configure" src/pcre2.h.in template for pcre2.h when built by "configure" @@ -814,7 +835,7 @@ The distribution should contain the files listed below. libpcre2-8.pc.in template for libpcre2-8.pc for pkg-config libpcre2-16.pc.in template for libpcre2-16.pc for pkg-config libpcre2-32.pc.in template for libpcre2-32.pc for pkg-config - libpcre2posix.pc.in template for libpcre2posix.pc for pkg-config + libpcre2-posix.pc.in template for libpcre2-posix.pc for pkg-config ltmain.sh file used to build a libtool script missing ) common stub for a few missing GNU programs while ) installing, generated by automake @@ -845,4 +866,4 @@ The distribution should contain the files listed below. Philip Hazel Email local part: ph10 Email domain: cam.ac.uk -Last updated: 01 April 2016 +Last updated: 01 November 2016 diff --git a/doc/html/index.html b/doc/html/index.html index 703c298..eebb80b 100644 --- a/doc/html/index.html +++ b/doc/html/index.html @@ -94,6 +94,9 @@ in the library. pcre2_code_copy   Copy a compiled pattern +pcre2_code_copy_with_tables +   Copy a compiled pattern and its character tables + pcre2_code_free   Free a compiled pattern diff --git a/doc/html/pcre2_code_copy.html b/doc/html/pcre2_code_copy.html index 5b68282..667d7b7 100644 --- a/doc/html/pcre2_code_copy.html +++ b/doc/html/pcre2_code_copy.html @@ -28,8 +28,9 @@ DESCRIPTION This function makes a copy of the memory used for a compiled pattern, excluding any memory used by the JIT compiler. Without a subsequent call to pcre2_jit_compile(), the copy can be used only for non-JIT matching. The -yield of the function is NULL if code is NULL or if sufficient memory -cannot be obtained. +pointer to the character tables is copied, not the tables themselves (see +pcre2_code_copy_with_tables()). The yield of the function is NULL if +code is NULL or if sufficient memory cannot be obtained.

There is a complete description of the PCRE2 native API in the diff --git a/doc/html/pcre2_code_copy_with_tables.html b/doc/html/pcre2_code_copy_with_tables.html new file mode 100644 index 0000000..99cb022 --- /dev/null +++ b/doc/html/pcre2_code_copy_with_tables.html @@ -0,0 +1,44 @@ + + +pcre2_code_copy_with_tables specification + + +

pcre2_code_copy_with_tables man page

+

+Return to the PCRE2 index page. +

+

+This page is part of the PCRE2 HTML documentation. It was generated +automatically from the original man page. If there is any nonsense in it, +please consult the man page, in case the conversion went wrong. +
+
+SYNOPSIS +
+

+#include <pcre2.h> +

+

+pcre2_code *pcre2_code_copy_with_tables(const pcre2_code *code); +

+
+DESCRIPTION +
+

+This function makes a copy of the memory used for a compiled pattern, excluding +any memory used by the JIT compiler. Without a subsequent call to +pcre2_jit_compile(), the copy can be used only for non-JIT matching. +Unlike pcre2_code_copy(), a separate copy of the character tables is also +made, with the new code pointing to it. This memory will be automatically freed +when pcre2_code_free() is called. The yield of the function is NULL if +code is NULL or if sufficient memory cannot be obtained. +

+

+There is a complete description of the PCRE2 native API in the +pcre2api +page and a description of the POSIX API in the +pcre2posix +page. +

+Return to the PCRE2 index page. +

diff --git a/doc/html/pcre2_set_max_pattern_length.html b/doc/html/pcre2_set_max_pattern_length.html index 055d1ec..f6e422a 100644 --- a/doc/html/pcre2_set_max_pattern_length.html +++ b/doc/html/pcre2_set_max_pattern_length.html @@ -26,8 +26,11 @@ SYNOPSIS DESCRIPTION

-This function sets, in a compile context, the maximum length (in code units) of -the pattern that can be compiled. The result is always zero. +This function sets, in a compile context, the maximum text length (in code +units) of the pattern that can be compiled. The result is always zero. If a +longer pattern is passed to pcre2_compile() there is an immediate error +return. The default is effectively unlimited, being the largest value a +PCRE2_SIZE variable can hold.

There is a complete description of the PCRE2 native API in the diff --git a/doc/html/pcre2api.html b/doc/html/pcre2api.html index fa9f342..629f1d0 100644 --- a/doc/html/pcre2api.html +++ b/doc/html/pcre2api.html @@ -294,6 +294,9 @@ document for an overview of all the PCRE2 documentation. pcre2_code *pcre2_code_copy(const pcre2_code *code);

+pcre2_code *pcre2_code_copy_with_tables(const pcre2_code *code); +
+
int pcre2_get_error_message(int errorcode, PCRE2_UCHAR *buffer, PCRE2_SIZE bufflen);
@@ -567,8 +570,9 @@ If JIT is being used, but the JIT compilation is not being done immediately, (perhaps waiting to see if the pattern is used often enough) similar logic is required. JIT compilation updates a pointer within the compiled code block, so a thread must gain unique write access to the pointer before calling -pcre2_jit_compile(). Alternatively, pcre2_code_copy() can be used -to obtain a private copy of the compiled code. +pcre2_jit_compile(). Alternatively, pcre2_code_copy() or +pcre2_code_copy_with_tables() can be used to obtain a private copy of the +compiled code.


Context blocks @@ -736,7 +740,8 @@ functions, pcre2_match() and pcre2_dfa_match().
This parameter ajusts the limit, set when PCRE2 is built (default 250), on the depth of parenthesis nesting in a pattern. This limit stops rogue patterns -using up too much system stack when being compiled. +using up too much system stack when being compiled. The limit applies to +parentheses of all kinds, not just capturing parentheses. int pcre2_set_compile_recursion_guard(pcre2_compile_context *ccontext, int (*guard_function)(uint32_t, void *), void *user_data);
@@ -1058,6 +1063,9 @@ zero.

pcre2_code *pcre2_code_copy(const pcre2_code *code); +
+
+pcre2_code *pcre2_code_copy_with_tables(const pcre2_code *code);

The pcre2_compile() function compiles a pattern into an internal form. @@ -1079,9 +1087,22 @@ if the code has been processed by the JIT compiler (see below), the JIT information cannot be copied (because it is position-dependent). The new copy can initially be used only for non-JIT matching, though it can be -passed to pcre2_jit_compile() if required. The pcre2_code_copy() -function provides a way for individual threads in a multithreaded application -to acquire a private copy of shared compiled code. +passed to pcre2_jit_compile() if required. +

+

+The pcre2_code_copy() function provides a way for individual threads in a +multithreaded application to acquire a private copy of shared compiled code. +However, it does not make a copy of the character tables used by the compiled +pattern; the new pattern code points to the same tables as the original code. +(See +"Locale Support" +below for details of these character tables.) In many applications the same +tables are used throughout, so this behaviour is appropriate. Nevertheless, +there are occasions when a copy of a compiled pattern and the relevant tables +are needed. The pcre2_code_copy_with_tables() provides this facility. +Copies of both the code and the tables are made, with the new code pointing to +the new tables. The memory for the new tables is automatically freed when +pcre2_code_free() is called for the new copy of the compiled code.

NOTE: When one of the matching functions is called, pointers to the compiled @@ -1119,7 +1140,14 @@ NULL immediately. Otherwise, the variables to which these point are set to an error code and an offset (number of code units) within the pattern, respectively, when pcre2_compile() returns NULL because a compilation error has occurred. The values are not defined when compilation is successful -and pcre2_compile() returns a non-NULL value. +and pcre2_compile() returns a non-NULL value. +

+

+The value returned in erroroffset is an indication of where in the +pattern the error occurred. It is not necessarily the furthest point in the +pattern that was read. For example, after the error "lookbehind assertion is +not fixed length", the error offset points to the start of the failing +assertion.

The pcre2_get_error_message() function (see "Obtaining a textual error @@ -1215,8 +1243,8 @@ recognized, exactly as in the rest of the pattern. PCRE2_AUTO_CALLOUT If this bit is set, pcre2_compile() automatically inserts callout items, -all with number 255, before each pattern item. For discussion of the callout -facility, see the +all with number 255, before each pattern item, except immediately before or +after a callout in the pattern. For discussion of the callout facility, see the pcre2callout documentation.

@@ -3235,7 +3263,7 @@ Cambridge, England.
 


REVISION

-Last updated: 17 June 2016 +Last updated: 22 November 2016
Copyright © 1997-2016 University of Cambridge.
diff --git a/doc/html/pcre2build.html b/doc/html/pcre2build.html index 6c8e1de..13ae9cb 100644 --- a/doc/html/pcre2build.html +++ b/doc/html/pcre2build.html @@ -34,9 +34,10 @@ please consult the man page, in case the conversion went wrong.

  • INCLUDING DEBUGGING CODE
  • DEBUGGING WITH VALGRIND SUPPORT
  • CODE COVERAGE REPORTING -
  • SEE ALSO -
  • AUTHOR -
  • REVISION +
  • SUPPORT FOR FUZZERS +
  • SEE ALSO +
  • AUTHOR +
  • REVISION
    BUILDING PCRE2

    @@ -376,16 +377,19 @@ they are not.

    pcre2grep uses an internal buffer to hold a "window" on the file it is scanning, in order to be able to output "before" and "after" lines when it -finds a match. The size of the buffer is controlled by a parameter whose -default value is 20K. The buffer itself is three times this size, but because -of the way it is used for holding "before" lines, the longest line that is -guaranteed to be processable is the parameter size. You can change the default -parameter value by adding, for example, +finds a match. The starting size of the buffer is controlled by a parameter +whose default value is 20K. The buffer itself is three times this size, but +because of the way it is used for holding "before" lines, the longest line that +is guaranteed to be processable is the parameter size. If a longer line is +encountered, pcre2grep automatically expands the buffer, up to a +specified maximum size, whose default is 1M or the starting size, whichever is +the larger. You can change the default parameter values by adding, for example,

    -  --with-pcre2grep-bufsize=50K
    +  --with-pcre2grep-bufsize=51200
    +  --with-pcre2grep-max-bufsize=2097152 
     
    -to the configure command. The caller of \fPpcre2grep\fP can override this -value by using --buffer-size on the command line. +to the configure command. The caller of \fPpcre2grep\fP can override +these values by using --buffer-size and --max-buffer-size on the command line.


    PCRE2TEST OPTION FOR LIBREADLINE SUPPORT

    @@ -497,11 +501,32 @@ This cleans all coverage data including the generated coverage report. For more information about code coverage, see the gcov and lcov documentation.

    -
    SEE ALSO
    +
    SUPPORT FOR FUZZERS
    +

    +There is a special option for use by people who want to run fuzzing tests on +PCRE2: +

    +  --enable-fuzz-support
    +
    +At present this applies only to the 8-bit library. If set, it causes an extra +library called libpcre2-fuzzsupport.a to be built, but not installed. This +contains a single function called LLVMFuzzerTestOneInput() whose arguments are +a pointer to a string and the length of the string. When called, this function +tries to compile the string as a pattern, and if that succeeds, to match it. +This is done both with no options and with some random options bits that are +generated from the string. Setting --enable-fuzz-support also causes a binary +called pcre2fuzzcheck to be created. This is normally run under valgrind +or used when PCRE2 is compiled with address sanitizing enabled. It calls the +fuzzing function and outputs information about it is doing. The input strings +are specified by arguments: if an argument starts with "=" the rest of it is a +literal input string. Otherwise, it is assumed to be a file name, and the +contents of the file are the test string. +

    +
    SEE ALSO

    pcre2api(3), pcre2-config(3).

    -
    AUTHOR
    +
    AUTHOR

    Philip Hazel
    @@ -510,9 +535,9 @@ University Computing Service Cambridge, England.

    -
    REVISION
    +
    REVISION

    -Last updated: 01 April 2016 +Last updated: 01 November 2016
    Copyright © 1997-2016 University of Cambridge.
    diff --git a/doc/html/pcre2callout.html b/doc/html/pcre2callout.html index 7e85c9a..8a4fa6c 100644 --- a/doc/html/pcre2callout.html +++ b/doc/html/pcre2callout.html @@ -57,11 +57,20 @@ two callout points:

  • If the PCRE2_AUTO_CALLOUT option bit is set when a pattern is compiled, PCRE2 automatically inserts callouts, all with number 255, before each item in the -pattern. For example, if PCRE2_AUTO_CALLOUT is used with the pattern +pattern except for immediately before or after a callout item in the pattern. +For example, if PCRE2_AUTO_CALLOUT is used with the pattern +
    +  A(?C3)B
    +
    +it is processed as if it were +
    +  (?C255)A(?C3)B(?C255)   
    +
    +Here is a more complicated example:
       A(\d{2}|--)
     
    -it is processed as if it were +With PCRE2_AUTO_CALLOUT, this pattern is processed as if it were

    (?C255)A(?C255)((?C255)\d{2}(?C255)|(?C255)-(?C255)-(?C255))(?C255) @@ -107,10 +116,10 @@ with PCRE2_ANCHORED and PCRE2_AUTO_CALLOUT and then applied to the string No match This indicates that when matching [bc] fails, there is no backtracking into a+ -and therefore the callouts that would be taken for the backtracks do not occur. -You can disable the auto-possessify feature by passing PCRE2_NO_AUTO_POSSESS to -pcre2_compile(), or starting the pattern with (*NO_AUTO_POSSESS). In this -case, the output changes to this: +(because it is being treated as a++) and therefore the callouts that would be +taken for the backtracks do not occur. You can disable the auto-possessify +feature by passing PCRE2_NO_AUTO_POSSESS to pcre2_compile(), or starting +the pattern with (*NO_AUTO_POSSESS). In this case, the output changes to this:
       --->aaaa
        +0 ^        a+
    @@ -235,8 +244,8 @@ Fields for numerical callouts
     

    For a numerical callout, callout_string is NULL, and callout_number contains the number of the callout, in the range 0-255. This is the number -that follows (?C for manual callouts; it is 255 for automatically generated -callouts. +that follows (?C for callouts that part of the pattern; it is 255 for +automatically generated callouts.


    Fields for string callouts @@ -310,10 +319,15 @@ the next item to be matched.

    The next_item_length field contains the length of the next item to be -matched in the pattern string. When the callout immediately precedes an -alternation bar, a closing parenthesis, or the end of the pattern, the length -is zero. When the callout precedes an opening parenthesis, the length is that -of the entire subpattern. +processed in the pattern string. When the callout is at the end of the pattern, +the length is zero. When the callout precedes an opening parenthesis, the +length includes meta characters that follow the parenthesis. For example, in a +callout before an assertion such as (?=ab) the length is 3. For an an +alternation bar or a closing parenthesis, the length is one, unless a closing +parenthesis is followed by a quantifier, in which case its length is included. +(This changed in release 10.23. In earlier releases, before an opening +parenthesis the length was that of the entire subpattern, and before an +alternation bar or a closing parenthesis the length was zero.)

    The pattern_position and next_item_length fields are intended to @@ -399,9 +413,9 @@ Cambridge, England.


    REVISION

    -Last updated: 23 March 2015 +Last updated: 29 September 2016
    -Copyright © 1997-2015 University of Cambridge. +Copyright © 1997-2016 University of Cambridge.

    Return to the PCRE2 index page. diff --git a/doc/html/pcre2compat.html b/doc/html/pcre2compat.html index 3b29e6f..361eb3b 100644 --- a/doc/html/pcre2compat.html +++ b/doc/html/pcre2compat.html @@ -107,7 +107,7 @@ processed as anchored at the point where they are tested. one that is backtracked onto acts. For example, in the pattern A(*COMMIT)B(*PRUNE)C a failure in B triggers (*COMMIT), but a failure in C triggers (*PRUNE). Perl's behaviour is more complex; in many cases it is the -same as PCRE2, but there are examples where it differs. +same as PCRE2, but there are cases where it differs.

    11. Most backtracking verbs in assertions have their normal actions. They are @@ -123,7 +123,7 @@ the pattern /^(a(b)?)+$/ in Perl leaves $2 unset, but in PCRE2 it is set to 13. PCRE2's handling of duplicate subpattern numbers and duplicate subpattern names is not as general as Perl's. This is a consequence of the fact the PCRE2 works internally just with numbers, using an external table to translate -between numbers and names. In particular, a pattern such as (?|(?<a>A)|(?<b)B), +between numbers and names. In particular, a pattern such as (?|(?<a>A)|(?<b>B), where the two capturing parentheses have the same number but different names, is not supported, and causes an error at compile time. If it were allowed, it would not be possible to distinguish which parentheses matched, because both @@ -131,10 +131,11 @@ names map to capturing subpattern number 1. To avoid this confusing situation, an error is given at compile time.

    -14. Perl recognizes comments in some places that PCRE2 does not, for example, -between the ( and ? at the start of a subpattern. If the /x modifier is set, -Perl allows white space between ( and ? (though current Perls warn that this is -deprecated) but PCRE2 never does, even if the PCRE2_EXTENDED option is set. +14. Perl used to recognize comments in some places that PCRE2 does not, for +example, between the ( and ? at the start of a subpattern. If the /x modifier +is set, Perl allowed white space between ( and ? though the latest Perls give +an error (for a while it was just deprecated). There may still be some cases +where Perl behaves differently.

    15. Perl, when in warning mode, gives warnings for character classes such as @@ -158,45 +159,50 @@ list is with respect to Perl 5.10:
    (a) Although lookbehind assertions in PCRE2 must match fixed length strings, each alternative branch of a lookbehind assertion can match a different length -of string. Perl requires them all to have the same length. +of string. Perl requires them all to have the same length.

    -(b) If PCRE2_DOLLAR_ENDONLY is set and PCRE2_MULTILINE is not set, the $ +(b) From PCRE2 10.23, back references to groups of fixed length are supported +in lookbehinds, provided that there is no possibility of referencing a +non-unique number or name. Perl does not support backreferences in lookbehinds. +
    +
    +(c) If PCRE2_DOLLAR_ENDONLY is set and PCRE2_MULTILINE is not set, the $ meta-character matches only at the very end of the string.

    -(c) A backslash followed by a letter with no special meaning is faulted. (Perl +(d) A backslash followed by a letter with no special meaning is faulted. (Perl can be made to issue a warning.)

    -(d) If PCRE2_UNGREEDY is set, the greediness of the repetition quantifiers is +(e) If PCRE2_UNGREEDY is set, the greediness of the repetition quantifiers is inverted, that is, by default they are not greedy, but if followed by a question mark they are.

    -(e) PCRE2_ANCHORED can be used at matching time to force a pattern to be tried +(f) PCRE2_ANCHORED can be used at matching time to force a pattern to be tried only at the first matching position in the subject string.

    -(f) The PCRE2_NOTBOL, PCRE2_NOTEOL, PCRE2_NOTEMPTY, PCRE2_NOTEMPTY_ATSTART, and +(g) The PCRE2_NOTBOL, PCRE2_NOTEOL, PCRE2_NOTEMPTY, PCRE2_NOTEMPTY_ATSTART, and PCRE2_NO_AUTO_CAPTURE options have no Perl equivalents.

    -(g) The \R escape sequence can be restricted to match only CR, LF, or CRLF +(h) The \R escape sequence can be restricted to match only CR, LF, or CRLF by the PCRE2_BSR_ANYCRLF option.

    -(h) The callout facility is PCRE2-specific. +(i) The callout facility is PCRE2-specific.

    -(i) The partial matching facility is PCRE2-specific. +(j) The partial matching facility is PCRE2-specific.

    -(j) The alternative matching function (pcre2_dfa_match() matches in a +(k) The alternative matching function (pcre2_dfa_match() matches in a different way and is not Perl-compatible.

    -(k) PCRE2 recognizes some special sequences such as (*CR) at the start of +(l) PCRE2 recognizes some special sequences such as (*CR) at the start of a pattern that set overall options that cannot be changed within the pattern.


    @@ -214,9 +220,9 @@ Cambridge, England. REVISION

    -Last updated: 15 March 2015 +Last updated: 18 October 2016
    -Copyright © 1997-2015 University of Cambridge. +Copyright © 1997-2016 University of Cambridge.

    Return to the PCRE2 index page. diff --git a/doc/html/pcre2grep.html b/doc/html/pcre2grep.html index d02d365..c1a982c 100644 --- a/doc/html/pcre2grep.html +++ b/doc/html/pcre2grep.html @@ -80,11 +80,19 @@ span line boundaries. What defines a line boundary is controlled by the

    The amount of memory used for buffering files that are being scanned is -controlled by a parameter that can be set by the --buffer-size option. -The default value for this parameter is specified when pcre2grep is -built, with the default default being 20K. A block of memory three times this -size is used (to allow for buffering "before" and "after" lines). An error -occurs if a line overflows the buffer. +controlled by parameters that can be set by the --buffer-size and +--max-buffer-size options. The first of these sets the size of buffer +that is obtained at the start of processing. If an input file contains very +long lines, a larger buffer may be needed; this is handled by automatically +extending the buffer, up to the limit specified by --max-buffer-size. The +default values for these parameters are specified when pcre2grep is +built, with the default defaults being 20K and 1M respectively. An error occurs +if a line is too long and the buffer can no longer be expanded. +

    +

    +The block of memory that is actually used is three times the "buffer size", to +allow for buffering "before" and "after" lines. If the buffer size is too +small, fewer than requested "before" and "after" lines may be output.

    Patterns can be no longer than 8K or BUFSIZ bytes, whichever is the greater. @@ -155,12 +163,13 @@ processing of patterns and file names that start with hyphens.

    -A number, --after-context=number -Output number lines of context after each matching line. If file names -and/or line numbers are being output, a hyphen separator is used instead of a -colon for the context lines. A line containing "--" is output between each -group of lines, unless they are in fact contiguous in the input file. The value -of number is expected to be relatively small. However, pcre2grep -guarantees to have up to 8K of following text available for context output. +Output up to number lines of context after each matching line. Fewer +lines are output if the next match or the end of the file is reached, or if the +processing buffer size has been set too small. If file names and/or line +numbers are being output, a hyphen separator is used instead of a colon for the +context lines. A line containing "--" is output between each group of lines, +unless they are in fact contiguous in the input file. The value of number +is expected to be relatively small. When -c is used, -A is ignored.

    -a, --text @@ -169,12 +178,14 @@ Treat binary files as text. This is equivalent to

    -B number, --before-context=number -Output number lines of context before each matching line. If file names -and/or line numbers are being output, a hyphen separator is used instead of a -colon for the context lines. A line containing "--" is output between each -group of lines, unless they are in fact contiguous in the input file. The value -of number is expected to be relatively small. However, pcre2grep -guarantees to have up to 8K of preceding text available for context output. +Output up to number lines of context before each matching line. Fewer +lines are output if the previous match or the start of the file is within +number lines, or if the processing buffer size has been set too small. If +file names and/or line numbers are being output, a hyphen separator is used +instead of a colon for the context lines. A line containing "--" is output +between each group of lines, unless they are in fact contiguous in the input +file. The value of number is expected to be relatively small. When +-c is used, -B is ignored.

    --binary-files=word @@ -191,8 +202,9 @@ return code.

    --buffer-size=number -Set the parameter that controls how much memory is used for buffering files -that are being scanned. +Set the parameter that controls how much memory is obtained at the start of +processing for buffering files that are being scanned. See also +--max-buffer-size below.

    -C number, --context=number @@ -202,14 +214,16 @@ This is equivalent to setting both -A and -B to the same value.

    -c, --count Do not output lines from the files that are being scanned; instead output the -number of matches (or non-matches if -v is used) that would otherwise -have caused lines to be shown. By default, this count is the same as the number -of suppressed lines, but if the -M (multiline) option is used (without --v), there may be more suppressed lines than the number of matches. +number of lines that would have been shown, either because they matched, or, if +-v is set, because they failed to match. By default, this count is +exactly the same as the number of lines that would have been output, but if the +-M (multiline) option is used (without -v), there may be more +suppressed lines than the count (that is, the number of matches).

    If no lines are selected, the number zero is output. If several files are are -being scanned, a count is output for each of them. However, if the +being scanned, a count is output for each of them and the -t option can +be used to cause a total to be output at the end. However, if the --files-with-matches option is also used, only those files whose counts are greater than zero are listed. When -c is used, the -A, -B, and -C options are ignored. @@ -232,11 +246,12 @@ just one, in order to colour them all.

    The colour that is used can be specified by setting the environment variable -PCRE2GREP_COLOUR or PCRE2GREP_COLOR. The value of this variable should be a -string of two numbers, separated by a semicolon. They are copied directly into -the control string for setting colour on a terminal, so it is your -responsibility to ensure that they make sense. If neither of the environment -variables is set, the default is "1;31", which gives red. +PCRE2GREP_COLOUR or PCRE2GREP_COLOR. If neither of these are set, +pcre2grep looks for GREP_COLOUR or GREP_COLOR. The value of the variable +should be a string of two numbers, separated by a semicolon. They are copied +directly into the control string for setting colour on a terminal, so it is +your responsibility to ensure that they make sense. If neither of the +environment variables is set, the default is "1;31", which gives red.

    -D action, --devices=action @@ -321,24 +336,24 @@ files; it does not apply to patterns specified by any of the --include or

    -f filename, --file=filename -Read patterns from the file, one per line, and match them against -each line of input. What constitutes a newline when reading the file is the -operating system's default. The --newline option has no effect on this -option. Trailing white space is removed from each line, and blank lines are -ignored. An empty file contains no patterns and therefore matches nothing. See -also the comments about multiple patterns versus a single pattern with -alternatives in the description of -e above. +Read patterns from the file, one per line, and match them against each line of +input. What constitutes a newline when reading the file is the operating +system's default. The --newline option has no effect on this option. +Trailing white space is removed from each line, and blank lines are ignored. An +empty file contains no patterns and therefore matches nothing. See also the +comments about multiple patterns versus a single pattern with alternatives in +the description of -e above.

    -If this option is given more than once, all the specified files are -read. A data line is output if any of the patterns match it. A file name can -be given as "-" to refer to the standard input. When -f is used, patterns +If this option is given more than once, all the specified files are read. A +data line is output if any of the patterns match it. A file name can be given +as "-" to refer to the standard input. When -f is used, patterns specified on the command line using -e may also be present; they are tested before the file's patterns. However, no other pattern is taken from the command line; all arguments are treated as the names of paths to be searched.

    ---file-list=filename +--file-list=filename Read a list of files and/or directories that are to be scanned from the given file, one per line. Trailing white space is removed from each line, and blank lines are ignored. These paths are processed before any that are listed on the @@ -502,22 +517,24 @@ There are no short forms for these options. The default settings are specified when the PCRE2 library is compiled, with the default default being 10 million.

    +\fB--max-buffer-size=number +This limits the expansion of the processing buffer, whose initial size can be +set by --buffer-size. The maximum buffer size is silently forced to be no +smaller than the starting buffer size. +

    +

    -M, --multiline -Allow patterns to match more than one line. When this option is given, patterns -may usefully contain literal newline characters and internal occurrences of ^ -and $ characters. The output for a successful match may consist of more than -one line. The first is the line in which the match started, and the last is the -line in which the match ended. If the matched string ends with a newline -sequence the output ends at the end of that line. -
    -
    -When this option is set, the PCRE2 library is called in "multiline" mode. This -allows a matched string to extend past the end of a line and continue on one or -more subsequent lines. However, pcre2grep still processes the input line -by line. Once a match has been handled, scanning restarts at the beginning of -the next line, just as it does when -M is not present. This means that it -is possible for the second or subsequent lines in a multiline match to be -output again as part of another match. +Allow patterns to match more than one line. When this option is set, the PCRE2 +library is called in "multiline" mode. This allows a matched string to extend +past the end of a line and continue on one or more subsequent lines. Patterns +used with -M may usefully contain literal newline characters and internal +occurrences of ^ and $ characters. The output for a successful match may +consist of more than one line. The first line is the line in which the match +started, and the last line is the line in which the match ended. If the matched +string ends with a newline sequence, the output ends at the end of that line. +If -v is set, none of the lines in a multi-line match are output. Once a +match has been handled, scanning restarts at the beginning of the line after +the one in which the match ended.

    The newline sequence that separates multiple lines must be matched as part of @@ -533,11 +550,8 @@ well as possibly handling a two-character newline sequence.

    There is a limit to the number of lines that can be matched, imposed by the way -that pcre2grep buffers the input file as it scans it. However, -pcre2grep ensures that at least 8K characters or the rest of the file -(whichever is the shorter) are available for forward matching, and similarly -the previous 8K characters (or all the previous characters, if fewer than 8K) -are guaranteed to be available for lookbehind assertions. The -M option +that pcre2grep buffers the input file as it scans it. With a sufficiently +large processing buffer, this should not be a problem, but the -M option does not work when input is read line by line (see \fP--line-buffered\fP.)

    @@ -585,12 +599,13 @@ It should never be needed in normal use. Show only the part of the line that matched a pattern instead of the whole line. In this mode, no context is shown. That is, the -A, -B, and -C options are ignored. If there is more than one match in a line, each -of them is shown separately. If -o is combined with -v (invert the -sense of the match to find non-matching lines), no output is generated, but the -return code is set appropriately. If the matched portion of the line is empty, -nothing is output unless the file name or line number are being printed, in -which case they are shown on an otherwise empty line. This option is mutually -exclusive with --file-offsets and --line-offsets. +of them is shown separately, on a separate line of output. If -o is +combined with -v (invert the sense of the match to find non-matching +lines), no output is generated, but the return code is set appropriately. If +the matched portion of the line is empty, nothing is output unless the file +name or line number are being printed, in which case they are shown on an +otherwise empty line. This option is mutually exclusive with +--file-offsets and --line-offsets.

    -onumber, --only-matching=number @@ -604,10 +619,11 @@ capturing parentheses do not exist in the pattern, or were not set in the match, nothing is output unless the file name or line number are being output.

    -If this option is given multiple times, multiple substrings are output, in the -order the options are given. For example, -o3 -o1 -o3 causes the substrings -matched by capturing parentheses 3 and 1 and then 3 again to be output. By -default, there is no separator (but see the next option). +If this option is given multiple times, multiple substrings are output for each +match, in the order the options are given, and all on one line. For example, +-o3 -o1 -o3 causes the substrings matched by capturing parentheses 3 and 1 and +then 3 again to be output. By default, there is no separator (but see the next +option).

    --om-separator=text @@ -638,6 +654,18 @@ quietly skipped. However, the return code is still 2, even if matches were found in other files.

    +-t, --total-count +This option is useful when scanning more than one file. If used on its own, +-t suppresses all output except for a grand total number of matching +lines (or non-matching lines if -v is used) in all the files. If -t +is used with -c, a grand total is output except when the previous output +is just one line. In other words, it is not output when just one file's count +is listed. If file names are being output, the grand total is preceded by +"TOTAL:". Otherwise, it appears as just another number. The -t option is +ignored when used with -L (list files without matches), because the grand +total would always be zero. +

    +

    -u, --utf-8 Operate in UTF-8 mode. This option is available only if PCRE2 has been compiled with UTF-8 support. All patterns (including those for any --exclude and @@ -665,11 +693,12 @@ specified by any of the --include or --exclude options.

    -x, --line-regex, --line-regexp Force the patterns to be anchored (each must start matching at the beginning of -a line) and in addition, require them to match entire lines. This is equivalent -to having ^ and $ characters at the start and end of each alternative top-level -branch in every pattern. This option applies only to the patterns that are -matched against the contents of files; it does not apply to patterns specified -by any of the --include or --exclude options. +a line) and in addition, require them to match entire lines. In multiline mode +the match may be more than one line. This is equivalent to having \A and \Z +characters at the start and end of each alternative top-level branch in every +pattern. This option applies only to the patterns that are matched against the +contents of files; it does not apply to patterns specified by any of the +--include or --exclude options.


    ENVIRONMENT VARIABLES

    @@ -831,7 +860,7 @@ Cambridge, England.


    REVISION

    -Last updated: 19 June 2016 +Last updated: 31 October 2016
    Copyright © 1997-2016 University of Cambridge.
    diff --git a/doc/html/pcre2limits.html b/doc/html/pcre2limits.html index e227a30..c077e6c 100644 --- a/doc/html/pcre2limits.html +++ b/doc/html/pcre2limits.html @@ -61,14 +61,10 @@ The maximum length of a lookbehind assertion is 65535 characters. There is no limit to the number of parenthesized subpatterns, but there can be no more than 65535 capturing subpatterns. There is, however, a limit to the depth of nesting of parenthesized subpatterns of all kinds. This is imposed in -order to limit the amount of system stack used at compile time. The limit can -be specified when PCRE2 is built; the default is 250. -

    -

    -There is a limit to the number of forward references to subsequent subpatterns -of around 200,000. Repeated forward references with fixed upper limits, for -example, (?2){0,100} when subpattern number 2 is to the right, are included in -the count. There is no limit to the number of backward references. +order to limit the amount of system stack used at compile time. The default +limit can be specified when PCRE2 is built; the default default is 250. An +application can change this limit by calling pcre2_set_parens_nest_limit() to +set the limit in a compile context.

    The maximum length of name for a named subpattern is 32 code units, and the @@ -76,7 +72,12 @@ maximum number of named subpatterns is 10000.

    The maximum length of a name in a (*MARK), (*PRUNE), (*SKIP), or (*THEN) verb -is 255 for the 8-bit library and 65535 for the 16-bit and 32-bit libraries. +is 255 code units for the 8-bit library and 65535 code units for the 16-bit and +32-bit libraries. +

    +

    +The maximum length of a string argument to a callout is the largest number a +32-bit unsigned integer can hold.


    AUTHOR @@ -93,9 +94,9 @@ Cambridge, England. REVISION

    -Last updated: 05 November 2015 +Last updated: 26 October 2016
    -Copyright © 1997-2015 University of Cambridge. +Copyright © 1997-2016 University of Cambridge.

    Return to the PCRE2 index page. diff --git a/doc/html/pcre2pattern.html b/doc/html/pcre2pattern.html index 797690a..8ddb994 100644 --- a/doc/html/pcre2pattern.html +++ b/doc/html/pcre2pattern.html @@ -379,32 +379,31 @@ case letter, it is converted to upper case. Then bit 6 of the character (hex 40) is inverted. Thus \cA to \cZ become hex 01 to hex 1A (A is 41, Z is 5A), but \c{ becomes hex 3B ({ is 7B), and \c; becomes hex 7B (; is 3B). If the code unit following \c has a value less than 32 or greater than 126, a -compile-time error occurs. This locks out non-printable ASCII characters in all -modes. +compile-time error occurs.

    When PCRE2 is compiled in EBCDIC mode, \a, \e, \f, \n, \r, and \t generate the appropriate EBCDIC code values. The \c escape is processed as specified for Perl in the perlebcdic document. The only characters that are allowed after \c are A-Z, a-z, or one of @, [, \, ], ^, _, or ?. Any -other character provokes a compile-time error. The sequence \@ encodes -character code 0; the letters (in either case) encode characters 1-26 (hex 01 -to hex 1A); [, \, ], ^, and _ encode characters 27-31 (hex 1B to hex 1F), and -\? becomes either 255 (hex FF) or 95 (hex 5F). +other character provokes a compile-time error. The sequence \c@ encodes +character code 0; after \c the letters (in either case) encode characters 1-26 +(hex 01 to hex 1A); [, \, ], ^, and _ encode characters 27-31 (hex 1B to hex +1F), and \c? becomes either 255 (hex FF) or 95 (hex 5F).

    -Thus, apart from \?, these escapes generate the same character code values as +Thus, apart from \c?, these escapes generate the same character code values as they do in an ASCII environment, though the meanings of the values mostly -differ. For example, \G always generates code value 7, which is BEL in ASCII +differ. For example, \cG always generates code value 7, which is BEL in ASCII but DEL in EBCDIC.

    -The sequence \? generates DEL (127, hex 7F) in an ASCII environment, but +The sequence \c? generates DEL (127, hex 7F) in an ASCII environment, but because 127 is not a control character in EBCDIC, Perl makes it generate the APC character. Unfortunately, there are several variants of EBCDIC. In most of them the APC character has the value 255 (hex FF), but in the one Perl calls POSIX-BC its value is 95 (hex 5F). If certain other characters have POSIX-BC -values, PCRE2 makes \? generate 95; otherwise it generates 255. +values, PCRE2 makes \c? generate 95; otherwise it generates 255.

    After \0 up to two further octal digits are read. If there are fewer than two @@ -526,9 +525,9 @@ by code point, as described in the previous section. Absolute and relative back references

    -The sequence \g followed by an unsigned or a negative number, optionally -enclosed in braces, is an absolute or relative back reference. A named back -reference can be coded as \g{name}. Back references are discussed +The sequence \g followed by a signed or unsigned number, optionally enclosed +in braces, is an absolute or relative back reference. A named back reference +can be coded as \g{name}. Back references are discussed later, following the discussion of parenthesized subpatterns. @@ -1326,13 +1325,32 @@ whatever setting of the PCRE2_DOTALL and PCRE2_MULTILINE options is used. A class such as [^a] always matches one of these characters.

    +The character escape sequences \d, \D, \h, \H, \p, \P, \s, \S, \v, +\V, \w, and \W may appear in a character class, and add the characters that +they match to the class. For example, [\dABCDEF] matches any hexadecimal +digit. In UTF modes, the PCRE2_UCP option affects the meanings of \d, \s, \w +and their upper case partners, just as it does when they appear outside a +character class, as described in the section entitled +"Generic character types" +above. The escape sequence \b has a different meaning inside a character +class; it matches the backspace character. The sequences \B, \N, \R, and \X +are not special inside a character class. Like any other unrecognized escape +sequences, they cause an error. +

    +

    The minus (hyphen) character can be used to specify a range of characters in a character class. For example, [d-m] matches any letter between d and m, inclusive. If a minus character is required in a class, it must be escaped with a backslash or appear in a position where it cannot be interpreted as -indicating a range, typically as the first or last character in the class, or -immediately after a range. For example, [b-d-z] matches letters in the range b -to d, a hyphen character, or z. +indicating a range, typically as the first or last character in the class, +or immediately after a range. For example, [b-d-z] matches letters in the range +b to d, a hyphen character, or z. +

    +

    +Perl treats a hyphen as a literal if it appears before a POSIX class (see +below) or a character type escape such as as \d, but gives a warning in its +warning mode, as this is most likely a user error. As PCRE2 has no facility for +warning, an error is given in these cases.

    It is not possible to have the literal character "]" as the end character of a @@ -1344,12 +1362,6 @@ followed by two other characters. The octal or hexadecimal representation of "]" can also be used to end a range.

    -An error is generated if a POSIX character class (see below) or an escape -sequence other than one that defines a single character appears at a point -where a range ending character is expected. For example, [z-\xff] is valid, -but [A-\d] and [A-[:digit:]] are not. -

    -

    Ranges normally include all code points between the start and end characters, inclusive. They can also be used for code points specified numerically, for example [\000-\037]. Ranges can include any characters that are valid for the @@ -1372,19 +1384,6 @@ tables for a French locale are in use, [\xc8-\xcb] matches accented E characters in both cases.

    -The character escape sequences \d, \D, \h, \H, \p, \P, \s, \S, \v, -\V, \w, and \W may appear in a character class, and add the characters that -they match to the class. For example, [\dABCDEF] matches any hexadecimal -digit. In UTF modes, the PCRE2_UCP option affects the meanings of \d, \s, \w -and their upper case partners, just as it does when they appear outside a -character class, as described in the section entitled -"Generic character types" -above. The escape sequence \b has a different meaning inside a character -class; it matches the backspace character. The sequences \B, \N, \R, and \X -are not special inside a character class. Like any other unrecognized escape -sequences, they cause an error. -

    -

    A circumflex can conveniently be used with the upper case character types to specify a more restricted set of characters than the matching lower case type. For example, the class [^\W_] matches any letter or digit, but not underscore, @@ -1552,13 +1551,8 @@ respectively.

    When one of these option changes occurs at top level (that is, not inside subpattern parentheses), the change applies to the remainder of the pattern -that follows. If the change is placed right at the start of a pattern, PCRE2 -extracts it into the global options (and it will therefore show up in data -extracted by the pcre2_pattern_info() function). -

    -

    -An option change within a subpattern (see below for a description of -subpatterns) affects only that part of the subpattern that follows it, so +that follows. An option change within a subpattern (see below for a description +of subpatterns) affects only that part of the subpattern that follows it, so

       (a(?i)b)c
     
    @@ -2093,9 +2087,9 @@ subpattern is possible using named parentheses (see below).

    Another way of avoiding the ambiguity inherent in the use of digits following a -backslash is to use the \g escape sequence. This escape must be followed by an -unsigned number or a negative number, optionally enclosed in braces. These -examples are all identical: +backslash is to use the \g escape sequence. This escape must be followed by a +signed or unsigned number, optionally enclosed in braces. These examples are +all identical:

       (ring), \1
       (ring), \g1
    @@ -2103,8 +2097,7 @@ examples are all identical:
     
    An unsigned number specifies an absolute reference without the ambiguity that is present in the older syntax. It is also useful when literal digits follow -the reference. A negative number is a relative reference. Consider this -example: +the reference. A signed number is a relative reference. Consider this example:
       (abc(def)ghi)\g{-1}
     
    @@ -2115,6 +2108,11 @@ can be helpful in long patterns, and also in patterns that are created by joining together fragments that contain references within themselves.

    +The sequence \g{+1} is a reference to the next capturing subpattern. This kind +of forward reference can be useful it patterns that repeat. Perl does not +support the use of + in this way. +

    +

    A back reference matches whatever actually matched the capturing subpattern in the current subject string, rather than anything matching the subpattern itself (see @@ -2214,6 +2212,14 @@ capturing is carried out only for positive assertions. (Perl sometimes, but not always, does do capturing in negative assertions.)

    +WARNING: If a positive assertion containing one or more capturing subpatterns +succeeds, but failure to match later in the pattern causes backtracking over +this assertion, the captures within the assertion are reset only if no higher +numbered captures are already set. This is, unfortunately, a fundamental +limitation of the current implementation; it may get removed in a future +reworking. +

    +

    For compatibility with Perl, most assertion subpatterns may be repeated; though it makes no sense to assert the same thing several times, the side effect of capturing parentheses may occasionally be useful. However, an assertion that @@ -2310,18 +2316,31 @@ match. If there are insufficient characters before the current position, the assertion fails.

    -In a UTF mode, PCRE2 does not allow the \C escape (which matches a single code -unit even in a UTF mode) to appear in lookbehind assertions, because it makes -it impossible to calculate the length of the lookbehind. The \X and \R -escapes, which can match different numbers of code units, are also not -permitted. +In UTF-8 and UTF-16 modes, PCRE2 does not allow the \C escape (which matches a +single code unit even in a UTF mode) to appear in lookbehind assertions, +because it makes it impossible to calculate the length of the lookbehind. The +\X and \R escapes, which can match different numbers of code units, are never +permitted in lookbehinds.

    "Subroutine" calls (see below) such as (?2) or (?&X) are permitted in lookbehinds, as long -as the subpattern matches a fixed-length string. -Recursion, -however, is not supported. +as the subpattern matches a fixed-length string. However, +recursion, +that is, a "subroutine" call into a group that is already active, +is not supported. +

    +

    +Perl does not support back references in lookbehinds. PCRE2 does support them, +but only if certain conditions are met. The PCRE2_MATCH_UNSET_BACKREF option +must not be set, there must be no use of (?| in the pattern (it creates +duplicate subpattern numbers), and if the back reference is by name, the name +must be unique. Of course, the referenced subpattern must itself be of fixed +length. The following pattern matches words containing at least two characters +that begin and end with the same character: +

    +   \b(\w)\w++(?<=\1)
    +

    Possessive quantifiers can be used in conjunction with lookbehind assertions to @@ -2459,7 +2478,9 @@ Checking for a used subpattern by name

    Perl uses the syntax (?(<name>)...) or (?('name')...) to test for a used subpattern by name. For compatibility with earlier versions of PCRE1, which had -this facility before Perl, the syntax (?(name)...) is also recognized. +this facility before Perl, the syntax (?(name)...) is also recognized. Note, +however, that undelimited names consisting of the letter R followed by digits +are ambiguous (see the following section).

    Rewriting the above example to use a named subpattern gives this: @@ -2474,30 +2495,52 @@ matched. Checking for pattern recursion

    -If the condition is the string (R), and there is no subpattern with the name R, -the condition is true if a recursive call to the whole pattern or any -subpattern has been made. If digits or a name preceded by ampersand follow the -letter R, for example: +"Recursion" in this sense refers to any subroutine-like call from one part of +the pattern to another, whether or not it is actually recursive. See the +sections entitled +"Recursive patterns" +and +"Subpatterns as subroutines" +below for details of recursion and subpattern calls. +

    +

    +If a condition is the string (R), and there is no subpattern with the name R, +the condition is true if matching is currently in a recursion or subroutine +call to the whole pattern or any subpattern. If digits follow the letter R, and +there is no subpattern with that name, the condition is true if the most recent +call is into a subpattern with the given number, which must exist somewhere in +the overall pattern. This is a contrived example that is equivalent to a+b:

    -  (?(R3)...) or (?(R&name)...)
    +  ((?(R1)a+|(?1)b))
     
    -the condition is true if the most recent recursion is into a subpattern whose -number or name is given. This condition does not check the entire recursion -stack. If the name used in a condition of this kind is a duplicate, the test is -applied to all subpatterns of the same name, and is true if any one of them is -the most recent recursion. +However, in both cases, if there is a subpattern with a matching name, the +condition tests for its being set, as described in the section above, instead +of testing for recursion. For example, creating a group with the name R1 by +adding (?<R1>) to the above pattern completely changes its meaning. +

    +

    +If a name preceded by ampersand follows the letter R, for example: +

    +  (?(R&name)...)
    +
    +the condition is true if the most recent recursion is into a subpattern of that +name (which must exist within the pattern). +

    +

    +This condition does not check the entire recursion stack. It tests only the +current level. If the name used in a condition of this kind is a duplicate, the +test is applied to all subpatterns of the same name, and is true if any one of +them is the most recent recursion.

    At "top level", all these recursion test conditions are false. -The syntax for recursive patterns -is described below.


    Defining subpatterns for use by reference only

    -If the condition is the string (DEFINE), and there is no subpattern with the -name DEFINE, the condition is always false. In this case, there may be only one +If the condition is the string (DEFINE), the condition is always false, even if +there is a group with the name DEFINE. In this case, there may be only one alternative in the subpattern. It is always skipped if control reaches this point in the pattern; the idea of DEFINE is that it can be used to define subroutines that can be referenced from elsewhere. (The use of @@ -2965,12 +3008,22 @@ depending on whether or not a name is present. By default, for compatibility with Perl, a name is any sequence of characters that does not include a closing parenthesis. The name is not processed in any way, and it is not possible to include a closing parenthesis in the name. -However, if the PCRE2_ALT_VERBNAMES option is set, normal backslash processing -is applied to verb names and only an unescaped closing parenthesis terminates -the name. A closing parenthesis can be included in a name either as \) or -between \Q and \E. If the PCRE2_EXTENDED option is set, unescaped whitespace -in verb names is skipped and #-comments are recognized, exactly as in the rest -of the pattern. +This can be changed by setting the PCRE2_ALT_VERBNAMES option, but the result +is no longer Perl-compatible. +

    +

    +When PCRE2_ALT_VERBNAMES is set, backslash processing is applied to verb names +and only an unescaped closing parenthesis terminates the name. However, the +only backslash items that are permitted are \Q, \E, and sequences such as +\x{100} that define character code points. Character type escapes such as \d +are faulted. +

    +

    +A closing parenthesis can be included in a name either as \) or between \Q +and \E. In addition to backslash processing, if the PCRE2_EXTENDED option is +also set, unescaped whitespace in verb names is skipped, and #-comments are +recognized, exactly as in the rest of the pattern. PCRE2_EXTENDED does not +affect verb names unless PCRE2_ALT_VERBNAMES is also set.

    The maximum length of a name is 255 in the 8-bit library and 65535 in the @@ -3393,7 +3446,7 @@ Cambridge, England.


    REVISION

    -Last updated: 20 June 2016 +Last updated: 23 October 2016
    Copyright © 1997-2016 University of Cambridge.
    diff --git a/doc/html/pcre2syntax.html b/doc/html/pcre2syntax.html index 7fdc0dc..8d20353 100644 --- a/doc/html/pcre2syntax.html +++ b/doc/html/pcre2syntax.html @@ -492,6 +492,9 @@ Each top-level branch of a look behind must be of a fixed length. \n reference by number (can be ambiguous) \gn reference by number \g{n} reference by number + \g+n relative reference by number (PCRE2 extension) + \g-n relative reference by number + \g{+n} relative reference by number (PCRE2 extension) \g{-n} relative reference by number \k<name> reference by name (Perl) \k'name' reference by name (Perl) @@ -530,14 +533,17 @@ Each top-level branch of a look behind must be of a fixed length. (?(-n) relative reference condition (?(<name>) named reference condition (Perl) (?('name') named reference condition (Perl) - (?(name) named reference condition (PCRE2) + (?(name) named reference condition (PCRE2, deprecated) (?(R) overall recursion condition - (?(Rn) specific group recursion condition - (?(R&name) specific recursion condition + (?(Rn) specific numbered group recursion condition + (?(R&name) specific named group recursion condition (?(DEFINE) define subpattern for reference (?(VERSION[>]=n.m) test PCRE2 version (?(assert) assertion condition -

    + +Note the ambiguity of (?(R) and (?(Rn) which might be named reference +conditions or recursion tests. Such a condition is interpreted as a reference +condition if the relevant named group exists.


    BACKTRACKING CONTROL

    @@ -589,9 +595,9 @@ Cambridge, England.


    REVISION

    -Last updated: 16 October 2015 +Last updated: 28 September 2016
    -Copyright © 1997-2015 University of Cambridge. +Copyright © 1997-2016 University of Cambridge.

    Return to the PCRE2 index page. diff --git a/doc/html/pcre2test.html b/doc/html/pcre2test.html index 7509b37..dc1b1dd 100644 --- a/doc/html/pcre2test.html +++ b/doc/html/pcre2test.html @@ -615,6 +615,7 @@ about the pattern: pushcopy push a copy onto the stack stackguard=<number> test the stackguard feature tables=[0|1|2] select internal tables + use_length do not zero-terminate the pattern utf8_input treat input as UTF-8 The effects of these modifiers are described in the following sections. @@ -698,6 +699,18 @@ testing that pcre2_compile() behaves correctly in this case (it uses default values).


    +Specifying the pattern's length +
    +

    +By default, patterns are passed to the compiling functions as zero-terminated +strings. When using the POSIX wrapper API, there is no other option. However, +when using PCRE2's native API, patterns can be passed by length instead of +being zero-terminated. The use_length modifier causes this to happen. +Using a length happens automatically (whether or not use_length is set) +when hex is set, because patterns specified in hexadecimal may contain +binary zeros. +

    +
    Specifying pattern characters in hexadecimal

    @@ -720,10 +733,10 @@ the delimiter within a substring. The hex and expand modifiers are mutually exclusive.

    -By default, pcre2test passes patterns as zero-terminated strings to -pcre2_compile(), giving the length as PCRE2_ZERO_TERMINATED. However, for -patterns specified with the hex modifier, the actual length of the -pattern is passed. +The POSIX API cannot be used with patterns specified in hexadecimal because +they may contain binary zeros, which conflicts with regcomp()'s +requirement for a zero-terminated string. Such patterns are always passed to +pcre2_compile() as a string with a length, not as zero-terminated.


    Specifying wide characters in 16-bit and 32-bit modes @@ -1753,7 +1766,7 @@ Cambridge, England.


    REVISION

    -Last updated: 02 August 2016 +Last updated: 04 November 2016
    Copyright © 1997-2016 University of Cambridge.
    diff --git a/doc/index.html.src b/doc/index.html.src index 703c298..eebb80b 100644 --- a/doc/index.html.src +++ b/doc/index.html.src @@ -94,6 +94,9 @@ in the library. pcre2_code_copy   Copy a compiled pattern +pcre2_code_copy_with_tables +   Copy a compiled pattern and its character tables + pcre2_code_free   Free a compiled pattern diff --git a/doc/pcre2.txt b/doc/pcre2.txt index 40efeea..33e8ddd 100644 --- a/doc/pcre2.txt +++ b/doc/pcre2.txt @@ -379,6 +379,8 @@ PCRE2 NATIVE API AUXILIARY FUNCTIONS pcre2_code *pcre2_code_copy(const pcre2_code *code); + pcre2_code *pcre2_code_copy_with_tables(const pcre2_code *code); + int pcre2_get_error_message(int errorcode, PCRE2_UCHAR *buffer, PCRE2_SIZE bufflen); @@ -626,8 +628,8 @@ MULTITHREADING similar logic is required. JIT compilation updates a pointer within the compiled code block, so a thread must gain unique write access to the pointer before calling pcre2_jit_compile(). Alternatively, - pcre2_code_copy() can be used to obtain a private copy of the compiled - code. + pcre2_code_copy() or pcre2_code_copy_with_tables() can be used to + obtain a private copy of the compiled code. Context blocks @@ -789,7 +791,9 @@ PCRE2 CONTEXTS This parameter ajusts the limit, set when PCRE2 is built (default 250), on the depth of parenthesis nesting in a pattern. This limit stops - rogue patterns using up too much system stack when being compiled. + rogue patterns using up too much system stack when being compiled. The + limit applies to parentheses of all kinds, not just capturing parenthe- + ses. int pcre2_set_compile_recursion_guard(pcre2_compile_context *ccontext, int (*guard_function)(uint32_t, void *), void *user_data); @@ -1102,6 +1106,8 @@ COMPILING A PATTERN pcre2_code *pcre2_code_copy(const pcre2_code *code); + pcre2_code *pcre2_code_copy_with_tables(const pcre2_code *code); + The pcre2_compile() function compiles a pattern into an internal form. The pattern is defined by a pointer to a string of code units and a length. If the pattern is zero-terminated, the length can be specified @@ -1120,54 +1126,71 @@ COMPILING A PATTERN However, if the code has been processed by the JIT compiler (see below), the JIT information cannot be copied (because it is position- dependent). The new copy can initially be used only for non-JIT match- - ing, though it can be passed to pcre2_jit_compile() if required. The - pcre2_code_copy() function provides a way for individual threads in a - multithreaded application to acquire a private copy of shared compiled - code. + ing, though it can be passed to pcre2_jit_compile() if required. - NOTE: When one of the matching functions is called, pointers to the + The pcre2_code_copy() function provides a way for individual threads in + a multithreaded application to acquire a private copy of shared com- + piled code. However, it does not make a copy of the character tables + used by the compiled pattern; the new pattern code points to the same + tables as the original code. (See "Locale Support" below for details + of these character tables.) In many applications the same tables are + used throughout, so this behaviour is appropriate. Nevertheless, there + are occasions when a copy of a compiled pattern and the relevant tables + are needed. The pcre2_code_copy_with_tables() provides this facility. + Copies of both the code and the tables are made, with the new code + pointing to the new tables. The memory for the new tables is automati- + cally freed when pcre2_code_free() is called for the new copy of the + compiled code. + + NOTE: When one of the matching functions is called, pointers to the compiled pattern and the subject string are set in the match data block - so that they can be referenced by the substring extraction functions. - After running a match, you must not free a compiled pattern (or a sub- - ject string) until after all operations on the match data block have + so that they can be referenced by the substring extraction functions. + After running a match, you must not free a compiled pattern (or a sub- + ject string) until after all operations on the match data block have taken place. - The options argument for pcre2_compile() contains various bit settings - that affect the compilation. It should be zero if no options are - required. The available options are described below. Some of them (in - particular, those that are compatible with Perl, but some others as - well) can also be set and unset from within the pattern (see the + The options argument for pcre2_compile() contains various bit settings + that affect the compilation. It should be zero if no options are + required. The available options are described below. Some of them (in + particular, those that are compatible with Perl, but some others as + well) can also be set and unset from within the pattern (see the detailed description in the pcre2pattern documentation). - For those options that can be different in different parts of the pat- - tern, the contents of the options argument specifies their settings at - the start of compilation. The PCRE2_ANCHORED and PCRE2_NO_UTF_CHECK + For those options that can be different in different parts of the pat- + tern, the contents of the options argument specifies their settings at + the start of compilation. The PCRE2_ANCHORED and PCRE2_NO_UTF_CHECK options can be set at the time of matching as well as at compile time. - Other, less frequently required compile-time parameters (for example, + Other, less frequently required compile-time parameters (for example, the newline setting) can be provided in a compile context (as described above). If errorcode or erroroffset is NULL, pcre2_compile() returns NULL imme- - diately. Otherwise, the variables to which these point are set to an - error code and an offset (number of code units) within the pattern, - respectively, when pcre2_compile() returns NULL because a compilation + diately. Otherwise, the variables to which these point are set to an + error code and an offset (number of code units) within the pattern, + respectively, when pcre2_compile() returns NULL because a compilation error has occurred. The values are not defined when compilation is suc- cessful and pcre2_compile() returns a non-NULL value. - The pcre2_get_error_message() function (see "Obtaining a textual error - message" below) provides a textual message for each error code. Compi- + The value returned in erroroffset is an indication of where in the pat- + tern the error occurred. It is not necessarily the furthest point in + the pattern that was read. For example, after the error "lookbehind + assertion is not fixed length", the error offset points to the start of + the failing assertion. + + The pcre2_get_error_message() function (see "Obtaining a textual error + message" below) provides a textual message for each error code. Compi- lation errors have positive error codes; UTF formatting error codes are - negative. For an invalid UTF-8 or UTF-16 string, the offset is that of + negative. For an invalid UTF-8 or UTF-16 string, the offset is that of the first code unit of the failing character. - Some errors are not detected until the whole pattern has been scanned; - in these cases, the offset passed back is the length of the pattern. - Note that the offset is in code units, not characters, even in a UTF + Some errors are not detected until the whole pattern has been scanned; + in these cases, the offset passed back is the length of the pattern. + Note that the offset is in code units, not characters, even in a UTF mode. It may sometimes point into the middle of a UTF-8 or UTF-16 char- acter. - This code fragment shows a typical straightforward call to pcre2_com- + This code fragment shows a typical straightforward call to pcre2_com- pile(): pcre2_code *re; @@ -1181,71 +1204,72 @@ COMPILING A PATTERN &erroffset, /* for error offset */ NULL); /* no compile context */ - The following names for option bits are defined in the pcre2.h header + The following names for option bits are defined in the pcre2.h header file: PCRE2_ANCHORED If this bit is set, the pattern is forced to be "anchored", that is, it - is constrained to match only at the first matching point in the string - that is being searched (the "subject string"). This effect can also be - achieved by appropriate constructs in the pattern itself, which is the + is constrained to match only at the first matching point in the string + that is being searched (the "subject string"). This effect can also be + achieved by appropriate constructs in the pattern itself, which is the only way to do it in Perl. PCRE2_ALLOW_EMPTY_CLASS - By default, for compatibility with Perl, a closing square bracket that - immediately follows an opening one is treated as a data character for - the class. When PCRE2_ALLOW_EMPTY_CLASS is set, it terminates the + By default, for compatibility with Perl, a closing square bracket that + immediately follows an opening one is treated as a data character for + the class. When PCRE2_ALLOW_EMPTY_CLASS is set, it terminates the class, which therefore contains no characters and so can never match. PCRE2_ALT_BSUX - This option request alternative handling of three escape sequences, - which makes PCRE2's behaviour more like ECMAscript (aka JavaScript). + This option request alternative handling of three escape sequences, + which makes PCRE2's behaviour more like ECMAscript (aka JavaScript). When it is set: (1) \U matches an upper case "U" character; by default \U causes a com- pile time error (Perl uses \U to upper case subsequent characters). (2) \u matches a lower case "u" character unless it is followed by four - hexadecimal digits, in which case the hexadecimal number defines the - code point to match. By default, \u causes a compile time error (Perl + hexadecimal digits, in which case the hexadecimal number defines the + code point to match. By default, \u causes a compile time error (Perl uses it to upper case the following character). - (3) \x matches a lower case "x" character unless it is followed by two - hexadecimal digits, in which case the hexadecimal number defines the - code point to match. By default, as in Perl, a hexadecimal number is + (3) \x matches a lower case "x" character unless it is followed by two + hexadecimal digits, in which case the hexadecimal number defines the + code point to match. By default, as in Perl, a hexadecimal number is always expected after \x, but it may have zero, one, or two digits (so, for example, \xz matches a binary zero character followed by z). PCRE2_ALT_CIRCUMFLEX In multiline mode (when PCRE2_MULTILINE is set), the circumflex - metacharacter matches at the start of the subject (unless PCRE2_NOTBOL - is set), and also after any internal newline. However, it does not + metacharacter matches at the start of the subject (unless PCRE2_NOTBOL + is set), and also after any internal newline. However, it does not match after a newline at the end of the subject, for compatibility with - Perl. If you want a multiline circumflex also to match after a termi- + Perl. If you want a multiline circumflex also to match after a termi- nating newline, you must set PCRE2_ALT_CIRCUMFLEX. PCRE2_ALT_VERBNAMES - By default, for compatibility with Perl, the name in any verb sequence - such as (*MARK:NAME) is any sequence of characters that does not - include a closing parenthesis. The name is not processed in any way, - and it is not possible to include a closing parenthesis in the name. - However, if the PCRE2_ALT_VERBNAMES option is set, normal backslash - processing is applied to verb names and only an unescaped closing - parenthesis terminates the name. A closing parenthesis can be included - in a name either as \) or between \Q and \E. If the PCRE2_EXTENDED + By default, for compatibility with Perl, the name in any verb sequence + such as (*MARK:NAME) is any sequence of characters that does not + include a closing parenthesis. The name is not processed in any way, + and it is not possible to include a closing parenthesis in the name. + However, if the PCRE2_ALT_VERBNAMES option is set, normal backslash + processing is applied to verb names and only an unescaped closing + parenthesis terminates the name. A closing parenthesis can be included + in a name either as \) or between \Q and \E. If the PCRE2_EXTENDED option is set, unescaped whitespace in verb names is skipped and #-com- ments are recognized, exactly as in the rest of the pattern. PCRE2_AUTO_CALLOUT - If this bit is set, pcre2_compile() automatically inserts callout - items, all with number 255, before each pattern item. For discussion of - the callout facility, see the pcre2callout documentation. + If this bit is set, pcre2_compile() automatically inserts callout + items, all with number 255, before each pattern item, except immedi- + ately before or after a callout in the pattern. For discussion of the + callout facility, see the pcre2callout documentation. PCRE2_CASELESS @@ -3151,7 +3175,7 @@ AUTHOR REVISION - Last updated: 17 June 2016 + Last updated: 22 November 2016 Copyright (c) 1997-2016 University of Cambridge. ------------------------------------------------------------------------------ @@ -3506,16 +3530,21 @@ PCRE2GREP BUFFER SIZE pcre2grep uses an internal buffer to hold a "window" on the file it is scanning, in order to be able to output "before" and "after" lines when - it finds a match. The size of the buffer is controlled by a parameter - whose default value is 20K. The buffer itself is three times this size, - but because of the way it is used for holding "before" lines, the long- - est line that is guaranteed to be processable is the parameter size. - You can change the default parameter value by adding, for example, + it finds a match. The starting size of the buffer is controlled by a + parameter whose default value is 20K. The buffer itself is three times + this size, but because of the way it is used for holding "before" + lines, the longest line that is guaranteed to be processable is the + parameter size. If a longer line is encountered, pcre2grep automati- + cally expands the buffer, up to a specified maximum size, whose default + is 1M or the starting size, whichever is the larger. You can change the + default parameter values by adding, for example, - --with-pcre2grep-bufsize=50K + --with-pcre2grep-bufsize=51200 + --with-pcre2grep-max-bufsize=2097152 - to the configure command. The caller of pcre2grep can override this - value by using --buffer-size on the command line. + to the configure command. The caller of pcre2grep can override these + values by using --buffer-size and --max-buffer-size on the command + line. PCRE2TEST OPTION FOR LIBREADLINE SUPPORT @@ -3630,6 +3659,29 @@ CODE COVERAGE REPORTING mentation. +SUPPORT FOR FUZZERS + + There is a special option for use by people who want to run fuzzing + tests on PCRE2: + + --enable-fuzz-support + + At present this applies only to the 8-bit library. If set, it causes an + extra library called libpcre2-fuzzsupport.a to be built, but not + installed. This contains a single function called LLVMFuzzerTestOneIn- + put() whose arguments are a pointer to a string and the length of the + string. When called, this function tries to compile the string as a + pattern, and if that succeeds, to match it. This is done both with no + options and with some random options bits that are generated from the + string. Setting --enable-fuzz-support also causes a binary called + pcre2fuzzcheck to be created. This is normally run under valgrind or + used when PCRE2 is compiled with address sanitizing enabled. It calls + the fuzzing function and outputs information about it is doing. The + input strings are specified by arguments: if an argument starts with + "=" the rest of it is a literal input string. Otherwise, it is assumed + to be a file name, and the contents of the file are the test string. + + SEE ALSO pcre2api(3), pcre2-config(3). @@ -3644,7 +3696,7 @@ AUTHOR REVISION - Last updated: 01 April 2016 + Last updated: 01 November 2016 Copyright (c) 1997-2016 University of Cambridge. ------------------------------------------------------------------------------ @@ -3689,45 +3741,54 @@ DESCRIPTION If the PCRE2_AUTO_CALLOUT option bit is set when a pattern is compiled, PCRE2 automatically inserts callouts, all with number 255, before each - item in the pattern. For example, if PCRE2_AUTO_CALLOUT is used with + item in the pattern except for immediately before or after a callout + item in the pattern. For example, if PCRE2_AUTO_CALLOUT is used with the pattern - A(\d{2}|--) + A(?C3)B it is processed as if it were + (?C255)A(?C3)B(?C255) + + Here is a more complicated example: + + A(\d{2}|--) + + With PCRE2_AUTO_CALLOUT, this pattern is processed as if it were + (?C255)A(?C255)((?C255)\d{2}(?C255)|(?C255)-(?C255)-(?C255))(?C255) - Notice that there is a callout before and after each parenthesis and + Notice that there is a callout before and after each parenthesis and alternation bar. If the pattern contains a conditional group whose con- - dition is an assertion, an automatic callout is inserted immediately - before the condition. Such a callout may also be inserted explicitly, + dition is an assertion, an automatic callout is inserted immediately + before the condition. Such a callout may also be inserted explicitly, for example: (?(?C9)(?=a)ab|de) (?(?C%text%)(?!=d)ab|de) - This applies only to assertion conditions (because they are themselves + This applies only to assertion conditions (because they are themselves independent groups). - Callouts can be useful for tracking the progress of pattern matching. + Callouts can be useful for tracking the progress of pattern matching. The pcre2test program has a pattern qualifier (/auto_callout) that sets - automatic callouts. When any callouts are present, the output from - pcre2test indicates how the pattern is being matched. This is useful - information when you are trying to optimize the performance of a par- + automatic callouts. When any callouts are present, the output from + pcre2test indicates how the pattern is being matched. This is useful + information when you are trying to optimize the performance of a par- ticular pattern. MISSING CALLOUTS - You should be aware that, because of optimizations in the way PCRE2 + You should be aware that, because of optimizations in the way PCRE2 compiles and matches patterns, callouts sometimes do not happen exactly as you might expect. Auto-possessification At compile time, PCRE2 "auto-possessifies" repeated items when it knows - that what follows cannot be part of the repeat. For example, a+[bc] is - compiled as if it were a++[bc]. The pcre2test output when this pattern + that what follows cannot be part of the repeat. For example, a+[bc] is + compiled as if it were a++[bc]. The pcre2test output when this pattern is compiled with PCRE2_ANCHORED and PCRE2_AUTO_CALLOUT and then applied to the string "aaaa" is: @@ -3736,11 +3797,12 @@ MISSING CALLOUTS +2 ^ ^ [bc] No match - This indicates that when matching [bc] fails, there is no backtracking - into a+ and therefore the callouts that would be taken for the back- - tracks do not occur. You can disable the auto-possessify feature by - passing PCRE2_NO_AUTO_POSSESS to pcre2_compile(), or starting the pat- - tern with (*NO_AUTO_POSSESS). In this case, the output changes to this: + This indicates that when matching [bc] fails, there is no backtracking + into a+ (because it is being treated as a++) and therefore the callouts + that would be taken for the backtracks do not occur. You can disable + the auto-possessify feature by passing PCRE2_NO_AUTO_POSSESS to + pcre2_compile(), or starting the pattern with (*NO_AUTO_POSSESS). In + this case, the output changes to this: --->aaaa +0 ^ a+ @@ -3859,8 +3921,8 @@ THE CALLOUT INTERFACE For a numerical callout, callout_string is NULL, and callout_number contains the number of the callout, in the range 0-255. This is the - number that follows (?C for manual callouts; it is 255 for automati- - cally generated callouts. + number that follows (?C for callouts that part of the pattern; it is + 255 for automatically generated callouts. Fields for string callouts @@ -3921,10 +3983,16 @@ THE CALLOUT INTERFACE the next item to be matched. The next_item_length field contains the length of the next item to be - matched in the pattern string. When the callout immediately precedes an - alternation bar, a closing parenthesis, or the end of the pattern, the - length is zero. When the callout precedes an opening parenthesis, the - length is that of the entire subpattern. + processed in the pattern string. When the callout is at the end of the + pattern, the length is zero. When the callout precedes an opening + parenthesis, the length includes meta characters that follow the paren- + thesis. For example, in a callout before an assertion such as (?=ab) + the length is 3. For an an alternation bar or a closing parenthesis, + the length is one, unless a closing parenthesis is followed by a quan- + tifier, in which case its length is included. (This changed in release + 10.23. In earlier releases, before an opening parenthesis the length + was that of the entire subpattern, and before an alternation bar or a + closing parenthesis the length was zero.) The pattern_position and next_item_length fields are intended to help in distinguishing between different automatic callouts, which all have @@ -4008,8 +4076,8 @@ AUTHOR REVISION - Last updated: 23 March 2015 - Copyright (c) 1997-2015 University of Cambridge. + Last updated: 29 September 2016 + Copyright (c) 1997-2016 University of Cambridge. ------------------------------------------------------------------------------ @@ -4103,7 +4171,7 @@ DIFFERENCES BETWEEN PCRE2 AND PERL first one that is backtracked onto acts. For example, in the pattern A(*COMMIT)B(*PRUNE)C a failure in B triggers (*COMMIT), but a failure in C triggers (*PRUNE). Perl's behaviour is more complex; in many cases - it is the same as PCRE2, but there are examples where it differs. + it is the same as PCRE2, but there are cases where it differs. 11. Most backtracking verbs in assertions have their normal actions. They are not confined to the assertion. @@ -4117,18 +4185,18 @@ DIFFERENCES BETWEEN PCRE2 AND PERL pattern names is not as general as Perl's. This is a consequence of the fact the PCRE2 works internally just with numbers, using an external table to translate between numbers and names. In particular, a pattern - such as (?|(?A)|(?A)|(?B), where the two capturing parentheses have the same number but different names, is not supported, and causes an error at compile time. If it were allowed, it would not be possible to distinguish which parentheses matched, because both names map to cap- turing subpattern number 1. To avoid this confusing situation, an error is given at compile time. - 14. Perl recognizes comments in some places that PCRE2 does not, for - example, between the ( and ? at the start of a subpattern. If the /x - modifier is set, Perl allows white space between ( and ? (though cur- - rent Perls warn that this is deprecated) but PCRE2 never does, even if - the PCRE2_EXTENDED option is set. + 14. Perl used to recognize comments in some places that PCRE2 does not, + for example, between the ( and ? at the start of a subpattern. If the + /x modifier is set, Perl allowed white space between ( and ? though the + latest Perls give an error (for a while it was just deprecated). There + may still be some cases where Perl behaves differently. 15. Perl, when in warning mode, gives warnings for character classes such as [A-\d] or [a-[:digit:]]. It then treats the hyphens as liter- @@ -4152,34 +4220,39 @@ DIFFERENCES BETWEEN PCRE2 AND PERL different length of string. Perl requires them all to have the same length. - (b) If PCRE2_DOLLAR_ENDONLY is set and PCRE2_MULTILINE is not set, the + (b) From PCRE2 10.23, back references to groups of fixed length are + supported in lookbehinds, provided that there is no possibility of ref- + erencing a non-unique number or name. Perl does not support backrefer- + ences in lookbehinds. + + (c) If PCRE2_DOLLAR_ENDONLY is set and PCRE2_MULTILINE is not set, the $ meta-character matches only at the very end of the string. - (c) A backslash followed by a letter with no special meaning is + (d) A backslash followed by a letter with no special meaning is faulted. (Perl can be made to issue a warning.) - (d) If PCRE2_UNGREEDY is set, the greediness of the repetition quanti- + (e) If PCRE2_UNGREEDY is set, the greediness of the repetition quanti- fiers is inverted, that is, by default they are not greedy, but if fol- lowed by a question mark they are. - (e) PCRE2_ANCHORED can be used at matching time to force a pattern to + (f) PCRE2_ANCHORED can be used at matching time to force a pattern to be tried only at the first matching position in the subject string. - (f) The PCRE2_NOTBOL, PCRE2_NOTEOL, PCRE2_NOTEMPTY, - PCRE2_NOTEMPTY_ATSTART, and PCRE2_NO_AUTO_CAPTURE options have no Perl + (g) The PCRE2_NOTBOL, PCRE2_NOTEOL, PCRE2_NOTEMPTY, + PCRE2_NOTEMPTY_ATSTART, and PCRE2_NO_AUTO_CAPTURE options have no Perl equivalents. - (g) The \R escape sequence can be restricted to match only CR, LF, or + (h) The \R escape sequence can be restricted to match only CR, LF, or CRLF by the PCRE2_BSR_ANYCRLF option. - (h) The callout facility is PCRE2-specific. + (i) The callout facility is PCRE2-specific. - (i) The partial matching facility is PCRE2-specific. + (j) The partial matching facility is PCRE2-specific. - (j) The alternative matching function (pcre2_dfa_match() matches in a + (k) The alternative matching function (pcre2_dfa_match() matches in a different way and is not Perl-compatible. - (k) PCRE2 recognizes some special sequences such as (*CR) at the start + (l) PCRE2 recognizes some special sequences such as (*CR) at the start of a pattern that set overall options that cannot be changed within the pattern. @@ -4193,8 +4266,8 @@ AUTHOR REVISION - Last updated: 15 March 2015 - Copyright (c) 1997-2015 University of Cambridge. + Last updated: 18 October 2016 + Copyright (c) 1997-2016 University of Cambridge. ------------------------------------------------------------------------------ @@ -4642,21 +4715,20 @@ SIZE AND OTHER LIMITATIONS can be no more than 65535 capturing subpatterns. There is, however, a limit to the depth of nesting of parenthesized subpatterns of all kinds. This is imposed in order to limit the amount of system stack - used at compile time. The limit can be specified when PCRE2 is built; - the default is 250. - - There is a limit to the number of forward references to subsequent sub- - patterns of around 200,000. Repeated forward references with fixed - upper limits, for example, (?2){0,100} when subpattern number 2 is to - the right, are included in the count. There is no limit to the number - of backward references. + used at compile time. The default limit can be specified when PCRE2 is + built; the default default is 250. An application can change this limit + by calling pcre2_set_parens_nest_limit() to set the limit in a compile + context. The maximum length of name for a named subpattern is 32 code units, and the maximum number of named subpatterns is 10000. The maximum length of a name in a (*MARK), (*PRUNE), (*SKIP), or - (*THEN) verb is 255 for the 8-bit library and 65535 for the 16-bit and - 32-bit libraries. + (*THEN) verb is 255 code units for the 8-bit library and 65535 code + units for the 16-bit and 32-bit libraries. + + The maximum length of a string argument to a callout is the largest + number a 32-bit unsigned integer can hold. AUTHOR @@ -4668,8 +4740,8 @@ AUTHOR REVISION - Last updated: 05 November 2015 - Copyright (c) 1997-2015 University of Cambridge. + Last updated: 26 October 2016 + Copyright (c) 1997-2016 University of Cambridge. ------------------------------------------------------------------------------ @@ -5644,29 +5716,29 @@ BACKSLASH character (hex 40) is inverted. Thus \cA to \cZ become hex 01 to hex 1A (A is 41, Z is 5A), but \c{ becomes hex 3B ({ is 7B), and \c; becomes hex 7B (; is 3B). If the code unit following \c has a value less than - 32 or greater than 126, a compile-time error occurs. This locks out - non-printable ASCII characters in all modes. + 32 or greater than 126, a compile-time error occurs. - When PCRE2 is compiled in EBCDIC mode, \a, \e, \f, \n, \r, and \t gen- + When PCRE2 is compiled in EBCDIC mode, \a, \e, \f, \n, \r, and \t gen- erate the appropriate EBCDIC code values. The \c escape is processed as specified for Perl in the perlebcdic document. The only characters that - are allowed after \c are A-Z, a-z, or one of @, [, \, ], ^, _, or ?. - Any other character provokes a compile-time error. The sequence \@ - encodes character code 0; the letters (in either case) encode charac- - ters 1-26 (hex 01 to hex 1A); [, \, ], ^, and _ encode characters 27-31 - (hex 1B to hex 1F), and \? becomes either 255 (hex FF) or 95 (hex 5F). + are allowed after \c are A-Z, a-z, or one of @, [, \, ], ^, _, or ?. + Any other character provokes a compile-time error. The sequence \c@ + encodes character code 0; after \c the letters (in either case) encode + characters 1-26 (hex 01 to hex 1A); [, \, ], ^, and _ encode characters + 27-31 (hex 1B to hex 1F), and \c? becomes either 255 (hex FF) or 95 + (hex 5F). - Thus, apart from \?, these escapes generate the same character code + Thus, apart from \c?, these escapes generate the same character code values as they do in an ASCII environment, though the meanings of the - values mostly differ. For example, \G always generates code value 7, + values mostly differ. For example, \cG always generates code value 7, which is BEL in ASCII but DEL in EBCDIC. - The sequence \? generates DEL (127, hex 7F) in an ASCII environment, + The sequence \c? generates DEL (127, hex 7F) in an ASCII environment, but because 127 is not a control character in EBCDIC, Perl makes it generate the APC character. Unfortunately, there are several variants of EBCDIC. In most of them the APC character has the value 255 (hex FF), but in the one Perl calls POSIX-BC its value is 95 (hex 5F). If - certain other characters have POSIX-BC values, PCRE2 makes \? generate + certain other characters have POSIX-BC values, PCRE2 makes \c? generate 95; otherwise it generates 255. After \0 up to two further octal digits are read. If there are fewer @@ -5776,10 +5848,10 @@ BACKSLASH Absolute and relative back references - The sequence \g followed by an unsigned or a negative number, option- - ally enclosed in braces, is an absolute or relative back reference. A - named back reference can be coded as \g{name}. Back references are dis- - cussed later, following the discussion of parenthesized subpatterns. + The sequence \g followed by a signed or unsigned number, optionally + enclosed in braces, is an absolute or relative back reference. A named + back reference can be coded as \g{name}. Back references are discussed + later, following the discussion of parenthesized subpatterns. Absolute and relative subroutine calls @@ -6404,6 +6476,18 @@ SQUARE BRACKETS AND CHARACTER CLASSES PCRE2_MULTILINE options is used. A class such as [^a] always matches one of these characters. + The character escape sequences \d, \D, \h, \H, \p, \P, \s, \S, \v, \V, + \w, and \W may appear in a character class, and add the characters that + they match to the class. For example, [\dABCDEF] matches any hexadeci- + mal digit. In UTF modes, the PCRE2_UCP option affects the meanings of + \d, \s, \w and their upper case partners, just as it does when they + appear outside a character class, as described in the section entitled + "Generic character types" above. The escape sequence \b has a different + meaning inside a character class; it matches the backspace character. + The sequences \B, \N, \R, and \X are not special inside a character + class. Like any other unrecognized escape sequences, they cause an + error. + The minus (hyphen) character can be used to specify a range of charac- ters in a character class. For example, [d-m] matches any letter between d and m, inclusive. If a minus character is required in a @@ -6413,19 +6497,19 @@ SQUARE BRACKETS AND CHARACTER CLASSES example, [b-d-z] matches letters in the range b to d, a hyphen charac- ter, or z. - It is not possible to have the literal character "]" as the end charac- - ter of a range. A pattern such as [W-]46] is interpreted as a class of - two characters ("W" and "-") followed by a literal string "46]", so it - would match "W46]" or "-46]". However, if the "]" is escaped with a - backslash it is interpreted as the end of range, so [W-\]46] is inter- - preted as a class containing a range followed by two other characters. - The octal or hexadecimal representation of "]" can also be used to end - a range. + Perl treats a hyphen as a literal if it appears before a POSIX class + (see below) or a character type escape such as as \d, but gives a warn- + ing in its warning mode, as this is most likely a user error. As PCRE2 + has no facility for warning, an error is given in these cases. - An error is generated if a POSIX character class (see below) or an - escape sequence other than one that defines a single character appears - at a point where a range ending character is expected. For example, - [z-\xff] is valid, but [A-\d] and [A-[:digit:]] are not. + It is not possible to have the literal character "]" as the end charac- + ter of a range. A pattern such as [W-]46] is interpreted as a class of + two characters ("W" and "-") followed by a literal string "46]", so it + would match "W46]" or "-46]". However, if the "]" is escaped with a + backslash it is interpreted as the end of range, so [W-\]46] is inter- + preted as a class containing a range followed by two other characters. + The octal or hexadecimal representation of "]" can also be used to end + a range. Ranges normally include all code points between the start and end char- acters, inclusive. They can also be used for code points specified @@ -6446,18 +6530,6 @@ SQUARE BRACKETS AND CHARACTER CLASSES character tables for a French locale are in use, [\xc8-\xcb] matches accented E characters in both cases. - The character escape sequences \d, \D, \h, \H, \p, \P, \s, \S, \v, \V, - \w, and \W may appear in a character class, and add the characters that - they match to the class. For example, [\dABCDEF] matches any hexadeci- - mal digit. In UTF modes, the PCRE2_UCP option affects the meanings of - \d, \s, \w and their upper case partners, just as it does when they - appear outside a character class, as described in the section entitled - "Generic character types" above. The escape sequence \b has a different - meaning inside a character class; it matches the backspace character. - The sequences \B, \N, \R, and \X are not special inside a character - class. Like any other unrecognized escape sequences, they cause an - error. - A circumflex can conveniently be used with the upper case character types to specify a more restricted set of characters than the matching lower case type. For example, the class [^\W_] matches any letter or @@ -6618,32 +6690,27 @@ INTERNAL OPTION SETTING When one of these option changes occurs at top level (that is, not inside subpattern parentheses), the change applies to the remainder of - the pattern that follows. If the change is placed right at the start of - a pattern, PCRE2 extracts it into the global options (and it will - therefore show up in data extracted by the pcre2_pattern_info() func- - tion). - - An option change within a subpattern (see below for a description of - subpatterns) affects only that part of the subpattern that follows it, - so + the pattern that follows. An option change within a subpattern (see + below for a description of subpatterns) affects only that part of the + subpattern that follows it, so (a(?i)b)c - matches abc and aBc and no other strings (assuming PCRE2_CASELESS is - not used). By this means, options can be made to have different set- + matches abc and aBc and no other strings (assuming PCRE2_CASELESS is + not used). By this means, options can be made to have different set- tings in different parts of the pattern. Any changes made in one alter- native do carry on into subsequent branches within the same subpattern. For example, (a(?i)b|c) - matches "ab", "aB", "c", and "C", even though when matching "C" the - first branch is abandoned before the option setting. This is because - the effects of option settings happen at compile time. There would be + matches "ab", "aB", "c", and "C", even though when matching "C" the + first branch is abandoned before the option setting. This is because + the effects of option settings happen at compile time. There would be some very weird behaviour otherwise. - As a convenient shorthand, if any option settings are required at the - start of a non-capturing subpattern (see the next section), the option + As a convenient shorthand, if any option settings are required at the + start of a non-capturing subpattern (see the next section), the option letters may appear between the "?" and the ":". Thus the two patterns (?i:saturday|sunday) @@ -6651,14 +6718,14 @@ INTERNAL OPTION SETTING match exactly the same set of strings. - Note: There are other PCRE2-specific options that can be set by the + Note: There are other PCRE2-specific options that can be set by the application when the compiling function is called. The pattern can con- - tain special leading sequences such as (*CRLF) to override what the - application has set or what has been defaulted. Details are given in - the section entitled "Newline sequences" above. There are also the - (*UTF) and (*UCP) leading sequences that can be used to set UTF and - Unicode property modes; they are equivalent to setting the PCRE2_UTF - and PCRE2_UCP options, respectively. However, the application can set + tain special leading sequences such as (*CRLF) to override what the + application has set or what has been defaulted. Details are given in + the section entitled "Newline sequences" above. There are also the + (*UTF) and (*UCP) leading sequences that can be used to set UTF and + Unicode property modes; they are equivalent to setting the PCRE2_UTF + and PCRE2_UCP options, respectively. However, the application can set the PCRE2_NEVER_UTF and PCRE2_NEVER_UCP options, which lock out the use of the (*UTF) and (*UCP) sequences. @@ -6672,18 +6739,18 @@ SUBPATTERNS cat(aract|erpillar|) - matches "cataract", "caterpillar", or "cat". Without the parentheses, + matches "cataract", "caterpillar", or "cat". Without the parentheses, it would match "cataract", "erpillar" or an empty string. - 2. It sets up the subpattern as a capturing subpattern. This means + 2. It sets up the subpattern as a capturing subpattern. This means that, when the whole pattern matches, the portion of the subject string - that matched the subpattern is passed back to the caller, separately - from the portion that matched the whole pattern. (This applies only to - the traditional matching function; the DFA matching function does not + that matched the subpattern is passed back to the caller, separately + from the portion that matched the whole pattern. (This applies only to + the traditional matching function; the DFA matching function does not support capturing.) Opening parentheses are counted from left to right (starting from 1) to - obtain numbers for the capturing subpatterns. For example, if the + obtain numbers for the capturing subpatterns. For example, if the string "the red king" is matched against the pattern the ((red|white) (king|queen)) @@ -6691,12 +6758,12 @@ SUBPATTERNS the captured substrings are "red king", "red", and "king", and are num- bered 1, 2, and 3, respectively. - The fact that plain parentheses fulfil two functions is not always - helpful. There are often times when a grouping subpattern is required - without a capturing requirement. If an opening parenthesis is followed - by a question mark and a colon, the subpattern does not do any captur- - ing, and is not counted when computing the number of any subsequent - capturing subpatterns. For example, if the string "the white queen" is + The fact that plain parentheses fulfil two functions is not always + helpful. There are often times when a grouping subpattern is required + without a capturing requirement. If an opening parenthesis is followed + by a question mark and a colon, the subpattern does not do any captur- + ing, and is not counted when computing the number of any subsequent + capturing subpatterns. For example, if the string "the white queen" is matched against the pattern the ((?:red|white) (king|queen)) @@ -6704,37 +6771,37 @@ SUBPATTERNS the captured substrings are "white queen" and "queen", and are numbered 1 and 2. The maximum number of capturing subpatterns is 65535. - As a convenient shorthand, if any option settings are required at the - start of a non-capturing subpattern, the option letters may appear + As a convenient shorthand, if any option settings are required at the + start of a non-capturing subpattern, the option letters may appear between the "?" and the ":". Thus the two patterns (?i:saturday|sunday) (?:(?i)saturday|sunday) match exactly the same set of strings. Because alternative branches are - tried from left to right, and options are not reset until the end of - the subpattern is reached, an option setting in one branch does affect - subsequent branches, so the above patterns match "SUNDAY" as well as + tried from left to right, and options are not reset until the end of + the subpattern is reached, an option setting in one branch does affect + subsequent branches, so the above patterns match "SUNDAY" as well as "Saturday". DUPLICATE SUBPATTERN NUMBERS Perl 5.10 introduced a feature whereby each alternative in a subpattern - uses the same numbers for its capturing parentheses. Such a subpattern - starts with (?| and is itself a non-capturing subpattern. For example, + uses the same numbers for its capturing parentheses. Such a subpattern + starts with (?| and is itself a non-capturing subpattern. For example, consider this pattern: (?|(Sat)ur|(Sun))day - Because the two alternatives are inside a (?| group, both sets of cap- - turing parentheses are numbered one. Thus, when the pattern matches, - you can look at captured substring number one, whichever alternative - matched. This construct is useful when you want to capture part, but + Because the two alternatives are inside a (?| group, both sets of cap- + turing parentheses are numbered one. Thus, when the pattern matches, + you can look at captured substring number one, whichever alternative + matched. This construct is useful when you want to capture part, but not all, of one of a number of alternatives. Inside a (?| group, paren- - theses are numbered as usual, but the number is reset at the start of - each branch. The numbers of any capturing parentheses that follow the - subpattern start after the highest number used in any branch. The fol- + theses are numbered as usual, but the number is reset at the start of + each branch. The numbers of any capturing parentheses that follow the + subpattern start after the highest number used in any branch. The fol- lowing example is taken from the Perl documentation. The numbers under- neath show in which buffer the captured content will be stored. @@ -6742,14 +6809,14 @@ DUPLICATE SUBPATTERN NUMBERS / ( a ) (?| x ( y ) z | (p (q) r) | (t) u (v) ) ( z ) /x # 1 2 2 3 2 3 4 - A back reference to a numbered subpattern uses the most recent value - that is set for that number by any subpattern. The following pattern + A back reference to a numbered subpattern uses the most recent value + that is set for that number by any subpattern. The following pattern matches "abcabc" or "defdef": /(?|(abc)|(def))\1/ - In contrast, a subroutine call to a numbered subpattern always refers - to the first one in the pattern with the given number. The following + In contrast, a subroutine call to a numbered subpattern always refers + to the first one in the pattern with the given number. The following pattern matches "abcabc" or "defabc": /(?|(abc)|(def))(?1)/ @@ -6757,47 +6824,47 @@ DUPLICATE SUBPATTERN NUMBERS A relative reference such as (?-1) is no different: it is just a conve- nient way of computing an absolute group number. - If a condition test for a subpattern's having matched refers to a non- - unique number, the test is true if any of the subpatterns of that num- + If a condition test for a subpattern's having matched refers to a non- + unique number, the test is true if any of the subpatterns of that num- ber have matched. - An alternative approach to using this "branch reset" feature is to use + An alternative approach to using this "branch reset" feature is to use duplicate named subpatterns, as described in the next section. NAMED SUBPATTERNS - Identifying capturing parentheses by number is simple, but it can be - very hard to keep track of the numbers in complicated regular expres- - sions. Furthermore, if an expression is modified, the numbers may + Identifying capturing parentheses by number is simple, but it can be + very hard to keep track of the numbers in complicated regular expres- + sions. Furthermore, if an expression is modified, the numbers may change. To help with this difficulty, PCRE2 supports the naming of sub- patterns. This feature was not added to Perl until release 5.10. Python - had the feature earlier, and PCRE1 introduced it at release 4.0, using - the Python syntax. PCRE2 supports both the Perl and the Python syntax. - Perl allows identically numbered subpatterns to have different names, + had the feature earlier, and PCRE1 introduced it at release 4.0, using + the Python syntax. PCRE2 supports both the Perl and the Python syntax. + Perl allows identically numbered subpatterns to have different names, but PCRE2 does not. - In PCRE2, a subpattern can be named in one of three ways: (?...) - or (?'name'...) as in Perl, or (?P...) as in Python. References - to capturing parentheses from other parts of the pattern, such as back - references, recursion, and conditions, can be made by name as well as + In PCRE2, a subpattern can be named in one of three ways: (?...) + or (?'name'...) as in Perl, or (?P...) as in Python. References + to capturing parentheses from other parts of the pattern, such as back + references, recursion, and conditions, can be made by name as well as by number. - Names consist of up to 32 alphanumeric characters and underscores, but - must start with a non-digit. Named capturing parentheses are still - allocated numbers as well as names, exactly as if the names were not + Names consist of up to 32 alphanumeric characters and underscores, but + must start with a non-digit. Named capturing parentheses are still + allocated numbers as well as names, exactly as if the names were not present. The PCRE2 API provides function calls for extracting the name- - to-number translation table from a compiled pattern. There are also + to-number translation table from a compiled pattern. There are also convenience functions for extracting a captured substring by name. - By default, a name must be unique within a pattern, but it is possible - to relax this constraint by setting the PCRE2_DUPNAMES option at com- - pile time. (Duplicate names are also always permitted for subpatterns - with the same number, set up as described in the previous section.) - Duplicate names can be useful for patterns where only one instance of + By default, a name must be unique within a pattern, but it is possible + to relax this constraint by setting the PCRE2_DUPNAMES option at com- + pile time. (Duplicate names are also always permitted for subpatterns + with the same number, set up as described in the previous section.) + Duplicate names can be useful for patterns where only one instance of the named parentheses can match. Suppose you want to match the name of - a weekday, either as a 3-letter abbreviation or as the full name, and - in both cases you want to extract the abbreviation. This pattern + a weekday, either as a 3-letter abbreviation or as the full name, and + in both cases you want to extract the abbreviation. This pattern (ignoring the line breaks) does the job: (?Mon|Fri|Sun)(?:day)?| @@ -6806,18 +6873,18 @@ NAMED SUBPATTERNS (?Thu)(?:rsday)?| (?Sat)(?:urday)? - There are five capturing substrings, but only one is ever set after a + There are five capturing substrings, but only one is ever set after a match. (An alternative way of solving this problem is to use a "branch reset" subpattern, as described in the previous section.) - The convenience functions for extracting the data by name returns the - substring for the first (and in this example, the only) subpattern of - that name that matched. This saves searching to find which numbered + The convenience functions for extracting the data by name returns the + substring for the first (and in this example, the only) subpattern of + that name that matched. This saves searching to find which numbered subpattern it was. - If you make a back reference to a non-unique named subpattern from - elsewhere in the pattern, the subpatterns to which the name refers are - checked in the order in which they appear in the overall pattern. The + If you make a back reference to a non-unique named subpattern from + elsewhere in the pattern, the subpatterns to which the name refers are + checked in the order in which they appear in the overall pattern. The first one that is set is used for the reference. For example, this pat- tern matches both "foofoo" and "barbar" but not "foobar" or "barfoo": @@ -6825,29 +6892,29 @@ NAMED SUBPATTERNS If you make a subroutine call to a non-unique named subpattern, the one - that corresponds to the first occurrence of the name is used. In the + that corresponds to the first occurrence of the name is used. In the absence of duplicate numbers (see the previous section) this is the one with the lowest number. If you use a named reference in a condition test (see the section about conditions below), either to check whether a subpattern has matched, or - to check for recursion, all subpatterns with the same name are tested. - If the condition is true for any one of them, the overall condition is - true. This is the same behaviour as testing by number. For further - details of the interfaces for handling named subpatterns, see the + to check for recursion, all subpatterns with the same name are tested. + If the condition is true for any one of them, the overall condition is + true. This is the same behaviour as testing by number. For further + details of the interfaces for handling named subpatterns, see the pcre2api documentation. Warning: You cannot use different names to distinguish between two sub- - patterns with the same number because PCRE2 uses only the numbers when + patterns with the same number because PCRE2 uses only the numbers when matching. For this reason, an error is given at compile time if differ- - ent names are given to subpatterns with the same number. However, you + ent names are given to subpatterns with the same number. However, you can always give the same name to subpatterns with the same number, even when PCRE2_DUPNAMES is not set. REPETITION - Repetition is specified by quantifiers, which can follow any of the + Repetition is specified by quantifiers, which can follow any of the following items: a literal data character @@ -6861,17 +6928,17 @@ REPETITION a parenthesized subpattern (including most assertions) a subroutine call to a subpattern (recursive or otherwise) - The general repetition quantifier specifies a minimum and maximum num- - ber of permitted matches, by giving the two numbers in curly brackets - (braces), separated by a comma. The numbers must be less than 65536, + The general repetition quantifier specifies a minimum and maximum num- + ber of permitted matches, by giving the two numbers in curly brackets + (braces), separated by a comma. The numbers must be less than 65536, and the first must be less than or equal to the second. For example: z{2,4} - matches "zz", "zzz", or "zzzz". A closing brace on its own is not a - special character. If the second number is omitted, but the comma is - present, there is no upper limit; if the second number and the comma - are both omitted, the quantifier specifies an exact number of required + matches "zz", "zzz", or "zzzz". A closing brace on its own is not a + special character. If the second number is omitted, but the comma is + present, there is no upper limit; if the second number and the comma + are both omitted, the quantifier specifies an exact number of required matches. Thus [aeiou]{3,} @@ -6880,50 +6947,50 @@ REPETITION \d{8} - matches exactly 8 digits. An opening curly bracket that appears in a - position where a quantifier is not allowed, or one that does not match - the syntax of a quantifier, is taken as a literal character. For exam- + matches exactly 8 digits. An opening curly bracket that appears in a + position where a quantifier is not allowed, or one that does not match + the syntax of a quantifier, is taken as a literal character. For exam- ple, {,6} is not a quantifier, but a literal string of four characters. In UTF modes, quantifiers apply to characters rather than to individual - code units. Thus, for example, \x{100}{2} matches two characters, each + code units. Thus, for example, \x{100}{2} matches two characters, each of which is represented by a two-byte sequence in a UTF-8 string. Simi- - larly, \X{3} matches three Unicode extended grapheme clusters, each of - which may be several code units long (and they may be of different + larly, \X{3} matches three Unicode extended grapheme clusters, each of + which may be several code units long (and they may be of different lengths). The quantifier {0} is permitted, causing the expression to behave as if the previous item and the quantifier were not present. This may be use- - ful for subpatterns that are referenced as subroutines from elsewhere + ful for subpatterns that are referenced as subroutines from elsewhere in the pattern (but see also the section entitled "Defining subpatterns - for use by reference only" below). Items other than subpatterns that + for use by reference only" below). Items other than subpatterns that have a {0} quantifier are omitted from the compiled pattern. - For convenience, the three most common quantifiers have single-charac- + For convenience, the three most common quantifiers have single-charac- ter abbreviations: * is equivalent to {0,} + is equivalent to {1,} ? is equivalent to {0,1} - It is possible to construct infinite loops by following a subpattern + It is possible to construct infinite loops by following a subpattern that can match no characters with a quantifier that has no upper limit, for example: (a?)* - Earlier versions of Perl and PCRE1 used to give an error at compile + Earlier versions of Perl and PCRE1 used to give an error at compile time for such patterns. However, because there are cases where this can be useful, such patterns are now accepted, but if any repetition of the - subpattern does in fact match no characters, the loop is forcibly bro- + subpattern does in fact match no characters, the loop is forcibly bro- ken. - By default, the quantifiers are "greedy", that is, they match as much - as possible (up to the maximum number of permitted times), without - causing the rest of the pattern to fail. The classic example of where + By default, the quantifiers are "greedy", that is, they match as much + as possible (up to the maximum number of permitted times), without + causing the rest of the pattern to fail. The classic example of where this gives problems is in trying to match comments in C programs. These - appear between /* and */ and within the comment, individual * and / - characters may appear. An attempt to match C comments by applying the + appear between /* and */ and within the comment, individual * and / + characters may appear. An attempt to match C comments by applying the pattern /\*.*\*/ @@ -6932,19 +6999,19 @@ REPETITION /* first comment */ not comment /* second comment */ - fails, because it matches the entire string owing to the greediness of + fails, because it matches the entire string owing to the greediness of the .* item. If a quantifier is followed by a question mark, it ceases to be greedy, - and instead matches the minimum number of times possible, so the pat- + and instead matches the minimum number of times possible, so the pat- tern /\*.*?\*/ - does the right thing with the C comments. The meaning of the various - quantifiers is not otherwise changed, just the preferred number of - matches. Do not confuse this use of question mark with its use as a - quantifier in its own right. Because it has two uses, it can sometimes + does the right thing with the C comments. The meaning of the various + quantifiers is not otherwise changed, just the preferred number of + matches. Do not confuse this use of question mark with its use as a + quantifier in its own right. Because it has two uses, it can sometimes appear doubled, as in \d??\d @@ -6953,45 +7020,45 @@ REPETITION only way the rest of the pattern matches. If the PCRE2_UNGREEDY option is set (an option that is not available in - Perl), the quantifiers are not greedy by default, but individual ones - can be made greedy by following them with a question mark. In other + Perl), the quantifiers are not greedy by default, but individual ones + can be made greedy by following them with a question mark. In other words, it inverts the default behaviour. - When a parenthesized subpattern is quantified with a minimum repeat - count that is greater than 1 or with a limited maximum, more memory is - required for the compiled pattern, in proportion to the size of the + When a parenthesized subpattern is quantified with a minimum repeat + count that is greater than 1 or with a limited maximum, more memory is + required for the compiled pattern, in proportion to the size of the minimum or maximum. - If a pattern starts with .* or .{0,} and the PCRE2_DOTALL option - (equivalent to Perl's /s) is set, thus allowing the dot to match new- - lines, the pattern is implicitly anchored, because whatever follows - will be tried against every character position in the subject string, - so there is no point in retrying the overall match at any position + If a pattern starts with .* or .{0,} and the PCRE2_DOTALL option + (equivalent to Perl's /s) is set, thus allowing the dot to match new- + lines, the pattern is implicitly anchored, because whatever follows + will be tried against every character position in the subject string, + so there is no point in retrying the overall match at any position after the first. PCRE2 normally treats such a pattern as though it were preceded by \A. - In cases where it is known that the subject string contains no new- - lines, it is worth setting PCRE2_DOTALL in order to obtain this opti- + In cases where it is known that the subject string contains no new- + lines, it is worth setting PCRE2_DOTALL in order to obtain this opti- mization, or alternatively, using ^ to indicate anchoring explicitly. - However, there are some cases where the optimization cannot be used. + However, there are some cases where the optimization cannot be used. When .* is inside capturing parentheses that are the subject of a back reference elsewhere in the pattern, a match at the start may fail where a later one succeeds. Consider, for example: (.*)abc\1 - If the subject is "xyz123abc123" the match point is the fourth charac- + If the subject is "xyz123abc123" the match point is the fourth charac- ter. For this reason, such a pattern is not implicitly anchored. - Another case where implicit anchoring is not applied is when the lead- - ing .* is inside an atomic group. Once again, a match at the start may + Another case where implicit anchoring is not applied is when the lead- + ing .* is inside an atomic group. Once again, a match at the start may fail where a later one succeeds. Consider this pattern: (?>.*?a)b - It matches "ab" in the subject "aab". The use of the backtracking con- - trol verbs (*PRUNE) and (*SKIP) also disable this optimization, and + It matches "ab" in the subject "aab". The use of the backtracking con- + trol verbs (*PRUNE) and (*SKIP) also disable this optimization, and there is an option, PCRE2_NO_DOTSTAR_ANCHOR, to do so explicitly. When a capturing subpattern is repeated, the value captured is the sub- @@ -7000,8 +7067,8 @@ REPETITION (tweedle[dume]{3}\s*)+ has matched "tweedledum tweedledee" the value of the captured substring - is "tweedledee". However, if there are nested capturing subpatterns, - the corresponding captured values may have been set in previous itera- + is "tweedledee". However, if there are nested capturing subpatterns, + the corresponding captured values may have been set in previous itera- tions. For example, after (a|(b))+ @@ -7011,53 +7078,53 @@ REPETITION ATOMIC GROUPING AND POSSESSIVE QUANTIFIERS - With both maximizing ("greedy") and minimizing ("ungreedy" or "lazy") - repetition, failure of what follows normally causes the repeated item - to be re-evaluated to see if a different number of repeats allows the - rest of the pattern to match. Sometimes it is useful to prevent this, - either to change the nature of the match, or to cause it fail earlier - than it otherwise might, when the author of the pattern knows there is + With both maximizing ("greedy") and minimizing ("ungreedy" or "lazy") + repetition, failure of what follows normally causes the repeated item + to be re-evaluated to see if a different number of repeats allows the + rest of the pattern to match. Sometimes it is useful to prevent this, + either to change the nature of the match, or to cause it fail earlier + than it otherwise might, when the author of the pattern knows there is no point in carrying on. - Consider, for example, the pattern \d+foo when applied to the subject + Consider, for example, the pattern \d+foo when applied to the subject line 123456bar After matching all 6 digits and then failing to match "foo", the normal - action of the matcher is to try again with only 5 digits matching the - \d+ item, and then with 4, and so on, before ultimately failing. - "Atomic grouping" (a term taken from Jeffrey Friedl's book) provides - the means for specifying that once a subpattern has matched, it is not + action of the matcher is to try again with only 5 digits matching the + \d+ item, and then with 4, and so on, before ultimately failing. + "Atomic grouping" (a term taken from Jeffrey Friedl's book) provides + the means for specifying that once a subpattern has matched, it is not to be re-evaluated in this way. - If we use atomic grouping for the previous example, the matcher gives - up immediately on failing to match "foo" the first time. The notation + If we use atomic grouping for the previous example, the matcher gives + up immediately on failing to match "foo" the first time. The notation is a kind of special parenthesis, starting with (?> as in this example: (?>\d+)foo - This kind of parenthesis "locks up" the part of the pattern it con- - tains once it has matched, and a failure further into the pattern is - prevented from backtracking into it. Backtracking past it to previous + This kind of parenthesis "locks up" the part of the pattern it con- + tains once it has matched, and a failure further into the pattern is + prevented from backtracking into it. Backtracking past it to previous items, however, works as normal. - An alternative description is that a subpattern of this type matches - exactly the string of characters that an identical standalone pattern + An alternative description is that a subpattern of this type matches + exactly the string of characters that an identical standalone pattern would match, if anchored at the current point in the subject string. Atomic grouping subpatterns are not capturing subpatterns. Simple cases such as the above example can be thought of as a maximizing repeat that - must swallow everything it can. So, while both \d+ and \d+? are pre- - pared to adjust the number of digits they match in order to make the + must swallow everything it can. So, while both \d+ and \d+? are pre- + pared to adjust the number of digits they match in order to make the rest of the pattern match, (?>\d+) can only match an entire sequence of digits. - Atomic groups in general can of course contain arbitrarily complicated - subpatterns, and can be nested. However, when the subpattern for an + Atomic groups in general can of course contain arbitrarily complicated + subpatterns, and can be nested. However, when the subpattern for an atomic group is just a single repeated item, as in the example above, a - simpler notation, called a "possessive quantifier" can be used. This - consists of an additional + character following a quantifier. Using + simpler notation, called a "possessive quantifier" can be used. This + consists of an additional + character following a quantifier. Using this notation, the previous example can be rewritten as \d++foo @@ -7067,46 +7134,46 @@ ATOMIC GROUPING AND POSSESSIVE QUANTIFIERS (abc|xyz){2,3}+ - Possessive quantifiers are always greedy; the setting of the - PCRE2_UNGREEDY option is ignored. They are a convenient notation for - the simpler forms of atomic group. However, there is no difference in + Possessive quantifiers are always greedy; the setting of the + PCRE2_UNGREEDY option is ignored. They are a convenient notation for + the simpler forms of atomic group. However, there is no difference in the meaning of a possessive quantifier and the equivalent atomic group, - though there may be a performance difference; possessive quantifiers + though there may be a performance difference; possessive quantifiers should be slightly faster. - The possessive quantifier syntax is an extension to the Perl 5.8 syn- - tax. Jeffrey Friedl originated the idea (and the name) in the first + The possessive quantifier syntax is an extension to the Perl 5.8 syn- + tax. Jeffrey Friedl originated the idea (and the name) in the first edition of his book. Mike McCloskey liked it, so implemented it when he built Sun's Java package, and PCRE1 copied it from there. It ultimately found its way into Perl at release 5.10. - PCRE2 has an optimization that automatically "possessifies" certain - simple pattern constructs. For example, the sequence A+B is treated as - A++B because there is no point in backtracking into a sequence of A's + PCRE2 has an optimization that automatically "possessifies" certain + simple pattern constructs. For example, the sequence A+B is treated as + A++B because there is no point in backtracking into a sequence of A's when B must follow. This feature can be disabled by the PCRE2_NO_AUTO- POSSESS option, or starting the pattern with (*NO_AUTO_POSSESS). - When a pattern contains an unlimited repeat inside a subpattern that - can itself be repeated an unlimited number of times, the use of an - atomic group is the only way to avoid some failing matches taking a + When a pattern contains an unlimited repeat inside a subpattern that + can itself be repeated an unlimited number of times, the use of an + atomic group is the only way to avoid some failing matches taking a very long time indeed. The pattern (\D+|<\d+>)*[!?] - matches an unlimited number of substrings that either consist of non- - digits, or digits enclosed in <>, followed by either ! or ?. When it + matches an unlimited number of substrings that either consist of non- + digits, or digits enclosed in <>, followed by either ! or ?. When it matches, it runs quickly. However, if it is applied to aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa - it takes a long time before reporting failure. This is because the - string can be divided between the internal \D+ repeat and the external - * repeat in a large number of ways, and all have to be tried. (The - example uses [!?] rather than a single character at the end, because - both PCRE2 and Perl have an optimization that allows for fast failure - when a single character is used. They remember the last single charac- - ter that is required for a match, and fail early if it is not present - in the string.) If the pattern is changed so that it uses an atomic + it takes a long time before reporting failure. This is because the + string can be divided between the internal \D+ repeat and the external + * repeat in a large number of ways, and all have to be tried. (The + example uses [!?] rather than a single character at the end, because + both PCRE2 and Perl have an optimization that allows for fast failure + when a single character is used. They remember the last single charac- + ter that is required for a match, and fail early if it is not present + in the string.) If the pattern is changed so that it uses an atomic group, like this: ((?>\D+)|<\d+>)*[!?] @@ -7118,71 +7185,75 @@ BACK REFERENCES Outside a character class, a backslash followed by a digit greater than 0 (and possibly further digits) is a back reference to a capturing sub- - pattern earlier (that is, to its left) in the pattern, provided there + pattern earlier (that is, to its left) in the pattern, provided there have been that many previous capturing left parentheses. - However, if the decimal number following the backslash is less than 8, - it is always taken as a back reference, and causes an error only if - there are not that many capturing left parentheses in the entire pat- - tern. In other words, the parentheses that are referenced need not be - to the left of the reference for numbers less than 8. A "forward back - reference" of this type can make sense when a repetition is involved - and the subpattern to the right has participated in an earlier itera- + However, if the decimal number following the backslash is less than 8, + it is always taken as a back reference, and causes an error only if + there are not that many capturing left parentheses in the entire pat- + tern. In other words, the parentheses that are referenced need not be + to the left of the reference for numbers less than 8. A "forward back + reference" of this type can make sense when a repetition is involved + and the subpattern to the right has participated in an earlier itera- tion. - It is not possible to have a numerical "forward back reference" to a - subpattern whose number is 8 or more using this syntax because a - sequence such as \50 is interpreted as a character defined in octal. + It is not possible to have a numerical "forward back reference" to a + subpattern whose number is 8 or more using this syntax because a + sequence such as \50 is interpreted as a character defined in octal. See the subsection entitled "Non-printing characters" above for further - details of the handling of digits following a backslash. There is no - such problem when named parentheses are used. A back reference to any + details of the handling of digits following a backslash. There is no + such problem when named parentheses are used. A back reference to any subpattern is possible using named parentheses (see below). - Another way of avoiding the ambiguity inherent in the use of digits - following a backslash is to use the \g escape sequence. This escape - must be followed by an unsigned number or a negative number, optionally - enclosed in braces. These examples are all identical: + Another way of avoiding the ambiguity inherent in the use of digits + following a backslash is to use the \g escape sequence. This escape + must be followed by a signed or unsigned number, optionally enclosed in + braces. These examples are all identical: (ring), \1 (ring), \g1 (ring), \g{1} - An unsigned number specifies an absolute reference without the ambigu- + An unsigned number specifies an absolute reference without the ambigu- ity that is present in the older syntax. It is also useful when literal - digits follow the reference. A negative number is a relative reference. + digits follow the reference. A signed number is a relative reference. Consider this example: (abc(def)ghi)\g{-1} The sequence \g{-1} is a reference to the most recently started captur- ing subpattern before \g, that is, is it equivalent to \2 in this exam- - ple. Similarly, \g{-2} would be equivalent to \1. The use of relative - references can be helpful in long patterns, and also in patterns that - are created by joining together fragments that contain references + ple. Similarly, \g{-2} would be equivalent to \1. The use of relative + references can be helpful in long patterns, and also in patterns that + are created by joining together fragments that contain references within themselves. - A back reference matches whatever actually matched the capturing sub- - pattern in the current subject string, rather than anything matching + The sequence \g{+1} is a reference to the next capturing subpattern. + This kind of forward reference can be useful it patterns that repeat. + Perl does not support the use of + in this way. + + A back reference matches whatever actually matched the capturing sub- + pattern in the current subject string, rather than anything matching the subpattern itself (see "Subpatterns as subroutines" below for a way of doing that). So the pattern (sens|respons)e and \1ibility - matches "sense and sensibility" and "response and responsibility", but - not "sense and responsibility". If caseful matching is in force at the - time of the back reference, the case of letters is relevant. For exam- + matches "sense and sensibility" and "response and responsibility", but + not "sense and responsibility". If caseful matching is in force at the + time of the back reference, the case of letters is relevant. For exam- ple, ((?i)rah)\s+\1 - matches "rah rah" and "RAH RAH", but not "RAH rah", even though the + matches "rah rah" and "RAH RAH", but not "RAH rah", even though the original capturing subpattern is matched caselessly. - There are several different ways of writing back references to named - subpatterns. The .NET syntax \k{name} and the Perl syntax \k or - \k'name' are supported, as is the Python syntax (?P=name). Perl 5.10's + There are several different ways of writing back references to named + subpatterns. The .NET syntax \k{name} and the Perl syntax \k or + \k'name' are supported, as is the Python syntax (?P=name). Perl 5.10's unified back reference syntax, in which \g can be used for both numeric - and named references, is also supported. We could rewrite the above + and named references, is also supported. We could rewrite the above example in any of the following ways: (?(?i)rah)\s+\k @@ -7190,68 +7261,75 @@ BACK REFERENCES (?P(?i)rah)\s+(?P=p1) (?(?i)rah)\s+\g{p1} - A subpattern that is referenced by name may appear in the pattern + A subpattern that is referenced by name may appear in the pattern before or after the reference. - There may be more than one back reference to the same subpattern. If a - subpattern has not actually been used in a particular match, any back + There may be more than one back reference to the same subpattern. If a + subpattern has not actually been used in a particular match, any back references to it always fail by default. For example, the pattern (a|(bc))\2 - always fails if it starts to match "a" rather than "bc". However, if - the PCRE2_MATCH_UNSET_BACKREF option is set at compile time, a back + always fails if it starts to match "a" rather than "bc". However, if + the PCRE2_MATCH_UNSET_BACKREF option is set at compile time, a back reference to an unset value matches an empty string. - Because there may be many capturing parentheses in a pattern, all dig- - its following a backslash are taken as part of a potential back refer- - ence number. If the pattern continues with a digit character, some - delimiter must be used to terminate the back reference. If the - PCRE2_EXTENDED option is set, this can be white space. Otherwise, the + Because there may be many capturing parentheses in a pattern, all dig- + its following a backslash are taken as part of a potential back refer- + ence number. If the pattern continues with a digit character, some + delimiter must be used to terminate the back reference. If the + PCRE2_EXTENDED option is set, this can be white space. Otherwise, the \g{ syntax or an empty comment (see "Comments" below) can be used. Recursive back references - A back reference that occurs inside the parentheses to which it refers - fails when the subpattern is first used, so, for example, (a\1) never - matches. However, such references can be useful inside repeated sub- + A back reference that occurs inside the parentheses to which it refers + fails when the subpattern is first used, so, for example, (a\1) never + matches. However, such references can be useful inside repeated sub- patterns. For example, the pattern (a|b\1)+ matches any number of "a"s and also "aba", "ababbaa" etc. At each iter- - ation of the subpattern, the back reference matches the character - string corresponding to the previous iteration. In order for this to - work, the pattern must be such that the first iteration does not need - to match the back reference. This can be done using alternation, as in + ation of the subpattern, the back reference matches the character + string corresponding to the previous iteration. In order for this to + work, the pattern must be such that the first iteration does not need + to match the back reference. This can be done using alternation, as in the example above, or by a quantifier with a minimum of zero. - Back references of this type cause the group that they reference to be - treated as an atomic group. Once the whole group has been matched, a - subsequent matching failure cannot cause backtracking into the middle + Back references of this type cause the group that they reference to be + treated as an atomic group. Once the whole group has been matched, a + subsequent matching failure cannot cause backtracking into the middle of the group. ASSERTIONS - An assertion is a test on the characters following or preceding the + An assertion is a test on the characters following or preceding the current matching point that does not consume any characters. The simple - assertions coded as \b, \B, \A, \G, \Z, \z, ^ and $ are described + assertions coded as \b, \B, \A, \G, \Z, \z, ^ and $ are described above. - More complicated assertions are coded as subpatterns. There are two - kinds: those that look ahead of the current position in the subject - string, and those that look behind it. An assertion subpattern is - matched in the normal way, except that it does not cause the current + More complicated assertions are coded as subpatterns. There are two + kinds: those that look ahead of the current position in the subject + string, and those that look behind it. An assertion subpattern is + matched in the normal way, except that it does not cause the current matching position to be changed. - Assertion subpatterns are not capturing subpatterns. If such an asser- - tion contains capturing subpatterns within it, these are counted for - the purposes of numbering the capturing subpatterns in the whole pat- - tern. However, substring capturing is carried out only for positive + Assertion subpatterns are not capturing subpatterns. If such an asser- + tion contains capturing subpatterns within it, these are counted for + the purposes of numbering the capturing subpatterns in the whole pat- + tern. However, substring capturing is carried out only for positive assertions. (Perl sometimes, but not always, does do capturing in nega- tive assertions.) + WARNING: If a positive assertion containing one or more capturing sub- + patterns succeeds, but failure to match later in the pattern causes + backtracking over this assertion, the captures within the assertion are + reset only if no higher numbered captures are already set. This is, + unfortunately, a fundamental limitation of the current implementation; + it may get removed in a future reworking. + For compatibility with Perl, most assertion subpatterns may be repeated; though it makes no sense to assert the same thing several times, the side effect of capturing parentheses may occasionally be @@ -7340,15 +7418,27 @@ ASSERTIONS then try to match. If there are insufficient characters before the cur- rent position, the assertion fails. - In a UTF mode, PCRE2 does not allow the \C escape (which matches a sin- - gle code unit even in a UTF mode) to appear in lookbehind assertions, - because it makes it impossible to calculate the length of the lookbe- - hind. The \X and \R escapes, which can match different numbers of code - units, are also not permitted. + In UTF-8 and UTF-16 modes, PCRE2 does not allow the \C escape (which + matches a single code unit even in a UTF mode) to appear in lookbehind + assertions, because it makes it impossible to calculate the length of + the lookbehind. The \X and \R escapes, which can match different num- + bers of code units, are never permitted in lookbehinds. "Subroutine" calls (see below) such as (?2) or (?&X) are permitted in lookbehinds, as long as the subpattern matches a fixed-length string. - Recursion, however, is not supported. + However, recursion, that is, a "subroutine" call into a group that is + already active, is not supported. + + Perl does not support back references in lookbehinds. PCRE2 does sup- + port them, but only if certain conditions are met. The + PCRE2_MATCH_UNSET_BACKREF option must not be set, there must be no use + of (?| in the pattern (it creates duplicate subpattern numbers), and if + the back reference is by name, the name must be unique. Of course, the + referenced subpattern must itself be of fixed length. The following + pattern matches words containing at least two characters that begin and + end with the same character: + + \b(\w)\w++(?<=\1) Possessive quantifiers can be used in conjunction with lookbehind assertions to specify efficient matching of fixed-length strings at the @@ -7482,7 +7572,9 @@ CONDITIONAL SUBPATTERNS Perl uses the syntax (?()...) or (?('name')...) to test for a used subpattern by name. For compatibility with earlier versions of PCRE1, which had this facility before Perl, the syntax (?(name)...) is - also recognized. + also recognized. Note, however, that undelimited names consisting of + the letter R followed by digits are ambiguous (see the following sec- + tion). Rewriting the above example to use a named subpattern gives this: @@ -7494,120 +7586,139 @@ CONDITIONAL SUBPATTERNS Checking for pattern recursion - If the condition is the string (R), and there is no subpattern with the - name R, the condition is true if a recursive call to the whole pattern - or any subpattern has been made. If digits or a name preceded by amper- - sand follow the letter R, for example: + "Recursion" in this sense refers to any subroutine-like call from one + part of the pattern to another, whether or not it is actually recur- + sive. See the sections entitled "Recursive patterns" and "Subpatterns + as subroutines" below for details of recursion and subpattern calls. - (?(R3)...) or (?(R&name)...) + If a condition is the string (R), and there is no subpattern with the + name R, the condition is true if matching is currently in a recursion + or subroutine call to the whole pattern or any subpattern. If digits + follow the letter R, and there is no subpattern with that name, the + condition is true if the most recent call is into a subpattern with the + given number, which must exist somewhere in the overall pattern. This + is a contrived example that is equivalent to a+b: + + ((?(R1)a+|(?1)b)) + + However, in both cases, if there is a subpattern with a matching name, + the condition tests for its being set, as described in the section + above, instead of testing for recursion. For example, creating a group + with the name R1 by adding (?) to the above pattern completely + changes its meaning. + + If a name preceded by ampersand follows the letter R, for example: + + (?(R&name)...) the condition is true if the most recent recursion is into a subpattern - whose number or name is given. This condition does not check the entire - recursion stack. If the name used in a condition of this kind is a + of that name (which must exist within the pattern). + + This condition does not check the entire recursion stack. It tests only + the current level. If the name used in a condition of this kind is a duplicate, the test is applied to all subpatterns of the same name, and is true if any one of them is the most recent recursion. - At "top level", all these recursion test conditions are false. The - syntax for recursive patterns is described below. + At "top level", all these recursion test conditions are false. Defining subpatterns for use by reference only - If the condition is the string (DEFINE), and there is no subpattern - with the name DEFINE, the condition is always false. In this case, - there may be only one alternative in the subpattern. It is always - skipped if control reaches this point in the pattern; the idea of - DEFINE is that it can be used to define subroutines that can be refer- - enced from elsewhere. (The use of subroutines is described below.) For - example, a pattern to match an IPv4 address such as "192.168.23.245" - could be written like this (ignore white space and line breaks): + If the condition is the string (DEFINE), the condition is always false, + even if there is a group with the name DEFINE. In this case, there may + be only one alternative in the subpattern. It is always skipped if con- + trol reaches this point in the pattern; the idea of DEFINE is that it + can be used to define subroutines that can be referenced from else- + where. (The use of subroutines is described below.) For example, a pat- + tern to match an IPv4 address such as "192.168.23.245" could be written + like this (ignore white space and line breaks): (?(DEFINE) (? 2[0-4]\d | 25[0-5] | 1\d\d | [1-9]?\d) ) \b (?&byte) (\.(?&byte)){3} \b - The first part of the pattern is a DEFINE group inside which a another - group named "byte" is defined. This matches an individual component of - an IPv4 address (a number less than 256). When matching takes place, - this part of the pattern is skipped because DEFINE acts like a false - condition. The rest of the pattern uses references to the named group - to match the four dot-separated components of an IPv4 address, insist- + The first part of the pattern is a DEFINE group inside which a another + group named "byte" is defined. This matches an individual component of + an IPv4 address (a number less than 256). When matching takes place, + this part of the pattern is skipped because DEFINE acts like a false + condition. The rest of the pattern uses references to the named group + to match the four dot-separated components of an IPv4 address, insist- ing on a word boundary at each end. Checking the PCRE2 version - Programs that link with a PCRE2 library can check the version by call- - ing pcre2_config() with appropriate arguments. Users of applications - that do not have access to the underlying code cannot do this. A spe- - cial "condition" called VERSION exists to allow such users to discover + Programs that link with a PCRE2 library can check the version by call- + ing pcre2_config() with appropriate arguments. Users of applications + that do not have access to the underlying code cannot do this. A spe- + cial "condition" called VERSION exists to allow such users to discover which version of PCRE2 they are dealing with by using this condition to - match a string such as "yesno". VERSION must be followed either by "=" + match a string such as "yesno". VERSION must be followed either by "=" or ">=" and a version number. For example: (?(VERSION>=10.4)yes|no) - This pattern matches "yes" if the PCRE2 version is greater or equal to - 10.4, or "no" otherwise. The fractional part of the version number may + This pattern matches "yes" if the PCRE2 version is greater or equal to + 10.4, or "no" otherwise. The fractional part of the version number may not contain more than two digits. Assertion conditions - If the condition is not in any of the above formats, it must be an - assertion. This may be a positive or negative lookahead or lookbehind - assertion. Consider this pattern, again containing non-significant + If the condition is not in any of the above formats, it must be an + assertion. This may be a positive or negative lookahead or lookbehind + assertion. Consider this pattern, again containing non-significant white space, and with the two alternatives on the second line: (?(?=[^a-z]*[a-z]) \d{2}-[a-z]{3}-\d{2} | \d{2}-\d{2}-\d{2} ) - The condition is a positive lookahead assertion that matches an - optional sequence of non-letters followed by a letter. In other words, - it tests for the presence of at least one letter in the subject. If a - letter is found, the subject is matched against the first alternative; - otherwise it is matched against the second. This pattern matches - strings in one of the two forms dd-aaa-dd or dd-dd-dd, where aaa are + The condition is a positive lookahead assertion that matches an + optional sequence of non-letters followed by a letter. In other words, + it tests for the presence of at least one letter in the subject. If a + letter is found, the subject is matched against the first alternative; + otherwise it is matched against the second. This pattern matches + strings in one of the two forms dd-aaa-dd or dd-dd-dd, where aaa are letters and dd are digits. COMMENTS There are two ways of including comments in patterns that are processed - by PCRE2. In both cases, the start of the comment must not be in a - character class, nor in the middle of any other sequence of related - characters such as (?: or a subpattern name or number. The characters + by PCRE2. In both cases, the start of the comment must not be in a + character class, nor in the middle of any other sequence of related + characters such as (?: or a subpattern name or number. The characters that make up a comment play no part in the pattern matching. - The sequence (?# marks the start of a comment that continues up to the - next closing parenthesis. Nested parentheses are not permitted. If the - PCRE2_EXTENDED option is set, an unescaped # character also introduces - a comment, which in this case continues to immediately after the next - newline character or character sequence in the pattern. Which charac- - ters are interpreted as newlines is controlled by an option passed to - the compiling function or by a special sequence at the start of the - pattern, as described in the section entitled "Newline conventions" - above. Note that the end of this type of comment is a literal newline - sequence in the pattern; escape sequences that happen to represent a - newline do not count. For example, consider this pattern when - PCRE2_EXTENDED is set, and the default newline convention (a single + The sequence (?# marks the start of a comment that continues up to the + next closing parenthesis. Nested parentheses are not permitted. If the + PCRE2_EXTENDED option is set, an unescaped # character also introduces + a comment, which in this case continues to immediately after the next + newline character or character sequence in the pattern. Which charac- + ters are interpreted as newlines is controlled by an option passed to + the compiling function or by a special sequence at the start of the + pattern, as described in the section entitled "Newline conventions" + above. Note that the end of this type of comment is a literal newline + sequence in the pattern; escape sequences that happen to represent a + newline do not count. For example, consider this pattern when + PCRE2_EXTENDED is set, and the default newline convention (a single linefeed character) is in force: abc #comment \n still comment - On encountering the # character, pcre2_compile() skips along, looking - for a newline in the pattern. The sequence \n is still literal at this - stage, so it does not terminate the comment. Only an actual character + On encountering the # character, pcre2_compile() skips along, looking + for a newline in the pattern. The sequence \n is still literal at this + stage, so it does not terminate the comment. Only an actual character with the code value 0x0a (the default newline) does so. RECURSIVE PATTERNS - Consider the problem of matching a string in parentheses, allowing for - unlimited nested parentheses. Without the use of recursion, the best - that can be done is to use a pattern that matches up to some fixed - depth of nesting. It is not possible to handle an arbitrary nesting + Consider the problem of matching a string in parentheses, allowing for + unlimited nested parentheses. Without the use of recursion, the best + that can be done is to use a pattern that matches up to some fixed + depth of nesting. It is not possible to handle an arbitrary nesting depth. For some time, Perl has provided a facility that allows regular expres- - sions to recurse (amongst other things). It does this by interpolating - Perl code in the expression at run time, and the code can refer to the + sions to recurse (amongst other things). It does this by interpolating + Perl code in the expression at run time, and the code can refer to the expression itself. A Perl pattern using code interpolation to solve the parentheses problem can be created like this: @@ -7617,214 +7728,214 @@ RECURSIVE PATTERNS refers recursively to the pattern in which it appears. Obviously, PCRE2 cannot support the interpolation of Perl code. - Instead, it supports special syntax for recursion of the entire pat- + Instead, it supports special syntax for recursion of the entire pat- tern, and also for individual subpattern recursion. After its introduc- - tion in PCRE1 and Python, this kind of recursion was subsequently + tion in PCRE1 and Python, this kind of recursion was subsequently introduced into Perl at release 5.10. - A special item that consists of (? followed by a number greater than - zero and a closing parenthesis is a recursive subroutine call of the - subpattern of the given number, provided that it occurs inside that - subpattern. (If not, it is a non-recursive subroutine call, which is - described in the next section.) The special item (?R) or (?0) is a + A special item that consists of (? followed by a number greater than + zero and a closing parenthesis is a recursive subroutine call of the + subpattern of the given number, provided that it occurs inside that + subpattern. (If not, it is a non-recursive subroutine call, which is + described in the next section.) The special item (?R) or (?0) is a recursive call of the entire regular expression. - This PCRE2 pattern solves the nested parentheses problem (assume the + This PCRE2 pattern solves the nested parentheses problem (assume the PCRE2_EXTENDED option is set so that white space is ignored): \( ( [^()]++ | (?R) )* \) - First it matches an opening parenthesis. Then it matches any number of - substrings which can either be a sequence of non-parentheses, or a - recursive match of the pattern itself (that is, a correctly parenthe- + First it matches an opening parenthesis. Then it matches any number of + substrings which can either be a sequence of non-parentheses, or a + recursive match of the pattern itself (that is, a correctly parenthe- sized substring). Finally there is a closing parenthesis. Note the use of a possessive quantifier to avoid backtracking into sequences of non- parentheses. - If this were part of a larger pattern, you would not want to recurse + If this were part of a larger pattern, you would not want to recurse the entire pattern, so instead you could use this: ( \( ( [^()]++ | (?1) )* \) ) - We have put the pattern into parentheses, and caused the recursion to + We have put the pattern into parentheses, and caused the recursion to refer to them instead of the whole pattern. - In a larger pattern, keeping track of parenthesis numbers can be - tricky. This is made easier by the use of relative references. Instead + In a larger pattern, keeping track of parenthesis numbers can be + tricky. This is made easier by the use of relative references. Instead of (?1) in the pattern above you can write (?-2) to refer to the second - most recently opened parentheses preceding the recursion. In other - words, a negative number counts capturing parentheses leftwards from + most recently opened parentheses preceding the recursion. In other + words, a negative number counts capturing parentheses leftwards from the point at which it is encountered. Be aware however, that if duplicate subpattern numbers are in use, rel- - ative references refer to the earliest subpattern with the appropriate + ative references refer to the earliest subpattern with the appropriate number. Consider, for example: (?|(a)|(b)) (c) (?-2) - The first two capturing groups (a) and (b) are both numbered 1, and - group (c) is number 2. When the reference (?-2) is encountered, the + The first two capturing groups (a) and (b) are both numbered 1, and + group (c) is number 2. When the reference (?-2) is encountered, the second most recently opened parentheses has the number 1, but it is the - first such group (the (a) group) to which the recursion refers. This - would be the same if an absolute reference (?1) was used. In other - words, relative references are just a shorthand for computing a group + first such group (the (a) group) to which the recursion refers. This + would be the same if an absolute reference (?1) was used. In other + words, relative references are just a shorthand for computing a group number. - It is also possible to refer to subsequently opened parentheses, by - writing references such as (?+2). However, these cannot be recursive - because the reference is not inside the parentheses that are refer- - enced. They are always non-recursive subroutine calls, as described in + It is also possible to refer to subsequently opened parentheses, by + writing references such as (?+2). However, these cannot be recursive + because the reference is not inside the parentheses that are refer- + enced. They are always non-recursive subroutine calls, as described in the next section. - An alternative approach is to use named parentheses. The Perl syntax - for this is (?&name); PCRE1's earlier syntax (?P>name) is also sup- + An alternative approach is to use named parentheses. The Perl syntax + for this is (?&name); PCRE1's earlier syntax (?P>name) is also sup- ported. We could rewrite the above example as follows: (? \( ( [^()]++ | (?&pn) )* \) ) - If there is more than one subpattern with the same name, the earliest + If there is more than one subpattern with the same name, the earliest one is used. The example pattern that we have been looking at contains nested unlim- - ited repeats, and so the use of a possessive quantifier for matching - strings of non-parentheses is important when applying the pattern to + ited repeats, and so the use of a possessive quantifier for matching + strings of non-parentheses is important when applying the pattern to strings that do not match. For example, when this pattern is applied to (aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa() - it yields "no match" quickly. However, if a possessive quantifier is - not used, the match runs for a very long time indeed because there are - so many different ways the + and * repeats can carve up the subject, + it yields "no match" quickly. However, if a possessive quantifier is + not used, the match runs for a very long time indeed because there are + so many different ways the + and * repeats can carve up the subject, and all have to be tested before failure can be reported. - At the end of a match, the values of capturing parentheses are those - from the outermost level. If you want to obtain intermediate values, a + At the end of a match, the values of capturing parentheses are those + from the outermost level. If you want to obtain intermediate values, a callout function can be used (see below and the pcre2callout documenta- tion). If the pattern above is matched against (ab(cd)ef) - the value for the inner capturing parentheses (numbered 2) is "ef", - which is the last value taken on at the top level. If a capturing sub- - pattern is not matched at the top level, its final captured value is - unset, even if it was (temporarily) set at a deeper level during the + the value for the inner capturing parentheses (numbered 2) is "ef", + which is the last value taken on at the top level. If a capturing sub- + pattern is not matched at the top level, its final captured value is + unset, even if it was (temporarily) set at a deeper level during the matching process. If there are more than 15 capturing parentheses in a pattern, PCRE2 has - to obtain extra memory from the heap to store data during a recursion. - If no memory can be obtained, the match fails with the + to obtain extra memory from the heap to store data during a recursion. + If no memory can be obtained, the match fails with the PCRE2_ERROR_NOMEMORY error. - Do not confuse the (?R) item with the condition (R), which tests for - recursion. Consider this pattern, which matches text in angle brack- - ets, allowing for arbitrary nesting. Only digits are allowed in nested - brackets (that is, when recursing), whereas any characters are permit- + Do not confuse the (?R) item with the condition (R), which tests for + recursion. Consider this pattern, which matches text in angle brack- + ets, allowing for arbitrary nesting. Only digits are allowed in nested + brackets (that is, when recursing), whereas any characters are permit- ted at the outer level. < (?: (?(R) \d++ | [^<>]*+) | (?R)) * > - In this pattern, (?(R) is the start of a conditional subpattern, with - two different alternatives for the recursive and non-recursive cases. + In this pattern, (?(R) is the start of a conditional subpattern, with + two different alternatives for the recursive and non-recursive cases. The (?R) item is the actual recursive call. Differences in recursion processing between PCRE2 and Perl - Recursion processing in PCRE2 differs from Perl in two important ways. + Recursion processing in PCRE2 differs from Perl in two important ways. In PCRE2 (like Python, but unlike Perl), a recursive subpattern call is always treated as an atomic group. That is, once it has matched some of the subject string, it is never re-entered, even if it contains untried - alternatives and there is a subsequent matching failure. This can be - illustrated by the following pattern, which purports to match a palin- - dromic string that contains an odd number of characters (for example, + alternatives and there is a subsequent matching failure. This can be + illustrated by the following pattern, which purports to match a palin- + dromic string that contains an odd number of characters (for example, "a", "aba", "abcba", "abcdcba"): ^(.|(.)(?1)\2)$ The idea is that it either matches a single character, or two identical - characters surrounding a sub-palindrome. In Perl, this pattern works; - in PCRE2 it does not if the pattern is longer than three characters. + characters surrounding a sub-palindrome. In Perl, this pattern works; + in PCRE2 it does not if the pattern is longer than three characters. Consider the subject string "abcba": - At the top level, the first character is matched, but as it is not at + At the top level, the first character is matched, but as it is not at the end of the string, the first alternative fails; the second alterna- tive is taken and the recursion kicks in. The recursive call to subpat- - tern 1 successfully matches the next character ("b"). (Note that the + tern 1 successfully matches the next character ("b"). (Note that the beginning and end of line tests are not part of the recursion). - Back at the top level, the next character ("c") is compared with what - subpattern 2 matched, which was "a". This fails. Because the recursion - is treated as an atomic group, there are now no backtracking points, - and so the entire match fails. (Perl is able, at this point, to re- - enter the recursion and try the second alternative.) However, if the + Back at the top level, the next character ("c") is compared with what + subpattern 2 matched, which was "a". This fails. Because the recursion + is treated as an atomic group, there are now no backtracking points, + and so the entire match fails. (Perl is able, at this point, to re- + enter the recursion and try the second alternative.) However, if the pattern is written with the alternatives in the other order, things are different: ^((.)(?1)\2|.)$ - This time, the recursing alternative is tried first, and continues to - recurse until it runs out of characters, at which point the recursion - fails. But this time we do have another alternative to try at the - higher level. That is the big difference: in the previous case the - remaining alternative is at a deeper recursion level, which PCRE2 can- + This time, the recursing alternative is tried first, and continues to + recurse until it runs out of characters, at which point the recursion + fails. But this time we do have another alternative to try at the + higher level. That is the big difference: in the previous case the + remaining alternative is at a deeper recursion level, which PCRE2 can- not use. - To change the pattern so that it matches all palindromic strings, not - just those with an odd number of characters, it is tempting to change + To change the pattern so that it matches all palindromic strings, not + just those with an odd number of characters, it is tempting to change the pattern to this: ^((.)(?1)\2|.?)$ - Again, this works in Perl, but not in PCRE2, and for the same reason. - When a deeper recursion has matched a single character, it cannot be - entered again in order to match an empty string. The solution is to - separate the two cases, and write out the odd and even cases as alter- + Again, this works in Perl, but not in PCRE2, and for the same reason. + When a deeper recursion has matched a single character, it cannot be + entered again in order to match an empty string. The solution is to + separate the two cases, and write out the odd and even cases as alter- natives at the higher level: ^(?:((.)(?1)\2|)|((.)(?3)\4|.)) - If you want to match typical palindromic phrases, the pattern has to + If you want to match typical palindromic phrases, the pattern has to ignore all non-word characters, which can be done like this: ^\W*+(?:((.)\W*+(?1)\W*+\2|)|((.)\W*+(?3)\W*+\4|\W*+.\W*+))\W*+$ - If run with the PCRE2_CASELESS option, this pattern matches phrases - such as "A man, a plan, a canal: Panama!" and it works in both PCRE2 - and Perl. Note the use of the possessive quantifier *+ to avoid back- - tracking into sequences of non-word characters. Without this, PCRE2 + If run with the PCRE2_CASELESS option, this pattern matches phrases + such as "A man, a plan, a canal: Panama!" and it works in both PCRE2 + and Perl. Note the use of the possessive quantifier *+ to avoid back- + tracking into sequences of non-word characters. Without this, PCRE2 takes a great deal longer (ten times or more) to match typical phrases, and Perl takes so long that you think it has gone into a loop. - WARNING: The palindrome-matching patterns above work only if the sub- - ject string does not start with a palindrome that is shorter than the - entire string. For example, although "abcba" is correctly matched, if - the subject is "ababa", PCRE2 finds the palindrome "aba" at the start, - then fails at top level because the end of the string does not follow. - Once again, it cannot jump back into the recursion to try other alter- + WARNING: The palindrome-matching patterns above work only if the sub- + ject string does not start with a palindrome that is shorter than the + entire string. For example, although "abcba" is correctly matched, if + the subject is "ababa", PCRE2 finds the palindrome "aba" at the start, + then fails at top level because the end of the string does not follow. + Once again, it cannot jump back into the recursion to try other alter- natives, so the entire match fails. - The second way in which PCRE2 and Perl differ in their recursion pro- - cessing is in the handling of captured values. In Perl, when a subpat- - tern is called recursively or as a subpattern (see the next section), - it has no access to any values that were captured outside the recur- - sion, whereas in PCRE2 these values can be referenced. Consider this + The second way in which PCRE2 and Perl differ in their recursion pro- + cessing is in the handling of captured values. In Perl, when a subpat- + tern is called recursively or as a subpattern (see the next section), + it has no access to any values that were captured outside the recur- + sion, whereas in PCRE2 these values can be referenced. Consider this pattern: ^(.)(\1|a(?2)) - In PCRE2, this pattern matches "bab". The first capturing parentheses - match "b", then in the second group, when the back reference \1 fails - to match "b", the second alternative matches "a" and then recurses. In - the recursion, \1 does now match "b" and so the whole match succeeds. - In Perl, the pattern fails to match because inside the recursive call + In PCRE2, this pattern matches "bab". The first capturing parentheses + match "b", then in the second group, when the back reference \1 fails + to match "b", the second alternative matches "a" and then recurses. In + the recursion, \1 does now match "b" and so the whole match succeeds. + In Perl, the pattern fails to match because inside the recursive call \1 cannot access the externally set value. SUBPATTERNS AS SUBROUTINES - If the syntax for a recursive subpattern call (either by number or by - name) is used outside the parentheses to which it refers, it operates - like a subroutine in a programming language. The called subpattern may - be defined before or after the reference. A numbered reference can be + If the syntax for a recursive subpattern call (either by number or by + name) is used outside the parentheses to which it refers, it operates + like a subroutine in a programming language. The called subpattern may + be defined before or after the reference. A numbered reference can be absolute or relative, as in these examples: (...(absolute)...)...(?2)... @@ -7835,104 +7946,104 @@ SUBPATTERNS AS SUBROUTINES (sens|respons)e and \1ibility - matches "sense and sensibility" and "response and responsibility", but + matches "sense and sensibility" and "response and responsibility", but not "sense and responsibility". If instead the pattern (sens|respons)e and (?1)ibility - is used, it does match "sense and responsibility" as well as the other - two strings. Another example is given in the discussion of DEFINE + is used, it does match "sense and responsibility" as well as the other + two strings. Another example is given in the discussion of DEFINE above. - All subroutine calls, whether recursive or not, are always treated as - atomic groups. That is, once a subroutine has matched some of the sub- + All subroutine calls, whether recursive or not, are always treated as + atomic groups. That is, once a subroutine has matched some of the sub- ject string, it is never re-entered, even if it contains untried alter- - natives and there is a subsequent matching failure. Any capturing - parentheses that are set during the subroutine call revert to their + natives and there is a subsequent matching failure. Any capturing + parentheses that are set during the subroutine call revert to their previous values afterwards. - Processing options such as case-independence are fixed when a subpat- - tern is defined, so if it is used as a subroutine, such options cannot + Processing options such as case-independence are fixed when a subpat- + tern is defined, so if it is used as a subroutine, such options cannot be changed for different calls. For example, consider this pattern: (abc)(?i:(?-1)) - It matches "abcabc". It does not match "abcABC" because the change of + It matches "abcabc". It does not match "abcABC" because the change of processing option does not affect the called subpattern. ONIGURUMA SUBROUTINE SYNTAX - For compatibility with Oniguruma, the non-Perl syntax \g followed by a + For compatibility with Oniguruma, the non-Perl syntax \g followed by a name or a number enclosed either in angle brackets or single quotes, is - an alternative syntax for referencing a subpattern as a subroutine, - possibly recursively. Here are two of the examples used above, rewrit- + an alternative syntax for referencing a subpattern as a subroutine, + possibly recursively. Here are two of the examples used above, rewrit- ten using this syntax: (? \( ( (?>[^()]+) | \g )* \) ) (sens|respons)e and \g'1'ibility - PCRE2 supports an extension to Oniguruma: if a number is preceded by a + PCRE2 supports an extension to Oniguruma: if a number is preceded by a plus or a minus sign it is taken as a relative reference. For example: (abc)(?i:\g<-1>) - Note that \g{...} (Perl syntax) and \g<...> (Oniguruma syntax) are not - synonymous. The former is a back reference; the latter is a subroutine + Note that \g{...} (Perl syntax) and \g<...> (Oniguruma syntax) are not + synonymous. The former is a back reference; the latter is a subroutine call. CALLOUTS Perl has a feature whereby using the sequence (?{...}) causes arbitrary - Perl code to be obeyed in the middle of matching a regular expression. + Perl code to be obeyed in the middle of matching a regular expression. This makes it possible, amongst other things, to extract different sub- strings that match the same pair of parentheses when there is a repeti- tion. - PCRE2 provides a similar feature, but of course it cannot obey arbi- - trary Perl code. The feature is called "callout". The caller of PCRE2 - provides an external function by putting its entry point in a match - context using the function pcre2_set_callout(), and then passing that - context to pcre2_match() or pcre2_dfa_match(). If no match context is + PCRE2 provides a similar feature, but of course it cannot obey arbi- + trary Perl code. The feature is called "callout". The caller of PCRE2 + provides an external function by putting its entry point in a match + context using the function pcre2_set_callout(), and then passing that + context to pcre2_match() or pcre2_dfa_match(). If no match context is passed, or if the callout entry point is set to NULL, callouts are dis- abled. - Within a regular expression, (?C) indicates a point at which the - external function is to be called. There are two kinds of callout: - those with a numerical argument and those with a string argument. (?C) - on its own with no argument is treated as (?C0). A numerical argument - allows the application to distinguish between different callouts. - String arguments were added for release 10.20 to make it possible for - script languages that use PCRE2 to embed short scripts within patterns + Within a regular expression, (?C) indicates a point at which the + external function is to be called. There are two kinds of callout: + those with a numerical argument and those with a string argument. (?C) + on its own with no argument is treated as (?C0). A numerical argument + allows the application to distinguish between different callouts. + String arguments were added for release 10.20 to make it possible for + script languages that use PCRE2 to embed short scripts within patterns in a similar way to Perl. During matching, when PCRE2 reaches a callout point, the external func- - tion is called. It is provided with the number or string argument of - the callout, the position in the pattern, and one item of data that is + tion is called. It is provided with the number or string argument of + the callout, the position in the pattern, and one item of data that is also set in the match block. The callout function may cause matching to proceed, to backtrack, or to fail. - By default, PCRE2 implements a number of optimizations at matching - time, and one side-effect is that sometimes callouts are skipped. If - you need all possible callouts to happen, you need to set options that - disable the relevant optimizations. More details, including a complete - description of the programming interface to the callout function, are + By default, PCRE2 implements a number of optimizations at matching + time, and one side-effect is that sometimes callouts are skipped. If + you need all possible callouts to happen, you need to set options that + disable the relevant optimizations. More details, including a complete + description of the programming interface to the callout function, are given in the pcre2callout documentation. Callouts with numerical arguments - If you just want to have a means of identifying different callout - points, put a number less than 256 after the letter C. For example, + If you just want to have a means of identifying different callout + points, put a number less than 256 after the letter C. For example, this pattern has two callout points: (?C1)abc(?C2)def - If the PCRE2_AUTO_CALLOUT flag is passed to pcre2_compile(), numerical - callouts are automatically installed before each item in the pattern. - They are all numbered 255. If there is a conditional group in the pat- + If the PCRE2_AUTO_CALLOUT flag is passed to pcre2_compile(), numerical + callouts are automatically installed before each item in the pattern. + They are all numbered 255. If there is a conditional group in the pat- tern whose condition is an assertion, an additional callout is inserted - just before the condition. An explicit callout may also be set at this + just before the condition. An explicit callout may also be set at this position, as in this example: (?(?C9)(?=a)abc|def) @@ -7942,42 +8053,51 @@ CALLOUTS Callouts with string arguments - A delimited string may be used instead of a number as a callout argu- - ment. The starting delimiter must be one of ` ' " ^ % # $ { and the + A delimited string may be used instead of a number as a callout argu- + ment. The starting delimiter must be one of ` ' " ^ % # $ { and the ending delimiter is the same as the start, except for {, where the end- - ing delimiter is }. If the ending delimiter is needed within the + ing delimiter is }. If the ending delimiter is needed within the string, it must be doubled. For example: (?C'ab ''c'' d')xyz(?C{any text})pqr - The doubling is removed before the string is passed to the callout + The doubling is removed before the string is passed to the callout function. BACKTRACKING CONTROL - Perl 5.10 introduced a number of "Special Backtracking Control Verbs", - which are still described in the Perl documentation as "experimental - and subject to change or removal in a future version of Perl". It goes - on to say: "Their usage in production code should be noted to avoid + Perl 5.10 introduced a number of "Special Backtracking Control Verbs", + which are still described in the Perl documentation as "experimental + and subject to change or removal in a future version of Perl". It goes + on to say: "Their usage in production code should be noted to avoid problems during upgrades." The same remarks apply to the PCRE2 features described in this section. - The new verbs make use of what was previously invalid syntax: an open- + The new verbs make use of what was previously invalid syntax: an open- ing parenthesis followed by an asterisk. They are generally of the form (*VERB) or (*VERB:NAME). Some verbs take either form, possibly behaving differently depending on whether or not a name is present. - By default, for compatibility with Perl, a name is any sequence of + By default, for compatibility with Perl, a name is any sequence of characters that does not include a closing parenthesis. The name is not - processed in any way, and it is not possible to include a closing - parenthesis in the name. However, if the PCRE2_ALT_VERBNAMES option is - set, normal backslash processing is applied to verb names and only an - unescaped closing parenthesis terminates the name. A closing parenthe- - sis can be included in a name either as \) or between \Q and \E. If the - PCRE2_EXTENDED option is set, unescaped whitespace in verb names is - skipped and #-comments are recognized, exactly as in the rest of the - pattern. + processed in any way, and it is not possible to include a closing + parenthesis in the name. This can be changed by setting the + PCRE2_ALT_VERBNAMES option, but the result is no longer Perl-compati- + ble. + + When PCRE2_ALT_VERBNAMES is set, backslash processing is applied to + verb names and only an unescaped closing parenthesis terminates the + name. However, the only backslash items that are permitted are \Q, \E, + and sequences such as \x{100} that define character code points. Char- + acter type escapes such as \d are faulted. + + A closing parenthesis can be included in a name either as \) or between + \Q and \E. In addition to backslash processing, if the PCRE2_EXTENDED + option is also set, unescaped whitespace in verb names is skipped, and + #-comments are recognized, exactly as in the rest of the pattern. + PCRE2_EXTENDED does not affect verb names unless PCRE2_ALT_VERBNAMES is + also set. The maximum length of a name is 255 in the 8-bit library and 65535 in the 16-bit and 32-bit libraries. If the name is empty, that is, if the @@ -8367,7 +8487,7 @@ AUTHOR REVISION - Last updated: 20 June 2016 + Last updated: 23 October 2016 Copyright (c) 1997-2016 University of Cambridge. ------------------------------------------------------------------------------ @@ -9589,6 +9709,9 @@ BACKREFERENCES \n reference by number (can be ambiguous) \gn reference by number \g{n} reference by number + \g+n relative reference by number (PCRE2 extension) + \g-n relative reference by number + \g{+n} relative reference by number (PCRE2 extension) \g{-n} relative reference by number \k reference by name (Perl) \k'name' reference by name (Perl) @@ -9625,14 +9748,18 @@ CONDITIONAL PATTERNS (?(-n) relative reference condition (?() named reference condition (Perl) (?('name') named reference condition (Perl) - (?(name) named reference condition (PCRE2) + (?(name) named reference condition (PCRE2, deprecated) (?(R) overall recursion condition - (?(Rn) specific group recursion condition - (?(R&name) specific recursion condition + (?(Rn) specific numbered group recursion condition + (?(R&name) specific named group recursion condition (?(DEFINE) define subpattern for reference (?(VERSION[>]=n.m) test PCRE2 version (?(assert) assertion condition + Note the ambiguity of (?(R) and (?(Rn) which might be named reference + conditions or recursion tests. Such a condition is interpreted as a + reference condition if the relevant named group exists. + BACKTRACKING CONTROL @@ -9684,8 +9811,8 @@ AUTHOR REVISION - Last updated: 16 October 2015 - Copyright (c) 1997-2015 University of Cambridge. + Last updated: 28 September 2016 + Copyright (c) 1997-2016 University of Cambridge. ------------------------------------------------------------------------------ diff --git a/doc/pcre2_code_copy.3 b/doc/pcre2_code_copy.3 index 270b3a6..09b4705 100644 --- a/doc/pcre2_code_copy.3 +++ b/doc/pcre2_code_copy.3 @@ -1,4 +1,4 @@ -.TH PCRE2_CODE_COPY 3 "26 February 2016" "PCRE2 10.22" +.TH PCRE2_CODE_COPY 3 "22 November 2016" "PCRE2 10.23" .SH NAME PCRE2 - Perl-compatible regular expressions (revised API) .SH SYNOPSIS @@ -16,8 +16,9 @@ PCRE2 - Perl-compatible regular expressions (revised API) This function makes a copy of the memory used for a compiled pattern, excluding any memory used by the JIT compiler. Without a subsequent call to \fBpcre2_jit_compile()\fP, the copy can be used only for non-JIT matching. The -yield of the function is NULL if \fIcode\fP is NULL or if sufficient memory -cannot be obtained. +pointer to the character tables is copied, not the tables themselves (see +\fBpcre2_code_copy_with_tables()\fP). The yield of the function is NULL if +\fIcode\fP is NULL or if sufficient memory cannot be obtained. .P There is a complete description of the PCRE2 native API in the .\" HREF diff --git a/doc/pcre2_code_copy_with_tables.3 b/doc/pcre2_code_copy_with_tables.3 new file mode 100644 index 0000000..c529c57 --- /dev/null +++ b/doc/pcre2_code_copy_with_tables.3 @@ -0,0 +1,32 @@ +.TH PCRE2_CODE_COPY 3 "22 November 2016" "PCRE2 10.23" +.SH NAME +PCRE2 - Perl-compatible regular expressions (revised API) +.SH SYNOPSIS +.rs +.sp +.B #include +.PP +.nf +.B pcre2_code *pcre2_code_copy_with_tables(const pcre2_code *\fIcode\fP); +.fi +. +.SH DESCRIPTION +.rs +.sp +This function makes a copy of the memory used for a compiled pattern, excluding +any memory used by the JIT compiler. Without a subsequent call to +\fBpcre2_jit_compile()\fP, the copy can be used only for non-JIT matching. +Unlike \fBpcre2_code_copy()\fP, a separate copy of the character tables is also +made, with the new code pointing to it. This memory will be automatically freed +when \fBpcre2_code_free()\fP is called. The yield of the function is NULL if +\fIcode\fP is NULL or if sufficient memory cannot be obtained. +.P +There is a complete description of the PCRE2 native API in the +.\" HREF +\fBpcre2api\fP +.\" +page and a description of the POSIX API in the +.\" HREF +\fBpcre2posix\fP +.\" +page. diff --git a/doc/pcre2api.3 b/doc/pcre2api.3 index 3fbf87a..6baf88e 100644 --- a/doc/pcre2api.3 +++ b/doc/pcre2api.3 @@ -1,4 +1,4 @@ -.TH PCRE2API 3 "30 September 2016" "PCRE2 10.23" +.TH PCRE2API 3 "22 November 2016" "PCRE2 10.23" .SH NAME PCRE2 - Perl-compatible regular expressions (revised API) .sp @@ -235,6 +235,8 @@ document for an overview of all the PCRE2 documentation. .nf .B pcre2_code *pcre2_code_copy(const pcre2_code *\fIcode\fP); .sp +.B pcre2_code *pcre2_code_copy_with_tables(const pcre2_code *\fIcode\fP); +.sp .B int pcre2_get_error_message(int \fIerrorcode\fP, PCRE2_UCHAR *\fIbuffer\fP, .B " PCRE2_SIZE \fIbufflen\fP);" .sp @@ -509,8 +511,9 @@ If JIT is being used, but the JIT compilation is not being done immediately, (perhaps waiting to see if the pattern is used often enough) similar logic is required. JIT compilation updates a pointer within the compiled code block, so a thread must gain unique write access to the pointer before calling -\fBpcre2_jit_compile()\fP. Alternatively, \fBpcre2_code_copy()\fP can be used -to obtain a private copy of the compiled code. +\fBpcre2_jit_compile()\fP. Alternatively, \fBpcre2_code_copy()\fP or +\fBpcre2_code_copy_with_tables()\fP can be used to obtain a private copy of the +compiled code. . . .SS "Context blocks" @@ -1027,6 +1030,8 @@ zero. .B void pcre2_code_free(pcre2_code *\fIcode\fP); .sp .B pcre2_code *pcre2_code_copy(const pcre2_code *\fIcode\fP); +.sp +.B pcre2_code *pcre2_code_copy_with_tables(const pcre2_code *\fIcode\fP); .fi .P The \fBpcre2_compile()\fP function compiles a pattern into an internal form. @@ -1049,9 +1054,24 @@ below), .\" the JIT information cannot be copied (because it is position-dependent). The new copy can initially be used only for non-JIT matching, though it can be -passed to \fBpcre2_jit_compile()\fP if required. The \fBpcre2_code_copy()\fP -function provides a way for individual threads in a multithreaded application -to acquire a private copy of shared compiled code. +passed to \fBpcre2_jit_compile()\fP if required. +.P +The \fBpcre2_code_copy()\fP function provides a way for individual threads in a +multithreaded application to acquire a private copy of shared compiled code. +However, it does not make a copy of the character tables used by the compiled +pattern; the new pattern code points to the same tables as the original code. +(See +.\" HTML +.\" +"Locale Support" +.\" +below for details of these character tables.) In many applications the same +tables are used throughout, so this behaviour is appropriate. Nevertheless, +there are occasions when a copy of a compiled pattern and the relevant tables +are needed. The \fBpcre2_code_copy_with_tables()\fP provides this facility. +Copies of both the code and the tables are made, with the new code pointing to +the new tables. The memory for the new tables is automatically freed when +\fBpcre2_code_free()\fP is called for the new copy of the compiled code. .P NOTE: When one of the matching functions is called, pointers to the compiled pattern and the subject string are set in the match data block so that they can @@ -3299,6 +3319,6 @@ Cambridge, England. .rs .sp .nf -Last updated: 30 September 2016 +Last updated: 22 November 2016 Copyright (c) 1997-2016 University of Cambridge. .fi diff --git a/doc/pcre2grep.txt b/doc/pcre2grep.txt index 31aa610..9005792 100644 --- a/doc/pcre2grep.txt +++ b/doc/pcre2grep.txt @@ -51,103 +51,115 @@ DESCRIPTION boundary is controlled by the -N (--newline) option. The amount of memory used for buffering files that are being scanned is - controlled by a parameter that can be set by the --buffer-size option. - The default value for this parameter is specified when pcre2grep is - built, with the default default being 20K. A block of memory three - times this size is used (to allow for buffering "before" and "after" - lines). An error occurs if a line overflows the buffer. + controlled by parameters that can be set by the --buffer-size and + --max-buffer-size options. The first of these sets the size of buffer + that is obtained at the start of processing. If an input file contains + very long lines, a larger buffer may be needed; this is handled by + automatically extending the buffer, up to the limit specified by --max- + buffer-size. The default values for these parameters are specified when + pcre2grep is built, with the default defaults being 20K and 1M respec- + tively. An error occurs if a line is too long and the buffer can no + longer be expanded. - Patterns can be no longer than 8K or BUFSIZ bytes, whichever is the - greater. BUFSIZ is defined in . When there is more than one + The block of memory that is actually used is three times the "buffer + size", to allow for buffering "before" and "after" lines. If the buffer + size is too small, fewer than requested "before" and "after" lines may + be output. + + Patterns can be no longer than 8K or BUFSIZ bytes, whichever is the + greater. BUFSIZ is defined in . When there is more than one pattern (specified by the use of -e and/or -f), each pattern is applied - to each line in the order in which they are defined, except that all + to each line in the order in which they are defined, except that all the -e patterns are tried before the -f patterns. - By default, as soon as one pattern matches a line, no further patterns + By default, as soon as one pattern matches a line, no further patterns are considered. However, if --colour (or --color) is used to colour the - matching substrings, or if --only-matching, --file-offsets, or --line- - offsets is used to output only the part of the line that matched + matching substrings, or if --only-matching, --file-offsets, or --line- + offsets is used to output only the part of the line that matched (either shown literally, or as an offset), scanning resumes immediately - following the match, so that further matches on the same line can be - found. If there are multiple patterns, they are all tried on the - remainder of the line, but patterns that follow the one that matched + following the match, so that further matches on the same line can be + found. If there are multiple patterns, they are all tried on the + remainder of the line, but patterns that follow the one that matched are not tried on the earlier part of the line. - This behaviour means that the order in which multiple patterns are - specified can affect the output when one of the above options is used. - This is no longer the same behaviour as GNU grep, which now manages to - display earlier matches for later patterns (as long as there is no + This behaviour means that the order in which multiple patterns are + specified can affect the output when one of the above options is used. + This is no longer the same behaviour as GNU grep, which now manages to + display earlier matches for later patterns (as long as there is no overlap). - Patterns that can match an empty string are accepted, but empty string + Patterns that can match an empty string are accepted, but empty string matches are never recognized. An example is the pattern - "(super)?(man)?", in which all components are optional. This pattern - finds all occurrences of both "super" and "man"; the output differs - from matching with "super|man" when only the matching substrings are + "(super)?(man)?", in which all components are optional. This pattern + finds all occurrences of both "super" and "man"; the output differs + from matching with "super|man" when only the matching substrings are being shown. - If the LC_ALL or LC_CTYPE environment variable is set, pcre2grep uses + If the LC_ALL or LC_CTYPE environment variable is set, pcre2grep uses the value to set a locale when calling the PCRE2 library. The --locale option can be used to override this. SUPPORT FOR COMPRESSED FILES - It is possible to compile pcre2grep so that it uses libz or libbz2 to - read files whose names end in .gz or .bz2, respectively. You can find + It is possible to compile pcre2grep so that it uses libz or libbz2 to + read files whose names end in .gz or .bz2, respectively. You can find out whether your binary has support for one or both of these file types by running it with the --help option. If the appropriate support is not - present, files are treated as plain text. The standard input is always + present, files are treated as plain text. The standard input is always so treated. BINARY FILES - By default, a file that contains a binary zero byte within the first - 1024 bytes is identified as a binary file, and is processed specially. - (GNU grep also identifies binary files in this manner.) See the - --binary-files option for a means of changing the way binary files are + By default, a file that contains a binary zero byte within the first + 1024 bytes is identified as a binary file, and is processed specially. + (GNU grep also identifies binary files in this manner.) See the + --binary-files option for a means of changing the way binary files are handled. OPTIONS - The order in which some of the options appear can affect the output. - For example, both the -h and -l options affect the printing of file - names. Whichever comes later in the command line will be the one that - takes effect. Similarly, except where noted below, if an option is - given twice, the later setting is used. Numerical values for options - may be followed by K or M, to signify multiplication by 1024 or + The order in which some of the options appear can affect the output. + For example, both the -h and -l options affect the printing of file + names. Whichever comes later in the command line will be the one that + takes effect. Similarly, except where noted below, if an option is + given twice, the later setting is used. Numerical values for options + may be followed by K or M, to signify multiplication by 1024 or 1024*1024 respectively. -- This terminates the list of options. It is useful if the next - item on the command line starts with a hyphen but is not an - option. This allows for the processing of patterns and file + item on the command line starts with a hyphen but is not an + option. This allows for the processing of patterns and file names that start with hyphens. -A number, --after-context=number - Output number lines of context after each matching line. If - file names and/or line numbers are being output, a hyphen - separator is used instead of a colon for the context lines. A - line containing "--" is output between each group of lines, - unless they are in fact contiguous in the input file. The - value of number is expected to be relatively small. However, - pcre2grep guarantees to have up to 8K of following text - available for context output. + Output up to number lines of context after each matching + line. Fewer lines are output if the next match or the end of + the file is reached, or if the processing buffer size has + been set too small. If file names and/or line numbers are + being output, a hyphen separator is used instead of a colon + for the context lines. A line containing "--" is output + between each group of lines, unless they are in fact contigu- + ous in the input file. The value of number is expected to be + relatively small. When -c is used, -A is ignored. -a, --text Treat binary files as text. This is equivalent to --binary- files=text. -B number, --before-context=number - Output number lines of context before each matching line. If - file names and/or line numbers are being output, a hyphen - separator is used instead of a colon for the context lines. A - line containing "--" is output between each group of lines, - unless they are in fact contiguous in the input file. The - value of number is expected to be relatively small. However, - pcre2grep guarantees to have up to 8K of preceding text - available for context output. + Output up to number lines of context before each matching + line. Fewer lines are output if the previous match or the + start of the file is within number lines, or if the process- + ing buffer size has been set too small. If file names and/or + line numbers are being output, a hyphen separator is used + instead of a colon for the context lines. A line containing + "--" is output between each group of lines, unless they are + in fact contiguous in the input file. The value of number is + expected to be relatively small. When -c is used, -B is + ignored. --binary-files=word Specify how binary files are to be processed. If the word is @@ -164,54 +176,58 @@ OPTIONS any output or affecting the return code. --buffer-size=number - Set the parameter that controls how much memory is used for - buffering files that are being scanned. + Set the parameter that controls how much memory is obtained + at the start of processing for buffering files that are being + scanned. See also --max-buffer-size below. -C number, --context=number - Output number lines of context both before and after each - matching line. This is equivalent to setting both -A and -B + Output number lines of context both before and after each + matching line. This is equivalent to setting both -A and -B to the same value. -c, --count - Do not output lines from the files that are being scanned; - instead output the number of matches (or non-matches if -v is - used) that would otherwise have caused lines to be shown. By - default, this count is the same as the number of suppressed - lines, but if the -M (multiline) option is used (without -v), - there may be more suppressed lines than the number of - matches. + Do not output lines from the files that are being scanned; + instead output the number of lines that would have been + shown, either because they matched, or, if -v is set, because + they failed to match. By default, this count is exactly the + same as the number of lines that would have been output, but + if the -M (multiline) option is used (without -v), there may + be more suppressed lines than the count (that is, the number + of matches). If no lines are selected, the number zero is output. If sev- eral files are are being scanned, a count is output for each - of them. However, if the --files-with-matches option is also - used, only those files whose counts are greater than zero are - listed. When -c is used, the -A, -B, and -C options are - ignored. + of them and the -t option can be used to cause a total to be + output at the end. However, if the --files-with-matches + option is also used, only those files whose counts are + greater than zero are listed. When -c is used, the -A, -B, + and -C options are ignored. --colour, --color If this option is given without any data, it is equivalent to - "--colour=auto". If data is required, it must be given in + "--colour=auto". If data is required, it must be given in the same shell item, separated by an equals sign. --colour=value, --color=value This option specifies under what circumstances the parts of a line that matched a pattern should be coloured in the output. - By default, the output is not coloured. The value (which is - optional, see above) may be "never", "always", or "auto". In - the latter case, colouring happens only if the standard out- - put is connected to a terminal. More resources are used when + By default, the output is not coloured. The value (which is + optional, see above) may be "never", "always", or "auto". In + the latter case, colouring happens only if the standard out- + put is connected to a terminal. More resources are used when colouring is enabled, because pcre2grep has to search for all - possible matches in a line, not just one, in order to colour + possible matches in a line, not just one, in order to colour them all. The colour that is used can be specified by setting the envi- - ronment variable PCRE2GREP_COLOUR or PCRE2GREP_COLOR. The - value of this variable should be a string of two numbers, - separated by a semicolon. They are copied directly into the - control string for setting colour on a terminal, so it is - your responsibility to ensure that they make sense. If nei- - ther of the environment variables is set, the default is - "1;31", which gives red. + ronment variable PCRE2GREP_COLOUR or PCRE2GREP_COLOR. If nei- + ther of these are set, pcre2grep looks for GREP_COLOUR or + GREP_COLOR. The value of the variable should be a string of + two numbers, separated by a semicolon. They are copied + directly into the control string for setting colour on a ter- + minal, so it is your responsibility to ensure that they make + sense. If neither of the environment variables is set, the + default is "1;31", which gives red. -D action, --devices=action If an input path is not a regular file or a directory, @@ -299,12 +315,12 @@ OPTIONS Read patterns from the file, one per line, and match them against each line of input. What constitutes a newline when reading the file is the operating system's default. The - --newline option has no effect on this option. Trailing white - space is removed from each line, and blank lines are ignored. - An empty file contains no patterns and therefore matches - nothing. See also the comments about multiple patterns versus - a single pattern with alternatives in the description of -e - above. + --newline option has no effect on this option. Trailing + white space is removed from each line, and blank lines are + ignored. An empty file contains no patterns and therefore + matches nothing. See also the comments about multiple pat- + terns versus a single pattern with alternatives in the + description of -e above. If this option is given more than once, all the specified files are read. A data line is output if any of the patterns @@ -482,102 +498,101 @@ OPTIONS tings are specified when the PCRE2 library is compiled, with the default default being 10 million. + --max-buffer-size=number + This limits the expansion of the processing buffer, whose + initial size can be set by --buffer-size. The maximum buffer + size is silently forced to be no smaller than the starting + buffer size. + -M, --multiline - Allow patterns to match more than one line. When this option - is given, patterns may usefully contain literal newline char- - acters and internal occurrences of ^ and $ characters. The - output for a successful match may consist of more than one - line. The first is the line in which the match started, and - the last is the line in which the match ended. If the matched - string ends with a newline sequence the output ends at the - end of that line. + Allow patterns to match more than one line. When this option + is set, the PCRE2 library is called in "multiline" mode. This + allows a matched string to extend past the end of a line and + continue on one or more subsequent lines. Patterns used with + -M may usefully contain literal newline characters and inter- + nal occurrences of ^ and $ characters. The output for a suc- + cessful match may consist of more than one line. The first + line is the line in which the match started, and the last + line is the line in which the match ended. If the matched + string ends with a newline sequence, the output ends at the + end of that line. If -v is set, none of the lines in a + multi-line match are output. Once a match has been handled, + scanning restarts at the beginning of the line after the one + in which the match ended. - When this option is set, the PCRE2 library is called in "mul- - tiline" mode. This allows a matched string to extend past the - end of a line and continue on one or more subsequent lines. - However, pcre2grep still processes the input line by line. - Once a match has been handled, scanning restarts at the - beginning of the next line, just as it does when -M is not - present. This means that it is possible for the second or - subsequent lines in a multiline match to be output again as - part of another match. - - The newline sequence that separates multiple lines must be - matched as part of the pattern. For example, to find the - phrase "regular expression" in a file where "regular" might - be at the end of a line and "expression" at the start of the + The newline sequence that separates multiple lines must be + matched as part of the pattern. For example, to find the + phrase "regular expression" in a file where "regular" might + be at the end of a line and "expression" at the start of the next line, you could use this command: pcre2grep -M 'regular\s+expression' - The \s escape sequence matches any white space character, - including newlines, and is followed by + so as to match - trailing white space on the first line as well as possibly + The \s escape sequence matches any white space character, + including newlines, and is followed by + so as to match + trailing white space on the first line as well as possibly handling a two-character newline sequence. - There is a limit to the number of lines that can be matched, - imposed by the way that pcre2grep buffers the input file as - it scans it. However, pcre2grep ensures that at least 8K - characters or the rest of the file (whichever is the shorter) - are available for forward matching, and similarly the previ- - ous 8K characters (or all the previous characters, if fewer - than 8K) are guaranteed to be available for lookbehind asser- - tions. The -M option does not work when input is read line by - line (see --line-buffered.) + There is a limit to the number of lines that can be matched, + imposed by the way that pcre2grep buffers the input file as + it scans it. With a sufficiently large processing buffer, + this should not be a problem, but the -M option does not work + when input is read line by line (see --line-buffered.) -N newline-type, --newline=newline-type - The PCRE2 library supports five different conventions for - indicating the ends of lines. They are the single-character - sequences CR (carriage return) and LF (linefeed), the two- - character sequence CRLF, an "anycrlf" convention, which rec- - ognizes any of the preceding three types, and an "any" con- + The PCRE2 library supports five different conventions for + indicating the ends of lines. They are the single-character + sequences CR (carriage return) and LF (linefeed), the two- + character sequence CRLF, an "anycrlf" convention, which rec- + ognizes any of the preceding three types, and an "any" con- vention, in which any Unicode line ending sequence is assumed - to end a line. The Unicode sequences are the three just men- - tioned, plus VT (vertical tab, U+000B), FF (form feed, - U+000C), NEL (next line, U+0085), LS (line separator, + to end a line. The Unicode sequences are the three just men- + tioned, plus VT (vertical tab, U+000B), FF (form feed, + U+000C), NEL (next line, U+0085), LS (line separator, U+2028), and PS (paragraph separator, U+2029). - When the PCRE2 library is built, a default line-ending - sequence is specified. This is normally the standard + When the PCRE2 library is built, a default line-ending + sequence is specified. This is normally the standard sequence for the operating system. Unless otherwise specified - by this option, pcre2grep uses the library's default. The + by this option, pcre2grep uses the library's default. The possible values for this option are CR, LF, CRLF, ANYCRLF, or - ANY. This makes it possible to use pcre2grep to scan files + ANY. This makes it possible to use pcre2grep to scan files that have come from other environments without having to mod- - ify their line endings. If the data that is being scanned - does not agree with the convention set by this option, - pcre2grep may behave in strange ways. Note that this option - does not apply to files specified by the -f, --exclude-from, - or --include-from options, which are expected to use the + ify their line endings. If the data that is being scanned + does not agree with the convention set by this option, + pcre2grep may behave in strange ways. Note that this option + does not apply to files specified by the -f, --exclude-from, + or --include-from options, which are expected to use the operating system's standard newline sequence. -n, --line-number Precede each output line by its line number in the file, fol- - lowed by a colon for matching lines or a hyphen for context + lowed by a colon for matching lines or a hyphen for context lines. If the file name is also being output, it precedes the - line number. When the -M option causes a pattern to match - more than one line, only the first is preceded by its line + line number. When the -M option causes a pattern to match + more than one line, only the first is preceded by its line number. This option is forced if --line-offsets is used. - --no-jit If the PCRE2 library is built with support for just-in-time + --no-jit If the PCRE2 library is built with support for just-in-time compiling (which speeds up matching), pcre2grep automatically makes use of this, unless it was explicitly disabled at build - time. This option can be used to disable the use of JIT at - run time. It is provided for testing and working round prob- + time. This option can be used to disable the use of JIT at + run time. It is provided for testing and working round prob- lems. It should never be needed in normal use. -o, --only-matching Show only the part of the line that matched a pattern instead - of the whole line. In this mode, no context is shown. That - is, the -A, -B, and -C options are ignored. If there is more - than one match in a line, each of them is shown separately. - If -o is combined with -v (invert the sense of the match to - find non-matching lines), no output is generated, but the - return code is set appropriately. If the matched portion of - the line is empty, nothing is output unless the file name or - line number are being printed, in which case they are shown - on an otherwise empty line. This option is mutually exclusive - with --file-offsets and --line-offsets. + of the whole line. In this mode, no context is shown. That + is, the -A, -B, and -C options are ignored. If there is more + than one match in a line, each of them is shown separately, + on a separate line of output. If -o is combined with -v + (invert the sense of the match to find non-matching lines), + no output is generated, but the return code is set appropri- + ately. If the matched portion of the line is empty, nothing + is output unless the file name or line number are being + printed, in which case they are shown on an otherwise empty + line. This option is mutually exclusive with --file-offsets + and --line-offsets. -onumber, --only-matching=number Show only the part of the line that matched the capturing @@ -593,65 +608,80 @@ OPTIONS put. If this option is given multiple times, multiple substrings - are output, in the order the options are given. For example, - -o3 -o1 -o3 causes the substrings matched by capturing paren- - theses 3 and 1 and then 3 again to be output. By default, - there is no separator (but see the next option). + are output for each match, in the order the options are + given, and all on one line. For example, -o3 -o1 -o3 causes + the substrings matched by capturing parentheses 3 and 1 and + then 3 again to be output. By default, there is no separator + (but see the next option). --om-separator=text - Specify a separating string for multiple occurrences of -o. - The default is an empty string. Separating strings are never + Specify a separating string for multiple occurrences of -o. + The default is an empty string. Separating strings are never coloured. -q, --quiet Work quietly, that is, display nothing except error messages. - The exit status indicates whether or not any matches were + The exit status indicates whether or not any matches were found. -r, --recursive - If any given path is a directory, recursively scan the files - it contains, taking note of any --include and --exclude set- - tings. By default, a directory is read as a normal file; in - some operating systems this gives an immediate end-of-file. - This option is a shorthand for setting the -d option to + If any given path is a directory, recursively scan the files + it contains, taking note of any --include and --exclude set- + tings. By default, a directory is read as a normal file; in + some operating systems this gives an immediate end-of-file. + This option is a shorthand for setting the -d option to "recurse". --recursion-limit=number See --match-limit above. -s, --no-messages - Suppress error messages about non-existent or unreadable - files. Such files are quietly skipped. However, the return + Suppress error messages about non-existent or unreadable + files. Such files are quietly skipped. However, the return code is still 2, even if matches were found in other files. + -t, --total-count + This option is useful when scanning more than one file. If + used on its own, -t suppresses all output except for a grand + total number of matching lines (or non-matching lines if -v + is used) in all the files. If -t is used with -c, a grand + total is output except when the previous output is just one + line. In other words, it is not output when just one file's + count is listed. If file names are being output, the grand + total is preceded by "TOTAL:". Otherwise, it appears as just + another number. The -t option is ignored when used with -L + (list files without matches), because the grand total would + always be zero. + -u, --utf-8 Operate in UTF-8 mode. This option is available only if PCRE2 has been compiled with UTF-8 support. All patterns (including - those for any --exclude and --include options) and all sub- - ject lines that are scanned must be valid strings of UTF-8 + those for any --exclude and --include options) and all sub- + ject lines that are scanned must be valid strings of UTF-8 characters. -V, --version - Write the version numbers of pcre2grep and the PCRE2 library - to the standard output and then exit. Anything else on the + Write the version numbers of pcre2grep and the PCRE2 library + to the standard output and then exit. Anything else on the command line is ignored. -v, --invert-match - Invert the sense of the match, so that lines which do not + Invert the sense of the match, so that lines which do not match any of the patterns are the ones that are found. -w, --word-regex, --word-regexp Force the patterns to match only whole words. This is equiva- - lent to having \b at the start and end of the pattern. This - option applies only to the patterns that are matched against - the contents of files; it does not apply to patterns speci- + lent to having \b at the start and end of the pattern. This + option applies only to the patterns that are matched against + the contents of files; it does not apply to patterns speci- fied by any of the --include or --exclude options. -x, --line-regex, --line-regexp - Force the patterns to be anchored (each must start matching - at the beginning of a line) and in addition, require them to - match entire lines. This is equivalent to having ^ and $ - characters at the start and end of each alternative top-level + Force the patterns to be anchored (each must start matching + at the beginning of a line) and in addition, require them to + match entire lines. In multiline mode the match may be more + than one line. This is equivalent to having \A and \Z charac- + ters at the start and end of each alternative top-level branch in every pattern. This option applies only to the pat- terns that are matched against the contents of files; it does not apply to patterns specified by any of the --include or @@ -822,5 +852,5 @@ AUTHOR REVISION - Last updated: 19 June 2016 + Last updated: 31 October 2016 Copyright (c) 1997-2016 University of Cambridge. diff --git a/doc/pcre2test.txt b/doc/pcre2test.txt index 233acae..a678333 100644 --- a/doc/pcre2test.txt +++ b/doc/pcre2test.txt @@ -558,6 +558,7 @@ PATTERN MODIFIERS pushcopy push a copy onto the stack stackguard= test the stackguard feature tables=[0|1|2] select internal tables + use_length do not zero-terminate the pattern utf8_input treat input as UTF-8 The effects of these modifiers are described in the following sections. @@ -631,6 +632,16 @@ PATTERN MODIFIERS testing that pcre2_compile() behaves correctly in this case (it uses default values). + Specifying the pattern's length + + By default, patterns are passed to the compiling functions as zero-ter- + minated strings. When using the POSIX wrapper API, there is no other + option. However, when using PCRE2's native API, patterns can be passed + by length instead of being zero-terminated. The use_length modifier + causes this to happen. Using a length happens automatically (whether + or not use_length is set) when hex is set, because patterns specified + in hexadecimal may contain binary zeros. + Specifying pattern characters in hexadecimal The hex modifier specifies that the characters of the pattern, except @@ -652,60 +663,61 @@ PATTERN MODIFIERS ing the delimiter within a substring. The hex and expand modifiers are mutually exclusive. - By default, pcre2test passes patterns as zero-terminated strings to - pcre2_compile(), giving the length as PCRE2_ZERO_TERMINATED. However, - for patterns specified with the hex modifier, the actual length of the - pattern is passed. + The POSIX API cannot be used with patterns specified in hexadecimal + because they may contain binary zeros, which conflicts with regcomp()'s + requirement for a zero-terminated string. Such patterns are always + passed to pcre2_compile() as a string with a length, not as zero-termi- + nated. Specifying wide characters in 16-bit and 32-bit modes In 16-bit and 32-bit modes, all input is automatically treated as UTF-8 - and translated to UTF-16 or UTF-32 when the utf modifier is set. For + and translated to UTF-16 or UTF-32 when the utf modifier is set. For testing the 16-bit and 32-bit libraries in non-UTF mode, the utf8_input - modifier can be used. It is mutually exclusive with utf. Input lines + modifier can be used. It is mutually exclusive with utf. Input lines are interpreted as UTF-8 as a means of specifying wide characters. More details are given in "Input encoding" above. Generating long repetitive patterns - Some tests use long patterns that are very repetitive. Instead of cre- - ating a very long input line for such a pattern, you can use a special - repetition feature, similar to the one described for subject lines - above. If the expand modifier is present on a pattern, parts of the + Some tests use long patterns that are very repetitive. Instead of cre- + ating a very long input line for such a pattern, you can use a special + repetition feature, similar to the one described for subject lines + above. If the expand modifier is present on a pattern, parts of the pattern that have the form \[]{} are expanded before the pattern is passed to pcre2_compile(). For exam- ple, \[AB]{6000} is expanded to "ABAB..." 6000 times. This construction - cannot be nested. An initial "\[" sequence is recognized only if "]{" - followed by decimal digits and "}" is found later in the pattern. If + cannot be nested. An initial "\[" sequence is recognized only if "]{" + followed by decimal digits and "}" is found later in the pattern. If not, the characters remain in the pattern unaltered. The expand and hex modifiers are mutually exclusive. - If part of an expanded pattern looks like an expansion, but is really + If part of an expanded pattern looks like an expansion, but is really part of the actual pattern, unwanted expansion can be avoided by giving two values in the quantifier. For example, \[AB]{6000,6000} is not rec- ognized as an expansion item. - If the info modifier is set on an expanded pattern, the result of the + If the info modifier is set on an expanded pattern, the result of the expansion is included in the information that is output. JIT compilation - Just-in-time (JIT) compiling is a heavyweight optimization that can - greatly speed up pattern matching. See the pcre2jit documentation for - details. JIT compiling happens, optionally, after a pattern has been - successfully compiled into an internal form. The JIT compiler converts + Just-in-time (JIT) compiling is a heavyweight optimization that can + greatly speed up pattern matching. See the pcre2jit documentation for + details. JIT compiling happens, optionally, after a pattern has been + successfully compiled into an internal form. The JIT compiler converts this to optimized machine code. It needs to know whether the match-time options PCRE2_PARTIAL_HARD and PCRE2_PARTIAL_SOFT are going to be used, - because different code is generated for the different cases. See the - partial modifier in "Subject Modifiers" below for details of how these + because different code is generated for the different cases. See the + partial modifier in "Subject Modifiers" below for details of how these options are specified for each match attempt. - JIT compilation is requested by the /jit pattern modifier, which may + JIT compilation is requested by the /jit pattern modifier, which may optionally be followed by an equals sign and a number in the range 0 to - 7. The three bits that make up the number specify which of the three + 7. The three bits that make up the number specify which of the three JIT operating modes are to be compiled: 1 compile JIT code for non-partial matching @@ -722,31 +734,31 @@ PATTERN MODIFIERS 6 soft and hard partial matching only 7 all three modes - If no number is given, 7 is assumed. The phrase "partial matching" + If no number is given, 7 is assumed. The phrase "partial matching" means a call to pcre2_match() with either the PCRE2_PARTIAL_SOFT or the - PCRE2_PARTIAL_HARD option set. Note that such a call may return a com- + PCRE2_PARTIAL_HARD option set. Note that such a call may return a com- plete match; the options enable the possibility of a partial match, but - do not require it. Note also that if you request JIT compilation only - for partial matching (for example, /jit=2) but do not set the partial - modifier on a subject line, that match will not use JIT code because + do not require it. Note also that if you request JIT compilation only + for partial matching (for example, /jit=2) but do not set the partial + modifier on a subject line, that match will not use JIT code because none was compiled for non-partial matching. - If JIT compilation is successful, the compiled JIT code will automati- - cally be used when an appropriate type of match is run, except when - incompatible run-time options are specified. For more details, see the - pcre2jit documentation. See also the jitstack modifier below for a way + If JIT compilation is successful, the compiled JIT code will automati- + cally be used when an appropriate type of match is run, except when + incompatible run-time options are specified. For more details, see the + pcre2jit documentation. See also the jitstack modifier below for a way of setting the size of the JIT stack. - If the jitfast modifier is specified, matching is done using the JIT - "fast path" interface, pcre2_jit_match(), which skips some of the san- - ity checks that are done by pcre2_match(), and of course does not work - when JIT is not supported. If jitfast is specified without jit, jit=7 + If the jitfast modifier is specified, matching is done using the JIT + "fast path" interface, pcre2_jit_match(), which skips some of the san- + ity checks that are done by pcre2_match(), and of course does not work + when JIT is not supported. If jitfast is specified without jit, jit=7 is assumed. - If the jitverify modifier is specified, information about the compiled - pattern shows whether JIT compilation was or was not successful. If - jitverify is specified without jit, jit=7 is assumed. If JIT compila- - tion is successful when jitverify is set, the text "(JIT)" is added to + If the jitverify modifier is specified, information about the compiled + pattern shows whether JIT compilation was or was not successful. If + jitverify is specified without jit, jit=7 is assumed. If JIT compila- + tion is successful when jitverify is set, the text "(JIT)" is added to the first output line after a match or non match when JIT-compiled code was actually used in the match. @@ -757,18 +769,18 @@ PATTERN MODIFIERS /pattern/locale=fr_FR The given locale is set, pcre2_maketables() is called to build a set of - character tables for the locale, and this is then passed to pcre2_com- - pile() when compiling the regular expression. The same tables are used + character tables for the locale, and this is then passed to pcre2_com- + pile() when compiling the regular expression. The same tables are used when matching the following subject lines. The /locale modifier applies only to the pattern on which it appears, but can be given in a #pattern - command if a default is needed. Setting a locale and alternate charac- + command if a default is needed. Setting a locale and alternate charac- ter tables are mutually exclusive. Showing pattern memory - The /memory modifier causes the size in bytes of the memory used to - hold the compiled pattern to be output. This does not include the size - of the pcre2_code block; it is just the actual compiled data. If the + The /memory modifier causes the size in bytes of the memory used to + hold the compiled pattern to be output. This does not include the size + of the pcre2_code block; it is just the actual compiled data. If the pattern is subsequently passed to the JIT compiler, the size of the JIT compiled code is also output. Here is an example: @@ -779,27 +791,27 @@ PATTERN MODIFIERS Limiting nested parentheses - The parens_nest_limit modifier sets a limit on the depth of nested - parentheses in a pattern. Breaching the limit causes a compilation - error. The default for the library is set when PCRE2 is built, but - pcre2test sets its own default of 220, which is required for running + The parens_nest_limit modifier sets a limit on the depth of nested + parentheses in a pattern. Breaching the limit causes a compilation + error. The default for the library is set when PCRE2 is built, but + pcre2test sets its own default of 220, which is required for running the standard test suite. Limiting the pattern length - The max_pattern_length modifier sets a limit, in code units, to the + The max_pattern_length modifier sets a limit, in code units, to the length of pattern that pcre2_compile() will accept. Breaching the limit - causes a compilation error. The default is the largest number a + causes a compilation error. The default is the largest number a PCRE2_SIZE variable can hold (essentially unlimited). Using the POSIX wrapper API - The /posix and posix_nosub modifiers cause pcre2test to call PCRE2 via - the POSIX wrapper API rather than its native API. When posix_nosub is - used, the POSIX option REG_NOSUB is passed to regcomp(). The POSIX - wrapper supports only the 8-bit library. Note that it does not imply + The /posix and posix_nosub modifiers cause pcre2test to call PCRE2 via + the POSIX wrapper API rather than its native API. When posix_nosub is + used, the POSIX option REG_NOSUB is passed to regcomp(). The POSIX + wrapper supports only the 8-bit library. Note that it does not imply POSIX matching semantics; for more detail see the pcre2posix documenta- - tion. The following pattern modifiers set options for the regcomp() + tion. The following pattern modifiers set options for the regcomp() function: caseless REG_ICASE @@ -809,35 +821,35 @@ PATTERN MODIFIERS ucp REG_UCP ) the POSIX standard utf REG_UTF8 ) - The regerror_buffsize modifier specifies a size for the error buffer - that is passed to regerror() in the event of a compilation error. For + The regerror_buffsize modifier specifies a size for the error buffer + that is passed to regerror() in the event of a compilation error. For example: /abc/posix,regerror_buffsize=20 - This provides a means of testing the behaviour of regerror() when the - buffer is too small for the error message. If this modifier has not + This provides a means of testing the behaviour of regerror() when the + buffer is too small for the error message. If this modifier has not been set, a large buffer is used. - The aftertext and allaftertext subject modifiers work as described - below. All other modifiers are either ignored, with a warning message, + The aftertext and allaftertext subject modifiers work as described + below. All other modifiers are either ignored, with a warning message, or cause an error. Testing the stack guard feature - The /stackguard modifier is used to test the use of pcre2_set_com- - pile_recursion_guard(), a function that is provided to enable stack - availability to be checked during compilation (see the pcre2api docu- - mentation for details). If the number specified by the modifier is + The /stackguard modifier is used to test the use of pcre2_set_com- + pile_recursion_guard(), a function that is provided to enable stack + availability to be checked during compilation (see the pcre2api docu- + mentation for details). If the number specified by the modifier is greater than zero, pcre2_set_compile_recursion_guard() is called to set - up callback from pcre2_compile() to a local function. The argument it - receives is the current nesting parenthesis depth; if this is greater + up callback from pcre2_compile() to a local function. The argument it + receives is the current nesting parenthesis depth; if this is greater than the value given by the modifier, non-zero is returned, causing the compilation to be aborted. Using alternative character tables - The value specified for the /tables modifier must be one of the digits + The value specified for the /tables modifier must be one of the digits 0, 1, or 2. It causes a specific set of built-in character tables to be passed to pcre2_compile(). This is used in the PCRE2 tests to check be- haviour with different character tables. The digit specifies the tables @@ -848,15 +860,15 @@ PATTERN MODIFIERS pcre2_chartables.c.dist 2 a set of tables defining ISO 8859 characters - In table 2, some characters whose codes are greater than 128 are iden- - tified as letters, digits, spaces, etc. Setting alternate character + In table 2, some characters whose codes are greater than 128 are iden- + tified as letters, digits, spaces, etc. Setting alternate character tables and a locale are mutually exclusive. Setting certain match controls The following modifiers are really subject modifiers, and are described - below. However, they may be included in a pattern's modifier list, in - which case they are applied to every subject line that is processed + below. However, they may be included in a pattern's modifier list, in + which case they are applied to every subject line that is processed with that pattern. They may not appear in #pattern commands. These mod- ifiers do not affect the compilation process. @@ -873,24 +885,24 @@ PATTERN MODIFIERS substitute_unknown_unset use PCRE2_SUBSTITUTE_UNKNOWN_UNSET substitute_unset_empty use PCRE2_SUBSTITUTE_UNSET_EMPTY - These modifiers may not appear in a #pattern command. If you want them + These modifiers may not appear in a #pattern command. If you want them as defaults, set them in a #subject command. Saving a compiled pattern - When a pattern with the push modifier is successfully compiled, it is - pushed onto a stack of compiled patterns, and pcre2test expects the - next line to contain a new pattern (or a command) instead of a subject + When a pattern with the push modifier is successfully compiled, it is + pushed onto a stack of compiled patterns, and pcre2test expects the + next line to contain a new pattern (or a command) instead of a subject line. This facility is used when saving compiled patterns to a file, as - described in the section entitled "Saving and restoring compiled pat- - terns" below. If pushcopy is used instead of push, a copy of the com- - piled pattern is stacked, leaving the original as current, ready to - match the following input lines. This provides a way of testing the - pcre2_code_copy() function. The push and pushcopy modifiers are - incompatible with compilation modifiers such as global that act at - match time. Any that are specified are ignored (for the stacked copy), + described in the section entitled "Saving and restoring compiled pat- + terns" below. If pushcopy is used instead of push, a copy of the com- + piled pattern is stacked, leaving the original as current, ready to + match the following input lines. This provides a way of testing the + pcre2_code_copy() function. The push and pushcopy modifiers are + incompatible with compilation modifiers such as global that act at + match time. Any that are specified are ignored (for the stacked copy), with a warning message, except for replace, which causes an error. Note - that jitverify, which is allowed, does not carry through to any subse- + that jitverify, which is allowed, does not carry through to any subse- quent matching that uses a stacked pattern. @@ -901,7 +913,7 @@ SUBJECT MODIFIERS Setting match options - The following modifiers set options for pcre2_match() or + The following modifiers set options for pcre2_match() or pcre2_dfa_match(). See pcreapi for a description of their effects. anchored set PCRE2_ANCHORED @@ -916,20 +928,20 @@ SUBJECT MODIFIERS partial_hard (or ph) set PCRE2_PARTIAL_HARD partial_soft (or ps) set PCRE2_PARTIAL_SOFT - The partial matching modifiers are provided with abbreviations because + The partial matching modifiers are provided with abbreviations because they appear frequently in tests. - If the /posix modifier was present on the pattern, causing the POSIX + If the /posix modifier was present on the pattern, causing the POSIX wrapper API to be used, the only option-setting modifiers that have any - effect are notbol, notempty, and noteol, causing REG_NOTBOL, - REG_NOTEMPTY, and REG_NOTEOL, respectively, to be passed to regexec(). + effect are notbol, notempty, and noteol, causing REG_NOTBOL, + REG_NOTEMPTY, and REG_NOTEOL, respectively, to be passed to regexec(). The other modifiers are ignored, with a warning message. Setting match controls - The following modifiers affect the matching process or request addi- - tional information. Some of them may also be specified on a pattern - line (see above), in which case they apply to every subject line that + The following modifiers affect the matching process or request addi- + tional information. Some of them may also be specified on a pattern + line (see above), in which case they apply to every subject line that is matched against that pattern. aftertext show text after match @@ -966,29 +978,29 @@ SUBJECT MODIFIERS zero_terminate pass the subject as zero-terminated The effects of these modifiers are described in the following sections. - When matching via the POSIX wrapper API, the aftertext, allaftertext, - and ovector subject modifiers work as described below. All other modi- + When matching via the POSIX wrapper API, the aftertext, allaftertext, + and ovector subject modifiers work as described below. All other modi- fiers are either ignored, with a warning message, or cause an error. Showing more text - The aftertext modifier requests that as well as outputting the part of + The aftertext modifier requests that as well as outputting the part of the subject string that matched the entire pattern, pcre2test should in addition output the remainder of the subject string. This is useful for tests where the subject contains multiple copies of the same substring. - The allaftertext modifier requests the same action for captured sub- + The allaftertext modifier requests the same action for captured sub- strings as well as the main matched substring. In each case the remain- der is output on the following line with a plus character following the capture number. - The allusedtext modifier requests that all the text that was consulted - during a successful pattern match by the interpreter should be shown. - This feature is not supported for JIT matching, and if requested with - JIT it is ignored (with a warning message). Setting this modifier + The allusedtext modifier requests that all the text that was consulted + during a successful pattern match by the interpreter should be shown. + This feature is not supported for JIT matching, and if requested with + JIT it is ignored (with a warning message). Setting this modifier affects the output if there is a lookbehind at the start of a match, or - a lookahead at the end, or if \K is used in the pattern. Characters - that precede or follow the start and end of the actual match are indi- - cated in the output by '<' or '>' characters underneath them. Here is + a lookahead at the end, or if \K is used in the pattern. Characters + that precede or follow the start and end of the actual match are indi- + cated in the output by '<' or '>' characters underneath them. Here is an example: re> /(?<=pqr)abc(?=xyz)/ @@ -996,16 +1008,16 @@ SUBJECT MODIFIERS 0: pqrabcxyz <<< >>> - This shows that the matched string is "abc", with the preceding and - following strings "pqr" and "xyz" having been consulted during the + This shows that the matched string is "abc", with the preceding and + following strings "pqr" and "xyz" having been consulted during the match (when processing the assertions). - The startchar modifier requests that the starting character for the - match be indicated, if it is different to the start of the matched + The startchar modifier requests that the starting character for the + match be indicated, if it is different to the start of the matched string. The only time when this occurs is when \K has been processed as part of the match. In this situation, the output for the matched string - is displayed from the starting character instead of from the match - point, with circumflex characters under the earlier characters. For + is displayed from the starting character instead of from the match + point, with circumflex characters under the earlier characters. For example: re> /abc\Kxyz/ @@ -1013,7 +1025,7 @@ SUBJECT MODIFIERS 0: abcxyz ^^^ - Unlike allusedtext, the startchar modifier can be used with JIT. How- + Unlike allusedtext, the startchar modifier can be used with JIT. How- ever, these two modifiers are mutually exclusive. Showing the value of all capture groups @@ -1021,90 +1033,90 @@ SUBJECT MODIFIERS The allcaptures modifier requests that the values of all potential cap- tured parentheses be output after a match. By default, only those up to the highest one actually used in the match are output (corresponding to - the return code from pcre2_match()). Groups that did not take part in - the match are output as "". This modifier is not relevant for - DFA matching (which does no capturing); it is ignored, with a warning + the return code from pcre2_match()). Groups that did not take part in + the match are output as "". This modifier is not relevant for + DFA matching (which does no capturing); it is ignored, with a warning message, if present. Testing callouts - A callout function is supplied when pcre2test calls the library match- - ing functions, unless callout_none is specified. If callout_capture is + A callout function is supplied when pcre2test calls the library match- + ing functions, unless callout_none is specified. If callout_capture is set, the current captured groups are output when a callout occurs. - The callout_fail modifier can be given one or two numbers. If there is + The callout_fail modifier can be given one or two numbers. If there is only one number, 1 is returned instead of 0 when a callout of that num- - ber is reached. If two numbers are given, 1 is returned when callout + ber is reached. If two numbers are given, 1 is returned when callout is reached for the th time. Note that callouts with string argu- - ments are always given the number zero. See "Callouts" below for a + ments are always given the number zero. See "Callouts" below for a description of the output when a callout it taken. - The callout_data modifier can be given an unsigned or a negative num- - ber. This is set as the "user data" that is passed to the matching - function, and passed back when the callout function is invoked. Any - value other than zero is used as a return from pcre2test's callout + The callout_data modifier can be given an unsigned or a negative num- + ber. This is set as the "user data" that is passed to the matching + function, and passed back when the callout function is invoked. Any + value other than zero is used as a return from pcre2test's callout function. Finding all matches in a string Searching for all possible matches within a subject can be requested by - the global or /altglobal modifier. After finding a match, the matching - function is called again to search the remainder of the subject. The - difference between global and altglobal is that the former uses the - start_offset argument to pcre2_match() or pcre2_dfa_match() to start - searching at a new point within the entire string (which is what Perl + the global or /altglobal modifier. After finding a match, the matching + function is called again to search the remainder of the subject. The + difference between global and altglobal is that the former uses the + start_offset argument to pcre2_match() or pcre2_dfa_match() to start + searching at a new point within the entire string (which is what Perl does), whereas the latter passes over a shortened subject. This makes a difference to the matching process if the pattern begins with a lookbe- hind assertion (including \b or \B). - If an empty string is matched, the next match is done with the + If an empty string is matched, the next match is done with the PCRE2_NOTEMPTY_ATSTART and PCRE2_ANCHORED flags set, in order to search for another, non-empty, match at the same point in the subject. If this - match fails, the start offset is advanced, and the normal match is - retried. This imitates the way Perl handles such cases when using the - /g modifier or the split() function. Normally, the start offset is - advanced by one character, but if the newline convention recognizes - CRLF as a newline, and the current character is CR followed by LF, an + match fails, the start offset is advanced, and the normal match is + retried. This imitates the way Perl handles such cases when using the + /g modifier or the split() function. Normally, the start offset is + advanced by one character, but if the newline convention recognizes + CRLF as a newline, and the current character is CR followed by LF, an advance of two characters occurs. Testing substring extraction functions - The copy and get modifiers can be used to test the pcre2_sub- + The copy and get modifiers can be used to test the pcre2_sub- string_copy_xxx() and pcre2_substring_get_xxx() functions. They can be - given more than once, and each can specify a group name or number, for + given more than once, and each can specify a group name or number, for example: abcd\=copy=1,copy=3,get=G1 - If the #subject command is used to set default copy and/or get lists, - these can be unset by specifying a negative number to cancel all num- + If the #subject command is used to set default copy and/or get lists, + these can be unset by specifying a negative number to cancel all num- bered groups and an empty name to cancel all named groups. - The getall modifier tests pcre2_substring_list_get(), which extracts + The getall modifier tests pcre2_substring_list_get(), which extracts all captured substrings. - If the subject line is successfully matched, the substrings extracted - by the convenience functions are output with C, G, or L after the - string number instead of a colon. This is in addition to the normal - full list. The string length (that is, the return from the extraction + If the subject line is successfully matched, the substrings extracted + by the convenience functions are output with C, G, or L after the + string number instead of a colon. This is in addition to the normal + full list. The string length (that is, the return from the extraction function) is given in parentheses after each substring, followed by the name when the extraction was by name. Testing the substitution function - If the replace modifier is set, the pcre2_substitute() function is - called instead of one of the matching functions. Note that replacement - strings cannot contain commas, because a comma signifies the end of a + If the replace modifier is set, the pcre2_substitute() function is + called instead of one of the matching functions. Note that replacement + strings cannot contain commas, because a comma signifies the end of a modifier. This is not thought to be an issue in a test program. - Unlike subject strings, pcre2test does not process replacement strings - for escape sequences. In UTF mode, a replacement string is checked to - see if it is a valid UTF-8 string. If so, it is correctly converted to - a UTF string of the appropriate code unit width. If it is not a valid - UTF-8 string, the individual code units are copied directly. This pro- + Unlike subject strings, pcre2test does not process replacement strings + for escape sequences. In UTF mode, a replacement string is checked to + see if it is a valid UTF-8 string. If so, it is correctly converted to + a UTF string of the appropriate code unit width. If it is not a valid + UTF-8 string, the individual code units are copied directly. This pro- vides a means of passing an invalid UTF-8 string for testing purposes. - The following modifiers set options (in additional to the normal match + The following modifiers set options (in additional to the normal match options) for pcre2_substitute(): global PCRE2_SUBSTITUTE_GLOBAL @@ -1114,8 +1126,8 @@ SUBJECT MODIFIERS substitute_unset_empty PCRE2_SUBSTITUTE_UNSET_EMPTY - After a successful substitution, the modified string is output, pre- - ceded by the number of replacements. This may be zero if there were no + After a successful substitution, the modified string is output, pre- + ceded by the number of replacements. This may be zero if there were no matches. Here is a simple example of a substitution test: /abc/replace=xxx @@ -1124,12 +1136,12 @@ SUBJECT MODIFIERS =abc=abc=\=global 2: =xxx=xxx= - Subject and replacement strings should be kept relatively short (fewer - than 256 characters) for substitution tests, as fixed-size buffers are - used. To make it easy to test for buffer overflow, if the replacement - string starts with a number in square brackets, that number is passed - to pcre2_substitute() as the size of the output buffer, with the - replacement string starting at the next character. Here is an example + Subject and replacement strings should be kept relatively short (fewer + than 256 characters) for substitution tests, as fixed-size buffers are + used. To make it easy to test for buffer overflow, if the replacement + string starts with a number in square brackets, that number is passed + to pcre2_substitute() as the size of the output buffer, with the + replacement string starting at the next character. Here is an example that tests the edge case: /abc/ @@ -1138,11 +1150,11 @@ SUBJECT MODIFIERS 123abc123\=replace=[9]XYZ Failed: error -47: no more memory - The default action of pcre2_substitute() is to return - PCRE2_ERROR_NOMEMORY when the output buffer is too small. However, if - the PCRE2_SUBSTITUTE_OVERFLOW_LENGTH option is set (by using the sub- - stitute_overflow_length modifier), pcre2_substitute() continues to go - through the motions of matching and substituting, in order to compute + The default action of pcre2_substitute() is to return + PCRE2_ERROR_NOMEMORY when the output buffer is too small. However, if + the PCRE2_SUBSTITUTE_OVERFLOW_LENGTH option is set (by using the sub- + stitute_overflow_length modifier), pcre2_substitute() continues to go + through the motions of matching and substituting, in order to compute the size of buffer that is required. When this happens, pcre2test shows the required buffer length (which includes space for the trailing zero) as part of the error message. For example: @@ -1152,140 +1164,140 @@ SUBJECT MODIFIERS Failed: error -47: no more memory: 10 code units are needed A replacement string is ignored with POSIX and DFA matching. Specifying - partial matching provokes an error return ("bad option value") from + partial matching provokes an error return ("bad option value") from pcre2_substitute(). Setting the JIT stack size - The jitstack modifier provides a way of setting the maximum stack size - that is used by the just-in-time optimization code. It is ignored if + The jitstack modifier provides a way of setting the maximum stack size + that is used by the just-in-time optimization code. It is ignored if JIT optimization is not being used. The value is a number of kilobytes. Providing a stack that is larger than the default 32K is necessary only for very complicated patterns. Setting match and recursion limits - The match_limit and recursion_limit modifiers set the appropriate lim- + The match_limit and recursion_limit modifiers set the appropriate lim- its in the match context. These values are ignored when the find_limits modifier is specified. Finding minimum limits - If the find_limits modifier is present, pcre2test calls pcre2_match() - several times, setting different values in the match context via - pcre2_set_match_limit() and pcre2_set_recursion_limit() until it finds - the minimum values for each parameter that allow pcre2_match() to com- + If the find_limits modifier is present, pcre2test calls pcre2_match() + several times, setting different values in the match context via + pcre2_set_match_limit() and pcre2_set_recursion_limit() until it finds + the minimum values for each parameter that allow pcre2_match() to com- plete without error. If JIT is being used, only the match limit is relevant. If DFA matching - is being used, neither limit is relevant, and this modifier is ignored + is being used, neither limit is relevant, and this modifier is ignored (with a warning message). - The match_limit number is a measure of the amount of backtracking that - takes place, and learning the minimum value can be instructive. For - most simple matches, the number is quite small, but for patterns with - very large numbers of matching possibilities, it can become large very - quickly with increasing length of subject string. The - match_limit_recursion number is a measure of how much stack (or, if - PCRE2 is compiled with NO_RECURSE, how much heap) memory is needed to + The match_limit number is a measure of the amount of backtracking that + takes place, and learning the minimum value can be instructive. For + most simple matches, the number is quite small, but for patterns with + very large numbers of matching possibilities, it can become large very + quickly with increasing length of subject string. The + match_limit_recursion number is a measure of how much stack (or, if + PCRE2 is compiled with NO_RECURSE, how much heap) memory is needed to complete the match attempt. Showing MARK names The mark modifier causes the names from backtracking control verbs that - are returned from calls to pcre2_match() to be displayed. If a mark is - returned for a match, non-match, or partial match, pcre2test shows it. - For a match, it is on a line by itself, tagged with "MK:". Otherwise, + are returned from calls to pcre2_match() to be displayed. If a mark is + returned for a match, non-match, or partial match, pcre2test shows it. + For a match, it is on a line by itself, tagged with "MK:". Otherwise, it is added to the non-match message. Showing memory usage - The memory modifier causes pcre2test to log all memory allocation and + The memory modifier causes pcre2test to log all memory allocation and freeing calls that occur during a match operation. Setting a starting offset - The offset modifier sets an offset in the subject string at which + The offset modifier sets an offset in the subject string at which matching starts. Its value is a number of code units, not characters. Setting an offset limit - The offset_limit modifier sets a limit for unanchored matches. If a + The offset_limit modifier sets a limit for unanchored matches. If a match cannot be found starting at or before this offset in the subject, a "no match" return is given. The data value is a number of code units, - not characters. When this modifier is used, the use_offset_limit modi- + not characters. When this modifier is used, the use_offset_limit modi- fier must have been set for the pattern; if not, an error is generated. Setting the size of the output vector - The ovector modifier applies only to the subject line in which it - appears, though of course it can also be used to set a default in a - #subject command. It specifies the number of pairs of offsets that are + The ovector modifier applies only to the subject line in which it + appears, though of course it can also be used to set a default in a + #subject command. It specifies the number of pairs of offsets that are available for storing matching information. The default is 15. - A value of zero is useful when testing the POSIX API because it causes + A value of zero is useful when testing the POSIX API because it causes regexec() to be called with a NULL capture vector. When not testing the - POSIX API, a value of zero is used to cause pcre2_match_data_cre- - ate_from_pattern() to be called, in order to create a match block of + POSIX API, a value of zero is used to cause pcre2_match_data_cre- + ate_from_pattern() to be called, in order to create a match block of exactly the right size for the pattern. (It is not possible to create a - match block with a zero-length ovector; there is always at least one + match block with a zero-length ovector; there is always at least one pair of offsets.) Passing the subject as zero-terminated By default, the subject string is passed to a native API matching func- tion with its correct length. In order to test the facility for passing - a zero-terminated string, the zero_terminate modifier is provided. It + a zero-terminated string, the zero_terminate modifier is provided. It causes the length to be passed as PCRE2_ZERO_TERMINATED. (When matching - via the POSIX interface, this modifier has no effect, as there is no + via the POSIX interface, this modifier has no effect, as there is no facility for passing a length.) - When testing pcre2_substitute(), this modifier also has the effect of + When testing pcre2_substitute(), this modifier also has the effect of passing the replacement string as zero-terminated. Passing a NULL context - Normally, pcre2test passes a context block to pcre2_match(), + Normally, pcre2test passes a context block to pcre2_match(), pcre2_dfa_match() or pcre2_jit_match(). If the null_context modifier is - set, however, NULL is passed. This is for testing that the matching + set, however, NULL is passed. This is for testing that the matching functions behave correctly in this case (they use default values). This - modifier cannot be used with the find_limits modifier or when testing + modifier cannot be used with the find_limits modifier or when testing the substitution function. THE ALTERNATIVE MATCHING FUNCTION - By default, pcre2test uses the standard PCRE2 matching function, + By default, pcre2test uses the standard PCRE2 matching function, pcre2_match() to match each subject line. PCRE2 also supports an alter- - native matching function, pcre2_dfa_match(), which operates in a dif- - ferent way, and has some restrictions. The differences between the two + native matching function, pcre2_dfa_match(), which operates in a dif- + ferent way, and has some restrictions. The differences between the two functions are described in the pcre2matching documentation. - If the dfa modifier is set, the alternative matching function is used. - This function finds all possible matches at a given point in the sub- - ject. If, however, the dfa_shortest modifier is set, processing stops - after the first match is found. This is always the shortest possible + If the dfa modifier is set, the alternative matching function is used. + This function finds all possible matches at a given point in the sub- + ject. If, however, the dfa_shortest modifier is set, processing stops + after the first match is found. This is always the shortest possible match. DEFAULT OUTPUT FROM pcre2test - This section describes the output when the normal matching function, + This section describes the output when the normal matching function, pcre2_match(), is being used. - When a match succeeds, pcre2test outputs the list of captured sub- - strings, starting with number 0 for the string that matched the whole - pattern. Otherwise, it outputs "No match" when the return is - PCRE2_ERROR_NOMATCH, or "Partial match:" followed by the partially - matching substring when the return is PCRE2_ERROR_PARTIAL. (Note that - this is the entire substring that was inspected during the partial - match; it may include characters before the actual match start if a + When a match succeeds, pcre2test outputs the list of captured sub- + strings, starting with number 0 for the string that matched the whole + pattern. Otherwise, it outputs "No match" when the return is + PCRE2_ERROR_NOMATCH, or "Partial match:" followed by the partially + matching substring when the return is PCRE2_ERROR_PARTIAL. (Note that + this is the entire substring that was inspected during the partial + match; it may include characters before the actual match start if a lookbehind assertion, \K, \b, or \B was involved.) For any other return, pcre2test outputs the PCRE2 negative error number - and a short descriptive phrase. If the error is a failed UTF string - check, the code unit offset of the start of the failing character is + and a short descriptive phrase. If the error is a failed UTF string + check, the code unit offset of the start of the failing character is also output. Here is an example of an interactive pcre2test run. $ pcre2test @@ -1301,8 +1313,8 @@ DEFAULT OUTPUT FROM pcre2test Unset capturing substrings that are not followed by one that is set are not shown by pcre2test unless the allcaptures modifier is specified. In the following example, there are two capturing substrings, but when the - first data line is matched, the second, unset substring is not shown. - An "internal" unset substring is shown as "", as for the second + first data line is matched, the second, unset substring is not shown. + An "internal" unset substring is shown as "", as for the second data line. re> /(a)|(b)/ @@ -1314,11 +1326,11 @@ DEFAULT OUTPUT FROM pcre2test 1: 2: b - If the strings contain any non-printing characters, they are output as - \xhh escapes if the value is less than 256 and UTF mode is not set. + If the strings contain any non-printing characters, they are output as + \xhh escapes if the value is less than 256 and UTF mode is not set. Otherwise they are output as \x{hh...} escapes. See below for the defi- - nition of non-printing characters. If the /aftertext modifier is set, - the output for substring 0 is followed by the the rest of the subject + nition of non-printing characters. If the /aftertext modifier is set, + the output for substring 0 is followed by the the rest of the subject string, identified by "0+" like this: re> /cat/aftertext @@ -1326,7 +1338,7 @@ DEFAULT OUTPUT FROM pcre2test 0: cat 0+ aract - If global matching is requested, the results of successive matching + If global matching is requested, the results of successive matching attempts are output in sequence, like this: re> /\Bi(\w\w)/g @@ -1338,8 +1350,8 @@ DEFAULT OUTPUT FROM pcre2test 0: ipp 1: pp - "No match" is output only if the first match attempt fails. Here is an - example of a failure message (the offset 4 that is specified by the + "No match" is output only if the first match attempt fails. Here is an + example of a failure message (the offset 4 that is specified by the offset modifier is past the end of the subject string): re> /xyz/ @@ -1347,7 +1359,7 @@ DEFAULT OUTPUT FROM pcre2test Error -24 (bad offset value) Note that whereas patterns can be continued over several lines (a plain - ">" prompt is used for continuations), subject lines may not. However + ">" prompt is used for continuations), subject lines may not. However newlines can be included in a subject by means of the \n escape (or \r, \r\n, etc., depending on the newline sequence setting). @@ -1355,7 +1367,7 @@ DEFAULT OUTPUT FROM pcre2test OUTPUT FROM THE ALTERNATIVE MATCHING FUNCTION When the alternative matching function, pcre2_dfa_match(), is used, the - output consists of a list of all the matches that start at the first + output consists of a list of all the matches that start at the first point in the subject where there is at least one match. For example: re> /(tang|tangerine|tan)/ @@ -1364,11 +1376,11 @@ OUTPUT FROM THE ALTERNATIVE MATCHING FUNCTION 1: tang 2: tan - Using the normal matching function on this data finds only "tang". The - longest matching string is always given first (and numbered zero). - After a PCRE2_ERROR_PARTIAL return, the output is "Partial match:", - followed by the partially matching substring. Note that this is the - entire substring that was inspected during the partial match; it may + Using the normal matching function on this data finds only "tang". The + longest matching string is always given first (and numbered zero). + After a PCRE2_ERROR_PARTIAL return, the output is "Partial match:", + followed by the partially matching substring. Note that this is the + entire substring that was inspected during the partial match; it may include characters before the actual match start if a lookbehind asser- tion, \b, or \B was involved. (\K is not supported for DFA matching.) @@ -1384,16 +1396,16 @@ OUTPUT FROM THE ALTERNATIVE MATCHING FUNCTION 1: tan 0: tan - The alternative matching function does not support substring capture, - so the modifiers that are concerned with captured substrings are not + The alternative matching function does not support substring capture, + so the modifiers that are concerned with captured substrings are not relevant. RESTARTING AFTER A PARTIAL MATCH - When the alternative matching function has given the PCRE2_ERROR_PAR- + When the alternative matching function has given the PCRE2_ERROR_PAR- TIAL return, indicating that the subject partially matched the pattern, - you can restart the match with additional subject data by means of the + you can restart the match with additional subject data by means of the dfa_restart modifier. For example: re> /^\d?\d(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\d\d$/ @@ -1402,45 +1414,45 @@ RESTARTING AFTER A PARTIAL MATCH data> n05\=dfa,dfa_restart 0: n05 - For further information about partial matching, see the pcre2partial + For further information about partial matching, see the pcre2partial documentation. CALLOUTS If the pattern contains any callout requests, pcre2test's callout func- - tion is called during matching unless callout_none is specified. This + tion is called during matching unless callout_none is specified. This works with both matching functions. - The callout function in pcre2test returns zero (carry on matching) by - default, but you can use a callout_fail modifier in a subject line (as + The callout function in pcre2test returns zero (carry on matching) by + default, but you can use a callout_fail modifier in a subject line (as described above) to change this and other parameters of the callout. Inserting callouts can be helpful when using pcre2test to check compli- - cated regular expressions. For further information about callouts, see + cated regular expressions. For further information about callouts, see the pcre2callout documentation. - The output for callouts with numerical arguments and those with string + The output for callouts with numerical arguments and those with string arguments is slightly different. Callouts with numerical arguments By default, the callout function displays the callout number, the start - and current positions in the subject text at the callout time, and the + and current positions in the subject text at the callout time, and the next pattern item to be tested. For example: --->pqrabcdef 0 ^ ^ \d - This output indicates that callout number 0 occurred for a match - attempt starting at the fourth character of the subject string, when - the pointer was at the seventh character, and when the next pattern - item was \d. Just one circumflex is output if the start and current - positions are the same, or if the current position precedes the start + This output indicates that callout number 0 occurred for a match + attempt starting at the fourth character of the subject string, when + the pointer was at the seventh character, and when the next pattern + item was \d. Just one circumflex is output if the start and current + positions are the same, or if the current position precedes the start position, which can happen if the callout is in a lookbehind assertion. Callouts numbered 255 are assumed to be automatic callouts, inserted as - a result of the /auto_callout pattern modifier. In this case, instead + a result of the /auto_callout pattern modifier. In this case, instead of showing the callout number, the offset in the pattern, preceded by a plus, is output. For example: @@ -1454,7 +1466,7 @@ CALLOUTS 0: E* If a pattern contains (*MARK) items, an additional line is output when- - ever a change of latest mark is passed to the callout function. For + ever a change of latest mark is passed to the callout function. For example: re> /a(*MARK:X)bc/auto_callout @@ -1468,17 +1480,17 @@ CALLOUTS +12 ^ ^ 0: abc - The mark changes between matching "a" and "b", but stays the same for - the rest of the match, so nothing more is output. If, as a result of - backtracking, the mark reverts to being unset, the text "" is + The mark changes between matching "a" and "b", but stays the same for + the rest of the match, so nothing more is output. If, as a result of + backtracking, the mark reverts to being unset, the text "" is output. Callouts with string arguments The output for a callout with a string argument is similar, except that - instead of outputting a callout number before the position indicators, - the callout string and its offset in the pattern string are output - before the reflection of the subject string, and the subject string is + instead of outputting a callout number before the position indicators, + the callout string and its offset in the pattern string are output + before the reflection of the subject string, and the subject string is reflected for each callout. For example: re> /^ab(?C'first')cd(?C"second")ef/ @@ -1495,43 +1507,43 @@ CALLOUTS NON-PRINTING CHARACTERS When pcre2test is outputting text in the compiled version of a pattern, - bytes other than 32-126 are always treated as non-printing characters + bytes other than 32-126 are always treated as non-printing characters and are therefore shown as hex escapes. - When pcre2test is outputting text that is a matched part of a subject - string, it behaves in the same way, unless a different locale has been - set for the pattern (using the /locale modifier). In this case, the - isprint() function is used to distinguish printing and non-printing + When pcre2test is outputting text that is a matched part of a subject + string, it behaves in the same way, unless a different locale has been + set for the pattern (using the /locale modifier). In this case, the + isprint() function is used to distinguish printing and non-printing characters. SAVING AND RESTORING COMPILED PATTERNS - It is possible to save compiled patterns on disc or elsewhere, and + It is possible to save compiled patterns on disc or elsewhere, and reload them later, subject to a number of restrictions. JIT data cannot - be saved. The host on which the patterns are reloaded must be running + be saved. The host on which the patterns are reloaded must be running the same version of PCRE2, with the same code unit width, and must also - have the same endianness, pointer width and PCRE2_SIZE type. Before - compiled patterns can be saved they must be serialized, that is, con- - verted to a stream of bytes. A single byte stream may contain any num- - ber of compiled patterns, but they must all use the same character + have the same endianness, pointer width and PCRE2_SIZE type. Before + compiled patterns can be saved they must be serialized, that is, con- + verted to a stream of bytes. A single byte stream may contain any num- + ber of compiled patterns, but they must all use the same character tables. A single copy of the tables is included in the byte stream (its size is 1088 bytes). - The functions whose names begin with pcre2_serialize_ are used for - serializing and de-serializing. They are described in the pcre2serial- + The functions whose names begin with pcre2_serialize_ are used for + serializing and de-serializing. They are described in the pcre2serial- ize documentation. In this section we describe the features of pcre2test that can be used to test these functions. - When a pattern with push modifier is successfully compiled, it is - pushed onto a stack of compiled patterns, and pcre2test expects the - next line to contain a new pattern (or command) instead of a subject - line. By contrast, the pushcopy modifier causes a copy of the compiled - pattern to be stacked, leaving the original available for immediate - matching. By using push and/or pushcopy, a number of patterns can be + When a pattern with push modifier is successfully compiled, it is + pushed onto a stack of compiled patterns, and pcre2test expects the + next line to contain a new pattern (or command) instead of a subject + line. By contrast, the pushcopy modifier causes a copy of the compiled + pattern to be stacked, leaving the original available for immediate + matching. By using push and/or pushcopy, a number of patterns can be compiled and retained. These modifiers are incompatible with posix, and - control modifiers that act at match time are ignored (with a message) - for the stacked patterns. The jitverify modifier applies only at com- + control modifiers that act at match time are ignored (with a message) + for the stacked patterns. The jitverify modifier applies only at com- pile time. The command @@ -1539,21 +1551,21 @@ SAVING AND RESTORING COMPILED PATTERNS #save causes all the stacked patterns to be serialized and the result written - to the named file. Afterwards, all the stacked patterns are freed. The + to the named file. Afterwards, all the stacked patterns are freed. The command #load - reads the data in the file, and then arranges for it to be de-serial- - ized, with the resulting compiled patterns added to the pattern stack. - The pattern on the top of the stack can be retrieved by the #pop com- - mand, which must be followed by lines of subjects that are to be - matched with the pattern, terminated as usual by an empty line or end - of file. This command may be followed by a modifier list containing - only control modifiers that act after a pattern has been compiled. In + reads the data in the file, and then arranges for it to be de-serial- + ized, with the resulting compiled patterns added to the pattern stack. + The pattern on the top of the stack can be retrieved by the #pop com- + mand, which must be followed by lines of subjects that are to be + matched with the pattern, terminated as usual by an empty line or end + of file. This command may be followed by a modifier list containing + only control modifiers that act after a pattern has been compiled. In particular, hex, posix, posix_nosub, push, and pushcopy are not - allowed, nor are any option-setting modifiers. The JIT modifiers are, - however permitted. Here is an example that saves and reloads two pat- + allowed, nor are any option-setting modifiers. The JIT modifiers are, + however permitted. Here is an example that saves and reloads two pat- terns. /abc/push @@ -1566,10 +1578,10 @@ SAVING AND RESTORING COMPILED PATTERNS #pop jit,bincode abc - If jitverify is used with #pop, it does not automatically imply jit, + If jitverify is used with #pop, it does not automatically imply jit, which is different behaviour from when it is used on a pattern. - The #popcopy command is analagous to the pushcopy modifier in that it + The #popcopy command is analagous to the pushcopy modifier in that it makes current a copy of the topmost stack pattern, leaving the original still on the stack. @@ -1589,5 +1601,5 @@ AUTHOR REVISION - Last updated: 02 August 2016 + Last updated: 04 November 2016 Copyright (c) 1997-2016 University of Cambridge. diff --git a/src/pcre2.h b/src/pcre2.h index 6da2ec0..8827a70 100644 --- a/src/pcre2.h +++ b/src/pcre2.h @@ -465,7 +465,9 @@ PCRE2_EXP_DECL pcre2_code PCRE2_CALL_CONVENTION \ PCRE2_EXP_DECL void PCRE2_CALL_CONVENTION \ pcre2_code_free(pcre2_code *); \ PCRE2_EXP_DECL pcre2_code PCRE2_CALL_CONVENTION \ - *pcre2_code_copy(const pcre2_code *); + *pcre2_code_copy(const pcre2_code *); \ +PCRE2_EXP_DECL pcre2_code PCRE2_CALL_CONVENTION \ + *pcre2_code_copy_with_tables(const pcre2_code *); /* Functions that give information about a compiled pattern. */ @@ -629,6 +631,7 @@ pcre2_compile are called by application code. */ #define pcre2_callout_enumerate PCRE2_SUFFIX(pcre2_callout_enumerate_) #define pcre2_code_copy PCRE2_SUFFIX(pcre2_code_copy_) +#define pcre2_code_copy_with_tables PCRE2_SUFFIX(pcre2_code_copy_with_tables_) #define pcre2_code_free PCRE2_SUFFIX(pcre2_code_free_) #define pcre2_compile PCRE2_SUFFIX(pcre2_compile_) #define pcre2_compile_context_copy PCRE2_SUFFIX(pcre2_compile_context_copy_) diff --git a/src/pcre2.h.in b/src/pcre2.h.in index 9f4c4eb..96c29ff 100644 --- a/src/pcre2.h.in +++ b/src/pcre2.h.in @@ -465,7 +465,9 @@ PCRE2_EXP_DECL pcre2_code PCRE2_CALL_CONVENTION \ PCRE2_EXP_DECL void PCRE2_CALL_CONVENTION \ pcre2_code_free(pcre2_code *); \ PCRE2_EXP_DECL pcre2_code PCRE2_CALL_CONVENTION \ - *pcre2_code_copy(const pcre2_code *); + *pcre2_code_copy(const pcre2_code *); \ +PCRE2_EXP_DECL pcre2_code PCRE2_CALL_CONVENTION \ + *pcre2_code_copy_with_tables(const pcre2_code *); /* Functions that give information about a compiled pattern. */ @@ -629,6 +631,7 @@ pcre2_compile are called by application code. */ #define pcre2_callout_enumerate PCRE2_SUFFIX(pcre2_callout_enumerate_) #define pcre2_code_copy PCRE2_SUFFIX(pcre2_code_copy_) +#define pcre2_code_copy_with_tables PCRE2_SUFFIX(pcre2_code_copy_with_tables_) #define pcre2_code_free PCRE2_SUFFIX(pcre2_code_free_) #define pcre2_compile PCRE2_SUFFIX(pcre2_compile_) #define pcre2_compile_context_copy PCRE2_SUFFIX(pcre2_compile_context_copy_) diff --git a/src/pcre2_compile.c b/src/pcre2_compile.c index 2482f48..4f72872 100644 --- a/src/pcre2_compile.c +++ b/src/pcre2_compile.c @@ -1036,7 +1036,46 @@ if ((code->flags & PCRE2_DEREF_TABLES) != 0) ref_count = (PCRE2_SIZE *)(code->tables + tables_length); (*ref_count)++; } + +return newcode; +} + + +/************************************************* +* Copy compiled code and character tables * +*************************************************/ + +/* Compiled JIT code cannot be copied, so the new compiled block has no +associated JIT data. This version of code_copy also makes a separate copy of +the character tables. */ + +PCRE2_EXP_DEFN pcre2_code * PCRE2_CALL_CONVENTION +pcre2_code_copy_with_tables(const pcre2_code *code) +{ +PCRE2_SIZE* ref_count; +pcre2_code *newcode; +uint8_t *newtables; + +if (code == NULL) return NULL; +newcode = code->memctl.malloc(code->blocksize, code->memctl.memory_data); +if (newcode == NULL) return NULL; +memcpy(newcode, code, code->blocksize); +newcode->executable_jit = NULL; + +newtables = code->memctl.malloc(tables_length + sizeof(PCRE2_SIZE), + code->memctl.memory_data); +if (newtables == NULL) + { + code->memctl.free((void *)newcode, code->memctl.memory_data); + return NULL; + } +memcpy(newtables, code->tables, tables_length); +ref_count = (PCRE2_SIZE *)(newtables + tables_length); +*ref_count = 1; + +newcode->tables = newtables; +newcode->flags |= PCRE2_DEREF_TABLES; return newcode; } @@ -2367,7 +2406,7 @@ while (ptr < ptrend) assertion, possibly preceded by a callout. If the value is 1, we have just had the callout and expect an assertion. There must be at least 3 more characters in all cases. We know that the current character is an opening - parenthesis, as otherwise we wouldn't be here. Note that expect_cond_assert + parenthesis, as otherwise we wouldn't be here. Note that expect_cond_assert may be negative, since all callouts just decrement it. */ if (expect_cond_assert > 0) @@ -2377,23 +2416,23 @@ while (ptr < ptrend) { case CHAR_C: ok = expect_cond_assert == 2; - break; - + break; + case CHAR_EQUALS_SIGN: case CHAR_EXCLAMATION_MARK: break; - + case CHAR_LESS_THAN_SIGN: ok = ptr[2] == CHAR_EQUALS_SIGN || ptr[2] == CHAR_EXCLAMATION_MARK; break; - + default: - ok = FALSE; - } + ok = FALSE; + } if (!ok) { - ptr--; /* Adjust error offset */ + ptr--; /* Adjust error offset */ errorcode = ERR28; goto FAILED; } @@ -3559,7 +3598,7 @@ while (ptr < ptrend) if (*ptr == CHAR_QUESTION_MARK) { *parsed_pattern++ = META_COND_ASSERT; - ptr--; /* Pull pointer back to the opening parenthesis. */ + ptr--; /* Pull pointer back to the opening parenthesis. */ expect_cond_assert = 2; break; /* End of conditional */ } diff --git a/src/pcre2test.c b/src/pcre2test.c index a257deb..f23dedf 100644 --- a/src/pcre2test.c +++ b/src/pcre2test.c @@ -427,15 +427,13 @@ so many of them that they are split into two fields. */ #define CTL_NULLCONTEXT 0x00200000u #define CTL_POSIX 0x00400000u #define CTL_POSIX_NOSUB 0x00800000u -#define CTL_PUSH 0x01000000u -#define CTL_PUSHCOPY 0x02000000u -#define CTL_STARTCHAR 0x04000000u -#define CTL_USE_LENGTH 0x08000000u /* Same word as HEXPAT */ -#define CTL_UTF8_INPUT 0x10000000u -#define CTL_ZERO_TERMINATE 0x20000000u - -#define CTL_NL_SET 0x40000000u /* Informational */ -#define CTL_BSR_SET 0x80000000u /* Informational */ +#define CTL_PUSH 0x01000000u /* These three must be */ +#define CTL_PUSHCOPY 0x02000000u /* all in the same */ +#define CTL_PUSHTABLESCOPY 0x04000000u /* word. */ +#define CTL_STARTCHAR 0x08000000u +#define CTL_USE_LENGTH 0x10000000u /* Same word as HEXPAT */ +#define CTL_UTF8_INPUT 0x20000000u +#define CTL_ZERO_TERMINATE 0x40000000u /* Second control word */ @@ -444,6 +442,9 @@ so many of them that they are split into two fields. */ #define CTL2_SUBSTITUTE_UNKNOWN_UNSET 0x00000004u #define CTL2_SUBSTITUTE_UNSET_EMPTY 0x00000008u +#define CTL_NL_SET 0x40000000u /* Informational */ +#define CTL_BSR_SET 0x80000000u /* Informational */ + /* Combinations */ #define CTL_DEBUG (CTL_FULLBINCODE|CTL_INFO) /* For setting */ @@ -607,7 +608,8 @@ static modstruct modlist[] = { { "posix_nosub", MOD_PAT, MOD_CTL, CTL_POSIX|CTL_POSIX_NOSUB, PO(control) }, { "ps", MOD_DAT, MOD_OPT, PCRE2_PARTIAL_SOFT, DO(options) }, { "push", MOD_PAT, MOD_CTL, CTL_PUSH, PO(control) }, - { "pushcopy", MOD_PAT, MOD_CTL, CTL_PUSHCOPY, PO(control) }, + { "pushcopy", MOD_PAT, MOD_CTL, CTL_PUSHCOPY, PO(control) }, + { "pushtablescopy", MOD_PAT, MOD_CTL, CTL_PUSHTABLESCOPY, PO(control) }, { "recursion_limit", MOD_CTM, MOD_INT, 0, MO(recursion_limit) }, { "regerror_buffsize", MOD_PAT, MOD_INT, 0, PO(regerror_buffsize) }, { "replace", MOD_PND, MOD_STR, REPLACE_MODSIZE, PO(replacement) }, @@ -651,10 +653,10 @@ static modstruct modlist[] = { #define PUSH_SUPPORTED_COMPILE_CONTROLS ( \ CTL_BINCODE|CTL_CALLOUT_INFO|CTL_FULLBINCODE|CTL_HEXPAT|CTL_INFO| \ - CTL_JITVERIFY|CTL_MEMORY|CTL_PUSH|CTL_PUSHCOPY|CTL_BSR_SET|CTL_NL_SET| \ + CTL_JITVERIFY|CTL_MEMORY|CTL_PUSH|CTL_PUSHCOPY|CTL_PUSHTABLESCOPY| \ CTL_USE_LENGTH) -#define PUSH_SUPPORTED_COMPILE_CONTROLS2 (0) +#define PUSH_SUPPORTED_COMPILE_CONTROLS2 (CTL_BSR_SET|CTL_NL_SET) /* Controls that apply only at compile time with 'push'. */ @@ -664,7 +666,7 @@ static modstruct modlist[] = { /* Controls that are forbidden with #pop or #popcopy. */ #define NOTPOP_CONTROLS (CTL_HEXPAT|CTL_POSIX|CTL_POSIX_NOSUB|CTL_PUSH| \ - CTL_PUSHCOPY|CTL_USE_LENGTH) + CTL_PUSHCOPY|CTL_PUSHTABLESCOPY|CTL_USE_LENGTH) /* Pattern controls that are mutually exclusive. At present these are all in the first control word. Note that CTL_POSIX_NOSUB is always accompanied by @@ -674,6 +676,7 @@ static uint32_t exclusive_pat_controls[] = { CTL_POSIX | CTL_HEXPAT, CTL_POSIX | CTL_PUSH, CTL_POSIX | CTL_PUSHCOPY, + CTL_POSIX | CTL_PUSHTABLESCOPY, CTL_POSIX | CTL_USE_LENGTH, CTL_EXPAND | CTL_HEXPAT }; @@ -973,6 +976,14 @@ are supported. */ else \ a = (void *)pcre2_code_copy_32(G(b,32)) +#define PCRE2_CODE_COPY_WITH_TABLES_TO_VOID(a,b) \ + if (test_mode == PCRE8_MODE) \ + a = (void *)pcre2_code_copy_with_tables_8(G(b,8)); \ + else if (test_mode == PCRE16_MODE) \ + a = (void *)pcre2_code_copy_with_tables_16(G(b,16)); \ + else \ + a = (void *)pcre2_code_copy_with_tables_32(G(b,32)) + #define PCRE2_COMPILE(a,b,c,d,e,f,g) \ if (test_mode == PCRE8_MODE) \ G(a,8) = pcre2_compile_8(G(b,8),c,d,e,f,g); \ @@ -1436,6 +1447,12 @@ the three different cases. */ else \ a = (void *)G(pcre2_code_copy_,BITTWO)(G(b,BITTWO)) +#define PCRE2_CODE_COPY_WITH_TABLES_TO_VOID(a,b) \ + if (test_mode == G(G(PCRE,BITONE),_MODE)) \ + a = (void *)G(pcre2_code_copy_with_tables_,BITONE)(G(b,BITONE)); \ + else \ + a = (void *)G(pcre2_code_copy_with_tables_,BITTWO)(G(b,BITTWO)) + #define PCRE2_COMPILE(a,b,c,d,e,f,g) \ if (test_mode == G(G(PCRE,BITONE),_MODE)) \ G(a,BITONE) = G(pcre2_compile_,BITONE)(G(b,BITONE),c,d,e,f,g); \ @@ -1773,6 +1790,7 @@ the three different cases. */ (int (*)(struct pcre2_callout_enumerate_block_8 *, void *))b,c) #define PCRE2_CODE_COPY_FROM_VOID(a,b) G(a,8) = pcre2_code_copy_8(b) #define PCRE2_CODE_COPY_TO_VOID(a,b) a = (void *)pcre2_code_copy_8(G(b,8)) +#define PCRE2_CODE_COPY_WITH_TABLES_TO_VOID(a,b) a = (void *)pcre2_code_copy_with_tables_8(G(b,8)) #define PCRE2_COMPILE(a,b,c,d,e,f,g) \ G(a,8) = pcre2_compile_8(G(b,8),c,d,e,f,g) #define PCRE2_DFA_MATCH(a,b,c,d,e,f,g,h,i,j) \ @@ -1868,6 +1886,7 @@ the three different cases. */ (int (*)(struct pcre2_callout_enumerate_block_16 *, void *))b,c) #define PCRE2_CODE_COPY_FROM_VOID(a,b) G(a,16) = pcre2_code_copy_16(b) #define PCRE2_CODE_COPY_TO_VOID(a,b) a = (void *)pcre2_code_copy_16(G(b,16)) +#define PCRE2_CODE_COPY_WITH_TABLES_TO_VOID(a,b) a = (void *)pcre2_code_copy_with_tables_16(G(b,16)) #define PCRE2_COMPILE(a,b,c,d,e,f,g) \ G(a,16) = pcre2_compile_16(G(b,16),c,d,e,f,g) #define PCRE2_DFA_MATCH(a,b,c,d,e,f,g,h,i,j) \ @@ -1963,6 +1982,7 @@ the three different cases. */ (int (*)(struct pcre2_callout_enumerate_block_32 *, void *))b,c) #define PCRE2_CODE_COPY_FROM_VOID(a,b) G(a,32) = pcre2_code_copy_32(b) #define PCRE2_CODE_COPY_TO_VOID(a,b) a = (void *)pcre2_code_copy_32(G(b,32)) +#define PCRE2_CODE_COPY_WITH_TABLES_TO_VOID(a,b) a = (void *)pcre2_code_copy_with_tables_32(G(b,32)) #define PCRE2_COMPILE(a,b,c,d,e,f,g) \ G(a,32) = pcre2_compile_32(G(b,32),c,d,e,f,g) #define PCRE2_DFA_MATCH(a,b,c,d,e,f,g,h,i,j) \ @@ -3435,8 +3455,8 @@ for (;;) #else *((uint16_t *)field) = PCRE2_BSR_UNICODE; #endif - if (ctx == CTX_PAT || ctx == CTX_DEFPAT) pctl->control &= ~CTL_BSR_SET; - else dctl->control &= ~CTL_BSR_SET; + if (ctx == CTX_PAT || ctx == CTX_DEFPAT) pctl->control2 &= ~CTL_BSR_SET; + else dctl->control2 &= ~CTL_BSR_SET; } else { @@ -3445,8 +3465,8 @@ for (;;) else if (len == 7 && strncmpic(pp, (const uint8_t *)"unicode", 7) == 0) *((uint16_t *)field) = PCRE2_BSR_UNICODE; else goto INVALID_VALUE; - if (ctx == CTX_PAT || ctx == CTX_DEFPAT) pctl->control |= CTL_BSR_SET; - else dctl->control |= CTL_BSR_SET; + if (ctx == CTX_PAT || ctx == CTX_DEFPAT) pctl->control2 |= CTL_BSR_SET; + else dctl->control2 |= CTL_BSR_SET; } pp = ep; break; @@ -3513,14 +3533,14 @@ for (;;) if (i == 0) { *((uint16_t *)field) = NEWLINE_DEFAULT; - if (ctx == CTX_PAT || ctx == CTX_DEFPAT) pctl->control &= ~CTL_NL_SET; - else dctl->control &= ~CTL_NL_SET; + if (ctx == CTX_PAT || ctx == CTX_DEFPAT) pctl->control2 &= ~CTL_NL_SET; + else dctl->control2 &= ~CTL_NL_SET; } else { *((uint16_t *)field) = i; - if (ctx == CTX_PAT || ctx == CTX_DEFPAT) pctl->control |= CTL_NL_SET; - else dctl->control |= CTL_NL_SET; + if (ctx == CTX_PAT || ctx == CTX_DEFPAT) pctl->control2 |= CTL_NL_SET; + else dctl->control2 |= CTL_NL_SET; } pp = ep; break; @@ -3691,7 +3711,7 @@ Returns: nothing static void show_controls(uint32_t controls, uint32_t controls2, const char *before) { -fprintf(outfile, "%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s", +fprintf(outfile, "%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s", before, ((controls & CTL_AFTERTEXT) != 0)? " aftertext" : "", ((controls & CTL_ALLAFTERTEXT) != 0)? " allaftertext" : "", @@ -3699,7 +3719,7 @@ fprintf(outfile, "%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s ((controls & CTL_ALLUSEDTEXT) != 0)? " allusedtext" : "", ((controls & CTL_ALTGLOBAL) != 0)? " altglobal" : "", ((controls & CTL_BINCODE) != 0)? " bincode" : "", - ((controls & CTL_BSR_SET) != 0)? " bsr" : "", + ((controls2 & CTL_BSR_SET) != 0)? " bsr" : "", ((controls & CTL_CALLOUT_CAPTURE) != 0)? " callout_capture" : "", ((controls & CTL_CALLOUT_INFO) != 0)? " callout_info" : "", ((controls & CTL_CALLOUT_NONE) != 0)? " callout_none" : "", @@ -3715,12 +3735,13 @@ fprintf(outfile, "%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s ((controls & CTL_JITVERIFY) != 0)? " jitverify" : "", ((controls & CTL_MARK) != 0)? " mark" : "", ((controls & CTL_MEMORY) != 0)? " memory" : "", - ((controls & CTL_NL_SET) != 0)? " newline" : "", + ((controls2 & CTL_NL_SET) != 0)? " newline" : "", ((controls & CTL_NULLCONTEXT) != 0)? " null_context" : "", ((controls & CTL_POSIX) != 0)? " posix" : "", ((controls & CTL_POSIX_NOSUB) != 0)? " posix_nosub" : "", ((controls & CTL_PUSH) != 0)? " push" : "", ((controls & CTL_PUSHCOPY) != 0)? " pushcopy" : "", + ((controls & CTL_PUSHTABLESCOPY) != 0)? " pushtablescopy" : "", ((controls & CTL_STARTCHAR) != 0)? " startchar" : "", ((controls2 & CTL2_SUBSTITUTE_EXTENDED) != 0)? " substitute_extended" : "", ((controls2 & CTL2_SUBSTITUTE_OVERFLOW_LENGTH) != 0)? " substitute_overflow_length" : "", @@ -4061,7 +4082,7 @@ if ((pat_patctl.control & CTL_INFO) != 0) if (jchanged) fprintf(outfile, "Duplicate name status changes\n"); - if ((pat_patctl.control & CTL_BSR_SET) != 0 || + if ((pat_patctl.control2 & CTL_BSR_SET) != 0 || (FLD(compiled_code, flags) & PCRE2_BSR_SET) != 0) fprintf(outfile, "\\R matches %s\n", (bsr_convention == PCRE2_BSR_UNICODE)? "any Unicode newline" : "CR, LF, or CRLF"); @@ -4930,7 +4951,7 @@ if ((pat_patctl.control & CTL_POSIX) != 0) /* Handle compiling via the native interface. Controls that act later are ignored with "push". Replacements are locked out. */ -if ((pat_patctl.control & (CTL_PUSH|CTL_PUSHCOPY)) != 0) +if ((pat_patctl.control & (CTL_PUSH|CTL_PUSHCOPY|CTL_PUSHTABLESCOPY)) != 0) { if (pat_patctl.replacement[0] != 0) { @@ -5031,7 +5052,7 @@ if (test_mode == PCRE32_MODE && pbuffer32 != NULL) appropriate default newline setting, local_newline_default will be non-zero. We use this if there is no explicit newline modifier. */ -if ((pat_patctl.control & CTL_NL_SET) == 0 && local_newline_default != 0) +if ((pat_patctl.control2 & CTL_NL_SET) == 0 && local_newline_default != 0) { SETFLD(pat_context, newline_convention, local_newline_default); } @@ -5163,7 +5184,7 @@ if (pattern_info(PCRE2_INFO_MAXLOOKBEHIND, &maxlookbehind, FALSE) != 0) /* If an explicit newline modifier was given, set the information flag in the pattern so that it is preserved over push/pop. */ -if ((pat_patctl.control & CTL_NL_SET) != 0) +if ((pat_patctl.control2 & CTL_NL_SET) != 0) { SETFLD(compiled_code, flags, FLD(compiled_code, flags) | PCRE2_NL_SET); } @@ -5191,17 +5212,25 @@ if ((pat_patctl.control & CTL_PUSH) != 0) SET(compiled_code, NULL); } -/* The "pushcopy" control is similar, but pushes a copy of the pattern. This -tests the pcre2_code_copy() function. */ +/* The "pushcopy" and "pushtablescopy" controls are similar, but push a +copy of the pattern, the latter with a copy of its character tables. This tests +the pcre2_code_copy() and pcre2_code_copy_with_tables() functions. */ -if ((pat_patctl.control & CTL_PUSHCOPY) != 0) +if ((pat_patctl.control & (CTL_PUSHCOPY|CTL_PUSHTABLESCOPY)) != 0) { if (patstacknext >= PATSTACKSIZE) { fprintf(outfile, "** Too many pushed patterns (max %d)\n", PATSTACKSIZE); return PR_ABEND; } - PCRE2_CODE_COPY_TO_VOID(patstack[patstacknext++], compiled_code); + if ((pat_patctl.control & CTL_PUSHCOPY) != 0) + { + PCRE2_CODE_COPY_TO_VOID(patstack[patstacknext++], compiled_code); + } + else + { + PCRE2_CODE_COPY_WITH_TABLES_TO_VOID(patstack[patstacknext++], + compiled_code); } } return PR_OK; diff --git a/testdata/testinput20 b/testdata/testinput20 index c920e2a..c87a07e 100644 --- a/testdata/testinput20 +++ b/testdata/testinput20 @@ -88,4 +88,13 @@ #pop should give an error +/abcd/pushtablescopy + abcd + +#popcopy + abcd + +#pop + abcd + # End of testinput20 diff --git a/testdata/testoutput20 b/testdata/testoutput20 index 952b0bb..db99866 100644 --- a/testdata/testoutput20 +++ b/testdata/testoutput20 @@ -135,4 +135,16 @@ Serialization failed: error -30: patterns do not all use the same character tabl #pop should give an error ** Can't pop off an empty stack +/abcd/pushtablescopy + abcd + 0: abcd + +#popcopy + abcd + 0: abcd + +#pop + abcd + 0: abcd + # End of testinput20