From 4352f00bb91afab5c159ede32e90a2a65f924144 Mon Sep 17 00:00:00 2001 From: "Philip.Hazel" Date: Mon, 20 Oct 2014 16:48:14 +0000 Subject: [PATCH] More documentation --- Makefile.am | 24 +- doc/html/pcre2perform.html | 196 +++++++++++++ doc/html/pcre2posix.html | 292 +++++++++++++++++++ doc/html/pcre2sample.html | 106 +++++++ doc/html/pcre2stack.html | 203 ++++++++++++++ doc/html/pcre2syntax.html | 561 +++++++++++++++++++++++++++++++++++++ doc/pcre2perform.3 | 178 ++++++++++++ doc/pcre2posix.3 | 268 ++++++++++++++++++ doc/pcre2sample.3 | 94 +++++++ doc/pcre2stack.3 | 199 +++++++++++++ doc/pcre2syntax.3 | 540 +++++++++++++++++++++++++++++++++++ 11 files changed, 2649 insertions(+), 12 deletions(-) create mode 100644 doc/html/pcre2perform.html create mode 100644 doc/html/pcre2posix.html create mode 100644 doc/html/pcre2sample.html create mode 100644 doc/html/pcre2stack.html create mode 100644 doc/html/pcre2syntax.html create mode 100644 doc/pcre2perform.3 create mode 100644 doc/pcre2posix.3 create mode 100644 doc/pcre2sample.3 create mode 100644 doc/pcre2stack.3 create mode 100644 doc/pcre2syntax.3 diff --git a/Makefile.am b/Makefile.am index 57ab2fc..5b62573 100644 --- a/Makefile.am +++ b/Makefile.am @@ -36,6 +36,11 @@ dist_html_DATA = \ doc/html/pcre2matching.html \ doc/html/pcre2partial.html \ doc/html/pcre2pattern.html \ + doc/html/pcre2perform.html \ + doc/html/pcre2posix.html \ + doc/html/pcre2sample.html \ + doc/html/pcre2stack.html \ + doc/html/pcre2syntax.html \ doc/html/pcre2test.html \ doc/html/pcre2unicode.html @@ -66,12 +71,7 @@ dist_html_DATA = \ # doc/html/pcre2_utf16_to_host_byte_order.html \ # doc/html/pcre2_utf32_to_host_byte_order.html \ # doc/html/pcre2_version.html \ -# doc/html/pcre2perform.html \ -# doc/html/pcre2posix.html \ -# doc/html/pcre2precompile.html \ -# doc/html/pcre2sample.html \ -# doc/html/pcre2stack.html \ -# doc/html/pcre2syntax.html +# doc/html/pcre2precompile.html # FIXME dist_man_MANS = \ @@ -88,6 +88,11 @@ dist_man_MANS = \ doc/pcre2matching.3 \ doc/pcre2partial.3 \ doc/pcre2pattern.3 \ + doc/pcre2perform.3 \ + doc/pcre2posix.3 \ + doc/pcre2sample.3 \ + doc/pcre2stack.3 \ + doc/pcre2syntax.3 \ doc/pcre2test.1 \ doc/pcre2unicode.3 @@ -120,12 +125,7 @@ dist_man_MANS = \ # doc/pcre2_utf16_to_host_byte_order.3 \ # doc/pcre2_utf32_to_host_byte_order.3 \ # doc/pcre2_version.3 \ -# doc/pcre2perform.3 \ -# doc/pcre2posix.3 \ -# doc/pcre2precompile.3 \ -# doc/pcre2sample.3 \ -# doc/pcre2stack.3 \ -# doc/pcre2syntax.3 +# doc/pcre2precompile.3 # The Libtool libraries to install. We'll add to this later. diff --git a/doc/html/pcre2perform.html b/doc/html/pcre2perform.html new file mode 100644 index 0000000..0e182e7 --- /dev/null +++ b/doc/html/pcre2perform.html @@ -0,0 +1,196 @@ + + +pcre2perform specification + + +

pcre2perform man page

+

+Return to the PCRE2 index page. +

+

+This page is part of the PCRE2 HTML documentation. It was generated +automatically from the original man page. If there is any nonsense in it, +please consult the man page, in case the conversion went wrong. +
+
+PCRE2 PERFORMANCE +
+

+Two aspects of performance are discussed below: memory usage and processing +time. The way you express your pattern as a regular expression can affect both +of them. +

+
+COMPILED PATTERN MEMORY USAGE +
+

+Patterns are compiled by PCRE2 into a reasonably efficient interpretive code, +so that most simple patterns do not use much memory. However, there is one case +where the memory usage of a compiled pattern can be unexpectedly large. If a +parenthesized subpattern has a quantifier with a minimum greater than 1 and/or +a limited maximum, the whole subpattern is repeated in the compiled code. For +example, the pattern +

+  (abc|def){2,4}
+
+is compiled as if it were +
+  (abc|def)(abc|def)((abc|def)(abc|def)?)?
+
+(Technical aside: It is done this way so that backtrack points within each of +the repetitions can be independently maintained.) +

+

+For regular expressions whose quantifiers use only small numbers, this is not +usually a problem. However, if the numbers are large, and particularly if such +repetitions are nested, the memory usage can become an embarrassment. For +example, the very simple pattern +

+  ((ab){1,1000}c){1,3}
+
+uses 51K bytes when compiled using the 8-bit library. When PCRE2 is compiled +with its default internal pointer size of two bytes, the size limit on a +compiled pattern is 64K code units in the 8-bit and 16-bit libraries, and this +is reached with the above pattern if the outer repetition is increased from 3 +to 4. PCRE2 can be compiled to use larger internal pointers and thus handle +larger compiled patterns, but it is better to try to rewrite your pattern to +use less memory if you can. +

+

+One way of reducing the memory usage for such patterns is to make use of +PCRE2's +"subroutine" +facility. Re-writing the above pattern as +

+  ((ab)(?2){0,999}c)(?1){0,2}
+
+reduces the memory requirements to 18K, and indeed it remains under 20K even +with the outer repetition increased to 100. However, this pattern is not +exactly equivalent, because the "subroutine" calls are treated as +atomic groups +into which there can be no backtracking if there is a subsequent matching +failure. Therefore, PCRE2 cannot do this kind of rewriting automatically. +Furthermore, there is a noticeable loss of speed when executing the modified +pattern. Nevertheless, if the atomic grouping is not a problem and the loss of +speed is acceptable, this kind of rewriting will allow you to process patterns +that PCRE2 cannot otherwise handle. +

+
+STACK USAGE AT RUN TIME +
+

+When pcre2_match() is used for matching, certain kinds of pattern can +cause it to use large amounts of the process stack. In some environments the +default process stack is quite small, and if it runs out the result is often +SIGSEGV. Rewriting your pattern can often help. The +pcre2stack +documentation discusses this issue in detail. +

+
+PROCESSING TIME +
+

+Certain items in regular expression patterns are processed more efficiently +than others. It is more efficient to use a character class like [aeiou] than a +set of single-character alternatives such as (a|e|i|o|u). In general, the +simplest construction that provides the required behaviour is usually the most +efficient. Jeffrey Friedl's book contains a lot of useful general discussion +about optimizing regular expressions for efficient performance. This document +contains a few observations about PCRE2. +

+

+Using Unicode character properties (the \p, \P, and \X escapes) is slow, +because PCRE2 has to use a multi-stage table lookup whenever it needs a +character's property. If you can find an alternative pattern that does not use +character properties, it will probably be faster. +

+

+By default, the escape sequences \b, \d, \s, and \w, and the POSIX +character classes such as [:alpha:] do not use Unicode properties, partly for +backwards compatibility, and partly for performance reasons. However, you can +set the PCRE2_UCP option or start the pattern with (*UCP) if you want Unicode +character properties to be used. This can double the matching time for items +such as \d, when matched with pcre2_match(); the performance loss is +less with a DFA matching function, and in both cases there is not much +difference for \b. +

+

+When a pattern begins with .* not in parentheses, or in parentheses that are +not the subject of a backreference, and the PCRE2_DOTALL option is set, the +pattern is implicitly anchored by PCRE2, since it can match only at the start +of a subject string. However, if PCRE2_DOTALL is not set, PCRE2 cannot make +this optimization, because the dot metacharacter does not then match a newline, +and if the subject string contains newlines, the pattern may match from the +character immediately following one of them instead of from the very start. For +example, the pattern +

+  .*second
+
+matches the subject "first\nand second" (where \n stands for a newline +character), with the match starting at the seventh character. In order to do +this, PCRE2 has to retry the match starting after every newline in the subject. +

+

+If you are using such a pattern with subject strings that do not contain +newlines, the best performance is obtained by setting PCRE2_DOTALL, or starting +the pattern with ^.* or ^.*? to indicate explicit anchoring. That saves PCRE2 +from having to scan along the subject looking for a newline to restart at. +

+

+Beware of patterns that contain nested indefinite repeats. These can take a +long time to run when applied to a string that does not match. Consider the +pattern fragment +

+  ^(a+)*
+
+This can match "aaaa" in 16 different ways, and this number increases very +rapidly as the string gets longer. (The * repeat can match 0, 1, 2, 3, or 4 +times, and for each of those cases other than 0 or 4, the + repeats can match +different numbers of times.) When the remainder of the pattern is such that the +entire match is going to fail, PCRE2 has in principle to try every possible +variation, and this can take an extremely long time, even for relatively short +strings. +

+

+An optimization catches some of the more simple cases such as +

+  (a+)*b
+
+where a literal character follows. Before embarking on the standard matching +procedure, PCRE2 checks that there is a "b" later in the subject string, and if +there is not, it fails the match immediately. However, when there is no +following literal this optimization cannot be used. You can see the difference +by comparing the behaviour of +
+  (a+)*\d
+
+with the pattern above. The former gives a failure almost instantly when +applied to a whole line of "a" characters, whereas the latter takes an +appreciable time with strings longer than about 20 characters. +

+

+In many cases, the solution to this kind of performance issue is to use an +atomic group or a possessive quantifier. +

+
+AUTHOR +
+

+Philip Hazel +
+University Computing Service +
+Cambridge CB2 3QH, England. +
+

+
+REVISION +
+

+Last updated: 20 October 2014 +
+Copyright © 1997-2014 University of Cambridge. +
+

+Return to the PCRE2 index page. +

diff --git a/doc/html/pcre2posix.html b/doc/html/pcre2posix.html new file mode 100644 index 0000000..93829b3 --- /dev/null +++ b/doc/html/pcre2posix.html @@ -0,0 +1,292 @@ + + +pcre2posix specification + + +

pcre2posix man page

+

+Return to the PCRE2 index page. +

+

+This page is part of the PCRE2 HTML documentation. It was generated +automatically from the original man page. If there is any nonsense in it, +please consult the man page, in case the conversion went wrong. +
+

+
SYNOPSIS
+

+#include <pcre2posix.h> +

+

+int regcomp(regex_t *preg, const char *pattern, + int cflags); +
+
+int regexec(const regex_t *preg, const char *string, + size_t nmatch, regmatch_t pmatch[], int eflags); +
+
+size_t regerror(int errcode, const regex_t *preg, + char *errbuf, size_t errbuf_size); +
+
+void regfree(regex_t *preg); +

+
DESCRIPTION
+

+This set of functions provides a POSIX-style API for the PCRE2 regular +expression 8-bit library. See the +pcre2api +documentation for a description of PCRE2's native API, which contains much +additional functionality. There is no POSIX-style wrapper for PCRE2's 16-bit +and 32-bit libraries. +

+

+The functions described here are just wrapper functions that ultimately call +the PCRE2 native API. Their prototypes are defined in the pcre2posix.h +header file, and on Unix systems the library itself is called +libpcre2-posix.a, so can be accessed by adding -lpcre2-posix to the +command for linking an application that uses them. Because the POSIX functions +call the native ones, it is also necessary to add -lpcre2-8. +

+

+Those POSIX option bits that can reasonably be mapped to PCRE2 native options +have been implemented. In addition, the option REG_EXTENDED is defined with the +value zero. This has no effect, but since programs that are written to the +POSIX interface often use it, this makes it easier to slot in PCRE2 as a +replacement library. Other POSIX options are not even defined. +

+

+There are also some other options that are not defined by POSIX. These have +been added at the request of users who want to make use of certain +PCRE2-specific features via the POSIX calling interface. +

+

+When PCRE2 is called via these functions, it is only the API that is POSIX-like +in style. The syntax and semantics of the regular expressions themselves are +still those of Perl, subject to the setting of various PCRE2 options, as +described below. "POSIX-like in style" means that the API approximates to the +POSIX definition; it is not fully POSIX-compatible, and in multi-unit encoding +domains it is probably even less compatible. +

+

+The header for these functions is supplied as pcre2posix.h to avoid any +potential clash with other POSIX libraries. It can, of course, be renamed or +aliased as regex.h, which is the "correct" name. It provides two +structure types, regex_t for compiled internal forms, and +regmatch_t for returning captured substrings. It also defines some +constants whose names start with "REG_"; these are used for setting options and +identifying error codes. +

+
COMPILING A PATTERN
+

+The function regcomp() is called to compile a pattern into an +internal form. The pattern is a C string terminated by a binary zero, and +is passed in the argument pattern. The preg argument is a pointer +to a regex_t structure that is used as a base for storing information +about the compiled regular expression. +

+

+The argument cflags is either zero, or contains one or more of the bits +defined by the following macros: +

+  REG_DOTALL
+
+The PCRE2_DOTALL option is set when the regular expression is passed for +compilation to the native function. Note that REG_DOTALL is not part of the +POSIX standard. +
+  REG_ICASE
+
+The PCRE2_CASELESS option is set when the regular expression is passed for +compilation to the native function. +
+  REG_NEWLINE
+
+The PCRE2_MULTILINE option is set when the regular expression is passed for +compilation to the native function. Note that this does not mimic the +defined POSIX behaviour for REG_NEWLINE (see the following section). +
+  REG_NOSUB
+
+The PCRE2_NO_AUTO_CAPTURE option is set when the regular expression is passed +for compilation to the native function. In addition, when a pattern that is +compiled with this flag is passed to regexec() for matching, the +nmatch and pmatch arguments are ignored, and no captured strings +are returned. +
+  REG_UCP
+
+The PCRE2_UCP option is set when the regular expression is passed for +compilation to the native function. This causes PCRE2 to use Unicode properties +when matchine \d, \w, etc., instead of just recognizing ASCII values. Note +that REG_UCP is not part of the POSIX standard. +
+  REG_UNGREEDY
+
+The PCRE2_UNGREEDY option is set when the regular expression is passed for +compilation to the native function. Note that REG_UNGREEDY is not part of the +POSIX standard. +
+  REG_UTF
+
+The PCRE2_UTF option is set when the regular expression is passed for +compilation to the native function. This causes the pattern itself and all data +strings used for matching it to be treated as UTF-8 strings. Note that REG_UTF +is not part of the POSIX standard. +

+

+In the absence of these flags, no options are passed to the native function. +This means the the regex is compiled with PCRE2 default semantics. In +particular, the way it handles newline characters in the subject string is the +Perl way, not the POSIX way. Note that setting PCRE2_MULTILINE has only +some of the effects specified for REG_NEWLINE. It does not affect the way +newlines are matched by the dot metacharacter (they are not) or by a negative +class such as [^a] (they are). +

+

+The yield of regcomp() is zero on success, and non-zero otherwise. The +preg structure is filled in on success, and one member of the structure +is public: re_nsub contains the number of capturing subpatterns in +the regular expression. Various error codes are defined in the header file. +

+

+NOTE: If the yield of regcomp() is non-zero, you must not attempt to +use the contents of the preg structure. If, for example, you pass it to +regexec(), the result is undefined and your program is likely to crash. +

+
MATCHING NEWLINE CHARACTERS
+

+This area is not simple, because POSIX and Perl take different views of things. +It is not possible to get PCRE2 to obey POSIX semantics, but then PCRE2 was +never intended to be a POSIX engine. The following table lists the different +possibilities for matching newline characters in PCRE2: +

+                          Default   Change with
+
+  . matches newline          no     PCRE2_DOTALL
+  newline matches [^a]       yes    not changeable
+  $ matches \n at end        yes    PCRE2_DOLLAR_ENDONLY
+  $ matches \n in middle     no     PCRE2_MULTILINE
+  ^ matches \n in middle     no     PCRE2_MULTILINE
+
+This is the equivalent table for POSIX: +
+                          Default   Change with
+
+  . matches newline          yes    REG_NEWLINE
+  newline matches [^a]       yes    REG_NEWLINE
+  $ matches \n at end        no     REG_NEWLINE
+  $ matches \n in middle     no     REG_NEWLINE
+  ^ matches \n in middle     no     REG_NEWLINE
+
+PCRE2's behaviour is the same as Perl's, except that there is no equivalent for +PCRE2_DOLLAR_ENDONLY in Perl. In both PCRE2 and Perl, there is no way to stop +newline from matching [^a]. +

+

+The default POSIX newline handling can be obtained by setting PCRE2_DOTALL and +PCRE2_DOLLAR_ENDONLY, but there is no way to make PCRE2 behave exactly as for +the REG_NEWLINE action. +

+
MATCHING A PATTERN
+

+The function regexec() is called to match a compiled pattern preg +against a given string, which is by default terminated by a zero byte +(but see REG_STARTEND below), subject to the options in eflags. These can +be: +

+  REG_NOTBOL
+
+The PCRE2_NOTBOL option is set when calling the underlying PCRE2 matching +function. +
+  REG_NOTEMPTY
+
+The PCRE2_NOTEMPTY option is set when calling the underlying PCRE2 matching +function. Note that REG_NOTEMPTY is not part of the POSIX standard. However, +setting this option can give more POSIX-like behaviour in some situations. +
+  REG_NOTEOL
+
+The PCRE2_NOTEOL option is set when calling the underlying PCRE2 matching +function. +
+  REG_STARTEND
+
+The string is considered to start at string + pmatch[0].rm_so and +to have a terminating NUL located at string + pmatch[0].rm_eo +(there need not actually be a NUL at that location), regardless of the value of +nmatch. This is a BSD extension, compatible with but not specified by +IEEE Standard 1003.2 (POSIX.2), and should be used with caution in software +intended to be portable to other systems. Note that a non-zero rm_so does +not imply REG_NOTBOL; REG_STARTEND affects only the location of the string, not +how it is matched. +

+

+If the pattern was compiled with the REG_NOSUB flag, no data about any matched +strings is returned. The nmatch and pmatch arguments of +regexec() are ignored. +

+

+If the value of nmatch is zero, or if the value pmatch is NULL, +no data about any matched strings is returned. +

+

+Otherwise,the portion of the string that was matched, and also any captured +substrings, are returned via the pmatch argument, which points to an +array of nmatch structures of type regmatch_t, containing the +members rm_so and rm_eo. These contain the byte offset to the first +character of each substring and the offset to the first character after the end +of each substring, respectively. The 0th element of the vector relates to the +entire portion of string that was matched; subsequent elements relate to +the capturing subpatterns of the regular expression. Unused entries in the +array have both structure members set to -1. +

+

+A successful match yields a zero return; various error codes are defined in the +header file, of which REG_NOMATCH is the "expected" failure code. +

+
ERROR MESSAGES
+

+The regerror() function maps a non-zero errorcode from either +regcomp() or regexec() to a printable message. If preg is not +NULL, the error should have arisen from the use of that structure. A message +terminated by a binary zero is placed in errbuf. The length of the +message, including the zero, is limited to errbuf_size. The yield of the +function is the size of buffer needed to hold the whole message. +

+
MEMORY USAGE
+

+Compiling a regular expression causes memory to be allocated and associated +with the preg structure. The function regfree() frees all such +memory, after which preg may no longer be used as a compiled expression. +

+
AUTHOR
+

+Philip Hazel +
+University Computing Service +
+Cambridge CB2 3QH, England. +
+

+
REVISION
+

+Last updated: 20 October 2014 +
+Copyright © 1997-2014 University of Cambridge. +
+

+Return to the PCRE2 index page. +

diff --git a/doc/html/pcre2sample.html b/doc/html/pcre2sample.html new file mode 100644 index 0000000..1cba300 --- /dev/null +++ b/doc/html/pcre2sample.html @@ -0,0 +1,106 @@ + + +pcre2sample specification + + +

pcre2sample man page

+

+Return to the PCRE2 index page. +

+

+This page is part of the PCRE2 HTML documentation. It was generated +automatically from the original man page. If there is any nonsense in it, +please consult the man page, in case the conversion went wrong. +
+
+PCRE2 SAMPLE PROGRAM +
+

+A simple, complete demonstration program to get you started with using PCRE2 is +supplied in the file pcre2demo.c in the src directory in the PCRE2 +distribution. A listing of this program is given in the +pcre2demo +documentation. If you do not have a copy of the PCRE2 distribution, you can +save this listing to re-create the contents of pcre2demo.c. +

+

+The demonstration program, which uses the PCRE2 8-bit library, compiles the +regular expression that is its first argument, and matches it against the +subject string in its second argument. No PCRE2 options are set, and default +character tables are used. If matching succeeds, the program outputs the +portion of the subject that matched, together with the contents of any captured +substrings. +

+

+If the -g option is given on the command line, the program then goes on to +check for further matches of the same regular expression in the same subject +string. The logic is a little bit tricky because of the possibility of matching +an empty string. Comments in the code explain what is going on. +

+

+If PCRE2 is installed in the standard include and library directories for your +operating system, you should be able to compile the demonstration program using +this command: +

+  gcc -o pcre2demo pcre2demo.c -lpcre2-8
+
+If PCRE2 is installed elsewhere, you may need to add additional options to the +command line. For example, on a Unix-like system that has PCRE2 installed in +/usr/local, you can compile the demonstration program using a command +like this: +
+  gcc -o pcre2demo -I/usr/local/include pcre2demo.c -L/usr/local/lib -lpcre2-8
+
+
+

+

+Once you have compiled and linked the demonstration program, you can run simple +tests like this: +

+  ./pcre2demo 'cat|dog' 'the cat sat on the mat'
+  ./pcre2demo -g 'cat|dog' 'the dog sat on the cat'
+
+Note that there is a much more comprehensive test program, called +pcre2test, +which supports many more facilities for testing regular expressions using the +PCRE2 libraries. The +pcre2demo +program is provided as a simple coding example. +

+

+If you try to run +pcre2demo +when PCRE2 is not installed in the standard library directory, you may get an +error like this on some operating systems (e.g. Solaris): +

+  ld.so.1: a.out: fatal: libpcre2.so.0: open failed: No such file or directory
+
+This is caused by the way shared library support works on those systems. You +need to add +
+  -R/usr/local/lib
+
+(for example) to the compile command to get round this problem. +

+
+AUTHOR +
+

+Philip Hazel +
+University Computing Service +
+Cambridge CB2 3QH, England. +
+

+
+REVISION +
+

+Last updated: 20 October 2014 +
+Copyright © 1997-2014 University of Cambridge. +
+

+Return to the PCRE2 index page. +

diff --git a/doc/html/pcre2stack.html b/doc/html/pcre2stack.html new file mode 100644 index 0000000..aff072f --- /dev/null +++ b/doc/html/pcre2stack.html @@ -0,0 +1,203 @@ + + +pcre2stack specification + + +

pcre2stack man page

+

+Return to the PCRE2 index page. +

+

+This page is part of the PCRE2 HTML documentation. It was generated +automatically from the original man page. If there is any nonsense in it, +please consult the man page, in case the conversion went wrong. +
+
+PCRE2 DISCUSSION OF STACK USAGE +
+

+When you call pcre2_match(), it makes use of an internal function called +match(). This calls itself recursively at branch points in the pattern, +in order to remember the state of the match so that it can back up and try a +different alternative after a failure. As matching proceeds deeper and deeper +into the tree of possibilities, the recursion depth increases. The +match() function is also called in other circumstances, for example, +whenever a parenthesized sub-pattern is entered, and in certain cases of +repetition. +

+

+Not all calls of match() increase the recursion depth; for an item such +as a* it may be called several times at the same level, after matching +different numbers of a's. Furthermore, in a number of cases where the result of +the recursive call would immediately be passed back as the result of the +current call (a "tail recursion"), the function is just restarted instead. +

+

+The above comments apply when pcre2_match() is run in its normal +interpretive manner. If the compiled pattern was processed by +pcre2_jit_compile(), and just-in-time compiling was successful, and the +options passed to pcre2_match() were not incompatible, the matching +process uses the JIT-compiled code instead of the match() function. In +this case, the memory requirements are handled entirely differently. See the +pcre2jit +documentation for details. +

+

+The pcre2_dfa_match() function operates in a different way to +pcre2_match(), and uses recursion only when there is a regular expression +recursion or subroutine call in the pattern. This includes the processing of +assertion and "once-only" subpatterns, which are handled like subroutine calls. +Normally, these are never very deep, and the limit on the complexity of +pcre2_dfa_match() is controlled by the amount of workspace it is given. +However, it is possible to write patterns with runaway infinite recursions; +such patterns will cause pcre2_dfa_match() to run out of stack. At +present, there is no protection against this. +

+

+The comments that follow do NOT apply to pcre2_dfa_match(); they are +relevant only for pcre2_match() without the JIT optimization. +

+
+Reducing pcre2_match()'s stack usage +
+

+Each time that the internal match() function is called recursively, it +uses memory from the process stack. For certain kinds of pattern and data, very +large amounts of stack may be needed, despite the recognition of "tail +recursion". You can often reduce the amount of recursion, and therefore the +amount of stack used, by modifying the pattern that is being matched. Consider, +for example, this pattern: +

+  ([^<]|<(?!inet))+
+
+It matches from wherever it starts until it encounters "<inet" or the end of +the data, and is the kind of pattern that might be used when processing an XML +file. Each iteration of the outer parentheses matches either one character that +is not "<" or a "<" that is not followed by "inet". However, each time a +parenthesis is processed, a recursion occurs, so this formulation uses a stack +frame for each matched character. For a long string, a lot of stack is +required. Consider now this rewritten pattern, which matches exactly the same +strings: +
+  ([^<]++|<(?!inet))+
+
+This uses very much less stack, because runs of characters that do not contain +"<" are "swallowed" in one item inside the parentheses. Recursion happens only +when a "<" character that is not followed by "inet" is encountered (and we +assume this is relatively rare). A possessive quantifier is used to stop any +backtracking into the runs of non-"<" characters, but that is not related to +stack usage. +

+

+This example shows that one way of avoiding stack problems when matching long +subject strings is to write repeated parenthesized subpatterns to match more +than one character whenever possible. +

+
+Compiling PCRE2 to use heap instead of stack for pcre2_match() +
+

+In environments where stack memory is constrained, you might want to compile +PCRE2 to use heap memory instead of stack for remembering back-up points when +pcre2_match() is running. This makes it run more slowly, however. Details +of how to do this are given in the +pcre2build +documentation. When built in this way, instead of using the stack, PCRE2 +gets memory for remembering backup points from the heap. By default, the memory +is obtained by calling the system malloc() function, but you can arrange +to supply your own memory management function. For details, see the section +entitled +"The match context" +in the +pcre2api +documentation. Since the block sizes are always the same, it may be possible to +implement customized a memory handler that is more efficient than the standard +function. The memory blocks obtained for this purpose are retained and re-used +if possible while pcre2_match() is running. They are all freed just +before it exits. +

+
+Limiting pcre2_match()'s stack usage +
+

+You can set limits on the number of times the internal match() function +is called, both in total and recursively. If a limit is exceeded, +pcre2_match() returns an error code. Setting suitable limits should +prevent it from running out of stack. The default values of the limits are very +large, and unlikely ever to operate. They can be changed when PCRE2 is built, +and they can also be set when pcre2_match() is called. For details of +these interfaces, see the +pcre2build +documentation and the section entitled +"The match context" +in the +pcre2api +documentation. +

+

+As a very rough rule of thumb, you should reckon on about 500 bytes per +recursion. Thus, if you want to limit your stack usage to 8Mb, you should set +the limit at 16000 recursions. A 64Mb stack, on the other hand, can support +around 128000 recursions. +

+

+The pcre2test test program has a modifier called "find_limits" which, if +applied to a subject line, causes it to find the smallest limits that allow a a +pattern to match. This is done by calling pcre2_match() repeatedly with +different limits. +

+
+Changing stack size in Unix-like systems +
+

+In Unix-like environments, there is not often a problem with the stack unless +very long strings are involved, though the default limit on stack size varies +from system to system. Values from 8Mb to 64Mb are common. You can find your +default limit by running the command: +

+  ulimit -s
+
+Unfortunately, the effect of running out of stack is often SIGSEGV, though +sometimes a more explicit error message is given. You can normally increase the +limit on stack size by code such as this: +
+  struct rlimit rlim;
+  getrlimit(RLIMIT_STACK, &rlim);
+  rlim.rlim_cur = 100*1024*1024;
+  setrlimit(RLIMIT_STACK, &rlim);
+
+This reads the current limits (soft and hard) using getrlimit(), then +attempts to increase the soft limit to 100Mb using setrlimit(). You must +do this before calling pcre2_match(). +

+
+Changing stack size in Mac OS X +
+

+Using setrlimit(), as described above, should also work on Mac OS X. It +is also possible to set a stack size when linking a program. There is a +discussion about stack sizes in Mac OS X at this web site: +http://developer.apple.com/qa/qa2005/qa1419.html. +

+
+AUTHOR +
+

+Philip Hazel +
+University Computing Service +
+Cambridge CB2 3QH, England. +
+

+
+REVISION +
+

+Last updated: 20 October 2014 +
+Copyright © 1997-2014 University of Cambridge. +
+

+Return to the PCRE2 index page. +

diff --git a/doc/html/pcre2syntax.html b/doc/html/pcre2syntax.html new file mode 100644 index 0000000..c240ca9 --- /dev/null +++ b/doc/html/pcre2syntax.html @@ -0,0 +1,561 @@ + + +pcre2syntax specification + + +

pcre2syntax man page

+

+Return to the PCRE2 index page. +

+

+This page is part of the PCRE2 HTML documentation. It was generated +automatically from the original man page. If there is any nonsense in it, +please consult the man page, in case the conversion went wrong. +
+

+
PCRE2 REGULAR EXPRESSION SYNTAX SUMMARY
+

+The full syntax and semantics of the regular expressions that are supported by +PCRE2 are described in the +pcre2pattern +documentation. This document contains a quick-reference summary of the syntax. +

+
QUOTING
+

+

+  \x         where x is non-alphanumeric is a literal x
+  \Q...\E    treat enclosed characters as literal
+
+

+
CHARACTERS
+

+

+  \a         alarm, that is, the BEL character (hex 07)
+  \cx        "control-x", where x is any ASCII character
+  \e         escape (hex 1B)
+  \f         form feed (hex 0C)
+  \n         newline (hex 0A)
+  \r         carriage return (hex 0D)
+  \t         tab (hex 09)
+  \0dd       character with octal code 0dd
+  \ddd       character with octal code ddd, or backreference
+  \o{ddd..}  character with octal code ddd..
+  \xhh       character with hex code hh
+  \x{hhh..}  character with hex code hhh..
+
+Note that \0dd is always an octal code, and that \8 and \9 are the literal +characters "8" and "9". +

+
CHARACTER TYPES
+

+

+  .          any character except newline;
+               in dotall mode, any character whatsoever
+  \C         one data unit, even in UTF mode (best avoided)
+  \d         a decimal digit
+  \D         a character that is not a decimal digit
+  \h         a horizontal white space character
+  \H         a character that is not a horizontal white space character
+  \N         a character that is not a newline
+  \p{xx}     a character with the xx property
+  \P{xx}     a character without the xx property
+  \R         a newline sequence
+  \s         a white space character
+  \S         a character that is not a white space character
+  \v         a vertical white space character
+  \V         a character that is not a vertical white space character
+  \w         a "word" character
+  \W         a "non-word" character
+  \X         a Unicode extended grapheme cluster
+
+By default, \d, \s, and \w match only ASCII characters, even in UTF-8 mode +or in the 16-bit and 32-bit libraries. However, if locale-specific matching is +happening, \s and \w may also match characters with code points in the range +128-255. If the PCRE2_UCP option is set, the behaviour of these escape +sequences is changed to use Unicode properties and they match many more +characters. +

+
GENERAL CATEGORY PROPERTIES FOR \p and \P
+

+

+  C          Other
+  Cc         Control
+  Cf         Format
+  Cn         Unassigned
+  Co         Private use
+  Cs         Surrogate
+
+  L          Letter
+  Ll         Lower case letter
+  Lm         Modifier letter
+  Lo         Other letter
+  Lt         Title case letter
+  Lu         Upper case letter
+  L&         Ll, Lu, or Lt
+
+  M          Mark
+  Mc         Spacing mark
+  Me         Enclosing mark
+  Mn         Non-spacing mark
+
+  N          Number
+  Nd         Decimal number
+  Nl         Letter number
+  No         Other number
+
+  P          Punctuation
+  Pc         Connector punctuation
+  Pd         Dash punctuation
+  Pe         Close punctuation
+  Pf         Final punctuation
+  Pi         Initial punctuation
+  Po         Other punctuation
+  Ps         Open punctuation
+
+  S          Symbol
+  Sc         Currency symbol
+  Sk         Modifier symbol
+  Sm         Mathematical symbol
+  So         Other symbol
+
+  Z          Separator
+  Zl         Line separator
+  Zp         Paragraph separator
+  Zs         Space separator
+
+

+
PCRE2 SPECIAL CATEGORY PROPERTIES FOR \p and \P
+

+

+  Xan        Alphanumeric: union of properties L and N
+  Xps        POSIX space: property Z or tab, NL, VT, FF, CR
+  Xsp        Perl space: property Z or tab, NL, VT, FF, CR
+  Xuc        Univerally-named character: one that can be
+               represented by a Universal Character Name
+  Xwd        Perl word: property Xan or underscore
+
+Perl and POSIX space are now the same. Perl added VT to its space character set +at release 5.18. +

+
SCRIPT NAMES FOR \p AND \P
+

+Arabic, +Armenian, +Avestan, +Balinese, +Bamum, +Bassa_Vah, +Batak, +Bengali, +Bopomofo, +Brahmi, +Braille, +Buginese, +Buhid, +Canadian_Aboriginal, +Carian, +Caucasian_Albanian, +Chakma, +Cham, +Cherokee, +Common, +Coptic, +Cuneiform, +Cypriot, +Cyrillic, +Deseret, +Devanagari, +Duployan, +Egyptian_Hieroglyphs, +Elbasan, +Ethiopic, +Georgian, +Glagolitic, +Gothic, +Grantha, +Greek, +Gujarati, +Gurmukhi, +Han, +Hangul, +Hanunoo, +Hebrew, +Hiragana, +Imperial_Aramaic, +Inherited, +Inscriptional_Pahlavi, +Inscriptional_Parthian, +Javanese, +Kaithi, +Kannada, +Katakana, +Kayah_Li, +Kharoshthi, +Khmer, +Khojki, +Khudawadi, +Lao, +Latin, +Lepcha, +Limbu, +Linear_A, +Linear_B, +Lisu, +Lycian, +Lydian, +Mahajani, +Malayalam, +Mandaic, +Manichaean, +Meetei_Mayek, +Mende_Kikakui, +Meroitic_Cursive, +Meroitic_Hieroglyphs, +Miao, +Modi, +Mongolian, +Mro, +Myanmar, +Nabataean, +New_Tai_Lue, +Nko, +Ogham, +Ol_Chiki, +Old_Italic, +Old_North_Arabian, +Old_Permic, +Old_Persian, +Old_South_Arabian, +Old_Turkic, +Oriya, +Osmanya, +Pahawh_Hmong, +Palmyrene, +Pau_Cin_Hau, +Phags_Pa, +Phoenician, +Psalter_Pahlavi, +Rejang, +Runic, +Samaritan, +Saurashtra, +Sharada, +Shavian, +Siddham, +Sinhala, +Sora_Sompeng, +Sundanese, +Syloti_Nagri, +Syriac, +Tagalog, +Tagbanwa, +Tai_Le, +Tai_Tham, +Tai_Viet, +Takri, +Tamil, +Telugu, +Thaana, +Thai, +Tibetan, +Tifinagh, +Tirhuta, +Ugaritic, +Vai, +Warang_Citi, +Yi. +

+
CHARACTER CLASSES
+

+

+  [...]       positive character class
+  [^...]      negative character class
+  [x-y]       range (can be used for hex characters)
+  [[:xxx:]]   positive POSIX named set
+  [[:^xxx:]]  negative POSIX named set
+
+  alnum       alphanumeric
+  alpha       alphabetic
+  ascii       0-127
+  blank       space or tab
+  cntrl       control character
+  digit       decimal digit
+  graph       printing, excluding space
+  lower       lower case letter
+  print       printing, including space
+  punct       printing, excluding alphanumeric
+  space       white space
+  upper       upper case letter
+  word        same as \w
+  xdigit      hexadecimal digit
+
+In PCRE2, POSIX character set names recognize only ASCII characters by default, +but some of them use Unicode properties if PCRE2_UCP is set. You can use +\Q...\E inside a character class. +

+
QUANTIFIERS
+

+

+  ?           0 or 1, greedy
+  ?+          0 or 1, possessive
+  ??          0 or 1, lazy
+  *           0 or more, greedy
+  *+          0 or more, possessive
+  *?          0 or more, lazy
+  +           1 or more, greedy
+  ++          1 or more, possessive
+  +?          1 or more, lazy
+  {n}         exactly n
+  {n,m}       at least n, no more than m, greedy
+  {n,m}+      at least n, no more than m, possessive
+  {n,m}?      at least n, no more than m, lazy
+  {n,}        n or more, greedy
+  {n,}+       n or more, possessive
+  {n,}?       n or more, lazy
+
+

+
ANCHORS AND SIMPLE ASSERTIONS
+

+

+  \b          word boundary
+  \B          not a word boundary
+  ^           start of subject
+               also after internal newline in multiline mode
+  \A          start of subject
+  $           end of subject
+               also before newline at end of subject
+               also before internal newline in multiline mode
+  \Z          end of subject
+               also before newline at end of subject
+  \z          end of subject
+  \G          first matching position in subject
+
+

+
MATCH POINT RESET
+

+

+  \K          reset start of match
+
+\K is honoured in positive assertions, but ignored in negative ones. +

+
ALTERNATION
+

+

+  expr|expr|expr...
+
+

+
CAPTURING
+

+

+  (...)           capturing group
+  (?<name>...)    named capturing group (Perl)
+  (?'name'...)    named capturing group (Perl)
+  (?P<name>...)   named capturing group (Python)
+  (?:...)         non-capturing group
+  (?|...)         non-capturing group; reset group numbers for
+                   capturing groups in each alternative
+
+

+
ATOMIC GROUPS
+

+

+  (?>...)         atomic, non-capturing group
+
+

+
COMMENT
+

+

+  (?#....)        comment (not nestable)
+
+

+
OPTION SETTING
+

+

+  (?i)            caseless
+  (?J)            allow duplicate names
+  (?m)            multiline
+  (?s)            single line (dotall)
+  (?U)            default ungreedy (lazy)
+  (?x)            extended (ignore white space)
+  (?-...)         unset option(s)
+
+The following are recognized only at the very start of a pattern or after one +of the newline or \R options with similar syntax. More than one of them may +appear. +
+  (*LIMIT_MATCH=d) set the match limit to d (decimal number)
+  (*LIMIT_RECURSION=d) set the recursion limit to d (decimal number)
+  (*NOTEMPTY)     set PCRE2_NOTEMPTY when matching
+  (*NOTEMPTY_ATSTART) set PCRE2_NOTEMPTY_ATSTART when matching 
+  (*NO_AUTO_POSSESS) no auto-possessification (PCRE2_NO_AUTO_POSSESS)
+  (*NO_START_OPT) no start-match optimization (PCRE2_NO_START_OPTIMIZE)
+  (*UTF)          set appropriate UTF mode for the library in use
+  (*UCP)          set PCRE2_UCP (use Unicode properties for \d etc)
+
+Note that LIMIT_MATCH and LIMIT_RECURSION can only reduce the value of the +limits set by the caller of pcre2_exec(), not increase them. +

+
NEWLINE CONVENTION
+

+These are recognized only at the very start of the pattern or after option +settings with a similar syntax. +

+  (*CR)           carriage return only
+  (*LF)           linefeed only
+  (*CRLF)         carriage return followed by linefeed
+  (*ANYCRLF)      all three of the above
+  (*ANY)          any Unicode newline sequence
+
+

+
WHAT \R MATCHES
+

+These are recognized only at the very start of the pattern or after option +setting with a similar syntax. +

+  (*BSR_ANYCRLF)  CR, LF, or CRLF
+  (*BSR_UNICODE)  any Unicode newline sequence
+
+

+
LOOKAHEAD AND LOOKBEHIND ASSERTIONS
+

+

+  (?=...)         positive look ahead
+  (?!...)         negative look ahead
+  (?<=...)        positive look behind
+  (?<!...)        negative look behind
+
+Each top-level branch of a look behind must be of a fixed length. +

+
BACKREFERENCES
+

+

+  \n              reference by number (can be ambiguous)
+  \gn             reference by number
+  \g{n}           reference by number
+  \g{-n}          relative reference by number
+  \k<name>        reference by name (Perl)
+  \k'name'        reference by name (Perl)
+  \g{name}        reference by name (Perl)
+  \k{name}        reference by name (.NET)
+  (?P=name)       reference by name (Python)
+
+

+
SUBROUTINE REFERENCES (POSSIBLY RECURSIVE)
+

+

+  (?R)            recurse whole pattern
+  (?n)            call subpattern by absolute number
+  (?+n)           call subpattern by relative number
+  (?-n)           call subpattern by relative number
+  (?&name)        call subpattern by name (Perl)
+  (?P>name)       call subpattern by name (Python)
+  \g<name>        call subpattern by name (Oniguruma)
+  \g'name'        call subpattern by name (Oniguruma)
+  \g<n>           call subpattern by absolute number (Oniguruma)
+  \g'n'           call subpattern by absolute number (Oniguruma)
+  \g<+n>          call subpattern by relative number (PCRE2 extension)
+  \g'+n'          call subpattern by relative number (PCRE2 extension)
+  \g<-n>          call subpattern by relative number (PCRE2 extension)
+  \g'-n'          call subpattern by relative number (PCRE2 extension)
+
+

+
CONDITIONAL PATTERNS
+

+

+  (?(condition)yes-pattern)
+  (?(condition)yes-pattern|no-pattern)
+
+  (?(n)...        absolute reference condition
+  (?(+n)...       relative reference condition
+  (?(-n)...       relative reference condition
+  (?(<name>)...   named reference condition (Perl)
+  (?('name')...   named reference condition (Perl)
+  (?(name)...     named reference condition (PCRE2)
+  (?(R)...        overall recursion condition
+  (?(Rn)...       specific group recursion condition
+  (?(R&name)...   specific recursion condition
+  (?(DEFINE)...   define subpattern for reference
+  (?(assert)...   assertion condition
+
+

+
BACKTRACKING CONTROL
+

+The following act immediately they are reached: +

+  (*ACCEPT)       force successful match
+  (*FAIL)         force backtrack; synonym (*F)
+  (*MARK:NAME)    set name to be passed back; synonym (*:NAME)
+
+The following act only when a subsequent match failure causes a backtrack to +reach them. They all force a match failure, but they differ in what happens +afterwards. Those that advance the start-of-match point do so only if the +pattern is not anchored. +
+  (*COMMIT)       overall failure, no advance of starting point
+  (*PRUNE)        advance to next starting character
+  (*PRUNE:NAME)   equivalent to (*MARK:NAME)(*PRUNE)
+  (*SKIP)         advance to current matching position
+  (*SKIP:NAME)    advance to position corresponding to an earlier
+                  (*MARK:NAME); if not found, the (*SKIP) is ignored
+  (*THEN)         local failure, backtrack to next alternation
+  (*THEN:NAME)    equivalent to (*MARK:NAME)(*THEN)
+
+

+
CALLOUTS
+

+

+  (?C)      callout
+  (?Cn)     callout with data n
+
+

+
SEE ALSO
+

+pcre2pattern(3), pcre2api(3), pcre2callout(3), +pcre2matching(3), pcre2(3). +

+
AUTHOR
+

+Philip Hazel +
+University Computing Service +
+Cambridge CB2 3QH, England. +
+

+
REVISION
+

+Last updated: 20 October 2014 +
+Copyright © 1997-2014 University of Cambridge. +
+

+Return to the PCRE2 index page. +

diff --git a/doc/pcre2perform.3 b/doc/pcre2perform.3 new file mode 100644 index 0000000..d01b2ed --- /dev/null +++ b/doc/pcre2perform.3 @@ -0,0 +1,178 @@ +.TH PCRE2PERFORM 3 "20 Ocbober 2014" "PCRE2 10.00" +.SH NAME +PCRE2 - Perl-compatible regular expressions (revised API) +.SH "PCRE2 PERFORMANCE" +.rs +.sp +Two aspects of performance are discussed below: memory usage and processing +time. The way you express your pattern as a regular expression can affect both +of them. +. +.SH "COMPILED PATTERN MEMORY USAGE" +.rs +.sp +Patterns are compiled by PCRE2 into a reasonably efficient interpretive code, +so that most simple patterns do not use much memory. However, there is one case +where the memory usage of a compiled pattern can be unexpectedly large. If a +parenthesized subpattern has a quantifier with a minimum greater than 1 and/or +a limited maximum, the whole subpattern is repeated in the compiled code. For +example, the pattern +.sp + (abc|def){2,4} +.sp +is compiled as if it were +.sp + (abc|def)(abc|def)((abc|def)(abc|def)?)? +.sp +(Technical aside: It is done this way so that backtrack points within each of +the repetitions can be independently maintained.) +.P +For regular expressions whose quantifiers use only small numbers, this is not +usually a problem. However, if the numbers are large, and particularly if such +repetitions are nested, the memory usage can become an embarrassment. For +example, the very simple pattern +.sp + ((ab){1,1000}c){1,3} +.sp +uses 51K bytes when compiled using the 8-bit library. When PCRE2 is compiled +with its default internal pointer size of two bytes, the size limit on a +compiled pattern is 64K code units in the 8-bit and 16-bit libraries, and this +is reached with the above pattern if the outer repetition is increased from 3 +to 4. PCRE2 can be compiled to use larger internal pointers and thus handle +larger compiled patterns, but it is better to try to rewrite your pattern to +use less memory if you can. +.P +One way of reducing the memory usage for such patterns is to make use of +PCRE2's +.\" HTML +.\" +"subroutine" +.\" +facility. Re-writing the above pattern as +.sp + ((ab)(?2){0,999}c)(?1){0,2} +.sp +reduces the memory requirements to 18K, and indeed it remains under 20K even +with the outer repetition increased to 100. However, this pattern is not +exactly equivalent, because the "subroutine" calls are treated as +.\" HTML +.\" +atomic groups +.\" +into which there can be no backtracking if there is a subsequent matching +failure. Therefore, PCRE2 cannot do this kind of rewriting automatically. +Furthermore, there is a noticeable loss of speed when executing the modified +pattern. Nevertheless, if the atomic grouping is not a problem and the loss of +speed is acceptable, this kind of rewriting will allow you to process patterns +that PCRE2 cannot otherwise handle. +. +. +.SH "STACK USAGE AT RUN TIME" +.rs +.sp +When \fBpcre2_match()\fP is used for matching, certain kinds of pattern can +cause it to use large amounts of the process stack. In some environments the +default process stack is quite small, and if it runs out the result is often +SIGSEGV. Rewriting your pattern can often help. The +.\" HREF +\fBpcre2stack\fP +.\" +documentation discusses this issue in detail. +. +. +.SH "PROCESSING TIME" +.rs +.sp +Certain items in regular expression patterns are processed more efficiently +than others. It is more efficient to use a character class like [aeiou] than a +set of single-character alternatives such as (a|e|i|o|u). In general, the +simplest construction that provides the required behaviour is usually the most +efficient. Jeffrey Friedl's book contains a lot of useful general discussion +about optimizing regular expressions for efficient performance. This document +contains a few observations about PCRE2. +.P +Using Unicode character properties (the \ep, \eP, and \eX escapes) is slow, +because PCRE2 has to use a multi-stage table lookup whenever it needs a +character's property. If you can find an alternative pattern that does not use +character properties, it will probably be faster. +.P +By default, the escape sequences \eb, \ed, \es, and \ew, and the POSIX +character classes such as [:alpha:] do not use Unicode properties, partly for +backwards compatibility, and partly for performance reasons. However, you can +set the PCRE2_UCP option or start the pattern with (*UCP) if you want Unicode +character properties to be used. This can double the matching time for items +such as \ed, when matched with \fBpcre2_match()\fP; the performance loss is +less with a DFA matching function, and in both cases there is not much +difference for \eb. +.P +When a pattern begins with .* not in parentheses, or in parentheses that are +not the subject of a backreference, and the PCRE2_DOTALL option is set, the +pattern is implicitly anchored by PCRE2, since it can match only at the start +of a subject string. However, if PCRE2_DOTALL is not set, PCRE2 cannot make +this optimization, because the dot metacharacter does not then match a newline, +and if the subject string contains newlines, the pattern may match from the +character immediately following one of them instead of from the very start. For +example, the pattern +.sp + .*second +.sp +matches the subject "first\enand second" (where \en stands for a newline +character), with the match starting at the seventh character. In order to do +this, PCRE2 has to retry the match starting after every newline in the subject. +.P +If you are using such a pattern with subject strings that do not contain +newlines, the best performance is obtained by setting PCRE2_DOTALL, or starting +the pattern with ^.* or ^.*? to indicate explicit anchoring. That saves PCRE2 +from having to scan along the subject looking for a newline to restart at. +.P +Beware of patterns that contain nested indefinite repeats. These can take a +long time to run when applied to a string that does not match. Consider the +pattern fragment +.sp + ^(a+)* +.sp +This can match "aaaa" in 16 different ways, and this number increases very +rapidly as the string gets longer. (The * repeat can match 0, 1, 2, 3, or 4 +times, and for each of those cases other than 0 or 4, the + repeats can match +different numbers of times.) When the remainder of the pattern is such that the +entire match is going to fail, PCRE2 has in principle to try every possible +variation, and this can take an extremely long time, even for relatively short +strings. +.P +An optimization catches some of the more simple cases such as +.sp + (a+)*b +.sp +where a literal character follows. Before embarking on the standard matching +procedure, PCRE2 checks that there is a "b" later in the subject string, and if +there is not, it fails the match immediately. However, when there is no +following literal this optimization cannot be used. You can see the difference +by comparing the behaviour of +.sp + (a+)*\ed +.sp +with the pattern above. The former gives a failure almost instantly when +applied to a whole line of "a" characters, whereas the latter takes an +appreciable time with strings longer than about 20 characters. +.P +In many cases, the solution to this kind of performance issue is to use an +atomic group or a possessive quantifier. +. +. +.SH AUTHOR +.rs +.sp +.nf +Philip Hazel +University Computing Service +Cambridge CB2 3QH, England. +.fi +. +. +.SH REVISION +.rs +.sp +.nf +Last updated: 20 October 2014 +Copyright (c) 1997-2014 University of Cambridge. +.fi diff --git a/doc/pcre2posix.3 b/doc/pcre2posix.3 new file mode 100644 index 0000000..5d4a721 --- /dev/null +++ b/doc/pcre2posix.3 @@ -0,0 +1,268 @@ +.TH PCRE2POSIX 3 "20 October 2014" "PCRE2 10.00" +.SH NAME +PCRE2 - Perl-compatible regular expressions (revised API) +.SH "SYNOPSIS" +.rs +.sp +.B #include +.PP +.nf +.B int regcomp(regex_t *\fIpreg\fP, const char *\fIpattern\fP, +.B " int \fIcflags\fP);" +.sp +.B int regexec(const regex_t *\fIpreg\fP, const char *\fIstring\fP, +.B " size_t \fInmatch\fP, regmatch_t \fIpmatch\fP[], int \fIeflags\fP);" +.sp +.B "size_t regerror(int \fIerrcode\fP, const regex_t *\fIpreg\fP," +.B " char *\fIerrbuf\fP, size_t \fIerrbuf_size\fP);" +.sp +.B void regfree(regex_t *\fIpreg\fP); +.fi +. +.SH DESCRIPTION +.rs +.sp +This set of functions provides a POSIX-style API for the PCRE2 regular +expression 8-bit library. See the +.\" HREF +\fBpcre2api\fP +.\" +documentation for a description of PCRE2's native API, which contains much +additional functionality. There is no POSIX-style wrapper for PCRE2's 16-bit +and 32-bit libraries. +.P +The functions described here are just wrapper functions that ultimately call +the PCRE2 native API. Their prototypes are defined in the \fBpcre2posix.h\fP +header file, and on Unix systems the library itself is called +\fBlibpcre2-posix.a\fP, so can be accessed by adding \fB-lpcre2-posix\fP to the +command for linking an application that uses them. Because the POSIX functions +call the native ones, it is also necessary to add \fB-lpcre2-8\fP. +.P +Those POSIX option bits that can reasonably be mapped to PCRE2 native options +have been implemented. In addition, the option REG_EXTENDED is defined with the +value zero. This has no effect, but since programs that are written to the +POSIX interface often use it, this makes it easier to slot in PCRE2 as a +replacement library. Other POSIX options are not even defined. +.P +There are also some other options that are not defined by POSIX. These have +been added at the request of users who want to make use of certain +PCRE2-specific features via the POSIX calling interface. +.P +When PCRE2 is called via these functions, it is only the API that is POSIX-like +in style. The syntax and semantics of the regular expressions themselves are +still those of Perl, subject to the setting of various PCRE2 options, as +described below. "POSIX-like in style" means that the API approximates to the +POSIX definition; it is not fully POSIX-compatible, and in multi-unit encoding +domains it is probably even less compatible. +.P +The header for these functions is supplied as \fBpcre2posix.h\fP to avoid any +potential clash with other POSIX libraries. It can, of course, be renamed or +aliased as \fBregex.h\fP, which is the "correct" name. It provides two +structure types, \fIregex_t\fP for compiled internal forms, and +\fIregmatch_t\fP for returning captured substrings. It also defines some +constants whose names start with "REG_"; these are used for setting options and +identifying error codes. +. +. +.SH "COMPILING A PATTERN" +.rs +.sp +The function \fBregcomp()\fP is called to compile a pattern into an +internal form. The pattern is a C string terminated by a binary zero, and +is passed in the argument \fIpattern\fP. The \fIpreg\fP argument is a pointer +to a \fBregex_t\fP structure that is used as a base for storing information +about the compiled regular expression. +.P +The argument \fIcflags\fP is either zero, or contains one or more of the bits +defined by the following macros: +.sp + REG_DOTALL +.sp +The PCRE2_DOTALL option is set when the regular expression is passed for +compilation to the native function. Note that REG_DOTALL is not part of the +POSIX standard. +.sp + REG_ICASE +.sp +The PCRE2_CASELESS option is set when the regular expression is passed for +compilation to the native function. +.sp + REG_NEWLINE +.sp +The PCRE2_MULTILINE option is set when the regular expression is passed for +compilation to the native function. Note that this does \fInot\fP mimic the +defined POSIX behaviour for REG_NEWLINE (see the following section). +.sp + REG_NOSUB +.sp +The PCRE2_NO_AUTO_CAPTURE option is set when the regular expression is passed +for compilation to the native function. In addition, when a pattern that is +compiled with this flag is passed to \fBregexec()\fP for matching, the +\fInmatch\fP and \fIpmatch\fP arguments are ignored, and no captured strings +are returned. +.sp + REG_UCP +.sp +The PCRE2_UCP option is set when the regular expression is passed for +compilation to the native function. This causes PCRE2 to use Unicode properties +when matchine \ed, \ew, etc., instead of just recognizing ASCII values. Note +that REG_UCP is not part of the POSIX standard. +.sp + REG_UNGREEDY +.sp +The PCRE2_UNGREEDY option is set when the regular expression is passed for +compilation to the native function. Note that REG_UNGREEDY is not part of the +POSIX standard. +.sp + REG_UTF +.sp +The PCRE2_UTF option is set when the regular expression is passed for +compilation to the native function. This causes the pattern itself and all data +strings used for matching it to be treated as UTF-8 strings. Note that REG_UTF +is not part of the POSIX standard. +.P +In the absence of these flags, no options are passed to the native function. +This means the the regex is compiled with PCRE2 default semantics. In +particular, the way it handles newline characters in the subject string is the +Perl way, not the POSIX way. Note that setting PCRE2_MULTILINE has only +\fIsome\fP of the effects specified for REG_NEWLINE. It does not affect the way +newlines are matched by the dot metacharacter (they are not) or by a negative +class such as [^a] (they are). +.P +The yield of \fBregcomp()\fP is zero on success, and non-zero otherwise. The +\fIpreg\fP structure is filled in on success, and one member of the structure +is public: \fIre_nsub\fP contains the number of capturing subpatterns in +the regular expression. Various error codes are defined in the header file. +.P +NOTE: If the yield of \fBregcomp()\fP is non-zero, you must not attempt to +use the contents of the \fIpreg\fP structure. If, for example, you pass it to +\fBregexec()\fP, the result is undefined and your program is likely to crash. +. +. +.SH "MATCHING NEWLINE CHARACTERS" +.rs +.sp +This area is not simple, because POSIX and Perl take different views of things. +It is not possible to get PCRE2 to obey POSIX semantics, but then PCRE2 was +never intended to be a POSIX engine. The following table lists the different +possibilities for matching newline characters in PCRE2: +.sp + Default Change with +.sp + . matches newline no PCRE2_DOTALL + newline matches [^a] yes not changeable + $ matches \en at end yes PCRE2_DOLLAR_ENDONLY + $ matches \en in middle no PCRE2_MULTILINE + ^ matches \en in middle no PCRE2_MULTILINE +.sp +This is the equivalent table for POSIX: +.sp + Default Change with +.sp + . matches newline yes REG_NEWLINE + newline matches [^a] yes REG_NEWLINE + $ matches \en at end no REG_NEWLINE + $ matches \en in middle no REG_NEWLINE + ^ matches \en in middle no REG_NEWLINE +.sp +PCRE2's behaviour is the same as Perl's, except that there is no equivalent for +PCRE2_DOLLAR_ENDONLY in Perl. In both PCRE2 and Perl, there is no way to stop +newline from matching [^a]. +.P +The default POSIX newline handling can be obtained by setting PCRE2_DOTALL and +PCRE2_DOLLAR_ENDONLY, but there is no way to make PCRE2 behave exactly as for +the REG_NEWLINE action. +. +. +.SH "MATCHING A PATTERN" +.rs +.sp +The function \fBregexec()\fP is called to match a compiled pattern \fIpreg\fP +against a given \fIstring\fP, which is by default terminated by a zero byte +(but see REG_STARTEND below), subject to the options in \fIeflags\fP. These can +be: +.sp + REG_NOTBOL +.sp +The PCRE2_NOTBOL option is set when calling the underlying PCRE2 matching +function. +.sp + REG_NOTEMPTY +.sp +The PCRE2_NOTEMPTY option is set when calling the underlying PCRE2 matching +function. Note that REG_NOTEMPTY is not part of the POSIX standard. However, +setting this option can give more POSIX-like behaviour in some situations. +.sp + REG_NOTEOL +.sp +The PCRE2_NOTEOL option is set when calling the underlying PCRE2 matching +function. +.sp + REG_STARTEND +.sp +The string is considered to start at \fIstring\fP + \fIpmatch[0].rm_so\fP and +to have a terminating NUL located at \fIstring\fP + \fIpmatch[0].rm_eo\fP +(there need not actually be a NUL at that location), regardless of the value of +\fInmatch\fP. This is a BSD extension, compatible with but not specified by +IEEE Standard 1003.2 (POSIX.2), and should be used with caution in software +intended to be portable to other systems. Note that a non-zero \fIrm_so\fP does +not imply REG_NOTBOL; REG_STARTEND affects only the location of the string, not +how it is matched. +.P +If the pattern was compiled with the REG_NOSUB flag, no data about any matched +strings is returned. The \fInmatch\fP and \fIpmatch\fP arguments of +\fBregexec()\fP are ignored. +.P +If the value of \fInmatch\fP is zero, or if the value \fIpmatch\fP is NULL, +no data about any matched strings is returned. +.P +Otherwise,the portion of the string that was matched, and also any captured +substrings, are returned via the \fIpmatch\fP argument, which points to an +array of \fInmatch\fP structures of type \fIregmatch_t\fP, containing the +members \fIrm_so\fP and \fIrm_eo\fP. These contain the byte offset to the first +character of each substring and the offset to the first character after the end +of each substring, respectively. The 0th element of the vector relates to the +entire portion of \fIstring\fP that was matched; subsequent elements relate to +the capturing subpatterns of the regular expression. Unused entries in the +array have both structure members set to -1. +.P +A successful match yields a zero return; various error codes are defined in the +header file, of which REG_NOMATCH is the "expected" failure code. +. +. +.SH "ERROR MESSAGES" +.rs +.sp +The \fBregerror()\fP function maps a non-zero errorcode from either +\fBregcomp()\fP or \fBregexec()\fP to a printable message. If \fIpreg\fP is not +NULL, the error should have arisen from the use of that structure. A message +terminated by a binary zero is placed in \fIerrbuf\fP. The length of the +message, including the zero, is limited to \fIerrbuf_size\fP. The yield of the +function is the size of buffer needed to hold the whole message. +. +. +.SH MEMORY USAGE +.rs +.sp +Compiling a regular expression causes memory to be allocated and associated +with the \fIpreg\fP structure. The function \fBregfree()\fP frees all such +memory, after which \fIpreg\fP may no longer be used as a compiled expression. +. +. +.SH AUTHOR +.rs +.sp +.nf +Philip Hazel +University Computing Service +Cambridge CB2 3QH, England. +.fi +. +. +.SH REVISION +.rs +.sp +.nf +Last updated: 20 October 2014 +Copyright (c) 1997-2014 University of Cambridge. +.fi diff --git a/doc/pcre2sample.3 b/doc/pcre2sample.3 new file mode 100644 index 0000000..3173f3e --- /dev/null +++ b/doc/pcre2sample.3 @@ -0,0 +1,94 @@ +.TH PCRE2SAMPLE 3 "20 October 2014" "PCRE2 10.00" +.SH NAME +PCRE2 - Perl-compatible regular expressions (revised API) +.SH "PCRE2 SAMPLE PROGRAM" +.rs +.sp +A simple, complete demonstration program to get you started with using PCRE2 is +supplied in the file \fIpcre2demo.c\fP in the \fBsrc\fP directory in the PCRE2 +distribution. A listing of this program is given in the +.\" HREF +\fBpcre2demo\fP +.\" +documentation. If you do not have a copy of the PCRE2 distribution, you can +save this listing to re-create the contents of \fIpcre2demo.c\fP. +.P +The demonstration program, which uses the PCRE2 8-bit library, compiles the +regular expression that is its first argument, and matches it against the +subject string in its second argument. No PCRE2 options are set, and default +character tables are used. If matching succeeds, the program outputs the +portion of the subject that matched, together with the contents of any captured +substrings. +.P +If the -g option is given on the command line, the program then goes on to +check for further matches of the same regular expression in the same subject +string. The logic is a little bit tricky because of the possibility of matching +an empty string. Comments in the code explain what is going on. +.P +If PCRE2 is installed in the standard include and library directories for your +operating system, you should be able to compile the demonstration program using +this command: +.sp + gcc -o pcre2demo pcre2demo.c -lpcre2-8 +.sp +If PCRE2 is installed elsewhere, you may need to add additional options to the +command line. For example, on a Unix-like system that has PCRE2 installed in +\fI/usr/local\fP, you can compile the demonstration program using a command +like this: +.sp +.\" JOINSH + gcc -o pcre2demo -I/usr/local/include pcre2demo.c \e + -L/usr/local/lib -lpcre2-8 +.sp +.P +Once you have compiled and linked the demonstration program, you can run simple +tests like this: +.sp + ./pcre2demo 'cat|dog' 'the cat sat on the mat' + ./pcre2demo -g 'cat|dog' 'the dog sat on the cat' +.sp +Note that there is a much more comprehensive test program, called +.\" HREF +\fBpcre2test\fP, +.\" +which supports many more facilities for testing regular expressions using the +PCRE2 libraries. The +.\" HREF +\fBpcre2demo\fP +.\" +program is provided as a simple coding example. +.P +If you try to run +.\" HREF +\fBpcre2demo\fP +.\" +when PCRE2 is not installed in the standard library directory, you may get an +error like this on some operating systems (e.g. Solaris): +.sp + ld.so.1: a.out: fatal: libpcre2.so.0: open failed: No such file or directory +.sp +This is caused by the way shared library support works on those systems. You +need to add +.sp + -R/usr/local/lib +.sp +(for example) to the compile command to get round this problem. +. +. +.SH AUTHOR +.rs +.sp +.nf +Philip Hazel +University Computing Service +Cambridge CB2 3QH, England. +.fi +. +. +.SH REVISION +.rs +.sp +.nf +Last updated: 20 October 2014 +Copyright (c) 1997-2014 University of Cambridge. +.fi diff --git a/doc/pcre2stack.3 b/doc/pcre2stack.3 new file mode 100644 index 0000000..a20bb0c --- /dev/null +++ b/doc/pcre2stack.3 @@ -0,0 +1,199 @@ +.TH PCRE2STACK 3 "20 October 2014" "PCRE2 10.00" +.SH NAME +PCRE2 - Perl-compatible regular expressions (revised API) +.SH "PCRE2 DISCUSSION OF STACK USAGE" +.rs +.sp +When you call \fBpcre2_match()\fP, it makes use of an internal function called +\fBmatch()\fP. This calls itself recursively at branch points in the pattern, +in order to remember the state of the match so that it can back up and try a +different alternative after a failure. As matching proceeds deeper and deeper +into the tree of possibilities, the recursion depth increases. The +\fBmatch()\fP function is also called in other circumstances, for example, +whenever a parenthesized sub-pattern is entered, and in certain cases of +repetition. +.P +Not all calls of \fBmatch()\fP increase the recursion depth; for an item such +as a* it may be called several times at the same level, after matching +different numbers of a's. Furthermore, in a number of cases where the result of +the recursive call would immediately be passed back as the result of the +current call (a "tail recursion"), the function is just restarted instead. +.P +The above comments apply when \fBpcre2_match()\fP is run in its normal +interpretive manner. If the compiled pattern was processed by +\fBpcre2_jit_compile()\fP, and just-in-time compiling was successful, and the +options passed to \fBpcre2_match()\fP were not incompatible, the matching +process uses the JIT-compiled code instead of the \fBmatch()\fP function. In +this case, the memory requirements are handled entirely differently. See the +.\" HREF +\fBpcre2jit\fP +.\" +documentation for details. +.P +The \fBpcre2_dfa_match()\fP function operates in a different way to +\fBpcre2_match()\fP, and uses recursion only when there is a regular expression +recursion or subroutine call in the pattern. This includes the processing of +assertion and "once-only" subpatterns, which are handled like subroutine calls. +Normally, these are never very deep, and the limit on the complexity of +\fBpcre2_dfa_match()\fP is controlled by the amount of workspace it is given. +However, it is possible to write patterns with runaway infinite recursions; +such patterns will cause \fBpcre2_dfa_match()\fP to run out of stack. At +present, there is no protection against this. +.P +The comments that follow do NOT apply to \fBpcre2_dfa_match()\fP; they are +relevant only for \fBpcre2_match()\fP without the JIT optimization. +. +. +.SS "Reducing \fBpcre2_match()\fP's stack usage" +.rs +.sp +Each time that the internal \fBmatch()\fP function is called recursively, it +uses memory from the process stack. For certain kinds of pattern and data, very +large amounts of stack may be needed, despite the recognition of "tail +recursion". You can often reduce the amount of recursion, and therefore the +amount of stack used, by modifying the pattern that is being matched. Consider, +for example, this pattern: +.sp + ([^<]|<(?!inet))+ +.sp +It matches from wherever it starts until it encounters " +.\" +"The match context" +.\" +in the +.\" HREF +\fBpcre2api\fP +.\" +documentation. Since the block sizes are always the same, it may be possible to +implement customized a memory handler that is more efficient than the standard +function. The memory blocks obtained for this purpose are retained and re-used +if possible while \fBpcre2_match()\fP is running. They are all freed just +before it exits. +. +. +.SS "Limiting \fBpcre2_match()\fP's stack usage" +.rs +.sp +You can set limits on the number of times the internal \fBmatch()\fP function +is called, both in total and recursively. If a limit is exceeded, +\fBpcre2_match()\fP returns an error code. Setting suitable limits should +prevent it from running out of stack. The default values of the limits are very +large, and unlikely ever to operate. They can be changed when PCRE2 is built, +and they can also be set when \fBpcre2_match()\fP is called. For details of +these interfaces, see the +.\" HREF +\fBpcre2build\fP +.\" +documentation and the section entitled +.\" HTML +.\" +"The match context" +.\" +in the +.\" HREF +\fBpcre2api\fP +.\" +documentation. +.P +As a very rough rule of thumb, you should reckon on about 500 bytes per +recursion. Thus, if you want to limit your stack usage to 8Mb, you should set +the limit at 16000 recursions. A 64Mb stack, on the other hand, can support +around 128000 recursions. +.P +The \fBpcre2test\fP test program has a modifier called "find_limits" which, if +applied to a subject line, causes it to find the smallest limits that allow a a +pattern to match. This is done by calling \fBpcre2_match()\fP repeatedly with +different limits. +. +. +.SS "Changing stack size in Unix-like systems" +.rs +.sp +In Unix-like environments, there is not often a problem with the stack unless +very long strings are involved, though the default limit on stack size varies +from system to system. Values from 8Mb to 64Mb are common. You can find your +default limit by running the command: +.sp + ulimit -s +.sp +Unfortunately, the effect of running out of stack is often SIGSEGV, though +sometimes a more explicit error message is given. You can normally increase the +limit on stack size by code such as this: +.sp + struct rlimit rlim; + getrlimit(RLIMIT_STACK, &rlim); + rlim.rlim_cur = 100*1024*1024; + setrlimit(RLIMIT_STACK, &rlim); +.sp +This reads the current limits (soft and hard) using \fBgetrlimit()\fP, then +attempts to increase the soft limit to 100Mb using \fBsetrlimit()\fP. You must +do this before calling \fBpcre2_match()\fP. +. +. +.SS "Changing stack size in Mac OS X" +.rs +.sp +Using \fBsetrlimit()\fP, as described above, should also work on Mac OS X. It +is also possible to set a stack size when linking a program. There is a +discussion about stack sizes in Mac OS X at this web site: +.\" HTML +.\" +http://developer.apple.com/qa/qa2005/qa1419.html. +.\" +. +. +.SH AUTHOR +.rs +.sp +.nf +Philip Hazel +University Computing Service +Cambridge CB2 3QH, England. +.fi +. +. +.SH REVISION +.rs +.sp +.nf +Last updated: 20 October 2014 +Copyright (c) 1997-2014 University of Cambridge. +.fi diff --git a/doc/pcre2syntax.3 b/doc/pcre2syntax.3 new file mode 100644 index 0000000..1b886e6 --- /dev/null +++ b/doc/pcre2syntax.3 @@ -0,0 +1,540 @@ +.TH PCRE2SYNTAX 3 "20 October 2014" "PCRE2 10.00" +.SH NAME +PCRE2 - Perl-compatible regular expressions (revised API) +.SH "PCRE2 REGULAR EXPRESSION SYNTAX SUMMARY" +.rs +.sp +The full syntax and semantics of the regular expressions that are supported by +PCRE2 are described in the +.\" HREF +\fBpcre2pattern\fP +.\" +documentation. This document contains a quick-reference summary of the syntax. +. +. +.SH "QUOTING" +.rs +.sp + \ex where x is non-alphanumeric is a literal x + \eQ...\eE treat enclosed characters as literal +. +. +.SH "CHARACTERS" +.rs +.sp + \ea alarm, that is, the BEL character (hex 07) + \ecx "control-x", where x is any ASCII character + \ee escape (hex 1B) + \ef form feed (hex 0C) + \en newline (hex 0A) + \er carriage return (hex 0D) + \et tab (hex 09) + \e0dd character with octal code 0dd + \eddd character with octal code ddd, or backreference + \eo{ddd..} character with octal code ddd.. + \exhh character with hex code hh + \ex{hhh..} character with hex code hhh.. +.sp +Note that \e0dd is always an octal code, and that \e8 and \e9 are the literal +characters "8" and "9". +. +. +.SH "CHARACTER TYPES" +.rs +.sp + . any character except newline; + in dotall mode, any character whatsoever + \eC one data unit, even in UTF mode (best avoided) + \ed a decimal digit + \eD a character that is not a decimal digit + \eh a horizontal white space character + \eH a character that is not a horizontal white space character + \eN a character that is not a newline + \ep{\fIxx\fP} a character with the \fIxx\fP property + \eP{\fIxx\fP} a character without the \fIxx\fP property + \eR a newline sequence + \es a white space character + \eS a character that is not a white space character + \ev a vertical white space character + \eV a character that is not a vertical white space character + \ew a "word" character + \eW a "non-word" character + \eX a Unicode extended grapheme cluster +.sp +By default, \ed, \es, and \ew match only ASCII characters, even in UTF-8 mode +or in the 16-bit and 32-bit libraries. However, if locale-specific matching is +happening, \es and \ew may also match characters with code points in the range +128-255. If the PCRE2_UCP option is set, the behaviour of these escape +sequences is changed to use Unicode properties and they match many more +characters. +. +. +.SH "GENERAL CATEGORY PROPERTIES FOR \ep and \eP" +.rs +.sp + C Other + Cc Control + Cf Format + Cn Unassigned + Co Private use + Cs Surrogate +.sp + L Letter + Ll Lower case letter + Lm Modifier letter + Lo Other letter + Lt Title case letter + Lu Upper case letter + L& Ll, Lu, or Lt +.sp + M Mark + Mc Spacing mark + Me Enclosing mark + Mn Non-spacing mark +.sp + N Number + Nd Decimal number + Nl Letter number + No Other number +.sp + P Punctuation + Pc Connector punctuation + Pd Dash punctuation + Pe Close punctuation + Pf Final punctuation + Pi Initial punctuation + Po Other punctuation + Ps Open punctuation +.sp + S Symbol + Sc Currency symbol + Sk Modifier symbol + Sm Mathematical symbol + So Other symbol +.sp + Z Separator + Zl Line separator + Zp Paragraph separator + Zs Space separator +. +. +.SH "PCRE2 SPECIAL CATEGORY PROPERTIES FOR \ep and \eP" +.rs +.sp + Xan Alphanumeric: union of properties L and N + Xps POSIX space: property Z or tab, NL, VT, FF, CR + Xsp Perl space: property Z or tab, NL, VT, FF, CR + Xuc Univerally-named character: one that can be + represented by a Universal Character Name + Xwd Perl word: property Xan or underscore +.sp +Perl and POSIX space are now the same. Perl added VT to its space character set +at release 5.18. +. +. +.SH "SCRIPT NAMES FOR \ep AND \eP" +.rs +.sp +Arabic, +Armenian, +Avestan, +Balinese, +Bamum, +Bassa_Vah, +Batak, +Bengali, +Bopomofo, +Brahmi, +Braille, +Buginese, +Buhid, +Canadian_Aboriginal, +Carian, +Caucasian_Albanian, +Chakma, +Cham, +Cherokee, +Common, +Coptic, +Cuneiform, +Cypriot, +Cyrillic, +Deseret, +Devanagari, +Duployan, +Egyptian_Hieroglyphs, +Elbasan, +Ethiopic, +Georgian, +Glagolitic, +Gothic, +Grantha, +Greek, +Gujarati, +Gurmukhi, +Han, +Hangul, +Hanunoo, +Hebrew, +Hiragana, +Imperial_Aramaic, +Inherited, +Inscriptional_Pahlavi, +Inscriptional_Parthian, +Javanese, +Kaithi, +Kannada, +Katakana, +Kayah_Li, +Kharoshthi, +Khmer, +Khojki, +Khudawadi, +Lao, +Latin, +Lepcha, +Limbu, +Linear_A, +Linear_B, +Lisu, +Lycian, +Lydian, +Mahajani, +Malayalam, +Mandaic, +Manichaean, +Meetei_Mayek, +Mende_Kikakui, +Meroitic_Cursive, +Meroitic_Hieroglyphs, +Miao, +Modi, +Mongolian, +Mro, +Myanmar, +Nabataean, +New_Tai_Lue, +Nko, +Ogham, +Ol_Chiki, +Old_Italic, +Old_North_Arabian, +Old_Permic, +Old_Persian, +Old_South_Arabian, +Old_Turkic, +Oriya, +Osmanya, +Pahawh_Hmong, +Palmyrene, +Pau_Cin_Hau, +Phags_Pa, +Phoenician, +Psalter_Pahlavi, +Rejang, +Runic, +Samaritan, +Saurashtra, +Sharada, +Shavian, +Siddham, +Sinhala, +Sora_Sompeng, +Sundanese, +Syloti_Nagri, +Syriac, +Tagalog, +Tagbanwa, +Tai_Le, +Tai_Tham, +Tai_Viet, +Takri, +Tamil, +Telugu, +Thaana, +Thai, +Tibetan, +Tifinagh, +Tirhuta, +Ugaritic, +Vai, +Warang_Citi, +Yi. +. +. +.SH "CHARACTER CLASSES" +.rs +.sp + [...] positive character class + [^...] negative character class + [x-y] range (can be used for hex characters) + [[:xxx:]] positive POSIX named set + [[:^xxx:]] negative POSIX named set +.sp + alnum alphanumeric + alpha alphabetic + ascii 0-127 + blank space or tab + cntrl control character + digit decimal digit + graph printing, excluding space + lower lower case letter + print printing, including space + punct printing, excluding alphanumeric + space white space + upper upper case letter + word same as \ew + xdigit hexadecimal digit +.sp +In PCRE2, POSIX character set names recognize only ASCII characters by default, +but some of them use Unicode properties if PCRE2_UCP is set. You can use +\eQ...\eE inside a character class. +. +. +.SH "QUANTIFIERS" +.rs +.sp + ? 0 or 1, greedy + ?+ 0 or 1, possessive + ?? 0 or 1, lazy + * 0 or more, greedy + *+ 0 or more, possessive + *? 0 or more, lazy + + 1 or more, greedy + ++ 1 or more, possessive + +? 1 or more, lazy + {n} exactly n + {n,m} at least n, no more than m, greedy + {n,m}+ at least n, no more than m, possessive + {n,m}? at least n, no more than m, lazy + {n,} n or more, greedy + {n,}+ n or more, possessive + {n,}? n or more, lazy +. +. +.SH "ANCHORS AND SIMPLE ASSERTIONS" +.rs +.sp + \eb word boundary + \eB not a word boundary + ^ start of subject + also after internal newline in multiline mode + \eA start of subject + $ end of subject + also before newline at end of subject + also before internal newline in multiline mode + \eZ end of subject + also before newline at end of subject + \ez end of subject + \eG first matching position in subject +. +. +.SH "MATCH POINT RESET" +.rs +.sp + \eK reset start of match +.sp +\eK is honoured in positive assertions, but ignored in negative ones. +. +. +.SH "ALTERNATION" +.rs +.sp + expr|expr|expr... +. +. +.SH "CAPTURING" +.rs +.sp + (...) capturing group + (?...) named capturing group (Perl) + (?'name'...) named capturing group (Perl) + (?P...) named capturing group (Python) + (?:...) non-capturing group + (?|...) non-capturing group; reset group numbers for + capturing groups in each alternative +. +. +.SH "ATOMIC GROUPS" +.rs +.sp + (?>...) atomic, non-capturing group +. +. +. +. +.SH "COMMENT" +.rs +.sp + (?#....) comment (not nestable) +. +. +.SH "OPTION SETTING" +.rs +.sp + (?i) caseless + (?J) allow duplicate names + (?m) multiline + (?s) single line (dotall) + (?U) default ungreedy (lazy) + (?x) extended (ignore white space) + (?-...) unset option(s) +.sp +The following are recognized only at the very start of a pattern or after one +of the newline or \eR options with similar syntax. More than one of them may +appear. +.sp + (*LIMIT_MATCH=d) set the match limit to d (decimal number) + (*LIMIT_RECURSION=d) set the recursion limit to d (decimal number) + (*NOTEMPTY) set PCRE2_NOTEMPTY when matching + (*NOTEMPTY_ATSTART) set PCRE2_NOTEMPTY_ATSTART when matching + (*NO_AUTO_POSSESS) no auto-possessification (PCRE2_NO_AUTO_POSSESS) + (*NO_START_OPT) no start-match optimization (PCRE2_NO_START_OPTIMIZE) + (*UTF) set appropriate UTF mode for the library in use + (*UCP) set PCRE2_UCP (use Unicode properties for \ed etc) +.sp +Note that LIMIT_MATCH and LIMIT_RECURSION can only reduce the value of the +limits set by the caller of pcre2_exec(), not increase them. +. +. +.SH "NEWLINE CONVENTION" +.rs +.sp +These are recognized only at the very start of the pattern or after option +settings with a similar syntax. +.sp + (*CR) carriage return only + (*LF) linefeed only + (*CRLF) carriage return followed by linefeed + (*ANYCRLF) all three of the above + (*ANY) any Unicode newline sequence +. +. +.SH "WHAT \eR MATCHES" +.rs +.sp +These are recognized only at the very start of the pattern or after option +setting with a similar syntax. +.sp + (*BSR_ANYCRLF) CR, LF, or CRLF + (*BSR_UNICODE) any Unicode newline sequence +. +. +.SH "LOOKAHEAD AND LOOKBEHIND ASSERTIONS" +.rs +.sp + (?=...) positive look ahead + (?!...) negative look ahead + (?<=...) positive look behind + (? reference by name (Perl) + \ek'name' reference by name (Perl) + \eg{name} reference by name (Perl) + \ek{name} reference by name (.NET) + (?P=name) reference by name (Python) +. +. +.SH "SUBROUTINE REFERENCES (POSSIBLY RECURSIVE)" +.rs +.sp + (?R) recurse whole pattern + (?n) call subpattern by absolute number + (?+n) call subpattern by relative number + (?-n) call subpattern by relative number + (?&name) call subpattern by name (Perl) + (?P>name) call subpattern by name (Python) + \eg call subpattern by name (Oniguruma) + \eg'name' call subpattern by name (Oniguruma) + \eg call subpattern by absolute number (Oniguruma) + \eg'n' call subpattern by absolute number (Oniguruma) + \eg<+n> call subpattern by relative number (PCRE2 extension) + \eg'+n' call subpattern by relative number (PCRE2 extension) + \eg<-n> call subpattern by relative number (PCRE2 extension) + \eg'-n' call subpattern by relative number (PCRE2 extension) +. +. +.SH "CONDITIONAL PATTERNS" +.rs +.sp + (?(condition)yes-pattern) + (?(condition)yes-pattern|no-pattern) +.sp + (?(n)... absolute reference condition + (?(+n)... relative reference condition + (?(-n)... relative reference condition + (?()... named reference condition (Perl) + (?('name')... named reference condition (Perl) + (?(name)... named reference condition (PCRE2) + (?(R)... overall recursion condition + (?(Rn)... specific group recursion condition + (?(R&name)... specific recursion condition + (?(DEFINE)... define subpattern for reference + (?(assert)... assertion condition +. +. +.SH "BACKTRACKING CONTROL" +.rs +.sp +The following act immediately they are reached: +.sp + (*ACCEPT) force successful match + (*FAIL) force backtrack; synonym (*F) + (*MARK:NAME) set name to be passed back; synonym (*:NAME) +.sp +The following act only when a subsequent match failure causes a backtrack to +reach them. They all force a match failure, but they differ in what happens +afterwards. Those that advance the start-of-match point do so only if the +pattern is not anchored. +.sp + (*COMMIT) overall failure, no advance of starting point + (*PRUNE) advance to next starting character + (*PRUNE:NAME) equivalent to (*MARK:NAME)(*PRUNE) + (*SKIP) advance to current matching position + (*SKIP:NAME) advance to position corresponding to an earlier + (*MARK:NAME); if not found, the (*SKIP) is ignored + (*THEN) local failure, backtrack to next alternation + (*THEN:NAME) equivalent to (*MARK:NAME)(*THEN) +. +. +.SH "CALLOUTS" +.rs +.sp + (?C) callout + (?Cn) callout with data n +. +. +.SH "SEE ALSO" +.rs +.sp +\fBpcre2pattern\fP(3), \fBpcre2api\fP(3), \fBpcre2callout\fP(3), +\fBpcre2matching\fP(3), \fBpcre2\fP(3). +. +. +.SH AUTHOR +.rs +.sp +.nf +Philip Hazel +University Computing Service +Cambridge CB2 3QH, England. +.fi +. +. +.SH REVISION +.rs +.sp +.nf +Last updated: 20 October 2014 +Copyright (c) 1997-2014 University of Cambridge. +.fi