More documentation

This commit is contained in:
Philip.Hazel 2014-10-20 16:48:14 +00:00
parent 0dfe4e5e98
commit 4352f00bb9
11 changed files with 2649 additions and 12 deletions

View File

@ -36,6 +36,11 @@ dist_html_DATA = \
doc/html/pcre2matching.html \ doc/html/pcre2matching.html \
doc/html/pcre2partial.html \ doc/html/pcre2partial.html \
doc/html/pcre2pattern.html \ doc/html/pcre2pattern.html \
doc/html/pcre2perform.html \
doc/html/pcre2posix.html \
doc/html/pcre2sample.html \
doc/html/pcre2stack.html \
doc/html/pcre2syntax.html \
doc/html/pcre2test.html \ doc/html/pcre2test.html \
doc/html/pcre2unicode.html doc/html/pcre2unicode.html
@ -66,12 +71,7 @@ dist_html_DATA = \
# doc/html/pcre2_utf16_to_host_byte_order.html \ # doc/html/pcre2_utf16_to_host_byte_order.html \
# doc/html/pcre2_utf32_to_host_byte_order.html \ # doc/html/pcre2_utf32_to_host_byte_order.html \
# doc/html/pcre2_version.html \ # doc/html/pcre2_version.html \
# doc/html/pcre2perform.html \ # doc/html/pcre2precompile.html
# doc/html/pcre2posix.html \
# doc/html/pcre2precompile.html \
# doc/html/pcre2sample.html \
# doc/html/pcre2stack.html \
# doc/html/pcre2syntax.html
# FIXME # FIXME
dist_man_MANS = \ dist_man_MANS = \
@ -88,6 +88,11 @@ dist_man_MANS = \
doc/pcre2matching.3 \ doc/pcre2matching.3 \
doc/pcre2partial.3 \ doc/pcre2partial.3 \
doc/pcre2pattern.3 \ doc/pcre2pattern.3 \
doc/pcre2perform.3 \
doc/pcre2posix.3 \
doc/pcre2sample.3 \
doc/pcre2stack.3 \
doc/pcre2syntax.3 \
doc/pcre2test.1 \ doc/pcre2test.1 \
doc/pcre2unicode.3 doc/pcre2unicode.3
@ -120,12 +125,7 @@ dist_man_MANS = \
# doc/pcre2_utf16_to_host_byte_order.3 \ # doc/pcre2_utf16_to_host_byte_order.3 \
# doc/pcre2_utf32_to_host_byte_order.3 \ # doc/pcre2_utf32_to_host_byte_order.3 \
# doc/pcre2_version.3 \ # doc/pcre2_version.3 \
# doc/pcre2perform.3 \ # doc/pcre2precompile.3
# doc/pcre2posix.3 \
# doc/pcre2precompile.3 \
# doc/pcre2sample.3 \
# doc/pcre2stack.3 \
# doc/pcre2syntax.3
# The Libtool libraries to install. We'll add to this later. # The Libtool libraries to install. We'll add to this later.

196
doc/html/pcre2perform.html Normal file
View File

@ -0,0 +1,196 @@
<html>
<head>
<title>pcre2perform specification</title>
</head>
<body bgcolor="#FFFFFF" text="#00005A" link="#0066FF" alink="#3399FF" vlink="#2222BB">
<h1>pcre2perform man page</h1>
<p>
Return to the <a href="index.html">PCRE2 index page</a>.
</p>
<p>
This page is part of the PCRE2 HTML documentation. It was generated
automatically from the original man page. If there is any nonsense in it,
please consult the man page, in case the conversion went wrong.
<br>
<br><b>
PCRE2 PERFORMANCE
</b><br>
<P>
Two aspects of performance are discussed below: memory usage and processing
time. The way you express your pattern as a regular expression can affect both
of them.
</P>
<br><b>
COMPILED PATTERN MEMORY USAGE
</b><br>
<P>
Patterns are compiled by PCRE2 into a reasonably efficient interpretive code,
so that most simple patterns do not use much memory. However, there is one case
where the memory usage of a compiled pattern can be unexpectedly large. If a
parenthesized subpattern has a quantifier with a minimum greater than 1 and/or
a limited maximum, the whole subpattern is repeated in the compiled code. For
example, the pattern
<pre>
(abc|def){2,4}
</pre>
is compiled as if it were
<pre>
(abc|def)(abc|def)((abc|def)(abc|def)?)?
</pre>
(Technical aside: It is done this way so that backtrack points within each of
the repetitions can be independently maintained.)
</P>
<P>
For regular expressions whose quantifiers use only small numbers, this is not
usually a problem. However, if the numbers are large, and particularly if such
repetitions are nested, the memory usage can become an embarrassment. For
example, the very simple pattern
<pre>
((ab){1,1000}c){1,3}
</pre>
uses 51K bytes when compiled using the 8-bit library. When PCRE2 is compiled
with its default internal pointer size of two bytes, the size limit on a
compiled pattern is 64K code units in the 8-bit and 16-bit libraries, and this
is reached with the above pattern if the outer repetition is increased from 3
to 4. PCRE2 can be compiled to use larger internal pointers and thus handle
larger compiled patterns, but it is better to try to rewrite your pattern to
use less memory if you can.
</P>
<P>
One way of reducing the memory usage for such patterns is to make use of
PCRE2's
<a href="pcre2pattern.html#subpatternsassubroutines">"subroutine"</a>
facility. Re-writing the above pattern as
<pre>
((ab)(?2){0,999}c)(?1){0,2}
</pre>
reduces the memory requirements to 18K, and indeed it remains under 20K even
with the outer repetition increased to 100. However, this pattern is not
exactly equivalent, because the "subroutine" calls are treated as
<a href="pcre2pattern.html#atomicgroup">atomic groups</a>
into which there can be no backtracking if there is a subsequent matching
failure. Therefore, PCRE2 cannot do this kind of rewriting automatically.
Furthermore, there is a noticeable loss of speed when executing the modified
pattern. Nevertheless, if the atomic grouping is not a problem and the loss of
speed is acceptable, this kind of rewriting will allow you to process patterns
that PCRE2 cannot otherwise handle.
</P>
<br><b>
STACK USAGE AT RUN TIME
</b><br>
<P>
When <b>pcre2_match()</b> is used for matching, certain kinds of pattern can
cause it to use large amounts of the process stack. In some environments the
default process stack is quite small, and if it runs out the result is often
SIGSEGV. Rewriting your pattern can often help. The
<a href="pcre2stack.html"><b>pcre2stack</b></a>
documentation discusses this issue in detail.
</P>
<br><b>
PROCESSING TIME
</b><br>
<P>
Certain items in regular expression patterns are processed more efficiently
than others. It is more efficient to use a character class like [aeiou] than a
set of single-character alternatives such as (a|e|i|o|u). In general, the
simplest construction that provides the required behaviour is usually the most
efficient. Jeffrey Friedl's book contains a lot of useful general discussion
about optimizing regular expressions for efficient performance. This document
contains a few observations about PCRE2.
</P>
<P>
Using Unicode character properties (the \p, \P, and \X escapes) is slow,
because PCRE2 has to use a multi-stage table lookup whenever it needs a
character's property. If you can find an alternative pattern that does not use
character properties, it will probably be faster.
</P>
<P>
By default, the escape sequences \b, \d, \s, and \w, and the POSIX
character classes such as [:alpha:] do not use Unicode properties, partly for
backwards compatibility, and partly for performance reasons. However, you can
set the PCRE2_UCP option or start the pattern with (*UCP) if you want Unicode
character properties to be used. This can double the matching time for items
such as \d, when matched with <b>pcre2_match()</b>; the performance loss is
less with a DFA matching function, and in both cases there is not much
difference for \b.
</P>
<P>
When a pattern begins with .* not in parentheses, or in parentheses that are
not the subject of a backreference, and the PCRE2_DOTALL option is set, the
pattern is implicitly anchored by PCRE2, since it can match only at the start
of a subject string. However, if PCRE2_DOTALL is not set, PCRE2 cannot make
this optimization, because the dot metacharacter does not then match a newline,
and if the subject string contains newlines, the pattern may match from the
character immediately following one of them instead of from the very start. For
example, the pattern
<pre>
.*second
</pre>
matches the subject "first\nand second" (where \n stands for a newline
character), with the match starting at the seventh character. In order to do
this, PCRE2 has to retry the match starting after every newline in the subject.
</P>
<P>
If you are using such a pattern with subject strings that do not contain
newlines, the best performance is obtained by setting PCRE2_DOTALL, or starting
the pattern with ^.* or ^.*? to indicate explicit anchoring. That saves PCRE2
from having to scan along the subject looking for a newline to restart at.
</P>
<P>
Beware of patterns that contain nested indefinite repeats. These can take a
long time to run when applied to a string that does not match. Consider the
pattern fragment
<pre>
^(a+)*
</pre>
This can match "aaaa" in 16 different ways, and this number increases very
rapidly as the string gets longer. (The * repeat can match 0, 1, 2, 3, or 4
times, and for each of those cases other than 0 or 4, the + repeats can match
different numbers of times.) When the remainder of the pattern is such that the
entire match is going to fail, PCRE2 has in principle to try every possible
variation, and this can take an extremely long time, even for relatively short
strings.
</P>
<P>
An optimization catches some of the more simple cases such as
<pre>
(a+)*b
</pre>
where a literal character follows. Before embarking on the standard matching
procedure, PCRE2 checks that there is a "b" later in the subject string, and if
there is not, it fails the match immediately. However, when there is no
following literal this optimization cannot be used. You can see the difference
by comparing the behaviour of
<pre>
(a+)*\d
</pre>
with the pattern above. The former gives a failure almost instantly when
applied to a whole line of "a" characters, whereas the latter takes an
appreciable time with strings longer than about 20 characters.
</P>
<P>
In many cases, the solution to this kind of performance issue is to use an
atomic group or a possessive quantifier.
</P>
<br><b>
AUTHOR
</b><br>
<P>
Philip Hazel
<br>
University Computing Service
<br>
Cambridge CB2 3QH, England.
<br>
</P>
<br><b>
REVISION
</b><br>
<P>
Last updated: 20 October 2014
<br>
Copyright &copy; 1997-2014 University of Cambridge.
<br>
<p>
Return to the <a href="index.html">PCRE2 index page</a>.
</p>

292
doc/html/pcre2posix.html Normal file
View File

@ -0,0 +1,292 @@
<html>
<head>
<title>pcre2posix specification</title>
</head>
<body bgcolor="#FFFFFF" text="#00005A" link="#0066FF" alink="#3399FF" vlink="#2222BB">
<h1>pcre2posix man page</h1>
<p>
Return to the <a href="index.html">PCRE2 index page</a>.
</p>
<p>
This page is part of the PCRE2 HTML documentation. It was generated
automatically from the original man page. If there is any nonsense in it,
please consult the man page, in case the conversion went wrong.
<br>
<ul>
<li><a name="TOC1" href="#SEC1">SYNOPSIS</a>
<li><a name="TOC2" href="#SEC2">DESCRIPTION</a>
<li><a name="TOC3" href="#SEC3">COMPILING A PATTERN</a>
<li><a name="TOC4" href="#SEC4">MATCHING NEWLINE CHARACTERS</a>
<li><a name="TOC5" href="#SEC5">MATCHING A PATTERN</a>
<li><a name="TOC6" href="#SEC6">ERROR MESSAGES</a>
<li><a name="TOC7" href="#SEC7">MEMORY USAGE</a>
<li><a name="TOC8" href="#SEC8">AUTHOR</a>
<li><a name="TOC9" href="#SEC9">REVISION</a>
</ul>
<br><a name="SEC1" href="#TOC1">SYNOPSIS</a><br>
<P>
<b>#include &#60;pcre2posix.h&#62;</b>
</P>
<P>
<b>int regcomp(regex_t *<i>preg</i>, const char *<i>pattern</i>,</b>
<b> int <i>cflags</i>);</b>
<br>
<br>
<b>int regexec(const regex_t *<i>preg</i>, const char *<i>string</i>,</b>
<b> size_t <i>nmatch</i>, regmatch_t <i>pmatch</i>[], int <i>eflags</i>);</b>
<br>
<br>
<b>size_t regerror(int <i>errcode</i>, const regex_t *<i>preg</i>,</b>
<b> char *<i>errbuf</i>, size_t <i>errbuf_size</i>);</b>
<br>
<br>
<b>void regfree(regex_t *<i>preg</i>);</b>
</P>
<br><a name="SEC2" href="#TOC1">DESCRIPTION</a><br>
<P>
This set of functions provides a POSIX-style API for the PCRE2 regular
expression 8-bit library. See the
<a href="pcre2api.html"><b>pcre2api</b></a>
documentation for a description of PCRE2's native API, which contains much
additional functionality. There is no POSIX-style wrapper for PCRE2's 16-bit
and 32-bit libraries.
</P>
<P>
The functions described here are just wrapper functions that ultimately call
the PCRE2 native API. Their prototypes are defined in the <b>pcre2posix.h</b>
header file, and on Unix systems the library itself is called
<b>libpcre2-posix.a</b>, so can be accessed by adding <b>-lpcre2-posix</b> to the
command for linking an application that uses them. Because the POSIX functions
call the native ones, it is also necessary to add <b>-lpcre2-8</b>.
</P>
<P>
Those POSIX option bits that can reasonably be mapped to PCRE2 native options
have been implemented. In addition, the option REG_EXTENDED is defined with the
value zero. This has no effect, but since programs that are written to the
POSIX interface often use it, this makes it easier to slot in PCRE2 as a
replacement library. Other POSIX options are not even defined.
</P>
<P>
There are also some other options that are not defined by POSIX. These have
been added at the request of users who want to make use of certain
PCRE2-specific features via the POSIX calling interface.
</P>
<P>
When PCRE2 is called via these functions, it is only the API that is POSIX-like
in style. The syntax and semantics of the regular expressions themselves are
still those of Perl, subject to the setting of various PCRE2 options, as
described below. "POSIX-like in style" means that the API approximates to the
POSIX definition; it is not fully POSIX-compatible, and in multi-unit encoding
domains it is probably even less compatible.
</P>
<P>
The header for these functions is supplied as <b>pcre2posix.h</b> to avoid any
potential clash with other POSIX libraries. It can, of course, be renamed or
aliased as <b>regex.h</b>, which is the "correct" name. It provides two
structure types, <i>regex_t</i> for compiled internal forms, and
<i>regmatch_t</i> for returning captured substrings. It also defines some
constants whose names start with "REG_"; these are used for setting options and
identifying error codes.
</P>
<br><a name="SEC3" href="#TOC1">COMPILING A PATTERN</a><br>
<P>
The function <b>regcomp()</b> is called to compile a pattern into an
internal form. The pattern is a C string terminated by a binary zero, and
is passed in the argument <i>pattern</i>. The <i>preg</i> argument is a pointer
to a <b>regex_t</b> structure that is used as a base for storing information
about the compiled regular expression.
</P>
<P>
The argument <i>cflags</i> is either zero, or contains one or more of the bits
defined by the following macros:
<pre>
REG_DOTALL
</pre>
The PCRE2_DOTALL option is set when the regular expression is passed for
compilation to the native function. Note that REG_DOTALL is not part of the
POSIX standard.
<pre>
REG_ICASE
</pre>
The PCRE2_CASELESS option is set when the regular expression is passed for
compilation to the native function.
<pre>
REG_NEWLINE
</pre>
The PCRE2_MULTILINE option is set when the regular expression is passed for
compilation to the native function. Note that this does <i>not</i> mimic the
defined POSIX behaviour for REG_NEWLINE (see the following section).
<pre>
REG_NOSUB
</pre>
The PCRE2_NO_AUTO_CAPTURE option is set when the regular expression is passed
for compilation to the native function. In addition, when a pattern that is
compiled with this flag is passed to <b>regexec()</b> for matching, the
<i>nmatch</i> and <i>pmatch</i> arguments are ignored, and no captured strings
are returned.
<pre>
REG_UCP
</pre>
The PCRE2_UCP option is set when the regular expression is passed for
compilation to the native function. This causes PCRE2 to use Unicode properties
when matchine \d, \w, etc., instead of just recognizing ASCII values. Note
that REG_UCP is not part of the POSIX standard.
<pre>
REG_UNGREEDY
</pre>
The PCRE2_UNGREEDY option is set when the regular expression is passed for
compilation to the native function. Note that REG_UNGREEDY is not part of the
POSIX standard.
<pre>
REG_UTF
</pre>
The PCRE2_UTF option is set when the regular expression is passed for
compilation to the native function. This causes the pattern itself and all data
strings used for matching it to be treated as UTF-8 strings. Note that REG_UTF
is not part of the POSIX standard.
</P>
<P>
In the absence of these flags, no options are passed to the native function.
This means the the regex is compiled with PCRE2 default semantics. In
particular, the way it handles newline characters in the subject string is the
Perl way, not the POSIX way. Note that setting PCRE2_MULTILINE has only
<i>some</i> of the effects specified for REG_NEWLINE. It does not affect the way
newlines are matched by the dot metacharacter (they are not) or by a negative
class such as [^a] (they are).
</P>
<P>
The yield of <b>regcomp()</b> is zero on success, and non-zero otherwise. The
<i>preg</i> structure is filled in on success, and one member of the structure
is public: <i>re_nsub</i> contains the number of capturing subpatterns in
the regular expression. Various error codes are defined in the header file.
</P>
<P>
NOTE: If the yield of <b>regcomp()</b> is non-zero, you must not attempt to
use the contents of the <i>preg</i> structure. If, for example, you pass it to
<b>regexec()</b>, the result is undefined and your program is likely to crash.
</P>
<br><a name="SEC4" href="#TOC1">MATCHING NEWLINE CHARACTERS</a><br>
<P>
This area is not simple, because POSIX and Perl take different views of things.
It is not possible to get PCRE2 to obey POSIX semantics, but then PCRE2 was
never intended to be a POSIX engine. The following table lists the different
possibilities for matching newline characters in PCRE2:
<pre>
Default Change with
. matches newline no PCRE2_DOTALL
newline matches [^a] yes not changeable
$ matches \n at end yes PCRE2_DOLLAR_ENDONLY
$ matches \n in middle no PCRE2_MULTILINE
^ matches \n in middle no PCRE2_MULTILINE
</pre>
This is the equivalent table for POSIX:
<pre>
Default Change with
. matches newline yes REG_NEWLINE
newline matches [^a] yes REG_NEWLINE
$ matches \n at end no REG_NEWLINE
$ matches \n in middle no REG_NEWLINE
^ matches \n in middle no REG_NEWLINE
</pre>
PCRE2's behaviour is the same as Perl's, except that there is no equivalent for
PCRE2_DOLLAR_ENDONLY in Perl. In both PCRE2 and Perl, there is no way to stop
newline from matching [^a].
</P>
<P>
The default POSIX newline handling can be obtained by setting PCRE2_DOTALL and
PCRE2_DOLLAR_ENDONLY, but there is no way to make PCRE2 behave exactly as for
the REG_NEWLINE action.
</P>
<br><a name="SEC5" href="#TOC1">MATCHING A PATTERN</a><br>
<P>
The function <b>regexec()</b> is called to match a compiled pattern <i>preg</i>
against a given <i>string</i>, which is by default terminated by a zero byte
(but see REG_STARTEND below), subject to the options in <i>eflags</i>. These can
be:
<pre>
REG_NOTBOL
</pre>
The PCRE2_NOTBOL option is set when calling the underlying PCRE2 matching
function.
<pre>
REG_NOTEMPTY
</pre>
The PCRE2_NOTEMPTY option is set when calling the underlying PCRE2 matching
function. Note that REG_NOTEMPTY is not part of the POSIX standard. However,
setting this option can give more POSIX-like behaviour in some situations.
<pre>
REG_NOTEOL
</pre>
The PCRE2_NOTEOL option is set when calling the underlying PCRE2 matching
function.
<pre>
REG_STARTEND
</pre>
The string is considered to start at <i>string</i> + <i>pmatch[0].rm_so</i> and
to have a terminating NUL located at <i>string</i> + <i>pmatch[0].rm_eo</i>
(there need not actually be a NUL at that location), regardless of the value of
<i>nmatch</i>. This is a BSD extension, compatible with but not specified by
IEEE Standard 1003.2 (POSIX.2), and should be used with caution in software
intended to be portable to other systems. Note that a non-zero <i>rm_so</i> does
not imply REG_NOTBOL; REG_STARTEND affects only the location of the string, not
how it is matched.
</P>
<P>
If the pattern was compiled with the REG_NOSUB flag, no data about any matched
strings is returned. The <i>nmatch</i> and <i>pmatch</i> arguments of
<b>regexec()</b> are ignored.
</P>
<P>
If the value of <i>nmatch</i> is zero, or if the value <i>pmatch</i> is NULL,
no data about any matched strings is returned.
</P>
<P>
Otherwise,the portion of the string that was matched, and also any captured
substrings, are returned via the <i>pmatch</i> argument, which points to an
array of <i>nmatch</i> structures of type <i>regmatch_t</i>, containing the
members <i>rm_so</i> and <i>rm_eo</i>. These contain the byte offset to the first
character of each substring and the offset to the first character after the end
of each substring, respectively. The 0th element of the vector relates to the
entire portion of <i>string</i> that was matched; subsequent elements relate to
the capturing subpatterns of the regular expression. Unused entries in the
array have both structure members set to -1.
</P>
<P>
A successful match yields a zero return; various error codes are defined in the
header file, of which REG_NOMATCH is the "expected" failure code.
</P>
<br><a name="SEC6" href="#TOC1">ERROR MESSAGES</a><br>
<P>
The <b>regerror()</b> function maps a non-zero errorcode from either
<b>regcomp()</b> or <b>regexec()</b> to a printable message. If <i>preg</i> is not
NULL, the error should have arisen from the use of that structure. A message
terminated by a binary zero is placed in <i>errbuf</i>. The length of the
message, including the zero, is limited to <i>errbuf_size</i>. The yield of the
function is the size of buffer needed to hold the whole message.
</P>
<br><a name="SEC7" href="#TOC1">MEMORY USAGE</a><br>
<P>
Compiling a regular expression causes memory to be allocated and associated
with the <i>preg</i> structure. The function <b>regfree()</b> frees all such
memory, after which <i>preg</i> may no longer be used as a compiled expression.
</P>
<br><a name="SEC8" href="#TOC1">AUTHOR</a><br>
<P>
Philip Hazel
<br>
University Computing Service
<br>
Cambridge CB2 3QH, England.
<br>
</P>
<br><a name="SEC9" href="#TOC1">REVISION</a><br>
<P>
Last updated: 20 October 2014
<br>
Copyright &copy; 1997-2014 University of Cambridge.
<br>
<p>
Return to the <a href="index.html">PCRE2 index page</a>.
</p>

106
doc/html/pcre2sample.html Normal file
View File

@ -0,0 +1,106 @@
<html>
<head>
<title>pcre2sample specification</title>
</head>
<body bgcolor="#FFFFFF" text="#00005A" link="#0066FF" alink="#3399FF" vlink="#2222BB">
<h1>pcre2sample man page</h1>
<p>
Return to the <a href="index.html">PCRE2 index page</a>.
</p>
<p>
This page is part of the PCRE2 HTML documentation. It was generated
automatically from the original man page. If there is any nonsense in it,
please consult the man page, in case the conversion went wrong.
<br>
<br><b>
PCRE2 SAMPLE PROGRAM
</b><br>
<P>
A simple, complete demonstration program to get you started with using PCRE2 is
supplied in the file <i>pcre2demo.c</i> in the <b>src</b> directory in the PCRE2
distribution. A listing of this program is given in the
<a href="pcre2demo.html"><b>pcre2demo</b></a>
documentation. If you do not have a copy of the PCRE2 distribution, you can
save this listing to re-create the contents of <i>pcre2demo.c</i>.
</P>
<P>
The demonstration program, which uses the PCRE2 8-bit library, compiles the
regular expression that is its first argument, and matches it against the
subject string in its second argument. No PCRE2 options are set, and default
character tables are used. If matching succeeds, the program outputs the
portion of the subject that matched, together with the contents of any captured
substrings.
</P>
<P>
If the -g option is given on the command line, the program then goes on to
check for further matches of the same regular expression in the same subject
string. The logic is a little bit tricky because of the possibility of matching
an empty string. Comments in the code explain what is going on.
</P>
<P>
If PCRE2 is installed in the standard include and library directories for your
operating system, you should be able to compile the demonstration program using
this command:
<pre>
gcc -o pcre2demo pcre2demo.c -lpcre2-8
</pre>
If PCRE2 is installed elsewhere, you may need to add additional options to the
command line. For example, on a Unix-like system that has PCRE2 installed in
<i>/usr/local</i>, you can compile the demonstration program using a command
like this:
<pre>
gcc -o pcre2demo -I/usr/local/include pcre2demo.c -L/usr/local/lib -lpcre2-8
</PRE>
</P>
<P>
Once you have compiled and linked the demonstration program, you can run simple
tests like this:
<pre>
./pcre2demo 'cat|dog' 'the cat sat on the mat'
./pcre2demo -g 'cat|dog' 'the dog sat on the cat'
</pre>
Note that there is a much more comprehensive test program, called
<a href="pcre2test.html"><b>pcre2test</b>,</a>
which supports many more facilities for testing regular expressions using the
PCRE2 libraries. The
<a href="pcre2demo.html"><b>pcre2demo</b></a>
program is provided as a simple coding example.
</P>
<P>
If you try to run
<a href="pcre2demo.html"><b>pcre2demo</b></a>
when PCRE2 is not installed in the standard library directory, you may get an
error like this on some operating systems (e.g. Solaris):
<pre>
ld.so.1: a.out: fatal: libpcre2.so.0: open failed: No such file or directory
</pre>
This is caused by the way shared library support works on those systems. You
need to add
<pre>
-R/usr/local/lib
</pre>
(for example) to the compile command to get round this problem.
</P>
<br><b>
AUTHOR
</b><br>
<P>
Philip Hazel
<br>
University Computing Service
<br>
Cambridge CB2 3QH, England.
<br>
</P>
<br><b>
REVISION
</b><br>
<P>
Last updated: 20 October 2014
<br>
Copyright &copy; 1997-2014 University of Cambridge.
<br>
<p>
Return to the <a href="index.html">PCRE2 index page</a>.
</p>

203
doc/html/pcre2stack.html Normal file
View File

@ -0,0 +1,203 @@
<html>
<head>
<title>pcre2stack specification</title>
</head>
<body bgcolor="#FFFFFF" text="#00005A" link="#0066FF" alink="#3399FF" vlink="#2222BB">
<h1>pcre2stack man page</h1>
<p>
Return to the <a href="index.html">PCRE2 index page</a>.
</p>
<p>
This page is part of the PCRE2 HTML documentation. It was generated
automatically from the original man page. If there is any nonsense in it,
please consult the man page, in case the conversion went wrong.
<br>
<br><b>
PCRE2 DISCUSSION OF STACK USAGE
</b><br>
<P>
When you call <b>pcre2_match()</b>, it makes use of an internal function called
<b>match()</b>. This calls itself recursively at branch points in the pattern,
in order to remember the state of the match so that it can back up and try a
different alternative after a failure. As matching proceeds deeper and deeper
into the tree of possibilities, the recursion depth increases. The
<b>match()</b> function is also called in other circumstances, for example,
whenever a parenthesized sub-pattern is entered, and in certain cases of
repetition.
</P>
<P>
Not all calls of <b>match()</b> increase the recursion depth; for an item such
as a* it may be called several times at the same level, after matching
different numbers of a's. Furthermore, in a number of cases where the result of
the recursive call would immediately be passed back as the result of the
current call (a "tail recursion"), the function is just restarted instead.
</P>
<P>
The above comments apply when <b>pcre2_match()</b> is run in its normal
interpretive manner. If the compiled pattern was processed by
<b>pcre2_jit_compile()</b>, and just-in-time compiling was successful, and the
options passed to <b>pcre2_match()</b> were not incompatible, the matching
process uses the JIT-compiled code instead of the <b>match()</b> function. In
this case, the memory requirements are handled entirely differently. See the
<a href="pcre2jit.html"><b>pcre2jit</b></a>
documentation for details.
</P>
<P>
The <b>pcre2_dfa_match()</b> function operates in a different way to
<b>pcre2_match()</b>, and uses recursion only when there is a regular expression
recursion or subroutine call in the pattern. This includes the processing of
assertion and "once-only" subpatterns, which are handled like subroutine calls.
Normally, these are never very deep, and the limit on the complexity of
<b>pcre2_dfa_match()</b> is controlled by the amount of workspace it is given.
However, it is possible to write patterns with runaway infinite recursions;
such patterns will cause <b>pcre2_dfa_match()</b> to run out of stack. At
present, there is no protection against this.
</P>
<P>
The comments that follow do NOT apply to <b>pcre2_dfa_match()</b>; they are
relevant only for <b>pcre2_match()</b> without the JIT optimization.
</P>
<br><b>
Reducing <b>pcre2_match()</b>'s stack usage
</b><br>
<P>
Each time that the internal <b>match()</b> function is called recursively, it
uses memory from the process stack. For certain kinds of pattern and data, very
large amounts of stack may be needed, despite the recognition of "tail
recursion". You can often reduce the amount of recursion, and therefore the
amount of stack used, by modifying the pattern that is being matched. Consider,
for example, this pattern:
<pre>
([^&#60;]|&#60;(?!inet))+
</pre>
It matches from wherever it starts until it encounters "&#60;inet" or the end of
the data, and is the kind of pattern that might be used when processing an XML
file. Each iteration of the outer parentheses matches either one character that
is not "&#60;" or a "&#60;" that is not followed by "inet". However, each time a
parenthesis is processed, a recursion occurs, so this formulation uses a stack
frame for each matched character. For a long string, a lot of stack is
required. Consider now this rewritten pattern, which matches exactly the same
strings:
<pre>
([^&#60;]++|&#60;(?!inet))+
</pre>
This uses very much less stack, because runs of characters that do not contain
"&#60;" are "swallowed" in one item inside the parentheses. Recursion happens only
when a "&#60;" character that is not followed by "inet" is encountered (and we
assume this is relatively rare). A possessive quantifier is used to stop any
backtracking into the runs of non-"&#60;" characters, but that is not related to
stack usage.
</P>
<P>
This example shows that one way of avoiding stack problems when matching long
subject strings is to write repeated parenthesized subpatterns to match more
than one character whenever possible.
</P>
<br><b>
Compiling PCRE2 to use heap instead of stack for <b>pcre2_match()</b>
</b><br>
<P>
In environments where stack memory is constrained, you might want to compile
PCRE2 to use heap memory instead of stack for remembering back-up points when
<b>pcre2_match()</b> is running. This makes it run more slowly, however. Details
of how to do this are given in the
<a href="pcre2build.html"><b>pcre2build</b></a>
documentation. When built in this way, instead of using the stack, PCRE2
gets memory for remembering backup points from the heap. By default, the memory
is obtained by calling the system <b>malloc()</b> function, but you can arrange
to supply your own memory management function. For details, see the section
entitled
<a href="pcre2api.html#matchcontext">"The match context"</a>
in the
<a href="pcre2api.html"><b>pcre2api</b></a>
documentation. Since the block sizes are always the same, it may be possible to
implement customized a memory handler that is more efficient than the standard
function. The memory blocks obtained for this purpose are retained and re-used
if possible while <b>pcre2_match()</b> is running. They are all freed just
before it exits.
</P>
<br><b>
Limiting <b>pcre2_match()</b>'s stack usage
</b><br>
<P>
You can set limits on the number of times the internal <b>match()</b> function
is called, both in total and recursively. If a limit is exceeded,
<b>pcre2_match()</b> returns an error code. Setting suitable limits should
prevent it from running out of stack. The default values of the limits are very
large, and unlikely ever to operate. They can be changed when PCRE2 is built,
and they can also be set when <b>pcre2_match()</b> is called. For details of
these interfaces, see the
<a href="pcre2build.html"><b>pcre2build</b></a>
documentation and the section entitled
<a href="pcre2api.html#matchcontext">"The match context"</a>
in the
<a href="pcre2api.html"><b>pcre2api</b></a>
documentation.
</P>
<P>
As a very rough rule of thumb, you should reckon on about 500 bytes per
recursion. Thus, if you want to limit your stack usage to 8Mb, you should set
the limit at 16000 recursions. A 64Mb stack, on the other hand, can support
around 128000 recursions.
</P>
<P>
The <b>pcre2test</b> test program has a modifier called "find_limits" which, if
applied to a subject line, causes it to find the smallest limits that allow a a
pattern to match. This is done by calling <b>pcre2_match()</b> repeatedly with
different limits.
</P>
<br><b>
Changing stack size in Unix-like systems
</b><br>
<P>
In Unix-like environments, there is not often a problem with the stack unless
very long strings are involved, though the default limit on stack size varies
from system to system. Values from 8Mb to 64Mb are common. You can find your
default limit by running the command:
<pre>
ulimit -s
</pre>
Unfortunately, the effect of running out of stack is often SIGSEGV, though
sometimes a more explicit error message is given. You can normally increase the
limit on stack size by code such as this:
<pre>
struct rlimit rlim;
getrlimit(RLIMIT_STACK, &rlim);
rlim.rlim_cur = 100*1024*1024;
setrlimit(RLIMIT_STACK, &rlim);
</pre>
This reads the current limits (soft and hard) using <b>getrlimit()</b>, then
attempts to increase the soft limit to 100Mb using <b>setrlimit()</b>. You must
do this before calling <b>pcre2_match()</b>.
</P>
<br><b>
Changing stack size in Mac OS X
</b><br>
<P>
Using <b>setrlimit()</b>, as described above, should also work on Mac OS X. It
is also possible to set a stack size when linking a program. There is a
discussion about stack sizes in Mac OS X at this web site:
<a href="http://developer.apple.com/qa/qa2005/qa1419.html">http://developer.apple.com/qa/qa2005/qa1419.html.</a>
</P>
<br><b>
AUTHOR
</b><br>
<P>
Philip Hazel
<br>
University Computing Service
<br>
Cambridge CB2 3QH, England.
<br>
</P>
<br><b>
REVISION
</b><br>
<P>
Last updated: 20 October 2014
<br>
Copyright &copy; 1997-2014 University of Cambridge.
<br>
<p>
Return to the <a href="index.html">PCRE2 index page</a>.
</p>

561
doc/html/pcre2syntax.html Normal file
View File

@ -0,0 +1,561 @@
<html>
<head>
<title>pcre2syntax specification</title>
</head>
<body bgcolor="#FFFFFF" text="#00005A" link="#0066FF" alink="#3399FF" vlink="#2222BB">
<h1>pcre2syntax man page</h1>
<p>
Return to the <a href="index.html">PCRE2 index page</a>.
</p>
<p>
This page is part of the PCRE2 HTML documentation. It was generated
automatically from the original man page. If there is any nonsense in it,
please consult the man page, in case the conversion went wrong.
<br>
<ul>
<li><a name="TOC1" href="#SEC1">PCRE2 REGULAR EXPRESSION SYNTAX SUMMARY</a>
<li><a name="TOC2" href="#SEC2">QUOTING</a>
<li><a name="TOC3" href="#SEC3">CHARACTERS</a>
<li><a name="TOC4" href="#SEC4">CHARACTER TYPES</a>
<li><a name="TOC5" href="#SEC5">GENERAL CATEGORY PROPERTIES FOR \p and \P</a>
<li><a name="TOC6" href="#SEC6">PCRE2 SPECIAL CATEGORY PROPERTIES FOR \p and \P</a>
<li><a name="TOC7" href="#SEC7">SCRIPT NAMES FOR \p AND \P</a>
<li><a name="TOC8" href="#SEC8">CHARACTER CLASSES</a>
<li><a name="TOC9" href="#SEC9">QUANTIFIERS</a>
<li><a name="TOC10" href="#SEC10">ANCHORS AND SIMPLE ASSERTIONS</a>
<li><a name="TOC11" href="#SEC11">MATCH POINT RESET</a>
<li><a name="TOC12" href="#SEC12">ALTERNATION</a>
<li><a name="TOC13" href="#SEC13">CAPTURING</a>
<li><a name="TOC14" href="#SEC14">ATOMIC GROUPS</a>
<li><a name="TOC15" href="#SEC15">COMMENT</a>
<li><a name="TOC16" href="#SEC16">OPTION SETTING</a>
<li><a name="TOC17" href="#SEC17">NEWLINE CONVENTION</a>
<li><a name="TOC18" href="#SEC18">WHAT \R MATCHES</a>
<li><a name="TOC19" href="#SEC19">LOOKAHEAD AND LOOKBEHIND ASSERTIONS</a>
<li><a name="TOC20" href="#SEC20">BACKREFERENCES</a>
<li><a name="TOC21" href="#SEC21">SUBROUTINE REFERENCES (POSSIBLY RECURSIVE)</a>
<li><a name="TOC22" href="#SEC22">CONDITIONAL PATTERNS</a>
<li><a name="TOC23" href="#SEC23">BACKTRACKING CONTROL</a>
<li><a name="TOC24" href="#SEC24">CALLOUTS</a>
<li><a name="TOC25" href="#SEC25">SEE ALSO</a>
<li><a name="TOC26" href="#SEC26">AUTHOR</a>
<li><a name="TOC27" href="#SEC27">REVISION</a>
</ul>
<br><a name="SEC1" href="#TOC1">PCRE2 REGULAR EXPRESSION SYNTAX SUMMARY</a><br>
<P>
The full syntax and semantics of the regular expressions that are supported by
PCRE2 are described in the
<a href="pcre2pattern.html"><b>pcre2pattern</b></a>
documentation. This document contains a quick-reference summary of the syntax.
</P>
<br><a name="SEC2" href="#TOC1">QUOTING</a><br>
<P>
<pre>
\x where x is non-alphanumeric is a literal x
\Q...\E treat enclosed characters as literal
</PRE>
</P>
<br><a name="SEC3" href="#TOC1">CHARACTERS</a><br>
<P>
<pre>
\a alarm, that is, the BEL character (hex 07)
\cx "control-x", where x is any ASCII character
\e escape (hex 1B)
\f form feed (hex 0C)
\n newline (hex 0A)
\r carriage return (hex 0D)
\t tab (hex 09)
\0dd character with octal code 0dd
\ddd character with octal code ddd, or backreference
\o{ddd..} character with octal code ddd..
\xhh character with hex code hh
\x{hhh..} character with hex code hhh..
</pre>
Note that \0dd is always an octal code, and that \8 and \9 are the literal
characters "8" and "9".
</P>
<br><a name="SEC4" href="#TOC1">CHARACTER TYPES</a><br>
<P>
<pre>
. any character except newline;
in dotall mode, any character whatsoever
\C one data unit, even in UTF mode (best avoided)
\d a decimal digit
\D a character that is not a decimal digit
\h a horizontal white space character
\H a character that is not a horizontal white space character
\N a character that is not a newline
\p{<i>xx</i>} a character with the <i>xx</i> property
\P{<i>xx</i>} a character without the <i>xx</i> property
\R a newline sequence
\s a white space character
\S a character that is not a white space character
\v a vertical white space character
\V a character that is not a vertical white space character
\w a "word" character
\W a "non-word" character
\X a Unicode extended grapheme cluster
</pre>
By default, \d, \s, and \w match only ASCII characters, even in UTF-8 mode
or in the 16-bit and 32-bit libraries. However, if locale-specific matching is
happening, \s and \w may also match characters with code points in the range
128-255. If the PCRE2_UCP option is set, the behaviour of these escape
sequences is changed to use Unicode properties and they match many more
characters.
</P>
<br><a name="SEC5" href="#TOC1">GENERAL CATEGORY PROPERTIES FOR \p and \P</a><br>
<P>
<pre>
C Other
Cc Control
Cf Format
Cn Unassigned
Co Private use
Cs Surrogate
L Letter
Ll Lower case letter
Lm Modifier letter
Lo Other letter
Lt Title case letter
Lu Upper case letter
L& Ll, Lu, or Lt
M Mark
Mc Spacing mark
Me Enclosing mark
Mn Non-spacing mark
N Number
Nd Decimal number
Nl Letter number
No Other number
P Punctuation
Pc Connector punctuation
Pd Dash punctuation
Pe Close punctuation
Pf Final punctuation
Pi Initial punctuation
Po Other punctuation
Ps Open punctuation
S Symbol
Sc Currency symbol
Sk Modifier symbol
Sm Mathematical symbol
So Other symbol
Z Separator
Zl Line separator
Zp Paragraph separator
Zs Space separator
</PRE>
</P>
<br><a name="SEC6" href="#TOC1">PCRE2 SPECIAL CATEGORY PROPERTIES FOR \p and \P</a><br>
<P>
<pre>
Xan Alphanumeric: union of properties L and N
Xps POSIX space: property Z or tab, NL, VT, FF, CR
Xsp Perl space: property Z or tab, NL, VT, FF, CR
Xuc Univerally-named character: one that can be
represented by a Universal Character Name
Xwd Perl word: property Xan or underscore
</pre>
Perl and POSIX space are now the same. Perl added VT to its space character set
at release 5.18.
</P>
<br><a name="SEC7" href="#TOC1">SCRIPT NAMES FOR \p AND \P</a><br>
<P>
Arabic,
Armenian,
Avestan,
Balinese,
Bamum,
Bassa_Vah,
Batak,
Bengali,
Bopomofo,
Brahmi,
Braille,
Buginese,
Buhid,
Canadian_Aboriginal,
Carian,
Caucasian_Albanian,
Chakma,
Cham,
Cherokee,
Common,
Coptic,
Cuneiform,
Cypriot,
Cyrillic,
Deseret,
Devanagari,
Duployan,
Egyptian_Hieroglyphs,
Elbasan,
Ethiopic,
Georgian,
Glagolitic,
Gothic,
Grantha,
Greek,
Gujarati,
Gurmukhi,
Han,
Hangul,
Hanunoo,
Hebrew,
Hiragana,
Imperial_Aramaic,
Inherited,
Inscriptional_Pahlavi,
Inscriptional_Parthian,
Javanese,
Kaithi,
Kannada,
Katakana,
Kayah_Li,
Kharoshthi,
Khmer,
Khojki,
Khudawadi,
Lao,
Latin,
Lepcha,
Limbu,
Linear_A,
Linear_B,
Lisu,
Lycian,
Lydian,
Mahajani,
Malayalam,
Mandaic,
Manichaean,
Meetei_Mayek,
Mende_Kikakui,
Meroitic_Cursive,
Meroitic_Hieroglyphs,
Miao,
Modi,
Mongolian,
Mro,
Myanmar,
Nabataean,
New_Tai_Lue,
Nko,
Ogham,
Ol_Chiki,
Old_Italic,
Old_North_Arabian,
Old_Permic,
Old_Persian,
Old_South_Arabian,
Old_Turkic,
Oriya,
Osmanya,
Pahawh_Hmong,
Palmyrene,
Pau_Cin_Hau,
Phags_Pa,
Phoenician,
Psalter_Pahlavi,
Rejang,
Runic,
Samaritan,
Saurashtra,
Sharada,
Shavian,
Siddham,
Sinhala,
Sora_Sompeng,
Sundanese,
Syloti_Nagri,
Syriac,
Tagalog,
Tagbanwa,
Tai_Le,
Tai_Tham,
Tai_Viet,
Takri,
Tamil,
Telugu,
Thaana,
Thai,
Tibetan,
Tifinagh,
Tirhuta,
Ugaritic,
Vai,
Warang_Citi,
Yi.
</P>
<br><a name="SEC8" href="#TOC1">CHARACTER CLASSES</a><br>
<P>
<pre>
[...] positive character class
[^...] negative character class
[x-y] range (can be used for hex characters)
[[:xxx:]] positive POSIX named set
[[:^xxx:]] negative POSIX named set
alnum alphanumeric
alpha alphabetic
ascii 0-127
blank space or tab
cntrl control character
digit decimal digit
graph printing, excluding space
lower lower case letter
print printing, including space
punct printing, excluding alphanumeric
space white space
upper upper case letter
word same as \w
xdigit hexadecimal digit
</pre>
In PCRE2, POSIX character set names recognize only ASCII characters by default,
but some of them use Unicode properties if PCRE2_UCP is set. You can use
\Q...\E inside a character class.
</P>
<br><a name="SEC9" href="#TOC1">QUANTIFIERS</a><br>
<P>
<pre>
? 0 or 1, greedy
?+ 0 or 1, possessive
?? 0 or 1, lazy
* 0 or more, greedy
*+ 0 or more, possessive
*? 0 or more, lazy
+ 1 or more, greedy
++ 1 or more, possessive
+? 1 or more, lazy
{n} exactly n
{n,m} at least n, no more than m, greedy
{n,m}+ at least n, no more than m, possessive
{n,m}? at least n, no more than m, lazy
{n,} n or more, greedy
{n,}+ n or more, possessive
{n,}? n or more, lazy
</PRE>
</P>
<br><a name="SEC10" href="#TOC1">ANCHORS AND SIMPLE ASSERTIONS</a><br>
<P>
<pre>
\b word boundary
\B not a word boundary
^ start of subject
also after internal newline in multiline mode
\A start of subject
$ end of subject
also before newline at end of subject
also before internal newline in multiline mode
\Z end of subject
also before newline at end of subject
\z end of subject
\G first matching position in subject
</PRE>
</P>
<br><a name="SEC11" href="#TOC1">MATCH POINT RESET</a><br>
<P>
<pre>
\K reset start of match
</pre>
\K is honoured in positive assertions, but ignored in negative ones.
</P>
<br><a name="SEC12" href="#TOC1">ALTERNATION</a><br>
<P>
<pre>
expr|expr|expr...
</PRE>
</P>
<br><a name="SEC13" href="#TOC1">CAPTURING</a><br>
<P>
<pre>
(...) capturing group
(?&#60;name&#62;...) named capturing group (Perl)
(?'name'...) named capturing group (Perl)
(?P&#60;name&#62;...) named capturing group (Python)
(?:...) non-capturing group
(?|...) non-capturing group; reset group numbers for
capturing groups in each alternative
</PRE>
</P>
<br><a name="SEC14" href="#TOC1">ATOMIC GROUPS</a><br>
<P>
<pre>
(?&#62;...) atomic, non-capturing group
</PRE>
</P>
<br><a name="SEC15" href="#TOC1">COMMENT</a><br>
<P>
<pre>
(?#....) comment (not nestable)
</PRE>
</P>
<br><a name="SEC16" href="#TOC1">OPTION SETTING</a><br>
<P>
<pre>
(?i) caseless
(?J) allow duplicate names
(?m) multiline
(?s) single line (dotall)
(?U) default ungreedy (lazy)
(?x) extended (ignore white space)
(?-...) unset option(s)
</pre>
The following are recognized only at the very start of a pattern or after one
of the newline or \R options with similar syntax. More than one of them may
appear.
<pre>
(*LIMIT_MATCH=d) set the match limit to d (decimal number)
(*LIMIT_RECURSION=d) set the recursion limit to d (decimal number)
(*NOTEMPTY) set PCRE2_NOTEMPTY when matching
(*NOTEMPTY_ATSTART) set PCRE2_NOTEMPTY_ATSTART when matching
(*NO_AUTO_POSSESS) no auto-possessification (PCRE2_NO_AUTO_POSSESS)
(*NO_START_OPT) no start-match optimization (PCRE2_NO_START_OPTIMIZE)
(*UTF) set appropriate UTF mode for the library in use
(*UCP) set PCRE2_UCP (use Unicode properties for \d etc)
</pre>
Note that LIMIT_MATCH and LIMIT_RECURSION can only reduce the value of the
limits set by the caller of pcre2_exec(), not increase them.
</P>
<br><a name="SEC17" href="#TOC1">NEWLINE CONVENTION</a><br>
<P>
These are recognized only at the very start of the pattern or after option
settings with a similar syntax.
<pre>
(*CR) carriage return only
(*LF) linefeed only
(*CRLF) carriage return followed by linefeed
(*ANYCRLF) all three of the above
(*ANY) any Unicode newline sequence
</PRE>
</P>
<br><a name="SEC18" href="#TOC1">WHAT \R MATCHES</a><br>
<P>
These are recognized only at the very start of the pattern or after option
setting with a similar syntax.
<pre>
(*BSR_ANYCRLF) CR, LF, or CRLF
(*BSR_UNICODE) any Unicode newline sequence
</PRE>
</P>
<br><a name="SEC19" href="#TOC1">LOOKAHEAD AND LOOKBEHIND ASSERTIONS</a><br>
<P>
<pre>
(?=...) positive look ahead
(?!...) negative look ahead
(?&#60;=...) positive look behind
(?&#60;!...) negative look behind
</pre>
Each top-level branch of a look behind must be of a fixed length.
</P>
<br><a name="SEC20" href="#TOC1">BACKREFERENCES</a><br>
<P>
<pre>
\n reference by number (can be ambiguous)
\gn reference by number
\g{n} reference by number
\g{-n} relative reference by number
\k&#60;name&#62; reference by name (Perl)
\k'name' reference by name (Perl)
\g{name} reference by name (Perl)
\k{name} reference by name (.NET)
(?P=name) reference by name (Python)
</PRE>
</P>
<br><a name="SEC21" href="#TOC1">SUBROUTINE REFERENCES (POSSIBLY RECURSIVE)</a><br>
<P>
<pre>
(?R) recurse whole pattern
(?n) call subpattern by absolute number
(?+n) call subpattern by relative number
(?-n) call subpattern by relative number
(?&name) call subpattern by name (Perl)
(?P&#62;name) call subpattern by name (Python)
\g&#60;name&#62; call subpattern by name (Oniguruma)
\g'name' call subpattern by name (Oniguruma)
\g&#60;n&#62; call subpattern by absolute number (Oniguruma)
\g'n' call subpattern by absolute number (Oniguruma)
\g&#60;+n&#62; call subpattern by relative number (PCRE2 extension)
\g'+n' call subpattern by relative number (PCRE2 extension)
\g&#60;-n&#62; call subpattern by relative number (PCRE2 extension)
\g'-n' call subpattern by relative number (PCRE2 extension)
</PRE>
</P>
<br><a name="SEC22" href="#TOC1">CONDITIONAL PATTERNS</a><br>
<P>
<pre>
(?(condition)yes-pattern)
(?(condition)yes-pattern|no-pattern)
(?(n)... absolute reference condition
(?(+n)... relative reference condition
(?(-n)... relative reference condition
(?(&#60;name&#62;)... named reference condition (Perl)
(?('name')... named reference condition (Perl)
(?(name)... named reference condition (PCRE2)
(?(R)... overall recursion condition
(?(Rn)... specific group recursion condition
(?(R&name)... specific recursion condition
(?(DEFINE)... define subpattern for reference
(?(assert)... assertion condition
</PRE>
</P>
<br><a name="SEC23" href="#TOC1">BACKTRACKING CONTROL</a><br>
<P>
The following act immediately they are reached:
<pre>
(*ACCEPT) force successful match
(*FAIL) force backtrack; synonym (*F)
(*MARK:NAME) set name to be passed back; synonym (*:NAME)
</pre>
The following act only when a subsequent match failure causes a backtrack to
reach them. They all force a match failure, but they differ in what happens
afterwards. Those that advance the start-of-match point do so only if the
pattern is not anchored.
<pre>
(*COMMIT) overall failure, no advance of starting point
(*PRUNE) advance to next starting character
(*PRUNE:NAME) equivalent to (*MARK:NAME)(*PRUNE)
(*SKIP) advance to current matching position
(*SKIP:NAME) advance to position corresponding to an earlier
(*MARK:NAME); if not found, the (*SKIP) is ignored
(*THEN) local failure, backtrack to next alternation
(*THEN:NAME) equivalent to (*MARK:NAME)(*THEN)
</PRE>
</P>
<br><a name="SEC24" href="#TOC1">CALLOUTS</a><br>
<P>
<pre>
(?C) callout
(?Cn) callout with data n
</PRE>
</P>
<br><a name="SEC25" href="#TOC1">SEE ALSO</a><br>
<P>
<b>pcre2pattern</b>(3), <b>pcre2api</b>(3), <b>pcre2callout</b>(3),
<b>pcre2matching</b>(3), <b>pcre2</b>(3).
</P>
<br><a name="SEC26" href="#TOC1">AUTHOR</a><br>
<P>
Philip Hazel
<br>
University Computing Service
<br>
Cambridge CB2 3QH, England.
<br>
</P>
<br><a name="SEC27" href="#TOC1">REVISION</a><br>
<P>
Last updated: 20 October 2014
<br>
Copyright &copy; 1997-2014 University of Cambridge.
<br>
<p>
Return to the <a href="index.html">PCRE2 index page</a>.
</p>

178
doc/pcre2perform.3 Normal file
View File

@ -0,0 +1,178 @@
.TH PCRE2PERFORM 3 "20 Ocbober 2014" "PCRE2 10.00"
.SH NAME
PCRE2 - Perl-compatible regular expressions (revised API)
.SH "PCRE2 PERFORMANCE"
.rs
.sp
Two aspects of performance are discussed below: memory usage and processing
time. The way you express your pattern as a regular expression can affect both
of them.
.
.SH "COMPILED PATTERN MEMORY USAGE"
.rs
.sp
Patterns are compiled by PCRE2 into a reasonably efficient interpretive code,
so that most simple patterns do not use much memory. However, there is one case
where the memory usage of a compiled pattern can be unexpectedly large. If a
parenthesized subpattern has a quantifier with a minimum greater than 1 and/or
a limited maximum, the whole subpattern is repeated in the compiled code. For
example, the pattern
.sp
(abc|def){2,4}
.sp
is compiled as if it were
.sp
(abc|def)(abc|def)((abc|def)(abc|def)?)?
.sp
(Technical aside: It is done this way so that backtrack points within each of
the repetitions can be independently maintained.)
.P
For regular expressions whose quantifiers use only small numbers, this is not
usually a problem. However, if the numbers are large, and particularly if such
repetitions are nested, the memory usage can become an embarrassment. For
example, the very simple pattern
.sp
((ab){1,1000}c){1,3}
.sp
uses 51K bytes when compiled using the 8-bit library. When PCRE2 is compiled
with its default internal pointer size of two bytes, the size limit on a
compiled pattern is 64K code units in the 8-bit and 16-bit libraries, and this
is reached with the above pattern if the outer repetition is increased from 3
to 4. PCRE2 can be compiled to use larger internal pointers and thus handle
larger compiled patterns, but it is better to try to rewrite your pattern to
use less memory if you can.
.P
One way of reducing the memory usage for such patterns is to make use of
PCRE2's
.\" HTML <a href="pcre2pattern.html#subpatternsassubroutines">
.\" </a>
"subroutine"
.\"
facility. Re-writing the above pattern as
.sp
((ab)(?2){0,999}c)(?1){0,2}
.sp
reduces the memory requirements to 18K, and indeed it remains under 20K even
with the outer repetition increased to 100. However, this pattern is not
exactly equivalent, because the "subroutine" calls are treated as
.\" HTML <a href="pcre2pattern.html#atomicgroup">
.\" </a>
atomic groups
.\"
into which there can be no backtracking if there is a subsequent matching
failure. Therefore, PCRE2 cannot do this kind of rewriting automatically.
Furthermore, there is a noticeable loss of speed when executing the modified
pattern. Nevertheless, if the atomic grouping is not a problem and the loss of
speed is acceptable, this kind of rewriting will allow you to process patterns
that PCRE2 cannot otherwise handle.
.
.
.SH "STACK USAGE AT RUN TIME"
.rs
.sp
When \fBpcre2_match()\fP is used for matching, certain kinds of pattern can
cause it to use large amounts of the process stack. In some environments the
default process stack is quite small, and if it runs out the result is often
SIGSEGV. Rewriting your pattern can often help. The
.\" HREF
\fBpcre2stack\fP
.\"
documentation discusses this issue in detail.
.
.
.SH "PROCESSING TIME"
.rs
.sp
Certain items in regular expression patterns are processed more efficiently
than others. It is more efficient to use a character class like [aeiou] than a
set of single-character alternatives such as (a|e|i|o|u). In general, the
simplest construction that provides the required behaviour is usually the most
efficient. Jeffrey Friedl's book contains a lot of useful general discussion
about optimizing regular expressions for efficient performance. This document
contains a few observations about PCRE2.
.P
Using Unicode character properties (the \ep, \eP, and \eX escapes) is slow,
because PCRE2 has to use a multi-stage table lookup whenever it needs a
character's property. If you can find an alternative pattern that does not use
character properties, it will probably be faster.
.P
By default, the escape sequences \eb, \ed, \es, and \ew, and the POSIX
character classes such as [:alpha:] do not use Unicode properties, partly for
backwards compatibility, and partly for performance reasons. However, you can
set the PCRE2_UCP option or start the pattern with (*UCP) if you want Unicode
character properties to be used. This can double the matching time for items
such as \ed, when matched with \fBpcre2_match()\fP; the performance loss is
less with a DFA matching function, and in both cases there is not much
difference for \eb.
.P
When a pattern begins with .* not in parentheses, or in parentheses that are
not the subject of a backreference, and the PCRE2_DOTALL option is set, the
pattern is implicitly anchored by PCRE2, since it can match only at the start
of a subject string. However, if PCRE2_DOTALL is not set, PCRE2 cannot make
this optimization, because the dot metacharacter does not then match a newline,
and if the subject string contains newlines, the pattern may match from the
character immediately following one of them instead of from the very start. For
example, the pattern
.sp
.*second
.sp
matches the subject "first\enand second" (where \en stands for a newline
character), with the match starting at the seventh character. In order to do
this, PCRE2 has to retry the match starting after every newline in the subject.
.P
If you are using such a pattern with subject strings that do not contain
newlines, the best performance is obtained by setting PCRE2_DOTALL, or starting
the pattern with ^.* or ^.*? to indicate explicit anchoring. That saves PCRE2
from having to scan along the subject looking for a newline to restart at.
.P
Beware of patterns that contain nested indefinite repeats. These can take a
long time to run when applied to a string that does not match. Consider the
pattern fragment
.sp
^(a+)*
.sp
This can match "aaaa" in 16 different ways, and this number increases very
rapidly as the string gets longer. (The * repeat can match 0, 1, 2, 3, or 4
times, and for each of those cases other than 0 or 4, the + repeats can match
different numbers of times.) When the remainder of the pattern is such that the
entire match is going to fail, PCRE2 has in principle to try every possible
variation, and this can take an extremely long time, even for relatively short
strings.
.P
An optimization catches some of the more simple cases such as
.sp
(a+)*b
.sp
where a literal character follows. Before embarking on the standard matching
procedure, PCRE2 checks that there is a "b" later in the subject string, and if
there is not, it fails the match immediately. However, when there is no
following literal this optimization cannot be used. You can see the difference
by comparing the behaviour of
.sp
(a+)*\ed
.sp
with the pattern above. The former gives a failure almost instantly when
applied to a whole line of "a" characters, whereas the latter takes an
appreciable time with strings longer than about 20 characters.
.P
In many cases, the solution to this kind of performance issue is to use an
atomic group or a possessive quantifier.
.
.
.SH AUTHOR
.rs
.sp
.nf
Philip Hazel
University Computing Service
Cambridge CB2 3QH, England.
.fi
.
.
.SH REVISION
.rs
.sp
.nf
Last updated: 20 October 2014
Copyright (c) 1997-2014 University of Cambridge.
.fi

268
doc/pcre2posix.3 Normal file
View File

@ -0,0 +1,268 @@
.TH PCRE2POSIX 3 "20 October 2014" "PCRE2 10.00"
.SH NAME
PCRE2 - Perl-compatible regular expressions (revised API)
.SH "SYNOPSIS"
.rs
.sp
.B #include <pcre2posix.h>
.PP
.nf
.B int regcomp(regex_t *\fIpreg\fP, const char *\fIpattern\fP,
.B " int \fIcflags\fP);"
.sp
.B int regexec(const regex_t *\fIpreg\fP, const char *\fIstring\fP,
.B " size_t \fInmatch\fP, regmatch_t \fIpmatch\fP[], int \fIeflags\fP);"
.sp
.B "size_t regerror(int \fIerrcode\fP, const regex_t *\fIpreg\fP,"
.B " char *\fIerrbuf\fP, size_t \fIerrbuf_size\fP);"
.sp
.B void regfree(regex_t *\fIpreg\fP);
.fi
.
.SH DESCRIPTION
.rs
.sp
This set of functions provides a POSIX-style API for the PCRE2 regular
expression 8-bit library. See the
.\" HREF
\fBpcre2api\fP
.\"
documentation for a description of PCRE2's native API, which contains much
additional functionality. There is no POSIX-style wrapper for PCRE2's 16-bit
and 32-bit libraries.
.P
The functions described here are just wrapper functions that ultimately call
the PCRE2 native API. Their prototypes are defined in the \fBpcre2posix.h\fP
header file, and on Unix systems the library itself is called
\fBlibpcre2-posix.a\fP, so can be accessed by adding \fB-lpcre2-posix\fP to the
command for linking an application that uses them. Because the POSIX functions
call the native ones, it is also necessary to add \fB-lpcre2-8\fP.
.P
Those POSIX option bits that can reasonably be mapped to PCRE2 native options
have been implemented. In addition, the option REG_EXTENDED is defined with the
value zero. This has no effect, but since programs that are written to the
POSIX interface often use it, this makes it easier to slot in PCRE2 as a
replacement library. Other POSIX options are not even defined.
.P
There are also some other options that are not defined by POSIX. These have
been added at the request of users who want to make use of certain
PCRE2-specific features via the POSIX calling interface.
.P
When PCRE2 is called via these functions, it is only the API that is POSIX-like
in style. The syntax and semantics of the regular expressions themselves are
still those of Perl, subject to the setting of various PCRE2 options, as
described below. "POSIX-like in style" means that the API approximates to the
POSIX definition; it is not fully POSIX-compatible, and in multi-unit encoding
domains it is probably even less compatible.
.P
The header for these functions is supplied as \fBpcre2posix.h\fP to avoid any
potential clash with other POSIX libraries. It can, of course, be renamed or
aliased as \fBregex.h\fP, which is the "correct" name. It provides two
structure types, \fIregex_t\fP for compiled internal forms, and
\fIregmatch_t\fP for returning captured substrings. It also defines some
constants whose names start with "REG_"; these are used for setting options and
identifying error codes.
.
.
.SH "COMPILING A PATTERN"
.rs
.sp
The function \fBregcomp()\fP is called to compile a pattern into an
internal form. The pattern is a C string terminated by a binary zero, and
is passed in the argument \fIpattern\fP. The \fIpreg\fP argument is a pointer
to a \fBregex_t\fP structure that is used as a base for storing information
about the compiled regular expression.
.P
The argument \fIcflags\fP is either zero, or contains one or more of the bits
defined by the following macros:
.sp
REG_DOTALL
.sp
The PCRE2_DOTALL option is set when the regular expression is passed for
compilation to the native function. Note that REG_DOTALL is not part of the
POSIX standard.
.sp
REG_ICASE
.sp
The PCRE2_CASELESS option is set when the regular expression is passed for
compilation to the native function.
.sp
REG_NEWLINE
.sp
The PCRE2_MULTILINE option is set when the regular expression is passed for
compilation to the native function. Note that this does \fInot\fP mimic the
defined POSIX behaviour for REG_NEWLINE (see the following section).
.sp
REG_NOSUB
.sp
The PCRE2_NO_AUTO_CAPTURE option is set when the regular expression is passed
for compilation to the native function. In addition, when a pattern that is
compiled with this flag is passed to \fBregexec()\fP for matching, the
\fInmatch\fP and \fIpmatch\fP arguments are ignored, and no captured strings
are returned.
.sp
REG_UCP
.sp
The PCRE2_UCP option is set when the regular expression is passed for
compilation to the native function. This causes PCRE2 to use Unicode properties
when matchine \ed, \ew, etc., instead of just recognizing ASCII values. Note
that REG_UCP is not part of the POSIX standard.
.sp
REG_UNGREEDY
.sp
The PCRE2_UNGREEDY option is set when the regular expression is passed for
compilation to the native function. Note that REG_UNGREEDY is not part of the
POSIX standard.
.sp
REG_UTF
.sp
The PCRE2_UTF option is set when the regular expression is passed for
compilation to the native function. This causes the pattern itself and all data
strings used for matching it to be treated as UTF-8 strings. Note that REG_UTF
is not part of the POSIX standard.
.P
In the absence of these flags, no options are passed to the native function.
This means the the regex is compiled with PCRE2 default semantics. In
particular, the way it handles newline characters in the subject string is the
Perl way, not the POSIX way. Note that setting PCRE2_MULTILINE has only
\fIsome\fP of the effects specified for REG_NEWLINE. It does not affect the way
newlines are matched by the dot metacharacter (they are not) or by a negative
class such as [^a] (they are).
.P
The yield of \fBregcomp()\fP is zero on success, and non-zero otherwise. The
\fIpreg\fP structure is filled in on success, and one member of the structure
is public: \fIre_nsub\fP contains the number of capturing subpatterns in
the regular expression. Various error codes are defined in the header file.
.P
NOTE: If the yield of \fBregcomp()\fP is non-zero, you must not attempt to
use the contents of the \fIpreg\fP structure. If, for example, you pass it to
\fBregexec()\fP, the result is undefined and your program is likely to crash.
.
.
.SH "MATCHING NEWLINE CHARACTERS"
.rs
.sp
This area is not simple, because POSIX and Perl take different views of things.
It is not possible to get PCRE2 to obey POSIX semantics, but then PCRE2 was
never intended to be a POSIX engine. The following table lists the different
possibilities for matching newline characters in PCRE2:
.sp
Default Change with
.sp
. matches newline no PCRE2_DOTALL
newline matches [^a] yes not changeable
$ matches \en at end yes PCRE2_DOLLAR_ENDONLY
$ matches \en in middle no PCRE2_MULTILINE
^ matches \en in middle no PCRE2_MULTILINE
.sp
This is the equivalent table for POSIX:
.sp
Default Change with
.sp
. matches newline yes REG_NEWLINE
newline matches [^a] yes REG_NEWLINE
$ matches \en at end no REG_NEWLINE
$ matches \en in middle no REG_NEWLINE
^ matches \en in middle no REG_NEWLINE
.sp
PCRE2's behaviour is the same as Perl's, except that there is no equivalent for
PCRE2_DOLLAR_ENDONLY in Perl. In both PCRE2 and Perl, there is no way to stop
newline from matching [^a].
.P
The default POSIX newline handling can be obtained by setting PCRE2_DOTALL and
PCRE2_DOLLAR_ENDONLY, but there is no way to make PCRE2 behave exactly as for
the REG_NEWLINE action.
.
.
.SH "MATCHING A PATTERN"
.rs
.sp
The function \fBregexec()\fP is called to match a compiled pattern \fIpreg\fP
against a given \fIstring\fP, which is by default terminated by a zero byte
(but see REG_STARTEND below), subject to the options in \fIeflags\fP. These can
be:
.sp
REG_NOTBOL
.sp
The PCRE2_NOTBOL option is set when calling the underlying PCRE2 matching
function.
.sp
REG_NOTEMPTY
.sp
The PCRE2_NOTEMPTY option is set when calling the underlying PCRE2 matching
function. Note that REG_NOTEMPTY is not part of the POSIX standard. However,
setting this option can give more POSIX-like behaviour in some situations.
.sp
REG_NOTEOL
.sp
The PCRE2_NOTEOL option is set when calling the underlying PCRE2 matching
function.
.sp
REG_STARTEND
.sp
The string is considered to start at \fIstring\fP + \fIpmatch[0].rm_so\fP and
to have a terminating NUL located at \fIstring\fP + \fIpmatch[0].rm_eo\fP
(there need not actually be a NUL at that location), regardless of the value of
\fInmatch\fP. This is a BSD extension, compatible with but not specified by
IEEE Standard 1003.2 (POSIX.2), and should be used with caution in software
intended to be portable to other systems. Note that a non-zero \fIrm_so\fP does
not imply REG_NOTBOL; REG_STARTEND affects only the location of the string, not
how it is matched.
.P
If the pattern was compiled with the REG_NOSUB flag, no data about any matched
strings is returned. The \fInmatch\fP and \fIpmatch\fP arguments of
\fBregexec()\fP are ignored.
.P
If the value of \fInmatch\fP is zero, or if the value \fIpmatch\fP is NULL,
no data about any matched strings is returned.
.P
Otherwise,the portion of the string that was matched, and also any captured
substrings, are returned via the \fIpmatch\fP argument, which points to an
array of \fInmatch\fP structures of type \fIregmatch_t\fP, containing the
members \fIrm_so\fP and \fIrm_eo\fP. These contain the byte offset to the first
character of each substring and the offset to the first character after the end
of each substring, respectively. The 0th element of the vector relates to the
entire portion of \fIstring\fP that was matched; subsequent elements relate to
the capturing subpatterns of the regular expression. Unused entries in the
array have both structure members set to -1.
.P
A successful match yields a zero return; various error codes are defined in the
header file, of which REG_NOMATCH is the "expected" failure code.
.
.
.SH "ERROR MESSAGES"
.rs
.sp
The \fBregerror()\fP function maps a non-zero errorcode from either
\fBregcomp()\fP or \fBregexec()\fP to a printable message. If \fIpreg\fP is not
NULL, the error should have arisen from the use of that structure. A message
terminated by a binary zero is placed in \fIerrbuf\fP. The length of the
message, including the zero, is limited to \fIerrbuf_size\fP. The yield of the
function is the size of buffer needed to hold the whole message.
.
.
.SH MEMORY USAGE
.rs
.sp
Compiling a regular expression causes memory to be allocated and associated
with the \fIpreg\fP structure. The function \fBregfree()\fP frees all such
memory, after which \fIpreg\fP may no longer be used as a compiled expression.
.
.
.SH AUTHOR
.rs
.sp
.nf
Philip Hazel
University Computing Service
Cambridge CB2 3QH, England.
.fi
.
.
.SH REVISION
.rs
.sp
.nf
Last updated: 20 October 2014
Copyright (c) 1997-2014 University of Cambridge.
.fi

94
doc/pcre2sample.3 Normal file
View File

@ -0,0 +1,94 @@
.TH PCRE2SAMPLE 3 "20 October 2014" "PCRE2 10.00"
.SH NAME
PCRE2 - Perl-compatible regular expressions (revised API)
.SH "PCRE2 SAMPLE PROGRAM"
.rs
.sp
A simple, complete demonstration program to get you started with using PCRE2 is
supplied in the file \fIpcre2demo.c\fP in the \fBsrc\fP directory in the PCRE2
distribution. A listing of this program is given in the
.\" HREF
\fBpcre2demo\fP
.\"
documentation. If you do not have a copy of the PCRE2 distribution, you can
save this listing to re-create the contents of \fIpcre2demo.c\fP.
.P
The demonstration program, which uses the PCRE2 8-bit library, compiles the
regular expression that is its first argument, and matches it against the
subject string in its second argument. No PCRE2 options are set, and default
character tables are used. If matching succeeds, the program outputs the
portion of the subject that matched, together with the contents of any captured
substrings.
.P
If the -g option is given on the command line, the program then goes on to
check for further matches of the same regular expression in the same subject
string. The logic is a little bit tricky because of the possibility of matching
an empty string. Comments in the code explain what is going on.
.P
If PCRE2 is installed in the standard include and library directories for your
operating system, you should be able to compile the demonstration program using
this command:
.sp
gcc -o pcre2demo pcre2demo.c -lpcre2-8
.sp
If PCRE2 is installed elsewhere, you may need to add additional options to the
command line. For example, on a Unix-like system that has PCRE2 installed in
\fI/usr/local\fP, you can compile the demonstration program using a command
like this:
.sp
.\" JOINSH
gcc -o pcre2demo -I/usr/local/include pcre2demo.c \e
-L/usr/local/lib -lpcre2-8
.sp
.P
Once you have compiled and linked the demonstration program, you can run simple
tests like this:
.sp
./pcre2demo 'cat|dog' 'the cat sat on the mat'
./pcre2demo -g 'cat|dog' 'the dog sat on the cat'
.sp
Note that there is a much more comprehensive test program, called
.\" HREF
\fBpcre2test\fP,
.\"
which supports many more facilities for testing regular expressions using the
PCRE2 libraries. The
.\" HREF
\fBpcre2demo\fP
.\"
program is provided as a simple coding example.
.P
If you try to run
.\" HREF
\fBpcre2demo\fP
.\"
when PCRE2 is not installed in the standard library directory, you may get an
error like this on some operating systems (e.g. Solaris):
.sp
ld.so.1: a.out: fatal: libpcre2.so.0: open failed: No such file or directory
.sp
This is caused by the way shared library support works on those systems. You
need to add
.sp
-R/usr/local/lib
.sp
(for example) to the compile command to get round this problem.
.
.
.SH AUTHOR
.rs
.sp
.nf
Philip Hazel
University Computing Service
Cambridge CB2 3QH, England.
.fi
.
.
.SH REVISION
.rs
.sp
.nf
Last updated: 20 October 2014
Copyright (c) 1997-2014 University of Cambridge.
.fi

199
doc/pcre2stack.3 Normal file
View File

@ -0,0 +1,199 @@
.TH PCRE2STACK 3 "20 October 2014" "PCRE2 10.00"
.SH NAME
PCRE2 - Perl-compatible regular expressions (revised API)
.SH "PCRE2 DISCUSSION OF STACK USAGE"
.rs
.sp
When you call \fBpcre2_match()\fP, it makes use of an internal function called
\fBmatch()\fP. This calls itself recursively at branch points in the pattern,
in order to remember the state of the match so that it can back up and try a
different alternative after a failure. As matching proceeds deeper and deeper
into the tree of possibilities, the recursion depth increases. The
\fBmatch()\fP function is also called in other circumstances, for example,
whenever a parenthesized sub-pattern is entered, and in certain cases of
repetition.
.P
Not all calls of \fBmatch()\fP increase the recursion depth; for an item such
as a* it may be called several times at the same level, after matching
different numbers of a's. Furthermore, in a number of cases where the result of
the recursive call would immediately be passed back as the result of the
current call (a "tail recursion"), the function is just restarted instead.
.P
The above comments apply when \fBpcre2_match()\fP is run in its normal
interpretive manner. If the compiled pattern was processed by
\fBpcre2_jit_compile()\fP, and just-in-time compiling was successful, and the
options passed to \fBpcre2_match()\fP were not incompatible, the matching
process uses the JIT-compiled code instead of the \fBmatch()\fP function. In
this case, the memory requirements are handled entirely differently. See the
.\" HREF
\fBpcre2jit\fP
.\"
documentation for details.
.P
The \fBpcre2_dfa_match()\fP function operates in a different way to
\fBpcre2_match()\fP, and uses recursion only when there is a regular expression
recursion or subroutine call in the pattern. This includes the processing of
assertion and "once-only" subpatterns, which are handled like subroutine calls.
Normally, these are never very deep, and the limit on the complexity of
\fBpcre2_dfa_match()\fP is controlled by the amount of workspace it is given.
However, it is possible to write patterns with runaway infinite recursions;
such patterns will cause \fBpcre2_dfa_match()\fP to run out of stack. At
present, there is no protection against this.
.P
The comments that follow do NOT apply to \fBpcre2_dfa_match()\fP; they are
relevant only for \fBpcre2_match()\fP without the JIT optimization.
.
.
.SS "Reducing \fBpcre2_match()\fP's stack usage"
.rs
.sp
Each time that the internal \fBmatch()\fP function is called recursively, it
uses memory from the process stack. For certain kinds of pattern and data, very
large amounts of stack may be needed, despite the recognition of "tail
recursion". You can often reduce the amount of recursion, and therefore the
amount of stack used, by modifying the pattern that is being matched. Consider,
for example, this pattern:
.sp
([^<]|<(?!inet))+
.sp
It matches from wherever it starts until it encounters "<inet" or the end of
the data, and is the kind of pattern that might be used when processing an XML
file. Each iteration of the outer parentheses matches either one character that
is not "<" or a "<" that is not followed by "inet". However, each time a
parenthesis is processed, a recursion occurs, so this formulation uses a stack
frame for each matched character. For a long string, a lot of stack is
required. Consider now this rewritten pattern, which matches exactly the same
strings:
.sp
([^<]++|<(?!inet))+
.sp
This uses very much less stack, because runs of characters that do not contain
"<" are "swallowed" in one item inside the parentheses. Recursion happens only
when a "<" character that is not followed by "inet" is encountered (and we
assume this is relatively rare). A possessive quantifier is used to stop any
backtracking into the runs of non-"<" characters, but that is not related to
stack usage.
.P
This example shows that one way of avoiding stack problems when matching long
subject strings is to write repeated parenthesized subpatterns to match more
than one character whenever possible.
.
.
.SS "Compiling PCRE2 to use heap instead of stack for \fBpcre2_match()\fP"
.rs
.sp
In environments where stack memory is constrained, you might want to compile
PCRE2 to use heap memory instead of stack for remembering back-up points when
\fBpcre2_match()\fP is running. This makes it run more slowly, however. Details
of how to do this are given in the
.\" HREF
\fBpcre2build\fP
.\"
documentation. When built in this way, instead of using the stack, PCRE2
gets memory for remembering backup points from the heap. By default, the memory
is obtained by calling the system \fBmalloc()\fP function, but you can arrange
to supply your own memory management function. For details, see the section
entitled
.\" HTML <a href="pcre2api.html#matchcontext">
.\" </a>
"The match context"
.\"
in the
.\" HREF
\fBpcre2api\fP
.\"
documentation. Since the block sizes are always the same, it may be possible to
implement customized a memory handler that is more efficient than the standard
function. The memory blocks obtained for this purpose are retained and re-used
if possible while \fBpcre2_match()\fP is running. They are all freed just
before it exits.
.
.
.SS "Limiting \fBpcre2_match()\fP's stack usage"
.rs
.sp
You can set limits on the number of times the internal \fBmatch()\fP function
is called, both in total and recursively. If a limit is exceeded,
\fBpcre2_match()\fP returns an error code. Setting suitable limits should
prevent it from running out of stack. The default values of the limits are very
large, and unlikely ever to operate. They can be changed when PCRE2 is built,
and they can also be set when \fBpcre2_match()\fP is called. For details of
these interfaces, see the
.\" HREF
\fBpcre2build\fP
.\"
documentation and the section entitled
.\" HTML <a href="pcre2api.html#matchcontext">
.\" </a>
"The match context"
.\"
in the
.\" HREF
\fBpcre2api\fP
.\"
documentation.
.P
As a very rough rule of thumb, you should reckon on about 500 bytes per
recursion. Thus, if you want to limit your stack usage to 8Mb, you should set
the limit at 16000 recursions. A 64Mb stack, on the other hand, can support
around 128000 recursions.
.P
The \fBpcre2test\fP test program has a modifier called "find_limits" which, if
applied to a subject line, causes it to find the smallest limits that allow a a
pattern to match. This is done by calling \fBpcre2_match()\fP repeatedly with
different limits.
.
.
.SS "Changing stack size in Unix-like systems"
.rs
.sp
In Unix-like environments, there is not often a problem with the stack unless
very long strings are involved, though the default limit on stack size varies
from system to system. Values from 8Mb to 64Mb are common. You can find your
default limit by running the command:
.sp
ulimit -s
.sp
Unfortunately, the effect of running out of stack is often SIGSEGV, though
sometimes a more explicit error message is given. You can normally increase the
limit on stack size by code such as this:
.sp
struct rlimit rlim;
getrlimit(RLIMIT_STACK, &rlim);
rlim.rlim_cur = 100*1024*1024;
setrlimit(RLIMIT_STACK, &rlim);
.sp
This reads the current limits (soft and hard) using \fBgetrlimit()\fP, then
attempts to increase the soft limit to 100Mb using \fBsetrlimit()\fP. You must
do this before calling \fBpcre2_match()\fP.
.
.
.SS "Changing stack size in Mac OS X"
.rs
.sp
Using \fBsetrlimit()\fP, as described above, should also work on Mac OS X. It
is also possible to set a stack size when linking a program. There is a
discussion about stack sizes in Mac OS X at this web site:
.\" HTML <a href="http://developer.apple.com/qa/qa2005/qa1419.html">
.\" </a>
http://developer.apple.com/qa/qa2005/qa1419.html.
.\"
.
.
.SH AUTHOR
.rs
.sp
.nf
Philip Hazel
University Computing Service
Cambridge CB2 3QH, England.
.fi
.
.
.SH REVISION
.rs
.sp
.nf
Last updated: 20 October 2014
Copyright (c) 1997-2014 University of Cambridge.
.fi

540
doc/pcre2syntax.3 Normal file
View File

@ -0,0 +1,540 @@
.TH PCRE2SYNTAX 3 "20 October 2014" "PCRE2 10.00"
.SH NAME
PCRE2 - Perl-compatible regular expressions (revised API)
.SH "PCRE2 REGULAR EXPRESSION SYNTAX SUMMARY"
.rs
.sp
The full syntax and semantics of the regular expressions that are supported by
PCRE2 are described in the
.\" HREF
\fBpcre2pattern\fP
.\"
documentation. This document contains a quick-reference summary of the syntax.
.
.
.SH "QUOTING"
.rs
.sp
\ex where x is non-alphanumeric is a literal x
\eQ...\eE treat enclosed characters as literal
.
.
.SH "CHARACTERS"
.rs
.sp
\ea alarm, that is, the BEL character (hex 07)
\ecx "control-x", where x is any ASCII character
\ee escape (hex 1B)
\ef form feed (hex 0C)
\en newline (hex 0A)
\er carriage return (hex 0D)
\et tab (hex 09)
\e0dd character with octal code 0dd
\eddd character with octal code ddd, or backreference
\eo{ddd..} character with octal code ddd..
\exhh character with hex code hh
\ex{hhh..} character with hex code hhh..
.sp
Note that \e0dd is always an octal code, and that \e8 and \e9 are the literal
characters "8" and "9".
.
.
.SH "CHARACTER TYPES"
.rs
.sp
. any character except newline;
in dotall mode, any character whatsoever
\eC one data unit, even in UTF mode (best avoided)
\ed a decimal digit
\eD a character that is not a decimal digit
\eh a horizontal white space character
\eH a character that is not a horizontal white space character
\eN a character that is not a newline
\ep{\fIxx\fP} a character with the \fIxx\fP property
\eP{\fIxx\fP} a character without the \fIxx\fP property
\eR a newline sequence
\es a white space character
\eS a character that is not a white space character
\ev a vertical white space character
\eV a character that is not a vertical white space character
\ew a "word" character
\eW a "non-word" character
\eX a Unicode extended grapheme cluster
.sp
By default, \ed, \es, and \ew match only ASCII characters, even in UTF-8 mode
or in the 16-bit and 32-bit libraries. However, if locale-specific matching is
happening, \es and \ew may also match characters with code points in the range
128-255. If the PCRE2_UCP option is set, the behaviour of these escape
sequences is changed to use Unicode properties and they match many more
characters.
.
.
.SH "GENERAL CATEGORY PROPERTIES FOR \ep and \eP"
.rs
.sp
C Other
Cc Control
Cf Format
Cn Unassigned
Co Private use
Cs Surrogate
.sp
L Letter
Ll Lower case letter
Lm Modifier letter
Lo Other letter
Lt Title case letter
Lu Upper case letter
L& Ll, Lu, or Lt
.sp
M Mark
Mc Spacing mark
Me Enclosing mark
Mn Non-spacing mark
.sp
N Number
Nd Decimal number
Nl Letter number
No Other number
.sp
P Punctuation
Pc Connector punctuation
Pd Dash punctuation
Pe Close punctuation
Pf Final punctuation
Pi Initial punctuation
Po Other punctuation
Ps Open punctuation
.sp
S Symbol
Sc Currency symbol
Sk Modifier symbol
Sm Mathematical symbol
So Other symbol
.sp
Z Separator
Zl Line separator
Zp Paragraph separator
Zs Space separator
.
.
.SH "PCRE2 SPECIAL CATEGORY PROPERTIES FOR \ep and \eP"
.rs
.sp
Xan Alphanumeric: union of properties L and N
Xps POSIX space: property Z or tab, NL, VT, FF, CR
Xsp Perl space: property Z or tab, NL, VT, FF, CR
Xuc Univerally-named character: one that can be
represented by a Universal Character Name
Xwd Perl word: property Xan or underscore
.sp
Perl and POSIX space are now the same. Perl added VT to its space character set
at release 5.18.
.
.
.SH "SCRIPT NAMES FOR \ep AND \eP"
.rs
.sp
Arabic,
Armenian,
Avestan,
Balinese,
Bamum,
Bassa_Vah,
Batak,
Bengali,
Bopomofo,
Brahmi,
Braille,
Buginese,
Buhid,
Canadian_Aboriginal,
Carian,
Caucasian_Albanian,
Chakma,
Cham,
Cherokee,
Common,
Coptic,
Cuneiform,
Cypriot,
Cyrillic,
Deseret,
Devanagari,
Duployan,
Egyptian_Hieroglyphs,
Elbasan,
Ethiopic,
Georgian,
Glagolitic,
Gothic,
Grantha,
Greek,
Gujarati,
Gurmukhi,
Han,
Hangul,
Hanunoo,
Hebrew,
Hiragana,
Imperial_Aramaic,
Inherited,
Inscriptional_Pahlavi,
Inscriptional_Parthian,
Javanese,
Kaithi,
Kannada,
Katakana,
Kayah_Li,
Kharoshthi,
Khmer,
Khojki,
Khudawadi,
Lao,
Latin,
Lepcha,
Limbu,
Linear_A,
Linear_B,
Lisu,
Lycian,
Lydian,
Mahajani,
Malayalam,
Mandaic,
Manichaean,
Meetei_Mayek,
Mende_Kikakui,
Meroitic_Cursive,
Meroitic_Hieroglyphs,
Miao,
Modi,
Mongolian,
Mro,
Myanmar,
Nabataean,
New_Tai_Lue,
Nko,
Ogham,
Ol_Chiki,
Old_Italic,
Old_North_Arabian,
Old_Permic,
Old_Persian,
Old_South_Arabian,
Old_Turkic,
Oriya,
Osmanya,
Pahawh_Hmong,
Palmyrene,
Pau_Cin_Hau,
Phags_Pa,
Phoenician,
Psalter_Pahlavi,
Rejang,
Runic,
Samaritan,
Saurashtra,
Sharada,
Shavian,
Siddham,
Sinhala,
Sora_Sompeng,
Sundanese,
Syloti_Nagri,
Syriac,
Tagalog,
Tagbanwa,
Tai_Le,
Tai_Tham,
Tai_Viet,
Takri,
Tamil,
Telugu,
Thaana,
Thai,
Tibetan,
Tifinagh,
Tirhuta,
Ugaritic,
Vai,
Warang_Citi,
Yi.
.
.
.SH "CHARACTER CLASSES"
.rs
.sp
[...] positive character class
[^...] negative character class
[x-y] range (can be used for hex characters)
[[:xxx:]] positive POSIX named set
[[:^xxx:]] negative POSIX named set
.sp
alnum alphanumeric
alpha alphabetic
ascii 0-127
blank space or tab
cntrl control character
digit decimal digit
graph printing, excluding space
lower lower case letter
print printing, including space
punct printing, excluding alphanumeric
space white space
upper upper case letter
word same as \ew
xdigit hexadecimal digit
.sp
In PCRE2, POSIX character set names recognize only ASCII characters by default,
but some of them use Unicode properties if PCRE2_UCP is set. You can use
\eQ...\eE inside a character class.
.
.
.SH "QUANTIFIERS"
.rs
.sp
? 0 or 1, greedy
?+ 0 or 1, possessive
?? 0 or 1, lazy
* 0 or more, greedy
*+ 0 or more, possessive
*? 0 or more, lazy
+ 1 or more, greedy
++ 1 or more, possessive
+? 1 or more, lazy
{n} exactly n
{n,m} at least n, no more than m, greedy
{n,m}+ at least n, no more than m, possessive
{n,m}? at least n, no more than m, lazy
{n,} n or more, greedy
{n,}+ n or more, possessive
{n,}? n or more, lazy
.
.
.SH "ANCHORS AND SIMPLE ASSERTIONS"
.rs
.sp
\eb word boundary
\eB not a word boundary
^ start of subject
also after internal newline in multiline mode
\eA start of subject
$ end of subject
also before newline at end of subject
also before internal newline in multiline mode
\eZ end of subject
also before newline at end of subject
\ez end of subject
\eG first matching position in subject
.
.
.SH "MATCH POINT RESET"
.rs
.sp
\eK reset start of match
.sp
\eK is honoured in positive assertions, but ignored in negative ones.
.
.
.SH "ALTERNATION"
.rs
.sp
expr|expr|expr...
.
.
.SH "CAPTURING"
.rs
.sp
(...) capturing group
(?<name>...) named capturing group (Perl)
(?'name'...) named capturing group (Perl)
(?P<name>...) named capturing group (Python)
(?:...) non-capturing group
(?|...) non-capturing group; reset group numbers for
capturing groups in each alternative
.
.
.SH "ATOMIC GROUPS"
.rs
.sp
(?>...) atomic, non-capturing group
.
.
.
.
.SH "COMMENT"
.rs
.sp
(?#....) comment (not nestable)
.
.
.SH "OPTION SETTING"
.rs
.sp
(?i) caseless
(?J) allow duplicate names
(?m) multiline
(?s) single line (dotall)
(?U) default ungreedy (lazy)
(?x) extended (ignore white space)
(?-...) unset option(s)
.sp
The following are recognized only at the very start of a pattern or after one
of the newline or \eR options with similar syntax. More than one of them may
appear.
.sp
(*LIMIT_MATCH=d) set the match limit to d (decimal number)
(*LIMIT_RECURSION=d) set the recursion limit to d (decimal number)
(*NOTEMPTY) set PCRE2_NOTEMPTY when matching
(*NOTEMPTY_ATSTART) set PCRE2_NOTEMPTY_ATSTART when matching
(*NO_AUTO_POSSESS) no auto-possessification (PCRE2_NO_AUTO_POSSESS)
(*NO_START_OPT) no start-match optimization (PCRE2_NO_START_OPTIMIZE)
(*UTF) set appropriate UTF mode for the library in use
(*UCP) set PCRE2_UCP (use Unicode properties for \ed etc)
.sp
Note that LIMIT_MATCH and LIMIT_RECURSION can only reduce the value of the
limits set by the caller of pcre2_exec(), not increase them.
.
.
.SH "NEWLINE CONVENTION"
.rs
.sp
These are recognized only at the very start of the pattern or after option
settings with a similar syntax.
.sp
(*CR) carriage return only
(*LF) linefeed only
(*CRLF) carriage return followed by linefeed
(*ANYCRLF) all three of the above
(*ANY) any Unicode newline sequence
.
.
.SH "WHAT \eR MATCHES"
.rs
.sp
These are recognized only at the very start of the pattern or after option
setting with a similar syntax.
.sp
(*BSR_ANYCRLF) CR, LF, or CRLF
(*BSR_UNICODE) any Unicode newline sequence
.
.
.SH "LOOKAHEAD AND LOOKBEHIND ASSERTIONS"
.rs
.sp
(?=...) positive look ahead
(?!...) negative look ahead
(?<=...) positive look behind
(?<!...) negative look behind
.sp
Each top-level branch of a look behind must be of a fixed length.
.
.
.SH "BACKREFERENCES"
.rs
.sp
\en reference by number (can be ambiguous)
\egn reference by number
\eg{n} reference by number
\eg{-n} relative reference by number
\ek<name> reference by name (Perl)
\ek'name' reference by name (Perl)
\eg{name} reference by name (Perl)
\ek{name} reference by name (.NET)
(?P=name) reference by name (Python)
.
.
.SH "SUBROUTINE REFERENCES (POSSIBLY RECURSIVE)"
.rs
.sp
(?R) recurse whole pattern
(?n) call subpattern by absolute number
(?+n) call subpattern by relative number
(?-n) call subpattern by relative number
(?&name) call subpattern by name (Perl)
(?P>name) call subpattern by name (Python)
\eg<name> call subpattern by name (Oniguruma)
\eg'name' call subpattern by name (Oniguruma)
\eg<n> call subpattern by absolute number (Oniguruma)
\eg'n' call subpattern by absolute number (Oniguruma)
\eg<+n> call subpattern by relative number (PCRE2 extension)
\eg'+n' call subpattern by relative number (PCRE2 extension)
\eg<-n> call subpattern by relative number (PCRE2 extension)
\eg'-n' call subpattern by relative number (PCRE2 extension)
.
.
.SH "CONDITIONAL PATTERNS"
.rs
.sp
(?(condition)yes-pattern)
(?(condition)yes-pattern|no-pattern)
.sp
(?(n)... absolute reference condition
(?(+n)... relative reference condition
(?(-n)... relative reference condition
(?(<name>)... named reference condition (Perl)
(?('name')... named reference condition (Perl)
(?(name)... named reference condition (PCRE2)
(?(R)... overall recursion condition
(?(Rn)... specific group recursion condition
(?(R&name)... specific recursion condition
(?(DEFINE)... define subpattern for reference
(?(assert)... assertion condition
.
.
.SH "BACKTRACKING CONTROL"
.rs
.sp
The following act immediately they are reached:
.sp
(*ACCEPT) force successful match
(*FAIL) force backtrack; synonym (*F)
(*MARK:NAME) set name to be passed back; synonym (*:NAME)
.sp
The following act only when a subsequent match failure causes a backtrack to
reach them. They all force a match failure, but they differ in what happens
afterwards. Those that advance the start-of-match point do so only if the
pattern is not anchored.
.sp
(*COMMIT) overall failure, no advance of starting point
(*PRUNE) advance to next starting character
(*PRUNE:NAME) equivalent to (*MARK:NAME)(*PRUNE)
(*SKIP) advance to current matching position
(*SKIP:NAME) advance to position corresponding to an earlier
(*MARK:NAME); if not found, the (*SKIP) is ignored
(*THEN) local failure, backtrack to next alternation
(*THEN:NAME) equivalent to (*MARK:NAME)(*THEN)
.
.
.SH "CALLOUTS"
.rs
.sp
(?C) callout
(?Cn) callout with data n
.
.
.SH "SEE ALSO"
.rs
.sp
\fBpcre2pattern\fP(3), \fBpcre2api\fP(3), \fBpcre2callout\fP(3),
\fBpcre2matching\fP(3), \fBpcre2\fP(3).
.
.
.SH AUTHOR
.rs
.sp
.nf
Philip Hazel
University Computing Service
Cambridge CB2 3QH, England.
.fi
.
.
.SH REVISION
.rs
.sp
.nf
Last updated: 20 October 2014
Copyright (c) 1997-2014 University of Cambridge.
.fi