More documentation
This commit is contained in:
parent
0dfe4e5e98
commit
4352f00bb9
24
Makefile.am
24
Makefile.am
|
@ -36,6 +36,11 @@ dist_html_DATA = \
|
||||||
doc/html/pcre2matching.html \
|
doc/html/pcre2matching.html \
|
||||||
doc/html/pcre2partial.html \
|
doc/html/pcre2partial.html \
|
||||||
doc/html/pcre2pattern.html \
|
doc/html/pcre2pattern.html \
|
||||||
|
doc/html/pcre2perform.html \
|
||||||
|
doc/html/pcre2posix.html \
|
||||||
|
doc/html/pcre2sample.html \
|
||||||
|
doc/html/pcre2stack.html \
|
||||||
|
doc/html/pcre2syntax.html \
|
||||||
doc/html/pcre2test.html \
|
doc/html/pcre2test.html \
|
||||||
doc/html/pcre2unicode.html
|
doc/html/pcre2unicode.html
|
||||||
|
|
||||||
|
@ -66,12 +71,7 @@ dist_html_DATA = \
|
||||||
# doc/html/pcre2_utf16_to_host_byte_order.html \
|
# doc/html/pcre2_utf16_to_host_byte_order.html \
|
||||||
# doc/html/pcre2_utf32_to_host_byte_order.html \
|
# doc/html/pcre2_utf32_to_host_byte_order.html \
|
||||||
# doc/html/pcre2_version.html \
|
# doc/html/pcre2_version.html \
|
||||||
# doc/html/pcre2perform.html \
|
# doc/html/pcre2precompile.html
|
||||||
# doc/html/pcre2posix.html \
|
|
||||||
# doc/html/pcre2precompile.html \
|
|
||||||
# doc/html/pcre2sample.html \
|
|
||||||
# doc/html/pcre2stack.html \
|
|
||||||
# doc/html/pcre2syntax.html
|
|
||||||
|
|
||||||
# FIXME
|
# FIXME
|
||||||
dist_man_MANS = \
|
dist_man_MANS = \
|
||||||
|
@ -88,6 +88,11 @@ dist_man_MANS = \
|
||||||
doc/pcre2matching.3 \
|
doc/pcre2matching.3 \
|
||||||
doc/pcre2partial.3 \
|
doc/pcre2partial.3 \
|
||||||
doc/pcre2pattern.3 \
|
doc/pcre2pattern.3 \
|
||||||
|
doc/pcre2perform.3 \
|
||||||
|
doc/pcre2posix.3 \
|
||||||
|
doc/pcre2sample.3 \
|
||||||
|
doc/pcre2stack.3 \
|
||||||
|
doc/pcre2syntax.3 \
|
||||||
doc/pcre2test.1 \
|
doc/pcre2test.1 \
|
||||||
doc/pcre2unicode.3
|
doc/pcre2unicode.3
|
||||||
|
|
||||||
|
@ -120,12 +125,7 @@ dist_man_MANS = \
|
||||||
# doc/pcre2_utf16_to_host_byte_order.3 \
|
# doc/pcre2_utf16_to_host_byte_order.3 \
|
||||||
# doc/pcre2_utf32_to_host_byte_order.3 \
|
# doc/pcre2_utf32_to_host_byte_order.3 \
|
||||||
# doc/pcre2_version.3 \
|
# doc/pcre2_version.3 \
|
||||||
# doc/pcre2perform.3 \
|
# doc/pcre2precompile.3
|
||||||
# doc/pcre2posix.3 \
|
|
||||||
# doc/pcre2precompile.3 \
|
|
||||||
# doc/pcre2sample.3 \
|
|
||||||
# doc/pcre2stack.3 \
|
|
||||||
# doc/pcre2syntax.3
|
|
||||||
|
|
||||||
# The Libtool libraries to install. We'll add to this later.
|
# The Libtool libraries to install. We'll add to this later.
|
||||||
|
|
||||||
|
|
|
@ -0,0 +1,196 @@
|
||||||
|
<html>
|
||||||
|
<head>
|
||||||
|
<title>pcre2perform specification</title>
|
||||||
|
</head>
|
||||||
|
<body bgcolor="#FFFFFF" text="#00005A" link="#0066FF" alink="#3399FF" vlink="#2222BB">
|
||||||
|
<h1>pcre2perform man page</h1>
|
||||||
|
<p>
|
||||||
|
Return to the <a href="index.html">PCRE2 index page</a>.
|
||||||
|
</p>
|
||||||
|
<p>
|
||||||
|
This page is part of the PCRE2 HTML documentation. It was generated
|
||||||
|
automatically from the original man page. If there is any nonsense in it,
|
||||||
|
please consult the man page, in case the conversion went wrong.
|
||||||
|
<br>
|
||||||
|
<br><b>
|
||||||
|
PCRE2 PERFORMANCE
|
||||||
|
</b><br>
|
||||||
|
<P>
|
||||||
|
Two aspects of performance are discussed below: memory usage and processing
|
||||||
|
time. The way you express your pattern as a regular expression can affect both
|
||||||
|
of them.
|
||||||
|
</P>
|
||||||
|
<br><b>
|
||||||
|
COMPILED PATTERN MEMORY USAGE
|
||||||
|
</b><br>
|
||||||
|
<P>
|
||||||
|
Patterns are compiled by PCRE2 into a reasonably efficient interpretive code,
|
||||||
|
so that most simple patterns do not use much memory. However, there is one case
|
||||||
|
where the memory usage of a compiled pattern can be unexpectedly large. If a
|
||||||
|
parenthesized subpattern has a quantifier with a minimum greater than 1 and/or
|
||||||
|
a limited maximum, the whole subpattern is repeated in the compiled code. For
|
||||||
|
example, the pattern
|
||||||
|
<pre>
|
||||||
|
(abc|def){2,4}
|
||||||
|
</pre>
|
||||||
|
is compiled as if it were
|
||||||
|
<pre>
|
||||||
|
(abc|def)(abc|def)((abc|def)(abc|def)?)?
|
||||||
|
</pre>
|
||||||
|
(Technical aside: It is done this way so that backtrack points within each of
|
||||||
|
the repetitions can be independently maintained.)
|
||||||
|
</P>
|
||||||
|
<P>
|
||||||
|
For regular expressions whose quantifiers use only small numbers, this is not
|
||||||
|
usually a problem. However, if the numbers are large, and particularly if such
|
||||||
|
repetitions are nested, the memory usage can become an embarrassment. For
|
||||||
|
example, the very simple pattern
|
||||||
|
<pre>
|
||||||
|
((ab){1,1000}c){1,3}
|
||||||
|
</pre>
|
||||||
|
uses 51K bytes when compiled using the 8-bit library. When PCRE2 is compiled
|
||||||
|
with its default internal pointer size of two bytes, the size limit on a
|
||||||
|
compiled pattern is 64K code units in the 8-bit and 16-bit libraries, and this
|
||||||
|
is reached with the above pattern if the outer repetition is increased from 3
|
||||||
|
to 4. PCRE2 can be compiled to use larger internal pointers and thus handle
|
||||||
|
larger compiled patterns, but it is better to try to rewrite your pattern to
|
||||||
|
use less memory if you can.
|
||||||
|
</P>
|
||||||
|
<P>
|
||||||
|
One way of reducing the memory usage for such patterns is to make use of
|
||||||
|
PCRE2's
|
||||||
|
<a href="pcre2pattern.html#subpatternsassubroutines">"subroutine"</a>
|
||||||
|
facility. Re-writing the above pattern as
|
||||||
|
<pre>
|
||||||
|
((ab)(?2){0,999}c)(?1){0,2}
|
||||||
|
</pre>
|
||||||
|
reduces the memory requirements to 18K, and indeed it remains under 20K even
|
||||||
|
with the outer repetition increased to 100. However, this pattern is not
|
||||||
|
exactly equivalent, because the "subroutine" calls are treated as
|
||||||
|
<a href="pcre2pattern.html#atomicgroup">atomic groups</a>
|
||||||
|
into which there can be no backtracking if there is a subsequent matching
|
||||||
|
failure. Therefore, PCRE2 cannot do this kind of rewriting automatically.
|
||||||
|
Furthermore, there is a noticeable loss of speed when executing the modified
|
||||||
|
pattern. Nevertheless, if the atomic grouping is not a problem and the loss of
|
||||||
|
speed is acceptable, this kind of rewriting will allow you to process patterns
|
||||||
|
that PCRE2 cannot otherwise handle.
|
||||||
|
</P>
|
||||||
|
<br><b>
|
||||||
|
STACK USAGE AT RUN TIME
|
||||||
|
</b><br>
|
||||||
|
<P>
|
||||||
|
When <b>pcre2_match()</b> is used for matching, certain kinds of pattern can
|
||||||
|
cause it to use large amounts of the process stack. In some environments the
|
||||||
|
default process stack is quite small, and if it runs out the result is often
|
||||||
|
SIGSEGV. Rewriting your pattern can often help. The
|
||||||
|
<a href="pcre2stack.html"><b>pcre2stack</b></a>
|
||||||
|
documentation discusses this issue in detail.
|
||||||
|
</P>
|
||||||
|
<br><b>
|
||||||
|
PROCESSING TIME
|
||||||
|
</b><br>
|
||||||
|
<P>
|
||||||
|
Certain items in regular expression patterns are processed more efficiently
|
||||||
|
than others. It is more efficient to use a character class like [aeiou] than a
|
||||||
|
set of single-character alternatives such as (a|e|i|o|u). In general, the
|
||||||
|
simplest construction that provides the required behaviour is usually the most
|
||||||
|
efficient. Jeffrey Friedl's book contains a lot of useful general discussion
|
||||||
|
about optimizing regular expressions for efficient performance. This document
|
||||||
|
contains a few observations about PCRE2.
|
||||||
|
</P>
|
||||||
|
<P>
|
||||||
|
Using Unicode character properties (the \p, \P, and \X escapes) is slow,
|
||||||
|
because PCRE2 has to use a multi-stage table lookup whenever it needs a
|
||||||
|
character's property. If you can find an alternative pattern that does not use
|
||||||
|
character properties, it will probably be faster.
|
||||||
|
</P>
|
||||||
|
<P>
|
||||||
|
By default, the escape sequences \b, \d, \s, and \w, and the POSIX
|
||||||
|
character classes such as [:alpha:] do not use Unicode properties, partly for
|
||||||
|
backwards compatibility, and partly for performance reasons. However, you can
|
||||||
|
set the PCRE2_UCP option or start the pattern with (*UCP) if you want Unicode
|
||||||
|
character properties to be used. This can double the matching time for items
|
||||||
|
such as \d, when matched with <b>pcre2_match()</b>; the performance loss is
|
||||||
|
less with a DFA matching function, and in both cases there is not much
|
||||||
|
difference for \b.
|
||||||
|
</P>
|
||||||
|
<P>
|
||||||
|
When a pattern begins with .* not in parentheses, or in parentheses that are
|
||||||
|
not the subject of a backreference, and the PCRE2_DOTALL option is set, the
|
||||||
|
pattern is implicitly anchored by PCRE2, since it can match only at the start
|
||||||
|
of a subject string. However, if PCRE2_DOTALL is not set, PCRE2 cannot make
|
||||||
|
this optimization, because the dot metacharacter does not then match a newline,
|
||||||
|
and if the subject string contains newlines, the pattern may match from the
|
||||||
|
character immediately following one of them instead of from the very start. For
|
||||||
|
example, the pattern
|
||||||
|
<pre>
|
||||||
|
.*second
|
||||||
|
</pre>
|
||||||
|
matches the subject "first\nand second" (where \n stands for a newline
|
||||||
|
character), with the match starting at the seventh character. In order to do
|
||||||
|
this, PCRE2 has to retry the match starting after every newline in the subject.
|
||||||
|
</P>
|
||||||
|
<P>
|
||||||
|
If you are using such a pattern with subject strings that do not contain
|
||||||
|
newlines, the best performance is obtained by setting PCRE2_DOTALL, or starting
|
||||||
|
the pattern with ^.* or ^.*? to indicate explicit anchoring. That saves PCRE2
|
||||||
|
from having to scan along the subject looking for a newline to restart at.
|
||||||
|
</P>
|
||||||
|
<P>
|
||||||
|
Beware of patterns that contain nested indefinite repeats. These can take a
|
||||||
|
long time to run when applied to a string that does not match. Consider the
|
||||||
|
pattern fragment
|
||||||
|
<pre>
|
||||||
|
^(a+)*
|
||||||
|
</pre>
|
||||||
|
This can match "aaaa" in 16 different ways, and this number increases very
|
||||||
|
rapidly as the string gets longer. (The * repeat can match 0, 1, 2, 3, or 4
|
||||||
|
times, and for each of those cases other than 0 or 4, the + repeats can match
|
||||||
|
different numbers of times.) When the remainder of the pattern is such that the
|
||||||
|
entire match is going to fail, PCRE2 has in principle to try every possible
|
||||||
|
variation, and this can take an extremely long time, even for relatively short
|
||||||
|
strings.
|
||||||
|
</P>
|
||||||
|
<P>
|
||||||
|
An optimization catches some of the more simple cases such as
|
||||||
|
<pre>
|
||||||
|
(a+)*b
|
||||||
|
</pre>
|
||||||
|
where a literal character follows. Before embarking on the standard matching
|
||||||
|
procedure, PCRE2 checks that there is a "b" later in the subject string, and if
|
||||||
|
there is not, it fails the match immediately. However, when there is no
|
||||||
|
following literal this optimization cannot be used. You can see the difference
|
||||||
|
by comparing the behaviour of
|
||||||
|
<pre>
|
||||||
|
(a+)*\d
|
||||||
|
</pre>
|
||||||
|
with the pattern above. The former gives a failure almost instantly when
|
||||||
|
applied to a whole line of "a" characters, whereas the latter takes an
|
||||||
|
appreciable time with strings longer than about 20 characters.
|
||||||
|
</P>
|
||||||
|
<P>
|
||||||
|
In many cases, the solution to this kind of performance issue is to use an
|
||||||
|
atomic group or a possessive quantifier.
|
||||||
|
</P>
|
||||||
|
<br><b>
|
||||||
|
AUTHOR
|
||||||
|
</b><br>
|
||||||
|
<P>
|
||||||
|
Philip Hazel
|
||||||
|
<br>
|
||||||
|
University Computing Service
|
||||||
|
<br>
|
||||||
|
Cambridge CB2 3QH, England.
|
||||||
|
<br>
|
||||||
|
</P>
|
||||||
|
<br><b>
|
||||||
|
REVISION
|
||||||
|
</b><br>
|
||||||
|
<P>
|
||||||
|
Last updated: 20 October 2014
|
||||||
|
<br>
|
||||||
|
Copyright © 1997-2014 University of Cambridge.
|
||||||
|
<br>
|
||||||
|
<p>
|
||||||
|
Return to the <a href="index.html">PCRE2 index page</a>.
|
||||||
|
</p>
|
|
@ -0,0 +1,292 @@
|
||||||
|
<html>
|
||||||
|
<head>
|
||||||
|
<title>pcre2posix specification</title>
|
||||||
|
</head>
|
||||||
|
<body bgcolor="#FFFFFF" text="#00005A" link="#0066FF" alink="#3399FF" vlink="#2222BB">
|
||||||
|
<h1>pcre2posix man page</h1>
|
||||||
|
<p>
|
||||||
|
Return to the <a href="index.html">PCRE2 index page</a>.
|
||||||
|
</p>
|
||||||
|
<p>
|
||||||
|
This page is part of the PCRE2 HTML documentation. It was generated
|
||||||
|
automatically from the original man page. If there is any nonsense in it,
|
||||||
|
please consult the man page, in case the conversion went wrong.
|
||||||
|
<br>
|
||||||
|
<ul>
|
||||||
|
<li><a name="TOC1" href="#SEC1">SYNOPSIS</a>
|
||||||
|
<li><a name="TOC2" href="#SEC2">DESCRIPTION</a>
|
||||||
|
<li><a name="TOC3" href="#SEC3">COMPILING A PATTERN</a>
|
||||||
|
<li><a name="TOC4" href="#SEC4">MATCHING NEWLINE CHARACTERS</a>
|
||||||
|
<li><a name="TOC5" href="#SEC5">MATCHING A PATTERN</a>
|
||||||
|
<li><a name="TOC6" href="#SEC6">ERROR MESSAGES</a>
|
||||||
|
<li><a name="TOC7" href="#SEC7">MEMORY USAGE</a>
|
||||||
|
<li><a name="TOC8" href="#SEC8">AUTHOR</a>
|
||||||
|
<li><a name="TOC9" href="#SEC9">REVISION</a>
|
||||||
|
</ul>
|
||||||
|
<br><a name="SEC1" href="#TOC1">SYNOPSIS</a><br>
|
||||||
|
<P>
|
||||||
|
<b>#include <pcre2posix.h></b>
|
||||||
|
</P>
|
||||||
|
<P>
|
||||||
|
<b>int regcomp(regex_t *<i>preg</i>, const char *<i>pattern</i>,</b>
|
||||||
|
<b> int <i>cflags</i>);</b>
|
||||||
|
<br>
|
||||||
|
<br>
|
||||||
|
<b>int regexec(const regex_t *<i>preg</i>, const char *<i>string</i>,</b>
|
||||||
|
<b> size_t <i>nmatch</i>, regmatch_t <i>pmatch</i>[], int <i>eflags</i>);</b>
|
||||||
|
<br>
|
||||||
|
<br>
|
||||||
|
<b>size_t regerror(int <i>errcode</i>, const regex_t *<i>preg</i>,</b>
|
||||||
|
<b> char *<i>errbuf</i>, size_t <i>errbuf_size</i>);</b>
|
||||||
|
<br>
|
||||||
|
<br>
|
||||||
|
<b>void regfree(regex_t *<i>preg</i>);</b>
|
||||||
|
</P>
|
||||||
|
<br><a name="SEC2" href="#TOC1">DESCRIPTION</a><br>
|
||||||
|
<P>
|
||||||
|
This set of functions provides a POSIX-style API for the PCRE2 regular
|
||||||
|
expression 8-bit library. See the
|
||||||
|
<a href="pcre2api.html"><b>pcre2api</b></a>
|
||||||
|
documentation for a description of PCRE2's native API, which contains much
|
||||||
|
additional functionality. There is no POSIX-style wrapper for PCRE2's 16-bit
|
||||||
|
and 32-bit libraries.
|
||||||
|
</P>
|
||||||
|
<P>
|
||||||
|
The functions described here are just wrapper functions that ultimately call
|
||||||
|
the PCRE2 native API. Their prototypes are defined in the <b>pcre2posix.h</b>
|
||||||
|
header file, and on Unix systems the library itself is called
|
||||||
|
<b>libpcre2-posix.a</b>, so can be accessed by adding <b>-lpcre2-posix</b> to the
|
||||||
|
command for linking an application that uses them. Because the POSIX functions
|
||||||
|
call the native ones, it is also necessary to add <b>-lpcre2-8</b>.
|
||||||
|
</P>
|
||||||
|
<P>
|
||||||
|
Those POSIX option bits that can reasonably be mapped to PCRE2 native options
|
||||||
|
have been implemented. In addition, the option REG_EXTENDED is defined with the
|
||||||
|
value zero. This has no effect, but since programs that are written to the
|
||||||
|
POSIX interface often use it, this makes it easier to slot in PCRE2 as a
|
||||||
|
replacement library. Other POSIX options are not even defined.
|
||||||
|
</P>
|
||||||
|
<P>
|
||||||
|
There are also some other options that are not defined by POSIX. These have
|
||||||
|
been added at the request of users who want to make use of certain
|
||||||
|
PCRE2-specific features via the POSIX calling interface.
|
||||||
|
</P>
|
||||||
|
<P>
|
||||||
|
When PCRE2 is called via these functions, it is only the API that is POSIX-like
|
||||||
|
in style. The syntax and semantics of the regular expressions themselves are
|
||||||
|
still those of Perl, subject to the setting of various PCRE2 options, as
|
||||||
|
described below. "POSIX-like in style" means that the API approximates to the
|
||||||
|
POSIX definition; it is not fully POSIX-compatible, and in multi-unit encoding
|
||||||
|
domains it is probably even less compatible.
|
||||||
|
</P>
|
||||||
|
<P>
|
||||||
|
The header for these functions is supplied as <b>pcre2posix.h</b> to avoid any
|
||||||
|
potential clash with other POSIX libraries. It can, of course, be renamed or
|
||||||
|
aliased as <b>regex.h</b>, which is the "correct" name. It provides two
|
||||||
|
structure types, <i>regex_t</i> for compiled internal forms, and
|
||||||
|
<i>regmatch_t</i> for returning captured substrings. It also defines some
|
||||||
|
constants whose names start with "REG_"; these are used for setting options and
|
||||||
|
identifying error codes.
|
||||||
|
</P>
|
||||||
|
<br><a name="SEC3" href="#TOC1">COMPILING A PATTERN</a><br>
|
||||||
|
<P>
|
||||||
|
The function <b>regcomp()</b> is called to compile a pattern into an
|
||||||
|
internal form. The pattern is a C string terminated by a binary zero, and
|
||||||
|
is passed in the argument <i>pattern</i>. The <i>preg</i> argument is a pointer
|
||||||
|
to a <b>regex_t</b> structure that is used as a base for storing information
|
||||||
|
about the compiled regular expression.
|
||||||
|
</P>
|
||||||
|
<P>
|
||||||
|
The argument <i>cflags</i> is either zero, or contains one or more of the bits
|
||||||
|
defined by the following macros:
|
||||||
|
<pre>
|
||||||
|
REG_DOTALL
|
||||||
|
</pre>
|
||||||
|
The PCRE2_DOTALL option is set when the regular expression is passed for
|
||||||
|
compilation to the native function. Note that REG_DOTALL is not part of the
|
||||||
|
POSIX standard.
|
||||||
|
<pre>
|
||||||
|
REG_ICASE
|
||||||
|
</pre>
|
||||||
|
The PCRE2_CASELESS option is set when the regular expression is passed for
|
||||||
|
compilation to the native function.
|
||||||
|
<pre>
|
||||||
|
REG_NEWLINE
|
||||||
|
</pre>
|
||||||
|
The PCRE2_MULTILINE option is set when the regular expression is passed for
|
||||||
|
compilation to the native function. Note that this does <i>not</i> mimic the
|
||||||
|
defined POSIX behaviour for REG_NEWLINE (see the following section).
|
||||||
|
<pre>
|
||||||
|
REG_NOSUB
|
||||||
|
</pre>
|
||||||
|
The PCRE2_NO_AUTO_CAPTURE option is set when the regular expression is passed
|
||||||
|
for compilation to the native function. In addition, when a pattern that is
|
||||||
|
compiled with this flag is passed to <b>regexec()</b> for matching, the
|
||||||
|
<i>nmatch</i> and <i>pmatch</i> arguments are ignored, and no captured strings
|
||||||
|
are returned.
|
||||||
|
<pre>
|
||||||
|
REG_UCP
|
||||||
|
</pre>
|
||||||
|
The PCRE2_UCP option is set when the regular expression is passed for
|
||||||
|
compilation to the native function. This causes PCRE2 to use Unicode properties
|
||||||
|
when matchine \d, \w, etc., instead of just recognizing ASCII values. Note
|
||||||
|
that REG_UCP is not part of the POSIX standard.
|
||||||
|
<pre>
|
||||||
|
REG_UNGREEDY
|
||||||
|
</pre>
|
||||||
|
The PCRE2_UNGREEDY option is set when the regular expression is passed for
|
||||||
|
compilation to the native function. Note that REG_UNGREEDY is not part of the
|
||||||
|
POSIX standard.
|
||||||
|
<pre>
|
||||||
|
REG_UTF
|
||||||
|
</pre>
|
||||||
|
The PCRE2_UTF option is set when the regular expression is passed for
|
||||||
|
compilation to the native function. This causes the pattern itself and all data
|
||||||
|
strings used for matching it to be treated as UTF-8 strings. Note that REG_UTF
|
||||||
|
is not part of the POSIX standard.
|
||||||
|
</P>
|
||||||
|
<P>
|
||||||
|
In the absence of these flags, no options are passed to the native function.
|
||||||
|
This means the the regex is compiled with PCRE2 default semantics. In
|
||||||
|
particular, the way it handles newline characters in the subject string is the
|
||||||
|
Perl way, not the POSIX way. Note that setting PCRE2_MULTILINE has only
|
||||||
|
<i>some</i> of the effects specified for REG_NEWLINE. It does not affect the way
|
||||||
|
newlines are matched by the dot metacharacter (they are not) or by a negative
|
||||||
|
class such as [^a] (they are).
|
||||||
|
</P>
|
||||||
|
<P>
|
||||||
|
The yield of <b>regcomp()</b> is zero on success, and non-zero otherwise. The
|
||||||
|
<i>preg</i> structure is filled in on success, and one member of the structure
|
||||||
|
is public: <i>re_nsub</i> contains the number of capturing subpatterns in
|
||||||
|
the regular expression. Various error codes are defined in the header file.
|
||||||
|
</P>
|
||||||
|
<P>
|
||||||
|
NOTE: If the yield of <b>regcomp()</b> is non-zero, you must not attempt to
|
||||||
|
use the contents of the <i>preg</i> structure. If, for example, you pass it to
|
||||||
|
<b>regexec()</b>, the result is undefined and your program is likely to crash.
|
||||||
|
</P>
|
||||||
|
<br><a name="SEC4" href="#TOC1">MATCHING NEWLINE CHARACTERS</a><br>
|
||||||
|
<P>
|
||||||
|
This area is not simple, because POSIX and Perl take different views of things.
|
||||||
|
It is not possible to get PCRE2 to obey POSIX semantics, but then PCRE2 was
|
||||||
|
never intended to be a POSIX engine. The following table lists the different
|
||||||
|
possibilities for matching newline characters in PCRE2:
|
||||||
|
<pre>
|
||||||
|
Default Change with
|
||||||
|
|
||||||
|
. matches newline no PCRE2_DOTALL
|
||||||
|
newline matches [^a] yes not changeable
|
||||||
|
$ matches \n at end yes PCRE2_DOLLAR_ENDONLY
|
||||||
|
$ matches \n in middle no PCRE2_MULTILINE
|
||||||
|
^ matches \n in middle no PCRE2_MULTILINE
|
||||||
|
</pre>
|
||||||
|
This is the equivalent table for POSIX:
|
||||||
|
<pre>
|
||||||
|
Default Change with
|
||||||
|
|
||||||
|
. matches newline yes REG_NEWLINE
|
||||||
|
newline matches [^a] yes REG_NEWLINE
|
||||||
|
$ matches \n at end no REG_NEWLINE
|
||||||
|
$ matches \n in middle no REG_NEWLINE
|
||||||
|
^ matches \n in middle no REG_NEWLINE
|
||||||
|
</pre>
|
||||||
|
PCRE2's behaviour is the same as Perl's, except that there is no equivalent for
|
||||||
|
PCRE2_DOLLAR_ENDONLY in Perl. In both PCRE2 and Perl, there is no way to stop
|
||||||
|
newline from matching [^a].
|
||||||
|
</P>
|
||||||
|
<P>
|
||||||
|
The default POSIX newline handling can be obtained by setting PCRE2_DOTALL and
|
||||||
|
PCRE2_DOLLAR_ENDONLY, but there is no way to make PCRE2 behave exactly as for
|
||||||
|
the REG_NEWLINE action.
|
||||||
|
</P>
|
||||||
|
<br><a name="SEC5" href="#TOC1">MATCHING A PATTERN</a><br>
|
||||||
|
<P>
|
||||||
|
The function <b>regexec()</b> is called to match a compiled pattern <i>preg</i>
|
||||||
|
against a given <i>string</i>, which is by default terminated by a zero byte
|
||||||
|
(but see REG_STARTEND below), subject to the options in <i>eflags</i>. These can
|
||||||
|
be:
|
||||||
|
<pre>
|
||||||
|
REG_NOTBOL
|
||||||
|
</pre>
|
||||||
|
The PCRE2_NOTBOL option is set when calling the underlying PCRE2 matching
|
||||||
|
function.
|
||||||
|
<pre>
|
||||||
|
REG_NOTEMPTY
|
||||||
|
</pre>
|
||||||
|
The PCRE2_NOTEMPTY option is set when calling the underlying PCRE2 matching
|
||||||
|
function. Note that REG_NOTEMPTY is not part of the POSIX standard. However,
|
||||||
|
setting this option can give more POSIX-like behaviour in some situations.
|
||||||
|
<pre>
|
||||||
|
REG_NOTEOL
|
||||||
|
</pre>
|
||||||
|
The PCRE2_NOTEOL option is set when calling the underlying PCRE2 matching
|
||||||
|
function.
|
||||||
|
<pre>
|
||||||
|
REG_STARTEND
|
||||||
|
</pre>
|
||||||
|
The string is considered to start at <i>string</i> + <i>pmatch[0].rm_so</i> and
|
||||||
|
to have a terminating NUL located at <i>string</i> + <i>pmatch[0].rm_eo</i>
|
||||||
|
(there need not actually be a NUL at that location), regardless of the value of
|
||||||
|
<i>nmatch</i>. This is a BSD extension, compatible with but not specified by
|
||||||
|
IEEE Standard 1003.2 (POSIX.2), and should be used with caution in software
|
||||||
|
intended to be portable to other systems. Note that a non-zero <i>rm_so</i> does
|
||||||
|
not imply REG_NOTBOL; REG_STARTEND affects only the location of the string, not
|
||||||
|
how it is matched.
|
||||||
|
</P>
|
||||||
|
<P>
|
||||||
|
If the pattern was compiled with the REG_NOSUB flag, no data about any matched
|
||||||
|
strings is returned. The <i>nmatch</i> and <i>pmatch</i> arguments of
|
||||||
|
<b>regexec()</b> are ignored.
|
||||||
|
</P>
|
||||||
|
<P>
|
||||||
|
If the value of <i>nmatch</i> is zero, or if the value <i>pmatch</i> is NULL,
|
||||||
|
no data about any matched strings is returned.
|
||||||
|
</P>
|
||||||
|
<P>
|
||||||
|
Otherwise,the portion of the string that was matched, and also any captured
|
||||||
|
substrings, are returned via the <i>pmatch</i> argument, which points to an
|
||||||
|
array of <i>nmatch</i> structures of type <i>regmatch_t</i>, containing the
|
||||||
|
members <i>rm_so</i> and <i>rm_eo</i>. These contain the byte offset to the first
|
||||||
|
character of each substring and the offset to the first character after the end
|
||||||
|
of each substring, respectively. The 0th element of the vector relates to the
|
||||||
|
entire portion of <i>string</i> that was matched; subsequent elements relate to
|
||||||
|
the capturing subpatterns of the regular expression. Unused entries in the
|
||||||
|
array have both structure members set to -1.
|
||||||
|
</P>
|
||||||
|
<P>
|
||||||
|
A successful match yields a zero return; various error codes are defined in the
|
||||||
|
header file, of which REG_NOMATCH is the "expected" failure code.
|
||||||
|
</P>
|
||||||
|
<br><a name="SEC6" href="#TOC1">ERROR MESSAGES</a><br>
|
||||||
|
<P>
|
||||||
|
The <b>regerror()</b> function maps a non-zero errorcode from either
|
||||||
|
<b>regcomp()</b> or <b>regexec()</b> to a printable message. If <i>preg</i> is not
|
||||||
|
NULL, the error should have arisen from the use of that structure. A message
|
||||||
|
terminated by a binary zero is placed in <i>errbuf</i>. The length of the
|
||||||
|
message, including the zero, is limited to <i>errbuf_size</i>. The yield of the
|
||||||
|
function is the size of buffer needed to hold the whole message.
|
||||||
|
</P>
|
||||||
|
<br><a name="SEC7" href="#TOC1">MEMORY USAGE</a><br>
|
||||||
|
<P>
|
||||||
|
Compiling a regular expression causes memory to be allocated and associated
|
||||||
|
with the <i>preg</i> structure. The function <b>regfree()</b> frees all such
|
||||||
|
memory, after which <i>preg</i> may no longer be used as a compiled expression.
|
||||||
|
</P>
|
||||||
|
<br><a name="SEC8" href="#TOC1">AUTHOR</a><br>
|
||||||
|
<P>
|
||||||
|
Philip Hazel
|
||||||
|
<br>
|
||||||
|
University Computing Service
|
||||||
|
<br>
|
||||||
|
Cambridge CB2 3QH, England.
|
||||||
|
<br>
|
||||||
|
</P>
|
||||||
|
<br><a name="SEC9" href="#TOC1">REVISION</a><br>
|
||||||
|
<P>
|
||||||
|
Last updated: 20 October 2014
|
||||||
|
<br>
|
||||||
|
Copyright © 1997-2014 University of Cambridge.
|
||||||
|
<br>
|
||||||
|
<p>
|
||||||
|
Return to the <a href="index.html">PCRE2 index page</a>.
|
||||||
|
</p>
|
|
@ -0,0 +1,106 @@
|
||||||
|
<html>
|
||||||
|
<head>
|
||||||
|
<title>pcre2sample specification</title>
|
||||||
|
</head>
|
||||||
|
<body bgcolor="#FFFFFF" text="#00005A" link="#0066FF" alink="#3399FF" vlink="#2222BB">
|
||||||
|
<h1>pcre2sample man page</h1>
|
||||||
|
<p>
|
||||||
|
Return to the <a href="index.html">PCRE2 index page</a>.
|
||||||
|
</p>
|
||||||
|
<p>
|
||||||
|
This page is part of the PCRE2 HTML documentation. It was generated
|
||||||
|
automatically from the original man page. If there is any nonsense in it,
|
||||||
|
please consult the man page, in case the conversion went wrong.
|
||||||
|
<br>
|
||||||
|
<br><b>
|
||||||
|
PCRE2 SAMPLE PROGRAM
|
||||||
|
</b><br>
|
||||||
|
<P>
|
||||||
|
A simple, complete demonstration program to get you started with using PCRE2 is
|
||||||
|
supplied in the file <i>pcre2demo.c</i> in the <b>src</b> directory in the PCRE2
|
||||||
|
distribution. A listing of this program is given in the
|
||||||
|
<a href="pcre2demo.html"><b>pcre2demo</b></a>
|
||||||
|
documentation. If you do not have a copy of the PCRE2 distribution, you can
|
||||||
|
save this listing to re-create the contents of <i>pcre2demo.c</i>.
|
||||||
|
</P>
|
||||||
|
<P>
|
||||||
|
The demonstration program, which uses the PCRE2 8-bit library, compiles the
|
||||||
|
regular expression that is its first argument, and matches it against the
|
||||||
|
subject string in its second argument. No PCRE2 options are set, and default
|
||||||
|
character tables are used. If matching succeeds, the program outputs the
|
||||||
|
portion of the subject that matched, together with the contents of any captured
|
||||||
|
substrings.
|
||||||
|
</P>
|
||||||
|
<P>
|
||||||
|
If the -g option is given on the command line, the program then goes on to
|
||||||
|
check for further matches of the same regular expression in the same subject
|
||||||
|
string. The logic is a little bit tricky because of the possibility of matching
|
||||||
|
an empty string. Comments in the code explain what is going on.
|
||||||
|
</P>
|
||||||
|
<P>
|
||||||
|
If PCRE2 is installed in the standard include and library directories for your
|
||||||
|
operating system, you should be able to compile the demonstration program using
|
||||||
|
this command:
|
||||||
|
<pre>
|
||||||
|
gcc -o pcre2demo pcre2demo.c -lpcre2-8
|
||||||
|
</pre>
|
||||||
|
If PCRE2 is installed elsewhere, you may need to add additional options to the
|
||||||
|
command line. For example, on a Unix-like system that has PCRE2 installed in
|
||||||
|
<i>/usr/local</i>, you can compile the demonstration program using a command
|
||||||
|
like this:
|
||||||
|
<pre>
|
||||||
|
gcc -o pcre2demo -I/usr/local/include pcre2demo.c -L/usr/local/lib -lpcre2-8
|
||||||
|
|
||||||
|
</PRE>
|
||||||
|
</P>
|
||||||
|
<P>
|
||||||
|
Once you have compiled and linked the demonstration program, you can run simple
|
||||||
|
tests like this:
|
||||||
|
<pre>
|
||||||
|
./pcre2demo 'cat|dog' 'the cat sat on the mat'
|
||||||
|
./pcre2demo -g 'cat|dog' 'the dog sat on the cat'
|
||||||
|
</pre>
|
||||||
|
Note that there is a much more comprehensive test program, called
|
||||||
|
<a href="pcre2test.html"><b>pcre2test</b>,</a>
|
||||||
|
which supports many more facilities for testing regular expressions using the
|
||||||
|
PCRE2 libraries. The
|
||||||
|
<a href="pcre2demo.html"><b>pcre2demo</b></a>
|
||||||
|
program is provided as a simple coding example.
|
||||||
|
</P>
|
||||||
|
<P>
|
||||||
|
If you try to run
|
||||||
|
<a href="pcre2demo.html"><b>pcre2demo</b></a>
|
||||||
|
when PCRE2 is not installed in the standard library directory, you may get an
|
||||||
|
error like this on some operating systems (e.g. Solaris):
|
||||||
|
<pre>
|
||||||
|
ld.so.1: a.out: fatal: libpcre2.so.0: open failed: No such file or directory
|
||||||
|
</pre>
|
||||||
|
This is caused by the way shared library support works on those systems. You
|
||||||
|
need to add
|
||||||
|
<pre>
|
||||||
|
-R/usr/local/lib
|
||||||
|
</pre>
|
||||||
|
(for example) to the compile command to get round this problem.
|
||||||
|
</P>
|
||||||
|
<br><b>
|
||||||
|
AUTHOR
|
||||||
|
</b><br>
|
||||||
|
<P>
|
||||||
|
Philip Hazel
|
||||||
|
<br>
|
||||||
|
University Computing Service
|
||||||
|
<br>
|
||||||
|
Cambridge CB2 3QH, England.
|
||||||
|
<br>
|
||||||
|
</P>
|
||||||
|
<br><b>
|
||||||
|
REVISION
|
||||||
|
</b><br>
|
||||||
|
<P>
|
||||||
|
Last updated: 20 October 2014
|
||||||
|
<br>
|
||||||
|
Copyright © 1997-2014 University of Cambridge.
|
||||||
|
<br>
|
||||||
|
<p>
|
||||||
|
Return to the <a href="index.html">PCRE2 index page</a>.
|
||||||
|
</p>
|
|
@ -0,0 +1,203 @@
|
||||||
|
<html>
|
||||||
|
<head>
|
||||||
|
<title>pcre2stack specification</title>
|
||||||
|
</head>
|
||||||
|
<body bgcolor="#FFFFFF" text="#00005A" link="#0066FF" alink="#3399FF" vlink="#2222BB">
|
||||||
|
<h1>pcre2stack man page</h1>
|
||||||
|
<p>
|
||||||
|
Return to the <a href="index.html">PCRE2 index page</a>.
|
||||||
|
</p>
|
||||||
|
<p>
|
||||||
|
This page is part of the PCRE2 HTML documentation. It was generated
|
||||||
|
automatically from the original man page. If there is any nonsense in it,
|
||||||
|
please consult the man page, in case the conversion went wrong.
|
||||||
|
<br>
|
||||||
|
<br><b>
|
||||||
|
PCRE2 DISCUSSION OF STACK USAGE
|
||||||
|
</b><br>
|
||||||
|
<P>
|
||||||
|
When you call <b>pcre2_match()</b>, it makes use of an internal function called
|
||||||
|
<b>match()</b>. This calls itself recursively at branch points in the pattern,
|
||||||
|
in order to remember the state of the match so that it can back up and try a
|
||||||
|
different alternative after a failure. As matching proceeds deeper and deeper
|
||||||
|
into the tree of possibilities, the recursion depth increases. The
|
||||||
|
<b>match()</b> function is also called in other circumstances, for example,
|
||||||
|
whenever a parenthesized sub-pattern is entered, and in certain cases of
|
||||||
|
repetition.
|
||||||
|
</P>
|
||||||
|
<P>
|
||||||
|
Not all calls of <b>match()</b> increase the recursion depth; for an item such
|
||||||
|
as a* it may be called several times at the same level, after matching
|
||||||
|
different numbers of a's. Furthermore, in a number of cases where the result of
|
||||||
|
the recursive call would immediately be passed back as the result of the
|
||||||
|
current call (a "tail recursion"), the function is just restarted instead.
|
||||||
|
</P>
|
||||||
|
<P>
|
||||||
|
The above comments apply when <b>pcre2_match()</b> is run in its normal
|
||||||
|
interpretive manner. If the compiled pattern was processed by
|
||||||
|
<b>pcre2_jit_compile()</b>, and just-in-time compiling was successful, and the
|
||||||
|
options passed to <b>pcre2_match()</b> were not incompatible, the matching
|
||||||
|
process uses the JIT-compiled code instead of the <b>match()</b> function. In
|
||||||
|
this case, the memory requirements are handled entirely differently. See the
|
||||||
|
<a href="pcre2jit.html"><b>pcre2jit</b></a>
|
||||||
|
documentation for details.
|
||||||
|
</P>
|
||||||
|
<P>
|
||||||
|
The <b>pcre2_dfa_match()</b> function operates in a different way to
|
||||||
|
<b>pcre2_match()</b>, and uses recursion only when there is a regular expression
|
||||||
|
recursion or subroutine call in the pattern. This includes the processing of
|
||||||
|
assertion and "once-only" subpatterns, which are handled like subroutine calls.
|
||||||
|
Normally, these are never very deep, and the limit on the complexity of
|
||||||
|
<b>pcre2_dfa_match()</b> is controlled by the amount of workspace it is given.
|
||||||
|
However, it is possible to write patterns with runaway infinite recursions;
|
||||||
|
such patterns will cause <b>pcre2_dfa_match()</b> to run out of stack. At
|
||||||
|
present, there is no protection against this.
|
||||||
|
</P>
|
||||||
|
<P>
|
||||||
|
The comments that follow do NOT apply to <b>pcre2_dfa_match()</b>; they are
|
||||||
|
relevant only for <b>pcre2_match()</b> without the JIT optimization.
|
||||||
|
</P>
|
||||||
|
<br><b>
|
||||||
|
Reducing <b>pcre2_match()</b>'s stack usage
|
||||||
|
</b><br>
|
||||||
|
<P>
|
||||||
|
Each time that the internal <b>match()</b> function is called recursively, it
|
||||||
|
uses memory from the process stack. For certain kinds of pattern and data, very
|
||||||
|
large amounts of stack may be needed, despite the recognition of "tail
|
||||||
|
recursion". You can often reduce the amount of recursion, and therefore the
|
||||||
|
amount of stack used, by modifying the pattern that is being matched. Consider,
|
||||||
|
for example, this pattern:
|
||||||
|
<pre>
|
||||||
|
([^<]|<(?!inet))+
|
||||||
|
</pre>
|
||||||
|
It matches from wherever it starts until it encounters "<inet" or the end of
|
||||||
|
the data, and is the kind of pattern that might be used when processing an XML
|
||||||
|
file. Each iteration of the outer parentheses matches either one character that
|
||||||
|
is not "<" or a "<" that is not followed by "inet". However, each time a
|
||||||
|
parenthesis is processed, a recursion occurs, so this formulation uses a stack
|
||||||
|
frame for each matched character. For a long string, a lot of stack is
|
||||||
|
required. Consider now this rewritten pattern, which matches exactly the same
|
||||||
|
strings:
|
||||||
|
<pre>
|
||||||
|
([^<]++|<(?!inet))+
|
||||||
|
</pre>
|
||||||
|
This uses very much less stack, because runs of characters that do not contain
|
||||||
|
"<" are "swallowed" in one item inside the parentheses. Recursion happens only
|
||||||
|
when a "<" character that is not followed by "inet" is encountered (and we
|
||||||
|
assume this is relatively rare). A possessive quantifier is used to stop any
|
||||||
|
backtracking into the runs of non-"<" characters, but that is not related to
|
||||||
|
stack usage.
|
||||||
|
</P>
|
||||||
|
<P>
|
||||||
|
This example shows that one way of avoiding stack problems when matching long
|
||||||
|
subject strings is to write repeated parenthesized subpatterns to match more
|
||||||
|
than one character whenever possible.
|
||||||
|
</P>
|
||||||
|
<br><b>
|
||||||
|
Compiling PCRE2 to use heap instead of stack for <b>pcre2_match()</b>
|
||||||
|
</b><br>
|
||||||
|
<P>
|
||||||
|
In environments where stack memory is constrained, you might want to compile
|
||||||
|
PCRE2 to use heap memory instead of stack for remembering back-up points when
|
||||||
|
<b>pcre2_match()</b> is running. This makes it run more slowly, however. Details
|
||||||
|
of how to do this are given in the
|
||||||
|
<a href="pcre2build.html"><b>pcre2build</b></a>
|
||||||
|
documentation. When built in this way, instead of using the stack, PCRE2
|
||||||
|
gets memory for remembering backup points from the heap. By default, the memory
|
||||||
|
is obtained by calling the system <b>malloc()</b> function, but you can arrange
|
||||||
|
to supply your own memory management function. For details, see the section
|
||||||
|
entitled
|
||||||
|
<a href="pcre2api.html#matchcontext">"The match context"</a>
|
||||||
|
in the
|
||||||
|
<a href="pcre2api.html"><b>pcre2api</b></a>
|
||||||
|
documentation. Since the block sizes are always the same, it may be possible to
|
||||||
|
implement customized a memory handler that is more efficient than the standard
|
||||||
|
function. The memory blocks obtained for this purpose are retained and re-used
|
||||||
|
if possible while <b>pcre2_match()</b> is running. They are all freed just
|
||||||
|
before it exits.
|
||||||
|
</P>
|
||||||
|
<br><b>
|
||||||
|
Limiting <b>pcre2_match()</b>'s stack usage
|
||||||
|
</b><br>
|
||||||
|
<P>
|
||||||
|
You can set limits on the number of times the internal <b>match()</b> function
|
||||||
|
is called, both in total and recursively. If a limit is exceeded,
|
||||||
|
<b>pcre2_match()</b> returns an error code. Setting suitable limits should
|
||||||
|
prevent it from running out of stack. The default values of the limits are very
|
||||||
|
large, and unlikely ever to operate. They can be changed when PCRE2 is built,
|
||||||
|
and they can also be set when <b>pcre2_match()</b> is called. For details of
|
||||||
|
these interfaces, see the
|
||||||
|
<a href="pcre2build.html"><b>pcre2build</b></a>
|
||||||
|
documentation and the section entitled
|
||||||
|
<a href="pcre2api.html#matchcontext">"The match context"</a>
|
||||||
|
in the
|
||||||
|
<a href="pcre2api.html"><b>pcre2api</b></a>
|
||||||
|
documentation.
|
||||||
|
</P>
|
||||||
|
<P>
|
||||||
|
As a very rough rule of thumb, you should reckon on about 500 bytes per
|
||||||
|
recursion. Thus, if you want to limit your stack usage to 8Mb, you should set
|
||||||
|
the limit at 16000 recursions. A 64Mb stack, on the other hand, can support
|
||||||
|
around 128000 recursions.
|
||||||
|
</P>
|
||||||
|
<P>
|
||||||
|
The <b>pcre2test</b> test program has a modifier called "find_limits" which, if
|
||||||
|
applied to a subject line, causes it to find the smallest limits that allow a a
|
||||||
|
pattern to match. This is done by calling <b>pcre2_match()</b> repeatedly with
|
||||||
|
different limits.
|
||||||
|
</P>
|
||||||
|
<br><b>
|
||||||
|
Changing stack size in Unix-like systems
|
||||||
|
</b><br>
|
||||||
|
<P>
|
||||||
|
In Unix-like environments, there is not often a problem with the stack unless
|
||||||
|
very long strings are involved, though the default limit on stack size varies
|
||||||
|
from system to system. Values from 8Mb to 64Mb are common. You can find your
|
||||||
|
default limit by running the command:
|
||||||
|
<pre>
|
||||||
|
ulimit -s
|
||||||
|
</pre>
|
||||||
|
Unfortunately, the effect of running out of stack is often SIGSEGV, though
|
||||||
|
sometimes a more explicit error message is given. You can normally increase the
|
||||||
|
limit on stack size by code such as this:
|
||||||
|
<pre>
|
||||||
|
struct rlimit rlim;
|
||||||
|
getrlimit(RLIMIT_STACK, &rlim);
|
||||||
|
rlim.rlim_cur = 100*1024*1024;
|
||||||
|
setrlimit(RLIMIT_STACK, &rlim);
|
||||||
|
</pre>
|
||||||
|
This reads the current limits (soft and hard) using <b>getrlimit()</b>, then
|
||||||
|
attempts to increase the soft limit to 100Mb using <b>setrlimit()</b>. You must
|
||||||
|
do this before calling <b>pcre2_match()</b>.
|
||||||
|
</P>
|
||||||
|
<br><b>
|
||||||
|
Changing stack size in Mac OS X
|
||||||
|
</b><br>
|
||||||
|
<P>
|
||||||
|
Using <b>setrlimit()</b>, as described above, should also work on Mac OS X. It
|
||||||
|
is also possible to set a stack size when linking a program. There is a
|
||||||
|
discussion about stack sizes in Mac OS X at this web site:
|
||||||
|
<a href="http://developer.apple.com/qa/qa2005/qa1419.html">http://developer.apple.com/qa/qa2005/qa1419.html.</a>
|
||||||
|
</P>
|
||||||
|
<br><b>
|
||||||
|
AUTHOR
|
||||||
|
</b><br>
|
||||||
|
<P>
|
||||||
|
Philip Hazel
|
||||||
|
<br>
|
||||||
|
University Computing Service
|
||||||
|
<br>
|
||||||
|
Cambridge CB2 3QH, England.
|
||||||
|
<br>
|
||||||
|
</P>
|
||||||
|
<br><b>
|
||||||
|
REVISION
|
||||||
|
</b><br>
|
||||||
|
<P>
|
||||||
|
Last updated: 20 October 2014
|
||||||
|
<br>
|
||||||
|
Copyright © 1997-2014 University of Cambridge.
|
||||||
|
<br>
|
||||||
|
<p>
|
||||||
|
Return to the <a href="index.html">PCRE2 index page</a>.
|
||||||
|
</p>
|
|
@ -0,0 +1,561 @@
|
||||||
|
<html>
|
||||||
|
<head>
|
||||||
|
<title>pcre2syntax specification</title>
|
||||||
|
</head>
|
||||||
|
<body bgcolor="#FFFFFF" text="#00005A" link="#0066FF" alink="#3399FF" vlink="#2222BB">
|
||||||
|
<h1>pcre2syntax man page</h1>
|
||||||
|
<p>
|
||||||
|
Return to the <a href="index.html">PCRE2 index page</a>.
|
||||||
|
</p>
|
||||||
|
<p>
|
||||||
|
This page is part of the PCRE2 HTML documentation. It was generated
|
||||||
|
automatically from the original man page. If there is any nonsense in it,
|
||||||
|
please consult the man page, in case the conversion went wrong.
|
||||||
|
<br>
|
||||||
|
<ul>
|
||||||
|
<li><a name="TOC1" href="#SEC1">PCRE2 REGULAR EXPRESSION SYNTAX SUMMARY</a>
|
||||||
|
<li><a name="TOC2" href="#SEC2">QUOTING</a>
|
||||||
|
<li><a name="TOC3" href="#SEC3">CHARACTERS</a>
|
||||||
|
<li><a name="TOC4" href="#SEC4">CHARACTER TYPES</a>
|
||||||
|
<li><a name="TOC5" href="#SEC5">GENERAL CATEGORY PROPERTIES FOR \p and \P</a>
|
||||||
|
<li><a name="TOC6" href="#SEC6">PCRE2 SPECIAL CATEGORY PROPERTIES FOR \p and \P</a>
|
||||||
|
<li><a name="TOC7" href="#SEC7">SCRIPT NAMES FOR \p AND \P</a>
|
||||||
|
<li><a name="TOC8" href="#SEC8">CHARACTER CLASSES</a>
|
||||||
|
<li><a name="TOC9" href="#SEC9">QUANTIFIERS</a>
|
||||||
|
<li><a name="TOC10" href="#SEC10">ANCHORS AND SIMPLE ASSERTIONS</a>
|
||||||
|
<li><a name="TOC11" href="#SEC11">MATCH POINT RESET</a>
|
||||||
|
<li><a name="TOC12" href="#SEC12">ALTERNATION</a>
|
||||||
|
<li><a name="TOC13" href="#SEC13">CAPTURING</a>
|
||||||
|
<li><a name="TOC14" href="#SEC14">ATOMIC GROUPS</a>
|
||||||
|
<li><a name="TOC15" href="#SEC15">COMMENT</a>
|
||||||
|
<li><a name="TOC16" href="#SEC16">OPTION SETTING</a>
|
||||||
|
<li><a name="TOC17" href="#SEC17">NEWLINE CONVENTION</a>
|
||||||
|
<li><a name="TOC18" href="#SEC18">WHAT \R MATCHES</a>
|
||||||
|
<li><a name="TOC19" href="#SEC19">LOOKAHEAD AND LOOKBEHIND ASSERTIONS</a>
|
||||||
|
<li><a name="TOC20" href="#SEC20">BACKREFERENCES</a>
|
||||||
|
<li><a name="TOC21" href="#SEC21">SUBROUTINE REFERENCES (POSSIBLY RECURSIVE)</a>
|
||||||
|
<li><a name="TOC22" href="#SEC22">CONDITIONAL PATTERNS</a>
|
||||||
|
<li><a name="TOC23" href="#SEC23">BACKTRACKING CONTROL</a>
|
||||||
|
<li><a name="TOC24" href="#SEC24">CALLOUTS</a>
|
||||||
|
<li><a name="TOC25" href="#SEC25">SEE ALSO</a>
|
||||||
|
<li><a name="TOC26" href="#SEC26">AUTHOR</a>
|
||||||
|
<li><a name="TOC27" href="#SEC27">REVISION</a>
|
||||||
|
</ul>
|
||||||
|
<br><a name="SEC1" href="#TOC1">PCRE2 REGULAR EXPRESSION SYNTAX SUMMARY</a><br>
|
||||||
|
<P>
|
||||||
|
The full syntax and semantics of the regular expressions that are supported by
|
||||||
|
PCRE2 are described in the
|
||||||
|
<a href="pcre2pattern.html"><b>pcre2pattern</b></a>
|
||||||
|
documentation. This document contains a quick-reference summary of the syntax.
|
||||||
|
</P>
|
||||||
|
<br><a name="SEC2" href="#TOC1">QUOTING</a><br>
|
||||||
|
<P>
|
||||||
|
<pre>
|
||||||
|
\x where x is non-alphanumeric is a literal x
|
||||||
|
\Q...\E treat enclosed characters as literal
|
||||||
|
</PRE>
|
||||||
|
</P>
|
||||||
|
<br><a name="SEC3" href="#TOC1">CHARACTERS</a><br>
|
||||||
|
<P>
|
||||||
|
<pre>
|
||||||
|
\a alarm, that is, the BEL character (hex 07)
|
||||||
|
\cx "control-x", where x is any ASCII character
|
||||||
|
\e escape (hex 1B)
|
||||||
|
\f form feed (hex 0C)
|
||||||
|
\n newline (hex 0A)
|
||||||
|
\r carriage return (hex 0D)
|
||||||
|
\t tab (hex 09)
|
||||||
|
\0dd character with octal code 0dd
|
||||||
|
\ddd character with octal code ddd, or backreference
|
||||||
|
\o{ddd..} character with octal code ddd..
|
||||||
|
\xhh character with hex code hh
|
||||||
|
\x{hhh..} character with hex code hhh..
|
||||||
|
</pre>
|
||||||
|
Note that \0dd is always an octal code, and that \8 and \9 are the literal
|
||||||
|
characters "8" and "9".
|
||||||
|
</P>
|
||||||
|
<br><a name="SEC4" href="#TOC1">CHARACTER TYPES</a><br>
|
||||||
|
<P>
|
||||||
|
<pre>
|
||||||
|
. any character except newline;
|
||||||
|
in dotall mode, any character whatsoever
|
||||||
|
\C one data unit, even in UTF mode (best avoided)
|
||||||
|
\d a decimal digit
|
||||||
|
\D a character that is not a decimal digit
|
||||||
|
\h a horizontal white space character
|
||||||
|
\H a character that is not a horizontal white space character
|
||||||
|
\N a character that is not a newline
|
||||||
|
\p{<i>xx</i>} a character with the <i>xx</i> property
|
||||||
|
\P{<i>xx</i>} a character without the <i>xx</i> property
|
||||||
|
\R a newline sequence
|
||||||
|
\s a white space character
|
||||||
|
\S a character that is not a white space character
|
||||||
|
\v a vertical white space character
|
||||||
|
\V a character that is not a vertical white space character
|
||||||
|
\w a "word" character
|
||||||
|
\W a "non-word" character
|
||||||
|
\X a Unicode extended grapheme cluster
|
||||||
|
</pre>
|
||||||
|
By default, \d, \s, and \w match only ASCII characters, even in UTF-8 mode
|
||||||
|
or in the 16-bit and 32-bit libraries. However, if locale-specific matching is
|
||||||
|
happening, \s and \w may also match characters with code points in the range
|
||||||
|
128-255. If the PCRE2_UCP option is set, the behaviour of these escape
|
||||||
|
sequences is changed to use Unicode properties and they match many more
|
||||||
|
characters.
|
||||||
|
</P>
|
||||||
|
<br><a name="SEC5" href="#TOC1">GENERAL CATEGORY PROPERTIES FOR \p and \P</a><br>
|
||||||
|
<P>
|
||||||
|
<pre>
|
||||||
|
C Other
|
||||||
|
Cc Control
|
||||||
|
Cf Format
|
||||||
|
Cn Unassigned
|
||||||
|
Co Private use
|
||||||
|
Cs Surrogate
|
||||||
|
|
||||||
|
L Letter
|
||||||
|
Ll Lower case letter
|
||||||
|
Lm Modifier letter
|
||||||
|
Lo Other letter
|
||||||
|
Lt Title case letter
|
||||||
|
Lu Upper case letter
|
||||||
|
L& Ll, Lu, or Lt
|
||||||
|
|
||||||
|
M Mark
|
||||||
|
Mc Spacing mark
|
||||||
|
Me Enclosing mark
|
||||||
|
Mn Non-spacing mark
|
||||||
|
|
||||||
|
N Number
|
||||||
|
Nd Decimal number
|
||||||
|
Nl Letter number
|
||||||
|
No Other number
|
||||||
|
|
||||||
|
P Punctuation
|
||||||
|
Pc Connector punctuation
|
||||||
|
Pd Dash punctuation
|
||||||
|
Pe Close punctuation
|
||||||
|
Pf Final punctuation
|
||||||
|
Pi Initial punctuation
|
||||||
|
Po Other punctuation
|
||||||
|
Ps Open punctuation
|
||||||
|
|
||||||
|
S Symbol
|
||||||
|
Sc Currency symbol
|
||||||
|
Sk Modifier symbol
|
||||||
|
Sm Mathematical symbol
|
||||||
|
So Other symbol
|
||||||
|
|
||||||
|
Z Separator
|
||||||
|
Zl Line separator
|
||||||
|
Zp Paragraph separator
|
||||||
|
Zs Space separator
|
||||||
|
</PRE>
|
||||||
|
</P>
|
||||||
|
<br><a name="SEC6" href="#TOC1">PCRE2 SPECIAL CATEGORY PROPERTIES FOR \p and \P</a><br>
|
||||||
|
<P>
|
||||||
|
<pre>
|
||||||
|
Xan Alphanumeric: union of properties L and N
|
||||||
|
Xps POSIX space: property Z or tab, NL, VT, FF, CR
|
||||||
|
Xsp Perl space: property Z or tab, NL, VT, FF, CR
|
||||||
|
Xuc Univerally-named character: one that can be
|
||||||
|
represented by a Universal Character Name
|
||||||
|
Xwd Perl word: property Xan or underscore
|
||||||
|
</pre>
|
||||||
|
Perl and POSIX space are now the same. Perl added VT to its space character set
|
||||||
|
at release 5.18.
|
||||||
|
</P>
|
||||||
|
<br><a name="SEC7" href="#TOC1">SCRIPT NAMES FOR \p AND \P</a><br>
|
||||||
|
<P>
|
||||||
|
Arabic,
|
||||||
|
Armenian,
|
||||||
|
Avestan,
|
||||||
|
Balinese,
|
||||||
|
Bamum,
|
||||||
|
Bassa_Vah,
|
||||||
|
Batak,
|
||||||
|
Bengali,
|
||||||
|
Bopomofo,
|
||||||
|
Brahmi,
|
||||||
|
Braille,
|
||||||
|
Buginese,
|
||||||
|
Buhid,
|
||||||
|
Canadian_Aboriginal,
|
||||||
|
Carian,
|
||||||
|
Caucasian_Albanian,
|
||||||
|
Chakma,
|
||||||
|
Cham,
|
||||||
|
Cherokee,
|
||||||
|
Common,
|
||||||
|
Coptic,
|
||||||
|
Cuneiform,
|
||||||
|
Cypriot,
|
||||||
|
Cyrillic,
|
||||||
|
Deseret,
|
||||||
|
Devanagari,
|
||||||
|
Duployan,
|
||||||
|
Egyptian_Hieroglyphs,
|
||||||
|
Elbasan,
|
||||||
|
Ethiopic,
|
||||||
|
Georgian,
|
||||||
|
Glagolitic,
|
||||||
|
Gothic,
|
||||||
|
Grantha,
|
||||||
|
Greek,
|
||||||
|
Gujarati,
|
||||||
|
Gurmukhi,
|
||||||
|
Han,
|
||||||
|
Hangul,
|
||||||
|
Hanunoo,
|
||||||
|
Hebrew,
|
||||||
|
Hiragana,
|
||||||
|
Imperial_Aramaic,
|
||||||
|
Inherited,
|
||||||
|
Inscriptional_Pahlavi,
|
||||||
|
Inscriptional_Parthian,
|
||||||
|
Javanese,
|
||||||
|
Kaithi,
|
||||||
|
Kannada,
|
||||||
|
Katakana,
|
||||||
|
Kayah_Li,
|
||||||
|
Kharoshthi,
|
||||||
|
Khmer,
|
||||||
|
Khojki,
|
||||||
|
Khudawadi,
|
||||||
|
Lao,
|
||||||
|
Latin,
|
||||||
|
Lepcha,
|
||||||
|
Limbu,
|
||||||
|
Linear_A,
|
||||||
|
Linear_B,
|
||||||
|
Lisu,
|
||||||
|
Lycian,
|
||||||
|
Lydian,
|
||||||
|
Mahajani,
|
||||||
|
Malayalam,
|
||||||
|
Mandaic,
|
||||||
|
Manichaean,
|
||||||
|
Meetei_Mayek,
|
||||||
|
Mende_Kikakui,
|
||||||
|
Meroitic_Cursive,
|
||||||
|
Meroitic_Hieroglyphs,
|
||||||
|
Miao,
|
||||||
|
Modi,
|
||||||
|
Mongolian,
|
||||||
|
Mro,
|
||||||
|
Myanmar,
|
||||||
|
Nabataean,
|
||||||
|
New_Tai_Lue,
|
||||||
|
Nko,
|
||||||
|
Ogham,
|
||||||
|
Ol_Chiki,
|
||||||
|
Old_Italic,
|
||||||
|
Old_North_Arabian,
|
||||||
|
Old_Permic,
|
||||||
|
Old_Persian,
|
||||||
|
Old_South_Arabian,
|
||||||
|
Old_Turkic,
|
||||||
|
Oriya,
|
||||||
|
Osmanya,
|
||||||
|
Pahawh_Hmong,
|
||||||
|
Palmyrene,
|
||||||
|
Pau_Cin_Hau,
|
||||||
|
Phags_Pa,
|
||||||
|
Phoenician,
|
||||||
|
Psalter_Pahlavi,
|
||||||
|
Rejang,
|
||||||
|
Runic,
|
||||||
|
Samaritan,
|
||||||
|
Saurashtra,
|
||||||
|
Sharada,
|
||||||
|
Shavian,
|
||||||
|
Siddham,
|
||||||
|
Sinhala,
|
||||||
|
Sora_Sompeng,
|
||||||
|
Sundanese,
|
||||||
|
Syloti_Nagri,
|
||||||
|
Syriac,
|
||||||
|
Tagalog,
|
||||||
|
Tagbanwa,
|
||||||
|
Tai_Le,
|
||||||
|
Tai_Tham,
|
||||||
|
Tai_Viet,
|
||||||
|
Takri,
|
||||||
|
Tamil,
|
||||||
|
Telugu,
|
||||||
|
Thaana,
|
||||||
|
Thai,
|
||||||
|
Tibetan,
|
||||||
|
Tifinagh,
|
||||||
|
Tirhuta,
|
||||||
|
Ugaritic,
|
||||||
|
Vai,
|
||||||
|
Warang_Citi,
|
||||||
|
Yi.
|
||||||
|
</P>
|
||||||
|
<br><a name="SEC8" href="#TOC1">CHARACTER CLASSES</a><br>
|
||||||
|
<P>
|
||||||
|
<pre>
|
||||||
|
[...] positive character class
|
||||||
|
[^...] negative character class
|
||||||
|
[x-y] range (can be used for hex characters)
|
||||||
|
[[:xxx:]] positive POSIX named set
|
||||||
|
[[:^xxx:]] negative POSIX named set
|
||||||
|
|
||||||
|
alnum alphanumeric
|
||||||
|
alpha alphabetic
|
||||||
|
ascii 0-127
|
||||||
|
blank space or tab
|
||||||
|
cntrl control character
|
||||||
|
digit decimal digit
|
||||||
|
graph printing, excluding space
|
||||||
|
lower lower case letter
|
||||||
|
print printing, including space
|
||||||
|
punct printing, excluding alphanumeric
|
||||||
|
space white space
|
||||||
|
upper upper case letter
|
||||||
|
word same as \w
|
||||||
|
xdigit hexadecimal digit
|
||||||
|
</pre>
|
||||||
|
In PCRE2, POSIX character set names recognize only ASCII characters by default,
|
||||||
|
but some of them use Unicode properties if PCRE2_UCP is set. You can use
|
||||||
|
\Q...\E inside a character class.
|
||||||
|
</P>
|
||||||
|
<br><a name="SEC9" href="#TOC1">QUANTIFIERS</a><br>
|
||||||
|
<P>
|
||||||
|
<pre>
|
||||||
|
? 0 or 1, greedy
|
||||||
|
?+ 0 or 1, possessive
|
||||||
|
?? 0 or 1, lazy
|
||||||
|
* 0 or more, greedy
|
||||||
|
*+ 0 or more, possessive
|
||||||
|
*? 0 or more, lazy
|
||||||
|
+ 1 or more, greedy
|
||||||
|
++ 1 or more, possessive
|
||||||
|
+? 1 or more, lazy
|
||||||
|
{n} exactly n
|
||||||
|
{n,m} at least n, no more than m, greedy
|
||||||
|
{n,m}+ at least n, no more than m, possessive
|
||||||
|
{n,m}? at least n, no more than m, lazy
|
||||||
|
{n,} n or more, greedy
|
||||||
|
{n,}+ n or more, possessive
|
||||||
|
{n,}? n or more, lazy
|
||||||
|
</PRE>
|
||||||
|
</P>
|
||||||
|
<br><a name="SEC10" href="#TOC1">ANCHORS AND SIMPLE ASSERTIONS</a><br>
|
||||||
|
<P>
|
||||||
|
<pre>
|
||||||
|
\b word boundary
|
||||||
|
\B not a word boundary
|
||||||
|
^ start of subject
|
||||||
|
also after internal newline in multiline mode
|
||||||
|
\A start of subject
|
||||||
|
$ end of subject
|
||||||
|
also before newline at end of subject
|
||||||
|
also before internal newline in multiline mode
|
||||||
|
\Z end of subject
|
||||||
|
also before newline at end of subject
|
||||||
|
\z end of subject
|
||||||
|
\G first matching position in subject
|
||||||
|
</PRE>
|
||||||
|
</P>
|
||||||
|
<br><a name="SEC11" href="#TOC1">MATCH POINT RESET</a><br>
|
||||||
|
<P>
|
||||||
|
<pre>
|
||||||
|
\K reset start of match
|
||||||
|
</pre>
|
||||||
|
\K is honoured in positive assertions, but ignored in negative ones.
|
||||||
|
</P>
|
||||||
|
<br><a name="SEC12" href="#TOC1">ALTERNATION</a><br>
|
||||||
|
<P>
|
||||||
|
<pre>
|
||||||
|
expr|expr|expr...
|
||||||
|
</PRE>
|
||||||
|
</P>
|
||||||
|
<br><a name="SEC13" href="#TOC1">CAPTURING</a><br>
|
||||||
|
<P>
|
||||||
|
<pre>
|
||||||
|
(...) capturing group
|
||||||
|
(?<name>...) named capturing group (Perl)
|
||||||
|
(?'name'...) named capturing group (Perl)
|
||||||
|
(?P<name>...) named capturing group (Python)
|
||||||
|
(?:...) non-capturing group
|
||||||
|
(?|...) non-capturing group; reset group numbers for
|
||||||
|
capturing groups in each alternative
|
||||||
|
</PRE>
|
||||||
|
</P>
|
||||||
|
<br><a name="SEC14" href="#TOC1">ATOMIC GROUPS</a><br>
|
||||||
|
<P>
|
||||||
|
<pre>
|
||||||
|
(?>...) atomic, non-capturing group
|
||||||
|
</PRE>
|
||||||
|
</P>
|
||||||
|
<br><a name="SEC15" href="#TOC1">COMMENT</a><br>
|
||||||
|
<P>
|
||||||
|
<pre>
|
||||||
|
(?#....) comment (not nestable)
|
||||||
|
</PRE>
|
||||||
|
</P>
|
||||||
|
<br><a name="SEC16" href="#TOC1">OPTION SETTING</a><br>
|
||||||
|
<P>
|
||||||
|
<pre>
|
||||||
|
(?i) caseless
|
||||||
|
(?J) allow duplicate names
|
||||||
|
(?m) multiline
|
||||||
|
(?s) single line (dotall)
|
||||||
|
(?U) default ungreedy (lazy)
|
||||||
|
(?x) extended (ignore white space)
|
||||||
|
(?-...) unset option(s)
|
||||||
|
</pre>
|
||||||
|
The following are recognized only at the very start of a pattern or after one
|
||||||
|
of the newline or \R options with similar syntax. More than one of them may
|
||||||
|
appear.
|
||||||
|
<pre>
|
||||||
|
(*LIMIT_MATCH=d) set the match limit to d (decimal number)
|
||||||
|
(*LIMIT_RECURSION=d) set the recursion limit to d (decimal number)
|
||||||
|
(*NOTEMPTY) set PCRE2_NOTEMPTY when matching
|
||||||
|
(*NOTEMPTY_ATSTART) set PCRE2_NOTEMPTY_ATSTART when matching
|
||||||
|
(*NO_AUTO_POSSESS) no auto-possessification (PCRE2_NO_AUTO_POSSESS)
|
||||||
|
(*NO_START_OPT) no start-match optimization (PCRE2_NO_START_OPTIMIZE)
|
||||||
|
(*UTF) set appropriate UTF mode for the library in use
|
||||||
|
(*UCP) set PCRE2_UCP (use Unicode properties for \d etc)
|
||||||
|
</pre>
|
||||||
|
Note that LIMIT_MATCH and LIMIT_RECURSION can only reduce the value of the
|
||||||
|
limits set by the caller of pcre2_exec(), not increase them.
|
||||||
|
</P>
|
||||||
|
<br><a name="SEC17" href="#TOC1">NEWLINE CONVENTION</a><br>
|
||||||
|
<P>
|
||||||
|
These are recognized only at the very start of the pattern or after option
|
||||||
|
settings with a similar syntax.
|
||||||
|
<pre>
|
||||||
|
(*CR) carriage return only
|
||||||
|
(*LF) linefeed only
|
||||||
|
(*CRLF) carriage return followed by linefeed
|
||||||
|
(*ANYCRLF) all three of the above
|
||||||
|
(*ANY) any Unicode newline sequence
|
||||||
|
</PRE>
|
||||||
|
</P>
|
||||||
|
<br><a name="SEC18" href="#TOC1">WHAT \R MATCHES</a><br>
|
||||||
|
<P>
|
||||||
|
These are recognized only at the very start of the pattern or after option
|
||||||
|
setting with a similar syntax.
|
||||||
|
<pre>
|
||||||
|
(*BSR_ANYCRLF) CR, LF, or CRLF
|
||||||
|
(*BSR_UNICODE) any Unicode newline sequence
|
||||||
|
</PRE>
|
||||||
|
</P>
|
||||||
|
<br><a name="SEC19" href="#TOC1">LOOKAHEAD AND LOOKBEHIND ASSERTIONS</a><br>
|
||||||
|
<P>
|
||||||
|
<pre>
|
||||||
|
(?=...) positive look ahead
|
||||||
|
(?!...) negative look ahead
|
||||||
|
(?<=...) positive look behind
|
||||||
|
(?<!...) negative look behind
|
||||||
|
</pre>
|
||||||
|
Each top-level branch of a look behind must be of a fixed length.
|
||||||
|
</P>
|
||||||
|
<br><a name="SEC20" href="#TOC1">BACKREFERENCES</a><br>
|
||||||
|
<P>
|
||||||
|
<pre>
|
||||||
|
\n reference by number (can be ambiguous)
|
||||||
|
\gn reference by number
|
||||||
|
\g{n} reference by number
|
||||||
|
\g{-n} relative reference by number
|
||||||
|
\k<name> reference by name (Perl)
|
||||||
|
\k'name' reference by name (Perl)
|
||||||
|
\g{name} reference by name (Perl)
|
||||||
|
\k{name} reference by name (.NET)
|
||||||
|
(?P=name) reference by name (Python)
|
||||||
|
</PRE>
|
||||||
|
</P>
|
||||||
|
<br><a name="SEC21" href="#TOC1">SUBROUTINE REFERENCES (POSSIBLY RECURSIVE)</a><br>
|
||||||
|
<P>
|
||||||
|
<pre>
|
||||||
|
(?R) recurse whole pattern
|
||||||
|
(?n) call subpattern by absolute number
|
||||||
|
(?+n) call subpattern by relative number
|
||||||
|
(?-n) call subpattern by relative number
|
||||||
|
(?&name) call subpattern by name (Perl)
|
||||||
|
(?P>name) call subpattern by name (Python)
|
||||||
|
\g<name> call subpattern by name (Oniguruma)
|
||||||
|
\g'name' call subpattern by name (Oniguruma)
|
||||||
|
\g<n> call subpattern by absolute number (Oniguruma)
|
||||||
|
\g'n' call subpattern by absolute number (Oniguruma)
|
||||||
|
\g<+n> call subpattern by relative number (PCRE2 extension)
|
||||||
|
\g'+n' call subpattern by relative number (PCRE2 extension)
|
||||||
|
\g<-n> call subpattern by relative number (PCRE2 extension)
|
||||||
|
\g'-n' call subpattern by relative number (PCRE2 extension)
|
||||||
|
</PRE>
|
||||||
|
</P>
|
||||||
|
<br><a name="SEC22" href="#TOC1">CONDITIONAL PATTERNS</a><br>
|
||||||
|
<P>
|
||||||
|
<pre>
|
||||||
|
(?(condition)yes-pattern)
|
||||||
|
(?(condition)yes-pattern|no-pattern)
|
||||||
|
|
||||||
|
(?(n)... absolute reference condition
|
||||||
|
(?(+n)... relative reference condition
|
||||||
|
(?(-n)... relative reference condition
|
||||||
|
(?(<name>)... named reference condition (Perl)
|
||||||
|
(?('name')... named reference condition (Perl)
|
||||||
|
(?(name)... named reference condition (PCRE2)
|
||||||
|
(?(R)... overall recursion condition
|
||||||
|
(?(Rn)... specific group recursion condition
|
||||||
|
(?(R&name)... specific recursion condition
|
||||||
|
(?(DEFINE)... define subpattern for reference
|
||||||
|
(?(assert)... assertion condition
|
||||||
|
</PRE>
|
||||||
|
</P>
|
||||||
|
<br><a name="SEC23" href="#TOC1">BACKTRACKING CONTROL</a><br>
|
||||||
|
<P>
|
||||||
|
The following act immediately they are reached:
|
||||||
|
<pre>
|
||||||
|
(*ACCEPT) force successful match
|
||||||
|
(*FAIL) force backtrack; synonym (*F)
|
||||||
|
(*MARK:NAME) set name to be passed back; synonym (*:NAME)
|
||||||
|
</pre>
|
||||||
|
The following act only when a subsequent match failure causes a backtrack to
|
||||||
|
reach them. They all force a match failure, but they differ in what happens
|
||||||
|
afterwards. Those that advance the start-of-match point do so only if the
|
||||||
|
pattern is not anchored.
|
||||||
|
<pre>
|
||||||
|
(*COMMIT) overall failure, no advance of starting point
|
||||||
|
(*PRUNE) advance to next starting character
|
||||||
|
(*PRUNE:NAME) equivalent to (*MARK:NAME)(*PRUNE)
|
||||||
|
(*SKIP) advance to current matching position
|
||||||
|
(*SKIP:NAME) advance to position corresponding to an earlier
|
||||||
|
(*MARK:NAME); if not found, the (*SKIP) is ignored
|
||||||
|
(*THEN) local failure, backtrack to next alternation
|
||||||
|
(*THEN:NAME) equivalent to (*MARK:NAME)(*THEN)
|
||||||
|
</PRE>
|
||||||
|
</P>
|
||||||
|
<br><a name="SEC24" href="#TOC1">CALLOUTS</a><br>
|
||||||
|
<P>
|
||||||
|
<pre>
|
||||||
|
(?C) callout
|
||||||
|
(?Cn) callout with data n
|
||||||
|
</PRE>
|
||||||
|
</P>
|
||||||
|
<br><a name="SEC25" href="#TOC1">SEE ALSO</a><br>
|
||||||
|
<P>
|
||||||
|
<b>pcre2pattern</b>(3), <b>pcre2api</b>(3), <b>pcre2callout</b>(3),
|
||||||
|
<b>pcre2matching</b>(3), <b>pcre2</b>(3).
|
||||||
|
</P>
|
||||||
|
<br><a name="SEC26" href="#TOC1">AUTHOR</a><br>
|
||||||
|
<P>
|
||||||
|
Philip Hazel
|
||||||
|
<br>
|
||||||
|
University Computing Service
|
||||||
|
<br>
|
||||||
|
Cambridge CB2 3QH, England.
|
||||||
|
<br>
|
||||||
|
</P>
|
||||||
|
<br><a name="SEC27" href="#TOC1">REVISION</a><br>
|
||||||
|
<P>
|
||||||
|
Last updated: 20 October 2014
|
||||||
|
<br>
|
||||||
|
Copyright © 1997-2014 University of Cambridge.
|
||||||
|
<br>
|
||||||
|
<p>
|
||||||
|
Return to the <a href="index.html">PCRE2 index page</a>.
|
||||||
|
</p>
|
|
@ -0,0 +1,178 @@
|
||||||
|
.TH PCRE2PERFORM 3 "20 Ocbober 2014" "PCRE2 10.00"
|
||||||
|
.SH NAME
|
||||||
|
PCRE2 - Perl-compatible regular expressions (revised API)
|
||||||
|
.SH "PCRE2 PERFORMANCE"
|
||||||
|
.rs
|
||||||
|
.sp
|
||||||
|
Two aspects of performance are discussed below: memory usage and processing
|
||||||
|
time. The way you express your pattern as a regular expression can affect both
|
||||||
|
of them.
|
||||||
|
.
|
||||||
|
.SH "COMPILED PATTERN MEMORY USAGE"
|
||||||
|
.rs
|
||||||
|
.sp
|
||||||
|
Patterns are compiled by PCRE2 into a reasonably efficient interpretive code,
|
||||||
|
so that most simple patterns do not use much memory. However, there is one case
|
||||||
|
where the memory usage of a compiled pattern can be unexpectedly large. If a
|
||||||
|
parenthesized subpattern has a quantifier with a minimum greater than 1 and/or
|
||||||
|
a limited maximum, the whole subpattern is repeated in the compiled code. For
|
||||||
|
example, the pattern
|
||||||
|
.sp
|
||||||
|
(abc|def){2,4}
|
||||||
|
.sp
|
||||||
|
is compiled as if it were
|
||||||
|
.sp
|
||||||
|
(abc|def)(abc|def)((abc|def)(abc|def)?)?
|
||||||
|
.sp
|
||||||
|
(Technical aside: It is done this way so that backtrack points within each of
|
||||||
|
the repetitions can be independently maintained.)
|
||||||
|
.P
|
||||||
|
For regular expressions whose quantifiers use only small numbers, this is not
|
||||||
|
usually a problem. However, if the numbers are large, and particularly if such
|
||||||
|
repetitions are nested, the memory usage can become an embarrassment. For
|
||||||
|
example, the very simple pattern
|
||||||
|
.sp
|
||||||
|
((ab){1,1000}c){1,3}
|
||||||
|
.sp
|
||||||
|
uses 51K bytes when compiled using the 8-bit library. When PCRE2 is compiled
|
||||||
|
with its default internal pointer size of two bytes, the size limit on a
|
||||||
|
compiled pattern is 64K code units in the 8-bit and 16-bit libraries, and this
|
||||||
|
is reached with the above pattern if the outer repetition is increased from 3
|
||||||
|
to 4. PCRE2 can be compiled to use larger internal pointers and thus handle
|
||||||
|
larger compiled patterns, but it is better to try to rewrite your pattern to
|
||||||
|
use less memory if you can.
|
||||||
|
.P
|
||||||
|
One way of reducing the memory usage for such patterns is to make use of
|
||||||
|
PCRE2's
|
||||||
|
.\" HTML <a href="pcre2pattern.html#subpatternsassubroutines">
|
||||||
|
.\" </a>
|
||||||
|
"subroutine"
|
||||||
|
.\"
|
||||||
|
facility. Re-writing the above pattern as
|
||||||
|
.sp
|
||||||
|
((ab)(?2){0,999}c)(?1){0,2}
|
||||||
|
.sp
|
||||||
|
reduces the memory requirements to 18K, and indeed it remains under 20K even
|
||||||
|
with the outer repetition increased to 100. However, this pattern is not
|
||||||
|
exactly equivalent, because the "subroutine" calls are treated as
|
||||||
|
.\" HTML <a href="pcre2pattern.html#atomicgroup">
|
||||||
|
.\" </a>
|
||||||
|
atomic groups
|
||||||
|
.\"
|
||||||
|
into which there can be no backtracking if there is a subsequent matching
|
||||||
|
failure. Therefore, PCRE2 cannot do this kind of rewriting automatically.
|
||||||
|
Furthermore, there is a noticeable loss of speed when executing the modified
|
||||||
|
pattern. Nevertheless, if the atomic grouping is not a problem and the loss of
|
||||||
|
speed is acceptable, this kind of rewriting will allow you to process patterns
|
||||||
|
that PCRE2 cannot otherwise handle.
|
||||||
|
.
|
||||||
|
.
|
||||||
|
.SH "STACK USAGE AT RUN TIME"
|
||||||
|
.rs
|
||||||
|
.sp
|
||||||
|
When \fBpcre2_match()\fP is used for matching, certain kinds of pattern can
|
||||||
|
cause it to use large amounts of the process stack. In some environments the
|
||||||
|
default process stack is quite small, and if it runs out the result is often
|
||||||
|
SIGSEGV. Rewriting your pattern can often help. The
|
||||||
|
.\" HREF
|
||||||
|
\fBpcre2stack\fP
|
||||||
|
.\"
|
||||||
|
documentation discusses this issue in detail.
|
||||||
|
.
|
||||||
|
.
|
||||||
|
.SH "PROCESSING TIME"
|
||||||
|
.rs
|
||||||
|
.sp
|
||||||
|
Certain items in regular expression patterns are processed more efficiently
|
||||||
|
than others. It is more efficient to use a character class like [aeiou] than a
|
||||||
|
set of single-character alternatives such as (a|e|i|o|u). In general, the
|
||||||
|
simplest construction that provides the required behaviour is usually the most
|
||||||
|
efficient. Jeffrey Friedl's book contains a lot of useful general discussion
|
||||||
|
about optimizing regular expressions for efficient performance. This document
|
||||||
|
contains a few observations about PCRE2.
|
||||||
|
.P
|
||||||
|
Using Unicode character properties (the \ep, \eP, and \eX escapes) is slow,
|
||||||
|
because PCRE2 has to use a multi-stage table lookup whenever it needs a
|
||||||
|
character's property. If you can find an alternative pattern that does not use
|
||||||
|
character properties, it will probably be faster.
|
||||||
|
.P
|
||||||
|
By default, the escape sequences \eb, \ed, \es, and \ew, and the POSIX
|
||||||
|
character classes such as [:alpha:] do not use Unicode properties, partly for
|
||||||
|
backwards compatibility, and partly for performance reasons. However, you can
|
||||||
|
set the PCRE2_UCP option or start the pattern with (*UCP) if you want Unicode
|
||||||
|
character properties to be used. This can double the matching time for items
|
||||||
|
such as \ed, when matched with \fBpcre2_match()\fP; the performance loss is
|
||||||
|
less with a DFA matching function, and in both cases there is not much
|
||||||
|
difference for \eb.
|
||||||
|
.P
|
||||||
|
When a pattern begins with .* not in parentheses, or in parentheses that are
|
||||||
|
not the subject of a backreference, and the PCRE2_DOTALL option is set, the
|
||||||
|
pattern is implicitly anchored by PCRE2, since it can match only at the start
|
||||||
|
of a subject string. However, if PCRE2_DOTALL is not set, PCRE2 cannot make
|
||||||
|
this optimization, because the dot metacharacter does not then match a newline,
|
||||||
|
and if the subject string contains newlines, the pattern may match from the
|
||||||
|
character immediately following one of them instead of from the very start. For
|
||||||
|
example, the pattern
|
||||||
|
.sp
|
||||||
|
.*second
|
||||||
|
.sp
|
||||||
|
matches the subject "first\enand second" (where \en stands for a newline
|
||||||
|
character), with the match starting at the seventh character. In order to do
|
||||||
|
this, PCRE2 has to retry the match starting after every newline in the subject.
|
||||||
|
.P
|
||||||
|
If you are using such a pattern with subject strings that do not contain
|
||||||
|
newlines, the best performance is obtained by setting PCRE2_DOTALL, or starting
|
||||||
|
the pattern with ^.* or ^.*? to indicate explicit anchoring. That saves PCRE2
|
||||||
|
from having to scan along the subject looking for a newline to restart at.
|
||||||
|
.P
|
||||||
|
Beware of patterns that contain nested indefinite repeats. These can take a
|
||||||
|
long time to run when applied to a string that does not match. Consider the
|
||||||
|
pattern fragment
|
||||||
|
.sp
|
||||||
|
^(a+)*
|
||||||
|
.sp
|
||||||
|
This can match "aaaa" in 16 different ways, and this number increases very
|
||||||
|
rapidly as the string gets longer. (The * repeat can match 0, 1, 2, 3, or 4
|
||||||
|
times, and for each of those cases other than 0 or 4, the + repeats can match
|
||||||
|
different numbers of times.) When the remainder of the pattern is such that the
|
||||||
|
entire match is going to fail, PCRE2 has in principle to try every possible
|
||||||
|
variation, and this can take an extremely long time, even for relatively short
|
||||||
|
strings.
|
||||||
|
.P
|
||||||
|
An optimization catches some of the more simple cases such as
|
||||||
|
.sp
|
||||||
|
(a+)*b
|
||||||
|
.sp
|
||||||
|
where a literal character follows. Before embarking on the standard matching
|
||||||
|
procedure, PCRE2 checks that there is a "b" later in the subject string, and if
|
||||||
|
there is not, it fails the match immediately. However, when there is no
|
||||||
|
following literal this optimization cannot be used. You can see the difference
|
||||||
|
by comparing the behaviour of
|
||||||
|
.sp
|
||||||
|
(a+)*\ed
|
||||||
|
.sp
|
||||||
|
with the pattern above. The former gives a failure almost instantly when
|
||||||
|
applied to a whole line of "a" characters, whereas the latter takes an
|
||||||
|
appreciable time with strings longer than about 20 characters.
|
||||||
|
.P
|
||||||
|
In many cases, the solution to this kind of performance issue is to use an
|
||||||
|
atomic group or a possessive quantifier.
|
||||||
|
.
|
||||||
|
.
|
||||||
|
.SH AUTHOR
|
||||||
|
.rs
|
||||||
|
.sp
|
||||||
|
.nf
|
||||||
|
Philip Hazel
|
||||||
|
University Computing Service
|
||||||
|
Cambridge CB2 3QH, England.
|
||||||
|
.fi
|
||||||
|
.
|
||||||
|
.
|
||||||
|
.SH REVISION
|
||||||
|
.rs
|
||||||
|
.sp
|
||||||
|
.nf
|
||||||
|
Last updated: 20 October 2014
|
||||||
|
Copyright (c) 1997-2014 University of Cambridge.
|
||||||
|
.fi
|
|
@ -0,0 +1,268 @@
|
||||||
|
.TH PCRE2POSIX 3 "20 October 2014" "PCRE2 10.00"
|
||||||
|
.SH NAME
|
||||||
|
PCRE2 - Perl-compatible regular expressions (revised API)
|
||||||
|
.SH "SYNOPSIS"
|
||||||
|
.rs
|
||||||
|
.sp
|
||||||
|
.B #include <pcre2posix.h>
|
||||||
|
.PP
|
||||||
|
.nf
|
||||||
|
.B int regcomp(regex_t *\fIpreg\fP, const char *\fIpattern\fP,
|
||||||
|
.B " int \fIcflags\fP);"
|
||||||
|
.sp
|
||||||
|
.B int regexec(const regex_t *\fIpreg\fP, const char *\fIstring\fP,
|
||||||
|
.B " size_t \fInmatch\fP, regmatch_t \fIpmatch\fP[], int \fIeflags\fP);"
|
||||||
|
.sp
|
||||||
|
.B "size_t regerror(int \fIerrcode\fP, const regex_t *\fIpreg\fP,"
|
||||||
|
.B " char *\fIerrbuf\fP, size_t \fIerrbuf_size\fP);"
|
||||||
|
.sp
|
||||||
|
.B void regfree(regex_t *\fIpreg\fP);
|
||||||
|
.fi
|
||||||
|
.
|
||||||
|
.SH DESCRIPTION
|
||||||
|
.rs
|
||||||
|
.sp
|
||||||
|
This set of functions provides a POSIX-style API for the PCRE2 regular
|
||||||
|
expression 8-bit library. See the
|
||||||
|
.\" HREF
|
||||||
|
\fBpcre2api\fP
|
||||||
|
.\"
|
||||||
|
documentation for a description of PCRE2's native API, which contains much
|
||||||
|
additional functionality. There is no POSIX-style wrapper for PCRE2's 16-bit
|
||||||
|
and 32-bit libraries.
|
||||||
|
.P
|
||||||
|
The functions described here are just wrapper functions that ultimately call
|
||||||
|
the PCRE2 native API. Their prototypes are defined in the \fBpcre2posix.h\fP
|
||||||
|
header file, and on Unix systems the library itself is called
|
||||||
|
\fBlibpcre2-posix.a\fP, so can be accessed by adding \fB-lpcre2-posix\fP to the
|
||||||
|
command for linking an application that uses them. Because the POSIX functions
|
||||||
|
call the native ones, it is also necessary to add \fB-lpcre2-8\fP.
|
||||||
|
.P
|
||||||
|
Those POSIX option bits that can reasonably be mapped to PCRE2 native options
|
||||||
|
have been implemented. In addition, the option REG_EXTENDED is defined with the
|
||||||
|
value zero. This has no effect, but since programs that are written to the
|
||||||
|
POSIX interface often use it, this makes it easier to slot in PCRE2 as a
|
||||||
|
replacement library. Other POSIX options are not even defined.
|
||||||
|
.P
|
||||||
|
There are also some other options that are not defined by POSIX. These have
|
||||||
|
been added at the request of users who want to make use of certain
|
||||||
|
PCRE2-specific features via the POSIX calling interface.
|
||||||
|
.P
|
||||||
|
When PCRE2 is called via these functions, it is only the API that is POSIX-like
|
||||||
|
in style. The syntax and semantics of the regular expressions themselves are
|
||||||
|
still those of Perl, subject to the setting of various PCRE2 options, as
|
||||||
|
described below. "POSIX-like in style" means that the API approximates to the
|
||||||
|
POSIX definition; it is not fully POSIX-compatible, and in multi-unit encoding
|
||||||
|
domains it is probably even less compatible.
|
||||||
|
.P
|
||||||
|
The header for these functions is supplied as \fBpcre2posix.h\fP to avoid any
|
||||||
|
potential clash with other POSIX libraries. It can, of course, be renamed or
|
||||||
|
aliased as \fBregex.h\fP, which is the "correct" name. It provides two
|
||||||
|
structure types, \fIregex_t\fP for compiled internal forms, and
|
||||||
|
\fIregmatch_t\fP for returning captured substrings. It also defines some
|
||||||
|
constants whose names start with "REG_"; these are used for setting options and
|
||||||
|
identifying error codes.
|
||||||
|
.
|
||||||
|
.
|
||||||
|
.SH "COMPILING A PATTERN"
|
||||||
|
.rs
|
||||||
|
.sp
|
||||||
|
The function \fBregcomp()\fP is called to compile a pattern into an
|
||||||
|
internal form. The pattern is a C string terminated by a binary zero, and
|
||||||
|
is passed in the argument \fIpattern\fP. The \fIpreg\fP argument is a pointer
|
||||||
|
to a \fBregex_t\fP structure that is used as a base for storing information
|
||||||
|
about the compiled regular expression.
|
||||||
|
.P
|
||||||
|
The argument \fIcflags\fP is either zero, or contains one or more of the bits
|
||||||
|
defined by the following macros:
|
||||||
|
.sp
|
||||||
|
REG_DOTALL
|
||||||
|
.sp
|
||||||
|
The PCRE2_DOTALL option is set when the regular expression is passed for
|
||||||
|
compilation to the native function. Note that REG_DOTALL is not part of the
|
||||||
|
POSIX standard.
|
||||||
|
.sp
|
||||||
|
REG_ICASE
|
||||||
|
.sp
|
||||||
|
The PCRE2_CASELESS option is set when the regular expression is passed for
|
||||||
|
compilation to the native function.
|
||||||
|
.sp
|
||||||
|
REG_NEWLINE
|
||||||
|
.sp
|
||||||
|
The PCRE2_MULTILINE option is set when the regular expression is passed for
|
||||||
|
compilation to the native function. Note that this does \fInot\fP mimic the
|
||||||
|
defined POSIX behaviour for REG_NEWLINE (see the following section).
|
||||||
|
.sp
|
||||||
|
REG_NOSUB
|
||||||
|
.sp
|
||||||
|
The PCRE2_NO_AUTO_CAPTURE option is set when the regular expression is passed
|
||||||
|
for compilation to the native function. In addition, when a pattern that is
|
||||||
|
compiled with this flag is passed to \fBregexec()\fP for matching, the
|
||||||
|
\fInmatch\fP and \fIpmatch\fP arguments are ignored, and no captured strings
|
||||||
|
are returned.
|
||||||
|
.sp
|
||||||
|
REG_UCP
|
||||||
|
.sp
|
||||||
|
The PCRE2_UCP option is set when the regular expression is passed for
|
||||||
|
compilation to the native function. This causes PCRE2 to use Unicode properties
|
||||||
|
when matchine \ed, \ew, etc., instead of just recognizing ASCII values. Note
|
||||||
|
that REG_UCP is not part of the POSIX standard.
|
||||||
|
.sp
|
||||||
|
REG_UNGREEDY
|
||||||
|
.sp
|
||||||
|
The PCRE2_UNGREEDY option is set when the regular expression is passed for
|
||||||
|
compilation to the native function. Note that REG_UNGREEDY is not part of the
|
||||||
|
POSIX standard.
|
||||||
|
.sp
|
||||||
|
REG_UTF
|
||||||
|
.sp
|
||||||
|
The PCRE2_UTF option is set when the regular expression is passed for
|
||||||
|
compilation to the native function. This causes the pattern itself and all data
|
||||||
|
strings used for matching it to be treated as UTF-8 strings. Note that REG_UTF
|
||||||
|
is not part of the POSIX standard.
|
||||||
|
.P
|
||||||
|
In the absence of these flags, no options are passed to the native function.
|
||||||
|
This means the the regex is compiled with PCRE2 default semantics. In
|
||||||
|
particular, the way it handles newline characters in the subject string is the
|
||||||
|
Perl way, not the POSIX way. Note that setting PCRE2_MULTILINE has only
|
||||||
|
\fIsome\fP of the effects specified for REG_NEWLINE. It does not affect the way
|
||||||
|
newlines are matched by the dot metacharacter (they are not) or by a negative
|
||||||
|
class such as [^a] (they are).
|
||||||
|
.P
|
||||||
|
The yield of \fBregcomp()\fP is zero on success, and non-zero otherwise. The
|
||||||
|
\fIpreg\fP structure is filled in on success, and one member of the structure
|
||||||
|
is public: \fIre_nsub\fP contains the number of capturing subpatterns in
|
||||||
|
the regular expression. Various error codes are defined in the header file.
|
||||||
|
.P
|
||||||
|
NOTE: If the yield of \fBregcomp()\fP is non-zero, you must not attempt to
|
||||||
|
use the contents of the \fIpreg\fP structure. If, for example, you pass it to
|
||||||
|
\fBregexec()\fP, the result is undefined and your program is likely to crash.
|
||||||
|
.
|
||||||
|
.
|
||||||
|
.SH "MATCHING NEWLINE CHARACTERS"
|
||||||
|
.rs
|
||||||
|
.sp
|
||||||
|
This area is not simple, because POSIX and Perl take different views of things.
|
||||||
|
It is not possible to get PCRE2 to obey POSIX semantics, but then PCRE2 was
|
||||||
|
never intended to be a POSIX engine. The following table lists the different
|
||||||
|
possibilities for matching newline characters in PCRE2:
|
||||||
|
.sp
|
||||||
|
Default Change with
|
||||||
|
.sp
|
||||||
|
. matches newline no PCRE2_DOTALL
|
||||||
|
newline matches [^a] yes not changeable
|
||||||
|
$ matches \en at end yes PCRE2_DOLLAR_ENDONLY
|
||||||
|
$ matches \en in middle no PCRE2_MULTILINE
|
||||||
|
^ matches \en in middle no PCRE2_MULTILINE
|
||||||
|
.sp
|
||||||
|
This is the equivalent table for POSIX:
|
||||||
|
.sp
|
||||||
|
Default Change with
|
||||||
|
.sp
|
||||||
|
. matches newline yes REG_NEWLINE
|
||||||
|
newline matches [^a] yes REG_NEWLINE
|
||||||
|
$ matches \en at end no REG_NEWLINE
|
||||||
|
$ matches \en in middle no REG_NEWLINE
|
||||||
|
^ matches \en in middle no REG_NEWLINE
|
||||||
|
.sp
|
||||||
|
PCRE2's behaviour is the same as Perl's, except that there is no equivalent for
|
||||||
|
PCRE2_DOLLAR_ENDONLY in Perl. In both PCRE2 and Perl, there is no way to stop
|
||||||
|
newline from matching [^a].
|
||||||
|
.P
|
||||||
|
The default POSIX newline handling can be obtained by setting PCRE2_DOTALL and
|
||||||
|
PCRE2_DOLLAR_ENDONLY, but there is no way to make PCRE2 behave exactly as for
|
||||||
|
the REG_NEWLINE action.
|
||||||
|
.
|
||||||
|
.
|
||||||
|
.SH "MATCHING A PATTERN"
|
||||||
|
.rs
|
||||||
|
.sp
|
||||||
|
The function \fBregexec()\fP is called to match a compiled pattern \fIpreg\fP
|
||||||
|
against a given \fIstring\fP, which is by default terminated by a zero byte
|
||||||
|
(but see REG_STARTEND below), subject to the options in \fIeflags\fP. These can
|
||||||
|
be:
|
||||||
|
.sp
|
||||||
|
REG_NOTBOL
|
||||||
|
.sp
|
||||||
|
The PCRE2_NOTBOL option is set when calling the underlying PCRE2 matching
|
||||||
|
function.
|
||||||
|
.sp
|
||||||
|
REG_NOTEMPTY
|
||||||
|
.sp
|
||||||
|
The PCRE2_NOTEMPTY option is set when calling the underlying PCRE2 matching
|
||||||
|
function. Note that REG_NOTEMPTY is not part of the POSIX standard. However,
|
||||||
|
setting this option can give more POSIX-like behaviour in some situations.
|
||||||
|
.sp
|
||||||
|
REG_NOTEOL
|
||||||
|
.sp
|
||||||
|
The PCRE2_NOTEOL option is set when calling the underlying PCRE2 matching
|
||||||
|
function.
|
||||||
|
.sp
|
||||||
|
REG_STARTEND
|
||||||
|
.sp
|
||||||
|
The string is considered to start at \fIstring\fP + \fIpmatch[0].rm_so\fP and
|
||||||
|
to have a terminating NUL located at \fIstring\fP + \fIpmatch[0].rm_eo\fP
|
||||||
|
(there need not actually be a NUL at that location), regardless of the value of
|
||||||
|
\fInmatch\fP. This is a BSD extension, compatible with but not specified by
|
||||||
|
IEEE Standard 1003.2 (POSIX.2), and should be used with caution in software
|
||||||
|
intended to be portable to other systems. Note that a non-zero \fIrm_so\fP does
|
||||||
|
not imply REG_NOTBOL; REG_STARTEND affects only the location of the string, not
|
||||||
|
how it is matched.
|
||||||
|
.P
|
||||||
|
If the pattern was compiled with the REG_NOSUB flag, no data about any matched
|
||||||
|
strings is returned. The \fInmatch\fP and \fIpmatch\fP arguments of
|
||||||
|
\fBregexec()\fP are ignored.
|
||||||
|
.P
|
||||||
|
If the value of \fInmatch\fP is zero, or if the value \fIpmatch\fP is NULL,
|
||||||
|
no data about any matched strings is returned.
|
||||||
|
.P
|
||||||
|
Otherwise,the portion of the string that was matched, and also any captured
|
||||||
|
substrings, are returned via the \fIpmatch\fP argument, which points to an
|
||||||
|
array of \fInmatch\fP structures of type \fIregmatch_t\fP, containing the
|
||||||
|
members \fIrm_so\fP and \fIrm_eo\fP. These contain the byte offset to the first
|
||||||
|
character of each substring and the offset to the first character after the end
|
||||||
|
of each substring, respectively. The 0th element of the vector relates to the
|
||||||
|
entire portion of \fIstring\fP that was matched; subsequent elements relate to
|
||||||
|
the capturing subpatterns of the regular expression. Unused entries in the
|
||||||
|
array have both structure members set to -1.
|
||||||
|
.P
|
||||||
|
A successful match yields a zero return; various error codes are defined in the
|
||||||
|
header file, of which REG_NOMATCH is the "expected" failure code.
|
||||||
|
.
|
||||||
|
.
|
||||||
|
.SH "ERROR MESSAGES"
|
||||||
|
.rs
|
||||||
|
.sp
|
||||||
|
The \fBregerror()\fP function maps a non-zero errorcode from either
|
||||||
|
\fBregcomp()\fP or \fBregexec()\fP to a printable message. If \fIpreg\fP is not
|
||||||
|
NULL, the error should have arisen from the use of that structure. A message
|
||||||
|
terminated by a binary zero is placed in \fIerrbuf\fP. The length of the
|
||||||
|
message, including the zero, is limited to \fIerrbuf_size\fP. The yield of the
|
||||||
|
function is the size of buffer needed to hold the whole message.
|
||||||
|
.
|
||||||
|
.
|
||||||
|
.SH MEMORY USAGE
|
||||||
|
.rs
|
||||||
|
.sp
|
||||||
|
Compiling a regular expression causes memory to be allocated and associated
|
||||||
|
with the \fIpreg\fP structure. The function \fBregfree()\fP frees all such
|
||||||
|
memory, after which \fIpreg\fP may no longer be used as a compiled expression.
|
||||||
|
.
|
||||||
|
.
|
||||||
|
.SH AUTHOR
|
||||||
|
.rs
|
||||||
|
.sp
|
||||||
|
.nf
|
||||||
|
Philip Hazel
|
||||||
|
University Computing Service
|
||||||
|
Cambridge CB2 3QH, England.
|
||||||
|
.fi
|
||||||
|
.
|
||||||
|
.
|
||||||
|
.SH REVISION
|
||||||
|
.rs
|
||||||
|
.sp
|
||||||
|
.nf
|
||||||
|
Last updated: 20 October 2014
|
||||||
|
Copyright (c) 1997-2014 University of Cambridge.
|
||||||
|
.fi
|
|
@ -0,0 +1,94 @@
|
||||||
|
.TH PCRE2SAMPLE 3 "20 October 2014" "PCRE2 10.00"
|
||||||
|
.SH NAME
|
||||||
|
PCRE2 - Perl-compatible regular expressions (revised API)
|
||||||
|
.SH "PCRE2 SAMPLE PROGRAM"
|
||||||
|
.rs
|
||||||
|
.sp
|
||||||
|
A simple, complete demonstration program to get you started with using PCRE2 is
|
||||||
|
supplied in the file \fIpcre2demo.c\fP in the \fBsrc\fP directory in the PCRE2
|
||||||
|
distribution. A listing of this program is given in the
|
||||||
|
.\" HREF
|
||||||
|
\fBpcre2demo\fP
|
||||||
|
.\"
|
||||||
|
documentation. If you do not have a copy of the PCRE2 distribution, you can
|
||||||
|
save this listing to re-create the contents of \fIpcre2demo.c\fP.
|
||||||
|
.P
|
||||||
|
The demonstration program, which uses the PCRE2 8-bit library, compiles the
|
||||||
|
regular expression that is its first argument, and matches it against the
|
||||||
|
subject string in its second argument. No PCRE2 options are set, and default
|
||||||
|
character tables are used. If matching succeeds, the program outputs the
|
||||||
|
portion of the subject that matched, together with the contents of any captured
|
||||||
|
substrings.
|
||||||
|
.P
|
||||||
|
If the -g option is given on the command line, the program then goes on to
|
||||||
|
check for further matches of the same regular expression in the same subject
|
||||||
|
string. The logic is a little bit tricky because of the possibility of matching
|
||||||
|
an empty string. Comments in the code explain what is going on.
|
||||||
|
.P
|
||||||
|
If PCRE2 is installed in the standard include and library directories for your
|
||||||
|
operating system, you should be able to compile the demonstration program using
|
||||||
|
this command:
|
||||||
|
.sp
|
||||||
|
gcc -o pcre2demo pcre2demo.c -lpcre2-8
|
||||||
|
.sp
|
||||||
|
If PCRE2 is installed elsewhere, you may need to add additional options to the
|
||||||
|
command line. For example, on a Unix-like system that has PCRE2 installed in
|
||||||
|
\fI/usr/local\fP, you can compile the demonstration program using a command
|
||||||
|
like this:
|
||||||
|
.sp
|
||||||
|
.\" JOINSH
|
||||||
|
gcc -o pcre2demo -I/usr/local/include pcre2demo.c \e
|
||||||
|
-L/usr/local/lib -lpcre2-8
|
||||||
|
.sp
|
||||||
|
.P
|
||||||
|
Once you have compiled and linked the demonstration program, you can run simple
|
||||||
|
tests like this:
|
||||||
|
.sp
|
||||||
|
./pcre2demo 'cat|dog' 'the cat sat on the mat'
|
||||||
|
./pcre2demo -g 'cat|dog' 'the dog sat on the cat'
|
||||||
|
.sp
|
||||||
|
Note that there is a much more comprehensive test program, called
|
||||||
|
.\" HREF
|
||||||
|
\fBpcre2test\fP,
|
||||||
|
.\"
|
||||||
|
which supports many more facilities for testing regular expressions using the
|
||||||
|
PCRE2 libraries. The
|
||||||
|
.\" HREF
|
||||||
|
\fBpcre2demo\fP
|
||||||
|
.\"
|
||||||
|
program is provided as a simple coding example.
|
||||||
|
.P
|
||||||
|
If you try to run
|
||||||
|
.\" HREF
|
||||||
|
\fBpcre2demo\fP
|
||||||
|
.\"
|
||||||
|
when PCRE2 is not installed in the standard library directory, you may get an
|
||||||
|
error like this on some operating systems (e.g. Solaris):
|
||||||
|
.sp
|
||||||
|
ld.so.1: a.out: fatal: libpcre2.so.0: open failed: No such file or directory
|
||||||
|
.sp
|
||||||
|
This is caused by the way shared library support works on those systems. You
|
||||||
|
need to add
|
||||||
|
.sp
|
||||||
|
-R/usr/local/lib
|
||||||
|
.sp
|
||||||
|
(for example) to the compile command to get round this problem.
|
||||||
|
.
|
||||||
|
.
|
||||||
|
.SH AUTHOR
|
||||||
|
.rs
|
||||||
|
.sp
|
||||||
|
.nf
|
||||||
|
Philip Hazel
|
||||||
|
University Computing Service
|
||||||
|
Cambridge CB2 3QH, England.
|
||||||
|
.fi
|
||||||
|
.
|
||||||
|
.
|
||||||
|
.SH REVISION
|
||||||
|
.rs
|
||||||
|
.sp
|
||||||
|
.nf
|
||||||
|
Last updated: 20 October 2014
|
||||||
|
Copyright (c) 1997-2014 University of Cambridge.
|
||||||
|
.fi
|
|
@ -0,0 +1,199 @@
|
||||||
|
.TH PCRE2STACK 3 "20 October 2014" "PCRE2 10.00"
|
||||||
|
.SH NAME
|
||||||
|
PCRE2 - Perl-compatible regular expressions (revised API)
|
||||||
|
.SH "PCRE2 DISCUSSION OF STACK USAGE"
|
||||||
|
.rs
|
||||||
|
.sp
|
||||||
|
When you call \fBpcre2_match()\fP, it makes use of an internal function called
|
||||||
|
\fBmatch()\fP. This calls itself recursively at branch points in the pattern,
|
||||||
|
in order to remember the state of the match so that it can back up and try a
|
||||||
|
different alternative after a failure. As matching proceeds deeper and deeper
|
||||||
|
into the tree of possibilities, the recursion depth increases. The
|
||||||
|
\fBmatch()\fP function is also called in other circumstances, for example,
|
||||||
|
whenever a parenthesized sub-pattern is entered, and in certain cases of
|
||||||
|
repetition.
|
||||||
|
.P
|
||||||
|
Not all calls of \fBmatch()\fP increase the recursion depth; for an item such
|
||||||
|
as a* it may be called several times at the same level, after matching
|
||||||
|
different numbers of a's. Furthermore, in a number of cases where the result of
|
||||||
|
the recursive call would immediately be passed back as the result of the
|
||||||
|
current call (a "tail recursion"), the function is just restarted instead.
|
||||||
|
.P
|
||||||
|
The above comments apply when \fBpcre2_match()\fP is run in its normal
|
||||||
|
interpretive manner. If the compiled pattern was processed by
|
||||||
|
\fBpcre2_jit_compile()\fP, and just-in-time compiling was successful, and the
|
||||||
|
options passed to \fBpcre2_match()\fP were not incompatible, the matching
|
||||||
|
process uses the JIT-compiled code instead of the \fBmatch()\fP function. In
|
||||||
|
this case, the memory requirements are handled entirely differently. See the
|
||||||
|
.\" HREF
|
||||||
|
\fBpcre2jit\fP
|
||||||
|
.\"
|
||||||
|
documentation for details.
|
||||||
|
.P
|
||||||
|
The \fBpcre2_dfa_match()\fP function operates in a different way to
|
||||||
|
\fBpcre2_match()\fP, and uses recursion only when there is a regular expression
|
||||||
|
recursion or subroutine call in the pattern. This includes the processing of
|
||||||
|
assertion and "once-only" subpatterns, which are handled like subroutine calls.
|
||||||
|
Normally, these are never very deep, and the limit on the complexity of
|
||||||
|
\fBpcre2_dfa_match()\fP is controlled by the amount of workspace it is given.
|
||||||
|
However, it is possible to write patterns with runaway infinite recursions;
|
||||||
|
such patterns will cause \fBpcre2_dfa_match()\fP to run out of stack. At
|
||||||
|
present, there is no protection against this.
|
||||||
|
.P
|
||||||
|
The comments that follow do NOT apply to \fBpcre2_dfa_match()\fP; they are
|
||||||
|
relevant only for \fBpcre2_match()\fP without the JIT optimization.
|
||||||
|
.
|
||||||
|
.
|
||||||
|
.SS "Reducing \fBpcre2_match()\fP's stack usage"
|
||||||
|
.rs
|
||||||
|
.sp
|
||||||
|
Each time that the internal \fBmatch()\fP function is called recursively, it
|
||||||
|
uses memory from the process stack. For certain kinds of pattern and data, very
|
||||||
|
large amounts of stack may be needed, despite the recognition of "tail
|
||||||
|
recursion". You can often reduce the amount of recursion, and therefore the
|
||||||
|
amount of stack used, by modifying the pattern that is being matched. Consider,
|
||||||
|
for example, this pattern:
|
||||||
|
.sp
|
||||||
|
([^<]|<(?!inet))+
|
||||||
|
.sp
|
||||||
|
It matches from wherever it starts until it encounters "<inet" or the end of
|
||||||
|
the data, and is the kind of pattern that might be used when processing an XML
|
||||||
|
file. Each iteration of the outer parentheses matches either one character that
|
||||||
|
is not "<" or a "<" that is not followed by "inet". However, each time a
|
||||||
|
parenthesis is processed, a recursion occurs, so this formulation uses a stack
|
||||||
|
frame for each matched character. For a long string, a lot of stack is
|
||||||
|
required. Consider now this rewritten pattern, which matches exactly the same
|
||||||
|
strings:
|
||||||
|
.sp
|
||||||
|
([^<]++|<(?!inet))+
|
||||||
|
.sp
|
||||||
|
This uses very much less stack, because runs of characters that do not contain
|
||||||
|
"<" are "swallowed" in one item inside the parentheses. Recursion happens only
|
||||||
|
when a "<" character that is not followed by "inet" is encountered (and we
|
||||||
|
assume this is relatively rare). A possessive quantifier is used to stop any
|
||||||
|
backtracking into the runs of non-"<" characters, but that is not related to
|
||||||
|
stack usage.
|
||||||
|
.P
|
||||||
|
This example shows that one way of avoiding stack problems when matching long
|
||||||
|
subject strings is to write repeated parenthesized subpatterns to match more
|
||||||
|
than one character whenever possible.
|
||||||
|
.
|
||||||
|
.
|
||||||
|
.SS "Compiling PCRE2 to use heap instead of stack for \fBpcre2_match()\fP"
|
||||||
|
.rs
|
||||||
|
.sp
|
||||||
|
In environments where stack memory is constrained, you might want to compile
|
||||||
|
PCRE2 to use heap memory instead of stack for remembering back-up points when
|
||||||
|
\fBpcre2_match()\fP is running. This makes it run more slowly, however. Details
|
||||||
|
of how to do this are given in the
|
||||||
|
.\" HREF
|
||||||
|
\fBpcre2build\fP
|
||||||
|
.\"
|
||||||
|
documentation. When built in this way, instead of using the stack, PCRE2
|
||||||
|
gets memory for remembering backup points from the heap. By default, the memory
|
||||||
|
is obtained by calling the system \fBmalloc()\fP function, but you can arrange
|
||||||
|
to supply your own memory management function. For details, see the section
|
||||||
|
entitled
|
||||||
|
.\" HTML <a href="pcre2api.html#matchcontext">
|
||||||
|
.\" </a>
|
||||||
|
"The match context"
|
||||||
|
.\"
|
||||||
|
in the
|
||||||
|
.\" HREF
|
||||||
|
\fBpcre2api\fP
|
||||||
|
.\"
|
||||||
|
documentation. Since the block sizes are always the same, it may be possible to
|
||||||
|
implement customized a memory handler that is more efficient than the standard
|
||||||
|
function. The memory blocks obtained for this purpose are retained and re-used
|
||||||
|
if possible while \fBpcre2_match()\fP is running. They are all freed just
|
||||||
|
before it exits.
|
||||||
|
.
|
||||||
|
.
|
||||||
|
.SS "Limiting \fBpcre2_match()\fP's stack usage"
|
||||||
|
.rs
|
||||||
|
.sp
|
||||||
|
You can set limits on the number of times the internal \fBmatch()\fP function
|
||||||
|
is called, both in total and recursively. If a limit is exceeded,
|
||||||
|
\fBpcre2_match()\fP returns an error code. Setting suitable limits should
|
||||||
|
prevent it from running out of stack. The default values of the limits are very
|
||||||
|
large, and unlikely ever to operate. They can be changed when PCRE2 is built,
|
||||||
|
and they can also be set when \fBpcre2_match()\fP is called. For details of
|
||||||
|
these interfaces, see the
|
||||||
|
.\" HREF
|
||||||
|
\fBpcre2build\fP
|
||||||
|
.\"
|
||||||
|
documentation and the section entitled
|
||||||
|
.\" HTML <a href="pcre2api.html#matchcontext">
|
||||||
|
.\" </a>
|
||||||
|
"The match context"
|
||||||
|
.\"
|
||||||
|
in the
|
||||||
|
.\" HREF
|
||||||
|
\fBpcre2api\fP
|
||||||
|
.\"
|
||||||
|
documentation.
|
||||||
|
.P
|
||||||
|
As a very rough rule of thumb, you should reckon on about 500 bytes per
|
||||||
|
recursion. Thus, if you want to limit your stack usage to 8Mb, you should set
|
||||||
|
the limit at 16000 recursions. A 64Mb stack, on the other hand, can support
|
||||||
|
around 128000 recursions.
|
||||||
|
.P
|
||||||
|
The \fBpcre2test\fP test program has a modifier called "find_limits" which, if
|
||||||
|
applied to a subject line, causes it to find the smallest limits that allow a a
|
||||||
|
pattern to match. This is done by calling \fBpcre2_match()\fP repeatedly with
|
||||||
|
different limits.
|
||||||
|
.
|
||||||
|
.
|
||||||
|
.SS "Changing stack size in Unix-like systems"
|
||||||
|
.rs
|
||||||
|
.sp
|
||||||
|
In Unix-like environments, there is not often a problem with the stack unless
|
||||||
|
very long strings are involved, though the default limit on stack size varies
|
||||||
|
from system to system. Values from 8Mb to 64Mb are common. You can find your
|
||||||
|
default limit by running the command:
|
||||||
|
.sp
|
||||||
|
ulimit -s
|
||||||
|
.sp
|
||||||
|
Unfortunately, the effect of running out of stack is often SIGSEGV, though
|
||||||
|
sometimes a more explicit error message is given. You can normally increase the
|
||||||
|
limit on stack size by code such as this:
|
||||||
|
.sp
|
||||||
|
struct rlimit rlim;
|
||||||
|
getrlimit(RLIMIT_STACK, &rlim);
|
||||||
|
rlim.rlim_cur = 100*1024*1024;
|
||||||
|
setrlimit(RLIMIT_STACK, &rlim);
|
||||||
|
.sp
|
||||||
|
This reads the current limits (soft and hard) using \fBgetrlimit()\fP, then
|
||||||
|
attempts to increase the soft limit to 100Mb using \fBsetrlimit()\fP. You must
|
||||||
|
do this before calling \fBpcre2_match()\fP.
|
||||||
|
.
|
||||||
|
.
|
||||||
|
.SS "Changing stack size in Mac OS X"
|
||||||
|
.rs
|
||||||
|
.sp
|
||||||
|
Using \fBsetrlimit()\fP, as described above, should also work on Mac OS X. It
|
||||||
|
is also possible to set a stack size when linking a program. There is a
|
||||||
|
discussion about stack sizes in Mac OS X at this web site:
|
||||||
|
.\" HTML <a href="http://developer.apple.com/qa/qa2005/qa1419.html">
|
||||||
|
.\" </a>
|
||||||
|
http://developer.apple.com/qa/qa2005/qa1419.html.
|
||||||
|
.\"
|
||||||
|
.
|
||||||
|
.
|
||||||
|
.SH AUTHOR
|
||||||
|
.rs
|
||||||
|
.sp
|
||||||
|
.nf
|
||||||
|
Philip Hazel
|
||||||
|
University Computing Service
|
||||||
|
Cambridge CB2 3QH, England.
|
||||||
|
.fi
|
||||||
|
.
|
||||||
|
.
|
||||||
|
.SH REVISION
|
||||||
|
.rs
|
||||||
|
.sp
|
||||||
|
.nf
|
||||||
|
Last updated: 20 October 2014
|
||||||
|
Copyright (c) 1997-2014 University of Cambridge.
|
||||||
|
.fi
|
|
@ -0,0 +1,540 @@
|
||||||
|
.TH PCRE2SYNTAX 3 "20 October 2014" "PCRE2 10.00"
|
||||||
|
.SH NAME
|
||||||
|
PCRE2 - Perl-compatible regular expressions (revised API)
|
||||||
|
.SH "PCRE2 REGULAR EXPRESSION SYNTAX SUMMARY"
|
||||||
|
.rs
|
||||||
|
.sp
|
||||||
|
The full syntax and semantics of the regular expressions that are supported by
|
||||||
|
PCRE2 are described in the
|
||||||
|
.\" HREF
|
||||||
|
\fBpcre2pattern\fP
|
||||||
|
.\"
|
||||||
|
documentation. This document contains a quick-reference summary of the syntax.
|
||||||
|
.
|
||||||
|
.
|
||||||
|
.SH "QUOTING"
|
||||||
|
.rs
|
||||||
|
.sp
|
||||||
|
\ex where x is non-alphanumeric is a literal x
|
||||||
|
\eQ...\eE treat enclosed characters as literal
|
||||||
|
.
|
||||||
|
.
|
||||||
|
.SH "CHARACTERS"
|
||||||
|
.rs
|
||||||
|
.sp
|
||||||
|
\ea alarm, that is, the BEL character (hex 07)
|
||||||
|
\ecx "control-x", where x is any ASCII character
|
||||||
|
\ee escape (hex 1B)
|
||||||
|
\ef form feed (hex 0C)
|
||||||
|
\en newline (hex 0A)
|
||||||
|
\er carriage return (hex 0D)
|
||||||
|
\et tab (hex 09)
|
||||||
|
\e0dd character with octal code 0dd
|
||||||
|
\eddd character with octal code ddd, or backreference
|
||||||
|
\eo{ddd..} character with octal code ddd..
|
||||||
|
\exhh character with hex code hh
|
||||||
|
\ex{hhh..} character with hex code hhh..
|
||||||
|
.sp
|
||||||
|
Note that \e0dd is always an octal code, and that \e8 and \e9 are the literal
|
||||||
|
characters "8" and "9".
|
||||||
|
.
|
||||||
|
.
|
||||||
|
.SH "CHARACTER TYPES"
|
||||||
|
.rs
|
||||||
|
.sp
|
||||||
|
. any character except newline;
|
||||||
|
in dotall mode, any character whatsoever
|
||||||
|
\eC one data unit, even in UTF mode (best avoided)
|
||||||
|
\ed a decimal digit
|
||||||
|
\eD a character that is not a decimal digit
|
||||||
|
\eh a horizontal white space character
|
||||||
|
\eH a character that is not a horizontal white space character
|
||||||
|
\eN a character that is not a newline
|
||||||
|
\ep{\fIxx\fP} a character with the \fIxx\fP property
|
||||||
|
\eP{\fIxx\fP} a character without the \fIxx\fP property
|
||||||
|
\eR a newline sequence
|
||||||
|
\es a white space character
|
||||||
|
\eS a character that is not a white space character
|
||||||
|
\ev a vertical white space character
|
||||||
|
\eV a character that is not a vertical white space character
|
||||||
|
\ew a "word" character
|
||||||
|
\eW a "non-word" character
|
||||||
|
\eX a Unicode extended grapheme cluster
|
||||||
|
.sp
|
||||||
|
By default, \ed, \es, and \ew match only ASCII characters, even in UTF-8 mode
|
||||||
|
or in the 16-bit and 32-bit libraries. However, if locale-specific matching is
|
||||||
|
happening, \es and \ew may also match characters with code points in the range
|
||||||
|
128-255. If the PCRE2_UCP option is set, the behaviour of these escape
|
||||||
|
sequences is changed to use Unicode properties and they match many more
|
||||||
|
characters.
|
||||||
|
.
|
||||||
|
.
|
||||||
|
.SH "GENERAL CATEGORY PROPERTIES FOR \ep and \eP"
|
||||||
|
.rs
|
||||||
|
.sp
|
||||||
|
C Other
|
||||||
|
Cc Control
|
||||||
|
Cf Format
|
||||||
|
Cn Unassigned
|
||||||
|
Co Private use
|
||||||
|
Cs Surrogate
|
||||||
|
.sp
|
||||||
|
L Letter
|
||||||
|
Ll Lower case letter
|
||||||
|
Lm Modifier letter
|
||||||
|
Lo Other letter
|
||||||
|
Lt Title case letter
|
||||||
|
Lu Upper case letter
|
||||||
|
L& Ll, Lu, or Lt
|
||||||
|
.sp
|
||||||
|
M Mark
|
||||||
|
Mc Spacing mark
|
||||||
|
Me Enclosing mark
|
||||||
|
Mn Non-spacing mark
|
||||||
|
.sp
|
||||||
|
N Number
|
||||||
|
Nd Decimal number
|
||||||
|
Nl Letter number
|
||||||
|
No Other number
|
||||||
|
.sp
|
||||||
|
P Punctuation
|
||||||
|
Pc Connector punctuation
|
||||||
|
Pd Dash punctuation
|
||||||
|
Pe Close punctuation
|
||||||
|
Pf Final punctuation
|
||||||
|
Pi Initial punctuation
|
||||||
|
Po Other punctuation
|
||||||
|
Ps Open punctuation
|
||||||
|
.sp
|
||||||
|
S Symbol
|
||||||
|
Sc Currency symbol
|
||||||
|
Sk Modifier symbol
|
||||||
|
Sm Mathematical symbol
|
||||||
|
So Other symbol
|
||||||
|
.sp
|
||||||
|
Z Separator
|
||||||
|
Zl Line separator
|
||||||
|
Zp Paragraph separator
|
||||||
|
Zs Space separator
|
||||||
|
.
|
||||||
|
.
|
||||||
|
.SH "PCRE2 SPECIAL CATEGORY PROPERTIES FOR \ep and \eP"
|
||||||
|
.rs
|
||||||
|
.sp
|
||||||
|
Xan Alphanumeric: union of properties L and N
|
||||||
|
Xps POSIX space: property Z or tab, NL, VT, FF, CR
|
||||||
|
Xsp Perl space: property Z or tab, NL, VT, FF, CR
|
||||||
|
Xuc Univerally-named character: one that can be
|
||||||
|
represented by a Universal Character Name
|
||||||
|
Xwd Perl word: property Xan or underscore
|
||||||
|
.sp
|
||||||
|
Perl and POSIX space are now the same. Perl added VT to its space character set
|
||||||
|
at release 5.18.
|
||||||
|
.
|
||||||
|
.
|
||||||
|
.SH "SCRIPT NAMES FOR \ep AND \eP"
|
||||||
|
.rs
|
||||||
|
.sp
|
||||||
|
Arabic,
|
||||||
|
Armenian,
|
||||||
|
Avestan,
|
||||||
|
Balinese,
|
||||||
|
Bamum,
|
||||||
|
Bassa_Vah,
|
||||||
|
Batak,
|
||||||
|
Bengali,
|
||||||
|
Bopomofo,
|
||||||
|
Brahmi,
|
||||||
|
Braille,
|
||||||
|
Buginese,
|
||||||
|
Buhid,
|
||||||
|
Canadian_Aboriginal,
|
||||||
|
Carian,
|
||||||
|
Caucasian_Albanian,
|
||||||
|
Chakma,
|
||||||
|
Cham,
|
||||||
|
Cherokee,
|
||||||
|
Common,
|
||||||
|
Coptic,
|
||||||
|
Cuneiform,
|
||||||
|
Cypriot,
|
||||||
|
Cyrillic,
|
||||||
|
Deseret,
|
||||||
|
Devanagari,
|
||||||
|
Duployan,
|
||||||
|
Egyptian_Hieroglyphs,
|
||||||
|
Elbasan,
|
||||||
|
Ethiopic,
|
||||||
|
Georgian,
|
||||||
|
Glagolitic,
|
||||||
|
Gothic,
|
||||||
|
Grantha,
|
||||||
|
Greek,
|
||||||
|
Gujarati,
|
||||||
|
Gurmukhi,
|
||||||
|
Han,
|
||||||
|
Hangul,
|
||||||
|
Hanunoo,
|
||||||
|
Hebrew,
|
||||||
|
Hiragana,
|
||||||
|
Imperial_Aramaic,
|
||||||
|
Inherited,
|
||||||
|
Inscriptional_Pahlavi,
|
||||||
|
Inscriptional_Parthian,
|
||||||
|
Javanese,
|
||||||
|
Kaithi,
|
||||||
|
Kannada,
|
||||||
|
Katakana,
|
||||||
|
Kayah_Li,
|
||||||
|
Kharoshthi,
|
||||||
|
Khmer,
|
||||||
|
Khojki,
|
||||||
|
Khudawadi,
|
||||||
|
Lao,
|
||||||
|
Latin,
|
||||||
|
Lepcha,
|
||||||
|
Limbu,
|
||||||
|
Linear_A,
|
||||||
|
Linear_B,
|
||||||
|
Lisu,
|
||||||
|
Lycian,
|
||||||
|
Lydian,
|
||||||
|
Mahajani,
|
||||||
|
Malayalam,
|
||||||
|
Mandaic,
|
||||||
|
Manichaean,
|
||||||
|
Meetei_Mayek,
|
||||||
|
Mende_Kikakui,
|
||||||
|
Meroitic_Cursive,
|
||||||
|
Meroitic_Hieroglyphs,
|
||||||
|
Miao,
|
||||||
|
Modi,
|
||||||
|
Mongolian,
|
||||||
|
Mro,
|
||||||
|
Myanmar,
|
||||||
|
Nabataean,
|
||||||
|
New_Tai_Lue,
|
||||||
|
Nko,
|
||||||
|
Ogham,
|
||||||
|
Ol_Chiki,
|
||||||
|
Old_Italic,
|
||||||
|
Old_North_Arabian,
|
||||||
|
Old_Permic,
|
||||||
|
Old_Persian,
|
||||||
|
Old_South_Arabian,
|
||||||
|
Old_Turkic,
|
||||||
|
Oriya,
|
||||||
|
Osmanya,
|
||||||
|
Pahawh_Hmong,
|
||||||
|
Palmyrene,
|
||||||
|
Pau_Cin_Hau,
|
||||||
|
Phags_Pa,
|
||||||
|
Phoenician,
|
||||||
|
Psalter_Pahlavi,
|
||||||
|
Rejang,
|
||||||
|
Runic,
|
||||||
|
Samaritan,
|
||||||
|
Saurashtra,
|
||||||
|
Sharada,
|
||||||
|
Shavian,
|
||||||
|
Siddham,
|
||||||
|
Sinhala,
|
||||||
|
Sora_Sompeng,
|
||||||
|
Sundanese,
|
||||||
|
Syloti_Nagri,
|
||||||
|
Syriac,
|
||||||
|
Tagalog,
|
||||||
|
Tagbanwa,
|
||||||
|
Tai_Le,
|
||||||
|
Tai_Tham,
|
||||||
|
Tai_Viet,
|
||||||
|
Takri,
|
||||||
|
Tamil,
|
||||||
|
Telugu,
|
||||||
|
Thaana,
|
||||||
|
Thai,
|
||||||
|
Tibetan,
|
||||||
|
Tifinagh,
|
||||||
|
Tirhuta,
|
||||||
|
Ugaritic,
|
||||||
|
Vai,
|
||||||
|
Warang_Citi,
|
||||||
|
Yi.
|
||||||
|
.
|
||||||
|
.
|
||||||
|
.SH "CHARACTER CLASSES"
|
||||||
|
.rs
|
||||||
|
.sp
|
||||||
|
[...] positive character class
|
||||||
|
[^...] negative character class
|
||||||
|
[x-y] range (can be used for hex characters)
|
||||||
|
[[:xxx:]] positive POSIX named set
|
||||||
|
[[:^xxx:]] negative POSIX named set
|
||||||
|
.sp
|
||||||
|
alnum alphanumeric
|
||||||
|
alpha alphabetic
|
||||||
|
ascii 0-127
|
||||||
|
blank space or tab
|
||||||
|
cntrl control character
|
||||||
|
digit decimal digit
|
||||||
|
graph printing, excluding space
|
||||||
|
lower lower case letter
|
||||||
|
print printing, including space
|
||||||
|
punct printing, excluding alphanumeric
|
||||||
|
space white space
|
||||||
|
upper upper case letter
|
||||||
|
word same as \ew
|
||||||
|
xdigit hexadecimal digit
|
||||||
|
.sp
|
||||||
|
In PCRE2, POSIX character set names recognize only ASCII characters by default,
|
||||||
|
but some of them use Unicode properties if PCRE2_UCP is set. You can use
|
||||||
|
\eQ...\eE inside a character class.
|
||||||
|
.
|
||||||
|
.
|
||||||
|
.SH "QUANTIFIERS"
|
||||||
|
.rs
|
||||||
|
.sp
|
||||||
|
? 0 or 1, greedy
|
||||||
|
?+ 0 or 1, possessive
|
||||||
|
?? 0 or 1, lazy
|
||||||
|
* 0 or more, greedy
|
||||||
|
*+ 0 or more, possessive
|
||||||
|
*? 0 or more, lazy
|
||||||
|
+ 1 or more, greedy
|
||||||
|
++ 1 or more, possessive
|
||||||
|
+? 1 or more, lazy
|
||||||
|
{n} exactly n
|
||||||
|
{n,m} at least n, no more than m, greedy
|
||||||
|
{n,m}+ at least n, no more than m, possessive
|
||||||
|
{n,m}? at least n, no more than m, lazy
|
||||||
|
{n,} n or more, greedy
|
||||||
|
{n,}+ n or more, possessive
|
||||||
|
{n,}? n or more, lazy
|
||||||
|
.
|
||||||
|
.
|
||||||
|
.SH "ANCHORS AND SIMPLE ASSERTIONS"
|
||||||
|
.rs
|
||||||
|
.sp
|
||||||
|
\eb word boundary
|
||||||
|
\eB not a word boundary
|
||||||
|
^ start of subject
|
||||||
|
also after internal newline in multiline mode
|
||||||
|
\eA start of subject
|
||||||
|
$ end of subject
|
||||||
|
also before newline at end of subject
|
||||||
|
also before internal newline in multiline mode
|
||||||
|
\eZ end of subject
|
||||||
|
also before newline at end of subject
|
||||||
|
\ez end of subject
|
||||||
|
\eG first matching position in subject
|
||||||
|
.
|
||||||
|
.
|
||||||
|
.SH "MATCH POINT RESET"
|
||||||
|
.rs
|
||||||
|
.sp
|
||||||
|
\eK reset start of match
|
||||||
|
.sp
|
||||||
|
\eK is honoured in positive assertions, but ignored in negative ones.
|
||||||
|
.
|
||||||
|
.
|
||||||
|
.SH "ALTERNATION"
|
||||||
|
.rs
|
||||||
|
.sp
|
||||||
|
expr|expr|expr...
|
||||||
|
.
|
||||||
|
.
|
||||||
|
.SH "CAPTURING"
|
||||||
|
.rs
|
||||||
|
.sp
|
||||||
|
(...) capturing group
|
||||||
|
(?<name>...) named capturing group (Perl)
|
||||||
|
(?'name'...) named capturing group (Perl)
|
||||||
|
(?P<name>...) named capturing group (Python)
|
||||||
|
(?:...) non-capturing group
|
||||||
|
(?|...) non-capturing group; reset group numbers for
|
||||||
|
capturing groups in each alternative
|
||||||
|
.
|
||||||
|
.
|
||||||
|
.SH "ATOMIC GROUPS"
|
||||||
|
.rs
|
||||||
|
.sp
|
||||||
|
(?>...) atomic, non-capturing group
|
||||||
|
.
|
||||||
|
.
|
||||||
|
.
|
||||||
|
.
|
||||||
|
.SH "COMMENT"
|
||||||
|
.rs
|
||||||
|
.sp
|
||||||
|
(?#....) comment (not nestable)
|
||||||
|
.
|
||||||
|
.
|
||||||
|
.SH "OPTION SETTING"
|
||||||
|
.rs
|
||||||
|
.sp
|
||||||
|
(?i) caseless
|
||||||
|
(?J) allow duplicate names
|
||||||
|
(?m) multiline
|
||||||
|
(?s) single line (dotall)
|
||||||
|
(?U) default ungreedy (lazy)
|
||||||
|
(?x) extended (ignore white space)
|
||||||
|
(?-...) unset option(s)
|
||||||
|
.sp
|
||||||
|
The following are recognized only at the very start of a pattern or after one
|
||||||
|
of the newline or \eR options with similar syntax. More than one of them may
|
||||||
|
appear.
|
||||||
|
.sp
|
||||||
|
(*LIMIT_MATCH=d) set the match limit to d (decimal number)
|
||||||
|
(*LIMIT_RECURSION=d) set the recursion limit to d (decimal number)
|
||||||
|
(*NOTEMPTY) set PCRE2_NOTEMPTY when matching
|
||||||
|
(*NOTEMPTY_ATSTART) set PCRE2_NOTEMPTY_ATSTART when matching
|
||||||
|
(*NO_AUTO_POSSESS) no auto-possessification (PCRE2_NO_AUTO_POSSESS)
|
||||||
|
(*NO_START_OPT) no start-match optimization (PCRE2_NO_START_OPTIMIZE)
|
||||||
|
(*UTF) set appropriate UTF mode for the library in use
|
||||||
|
(*UCP) set PCRE2_UCP (use Unicode properties for \ed etc)
|
||||||
|
.sp
|
||||||
|
Note that LIMIT_MATCH and LIMIT_RECURSION can only reduce the value of the
|
||||||
|
limits set by the caller of pcre2_exec(), not increase them.
|
||||||
|
.
|
||||||
|
.
|
||||||
|
.SH "NEWLINE CONVENTION"
|
||||||
|
.rs
|
||||||
|
.sp
|
||||||
|
These are recognized only at the very start of the pattern or after option
|
||||||
|
settings with a similar syntax.
|
||||||
|
.sp
|
||||||
|
(*CR) carriage return only
|
||||||
|
(*LF) linefeed only
|
||||||
|
(*CRLF) carriage return followed by linefeed
|
||||||
|
(*ANYCRLF) all three of the above
|
||||||
|
(*ANY) any Unicode newline sequence
|
||||||
|
.
|
||||||
|
.
|
||||||
|
.SH "WHAT \eR MATCHES"
|
||||||
|
.rs
|
||||||
|
.sp
|
||||||
|
These are recognized only at the very start of the pattern or after option
|
||||||
|
setting with a similar syntax.
|
||||||
|
.sp
|
||||||
|
(*BSR_ANYCRLF) CR, LF, or CRLF
|
||||||
|
(*BSR_UNICODE) any Unicode newline sequence
|
||||||
|
.
|
||||||
|
.
|
||||||
|
.SH "LOOKAHEAD AND LOOKBEHIND ASSERTIONS"
|
||||||
|
.rs
|
||||||
|
.sp
|
||||||
|
(?=...) positive look ahead
|
||||||
|
(?!...) negative look ahead
|
||||||
|
(?<=...) positive look behind
|
||||||
|
(?<!...) negative look behind
|
||||||
|
.sp
|
||||||
|
Each top-level branch of a look behind must be of a fixed length.
|
||||||
|
.
|
||||||
|
.
|
||||||
|
.SH "BACKREFERENCES"
|
||||||
|
.rs
|
||||||
|
.sp
|
||||||
|
\en reference by number (can be ambiguous)
|
||||||
|
\egn reference by number
|
||||||
|
\eg{n} reference by number
|
||||||
|
\eg{-n} relative reference by number
|
||||||
|
\ek<name> reference by name (Perl)
|
||||||
|
\ek'name' reference by name (Perl)
|
||||||
|
\eg{name} reference by name (Perl)
|
||||||
|
\ek{name} reference by name (.NET)
|
||||||
|
(?P=name) reference by name (Python)
|
||||||
|
.
|
||||||
|
.
|
||||||
|
.SH "SUBROUTINE REFERENCES (POSSIBLY RECURSIVE)"
|
||||||
|
.rs
|
||||||
|
.sp
|
||||||
|
(?R) recurse whole pattern
|
||||||
|
(?n) call subpattern by absolute number
|
||||||
|
(?+n) call subpattern by relative number
|
||||||
|
(?-n) call subpattern by relative number
|
||||||
|
(?&name) call subpattern by name (Perl)
|
||||||
|
(?P>name) call subpattern by name (Python)
|
||||||
|
\eg<name> call subpattern by name (Oniguruma)
|
||||||
|
\eg'name' call subpattern by name (Oniguruma)
|
||||||
|
\eg<n> call subpattern by absolute number (Oniguruma)
|
||||||
|
\eg'n' call subpattern by absolute number (Oniguruma)
|
||||||
|
\eg<+n> call subpattern by relative number (PCRE2 extension)
|
||||||
|
\eg'+n' call subpattern by relative number (PCRE2 extension)
|
||||||
|
\eg<-n> call subpattern by relative number (PCRE2 extension)
|
||||||
|
\eg'-n' call subpattern by relative number (PCRE2 extension)
|
||||||
|
.
|
||||||
|
.
|
||||||
|
.SH "CONDITIONAL PATTERNS"
|
||||||
|
.rs
|
||||||
|
.sp
|
||||||
|
(?(condition)yes-pattern)
|
||||||
|
(?(condition)yes-pattern|no-pattern)
|
||||||
|
.sp
|
||||||
|
(?(n)... absolute reference condition
|
||||||
|
(?(+n)... relative reference condition
|
||||||
|
(?(-n)... relative reference condition
|
||||||
|
(?(<name>)... named reference condition (Perl)
|
||||||
|
(?('name')... named reference condition (Perl)
|
||||||
|
(?(name)... named reference condition (PCRE2)
|
||||||
|
(?(R)... overall recursion condition
|
||||||
|
(?(Rn)... specific group recursion condition
|
||||||
|
(?(R&name)... specific recursion condition
|
||||||
|
(?(DEFINE)... define subpattern for reference
|
||||||
|
(?(assert)... assertion condition
|
||||||
|
.
|
||||||
|
.
|
||||||
|
.SH "BACKTRACKING CONTROL"
|
||||||
|
.rs
|
||||||
|
.sp
|
||||||
|
The following act immediately they are reached:
|
||||||
|
.sp
|
||||||
|
(*ACCEPT) force successful match
|
||||||
|
(*FAIL) force backtrack; synonym (*F)
|
||||||
|
(*MARK:NAME) set name to be passed back; synonym (*:NAME)
|
||||||
|
.sp
|
||||||
|
The following act only when a subsequent match failure causes a backtrack to
|
||||||
|
reach them. They all force a match failure, but they differ in what happens
|
||||||
|
afterwards. Those that advance the start-of-match point do so only if the
|
||||||
|
pattern is not anchored.
|
||||||
|
.sp
|
||||||
|
(*COMMIT) overall failure, no advance of starting point
|
||||||
|
(*PRUNE) advance to next starting character
|
||||||
|
(*PRUNE:NAME) equivalent to (*MARK:NAME)(*PRUNE)
|
||||||
|
(*SKIP) advance to current matching position
|
||||||
|
(*SKIP:NAME) advance to position corresponding to an earlier
|
||||||
|
(*MARK:NAME); if not found, the (*SKIP) is ignored
|
||||||
|
(*THEN) local failure, backtrack to next alternation
|
||||||
|
(*THEN:NAME) equivalent to (*MARK:NAME)(*THEN)
|
||||||
|
.
|
||||||
|
.
|
||||||
|
.SH "CALLOUTS"
|
||||||
|
.rs
|
||||||
|
.sp
|
||||||
|
(?C) callout
|
||||||
|
(?Cn) callout with data n
|
||||||
|
.
|
||||||
|
.
|
||||||
|
.SH "SEE ALSO"
|
||||||
|
.rs
|
||||||
|
.sp
|
||||||
|
\fBpcre2pattern\fP(3), \fBpcre2api\fP(3), \fBpcre2callout\fP(3),
|
||||||
|
\fBpcre2matching\fP(3), \fBpcre2\fP(3).
|
||||||
|
.
|
||||||
|
.
|
||||||
|
.SH AUTHOR
|
||||||
|
.rs
|
||||||
|
.sp
|
||||||
|
.nf
|
||||||
|
Philip Hazel
|
||||||
|
University Computing Service
|
||||||
|
Cambridge CB2 3QH, England.
|
||||||
|
.fi
|
||||||
|
.
|
||||||
|
.
|
||||||
|
.SH REVISION
|
||||||
|
.rs
|
||||||
|
.sp
|
||||||
|
.nf
|
||||||
|
Last updated: 20 October 2014
|
||||||
|
Copyright (c) 1997-2014 University of Cambridge.
|
||||||
|
.fi
|
Loading…
Reference in New Issue