More documentation.

This commit is contained in:
Philip.Hazel 2014-09-29 16:45:37 +00:00
parent e15b64ef03
commit 5543597741
13 changed files with 2159 additions and 69 deletions

View File

@ -24,64 +24,71 @@ dist_html_DATA = \
doc/html/README.txt \
doc/html/index.html \
doc/html/pcre2-config.html \
doc/html/pcre2.html \
doc/html/pcre2api.html \
doc/html/pcre2build.html \
doc/html/pcre2callout.html \
doc/html/pcre2compat.html \
doc/html/pcre2demo.html \
doc/html/pcre2grep.html \
doc/html/pcre2jit.html \
doc/html/pcre2limits.html \
doc/html/pcre2matching.html \
doc/html/pcre2test.html \
doc/html/pcre2unicode.html
# doc/html/pcre.html \
# doc/html/pcre_assign_jit_stack.html \
# doc/html/pcre_compile.html \
# doc/html/pcre_compile2.html \
# doc/html/pcre_config.html \
# doc/html/pcre_copy_named_substring.html \
# doc/html/pcre_copy_substring.html \
# doc/html/pcre_dfa_match.html \
# doc/html/pcre_match.html \
# doc/html/pcre_free_study.html \
# doc/html/pcre_free_substring.html \
# doc/html/pcre_free_substring_list.html \
# doc/html/pcre_fullinfo.html \
# doc/html/pcre_get_named_substring.html \
# doc/html/pcre_get_stringnumber.html \
# doc/html/pcre_get_stringtable_entries.html \
# doc/html/pcre_get_substring.html \
# doc/html/pcre_get_substring_list.html \
# doc/html/pcre_jit_match.html \
# doc/html/pcre_jit_stack_alloc.html \
# doc/html/pcre_jit_stack_free.html \
# doc/html/pcre_maketables.html \
# doc/html/pcre_pattern_to_host_byte_order.html \
# doc/html/pcre_refcount.html \
# doc/html/pcre_study.html \
# doc/html/pcre_utf16_to_host_byte_order.html \
# doc/html/pcre_utf32_to_host_byte_order.html \
# doc/html/pcre_version.html \
# doc/html/pcrebuild.html \
# doc/html/pcrecompat.html \
# doc/html/pcregrep.html \
# doc/html/pcrejit.html \
# doc/html/pcrelimits.html \
# doc/html/pcrematching.html \
# doc/html/pcrepartial.html \
# doc/html/pcrepattern.html \
# doc/html/pcreperform.html \
# doc/html/pcreposix.html \
# doc/html/pcreprecompile.html \
# doc/html/pcresample.html \
# doc/html/pcrestack.html \
# doc/html/pcresyntax.html
# doc/html/pcre2_assign_jit_stack.html \
# doc/html/pcre2_compile.html \
# doc/html/pcre2_compile2.html \
# doc/html/pcre2_config.html \
# doc/html/pcre2_copy_named_substring.html \
# doc/html/pcre2_copy_substring.html \
# doc/html/pcre2_dfa_match.html \
# doc/html/pcre2_match.html \
# doc/html/pcre2_free_study.html \
# doc/html/pcre2_free_substring.html \
# doc/html/pcre2_free_substring_list.html \
# doc/html/pcre2_fullinfo.html \
# doc/html/pcre2_get_named_substring.html \
# doc/html/pcre2_get_stringnumber.html \
# doc/html/pcre2_get_stringtable_entries.html \
# doc/html/pcre2_get_substring.html \
# doc/html/pcre2_get_substring_list.html \
# doc/html/pcre2_jit_match.html \
# doc/html/pcre2_jit_stack_alloc.html \
# doc/html/pcre2_jit_stack_free.html \
# doc/html/pcre2_maketables.html \
# doc/html/pcre2_pattern_to_host_byte_order.html \
# doc/html/pcre2_refcount.html \
# doc/html/pcre2_study.html \
# doc/html/pcre2_utf16_to_host_byte_order.html \
# doc/html/pcre2_utf32_to_host_byte_order.html \
# doc/html/pcre2_version.html \
# doc/html/pcre2partial.html \
# doc/html/pcre2pattern.html \
# doc/html/pcre2perform.html \
# doc/html/pcre2posix.html \
# doc/html/pcre2precompile.html \
# doc/html/pcre2sample.html \
# doc/html/pcre2stack.html \
# doc/html/pcre2syntax.html
# FIXME
dist_man_MANS = \
doc/pcre2-config.1 \
doc/pcre2.3 \
doc/pcre2api.3 \
doc/pcre2build.3 \
doc/pcre2callout.3 \
doc/pcre2compat.3 \
doc/pcre2demo.3 \
doc/pcre2grep.1 \
doc/pcre2jit.3 \
doc/pcre2limits.3 \
doc/pcre2matching.3 \
doc/pcre2test.1 \
doc/pcre2unicode.3
# doc/pcre2.3 \
# doc/pcre2-16.3 \
# doc/pcre2-32.3 \
# doc/pcre2_assign_jit_stack.3 \
@ -111,13 +118,6 @@ dist_man_MANS = \
# doc/pcre2_utf16_to_host_byte_order.3 \
# doc/pcre2_utf32_to_host_byte_order.3 \
# doc/pcre2_version.3 \
# doc/pcre2build.3 \
# doc/pcre2compat.3 \
# doc/pcre2demo.3 \
# doc/pcre2grep.1 \
# doc/pcre2jit.3 \
# doc/pcre2limits.3 \
# doc/pcre2matching.3 \
# doc/pcre2partial.3 \
# doc/pcre2pattern.3 \
# doc/pcre2perform.3 \

478
doc/html/pcre2build.html Normal file
View File

@ -0,0 +1,478 @@
<html>
<head>
<title>pcre2build specification</title>
</head>
<body bgcolor="#FFFFFF" text="#00005A" link="#0066FF" alink="#3399FF" vlink="#2222BB">
<h1>pcre2build man page</h1>
<p>
Return to the <a href="index.html">PCRE2 index page</a>.
</p>
<p>
This page is part of the PCRE2 HTML documentation. It was generated
automatically from the original man page. If there is any nonsense in it,
please consult the man page, in case the conversion went wrong.
<br>
<ul>
<li><a name="TOC1" href="#SEC1">BUILDING PCRE2</a>
<li><a name="TOC2" href="#SEC2">PCRE2 BUILD-TIME OPTIONS</a>
<li><a name="TOC3" href="#SEC3">BUILDING 8-BIT, 16-BIT AND 32-BIT LIBRARIES</a>
<li><a name="TOC4" href="#SEC4">BUILDING SHARED AND STATIC LIBRARIES</a>
<li><a name="TOC5" href="#SEC5">Unicode and UTF SUPPORT</a>
<li><a name="TOC6" href="#SEC6">JUST-IN-TIME COMPILER SUPPORT</a>
<li><a name="TOC7" href="#SEC7">CODE VALUE OF NEWLINE</a>
<li><a name="TOC8" href="#SEC8">WHAT \R MATCHES</a>
<li><a name="TOC9" href="#SEC9">HANDLING VERY LARGE PATTERNS</a>
<li><a name="TOC10" href="#SEC10">AVOIDING EXCESSIVE STACK USAGE</a>
<li><a name="TOC11" href="#SEC11">LIMITING PCRE2 RESOURCE USAGE</a>
<li><a name="TOC12" href="#SEC12">CREATING CHARACTER TABLES AT BUILD TIME</a>
<li><a name="TOC13" href="#SEC13">USING EBCDIC CODE</a>
<li><a name="TOC14" href="#SEC14">PCRE2GREP OPTIONS FOR COMPRESSED FILE SUPPORT</a>
<li><a name="TOC15" href="#SEC15">PCRE2GREP BUFFER SIZE</a>
<li><a name="TOC16" href="#SEC16">PCRE2TEST OPTION FOR LIBREADLINE SUPPORT</a>
<li><a name="TOC17" href="#SEC17">DEBUGGING WITH VALGRIND SUPPORT</a>
<li><a name="TOC18" href="#SEC18">CODE COVERAGE REPORTING</a>
<li><a name="TOC19" href="#SEC19">SEE ALSO</a>
<li><a name="TOC20" href="#SEC20">AUTHOR</a>
<li><a name="TOC21" href="#SEC21">REVISION</a>
</ul>
<br><a name="SEC1" href="#TOC1">BUILDING PCRE2</a><br>
<P>
PCRE2 is distributed with a <b>configure</b> script that can be used to build
the library in Unix-like environments using the applications known as
Autotools. Also in the distribution are files to support building using
<b>CMake</b> instead of <b>configure</b>. The text file
<a href="README.txt"><b>README</b></a>
contains general information about building with Autotools (some of which is
repeated below), and also has some comments about building on various operating
systems. There is a lot more information about building PCRE2 without using
Autotools (including information about using <b>CMake</b> and building "by
hand") in the text file called
<a href="NON-AUTOTOOLS-BUILD.txt"><b>NON-AUTOTOOLS-BUILD</b>.</a>
You should consult this file as well as the
<a href="README.txt"><b>README</b></a>
file if you are building in a non-Unix-like environment.
</P>
<br><a name="SEC2" href="#TOC1">PCRE2 BUILD-TIME OPTIONS</a><br>
<P>
The rest of this document describes the optional features of PCRE2 that can be
selected when the library is compiled. It assumes use of the <b>configure</b>
script, where the optional features are selected or deselected by providing
options to <b>configure</b> before running the <b>make</b> command. However, the
same options can be selected in both Unix-like and non-Unix-like environments
if you are using <b>CMake</b> instead of <b>configure</b> to build PCRE2.
</P>
<P>
If you are not using Autotools or <b>CMake</b>, option selection can be done by
editing the <b>config.h</b> file, or by passing parameter settings to the
compiler, as described in
<a href="NON-AUTOTOOLS-BUILD.txt"><b>NON-AUTOTOOLS-BUILD</b>.</a>
</P>
<P>
The complete list of options for <b>configure</b> (which includes the standard
ones such as the selection of the installation directory) can be obtained by
running
<pre>
./configure --help
</pre>
The following sections include descriptions of options whose names begin with
--enable or --disable. These settings specify changes to the defaults for the
<b>configure</b> command. Because of the way that <b>configure</b> works,
--enable and --disable always come in pairs, so the complementary option always
exists as well, but as it specifies the default, it is not described.
</P>
<br><a name="SEC3" href="#TOC1">BUILDING 8-BIT, 16-BIT AND 32-BIT LIBRARIES</a><br>
<P>
By default, a library called <b>libpcre2-8</b> is built, containing functions
that take string arguments contained in vectors of bytes, interpreted either as
single-byte characters, or UTF-8 strings. You can also build two other
libraries, called <b>libpcre2-16</b> and <b>libpcre2-32</b>, which process
strings that are contained in vectors of 16-bit and 32-bit code units,
respectively. These can be interpreted either as single-unit characters or
UTF-16/UTF-32 strings. To build these additional libraries, add one or both of
the following to the <b>configure</b> command:
<pre>
--enable-pcre16
--enable-pcre32
</pre>
If you do not want the 8-bit library, add
<pre>
--disable-pcre8
</pre>
as well. At least one of the three libraries must be built. Note that the POSIX
wrapper is for the 8-bit library only, and that <b>pcre2grep</b> is an 8-bit
program. Neither of these are built if you select only the 16-bit or 32-bit
libraries.
</P>
<br><a name="SEC4" href="#TOC1">BUILDING SHARED AND STATIC LIBRARIES</a><br>
<P>
The Autotools PCRE2 building process uses <b>libtool</b> to build both shared
and static libraries by default. You can suppress one of these by adding one of
<pre>
--disable-shared
--disable-static
</pre>
to the <b>configure</b> command, as required.
</P>
<br><a name="SEC5" href="#TOC1">Unicode and UTF SUPPORT</a><br>
<P>
To build PCRE2 with support for Unicode and UTF character strings, add
<pre>
--enable-unicode
</pre>
to the <b>configure</b> command. This setting applies to all three libraries,
adding support for UTF-8 to the 8-bit library, support for UTF-16 to the 16-bit
library, and support for UTF-32 to the to the 32-bit library.
It is not possible to build one library with
UTF support and another without in the same configuration.
</P>
<P>
Of itself, this setting does not make PCRE2 treat strings as UTF-8, UTF-16 or
UTF-32. As well as compiling PCRE2 with this option, you also have have to set
the PCRE2_UTF option when you call <b>pcre2_compile()</b> to compile a pattern.
</P>
<P>
If you set --enable-unicode when compiling in an EBCDIC environment, PCRE2
expects its input to be either ASCII or UTF-8 (depending on the run-time
option). It is not possible to support both EBCDIC and UTF-8 codes in the same
version of the library. Consequently, --enable-unicode and --enable-ebcdic are
mutually exclusive.
</P>
<P>
UTF support allows the libraries to process character codepoints up to 0x10ffff
in the strings that they handle. It also provides support for accessing the
properties of such characters, using pattern escapes such as \P, \p, and \X.
Only the general category properties such as <i>Lu</i> and <i>Nd</i> are
supported. Details are given in the
<a href="pcre2pattern.html"><b>pcre2pattern</b></a>
documentation.
</P>
<br><a name="SEC6" href="#TOC1">JUST-IN-TIME COMPILER SUPPORT</a><br>
<P>
Just-in-time compiler support is included in the build by specifying
<pre>
--enable-jit
</pre>
This support is available only for certain hardware architectures. If this
option is set for an unsupported architecture, a compile time error occurs.
See the
<a href="pcre2jit.html"><b>pcre2jit</b></a>
documentation for a discussion of JIT usage. When JIT support is enabled,
pcre2grep automatically makes use of it, unless you add
<pre>
--disable-pcre2grep-jit
</pre>
to the "configure" command.
</P>
<br><a name="SEC7" href="#TOC1">CODE VALUE OF NEWLINE</a><br>
<P>
By default, PCRE2 interprets the linefeed (LF) character as indicating the end
of a line. This is the normal newline character on Unix-like systems. You can
compile PCRE2 to use carriage return (CR) instead, by adding
<pre>
--enable-newline-is-cr
</pre>
to the <b>configure</b> command. There is also a --enable-newline-is-lf option,
which explicitly specifies linefeed as the newline character.
<br>
<br>
Alternatively, you can specify that line endings are to be indicated by the two
character sequence CRLF. If you want this, add
<pre>
--enable-newline-is-crlf
</pre>
to the <b>configure</b> command. There is a fourth option, specified by
<pre>
--enable-newline-is-anycrlf
</pre>
which causes PCRE2 to recognize any of the three sequences CR, LF, or CRLF as
indicating a line ending. Finally, a fifth option, specified by
<pre>
--enable-newline-is-any
</pre>
causes PCRE2 to recognize any Unicode newline sequence.
</P>
<P>
Whatever line ending convention is selected when PCRE2 is built can be
overridden when the library functions are called. At build time it is
conventional to use the standard for your operating system.
</P>
<br><a name="SEC8" href="#TOC1">WHAT \R MATCHES</a><br>
<P>
By default, the sequence \R in a pattern matches any Unicode newline sequence,
whatever has been selected as the line ending sequence. If you specify
<pre>
--enable-bsr-anycrlf
</pre>
the default is changed so that \R matches only CR, LF, or CRLF. Whatever is
selected when PCRE2 is built can be overridden when the library functions are
called.
</P>
<br><a name="SEC9" href="#TOC1">HANDLING VERY LARGE PATTERNS</a><br>
<P>
Within a compiled pattern, offset values are used to point from one part to
another (for example, from an opening parenthesis to an alternation
metacharacter). By default, in the 8-bit and 16-bit libraries, two-byte values
are used for these offsets, leading to a maximum size for a compiled pattern of
around 64K. This is sufficient to handle all but the most gigantic patterns.
Nevertheless, some people do want to process truly enormous patterns, so it is
possible to compile PCRE2 to use three-byte or four-byte offsets by adding a
setting such as
<pre>
--with-link-size=3
</pre>
to the <b>configure</b> command. The value given must be 2, 3, or 4. For the
16-bit library, a value of 3 is rounded up to 4. In these libraries, using
longer offsets slows down the operation of PCRE2 because it has to load
additional data when handling them. For the 32-bit library the value is always
4 and cannot be overridden; the value of --with-link-size is ignored.
</P>
<br><a name="SEC10" href="#TOC1">AVOIDING EXCESSIVE STACK USAGE</a><br>
<P>
When matching with the <b>pcre2_match()</b> function, PCRE2 implements
backtracking by making recursive calls to an internal function called
<b>match()</b>. In environments where the size of the stack is limited, this can
severely limit PCRE2's operation. (The Unix environment does not usually suffer
from this problem, but it may sometimes be necessary to increase the maximum
stack size. There is a discussion in the
<a href="pcre2stack.html"><b>pcre2stack</b></a>
documentation.) An alternative approach to recursion that uses memory from the
heap to remember data, instead of using recursive function calls, has been
implemented to work round the problem of limited stack size. If you want to
build a version of PCRE2 that works this way, add
<pre>
--disable-stack-for-recursion
</pre>
to the <b>configure</b> command. By default, the system functions <b>malloc()</b>
and <b>free()</b> are called to manage the heap memory that is required, but
custom memory management functions can be called instead. PCRE2 runs noticeably
more slowly when built in this way. This option affects only the
<b>pcre2_match()</b> function; it is not relevant for <b>pcre2_dfa_match()</b>.
</P>
<br><a name="SEC11" href="#TOC1">LIMITING PCRE2 RESOURCE USAGE</a><br>
<P>
Internally, PCRE2 has a function called <b>match()</b>, which it calls
repeatedly (sometimes recursively) when matching a pattern with the
<b>pcre2_match()</b> function. By controlling the maximum number of times this
function may be called during a single matching operation, a limit can be
placed on the resources used by a single call to <b>pcre2_match()</b>. The limit
can be changed at run time, as described in the
<a href="pcre2api.html"><b>pcre2api</b></a>
documentation. The default is 10 million, but this can be changed by adding a
setting such as
<pre>
--with-match-limit=500000
</pre>
to the <b>configure</b> command. This setting has no effect on the
<b>pcre2_dfa_match()</b> matching function.
</P>
<P>
In some environments it is desirable to limit the depth of recursive calls of
<b>match()</b> more strictly than the total number of calls, in order to
restrict the maximum amount of stack (or heap, if --disable-stack-for-recursion
is specified) that is used. A second limit controls this; it defaults to the
value that is set for --with-match-limit, which imposes no additional
constraints. However, you can set a lower limit by adding, for example,
<pre>
--with-match-limit-recursion=10000
</pre>
to the <b>configure</b> command. This value can also be overridden at run time.
</P>
<br><a name="SEC12" href="#TOC1">CREATING CHARACTER TABLES AT BUILD TIME</a><br>
<P>
PCRE2 uses fixed tables for processing characters whose code points are less
than 256. By default, PCRE2 is built with a set of tables that are distributed
in the file <i>src/pcre2_chartables.c.dist</i>. These tables are for ASCII codes
only. If you add
<pre>
--enable-rebuild-chartables
</pre>
to the <b>configure</b> command, the distributed tables are no longer used.
Instead, a program called <b>dftables</b> is compiled and run. This outputs the
source for new set of tables, created in the default locale of your C run-time
system. (This method of replacing the tables does not work if you are cross
compiling, because <b>dftables</b> is run on the local host. If you need to
create alternative tables when cross compiling, you will have to do so "by
hand".)
</P>
<br><a name="SEC13" href="#TOC1">USING EBCDIC CODE</a><br>
<P>
PCRE2 assumes by default that it will run in an environment where the character
code is ASCII (or Unicode, which is a superset of ASCII). This is the case for
most computer operating systems. PCRE2 can, however, be compiled to run in an
EBCDIC environment by adding
<pre>
--enable-ebcdic
</pre>
to the <b>configure</b> command. This setting implies
--enable-rebuild-chartables. You should only use it if you know that you are in
an EBCDIC environment (for example, an IBM mainframe operating system). The
--enable-ebcdic option is incompatible with --enable-unicode.
</P>
<P>
The EBCDIC character that corresponds to an ASCII LF is assumed to have the
value 0x15 by default. However, in some EBCDIC environments, 0x25 is used. In
such an environment you should use
<pre>
--enable-ebcdic-nl25
</pre>
as well as, or instead of, --enable-ebcdic. The EBCDIC character for CR has the
same value as in ASCII, namely, 0x0d. Whichever of 0x15 and 0x25 is <i>not</i>
chosen as LF is made to correspond to the Unicode NEL character (which, in
Unicode, is 0x85).
</P>
<P>
The options that select newline behaviour, such as --enable-newline-is-cr,
and equivalent run-time options, refer to these character values in an EBCDIC
environment.
</P>
<br><a name="SEC14" href="#TOC1">PCRE2GREP OPTIONS FOR COMPRESSED FILE SUPPORT</a><br>
<P>
By default, <b>pcre2grep</b> reads all files as plain text. You can build it so
that it recognizes files whose names end in <b>.gz</b> or <b>.bz2</b>, and reads
them with <b>libz</b> or <b>libbz2</b>, respectively, by adding one or both of
<pre>
--enable-pcre2grep-libz
--enable-pcre2grep-libbz2
</pre>
to the <b>configure</b> command. These options naturally require that the
relevant libraries are installed on your system. Configuration will fail if
they are not.
</P>
<br><a name="SEC15" href="#TOC1">PCRE2GREP BUFFER SIZE</a><br>
<P>
<b>pcre2grep</b> uses an internal buffer to hold a "window" on the file it is
scanning, in order to be able to output "before" and "after" lines when it
finds a match. The size of the buffer is controlled by a parameter whose
default value is 20K. The buffer itself is three times this size, but because
of the way it is used for holding "before" lines, the longest line that is
guaranteed to be processable is the parameter size. You can change the default
parameter value by adding, for example,
<pre>
--with-pcre2grep-bufsize=50K
</pre>
to the <b>configure</b> command. The caller of \fPpcre2grep\fP can, however,
override this value by specifying a run-time option.
</P>
<br><a name="SEC16" href="#TOC1">PCRE2TEST OPTION FOR LIBREADLINE SUPPORT</a><br>
<P>
If you add one of
<pre>
--enable-pcre2test-libreadline
--enable-pcre2test-libedit
</pre>
to the <b>configure</b> command, <b>pcre2test</b> is linked with the
<b>libreadline</b> or<b>libedit</b> library, respectively, and when its input is
from a terminal, it reads it using the <b>readline()</b> function. This provides
line-editing and history facilities. Note that <b>libreadline</b> is
GPL-licensed, so if you distribute a binary of <b>pcre2test</b> linked in this
way, there may be licensing issues. These can be avoided by linking with
<b>libedit</b> (which has a BSD licence) instead.
</P>
<P>
Setting this option causes the <b>-lreadline</b> option to be added to the
<b>pcre2test</b> build. In many operating environments with a sytem-installed
readline library this is sufficient. However, in some environments (e.g. if an
unmodified distribution version of readline is in use), some extra
configuration may be necessary. The INSTALL file for <b>libreadline</b> says
this:
<pre>
"Readline uses the termcap functions, but does not link with
the termcap or curses library itself, allowing applications
which link with readline the to choose an appropriate library."
</pre>
If your environment has not been set up so that an appropriate library is
automatically included, you may need to add something like
<pre>
LIBS="-ncurses"
</pre>
immediately before the <b>configure</b> command.
</P>
<br><a name="SEC17" href="#TOC1">DEBUGGING WITH VALGRIND SUPPORT</a><br>
<P>
By adding the
<pre>
--enable-valgrind
</pre>
option to to the <b>configure</b> command, PCRE2 will use valgrind annotations
to mark certain memory regions as unaddressable. This allows it to detect
invalid memory accesses, and is mostly useful for debugging PCRE2 itself.
</P>
<br><a name="SEC18" href="#TOC1">CODE COVERAGE REPORTING</a><br>
<P>
If your C compiler is gcc, you can build a version of PCRE2 that can generate a
code coverage report for its test suite. To enable this, you must install
<b>lcov</b> version 1.6 or above. Then specify
<pre>
--enable-coverage
</pre>
to the <b>configure</b> command and build PCRE2 in the usual way.
</P>
<P>
Note that using <b>ccache</b> (a caching C compiler) is incompatible with code
coverage reporting. If you have configured <b>ccache</b> to run automatically
on your system, you must set the environment variable
<pre>
CCACHE_DISABLE=1
</pre>
before running <b>make</b> to build PCRE2, so that <b>ccache</b> is not used.
</P>
<P>
When --enable-coverage is used, the following addition targets are added to the
<i>Makefile</i>:
<pre>
make coverage
</pre>
This creates a fresh coverage report for the PCRE2 test suite. It is equivalent
to running "make coverage-reset", "make coverage-baseline", "make check", and
then "make coverage-report".
<pre>
make coverage-reset
</pre>
This zeroes the coverage counters, but does nothing else.
<pre>
make coverage-baseline
</pre>
This captures baseline coverage information.
<pre>
make coverage-report
</pre>
This creates the coverage report.
<pre>
make coverage-clean-report
</pre>
This removes the generated coverage report without cleaning the coverage data
itself.
<pre>
make coverage-clean-data
</pre>
This removes the captured coverage data without removing the coverage files
created at compile time (*.gcno).
<pre>
make coverage-clean
</pre>
This cleans all coverage data including the generated coverage report. For more
information about code coverage, see the <b>gcov</b> and <b>lcov</b>
documentation.
</P>
<br><a name="SEC19" href="#TOC1">SEE ALSO</a><br>
<P>
<b>pcre2api</b>(3), <b>pcre2_config</b>(3).
</P>
<br><a name="SEC20" href="#TOC1">AUTHOR</a><br>
<P>
Philip Hazel
<br>
University Computing Service
<br>
Cambridge CB2 3QH, England.
<br>
</P>
<br><a name="SEC21" href="#TOC1">REVISION</a><br>
<P>
Last updated: 28 September 2014
<br>
Copyright &copy; 1997-2014 University of Cambridge.
<br>
<p>
Return to the <a href="index.html">PCRE2 index page</a>.
</p>

223
doc/html/pcre2compat.html Normal file
View File

@ -0,0 +1,223 @@
<html>
<head>
<title>pcre2compat specification</title>
</head>
<body bgcolor="#FFFFFF" text="#00005A" link="#0066FF" alink="#3399FF" vlink="#2222BB">
<h1>pcre2compat man page</h1>
<p>
Return to the <a href="index.html">PCRE2 index page</a>.
</p>
<p>
This page is part of the PCRE2 HTML documentation. It was generated
automatically from the original man page. If there is any nonsense in it,
please consult the man page, in case the conversion went wrong.
<br>
<br><b>
DIFFERENCES BETWEEN PCRE2 AND PERL
</b><br>
<P>
This document describes the differences in the ways that PCRE2 and Perl handle
regular expressions. The differences described here are with respect to Perl
versions 5.10 and above.
</P>
<P>
1. PCRE2 has only a subset of Perl's Unicode support. Details of what it does
have are given in the
<a href="pcre2unicode.html"><b>pcre2unicode</b></a>
page.
</P>
<P>
2. PCRE2 allows repeat quantifiers only on parenthesized assertions, but they
do not mean what you might think. For example, (?!a){3} does not assert that
the next three characters are not "a". It just asserts that the next character
is not "a" three times (in principle: PCRE2 optimizes this to run the assertion
just once). Perl allows repeat quantifiers on other assertions such as \b, but
these do not seem to have any use.
</P>
<P>
3. Capturing subpatterns that occur inside negative lookahead assertions are
counted, but their entries in the offsets vector are never set. Perl sometimes
(but not always) sets its numerical variables from inside negative assertions.
</P>
<P>
4. The following Perl escape sequences are not supported: \l, \u, \L,
\U, and \N when followed by a character name or Unicode value. (\N on its
own, matching a non-newline character, is supported.) In fact these are
implemented by Perl's general string-handling and are not part of its pattern
matching engine. If any of these are encountered by PCRE2, an error is
generated by default. However, if the PCRE2_ALT_BSUX option is set,
\U and \u are interpreted as ECMAScript interprets them.
</P>
<P>
5. The Perl escape sequences \p, \P, and \X are supported only if PCRE2 is
built with Unicode support. The properties that can be tested with \p and \P
are limited to the general category properties such as Lu and Nd, script names
such as Greek or Han, and the derived properties Any and L&. PCRE2 does support
the Cs (surrogate) property, which Perl does not; the Perl documentation says
"Because Perl hides the need for the user to understand the internal
representation of Unicode characters, there is no need to implement the
somewhat messy concept of surrogates."
</P>
<P>
6. PCRE2 does support the \Q...\E escape for quoting substrings. Characters
in between are treated as literals. This is slightly different from Perl in
that $ and @ are also handled as literals inside the quotes. In Perl, they
cause variable interpolation (but of course PCRE2 does not have variables).
Note the following examples:
<pre>
Pattern PCRE2 matches Perl matches
\Qabc$xyz\E abc$xyz abc followed by the contents of $xyz
\Qabc\$xyz\E abc\$xyz abc\$xyz
\Qabc\E\$\Qxyz\E abc$xyz abc$xyz
</pre>
The \Q...\E sequence is recognized both inside and outside character classes.
</P>
<P>
7. Fairly obviously, PCRE2 does not support the (?{code}) and (??{code})
constructions. However, there is support for recursive patterns. This is not
available in Perl 5.8, but it is in Perl 5.10. Also, the PCRE2 "callout"
feature allows an external function to be called during pattern matching. See
the
<a href="pcre2callout.html"><b>pcre2callout</b></a>
documentation for details.
</P>
<P>
8. Subpatterns that are called as subroutines (whether or not recursively) are
always treated as atomic groups in PCRE2. This is like Python, but unlike Perl.
Captured values that are set outside a subroutine call can be reference from
inside in PCRE2, but not in Perl. There is a discussion that explains these
differences in more detail in the
<a href="pcre2pattern.html#recursiondifference">section on recursion differences from Perl</a>
in the
<a href="pcre2pattern.html"><b>pcre2pattern</b></a>
page.
</P>
<P>
9. If any of the backtracking control verbs are used in a subpattern that is
called as a subroutine (whether or not recursively), their effect is confined
to that subpattern; it does not extend to the surrounding pattern. This is not
always the case in Perl. In particular, if (*THEN) is present in a group that
is called as a subroutine, its action is limited to that group, even if the
group does not contain any | characters. Note that such subpatterns are
processed as anchored at the point where they are tested.
</P>
<P>
10. If a pattern contains more than one backtracking control verb, the first
one that is backtracked onto acts. For example, in the pattern
A(*COMMIT)B(*PRUNE)C a failure in B triggers (*COMMIT), but a failure in C
triggers (*PRUNE). Perl's behaviour is more complex; in many cases it is the
same as PCRE2, but there are examples where it differs.
</P>
<P>
11. Most backtracking verbs in assertions have their normal actions. They are
not confined to the assertion.
</P>
<P>
12. There are some differences that are concerned with the settings of captured
strings when part of a pattern is repeated. For example, matching "aba" against
the pattern /^(a(b)?)+$/ in Perl leaves $2 unset, but in PCRE2 it is set to
"b".
</P>
<P>
13. PCRE2's handling of duplicate subpattern numbers and duplicate subpattern
names is not as general as Perl's. This is a consequence of the fact the PCRE2
works internally just with numbers, using an external table to translate
between numbers and names. In particular, a pattern such as (?|(?&#60;a&#62;A)|(?&#60;b)B),
where the two capturing parentheses have the same number but different names,
is not supported, and causes an error at compile time. If it were allowed, it
would not be possible to distinguish which parentheses matched, because both
names map to capturing subpattern number 1. To avoid this confusing situation,
an error is given at compile time.
</P>
<P>
14. Perl recognizes comments in some places that PCRE2 does not, for example,
between the ( and ? at the start of a subpattern. If the /x modifier is set,
Perl allows white space between ( and ? (though current Perls warn that this is
deprecated) but PCRE2 never does, even if the PCRE2_EXTENDED option is set.
</P>
<P>
15. Perl, when in warning mode, gives warnings for character classes such as
[A-\d] or [a-[:digit:]]. It then treats the hyphens as literals. PCRE2 has no
warning features, so it gives an error in these cases because they are almost
certainly user mistakes.
</P>
<P>
16. In PCRE2, the upper/lower case character properties Lu and Ll are not
affected when case-independent matching is specified. For example, \p{Lu}
always matches an upper case letter. I think Perl has changed in this respect;
in the release at the time of writing (5.16), \p{Lu} and \p{Ll} match all
letters, regardless of case, when case independence is specified.
</P>
<P>
17. PCRE2 provides some extensions to the Perl regular expression facilities.
Perl 5.10 includes new features that are not in earlier versions of Perl, some
of which (such as named parentheses) have been in PCRE2 for some time. This
list is with respect to Perl 5.10:
<br>
<br>
(a) Although lookbehind assertions in PCRE2 must match fixed length strings,
each alternative branch of a lookbehind assertion can match a different length
of string. Perl requires them all to have the same length.
<br>
<br>
(b) If PCRE2_DOLLAR_ENDONLY is set and PCRE2_MULTILINE is not set, the $
meta-character matches only at the very end of the string.
<br>
<br>
(c) A backslash followed by a letter with no special meaning is faulted. (Perl
can be made to issue a warning.)
<br>
<br>
(d) If PCRE2_UNGREEDY is set, the greediness of the repetition quantifiers is
inverted, that is, by default they are not greedy, but if followed by a
question mark they are.
<br>
<br>
(e) PCRE2_ANCHORED can be used at matching time to force a pattern to be tried
only at the first matching position in the subject string.
<br>
<br>
(f) The PCRE2_NOTBOL, PCRE2_NOTEOL, PCRE2_NOTEMPTY, PCRE2_NOTEMPTY_ATSTART, and
PCRE2_NO_AUTO_CAPTURE options have no Perl equivalents.
<br>
<br>
(g) The \R escape sequence can be restricted to match only CR, LF, or CRLF
by the PCRE2_BSR_ANYCRLF option.
<br>
<br>
(h) The callout facility is PCRE2-specific.
<br>
<br>
(i) The partial matching facility is PCRE2-specific.
<br>
<br>
(j) The alternative matching function (<b>pcre2_dfa_match()</b> matches in a
different way and is not Perl-compatible.
<br>
<br>
(k) PCRE2 recognizes some special sequences such as (*CR) at the start of
a pattern that set overall options that cannot be changed within the pattern.
</P>
<br><b>
AUTHOR
</b><br>
<P>
Philip Hazel
<br>
University Computing Service
<br>
Cambridge CB2 3QH, England.
<br>
</P>
<br><b>
REVISION
</b><br>
<P>
Last updated: 28 September 2014
<br>
Copyright &copy; 1997-2014 University of Cambridge.
<br>
<p>
Return to the <a href="index.html">PCRE2 index page</a>.
</p>

View File

@ -141,7 +141,7 @@ subject_length = strlen((char *)subject);
re = pcre2_compile(
pattern, /* the pattern */
-1, /* indicates pattern is zero-terminated */
PCRE2_ZERO_TERMINATED, /* indicates pattern is zero-terminated */
0, /* default options */
&amp;errornumber, /* for error number */
&amp;erroroffset, /* for error offset */

401
doc/html/pcre2jit.html Normal file
View File

@ -0,0 +1,401 @@
<html>
<head>
<title>pcre2jit specification</title>
</head>
<body bgcolor="#FFFFFF" text="#00005A" link="#0066FF" alink="#3399FF" vlink="#2222BB">
<h1>pcre2jit man page</h1>
<p>
Return to the <a href="index.html">PCRE2 index page</a>.
</p>
<p>
This page is part of the PCRE2 HTML documentation. It was generated
automatically from the original man page. If there is any nonsense in it,
please consult the man page, in case the conversion went wrong.
<br>
<ul>
<li><a name="TOC1" href="#SEC1">PCRE2 JUST-IN-TIME COMPILER SUPPORT</a>
<li><a name="TOC2" href="#SEC2">AVAILABILITY OF JIT SUPPORT</a>
<li><a name="TOC3" href="#SEC3">SIMPLE USE OF JIT</a>
<li><a name="TOC4" href="#SEC4">UNSUPPORTED OPTIONS AND PATTERN ITEMS</a>
<li><a name="TOC5" href="#SEC5">RETURN VALUES FROM JIT MATCHING</a>
<li><a name="TOC6" href="#SEC6">CONTROLLING THE JIT STACK</a>
<li><a name="TOC7" href="#SEC7">JIT STACK FAQ</a>
<li><a name="TOC8" href="#SEC8">EXAMPLE CODE</a>
<li><a name="TOC9" href="#SEC9">JIT FAST PATH API</a>
<li><a name="TOC10" href="#SEC10">SEE ALSO</a>
<li><a name="TOC11" href="#SEC11">AUTHOR</a>
<li><a name="TOC12" href="#SEC12">REVISION</a>
</ul>
<br><a name="SEC1" href="#TOC1">PCRE2 JUST-IN-TIME COMPILER SUPPORT</a><br>
<P>
FIXME: This needs checking over once JIT support is implemented.
</P>
<P>
Just-in-time compiling is a heavyweight optimization that can greatly speed up
pattern matching. However, it comes at the cost of extra processing before the
match is performed. Therefore, it is of most benefit when the same pattern is
going to be matched many times. This does not necessarily mean many calls of a
matching function; if the pattern is not anchored, matching attempts may take
place many times at various positions in the subject, even for a single call.
Therefore, if the subject string is very long, it may still pay to use JIT for
one-off matches. JIT support is available for all of the 8-bit, 16-bit and
32-bit PCRE2 libraries.
</P>
<P>
JIT support applies only to the traditional Perl-compatible matching function.
It does not apply when the DFA matching function is being used. The code for
this support was written by Zoltan Herczeg.
</P>
<br><a name="SEC2" href="#TOC1">AVAILABILITY OF JIT SUPPORT</a><br>
<P>
JIT support is an optional feature of PCRE2. The "configure" option
--enable-jit (or equivalent CMake option) must be set when PCRE2 is built if
you want to use JIT. The support is limited to the following hardware
platforms:
<pre>
ARM v5, v7, and Thumb2
Intel x86 32-bit and 64-bit
MIPS 32-bit
Power PC 32-bit and 64-bit
SPARC 32-bit (experimental)
</pre>
If --enable-jit is set on an unsupported platform, compilation fails.
</P>
<P>
A program can tell if JIT support is available by calling <b>pcre2_config()</b>
with the PCRE2_CONFIG_JIT option. The result is 1 when JIT is available, and 0
otherwise. However, a simple program does not need to check this in order to
use JIT. The API is implemented in a way that falls back to the interpretive
code if JIT is not available. For programs that need the best possible
performance, there is also a "fast path" API that is JIT-specific.
</P>
<br><a name="SEC3" href="#TOC1">SIMPLE USE OF JIT</a><br>
<P>
To make use of the JIT support in the simplest way, all you have to do is to
call <b>pcre2_jit_compile()</b> after successfully compiling a pattern with
<b>pcre2_compile()</b>. This function has two arguments: the first is the
compiled pattern pointer that was returned by <b>pcre2_compile()</b>, and the
second is a set of option bits, which must include at least one of
PCRE2_JIT_COMPLETE, PCRE2_JIT_PARTIAL_HARD, or PCRE2_JIT_PARTIAL_SOFT.
</P>
<P>
The returned value from <b>pcre2_jit_compile()</b> is FIXME FIXME.
</P>
<P>
PCRE2_JIT_COMPLETE requests the JIT compiler to generate code for complete
matches. If you want to run partial matches using the PCRE2_PARTIAL_HARD or
PCRE2_PARTIAL_SOFT options of <b>pcre2_match()</b>, you should set one or both
of the other options as well as, or instead of PCRE2_JIT_COMPLETE. The JIT
compiler generates different optimized code for each of the three modes
(normal, soft partial, hard partial). When <b>pcre2_match()</b> is called, the
appropriate code is run if it is available. Otherwise, the pattern is matched
using interpretive code.
</P>
<P>
In some circumstances you may need to call additional functions. These are
described in the section entitled
<a href="#stackcontrol">"Controlling the JIT stack"</a>
below.
</P>
<P>
If JIT support is not available, a call to <b>pcre2_jit_comple()</b> does
nothing and returns FIXME. Otherwise, the compiled pattern is passed to the JIT
compiler, which turns it into machine code that executes much faster than the
normal interpretive code, but yields exactly the same results.
</P>
<P>
There are some <b>pcre2_match()</b> options that are not supported by JIT, and
there are also some pattern items that JIT cannot handle. Details are given
below. In both cases, matching automatically falls back to the interpretive
code. If you want to know whether JIT was actually used for a particular match,
you should arrange for a JIT callback function to be set up as described in the
section entitled
<a href="#stackcontrol">"Controlling the JIT stack"</a>
below, even if you do not need to supply a non-default JIT stack. Such a
callback function is called whenever JIT code is about to be obeyed. If the
match-time options are not right for JIT execution, the callback function is
not obeyed.
</P>
<P>
If the JIT compiler finds an unsupported item, no JIT data is generated. You
can find out if JIT matching is available after compiling a pattern by calling
<b>pcre2_pattern_info()</b> with the PCRE2_INFO_JIT option. A result of 1 means
that JIT compilation was successful. A result of 0 means that JIT support is
not available, or the pattern was not processed by <b>pcre2_jit_compile()</b>,
or the JIT compiler was not able to handle the pattern.
</P>
<br><a name="SEC4" href="#TOC1">UNSUPPORTED OPTIONS AND PATTERN ITEMS</a><br>
<P>
The <b>pcre2_match()</b> options that are supported for JIT matching are
PCRE2_NOTBOL, PCRE2_NOTEOL, PCRE2_NOTEMPTY, PCRE2_NOTEMPTY_ATSTART,
PCRE2_NO_UTF_CHECK, PCRE2_PARTIAL_HARD, and PCRE2_PARTIAL_SOFT. The options
that are not supported at match time are PCRE2_ANCHORED and
PCRE2_NO_START_OPTIMIZE, though they are supported if given at compile time.
</P>
<P>
The only unsupported pattern items are \C (match a single data unit) when
running in a UTF mode, and a callout immediately before an assertion condition
in a conditional group.
</P>
<br><a name="SEC5" href="#TOC1">RETURN VALUES FROM JIT MATCHING</a><br>
<P>
When a pattern is matched using JIT matching, the return values are the same
as those given by the interpretive <b>pcre2_match()</b> code, with the addition
of one new error code: PCRE2_ERROR_JIT_STACKLIMIT. This means that the memory
used for the JIT stack was insufficient. See
<a href="#stackcontrol">"Controlling the JIT stack"</a>
below for a discussion of JIT stack usage.
</P>
<P>
The error code PCRE2_ERROR_MATCHLIMIT is returned by the JIT code if searching
a very large pattern tree goes on for too long, as it is in the same
circumstance when JIT is not used, but the details of exactly what is counted
are not the same. The PCRE2_ERROR_RECURSIONLIMIT error code is never returned
when JIT matching is used.
<a name="stackcontrol"></a></P>
<br><a name="SEC6" href="#TOC1">CONTROLLING THE JIT STACK</a><br>
<P>
When the compiled JIT code runs, it needs a block of memory to use as a stack.
By default, it uses 32K on the machine stack. However, some large or
complicated patterns need more than this. The error PCRE2_ERROR_JIT_STACKLIMIT
is given when there is not enough stack. Three functions are provided for
managing blocks of memory for use as JIT stacks. There is further discussion
about the use of JIT stacks in the section entitled
<a href="#stackcontrol">"JIT stack FAQ"</a>
below.
</P>
<P>
The <b>pcre2_jit_stack_alloc()</b> function creates a JIT stack. Its arguments
are a general context (for memory allocation functions, or NULL for standard
memory allocation), a starting size and a maximum size, and it returns a
pointer to an opaque structure of type <b>pcre2_jit_stack</b>, or NULL if there
is an error. The <b>pcre2_jit_stack_free()</b> function is used to free a stack
that is no longer needed. (For the technically minded: the address space is
allocated by mmap or VirtualAlloc.) FIXME Is this right?
</P>
<P>
JIT uses far less memory for recursion than the interpretive code,
and a maximum stack size of 512K to 1M should be more than enough for any
pattern.
</P>
<P>
The <b>pcre2_jit_stack_assign()</b> function specifies which stack JIT code
should use. Its arguments are as follows:
<pre>
pcre2_code *code
pcre2_jit_callback callback
void *data
</pre>
The <i>code</i> argument is a pointer to a compiled pattern, after it has been
processed by <b>pcre2_jit_compile()</b>. There are three cases for the values of
the other two options:
<pre>
(1) If <i>callback</i> is NULL and <i>data</i> is NULL, an internal 32K block
on the machine stack is used.
(2) If <i>callback</i> is NULL and <i>data</i> is not NULL, <i>data</i> must be
a valid JIT stack, the result of calling <b>pcre2_jit_stack_alloc()</b>.
(3) If <i>callback</i> is not NULL, it must point to a function that is
called with <i>data</i> as an argument at the start of matching, in
order to set up a JIT stack. If the return from the callback
function is NULL, the internal 32K stack is used; otherwise the
return value must be a valid JIT stack, the result of calling
<b>pcre2_jit_stack_alloc()</b>.
</pre>
A callback function is obeyed whenever JIT code is about to be run; it is not
obeyed when <b>pcre2_match()</b> is called with options that are incompatible
for JIT matching. A callback function can therefore be used to determine
whether a match operation was executed by JIT or by the interpreter.
</P>
<P>
You may safely use the same JIT stack for more than one pattern (either by
assigning directly or by callback), as long as the patterns are all matched
sequentially in the same thread. In a multithread application, if you do not
specify a JIT stack, or if you assign or pass back NULL from a callback, that
is thread-safe, because each thread has its own machine stack. However, if you
assign or pass back a non-NULL JIT stack, this must be a different stack for
each thread so that the application is thread-safe.
</P>
<P>
Strictly speaking, even more is allowed. You can assign the same non-NULL stack
to any number of patterns as long as they are not used for matching by multiple
threads at the same time. For example, you can assign the same stack to all
compiled patterns, and use a global mutex in the callback to wait until the
stack is available for use. However, this is an inefficient solution, and not
recommended.
</P>
<P>
This is a suggestion for how a multithreaded program that needs to set up
non-default JIT stacks might operate:
<pre>
During thread initalization
thread_local_var = pcre2_jit_stack_alloc(...)
During thread exit
pcre2_jit_stack_free(thread_local_var)
Use a one-line callback function
return thread_local_var
</pre>
All the functions described in this section do nothing if JIT is not available,
and <b>pcre2_jit_stack_assign()</b> does nothing unless the <b>code</b> argument
is non-NULL and points to a <b>pcre2_code</b> block that has been successfully
processed by <b>pcre2_jit_compile()</b>.
<a name="stackfaq"></a></P>
<br><a name="SEC7" href="#TOC1">JIT STACK FAQ</a><br>
<P>
(1) Why do we need JIT stacks?
<br>
<br>
PCRE2 (and JIT) is a recursive, depth-first engine, so it needs a stack where
the local data of the current node is pushed before checking its child nodes.
Allocating real machine stack on some platforms is difficult. For example, the
stack chain needs to be updated every time if we extend the stack on PowerPC.
Although it is possible, its updating time overhead decreases performance. So
we do the recursion in memory.
</P>
<P>
(2) Why don't we simply allocate blocks of memory with <b>malloc()</b>?
<br>
<br>
Modern operating systems have a nice feature: they can reserve an address space
instead of allocating memory. We can safely allocate memory pages inside this
address space, so the stack could grow without moving memory data (this is
important because of pointers). Thus we can allocate 1M address space, and use
only a single memory page (usually 4K) if that is enough. However, we can still
grow up to 1M anytime if needed.
</P>
<P>
(3) Who "owns" a JIT stack?
<br>
<br>
The owner of the stack is the user program, not the JIT studied pattern or
anything else. The user program must ensure that if a stack is used by
<b>pcre2_match()</b>, (that is, it is assigned to the pattern currently
running), that stack must not be used by any other threads (to avoid
overwriting the same memory area). The best practice for multithreaded programs
is to allocate a stack for each thread, and return this stack through the JIT
callback function.
</P>
<P>
(4) When should a JIT stack be freed?
<br>
<br>
You can free a JIT stack at any time, as long as it will not be used by
<b>pcre2_match()</b> again. When you assign the stack to a pattern, only a
pointer is set. There is no reference counting or any other magic. You can free
the patterns and stacks in any order, anytime. Just <i>do not</i> call
<b>pcre2_match()</b> with a pattern pointing to an already freed stack, as that
will cause SEGFAULT. (Also, do not free a stack currently used by
<b>pcre2_match()</b> in another thread). You can also replace the stack for a
pattern at any time. You can even free the previous stack before assigning a
replacement.
</P>
<P>
(5) Should I allocate/free a stack every time before/after calling
<b>pcre2_match()</b>?
<br>
<br>
No, because this is too costly in terms of resources. However, you could
implement some clever idea which release the stack if it is not used in let's
say two minutes. The JIT callback can help to achieve this without keeping a
list of the currently JIT studied patterns.
</P>
<P>
(6) OK, the stack is for long term memory allocation. But what happens if a
pattern causes stack overflow with a stack of 1M? Is that 1M kept until the
stack is freed?
<br>
<br>
Especially on embedded sytems, it might be a good idea to release memory
sometimes without freeing the stack. There is no API for this at the moment.
Probably a function call which returns with the currently allocated memory for
any stack and another which allows releasing memory (shrinking the stack) would
be a good idea if someone needs this.
</P>
<P>
(7) This is too much of a headache. Isn't there any better solution for JIT
stack handling?
<br>
<br>
No, thanks to Windows. If POSIX threads were used everywhere, we could throw
out this complicated API.
</P>
<br><a name="SEC8" href="#TOC1">EXAMPLE CODE</a><br>
<P>
This is a single-threaded example that specifies a JIT stack without using a
callback.
<pre>
int rc;
pcre2_code *re;
pcre2_match_data *match_data;
pcre2_jit_stack *jit_stack;
re = pcre2_compile(pattern, PCRE2_ZERO_TERMINATED, 0,
&errornumber, &erroffset, NULL);
/* Check for errors */
rc = pcre2_jit_compile(re, PCRE2_JIT_COMPLETE);
/* Check for errors */
jit_stack = pcre2_jit_stack_alloc(NULL, 32*1024, 512*1024);
/* Check for error (NULL) */
pcre2_jit_stack_assign(re, NULL, jit_stack);
match_data = pcre2_match_data_create(re, 10);
rc = pcre2_match(re, subject, length, 0, 0, match_data, NULL);
/* Check results */
pcre2_free(re);
pcre2_jit_stack_free(jit_stack);
</PRE>
</P>
<br><a name="SEC9" href="#TOC1">JIT FAST PATH API</a><br>
<P>
Because the API described above falls back to interpreted matching when JIT is
not available, it is convenient for programs that are written for general use
in many environments. However, calling JIT via <b>pcre2_match()</b> does have a
performance impact. Programs that are written for use where JIT is known to be
available, and which need the best possible performance, can instead use a
"fast path" API to call JIT matching directly instead of calling
<b>pcre2_match()</b> (obviously only for patterns that have been successfully
processed by <b>pcre2_jit_compile()</b>).
</P>
<P>
The fast path function is called <b>pcre2_jit_match()</b>, and it takes exactly
the same arguments as <b>pcre2_match()</b>, plus one additional argument that
must point to a JIT stack. The JIT stack arrangements described above do not
apply. The return values are the same as for <b>pcre2_match()</b>.
</P>
<P>
When you call <b>pcre2_match()</b>, as well as testing for invalid options, a
number of other sanity checks are performed on the arguments. For example, if
the subject pointer is NULL, an immediate error is given. Also, unless
PCRE2_NO_UTF_CHECK is set, a UTF subject string is tested for validity. In the
interests of speed, these checks do not happen on the JIT fast path, and if
invalid data is passed, the result is undefined.
</P>
<P>
Bypassing the sanity checks and the <b>pcre2_match()</b> wrapping can give
speedups of more than 10%.
</P>
<br><a name="SEC10" href="#TOC1">SEE ALSO</a><br>
<P>
<b>pcre2api</b>(3)
</P>
<br><a name="SEC11" href="#TOC1">AUTHOR</a><br>
<P>
Philip Hazel (FAQ by Zoltan Herczeg)
<br>
University Computing Service
<br>
Cambridge CB2 3QH, England.
<br>
</P>
<br><a name="SEC12" href="#TOC1">REVISION</a><br>
<P>
Last updated: 29 September 2014
<br>
Copyright &copy; 1997-2014 University of Cambridge.
<br>
<p>
Return to the <a href="index.html">PCRE2 index page</a>.
</p>

89
doc/html/pcre2limits.html Normal file
View File

@ -0,0 +1,89 @@
<html>
<head>
<title>pcre2limits specification</title>
</head>
<body bgcolor="#FFFFFF" text="#00005A" link="#0066FF" alink="#3399FF" vlink="#2222BB">
<h1>pcre2limits man page</h1>
<p>
Return to the <a href="index.html">PCRE2 index page</a>.
</p>
<p>
This page is part of the PCRE2 HTML documentation. It was generated
automatically from the original man page. If there is any nonsense in it,
please consult the man page, in case the conversion went wrong.
<br>
<br><b>
SIZE AND OTHER LIMITATIONS
</b><br>
<P>
There are some size limitations in PCRE2 but it is hoped that they will never
in practice be relevant.
</P>
<P>
The maximum size of a compiled pattern is approximately 64K code units for the
8-bit and 16-bit libraries if PCRE2 is compiled with the default internal
linkage size, which is 2 bytes for these libraries. If you want to process
regular expressions that are truly enormous, you can compile PCRE2 with an
internal linkage size of 3 or 4 (when building the 16-bit library, 3 is rounded
up to 4). See the <b>README</b> file in the source distribution and the
<a href="pcre2build.html"><b>pcre2build</b></a>
documentation for details. In these cases the limit is substantially larger.
However, the speed of execution is slower. In the 32-bit library, the internal
linkage size is always 4.
</P>
<P>
All values in repeating quantifiers must be less than 65536.
</P>
<P>
There is no limit to the number of parenthesized subpatterns, but there can be
no more than 65535 capturing subpatterns. There is, however, a limit to the
depth of nesting of parenthesized subpatterns of all kinds. This is imposed in
order to limit the amount of system stack used at compile time. The limit can
be specified when PCRE2 is built; the default is 250.
</P>
<P>
There is a limit to the number of forward references to subsequent subpatterns
of around 200,000. Repeated forward references with fixed upper limits, for
example, (?2){0,100} when subpattern number 2 is to the right, are included in
the count. There is no limit to the number of backward references.
</P>
<P>
The maximum length of name for a named subpattern is 32 code units, and the
maximum number of named subpatterns is 10000.
</P>
<P>
The maximum length of a name in a (*MARK), (*PRUNE), (*SKIP), or (*THEN) verb
is 255 for the 8-bit library and 65535 for the 16-bit and 32-bit libraries.
</P>
<P>
The maximum length of a subject string is the largest number a PCRE2_SIZE
variable can hold. PCRE2_SIZE is an unsigned integer type, usually defined as
size_t. However, when using the traditional matching function, PCRE2 uses
recursion to handle subpatterns and indefinite repetition. This means that the
available stack space may limit the size of a subject string that can be
processed by certain patterns. For a discussion of stack issues, see the
<a href="pcre2stack.html"><b>pcre2stack</b></a>
documentation.
</P>
<br><b>
AUTHOR
</b><br>
<P>
Philip Hazel
<br>
University Computing Service
<br>
Cambridge CB2 3QH, England.
<br>
</P>
<br><b>
REVISION
</b><br>
<P>
Last updated: 29 September 2014
<br>
Copyright &copy; 1997-2014 University of Cambridge.
<br>
<p>
Return to the <a href="index.html">PCRE2 index page</a>.
</p>

241
doc/html/pcre2matching.html Normal file
View File

@ -0,0 +1,241 @@
<html>
<head>
<title>pcre2matching specification</title>
</head>
<body bgcolor="#FFFFFF" text="#00005A" link="#0066FF" alink="#3399FF" vlink="#2222BB">
<h1>pcre2matching man page</h1>
<p>
Return to the <a href="index.html">PCRE2 index page</a>.
</p>
<p>
This page is part of the PCRE2 HTML documentation. It was generated
automatically from the original man page. If there is any nonsense in it,
please consult the man page, in case the conversion went wrong.
<br>
<ul>
<li><a name="TOC1" href="#SEC1">PCRE2 MATCHING ALGORITHMS</a>
<li><a name="TOC2" href="#SEC2">REGULAR EXPRESSIONS AS TREES</a>
<li><a name="TOC3" href="#SEC3">THE STANDARD MATCHING ALGORITHM</a>
<li><a name="TOC4" href="#SEC4">THE ALTERNATIVE MATCHING ALGORITHM</a>
<li><a name="TOC5" href="#SEC5">ADVANTAGES OF THE ALTERNATIVE ALGORITHM</a>
<li><a name="TOC6" href="#SEC6">DISADVANTAGES OF THE ALTERNATIVE ALGORITHM</a>
<li><a name="TOC7" href="#SEC7">AUTHOR</a>
<li><a name="TOC8" href="#SEC8">REVISION</a>
</ul>
<br><a name="SEC1" href="#TOC1">PCRE2 MATCHING ALGORITHMS</a><br>
<P>
This document describes the two different algorithms that are available in
PCRE2 for matching a compiled regular expression against a given subject
string. The "standard" algorithm is the one provided by the <b>pcre2_match()</b>
function. This works in the same as as Perl's matching function, and provide a
Perl-compatible matching operation. The just-in-time (JIT) optimization that is
described in the
<a href="pcre2jit.html"><b>pcre2jit</b></a>
documentation is compatible with this function.
</P>
<P>
An alternative algorithm is provided by the <b>pcre2_dfa_match()</b> function;
it operates in a different way, and is not Perl-compatible. This alternative
has advantages and disadvantages compared with the standard algorithm, and
these are described below.
</P>
<P>
When there is only one possible way in which a given subject string can match a
pattern, the two algorithms give the same answer. A difference arises, however,
when there are multiple possibilities. For example, if the pattern
<pre>
^&#60;.*&#62;
</pre>
is matched against the string
<pre>
&#60;something&#62; &#60;something else&#62; &#60;something further&#62;
</pre>
there are three possible answers. The standard algorithm finds only one of
them, whereas the alternative algorithm finds all three.
</P>
<br><a name="SEC2" href="#TOC1">REGULAR EXPRESSIONS AS TREES</a><br>
<P>
The set of strings that are matched by a regular expression can be represented
as a tree structure. An unlimited repetition in the pattern makes the tree of
infinite size, but it is still a tree. Matching the pattern to a given subject
string (from a given starting point) can be thought of as a search of the tree.
There are two ways to search a tree: depth-first and breadth-first, and these
correspond to the two matching algorithms provided by PCRE2.
</P>
<br><a name="SEC3" href="#TOC1">THE STANDARD MATCHING ALGORITHM</a><br>
<P>
In the terminology of Jeffrey Friedl's book "Mastering Regular Expressions",
the standard algorithm is an "NFA algorithm". It conducts a depth-first search
of the pattern tree. That is, it proceeds along a single path through the tree,
checking that the subject matches what is required. When there is a mismatch,
the algorithm tries any alternatives at the current point, and if they all
fail, it backs up to the previous branch point in the tree, and tries the next
alternative branch at that level. This often involves backing up (moving to the
left) in the subject string as well. The order in which repetition branches are
tried is controlled by the greedy or ungreedy nature of the quantifier.
</P>
<P>
If a leaf node is reached, a matching string has been found, and at that point
the algorithm stops. Thus, if there is more than one possible match, this
algorithm returns the first one that it finds. Whether this is the shortest,
the longest, or some intermediate length depends on the way the greedy and
ungreedy repetition quantifiers are specified in the pattern.
</P>
<P>
Because it ends up with a single path through the tree, it is relatively
straightforward for this algorithm to keep track of the substrings that are
matched by portions of the pattern in parentheses. This provides support for
capturing parentheses and back references.
</P>
<br><a name="SEC4" href="#TOC1">THE ALTERNATIVE MATCHING ALGORITHM</a><br>
<P>
This algorithm conducts a breadth-first search of the tree. Starting from the
first matching point in the subject, it scans the subject string from left to
right, once, character by character, and as it does this, it remembers all the
paths through the tree that represent valid matches. In Friedl's terminology,
this is a kind of "DFA algorithm", though it is not implemented as a
traditional finite state machine (it keeps multiple states active
simultaneously).
</P>
<P>
Although the general principle of this matching algorithm is that it scans the
subject string only once, without backtracking, there is one exception: when a
lookaround assertion is encountered, the characters following or preceding the
current point have to be independently inspected.
</P>
<P>
The scan continues until either the end of the subject is reached, or there are
no more unterminated paths. At this point, terminated paths represent the
different matching possibilities (if there are none, the match has failed).
Thus, if there is more than one possible match, this algorithm finds all of
them, and in particular, it finds the longest. The matches are returned in
decreasing order of length. There is an option to stop the algorithm after the
first match (which is necessarily the shortest) is found.
</P>
<P>
Note that all the matches that are found start at the same point in the
subject. If the pattern
<pre>
cat(er(pillar)?)?
</pre>
is matched against the string "the caterpillar catchment", the result is the
three strings "caterpillar", "cater", and "cat" that start at the fifth
character of the subject. The algorithm does not automatically move on to find
matches that start at later positions.
</P>
<P>
PCRE2's "auto-possessification" optimization usually applies to character
repeats at the end of a pattern (as well as internally). For example, the
pattern "a\d+" is compiled as if it were "a\d++" because there is no point
even considering the possibility of backtracking into the repeated digits. For
DFA matching, this means that only one possible match is found. If you really
do want multiple matches in such cases, either use an ungreedy repeat
("a\d+?") or set the PCRE2_NO_AUTO_POSSESS option when compiling.
</P>
<P>
There are a number of features of PCRE2 regular expressions that are not
supported by the alternative matching algorithm. They are as follows:
</P>
<P>
1. Because the algorithm finds all possible matches, the greedy or ungreedy
nature of repetition quantifiers is not relevant (though it may affect
auto-possessification, as just described). During matching, greedy and ungreedy
quantifiers are treated in exactly the same way. However, possessive
quantifiers can make a difference when what follows could also match what is
quantified, for example in a pattern like this:
<pre>
^a++\w!
</pre>
This pattern matches "aaab!" but not "aaa!", which would be matched by a
non-possessive quantifier. Similarly, if an atomic group is present, it is
matched as if it were a standalone pattern at the current point, and the
longest match is then "locked in" for the rest of the overall pattern.
</P>
<P>
2. When dealing with multiple paths through the tree simultaneously, it is not
straightforward to keep track of captured substrings for the different matching
possibilities, and PCRE2's implementation of this algorithm does not attempt to
do this. This means that no captured substrings are available.
</P>
<P>
3. Because no substrings are captured, back references within the pattern are
not supported, and cause errors if encountered.
</P>
<P>
4. For the same reason, conditional expressions that use a backreference as the
condition or test for a specific group recursion are not supported.
</P>
<P>
5. Because many paths through the tree may be active, the \K escape sequence,
which resets the start of the match when encountered (but may be on some paths
and not on others), is not supported. It causes an error if encountered.
</P>
<P>
6. Callouts are supported, but the value of the <i>capture_top</i> field is
always 1, and the value of the <i>capture_last</i> field is always 0.
</P>
<P>
7. The \C escape sequence, which (in the standard algorithm) always matches a
single code unit, even in a UTF mode, is not supported in these modes, because
the alternative algorithm moves through the subject string one character (not
code unit) at a time, for all active paths through the tree.
</P>
<P>
8. Except for (*FAIL), the backtracking control verbs such as (*PRUNE) are not
supported. (*FAIL) is supported, and behaves like a failing negative assertion.
</P>
<br><a name="SEC5" href="#TOC1">ADVANTAGES OF THE ALTERNATIVE ALGORITHM</a><br>
<P>
Using the alternative matching algorithm provides the following advantages:
</P>
<P>
1. All possible matches (at a single point in the subject) are automatically
found, and in particular, the longest match is found. To find more than one
match using the standard algorithm, you have to do kludgy things with
callouts.
</P>
<P>
2. Because the alternative algorithm scans the subject string just once, and
never needs to backtrack (except for lookbehinds), it is possible to pass very
long subject strings to the matching function in several pieces, checking for
partial matching each time. Although it is also possible to do multi-segment
matching using the standard algorithm, by retaining partially matched
substrings, it is more complicated. The
<a href="pcre2partial.html"><b>pcre2partial</b></a>
documentation gives details of partial matching and discusses multi-segment
matching.
</P>
<br><a name="SEC6" href="#TOC1">DISADVANTAGES OF THE ALTERNATIVE ALGORITHM</a><br>
<P>
The alternative algorithm suffers from a number of disadvantages:
</P>
<P>
1. It is substantially slower than the standard algorithm. This is partly
because it has to search for all possible matches, but is also because it is
less susceptible to optimization.
</P>
<P>
2. Capturing parentheses and back references are not supported.
</P>
<P>
3. Although atomic groups are supported, their use does not provide the
performance advantage that it does for the standard algorithm.
</P>
<br><a name="SEC7" href="#TOC1">AUTHOR</a><br>
<P>
Philip Hazel
<br>
University Computing Service
<br>
Cambridge CB2 3QH, England.
<br>
</P>
<br><a name="SEC8" href="#TOC1">REVISION</a><br>
<P>
Last updated: 29 September 2014
<br>
Copyright &copy; 1997-2014 University of Cambridge.
<br>
<p>
Return to the <a href="index.html">PCRE2 index page</a>.
</p>

View File

@ -141,7 +141,7 @@ subject_length = strlen((char *)subject);
re = pcre2_compile(
pattern, /* the pattern */
-1, /* indicates pattern is zero-terminated */
PCRE2_ZERO_TERMINATED, /* indicates pattern is zero-terminated */
0, /* default options */
&errornumber, /* for error number */
&erroroffset, /* for error offset */

375
doc/pcre2jit.3 Normal file
View File

@ -0,0 +1,375 @@
.TH PCRE2JIT 3 "29 September 2014" "PCRE2 10.00"
.SH NAME
PCRE2 - Perl-compatible regular expressions (revised API)
.SH "PCRE2 JUST-IN-TIME COMPILER SUPPORT"
.rs
.sp
FIXME: This needs checking over once JIT support is implemented.
.P
Just-in-time compiling is a heavyweight optimization that can greatly speed up
pattern matching. However, it comes at the cost of extra processing before the
match is performed. Therefore, it is of most benefit when the same pattern is
going to be matched many times. This does not necessarily mean many calls of a
matching function; if the pattern is not anchored, matching attempts may take
place many times at various positions in the subject, even for a single call.
Therefore, if the subject string is very long, it may still pay to use JIT for
one-off matches. JIT support is available for all of the 8-bit, 16-bit and
32-bit PCRE2 libraries.
.P
JIT support applies only to the traditional Perl-compatible matching function.
It does not apply when the DFA matching function is being used. The code for
this support was written by Zoltan Herczeg.
.
.
.SH "AVAILABILITY OF JIT SUPPORT"
.rs
.sp
JIT support is an optional feature of PCRE2. The "configure" option
--enable-jit (or equivalent CMake option) must be set when PCRE2 is built if
you want to use JIT. The support is limited to the following hardware
platforms:
.sp
ARM v5, v7, and Thumb2
Intel x86 32-bit and 64-bit
MIPS 32-bit
Power PC 32-bit and 64-bit
SPARC 32-bit (experimental)
.sp
If --enable-jit is set on an unsupported platform, compilation fails.
.P
A program can tell if JIT support is available by calling \fBpcre2_config()\fP
with the PCRE2_CONFIG_JIT option. The result is 1 when JIT is available, and 0
otherwise. However, a simple program does not need to check this in order to
use JIT. The API is implemented in a way that falls back to the interpretive
code if JIT is not available. For programs that need the best possible
performance, there is also a "fast path" API that is JIT-specific.
.
.
.SH "SIMPLE USE OF JIT"
.rs
.sp
To make use of the JIT support in the simplest way, all you have to do is to
call \fBpcre2_jit_compile()\fP after successfully compiling a pattern with
\fBpcre2_compile()\fP. This function has two arguments: the first is the
compiled pattern pointer that was returned by \fBpcre2_compile()\fP, and the
second is a set of option bits, which must include at least one of
PCRE2_JIT_COMPLETE, PCRE2_JIT_PARTIAL_HARD, or PCRE2_JIT_PARTIAL_SOFT.
.P
The returned value from \fBpcre2_jit_compile()\fP is FIXME FIXME.
.P
PCRE2_JIT_COMPLETE requests the JIT compiler to generate code for complete
matches. If you want to run partial matches using the PCRE2_PARTIAL_HARD or
PCRE2_PARTIAL_SOFT options of \fBpcre2_match()\fP, you should set one or both
of the other options as well as, or instead of PCRE2_JIT_COMPLETE. The JIT
compiler generates different optimized code for each of the three modes
(normal, soft partial, hard partial). When \fBpcre2_match()\fP is called, the
appropriate code is run if it is available. Otherwise, the pattern is matched
using interpretive code.
.P
In some circumstances you may need to call additional functions. These are
described in the section entitled
.\" HTML <a href="#stackcontrol">
.\" </a>
"Controlling the JIT stack"
.\"
below.
.P
If JIT support is not available, a call to \fBpcre2_jit_comple()\fP does
nothing and returns FIXME. Otherwise, the compiled pattern is passed to the JIT
compiler, which turns it into machine code that executes much faster than the
normal interpretive code, but yields exactly the same results.
.P
There are some \fBpcre2_match()\fP options that are not supported by JIT, and
there are also some pattern items that JIT cannot handle. Details are given
below. In both cases, matching automatically falls back to the interpretive
code. If you want to know whether JIT was actually used for a particular match,
you should arrange for a JIT callback function to be set up as described in the
section entitled
.\" HTML <a href="#stackcontrol">
.\" </a>
"Controlling the JIT stack"
.\"
below, even if you do not need to supply a non-default JIT stack. Such a
callback function is called whenever JIT code is about to be obeyed. If the
match-time options are not right for JIT execution, the callback function is
not obeyed.
.P
If the JIT compiler finds an unsupported item, no JIT data is generated. You
can find out if JIT matching is available after compiling a pattern by calling
\fBpcre2_pattern_info()\fP with the PCRE2_INFO_JIT option. A result of 1 means
that JIT compilation was successful. A result of 0 means that JIT support is
not available, or the pattern was not processed by \fBpcre2_jit_compile()\fP,
or the JIT compiler was not able to handle the pattern.
.
.
.SH "UNSUPPORTED OPTIONS AND PATTERN ITEMS"
.rs
.sp
The \fBpcre2_match()\fP options that are supported for JIT matching are
PCRE2_NOTBOL, PCRE2_NOTEOL, PCRE2_NOTEMPTY, PCRE2_NOTEMPTY_ATSTART,
PCRE2_NO_UTF_CHECK, PCRE2_PARTIAL_HARD, and PCRE2_PARTIAL_SOFT. The options
that are not supported at match time are PCRE2_ANCHORED and
PCRE2_NO_START_OPTIMIZE, though they are supported if given at compile time.
.P
The only unsupported pattern items are \eC (match a single data unit) when
running in a UTF mode, and a callout immediately before an assertion condition
in a conditional group.
.
.
.SH "RETURN VALUES FROM JIT MATCHING"
.rs
.sp
When a pattern is matched using JIT matching, the return values are the same
as those given by the interpretive \fBpcre2_match()\fP code, with the addition
of one new error code: PCRE2_ERROR_JIT_STACKLIMIT. This means that the memory
used for the JIT stack was insufficient. See
.\" HTML <a href="#stackcontrol">
.\" </a>
"Controlling the JIT stack"
.\"
below for a discussion of JIT stack usage.
.P
The error code PCRE2_ERROR_MATCHLIMIT is returned by the JIT code if searching
a very large pattern tree goes on for too long, as it is in the same
circumstance when JIT is not used, but the details of exactly what is counted
are not the same. The PCRE2_ERROR_RECURSIONLIMIT error code is never returned
when JIT matching is used.
.
.
.\" HTML <a name="stackcontrol"></a>
.SH "CONTROLLING THE JIT STACK"
.rs
.sp
When the compiled JIT code runs, it needs a block of memory to use as a stack.
By default, it uses 32K on the machine stack. However, some large or
complicated patterns need more than this. The error PCRE2_ERROR_JIT_STACKLIMIT
is given when there is not enough stack. Three functions are provided for
managing blocks of memory for use as JIT stacks. There is further discussion
about the use of JIT stacks in the section entitled
.\" HTML <a href="#stackcontrol">
.\" </a>
"JIT stack FAQ"
.\"
below.
.P
The \fBpcre2_jit_stack_alloc()\fP function creates a JIT stack. Its arguments
are a general context (for memory allocation functions, or NULL for standard
memory allocation), a starting size and a maximum size, and it returns a
pointer to an opaque structure of type \fBpcre2_jit_stack\fP, or NULL if there
is an error. The \fBpcre2_jit_stack_free()\fP function is used to free a stack
that is no longer needed. (For the technically minded: the address space is
allocated by mmap or VirtualAlloc.) FIXME Is this right?
.P
JIT uses far less memory for recursion than the interpretive code,
and a maximum stack size of 512K to 1M should be more than enough for any
pattern.
.P
The \fBpcre2_jit_stack_assign()\fP function specifies which stack JIT code
should use. Its arguments are as follows:
.sp
pcre2_code *code
pcre2_jit_callback callback
void *data
.sp
The \fIcode\fP argument is a pointer to a compiled pattern, after it has been
processed by \fBpcre2_jit_compile()\fP. There are three cases for the values of
the other two options:
.sp
(1) If \fIcallback\fP is NULL and \fIdata\fP is NULL, an internal 32K block
on the machine stack is used.
.sp
(2) If \fIcallback\fP is NULL and \fIdata\fP is not NULL, \fIdata\fP must be
a valid JIT stack, the result of calling \fBpcre2_jit_stack_alloc()\fP.
.sp
(3) If \fIcallback\fP is not NULL, it must point to a function that is
called with \fIdata\fP as an argument at the start of matching, in
order to set up a JIT stack. If the return from the callback
function is NULL, the internal 32K stack is used; otherwise the
return value must be a valid JIT stack, the result of calling
\fBpcre2_jit_stack_alloc()\fP.
.sp
A callback function is obeyed whenever JIT code is about to be run; it is not
obeyed when \fBpcre2_match()\fP is called with options that are incompatible
for JIT matching. A callback function can therefore be used to determine
whether a match operation was executed by JIT or by the interpreter.
.P
You may safely use the same JIT stack for more than one pattern (either by
assigning directly or by callback), as long as the patterns are all matched
sequentially in the same thread. In a multithread application, if you do not
specify a JIT stack, or if you assign or pass back NULL from a callback, that
is thread-safe, because each thread has its own machine stack. However, if you
assign or pass back a non-NULL JIT stack, this must be a different stack for
each thread so that the application is thread-safe.
.P
Strictly speaking, even more is allowed. You can assign the same non-NULL stack
to any number of patterns as long as they are not used for matching by multiple
threads at the same time. For example, you can assign the same stack to all
compiled patterns, and use a global mutex in the callback to wait until the
stack is available for use. However, this is an inefficient solution, and not
recommended.
.P
This is a suggestion for how a multithreaded program that needs to set up
non-default JIT stacks might operate:
.sp
During thread initalization
thread_local_var = pcre2_jit_stack_alloc(...)
.sp
During thread exit
pcre2_jit_stack_free(thread_local_var)
.sp
Use a one-line callback function
return thread_local_var
.sp
All the functions described in this section do nothing if JIT is not available,
and \fBpcre2_jit_stack_assign()\fP does nothing unless the \fBcode\fP argument
is non-NULL and points to a \fBpcre2_code\fP block that has been successfully
processed by \fBpcre2_jit_compile()\fP.
.
.
.\" HTML <a name="stackfaq"></a>
.SH "JIT STACK FAQ"
.rs
.sp
(1) Why do we need JIT stacks?
.sp
PCRE2 (and JIT) is a recursive, depth-first engine, so it needs a stack where
the local data of the current node is pushed before checking its child nodes.
Allocating real machine stack on some platforms is difficult. For example, the
stack chain needs to be updated every time if we extend the stack on PowerPC.
Although it is possible, its updating time overhead decreases performance. So
we do the recursion in memory.
.P
(2) Why don't we simply allocate blocks of memory with \fBmalloc()\fP?
.sp
Modern operating systems have a nice feature: they can reserve an address space
instead of allocating memory. We can safely allocate memory pages inside this
address space, so the stack could grow without moving memory data (this is
important because of pointers). Thus we can allocate 1M address space, and use
only a single memory page (usually 4K) if that is enough. However, we can still
grow up to 1M anytime if needed.
.P
(3) Who "owns" a JIT stack?
.sp
The owner of the stack is the user program, not the JIT studied pattern or
anything else. The user program must ensure that if a stack is used by
\fBpcre2_match()\fP, (that is, it is assigned to the pattern currently
running), that stack must not be used by any other threads (to avoid
overwriting the same memory area). The best practice for multithreaded programs
is to allocate a stack for each thread, and return this stack through the JIT
callback function.
.P
(4) When should a JIT stack be freed?
.sp
You can free a JIT stack at any time, as long as it will not be used by
\fBpcre2_match()\fP again. When you assign the stack to a pattern, only a
pointer is set. There is no reference counting or any other magic. You can free
the patterns and stacks in any order, anytime. Just \fIdo not\fP call
\fBpcre2_match()\fP with a pattern pointing to an already freed stack, as that
will cause SEGFAULT. (Also, do not free a stack currently used by
\fBpcre2_match()\fP in another thread). You can also replace the stack for a
pattern at any time. You can even free the previous stack before assigning a
replacement.
.P
(5) Should I allocate/free a stack every time before/after calling
\fBpcre2_match()\fP?
.sp
No, because this is too costly in terms of resources. However, you could
implement some clever idea which release the stack if it is not used in let's
say two minutes. The JIT callback can help to achieve this without keeping a
list of the currently JIT studied patterns.
.P
(6) OK, the stack is for long term memory allocation. But what happens if a
pattern causes stack overflow with a stack of 1M? Is that 1M kept until the
stack is freed?
.sp
Especially on embedded sytems, it might be a good idea to release memory
sometimes without freeing the stack. There is no API for this at the moment.
Probably a function call which returns with the currently allocated memory for
any stack and another which allows releasing memory (shrinking the stack) would
be a good idea if someone needs this.
.P
(7) This is too much of a headache. Isn't there any better solution for JIT
stack handling?
.sp
No, thanks to Windows. If POSIX threads were used everywhere, we could throw
out this complicated API.
.
.
.SH "EXAMPLE CODE"
.rs
.sp
This is a single-threaded example that specifies a JIT stack without using a
callback.
.sp
int rc;
pcre2_code *re;
pcre2_match_data *match_data;
pcre2_jit_stack *jit_stack;
.sp
re = pcre2_compile(pattern, PCRE2_ZERO_TERMINATED, 0,
&errornumber, &erroffset, NULL);
/* Check for errors */
rc = pcre2_jit_compile(re, PCRE2_JIT_COMPLETE);
/* Check for errors */
jit_stack = pcre2_jit_stack_alloc(NULL, 32*1024, 512*1024);
/* Check for error (NULL) */
pcre2_jit_stack_assign(re, NULL, jit_stack);
match_data = pcre2_match_data_create(re, 10);
rc = pcre2_match(re, subject, length, 0, 0, match_data, NULL);
/* Check results */
pcre2_free(re);
pcre2_jit_stack_free(jit_stack);
.sp
.
.
.SH "JIT FAST PATH API"
.rs
.sp
Because the API described above falls back to interpreted matching when JIT is
not available, it is convenient for programs that are written for general use
in many environments. However, calling JIT via \fBpcre2_match()\fP does have a
performance impact. Programs that are written for use where JIT is known to be
available, and which need the best possible performance, can instead use a
"fast path" API to call JIT matching directly instead of calling
\fBpcre2_match()\fP (obviously only for patterns that have been successfully
processed by \fBpcre2_jit_compile()\fP).
.P
The fast path function is called \fBpcre2_jit_match()\fP, and it takes exactly
the same arguments as \fBpcre2_match()\fP, plus one additional argument that
must point to a JIT stack. The JIT stack arrangements described above do not
apply. The return values are the same as for \fBpcre2_match()\fP.
.P
When you call \fBpcre2_match()\fP, as well as testing for invalid options, a
number of other sanity checks are performed on the arguments. For example, if
the subject pointer is NULL, an immediate error is given. Also, unless
PCRE2_NO_UTF_CHECK is set, a UTF subject string is tested for validity. In the
interests of speed, these checks do not happen on the JIT fast path, and if
invalid data is passed, the result is undefined.
.P
Bypassing the sanity checks and the \fBpcre2_match()\fP wrapping can give
speedups of more than 10%.
.
.
.SH "SEE ALSO"
.rs
.sp
\fBpcre2api\fP(3)
.
.
.SH AUTHOR
.rs
.sp
.nf
Philip Hazel (FAQ by Zoltan Herczeg)
University Computing Service
Cambridge CB2 3QH, England.
.fi
.
.
.SH REVISION
.rs
.sp
.nf
Last updated: 29 September 2014
Copyright (c) 1997-2014 University of Cambridge.
.fi

70
doc/pcre2limits.3 Normal file
View File

@ -0,0 +1,70 @@
.TH PCRE2LIMITS 3 "29 September 2014" "PCRE2 10.00"
.SH NAME
PCRE2 - Perl-compatible regular expressions (revised API)
.SH "SIZE AND OTHER LIMITATIONS"
.rs
.sp
There are some size limitations in PCRE2 but it is hoped that they will never
in practice be relevant.
.P
The maximum size of a compiled pattern is approximately 64K code units for the
8-bit and 16-bit libraries if PCRE2 is compiled with the default internal
linkage size, which is 2 bytes for these libraries. If you want to process
regular expressions that are truly enormous, you can compile PCRE2 with an
internal linkage size of 3 or 4 (when building the 16-bit library, 3 is rounded
up to 4). See the \fBREADME\fP file in the source distribution and the
.\" HREF
\fBpcre2build\fP
.\"
documentation for details. In these cases the limit is substantially larger.
However, the speed of execution is slower. In the 32-bit library, the internal
linkage size is always 4.
.P
All values in repeating quantifiers must be less than 65536.
.P
There is no limit to the number of parenthesized subpatterns, but there can be
no more than 65535 capturing subpatterns. There is, however, a limit to the
depth of nesting of parenthesized subpatterns of all kinds. This is imposed in
order to limit the amount of system stack used at compile time. The limit can
be specified when PCRE2 is built; the default is 250.
.P
There is a limit to the number of forward references to subsequent subpatterns
of around 200,000. Repeated forward references with fixed upper limits, for
example, (?2){0,100} when subpattern number 2 is to the right, are included in
the count. There is no limit to the number of backward references.
.P
The maximum length of name for a named subpattern is 32 code units, and the
maximum number of named subpatterns is 10000.
.P
The maximum length of a name in a (*MARK), (*PRUNE), (*SKIP), or (*THEN) verb
is 255 for the 8-bit library and 65535 for the 16-bit and 32-bit libraries.
.P
The maximum length of a subject string is the largest number a PCRE2_SIZE
variable can hold. PCRE2_SIZE is an unsigned integer type, usually defined as
size_t. However, when using the traditional matching function, PCRE2 uses
recursion to handle subpatterns and indefinite repetition. This means that the
available stack space may limit the size of a subject string that can be
processed by certain patterns. For a discussion of stack issues, see the
.\" HREF
\fBpcre2stack\fP
.\"
documentation.
.
.
.SH AUTHOR
.rs
.sp
.nf
Philip Hazel
University Computing Service
Cambridge CB2 3QH, England.
.fi
.
.
.SH REVISION
.rs
.sp
.nf
Last updated: 29 September 2014
Copyright (c) 1997-2014 University of Cambridge.
.fi

213
doc/pcre2matching.3 Normal file
View File

@ -0,0 +1,213 @@
.TH PCRE2MATCHING 3 "29 September 2014" "PCRE2 10.00"
.SH NAME
PCRE2 - Perl-compatible regular expressions (revised API)
.SH "PCRE2 MATCHING ALGORITHMS"
.rs
.sp
This document describes the two different algorithms that are available in
PCRE2 for matching a compiled regular expression against a given subject
string. The "standard" algorithm is the one provided by the \fBpcre2_match()\fP
function. This works in the same as as Perl's matching function, and provide a
Perl-compatible matching operation. The just-in-time (JIT) optimization that is
described in the
.\" HREF
\fBpcre2jit\fP
.\"
documentation is compatible with this function.
.P
An alternative algorithm is provided by the \fBpcre2_dfa_match()\fP function;
it operates in a different way, and is not Perl-compatible. This alternative
has advantages and disadvantages compared with the standard algorithm, and
these are described below.
.P
When there is only one possible way in which a given subject string can match a
pattern, the two algorithms give the same answer. A difference arises, however,
when there are multiple possibilities. For example, if the pattern
.sp
^<.*>
.sp
is matched against the string
.sp
<something> <something else> <something further>
.sp
there are three possible answers. The standard algorithm finds only one of
them, whereas the alternative algorithm finds all three.
.
.
.SH "REGULAR EXPRESSIONS AS TREES"
.rs
.sp
The set of strings that are matched by a regular expression can be represented
as a tree structure. An unlimited repetition in the pattern makes the tree of
infinite size, but it is still a tree. Matching the pattern to a given subject
string (from a given starting point) can be thought of as a search of the tree.
There are two ways to search a tree: depth-first and breadth-first, and these
correspond to the two matching algorithms provided by PCRE2.
.
.
.SH "THE STANDARD MATCHING ALGORITHM"
.rs
.sp
In the terminology of Jeffrey Friedl's book "Mastering Regular Expressions",
the standard algorithm is an "NFA algorithm". It conducts a depth-first search
of the pattern tree. That is, it proceeds along a single path through the tree,
checking that the subject matches what is required. When there is a mismatch,
the algorithm tries any alternatives at the current point, and if they all
fail, it backs up to the previous branch point in the tree, and tries the next
alternative branch at that level. This often involves backing up (moving to the
left) in the subject string as well. The order in which repetition branches are
tried is controlled by the greedy or ungreedy nature of the quantifier.
.P
If a leaf node is reached, a matching string has been found, and at that point
the algorithm stops. Thus, if there is more than one possible match, this
algorithm returns the first one that it finds. Whether this is the shortest,
the longest, or some intermediate length depends on the way the greedy and
ungreedy repetition quantifiers are specified in the pattern.
.P
Because it ends up with a single path through the tree, it is relatively
straightforward for this algorithm to keep track of the substrings that are
matched by portions of the pattern in parentheses. This provides support for
capturing parentheses and back references.
.
.
.SH "THE ALTERNATIVE MATCHING ALGORITHM"
.rs
.sp
This algorithm conducts a breadth-first search of the tree. Starting from the
first matching point in the subject, it scans the subject string from left to
right, once, character by character, and as it does this, it remembers all the
paths through the tree that represent valid matches. In Friedl's terminology,
this is a kind of "DFA algorithm", though it is not implemented as a
traditional finite state machine (it keeps multiple states active
simultaneously).
.P
Although the general principle of this matching algorithm is that it scans the
subject string only once, without backtracking, there is one exception: when a
lookaround assertion is encountered, the characters following or preceding the
current point have to be independently inspected.
.P
The scan continues until either the end of the subject is reached, or there are
no more unterminated paths. At this point, terminated paths represent the
different matching possibilities (if there are none, the match has failed).
Thus, if there is more than one possible match, this algorithm finds all of
them, and in particular, it finds the longest. The matches are returned in
decreasing order of length. There is an option to stop the algorithm after the
first match (which is necessarily the shortest) is found.
.P
Note that all the matches that are found start at the same point in the
subject. If the pattern
.sp
cat(er(pillar)?)?
.sp
is matched against the string "the caterpillar catchment", the result is the
three strings "caterpillar", "cater", and "cat" that start at the fifth
character of the subject. The algorithm does not automatically move on to find
matches that start at later positions.
.P
PCRE2's "auto-possessification" optimization usually applies to character
repeats at the end of a pattern (as well as internally). For example, the
pattern "a\ed+" is compiled as if it were "a\ed++" because there is no point
even considering the possibility of backtracking into the repeated digits. For
DFA matching, this means that only one possible match is found. If you really
do want multiple matches in such cases, either use an ungreedy repeat
("a\ed+?") or set the PCRE2_NO_AUTO_POSSESS option when compiling.
.P
There are a number of features of PCRE2 regular expressions that are not
supported by the alternative matching algorithm. They are as follows:
.P
1. Because the algorithm finds all possible matches, the greedy or ungreedy
nature of repetition quantifiers is not relevant (though it may affect
auto-possessification, as just described). During matching, greedy and ungreedy
quantifiers are treated in exactly the same way. However, possessive
quantifiers can make a difference when what follows could also match what is
quantified, for example in a pattern like this:
.sp
^a++\ew!
.sp
This pattern matches "aaab!" but not "aaa!", which would be matched by a
non-possessive quantifier. Similarly, if an atomic group is present, it is
matched as if it were a standalone pattern at the current point, and the
longest match is then "locked in" for the rest of the overall pattern.
.P
2. When dealing with multiple paths through the tree simultaneously, it is not
straightforward to keep track of captured substrings for the different matching
possibilities, and PCRE2's implementation of this algorithm does not attempt to
do this. This means that no captured substrings are available.
.P
3. Because no substrings are captured, back references within the pattern are
not supported, and cause errors if encountered.
.P
4. For the same reason, conditional expressions that use a backreference as the
condition or test for a specific group recursion are not supported.
.P
5. Because many paths through the tree may be active, the \eK escape sequence,
which resets the start of the match when encountered (but may be on some paths
and not on others), is not supported. It causes an error if encountered.
.P
6. Callouts are supported, but the value of the \fIcapture_top\fP field is
always 1, and the value of the \fIcapture_last\fP field is always 0.
.P
7. The \eC escape sequence, which (in the standard algorithm) always matches a
single code unit, even in a UTF mode, is not supported in these modes, because
the alternative algorithm moves through the subject string one character (not
code unit) at a time, for all active paths through the tree.
.P
8. Except for (*FAIL), the backtracking control verbs such as (*PRUNE) are not
supported. (*FAIL) is supported, and behaves like a failing negative assertion.
.
.
.SH "ADVANTAGES OF THE ALTERNATIVE ALGORITHM"
.rs
.sp
Using the alternative matching algorithm provides the following advantages:
.P
1. All possible matches (at a single point in the subject) are automatically
found, and in particular, the longest match is found. To find more than one
match using the standard algorithm, you have to do kludgy things with
callouts.
.P
2. Because the alternative algorithm scans the subject string just once, and
never needs to backtrack (except for lookbehinds), it is possible to pass very
long subject strings to the matching function in several pieces, checking for
partial matching each time. Although it is also possible to do multi-segment
matching using the standard algorithm, by retaining partially matched
substrings, it is more complicated. The
.\" HREF
\fBpcre2partial\fP
.\"
documentation gives details of partial matching and discusses multi-segment
matching.
.
.
.SH "DISADVANTAGES OF THE ALTERNATIVE ALGORITHM"
.rs
.sp
The alternative algorithm suffers from a number of disadvantages:
.P
1. It is substantially slower than the standard algorithm. This is partly
because it has to search for all possible matches, but is also because it is
less susceptible to optimization.
.P
2. Capturing parentheses and back references are not supported.
.P
3. Although atomic groups are supported, their use does not provide the
performance advantage that it does for the standard algorithm.
.
.
.SH AUTHOR
.rs
.sp
.nf
Philip Hazel
University Computing Service
Cambridge CB2 3QH, England.
.fi
.
.
.SH REVISION
.rs
.sp
.nf
Last updated: 29 September 2014
Copyright (c) 1997-2014 University of Cambridge.
.fi

View File

@ -120,7 +120,7 @@ D is inspected during pcre2_dfa_match() execution
/* These are for pcre2_jit_compile(). */
#define PCRE2_JIT 0x00000001u /* For full matching */
#define PCRE2_JIT_COMPLETE 0x00000001u /* For full matching */
#define PCRE2_JIT_PARTIAL_SOFT 0x00000002u
#define PCRE2_JIT_PARTIAL_HARD 0x00000004u

View File

@ -124,7 +124,7 @@ subject_length = strlen((char *)subject);
re = pcre2_compile(
pattern, /* the pattern */
-1, /* indicates pattern is zero-terminated */
PCRE2_ZERO_TERMINATED, /* indicates pattern is zero-terminated */
0, /* default options */
&errornumber, /* for error number */
&erroroffset, /* for error offset */