5240 lines
251 KiB
Plaintext
5240 lines
251 KiB
Plaintext
-----------------------------------------------------------------------------
|
|
This file contains a concatenation of the PCRE2 man pages, converted to plain
|
|
text format for ease of searching with a text editor, or for use on systems
|
|
that do not have a man page processor. The small individual files that give
|
|
synopses of each function in the library have not been included. Neither has
|
|
the pcre2demo program. There are separate text files for the pcre2grep and
|
|
pcre2test commands.
|
|
-----------------------------------------------------------------------------
|
|
|
|
|
|
PCRE2(3) Library Functions Manual PCRE2(3)
|
|
|
|
|
|
|
|
NAME
|
|
PCRE2 - Perl-compatible regular expressions (revised API)
|
|
|
|
INTRODUCTION
|
|
|
|
PCRE2 is the name used for a revised API for the PCRE library, which is
|
|
a set of functions, written in C, that implement regular expression
|
|
pattern matching using the same syntax and semantics as Perl, with just
|
|
a few differences. Some features that appeared in Python and the origi-
|
|
nal PCRE before they appeared in Perl are also available using the
|
|
Python syntax. There is also some support for one or two .NET and Onig-
|
|
uruma syntax items, and there are options for requesting some minor
|
|
changes that give better ECMAScript (aka JavaScript) compatibility.
|
|
|
|
The source code for PCRE2 can be compiled to support 8-bit, 16-bit, or
|
|
32-bit code units, which means that up to three separate libraries may
|
|
be installed. The original work to extend PCRE to 16-bit and 32-bit
|
|
code units was done by Zoltan Herczeg and Christian Persch, respec-
|
|
tively. In all three cases, strings can be interpreted either as one
|
|
character per code unit, or as UTF-encoded Unicode, with support for
|
|
Unicode general category properties. Unicode support is optional at
|
|
build time (but is the default). However, processing strings as UTF
|
|
code units must be enabled explicitly at run time. The version of Uni-
|
|
code in use can be discovered by running
|
|
|
|
pcre2test -C
|
|
|
|
The three libraries contain identical sets of functions, with names
|
|
ending in _8, _16, or _32, respectively (for example, pcre2_com-
|
|
pile_8()). However, by defining PCRE2_CODE_UNIT_WIDTH to be 8, 16, or
|
|
32, a program that uses just one code unit width can be written using
|
|
generic names such as pcre2_compile(), and the documentation is written
|
|
assuming that this is the case.
|
|
|
|
In addition to the Perl-compatible matching function, PCRE2 contains an
|
|
alternative function that matches the same compiled patterns in a dif-
|
|
ferent way. In certain circumstances, the alternative function has some
|
|
advantages. For a discussion of the two matching algorithms, see the
|
|
pcre2matching page.
|
|
|
|
Details of exactly which Perl regular expression features are and are
|
|
not supported by PCRE2 are given in separate documents. See the
|
|
pcre2pattern and pcre2compat pages. There is a syntax summary in the
|
|
pcre2syntax page.
|
|
|
|
Some features of PCRE2 can be included, excluded, or changed when the
|
|
library is built. The pcre2_config() function makes it possible for a
|
|
client to discover which features are available. The features them-
|
|
selves are described in the pcre2build page. Documentation about build-
|
|
ing PCRE2 for various operating systems can be found in the README and
|
|
NON-AUTOTOOLS_BUILD files in the source distribution.
|
|
|
|
The libraries contains a number of undocumented internal functions and
|
|
data tables that are used by more than one of the exported external
|
|
functions, but which are not intended for use by external callers.
|
|
Their names all begin with "_pcre2", which hopefully will not provoke
|
|
any name clashes. In some environments, it is possible to control which
|
|
external symbols are exported when a shared library is built, and in
|
|
these cases the undocumented symbols are not exported.
|
|
|
|
|
|
SECURITY CONSIDERATIONS
|
|
|
|
If you are using PCRE2 in a non-UTF application that permits users to
|
|
supply arbitrary patterns for compilation, you should be aware of a
|
|
feature that allows users to turn on UTF support from within a pattern.
|
|
For example, an 8-bit pattern that begins with "(*UTF)" turns on UTF-8
|
|
mode, which interprets patterns and subjects as strings of UTF-8 code
|
|
units instead of individual 8-bit characters. This causes both the pat-
|
|
tern and any data against which it is matched to be checked for UTF-8
|
|
validity. If the data string is very long, such a check might use suf-
|
|
ficiently many resources as to cause your application to lose perfor-
|
|
mance.
|
|
|
|
One way of guarding against this possibility is to use the pcre2_pat-
|
|
tern_info() function to check the compiled pattern's options for
|
|
PCRE2_UTF. Alternatively, you can set the PCRE2_NEVER_UTF option when
|
|
calling pcre2_compile(). This causes an compile time error if a pattern
|
|
contains a UTF-setting sequence.
|
|
|
|
The use of Unicode properties for character types such as \d can also
|
|
be enabled from within the pattern, by specifying "(*UCP)". This fea-
|
|
ture can be disallowed by setting the PCRE2_NEVER_UCP option.
|
|
|
|
If your application is one that supports UTF, be aware that validity
|
|
checking can take time. If the same data string is to be matched many
|
|
times, you can use the PCRE2_NO_UTF_CHECK option for the second and
|
|
subsequent matches to avoid running redundant checks.
|
|
|
|
The use of the \C escape sequence in a UTF-8 or UTF-16 pattern can lead
|
|
to problems, because it may leave the current matching point in the
|
|
middle of a multi-code-unit character. The PCRE2_NEVER_BACKSLASH_C
|
|
option can be used to lock out the use of \C, causing a compile-time
|
|
error if it is encountered.
|
|
|
|
Another way that performance can be hit is by running a pattern that
|
|
has a very large search tree against a string that will never match.
|
|
Nested unlimited repeats in a pattern are a common example. PCRE2 pro-
|
|
vides some protection against this: see the pcre2_set_match_limit()
|
|
function in the pcre2api page.
|
|
|
|
|
|
USER DOCUMENTATION
|
|
|
|
The user documentation for PCRE2 comprises a number of different sec-
|
|
tions. In the "man" format, each of these is a separate "man page". In
|
|
the HTML format, each is a separate page, linked from the index page.
|
|
In the plain text format, the descriptions of the pcre2grep and
|
|
pcre2test programs are in files called pcre2grep.txt and pcre2test.txt,
|
|
respectively. The remaining sections, except for the pcre2demo section
|
|
(which is a program listing), and the short pages for individual func-
|
|
tions, are concatenated in pcre2.txt, for ease of searching. The sec-
|
|
tions are as follows:
|
|
|
|
pcre2 this document
|
|
pcre2-config show PCRE2 installation configuration information
|
|
pcre2api details of PCRE2's native C API
|
|
pcre2build building PCRE2
|
|
pcre2callout details of the callout feature
|
|
pcre2compat discussion of Perl compatibility
|
|
pcre2demo a demonstration C program that uses PCRE2
|
|
pcre2grep description of the pcre2grep command (8-bit only)
|
|
pcre2jit discussion of just-in-time optimization support
|
|
pcre2limits details of size and other limits
|
|
pcre2matching discussion of the two matching algorithms
|
|
pcre2partial details of the partial matching facility
|
|
pcre2pattern syntax and semantics of supported regular
|
|
expression patterns
|
|
pcre2perform discussion of performance issues
|
|
pcre2posix the POSIX-compatible C API for the 8-bit library
|
|
pcre2sample discussion of the pcre2demo program
|
|
pcre2stack discussion of stack usage
|
|
pcre2syntax quick syntax reference
|
|
pcre2test description of the pcre2test command
|
|
pcre2unicode discussion of Unicode and UTF support
|
|
|
|
In the "man" and HTML formats, there is also a short page for each C
|
|
library function, listing its arguments and results.
|
|
|
|
|
|
AUTHOR
|
|
|
|
Philip Hazel
|
|
University Computing Service
|
|
Cambridge, England.
|
|
|
|
Putting an actual email address here is a spam magnet. If you want to
|
|
email me, use my two initials, followed by the two digits 10, at the
|
|
domain cam.ac.uk.
|
|
|
|
|
|
REVISION
|
|
|
|
Last updated: 13 April 2015
|
|
Copyright (c) 1997-2015 University of Cambridge.
|
|
------------------------------------------------------------------------------
|
|
|
|
|
|
PCRE2API(3) Library Functions Manual PCRE2API(3)
|
|
|
|
|
|
|
|
NAME
|
|
PCRE2 - Perl-compatible regular expressions (revised API)
|
|
|
|
#include <pcre2.h>
|
|
|
|
PCRE2 is a new API for PCRE. This document contains a description of
|
|
all its functions. See the pcre2 document for an overview of all the
|
|
PCRE2 documentation.
|
|
|
|
|
|
PCRE2 NATIVE API BASIC FUNCTIONS
|
|
|
|
pcre2_code *pcre2_compile(PCRE2_SPTR pattern, PCRE2_SIZE length,
|
|
uint32_t options, int *errorcode, PCRE2_SIZE *erroroffset,
|
|
pcre2_compile_context *ccontext);
|
|
|
|
void pcre2_code_free(pcre2_code *code);
|
|
|
|
pcre2_match_data *pcre2_match_data_create(uint32_t ovecsize,
|
|
pcre2_general_context *gcontext);
|
|
|
|
pcre2_match_data *pcre2_match_data_create_from_pattern(
|
|
const pcre2_code *code, pcre2_general_context *gcontext);
|
|
|
|
int pcre2_match(const pcre2_code *code, PCRE2_SPTR subject,
|
|
PCRE2_SIZE length, PCRE2_SIZE startoffset,
|
|
uint32_t options, pcre2_match_data *match_data,
|
|
pcre2_match_context *mcontext);
|
|
|
|
int pcre2_dfa_match(const pcre2_code *code, PCRE2_SPTR subject,
|
|
PCRE2_SIZE length, PCRE2_SIZE startoffset,
|
|
uint32_t options, pcre2_match_data *match_data,
|
|
pcre2_match_context *mcontext,
|
|
int *workspace, PCRE2_SIZE wscount);
|
|
|
|
void pcre2_match_data_free(pcre2_match_data *match_data);
|
|
|
|
|
|
PCRE2 NATIVE API AUXILIARY MATCH FUNCTIONS
|
|
|
|
PCRE2_SPTR pcre2_get_mark(pcre2_match_data *match_data);
|
|
|
|
uint32_t pcre2_get_ovector_count(pcre2_match_data *match_data);
|
|
|
|
PCRE2_SIZE *pcre2_get_ovector_pointer(pcre2_match_data *match_data);
|
|
|
|
PCRE2_SIZE pcre2_get_startchar(pcre2_match_data *match_data);
|
|
|
|
|
|
PCRE2 NATIVE API GENERAL CONTEXT FUNCTIONS
|
|
|
|
pcre2_general_context *pcre2_general_context_create(
|
|
void *(*private_malloc)(PCRE2_SIZE, void *),
|
|
void (*private_free)(void *, void *), void *memory_data);
|
|
|
|
pcre2_general_context *pcre2_general_context_copy(
|
|
pcre2_general_context *gcontext);
|
|
|
|
void pcre2_general_context_free(pcre2_general_context *gcontext);
|
|
|
|
|
|
PCRE2 NATIVE API COMPILE CONTEXT FUNCTIONS
|
|
|
|
pcre2_compile_context *pcre2_compile_context_create(
|
|
pcre2_general_context *gcontext);
|
|
|
|
pcre2_compile_context *pcre2_compile_context_copy(
|
|
pcre2_compile_context *ccontext);
|
|
|
|
void pcre2_compile_context_free(pcre2_compile_context *ccontext);
|
|
|
|
int pcre2_set_bsr(pcre2_compile_context *ccontext,
|
|
uint32_t value);
|
|
|
|
int pcre2_set_character_tables(pcre2_compile_context *ccontext,
|
|
const unsigned char *tables);
|
|
|
|
int pcre2_set_newline(pcre2_compile_context *ccontext,
|
|
uint32_t value);
|
|
|
|
int pcre2_set_parens_nest_limit(pcre2_compile_context *ccontext,
|
|
uint32_t value);
|
|
|
|
int pcre2_set_compile_recursion_guard(pcre2_compile_context *ccontext,
|
|
int (*guard_function)(uint32_t, void *), void *user_data);
|
|
|
|
|
|
PCRE2 NATIVE API MATCH CONTEXT FUNCTIONS
|
|
|
|
pcre2_match_context *pcre2_match_context_create(
|
|
pcre2_general_context *gcontext);
|
|
|
|
pcre2_match_context *pcre2_match_context_copy(
|
|
pcre2_match_context *mcontext);
|
|
|
|
void pcre2_match_context_free(pcre2_match_context *mcontext);
|
|
|
|
int pcre2_set_callout(pcre2_match_context *mcontext,
|
|
int (*callout_function)(pcre2_callout_block *, void *),
|
|
void *callout_data);
|
|
|
|
int pcre2_set_match_limit(pcre2_match_context *mcontext,
|
|
uint32_t value);
|
|
|
|
int pcre2_set_recursion_limit(pcre2_match_context *mcontext,
|
|
uint32_t value);
|
|
|
|
int pcre2_set_recursion_memory_management(
|
|
pcre2_match_context *mcontext,
|
|
void *(*private_malloc)(PCRE2_SIZE, void *),
|
|
void (*private_free)(void *, void *), void *memory_data);
|
|
|
|
|
|
PCRE2 NATIVE API STRING EXTRACTION FUNCTIONS
|
|
|
|
int pcre2_substring_copy_byname(pcre2_match_data *match_data,
|
|
PCRE2_SPTR name, PCRE2_UCHAR *buffer, PCRE2_SIZE *bufflen);
|
|
|
|
int pcre2_substring_copy_bynumber(pcre2_match_data *match_data,
|
|
uint32_t number, PCRE2_UCHAR *buffer,
|
|
PCRE2_SIZE *bufflen);
|
|
|
|
void pcre2_substring_free(PCRE2_UCHAR *buffer);
|
|
|
|
int pcre2_substring_get_byname(pcre2_match_data *match_data,
|
|
PCRE2_SPTR name, PCRE2_UCHAR **bufferptr, PCRE2_SIZE *bufflen);
|
|
|
|
int pcre2_substring_get_bynumber(pcre2_match_data *match_data,
|
|
uint32_t number, PCRE2_UCHAR **bufferptr,
|
|
PCRE2_SIZE *bufflen);
|
|
|
|
int pcre2_substring_length_byname(pcre2_match_data *match_data,
|
|
PCRE2_SPTR name, PCRE2_SIZE *length);
|
|
|
|
int pcre2_substring_length_bynumber(pcre2_match_data *match_data,
|
|
uint32_t number, PCRE2_SIZE *length);
|
|
|
|
int pcre2_substring_nametable_scan(const pcre2_code *code,
|
|
PCRE2_SPTR name, PCRE2_SPTR *first, PCRE2_SPTR *last);
|
|
|
|
int pcre2_substring_number_from_name(const pcre2_code *code,
|
|
PCRE2_SPTR name);
|
|
|
|
void pcre2_substring_list_free(PCRE2_SPTR *list);
|
|
|
|
int pcre2_substring_list_get(pcre2_match_data *match_data,
|
|
PCRE2_UCHAR ***listptr, PCRE2_SIZE **lengthsptr);
|
|
|
|
|
|
PCRE2 NATIVE API STRING SUBSTITUTION FUNCTION
|
|
|
|
int pcre2_substitute(const pcre2_code *code, PCRE2_SPTR subject,
|
|
PCRE2_SIZE length, PCRE2_SIZE startoffset,
|
|
uint32_t options, pcre2_match_data *match_data,
|
|
pcre2_match_context *mcontext, PCRE2_SPTR replacementzfP,
|
|
PCRE2_SIZE rlength, PCRE2_UCHAR *outputbuffer,
|
|
PCRE2_SIZE *outlengthptr);
|
|
|
|
|
|
PCRE2 NATIVE API JIT FUNCTIONS
|
|
|
|
int pcre2_jit_compile(pcre2_code *code, uint32_t options);
|
|
|
|
int pcre2_jit_match(const pcre2_code *code, PCRE2_SPTR subject,
|
|
PCRE2_SIZE length, PCRE2_SIZE startoffset,
|
|
uint32_t options, pcre2_match_data *match_data,
|
|
pcre2_match_context *mcontext);
|
|
|
|
void pcre2_jit_free_unused_memory(pcre2_general_context *gcontext);
|
|
|
|
pcre2_jit_stack *pcre2_jit_stack_create(PCRE2_SIZE startsize,
|
|
PCRE2_SIZE maxsize, pcre2_general_context *gcontext);
|
|
|
|
void pcre2_jit_stack_assign(pcre2_match_context *mcontext,
|
|
pcre2_jit_callback callback_function, void *callback_data);
|
|
|
|
void pcre2_jit_stack_free(pcre2_jit_stack *jit_stack);
|
|
|
|
|
|
PCRE2 NATIVE API SERIALIZATION FUNCTIONS
|
|
|
|
int32_t pcre2_serialize_decode(pcre2_code **codes,
|
|
int32_t number_of_codes, const uint32_t *bytes,
|
|
pcre2_general_context *gcontext);
|
|
|
|
int32_t pcre2_serialize_encode(pcre2_code **codes,
|
|
int32_t number_of_codes, uint32_t **serialized_bytes,
|
|
PCRE2_SIZE *serialized_size, pcre2_general_context *gcontext);
|
|
|
|
void pcre2_serialize_free(uint8_t *bytes);
|
|
|
|
int32_t pcre2_serialize_get_number_of_codes(const uint8_t *bytes);
|
|
|
|
|
|
PCRE2 NATIVE API AUXILIARY FUNCTIONS
|
|
|
|
int pcre2_get_error_message(int errorcode, PCRE2_UCHAR *buffer,
|
|
PCRE2_SIZE bufflen);
|
|
|
|
const unsigned char *pcre2_maketables(pcre2_general_context *gcontext);
|
|
|
|
int pcre2_pattern_info(const pcre2 *code, uint32_t what, void *where);
|
|
|
|
int pcre2_callout_enumerate(const pcre2_code *code,
|
|
int (*callback)(pcre2_callout_enumerate_block *, void *),
|
|
void *user_data);
|
|
|
|
int pcre2_config(uint32_t what, void *where);
|
|
|
|
|
|
PCRE2 8-BIT, 16-BIT, AND 32-BIT LIBRARIES
|
|
|
|
There are three PCRE2 libraries, supporting 8-bit, 16-bit, and 32-bit
|
|
code units, respectively. However, there is just one header file,
|
|
pcre2.h. This contains the function prototypes and other definitions
|
|
for all three libraries. One, two, or all three can be installed simul-
|
|
taneously. On Unix-like systems the libraries are called libpcre2-8,
|
|
libpcre2-16, and libpcre2-32, and they can also co-exist with the orig-
|
|
inal PCRE libraries.
|
|
|
|
Character strings are passed to and from a PCRE2 library as a sequence
|
|
of unsigned integers in code units of the appropriate width. Every
|
|
PCRE2 function comes in three different forms, one for each library,
|
|
for example:
|
|
|
|
pcre2_compile_8()
|
|
pcre2_compile_16()
|
|
pcre2_compile_32()
|
|
|
|
There are also three different sets of data types:
|
|
|
|
PCRE2_UCHAR8, PCRE2_UCHAR16, PCRE2_UCHAR32
|
|
PCRE2_SPTR8, PCRE2_SPTR16, PCRE2_SPTR32
|
|
|
|
The UCHAR types define unsigned code units of the appropriate widths.
|
|
For example, PCRE2_UCHAR16 is usually defined as `uint16_t'. The SPTR
|
|
types are constant pointers to the equivalent UCHAR types, that is,
|
|
they are pointers to vectors of unsigned code units.
|
|
|
|
Many applications use only one code unit width. For their convenience,
|
|
macros are defined whose names are the generic forms such as pcre2_com-
|
|
pile() and PCRE2_SPTR. These macros use the value of the macro
|
|
PCRE2_CODE_UNIT_WIDTH to generate the appropriate width-specific func-
|
|
tion and macro names. PCRE2_CODE_UNIT_WIDTH is not defined by default.
|
|
An application must define it to be 8, 16, or 32 before including
|
|
pcre2.h in order to make use of the generic names.
|
|
|
|
Applications that use more than one code unit width can be linked with
|
|
more than one PCRE2 library, but must define PCRE2_CODE_UNIT_WIDTH to
|
|
be 0 before including pcre2.h, and then use the real function names.
|
|
Any code that is to be included in an environment where the value of
|
|
PCRE2_CODE_UNIT_WIDTH is unknown should also use the real function
|
|
names. (Unfortunately, it is not possible in C code to save and restore
|
|
the value of a macro.)
|
|
|
|
If PCRE2_CODE_UNIT_WIDTH is not defined before including pcre2.h, a
|
|
compiler error occurs.
|
|
|
|
When using multiple libraries in an application, you must take care
|
|
when processing any particular pattern to use only functions from a
|
|
single library. For example, if you want to run a match using a pat-
|
|
tern that was compiled with pcre2_compile_16(), you must do so with
|
|
pcre2_match_16(), not pcre2_match_8().
|
|
|
|
In the function summaries above, and in the rest of this document and
|
|
other PCRE2 documents, functions and data types are described using
|
|
their generic names, without the 8, 16, or 32 suffix.
|
|
|
|
|
|
PCRE2 API OVERVIEW
|
|
|
|
PCRE2 has its own native API, which is described in this document.
|
|
There are also some wrapper functions for the 8-bit library that corre-
|
|
spond to the POSIX regular expression API, but they do not give access
|
|
to all the functionality. They are described in the pcre2posix documen-
|
|
tation. Both these APIs define a set of C function calls.
|
|
|
|
The native API C data types, function prototypes, option values, and
|
|
error codes are defined in the header file pcre2.h, which contains def-
|
|
initions of PCRE2_MAJOR and PCRE2_MINOR, the major and minor release
|
|
numbers for the library. Applications can use these to include support
|
|
for different releases of PCRE2.
|
|
|
|
In a Windows environment, if you want to statically link an application
|
|
program against a non-dll PCRE2 library, you must define PCRE2_STATIC
|
|
before including pcre2.h.
|
|
|
|
The functions pcre2_compile(), and pcre2_match() are used for compiling
|
|
and matching regular expressions in a Perl-compatible manner. A sample
|
|
program that demonstrates the simplest way of using them is provided in
|
|
the file called pcre2demo.c in the PCRE2 source distribution. A listing
|
|
of this program is given in the pcre2demo documentation, and the
|
|
pcre2sample documentation describes how to compile and run it.
|
|
|
|
Just-in-time compiler support is an optional feature of PCRE2 that can
|
|
be built in appropriate hardware environments. It greatly speeds up the
|
|
matching performance of many patterns. Programs can request that it be
|
|
used if available, by calling pcre2_jit_compile() after a pattern has
|
|
been successfully compiled by pcre2_compile(). This does nothing if JIT
|
|
support is not available.
|
|
|
|
More complicated programs might need to make use of the specialist
|
|
functions pcre2_jit_stack_create(), pcre2_jit_stack_free(), and
|
|
pcre2_jit_stack_assign() in order to control the JIT code's memory
|
|
usage.
|
|
|
|
JIT matching is automatically used by pcre2_match() if it is available.
|
|
There is also a direct interface for JIT matching, which gives improved
|
|
performance. The JIT-specific functions are discussed in the pcre2jit
|
|
documentation.
|
|
|
|
A second matching function, pcre2_dfa_match(), which is not Perl-com-
|
|
patible, is also provided. This uses a different algorithm for the
|
|
matching. The alternative algorithm finds all possible matches (at a
|
|
given point in the subject), and scans the subject just once (unless
|
|
there are lookbehind assertions). However, this algorithm does not
|
|
return captured substrings. A description of the two matching algo-
|
|
rithms and their advantages and disadvantages is given in the
|
|
pcre2matching documentation. There is no JIT support for
|
|
pcre2_dfa_match().
|
|
|
|
In addition to the main compiling and matching functions, there are
|
|
convenience functions for extracting captured substrings from a subject
|
|
string that has been matched by pcre2_match(). They are:
|
|
|
|
pcre2_substring_copy_byname()
|
|
pcre2_substring_copy_bynumber()
|
|
pcre2_substring_get_byname()
|
|
pcre2_substring_get_bynumber()
|
|
pcre2_substring_list_get()
|
|
pcre2_substring_length_byname()
|
|
pcre2_substring_length_bynumber()
|
|
pcre2_substring_nametable_scan()
|
|
pcre2_substring_number_from_name()
|
|
|
|
pcre2_substring_free() and pcre2_substring_list_free() are also pro-
|
|
vided, to free the memory used for extracted strings.
|
|
|
|
The function pcre2_substitute() can be called to match a pattern and
|
|
return a copy of the subject string with substitutions for parts that
|
|
were matched.
|
|
|
|
Finally, there are functions for finding out information about a com-
|
|
piled pattern (pcre2_pattern_info()) and about the configuration with
|
|
which PCRE2 was built (pcre2_config()).
|
|
|
|
|
|
STRING LENGTHS AND OFFSETS
|
|
|
|
The PCRE2 API uses string lengths and offsets into strings of code
|
|
units in several places. These values are always of type PCRE2_SIZE,
|
|
which is an unsigned integer type, currently always defined as size_t.
|
|
The largest value that can be stored in such a type (that is
|
|
~(PCRE2_SIZE)0) is reserved as a special indicator for zero-terminated
|
|
strings and unset offsets. Therefore, the longest string that can be
|
|
handled is one less than this maximum.
|
|
|
|
|
|
NEWLINES
|
|
|
|
PCRE2 supports five different conventions for indicating line breaks in
|
|
strings: a single CR (carriage return) character, a single LF (line-
|
|
feed) character, the two-character sequence CRLF, any of the three pre-
|
|
ceding, or any Unicode newline sequence. The Unicode newline sequences
|
|
are the three just mentioned, plus the single characters VT (vertical
|
|
tab, U+000B), FF (form feed, U+000C), NEL (next line, U+0085), LS (line
|
|
separator, U+2028), and PS (paragraph separator, U+2029).
|
|
|
|
Each of the first three conventions is used by at least one operating
|
|
system as its standard newline sequence. When PCRE2 is built, a default
|
|
can be specified. The default default is LF, which is the Unix stan-
|
|
dard. However, the newline convention can be changed by an application
|
|
when calling pcre2_compile(), or it can be specified by special text at
|
|
the start of the pattern itself; this overrides any other settings. See
|
|
the pcre2pattern page for details of the special character sequences.
|
|
|
|
In the PCRE2 documentation the word "newline" is used to mean "the
|
|
character or pair of characters that indicate a line break". The choice
|
|
of newline convention affects the handling of the dot, circumflex, and
|
|
dollar metacharacters, the handling of #-comments in /x mode, and, when
|
|
CRLF is a recognized line ending sequence, the match position advance-
|
|
ment for a non-anchored pattern. There is more detail about this in the
|
|
section on pcre2_match() options below.
|
|
|
|
The choice of newline convention does not affect the interpretation of
|
|
the \n or \r escape sequences, nor does it affect what \R matches; this
|
|
has its own separate convention.
|
|
|
|
|
|
MULTITHREADING
|
|
|
|
In a multithreaded application it is important to keep thread-specific
|
|
data separate from data that can be shared between threads. The PCRE2
|
|
library code itself is thread-safe: it contains no static or global
|
|
variables. The API is designed to be fairly simple for non-threaded
|
|
applications while at the same time ensuring that multithreaded appli-
|
|
cations can use it.
|
|
|
|
There are several different blocks of data that are used to pass infor-
|
|
mation between the application and the PCRE2 libraries.
|
|
|
|
(1) A pointer to the compiled form of a pattern is returned to the user
|
|
when pcre2_compile() is successful. The data in the compiled pattern is
|
|
fixed, and does not change when the pattern is matched. Therefore, it
|
|
is thread-safe, that is, the same compiled pattern can be used by more
|
|
than one thread simultaneously. An application can compile all its pat-
|
|
terns at the start, before forking off multiple threads that use them.
|
|
However, if the just-in-time optimization feature is being used, it
|
|
needs separate memory stack areas for each thread. See the pcre2jit
|
|
documentation for more details.
|
|
|
|
(2) The next section below introduces the idea of "contexts" in which
|
|
PCRE2 functions are called. A context is nothing more than a collection
|
|
of parameters that control the way PCRE2 operates. Grouping a number of
|
|
parameters together in a context is a convenient way of passing them to
|
|
a PCRE2 function without using lots of arguments. The parameters that
|
|
are stored in contexts are in some sense "advanced features" of the
|
|
API. Many straightforward applications will not need to use contexts.
|
|
|
|
In a multithreaded application, if the parameters in a context are val-
|
|
ues that are never changed, the same context can be used by all the
|
|
threads. However, if any thread needs to change any value in a context,
|
|
it must make its own thread-specific copy.
|
|
|
|
(3) The matching functions need a block of memory for working space and
|
|
for storing the results of a match. This includes details of what was
|
|
matched, as well as additional information such as the name of a
|
|
(*MARK) setting. Each thread must provide its own version of this mem-
|
|
ory.
|
|
|
|
|
|
PCRE2 CONTEXTS
|
|
|
|
Some PCRE2 functions have a lot of parameters, many of which are used
|
|
only by specialist applications, for example, those that use custom
|
|
memory management or non-standard character tables. To keep function
|
|
argument lists at a reasonable size, and at the same time to keep the
|
|
API extensible, "uncommon" parameters are passed to certain functions
|
|
in a context instead of directly. A context is just a block of memory
|
|
that holds the parameter values. Applications that do not need to
|
|
adjust any of the context parameters can pass NULL when a context
|
|
pointer is required.
|
|
|
|
There are three different types of context: a general context that is
|
|
relevant for several PCRE2 operations, a compile-time context, and a
|
|
match-time context.
|
|
|
|
The general context
|
|
|
|
At present, this context just contains pointers to (and data for)
|
|
external memory management functions that are called from several
|
|
places in the PCRE2 library. The context is named `general' rather than
|
|
specifically `memory' because in future other fields may be added. If
|
|
you do not want to supply your own custom memory management functions,
|
|
you do not need to bother with a general context. A general context is
|
|
created by:
|
|
|
|
pcre2_general_context *pcre2_general_context_create(
|
|
void *(*private_malloc)(PCRE2_SIZE, void *),
|
|
void (*private_free)(void *, void *), void *memory_data);
|
|
|
|
The two function pointers specify custom memory management functions,
|
|
whose prototypes are:
|
|
|
|
void *private_malloc(PCRE2_SIZE, void *);
|
|
void private_free(void *, void *);
|
|
|
|
Whenever code in PCRE2 calls these functions, the final argument is the
|
|
value of memory_data. Either of the first two arguments of the creation
|
|
function may be NULL, in which case the system memory management func-
|
|
tions malloc() and free() are used. (This is not currently useful, as
|
|
there are no other fields in a general context, but in future there
|
|
might be.) The private_malloc() function is used (if supplied) to
|
|
obtain memory for storing the context, and all three values are saved
|
|
as part of the context.
|
|
|
|
Whenever PCRE2 creates a data block of any kind, the block contains a
|
|
pointer to the free() function that matches the malloc() function that
|
|
was used. When the time comes to free the block, this function is
|
|
called.
|
|
|
|
A general context can be copied by calling:
|
|
|
|
pcre2_general_context *pcre2_general_context_copy(
|
|
pcre2_general_context *gcontext);
|
|
|
|
The memory used for a general context should be freed by calling:
|
|
|
|
void pcre2_general_context_free(pcre2_general_context *gcontext);
|
|
|
|
|
|
The compile context
|
|
|
|
A compile context is required if you want to change the default values
|
|
of any of the following compile-time parameters:
|
|
|
|
What \R matches (Unicode newlines or CR, LF, CRLF only)
|
|
PCRE2's character tables
|
|
The newline character sequence
|
|
The compile time nested parentheses limit
|
|
An external function for stack checking
|
|
|
|
A compile context is also required if you are using custom memory man-
|
|
agement. If none of these apply, just pass NULL as the context argu-
|
|
ment of pcre2_compile().
|
|
|
|
A compile context is created, copied, and freed by the following func-
|
|
tions:
|
|
|
|
pcre2_compile_context *pcre2_compile_context_create(
|
|
pcre2_general_context *gcontext);
|
|
|
|
pcre2_compile_context *pcre2_compile_context_copy(
|
|
pcre2_compile_context *ccontext);
|
|
|
|
void pcre2_compile_context_free(pcre2_compile_context *ccontext);
|
|
|
|
A compile context is created with default values for its parameters.
|
|
These can be changed by calling the following functions, which return 0
|
|
on success, or PCRE2_ERROR_BADDATA if invalid data is detected.
|
|
|
|
int pcre2_set_bsr(pcre2_compile_context *ccontext,
|
|
uint32_t value);
|
|
|
|
The value must be PCRE2_BSR_ANYCRLF, to specify that \R matches only
|
|
CR, LF, or CRLF, or PCRE2_BSR_UNICODE, to specify that \R matches any
|
|
Unicode line ending sequence. The value is used by the JIT compiler and
|
|
by the two interpreted matching functions, pcre2_match() and
|
|
pcre2_dfa_match().
|
|
|
|
int pcre2_set_character_tables(pcre2_compile_context *ccontext,
|
|
const unsigned char *tables);
|
|
|
|
The value must be the result of a call to pcre2_maketables(), whose
|
|
only argument is a general context. This function builds a set of char-
|
|
acter tables in the current locale.
|
|
|
|
int pcre2_set_newline(pcre2_compile_context *ccontext,
|
|
uint32_t value);
|
|
|
|
This specifies which characters or character sequences are to be recog-
|
|
nized as newlines. The value must be one of PCRE2_NEWLINE_CR (carriage
|
|
return only), PCRE2_NEWLINE_LF (linefeed only), PCRE2_NEWLINE_CRLF (the
|
|
two-character sequence CR followed by LF), PCRE2_NEWLINE_ANYCRLF (any
|
|
of the above), or PCRE2_NEWLINE_ANY (any Unicode newline sequence).
|
|
|
|
When a pattern is compiled with the PCRE2_EXTENDED option, the value of
|
|
this parameter affects the recognition of white space and the end of
|
|
internal comments starting with #. The value is saved with the compiled
|
|
pattern for subsequent use by the JIT compiler and by the two inter-
|
|
preted matching functions, pcre2_match() and pcre2_dfa_match().
|
|
|
|
int pcre2_set_parens_nest_limit(pcre2_compile_context *ccontext,
|
|
uint32_t value);
|
|
|
|
This parameter ajusts the limit, set when PCRE2 is built (default 250),
|
|
on the depth of parenthesis nesting in a pattern. This limit stops
|
|
rogue patterns using up too much system stack when being compiled.
|
|
|
|
int pcre2_set_compile_recursion_guard(pcre2_compile_context *ccontext,
|
|
int (*guard_function)(uint32_t, void *), void *user_data);
|
|
|
|
There is at least one application that runs PCRE2 in threads with very
|
|
limited system stack, where running out of stack is to be avoided at
|
|
all costs. The parenthesis limit above cannot take account of how much
|
|
stack is actually available. For a finer control, you can supply a
|
|
function that is called whenever pcre2_compile() starts to compile a
|
|
parenthesized part of a pattern. This function can check the actual
|
|
stack size (or anything else that it wants to, of course).
|
|
|
|
The first argument to the callout function gives the current depth of
|
|
nesting, and the second is user data that is set up by the last argu-
|
|
ment of pcre2_set_compile_recursion_guard(). The callout function
|
|
should return zero if all is well, or non-zero to force an error.
|
|
|
|
The match context
|
|
|
|
A match context is required if you want to change the default values of
|
|
any of the following match-time parameters:
|
|
|
|
A callout function
|
|
The limit for calling match()
|
|
The limit for calling match() recursively
|
|
|
|
A match context is also required if you are using custom memory manage-
|
|
ment. If none of these apply, just pass NULL as the context argument
|
|
of pcre2_match(), pcre2_dfa_match(), or pcre2_jit_match().
|
|
|
|
A match context is created, copied, and freed by the following func-
|
|
tions:
|
|
|
|
pcre2_match_context *pcre2_match_context_create(
|
|
pcre2_general_context *gcontext);
|
|
|
|
pcre2_match_context *pcre2_match_context_copy(
|
|
pcre2_match_context *mcontext);
|
|
|
|
void pcre2_match_context_free(pcre2_match_context *mcontext);
|
|
|
|
A match context is created with default values for its parameters.
|
|
These can be changed by calling the following functions, which return 0
|
|
on success, or PCRE2_ERROR_BADDATA if invalid data is detected.
|
|
|
|
int pcre2_set_callout(pcre2_match_context *mcontext,
|
|
int (*callout_function)(pcre2_callout_block *, void *),
|
|
void *callout_data);
|
|
|
|
This sets up a "callout" function, which PCRE2 will call at specified
|
|
points during a matching operation. Details are given in the pcre2call-
|
|
out documentation.
|
|
|
|
int pcre2_set_match_limit(pcre2_match_context *mcontext,
|
|
uint32_t value);
|
|
|
|
The match_limit parameter provides a means of preventing PCRE2 from
|
|
using up too many resources when processing patterns that are not going
|
|
to match, but which have a very large number of possibilities in their
|
|
search trees. The classic example is a pattern that uses nested unlim-
|
|
ited repeats.
|
|
|
|
Internally, pcre2_match() uses a function called match(), which it
|
|
calls repeatedly (sometimes recursively). The limit set by match_limit
|
|
is imposed on the number of times this function is called during a
|
|
match, which has the effect of limiting the amount of backtracking that
|
|
can take place. For patterns that are not anchored, the count restarts
|
|
from zero for each position in the subject string. This limit is not
|
|
relevant to pcre2_dfa_match(), which ignores it.
|
|
|
|
When pcre2_match() is called with a pattern that was successfully pro-
|
|
cessed by pcre2_jit_compile(), the way in which matching is executed is
|
|
entirely different. However, there is still the possibility of runaway
|
|
matching that goes on for a very long time, and so the match_limit
|
|
value is also used in this case (but in a different way) to limit how
|
|
long the matching can continue.
|
|
|
|
The default value for the limit can be set when PCRE2 is built; the
|
|
default default is 10 million, which handles all but the most extreme
|
|
cases. If the limit is exceeded, pcre2_match() returns
|
|
PCRE2_ERROR_MATCHLIMIT. A value for the match limit may also be sup-
|
|
plied by an item at the start of a pattern of the form
|
|
|
|
(*LIMIT_MATCH=ddd)
|
|
|
|
where ddd is a decimal number. However, such a setting is ignored
|
|
unless ddd is less than the limit set by the caller of pcre2_match()
|
|
or, if no such limit is set, less than the default.
|
|
|
|
int pcre2_set_recursion_limit(pcre2_match_context *mcontext,
|
|
uint32_t value);
|
|
|
|
The recursion_limit parameter is similar to match_limit, but instead of
|
|
limiting the total number of times that match() is called, it limits
|
|
the depth of recursion. The recursion depth is a smaller number than
|
|
the total number of calls, because not all calls to match() are recur-
|
|
sive. This limit is of use only if it is set smaller than match_limit.
|
|
|
|
Limiting the recursion depth limits the amount of system stack that can
|
|
be used, or, when PCRE2 has been compiled to use memory on the heap
|
|
instead of the stack, the amount of heap memory that can be used. This
|
|
limit is not relevant, and is ignored, when matching is done using JIT
|
|
compiled code or by the pcre2_dfa_match() function.
|
|
|
|
The default value for recursion_limit can be set when PCRE2 is built;
|
|
the default default is the same value as the default for match_limit.
|
|
If the limit is exceeded, pcre2_match() returns PCRE2_ERROR_RECURSION-
|
|
LIMIT. A value for the recursion limit may also be supplied by an item
|
|
at the start of a pattern of the form
|
|
|
|
(*LIMIT_RECURSION=ddd)
|
|
|
|
where ddd is a decimal number. However, such a setting is ignored
|
|
unless ddd is less than the limit set by the caller of pcre2_match()
|
|
or, if no such limit is set, less than the default.
|
|
|
|
int pcre2_set_recursion_memory_management(
|
|
pcre2_match_context *mcontext,
|
|
void *(*private_malloc)(PCRE2_SIZE, void *),
|
|
void (*private_free)(void *, void *), void *memory_data);
|
|
|
|
This function sets up two additional custom memory management functions
|
|
for use by pcre2_match() when PCRE2 is compiled to use the heap for
|
|
remembering backtracking data, instead of recursive function calls that
|
|
use the system stack. There is a discussion about PCRE2's stack usage
|
|
in the pcre2stack documentation. See the pcre2build documentation for
|
|
details of how to build PCRE2.
|
|
|
|
Using the heap for recursion is a non-standard way of building PCRE2,
|
|
for use in environments that have limited stacks. Because of the
|
|
greater use of memory management, pcre2_match() runs more slowly. Func-
|
|
tions that are different to the general custom memory functions are
|
|
provided so that special-purpose external code can be used for this
|
|
case, because the memory blocks are all the same size. The blocks are
|
|
retained by pcre2_match() until it is about to exit so that they can be
|
|
re-used when possible during the match. In the absence of these func-
|
|
tions, the normal custom memory management functions are used, if sup-
|
|
plied, otherwise the system functions.
|
|
|
|
|
|
CHECKING BUILD-TIME OPTIONS
|
|
|
|
int pcre2_config(uint32_t what, void *where);
|
|
|
|
The function pcre2_config() makes it possible for a PCRE2 client to
|
|
discover which optional features have been compiled into the PCRE2
|
|
library. The pcre2build documentation has more details about these
|
|
optional features.
|
|
|
|
The first argument for pcre2_config() specifies which information is
|
|
required. The second argument is a pointer to memory into which the
|
|
information is placed. If NULL is passed, the function returns the
|
|
amount of memory that is needed for the requested information. For
|
|
calls that return numerical values, the value is in bytes; when
|
|
requesting these values, where should point to appropriately aligned
|
|
memory. For calls that return strings, the required length is given in
|
|
code units, not counting the terminating zero.
|
|
|
|
When requesting information, the returned value from pcre2_config() is
|
|
non-negative on success, or the negative error code PCRE2_ERROR_BADOP-
|
|
TION if the value in the first argument is not recognized. The follow-
|
|
ing information is available:
|
|
|
|
PCRE2_CONFIG_BSR
|
|
|
|
The output is a uint32_t integer whose value indicates what character
|
|
sequences the \R escape sequence matches by default. A value of
|
|
PCRE2_BSR_UNICODE means that \R matches any Unicode line ending
|
|
sequence; a value of PCRE2_BSR_ANYCRLF means that \R matches only CR,
|
|
LF, or CRLF. The default can be overridden when a pattern is compiled.
|
|
|
|
PCRE2_CONFIG_JIT
|
|
|
|
The output is a uint32_t integer that is set to one if support for
|
|
just-in-time compiling is available; otherwise it is set to zero.
|
|
|
|
PCRE2_CONFIG_JITTARGET
|
|
|
|
The where argument should point to a buffer that is at least 48 code
|
|
units long. (The exact length required can be found by calling
|
|
pcre2_config() with where set to NULL.) The buffer is filled with a
|
|
string that contains the name of the architecture for which the JIT
|
|
compiler is configured, for example "x86 32bit (little endian +
|
|
unaligned)". If JIT support is not available, PCRE2_ERROR_BADOPTION is
|
|
returned, otherwise the number of code units used is returned. This is
|
|
the length of the string, plus one unit for the terminating zero.
|
|
|
|
PCRE2_CONFIG_LINKSIZE
|
|
|
|
The output is a uint32_t integer that contains the number of bytes used
|
|
for internal linkage in compiled regular expressions. When PCRE2 is
|
|
configured, the value can be set to 2, 3, or 4, with the default being
|
|
2. This is the value that is returned by pcre2_config(). However, when
|
|
the 16-bit library is compiled, a value of 3 is rounded up to 4, and
|
|
when the 32-bit library is compiled, internal linkages always use 4
|
|
bytes, so the configured value is not relevant.
|
|
|
|
The default value of 2 for the 8-bit and 16-bit libraries is sufficient
|
|
for all but the most massive patterns, since it allows the size of the
|
|
compiled pattern to be up to 64K code units. Larger values allow larger
|
|
regular expressions to be compiled by those two libraries, but at the
|
|
expense of slower matching.
|
|
|
|
PCRE2_CONFIG_MATCHLIMIT
|
|
|
|
The output is a uint32_t integer that gives the default limit for the
|
|
number of internal matching function calls in a pcre2_match() execu-
|
|
tion. Further details are given with pcre2_match() below.
|
|
|
|
PCRE2_CONFIG_NEWLINE
|
|
|
|
The output is a uint32_t integer whose value specifies the default
|
|
character sequence that is recognized as meaning "newline". The values
|
|
are:
|
|
|
|
PCRE2_NEWLINE_CR Carriage return (CR)
|
|
PCRE2_NEWLINE_LF Linefeed (LF)
|
|
PCRE2_NEWLINE_CRLF Carriage return, linefeed (CRLF)
|
|
PCRE2_NEWLINE_ANY Any Unicode line ending
|
|
PCRE2_NEWLINE_ANYCRLF Any of CR, LF, or CRLF
|
|
|
|
The default should normally correspond to the standard sequence for
|
|
your operating system.
|
|
|
|
PCRE2_CONFIG_PARENSLIMIT
|
|
|
|
The output is a uint32_t integer that gives the maximum depth of nest-
|
|
ing of parentheses (of any kind) in a pattern. This limit is imposed to
|
|
cap the amount of system stack used when a pattern is compiled. It is
|
|
specified when PCRE2 is built; the default is 250. This limit does not
|
|
take into account the stack that may already be used by the calling
|
|
application. For finer control over compilation stack usage, see
|
|
pcre2_set_compile_recursion_guard().
|
|
|
|
PCRE2_CONFIG_RECURSIONLIMIT
|
|
|
|
The output is a uint32_t integer that gives the default limit for the
|
|
depth of recursion when calling the internal matching function in a
|
|
pcre2_match() execution. Further details are given with pcre2_match()
|
|
below.
|
|
|
|
PCRE2_CONFIG_STACKRECURSE
|
|
|
|
The output is a uint32_t integer that is set to one if internal recur-
|
|
sion when running pcre2_match() is implemented by recursive function
|
|
calls that use the system stack to remember their state. This is the
|
|
usual way that PCRE2 is compiled. The output is zero if PCRE2 was com-
|
|
piled to use blocks of data on the heap instead of recursive function
|
|
calls.
|
|
|
|
PCRE2_CONFIG_UNICODE_VERSION
|
|
|
|
The where argument should point to a buffer that is at least 24 code
|
|
units long. (The exact length required can be found by calling
|
|
pcre2_config() with where set to NULL.) If PCRE2 has been compiled
|
|
without Unicode support, the buffer is filled with the text "Unicode
|
|
not supported". Otherwise, the Unicode version string (for example,
|
|
"8.0.0") is inserted. The number of code units used is returned. This
|
|
is the length of the string plus one unit for the terminating zero.
|
|
|
|
PCRE2_CONFIG_UNICODE
|
|
|
|
The output is a uint32_t integer that is set to one if Unicode support
|
|
is available; otherwise it is set to zero. Unicode support implies UTF
|
|
support.
|
|
|
|
PCRE2_CONFIG_VERSION
|
|
|
|
The where argument should point to a buffer that is at least 12 code
|
|
units long. (The exact length required can be found by calling
|
|
pcre2_config() with where set to NULL.) The buffer is filled with the
|
|
PCRE2 version string, zero-terminated. The number of code units used is
|
|
returned. This is the length of the string plus one unit for the termi-
|
|
nating zero.
|
|
|
|
|
|
COMPILING A PATTERN
|
|
|
|
pcre2_code *pcre2_compile(PCRE2_SPTR pattern, PCRE2_SIZE length,
|
|
uint32_t options, int *errorcode, PCRE2_SIZE *erroroffset,
|
|
pcre2_compile_context *ccontext);
|
|
|
|
void pcre2_code_free(pcre2_code *code);
|
|
|
|
The pcre2_compile() function compiles a pattern into an internal form.
|
|
The pattern is defined by a pointer to a string of code units and a
|
|
length, If the pattern is zero-terminated, the length can be specified
|
|
as PCRE2_ZERO_TERMINATED. The function returns a pointer to a block of
|
|
memory that contains the compiled pattern and related data. The caller
|
|
must free the memory by calling pcre2_code_free() when it is no longer
|
|
needed.
|
|
|
|
NOTE: When one of the matching functions is called, pointers to the
|
|
compiled pattern and the subject string are set in the match data block
|
|
so that they can be referenced by the extraction functions. After run-
|
|
ning a match, you must not free a compiled pattern (or a subject
|
|
string) until after all operations on the match data block have taken
|
|
place.
|
|
|
|
If the compile context argument ccontext is NULL, memory for the com-
|
|
piled pattern is obtained by calling malloc(). Otherwise, it is
|
|
obtained from the same memory function that was used for the compile
|
|
context.
|
|
|
|
The options argument contains various bit settings that affect the com-
|
|
pilation. It should be zero if no options are required. The available
|
|
options are described below. Some of them (in particular, those that
|
|
are compatible with Perl, but some others as well) can also be set and
|
|
unset from within the pattern (see the detailed description in the
|
|
pcre2pattern documentation).
|
|
|
|
For those options that can be different in different parts of the pat-
|
|
tern, the contents of the options argument specifies their settings at
|
|
the start of compilation. The PCRE2_ANCHORED and PCRE2_NO_UTF_CHECK
|
|
options can be set at the time of matching as well as at compile time.
|
|
|
|
Other, less frequently required compile-time parameters (for example,
|
|
the newline setting) can be provided in a compile context (as described
|
|
above).
|
|
|
|
If errorcode or erroroffset is NULL, pcre2_compile() returns NULL imme-
|
|
diately. Otherwise, if compilation of a pattern fails, pcre2_compile()
|
|
returns NULL, having set these variables to an error code and an offset
|
|
(number of code units) within the pattern, respectively. The
|
|
pcre2_get_error_message() function provides a textual message for each
|
|
error code. Compilation errors are positive numbers, but UTF formatting
|
|
errors are negative numbers. For an invalid UTF-8 or UTF-16 string, the
|
|
offset is that of the first code unit of the failing character.
|
|
|
|
Some errors are not detected until the whole pattern has been scanned;
|
|
in these cases, the offset passed back is the length of the pattern.
|
|
Note that the offset is in code units, not characters, even in a UTF
|
|
mode. It may sometimes point into the middle of a UTF-8 or UTF-16 char-
|
|
acter.
|
|
|
|
This code fragment shows a typical straightforward call to pcre2_com-
|
|
pile():
|
|
|
|
pcre2_code *re;
|
|
PCRE2_SIZE erroffset;
|
|
int errorcode;
|
|
re = pcre2_compile(
|
|
"^A.*Z", /* the pattern */
|
|
PCRE2_ZERO_TERMINATED, /* the pattern is zero-terminated */
|
|
0, /* default options */
|
|
&errorcode, /* for error code */
|
|
&erroffset, /* for error offset */
|
|
NULL); /* no compile context */
|
|
|
|
The following names for option bits are defined in the pcre2.h header
|
|
file:
|
|
|
|
PCRE2_ANCHORED
|
|
|
|
If this bit is set, the pattern is forced to be "anchored", that is, it
|
|
is constrained to match only at the first matching point in the string
|
|
that is being searched (the "subject string"). This effect can also be
|
|
achieved by appropriate constructs in the pattern itself, which is the
|
|
only way to do it in Perl.
|
|
|
|
PCRE2_ALLOW_EMPTY_CLASS
|
|
|
|
By default, for compatibility with Perl, a closing square bracket that
|
|
immediately follows an opening one is treated as a data character for
|
|
the class. When PCRE2_ALLOW_EMPTY_CLASS is set, it terminates the
|
|
class, which therefore contains no characters and so can never match.
|
|
|
|
PCRE2_ALT_BSUX
|
|
|
|
This option request alternative handling of three escape sequences,
|
|
which makes PCRE2's behaviour more like ECMAscript (aka JavaScript).
|
|
When it is set:
|
|
|
|
(1) \U matches an upper case "U" character; by default \U causes a com-
|
|
pile time error (Perl uses \U to upper case subsequent characters).
|
|
|
|
(2) \u matches a lower case "u" character unless it is followed by four
|
|
hexadecimal digits, in which case the hexadecimal number defines the
|
|
code point to match. By default, \u causes a compile time error (Perl
|
|
uses it to upper case the following character).
|
|
|
|
(3) \x matches a lower case "x" character unless it is followed by two
|
|
hexadecimal digits, in which case the hexadecimal number defines the
|
|
code point to match. By default, as in Perl, a hexadecimal number is
|
|
always expected after \x, but it may have zero, one, or two digits (so,
|
|
for example, \xz matches a binary zero character followed by z).
|
|
|
|
PCRE2_ALT_CIRCUMFLEX
|
|
|
|
In multiline mode (when PCRE2_MULTILINE is set), the circumflex
|
|
metacharacter matches at the start of the subject (unless PCRE2_NOTBOL
|
|
is set), and also after any internal newline. However, it does not
|
|
match after a newline at the end of the subject, for compatibility with
|
|
Perl. If you want a multiline circumflex also to match after a termi-
|
|
nating newline, you must set PCRE2_ALT_CIRCUMFLEX.
|
|
|
|
PCRE2_ALT_VERBNAMES
|
|
|
|
By default, for compatibility with Perl, the name in any verb sequence
|
|
such as (*MARK:NAME) is any sequence of characters that does not
|
|
include a closing parenthesis. The name is not processed in any way,
|
|
and it is not possible to include a closing parenthesis in the name.
|
|
However, if the PCRE2_ALT_VERBNAMES option is set, normal backslash
|
|
processing is applied to verb names and only an unescaped closing
|
|
parenthesis terminates the name.
|
|
|
|
PCRE2_AUTO_CALLOUT
|
|
|
|
If this bit is set, pcre2_compile() automatically inserts callout
|
|
items, all with number 255, before each pattern item. For discussion of
|
|
the callout facility, see the pcre2callout documentation.
|
|
|
|
PCRE2_CASELESS
|
|
|
|
If this bit is set, letters in the pattern match both upper and lower
|
|
case letters in the subject. It is equivalent to Perl's /i option, and
|
|
it can be changed within a pattern by a (?i) option setting.
|
|
|
|
PCRE2_DOLLAR_ENDONLY
|
|
|
|
If this bit is set, a dollar metacharacter in the pattern matches only
|
|
at the end of the subject string. Without this option, a dollar also
|
|
matches immediately before a newline at the end of the string (but not
|
|
before any other newlines). The PCRE2_DOLLAR_ENDONLY option is ignored
|
|
if PCRE2_MULTILINE is set. There is no equivalent to this option in
|
|
Perl, and no way to set it within a pattern.
|
|
|
|
PCRE2_DOTALL
|
|
|
|
If this bit is set, a dot metacharacter in the pattern matches any
|
|
character, including one that indicates a newline. However, it only
|
|
ever matches one character, even if newlines are coded as CRLF. Without
|
|
this option, a dot does not match when the current position in the sub-
|
|
ject is at a newline. This option is equivalent to Perl's /s option,
|
|
and it can be changed within a pattern by a (?s) option setting. A neg-
|
|
ative class such as [^a] always matches newline characters, independent
|
|
of the setting of this option.
|
|
|
|
PCRE2_DUPNAMES
|
|
|
|
If this bit is set, names used to identify capturing subpatterns need
|
|
not be unique. This can be helpful for certain types of pattern when it
|
|
is known that only one instance of the named subpattern can ever be
|
|
matched. There are more details of named subpatterns below; see also
|
|
the pcre2pattern documentation.
|
|
|
|
PCRE2_EXTENDED
|
|
|
|
If this bit is set, most white space characters in the pattern are
|
|
totally ignored except when escaped or inside a character class. How-
|
|
ever, white space is not allowed within sequences such as (?> that
|
|
introduce various parenthesized subpatterns, nor within numerical quan-
|
|
tifiers such as {1,3}. Ignorable white space is permitted between an
|
|
item and a following quantifier and between a quantifier and a follow-
|
|
ing + that indicates possessiveness.
|
|
|
|
PCRE2_EXTENDED also causes characters between an unescaped # outside a
|
|
character class and the next newline, inclusive, to be ignored, which
|
|
makes it possible to include comments inside complicated patterns. Note
|
|
that the end of this type of comment is a literal newline sequence in
|
|
the pattern; escape sequences that happen to represent a newline do not
|
|
count. PCRE2_EXTENDED is equivalent to Perl's /x option, and it can be
|
|
changed within a pattern by a (?x) option setting.
|
|
|
|
Which characters are interpreted as newlines can be specified by a set-
|
|
ting in the compile context that is passed to pcre2_compile() or by a
|
|
special sequence at the start of the pattern, as described in the sec-
|
|
tion entitled "Newline conventions" in the pcre2pattern documentation.
|
|
A default is defined when PCRE2 is built.
|
|
|
|
PCRE2_FIRSTLINE
|
|
|
|
If this option is set, an unanchored pattern is required to match
|
|
before or at the first newline in the subject string, though the
|
|
matched text may continue over the newline.
|
|
|
|
PCRE2_MATCH_UNSET_BACKREF
|
|
|
|
If this option is set, a back reference to an unset subpattern group
|
|
matches an empty string (by default this causes the current matching
|
|
alternative to fail). A pattern such as (\1)(a) succeeds when this
|
|
option is set (assuming it can find an "a" in the subject), whereas it
|
|
fails by default, for Perl compatibility. Setting this option makes
|
|
PCRE2 behave more like ECMAscript (aka JavaScript).
|
|
|
|
PCRE2_MULTILINE
|
|
|
|
By default, for the purposes of matching "start of line" and "end of
|
|
line", PCRE2 treats the subject string as consisting of a single line
|
|
of characters, even if it actually contains newlines. The "start of
|
|
line" metacharacter (^) matches only at the start of the string, and
|
|
the "end of line" metacharacter ($) matches only at the end of the
|
|
string, or before a terminating newline (except when PCRE2_DOL-
|
|
LAR_ENDONLY is set). Note, however, that unless PCRE2_DOTALL is set,
|
|
the "any character" metacharacter (.) does not match at a newline. This
|
|
behaviour (for ^, $, and dot) is the same as Perl.
|
|
|
|
When PCRE2_MULTILINE it is set, the "start of line" and "end of line"
|
|
constructs match immediately following or immediately before internal
|
|
newlines in the subject string, respectively, as well as at the very
|
|
start and end. This is equivalent to Perl's /m option, and it can be
|
|
changed within a pattern by a (?m) option setting. Note that the "start
|
|
of line" metacharacter does not match after a newline at the end of the
|
|
subject, for compatibility with Perl. However, you can change this by
|
|
setting the PCRE2_ALT_CIRCUMFLEX option. If there are no newlines in a
|
|
subject string, or no occurrences of ^ or $ in a pattern, setting
|
|
PCRE2_MULTILINE has no effect.
|
|
|
|
PCRE2_NEVER_BACKSLASH_C
|
|
|
|
This option locks out the use of \C in the pattern that is being com-
|
|
piled. This escape can cause unpredictable behaviour in UTF-8 or
|
|
UTF-16 modes, because it may leave the current matching point in the
|
|
middle of a multi-code-unit character. This option may be useful in
|
|
applications that process patterns from external sources.
|
|
|
|
PCRE2_NEVER_UCP
|
|
|
|
This option locks out the use of Unicode properties for handling \B,
|
|
\b, \D, \d, \S, \s, \W, \w, and some of the POSIX character classes, as
|
|
described for the PCRE2_UCP option below. In particular, it prevents
|
|
the creator of the pattern from enabling this facility by starting the
|
|
pattern with (*UCP). This option may be useful in applications that
|
|
process patterns from external sources. The option combination PCRE_UCP
|
|
and PCRE_NEVER_UCP causes an error.
|
|
|
|
PCRE2_NEVER_UTF
|
|
|
|
This option locks out interpretation of the pattern as UTF-8, UTF-16,
|
|
or UTF-32, depending on which library is in use. In particular, it pre-
|
|
vents the creator of the pattern from switching to UTF interpretation
|
|
by starting the pattern with (*UTF). This option may be useful in
|
|
applications that process patterns from external sources. The combina-
|
|
tion of PCRE2_UTF and PCRE2_NEVER_UTF causes an error.
|
|
|
|
PCRE2_NO_AUTO_CAPTURE
|
|
|
|
If this option is set, it disables the use of numbered capturing paren-
|
|
theses in the pattern. Any opening parenthesis that is not followed by
|
|
? behaves as if it were followed by ?: but named parentheses can still
|
|
be used for capturing (and they acquire numbers in the usual way).
|
|
There is no equivalent of this option in Perl.
|
|
|
|
PCRE2_NO_AUTO_POSSESS
|
|
|
|
If this option is set, it disables "auto-possessification", which is an
|
|
optimization that, for example, turns a+b into a++b in order to avoid
|
|
backtracks into a+ that can never be successful. However, if callouts
|
|
are in use, auto-possessification means that some callouts are never
|
|
taken. You can set this option if you want the matching functions to do
|
|
a full unoptimized search and run all the callouts, but it is mainly
|
|
provided for testing purposes.
|
|
|
|
PCRE2_NO_DOTSTAR_ANCHOR
|
|
|
|
If this option is set, it disables an optimization that is applied when
|
|
.* is the first significant item in a top-level branch of a pattern,
|
|
and all the other branches also start with .* or with \A or \G or ^.
|
|
The optimization is automatically disabled for .* if it is inside an
|
|
atomic group or a capturing group that is the subject of a back refer-
|
|
ence, or if the pattern contains (*PRUNE) or (*SKIP). When the opti-
|
|
mization is not disabled, such a pattern is automatically anchored if
|
|
PCRE2_DOTALL is set for all the .* items and PCRE2_MULTILINE is not set
|
|
for any ^ items. Otherwise, the fact that any match must start either
|
|
at the start of the subject or following a newline is remembered. Like
|
|
other optimizations, this can cause callouts to be skipped.
|
|
|
|
PCRE2_NO_START_OPTIMIZE
|
|
|
|
This is an option whose main effect is at matching time. It does not
|
|
change what pcre2_compile() generates, but it does affect the output of
|
|
the JIT compiler.
|
|
|
|
There are a number of optimizations that may occur at the start of a
|
|
match, in order to speed up the process. For example, if it is known
|
|
that an unanchored match must start with a specific character, the
|
|
matching code searches the subject for that character, and fails imme-
|
|
diately if it cannot find it, without actually running the main match-
|
|
ing function. This means that a special item such as (*COMMIT) at the
|
|
start of a pattern is not considered until after a suitable starting
|
|
point for the match has been found. Also, when callouts or (*MARK)
|
|
items are in use, these "start-up" optimizations can cause them to be
|
|
skipped if the pattern is never actually used. The start-up optimiza-
|
|
tions are in effect a pre-scan of the subject that takes place before
|
|
the pattern is run.
|
|
|
|
The PCRE2_NO_START_OPTIMIZE option disables the start-up optimizations,
|
|
possibly causing performance to suffer, but ensuring that in cases
|
|
where the result is "no match", the callouts do occur, and that items
|
|
such as (*COMMIT) and (*MARK) are considered at every possible starting
|
|
position in the subject string.
|
|
|
|
Setting PCRE2_NO_START_OPTIMIZE may change the outcome of a matching
|
|
operation. Consider the pattern
|
|
|
|
(*COMMIT)ABC
|
|
|
|
When this is compiled, PCRE2 records the fact that a match must start
|
|
with the character "A". Suppose the subject string is "DEFABC". The
|
|
start-up optimization scans along the subject, finds "A" and runs the
|
|
first match attempt from there. The (*COMMIT) item means that the pat-
|
|
tern must match the current starting position, which in this case, it
|
|
does. However, if the same match is run with PCRE2_NO_START_OPTIMIZE
|
|
set, the initial scan along the subject string does not happen. The
|
|
first match attempt is run starting from "D" and when this fails,
|
|
(*COMMIT) prevents any further matches being tried, so the overall
|
|
result is "no match". There are also other start-up optimizations. For
|
|
example, a minimum length for the subject may be recorded. Consider the
|
|
pattern
|
|
|
|
(*MARK:A)(X|Y)
|
|
|
|
The minimum length for a match is one character. If the subject is
|
|
"ABC", there will be attempts to match "ABC", "BC", and "C". An attempt
|
|
to match an empty string at the end of the subject does not take place,
|
|
because PCRE2 knows that the subject is now too short, and so the
|
|
(*MARK) is never encountered. In this case, the optimization does not
|
|
affect the overall match result, which is still "no match", but it does
|
|
affect the auxiliary information that is returned.
|
|
|
|
PCRE2_NO_UTF_CHECK
|
|
|
|
When PCRE2_UTF is set, the validity of the pattern as a UTF string is
|
|
automatically checked. There are discussions about the validity of
|
|
UTF-8 strings, UTF-16 strings, and UTF-32 strings in the pcre2unicode
|
|
document. If an invalid UTF sequence is found, pcre2_compile() returns
|
|
a negative error code.
|
|
|
|
If you know that your pattern is valid, and you want to skip this check
|
|
for performance reasons, you can set the PCRE2_NO_UTF_CHECK option.
|
|
When it is set, the effect of passing an invalid UTF string as a pat-
|
|
tern is undefined. It may cause your program to crash or loop. Note
|
|
that this option can also be passed to pcre2_match() and
|
|
pcre_dfa_match(), to suppress validity checking of the subject string.
|
|
|
|
PCRE2_UCP
|
|
|
|
This option changes the way PCRE2 processes \B, \b, \D, \d, \S, \s, \W,
|
|
\w, and some of the POSIX character classes. By default, only ASCII
|
|
characters are recognized, but if PCRE2_UCP is set, Unicode properties
|
|
are used instead to classify characters. More details are given in the
|
|
section on generic character types in the pcre2pattern page. If you set
|
|
PCRE2_UCP, matching one of the items it affects takes much longer. The
|
|
option is available only if PCRE2 has been compiled with Unicode sup-
|
|
port.
|
|
|
|
PCRE2_UNGREEDY
|
|
|
|
This option inverts the "greediness" of the quantifiers so that they
|
|
are not greedy by default, but become greedy if followed by "?". It is
|
|
not compatible with Perl. It can also be set by a (?U) option setting
|
|
within the pattern.
|
|
|
|
PCRE2_UTF
|
|
|
|
This option causes PCRE2 to regard both the pattern and the subject
|
|
strings that are subsequently processed as strings of UTF characters
|
|
instead of single-code-unit strings. It is available when PCRE2 is
|
|
built to include Unicode support (which is the default). If Unicode
|
|
support is not available, the use of this option provokes an error.
|
|
Details of how this option changes the behaviour of PCRE2 are given in
|
|
the pcre2unicode page.
|
|
|
|
|
|
COMPILATION ERROR CODES
|
|
|
|
There are over 80 positive error codes that pcre2_compile() may return
|
|
if it finds an error in the pattern. There are also some negative error
|
|
codes that are used for invalid UTF strings. These are the same as
|
|
given by pcre2_match() and pcre2_dfa_match(), and are described in the
|
|
pcre2unicode page. The pcre2_get_error_message() function can be called
|
|
to obtain a textual error message from any error code.
|
|
|
|
|
|
JUST-IN-TIME (JIT) COMPILATION
|
|
|
|
int pcre2_jit_compile(pcre2_code *code, uint32_t options);
|
|
|
|
int pcre2_jit_match(const pcre2_code *code, PCRE2_SPTR subject,
|
|
PCRE2_SIZE length, PCRE2_SIZE startoffset,
|
|
uint32_t options, pcre2_match_data *match_data,
|
|
pcre2_match_context *mcontext);
|
|
|
|
void pcre2_jit_free_unused_memory(pcre2_general_context *gcontext);
|
|
|
|
pcre2_jit_stack *pcre2_jit_stack_create(PCRE2_SIZE startsize,
|
|
PCRE2_SIZE maxsize, pcre2_general_context *gcontext);
|
|
|
|
void pcre2_jit_stack_assign(pcre2_match_context *mcontext,
|
|
pcre2_jit_callback callback_function, void *callback_data);
|
|
|
|
void pcre2_jit_stack_free(pcre2_jit_stack *jit_stack);
|
|
|
|
These functions provide support for JIT compilation, which, if the
|
|
just-in-time compiler is available, further processes a compiled pat-
|
|
tern into machine code that executes much faster than the pcre2_match()
|
|
interpretive matching function. Full details are given in the pcre2jit
|
|
documentation.
|
|
|
|
JIT compilation is a heavyweight optimization. It can take some time
|
|
for patterns to be analyzed, and for one-off matches and simple pat-
|
|
terns the benefit of faster execution might be offset by a much slower
|
|
compilation time. Most, but not all patterns can be optimized by the
|
|
JIT compiler.
|
|
|
|
|
|
LOCALE SUPPORT
|
|
|
|
PCRE2 handles caseless matching, and determines whether characters are
|
|
letters, digits, or whatever, by reference to a set of tables, indexed
|
|
by character code point. This applies only to characters whose code
|
|
points are less than 256. By default, higher-valued code points never
|
|
match escapes such as \w or \d. However, if PCRE2 is built with UTF
|
|
support, all characters can be tested with \p and \P, or, alterna-
|
|
tively, the PCRE2_UCP option can be set when a pattern is compiled;
|
|
this causes \w and friends to use Unicode property support instead of
|
|
the built-in tables.
|
|
|
|
The use of locales with Unicode is discouraged. If you are handling
|
|
characters with code points greater than 128, you should either use
|
|
Unicode support, or use locales, but not try to mix the two.
|
|
|
|
PCRE2 contains an internal set of character tables that are used by
|
|
default. These are sufficient for many applications. Normally, the
|
|
internal tables recognize only ASCII characters. However, when PCRE2 is
|
|
built, it is possible to cause the internal tables to be rebuilt in the
|
|
default "C" locale of the local system, which may cause them to be dif-
|
|
ferent.
|
|
|
|
The internal tables can be overridden by tables supplied by the appli-
|
|
cation that calls PCRE2. These may be created in a different locale
|
|
from the default. As more and more applications change to using Uni-
|
|
code, the need for this locale support is expected to die away.
|
|
|
|
External tables are built by calling the pcre2_maketables() function,
|
|
in the relevant locale. The result can be passed to pcre2_compile() as
|
|
often as necessary, by creating a compile context and calling
|
|
pcre2_set_character_tables() to set the tables pointer therein. For
|
|
example, to build and use tables that are appropriate for the French
|
|
locale (where accented characters with values greater than 128 are
|
|
treated as letters), the following code could be used:
|
|
|
|
setlocale(LC_CTYPE, "fr_FR");
|
|
tables = pcre2_maketables(NULL);
|
|
ccontext = pcre2_compile_context_create(NULL);
|
|
pcre2_set_character_tables(ccontext, tables);
|
|
re = pcre2_compile(..., ccontext);
|
|
|
|
The locale name "fr_FR" is used on Linux and other Unix-like systems;
|
|
if you are using Windows, the name for the French locale is "french".
|
|
It is the caller's responsibility to ensure that the memory containing
|
|
the tables remains available for as long as it is needed.
|
|
|
|
The pointer that is passed (via the compile context) to pcre2_compile()
|
|
is saved with the compiled pattern, and the same tables are used by
|
|
pcre2_match() and pcre_dfa_match(). Thus, for any single pattern, com-
|
|
pilation, and matching all happen in the same locale, but different
|
|
patterns can be processed in different locales.
|
|
|
|
|
|
INFORMATION ABOUT A COMPILED PATTERN
|
|
|
|
int pcre2_pattern_info(const pcre2 *code, uint32_t what, void *where);
|
|
|
|
The pcre2_pattern_info() function returns general information about a
|
|
compiled pattern. For information about callouts, see the next section.
|
|
The first argument for pcre2_pattern_info() is a pointer to the com-
|
|
piled pattern. The second argument specifies which piece of information
|
|
is required, and the third argument is a pointer to a variable to
|
|
receive the data. If the third argument is NULL, the first argument is
|
|
ignored, and the function returns the size in bytes of the variable
|
|
that is required for the information requested. Otherwise, The yield of
|
|
the function is zero for success, or one of the following negative num-
|
|
bers:
|
|
|
|
PCRE2_ERROR_NULL the argument code was NULL
|
|
PCRE2_ERROR_BADMAGIC the "magic number" was not found
|
|
PCRE2_ERROR_BADOPTION the value of what was invalid
|
|
PCRE2_ERROR_UNSET the requested field is not set
|
|
|
|
The "magic number" is placed at the start of each compiled pattern as
|
|
an simple check against passing an arbitrary memory pointer. Here is a
|
|
typical call of pcre2_pattern_info(), to obtain the length of the com-
|
|
piled pattern:
|
|
|
|
int rc;
|
|
size_t length;
|
|
rc = pcre2_pattern_info(
|
|
re, /* result of pcre2_compile() */
|
|
PCRE2_INFO_SIZE, /* what is required */
|
|
&length); /* where to put the data */
|
|
|
|
The possible values for the second argument are defined in pcre2.h, and
|
|
are as follows:
|
|
|
|
PCRE2_INFO_ALLOPTIONS
|
|
PCRE2_INFO_ARGOPTIONS
|
|
|
|
Return a copy of the pattern's options. The third argument should point
|
|
to a uint32_t variable. PCRE2_INFO_ARGOPTIONS returns exactly the
|
|
options that were passed to pcre2_compile(), whereas PCRE2_INFO_ALLOP-
|
|
TIONS returns the compile options as modified by any top-level option
|
|
settings at the start of the pattern itself. In other words, they are
|
|
the options that will be in force when matching starts. For example, if
|
|
the pattern /(?im)abc(?-i)d/ is compiled with the PCRE2_EXTENDED
|
|
option, the result is PCRE2_CASELESS, PCRE2_MULTILINE, and
|
|
PCRE2_EXTENDED.
|
|
|
|
A pattern compiled without PCRE2_ANCHORED is automatically anchored by
|
|
PCRE2 if the first significant item in every top-level branch is one of
|
|
the following:
|
|
|
|
^ unless PCRE2_MULTILINE is set
|
|
\A always
|
|
\G always
|
|
.* sometimes - see below
|
|
|
|
When .* is the first significant item, anchoring is possible only when
|
|
all the following are true:
|
|
|
|
.* is not in an atomic group
|
|
.* is not in a capturing group that is the subject
|
|
of a back reference
|
|
PCRE2_DOTALL is in force for .*
|
|
Neither (*PRUNE) nor (*SKIP) appears in the pattern.
|
|
PCRE2_NO_DOTSTAR_ANCHOR is not set.
|
|
|
|
For patterns that are auto-anchored, the PCRE2_ANCHORED bit is set in
|
|
the options returned for PCRE2_INFO_ALLOPTIONS.
|
|
|
|
PCRE2_INFO_BACKREFMAX
|
|
|
|
Return the number of the highest back reference in the pattern. The
|
|
third argument should point to an uint32_t variable. Named subpatterns
|
|
acquire numbers as well as names, and these count towards the highest
|
|
back reference. Back references such as \4 or \g{12} match the cap-
|
|
tured characters of the given group, but in addition, the check that a
|
|
capturing group is set in a conditional subpattern such as (?(3)a|b) is
|
|
also a back reference. Zero is returned if there are no back refer-
|
|
ences.
|
|
|
|
PCRE2_INFO_BSR
|
|
|
|
The output is a uint32_t whose value indicates what character sequences
|
|
the \R escape sequence matches. A value of PCRE2_BSR_UNICODE means that
|
|
\R matches any Unicode line ending sequence; a value of PCRE2_BSR_ANY-
|
|
CRLF means that \R matches only CR, LF, or CRLF.
|
|
|
|
PCRE2_INFO_CAPTURECOUNT
|
|
|
|
Return the number of capturing subpatterns in the pattern. The third
|
|
argument should point to an uint32_t variable.
|
|
|
|
PCRE2_INFO_FIRSTCODETYPE
|
|
|
|
Return information about the first code unit of any matched string, for
|
|
a non-anchored pattern. The third argument should point to an uint32_t
|
|
variable.
|
|
|
|
If there is a fixed first value, for example, the letter "c" from a
|
|
pattern such as (cat|cow|coyote), 1 is returned, and the character
|
|
value can be retrieved using PCRE2_INFO_FIRSTCODEUNIT. If there is no
|
|
fixed first value, but it is known that a match can occur only at the
|
|
start of the subject or following a newline in the subject, 2 is
|
|
returned. Otherwise, and for anchored patterns, 0 is returned.
|
|
|
|
PCRE2_INFO_FIRSTCODEUNIT
|
|
|
|
Return the value of the first code unit of any matched string in the
|
|
situation where PCRE2_INFO_FIRSTCODETYPE returns 1; otherwise return 0.
|
|
The third argument should point to an uint32_t variable. In the 8-bit
|
|
library, the value is always less than 256. In the 16-bit library the
|
|
value can be up to 0xffff. In the 32-bit library in UTF-32 mode the
|
|
value can be up to 0x10ffff, and up to 0xffffffff when not using UTF-32
|
|
mode.
|
|
|
|
PCRE2_INFO_FIRSTBITMAP
|
|
|
|
In the absence of a single first code unit for a non-anchored pattern,
|
|
pcre2_compile() may construct a 256-bit table that defines a fixed set
|
|
of values for the first code unit in any match. For example, a pattern
|
|
that starts with [abc] results in a table with three bits set. When
|
|
code unit values greater than 255 are supported, the flag bit for 255
|
|
means "any code unit of value 255 or above". If such a table was con-
|
|
structed, a pointer to it is returned. Otherwise NULL is returned. The
|
|
third argument should point to an const uint8_t * variable.
|
|
|
|
PCRE2_INFO_HASCRORLF
|
|
|
|
Return 1 if the pattern contains any explicit matches for CR or LF
|
|
characters, otherwise 0. The third argument should point to an uint32_t
|
|
variable. An explicit match is either a literal CR or LF character, or
|
|
\r or \n.
|
|
|
|
PCRE2_INFO_JCHANGED
|
|
|
|
Return 1 if the (?J) or (?-J) option setting is used in the pattern,
|
|
otherwise 0. The third argument should point to an uint32_t variable.
|
|
(?J) and (?-J) set and unset the local PCRE2_DUPNAMES option, respec-
|
|
tively.
|
|
|
|
PCRE2_INFO_JITSIZE
|
|
|
|
If the compiled pattern was successfully processed by pcre2_jit_com-
|
|
pile(), return the size of the JIT compiled code, otherwise return
|
|
zero. The third argument should point to a size_t variable.
|
|
|
|
PCRE2_INFO_LASTCODETYPE
|
|
|
|
Returns 1 if there is a rightmost literal code unit that must exist in
|
|
any matched string, other than at its start. The third argument should
|
|
point to an uint32_t variable. If there is no such value, 0 is
|
|
returned. When 1 is returned, the code unit value itself can be
|
|
retrieved using PCRE2_INFO_LASTCODEUNIT.
|
|
|
|
For anchored patterns, a last literal value is recorded only if it fol-
|
|
lows something of variable length. For example, for the pattern
|
|
/^a\d+z\d+/ the returned value is 1 (with "z" returned from
|
|
PCRE2_INFO_LASTCODEUNIT), but for /^a\dz\d/ the returned value is 0.
|
|
|
|
PCRE2_INFO_LASTCODEUNIT
|
|
|
|
Return the value of the rightmost literal data unit that must exist in
|
|
any matched string, other than at its start, if such a value has been
|
|
recorded. The third argument should point to an uint32_t variable. If
|
|
there is no such value, 0 is returned.
|
|
|
|
PCRE2_INFO_MATCHEMPTY
|
|
|
|
Return 1 if the pattern can match an empty string, otherwise 0. The
|
|
third argument should point to an uint32_t variable.
|
|
|
|
PCRE2_INFO_MATCHLIMIT
|
|
|
|
If the pattern set a match limit by including an item of the form
|
|
(*LIMIT_MATCH=nnnn) at the start, the value is returned. The third
|
|
argument should point to an unsigned 32-bit integer. If no such value
|
|
has been set, the call to pcre2_pattern_info() returns the error
|
|
PCRE2_ERROR_UNSET.
|
|
|
|
PCRE2_INFO_MAXLOOKBEHIND
|
|
|
|
Return the number of characters (not code units) in the longest lookbe-
|
|
hind assertion in the pattern. The third argument should point to an
|
|
unsigned 32-bit integer. This information is useful when doing multi-
|
|
segment matching using the partial matching facilities. Note that the
|
|
simple assertions \b and \B require a one-character lookbehind. \A also
|
|
registers a one-character lookbehind, though it does not actually
|
|
inspect the previous character. This is to ensure that at least one
|
|
character from the old segment is retained when a new segment is pro-
|
|
cessed. Otherwise, if there are no lookbehinds in the pattern, \A might
|
|
match incorrectly at the start of a new segment.
|
|
|
|
PCRE2_INFO_MINLENGTH
|
|
|
|
If a minimum length for matching subject strings was computed, its
|
|
value is returned. Otherwise the returned value is 0. The value is a
|
|
number of characters, which in UTF mode may be different from the num-
|
|
ber of code units. The third argument should point to an uint32_t
|
|
variable. The value is a lower bound to the length of any matching
|
|
string. There may not be any strings of that length that do actually
|
|
match, but every string that does match is at least that long.
|
|
|
|
PCRE2_INFO_NAMECOUNT
|
|
PCRE2_INFO_NAMEENTRYSIZE
|
|
PCRE2_INFO_NAMETABLE
|
|
|
|
PCRE2 supports the use of named as well as numbered capturing parenthe-
|
|
ses. The names are just an additional way of identifying the parenthe-
|
|
ses, which still acquire numbers. Several convenience functions such as
|
|
pcre2_substring_get_byname() are provided for extracting captured sub-
|
|
strings by name. It is also possible to extract the data directly, by
|
|
first converting the name to a number in order to access the correct
|
|
pointers in the output vector (described with pcre2_match() below). To
|
|
do the conversion, you need to use the name-to-number map, which is
|
|
described by these three values.
|
|
|
|
The map consists of a number of fixed-size entries. PCRE2_INFO_NAME-
|
|
COUNT gives the number of entries, and PCRE2_INFO_NAMEENTRYSIZE gives
|
|
the size of each entry in code units; both of these return a uint32_t
|
|
value. The entry size depends on the length of the longest name.
|
|
|
|
PCRE2_INFO_NAMETABLE returns a pointer to the first entry of the table.
|
|
This is a PCRE2_SPTR pointer to a block of code units. In the 8-bit
|
|
library, the first two bytes of each entry are the number of the cap-
|
|
turing parenthesis, most significant byte first. In the 16-bit library,
|
|
the pointer points to 16-bit code units, the first of which contains
|
|
the parenthesis number. In the 32-bit library, the pointer points to
|
|
32-bit code units, the first of which contains the parenthesis number.
|
|
The rest of the entry is the corresponding name, zero terminated.
|
|
|
|
The names are in alphabetical order. If (?| is used to create multiple
|
|
groups with the same number, as described in the section on duplicate
|
|
subpattern numbers in the pcre2pattern page, the groups may be given
|
|
the same name, but there is only one entry in the table. Different
|
|
names for groups of the same number are not permitted.
|
|
|
|
Duplicate names for subpatterns with different numbers are permitted,
|
|
but only if PCRE2_DUPNAMES is set. They appear in the table in the
|
|
order in which they were found in the pattern. In the absence of (?|
|
|
this is the order of increasing number; when (?| is used this is not
|
|
necessarily the case because later subpatterns may have lower numbers.
|
|
|
|
As a simple example of the name/number table, consider the following
|
|
pattern after compilation by the 8-bit library (assume PCRE2_EXTENDED
|
|
is set, so white space - including newlines - is ignored):
|
|
|
|
(?<date> (?<year>(\d\d)?\d\d) -
|
|
(?<month>\d\d) - (?<day>\d\d) )
|
|
|
|
There are four named subpatterns, so the table has four entries, and
|
|
each entry in the table is eight bytes long. The table is as follows,
|
|
with non-printing bytes shows in hexadecimal, and undefined bytes shown
|
|
as ??:
|
|
|
|
00 01 d a t e 00 ??
|
|
00 05 d a y 00 ?? ??
|
|
00 04 m o n t h 00
|
|
00 02 y e a r 00 ??
|
|
|
|
When writing code to extract data from named subpatterns using the
|
|
name-to-number map, remember that the length of the entries is likely
|
|
to be different for each compiled pattern.
|
|
|
|
PCRE2_INFO_NEWLINE
|
|
|
|
The output is a uint32_t with one of the following values:
|
|
|
|
PCRE2_NEWLINE_CR Carriage return (CR)
|
|
PCRE2_NEWLINE_LF Linefeed (LF)
|
|
PCRE2_NEWLINE_CRLF Carriage return, linefeed (CRLF)
|
|
PCRE2_NEWLINE_ANY Any Unicode line ending
|
|
PCRE2_NEWLINE_ANYCRLF Any of CR, LF, or CRLF
|
|
|
|
This specifies the default character sequence that will be recognized
|
|
as meaning "newline" while matching.
|
|
|
|
PCRE2_INFO_RECURSIONLIMIT
|
|
|
|
If the pattern set a recursion limit by including an item of the form
|
|
(*LIMIT_RECURSION=nnnn) at the start, the value is returned. The third
|
|
argument should point to an unsigned 32-bit integer. If no such value
|
|
has been set, the call to pcre2_pattern_info() returns the error
|
|
PCRE2_ERROR_UNSET.
|
|
|
|
PCRE2_INFO_SIZE
|
|
|
|
Return the size of the compiled pattern in bytes (for all three
|
|
libraries). The third argument should point to a size_t variable. This
|
|
value includes the size of the general data block that precedes the
|
|
code units of the compiled pattern itself. The value that is used when
|
|
pcre2_compile() is getting memory in which to place the compiled pat-
|
|
tern may be slightly larger than the value returned by this option,
|
|
because there are cases where the code that calculates the size has to
|
|
over-estimate. Processing a pattern with the JIT compiler does not
|
|
alter the value returned by this option.
|
|
|
|
|
|
INFORMATION ABOUT A PATTERN'S CALLOUTS
|
|
|
|
int pcre2_callout_enumerate(const pcre2_code *code,
|
|
int (*callback)(pcre2_callout_enumerate_block *, void *),
|
|
void *user_data);
|
|
|
|
A script language that supports the use of string arguments in callouts
|
|
might like to scan all the callouts in a pattern before running the
|
|
match. This can be done by calling pcre2_callout_enumerate(). The first
|
|
argument is a pointer to a compiled pattern, the second points to a
|
|
callback function, and the third is arbitrary user data. The callback
|
|
function is called for every callout in the pattern in the order in
|
|
which they appear. Its first argument is a pointer to a callout enumer-
|
|
ation block, and its second argument is the user_data value that was
|
|
passed to pcre2_callout_enumerate(). The contents of the callout enu-
|
|
meration block are described in the pcre2callout documentation, which
|
|
also gives further details about callouts.
|
|
|
|
|
|
SERIALIZATION AND PRECOMPILING
|
|
|
|
It is possible to save compiled patterns on disc or elsewhere, and
|
|
reload them later, subject to a number of restrictions. The functions
|
|
whose names begin with pcre2_serialize_ are used for this purpose. They
|
|
are described in the pcre2serialize documentation.
|
|
|
|
|
|
THE MATCH DATA BLOCK
|
|
|
|
pcre2_match_data *pcre2_match_data_create(uint32_t ovecsize,
|
|
pcre2_general_context *gcontext);
|
|
|
|
pcre2_match_data *pcre2_match_data_create_from_pattern(
|
|
const pcre2_code *code, pcre2_general_context *gcontext);
|
|
|
|
void pcre2_match_data_free(pcre2_match_data *match_data);
|
|
|
|
Information about a successful or unsuccessful match is placed in a
|
|
match data block, which is an opaque structure that is accessed by
|
|
function calls. In particular, the match data block contains a vector
|
|
of offsets into the subject string that define the matched part of the
|
|
subject and any substrings that were captured. This is know as the
|
|
ovector.
|
|
|
|
Before calling pcre2_match(), pcre2_dfa_match(), or pcre2_jit_match()
|
|
you must create a match data block by calling one of the creation func-
|
|
tions above. For pcre2_match_data_create(), the first argument is the
|
|
number of pairs of offsets in the ovector. One pair of offsets is
|
|
required to identify the string that matched the whole pattern, with
|
|
another pair for each captured substring. For example, a value of 4
|
|
creates enough space to record the matched portion of the subject plus
|
|
three captured substrings. A minimum of at least 1 pair is imposed by
|
|
pcre2_match_data_create(), so it is always possible to return the over-
|
|
all matched string.
|
|
|
|
The second argument of pcre2_match_data_create() is a pointer to a gen-
|
|
eral context, which can specify custom memory management for obtaining
|
|
the memory for the match data block. If you are not using custom memory
|
|
management, pass NULL, which causes malloc() to be used.
|
|
|
|
For pcre2_match_data_create_from_pattern(), the first argument is a
|
|
pointer to a compiled pattern. The ovector is created to be exactly the
|
|
right size to hold all the substrings a pattern might capture. The sec-
|
|
ond argument is again a pointer to a general context, but in this case
|
|
if NULL is passed, the memory is obtained using the same allocator that
|
|
was used for the compiled pattern (custom or default).
|
|
|
|
A match data block can be used many times, with the same or different
|
|
compiled patterns. You can extract information from a match data block
|
|
after a match operation has finished, using functions that are
|
|
described in the sections on matched strings and other match data
|
|
below.
|
|
|
|
When a call of pcre2_match() fails, valid data is available in the
|
|
match block only when the error is PCRE2_ERROR_NOMATCH,
|
|
PCRE2_ERROR_PARTIAL, or one of the error codes for an invalid UTF
|
|
string. Exactly what is available depends on the error, and is detailed
|
|
below.
|
|
|
|
When one of the matching functions is called, pointers to the compiled
|
|
pattern and the subject string are set in the match data block so that
|
|
they can be referenced by the extraction functions. After running a
|
|
match, you must not free a compiled pattern or a subject string until
|
|
after all operations on the match data block (for that match) have
|
|
taken place.
|
|
|
|
When a match data block itself is no longer needed, it should be freed
|
|
by calling pcre2_match_data_free().
|
|
|
|
|
|
MATCHING A PATTERN: THE TRADITIONAL FUNCTION
|
|
|
|
int pcre2_match(const pcre2_code *code, PCRE2_SPTR subject,
|
|
PCRE2_SIZE length, PCRE2_SIZE startoffset,
|
|
uint32_t options, pcre2_match_data *match_data,
|
|
pcre2_match_context *mcontext);
|
|
|
|
The function pcre2_match() is called to match a subject string against
|
|
a compiled pattern, which is passed in the code argument. You can call
|
|
pcre2_match() with the same code argument as many times as you like, in
|
|
order to find multiple matches in the subject string or to match dif-
|
|
ferent subject strings with the same pattern.
|
|
|
|
This function is the main matching facility of the library, and it
|
|
operates in a Perl-like manner. For specialist use there is also an
|
|
alternative matching function, which is described below in the section
|
|
about the pcre2_dfa_match() function.
|
|
|
|
Here is an example of a simple call to pcre2_match():
|
|
|
|
pcre2_match_data *md = pcre2_match_data_create(4, NULL);
|
|
int rc = pcre2_match(
|
|
re, /* result of pcre2_compile() */
|
|
"some string", /* the subject string */
|
|
11, /* the length of the subject string */
|
|
0, /* start at offset 0 in the subject */
|
|
0, /* default options */
|
|
match_data, /* the match data block */
|
|
NULL); /* a match context; NULL means use defaults */
|
|
|
|
If the subject string is zero-terminated, the length can be given as
|
|
PCRE2_ZERO_TERMINATED. A match context must be provided if certain less
|
|
common matching parameters are to be changed. For details, see the sec-
|
|
tion on the match context above.
|
|
|
|
The string to be matched by pcre2_match()
|
|
|
|
The subject string is passed to pcre2_match() as a pointer in subject,
|
|
a length in length, and a starting offset in startoffset. The length
|
|
and offset are in code units, not characters. That is, they are in
|
|
bytes for the 8-bit library, 16-bit code units for the 16-bit library,
|
|
and 32-bit code units for the 32-bit library, whether or not UTF pro-
|
|
cessing is enabled.
|
|
|
|
If startoffset is greater than the length of the subject, pcre2_match()
|
|
returns PCRE2_ERROR_BADOFFSET. When the starting offset is zero, the
|
|
search for a match starts at the beginning of the subject, and this is
|
|
by far the most common case. In UTF-8 or UTF-16 mode, the starting off-
|
|
set must point to the start of a character, or to the end of the sub-
|
|
ject (in UTF-32 mode, one code unit equals one character, so all off-
|
|
sets are valid). Like the pattern string, the subject may contain
|
|
binary zeroes.
|
|
|
|
A non-zero starting offset is useful when searching for another match
|
|
in the same subject by calling pcre2_match() again after a previous
|
|
success. Setting startoffset differs from passing over a shortened
|
|
string and setting PCRE2_NOTBOL in the case of a pattern that begins
|
|
with any kind of lookbehind. For example, consider the pattern
|
|
|
|
\Biss\B
|
|
|
|
which finds occurrences of "iss" in the middle of words. (\B matches
|
|
only if the current position in the subject is not a word boundary.)
|
|
When applied to the string "Mississipi" the first call to pcre2_match()
|
|
finds the first occurrence. If pcre2_match() is called again with just
|
|
the remainder of the subject, namely "issipi", it does not match,
|
|
because \B is always false at the start of the subject, which is deemed
|
|
to be a word boundary. However, if pcre2_match() is passed the entire
|
|
string again, but with startoffset set to 4, it finds the second occur-
|
|
rence of "iss" because it is able to look behind the starting point to
|
|
discover that it is preceded by a letter.
|
|
|
|
Finding all the matches in a subject is tricky when the pattern can
|
|
match an empty string. It is possible to emulate Perl's /g behaviour by
|
|
first trying the match again at the same offset, with the
|
|
PCRE2_NOTEMPTY_ATSTART and PCRE2_ANCHORED options, and then if that
|
|
fails, advancing the starting offset and trying an ordinary match
|
|
again. There is some code that demonstrates how to do this in the
|
|
pcre2demo sample program. In the most general case, you have to check
|
|
to see if the newline convention recognizes CRLF as a newline, and if
|
|
so, and the current character is CR followed by LF, advance the start-
|
|
ing offset by two characters instead of one.
|
|
|
|
If a non-zero starting offset is passed when the pattern is anchored,
|
|
one attempt to match at the given offset is made. This can only succeed
|
|
if the pattern does not require the match to be at the start of the
|
|
subject.
|
|
|
|
Option bits for pcre2_match()
|
|
|
|
The unused bits of the options argument for pcre2_match() must be zero.
|
|
The only bits that may be set are PCRE2_ANCHORED, PCRE2_NOTBOL,
|
|
PCRE2_NOTEOL, PCRE2_NOTEMPTY, PCRE2_NOTEMPTY_ATSTART,
|
|
PCRE2_NO_UTF_CHECK, PCRE2_PARTIAL_HARD, and PCRE2_PARTIAL_SOFT. Their
|
|
action is described below.
|
|
|
|
Setting PCRE2_ANCHORED at match time is not supported by the just-in-
|
|
time (JIT) compiler. If it is set, JIT matching is disabled and the
|
|
normal interpretive code in pcre2_match() is run. The remaining options
|
|
are supported for JIT matching.
|
|
|
|
PCRE2_ANCHORED
|
|
|
|
The PCRE2_ANCHORED option limits pcre2_match() to matching at the first
|
|
matching position. If a pattern was compiled with PCRE2_ANCHORED, or
|
|
turned out to be anchored by virtue of its contents, it cannot be made
|
|
unachored at matching time. Note that setting the option at match time
|
|
disables JIT matching.
|
|
|
|
PCRE2_NOTBOL
|
|
|
|
This option specifies that first character of the subject string is not
|
|
the beginning of a line, so the circumflex metacharacter should not
|
|
match before it. Setting this without having set PCRE2_MULTILINE at
|
|
compile time causes circumflex never to match. This option affects only
|
|
the behaviour of the circumflex metacharacter. It does not affect \A.
|
|
|
|
PCRE2_NOTEOL
|
|
|
|
This option specifies that the end of the subject string is not the end
|
|
of a line, so the dollar metacharacter should not match it nor (except
|
|
in multiline mode) a newline immediately before it. Setting this with-
|
|
out having set PCRE2_MULTILINE at compile time causes dollar never to
|
|
match. This option affects only the behaviour of the dollar metacharac-
|
|
ter. It does not affect \Z or \z.
|
|
|
|
PCRE2_NOTEMPTY
|
|
|
|
An empty string is not considered to be a valid match if this option is
|
|
set. If there are alternatives in the pattern, they are tried. If all
|
|
the alternatives match the empty string, the entire match fails. For
|
|
example, if the pattern
|
|
|
|
a?b?
|
|
|
|
is applied to a string not beginning with "a" or "b", it matches an
|
|
empty string at the start of the subject. With PCRE2_NOTEMPTY set, this
|
|
match is not valid, so pcre2_match() searches further into the string
|
|
for occurrences of "a" or "b".
|
|
|
|
PCRE2_NOTEMPTY_ATSTART
|
|
|
|
This is like PCRE2_NOTEMPTY, except that it locks out an empty string
|
|
match only at the first matching position, that is, at the start of the
|
|
subject plus the starting offset. An empty string match later in the
|
|
subject is permitted. If the pattern is anchored, such a match can
|
|
occur only if the pattern contains \K.
|
|
|
|
PCRE2_NO_UTF_CHECK
|
|
|
|
When PCRE2_UTF is set at compile time, the validity of the subject as a
|
|
UTF string is checked by default when pcre2_match() is subsequently
|
|
called. If a non-zero starting offset is given, the check is applied
|
|
only to that part of the subject that could be inspected during match-
|
|
ing, and there is a check that the starting offset points to the first
|
|
code unit of a character or to the end of the subject. If there are no
|
|
lookbehind assertions in the pattern, the check starts at the starting
|
|
offset. Otherwise, it starts at the length of the longest lookbehind
|
|
before the starting offset, or at the start of the subject if there are
|
|
not that many characters before the starting offset. Note that the
|
|
sequences \b and \B are one-character lookbehinds.
|
|
|
|
The check is carried out before any other processing takes place, and a
|
|
negative error code is returned if the check fails. There are several
|
|
UTF error codes for each code unit width, corresponding to different
|
|
problems with the code unit sequence. There are discussions about the
|
|
validity of UTF-8 strings, UTF-16 strings, and UTF-32 strings in the
|
|
pcre2unicode page.
|
|
|
|
If you know that your subject is valid, and you want to skip these
|
|
checks for performance reasons, you can set the PCRE2_NO_UTF_CHECK
|
|
option when calling pcre2_match(). You might want to do this for the
|
|
second and subsequent calls to pcre2_match() if you are making repeated
|
|
calls to find all the matches in a single subject string.
|
|
|
|
NOTE: When PCRE2_NO_UTF_CHECK is set, the effect of passing an invalid
|
|
string as a subject, or an invalid value of startoffset, is undefined.
|
|
Your program may crash or loop indefinitely.
|
|
|
|
PCRE2_PARTIAL_HARD
|
|
PCRE2_PARTIAL_SOFT
|
|
|
|
These options turn on the partial matching feature. A partial match
|
|
occurs if the end of the subject string is reached successfully, but
|
|
there are not enough subject characters to complete the match. If this
|
|
happens when PCRE2_PARTIAL_SOFT (but not PCRE2_PARTIAL_HARD) is set,
|
|
matching continues by testing any remaining alternatives. Only if no
|
|
complete match can be found is PCRE2_ERROR_PARTIAL returned instead of
|
|
PCRE2_ERROR_NOMATCH. In other words, PCRE2_PARTIAL_SOFT specifies that
|
|
the caller is prepared to handle a partial match, but only if no com-
|
|
plete match can be found.
|
|
|
|
If PCRE2_PARTIAL_HARD is set, it overrides PCRE2_PARTIAL_SOFT. In this
|
|
case, if a partial match is found, pcre2_match() immediately returns
|
|
PCRE2_ERROR_PARTIAL, without considering any other alternatives. In
|
|
other words, when PCRE2_PARTIAL_HARD is set, a partial match is consid-
|
|
ered to be more important that an alternative complete match.
|
|
|
|
There is a more detailed discussion of partial and multi-segment match-
|
|
ing, with examples, in the pcre2partial documentation.
|
|
|
|
|
|
NEWLINE HANDLING WHEN MATCHING
|
|
|
|
When PCRE2 is built, a default newline convention is set; this is usu-
|
|
ally the standard convention for the operating system. The default can
|
|
be overridden in a compile context. During matching, the newline
|
|
choice affects the behaviour of the dot, circumflex, and dollar
|
|
metacharacters. It may also alter the way the match starting position
|
|
is advanced after a match failure for an unanchored pattern.
|
|
|
|
When PCRE2_NEWLINE_CRLF, PCRE2_NEWLINE_ANYCRLF, or PCRE2_NEWLINE_ANY is
|
|
set as the newline convention, and a match attempt for an unanchored
|
|
pattern fails when the current starting position is at a CRLF sequence,
|
|
and the pattern contains no explicit matches for CR or LF characters,
|
|
the match position is advanced by two characters instead of one, in
|
|
other words, to after the CRLF.
|
|
|
|
The above rule is a compromise that makes the most common cases work as
|
|
expected. For example, if the pattern is .+A (and the PCRE2_DOTALL
|
|
option is not set), it does not match the string "\r\nA" because, after
|
|
failing at the start, it skips both the CR and the LF before retrying.
|
|
However, the pattern [\r\n]A does match that string, because it con-
|
|
tains an explicit CR or LF reference, and so advances only by one char-
|
|
acter after the first failure.
|
|
|
|
An explicit match for CR of LF is either a literal appearance of one of
|
|
those characters in the pattern, or one of the \r or \n escape
|
|
sequences. Implicit matches such as [^X] do not count, nor does \s,
|
|
even though it includes CR and LF in the characters that it matches.
|
|
|
|
Notwithstanding the above, anomalous effects may still occur when CRLF
|
|
is a valid newline sequence and explicit \r or \n escapes appear in the
|
|
pattern.
|
|
|
|
|
|
HOW PCRE2_MATCH() RETURNS A STRING AND CAPTURED SUBSTRINGS
|
|
|
|
uint32_t pcre2_get_ovector_count(pcre2_match_data *match_data);
|
|
|
|
PCRE2_SIZE *pcre2_get_ovector_pointer(pcre2_match_data *match_data);
|
|
|
|
In general, a pattern matches a certain portion of the subject, and in
|
|
addition, further substrings from the subject may be picked out by
|
|
parenthesized parts of the pattern. Following the usage in Jeffrey
|
|
Friedl's book, this is called "capturing" in what follows, and the
|
|
phrase "capturing subpattern" or "capturing group" is used for a frag-
|
|
ment of a pattern that picks out a substring. PCRE2 supports several
|
|
other kinds of parenthesized subpattern that do not cause substrings to
|
|
be captured. The pcre2_pattern_info() function can be used to find out
|
|
how many capturing subpatterns there are in a compiled pattern.
|
|
|
|
A successful match returns the overall matched string and any captured
|
|
substrings to the caller via a vector of PCRE2_SIZE values. This is
|
|
called the ovector, and is contained within the match data block. You
|
|
can obtain direct access to the ovector by calling pcre2_get_ovec-
|
|
tor_pointer() to find its address, and pcre2_get_ovector_count() to
|
|
find the number of pairs of values it contains. Alternatively, you can
|
|
use the auxiliary functions for accessing captured substrings by number
|
|
or by name (see below).
|
|
|
|
Within the ovector, the first in each pair of values is set to the off-
|
|
set of the first code unit of a substring, and the second is set to the
|
|
offset of the first code unit after the end of a substring. These val-
|
|
ues are always code unit offsets, not character offsets. That is, they
|
|
are byte offsets in the 8-bit library, 16-bit offsets in the 16-bit
|
|
library, and 32-bit offsets in the 32-bit library.
|
|
|
|
After a partial match (error return PCRE2_ERROR_PARTIAL), only the
|
|
first pair of offsets (that is, ovector[0] and ovector[1]) are set.
|
|
They identify the part of the subject that was partially matched. See
|
|
the pcre2partial documentation for details of partial matching.
|
|
|
|
After a successful match, the first pair of offsets identifies the por-
|
|
tion of the subject string that was matched by the entire pattern. The
|
|
next pair is used for the first capturing subpattern, and so on. The
|
|
value returned by pcre2_match() is one more than the highest numbered
|
|
pair that has been set. For example, if two substrings have been cap-
|
|
tured, the returned value is 3. If there are no capturing subpatterns,
|
|
the return value from a successful match is 1, indicating that just the
|
|
first pair of offsets has been set.
|
|
|
|
If a pattern uses the \K escape sequence within a positive assertion,
|
|
the reported start of a successful match can be greater than the end of
|
|
the match. For example, if the pattern (?=ab\K) is matched against
|
|
"ab", the start and end offset values for the match are 2 and 0.
|
|
|
|
If a capturing subpattern group is matched repeatedly within a single
|
|
match operation, it is the last portion of the subject that it matched
|
|
that is returned.
|
|
|
|
If the ovector is too small to hold all the captured substring offsets,
|
|
as much as possible is filled in, and the function returns a value of
|
|
zero. If captured substrings are not of interest, pcre2_match() may be
|
|
called with a match data block whose ovector is of minimum length (that
|
|
is, one pair). However, if the pattern contains back references and the
|
|
ovector is not big enough to remember the related substrings, PCRE2 has
|
|
to get additional memory for use during matching. Thus it is usually
|
|
advisable to set up a match data block containing an ovector of reason-
|
|
able size.
|
|
|
|
It is possible for capturing subpattern number n+1 to match some part
|
|
of the subject when subpattern n has not been used at all. For example,
|
|
if the string "abc" is matched against the pattern (a|(z))(bc) the
|
|
return from the function is 4, and subpatterns 1 and 3 are matched, but
|
|
2 is not. When this happens, both values in the offset pairs corre-
|
|
sponding to unused subpatterns are set to PCRE2_UNSET.
|
|
|
|
Offset values that correspond to unused subpatterns at the end of the
|
|
expression are also set to PCRE2_UNSET. For example, if the string
|
|
"abc" is matched against the pattern (abc)(x(yz)?)? subpatterns 2 and 3
|
|
are not matched. The return from the function is 2, because the high-
|
|
est used capturing subpattern number is 1. The offsets for for the sec-
|
|
ond and third capturing subpatterns (assuming the vector is large
|
|
enough, of course) are set to PCRE2_UNSET.
|
|
|
|
Elements in the ovector that do not correspond to capturing parentheses
|
|
in the pattern are never changed. That is, if a pattern contains n cap-
|
|
turing parentheses, no more than ovector[0] to ovector[2n+1] are set by
|
|
pcre2_match(). The other elements retain whatever values they previ-
|
|
ously had.
|
|
|
|
|
|
OTHER INFORMATION ABOUT A MATCH
|
|
|
|
PCRE2_SPTR pcre2_get_mark(pcre2_match_data *match_data);
|
|
|
|
PCRE2_SIZE pcre2_get_startchar(pcre2_match_data *match_data);
|
|
|
|
As well as the offsets in the ovector, other information about a match
|
|
is retained in the match data block and can be retrieved by the above
|
|
functions in appropriate circumstances. If they are called at other
|
|
times, the result is undefined.
|
|
|
|
After a successful match, a partial match (PCRE2_ERROR_PARTIAL), or a
|
|
failure to match (PCRE2_ERROR_NOMATCH), a (*MARK) name may be avail-
|
|
able, and pcre2_get_mark() can be called. It returns a pointer to the
|
|
zero-terminated name, which is within the compiled pattern. Otherwise
|
|
NULL is returned. After a successful match, the (*MARK) name that is
|
|
returned is the last one encountered on the matching path through the
|
|
pattern. After a "no match" or a partial match, the last encountered
|
|
(*MARK) name is returned. For example, consider this pattern:
|
|
|
|
^(*MARK:A)((*MARK:B)a|b)c
|
|
|
|
When it matches "bc", the returned mark is A. The B mark is "seen" in
|
|
the first branch of the group, but it is not on the matching path. On
|
|
the other hand, when this pattern fails to match "bx", the returned
|
|
mark is B.
|
|
|
|
After a successful match, a partial match, or one of the invalid UTF
|
|
errors (for example, PCRE2_ERROR_UTF8_ERR5), pcre2_get_startchar() can
|
|
be called. After a successful or partial match it returns the code unit
|
|
offset of the character at which the match started. For a non-partial
|
|
match, this can be different to the value of ovector[0] if the pattern
|
|
contains the \K escape sequence. After a partial match, however, this
|
|
value is always the same as ovector[0] because \K does not affect the
|
|
result of a partial match.
|
|
|
|
After a UTF check failure, pcre2_get_startchar() can be used to obtain
|
|
the code unit offset of the invalid UTF character. Details are given in
|
|
the pcre2unicode page.
|
|
|
|
|
|
ERROR RETURNS FROM pcre2_match()
|
|
|
|
If pcre2_match() fails, it returns a negative number. This can be con-
|
|
verted to a text string by calling pcre2_get_error_message(). Negative
|
|
error codes are also returned by other functions, and are documented
|
|
with them. The codes are given names in the header file. If UTF check-
|
|
ing is in force and an invalid UTF subject string is detected, one of a
|
|
number of UTF-specific negative error codes is returned. Details are
|
|
given in the pcre2unicode page. The following are the other errors that
|
|
may be returned by pcre2_match():
|
|
|
|
PCRE2_ERROR_NOMATCH
|
|
|
|
The subject string did not match the pattern.
|
|
|
|
PCRE2_ERROR_PARTIAL
|
|
|
|
The subject string did not match, but it did match partially. See the
|
|
pcre2partial documentation for details of partial matching.
|
|
|
|
PCRE2_ERROR_BADMAGIC
|
|
|
|
PCRE2 stores a 4-byte "magic number" at the start of the compiled code,
|
|
to catch the case when it is passed a junk pointer. This is the error
|
|
that is returned when the magic number is not present.
|
|
|
|
PCRE2_ERROR_BADMODE
|
|
|
|
This error is given when a pattern that was compiled by the 8-bit
|
|
library is passed to a 16-bit or 32-bit library function, or vice
|
|
versa.
|
|
|
|
PCRE2_ERROR_BADOFFSET
|
|
|
|
The value of startoffset was greater than the length of the subject.
|
|
|
|
PCRE2_ERROR_BADOPTION
|
|
|
|
An unrecognized bit was set in the options argument.
|
|
|
|
PCRE2_ERROR_BADUTFOFFSET
|
|
|
|
The UTF code unit sequence that was passed as a subject was checked and
|
|
found to be valid (the PCRE2_NO_UTF_CHECK option was not set), but the
|
|
value of startoffset did not point to the beginning of a UTF character
|
|
or the end of the subject.
|
|
|
|
PCRE2_ERROR_CALLOUT
|
|
|
|
This error is never generated by pcre2_match() itself. It is provided
|
|
for use by callout functions that want to cause pcre2_match() or
|
|
pcre2_callout_enumerate() to return a distinctive error code. See the
|
|
pcre2callout documentation for details.
|
|
|
|
PCRE2_ERROR_INTERNAL
|
|
|
|
An unexpected internal error has occurred. This error could be caused
|
|
by a bug in PCRE2 or by overwriting of the compiled pattern.
|
|
|
|
PCRE2_ERROR_JIT_BADOPTION
|
|
|
|
This error is returned when a pattern that was successfully studied
|
|
using JIT is being matched, but the matching mode (partial or complete
|
|
match) does not correspond to any JIT compilation mode. When the JIT
|
|
fast path function is used, this error may be also given for invalid
|
|
options. See the pcre2jit documentation for more details.
|
|
|
|
PCRE2_ERROR_JIT_STACKLIMIT
|
|
|
|
This error is returned when a pattern that was successfully studied
|
|
using JIT is being matched, but the memory available for the just-in-
|
|
time processing stack is not large enough. See the pcre2jit documenta-
|
|
tion for more details.
|
|
|
|
PCRE2_ERROR_MATCHLIMIT
|
|
|
|
The backtracking limit was reached.
|
|
|
|
PCRE2_ERROR_NOMEMORY
|
|
|
|
If a pattern contains back references, but the ovector is not big
|
|
enough to remember the referenced substrings, PCRE2 gets a block of
|
|
memory at the start of matching to use for this purpose. There are some
|
|
other special cases where extra memory is needed during matching. This
|
|
error is given when memory cannot be obtained.
|
|
|
|
PCRE2_ERROR_NULL
|
|
|
|
Either the code, subject, or match_data argument was passed as NULL.
|
|
|
|
PCRE2_ERROR_RECURSELOOP
|
|
|
|
This error is returned when pcre2_match() detects a recursion loop
|
|
within the pattern. Specifically, it means that either the whole pat-
|
|
tern or a subpattern has been called recursively for the second time at
|
|
the same position in the subject string. Some simple patterns that
|
|
might do this are detected and faulted at compile time, but more com-
|
|
plicated cases, in particular mutual recursions between two different
|
|
subpatterns, cannot be detected until matching is attempted.
|
|
|
|
PCRE2_ERROR_RECURSIONLIMIT
|
|
|
|
The internal recursion limit was reached.
|
|
|
|
|
|
EXTRACTING CAPTURED SUBSTRINGS BY NUMBER
|
|
|
|
int pcre2_substring_length_bynumber(pcre2_match_data *match_data,
|
|
uint32_t number, PCRE2_SIZE *length);
|
|
|
|
int pcre2_substring_copy_bynumber(pcre2_match_data *match_data,
|
|
uint32_t number, PCRE2_UCHAR *buffer,
|
|
PCRE2_SIZE *bufflen);
|
|
|
|
int pcre2_substring_get_bynumber(pcre2_match_data *match_data,
|
|
uint32_t number, PCRE2_UCHAR **bufferptr,
|
|
PCRE2_SIZE *bufflen);
|
|
|
|
void pcre2_substring_free(PCRE2_UCHAR *buffer);
|
|
|
|
Captured substrings can be accessed directly by using the ovector as
|
|
described above. For convenience, auxiliary functions are provided for
|
|
extracting captured substrings as new, separate, zero-terminated
|
|
strings. A substring that contains a binary zero is correctly extracted
|
|
and has a further zero added on the end, but the result is not, of
|
|
course, a C string.
|
|
|
|
The functions in this section identify substrings by number. The number
|
|
zero refers to the entire matched substring, with higher numbers refer-
|
|
ring to substrings captured by parenthesized groups. After a partial
|
|
match, only substring zero is available. An attempt to extract any
|
|
other substring gives the error PCRE2_ERROR_PARTIAL. The next section
|
|
describes similar functions for extracting captured substrings by name.
|
|
|
|
If a pattern uses the \K escape sequence within a positive assertion,
|
|
the reported start of a successful match can be greater than the end of
|
|
the match. For example, if the pattern (?=ab\K) is matched against
|
|
"ab", the start and end offset values for the match are 2 and 0. In
|
|
this situation, calling these functions with a zero substring number
|
|
extracts a zero-length empty string.
|
|
|
|
You can find the length in code units of a captured substring without
|
|
extracting it by calling pcre2_substring_length_bynumber(). The first
|
|
argument is a pointer to the match data block, the second is the group
|
|
number, and the third is a pointer to a variable into which the length
|
|
is placed. If you just want to know whether or not the substring has
|
|
been captured, you can pass the third argument as NULL.
|
|
|
|
The pcre2_substring_copy_bynumber() function copies a captured sub-
|
|
string into a supplied buffer, whereas pcre2_substring_get_bynumber()
|
|
copies it into new memory, obtained using the same memory allocation
|
|
function that was used for the match data block. The first two argu-
|
|
ments of these functions are a pointer to the match data block and a
|
|
capturing group number.
|
|
|
|
The final arguments of pcre2_substring_copy_bynumber() are a pointer to
|
|
the buffer and a pointer to a variable that contains its length in code
|
|
units. This is updated to contain the actual number of code units used
|
|
for the extracted substring, excluding the terminating zero.
|
|
|
|
For pcre2_substring_get_bynumber() the third and fourth arguments point
|
|
to variables that are updated with a pointer to the new memory and the
|
|
number of code units that comprise the substring, again excluding the
|
|
terminating zero. When the substring is no longer needed, the memory
|
|
should be freed by calling pcre2_substring_free().
|
|
|
|
The return value from all these functions is zero for success, or a
|
|
negative error code. If the pattern match failed, the match failure
|
|
code is returned. If a substring number greater than zero is used
|
|
after a partial match, PCRE2_ERROR_PARTIAL is returned. Other possible
|
|
error codes are:
|
|
|
|
PCRE2_ERROR_NOMEMORY
|
|
|
|
The buffer was too small for pcre2_substring_copy_bynumber(), or the
|
|
attempt to get memory failed for pcre2_substring_get_bynumber().
|
|
|
|
PCRE2_ERROR_NOSUBSTRING
|
|
|
|
There is no substring with that number in the pattern, that is, the
|
|
number is greater than the number of capturing parentheses.
|
|
|
|
PCRE2_ERROR_UNAVAILABLE
|
|
|
|
The substring number, though not greater than the number of captures in
|
|
the pattern, is greater than the number of slots in the ovector, so the
|
|
substring could not be captured.
|
|
|
|
PCRE2_ERROR_UNSET
|
|
|
|
The substring did not participate in the match. For example, if the
|
|
pattern is (abc)|(def) and the subject is "def", and the ovector con-
|
|
tains at least two capturing slots, substring number 1 is unset.
|
|
|
|
|
|
EXTRACTING A LIST OF ALL CAPTURED SUBSTRINGS
|
|
|
|
int pcre2_substring_list_get(pcre2_match_data *match_data,
|
|
PCRE2_UCHAR ***listptr, PCRE2_SIZE **lengthsptr);
|
|
|
|
void pcre2_substring_list_free(PCRE2_SPTR *list);
|
|
|
|
The pcre2_substring_list_get() function extracts all available sub-
|
|
strings and builds a list of pointers to them. It also (optionally)
|
|
builds a second list that contains their lengths (in code units),
|
|
excluding a terminating zero that is added to each of them. All this is
|
|
done in a single block of memory that is obtained using the same memory
|
|
allocation function that was used to get the match data block.
|
|
|
|
This function must be called only after a successful match. If called
|
|
after a partial match, the error code PCRE2_ERROR_PARTIAL is returned.
|
|
|
|
The address of the memory block is returned via listptr, which is also
|
|
the start of the list of string pointers. The end of the list is marked
|
|
by a NULL pointer. The address of the list of lengths is returned via
|
|
lengthsptr. If your strings do not contain binary zeros and you do not
|
|
therefore need the lengths, you may supply NULL as the lengthsptr argu-
|
|
ment to disable the creation of a list of lengths. The yield of the
|
|
function is zero if all went well, or PCRE2_ERROR_NOMEMORY if the mem-
|
|
ory block could not be obtained. When the list is no longer needed, it
|
|
should be freed by calling pcre2_substring_list_free().
|
|
|
|
If this function encounters a substring that is unset, which can happen
|
|
when capturing subpattern number n+1 matches some part of the subject,
|
|
but subpattern n has not been used at all, it returns an empty string.
|
|
This can be distinguished from a genuine zero-length substring by
|
|
inspecting the appropriate offset in the ovector, which contain
|
|
PCRE2_UNSET for unset substrings, or by calling pcre2_sub-
|
|
string_length_bynumber().
|
|
|
|
|
|
EXTRACTING CAPTURED SUBSTRINGS BY NAME
|
|
|
|
int pcre2_substring_number_from_name(const pcre2_code *code,
|
|
PCRE2_SPTR name);
|
|
|
|
int pcre2_substring_length_byname(pcre2_match_data *match_data,
|
|
PCRE2_SPTR name, PCRE2_SIZE *length);
|
|
|
|
int pcre2_substring_copy_byname(pcre2_match_data *match_data,
|
|
PCRE2_SPTR name, PCRE2_UCHAR *buffer, PCRE2_SIZE *bufflen);
|
|
|
|
int pcre2_substring_get_byname(pcre2_match_data *match_data,
|
|
PCRE2_SPTR name, PCRE2_UCHAR **bufferptr, PCRE2_SIZE *bufflen);
|
|
|
|
void pcre2_substring_free(PCRE2_UCHAR *buffer);
|
|
|
|
To extract a substring by name, you first have to find associated num-
|
|
ber. For example, for this pattern:
|
|
|
|
(a+)b(?<xxx>\d+)...
|
|
|
|
the number of the subpattern called "xxx" is 2. If the name is known to
|
|
be unique (PCRE2_DUPNAMES was not set), you can find the number from
|
|
the name by calling pcre2_substring_number_from_name(). The first argu-
|
|
ment is the compiled pattern, and the second is the name. The yield of
|
|
the function is the subpattern number, PCRE2_ERROR_NOSUBSTRING if there
|
|
is no subpattern of that name, or PCRE2_ERROR_NOUNIQUESUBSTRING if
|
|
there is more than one subpattern of that name. Given the number, you
|
|
can extract the substring directly, or use one of the functions
|
|
described above.
|
|
|
|
For convenience, there are also "byname" functions that correspond to
|
|
the "bynumber" functions, the only difference being that the second
|
|
argument is a name instead of a number. If PCRE2_DUPNAMES is set and
|
|
there are duplicate names, these functions scan all the groups with the
|
|
given name, and return the first named string that is set.
|
|
|
|
If there are no groups with the given name, PCRE2_ERROR_NOSUBSTRING is
|
|
returned. If all groups with the name have numbers that are greater
|
|
than the number of slots in the ovector, PCRE2_ERROR_UNAVAILABLE is
|
|
returned. If there is at least one group with a slot in the ovector,
|
|
but no group is found to be set, PCRE2_ERROR_UNSET is returned.
|
|
|
|
Warning: If the pattern uses the (?| feature to set up multiple subpat-
|
|
terns with the same number, as described in the section on duplicate
|
|
subpattern numbers in the pcre2pattern page, you cannot use names to
|
|
distinguish the different subpatterns, because names are not included
|
|
in the compiled code. The matching process uses only numbers. For this
|
|
reason, the use of different names for subpatterns of the same number
|
|
causes an error at compile time.
|
|
|
|
|
|
CREATING A NEW STRING WITH SUBSTITUTIONS
|
|
|
|
int pcre2_substitute(const pcre2_code *code, PCRE2_SPTR subject,
|
|
PCRE2_SIZE length, PCRE2_SIZE startoffset,
|
|
uint32_t options, pcre2_match_data *match_data,
|
|
pcre2_match_context *mcontext, PCRE2_SPTR replacementzfP,
|
|
PCRE2_SIZE rlength, PCRE2_UCHAR *outputbufferP,
|
|
PCRE2_SIZE *outlengthptr);
|
|
This function calls pcre2_match() and then makes a copy of the subject
|
|
string in outputbuffer, replacing the part that was matched with the
|
|
replacement string, whose length is supplied in rlength. This can be
|
|
given as PCRE2_ZERO_TERMINATED for a zero-terminated string.
|
|
|
|
In the replacement string, which is interpreted as a UTF string in UTF
|
|
mode, and is checked for UTF validity unless the PCRE2_NO_UTF_CHECK
|
|
option is set, a dollar character is an escape character that can spec-
|
|
ify the insertion of characters from capturing groups or (*MARK) items
|
|
in the pattern. The following forms are recognized:
|
|
|
|
$$ insert a dollar character
|
|
$<n> or ${<n>} insert the contents of group <n>
|
|
$*MARK or ${*MARK} insert the name of the last (*MARK) encountered
|
|
|
|
Either a group number or a group name can be given for <n>. Curly
|
|
brackets are required only if the following character would be inter-
|
|
preted as part of the number or name. The number may be zero to include
|
|
the entire matched string. For example, if the pattern a(b)c is
|
|
matched with "=abc=" and the replacement string "+$1$0$1+", the result
|
|
is "=+babcb+=". Group insertion is done by calling pcre2_copy_byname()
|
|
or pcre2_copy_bynumber() as appropriate.
|
|
|
|
The facility for inserting a (*MARK) name can be used to perform simple
|
|
simultaneous substitutions, as this pcre2test example shows:
|
|
|
|
/(*:pear)apple|(*:orange)lemon/g,replace=${*MARK}
|
|
apple lemon
|
|
2: pear orange
|
|
|
|
The first seven arguments of pcre2_substitute() are the same as for
|
|
pcre2_match(), except that the partial matching options are not permit-
|
|
ted, and match_data may be passed as NULL, in which case a match data
|
|
block is obtained and freed within this function, using memory manage-
|
|
ment functions from the match context, if provided, or else those that
|
|
were used to allocate memory for the compiled code.
|
|
|
|
There is one additional option, PCRE2_SUBSTITUTE_GLOBAL, which causes
|
|
the function to iterate over the subject string, replacing every match-
|
|
ing substring. If this is not set, only the first matching substring is
|
|
replaced.
|
|
|
|
The outlengthptr argument must point to a variable that contains the
|
|
length, in code units, of the output buffer. It is updated to contain
|
|
the length of the new string, excluding the trailing zero that is auto-
|
|
matically added.
|
|
|
|
The function returns the number of replacements that were made. This
|
|
may be zero if no matches were found, and is never greater than 1
|
|
unless PCRE2_SUBSTITUTE_GLOBAL is set. In the event of an error, a neg-
|
|
ative error code is returned. Except for PCRE2_ERROR_NOMATCH (which is
|
|
never returned), any errors from pcre2_match() or the substring copying
|
|
functions are passed straight back. PCRE2_ERROR_BADREPLACEMENT is
|
|
returned for an invalid replacement string (unrecognized sequence fol-
|
|
lowing a dollar sign), and PCRE2_ERROR_NOMEMORY is returned if the out-
|
|
put buffer is not big enough.
|
|
|
|
|
|
DUPLICATE SUBPATTERN NAMES
|
|
|
|
int pcre2_substring_nametable_scan(const pcre2_code *code,
|
|
PCRE2_SPTR name, PCRE2_SPTR *first, PCRE2_SPTR *last);
|
|
|
|
When a pattern is compiled with the PCRE2_DUPNAMES option, names for
|
|
subpatterns are not required to be unique. Duplicate names are always
|
|
allowed for subpatterns with the same number, created by using the (?|
|
|
feature. Indeed, if such subpatterns are named, they are required to
|
|
use the same names.
|
|
|
|
Normally, patterns with duplicate names are such that in any one match,
|
|
only one of the named subpatterns participates. An example is shown in
|
|
the pcre2pattern documentation.
|
|
|
|
When duplicates are present, pcre2_substring_copy_byname() and
|
|
pcre2_substring_get_byname() return the first substring corresponding
|
|
to the given name that is set. Only if none are set is
|
|
PCRE2_ERROR_UNSET is returned. The pcre2_substring_number_from_name()
|
|
function returns the error PCRE2_ERROR_NOUNIQUESUBSTRING when there are
|
|
duplicate names.
|
|
|
|
If you want to get full details of all captured substrings for a given
|
|
name, you must use the pcre2_substring_nametable_scan() function. The
|
|
first argument is the compiled pattern, and the second is the name. If
|
|
the third and fourth arguments are NULL, the function returns a group
|
|
number for a unique name, or PCRE2_ERROR_NOUNIQUESUBSTRING otherwise.
|
|
|
|
When the third and fourth arguments are not NULL, they must be pointers
|
|
to variables that are updated by the function. After it has run, they
|
|
point to the first and last entries in the name-to-number table for the
|
|
given name, and the function returns the length of each entry in code
|
|
units. In both cases, PCRE2_ERROR_NOSUBSTRING is returned if there are
|
|
no entries for the given name.
|
|
|
|
The format of the name table is described above in the section entitled
|
|
Information about a pattern above. Given all the relevant entries for
|
|
the name, you can extract each of their numbers, and hence the captured
|
|
data.
|
|
|
|
|
|
FINDING ALL POSSIBLE MATCHES AT ONE POSITION
|
|
|
|
The traditional matching function uses a similar algorithm to Perl,
|
|
which stops when it finds the first match at a given point in the sub-
|
|
ject. If you want to find all possible matches, or the longest possible
|
|
match at a given position, consider using the alternative matching
|
|
function (see below) instead. If you cannot use the alternative func-
|
|
tion, you can kludge it up by making use of the callout facility, which
|
|
is described in the pcre2callout documentation.
|
|
|
|
What you have to do is to insert a callout right at the end of the pat-
|
|
tern. When your callout function is called, extract and save the cur-
|
|
rent matched substring. Then return 1, which forces pcre2_match() to
|
|
backtrack and try other alternatives. Ultimately, when it runs out of
|
|
matches, pcre2_match() will yield PCRE2_ERROR_NOMATCH.
|
|
|
|
|
|
MATCHING A PATTERN: THE ALTERNATIVE FUNCTION
|
|
|
|
int pcre2_dfa_match(const pcre2_code *code, PCRE2_SPTR subject,
|
|
PCRE2_SIZE length, PCRE2_SIZE startoffset,
|
|
uint32_t options, pcre2_match_data *match_data,
|
|
pcre2_match_context *mcontext,
|
|
int *workspace, PCRE2_SIZE wscount);
|
|
|
|
The function pcre2_dfa_match() is called to match a subject string
|
|
against a compiled pattern, using a matching algorithm that scans the
|
|
subject string just once, and does not backtrack. This has different
|
|
characteristics to the normal algorithm, and is not compatible with
|
|
Perl. Some of the features of PCRE2 patterns are not supported. Never-
|
|
theless, there are times when this kind of matching can be useful. For
|
|
a discussion of the two matching algorithms, and a list of features
|
|
that pcre2_dfa_match() does not support, see the pcre2matching documen-
|
|
tation.
|
|
|
|
The arguments for the pcre2_dfa_match() function are the same as for
|
|
pcre2_match(), plus two extras. The ovector within the match data block
|
|
is used in a different way, and this is described below. The other com-
|
|
mon arguments are used in the same way as for pcre2_match(), so their
|
|
description is not repeated here.
|
|
|
|
The two additional arguments provide workspace for the function. The
|
|
workspace vector should contain at least 20 elements. It is used for
|
|
keeping track of multiple paths through the pattern tree. More
|
|
workspace is needed for patterns and subjects where there are a lot of
|
|
potential matches.
|
|
|
|
Here is an example of a simple call to pcre2_dfa_match():
|
|
|
|
int wspace[20];
|
|
pcre2_match_data *md = pcre2_match_data_create(4, NULL);
|
|
int rc = pcre2_dfa_match(
|
|
re, /* result of pcre2_compile() */
|
|
"some string", /* the subject string */
|
|
11, /* the length of the subject string */
|
|
0, /* start at offset 0 in the subject */
|
|
0, /* default options */
|
|
match_data, /* the match data block */
|
|
NULL, /* a match context; NULL means use defaults */
|
|
wspace, /* working space vector */
|
|
20); /* number of elements (NOT size in bytes) */
|
|
|
|
Option bits for pcre_dfa_match()
|
|
|
|
The unused bits of the options argument for pcre2_dfa_match() must be
|
|
zero. The only bits that may be set are PCRE2_ANCHORED, PCRE2_NOTBOL,
|
|
PCRE2_NOTEOL, PCRE2_NOTEMPTY, PCRE2_NOTEMPTY_ATSTART,
|
|
PCRE2_NO_UTF_CHECK, PCRE2_PARTIAL_HARD, PCRE2_PARTIAL_SOFT,
|
|
PCRE2_DFA_SHORTEST, and PCRE2_DFA_RESTART. All but the last four of
|
|
these are exactly the same as for pcre2_match(), so their description
|
|
is not repeated here.
|
|
|
|
PCRE2_PARTIAL_HARD
|
|
PCRE2_PARTIAL_SOFT
|
|
|
|
These have the same general effect as they do for pcre2_match(), but
|
|
the details are slightly different. When PCRE2_PARTIAL_HARD is set for
|
|
pcre2_dfa_match(), it returns PCRE2_ERROR_PARTIAL if the end of the
|
|
subject is reached and there is still at least one matching possibility
|
|
that requires additional characters. This happens even if some complete
|
|
matches have already been found. When PCRE2_PARTIAL_SOFT is set, the
|
|
return code PCRE2_ERROR_NOMATCH is converted into PCRE2_ERROR_PARTIAL
|
|
if the end of the subject is reached, there have been no complete
|
|
matches, but there is still at least one matching possibility. The por-
|
|
tion of the string that was inspected when the longest partial match
|
|
was found is set as the first matching string in both cases. There is a
|
|
more detailed discussion of partial and multi-segment matching, with
|
|
examples, in the pcre2partial documentation.
|
|
|
|
PCRE2_DFA_SHORTEST
|
|
|
|
Setting the PCRE2_DFA_SHORTEST option causes the matching algorithm to
|
|
stop as soon as it has found one match. Because of the way the alterna-
|
|
tive algorithm works, this is necessarily the shortest possible match
|
|
at the first possible matching point in the subject string.
|
|
|
|
PCRE2_DFA_RESTART
|
|
|
|
When pcre2_dfa_match() returns a partial match, it is possible to call
|
|
it again, with additional subject characters, and have it continue with
|
|
the same match. The PCRE2_DFA_RESTART option requests this action; when
|
|
it is set, the workspace and wscount options must reference the same
|
|
vector as before because data about the match so far is left in them
|
|
after a partial match. There is more discussion of this facility in the
|
|
pcre2partial documentation.
|
|
|
|
Successful returns from pcre2_dfa_match()
|
|
|
|
When pcre2_dfa_match() succeeds, it may have matched more than one sub-
|
|
string in the subject. Note, however, that all the matches from one run
|
|
of the function start at the same point in the subject. The shorter
|
|
matches are all initial substrings of the longer matches. For example,
|
|
if the pattern
|
|
|
|
<.*>
|
|
|
|
is matched against the string
|
|
|
|
This is <something> <something else> <something further> no more
|
|
|
|
the three matched strings are
|
|
|
|
<something> <something else> <something further>
|
|
<something> <something else>
|
|
<something>
|
|
|
|
On success, the yield of the function is a number greater than zero,
|
|
which is the number of matched substrings. The offsets of the sub-
|
|
strings are returned in the ovector, and can be extracted by number in
|
|
the same way as for pcre2_match(), but the numbers bear no relation to
|
|
any capturing groups that may exist in the pattern, because DFA match-
|
|
ing does not support group capture.
|
|
|
|
Calls to the convenience functions that extract substrings by name
|
|
return the error PCRE2_ERROR_DFA_UFUNC (unsupported function) if used
|
|
after a DFA match. The convenience functions that extract substrings by
|
|
number never return PCRE2_ERROR_NOSUBSTRING, and the meanings of some
|
|
other errors are slightly different:
|
|
|
|
PCRE2_ERROR_UNAVAILABLE
|
|
|
|
The ovector is not big enough to include a slot for the given substring
|
|
number.
|
|
|
|
PCRE2_ERROR_UNSET
|
|
|
|
There is a slot in the ovector for this substring, but there were
|
|
insufficient matches to fill it.
|
|
|
|
The matched strings are stored in the ovector in reverse order of
|
|
length; that is, the longest matching string is first. If there were
|
|
too many matches to fit into the ovector, the yield of the function is
|
|
zero, and the vector is filled with the longest matches.
|
|
|
|
NOTE: PCRE2's "auto-possessification" optimization usually applies to
|
|
character repeats at the end of a pattern (as well as internally). For
|
|
example, the pattern "a\d+" is compiled as if it were "a\d++". For DFA
|
|
matching, this means that only one possible match is found. If you
|
|
really do want multiple matches in such cases, either use an ungreedy
|
|
repeat auch as "a\d+?" or set the PCRE2_NO_AUTO_POSSESS option when
|
|
compiling.
|
|
|
|
Error returns from pcre2_dfa_match()
|
|
|
|
The pcre2_dfa_match() function returns a negative number when it fails.
|
|
Many of the errors are the same as for pcre2_match(), as described
|
|
above. There are in addition the following errors that are specific to
|
|
pcre2_dfa_match():
|
|
|
|
PCRE2_ERROR_DFA_UITEM
|
|
|
|
This return is given if pcre2_dfa_match() encounters an item in the
|
|
pattern that it does not support, for instance, the use of \C or a back
|
|
reference.
|
|
|
|
PCRE2_ERROR_DFA_UCOND
|
|
|
|
This return is given if pcre2_dfa_match() encounters a condition item
|
|
that uses a back reference for the condition, or a test for recursion
|
|
in a specific group. These are not supported.
|
|
|
|
PCRE2_ERROR_DFA_WSSIZE
|
|
|
|
This return is given if pcre2_dfa_match() runs out of space in the
|
|
workspace vector.
|
|
|
|
PCRE2_ERROR_DFA_RECURSE
|
|
|
|
When a recursive subpattern is processed, the matching function calls
|
|
itself recursively, using private memory for the ovector and workspace.
|
|
This error is given if the internal ovector is not large enough. This
|
|
should be extremely rare, as a vector of size 1000 is used.
|
|
|
|
PCRE2_ERROR_DFA_BADRESTART
|
|
|
|
When pcre2_dfa_match() is called with the PCRE2_DFA_RESTART option,
|
|
some plausibility checks are made on the contents of the workspace,
|
|
which should contain data about the previous partial match. If any of
|
|
these checks fail, this error is given.
|
|
|
|
|
|
SEE ALSO
|
|
|
|
pcre2build(3), pcre2callout(3), pcre2demo(3), pcre2matching(3),
|
|
pcre2partial(3), pcre2posix(3), pcre2sample(3), pcre2stack(3),
|
|
pcre2unicode(3).
|
|
|
|
|
|
AUTHOR
|
|
|
|
Philip Hazel
|
|
University Computing Service
|
|
Cambridge, England.
|
|
|
|
|
|
REVISION
|
|
|
|
Last updated: 30 August 2015
|
|
Copyright (c) 1997-2015 University of Cambridge.
|
|
------------------------------------------------------------------------------
|
|
|
|
|
|
PCRE2BUILD(3) Library Functions Manual PCRE2BUILD(3)
|
|
|
|
|
|
|
|
NAME
|
|
PCRE2 - Perl-compatible regular expressions (revised API)
|
|
|
|
BUILDING PCRE2
|
|
|
|
PCRE2 is distributed with a configure script that can be used to build
|
|
the library in Unix-like environments using the applications known as
|
|
Autotools. Also in the distribution are files to support building using
|
|
CMake instead of configure. The text file README contains general
|
|
information about building with Autotools (some of which is repeated
|
|
below), and also has some comments about building on various operating
|
|
systems. There is a lot more information about building PCRE2 without
|
|
using Autotools (including information about using CMake and building
|
|
"by hand") in the text file called NON-AUTOTOOLS-BUILD. You should
|
|
consult this file as well as the README file if you are building in a
|
|
non-Unix-like environment.
|
|
|
|
|
|
PCRE2 BUILD-TIME OPTIONS
|
|
|
|
The rest of this document describes the optional features of PCRE2 that
|
|
can be selected when the library is compiled. It assumes use of the
|
|
configure script, where the optional features are selected or dese-
|
|
lected by providing options to configure before running the make com-
|
|
mand. However, the same options can be selected in both Unix-like and
|
|
non-Unix-like environments if you are using CMake instead of configure
|
|
to build PCRE2.
|
|
|
|
If you are not using Autotools or CMake, option selection can be done
|
|
by editing the config.h file, or by passing parameter settings to the
|
|
compiler, as described in NON-AUTOTOOLS-BUILD.
|
|
|
|
The complete list of options for configure (which includes the standard
|
|
ones such as the selection of the installation directory) can be
|
|
obtained by running
|
|
|
|
./configure --help
|
|
|
|
The following sections include descriptions of options whose names
|
|
begin with --enable or --disable. These settings specify changes to the
|
|
defaults for the configure command. Because of the way that configure
|
|
works, --enable and --disable always come in pairs, so the complemen-
|
|
tary option always exists as well, but as it specifies the default, it
|
|
is not described.
|
|
|
|
|
|
BUILDING 8-BIT, 16-BIT AND 32-BIT LIBRARIES
|
|
|
|
By default, a library called libpcre2-8 is built, containing functions
|
|
that take string arguments contained in vectors of bytes, interpreted
|
|
either as single-byte characters, or UTF-8 strings. You can also build
|
|
two other libraries, called libpcre2-16 and libpcre2-32, which process
|
|
strings that are contained in vectors of 16-bit and 32-bit code units,
|
|
respectively. These can be interpreted either as single-unit characters
|
|
or UTF-16/UTF-32 strings. To build these additional libraries, add one
|
|
or both of the following to the configure command:
|
|
|
|
--enable-pcre2-16
|
|
--enable-pcre2-32
|
|
|
|
If you do not want the 8-bit library, add
|
|
|
|
--disable-pcre2-8
|
|
|
|
as well. At least one of the three libraries must be built. Note that
|
|
the POSIX wrapper is for the 8-bit library only, and that pcre2grep is
|
|
an 8-bit program. Neither of these are built if you select only the
|
|
16-bit or 32-bit libraries.
|
|
|
|
|
|
BUILDING SHARED AND STATIC LIBRARIES
|
|
|
|
The Autotools PCRE2 building process uses libtool to build both shared
|
|
and static libraries by default. You can suppress an unwanted library
|
|
by adding one of
|
|
|
|
--disable-shared
|
|
--disable-static
|
|
|
|
to the configure command.
|
|
|
|
|
|
UNICODE AND UTF SUPPORT
|
|
|
|
By default, PCRE2 is built with support for Unicode and UTF character
|
|
strings. To build it without Unicode support, add
|
|
|
|
--disable-unicode
|
|
|
|
to the configure command. This setting applies to all three libraries.
|
|
It is not possible to build one library with Unicode support, and
|
|
another without, in the same configuration.
|
|
|
|
Of itself, Unicode support does not make PCRE2 treat strings as UTF-8,
|
|
UTF-16 or UTF-32. To do that, applications that use the library can set
|
|
the PCRE2_UTF option when they call pcre2_compile() to compile a pat-
|
|
tern. Alternatively, patterns may be started with (*UTF) unless the
|
|
application has locked this out by setting PCRE2_NEVER_UTF.
|
|
|
|
UTF support allows the libraries to process character code points up to
|
|
0x10ffff in the strings that they handle. It also provides support for
|
|
accessing the Unicode properties of such characters, using pattern
|
|
escapes such as \P, \p, and \X. Only the general category properties
|
|
such as Lu and Nd are supported. Details are given in the pcre2pattern
|
|
documentation.
|
|
|
|
Pattern escapes such as \d and \w do not by default make use of Unicode
|
|
properties. The application can request that they do by setting the
|
|
PCRE2_UCP option. Unless the application has set PCRE2_NEVER_UCP, a
|
|
pattern may also request this by starting with (*UCP).
|
|
|
|
The \C escape sequence, which matches a single code unit, even in a UTF
|
|
mode, can cause unpredictable behaviour because it may leave the cur-
|
|
rent matching point in the middle of a multi-code-unit character. It
|
|
can be locked out by setting the PCRE2_NEVER_BACKSLASH_C option.
|
|
|
|
|
|
JUST-IN-TIME COMPILER SUPPORT
|
|
|
|
Just-in-time compiler support is included in the build by specifying
|
|
|
|
--enable-jit
|
|
|
|
This support is available only for certain hardware architectures. If
|
|
this option is set for an unsupported architecture, a building error
|
|
occurs. See the pcre2jit documentation for a discussion of JIT usage.
|
|
When JIT support is enabled, pcre2grep automatically makes use of it,
|
|
unless you add
|
|
|
|
--disable-pcre2grep-jit
|
|
|
|
to the "configure" command.
|
|
|
|
|
|
NEWLINE RECOGNITION
|
|
|
|
By default, PCRE2 interprets the linefeed (LF) character as indicating
|
|
the end of a line. This is the normal newline character on Unix-like
|
|
systems. You can compile PCRE2 to use carriage return (CR) instead, by
|
|
adding
|
|
|
|
--enable-newline-is-cr
|
|
|
|
to the configure command. There is also an --enable-newline-is-lf
|
|
option, which explicitly specifies linefeed as the newline character.
|
|
|
|
Alternatively, you can specify that line endings are to be indicated by
|
|
the two-character sequence CRLF (CR immediately followed by LF). If you
|
|
want this, add
|
|
|
|
--enable-newline-is-crlf
|
|
|
|
to the configure command. There is a fourth option, specified by
|
|
|
|
--enable-newline-is-anycrlf
|
|
|
|
which causes PCRE2 to recognize any of the three sequences CR, LF, or
|
|
CRLF as indicating a line ending. Finally, a fifth option, specified by
|
|
|
|
--enable-newline-is-any
|
|
|
|
causes PCRE2 to recognize any Unicode newline sequence. The Unicode
|
|
newline sequences are the three just mentioned, plus the single charac-
|
|
ters VT (vertical tab, U+000B), FF (form feed, U+000C), NEL (next line,
|
|
U+0085), LS (line separator, U+2028), and PS (paragraph separator,
|
|
U+2029).
|
|
|
|
Whatever default line ending convention is selected when PCRE2 is built
|
|
can be overridden by applications that use the library. At build time
|
|
it is conventional to use the standard for your operating system.
|
|
|
|
|
|
WHAT \R MATCHES
|
|
|
|
By default, the sequence \R in a pattern matches any Unicode newline
|
|
sequence, independently of what has been selected as the line ending
|
|
sequence. If you specify
|
|
|
|
--enable-bsr-anycrlf
|
|
|
|
the default is changed so that \R matches only CR, LF, or CRLF. What-
|
|
ever is selected when PCRE2 is built can be overridden by applications
|
|
that use the called.
|
|
|
|
|
|
HANDLING VERY LARGE PATTERNS
|
|
|
|
Within a compiled pattern, offset values are used to point from one
|
|
part to another (for example, from an opening parenthesis to an alter-
|
|
nation metacharacter). By default, in the 8-bit and 16-bit libraries,
|
|
two-byte values are used for these offsets, leading to a maximum size
|
|
for a compiled pattern of around 64K code units. This is sufficient to
|
|
handle all but the most gigantic patterns. Nevertheless, some people do
|
|
want to process truly enormous patterns, so it is possible to compile
|
|
PCRE2 to use three-byte or four-byte offsets by adding a setting such
|
|
as
|
|
|
|
--with-link-size=3
|
|
|
|
to the configure command. The value given must be 2, 3, or 4. For the
|
|
16-bit library, a value of 3 is rounded up to 4. In these libraries,
|
|
using longer offsets slows down the operation of PCRE2 because it has
|
|
to load additional data when handling them. For the 32-bit library the
|
|
value is always 4 and cannot be overridden; the value of --with-link-
|
|
size is ignored.
|
|
|
|
|
|
AVOIDING EXCESSIVE STACK USAGE
|
|
|
|
When matching with the pcre2_match() function, PCRE2 implements back-
|
|
tracking by making recursive calls to an internal function called
|
|
match(). In environments where the size of the stack is limited, this
|
|
can severely limit PCRE2's operation. (The Unix environment does not
|
|
usually suffer from this problem, but it may sometimes be necessary to
|
|
increase the maximum stack size. There is a discussion in the
|
|
pcre2stack documentation.) An alternative approach to recursion that
|
|
uses memory from the heap to remember data, instead of using recursive
|
|
function calls, has been implemented to work round the problem of lim-
|
|
ited stack size. If you want to build a version of PCRE2 that works
|
|
this way, add
|
|
|
|
--disable-stack-for-recursion
|
|
|
|
to the configure command. By default, the system functions malloc() and
|
|
free() are called to manage the heap memory that is required, but cus-
|
|
tom memory management functions can be called instead. PCRE2 runs
|
|
noticeably more slowly when built in this way. This option affects only
|
|
the pcre2_match() function; it is not relevant for pcre2_dfa_match().
|
|
|
|
|
|
LIMITING PCRE2 RESOURCE USAGE
|
|
|
|
Internally, PCRE2 has a function called match(), which it calls repeat-
|
|
edly (sometimes recursively) when matching a pattern with the
|
|
pcre2_match() function. By controlling the maximum number of times this
|
|
function may be called during a single matching operation, a limit can
|
|
be placed on the resources used by a single call to pcre2_match(). The
|
|
limit can be changed at run time, as described in the pcre2api documen-
|
|
tation. The default is 10 million, but this can be changed by adding a
|
|
setting such as
|
|
|
|
--with-match-limit=500000
|
|
|
|
to the configure command. This setting has no effect on the
|
|
pcre2_dfa_match() matching function.
|
|
|
|
In some environments it is desirable to limit the depth of recursive
|
|
calls of match() more strictly than the total number of calls, in order
|
|
to restrict the maximum amount of stack (or heap, if --disable-stack-
|
|
for-recursion is specified) that is used. A second limit controls this;
|
|
it defaults to the value that is set for --with-match-limit, which
|
|
imposes no additional constraints. However, you can set a lower limit
|
|
by adding, for example,
|
|
|
|
--with-match-limit-recursion=10000
|
|
|
|
to the configure command. This value can also be overridden at run
|
|
time.
|
|
|
|
|
|
CREATING CHARACTER TABLES AT BUILD TIME
|
|
|
|
PCRE2 uses fixed tables for processing characters whose code points are
|
|
less than 256. By default, PCRE2 is built with a set of tables that are
|
|
distributed in the file src/pcre2_chartables.c.dist. These tables are
|
|
for ASCII codes only. If you add
|
|
|
|
--enable-rebuild-chartables
|
|
|
|
to the configure command, the distributed tables are no longer used.
|
|
Instead, a program called dftables is compiled and run. This outputs
|
|
the source for new set of tables, created in the default locale of your
|
|
C run-time system. (This method of replacing the tables does not work
|
|
if you are cross compiling, because dftables is run on the local host.
|
|
If you need to create alternative tables when cross compiling, you will
|
|
have to do so "by hand".)
|
|
|
|
|
|
USING EBCDIC CODE
|
|
|
|
PCRE2 assumes by default that it will run in an environment where the
|
|
character code is ASCII or Unicode, which is a superset of ASCII. This
|
|
is the case for most computer operating systems. PCRE2 can, however, be
|
|
compiled to run in an 8-bit EBCDIC environment by adding
|
|
|
|
--enable-ebcdic --disable-unicode
|
|
|
|
to the configure command. This setting implies --enable-rebuild-charta-
|
|
bles. You should only use it if you know that you are in an EBCDIC
|
|
environment (for example, an IBM mainframe operating system).
|
|
|
|
It is not possible to support both EBCDIC and UTF-8 codes in the same
|
|
version of the library. Consequently, --enable-unicode and --enable-
|
|
ebcdic are mutually exclusive.
|
|
|
|
The EBCDIC character that corresponds to an ASCII LF is assumed to have
|
|
the value 0x15 by default. However, in some EBCDIC environments, 0x25
|
|
is used. In such an environment you should use
|
|
|
|
--enable-ebcdic-nl25
|
|
|
|
as well as, or instead of, --enable-ebcdic. The EBCDIC character for CR
|
|
has the same value as in ASCII, namely, 0x0d. Whichever of 0x15 and
|
|
0x25 is not chosen as LF is made to correspond to the Unicode NEL char-
|
|
acter (which, in Unicode, is 0x85).
|
|
|
|
The options that select newline behaviour, such as --enable-newline-is-
|
|
cr, and equivalent run-time options, refer to these character values in
|
|
an EBCDIC environment.
|
|
|
|
|
|
PCRE2GREP OPTIONS FOR COMPRESSED FILE SUPPORT
|
|
|
|
By default, pcre2grep reads all files as plain text. You can build it
|
|
so that it recognizes files whose names end in .gz or .bz2, and reads
|
|
them with libz or libbz2, respectively, by adding one or both of
|
|
|
|
--enable-pcre2grep-libz
|
|
--enable-pcre2grep-libbz2
|
|
|
|
to the configure command. These options naturally require that the rel-
|
|
evant libraries are installed on your system. Configuration will fail
|
|
if they are not.
|
|
|
|
|
|
PCRE2GREP BUFFER SIZE
|
|
|
|
pcre2grep uses an internal buffer to hold a "window" on the file it is
|
|
scanning, in order to be able to output "before" and "after" lines when
|
|
it finds a match. The size of the buffer is controlled by a parameter
|
|
whose default value is 20K. The buffer itself is three times this size,
|
|
but because of the way it is used for holding "before" lines, the long-
|
|
est line that is guaranteed to be processable is the parameter size.
|
|
You can change the default parameter value by adding, for example,
|
|
|
|
--with-pcre2grep-bufsize=50K
|
|
|
|
to the configure command. The caller of pcre2grep can override this
|
|
value by using --buffer-size on the command line..
|
|
|
|
|
|
PCRE2TEST OPTION FOR LIBREADLINE SUPPORT
|
|
|
|
If you add one of
|
|
|
|
--enable-pcre2test-libreadline
|
|
--enable-pcre2test-libedit
|
|
|
|
to the configure command, pcre2test is linked with the libreadline
|
|
orlibedit library, respectively, and when its input is from a terminal,
|
|
it reads it using the readline() function. This provides line-editing
|
|
and history facilities. Note that libreadline is GPL-licensed, so if
|
|
you distribute a binary of pcre2test linked in this way, there may be
|
|
licensing issues. These can be avoided by linking instead with libedit,
|
|
which has a BSD licence.
|
|
|
|
Setting --enable-pcre2test-libreadline causes the -lreadline option to
|
|
be added to the pcre2test build. In many operating environments with a
|
|
sytem-installed readline library this is sufficient. However, in some
|
|
environments (e.g. if an unmodified distribution version of readline is
|
|
in use), some extra configuration may be necessary. The INSTALL file
|
|
for libreadline says this:
|
|
|
|
"Readline uses the termcap functions, but does not link with
|
|
the termcap or curses library itself, allowing applications
|
|
which link with readline the to choose an appropriate library."
|
|
|
|
If your environment has not been set up so that an appropriate library
|
|
is automatically included, you may need to add something like
|
|
|
|
LIBS="-ncurses"
|
|
|
|
immediately before the configure command.
|
|
|
|
|
|
INCLUDING DEBUGGING CODE
|
|
|
|
If you add
|
|
|
|
--enable-debug
|
|
|
|
to the configure command, additional debugging code is included in the
|
|
build. This feature is intended for use by the PCRE2 maintainers.
|
|
|
|
|
|
DEBUGGING WITH VALGRIND SUPPORT
|
|
|
|
If you add
|
|
|
|
--enable-valgrind
|
|
|
|
to the configure command, PCRE2 will use valgrind annotations to mark
|
|
certain memory regions as unaddressable. This allows it to detect
|
|
invalid memory accesses, and is mostly useful for debugging PCRE2
|
|
itself.
|
|
|
|
|
|
CODE COVERAGE REPORTING
|
|
|
|
If your C compiler is gcc, you can build a version of PCRE2 that can
|
|
generate a code coverage report for its test suite. To enable this, you
|
|
must install lcov version 1.6 or above. Then specify
|
|
|
|
--enable-coverage
|
|
|
|
to the configure command and build PCRE2 in the usual way.
|
|
|
|
Note that using ccache (a caching C compiler) is incompatible with code
|
|
coverage reporting. If you have configured ccache to run automatically
|
|
on your system, you must set the environment variable
|
|
|
|
CCACHE_DISABLE=1
|
|
|
|
before running make to build PCRE2, so that ccache is not used.
|
|
|
|
When --enable-coverage is used, the following addition targets are
|
|
added to the Makefile:
|
|
|
|
make coverage
|
|
|
|
This creates a fresh coverage report for the PCRE2 test suite. It is
|
|
equivalent to running "make coverage-reset", "make coverage-baseline",
|
|
"make check", and then "make coverage-report".
|
|
|
|
make coverage-reset
|
|
|
|
This zeroes the coverage counters, but does nothing else.
|
|
|
|
make coverage-baseline
|
|
|
|
This captures baseline coverage information.
|
|
|
|
make coverage-report
|
|
|
|
This creates the coverage report.
|
|
|
|
make coverage-clean-report
|
|
|
|
This removes the generated coverage report without cleaning the cover-
|
|
age data itself.
|
|
|
|
make coverage-clean-data
|
|
|
|
This removes the captured coverage data without removing the coverage
|
|
files created at compile time (*.gcno).
|
|
|
|
make coverage-clean
|
|
|
|
This cleans all coverage data including the generated coverage report.
|
|
For more information about code coverage, see the gcov and lcov docu-
|
|
mentation.
|
|
|
|
|
|
SEE ALSO
|
|
|
|
pcre2api(3), pcre2-config(3).
|
|
|
|
|
|
AUTHOR
|
|
|
|
Philip Hazel
|
|
University Computing Service
|
|
Cambridge, England.
|
|
|
|
|
|
REVISION
|
|
|
|
Last updated: 24 April 2015
|
|
Copyright (c) 1997-2015 University of Cambridge.
|
|
------------------------------------------------------------------------------
|
|
|
|
|
|
PCRE2CALLOUT(3) Library Functions Manual PCRE2CALLOUT(3)
|
|
|
|
|
|
|
|
NAME
|
|
PCRE2 - Perl-compatible regular expressions (revised API)
|
|
|
|
SYNOPSIS
|
|
|
|
#include <pcre2.h>
|
|
|
|
int (*pcre2_callout)(pcre2_callout_block *, void *);
|
|
|
|
int pcre2_callout_enumerate(const pcre2_code *code,
|
|
int (*callback)(pcre2_callout_enumerate_block *, void *),
|
|
void *user_data);
|
|
|
|
|
|
DESCRIPTION
|
|
|
|
PCRE2 provides a feature called "callout", which is a means of tempo-
|
|
rarily passing control to the caller of PCRE2 in the middle of pattern
|
|
matching. The caller of PCRE2 provides an external function by putting
|
|
its entry point in a match context (see pcre2_set_callout() in the
|
|
pcre2api documentation).
|
|
|
|
Within a regular expression, (?C<arg>) indicates a point at which the
|
|
external function is to be called. Different callout points can be
|
|
identified by putting a number less than 256 after the letter C. The
|
|
default value is zero. Alternatively, the argument may be a delimited
|
|
string. The starting delimiter must be one of ` ' " ^ % # $ { and the
|
|
ending delimiter is the same as the start, except for {, where the end-
|
|
ing delimiter is }. If the ending delimiter is needed within the
|
|
string, it must be doubled. For example, this pattern has two callout
|
|
points:
|
|
|
|
(?C1)abc(?C"some ""arbitrary"" text")def
|
|
|
|
If the PCRE2_AUTO_CALLOUT option bit is set when a pattern is compiled,
|
|
PCRE2 automatically inserts callouts, all with number 255, before each
|
|
item in the pattern. For example, if PCRE2_AUTO_CALLOUT is used with
|
|
the pattern
|
|
|
|
A(\d{2}|--)
|
|
|
|
it is processed as if it were
|
|
|
|
(?C255)A(?C255)((?C255)\d{2}(?C255)|(?C255)-(?C255)-(?C255))(?C255)
|
|
|
|
Notice that there is a callout before and after each parenthesis and
|
|
alternation bar. If the pattern contains a conditional group whose con-
|
|
dition is an assertion, an automatic callout is inserted immediately
|
|
before the condition. Such a callout may also be inserted explicitly,
|
|
for example:
|
|
|
|
(?(?C9)(?=a)ab|de) (?(?C%text%)(?!=d)ab|de)
|
|
|
|
This applies only to assertion conditions (because they are themselves
|
|
independent groups).
|
|
|
|
Callouts can be useful for tracking the progress of pattern matching.
|
|
The pcre2test program has a pattern qualifier (/auto_callout) that sets
|
|
automatic callouts. When any callouts are present, the output from
|
|
pcre2test indicates how the pattern is being matched. This is useful
|
|
information when you are trying to optimize the performance of a par-
|
|
ticular pattern.
|
|
|
|
|
|
MISSING CALLOUTS
|
|
|
|
You should be aware that, because of optimizations in the way PCRE2
|
|
compiles and matches patterns, callouts sometimes do not happen exactly
|
|
as you might expect.
|
|
|
|
Auto-possessification
|
|
|
|
At compile time, PCRE2 "auto-possessifies" repeated items when it knows
|
|
that what follows cannot be part of the repeat. For example, a+[bc] is
|
|
compiled as if it were a++[bc]. The pcre2test output when this pattern
|
|
is compiled with PCRE2_ANCHORED and PCRE2_AUTO_CALLOUT and then applied
|
|
to the string "aaaa" is:
|
|
|
|
--->aaaa
|
|
+0 ^ a+
|
|
+2 ^ ^ [bc]
|
|
No match
|
|
|
|
This indicates that when matching [bc] fails, there is no backtracking
|
|
into a+ and therefore the callouts that would be taken for the back-
|
|
tracks do not occur. You can disable the auto-possessify feature by
|
|
passing PCRE2_NO_AUTO_POSSESS to pcre2_compile(), or starting the pat-
|
|
tern with (*NO_AUTO_POSSESS). In this case, the output changes to this:
|
|
|
|
--->aaaa
|
|
+0 ^ a+
|
|
+2 ^ ^ [bc]
|
|
+2 ^ ^ [bc]
|
|
+2 ^ ^ [bc]
|
|
+2 ^^ [bc]
|
|
No match
|
|
|
|
This time, when matching [bc] fails, the matcher backtracks into a+ and
|
|
tries again, repeatedly, until a+ itself fails.
|
|
|
|
Automatic .* anchoring
|
|
|
|
By default, an optimization is applied when .* is the first significant
|
|
item in a pattern. If PCRE2_DOTALL is set, so that the dot can match
|
|
any character, the pattern is automatically anchored. If PCRE2_DOTALL
|
|
is not set, a match can start only after an internal newline or at the
|
|
beginning of the subject, and pcre2_compile() remembers this. This
|
|
optimization is disabled, however, if .* is in an atomic group or if
|
|
there is a back reference to the capturing group in which it appears.
|
|
It is also disabled if the pattern contains (*PRUNE) or (*SKIP). How-
|
|
ever, the presence of callouts does not affect it.
|
|
|
|
For example, if the pattern .*\d is compiled with PCRE2_AUTO_CALLOUT
|
|
and applied to the string "aa", the pcre2test output is:
|
|
|
|
--->aa
|
|
+0 ^ .*
|
|
+2 ^ ^ \d
|
|
+2 ^^ \d
|
|
+2 ^ \d
|
|
No match
|
|
|
|
This shows that all match attempts start at the beginning of the sub-
|
|
ject. In other words, the pattern is anchored. You can disable this
|
|
optimization by passing PCRE2_NO_DOTSTAR_ANCHOR to pcre2_compile(), or
|
|
starting the pattern with (*NO_DOTSTAR_ANCHOR). In this case, the out-
|
|
put changes to:
|
|
|
|
--->aa
|
|
+0 ^ .*
|
|
+2 ^ ^ \d
|
|
+2 ^^ \d
|
|
+2 ^ \d
|
|
+0 ^ .*
|
|
+2 ^^ \d
|
|
+2 ^ \d
|
|
No match
|
|
|
|
This shows more match attempts, starting at the second subject charac-
|
|
ter. Another optimization, described in the next section, means that
|
|
there is no subsequent attempt to match with an empty subject.
|
|
|
|
If a pattern has more than one top-level branch, automatic anchoring
|
|
occurs if all branches are anchorable.
|
|
|
|
Other optimizations
|
|
|
|
Other optimizations that provide fast "no match" results also affect
|
|
callouts. For example, if the pattern is
|
|
|
|
ab(?C4)cd
|
|
|
|
PCRE2 knows that any matching string must contain the letter "d". If
|
|
the subject string is "abyz", the lack of "d" means that matching
|
|
doesn't ever start, and the callout is never reached. However, with
|
|
"abyd", though the result is still no match, the callout is obeyed.
|
|
|
|
PCRE2 also knows the minimum length of a matching string, and will
|
|
immediately give a "no match" return without actually running a match
|
|
if the subject is not long enough, or, for unanchored patterns, if it
|
|
has been scanned far enough.
|
|
|
|
You can disable these optimizations by passing the PCRE2_NO_START_OPTI-
|
|
MIZE option to pcre2_compile(), or by starting the pattern with
|
|
(*NO_START_OPT). This slows down the matching process, but does ensure
|
|
that callouts such as the example above are obeyed.
|
|
|
|
|
|
THE CALLOUT INTERFACE
|
|
|
|
During matching, when PCRE2 reaches a callout point, if an external
|
|
function is set in the match context, it is called. This applies to
|
|
both normal and DFA matching. The first argument to the callout func-
|
|
tion is a pointer to a pcre2_callout block. The second argument is the
|
|
void * callout data that was supplied when the callout was set up by
|
|
calling pcre2_set_callout() (see the pcre2api documentation). The call-
|
|
out block structure contains the following fields:
|
|
|
|
uint32_t version;
|
|
uint32_t callout_number;
|
|
uint32_t capture_top;
|
|
uint32_t capture_last;
|
|
PCRE2_SIZE *offset_vector;
|
|
PCRE2_SPTR mark;
|
|
PCRE2_SPTR subject;
|
|
PCRE2_SIZE subject_length;
|
|
PCRE2_SIZE start_match;
|
|
PCRE2_SIZE current_position;
|
|
PCRE2_SIZE pattern_position;
|
|
PCRE2_SIZE next_item_length;
|
|
PCRE2_SIZE callout_string_offset;
|
|
PCRE2_SIZE callout_string_length;
|
|
PCRE2_SPTR callout_string;
|
|
|
|
The version field contains the version number of the block format. The
|
|
current version is 1; the three callout string fields were added for
|
|
this version. If you are writing an application that might use an ear-
|
|
lier release of PCRE2, you should check the version number before
|
|
accessing any of these fields. The version number will increase in
|
|
future if more fields are added, but the intention is never to remove
|
|
any of the existing fields.
|
|
|
|
Fields for numerical callouts
|
|
|
|
For a numerical callout, callout_string is NULL, and callout_number
|
|
contains the number of the callout, in the range 0-255. This is the
|
|
number that follows (?C for manual callouts; it is 255 for automati-
|
|
cally generated callouts.
|
|
|
|
Fields for string callouts
|
|
|
|
For callouts with string arguments, callout_number is always zero, and
|
|
callout_string points to the string that is contained within the com-
|
|
piled pattern. Its length is given by callout_string_length. Duplicated
|
|
ending delimiters that were present in the original pattern string have
|
|
been turned into single characters, but there is no other processing of
|
|
the callout string argument. An additional code unit containing binary
|
|
zero is present after the string, but is not included in the length.
|
|
The delimiter that was used to start the string is also stored within
|
|
the pattern, immediately before the string itself. You can access this
|
|
delimiter as callout_string[-1] if you need it.
|
|
|
|
The callout_string_offset field is the code unit offset to the start of
|
|
the callout argument string within the original pattern string. This is
|
|
provided for the benefit of applications such as script languages that
|
|
might need to report errors in the callout string within the pattern.
|
|
|
|
Fields for all callouts
|
|
|
|
The remaining fields in the callout block are the same for both kinds
|
|
of callout.
|
|
|
|
The offset_vector field is a pointer to the vector of capturing offsets
|
|
(the "ovector") that was passed to the matching function in the match
|
|
data block. When pcre2_match() is used, the contents can be inspected
|
|
in order to extract substrings that have been matched so far, in the
|
|
same way as for extracting substrings after a match has completed. For
|
|
the DFA matching function, this field is not useful.
|
|
|
|
The subject and subject_length fields contain copies of the values that
|
|
were passed to the matching function.
|
|
|
|
The start_match field normally contains the offset within the subject
|
|
at which the current match attempt started. However, if the escape
|
|
sequence \K has been encountered, this value is changed to reflect the
|
|
modified starting point. If the pattern is not anchored, the callout
|
|
function may be called several times from the same point in the pattern
|
|
for different starting points in the subject.
|
|
|
|
The current_position field contains the offset within the subject of
|
|
the current match pointer.
|
|
|
|
When the pcre2_match() is used, the capture_top field contains one more
|
|
than the number of the highest numbered captured substring so far. If
|
|
no substrings have been captured, the value of capture_top is one. This
|
|
is always the case when the DFA functions are used, because they do not
|
|
support captured substrings.
|
|
|
|
The capture_last field contains the number of the most recently cap-
|
|
tured substring. However, when a recursion exits, the value reverts to
|
|
what it was outside the recursion, as do the values of all captured
|
|
substrings. If no substrings have been captured, the value of cap-
|
|
ture_last is 0. This is always the case for the DFA matching functions.
|
|
|
|
The pattern_position field contains the offset in the pattern string to
|
|
the next item to be matched.
|
|
|
|
The next_item_length field contains the length of the next item to be
|
|
matched in the pattern string. When the callout immediately precedes an
|
|
alternation bar, a closing parenthesis, or the end of the pattern, the
|
|
length is zero. When the callout precedes an opening parenthesis, the
|
|
length is that of the entire subpattern.
|
|
|
|
The pattern_position and next_item_length fields are intended to help
|
|
in distinguishing between different automatic callouts, which all have
|
|
the same callout number. However, they are set for all callouts, and
|
|
are used by pcre2test to show the next item to be matched when display-
|
|
ing callout information.
|
|
|
|
In callouts from pcre2_match() the mark field contains a pointer to the
|
|
zero-terminated name of the most recently passed (*MARK), (*PRUNE), or
|
|
(*THEN) item in the match, or NULL if no such items have been passed.
|
|
Instances of (*PRUNE) or (*THEN) without a name do not obliterate a
|
|
previous (*MARK). In callouts from the DFA matching function this field
|
|
always contains NULL.
|
|
|
|
|
|
RETURN VALUES FROM CALLOUTS
|
|
|
|
The external callout function returns an integer to PCRE2. If the value
|
|
is zero, matching proceeds as normal. If the value is greater than
|
|
zero, matching fails at the current point, but the testing of other
|
|
matching possibilities goes ahead, just as if a lookahead assertion had
|
|
failed. If the value is less than zero, the match is abandoned, and the
|
|
matching function returns the negative value.
|
|
|
|
Negative values should normally be chosen from the set of
|
|
PCRE2_ERROR_xxx values. In particular, PCRE2_ERROR_NOMATCH forces a
|
|
standard "no match" failure. The error number PCRE2_ERROR_CALLOUT is
|
|
reserved for use by callout functions; it will never be used by PCRE2
|
|
itself.
|
|
|
|
|
|
CALLOUT ENUMERATION
|
|
|
|
int pcre2_callout_enumerate(const pcre2_code *code,
|
|
int (*callback)(pcre2_callout_enumerate_block *, void *),
|
|
void *user_data);
|
|
|
|
A script language that supports the use of string arguments in callouts
|
|
might like to scan all the callouts in a pattern before running the
|
|
match. This can be done by calling pcre2_callout_enumerate(). The first
|
|
argument is a pointer to a compiled pattern, the second points to a
|
|
callback function, and the third is arbitrary user data. The callback
|
|
function is called for every callout in the pattern in the order in
|
|
which they appear. Its first argument is a pointer to a callout enumer-
|
|
ation block, and its second argument is the user_data value that was
|
|
passed to pcre2_callout_enumerate(). The data block contains the fol-
|
|
lowing fields:
|
|
|
|
version Block version number
|
|
pattern_position Offset to next item in pattern
|
|
next_item_length Length of next item in pattern
|
|
callout_number Number for numbered callouts
|
|
callout_string_offset Offset to string within pattern
|
|
callout_string_length Length of callout string
|
|
callout_string Points to callout string or is NULL
|
|
|
|
The version number is currently 0. It will increase if new fields are
|
|
ever added to the block. The remaining fields are the same as their
|
|
namesakes in the pcre2_callout block that is used for callouts during
|
|
matching, as described above.
|
|
|
|
Note that the value of pattern_position is unique for each callout.
|
|
However, if a callout occurs inside a group that is quantified with a
|
|
non-zero minimum or a fixed maximum, the group is replicated inside the
|
|
compiled pattern. For example, a pattern such as /(a){2}/ is compiled
|
|
as if it were /(a)(a)/. This means that the callout will be enumerated
|
|
more than once, but with the same value for pattern_position in each
|
|
case.
|
|
|
|
The callback function should normally return zero. If it returns a non-
|
|
zero value, scanning the pattern stops, and that value is returned from
|
|
pcre2_callout_enumerate().
|
|
|
|
|
|
AUTHOR
|
|
|
|
Philip Hazel
|
|
University Computing Service
|
|
Cambridge, England.
|
|
|
|
|
|
REVISION
|
|
|
|
Last updated: 23 March 2015
|
|
Copyright (c) 1997-2015 University of Cambridge.
|
|
------------------------------------------------------------------------------
|
|
|
|
|
|
PCRE2COMPAT(3) Library Functions Manual PCRE2COMPAT(3)
|
|
|
|
|
|
|
|
NAME
|
|
PCRE2 - Perl-compatible regular expressions (revised API)
|
|
|
|
DIFFERENCES BETWEEN PCRE2 AND PERL
|
|
|
|
This document describes the differences in the ways that PCRE2 and Perl
|
|
handle regular expressions. The differences described here are with
|
|
respect to Perl versions 5.10 and above.
|
|
|
|
1. PCRE2 has only a subset of Perl's Unicode support. Details of what
|
|
it does have are given in the pcre2unicode page.
|
|
|
|
2. PCRE2 allows repeat quantifiers only on parenthesized assertions,
|
|
but they do not mean what you might think. For example, (?!a){3} does
|
|
not assert that the next three characters are not "a". It just asserts
|
|
that the next character is not "a" three times (in principle: PCRE2
|
|
optimizes this to run the assertion just once). Perl allows repeat
|
|
quantifiers on other assertions such as \b, but these do not seem to
|
|
have any use.
|
|
|
|
3. Capturing subpatterns that occur inside negative lookahead asser-
|
|
tions are counted, but their entries in the offsets vector are never
|
|
set. Perl sometimes (but not always) sets its numerical variables from
|
|
inside negative assertions.
|
|
|
|
4. The following Perl escape sequences are not supported: \l, \u, \L,
|
|
\U, and \N when followed by a character name or Unicode value. (\N on
|
|
its own, matching a non-newline character, is supported.) In fact these
|
|
are implemented by Perl's general string-handling and are not part of
|
|
its pattern matching engine. If any of these are encountered by PCRE2,
|
|
an error is generated by default. However, if the PCRE2_ALT_BSUX option
|
|
is set, \U and \u are interpreted as ECMAScript interprets them.
|
|
|
|
5. The Perl escape sequences \p, \P, and \X are supported only if PCRE2
|
|
is built with Unicode support. The properties that can be tested with
|
|
\p and \P are limited to the general category properties such as Lu and
|
|
Nd, script names such as Greek or Han, and the derived properties Any
|
|
and L&. PCRE2 does support the Cs (surrogate) property, which Perl does
|
|
not; the Perl documentation says "Because Perl hides the need for the
|
|
user to understand the internal representation of Unicode characters,
|
|
there is no need to implement the somewhat messy concept of surro-
|
|
gates."
|
|
|
|
6. PCRE2 does support the \Q...\E escape for quoting substrings. Char-
|
|
acters in between are treated as literals. This is slightly different
|
|
from Perl in that $ and @ are also handled as literals inside the
|
|
quotes. In Perl, they cause variable interpolation (but of course PCRE2
|
|
does not have variables). Note the following examples:
|
|
|
|
Pattern PCRE2 matches Perl matches
|
|
|
|
\Qabc$xyz\E abc$xyz abc followed by the
|
|
contents of $xyz
|
|
\Qabc\$xyz\E abc\$xyz abc\$xyz
|
|
\Qabc\E\$\Qxyz\E abc$xyz abc$xyz
|
|
|
|
The \Q...\E sequence is recognized both inside and outside character
|
|
classes.
|
|
|
|
7. Fairly obviously, PCRE2 does not support the (?{code}) and
|
|
(??{code}) constructions. However, there is support for recursive pat-
|
|
terns. This is not available in Perl 5.8, but it is in Perl 5.10. Also,
|
|
the PCRE2 "callout" feature allows an external function to be called
|
|
during pattern matching. See the pcre2callout documentation for
|
|
details.
|
|
|
|
8. Subroutine calls (whether recursive or not) are treated as atomic
|
|
groups. Atomic recursion is like Python, but unlike Perl. Captured
|
|
values that are set outside a subroutine call can be referenced from
|
|
inside in PCRE2, but not in Perl. There is a discussion that explains
|
|
these differences in more detail in the section on recursion differ-
|
|
ences from Perl in the pcre2pattern page.
|
|
|
|
9. If any of the backtracking control verbs are used in a subpattern
|
|
that is called as a subroutine (whether or not recursively), their
|
|
effect is confined to that subpattern; it does not extend to the sur-
|
|
rounding pattern. This is not always the case in Perl. In particular,
|
|
if (*THEN) is present in a group that is called as a subroutine, its
|
|
action is limited to that group, even if the group does not contain any
|
|
| characters. Note that such subpatterns are processed as anchored at
|
|
the point where they are tested.
|
|
|
|
10. If a pattern contains more than one backtracking control verb, the
|
|
first one that is backtracked onto acts. For example, in the pattern
|
|
A(*COMMIT)B(*PRUNE)C a failure in B triggers (*COMMIT), but a failure
|
|
in C triggers (*PRUNE). Perl's behaviour is more complex; in many cases
|
|
it is the same as PCRE2, but there are examples where it differs.
|
|
|
|
11. Most backtracking verbs in assertions have their normal actions.
|
|
They are not confined to the assertion.
|
|
|
|
12. There are some differences that are concerned with the settings of
|
|
captured strings when part of a pattern is repeated. For example,
|
|
matching "aba" against the pattern /^(a(b)?)+$/ in Perl leaves $2
|
|
unset, but in PCRE2 it is set to "b".
|
|
|
|
13. PCRE2's handling of duplicate subpattern numbers and duplicate sub-
|
|
pattern names is not as general as Perl's. This is a consequence of the
|
|
fact the PCRE2 works internally just with numbers, using an external
|
|
table to translate between numbers and names. In particular, a pattern
|
|
such as (?|(?<a>A)|(?<b)B), where the two capturing parentheses have
|
|
the same number but different names, is not supported, and causes an
|
|
error at compile time. If it were allowed, it would not be possible to
|
|
distinguish which parentheses matched, because both names map to cap-
|
|
turing subpattern number 1. To avoid this confusing situation, an error
|
|
is given at compile time.
|
|
|
|
14. Perl recognizes comments in some places that PCRE2 does not, for
|
|
example, between the ( and ? at the start of a subpattern. If the /x
|
|
modifier is set, Perl allows white space between ( and ? (though cur-
|
|
rent Perls warn that this is deprecated) but PCRE2 never does, even if
|
|
the PCRE2_EXTENDED option is set.
|
|
|
|
15. Perl, when in warning mode, gives warnings for character classes
|
|
such as [A-\d] or [a-[:digit:]]. It then treats the hyphens as liter-
|
|
als. PCRE2 has no warning features, so it gives an error in these cases
|
|
because they are almost certainly user mistakes.
|
|
|
|
16. In PCRE2, the upper/lower case character properties Lu and Ll are
|
|
not affected when case-independent matching is specified. For example,
|
|
\p{Lu} always matches an upper case letter. I think Perl has changed in
|
|
this respect; in the release at the time of writing (5.16), \p{Lu} and
|
|
\p{Ll} match all letters, regardless of case, when case independence is
|
|
specified.
|
|
|
|
17. PCRE2 provides some extensions to the Perl regular expression
|
|
facilities. Perl 5.10 includes new features that are not in earlier
|
|
versions of Perl, some of which (such as named parentheses) have been
|
|
in PCRE2 for some time. This list is with respect to Perl 5.10:
|
|
|
|
(a) Although lookbehind assertions in PCRE2 must match fixed length
|
|
strings, each alternative branch of a lookbehind assertion can match a
|
|
different length of string. Perl requires them all to have the same
|
|
length.
|
|
|
|
(b) If PCRE2_DOLLAR_ENDONLY is set and PCRE2_MULTILINE is not set, the
|
|
$ meta-character matches only at the very end of the string.
|
|
|
|
(c) A backslash followed by a letter with no special meaning is
|
|
faulted. (Perl can be made to issue a warning.)
|
|
|
|
(d) If PCRE2_UNGREEDY is set, the greediness of the repetition quanti-
|
|
fiers is inverted, that is, by default they are not greedy, but if fol-
|
|
lowed by a question mark they are.
|
|
|
|
(e) PCRE2_ANCHORED can be used at matching time to force a pattern to
|
|
be tried only at the first matching position in the subject string.
|
|
|
|
(f) The PCRE2_NOTBOL, PCRE2_NOTEOL, PCRE2_NOTEMPTY,
|
|
PCRE2_NOTEMPTY_ATSTART, and PCRE2_NO_AUTO_CAPTURE options have no Perl
|
|
equivalents.
|
|
|
|
(g) The \R escape sequence can be restricted to match only CR, LF, or
|
|
CRLF by the PCRE2_BSR_ANYCRLF option.
|
|
|
|
(h) The callout facility is PCRE2-specific.
|
|
|
|
(i) The partial matching facility is PCRE2-specific.
|
|
|
|
(j) The alternative matching function (pcre2_dfa_match() matches in a
|
|
different way and is not Perl-compatible.
|
|
|
|
(k) PCRE2 recognizes some special sequences such as (*CR) at the start
|
|
of a pattern that set overall options that cannot be changed within the
|
|
pattern.
|
|
|
|
|
|
AUTHOR
|
|
|
|
Philip Hazel
|
|
University Computing Service
|
|
Cambridge, England.
|
|
|
|
|
|
REVISION
|
|
|
|
Last updated: 15 March 2015
|
|
Copyright (c) 1997-2015 University of Cambridge.
|
|
------------------------------------------------------------------------------
|
|
|
|
|
|
PCRE2JIT(3) Library Functions Manual PCRE2JIT(3)
|
|
|
|
|
|
|
|
NAME
|
|
PCRE2 - Perl-compatible regular expressions (revised API)
|
|
|
|
PCRE2 JUST-IN-TIME COMPILER SUPPORT
|
|
|
|
Just-in-time compiling is a heavyweight optimization that can greatly
|
|
speed up pattern matching. However, it comes at the cost of extra pro-
|
|
cessing before the match is performed, so it is of most benefit when
|
|
the same pattern is going to be matched many times. This does not nec-
|
|
essarily mean many calls of a matching function; if the pattern is not
|
|
anchored, matching attempts may take place many times at various posi-
|
|
tions in the subject, even for a single call. Therefore, if the subject
|
|
string is very long, it may still pay to use JIT even for one-off
|
|
matches. JIT support is available for all of the 8-bit, 16-bit and
|
|
32-bit PCRE2 libraries.
|
|
|
|
JIT support applies only to the traditional Perl-compatible matching
|
|
function. It does not apply when the DFA matching function is being
|
|
used. The code for this support was written by Zoltan Herczeg.
|
|
|
|
|
|
AVAILABILITY OF JIT SUPPORT
|
|
|
|
JIT support is an optional feature of PCRE2. The "configure" option
|
|
--enable-jit (or equivalent CMake option) must be set when PCRE2 is
|
|
built if you want to use JIT. The support is limited to the following
|
|
hardware platforms:
|
|
|
|
ARM 32-bit (v5, v7, and Thumb2)
|
|
ARM 64-bit
|
|
Intel x86 32-bit and 64-bit
|
|
MIPS 32-bit and 64-bit
|
|
Power PC 32-bit and 64-bit
|
|
SPARC 32-bit
|
|
|
|
If --enable-jit is set on an unsupported platform, compilation fails.
|
|
|
|
A program can tell if JIT support is available by calling pcre2_con-
|
|
fig() with the PCRE2_CONFIG_JIT option. The result is 1 when JIT is
|
|
available, and 0 otherwise. However, a simple program does not need to
|
|
check this in order to use JIT. The API is implemented in a way that
|
|
falls back to the interpretive code if JIT is not available. For pro-
|
|
grams that need the best possible performance, there is also a "fast
|
|
path" API that is JIT-specific.
|
|
|
|
|
|
SIMPLE USE OF JIT
|
|
|
|
To make use of the JIT support in the simplest way, all you have to do
|
|
is to call pcre2_jit_compile() after successfully compiling a pattern
|
|
with pcre2_compile(). This function has two arguments: the first is the
|
|
compiled pattern pointer that was returned by pcre2_compile(), and the
|
|
second is zero or more of the following option bits: PCRE2_JIT_COM-
|
|
PLETE, PCRE2_JIT_PARTIAL_HARD, or PCRE2_JIT_PARTIAL_SOFT.
|
|
|
|
If JIT support is not available, a call to pcre2_jit_compile() does
|
|
nothing and returns PCRE2_ERROR_JIT_BADOPTION. Otherwise, the compiled
|
|
pattern is passed to the JIT compiler, which turns it into machine code
|
|
that executes much faster than the normal interpretive code, but yields
|
|
exactly the same results. The returned value from pcre2_jit_compile()
|
|
is zero on success, or a negative error code.
|
|
|
|
PCRE2_JIT_COMPLETE requests the JIT compiler to generate code for com-
|
|
plete matches. If you want to run partial matches using the PCRE2_PAR-
|
|
TIAL_HARD or PCRE2_PARTIAL_SOFT options of pcre2_match(), you should
|
|
set one or both of the other options as well as, or instead of
|
|
PCRE2_JIT_COMPLETE. The JIT compiler generates different optimized code
|
|
for each of the three modes (normal, soft partial, hard partial). When
|
|
pcre2_match() is called, the appropriate code is run if it is avail-
|
|
able. Otherwise, the pattern is matched using interpretive code.
|
|
|
|
You can call pcre2_jit_compile() multiple times for the same compiled
|
|
pattern. It does nothing if it has previously compiled code for any of
|
|
the option bits. For example, you can call it once with PCRE2_JIT_COM-
|
|
PLETE and (perhaps later, when you find you need partial matching)
|
|
again with PCRE2_JIT_COMPLETE and PCRE2_JIT_PARTIAL_HARD. This time it
|
|
will ignore PCRE2_JIT_COMPLETE and just compile code for partial match-
|
|
ing. If pcre2_jit_compile() is called with no option bits set, it imme-
|
|
diately returns zero. This is an alternative way of testing whether JIT
|
|
is available.
|
|
|
|
At present, it is not possible to free JIT compiled code except when
|
|
the entire compiled pattern is freed by calling pcre2_code_free().
|
|
|
|
In some circumstances you may need to call additional functions. These
|
|
are described in the section entitled "Controlling the JIT stack"
|
|
below.
|
|
|
|
There are some pcre2_match() options that are not supported by JIT, and
|
|
there are also some pattern items that JIT cannot handle. Details are
|
|
given below. In both cases, matching automatically falls back to the
|
|
interpretive code. If you want to know whether JIT was actually used
|
|
for a particular match, you should arrange for a JIT callback function
|
|
to be set up as described in the section entitled "Controlling the JIT
|
|
stack" below, even if you do not need to supply a non-default JIT
|
|
stack. Such a callback function is called whenever JIT code is about to
|
|
be obeyed. If the match-time options are not right for JIT execution,
|
|
the callback function is not obeyed.
|
|
|
|
If the JIT compiler finds an unsupported item, no JIT data is gener-
|
|
ated. You can find out if JIT matching is available after compiling a
|
|
pattern by calling pcre2_pattern_info() with the PCRE2_INFO_JITSIZE
|
|
option. A non-zero result means that JIT compilation was successful. A
|
|
result of 0 means that JIT support is not available, or the pattern was
|
|
not processed by pcre2_jit_compile(), or the JIT compiler was not able
|
|
to handle the pattern.
|
|
|
|
|
|
UNSUPPORTED OPTIONS AND PATTERN ITEMS
|
|
|
|
The pcre2_match() options that are supported for JIT matching are
|
|
PCRE2_NOTBOL, PCRE2_NOTEOL, PCRE2_NOTEMPTY, PCRE2_NOTEMPTY_ATSTART,
|
|
PCRE2_NO_UTF_CHECK, PCRE2_PARTIAL_HARD, and PCRE2_PARTIAL_SOFT. The
|
|
PCRE2_ANCHORED option is not supported at match time.
|
|
|
|
The only unsupported pattern items are \C (match a single data unit)
|
|
when running in a UTF mode, and a callout immediately before an asser-
|
|
tion condition in a conditional group.
|
|
|
|
|
|
RETURN VALUES FROM JIT MATCHING
|
|
|
|
When a pattern is matched using JIT matching, the return values are the
|
|
same as those given by the interpretive pcre2_match() code, with the
|
|
addition of one new error code: PCRE2_ERROR_JIT_STACKLIMIT. This means
|
|
that the memory used for the JIT stack was insufficient. See "Control-
|
|
ling the JIT stack" below for a discussion of JIT stack usage.
|
|
|
|
The error code PCRE2_ERROR_MATCHLIMIT is returned by the JIT code if
|
|
searching a very large pattern tree goes on for too long, as it is in
|
|
the same circumstance when JIT is not used, but the details of exactly
|
|
what is counted are not the same. The PCRE2_ERROR_RECURSIONLIMIT error
|
|
code is never returned when JIT matching is used.
|
|
|
|
|
|
CONTROLLING THE JIT STACK
|
|
|
|
When the compiled JIT code runs, it needs a block of memory to use as a
|
|
stack. By default, it uses 32K on the machine stack. However, some
|
|
large or complicated patterns need more than this. The error
|
|
PCRE2_ERROR_JIT_STACKLIMIT is given when there is not enough stack.
|
|
Three functions are provided for managing blocks of memory for use as
|
|
JIT stacks. There is further discussion about the use of JIT stacks in
|
|
the section entitled "JIT stack FAQ" below.
|
|
|
|
The pcre2_jit_stack_create() function creates a JIT stack. Its argu-
|
|
ments are a starting size, a maximum size, and a general context (for
|
|
memory allocation functions, or NULL for standard memory allocation).
|
|
It returns a pointer to an opaque structure of type pcre2_jit_stack, or
|
|
NULL if there is an error. The pcre2_jit_stack_free() function is used
|
|
to free a stack that is no longer needed. (For the technically minded:
|
|
the address space is allocated by mmap or VirtualAlloc.)
|
|
|
|
JIT uses far less memory for recursion than the interpretive code, and
|
|
a maximum stack size of 512K to 1M should be more than enough for any
|
|
pattern.
|
|
|
|
The pcre2_jit_stack_assign() function specifies which stack JIT code
|
|
should use. Its arguments are as follows:
|
|
|
|
pcre2_match_context *mcontext
|
|
pcre2_jit_callback callback
|
|
void *data
|
|
|
|
The first argument is a pointer to a match context. When this is subse-
|
|
quently passed to a matching function, its information determines which
|
|
JIT stack is used. There are three cases for the values of the other
|
|
two options:
|
|
|
|
(1) If callback is NULL and data is NULL, an internal 32K block
|
|
on the machine stack is used. This is the default when a match
|
|
context is created.
|
|
|
|
(2) If callback is NULL and data is not NULL, data must be
|
|
a pointer to a valid JIT stack, the result of calling
|
|
pcre2_jit_stack_create().
|
|
|
|
(3) If callback is not NULL, it must point to a function that is
|
|
called with data as an argument at the start of matching, in
|
|
order to set up a JIT stack. If the return from the callback
|
|
function is NULL, the internal 32K stack is used; otherwise the
|
|
return value must be a valid JIT stack, the result of calling
|
|
pcre2_jit_stack_create().
|
|
|
|
A callback function is obeyed whenever JIT code is about to be run; it
|
|
is not obeyed when pcre2_match() is called with options that are incom-
|
|
patible for JIT matching. A callback function can therefore be used to
|
|
determine whether a match operation was executed by JIT or by the
|
|
interpreter.
|
|
|
|
You may safely use the same JIT stack for more than one pattern (either
|
|
by assigning directly or by callback), as long as the patterns are
|
|
matched sequentially in the same thread. Currently, the only way to set
|
|
up non-sequential matches in one thread is to use callouts: if a call-
|
|
out function starts another match, that match must use a different JIT
|
|
stack to the one used for currently suspended match(es).
|
|
|
|
In a multithread application, if you do not specify a JIT stack, or if
|
|
you assign or pass back NULL from a callback, that is thread-safe,
|
|
because each thread has its own machine stack. However, if you assign
|
|
or pass back a non-NULL JIT stack, this must be a different stack for
|
|
each thread so that the application is thread-safe.
|
|
|
|
Strictly speaking, even more is allowed. You can assign the same non-
|
|
NULL stack to a match context that is used by any number of patterns,
|
|
as long as they are not used for matching by multiple threads at the
|
|
same time. For example, you could use the same stack in all compiled
|
|
patterns, with a global mutex in the callback to wait until the stack
|
|
is available for use. However, this is an inefficient solution, and not
|
|
recommended.
|
|
|
|
This is a suggestion for how a multithreaded program that needs to set
|
|
up non-default JIT stacks might operate:
|
|
|
|
During thread initalization
|
|
thread_local_var = pcre2_jit_stack_create(...)
|
|
|
|
During thread exit
|
|
pcre2_jit_stack_free(thread_local_var)
|
|
|
|
Use a one-line callback function
|
|
return thread_local_var
|
|
|
|
All the functions described in this section do nothing if JIT is not
|
|
available.
|
|
|
|
|
|
JIT STACK FAQ
|
|
|
|
(1) Why do we need JIT stacks?
|
|
|
|
PCRE2 (and JIT) is a recursive, depth-first engine, so it needs a stack
|
|
where the local data of the current node is pushed before checking its
|
|
child nodes. Allocating real machine stack on some platforms is diffi-
|
|
cult. For example, the stack chain needs to be updated every time if we
|
|
extend the stack on PowerPC. Although it is possible, its updating
|
|
time overhead decreases performance. So we do the recursion in memory.
|
|
|
|
(2) Why don't we simply allocate blocks of memory with malloc()?
|
|
|
|
Modern operating systems have a nice feature: they can reserve an
|
|
address space instead of allocating memory. We can safely allocate mem-
|
|
ory pages inside this address space, so the stack could grow without
|
|
moving memory data (this is important because of pointers). Thus we can
|
|
allocate 1M address space, and use only a single memory page (usually
|
|
4K) if that is enough. However, we can still grow up to 1M anytime if
|
|
needed.
|
|
|
|
(3) Who "owns" a JIT stack?
|
|
|
|
The owner of the stack is the user program, not the JIT studied pattern
|
|
or anything else. The user program must ensure that if a stack is being
|
|
used by pcre2_match(), (that is, it is assigned to a match context that
|
|
is passed to the pattern currently running), that stack must not be
|
|
used by any other threads (to avoid overwriting the same memory area).
|
|
The best practice for multithreaded programs is to allocate a stack for
|
|
each thread, and return this stack through the JIT callback function.
|
|
|
|
(4) When should a JIT stack be freed?
|
|
|
|
You can free a JIT stack at any time, as long as it will not be used by
|
|
pcre2_match() again. When you assign the stack to a match context, only
|
|
a pointer is set. There is no reference counting or any other magic.
|
|
You can free compiled patterns, contexts, and stacks in any order, any-
|
|
time. Just do not call pcre2_match() with a match context pointing to
|
|
an already freed stack, as that will cause SEGFAULT. (Also, do not free
|
|
a stack currently used by pcre2_match() in another thread). You can
|
|
also replace the stack in a context at any time when it is not in use.
|
|
You should free the previous stack before assigning a replacement.
|
|
|
|
(5) Should I allocate/free a stack every time before/after calling
|
|
pcre2_match()?
|
|
|
|
No, because this is too costly in terms of resources. However, you
|
|
could implement some clever idea which release the stack if it is not
|
|
used in let's say two minutes. The JIT callback can help to achieve
|
|
this without keeping a list of patterns.
|
|
|
|
(6) OK, the stack is for long term memory allocation. But what happens
|
|
if a pattern causes stack overflow with a stack of 1M? Is that 1M kept
|
|
until the stack is freed?
|
|
|
|
Especially on embedded sytems, it might be a good idea to release mem-
|
|
ory sometimes without freeing the stack. There is no API for this at
|
|
the moment. Probably a function call which returns with the currently
|
|
allocated memory for any stack and another which allows releasing mem-
|
|
ory (shrinking the stack) would be a good idea if someone needs this.
|
|
|
|
(7) This is too much of a headache. Isn't there any better solution for
|
|
JIT stack handling?
|
|
|
|
No, thanks to Windows. If POSIX threads were used everywhere, we could
|
|
throw out this complicated API.
|
|
|
|
|
|
FREEING JIT SPECULATIVE MEMORY
|
|
|
|
void pcre2_jit_free_unused_memory(pcre2_general_context *gcontext);
|
|
|
|
The JIT executable allocator does not free all memory when it is possi-
|
|
ble. It expects new allocations, and keeps some free memory around to
|
|
improve allocation speed. However, in low memory conditions, it might
|
|
be better to free all possible memory. You can cause this to happen by
|
|
calling pcre2_jit_free_unused_memory(). Its argument is a general con-
|
|
text, for custom memory management, or NULL for standard memory manage-
|
|
ment.
|
|
|
|
|
|
EXAMPLE CODE
|
|
|
|
This is a single-threaded example that specifies a JIT stack without
|
|
using a callback. A real program should include error checking after
|
|
all the function calls.
|
|
|
|
int rc;
|
|
pcre2_code *re;
|
|
pcre2_match_data *match_data;
|
|
pcre2_match_context *mcontext;
|
|
pcre2_jit_stack *jit_stack;
|
|
|
|
re = pcre2_compile(pattern, PCRE2_ZERO_TERMINATED, 0,
|
|
&errornumber, &erroffset, NULL);
|
|
rc = pcre2_jit_compile(re, PCRE2_JIT_COMPLETE);
|
|
mcontext = pcre2_match_context_create(NULL);
|
|
jit_stack = pcre2_jit_stack_create(32*1024, 512*1024, NULL);
|
|
pcre2_jit_stack_assign(mcontext, NULL, jit_stack);
|
|
match_data = pcre2_match_data_create(re, 10);
|
|
rc = pcre2_match(re, subject, length, 0, 0, match_data, mcontext);
|
|
/* Process result */
|
|
|
|
pcre2_code_free(re);
|
|
pcre2_match_data_free(match_data);
|
|
pcre2_match_context_free(mcontext);
|
|
pcre2_jit_stack_free(jit_stack);
|
|
|
|
|
|
JIT FAST PATH API
|
|
|
|
Because the API described above falls back to interpreted matching when
|
|
JIT is not available, it is convenient for programs that are written
|
|
for general use in many environments. However, calling JIT via
|
|
pcre2_match() does have a performance impact. Programs that are written
|
|
for use where JIT is known to be available, and which need the best
|
|
possible performance, can instead use a "fast path" API to call JIT
|
|
matching directly instead of calling pcre2_match() (obviously only for
|
|
patterns that have been successfully processed by pcre2_jit_compile()).
|
|
|
|
The fast path function is called pcre2_jit_match(), and it takes
|
|
exactly the same arguments as pcre2_match(). The return values are also
|
|
the same, plus PCRE2_ERROR_JIT_BADOPTION if a matching mode (partial or
|
|
complete) is requested that was not compiled. Unsupported option bits
|
|
(for example, PCRE2_ANCHORED) are ignored.
|
|
|
|
When you call pcre2_match(), as well as testing for invalid options, a
|
|
number of other sanity checks are performed on the arguments. For exam-
|
|
ple, if the subject pointer is NULL, an immediate error is given. Also,
|
|
unless PCRE2_NO_UTF_CHECK is set, a UTF subject string is tested for
|
|
validity. In the interests of speed, these checks do not happen on the
|
|
JIT fast path, and if invalid data is passed, the result is undefined.
|
|
|
|
Bypassing the sanity checks and the pcre2_match() wrapping can give
|
|
speedups of more than 10%.
|
|
|
|
|
|
SEE ALSO
|
|
|
|
pcre2api(3)
|
|
|
|
|
|
AUTHOR
|
|
|
|
Philip Hazel (FAQ by Zoltan Herczeg)
|
|
University Computing Service
|
|
Cambridge, England.
|
|
|
|
|
|
REVISION
|
|
|
|
Last updated: 28 July 2015
|
|
Copyright (c) 1997-2015 University of Cambridge.
|
|
------------------------------------------------------------------------------
|
|
|
|
|
|
PCRE2LIMITS(3) Library Functions Manual PCRE2LIMITS(3)
|
|
|
|
|
|
|
|
NAME
|
|
PCRE2 - Perl-compatible regular expressions (revised API)
|
|
|
|
SIZE AND OTHER LIMITATIONS
|
|
|
|
There are some size limitations in PCRE2 but it is hoped that they will
|
|
never in practice be relevant.
|
|
|
|
The maximum size of a compiled pattern is approximately 64K code units
|
|
for the 8-bit and 16-bit libraries if PCRE2 is compiled with the
|
|
default internal linkage size, which is 2 bytes for these libraries. If
|
|
you want to process regular expressions that are truly enormous, you
|
|
can compile PCRE2 with an internal linkage size of 3 or 4 (when build-
|
|
ing the 16-bit library, 3 is rounded up to 4). See the README file in
|
|
the source distribution and the pcre2build documentation for details.
|
|
In these cases the limit is substantially larger. However, the speed
|
|
of execution is slower. In the 32-bit library, the internal linkage
|
|
size is always 4.
|
|
|
|
The maximum length (in code units) of a subject string is one less than
|
|
the largest number a PCRE2_SIZE variable can hold. PCRE2_SIZE is an
|
|
unsigned integer type, usually defined as size_t. Its maximum value
|
|
(that is ~(PCRE2_SIZE)0) is reserved as a special indicator for zero-
|
|
terminated strings and unset offsets.
|
|
|
|
Note that when using the traditional matching function, PCRE2 uses
|
|
recursion to handle subpatterns and indefinite repetition. This means
|
|
that the available stack space may limit the size of a subject string
|
|
that can be processed by certain patterns. For a discussion of stack
|
|
issues, see the pcre2stack documentation.
|
|
|
|
All values in repeating quantifiers must be less than 65536.
|
|
|
|
There is no limit to the number of parenthesized subpatterns, but there
|
|
can be no more than 65535 capturing subpatterns. There is, however, a
|
|
limit to the depth of nesting of parenthesized subpatterns of all
|
|
kinds. This is imposed in order to limit the amount of system stack
|
|
used at compile time. The limit can be specified when PCRE2 is built;
|
|
the default is 250.
|
|
|
|
There is a limit to the number of forward references to subsequent sub-
|
|
patterns of around 200,000. Repeated forward references with fixed
|
|
upper limits, for example, (?2){0,100} when subpattern number 2 is to
|
|
the right, are included in the count. There is no limit to the number
|
|
of backward references.
|
|
|
|
The maximum length of name for a named subpattern is 32 code units, and
|
|
the maximum number of named subpatterns is 10000.
|
|
|
|
The maximum length of a name in a (*MARK), (*PRUNE), (*SKIP), or
|
|
(*THEN) verb is 255 for the 8-bit library and 65535 for the 16-bit and
|
|
32-bit libraries.
|
|
|
|
|
|
AUTHOR
|
|
|
|
Philip Hazel
|
|
University Computing Service
|
|
Cambridge, England.
|
|
|
|
|
|
REVISION
|
|
|
|
Last updated: 25 November 2014
|
|
Copyright (c) 1997-2014 University of Cambridge.
|
|
------------------------------------------------------------------------------
|
|
|
|
|
|
PCRE2MATCHING(3) Library Functions Manual PCRE2MATCHING(3)
|
|
|
|
|
|
|
|
NAME
|
|
PCRE2 - Perl-compatible regular expressions (revised API)
|
|
|
|
PCRE2 MATCHING ALGORITHMS
|
|
|
|
This document describes the two different algorithms that are available
|
|
in PCRE2 for matching a compiled regular expression against a given
|
|
subject string. The "standard" algorithm is the one provided by the
|
|
pcre2_match() function. This works in the same as as Perl's matching
|
|
function, and provide a Perl-compatible matching operation. The just-
|
|
in-time (JIT) optimization that is described in the pcre2jit documenta-
|
|
tion is compatible with this function.
|
|
|
|
An alternative algorithm is provided by the pcre2_dfa_match() function;
|
|
it operates in a different way, and is not Perl-compatible. This alter-
|
|
native has advantages and disadvantages compared with the standard
|
|
algorithm, and these are described below.
|
|
|
|
When there is only one possible way in which a given subject string can
|
|
match a pattern, the two algorithms give the same answer. A difference
|
|
arises, however, when there are multiple possibilities. For example, if
|
|
the pattern
|
|
|
|
^<.*>
|
|
|
|
is matched against the string
|
|
|
|
<something> <something else> <something further>
|
|
|
|
there are three possible answers. The standard algorithm finds only one
|
|
of them, whereas the alternative algorithm finds all three.
|
|
|
|
|
|
REGULAR EXPRESSIONS AS TREES
|
|
|
|
The set of strings that are matched by a regular expression can be rep-
|
|
resented as a tree structure. An unlimited repetition in the pattern
|
|
makes the tree of infinite size, but it is still a tree. Matching the
|
|
pattern to a given subject string (from a given starting point) can be
|
|
thought of as a search of the tree. There are two ways to search a
|
|
tree: depth-first and breadth-first, and these correspond to the two
|
|
matching algorithms provided by PCRE2.
|
|
|
|
|
|
THE STANDARD MATCHING ALGORITHM
|
|
|
|
In the terminology of Jeffrey Friedl's book "Mastering Regular Expres-
|
|
sions", the standard algorithm is an "NFA algorithm". It conducts a
|
|
depth-first search of the pattern tree. That is, it proceeds along a
|
|
single path through the tree, checking that the subject matches what is
|
|
required. When there is a mismatch, the algorithm tries any alterna-
|
|
tives at the current point, and if they all fail, it backs up to the
|
|
previous branch point in the tree, and tries the next alternative
|
|
branch at that level. This often involves backing up (moving to the
|
|
left) in the subject string as well. The order in which repetition
|
|
branches are tried is controlled by the greedy or ungreedy nature of
|
|
the quantifier.
|
|
|
|
If a leaf node is reached, a matching string has been found, and at
|
|
that point the algorithm stops. Thus, if there is more than one possi-
|
|
ble match, this algorithm returns the first one that it finds. Whether
|
|
this is the shortest, the longest, or some intermediate length depends
|
|
on the way the greedy and ungreedy repetition quantifiers are specified
|
|
in the pattern.
|
|
|
|
Because it ends up with a single path through the tree, it is rela-
|
|
tively straightforward for this algorithm to keep track of the sub-
|
|
strings that are matched by portions of the pattern in parentheses.
|
|
This provides support for capturing parentheses and back references.
|
|
|
|
|
|
THE ALTERNATIVE MATCHING ALGORITHM
|
|
|
|
This algorithm conducts a breadth-first search of the tree. Starting
|
|
from the first matching point in the subject, it scans the subject
|
|
string from left to right, once, character by character, and as it does
|
|
this, it remembers all the paths through the tree that represent valid
|
|
matches. In Friedl's terminology, this is a kind of "DFA algorithm",
|
|
though it is not implemented as a traditional finite state machine (it
|
|
keeps multiple states active simultaneously).
|
|
|
|
Although the general principle of this matching algorithm is that it
|
|
scans the subject string only once, without backtracking, there is one
|
|
exception: when a lookaround assertion is encountered, the characters
|
|
following or preceding the current point have to be independently
|
|
inspected.
|
|
|
|
The scan continues until either the end of the subject is reached, or
|
|
there are no more unterminated paths. At this point, terminated paths
|
|
represent the different matching possibilities (if there are none, the
|
|
match has failed). Thus, if there is more than one possible match,
|
|
this algorithm finds all of them, and in particular, it finds the long-
|
|
est. The matches are returned in decreasing order of length. There is
|
|
an option to stop the algorithm after the first match (which is neces-
|
|
sarily the shortest) is found.
|
|
|
|
Note that all the matches that are found start at the same point in the
|
|
subject. If the pattern
|
|
|
|
cat(er(pillar)?)?
|
|
|
|
is matched against the string "the caterpillar catchment", the result
|
|
is the three strings "caterpillar", "cater", and "cat" that start at
|
|
the fifth character of the subject. The algorithm does not automati-
|
|
cally move on to find matches that start at later positions.
|
|
|
|
PCRE2's "auto-possessification" optimization usually applies to charac-
|
|
ter repeats at the end of a pattern (as well as internally). For exam-
|
|
ple, the pattern "a\d+" is compiled as if it were "a\d++" because there
|
|
is no point even considering the possibility of backtracking into the
|
|
repeated digits. For DFA matching, this means that only one possible
|
|
match is found. If you really do want multiple matches in such cases,
|
|
either use an ungreedy repeat ("a\d+?") or set the PCRE2_NO_AUTO_POS-
|
|
SESS option when compiling.
|
|
|
|
There are a number of features of PCRE2 regular expressions that are
|
|
not supported by the alternative matching algorithm. They are as fol-
|
|
lows:
|
|
|
|
1. Because the algorithm finds all possible matches, the greedy or
|
|
ungreedy nature of repetition quantifiers is not relevant (though it
|
|
may affect auto-possessification, as just described). During matching,
|
|
greedy and ungreedy quantifiers are treated in exactly the same way.
|
|
However, possessive quantifiers can make a difference when what follows
|
|
could also match what is quantified, for example in a pattern like
|
|
this:
|
|
|
|
^a++\w!
|
|
|
|
This pattern matches "aaab!" but not "aaa!", which would be matched by
|
|
a non-possessive quantifier. Similarly, if an atomic group is present,
|
|
it is matched as if it were a standalone pattern at the current point,
|
|
and the longest match is then "locked in" for the rest of the overall
|
|
pattern.
|
|
|
|
2. When dealing with multiple paths through the tree simultaneously, it
|
|
is not straightforward to keep track of captured substrings for the
|
|
different matching possibilities, and PCRE2's implementation of this
|
|
algorithm does not attempt to do this. This means that no captured sub-
|
|
strings are available.
|
|
|
|
3. Because no substrings are captured, back references within the pat-
|
|
tern are not supported, and cause errors if encountered.
|
|
|
|
4. For the same reason, conditional expressions that use a backrefer-
|
|
ence as the condition or test for a specific group recursion are not
|
|
supported.
|
|
|
|
5. Because many paths through the tree may be active, the \K escape
|
|
sequence, which resets the start of the match when encountered (but may
|
|
be on some paths and not on others), is not supported. It causes an
|
|
error if encountered.
|
|
|
|
6. Callouts are supported, but the value of the capture_top field is
|
|
always 1, and the value of the capture_last field is always 0.
|
|
|
|
7. The \C escape sequence, which (in the standard algorithm) always
|
|
matches a single code unit, even in a UTF mode, is not supported in
|
|
these modes, because the alternative algorithm moves through the sub-
|
|
ject string one character (not code unit) at a time, for all active
|
|
paths through the tree.
|
|
|
|
8. Except for (*FAIL), the backtracking control verbs such as (*PRUNE)
|
|
are not supported. (*FAIL) is supported, and behaves like a failing
|
|
negative assertion.
|
|
|
|
|
|
ADVANTAGES OF THE ALTERNATIVE ALGORITHM
|
|
|
|
Using the alternative matching algorithm provides the following advan-
|
|
tages:
|
|
|
|
1. All possible matches (at a single point in the subject) are automat-
|
|
ically found, and in particular, the longest match is found. To find
|
|
more than one match using the standard algorithm, you have to do kludgy
|
|
things with callouts.
|
|
|
|
2. Because the alternative algorithm scans the subject string just
|
|
once, and never needs to backtrack (except for lookbehinds), it is pos-
|
|
sible to pass very long subject strings to the matching function in
|
|
several pieces, checking for partial matching each time. Although it is
|
|
also possible to do multi-segment matching using the standard algo-
|
|
rithm, by retaining partially matched substrings, it is more compli-
|
|
cated. The pcre2partial documentation gives details of partial matching
|
|
and discusses multi-segment matching.
|
|
|
|
|
|
DISADVANTAGES OF THE ALTERNATIVE ALGORITHM
|
|
|
|
The alternative algorithm suffers from a number of disadvantages:
|
|
|
|
1. It is substantially slower than the standard algorithm. This is
|
|
partly because it has to search for all possible matches, but is also
|
|
because it is less susceptible to optimization.
|
|
|
|
2. Capturing parentheses and back references are not supported.
|
|
|
|
3. Although atomic groups are supported, their use does not provide the
|
|
performance advantage that it does for the standard algorithm.
|
|
|
|
|
|
AUTHOR
|
|
|
|
Philip Hazel
|
|
University Computing Service
|
|
Cambridge, England.
|
|
|
|
|
|
REVISION
|
|
|
|
Last updated: 29 September 2014
|
|
Copyright (c) 1997-2014 University of Cambridge.
|
|
------------------------------------------------------------------------------
|
|
|
|
|
|
PCRE2PARTIAL(3) Library Functions Manual PCRE2PARTIAL(3)
|
|
|
|
|
|
|
|
NAME
|
|
PCRE2 - Perl-compatible regular expressions
|
|
|
|
PARTIAL MATCHING IN PCRE2
|
|
|
|
In normal use of PCRE2, if the subject string that is passed to a
|
|
matching function matches as far as it goes, but is too short to match
|
|
the entire pattern, PCRE2_ERROR_NOMATCH is returned. There are circum-
|
|
stances where it might be helpful to distinguish this case from other
|
|
cases in which there is no match.
|
|
|
|
Consider, for example, an application where a human is required to type
|
|
in data for a field with specific formatting requirements. An example
|
|
might be a date in the form ddmmmyy, defined by this pattern:
|
|
|
|
^\d?\d(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\d\d$
|
|
|
|
If the application sees the user's keystrokes one by one, and can check
|
|
that what has been typed so far is potentially valid, it is able to
|
|
raise an error as soon as a mistake is made, by beeping and not
|
|
reflecting the character that has been typed, for example. This immedi-
|
|
ate feedback is likely to be a better user interface than a check that
|
|
is delayed until the entire string has been entered. Partial matching
|
|
can also be useful when the subject string is very long and is not all
|
|
available at once.
|
|
|
|
PCRE2 supports partial matching by means of the PCRE2_PARTIAL_SOFT and
|
|
PCRE2_PARTIAL_HARD options, which can be set when calling a matching
|
|
function. The difference between the two options is whether or not a
|
|
partial match is preferred to an alternative complete match, though the
|
|
details differ between the two types of matching function. If both
|
|
options are set, PCRE2_PARTIAL_HARD takes precedence.
|
|
|
|
If you want to use partial matching with just-in-time optimized code,
|
|
you must call pcre2_jit_compile() with one or both of these options:
|
|
|
|
PCRE2_JIT_PARTIAL_SOFT
|
|
PCRE2_JIT_PARTIAL_HARD
|
|
|
|
PCRE2_JIT_COMPLETE should also be set if you are going to run non-par-
|
|
tial matches on the same pattern. If the appropriate JIT mode has not
|
|
been compiled, interpretive matching code is used.
|
|
|
|
Setting a partial matching option disables two of PCRE2's standard
|
|
optimizations. PCRE2 remembers the last literal code unit in a pattern,
|
|
and abandons matching immediately if it is not present in the subject
|
|
string. This optimization cannot be used for a subject string that
|
|
might match only partially. PCRE2 also knows the minimum length of a
|
|
matching string, and does not bother to run the matching function on
|
|
shorter strings. This optimization is also disabled for partial match-
|
|
ing.
|
|
|
|
|
|
PARTIAL MATCHING USING pcre2_match()
|
|
|
|
A partial match occurs during a call to pcre2_match() when the end of
|
|
the subject string is reached successfully, but matching cannot con-
|
|
tinue because more characters are needed. However, at least one charac-
|
|
ter in the subject must have been inspected. This character need not
|
|
form part of the final matched string; lookbehind assertions and the \K
|
|
escape sequence provide ways of inspecting characters before the start
|
|
of a matched string. The requirement for inspecting at least one char-
|
|
acter exists because an empty string can always be matched; without
|
|
such a restriction there would always be a partial match of an empty
|
|
string at the end of the subject.
|
|
|
|
When a partial match is returned, the first two elements in the ovector
|
|
point to the portion of the subject that was matched, but the values in
|
|
the rest of the ovector are undefined. The appearance of \K in the pat-
|
|
tern has no effect for a partial match. Consider this pattern:
|
|
|
|
/abc\K123/
|
|
|
|
If it is matched against "456abc123xyz" the result is a complete match,
|
|
and the ovector defines the matched string as "123", because \K resets
|
|
the "start of match" point. However, if a partial match is requested
|
|
and the subject string is "456abc12", a partial match is found for the
|
|
string "abc12", because all these characters are needed for a subse-
|
|
quent re-match with additional characters.
|
|
|
|
What happens when a partial match is identified depends on which of the
|
|
two partial matching options are set.
|
|
|
|
PCRE2_PARTIAL_SOFT WITH pcre2_match()
|
|
|
|
If PCRE2_PARTIAL_SOFT is set when pcre2_match() identifies a partial
|
|
match, the partial match is remembered, but matching continues as nor-
|
|
mal, and other alternatives in the pattern are tried. If no complete
|
|
match can be found, PCRE2_ERROR_PARTIAL is returned instead of
|
|
PCRE2_ERROR_NOMATCH.
|
|
|
|
This option is "soft" because it prefers a complete match over a par-
|
|
tial match. All the various matching items in a pattern behave as if
|
|
the subject string is potentially complete. For example, \z, \Z, and $
|
|
match at the end of the subject, as normal, and for \b and \B the end
|
|
of the subject is treated as a non-alphanumeric.
|
|
|
|
If there is more than one partial match, the first one that was found
|
|
provides the data that is returned. Consider this pattern:
|
|
|
|
/123\w+X|dogY/
|
|
|
|
If this is matched against the subject string "abc123dog", both alter-
|
|
natives fail to match, but the end of the subject is reached during
|
|
matching, so PCRE2_ERROR_PARTIAL is returned. The offsets are set to 3
|
|
and 9, identifying "123dog" as the first partial match that was found.
|
|
(In this example, there are two partial matches, because "dog" on its
|
|
own partially matches the second alternative.)
|
|
|
|
PCRE2_PARTIAL_HARD WITH pcre2_match()
|
|
|
|
If PCRE2_PARTIAL_HARD is set for pcre2_match(), PCRE2_ERROR_PARTIAL is
|
|
returned as soon as a partial match is found, without continuing to
|
|
search for possible complete matches. This option is "hard" because it
|
|
prefers an earlier partial match over a later complete match. For this
|
|
reason, the assumption is made that the end of the supplied subject
|
|
string may not be the true end of the available data, and so, if \z,
|
|
\Z, \b, \B, or $ are encountered at the end of the subject, the result
|
|
is PCRE2_ERROR_PARTIAL, provided that at least one character in the
|
|
subject has been inspected.
|
|
|
|
Comparing hard and soft partial matching
|
|
|
|
The difference between the two partial matching options can be illus-
|
|
trated by a pattern such as:
|
|
|
|
/dog(sbody)?/
|
|
|
|
This matches either "dog" or "dogsbody", greedily (that is, it prefers
|
|
the longer string if possible). If it is matched against the string
|
|
"dog" with PCRE2_PARTIAL_SOFT, it yields a complete match for "dog".
|
|
However, if PCRE2_PARTIAL_HARD is set, the result is PCRE2_ERROR_PAR-
|
|
TIAL. On the other hand, if the pattern is made ungreedy the result is
|
|
different:
|
|
|
|
/dog(sbody)??/
|
|
|
|
In this case the result is always a complete match because that is
|
|
found first, and matching never continues after finding a complete
|
|
match. It might be easier to follow this explanation by thinking of the
|
|
two patterns like this:
|
|
|
|
/dog(sbody)?/ is the same as /dogsbody|dog/
|
|
/dog(sbody)??/ is the same as /dog|dogsbody/
|
|
|
|
The second pattern will never match "dogsbody", because it will always
|
|
find the shorter match first.
|
|
|
|
|
|
PARTIAL MATCHING USING pcre2_dfa_match()
|
|
|
|
The DFA functions move along the subject string character by character,
|
|
without backtracking, searching for all possible matches simultane-
|
|
ously. If the end of the subject is reached before the end of the pat-
|
|
tern, there is the possibility of a partial match, again provided that
|
|
at least one character has been inspected.
|
|
|
|
When PCRE2_PARTIAL_SOFT is set, PCRE2_ERROR_PARTIAL is returned only if
|
|
there have been no complete matches. Otherwise, the complete matches
|
|
are returned. However, if PCRE2_PARTIAL_HARD is set, a partial match
|
|
takes precedence over any complete matches. The portion of the string
|
|
that was matched when the longest partial match was found is set as the
|
|
first matching string.
|
|
|
|
Because the DFA functions always search for all possible matches, and
|
|
there is no difference between greedy and ungreedy repetition, their
|
|
behaviour is different from the standard functions when PCRE2_PAR-
|
|
TIAL_HARD is set. Consider the string "dog" matched against the
|
|
ungreedy pattern shown above:
|
|
|
|
/dog(sbody)??/
|
|
|
|
Whereas the standard function stops as soon as it finds the complete
|
|
match for "dog", the DFA function also finds the partial match for
|
|
"dogsbody", and so returns that when PCRE2_PARTIAL_HARD is set.
|
|
|
|
|
|
PARTIAL MATCHING AND WORD BOUNDARIES
|
|
|
|
If a pattern ends with one of sequences \b or \B, which test for word
|
|
boundaries, partial matching with PCRE2_PARTIAL_SOFT can give counter-
|
|
intuitive results. Consider this pattern:
|
|
|
|
/\bcat\b/
|
|
|
|
This matches "cat", provided there is a word boundary at either end. If
|
|
the subject string is "the cat", the comparison of the final "t" with a
|
|
following character cannot take place, so a partial match is found.
|
|
However, normal matching carries on, and \b matches at the end of the
|
|
subject when the last character is a letter, so a complete match is
|
|
found. The result, therefore, is not PCRE2_ERROR_PARTIAL. Using
|
|
PCRE2_PARTIAL_HARD in this case does yield PCRE2_ERROR_PARTIAL, because
|
|
then the partial match takes precedence.
|
|
|
|
|
|
EXAMPLE OF PARTIAL MATCHING USING PCRE2TEST
|
|
|
|
If the partial_soft (or ps) modifier is present on a pcre2test data
|
|
line, the PCRE2_PARTIAL_SOFT option is used for the match. Here is a
|
|
run of pcre2test that uses the date example quoted above:
|
|
|
|
re> /^\d?\d(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\d\d$/
|
|
data> 25jun04\=ps
|
|
0: 25jun04
|
|
1: jun
|
|
data> 25dec3\=ps
|
|
Partial match: 23dec3
|
|
data> 3ju\=ps
|
|
Partial match: 3ju
|
|
data> 3juj\=ps
|
|
No match
|
|
data> j\=ps
|
|
No match
|
|
|
|
The first data string is matched completely, so pcre2test shows the
|
|
matched substrings. The remaining four strings do not match the com-
|
|
plete pattern, but the first two are partial matches. Similar output is
|
|
obtained if DFA matching is used.
|
|
|
|
If the partial_hard (or ph) modifier is present on a pcre2test data
|
|
line, the PCRE2_PARTIAL_HARD option is set for the match.
|
|
|
|
|
|
MULTI-SEGMENT MATCHING WITH pcre2_dfa_match()
|
|
|
|
When a partial match has been found using a DFA matching function, it
|
|
is possible to continue the match by providing additional subject data
|
|
and calling the function again with the same compiled regular expres-
|
|
sion, this time setting the PCRE2_DFA_RESTART option. You must pass the
|
|
same working space as before, because this is where details of the pre-
|
|
vious partial match are stored. Here is an example using pcre2test:
|
|
|
|
re> /^\d?\d(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\d\d$/
|
|
data> 23ja\=dfa,ps
|
|
Partial match: 23ja
|
|
data> n05\=dfa,dfa_restart
|
|
0: n05
|
|
|
|
The first call has "23ja" as the subject, and requests partial match-
|
|
ing; the second call has "n05" as the subject for the continued
|
|
(restarted) match. Notice that when the match is complete, only the
|
|
last part is shown; PCRE2 does not retain the previously partially-
|
|
matched string. It is up to the calling program to do that if it needs
|
|
to.
|
|
|
|
That means that, for an unanchored pattern, if a continued match fails,
|
|
it is not possible to try again at a new starting point. All this
|
|
facility is capable of doing is continuing with the previous match
|
|
attempt. In the previous example, if the second set of data is "ug23"
|
|
the result is no match, even though there would be a match for "aug23"
|
|
if the entire string were given at once. Depending on the application,
|
|
this may or may not be what you want. The only way to allow for start-
|
|
ing again at the next character is to retain the matched part of the
|
|
subject and try a new complete match.
|
|
|
|
You can set the PCRE2_PARTIAL_SOFT or PCRE2_PARTIAL_HARD options with
|
|
PCRE2_DFA_RESTART to continue partial matching over multiple segments.
|
|
This facility can be used to pass very long subject strings to the DFA
|
|
matching functions.
|
|
|
|
|
|
MULTI-SEGMENT MATCHING WITH pcre2_match()
|
|
|
|
Unlike the DFA function, it is not possible to restart the previous
|
|
match with a new segment of data when using pcre2_match(). Instead, new
|
|
data must be added to the previous subject string, and the entire match
|
|
re-run, starting from the point where the partial match occurred. Ear-
|
|
lier data can be discarded.
|
|
|
|
It is best to use PCRE2_PARTIAL_HARD in this situation, because it does
|
|
not treat the end of a segment as the end of the subject when matching
|
|
\z, \Z, \b, \B, and $. Consider an unanchored pattern that matches
|
|
dates:
|
|
|
|
re> /\d?\d(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\d\d/
|
|
data> The date is 23ja\=ph
|
|
Partial match: 23ja
|
|
|
|
At this stage, an application could discard the text preceding "23ja",
|
|
add on text from the next segment, and call the matching function
|
|
again. Unlike the DFA matching function, the entire matching string
|
|
must always be available, and the complete matching process occurs for
|
|
each call, so more memory and more processing time is needed.
|
|
|
|
|
|
ISSUES WITH MULTI-SEGMENT MATCHING
|
|
|
|
Certain types of pattern may give problems with multi-segment matching,
|
|
whichever matching function is used.
|
|
|
|
1. If the pattern contains a test for the beginning of a line, you need
|
|
to pass the PCRE2_NOTBOL option when the subject string for any call
|
|
does start at the beginning of a line. There is also a PCRE2_NOTEOL
|
|
option, but in practice when doing multi-segment matching you should be
|
|
using PCRE2_PARTIAL_HARD, which includes the effect of PCRE2_NOTEOL.
|
|
|
|
2. If a pattern contains a lookbehind assertion, characters that pre-
|
|
cede the start of the partial match may have been inspected during the
|
|
matching process. When using pcre2_match(), sufficient characters must
|
|
be retained for the next match attempt. You can ensure that enough
|
|
characters are retained by doing the following:
|
|
|
|
Before doing any matching, find the length of the longest lookbehind in
|
|
the pattern by calling pcre2_pattern_info() with the
|
|
PCRE2_INFO_MAXLOOKBEHIND option. Note that the resulting count is in
|
|
characters, not code units. After a partial match, moving back from the
|
|
ovector[0] offset in the subject by the number of characters given for
|
|
the maximum lookbehind gets you to the earliest character that must be
|
|
retained. In a non-UTF or a 32-bit situation, moving back is just a
|
|
subtraction, but in UTF-8 or UTF-16 you have to count characters while
|
|
moving back through the code units.
|
|
|
|
Characters before the point you have now reached can be discarded, and
|
|
after the next segment has been added to what is retained, you should
|
|
run the next match with the startoffset argument set so that the match
|
|
begins at the same point as before.
|
|
|
|
For example, if the pattern "(?<=123)abc" is partially matched against
|
|
the string "xx123ab", the ovector offsets are 5 and 7 ("ab"). The maxi-
|
|
mum lookbehind count is 3, so all characters before offset 2 can be
|
|
discarded. The value of startoffset for the next match should be 3.
|
|
When pcre2test displays a partial match, it indicates the lookbehind
|
|
characters with '<' characters:
|
|
|
|
re> "(?<=123)abc"
|
|
data> xx123ab\=ph
|
|
Partial match: 123ab
|
|
<<<
|
|
|
|
3. Because a partial match must always contain at least one character,
|
|
what might be considered a partial match of an empty string actually
|
|
gives a "no match" result. For example:
|
|
|
|
re> /c(?<=abc)x/
|
|
data> ab\=ps
|
|
No match
|
|
|
|
If the next segment begins "cx", a match should be found, but this will
|
|
only happen if characters from the previous segment are retained. For
|
|
this reason, a "no match" result should be interpreted as "partial
|
|
match of an empty string" when the pattern contains lookbehinds.
|
|
|
|
4. Matching a subject string that is split into multiple segments may
|
|
not always produce exactly the same result as matching over one single
|
|
long string, especially when PCRE2_PARTIAL_SOFT is used. The section
|
|
"Partial Matching and Word Boundaries" above describes an issue that
|
|
arises if the pattern ends with \b or \B. Another kind of difference
|
|
may occur when there are multiple matching possibilities, because (for
|
|
PCRE2_PARTIAL_SOFT) a partial match result is given only when there are
|
|
no completed matches. This means that as soon as the shortest match has
|
|
been found, continuation to a new subject segment is no longer possi-
|
|
ble. Consider this pcre2test example:
|
|
|
|
re> /dog(sbody)?/
|
|
data> dogsb\=ps
|
|
0: dog
|
|
data> do\=ps,dfa
|
|
Partial match: do
|
|
data> gsb\=ps,dfa,dfa_restart
|
|
0: g
|
|
data> dogsbody\=dfa
|
|
0: dogsbody
|
|
1: dog
|
|
|
|
The first data line passes the string "dogsb" to a standard matching
|
|
function, setting the PCRE2_PARTIAL_SOFT option. Although the string is
|
|
a partial match for "dogsbody", the result is not PCRE2_ERROR_PARTIAL,
|
|
because the shorter string "dog" is a complete match. Similarly, when
|
|
the subject is presented to a DFA matching function in several parts
|
|
("do" and "gsb" being the first two) the match stops when "dog" has
|
|
been found, and it is not possible to continue. On the other hand, if
|
|
"dogsbody" is presented as a single string, a DFA matching function
|
|
finds both matches.
|
|
|
|
Because of these problems, it is best to use PCRE2_PARTIAL_HARD when
|
|
matching multi-segment data. The example above then behaves differ-
|
|
ently:
|
|
|
|
re> /dog(sbody)?/
|
|
data> dogsb\=ph
|
|
Partial match: dogsb
|
|
data> do\=ps,dfa
|
|
Partial match: do
|
|
data> gsb\=ph,dfa,dfa_restart
|
|
Partial match: gsb
|
|
|
|
5. Patterns that contain alternatives at the top level which do not all
|
|
start with the same pattern item may not work as expected when
|
|
PCRE2_DFA_RESTART is used. For example, consider this pattern:
|
|
|
|
1234|3789
|
|
|
|
If the first part of the subject is "ABC123", a partial match of the
|
|
first alternative is found at offset 3. There is no partial match for
|
|
the second alternative, because such a match does not start at the same
|
|
point in the subject string. Attempting to continue with the string
|
|
"7890" does not yield a match because only those alternatives that
|
|
match at one point in the subject are remembered. The problem arises
|
|
because the start of the second alternative matches within the first
|
|
alternative. There is no problem with anchored patterns or patterns
|
|
such as:
|
|
|
|
1234|ABCD
|
|
|
|
where no string can be a partial match for both alternatives. This is
|
|
not a problem if a standard matching function is used, because the
|
|
entire match has to be rerun each time:
|
|
|
|
re> /1234|3789/
|
|
data> ABC123\=ph
|
|
Partial match: 123
|
|
data> 1237890
|
|
0: 3789
|
|
|
|
Of course, instead of using PCRE2_DFA_RESTART, the same technique of
|
|
re-running the entire match can also be used with the DFA matching
|
|
function. Another possibility is to work with two buffers. If a partial
|
|
match at offset n in the first buffer is followed by "no match" when
|
|
PCRE2_DFA_RESTART is used on the second buffer, you can then try a new
|
|
match starting at offset n+1 in the first buffer.
|
|
|
|
|
|
AUTHOR
|
|
|
|
Philip Hazel
|
|
University Computing Service
|
|
Cambridge, England.
|
|
|
|
|
|
REVISION
|
|
|
|
Last updated: 22 December 2014
|
|
Copyright (c) 1997-2014 University of Cambridge.
|
|
------------------------------------------------------------------------------
|
|
|
|
|
|
PCRE2UNICODE(3) Library Functions Manual PCRE2UNICODE(3)
|
|
|
|
|
|
|
|
NAME
|
|
PCRE - Perl-compatible regular expressions (revised API)
|
|
|
|
UNICODE AND UTF SUPPORT
|
|
|
|
When PCRE2 is built with Unicode support (which is the default), it has
|
|
knowledge of Unicode character properties and can process text strings
|
|
in UTF-8, UTF-16, or UTF-32 format (depending on the code unit width).
|
|
However, by default, PCRE2 assumes that one code unit is one character.
|
|
To process a pattern as a UTF string, where a character may require
|
|
more than one code unit, you must call pcre2_compile() with the
|
|
PCRE2_UTF option flag, or the pattern must start with the sequence
|
|
(*UTF). When either of these is the case, both the pattern and any sub-
|
|
ject strings that are matched against it are treated as UTF strings
|
|
instead of strings of individual one-code-unit characters.
|
|
|
|
If you do not need Unicode support you can build PCRE2 without it, in
|
|
which case the library will be smaller.
|
|
|
|
|
|
UNICODE PROPERTY SUPPORT
|
|
|
|
When PCRE2 is built with Unicode support, the escape sequences \p{..},
|
|
\P{..}, and \X can be used. The Unicode properties that can be tested
|
|
are limited to the general category properties such as Lu for an upper
|
|
case letter or Nd for a decimal number, the Unicode script names such
|
|
as Arabic or Han, and the derived properties Any and L&. Full lists are
|
|
given in the pcre2pattern and pcre2syntax documentation. Only the short
|
|
names for properties are supported. For example, \p{L} matches a let-
|
|
ter. Its Perl synonym, \p{Letter}, is not supported. Furthermore, in
|
|
Perl, many properties may optionally be prefixed by "Is", for compati-
|
|
bility with Perl 5.6. PCRE does not support this.
|
|
|
|
|
|
WIDE CHARACTERS AND UTF MODES
|
|
|
|
Codepoints less than 256 can be specified in patterns by either braced
|
|
or unbraced hexadecimal escape sequences (for example, \x{b3} or \xb3).
|
|
Larger values have to use braced sequences. Unbraced octal code points
|
|
up to \777 are also recognized; larger ones can be coded using \o{...}.
|
|
|
|
In UTF modes, repeat quantifiers apply to complete UTF characters, not
|
|
to individual code units.
|
|
|
|
In UTF modes, the dot metacharacter matches one UTF character instead
|
|
of a single code unit.
|
|
|
|
The escape sequence \C can be used to match a single code unit, in a
|
|
UTF mode, but its use can lead to some strange effects because it
|
|
breaks up multi-unit characters (see the description of \C in the
|
|
pcre2pattern documentation). The use of \C is not supported in the
|
|
alternative matching function pcre2_dfa_match(), nor is it supported in
|
|
UTF mode by the JIT optimization. If JIT optimization is requested for
|
|
a UTF pattern that contains \C, it will not succeed, and so the match-
|
|
ing will be carried out by the normal interpretive function.
|
|
|
|
The character escapes \b, \B, \d, \D, \s, \S, \w, and \W correctly test
|
|
characters of any code value, but, by default, the characters that
|
|
PCRE2 recognizes as digits, spaces, or word characters remain the same
|
|
set as in non-UTF mode, all with code points less than 256. This
|
|
remains true even when PCRE2 is built to include Unicode support,
|
|
because to do otherwise would slow down matching in many common cases.
|
|
Note that this also applies to \b and \B, because they are defined in
|
|
terms of \w and \W. If you want to test for a wider sense of, say,
|
|
"digit", you can use explicit Unicode property tests such as \p{Nd}.
|
|
Alternatively, if you set the PCRE2_UCP option, the way that the char-
|
|
acter escapes work is changed so that Unicode properties are used to
|
|
determine which characters match. There are more details in the section
|
|
on generic character types in the pcre2pattern documentation.
|
|
|
|
Similarly, characters that match the POSIX named character classes are
|
|
all low-valued characters, unless the PCRE2_UCP option is set.
|
|
|
|
However, the special horizontal and vertical white space matching
|
|
escapes (\h, \H, \v, and \V) do match all the appropriate Unicode char-
|
|
acters, whether or not PCRE2_UCP is set.
|
|
|
|
Case-insensitive matching in UTF mode makes use of Unicode properties.
|
|
A few Unicode characters such as Greek sigma have more than two code-
|
|
points that are case-equivalent, and these are treated as such.
|
|
|
|
|
|
VALIDITY OF UTF STRINGS
|
|
|
|
When the PCRE2_UTF option is set, the strings passed as patterns and
|
|
subjects are (by default) checked for validity on entry to the relevant
|
|
functions. If an invalid UTF string is passed, an negative error code
|
|
is returned. The code unit offset to the offending character can be
|
|
extracted from the match data block by calling pcre2_get_startchar(),
|
|
which is used for this purpose after a UTF error.
|
|
|
|
UTF-16 and UTF-32 strings can indicate their endianness by special code
|
|
knows as a byte-order mark (BOM). The PCRE2 functions do not handle
|
|
this, expecting strings to be in host byte order.
|
|
|
|
A UTF string is checked before any other processing takes place. In the
|
|
case of pcre2_match() and pcre2_dfa_match() calls with a non-zero
|
|
starting offset, the check is applied only to that part of the subject
|
|
that could be inspected during matching, and there is a check that the
|
|
starting offset points to the first code unit of a character or to the
|
|
end of the subject. If there are no lookbehind assertions in the pat-
|
|
tern, the check starts at the starting offset. Otherwise, it starts at
|
|
the length of the longest lookbehind before the starting offset, or at
|
|
the start of the subject if there are not that many characters before
|
|
the starting offset. Note that the sequences \b and \B are one-charac-
|
|
ter lookbehinds.
|
|
|
|
In addition to checking the format of the string, there is a check to
|
|
ensure that all code points lie in the range U+0 to U+10FFFF, excluding
|
|
the surrogate area. The so-called "non-character" code points are not
|
|
excluded because Unicode corrigendum #9 makes it clear that they should
|
|
not be.
|
|
|
|
Characters in the "Surrogate Area" of Unicode are reserved for use by
|
|
UTF-16, where they are used in pairs to encode code points with values
|
|
greater than 0xFFFF. The code points that are encoded by UTF-16 pairs
|
|
are available independently in the UTF-8 and UTF-32 encodings. (In
|
|
other words, the whole surrogate thing is a fudge for UTF-16 which
|
|
unfortunately messes up UTF-8 and UTF-32.)
|
|
|
|
In some situations, you may already know that your strings are valid,
|
|
and therefore want to skip these checks in order to improve perfor-
|
|
mance, for example in the case of a long subject string that is being
|
|
scanned repeatedly. If you set the PCRE2_NO_UTF_CHECK option at com-
|
|
pile time or at match time, PCRE2 assumes that the pattern or subject
|
|
it is given (respectively) contains only valid UTF code unit sequences.
|
|
|
|
Passing PCRE2_NO_UTF_CHECK to pcre2_compile() just disables the check
|
|
for the pattern; it does not also apply to subject strings. If you want
|
|
to disable the check for a subject string you must pass this option to
|
|
pcre2_match() or pcre2_dfa_match().
|
|
|
|
If you pass an invalid UTF string when PCRE2_NO_UTF_CHECK is set, the
|
|
result is undefined and your program may crash or loop indefinitely.
|
|
|
|
Errors in UTF-8 strings
|
|
|
|
The following negative error codes are given for invalid UTF-8 strings:
|
|
|
|
PCRE2_ERROR_UTF8_ERR1
|
|
PCRE2_ERROR_UTF8_ERR2
|
|
PCRE2_ERROR_UTF8_ERR3
|
|
PCRE2_ERROR_UTF8_ERR4
|
|
PCRE2_ERROR_UTF8_ERR5
|
|
|
|
The string ends with a truncated UTF-8 character; the code specifies
|
|
how many bytes are missing (1 to 5). Although RFC 3629 restricts UTF-8
|
|
characters to be no longer than 4 bytes, the encoding scheme (origi-
|
|
nally defined by RFC 2279) allows for up to 6 bytes, and this is
|
|
checked first; hence the possibility of 4 or 5 missing bytes.
|
|
|
|
PCRE2_ERROR_UTF8_ERR6
|
|
PCRE2_ERROR_UTF8_ERR7
|
|
PCRE2_ERROR_UTF8_ERR8
|
|
PCRE2_ERROR_UTF8_ERR9
|
|
PCRE2_ERROR_UTF8_ERR10
|
|
|
|
The two most significant bits of the 2nd, 3rd, 4th, 5th, or 6th byte of
|
|
the character do not have the binary value 0b10 (that is, either the
|
|
most significant bit is 0, or the next bit is 1).
|
|
|
|
PCRE2_ERROR_UTF8_ERR11
|
|
PCRE2_ERROR_UTF8_ERR12
|
|
|
|
A character that is valid by the RFC 2279 rules is either 5 or 6 bytes
|
|
long; these code points are excluded by RFC 3629.
|
|
|
|
PCRE2_ERROR_UTF8_ERR13
|
|
|
|
A 4-byte character has a value greater than 0x10fff; these code points
|
|
are excluded by RFC 3629.
|
|
|
|
PCRE2_ERROR_UTF8_ERR14
|
|
|
|
A 3-byte character has a value in the range 0xd800 to 0xdfff; this
|
|
range of code points are reserved by RFC 3629 for use with UTF-16, and
|
|
so are excluded from UTF-8.
|
|
|
|
PCRE2_ERROR_UTF8_ERR15
|
|
PCRE2_ERROR_UTF8_ERR16
|
|
PCRE2_ERROR_UTF8_ERR17
|
|
PCRE2_ERROR_UTF8_ERR18
|
|
PCRE2_ERROR_UTF8_ERR19
|
|
|
|
A 2-, 3-, 4-, 5-, or 6-byte character is "overlong", that is, it codes
|
|
for a value that can be represented by fewer bytes, which is invalid.
|
|
For example, the two bytes 0xc0, 0xae give the value 0x2e, whose cor-
|
|
rect coding uses just one byte.
|
|
|
|
PCRE2_ERROR_UTF8_ERR20
|
|
|
|
The two most significant bits of the first byte of a character have the
|
|
binary value 0b10 (that is, the most significant bit is 1 and the sec-
|
|
ond is 0). Such a byte can only validly occur as the second or subse-
|
|
quent byte of a multi-byte character.
|
|
|
|
PCRE2_ERROR_UTF8_ERR21
|
|
|
|
The first byte of a character has the value 0xfe or 0xff. These values
|
|
can never occur in a valid UTF-8 string.
|
|
|
|
Errors in UTF-16 strings
|
|
|
|
The following negative error codes are given for invalid UTF-16
|
|
strings:
|
|
|
|
PCRE_UTF16_ERR1 Missing low surrogate at end of string
|
|
PCRE_UTF16_ERR2 Invalid low surrogate follows high surrogate
|
|
PCRE_UTF16_ERR3 Isolated low surrogate
|
|
|
|
|
|
Errors in UTF-32 strings
|
|
|
|
The following negative error codes are given for invalid UTF-32
|
|
strings:
|
|
|
|
PCRE_UTF32_ERR1 Surrogate character (range from 0xd800 to 0xdfff)
|
|
PCRE_UTF32_ERR2 Code point is greater than 0x10ffff
|
|
|
|
|
|
AUTHOR
|
|
|
|
Philip Hazel
|
|
University Computing Service
|
|
Cambridge, England.
|
|
|
|
|
|
REVISION
|
|
|
|
Last updated: 18 August 2015
|
|
Copyright (c) 1997-2015 University of Cambridge.
|
|
------------------------------------------------------------------------------
|
|
|
|
|