11597 lines
568 KiB
Plaintext
11597 lines
568 KiB
Plaintext
-----------------------------------------------------------------------------
|
|
This file contains a concatenation of the PCRE2 man pages, converted to plain
|
|
text format for ease of searching with a text editor, or for use on systems
|
|
that do not have a man page processor. The small individual files that give
|
|
synopses of each function in the library have not been included. Neither has
|
|
the pcre2demo program. There are separate text files for the pcre2grep and
|
|
pcre2test commands.
|
|
-----------------------------------------------------------------------------
|
|
|
|
|
|
PCRE2(3) Library Functions Manual PCRE2(3)
|
|
|
|
|
|
|
|
NAME
|
|
PCRE2 - Perl-compatible regular expressions (revised API)
|
|
|
|
INTRODUCTION
|
|
|
|
PCRE2 is the name used for a revised API for the PCRE library, which is
|
|
a set of functions, written in C, that implement regular expression
|
|
pattern matching using the same syntax and semantics as Perl, with just
|
|
a few differences. After nearly two decades, the limitations of the
|
|
original API were making development increasingly difficult. The new
|
|
API is more extensible, and it was simplified by abolishing the sepa-
|
|
rate "study" optimizing function; in PCRE2, patterns are automatically
|
|
optimized where possible. Since forking from PCRE1, the code has been
|
|
extensively refactored and new features introduced. The old library is
|
|
now obsolete and is no longer maintained.
|
|
|
|
As well as Perl-style regular expression patterns, some features that
|
|
appeared in Python and the original PCRE before they appeared in Perl
|
|
are available using the Python syntax. There is also some support for
|
|
one or two .NET and Oniguruma syntax items, and there are options for
|
|
requesting some minor changes that give better ECMAScript (aka Java-
|
|
Script) compatibility.
|
|
|
|
The source code for PCRE2 can be compiled to support strings of 8-bit,
|
|
16-bit, or 32-bit code units, which means that up to three separate li-
|
|
braries may be installed, one for each code unit size. The size of code
|
|
unit is not related to the bit size of the underlying hardware. In a
|
|
64-bit environment that also supports 32-bit applications, versions of
|
|
PCRE2 that are compiled in both 64-bit and 32-bit modes may be needed.
|
|
|
|
The original work to extend PCRE to 16-bit and 32-bit code units was
|
|
done by Zoltan Herczeg and Christian Persch, respectively. In all three
|
|
cases, strings can be interpreted either as one character per code
|
|
unit, or as UTF-encoded Unicode, with support for Unicode general cate-
|
|
gory properties. Unicode support is optional at build time (but is the
|
|
default). However, processing strings as UTF code units must be enabled
|
|
explicitly at run time. The version of Unicode in use can be discovered
|
|
by running
|
|
|
|
pcre2test -C
|
|
|
|
The three libraries contain identical sets of functions, with names
|
|
ending in _8, _16, or _32, respectively (for example, pcre2_com-
|
|
pile_8()). However, by defining PCRE2_CODE_UNIT_WIDTH to be 8, 16, or
|
|
32, a program that uses just one code unit width can be written using
|
|
generic names such as pcre2_compile(), and the documentation is written
|
|
assuming that this is the case.
|
|
|
|
In addition to the Perl-compatible matching function, PCRE2 contains an
|
|
alternative function that matches the same compiled patterns in a dif-
|
|
ferent way. In certain circumstances, the alternative function has some
|
|
advantages. For a discussion of the two matching algorithms, see the
|
|
pcre2matching page.
|
|
|
|
Details of exactly which Perl regular expression features are and are
|
|
not supported by PCRE2 are given in separate documents. See the
|
|
pcre2pattern and pcre2compat pages. There is a syntax summary in the
|
|
pcre2syntax page.
|
|
|
|
Some features of PCRE2 can be included, excluded, or changed when the
|
|
library is built. The pcre2_config() function makes it possible for a
|
|
client to discover which features are available. The features them-
|
|
selves are described in the pcre2build page. Documentation about build-
|
|
ing PCRE2 for various operating systems can be found in the README and
|
|
NON-AUTOTOOLS_BUILD files in the source distribution.
|
|
|
|
The libraries contains a number of undocumented internal functions and
|
|
data tables that are used by more than one of the exported external
|
|
functions, but which are not intended for use by external callers.
|
|
Their names all begin with "_pcre2", which hopefully will not provoke
|
|
any name clashes. In some environments, it is possible to control which
|
|
external symbols are exported when a shared library is built, and in
|
|
these cases the undocumented symbols are not exported.
|
|
|
|
|
|
SECURITY CONSIDERATIONS
|
|
|
|
If you are using PCRE2 in a non-UTF application that permits users to
|
|
supply arbitrary patterns for compilation, you should be aware of a
|
|
feature that allows users to turn on UTF support from within a pattern.
|
|
For example, an 8-bit pattern that begins with "(*UTF)" turns on UTF-8
|
|
mode, which interprets patterns and subjects as strings of UTF-8 code
|
|
units instead of individual 8-bit characters. This causes both the pat-
|
|
tern and any data against which it is matched to be checked for UTF-8
|
|
validity. If the data string is very long, such a check might use suf-
|
|
ficiently many resources as to cause your application to lose perfor-
|
|
mance.
|
|
|
|
One way of guarding against this possibility is to use the pcre2_pat-
|
|
tern_info() function to check the compiled pattern's options for
|
|
PCRE2_UTF. Alternatively, you can set the PCRE2_NEVER_UTF option when
|
|
calling pcre2_compile(). This causes a compile time error if the pat-
|
|
tern contains a UTF-setting sequence.
|
|
|
|
The use of Unicode properties for character types such as \d can also
|
|
be enabled from within the pattern, by specifying "(*UCP)". This fea-
|
|
ture can be disallowed by setting the PCRE2_NEVER_UCP option.
|
|
|
|
If your application is one that supports UTF, be aware that validity
|
|
checking can take time. If the same data string is to be matched many
|
|
times, you can use the PCRE2_NO_UTF_CHECK option for the second and
|
|
subsequent matches to avoid running redundant checks.
|
|
|
|
The use of the \C escape sequence in a UTF-8 or UTF-16 pattern can lead
|
|
to problems, because it may leave the current matching point in the
|
|
middle of a multi-code-unit character. The PCRE2_NEVER_BACKSLASH_C op-
|
|
tion can be used by an application to lock out the use of \C, causing a
|
|
compile-time error if it is encountered. It is also possible to build
|
|
PCRE2 with the use of \C permanently disabled.
|
|
|
|
Another way that performance can be hit is by running a pattern that
|
|
has a very large search tree against a string that will never match.
|
|
Nested unlimited repeats in a pattern are a common example. PCRE2 pro-
|
|
vides some protection against this: see the pcre2_set_match_limit()
|
|
function in the pcre2api page. There is a similar function called
|
|
pcre2_set_depth_limit() that can be used to restrict the amount of mem-
|
|
ory that is used.
|
|
|
|
|
|
USER DOCUMENTATION
|
|
|
|
The user documentation for PCRE2 comprises a number of different sec-
|
|
tions. In the "man" format, each of these is a separate "man page". In
|
|
the HTML format, each is a separate page, linked from the index page.
|
|
In the plain text format, the descriptions of the pcre2grep and
|
|
pcre2test programs are in files called pcre2grep.txt and pcre2test.txt,
|
|
respectively. The remaining sections, except for the pcre2demo section
|
|
(which is a program listing), and the short pages for individual func-
|
|
tions, are concatenated in pcre2.txt, for ease of searching. The sec-
|
|
tions are as follows:
|
|
|
|
pcre2 this document
|
|
pcre2-config show PCRE2 installation configuration information
|
|
pcre2api details of PCRE2's native C API
|
|
pcre2build building PCRE2
|
|
pcre2callout details of the pattern callout feature
|
|
pcre2compat discussion of Perl compatibility
|
|
pcre2convert details of pattern conversion functions
|
|
pcre2demo a demonstration C program that uses PCRE2
|
|
pcre2grep description of the pcre2grep command (8-bit only)
|
|
pcre2jit discussion of just-in-time optimization support
|
|
pcre2limits details of size and other limits
|
|
pcre2matching discussion of the two matching algorithms
|
|
pcre2partial details of the partial matching facility
|
|
pcre2pattern syntax and semantics of supported regular
|
|
expression patterns
|
|
pcre2perform discussion of performance issues
|
|
pcre2posix the POSIX-compatible C API for the 8-bit library
|
|
pcre2sample discussion of the pcre2demo program
|
|
pcre2serialize details of pattern serialization
|
|
pcre2syntax quick syntax reference
|
|
pcre2test description of the pcre2test command
|
|
pcre2unicode discussion of Unicode and UTF support
|
|
|
|
In the "man" and HTML formats, there is also a short page for each C
|
|
library function, listing its arguments and results.
|
|
|
|
|
|
AUTHOR
|
|
|
|
Philip Hazel
|
|
Retired from University Computing Service
|
|
Cambridge, England.
|
|
|
|
Putting an actual email address here is a spam magnet. If you want to
|
|
email me, use my two names separated by a dot at gmail.com.
|
|
|
|
|
|
REVISION
|
|
|
|
Last updated: 27 August 2021
|
|
Copyright (c) 1997-2021 University of Cambridge.
|
|
------------------------------------------------------------------------------
|
|
|
|
|
|
PCRE2API(3) Library Functions Manual PCRE2API(3)
|
|
|
|
|
|
|
|
NAME
|
|
PCRE2 - Perl-compatible regular expressions (revised API)
|
|
|
|
#include <pcre2.h>
|
|
|
|
PCRE2 is a new API for PCRE, starting at release 10.0. This document
|
|
contains a description of all its native functions. See the pcre2 docu-
|
|
ment for an overview of all the PCRE2 documentation.
|
|
|
|
|
|
PCRE2 NATIVE API BASIC FUNCTIONS
|
|
|
|
pcre2_code *pcre2_compile(PCRE2_SPTR pattern, PCRE2_SIZE length,
|
|
uint32_t options, int *errorcode, PCRE2_SIZE *erroroffset,
|
|
pcre2_compile_context *ccontext);
|
|
|
|
void pcre2_code_free(pcre2_code *code);
|
|
|
|
pcre2_match_data *pcre2_match_data_create(uint32_t ovecsize,
|
|
pcre2_general_context *gcontext);
|
|
|
|
pcre2_match_data *pcre2_match_data_create_from_pattern(
|
|
const pcre2_code *code, pcre2_general_context *gcontext);
|
|
|
|
int pcre2_match(const pcre2_code *code, PCRE2_SPTR subject,
|
|
PCRE2_SIZE length, PCRE2_SIZE startoffset,
|
|
uint32_t options, pcre2_match_data *match_data,
|
|
pcre2_match_context *mcontext);
|
|
|
|
int pcre2_dfa_match(const pcre2_code *code, PCRE2_SPTR subject,
|
|
PCRE2_SIZE length, PCRE2_SIZE startoffset,
|
|
uint32_t options, pcre2_match_data *match_data,
|
|
pcre2_match_context *mcontext,
|
|
int *workspace, PCRE2_SIZE wscount);
|
|
|
|
void pcre2_match_data_free(pcre2_match_data *match_data);
|
|
|
|
|
|
PCRE2 NATIVE API AUXILIARY MATCH FUNCTIONS
|
|
|
|
PCRE2_SPTR pcre2_get_mark(pcre2_match_data *match_data);
|
|
|
|
uint32_t pcre2_get_ovector_count(pcre2_match_data *match_data);
|
|
|
|
PCRE2_SIZE *pcre2_get_ovector_pointer(pcre2_match_data *match_data);
|
|
|
|
PCRE2_SIZE pcre2_get_startchar(pcre2_match_data *match_data);
|
|
|
|
|
|
PCRE2 NATIVE API GENERAL CONTEXT FUNCTIONS
|
|
|
|
pcre2_general_context *pcre2_general_context_create(
|
|
void *(*private_malloc)(PCRE2_SIZE, void *),
|
|
void (*private_free)(void *, void *), void *memory_data);
|
|
|
|
pcre2_general_context *pcre2_general_context_copy(
|
|
pcre2_general_context *gcontext);
|
|
|
|
void pcre2_general_context_free(pcre2_general_context *gcontext);
|
|
|
|
|
|
PCRE2 NATIVE API COMPILE CONTEXT FUNCTIONS
|
|
|
|
pcre2_compile_context *pcre2_compile_context_create(
|
|
pcre2_general_context *gcontext);
|
|
|
|
pcre2_compile_context *pcre2_compile_context_copy(
|
|
pcre2_compile_context *ccontext);
|
|
|
|
void pcre2_compile_context_free(pcre2_compile_context *ccontext);
|
|
|
|
int pcre2_set_bsr(pcre2_compile_context *ccontext,
|
|
uint32_t value);
|
|
|
|
int pcre2_set_character_tables(pcre2_compile_context *ccontext,
|
|
const uint8_t *tables);
|
|
|
|
int pcre2_set_compile_extra_options(pcre2_compile_context *ccontext,
|
|
uint32_t extra_options);
|
|
|
|
int pcre2_set_max_pattern_length(pcre2_compile_context *ccontext,
|
|
PCRE2_SIZE value);
|
|
|
|
int pcre2_set_newline(pcre2_compile_context *ccontext,
|
|
uint32_t value);
|
|
|
|
int pcre2_set_parens_nest_limit(pcre2_compile_context *ccontext,
|
|
uint32_t value);
|
|
|
|
int pcre2_set_compile_recursion_guard(pcre2_compile_context *ccontext,
|
|
int (*guard_function)(uint32_t, void *), void *user_data);
|
|
|
|
|
|
PCRE2 NATIVE API MATCH CONTEXT FUNCTIONS
|
|
|
|
pcre2_match_context *pcre2_match_context_create(
|
|
pcre2_general_context *gcontext);
|
|
|
|
pcre2_match_context *pcre2_match_context_copy(
|
|
pcre2_match_context *mcontext);
|
|
|
|
void pcre2_match_context_free(pcre2_match_context *mcontext);
|
|
|
|
int pcre2_set_callout(pcre2_match_context *mcontext,
|
|
int (*callout_function)(pcre2_callout_block *, void *),
|
|
void *callout_data);
|
|
|
|
int pcre2_set_substitute_callout(pcre2_match_context *mcontext,
|
|
int (*callout_function)(pcre2_substitute_callout_block *, void *),
|
|
void *callout_data);
|
|
|
|
int pcre2_set_offset_limit(pcre2_match_context *mcontext,
|
|
PCRE2_SIZE value);
|
|
|
|
int pcre2_set_heap_limit(pcre2_match_context *mcontext,
|
|
uint32_t value);
|
|
|
|
int pcre2_set_match_limit(pcre2_match_context *mcontext,
|
|
uint32_t value);
|
|
|
|
int pcre2_set_depth_limit(pcre2_match_context *mcontext,
|
|
uint32_t value);
|
|
|
|
|
|
PCRE2 NATIVE API STRING EXTRACTION FUNCTIONS
|
|
|
|
int pcre2_substring_copy_byname(pcre2_match_data *match_data,
|
|
PCRE2_SPTR name, PCRE2_UCHAR *buffer, PCRE2_SIZE *bufflen);
|
|
|
|
int pcre2_substring_copy_bynumber(pcre2_match_data *match_data,
|
|
uint32_t number, PCRE2_UCHAR *buffer,
|
|
PCRE2_SIZE *bufflen);
|
|
|
|
void pcre2_substring_free(PCRE2_UCHAR *buffer);
|
|
|
|
int pcre2_substring_get_byname(pcre2_match_data *match_data,
|
|
PCRE2_SPTR name, PCRE2_UCHAR **bufferptr, PCRE2_SIZE *bufflen);
|
|
|
|
int pcre2_substring_get_bynumber(pcre2_match_data *match_data,
|
|
uint32_t number, PCRE2_UCHAR **bufferptr,
|
|
PCRE2_SIZE *bufflen);
|
|
|
|
int pcre2_substring_length_byname(pcre2_match_data *match_data,
|
|
PCRE2_SPTR name, PCRE2_SIZE *length);
|
|
|
|
int pcre2_substring_length_bynumber(pcre2_match_data *match_data,
|
|
uint32_t number, PCRE2_SIZE *length);
|
|
|
|
int pcre2_substring_nametable_scan(const pcre2_code *code,
|
|
PCRE2_SPTR name, PCRE2_SPTR *first, PCRE2_SPTR *last);
|
|
|
|
int pcre2_substring_number_from_name(const pcre2_code *code,
|
|
PCRE2_SPTR name);
|
|
|
|
void pcre2_substring_list_free(PCRE2_SPTR *list);
|
|
|
|
int pcre2_substring_list_get(pcre2_match_data *match_data,
|
|
PCRE2_UCHAR ***listptr, PCRE2_SIZE **lengthsptr);
|
|
|
|
|
|
PCRE2 NATIVE API STRING SUBSTITUTION FUNCTION
|
|
|
|
int pcre2_substitute(const pcre2_code *code, PCRE2_SPTR subject,
|
|
PCRE2_SIZE length, PCRE2_SIZE startoffset,
|
|
uint32_t options, pcre2_match_data *match_data,
|
|
pcre2_match_context *mcontext, PCRE2_SPTR replacementz,
|
|
PCRE2_SIZE rlength, PCRE2_UCHAR *outputbuffer,
|
|
PCRE2_SIZE *outlengthptr);
|
|
|
|
|
|
PCRE2 NATIVE API JIT FUNCTIONS
|
|
|
|
int pcre2_jit_compile(pcre2_code *code, uint32_t options);
|
|
|
|
int pcre2_jit_match(const pcre2_code *code, PCRE2_SPTR subject,
|
|
PCRE2_SIZE length, PCRE2_SIZE startoffset,
|
|
uint32_t options, pcre2_match_data *match_data,
|
|
pcre2_match_context *mcontext);
|
|
|
|
void pcre2_jit_free_unused_memory(pcre2_general_context *gcontext);
|
|
|
|
pcre2_jit_stack *pcre2_jit_stack_create(PCRE2_SIZE startsize,
|
|
PCRE2_SIZE maxsize, pcre2_general_context *gcontext);
|
|
|
|
void pcre2_jit_stack_assign(pcre2_match_context *mcontext,
|
|
pcre2_jit_callback callback_function, void *callback_data);
|
|
|
|
void pcre2_jit_stack_free(pcre2_jit_stack *jit_stack);
|
|
|
|
|
|
PCRE2 NATIVE API SERIALIZATION FUNCTIONS
|
|
|
|
int32_t pcre2_serialize_decode(pcre2_code **codes,
|
|
int32_t number_of_codes, const uint8_t *bytes,
|
|
pcre2_general_context *gcontext);
|
|
|
|
int32_t pcre2_serialize_encode(const pcre2_code **codes,
|
|
int32_t number_of_codes, uint8_t **serialized_bytes,
|
|
PCRE2_SIZE *serialized_size, pcre2_general_context *gcontext);
|
|
|
|
void pcre2_serialize_free(uint8_t *bytes);
|
|
|
|
int32_t pcre2_serialize_get_number_of_codes(const uint8_t *bytes);
|
|
|
|
|
|
PCRE2 NATIVE API AUXILIARY FUNCTIONS
|
|
|
|
pcre2_code *pcre2_code_copy(const pcre2_code *code);
|
|
|
|
pcre2_code *pcre2_code_copy_with_tables(const pcre2_code *code);
|
|
|
|
int pcre2_get_error_message(int errorcode, PCRE2_UCHAR *buffer,
|
|
PCRE2_SIZE bufflen);
|
|
|
|
const uint8_t *pcre2_maketables(pcre2_general_context *gcontext);
|
|
|
|
void pcre2_maketables_free(pcre2_general_context *gcontext,
|
|
const uint8_t *tables);
|
|
|
|
int pcre2_pattern_info(const pcre2_code *code, uint32_t what,
|
|
void *where);
|
|
|
|
int pcre2_callout_enumerate(const pcre2_code *code,
|
|
int (*callback)(pcre2_callout_enumerate_block *, void *),
|
|
void *user_data);
|
|
|
|
int pcre2_config(uint32_t what, void *where);
|
|
|
|
|
|
PCRE2 NATIVE API OBSOLETE FUNCTIONS
|
|
|
|
int pcre2_set_recursion_limit(pcre2_match_context *mcontext,
|
|
uint32_t value);
|
|
|
|
int pcre2_set_recursion_memory_management(
|
|
pcre2_match_context *mcontext,
|
|
void *(*private_malloc)(PCRE2_SIZE, void *),
|
|
void (*private_free)(void *, void *), void *memory_data);
|
|
|
|
These functions became obsolete at release 10.30 and are retained only
|
|
for backward compatibility. They should not be used in new code. The
|
|
first is replaced by pcre2_set_depth_limit(); the second is no longer
|
|
needed and has no effect (it always returns zero).
|
|
|
|
|
|
PCRE2 EXPERIMENTAL PATTERN CONVERSION FUNCTIONS
|
|
|
|
pcre2_convert_context *pcre2_convert_context_create(
|
|
pcre2_general_context *gcontext);
|
|
|
|
pcre2_convert_context *pcre2_convert_context_copy(
|
|
pcre2_convert_context *cvcontext);
|
|
|
|
void pcre2_convert_context_free(pcre2_convert_context *cvcontext);
|
|
|
|
int pcre2_set_glob_escape(pcre2_convert_context *cvcontext,
|
|
uint32_t escape_char);
|
|
|
|
int pcre2_set_glob_separator(pcre2_convert_context *cvcontext,
|
|
uint32_t separator_char);
|
|
|
|
int pcre2_pattern_convert(PCRE2_SPTR pattern, PCRE2_SIZE length,
|
|
uint32_t options, PCRE2_UCHAR **buffer,
|
|
PCRE2_SIZE *blength, pcre2_convert_context *cvcontext);
|
|
|
|
void pcre2_converted_pattern_free(PCRE2_UCHAR *converted_pattern);
|
|
|
|
These functions provide a way of converting non-PCRE2 patterns into
|
|
patterns that can be processed by pcre2_compile(). This facility is ex-
|
|
perimental and may be changed in future releases. At present, "globs"
|
|
and POSIX basic and extended patterns can be converted. Details are
|
|
given in the pcre2convert documentation.
|
|
|
|
|
|
PCRE2 8-BIT, 16-BIT, AND 32-BIT LIBRARIES
|
|
|
|
There are three PCRE2 libraries, supporting 8-bit, 16-bit, and 32-bit
|
|
code units, respectively. However, there is just one header file,
|
|
pcre2.h. This contains the function prototypes and other definitions
|
|
for all three libraries. One, two, or all three can be installed simul-
|
|
taneously. On Unix-like systems the libraries are called libpcre2-8,
|
|
libpcre2-16, and libpcre2-32, and they can also co-exist with the orig-
|
|
inal PCRE libraries.
|
|
|
|
Character strings are passed to and from a PCRE2 library as a sequence
|
|
of unsigned integers in code units of the appropriate width. Every
|
|
PCRE2 function comes in three different forms, one for each library,
|
|
for example:
|
|
|
|
pcre2_compile_8()
|
|
pcre2_compile_16()
|
|
pcre2_compile_32()
|
|
|
|
There are also three different sets of data types:
|
|
|
|
PCRE2_UCHAR8, PCRE2_UCHAR16, PCRE2_UCHAR32
|
|
PCRE2_SPTR8, PCRE2_SPTR16, PCRE2_SPTR32
|
|
|
|
The UCHAR types define unsigned code units of the appropriate widths.
|
|
For example, PCRE2_UCHAR16 is usually defined as `uint16_t'. The SPTR
|
|
types are constant pointers to the equivalent UCHAR types, that is,
|
|
they are pointers to vectors of unsigned code units.
|
|
|
|
Many applications use only one code unit width. For their convenience,
|
|
macros are defined whose names are the generic forms such as pcre2_com-
|
|
pile() and PCRE2_SPTR. These macros use the value of the macro
|
|
PCRE2_CODE_UNIT_WIDTH to generate the appropriate width-specific func-
|
|
tion and macro names. PCRE2_CODE_UNIT_WIDTH is not defined by default.
|
|
An application must define it to be 8, 16, or 32 before including
|
|
pcre2.h in order to make use of the generic names.
|
|
|
|
Applications that use more than one code unit width can be linked with
|
|
more than one PCRE2 library, but must define PCRE2_CODE_UNIT_WIDTH to
|
|
be 0 before including pcre2.h, and then use the real function names.
|
|
Any code that is to be included in an environment where the value of
|
|
PCRE2_CODE_UNIT_WIDTH is unknown should also use the real function
|
|
names. (Unfortunately, it is not possible in C code to save and restore
|
|
the value of a macro.)
|
|
|
|
If PCRE2_CODE_UNIT_WIDTH is not defined before including pcre2.h, a
|
|
compiler error occurs.
|
|
|
|
When using multiple libraries in an application, you must take care
|
|
when processing any particular pattern to use only functions from a
|
|
single library. For example, if you want to run a match using a pat-
|
|
tern that was compiled with pcre2_compile_16(), you must do so with
|
|
pcre2_match_16(), not pcre2_match_8() or pcre2_match_32().
|
|
|
|
In the function summaries above, and in the rest of this document and
|
|
other PCRE2 documents, functions and data types are described using
|
|
their generic names, without the _8, _16, or _32 suffix.
|
|
|
|
|
|
PCRE2 API OVERVIEW
|
|
|
|
PCRE2 has its own native API, which is described in this document.
|
|
There are also some wrapper functions for the 8-bit library that corre-
|
|
spond to the POSIX regular expression API, but they do not give access
|
|
to all the functionality of PCRE2. They are described in the pcre2posix
|
|
documentation. Both these APIs define a set of C function calls.
|
|
|
|
The native API C data types, function prototypes, option values, and
|
|
error codes are defined in the header file pcre2.h, which also contains
|
|
definitions of PCRE2_MAJOR and PCRE2_MINOR, the major and minor release
|
|
numbers for the library. Applications can use these to include support
|
|
for different releases of PCRE2.
|
|
|
|
In a Windows environment, if you want to statically link an application
|
|
program against a non-dll PCRE2 library, you must define PCRE2_STATIC
|
|
before including pcre2.h.
|
|
|
|
The functions pcre2_compile() and pcre2_match() are used for compiling
|
|
and matching regular expressions in a Perl-compatible manner. A sample
|
|
program that demonstrates the simplest way of using them is provided in
|
|
the file called pcre2demo.c in the PCRE2 source distribution. A listing
|
|
of this program is given in the pcre2demo documentation, and the
|
|
pcre2sample documentation describes how to compile and run it.
|
|
|
|
The compiling and matching functions recognize various options that are
|
|
passed as bits in an options argument. There are also some more compli-
|
|
cated parameters such as custom memory management functions and re-
|
|
source limits that are passed in "contexts" (which are just memory
|
|
blocks, described below). Simple applications do not need to make use
|
|
of contexts.
|
|
|
|
Just-in-time (JIT) compiler support is an optional feature of PCRE2
|
|
that can be built in appropriate hardware environments. It greatly
|
|
speeds up the matching performance of many patterns. Programs can re-
|
|
quest that it be used if available by calling pcre2_jit_compile() after
|
|
a pattern has been successfully compiled by pcre2_compile(). This does
|
|
nothing if JIT support is not available.
|
|
|
|
More complicated programs might need to make use of the specialist
|
|
functions pcre2_jit_stack_create(), pcre2_jit_stack_free(), and
|
|
pcre2_jit_stack_assign() in order to control the JIT code's memory us-
|
|
age.
|
|
|
|
JIT matching is automatically used by pcre2_match() if it is available,
|
|
unless the PCRE2_NO_JIT option is set. There is also a direct interface
|
|
for JIT matching, which gives improved performance at the expense of
|
|
less sanity checking. The JIT-specific functions are discussed in the
|
|
pcre2jit documentation.
|
|
|
|
A second matching function, pcre2_dfa_match(), which is not Perl-com-
|
|
patible, is also provided. This uses a different algorithm for the
|
|
matching. The alternative algorithm finds all possible matches (at a
|
|
given point in the subject), and scans the subject just once (unless
|
|
there are lookaround assertions). However, this algorithm does not re-
|
|
turn captured substrings. A description of the two matching algorithms
|
|
and their advantages and disadvantages is given in the pcre2matching
|
|
documentation. There is no JIT support for pcre2_dfa_match().
|
|
|
|
In addition to the main compiling and matching functions, there are
|
|
convenience functions for extracting captured substrings from a subject
|
|
string that has been matched by pcre2_match(). They are:
|
|
|
|
pcre2_substring_copy_byname()
|
|
pcre2_substring_copy_bynumber()
|
|
pcre2_substring_get_byname()
|
|
pcre2_substring_get_bynumber()
|
|
pcre2_substring_list_get()
|
|
pcre2_substring_length_byname()
|
|
pcre2_substring_length_bynumber()
|
|
pcre2_substring_nametable_scan()
|
|
pcre2_substring_number_from_name()
|
|
|
|
pcre2_substring_free() and pcre2_substring_list_free() are also pro-
|
|
vided, to free memory used for extracted strings. If either of these
|
|
functions is called with a NULL argument, the function returns immedi-
|
|
ately without doing anything.
|
|
|
|
The function pcre2_substitute() can be called to match a pattern and
|
|
return a copy of the subject string with substitutions for parts that
|
|
were matched.
|
|
|
|
Functions whose names begin with pcre2_serialize_ are used for saving
|
|
compiled patterns on disc or elsewhere, and reloading them later.
|
|
|
|
Finally, there are functions for finding out information about a com-
|
|
piled pattern (pcre2_pattern_info()) and about the configuration with
|
|
which PCRE2 was built (pcre2_config()).
|
|
|
|
Functions with names ending with _free() are used for freeing memory
|
|
blocks of various sorts. In all cases, if one of these functions is
|
|
called with a NULL argument, it does nothing.
|
|
|
|
|
|
STRING LENGTHS AND OFFSETS
|
|
|
|
The PCRE2 API uses string lengths and offsets into strings of code
|
|
units in several places. These values are always of type PCRE2_SIZE,
|
|
which is an unsigned integer type, currently always defined as size_t.
|
|
The largest value that can be stored in such a type (that is
|
|
~(PCRE2_SIZE)0) is reserved as a special indicator for zero-terminated
|
|
strings and unset offsets. Therefore, the longest string that can be
|
|
handled is one less than this maximum.
|
|
|
|
|
|
NEWLINES
|
|
|
|
PCRE2 supports five different conventions for indicating line breaks in
|
|
strings: a single CR (carriage return) character, a single LF (line-
|
|
feed) character, the two-character sequence CRLF, any of the three pre-
|
|
ceding, or any Unicode newline sequence. The Unicode newline sequences
|
|
are the three just mentioned, plus the single characters VT (vertical
|
|
tab, U+000B), FF (form feed, U+000C), NEL (next line, U+0085), LS (line
|
|
separator, U+2028), and PS (paragraph separator, U+2029).
|
|
|
|
Each of the first three conventions is used by at least one operating
|
|
system as its standard newline sequence. When PCRE2 is built, a default
|
|
can be specified. If it is not, the default is set to LF, which is the
|
|
Unix standard. However, the newline convention can be changed by an ap-
|
|
plication when calling pcre2_compile(), or it can be specified by spe-
|
|
cial text at the start of the pattern itself; this overrides any other
|
|
settings. See the pcre2pattern page for details of the special charac-
|
|
ter sequences.
|
|
|
|
In the PCRE2 documentation the word "newline" is used to mean "the
|
|
character or pair of characters that indicate a line break". The choice
|
|
of newline convention affects the handling of the dot, circumflex, and
|
|
dollar metacharacters, the handling of #-comments in /x mode, and, when
|
|
CRLF is a recognized line ending sequence, the match position advance-
|
|
ment for a non-anchored pattern. There is more detail about this in the
|
|
section on pcre2_match() options below.
|
|
|
|
The choice of newline convention does not affect the interpretation of
|
|
the \n or \r escape sequences, nor does it affect what \R matches; this
|
|
has its own separate convention.
|
|
|
|
|
|
MULTITHREADING
|
|
|
|
In a multithreaded application it is important to keep thread-specific
|
|
data separate from data that can be shared between threads. The PCRE2
|
|
library code itself is thread-safe: it contains no static or global
|
|
variables. The API is designed to be fairly simple for non-threaded ap-
|
|
plications while at the same time ensuring that multithreaded applica-
|
|
tions can use it.
|
|
|
|
There are several different blocks of data that are used to pass infor-
|
|
mation between the application and the PCRE2 libraries.
|
|
|
|
The compiled pattern
|
|
|
|
A pointer to the compiled form of a pattern is returned to the user
|
|
when pcre2_compile() is successful. The data in the compiled pattern is
|
|
fixed, and does not change when the pattern is matched. Therefore, it
|
|
is thread-safe, that is, the same compiled pattern can be used by more
|
|
than one thread simultaneously. For example, an application can compile
|
|
all its patterns at the start, before forking off multiple threads that
|
|
use them. However, if the just-in-time (JIT) optimization feature is
|
|
being used, it needs separate memory stack areas for each thread. See
|
|
the pcre2jit documentation for more details.
|
|
|
|
In a more complicated situation, where patterns are compiled only when
|
|
they are first needed, but are still shared between threads, pointers
|
|
to compiled patterns must be protected from simultaneous writing by
|
|
multiple threads. This is somewhat tricky to do correctly. If you know
|
|
that writing to a pointer is atomic in your environment, you can use
|
|
logic like this:
|
|
|
|
Get a read-only (shared) lock (mutex) for pointer
|
|
if (pointer == NULL)
|
|
{
|
|
Get a write (unique) lock for pointer
|
|
if (pointer == NULL) pointer = pcre2_compile(...
|
|
}
|
|
Release the lock
|
|
Use pointer in pcre2_match()
|
|
|
|
Of course, testing for compilation errors should also be included in
|
|
the code.
|
|
|
|
The reason for checking the pointer a second time is as follows: Sev-
|
|
eral threads may have acquired the shared lock and tested the pointer
|
|
for being NULL, but only one of them will be given the write lock, with
|
|
the rest kept waiting. The winning thread will compile the pattern and
|
|
store the result. After this thread releases the write lock, another
|
|
thread will get it, and if it does not retest pointer for being NULL,
|
|
will recompile the pattern and overwrite the pointer, creating a memory
|
|
leak and possibly causing other issues.
|
|
|
|
In an environment where writing to a pointer may not be atomic, the
|
|
above logic is not sufficient. The thread that is doing the compiling
|
|
may be descheduled after writing only part of the pointer, which could
|
|
cause other threads to use an invalid value. Instead of checking the
|
|
pointer itself, a separate "pointer is valid" flag (that can be updated
|
|
atomically) must be used:
|
|
|
|
Get a read-only (shared) lock (mutex) for pointer
|
|
if (!pointer_is_valid)
|
|
{
|
|
Get a write (unique) lock for pointer
|
|
if (!pointer_is_valid)
|
|
{
|
|
pointer = pcre2_compile(...
|
|
pointer_is_valid = TRUE
|
|
}
|
|
}
|
|
Release the lock
|
|
Use pointer in pcre2_match()
|
|
|
|
If JIT is being used, but the JIT compilation is not being done immedi-
|
|
ately (perhaps waiting to see if the pattern is used often enough),
|
|
similar logic is required. JIT compilation updates a value within the
|
|
compiled code block, so a thread must gain unique write access to the
|
|
pointer before calling pcre2_jit_compile(). Alternatively,
|
|
pcre2_code_copy() or pcre2_code_copy_with_tables() can be used to ob-
|
|
tain a private copy of the compiled code before calling the JIT com-
|
|
piler.
|
|
|
|
Context blocks
|
|
|
|
The next main section below introduces the idea of "contexts" in which
|
|
PCRE2 functions are called. A context is nothing more than a collection
|
|
of parameters that control the way PCRE2 operates. Grouping a number of
|
|
parameters together in a context is a convenient way of passing them to
|
|
a PCRE2 function without using lots of arguments. The parameters that
|
|
are stored in contexts are in some sense "advanced features" of the
|
|
API. Many straightforward applications will not need to use contexts.
|
|
|
|
In a multithreaded application, if the parameters in a context are val-
|
|
ues that are never changed, the same context can be used by all the
|
|
threads. However, if any thread needs to change any value in a context,
|
|
it must make its own thread-specific copy.
|
|
|
|
Match blocks
|
|
|
|
The matching functions need a block of memory for storing the results
|
|
of a match. This includes details of what was matched, as well as addi-
|
|
tional information such as the name of a (*MARK) setting. Each thread
|
|
must provide its own copy of this memory.
|
|
|
|
|
|
PCRE2 CONTEXTS
|
|
|
|
Some PCRE2 functions have a lot of parameters, many of which are used
|
|
only by specialist applications, for example, those that use custom
|
|
memory management or non-standard character tables. To keep function
|
|
argument lists at a reasonable size, and at the same time to keep the
|
|
API extensible, "uncommon" parameters are passed to certain functions
|
|
in a context instead of directly. A context is just a block of memory
|
|
that holds the parameter values. Applications that do not need to ad-
|
|
just any of the context parameters can pass NULL when a context pointer
|
|
is required.
|
|
|
|
There are three different types of context: a general context that is
|
|
relevant for several PCRE2 operations, a compile-time context, and a
|
|
match-time context.
|
|
|
|
The general context
|
|
|
|
At present, this context just contains pointers to (and data for) ex-
|
|
ternal memory management functions that are called from several places
|
|
in the PCRE2 library. The context is named `general' rather than
|
|
specifically `memory' because in future other fields may be added. If
|
|
you do not want to supply your own custom memory management functions,
|
|
you do not need to bother with a general context. A general context is
|
|
created by:
|
|
|
|
pcre2_general_context *pcre2_general_context_create(
|
|
void *(*private_malloc)(PCRE2_SIZE, void *),
|
|
void (*private_free)(void *, void *), void *memory_data);
|
|
|
|
The two function pointers specify custom memory management functions,
|
|
whose prototypes are:
|
|
|
|
void *private_malloc(PCRE2_SIZE, void *);
|
|
void private_free(void *, void *);
|
|
|
|
Whenever code in PCRE2 calls these functions, the final argument is the
|
|
value of memory_data. Either of the first two arguments of the creation
|
|
function may be NULL, in which case the system memory management func-
|
|
tions malloc() and free() are used. (This is not currently useful, as
|
|
there are no other fields in a general context, but in future there
|
|
might be.) The private_malloc() function is used (if supplied) to ob-
|
|
tain memory for storing the context, and all three values are saved as
|
|
part of the context.
|
|
|
|
Whenever PCRE2 creates a data block of any kind, the block contains a
|
|
pointer to the free() function that matches the malloc() function that
|
|
was used. When the time comes to free the block, this function is
|
|
called.
|
|
|
|
A general context can be copied by calling:
|
|
|
|
pcre2_general_context *pcre2_general_context_copy(
|
|
pcre2_general_context *gcontext);
|
|
|
|
The memory used for a general context should be freed by calling:
|
|
|
|
void pcre2_general_context_free(pcre2_general_context *gcontext);
|
|
|
|
If this function is passed a NULL argument, it returns immediately
|
|
without doing anything.
|
|
|
|
The compile context
|
|
|
|
A compile context is required if you want to provide an external func-
|
|
tion for stack checking during compilation or to change the default
|
|
values of any of the following compile-time parameters:
|
|
|
|
What \R matches (Unicode newlines or CR, LF, CRLF only)
|
|
PCRE2's character tables
|
|
The newline character sequence
|
|
The compile time nested parentheses limit
|
|
The maximum length of the pattern string
|
|
The extra options bits (none set by default)
|
|
|
|
A compile context is also required if you are using custom memory man-
|
|
agement. If none of these apply, just pass NULL as the context argu-
|
|
ment of pcre2_compile().
|
|
|
|
A compile context is created, copied, and freed by the following func-
|
|
tions:
|
|
|
|
pcre2_compile_context *pcre2_compile_context_create(
|
|
pcre2_general_context *gcontext);
|
|
|
|
pcre2_compile_context *pcre2_compile_context_copy(
|
|
pcre2_compile_context *ccontext);
|
|
|
|
void pcre2_compile_context_free(pcre2_compile_context *ccontext);
|
|
|
|
A compile context is created with default values for its parameters.
|
|
These can be changed by calling the following functions, which return 0
|
|
on success, or PCRE2_ERROR_BADDATA if invalid data is detected.
|
|
|
|
int pcre2_set_bsr(pcre2_compile_context *ccontext,
|
|
uint32_t value);
|
|
|
|
The value must be PCRE2_BSR_ANYCRLF, to specify that \R matches only
|
|
CR, LF, or CRLF, or PCRE2_BSR_UNICODE, to specify that \R matches any
|
|
Unicode line ending sequence. The value is used by the JIT compiler and
|
|
by the two interpreted matching functions, pcre2_match() and
|
|
pcre2_dfa_match().
|
|
|
|
int pcre2_set_character_tables(pcre2_compile_context *ccontext,
|
|
const uint8_t *tables);
|
|
|
|
The value must be the result of a call to pcre2_maketables(), whose
|
|
only argument is a general context. This function builds a set of char-
|
|
acter tables in the current locale.
|
|
|
|
int pcre2_set_compile_extra_options(pcre2_compile_context *ccontext,
|
|
uint32_t extra_options);
|
|
|
|
As PCRE2 has developed, almost all the 32 option bits that are avail-
|
|
able in the options argument of pcre2_compile() have been used up. To
|
|
avoid running out, the compile context contains a set of extra option
|
|
bits which are used for some newer, assumed rarer, options. This func-
|
|
tion sets those bits. It always sets all the bits (either on or off).
|
|
It does not modify any existing setting. The available options are de-
|
|
fined in the section entitled "Extra compile options" below.
|
|
|
|
int pcre2_set_max_pattern_length(pcre2_compile_context *ccontext,
|
|
PCRE2_SIZE value);
|
|
|
|
This sets a maximum length, in code units, for any pattern string that
|
|
is compiled with this context. If the pattern is longer, an error is
|
|
generated. This facility is provided so that applications that accept
|
|
patterns from external sources can limit their size. The default is the
|
|
largest number that a PCRE2_SIZE variable can hold, which is effec-
|
|
tively unlimited.
|
|
|
|
int pcre2_set_newline(pcre2_compile_context *ccontext,
|
|
uint32_t value);
|
|
|
|
This specifies which characters or character sequences are to be recog-
|
|
nized as newlines. The value must be one of PCRE2_NEWLINE_CR (carriage
|
|
return only), PCRE2_NEWLINE_LF (linefeed only), PCRE2_NEWLINE_CRLF (the
|
|
two-character sequence CR followed by LF), PCRE2_NEWLINE_ANYCRLF (any
|
|
of the above), PCRE2_NEWLINE_ANY (any Unicode newline sequence), or
|
|
PCRE2_NEWLINE_NUL (the NUL character, that is a binary zero).
|
|
|
|
A pattern can override the value set in the compile context by starting
|
|
with a sequence such as (*CRLF). See the pcre2pattern page for details.
|
|
|
|
When a pattern is compiled with the PCRE2_EXTENDED or PCRE2_EX-
|
|
TENDED_MORE option, the newline convention affects the recognition of
|
|
the end of internal comments starting with #. The value is saved with
|
|
the compiled pattern for subsequent use by the JIT compiler and by the
|
|
two interpreted matching functions, pcre2_match() and
|
|
pcre2_dfa_match().
|
|
|
|
int pcre2_set_parens_nest_limit(pcre2_compile_context *ccontext,
|
|
uint32_t value);
|
|
|
|
This parameter adjusts the limit, set when PCRE2 is built (default
|
|
250), on the depth of parenthesis nesting in a pattern. This limit
|
|
stops rogue patterns using up too much system stack when being com-
|
|
piled. The limit applies to parentheses of all kinds, not just captur-
|
|
ing parentheses.
|
|
|
|
int pcre2_set_compile_recursion_guard(pcre2_compile_context *ccontext,
|
|
int (*guard_function)(uint32_t, void *), void *user_data);
|
|
|
|
There is at least one application that runs PCRE2 in threads with very
|
|
limited system stack, where running out of stack is to be avoided at
|
|
all costs. The parenthesis limit above cannot take account of how much
|
|
stack is actually available during compilation. For a finer control,
|
|
you can supply a function that is called whenever pcre2_compile()
|
|
starts to compile a parenthesized part of a pattern. This function can
|
|
check the actual stack size (or anything else that it wants to, of
|
|
course).
|
|
|
|
The first argument to the callout function gives the current depth of
|
|
nesting, and the second is user data that is set up by the last argu-
|
|
ment of pcre2_set_compile_recursion_guard(). The callout function
|
|
should return zero if all is well, or non-zero to force an error.
|
|
|
|
The match context
|
|
|
|
A match context is required if you want to:
|
|
|
|
Set up a callout function
|
|
Set an offset limit for matching an unanchored pattern
|
|
Change the limit on the amount of heap used when matching
|
|
Change the backtracking match limit
|
|
Change the backtracking depth limit
|
|
Set custom memory management specifically for the match
|
|
|
|
If none of these apply, just pass NULL as the context argument of
|
|
pcre2_match(), pcre2_dfa_match(), or pcre2_jit_match().
|
|
|
|
A match context is created, copied, and freed by the following func-
|
|
tions:
|
|
|
|
pcre2_match_context *pcre2_match_context_create(
|
|
pcre2_general_context *gcontext);
|
|
|
|
pcre2_match_context *pcre2_match_context_copy(
|
|
pcre2_match_context *mcontext);
|
|
|
|
void pcre2_match_context_free(pcre2_match_context *mcontext);
|
|
|
|
A match context is created with default values for its parameters.
|
|
These can be changed by calling the following functions, which return 0
|
|
on success, or PCRE2_ERROR_BADDATA if invalid data is detected.
|
|
|
|
int pcre2_set_callout(pcre2_match_context *mcontext,
|
|
int (*callout_function)(pcre2_callout_block *, void *),
|
|
void *callout_data);
|
|
|
|
This sets up a callout function for PCRE2 to call at specified points
|
|
during a matching operation. Details are given in the pcre2callout doc-
|
|
umentation.
|
|
|
|
int pcre2_set_substitute_callout(pcre2_match_context *mcontext,
|
|
int (*callout_function)(pcre2_substitute_callout_block *, void *),
|
|
void *callout_data);
|
|
|
|
This sets up a callout function for PCRE2 to call after each substitu-
|
|
tion made by pcre2_substitute(). Details are given in the section enti-
|
|
tled "Creating a new string with substitutions" below.
|
|
|
|
int pcre2_set_offset_limit(pcre2_match_context *mcontext,
|
|
PCRE2_SIZE value);
|
|
|
|
The offset_limit parameter limits how far an unanchored search can ad-
|
|
vance in the subject string. The default value is PCRE2_UNSET. The
|
|
pcre2_match() and pcre2_dfa_match() functions return PCRE2_ERROR_NO-
|
|
MATCH if a match with a starting point before or at the given offset is
|
|
not found. The pcre2_substitute() function makes no more substitutions.
|
|
|
|
For example, if the pattern /abc/ is matched against "123abc" with an
|
|
offset limit less than 3, the result is PCRE2_ERROR_NOMATCH. A match
|
|
can never be found if the startoffset argument of pcre2_match(),
|
|
pcre2_dfa_match(), or pcre2_substitute() is greater than the offset
|
|
limit set in the match context.
|
|
|
|
When using this facility, you must set the PCRE2_USE_OFFSET_LIMIT op-
|
|
tion when calling pcre2_compile() so that when JIT is in use, different
|
|
code can be compiled. If a match is started with a non-default match
|
|
limit when PCRE2_USE_OFFSET_LIMIT is not set, an error is generated.
|
|
|
|
The offset limit facility can be used to track progress when searching
|
|
large subject strings or to limit the extent of global substitutions.
|
|
See also the PCRE2_FIRSTLINE option, which requires a match to start
|
|
before or at the first newline that follows the start of matching in
|
|
the subject. If this is set with an offset limit, a match must occur in
|
|
the first line and also within the offset limit. In other words, which-
|
|
ever limit comes first is used.
|
|
|
|
int pcre2_set_heap_limit(pcre2_match_context *mcontext,
|
|
uint32_t value);
|
|
|
|
The heap_limit parameter specifies, in units of kibibytes (1024 bytes),
|
|
the maximum amount of heap memory that pcre2_match() may use to hold
|
|
backtracking information when running an interpretive match. This limit
|
|
also applies to pcre2_dfa_match(), which may use the heap when process-
|
|
ing patterns with a lot of nested pattern recursion or lookarounds or
|
|
atomic groups. This limit does not apply to matching with the JIT opti-
|
|
mization, which has its own memory control arrangements (see the
|
|
pcre2jit documentation for more details). If the limit is reached, the
|
|
negative error code PCRE2_ERROR_HEAPLIMIT is returned. The default
|
|
limit can be set when PCRE2 is built; if it is not, the default is set
|
|
very large and is essentially "unlimited".
|
|
|
|
A value for the heap limit may also be supplied by an item at the start
|
|
of a pattern of the form
|
|
|
|
(*LIMIT_HEAP=ddd)
|
|
|
|
where ddd is a decimal number. However, such a setting is ignored un-
|
|
less ddd is less than the limit set by the caller of pcre2_match() or,
|
|
if no such limit is set, less than the default.
|
|
|
|
The pcre2_match() function starts out using a 20KiB vector on the sys-
|
|
tem stack for recording backtracking points. The more nested backtrack-
|
|
ing points there are (that is, the deeper the search tree), the more
|
|
memory is needed. Heap memory is used only if the initial vector is
|
|
too small. If the heap limit is set to a value less than 21 (in partic-
|
|
ular, zero) no heap memory will be used. In this case, only patterns
|
|
that do not have a lot of nested backtracking can be successfully pro-
|
|
cessed.
|
|
|
|
Similarly, for pcre2_dfa_match(), a vector on the system stack is used
|
|
when processing pattern recursions, lookarounds, or atomic groups, and
|
|
only if this is not big enough is heap memory used. In this case, too,
|
|
setting a value of zero disables the use of the heap.
|
|
|
|
int pcre2_set_match_limit(pcre2_match_context *mcontext,
|
|
uint32_t value);
|
|
|
|
The match_limit parameter provides a means of preventing PCRE2 from us-
|
|
ing up too many computing resources when processing patterns that are
|
|
not going to match, but which have a very large number of possibilities
|
|
in their search trees. The classic example is a pattern that uses
|
|
nested unlimited repeats.
|
|
|
|
There is an internal counter in pcre2_match() that is incremented each
|
|
time round its main matching loop. If this value reaches the match
|
|
limit, pcre2_match() returns the negative value PCRE2_ERROR_MATCHLIMIT.
|
|
This has the effect of limiting the amount of backtracking that can
|
|
take place. For patterns that are not anchored, the count restarts from
|
|
zero for each position in the subject string. This limit also applies
|
|
to pcre2_dfa_match(), though the counting is done in a different way.
|
|
|
|
When pcre2_match() is called with a pattern that was successfully pro-
|
|
cessed by pcre2_jit_compile(), the way in which matching is executed is
|
|
entirely different. However, there is still the possibility of runaway
|
|
matching that goes on for a very long time, and so the match_limit
|
|
value is also used in this case (but in a different way) to limit how
|
|
long the matching can continue.
|
|
|
|
The default value for the limit can be set when PCRE2 is built; the de-
|
|
fault default is 10 million, which handles all but the most extreme
|
|
cases. A value for the match limit may also be supplied by an item at
|
|
the start of a pattern of the form
|
|
|
|
(*LIMIT_MATCH=ddd)
|
|
|
|
where ddd is a decimal number. However, such a setting is ignored un-
|
|
less ddd is less than the limit set by the caller of pcre2_match() or
|
|
pcre2_dfa_match() or, if no such limit is set, less than the default.
|
|
|
|
int pcre2_set_depth_limit(pcre2_match_context *mcontext,
|
|
uint32_t value);
|
|
|
|
This parameter limits the depth of nested backtracking in
|
|
pcre2_match(). Each time a nested backtracking point is passed, a new
|
|
memory "frame" is used to remember the state of matching at that point.
|
|
Thus, this parameter indirectly limits the amount of memory that is
|
|
used in a match. However, because the size of each memory "frame" de-
|
|
pends on the number of capturing parentheses, the actual memory limit
|
|
varies from pattern to pattern. This limit was more useful in versions
|
|
before 10.30, where function recursion was used for backtracking.
|
|
|
|
The depth limit is not relevant, and is ignored, when matching is done
|
|
using JIT compiled code. However, it is supported by pcre2_dfa_match(),
|
|
which uses it to limit the depth of nested internal recursive function
|
|
calls that implement atomic groups, lookaround assertions, and pattern
|
|
recursions. This limits, indirectly, the amount of system stack that is
|
|
used. It was more useful in versions before 10.32, when stack memory
|
|
was used for local workspace vectors for recursive function calls. From
|
|
version 10.32, only local variables are allocated on the stack and as
|
|
each call uses only a few hundred bytes, even a small stack can support
|
|
quite a lot of recursion.
|
|
|
|
If the depth of internal recursive function calls is great enough, lo-
|
|
cal workspace vectors are allocated on the heap from version 10.32 on-
|
|
wards, so the depth limit also indirectly limits the amount of heap
|
|
memory that is used. A recursive pattern such as /(.(?2))((?1)|)/, when
|
|
matched to a very long string using pcre2_dfa_match(), can use a great
|
|
deal of memory. However, it is probably better to limit heap usage di-
|
|
rectly by calling pcre2_set_heap_limit().
|
|
|
|
The default value for the depth limit can be set when PCRE2 is built;
|
|
if it is not, the default is set to the same value as the default for
|
|
the match limit. If the limit is exceeded, pcre2_match() or
|
|
pcre2_dfa_match() returns PCRE2_ERROR_DEPTHLIMIT. A value for the depth
|
|
limit may also be supplied by an item at the start of a pattern of the
|
|
form
|
|
|
|
(*LIMIT_DEPTH=ddd)
|
|
|
|
where ddd is a decimal number. However, such a setting is ignored un-
|
|
less ddd is less than the limit set by the caller of pcre2_match() or
|
|
pcre2_dfa_match() or, if no such limit is set, less than the default.
|
|
|
|
|
|
CHECKING BUILD-TIME OPTIONS
|
|
|
|
int pcre2_config(uint32_t what, void *where);
|
|
|
|
The function pcre2_config() makes it possible for a PCRE2 client to
|
|
find the value of certain configuration parameters and to discover
|
|
which optional features have been compiled into the PCRE2 library. The
|
|
pcre2build documentation has more details about these features.
|
|
|
|
The first argument for pcre2_config() specifies which information is
|
|
required. The second argument is a pointer to memory into which the in-
|
|
formation is placed. If NULL is passed, the function returns the amount
|
|
of memory that is needed for the requested information. For calls that
|
|
return numerical values, the value is in bytes; when requesting these
|
|
values, where should point to appropriately aligned memory. For calls
|
|
that return strings, the required length is given in code units, not
|
|
counting the terminating zero.
|
|
|
|
When requesting information, the returned value from pcre2_config() is
|
|
non-negative on success, or the negative error code PCRE2_ERROR_BADOP-
|
|
TION if the value in the first argument is not recognized. The follow-
|
|
ing information is available:
|
|
|
|
PCRE2_CONFIG_BSR
|
|
|
|
The output is a uint32_t integer whose value indicates what character
|
|
sequences the \R escape sequence matches by default. A value of
|
|
PCRE2_BSR_UNICODE means that \R matches any Unicode line ending se-
|
|
quence; a value of PCRE2_BSR_ANYCRLF means that \R matches only CR, LF,
|
|
or CRLF. The default can be overridden when a pattern is compiled.
|
|
|
|
PCRE2_CONFIG_COMPILED_WIDTHS
|
|
|
|
The output is a uint32_t integer whose lower bits indicate which code
|
|
unit widths were selected when PCRE2 was built. The 1-bit indicates
|
|
8-bit support, and the 2-bit and 4-bit indicate 16-bit and 32-bit sup-
|
|
port, respectively.
|
|
|
|
PCRE2_CONFIG_DEPTHLIMIT
|
|
|
|
The output is a uint32_t integer that gives the default limit for the
|
|
depth of nested backtracking in pcre2_match() or the depth of nested
|
|
recursions, lookarounds, and atomic groups in pcre2_dfa_match(). Fur-
|
|
ther details are given with pcre2_set_depth_limit() above.
|
|
|
|
PCRE2_CONFIG_HEAPLIMIT
|
|
|
|
The output is a uint32_t integer that gives, in kibibytes, the default
|
|
limit for the amount of heap memory used by pcre2_match() or
|
|
pcre2_dfa_match(). Further details are given with
|
|
pcre2_set_heap_limit() above.
|
|
|
|
PCRE2_CONFIG_JIT
|
|
|
|
The output is a uint32_t integer that is set to one if support for
|
|
just-in-time compiling is available; otherwise it is set to zero.
|
|
|
|
PCRE2_CONFIG_JITTARGET
|
|
|
|
The where argument should point to a buffer that is at least 48 code
|
|
units long. (The exact length required can be found by calling
|
|
pcre2_config() with where set to NULL.) The buffer is filled with a
|
|
string that contains the name of the architecture for which the JIT
|
|
compiler is configured, for example "x86 32bit (little endian + un-
|
|
aligned)". If JIT support is not available, PCRE2_ERROR_BADOPTION is
|
|
returned, otherwise the number of code units used is returned. This is
|
|
the length of the string, plus one unit for the terminating zero.
|
|
|
|
PCRE2_CONFIG_LINKSIZE
|
|
|
|
The output is a uint32_t integer that contains the number of bytes used
|
|
for internal linkage in compiled regular expressions. When PCRE2 is
|
|
configured, the value can be set to 2, 3, or 4, with the default being
|
|
2. This is the value that is returned by pcre2_config(). However, when
|
|
the 16-bit library is compiled, a value of 3 is rounded up to 4, and
|
|
when the 32-bit library is compiled, internal linkages always use 4
|
|
bytes, so the configured value is not relevant.
|
|
|
|
The default value of 2 for the 8-bit and 16-bit libraries is sufficient
|
|
for all but the most massive patterns, since it allows the size of the
|
|
compiled pattern to be up to 65535 code units. Larger values allow
|
|
larger regular expressions to be compiled by those two libraries, but
|
|
at the expense of slower matching.
|
|
|
|
PCRE2_CONFIG_MATCHLIMIT
|
|
|
|
The output is a uint32_t integer that gives the default match limit for
|
|
pcre2_match(). Further details are given with pcre2_set_match_limit()
|
|
above.
|
|
|
|
PCRE2_CONFIG_NEWLINE
|
|
|
|
The output is a uint32_t integer whose value specifies the default
|
|
character sequence that is recognized as meaning "newline". The values
|
|
are:
|
|
|
|
PCRE2_NEWLINE_CR Carriage return (CR)
|
|
PCRE2_NEWLINE_LF Linefeed (LF)
|
|
PCRE2_NEWLINE_CRLF Carriage return, linefeed (CRLF)
|
|
PCRE2_NEWLINE_ANY Any Unicode line ending
|
|
PCRE2_NEWLINE_ANYCRLF Any of CR, LF, or CRLF
|
|
PCRE2_NEWLINE_NUL The NUL character (binary zero)
|
|
|
|
The default should normally correspond to the standard sequence for
|
|
your operating system.
|
|
|
|
PCRE2_CONFIG_NEVER_BACKSLASH_C
|
|
|
|
The output is a uint32_t integer that is set to one if the use of \C
|
|
was permanently disabled when PCRE2 was built; otherwise it is set to
|
|
zero.
|
|
|
|
PCRE2_CONFIG_PARENSLIMIT
|
|
|
|
The output is a uint32_t integer that gives the maximum depth of nest-
|
|
ing of parentheses (of any kind) in a pattern. This limit is imposed to
|
|
cap the amount of system stack used when a pattern is compiled. It is
|
|
specified when PCRE2 is built; the default is 250. This limit does not
|
|
take into account the stack that may already be used by the calling ap-
|
|
plication. For finer control over compilation stack usage, see
|
|
pcre2_set_compile_recursion_guard().
|
|
|
|
PCRE2_CONFIG_STACKRECURSE
|
|
|
|
This parameter is obsolete and should not be used in new code. The out-
|
|
put is a uint32_t integer that is always set to zero.
|
|
|
|
PCRE2_CONFIG_TABLES_LENGTH
|
|
|
|
The output is a uint32_t integer that gives the length of PCRE2's char-
|
|
acter processing tables in bytes. For details of these tables see the
|
|
section on locale support below.
|
|
|
|
PCRE2_CONFIG_UNICODE_VERSION
|
|
|
|
The where argument should point to a buffer that is at least 24 code
|
|
units long. (The exact length required can be found by calling
|
|
pcre2_config() with where set to NULL.) If PCRE2 has been compiled
|
|
without Unicode support, the buffer is filled with the text "Unicode
|
|
not supported". Otherwise, the Unicode version string (for example,
|
|
"8.0.0") is inserted. The number of code units used is returned. This
|
|
is the length of the string plus one unit for the terminating zero.
|
|
|
|
PCRE2_CONFIG_UNICODE
|
|
|
|
The output is a uint32_t integer that is set to one if Unicode support
|
|
is available; otherwise it is set to zero. Unicode support implies UTF
|
|
support.
|
|
|
|
PCRE2_CONFIG_VERSION
|
|
|
|
The where argument should point to a buffer that is at least 24 code
|
|
units long. (The exact length required can be found by calling
|
|
pcre2_config() with where set to NULL.) The buffer is filled with the
|
|
PCRE2 version string, zero-terminated. The number of code units used is
|
|
returned. This is the length of the string plus one unit for the termi-
|
|
nating zero.
|
|
|
|
|
|
COMPILING A PATTERN
|
|
|
|
pcre2_code *pcre2_compile(PCRE2_SPTR pattern, PCRE2_SIZE length,
|
|
uint32_t options, int *errorcode, PCRE2_SIZE *erroroffset,
|
|
pcre2_compile_context *ccontext);
|
|
|
|
void pcre2_code_free(pcre2_code *code);
|
|
|
|
pcre2_code *pcre2_code_copy(const pcre2_code *code);
|
|
|
|
pcre2_code *pcre2_code_copy_with_tables(const pcre2_code *code);
|
|
|
|
The pcre2_compile() function compiles a pattern into an internal form.
|
|
The pattern is defined by a pointer to a string of code units and a
|
|
length (in code units). If the pattern is zero-terminated, the length
|
|
can be specified as PCRE2_ZERO_TERMINATED. The function returns a
|
|
pointer to a block of memory that contains the compiled pattern and re-
|
|
lated data, or NULL if an error occurred.
|
|
|
|
If the compile context argument ccontext is NULL, memory for the com-
|
|
piled pattern is obtained by calling malloc(). Otherwise, it is ob-
|
|
tained from the same memory function that was used for the compile con-
|
|
text. The caller must free the memory by calling pcre2_code_free() when
|
|
it is no longer needed. If pcre2_code_free() is called with a NULL ar-
|
|
gument, it returns immediately, without doing anything.
|
|
|
|
The function pcre2_code_copy() makes a copy of the compiled code in new
|
|
memory, using the same memory allocator as was used for the original.
|
|
However, if the code has been processed by the JIT compiler (see be-
|
|
low), the JIT information cannot be copied (because it is position-de-
|
|
pendent). The new copy can initially be used only for non-JIT match-
|
|
ing, though it can be passed to pcre2_jit_compile() if required. If
|
|
pcre2_code_copy() is called with a NULL argument, it returns NULL.
|
|
|
|
The pcre2_code_copy() function provides a way for individual threads in
|
|
a multithreaded application to acquire a private copy of shared com-
|
|
piled code. However, it does not make a copy of the character tables
|
|
used by the compiled pattern; the new pattern code points to the same
|
|
tables as the original code. (See "Locale Support" below for details
|
|
of these character tables.) In many applications the same tables are
|
|
used throughout, so this behaviour is appropriate. Nevertheless, there
|
|
are occasions when a copy of a compiled pattern and the relevant tables
|
|
are needed. The pcre2_code_copy_with_tables() provides this facility.
|
|
Copies of both the code and the tables are made, with the new code
|
|
pointing to the new tables. The memory for the new tables is automati-
|
|
cally freed when pcre2_code_free() is called for the new copy of the
|
|
compiled code. If pcre2_code_copy_with_tables() is called with a NULL
|
|
argument, it returns NULL.
|
|
|
|
NOTE: When one of the matching functions is called, pointers to the
|
|
compiled pattern and the subject string are set in the match data block
|
|
so that they can be referenced by the substring extraction functions
|
|
after a successful match. After running a match, you must not free a
|
|
compiled pattern or a subject string until after all operations on the
|
|
match data block have taken place, unless, in the case of the subject
|
|
string, you have used the PCRE2_COPY_MATCHED_SUBJECT option, which is
|
|
described in the section entitled "Option bits for pcre2_match()" be-
|
|
low.
|
|
|
|
The options argument for pcre2_compile() contains various bit settings
|
|
that affect the compilation. It should be zero if none of them are re-
|
|
quired. The available options are described below. Some of them (in
|
|
particular, those that are compatible with Perl, but some others as
|
|
well) can also be set and unset from within the pattern (see the de-
|
|
tailed description in the pcre2pattern documentation).
|
|
|
|
For those options that can be different in different parts of the pat-
|
|
tern, the contents of the options argument specifies their settings at
|
|
the start of compilation. The PCRE2_ANCHORED, PCRE2_ENDANCHORED, and
|
|
PCRE2_NO_UTF_CHECK options can be set at the time of matching as well
|
|
as at compile time.
|
|
|
|
Some additional options and less frequently required compile-time pa-
|
|
rameters (for example, the newline setting) can be provided in a com-
|
|
pile context (as described above).
|
|
|
|
If errorcode or erroroffset is NULL, pcre2_compile() returns NULL imme-
|
|
diately. Otherwise, the variables to which these point are set to an
|
|
error code and an offset (number of code units) within the pattern, re-
|
|
spectively, when pcre2_compile() returns NULL because a compilation er-
|
|
ror has occurred. The values are not defined when compilation is suc-
|
|
cessful and pcre2_compile() returns a non-NULL value.
|
|
|
|
There are nearly 100 positive error codes that pcre2_compile() may re-
|
|
turn if it finds an error in the pattern. There are also some negative
|
|
error codes that are used for invalid UTF strings when validity check-
|
|
ing is in force. These are the same as given by pcre2_match() and
|
|
pcre2_dfa_match(), and are described in the pcre2unicode documentation.
|
|
There is no separate documentation for the positive error codes, be-
|
|
cause the textual error messages that are obtained by calling the
|
|
pcre2_get_error_message() function (see "Obtaining a textual error mes-
|
|
sage" below) should be self-explanatory. Macro names starting with
|
|
PCRE2_ERROR_ are defined for both positive and negative error codes in
|
|
pcre2.h.
|
|
|
|
The value returned in erroroffset is an indication of where in the pat-
|
|
tern the error occurred. It is not necessarily the furthest point in
|
|
the pattern that was read. For example, after the error "lookbehind as-
|
|
sertion is not fixed length", the error offset points to the start of
|
|
the failing assertion. For an invalid UTF-8 or UTF-16 string, the off-
|
|
set is that of the first code unit of the failing character.
|
|
|
|
Some errors are not detected until the whole pattern has been scanned;
|
|
in these cases, the offset passed back is the length of the pattern.
|
|
Note that the offset is in code units, not characters, even in a UTF
|
|
mode. It may sometimes point into the middle of a UTF-8 or UTF-16 char-
|
|
acter.
|
|
|
|
This code fragment shows a typical straightforward call to pcre2_com-
|
|
pile():
|
|
|
|
pcre2_code *re;
|
|
PCRE2_SIZE erroffset;
|
|
int errorcode;
|
|
re = pcre2_compile(
|
|
"^A.*Z", /* the pattern */
|
|
PCRE2_ZERO_TERMINATED, /* the pattern is zero-terminated */
|
|
0, /* default options */
|
|
&errorcode, /* for error code */
|
|
&erroffset, /* for error offset */
|
|
NULL); /* no compile context */
|
|
|
|
|
|
Main compile options
|
|
|
|
The following names for option bits are defined in the pcre2.h header
|
|
file:
|
|
|
|
PCRE2_ANCHORED
|
|
|
|
If this bit is set, the pattern is forced to be "anchored", that is, it
|
|
is constrained to match only at the first matching point in the string
|
|
that is being searched (the "subject string"). This effect can also be
|
|
achieved by appropriate constructs in the pattern itself, which is the
|
|
only way to do it in Perl.
|
|
|
|
PCRE2_ALLOW_EMPTY_CLASS
|
|
|
|
By default, for compatibility with Perl, a closing square bracket that
|
|
immediately follows an opening one is treated as a data character for
|
|
the class. When PCRE2_ALLOW_EMPTY_CLASS is set, it terminates the
|
|
class, which therefore contains no characters and so can never match.
|
|
|
|
PCRE2_ALT_BSUX
|
|
|
|
This option request alternative handling of three escape sequences,
|
|
which makes PCRE2's behaviour more like ECMAscript (aka JavaScript).
|
|
When it is set:
|
|
|
|
(1) \U matches an upper case "U" character; by default \U causes a com-
|
|
pile time error (Perl uses \U to upper case subsequent characters).
|
|
|
|
(2) \u matches a lower case "u" character unless it is followed by four
|
|
hexadecimal digits, in which case the hexadecimal number defines the
|
|
code point to match. By default, \u causes a compile time error (Perl
|
|
uses it to upper case the following character).
|
|
|
|
(3) \x matches a lower case "x" character unless it is followed by two
|
|
hexadecimal digits, in which case the hexadecimal number defines the
|
|
code point to match. By default, as in Perl, a hexadecimal number is
|
|
always expected after \x, but it may have zero, one, or two digits (so,
|
|
for example, \xz matches a binary zero character followed by z).
|
|
|
|
ECMAscript 6 added additional functionality to \u. This can be accessed
|
|
using the PCRE2_EXTRA_ALT_BSUX extra option (see "Extra compile op-
|
|
tions" below). Note that this alternative escape handling applies only
|
|
to patterns. Neither of these options affects the processing of re-
|
|
placement strings passed to pcre2_substitute().
|
|
|
|
PCRE2_ALT_CIRCUMFLEX
|
|
|
|
In multiline mode (when PCRE2_MULTILINE is set), the circumflex
|
|
metacharacter matches at the start of the subject (unless PCRE2_NOTBOL
|
|
is set), and also after any internal newline. However, it does not
|
|
match after a newline at the end of the subject, for compatibility with
|
|
Perl. If you want a multiline circumflex also to match after a termi-
|
|
nating newline, you must set PCRE2_ALT_CIRCUMFLEX.
|
|
|
|
PCRE2_ALT_VERBNAMES
|
|
|
|
By default, for compatibility with Perl, the name in any verb sequence
|
|
such as (*MARK:NAME) is any sequence of characters that does not in-
|
|
clude a closing parenthesis. The name is not processed in any way, and
|
|
it is not possible to include a closing parenthesis in the name. How-
|
|
ever, if the PCRE2_ALT_VERBNAMES option is set, normal backslash pro-
|
|
cessing is applied to verb names and only an unescaped closing paren-
|
|
thesis terminates the name. A closing parenthesis can be included in a
|
|
name either as \) or between \Q and \E. If the PCRE2_EXTENDED or
|
|
PCRE2_EXTENDED_MORE option is set with PCRE2_ALT_VERBNAMES, unescaped
|
|
whitespace in verb names is skipped and #-comments are recognized, ex-
|
|
actly as in the rest of the pattern.
|
|
|
|
PCRE2_AUTO_CALLOUT
|
|
|
|
If this bit is set, pcre2_compile() automatically inserts callout
|
|
items, all with number 255, before each pattern item, except immedi-
|
|
ately before or after an explicit callout in the pattern. For discus-
|
|
sion of the callout facility, see the pcre2callout documentation.
|
|
|
|
PCRE2_CASELESS
|
|
|
|
If this bit is set, letters in the pattern match both upper and lower
|
|
case letters in the subject. It is equivalent to Perl's /i option, and
|
|
it can be changed within a pattern by a (?i) option setting. If either
|
|
PCRE2_UTF or PCRE2_UCP is set, Unicode properties are used for all
|
|
characters with more than one other case, and for all characters whose
|
|
code points are greater than U+007F. Note that there are two ASCII
|
|
characters, K and S, that, in addition to their lower case ASCII equiv-
|
|
alents, are case-equivalent with U+212A (Kelvin sign) and U+017F (long
|
|
S) respectively. For lower valued characters with only one other case,
|
|
a lookup table is used for speed. When neither PCRE2_UTF nor PCRE2_UCP
|
|
is set, a lookup table is used for all code points less than 256, and
|
|
higher code points (available only in 16-bit or 32-bit mode) are
|
|
treated as not having another case.
|
|
|
|
PCRE2_DOLLAR_ENDONLY
|
|
|
|
If this bit is set, a dollar metacharacter in the pattern matches only
|
|
at the end of the subject string. Without this option, a dollar also
|
|
matches immediately before a newline at the end of the string (but not
|
|
before any other newlines). The PCRE2_DOLLAR_ENDONLY option is ignored
|
|
if PCRE2_MULTILINE is set. There is no equivalent to this option in
|
|
Perl, and no way to set it within a pattern.
|
|
|
|
PCRE2_DOTALL
|
|
|
|
If this bit is set, a dot metacharacter in the pattern matches any
|
|
character, including one that indicates a newline. However, it only
|
|
ever matches one character, even if newlines are coded as CRLF. Without
|
|
this option, a dot does not match when the current position in the sub-
|
|
ject is at a newline. This option is equivalent to Perl's /s option,
|
|
and it can be changed within a pattern by a (?s) option setting. A neg-
|
|
ative class such as [^a] always matches newline characters, and the \N
|
|
escape sequence always matches a non-newline character, independent of
|
|
the setting of PCRE2_DOTALL.
|
|
|
|
PCRE2_DUPNAMES
|
|
|
|
If this bit is set, names used to identify capture groups need not be
|
|
unique. This can be helpful for certain types of pattern when it is
|
|
known that only one instance of the named group can ever be matched.
|
|
There are more details of named capture groups below; see also the
|
|
pcre2pattern documentation.
|
|
|
|
PCRE2_ENDANCHORED
|
|
|
|
If this bit is set, the end of any pattern match must be right at the
|
|
end of the string being searched (the "subject string"). If the pattern
|
|
match succeeds by reaching (*ACCEPT), but does not reach the end of the
|
|
subject, the match fails at the current starting point. For unanchored
|
|
patterns, a new match is then tried at the next starting point. How-
|
|
ever, if the match succeeds by reaching the end of the pattern, but not
|
|
the end of the subject, backtracking occurs and an alternative match
|
|
may be found. Consider these two patterns:
|
|
|
|
.(*ACCEPT)|..
|
|
.|..
|
|
|
|
If matched against "abc" with PCRE2_ENDANCHORED set, the first matches
|
|
"c" whereas the second matches "bc". The effect of PCRE2_ENDANCHORED
|
|
can also be achieved by appropriate constructs in the pattern itself,
|
|
which is the only way to do it in Perl.
|
|
|
|
For DFA matching with pcre2_dfa_match(), PCRE2_ENDANCHORED applies only
|
|
to the first (that is, the longest) matched string. Other parallel
|
|
matches, which are necessarily substrings of the first one, must obvi-
|
|
ously end before the end of the subject.
|
|
|
|
PCRE2_EXTENDED
|
|
|
|
If this bit is set, most white space characters in the pattern are to-
|
|
tally ignored except when escaped or inside a character class. However,
|
|
white space is not allowed within sequences such as (?> that introduce
|
|
various parenthesized groups, nor within numerical quantifiers such as
|
|
{1,3}. Ignorable white space is permitted between an item and a follow-
|
|
ing quantifier and between a quantifier and a following + that indi-
|
|
cates possessiveness. PCRE2_EXTENDED is equivalent to Perl's /x option,
|
|
and it can be changed within a pattern by a (?x) option setting.
|
|
|
|
When PCRE2 is compiled without Unicode support, PCRE2_EXTENDED recog-
|
|
nizes as white space only those characters with code points less than
|
|
256 that are flagged as white space in its low-character table. The ta-
|
|
ble is normally created by pcre2_maketables(), which uses the isspace()
|
|
function to identify space characters. In most ASCII environments, the
|
|
relevant characters are those with code points 0x0009 (tab), 0x000A
|
|
(linefeed), 0x000B (vertical tab), 0x000C (formfeed), 0x000D (carriage
|
|
return), and 0x0020 (space).
|
|
|
|
When PCRE2 is compiled with Unicode support, in addition to these char-
|
|
acters, five more Unicode "Pattern White Space" characters are recog-
|
|
nized by PCRE2_EXTENDED. These are U+0085 (next line), U+200E (left-to-
|
|
right mark), U+200F (right-to-left mark), U+2028 (line separator), and
|
|
U+2029 (paragraph separator). This set of characters is the same as
|
|
recognized by Perl's /x option. Note that the horizontal and vertical
|
|
space characters that are matched by the \h and \v escapes in patterns
|
|
are a much bigger set.
|
|
|
|
As well as ignoring most white space, PCRE2_EXTENDED also causes char-
|
|
acters between an unescaped # outside a character class and the next
|
|
newline, inclusive, to be ignored, which makes it possible to include
|
|
comments inside complicated patterns. Note that the end of this type of
|
|
comment is a literal newline sequence in the pattern; escape sequences
|
|
that happen to represent a newline do not count.
|
|
|
|
Which characters are interpreted as newlines can be specified by a set-
|
|
ting in the compile context that is passed to pcre2_compile() or by a
|
|
special sequence at the start of the pattern, as described in the sec-
|
|
tion entitled "Newline conventions" in the pcre2pattern documentation.
|
|
A default is defined when PCRE2 is built.
|
|
|
|
PCRE2_EXTENDED_MORE
|
|
|
|
This option has the effect of PCRE2_EXTENDED, but, in addition, un-
|
|
escaped space and horizontal tab characters are ignored inside a char-
|
|
acter class. Note: only these two characters are ignored, not the full
|
|
set of pattern white space characters that are ignored outside a char-
|
|
acter class. PCRE2_EXTENDED_MORE is equivalent to Perl's /xx option,
|
|
and it can be changed within a pattern by a (?xx) option setting.
|
|
|
|
PCRE2_FIRSTLINE
|
|
|
|
If this option is set, the start of an unanchored pattern match must be
|
|
before or at the first newline in the subject string following the
|
|
start of matching, though the matched text may continue over the new-
|
|
line. If startoffset is non-zero, the limiting newline is not necessar-
|
|
ily the first newline in the subject. For example, if the subject
|
|
string is "abc\nxyz" (where \n represents a single-character newline) a
|
|
pattern match for "yz" succeeds with PCRE2_FIRSTLINE if startoffset is
|
|
greater than 3. See also PCRE2_USE_OFFSET_LIMIT, which provides a more
|
|
general limiting facility. If PCRE2_FIRSTLINE is set with an offset
|
|
limit, a match must occur in the first line and also within the offset
|
|
limit. In other words, whichever limit comes first is used.
|
|
|
|
PCRE2_LITERAL
|
|
|
|
If this option is set, all meta-characters in the pattern are disabled,
|
|
and it is treated as a literal string. Matching literal strings with a
|
|
regular expression engine is not the most efficient way of doing it. If
|
|
you are doing a lot of literal matching and are worried about effi-
|
|
ciency, you should consider using other approaches. The only other main
|
|
options that are allowed with PCRE2_LITERAL are: PCRE2_ANCHORED,
|
|
PCRE2_ENDANCHORED, PCRE2_AUTO_CALLOUT, PCRE2_CASELESS, PCRE2_FIRSTLINE,
|
|
PCRE2_MATCH_INVALID_UTF, PCRE2_NO_START_OPTIMIZE, PCRE2_NO_UTF_CHECK,
|
|
PCRE2_UTF, and PCRE2_USE_OFFSET_LIMIT. The extra options PCRE2_EX-
|
|
TRA_MATCH_LINE and PCRE2_EXTRA_MATCH_WORD are also supported. Any other
|
|
options cause an error.
|
|
|
|
PCRE2_MATCH_INVALID_UTF
|
|
|
|
This option forces PCRE2_UTF (see below) and also enables support for
|
|
matching by pcre2_match() in subject strings that contain invalid UTF
|
|
sequences. This facility is not supported for DFA matching. For de-
|
|
tails, see the pcre2unicode documentation.
|
|
|
|
PCRE2_MATCH_UNSET_BACKREF
|
|
|
|
If this option is set, a backreference to an unset capture group
|
|
matches an empty string (by default this causes the current matching
|
|
alternative to fail). A pattern such as (\1)(a) succeeds when this op-
|
|
tion is set (assuming it can find an "a" in the subject), whereas it
|
|
fails by default, for Perl compatibility. Setting this option makes
|
|
PCRE2 behave more like ECMAscript (aka JavaScript).
|
|
|
|
PCRE2_MULTILINE
|
|
|
|
By default, for the purposes of matching "start of line" and "end of
|
|
line", PCRE2 treats the subject string as consisting of a single line
|
|
of characters, even if it actually contains newlines. The "start of
|
|
line" metacharacter (^) matches only at the start of the string, and
|
|
the "end of line" metacharacter ($) matches only at the end of the
|
|
string, or before a terminating newline (except when PCRE2_DOLLAR_EN-
|
|
DONLY is set). Note, however, that unless PCRE2_DOTALL is set, the "any
|
|
character" metacharacter (.) does not match at a newline. This behav-
|
|
iour (for ^, $, and dot) is the same as Perl.
|
|
|
|
When PCRE2_MULTILINE it is set, the "start of line" and "end of line"
|
|
constructs match immediately following or immediately before internal
|
|
newlines in the subject string, respectively, as well as at the very
|
|
start and end. This is equivalent to Perl's /m option, and it can be
|
|
changed within a pattern by a (?m) option setting. Note that the "start
|
|
of line" metacharacter does not match after a newline at the end of the
|
|
subject, for compatibility with Perl. However, you can change this by
|
|
setting the PCRE2_ALT_CIRCUMFLEX option. If there are no newlines in a
|
|
subject string, or no occurrences of ^ or $ in a pattern, setting
|
|
PCRE2_MULTILINE has no effect.
|
|
|
|
PCRE2_NEVER_BACKSLASH_C
|
|
|
|
This option locks out the use of \C in the pattern that is being com-
|
|
piled. This escape can cause unpredictable behaviour in UTF-8 or
|
|
UTF-16 modes, because it may leave the current matching point in the
|
|
middle of a multi-code-unit character. This option may be useful in ap-
|
|
plications that process patterns from external sources. Note that there
|
|
is also a build-time option that permanently locks out the use of \C.
|
|
|
|
PCRE2_NEVER_UCP
|
|
|
|
This option locks out the use of Unicode properties for handling \B,
|
|
\b, \D, \d, \S, \s, \W, \w, and some of the POSIX character classes, as
|
|
described for the PCRE2_UCP option below. In particular, it prevents
|
|
the creator of the pattern from enabling this facility by starting the
|
|
pattern with (*UCP). This option may be useful in applications that
|
|
process patterns from external sources. The option combination PCRE_UCP
|
|
and PCRE_NEVER_UCP causes an error.
|
|
|
|
PCRE2_NEVER_UTF
|
|
|
|
This option locks out interpretation of the pattern as UTF-8, UTF-16,
|
|
or UTF-32, depending on which library is in use. In particular, it pre-
|
|
vents the creator of the pattern from switching to UTF interpretation
|
|
by starting the pattern with (*UTF). This option may be useful in ap-
|
|
plications that process patterns from external sources. The combination
|
|
of PCRE2_UTF and PCRE2_NEVER_UTF causes an error.
|
|
|
|
PCRE2_NO_AUTO_CAPTURE
|
|
|
|
If this option is set, it disables the use of numbered capturing paren-
|
|
theses in the pattern. Any opening parenthesis that is not followed by
|
|
? behaves as if it were followed by ?: but named parentheses can still
|
|
be used for capturing (and they acquire numbers in the usual way). This
|
|
is the same as Perl's /n option. Note that, when this option is set,
|
|
references to capture groups (backreferences or recursion/subroutine
|
|
calls) may only refer to named groups, though the reference can be by
|
|
name or by number.
|
|
|
|
PCRE2_NO_AUTO_POSSESS
|
|
|
|
If this option is set, it disables "auto-possessification", which is an
|
|
optimization that, for example, turns a+b into a++b in order to avoid
|
|
backtracks into a+ that can never be successful. However, if callouts
|
|
are in use, auto-possessification means that some callouts are never
|
|
taken. You can set this option if you want the matching functions to do
|
|
a full unoptimized search and run all the callouts, but it is mainly
|
|
provided for testing purposes.
|
|
|
|
PCRE2_NO_DOTSTAR_ANCHOR
|
|
|
|
If this option is set, it disables an optimization that is applied when
|
|
.* is the first significant item in a top-level branch of a pattern,
|
|
and all the other branches also start with .* or with \A or \G or ^.
|
|
The optimization is automatically disabled for .* if it is inside an
|
|
atomic group or a capture group that is the subject of a backreference,
|
|
or if the pattern contains (*PRUNE) or (*SKIP). When the optimization
|
|
is not disabled, such a pattern is automatically anchored if
|
|
PCRE2_DOTALL is set for all the .* items and PCRE2_MULTILINE is not set
|
|
for any ^ items. Otherwise, the fact that any match must start either
|
|
at the start of the subject or following a newline is remembered. Like
|
|
other optimizations, this can cause callouts to be skipped.
|
|
|
|
PCRE2_NO_START_OPTIMIZE
|
|
|
|
This is an option whose main effect is at matching time. It does not
|
|
change what pcre2_compile() generates, but it does affect the output of
|
|
the JIT compiler.
|
|
|
|
There are a number of optimizations that may occur at the start of a
|
|
match, in order to speed up the process. For example, if it is known
|
|
that an unanchored match must start with a specific code unit value,
|
|
the matching code searches the subject for that value, and fails imme-
|
|
diately if it cannot find it, without actually running the main match-
|
|
ing function. This means that a special item such as (*COMMIT) at the
|
|
start of a pattern is not considered until after a suitable starting
|
|
point for the match has been found. Also, when callouts or (*MARK)
|
|
items are in use, these "start-up" optimizations can cause them to be
|
|
skipped if the pattern is never actually used. The start-up optimiza-
|
|
tions are in effect a pre-scan of the subject that takes place before
|
|
the pattern is run.
|
|
|
|
The PCRE2_NO_START_OPTIMIZE option disables the start-up optimizations,
|
|
possibly causing performance to suffer, but ensuring that in cases
|
|
where the result is "no match", the callouts do occur, and that items
|
|
such as (*COMMIT) and (*MARK) are considered at every possible starting
|
|
position in the subject string.
|
|
|
|
Setting PCRE2_NO_START_OPTIMIZE may change the outcome of a matching
|
|
operation. Consider the pattern
|
|
|
|
(*COMMIT)ABC
|
|
|
|
When this is compiled, PCRE2 records the fact that a match must start
|
|
with the character "A". Suppose the subject string is "DEFABC". The
|
|
start-up optimization scans along the subject, finds "A" and runs the
|
|
first match attempt from there. The (*COMMIT) item means that the pat-
|
|
tern must match the current starting position, which in this case, it
|
|
does. However, if the same match is run with PCRE2_NO_START_OPTIMIZE
|
|
set, the initial scan along the subject string does not happen. The
|
|
first match attempt is run starting from "D" and when this fails,
|
|
(*COMMIT) prevents any further matches being tried, so the overall re-
|
|
sult is "no match".
|
|
|
|
As another start-up optimization makes use of a minimum length for a
|
|
matching subject, which is recorded when possible. Consider the pattern
|
|
|
|
(*MARK:1)B(*MARK:2)(X|Y)
|
|
|
|
The minimum length for a match is two characters. If the subject is
|
|
"XXBB", the "starting character" optimization skips "XX", then tries to
|
|
match "BB", which is long enough. In the process, (*MARK:2) is encoun-
|
|
tered and remembered. When the match attempt fails, the next "B" is
|
|
found, but there is only one character left, so there are no more at-
|
|
tempts, and "no match" is returned with the "last mark seen" set to
|
|
"2". If NO_START_OPTIMIZE is set, however, matches are tried at every
|
|
possible starting position, including at the end of the subject, where
|
|
(*MARK:1) is encountered, but there is no "B", so the "last mark seen"
|
|
that is returned is "1". In this case, the optimizations do not affect
|
|
the overall match result, which is still "no match", but they do affect
|
|
the auxiliary information that is returned.
|
|
|
|
PCRE2_NO_UTF_CHECK
|
|
|
|
When PCRE2_UTF is set, the validity of the pattern as a UTF string is
|
|
automatically checked. There are discussions about the validity of
|
|
UTF-8 strings, UTF-16 strings, and UTF-32 strings in the pcre2unicode
|
|
document. If an invalid UTF sequence is found, pcre2_compile() returns
|
|
a negative error code.
|
|
|
|
If you know that your pattern is a valid UTF string, and you want to
|
|
skip this check for performance reasons, you can set the
|
|
PCRE2_NO_UTF_CHECK option. When it is set, the effect of passing an in-
|
|
valid UTF string as a pattern is undefined. It may cause your program
|
|
to crash or loop.
|
|
|
|
Note that this option can also be passed to pcre2_match() and
|
|
pcre2_dfa_match(), to suppress UTF validity checking of the subject
|
|
string.
|
|
|
|
Note also that setting PCRE2_NO_UTF_CHECK at compile time does not dis-
|
|
able the error that is given if an escape sequence for an invalid Uni-
|
|
code code point is encountered in the pattern. In particular, the so-
|
|
called "surrogate" code points (0xd800 to 0xdfff) are invalid. If you
|
|
want to allow escape sequences such as \x{d800} you can set the
|
|
PCRE2_EXTRA_ALLOW_SURROGATE_ESCAPES extra option, as described in the
|
|
section entitled "Extra compile options" below. However, this is pos-
|
|
sible only in UTF-8 and UTF-32 modes, because these values are not rep-
|
|
resentable in UTF-16.
|
|
|
|
PCRE2_UCP
|
|
|
|
This option has two effects. Firstly, it change the way PCRE2 processes
|
|
\B, \b, \D, \d, \S, \s, \W, \w, and some of the POSIX character
|
|
classes. By default, only ASCII characters are recognized, but if
|
|
PCRE2_UCP is set, Unicode properties are used instead to classify char-
|
|
acters. More details are given in the section on generic character
|
|
types in the pcre2pattern page. If you set PCRE2_UCP, matching one of
|
|
the items it affects takes much longer.
|
|
|
|
The second effect of PCRE2_UCP is to force the use of Unicode proper-
|
|
ties for upper/lower casing operations on characters with code points
|
|
greater than 127, even when PCRE2_UTF is not set. This makes it possi-
|
|
ble, for example, to process strings in the 16-bit UCS-2 code. This op-
|
|
tion is available only if PCRE2 has been compiled with Unicode support
|
|
(which is the default).
|
|
|
|
PCRE2_UNGREEDY
|
|
|
|
This option inverts the "greediness" of the quantifiers so that they
|
|
are not greedy by default, but become greedy if followed by "?". It is
|
|
not compatible with Perl. It can also be set by a (?U) option setting
|
|
within the pattern.
|
|
|
|
PCRE2_USE_OFFSET_LIMIT
|
|
|
|
This option must be set for pcre2_compile() if pcre2_set_offset_limit()
|
|
is going to be used to set a non-default offset limit in a match con-
|
|
text for matches that use this pattern. An error is generated if an
|
|
offset limit is set without this option. For more details, see the de-
|
|
scription of pcre2_set_offset_limit() in the section that describes
|
|
match contexts. See also the PCRE2_FIRSTLINE option above.
|
|
|
|
PCRE2_UTF
|
|
|
|
This option causes PCRE2 to regard both the pattern and the subject
|
|
strings that are subsequently processed as strings of UTF characters
|
|
instead of single-code-unit strings. It is available when PCRE2 is
|
|
built to include Unicode support (which is the default). If Unicode
|
|
support is not available, the use of this option provokes an error. De-
|
|
tails of how PCRE2_UTF changes the behaviour of PCRE2 are given in the
|
|
pcre2unicode page. In particular, note that it changes the way
|
|
PCRE2_CASELESS handles characters with code points greater than 127.
|
|
|
|
Extra compile options
|
|
|
|
The option bits that can be set in a compile context by calling the
|
|
pcre2_set_compile_extra_options() function are as follows:
|
|
|
|
PCRE2_EXTRA_ALLOW_LOOKAROUND_BSK
|
|
|
|
Since release 10.38 PCRE2 has forbidden the use of \K within lookaround
|
|
assertions, following Perl's lead. This option is provided to re-enable
|
|
the previous behaviour (act in positive lookarounds, ignore in negative
|
|
ones) in case anybody is relying on it.
|
|
|
|
PCRE2_EXTRA_ALLOW_SURROGATE_ESCAPES
|
|
|
|
This option applies when compiling a pattern in UTF-8 or UTF-32 mode.
|
|
It is forbidden in UTF-16 mode, and ignored in non-UTF modes. Unicode
|
|
"surrogate" code points in the range 0xd800 to 0xdfff are used in pairs
|
|
in UTF-16 to encode code points with values in the range 0x10000 to
|
|
0x10ffff. The surrogates cannot therefore be represented in UTF-16.
|
|
They can be represented in UTF-8 and UTF-32, but are defined as invalid
|
|
code points, and cause errors if encountered in a UTF-8 or UTF-32
|
|
string that is being checked for validity by PCRE2.
|
|
|
|
These values also cause errors if encountered in escape sequences such
|
|
as \x{d912} within a pattern. However, it seems that some applications,
|
|
when using PCRE2 to check for unwanted characters in UTF-8 strings, ex-
|
|
plicitly test for the surrogates using escape sequences. The
|
|
PCRE2_NO_UTF_CHECK option does not disable the error that occurs, be-
|
|
cause it applies only to the testing of input strings for UTF validity.
|
|
|
|
If the extra option PCRE2_EXTRA_ALLOW_SURROGATE_ESCAPES is set, surro-
|
|
gate code point values in UTF-8 and UTF-32 patterns no longer provoke
|
|
errors and are incorporated in the compiled pattern. However, they can
|
|
only match subject characters if the matching function is called with
|
|
PCRE2_NO_UTF_CHECK set.
|
|
|
|
PCRE2_EXTRA_ALT_BSUX
|
|
|
|
The original option PCRE2_ALT_BSUX causes PCRE2 to process \U, \u, and
|
|
\x in the way that ECMAscript (aka JavaScript) does. Additional func-
|
|
tionality was defined by ECMAscript 6; setting PCRE2_EXTRA_ALT_BSUX has
|
|
the effect of PCRE2_ALT_BSUX, but in addition it recognizes \u{hhh..}
|
|
as a hexadecimal character code, where hhh.. is any number of hexadeci-
|
|
mal digits.
|
|
|
|
PCRE2_EXTRA_BAD_ESCAPE_IS_LITERAL
|
|
|
|
This is a dangerous option. Use with care. By default, an unrecognized
|
|
escape such as \j or a malformed one such as \x{2z} causes a compile-
|
|
time error when detected by pcre2_compile(). Perl is somewhat inconsis-
|
|
tent in handling such items: for example, \j is treated as a literal
|
|
"j", and non-hexadecimal digits in \x{} are just ignored, though warn-
|
|
ings are given in both cases if Perl's warning switch is enabled. How-
|
|
ever, a malformed octal number after \o{ always causes an error in
|
|
Perl.
|
|
|
|
If the PCRE2_EXTRA_BAD_ESCAPE_IS_LITERAL extra option is passed to
|
|
pcre2_compile(), all unrecognized or malformed escape sequences are
|
|
treated as single-character escapes. For example, \j is a literal "j"
|
|
and \x{2z} is treated as the literal string "x{2z}". Setting this op-
|
|
tion means that typos in patterns may go undetected and have unexpected
|
|
results. Also note that a sequence such as [\N{] is interpreted as a
|
|
malformed attempt at [\N{...}] and so is treated as [N{] whereas [\N]
|
|
gives an error because an unqualified \N is a valid escape sequence but
|
|
is not supported in a character class. To reiterate: this is a danger-
|
|
ous option. Use with great care.
|
|
|
|
PCRE2_EXTRA_ESCAPED_CR_IS_LF
|
|
|
|
There are some legacy applications where the escape sequence \r in a
|
|
pattern is expected to match a newline. If this option is set, \r in a
|
|
pattern is converted to \n so that it matches a LF (linefeed) instead
|
|
of a CR (carriage return) character. The option does not affect a lit-
|
|
eral CR in the pattern, nor does it affect CR specified as an explicit
|
|
code point such as \x{0D}.
|
|
|
|
PCRE2_EXTRA_MATCH_LINE
|
|
|
|
This option is provided for use by the -x option of pcre2grep. It
|
|
causes the pattern only to match complete lines. This is achieved by
|
|
automatically inserting the code for "^(?:" at the start of the com-
|
|
piled pattern and ")$" at the end. Thus, when PCRE2_MULTILINE is set,
|
|
the matched line may be in the middle of the subject string. This op-
|
|
tion can be used with PCRE2_LITERAL.
|
|
|
|
PCRE2_EXTRA_MATCH_WORD
|
|
|
|
This option is provided for use by the -w option of pcre2grep. It
|
|
causes the pattern only to match strings that have a word boundary at
|
|
the start and the end. This is achieved by automatically inserting the
|
|
code for "\b(?:" at the start of the compiled pattern and ")\b" at the
|
|
end. The option may be used with PCRE2_LITERAL. However, it is ignored
|
|
if PCRE2_EXTRA_MATCH_LINE is also set.
|
|
|
|
|
|
JUST-IN-TIME (JIT) COMPILATION
|
|
|
|
int pcre2_jit_compile(pcre2_code *code, uint32_t options);
|
|
|
|
int pcre2_jit_match(const pcre2_code *code, PCRE2_SPTR subject,
|
|
PCRE2_SIZE length, PCRE2_SIZE startoffset,
|
|
uint32_t options, pcre2_match_data *match_data,
|
|
pcre2_match_context *mcontext);
|
|
|
|
void pcre2_jit_free_unused_memory(pcre2_general_context *gcontext);
|
|
|
|
pcre2_jit_stack *pcre2_jit_stack_create(PCRE2_SIZE startsize,
|
|
PCRE2_SIZE maxsize, pcre2_general_context *gcontext);
|
|
|
|
void pcre2_jit_stack_assign(pcre2_match_context *mcontext,
|
|
pcre2_jit_callback callback_function, void *callback_data);
|
|
|
|
void pcre2_jit_stack_free(pcre2_jit_stack *jit_stack);
|
|
|
|
These functions provide support for JIT compilation, which, if the
|
|
just-in-time compiler is available, further processes a compiled pat-
|
|
tern into machine code that executes much faster than the pcre2_match()
|
|
interpretive matching function. Full details are given in the pcre2jit
|
|
documentation.
|
|
|
|
JIT compilation is a heavyweight optimization. It can take some time
|
|
for patterns to be analyzed, and for one-off matches and simple pat-
|
|
terns the benefit of faster execution might be offset by a much slower
|
|
compilation time. Most (but not all) patterns can be optimized by the
|
|
JIT compiler.
|
|
|
|
|
|
LOCALE SUPPORT
|
|
|
|
const uint8_t *pcre2_maketables(pcre2_general_context *gcontext);
|
|
|
|
void pcre2_maketables_free(pcre2_general_context *gcontext,
|
|
const uint8_t *tables);
|
|
|
|
PCRE2 handles caseless matching, and determines whether characters are
|
|
letters, digits, or whatever, by reference to a set of tables, indexed
|
|
by character code point. However, this applies only to characters whose
|
|
code points are less than 256. By default, higher-valued code points
|
|
never match escapes such as \w or \d.
|
|
|
|
When PCRE2 is built with Unicode support (the default), certain Unicode
|
|
character properties can be tested with \p and \P, or, alternatively,
|
|
the PCRE2_UCP option can be set when a pattern is compiled; this causes
|
|
\w and friends to use Unicode property support instead of the built-in
|
|
tables. PCRE2_UCP also causes upper/lower casing operations on charac-
|
|
ters with code points greater than 127 to use Unicode properties. These
|
|
effects apply even when PCRE2_UTF is not set.
|
|
|
|
The use of locales with Unicode is discouraged. If you are handling
|
|
characters with code points greater than 127, you should either use
|
|
Unicode support, or use locales, but not try to mix the two.
|
|
|
|
PCRE2 contains a built-in set of character tables that are used by de-
|
|
fault. These are sufficient for many applications. Normally, the in-
|
|
ternal tables recognize only ASCII characters. However, when PCRE2 is
|
|
built, it is possible to cause the internal tables to be rebuilt in the
|
|
default "C" locale of the local system, which may cause them to be dif-
|
|
ferent.
|
|
|
|
The built-in tables can be overridden by tables supplied by the appli-
|
|
cation that calls PCRE2. These may be created in a different locale
|
|
from the default. As more and more applications change to using Uni-
|
|
code, the need for this locale support is expected to die away.
|
|
|
|
External tables are built by calling the pcre2_maketables() function,
|
|
in the relevant locale. The only argument to this function is a general
|
|
context, which can be used to pass a custom memory allocator. If the
|
|
argument is NULL, the system malloc() is used. The result can be passed
|
|
to pcre2_compile() as often as necessary, by creating a compile context
|
|
and calling pcre2_set_character_tables() to set the tables pointer
|
|
therein.
|
|
|
|
For example, to build and use tables that are appropriate for the
|
|
French locale (where accented characters with values greater than 127
|
|
are treated as letters), the following code could be used:
|
|
|
|
setlocale(LC_CTYPE, "fr_FR");
|
|
tables = pcre2_maketables(NULL);
|
|
ccontext = pcre2_compile_context_create(NULL);
|
|
pcre2_set_character_tables(ccontext, tables);
|
|
re = pcre2_compile(..., ccontext);
|
|
|
|
The locale name "fr_FR" is used on Linux and other Unix-like systems;
|
|
if you are using Windows, the name for the French locale is "french".
|
|
|
|
The pointer that is passed (via the compile context) to pcre2_compile()
|
|
is saved with the compiled pattern, and the same tables are used by the
|
|
matching functions. Thus, for any single pattern, compilation and
|
|
matching both happen in the same locale, but different patterns can be
|
|
processed in different locales.
|
|
|
|
It is the caller's responsibility to ensure that the memory containing
|
|
the tables remains available while they are still in use. When they are
|
|
no longer needed, you can discard them using pcre2_maketables_free(),
|
|
which should pass as its first parameter the same global context that
|
|
was used to create the tables.
|
|
|
|
Saving locale tables
|
|
|
|
The tables described above are just a sequence of binary bytes, which
|
|
makes them independent of hardware characteristics such as endianness
|
|
or whether the processor is 32-bit or 64-bit. A copy of the result of
|
|
pcre2_maketables() can therefore be saved in a file or elsewhere and
|
|
re-used later, even in a different program or on another computer. The
|
|
size of the tables (number of bytes) must be obtained by calling
|
|
pcre2_config() with the PCRE2_CONFIG_TABLES_LENGTH option because
|
|
pcre2_maketables() does not return this value. Note that the
|
|
pcre2_dftables program, which is part of the PCRE2 build system, can be
|
|
used stand-alone to create a file that contains a set of binary tables.
|
|
See the pcre2build documentation for details.
|
|
|
|
|
|
INFORMATION ABOUT A COMPILED PATTERN
|
|
|
|
int pcre2_pattern_info(const pcre2 *code, uint32_t what, void *where);
|
|
|
|
The pcre2_pattern_info() function returns general information about a
|
|
compiled pattern. For information about callouts, see the next section.
|
|
The first argument for pcre2_pattern_info() is a pointer to the com-
|
|
piled pattern. The second argument specifies which piece of information
|
|
is required, and the third argument is a pointer to a variable to re-
|
|
ceive the data. If the third argument is NULL, the first argument is
|
|
ignored, and the function returns the size in bytes of the variable
|
|
that is required for the information requested. Otherwise, the yield of
|
|
the function is zero for success, or one of the following negative num-
|
|
bers:
|
|
|
|
PCRE2_ERROR_NULL the argument code was NULL
|
|
PCRE2_ERROR_BADMAGIC the "magic number" was not found
|
|
PCRE2_ERROR_BADOPTION the value of what was invalid
|
|
PCRE2_ERROR_UNSET the requested field is not set
|
|
|
|
The "magic number" is placed at the start of each compiled pattern as a
|
|
simple check against passing an arbitrary memory pointer. Here is a
|
|
typical call of pcre2_pattern_info(), to obtain the length of the com-
|
|
piled pattern:
|
|
|
|
int rc;
|
|
size_t length;
|
|
rc = pcre2_pattern_info(
|
|
re, /* result of pcre2_compile() */
|
|
PCRE2_INFO_SIZE, /* what is required */
|
|
&length); /* where to put the data */
|
|
|
|
The possible values for the second argument are defined in pcre2.h, and
|
|
are as follows:
|
|
|
|
PCRE2_INFO_ALLOPTIONS
|
|
PCRE2_INFO_ARGOPTIONS
|
|
PCRE2_INFO_EXTRAOPTIONS
|
|
|
|
Return copies of the pattern's options. The third argument should point
|
|
to a uint32_t variable. PCRE2_INFO_ARGOPTIONS returns exactly the op-
|
|
tions that were passed to pcre2_compile(), whereas PCRE2_INFO_ALLOP-
|
|
TIONS returns the compile options as modified by any top-level (*XXX)
|
|
option settings such as (*UTF) at the start of the pattern itself.
|
|
PCRE2_INFO_EXTRAOPTIONS returns the extra options that were set in the
|
|
compile context by calling the pcre2_set_compile_extra_options() func-
|
|
tion.
|
|
|
|
For example, if the pattern /(*UTF)abc/ is compiled with the PCRE2_EX-
|
|
TENDED option, the result for PCRE2_INFO_ALLOPTIONS is PCRE2_EXTENDED
|
|
and PCRE2_UTF. Option settings such as (?i) that can change within a
|
|
pattern do not affect the result of PCRE2_INFO_ALLOPTIONS, even if they
|
|
appear right at the start of the pattern. (This was different in some
|
|
earlier releases.)
|
|
|
|
A pattern compiled without PCRE2_ANCHORED is automatically anchored by
|
|
PCRE2 if the first significant item in every top-level branch is one of
|
|
the following:
|
|
|
|
^ unless PCRE2_MULTILINE is set
|
|
\A always
|
|
\G always
|
|
.* sometimes - see below
|
|
|
|
When .* is the first significant item, anchoring is possible only when
|
|
all the following are true:
|
|
|
|
.* is not in an atomic group
|
|
.* is not in a capture group that is the subject
|
|
of a backreference
|
|
PCRE2_DOTALL is in force for .*
|
|
Neither (*PRUNE) nor (*SKIP) appears in the pattern
|
|
PCRE2_NO_DOTSTAR_ANCHOR is not set
|
|
|
|
For patterns that are auto-anchored, the PCRE2_ANCHORED bit is set in
|
|
the options returned for PCRE2_INFO_ALLOPTIONS.
|
|
|
|
PCRE2_INFO_BACKREFMAX
|
|
|
|
Return the number of the highest backreference in the pattern. The
|
|
third argument should point to a uint32_t variable. Named capture
|
|
groups acquire numbers as well as names, and these count towards the
|
|
highest backreference. Backreferences such as \4 or \g{12} match the
|
|
captured characters of the given group, but in addition, the check that
|
|
a capture group is set in a conditional group such as (?(3)a|b) is also
|
|
a backreference. Zero is returned if there are no backreferences.
|
|
|
|
PCRE2_INFO_BSR
|
|
|
|
The output is a uint32_t integer whose value indicates what character
|
|
sequences the \R escape sequence matches. A value of PCRE2_BSR_UNICODE
|
|
means that \R matches any Unicode line ending sequence; a value of
|
|
PCRE2_BSR_ANYCRLF means that \R matches only CR, LF, or CRLF.
|
|
|
|
PCRE2_INFO_CAPTURECOUNT
|
|
|
|
Return the highest capture group number in the pattern. In patterns
|
|
where (?| is not used, this is also the total number of capture groups.
|
|
The third argument should point to a uint32_t variable.
|
|
|
|
PCRE2_INFO_DEPTHLIMIT
|
|
|
|
If the pattern set a backtracking depth limit by including an item of
|
|
the form (*LIMIT_DEPTH=nnnn) at the start, the value is returned. The
|
|
third argument should point to a uint32_t integer. If no such value has
|
|
been set, the call to pcre2_pattern_info() returns the error PCRE2_ER-
|
|
ROR_UNSET. Note that this limit will only be used during matching if it
|
|
is less than the limit set or defaulted by the caller of the match
|
|
function.
|
|
|
|
PCRE2_INFO_FIRSTBITMAP
|
|
|
|
In the absence of a single first code unit for a non-anchored pattern,
|
|
pcre2_compile() may construct a 256-bit table that defines a fixed set
|
|
of values for the first code unit in any match. For example, a pattern
|
|
that starts with [abc] results in a table with three bits set. When
|
|
code unit values greater than 255 are supported, the flag bit for 255
|
|
means "any code unit of value 255 or above". If such a table was con-
|
|
structed, a pointer to it is returned. Otherwise NULL is returned. The
|
|
third argument should point to a const uint8_t * variable.
|
|
|
|
PCRE2_INFO_FIRSTCODETYPE
|
|
|
|
Return information about the first code unit of any matched string, for
|
|
a non-anchored pattern. The third argument should point to a uint32_t
|
|
variable. If there is a fixed first value, for example, the letter "c"
|
|
from a pattern such as (cat|cow|coyote), 1 is returned, and the value
|
|
can be retrieved using PCRE2_INFO_FIRSTCODEUNIT. If there is no fixed
|
|
first value, but it is known that a match can occur only at the start
|
|
of the subject or following a newline in the subject, 2 is returned.
|
|
Otherwise, and for anchored patterns, 0 is returned.
|
|
|
|
PCRE2_INFO_FIRSTCODEUNIT
|
|
|
|
Return the value of the first code unit of any matched string for a
|
|
pattern where PCRE2_INFO_FIRSTCODETYPE returns 1; otherwise return 0.
|
|
The third argument should point to a uint32_t variable. In the 8-bit
|
|
library, the value is always less than 256. In the 16-bit library the
|
|
value can be up to 0xffff. In the 32-bit library in UTF-32 mode the
|
|
value can be up to 0x10ffff, and up to 0xffffffff when not using UTF-32
|
|
mode.
|
|
|
|
PCRE2_INFO_FRAMESIZE
|
|
|
|
Return the size (in bytes) of the data frames that are used to remember
|
|
backtracking positions when the pattern is processed by pcre2_match()
|
|
without the use of JIT. The third argument should point to a size_t
|
|
variable. The frame size depends on the number of capturing parentheses
|
|
in the pattern. Each additional capture group adds two PCRE2_SIZE vari-
|
|
ables.
|
|
|
|
PCRE2_INFO_HASBACKSLASHC
|
|
|
|
Return 1 if the pattern contains any instances of \C, otherwise 0. The
|
|
third argument should point to a uint32_t variable.
|
|
|
|
PCRE2_INFO_HASCRORLF
|
|
|
|
Return 1 if the pattern contains any explicit matches for CR or LF
|
|
characters, otherwise 0. The third argument should point to a uint32_t
|
|
variable. An explicit match is either a literal CR or LF character, or
|
|
\r or \n or one of the equivalent hexadecimal or octal escape se-
|
|
quences.
|
|
|
|
PCRE2_INFO_HEAPLIMIT
|
|
|
|
If the pattern set a heap memory limit by including an item of the form
|
|
(*LIMIT_HEAP=nnnn) at the start, the value is returned. The third argu-
|
|
ment should point to a uint32_t integer. If no such value has been set,
|
|
the call to pcre2_pattern_info() returns the error PCRE2_ERROR_UNSET.
|
|
Note that this limit will only be used during matching if it is less
|
|
than the limit set or defaulted by the caller of the match function.
|
|
|
|
PCRE2_INFO_JCHANGED
|
|
|
|
Return 1 if the (?J) or (?-J) option setting is used in the pattern,
|
|
otherwise 0. The third argument should point to a uint32_t variable.
|
|
(?J) and (?-J) set and unset the local PCRE2_DUPNAMES option, respec-
|
|
tively.
|
|
|
|
PCRE2_INFO_JITSIZE
|
|
|
|
If the compiled pattern was successfully processed by pcre2_jit_com-
|
|
pile(), return the size of the JIT compiled code, otherwise return
|
|
zero. The third argument should point to a size_t variable.
|
|
|
|
PCRE2_INFO_LASTCODETYPE
|
|
|
|
Returns 1 if there is a rightmost literal code unit that must exist in
|
|
any matched string, other than at its start. The third argument should
|
|
point to a uint32_t variable. If there is no such value, 0 is returned.
|
|
When 1 is returned, the code unit value itself can be retrieved using
|
|
PCRE2_INFO_LASTCODEUNIT. For anchored patterns, a last literal value is
|
|
recorded only if it follows something of variable length. For example,
|
|
for the pattern /^a\d+z\d+/ the returned value is 1 (with "z" returned
|
|
from PCRE2_INFO_LASTCODEUNIT), but for /^a\dz\d/ the returned value is
|
|
0.
|
|
|
|
PCRE2_INFO_LASTCODEUNIT
|
|
|
|
Return the value of the rightmost literal code unit that must exist in
|
|
any matched string, other than at its start, for a pattern where
|
|
PCRE2_INFO_LASTCODETYPE returns 1. Otherwise, return 0. The third argu-
|
|
ment should point to a uint32_t variable.
|
|
|
|
PCRE2_INFO_MATCHEMPTY
|
|
|
|
Return 1 if the pattern might match an empty string, otherwise 0. The
|
|
third argument should point to a uint32_t variable. When a pattern con-
|
|
tains recursive subroutine calls it is not always possible to determine
|
|
whether or not it can match an empty string. PCRE2 takes a cautious ap-
|
|
proach and returns 1 in such cases.
|
|
|
|
PCRE2_INFO_MATCHLIMIT
|
|
|
|
If the pattern set a match limit by including an item of the form
|
|
(*LIMIT_MATCH=nnnn) at the start, the value is returned. The third ar-
|
|
gument should point to a uint32_t integer. If no such value has been
|
|
set, the call to pcre2_pattern_info() returns the error PCRE2_ERROR_UN-
|
|
SET. Note that this limit will only be used during matching if it is
|
|
less than the limit set or defaulted by the caller of the match func-
|
|
tion.
|
|
|
|
PCRE2_INFO_MAXLOOKBEHIND
|
|
|
|
A lookbehind assertion moves back a certain number of characters (not
|
|
code units) when it starts to process each of its branches. This re-
|
|
quest returns the largest of these backward moves. The third argument
|
|
should point to a uint32_t integer. The simple assertions \b and \B re-
|
|
quire a one-character lookbehind and cause PCRE2_INFO_MAXLOOKBEHIND to
|
|
return 1 in the absence of anything longer. \A also registers a one-
|
|
character lookbehind, though it does not actually inspect the previous
|
|
character.
|
|
|
|
Note that this information is useful for multi-segment matching only if
|
|
the pattern contains no nested lookbehinds. For example, the pattern
|
|
(?<=a(?<=ba)c) returns a maximum lookbehind of 2, but when it is pro-
|
|
cessed, the first lookbehind moves back by two characters, matches one
|
|
character, then the nested lookbehind also moves back by two charac-
|
|
ters. This puts the matching point three characters earlier than it was
|
|
at the start. PCRE2_INFO_MAXLOOKBEHIND is really only useful as a de-
|
|
bugging tool. See the pcre2partial documentation for a discussion of
|
|
multi-segment matching.
|
|
|
|
PCRE2_INFO_MINLENGTH
|
|
|
|
If a minimum length for matching subject strings was computed, its
|
|
value is returned. Otherwise the returned value is 0. This value is not
|
|
computed when PCRE2_NO_START_OPTIMIZE is set. The value is a number of
|
|
characters, which in UTF mode may be different from the number of code
|
|
units. The third argument should point to a uint32_t variable. The
|
|
value is a lower bound to the length of any matching string. There may
|
|
not be any strings of that length that do actually match, but every
|
|
string that does match is at least that long.
|
|
|
|
PCRE2_INFO_NAMECOUNT
|
|
PCRE2_INFO_NAMEENTRYSIZE
|
|
PCRE2_INFO_NAMETABLE
|
|
|
|
PCRE2 supports the use of named as well as numbered capturing parenthe-
|
|
ses. The names are just an additional way of identifying the parenthe-
|
|
ses, which still acquire numbers. Several convenience functions such as
|
|
pcre2_substring_get_byname() are provided for extracting captured sub-
|
|
strings by name. It is also possible to extract the data directly, by
|
|
first converting the name to a number in order to access the correct
|
|
pointers in the output vector (described with pcre2_match() below). To
|
|
do the conversion, you need to use the name-to-number map, which is de-
|
|
scribed by these three values.
|
|
|
|
The map consists of a number of fixed-size entries. PCRE2_INFO_NAME-
|
|
COUNT gives the number of entries, and PCRE2_INFO_NAMEENTRYSIZE gives
|
|
the size of each entry in code units; both of these return a uint32_t
|
|
value. The entry size depends on the length of the longest name.
|
|
|
|
PCRE2_INFO_NAMETABLE returns a pointer to the first entry of the table.
|
|
This is a PCRE2_SPTR pointer to a block of code units. In the 8-bit li-
|
|
brary, the first two bytes of each entry are the number of the captur-
|
|
ing parenthesis, most significant byte first. In the 16-bit library,
|
|
the pointer points to 16-bit code units, the first of which contains
|
|
the parenthesis number. In the 32-bit library, the pointer points to
|
|
32-bit code units, the first of which contains the parenthesis number.
|
|
The rest of the entry is the corresponding name, zero terminated.
|
|
|
|
The names are in alphabetical order. If (?| is used to create multiple
|
|
capture groups with the same number, as described in the section on du-
|
|
plicate group numbers in the pcre2pattern page, the groups may be given
|
|
the same name, but there is only one entry in the table. Different
|
|
names for groups of the same number are not permitted.
|
|
|
|
Duplicate names for capture groups with different numbers are permit-
|
|
ted, but only if PCRE2_DUPNAMES is set. They appear in the table in the
|
|
order in which they were found in the pattern. In the absence of (?|
|
|
this is the order of increasing number; when (?| is used this is not
|
|
necessarily the case because later capture groups may have lower num-
|
|
bers.
|
|
|
|
As a simple example of the name/number table, consider the following
|
|
pattern after compilation by the 8-bit library (assume PCRE2_EXTENDED
|
|
is set, so white space - including newlines - is ignored):
|
|
|
|
(?<date> (?<year>(\d\d)?\d\d) -
|
|
(?<month>\d\d) - (?<day>\d\d) )
|
|
|
|
There are four named capture groups, so the table has four entries, and
|
|
each entry in the table is eight bytes long. The table is as follows,
|
|
with non-printing bytes shows in hexadecimal, and undefined bytes shown
|
|
as ??:
|
|
|
|
00 01 d a t e 00 ??
|
|
00 05 d a y 00 ?? ??
|
|
00 04 m o n t h 00
|
|
00 02 y e a r 00 ??
|
|
|
|
When writing code to extract data from named capture groups using the
|
|
name-to-number map, remember that the length of the entries is likely
|
|
to be different for each compiled pattern.
|
|
|
|
PCRE2_INFO_NEWLINE
|
|
|
|
The output is one of the following uint32_t values:
|
|
|
|
PCRE2_NEWLINE_CR Carriage return (CR)
|
|
PCRE2_NEWLINE_LF Linefeed (LF)
|
|
PCRE2_NEWLINE_CRLF Carriage return, linefeed (CRLF)
|
|
PCRE2_NEWLINE_ANY Any Unicode line ending
|
|
PCRE2_NEWLINE_ANYCRLF Any of CR, LF, or CRLF
|
|
PCRE2_NEWLINE_NUL The NUL character (binary zero)
|
|
|
|
This identifies the character sequence that will be recognized as mean-
|
|
ing "newline" while matching.
|
|
|
|
PCRE2_INFO_SIZE
|
|
|
|
Return the size of the compiled pattern in bytes (for all three li-
|
|
braries). The third argument should point to a size_t variable. This
|
|
value includes the size of the general data block that precedes the
|
|
code units of the compiled pattern itself. The value that is used when
|
|
pcre2_compile() is getting memory in which to place the compiled pat-
|
|
tern may be slightly larger than the value returned by this option, be-
|
|
cause there are cases where the code that calculates the size has to
|
|
over-estimate. Processing a pattern with the JIT compiler does not al-
|
|
ter the value returned by this option.
|
|
|
|
|
|
INFORMATION ABOUT A PATTERN'S CALLOUTS
|
|
|
|
int pcre2_callout_enumerate(const pcre2_code *code,
|
|
int (*callback)(pcre2_callout_enumerate_block *, void *),
|
|
void *user_data);
|
|
|
|
A script language that supports the use of string arguments in callouts
|
|
might like to scan all the callouts in a pattern before running the
|
|
match. This can be done by calling pcre2_callout_enumerate(). The first
|
|
argument is a pointer to a compiled pattern, the second points to a
|
|
callback function, and the third is arbitrary user data. The callback
|
|
function is called for every callout in the pattern in the order in
|
|
which they appear. Its first argument is a pointer to a callout enumer-
|
|
ation block, and its second argument is the user_data value that was
|
|
passed to pcre2_callout_enumerate(). The contents of the callout enu-
|
|
meration block are described in the pcre2callout documentation, which
|
|
also gives further details about callouts.
|
|
|
|
|
|
SERIALIZATION AND PRECOMPILING
|
|
|
|
It is possible to save compiled patterns on disc or elsewhere, and
|
|
reload them later, subject to a number of restrictions. The host on
|
|
which the patterns are reloaded must be running the same version of
|
|
PCRE2, with the same code unit width, and must also have the same endi-
|
|
anness, pointer width, and PCRE2_SIZE type. Before compiled patterns
|
|
can be saved, they must be converted to a "serialized" form, which in
|
|
the case of PCRE2 is really just a bytecode dump. The functions whose
|
|
names begin with pcre2_serialize_ are used for converting to and from
|
|
the serialized form. They are described in the pcre2serialize documen-
|
|
tation. Note that PCRE2 serialization does not convert compiled pat-
|
|
terns to an abstract format like Java or .NET serialization.
|
|
|
|
|
|
THE MATCH DATA BLOCK
|
|
|
|
pcre2_match_data *pcre2_match_data_create(uint32_t ovecsize,
|
|
pcre2_general_context *gcontext);
|
|
|
|
pcre2_match_data *pcre2_match_data_create_from_pattern(
|
|
const pcre2_code *code, pcre2_general_context *gcontext);
|
|
|
|
void pcre2_match_data_free(pcre2_match_data *match_data);
|
|
|
|
Information about a successful or unsuccessful match is placed in a
|
|
match data block, which is an opaque structure that is accessed by
|
|
function calls. In particular, the match data block contains a vector
|
|
of offsets into the subject string that define the matched parts of the
|
|
subject. This is known as the ovector.
|
|
|
|
Before calling pcre2_match(), pcre2_dfa_match(), or pcre2_jit_match()
|
|
you must create a match data block by calling one of the creation func-
|
|
tions above. For pcre2_match_data_create(), the first argument is the
|
|
number of pairs of offsets in the ovector.
|
|
|
|
When using pcre2_match(), one pair of offsets is required to identify
|
|
the string that matched the whole pattern, with an additional pair for
|
|
each captured substring. For example, a value of 4 creates enough space
|
|
to record the matched portion of the subject plus three captured sub-
|
|
strings.
|
|
|
|
When using pcre2_dfa_match() there may be multiple matched substrings
|
|
of different lengths at the same point in the subject. The ovector
|
|
should be made large enough to hold as many as are expected.
|
|
|
|
A minimum of at least 1 pair is imposed by pcre2_match_data_create(),
|
|
so it is always possible to return the overall matched string in the
|
|
case of pcre2_match() or the longest match in the case of
|
|
pcre2_dfa_match().
|
|
|
|
The second argument of pcre2_match_data_create() is a pointer to a gen-
|
|
eral context, which can specify custom memory management for obtaining
|
|
the memory for the match data block. If you are not using custom memory
|
|
management, pass NULL, which causes malloc() to be used.
|
|
|
|
For pcre2_match_data_create_from_pattern(), the first argument is a
|
|
pointer to a compiled pattern. The ovector is created to be exactly the
|
|
right size to hold all the substrings a pattern might capture when
|
|
matched using pcre2_match(). You should not use this call when matching
|
|
with pcre2_dfa_match(). The second argument is again a pointer to a
|
|
general context, but in this case if NULL is passed, the memory is ob-
|
|
tained using the same allocator that was used for the compiled pattern
|
|
(custom or default).
|
|
|
|
A match data block can be used many times, with the same or different
|
|
compiled patterns. You can extract information from a match data block
|
|
after a match operation has finished, using functions that are de-
|
|
scribed in the sections on matched strings and other match data below.
|
|
|
|
When a call of pcre2_match() fails, valid data is available in the
|
|
match block only when the error is PCRE2_ERROR_NOMATCH, PCRE2_ER-
|
|
ROR_PARTIAL, or one of the error codes for an invalid UTF string. Ex-
|
|
actly what is available depends on the error, and is detailed below.
|
|
|
|
When one of the matching functions is called, pointers to the compiled
|
|
pattern and the subject string are set in the match data block so that
|
|
they can be referenced by the extraction functions after a successful
|
|
match. After running a match, you must not free a compiled pattern or a
|
|
subject string until after all operations on the match data block (for
|
|
that match) have taken place, unless, in the case of the subject
|
|
string, you have used the PCRE2_COPY_MATCHED_SUBJECT option, which is
|
|
described in the section entitled "Option bits for pcre2_match()" be-
|
|
low.
|
|
|
|
When a match data block itself is no longer needed, it should be freed
|
|
by calling pcre2_match_data_free(). If this function is called with a
|
|
NULL argument, it returns immediately, without doing anything.
|
|
|
|
|
|
MATCHING A PATTERN: THE TRADITIONAL FUNCTION
|
|
|
|
int pcre2_match(const pcre2_code *code, PCRE2_SPTR subject,
|
|
PCRE2_SIZE length, PCRE2_SIZE startoffset,
|
|
uint32_t options, pcre2_match_data *match_data,
|
|
pcre2_match_context *mcontext);
|
|
|
|
The function pcre2_match() is called to match a subject string against
|
|
a compiled pattern, which is passed in the code argument. You can call
|
|
pcre2_match() with the same code argument as many times as you like, in
|
|
order to find multiple matches in the subject string or to match dif-
|
|
ferent subject strings with the same pattern.
|
|
|
|
This function is the main matching facility of the library, and it op-
|
|
erates in a Perl-like manner. For specialist use there is also an al-
|
|
ternative matching function, which is described below in the section
|
|
about the pcre2_dfa_match() function.
|
|
|
|
Here is an example of a simple call to pcre2_match():
|
|
|
|
pcre2_match_data *md = pcre2_match_data_create(4, NULL);
|
|
int rc = pcre2_match(
|
|
re, /* result of pcre2_compile() */
|
|
"some string", /* the subject string */
|
|
11, /* the length of the subject string */
|
|
0, /* start at offset 0 in the subject */
|
|
0, /* default options */
|
|
md, /* the match data block */
|
|
NULL); /* a match context; NULL means use defaults */
|
|
|
|
If the subject string is zero-terminated, the length can be given as
|
|
PCRE2_ZERO_TERMINATED. A match context must be provided if certain less
|
|
common matching parameters are to be changed. For details, see the sec-
|
|
tion on the match context above.
|
|
|
|
The string to be matched by pcre2_match()
|
|
|
|
The subject string is passed to pcre2_match() as a pointer in subject,
|
|
a length in length, and a starting offset in startoffset. The length
|
|
and offset are in code units, not characters. That is, they are in
|
|
bytes for the 8-bit library, 16-bit code units for the 16-bit library,
|
|
and 32-bit code units for the 32-bit library, whether or not UTF pro-
|
|
cessing is enabled. As a special case, if subject is NULL and length is
|
|
zero, the subject is assumed to be an empty string. If length is non-
|
|
zero, an error occurs if subject is NULL.
|
|
|
|
If startoffset is greater than the length of the subject, pcre2_match()
|
|
returns PCRE2_ERROR_BADOFFSET. When the starting offset is zero, the
|
|
search for a match starts at the beginning of the subject, and this is
|
|
by far the most common case. In UTF-8 or UTF-16 mode, the starting off-
|
|
set must point to the start of a character, or to the end of the sub-
|
|
ject (in UTF-32 mode, one code unit equals one character, so all off-
|
|
sets are valid). Like the pattern string, the subject may contain bi-
|
|
nary zeros.
|
|
|
|
A non-zero starting offset is useful when searching for another match
|
|
in the same subject by calling pcre2_match() again after a previous
|
|
success. Setting startoffset differs from passing over a shortened
|
|
string and setting PCRE2_NOTBOL in the case of a pattern that begins
|
|
with any kind of lookbehind. For example, consider the pattern
|
|
|
|
\Biss\B
|
|
|
|
which finds occurrences of "iss" in the middle of words. (\B matches
|
|
only if the current position in the subject is not a word boundary.)
|
|
When applied to the string "Mississippi" the first call to
|
|
pcre2_match() finds the first occurrence. If pcre2_match() is called
|
|
again with just the remainder of the subject, namely "issippi", it does
|
|
not match, because \B is always false at the start of the subject,
|
|
which is deemed to be a word boundary. However, if pcre2_match() is
|
|
passed the entire string again, but with startoffset set to 4, it finds
|
|
the second occurrence of "iss" because it is able to look behind the
|
|
starting point to discover that it is preceded by a letter.
|
|
|
|
Finding all the matches in a subject is tricky when the pattern can
|
|
match an empty string. It is possible to emulate Perl's /g behaviour by
|
|
first trying the match again at the same offset, with the
|
|
PCRE2_NOTEMPTY_ATSTART and PCRE2_ANCHORED options, and then if that
|
|
fails, advancing the starting offset and trying an ordinary match
|
|
again. There is some code that demonstrates how to do this in the
|
|
pcre2demo sample program. In the most general case, you have to check
|
|
to see if the newline convention recognizes CRLF as a newline, and if
|
|
so, and the current character is CR followed by LF, advance the start-
|
|
ing offset by two characters instead of one.
|
|
|
|
If a non-zero starting offset is passed when the pattern is anchored, a
|
|
single attempt to match at the given offset is made. This can only suc-
|
|
ceed if the pattern does not require the match to be at the start of
|
|
the subject. In other words, the anchoring must be the result of set-
|
|
ting the PCRE2_ANCHORED option or the use of .* with PCRE2_DOTALL, not
|
|
by starting the pattern with ^ or \A.
|
|
|
|
Option bits for pcre2_match()
|
|
|
|
The unused bits of the options argument for pcre2_match() must be zero.
|
|
The only bits that may be set are PCRE2_ANCHORED,
|
|
PCRE2_COPY_MATCHED_SUBJECT, PCRE2_ENDANCHORED, PCRE2_NOTBOL, PCRE2_NO-
|
|
TEOL, PCRE2_NOTEMPTY, PCRE2_NOTEMPTY_ATSTART, PCRE2_NO_JIT,
|
|
PCRE2_NO_UTF_CHECK, PCRE2_PARTIAL_HARD, and PCRE2_PARTIAL_SOFT. Their
|
|
action is described below.
|
|
|
|
Setting PCRE2_ANCHORED or PCRE2_ENDANCHORED at match time is not sup-
|
|
ported by the just-in-time (JIT) compiler. If it is set, JIT matching
|
|
is disabled and the interpretive code in pcre2_match() is run. Apart
|
|
from PCRE2_NO_JIT (obviously), the remaining options are supported for
|
|
JIT matching.
|
|
|
|
PCRE2_ANCHORED
|
|
|
|
The PCRE2_ANCHORED option limits pcre2_match() to matching at the first
|
|
matching position. If a pattern was compiled with PCRE2_ANCHORED, or
|
|
turned out to be anchored by virtue of its contents, it cannot be made
|
|
unachored at matching time. Note that setting the option at match time
|
|
disables JIT matching.
|
|
|
|
PCRE2_COPY_MATCHED_SUBJECT
|
|
|
|
By default, a pointer to the subject is remembered in the match data
|
|
block so that, after a successful match, it can be referenced by the
|
|
substring extraction functions. This means that the subject's memory
|
|
must not be freed until all such operations are complete. For some ap-
|
|
plications where the lifetime of the subject string is not guaranteed,
|
|
it may be necessary to make a copy of the subject string, but it is
|
|
wasteful to do this unless the match is successful. After a successful
|
|
match, if PCRE2_COPY_MATCHED_SUBJECT is set, the subject is copied and
|
|
the new pointer is remembered in the match data block instead of the
|
|
original subject pointer. The memory allocator that was used for the
|
|
match block itself is used. The copy is automatically freed when
|
|
pcre2_match_data_free() is called to free the match data block. It is
|
|
also automatically freed if the match data block is re-used for another
|
|
match operation.
|
|
|
|
PCRE2_ENDANCHORED
|
|
|
|
If the PCRE2_ENDANCHORED option is set, any string that pcre2_match()
|
|
matches must be right at the end of the subject string. Note that set-
|
|
ting the option at match time disables JIT matching.
|
|
|
|
PCRE2_NOTBOL
|
|
|
|
This option specifies that first character of the subject string is not
|
|
the beginning of a line, so the circumflex metacharacter should not
|
|
match before it. Setting this without having set PCRE2_MULTILINE at
|
|
compile time causes circumflex never to match. This option affects only
|
|
the behaviour of the circumflex metacharacter. It does not affect \A.
|
|
|
|
PCRE2_NOTEOL
|
|
|
|
This option specifies that the end of the subject string is not the end
|
|
of a line, so the dollar metacharacter should not match it nor (except
|
|
in multiline mode) a newline immediately before it. Setting this with-
|
|
out having set PCRE2_MULTILINE at compile time causes dollar never to
|
|
match. This option affects only the behaviour of the dollar metacharac-
|
|
ter. It does not affect \Z or \z.
|
|
|
|
PCRE2_NOTEMPTY
|
|
|
|
An empty string is not considered to be a valid match if this option is
|
|
set. If there are alternatives in the pattern, they are tried. If all
|
|
the alternatives match the empty string, the entire match fails. For
|
|
example, if the pattern
|
|
|
|
a?b?
|
|
|
|
is applied to a string not beginning with "a" or "b", it matches an
|
|
empty string at the start of the subject. With PCRE2_NOTEMPTY set, this
|
|
match is not valid, so pcre2_match() searches further into the string
|
|
for occurrences of "a" or "b".
|
|
|
|
PCRE2_NOTEMPTY_ATSTART
|
|
|
|
This is like PCRE2_NOTEMPTY, except that it locks out an empty string
|
|
match only at the first matching position, that is, at the start of the
|
|
subject plus the starting offset. An empty string match later in the
|
|
subject is permitted. If the pattern is anchored, such a match can oc-
|
|
cur only if the pattern contains \K.
|
|
|
|
PCRE2_NO_JIT
|
|
|
|
By default, if a pattern has been successfully processed by
|
|
pcre2_jit_compile(), JIT is automatically used when pcre2_match() is
|
|
called with options that JIT supports. Setting PCRE2_NO_JIT disables
|
|
the use of JIT; it forces matching to be done by the interpreter.
|
|
|
|
PCRE2_NO_UTF_CHECK
|
|
|
|
When PCRE2_UTF is set at compile time, the validity of the subject as a
|
|
UTF string is checked unless PCRE2_NO_UTF_CHECK is passed to
|
|
pcre2_match() or PCRE2_MATCH_INVALID_UTF was passed to pcre2_compile().
|
|
The latter special case is discussed in detail in the pcre2unicode doc-
|
|
umentation.
|
|
|
|
In the default case, if a non-zero starting offset is given, the check
|
|
is applied only to that part of the subject that could be inspected
|
|
during matching, and there is a check that the starting offset points
|
|
to the first code unit of a character or to the end of the subject. If
|
|
there are no lookbehind assertions in the pattern, the check starts at
|
|
the starting offset. Otherwise, it starts at the length of the longest
|
|
lookbehind before the starting offset, or at the start of the subject
|
|
if there are not that many characters before the starting offset. Note
|
|
that the sequences \b and \B are one-character lookbehinds.
|
|
|
|
The check is carried out before any other processing takes place, and a
|
|
negative error code is returned if the check fails. There are several
|
|
UTF error codes for each code unit width, corresponding to different
|
|
problems with the code unit sequence. There are discussions about the
|
|
validity of UTF-8 strings, UTF-16 strings, and UTF-32 strings in the
|
|
pcre2unicode documentation.
|
|
|
|
If you know that your subject is valid, and you want to skip this check
|
|
for performance reasons, you can set the PCRE2_NO_UTF_CHECK option when
|
|
calling pcre2_match(). You might want to do this for the second and
|
|
subsequent calls to pcre2_match() if you are making repeated calls to
|
|
find multiple matches in the same subject string.
|
|
|
|
Warning: Unless PCRE2_MATCH_INVALID_UTF was set at compile time, when
|
|
PCRE2_NO_UTF_CHECK is set at match time the effect of passing an in-
|
|
valid string as a subject, or an invalid value of startoffset, is unde-
|
|
fined. Your program may crash or loop indefinitely or give wrong re-
|
|
sults.
|
|
|
|
PCRE2_PARTIAL_HARD
|
|
PCRE2_PARTIAL_SOFT
|
|
|
|
These options turn on the partial matching feature. A partial match oc-
|
|
curs if the end of the subject string is reached successfully, but
|
|
there are not enough subject characters to complete the match. In addi-
|
|
tion, either at least one character must have been inspected or the
|
|
pattern must contain a lookbehind, or the pattern must be one that
|
|
could match an empty string.
|
|
|
|
If this situation arises when PCRE2_PARTIAL_SOFT (but not PCRE2_PAR-
|
|
TIAL_HARD) is set, matching continues by testing any remaining alterna-
|
|
tives. Only if no complete match can be found is PCRE2_ERROR_PARTIAL
|
|
returned instead of PCRE2_ERROR_NOMATCH. In other words, PCRE2_PAR-
|
|
TIAL_SOFT specifies that the caller is prepared to handle a partial
|
|
match, but only if no complete match can be found.
|
|
|
|
If PCRE2_PARTIAL_HARD is set, it overrides PCRE2_PARTIAL_SOFT. In this
|
|
case, if a partial match is found, pcre2_match() immediately returns
|
|
PCRE2_ERROR_PARTIAL, without considering any other alternatives. In
|
|
other words, when PCRE2_PARTIAL_HARD is set, a partial match is consid-
|
|
ered to be more important that an alternative complete match.
|
|
|
|
There is a more detailed discussion of partial and multi-segment match-
|
|
ing, with examples, in the pcre2partial documentation.
|
|
|
|
|
|
NEWLINE HANDLING WHEN MATCHING
|
|
|
|
When PCRE2 is built, a default newline convention is set; this is usu-
|
|
ally the standard convention for the operating system. The default can
|
|
be overridden in a compile context by calling pcre2_set_newline(). It
|
|
can also be overridden by starting a pattern string with, for example,
|
|
(*CRLF), as described in the section on newline conventions in the
|
|
pcre2pattern page. During matching, the newline choice affects the be-
|
|
haviour of the dot, circumflex, and dollar metacharacters. It may also
|
|
alter the way the match starting position is advanced after a match
|
|
failure for an unanchored pattern.
|
|
|
|
When PCRE2_NEWLINE_CRLF, PCRE2_NEWLINE_ANYCRLF, or PCRE2_NEWLINE_ANY is
|
|
set as the newline convention, and a match attempt for an unanchored
|
|
pattern fails when the current starting position is at a CRLF sequence,
|
|
and the pattern contains no explicit matches for CR or LF characters,
|
|
the match position is advanced by two characters instead of one, in
|
|
other words, to after the CRLF.
|
|
|
|
The above rule is a compromise that makes the most common cases work as
|
|
expected. For example, if the pattern is .+A (and the PCRE2_DOTALL op-
|
|
tion is not set), it does not match the string "\r\nA" because, after
|
|
failing at the start, it skips both the CR and the LF before retrying.
|
|
However, the pattern [\r\n]A does match that string, because it con-
|
|
tains an explicit CR or LF reference, and so advances only by one char-
|
|
acter after the first failure.
|
|
|
|
An explicit match for CR of LF is either a literal appearance of one of
|
|
those characters in the pattern, or one of the \r or \n or equivalent
|
|
octal or hexadecimal escape sequences. Implicit matches such as [^X] do
|
|
not count, nor does \s, even though it includes CR and LF in the char-
|
|
acters that it matches.
|
|
|
|
Notwithstanding the above, anomalous effects may still occur when CRLF
|
|
is a valid newline sequence and explicit \r or \n escapes appear in the
|
|
pattern.
|
|
|
|
|
|
HOW PCRE2_MATCH() RETURNS A STRING AND CAPTURED SUBSTRINGS
|
|
|
|
uint32_t pcre2_get_ovector_count(pcre2_match_data *match_data);
|
|
|
|
PCRE2_SIZE *pcre2_get_ovector_pointer(pcre2_match_data *match_data);
|
|
|
|
In general, a pattern matches a certain portion of the subject, and in
|
|
addition, further substrings from the subject may be picked out by
|
|
parenthesized parts of the pattern. Following the usage in Jeffrey
|
|
Friedl's book, this is called "capturing" in what follows, and the
|
|
phrase "capture group" (Perl terminology) is used for a fragment of a
|
|
pattern that picks out a substring. PCRE2 supports several other kinds
|
|
of parenthesized group that do not cause substrings to be captured. The
|
|
pcre2_pattern_info() function can be used to find out how many capture
|
|
groups there are in a compiled pattern.
|
|
|
|
You can use auxiliary functions for accessing captured substrings by
|
|
number or by name, as described in sections below.
|
|
|
|
Alternatively, you can make direct use of the vector of PCRE2_SIZE val-
|
|
ues, called the ovector, which contains the offsets of captured
|
|
strings. It is part of the match data block. The function
|
|
pcre2_get_ovector_pointer() returns the address of the ovector, and
|
|
pcre2_get_ovector_count() returns the number of pairs of values it con-
|
|
tains.
|
|
|
|
Within the ovector, the first in each pair of values is set to the off-
|
|
set of the first code unit of a substring, and the second is set to the
|
|
offset of the first code unit after the end of a substring. These val-
|
|
ues are always code unit offsets, not character offsets. That is, they
|
|
are byte offsets in the 8-bit library, 16-bit offsets in the 16-bit li-
|
|
brary, and 32-bit offsets in the 32-bit library.
|
|
|
|
After a partial match (error return PCRE2_ERROR_PARTIAL), only the
|
|
first pair of offsets (that is, ovector[0] and ovector[1]) are set.
|
|
They identify the part of the subject that was partially matched. See
|
|
the pcre2partial documentation for details of partial matching.
|
|
|
|
After a fully successful match, the first pair of offsets identifies
|
|
the portion of the subject string that was matched by the entire pat-
|
|
tern. The next pair is used for the first captured substring, and so
|
|
on. The value returned by pcre2_match() is one more than the highest
|
|
numbered pair that has been set. For example, if two substrings have
|
|
been captured, the returned value is 3. If there are no captured sub-
|
|
strings, the return value from a successful match is 1, indicating that
|
|
just the first pair of offsets has been set.
|
|
|
|
If a pattern uses the \K escape sequence within a positive assertion,
|
|
the reported start of a successful match can be greater than the end of
|
|
the match. For example, if the pattern (?=ab\K) is matched against
|
|
"ab", the start and end offset values for the match are 2 and 0.
|
|
|
|
If a capture group is matched repeatedly within a single match opera-
|
|
tion, it is the last portion of the subject that it matched that is re-
|
|
turned.
|
|
|
|
If the ovector is too small to hold all the captured substring offsets,
|
|
as much as possible is filled in, and the function returns a value of
|
|
zero. If captured substrings are not of interest, pcre2_match() may be
|
|
called with a match data block whose ovector is of minimum length (that
|
|
is, one pair).
|
|
|
|
It is possible for capture group number n+1 to match some part of the
|
|
subject when group n has not been used at all. For example, if the
|
|
string "abc" is matched against the pattern (a|(z))(bc) the return from
|
|
the function is 4, and groups 1 and 3 are matched, but 2 is not. When
|
|
this happens, both values in the offset pairs corresponding to unused
|
|
groups are set to PCRE2_UNSET.
|
|
|
|
Offset values that correspond to unused groups at the end of the ex-
|
|
pression are also set to PCRE2_UNSET. For example, if the string "abc"
|
|
is matched against the pattern (abc)(x(yz)?)? groups 2 and 3 are not
|
|
matched. The return from the function is 2, because the highest used
|
|
capture group number is 1. The offsets for for the second and third
|
|
capture groupss (assuming the vector is large enough, of course) are
|
|
set to PCRE2_UNSET.
|
|
|
|
Elements in the ovector that do not correspond to capturing parentheses
|
|
in the pattern are never changed. That is, if a pattern contains n cap-
|
|
turing parentheses, no more than ovector[0] to ovector[2n+1] are set by
|
|
pcre2_match(). The other elements retain whatever values they previ-
|
|
ously had. After a failed match attempt, the contents of the ovector
|
|
are unchanged.
|
|
|
|
|
|
OTHER INFORMATION ABOUT A MATCH
|
|
|
|
PCRE2_SPTR pcre2_get_mark(pcre2_match_data *match_data);
|
|
|
|
PCRE2_SIZE pcre2_get_startchar(pcre2_match_data *match_data);
|
|
|
|
As well as the offsets in the ovector, other information about a match
|
|
is retained in the match data block and can be retrieved by the above
|
|
functions in appropriate circumstances. If they are called at other
|
|
times, the result is undefined.
|
|
|
|
After a successful match, a partial match (PCRE2_ERROR_PARTIAL), or a
|
|
failure to match (PCRE2_ERROR_NOMATCH), a mark name may be available.
|
|
The function pcre2_get_mark() can be called to access this name, which
|
|
can be specified in the pattern by any of the backtracking control
|
|
verbs, not just (*MARK). The same function applies to all the verbs. It
|
|
returns a pointer to the zero-terminated name, which is within the com-
|
|
piled pattern. If no name is available, NULL is returned. The length of
|
|
the name (excluding the terminating zero) is stored in the code unit
|
|
that precedes the name. You should use this length instead of relying
|
|
on the terminating zero if the name might contain a binary zero.
|
|
|
|
After a successful match, the name that is returned is the last mark
|
|
name encountered on the matching path through the pattern. Instances of
|
|
backtracking verbs without names do not count. Thus, for example, if
|
|
the matching path contains (*MARK:A)(*PRUNE), the name "A" is returned.
|
|
After a "no match" or a partial match, the last encountered name is re-
|
|
turned. For example, consider this pattern:
|
|
|
|
^(*MARK:A)((*MARK:B)a|b)c
|
|
|
|
When it matches "bc", the returned name is A. The B mark is "seen" in
|
|
the first branch of the group, but it is not on the matching path. On
|
|
the other hand, when this pattern fails to match "bx", the returned
|
|
name is B.
|
|
|
|
Warning: By default, certain start-of-match optimizations are used to
|
|
give a fast "no match" result in some situations. For example, if the
|
|
anchoring is removed from the pattern above, there is an initial check
|
|
for the presence of "c" in the subject before running the matching en-
|
|
gine. This check fails for "bx", causing a match failure without seeing
|
|
any marks. You can disable the start-of-match optimizations by setting
|
|
the PCRE2_NO_START_OPTIMIZE option for pcre2_compile() or by starting
|
|
the pattern with (*NO_START_OPT).
|
|
|
|
After a successful match, a partial match, or one of the invalid UTF
|
|
errors (for example, PCRE2_ERROR_UTF8_ERR5), pcre2_get_startchar() can
|
|
be called. After a successful or partial match it returns the code unit
|
|
offset of the character at which the match started. For a non-partial
|
|
match, this can be different to the value of ovector[0] if the pattern
|
|
contains the \K escape sequence. After a partial match, however, this
|
|
value is always the same as ovector[0] because \K does not affect the
|
|
result of a partial match.
|
|
|
|
After a UTF check failure, pcre2_get_startchar() can be used to obtain
|
|
the code unit offset of the invalid UTF character. Details are given in
|
|
the pcre2unicode page.
|
|
|
|
|
|
ERROR RETURNS FROM pcre2_match()
|
|
|
|
If pcre2_match() fails, it returns a negative number. This can be con-
|
|
verted to a text string by calling the pcre2_get_error_message() func-
|
|
tion (see "Obtaining a textual error message" below). Negative error
|
|
codes are also returned by other functions, and are documented with
|
|
them. The codes are given names in the header file. If UTF checking is
|
|
in force and an invalid UTF subject string is detected, one of a number
|
|
of UTF-specific negative error codes is returned. Details are given in
|
|
the pcre2unicode page. The following are the other errors that may be
|
|
returned by pcre2_match():
|
|
|
|
PCRE2_ERROR_NOMATCH
|
|
|
|
The subject string did not match the pattern.
|
|
|
|
PCRE2_ERROR_PARTIAL
|
|
|
|
The subject string did not match, but it did match partially. See the
|
|
pcre2partial documentation for details of partial matching.
|
|
|
|
PCRE2_ERROR_BADMAGIC
|
|
|
|
PCRE2 stores a 4-byte "magic number" at the start of the compiled code,
|
|
to catch the case when it is passed a junk pointer. This is the error
|
|
that is returned when the magic number is not present.
|
|
|
|
PCRE2_ERROR_BADMODE
|
|
|
|
This error is given when a compiled pattern is passed to a function in
|
|
a library of a different code unit width, for example, a pattern com-
|
|
piled by the 8-bit library is passed to a 16-bit or 32-bit library
|
|
function.
|
|
|
|
PCRE2_ERROR_BADOFFSET
|
|
|
|
The value of startoffset was greater than the length of the subject.
|
|
|
|
PCRE2_ERROR_BADOPTION
|
|
|
|
An unrecognized bit was set in the options argument.
|
|
|
|
PCRE2_ERROR_BADUTFOFFSET
|
|
|
|
The UTF code unit sequence that was passed as a subject was checked and
|
|
found to be valid (the PCRE2_NO_UTF_CHECK option was not set), but the
|
|
value of startoffset did not point to the beginning of a UTF character
|
|
or the end of the subject.
|
|
|
|
PCRE2_ERROR_CALLOUT
|
|
|
|
This error is never generated by pcre2_match() itself. It is provided
|
|
for use by callout functions that want to cause pcre2_match() or
|
|
pcre2_callout_enumerate() to return a distinctive error code. See the
|
|
pcre2callout documentation for details.
|
|
|
|
PCRE2_ERROR_DEPTHLIMIT
|
|
|
|
The nested backtracking depth limit was reached.
|
|
|
|
PCRE2_ERROR_HEAPLIMIT
|
|
|
|
The heap limit was reached.
|
|
|
|
PCRE2_ERROR_INTERNAL
|
|
|
|
An unexpected internal error has occurred. This error could be caused
|
|
by a bug in PCRE2 or by overwriting of the compiled pattern.
|
|
|
|
PCRE2_ERROR_JIT_STACKLIMIT
|
|
|
|
This error is returned when a pattern that was successfully studied us-
|
|
ing JIT is being matched, but the memory available for the just-in-time
|
|
processing stack is not large enough. See the pcre2jit documentation
|
|
for more details.
|
|
|
|
PCRE2_ERROR_MATCHLIMIT
|
|
|
|
The backtracking match limit was reached.
|
|
|
|
PCRE2_ERROR_NOMEMORY
|
|
|
|
If a pattern contains many nested backtracking points, heap memory is
|
|
used to remember them. This error is given when the memory allocation
|
|
function (default or custom) fails. Note that a different error,
|
|
PCRE2_ERROR_HEAPLIMIT, is given if the amount of memory needed exceeds
|
|
the heap limit. PCRE2_ERROR_NOMEMORY is also returned if
|
|
PCRE2_COPY_MATCHED_SUBJECT is set and memory allocation fails.
|
|
|
|
PCRE2_ERROR_NULL
|
|
|
|
Either the code, subject, or match_data argument was passed as NULL.
|
|
|
|
PCRE2_ERROR_RECURSELOOP
|
|
|
|
This error is returned when pcre2_match() detects a recursion loop
|
|
within the pattern. Specifically, it means that either the whole pat-
|
|
tern or a capture group has been called recursively for the second time
|
|
at the same position in the subject string. Some simple patterns that
|
|
might do this are detected and faulted at compile time, but more com-
|
|
plicated cases, in particular mutual recursions between two different
|
|
groups, cannot be detected until matching is attempted.
|
|
|
|
|
|
OBTAINING A TEXTUAL ERROR MESSAGE
|
|
|
|
int pcre2_get_error_message(int errorcode, PCRE2_UCHAR *buffer,
|
|
PCRE2_SIZE bufflen);
|
|
|
|
A text message for an error code from any PCRE2 function (compile,
|
|
match, or auxiliary) can be obtained by calling pcre2_get_error_mes-
|
|
sage(). The code is passed as the first argument, with the remaining
|
|
two arguments specifying a code unit buffer and its length in code
|
|
units, into which the text message is placed. The message is returned
|
|
in code units of the appropriate width for the library that is being
|
|
used.
|
|
|
|
The returned message is terminated with a trailing zero, and the func-
|
|
tion returns the number of code units used, excluding the trailing
|
|
zero. If the error number is unknown, the negative error code PCRE2_ER-
|
|
ROR_BADDATA is returned. If the buffer is too small, the message is
|
|
truncated (but still with a trailing zero), and the negative error code
|
|
PCRE2_ERROR_NOMEMORY is returned. None of the messages are very long;
|
|
a buffer size of 120 code units is ample.
|
|
|
|
|
|
EXTRACTING CAPTURED SUBSTRINGS BY NUMBER
|
|
|
|
int pcre2_substring_length_bynumber(pcre2_match_data *match_data,
|
|
uint32_t number, PCRE2_SIZE *length);
|
|
|
|
int pcre2_substring_copy_bynumber(pcre2_match_data *match_data,
|
|
uint32_t number, PCRE2_UCHAR *buffer,
|
|
PCRE2_SIZE *bufflen);
|
|
|
|
int pcre2_substring_get_bynumber(pcre2_match_data *match_data,
|
|
uint32_t number, PCRE2_UCHAR **bufferptr,
|
|
PCRE2_SIZE *bufflen);
|
|
|
|
void pcre2_substring_free(PCRE2_UCHAR *buffer);
|
|
|
|
Captured substrings can be accessed directly by using the ovector as
|
|
described above. For convenience, auxiliary functions are provided for
|
|
extracting captured substrings as new, separate, zero-terminated
|
|
strings. A substring that contains a binary zero is correctly extracted
|
|
and has a further zero added on the end, but the result is not, of
|
|
course, a C string.
|
|
|
|
The functions in this section identify substrings by number. The number
|
|
zero refers to the entire matched substring, with higher numbers refer-
|
|
ring to substrings captured by parenthesized groups. After a partial
|
|
match, only substring zero is available. An attempt to extract any
|
|
other substring gives the error PCRE2_ERROR_PARTIAL. The next section
|
|
describes similar functions for extracting captured substrings by name.
|
|
|
|
If a pattern uses the \K escape sequence within a positive assertion,
|
|
the reported start of a successful match can be greater than the end of
|
|
the match. For example, if the pattern (?=ab\K) is matched against
|
|
"ab", the start and end offset values for the match are 2 and 0. In
|
|
this situation, calling these functions with a zero substring number
|
|
extracts a zero-length empty string.
|
|
|
|
You can find the length in code units of a captured substring without
|
|
extracting it by calling pcre2_substring_length_bynumber(). The first
|
|
argument is a pointer to the match data block, the second is the group
|
|
number, and the third is a pointer to a variable into which the length
|
|
is placed. If you just want to know whether or not the substring has
|
|
been captured, you can pass the third argument as NULL.
|
|
|
|
The pcre2_substring_copy_bynumber() function copies a captured sub-
|
|
string into a supplied buffer, whereas pcre2_substring_get_bynumber()
|
|
copies it into new memory, obtained using the same memory allocation
|
|
function that was used for the match data block. The first two argu-
|
|
ments of these functions are a pointer to the match data block and a
|
|
capture group number.
|
|
|
|
The final arguments of pcre2_substring_copy_bynumber() are a pointer to
|
|
the buffer and a pointer to a variable that contains its length in code
|
|
units. This is updated to contain the actual number of code units used
|
|
for the extracted substring, excluding the terminating zero.
|
|
|
|
For pcre2_substring_get_bynumber() the third and fourth arguments point
|
|
to variables that are updated with a pointer to the new memory and the
|
|
number of code units that comprise the substring, again excluding the
|
|
terminating zero. When the substring is no longer needed, the memory
|
|
should be freed by calling pcre2_substring_free().
|
|
|
|
The return value from all these functions is zero for success, or a
|
|
negative error code. If the pattern match failed, the match failure
|
|
code is returned. If a substring number greater than zero is used af-
|
|
ter a partial match, PCRE2_ERROR_PARTIAL is returned. Other possible
|
|
error codes are:
|
|
|
|
PCRE2_ERROR_NOMEMORY
|
|
|
|
The buffer was too small for pcre2_substring_copy_bynumber(), or the
|
|
attempt to get memory failed for pcre2_substring_get_bynumber().
|
|
|
|
PCRE2_ERROR_NOSUBSTRING
|
|
|
|
There is no substring with that number in the pattern, that is, the
|
|
number is greater than the number of capturing parentheses.
|
|
|
|
PCRE2_ERROR_UNAVAILABLE
|
|
|
|
The substring number, though not greater than the number of captures in
|
|
the pattern, is greater than the number of slots in the ovector, so the
|
|
substring could not be captured.
|
|
|
|
PCRE2_ERROR_UNSET
|
|
|
|
The substring did not participate in the match. For example, if the
|
|
pattern is (abc)|(def) and the subject is "def", and the ovector con-
|
|
tains at least two capturing slots, substring number 1 is unset.
|
|
|
|
|
|
EXTRACTING A LIST OF ALL CAPTURED SUBSTRINGS
|
|
|
|
int pcre2_substring_list_get(pcre2_match_data *match_data,
|
|
PCRE2_UCHAR ***listptr, PCRE2_SIZE **lengthsptr);
|
|
|
|
void pcre2_substring_list_free(PCRE2_SPTR *list);
|
|
|
|
The pcre2_substring_list_get() function extracts all available sub-
|
|
strings and builds a list of pointers to them. It also (optionally)
|
|
builds a second list that contains their lengths (in code units), ex-
|
|
cluding a terminating zero that is added to each of them. All this is
|
|
done in a single block of memory that is obtained using the same memory
|
|
allocation function that was used to get the match data block.
|
|
|
|
This function must be called only after a successful match. If called
|
|
after a partial match, the error code PCRE2_ERROR_PARTIAL is returned.
|
|
|
|
The address of the memory block is returned via listptr, which is also
|
|
the start of the list of string pointers. The end of the list is marked
|
|
by a NULL pointer. The address of the list of lengths is returned via
|
|
lengthsptr. If your strings do not contain binary zeros and you do not
|
|
therefore need the lengths, you may supply NULL as the lengthsptr argu-
|
|
ment to disable the creation of a list of lengths. The yield of the
|
|
function is zero if all went well, or PCRE2_ERROR_NOMEMORY if the mem-
|
|
ory block could not be obtained. When the list is no longer needed, it
|
|
should be freed by calling pcre2_substring_list_free().
|
|
|
|
If this function encounters a substring that is unset, which can happen
|
|
when capture group number n+1 matches some part of the subject, but
|
|
group n has not been used at all, it returns an empty string. This can
|
|
be distinguished from a genuine zero-length substring by inspecting the
|
|
appropriate offset in the ovector, which contain PCRE2_UNSET for unset
|
|
substrings, or by calling pcre2_substring_length_bynumber().
|
|
|
|
|
|
EXTRACTING CAPTURED SUBSTRINGS BY NAME
|
|
|
|
int pcre2_substring_number_from_name(const pcre2_code *code,
|
|
PCRE2_SPTR name);
|
|
|
|
int pcre2_substring_length_byname(pcre2_match_data *match_data,
|
|
PCRE2_SPTR name, PCRE2_SIZE *length);
|
|
|
|
int pcre2_substring_copy_byname(pcre2_match_data *match_data,
|
|
PCRE2_SPTR name, PCRE2_UCHAR *buffer, PCRE2_SIZE *bufflen);
|
|
|
|
int pcre2_substring_get_byname(pcre2_match_data *match_data,
|
|
PCRE2_SPTR name, PCRE2_UCHAR **bufferptr, PCRE2_SIZE *bufflen);
|
|
|
|
void pcre2_substring_free(PCRE2_UCHAR *buffer);
|
|
|
|
To extract a substring by name, you first have to find associated num-
|
|
ber. For example, for this pattern:
|
|
|
|
(a+)b(?<xxx>\d+)...
|
|
|
|
the number of the capture group called "xxx" is 2. If the name is known
|
|
to be unique (PCRE2_DUPNAMES was not set), you can find the number from
|
|
the name by calling pcre2_substring_number_from_name(). The first argu-
|
|
ment is the compiled pattern, and the second is the name. The yield of
|
|
the function is the group number, PCRE2_ERROR_NOSUBSTRING if there is
|
|
no group with that name, or PCRE2_ERROR_NOUNIQUESUBSTRING if there is
|
|
more than one group with that name. Given the number, you can extract
|
|
the substring directly from the ovector, or use one of the "bynumber"
|
|
functions described above.
|
|
|
|
For convenience, there are also "byname" functions that correspond to
|
|
the "bynumber" functions, the only difference being that the second ar-
|
|
gument is a name instead of a number. If PCRE2_DUPNAMES is set and
|
|
there are duplicate names, these functions scan all the groups with the
|
|
given name, and return the captured substring from the first named
|
|
group that is set.
|
|
|
|
If there are no groups with the given name, PCRE2_ERROR_NOSUBSTRING is
|
|
returned. If all groups with the name have numbers that are greater
|
|
than the number of slots in the ovector, PCRE2_ERROR_UNAVAILABLE is re-
|
|
turned. If there is at least one group with a slot in the ovector, but
|
|
no group is found to be set, PCRE2_ERROR_UNSET is returned.
|
|
|
|
Warning: If the pattern uses the (?| feature to set up multiple capture
|
|
groups with the same number, as described in the section on duplicate
|
|
group numbers in the pcre2pattern page, you cannot use names to distin-
|
|
guish the different capture groups, because names are not included in
|
|
the compiled code. The matching process uses only numbers. For this
|
|
reason, the use of different names for groups with the same number
|
|
causes an error at compile time.
|
|
|
|
|
|
CREATING A NEW STRING WITH SUBSTITUTIONS
|
|
|
|
int pcre2_substitute(const pcre2_code *code, PCRE2_SPTR subject,
|
|
PCRE2_SIZE length, PCRE2_SIZE startoffset,
|
|
uint32_t options, pcre2_match_data *match_data,
|
|
pcre2_match_context *mcontext, PCRE2_SPTR replacement,
|
|
PCRE2_SIZE rlength, PCRE2_UCHAR *outputbuffer,
|
|
PCRE2_SIZE *outlengthptr);
|
|
|
|
This function optionally calls pcre2_match() and then makes a copy of
|
|
the subject string in outputbuffer, replacing parts that were matched
|
|
with the replacement string, whose length is supplied in rlength, which
|
|
can be given as PCRE2_ZERO_TERMINATED for a zero-terminated string. As
|
|
a special case, if replacement is NULL and rlength is zero, the re-
|
|
placement is assumed to be an empty string. If rlength is non-zero, an
|
|
error occurs if replacement is NULL.
|
|
|
|
There is an option (see PCRE2_SUBSTITUTE_REPLACEMENT_ONLY below) to re-
|
|
turn just the replacement string(s). The default action is to perform
|
|
just one replacement if the pattern matches, but there is an option
|
|
that requests multiple replacements (see PCRE2_SUBSTITUTE_GLOBAL be-
|
|
low).
|
|
|
|
If successful, pcre2_substitute() returns the number of substitutions
|
|
that were carried out. This may be zero if no match was found, and is
|
|
never greater than one unless PCRE2_SUBSTITUTE_GLOBAL is set. A nega-
|
|
tive value is returned if an error is detected.
|
|
|
|
Matches in which a \K item in a lookahead in the pattern causes the
|
|
match to end before it starts are not supported, and give rise to an
|
|
error return. For global replacements, matches in which \K in a lookbe-
|
|
hind causes the match to start earlier than the point that was reached
|
|
in the previous iteration are also not supported.
|
|
|
|
The first seven arguments of pcre2_substitute() are the same as for
|
|
pcre2_match(), except that the partial matching options are not permit-
|
|
ted, and match_data may be passed as NULL, in which case a match data
|
|
block is obtained and freed within this function, using memory manage-
|
|
ment functions from the match context, if provided, or else those that
|
|
were used to allocate memory for the compiled code.
|
|
|
|
If match_data is not NULL and PCRE2_SUBSTITUTE_MATCHED is not set, the
|
|
provided block is used for all calls to pcre2_match(), and its contents
|
|
afterwards are the result of the final call. For global changes, this
|
|
will always be a no-match error. The contents of the ovector within the
|
|
match data block may or may not have been changed.
|
|
|
|
As well as the usual options for pcre2_match(), a number of additional
|
|
options can be set in the options argument of pcre2_substitute(). One
|
|
such option is PCRE2_SUBSTITUTE_MATCHED. When this is set, an external
|
|
match_data block must be provided, and it must have already been used
|
|
for an external call to pcre2_match() with the same pattern and subject
|
|
arguments. The data in the match_data block (return code, offset vec-
|
|
tor) is then used for the first substitution instead of calling
|
|
pcre2_match() from within pcre2_substitute(). This allows an applica-
|
|
tion to check for a match before choosing to substitute, without having
|
|
to repeat the match.
|
|
|
|
The contents of the externally supplied match data block are not
|
|
changed when PCRE2_SUBSTITUTE_MATCHED is set. If PCRE2_SUBSTI-
|
|
TUTE_GLOBAL is also set, pcre2_match() is called after the first sub-
|
|
stitution to check for further matches, but this is done using an in-
|
|
ternally obtained match data block, thus always leaving the external
|
|
block unchanged.
|
|
|
|
The code argument is not used for matching before the first substitu-
|
|
tion when PCRE2_SUBSTITUTE_MATCHED is set, but it must be provided,
|
|
even when PCRE2_SUBSTITUTE_GLOBAL is not set, because it contains in-
|
|
formation such as the UTF setting and the number of capturing parenthe-
|
|
ses in the pattern.
|
|
|
|
The default action of pcre2_substitute() is to return a copy of the
|
|
subject string with matched substrings replaced. However, if PCRE2_SUB-
|
|
STITUTE_REPLACEMENT_ONLY is set, only the replacement substrings are
|
|
returned. In the global case, multiple replacements are concatenated in
|
|
the output buffer. Substitution callouts (see below) can be used to
|
|
separate them if necessary.
|
|
|
|
The outlengthptr argument of pcre2_substitute() must point to a vari-
|
|
able that contains the length, in code units, of the output buffer. If
|
|
the function is successful, the value is updated to contain the length
|
|
in code units of the new string, excluding the trailing zero that is
|
|
automatically added.
|
|
|
|
If the function is not successful, the value set via outlengthptr de-
|
|
pends on the type of error. For syntax errors in the replacement
|
|
string, the value is the offset in the replacement string where the er-
|
|
ror was detected. For other errors, the value is PCRE2_UNSET by de-
|
|
fault. This includes the case of the output buffer being too small, un-
|
|
less PCRE2_SUBSTITUTE_OVERFLOW_LENGTH is set.
|
|
|
|
PCRE2_SUBSTITUTE_OVERFLOW_LENGTH changes what happens when the output
|
|
buffer is too small. The default action is to return PCRE2_ERROR_NOMEM-
|
|
ORY immediately. If this option is set, however, pcre2_substitute()
|
|
continues to go through the motions of matching and substituting (with-
|
|
out, of course, writing anything) in order to compute the size of buf-
|
|
fer that is needed. This value is passed back via the outlengthptr
|
|
variable, with the result of the function still being PCRE2_ER-
|
|
ROR_NOMEMORY.
|
|
|
|
Passing a buffer size of zero is a permitted way of finding out how
|
|
much memory is needed for given substitution. However, this does mean
|
|
that the entire operation is carried out twice. Depending on the appli-
|
|
cation, it may be more efficient to allocate a large buffer and free
|
|
the excess afterwards, instead of using PCRE2_SUBSTITUTE_OVER-
|
|
FLOW_LENGTH.
|
|
|
|
The replacement string, which is interpreted as a UTF string in UTF
|
|
mode, is checked for UTF validity unless PCRE2_NO_UTF_CHECK is set. An
|
|
invalid UTF replacement string causes an immediate return with the rel-
|
|
evant UTF error code.
|
|
|
|
If PCRE2_SUBSTITUTE_LITERAL is set, the replacement string is not in-
|
|
terpreted in any way. By default, however, a dollar character is an es-
|
|
cape character that can specify the insertion of characters from cap-
|
|
ture groups and names from (*MARK) or other control verbs in the pat-
|
|
tern. The following forms are always recognized:
|
|
|
|
$$ insert a dollar character
|
|
$<n> or ${<n>} insert the contents of group <n>
|
|
$*MARK or ${*MARK} insert a control verb name
|
|
|
|
Either a group number or a group name can be given for <n>. Curly
|
|
brackets are required only if the following character would be inter-
|
|
preted as part of the number or name. The number may be zero to include
|
|
the entire matched string. For example, if the pattern a(b)c is
|
|
matched with "=abc=" and the replacement string "+$1$0$1+", the result
|
|
is "=+babcb+=".
|
|
|
|
$*MARK inserts the name from the last encountered backtracking control
|
|
verb on the matching path that has a name. (*MARK) must always include
|
|
a name, but the other verbs need not. For example, in the case of
|
|
(*MARK:A)(*PRUNE) the name inserted is "A", but for (*MARK:A)(*PRUNE:B)
|
|
the relevant name is "B". This facility can be used to perform simple
|
|
simultaneous substitutions, as this pcre2test example shows:
|
|
|
|
/(*MARK:pear)apple|(*MARK:orange)lemon/g,replace=${*MARK}
|
|
apple lemon
|
|
2: pear orange
|
|
|
|
PCRE2_SUBSTITUTE_GLOBAL causes the function to iterate over the subject
|
|
string, replacing every matching substring. If this option is not set,
|
|
only the first matching substring is replaced. The search for matches
|
|
takes place in the original subject string (that is, previous replace-
|
|
ments do not affect it). Iteration is implemented by advancing the
|
|
startoffset value for each search, which is always passed the entire
|
|
subject string. If an offset limit is set in the match context, search-
|
|
ing stops when that limit is reached.
|
|
|
|
You can restrict the effect of a global substitution to a portion of
|
|
the subject string by setting either or both of startoffset and an off-
|
|
set limit. Here is a pcre2test example:
|
|
|
|
/B/g,replace=!,use_offset_limit
|
|
ABC ABC ABC ABC\=offset=3,offset_limit=12
|
|
2: ABC A!C A!C ABC
|
|
|
|
When continuing with global substitutions after matching a substring
|
|
with zero length, an attempt to find a non-empty match at the same off-
|
|
set is performed. If this is not successful, the offset is advanced by
|
|
one character except when CRLF is a valid newline sequence and the next
|
|
two characters are CR, LF. In this case, the offset is advanced by two
|
|
characters.
|
|
|
|
PCRE2_SUBSTITUTE_UNKNOWN_UNSET causes references to capture groups that
|
|
do not appear in the pattern to be treated as unset groups. This option
|
|
should be used with care, because it means that a typo in a group name
|
|
or number no longer causes the PCRE2_ERROR_NOSUBSTRING error.
|
|
|
|
PCRE2_SUBSTITUTE_UNSET_EMPTY causes unset capture groups (including un-
|
|
known groups when PCRE2_SUBSTITUTE_UNKNOWN_UNSET is set) to be treated
|
|
as empty strings when inserted as described above. If this option is
|
|
not set, an attempt to insert an unset group causes the PCRE2_ERROR_UN-
|
|
SET error. This option does not influence the extended substitution
|
|
syntax described below.
|
|
|
|
PCRE2_SUBSTITUTE_EXTENDED causes extra processing to be applied to the
|
|
replacement string. Without this option, only the dollar character is
|
|
special, and only the group insertion forms listed above are valid.
|
|
When PCRE2_SUBSTITUTE_EXTENDED is set, two things change:
|
|
|
|
Firstly, backslash in a replacement string is interpreted as an escape
|
|
character. The usual forms such as \n or \x{ddd} can be used to specify
|
|
particular character codes, and backslash followed by any non-alphanu-
|
|
meric character quotes that character. Extended quoting can be coded
|
|
using \Q...\E, exactly as in pattern strings.
|
|
|
|
There are also four escape sequences for forcing the case of inserted
|
|
letters. The insertion mechanism has three states: no case forcing,
|
|
force upper case, and force lower case. The escape sequences change the
|
|
current state: \U and \L change to upper or lower case forcing, respec-
|
|
tively, and \E (when not terminating a \Q quoted sequence) reverts to
|
|
no case forcing. The sequences \u and \l force the next character (if
|
|
it is a letter) to upper or lower case, respectively, and then the
|
|
state automatically reverts to no case forcing. Case forcing applies to
|
|
all inserted characters, including those from capture groups and let-
|
|
ters within \Q...\E quoted sequences. If either PCRE2_UTF or PCRE2_UCP
|
|
was set when the pattern was compiled, Unicode properties are used for
|
|
case forcing characters whose code points are greater than 127.
|
|
|
|
Note that case forcing sequences such as \U...\E do not nest. For exam-
|
|
ple, the result of processing "\Uaa\LBB\Ecc\E" is "AAbbcc"; the final
|
|
\E has no effect. Note also that the PCRE2_ALT_BSUX and PCRE2_EX-
|
|
TRA_ALT_BSUX options do not apply to replacement strings.
|
|
|
|
The second effect of setting PCRE2_SUBSTITUTE_EXTENDED is to add more
|
|
flexibility to capture group substitution. The syntax is similar to
|
|
that used by Bash:
|
|
|
|
${<n>:-<string>}
|
|
${<n>:+<string1>:<string2>}
|
|
|
|
As before, <n> may be a group number or a name. The first form speci-
|
|
fies a default value. If group <n> is set, its value is inserted; if
|
|
not, <string> is expanded and the result inserted. The second form
|
|
specifies strings that are expanded and inserted when group <n> is set
|
|
or unset, respectively. The first form is just a convenient shorthand
|
|
for
|
|
|
|
${<n>:+${<n>}:<string>}
|
|
|
|
Backslash can be used to escape colons and closing curly brackets in
|
|
the replacement strings. A change of the case forcing state within a
|
|
replacement string remains in force afterwards, as shown in this
|
|
pcre2test example:
|
|
|
|
/(some)?(body)/substitute_extended,replace=${1:+\U:\L}HeLLo
|
|
body
|
|
1: hello
|
|
somebody
|
|
1: HELLO
|
|
|
|
The PCRE2_SUBSTITUTE_UNSET_EMPTY option does not affect these extended
|
|
substitutions. However, PCRE2_SUBSTITUTE_UNKNOWN_UNSET does cause un-
|
|
known groups in the extended syntax forms to be treated as unset.
|
|
|
|
If PCRE2_SUBSTITUTE_LITERAL is set, PCRE2_SUBSTITUTE_UNKNOWN_UNSET,
|
|
PCRE2_SUBSTITUTE_UNSET_EMPTY, and PCRE2_SUBSTITUTE_EXTENDED are irrele-
|
|
vant and are ignored.
|
|
|
|
Substitution errors
|
|
|
|
In the event of an error, pcre2_substitute() returns a negative error
|
|
code. Except for PCRE2_ERROR_NOMATCH (which is never returned), errors
|
|
from pcre2_match() are passed straight back.
|
|
|
|
PCRE2_ERROR_NOSUBSTRING is returned for a non-existent substring inser-
|
|
tion, unless PCRE2_SUBSTITUTE_UNKNOWN_UNSET is set.
|
|
|
|
PCRE2_ERROR_UNSET is returned for an unset substring insertion (includ-
|
|
ing an unknown substring when PCRE2_SUBSTITUTE_UNKNOWN_UNSET is set)
|
|
when the simple (non-extended) syntax is used and PCRE2_SUBSTITUTE_UN-
|
|
SET_EMPTY is not set.
|
|
|
|
PCRE2_ERROR_NOMEMORY is returned if the output buffer is not big
|
|
enough. If the PCRE2_SUBSTITUTE_OVERFLOW_LENGTH option is set, the size
|
|
of buffer that is needed is returned via outlengthptr. Note that this
|
|
does not happen by default.
|
|
|
|
PCRE2_ERROR_NULL is returned if PCRE2_SUBSTITUTE_MATCHED is set but the
|
|
match_data argument is NULL or if the subject or replacement arguments
|
|
are NULL. For backward compatibility reasons an exception is made for
|
|
the replacement argument if the rlength argument is also 0.
|
|
|
|
PCRE2_ERROR_BADREPLACEMENT is used for miscellaneous syntax errors in
|
|
the replacement string, with more particular errors being PCRE2_ER-
|
|
ROR_BADREPESCAPE (invalid escape sequence), PCRE2_ERROR_REPMISSINGBRACE
|
|
(closing curly bracket not found), PCRE2_ERROR_BADSUBSTITUTION (syntax
|
|
error in extended group substitution), and PCRE2_ERROR_BADSUBSPATTERN
|
|
(the pattern match ended before it started or the match started earlier
|
|
than the current position in the subject, which can happen if \K is
|
|
used in an assertion).
|
|
|
|
As for all PCRE2 errors, a text message that describes the error can be
|
|
obtained by calling the pcre2_get_error_message() function (see "Ob-
|
|
taining a textual error message" above).
|
|
|
|
Substitution callouts
|
|
|
|
int pcre2_set_substitute_callout(pcre2_match_context *mcontext,
|
|
int (*callout_function)(pcre2_substitute_callout_block *, void *),
|
|
void *callout_data);
|
|
|
|
The pcre2_set_substitution_callout() function can be used to specify a
|
|
callout function for pcre2_substitute(). This information is passed in
|
|
a match context. The callout function is called after each substitution
|
|
has been processed, but it can cause the replacement not to happen. The
|
|
callout function is not called for simulated substitutions that happen
|
|
as a result of the PCRE2_SUBSTITUTE_OVERFLOW_LENGTH option.
|
|
|
|
The first argument of the callout function is a pointer to a substitute
|
|
callout block structure, which contains the following fields, not nec-
|
|
essarily in this order:
|
|
|
|
uint32_t version;
|
|
uint32_t subscount;
|
|
PCRE2_SPTR input;
|
|
PCRE2_SPTR output;
|
|
PCRE2_SIZE *ovector;
|
|
uint32_t oveccount;
|
|
PCRE2_SIZE output_offsets[2];
|
|
|
|
The version field contains the version number of the block format. The
|
|
current version is 0. The version number will increase in future if
|
|
more fields are added, but the intention is never to remove any of the
|
|
existing fields.
|
|
|
|
The subscount field is the number of the current match. It is 1 for the
|
|
first callout, 2 for the second, and so on. The input and output point-
|
|
ers are copies of the values passed to pcre2_substitute().
|
|
|
|
The ovector field points to the ovector, which contains the result of
|
|
the most recent match. The oveccount field contains the number of pairs
|
|
that are set in the ovector, and is always greater than zero.
|
|
|
|
The output_offsets vector contains the offsets of the replacement in
|
|
the output string. This has already been processed for dollar and (if
|
|
requested) backslash substitutions as described above.
|
|
|
|
The second argument of the callout function is the value passed as
|
|
callout_data when the function was registered. The value returned by
|
|
the callout function is interpreted as follows:
|
|
|
|
If the value is zero, the replacement is accepted, and, if PCRE2_SUB-
|
|
STITUTE_GLOBAL is set, processing continues with a search for the next
|
|
match. If the value is not zero, the current replacement is not ac-
|
|
cepted. If the value is greater than zero, processing continues when
|
|
PCRE2_SUBSTITUTE_GLOBAL is set. Otherwise (the value is less than zero
|
|
or PCRE2_SUBSTITUTE_GLOBAL is not set), the the rest of the input is
|
|
copied to the output and the call to pcre2_substitute() exits, return-
|
|
ing the number of matches so far.
|
|
|
|
|
|
DUPLICATE CAPTURE GROUP NAMES
|
|
|
|
int pcre2_substring_nametable_scan(const pcre2_code *code,
|
|
PCRE2_SPTR name, PCRE2_SPTR *first, PCRE2_SPTR *last);
|
|
|
|
When a pattern is compiled with the PCRE2_DUPNAMES option, names for
|
|
capture groups are not required to be unique. Duplicate names are al-
|
|
ways allowed for groups with the same number, created by using the (?|
|
|
feature. Indeed, if such groups are named, they are required to use the
|
|
same names.
|
|
|
|
Normally, patterns that use duplicate names are such that in any one
|
|
match, only one of each set of identically-named groups participates.
|
|
An example is shown in the pcre2pattern documentation.
|
|
|
|
When duplicates are present, pcre2_substring_copy_byname() and
|
|
pcre2_substring_get_byname() return the first substring corresponding
|
|
to the given name that is set. Only if none are set is PCRE2_ERROR_UN-
|
|
SET is returned. The pcre2_substring_number_from_name() function re-
|
|
turns the error PCRE2_ERROR_NOUNIQUESUBSTRING when there are duplicate
|
|
names.
|
|
|
|
If you want to get full details of all captured substrings for a given
|
|
name, you must use the pcre2_substring_nametable_scan() function. The
|
|
first argument is the compiled pattern, and the second is the name. If
|
|
the third and fourth arguments are NULL, the function returns a group
|
|
number for a unique name, or PCRE2_ERROR_NOUNIQUESUBSTRING otherwise.
|
|
|
|
When the third and fourth arguments are not NULL, they must be pointers
|
|
to variables that are updated by the function. After it has run, they
|
|
point to the first and last entries in the name-to-number table for the
|
|
given name, and the function returns the length of each entry in code
|
|
units. In both cases, PCRE2_ERROR_NOSUBSTRING is returned if there are
|
|
no entries for the given name.
|
|
|
|
The format of the name table is described above in the section entitled
|
|
Information about a pattern. Given all the relevant entries for the
|
|
name, you can extract each of their numbers, and hence the captured
|
|
data.
|
|
|
|
|
|
FINDING ALL POSSIBLE MATCHES AT ONE POSITION
|
|
|
|
The traditional matching function uses a similar algorithm to Perl,
|
|
which stops when it finds the first match at a given point in the sub-
|
|
ject. If you want to find all possible matches, or the longest possible
|
|
match at a given position, consider using the alternative matching
|
|
function (see below) instead. If you cannot use the alternative func-
|
|
tion, you can kludge it up by making use of the callout facility, which
|
|
is described in the pcre2callout documentation.
|
|
|
|
What you have to do is to insert a callout right at the end of the pat-
|
|
tern. When your callout function is called, extract and save the cur-
|
|
rent matched substring. Then return 1, which forces pcre2_match() to
|
|
backtrack and try other alternatives. Ultimately, when it runs out of
|
|
matches, pcre2_match() will yield PCRE2_ERROR_NOMATCH.
|
|
|
|
|
|
MATCHING A PATTERN: THE ALTERNATIVE FUNCTION
|
|
|
|
int pcre2_dfa_match(const pcre2_code *code, PCRE2_SPTR subject,
|
|
PCRE2_SIZE length, PCRE2_SIZE startoffset,
|
|
uint32_t options, pcre2_match_data *match_data,
|
|
pcre2_match_context *mcontext,
|
|
int *workspace, PCRE2_SIZE wscount);
|
|
|
|
The function pcre2_dfa_match() is called to match a subject string
|
|
against a compiled pattern, using a matching algorithm that scans the
|
|
subject string just once (not counting lookaround assertions), and does
|
|
not backtrack (except when processing lookaround assertions). This has
|
|
different characteristics to the normal algorithm, and is not compati-
|
|
ble with Perl. Some of the features of PCRE2 patterns are not sup-
|
|
ported. Nevertheless, there are times when this kind of matching can be
|
|
useful. For a discussion of the two matching algorithms, and a list of
|
|
features that pcre2_dfa_match() does not support, see the pcre2matching
|
|
documentation.
|
|
|
|
The arguments for the pcre2_dfa_match() function are the same as for
|
|
pcre2_match(), plus two extras. The ovector within the match data block
|
|
is used in a different way, and this is described below. The other com-
|
|
mon arguments are used in the same way as for pcre2_match(), so their
|
|
description is not repeated here.
|
|
|
|
The two additional arguments provide workspace for the function. The
|
|
workspace vector should contain at least 20 elements. It is used for
|
|
keeping track of multiple paths through the pattern tree. More
|
|
workspace is needed for patterns and subjects where there are a lot of
|
|
potential matches.
|
|
|
|
Here is an example of a simple call to pcre2_dfa_match():
|
|
|
|
int wspace[20];
|
|
pcre2_match_data *md = pcre2_match_data_create(4, NULL);
|
|
int rc = pcre2_dfa_match(
|
|
re, /* result of pcre2_compile() */
|
|
"some string", /* the subject string */
|
|
11, /* the length of the subject string */
|
|
0, /* start at offset 0 in the subject */
|
|
0, /* default options */
|
|
md, /* the match data block */
|
|
NULL, /* a match context; NULL means use defaults */
|
|
wspace, /* working space vector */
|
|
20); /* number of elements (NOT size in bytes) */
|
|
|
|
Option bits for pcre2_dfa_match()
|
|
|
|
The unused bits of the options argument for pcre2_dfa_match() must be
|
|
zero. The only bits that may be set are PCRE2_ANCHORED,
|
|
PCRE2_COPY_MATCHED_SUBJECT, PCRE2_ENDANCHORED, PCRE2_NOTBOL, PCRE2_NO-
|
|
TEOL, PCRE2_NOTEMPTY, PCRE2_NOTEMPTY_ATSTART, PCRE2_NO_UTF_CHECK,
|
|
PCRE2_PARTIAL_HARD, PCRE2_PARTIAL_SOFT, PCRE2_DFA_SHORTEST, and
|
|
PCRE2_DFA_RESTART. All but the last four of these are exactly the same
|
|
as for pcre2_match(), so their description is not repeated here.
|
|
|
|
PCRE2_PARTIAL_HARD
|
|
PCRE2_PARTIAL_SOFT
|
|
|
|
These have the same general effect as they do for pcre2_match(), but
|
|
the details are slightly different. When PCRE2_PARTIAL_HARD is set for
|
|
pcre2_dfa_match(), it returns PCRE2_ERROR_PARTIAL if the end of the
|
|
subject is reached and there is still at least one matching possibility
|
|
that requires additional characters. This happens even if some complete
|
|
matches have already been found. When PCRE2_PARTIAL_SOFT is set, the
|
|
return code PCRE2_ERROR_NOMATCH is converted into PCRE2_ERROR_PARTIAL
|
|
if the end of the subject is reached, there have been no complete
|
|
matches, but there is still at least one matching possibility. The por-
|
|
tion of the string that was inspected when the longest partial match
|
|
was found is set as the first matching string in both cases. There is a
|
|
more detailed discussion of partial and multi-segment matching, with
|
|
examples, in the pcre2partial documentation.
|
|
|
|
PCRE2_DFA_SHORTEST
|
|
|
|
Setting the PCRE2_DFA_SHORTEST option causes the matching algorithm to
|
|
stop as soon as it has found one match. Because of the way the alterna-
|
|
tive algorithm works, this is necessarily the shortest possible match
|
|
at the first possible matching point in the subject string.
|
|
|
|
PCRE2_DFA_RESTART
|
|
|
|
When pcre2_dfa_match() returns a partial match, it is possible to call
|
|
it again, with additional subject characters, and have it continue with
|
|
the same match. The PCRE2_DFA_RESTART option requests this action; when
|
|
it is set, the workspace and wscount options must reference the same
|
|
vector as before because data about the match so far is left in them
|
|
after a partial match. There is more discussion of this facility in the
|
|
pcre2partial documentation.
|
|
|
|
Successful returns from pcre2_dfa_match()
|
|
|
|
When pcre2_dfa_match() succeeds, it may have matched more than one sub-
|
|
string in the subject. Note, however, that all the matches from one run
|
|
of the function start at the same point in the subject. The shorter
|
|
matches are all initial substrings of the longer matches. For example,
|
|
if the pattern
|
|
|
|
<.*>
|
|
|
|
is matched against the string
|
|
|
|
This is <something> <something else> <something further> no more
|
|
|
|
the three matched strings are
|
|
|
|
<something> <something else> <something further>
|
|
<something> <something else>
|
|
<something>
|
|
|
|
On success, the yield of the function is a number greater than zero,
|
|
which is the number of matched substrings. The offsets of the sub-
|
|
strings are returned in the ovector, and can be extracted by number in
|
|
the same way as for pcre2_match(), but the numbers bear no relation to
|
|
any capture groups that may exist in the pattern, because DFA matching
|
|
does not support capturing.
|
|
|
|
Calls to the convenience functions that extract substrings by name re-
|
|
turn the error PCRE2_ERROR_DFA_UFUNC (unsupported function) if used af-
|
|
ter a DFA match. The convenience functions that extract substrings by
|
|
number never return PCRE2_ERROR_NOSUBSTRING.
|
|
|
|
The matched strings are stored in the ovector in reverse order of
|
|
length; that is, the longest matching string is first. If there were
|
|
too many matches to fit into the ovector, the yield of the function is
|
|
zero, and the vector is filled with the longest matches.
|
|
|
|
NOTE: PCRE2's "auto-possessification" optimization usually applies to
|
|
character repeats at the end of a pattern (as well as internally). For
|
|
example, the pattern "a\d+" is compiled as if it were "a\d++". For DFA
|
|
matching, this means that only one possible match is found. If you re-
|
|
ally do want multiple matches in such cases, either use an ungreedy re-
|
|
peat such as "a\d+?" or set the PCRE2_NO_AUTO_POSSESS option when com-
|
|
piling.
|
|
|
|
Error returns from pcre2_dfa_match()
|
|
|
|
The pcre2_dfa_match() function returns a negative number when it fails.
|
|
Many of the errors are the same as for pcre2_match(), as described
|
|
above. There are in addition the following errors that are specific to
|
|
pcre2_dfa_match():
|
|
|
|
PCRE2_ERROR_DFA_UITEM
|
|
|
|
This return is given if pcre2_dfa_match() encounters an item in the
|
|
pattern that it does not support, for instance, the use of \C in a UTF
|
|
mode or a backreference.
|
|
|
|
PCRE2_ERROR_DFA_UCOND
|
|
|
|
This return is given if pcre2_dfa_match() encounters a condition item
|
|
that uses a backreference for the condition, or a test for recursion in
|
|
a specific capture group. These are not supported.
|
|
|
|
PCRE2_ERROR_DFA_UINVALID_UTF
|
|
|
|
This return is given if pcre2_dfa_match() is called for a pattern that
|
|
was compiled with PCRE2_MATCH_INVALID_UTF. This is not supported for
|
|
DFA matching.
|
|
|
|
PCRE2_ERROR_DFA_WSSIZE
|
|
|
|
This return is given if pcre2_dfa_match() runs out of space in the
|
|
workspace vector.
|
|
|
|
PCRE2_ERROR_DFA_RECURSE
|
|
|
|
When a recursion or subroutine call is processed, the matching function
|
|
calls itself recursively, using private memory for the ovector and
|
|
workspace. This error is given if the internal ovector is not large
|
|
enough. This should be extremely rare, as a vector of size 1000 is
|
|
used.
|
|
|
|
PCRE2_ERROR_DFA_BADRESTART
|
|
|
|
When pcre2_dfa_match() is called with the PCRE2_DFA_RESTART option,
|
|
some plausibility checks are made on the contents of the workspace,
|
|
which should contain data about the previous partial match. If any of
|
|
these checks fail, this error is given.
|
|
|
|
|
|
SEE ALSO
|
|
|
|
pcre2build(3), pcre2callout(3), pcre2demo(3), pcre2matching(3),
|
|
pcre2partial(3), pcre2posix(3), pcre2sample(3), pcre2unicode(3).
|
|
|
|
|
|
AUTHOR
|
|
|
|
Philip Hazel
|
|
Retired from University Computing Service
|
|
Cambridge, England.
|
|
|
|
|
|
REVISION
|
|
|
|
Last updated: 14 December 2021
|
|
Copyright (c) 1997-2021 University of Cambridge.
|
|
------------------------------------------------------------------------------
|
|
|
|
|
|
PCRE2BUILD(3) Library Functions Manual PCRE2BUILD(3)
|
|
|
|
|
|
|
|
NAME
|
|
PCRE2 - Perl-compatible regular expressions (revised API)
|
|
|
|
BUILDING PCRE2
|
|
|
|
PCRE2 is distributed with a configure script that can be used to build
|
|
the library in Unix-like environments using the applications known as
|
|
Autotools. Also in the distribution are files to support building using
|
|
CMake instead of configure. The text file README contains general in-
|
|
formation about building with Autotools (some of which is repeated be-
|
|
low), and also has some comments about building on various operating
|
|
systems. There is a lot more information about building PCRE2 without
|
|
using Autotools (including information about using CMake and building
|
|
"by hand") in the text file called NON-AUTOTOOLS-BUILD. You should
|
|
consult this file as well as the README file if you are building in a
|
|
non-Unix-like environment.
|
|
|
|
|
|
PCRE2 BUILD-TIME OPTIONS
|
|
|
|
The rest of this document describes the optional features of PCRE2 that
|
|
can be selected when the library is compiled. It assumes use of the
|
|
configure script, where the optional features are selected or dese-
|
|
lected by providing options to configure before running the make com-
|
|
mand. However, the same options can be selected in both Unix-like and
|
|
non-Unix-like environments if you are using CMake instead of configure
|
|
to build PCRE2.
|
|
|
|
If you are not using Autotools or CMake, option selection can be done
|
|
by editing the config.h file, or by passing parameter settings to the
|
|
compiler, as described in NON-AUTOTOOLS-BUILD.
|
|
|
|
The complete list of options for configure (which includes the standard
|
|
ones such as the selection of the installation directory) can be ob-
|
|
tained by running
|
|
|
|
./configure --help
|
|
|
|
The following sections include descriptions of "on/off" options whose
|
|
names begin with --enable or --disable. Because of the way that config-
|
|
ure works, --enable and --disable always come in pairs, so the comple-
|
|
mentary option always exists as well, but as it specifies the default,
|
|
it is not described. Options that specify values have names that start
|
|
with --with. At the end of a configure run, a summary of the configura-
|
|
tion is output.
|
|
|
|
|
|
BUILDING 8-BIT, 16-BIT AND 32-BIT LIBRARIES
|
|
|
|
By default, a library called libpcre2-8 is built, containing functions
|
|
that take string arguments contained in arrays of bytes, interpreted
|
|
either as single-byte characters, or UTF-8 strings. You can also build
|
|
two other libraries, called libpcre2-16 and libpcre2-32, which process
|
|
strings that are contained in arrays of 16-bit and 32-bit code units,
|
|
respectively. These can be interpreted either as single-unit characters
|
|
or UTF-16/UTF-32 strings. To build these additional libraries, add one
|
|
or both of the following to the configure command:
|
|
|
|
--enable-pcre2-16
|
|
--enable-pcre2-32
|
|
|
|
If you do not want the 8-bit library, add
|
|
|
|
--disable-pcre2-8
|
|
|
|
as well. At least one of the three libraries must be built. Note that
|
|
the POSIX wrapper is for the 8-bit library only, and that pcre2grep is
|
|
an 8-bit program. Neither of these are built if you select only the
|
|
16-bit or 32-bit libraries.
|
|
|
|
|
|
BUILDING SHARED AND STATIC LIBRARIES
|
|
|
|
The Autotools PCRE2 building process uses libtool to build both shared
|
|
and static libraries by default. You can suppress an unwanted library
|
|
by adding one of
|
|
|
|
--disable-shared
|
|
--disable-static
|
|
|
|
to the configure command.
|
|
|
|
|
|
UNICODE AND UTF SUPPORT
|
|
|
|
By default, PCRE2 is built with support for Unicode and UTF character
|
|
strings. To build it without Unicode support, add
|
|
|
|
--disable-unicode
|
|
|
|
to the configure command. This setting applies to all three libraries.
|
|
It is not possible to build one library with Unicode support and an-
|
|
other without in the same configuration.
|
|
|
|
Of itself, Unicode support does not make PCRE2 treat strings as UTF-8,
|
|
UTF-16 or UTF-32. To do that, applications that use the library can set
|
|
the PCRE2_UTF option when they call pcre2_compile() to compile a pat-
|
|
tern. Alternatively, patterns may be started with (*UTF) unless the
|
|
application has locked this out by setting PCRE2_NEVER_UTF.
|
|
|
|
UTF support allows the libraries to process character code points up to
|
|
0x10ffff in the strings that they handle. Unicode support also gives
|
|
access to the Unicode properties of characters, using pattern escapes
|
|
such as \P, \p, and \X. Only the general category properties such as Lu
|
|
and Nd, script names, and some bi-directional properties are supported.
|
|
Details are given in the pcre2pattern documentation.
|
|
|
|
Pattern escapes such as \d and \w do not by default make use of Unicode
|
|
properties. The application can request that they do by setting the
|
|
PCRE2_UCP option. Unless the application has set PCRE2_NEVER_UCP, a
|
|
pattern may also request this by starting with (*UCP).
|
|
|
|
|
|
DISABLING THE USE OF \C
|
|
|
|
The \C escape sequence, which matches a single code unit, even in a UTF
|
|
mode, can cause unpredictable behaviour because it may leave the cur-
|
|
rent matching point in the middle of a multi-code-unit character. The
|
|
application can lock it out by setting the PCRE2_NEVER_BACKSLASH_C op-
|
|
tion when calling pcre2_compile(). There is also a build-time option
|
|
|
|
--enable-never-backslash-C
|
|
|
|
(note the upper case C) which locks out the use of \C entirely.
|
|
|
|
|
|
JUST-IN-TIME COMPILER SUPPORT
|
|
|
|
Just-in-time (JIT) compiler support is included in the build by speci-
|
|
fying
|
|
|
|
--enable-jit
|
|
|
|
This support is available only for certain hardware architectures. If
|
|
this option is set for an unsupported architecture, a building error
|
|
occurs. If in doubt, use
|
|
|
|
--enable-jit=auto
|
|
|
|
which enables JIT only if the current hardware is supported. You can
|
|
check if JIT is enabled in the configuration summary that is output at
|
|
the end of a configure run. If you are enabling JIT under SELinux you
|
|
may also want to add
|
|
|
|
--enable-jit-sealloc
|
|
|
|
which enables the use of an execmem allocator in JIT that is compatible
|
|
with SELinux. This has no effect if JIT is not enabled. See the
|
|
pcre2jit documentation for a discussion of JIT usage. When JIT support
|
|
is enabled, pcre2grep automatically makes use of it, unless you add
|
|
|
|
--disable-pcre2grep-jit
|
|
|
|
to the configure command.
|
|
|
|
|
|
NEWLINE RECOGNITION
|
|
|
|
By default, PCRE2 interprets the linefeed (LF) character as indicating
|
|
the end of a line. This is the normal newline character on Unix-like
|
|
systems. You can compile PCRE2 to use carriage return (CR) instead, by
|
|
adding
|
|
|
|
--enable-newline-is-cr
|
|
|
|
to the configure command. There is also an --enable-newline-is-lf op-
|
|
tion, which explicitly specifies linefeed as the newline character.
|
|
|
|
Alternatively, you can specify that line endings are to be indicated by
|
|
the two-character sequence CRLF (CR immediately followed by LF). If you
|
|
want this, add
|
|
|
|
--enable-newline-is-crlf
|
|
|
|
to the configure command. There is a fourth option, specified by
|
|
|
|
--enable-newline-is-anycrlf
|
|
|
|
which causes PCRE2 to recognize any of the three sequences CR, LF, or
|
|
CRLF as indicating a line ending. A fifth option, specified by
|
|
|
|
--enable-newline-is-any
|
|
|
|
causes PCRE2 to recognize any Unicode newline sequence. The Unicode
|
|
newline sequences are the three just mentioned, plus the single charac-
|
|
ters VT (vertical tab, U+000B), FF (form feed, U+000C), NEL (next line,
|
|
U+0085), LS (line separator, U+2028), and PS (paragraph separator,
|
|
U+2029). The final option is
|
|
|
|
--enable-newline-is-nul
|
|
|
|
which causes NUL (binary zero) to be set as the default line-ending
|
|
character.
|
|
|
|
Whatever default line ending convention is selected when PCRE2 is built
|
|
can be overridden by applications that use the library. At build time
|
|
it is recommended to use the standard for your operating system.
|
|
|
|
|
|
WHAT \R MATCHES
|
|
|
|
By default, the sequence \R in a pattern matches any Unicode newline
|
|
sequence, independently of what has been selected as the line ending
|
|
sequence. If you specify
|
|
|
|
--enable-bsr-anycrlf
|
|
|
|
the default is changed so that \R matches only CR, LF, or CRLF. What-
|
|
ever is selected when PCRE2 is built can be overridden by applications
|
|
that use the library.
|
|
|
|
|
|
HANDLING VERY LARGE PATTERNS
|
|
|
|
Within a compiled pattern, offset values are used to point from one
|
|
part to another (for example, from an opening parenthesis to an alter-
|
|
nation metacharacter). By default, in the 8-bit and 16-bit libraries,
|
|
two-byte values are used for these offsets, leading to a maximum size
|
|
for a compiled pattern of around 64 thousand code units. This is suffi-
|
|
cient to handle all but the most gigantic patterns. Nevertheless, some
|
|
people do want to process truly enormous patterns, so it is possible to
|
|
compile PCRE2 to use three-byte or four-byte offsets by adding a set-
|
|
ting such as
|
|
|
|
--with-link-size=3
|
|
|
|
to the configure command. The value given must be 2, 3, or 4. For the
|
|
16-bit library, a value of 3 is rounded up to 4. In these libraries,
|
|
using longer offsets slows down the operation of PCRE2 because it has
|
|
to load additional data when handling them. For the 32-bit library the
|
|
value is always 4 and cannot be overridden; the value of --with-link-
|
|
size is ignored.
|
|
|
|
|
|
LIMITING PCRE2 RESOURCE USAGE
|
|
|
|
The pcre2_match() function increments a counter each time it goes round
|
|
its main loop. Putting a limit on this counter controls the amount of
|
|
computing resource used by a single call to pcre2_match(). The limit
|
|
can be changed at run time, as described in the pcre2api documentation.
|
|
The default is 10 million, but this can be changed by adding a setting
|
|
such as
|
|
|
|
--with-match-limit=500000
|
|
|
|
to the configure command. This setting also applies to the
|
|
pcre2_dfa_match() matching function, and to JIT matching (though the
|
|
counting is done differently).
|
|
|
|
The pcre2_match() function starts out using a 20KiB vector on the sys-
|
|
tem stack to record backtracking points. The more nested backtracking
|
|
points there are (that is, the deeper the search tree), the more memory
|
|
is needed. If the initial vector is not large enough, heap memory is
|
|
used, up to a certain limit, which is specified in kibibytes (units of
|
|
1024 bytes). The limit can be changed at run time, as described in the
|
|
pcre2api documentation. The default limit (in effect unlimited) is 20
|
|
million. You can change this by a setting such as
|
|
|
|
--with-heap-limit=500
|
|
|
|
which limits the amount of heap to 500 KiB. This limit applies only to
|
|
interpretive matching in pcre2_match() and pcre2_dfa_match(), which may
|
|
also use the heap for internal workspace when processing complicated
|
|
patterns. This limit does not apply when JIT (which has its own memory
|
|
arrangements) is used.
|
|
|
|
You can also explicitly limit the depth of nested backtracking in the
|
|
pcre2_match() interpreter. This limit defaults to the value that is set
|
|
for --with-match-limit. You can set a lower default limit by adding,
|
|
for example,
|
|
|
|
--with-match-limit_depth=10000
|
|
|
|
to the configure command. This value can be overridden at run time.
|
|
This depth limit indirectly limits the amount of heap memory that is
|
|
used, but because the size of each backtracking "frame" depends on the
|
|
number of capturing parentheses in a pattern, the amount of heap that
|
|
is used before the limit is reached varies from pattern to pattern.
|
|
This limit was more useful in versions before 10.30, where function re-
|
|
cursion was used for backtracking.
|
|
|
|
As well as applying to pcre2_match(), the depth limit also controls the
|
|
depth of recursive function calls in pcre2_dfa_match(). These are used
|
|
for lookaround assertions, atomic groups, and recursion within pat-
|
|
terns. The limit does not apply to JIT matching.
|
|
|
|
|
|
CREATING CHARACTER TABLES AT BUILD TIME
|
|
|
|
PCRE2 uses fixed tables for processing characters whose code points are
|
|
less than 256. By default, PCRE2 is built with a set of tables that are
|
|
distributed in the file src/pcre2_chartables.c.dist. These tables are
|
|
for ASCII codes only. If you add
|
|
|
|
--enable-rebuild-chartables
|
|
|
|
to the configure command, the distributed tables are no longer used.
|
|
Instead, a program called pcre2_dftables is compiled and run. This out-
|
|
puts the source for new set of tables, created in the default locale of
|
|
your C run-time system. This method of replacing the tables does not
|
|
work if you are cross compiling, because pcre2_dftables needs to be run
|
|
on the local host and therefore not compiled with the cross compiler.
|
|
|
|
If you need to create alternative tables when cross compiling, you will
|
|
have to do so "by hand". There may also be other reasons for creating
|
|
tables manually. To cause pcre2_dftables to be built on the local
|
|
host, run a normal compiling command, and then run the program with the
|
|
output file as its argument, for example:
|
|
|
|
cc src/pcre2_dftables.c -o pcre2_dftables
|
|
./pcre2_dftables src/pcre2_chartables.c
|
|
|
|
This builds the tables in the default locale of the local host. If you
|
|
want to specify a locale, you must use the -L option:
|
|
|
|
LC_ALL=fr_FR ./pcre2_dftables -L src/pcre2_chartables.c
|
|
|
|
You can also specify -b (with or without -L). This causes the tables to
|
|
be written in binary instead of as source code. A set of binary tables
|
|
can be loaded into memory by an application and passed to pcre2_com-
|
|
pile() in the same way as tables created by calling pcre2_maketables().
|
|
The tables are just a string of bytes, independent of hardware charac-
|
|
teristics such as endianness. This means they can be bundled with an
|
|
application that runs in different environments, to ensure consistent
|
|
behaviour.
|
|
|
|
|
|
USING EBCDIC CODE
|
|
|
|
PCRE2 assumes by default that it will run in an environment where the
|
|
character code is ASCII or Unicode, which is a superset of ASCII. This
|
|
is the case for most computer operating systems. PCRE2 can, however, be
|
|
compiled to run in an 8-bit EBCDIC environment by adding
|
|
|
|
--enable-ebcdic --disable-unicode
|
|
|
|
to the configure command. This setting implies --enable-rebuild-charta-
|
|
bles. You should only use it if you know that you are in an EBCDIC en-
|
|
vironment (for example, an IBM mainframe operating system).
|
|
|
|
It is not possible to support both EBCDIC and UTF-8 codes in the same
|
|
version of the library. Consequently, --enable-unicode and --enable-
|
|
ebcdic are mutually exclusive.
|
|
|
|
The EBCDIC character that corresponds to an ASCII LF is assumed to have
|
|
the value 0x15 by default. However, in some EBCDIC environments, 0x25
|
|
is used. In such an environment you should use
|
|
|
|
--enable-ebcdic-nl25
|
|
|
|
as well as, or instead of, --enable-ebcdic. The EBCDIC character for CR
|
|
has the same value as in ASCII, namely, 0x0d. Whichever of 0x15 and
|
|
0x25 is not chosen as LF is made to correspond to the Unicode NEL char-
|
|
acter (which, in Unicode, is 0x85).
|
|
|
|
The options that select newline behaviour, such as --enable-newline-is-
|
|
cr, and equivalent run-time options, refer to these character values in
|
|
an EBCDIC environment.
|
|
|
|
|
|
PCRE2GREP SUPPORT FOR EXTERNAL SCRIPTS
|
|
|
|
By default pcre2grep supports the use of callouts with string arguments
|
|
within the patterns it is matching. There are two kinds: one that gen-
|
|
erates output using local code, and another that calls an external pro-
|
|
gram or script. If --disable-pcre2grep-callout-fork is added to the
|
|
configure command, only the first kind of callout is supported; if
|
|
--disable-pcre2grep-callout is used, all callouts are completely ig-
|
|
nored. For more details of pcre2grep callouts, see the pcre2grep docu-
|
|
mentation.
|
|
|
|
|
|
PCRE2GREP OPTIONS FOR COMPRESSED FILE SUPPORT
|
|
|
|
By default, pcre2grep reads all files as plain text. You can build it
|
|
so that it recognizes files whose names end in .gz or .bz2, and reads
|
|
them with libz or libbz2, respectively, by adding one or both of
|
|
|
|
--enable-pcre2grep-libz
|
|
--enable-pcre2grep-libbz2
|
|
|
|
to the configure command. These options naturally require that the rel-
|
|
evant libraries are installed on your system. Configuration will fail
|
|
if they are not.
|
|
|
|
|
|
PCRE2GREP BUFFER SIZE
|
|
|
|
pcre2grep uses an internal buffer to hold a "window" on the file it is
|
|
scanning, in order to be able to output "before" and "after" lines when
|
|
it finds a match. The default starting size of the buffer is 20KiB. The
|
|
buffer itself is three times this size, but because of the way it is
|
|
used for holding "before" lines, the longest line that is guaranteed to
|
|
be processable is the notional buffer size. If a longer line is encoun-
|
|
tered, pcre2grep automatically expands the buffer, up to a specified
|
|
maximum size, whose default is 1MiB or the starting size, whichever is
|
|
the larger. You can change the default parameter values by adding, for
|
|
example,
|
|
|
|
--with-pcre2grep-bufsize=51200
|
|
--with-pcre2grep-max-bufsize=2097152
|
|
|
|
to the configure command. The caller of pcre2grep can override these
|
|
values by using --buffer-size and --max-buffer-size on the command
|
|
line.
|
|
|
|
|
|
PCRE2TEST OPTION FOR LIBREADLINE SUPPORT
|
|
|
|
If you add one of
|
|
|
|
--enable-pcre2test-libreadline
|
|
--enable-pcre2test-libedit
|
|
|
|
to the configure command, pcre2test is linked with the libreadline or-
|
|
libedit library, respectively, and when its input is from a terminal,
|
|
it reads it using the readline() function. This provides line-editing
|
|
and history facilities. Note that libreadline is GPL-licensed, so if
|
|
you distribute a binary of pcre2test linked in this way, there may be
|
|
licensing issues. These can be avoided by linking instead with libedit,
|
|
which has a BSD licence.
|
|
|
|
Setting --enable-pcre2test-libreadline causes the -lreadline option to
|
|
be added to the pcre2test build. In many operating environments with a
|
|
sytem-installed readline library this is sufficient. However, in some
|
|
environments (e.g. if an unmodified distribution version of readline is
|
|
in use), some extra configuration may be necessary. The INSTALL file
|
|
for libreadline says this:
|
|
|
|
"Readline uses the termcap functions, but does not link with
|
|
the termcap or curses library itself, allowing applications
|
|
which link with readline the to choose an appropriate library."
|
|
|
|
If your environment has not been set up so that an appropriate library
|
|
is automatically included, you may need to add something like
|
|
|
|
LIBS="-ncurses"
|
|
|
|
immediately before the configure command.
|
|
|
|
|
|
INCLUDING DEBUGGING CODE
|
|
|
|
If you add
|
|
|
|
--enable-debug
|
|
|
|
to the configure command, additional debugging code is included in the
|
|
build. This feature is intended for use by the PCRE2 maintainers.
|
|
|
|
|
|
DEBUGGING WITH VALGRIND SUPPORT
|
|
|
|
If you add
|
|
|
|
--enable-valgrind
|
|
|
|
to the configure command, PCRE2 will use valgrind annotations to mark
|
|
certain memory regions as unaddressable. This allows it to detect in-
|
|
valid memory accesses, and is mostly useful for debugging PCRE2 itself.
|
|
|
|
|
|
CODE COVERAGE REPORTING
|
|
|
|
If your C compiler is gcc, you can build a version of PCRE2 that can
|
|
generate a code coverage report for its test suite. To enable this, you
|
|
must install lcov version 1.6 or above. Then specify
|
|
|
|
--enable-coverage
|
|
|
|
to the configure command and build PCRE2 in the usual way.
|
|
|
|
Note that using ccache (a caching C compiler) is incompatible with code
|
|
coverage reporting. If you have configured ccache to run automatically
|
|
on your system, you must set the environment variable
|
|
|
|
CCACHE_DISABLE=1
|
|
|
|
before running make to build PCRE2, so that ccache is not used.
|
|
|
|
When --enable-coverage is used, the following addition targets are
|
|
added to the Makefile:
|
|
|
|
make coverage
|
|
|
|
This creates a fresh coverage report for the PCRE2 test suite. It is
|
|
equivalent to running "make coverage-reset", "make coverage-baseline",
|
|
"make check", and then "make coverage-report".
|
|
|
|
make coverage-reset
|
|
|
|
This zeroes the coverage counters, but does nothing else.
|
|
|
|
make coverage-baseline
|
|
|
|
This captures baseline coverage information.
|
|
|
|
make coverage-report
|
|
|
|
This creates the coverage report.
|
|
|
|
make coverage-clean-report
|
|
|
|
This removes the generated coverage report without cleaning the cover-
|
|
age data itself.
|
|
|
|
make coverage-clean-data
|
|
|
|
This removes the captured coverage data without removing the coverage
|
|
files created at compile time (*.gcno).
|
|
|
|
make coverage-clean
|
|
|
|
This cleans all coverage data including the generated coverage report.
|
|
For more information about code coverage, see the gcov and lcov docu-
|
|
mentation.
|
|
|
|
|
|
DISABLING THE Z AND T FORMATTING MODIFIERS
|
|
|
|
The C99 standard defines formatting modifiers z and t for size_t and
|
|
ptrdiff_t values, respectively. By default, PCRE2 uses these modifiers
|
|
in environments other than old versions of Microsoft Visual Studio when
|
|
__STDC_VERSION__ is defined and has a value greater than or equal to
|
|
199901L (indicating support for C99). However, there is at least one
|
|
environment that claims to be C99 but does not support these modifiers.
|
|
If
|
|
|
|
--disable-percent-zt
|
|
|
|
is specified, no use is made of the z or t modifiers. Instead of %td or
|
|
%zu, a suitable format is used depending in the size of long for the
|
|
platform.
|
|
|
|
|
|
SUPPORT FOR FUZZERS
|
|
|
|
There is a special option for use by people who want to run fuzzing
|
|
tests on PCRE2:
|
|
|
|
--enable-fuzz-support
|
|
|
|
At present this applies only to the 8-bit library. If set, it causes an
|
|
extra library called libpcre2-fuzzsupport.a to be built, but not in-
|
|
stalled. This contains a single function called LLVMFuzzerTestOneIn-
|
|
put() whose arguments are a pointer to a string and the length of the
|
|
string. When called, this function tries to compile the string as a
|
|
pattern, and if that succeeds, to match it. This is done both with no
|
|
options and with some random options bits that are generated from the
|
|
string.
|
|
|
|
Setting --enable-fuzz-support also causes a binary called pcre2fuz-
|
|
zcheck to be created. This is normally run under valgrind or used when
|
|
PCRE2 is compiled with address sanitizing enabled. It calls the fuzzing
|
|
function and outputs information about what it is doing. The input
|
|
strings are specified by arguments: if an argument starts with "=" the
|
|
rest of it is a literal input string. Otherwise, it is assumed to be a
|
|
file name, and the contents of the file are the test string.
|
|
|
|
|
|
OBSOLETE OPTION
|
|
|
|
In versions of PCRE2 prior to 10.30, there were two ways of handling
|
|
backtracking in the pcre2_match() function. The default was to use the
|
|
system stack, but if
|
|
|
|
--disable-stack-for-recursion
|
|
|
|
was set, memory on the heap was used. From release 10.30 onwards this
|
|
has changed (the stack is no longer used) and this option now does
|
|
nothing except give a warning.
|
|
|
|
|
|
SEE ALSO
|
|
|
|
pcre2api(3), pcre2-config(3).
|
|
|
|
|
|
AUTHOR
|
|
|
|
Philip Hazel
|
|
University Computing Service
|
|
Cambridge, England.
|
|
|
|
|
|
REVISION
|
|
|
|
Last updated: 08 December 2021
|
|
Copyright (c) 1997-2021 University of Cambridge.
|
|
------------------------------------------------------------------------------
|
|
|
|
|
|
PCRE2CALLOUT(3) Library Functions Manual PCRE2CALLOUT(3)
|
|
|
|
|
|
|
|
NAME
|
|
PCRE2 - Perl-compatible regular expressions (revised API)
|
|
|
|
SYNOPSIS
|
|
|
|
#include <pcre2.h>
|
|
|
|
int (*pcre2_callout)(pcre2_callout_block *, void *);
|
|
|
|
int pcre2_callout_enumerate(const pcre2_code *code,
|
|
int (*callback)(pcre2_callout_enumerate_block *, void *),
|
|
void *user_data);
|
|
|
|
|
|
DESCRIPTION
|
|
|
|
PCRE2 provides a feature called "callout", which is a means of tempo-
|
|
rarily passing control to the caller of PCRE2 in the middle of pattern
|
|
matching. The caller of PCRE2 provides an external function by putting
|
|
its entry point in a match context (see pcre2_set_callout() in the
|
|
pcre2api documentation).
|
|
|
|
When using the pcre2_substitute() function, an additional callout fea-
|
|
ture is available. This does a callout after each change to the subject
|
|
string and is described in the pcre2api documentation; the rest of this
|
|
document is concerned with callouts during pattern matching.
|
|
|
|
Within a regular expression, (?C<arg>) indicates a point at which the
|
|
external function is to be called. Different callout points can be
|
|
identified by putting a number less than 256 after the letter C. The
|
|
default value is zero. Alternatively, the argument may be a delimited
|
|
string. The starting delimiter must be one of ` ' " ^ % # $ { and the
|
|
ending delimiter is the same as the start, except for {, where the end-
|
|
ing delimiter is }. If the ending delimiter is needed within the
|
|
string, it must be doubled. For example, this pattern has two callout
|
|
points:
|
|
|
|
(?C1)abc(?C"some ""arbitrary"" text")def
|
|
|
|
If the PCRE2_AUTO_CALLOUT option bit is set when a pattern is compiled,
|
|
PCRE2 automatically inserts callouts, all with number 255, before each
|
|
item in the pattern except for immediately before or after an explicit
|
|
callout. For example, if PCRE2_AUTO_CALLOUT is used with the pattern
|
|
|
|
A(?C3)B
|
|
|
|
it is processed as if it were
|
|
|
|
(?C255)A(?C3)B(?C255)
|
|
|
|
Here is a more complicated example:
|
|
|
|
A(\d{2}|--)
|
|
|
|
With PCRE2_AUTO_CALLOUT, this pattern is processed as if it were
|
|
|
|
(?C255)A(?C255)((?C255)\d{2}(?C255)|(?C255)-(?C255)-(?C255))(?C255)
|
|
|
|
Notice that there is a callout before and after each parenthesis and
|
|
alternation bar. If the pattern contains a conditional group whose con-
|
|
dition is an assertion, an automatic callout is inserted immediately
|
|
before the condition. Such a callout may also be inserted explicitly,
|
|
for example:
|
|
|
|
(?(?C9)(?=a)ab|de) (?(?C%text%)(?!=d)ab|de)
|
|
|
|
This applies only to assertion conditions (because they are themselves
|
|
independent groups).
|
|
|
|
Callouts can be useful for tracking the progress of pattern matching.
|
|
The pcre2test program has a pattern qualifier (/auto_callout) that sets
|
|
automatic callouts. When any callouts are present, the output from
|
|
pcre2test indicates how the pattern is being matched. This is useful
|
|
information when you are trying to optimize the performance of a par-
|
|
ticular pattern.
|
|
|
|
|
|
MISSING CALLOUTS
|
|
|
|
You should be aware that, because of optimizations in the way PCRE2
|
|
compiles and matches patterns, callouts sometimes do not happen exactly
|
|
as you might expect.
|
|
|
|
Auto-possessification
|
|
|
|
At compile time, PCRE2 "auto-possessifies" repeated items when it knows
|
|
that what follows cannot be part of the repeat. For example, a+[bc] is
|
|
compiled as if it were a++[bc]. The pcre2test output when this pattern
|
|
is compiled with PCRE2_ANCHORED and PCRE2_AUTO_CALLOUT and then applied
|
|
to the string "aaaa" is:
|
|
|
|
--->aaaa
|
|
+0 ^ a+
|
|
+2 ^ ^ [bc]
|
|
No match
|
|
|
|
This indicates that when matching [bc] fails, there is no backtracking
|
|
into a+ (because it is being treated as a++) and therefore the callouts
|
|
that would be taken for the backtracks do not occur. You can disable
|
|
the auto-possessify feature by passing PCRE2_NO_AUTO_POSSESS to
|
|
pcre2_compile(), or starting the pattern with (*NO_AUTO_POSSESS). In
|
|
this case, the output changes to this:
|
|
|
|
--->aaaa
|
|
+0 ^ a+
|
|
+2 ^ ^ [bc]
|
|
+2 ^ ^ [bc]
|
|
+2 ^ ^ [bc]
|
|
+2 ^^ [bc]
|
|
No match
|
|
|
|
This time, when matching [bc] fails, the matcher backtracks into a+ and
|
|
tries again, repeatedly, until a+ itself fails.
|
|
|
|
Automatic .* anchoring
|
|
|
|
By default, an optimization is applied when .* is the first significant
|
|
item in a pattern. If PCRE2_DOTALL is set, so that the dot can match
|
|
any character, the pattern is automatically anchored. If PCRE2_DOTALL
|
|
is not set, a match can start only after an internal newline or at the
|
|
beginning of the subject, and pcre2_compile() remembers this. If a pat-
|
|
tern has more than one top-level branch, automatic anchoring occurs if
|
|
all branches are anchorable.
|
|
|
|
This optimization is disabled, however, if .* is in an atomic group or
|
|
if there is a backreference to the capture group in which it appears.
|
|
It is also disabled if the pattern contains (*PRUNE) or (*SKIP). How-
|
|
ever, the presence of callouts does not affect it.
|
|
|
|
For example, if the pattern .*\d is compiled with PCRE2_AUTO_CALLOUT
|
|
and applied to the string "aa", the pcre2test output is:
|
|
|
|
--->aa
|
|
+0 ^ .*
|
|
+2 ^ ^ \d
|
|
+2 ^^ \d
|
|
+2 ^ \d
|
|
No match
|
|
|
|
This shows that all match attempts start at the beginning of the sub-
|
|
ject. In other words, the pattern is anchored. You can disable this op-
|
|
timization by passing PCRE2_NO_DOTSTAR_ANCHOR to pcre2_compile(), or
|
|
starting the pattern with (*NO_DOTSTAR_ANCHOR). In this case, the out-
|
|
put changes to:
|
|
|
|
--->aa
|
|
+0 ^ .*
|
|
+2 ^ ^ \d
|
|
+2 ^^ \d
|
|
+2 ^ \d
|
|
+0 ^ .*
|
|
+2 ^^ \d
|
|
+2 ^ \d
|
|
No match
|
|
|
|
This shows more match attempts, starting at the second subject charac-
|
|
ter. Another optimization, described in the next section, means that
|
|
there is no subsequent attempt to match with an empty subject.
|
|
|
|
Other optimizations
|
|
|
|
Other optimizations that provide fast "no match" results also affect
|
|
callouts. For example, if the pattern is
|
|
|
|
ab(?C4)cd
|
|
|
|
PCRE2 knows that any matching string must contain the letter "d". If
|
|
the subject string is "abyz", the lack of "d" means that matching
|
|
doesn't ever start, and the callout is never reached. However, with
|
|
"abyd", though the result is still no match, the callout is obeyed.
|
|
|
|
For most patterns PCRE2 also knows the minimum length of a matching
|
|
string, and will immediately give a "no match" return without actually
|
|
running a match if the subject is not long enough, or, for unanchored
|
|
patterns, if it has been scanned far enough.
|
|
|
|
You can disable these optimizations by passing the PCRE2_NO_START_OPTI-
|
|
MIZE option to pcre2_compile(), or by starting the pattern with
|
|
(*NO_START_OPT). This slows down the matching process, but does ensure
|
|
that callouts such as the example above are obeyed.
|
|
|
|
|
|
THE CALLOUT INTERFACE
|
|
|
|
During matching, when PCRE2 reaches a callout point, if an external
|
|
function is provided in the match context, it is called. This applies
|
|
to both normal, DFA, and JIT matching. The first argument to the call-
|
|
out function is a pointer to a pcre2_callout block. The second argument
|
|
is the void * callout data that was supplied when the callout was set
|
|
up by calling pcre2_set_callout() (see the pcre2api documentation). The
|
|
callout block structure contains the following fields, not necessarily
|
|
in this order:
|
|
|
|
uint32_t version;
|
|
uint32_t callout_number;
|
|
uint32_t capture_top;
|
|
uint32_t capture_last;
|
|
uint32_t callout_flags;
|
|
PCRE2_SIZE *offset_vector;
|
|
PCRE2_SPTR mark;
|
|
PCRE2_SPTR subject;
|
|
PCRE2_SIZE subject_length;
|
|
PCRE2_SIZE start_match;
|
|
PCRE2_SIZE current_position;
|
|
PCRE2_SIZE pattern_position;
|
|
PCRE2_SIZE next_item_length;
|
|
PCRE2_SIZE callout_string_offset;
|
|
PCRE2_SIZE callout_string_length;
|
|
PCRE2_SPTR callout_string;
|
|
|
|
The version field contains the version number of the block format. The
|
|
current version is 2; the three callout string fields were added for
|
|
version 1, and the callout_flags field for version 2. If you are writ-
|
|
ing an application that might use an earlier release of PCRE2, you
|
|
should check the version number before accessing any of these fields.
|
|
The version number will increase in future if more fields are added,
|
|
but the intention is never to remove any of the existing fields.
|
|
|
|
Fields for numerical callouts
|
|
|
|
For a numerical callout, callout_string is NULL, and callout_number
|
|
contains the number of the callout, in the range 0-255. This is the
|
|
number that follows (?C for callouts that part of the pattern; it is
|
|
255 for automatically generated callouts.
|
|
|
|
Fields for string callouts
|
|
|
|
For callouts with string arguments, callout_number is always zero, and
|
|
callout_string points to the string that is contained within the com-
|
|
piled pattern. Its length is given by callout_string_length. Duplicated
|
|
ending delimiters that were present in the original pattern string have
|
|
been turned into single characters, but there is no other processing of
|
|
the callout string argument. An additional code unit containing binary
|
|
zero is present after the string, but is not included in the length.
|
|
The delimiter that was used to start the string is also stored within
|
|
the pattern, immediately before the string itself. You can access this
|
|
delimiter as callout_string[-1] if you need it.
|
|
|
|
The callout_string_offset field is the code unit offset to the start of
|
|
the callout argument string within the original pattern string. This is
|
|
provided for the benefit of applications such as script languages that
|
|
might need to report errors in the callout string within the pattern.
|
|
|
|
Fields for all callouts
|
|
|
|
The remaining fields in the callout block are the same for both kinds
|
|
of callout.
|
|
|
|
The offset_vector field is a pointer to a vector of capturing offsets
|
|
(the "ovector"). You may read the elements in this vector, but you must
|
|
not change any of them.
|
|
|
|
For calls to pcre2_match(), the offset_vector field is not (since re-
|
|
lease 10.30) a pointer to the actual ovector that was passed to the
|
|
matching function in the match data block. Instead it points to an in-
|
|
ternal ovector of a size large enough to hold all possible captured
|
|
substrings in the pattern. Note that whenever a recursion or subroutine
|
|
call within a pattern completes, the capturing state is reset to what
|
|
it was before.
|
|
|
|
The capture_last field contains the number of the most recently cap-
|
|
tured substring, and the capture_top field contains one more than the
|
|
number of the highest numbered captured substring so far. If no sub-
|
|
strings have yet been captured, the value of capture_last is 0 and the
|
|
value of capture_top is 1. The values of these fields do not always
|
|
differ by one; for example, when the callout in the pattern
|
|
((a)(b))(?C2) is taken, capture_last is 1 but capture_top is 4.
|
|
|
|
The contents of ovector[2] to ovector[<capture_top>*2-1] can be in-
|
|
spected in order to extract substrings that have been matched so far,
|
|
in the same way as extracting substrings after a match has completed.
|
|
The values in ovector[0] and ovector[1] are always PCRE2_UNSET because
|
|
the match is by definition not complete. Substrings that have not been
|
|
captured but whose numbers are less than capture_top also have both of
|
|
their ovector slots set to PCRE2_UNSET.
|
|
|
|
For DFA matching, the offset_vector field points to the ovector that
|
|
was passed to the matching function in the match data block for call-
|
|
outs at the top level, but to an internal ovector during the processing
|
|
of pattern recursions, lookarounds, and atomic groups. However, these
|
|
ovectors hold no useful information because pcre2_dfa_match() does not
|
|
support substring capturing. The value of capture_top is always 1 and
|
|
the value of capture_last is always 0 for DFA matching.
|
|
|
|
The subject and subject_length fields contain copies of the values that
|
|
were passed to the matching function.
|
|
|
|
The start_match field normally contains the offset within the subject
|
|
at which the current match attempt started. However, if the escape se-
|
|
quence \K has been encountered, this value is changed to reflect the
|
|
modified starting point. If the pattern is not anchored, the callout
|
|
function may be called several times from the same point in the pattern
|
|
for different starting points in the subject.
|
|
|
|
The current_position field contains the offset within the subject of
|
|
the current match pointer.
|
|
|
|
The pattern_position field contains the offset in the pattern string to
|
|
the next item to be matched.
|
|
|
|
The next_item_length field contains the length of the next item to be
|
|
processed in the pattern string. When the callout is at the end of the
|
|
pattern, the length is zero. When the callout precedes an opening
|
|
parenthesis, the length includes meta characters that follow the paren-
|
|
thesis. For example, in a callout before an assertion such as (?=ab)
|
|
the length is 3. For an an alternation bar or a closing parenthesis,
|
|
the length is one, unless a closing parenthesis is followed by a quan-
|
|
tifier, in which case its length is included. (This changed in release
|
|
10.23. In earlier releases, before an opening parenthesis the length
|
|
was that of the entire group, and before an alternation bar or a clos-
|
|
ing parenthesis the length was zero.)
|
|
|
|
The pattern_position and next_item_length fields are intended to help
|
|
in distinguishing between different automatic callouts, which all have
|
|
the same callout number. However, they are set for all callouts, and
|
|
are used by pcre2test to show the next item to be matched when display-
|
|
ing callout information.
|
|
|
|
In callouts from pcre2_match() the mark field contains a pointer to the
|
|
zero-terminated name of the most recently passed (*MARK), (*PRUNE), or
|
|
(*THEN) item in the match, or NULL if no such items have been passed.
|
|
Instances of (*PRUNE) or (*THEN) without a name do not obliterate a
|
|
previous (*MARK). In callouts from the DFA matching function this field
|
|
always contains NULL.
|
|
|
|
The callout_flags field is always zero in callouts from
|
|
pcre2_dfa_match() or when JIT is being used. When pcre2_match() without
|
|
JIT is used, the following bits may be set:
|
|
|
|
PCRE2_CALLOUT_STARTMATCH
|
|
|
|
This is set for the first callout after the start of matching for each
|
|
new starting position in the subject.
|
|
|
|
PCRE2_CALLOUT_BACKTRACK
|
|
|
|
This is set if there has been a matching backtrack since the previous
|
|
callout, or since the start of matching if this is the first callout
|
|
from a pcre2_match() run.
|
|
|
|
Both bits are set when a backtrack has caused a "bumpalong" to a new
|
|
starting position in the subject. Output from pcre2test does not indi-
|
|
cate the presence of these bits unless the callout_extra modifier is
|
|
set.
|
|
|
|
The information in the callout_flags field is provided so that applica-
|
|
tions can track and tell their users how matching with backtracking is
|
|
done. This can be useful when trying to optimize patterns, or just to
|
|
understand how PCRE2 works. There is no support in pcre2_dfa_match()
|
|
because there is no backtracking in DFA matching, and there is no sup-
|
|
port in JIT because JIT is all about maximimizing matching performance.
|
|
In both these cases the callout_flags field is always zero.
|
|
|
|
|
|
RETURN VALUES FROM CALLOUTS
|
|
|
|
The external callout function returns an integer to PCRE2. If the value
|
|
is zero, matching proceeds as normal. If the value is greater than
|
|
zero, matching fails at the current point, but the testing of other
|
|
matching possibilities goes ahead, just as if a lookahead assertion had
|
|
failed. If the value is less than zero, the match is abandoned, and the
|
|
matching function returns the negative value.
|
|
|
|
Negative values should normally be chosen from the set of PCRE2_ER-
|
|
ROR_xxx values. In particular, PCRE2_ERROR_NOMATCH forces a standard
|
|
"no match" failure. The error number PCRE2_ERROR_CALLOUT is reserved
|
|
for use by callout functions; it will never be used by PCRE2 itself.
|
|
|
|
|
|
CALLOUT ENUMERATION
|
|
|
|
int pcre2_callout_enumerate(const pcre2_code *code,
|
|
int (*callback)(pcre2_callout_enumerate_block *, void *),
|
|
void *user_data);
|
|
|
|
A script language that supports the use of string arguments in callouts
|
|
might like to scan all the callouts in a pattern before running the
|
|
match. This can be done by calling pcre2_callout_enumerate(). The first
|
|
argument is a pointer to a compiled pattern, the second points to a
|
|
callback function, and the third is arbitrary user data. The callback
|
|
function is called for every callout in the pattern in the order in
|
|
which they appear. Its first argument is a pointer to a callout enumer-
|
|
ation block, and its second argument is the user_data value that was
|
|
passed to pcre2_callout_enumerate(). The data block contains the fol-
|
|
lowing fields:
|
|
|
|
version Block version number
|
|
pattern_position Offset to next item in pattern
|
|
next_item_length Length of next item in pattern
|
|
callout_number Number for numbered callouts
|
|
callout_string_offset Offset to string within pattern
|
|
callout_string_length Length of callout string
|
|
callout_string Points to callout string or is NULL
|
|
|
|
The version number is currently 0. It will increase if new fields are
|
|
ever added to the block. The remaining fields are the same as their
|
|
namesakes in the pcre2_callout block that is used for callouts during
|
|
matching, as described above.
|
|
|
|
Note that the value of pattern_position is unique for each callout.
|
|
However, if a callout occurs inside a group that is quantified with a
|
|
non-zero minimum or a fixed maximum, the group is replicated inside the
|
|
compiled pattern. For example, a pattern such as /(a){2}/ is compiled
|
|
as if it were /(a)(a)/. This means that the callout will be enumerated
|
|
more than once, but with the same value for pattern_position in each
|
|
case.
|
|
|
|
The callback function should normally return zero. If it returns a non-
|
|
zero value, scanning the pattern stops, and that value is returned from
|
|
pcre2_callout_enumerate().
|
|
|
|
|
|
AUTHOR
|
|
|
|
Philip Hazel
|
|
University Computing Service
|
|
Cambridge, England.
|
|
|
|
|
|
REVISION
|
|
|
|
Last updated: 03 February 2019
|
|
Copyright (c) 1997-2019 University of Cambridge.
|
|
------------------------------------------------------------------------------
|
|
|
|
|
|
PCRE2COMPAT(3) Library Functions Manual PCRE2COMPAT(3)
|
|
|
|
|
|
|
|
NAME
|
|
PCRE2 - Perl-compatible regular expressions (revised API)
|
|
|
|
DIFFERENCES BETWEEN PCRE2 AND PERL
|
|
|
|
This document describes some of the differences in the ways that PCRE2
|
|
and Perl handle regular expressions. The differences described here are
|
|
with respect to Perl version 5.34.0, but as both Perl and PCRE2 are
|
|
continually changing, the information may at times be out of date.
|
|
|
|
1. When PCRE2_DOTALL (equivalent to Perl's /s qualifier) is not set,
|
|
the behaviour of the '.' metacharacter differs from Perl. In PCRE2, '.'
|
|
matches the next character unless it is the start of a newline se-
|
|
quence. This means that, if the newline setting is CR, CRLF, or NUL,
|
|
'.' will match the code point LF (0x0A) in ASCII/Unicode environments,
|
|
and NL (either 0x15 or 0x25) when using EBCDIC. In Perl, '.' appears
|
|
never to match LF, even when 0x0A is not a newline indicator.
|
|
|
|
2. PCRE2 has only a subset of Perl's Unicode support. Details of what
|
|
it does have are given in the pcre2unicode page.
|
|
|
|
3. Like Perl, PCRE2 allows repeat quantifiers on parenthesized asser-
|
|
tions, but they do not mean what you might think. For example, (?!a){3}
|
|
does not assert that the next three characters are not "a". It just as-
|
|
serts that the next character is not "a" three times (in principle;
|
|
PCRE2 optimizes this to run the assertion just once). Perl allows some
|
|
repeat quantifiers on other assertions, for example, \b* , but these do
|
|
not seem to have any use. PCRE2 does not allow any kind of quantifier
|
|
on non-lookaround assertions.
|
|
|
|
4. Capture groups that occur inside negative lookaround assertions are
|
|
counted, but their entries in the offsets vector are set only when a
|
|
negative assertion is a condition that has a matching branch (that is,
|
|
the condition is false). Perl may set such capture groups in other
|
|
circumstances.
|
|
|
|
5. The following Perl escape sequences are not supported: \F, \l, \L,
|
|
\u, \U, and \N when followed by a character name. \N on its own, match-
|
|
ing a non-newline character, and \N{U+dd..}, matching a Unicode code
|
|
point, are supported. The escapes that modify the case of following
|
|
letters are implemented by Perl's general string-handling and are not
|
|
part of its pattern matching engine. If any of these are encountered by
|
|
PCRE2, an error is generated by default. However, if either of the
|
|
PCRE2_ALT_BSUX or PCRE2_EXTRA_ALT_BSUX options is set, \U and \u are
|
|
interpreted as ECMAScript interprets them.
|
|
|
|
6. The Perl escape sequences \p, \P, and \X are supported only if PCRE2
|
|
is built with Unicode support (the default). The properties that can be
|
|
tested with \p and \P are limited to the general category properties
|
|
such as Lu and Nd, script names such as Greek or Han, Bidi_Class,
|
|
Bidi_Control, and the derived properties Any and LC (synonym L&). Both
|
|
PCRE2 and Perl support the Cs (surrogate) property, but in PCRE2 its
|
|
use is limited. See the pcre2pattern documentation for details. The
|
|
long synonyms for property names that Perl supports (such as \p{Let-
|
|
ter}) are not supported by PCRE2, nor is it permitted to prefix any of
|
|
these properties with "Is".
|
|
|
|
7. PCRE2 supports the \Q...\E escape for quoting substrings. Characters
|
|
in between are treated as literals. However, this is slightly different
|
|
from Perl in that $ and @ are also handled as literals inside the
|
|
quotes. In Perl, they cause variable interpolation (PCRE2 does not have
|
|
variables). Also, Perl does "double-quotish backslash interpolation" on
|
|
any backslashes between \Q and \E which, its documentation says, "may
|
|
lead to confusing results". PCRE2 treats a backslash between \Q and \E
|
|
just like any other character. Note the following examples:
|
|
|
|
Pattern PCRE2 matches Perl matches
|
|
|
|
\Qabc$xyz\E abc$xyz abc followed by the
|
|
contents of $xyz
|
|
\Qabc\$xyz\E abc\$xyz abc\$xyz
|
|
\Qabc\E\$\Qxyz\E abc$xyz abc$xyz
|
|
\QA\B\E A\B A\B
|
|
\Q\\E \ \\E
|
|
|
|
The \Q...\E sequence is recognized both inside and outside character
|
|
classes by both PCRE2 and Perl.
|
|
|
|
8. Fairly obviously, PCRE2 does not support the (?{code}) and
|
|
(??{code}) constructions. However, PCRE2 does have a "callout" feature,
|
|
which allows an external function to be called during pattern matching.
|
|
See the pcre2callout documentation for details.
|
|
|
|
9. Subroutine calls (whether recursive or not) were treated as atomic
|
|
groups up to PCRE2 release 10.23, but from release 10.30 this changed,
|
|
and backtracking into subroutine calls is now supported, as in Perl.
|
|
|
|
10. In PCRE2, if any of the backtracking control verbs are used in a
|
|
group that is called as a subroutine (whether or not recursively),
|
|
their effect is confined to that group; it does not extend to the sur-
|
|
rounding pattern. This is not always the case in Perl. In particular,
|
|
if (*THEN) is present in a group that is called as a subroutine, its
|
|
action is limited to that group, even if the group does not contain any
|
|
| characters. Note that such groups are processed as anchored at the
|
|
point where they are tested.
|
|
|
|
11. If a pattern contains more than one backtracking control verb, the
|
|
first one that is backtracked onto acts. For example, in the pattern
|
|
A(*COMMIT)B(*PRUNE)C a failure in B triggers (*COMMIT), but a failure
|
|
in C triggers (*PRUNE). Perl's behaviour is more complex; in many cases
|
|
it is the same as PCRE2, but there are cases where it differs.
|
|
|
|
12. There are some differences that are concerned with the settings of
|
|
captured strings when part of a pattern is repeated. For example,
|
|
matching "aba" against the pattern /^(a(b)?)+$/ in Perl leaves $2 un-
|
|
set, but in PCRE2 it is set to "b".
|
|
|
|
13. PCRE2's handling of duplicate capture group numbers and names is
|
|
not as general as Perl's. This is a consequence of the fact the PCRE2
|
|
works internally just with numbers, using an external table to trans-
|
|
late between numbers and names. In particular, a pattern such as
|
|
(?|(?<a>A)|(?<b>B)), where the two capture groups have the same number
|
|
but different names, is not supported, and causes an error at compile
|
|
time. If it were allowed, it would not be possible to distinguish which
|
|
group matched, because both names map to capture group number 1. To
|
|
avoid this confusing situation, an error is given at compile time.
|
|
|
|
14. Perl used to recognize comments in some places that PCRE2 does not,
|
|
for example, between the ( and ? at the start of a group. If the /x
|
|
modifier is set, Perl allowed white space between ( and ? though the
|
|
latest Perls give an error (for a while it was just deprecated). There
|
|
may still be some cases where Perl behaves differently.
|
|
|
|
15. Perl, when in warning mode, gives warnings for character classes
|
|
such as [A-\d] or [a-[:digit:]]. It then treats the hyphens as liter-
|
|
als. PCRE2 has no warning features, so it gives an error in these cases
|
|
because they are almost certainly user mistakes.
|
|
|
|
16. In PCRE2, the upper/lower case character properties Lu and Ll are
|
|
not affected when case-independent matching is specified. For example,
|
|
\p{Lu} always matches an upper case letter. I think Perl has changed in
|
|
this respect; in the release at the time of writing (5.34), \p{Lu} and
|
|
\p{Ll} match all letters, regardless of case, when case independence is
|
|
specified.
|
|
|
|
17. From release 5.32.0, Perl locks out the use of \K in lookaround as-
|
|
sertions. From release 10.38 PCRE2 does the same by default. However,
|
|
there is an option for re-enabling the previous behaviour. When this
|
|
option is set, \K is acted on when it occurs in positive assertions,
|
|
but is ignored in negative assertions.
|
|
|
|
18. PCRE2 provides some extensions to the Perl regular expression fa-
|
|
cilities. Perl 5.10 included new features that were not in earlier
|
|
versions of Perl, some of which (such as named parentheses) were in
|
|
PCRE2 for some time before. This list is with respect to Perl 5.34:
|
|
|
|
(a) Although lookbehind assertions in PCRE2 must match fixed length
|
|
strings, each alternative toplevel branch of a lookbehind assertion can
|
|
match a different length of string. Perl used to require them all to
|
|
have the same length, but the latest version has some variable length
|
|
support.
|
|
|
|
(b) From PCRE2 10.23, backreferences to groups of fixed length are sup-
|
|
ported in lookbehinds, provided that there is no possibility of refer-
|
|
encing a non-unique number or name. Perl does not support backrefer-
|
|
ences in lookbehinds.
|
|
|
|
(c) If PCRE2_DOLLAR_ENDONLY is set and PCRE2_MULTILINE is not set, the
|
|
$ meta-character matches only at the very end of the string.
|
|
|
|
(d) A backslash followed by a letter with no special meaning is
|
|
faulted. (Perl can be made to issue a warning.)
|
|
|
|
(e) If PCRE2_UNGREEDY is set, the greediness of the repetition quanti-
|
|
fiers is inverted, that is, by default they are not greedy, but if fol-
|
|
lowed by a question mark they are.
|
|
|
|
(f) PCRE2_ANCHORED can be used at matching time to force a pattern to
|
|
be tried only at the first matching position in the subject string.
|
|
|
|
(g) The PCRE2_NOTBOL, PCRE2_NOTEOL, PCRE2_NOTEMPTY and
|
|
PCRE2_NOTEMPTY_ATSTART options have no Perl equivalents.
|
|
|
|
(h) The \R escape sequence can be restricted to match only CR, LF, or
|
|
CRLF by the PCRE2_BSR_ANYCRLF option.
|
|
|
|
(i) The callout facility is PCRE2-specific. Perl supports codeblocks
|
|
and variable interpolation, but not general hooks on every match.
|
|
|
|
(j) The partial matching facility is PCRE2-specific.
|
|
|
|
(k) The alternative matching function (pcre2_dfa_match() matches in a
|
|
different way and is not Perl-compatible.
|
|
|
|
(l) PCRE2 recognizes some special sequences such as (*CR) or (*NO_JIT)
|
|
at the start of a pattern. These set overall options that cannot be
|
|
changed within the pattern.
|
|
|
|
(m) PCRE2 supports non-atomic positive lookaround assertions. This is
|
|
an extension to the lookaround facilities. The default, Perl-compatible
|
|
lookarounds are atomic.
|
|
|
|
19. The Perl /a modifier restricts /d numbers to pure ascii, and the
|
|
/aa modifier restricts /i case-insensitive matching to pure ascii, ig-
|
|
noring Unicode rules. This separation cannot be represented with
|
|
PCRE2_UCP.
|
|
|
|
20. Perl has different limits than PCRE2. See the pcre2limit documenta-
|
|
tion for details. Perl went with 5.10 from recursion to iteration keep-
|
|
ing the intermediate matches on the heap, which is ~10% slower but does
|
|
not fall into any stack-overflow limit. PCRE2 made a similar change at
|
|
release 10.30, and also has many build-time and run-time customizable
|
|
limits.
|
|
|
|
|
|
AUTHOR
|
|
|
|
Philip Hazel
|
|
Retired from University Computing Service
|
|
Cambridge, England.
|
|
|
|
|
|
REVISION
|
|
|
|
Last updated: 08 December 2021
|
|
Copyright (c) 1997-2021 University of Cambridge.
|
|
------------------------------------------------------------------------------
|
|
|
|
|
|
PCRE2JIT(3) Library Functions Manual PCRE2JIT(3)
|
|
|
|
|
|
|
|
NAME
|
|
PCRE2 - Perl-compatible regular expressions (revised API)
|
|
|
|
PCRE2 JUST-IN-TIME COMPILER SUPPORT
|
|
|
|
Just-in-time compiling is a heavyweight optimization that can greatly
|
|
speed up pattern matching. However, it comes at the cost of extra pro-
|
|
cessing before the match is performed, so it is of most benefit when
|
|
the same pattern is going to be matched many times. This does not nec-
|
|
essarily mean many calls of a matching function; if the pattern is not
|
|
anchored, matching attempts may take place many times at various posi-
|
|
tions in the subject, even for a single call. Therefore, if the subject
|
|
string is very long, it may still pay to use JIT even for one-off
|
|
matches. JIT support is available for all of the 8-bit, 16-bit and
|
|
32-bit PCRE2 libraries.
|
|
|
|
JIT support applies only to the traditional Perl-compatible matching
|
|
function. It does not apply when the DFA matching function is being
|
|
used. The code for this support was written by Zoltan Herczeg.
|
|
|
|
|
|
AVAILABILITY OF JIT SUPPORT
|
|
|
|
JIT support is an optional feature of PCRE2. The "configure" option
|
|
--enable-jit (or equivalent CMake option) must be set when PCRE2 is
|
|
built if you want to use JIT. The support is limited to the following
|
|
hardware platforms:
|
|
|
|
ARM 32-bit (v5, v7, and Thumb2)
|
|
ARM 64-bit
|
|
IBM s390x 64 bit
|
|
Intel x86 32-bit and 64-bit
|
|
MIPS 32-bit and 64-bit
|
|
Power PC 32-bit and 64-bit
|
|
SPARC 32-bit
|
|
|
|
If --enable-jit is set on an unsupported platform, compilation fails.
|
|
|
|
A program can tell if JIT support is available by calling pcre2_con-
|
|
fig() with the PCRE2_CONFIG_JIT option. The result is 1 when JIT is
|
|
available, and 0 otherwise. However, a simple program does not need to
|
|
check this in order to use JIT. The API is implemented in a way that
|
|
falls back to the interpretive code if JIT is not available. For pro-
|
|
grams that need the best possible performance, there is also a "fast
|
|
path" API that is JIT-specific.
|
|
|
|
|
|
SIMPLE USE OF JIT
|
|
|
|
To make use of the JIT support in the simplest way, all you have to do
|
|
is to call pcre2_jit_compile() after successfully compiling a pattern
|
|
with pcre2_compile(). This function has two arguments: the first is the
|
|
compiled pattern pointer that was returned by pcre2_compile(), and the
|
|
second is zero or more of the following option bits: PCRE2_JIT_COM-
|
|
PLETE, PCRE2_JIT_PARTIAL_HARD, or PCRE2_JIT_PARTIAL_SOFT.
|
|
|
|
If JIT support is not available, a call to pcre2_jit_compile() does
|
|
nothing and returns PCRE2_ERROR_JIT_BADOPTION. Otherwise, the compiled
|
|
pattern is passed to the JIT compiler, which turns it into machine code
|
|
that executes much faster than the normal interpretive code, but yields
|
|
exactly the same results. The returned value from pcre2_jit_compile()
|
|
is zero on success, or a negative error code.
|
|
|
|
There is a limit to the size of pattern that JIT supports, imposed by
|
|
the size of machine stack that it uses. The exact rules are not docu-
|
|
mented because they may change at any time, in particular, when new op-
|
|
timizations are introduced. If a pattern is too big, a call to
|
|
pcre2_jit_compile() returns PCRE2_ERROR_NOMEMORY.
|
|
|
|
PCRE2_JIT_COMPLETE requests the JIT compiler to generate code for com-
|
|
plete matches. If you want to run partial matches using the PCRE2_PAR-
|
|
TIAL_HARD or PCRE2_PARTIAL_SOFT options of pcre2_match(), you should
|
|
set one or both of the other options as well as, or instead of
|
|
PCRE2_JIT_COMPLETE. The JIT compiler generates different optimized code
|
|
for each of the three modes (normal, soft partial, hard partial). When
|
|
pcre2_match() is called, the appropriate code is run if it is avail-
|
|
able. Otherwise, the pattern is matched using interpretive code.
|
|
|
|
You can call pcre2_jit_compile() multiple times for the same compiled
|
|
pattern. It does nothing if it has previously compiled code for any of
|
|
the option bits. For example, you can call it once with PCRE2_JIT_COM-
|
|
PLETE and (perhaps later, when you find you need partial matching)
|
|
again with PCRE2_JIT_COMPLETE and PCRE2_JIT_PARTIAL_HARD. This time it
|
|
will ignore PCRE2_JIT_COMPLETE and just compile code for partial match-
|
|
ing. If pcre2_jit_compile() is called with no option bits set, it imme-
|
|
diately returns zero. This is an alternative way of testing whether JIT
|
|
is available.
|
|
|
|
At present, it is not possible to free JIT compiled code except when
|
|
the entire compiled pattern is freed by calling pcre2_code_free().
|
|
|
|
In some circumstances you may need to call additional functions. These
|
|
are described in the section entitled "Controlling the JIT stack" be-
|
|
low.
|
|
|
|
There are some pcre2_match() options that are not supported by JIT, and
|
|
there are also some pattern items that JIT cannot handle. Details are
|
|
given below. In both cases, matching automatically falls back to the
|
|
interpretive code. If you want to know whether JIT was actually used
|
|
for a particular match, you should arrange for a JIT callback function
|
|
to be set up as described in the section entitled "Controlling the JIT
|
|
stack" below, even if you do not need to supply a non-default JIT
|
|
stack. Such a callback function is called whenever JIT code is about to
|
|
be obeyed. If the match-time options are not right for JIT execution,
|
|
the callback function is not obeyed.
|
|
|
|
If the JIT compiler finds an unsupported item, no JIT data is gener-
|
|
ated. You can find out if JIT matching is available after compiling a
|
|
pattern by calling pcre2_pattern_info() with the PCRE2_INFO_JITSIZE op-
|
|
tion. A non-zero result means that JIT compilation was successful. A
|
|
result of 0 means that JIT support is not available, or the pattern was
|
|
not processed by pcre2_jit_compile(), or the JIT compiler was not able
|
|
to handle the pattern.
|
|
|
|
|
|
MATCHING SUBJECTS CONTAINING INVALID UTF
|
|
|
|
When a pattern is compiled with the PCRE2_UTF option, subject strings
|
|
are normally expected to be a valid sequence of UTF code units. By de-
|
|
fault, this is checked at the start of matching and an error is gener-
|
|
ated if invalid UTF is detected. The PCRE2_NO_UTF_CHECK option can be
|
|
passed to pcre2_match() to skip the check (for improved performance) if
|
|
you are sure that a subject string is valid. If this option is used
|
|
with an invalid string, the result is undefined.
|
|
|
|
However, a way of running matches on strings that may contain invalid
|
|
UTF sequences is available. Calling pcre2_compile() with the
|
|
PCRE2_MATCH_INVALID_UTF option has two effects: it tells the inter-
|
|
preter in pcre2_match() to support invalid UTF, and, if pcre2_jit_com-
|
|
pile() is called, the compiled JIT code also supports invalid UTF. De-
|
|
tails of how this support works, in both the JIT and the interpretive
|
|
cases, is given in the pcre2unicode documentation.
|
|
|
|
There is also an obsolete option for pcre2_jit_compile() called
|
|
PCRE2_JIT_INVALID_UTF, which currently exists only for backward compat-
|
|
ibility. It is superseded by the pcre2_compile() option
|
|
PCRE2_MATCH_INVALID_UTF and should no longer be used. It may be removed
|
|
in future.
|
|
|
|
|
|
UNSUPPORTED OPTIONS AND PATTERN ITEMS
|
|
|
|
The pcre2_match() options that are supported for JIT matching are
|
|
PCRE2_COPY_MATCHED_SUBJECT, PCRE2_NOTBOL, PCRE2_NOTEOL, PCRE2_NOTEMPTY,
|
|
PCRE2_NOTEMPTY_ATSTART, PCRE2_NO_UTF_CHECK, PCRE2_PARTIAL_HARD, and
|
|
PCRE2_PARTIAL_SOFT. The PCRE2_ANCHORED and PCRE2_ENDANCHORED options
|
|
are not supported at match time.
|
|
|
|
If the PCRE2_NO_JIT option is passed to pcre2_match() it disables the
|
|
use of JIT, forcing matching by the interpreter code.
|
|
|
|
The only unsupported pattern items are \C (match a single data unit)
|
|
when running in a UTF mode, and a callout immediately before an asser-
|
|
tion condition in a conditional group.
|
|
|
|
|
|
RETURN VALUES FROM JIT MATCHING
|
|
|
|
When a pattern is matched using JIT matching, the return values are the
|
|
same as those given by the interpretive pcre2_match() code, with the
|
|
addition of one new error code: PCRE2_ERROR_JIT_STACKLIMIT. This means
|
|
that the memory used for the JIT stack was insufficient. See "Control-
|
|
ling the JIT stack" below for a discussion of JIT stack usage.
|
|
|
|
The error code PCRE2_ERROR_MATCHLIMIT is returned by the JIT code if
|
|
searching a very large pattern tree goes on for too long, as it is in
|
|
the same circumstance when JIT is not used, but the details of exactly
|
|
what is counted are not the same. The PCRE2_ERROR_DEPTHLIMIT error code
|
|
is never returned when JIT matching is used.
|
|
|
|
|
|
CONTROLLING THE JIT STACK
|
|
|
|
When the compiled JIT code runs, it needs a block of memory to use as a
|
|
stack. By default, it uses 32KiB on the machine stack. However, some
|
|
large or complicated patterns need more than this. The error PCRE2_ER-
|
|
ROR_JIT_STACKLIMIT is given when there is not enough stack. Three func-
|
|
tions are provided for managing blocks of memory for use as JIT stacks.
|
|
There is further discussion about the use of JIT stacks in the section
|
|
entitled "JIT stack FAQ" below.
|
|
|
|
The pcre2_jit_stack_create() function creates a JIT stack. Its argu-
|
|
ments are a starting size, a maximum size, and a general context (for
|
|
memory allocation functions, or NULL for standard memory allocation).
|
|
It returns a pointer to an opaque structure of type pcre2_jit_stack, or
|
|
NULL if there is an error. The pcre2_jit_stack_free() function is used
|
|
to free a stack that is no longer needed. If its argument is NULL, this
|
|
function returns immediately, without doing anything. (For the techni-
|
|
cally minded: the address space is allocated by mmap or VirtualAlloc.)
|
|
A maximum stack size of 512KiB to 1MiB should be more than enough for
|
|
any pattern.
|
|
|
|
The pcre2_jit_stack_assign() function specifies which stack JIT code
|
|
should use. Its arguments are as follows:
|
|
|
|
pcre2_match_context *mcontext
|
|
pcre2_jit_callback callback
|
|
void *data
|
|
|
|
The first argument is a pointer to a match context. When this is subse-
|
|
quently passed to a matching function, its information determines which
|
|
JIT stack is used. If this argument is NULL, the function returns imme-
|
|
diately, without doing anything. There are three cases for the values
|
|
of the other two options:
|
|
|
|
(1) If callback is NULL and data is NULL, an internal 32KiB block
|
|
on the machine stack is used. This is the default when a match
|
|
context is created.
|
|
|
|
(2) If callback is NULL and data is not NULL, data must be
|
|
a pointer to a valid JIT stack, the result of calling
|
|
pcre2_jit_stack_create().
|
|
|
|
(3) If callback is not NULL, it must point to a function that is
|
|
called with data as an argument at the start of matching, in
|
|
order to set up a JIT stack. If the return from the callback
|
|
function is NULL, the internal 32KiB stack is used; otherwise the
|
|
return value must be a valid JIT stack, the result of calling
|
|
pcre2_jit_stack_create().
|
|
|
|
A callback function is obeyed whenever JIT code is about to be run; it
|
|
is not obeyed when pcre2_match() is called with options that are incom-
|
|
patible for JIT matching. A callback function can therefore be used to
|
|
determine whether a match operation was executed by JIT or by the in-
|
|
terpreter.
|
|
|
|
You may safely use the same JIT stack for more than one pattern (either
|
|
by assigning directly or by callback), as long as the patterns are
|
|
matched sequentially in the same thread. Currently, the only way to set
|
|
up non-sequential matches in one thread is to use callouts: if a call-
|
|
out function starts another match, that match must use a different JIT
|
|
stack to the one used for currently suspended match(es).
|
|
|
|
In a multithread application, if you do not specify a JIT stack, or if
|
|
you assign or pass back NULL from a callback, that is thread-safe, be-
|
|
cause each thread has its own machine stack. However, if you assign or
|
|
pass back a non-NULL JIT stack, this must be a different stack for each
|
|
thread so that the application is thread-safe.
|
|
|
|
Strictly speaking, even more is allowed. You can assign the same non-
|
|
NULL stack to a match context that is used by any number of patterns,
|
|
as long as they are not used for matching by multiple threads at the
|
|
same time. For example, you could use the same stack in all compiled
|
|
patterns, with a global mutex in the callback to wait until the stack
|
|
is available for use. However, this is an inefficient solution, and not
|
|
recommended.
|
|
|
|
This is a suggestion for how a multithreaded program that needs to set
|
|
up non-default JIT stacks might operate:
|
|
|
|
During thread initialization
|
|
thread_local_var = pcre2_jit_stack_create(...)
|
|
|
|
During thread exit
|
|
pcre2_jit_stack_free(thread_local_var)
|
|
|
|
Use a one-line callback function
|
|
return thread_local_var
|
|
|
|
All the functions described in this section do nothing if JIT is not
|
|
available.
|
|
|
|
|
|
JIT STACK FAQ
|
|
|
|
(1) Why do we need JIT stacks?
|
|
|
|
PCRE2 (and JIT) is a recursive, depth-first engine, so it needs a stack
|
|
where the local data of the current node is pushed before checking its
|
|
child nodes. Allocating real machine stack on some platforms is diffi-
|
|
cult. For example, the stack chain needs to be updated every time if we
|
|
extend the stack on PowerPC. Although it is possible, its updating
|
|
time overhead decreases performance. So we do the recursion in memory.
|
|
|
|
(2) Why don't we simply allocate blocks of memory with malloc()?
|
|
|
|
Modern operating systems have a nice feature: they can reserve an ad-
|
|
dress space instead of allocating memory. We can safely allocate memory
|
|
pages inside this address space, so the stack could grow without moving
|
|
memory data (this is important because of pointers). Thus we can allo-
|
|
cate 1MiB address space, and use only a single memory page (usually
|
|
4KiB) if that is enough. However, we can still grow up to 1MiB anytime
|
|
if needed.
|
|
|
|
(3) Who "owns" a JIT stack?
|
|
|
|
The owner of the stack is the user program, not the JIT studied pattern
|
|
or anything else. The user program must ensure that if a stack is being
|
|
used by pcre2_match(), (that is, it is assigned to a match context that
|
|
is passed to the pattern currently running), that stack must not be
|
|
used by any other threads (to avoid overwriting the same memory area).
|
|
The best practice for multithreaded programs is to allocate a stack for
|
|
each thread, and return this stack through the JIT callback function.
|
|
|
|
(4) When should a JIT stack be freed?
|
|
|
|
You can free a JIT stack at any time, as long as it will not be used by
|
|
pcre2_match() again. When you assign the stack to a match context, only
|
|
a pointer is set. There is no reference counting or any other magic.
|
|
You can free compiled patterns, contexts, and stacks in any order, any-
|
|
time. Just do not call pcre2_match() with a match context pointing to
|
|
an already freed stack, as that will cause SEGFAULT. (Also, do not free
|
|
a stack currently used by pcre2_match() in another thread). You can
|
|
also replace the stack in a context at any time when it is not in use.
|
|
You should free the previous stack before assigning a replacement.
|
|
|
|
(5) Should I allocate/free a stack every time before/after calling
|
|
pcre2_match()?
|
|
|
|
No, because this is too costly in terms of resources. However, you
|
|
could implement some clever idea which release the stack if it is not
|
|
used in let's say two minutes. The JIT callback can help to achieve
|
|
this without keeping a list of patterns.
|
|
|
|
(6) OK, the stack is for long term memory allocation. But what happens
|
|
if a pattern causes stack overflow with a stack of 1MiB? Is that 1MiB
|
|
kept until the stack is freed?
|
|
|
|
Especially on embedded sytems, it might be a good idea to release mem-
|
|
ory sometimes without freeing the stack. There is no API for this at
|
|
the moment. Probably a function call which returns with the currently
|
|
allocated memory for any stack and another which allows releasing mem-
|
|
ory (shrinking the stack) would be a good idea if someone needs this.
|
|
|
|
(7) This is too much of a headache. Isn't there any better solution for
|
|
JIT stack handling?
|
|
|
|
No, thanks to Windows. If POSIX threads were used everywhere, we could
|
|
throw out this complicated API.
|
|
|
|
|
|
FREEING JIT SPECULATIVE MEMORY
|
|
|
|
void pcre2_jit_free_unused_memory(pcre2_general_context *gcontext);
|
|
|
|
The JIT executable allocator does not free all memory when it is possi-
|
|
ble. It expects new allocations, and keeps some free memory around to
|
|
improve allocation speed. However, in low memory conditions, it might
|
|
be better to free all possible memory. You can cause this to happen by
|
|
calling pcre2_jit_free_unused_memory(). Its argument is a general con-
|
|
text, for custom memory management, or NULL for standard memory manage-
|
|
ment.
|
|
|
|
|
|
EXAMPLE CODE
|
|
|
|
This is a single-threaded example that specifies a JIT stack without
|
|
using a callback. A real program should include error checking after
|
|
all the function calls.
|
|
|
|
int rc;
|
|
pcre2_code *re;
|
|
pcre2_match_data *match_data;
|
|
pcre2_match_context *mcontext;
|
|
pcre2_jit_stack *jit_stack;
|
|
|
|
re = pcre2_compile(pattern, PCRE2_ZERO_TERMINATED, 0,
|
|
&errornumber, &erroffset, NULL);
|
|
rc = pcre2_jit_compile(re, PCRE2_JIT_COMPLETE);
|
|
mcontext = pcre2_match_context_create(NULL);
|
|
jit_stack = pcre2_jit_stack_create(32*1024, 512*1024, NULL);
|
|
pcre2_jit_stack_assign(mcontext, NULL, jit_stack);
|
|
match_data = pcre2_match_data_create(re, 10);
|
|
rc = pcre2_match(re, subject, length, 0, 0, match_data, mcontext);
|
|
/* Process result */
|
|
|
|
pcre2_code_free(re);
|
|
pcre2_match_data_free(match_data);
|
|
pcre2_match_context_free(mcontext);
|
|
pcre2_jit_stack_free(jit_stack);
|
|
|
|
|
|
JIT FAST PATH API
|
|
|
|
Because the API described above falls back to interpreted matching when
|
|
JIT is not available, it is convenient for programs that are written
|
|
for general use in many environments. However, calling JIT via
|
|
pcre2_match() does have a performance impact. Programs that are written
|
|
for use where JIT is known to be available, and which need the best
|
|
possible performance, can instead use a "fast path" API to call JIT
|
|
matching directly instead of calling pcre2_match() (obviously only for
|
|
patterns that have been successfully processed by pcre2_jit_compile()).
|
|
|
|
The fast path function is called pcre2_jit_match(), and it takes ex-
|
|
actly the same arguments as pcre2_match(). However, the subject string
|
|
must be specified with a length; PCRE2_ZERO_TERMINATED is not sup-
|
|
ported. Unsupported option bits (for example, PCRE2_ANCHORED, PCRE2_EN-
|
|
DANCHORED and PCRE2_COPY_MATCHED_SUBJECT) are ignored, as is the
|
|
PCRE2_NO_JIT option. The return values are also the same as for
|
|
pcre2_match(), plus PCRE2_ERROR_JIT_BADOPTION if a matching mode (par-
|
|
tial or complete) is requested that was not compiled.
|
|
|
|
When you call pcre2_match(), as well as testing for invalid options, a
|
|
number of other sanity checks are performed on the arguments. For exam-
|
|
ple, if the subject pointer is NULL but the length is non-zero, an im-
|
|
mediate error is given. Also, unless PCRE2_NO_UTF_CHECK is set, a UTF
|
|
subject string is tested for validity. In the interests of speed, these
|
|
checks do not happen on the JIT fast path, and if invalid data is
|
|
passed, the result is undefined.
|
|
|
|
Bypassing the sanity checks and the pcre2_match() wrapping can give
|
|
speedups of more than 10%.
|
|
|
|
|
|
SEE ALSO
|
|
|
|
pcre2api(3)
|
|
|
|
|
|
AUTHOR
|
|
|
|
Philip Hazel (FAQ by Zoltan Herczeg)
|
|
University Computing Service
|
|
Cambridge, England.
|
|
|
|
|
|
REVISION
|
|
|
|
Last updated: 30 November 2021
|
|
Copyright (c) 1997-2021 University of Cambridge.
|
|
------------------------------------------------------------------------------
|
|
|
|
|
|
PCRE2LIMITS(3) Library Functions Manual PCRE2LIMITS(3)
|
|
|
|
|
|
|
|
NAME
|
|
PCRE2 - Perl-compatible regular expressions (revised API)
|
|
|
|
SIZE AND OTHER LIMITATIONS
|
|
|
|
There are some size limitations in PCRE2 but it is hoped that they will
|
|
never in practice be relevant.
|
|
|
|
The maximum size of a compiled pattern is approximately 64 thousand
|
|
code units for the 8-bit and 16-bit libraries if PCRE2 is compiled with
|
|
the default internal linkage size, which is 2 bytes for these li-
|
|
braries. If you want to process regular expressions that are truly
|
|
enormous, you can compile PCRE2 with an internal linkage size of 3 or 4
|
|
(when building the 16-bit library, 3 is rounded up to 4). See the
|
|
README file in the source distribution and the pcre2build documentation
|
|
for details. In these cases the limit is substantially larger. How-
|
|
ever, the speed of execution is slower. In the 32-bit library, the in-
|
|
ternal linkage size is always 4.
|
|
|
|
The maximum length of a source pattern string is essentially unlimited;
|
|
it is the largest number a PCRE2_SIZE variable can hold. However, the
|
|
program that calls pcre2_compile() can specify a smaller limit.
|
|
|
|
The maximum length (in code units) of a subject string is one less than
|
|
the largest number a PCRE2_SIZE variable can hold. PCRE2_SIZE is an un-
|
|
signed integer type, usually defined as size_t. Its maximum value (that
|
|
is ~(PCRE2_SIZE)0) is reserved as a special indicator for zero-termi-
|
|
nated strings and unset offsets.
|
|
|
|
All values in repeating quantifiers must be less than 65536.
|
|
|
|
The maximum length of a lookbehind assertion is 65535 characters.
|
|
|
|
There is no limit to the number of parenthesized groups, but there can
|
|
be no more than 65535 capture groups, and there is a limit to the depth
|
|
of nesting of parenthesized subpatterns of all kinds. This is imposed
|
|
in order to limit the amount of system stack used at compile time. The
|
|
default limit can be specified when PCRE2 is built; if not, the default
|
|
is set to 250. An application can change this limit by calling
|
|
pcre2_set_parens_nest_limit() to set the limit in a compile context.
|
|
|
|
The maximum length of name for a named capture group is 32 code units,
|
|
and the maximum number of such groups is 10000.
|
|
|
|
The maximum length of a name in a (*MARK), (*PRUNE), (*SKIP), or
|
|
(*THEN) verb is 255 code units for the 8-bit library and 65535 code
|
|
units for the 16-bit and 32-bit libraries.
|
|
|
|
The maximum length of a string argument to a callout is the largest
|
|
number a 32-bit unsigned integer can hold.
|
|
|
|
|
|
AUTHOR
|
|
|
|
Philip Hazel
|
|
University Computing Service
|
|
Cambridge, England.
|
|
|
|
|
|
REVISION
|
|
|
|
Last updated: 02 February 2019
|
|
Copyright (c) 1997-2019 University of Cambridge.
|
|
------------------------------------------------------------------------------
|
|
|
|
|
|
PCRE2MATCHING(3) Library Functions Manual PCRE2MATCHING(3)
|
|
|
|
|
|
|
|
NAME
|
|
PCRE2 - Perl-compatible regular expressions (revised API)
|
|
|
|
PCRE2 MATCHING ALGORITHMS
|
|
|
|
This document describes the two different algorithms that are available
|
|
in PCRE2 for matching a compiled regular expression against a given
|
|
subject string. The "standard" algorithm is the one provided by the
|
|
pcre2_match() function. This works in the same as as Perl's matching
|
|
function, and provide a Perl-compatible matching operation. The just-
|
|
in-time (JIT) optimization that is described in the pcre2jit documenta-
|
|
tion is compatible with this function.
|
|
|
|
An alternative algorithm is provided by the pcre2_dfa_match() function;
|
|
it operates in a different way, and is not Perl-compatible. This alter-
|
|
native has advantages and disadvantages compared with the standard al-
|
|
gorithm, and these are described below.
|
|
|
|
When there is only one possible way in which a given subject string can
|
|
match a pattern, the two algorithms give the same answer. A difference
|
|
arises, however, when there are multiple possibilities. For example, if
|
|
the pattern
|
|
|
|
^<.*>
|
|
|
|
is matched against the string
|
|
|
|
<something> <something else> <something further>
|
|
|
|
there are three possible answers. The standard algorithm finds only one
|
|
of them, whereas the alternative algorithm finds all three.
|
|
|
|
|
|
REGULAR EXPRESSIONS AS TREES
|
|
|
|
The set of strings that are matched by a regular expression can be rep-
|
|
resented as a tree structure. An unlimited repetition in the pattern
|
|
makes the tree of infinite size, but it is still a tree. Matching the
|
|
pattern to a given subject string (from a given starting point) can be
|
|
thought of as a search of the tree. There are two ways to search a
|
|
tree: depth-first and breadth-first, and these correspond to the two
|
|
matching algorithms provided by PCRE2.
|
|
|
|
|
|
THE STANDARD MATCHING ALGORITHM
|
|
|
|
In the terminology of Jeffrey Friedl's book "Mastering Regular Expres-
|
|
sions", the standard algorithm is an "NFA algorithm". It conducts a
|
|
depth-first search of the pattern tree. That is, it proceeds along a
|
|
single path through the tree, checking that the subject matches what is
|
|
required. When there is a mismatch, the algorithm tries any alterna-
|
|
tives at the current point, and if they all fail, it backs up to the
|
|
previous branch point in the tree, and tries the next alternative
|
|
branch at that level. This often involves backing up (moving to the
|
|
left) in the subject string as well. The order in which repetition
|
|
branches are tried is controlled by the greedy or ungreedy nature of
|
|
the quantifier.
|
|
|
|
If a leaf node is reached, a matching string has been found, and at
|
|
that point the algorithm stops. Thus, if there is more than one possi-
|
|
ble match, this algorithm returns the first one that it finds. Whether
|
|
this is the shortest, the longest, or some intermediate length depends
|
|
on the way the alternations and the greedy or ungreedy repetition quan-
|
|
tifiers are specified in the pattern.
|
|
|
|
Because it ends up with a single path through the tree, it is rela-
|
|
tively straightforward for this algorithm to keep track of the sub-
|
|
strings that are matched by portions of the pattern in parentheses.
|
|
This provides support for capturing parentheses and backreferences.
|
|
|
|
|
|
THE ALTERNATIVE MATCHING ALGORITHM
|
|
|
|
This algorithm conducts a breadth-first search of the tree. Starting
|
|
from the first matching point in the subject, it scans the subject
|
|
string from left to right, once, character by character, and as it does
|
|
this, it remembers all the paths through the tree that represent valid
|
|
matches. In Friedl's terminology, this is a kind of "DFA algorithm",
|
|
though it is not implemented as a traditional finite state machine (it
|
|
keeps multiple states active simultaneously).
|
|
|
|
Although the general principle of this matching algorithm is that it
|
|
scans the subject string only once, without backtracking, there is one
|
|
exception: when a lookaround assertion is encountered, the characters
|
|
following or preceding the current point have to be independently in-
|
|
spected.
|
|
|
|
The scan continues until either the end of the subject is reached, or
|
|
there are no more unterminated paths. At this point, terminated paths
|
|
represent the different matching possibilities (if there are none, the
|
|
match has failed). Thus, if there is more than one possible match,
|
|
this algorithm finds all of them, and in particular, it finds the long-
|
|
est. The matches are returned in the output vector in decreasing order
|
|
of length. There is an option to stop the algorithm after the first
|
|
match (which is necessarily the shortest) is found.
|
|
|
|
Note that the size of vector needed to contain all the results depends
|
|
on the number of simultaneous matches, not on the number of parentheses
|
|
in the pattern. Using pcre2_match_data_create_from_pattern() to create
|
|
the match data block is therefore not advisable when doing DFA match-
|
|
ing.
|
|
|
|
Note also that all the matches that are found start at the same point
|
|
in the subject. If the pattern
|
|
|
|
cat(er(pillar)?)?
|
|
|
|
is matched against the string "the caterpillar catchment", the result
|
|
is the three strings "caterpillar", "cater", and "cat" that start at
|
|
the fifth character of the subject. The algorithm does not automati-
|
|
cally move on to find matches that start at later positions.
|
|
|
|
PCRE2's "auto-possessification" optimization usually applies to charac-
|
|
ter repeats at the end of a pattern (as well as internally). For exam-
|
|
ple, the pattern "a\d+" is compiled as if it were "a\d++" because there
|
|
is no point even considering the possibility of backtracking into the
|
|
repeated digits. For DFA matching, this means that only one possible
|
|
match is found. If you really do want multiple matches in such cases,
|
|
either use an ungreedy repeat ("a\d+?") or set the PCRE2_NO_AUTO_POS-
|
|
SESS option when compiling.
|
|
|
|
There are a number of features of PCRE2 regular expressions that are
|
|
not supported or behave differently in the alternative matching func-
|
|
tion. Those that are not supported cause an error if encountered.
|
|
|
|
1. Because the algorithm finds all possible matches, the greedy or un-
|
|
greedy nature of repetition quantifiers is not relevant (though it may
|
|
affect auto-possessification, as just described). During matching,
|
|
greedy and ungreedy quantifiers are treated in exactly the same way.
|
|
However, possessive quantifiers can make a difference when what follows
|
|
could also match what is quantified, for example in a pattern like
|
|
this:
|
|
|
|
^a++\w!
|
|
|
|
This pattern matches "aaab!" but not "aaa!", which would be matched by
|
|
a non-possessive quantifier. Similarly, if an atomic group is present,
|
|
it is matched as if it were a standalone pattern at the current point,
|
|
and the longest match is then "locked in" for the rest of the overall
|
|
pattern.
|
|
|
|
2. When dealing with multiple paths through the tree simultaneously, it
|
|
is not straightforward to keep track of captured substrings for the
|
|
different matching possibilities, and PCRE2's implementation of this
|
|
algorithm does not attempt to do this. This means that no captured sub-
|
|
strings are available.
|
|
|
|
3. Because no substrings are captured, backreferences within the pat-
|
|
tern are not supported.
|
|
|
|
4. For the same reason, conditional expressions that use a backrefer-
|
|
ence as the condition or test for a specific group recursion are not
|
|
supported.
|
|
|
|
5. Again for the same reason, script runs are not supported.
|
|
|
|
6. Because many paths through the tree may be active, the \K escape se-
|
|
quence, which resets the start of the match when encountered (but may
|
|
be on some paths and not on others), is not supported.
|
|
|
|
7. Callouts are supported, but the value of the capture_top field is
|
|
always 1, and the value of the capture_last field is always 0.
|
|
|
|
8. The \C escape sequence, which (in the standard algorithm) always
|
|
matches a single code unit, even in a UTF mode, is not supported in
|
|
these modes, because the alternative algorithm moves through the sub-
|
|
ject string one character (not code unit) at a time, for all active
|
|
paths through the tree.
|
|
|
|
9. Except for (*FAIL), the backtracking control verbs such as (*PRUNE)
|
|
are not supported. (*FAIL) is supported, and behaves like a failing
|
|
negative assertion.
|
|
|
|
10. The PCRE2_MATCH_INVALID_UTF option for pcre2_compile() is not sup-
|
|
ported by pcre2_dfa_match().
|
|
|
|
|
|
ADVANTAGES OF THE ALTERNATIVE ALGORITHM
|
|
|
|
The main advantage of the alternative algorithm is that all possible
|
|
matches (at a single point in the subject) are automatically found, and
|
|
in particular, the longest match is found. To find more than one match
|
|
at the same point using the standard algorithm, you have to do kludgy
|
|
things with callouts.
|
|
|
|
Partial matching is possible with this algorithm, though it has some
|
|
limitations. The pcre2partial documentation gives details of partial
|
|
matching and discusses multi-segment matching.
|
|
|
|
|
|
DISADVANTAGES OF THE ALTERNATIVE ALGORITHM
|
|
|
|
The alternative algorithm suffers from a number of disadvantages:
|
|
|
|
1. It is substantially slower than the standard algorithm. This is
|
|
partly because it has to search for all possible matches, but is also
|
|
because it is less susceptible to optimization.
|
|
|
|
2. Capturing parentheses, backreferences, script runs, and matching
|
|
within invalid UTF string are not supported.
|
|
|
|
3. Although atomic groups are supported, their use does not provide the
|
|
performance advantage that it does for the standard algorithm.
|
|
|
|
4. JIT optimization is not supported.
|
|
|
|
|
|
AUTHOR
|
|
|
|
Philip Hazel
|
|
Retired from University Computing Service
|
|
Cambridge, England.
|
|
|
|
|
|
REVISION
|
|
|
|
Last updated: 28 August 2021
|
|
Copyright (c) 1997-2021 University of Cambridge.
|
|
------------------------------------------------------------------------------
|
|
|
|
|
|
PCRE2PARTIAL(3) Library Functions Manual PCRE2PARTIAL(3)
|
|
|
|
|
|
|
|
NAME
|
|
PCRE2 - Perl-compatible regular expressions
|
|
|
|
PARTIAL MATCHING IN PCRE2
|
|
|
|
In normal use of PCRE2, if there is a match up to the end of a subject
|
|
string, but more characters are needed to match the entire pattern,
|
|
PCRE2_ERROR_NOMATCH is returned, just like any other failing match.
|
|
There are circumstances where it might be helpful to distinguish this
|
|
"partial match" case.
|
|
|
|
One example is an application where the subject string is very long,
|
|
and not all available at once. The requirement here is to be able to do
|
|
the matching segment by segment, but special action is needed when a
|
|
matched substring spans the boundary between two segments.
|
|
|
|
Another example is checking a user input string as it is typed, to en-
|
|
sure that it conforms to a required format. Invalid characters can be
|
|
immediately diagnosed and rejected, giving instant feedback.
|
|
|
|
Partial matching is a PCRE2-specific feature; it is not Perl-compati-
|
|
ble. It is requested by setting one of the PCRE2_PARTIAL_HARD or
|
|
PCRE2_PARTIAL_SOFT options when calling a matching function. The dif-
|
|
ference between the two options is whether or not a partial match is
|
|
preferred to an alternative complete match, though the details differ
|
|
between the two types of matching function. If both options are set,
|
|
PCRE2_PARTIAL_HARD takes precedence.
|
|
|
|
If you want to use partial matching with just-in-time optimized code,
|
|
as well as setting a partial match option for the matching function,
|
|
you must also call pcre2_jit_compile() with one or both of these op-
|
|
tions:
|
|
|
|
PCRE2_JIT_PARTIAL_HARD
|
|
PCRE2_JIT_PARTIAL_SOFT
|
|
|
|
PCRE2_JIT_COMPLETE should also be set if you are going to run non-par-
|
|
tial matches on the same pattern. Separate code is compiled for each
|
|
mode. If the appropriate JIT mode has not been compiled, interpretive
|
|
matching code is used.
|
|
|
|
Setting a partial matching option disables two of PCRE2's standard op-
|
|
timization hints. PCRE2 remembers the last literal code unit in a pat-
|
|
tern, and abandons matching immediately if it is not present in the
|
|
subject string. This optimization cannot be used for a subject string
|
|
that might match only partially. PCRE2 also remembers a minimum length
|
|
of a matching string, and does not bother to run the matching function
|
|
on shorter strings. This optimization is also disabled for partial
|
|
matching.
|
|
|
|
|
|
REQUIREMENTS FOR A PARTIAL MATCH
|
|
|
|
A possible partial match occurs during matching when the end of the
|
|
subject string is reached successfully, but either more characters are
|
|
needed to complete the match, or the addition of more characters might
|
|
change what is matched.
|
|
|
|
Example 1: if the pattern is /abc/ and the subject is "ab", more char-
|
|
acters are definitely needed to complete a match. In this case both
|
|
hard and soft matching options yield a partial match.
|
|
|
|
Example 2: if the pattern is /ab+/ and the subject is "ab", a complete
|
|
match can be found, but the addition of more characters might change
|
|
what is matched. In this case, only PCRE2_PARTIAL_HARD returns a par-
|
|
tial match; PCRE2_PARTIAL_SOFT returns the complete match.
|
|
|
|
On reaching the end of the subject, when PCRE2_PARTIAL_HARD is set, if
|
|
the next pattern item is \z, \Z, \b, \B, or $ there is always a partial
|
|
match. Otherwise, for both options, the next pattern item must be one
|
|
that inspects a character, and at least one of the following must be
|
|
true:
|
|
|
|
(1) At least one character has already been inspected. An inspected
|
|
character need not form part of the final matched string; lookbehind
|
|
assertions and the \K escape sequence provide ways of inspecting char-
|
|
acters before the start of a matched string.
|
|
|
|
(2) The pattern contains one or more lookbehind assertions. This condi-
|
|
tion exists in case there is a lookbehind that inspects characters be-
|
|
fore the start of the match.
|
|
|
|
(3) There is a special case when the whole pattern can match an empty
|
|
string. When the starting point is at the end of the subject, the
|
|
empty string match is a possibility, and if PCRE2_PARTIAL_SOFT is set
|
|
and neither of the above conditions is true, it is returned. However,
|
|
because adding more characters might result in a non-empty match,
|
|
PCRE2_PARTIAL_HARD returns a partial match, which in this case means
|
|
"there is going to be a match at this point, but until some more char-
|
|
acters are added, we do not know if it will be an empty string or some-
|
|
thing longer".
|
|
|
|
|
|
PARTIAL MATCHING USING pcre2_match()
|
|
|
|
When a partial matching option is set, the result of calling
|
|
pcre2_match() can be one of the following:
|
|
|
|
A successful match
|
|
A complete match has been found, starting and ending within this sub-
|
|
ject.
|
|
|
|
PCRE2_ERROR_NOMATCH
|
|
No match can start anywhere in this subject.
|
|
|
|
PCRE2_ERROR_PARTIAL
|
|
Adding more characters may result in a complete match that uses one
|
|
or more characters from the end of this subject.
|
|
|
|
When a partial match is returned, the first two elements in the ovector
|
|
point to the portion of the subject that was matched, but the values in
|
|
the rest of the ovector are undefined. The appearance of \K in the pat-
|
|
tern has no effect for a partial match. Consider this pattern:
|
|
|
|
/abc\K123/
|
|
|
|
If it is matched against "456abc123xyz" the result is a complete match,
|
|
and the ovector defines the matched string as "123", because \K resets
|
|
the "start of match" point. However, if a partial match is requested
|
|
and the subject string is "456abc12", a partial match is found for the
|
|
string "abc12", because all these characters are needed for a subse-
|
|
quent re-match with additional characters.
|
|
|
|
If there is more than one partial match, the first one that was found
|
|
provides the data that is returned. Consider this pattern:
|
|
|
|
/123\w+X|dogY/
|
|
|
|
If this is matched against the subject string "abc123dog", both alter-
|
|
natives fail to match, but the end of the subject is reached during
|
|
matching, so PCRE2_ERROR_PARTIAL is returned. The offsets are set to 3
|
|
and 9, identifying "123dog" as the first partial match. (In this exam-
|
|
ple, there are two partial matches, because "dog" on its own partially
|
|
matches the second alternative.)
|
|
|
|
How a partial match is processed by pcre2_match()
|
|
|
|
What happens when a partial match is identified depends on which of the
|
|
two partial matching options is set.
|
|
|
|
If PCRE2_PARTIAL_HARD is set, PCRE2_ERROR_PARTIAL is returned as soon
|
|
as a partial match is found, without continuing to search for possible
|
|
complete matches. This option is "hard" because it prefers an earlier
|
|
partial match over a later complete match. For this reason, the assump-
|
|
tion is made that the end of the supplied subject string is not the
|
|
true end of the available data, which is why \z, \Z, \b, \B, and $ al-
|
|
ways give a partial match.
|
|
|
|
If PCRE2_PARTIAL_SOFT is set, the partial match is remembered, but
|
|
matching continues as normal, and other alternatives in the pattern are
|
|
tried. If no complete match can be found, PCRE2_ERROR_PARTIAL is re-
|
|
turned instead of PCRE2_ERROR_NOMATCH. This option is "soft" because it
|
|
prefers a complete match over a partial match. All the various matching
|
|
items in a pattern behave as if the subject string is potentially com-
|
|
plete; \z, \Z, and $ match at the end of the subject, as normal, and
|
|
for \b and \B the end of the subject is treated as a non-alphanumeric.
|
|
|
|
The difference between the two partial matching options can be illus-
|
|
trated by a pattern such as:
|
|
|
|
/dog(sbody)?/
|
|
|
|
This matches either "dog" or "dogsbody", greedily (that is, it prefers
|
|
the longer string if possible). If it is matched against the string
|
|
"dog" with PCRE2_PARTIAL_SOFT, it yields a complete match for "dog".
|
|
However, if PCRE2_PARTIAL_HARD is set, the result is PCRE2_ERROR_PAR-
|
|
TIAL. On the other hand, if the pattern is made ungreedy the result is
|
|
different:
|
|
|
|
/dog(sbody)??/
|
|
|
|
In this case the result is always a complete match because that is
|
|
found first, and matching never continues after finding a complete
|
|
match. It might be easier to follow this explanation by thinking of the
|
|
two patterns like this:
|
|
|
|
/dog(sbody)?/ is the same as /dogsbody|dog/
|
|
/dog(sbody)??/ is the same as /dog|dogsbody/
|
|
|
|
The second pattern will never match "dogsbody", because it will always
|
|
find the shorter match first.
|
|
|
|
Example of partial matching using pcre2test
|
|
|
|
The pcre2test data modifiers partial_hard (or ph) and partial_soft (or
|
|
ps) set PCRE2_PARTIAL_HARD and PCRE2_PARTIAL_SOFT, respectively, when
|
|
calling pcre2_match(). Here is a run of pcre2test using a pattern that
|
|
matches the whole subject in the form of a date:
|
|
|
|
re> /^\d?\d(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\d\d$/
|
|
data> 25dec3\=ph
|
|
Partial match: 23dec3
|
|
data> 3ju\=ph
|
|
Partial match: 3ju
|
|
data> 3juj\=ph
|
|
No match
|
|
|
|
This example gives the same results for both hard and soft partial
|
|
matching options. Here is an example where there is a difference:
|
|
|
|
re> /^\d?\d(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\d\d$/
|
|
data> 25jun04\=ps
|
|
0: 25jun04
|
|
1: jun
|
|
data> 25jun04\=ph
|
|
Partial match: 25jun04
|
|
|
|
With PCRE2_PARTIAL_SOFT, the subject is matched completely. For
|
|
PCRE2_PARTIAL_HARD, however, the subject is assumed not to be complete,
|
|
so there is only a partial match.
|
|
|
|
|
|
MULTI-SEGMENT MATCHING WITH pcre2_match()
|
|
|
|
PCRE was not originally designed with multi-segment matching in mind.
|
|
However, over time, features (including partial matching) that make
|
|
multi-segment matching possible have been added. A very long string can
|
|
be searched segment by segment by calling pcre2_match() repeatedly,
|
|
with the aim of achieving the same results that would happen if the en-
|
|
tire string was available for searching all the time. Normally, the
|
|
strings that are being sought are much shorter than each individual
|
|
segment, and are in the middle of very long strings, so the pattern is
|
|
normally not anchored.
|
|
|
|
Special logic must be implemented to handle a matched substring that
|
|
spans a segment boundary. PCRE2_PARTIAL_HARD should be used, because it
|
|
returns a partial match at the end of a segment whenever there is the
|
|
possibility of changing the match by adding more characters. The
|
|
PCRE2_NOTBOL option should also be set for all but the first segment.
|
|
|
|
When a partial match occurs, the next segment must be added to the cur-
|
|
rent subject and the match re-run, using the startoffset argument of
|
|
pcre2_match() to begin at the point where the partial match started.
|
|
For example:
|
|
|
|
re> /\d?\d(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\d\d/
|
|
data> ...the date is 23ja\=ph
|
|
Partial match: 23ja
|
|
data> ...the date is 23jan19 and on that day...\=offset=15
|
|
0: 23jan19
|
|
1: jan
|
|
|
|
Note the use of the offset modifier to start the new match where the
|
|
partial match was found. In this example, the next segment was added to
|
|
the one in which the partial match was found. This is the most
|
|
straightforward approach, typically using a memory buffer that is twice
|
|
the size of each segment. After a partial match, the first half of the
|
|
buffer is discarded, the second half is moved to the start of the buf-
|
|
fer, and a new segment is added before repeating the match as in the
|
|
example above. After a no match, the entire buffer can be discarded.
|
|
|
|
If there are memory constraints, you may want to discard text that pre-
|
|
cedes a partial match before adding the next segment. Unfortunately,
|
|
this is not at present straightforward. In cases such as the above,
|
|
where the pattern does not contain any lookbehinds, it is sufficient to
|
|
retain only the partially matched substring. However, if the pattern
|
|
contains a lookbehind assertion, characters that precede the start of
|
|
the partial match may have been inspected during the matching process.
|
|
When pcre2test displays a partial match, it indicates these characters
|
|
with '<' if the allusedtext modifier is set:
|
|
|
|
re> "(?<=123)abc"
|
|
data> xx123ab\=ph,allusedtext
|
|
Partial match: 123ab
|
|
<<<
|
|
|
|
However, the allusedtext modifier is not available for JIT matching,
|
|
because JIT matching does not record the first (or last) consulted
|
|
characters. For this reason, this information is not available via the
|
|
API. It is therefore not possible in general to obtain the exact number
|
|
of characters that must be retained in order to get the right match re-
|
|
sult. If you cannot retain the entire segment, you must find some
|
|
heuristic way of choosing.
|
|
|
|
If you know the approximate length of the matching substrings, you can
|
|
use that to decide how much text to retain. The only lookbehind infor-
|
|
mation that is currently available via the API is the length of the
|
|
longest individual lookbehind in a pattern, but this can be misleading
|
|
if there are nested lookbehinds. The value returned by calling
|
|
pcre2_pattern_info() with the PCRE2_INFO_MAXLOOKBEHIND option is the
|
|
maximum number of characters (not code units) that any individual look-
|
|
behind moves back when it is processed. A pattern such as
|
|
"(?<=(?<!b)a)" has a maximum lookbehind value of one, but inspects two
|
|
characters before its starting point.
|
|
|
|
In a non-UTF or a 32-bit case, moving back is just a subtraction, but
|
|
in UTF-8 or UTF-16 you have to count characters while moving back
|
|
through the code units.
|
|
|
|
|
|
PARTIAL MATCHING USING pcre2_dfa_match()
|
|
|
|
The DFA function moves along the subject string character by character,
|
|
without backtracking, searching for all possible matches simultane-
|
|
ously. If the end of the subject is reached before the end of the pat-
|
|
tern, there is the possibility of a partial match.
|
|
|
|
When PCRE2_PARTIAL_SOFT is set, PCRE2_ERROR_PARTIAL is returned only if
|
|
there have been no complete matches. Otherwise, the complete matches
|
|
are returned. If PCRE2_PARTIAL_HARD is set, a partial match takes
|
|
precedence over any complete matches. The portion of the string that
|
|
was matched when the longest partial match was found is set as the
|
|
first matching string.
|
|
|
|
Because the DFA function always searches for all possible matches, and
|
|
there is no difference between greedy and ungreedy repetition, its be-
|
|
haviour is different from the pcre2_match(). Consider the string "dog"
|
|
matched against this ungreedy pattern:
|
|
|
|
/dog(sbody)??/
|
|
|
|
Whereas the standard function stops as soon as it finds the complete
|
|
match for "dog", the DFA function also finds the partial match for
|
|
"dogsbody", and so returns that when PCRE2_PARTIAL_HARD is set.
|
|
|
|
|
|
MULTI-SEGMENT MATCHING WITH pcre2_dfa_match()
|
|
|
|
When a partial match has been found using the DFA matching function, it
|
|
is possible to continue the match by providing additional subject data
|
|
and calling the function again with the same compiled regular expres-
|
|
sion, this time setting the PCRE2_DFA_RESTART option. You must pass the
|
|
same working space as before, because this is where details of the pre-
|
|
vious partial match are stored. You can set the PCRE2_PARTIAL_SOFT or
|
|
PCRE2_PARTIAL_HARD options with PCRE2_DFA_RESTART to continue partial
|
|
matching over multiple segments. Here is an example using pcre2test:
|
|
|
|
re> /^\d?\d(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\d\d$/
|
|
data> 23ja\=dfa,ps
|
|
Partial match: 23ja
|
|
data> n05\=dfa,dfa_restart
|
|
0: n05
|
|
|
|
The first call has "23ja" as the subject, and requests partial match-
|
|
ing; the second call has "n05" as the subject for the continued
|
|
(restarted) match. Notice that when the match is complete, only the
|
|
last part is shown; PCRE2 does not retain the previously partially-
|
|
matched string. It is up to the calling program to do that if it needs
|
|
to. This means that, for an unanchored pattern, if a continued match
|
|
fails, it is not possible to try again at a new starting point. All
|
|
this facility is capable of doing is continuing with the previous match
|
|
attempt. For example, consider this pattern:
|
|
|
|
1234|3789
|
|
|
|
If the first part of the subject is "ABC123", a partial match of the
|
|
first alternative is found at offset 3. There is no partial match for
|
|
the second alternative, because such a match does not start at the same
|
|
point in the subject string. Attempting to continue with the string
|
|
"7890" does not yield a match because only those alternatives that
|
|
match at one point in the subject are remembered. Depending on the ap-
|
|
plication, this may or may not be what you want.
|
|
|
|
If you do want to allow for starting again at the next character, one
|
|
way of doing it is to retain some or all of the segment and try a new
|
|
complete match, as described for pcre2_match() above. Another possibil-
|
|
ity is to work with two buffers. If a partial match at offset n in the
|
|
first buffer is followed by "no match" when PCRE2_DFA_RESTART is used
|
|
on the second buffer, you can then try a new match starting at offset
|
|
n+1 in the first buffer.
|
|
|
|
|
|
AUTHOR
|
|
|
|
Philip Hazel
|
|
University Computing Service
|
|
Cambridge, England.
|
|
|
|
|
|
REVISION
|
|
|
|
Last updated: 04 September 2019
|
|
Copyright (c) 1997-2019 University of Cambridge.
|
|
------------------------------------------------------------------------------
|
|
|
|
|
|
PCRE2PATTERN(3) Library Functions Manual PCRE2PATTERN(3)
|
|
|
|
|
|
|
|
NAME
|
|
PCRE2 - Perl-compatible regular expressions (revised API)
|
|
|
|
PCRE2 REGULAR EXPRESSION DETAILS
|
|
|
|
The syntax and semantics of the regular expressions that are supported
|
|
by PCRE2 are described in detail below. There is a quick-reference syn-
|
|
tax summary in the pcre2syntax page. PCRE2 tries to match Perl syntax
|
|
and semantics as closely as it can. PCRE2 also supports some alterna-
|
|
tive regular expression syntax (which does not conflict with the Perl
|
|
syntax) in order to provide some compatibility with regular expressions
|
|
in Python, .NET, and Oniguruma.
|
|
|
|
Perl's regular expressions are described in its own documentation, and
|
|
regular expressions in general are covered in a number of books, some
|
|
of which have copious examples. Jeffrey Friedl's "Mastering Regular Ex-
|
|
pressions", published by O'Reilly, covers regular expressions in great
|
|
detail. This description of PCRE2's regular expressions is intended as
|
|
reference material.
|
|
|
|
This document discusses the regular expression patterns that are sup-
|
|
ported by PCRE2 when its main matching function, pcre2_match(), is
|
|
used. PCRE2 also has an alternative matching function,
|
|
pcre2_dfa_match(), which matches using a different algorithm that is
|
|
not Perl-compatible. Some of the features discussed below are not
|
|
available when DFA matching is used. The advantages and disadvantages
|
|
of the alternative function, and how it differs from the normal func-
|
|
tion, are discussed in the pcre2matching page.
|
|
|
|
|
|
SPECIAL START-OF-PATTERN ITEMS
|
|
|
|
A number of options that can be passed to pcre2_compile() can also be
|
|
set by special items at the start of a pattern. These are not Perl-com-
|
|
patible, but are provided to make these options accessible to pattern
|
|
writers who are not able to change the program that processes the pat-
|
|
tern. Any number of these items may appear, but they must all be to-
|
|
gether right at the start of the pattern string, and the letters must
|
|
be in upper case.
|
|
|
|
UTF support
|
|
|
|
In the 8-bit and 16-bit PCRE2 libraries, characters may be coded either
|
|
as single code units, or as multiple UTF-8 or UTF-16 code units. UTF-32
|
|
can be specified for the 32-bit library, in which case it constrains
|
|
the character values to valid Unicode code points. To process UTF
|
|
strings, PCRE2 must be built to include Unicode support (which is the
|
|
default). When using UTF strings you must either call the compiling
|
|
function with one or both of the PCRE2_UTF or PCRE2_MATCH_INVALID_UTF
|
|
options, or the pattern must start with the special sequence (*UTF),
|
|
which is equivalent to setting the relevant PCRE2_UTF. How setting a
|
|
UTF mode affects pattern matching is mentioned in several places below.
|
|
There is also a summary of features in the pcre2unicode page.
|
|
|
|
Some applications that allow their users to supply patterns may wish to
|
|
restrict them to non-UTF data for security reasons. If the
|
|
PCRE2_NEVER_UTF option is passed to pcre2_compile(), (*UTF) is not al-
|
|
lowed, and its appearance in a pattern causes an error.
|
|
|
|
Unicode property support
|
|
|
|
Another special sequence that may appear at the start of a pattern is
|
|
(*UCP). This has the same effect as setting the PCRE2_UCP option: it
|
|
causes sequences such as \d and \w to use Unicode properties to deter-
|
|
mine character types, instead of recognizing only characters with codes
|
|
less than 256 via a lookup table. If also causes upper/lower casing op-
|
|
erations to use Unicode properties for characters with code points
|
|
greater than 127, even when UTF is not set.
|
|
|
|
Some applications that allow their users to supply patterns may wish to
|
|
restrict them for security reasons. If the PCRE2_NEVER_UCP option is
|
|
passed to pcre2_compile(), (*UCP) is not allowed, and its appearance in
|
|
a pattern causes an error.
|
|
|
|
Locking out empty string matching
|
|
|
|
Starting a pattern with (*NOTEMPTY) or (*NOTEMPTY_ATSTART) has the same
|
|
effect as passing the PCRE2_NOTEMPTY or PCRE2_NOTEMPTY_ATSTART option
|
|
to whichever matching function is subsequently called to match the pat-
|
|
tern. These options lock out the matching of empty strings, either en-
|
|
tirely, or only at the start of the subject.
|
|
|
|
Disabling auto-possessification
|
|
|
|
If a pattern starts with (*NO_AUTO_POSSESS), it has the same effect as
|
|
setting the PCRE2_NO_AUTO_POSSESS option. This stops PCRE2 from making
|
|
quantifiers possessive when what follows cannot match the repeated
|
|
item. For example, by default a+b is treated as a++b. For more details,
|
|
see the pcre2api documentation.
|
|
|
|
Disabling start-up optimizations
|
|
|
|
If a pattern starts with (*NO_START_OPT), it has the same effect as
|
|
setting the PCRE2_NO_START_OPTIMIZE option. This disables several opti-
|
|
mizations for quickly reaching "no match" results. For more details,
|
|
see the pcre2api documentation.
|
|
|
|
Disabling automatic anchoring
|
|
|
|
If a pattern starts with (*NO_DOTSTAR_ANCHOR), it has the same effect
|
|
as setting the PCRE2_NO_DOTSTAR_ANCHOR option. This disables optimiza-
|
|
tions that apply to patterns whose top-level branches all start with .*
|
|
(match any number of arbitrary characters). For more details, see the
|
|
pcre2api documentation.
|
|
|
|
Disabling JIT compilation
|
|
|
|
If a pattern that starts with (*NO_JIT) is successfully compiled, an
|
|
attempt by the application to apply the JIT optimization by calling
|
|
pcre2_jit_compile() is ignored.
|
|
|
|
Setting match resource limits
|
|
|
|
The pcre2_match() function contains a counter that is incremented every
|
|
time it goes round its main loop. The caller of pcre2_match() can set a
|
|
limit on this counter, which therefore limits the amount of computing
|
|
resource used for a match. The maximum depth of nested backtracking can
|
|
also be limited; this indirectly restricts the amount of heap memory
|
|
that is used, but there is also an explicit memory limit that can be
|
|
set.
|
|
|
|
These facilities are provided to catch runaway matches that are pro-
|
|
voked by patterns with huge matching trees. A common example is a pat-
|
|
tern with nested unlimited repeats applied to a long string that does
|
|
not match. When one of these limits is reached, pcre2_match() gives an
|
|
error return. The limits can also be set by items at the start of the
|
|
pattern of the form
|
|
|
|
(*LIMIT_HEAP=d)
|
|
(*LIMIT_MATCH=d)
|
|
(*LIMIT_DEPTH=d)
|
|
|
|
where d is any number of decimal digits. However, the value of the set-
|
|
ting must be less than the value set (or defaulted) by the caller of
|
|
pcre2_match() for it to have any effect. In other words, the pattern
|
|
writer can lower the limits set by the programmer, but not raise them.
|
|
If there is more than one setting of one of these limits, the lower
|
|
value is used. The heap limit is specified in kibibytes (units of 1024
|
|
bytes).
|
|
|
|
Prior to release 10.30, LIMIT_DEPTH was called LIMIT_RECURSION. This
|
|
name is still recognized for backwards compatibility.
|
|
|
|
The heap limit applies only when the pcre2_match() or pcre2_dfa_match()
|
|
interpreters are used for matching. It does not apply to JIT. The match
|
|
limit is used (but in a different way) when JIT is being used, or when
|
|
pcre2_dfa_match() is called, to limit computing resource usage by those
|
|
matching functions. The depth limit is ignored by JIT but is relevant
|
|
for DFA matching, which uses function recursion for recursions within
|
|
the pattern and for lookaround assertions and atomic groups. In this
|
|
case, the depth limit controls the depth of such recursion.
|
|
|
|
Newline conventions
|
|
|
|
PCRE2 supports six different conventions for indicating line breaks in
|
|
strings: a single CR (carriage return) character, a single LF (line-
|
|
feed) character, the two-character sequence CRLF, any of the three pre-
|
|
ceding, any Unicode newline sequence, or the NUL character (binary
|
|
zero). The pcre2api page has further discussion about newlines, and
|
|
shows how to set the newline convention when calling pcre2_compile().
|
|
|
|
It is also possible to specify a newline convention by starting a pat-
|
|
tern string with one of the following sequences:
|
|
|
|
(*CR) carriage return
|
|
(*LF) linefeed
|
|
(*CRLF) carriage return, followed by linefeed
|
|
(*ANYCRLF) any of the three above
|
|
(*ANY) all Unicode newline sequences
|
|
(*NUL) the NUL character (binary zero)
|
|
|
|
These override the default and the options given to the compiling func-
|
|
tion. For example, on a Unix system where LF is the default newline se-
|
|
quence, the pattern
|
|
|
|
(*CR)a.b
|
|
|
|
changes the convention to CR. That pattern matches "a\nb" because LF is
|
|
no longer a newline. If more than one of these settings is present, the
|
|
last one is used.
|
|
|
|
The newline convention affects where the circumflex and dollar asser-
|
|
tions are true. It also affects the interpretation of the dot metachar-
|
|
acter when PCRE2_DOTALL is not set, and the behaviour of \N when not
|
|
followed by an opening brace. However, it does not affect what the \R
|
|
escape sequence matches. By default, this is any Unicode newline se-
|
|
quence, for Perl compatibility. However, this can be changed; see the
|
|
next section and the description of \R in the section entitled "Newline
|
|
sequences" below. A change of \R setting can be combined with a change
|
|
of newline convention.
|
|
|
|
Specifying what \R matches
|
|
|
|
It is possible to restrict \R to match only CR, LF, or CRLF (instead of
|
|
the complete set of Unicode line endings) by setting the option
|
|
PCRE2_BSR_ANYCRLF at compile time. This effect can also be achieved by
|
|
starting a pattern with (*BSR_ANYCRLF). For completeness, (*BSR_UNI-
|
|
CODE) is also recognized, corresponding to PCRE2_BSR_UNICODE.
|
|
|
|
|
|
EBCDIC CHARACTER CODES
|
|
|
|
PCRE2 can be compiled to run in an environment that uses EBCDIC as its
|
|
character code instead of ASCII or Unicode (typically a mainframe sys-
|
|
tem). In the sections below, character code values are ASCII or Uni-
|
|
code; in an EBCDIC environment these characters may have different code
|
|
values, and there are no code points greater than 255.
|
|
|
|
|
|
CHARACTERS AND METACHARACTERS
|
|
|
|
A regular expression is a pattern that is matched against a subject
|
|
string from left to right. Most characters stand for themselves in a
|
|
pattern, and match the corresponding characters in the subject. As a
|
|
trivial example, the pattern
|
|
|
|
The quick brown fox
|
|
|
|
matches a portion of a subject string that is identical to itself. When
|
|
caseless matching is specified (the PCRE2_CASELESS option or (?i)
|
|
within the pattern), letters are matched independently of case. Note
|
|
that there are two ASCII characters, K and S, that, in addition to
|
|
their lower case ASCII equivalents, are case-equivalent with Unicode
|
|
U+212A (Kelvin sign) and U+017F (long S) respectively when either
|
|
PCRE2_UTF or PCRE2_UCP is set.
|
|
|
|
The power of regular expressions comes from the ability to include wild
|
|
cards, character classes, alternatives, and repetitions in the pattern.
|
|
These are encoded in the pattern by the use of metacharacters, which do
|
|
not stand for themselves but instead are interpreted in some special
|
|
way.
|
|
|
|
There are two different sets of metacharacters: those that are recog-
|
|
nized anywhere in the pattern except within square brackets, and those
|
|
that are recognized within square brackets. Outside square brackets,
|
|
the metacharacters are as follows:
|
|
|
|
\ general escape character with several uses
|
|
^ assert start of string (or line, in multiline mode)
|
|
$ assert end of string (or line, in multiline mode)
|
|
. match any character except newline (by default)
|
|
[ start character class definition
|
|
| start of alternative branch
|
|
( start group or control verb
|
|
) end group or control verb
|
|
* 0 or more quantifier
|
|
+ 1 or more quantifier; also "possessive quantifier"
|
|
? 0 or 1 quantifier; also quantifier minimizer
|
|
{ start min/max quantifier
|
|
|
|
Part of a pattern that is in square brackets is called a "character
|
|
class". In a character class the only metacharacters are:
|
|
|
|
\ general escape character
|
|
^ negate the class, but only if the first character
|
|
- indicates character range
|
|
[ POSIX character class (if followed by POSIX syntax)
|
|
] terminates the character class
|
|
|
|
If a pattern is compiled with the PCRE2_EXTENDED option, most white
|
|
space in the pattern, other than in a character class, and characters
|
|
between a # outside a character class and the next newline, inclusive,
|
|
are ignored. An escaping backslash can be used to include a white space
|
|
or a # character as part of the pattern. If the PCRE2_EXTENDED_MORE op-
|
|
tion is set, the same applies, but in addition unescaped space and hor-
|
|
izontal tab characters are ignored inside a character class. Note: only
|
|
these two characters are ignored, not the full set of pattern white
|
|
space characters that are ignored outside a character class. Option
|
|
settings can be changed within a pattern; see the section entitled "In-
|
|
ternal Option Setting" below.
|
|
|
|
The following sections describe the use of each of the metacharacters.
|
|
|
|
|
|
BACKSLASH
|
|
|
|
The backslash character has several uses. Firstly, if it is followed by
|
|
a character that is not a digit or a letter, it takes away any special
|
|
meaning that character may have. This use of backslash as an escape
|
|
character applies both inside and outside character classes.
|
|
|
|
For example, if you want to match a * character, you must write \* in
|
|
the pattern. This escaping action applies whether or not the following
|
|
character would otherwise be interpreted as a metacharacter, so it is
|
|
always safe to precede a non-alphanumeric with backslash to specify
|
|
that it stands for itself. In particular, if you want to match a back-
|
|
slash, you write \\.
|
|
|
|
Only ASCII digits and letters have any special meaning after a back-
|
|
slash. All other characters (in particular, those whose code points are
|
|
greater than 127) are treated as literals.
|
|
|
|
If you want to treat all characters in a sequence as literals, you can
|
|
do so by putting them between \Q and \E. This is different from Perl in
|
|
that $ and @ are handled as literals in \Q...\E sequences in PCRE2,
|
|
whereas in Perl, $ and @ cause variable interpolation. Also, Perl does
|
|
"double-quotish backslash interpolation" on any backslashes between \Q
|
|
and \E which, its documentation says, "may lead to confusing results".
|
|
PCRE2 treats a backslash between \Q and \E just like any other charac-
|
|
ter. Note the following examples:
|
|
|
|
Pattern PCRE2 matches Perl matches
|
|
|
|
\Qabc$xyz\E abc$xyz abc followed by the
|
|
contents of $xyz
|
|
\Qabc\$xyz\E abc\$xyz abc\$xyz
|
|
\Qabc\E\$\Qxyz\E abc$xyz abc$xyz
|
|
\QA\B\E A\B A\B
|
|
\Q\\E \ \\E
|
|
|
|
The \Q...\E sequence is recognized both inside and outside character
|
|
classes. An isolated \E that is not preceded by \Q is ignored. If \Q
|
|
is not followed by \E later in the pattern, the literal interpretation
|
|
continues to the end of the pattern (that is, \E is assumed at the
|
|
end). If the isolated \Q is inside a character class, this causes an
|
|
error, because the character class is not terminated by a closing
|
|
square bracket.
|
|
|
|
Non-printing characters
|
|
|
|
A second use of backslash provides a way of encoding non-printing char-
|
|
acters in patterns in a visible manner. There is no restriction on the
|
|
appearance of non-printing characters in a pattern, but when a pattern
|
|
is being prepared by text editing, it is often easier to use one of the
|
|
following escape sequences instead of the binary character it repre-
|
|
sents. In an ASCII or Unicode environment, these escapes are as fol-
|
|
lows:
|
|
|
|
\a alarm, that is, the BEL character (hex 07)
|
|
\cx "control-x", where x is any printable ASCII character
|
|
\e escape (hex 1B)
|
|
\f form feed (hex 0C)
|
|
\n linefeed (hex 0A)
|
|
\r carriage return (hex 0D) (but see below)
|
|
\t tab (hex 09)
|
|
\0dd character with octal code 0dd
|
|
\ddd character with octal code ddd, or backreference
|
|
\o{ddd..} character with octal code ddd..
|
|
\xhh character with hex code hh
|
|
\x{hhh..} character with hex code hhh..
|
|
\N{U+hhh..} character with Unicode hex code point hhh..
|
|
|
|
By default, after \x that is not followed by {, from zero to two hexa-
|
|
decimal digits are read (letters can be in upper or lower case). Any
|
|
number of hexadecimal digits may appear between \x{ and }. If a charac-
|
|
ter other than a hexadecimal digit appears between \x{ and }, or if
|
|
there is no terminating }, an error occurs.
|
|
|
|
Characters whose code points are less than 256 can be defined by either
|
|
of the two syntaxes for \x or by an octal sequence. There is no differ-
|
|
ence in the way they are handled. For example, \xdc is exactly the same
|
|
as \x{dc} or \334. However, using the braced versions does make such
|
|
sequences easier to read.
|
|
|
|
Support is available for some ECMAScript (aka JavaScript) escape se-
|
|
quences via two compile-time options. If PCRE2_ALT_BSUX is set, the se-
|
|
quence \x followed by { is not recognized. Only if \x is followed by
|
|
two hexadecimal digits is it recognized as a character escape. Other-
|
|
wise it is interpreted as a literal "x" character. In this mode, sup-
|
|
port for code points greater than 256 is provided by \u, which must be
|
|
followed by four hexadecimal digits; otherwise it is interpreted as a
|
|
literal "u" character.
|
|
|
|
PCRE2_EXTRA_ALT_BSUX has the same effect as PCRE2_ALT_BSUX and, in ad-
|
|
dition, \u{hhh..} is recognized as the character specified by hexadeci-
|
|
mal code point. There may be any number of hexadecimal digits. This
|
|
syntax is from ECMAScript 6.
|
|
|
|
The \N{U+hhh..} escape sequence is recognized only when PCRE2 is oper-
|
|
ating in UTF mode. Perl also uses \N{name} to specify characters by
|
|
Unicode name; PCRE2 does not support this. Note that when \N is not
|
|
followed by an opening brace (curly bracket) it has an entirely differ-
|
|
ent meaning, matching any character that is not a newline.
|
|
|
|
There are some legacy applications where the escape sequence \r is ex-
|
|
pected to match a newline. If the PCRE2_EXTRA_ESCAPED_CR_IS_LF option
|
|
is set, \r in a pattern is converted to \n so that it matches a LF
|
|
(linefeed) instead of a CR (carriage return) character.
|
|
|
|
The precise effect of \cx on ASCII characters is as follows: if x is a
|
|
lower case letter, it is converted to upper case. Then bit 6 of the
|
|
character (hex 40) is inverted. Thus \cA to \cZ become hex 01 to hex 1A
|
|
(A is 41, Z is 5A), but \c{ becomes hex 3B ({ is 7B), and \c; becomes
|
|
hex 7B (; is 3B). If the code unit following \c has a value less than
|
|
32 or greater than 126, a compile-time error occurs.
|
|
|
|
When PCRE2 is compiled in EBCDIC mode, \N{U+hhh..} is not supported.
|
|
\a, \e, \f, \n, \r, and \t generate the appropriate EBCDIC code values.
|
|
The \c escape is processed as specified for Perl in the perlebcdic doc-
|
|
ument. The only characters that are allowed after \c are A-Z, a-z, or
|
|
one of @, [, \, ], ^, _, or ?. Any other character provokes a compile-
|
|
time error. The sequence \c@ encodes character code 0; after \c the
|
|
letters (in either case) encode characters 1-26 (hex 01 to hex 1A); [,
|
|
\, ], ^, and _ encode characters 27-31 (hex 1B to hex 1F), and \c? be-
|
|
comes either 255 (hex FF) or 95 (hex 5F).
|
|
|
|
Thus, apart from \c?, these escapes generate the same character code
|
|
values as they do in an ASCII environment, though the meanings of the
|
|
values mostly differ. For example, \cG always generates code value 7,
|
|
which is BEL in ASCII but DEL in EBCDIC.
|
|
|
|
The sequence \c? generates DEL (127, hex 7F) in an ASCII environment,
|
|
but because 127 is not a control character in EBCDIC, Perl makes it
|
|
generate the APC character. Unfortunately, there are several variants
|
|
of EBCDIC. In most of them the APC character has the value 255 (hex
|
|
FF), but in the one Perl calls POSIX-BC its value is 95 (hex 5F). If
|
|
certain other characters have POSIX-BC values, PCRE2 makes \c? generate
|
|
95; otherwise it generates 255.
|
|
|
|
After \0 up to two further octal digits are read. If there are fewer
|
|
than two digits, just those that are present are used. Thus the se-
|
|
quence \0\x\015 specifies two binary zeros followed by a CR character
|
|
(code value 13). Make sure you supply two digits after the initial zero
|
|
if the pattern character that follows is itself an octal digit.
|
|
|
|
The escape \o must be followed by a sequence of octal digits, enclosed
|
|
in braces. An error occurs if this is not the case. This escape is a
|
|
recent addition to Perl; it provides way of specifying character code
|
|
points as octal numbers greater than 0777, and it also allows octal
|
|
numbers and backreferences to be unambiguously specified.
|
|
|
|
For greater clarity and unambiguity, it is best to avoid following \ by
|
|
a digit greater than zero. Instead, use \o{} or \x{} to specify numeri-
|
|
cal character code points, and \g{} to specify backreferences. The fol-
|
|
lowing paragraphs describe the old, ambiguous syntax.
|
|
|
|
The handling of a backslash followed by a digit other than 0 is compli-
|
|
cated, and Perl has changed over time, causing PCRE2 also to change.
|
|
|
|
Outside a character class, PCRE2 reads the digit and any following dig-
|
|
its as a decimal number. If the number is less than 10, begins with the
|
|
digit 8 or 9, or if there are at least that many previous capture
|
|
groups in the expression, the entire sequence is taken as a backrefer-
|
|
ence. A description of how this works is given later, following the
|
|
discussion of parenthesized groups. Otherwise, up to three octal dig-
|
|
its are read to form a character code.
|
|
|
|
Inside a character class, PCRE2 handles \8 and \9 as the literal char-
|
|
acters "8" and "9", and otherwise reads up to three octal digits fol-
|
|
lowing the backslash, using them to generate a data character. Any sub-
|
|
sequent digits stand for themselves. For example, outside a character
|
|
class:
|
|
|
|
\040 is another way of writing an ASCII space
|
|
\40 is the same, provided there are fewer than 40
|
|
previous capture groups
|
|
\7 is always a backreference
|
|
\11 might be a backreference, or another way of
|
|
writing a tab
|
|
\011 is always a tab
|
|
\0113 is a tab followed by the character "3"
|
|
\113 might be a backreference, otherwise the
|
|
character with octal code 113
|
|
\377 might be a backreference, otherwise
|
|
the value 255 (decimal)
|
|
\81 is always a backreference
|
|
|
|
Note that octal values of 100 or greater that are specified using this
|
|
syntax must not be introduced by a leading zero, because no more than
|
|
three octal digits are ever read.
|
|
|
|
Constraints on character values
|
|
|
|
Characters that are specified using octal or hexadecimal numbers are
|
|
limited to certain values, as follows:
|
|
|
|
8-bit non-UTF mode no greater than 0xff
|
|
16-bit non-UTF mode no greater than 0xffff
|
|
32-bit non-UTF mode no greater than 0xffffffff
|
|
All UTF modes no greater than 0x10ffff and a valid code point
|
|
|
|
Invalid Unicode code points are all those in the range 0xd800 to 0xdfff
|
|
(the so-called "surrogate" code points). The check for these can be
|
|
disabled by the caller of pcre2_compile() by setting the option
|
|
PCRE2_EXTRA_ALLOW_SURROGATE_ESCAPES. However, this is possible only in
|
|
UTF-8 and UTF-32 modes, because these values are not representable in
|
|
UTF-16.
|
|
|
|
Escape sequences in character classes
|
|
|
|
All the sequences that define a single character value can be used both
|
|
inside and outside character classes. In addition, inside a character
|
|
class, \b is interpreted as the backspace character (hex 08).
|
|
|
|
When not followed by an opening brace, \N is not allowed in a character
|
|
class. \B, \R, and \X are not special inside a character class. Like
|
|
other unrecognized alphabetic escape sequences, they cause an error.
|
|
Outside a character class, these sequences have different meanings.
|
|
|
|
Unsupported escape sequences
|
|
|
|
In Perl, the sequences \F, \l, \L, \u, and \U are recognized by its
|
|
string handler and used to modify the case of following characters. By
|
|
default, PCRE2 does not support these escape sequences in patterns.
|
|
However, if either of the PCRE2_ALT_BSUX or PCRE2_EXTRA_ALT_BSUX op-
|
|
tions is set, \U matches a "U" character, and \u can be used to define
|
|
a character by code point, as described above.
|
|
|
|
Absolute and relative backreferences
|
|
|
|
The sequence \g followed by a signed or unsigned number, optionally en-
|
|
closed in braces, is an absolute or relative backreference. A named
|
|
backreference can be coded as \g{name}. Backreferences are discussed
|
|
later, following the discussion of parenthesized groups.
|
|
|
|
Absolute and relative subroutine calls
|
|
|
|
For compatibility with Oniguruma, the non-Perl syntax \g followed by a
|
|
name or a number enclosed either in angle brackets or single quotes, is
|
|
an alternative syntax for referencing a capture group as a subroutine.
|
|
Details are discussed later. Note that \g{...} (Perl syntax) and
|
|
\g<...> (Oniguruma syntax) are not synonymous. The former is a backref-
|
|
erence; the latter is a subroutine call.
|
|
|
|
Generic character types
|
|
|
|
Another use of backslash is for specifying generic character types:
|
|
|
|
\d any decimal digit
|
|
\D any character that is not a decimal digit
|
|
\h any horizontal white space character
|
|
\H any character that is not a horizontal white space character
|
|
\N any character that is not a newline
|
|
\s any white space character
|
|
\S any character that is not a white space character
|
|
\v any vertical white space character
|
|
\V any character that is not a vertical white space character
|
|
\w any "word" character
|
|
\W any "non-word" character
|
|
|
|
The \N escape sequence has the same meaning as the "." metacharacter
|
|
when PCRE2_DOTALL is not set, but setting PCRE2_DOTALL does not change
|
|
the meaning of \N. Note that when \N is followed by an opening brace it
|
|
has a different meaning. See the section entitled "Non-printing charac-
|
|
ters" above for details. Perl also uses \N{name} to specify characters
|
|
by Unicode name; PCRE2 does not support this.
|
|
|
|
Each pair of lower and upper case escape sequences partitions the com-
|
|
plete set of characters into two disjoint sets. Any given character
|
|
matches one, and only one, of each pair. The sequences can appear both
|
|
inside and outside character classes. They each match one character of
|
|
the appropriate type. If the current matching point is at the end of
|
|
the subject string, all of them fail, because there is no character to
|
|
match.
|
|
|
|
The default \s characters are HT (9), LF (10), VT (11), FF (12), CR
|
|
(13), and space (32), which are defined as white space in the "C" lo-
|
|
cale. This list may vary if locale-specific matching is taking place.
|
|
For example, in some locales the "non-breaking space" character (\xA0)
|
|
is recognized as white space, and in others the VT character is not.
|
|
|
|
A "word" character is an underscore or any character that is a letter
|
|
or digit. By default, the definition of letters and digits is con-
|
|
trolled by PCRE2's low-valued character tables, and may vary if locale-
|
|
specific matching is taking place (see "Locale support" in the pcre2api
|
|
page). For example, in a French locale such as "fr_FR" in Unix-like
|
|
systems, or "french" in Windows, some character codes greater than 127
|
|
are used for accented letters, and these are then matched by \w. The
|
|
use of locales with Unicode is discouraged.
|
|
|
|
By default, characters whose code points are greater than 127 never
|
|
match \d, \s, or \w, and always match \D, \S, and \W, although this may
|
|
be different for characters in the range 128-255 when locale-specific
|
|
matching is happening. These escape sequences retain their original
|
|
meanings from before Unicode support was available, mainly for effi-
|
|
ciency reasons. If the PCRE2_UCP option is set, the behaviour is
|
|
changed so that Unicode properties are used to determine character
|
|
types, as follows:
|
|
|
|
\d any character that matches \p{Nd} (decimal digit)
|
|
\s any character that matches \p{Z} or \h or \v
|
|
\w any character that matches \p{L} or \p{N}, plus underscore
|
|
|
|
The upper case escapes match the inverse sets of characters. Note that
|
|
\d matches only decimal digits, whereas \w matches any Unicode digit,
|
|
as well as any Unicode letter, and underscore. Note also that PCRE2_UCP
|
|
affects \b, and \B because they are defined in terms of \w and \W.
|
|
Matching these sequences is noticeably slower when PCRE2_UCP is set.
|
|
|
|
The sequences \h, \H, \v, and \V, in contrast to the other sequences,
|
|
which match only ASCII characters by default, always match a specific
|
|
list of code points, whether or not PCRE2_UCP is set. The horizontal
|
|
space characters are:
|
|
|
|
U+0009 Horizontal tab (HT)
|
|
U+0020 Space
|
|
U+00A0 Non-break space
|
|
U+1680 Ogham space mark
|
|
U+180E Mongolian vowel separator
|
|
U+2000 En quad
|
|
U+2001 Em quad
|
|
U+2002 En space
|
|
U+2003 Em space
|
|
U+2004 Three-per-em space
|
|
U+2005 Four-per-em space
|
|
U+2006 Six-per-em space
|
|
U+2007 Figure space
|
|
U+2008 Punctuation space
|
|
U+2009 Thin space
|
|
U+200A Hair space
|
|
U+202F Narrow no-break space
|
|
U+205F Medium mathematical space
|
|
U+3000 Ideographic space
|
|
|
|
The vertical space characters are:
|
|
|
|
U+000A Linefeed (LF)
|
|
U+000B Vertical tab (VT)
|
|
U+000C Form feed (FF)
|
|
U+000D Carriage return (CR)
|
|
U+0085 Next line (NEL)
|
|
U+2028 Line separator
|
|
U+2029 Paragraph separator
|
|
|
|
In 8-bit, non-UTF-8 mode, only the characters with code points less
|
|
than 256 are relevant.
|
|
|
|
Newline sequences
|
|
|
|
Outside a character class, by default, the escape sequence \R matches
|
|
any Unicode newline sequence. In 8-bit non-UTF-8 mode \R is equivalent
|
|
to the following:
|
|
|
|
(?>\r\n|\n|\x0b|\f|\r|\x85)
|
|
|
|
This is an example of an "atomic group", details of which are given be-
|
|
low. This particular group matches either the two-character sequence
|
|
CR followed by LF, or one of the single characters LF (linefeed,
|
|
U+000A), VT (vertical tab, U+000B), FF (form feed, U+000C), CR (car-
|
|
riage return, U+000D), or NEL (next line, U+0085). Because this is an
|
|
atomic group, the two-character sequence is treated as a single unit
|
|
that cannot be split.
|
|
|
|
In other modes, two additional characters whose code points are greater
|
|
than 255 are added: LS (line separator, U+2028) and PS (paragraph sepa-
|
|
rator, U+2029). Unicode support is not needed for these characters to
|
|
be recognized.
|
|
|
|
It is possible to restrict \R to match only CR, LF, or CRLF (instead of
|
|
the complete set of Unicode line endings) by setting the option
|
|
PCRE2_BSR_ANYCRLF at compile time. (BSR is an abbreviation for "back-
|
|
slash R".) This can be made the default when PCRE2 is built; if this is
|
|
the case, the other behaviour can be requested via the PCRE2_BSR_UNI-
|
|
CODE option. It is also possible to specify these settings by starting
|
|
a pattern string with one of the following sequences:
|
|
|
|
(*BSR_ANYCRLF) CR, LF, or CRLF only
|
|
(*BSR_UNICODE) any Unicode newline sequence
|
|
|
|
These override the default and the options given to the compiling func-
|
|
tion. Note that these special settings, which are not Perl-compatible,
|
|
are recognized only at the very start of a pattern, and that they must
|
|
be in upper case. If more than one of them is present, the last one is
|
|
used. They can be combined with a change of newline convention; for ex-
|
|
ample, a pattern can start with:
|
|
|
|
(*ANY)(*BSR_ANYCRLF)
|
|
|
|
They can also be combined with the (*UTF) or (*UCP) special sequences.
|
|
Inside a character class, \R is treated as an unrecognized escape se-
|
|
quence, and causes an error.
|
|
|
|
Unicode character properties
|
|
|
|
When PCRE2 is built with Unicode support (the default), three addi-
|
|
tional escape sequences that match characters with specific properties
|
|
are available. They can be used in any mode, though in 8-bit and 16-bit
|
|
non-UTF modes these sequences are of course limited to testing charac-
|
|
ters whose code points are less than U+0100 and U+10000, respectively.
|
|
In 32-bit non-UTF mode, code points greater than 0x10ffff (the Unicode
|
|
limit) may be encountered. These are all treated as being in the Un-
|
|
known script and with an unassigned type. The extra escape sequences
|
|
are:
|
|
|
|
\p{xx} a character with the xx property
|
|
\P{xx} a character without the xx property
|
|
\X a Unicode extended grapheme cluster
|
|
|
|
The property names represented by xx above are not case-sensitive, and
|
|
in accordance with Unicode's "loose matching" rules, spaces, hyphens,
|
|
and underscores are ignored. There is support for Unicode script names,
|
|
Unicode general category properties, "Any", which matches any character
|
|
(including newline), Bidi_Control, Bidi_Class, and some special PCRE2
|
|
properties (described below). Other Perl properties such as "InMusi-
|
|
calSymbols" are not supported by PCRE2. Note that \P{Any} does not
|
|
match any characters, so always causes a match failure.
|
|
|
|
There are three different syntax forms for matching a script. Each Uni-
|
|
code character has a basic script and, optionally, a list of other
|
|
scripts ("Script Extentions") with which it is commonly used. Using the
|
|
Adlam script as an example, \p{sc:Adlam} matches characters whose basic
|
|
script is Adlam, whereas \p{scx:Adlam} matches, in addition, characters
|
|
that have Adlam in their extensions list. The full names "script" and
|
|
"script extensions" for the property types are recognized, and a equals
|
|
sign is an alternative to the colon. If a script name is given without
|
|
a property type, for example, \p{Adlam}, it is treated as \p{scx:Ad-
|
|
lam}. Perl changed to this interpretation at release 5.26 and PCRE2
|
|
changed at release 10.40.
|
|
|
|
Unassigned characters (and in non-UTF 32-bit mode, characters with code
|
|
points greater than 0x10FFFF) are assigned the "Unknown" script. Others
|
|
that are not part of an identified script are lumped together as "Com-
|
|
mon". The current list of script names and their 4-letter abbreviations
|
|
is:
|
|
|
|
Adlam (Adlm), Ahom (Ahom), Anatolian_Hieroglyphs (Hluw), Arabic (Arab),
|
|
Armenian (Armn), Avestan (Avst), Balinese (Bali), Bamum (Bamu),
|
|
Bassa_Vah (Bass), Batak (Batk), Bengali (Beng), Bhaiksuki (Bhks), Bopo-
|
|
mofo (Bopo), Brahmi (Brah), Braille (Brai), Buginese (Bugi), Buhid
|
|
(Buhd), Canadian_Aboriginal (Cans), Carian (Cari), Caucasian_Albanian
|
|
(Aghb), Chakma (Cakm), Cham (Cham), Cherokee (Cher), Chorasmian (Chrs),
|
|
Common (Zyyy), Coptic (Copt), Cuneiform (Xsux), Cypriot (Cprt),
|
|
Cypro_Minoan (Cpmn), Cyrillic (Cyrl), Deseret (Dsrt), Devanagari
|
|
(Deva), Dives_Akuru (Diak), Dogra (Dogr), Duployan (Dupl), Egyptian_Hi-
|
|
eroglyphs (Egyp), Elbasan (Elba), Elymaic (Elym), Ethiopic (Ethi),
|
|
Georgian (Geor), Glagolitic (Glag), Gothic (Goth), Grantha (Gran),
|
|
Greek (Grek), Gujarati (Gujr), Gunjala_Gondi (Gong), Gurmukhi (Guru),
|
|
Han (Hani), Hangul (Hang), Hanifi_Rohingya (Rohg), Hanunoo (Hano), Ha-
|
|
tran (Hatr), Hebrew (Hebr), Hiragana (Hira), Imperial_Aramaic (Armi),
|
|
Inherited (Zinh), Inscriptional_Pahlavi (Phli), Inscriptional_Parthian
|
|
(Prti), Javanese (Java), Kaithi (Kthi), Kannada (Knda), Katakana
|
|
(Kana), Kayah_Li (Kali), Kharoshthi (Khar), Khitan_Small_Script (Kits),
|
|
Khmer (Khmr), Khojki (Khoj), Khudawadi (Sind), Lao (Laoo), Latin
|
|
(Latn), Lepcha (Lepc), Limbu (Limb), Linear_A (Lina), Linear_B (Linb),
|
|
Lisu (Lisu), Lycian (Lyci), Lydian (Lydi), Mahajani (Majh), Makasar
|
|
(Maka), Malayalam (Mlym), Mandaic (Mand), Manichaean (Mani), Marchen
|
|
(Marc), Masaram_Gondi (Gonm), Medefaidrin (Medf), Meetei_Mayek (Mtei),
|
|
Mende_Kikakui (Mend), Meroitic_Cursive (Merc), Meroitic_Hieroglyphs
|
|
(Mero), Miao (Miao), Modi (Modi), Mongolian (Mong), Mro (Mroo), Multani
|
|
(Mult), Myanmar (Mymr), Nabataean (Nbar), Nandinagari (Nand),
|
|
New_Tai_Lue (Talu), Newa (Newa), Nko (Nkoo), Nushu (Nshu), Nyiak-
|
|
eng_Puachue_Hmong (Hmnp), Ogham (Ogam), Ol_Chiki (Olck), Old_Hungarian
|
|
(Hung), Old_Italic (Olck), Old_North_Arabian (Narb), Old_Permic (Perm),
|
|
Old_Persian (Orkh), Old_Sogdian (Sogo), Old_South_Arabian (Sarb),
|
|
Old_Turkic (Orkh), Old_Uyghur (Ougr), Oriya (Orya), Osage (Osge), Os-
|
|
manya (Osma), Pahawh_Hmong (Hmng), Palmyrene (Palm), Pau_Cin_Hau
|
|
(Pauc), Phags_Pa (Phag), Phoenician (Phnx), Psalter_Pahlavi (Phli), Re-
|
|
jang (Rjng), Runic (Runr), Samaritan (Samr), Saurashtra (Saur), Sharada
|
|
(Shrd), Shavian (Shaw), Siddham (Sidd), SignWriting (Sgnw), Sinhala
|
|
(Sinh), Sogdian (Sogd), Sora_Sompeng (Sora), Soyombo (Soyo), Sundanese
|
|
(Sund), Syloti_Nagri (Sylo), Syriac (Syrc), Tagalog (Tglg), Tagbanwa
|
|
(Tagb), Tai_Le (Tale), Tai_Tham (Lana), Tai_Viet (Tavt), Takri (Takr),
|
|
Tamil (Taml), Tangsa (Tngs), Tangut (Tang), Telugu (Telu), Thaana
|
|
(Thaa), Thai (Thai), Tibetan (Tibt), Tifinagh (Tfng), Tirhuta (Tirh),
|
|
Toto (Toto), Ugaritic (Ugar), Vai (Vaii), Vithkuqi (Vith), Wancho
|
|
(Wcho), Warang_Citi (Wara), Yezidi (Yezi), Yi (Yiii), Zanabazar_Square
|
|
(Zanb).
|
|
|
|
Each character has exactly one Unicode general category property, spec-
|
|
ified by a two-letter abbreviation. For compatibility with Perl, nega-
|
|
tion can be specified by including a circumflex between the opening
|
|
brace and the property name. For example, \p{^Lu} is the same as
|
|
\P{Lu}.
|
|
|
|
If only one letter is specified with \p or \P, it includes all the gen-
|
|
eral category properties that start with that letter. In this case, in
|
|
the absence of negation, the curly brackets in the escape sequence are
|
|
optional; these two examples have the same effect:
|
|
|
|
\p{L}
|
|
\pL
|
|
|
|
The following general category property codes are supported:
|
|
|
|
C Other
|
|
Cc Control
|
|
Cf Format
|
|
Cn Unassigned
|
|
Co Private use
|
|
Cs Surrogate
|
|
|
|
L Letter
|
|
Ll Lower case letter
|
|
Lm Modifier letter
|
|
Lo Other letter
|
|
Lt Title case letter
|
|
Lu Upper case letter
|
|
|
|
M Mark
|
|
Mc Spacing mark
|
|
Me Enclosing mark
|
|
Mn Non-spacing mark
|
|
|
|
N Number
|
|
Nd Decimal number
|
|
Nl Letter number
|
|
No Other number
|
|
|
|
P Punctuation
|
|
Pc Connector punctuation
|
|
Pd Dash punctuation
|
|
Pe Close punctuation
|
|
Pf Final punctuation
|
|
Pi Initial punctuation
|
|
Po Other punctuation
|
|
Ps Open punctuation
|
|
|
|
S Symbol
|
|
Sc Currency symbol
|
|
Sk Modifier symbol
|
|
Sm Mathematical symbol
|
|
So Other symbol
|
|
|
|
Z Separator
|
|
Zl Line separator
|
|
Zp Paragraph separator
|
|
Zs Space separator
|
|
|
|
The special property LC, which has the synonym L&, is also supported:
|
|
it matches a character that has the Lu, Ll, or Lt property, in other
|
|
words, a letter that is not classified as a modifier or "other".
|
|
|
|
The Cs (Surrogate) property applies only to characters whose code
|
|
points are in the range U+D800 to U+DFFF. These characters are no dif-
|
|
ferent to any other character when PCRE2 is not in UTF mode (using the
|
|
16-bit or 32-bit library). However, they are not valid in Unicode
|
|
strings and so cannot be tested by PCRE2 in UTF mode, unless UTF valid-
|
|
ity checking has been turned off (see the discussion of
|
|
PCRE2_NO_UTF_CHECK in the pcre2api page).
|
|
|
|
The long synonyms for property names that Perl supports (such as
|
|
\p{Letter}) are not supported by PCRE2, nor is it permitted to prefix
|
|
any of these properties with "Is".
|
|
|
|
No character that is in the Unicode table has the Cn (unassigned) prop-
|
|
erty. Instead, this property is assumed for any code point that is not
|
|
in the Unicode table.
|
|
|
|
Specifying caseless matching does not affect these escape sequences.
|
|
For example, \p{Lu} always matches only upper case letters. This is
|
|
different from the behaviour of current versions of Perl.
|
|
|
|
Matching characters by Unicode property is not fast, because PCRE2 has
|
|
to do a multistage table lookup in order to find a character's prop-
|
|
erty. That is why the traditional escape sequences such as \d and \w do
|
|
not use Unicode properties in PCRE2 by default, though you can make
|
|
them do so by setting the PCRE2_UCP option or by starting the pattern
|
|
with (*UCP).
|
|
|
|
Bi-directional properties for \p and \P
|
|
|
|
Two properties relating to bi-directional text (each with a shorter
|
|
synonym) are supported:
|
|
|
|
\p{Bidi_Control} matches a Bidi control character
|
|
\p{Bidi_C} matches a Bidi control character
|
|
\p{Bidi_Class:<class>} matches a character with the given class
|
|
\p{BC:<class>} matches a character with the given class
|
|
|
|
The recognized classes are:
|
|
|
|
AL Arabic letter
|
|
AN Arabic number
|
|
B paragraph separator
|
|
BN boundary neutral
|
|
CS common separator
|
|
EN European number
|
|
ES European separator
|
|
ET European terminator
|
|
FSI first strong isolate
|
|
L left-to-right
|
|
LRE left-to-right embedding
|
|
LRI left-to-right isolate
|
|
LRO left-to-right override
|
|
NSM non-spacing mark
|
|
ON other neutral
|
|
PDF pop directional format
|
|
PDI pop directional isolate
|
|
R right-to-left
|
|
RLE right-to-left embedding
|
|
RLI right-to-left isolate
|
|
RLO right-to-left override
|
|
S segment separator
|
|
WS which space
|
|
|
|
For Bidi_Class, an equals sign may be used instead of a colon. The
|
|
class names are case-insensitive; only the short names listed above are
|
|
recognized.
|
|
|
|
Extended grapheme clusters
|
|
|
|
The \X escape matches any number of Unicode characters that form an
|
|
"extended grapheme cluster", and treats the sequence as an atomic group
|
|
(see below). Unicode supports various kinds of composite character by
|
|
giving each character a grapheme breaking property, and having rules
|
|
that use these properties to define the boundaries of extended grapheme
|
|
clusters. The rules are defined in Unicode Standard Annex 29, "Unicode
|
|
Text Segmentation". Unicode 11.0.0 abandoned the use of some previous
|
|
properties that had been used for emojis. Instead it introduced vari-
|
|
ous emoji-specific properties. PCRE2 uses only the Extended Picto-
|
|
graphic property.
|
|
|
|
\X always matches at least one character. Then it decides whether to
|
|
add additional characters according to the following rules for ending a
|
|
cluster:
|
|
|
|
1. End at the end of the subject string.
|
|
|
|
2. Do not end between CR and LF; otherwise end after any control char-
|
|
acter.
|
|
|
|
3. Do not break Hangul (a Korean script) syllable sequences. Hangul
|
|
characters are of five types: L, V, T, LV, and LVT. An L character may
|
|
be followed by an L, V, LV, or LVT character; an LV or V character may
|
|
be followed by a V or T character; an LVT or T character may be fol-
|
|
lowed only by a T character.
|
|
|
|
4. Do not end before extending characters or spacing marks or the
|
|
"zero-width joiner" character. Characters with the "mark" property al-
|
|
ways have the "extend" grapheme breaking property.
|
|
|
|
5. Do not end after prepend characters.
|
|
|
|
6. Do not break within emoji modifier sequences or emoji zwj sequences.
|
|
That is, do not break between characters with the Extended_Pictographic
|
|
property. Extend and ZWJ characters are allowed between the charac-
|
|
ters.
|
|
|
|
7. Do not break within emoji flag sequences. That is, do not break be-
|
|
tween regional indicator (RI) characters if there are an odd number of
|
|
RI characters before the break point.
|
|
|
|
8. Otherwise, end the cluster.
|
|
|
|
PCRE2's additional properties
|
|
|
|
As well as the standard Unicode properties described above, PCRE2 sup-
|
|
ports four more that make it possible to convert traditional escape se-
|
|
quences such as \w and \s to use Unicode properties. PCRE2 uses these
|
|
non-standard, non-Perl properties internally when PCRE2_UCP is set.
|
|
However, they may also be used explicitly. These properties are:
|
|
|
|
Xan Any alphanumeric character
|
|
Xps Any POSIX space character
|
|
Xsp Any Perl space character
|
|
Xwd Any Perl "word" character
|
|
|
|
Xan matches characters that have either the L (letter) or the N (num-
|
|
ber) property. Xps matches the characters tab, linefeed, vertical tab,
|
|
form feed, or carriage return, and any other character that has the Z
|
|
(separator) property. Xsp is the same as Xps; in PCRE1 it used to ex-
|
|
clude vertical tab, for Perl compatibility, but Perl changed. Xwd
|
|
matches the same characters as Xan, plus underscore.
|
|
|
|
There is another non-standard property, Xuc, which matches any charac-
|
|
ter that can be represented by a Universal Character Name in C++ and
|
|
other programming languages. These are the characters $, @, ` (grave
|
|
accent), and all characters with Unicode code points greater than or
|
|
equal to U+00A0, except for the surrogates U+D800 to U+DFFF. Note that
|
|
most base (ASCII) characters are excluded. (Universal Character Names
|
|
are of the form \uHHHH or \UHHHHHHHH where H is a hexadecimal digit.
|
|
Note that the Xuc property does not match these sequences but the char-
|
|
acters that they represent.)
|
|
|
|
Resetting the match start
|
|
|
|
In normal use, the escape sequence \K causes any previously matched
|
|
characters not to be included in the final matched sequence that is re-
|
|
turned. For example, the pattern:
|
|
|
|
foo\Kbar
|
|
|
|
matches "foobar", but reports that it has matched "bar". \K does not
|
|
interact with anchoring in any way. The pattern:
|
|
|
|
^foo\Kbar
|
|
|
|
matches only when the subject begins with "foobar" (in single line
|
|
mode), though it again reports the matched string as "bar". This fea-
|
|
ture is similar to a lookbehind assertion (described below). However,
|
|
in this case, the part of the subject before the real match does not
|
|
have to be of fixed length, as lookbehind assertions do. The use of \K
|
|
does not interfere with the setting of captured substrings. For exam-
|
|
ple, when the pattern
|
|
|
|
(foo)\Kbar
|
|
|
|
matches "foobar", the first substring is still set to "foo".
|
|
|
|
From version 5.32.0 Perl forbids the use of \K in lookaround asser-
|
|
tions. From release 10.38 PCRE2 also forbids this by default. However,
|
|
the PCRE2_EXTRA_ALLOW_LOOKAROUND_BSK option can be used when calling
|
|
pcre2_compile() to re-enable the previous behaviour. When this option
|
|
is set, \K is acted upon when it occurs inside positive assertions, but
|
|
is ignored in negative assertions. Note that when a pattern such as
|
|
(?=ab\K) matches, the reported start of the match can be greater than
|
|
the end of the match. Using \K in a lookbehind assertion at the start
|
|
of a pattern can also lead to odd effects. For example, consider this
|
|
pattern:
|
|
|
|
(?<=\Kfoo)bar
|
|
|
|
If the subject is "foobar", a call to pcre2_match() with a starting
|
|
offset of 3 succeeds and reports the matching string as "foobar", that
|
|
is, the start of the reported match is earlier than where the match
|
|
started.
|
|
|
|
Simple assertions
|
|
|
|
The final use of backslash is for certain simple assertions. An asser-
|
|
tion specifies a condition that has to be met at a particular point in
|
|
a match, without consuming any characters from the subject string. The
|
|
use of groups for more complicated assertions is described below. The
|
|
backslashed assertions are:
|
|
|
|
\b matches at a word boundary
|
|
\B matches when not at a word boundary
|
|
\A matches at the start of the subject
|
|
\Z matches at the end of the subject
|
|
also matches before a newline at the end of the subject
|
|
\z matches only at the end of the subject
|
|
\G matches at the first matching position in the subject
|
|
|
|
Inside a character class, \b has a different meaning; it matches the
|
|
backspace character. If any other of these assertions appears in a
|
|
character class, an "invalid escape sequence" error is generated.
|
|
|
|
A word boundary is a position in the subject string where the current
|
|
character and the previous character do not both match \w or \W (i.e.
|
|
one matches \w and the other matches \W), or the start or end of the
|
|
string if the first or last character matches \w, respectively. When
|
|
PCRE2 is built with Unicode support, the meanings of \w and \W can be
|
|
changed by setting the PCRE2_UCP option. When this is done, it also af-
|
|
fects \b and \B. Neither PCRE2 nor Perl has a separate "start of word"
|
|
or "end of word" metasequence. However, whatever follows \b normally
|
|
determines which it is. For example, the fragment \ba matches "a" at
|
|
the start of a word.
|
|
|
|
The \A, \Z, and \z assertions differ from the traditional circumflex
|
|
and dollar (described in the next section) in that they only ever match
|
|
at the very start and end of the subject string, whatever options are
|
|
set. Thus, they are independent of multiline mode. These three asser-
|
|
tions are not affected by the PCRE2_NOTBOL or PCRE2_NOTEOL options,
|
|
which affect only the behaviour of the circumflex and dollar metachar-
|
|
acters. However, if the startoffset argument of pcre2_match() is non-
|
|
zero, indicating that matching is to start at a point other than the
|
|
beginning of the subject, \A can never match. The difference between
|
|
\Z and \z is that \Z matches before a newline at the end of the string
|
|
as well as at the very end, whereas \z matches only at the end.
|
|
|
|
The \G assertion is true only when the current matching position is at
|
|
the start point of the matching process, as specified by the startoff-
|
|
set argument of pcre2_match(). It differs from \A when the value of
|
|
startoffset is non-zero. By calling pcre2_match() multiple times with
|
|
appropriate arguments, you can mimic Perl's /g option, and it is in
|
|
this kind of implementation where \G can be useful.
|
|
|
|
Note, however, that PCRE2's implementation of \G, being true at the
|
|
starting character of the matching process, is subtly different from
|
|
Perl's, which defines it as true at the end of the previous match. In
|
|
Perl, these can be different when the previously matched string was
|
|
empty. Because PCRE2 does just one match at a time, it cannot reproduce
|
|
this behaviour.
|
|
|
|
If all the alternatives of a pattern begin with \G, the expression is
|
|
anchored to the starting match position, and the "anchored" flag is set
|
|
in the compiled regular expression.
|
|
|
|
|
|
CIRCUMFLEX AND DOLLAR
|
|
|
|
The circumflex and dollar metacharacters are zero-width assertions.
|
|
That is, they test for a particular condition being true without con-
|
|
suming any characters from the subject string. These two metacharacters
|
|
are concerned with matching the starts and ends of lines. If the new-
|
|
line convention is set so that only the two-character sequence CRLF is
|
|
recognized as a newline, isolated CR and LF characters are treated as
|
|
ordinary data characters, and are not recognized as newlines.
|
|
|
|
Outside a character class, in the default matching mode, the circumflex
|
|
character is an assertion that is true only if the current matching
|
|
point is at the start of the subject string. If the startoffset argu-
|
|
ment of pcre2_match() is non-zero, or if PCRE2_NOTBOL is set, circum-
|
|
flex can never match if the PCRE2_MULTILINE option is unset. Inside a
|
|
character class, circumflex has an entirely different meaning (see be-
|
|
low).
|
|
|
|
Circumflex need not be the first character of the pattern if a number
|
|
of alternatives are involved, but it should be the first thing in each
|
|
alternative in which it appears if the pattern is ever to match that
|
|
branch. If all possible alternatives start with a circumflex, that is,
|
|
if the pattern is constrained to match only at the start of the sub-
|
|
ject, it is said to be an "anchored" pattern. (There are also other
|
|
constructs that can cause a pattern to be anchored.)
|
|
|
|
The dollar character is an assertion that is true only if the current
|
|
matching point is at the end of the subject string, or immediately be-
|
|
fore a newline at the end of the string (by default), unless PCRE2_NO-
|
|
TEOL is set. Note, however, that it does not actually match the new-
|
|
line. Dollar need not be the last character of the pattern if a number
|
|
of alternatives are involved, but it should be the last item in any
|
|
branch in which it appears. Dollar has no special meaning in a charac-
|
|
ter class.
|
|
|
|
The meaning of dollar can be changed so that it matches only at the
|
|
very end of the string, by setting the PCRE2_DOLLAR_ENDONLY option at
|
|
compile time. This does not affect the \Z assertion.
|
|
|
|
The meanings of the circumflex and dollar metacharacters are changed if
|
|
the PCRE2_MULTILINE option is set. When this is the case, a dollar
|
|
character matches before any newlines in the string, as well as at the
|
|
very end, and a circumflex matches immediately after internal newlines
|
|
as well as at the start of the subject string. It does not match after
|
|
a newline that ends the string, for compatibility with Perl. However,
|
|
this can be changed by setting the PCRE2_ALT_CIRCUMFLEX option.
|
|
|
|
For example, the pattern /^abc$/ matches the subject string "def\nabc"
|
|
(where \n represents a newline) in multiline mode, but not otherwise.
|
|
Consequently, patterns that are anchored in single line mode because
|
|
all branches start with ^ are not anchored in multiline mode, and a
|
|
match for circumflex is possible when the startoffset argument of
|
|
pcre2_match() is non-zero. The PCRE2_DOLLAR_ENDONLY option is ignored
|
|
if PCRE2_MULTILINE is set.
|
|
|
|
When the newline convention (see "Newline conventions" below) recog-
|
|
nizes the two-character sequence CRLF as a newline, this is preferred,
|
|
even if the single characters CR and LF are also recognized as new-
|
|
lines. For example, if the newline convention is "any", a multiline
|
|
mode circumflex matches before "xyz" in the string "abc\r\nxyz" rather
|
|
than after CR, even though CR on its own is a valid newline. (It also
|
|
matches at the very start of the string, of course.)
|
|
|
|
Note that the sequences \A, \Z, and \z can be used to match the start
|
|
and end of the subject in both modes, and if all branches of a pattern
|
|
start with \A it is always anchored, whether or not PCRE2_MULTILINE is
|
|
set.
|
|
|
|
|
|
FULL STOP (PERIOD, DOT) AND \N
|
|
|
|
Outside a character class, a dot in the pattern matches any one charac-
|
|
ter in the subject string except (by default) a character that signi-
|
|
fies the end of a line. One or more characters may be specified as line
|
|
terminators (see "Newline conventions" above).
|
|
|
|
Dot never matches a single line-ending character. When the two-charac-
|
|
ter sequence CRLF is the only line ending, dot does not match CR if it
|
|
is immediately followed by LF, but otherwise it matches all characters
|
|
(including isolated CRs and LFs). When ANYCRLF is selected for line
|
|
endings, no occurences of CR of LF match dot. When all Unicode line
|
|
endings are being recognized, dot does not match CR or LF or any of the
|
|
other line ending characters.
|
|
|
|
The behaviour of dot with regard to newlines can be changed. If the
|
|
PCRE2_DOTALL option is set, a dot matches any one character, without
|
|
exception. If the two-character sequence CRLF is present in the sub-
|
|
ject string, it takes two dots to match it.
|
|
|
|
The handling of dot is entirely independent of the handling of circum-
|
|
flex and dollar, the only relationship being that they both involve
|
|
newlines. Dot has no special meaning in a character class.
|
|
|
|
The escape sequence \N when not followed by an opening brace behaves
|
|
like a dot, except that it is not affected by the PCRE2_DOTALL option.
|
|
In other words, it matches any character except one that signifies the
|
|
end of a line.
|
|
|
|
When \N is followed by an opening brace it has a different meaning. See
|
|
the section entitled "Non-printing characters" above for details. Perl
|
|
also uses \N{name} to specify characters by Unicode name; PCRE2 does
|
|
not support this.
|
|
|
|
|
|
MATCHING A SINGLE CODE UNIT
|
|
|
|
Outside a character class, the escape sequence \C matches any one code
|
|
unit, whether or not a UTF mode is set. In the 8-bit library, one code
|
|
unit is one byte; in the 16-bit library it is a 16-bit unit; in the
|
|
32-bit library it is a 32-bit unit. Unlike a dot, \C always matches
|
|
line-ending characters. The feature is provided in Perl in order to
|
|
match individual bytes in UTF-8 mode, but it is unclear how it can use-
|
|
fully be used.
|
|
|
|
Because \C breaks up characters into individual code units, matching
|
|
one unit with \C in UTF-8 or UTF-16 mode means that the rest of the
|
|
string may start with a malformed UTF character. This has undefined re-
|
|
sults, because PCRE2 assumes that it is matching character by character
|
|
in a valid UTF string (by default it checks the subject string's valid-
|
|
ity at the start of processing unless the PCRE2_NO_UTF_CHECK or
|
|
PCRE2_MATCH_INVALID_UTF option is used).
|
|
|
|
An application can lock out the use of \C by setting the
|
|
PCRE2_NEVER_BACKSLASH_C option when compiling a pattern. It is also
|
|
possible to build PCRE2 with the use of \C permanently disabled.
|
|
|
|
PCRE2 does not allow \C to appear in lookbehind assertions (described
|
|
below) in UTF-8 or UTF-16 modes, because this would make it impossible
|
|
to calculate the length of the lookbehind. Neither the alternative
|
|
matching function pcre2_dfa_match() nor the JIT optimizer support \C in
|
|
these UTF modes. The former gives a match-time error; the latter fails
|
|
to optimize and so the match is always run using the interpreter.
|
|
|
|
In the 32-bit library, however, \C is always supported (when not ex-
|
|
plicitly locked out) because it always matches a single code unit,
|
|
whether or not UTF-32 is specified.
|
|
|
|
In general, the \C escape sequence is best avoided. However, one way of
|
|
using it that avoids the problem of malformed UTF-8 or UTF-16 charac-
|
|
ters is to use a lookahead to check the length of the next character,
|
|
as in this pattern, which could be used with a UTF-8 string (ignore
|
|
white space and line breaks):
|
|
|
|
(?| (?=[\x00-\x7f])(\C) |
|
|
(?=[\x80-\x{7ff}])(\C)(\C) |
|
|
(?=[\x{800}-\x{ffff}])(\C)(\C)(\C) |
|
|
(?=[\x{10000}-\x{1fffff}])(\C)(\C)(\C)(\C))
|
|
|
|
In this example, a group that starts with (?| resets the capturing
|
|
parentheses numbers in each alternative (see "Duplicate Group Numbers"
|
|
below). The assertions at the start of each branch check the next UTF-8
|
|
character for values whose encoding uses 1, 2, 3, or 4 bytes, respec-
|
|
tively. The character's individual bytes are then captured by the ap-
|
|
propriate number of \C groups.
|
|
|
|
|
|
SQUARE BRACKETS AND CHARACTER CLASSES
|
|
|
|
An opening square bracket introduces a character class, terminated by a
|
|
closing square bracket. A closing square bracket on its own is not spe-
|
|
cial by default. If a closing square bracket is required as a member
|
|
of the class, it should be the first data character in the class (after
|
|
an initial circumflex, if present) or escaped with a backslash. This
|
|
means that, by default, an empty class cannot be defined. However, if
|
|
the PCRE2_ALLOW_EMPTY_CLASS option is set, a closing square bracket at
|
|
the start does end the (empty) class.
|
|
|
|
A character class matches a single character in the subject. A matched
|
|
character must be in the set of characters defined by the class, unless
|
|
the first character in the class definition is a circumflex, in which
|
|
case the subject character must not be in the set defined by the class.
|
|
If a circumflex is actually required as a member of the class, ensure
|
|
it is not the first character, or escape it with a backslash.
|
|
|
|
For example, the character class [aeiou] matches any lower case vowel,
|
|
while [^aeiou] matches any character that is not a lower case vowel.
|
|
Note that a circumflex is just a convenient notation for specifying the
|
|
characters that are in the class by enumerating those that are not. A
|
|
class that starts with a circumflex is not an assertion; it still con-
|
|
sumes a character from the subject string, and therefore it fails if
|
|
the current pointer is at the end of the string.
|
|
|
|
Characters in a class may be specified by their code points using \o,
|
|
\x, or \N{U+hh..} in the usual way. When caseless matching is set, any
|
|
letters in a class represent both their upper case and lower case ver-
|
|
sions, so for example, a caseless [aeiou] matches "A" as well as "a",
|
|
and a caseless [^aeiou] does not match "A", whereas a caseful version
|
|
would. Note that there are two ASCII characters, K and S, that, in ad-
|
|
dition to their lower case ASCII equivalents, are case-equivalent with
|
|
Unicode U+212A (Kelvin sign) and U+017F (long S) respectively when ei-
|
|
ther PCRE2_UTF or PCRE2_UCP is set.
|
|
|
|
Characters that might indicate line breaks are never treated in any
|
|
special way when matching character classes, whatever line-ending se-
|
|
quence is in use, and whatever setting of the PCRE2_DOTALL and
|
|
PCRE2_MULTILINE options is used. A class such as [^a] always matches
|
|
one of these characters.
|
|
|
|
The generic character type escape sequences \d, \D, \h, \H, \p, \P, \s,
|
|
\S, \v, \V, \w, and \W may appear in a character class, and add the
|
|
characters that they match to the class. For example, [\dABCDEF]
|
|
matches any hexadecimal digit. In UTF modes, the PCRE2_UCP option af-
|
|
fects the meanings of \d, \s, \w and their upper case partners, just as
|
|
it does when they appear outside a character class, as described in the
|
|
section entitled "Generic character types" above. The escape sequence
|
|
\b has a different meaning inside a character class; it matches the
|
|
backspace character. The sequences \B, \R, and \X are not special in-
|
|
side a character class. Like any other unrecognized escape sequences,
|
|
they cause an error. The same is true for \N when not followed by an
|
|
opening brace.
|
|
|
|
The minus (hyphen) character can be used to specify a range of charac-
|
|
ters in a character class. For example, [d-m] matches any letter be-
|
|
tween d and m, inclusive. If a minus character is required in a class,
|
|
it must be escaped with a backslash or appear in a position where it
|
|
cannot be interpreted as indicating a range, typically as the first or
|
|
last character in the class, or immediately after a range. For example,
|
|
[b-d-z] matches letters in the range b to d, a hyphen character, or z.
|
|
|
|
Perl treats a hyphen as a literal if it appears before or after a POSIX
|
|
class (see below) or before or after a character type escape such as as
|
|
\d or \H. However, unless the hyphen is the last character in the
|
|
class, Perl outputs a warning in its warning mode, as this is most
|
|
likely a user error. As PCRE2 has no facility for warning, an error is
|
|
given in these cases.
|
|
|
|
It is not possible to have the literal character "]" as the end charac-
|
|
ter of a range. A pattern such as [W-]46] is interpreted as a class of
|
|
two characters ("W" and "-") followed by a literal string "46]", so it
|
|
would match "W46]" or "-46]". However, if the "]" is escaped with a
|
|
backslash it is interpreted as the end of range, so [W-\]46] is inter-
|
|
preted as a class containing a range followed by two other characters.
|
|
The octal or hexadecimal representation of "]" can also be used to end
|
|
a range.
|
|
|
|
Ranges normally include all code points between the start and end char-
|
|
acters, inclusive. They can also be used for code points specified nu-
|
|
merically, for example [\000-\037]. Ranges can include any characters
|
|
that are valid for the current mode. In any UTF mode, the so-called
|
|
"surrogate" characters (those whose code points lie between 0xd800 and
|
|
0xdfff inclusive) may not be specified explicitly by default (the
|
|
PCRE2_EXTRA_ALLOW_SURROGATE_ESCAPES option disables this check). How-
|
|
ever, ranges such as [\x{d7ff}-\x{e000}], which include the surrogates,
|
|
are always permitted.
|
|
|
|
There is a special case in EBCDIC environments for ranges whose end
|
|
points are both specified as literal letters in the same case. For com-
|
|
patibility with Perl, EBCDIC code points within the range that are not
|
|
letters are omitted. For example, [h-k] matches only four characters,
|
|
even though the codes for h and k are 0x88 and 0x92, a range of 11 code
|
|
points. However, if the range is specified numerically, for example,
|
|
[\x88-\x92] or [h-\x92], all code points are included.
|
|
|
|
If a range that includes letters is used when caseless matching is set,
|
|
it matches the letters in either case. For example, [W-c] is equivalent
|
|
to [][\\^_`wxyzabc], matched caselessly, and in a non-UTF mode, if
|
|
character tables for a French locale are in use, [\xc8-\xcb] matches
|
|
accented E characters in both cases.
|
|
|
|
A circumflex can conveniently be used with the upper case character
|
|
types to specify a more restricted set of characters than the matching
|
|
lower case type. For example, the class [^\W_] matches any letter or
|
|
digit, but not underscore, whereas [\w] includes underscore. A positive
|
|
character class should be read as "something OR something OR ..." and a
|
|
negative class as "NOT something AND NOT something AND NOT ...".
|
|
|
|
The only metacharacters that are recognized in character classes are
|
|
backslash, hyphen (only where it can be interpreted as specifying a
|
|
range), circumflex (only at the start), opening square bracket (only
|
|
when it can be interpreted as introducing a POSIX class name, or for a
|
|
special compatibility feature - see the next two sections), and the
|
|
terminating closing square bracket. However, escaping other non-al-
|
|
phanumeric characters does no harm.
|
|
|
|
|
|
POSIX CHARACTER CLASSES
|
|
|
|
Perl supports the POSIX notation for character classes. This uses names
|
|
enclosed by [: and :] within the enclosing square brackets. PCRE2 also
|
|
supports this notation. For example,
|
|
|
|
[01[:alpha:]%]
|
|
|
|
matches "0", "1", any alphabetic character, or "%". The supported class
|
|
names are:
|
|
|
|
alnum letters and digits
|
|
alpha letters
|
|
ascii character codes 0 - 127
|
|
blank space or tab only
|
|
cntrl control characters
|
|
digit decimal digits (same as \d)
|
|
graph printing characters, excluding space
|
|
lower lower case letters
|
|
print printing characters, including space
|
|
punct printing characters, excluding letters and digits and space
|
|
space white space (the same as \s from PCRE2 8.34)
|
|
upper upper case letters
|
|
word "word" characters (same as \w)
|
|
xdigit hexadecimal digits
|
|
|
|
The default "space" characters are HT (9), LF (10), VT (11), FF (12),
|
|
CR (13), and space (32). If locale-specific matching is taking place,
|
|
the list of space characters may be different; there may be fewer or
|
|
more of them. "Space" and \s match the same set of characters.
|
|
|
|
The name "word" is a Perl extension, and "blank" is a GNU extension
|
|
from Perl 5.8. Another Perl extension is negation, which is indicated
|
|
by a ^ character after the colon. For example,
|
|
|
|
[12[:^digit:]]
|
|
|
|
matches "1", "2", or any non-digit. PCRE2 (and Perl) also recognize the
|
|
POSIX syntax [.ch.] and [=ch=] where "ch" is a "collating element", but
|
|
these are not supported, and an error is given if they are encountered.
|
|
|
|
By default, characters with values greater than 127 do not match any of
|
|
the POSIX character classes, although this may be different for charac-
|
|
ters in the range 128-255 when locale-specific matching is happening.
|
|
However, if the PCRE2_UCP option is passed to pcre2_compile(), some of
|
|
the classes are changed so that Unicode character properties are used.
|
|
This is achieved by replacing certain POSIX classes with other se-
|
|
quences, as follows:
|
|
|
|
[:alnum:] becomes \p{Xan}
|
|
[:alpha:] becomes \p{L}
|
|
[:blank:] becomes \h
|
|
[:cntrl:] becomes \p{Cc}
|
|
[:digit:] becomes \p{Nd}
|
|
[:lower:] becomes \p{Ll}
|
|
[:space:] becomes \p{Xps}
|
|
[:upper:] becomes \p{Lu}
|
|
[:word:] becomes \p{Xwd}
|
|
|
|
Negated versions, such as [:^alpha:] use \P instead of \p. Three other
|
|
POSIX classes are handled specially in UCP mode:
|
|
|
|
[:graph:] This matches characters that have glyphs that mark the page
|
|
when printed. In Unicode property terms, it matches all char-
|
|
acters with the L, M, N, P, S, or Cf properties, except for:
|
|
|
|
U+061C Arabic Letter Mark
|
|
U+180E Mongolian Vowel Separator
|
|
U+2066 - U+2069 Various "isolate"s
|
|
|
|
|
|
[:print:] This matches the same characters as [:graph:] plus space
|
|
characters that are not controls, that is, characters with
|
|
the Zs property.
|
|
|
|
[:punct:] This matches all characters that have the Unicode P (punctua-
|
|
tion) property, plus those characters with code points less
|
|
than 256 that have the S (Symbol) property.
|
|
|
|
The other POSIX classes are unchanged, and match only characters with
|
|
code points less than 256.
|
|
|
|
|
|
COMPATIBILITY FEATURE FOR WORD BOUNDARIES
|
|
|
|
In the POSIX.2 compliant library that was included in 4.4BSD Unix, the
|
|
ugly syntax [[:<:]] and [[:>:]] is used for matching "start of word"
|
|
and "end of word". PCRE2 treats these items as follows:
|
|
|
|
[[:<:]] is converted to \b(?=\w)
|
|
[[:>:]] is converted to \b(?<=\w)
|
|
|
|
Only these exact character sequences are recognized. A sequence such as
|
|
[a[:<:]b] provokes error for an unrecognized POSIX class name. This
|
|
support is not compatible with Perl. It is provided to help migrations
|
|
from other environments, and is best not used in any new patterns. Note
|
|
that \b matches at the start and the end of a word (see "Simple asser-
|
|
tions" above), and in a Perl-style pattern the preceding or following
|
|
character normally shows which is wanted, without the need for the as-
|
|
sertions that are used above in order to give exactly the POSIX behav-
|
|
iour.
|
|
|
|
|
|
VERTICAL BAR
|
|
|
|
Vertical bar characters are used to separate alternative patterns. For
|
|
example, the pattern
|
|
|
|
gilbert|sullivan
|
|
|
|
matches either "gilbert" or "sullivan". Any number of alternatives may
|
|
appear, and an empty alternative is permitted (matching the empty
|
|
string). The matching process tries each alternative in turn, from left
|
|
to right, and the first one that succeeds is used. If the alternatives
|
|
are within a group (defined below), "succeeds" means matching the rest
|
|
of the main pattern as well as the alternative in the group.
|
|
|
|
|
|
INTERNAL OPTION SETTING
|
|
|
|
The settings of the PCRE2_CASELESS, PCRE2_MULTILINE, PCRE2_DOTALL,
|
|
PCRE2_EXTENDED, PCRE2_EXTENDED_MORE, and PCRE2_NO_AUTO_CAPTURE options
|
|
can be changed from within the pattern by a sequence of letters en-
|
|
closed between "(?" and ")". These options are Perl-compatible, and
|
|
are described in detail in the pcre2api documentation. The option let-
|
|
ters are:
|
|
|
|
i for PCRE2_CASELESS
|
|
m for PCRE2_MULTILINE
|
|
n for PCRE2_NO_AUTO_CAPTURE
|
|
s for PCRE2_DOTALL
|
|
x for PCRE2_EXTENDED
|
|
xx for PCRE2_EXTENDED_MORE
|
|
|
|
For example, (?im) sets caseless, multiline matching. It is also possi-
|
|
ble to unset these options by preceding the relevant letters with a hy-
|
|
phen, for example (?-im). The two "extended" options are not indepen-
|
|
dent; unsetting either one cancels the effects of both of them.
|
|
|
|
A combined setting and unsetting such as (?im-sx), which sets
|
|
PCRE2_CASELESS and PCRE2_MULTILINE while unsetting PCRE2_DOTALL and
|
|
PCRE2_EXTENDED, is also permitted. Only one hyphen may appear in the
|
|
options string. If a letter appears both before and after the hyphen,
|
|
the option is unset. An empty options setting "(?)" is allowed. Need-
|
|
less to say, it has no effect.
|
|
|
|
If the first character following (? is a circumflex, it causes all of
|
|
the above options to be unset. Thus, (?^) is equivalent to (?-imnsx).
|
|
Letters may follow the circumflex to cause some options to be re-in-
|
|
stated, but a hyphen may not appear.
|
|
|
|
The PCRE2-specific options PCRE2_DUPNAMES and PCRE2_UNGREEDY can be
|
|
changed in the same way as the Perl-compatible options by using the
|
|
characters J and U respectively. However, these are not unset by (?^).
|
|
|
|
When one of these option changes occurs at top level (that is, not in-
|
|
side group parentheses), the change applies to the remainder of the
|
|
pattern that follows. An option change within a group (see below for a
|
|
description of groups) affects only that part of the group that follows
|
|
it, so
|
|
|
|
(a(?i)b)c
|
|
|
|
matches abc and aBc and no other strings (assuming PCRE2_CASELESS is
|
|
not used). By this means, options can be made to have different set-
|
|
tings in different parts of the pattern. Any changes made in one alter-
|
|
native do carry on into subsequent branches within the same group. For
|
|
example,
|
|
|
|
(a(?i)b|c)
|
|
|
|
matches "ab", "aB", "c", and "C", even though when matching "C" the
|
|
first branch is abandoned before the option setting. This is because
|
|
the effects of option settings happen at compile time. There would be
|
|
some very weird behaviour otherwise.
|
|
|
|
As a convenient shorthand, if any option settings are required at the
|
|
start of a non-capturing group (see the next section), the option let-
|
|
ters may appear between the "?" and the ":". Thus the two patterns
|
|
|
|
(?i:saturday|sunday)
|
|
(?:(?i)saturday|sunday)
|
|
|
|
match exactly the same set of strings.
|
|
|
|
Note: There are other PCRE2-specific options, applying to the whole
|
|
pattern, which can be set by the application when the compiling func-
|
|
tion is called. In addition, the pattern can contain special leading
|
|
sequences such as (*CRLF) to override what the application has set or
|
|
what has been defaulted. Details are given in the section entitled
|
|
"Newline sequences" above. There are also the (*UTF) and (*UCP) leading
|
|
sequences that can be used to set UTF and Unicode property modes; they
|
|
are equivalent to setting the PCRE2_UTF and PCRE2_UCP options, respec-
|
|
tively. However, the application can set the PCRE2_NEVER_UTF and
|
|
PCRE2_NEVER_UCP options, which lock out the use of the (*UTF) and
|
|
(*UCP) sequences.
|
|
|
|
|
|
GROUPS
|
|
|
|
Groups are delimited by parentheses (round brackets), which can be
|
|
nested. Turning part of a pattern into a group does two things:
|
|
|
|
1. It localizes a set of alternatives. For example, the pattern
|
|
|
|
cat(aract|erpillar|)
|
|
|
|
matches "cataract", "caterpillar", or "cat". Without the parentheses,
|
|
it would match "cataract", "erpillar" or an empty string.
|
|
|
|
2. It creates a "capture group". This means that, when the whole pat-
|
|
tern matches, the portion of the subject string that matched the group
|
|
is passed back to the caller, separately from the portion that matched
|
|
the whole pattern. (This applies only to the traditional matching
|
|
function; the DFA matching function does not support capturing.)
|
|
|
|
Opening parentheses are counted from left to right (starting from 1) to
|
|
obtain numbers for capture groups. For example, if the string "the red
|
|
king" is matched against the pattern
|
|
|
|
the ((red|white) (king|queen))
|
|
|
|
the captured substrings are "red king", "red", and "king", and are num-
|
|
bered 1, 2, and 3, respectively.
|
|
|
|
The fact that plain parentheses fulfil two functions is not always
|
|
helpful. There are often times when grouping is required without cap-
|
|
turing. If an opening parenthesis is followed by a question mark and a
|
|
colon, the group does not do any capturing, and is not counted when
|
|
computing the number of any subsequent capture groups. For example, if
|
|
the string "the white queen" is matched against the pattern
|
|
|
|
the ((?:red|white) (king|queen))
|
|
|
|
the captured substrings are "white queen" and "queen", and are numbered
|
|
1 and 2. The maximum number of capture groups is 65535.
|
|
|
|
As a convenient shorthand, if any option settings are required at the
|
|
start of a non-capturing group, the option letters may appear between
|
|
the "?" and the ":". Thus the two patterns
|
|
|
|
(?i:saturday|sunday)
|
|
(?:(?i)saturday|sunday)
|
|
|
|
match exactly the same set of strings. Because alternative branches are
|
|
tried from left to right, and options are not reset until the end of
|
|
the group is reached, an option setting in one branch does affect sub-
|
|
sequent branches, so the above patterns match "SUNDAY" as well as "Sat-
|
|
urday".
|
|
|
|
|
|
DUPLICATE GROUP NUMBERS
|
|
|
|
Perl 5.10 introduced a feature whereby each alternative in a group uses
|
|
the same numbers for its capturing parentheses. Such a group starts
|
|
with (?| and is itself a non-capturing group. For example, consider
|
|
this pattern:
|
|
|
|
(?|(Sat)ur|(Sun))day
|
|
|
|
Because the two alternatives are inside a (?| group, both sets of cap-
|
|
turing parentheses are numbered one. Thus, when the pattern matches,
|
|
you can look at captured substring number one, whichever alternative
|
|
matched. This construct is useful when you want to capture part, but
|
|
not all, of one of a number of alternatives. Inside a (?| group, paren-
|
|
theses are numbered as usual, but the number is reset at the start of
|
|
each branch. The numbers of any capturing parentheses that follow the
|
|
whole group start after the highest number used in any branch. The fol-
|
|
lowing example is taken from the Perl documentation. The numbers under-
|
|
neath show in which buffer the captured content will be stored.
|
|
|
|
# before ---------------branch-reset----------- after
|
|
/ ( a ) (?| x ( y ) z | (p (q) r) | (t) u (v) ) ( z ) /x
|
|
# 1 2 2 3 2 3 4
|
|
|
|
A backreference to a capture group uses the most recent value that is
|
|
set for the group. The following pattern matches "abcabc" or "defdef":
|
|
|
|
/(?|(abc)|(def))\1/
|
|
|
|
In contrast, a subroutine call to a capture group always refers to the
|
|
first one in the pattern with the given number. The following pattern
|
|
matches "abcabc" or "defabc":
|
|
|
|
/(?|(abc)|(def))(?1)/
|
|
|
|
A relative reference such as (?-1) is no different: it is just a conve-
|
|
nient way of computing an absolute group number.
|
|
|
|
If a condition test for a group's having matched refers to a non-unique
|
|
number, the test is true if any group with that number has matched.
|
|
|
|
An alternative approach to using this "branch reset" feature is to use
|
|
duplicate named groups, as described in the next section.
|
|
|
|
|
|
NAMED CAPTURE GROUPS
|
|
|
|
Identifying capture groups by number is simple, but it can be very hard
|
|
to keep track of the numbers in complicated patterns. Furthermore, if
|
|
an expression is modified, the numbers may change. To help with this
|
|
difficulty, PCRE2 supports the naming of capture groups. This feature
|
|
was not added to Perl until release 5.10. Python had the feature ear-
|
|
lier, and PCRE1 introduced it at release 4.0, using the Python syntax.
|
|
PCRE2 supports both the Perl and the Python syntax.
|
|
|
|
In PCRE2, a capture group can be named in one of three ways:
|
|
(?<name>...) or (?'name'...) as in Perl, or (?P<name>...) as in Python.
|
|
Names may be up to 32 code units long. When PCRE2_UTF is not set, they
|
|
may contain only ASCII alphanumeric characters and underscores, but
|
|
must start with a non-digit. When PCRE2_UTF is set, the syntax of group
|
|
names is extended to allow any Unicode letter or Unicode decimal digit.
|
|
In other words, group names must match one of these patterns:
|
|
|
|
^[_A-Za-z][_A-Za-z0-9]*\z when PCRE2_UTF is not set
|
|
^[_\p{L}][_\p{L}\p{Nd}]*\z when PCRE2_UTF is set
|
|
|
|
References to capture groups from other parts of the pattern, such as
|
|
backreferences, recursion, and conditions, can all be made by name as
|
|
well as by number.
|
|
|
|
Named capture groups are allocated numbers as well as names, exactly as
|
|
if the names were not present. In both PCRE2 and Perl, capture groups
|
|
are primarily identified by numbers; any names are just aliases for
|
|
these numbers. The PCRE2 API provides function calls for extracting the
|
|
complete name-to-number translation table from a compiled pattern, as
|
|
well as convenience functions for extracting captured substrings by
|
|
name.
|
|
|
|
Warning: When more than one capture group has the same number, as de-
|
|
scribed in the previous section, a name given to one of them applies to
|
|
all of them. Perl allows identically numbered groups to have different
|
|
names. Consider this pattern, where there are two capture groups, both
|
|
numbered 1:
|
|
|
|
(?|(?<AA>aa)|(?<BB>bb))
|
|
|
|
Perl allows this, with both names AA and BB as aliases of group 1.
|
|
Thus, after a successful match, both names yield the same value (either
|
|
"aa" or "bb").
|
|
|
|
In an attempt to reduce confusion, PCRE2 does not allow the same group
|
|
number to be associated with more than one name. The example above pro-
|
|
vokes a compile-time error. However, there is still scope for confu-
|
|
sion. Consider this pattern:
|
|
|
|
(?|(?<AA>aa)|(bb))
|
|
|
|
Although the second group number 1 is not explicitly named, the name AA
|
|
is still an alias for any group 1. Whether the pattern matches "aa" or
|
|
"bb", a reference by name to group AA yields the matched string.
|
|
|
|
By default, a name must be unique within a pattern, except that dupli-
|
|
cate names are permitted for groups with the same number, for example:
|
|
|
|
(?|(?<AA>aa)|(?<AA>bb))
|
|
|
|
The duplicate name constraint can be disabled by setting the PCRE2_DUP-
|
|
NAMES option at compile time, or by the use of (?J) within the pattern,
|
|
as described in the section entitled "Internal Option Setting" above.
|
|
|
|
Duplicate names can be useful for patterns where only one instance of
|
|
the named capture group can match. Suppose you want to match the name
|
|
of a weekday, either as a 3-letter abbreviation or as the full name,
|
|
and in both cases you want to extract the abbreviation. This pattern
|
|
(ignoring the line breaks) does the job:
|
|
|
|
(?J)
|
|
(?<DN>Mon|Fri|Sun)(?:day)?|
|
|
(?<DN>Tue)(?:sday)?|
|
|
(?<DN>Wed)(?:nesday)?|
|
|
(?<DN>Thu)(?:rsday)?|
|
|
(?<DN>Sat)(?:urday)?
|
|
|
|
There are five capture groups, but only one is ever set after a match.
|
|
The convenience functions for extracting the data by name returns the
|
|
substring for the first (and in this example, the only) group of that
|
|
name that matched. This saves searching to find which numbered group it
|
|
was. (An alternative way of solving this problem is to use a "branch
|
|
reset" group, as described in the previous section.)
|
|
|
|
If you make a backreference to a non-unique named group from elsewhere
|
|
in the pattern, the groups to which the name refers are checked in the
|
|
order in which they appear in the overall pattern. The first one that
|
|
is set is used for the reference. For example, this pattern matches
|
|
both "foofoo" and "barbar" but not "foobar" or "barfoo":
|
|
|
|
(?J)(?:(?<n>foo)|(?<n>bar))\k<n>
|
|
|
|
|
|
If you make a subroutine call to a non-unique named group, the one that
|
|
corresponds to the first occurrence of the name is used. In the absence
|
|
of duplicate numbers this is the one with the lowest number.
|
|
|
|
If you use a named reference in a condition test (see the section about
|
|
conditions below), either to check whether a capture group has matched,
|
|
or to check for recursion, all groups with the same name are tested. If
|
|
the condition is true for any one of them, the overall condition is
|
|
true. This is the same behaviour as testing by number. For further de-
|
|
tails of the interfaces for handling named capture groups, see the
|
|
pcre2api documentation.
|
|
|
|
|
|
REPETITION
|
|
|
|
Repetition is specified by quantifiers, which can follow any of the
|
|
following items:
|
|
|
|
a literal data character
|
|
the dot metacharacter
|
|
the \C escape sequence
|
|
the \R escape sequence
|
|
the \X escape sequence
|
|
an escape such as \d or \pL that matches a single character
|
|
a character class
|
|
a backreference
|
|
a parenthesized group (including lookaround assertions)
|
|
a subroutine call (recursive or otherwise)
|
|
|
|
The general repetition quantifier specifies a minimum and maximum num-
|
|
ber of permitted matches, by giving the two numbers in curly brackets
|
|
(braces), separated by a comma. The numbers must be less than 65536,
|
|
and the first must be less than or equal to the second. For example,
|
|
|
|
z{2,4}
|
|
|
|
matches "zz", "zzz", or "zzzz". A closing brace on its own is not a
|
|
special character. If the second number is omitted, but the comma is
|
|
present, there is no upper limit; if the second number and the comma
|
|
are both omitted, the quantifier specifies an exact number of required
|
|
matches. Thus
|
|
|
|
[aeiou]{3,}
|
|
|
|
matches at least 3 successive vowels, but may match many more, whereas
|
|
|
|
\d{8}
|
|
|
|
matches exactly 8 digits. An opening curly bracket that appears in a
|
|
position where a quantifier is not allowed, or one that does not match
|
|
the syntax of a quantifier, is taken as a literal character. For exam-
|
|
ple, {,6} is not a quantifier, but a literal string of four characters.
|
|
|
|
In UTF modes, quantifiers apply to characters rather than to individual
|
|
code units. Thus, for example, \x{100}{2} matches two characters, each
|
|
of which is represented by a two-byte sequence in a UTF-8 string. Simi-
|
|
larly, \X{3} matches three Unicode extended grapheme clusters, each of
|
|
which may be several code units long (and they may be of different
|
|
lengths).
|
|
|
|
The quantifier {0} is permitted, causing the expression to behave as if
|
|
the previous item and the quantifier were not present. This may be use-
|
|
ful for capture groups that are referenced as subroutines from else-
|
|
where in the pattern (but see also the section entitled "Defining cap-
|
|
ture groups for use by reference only" below). Except for parenthesized
|
|
groups, items that have a {0} quantifier are omitted from the compiled
|
|
pattern.
|
|
|
|
For convenience, the three most common quantifiers have single-charac-
|
|
ter abbreviations:
|
|
|
|
* is equivalent to {0,}
|
|
+ is equivalent to {1,}
|
|
? is equivalent to {0,1}
|
|
|
|
It is possible to construct infinite loops by following a group that
|
|
can match no characters with a quantifier that has no upper limit, for
|
|
example:
|
|
|
|
(a?)*
|
|
|
|
Earlier versions of Perl and PCRE1 used to give an error at compile
|
|
time for such patterns. However, because there are cases where this can
|
|
be useful, such patterns are now accepted, but whenever an iteration of
|
|
such a group matches no characters, matching moves on to the next item
|
|
in the pattern instead of repeatedly matching an empty string. This
|
|
does not prevent backtracking into any of the iterations if a subse-
|
|
quent item fails to match.
|
|
|
|
By default, quantifiers are "greedy", that is, they match as much as
|
|
possible (up to the maximum number of permitted times), without causing
|
|
the rest of the pattern to fail. The classic example of where this
|
|
gives problems is in trying to match comments in C programs. These ap-
|
|
pear between /* and */ and within the comment, individual * and / char-
|
|
acters may appear. An attempt to match C comments by applying the pat-
|
|
tern
|
|
|
|
/\*.*\*/
|
|
|
|
to the string
|
|
|
|
/* first comment */ not comment /* second comment */
|
|
|
|
fails, because it matches the entire string owing to the greediness of
|
|
the .* item. However, if a quantifier is followed by a question mark,
|
|
it ceases to be greedy, and instead matches the minimum number of times
|
|
possible, so the pattern
|
|
|
|
/\*.*?\*/
|
|
|
|
does the right thing with the C comments. The meaning of the various
|
|
quantifiers is not otherwise changed, just the preferred number of
|
|
matches. Do not confuse this use of question mark with its use as a
|
|
quantifier in its own right. Because it has two uses, it can sometimes
|
|
appear doubled, as in
|
|
|
|
\d??\d
|
|
|
|
which matches one digit by preference, but can match two if that is the
|
|
only way the rest of the pattern matches.
|
|
|
|
If the PCRE2_UNGREEDY option is set (an option that is not available in
|
|
Perl), the quantifiers are not greedy by default, but individual ones
|
|
can be made greedy by following them with a question mark. In other
|
|
words, it inverts the default behaviour.
|
|
|
|
When a parenthesized group is quantified with a minimum repeat count
|
|
that is greater than 1 or with a limited maximum, more memory is re-
|
|
quired for the compiled pattern, in proportion to the size of the mini-
|
|
mum or maximum.
|
|
|
|
If a pattern starts with .* or .{0,} and the PCRE2_DOTALL option
|
|
(equivalent to Perl's /s) is set, thus allowing the dot to match new-
|
|
lines, the pattern is implicitly anchored, because whatever follows
|
|
will be tried against every character position in the subject string,
|
|
so there is no point in retrying the overall match at any position af-
|
|
ter the first. PCRE2 normally treats such a pattern as though it were
|
|
preceded by \A.
|
|
|
|
In cases where it is known that the subject string contains no new-
|
|
lines, it is worth setting PCRE2_DOTALL in order to obtain this opti-
|
|
mization, or alternatively, using ^ to indicate anchoring explicitly.
|
|
|
|
However, there are some cases where the optimization cannot be used.
|
|
When .* is inside capturing parentheses that are the subject of a
|
|
backreference elsewhere in the pattern, a match at the start may fail
|
|
where a later one succeeds. Consider, for example:
|
|
|
|
(.*)abc\1
|
|
|
|
If the subject is "xyz123abc123" the match point is the fourth charac-
|
|
ter. For this reason, such a pattern is not implicitly anchored.
|
|
|
|
Another case where implicit anchoring is not applied is when the lead-
|
|
ing .* is inside an atomic group. Once again, a match at the start may
|
|
fail where a later one succeeds. Consider this pattern:
|
|
|
|
(?>.*?a)b
|
|
|
|
It matches "ab" in the subject "aab". The use of the backtracking con-
|
|
trol verbs (*PRUNE) and (*SKIP) also disable this optimization, and
|
|
there is an option, PCRE2_NO_DOTSTAR_ANCHOR, to do so explicitly.
|
|
|
|
When a capture group is repeated, the value captured is the substring
|
|
that matched the final iteration. For example, after
|
|
|
|
(tweedle[dume]{3}\s*)+
|
|
|
|
has matched "tweedledum tweedledee" the value of the captured substring
|
|
is "tweedledee". However, if there are nested capture groups, the cor-
|
|
responding captured values may have been set in previous iterations.
|
|
For example, after
|
|
|
|
(a|(b))+
|
|
|
|
matches "aba" the value of the second captured substring is "b".
|
|
|
|
|
|
ATOMIC GROUPING AND POSSESSIVE QUANTIFIERS
|
|
|
|
With both maximizing ("greedy") and minimizing ("ungreedy" or "lazy")
|
|
repetition, failure of what follows normally causes the repeated item
|
|
to be re-evaluated to see if a different number of repeats allows the
|
|
rest of the pattern to match. Sometimes it is useful to prevent this,
|
|
either to change the nature of the match, or to cause it fail earlier
|
|
than it otherwise might, when the author of the pattern knows there is
|
|
no point in carrying on.
|
|
|
|
Consider, for example, the pattern \d+foo when applied to the subject
|
|
line
|
|
|
|
123456bar
|
|
|
|
After matching all 6 digits and then failing to match "foo", the normal
|
|
action of the matcher is to try again with only 5 digits matching the
|
|
\d+ item, and then with 4, and so on, before ultimately failing.
|
|
"Atomic grouping" (a term taken from Jeffrey Friedl's book) provides
|
|
the means for specifying that once a group has matched, it is not to be
|
|
re-evaluated in this way.
|
|
|
|
If we use atomic grouping for the previous example, the matcher gives
|
|
up immediately on failing to match "foo" the first time. The notation
|
|
is a kind of special parenthesis, starting with (?> as in this example:
|
|
|
|
(?>\d+)foo
|
|
|
|
Perl 5.28 introduced an experimental alphabetic form starting with (*
|
|
which may be easier to remember:
|
|
|
|
(*atomic:\d+)foo
|
|
|
|
This kind of parenthesized group "locks up" the part of the pattern it
|
|
contains once it has matched, and a failure further into the pattern is
|
|
prevented from backtracking into it. Backtracking past it to previous
|
|
items, however, works as normal.
|
|
|
|
An alternative description is that a group of this type matches exactly
|
|
the string of characters that an identical standalone pattern would
|
|
match, if anchored at the current point in the subject string.
|
|
|
|
Atomic groups are not capture groups. Simple cases such as the above
|
|
example can be thought of as a maximizing repeat that must swallow ev-
|
|
erything it can. So, while both \d+ and \d+? are prepared to adjust
|
|
the number of digits they match in order to make the rest of the pat-
|
|
tern match, (?>\d+) can only match an entire sequence of digits.
|
|
|
|
Atomic groups in general can of course contain arbitrarily complicated
|
|
expressions, and can be nested. However, when the contents of an atomic
|
|
group is just a single repeated item, as in the example above, a sim-
|
|
pler notation, called a "possessive quantifier" can be used. This con-
|
|
sists of an additional + character following a quantifier. Using this
|
|
notation, the previous example can be rewritten as
|
|
|
|
\d++foo
|
|
|
|
Note that a possessive quantifier can be used with an entire group, for
|
|
example:
|
|
|
|
(abc|xyz){2,3}+
|
|
|
|
Possessive quantifiers are always greedy; the setting of the PCRE2_UN-
|
|
GREEDY option is ignored. They are a convenient notation for the sim-
|
|
pler forms of atomic group. However, there is no difference in the
|
|
meaning of a possessive quantifier and the equivalent atomic group,
|
|
though there may be a performance difference; possessive quantifiers
|
|
should be slightly faster.
|
|
|
|
The possessive quantifier syntax is an extension to the Perl 5.8 syn-
|
|
tax. Jeffrey Friedl originated the idea (and the name) in the first
|
|
edition of his book. Mike McCloskey liked it, so implemented it when he
|
|
built Sun's Java package, and PCRE1 copied it from there. It found its
|
|
way into Perl at release 5.10.
|
|
|
|
PCRE2 has an optimization that automatically "possessifies" certain
|
|
simple pattern constructs. For example, the sequence A+B is treated as
|
|
A++B because there is no point in backtracking into a sequence of A's
|
|
when B must follow. This feature can be disabled by the PCRE2_NO_AUTO-
|
|
POSSESS option, or starting the pattern with (*NO_AUTO_POSSESS).
|
|
|
|
When a pattern contains an unlimited repeat inside a group that can it-
|
|
self be repeated an unlimited number of times, the use of an atomic
|
|
group is the only way to avoid some failing matches taking a very long
|
|
time indeed. The pattern
|
|
|
|
(\D+|<\d+>)*[!?]
|
|
|
|
matches an unlimited number of substrings that either consist of non-
|
|
digits, or digits enclosed in <>, followed by either ! or ?. When it
|
|
matches, it runs quickly. However, if it is applied to
|
|
|
|
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
|
|
|
|
it takes a long time before reporting failure. This is because the
|
|
string can be divided between the internal \D+ repeat and the external
|
|
* repeat in a large number of ways, and all have to be tried. (The ex-
|
|
ample uses [!?] rather than a single character at the end, because both
|
|
PCRE2 and Perl have an optimization that allows for fast failure when a
|
|
single character is used. They remember the last single character that
|
|
is required for a match, and fail early if it is not present in the
|
|
string.) If the pattern is changed so that it uses an atomic group,
|
|
like this:
|
|
|
|
((?>\D+)|<\d+>)*[!?]
|
|
|
|
sequences of non-digits cannot be broken, and failure happens quickly.
|
|
|
|
|
|
BACKREFERENCES
|
|
|
|
Outside a character class, a backslash followed by a digit greater than
|
|
0 (and possibly further digits) is a backreference to a capture group
|
|
earlier (that is, to its left) in the pattern, provided there have been
|
|
that many previous capture groups.
|
|
|
|
However, if the decimal number following the backslash is less than 8,
|
|
it is always taken as a backreference, and causes an error only if
|
|
there are not that many capture groups in the entire pattern. In other
|
|
words, the group that is referenced need not be to the left of the ref-
|
|
erence for numbers less than 8. A "forward backreference" of this type
|
|
can make sense when a repetition is involved and the group to the right
|
|
has participated in an earlier iteration.
|
|
|
|
It is not possible to have a numerical "forward backreference" to a
|
|
group whose number is 8 or more using this syntax because a sequence
|
|
such as \50 is interpreted as a character defined in octal. See the
|
|
subsection entitled "Non-printing characters" above for further details
|
|
of the handling of digits following a backslash. Other forms of back-
|
|
referencing do not suffer from this restriction. In particular, there
|
|
is no problem when named capture groups are used (see below).
|
|
|
|
Another way of avoiding the ambiguity inherent in the use of digits
|
|
following a backslash is to use the \g escape sequence. This escape
|
|
must be followed by a signed or unsigned number, optionally enclosed in
|
|
braces. These examples are all identical:
|
|
|
|
(ring), \1
|
|
(ring), \g1
|
|
(ring), \g{1}
|
|
|
|
An unsigned number specifies an absolute reference without the ambigu-
|
|
ity that is present in the older syntax. It is also useful when literal
|
|
digits follow the reference. A signed number is a relative reference.
|
|
Consider this example:
|
|
|
|
(abc(def)ghi)\g{-1}
|
|
|
|
The sequence \g{-1} is a reference to the most recently started capture
|
|
group before \g, that is, is it equivalent to \2 in this example. Simi-
|
|
larly, \g{-2} would be equivalent to \1. The use of relative references
|
|
can be helpful in long patterns, and also in patterns that are created
|
|
by joining together fragments that contain references within them-
|
|
selves.
|
|
|
|
The sequence \g{+1} is a reference to the next capture group. This kind
|
|
of forward reference can be useful in patterns that repeat. Perl does
|
|
not support the use of + in this way.
|
|
|
|
A backreference matches whatever actually most recently matched the
|
|
capture group in the current subject string, rather than anything at
|
|
all that matches the group (see "Groups as subroutines" below for a way
|
|
of doing that). So the pattern
|
|
|
|
(sens|respons)e and \1ibility
|
|
|
|
matches "sense and sensibility" and "response and responsibility", but
|
|
not "sense and responsibility". If caseful matching is in force at the
|
|
time of the backreference, the case of letters is relevant. For exam-
|
|
ple,
|
|
|
|
((?i)rah)\s+\1
|
|
|
|
matches "rah rah" and "RAH RAH", but not "RAH rah", even though the
|
|
original capture group is matched caselessly.
|
|
|
|
There are several different ways of writing backreferences to named
|
|
capture groups. The .NET syntax \k{name} and the Perl syntax \k<name>
|
|
or \k'name' are supported, as is the Python syntax (?P=name). Perl
|
|
5.10's unified backreference syntax, in which \g can be used for both
|
|
numeric and named references, is also supported. We could rewrite the
|
|
above example in any of the following ways:
|
|
|
|
(?<p1>(?i)rah)\s+\k<p1>
|
|
(?'p1'(?i)rah)\s+\k{p1}
|
|
(?P<p1>(?i)rah)\s+(?P=p1)
|
|
(?<p1>(?i)rah)\s+\g{p1}
|
|
|
|
A capture group that is referenced by name may appear in the pattern
|
|
before or after the reference.
|
|
|
|
There may be more than one backreference to the same group. If a group
|
|
has not actually been used in a particular match, backreferences to it
|
|
always fail by default. For example, the pattern
|
|
|
|
(a|(bc))\2
|
|
|
|
always fails if it starts to match "a" rather than "bc". However, if
|
|
the PCRE2_MATCH_UNSET_BACKREF option is set at compile time, a backref-
|
|
erence to an unset value matches an empty string.
|
|
|
|
Because there may be many capture groups in a pattern, all digits fol-
|
|
lowing a backslash are taken as part of a potential backreference num-
|
|
ber. If the pattern continues with a digit character, some delimiter
|
|
must be used to terminate the backreference. If the PCRE2_EXTENDED or
|
|
PCRE2_EXTENDED_MORE option is set, this can be white space. Otherwise,
|
|
the \g{} syntax or an empty comment (see "Comments" below) can be used.
|
|
|
|
Recursive backreferences
|
|
|
|
A backreference that occurs inside the group to which it refers fails
|
|
when the group is first used, so, for example, (a\1) never matches.
|
|
However, such references can be useful inside repeated groups. For ex-
|
|
ample, the pattern
|
|
|
|
(a|b\1)+
|
|
|
|
matches any number of "a"s and also "aba", "ababbaa" etc. At each iter-
|
|
ation of the group, the backreference matches the character string cor-
|
|
responding to the previous iteration. In order for this to work, the
|
|
pattern must be such that the first iteration does not need to match
|
|
the backreference. This can be done using alternation, as in the exam-
|
|
ple above, or by a quantifier with a minimum of zero.
|
|
|
|
For versions of PCRE2 less than 10.25, backreferences of this type used
|
|
to cause the group that they reference to be treated as an atomic
|
|
group. This restriction no longer applies, and backtracking into such
|
|
groups can occur as normal.
|
|
|
|
|
|
ASSERTIONS
|
|
|
|
An assertion is a test on the characters following or preceding the
|
|
current matching point that does not consume any characters. The simple
|
|
assertions coded as \b, \B, \A, \G, \Z, \z, ^ and $ are described
|
|
above.
|
|
|
|
More complicated assertions are coded as parenthesized groups. There
|
|
are two kinds: those that look ahead of the current position in the
|
|
subject string, and those that look behind it, and in each case an as-
|
|
sertion may be positive (must match for the assertion to be true) or
|
|
negative (must not match for the assertion to be true). An assertion
|
|
group is matched in the normal way, and if it is true, matching contin-
|
|
ues after it, but with the matching position in the subject string re-
|
|
set to what it was before the assertion was processed.
|
|
|
|
The Perl-compatible lookaround assertions are atomic. If an assertion
|
|
is true, but there is a subsequent matching failure, there is no back-
|
|
tracking into the assertion. However, there are some cases where non-
|
|
atomic assertions can be useful. PCRE2 has some support for these, de-
|
|
scribed in the section entitled "Non-atomic assertions" below, but they
|
|
are not Perl-compatible.
|
|
|
|
A lookaround assertion may appear as the condition in a conditional
|
|
group (see below). In this case, the result of matching the assertion
|
|
determines which branch of the condition is followed.
|
|
|
|
Assertion groups are not capture groups. If an assertion contains cap-
|
|
ture groups within it, these are counted for the purposes of numbering
|
|
the capture groups in the whole pattern. Within each branch of an as-
|
|
sertion, locally captured substrings may be referenced in the usual
|
|
way. For example, a sequence such as (.)\g{-1} can be used to check
|
|
that two adjacent characters are the same.
|
|
|
|
When a branch within an assertion fails to match, any substrings that
|
|
were captured are discarded (as happens with any pattern branch that
|
|
fails to match). A negative assertion is true only when all its
|
|
branches fail to match; this means that no captured substrings are ever
|
|
retained after a successful negative assertion. When an assertion con-
|
|
tains a matching branch, what happens depends on the type of assertion.
|
|
|
|
For a positive assertion, internally captured substrings in the suc-
|
|
cessful branch are retained, and matching continues with the next pat-
|
|
tern item after the assertion. For a negative assertion, a matching
|
|
branch means that the assertion is not true. If such an assertion is
|
|
being used as a condition in a conditional group (see below), captured
|
|
substrings are retained, because matching continues with the "no"
|
|
branch of the condition. For other failing negative assertions, control
|
|
passes to the previous backtracking point, thus discarding any captured
|
|
strings within the assertion.
|
|
|
|
Most assertion groups may be repeated; though it makes no sense to as-
|
|
sert the same thing several times, the side effect of capturing in pos-
|
|
itive assertions may occasionally be useful. However, an assertion that
|
|
forms the condition for a conditional group may not be quantified.
|
|
PCRE2 used to restrict the repetition of assertions, but from release
|
|
10.35 the only restriction is that an unlimited maximum repetition is
|
|
changed to be one more than the minimum. For example, {3,} is treated
|
|
as {3,4}.
|
|
|
|
Alphabetic assertion names
|
|
|
|
Traditionally, symbolic sequences such as (?= and (?<= have been used
|
|
to specify lookaround assertions. Perl 5.28 introduced some experimen-
|
|
tal alphabetic alternatives which might be easier to remember. They all
|
|
start with (* instead of (? and must be written using lower case let-
|
|
ters. PCRE2 supports the following synonyms:
|
|
|
|
(*positive_lookahead: or (*pla: is the same as (?=
|
|
(*negative_lookahead: or (*nla: is the same as (?!
|
|
(*positive_lookbehind: or (*plb: is the same as (?<=
|
|
(*negative_lookbehind: or (*nlb: is the same as (?<!
|
|
|
|
For example, (*pla:foo) is the same assertion as (?=foo). In the fol-
|
|
lowing sections, the various assertions are described using the origi-
|
|
nal symbolic forms.
|
|
|
|
Lookahead assertions
|
|
|
|
Lookahead assertions start with (?= for positive assertions and (?! for
|
|
negative assertions. For example,
|
|
|
|
\w+(?=;)
|
|
|
|
matches a word followed by a semicolon, but does not include the semi-
|
|
colon in the match, and
|
|
|
|
foo(?!bar)
|
|
|
|
matches any occurrence of "foo" that is not followed by "bar". Note
|
|
that the apparently similar pattern
|
|
|
|
(?!foo)bar
|
|
|
|
does not find an occurrence of "bar" that is preceded by something
|
|
other than "foo"; it finds any occurrence of "bar" whatsoever, because
|
|
the assertion (?!foo) is always true when the next three characters are
|
|
"bar". A lookbehind assertion is needed to achieve the other effect.
|
|
|
|
If you want to force a matching failure at some point in a pattern, the
|
|
most convenient way to do it is with (?!) because an empty string al-
|
|
ways matches, so an assertion that requires there not to be an empty
|
|
string must always fail. The backtracking control verb (*FAIL) or (*F)
|
|
is a synonym for (?!).
|
|
|
|
Lookbehind assertions
|
|
|
|
Lookbehind assertions start with (?<= for positive assertions and (?<!
|
|
for negative assertions. For example,
|
|
|
|
(?<!foo)bar
|
|
|
|
does find an occurrence of "bar" that is not preceded by "foo". The
|
|
contents of a lookbehind assertion are restricted such that all the
|
|
strings it matches must have a fixed length. However, if there are sev-
|
|
eral top-level alternatives, they do not all have to have the same
|
|
fixed length. Thus
|
|
|
|
(?<=bullock|donkey)
|
|
|
|
is permitted, but
|
|
|
|
(?<!dogs?|cats?)
|
|
|
|
causes an error at compile time. Branches that match different length
|
|
strings are permitted only at the top level of a lookbehind assertion.
|
|
This is an extension compared with Perl, which requires all branches to
|
|
match the same length of string. An assertion such as
|
|
|
|
(?<=ab(c|de))
|
|
|
|
is not permitted, because its single top-level branch can match two
|
|
different lengths, but it is acceptable to PCRE2 if rewritten to use
|
|
two top-level branches:
|
|
|
|
(?<=abc|abde)
|
|
|
|
In some cases, the escape sequence \K (see above) can be used instead
|
|
of a lookbehind assertion to get round the fixed-length restriction.
|
|
|
|
The implementation of lookbehind assertions is, for each alternative,
|
|
to temporarily move the current position back by the fixed length and
|
|
then try to match. If there are insufficient characters before the cur-
|
|
rent position, the assertion fails.
|
|
|
|
In UTF-8 and UTF-16 modes, PCRE2 does not allow the \C escape (which
|
|
matches a single code unit even in a UTF mode) to appear in lookbehind
|
|
assertions, because it makes it impossible to calculate the length of
|
|
the lookbehind. The \X and \R escapes, which can match different num-
|
|
bers of code units, are never permitted in lookbehinds.
|
|
|
|
"Subroutine" calls (see below) such as (?2) or (?&X) are permitted in
|
|
lookbehinds, as long as the called capture group matches a fixed-length
|
|
string. However, recursion, that is, a "subroutine" call into a group
|
|
that is already active, is not supported.
|
|
|
|
Perl does not support backreferences in lookbehinds. PCRE2 does support
|
|
them, but only if certain conditions are met. The PCRE2_MATCH_UN-
|
|
SET_BACKREF option must not be set, there must be no use of (?| in the
|
|
pattern (it creates duplicate group numbers), and if the backreference
|
|
is by name, the name must be unique. Of course, the referenced group
|
|
must itself match a fixed length substring. The following pattern
|
|
matches words containing at least two characters that begin and end
|
|
with the same character:
|
|
|
|
\b(\w)\w++(?<=\1)
|
|
|
|
Possessive quantifiers can be used in conjunction with lookbehind as-
|
|
sertions to specify efficient matching of fixed-length strings at the
|
|
end of subject strings. Consider a simple pattern such as
|
|
|
|
abcd$
|
|
|
|
when applied to a long string that does not match. Because matching
|
|
proceeds from left to right, PCRE2 will look for each "a" in the sub-
|
|
ject and then see if what follows matches the rest of the pattern. If
|
|
the pattern is specified as
|
|
|
|
^.*abcd$
|
|
|
|
the initial .* matches the entire string at first, but when this fails
|
|
(because there is no following "a"), it backtracks to match all but the
|
|
last character, then all but the last two characters, and so on. Once
|
|
again the search for "a" covers the entire string, from right to left,
|
|
so we are no better off. However, if the pattern is written as
|
|
|
|
^.*+(?<=abcd)
|
|
|
|
there can be no backtracking for the .*+ item because of the possessive
|
|
quantifier; it can match only the entire string. The subsequent lookbe-
|
|
hind assertion does a single test on the last four characters. If it
|
|
fails, the match fails immediately. For long strings, this approach
|
|
makes a significant difference to the processing time.
|
|
|
|
Using multiple assertions
|
|
|
|
Several assertions (of any sort) may occur in succession. For example,
|
|
|
|
(?<=\d{3})(?<!999)foo
|
|
|
|
matches "foo" preceded by three digits that are not "999". Notice that
|
|
each of the assertions is applied independently at the same point in
|
|
the subject string. First there is a check that the previous three
|
|
characters are all digits, and then there is a check that the same
|
|
three characters are not "999". This pattern does not match "foo" pre-
|
|
ceded by six characters, the first of which are digits and the last
|
|
three of which are not "999". For example, it doesn't match "123abc-
|
|
foo". A pattern to do that is
|
|
|
|
(?<=\d{3}...)(?<!999)foo
|
|
|
|
This time the first assertion looks at the preceding six characters,
|
|
checking that the first three are digits, and then the second assertion
|
|
checks that the preceding three characters are not "999".
|
|
|
|
Assertions can be nested in any combination. For example,
|
|
|
|
(?<=(?<!foo)bar)baz
|
|
|
|
matches an occurrence of "baz" that is preceded by "bar" which in turn
|
|
is not preceded by "foo", while
|
|
|
|
(?<=\d{3}(?!999)...)foo
|
|
|
|
is another pattern that matches "foo" preceded by three digits and any
|
|
three characters that are not "999".
|
|
|
|
|
|
NON-ATOMIC ASSERTIONS
|
|
|
|
The traditional Perl-compatible lookaround assertions are atomic. That
|
|
is, if an assertion is true, but there is a subsequent matching fail-
|
|
ure, there is no backtracking into the assertion. However, there are
|
|
some cases where non-atomic positive assertions can be useful. PCRE2
|
|
provides these using the following syntax:
|
|
|
|
(*non_atomic_positive_lookahead: or (*napla: or (?*
|
|
(*non_atomic_positive_lookbehind: or (*naplb: or (?<*
|
|
|
|
Consider the problem of finding the right-most word in a string that
|
|
also appears earlier in the string, that is, it must appear at least
|
|
twice in total. This pattern returns the required result as captured
|
|
substring 1:
|
|
|
|
^(?x)(*napla: .* \b(\w++)) (?> .*? \b\1\b ){2}
|
|
|
|
For a subject such as "word1 word2 word3 word2 word3 word4" the result
|
|
is "word3". How does it work? At the start, ^(?x) anchors the pattern
|
|
and sets the "x" option, which causes white space (introduced for read-
|
|
ability) to be ignored. Inside the assertion, the greedy .* at first
|
|
consumes the entire string, but then has to backtrack until the rest of
|
|
the assertion can match a word, which is captured by group 1. In other
|
|
words, when the assertion first succeeds, it captures the right-most
|
|
word in the string.
|
|
|
|
The current matching point is then reset to the start of the subject,
|
|
and the rest of the pattern match checks for two occurrences of the
|
|
captured word, using an ungreedy .*? to scan from the left. If this
|
|
succeeds, we are done, but if the last word in the string does not oc-
|
|
cur twice, this part of the pattern fails. If a traditional atomic
|
|
lookhead (?= or (*pla: had been used, the assertion could not be re-en-
|
|
tered, and the whole match would fail. The pattern would succeed only
|
|
if the very last word in the subject was found twice.
|
|
|
|
Using a non-atomic lookahead, however, means that when the last word
|
|
does not occur twice in the string, the lookahead can backtrack and
|
|
find the second-last word, and so on, until either the match succeeds,
|
|
or all words have been tested.
|
|
|
|
Two conditions must be met for a non-atomic assertion to be useful: the
|
|
contents of one or more capturing groups must change after a backtrack
|
|
into the assertion, and there must be a backreference to a changed
|
|
group later in the pattern. If this is not the case, the rest of the
|
|
pattern match fails exactly as before because nothing has changed, so
|
|
using a non-atomic assertion just wastes resources.
|
|
|
|
There is one exception to backtracking into a non-atomic assertion. If
|
|
an (*ACCEPT) control verb is triggered, the assertion succeeds atomi-
|
|
cally. That is, a subsequent match failure cannot backtrack into the
|
|
assertion.
|
|
|
|
Non-atomic assertions are not supported by the alternative matching
|
|
function pcre2_dfa_match(). They are supported by JIT, but only if they
|
|
do not contain any control verbs such as (*ACCEPT). (This may change in
|
|
future). Note that assertions that appear as conditions for conditional
|
|
groups (see below) must be atomic.
|
|
|
|
|
|
SCRIPT RUNS
|
|
|
|
In concept, a script run is a sequence of characters that are all from
|
|
the same Unicode script such as Latin or Greek. However, because some
|
|
scripts are commonly used together, and because some diacritical and
|
|
other marks are used with multiple scripts, it is not that simple.
|
|
There is a full description of the rules that PCRE2 uses in the section
|
|
entitled "Script Runs" in the pcre2unicode documentation.
|
|
|
|
If part of a pattern is enclosed between (*script_run: or (*sr: and a
|
|
closing parenthesis, it fails if the sequence of characters that it
|
|
matches are not a script run. After a failure, normal backtracking oc-
|
|
curs. Script runs can be used to detect spoofing attacks using charac-
|
|
ters that look the same, but are from different scripts. The string
|
|
"paypal.com" is an infamous example, where the letters could be a mix-
|
|
ture of Latin and Cyrillic. This pattern ensures that the matched char-
|
|
acters in a sequence of non-spaces that follow white space are a script
|
|
run:
|
|
|
|
\s+(*sr:\S+)
|
|
|
|
To be sure that they are all from the Latin script (for example), a
|
|
lookahead can be used:
|
|
|
|
\s+(?=\p{Latin})(*sr:\S+)
|
|
|
|
This works as long as the first character is expected to be a character
|
|
in that script, and not (for example) punctuation, which is allowed
|
|
with any script. If this is not the case, a more creative lookahead is
|
|
needed. For example, if digits, underscore, and dots are permitted at
|
|
the start:
|
|
|
|
\s+(?=[0-9_.]*\p{Latin})(*sr:\S+)
|
|
|
|
|
|
In many cases, backtracking into a script run pattern fragment is not
|
|
desirable. The script run can employ an atomic group to prevent this.
|
|
Because this is a common requirement, a shorthand notation is provided
|
|
by (*atomic_script_run: or (*asr:
|
|
|
|
(*asr:...) is the same as (*sr:(?>...))
|
|
|
|
Note that the atomic group is inside the script run. Putting it outside
|
|
would not prevent backtracking into the script run pattern.
|
|
|
|
Support for script runs is not available if PCRE2 is compiled without
|
|
Unicode support. A compile-time error is given if any of the above con-
|
|
structs is encountered. Script runs are not supported by the alternate
|
|
matching function, pcre2_dfa_match() because they use the same mecha-
|
|
nism as capturing parentheses.
|
|
|
|
Warning: The (*ACCEPT) control verb (see below) should not be used
|
|
within a script run group, because it causes an immediate exit from the
|
|
group, bypassing the script run checking.
|
|
|
|
|
|
CONDITIONAL GROUPS
|
|
|
|
It is possible to cause the matching process to obey a pattern fragment
|
|
conditionally or to choose between two alternative fragments, depending
|
|
on the result of an assertion, or whether a specific capture group has
|
|
already been matched. The two possible forms of conditional group are:
|
|
|
|
(?(condition)yes-pattern)
|
|
(?(condition)yes-pattern|no-pattern)
|
|
|
|
If the condition is satisfied, the yes-pattern is used; otherwise the
|
|
no-pattern (if present) is used. An absent no-pattern is equivalent to
|
|
an empty string (it always matches). If there are more than two alter-
|
|
natives in the group, a compile-time error occurs. Each of the two al-
|
|
ternatives may itself contain nested groups of any form, including con-
|
|
ditional groups; the restriction to two alternatives applies only at
|
|
the level of the condition itself. This pattern fragment is an example
|
|
where the alternatives are complex:
|
|
|
|
(?(1) (A|B|C) | (D | (?(2)E|F) | E) )
|
|
|
|
|
|
There are five kinds of condition: references to capture groups, refer-
|
|
ences to recursion, two pseudo-conditions called DEFINE and VERSION,
|
|
and assertions.
|
|
|
|
Checking for a used capture group by number
|
|
|
|
If the text between the parentheses consists of a sequence of digits,
|
|
the condition is true if a capture group of that number has previously
|
|
matched. If there is more than one capture group with the same number
|
|
(see the earlier section about duplicate group numbers), the condition
|
|
is true if any of them have matched. An alternative notation is to pre-
|
|
cede the digits with a plus or minus sign. In this case, the group num-
|
|
ber is relative rather than absolute. The most recently opened capture
|
|
group can be referenced by (?(-1), the next most recent by (?(-2), and
|
|
so on. Inside loops it can also make sense to refer to subsequent
|
|
groups. The next capture group can be referenced as (?(+1), and so on.
|
|
(The value zero in any of these forms is not used; it provokes a com-
|
|
pile-time error.)
|
|
|
|
Consider the following pattern, which contains non-significant white
|
|
space to make it more readable (assume the PCRE2_EXTENDED option) and
|
|
to divide it into three parts for ease of discussion:
|
|
|
|
( \( )? [^()]+ (?(1) \) )
|
|
|
|
The first part matches an optional opening parenthesis, and if that
|
|
character is present, sets it as the first captured substring. The sec-
|
|
ond part matches one or more characters that are not parentheses. The
|
|
third part is a conditional group that tests whether or not the first
|
|
capture group matched. If it did, that is, if subject started with an
|
|
opening parenthesis, the condition is true, and so the yes-pattern is
|
|
executed and a closing parenthesis is required. Otherwise, since no-
|
|
pattern is not present, the conditional group matches nothing. In other
|
|
words, this pattern matches a sequence of non-parentheses, optionally
|
|
enclosed in parentheses.
|
|
|
|
If you were embedding this pattern in a larger one, you could use a
|
|
relative reference:
|
|
|
|
...other stuff... ( \( )? [^()]+ (?(-1) \) ) ...
|
|
|
|
This makes the fragment independent of the parentheses in the larger
|
|
pattern.
|
|
|
|
Checking for a used capture group by name
|
|
|
|
Perl uses the syntax (?(<name>)...) or (?('name')...) to test for a
|
|
used capture group by name. For compatibility with earlier versions of
|
|
PCRE1, which had this facility before Perl, the syntax (?(name)...) is
|
|
also recognized. Note, however, that undelimited names consisting of
|
|
the letter R followed by digits are ambiguous (see the following sec-
|
|
tion). Rewriting the above example to use a named group gives this:
|
|
|
|
(?<OPEN> \( )? [^()]+ (?(<OPEN>) \) )
|
|
|
|
If the name used in a condition of this kind is a duplicate, the test
|
|
is applied to all groups of the same name, and is true if any one of
|
|
them has matched.
|
|
|
|
Checking for pattern recursion
|
|
|
|
"Recursion" in this sense refers to any subroutine-like call from one
|
|
part of the pattern to another, whether or not it is actually recur-
|
|
sive. See the sections entitled "Recursive patterns" and "Groups as
|
|
subroutines" below for details of recursion and subroutine calls.
|
|
|
|
If a condition is the string (R), and there is no capture group with
|
|
the name R, the condition is true if matching is currently in a recur-
|
|
sion or subroutine call to the whole pattern or any capture group. If
|
|
digits follow the letter R, and there is no group with that name, the
|
|
condition is true if the most recent call is into a group with the
|
|
given number, which must exist somewhere in the overall pattern. This
|
|
is a contrived example that is equivalent to a+b:
|
|
|
|
((?(R1)a+|(?1)b))
|
|
|
|
However, in both cases, if there is a capture group with a matching
|
|
name, the condition tests for its being set, as described in the sec-
|
|
tion above, instead of testing for recursion. For example, creating a
|
|
group with the name R1 by adding (?<R1>) to the above pattern com-
|
|
pletely changes its meaning.
|
|
|
|
If a name preceded by ampersand follows the letter R, for example:
|
|
|
|
(?(R&name)...)
|
|
|
|
the condition is true if the most recent recursion is into a group of
|
|
that name (which must exist within the pattern).
|
|
|
|
This condition does not check the entire recursion stack. It tests only
|
|
the current level. If the name used in a condition of this kind is a
|
|
duplicate, the test is applied to all groups of the same name, and is
|
|
true if any one of them is the most recent recursion.
|
|
|
|
At "top level", all these recursion test conditions are false.
|
|
|
|
Defining capture groups for use by reference only
|
|
|
|
If the condition is the string (DEFINE), the condition is always false,
|
|
even if there is a group with the name DEFINE. In this case, there may
|
|
be only one alternative in the rest of the conditional group. It is al-
|
|
ways skipped if control reaches this point in the pattern; the idea of
|
|
DEFINE is that it can be used to define subroutines that can be refer-
|
|
enced from elsewhere. (The use of subroutines is described below.) For
|
|
example, a pattern to match an IPv4 address such as "192.168.23.245"
|
|
could be written like this (ignore white space and line breaks):
|
|
|
|
(?(DEFINE) (?<byte> 2[0-4]\d | 25[0-5] | 1\d\d | [1-9]?\d) )
|
|
\b (?&byte) (\.(?&byte)){3} \b
|
|
|
|
The first part of the pattern is a DEFINE group inside which another
|
|
group named "byte" is defined. This matches an individual component of
|
|
an IPv4 address (a number less than 256). When matching takes place,
|
|
this part of the pattern is skipped because DEFINE acts like a false
|
|
condition. The rest of the pattern uses references to the named group
|
|
to match the four dot-separated components of an IPv4 address, insist-
|
|
ing on a word boundary at each end.
|
|
|
|
Checking the PCRE2 version
|
|
|
|
Programs that link with a PCRE2 library can check the version by call-
|
|
ing pcre2_config() with appropriate arguments. Users of applications
|
|
that do not have access to the underlying code cannot do this. A spe-
|
|
cial "condition" called VERSION exists to allow such users to discover
|
|
which version of PCRE2 they are dealing with by using this condition to
|
|
match a string such as "yesno". VERSION must be followed either by "="
|
|
or ">=" and a version number. For example:
|
|
|
|
(?(VERSION>=10.4)yes|no)
|
|
|
|
This pattern matches "yes" if the PCRE2 version is greater or equal to
|
|
10.4, or "no" otherwise. The fractional part of the version number may
|
|
not contain more than two digits.
|
|
|
|
Assertion conditions
|
|
|
|
If the condition is not in any of the above formats, it must be a
|
|
parenthesized assertion. This may be a positive or negative lookahead
|
|
or lookbehind assertion. However, it must be a traditional atomic as-
|
|
sertion, not one of the PCRE2-specific non-atomic assertions.
|
|
|
|
Consider this pattern, again containing non-significant white space,
|
|
and with the two alternatives on the second line:
|
|
|
|
(?(?=[^a-z]*[a-z])
|
|
\d{2}-[a-z]{3}-\d{2} | \d{2}-\d{2}-\d{2} )
|
|
|
|
The condition is a positive lookahead assertion that matches an op-
|
|
tional sequence of non-letters followed by a letter. In other words, it
|
|
tests for the presence of at least one letter in the subject. If a let-
|
|
ter is found, the subject is matched against the first alternative;
|
|
otherwise it is matched against the second. This pattern matches
|
|
strings in one of the two forms dd-aaa-dd or dd-dd-dd, where aaa are
|
|
letters and dd are digits.
|
|
|
|
When an assertion that is a condition contains capture groups, any cap-
|
|
turing that occurs in a matching branch is retained afterwards, for
|
|
both positive and negative assertions, because matching always contin-
|
|
ues after the assertion, whether it succeeds or fails. (Compare non-
|
|
conditional assertions, for which captures are retained only for posi-
|
|
tive assertions that succeed.)
|
|
|
|
|
|
COMMENTS
|
|
|
|
There are two ways of including comments in patterns that are processed
|
|
by PCRE2. In both cases, the start of the comment must not be in a
|
|
character class, nor in the middle of any other sequence of related
|
|
characters such as (?: or a group name or number. The characters that
|
|
make up a comment play no part in the pattern matching.
|
|
|
|
The sequence (?# marks the start of a comment that continues up to the
|
|
next closing parenthesis. Nested parentheses are not permitted. If the
|
|
PCRE2_EXTENDED or PCRE2_EXTENDED_MORE option is set, an unescaped #
|
|
character also introduces a comment, which in this case continues to
|
|
immediately after the next newline character or character sequence in
|
|
the pattern. Which characters are interpreted as newlines is controlled
|
|
by an option passed to the compiling function or by a special sequence
|
|
at the start of the pattern, as described in the section entitled "New-
|
|
line conventions" above. Note that the end of this type of comment is a
|
|
literal newline sequence in the pattern; escape sequences that happen
|
|
to represent a newline do not count. For example, consider this pattern
|
|
when PCRE2_EXTENDED is set, and the default newline convention (a sin-
|
|
gle linefeed character) is in force:
|
|
|
|
abc #comment \n still comment
|
|
|
|
On encountering the # character, pcre2_compile() skips along, looking
|
|
for a newline in the pattern. The sequence \n is still literal at this
|
|
stage, so it does not terminate the comment. Only an actual character
|
|
with the code value 0x0a (the default newline) does so.
|
|
|
|
|
|
RECURSIVE PATTERNS
|
|
|
|
Consider the problem of matching a string in parentheses, allowing for
|
|
unlimited nested parentheses. Without the use of recursion, the best
|
|
that can be done is to use a pattern that matches up to some fixed
|
|
depth of nesting. It is not possible to handle an arbitrary nesting
|
|
depth.
|
|
|
|
For some time, Perl has provided a facility that allows regular expres-
|
|
sions to recurse (amongst other things). It does this by interpolating
|
|
Perl code in the expression at run time, and the code can refer to the
|
|
expression itself. A Perl pattern using code interpolation to solve the
|
|
parentheses problem can be created like this:
|
|
|
|
$re = qr{\( (?: (?>[^()]+) | (?p{$re}) )* \)}x;
|
|
|
|
The (?p{...}) item interpolates Perl code at run time, and in this case
|
|
refers recursively to the pattern in which it appears.
|
|
|
|
Obviously, PCRE2 cannot support the interpolation of Perl code. In-
|
|
stead, it supports special syntax for recursion of the entire pattern,
|
|
and also for individual capture group recursion. After its introduction
|
|
in PCRE1 and Python, this kind of recursion was subsequently introduced
|
|
into Perl at release 5.10.
|
|
|
|
A special item that consists of (? followed by a number greater than
|
|
zero and a closing parenthesis is a recursive subroutine call of the
|
|
capture group of the given number, provided that it occurs inside that
|
|
group. (If not, it is a non-recursive subroutine call, which is de-
|
|
scribed in the next section.) The special item (?R) or (?0) is a recur-
|
|
sive call of the entire regular expression.
|
|
|
|
This PCRE2 pattern solves the nested parentheses problem (assume the
|
|
PCRE2_EXTENDED option is set so that white space is ignored):
|
|
|
|
\( ( [^()]++ | (?R) )* \)
|
|
|
|
First it matches an opening parenthesis. Then it matches any number of
|
|
substrings which can either be a sequence of non-parentheses, or a re-
|
|
cursive match of the pattern itself (that is, a correctly parenthesized
|
|
substring). Finally there is a closing parenthesis. Note the use of a
|
|
possessive quantifier to avoid backtracking into sequences of non-
|
|
parentheses.
|
|
|
|
If this were part of a larger pattern, you would not want to recurse
|
|
the entire pattern, so instead you could use this:
|
|
|
|
( \( ( [^()]++ | (?1) )* \) )
|
|
|
|
We have put the pattern into parentheses, and caused the recursion to
|
|
refer to them instead of the whole pattern.
|
|
|
|
In a larger pattern, keeping track of parenthesis numbers can be
|
|
tricky. This is made easier by the use of relative references. Instead
|
|
of (?1) in the pattern above you can write (?-2) to refer to the second
|
|
most recently opened parentheses preceding the recursion. In other
|
|
words, a negative number counts capturing parentheses leftwards from
|
|
the point at which it is encountered.
|
|
|
|
Be aware however, that if duplicate capture group numbers are in use,
|
|
relative references refer to the earliest group with the appropriate
|
|
number. Consider, for example:
|
|
|
|
(?|(a)|(b)) (c) (?-2)
|
|
|
|
The first two capture groups (a) and (b) are both numbered 1, and group
|
|
(c) is number 2. When the reference (?-2) is encountered, the second
|
|
most recently opened parentheses has the number 1, but it is the first
|
|
such group (the (a) group) to which the recursion refers. This would be
|
|
the same if an absolute reference (?1) was used. In other words, rela-
|
|
tive references are just a shorthand for computing a group number.
|
|
|
|
It is also possible to refer to subsequent capture groups, by writing
|
|
references such as (?+2). However, these cannot be recursive because
|
|
the reference is not inside the parentheses that are referenced. They
|
|
are always non-recursive subroutine calls, as described in the next
|
|
section.
|
|
|
|
An alternative approach is to use named parentheses. The Perl syntax
|
|
for this is (?&name); PCRE1's earlier syntax (?P>name) is also sup-
|
|
ported. We could rewrite the above example as follows:
|
|
|
|
(?<pn> \( ( [^()]++ | (?&pn) )* \) )
|
|
|
|
If there is more than one group with the same name, the earliest one is
|
|
used.
|
|
|
|
The example pattern that we have been looking at contains nested unlim-
|
|
ited repeats, and so the use of a possessive quantifier for matching
|
|
strings of non-parentheses is important when applying the pattern to
|
|
strings that do not match. For example, when this pattern is applied to
|
|
|
|
(aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa()
|
|
|
|
it yields "no match" quickly. However, if a possessive quantifier is
|
|
not used, the match runs for a very long time indeed because there are
|
|
so many different ways the + and * repeats can carve up the subject,
|
|
and all have to be tested before failure can be reported.
|
|
|
|
At the end of a match, the values of capturing parentheses are those
|
|
from the outermost level. If you want to obtain intermediate values, a
|
|
callout function can be used (see below and the pcre2callout documenta-
|
|
tion). If the pattern above is matched against
|
|
|
|
(ab(cd)ef)
|
|
|
|
the value for the inner capturing parentheses (numbered 2) is "ef",
|
|
which is the last value taken on at the top level. If a capture group
|
|
is not matched at the top level, its final captured value is unset,
|
|
even if it was (temporarily) set at a deeper level during the matching
|
|
process.
|
|
|
|
Do not confuse the (?R) item with the condition (R), which tests for
|
|
recursion. Consider this pattern, which matches text in angle brack-
|
|
ets, allowing for arbitrary nesting. Only digits are allowed in nested
|
|
brackets (that is, when recursing), whereas any characters are permit-
|
|
ted at the outer level.
|
|
|
|
< (?: (?(R) \d++ | [^<>]*+) | (?R)) * >
|
|
|
|
In this pattern, (?(R) is the start of a conditional group, with two
|
|
different alternatives for the recursive and non-recursive cases. The
|
|
(?R) item is the actual recursive call.
|
|
|
|
Differences in recursion processing between PCRE2 and Perl
|
|
|
|
Some former differences between PCRE2 and Perl no longer exist.
|
|
|
|
Before release 10.30, recursion processing in PCRE2 differed from Perl
|
|
in that a recursive subroutine call was always treated as an atomic
|
|
group. That is, once it had matched some of the subject string, it was
|
|
never re-entered, even if it contained untried alternatives and there
|
|
was a subsequent matching failure. (Historical note: PCRE implemented
|
|
recursion before Perl did.)
|
|
|
|
Starting with release 10.30, recursive subroutine calls are no longer
|
|
treated as atomic. That is, they can be re-entered to try unused alter-
|
|
natives if there is a matching failure later in the pattern. This is
|
|
now compatible with the way Perl works. If you want a subroutine call
|
|
to be atomic, you must explicitly enclose it in an atomic group.
|
|
|
|
Supporting backtracking into recursions simplifies certain types of re-
|
|
cursive pattern. For example, this pattern matches palindromic strings:
|
|
|
|
^((.)(?1)\2|.?)$
|
|
|
|
The second branch in the group matches a single central character in
|
|
the palindrome when there are an odd number of characters, or nothing
|
|
when there are an even number of characters, but in order to work it
|
|
has to be able to try the second case when the rest of the pattern
|
|
match fails. If you want to match typical palindromic phrases, the pat-
|
|
tern has to ignore all non-word characters, which can be done like
|
|
this:
|
|
|
|
^\W*+((.)\W*+(?1)\W*+\2|\W*+.?)\W*+$
|
|
|
|
If run with the PCRE2_CASELESS option, this pattern matches phrases
|
|
such as "A man, a plan, a canal: Panama!". Note the use of the posses-
|
|
sive quantifier *+ to avoid backtracking into sequences of non-word
|
|
characters. Without this, PCRE2 takes a great deal longer (ten times or
|
|
more) to match typical phrases, and Perl takes so long that you think
|
|
it has gone into a loop.
|
|
|
|
Another way in which PCRE2 and Perl used to differ in their recursion
|
|
processing is in the handling of captured values. Formerly in Perl,
|
|
when a group was called recursively or as a subroutine (see the next
|
|
section), it had no access to any values that were captured outside the
|
|
recursion, whereas in PCRE2 these values can be referenced. Consider
|
|
this pattern:
|
|
|
|
^(.)(\1|a(?2))
|
|
|
|
This pattern matches "bab". The first capturing parentheses match "b",
|
|
then in the second group, when the backreference \1 fails to match "b",
|
|
the second alternative matches "a" and then recurses. In the recursion,
|
|
\1 does now match "b" and so the whole match succeeds. This match used
|
|
to fail in Perl, but in later versions (I tried 5.024) it now works.
|
|
|
|
|
|
GROUPS AS SUBROUTINES
|
|
|
|
If the syntax for a recursive group call (either by number or by name)
|
|
is used outside the parentheses to which it refers, it operates a bit
|
|
like a subroutine in a programming language. More accurately, PCRE2
|
|
treats the referenced group as an independent subpattern which it tries
|
|
to match at the current matching position. The called group may be de-
|
|
fined before or after the reference. A numbered reference can be abso-
|
|
lute or relative, as in these examples:
|
|
|
|
(...(absolute)...)...(?2)...
|
|
(...(relative)...)...(?-1)...
|
|
(...(?+1)...(relative)...
|
|
|
|
An earlier example pointed out that the pattern
|
|
|
|
(sens|respons)e and \1ibility
|
|
|
|
matches "sense and sensibility" and "response and responsibility", but
|
|
not "sense and responsibility". If instead the pattern
|
|
|
|
(sens|respons)e and (?1)ibility
|
|
|
|
is used, it does match "sense and responsibility" as well as the other
|
|
two strings. Another example is given in the discussion of DEFINE
|
|
above.
|
|
|
|
Like recursions, subroutine calls used to be treated as atomic, but
|
|
this changed at PCRE2 release 10.30, so backtracking into subroutine
|
|
calls can now occur. However, any capturing parentheses that are set
|
|
during the subroutine call revert to their previous values afterwards.
|
|
|
|
Processing options such as case-independence are fixed when a group is
|
|
defined, so if it is used as a subroutine, such options cannot be
|
|
changed for different calls. For example, consider this pattern:
|
|
|
|
(abc)(?i:(?-1))
|
|
|
|
It matches "abcabc". It does not match "abcABC" because the change of
|
|
processing option does not affect the called group.
|
|
|
|
The behaviour of backtracking control verbs in groups when called as
|
|
subroutines is described in the section entitled "Backtracking verbs in
|
|
subroutines" below.
|
|
|
|
|
|
ONIGURUMA SUBROUTINE SYNTAX
|
|
|
|
For compatibility with Oniguruma, the non-Perl syntax \g followed by a
|
|
name or a number enclosed either in angle brackets or single quotes, is
|
|
an alternative syntax for calling a group as a subroutine, possibly re-
|
|
cursively. Here are two of the examples used above, rewritten using
|
|
this syntax:
|
|
|
|
(?<pn> \( ( (?>[^()]+) | \g<pn> )* \) )
|
|
(sens|respons)e and \g'1'ibility
|
|
|
|
PCRE2 supports an extension to Oniguruma: if a number is preceded by a
|
|
plus or a minus sign it is taken as a relative reference. For example:
|
|
|
|
(abc)(?i:\g<-1>)
|
|
|
|
Note that \g{...} (Perl syntax) and \g<...> (Oniguruma syntax) are not
|
|
synonymous. The former is a backreference; the latter is a subroutine
|
|
call.
|
|
|
|
|
|
CALLOUTS
|
|
|
|
Perl has a feature whereby using the sequence (?{...}) causes arbitrary
|
|
Perl code to be obeyed in the middle of matching a regular expression.
|
|
This makes it possible, amongst other things, to extract different sub-
|
|
strings that match the same pair of parentheses when there is a repeti-
|
|
tion.
|
|
|
|
PCRE2 provides a similar feature, but of course it cannot obey arbi-
|
|
trary Perl code. The feature is called "callout". The caller of PCRE2
|
|
provides an external function by putting its entry point in a match
|
|
context using the function pcre2_set_callout(), and then passing that
|
|
context to pcre2_match() or pcre2_dfa_match(). If no match context is
|
|
passed, or if the callout entry point is set to NULL, callouts are dis-
|
|
abled.
|
|
|
|
Within a regular expression, (?C<arg>) indicates a point at which the
|
|
external function is to be called. There are two kinds of callout:
|
|
those with a numerical argument and those with a string argument. (?C)
|
|
on its own with no argument is treated as (?C0). A numerical argument
|
|
allows the application to distinguish between different callouts.
|
|
String arguments were added for release 10.20 to make it possible for
|
|
script languages that use PCRE2 to embed short scripts within patterns
|
|
in a similar way to Perl.
|
|
|
|
During matching, when PCRE2 reaches a callout point, the external func-
|
|
tion is called. It is provided with the number or string argument of
|
|
the callout, the position in the pattern, and one item of data that is
|
|
also set in the match block. The callout function may cause matching to
|
|
proceed, to backtrack, or to fail.
|
|
|
|
By default, PCRE2 implements a number of optimizations at matching
|
|
time, and one side-effect is that sometimes callouts are skipped. If
|
|
you need all possible callouts to happen, you need to set options that
|
|
disable the relevant optimizations. More details, including a complete
|
|
description of the programming interface to the callout function, are
|
|
given in the pcre2callout documentation.
|
|
|
|
Callouts with numerical arguments
|
|
|
|
If you just want to have a means of identifying different callout
|
|
points, put a number less than 256 after the letter C. For example,
|
|
this pattern has two callout points:
|
|
|
|
(?C1)abc(?C2)def
|
|
|
|
If the PCRE2_AUTO_CALLOUT flag is passed to pcre2_compile(), numerical
|
|
callouts are automatically installed before each item in the pattern.
|
|
They are all numbered 255. If there is a conditional group in the pat-
|
|
tern whose condition is an assertion, an additional callout is inserted
|
|
just before the condition. An explicit callout may also be set at this
|
|
position, as in this example:
|
|
|
|
(?(?C9)(?=a)abc|def)
|
|
|
|
Note that this applies only to assertion conditions, not to other types
|
|
of condition.
|
|
|
|
Callouts with string arguments
|
|
|
|
A delimited string may be used instead of a number as a callout argu-
|
|
ment. The starting delimiter must be one of ` ' " ^ % # $ { and the
|
|
ending delimiter is the same as the start, except for {, where the end-
|
|
ing delimiter is }. If the ending delimiter is needed within the
|
|
string, it must be doubled. For example:
|
|
|
|
(?C'ab ''c'' d')xyz(?C{any text})pqr
|
|
|
|
The doubling is removed before the string is passed to the callout
|
|
function.
|
|
|
|
|
|
BACKTRACKING CONTROL
|
|
|
|
There are a number of special "Backtracking Control Verbs" (to use
|
|
Perl's terminology) that modify the behaviour of backtracking during
|
|
matching. They are generally of the form (*VERB) or (*VERB:NAME). Some
|
|
verbs take either form, and may behave differently depending on whether
|
|
or not a name argument is present. The names are not required to be
|
|
unique within the pattern.
|
|
|
|
By default, for compatibility with Perl, a name is any sequence of
|
|
characters that does not include a closing parenthesis. The name is not
|
|
processed in any way, and it is not possible to include a closing
|
|
parenthesis in the name. This can be changed by setting the
|
|
PCRE2_ALT_VERBNAMES option, but the result is no longer Perl-compati-
|
|
ble.
|
|
|
|
When PCRE2_ALT_VERBNAMES is set, backslash processing is applied to
|
|
verb names and only an unescaped closing parenthesis terminates the
|
|
name. However, the only backslash items that are permitted are \Q, \E,
|
|
and sequences such as \x{100} that define character code points. Char-
|
|
acter type escapes such as \d are faulted.
|
|
|
|
A closing parenthesis can be included in a name either as \) or between
|
|
\Q and \E. In addition to backslash processing, if the PCRE2_EXTENDED
|
|
or PCRE2_EXTENDED_MORE option is also set, unescaped whitespace in verb
|
|
names is skipped, and #-comments are recognized, exactly as in the rest
|
|
of the pattern. PCRE2_EXTENDED and PCRE2_EXTENDED_MORE do not affect
|
|
verb names unless PCRE2_ALT_VERBNAMES is also set.
|
|
|
|
The maximum length of a name is 255 in the 8-bit library and 65535 in
|
|
the 16-bit and 32-bit libraries. If the name is empty, that is, if the
|
|
closing parenthesis immediately follows the colon, the effect is as if
|
|
the colon were not there. Any number of these verbs may occur in a pat-
|
|
tern. Except for (*ACCEPT), they may not be quantified.
|
|
|
|
Since these verbs are specifically related to backtracking, most of
|
|
them can be used only when the pattern is to be matched using the tra-
|
|
ditional matching function, because that uses a backtracking algorithm.
|
|
With the exception of (*FAIL), which behaves like a failing negative
|
|
assertion, the backtracking control verbs cause an error if encountered
|
|
by the DFA matching function.
|
|
|
|
The behaviour of these verbs in repeated groups, assertions, and in
|
|
capture groups called as subroutines (whether or not recursively) is
|
|
documented below.
|
|
|
|
Optimizations that affect backtracking verbs
|
|
|
|
PCRE2 contains some optimizations that are used to speed up matching by
|
|
running some checks at the start of each match attempt. For example, it
|
|
may know the minimum length of matching subject, or that a particular
|
|
character must be present. When one of these optimizations bypasses the
|
|
running of a match, any included backtracking verbs will not, of
|
|
course, be processed. You can suppress the start-of-match optimizations
|
|
by setting the PCRE2_NO_START_OPTIMIZE option when calling pcre2_com-
|
|
pile(), or by starting the pattern with (*NO_START_OPT). There is more
|
|
discussion of this option in the section entitled "Compiling a pattern"
|
|
in the pcre2api documentation.
|
|
|
|
Experiments with Perl suggest that it too has similar optimizations,
|
|
and like PCRE2, turning them off can change the result of a match.
|
|
|
|
Verbs that act immediately
|
|
|
|
The following verbs act as soon as they are encountered.
|
|
|
|
(*ACCEPT) or (*ACCEPT:NAME)
|
|
|
|
This verb causes the match to end successfully, skipping the remainder
|
|
of the pattern. However, when it is inside a capture group that is
|
|
called as a subroutine, only that group is ended successfully. Matching
|
|
then continues at the outer level. If (*ACCEPT) in triggered in a posi-
|
|
tive assertion, the assertion succeeds; in a negative assertion, the
|
|
assertion fails.
|
|
|
|
If (*ACCEPT) is inside capturing parentheses, the data so far is cap-
|
|
tured. For example:
|
|
|
|
A((?:A|B(*ACCEPT)|C)D)
|
|
|
|
This matches "AB", "AAD", or "ACD"; when it matches "AB", "B" is cap-
|
|
tured by the outer parentheses.
|
|
|
|
(*ACCEPT) is the only backtracking verb that is allowed to be quanti-
|
|
fied because an ungreedy quantification with a minimum of zero acts
|
|
only when a backtrack happens. Consider, for example,
|
|
|
|
(A(*ACCEPT)??B)C
|
|
|
|
where A, B, and C may be complex expressions. After matching "A", the
|
|
matcher processes "BC"; if that fails, causing a backtrack, (*ACCEPT)
|
|
is triggered and the match succeeds. In both cases, all but C is cap-
|
|
tured. Whereas (*COMMIT) (see below) means "fail on backtrack", a re-
|
|
peated (*ACCEPT) of this type means "succeed on backtrack".
|
|
|
|
Warning: (*ACCEPT) should not be used within a script run group, be-
|
|
cause it causes an immediate exit from the group, bypassing the script
|
|
run checking.
|
|
|
|
(*FAIL) or (*FAIL:NAME)
|
|
|
|
This verb causes a matching failure, forcing backtracking to occur. It
|
|
may be abbreviated to (*F). It is equivalent to (?!) but easier to
|
|
read. The Perl documentation notes that it is probably useful only when
|
|
combined with (?{}) or (??{}). Those are, of course, Perl features that
|
|
are not present in PCRE2. The nearest equivalent is the callout fea-
|
|
ture, as for example in this pattern:
|
|
|
|
a+(?C)(*FAIL)
|
|
|
|
A match with the string "aaaa" always fails, but the callout is taken
|
|
before each backtrack happens (in this example, 10 times).
|
|
|
|
(*ACCEPT:NAME) and (*FAIL:NAME) behave the same as (*MARK:NAME)(*AC-
|
|
CEPT) and (*MARK:NAME)(*FAIL), respectively, that is, a (*MARK) is
|
|
recorded just before the verb acts.
|
|
|
|
Recording which path was taken
|
|
|
|
There is one verb whose main purpose is to track how a match was ar-
|
|
rived at, though it also has a secondary use in conjunction with ad-
|
|
vancing the match starting point (see (*SKIP) below).
|
|
|
|
(*MARK:NAME) or (*:NAME)
|
|
|
|
A name is always required with this verb. For all the other backtrack-
|
|
ing control verbs, a NAME argument is optional.
|
|
|
|
When a match succeeds, the name of the last-encountered mark name on
|
|
the matching path is passed back to the caller as described in the sec-
|
|
tion entitled "Other information about the match" in the pcre2api docu-
|
|
mentation. This applies to all instances of (*MARK) and other verbs,
|
|
including those inside assertions and atomic groups. However, there are
|
|
differences in those cases when (*MARK) is used in conjunction with
|
|
(*SKIP) as described below.
|
|
|
|
The mark name that was last encountered on the matching path is passed
|
|
back. A verb without a NAME argument is ignored for this purpose. Here
|
|
is an example of pcre2test output, where the "mark" modifier requests
|
|
the retrieval and outputting of (*MARK) data:
|
|
|
|
re> /X(*MARK:A)Y|X(*MARK:B)Z/mark
|
|
data> XY
|
|
0: XY
|
|
MK: A
|
|
XZ
|
|
0: XZ
|
|
MK: B
|
|
|
|
The (*MARK) name is tagged with "MK:" in this output, and in this exam-
|
|
ple it indicates which of the two alternatives matched. This is a more
|
|
efficient way of obtaining this information than putting each alterna-
|
|
tive in its own capturing parentheses.
|
|
|
|
If a verb with a name is encountered in a positive assertion that is
|
|
true, the name is recorded and passed back if it is the last-encoun-
|
|
tered. This does not happen for negative assertions or failing positive
|
|
assertions.
|
|
|
|
After a partial match or a failed match, the last encountered name in
|
|
the entire match process is returned. For example:
|
|
|
|
re> /X(*MARK:A)Y|X(*MARK:B)Z/mark
|
|
data> XP
|
|
No match, mark = B
|
|
|
|
Note that in this unanchored example the mark is retained from the
|
|
match attempt that started at the letter "X" in the subject. Subsequent
|
|
match attempts starting at "P" and then with an empty string do not get
|
|
as far as the (*MARK) item, but nevertheless do not reset it.
|
|
|
|
If you are interested in (*MARK) values after failed matches, you
|
|
should probably set the PCRE2_NO_START_OPTIMIZE option (see above) to
|
|
ensure that the match is always attempted.
|
|
|
|
Verbs that act after backtracking
|
|
|
|
The following verbs do nothing when they are encountered. Matching con-
|
|
tinues with what follows, but if there is a subsequent match failure,
|
|
causing a backtrack to the verb, a failure is forced. That is, back-
|
|
tracking cannot pass to the left of the verb. However, when one of
|
|
these verbs appears inside an atomic group or in a lookaround assertion
|
|
that is true, its effect is confined to that group, because once the
|
|
group has been matched, there is never any backtracking into it. Back-
|
|
tracking from beyond an assertion or an atomic group ignores the entire
|
|
group, and seeks a preceding backtracking point.
|
|
|
|
These verbs differ in exactly what kind of failure occurs when back-
|
|
tracking reaches them. The behaviour described below is what happens
|
|
when the verb is not in a subroutine or an assertion. Subsequent sec-
|
|
tions cover these special cases.
|
|
|
|
(*COMMIT) or (*COMMIT:NAME)
|
|
|
|
This verb causes the whole match to fail outright if there is a later
|
|
matching failure that causes backtracking to reach it. Even if the pat-
|
|
tern is unanchored, no further attempts to find a match by advancing
|
|
the starting point take place. If (*COMMIT) is the only backtracking
|
|
verb that is encountered, once it has been passed pcre2_match() is com-
|
|
mitted to finding a match at the current starting point, or not at all.
|
|
For example:
|
|
|
|
a+(*COMMIT)b
|
|
|
|
This matches "xxaab" but not "aacaab". It can be thought of as a kind
|
|
of dynamic anchor, or "I've started, so I must finish."
|
|
|
|
The behaviour of (*COMMIT:NAME) is not the same as (*MARK:NAME)(*COM-
|
|
MIT). It is like (*MARK:NAME) in that the name is remembered for pass-
|
|
ing back to the caller. However, (*SKIP:NAME) searches only for names
|
|
that are set with (*MARK), ignoring those set by any of the other back-
|
|
tracking verbs.
|
|
|
|
If there is more than one backtracking verb in a pattern, a different
|
|
one that follows (*COMMIT) may be triggered first, so merely passing
|
|
(*COMMIT) during a match does not always guarantee that a match must be
|
|
at this starting point.
|
|
|
|
Note that (*COMMIT) at the start of a pattern is not the same as an an-
|
|
chor, unless PCRE2's start-of-match optimizations are turned off, as
|
|
shown in this output from pcre2test:
|
|
|
|
re> /(*COMMIT)abc/
|
|
data> xyzabc
|
|
0: abc
|
|
data>
|
|
re> /(*COMMIT)abc/no_start_optimize
|
|
data> xyzabc
|
|
No match
|
|
|
|
For the first pattern, PCRE2 knows that any match must start with "a",
|
|
so the optimization skips along the subject to "a" before applying the
|
|
pattern to the first set of data. The match attempt then succeeds. The
|
|
second pattern disables the optimization that skips along to the first
|
|
character. The pattern is now applied starting at "x", and so the
|
|
(*COMMIT) causes the match to fail without trying any other starting
|
|
points.
|
|
|
|
(*PRUNE) or (*PRUNE:NAME)
|
|
|
|
This verb causes the match to fail at the current starting position in
|
|
the subject if there is a later matching failure that causes backtrack-
|
|
ing to reach it. If the pattern is unanchored, the normal "bumpalong"
|
|
advance to the next starting character then happens. Backtracking can
|
|
occur as usual to the left of (*PRUNE), before it is reached, or when
|
|
matching to the right of (*PRUNE), but if there is no match to the
|
|
right, backtracking cannot cross (*PRUNE). In simple cases, the use of
|
|
(*PRUNE) is just an alternative to an atomic group or possessive quan-
|
|
tifier, but there are some uses of (*PRUNE) that cannot be expressed in
|
|
any other way. In an anchored pattern (*PRUNE) has the same effect as
|
|
(*COMMIT).
|
|
|
|
The behaviour of (*PRUNE:NAME) is not the same as (*MARK:NAME)(*PRUNE).
|
|
It is like (*MARK:NAME) in that the name is remembered for passing back
|
|
to the caller. However, (*SKIP:NAME) searches only for names set with
|
|
(*MARK), ignoring those set by other backtracking verbs.
|
|
|
|
(*SKIP)
|
|
|
|
This verb, when given without a name, is like (*PRUNE), except that if
|
|
the pattern is unanchored, the "bumpalong" advance is not to the next
|
|
character, but to the position in the subject where (*SKIP) was encoun-
|
|
tered. (*SKIP) signifies that whatever text was matched leading up to
|
|
it cannot be part of a successful match if there is a later mismatch.
|
|
Consider:
|
|
|
|
a+(*SKIP)b
|
|
|
|
If the subject is "aaaac...", after the first match attempt fails
|
|
(starting at the first character in the string), the starting point
|
|
skips on to start the next attempt at "c". Note that a possessive quan-
|
|
tifier does not have the same effect as this example; although it would
|
|
suppress backtracking during the first match attempt, the second at-
|
|
tempt would start at the second character instead of skipping on to
|
|
"c".
|
|
|
|
If (*SKIP) is used to specify a new starting position that is the same
|
|
as the starting position of the current match, or (by being inside a
|
|
lookbehind) earlier, the position specified by (*SKIP) is ignored, and
|
|
instead the normal "bumpalong" occurs.
|
|
|
|
(*SKIP:NAME)
|
|
|
|
When (*SKIP) has an associated name, its behaviour is modified. When
|
|
such a (*SKIP) is triggered, the previous path through the pattern is
|
|
searched for the most recent (*MARK) that has the same name. If one is
|
|
found, the "bumpalong" advance is to the subject position that corre-
|
|
sponds to that (*MARK) instead of to where (*SKIP) was encountered. If
|
|
no (*MARK) with a matching name is found, the (*SKIP) is ignored.
|
|
|
|
The search for a (*MARK) name uses the normal backtracking mechanism,
|
|
which means that it does not see (*MARK) settings that are inside
|
|
atomic groups or assertions, because they are never re-entered by back-
|
|
tracking. Compare the following pcre2test examples:
|
|
|
|
re> /a(?>(*MARK:X))(*SKIP:X)(*F)|(.)/
|
|
data: abc
|
|
0: a
|
|
1: a
|
|
data:
|
|
re> /a(?:(*MARK:X))(*SKIP:X)(*F)|(.)/
|
|
data: abc
|
|
0: b
|
|
1: b
|
|
|
|
In the first example, the (*MARK) setting is in an atomic group, so it
|
|
is not seen when (*SKIP:X) triggers, causing the (*SKIP) to be ignored.
|
|
This allows the second branch of the pattern to be tried at the first
|
|
character position. In the second example, the (*MARK) setting is not
|
|
in an atomic group. This allows (*SKIP:X) to find the (*MARK) when it
|
|
backtracks, and this causes a new matching attempt to start at the sec-
|
|
ond character. This time, the (*MARK) is never seen because "a" does
|
|
not match "b", so the matcher immediately jumps to the second branch of
|
|
the pattern.
|
|
|
|
Note that (*SKIP:NAME) searches only for names set by (*MARK:NAME). It
|
|
ignores names that are set by other backtracking verbs.
|
|
|
|
(*THEN) or (*THEN:NAME)
|
|
|
|
This verb causes a skip to the next innermost alternative when back-
|
|
tracking reaches it. That is, it cancels any further backtracking
|
|
within the current alternative. Its name comes from the observation
|
|
that it can be used for a pattern-based if-then-else block:
|
|
|
|
( COND1 (*THEN) FOO | COND2 (*THEN) BAR | COND3 (*THEN) BAZ ) ...
|
|
|
|
If the COND1 pattern matches, FOO is tried (and possibly further items
|
|
after the end of the group if FOO succeeds); on failure, the matcher
|
|
skips to the second alternative and tries COND2, without backtracking
|
|
into COND1. If that succeeds and BAR fails, COND3 is tried. If subse-
|
|
quently BAZ fails, there are no more alternatives, so there is a back-
|
|
track to whatever came before the entire group. If (*THEN) is not in-
|
|
side an alternation, it acts like (*PRUNE).
|
|
|
|
The behaviour of (*THEN:NAME) is not the same as (*MARK:NAME)(*THEN).
|
|
It is like (*MARK:NAME) in that the name is remembered for passing back
|
|
to the caller. However, (*SKIP:NAME) searches only for names set with
|
|
(*MARK), ignoring those set by other backtracking verbs.
|
|
|
|
A group that does not contain a | character is just a part of the en-
|
|
closing alternative; it is not a nested alternation with only one al-
|
|
ternative. The effect of (*THEN) extends beyond such a group to the en-
|
|
closing alternative. Consider this pattern, where A, B, etc. are com-
|
|
plex pattern fragments that do not contain any | characters at this
|
|
level:
|
|
|
|
A (B(*THEN)C) | D
|
|
|
|
If A and B are matched, but there is a failure in C, matching does not
|
|
backtrack into A; instead it moves to the next alternative, that is, D.
|
|
However, if the group containing (*THEN) is given an alternative, it
|
|
behaves differently:
|
|
|
|
A (B(*THEN)C | (*FAIL)) | D
|
|
|
|
The effect of (*THEN) is now confined to the inner group. After a fail-
|
|
ure in C, matching moves to (*FAIL), which causes the whole group to
|
|
fail because there are no more alternatives to try. In this case,
|
|
matching does backtrack into A.
|
|
|
|
Note that a conditional group is not considered as having two alterna-
|
|
tives, because only one is ever used. In other words, the | character
|
|
in a conditional group has a different meaning. Ignoring white space,
|
|
consider:
|
|
|
|
^.*? (?(?=a) a | b(*THEN)c )
|
|
|
|
If the subject is "ba", this pattern does not match. Because .*? is un-
|
|
greedy, it initially matches zero characters. The condition (?=a) then
|
|
fails, the character "b" is matched, but "c" is not. At this point,
|
|
matching does not backtrack to .*? as might perhaps be expected from
|
|
the presence of the | character. The conditional group is part of the
|
|
single alternative that comprises the whole pattern, and so the match
|
|
fails. (If there was a backtrack into .*?, allowing it to match "b",
|
|
the match would succeed.)
|
|
|
|
The verbs just described provide four different "strengths" of control
|
|
when subsequent matching fails. (*THEN) is the weakest, carrying on the
|
|
match at the next alternative. (*PRUNE) comes next, failing the match
|
|
at the current starting position, but allowing an advance to the next
|
|
character (for an unanchored pattern). (*SKIP) is similar, except that
|
|
the advance may be more than one character. (*COMMIT) is the strongest,
|
|
causing the entire match to fail.
|
|
|
|
More than one backtracking verb
|
|
|
|
If more than one backtracking verb is present in a pattern, the one
|
|
that is backtracked onto first acts. For example, consider this pat-
|
|
tern, where A, B, etc. are complex pattern fragments:
|
|
|
|
(A(*COMMIT)B(*THEN)C|ABD)
|
|
|
|
If A matches but B fails, the backtrack to (*COMMIT) causes the entire
|
|
match to fail. However, if A and B match, but C fails, the backtrack to
|
|
(*THEN) causes the next alternative (ABD) to be tried. This behaviour
|
|
is consistent, but is not always the same as Perl's. It means that if
|
|
two or more backtracking verbs appear in succession, all the the last
|
|
of them has no effect. Consider this example:
|
|
|
|
...(*COMMIT)(*PRUNE)...
|
|
|
|
If there is a matching failure to the right, backtracking onto (*PRUNE)
|
|
causes it to be triggered, and its action is taken. There can never be
|
|
a backtrack onto (*COMMIT).
|
|
|
|
Backtracking verbs in repeated groups
|
|
|
|
PCRE2 sometimes differs from Perl in its handling of backtracking verbs
|
|
in repeated groups. For example, consider:
|
|
|
|
/(a(*COMMIT)b)+ac/
|
|
|
|
If the subject is "abac", Perl matches unless its optimizations are
|
|
disabled, but PCRE2 always fails because the (*COMMIT) in the second
|
|
repeat of the group acts.
|
|
|
|
Backtracking verbs in assertions
|
|
|
|
(*FAIL) in any assertion has its normal effect: it forces an immediate
|
|
backtrack. The behaviour of the other backtracking verbs depends on
|
|
whether or not the assertion is standalone or acting as the condition
|
|
in a conditional group.
|
|
|
|
(*ACCEPT) in a standalone positive assertion causes the assertion to
|
|
succeed without any further processing; captured strings and a mark
|
|
name (if set) are retained. In a standalone negative assertion, (*AC-
|
|
CEPT) causes the assertion to fail without any further processing; cap-
|
|
tured substrings and any mark name are discarded.
|
|
|
|
If the assertion is a condition, (*ACCEPT) causes the condition to be
|
|
true for a positive assertion and false for a negative one; captured
|
|
substrings are retained in both cases.
|
|
|
|
The remaining verbs act only when a later failure causes a backtrack to
|
|
reach them. This means that, for the Perl-compatible assertions, their
|
|
effect is confined to the assertion, because Perl lookaround assertions
|
|
are atomic. A backtrack that occurs after such an assertion is complete
|
|
does not jump back into the assertion. Note in particular that a
|
|
(*MARK) name that is set in an assertion is not "seen" by an instance
|
|
of (*SKIP:NAME) later in the pattern.
|
|
|
|
PCRE2 now supports non-atomic positive assertions, as described in the
|
|
section entitled "Non-atomic assertions" above. These assertions must
|
|
be standalone (not used as conditions). They are not Perl-compatible.
|
|
For these assertions, a later backtrack does jump back into the asser-
|
|
tion, and therefore verbs such as (*COMMIT) can be triggered by back-
|
|
tracks from later in the pattern.
|
|
|
|
The effect of (*THEN) is not allowed to escape beyond an assertion. If
|
|
there are no more branches to try, (*THEN) causes a positive assertion
|
|
to be false, and a negative assertion to be true.
|
|
|
|
The other backtracking verbs are not treated specially if they appear
|
|
in a standalone positive assertion. In a conditional positive asser-
|
|
tion, backtracking (from within the assertion) into (*COMMIT), (*SKIP),
|
|
or (*PRUNE) causes the condition to be false. However, for both stand-
|
|
alone and conditional negative assertions, backtracking into (*COMMIT),
|
|
(*SKIP), or (*PRUNE) causes the assertion to be true, without consider-
|
|
ing any further alternative branches.
|
|
|
|
Backtracking verbs in subroutines
|
|
|
|
These behaviours occur whether or not the group is called recursively.
|
|
|
|
(*ACCEPT) in a group called as a subroutine causes the subroutine match
|
|
to succeed without any further processing. Matching then continues af-
|
|
ter the subroutine call. Perl documents this behaviour. Perl's treat-
|
|
ment of the other verbs in subroutines is different in some cases.
|
|
|
|
(*FAIL) in a group called as a subroutine has its normal effect: it
|
|
forces an immediate backtrack.
|
|
|
|
(*COMMIT), (*SKIP), and (*PRUNE) cause the subroutine match to fail
|
|
when triggered by being backtracked to in a group called as a subrou-
|
|
tine. There is then a backtrack at the outer level.
|
|
|
|
(*THEN), when triggered, skips to the next alternative in the innermost
|
|
enclosing group that has alternatives (its normal behaviour). However,
|
|
if there is no such group within the subroutine's group, the subroutine
|
|
match fails and there is a backtrack at the outer level.
|
|
|
|
|
|
SEE ALSO
|
|
|
|
pcre2api(3), pcre2callout(3), pcre2matching(3), pcre2syntax(3),
|
|
pcre2(3).
|
|
|
|
|
|
AUTHOR
|
|
|
|
Philip Hazel
|
|
Retired from University Computing Service
|
|
Cambridge, England.
|
|
|
|
|
|
REVISION
|
|
|
|
Last updated: 28 December 2021
|
|
Copyright (c) 1997-2021 University of Cambridge.
|
|
------------------------------------------------------------------------------
|
|
|
|
|
|
PCRE2PERFORM(3) Library Functions Manual PCRE2PERFORM(3)
|
|
|
|
|
|
|
|
NAME
|
|
PCRE2 - Perl-compatible regular expressions (revised API)
|
|
|
|
PCRE2 PERFORMANCE
|
|
|
|
Two aspects of performance are discussed below: memory usage and pro-
|
|
cessing time. The way you express your pattern as a regular expression
|
|
can affect both of them.
|
|
|
|
|
|
COMPILED PATTERN MEMORY USAGE
|
|
|
|
Patterns are compiled by PCRE2 into a reasonably efficient interpretive
|
|
code, so that most simple patterns do not use much memory for storing
|
|
the compiled version. However, there is one case where the memory usage
|
|
of a compiled pattern can be unexpectedly large. If a parenthesized
|
|
group has a quantifier with a minimum greater than 1 and/or a limited
|
|
maximum, the whole group is repeated in the compiled code. For example,
|
|
the pattern
|
|
|
|
(abc|def){2,4}
|
|
|
|
is compiled as if it were
|
|
|
|
(abc|def)(abc|def)((abc|def)(abc|def)?)?
|
|
|
|
(Technical aside: It is done this way so that backtrack points within
|
|
each of the repetitions can be independently maintained.)
|
|
|
|
For regular expressions whose quantifiers use only small numbers, this
|
|
is not usually a problem. However, if the numbers are large, and par-
|
|
ticularly if such repetitions are nested, the memory usage can become
|
|
an embarrassment. For example, the very simple pattern
|
|
|
|
((ab){1,1000}c){1,3}
|
|
|
|
uses over 50KiB when compiled using the 8-bit library. When PCRE2 is
|
|
compiled with its default internal pointer size of two bytes, the size
|
|
limit on a compiled pattern is 65535 code units in the 8-bit and 16-bit
|
|
libraries, and this is reached with the above pattern if the outer rep-
|
|
etition is increased from 3 to 4. PCRE2 can be compiled to use larger
|
|
internal pointers and thus handle larger compiled patterns, but it is
|
|
better to try to rewrite your pattern to use less memory if you can.
|
|
|
|
One way of reducing the memory usage for such patterns is to make use
|
|
of PCRE2's "subroutine" facility. Re-writing the above pattern as
|
|
|
|
((ab)(?2){0,999}c)(?1){0,2}
|
|
|
|
reduces the memory requirements to around 16KiB, and indeed it remains
|
|
under 20KiB even with the outer repetition increased to 100. However,
|
|
this kind of pattern is not always exactly equivalent, because any cap-
|
|
tures within subroutine calls are lost when the subroutine completes.
|
|
If this is not a problem, this kind of rewriting will allow you to
|
|
process patterns that PCRE2 cannot otherwise handle. The matching per-
|
|
formance of the two different versions of the pattern are roughly the
|
|
same. (This applies from release 10.30 - things were different in ear-
|
|
lier releases.)
|
|
|
|
|
|
STACK AND HEAP USAGE AT RUN TIME
|
|
|
|
From release 10.30, the interpretive (non-JIT) version of pcre2_match()
|
|
uses very little system stack at run time. In earlier releases recur-
|
|
sive function calls could use a great deal of stack, and this could
|
|
cause problems, but this usage has been eliminated. Backtracking posi-
|
|
tions are now explicitly remembered in memory frames controlled by the
|
|
code. An initial 20KiB vector of frames is allocated on the system
|
|
stack (enough for about 100 frames for small patterns), but if this is
|
|
insufficient, heap memory is used. The amount of heap memory can be
|
|
limited; if the limit is set to zero, only the initial stack vector is
|
|
used. Rewriting patterns to be time-efficient, as described below, may
|
|
also reduce the memory requirements.
|
|
|
|
In contrast to pcre2_match(), pcre2_dfa_match() does use recursive
|
|
function calls, but only for processing atomic groups, lookaround as-
|
|
sertions, and recursion within the pattern. The original version of the
|
|
code used to allocate quite large internal workspace vectors on the
|
|
stack, which caused some problems for some patterns in environments
|
|
with small stacks. From release 10.32 the code for pcre2_dfa_match()
|
|
has been re-factored to use heap memory when necessary for internal
|
|
workspace when recursing, though recursive function calls are still
|
|
used.
|
|
|
|
The "match depth" parameter can be used to limit the depth of function
|
|
recursion, and the "match heap" parameter to limit heap memory in
|
|
pcre2_dfa_match().
|
|
|
|
|
|
PROCESSING TIME
|
|
|
|
Certain items in regular expression patterns are processed more effi-
|
|
ciently than others. It is more efficient to use a character class like
|
|
[aeiou] than a set of single-character alternatives such as
|
|
(a|e|i|o|u). In general, the simplest construction that provides the
|
|
required behaviour is usually the most efficient. Jeffrey Friedl's book
|
|
contains a lot of useful general discussion about optimizing regular
|
|
expressions for efficient performance. This document contains a few ob-
|
|
servations about PCRE2.
|
|
|
|
Using Unicode character properties (the \p, \P, and \X escapes) is
|
|
slow, because PCRE2 has to use a multi-stage table lookup whenever it
|
|
needs a character's property. If you can find an alternative pattern
|
|
that does not use character properties, it will probably be faster.
|
|
|
|
By default, the escape sequences \b, \d, \s, and \w, and the POSIX
|
|
character classes such as [:alpha:] do not use Unicode properties,
|
|
partly for backwards compatibility, and partly for performance reasons.
|
|
However, you can set the PCRE2_UCP option or start the pattern with
|
|
(*UCP) if you want Unicode character properties to be used. This can
|
|
double the matching time for items such as \d, when matched with
|
|
pcre2_match(); the performance loss is less with a DFA matching func-
|
|
tion, and in both cases there is not much difference for \b.
|
|
|
|
When a pattern begins with .* not in atomic parentheses, nor in paren-
|
|
theses that are the subject of a backreference, and the PCRE2_DOTALL
|
|
option is set, the pattern is implicitly anchored by PCRE2, since it
|
|
can match only at the start of a subject string. If the pattern has
|
|
multiple top-level branches, they must all be anchorable. The optimiza-
|
|
tion can be disabled by the PCRE2_NO_DOTSTAR_ANCHOR option, and is au-
|
|
tomatically disabled if the pattern contains (*PRUNE) or (*SKIP).
|
|
|
|
If PCRE2_DOTALL is not set, PCRE2 cannot make this optimization, be-
|
|
cause the dot metacharacter does not then match a newline, and if the
|
|
subject string contains newlines, the pattern may match from the char-
|
|
acter immediately following one of them instead of from the very start.
|
|
For example, the pattern
|
|
|
|
.*second
|
|
|
|
matches the subject "first\nand second" (where \n stands for a newline
|
|
character), with the match starting at the seventh character. In order
|
|
to do this, PCRE2 has to retry the match starting after every newline
|
|
in the subject.
|
|
|
|
If you are using such a pattern with subject strings that do not con-
|
|
tain newlines, the best performance is obtained by setting
|
|
PCRE2_DOTALL, or starting the pattern with ^.* or ^.*? to indicate ex-
|
|
plicit anchoring. That saves PCRE2 from having to scan along the sub-
|
|
ject looking for a newline to restart at.
|
|
|
|
Beware of patterns that contain nested indefinite repeats. These can
|
|
take a long time to run when applied to a string that does not match.
|
|
Consider the pattern fragment
|
|
|
|
^(a+)*
|
|
|
|
This can match "aaaa" in 16 different ways, and this number increases
|
|
very rapidly as the string gets longer. (The * repeat can match 0, 1,
|
|
2, 3, or 4 times, and for each of those cases other than 0 or 4, the +
|
|
repeats can match different numbers of times.) When the remainder of
|
|
the pattern is such that the entire match is going to fail, PCRE2 has
|
|
in principle to try every possible variation, and this can take an ex-
|
|
tremely long time, even for relatively short strings.
|
|
|
|
An optimization catches some of the more simple cases such as
|
|
|
|
(a+)*b
|
|
|
|
where a literal character follows. Before embarking on the standard
|
|
matching procedure, PCRE2 checks that there is a "b" later in the sub-
|
|
ject string, and if there is not, it fails the match immediately. How-
|
|
ever, when there is no following literal this optimization cannot be
|
|
used. You can see the difference by comparing the behaviour of
|
|
|
|
(a+)*\d
|
|
|
|
with the pattern above. The former gives a failure almost instantly
|
|
when applied to a whole line of "a" characters, whereas the latter
|
|
takes an appreciable time with strings longer than about 20 characters.
|
|
|
|
In many cases, the solution to this kind of performance issue is to use
|
|
an atomic group or a possessive quantifier. This can often reduce mem-
|
|
ory requirements as well. As another example, consider this pattern:
|
|
|
|
([^<]|<(?!inet))+
|
|
|
|
It matches from wherever it starts until it encounters "<inet" or the
|
|
end of the data, and is the kind of pattern that might be used when
|
|
processing an XML file. Each iteration of the outer parentheses matches
|
|
either one character that is not "<" or a "<" that is not followed by
|
|
"inet". However, each time a parenthesis is processed, a backtracking
|
|
position is passed, so this formulation uses a memory frame for each
|
|
matched character. For a long string, a lot of memory is required. Con-
|
|
sider now this rewritten pattern, which matches exactly the same
|
|
strings:
|
|
|
|
([^<]++|<(?!inet))+
|
|
|
|
This runs much faster, because sequences of characters that do not con-
|
|
tain "<" are "swallowed" in one item inside the parentheses, and a pos-
|
|
sessive quantifier is used to stop any backtracking into the runs of
|
|
non-"<" characters. This version also uses a lot less memory because
|
|
entry to a new set of parentheses happens only when a "<" character
|
|
that is not followed by "inet" is encountered (and we assume this is
|
|
relatively rare).
|
|
|
|
This example shows that one way of optimizing performance when matching
|
|
long subject strings is to write repeated parenthesized subpatterns to
|
|
match more than one character whenever possible.
|
|
|
|
SETTING RESOURCE LIMITS
|
|
|
|
You can set limits on the amount of processing that takes place when
|
|
matching, and on the amount of heap memory that is used. The default
|
|
values of the limits are very large, and unlikely ever to operate. They
|
|
can be changed when PCRE2 is built, and they can also be set when
|
|
pcre2_match() or pcre2_dfa_match() is called. For details of these in-
|
|
terfaces, see the pcre2build documentation and the section entitled
|
|
"The match context" in the pcre2api documentation.
|
|
|
|
The pcre2test test program has a modifier called "find_limits" which,
|
|
if applied to a subject line, causes it to find the smallest limits
|
|
that allow a pattern to match. This is done by repeatedly matching with
|
|
different limits.
|
|
|
|
|
|
AUTHOR
|
|
|
|
Philip Hazel
|
|
University Computing Service
|
|
Cambridge, England.
|
|
|
|
|
|
REVISION
|
|
|
|
Last updated: 03 February 2019
|
|
Copyright (c) 1997-2019 University of Cambridge.
|
|
------------------------------------------------------------------------------
|
|
|
|
|
|
PCRE2POSIX(3) Library Functions Manual PCRE2POSIX(3)
|
|
|
|
|
|
|
|
NAME
|
|
PCRE2 - Perl-compatible regular expressions (revised API)
|
|
|
|
SYNOPSIS
|
|
|
|
#include <pcre2posix.h>
|
|
|
|
int pcre2_regcomp(regex_t *preg, const char *pattern,
|
|
int cflags);
|
|
|
|
int pcre2_regexec(const regex_t *preg, const char *string,
|
|
size_t nmatch, regmatch_t pmatch[], int eflags);
|
|
|
|
size_t pcre2_regerror(int errcode, const regex_t *preg,
|
|
char *errbuf, size_t errbuf_size);
|
|
|
|
void pcre2_regfree(regex_t *preg);
|
|
|
|
|
|
DESCRIPTION
|
|
|
|
This set of functions provides a POSIX-style API for the PCRE2 regular
|
|
expression 8-bit library. There are no POSIX-style wrappers for PCRE2's
|
|
16-bit and 32-bit libraries. See the pcre2api documentation for a de-
|
|
scription of PCRE2's native API, which contains much additional func-
|
|
tionality.
|
|
|
|
The functions described here are wrapper functions that ultimately call
|
|
the PCRE2 native API. Their prototypes are defined in the pcre2posix.h
|
|
header file, and they all have unique names starting with pcre2_. How-
|
|
ever, the pcre2posix.h header also contains macro definitions that con-
|
|
vert the standard POSIX names such regcomp() into pcre2_regcomp() etc.
|
|
This means that a program can use the usual POSIX names without running
|
|
the risk of accidentally linking with POSIX functions from a different
|
|
library.
|
|
|
|
On Unix-like systems the PCRE2 POSIX library is called libpcre2-posix,
|
|
so can be accessed by adding -lpcre2-posix to the command for linking
|
|
an application. Because the POSIX functions call the native ones, it is
|
|
also necessary to add -lpcre2-8.
|
|
|
|
Although they were not defined as protypes in pcre2posix.h, releases
|
|
10.33 to 10.36 of the library contained functions with the POSIX names
|
|
regcomp() etc. These simply passed their arguments to the PCRE2 func-
|
|
tions. These functions were provided for backwards compatibility with
|
|
earlier versions of PCRE2, which had only POSIX names. However, this
|
|
has proved troublesome in situations where a program links with several
|
|
libraries, some of which use PCRE2's POSIX interface while others use
|
|
the real POSIX functions. For this reason, the POSIX names have been
|
|
removed since release 10.37.
|
|
|
|
Calling the header file pcre2posix.h avoids any conflict with other
|
|
POSIX libraries. It can, of course, be renamed or aliased as regex.h,
|
|
which is the "correct" name, if there is no clash. It provides two
|
|
structure types, regex_t for compiled internal forms, and regmatch_t
|
|
for returning captured substrings. It also defines some constants whose
|
|
names start with "REG_"; these are used for setting options and identi-
|
|
fying error codes.
|
|
|
|
|
|
USING THE POSIX FUNCTIONS
|
|
|
|
Those POSIX option bits that can reasonably be mapped to PCRE2 native
|
|
options have been implemented. In addition, the option REG_EXTENDED is
|
|
defined with the value zero. This has no effect, but since programs
|
|
that are written to the POSIX interface often use it, this makes it
|
|
easier to slot in PCRE2 as a replacement library. Other POSIX options
|
|
are not even defined.
|
|
|
|
There are also some options that are not defined by POSIX. These have
|
|
been added at the request of users who want to make use of certain
|
|
PCRE2-specific features via the POSIX calling interface or to add BSD
|
|
or GNU functionality.
|
|
|
|
When PCRE2 is called via these functions, it is only the API that is
|
|
POSIX-like in style. The syntax and semantics of the regular expres-
|
|
sions themselves are still those of Perl, subject to the setting of
|
|
various PCRE2 options, as described below. "POSIX-like in style" means
|
|
that the API approximates to the POSIX definition; it is not fully
|
|
POSIX-compatible, and in multi-unit encoding domains it is probably
|
|
even less compatible.
|
|
|
|
The descriptions below use the actual names of the functions, but, as
|
|
described above, the standard POSIX names (without the pcre2_ prefix)
|
|
may also be used.
|
|
|
|
|
|
COMPILING A PATTERN
|
|
|
|
The function pcre2_regcomp() is called to compile a pattern into an in-
|
|
ternal form. By default, the pattern is a C string terminated by a bi-
|
|
nary zero (but see REG_PEND below). The preg argument is a pointer to a
|
|
regex_t structure that is used as a base for storing information about
|
|
the compiled regular expression. (It is also used for input when
|
|
REG_PEND is set.)
|
|
|
|
The argument cflags is either zero, or contains one or more of the bits
|
|
defined by the following macros:
|
|
|
|
REG_DOTALL
|
|
|
|
The PCRE2_DOTALL option is set when the regular expression is passed
|
|
for compilation to the native function. Note that REG_DOTALL is not
|
|
part of the POSIX standard.
|
|
|
|
REG_ICASE
|
|
|
|
The PCRE2_CASELESS option is set when the regular expression is passed
|
|
for compilation to the native function.
|
|
|
|
REG_NEWLINE
|
|
|
|
The PCRE2_MULTILINE option is set when the regular expression is passed
|
|
for compilation to the native function. Note that this does not mimic
|
|
the defined POSIX behaviour for REG_NEWLINE (see the following sec-
|
|
tion).
|
|
|
|
REG_NOSPEC
|
|
|
|
The PCRE2_LITERAL option is set when the regular expression is passed
|
|
for compilation to the native function. This disables all meta charac-
|
|
ters in the pattern, causing it to be treated as a literal string. The
|
|
only other options that are allowed with REG_NOSPEC are REG_ICASE,
|
|
REG_NOSUB, REG_PEND, and REG_UTF. Note that REG_NOSPEC is not part of
|
|
the POSIX standard.
|
|
|
|
REG_NOSUB
|
|
|
|
When a pattern that is compiled with this flag is passed to
|
|
pcre2_regexec() for matching, the nmatch and pmatch arguments are ig-
|
|
nored, and no captured strings are returned. Versions of the PCRE li-
|
|
brary prior to 10.22 used to set the PCRE2_NO_AUTO_CAPTURE compile op-
|
|
tion, but this no longer happens because it disables the use of back-
|
|
references.
|
|
|
|
REG_PEND
|
|
|
|
If this option is set, the reg_endp field in the preg structure (which
|
|
has the type const char *) must be set to point to the character beyond
|
|
the end of the pattern before calling pcre2_regcomp(). The pattern it-
|
|
self may now contain binary zeros, which are treated as data charac-
|
|
ters. Without REG_PEND, a binary zero terminates the pattern and the
|
|
re_endp field is ignored. This is a GNU extension to the POSIX standard
|
|
and should be used with caution in software intended to be portable to
|
|
other systems.
|
|
|
|
REG_UCP
|
|
|
|
The PCRE2_UCP option is set when the regular expression is passed for
|
|
compilation to the native function. This causes PCRE2 to use Unicode
|
|
properties when matchine \d, \w, etc., instead of just recognizing
|
|
ASCII values. Note that REG_UCP is not part of the POSIX standard.
|
|
|
|
REG_UNGREEDY
|
|
|
|
The PCRE2_UNGREEDY option is set when the regular expression is passed
|
|
for compilation to the native function. Note that REG_UNGREEDY is not
|
|
part of the POSIX standard.
|
|
|
|
REG_UTF
|
|
|
|
The PCRE2_UTF option is set when the regular expression is passed for
|
|
compilation to the native function. This causes the pattern itself and
|
|
all data strings used for matching it to be treated as UTF-8 strings.
|
|
Note that REG_UTF is not part of the POSIX standard.
|
|
|
|
In the absence of these flags, no options are passed to the native
|
|
function. This means the the regex is compiled with PCRE2 default se-
|
|
mantics. In particular, the way it handles newline characters in the
|
|
subject string is the Perl way, not the POSIX way. Note that setting
|
|
PCRE2_MULTILINE has only some of the effects specified for REG_NEWLINE.
|
|
It does not affect the way newlines are matched by the dot metacharac-
|
|
ter (they are not) or by a negative class such as [^a] (they are).
|
|
|
|
The yield of pcre2_regcomp() is zero on success, and non-zero other-
|
|
wise. The preg structure is filled in on success, and one other member
|
|
of the structure (as well as re_endp) is public: re_nsub contains the
|
|
number of capturing subpatterns in the regular expression. Various er-
|
|
ror codes are defined in the header file.
|
|
|
|
NOTE: If the yield of pcre2_regcomp() is non-zero, you must not attempt
|
|
to use the contents of the preg structure. If, for example, you pass it
|
|
to pcre2_regexec(), the result is undefined and your program is likely
|
|
to crash.
|
|
|
|
|
|
MATCHING NEWLINE CHARACTERS
|
|
|
|
This area is not simple, because POSIX and Perl take different views of
|
|
things. It is not possible to get PCRE2 to obey POSIX semantics, but
|
|
then PCRE2 was never intended to be a POSIX engine. The following table
|
|
lists the different possibilities for matching newline characters in
|
|
Perl and PCRE2:
|
|
|
|
Default Change with
|
|
|
|
. matches newline no PCRE2_DOTALL
|
|
newline matches [^a] yes not changeable
|
|
$ matches \n at end yes PCRE2_DOLLAR_ENDONLY
|
|
$ matches \n in middle no PCRE2_MULTILINE
|
|
^ matches \n in middle no PCRE2_MULTILINE
|
|
|
|
This is the equivalent table for a POSIX-compatible pattern matcher:
|
|
|
|
Default Change with
|
|
|
|
. matches newline yes REG_NEWLINE
|
|
newline matches [^a] yes REG_NEWLINE
|
|
$ matches \n at end no REG_NEWLINE
|
|
$ matches \n in middle no REG_NEWLINE
|
|
^ matches \n in middle no REG_NEWLINE
|
|
|
|
This behaviour is not what happens when PCRE2 is called via its POSIX
|
|
API. By default, PCRE2's behaviour is the same as Perl's, except that
|
|
there is no equivalent for PCRE2_DOLLAR_ENDONLY in Perl. In both PCRE2
|
|
and Perl, there is no way to stop newline from matching [^a].
|
|
|
|
Default POSIX newline handling can be obtained by setting PCRE2_DOTALL
|
|
and PCRE2_DOLLAR_ENDONLY when calling pcre2_compile() directly, but
|
|
there is no way to make PCRE2 behave exactly as for the REG_NEWLINE ac-
|
|
tion. When using the POSIX API, passing REG_NEWLINE to PCRE2's
|
|
pcre2_regcomp() function causes PCRE2_MULTILINE to be passed to
|
|
pcre2_compile(), and REG_DOTALL passes PCRE2_DOTALL. There is no way to
|
|
pass PCRE2_DOLLAR_ENDONLY.
|
|
|
|
|
|
MATCHING A PATTERN
|
|
|
|
The function pcre2_regexec() is called to match a compiled pattern preg
|
|
against a given string, which is by default terminated by a zero byte
|
|
(but see REG_STARTEND below), subject to the options in eflags. These
|
|
can be:
|
|
|
|
REG_NOTBOL
|
|
|
|
The PCRE2_NOTBOL option is set when calling the underlying PCRE2 match-
|
|
ing function.
|
|
|
|
REG_NOTEMPTY
|
|
|
|
The PCRE2_NOTEMPTY option is set when calling the underlying PCRE2
|
|
matching function. Note that REG_NOTEMPTY is not part of the POSIX
|
|
standard. However, setting this option can give more POSIX-like behav-
|
|
iour in some situations.
|
|
|
|
REG_NOTEOL
|
|
|
|
The PCRE2_NOTEOL option is set when calling the underlying PCRE2 match-
|
|
ing function.
|
|
|
|
REG_STARTEND
|
|
|
|
When this option is set, the subject string starts at string +
|
|
pmatch[0].rm_so and ends at string + pmatch[0].rm_eo, which should
|
|
point to the first character beyond the string. There may be binary ze-
|
|
ros within the subject string, and indeed, using REG_STARTEND is the
|
|
only way to pass a subject string that contains a binary zero.
|
|
|
|
Whatever the value of pmatch[0].rm_so, the offsets of the matched
|
|
string and any captured substrings are still given relative to the
|
|
start of string itself. (Before PCRE2 release 10.30 these were given
|
|
relative to string + pmatch[0].rm_so, but this differs from other im-
|
|
plementations.)
|
|
|
|
This is a BSD extension, compatible with but not specified by IEEE
|
|
Standard 1003.2 (POSIX.2), and should be used with caution in software
|
|
intended to be portable to other systems. Note that a non-zero rm_so
|
|
does not imply REG_NOTBOL; REG_STARTEND affects only the location and
|
|
length of the string, not how it is matched. Setting REG_STARTEND and
|
|
passing pmatch as NULL are mutually exclusive; the error REG_INVARG is
|
|
returned.
|
|
|
|
If the pattern was compiled with the REG_NOSUB flag, no data about any
|
|
matched strings is returned. The nmatch and pmatch arguments of
|
|
pcre2_regexec() are ignored (except possibly as input for REG_STAR-
|
|
TEND).
|
|
|
|
The value of nmatch may be zero, and the value pmatch may be NULL (un-
|
|
less REG_STARTEND is set); in both these cases no data about any
|
|
matched strings is returned.
|
|
|
|
Otherwise, the portion of the string that was matched, and also any
|
|
captured substrings, are returned via the pmatch argument, which points
|
|
to an array of nmatch structures of type regmatch_t, containing the
|
|
members rm_so and rm_eo. These contain the byte offset to the first
|
|
character of each substring and the offset to the first character after
|
|
the end of each substring, respectively. The 0th element of the vector
|
|
relates to the entire portion of string that was matched; subsequent
|
|
elements relate to the capturing subpatterns of the regular expression.
|
|
Unused entries in the array have both structure members set to -1.
|
|
|
|
A successful match yields a zero return; various error codes are de-
|
|
fined in the header file, of which REG_NOMATCH is the "expected" fail-
|
|
ure code.
|
|
|
|
|
|
ERROR MESSAGES
|
|
|
|
The pcre2_regerror() function maps a non-zero errorcode from either
|
|
pcre2_regcomp() or pcre2_regexec() to a printable message. If preg is
|
|
not NULL, the error should have arisen from the use of that structure.
|
|
A message terminated by a binary zero is placed in errbuf. If the buf-
|
|
fer is too short, only the first errbuf_size - 1 characters of the er-
|
|
ror message are used. The yield of the function is the size of buffer
|
|
needed to hold the whole message, including the terminating zero. This
|
|
value is greater than errbuf_size if the message was truncated.
|
|
|
|
|
|
MEMORY USAGE
|
|
|
|
Compiling a regular expression causes memory to be allocated and asso-
|
|
ciated with the preg structure. The function pcre2_regfree() frees all
|
|
such memory, after which preg may no longer be used as a compiled ex-
|
|
pression.
|
|
|
|
|
|
AUTHOR
|
|
|
|
Philip Hazel
|
|
University Computing Service
|
|
Cambridge, England.
|
|
|
|
|
|
REVISION
|
|
|
|
Last updated: 26 April 2021
|
|
Copyright (c) 1997-2021 University of Cambridge.
|
|
------------------------------------------------------------------------------
|
|
|
|
|
|
PCRE2SAMPLE(3) Library Functions Manual PCRE2SAMPLE(3)
|
|
|
|
|
|
|
|
NAME
|
|
PCRE2 - Perl-compatible regular expressions (revised API)
|
|
|
|
PCRE2 SAMPLE PROGRAM
|
|
|
|
A simple, complete demonstration program to get you started with using
|
|
PCRE2 is supplied in the file pcre2demo.c in the src directory in the
|
|
PCRE2 distribution. A listing of this program is given in the pcre2demo
|
|
documentation. If you do not have a copy of the PCRE2 distribution, you
|
|
can save this listing to re-create the contents of pcre2demo.c.
|
|
|
|
The demonstration program compiles the regular expression that is its
|
|
first argument, and matches it against the subject string in its second
|
|
argument. No PCRE2 options are set, and default character tables are
|
|
used. If matching succeeds, the program outputs the portion of the sub-
|
|
ject that matched, together with the contents of any captured sub-
|
|
strings.
|
|
|
|
If the -g option is given on the command line, the program then goes on
|
|
to check for further matches of the same regular expression in the same
|
|
subject string. The logic is a little bit tricky because of the possi-
|
|
bility of matching an empty string. Comments in the code explain what
|
|
is going on.
|
|
|
|
The code in pcre2demo.c is an 8-bit program that uses the PCRE2 8-bit
|
|
library. It handles strings and characters that are stored in 8-bit
|
|
code units. By default, one character corresponds to one code unit,
|
|
but if the pattern starts with "(*UTF)", both it and the subject are
|
|
treated as UTF-8 strings, where characters may occupy multiple code
|
|
units.
|
|
|
|
If PCRE2 is installed in the standard include and library directories
|
|
for your operating system, you should be able to compile the demonstra-
|
|
tion program using a command like this:
|
|
|
|
cc -o pcre2demo pcre2demo.c -lpcre2-8
|
|
|
|
If PCRE2 is installed elsewhere, you may need to add additional options
|
|
to the command line. For example, on a Unix-like system that has PCRE2
|
|
installed in /usr/local, you can compile the demonstration program us-
|
|
ing a command like this:
|
|
|
|
cc -o pcre2demo -I/usr/local/include pcre2demo.c \
|
|
-L/usr/local/lib -lpcre2-8
|
|
|
|
Once you have built the demonstration program, you can run simple tests
|
|
like this:
|
|
|
|
./pcre2demo 'cat|dog' 'the cat sat on the mat'
|
|
./pcre2demo -g 'cat|dog' 'the dog sat on the cat'
|
|
|
|
Note that there is a much more comprehensive test program, called
|
|
pcre2test, which supports many more facilities for testing regular ex-
|
|
pressions using all three PCRE2 libraries (8-bit, 16-bit, and 32-bit,
|
|
though not all three need be installed). The pcre2demo program is pro-
|
|
vided as a relatively simple coding example.
|
|
|
|
If you try to run pcre2demo when PCRE2 is not installed in the standard
|
|
library directory, you may get an error like this on some operating
|
|
systems (e.g. Solaris):
|
|
|
|
ld.so.1: pcre2demo: fatal: libpcre2-8.so.0: open failed: No such file
|
|
or directory
|
|
|
|
This is caused by the way shared library support works on those sys-
|
|
tems. You need to add
|
|
|
|
-R/usr/local/lib
|
|
|
|
(for example) to the compile command to get round this problem.
|
|
|
|
|
|
AUTHOR
|
|
|
|
Philip Hazel
|
|
University Computing Service
|
|
Cambridge, England.
|
|
|
|
|
|
REVISION
|
|
|
|
Last updated: 02 February 2016
|
|
Copyright (c) 1997-2016 University of Cambridge.
|
|
------------------------------------------------------------------------------
|
|
PCRE2SERIALIZE(3) Library Functions Manual PCRE2SERIALIZE(3)
|
|
|
|
|
|
|
|
NAME
|
|
PCRE2 - Perl-compatible regular expressions (revised API)
|
|
|
|
SAVING AND RE-USING PRECOMPILED PCRE2 PATTERNS
|
|
|
|
int32_t pcre2_serialize_decode(pcre2_code **codes,
|
|
int32_t number_of_codes, const uint8_t *bytes,
|
|
pcre2_general_context *gcontext);
|
|
|
|
int32_t pcre2_serialize_encode(const pcre2_code **codes,
|
|
int32_t number_of_codes, uint8_t **serialized_bytes,
|
|
PCRE2_SIZE *serialized_size, pcre2_general_context *gcontext);
|
|
|
|
void pcre2_serialize_free(uint8_t *bytes);
|
|
|
|
int32_t pcre2_serialize_get_number_of_codes(const uint8_t *bytes);
|
|
|
|
If you are running an application that uses a large number of regular
|
|
expression patterns, it may be useful to store them in a precompiled
|
|
form instead of having to compile them every time the application is
|
|
run. However, if you are using the just-in-time optimization feature,
|
|
it is not possible to save and reload the JIT data, because it is posi-
|
|
tion-dependent. The host on which the patterns are reloaded must be
|
|
running the same version of PCRE2, with the same code unit width, and
|
|
must also have the same endianness, pointer width and PCRE2_SIZE type.
|
|
For example, patterns compiled on a 32-bit system using PCRE2's 16-bit
|
|
library cannot be reloaded on a 64-bit system, nor can they be reloaded
|
|
using the 8-bit library.
|
|
|
|
Note that "serialization" in PCRE2 does not convert compiled patterns
|
|
to an abstract format like Java or .NET serialization. The serialized
|
|
output is really just a bytecode dump, which is why it can only be
|
|
reloaded in the same environment as the one that created it. Hence the
|
|
restrictions mentioned above. Applications that are not statically
|
|
linked with a fixed version of PCRE2 must be prepared to recompile pat-
|
|
terns from their sources, in order to be immune to PCRE2 upgrades.
|
|
|
|
|
|
SECURITY CONCERNS
|
|
|
|
The facility for saving and restoring compiled patterns is intended for
|
|
use within individual applications. As such, the data supplied to
|
|
pcre2_serialize_decode() is expected to be trusted data, not data from
|
|
arbitrary external sources. There is only some simple consistency
|
|
checking, not complete validation of what is being re-loaded. Corrupted
|
|
data may cause undefined results. For example, if the length field of a
|
|
pattern in the serialized data is corrupted, the deserializing code may
|
|
read beyond the end of the byte stream that is passed to it.
|
|
|
|
|
|
SAVING COMPILED PATTERNS
|
|
|
|
Before compiled patterns can be saved they must be serialized, which in
|
|
PCRE2 means converting the pattern to a stream of bytes. A single byte
|
|
stream may contain any number of compiled patterns, but they must all
|
|
use the same character tables. A single copy of the tables is included
|
|
in the byte stream (its size is 1088 bytes). For more details of char-
|
|
acter tables, see the section on locale support in the pcre2api docu-
|
|
mentation.
|
|
|
|
The function pcre2_serialize_encode() creates a serialized byte stream
|
|
from a list of compiled patterns. Its first two arguments specify the
|
|
list, being a pointer to a vector of pointers to compiled patterns, and
|
|
the length of the vector. The third and fourth arguments point to vari-
|
|
ables which are set to point to the created byte stream and its length,
|
|
respectively. The final argument is a pointer to a general context,
|
|
which can be used to specify custom memory mangagement functions. If
|
|
this argument is NULL, malloc() is used to obtain memory for the byte
|
|
stream. The yield of the function is the number of serialized patterns,
|
|
or one of the following negative error codes:
|
|
|
|
PCRE2_ERROR_BADDATA the number of patterns is zero or less
|
|
PCRE2_ERROR_BADMAGIC mismatch of id bytes in one of the patterns
|
|
PCRE2_ERROR_MEMORY memory allocation failed
|
|
PCRE2_ERROR_MIXEDTABLES the patterns do not all use the same tables
|
|
PCRE2_ERROR_NULL the 1st, 3rd, or 4th argument is NULL
|
|
|
|
PCRE2_ERROR_BADMAGIC means either that a pattern's code has been cor-
|
|
rupted, or that a slot in the vector does not point to a compiled pat-
|
|
tern.
|
|
|
|
Once a set of patterns has been serialized you can save the data in any
|
|
appropriate manner. Here is sample code that compiles two patterns and
|
|
writes them to a file. It assumes that the variable fd refers to a file
|
|
that is open for output. The error checking that should be present in a
|
|
real application has been omitted for simplicity.
|
|
|
|
int errorcode;
|
|
uint8_t *bytes;
|
|
PCRE2_SIZE erroroffset;
|
|
PCRE2_SIZE bytescount;
|
|
pcre2_code *list_of_codes[2];
|
|
list_of_codes[0] = pcre2_compile("first pattern",
|
|
PCRE2_ZERO_TERMINATED, 0, &errorcode, &erroroffset, NULL);
|
|
list_of_codes[1] = pcre2_compile("second pattern",
|
|
PCRE2_ZERO_TERMINATED, 0, &errorcode, &erroroffset, NULL);
|
|
errorcode = pcre2_serialize_encode(list_of_codes, 2, &bytes,
|
|
&bytescount, NULL);
|
|
errorcode = fwrite(bytes, 1, bytescount, fd);
|
|
|
|
Note that the serialized data is binary data that may contain any of
|
|
the 256 possible byte values. On systems that make a distinction be-
|
|
tween binary and non-binary data, be sure that the file is opened for
|
|
binary output.
|
|
|
|
Serializing a set of patterns leaves the original data untouched, so
|
|
they can still be used for matching. Their memory must eventually be
|
|
freed in the usual way by calling pcre2_code_free(). When you have fin-
|
|
ished with the byte stream, it too must be freed by calling pcre2_seri-
|
|
alize_free(). If this function is called with a NULL argument, it re-
|
|
turns immediately without doing anything.
|
|
|
|
|
|
RE-USING PRECOMPILED PATTERNS
|
|
|
|
In order to re-use a set of saved patterns you must first make the se-
|
|
rialized byte stream available in main memory (for example, by reading
|
|
from a file). The management of this memory block is up to the applica-
|
|
tion. You can use the pcre2_serialize_get_number_of_codes() function to
|
|
find out how many compiled patterns are in the serialized data without
|
|
actually decoding the patterns:
|
|
|
|
uint8_t *bytes = <serialized data>;
|
|
int32_t number_of_codes = pcre2_serialize_get_number_of_codes(bytes);
|
|
|
|
The pcre2_serialize_decode() function reads a byte stream and recreates
|
|
the compiled patterns in new memory blocks, setting pointers to them in
|
|
a vector. The first two arguments are a pointer to a suitable vector
|
|
and its length, and the third argument points to a byte stream. The fi-
|
|
nal argument is a pointer to a general context, which can be used to
|
|
specify custom memory mangagement functions for the decoded patterns.
|
|
If this argument is NULL, malloc() and free() are used. After deserial-
|
|
ization, the byte stream is no longer needed and can be discarded.
|
|
|
|
pcre2_code *list_of_codes[2];
|
|
uint8_t *bytes = <serialized data>;
|
|
int32_t number_of_codes =
|
|
pcre2_serialize_decode(list_of_codes, 2, bytes, NULL);
|
|
|
|
If the vector is not large enough for all the patterns in the byte
|
|
stream, it is filled with those that fit, and the remainder are ig-
|
|
nored. The yield of the function is the number of decoded patterns, or
|
|
one of the following negative error codes:
|
|
|
|
PCRE2_ERROR_BADDATA second argument is zero or less
|
|
PCRE2_ERROR_BADMAGIC mismatch of id bytes in the data
|
|
PCRE2_ERROR_BADMODE mismatch of code unit size or PCRE2 version
|
|
PCRE2_ERROR_BADSERIALIZEDDATA other sanity check failure
|
|
PCRE2_ERROR_MEMORY memory allocation failed
|
|
PCRE2_ERROR_NULL first or third argument is NULL
|
|
|
|
PCRE2_ERROR_BADMAGIC may mean that the data is corrupt, or that it was
|
|
compiled on a system with different endianness.
|
|
|
|
Decoded patterns can be used for matching in the usual way, and must be
|
|
freed by calling pcre2_code_free(). However, be aware that there is a
|
|
potential race issue if you are using multiple patterns that were de-
|
|
coded from a single byte stream in a multithreaded application. A sin-
|
|
gle copy of the character tables is used by all the decoded patterns
|
|
and a reference count is used to arrange for its memory to be automati-
|
|
cally freed when the last pattern is freed, but there is no locking on
|
|
this reference count. Therefore, if you want to call pcre2_code_free()
|
|
for these patterns in different threads, you must arrange your own
|
|
locking, and ensure that pcre2_code_free() cannot be called by two
|
|
threads at the same time.
|
|
|
|
If a pattern was processed by pcre2_jit_compile() before being serial-
|
|
ized, the JIT data is discarded and so is no longer available after a
|
|
save/restore cycle. You can, however, process a restored pattern with
|
|
pcre2_jit_compile() if you wish.
|
|
|
|
|
|
AUTHOR
|
|
|
|
Philip Hazel
|
|
University Computing Service
|
|
Cambridge, England.
|
|
|
|
|
|
REVISION
|
|
|
|
Last updated: 27 June 2018
|
|
Copyright (c) 1997-2018 University of Cambridge.
|
|
------------------------------------------------------------------------------
|
|
|
|
|
|
PCRE2SYNTAX(3) Library Functions Manual PCRE2SYNTAX(3)
|
|
|
|
|
|
|
|
NAME
|
|
PCRE2 - Perl-compatible regular expressions (revised API)
|
|
|
|
PCRE2 REGULAR EXPRESSION SYNTAX SUMMARY
|
|
|
|
The full syntax and semantics of the regular expressions that are sup-
|
|
ported by PCRE2 are described in the pcre2pattern documentation. This
|
|
document contains a quick-reference summary of the syntax.
|
|
|
|
|
|
QUOTING
|
|
|
|
\x where x is non-alphanumeric is a literal x
|
|
\Q...\E treat enclosed characters as literal
|
|
|
|
|
|
ESCAPED CHARACTERS
|
|
|
|
This table applies to ASCII and Unicode environments. An unrecognized
|
|
escape sequence causes an error.
|
|
|
|
\a alarm, that is, the BEL character (hex 07)
|
|
\cx "control-x", where x is any ASCII printing character
|
|
\e escape (hex 1B)
|
|
\f form feed (hex 0C)
|
|
\n newline (hex 0A)
|
|
\r carriage return (hex 0D)
|
|
\t tab (hex 09)
|
|
\0dd character with octal code 0dd
|
|
\ddd character with octal code ddd, or backreference
|
|
\o{ddd..} character with octal code ddd..
|
|
\N{U+hh..} character with Unicode code point hh.. (Unicode mode only)
|
|
\xhh character with hex code hh
|
|
\x{hh..} character with hex code hh..
|
|
|
|
If PCRE2_ALT_BSUX or PCRE2_EXTRA_ALT_BSUX is set ("ALT_BSUX mode"), the
|
|
following are also recognized:
|
|
|
|
\U the character "U"
|
|
\uhhhh character with hex code hhhh
|
|
\u{hh..} character with hex code hh.. but only for EXTRA_ALT_BSUX
|
|
|
|
When \x is not followed by {, from zero to two hexadecimal digits are
|
|
read, but in ALT_BSUX mode \x must be followed by two hexadecimal dig-
|
|
its to be recognized as a hexadecimal escape; otherwise it matches a
|
|
literal "x". Likewise, if \u (in ALT_BSUX mode) is not followed by
|
|
four hexadecimal digits or (in EXTRA_ALT_BSUX mode) a sequence of hex
|
|
digits in curly brackets, it matches a literal "u".
|
|
|
|
Note that \0dd is always an octal code. The treatment of backslash fol-
|
|
lowed by a non-zero digit is complicated; for details see the section
|
|
"Non-printing characters" in the pcre2pattern documentation, where de-
|
|
tails of escape processing in EBCDIC environments are also given.
|
|
\N{U+hh..} is synonymous with \x{hh..} in PCRE2 but is not supported in
|
|
EBCDIC environments. Note that \N not followed by an opening curly
|
|
bracket has a different meaning (see below).
|
|
|
|
|
|
CHARACTER TYPES
|
|
|
|
. any character except newline;
|
|
in dotall mode, any character whatsoever
|
|
\C one code unit, even in UTF mode (best avoided)
|
|
\d a decimal digit
|
|
\D a character that is not a decimal digit
|
|
\h a horizontal white space character
|
|
\H a character that is not a horizontal white space character
|
|
\N a character that is not a newline
|
|
\p{xx} a character with the xx property
|
|
\P{xx} a character without the xx property
|
|
\R a newline sequence
|
|
\s a white space character
|
|
\S a character that is not a white space character
|
|
\v a vertical white space character
|
|
\V a character that is not a vertical white space character
|
|
\w a "word" character
|
|
\W a "non-word" character
|
|
\X a Unicode extended grapheme cluster
|
|
|
|
\C is dangerous because it may leave the current matching point in the
|
|
middle of a UTF-8 or UTF-16 character. The application can lock out the
|
|
use of \C by setting the PCRE2_NEVER_BACKSLASH_C option. It is also
|
|
possible to build PCRE2 with the use of \C permanently disabled.
|
|
|
|
By default, \d, \s, and \w match only ASCII characters, even in UTF-8
|
|
mode or in the 16-bit and 32-bit libraries. However, if locale-specific
|
|
matching is happening, \s and \w may also match characters with code
|
|
points in the range 128-255. If the PCRE2_UCP option is set, the behav-
|
|
iour of these escape sequences is changed to use Unicode properties and
|
|
they match many more characters.
|
|
|
|
Property descriptions in \p and \P are matched caselessly; hyphens, un-
|
|
derscores, and white space are ignored, in accordance with Unicode's
|
|
"loose matching" rules.
|
|
|
|
|
|
GENERAL CATEGORY PROPERTIES FOR \p and \P
|
|
|
|
C Other
|
|
Cc Control
|
|
Cf Format
|
|
Cn Unassigned
|
|
Co Private use
|
|
Cs Surrogate
|
|
|
|
L Letter
|
|
Ll Lower case letter
|
|
Lm Modifier letter
|
|
Lo Other letter
|
|
Lt Title case letter
|
|
Lu Upper case letter
|
|
Lc Ll, Lu, or Lt
|
|
L& Ll, Lu, or Lt
|
|
|
|
M Mark
|
|
Mc Spacing mark
|
|
Me Enclosing mark
|
|
Mn Non-spacing mark
|
|
|
|
N Number
|
|
Nd Decimal number
|
|
Nl Letter number
|
|
No Other number
|
|
|
|
P Punctuation
|
|
Pc Connector punctuation
|
|
Pd Dash punctuation
|
|
Pe Close punctuation
|
|
Pf Final punctuation
|
|
Pi Initial punctuation
|
|
Po Other punctuation
|
|
Ps Open punctuation
|
|
|
|
S Symbol
|
|
Sc Currency symbol
|
|
Sk Modifier symbol
|
|
Sm Mathematical symbol
|
|
So Other symbol
|
|
|
|
Z Separator
|
|
Zl Line separator
|
|
Zp Paragraph separator
|
|
Zs Space separator
|
|
|
|
|
|
PCRE2 SPECIAL CATEGORY PROPERTIES FOR \p and \P
|
|
|
|
Xan Alphanumeric: union of properties L and N
|
|
Xps POSIX space: property Z or tab, NL, VT, FF, CR
|
|
Xsp Perl space: property Z or tab, NL, VT, FF, CR
|
|
Xuc Univerally-named character: one that can be
|
|
represented by a Universal Character Name
|
|
Xwd Perl word: property Xan or underscore
|
|
|
|
Perl and POSIX space are now the same. Perl added VT to its space char-
|
|
acter set at release 5.18.
|
|
|
|
|
|
SCRIPT MATCHING WITH \p AND \P
|
|
|
|
The following script names and their 4-letter abbreviations are recog-
|
|
nized in \p{sc:...} or \p{scx:...} items, or on their own with \p (and
|
|
also \P of course):
|
|
|
|
Adlam (Adlm), Ahom (Ahom), Anatolian_Hieroglyphs (Hluw), Arabic (Arab),
|
|
Armenian (Armn), Avestan (Avst), Balinese (Bali), Bamum (Bamu),
|
|
Bassa_Vah (Bass), Batak (Batk), Bengali (Beng), Bhaiksuki (Bhks), Bopo-
|
|
mofo (Bopo), Brahmi (Brah), Braille (Brai), Buginese (Bugi), Buhid
|
|
(Buhd), Canadian_Aboriginal (Cans), Carian (Cari), Caucasian_Albanian
|
|
(Aghb), Chakma (Cakm), Cham (Cham), Cherokee (Cher), Chorasmian (Chrs),
|
|
Common (Zyyy), Coptic (Copt), Cuneiform (Xsux), Cypriot (Cprt),
|
|
Cypro_Minoan (Cpmn), Cyrillic (Cyrl), Deseret (Dsrt), Devanagari
|
|
(Deva), Dives_Akuru (Diak), Dogra (Dogr), Duployan (Dupl), Egyptian_Hi-
|
|
eroglyphs (Egyp), Elbasan (Elba), Elymaic (Elym), Ethiopic (Ethi),
|
|
Georgian (Geor), Glagolitic (Glag), Gothic (Goth), Grantha (Gran),
|
|
Greek (Grek), Gujarati (Gujr), Gunjala_Gondi (Gong), Gurmukhi (Guru),
|
|
Han (Hani), Hangul (Hang), Hanifi_Rohingya (Rohg), Hanunoo (Hano), Ha-
|
|
tran (Hatr), Hebrew (Hebr), Hiragana (Hira), Imperial_Aramaic (Armi),
|
|
Inherited (Zinh), Inscriptional_Pahlavi (Phli), Inscriptional_Parthian
|
|
(Prti), Javanese (Java), Kaithi (Kthi), Kannada (Knda), Katakana
|
|
(Kana), Kayah_Li (Kali), Kharoshthi (Khar), Khitan_Small_Script (Kits),
|
|
Khmer (Khmr), Khojki (Khoj), Khudawadi (Sind), Lao (Laoo), Latin
|
|
(Latn), Lepcha (Lepc), Limbu (Limb), Linear_A (Lina), Linear_B (Linb),
|
|
Lisu (Lisu), Lycian (Lyci), Lydian (Lydi), Mahajani (Majh), Makasar
|
|
(Maka), Malayalam (Mlym), Mandaic (Mand), Manichaean (Mani), Marchen
|
|
(Marc), Masaram_Gondi (Gonm), Medefaidrin (Medf), Meetei_Mayek (Mtei),
|
|
Mende_Kikakui (Mend), Meroitic_Cursive (Merc), Meroitic_Hieroglyphs
|
|
(Mero), Miao (Miao), Modi (Modi), Mongolian (Mong), Mro (Mroo), Multani
|
|
(Mult), Myanmar (Mymr), Nabataean (Nbar), Nandinagari (Nand),
|
|
New_Tai_Lue (Talu), Newa (Newa), Nko (Nkoo), Nushu (Nshu), Nyiak-
|
|
eng_Puachue_Hmong (Hmnp), Ogham (Ogam), Ol_Chiki (Olck), Old_Hungarian
|
|
(Hung), Old_Italic (Olck), Old_North_Arabian (Narb), Old_Permic (Perm),
|
|
Old_Persian (Orkh), Old_Sogdian (Sogo), Old_South_Arabian (Sarb),
|
|
Old_Turkic (Orkh), Old_Uyghur (Ougr), Oriya (Orya), Osage (Osge), Os-
|
|
manya (Osma), Pahawh_Hmong (Hmng), Palmyrene (Palm), Pau_Cin_Hau
|
|
(Pauc), Phags_Pa (Phag), Phoenician (Phnx), Psalter_Pahlavi (Phli), Re-
|
|
jang (Rjng), Runic (Runr), Samaritan (Samr), Saurashtra (Saur), Sharada
|
|
(Shrd), Shavian (Shaw), Siddham (Sidd), SignWriting (Sgnw), Sinhala
|
|
(Sinh), Sogdian (Sogd), Sora_Sompeng (Sora), Soyombo (Soyo), Sundanese
|
|
(Sund), Syloti_Nagri (Sylo), Syriac (Syrc), Tagalog (Tglg), Tagbanwa
|
|
(Tagb), Tai_Le (Tale), Tai_Tham (Lana), Tai_Viet (Tavt), Takri (Takr),
|
|
Tamil (Taml), Tangsa (Tngs), Tangut (Tang), Telugu (Telu), Thaana
|
|
(Thaa), Thai (Thai), Tibetan (Tibt), Tifinagh (Tfng), Tirhuta (Tirh),
|
|
Toto (Toto), Ugaritic (Ugar), Vai (Vaii), Vithkuqi (Vith), Wancho
|
|
(Wcho), Warang_Citi (Wara), Yezidi (Yezi), Yi (Yiii), Zanabazar_Square
|
|
(Zanb).
|
|
|
|
|
|
BIDI_PROPERTIES FOR \p AND \P
|
|
|
|
\p{Bidi_Control} matches a Bidi control character
|
|
\p{Bidi_C} matches a Bidi control character
|
|
\p{Bidi_Class:<class>} matches a character with the given class
|
|
\p{BC:<class>} matches a character with the given class
|
|
|
|
The recognized classes are:
|
|
|
|
AL Arabic letter
|
|
AN Arabic number
|
|
B paragraph separator
|
|
BN boundary neutral
|
|
CS common separator
|
|
EN European number
|
|
ES European separator
|
|
ET European terminator
|
|
FSI first strong isolate
|
|
L left-to-right
|
|
LRE left-to-right embedding
|
|
LRI left-to-right isolate
|
|
LRO left-to-right override
|
|
NSM non-spacing mark
|
|
ON other neutral
|
|
PDF pop directional format
|
|
PDI pop directional isolate
|
|
R right-to-left
|
|
RLE right-to-left embedding
|
|
RLI right-to-left isolate
|
|
RLO right-to-left override
|
|
S segment separator
|
|
WS which space
|
|
|
|
|
|
CHARACTER CLASSES
|
|
|
|
[...] positive character class
|
|
[^...] negative character class
|
|
[x-y] range (can be used for hex characters)
|
|
[[:xxx:]] positive POSIX named set
|
|
[[:^xxx:]] negative POSIX named set
|
|
|
|
alnum alphanumeric
|
|
alpha alphabetic
|
|
ascii 0-127
|
|
blank space or tab
|
|
cntrl control character
|
|
digit decimal digit
|
|
graph printing, excluding space
|
|
lower lower case letter
|
|
print printing, including space
|
|
punct printing, excluding alphanumeric
|
|
space white space
|
|
upper upper case letter
|
|
word same as \w
|
|
xdigit hexadecimal digit
|
|
|
|
In PCRE2, POSIX character set names recognize only ASCII characters by
|
|
default, but some of them use Unicode properties if PCRE2_UCP is set.
|
|
You can use \Q...\E inside a character class.
|
|
|
|
|
|
QUANTIFIERS
|
|
|
|
? 0 or 1, greedy
|
|
?+ 0 or 1, possessive
|
|
?? 0 or 1, lazy
|
|
* 0 or more, greedy
|
|
*+ 0 or more, possessive
|
|
*? 0 or more, lazy
|
|
+ 1 or more, greedy
|
|
++ 1 or more, possessive
|
|
+? 1 or more, lazy
|
|
{n} exactly n
|
|
{n,m} at least n, no more than m, greedy
|
|
{n,m}+ at least n, no more than m, possessive
|
|
{n,m}? at least n, no more than m, lazy
|
|
{n,} n or more, greedy
|
|
{n,}+ n or more, possessive
|
|
{n,}? n or more, lazy
|
|
|
|
|
|
ANCHORS AND SIMPLE ASSERTIONS
|
|
|
|
\b word boundary
|
|
\B not a word boundary
|
|
^ start of subject
|
|
also after an internal newline in multiline mode
|
|
(after any newline if PCRE2_ALT_CIRCUMFLEX is set)
|
|
\A start of subject
|
|
$ end of subject
|
|
also before newline at end of subject
|
|
also before internal newline in multiline mode
|
|
\Z end of subject
|
|
also before newline at end of subject
|
|
\z end of subject
|
|
\G first matching position in subject
|
|
|
|
|
|
REPORTED MATCH POINT SETTING
|
|
|
|
\K set reported start of match
|
|
|
|
From release 10.38 \K is not permitted by default in lookaround asser-
|
|
tions, for compatibility with Perl. However, if the PCRE2_EXTRA_AL-
|
|
LOW_LOOKAROUND_BSK option is set, the previous behaviour is re-enabled.
|
|
When this option is set, \K is honoured in positive assertions, but ig-
|
|
nored in negative ones.
|
|
|
|
|
|
ALTERNATION
|
|
|
|
expr|expr|expr...
|
|
|
|
|
|
CAPTURING
|
|
|
|
(...) capture group
|
|
(?<name>...) named capture group (Perl)
|
|
(?'name'...) named capture group (Perl)
|
|
(?P<name>...) named capture group (Python)
|
|
(?:...) non-capture group
|
|
(?|...) non-capture group; reset group numbers for
|
|
capture groups in each alternative
|
|
|
|
In non-UTF modes, names may contain underscores and ASCII letters and
|
|
digits; in UTF modes, any Unicode letters and Unicode decimal digits
|
|
are permitted. In both cases, a name must not start with a digit.
|
|
|
|
|
|
ATOMIC GROUPS
|
|
|
|
(?>...) atomic non-capture group
|
|
(*atomic:...) atomic non-capture group
|
|
|
|
|
|
COMMENT
|
|
|
|
(?#....) comment (not nestable)
|
|
|
|
|
|
OPTION SETTING
|
|
Changes of these options within a group are automatically cancelled at
|
|
the end of the group.
|
|
|
|
(?i) caseless
|
|
(?J) allow duplicate named groups
|
|
(?m) multiline
|
|
(?n) no auto capture
|
|
(?s) single line (dotall)
|
|
(?U) default ungreedy (lazy)
|
|
(?x) extended: ignore white space except in classes
|
|
(?xx) as (?x) but also ignore space and tab in classes
|
|
(?-...) unset option(s)
|
|
(?^) unset imnsx options
|
|
|
|
Unsetting x or xx unsets both. Several options may be set at once, and
|
|
a mixture of setting and unsetting such as (?i-x) is allowed, but there
|
|
may be only one hyphen. Setting (but no unsetting) is allowed after (?^
|
|
for example (?^in). An option setting may appear at the start of a non-
|
|
capture group, for example (?i:...).
|
|
|
|
The following are recognized only at the very start of a pattern or af-
|
|
ter one of the newline or \R options with similar syntax. More than one
|
|
of them may appear. For the first three, d is a decimal number.
|
|
|
|
(*LIMIT_DEPTH=d) set the backtracking limit to d
|
|
(*LIMIT_HEAP=d) set the heap size limit to d * 1024 bytes
|
|
(*LIMIT_MATCH=d) set the match limit to d
|
|
(*NOTEMPTY) set PCRE2_NOTEMPTY when matching
|
|
(*NOTEMPTY_ATSTART) set PCRE2_NOTEMPTY_ATSTART when matching
|
|
(*NO_AUTO_POSSESS) no auto-possessification (PCRE2_NO_AUTO_POSSESS)
|
|
(*NO_DOTSTAR_ANCHOR) no .* anchoring (PCRE2_NO_DOTSTAR_ANCHOR)
|
|
(*NO_JIT) disable JIT optimization
|
|
(*NO_START_OPT) no start-match optimization (PCRE2_NO_START_OPTIMIZE)
|
|
(*UTF) set appropriate UTF mode for the library in use
|
|
(*UCP) set PCRE2_UCP (use Unicode properties for \d etc)
|
|
|
|
Note that LIMIT_DEPTH, LIMIT_HEAP, and LIMIT_MATCH can only reduce the
|
|
value of the limits set by the caller of pcre2_match() or
|
|
pcre2_dfa_match(), not increase them. LIMIT_RECURSION is an obsolete
|
|
synonym for LIMIT_DEPTH. The application can lock out the use of (*UTF)
|
|
and (*UCP) by setting the PCRE2_NEVER_UTF or PCRE2_NEVER_UCP options,
|
|
respectively, at compile time.
|
|
|
|
|
|
NEWLINE CONVENTION
|
|
|
|
These are recognized only at the very start of the pattern or after op-
|
|
tion settings with a similar syntax.
|
|
|
|
(*CR) carriage return only
|
|
(*LF) linefeed only
|
|
(*CRLF) carriage return followed by linefeed
|
|
(*ANYCRLF) all three of the above
|
|
(*ANY) any Unicode newline sequence
|
|
(*NUL) the NUL character (binary zero)
|
|
|
|
|
|
WHAT \R MATCHES
|
|
|
|
These are recognized only at the very start of the pattern or after op-
|
|
tion setting with a similar syntax.
|
|
|
|
(*BSR_ANYCRLF) CR, LF, or CRLF
|
|
(*BSR_UNICODE) any Unicode newline sequence
|
|
|
|
|
|
LOOKAHEAD AND LOOKBEHIND ASSERTIONS
|
|
|
|
(?=...) )
|
|
(*pla:...) ) positive lookahead
|
|
(*positive_lookahead:...) )
|
|
|
|
(?!...) )
|
|
(*nla:...) ) negative lookahead
|
|
(*negative_lookahead:...) )
|
|
|
|
(?<=...) )
|
|
(*plb:...) ) positive lookbehind
|
|
(*positive_lookbehind:...) )
|
|
|
|
(?<!...) )
|
|
(*nlb:...) ) negative lookbehind
|
|
(*negative_lookbehind:...) )
|
|
|
|
Each top-level branch of a lookbehind must be of a fixed length.
|
|
|
|
|
|
NON-ATOMIC LOOKAROUND ASSERTIONS
|
|
|
|
These assertions are specific to PCRE2 and are not Perl-compatible.
|
|
|
|
(?*...) )
|
|
(*napla:...) ) synonyms
|
|
(*non_atomic_positive_lookahead:...) )
|
|
|
|
(?<*...) )
|
|
(*naplb:...) ) synonyms
|
|
(*non_atomic_positive_lookbehind:...) )
|
|
|
|
|
|
SCRIPT RUNS
|
|
|
|
(*script_run:...) ) script run, can be backtracked into
|
|
(*sr:...) )
|
|
|
|
(*atomic_script_run:...) ) atomic script run
|
|
(*asr:...) )
|
|
|
|
|
|
BACKREFERENCES
|
|
|
|
\n reference by number (can be ambiguous)
|
|
\gn reference by number
|
|
\g{n} reference by number
|
|
\g+n relative reference by number (PCRE2 extension)
|
|
\g-n relative reference by number
|
|
\g{+n} relative reference by number (PCRE2 extension)
|
|
\g{-n} relative reference by number
|
|
\k<name> reference by name (Perl)
|
|
\k'name' reference by name (Perl)
|
|
\g{name} reference by name (Perl)
|
|
\k{name} reference by name (.NET)
|
|
(?P=name) reference by name (Python)
|
|
|
|
|
|
SUBROUTINE REFERENCES (POSSIBLY RECURSIVE)
|
|
|
|
(?R) recurse whole pattern
|
|
(?n) call subroutine by absolute number
|
|
(?+n) call subroutine by relative number
|
|
(?-n) call subroutine by relative number
|
|
(?&name) call subroutine by name (Perl)
|
|
(?P>name) call subroutine by name (Python)
|
|
\g<name> call subroutine by name (Oniguruma)
|
|
\g'name' call subroutine by name (Oniguruma)
|
|
\g<n> call subroutine by absolute number (Oniguruma)
|
|
\g'n' call subroutine by absolute number (Oniguruma)
|
|
\g<+n> call subroutine by relative number (PCRE2 extension)
|
|
\g'+n' call subroutine by relative number (PCRE2 extension)
|
|
\g<-n> call subroutine by relative number (PCRE2 extension)
|
|
\g'-n' call subroutine by relative number (PCRE2 extension)
|
|
|
|
|
|
CONDITIONAL PATTERNS
|
|
|
|
(?(condition)yes-pattern)
|
|
(?(condition)yes-pattern|no-pattern)
|
|
|
|
(?(n) absolute reference condition
|
|
(?(+n) relative reference condition
|
|
(?(-n) relative reference condition
|
|
(?(<name>) named reference condition (Perl)
|
|
(?('name') named reference condition (Perl)
|
|
(?(name) named reference condition (PCRE2, deprecated)
|
|
(?(R) overall recursion condition
|
|
(?(Rn) specific numbered group recursion condition
|
|
(?(R&name) specific named group recursion condition
|
|
(?(DEFINE) define groups for reference
|
|
(?(VERSION[>]=n.m) test PCRE2 version
|
|
(?(assert) assertion condition
|
|
|
|
Note the ambiguity of (?(R) and (?(Rn) which might be named reference
|
|
conditions or recursion tests. Such a condition is interpreted as a
|
|
reference condition if the relevant named group exists.
|
|
|
|
|
|
BACKTRACKING CONTROL
|
|
|
|
All backtracking control verbs may be in the form (*VERB:NAME). For
|
|
(*MARK) the name is mandatory, for the others it is optional. (*SKIP)
|
|
changes its behaviour if :NAME is present. The others just set a name
|
|
for passing back to the caller, but this is not a name that (*SKIP) can
|
|
see. The following act immediately they are reached:
|
|
|
|
(*ACCEPT) force successful match
|
|
(*FAIL) force backtrack; synonym (*F)
|
|
(*MARK:NAME) set name to be passed back; synonym (*:NAME)
|
|
|
|
The following act only when a subsequent match failure causes a back-
|
|
track to reach them. They all force a match failure, but they differ in
|
|
what happens afterwards. Those that advance the start-of-match point do
|
|
so only if the pattern is not anchored.
|
|
|
|
(*COMMIT) overall failure, no advance of starting point
|
|
(*PRUNE) advance to next starting character
|
|
(*SKIP) advance to current matching position
|
|
(*SKIP:NAME) advance to position corresponding to an earlier
|
|
(*MARK:NAME); if not found, the (*SKIP) is ignored
|
|
(*THEN) local failure, backtrack to next alternation
|
|
|
|
The effect of one of these verbs in a group called as a subroutine is
|
|
confined to the subroutine call.
|
|
|
|
|
|
CALLOUTS
|
|
|
|
(?C) callout (assumed number 0)
|
|
(?Cn) callout with numerical data n
|
|
(?C"text") callout with string data
|
|
|
|
The allowed string delimiters are ` ' " ^ % # $ (which are the same for
|
|
the start and the end), and the starting delimiter { matched with the
|
|
ending delimiter }. To encode the ending delimiter within the string,
|
|
double it.
|
|
|
|
|
|
SEE ALSO
|
|
|
|
pcre2pattern(3), pcre2api(3), pcre2callout(3), pcre2matching(3),
|
|
pcre2(3).
|
|
|
|
|
|
AUTHOR
|
|
|
|
Philip Hazel
|
|
Retired from University Computing Service
|
|
Cambridge, England.
|
|
|
|
|
|
REVISION
|
|
|
|
Last updated: 28 December 2021
|
|
Copyright (c) 1997-2021 University of Cambridge.
|
|
------------------------------------------------------------------------------
|
|
|
|
|
|
PCRE2UNICODE(3) Library Functions Manual PCRE2UNICODE(3)
|
|
|
|
|
|
|
|
NAME
|
|
PCRE - Perl-compatible regular expressions (revised API)
|
|
|
|
UNICODE AND UTF SUPPORT
|
|
|
|
PCRE2 is normally built with Unicode support, though if you do not need
|
|
it, you can build it without, in which case the library will be
|
|
smaller. With Unicode support, PCRE2 has knowledge of Unicode character
|
|
properties and can process strings of text in UTF-8, UTF-16, and UTF-32
|
|
format (depending on the code unit width), but this is not the default.
|
|
Unless specifically requested, PCRE2 treats each code unit in a string
|
|
as one character.
|
|
|
|
There are two ways of telling PCRE2 to switch to UTF mode, where char-
|
|
acters may consist of more than one code unit and the range of values
|
|
is constrained. The program can call pcre2_compile() with the PCRE2_UTF
|
|
option, or the pattern may start with the sequence (*UTF). However,
|
|
the latter facility can be locked out by the PCRE2_NEVER_UTF option.
|
|
That is, the programmer can prevent the supplier of the pattern from
|
|
switching to UTF mode.
|
|
|
|
Note that the PCRE2_MATCH_INVALID_UTF option (see below) forces
|
|
PCRE2_UTF to be set.
|
|
|
|
In UTF mode, both the pattern and any subject strings that are matched
|
|
against it are treated as UTF strings instead of strings of individual
|
|
one-code-unit characters. There are also some other changes to the way
|
|
characters are handled, as documented below.
|
|
|
|
|
|
UNICODE PROPERTY SUPPORT
|
|
|
|
When PCRE2 is built with Unicode support, the escape sequences \p{..},
|
|
\P{..}, and \X can be used. This is not dependent on the PCRE2_UTF set-
|
|
ting. The Unicode properties that can be tested are a subset of those
|
|
that Perl supports. Currently they are limited to the general category
|
|
properties such as Lu for an upper case letter or Nd for a decimal num-
|
|
ber, the Unicode script names such as Arabic or Han, Bidi_Class,
|
|
Bidi_Control, and the derived properties Any and LC (synonym L&). Full
|
|
lists are given in the pcre2pattern and pcre2syntax documentation. In
|
|
general, only the short names for properties are supported. For exam-
|
|
ple, \p{L} matches a letter. Its longer synonym, \p{Letter}, is not
|
|
supported. Furthermore, in Perl, many properties may optionally be pre-
|
|
fixed by "Is", for compatibility with Perl 5.6. PCRE2 does not support
|
|
this.
|
|
|
|
|
|
WIDE CHARACTERS AND UTF MODES
|
|
|
|
Code points less than 256 can be specified in patterns by either braced
|
|
or unbraced hexadecimal escape sequences (for example, \x{b3} or \xb3).
|
|
Larger values have to use braced sequences. Unbraced octal code points
|
|
up to \777 are also recognized; larger ones can be coded using \o{...}.
|
|
|
|
The escape sequence \N{U+<hex digits>} is recognized as another way of
|
|
specifying a Unicode character by code point in a UTF mode. It is not
|
|
allowed in non-UTF mode.
|
|
|
|
In UTF mode, repeat quantifiers apply to complete UTF characters, not
|
|
to individual code units.
|
|
|
|
In UTF mode, the dot metacharacter matches one UTF character instead of
|
|
a single code unit.
|
|
|
|
In UTF mode, capture group names are not restricted to ASCII, and may
|
|
contain any Unicode letters and decimal digits, as well as underscore.
|
|
|
|
The escape sequence \C can be used to match a single code unit in UTF
|
|
mode, but its use can lead to some strange effects because it breaks up
|
|
multi-unit characters (see the description of \C in the pcre2pattern
|
|
documentation). For this reason, there is a build-time option that dis-
|
|
ables support for \C completely. There is also a less draconian com-
|
|
pile-time option for locking out the use of \C when a pattern is com-
|
|
piled.
|
|
|
|
The use of \C is not supported by the alternative matching function
|
|
pcre2_dfa_match() when in UTF-8 or UTF-16 mode, that is, when a charac-
|
|
ter may consist of more than one code unit. The use of \C in these
|
|
modes provokes a match-time error. Also, the JIT optimization does not
|
|
support \C in these modes. If JIT optimization is requested for a UTF-8
|
|
or UTF-16 pattern that contains \C, it will not succeed, and so when
|
|
pcre2_match() is called, the matching will be carried out by the inter-
|
|
pretive function.
|
|
|
|
The character escapes \b, \B, \d, \D, \s, \S, \w, and \W correctly test
|
|
characters of any code value, but, by default, the characters that
|
|
PCRE2 recognizes as digits, spaces, or word characters remain the same
|
|
set as in non-UTF mode, all with code points less than 256. This re-
|
|
mains true even when PCRE2 is built to include Unicode support, because
|
|
to do otherwise would slow down matching in many common cases. Note
|
|
that this also applies to \b and \B, because they are defined in terms
|
|
of \w and \W. If you want to test for a wider sense of, say, "digit",
|
|
you can use explicit Unicode property tests such as \p{Nd}. Alterna-
|
|
tively, if you set the PCRE2_UCP option, the way that the character es-
|
|
capes work is changed so that Unicode properties are used to determine
|
|
which characters match. There are more details in the section on
|
|
generic character types in the pcre2pattern documentation.
|
|
|
|
Similarly, characters that match the POSIX named character classes are
|
|
all low-valued characters, unless the PCRE2_UCP option is set.
|
|
|
|
However, the special horizontal and vertical white space matching es-
|
|
capes (\h, \H, \v, and \V) do match all the appropriate Unicode charac-
|
|
ters, whether or not PCRE2_UCP is set.
|
|
|
|
|
|
UNICODE CASE-EQUIVALENCE
|
|
|
|
If either PCRE2_UTF or PCRE2_UCP is set, upper/lower case processing
|
|
makes use of Unicode properties except for characters whose code points
|
|
are less than 128 and that have at most two case-equivalent values. For
|
|
these, a direct table lookup is used for speed. A few Unicode charac-
|
|
ters such as Greek sigma have more than two code points that are case-
|
|
equivalent, and these are treated specially. Setting PCRE2_UCP without
|
|
PCRE2_UTF allows Unicode-style case processing for non-UTF character
|
|
encodings such as UCS-2.
|
|
|
|
|
|
SCRIPT RUNS
|
|
|
|
The pattern constructs (*script_run:...) and (*atomic_script_run:...),
|
|
with synonyms (*sr:...) and (*asr:...), verify that the string matched
|
|
within the parentheses is a script run. In concept, a script run is a
|
|
sequence of characters that are all from the same Unicode script. How-
|
|
ever, because some scripts are commonly used together, and because some
|
|
diacritical and other marks are used with multiple scripts, it is not
|
|
that simple.
|
|
|
|
Every Unicode character has a Script property, mostly with a value cor-
|
|
responding to the name of a script, such as Latin, Greek, or Cyrillic.
|
|
There are also three special values:
|
|
|
|
"Unknown" is used for code points that have not been assigned, and also
|
|
for the surrogate code points. In the PCRE2 32-bit library, characters
|
|
whose code points are greater than the Unicode maximum (U+10FFFF),
|
|
which are accessible only in non-UTF mode, are assigned the Unknown
|
|
script.
|
|
|
|
"Common" is used for characters that are used with many scripts. These
|
|
include punctuation, emoji, mathematical, musical, and currency sym-
|
|
bols, and the ASCII digits 0 to 9.
|
|
|
|
"Inherited" is used for characters such as diacritical marks that mod-
|
|
ify a previous character. These are considered to take on the script of
|
|
the character that they modify.
|
|
|
|
Some Inherited characters are used with many scripts, but many of them
|
|
are only normally used with a small number of scripts. For example,
|
|
U+102E0 (Coptic Epact thousands mark) is used only with Arabic and Cop-
|
|
tic. In order to make it possible to check this, a Unicode property
|
|
called Script Extension exists. Its value is a list of scripts that ap-
|
|
ply to the character. For the majority of characters, the list contains
|
|
just one script, the same one as the Script property. However, for
|
|
characters such as U+102E0 more than one Script is listed. There are
|
|
also some Common characters that have a single, non-Common script in
|
|
their Script Extension list.
|
|
|
|
The next section describes the basic rules for deciding whether a given
|
|
string of characters is a script run. Note, however, that there are
|
|
some special cases involving the Chinese Han script, and an additional
|
|
constraint for decimal digits. These are covered in subsequent sec-
|
|
tions.
|
|
|
|
Basic script run rules
|
|
|
|
A string that is less than two characters long is a script run. This is
|
|
the only case in which an Unknown character can be part of a script
|
|
run. Longer strings are checked using only the Script Extensions prop-
|
|
erty, not the basic Script property.
|
|
|
|
If a character's Script Extension property is the single value "Inher-
|
|
ited", it is always accepted as part of a script run. This is also true
|
|
for the property "Common", subject to the checking of decimal digits
|
|
described below. All the remaining characters in a script run must have
|
|
at least one script in common in their Script Extension lists. In set-
|
|
theoretic terminology, the intersection of all the sets of scripts must
|
|
not be empty.
|
|
|
|
A simple example is an Internet name such as "google.com". The letters
|
|
are all in the Latin script, and the dot is Common, so this string is a
|
|
script run. However, the Cyrillic letter "o" looks exactly the same as
|
|
the Latin "o"; a string that looks the same, but with Cyrillic "o"s is
|
|
not a script run.
|
|
|
|
More interesting examples involve characters with more than one script
|
|
in their Script Extension. Consider the following characters:
|
|
|
|
U+060C Arabic comma
|
|
U+06D4 Arabic full stop
|
|
|
|
The first has the Script Extension list Arabic, Hanifi Rohingya, Syr-
|
|
iac, and Thaana; the second has just Arabic and Hanifi Rohingya. Both
|
|
of them could appear in script runs of either Arabic or Hanifi Ro-
|
|
hingya. The first could also appear in Syriac or Thaana script runs,
|
|
but the second could not.
|
|
|
|
The Chinese Han script
|
|
|
|
The Chinese Han script is commonly used in conjunction with other
|
|
scripts for writing certain languages. Japanese uses the Hiragana and
|
|
Katakana scripts together with Han; Korean uses Hangul and Han; Tai-
|
|
wanese Mandarin uses Bopomofo and Han. These three combinations are
|
|
treated as special cases when checking script runs and are, in effect,
|
|
"virtual scripts". Thus, a script run may contain a mixture of Hira-
|
|
gana, Katakana, and Han, or a mixture of Hangul and Han, or a mixture
|
|
of Bopomofo and Han, but not, for example, a mixture of Hangul and
|
|
Bopomofo and Han. PCRE2 (like Perl) follows Unicode's Technical Stan-
|
|
dard 39 ("Unicode Security Mechanisms", http://unicode.org/re-
|
|
ports/tr39/) in allowing such mixtures.
|
|
|
|
Decimal digits
|
|
|
|
Unicode contains many sets of 10 decimal digits in different scripts,
|
|
and some scripts (including the Common script) contain more than one
|
|
set. Some of these decimal digits them are visually indistinguishable
|
|
from the common ASCII digits. In addition to the script checking de-
|
|
scribed above, if a script run contains any decimal digits, they must
|
|
all come from the same set of 10 adjacent characters.
|
|
|
|
|
|
VALIDITY OF UTF STRINGS
|
|
|
|
When the PCRE2_UTF option is set, the strings passed as patterns and
|
|
subjects are (by default) checked for validity on entry to the relevant
|
|
functions. If an invalid UTF string is passed, a negative error code is
|
|
returned. The code unit offset to the offending character can be ex-
|
|
tracted from the match data block by calling pcre2_get_startchar(),
|
|
which is used for this purpose after a UTF error.
|
|
|
|
In some situations, you may already know that your strings are valid,
|
|
and therefore want to skip these checks in order to improve perfor-
|
|
mance, for example in the case of a long subject string that is being
|
|
scanned repeatedly. If you set the PCRE2_NO_UTF_CHECK option at com-
|
|
pile time or at match time, PCRE2 assumes that the pattern or subject
|
|
it is given (respectively) contains only valid UTF code unit sequences.
|
|
|
|
If you pass an invalid UTF string when PCRE2_NO_UTF_CHECK is set, the
|
|
result is undefined and your program may crash or loop indefinitely or
|
|
give incorrect results. There is, however, one mode of matching that
|
|
can handle invalid UTF subject strings. This is enabled by passing
|
|
PCRE2_MATCH_INVALID_UTF to pcre2_compile() and is discussed below in
|
|
the next section. The rest of this section covers the case when
|
|
PCRE2_MATCH_INVALID_UTF is not set.
|
|
|
|
Passing PCRE2_NO_UTF_CHECK to pcre2_compile() just disables the UTF
|
|
check for the pattern; it does not also apply to subject strings. If
|
|
you want to disable the check for a subject string you must pass this
|
|
same option to pcre2_match() or pcre2_dfa_match().
|
|
|
|
UTF-16 and UTF-32 strings can indicate their endianness by special code
|
|
knows as a byte-order mark (BOM). The PCRE2 functions do not handle
|
|
this, expecting strings to be in host byte order.
|
|
|
|
Unless PCRE2_NO_UTF_CHECK is set, a UTF string is checked before any
|
|
other processing takes place. In the case of pcre2_match() and
|
|
pcre2_dfa_match() calls with a non-zero starting offset, the check is
|
|
applied only to that part of the subject that could be inspected during
|
|
matching, and there is a check that the starting offset points to the
|
|
first code unit of a character or to the end of the subject. If there
|
|
are no lookbehind assertions in the pattern, the check starts at the
|
|
starting offset. Otherwise, it starts at the length of the longest
|
|
lookbehind before the starting offset, or at the start of the subject
|
|
if there are not that many characters before the starting offset. Note
|
|
that the sequences \b and \B are one-character lookbehinds.
|
|
|
|
In addition to checking the format of the string, there is a check to
|
|
ensure that all code points lie in the range U+0 to U+10FFFF, excluding
|
|
the surrogate area. The so-called "non-character" code points are not
|
|
excluded because Unicode corrigendum #9 makes it clear that they should
|
|
not be.
|
|
|
|
Characters in the "Surrogate Area" of Unicode are reserved for use by
|
|
UTF-16, where they are used in pairs to encode code points with values
|
|
greater than 0xFFFF. The code points that are encoded by UTF-16 pairs
|
|
are available independently in the UTF-8 and UTF-32 encodings. (In
|
|
other words, the whole surrogate thing is a fudge for UTF-16 which un-
|
|
fortunately messes up UTF-8 and UTF-32.)
|
|
|
|
Setting PCRE2_NO_UTF_CHECK at compile time does not disable the error
|
|
that is given if an escape sequence for an invalid Unicode code point
|
|
is encountered in the pattern. If you want to allow escape sequences
|
|
such as \x{d800} (a surrogate code point) you can set the PCRE2_EX-
|
|
TRA_ALLOW_SURROGATE_ESCAPES extra option. However, this is possible
|
|
only in UTF-8 and UTF-32 modes, because these values are not repre-
|
|
sentable in UTF-16.
|
|
|
|
Errors in UTF-8 strings
|
|
|
|
The following negative error codes are given for invalid UTF-8 strings:
|
|
|
|
PCRE2_ERROR_UTF8_ERR1
|
|
PCRE2_ERROR_UTF8_ERR2
|
|
PCRE2_ERROR_UTF8_ERR3
|
|
PCRE2_ERROR_UTF8_ERR4
|
|
PCRE2_ERROR_UTF8_ERR5
|
|
|
|
The string ends with a truncated UTF-8 character; the code specifies
|
|
how many bytes are missing (1 to 5). Although RFC 3629 restricts UTF-8
|
|
characters to be no longer than 4 bytes, the encoding scheme (origi-
|
|
nally defined by RFC 2279) allows for up to 6 bytes, and this is
|
|
checked first; hence the possibility of 4 or 5 missing bytes.
|
|
|
|
PCRE2_ERROR_UTF8_ERR6
|
|
PCRE2_ERROR_UTF8_ERR7
|
|
PCRE2_ERROR_UTF8_ERR8
|
|
PCRE2_ERROR_UTF8_ERR9
|
|
PCRE2_ERROR_UTF8_ERR10
|
|
|
|
The two most significant bits of the 2nd, 3rd, 4th, 5th, or 6th byte of
|
|
the character do not have the binary value 0b10 (that is, either the
|
|
most significant bit is 0, or the next bit is 1).
|
|
|
|
PCRE2_ERROR_UTF8_ERR11
|
|
PCRE2_ERROR_UTF8_ERR12
|
|
|
|
A character that is valid by the RFC 2279 rules is either 5 or 6 bytes
|
|
long; these code points are excluded by RFC 3629.
|
|
|
|
PCRE2_ERROR_UTF8_ERR13
|
|
|
|
A 4-byte character has a value greater than 0x10ffff; these code points
|
|
are excluded by RFC 3629.
|
|
|
|
PCRE2_ERROR_UTF8_ERR14
|
|
|
|
A 3-byte character has a value in the range 0xd800 to 0xdfff; this
|
|
range of code points are reserved by RFC 3629 for use with UTF-16, and
|
|
so are excluded from UTF-8.
|
|
|
|
PCRE2_ERROR_UTF8_ERR15
|
|
PCRE2_ERROR_UTF8_ERR16
|
|
PCRE2_ERROR_UTF8_ERR17
|
|
PCRE2_ERROR_UTF8_ERR18
|
|
PCRE2_ERROR_UTF8_ERR19
|
|
|
|
A 2-, 3-, 4-, 5-, or 6-byte character is "overlong", that is, it codes
|
|
for a value that can be represented by fewer bytes, which is invalid.
|
|
For example, the two bytes 0xc0, 0xae give the value 0x2e, whose cor-
|
|
rect coding uses just one byte.
|
|
|
|
PCRE2_ERROR_UTF8_ERR20
|
|
|
|
The two most significant bits of the first byte of a character have the
|
|
binary value 0b10 (that is, the most significant bit is 1 and the sec-
|
|
ond is 0). Such a byte can only validly occur as the second or subse-
|
|
quent byte of a multi-byte character.
|
|
|
|
PCRE2_ERROR_UTF8_ERR21
|
|
|
|
The first byte of a character has the value 0xfe or 0xff. These values
|
|
can never occur in a valid UTF-8 string.
|
|
|
|
Errors in UTF-16 strings
|
|
|
|
The following negative error codes are given for invalid UTF-16
|
|
strings:
|
|
|
|
PCRE2_ERROR_UTF16_ERR1 Missing low surrogate at end of string
|
|
PCRE2_ERROR_UTF16_ERR2 Invalid low surrogate follows high surrogate
|
|
PCRE2_ERROR_UTF16_ERR3 Isolated low surrogate
|
|
|
|
|
|
Errors in UTF-32 strings
|
|
|
|
The following negative error codes are given for invalid UTF-32
|
|
strings:
|
|
|
|
PCRE2_ERROR_UTF32_ERR1 Surrogate character (0xd800 to 0xdfff)
|
|
PCRE2_ERROR_UTF32_ERR2 Code point is greater than 0x10ffff
|
|
|
|
|
|
MATCHING IN INVALID UTF STRINGS
|
|
|
|
You can run pattern matches on subject strings that may contain invalid
|
|
UTF sequences if you call pcre2_compile() with the PCRE2_MATCH_IN-
|
|
VALID_UTF option. This is supported by pcre2_match(), including JIT
|
|
matching, but not by pcre2_dfa_match(). When PCRE2_MATCH_INVALID_UTF is
|
|
set, it forces PCRE2_UTF to be set as well. Note, however, that the
|
|
pattern itself must be a valid UTF string.
|
|
|
|
Setting PCRE2_MATCH_INVALID_UTF does not affect what pcre2_compile()
|
|
generates, but if pcre2_jit_compile() is subsequently called, it does
|
|
generate different code. If JIT is not used, the option affects the be-
|
|
haviour of the interpretive code in pcre2_match(). When PCRE2_MATCH_IN-
|
|
VALID_UTF is set at compile time, PCRE2_NO_UTF_CHECK is ignored at
|
|
match time.
|
|
|
|
In this mode, an invalid code unit sequence in the subject never
|
|
matches any pattern item. It does not match dot, it does not match
|
|
\p{Any}, it does not even match negative items such as [^X]. A lookbe-
|
|
hind assertion fails if it encounters an invalid sequence while moving
|
|
the current point backwards. In other words, an invalid UTF code unit
|
|
sequence acts as a barrier which no match can cross.
|
|
|
|
You can also think of this as the subject being split up into fragments
|
|
of valid UTF, delimited internally by invalid code unit sequences. The
|
|
pattern is matched fragment by fragment. The result of a successful
|
|
match, however, is given as code unit offsets in the entire subject
|
|
string in the usual way. There are a few points to consider:
|
|
|
|
The internal boundaries are not interpreted as the beginnings or ends
|
|
of lines and so do not match circumflex or dollar characters in the
|
|
pattern.
|
|
|
|
If pcre2_match() is called with an offset that points to an invalid
|
|
UTF-sequence, that sequence is skipped, and the match starts at the
|
|
next valid UTF character, or the end of the subject.
|
|
|
|
At internal fragment boundaries, \b and \B behave in the same way as at
|
|
the beginning and end of the subject. For example, a sequence such as
|
|
\bWORD\b would match an instance of WORD that is surrounded by invalid
|
|
UTF code units.
|
|
|
|
Using PCRE2_MATCH_INVALID_UTF, an application can run matches on arbi-
|
|
trary data, knowing that any matched strings that are returned are
|
|
valid UTF. This can be useful when searching for UTF text in executable
|
|
or other binary files.
|
|
|
|
|
|
AUTHOR
|
|
|
|
Philip Hazel
|
|
Retired from University Computing Service
|
|
Cambridge, England.
|
|
|
|
|
|
REVISION
|
|
|
|
Last updated: 22 December 2021
|
|
Copyright (c) 1997-2021 University of Cambridge.
|
|
------------------------------------------------------------------------------
|
|
|
|
|