-----------------------------------------------------------------------------
This file contains a concatenation of the PCRE2 man pages, converted to plain
text format for ease of searching with a text editor, or for use on systems
that do not have a man page processor. The small individual files that give
synopses of each function in the library have not been included. Neither has
the pcre2demo program. There are separate text files for the pcre2grep and
pcre2test commands.
-----------------------------------------------------------------------------


PCRE2(3)                   Library Functions Manual                   PCRE2(3)


NAME
       PCRE2 - Perl-compatible regular expressions (revised API)

INTRODUCTION

       PCRE2 is the name used for a revised API for the PCRE library, which is
       a set of functions, written in C,  that  implement  regular  expression
       pattern matching using the same syntax and semantics as Perl, with just
       a few differences. Some features that appeared in Python and the origi-
       nal  PCRE  before  they  appeared  in Perl are also available using the
       Python syntax. There is also some support for one or two .NET and Onig-
       uruma  syntax  items,  and  there are options for requesting some minor
       changes that give better ECMAScript (aka JavaScript) compatibility.

       The source code for PCRE2 can be compiled to support 8-bit, 16-bit,  or
       32-bit  code units, which means that up to three separate libraries may
       be installed.  The original work to extend PCRE to  16-bit  and  32-bit
       code  units  was  done  by Zoltan Herczeg and Christian Persch, respec-
       tively. In all three cases, strings can be interpreted  either  as  one
       character  per  code  unit, or as UTF-encoded Unicode, with support for
       Unicode general category properties. Unicode  support  is  optional  at
       build  time  (but  is  the default). However, processing strings as UTF
       code units must be enabled explicitly at run time. The version of  Uni-
       code in use can be discovered by running

         pcre2test -C

       The  three  libraries  contain  identical sets of functions, with names
       ending in _8,  _16,  or  _32,  respectively  (for  example,  pcre2_com-
       pile_8()).  However,  by defining PCRE2_CODE_UNIT_WIDTH to be 8, 16, or
       32, a program that uses just one code unit width can be  written  using
       generic names such as pcre2_compile(), and the documentation is written
       assuming that this is the case.

       In addition to the Perl-compatible matching function, PCRE2 contains an
       alternative  function that matches the same compiled patterns in a dif-
       ferent way. In certain circumstances, the alternative function has some
       advantages.   For  a discussion of the two matching algorithms, see the
       pcre2matching page.

       Details of exactly which Perl regular expression features are  and  are
       not  supported  by  PCRE2  are  given  in  separate  documents. See the
       pcre2pattern and pcre2compat pages. There is a syntax  summary  in  the
       pcre2syntax page.

       Some  features  of PCRE2 can be included, excluded, or changed when the
       library is built. The pcre2_config() function makes it possible  for  a
       client  to  discover  which  features are available. The features them-
       selves are described in the pcre2build page. Documentation about build-
       ing  PCRE2 for various operating systems can be found in the README and
       NON-AUTOTOOLS_BUILD files in the source distribution.

       The libraries contains a number of undocumented internal functions  and
       data  tables  that  are  used by more than one of the exported external
       functions, but which are not intended  for  use  by  external  callers.
       Their  names  all begin with "_pcre2", which hopefully will not provoke
       any name clashes. In some environments, it is possible to control which
       external  symbols  are  exported when a shared library is built, and in
       these cases the undocumented symbols are not exported.


SECURITY CONSIDERATIONS

       If you are using PCRE2 in a non-UTF application that permits  users  to
       supply  arbitrary  patterns  for  compilation, you should be aware of a
       feature that allows users to turn on UTF support from within a pattern.
       For  example, an 8-bit pattern that begins with "(*UTF)" turns on UTF-8
       mode, which interprets patterns and subjects as strings of  UTF-8  code
       units instead of individual 8-bit characters. This causes both the pat-
       tern and any data against which it is matched to be checked  for  UTF-8
       validity.  If the data string is very long, such a check might use suf-
       ficiently many resources as to cause your application to  lose  perfor-
       mance.

       One  way  of guarding against this possibility is to use the pcre2_pat-
       tern_info() function to check the compiled pattern's options  for  UTF.
       Alternatively,  you can set the PCRE2_NEVER_UTF option at compile time.
       This causes an compile time error if a pattern contains  a  UTF-setting
       sequence.

       If  your  application  is one that supports UTF, be aware that validity
       checking can take time. If the same data string is to be  matched  many
       times,  you  can  use  the PCRE2_NO_UTF_CHECK option for the second and
       subsequent matches to avoid running redundant checks.

       Another way that performance can be hit is by running  a  pattern  that
       has  a  very  large search tree against a string that will never match.
       Nested unlimited repeats in a pattern are a common example. PCRE2  pro-
       vides  some  protection  against  this: see the pcre2_set_match_limit()
       function in the pcre2api page.


USER DOCUMENTATION

       The user documentation for PCRE2 comprises a number of  different  sec-
       tions.  In the "man" format, each of these is a separate "man page". In
       the HTML format, each is a separate page, linked from the  index  page.
       In  the  plain  text  format,  the  descriptions  of  the pcre2grep and
       pcre2test programs are in files called pcre2grep.txt and pcre2test.txt,
       respectively.  The remaining sections, except for the pcre2demo section
       (which is a program listing), and the short pages for individual  func-
       tions,  are  concatenated in pcre2.txt, for ease of searching. The sec-
       tions are as follows:

         pcre2              this document
         pcre2-config       show PCRE2 installation configuration information
         pcre2api           details of PCRE2's native C API
         pcre2build         building PCRE2
         pcre2callout       details of the callout feature
         pcre2compat        discussion of Perl compatibility
         pcre2demo          a demonstration C program that uses PCRE2
         pcre2grep          description of the pcre2grep command (8-bit only)
         pcre2jit           discussion of just-in-time optimization support
         pcre2limits        details of size and other limits
         pcre2matching      discussion of the two matching algorithms
         pcre2partial       details of the partial matching facility
         pcre2pattern       syntax and semantics of supported regular
                              expression patterns
         pcre2perform       discussion of performance issues
         pcre2posix         the POSIX-compatible C API for the 8-bit library
         pcre2sample        discussion of the pcre2demo program
         pcre2stack         discussion of stack usage
         pcre2syntax        quick syntax reference
         pcre2test          description of the pcre2test command
         pcre2unicode       discussion of Unicode and UTF support

       In the "man" and HTML formats, there is also a short page  for  each  C
       library function, listing its arguments and results.


AUTHOR

       Philip Hazel
       University Computing Service
       Cambridge, England.

       Putting  an  actual email address here is a spam magnet. If you want to
       email me, use my two initials, followed by the two digits  10,  at  the
       domain cam.ac.uk.


REVISION

       Last updated: 18 November 2014
       Copyright (c) 1997-2014 University of Cambridge.
------------------------------------------------------------------------------


PCRE2API(3)                Library Functions Manual                PCRE2API(3)


NAME
       PCRE2 - Perl-compatible regular expressions (revised API)

       #include <pcre2.h>

       PCRE2  is  a  new API for PCRE. This document contains a description of
       all its functions. See the pcre2 document for an overview  of  all  the
       PCRE2 documentation.


PCRE2 NATIVE API BASIC FUNCTIONS

       pcre2_code *pcre2_compile(PCRE2_SPTR pattern, PCRE2_SIZE length,
         uint32_t options, int *errorcode, PCRE2_SIZE *erroroffset,
         pcre2_compile_context *ccontext);

       pcre2_code_free(pcre2_code *code);

       pcre2_match_data_create(uint32_t ovecsize,
         pcre2_general_context *gcontext);

       pcre2_match_data_create_from_pattern(const pcre2_code *code,
         pcre2_general_context *gcontext);

       int pcre2_match(const pcre2_code *code, PCRE2_SPTR subject,
         PCRE2_SIZE length, PCRE2_SIZE startoffset,
         uint32_t options, pcre2_match_data *match_data,
         pcre2_match_context *mcontext);

       int pcre2_dfa_match(const pcre2_code *code, PCRE2_SPTR subject,
         PCRE2_SIZE length, PCRE2_SIZE startoffset,
         uint32_t options, pcre2_match_data *match_data,
         pcre2_match_context *mcontext,
         int *workspace, PCRE2_SIZE wscount);

       void pcre2_match_data_free(pcre2_match_data *match_data);


PCRE2 NATIVE API AUXILIARY MATCH FUNCTIONS

       PCRE2_SPTR pcre2_get_mark(pcre2_match_data *match_data);

       uint32_t pcre2_get_ovector_count(pcre2_match_data *match_data);

       PCRE2_SIZE *pcre2_get_ovector_pointer(pcre2_match_data *match_data);

       PCRE2_SIZE pcre2_get_startchar(pcre2_match_data *match_data);


PCRE2 NATIVE API GENERAL CONTEXT FUNCTIONS

       pcre2_general_context *pcre2_general_context_create(
         void *(*private_malloc)(PCRE2_SIZE, void *),
         void (*private_free)(void *, void *), void *memory_data);

       pcre2_general_context *pcre2_general_context_copy(
         pcre2_general_context *gcontext);

       void pcre2_general_context_free(pcre2_general_context *gcontext);


PCRE2 NATIVE API COMPILE CONTEXT FUNCTIONS

       pcre2_compile_context *pcre2_compile_context_create(
         pcre2_general_context *gcontext);

       pcre2_compile_context *pcre2_compile_context_copy(
         pcre2_compile_context *ccontext);

       void pcre2_compile_context_free(pcre2_compile_context *ccontext);

       int pcre2_set_bsr(pcre2_compile_context *ccontext,
         uint32_t value);

       int pcre2_set_character_tables(pcre2_compile_context *ccontext,
         const unsigned char *tables);

       int pcre2_set_newline(pcre2_compile_context *ccontext,
         uint32_t value);

       int pcre2_set_parens_nest_limit(pcre2_compile_context *ccontext,
         uint32_t value);

       int pcre2_set_compile_recursion_guard(pcre2_compile_context *ccontext,
         int (*guard_function)(uint32_t, void *), void *user_data);


PCRE2 NATIVE API MATCH CONTEXT FUNCTIONS

       pcre2_match_context *pcre2_match_context_create(
         pcre2_general_context *gcontext);

       pcre2_match_context *pcre2_match_context_copy(
         pcre2_match_context *mcontext);

       void pcre2_match_context_free(pcre2_match_context *mcontext);

       int pcre2_set_callout(pcre2_match_context *mcontext,
         int (*callout_function)(pcre2_callout_block *, void *),
         void *callout_data);

       int pcre2_set_match_limit(pcre2_match_context *mcontext,
         uint32_t value);

       int pcre2_set_recursion_limit(pcre2_match_context *mcontext,
         uint32_t value);

       int pcre2_set_recursion_memory_management(
         pcre2_match_context *mcontext,
         void *(*private_malloc)(PCRE2_SIZE, void *),
         void (*private_free)(void *, void *), void *memory_data);


PCRE2 NATIVE API STRING EXTRACTION FUNCTIONS

       int pcre2_substring_copy_byname(pcre2_match_data *match_data,
         PCRE2_SPTR name, PCRE2_UCHAR *buffer, PCRE2_SIZE *bufflen);

       int pcre2_substring_copy_bynumber(pcre2_match_data *match_data,
         uint32_t number, PCRE2_UCHAR *buffer,
         PCRE2_SIZE *bufflen);

       void pcre2_substring_free(PCRE2_UCHAR *buffer);

       int pcre2_substring_get_byname(pcre2_match_data *match_data,
         PCRE2_SPTR name, PCRE2_UCHAR **bufferptr, PCRE2_SIZE *bufflen);

       int pcre2_substring_get_bynumber(pcre2_match_data *match_data,
         uint32_t number, PCRE2_UCHAR **bufferptr,
         PCRE2_SIZE *bufflen);

       int pcre2_substring_length_byname(pcre2_match_data *match_data,
         PCRE2_SPTR name, PCRE2_SIZE *length);

       int pcre2_substring_length_bynumber(pcre2_match_data *match_data,
         uint32_t number, PCRE2_SIZE *length);

       int pcre2_substring_nametable_scan(const pcre2_code *code,
         PCRE2_SPTR name, PCRE2_SPTR *first, PCRE2_SPTR *last);

       int pcre2_substring_number_from_name(const pcre2_code *code,
         PCRE2_SPTR name);

       void pcre2_substring_list_free(PCRE2_SPTR *list);

       int pcre2_substring_list_get(pcre2_match_data *match_data,
         PCRE2_UCHAR ***listptr, PCRE2_SIZE **lengthsptr);


PCRE2 NATIVE API STRING SUBSTITUTION FUNCTION

       int pcre2_substitute(const pcre2_code *code, PCRE2_SPTR subject,
         PCRE2_SIZE length, PCRE2_SIZE startoffset,
         uint32_t options, pcre2_match_data *match_data,
         pcre2_match_context *mcontext, PCRE2_SPTR replacementzfP,
         PCRE2_SIZE rlength, PCRE2_UCHAR *outputbuffer,
         PCRE2_SIZE *outlengthptr);


PCRE2 NATIVE API JIT FUNCTIONS

       int pcre2_jit_compile(pcre2_code *code, uint32_t options);

       int pcre2_jit_match(const pcre2_code *code, PCRE2_SPTR subject,
         PCRE2_SIZE length, PCRE2_SIZE startoffset,
         uint32_t options, pcre2_match_data *match_data,
         pcre2_match_context *mcontext);

       void pcre2_jit_free_unused_memory(pcre2_general_context *gcontext);

       pcre2_jit_stack *pcre2_jit_stack_create(PCRE2_SIZE startsize,
         PCRE2_SIZE maxsize, pcre2_general_context *gcontext);

       void pcre2_jit_stack_assign(pcre2_match_context *mcontext,
         pcre2_jit_callback callback_function, void *callback_data);

       void pcre2_jit_stack_free(pcre2_jit_stack *jit_stack);


PCRE2 NATIVE API SERIALIZATION FUNCTIONS

       int32_t pcre2_serialize_decode(pcre2_code **codes,
         int32_t number_of_codes, const uint32_t *bytes,
         pcre2_general_context *gcontext);

       int32_t pcre2_serialize_encode(pcre2_code **codes,
         int32_t number_of_codes, uint32_t **serialized_bytes,
         PCRE2_SIZE *serialized_size, pcre2_general_context *gcontext);

       void pcre2_serialize_free(uint8_t *bytes);

       int32_t pcre2_serialize_get_number_of_codes(const uint8_t *bytes);


PCRE2 NATIVE API AUXILIARY FUNCTIONS

       int pcre2_get_error_message(int errorcode, PCRE2_UCHAR *buffer,
         PCRE2_SIZE bufflen);

       const unsigned char *pcre2_maketables(pcre2_general_context *gcontext);

       int pcre2_pattern_info(const pcre2 *code, uint32_t what, void *where);

       int pcre2_config(uint32_t what, void *where);


PCRE2 8-BIT, 16-BIT, AND 32-BIT LIBRARIES

       There  are  three PCRE2 libraries, supporting 8-bit, 16-bit, and 32-bit
       code units, respectively. However,  there  is  just  one  header  file,
       pcre2.h.   This  contains the function prototypes and other definitions
       for all three libraries. One, two, or all three can be installed simul-
       taneously.  On  Unix-like  systems the libraries are called libpcre2-8,
       libpcre2-16, and libpcre2-32, and they can also co-exist with the orig-
       inal PCRE libraries.

       Character  strings are passed to and from a PCRE2 library as a sequence
       of unsigned integers in code units  of  the  appropriate  width.  Every
       PCRE2  function  comes  in three different forms, one for each library,
       for example:

         pcre2_compile_8()
         pcre2_compile_16()
         pcre2_compile_32()

       There are also three different sets of data types:

         PCRE2_UCHAR8, PCRE2_UCHAR16, PCRE2_UCHAR32
         PCRE2_SPTR8,  PCRE2_SPTR16,  PCRE2_SPTR32

       The UCHAR types define unsigned code units of the  appropriate  widths.
       For  example,  PCRE2_UCHAR16 is usually defined as `uint16_t'. The SPTR
       types are constant pointers to the equivalent  UCHAR  types,  that  is,
       they are pointers to vectors of unsigned code units.

       Many  applications use only one code unit width. For their convenience,
       macros are defined whose names are the generic forms such as pcre2_com-
       pile()  and  PCRE2_SPTR.  These  macros  use  the  value  of  the macro
       PCRE2_CODE_UNIT_WIDTH to generate the appropriate width-specific  func-
       tion and macro names.  PCRE2_CODE_UNIT_WIDTH is not defined by default.
       An application must define it to be  8,  16,  or  32  before  including
       pcre2.h in order to make use of the generic names.

       Applications  that use more than one code unit width can be linked with
       more than one PCRE2 library, but must define  PCRE2_CODE_UNIT_WIDTH  to
       be  0  before  including pcre2.h, and then use the real function names.
       Any code that is to be included in an environment where  the  value  of
       PCRE2_CODE_UNIT_WIDTH  is  unknown  should  also  use the real function
       names. (Unfortunately, it is not possible in C code to save and restore
       the value of a macro.)

       If  PCRE2_CODE_UNIT_WIDTH  is  not  defined before including pcre2.h, a
       compiler error occurs.

       When using multiple libraries in an application,  you  must  take  care
       when  processing  any  particular  pattern to use only functions from a
       single library.  For example, if you want to run a match using  a  pat-
       tern  that  was  compiled  with pcre2_compile_16(), you must do so with
       pcre2_match_16(), not pcre2_match_8().

       In the function summaries above, and in the rest of this  document  and
       other  PCRE2  documents,  functions  and data types are described using
       their generic names, without the 8, 16, or 32 suffix.


PCRE2 API OVERVIEW

       PCRE2 has its own native API, which  is  described  in  this  document.
       There are also some wrapper functions for the 8-bit library that corre-
       spond to the POSIX regular expression API, but they do not give  access
       to all the functionality. They are described in the pcre2posix documen-
       tation. Both these APIs define a set of C function calls.

       The native API C data types, function prototypes,  option  values,  and
       error codes are defined in the header file pcre2.h, which contains def-
       initions of PCRE2_MAJOR and PCRE2_MINOR, the major  and  minor  release
       numbers  for the library. Applications can use these to include support
       for different releases of PCRE2.

       In a Windows environment, if you want to statically link an application
       program  against  a non-dll PCRE2 library, you must define PCRE2_STATIC
       before including pcre2.h.

       The functions pcre2_compile(), and pcre2_match() are used for compiling
       and  matching regular expressions in a Perl-compatible manner. A sample
       program that demonstrates the simplest way of using them is provided in
       the file called pcre2demo.c in the PCRE2 source distribution. A listing
       of this program is  given  in  the  pcre2demo  documentation,  and  the
       pcre2sample documentation describes how to compile and run it.

       Just-in-time  compiler support is an optional feature of PCRE2 that can
       be built in appropriate hardware environments. It greatly speeds up the
       matching  performance of many patterns. Programs can request that it be
       used if available, by calling pcre2_jit_compile() after a  pattern  has
       been successfully compiled by pcre2_compile(). This does nothing if JIT
       support is not available.

       More complicated programs might need to  make  use  of  the  specialist
       functions    pcre2_jit_stack_create(),    pcre2_jit_stack_free(),   and
       pcre2_jit_stack_assign() in order to  control  the  JIT  code's  memory
       usage.

       JIT matching is automatically used by pcre2_match() if it is available.
       There is also a direct interface for JIT matching, which gives improved
       performance.  The  JIT-specific functions are discussed in the pcre2jit
       documentation.

       A second matching function, pcre2_dfa_match(), which is  not  Perl-com-
       patible,  is  also  provided.  This  uses a different algorithm for the
       matching. The alternative algorithm finds all possible  matches  (at  a
       given  point  in  the subject), and scans the subject just once (unless
       there are lookbehind assertions).  However,  this  algorithm  does  not
       return  captured  substrings.  A  description of the two matching algo-
       rithms  and  their  advantages  and  disadvantages  is  given  in   the
       pcre2matching    documentation.   There   is   no   JIT   support   for
       pcre2_dfa_match().

       In addition to the main compiling and  matching  functions,  there  are
       convenience functions for extracting captured substrings from a subject
       string that has been matched by pcre2_match(). They are:

         pcre2_substring_copy_byname()
         pcre2_substring_copy_bynumber()
         pcre2_substring_get_byname()
         pcre2_substring_get_bynumber()
         pcre2_substring_list_get()
         pcre2_substring_length_byname()
         pcre2_substring_length_bynumber()
         pcre2_substring_nametable_scan()
         pcre2_substring_number_from_name()

       pcre2_substring_free() and pcre2_substring_list_free()  are  also  pro-
       vided, to free the memory used for extracted strings.

       The  function  pcre2_substitute()  can be called to match a pattern and
       return a copy of the subject string with substitutions for  parts  that
       were matched.

       Finally,  there  are functions for finding out information about a com-
       piled pattern (pcre2_pattern_info()) and about the  configuration  with
       which PCRE2 was built (pcre2_config()).


STRING LENGTHS AND OFFSETS

       The  PCRE2  API  uses  string  lengths and offsets into strings of code
       units in several places. These values are always  of  type  PCRE2_SIZE,
       which  is an unsigned integer type, currently always defined as size_t.
       The largest  value  that  can  be  stored  in  such  a  type  (that  is
       ~(PCRE2_SIZE)0)  is reserved as a special indicator for zero-terminated
       strings and unset offsets.  Therefore, the longest string that  can  be
       handled is one less than this maximum.


NEWLINES

       PCRE2 supports five different conventions for indicating line breaks in
       strings: a single CR (carriage return) character, a  single  LF  (line-
       feed) character, the two-character sequence CRLF, any of the three pre-
       ceding, or any Unicode newline sequence. The Unicode newline  sequences
       are  the  three just mentioned, plus the single characters VT (vertical
       tab, U+000B), FF (form feed, U+000C), NEL (next line, U+0085), LS (line
       separator, U+2028), and PS (paragraph separator, U+2029).

       Each  of  the first three conventions is used by at least one operating
       system as its standard newline sequence. When PCRE2 is built, a default
       can  be  specified.  The default default is LF, which is the Unix stan-
       dard. However, the newline convention can be changed by an  application
       when calling pcre2_compile(), or it can be specified by special text at
       the start of the pattern itself; this overrides any other settings. See
       the pcre2pattern page for details of the special character sequences.

       In  the  PCRE2  documentation  the  word "newline" is used to mean "the
       character or pair of characters that indicate a line break". The choice
       of  newline convention affects the handling of the dot, circumflex, and
       dollar metacharacters, the handling of #-comments in /x mode, and, when
       CRLF  is a recognized line ending sequence, the match position advance-
       ment for a non-anchored pattern. There is more detail about this in the
       section on pcre2_match() options below.

       The  choice of newline convention does not affect the interpretation of
       the \n or \r escape sequences, nor does it affect what \R matches; this
       has its own separate convention.


MULTITHREADING

       In  a multithreaded application it is important to keep thread-specific
       data separate from data that can be shared between threads.  The  PCRE2
       library  code  itself  is  thread-safe: it contains no static or global
       variables. The API is designed to be  fairly  simple  for  non-threaded
       applications  while at the same time ensuring that multithreaded appli-
       cations can use it.

       There are several different blocks of data that are used to pass infor-
       mation between the application and the PCRE2 libraries.

       (1) A pointer to the compiled form of a pattern is returned to the user
       when pcre2_compile() is successful. The data in the compiled pattern is
       fixed,  and  does not change when the pattern is matched. Therefore, it
       is thread-safe, that is, the same compiled pattern can be used by  more
       than one thread simultaneously. An application can compile all its pat-
       terns at the start, before forking off multiple threads that use  them.
       However,  if  the  just-in-time  optimization feature is being used, it
       needs separate memory stack areas for each  thread.  See  the  pcre2jit
       documentation for more details.

       (2)  The  next section below introduces the idea of "contexts" in which
       PCRE2 functions are called. A context is nothing more than a collection
       of parameters that control the way PCRE2 operates. Grouping a number of
       parameters together in a context is a convenient way of passing them to
       a  PCRE2  function without using lots of arguments. The parameters that
       are stored in contexts are in some sense  "advanced  features"  of  the
       API. Many straightforward applications will not need to use contexts.

       In a multithreaded application, if the parameters in a context are val-
       ues that are never changed, the same context can be  used  by  all  the
       threads. However, if any thread needs to change any value in a context,
       it must make its own thread-specific copy.

       (3) The matching functions need a block of memory for working space and
       for  storing  the results of a match. This includes details of what was
       matched, as well as additional  information  such  as  the  name  of  a
       (*MARK)  setting. Each thread must provide its own version of this mem-
       ory.


PCRE2 CONTEXTS

       Some PCRE2 functions have a lot of parameters, many of which  are  used
       only  by  specialist  applications,  for example, those that use custom
       memory management or non-standard character tables.  To  keep  function
       argument  lists  at a reasonable size, and at the same time to keep the
       API extensible, "uncommon" parameters are passed to  certain  functions
       in  a  context instead of directly. A context is just a block of memory
       that holds the parameter values.  Applications  that  do  not  need  to
       adjust  any  of  the  context  parameters  can pass NULL when a context
       pointer is required.

       There are three different types of context: a general context  that  is
       relevant  for  several  PCRE2 operations, a compile-time context, and a
       match-time context.

   The general context

       At present, this context just  contains  pointers  to  (and  data  for)
       external  memory  management  functions  that  are  called from several
       places in the PCRE2 library. The context is named `general' rather than
       specifically  `memory'  because in future other fields may be added. If
       you do not want to supply your own custom memory management  functions,
       you  do not need to bother with a general context. A general context is
       created by:

       pcre2_general_context *pcre2_general_context_create(
         void *(*private_malloc)(PCRE2_SIZE, void *),
         void (*private_free)(void *, void *), void *memory_data);

       The two function pointers specify custom memory  management  functions,
       whose prototypes are:

         void *private_malloc(PCRE2_SIZE, void *);
         void  private_free(void *, void *);

       Whenever code in PCRE2 calls these functions, the final argument is the
       value of memory_data. Either of the first two arguments of the creation
       function  may be NULL, in which case the system memory management func-
       tions malloc() and free() are used. (This is not currently  useful,  as
       there  are  no  other  fields in a general context, but in future there
       might be.)  The private_malloc() function  is  used  (if  supplied)  to
       obtain  memory  for storing the context, and all three values are saved
       as part of the context.

       Whenever PCRE2 creates a data block of any kind, the block  contains  a
       pointer  to the free() function that matches the malloc() function that
       was used. When the time comes to  free  the  block,  this  function  is
       called.

       A general context can be copied by calling:

       pcre2_general_context *pcre2_general_context_copy(
         pcre2_general_context *gcontext);

       The memory used for a general context should be freed by calling:

       void pcre2_general_context_free(pcre2_general_context *gcontext);


   The compile context

       A  compile context is required if you want to change the default values
       of any of the following compile-time parameters:

         What \R matches (Unicode newlines or CR, LF, CRLF only)
         PCRE2's character tables
         The newline character sequence
         The compile time nested parentheses limit
         An external function for stack checking

       A compile context is also required if you are using custom memory  man-
       agement.   If  none of these apply, just pass NULL as the context argu-
       ment of pcre2_compile().

       A compile context is created, copied, and freed by the following  func-
       tions:

       pcre2_compile_context *pcre2_compile_context_create(
         pcre2_general_context *gcontext);

       pcre2_compile_context *pcre2_compile_context_copy(
         pcre2_compile_context *ccontext);

       void pcre2_compile_context_free(pcre2_compile_context *ccontext);

       A  compile  context  is created with default values for its parameters.
       These can be changed by calling the following functions, which return 0
       on success, or PCRE2_ERROR_BADDATA if invalid data is detected.

       int pcre2_set_bsr(pcre2_compile_context *ccontext,
         uint32_t value);

       The  value  must  be PCRE2_BSR_ANYCRLF, to specify that \R matches only
       CR, LF, or CRLF, or PCRE2_BSR_UNICODE, to specify that \R  matches  any
       Unicode line ending sequence. The value is used by the JIT compiler and
       by  the  two  interpreted   matching   functions,   pcre2_match()   and
       pcre2_dfa_match().

       int pcre2_set_character_tables(pcre2_compile_context *ccontext,
         const unsigned char *tables);

       The  value  must  be  the result of a call to pcre2_maketables(), whose
       only argument is a general context. This function builds a set of char-
       acter tables in the current locale.

       int pcre2_set_newline(pcre2_compile_context *ccontext,
         uint32_t value);

       This specifies which characters or character sequences are to be recog-
       nized as newlines. The value must be one of PCRE2_NEWLINE_CR  (carriage
       return only), PCRE2_NEWLINE_LF (linefeed only), PCRE2_NEWLINE_CRLF (the
       two-character sequence CR followed by LF),  PCRE2_NEWLINE_ANYCRLF  (any
       of the above), or PCRE2_NEWLINE_ANY (any Unicode newline sequence).

       When a pattern is compiled with the PCRE2_EXTENDED option, the value of
       this parameter affects the recognition of white space and  the  end  of
       internal comments starting with #. The value is saved with the compiled
       pattern for subsequent use by the JIT compiler and by  the  two  inter-
       preted matching functions, pcre2_match() and pcre2_dfa_match().

       int pcre2_set_parens_nest_limit(pcre2_compile_context *ccontext,
         uint32_t value);

       This parameter ajusts the limit, set when PCRE2 is built (default 250),
       on the depth of parenthesis nesting in  a  pattern.  This  limit  stops
       rogue patterns using up too much system stack when being compiled.

       int pcre2_set_compile_recursion_guard(pcre2_compile_context *ccontext,
         int (*guard_function)(uint32_t, void *), void *user_data);

       There  is at least one application that runs PCRE2 in threads with very
       limited system stack, where running out of stack is to  be  avoided  at
       all  costs. The parenthesis limit above cannot take account of how much
       stack is actually available. For a finer  control,  you  can  supply  a
       function  that  is  called whenever pcre2_compile() starts to compile a
       parenthesized part of a pattern. This function  can  check  the  actual
       stack size (or anything else that it wants to, of course).

       The  first  argument to the callout function gives the current depth of
       nesting, and the second is user data that is set up by the  last  argu-
       ment   of  pcre2_set_compile_recursion_guard().  The  callout  function
       should return zero if all is well, or non-zero to force an error.

   The match context

       A match context is required if you want to change the default values of
       any of the following match-time parameters:

         A callout function
         The limit for calling match()
         The limit for calling match() recursively

       A match context is also required if you are using custom memory manage-
       ment.  If none of these apply, just pass NULL as the  context  argument
       of pcre2_match(), pcre2_dfa_match(), or pcre2_jit_match().

       A  match  context  is created, copied, and freed by the following func-
       tions:

       pcre2_match_context *pcre2_match_context_create(
         pcre2_general_context *gcontext);

       pcre2_match_context *pcre2_match_context_copy(
         pcre2_match_context *mcontext);

       void pcre2_match_context_free(pcre2_match_context *mcontext);

       A match context is created with  default  values  for  its  parameters.
       These can be changed by calling the following functions, which return 0
       on success, or PCRE2_ERROR_BADDATA if invalid data is detected.

       int pcre2_set_callout(pcre2_match_context *mcontext,
         int (*callout_function)(pcre2_callout_block *, void *),
         void *callout_data);

       This sets up a "callout" function, which PCRE2 will call  at  specified
       points during a matching operation. Details are given in the pcre2call-
       out documentation.

       int pcre2_set_match_limit(pcre2_match_context *mcontext,
         uint32_t value);

       The match_limit parameter provides a means  of  preventing  PCRE2  from
       using up too many resources when processing patterns that are not going
       to match, but which have a very large number of possibilities in  their
       search  trees. The classic example is a pattern that uses nested unlim-
       ited repeats.

       Internally, pcre2_match() uses a  function  called  match(),  which  it
       calls  repeatedly (sometimes recursively). The limit set by match_limit
       is imposed on the number of times this  function  is  called  during  a
       match, which has the effect of limiting the amount of backtracking that
       can take place. For patterns that are not anchored, the count  restarts
       from  zero  for  each position in the subject string. This limit is not
       relevant to pcre2_dfa_match(), which ignores it.

       When pcre2_match() is called with a pattern that was successfully  pro-
       cessed by pcre2_jit_compile(), the way in which matching is executed is
       entirely different. However, there is still the possibility of  runaway
       matching  that  goes  on  for  a very long time, and so the match_limit
       value is also used in this case (but in a different way) to  limit  how
       long the matching can continue.

       The  default  value  for  the limit can be set when PCRE2 is built; the
       default default is 10 million, which handles all but the  most  extreme
       cases.    If    the    limit   is   exceeded,   pcre2_match()   returns
       PCRE2_ERROR_MATCHLIMIT. A value for the match limit may  also  be  sup-
       plied by an item at the start of a pattern of the form

         (*LIMIT_MATCH=ddd)

       where  ddd  is  a  decimal  number.  However, such a setting is ignored
       unless ddd is less than the limit set by the  caller  of  pcre2_match()
       or, if no such limit is set, less than the default.

       int pcre2_set_recursion_limit(pcre2_match_context *mcontext,
         uint32_t value);

       The recursion_limit parameter is similar to match_limit, but instead of
       limiting the total number of times that match() is  called,  it  limits
       the  depth  of  recursion. The recursion depth is a smaller number than
       the total number of calls, because not all calls to match() are  recur-
       sive.  This limit is of use only if it is set smaller than match_limit.

       Limiting the recursion depth limits the amount of system stack that can
       be used, or, when PCRE2 has been compiled to use  memory  on  the  heap
       instead  of the stack, the amount of heap memory that can be used. This
       limit is not relevant, and is ignored, when matching is done using  JIT
       compiled code or by the pcre2_dfa_match() function.

       The  default  value for recursion_limit can be set when PCRE2 is built;
       the default default is the same value as the default  for  match_limit.
       If  the limit is exceeded, pcre2_match() returns PCRE2_ERROR_RECURSION-
       LIMIT. A value for the recursion limit may also be supplied by an  item
       at the start of a pattern of the form

         (*LIMIT_RECURSION=ddd)

       where  ddd  is  a  decimal  number.  However, such a setting is ignored
       unless ddd is less than the limit set by the  caller  of  pcre2_match()
       or, if no such limit is set, less than the default.

       int pcre2_set_recursion_memory_management(
         pcre2_match_context *mcontext,
         void *(*private_malloc)(PCRE2_SIZE, void *),
         void (*private_free)(void *, void *), void *memory_data);

       This function sets up two additional custom memory management functions
       for use by pcre2_match() when PCRE2 is compiled to  use  the  heap  for
       remembering backtracking data, instead of recursive function calls that
       use the system stack. There is a discussion about PCRE2's  stack  usage
       in  the  pcre2stack documentation. See the pcre2build documentation for
       details of how to build PCRE2.

       Using the heap for recursion is a non-standard way of  building  PCRE2,
       for  use  in  environments  that  have  limited  stacks. Because of the
       greater use of memory management, pcre2_match() runs more slowly. Func-
       tions  that  are  different  to the general custom memory functions are
       provided so that special-purpose external code can  be  used  for  this
       case,  because  the memory blocks are all the same size. The blocks are
       retained by pcre2_match() until it is about to exit so that they can be
       re-used  when  possible during the match. In the absence of these func-
       tions, the normal custom memory management functions are used, if  sup-
       plied, otherwise the system functions.


CHECKING BUILD-TIME OPTIONS

       int pcre2_config(uint32_t what, void *where);

       The  function  pcre2_config()  makes  it possible for a PCRE2 client to
       discover which optional features have  been  compiled  into  the  PCRE2
       library.  The  pcre2build  documentation  has  more details about these
       optional features.

       The first argument for pcre2_config() specifies  which  information  is
       required.  The  second  argument  is a pointer to memory into which the
       information is placed. If NULL is  passed,  the  function  returns  the
       amount  of  memory  that  is  needed for the requested information. For
       calls that return  numerical  values,  the  value  is  in  bytes;  when
       requesting  these  values,  where should point to appropriately aligned
       memory. For calls that return strings, the required length is given  in
       code units, not counting the terminating zero.

       When  requesting information, the returned value from pcre2_config() is
       non-negative on success, or the negative error code  PCRE2_ERROR_BADOP-
       TION  if the value in the first argument is not recognized. The follow-
       ing information is available:

         PCRE2_CONFIG_BSR

       The output is a uint32_t integer whose value indicates  what  character
       sequences  the  \R  escape  sequence  matches  by  default.  A value of
       PCRE2_BSR_UNICODE  means  that  \R  matches  any  Unicode  line  ending
       sequence;  a  value of PCRE2_BSR_ANYCRLF means that \R matches only CR,
       LF, or CRLF. The default can be overridden when a pattern is compiled.

         PCRE2_CONFIG_JIT

       The output is a uint32_t integer that is set  to  one  if  support  for
       just-in-time compiling is available; otherwise it is set to zero.

         PCRE2_CONFIG_JITTARGET

       The  where  argument  should point to a buffer that is at least 48 code
       units long.  (The  exact  length  required  can  be  found  by  calling
       pcre2_config()  with  where  set  to NULL.) The buffer is filled with a
       string that contains the name of the architecture  for  which  the  JIT
       compiler  is  configured,  for  example  "x86  32bit  (little  endian +
       unaligned)". If JIT support is not available, PCRE2_ERROR_BADOPTION  is
       returned,  otherwise the number of code units used is returned. This is
       the length of the string, plus one unit for the terminating zero.

         PCRE2_CONFIG_LINKSIZE

       The output is a uint32_t integer that contains the number of bytes used
       for  internal  linkage  in  compiled regular expressions. When PCRE2 is
       configured, the value can be set to 2, 3, or 4, with the default  being
       2.  This is the value that is returned by pcre2_config(). However, when
       the 16-bit library is compiled, a value of 3 is rounded up  to  4,  and
       when  the  32-bit  library  is compiled, internal linkages always use 4
       bytes, so the configured value is not relevant.

       The default value of 2 for the 8-bit and 16-bit libraries is sufficient
       for  all but the most massive patterns, since it allows the size of the
       compiled pattern to be up to 64K code units. Larger values allow larger
       regular  expressions  to be compiled by those two libraries, but at the
       expense of slower matching.

         PCRE2_CONFIG_MATCHLIMIT

       The output is a uint32_t integer that gives the default limit  for  the
       number  of  internal  matching function calls in a pcre2_match() execu-
       tion. Further details are given with pcre2_match() below.

         PCRE2_CONFIG_NEWLINE

       The output is a uint32_t integer  whose  value  specifies  the  default
       character  sequence that is recognized as meaning "newline". The values
       are:

         PCRE2_NEWLINE_CR       Carriage return (CR)
         PCRE2_NEWLINE_LF       Linefeed (LF)
         PCRE2_NEWLINE_CRLF     Carriage return, linefeed (CRLF)
         PCRE2_NEWLINE_ANY      Any Unicode line ending
         PCRE2_NEWLINE_ANYCRLF  Any of CR, LF, or CRLF

       The default should normally correspond to  the  standard  sequence  for
       your operating system.

         PCRE2_CONFIG_PARENSLIMIT

       The  output is a uint32_t integer that gives the maximum depth of nest-
       ing of parentheses (of any kind) in a pattern. This limit is imposed to
       cap  the  amount of system stack used when a pattern is compiled. It is
       specified when PCRE2 is built; the default is 250. This limit does  not
       take  into  account  the  stack that may already be used by the calling
       application. For  finer  control  over  compilation  stack  usage,  see
       pcre2_set_compile_recursion_guard().

         PCRE2_CONFIG_RECURSIONLIMIT

       The  output  is a uint32_t integer that gives the default limit for the
       depth of recursion when calling the internal  matching  function  in  a
       pcre2_match()  execution.  Further details are given with pcre2_match()
       below.

         PCRE2_CONFIG_STACKRECURSE

       The output is a uint32_t integer that is set to one if internal  recur-
       sion  when  running  pcre2_match() is implemented by recursive function
       calls that use the system stack to remember their state.  This  is  the
       usual  way that PCRE2 is compiled. The output is zero if PCRE2 was com-
       piled to use blocks of data on the heap instead of  recursive  function
       calls.

         PCRE2_CONFIG_UNICODE_VERSION

       The  where  argument  should point to a buffer that is at least 24 code
       units long.  (The  exact  length  required  can  be  found  by  calling
       pcre2_config()  with  where  set  to  NULL.) If PCRE2 has been compiled
       without Unicode support, the buffer is filled with  the  text  "Unicode
       not  supported".  Otherwise,  the  Unicode version string (for example,
       "7.0.0") is inserted. The number of code units used is  returned.  This
       is the length of the string plus one unit for the terminating zero.

         PCRE2_CONFIG_UNICODE

       The  output is a uint32_t integer that is set to one if Unicode support
       is available; otherwise it is set to zero. Unicode support implies  UTF
       support.

         PCRE2_CONFIG_VERSION

       The  where  argument  should point to a buffer that is at least 12 code
       units long.  (The  exact  length  required  can  be  found  by  calling
       pcre2_config()  with  where set to NULL.) The buffer is filled with the
       PCRE2 version string, zero-terminated. The number of code units used is
       returned. This is the length of the string plus one unit for the termi-
       nating zero.


COMPILING A PATTERN

       pcre2_code *pcre2_compile(PCRE2_SPTR pattern, PCRE2_SIZE length,
         uint32_t options, int *errorcode, PCRE2_SIZE *erroroffset,
         pcre2_compile_context *ccontext);

       pcre2_code_free(pcre2_code *code);

       The pcre2_compile() function compiles a pattern into an internal  form.
       The  pattern  is  defined  by a pointer to a string of code units and a
       length, If the pattern is zero-terminated, the length can be  specified
       as  PCRE2_ZERO_TERMINATED. The function returns a pointer to a block of
       memory that contains the compiled pattern and related data. The  caller
       must  free the memory by calling pcre2_code_free() when it is no longer
       needed.

       NOTE: When one of the matching functions is  called,  pointers  to  the
       compiled pattern and the subject string are set in the match data block
       so that they can be referenced by the extraction functions. After  run-
       ning  a  match,  you  must  not  free  a compiled pattern (or a subject
       string) until after all operations on the match data block  have  taken
       place.

       If  the  compile context argument ccontext is NULL, memory for the com-
       piled pattern  is  obtained  by  calling  malloc().  Otherwise,  it  is
       obtained  from  the  same memory function that was used for the compile
       context.

       The options argument contains various bit settings that affect the com-
       pilation.  It  should be zero if no options are required. The available
       options are described below. Some of them (in  particular,  those  that
       are  compatible with Perl, but some others as well) can also be set and
       unset from within the pattern (see  the  detailed  description  in  the
       pcre2pattern documentation).

       For  those options that can be different in different parts of the pat-
       tern, the contents of the options argument specifies their settings  at
       the  start  of  compilation.  The PCRE2_ANCHORED and PCRE2_NO_UTF_CHECK
       options can be set at the time of matching as well as at compile time.

       Other, less frequently required compile-time parameters  (for  example,
       the newline setting) can be provided in a compile context (as described
       above).

       If errorcode or erroroffset is NULL, pcre2_compile() returns NULL imme-
       diately.  Otherwise, if compilation of a pattern fails, pcre2_compile()
       returns NULL, having set these variables to an error code and an offset
       (number   of   code   units)  within  the  pattern,  respectively.  The
       pcre2_get_error_message() function provides a textual message for  each
       error code. Compilation errors are positive numbers, but UTF formatting
       errors are negative numbers. For an invalid UTF-8 or UTF-16 string, the
       offset is that of the first code unit of the failing character.

       Some  errors are not detected until the whole pattern has been scanned;
       in these cases, the offset passed back is the length  of  the  pattern.
       Note  that  the  offset is in code units, not characters, even in a UTF
       mode. It may sometimes point into the middle of a UTF-8 or UTF-16 char-
       acter.

       This  code  fragment shows a typical straightforward call to pcre2_com-
       pile():

         pcre2_code *re;
         PCRE2_SIZE erroffset;
         int errorcode;
         re = pcre2_compile(
           "^A.*Z",                /* the pattern */
           PCRE2_ZERO_TERMINATED,  /* the pattern is zero-terminated */
           0,                      /* default options */
           &errorcode,             /* for error code */
           &erroffset,             /* for error offset */
           NULL);                  /* no compile context */

       The following names for option bits are defined in the  pcre2.h  header
       file:

         PCRE2_ANCHORED

       If this bit is set, the pattern is forced to be "anchored", that is, it
       is constrained to match only at the first matching point in the  string
       that  is being searched (the "subject string"). This effect can also be
       achieved by appropriate constructs in the pattern itself, which is  the
       only way to do it in Perl.

         PCRE2_ALLOW_EMPTY_CLASS

       By  default, for compatibility with Perl, a closing square bracket that
       immediately follows an opening one is treated as a data  character  for
       the  class.  When  PCRE2_ALLOW_EMPTY_CLASS  is  set,  it terminates the
       class, which therefore contains no characters and so can never match.

         PCRE2_ALT_BSUX

       This option request alternative handling  of  three  escape  sequences,
       which  makes  PCRE2's  behaviour more like ECMAscript (aka JavaScript).
       When it is set:

       (1) \U matches an upper case "U" character; by default \U causes a com-
       pile time error (Perl uses \U to upper case subsequent characters).

       (2) \u matches a lower case "u" character unless it is followed by four
       hexadecimal digits, in which case the hexadecimal  number  defines  the
       code  point  to match. By default, \u causes a compile time error (Perl
       uses it to upper case the following character).

       (3) \x matches a lower case "x" character unless it is followed by  two
       hexadecimal  digits,  in  which case the hexadecimal number defines the
       code point to match. By default, as in Perl, a  hexadecimal  number  is
       always expected after \x, but it may have zero, one, or two digits (so,
       for example, \xz matches a binary zero character followed by z).

         PCRE2_AUTO_CALLOUT

       If this bit  is  set,  pcre2_compile()  automatically  inserts  callout
       items, all with number 255, before each pattern item. For discussion of
       the callout facility, see the pcre2callout documentation.

         PCRE2_CASELESS

       If this bit is set, letters in the pattern match both upper  and  lower
       case  letters in the subject. It is equivalent to Perl's /i option, and
       it can be changed within a pattern by a (?i) option setting.

         PCRE2_DOLLAR_ENDONLY

       If this bit is set, a dollar metacharacter in the pattern matches  only
       at  the  end  of the subject string. Without this option, a dollar also
       matches immediately before a newline at the end of the string (but  not
       before  any other newlines). The PCRE2_DOLLAR_ENDONLY option is ignored
       if PCRE2_MULTILINE is set. There is no equivalent  to  this  option  in
       Perl, and no way to set it within a pattern.

         PCRE2_DOTALL

       If  this  bit  is  set,  a dot metacharacter in the pattern matches any
       character, including one that indicates a  newline.  However,  it  only
       ever matches one character, even if newlines are coded as CRLF. Without
       this option, a dot does not match when the current position in the sub-
       ject  is  at  a newline. This option is equivalent to Perl's /s option,
       and it can be changed within a pattern by a (?s) option setting. A neg-
       ative class such as [^a] always matches newline characters, independent
       of the setting of this option.

         PCRE2_DUPNAMES

       If this bit is set, names used to identify capturing  subpatterns  need
       not be unique. This can be helpful for certain types of pattern when it
       is known that only one instance of the named  subpattern  can  ever  be
       matched.  There  are  more details of named subpatterns below; see also
       the pcre2pattern documentation.

         PCRE2_EXTENDED

       If this bit is set, most white space  characters  in  the  pattern  are
       totally  ignored  except when escaped or inside a character class. How-
       ever, white space is not allowed within  sequences  such  as  (?>  that
       introduce various parenthesized subpatterns, nor within numerical quan-
       tifiers such as {1,3}.  Ignorable white space is permitted  between  an
       item  and a following quantifier and between a quantifier and a follow-
       ing + that indicates possessiveness.

       PCRE2_EXTENDED also causes characters between an unescaped # outside  a
       character  class  and the next newline, inclusive, to be ignored, which
       makes it possible to include comments inside complicated patterns. Note
       that  the  end of this type of comment is a literal newline sequence in
       the pattern; escape sequences that happen to represent a newline do not
       count.  PCRE2_EXTENDED is equivalent to Perl's /x option, and it can be
       changed within a pattern by a (?x) option setting.

       Which characters are interpreted as newlines can be specified by a set-
       ting  in  the compile context that is passed to pcre2_compile() or by a
       special sequence at the start of the pattern, as described in the  sec-
       tion  entitled "Newline conventions" in the pcre2pattern documentation.
       A default is defined when PCRE2 is built.

         PCRE2_FIRSTLINE

       If this option is set, an  unanchored  pattern  is  required  to  match
       before  or  at  the  first  newline  in  the subject string, though the
       matched text may continue over the newline.

         PCRE2_MATCH_UNSET_BACKREF

       If this option is set, a back reference to an  unset  subpattern  group
       matches  an  empty  string (by default this causes the current matching
       alternative to fail).  A pattern such as  (\1)(a)  succeeds  when  this
       option  is set (assuming it can find an "a" in the subject), whereas it
       fails by default, for Perl compatibility.  Setting  this  option  makes
       PCRE2 behave more like ECMAscript (aka JavaScript).

         PCRE2_MULTILINE

       By  default,  for  the purposes of matching "start of line" and "end of
       line", PCRE2 treats the subject string as consisting of a  single  line
       of  characters,  even  if  it actually contains newlines. The "start of
       line" metacharacter (^) matches only at the start of  the  string,  and
       the  "end  of  line"  metacharacter  ($) matches only at the end of the
       string,  or  before  a  terminating  newline  (except  when  PCRE2_DOL-
       LAR_ENDONLY  is  set).  Note, however, that unless PCRE2_DOTALL is set,
       the "any character" metacharacter (.) does not match at a newline. This
       behaviour (for ^, $, and dot) is the same as Perl.

       When  PCRE2_MULTILINE  it is set, the "start of line" and "end of line"
       constructs match immediately following or immediately  before  internal
       newlines  in  the  subject string, respectively, as well as at the very
       start and end. This is equivalent to Perl's /m option, and  it  can  be
       changed within a pattern by a (?m) option setting. If there are no new-
       lines in a subject string, or no occurrences of ^ or $  in  a  pattern,
       setting PCRE2_MULTILINE has no effect.

         PCRE2_NEVER_UCP

       This  option  locks  out the use of Unicode properties for handling \B,
       \b, \D, \d, \S, \s, \W, \w, and some of the POSIX character classes, as
       described  for  the  PCRE2_UCP option below. In particular, it prevents
       the creator of the pattern from enabling this facility by starting  the
       pattern  with  (*UCP).  This may be useful in applications that process
       patterns from external sources. The  option  combination  PCRE_UCP  and
       PCRE_NEVER_UCP causes an error.

         PCRE2_NEVER_UTF

       This  option  locks out interpretation of the pattern as UTF-8, UTF-16,
       or UTF-32, depending on which library is in use. In particular, it pre-
       vents  the  creator of the pattern from switching to UTF interpretation
       by starting the pattern with (*UTF). This may be useful in applications
       that  process  patterns  from  external  sources.  The  combination  of
       PCRE2_UTF and PCRE2_NEVER_UTF causes an error.

         PCRE2_NO_AUTO_CAPTURE

       If this option is set, it disables the use of numbered capturing paren-
       theses  in the pattern. Any opening parenthesis that is not followed by
       ? behaves as if it were followed by ?: but named parentheses can  still
       be  used  for  capturing  (and  they acquire numbers in the usual way).
       There is no equivalent of this option in Perl.

         PCRE2_NO_AUTO_POSSESS

       If this option is set, it disables "auto-possessification", which is an
       optimization  that,  for example, turns a+b into a++b in order to avoid
       backtracks into a+ that can never be successful. However,  if  callouts
       are  in  use,  auto-possessification means that some callouts are never
       taken. You can set this option if you want the matching functions to do
       a  full  unoptimized  search and run all the callouts, but it is mainly
       provided for testing purposes.

         PCRE2_NO_DOTSTAR_ANCHOR

       If this option is set, it disables an optimization that is applied when
       .*  is  the  first significant item in a top-level branch of a pattern,
       and all the other branches also start with .* or with \A or  \G  or  ^.
       The  optimization  is  automatically disabled for .* if it is inside an
       atomic group or a capturing group that is the subject of a back  refer-
       ence,  or  if  the pattern contains (*PRUNE) or (*SKIP). When the opti-
       mization is not disabled, such a pattern is automatically  anchored  if
       PCRE2_DOTALL is set for all the .* items and PCRE2_MULTILINE is not set
       for any ^ items. Otherwise, the fact that any match must  start  either
       at  the start of the subject or following a newline is remembered. Like
       other optimizations, this can cause callouts to be skipped.

         PCRE2_NO_START_OPTIMIZE

       This is an option whose main effect is at matching time.  It  does  not
       change what pcre2_compile() generates, but it does affect the output of
       the JIT compiler.

       There are a number of optimizations that may occur at the  start  of  a
       match,  in  order  to speed up the process. For example, if it is known
       that an unanchored match must start  with  a  specific  character,  the
       matching  code searches the subject for that character, and fails imme-
       diately if it cannot find it, without actually running the main  match-
       ing  function.  This means that a special item such as (*COMMIT) at the
       start of a pattern is not considered until after  a  suitable  starting
       point  for  the  match  has  been found. Also, when callouts or (*MARK)
       items are in use, these "start-up" optimizations can cause them  to  be
       skipped  if  the pattern is never actually used. The start-up optimiza-
       tions are in effect a pre-scan of the subject that takes  place  before
       the pattern is run.

       The PCRE2_NO_START_OPTIMIZE option disables the start-up optimizations,
       possibly causing performance to suffer,  but  ensuring  that  in  cases
       where  the  result is "no match", the callouts do occur, and that items
       such as (*COMMIT) and (*MARK) are considered at every possible starting
       position in the subject string.

       Setting  PCRE2_NO_START_OPTIMIZE  may  change the outcome of a matching
       operation.  Consider the pattern

         (*COMMIT)ABC

       When this is compiled, PCRE2 records the fact that a match  must  start
       with  the  character  "A".  Suppose the subject string is "DEFABC". The
       start-up optimization scans along the subject, finds "A" and  runs  the
       first  match attempt from there. The (*COMMIT) item means that the pat-
       tern must match the current starting position, which in this  case,  it
       does.  However,  if  the same match is run with PCRE2_NO_START_OPTIMIZE
       set, the initial scan along the subject string  does  not  happen.  The
       first  match  attempt  is  run  starting  from "D" and when this fails,
       (*COMMIT) prevents any further matches  being  tried,  so  the  overall
       result is "no match". There are also other start-up optimizations.  For
       example, a minimum length for the subject may be recorded. Consider the
       pattern

         (*MARK:A)(X|Y)

       The  minimum  length  for  a  match is one character. If the subject is
       "ABC", there will be attempts to match "ABC", "BC", and "C". An attempt
       to match an empty string at the end of the subject does not take place,
       because PCRE2 knows that the subject is  now  too  short,  and  so  the
       (*MARK)  is  never encountered. In this case, the optimization does not
       affect the overall match result, which is still "no match", but it does
       affect the auxiliary information that is returned.

         PCRE2_NO_UTF_CHECK

       When  PCRE2_UTF  is set, the validity of the pattern as a UTF string is
       automatically checked. There are  discussions  about  the  validity  of
       UTF-8  strings,  UTF-16 strings, and UTF-32 strings in the pcre2unicode
       document.  If an invalid UTF sequence is found, pcre2_compile() returns
       a negative error code.

       If you know that your pattern is valid, and you want to skip this check
       for performance reasons, you can  set  the  PCRE2_NO_UTF_CHECK  option.
       When  it  is set, the effect of passing an invalid UTF string as a pat-
       tern is undefined. It may cause your program to  crash  or  loop.  Note
       that   this   option   can   also   be   passed  to  pcre2_match()  and
       pcre_dfa_match(), to suppress validity checking of the subject string.

         PCRE2_UCP

       This option changes the way PCRE2 processes \B, \b, \D, \d, \S, \s, \W,
       \w,  and  some  of  the POSIX character classes. By default, only ASCII
       characters are recognized, but if PCRE2_UCP is set, Unicode  properties
       are  used instead to classify characters. More details are given in the
       section on generic character types in the pcre2pattern page. If you set
       PCRE2_UCP,  matching one of the items it affects takes much longer. The
       option is available only if PCRE2 has been compiled with  Unicode  sup-
       port.

         PCRE2_UNGREEDY

       This  option  inverts  the "greediness" of the quantifiers so that they
       are not greedy by default, but become greedy if followed by "?". It  is
       not  compatible  with Perl. It can also be set by a (?U) option setting
       within the pattern.

         PCRE2_UTF

       This option causes PCRE2 to regard both the  pattern  and  the  subject
       strings  that  are  subsequently processed as strings of UTF characters
       instead of single-code-unit strings. It  is  available  when  PCRE2  is
       built  to  include  Unicode  support (which is the default). If Unicode
       support is not available, the use of this  option  provokes  an  error.
       Details  of how this option changes the behaviour of PCRE2 are given in
       the pcre2unicode page.


COMPILATION ERROR CODES

       There are over 80 positive error codes that pcre2_compile() may  return
       if it finds an error in the pattern. There are also some negative error
       codes that are used for invalid UTF strings.  These  are  the  same  as
       given  by pcre2_match() and pcre2_dfa_match(), and are described in the
       pcre2unicode page. The pcre2_get_error_message() function can be called
       to obtain a textual error message from any error code.


JUST-IN-TIME (JIT) COMPILATION

       int pcre2_jit_compile(pcre2_code *code, uint32_t options);

       int pcre2_jit_match(const pcre2_code *code, PCRE2_SPTR subject,
         PCRE2_SIZE length, PCRE2_SIZE startoffset,
         uint32_t options, pcre2_match_data *match_data,
         pcre2_match_context *mcontext);

       void pcre2_jit_free_unused_memory(pcre2_general_context *gcontext);

       pcre2_jit_stack *pcre2_jit_stack_create(PCRE2_SIZE startsize,
         PCRE2_SIZE maxsize, pcre2_general_context *gcontext);

       void pcre2_jit_stack_assign(pcre2_match_context *mcontext,
         pcre2_jit_callback callback_function, void *callback_data);

       void pcre2_jit_stack_free(pcre2_jit_stack *jit_stack);

       These  functions  provide  support  for  JIT compilation, which, if the
       just-in-time compiler is available, further processes a  compiled  pat-
       tern into machine code that executes much faster than the pcre2_match()
       interpretive matching function. Full details are given in the  pcre2jit
       documentation.

       JIT  compilation  is  a heavyweight optimization. It can take some time
       for patterns to be analyzed, and for one-off matches  and  simple  pat-
       terns  the benefit of faster execution might be offset by a much slower
       compilation time.  Most, but not all patterns can be optimized  by  the
       JIT compiler.


LOCALE SUPPORT

       PCRE2  handles caseless matching, and determines whether characters are
       letters, digits, or whatever, by reference to a set of tables,  indexed
       by  character  code  point.  This applies only to characters whose code
       points are less than 256. By default, higher-valued code  points  never
       match  escapes  such  as \w or \d.  However, if PCRE2 is built with UTF
       support, all characters can be tested with  \p  and  \P,  or,  alterna-
       tively,  the  PCRE2_UCP  option  can be set when a pattern is compiled;
       this causes \w and friends to use Unicode property support  instead  of
       the built-in tables.

       The  use  of  locales  with Unicode is discouraged. If you are handling
       characters with code points greater than 128,  you  should  either  use
       Unicode support, or use locales, but not try to mix the two.

       PCRE2  contains  an  internal  set of character tables that are used by
       default.  These are sufficient for  many  applications.  Normally,  the
       internal tables recognize only ASCII characters. However, when PCRE2 is
       built, it is possible to cause the internal tables to be rebuilt in the
       default "C" locale of the local system, which may cause them to be dif-
       ferent.

       The internal tables can be overridden by tables supplied by the  appli-
       cation  that  calls  PCRE2.  These may be created in a different locale
       from the default.  As more and more applications change to  using  Uni-
       code, the need for this locale support is expected to die away.

       External  tables  are built by calling the pcre2_maketables() function,
       in the relevant locale. The result can be passed to pcre2_compile()  as
       often   as  necessary,  by  creating  a  compile  context  and  calling
       pcre2_set_character_tables() to set the  tables  pointer  therein.  For
       example,  to  build  and use tables that are appropriate for the French
       locale (where accented characters with  values  greater  than  128  are
       treated as letters), the following code could be used:

         setlocale(LC_CTYPE, "fr_FR");
         tables = pcre2_maketables(NULL);
         ccontext = pcre2_compile_context_create(NULL);
         pcre2_set_character_tables(ccontext, tables);
         re = pcre2_compile(..., ccontext);

       The  locale  name "fr_FR" is used on Linux and other Unix-like systems;
       if you are using Windows, the name for the French locale  is  "french".
       It  is the caller's responsibility to ensure that the memory containing
       the tables remains available for as long as it is needed.

       The pointer that is passed (via the compile context) to pcre2_compile()
       is  saved  with  the  compiled pattern, and the same tables are used by
       pcre2_match() and pcre_dfa_match(). Thus, for any single pattern,  com-
       pilation,  and  matching  all  happen in the same locale, but different
       patterns can be processed in different locales.


INFORMATION ABOUT A COMPILED PATTERN

       int pcre2_pattern_info(const pcre2 *code, uint32_t what, void *where);

       The pcre2_pattern_info() function returns information about a  compiled
       pattern.  The  first argument is a pointer to the compiled pattern. The
       second argument specifies which piece of information is  required,  and
       the  third  argument is a pointer to a variable to receive the data. If
       the third argument is NULL, the first  argument  is  ignored,  and  the
       function returns the size in bytes of the variable that is required for
       the information requested.  Otherwise, The yield  of  the  function  is
       zero for success, or one of the following negative numbers:

         PCRE2_ERROR_NULL           the argument code was NULL
         PCRE2_ERROR_BADMAGIC       the "magic number" was not found
         PCRE2_ERROR_BADOPTION      the value of what was invalid
         PCRE2_ERROR_UNSET          the requested field is not set

       The  "magic  number" is placed at the start of each compiled pattern as
       an simple check against passing an arbitrary memory pointer. Here is  a
       typical  call of pcre2_pattern_info(), to obtain the length of the com-
       piled pattern:

         int rc;
         size_t length;
         rc = pcre2_pattern_info(
           re,               /* result of pcre2_compile() */
           PCRE2_INFO_SIZE,  /* what is required */
           &length);         /* where to put the data */

       The possible values for the second argument are defined in pcre2.h, and
       are as follows:

         PCRE2_INFO_ALLOPTIONS
         PCRE2_INFO_ARGOPTIONS

       Return a copy of the pattern's options. The third argument should point
       to a  uint32_t  variable.  PCRE2_INFO_ARGOPTIONS  returns  exactly  the
       options  that were passed to pcre2_compile(), whereas PCRE2_INFO_ALLOP-
       TIONS returns the compile options as modified by any  top-level  option
       settings  at  the start of the pattern itself. In other words, they are
       the options that will be in force when matching starts. For example, if
       the  pattern  /(?im)abc(?-i)d/  is  compiled  with  the  PCRE2_EXTENDED
       option,   the   result   is   PCRE2_CASELESS,   PCRE2_MULTILINE,    and
       PCRE2_EXTENDED.

       A  pattern compiled without PCRE2_ANCHORED is automatically anchored by
       PCRE2 if the first significant item in every top-level branch is one of
       the following:

         ^     unless PCRE2_MULTILINE is set
         \A    always
         \G    always
         .*    sometimes - see below

       When  .* is the first significant item, anchoring is possible only when
       all the following are true:

         .* is not in an atomic group
         .* is not in a capturing group that is the subject
              of a back reference
         PCRE2_DOTALL is in force for .*
         Neither (*PRUNE) nor (*SKIP) appears in the pattern.
         PCRE2_NO_DOTSTAR_ANCHOR is not set.

       For patterns that are auto-anchored, the PCRE2_ANCHORED bit is  set  in
       the options returned for PCRE2_INFO_ALLOPTIONS.

         PCRE2_INFO_BACKREFMAX

       Return  the  number  of  the highest back reference in the pattern. The
       third argument should point to an uint32_t variable. Named  subpatterns
       acquire  numbers  as well as names, and these count towards the highest
       back reference.  Back references such as \4 or \g{12}  match  the  cap-
       tured  characters of the given group, but in addition, the check that a
       capturing group is set in a conditional subpattern such as (?(3)a|b) is
       also  a  back  reference.  Zero is returned if there are no back refer-
       ences.

         PCRE2_INFO_BSR

       The output is a uint32_t whose value indicates what character sequences
       the \R escape sequence matches. A value of PCRE2_BSR_UNICODE means that
       \R matches any Unicode line ending sequence; a value of  PCRE2_BSR_ANY-
       CRLF means that \R matches only CR, LF, or CRLF.

         PCRE2_INFO_CAPTURECOUNT

       Return  the  number  of capturing subpatterns in the pattern. The third
       argument should point to an uint32_t variable.

         PCRE2_INFO_FIRSTCODETYPE

       Return information about the first code unit of any matched string, for
       a  non-anchored pattern. The third argument should point to an uint32_t
       variable.

       If there is a fixed first value, for example, the  letter  "c"  from  a
       pattern  such  as  (cat|cow|coyote),  1  is returned, and the character
       value can be retrieved using PCRE2_INFO_FIRSTCODEUNIT. If there  is  no
       fixed  first  value, but it is known that a match can occur only at the
       start of the subject or following  a  newline  in  the  subject,  2  is
       returned. Otherwise, and for anchored patterns, 0 is returned.

         PCRE2_INFO_FIRSTCODEUNIT

       Return  the  value  of the first code unit of any matched string in the
       situation where PCRE2_INFO_FIRSTCODETYPE returns 1; otherwise return 0.
       The  third  argument should point to an uint32_t variable. In the 8-bit
       library, the value is always less than 256. In the 16-bit  library  the
       value  can  be  up  to 0xffff. In the 32-bit library in UTF-32 mode the
       value can be up to 0x10ffff, and up to 0xffffffff when not using UTF-32
       mode.

         PCRE2_INFO_FIRSTBITMAP

       In  the absence of a single first code unit for a non-anchored pattern,
       pcre2_compile() may construct a 256-bit table that defines a fixed  set
       of  values for the first code unit in any match. For example, a pattern
       that starts with [abc] results in a table with  three  bits  set.  When
       code  unit  values greater than 255 are supported, the flag bit for 255
       means "any code unit of value 255 or above". If such a table  was  con-
       structed,  a pointer to it is returned. Otherwise NULL is returned. The
       third argument should point to an const uint8_t * variable.

         PCRE2_INFO_HASCRORLF

       Return 1 if the pattern contains any explicit  matches  for  CR  or  LF
       characters, otherwise 0. The third argument should point to an uint32_t
       variable. An explicit match is either a literal CR or LF character,  or
       \r or \n.

         PCRE2_INFO_JCHANGED

       Return  1  if  the (?J) or (?-J) option setting is used in the pattern,
       otherwise 0. The third argument should point to an  uint32_t  variable.
       (?J)  and  (?-J) set and unset the local PCRE2_DUPNAMES option, respec-
       tively.

         PCRE2_INFO_JITSIZE

       If the compiled pattern was successfully  processed  by  pcre2_jit_com-
       pile(),  return  the  size  of  the JIT compiled code, otherwise return
       zero. The third argument should point to a size_t variable.

         PCRE2_INFO_LASTCODETYPE

       Returns 1 if there is a rightmost literal code unit that must exist  in
       any  matched string, other than at its start. The third argument should
       point to an uint32_t  variable.  If  there  is  no  such  value,  0  is
       returned.  When  1  is  returned,  the  code  unit  value itself can be
       retrieved using PCRE2_INFO_LASTCODEUNIT.

       For anchored patterns, a last literal value is recorded only if it fol-
       lows  something  of  variable  length.  For  example,  for  the pattern
       /^a\d+z\d+/  the  returned  value  is  1  (with   "z"   returned   from
       PCRE2_INFO_LASTCODEUNIT), but for /^a\dz\d/ the returned value is 0.

         PCRE2_INFO_LASTCODEUNIT

       Return  the value of the rightmost literal data unit that must exist in
       any matched string, other than at its start, if such a value  has  been
       recorded.  The  third argument should point to an uint32_t variable. If
       there is no such value, 0 is returned.

         PCRE2_INFO_MATCHEMPTY

       Return 1 if the pattern can match an empty  string,  otherwise  0.  The
       third argument should point to an uint32_t variable.

         PCRE2_INFO_MATCHLIMIT

       If  the  pattern  set  a  match  limit by including an item of the form
       (*LIMIT_MATCH=nnnn) at the start, the  value  is  returned.  The  third
       argument  should  point to an unsigned 32-bit integer. If no such value
       has been set,  the  call  to  pcre2_pattern_info()  returns  the  error
       PCRE2_ERROR_UNSET.

         PCRE2_INFO_MAXLOOKBEHIND

       Return the number of characters (not code units) in the longest lookbe-
       hind assertion in the pattern. The third argument should  point  to  an
       unsigned  32-bit  integer. This information is useful when doing multi-
       segment matching using the partial matching facilities. Note  that  the
       simple assertions \b and \B require a one-character lookbehind. \A also
       registers a one-character  lookbehind,  though  it  does  not  actually
       inspect  the  previous  character.  This is to ensure that at least one
       character from the old segment is retained when a new segment  is  pro-
       cessed. Otherwise, if there are no lookbehinds in the pattern, \A might
       match incorrectly at the start of a new segment.

         PCRE2_INFO_MINLENGTH

       If a minimum length for matching  subject  strings  was  computed,  its
       value  is  returned.  Otherwise the returned value is 0. The value is a
       number of characters, which in UTF mode may be different from the  num-
       ber  of  code  units.   The  third argument should point to an uint32_t
       variable. The value is a lower bound to  the  length  of  any  matching
       string.  There  may  not be any strings of that length that do actually
       match, but every string that does match is at least that long.

         PCRE2_INFO_NAMECOUNT
         PCRE2_INFO_NAMEENTRYSIZE
         PCRE2_INFO_NAMETABLE

       PCRE2 supports the use of named as well as numbered capturing parenthe-
       ses.  The names are just an additional way of identifying the parenthe-
       ses, which still acquire numbers. Several convenience functions such as
       pcre2_substring_get_byname()  are provided for extracting captured sub-
       strings by name. It is also possible to extract the data  directly,  by
       first  converting  the  name to a number in order to access the correct
       pointers in the output vector (described with pcre2_match() below).  To
       do  the  conversion,  you  need to use the name-to-number map, which is
       described by these three values.

       The map consists of a number of  fixed-size  entries.  PCRE2_INFO_NAME-
       COUNT  gives  the number of entries, and PCRE2_INFO_NAMEENTRYSIZE gives
       the size of each entry in code units; both of these return  a  uint32_t
       value. The entry size depends on the length of the longest name.

       PCRE2_INFO_NAMETABLE returns a pointer to the first entry of the table.
       This is a PCRE2_SPTR pointer to a block of code  units.  In  the  8-bit
       library,  the  first two bytes of each entry are the number of the cap-
       turing parenthesis, most significant byte first. In the 16-bit library,
       the  pointer  points  to 16-bit code units, the first of which contains
       the parenthesis number. In the 32-bit library, the  pointer  points  to
       32-bit  code units, the first of which contains the parenthesis number.
       The rest of the entry is the corresponding name, zero terminated.

       The names are in alphabetical order. If (?| is used to create  multiple
       groups  with  the same number, as described in the section on duplicate
       subpattern numbers in the pcre2pattern page, the groups  may  be  given
       the  same  name,  but  there  is only one entry in the table. Different
       names for groups of the same number are not permitted.

       Duplicate names for subpatterns with different numbers  are  permitted,
       but  only  if  PCRE2_DUPNAMES  is  set. They appear in the table in the
       order in which they were found in the pattern. In the  absence  of  (?|
       this  is  the  order of increasing number; when (?| is used this is not
       necessarily the case because later subpatterns may have lower numbers.

       As a simple example of the name/number table,  consider  the  following
       pattern  after  compilation by the 8-bit library (assume PCRE2_EXTENDED
       is set, so white space - including newlines - is ignored):

         (?<date> (?<year>(\d\d)?\d\d) -
         (?<month>\d\d) - (?<day>\d\d) )

       There are four named subpatterns, so the table has  four  entries,  and
       each  entry  in the table is eight bytes long. The table is as follows,
       with non-printing bytes shows in hexadecimal, and undefined bytes shown
       as ??:

         00 01 d  a  t  e  00 ??
         00 05 d  a  y  00 ?? ??
         00 04 m  o  n  t  h  00
         00 02 y  e  a  r  00 ??

       When  writing  code  to  extract  data from named subpatterns using the
       name-to-number map, remember that the length of the entries  is  likely
       to be different for each compiled pattern.

         PCRE2_INFO_NEWLINE

       The output is a uint32_t with one of the following values:

         PCRE2_NEWLINE_CR       Carriage return (CR)
         PCRE2_NEWLINE_LF       Linefeed (LF)
         PCRE2_NEWLINE_CRLF     Carriage return, linefeed (CRLF)
         PCRE2_NEWLINE_ANY      Any Unicode line ending
         PCRE2_NEWLINE_ANYCRLF  Any of CR, LF, or CRLF

       This  specifies  the default character sequence that will be recognized
       as meaning "newline" while matching.

         PCRE2_INFO_RECURSIONLIMIT

       If the pattern set a recursion limit by including an item of  the  form
       (*LIMIT_RECURSION=nnnn)  at the start, the value is returned. The third
       argument should point to an unsigned 32-bit integer. If no  such  value
       has  been  set,  the  call  to  pcre2_pattern_info()  returns the error
       PCRE2_ERROR_UNSET.

         PCRE2_INFO_SIZE

       Return the size of  the  compiled  pattern  in  bytes  (for  all  three
       libraries).  The third argument should point to a size_t variable. This
       value includes the size of the general data  block  that  precedes  the
       code  units of the compiled pattern itself. The value that is used when
       pcre2_compile() is getting memory in which to place the  compiled  pat-
       tern  may  be  slightly  larger than the value returned by this option,
       because there are cases where the code that calculates the size has  to
       over-estimate.  Processing  a  pattern  with  the JIT compiler does not
       alter the value returned by this option.


SERIALIZATION AND PRECOMPILING

       It is possible to save compiled patterns  on  disc  or  elsewhere,  and
       reload  them  later, subject to a number of restrictions. The functions
       whose names begin with pcre2_serialize_ are used for this purpose. They
       are described in the pcre2serialize documentation.


THE MATCH DATA BLOCK

       pcre2_match_data_create(uint32_t ovecsize,
         pcre2_general_context *gcontext);

       pcre2_match_data_create_from_pattern(const pcre2_code *code,
         pcre2_general_context *gcontext);

       void pcre2_match_data_free(pcre2_match_data *match_data);

       Information  about  a  successful  or unsuccessful match is placed in a
       match data block, which is an opaque  structure  that  is  accessed  by
       function  calls.  In particular, the match data block contains a vector
       of offsets into the subject string that define the matched part of  the
       subject  and  any  substrings  that  were captured. This is know as the
       ovector.

       Before calling pcre2_match(), pcre2_dfa_match(),  or  pcre2_jit_match()
       you must create a match data block by calling one of the creation func-
       tions above. For pcre2_match_data_create(), the first argument  is  the
       number  of  pairs  of  offsets  in  the ovector. One pair of offsets is
       required to identify the string that matched the  whole  pattern,  with
       another  pair  for  each  captured substring. For example, a value of 4
       creates enough space to record the matched portion of the subject  plus
       three  captured  substrings. A minimum of at least 1 pair is imposed by
       pcre2_match_data_create(), so it is always possible to return the over-
       all matched string.

       The second argument of pcre2_match_data_create() is a pointer to a gen-
       eral context, which can specify custom memory management for  obtaining
       the memory for the match data block. If you are not using custom memory
       management, pass NULL, which causes malloc() to be used.

       For pcre2_match_data_create_from_pattern(), the  first  argument  is  a
       pointer to a compiled pattern. The ovector is created to be exactly the
       right size to hold all the substrings a pattern might capture. The sec-
       ond  argument is again a pointer to a general context, but in this case
       if NULL is passed, the memory is obtained using the same allocator that
       was used for the compiled pattern (custom or default).

       A  match  data block can be used many times, with the same or different
       compiled patterns. You can extract information from a match data  block
       after  a  match  operation  has  finished,  using  functions  that  are
       described in the sections on  matched  strings  and  other  match  data
       below.

       When  a  call  of  pcre2_match()  fails, valid data is available in the
       match   block   only   when   the   error    is    PCRE2_ERROR_NOMATCH,
       PCRE2_ERROR_PARTIAL,  or  one  of  the  error  codes for an invalid UTF
       string. Exactly what is available depends on the error, and is detailed
       below.

       When  one of the matching functions is called, pointers to the compiled
       pattern and the subject string are set in the match data block so  that
       they  can  be  referenced  by the extraction functions. After running a
       match, you must not free a compiled pattern or a subject  string  until
       after  all  operations  on  the  match data block (for that match) have
       taken place.

       When a match data block itself is no longer needed, it should be  freed
       by calling pcre2_match_data_free().


MATCHING A PATTERN: THE TRADITIONAL FUNCTION

       int pcre2_match(const pcre2_code *code, PCRE2_SPTR subject,
         PCRE2_SIZE length, PCRE2_SIZE startoffset,
         uint32_t options, pcre2_match_data *match_data,
         pcre2_match_context *mcontext);

       The  function pcre2_match() is called to match a subject string against
       a compiled pattern, which is passed in the code argument. You can  call
       pcre2_match() with the same code argument as many times as you like, in
       order to find multiple matches in the subject string or to  match  dif-
       ferent subject strings with the same pattern.

       This  function  is  the  main  matching facility of the library, and it
       operates in a Perl-like manner. For specialist use  there  is  also  an
       alternative  matching function, which is described below in the section
       about the pcre2_dfa_match() function.

       Here is an example of a simple call to pcre2_match():

         pcre2_match_data *md = pcre2_match_data_create(4, NULL);
         int rc = pcre2_match(
           re,             /* result of pcre2_compile() */
           "some string",  /* the subject string */
           11,             /* the length of the subject string */
           0,              /* start at offset 0 in the subject */
           0,              /* default options */
           match_data,     /* the match data block */
           NULL);          /* a match context; NULL means use defaults */

       If the subject string is zero-terminated, the length can  be  given  as
       PCRE2_ZERO_TERMINATED. A match context must be provided if certain less
       common matching parameters are to be changed. For details, see the sec-
       tion on the match context above.

   The string to be matched by pcre2_match()

       The  subject string is passed to pcre2_match() as a pointer in subject,
       a length in length, and a starting offset in  startoffset.  The  length
       and  offset  are  in  code units, not characters.  That is, they are in
       bytes for the 8-bit library, 16-bit code units for the 16-bit  library,
       and  32-bit  code units for the 32-bit library, whether or not UTF pro-
       cessing is enabled.

       If startoffset is greater than the length of the subject, pcre2_match()
       returns  PCRE2_ERROR_BADOFFSET.  When  the starting offset is zero, the
       search for a match starts at the beginning of the subject, and this  is
       by far the most common case. In UTF-8 or UTF-16 mode, the starting off-
       set must point to the start of a character, or to the end of  the  sub-
       ject  (in  UTF-32 mode, one code unit equals one character, so all off-
       sets are valid). Like the  pattern  string,  the  subject  may  contain
       binary zeroes.

       A  non-zero  starting offset is useful when searching for another match
       in the same subject by calling pcre2_match()  again  after  a  previous
       success.   Setting  startoffset  differs  from passing over a shortened
       string and setting PCRE2_NOTBOL in the case of a  pattern  that  begins
       with any kind of lookbehind. For example, consider the pattern

         \Biss\B

       which  finds  occurrences  of "iss" in the middle of words. (\B matches
       only if the current position in the subject is not  a  word  boundary.)
       When applied to the string "Mississipi" the first call to pcre2_match()
       finds the first occurrence. If pcre2_match() is called again with  just
       the  remainder  of  the  subject,  namely  "issipi", it does not match,
       because \B is always false at the start of the subject, which is deemed
       to  be  a word boundary. However, if pcre2_match() is passed the entire
       string again, but with startoffset set to 4, it finds the second occur-
       rence  of "iss" because it is able to look behind the starting point to
       discover that it is preceded by a letter.

       Finding all the matches in a subject is tricky  when  the  pattern  can
       match an empty string. It is possible to emulate Perl's /g behaviour by
       first  trying  the  match  again  at  the   same   offset,   with   the
       PCRE2_NOTEMPTY_ATSTART  and  PCRE2_ANCHORED  options,  and then if that
       fails, advancing the starting  offset  and  trying  an  ordinary  match
       again.  There  is  some  code  that  demonstrates how to do this in the
       pcre2demo sample program. In the most general case, you have  to  check
       to  see  if the newline convention recognizes CRLF as a newline, and if
       so, and the current character is CR followed by LF, advance the  start-
       ing offset by two characters instead of one.

       If  a  non-zero starting offset is passed when the pattern is anchored,
       one attempt to match at the given offset is made. This can only succeed
       if  the  pattern  does  not require the match to be at the start of the
       subject.

   Option bits for pcre2_match()

       The unused bits of the options argument for pcre2_match() must be zero.
       The  only  bits  that  may  be  set  are  PCRE2_ANCHORED, PCRE2_NOTBOL,
       PCRE2_NOTEOL,          PCRE2_NOTEMPTY,          PCRE2_NOTEMPTY_ATSTART,
       PCRE2_NO_UTF_CHECK,  PCRE2_PARTIAL_HARD,  and PCRE2_PARTIAL_SOFT. Their
       action is described below.

       Setting PCRE2_ANCHORED at match time is not supported by  the  just-in-
       time  (JIT)  compiler.  If  it is set, JIT matching is disabled and the
       normal interpretive code in pcre2_match() is run. The remaining options
       are supported for JIT matching.

         PCRE2_ANCHORED

       The PCRE2_ANCHORED option limits pcre2_match() to matching at the first
       matching position. If a pattern was compiled  with  PCRE2_ANCHORED,  or
       turned  out to be anchored by virtue of its contents, it cannot be made
       unachored at matching time. Note that setting the option at match  time
       disables JIT matching.

         PCRE2_NOTBOL

       This option specifies that first character of the subject string is not
       the beginning of a line, so the  circumflex  metacharacter  should  not
       match  before  it.  Setting  this without having set PCRE2_MULTILINE at
       compile time causes circumflex never to match. This option affects only
       the behaviour of the circumflex metacharacter. It does not affect \A.

         PCRE2_NOTEOL

       This option specifies that the end of the subject string is not the end
       of a line, so the dollar metacharacter should not match it nor  (except
       in  multiline mode) a newline immediately before it. Setting this with-
       out having set PCRE2_MULTILINE at compile time causes dollar  never  to
       match. This option affects only the behaviour of the dollar metacharac-
       ter. It does not affect \Z or \z.

         PCRE2_NOTEMPTY

       An empty string is not considered to be a valid match if this option is
       set.  If  there are alternatives in the pattern, they are tried. If all
       the alternatives match the empty string, the entire  match  fails.  For
       example, if the pattern

         a?b?

       is  applied  to  a  string not beginning with "a" or "b", it matches an
       empty string at the start of the subject. With PCRE2_NOTEMPTY set, this
       match  is  not valid, so pcre2_match() searches further into the string
       for occurrences of "a" or "b".

         PCRE2_NOTEMPTY_ATSTART

       This is like PCRE2_NOTEMPTY, except that it locks out an  empty  string
       match only at the first matching position, that is, at the start of the
       subject plus the starting offset. An empty string match  later  in  the
       subject  is  permitted.   If  the pattern is anchored, such a match can
       occur only if the pattern contains \K.

         PCRE2_NO_UTF_CHECK

       When PCRE2_UTF is set at compile time, the validity of the subject as a
       UTF  string  is  checked  by default when pcre2_match() is subsequently
       called.  The entire string is checked before any other processing takes
       place,  and a negative error code is returned if the check fails. There
       are several UTF error codes for each code unit width, corresponding  to
       different  problems with the code unit sequence. The value of startoff-
       set is also checked, to ensure that it points to the start of a charac-
       ter  or  to  the  end  of  the subject. There are discussions about the
       validity of UTF-8 strings, UTF-16 strings, and UTF-32  strings  in  the
       pcre2unicode page.

       If  you  know  that  your  subject is valid, and you want to skip these
       checks for performance reasons,  you  can  set  the  PCRE2_NO_UTF_CHECK
       option  when  calling  pcre2_match(). You might want to do this for the
       second and subsequent calls to pcre2_match() if you are making repeated
       calls to find all the matches in a single subject string.

       NOTE:  When PCRE2_NO_UTF_CHECK is set, the effect of passing an invalid
       string as a subject, or an invalid value of startoffset, is  undefined.
       Your program may crash or loop indefinitely.

         PCRE2_PARTIAL_HARD
         PCRE2_PARTIAL_SOFT

       These  options  turn  on  the partial matching feature. A partial match
       occurs if the end of the subject string is  reached  successfully,  but
       there  are not enough subject characters to complete the match. If this
       happens when PCRE2_PARTIAL_SOFT (but not  PCRE2_PARTIAL_HARD)  is  set,
       matching  continues  by  testing any remaining alternatives. Only if no
       complete match can be found is PCRE2_ERROR_PARTIAL returned instead  of
       PCRE2_ERROR_NOMATCH.  In other words, PCRE2_PARTIAL_SOFT specifies that
       the caller is prepared to handle a partial match, but only if  no  com-
       plete match can be found.

       If  PCRE2_PARTIAL_HARD is set, it overrides PCRE2_PARTIAL_SOFT. In this
       case, if a partial match is found,  pcre2_match()  immediately  returns
       PCRE2_ERROR_PARTIAL,  without  considering  any  other alternatives. In
       other words, when PCRE2_PARTIAL_HARD is set, a partial match is consid-
       ered to be more important that an alternative complete match.

       There is a more detailed discussion of partial and multi-segment match-
       ing, with examples, in the pcre2partial documentation.


NEWLINE HANDLING WHEN MATCHING

       When PCRE2 is built, a default newline convention is set; this is  usu-
       ally  the standard convention for the operating system. The default can
       be overridden in a  compile  context.   During  matching,  the  newline
       choice  affects  the  behaviour  of  the  dot,  circumflex,  and dollar
       metacharacters. It may also alter the way the match  starting  position
       is advanced after a match failure for an unanchored pattern.

       When PCRE2_NEWLINE_CRLF, PCRE2_NEWLINE_ANYCRLF, or PCRE2_NEWLINE_ANY is
       set as the newline convention, and a match attempt  for  an  unanchored
       pattern fails when the current starting position is at a CRLF sequence,
       and the pattern contains no explicit matches for CR or  LF  characters,
       the  match  position  is  advanced by two characters instead of one, in
       other words, to after the CRLF.

       The above rule is a compromise that makes the most common cases work as
       expected.  For  example,  if  the  pattern is .+A (and the PCRE2_DOTALL
       option is not set), it does not match the string "\r\nA" because, after
       failing  at the start, it skips both the CR and the LF before retrying.
       However, the pattern [\r\n]A does match that string,  because  it  con-
       tains an explicit CR or LF reference, and so advances only by one char-
       acter after the first failure.

       An explicit match for CR of LF is either a literal appearance of one of
       those  characters  in  the  pattern,  or  one  of  the  \r or \n escape
       sequences. Implicit matches such as [^X] do not  count,  nor  does  \s,
       even though it includes CR and LF in the characters that it matches.

       Notwithstanding  the above, anomalous effects may still occur when CRLF
       is a valid newline sequence and explicit \r or \n escapes appear in the
       pattern.


HOW PCRE2_MATCH() RETURNS A STRING AND CAPTURED SUBSTRINGS

       uint32_t pcre2_get_ovector_count(pcre2_match_data *match_data);

       PCRE2_SIZE *pcre2_get_ovector_pointer(pcre2_match_data *match_data);

       In  general, a pattern matches a certain portion of the subject, and in
       addition, further substrings from the subject  may  be  picked  out  by
       parenthesized  parts  of  the  pattern.  Following the usage in Jeffrey
       Friedl's book, this is called "capturing"  in  what  follows,  and  the
       phrase  "capturing subpattern" or "capturing group" is used for a frag-
       ment of a pattern that picks out a substring.  PCRE2  supports  several
       other kinds of parenthesized subpattern that do not cause substrings to
       be captured. The pcre2_pattern_info() function can be used to find  out
       how many capturing subpatterns there are in a compiled pattern.

       A  successful match returns the overall matched string and any captured
       substrings to the caller via a vector of  PCRE2_SIZE  values.  This  is
       called  the ovector, and is contained within the match data block.  You
       can obtain direct access to  the  ovector  by  calling  pcre2_get_ovec-
       tor_pointer()  to  find  its  address, and pcre2_get_ovector_count() to
       find the number of pairs of values it contains. Alternatively, you  can
       use the auxiliary functions for accessing captured substrings by number
       or by name (see below).

       Within the ovector, the first in each pair of values is set to the off-
       set of the first code unit of a substring, and the second is set to the
       offset of the first code unit after the end of a substring. These  val-
       ues  are always code unit offsets, not character offsets. That is, they
       are byte offsets in the 8-bit library, 16-bit  offsets  in  the  16-bit
       library, and 32-bit offsets in the 32-bit library.

       After  a  partial  match  (error  return PCRE2_ERROR_PARTIAL), only the
       first pair of offsets (that is, ovector[0]  and  ovector[1])  are  set.
       They  identify  the part of the subject that was partially matched. See
       the pcre2partial documentation for details of partial matching.

       After a successful match, the first pair of offsets identifies the por-
       tion  of the subject string that was matched by the entire pattern. The
       next pair is used for the first capturing subpattern, and  so  on.  The
       value  returned  by pcre2_match() is one more than the highest numbered
       pair that has been set. For example, if two substrings have  been  cap-
       tured,  the returned value is 3. If there are no capturing subpatterns,
       the return value from a successful match is 1, indicating that just the
       first pair of offsets has been set.

       If  a  pattern uses the \K escape sequence within a positive assertion,
       the reported start of a successful match can be greater than the end of
       the  match.   For  example,  if the pattern (?=ab\K) is matched against
       "ab", the start and end offset values for the match are 2 and 0.

       If a capturing subpattern group is matched repeatedly within  a  single
       match  operation, it is the last portion of the subject that it matched
       that is returned.

       If the ovector is too small to hold all the captured substring offsets,
       as  much  as possible is filled in, and the function returns a value of
       zero. If captured substrings are not of interest, pcre2_match() may  be
       called with a match data block whose ovector is of minimum length (that
       is, one pair). However, if the pattern contains back references and the
       ovector is not big enough to remember the related substrings, PCRE2 has
       to get additional memory for use during matching. Thus  it  is  usually
       advisable to set up a match data block containing an ovector of reason-
       able size.

       It is possible for capturing subpattern number n+1 to match  some  part
       of the subject when subpattern n has not been used at all. For example,
       if the string "abc" is matched  against  the  pattern  (a|(z))(bc)  the
       return from the function is 4, and subpatterns 1 and 3 are matched, but
       2 is not. When this happens, both values in  the  offset  pairs  corre-
       sponding to unused subpatterns are set to PCRE2_UNSET.

       Offset  values  that correspond to unused subpatterns at the end of the
       expression are also set to PCRE2_UNSET.  For  example,  if  the  string
       "abc" is matched against the pattern (abc)(x(yz)?)? subpatterns 2 and 3
       are not matched.  The return from the function is 2, because the  high-
       est used capturing subpattern number is 1. The offsets for for the sec-
       ond and third capturing  subpatterns  (assuming  the  vector  is  large
       enough, of course) are set to PCRE2_UNSET.

       Elements in the ovector that do not correspond to capturing parentheses
       in the pattern are never changed. That is, if a pattern contains n cap-
       turing parentheses, no more than ovector[0] to ovector[2n+1] are set by
       pcre2_match(). The other elements retain whatever  values  they  previ-
       ously had.


OTHER INFORMATION ABOUT A MATCH

       PCRE2_SPTR pcre2_get_mark(pcre2_match_data *match_data);

       PCRE2_SIZE pcre2_get_startchar(pcre2_match_data *match_data);

       As  well as the offsets in the ovector, other information about a match
       is retained in the match data block and can be retrieved by  the  above
       functions  in  appropriate  circumstances.  If they are called at other
       times, the result is undefined.

       After a successful match, a partial match (PCRE2_ERROR_PARTIAL),  or  a
       failure  to  match  (PCRE2_ERROR_NOMATCH), a (*MARK) name may be avail-
       able, and pcre2_get_mark() can be called. It returns a pointer  to  the
       zero-terminated  name,  which is within the compiled pattern. Otherwise
       NULL is returned. After a successful match, the (*MARK)  name  that  is
       returned  is  the last one encountered on the matching path through the
       pattern. After a "no match" or a partial match,  the  last  encountered
       (*MARK) name is returned. For example, consider this pattern:

         ^(*MARK:A)((*MARK:B)a|b)c

       When  it  matches "bc", the returned mark is A. The B mark is "seen" in
       the first branch of the group, but it is not on the matching  path.  On
       the  other  hand,  when  this pattern fails to match "bx", the returned
       mark is B.

       After a successful match, a partial match, or one of  the  invalid  UTF
       errors  (for example, PCRE2_ERROR_UTF8_ERR5), pcre2_get_startchar() can
       be called. After a successful or partial match it returns the code unit
       offset  of  the character at which the match started. For a non-partial
       match, this can be different to the value of ovector[0] if the  pattern
       contains  the  \K escape sequence. After a partial match, however, this
       value is always the same as ovector[0] because \K does not  affect  the
       result of a partial match.

       After  a UTF check failure, pcre2_get_startchar() can be used to obtain
       the code unit offset of the invalid UTF character. Details are given in
       the pcre2unicode page.


ERROR RETURNS FROM pcre2_match()

       If  pcre2_match() fails, it returns a negative number. This can be con-
       verted to a text string by calling pcre2_get_error_message().  Negative
       error  codes  are  also returned by other functions, and are documented
       with them.  The codes are given names in the header file. If UTF check-
       ing is in force and an invalid UTF subject string is detected, one of a
       number of UTF-specific negative error codes is  returned.  Details  are
       given in the pcre2unicode page. The following are the other errors that
       may be returned by pcre2_match():

         PCRE2_ERROR_NOMATCH

       The subject string did not match the pattern.

         PCRE2_ERROR_PARTIAL

       The subject string did not match, but it did match partially.  See  the
       pcre2partial documentation for details of partial matching.

         PCRE2_ERROR_BADMAGIC

       PCRE2 stores a 4-byte "magic number" at the start of the compiled code,
       to catch the case when it is passed a junk pointer. This is  the  error
       that is returned when the magic number is not present.

         PCRE2_ERROR_BADMODE

       This  error  is  given  when  a  pattern that was compiled by the 8-bit
       library is passed to a 16-bit  or  32-bit  library  function,  or  vice
       versa.

         PCRE2_ERROR_BADOFFSET

       The value of startoffset was greater than the length of the subject.

         PCRE2_ERROR_BADOPTION

       An unrecognized bit was set in the options argument.

         PCRE2_ERROR_BADUTFOFFSET

       The UTF code unit sequence that was passed as a subject was checked and
       found to be valid (the PCRE2_NO_UTF_CHECK option was not set), but  the
       value  of startoffset did not point to the beginning of a UTF character
       or the end of the subject.

         PCRE2_ERROR_CALLOUT

       This error is never generated by pcre2_match() itself. It  is  provided
       for use by callout functions that want to cause pcre2_match() to return
       a distinctive  error  code.  See  the  pcre2callout  documentation  for
       details.

         PCRE2_ERROR_INTERNAL

       An  unexpected  internal error has occurred. This error could be caused
       by a bug in PCRE2 or by overwriting of the compiled pattern.

         PCRE2_ERROR_JIT_BADOPTION

       This error is returned when a pattern  that  was  successfully  studied
       using  JIT is being matched, but the matching mode (partial or complete
       match) does not correspond to any JIT compilation mode.  When  the  JIT
       fast  path  function  is used, this error may be also given for invalid
       options. See the pcre2jit documentation for more details.

         PCRE2_ERROR_JIT_STACKLIMIT

       This error is returned when a pattern  that  was  successfully  studied
       using  JIT  is being matched, but the memory available for the just-in-
       time processing stack is not large enough. See the pcre2jit  documenta-
       tion for more details.

         PCRE2_ERROR_MATCHLIMIT

       The backtracking limit was reached.

         PCRE2_ERROR_NOMEMORY

       If  a  pattern  contains  back  references,  but the ovector is not big
       enough to remember the referenced substrings, PCRE2  gets  a  block  of
       memory at the start of matching to use for this purpose. There are some
       other special cases where extra memory is needed during matching.  This
       error is given when memory cannot be obtained.

         PCRE2_ERROR_NULL

       Either the code, subject, or match_data argument was passed as NULL.

         PCRE2_ERROR_RECURSELOOP

       This  error  is  returned  when  pcre2_match() detects a recursion loop
       within the pattern. Specifically, it means that either the  whole  pat-
       tern or a subpattern has been called recursively for the second time at
       the same position in the subject  string.  Some  simple  patterns  that
       might  do  this are detected and faulted at compile time, but more com-
       plicated cases, in particular mutual recursions between  two  different
       subpatterns, cannot be detected until matching is attempted.

         PCRE2_ERROR_RECURSIONLIMIT

       The internal recursion limit was reached.


EXTRACTING CAPTURED SUBSTRINGS BY NUMBER

       int pcre2_substring_length_bynumber(pcre2_match_data *match_data,
         uint32_t number, PCRE2_SIZE *length);

       int pcre2_substring_copy_bynumber(pcre2_match_data *match_data,
         uint32_t number, PCRE2_UCHAR *buffer,
         PCRE2_SIZE *bufflen);

       int pcre2_substring_get_bynumber(pcre2_match_data *match_data,
         uint32_t number, PCRE2_UCHAR **bufferptr,
         PCRE2_SIZE *bufflen);

       void pcre2_substring_free(PCRE2_UCHAR *buffer);

       Captured  substrings  can  be accessed directly by using the ovector as
       described above.  For convenience, auxiliary functions are provided for
       extracting   captured  substrings  as  new,  separate,  zero-terminated
       strings. A substring that contains a binary zero is correctly extracted
       and  has  a  further  zero  added on the end, but the result is not, of
       course, a C string.

       The functions in this section identify substrings by number. The number
       zero refers to the entire matched substring, with higher numbers refer-
       ring to substrings captured by parenthesized groups.  After  a  partial
       match,  only  substring  zero  is  available. An attempt to extract any
       other substring gives the error PCRE2_ERROR_PARTIAL. The  next  section
       describes similar functions for extracting captured substrings by name.

       If  a  pattern uses the \K escape sequence within a positive assertion,
       the reported start of a successful match can be greater than the end of
       the  match.   For  example,  if the pattern (?=ab\K) is matched against
       "ab", the start and end offset values for the match are  2  and  0.  In
       this  situation,  calling  these functions with a zero substring number
       extracts a zero-length empty string.

       You can find the length in code units of a captured  substring  without
       extracting  it  by calling pcre2_substring_length_bynumber(). The first
       argument is a pointer to the match data block, the second is the  group
       number,  and the third is a pointer to a variable into which the length
       is placed. If you just want to know whether or not  the  substring  has
       been captured, you can pass the third argument as NULL.

       The  pcre2_substring_copy_bynumber()  function  copies  a captured sub-
       string into a supplied buffer,  whereas  pcre2_substring_get_bynumber()
       copies  it  into  new memory, obtained using the same memory allocation
       function that was used for the match data block. The  first  two  argu-
       ments  of  these  functions are a pointer to the match data block and a
       capturing group number.

       The final arguments of pcre2_substring_copy_bynumber() are a pointer to
       the buffer and a pointer to a variable that contains its length in code
       units.  This is updated to contain the actual number of code units used
       for the extracted substring, excluding the terminating zero.

       For pcre2_substring_get_bynumber() the third and fourth arguments point
       to variables that are updated with a pointer to the new memory and  the
       number  of  code units that comprise the substring, again excluding the
       terminating zero. When the substring is no longer  needed,  the  memory
       should be freed by calling pcre2_substring_free().

       The  return  value  from  all these functions is zero for success, or a
       negative error code. If the pattern match  failed,  the  match  failure
       code  is  returned.   If  a  substring number greater than zero is used
       after a partial match, PCRE2_ERROR_PARTIAL is returned. Other  possible
       error codes are:

         PCRE2_ERROR_NOMEMORY

       The  buffer  was  too small for pcre2_substring_copy_bynumber(), or the
       attempt to get memory failed for pcre2_substring_get_bynumber().

         PCRE2_ERROR_NOSUBSTRING

       There is no substring with that number in the  pattern,  that  is,  the
       number is greater than the number of capturing parentheses.

         PCRE2_ERROR_UNAVAILABLE

       The substring number, though not greater than the number of captures in
       the pattern, is greater than the number of slots in the ovector, so the
       substring could not be captured.

         PCRE2_ERROR_UNSET

       The  substring  did  not  participate in the match. For example, if the
       pattern is (abc)|(def) and the subject is "def", and the  ovector  con-
       tains at least two capturing slots, substring number 1 is unset.


EXTRACTING A LIST OF ALL CAPTURED SUBSTRINGS

       int pcre2_substring_list_get(pcre2_match_data *match_data,
         PCRE2_UCHAR ***listptr, PCRE2_SIZE **lengthsptr);

       void pcre2_substring_list_free(PCRE2_SPTR *list);

       The  pcre2_substring_list_get()  function  extracts  all available sub-
       strings and builds a list of pointers to  them.  It  also  (optionally)
       builds  a  second  list  that  contains  their lengths (in code units),
       excluding a terminating zero that is added to each of them. All this is
       done in a single block of memory that is obtained using the same memory
       allocation function that was used to get the match data block.

       This function must be called only after a successful match.  If  called
       after a partial match, the error code PCRE2_ERROR_PARTIAL is returned.

       The  address of the memory block is returned via listptr, which is also
       the start of the list of string pointers. The end of the list is marked
       by  a  NULL pointer. The address of the list of lengths is returned via
       lengthsptr. If your strings do not contain binary zeros and you do  not
       therefore need the lengths, you may supply NULL as the lengthsptr argu-
       ment to disable the creation of a list of lengths.  The  yield  of  the
       function  is zero if all went well, or PCRE2_ERROR_NOMEMORY if the mem-
       ory block could not be obtained. When the list is no longer needed,  it
       should be freed by calling pcre2_substring_list_free().

       If this function encounters a substring that is unset, which can happen
       when capturing subpattern number n+1 matches some part of the  subject,
       but  subpattern n has not been used at all, it returns an empty string.
       This can be distinguished  from  a  genuine  zero-length  substring  by
       inspecting  the  appropriate  offset  in  the  ovector,  which  contain
       PCRE2_UNSET  for   unset   substrings,   or   by   calling   pcre2_sub-
       string_length_bynumber().


EXTRACTING CAPTURED SUBSTRINGS BY NAME

       int pcre2_substring_number_from_name(const pcre2_code *code,
         PCRE2_SPTR name);

       int pcre2_substring_length_byname(pcre2_match_data *match_data,
         PCRE2_SPTR name, PCRE2_SIZE *length);

       int pcre2_substring_copy_byname(pcre2_match_data *match_data,
         PCRE2_SPTR name, PCRE2_UCHAR *buffer, PCRE2_SIZE *bufflen);

       int pcre2_substring_get_byname(pcre2_match_data *match_data,
         PCRE2_SPTR name, PCRE2_UCHAR **bufferptr, PCRE2_SIZE *bufflen);

       void pcre2_substring_free(PCRE2_UCHAR *buffer);

       To  extract a substring by name, you first have to find associated num-
       ber.  For example, for this pattern:

         (a+)b(?<xxx>\d+)...

       the number of the subpattern called "xxx" is 2. If the name is known to
       be  unique  (PCRE2_DUPNAMES  was not set), you can find the number from
       the name by calling pcre2_substring_number_from_name(). The first argu-
       ment  is the compiled pattern, and the second is the name. The yield of
       the function is the subpattern number, PCRE2_ERROR_NOSUBSTRING if there
       is  no  subpattern  of  that  name, or PCRE2_ERROR_NOUNIQUESUBSTRING if
       there is more than one subpattern of that name. Given the  number,  you
       can  extract  the  substring  directly,  or  use  one  of the functions
       described above.

       For convenience, there are also "byname" functions that  correspond  to
       the  "bynumber"  functions,  the  only difference being that the second
       argument is a name instead of a number. If PCRE2_DUPNAMES  is  set  and
       there are duplicate names, these functions scan all the groups with the
       given name, and return the first named string that is set.

       If there are no groups with the given name, PCRE2_ERROR_NOSUBSTRING  is
       returned.  If  all  groups  with the name have numbers that are greater
       than the number of slots in  the  ovector,  PCRE2_ERROR_UNAVAILABLE  is
       returned.  If  there  is at least one group with a slot in the ovector,
       but no group is found to be set, PCRE2_ERROR_UNSET is returned.

       Warning: If the pattern uses the (?| feature to set up multiple subpat-
       terns  with  the  same number, as described in the section on duplicate
       subpattern numbers in the pcre2pattern page, you cannot  use  names  to
       distinguish  the  different subpatterns, because names are not included
       in the compiled code. The matching process uses only numbers. For  this
       reason,  the  use of different names for subpatterns of the same number
       causes an error at compile time.


CREATING A NEW STRING WITH SUBSTITUTIONS

       int pcre2_substitute(const pcre2_code *code, PCRE2_SPTR subject,
         PCRE2_SIZE length, PCRE2_SIZE startoffset,
         uint32_t options, pcre2_match_data *match_data,
         pcre2_match_context *mcontext, PCRE2_SPTR replacementzfP,
         PCRE2_SIZE rlength, PCRE2_UCHAR *outputbufferP,
         PCRE2_SIZE *outlengthptr);
       This function calls pcre2_match() and then makes a copy of the  subject
       string  in  outputbuffer,  replacing the part that was matched with the
       replacement string, whose length is supplied in rlength.  This  can  be
       given as PCRE2_ZERO_TERMINATED for a zero-terminated string.

       In  the replacement string, which is interpreted as a UTF string in UTF
       mode, and is checked for UTF  validity  unless  the  PCRE2_NO_UTF_CHECK
       option is set, a dollar character is an escape character that can spec-
       ify the insertion of characters from capturing groups in  the  pattern.
       The following forms are recognized:

         $$      insert a dollar character
         $<n>    insert the contents of group <n>
         ${<n>}  insert the contents of group <n>

       Either  a  group  number  or  a  group name can be given for <n>. Curly
       brackets are required only if the following character would  be  inter-
       preted as part of the number or name. The number may be zero to include
       the entire matched string.   For  example,  if  the  pattern  a(b)c  is
       matched  with "=abc=" and the replacement string "+$1$0$1+", the result
       is "=+babcb+=". Group insertion is done by calling  pcre2_copy_byname()
       or pcre2_copy_bynumber() as appropriate.

       The  first  seven  arguments  of pcre2_substitute() are the same as for
       pcre2_match(), except that the partial matching options are not permit-
       ted,  and  match_data may be passed as NULL, in which case a match data
       block is obtained and freed within this function, using memory  manage-
       ment  functions from the match context, if provided, or else those that
       were used to allocate memory for the compiled code.

       There is one additional option, PCRE2_SUBSTITUTE_GLOBAL,  which  causes
       the function to iterate over the subject string, replacing every match-
       ing substring. If this is not set, only the first matching substring is
       replaced.

       The  outlengthptr  argument  must point to a variable that contains the
       length, in code units, of the output buffer. It is updated  to  contain
       the length of the new string, excluding the trailing zero that is auto-
       matically added.

       The function returns the number of replacements that  were  made.  This
       may  be  zero  if  no  matches  were found, and is never greater than 1
       unless PCRE2_SUBSTITUTE_GLOBAL is set. In the event of an error, a neg-
       ative  error code is returned. Except for PCRE2_ERROR_NOMATCH (which is
       never returned), any errors from pcre2_match() or the substring copying
       functions  are  passed  straight  back.  PCRE2_ERROR_BADREPLACEMENT  is
       returned for an invalid replacement string (unrecognized sequence  fol-
       lowing a dollar sign), and PCRE2_ERROR_NOMEMORY is returned if the out-
       put buffer is not big enough.


DUPLICATE SUBPATTERN NAMES

       int pcre2_substring_nametable_scan(const pcre2_code *code,
         PCRE2_SPTR name, PCRE2_SPTR *first, PCRE2_SPTR *last);

       When a pattern is compiled with the PCRE2_DUPNAMES  option,  names  for
       subpatterns  are  not required to be unique. Duplicate names are always
       allowed for subpatterns with the same number, created by using the  (?|
       feature.  Indeed,  if  such subpatterns are named, they are required to
       use the same names.

       Normally, patterns with duplicate names are such that in any one match,
       only  one of the named subpatterns participates. An example is shown in
       the pcre2pattern documentation.

       When  duplicates   are   present,   pcre2_substring_copy_byname()   and
       pcre2_substring_get_byname()  return  the first substring corresponding
       to  the  given  name  that  is  set.  Only   if   none   are   set   is
       PCRE2_ERROR_UNSET  is  returned. The pcre2_substring_number_from_name()
       function returns the error PCRE2_ERROR_NOUNIQUESUBSTRING when there are
       duplicate names.

       If  you want to get full details of all captured substrings for a given
       name, you must use the pcre2_substring_nametable_scan()  function.  The
       first  argument is the compiled pattern, and the second is the name. If
       the third and fourth arguments are NULL, the function returns  a  group
       number for a unique name, or PCRE2_ERROR_NOUNIQUESUBSTRING otherwise.

       When the third and fourth arguments are not NULL, they must be pointers
       to variables that are updated by the function. After it has  run,  they
       point to the first and last entries in the name-to-number table for the
       given name, and the function returns the length of each entry  in  code
       units.  In both cases, PCRE2_ERROR_NOSUBSTRING is returned if there are
       no entries for the given name.

       The format of the name table is described above in the section entitled
       Information  about a pattern above.  Given all the relevant entries for
       the name, you can extract each of their numbers, and hence the captured
       data.


FINDING ALL POSSIBLE MATCHES AT ONE POSITION

       The  traditional  matching  function  uses a similar algorithm to Perl,
       which stops when it finds the first match at a given point in the  sub-
       ject. If you want to find all possible matches, or the longest possible
       match at a given position,  consider  using  the  alternative  matching
       function  (see  below) instead. If you cannot use the alternative func-
       tion, you can kludge it up by making use of the callout facility, which
       is described in the pcre2callout documentation.

       What you have to do is to insert a callout right at the end of the pat-
       tern.  When your callout function is called, extract and save the  cur-
       rent  matched  substring.  Then return 1, which forces pcre2_match() to
       backtrack and try other alternatives. Ultimately, when it runs  out  of
       matches, pcre2_match() will yield PCRE2_ERROR_NOMATCH.


MATCHING A PATTERN: THE ALTERNATIVE FUNCTION

       int pcre2_dfa_match(const pcre2_code *code, PCRE2_SPTR subject,
         PCRE2_SIZE length, PCRE2_SIZE startoffset,
         uint32_t options, pcre2_match_data *match_data,
         pcre2_match_context *mcontext,
         int *workspace, PCRE2_SIZE wscount);

       The  function  pcre2_dfa_match()  is  called  to match a subject string
       against a compiled pattern, using a matching algorithm that  scans  the
       subject  string  just  once, and does not backtrack. This has different
       characteristics to the normal algorithm, and  is  not  compatible  with
       Perl.  Some of the features of PCRE2 patterns are not supported. Never-
       theless, there are times when this kind of matching can be useful.  For
       a  discussion  of  the  two matching algorithms, and a list of features
       that pcre2_dfa_match() does not support, see the pcre2matching documen-
       tation.

       The  arguments  for  the pcre2_dfa_match() function are the same as for
       pcre2_match(), plus two extras. The ovector within the match data block
       is used in a different way, and this is described below. The other com-
       mon arguments are used in the same way as for pcre2_match(),  so  their
       description is not repeated here.

       The  two  additional  arguments provide workspace for the function. The
       workspace vector should contain at least 20 elements. It  is  used  for
       keeping  track  of  multiple  paths  through  the  pattern  tree.  More
       workspace is needed for patterns and subjects where there are a lot  of
       potential matches.

       Here is an example of a simple call to pcre2_dfa_match():

         int wspace[20];
         pcre2_match_data *md = pcre2_match_data_create(4, NULL);
         int rc = pcre2_dfa_match(
           re,             /* result of pcre2_compile() */
           "some string",  /* the subject string */
           11,             /* the length of the subject string */
           0,              /* start at offset 0 in the subject */
           0,              /* default options */
           match_data,     /* the match data block */
           NULL,           /* a match context; NULL means use defaults */
           wspace,         /* working space vector */
           20);            /* number of elements (NOT size in bytes) */

   Option bits for pcre_dfa_match()

       The  unused  bits of the options argument for pcre2_dfa_match() must be
       zero. The only bits that may be set are  PCRE2_ANCHORED,  PCRE2_NOTBOL,
       PCRE2_NOTEOL,          PCRE2_NOTEMPTY,          PCRE2_NOTEMPTY_ATSTART,
       PCRE2_NO_UTF_CHECK,       PCRE2_PARTIAL_HARD,       PCRE2_PARTIAL_SOFT,
       PCRE2_DFA_SHORTEST,  and  PCRE2_DFA_RESTART.  All  but the last four of
       these are exactly the same as for pcre2_match(), so  their  description
       is not repeated here.

         PCRE2_PARTIAL_HARD
         PCRE2_PARTIAL_SOFT

       These  have  the  same general effect as they do for pcre2_match(), but
       the details are slightly different. When PCRE2_PARTIAL_HARD is set  for
       pcre2_dfa_match(),  it  returns  PCRE2_ERROR_PARTIAL  if the end of the
       subject is reached and there is still at least one matching possibility
       that requires additional characters. This happens even if some complete
       matches have already been found. When PCRE2_PARTIAL_SOFT  is  set,  the
       return  code  PCRE2_ERROR_NOMATCH is converted into PCRE2_ERROR_PARTIAL
       if the end of the subject is  reached,  there  have  been  no  complete
       matches, but there is still at least one matching possibility. The por-
       tion of the string that was inspected when the  longest  partial  match
       was found is set as the first matching string in both cases. There is a
       more detailed discussion of partial and  multi-segment  matching,  with
       examples, in the pcre2partial documentation.

         PCRE2_DFA_SHORTEST

       Setting  the PCRE2_DFA_SHORTEST option causes the matching algorithm to
       stop as soon as it has found one match. Because of the way the alterna-
       tive  algorithm  works, this is necessarily the shortest possible match
       at the first possible matching point in the subject string.

         PCRE2_DFA_RESTART

       When pcre2_dfa_match() returns a partial match, it is possible to  call
       it again, with additional subject characters, and have it continue with
       the same match. The PCRE2_DFA_RESTART option requests this action; when
       it  is  set,  the workspace and wscount options must reference the same
       vector as before because data about the match so far is  left  in  them
       after a partial match. There is more discussion of this facility in the
       pcre2partial documentation.

   Successful returns from pcre2_dfa_match()

       When pcre2_dfa_match() succeeds, it may have matched more than one sub-
       string in the subject. Note, however, that all the matches from one run
       of the function start at the same point in  the  subject.  The  shorter
       matches  are all initial substrings of the longer matches. For example,
       if the pattern

         <.*>

       is matched against the string

         This is <something> <something else> <something further> no more

       the three matched strings are

         <something> <something else> <something further>
         <something> <something else>
         <something>

       On success, the yield of the function is a number  greater  than  zero,
       which  is  the  number  of  matched substrings. The offsets of the sub-
       strings are returned in the ovector, and can be extracted by number  in
       the  same way as for pcre2_match(), but the numbers bear no relation to
       any capturing groups that may exist in the pattern, because DFA  match-
       ing does not support group capture.

       Calls  to  the  convenience  functions  that extract substrings by name
       return the error PCRE2_ERROR_DFA_UFUNC (unsupported function)  if  used
       after a DFA match. The convenience functions that extract substrings by
       number never return PCRE2_ERROR_NOSUBSTRING, and the meanings  of  some
       other errors are slightly different:

         PCRE2_ERROR_UNAVAILABLE

       The ovector is not big enough to include a slot for the given substring
       number.

         PCRE2_ERROR_UNSET

       There is a slot in the ovector  for  this  substring,  but  there  were
       insufficient matches to fill it.

       The  matched  strings  are  stored  in  the ovector in reverse order of
       length; that is, the longest matching string is first.  If  there  were
       too  many matches to fit into the ovector, the yield of the function is
       zero, and the vector is filled with the longest matches.

       NOTE: PCRE2's "auto-possessification" optimization usually  applies  to
       character  repeats at the end of a pattern (as well as internally). For
       example, the pattern "a\d+" is compiled as if it were "a\d++". For  DFA
       matching,  this  means  that  only  one possible match is found. If you
       really do want multiple matches in such cases, either use  an  ungreedy
       repeat  auch  as  "a\d+?"  or set the PCRE2_NO_AUTO_POSSESS option when
       compiling.

   Error returns from pcre2_dfa_match()

       The pcre2_dfa_match() function returns a negative number when it fails.
       Many  of  the  errors  are  the same as for pcre2_match(), as described
       above.  There are in addition the following errors that are specific to
       pcre2_dfa_match():

         PCRE2_ERROR_DFA_UITEM

       This  return  is  given  if pcre2_dfa_match() encounters an item in the
       pattern that it does not support, for instance, the use of \C or a back
       reference.

         PCRE2_ERROR_DFA_UCOND

       This  return  is given if pcre2_dfa_match() encounters a condition item
       that uses a back reference for the condition, or a test  for  recursion
       in a specific group. These are not supported.

         PCRE2_ERROR_DFA_WSSIZE

       This  return  is  given  if  pcre2_dfa_match() runs out of space in the
       workspace vector.

         PCRE2_ERROR_DFA_RECURSE

       When a recursive subpattern is processed, the matching  function  calls
       itself recursively, using private memory for the ovector and workspace.
       This error is given if the internal ovector is not large  enough.  This
       should be extremely rare, as a vector of size 1000 is used.

         PCRE2_ERROR_DFA_BADRESTART

       When  pcre2_dfa_match()  is  called  with the PCRE2_DFA_RESTART option,
       some plausibility checks are made on the  contents  of  the  workspace,
       which  should  contain data about the previous partial match. If any of
       these checks fail, this error is given.


SEE ALSO

       pcre2build(3),   pcre2callout(3),    pcre2demo(3),    pcre2matching(3),
       pcre2partial(3),    pcre2posix(3),    pcre2sample(3),    pcre2stack(3),
       pcre2unicode(3).


AUTHOR

       Philip Hazel
       University Computing Service
       Cambridge, England.


REVISION

       Last updated: 23 January 2015
       Copyright (c) 1997-2015 University of Cambridge.
------------------------------------------------------------------------------


PCRE2BUILD(3)              Library Functions Manual              PCRE2BUILD(3)


NAME
       PCRE2 - Perl-compatible regular expressions (revised API)

BUILDING PCRE2

       PCRE2  is distributed with a configure script that can be used to build
       the library in Unix-like environments using the applications  known  as
       Autotools. Also in the distribution are files to support building using
       CMake instead of configure.  The  text  file  README  contains  general
       information  about  building  with Autotools (some of which is repeated
       below), and also has some comments about building on various  operating
       systems.  There  is a lot more information about building PCRE2 without
       using Autotools (including information about using CMake  and  building
       "by  hand")  in  the  text file called NON-AUTOTOOLS-BUILD.  You should
       consult this file as well as the README file if you are building  in  a
       non-Unix-like environment.


PCRE2 BUILD-TIME OPTIONS

       The rest of this document describes the optional features of PCRE2 that
       can be selected when the library is compiled. It  assumes  use  of  the
       configure  script,  where  the  optional features are selected or dese-
       lected by providing options to configure before running the  make  com-
       mand.  However,  the same options can be selected in both Unix-like and
       non-Unix-like environments if you are using CMake instead of  configure
       to build PCRE2.

       If  you  are not using Autotools or CMake, option selection can be done
       by editing the config.h file, or by passing parameter settings  to  the
       compiler, as described in NON-AUTOTOOLS-BUILD.

       The complete list of options for configure (which includes the standard
       ones such as the  selection  of  the  installation  directory)  can  be
       obtained by running

         ./configure --help

       The  following  sections  include  descriptions  of options whose names
       begin with --enable or --disable. These settings specify changes to the
       defaults  for  the configure command. Because of the way that configure
       works, --enable and --disable always come in pairs, so  the  complemen-
       tary  option always exists as well, but as it specifies the default, it
       is not described.


BUILDING 8-BIT, 16-BIT AND 32-BIT LIBRARIES

       By default, a library called libpcre2-8 is built, containing  functions
       that  take  string arguments contained in vectors of bytes, interpreted
       either as single-byte characters, or UTF-8 strings. You can also  build
       two  other libraries, called libpcre2-16 and libpcre2-32, which process
       strings that are contained in vectors of 16-bit and 32-bit code  units,
       respectively. These can be interpreted either as single-unit characters
       or UTF-16/UTF-32 strings. To build these additional libraries, add  one
       or both of the following to the configure command:

         --enable-pcre2-16
         --enable-pcre2-32

       If you do not want the 8-bit library, add

         --disable-pcre2-8

       as  well.  At least one of the three libraries must be built. Note that
       the POSIX wrapper is for the 8-bit library only, and that pcre2grep  is
       an  8-bit  program.  Neither  of these are built if you select only the
       16-bit or 32-bit libraries.


BUILDING SHARED AND STATIC LIBRARIES

       The Autotools PCRE2 building process uses libtool to build both  shared
       and  static  libraries by default. You can suppress an unwanted library
       by adding one of

         --disable-shared
         --disable-static

       to the configure command.


UNICODE AND UTF SUPPORT

       By default, PCRE2 is built with support for Unicode and  UTF  character
       strings.  To build it without Unicode support, add

         --disable-unicode

       to  the configure command. This setting applies to all three libraries.
       It is not possible to build  one  library  with  Unicode  support,  and
       another without, in the same configuration.

       Of  itself, Unicode support does not make PCRE2 treat strings as UTF-8,
       UTF-16 or UTF-32. To do that, applications that use the library have to
       set  the  PCRE2_UTF  option when they call pcre2_compile() to compile a
       pattern.

       UTF support allows the libraries to process character code points up to
       0x10ffff  in the strings that they handle. It also provides support for
       accessing the Unicode properties  of  such  characters,  using  pattern
       escapes  such  as  \P, \p, and \X. Only the general category properties
       such as Lu and Nd are supported. Details are given in the  pcre2pattern
       documentation.


JUST-IN-TIME COMPILER SUPPORT

       Just-in-time compiler support is included in the build by specifying

         --enable-jit

       This  support  is available only for certain hardware architectures. If
       this option is set for an unsupported architecture,  a  building  error
       occurs.   See the pcre2jit documentation for a discussion of JIT usage.
       When JIT support is enabled, pcre2grep automatically makes use  of  it,
       unless you add

         --disable-pcre2grep-jit

       to the "configure" command.


NEWLINE RECOGNITION

       By  default, PCRE2 interprets the linefeed (LF) character as indicating
       the end of a line. This is the normal newline  character  on  Unix-like
       systems.  You can compile PCRE2 to use carriage return (CR) instead, by
       adding

         --enable-newline-is-cr

       to the configure  command.  There  is  also  an  --enable-newline-is-lf
       option, which explicitly specifies linefeed as the newline character.

       Alternatively, you can specify that line endings are to be indicated by
       the two-character sequence CRLF (CR immediately followed by LF). If you
       want this, add

         --enable-newline-is-crlf

       to the configure command. There is a fourth option, specified by

         --enable-newline-is-anycrlf

       which  causes  PCRE2 to recognize any of the three sequences CR, LF, or
       CRLF as indicating a line ending. Finally, a fifth option, specified by

         --enable-newline-is-any

       causes PCRE2 to recognize any Unicode  newline  sequence.  The  Unicode
       newline sequences are the three just mentioned, plus the single charac-
       ters VT (vertical tab, U+000B), FF (form feed, U+000C), NEL (next line,
       U+0085),  LS  (line  separator,  U+2028),  and PS (paragraph separator,
       U+2029).

       Whatever default line ending convention is selected when PCRE2 is built
       can  be  overridden by applications that use the library. At build time
       it is conventional to use the standard for your operating system.


WHAT \R MATCHES

       By default, the sequence \R in a pattern matches  any  Unicode  newline
       sequence,  independently  of  what has been selected as the line ending
       sequence. If you specify

         --enable-bsr-anycrlf

       the default is changed so that \R matches only CR, LF, or  CRLF.  What-
       ever  is selected when PCRE2 is built can be overridden by applications
       that use the called.


HANDLING VERY LARGE PATTERNS

       Within a compiled pattern, offset values are used  to  point  from  one
       part  to another (for example, from an opening parenthesis to an alter-
       nation metacharacter). By default, in the 8-bit and  16-bit  libraries,
       two-byte  values  are used for these offsets, leading to a maximum size
       for a compiled pattern of around 64K code units. This is sufficient  to
       handle all but the most gigantic patterns. Nevertheless, some people do
       want to process truly enormous patterns, so it is possible  to  compile
       PCRE2  to  use three-byte or four-byte offsets by adding a setting such
       as

         --with-link-size=3

       to the configure command. The value given must be 2, 3, or 4.  For  the
       16-bit  library,  a  value of 3 is rounded up to 4. In these libraries,
       using longer offsets slows down the operation of PCRE2 because  it  has
       to  load additional data when handling them. For the 32-bit library the
       value is always 4 and cannot be overridden; the value  of  --with-link-
       size is ignored.


AVOIDING EXCESSIVE STACK USAGE

       When  matching  with the pcre2_match() function, PCRE2 implements back-
       tracking by making recursive  calls  to  an  internal  function  called
       match().  In  environments where the size of the stack is limited, this
       can severely limit PCRE2's operation. (The Unix  environment  does  not
       usually  suffer from this problem, but it may sometimes be necessary to
       increase  the  maximum  stack  size.  There  is  a  discussion  in  the
       pcre2stack  documentation.)  An  alternative approach to recursion that
       uses memory from the heap to remember data, instead of using  recursive
       function  calls, has been implemented to work round the problem of lim-
       ited stack size. If you want to build a version  of  PCRE2  that  works
       this way, add

         --disable-stack-for-recursion

       to the configure command. By default, the system functions malloc() and
       free() are called to manage the heap memory that is required, but  cus-
       tom  memory  management  functions  can  be  called instead. PCRE2 runs
       noticeably more slowly when built in this way. This option affects only
       the pcre2_match() function; it is not relevant for pcre2_dfa_match().


LIMITING PCRE2 RESOURCE USAGE

       Internally, PCRE2 has a function called match(), which it calls repeat-
       edly  (sometimes  recursively)  when  matching  a  pattern   with   the
       pcre2_match() function. By controlling the maximum number of times this
       function may be called during a single matching operation, a limit  can
       be  placed on the resources used by a single call to pcre2_match(). The
       limit can be changed at run time, as described in the pcre2api documen-
       tation.  The default is 10 million, but this can be changed by adding a
       setting such as

         --with-match-limit=500000

       to  the  configure  command.  This  setting  has  no  effect   on   the
       pcre2_dfa_match() matching function.

       In  some  environments  it is desirable to limit the depth of recursive
       calls of match() more strictly than the total number of calls, in order
       to  restrict  the maximum amount of stack (or heap, if --disable-stack-
       for-recursion is specified) that is used. A second limit controls this;
       it  defaults  to  the  value  that is set for --with-match-limit, which
       imposes no additional constraints. However, you can set a  lower  limit
       by adding, for example,

         --with-match-limit-recursion=10000

       to  the  configure  command.  This  value can also be overridden at run
       time.


CREATING CHARACTER TABLES AT BUILD TIME

       PCRE2 uses fixed tables for processing characters whose code points are
       less than 256. By default, PCRE2 is built with a set of tables that are
       distributed in the file src/pcre2_chartables.c.dist. These  tables  are
       for ASCII codes only. If you add

         --enable-rebuild-chartables

       to  the  configure  command, the distributed tables are no longer used.
       Instead, a program called dftables is compiled and  run.  This  outputs
       the source for new set of tables, created in the default locale of your
       C run-time system. (This method of replacing the tables does  not  work
       if  you are cross compiling, because dftables is run on the local host.
       If you need to create alternative tables when cross compiling, you will
       have to do so "by hand".)


USING EBCDIC CODE

       PCRE2  assumes  by default that it will run in an environment where the
       character code is ASCII or Unicode, which is a superset of ASCII.  This
       is the case for most computer operating systems. PCRE2 can, however, be
       compiled to run in an 8-bit EBCDIC environment by adding

         --enable-ebcdic --disable-unicode

       to the configure command. This setting implies --enable-rebuild-charta-
       bles.  You  should  only  use  it if you know that you are in an EBCDIC
       environment (for example, an IBM mainframe operating system).

       It is not possible to support both EBCDIC and UTF-8 codes in  the  same
       version  of  the  library. Consequently, --enable-unicode and --enable-
       ebcdic are mutually exclusive.

       The EBCDIC character that corresponds to an ASCII LF is assumed to have
       the  value  0x15 by default. However, in some EBCDIC environments, 0x25
       is used. In such an environment you should use

         --enable-ebcdic-nl25

       as well as, or instead of, --enable-ebcdic. The EBCDIC character for CR
       has  the  same  value  as in ASCII, namely, 0x0d. Whichever of 0x15 and
       0x25 is not chosen as LF is made to correspond to the Unicode NEL char-
       acter (which, in Unicode, is 0x85).

       The options that select newline behaviour, such as --enable-newline-is-
       cr, and equivalent run-time options, refer to these character values in
       an EBCDIC environment.


PCRE2GREP OPTIONS FOR COMPRESSED FILE SUPPORT

       By  default,  pcre2grep reads all files as plain text. You can build it
       so that it recognizes files whose names end in .gz or .bz2,  and  reads
       them with libz or libbz2, respectively, by adding one or both of

         --enable-pcre2grep-libz
         --enable-pcre2grep-libbz2

       to the configure command. These options naturally require that the rel-
       evant libraries are installed on your system. Configuration  will  fail
       if they are not.


PCRE2GREP BUFFER SIZE

       pcre2grep  uses an internal buffer to hold a "window" on the file it is
       scanning, in order to be able to output "before" and "after" lines when
       it  finds  a match. The size of the buffer is controlled by a parameter
       whose default value is 20K. The buffer itself is three times this size,
       but because of the way it is used for holding "before" lines, the long-
       est line that is guaranteed to be processable is  the  parameter  size.
       You can change the default parameter value by adding, for example,

         --with-pcre2grep-bufsize=50K

       to  the  configure  command.  The caller of pcre2grep can override this
       value by using --buffer-size on the command line..


PCRE2TEST OPTION FOR LIBREADLINE SUPPORT

       If you add one of

         --enable-pcre2test-libreadline
         --enable-pcre2test-libedit

       to the configure command, pcre2test  is  linked  with  the  libreadline
       orlibedit library, respectively, and when its input is from a terminal,
       it reads it using the readline() function. This  provides  line-editing
       and  history  facilities.  Note that libreadline is GPL-licensed, so if
       you distribute a binary of pcre2test linked in this way, there  may  be
       licensing issues. These can be avoided by linking instead with libedit,
       which has a BSD licence.

       Setting --enable-pcre2test-libreadline causes the -lreadline option  to
       be  added to the pcre2test build. In many operating environments with a
       sytem-installed readline library this is sufficient. However,  in  some
       environments (e.g. if an unmodified distribution version of readline is
       in use), some extra configuration may be necessary.  The  INSTALL  file
       for libreadline says this:

         "Readline uses the termcap functions, but does not link with
         the termcap or curses library itself, allowing applications
         which link with readline the to choose an appropriate library."

       If  your environment has not been set up so that an appropriate library
       is automatically included, you may need to add something like

         LIBS="-ncurses"

       immediately before the configure command.


DEBUGGING WITH VALGRIND SUPPORT

       If you add

         --enable-valgrind

       to the configure command, PCRE2 will use valgrind annotations  to  mark
       certain  memory  regions  as  unaddressable.  This  allows it to detect
       invalid memory accesses, and  is  mostly  useful  for  debugging  PCRE2
       itself.


CODE COVERAGE REPORTING

       If  your  C  compiler is gcc, you can build a version of PCRE2 that can
       generate a code coverage report for its test suite. To enable this, you
       must install lcov version 1.6 or above. Then specify

         --enable-coverage

       to the configure command and build PCRE2 in the usual way.

       Note that using ccache (a caching C compiler) is incompatible with code
       coverage reporting. If you have configured ccache to run  automatically
       on your system, you must set the environment variable

         CCACHE_DISABLE=1

       before running make to build PCRE2, so that ccache is not used.

       When  --enable-coverage  is  used,  the  following addition targets are
       added to the Makefile:

         make coverage

       This creates a fresh coverage report for the PCRE2 test  suite.  It  is
       equivalent  to running "make coverage-reset", "make coverage-baseline",
       "make check", and then "make coverage-report".

         make coverage-reset

       This zeroes the coverage counters, but does nothing else.

         make coverage-baseline

       This captures baseline coverage information.

         make coverage-report

       This creates the coverage report.

         make coverage-clean-report

       This removes the generated coverage report without cleaning the  cover-
       age data itself.

         make coverage-clean-data

       This  removes  the captured coverage data without removing the coverage
       files created at compile time (*.gcno).

         make coverage-clean

       This cleans all coverage data including the generated coverage  report.
       For  more  information about code coverage, see the gcov and lcov docu-
       mentation.


SEE ALSO

       pcre2api(3), pcre2-config(3).


AUTHOR

       Philip Hazel
       University Computing Service
       Cambridge, England.


REVISION

       Last updated: 23 November 2014
       Copyright (c) 1997-2014 University of Cambridge.
------------------------------------------------------------------------------


PCRE2CALLOUT(3)            Library Functions Manual            PCRE2CALLOUT(3)


NAME
       PCRE2 - Perl-compatible regular expressions (revised API)

SYNOPSIS

       #include <pcre2.h>

       int (*pcre2_callout)(pcre2_callout_block *, void *);


DESCRIPTION

       PCRE2  provides  a feature called "callout", which is a means of tempo-
       rarily passing control to the caller of PCRE2 in the middle of  pattern
       matching.  The caller of PCRE2 provides an external function by putting
       its entry point in a match context  (see  pcre2_set_callout())  in  the
       pcre2api documentation).

       Within  a  regular  expression,  (?C) indicates the points at which the
       external function is to be called.  Different  callout  points  can  be
       identified  by  putting  a number less than 256 after the letter C. The
       default value is zero.  For  example,  this  pattern  has  two  callout
       points:

         (?C1)abc(?C2)def

       If the PCRE2_AUTO_CALLOUT option bit is set when a pattern is compiled,
       PCRE2 automatically inserts callouts, all with number 255, before  each
       item  in  the  pattern. For example, if PCRE2_AUTO_CALLOUT is used with
       the pattern

         A(\d{2}|--)

       it is processed as if it were

       (?C255)A(?C255)((?C255)\d{2}(?C255)|(?C255)-(?C255)-(?C255))(?C255)

       Notice that there is a callout before and after  each  parenthesis  and
       alternation bar. If the pattern contains a conditional group whose con-
       dition is an assertion, an automatic callout  is  inserted  immediately
       before  the  condition. Such a callout may also be inserted explicitly,
       for example:

         (?(?C9)(?=a)ab|de)

       This applies only to assertion conditions (because they are  themselves
       independent groups).

       Automatic  callouts  can  be  used for tracking the progress of pattern
       matching.  The pcre2test program has a pattern  qualifier  (/auto_call-
       out)  that  sets  automatic callouts; when it is used, the output indi-
       cates how the pattern is being matched. This is useful information when
       you are trying to optimize the performance of a particular pattern.


MISSING CALLOUTS

       You  should  be  aware  that, because of optimizations in the way PCRE2
       compiles and matches patterns, callouts sometimes do not happen exactly
       as you might expect.

   Auto-possessification

       At compile time, PCRE2 "auto-possessifies" repeated items when it knows
       that what follows cannot be part of the repeat. For example, a+[bc]  is
       compiled  as if it were a++[bc]. The pcre2test output when this pattern
       is compiled with PCRE2_ANCHORED and PCRE2_AUTO_CALLOUT and then applied
       to the string "aaaa" is:

         --->aaaa
          +0 ^        a+
          +2 ^   ^    [bc]
         No match

       This  indicates that when matching [bc] fails, there is no backtracking
       into a+ and therefore the callouts that would be taken  for  the  back-
       tracks  do  not  occur.  You can disable the auto-possessify feature by
       passing PCRE2_NO_AUTO_POSSESS to pcre2_compile(), or starting the  pat-
       tern with (*NO_AUTO_POSSESS). In this case, the output changes to this:

         --->aaaa
          +0 ^        a+
          +2 ^   ^    [bc]
          +2 ^  ^     [bc]
          +2 ^ ^      [bc]
          +2 ^^       [bc]
         No match

       This time, when matching [bc] fails, the matcher backtracks into a+ and
       tries again, repeatedly, until a+ itself fails.

   Automatic .* anchoring

       By default, an optimization is applied when .* is the first significant
       item  in  a  pattern. If PCRE2_DOTALL is set, so that the dot can match
       any character, the pattern is automatically anchored.  If  PCRE2_DOTALL
       is  not set, a match can start only after an internal newline or at the
       beginning of the subject,  and  pcre2_compile()  remembers  this.  This
       optimization  is  disabled,  however, if .* is in an atomic group or if
       there is a back reference to the capturing group in which  it  appears.
       It  is  also disabled if the pattern contains (*PRUNE) or (*SKIP). How-
       ever, the presence of callouts does not affect it.

       For example, if the pattern .*\d is  compiled  with  PCRE2_AUTO_CALLOUT
       and applied to the string "aa", the pcre2test output is:

         --->aa
          +0 ^      .*
          +2 ^ ^    \d
          +2 ^^     \d
          +2 ^      \d
         No match

       This  shows  that all match attempts start at the beginning of the sub-
       ject. In other words, the pattern is anchored.  You  can  disable  this
       optimization  by passing PCRE2_NO_DOTSTAR_ANCHOR to pcre2_compile(), or
       starting the pattern with (*NO_DOTSTAR_ANCHOR). In this case, the  out-
       put changes to:

         --->aa
          +0 ^      .*
          +2 ^ ^    \d
          +2 ^^     \d
          +2 ^      \d
          +0  ^     .*
          +2  ^^    \d
          +2  ^     \d
         No match

       This  shows more match attempts, starting at the second subject charac-
       ter.  Another optimization, described in the next section,  means  that
       there is no subsequent attempt to match with an empty subject.

       If  a  pattern  has more than one top-level branch, automatic anchoring
       occurs if all branches are anchorable.

   Other optimizations

       Other optimizations that provide fast "no match"  results  also  affect
       callouts.  For example, if the pattern is

         ab(?C4)cd

       PCRE2  knows  that  any matching string must contain the letter "d". If
       the subject string is "abyz", the  lack  of  "d"  means  that  matching
       doesn't  ever  start,  and  the callout is never reached. However, with
       "abyd", though the result is still no match, the callout is obeyed.

       PCRE2 also knows the minimum length of  a  matching  string,  and  will
       immediately  give  a "no match" return without actually running a match
       if the subject is not long enough, or, for unanchored patterns,  if  it
       has been scanned far enough.

       You can disable these optimizations by passing the PCRE2_NO_START_OPTI-
       MIZE option  to  pcre2_compile(),  or  by  starting  the  pattern  with
       (*NO_START_OPT).  This slows down the matching process, but does ensure
       that callouts such as the example above are obeyed.


THE CALLOUT INTERFACE

       During matching, when PCRE2 reaches a callout  point,  if  an  external
       function  is  set  in  the match context, it is called. This applies to
       both normal and DFA matching. The first argument to the  callout  func-
       tion  is a pointer to a pcre2_callout block. The second argument is the
       void * callout data that was supplied when the callout was  set  up  by
       calling pcre2_set_callout() (see the pcre2api documentation). The call-
       out block structure contains the following fields:

         uint32_t      version;
         uint32_t      callout_number;
         uint32_t      capture_top;
         uint32_t      capture_last;
         PCRE2_SIZE   *offset_vector;
         PCRE2_SPTR    mark;
         PCRE2_SPTR    subject;
         PCRE2_SIZE    subject_length;
         PCRE2_SIZE    start_match;
         PCRE2_SIZE    current_position;
         PCRE2_SIZE    pattern_position;
         PCRE2_SIZE    next_item_length;

       The version field contains the version number of the block format.  The
       current version is 0. The version number will change in future if addi-
       tional fields are added, but the intention is never to  remove  any  of
       the existing fields.

       The  callout_number  field  contains the number of the callout, as com-
       piled into the pattern (that is, the number after ?C for  manual  call-
       outs, and 255 for automatically generated callouts).

       The offset_vector field is a pointer to the vector of capturing offsets
       (the "ovector") that was passed to the matching function in  the  match
       data  block.  When pcre2_match() is used, the contents can be inspected
       in order to extract substrings that have been matched so  far,  in  the
       same  way as for extracting substrings after a match has completed. For
       the DFA matching function, this field is not useful.

       The subject and subject_length fields contain copies of the values that
       were passed to the matching function.

       The  start_match  field normally contains the offset within the subject
       at which the current match attempt  started.  However,  if  the  escape
       sequence  \K has been encountered, this value is changed to reflect the
       modified starting point. If the pattern is not  anchored,  the  callout
       function may be called several times from the same point in the pattern
       for different starting points in the subject.

       The current_position field contains the offset within  the  subject  of
       the current match pointer.

       When the pcre2_match() is used, the capture_top field contains one more
       than the number of the highest numbered captured substring so  far.  If
       no substrings have been captured, the value of capture_top is one. This
       is always the case when the DFA functions are used, because they do not
       support captured substrings.

       The  capture_last  field  contains the number of the most recently cap-
       tured substring. However, when a recursion exits, the value reverts  to
       what  it  was  outside  the recursion, as do the values of all captured
       substrings. If no substrings have been  captured,  the  value  of  cap-
       ture_last is 0. This is always the case for the DFA matching functions.

       The  pattern_position  field contains the offset to the next item to be
       matched in the pattern string.

       The next_item_length field contains the length of the next item  to  be
       matched in the pattern string. When the callout immediately precedes an
       alternation bar, a closing parenthesis, or the end of the pattern,  the
       length  is  zero. When the callout precedes an opening parenthesis, the
       length is that of the entire subpattern.

       The pattern_position and next_item_length fields are intended  to  help
       in  distinguishing between different automatic callouts, which all have
       the same callout number. However, they are set for all callouts.

       In callouts from pcre2_match() the mark field contains a pointer to the
       zero-terminated  name of the most recently passed (*MARK), (*PRUNE), or
       (*THEN) item in the match, or NULL if no such items have  been  passed.
       Instances  of  (*PRUNE)  or  (*THEN) without a name do not obliterate a
       previous (*MARK). In callouts from the DFA matching function this field
       always contains NULL.


RETURN VALUES

       The external callout function returns an integer to PCRE2. If the value
       is zero, matching proceeds as normal. If  the  value  is  greater  than
       zero,  matching  fails  at  the current point, but the testing of other
       matching possibilities goes ahead, just as if a lookahead assertion had
       failed. If the value is less than zero, the match is abandoned, and the
       matching function returns the negative value.

       Negative  values  should  normally  be   chosen   from   the   set   of
       PCRE2_ERROR_xxx  values.  In  particular,  PCRE2_ERROR_NOMATCH forces a
       standard "no match" failure. The error  number  PCRE2_ERROR_CALLOUT  is
       reserved  for  use by callout functions; it will never be used by PCRE2
       itself.


AUTHOR

       Philip Hazel
       University Computing Service
       Cambridge, England.


REVISION

       Last updated: 02 January 2015
       Copyright (c) 1997-2015 University of Cambridge.
------------------------------------------------------------------------------


PCRE2COMPAT(3)             Library Functions Manual             PCRE2COMPAT(3)


NAME
       PCRE2 - Perl-compatible regular expressions (revised API)

DIFFERENCES BETWEEN PCRE2 AND PERL

       This document describes the differences in the ways that PCRE2 and Perl
       handle regular expressions. The differences  described  here  are  with
       respect to Perl versions 5.10 and above.

       1.  PCRE2  has only a subset of Perl's Unicode support. Details of what
       it does have are given in the pcre2unicode page.

       2. PCRE2 allows repeat quantifiers only  on  parenthesized  assertions,
       but  they  do not mean what you might think. For example, (?!a){3} does
       not assert that the next three characters are not "a". It just  asserts
       that  the  next  character  is not "a" three times (in principle: PCRE2
       optimizes this to run the assertion  just  once).  Perl  allows  repeat
       quantifiers  on  other  assertions such as \b, but these do not seem to
       have any use.

       3. Capturing subpatterns that occur inside  negative  lookahead  asser-
       tions  are  counted,  but their entries in the offsets vector are never
       set. Perl sometimes (but not always) sets its numerical variables  from
       inside negative assertions.

       4.  The  following Perl escape sequences are not supported: \l, \u, \L,
       \U, and \N when followed by a character name or Unicode value.  (\N  on
       its own, matching a non-newline character, is supported.) In fact these
       are implemented by Perl's general string-handling and are not  part  of
       its  pattern matching engine. If any of these are encountered by PCRE2,
       an error is generated by default. However, if the PCRE2_ALT_BSUX option
       is set, \U and \u are interpreted as ECMAScript interprets them.

       5. The Perl escape sequences \p, \P, and \X are supported only if PCRE2
       is built with Unicode support. The properties that can be  tested  with
       \p and \P are limited to the general category properties such as Lu and
       Nd, script names such as Greek or Han, and the derived  properties  Any
       and L&. PCRE2 does support the Cs (surrogate) property, which Perl does
       not; the Perl documentation says "Because Perl hides the need  for  the
       user  to  understand the internal representation of Unicode characters,
       there is no need to implement the  somewhat  messy  concept  of  surro-
       gates."

       6.  PCRE2 does support the \Q...\E escape for quoting substrings. Char-
       acters in between are treated as literals. This is  slightly  different
       from  Perl  in  that  $  and  @ are also handled as literals inside the
       quotes. In Perl, they cause variable interpolation (but of course PCRE2
       does not have variables).  Note the following examples:

           Pattern            PCRE2 matches      Perl matches

           \Qabc$xyz\E        abc$xyz           abc followed by the
                                                  contents of $xyz
           \Qabc\$xyz\E       abc\$xyz          abc\$xyz
           \Qabc\E\$\Qxyz\E   abc$xyz           abc$xyz

       The  \Q...\E  sequence  is recognized both inside and outside character
       classes.

       7.  Fairly  obviously,  PCRE2  does  not  support  the  (?{code})   and
       (??{code})  constructions. However, there is support for recursive pat-
       terns. This is not available in Perl 5.8, but it is in Perl 5.10. Also,
       the  PCRE2  "callout"  feature allows an external function to be called
       during  pattern  matching.  See  the  pcre2callout  documentation   for
       details.

       8.  Subpatterns  that  are called as subroutines (whether or not recur-
       sively) are always treated as atomic groups  in  PCRE2.  This  is  like
       Python,  but  unlike Perl.  Captured values that are set outside a sub-
       routine call can be reference from inside in PCRE2, but  not  in  Perl.
       There is a discussion that explains these differences in more detail in
       the section on recursion differences  from  Perl  in  the  pcre2pattern
       page.

       9.  If  any  of the backtracking control verbs are used in a subpattern
       that is called as a subroutine  (whether  or  not  recursively),  their
       effect  is  confined to that subpattern; it does not extend to the sur-
       rounding pattern. This is not always the case in Perl.  In  particular,
       if  (*THEN)  is  present in a group that is called as a subroutine, its
       action is limited to that group, even if the group does not contain any
       |  characters.  Note that such subpatterns are processed as anchored at
       the point where they are tested.

       10. If a pattern contains more than one backtracking control verb,  the
       first  one  that  is backtracked onto acts. For example, in the pattern
       A(*COMMIT)B(*PRUNE)C a failure in B triggers (*COMMIT), but  a  failure
       in C triggers (*PRUNE). Perl's behaviour is more complex; in many cases
       it is the same as PCRE2, but there are examples where it differs.

       11. Most backtracking verbs in assertions have  their  normal  actions.
       They are not confined to the assertion.

       12.  There are some differences that are concerned with the settings of
       captured strings when part of  a  pattern  is  repeated.  For  example,
       matching  "aba"  against  the  pattern  /^(a(b)?)+$/  in Perl leaves $2
       unset, but in PCRE2 it is set to "b".

       13. PCRE2's handling of duplicate subpattern numbers and duplicate sub-
       pattern names is not as general as Perl's. This is a consequence of the
       fact the PCRE2 works internally just with numbers,  using  an  external
       table  to translate between numbers and names. In particular, a pattern
       such as (?|(?<a>A)|(?<b)B), where the two  capturing  parentheses  have
       the  same  number  but different names, is not supported, and causes an
       error at compile time. If it were allowed, it would not be possible  to
       distinguish  which  parentheses matched, because both names map to cap-
       turing subpattern number 1. To avoid this confusing situation, an error
       is given at compile time.

       14.  Perl  recognizes  comments in some places that PCRE2 does not, for
       example, between the ( and ? at the start of a subpattern.  If  the  /x
       modifier  is  set, Perl allows white space between ( and ? (though cur-
       rent Perls warn that this is deprecated) but PCRE2 never does, even  if
       the PCRE2_EXTENDED option is set.

       15.  Perl,  when  in warning mode, gives warnings for character classes
       such as [A-\d] or [a-[:digit:]]. It then treats the hyphens  as  liter-
       als. PCRE2 has no warning features, so it gives an error in these cases
       because they are almost certainly user mistakes.

       16. In PCRE2, the upper/lower case character properties Lu and  Ll  are
       not  affected when case-independent matching is specified. For example,
       \p{Lu} always matches an upper case letter. I think Perl has changed in
       this  respect; in the release at the time of writing (5.16), \p{Lu} and
       \p{Ll} match all letters, regardless of case, when case independence is
       specified.

       17.  PCRE2  provides  some  extensions  to  the Perl regular expression
       facilities.  Perl 5.10 includes new features that are  not  in  earlier
       versions  of  Perl, some of which (such as named parentheses) have been
       in PCRE2 for some time. This list is with respect to Perl 5.10:

       (a) Although lookbehind assertions in PCRE2  must  match  fixed  length
       strings,  each alternative branch of a lookbehind assertion can match a
       different length of string. Perl requires them all  to  have  the  same
       length.

       (b)  If PCRE2_DOLLAR_ENDONLY is set and PCRE2_MULTILINE is not set, the
       $ meta-character matches only at the very end of the string.

       (c) A backslash followed  by  a  letter  with  no  special  meaning  is
       faulted. (Perl can be made to issue a warning.)

       (d)  If PCRE2_UNGREEDY is set, the greediness of the repetition quanti-
       fiers is inverted, that is, by default they are not greedy, but if fol-
       lowed by a question mark they are.

       (e)  PCRE2_ANCHORED  can be used at matching time to force a pattern to
       be tried only at the first matching position in the subject string.

       (f)      The      PCRE2_NOTBOL,      PCRE2_NOTEOL,      PCRE2_NOTEMPTY,
       PCRE2_NOTEMPTY_ATSTART,  and PCRE2_NO_AUTO_CAPTURE options have no Perl
       equivalents.

       (g) The \R escape sequence can be restricted to match only CR,  LF,  or
       CRLF by the PCRE2_BSR_ANYCRLF option.

       (h) The callout facility is PCRE2-specific.

       (i) The partial matching facility is PCRE2-specific.

       (j)  The  alternative matching function (pcre2_dfa_match() matches in a
       different way and is not Perl-compatible.

       (k) PCRE2 recognizes some special sequences such as (*CR) at the  start
       of a pattern that set overall options that cannot be changed within the
       pattern.


AUTHOR

       Philip Hazel
       University Computing Service
       Cambridge, England.


REVISION

       Last updated: 28 September 2014
       Copyright (c) 1997-2014 University of Cambridge.
------------------------------------------------------------------------------


PCRE2JIT(3)                Library Functions Manual                PCRE2JIT(3)


NAME
       PCRE2 - Perl-compatible regular expressions (revised API)

PCRE2 JUST-IN-TIME COMPILER SUPPORT

       Just-in-time  compiling  is a heavyweight optimization that can greatly
       speed up pattern matching. However, it comes at the cost of extra  pro-
       cessing  before  the  match is performed, so it is of most benefit when
       the same pattern is going to be matched many times. This does not  nec-
       essarily  mean many calls of a matching function; if the pattern is not
       anchored, matching attempts may take place many times at various  posi-
       tions in the subject, even for a single call. Therefore, if the subject
       string is very long, it may still pay  to  use  JIT  even  for  one-off
       matches.  JIT  support  is  available  for all of the 8-bit, 16-bit and
       32-bit PCRE2 libraries.

       JIT support applies only to the  traditional  Perl-compatible  matching
       function.   It  does  not apply when the DFA matching function is being
       used. The code for this support was written by Zoltan Herczeg.


AVAILABILITY OF JIT SUPPORT

       JIT support is an optional feature of  PCRE2.  The  "configure"  option
       --enable-jit  (or  equivalent  CMake  option) must be set when PCRE2 is
       built if you want to use JIT. The support is limited to  the  following
       hardware platforms:

         ARM 32-bit (v5, v7, and Thumb2)
         ARM 64-bit
         Intel x86 32-bit and 64-bit
         MIPS 32-bit and 64-bit
         Power PC 32-bit and 64-bit
         SPARC 32-bit

       If --enable-jit is set on an unsupported platform, compilation fails.

       A  program  can  tell if JIT support is available by calling pcre2_con-
       fig() with the PCRE2_CONFIG_JIT option. The result is  1  when  JIT  is
       available,  and 0 otherwise. However, a simple program does not need to
       check this in order to use JIT. The API is implemented in  a  way  that
       falls  back  to the interpretive code if JIT is not available. For pro-
       grams that need the best possible performance, there is  also  a  "fast
       path" API that is JIT-specific.


SIMPLE USE OF JIT

       To  make use of the JIT support in the simplest way, all you have to do
       is to call pcre2_jit_compile() after successfully compiling  a  pattern
       with pcre2_compile(). This function has two arguments: the first is the
       compiled pattern pointer that was returned by pcre2_compile(), and  the
       second  is  zero  or  more of the following option bits: PCRE2_JIT_COM-
       PLETE, PCRE2_JIT_PARTIAL_HARD, or PCRE2_JIT_PARTIAL_SOFT.

       If JIT support is not available, a  call  to  pcre2_jit_compile()  does
       nothing  and returns PCRE2_ERROR_JIT_BADOPTION. Otherwise, the compiled
       pattern is passed to the JIT compiler, which turns it into machine code
       that executes much faster than the normal interpretive code, but yields
       exactly the same results. The returned value  from  pcre2_jit_compile()
       is zero on success, or a negative error code.

       PCRE2_JIT_COMPLETE  requests the JIT compiler to generate code for com-
       plete matches. If you want to run partial matches using the  PCRE2_PAR-
       TIAL_HARD  or  PCRE2_PARTIAL_SOFT  options of pcre2_match(), you should
       set one or both of  the  other  options  as  well  as,  or  instead  of
       PCRE2_JIT_COMPLETE. The JIT compiler generates different optimized code
       for each of the three modes (normal, soft partial, hard partial).  When
       pcre2_match()  is  called,  the appropriate code is run if it is avail-
       able. Otherwise, the pattern is matched using interpretive code.

       You can call pcre2_jit_compile() multiple times for the  same  compiled
       pattern.  It does nothing if it has previously compiled code for any of
       the option bits. For example, you can call it once with  PCRE2_JIT_COM-
       PLETE  and  (perhaps  later,  when  you find you need partial matching)
       again with PCRE2_JIT_COMPLETE and PCRE2_JIT_PARTIAL_HARD. This time  it
       will ignore PCRE2_JIT_COMPLETE and just compile code for partial match-
       ing. If pcre2_jit_compile() is called with no option bits set, it imme-
       diately returns zero. This is an alternative way of testing whether JIT
       is available.

       At present, it is not possible to free JIT compiled  code  except  when
       the entire compiled pattern is freed by calling pcre2_code_free().

       In  some circumstances you may need to call additional functions. These
       are described in the  section  entitled  "Controlling  the  JIT  stack"
       below.

       There are some pcre2_match() options that are not supported by JIT, and
       there are also some pattern items that JIT cannot handle.  Details  are
       given  below.  In  both cases, matching automatically falls back to the
       interpretive code. If you want to know whether JIT  was  actually  used
       for  a particular match, you should arrange for a JIT callback function
       to be set up as described in the section entitled "Controlling the  JIT
       stack"  below,  even  if  you  do  not need to supply a non-default JIT
       stack. Such a callback function is called whenever JIT code is about to
       be  obeyed.  If the match-time options are not right for JIT execution,
       the callback function is not obeyed.

       If the JIT compiler finds an unsupported item, no JIT  data  is  gener-
       ated.  You  can find out if JIT matching is available after compiling a
       pattern by calling  pcre2_pattern_info()  with  the  PCRE2_INFO_JITSIZE
       option.  A non-zero result means that JIT compilation was successful. A
       result of 0 means that JIT support is not available, or the pattern was
       not  processed by pcre2_jit_compile(), or the JIT compiler was not able
       to handle the pattern.


UNSUPPORTED OPTIONS AND PATTERN ITEMS

       The pcre2_match() options that  are  supported  for  JIT  matching  are
       PCRE2_NOTBOL,   PCRE2_NOTEOL,  PCRE2_NOTEMPTY,  PCRE2_NOTEMPTY_ATSTART,
       PCRE2_NO_UTF_CHECK,  PCRE2_PARTIAL_HARD,  and  PCRE2_PARTIAL_SOFT.  The
       PCRE2_ANCHORED option is not supported at match time.

       The  only  unsupported  pattern items are \C (match a single data unit)
       when running in a UTF mode, and a callout immediately before an  asser-
       tion condition in a conditional group.


RETURN VALUES FROM JIT MATCHING

       When a pattern is matched using JIT matching, the return values are the
       same as those given by the interpretive pcre2_match()  code,  with  the
       addition  of one new error code: PCRE2_ERROR_JIT_STACKLIMIT. This means
       that the memory used for the JIT stack was insufficient. See  "Control-
       ling the JIT stack" below for a discussion of JIT stack usage.

       The  error  code  PCRE2_ERROR_MATCHLIMIT is returned by the JIT code if
       searching a very large pattern tree goes on for too long, as it  is  in
       the  same circumstance when JIT is not used, but the details of exactly
       what is counted are not the same. The PCRE2_ERROR_RECURSIONLIMIT  error
       code is never returned when JIT matching is used.


CONTROLLING THE JIT STACK

       When the compiled JIT code runs, it needs a block of memory to use as a
       stack.  By default, it uses 32K on the  machine  stack.  However,  some
       large   or   complicated  patterns  need  more  than  this.  The  error
       PCRE2_ERROR_JIT_STACKLIMIT is given when there  is  not  enough  stack.
       Three  functions  are provided for managing blocks of memory for use as
       JIT stacks. There is further discussion about the use of JIT stacks  in
       the section entitled "JIT stack FAQ" below.

       The  pcre2_jit_stack_create()  function  creates a JIT stack. Its argu-
       ments are a starting size, a maximum size, and a general  context  (for
       memory  allocation  functions, or NULL for standard memory allocation).
       It returns a pointer to an opaque structure of type pcre2_jit_stack, or
       NULL  if there is an error. The pcre2_jit_stack_free() function is used
       to free a stack that is no longer needed. (For the technically  minded:
       the address space is allocated by mmap or VirtualAlloc.)

       JIT  uses far less memory for recursion than the interpretive code, and
       a maximum stack size of 512K to 1M should be more than enough  for  any
       pattern.

       The  pcre2_jit_stack_assign()  function  specifies which stack JIT code
       should use. Its arguments are as follows:

         pcre2_match_context  *mcontext
         pcre2_jit_callback    callback
         void                 *data

       The first argument is a pointer to a match context. When this is subse-
       quently passed to a matching function, its information determines which
       JIT stack is used. There are three cases for the values  of  the  other
       two options:

         (1) If callback is NULL and data is NULL, an internal 32K block
             on the machine stack is used. This is the default when a match
             context is created.

         (2) If callback is NULL and data is not NULL, data must be
             a pointer to a valid JIT stack, the result of calling
             pcre2_jit_stack_create().

         (3) If callback is not NULL, it must point to a function that is
             called with data as an argument at the start of matching, in
             order to set up a JIT stack. If the return from the callback
             function is NULL, the internal 32K stack is used; otherwise the
             return value must be a valid JIT stack, the result of calling
             pcre2_jit_stack_create().

       A  callback function is obeyed whenever JIT code is about to be run; it
       is not obeyed when pcre2_match() is called with options that are incom-
       patible  for JIT matching. A callback function can therefore be used to
       determine whether a match operation was  executed  by  JIT  or  by  the
       interpreter.

       You may safely use the same JIT stack for more than one pattern (either
       by assigning directly or by callback), as long as the patterns are  all
       matched  sequentially in the same thread. In a multithread application,
       if you do not specify a JIT stack, or if you assign or pass  back  NULL
       from  a  callback, that is thread-safe, because each thread has its own
       machine stack. However, if you assign  or  pass  back  a  non-NULL  JIT
       stack,  this  must  be  a  different  stack for each thread so that the
       application is thread-safe.

       Strictly speaking, even more is allowed. You can assign the  same  non-
       NULL  stack  to a match context that is used by any number of patterns,
       as long as they are not used for matching by multiple  threads  at  the
       same  time.  For  example, you could use the same stack in all compiled
       patterns, with a global mutex in the callback to wait until  the  stack
       is available for use. However, this is an inefficient solution, and not
       recommended.

       This is a suggestion for how a multithreaded program that needs to  set
       up non-default JIT stacks might operate:

         During thread initalization
           thread_local_var = pcre2_jit_stack_create(...)

         During thread exit
           pcre2_jit_stack_free(thread_local_var)

         Use a one-line callback function
           return thread_local_var

       All  the  functions  described in this section do nothing if JIT is not
       available.


JIT STACK FAQ

       (1) Why do we need JIT stacks?

       PCRE2 (and JIT) is a recursive, depth-first engine, so it needs a stack
       where  the local data of the current node is pushed before checking its
       child nodes.  Allocating real machine stack on some platforms is diffi-
       cult. For example, the stack chain needs to be updated every time if we
       extend the stack on PowerPC.  Although it  is  possible,  its  updating
       time overhead decreases performance. So we do the recursion in memory.

       (2) Why don't we simply allocate blocks of memory with malloc()?

       Modern  operating  systems  have  a  nice  feature: they can reserve an
       address space instead of allocating memory. We can safely allocate mem-
       ory  pages  inside  this address space, so the stack could grow without
       moving memory data (this is important because of pointers). Thus we can
       allocate  1M  address space, and use only a single memory page (usually
       4K) if that is enough. However, we can still grow up to 1M  anytime  if
       needed.

       (3) Who "owns" a JIT stack?

       The owner of the stack is the user program, not the JIT studied pattern
       or anything else. The user program must ensure that if a stack is being
       used by pcre2_match(), (that is, it is assigned to a match context that
       is passed to the pattern currently running), that  stack  must  not  be
       used  by any other threads (to avoid overwriting the same memory area).
       The best practice for multithreaded programs is to allocate a stack for
       each thread, and return this stack through the JIT callback function.

       (4) When should a JIT stack be freed?

       You can free a JIT stack at any time, as long as it will not be used by
       pcre2_match() again. When you assign the stack to a match context, only
       a  pointer  is  set. There is no reference counting or any other magic.
       You can free compiled patterns, contexts, and stacks in any order, any-
       time.  Just  do not call pcre2_match() with a match context pointing to
       an already freed stack, as that will cause SEGFAULT. (Also, do not free
       a  stack  currently  used  by pcre2_match() in another thread). You can
       also replace the stack in a context at any time when it is not in  use.
       You should free the previous stack before assigning a replacement.

       (5)  Should  I  allocate/free  a  stack every time before/after calling
       pcre2_match()?

       No, because this is too costly in  terms  of  resources.  However,  you
       could  implement  some clever idea which release the stack if it is not
       used in let's say two minutes. The JIT callback  can  help  to  achieve
       this without keeping a list of patterns.

       (6)  OK, the stack is for long term memory allocation. But what happens
       if a pattern causes stack overflow with a stack of 1M? Is that 1M  kept
       until the stack is freed?

       Especially  on embedded sytems, it might be a good idea to release mem-
       ory sometimes without freeing the stack. There is no API  for  this  at
       the  moment.  Probably a function call which returns with the currently
       allocated memory for any stack and another which allows releasing  mem-
       ory (shrinking the stack) would be a good idea if someone needs this.

       (7) This is too much of a headache. Isn't there any better solution for
       JIT stack handling?

       No, thanks to Windows. If POSIX threads were used everywhere, we  could
       throw out this complicated API.


FREEING JIT SPECULATIVE MEMORY

       void pcre2_jit_free_unused_memory(pcre2_general_context *gcontext);

       The JIT executable allocator does not free all memory when it is possi-
       ble.  It expects new allocations, and keeps some free memory around  to
       improve  allocation  speed. However, in low memory conditions, it might
       be better to free all possible memory. You can cause this to happen  by
       calling  pcre2_jit_free_unused_memory(). Its argument is a general con-
       text, for custom memory management, or NULL for standard memory manage-
       ment.


EXAMPLE CODE

       This  is  a  single-threaded example that specifies a JIT stack without
       using a callback. A real program should include  error  checking  after
       all the function calls.

         int rc;
         pcre2_code *re;
         pcre2_match_data *match_data;
         pcre2_match_context *mcontext;
         pcre2_jit_stack *jit_stack;

         re = pcre2_compile(pattern, PCRE2_ZERO_TERMINATED, 0,
           &errornumber, &erroffset, NULL);
         rc = pcre2_jit_compile(re, PCRE2_JIT_COMPLETE);
         mcontext = pcre2_match_context_create(NULL);
         jit_stack = pcre2_jit_stack_create(32*1024, 512*1024, NULL);
         pcre2_jit_stack_assign(mcontext, NULL, jit_stack);
         match_data = pcre2_match_data_create(re, 10);
         rc = pcre2_match(re, subject, length, 0, 0, match_data, mcontext);
         /* Process result */

         pcre2_code_free(re);
         pcre2_match_data_free(match_data);
         pcre2_match_context_free(mcontext);
         pcre2_jit_stack_free(jit_stack);


JIT FAST PATH API

       Because the API described above falls back to interpreted matching when
       JIT is not available, it is convenient for programs  that  are  written
       for  general  use  in  many  environments.  However,  calling  JIT  via
       pcre2_match() does have a performance impact. Programs that are written
       for  use  where  JIT  is known to be available, and which need the best
       possible performance, can instead use a "fast path"  API  to  call  JIT
       matching  directly instead of calling pcre2_match() (obviously only for
       patterns that have been successfully processed by pcre2_jit_compile()).

       The fast path  function  is  called  pcre2_jit_match(),  and  it  takes
       exactly the same arguments as pcre2_match(). The return values are also
       the same, plus PCRE2_ERROR_JIT_BADOPTION if a matching mode (partial or
       complete)  is  requested that was not compiled. Unsupported option bits
       (for example, PCRE2_ANCHORED) are ignored.

       When you call pcre2_match(), as well as testing for invalid options,  a
       number of other sanity checks are performed on the arguments. For exam-
       ple, if the subject pointer is NULL, an immediate error is given. Also,
       unless  PCRE2_NO_UTF_CHECK  is  set, a UTF subject string is tested for
       validity. In the interests of speed, these checks do not happen on  the
       JIT fast path, and if invalid data is passed, the result is undefined.

       Bypassing  the  sanity  checks  and the pcre2_match() wrapping can give
       speedups of more than 10%.


SEE ALSO

       pcre2api(3)


AUTHOR

       Philip Hazel (FAQ by Zoltan Herczeg)
       University Computing Service
       Cambridge, England.


REVISION

       Last updated: 27 November 2014
       Copyright (c) 1997-2014 University of Cambridge.
------------------------------------------------------------------------------


PCRE2LIMITS(3)             Library Functions Manual             PCRE2LIMITS(3)


NAME
       PCRE2 - Perl-compatible regular expressions (revised API)

SIZE AND OTHER LIMITATIONS

       There are some size limitations in PCRE2 but it is hoped that they will
       never in practice be relevant.

       The maximum size of a compiled pattern is approximately 64K code  units
       for  the  8-bit  and  16-bit  libraries  if  PCRE2 is compiled with the
       default internal linkage size, which is 2 bytes for these libraries. If
       you  want  to  process regular expressions that are truly enormous, you
       can compile PCRE2 with an internal linkage size of 3 or 4 (when  build-
       ing  the  16-bit library, 3 is rounded up to 4). See the README file in
       the source distribution and the pcre2build documentation  for  details.
       In  these  cases the limit is substantially larger.  However, the speed
       of execution is slower. In the 32-bit  library,  the  internal  linkage
       size is always 4.

       The maximum length (in code units) of a subject string is one less than
       the largest number a PCRE2_SIZE variable can  hold.  PCRE2_SIZE  is  an
       unsigned  integer  type,  usually  defined as size_t. Its maximum value
       (that is ~(PCRE2_SIZE)0) is reserved as a special indicator  for  zero-
       terminated strings and unset offsets.

       Note  that  when  using  the  traditional matching function, PCRE2 uses
       recursion to handle subpatterns and indefinite repetition.  This  means
       that  the  available stack space may limit the size of a subject string
       that can be processed by certain patterns. For a  discussion  of  stack
       issues, see the pcre2stack documentation.

       All values in repeating quantifiers must be less than 65536.

       There is no limit to the number of parenthesized subpatterns, but there
       can be no more than 65535 capturing subpatterns. There is,  however,  a
       limit  to  the  depth  of  nesting  of parenthesized subpatterns of all
       kinds. This is imposed in order to limit the  amount  of  system  stack
       used  at  compile time. The limit can be specified when PCRE2 is built;
       the default is 250.

       There is a limit to the number of forward references to subsequent sub-
       patterns  of  around  200,000.  Repeated  forward references with fixed
       upper limits, for example, (?2){0,100} when subpattern number 2  is  to
       the  right,  are included in the count. There is no limit to the number
       of backward references.

       The maximum length of name for a named subpattern is 32 code units, and
       the maximum number of named subpatterns is 10000.

       The  maximum  length  of  a  name  in  a (*MARK), (*PRUNE), (*SKIP), or
       (*THEN) verb is 255 for the 8-bit library and 65535 for the 16-bit  and
       32-bit libraries.


AUTHOR

       Philip Hazel
       University Computing Service
       Cambridge, England.


REVISION

       Last updated: 25 November 2014
       Copyright (c) 1997-2014 University of Cambridge.
------------------------------------------------------------------------------


PCRE2MATCHING(3)           Library Functions Manual           PCRE2MATCHING(3)


NAME
       PCRE2 - Perl-compatible regular expressions (revised API)

PCRE2 MATCHING ALGORITHMS

       This document describes the two different algorithms that are available
       in PCRE2 for matching a compiled regular  expression  against  a  given
       subject  string.  The  "standard"  algorithm is the one provided by the
       pcre2_match() function. This works in the same as  as  Perl's  matching
       function,  and  provide a Perl-compatible matching operation. The just-
       in-time (JIT) optimization that is described in the pcre2jit documenta-
       tion is compatible with this function.

       An alternative algorithm is provided by the pcre2_dfa_match() function;
       it operates in a different way, and is not Perl-compatible. This alter-
       native  has  advantages  and  disadvantages  compared with the standard
       algorithm, and these are described below.

       When there is only one possible way in which a given subject string can
       match  a pattern, the two algorithms give the same answer. A difference
       arises, however, when there are multiple possibilities. For example, if
       the pattern

         ^<.*>

       is matched against the string

         <something> <something else> <something further>

       there are three possible answers. The standard algorithm finds only one
       of them, whereas the alternative algorithm finds all three.


REGULAR EXPRESSIONS AS TREES

       The set of strings that are matched by a regular expression can be rep-
       resented  as  a  tree structure. An unlimited repetition in the pattern
       makes the tree of infinite size, but it is still a tree.  Matching  the
       pattern  to a given subject string (from a given starting point) can be
       thought of as a search of the tree.  There are two  ways  to  search  a
       tree:  depth-first  and  breadth-first, and these correspond to the two
       matching algorithms provided by PCRE2.


THE STANDARD MATCHING ALGORITHM

       In the terminology of Jeffrey Friedl's book "Mastering Regular  Expres-
       sions",  the  standard  algorithm  is an "NFA algorithm". It conducts a
       depth-first search of the pattern tree. That is, it  proceeds  along  a
       single path through the tree, checking that the subject matches what is
       required. When there is a mismatch, the algorithm  tries  any  alterna-
       tives  at  the  current point, and if they all fail, it backs up to the
       previous branch point in the  tree,  and  tries  the  next  alternative
       branch  at  that  level.  This often involves backing up (moving to the
       left) in the subject string as well.  The  order  in  which  repetition
       branches  are  tried  is controlled by the greedy or ungreedy nature of
       the quantifier.

       If a leaf node is reached, a matching string has  been  found,  and  at
       that  point the algorithm stops. Thus, if there is more than one possi-
       ble match, this algorithm returns the first one that it finds.  Whether
       this  is the shortest, the longest, or some intermediate length depends
       on the way the greedy and ungreedy repetition quantifiers are specified
       in the pattern.

       Because  it  ends  up  with a single path through the tree, it is rela-
       tively straightforward for this algorithm to keep  track  of  the  sub-
       strings  that  are  matched  by portions of the pattern in parentheses.
       This provides support for capturing parentheses and back references.


THE ALTERNATIVE MATCHING ALGORITHM

       This algorithm conducts a breadth-first search of  the  tree.  Starting
       from  the  first  matching  point  in the subject, it scans the subject
       string from left to right, once, character by character, and as it does
       this,  it remembers all the paths through the tree that represent valid
       matches. In Friedl's terminology, this is a kind  of  "DFA  algorithm",
       though  it is not implemented as a traditional finite state machine (it
       keeps multiple states active simultaneously).

       Although the general principle of this matching algorithm  is  that  it
       scans  the subject string only once, without backtracking, there is one
       exception: when a lookaround assertion is encountered,  the  characters
       following  or  preceding  the  current  point  have to be independently
       inspected.

       The scan continues until either the end of the subject is  reached,  or
       there  are  no more unterminated paths. At this point, terminated paths
       represent the different matching possibilities (if there are none,  the
       match  has  failed).   Thus,  if there is more than one possible match,
       this algorithm finds all of them, and in particular, it finds the long-
       est.  The  matches are returned in decreasing order of length. There is
       an option to stop the algorithm after the first match (which is  neces-
       sarily the shortest) is found.

       Note that all the matches that are found start at the same point in the
       subject. If the pattern

         cat(er(pillar)?)?

       is matched against the string "the caterpillar catchment",  the  result
       is  the  three  strings "caterpillar", "cater", and "cat" that start at
       the fifth character of the subject. The algorithm  does  not  automati-
       cally move on to find matches that start at later positions.

       PCRE2's "auto-possessification" optimization usually applies to charac-
       ter repeats at the end of a pattern (as well as internally). For  exam-
       ple, the pattern "a\d+" is compiled as if it were "a\d++" because there
       is no point even considering the possibility of backtracking  into  the
       repeated  digits.  For  DFA matching, this means that only one possible
       match is found. If you really do want multiple matches in  such  cases,
       either  use  an ungreedy repeat ("a\d+?") or set the PCRE2_NO_AUTO_POS-
       SESS option when compiling.

       There are a number of features of PCRE2 regular  expressions  that  are
       not  supported  by the alternative matching algorithm. They are as fol-
       lows:

       1. Because the algorithm finds all  possible  matches,  the  greedy  or
       ungreedy  nature  of  repetition quantifiers is not relevant (though it
       may affect auto-possessification, as just described). During  matching,
       greedy  and  ungreedy  quantifiers are treated in exactly the same way.
       However, possessive quantifiers can make a difference when what follows
       could  also  match  what  is  quantified, for example in a pattern like
       this:

         ^a++\w!

       This pattern matches "aaab!" but not "aaa!", which would be matched  by
       a  non-possessive quantifier. Similarly, if an atomic group is present,
       it is matched as if it were a standalone pattern at the current  point,
       and  the  longest match is then "locked in" for the rest of the overall
       pattern.

       2. When dealing with multiple paths through the tree simultaneously, it
       is  not  straightforward  to  keep track of captured substrings for the
       different matching possibilities, and PCRE2's  implementation  of  this
       algorithm does not attempt to do this. This means that no captured sub-
       strings are available.

       3. Because no substrings are captured, back references within the  pat-
       tern are not supported, and cause errors if encountered.

       4.  For  the same reason, conditional expressions that use a backrefer-
       ence as the condition or test for a specific group  recursion  are  not
       supported.

       5.  Because  many  paths  through the tree may be active, the \K escape
       sequence, which resets the start of the match when encountered (but may
       be  on  some  paths  and not on others), is not supported. It causes an
       error if encountered.

       6. Callouts are supported, but the value of the  capture_top  field  is
       always 1, and the value of the capture_last field is always 0.

       7.  The  \C  escape  sequence, which (in the standard algorithm) always
       matches a single code unit, even in a UTF mode,  is  not  supported  in
       these  modes,  because the alternative algorithm moves through the sub-
       ject string one character (not code unit) at a  time,  for  all  active
       paths through the tree.

       8.  Except for (*FAIL), the backtracking control verbs such as (*PRUNE)
       are not supported. (*FAIL) is supported, and  behaves  like  a  failing
       negative assertion.


ADVANTAGES OF THE ALTERNATIVE ALGORITHM

       Using  the alternative matching algorithm provides the following advan-
       tages:

       1. All possible matches (at a single point in the subject) are automat-
       ically  found,  and  in particular, the longest match is found. To find
       more than one match using the standard algorithm, you have to do kludgy
       things with callouts.

       2.  Because  the  alternative  algorithm  scans the subject string just
       once, and never needs to backtrack (except for lookbehinds), it is pos-
       sible  to  pass  very  long subject strings to the matching function in
       several pieces, checking for partial matching each time. Although it is
       also  possible  to  do  multi-segment matching using the standard algo-
       rithm, by retaining partially matched substrings, it  is  more  compli-
       cated. The pcre2partial documentation gives details of partial matching
       and discusses multi-segment matching.


DISADVANTAGES OF THE ALTERNATIVE ALGORITHM

       The alternative algorithm suffers from a number of disadvantages:

       1. It is substantially slower than  the  standard  algorithm.  This  is
       partly  because  it has to search for all possible matches, but is also
       because it is less susceptible to optimization.

       2. Capturing parentheses and back references are not supported.

       3. Although atomic groups are supported, their use does not provide the
       performance advantage that it does for the standard algorithm.


AUTHOR

       Philip Hazel
       University Computing Service
       Cambridge, England.


REVISION

       Last updated: 29 September 2014
       Copyright (c) 1997-2014 University of Cambridge.
------------------------------------------------------------------------------


PCRE2PARTIAL(3)            Library Functions Manual            PCRE2PARTIAL(3)


NAME
       PCRE2 - Perl-compatible regular expressions

PARTIAL MATCHING IN PCRE2

       In  normal  use  of  PCRE2,  if  the subject string that is passed to a
       matching function matches as far as it goes, but is too short to  match
       the  entire pattern, PCRE2_ERROR_NOMATCH is returned. There are circum-
       stances where it might be helpful to distinguish this case  from  other
       cases in which there is no match.

       Consider, for example, an application where a human is required to type
       in data for a field with specific formatting requirements.  An  example
       might be a date in the form ddmmmyy, defined by this pattern:

         ^\d?\d(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\d\d$

       If the application sees the user's keystrokes one by one, and can check
       that what has been typed so far is potentially valid,  it  is  able  to
       raise  an  error  as  soon  as  a  mistake  is made, by beeping and not
       reflecting the character that has been typed, for example. This immedi-
       ate  feedback is likely to be a better user interface than a check that
       is delayed until the entire string has been entered.  Partial  matching
       can  also be useful when the subject string is very long and is not all
       available at once.

       PCRE2 supports partial matching by means of the PCRE2_PARTIAL_SOFT  and
       PCRE2_PARTIAL_HARD  options,  which  can be set when calling a matching
       function.  The difference between the two options is whether or  not  a
       partial match is preferred to an alternative complete match, though the
       details differ between the two types  of  matching  function.  If  both
       options are set, PCRE2_PARTIAL_HARD takes precedence.

       If  you  want to use partial matching with just-in-time optimized code,
       you must call pcre2_jit_compile() with one or both of these options:

         PCRE2_JIT_PARTIAL_SOFT
         PCRE2_JIT_PARTIAL_HARD

       PCRE2_JIT_COMPLETE should also be set if you are going to run  non-par-
       tial  matches  on the same pattern. If the appropriate JIT mode has not
       been compiled, interpretive matching code is used.

       Setting a partial matching option  disables  two  of  PCRE2's  standard
       optimizations. PCRE2 remembers the last literal code unit in a pattern,
       and abandons matching immediately if it is not present in  the  subject
       string.  This  optimization  cannot  be  used for a subject string that
       might match only partially. PCRE2 also knows the minimum  length  of  a
       matching  string,  and  does not bother to run the matching function on
       shorter strings. This optimization is also disabled for partial  match-
       ing.


PARTIAL MATCHING USING pcre2_match()

       A  partial  match occurs during a call to pcre2_match() when the end of
       the subject string is reached successfully, but  matching  cannot  con-
       tinue because more characters are needed. However, at least one charac-
       ter in the subject must have been inspected. This  character  need  not
       form part of the final matched string; lookbehind assertions and the \K
       escape sequence provide ways of inspecting characters before the  start
       of  a matched string. The requirement for inspecting at least one char-
       acter exists because an empty string can  always  be  matched;  without
       such  a  restriction  there would always be a partial match of an empty
       string at the end of the subject.

       When a partial match is returned, the first two elements in the ovector
       point to the portion of the subject that was matched, but the values in
       the rest of the ovector are undefined. The appearance of \K in the pat-
       tern has no effect for a partial match. Consider this pattern:

         /abc\K123/

       If it is matched against "456abc123xyz" the result is a complete match,
       and the ovector defines the matched string as "123", because \K  resets
       the  "start  of  match" point. However, if a partial match is requested
       and the subject string is "456abc12", a partial match is found for  the
       string  "abc12",  because  all these characters are needed for a subse-
       quent re-match with additional characters.

       What happens when a partial match is identified depends on which of the
       two partial matching options are set.

   PCRE2_PARTIAL_SOFT WITH pcre2_match()

       If  PCRE2_PARTIAL_SOFT  is  set when pcre2_match() identifies a partial
       match, the partial match is remembered, but matching continues as  nor-
       mal,  and  other  alternatives in the pattern are tried. If no complete
       match  can  be  found,  PCRE2_ERROR_PARTIAL  is  returned  instead   of
       PCRE2_ERROR_NOMATCH.

       This  option  is "soft" because it prefers a complete match over a par-
       tial match.  All the various matching items in a pattern behave  as  if
       the  subject string is potentially complete. For example, \z, \Z, and $
       match at the end of the subject, as normal, and for \b and \B  the  end
       of the subject is treated as a non-alphanumeric.

       If  there  is more than one partial match, the first one that was found
       provides the data that is returned. Consider this pattern:

         /123\w+X|dogY/

       If this is matched against the subject string "abc123dog", both  alter-
       natives  fail  to  match,  but the end of the subject is reached during
       matching, so PCRE2_ERROR_PARTIAL is returned. The offsets are set to  3
       and  9, identifying "123dog" as the first partial match that was found.
       (In this example, there are two partial matches, because "dog"  on  its
       own partially matches the second alternative.)

   PCRE2_PARTIAL_HARD WITH pcre2_match()

       If  PCRE2_PARTIAL_HARD is set for pcre2_match(), PCRE2_ERROR_PARTIAL is
       returned as soon as a partial match is  found,  without  continuing  to
       search  for possible complete matches. This option is "hard" because it
       prefers an earlier partial match over a later complete match. For  this
       reason,  the  assumption  is  made that the end of the supplied subject
       string may not be the true end of the available data, and  so,  if  \z,
       \Z,  \b, \B, or $ are encountered at the end of the subject, the result
       is PCRE2_ERROR_PARTIAL, provided that at least  one  character  in  the
       subject has been inspected.

   Comparing hard and soft partial matching

       The  difference  between the two partial matching options can be illus-
       trated by a pattern such as:

         /dog(sbody)?/

       This matches either "dog" or "dogsbody", greedily (that is, it  prefers
       the  longer  string  if  possible). If it is matched against the string
       "dog" with PCRE2_PARTIAL_SOFT, it yields a complete  match  for  "dog".
       However,  if  PCRE2_PARTIAL_HARD is set, the result is PCRE2_ERROR_PAR-
       TIAL. On the other hand, if the pattern is made ungreedy the result  is
       different:

         /dog(sbody)??/

       In  this  case  the  result  is always a complete match because that is
       found first, and matching never  continues  after  finding  a  complete
       match. It might be easier to follow this explanation by thinking of the
       two patterns like this:

         /dog(sbody)?/    is the same as  /dogsbody|dog/
         /dog(sbody)??/   is the same as  /dog|dogsbody/

       The second pattern will never match "dogsbody", because it will  always
       find the shorter match first.


PARTIAL MATCHING USING pcre2_dfa_match()

       The DFA functions move along the subject string character by character,
       without backtracking, searching for  all  possible  matches  simultane-
       ously.  If the end of the subject is reached before the end of the pat-
       tern, there is the possibility of a partial match, again provided  that
       at least one character has been inspected.

       When PCRE2_PARTIAL_SOFT is set, PCRE2_ERROR_PARTIAL is returned only if
       there have been no complete matches. Otherwise,  the  complete  matches
       are  returned.   However, if PCRE2_PARTIAL_HARD is set, a partial match
       takes precedence over any complete matches. The portion of  the  string
       that was matched when the longest partial match was found is set as the
       first matching string.

       Because the DFA functions always search for all possible  matches,  and
       there  is  no  difference between greedy and ungreedy repetition, their
       behaviour is different from  the  standard  functions  when  PCRE2_PAR-
       TIAL_HARD  is  set.  Consider  the  string  "dog"  matched  against the
       ungreedy pattern shown above:

         /dog(sbody)??/

       Whereas the standard function stops as soon as it  finds  the  complete
       match  for  "dog",  the  DFA  function also finds the partial match for
       "dogsbody", and so returns that when PCRE2_PARTIAL_HARD is set.


PARTIAL MATCHING AND WORD BOUNDARIES

       If a pattern ends with one of sequences \b or \B, which test  for  word
       boundaries,  partial matching with PCRE2_PARTIAL_SOFT can give counter-
       intuitive results. Consider this pattern:

         /\bcat\b/

       This matches "cat", provided there is a word boundary at either end. If
       the subject string is "the cat", the comparison of the final "t" with a
       following character cannot take place, so a  partial  match  is  found.
       However,  normal  matching carries on, and \b matches at the end of the
       subject when the last character is a letter, so  a  complete  match  is
       found.   The  result,  therefore,  is  not  PCRE2_ERROR_PARTIAL.  Using
       PCRE2_PARTIAL_HARD in this case does yield PCRE2_ERROR_PARTIAL, because
       then the partial match takes precedence.


EXAMPLE OF PARTIAL MATCHING USING PCRE2TEST

       If  the  partial_soft  (or  ps) modifier is present on a pcre2test data
       line, the PCRE2_PARTIAL_SOFT option is used for the match.  Here  is  a
       run of pcre2test that uses the date example quoted above:

           re> /^\d?\d(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\d\d$/
         data> 25jun04\=ps
          0: 25jun04
          1: jun
         data> 25dec3\=ps
         Partial match: 23dec3
         data> 3ju\=ps
         Partial match: 3ju
         data> 3juj\=ps
         No match
         data> j\=ps
         No match

       The  first  data  string  is matched completely, so pcre2test shows the
       matched substrings. The remaining four strings do not  match  the  com-
       plete pattern, but the first two are partial matches. Similar output is
       obtained if DFA matching is used.

       If the partial_hard (or ph) modifier is present  on  a  pcre2test  data
       line, the PCRE2_PARTIAL_HARD option is set for the match.


MULTI-SEGMENT MATCHING WITH pcre2_dfa_match()

       When  a  partial match has been found using a DFA matching function, it
       is possible to continue the match by providing additional subject  data
       and  calling  the function again with the same compiled regular expres-
       sion, this time setting the PCRE2_DFA_RESTART option. You must pass the
       same working space as before, because this is where details of the pre-
       vious partial match are stored. Here is an example using pcre2test:

           re> /^\d?\d(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\d\d$/
         data> 23ja\=dfa,ps
         Partial match: 23ja
         data> n05\=dfa,dfa_restart
          0: n05

       The first call has "23ja" as the subject, and requests  partial  match-
       ing;  the  second  call  has  "n05"  as  the  subject for the continued
       (restarted) match.  Notice that when the match is  complete,  only  the
       last  part  is  shown;  PCRE2 does not retain the previously partially-
       matched string. It is up to the calling program to do that if it  needs
       to.

       That means that, for an unanchored pattern, if a continued match fails,
       it is not possible to try again at  a  new  starting  point.  All  this
       facility  is  capable  of  doing  is continuing with the previous match
       attempt. In the previous example, if the second set of data  is  "ug23"
       the  result is no match, even though there would be a match for "aug23"
       if the entire string were given at once. Depending on the  application,
       this may or may not be what you want.  The only way to allow for start-
       ing again at the next character is to retain the matched  part  of  the
       subject and try a new complete match.

       You  can  set the PCRE2_PARTIAL_SOFT or PCRE2_PARTIAL_HARD options with
       PCRE2_DFA_RESTART to continue partial matching over multiple  segments.
       This  facility can be used to pass very long subject strings to the DFA
       matching functions.


MULTI-SEGMENT MATCHING WITH pcre2_match()

       Unlike the DFA function, it is not possible  to  restart  the  previous
       match with a new segment of data when using pcre2_match(). Instead, new
       data must be added to the previous subject string, and the entire match
       re-run,  starting from the point where the partial match occurred. Ear-
       lier data can be discarded.

       It is best to use PCRE2_PARTIAL_HARD in this situation, because it does
       not  treat the end of a segment as the end of the subject when matching
       \z, \Z, \b, \B, and $. Consider  an  unanchored  pattern  that  matches
       dates:

           re> /\d?\d(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\d\d/
         data> The date is 23ja\=ph
         Partial match: 23ja

       At  this stage, an application could discard the text preceding "23ja",
       add on text from the next  segment,  and  call  the  matching  function
       again.  Unlike  the  DFA  matching function, the entire matching string
       must always be available, and the complete matching process occurs  for
       each call, so more memory and more processing time is needed.


ISSUES WITH MULTI-SEGMENT MATCHING

       Certain types of pattern may give problems with multi-segment matching,
       whichever matching function is used.

       1. If the pattern contains a test for the beginning of a line, you need
       to  pass  the  PCRE2_NOTBOL option when the subject string for any call
       does start at the beginning of a line. There  is  also  a  PCRE2_NOTEOL
       option, but in practice when doing multi-segment matching you should be
       using PCRE2_PARTIAL_HARD, which includes the effect of PCRE2_NOTEOL.

       2. If a pattern contains a lookbehind assertion, characters  that  pre-
       cede  the start of the partial match may have been inspected during the
       matching process.  When using pcre2_match(), sufficient characters must
       be  retained  for  the  next  match attempt. You can ensure that enough
       characters are retained by doing the following:

       Before doing any matching, find the length of the longest lookbehind in
       the     pattern    by    calling    pcre2_pattern_info()    with    the
       PCRE2_INFO_MAXLOOKBEHIND option. Note that the resulting  count  is  in
       characters, not code units. After a partial match, moving back from the
       ovector[0] offset in the subject by the number of characters given  for
       the  maximum lookbehind gets you to the earliest character that must be
       retained. In a non-UTF or a 32-bit situation, moving  back  is  just  a
       subtraction,  but in UTF-8 or UTF-16 you have to count characters while
       moving back through the code units.

       Characters before the point you have now reached can be discarded,  and
       after  the  next segment has been added to what is retained, you should
       run the next match with the startoffset argument set so that the  match
       begins at the same point as before.

       For  example, if the pattern "(?<=123)abc" is partially matched against
       the string "xx123ab", the ovector offsets are 5 and 7 ("ab"). The maxi-
       mum  lookbehind  count  is  3, so all characters before offset 2 can be
       discarded. The value of startoffset for the next  match  should  be  3.
       When  pcre2test  displays  a partial match, it indicates the lookbehind
       characters with '<' characters:

           re> "(?<=123)abc"
         data> xx123ab\=ph
         Partial match: 123ab
                        <<<

       3. Because a partial match must always contain at least one  character,
       what  might  be  considered a partial match of an empty string actually
       gives a "no match" result. For example:

           re> /c(?<=abc)x/
         data> ab\=ps
         No match

       If the next segment begins "cx", a match should be found, but this will
       only  happen  if characters from the previous segment are retained. For
       this reason, a "no match" result  should  be  interpreted  as  "partial
       match of an empty string" when the pattern contains lookbehinds.

       4.  Matching  a subject string that is split into multiple segments may
       not always produce exactly the same result as matching over one  single
       long  string,  especially  when PCRE2_PARTIAL_SOFT is used. The section
       "Partial Matching and Word Boundaries" above describes  an  issue  that
       arises  if  the  pattern ends with \b or \B. Another kind of difference
       may occur when there are multiple matching possibilities, because  (for
       PCRE2_PARTIAL_SOFT) a partial match result is given only when there are
       no completed matches. This means that as soon as the shortest match has
       been  found,  continuation to a new subject segment is no longer possi-
       ble. Consider this pcre2test example:

           re> /dog(sbody)?/
         data> dogsb\=ps
          0: dog
         data> do\=ps,dfa
         Partial match: do
         data> gsb\=ps,dfa,dfa_restart
          0: g
         data> dogsbody\=dfa
          0: dogsbody
          1: dog

       The first data line passes the string "dogsb" to  a  standard  matching
       function, setting the PCRE2_PARTIAL_SOFT option. Although the string is
       a partial match for "dogsbody", the result is not  PCRE2_ERROR_PARTIAL,
       because  the  shorter string "dog" is a complete match. Similarly, when
       the subject is presented to a DFA matching function  in  several  parts
       ("do"  and  "gsb"  being  the first two) the match stops when "dog" has
       been found, and it is not possible to continue.  On the other hand,  if
       "dogsbody"  is  presented  as  a single string, a DFA matching function
       finds both matches.

       Because of these problems, it is best to  use  PCRE2_PARTIAL_HARD  when
       matching  multi-segment  data.  The  example above then behaves differ-
       ently:

           re> /dog(sbody)?/
         data> dogsb\=ph
         Partial match: dogsb
         data> do\=ps,dfa
         Partial match: do
         data> gsb\=ph,dfa,dfa_restart
         Partial match: gsb

       5. Patterns that contain alternatives at the top level which do not all
       start  with  the  same  pattern  item  may  not  work  as expected when
       PCRE2_DFA_RESTART is used. For example, consider this pattern:

         1234|3789

       If the first part of the subject is "ABC123", a partial  match  of  the
       first  alternative  is found at offset 3. There is no partial match for
       the second alternative, because such a match does not start at the same
       point  in  the  subject  string. Attempting to continue with the string
       "7890" does not yield a match  because  only  those  alternatives  that
       match  at  one  point in the subject are remembered. The problem arises
       because the start of the second alternative matches  within  the  first
       alternative.  There  is  no  problem with anchored patterns or patterns
       such as:

         1234|ABCD

       where no string can be a partial match for both alternatives.  This  is
       not  a  problem  if  a  standard matching function is used, because the
       entire match has to be rerun each time:

           re> /1234|3789/
         data> ABC123\=ph
         Partial match: 123
         data> 1237890
          0: 3789

       Of course, instead of using PCRE2_DFA_RESTART, the  same  technique  of
       re-running  the  entire  match  can  also be used with the DFA matching
       function. Another possibility is to work with two buffers. If a partial
       match  at  offset  n in the first buffer is followed by "no match" when
       PCRE2_DFA_RESTART is used on the second buffer, you can then try a  new
       match starting at offset n+1 in the first buffer.


AUTHOR

       Philip Hazel
       University Computing Service
       Cambridge, England.


REVISION

       Last updated: 22 December 2014
       Copyright (c) 1997-2014 University of Cambridge.
------------------------------------------------------------------------------


PCRE2UNICODE(3)            Library Functions Manual            PCRE2UNICODE(3)


NAME
       PCRE - Perl-compatible regular expressions (revised API)

UNICODE AND UTF SUPPORT

       When PCRE2 is built with Unicode support (which is the default), it has
       knowledge of Unicode character properties and can process text  strings
       in  UTF-8, UTF-16, or UTF-32 format (depending on the code unit width).
       However, by default, PCRE2 assumes that one code unit is one character.
       To  process  a  pattern  as a UTF string, where a character may require
       more than one  code  unit,  you  must  call  pcre2_compile()  with  the
       PCRE2_UTF  option  flag,  or  the  pattern must start with the sequence
       (*UTF). When either of these is the case, both the pattern and any sub-
       ject  strings  that  are  matched against it are treated as UTF strings
       instead of strings of individual one-code-unit characters.

       If you do not need Unicode support you can build PCRE2 without  it,  in
       which case the library will be smaller.


UNICODE PROPERTY SUPPORT

       When  PCRE2 is built with Unicode support, the escape sequences \p{..},
       \P{..}, and \X can be used. The Unicode properties that can  be  tested
       are  limited to the general category properties such as Lu for an upper
       case letter or Nd for a decimal number, the Unicode script  names  such
       as Arabic or Han, and the derived properties Any and L&. Full lists are
       given in the pcre2pattern and pcre2syntax documentation. Only the short
       names  for  properties are supported. For example, \p{L} matches a let-
       ter. Its Perl synonym, \p{Letter}, is not supported.   Furthermore,  in
       Perl,  many properties may optionally be prefixed by "Is", for compati-
       bility with Perl 5.6. PCRE does not support this.


WIDE CHARACTERS AND UTF MODES

       Codepoints less than 256 can be specified in patterns by either  braced
       or unbraced hexadecimal escape sequences (for example, \x{b3} or \xb3).
       Larger values have to use braced sequences. Unbraced octal code  points
       up to \777 are also recognized; larger ones can be coded using \o{...}.

       In  UTF modes, repeat quantifiers apply to complete UTF characters, not
       to individual code units.

       In UTF modes, the dot metacharacter matches one UTF  character  instead
       of a single code unit.

       The  escape  sequence  \C can be used to match a single code unit, in a
       UTF mode, but its use can lead  to  some  strange  effects  because  it
       breaks  up  multi-unit  characters  (see  the  description of \C in the
       pcre2pattern documentation). The use of \C  is  not  supported  in  the
       alternative matching function pcre2_dfa_match(), nor is it supported in
       UTF mode by the JIT optimization. If JIT optimization is requested  for
       a  UTF pattern that contains \C, it will not succeed, and so the match-
       ing will be carried out by the normal interpretive function.

       The character escapes \b, \B, \d, \D, \s, \S, \w, and \W correctly test
       characters  of  any  code  value,  but, by default, the characters that
       PCRE2 recognizes as digits, spaces, or word characters remain the  same
       set  as  in  non-UTF  mode,  all  with  code points less than 256. This
       remains true even when PCRE2  is  built  to  include  Unicode  support,
       because  to do otherwise would slow down matching in many common cases.
       Note that this also applies to \b and \B, because they are  defined  in
       terms  of  \w  and  \W.  If you want to test for a wider sense of, say,
       "digit", you can use explicit Unicode property tests  such  as  \p{Nd}.
       Alternatively,  if you set the PCRE2_UCP option, the way that the char-
       acter escapes work is changed so that Unicode properties  are  used  to
       determine which characters match. There are more details in the section
       on generic character types in the pcre2pattern documentation.

       Similarly, characters that match the POSIX named character classes  are
       all low-valued characters, unless the PCRE2_UCP option is set.

       However,  the  special  horizontal  and  vertical  white space matching
       escapes (\h, \H, \v, and \V) do match all the appropriate Unicode char-
       acters, whether or not PCRE2_UCP is set.

       Case-insensitive  matching in UTF mode makes use of Unicode properties.
       A few Unicode characters such as Greek sigma have more than  two  code-
       points that are case-equivalent, and these are treated as such.


VALIDITY OF UTF STRINGS

       When  the  PCRE2_UTF  option is set, the strings passed as patterns and
       subjects are (by default) checked for validity on entry to the relevant
       functions.   If an invalid UTF string is passed, an negative error code
       is returned. The code unit offset to the  offending  character  can  be
       extracted  from  the match data block by calling pcre2_get_startchar(),
       which is used for this purpose after a UTF error.

       UTF-16 and UTF-32 strings can indicate their endianness by special code
       knows  as  a  byte-order  mark (BOM). The PCRE2 functions do not handle
       this, expecting strings to be in host byte order.

       The entire string is checked before any other processing  takes  place.
       In  addition  to checking the format of the string, there is a check to
       ensure that all code points lie in the range U+0 to U+10FFFF, excluding
       the  surrogate area.  The so-called "non-character" code points are not
       excluded because Unicode corrigendum #9 makes it clear that they should
       not be.

       Characters  in  the "Surrogate Area" of Unicode are reserved for use by
       UTF-16, where they are used in pairs to encode code points with  values
       greater  than  0xFFFF. The code points that are encoded by UTF-16 pairs
       are available independently in the  UTF-8  and  UTF-32  encodings.  (In
       other  words,  the  whole  surrogate  thing is a fudge for UTF-16 which
       unfortunately messes up UTF-8 and UTF-32.)

       In some situations, you may already know that your strings  are  valid,
       and  therefore  want  to  skip these checks in order to improve perfor-
       mance, for example in the case of a long subject string that  is  being
       scanned  repeatedly.   If you set the PCRE2_NO_UTF_CHECK option at com-
       pile time or at match time, PCRE2 assumes that the pattern  or  subject
       it is given (respectively) contains only valid UTF code unit sequences.

       Passing  PCRE2_NO_UTF_CHECK  to pcre2_compile() just disables the check
       for the pattern; it does not also apply to subject strings. If you want
       to  disable the check for a subject string you must pass this option to
       pcre2_match() or pcre2_dfa_match().

       If you pass an invalid UTF string when PCRE2_NO_UTF_CHECK is  set,  the
       result is undefined and your program may crash or loop indefinitely.

   Errors in UTF-8 strings

       The following negative error codes are given for invalid UTF-8 strings:

         PCRE2_ERROR_UTF8_ERR1
         PCRE2_ERROR_UTF8_ERR2
         PCRE2_ERROR_UTF8_ERR3
         PCRE2_ERROR_UTF8_ERR4
         PCRE2_ERROR_UTF8_ERR5

       The  string  ends  with a truncated UTF-8 character; the code specifies
       how many bytes are missing (1 to 5). Although RFC 3629 restricts  UTF-8
       characters  to  be  no longer than 4 bytes, the encoding scheme (origi-
       nally defined by RFC 2279) allows for  up  to  6  bytes,  and  this  is
       checked first; hence the possibility of 4 or 5 missing bytes.

         PCRE2_ERROR_UTF8_ERR6
         PCRE2_ERROR_UTF8_ERR7
         PCRE2_ERROR_UTF8_ERR8
         PCRE2_ERROR_UTF8_ERR9
         PCRE2_ERROR_UTF8_ERR10

       The two most significant bits of the 2nd, 3rd, 4th, 5th, or 6th byte of
       the character do not have the binary value 0b10 (that  is,  either  the
       most significant bit is 0, or the next bit is 1).

         PCRE2_ERROR_UTF8_ERR11
         PCRE2_ERROR_UTF8_ERR12

       A  character that is valid by the RFC 2279 rules is either 5 or 6 bytes
       long; these code points are excluded by RFC 3629.

         PCRE2_ERROR_UTF8_ERR13

       A 4-byte character has a value greater than 0x10fff; these code  points
       are excluded by RFC 3629.

         PCRE2_ERROR_UTF8_ERR14

       A  3-byte  character  has  a  value in the range 0xd800 to 0xdfff; this
       range of code points are reserved by RFC 3629 for use with UTF-16,  and
       so are excluded from UTF-8.

         PCRE2_ERROR_UTF8_ERR15
         PCRE2_ERROR_UTF8_ERR16
         PCRE2_ERROR_UTF8_ERR17
         PCRE2_ERROR_UTF8_ERR18
         PCRE2_ERROR_UTF8_ERR19

       A  2-, 3-, 4-, 5-, or 6-byte character is "overlong", that is, it codes
       for a value that can be represented by fewer bytes, which  is  invalid.
       For  example,  the two bytes 0xc0, 0xae give the value 0x2e, whose cor-
       rect coding uses just one byte.

         PCRE2_ERROR_UTF8_ERR20

       The two most significant bits of the first byte of a character have the
       binary  value 0b10 (that is, the most significant bit is 1 and the sec-
       ond is 0). Such a byte can only validly occur as the second  or  subse-
       quent byte of a multi-byte character.

         PCRE2_ERROR_UTF8_ERR21

       The  first byte of a character has the value 0xfe or 0xff. These values
       can never occur in a valid UTF-8 string.

   Errors in UTF-16 strings

       The following  negative  error  codes  are  given  for  invalid  UTF-16
       strings:

         PCRE_UTF16_ERR1  Missing low surrogate at end of string
         PCRE_UTF16_ERR2  Invalid low surrogate follows high surrogate
         PCRE_UTF16_ERR3  Isolated low surrogate


   Errors in UTF-32 strings

       The  following  negative  error  codes  are  given  for  invalid UTF-32
       strings:

         PCRE_UTF32_ERR1  Surrogate character (range from 0xd800 to 0xdfff)
         PCRE_UTF32_ERR2  Code point is greater than 0x10ffff


AUTHOR

       Philip Hazel
       University Computing Service
       Cambridge, England.


REVISION

       Last updated: 23 November 2014
       Copyright (c) 1997-2014 University of Cambridge.
------------------------------------------------------------------------------