Documentation update
This commit is contained in:
parent
85fc061dcf
commit
a5d81d06f4
|
@ -40,7 +40,11 @@ GENERIC INSTRUCTIONS FOR THE PCRE2 C LIBRARY
|
|||
|
||||
The following are generic instructions for building the PCRE2 C library "by
|
||||
hand". If you are going to use CMake, this section does not apply to you; you
|
||||
can skip ahead to the CMake section.
|
||||
can skip ahead to the CMake section. Note that the settings concerned with
|
||||
8-bit, 16-bit, and 32-bit code units relate to the type of data string that
|
||||
PCRE2 processes. They are NOT referring to the underlying operating system bit
|
||||
width. You do not have to do anything special to compile in a 64-bit
|
||||
environment, for example.
|
||||
|
||||
(1) Copy or rename the file src/config.h.generic as src/config.h, and edit the
|
||||
macro settings that it contains to whatever is appropriate for your
|
||||
|
@ -86,11 +90,11 @@ can skip ahead to the CMake section.
|
|||
The tables in src/pcre2_chartables.c are defaults. The caller of PCRE2 can
|
||||
specify alternative tables at run time.
|
||||
|
||||
(4) For an 8-bit library, compile the following source files from the src
|
||||
directory, setting -DPCRE2_CODE_UNIT_WIDTH=8 as a compiler option. Also
|
||||
set -DHAVE_CONFIG_H if you have set up src/config.h with your
|
||||
configuration, or else use other -D settings to change the configuration
|
||||
as required.
|
||||
(4) For a library that supports 8-bit code units in the character strings that
|
||||
it processes, compile the following source files from the src directory,
|
||||
setting -DPCRE2_CODE_UNIT_WIDTH=8 as a compiler option. Also set
|
||||
-DHAVE_CONFIG_H if you have set up src/config.h with your configuration,
|
||||
or else use other -D settings to change the configuration as required.
|
||||
|
||||
pcre2_auto_possess.c
|
||||
pcre2_chartables.c
|
||||
|
@ -142,9 +146,9 @@ can skip ahead to the CMake section.
|
|||
If your system has static and shared libraries, you may have to do this
|
||||
once for each type.
|
||||
|
||||
(6) If you want to build a 16-bit library or 32-bit library (as well as, or
|
||||
instead of the 8-bit library) just supply 16 or 32 as the value of
|
||||
-DPCRE2_CODE_UNIT_WIDTH when you are compiling.
|
||||
(6) If you want to build a library that supports 16-bit or 32-bit code units,
|
||||
(as well as, or instead of the 8-bit library) just supply 16 or 32 as the
|
||||
value of -DPCRE2_CODE_UNIT_WIDTH when you are compiling.
|
||||
|
||||
(7) If you want to build the POSIX wrapper functions (which apply only to the
|
||||
8-bit library), ensure that you have the src/pcre2posix.h file and then
|
||||
|
@ -401,6 +405,6 @@ Everything in that location, source and executable, is in EBCDIC and native
|
|||
z/OS file formats. The port provides an API for LE languages such as COBOL and
|
||||
for the z/OS and z/VM versions of the Rexx languages.
|
||||
|
||||
==============================
|
||||
Last Updated: 14 November 2018
|
||||
==============================
|
||||
===========================
|
||||
Last Updated: 28 April 2021
|
||||
===========================
|
||||
|
|
|
@ -40,7 +40,11 @@ GENERIC INSTRUCTIONS FOR THE PCRE2 C LIBRARY
|
|||
|
||||
The following are generic instructions for building the PCRE2 C library "by
|
||||
hand". If you are going to use CMake, this section does not apply to you; you
|
||||
can skip ahead to the CMake section.
|
||||
can skip ahead to the CMake section. Note that the settings concerned with
|
||||
8-bit, 16-bit, and 32-bit code units relate to the type of data string that
|
||||
PCRE2 processes. They are NOT referring to the underlying operating system bit
|
||||
width. You do not have to do anything special to compile in a 64-bit
|
||||
environment, for example.
|
||||
|
||||
(1) Copy or rename the file src/config.h.generic as src/config.h, and edit the
|
||||
macro settings that it contains to whatever is appropriate for your
|
||||
|
@ -86,11 +90,11 @@ can skip ahead to the CMake section.
|
|||
The tables in src/pcre2_chartables.c are defaults. The caller of PCRE2 can
|
||||
specify alternative tables at run time.
|
||||
|
||||
(4) For an 8-bit library, compile the following source files from the src
|
||||
directory, setting -DPCRE2_CODE_UNIT_WIDTH=8 as a compiler option. Also
|
||||
set -DHAVE_CONFIG_H if you have set up src/config.h with your
|
||||
configuration, or else use other -D settings to change the configuration
|
||||
as required.
|
||||
(4) For a library that supports 8-bit code units in the character strings that
|
||||
it processes, compile the following source files from the src directory,
|
||||
setting -DPCRE2_CODE_UNIT_WIDTH=8 as a compiler option. Also set
|
||||
-DHAVE_CONFIG_H if you have set up src/config.h with your configuration,
|
||||
or else use other -D settings to change the configuration as required.
|
||||
|
||||
pcre2_auto_possess.c
|
||||
pcre2_chartables.c
|
||||
|
@ -142,9 +146,9 @@ can skip ahead to the CMake section.
|
|||
If your system has static and shared libraries, you may have to do this
|
||||
once for each type.
|
||||
|
||||
(6) If you want to build a 16-bit library or 32-bit library (as well as, or
|
||||
instead of the 8-bit library) just supply 16 or 32 as the value of
|
||||
-DPCRE2_CODE_UNIT_WIDTH when you are compiling.
|
||||
(6) If you want to build a library that supports 16-bit or 32-bit code units,
|
||||
(as well as, or instead of the 8-bit library) just supply 16 or 32 as the
|
||||
value of -DPCRE2_CODE_UNIT_WIDTH when you are compiling.
|
||||
|
||||
(7) If you want to build the POSIX wrapper functions (which apply only to the
|
||||
8-bit library), ensure that you have the src/pcre2posix.h file and then
|
||||
|
@ -401,6 +405,6 @@ Everything in that location, source and executable, is in EBCDIC and native
|
|||
z/OS file formats. The port provides an API for LE languages such as COBOL and
|
||||
for the z/OS and z/VM versions of the Rexx languages.
|
||||
|
||||
==============================
|
||||
Last Updated: 14 November 2018
|
||||
==============================
|
||||
===========================
|
||||
Last Updated: 28 April 2021
|
||||
===========================
|
||||
|
|
|
@ -38,8 +38,14 @@ Oniguruma syntax items, and there are options for requesting some minor changes
|
|||
that give better ECMAScript (aka JavaScript) compatibility.
|
||||
</P>
|
||||
<P>
|
||||
The source code for PCRE2 can be compiled to support 8-bit, 16-bit, or 32-bit
|
||||
code units, which means that up to three separate libraries may be installed.
|
||||
The source code for PCRE2 can be compiled to support strings of 8-bit, 16-bit,
|
||||
or 32-bit code units, which means that up to three separate libraries may be
|
||||
installed, one for each code unit size. The size of code unit is not related to
|
||||
the bit size of the underlying hardware. In a 64-bit environment that also
|
||||
supports 32-bit applications, versions of PCRE2 that are compiled in both
|
||||
64-bit and 32-bit modes may be needed.
|
||||
</P>
|
||||
<P>
|
||||
The original work to extend PCRE to 16-bit and 32-bit code units was done by
|
||||
Zoltan Herczeg and Christian Persch, respectively. In all three cases, strings
|
||||
can be interpreted either as one character per code unit, or as UTF-encoded
|
||||
|
@ -198,9 +204,9 @@ use my two initials, followed by the two digits 10, at the domain cam.ac.uk.
|
|||
</P>
|
||||
<br><a name="SEC5" href="#TOC1">REVISION</a><br>
|
||||
<P>
|
||||
Last updated: 17 September 2018
|
||||
Last updated: 28 April 2021
|
||||
<br>
|
||||
Copyright © 1997-2018 University of Cambridge.
|
||||
Copyright © 1997-2021 University of Cambridge.
|
||||
<br>
|
||||
<p>
|
||||
Return to the <a href="index.html">PCRE2 index page</a>.
|
||||
|
|
|
@ -1213,7 +1213,7 @@ Setting match controls
|
|||
The following modifiers affect the matching process or request additional
|
||||
information. Some of them may also be specified on a pattern line (see above),
|
||||
in which case they apply to every subject line that is matched against that
|
||||
pattern.
|
||||
pattern, but can be overridden by modifiers on the subject.
|
||||
<pre>
|
||||
aftertext show text after match
|
||||
allaftertext show text after captures
|
||||
|
@ -1421,6 +1421,11 @@ replacement strings cannot contain commas, because a comma signifies the end of
|
|||
a modifier. This is not thought to be an issue in a test program.
|
||||
</P>
|
||||
<P>
|
||||
Specifying a completely empty replacement string disables this modifier.
|
||||
However, it is possible to specify an empty replacement by providing a buffer
|
||||
length, as described below, for an otherwise empty replacement.
|
||||
</P>
|
||||
<P>
|
||||
Unlike subject strings, <b>pcre2test</b> does not process replacement strings
|
||||
for escape sequences. In UTF mode, a replacement string is checked to see if it
|
||||
is a valid UTF-8 string. If so, it is correctly converted to a UTF string of
|
||||
|
@ -2119,9 +2124,9 @@ Cambridge, England.
|
|||
</P>
|
||||
<br><a name="SEC21" href="#TOC1">REVISION</a><br>
|
||||
<P>
|
||||
Last updated: 14 September 2020
|
||||
Last updated: 28 April 2021
|
||||
<br>
|
||||
Copyright © 1997-2020 University of Cambridge.
|
||||
Copyright © 1997-2021 University of Cambridge.
|
||||
<br>
|
||||
<p>
|
||||
Return to the <a href="index.html">PCRE2 index page</a>.
|
||||
|
|
133
doc/pcre2.txt
133
doc/pcre2.txt
|
@ -34,107 +34,112 @@ INTRODUCTION
|
|||
requesting some minor changes that give better ECMAScript (aka Java-
|
||||
Script) compatibility.
|
||||
|
||||
The source code for PCRE2 can be compiled to support 8-bit, 16-bit, or
|
||||
32-bit code units, which means that up to three separate libraries may
|
||||
be installed. The original work to extend PCRE to 16-bit and 32-bit
|
||||
code units was done by Zoltan Herczeg and Christian Persch, respec-
|
||||
tively. In all three cases, strings can be interpreted either as one
|
||||
character per code unit, or as UTF-encoded Unicode, with support for
|
||||
Unicode general category properties. Unicode support is optional at
|
||||
build time (but is the default). However, processing strings as UTF
|
||||
code units must be enabled explicitly at run time. The version of Uni-
|
||||
code in use can be discovered by running
|
||||
The source code for PCRE2 can be compiled to support strings of 8-bit,
|
||||
16-bit, or 32-bit code units, which means that up to three separate li-
|
||||
braries may be installed, one for each code unit size. The size of code
|
||||
unit is not related to the bit size of the underlying hardware. In a
|
||||
64-bit environment that also supports 32-bit applications, versions of
|
||||
PCRE2 that are compiled in both 64-bit and 32-bit modes may be needed.
|
||||
|
||||
The original work to extend PCRE to 16-bit and 32-bit code units was
|
||||
done by Zoltan Herczeg and Christian Persch, respectively. In all three
|
||||
cases, strings can be interpreted either as one character per code
|
||||
unit, or as UTF-encoded Unicode, with support for Unicode general cate-
|
||||
gory properties. Unicode support is optional at build time (but is the
|
||||
default). However, processing strings as UTF code units must be enabled
|
||||
explicitly at run time. The version of Unicode in use can be discovered
|
||||
by running
|
||||
|
||||
pcre2test -C
|
||||
|
||||
The three libraries contain identical sets of functions, with names
|
||||
ending in _8, _16, or _32, respectively (for example, pcre2_com-
|
||||
pile_8()). However, by defining PCRE2_CODE_UNIT_WIDTH to be 8, 16, or
|
||||
32, a program that uses just one code unit width can be written using
|
||||
The three libraries contain identical sets of functions, with names
|
||||
ending in _8, _16, or _32, respectively (for example, pcre2_com-
|
||||
pile_8()). However, by defining PCRE2_CODE_UNIT_WIDTH to be 8, 16, or
|
||||
32, a program that uses just one code unit width can be written using
|
||||
generic names such as pcre2_compile(), and the documentation is written
|
||||
assuming that this is the case.
|
||||
|
||||
In addition to the Perl-compatible matching function, PCRE2 contains an
|
||||
alternative function that matches the same compiled patterns in a dif-
|
||||
alternative function that matches the same compiled patterns in a dif-
|
||||
ferent way. In certain circumstances, the alternative function has some
|
||||
advantages. For a discussion of the two matching algorithms, see the
|
||||
advantages. For a discussion of the two matching algorithms, see the
|
||||
pcre2matching page.
|
||||
|
||||
Details of exactly which Perl regular expression features are and are
|
||||
not supported by PCRE2 are given in separate documents. See the
|
||||
pcre2pattern and pcre2compat pages. There is a syntax summary in the
|
||||
Details of exactly which Perl regular expression features are and are
|
||||
not supported by PCRE2 are given in separate documents. See the
|
||||
pcre2pattern and pcre2compat pages. There is a syntax summary in the
|
||||
pcre2syntax page.
|
||||
|
||||
Some features of PCRE2 can be included, excluded, or changed when the
|
||||
library is built. The pcre2_config() function makes it possible for a
|
||||
client to discover which features are available. The features them-
|
||||
Some features of PCRE2 can be included, excluded, or changed when the
|
||||
library is built. The pcre2_config() function makes it possible for a
|
||||
client to discover which features are available. The features them-
|
||||
selves are described in the pcre2build page. Documentation about build-
|
||||
ing PCRE2 for various operating systems can be found in the README and
|
||||
ing PCRE2 for various operating systems can be found in the README and
|
||||
NON-AUTOTOOLS_BUILD files in the source distribution.
|
||||
|
||||
The libraries contains a number of undocumented internal functions and
|
||||
data tables that are used by more than one of the exported external
|
||||
functions, but which are not intended for use by external callers.
|
||||
Their names all begin with "_pcre2", which hopefully will not provoke
|
||||
The libraries contains a number of undocumented internal functions and
|
||||
data tables that are used by more than one of the exported external
|
||||
functions, but which are not intended for use by external callers.
|
||||
Their names all begin with "_pcre2", which hopefully will not provoke
|
||||
any name clashes. In some environments, it is possible to control which
|
||||
external symbols are exported when a shared library is built, and in
|
||||
external symbols are exported when a shared library is built, and in
|
||||
these cases the undocumented symbols are not exported.
|
||||
|
||||
|
||||
SECURITY CONSIDERATIONS
|
||||
|
||||
If you are using PCRE2 in a non-UTF application that permits users to
|
||||
supply arbitrary patterns for compilation, you should be aware of a
|
||||
If you are using PCRE2 in a non-UTF application that permits users to
|
||||
supply arbitrary patterns for compilation, you should be aware of a
|
||||
feature that allows users to turn on UTF support from within a pattern.
|
||||
For example, an 8-bit pattern that begins with "(*UTF)" turns on UTF-8
|
||||
mode, which interprets patterns and subjects as strings of UTF-8 code
|
||||
For example, an 8-bit pattern that begins with "(*UTF)" turns on UTF-8
|
||||
mode, which interprets patterns and subjects as strings of UTF-8 code
|
||||
units instead of individual 8-bit characters. This causes both the pat-
|
||||
tern and any data against which it is matched to be checked for UTF-8
|
||||
validity. If the data string is very long, such a check might use suf-
|
||||
ficiently many resources as to cause your application to lose perfor-
|
||||
tern and any data against which it is matched to be checked for UTF-8
|
||||
validity. If the data string is very long, such a check might use suf-
|
||||
ficiently many resources as to cause your application to lose perfor-
|
||||
mance.
|
||||
|
||||
One way of guarding against this possibility is to use the pcre2_pat-
|
||||
tern_info() function to check the compiled pattern's options for
|
||||
PCRE2_UTF. Alternatively, you can set the PCRE2_NEVER_UTF option when
|
||||
calling pcre2_compile(). This causes a compile time error if the pat-
|
||||
One way of guarding against this possibility is to use the pcre2_pat-
|
||||
tern_info() function to check the compiled pattern's options for
|
||||
PCRE2_UTF. Alternatively, you can set the PCRE2_NEVER_UTF option when
|
||||
calling pcre2_compile(). This causes a compile time error if the pat-
|
||||
tern contains a UTF-setting sequence.
|
||||
|
||||
The use of Unicode properties for character types such as \d can also
|
||||
be enabled from within the pattern, by specifying "(*UCP)". This fea-
|
||||
The use of Unicode properties for character types such as \d can also
|
||||
be enabled from within the pattern, by specifying "(*UCP)". This fea-
|
||||
ture can be disallowed by setting the PCRE2_NEVER_UCP option.
|
||||
|
||||
If your application is one that supports UTF, be aware that validity
|
||||
checking can take time. If the same data string is to be matched many
|
||||
times, you can use the PCRE2_NO_UTF_CHECK option for the second and
|
||||
If your application is one that supports UTF, be aware that validity
|
||||
checking can take time. If the same data string is to be matched many
|
||||
times, you can use the PCRE2_NO_UTF_CHECK option for the second and
|
||||
subsequent matches to avoid running redundant checks.
|
||||
|
||||
The use of the \C escape sequence in a UTF-8 or UTF-16 pattern can lead
|
||||
to problems, because it may leave the current matching point in the
|
||||
middle of a multi-code-unit character. The PCRE2_NEVER_BACKSLASH_C op-
|
||||
to problems, because it may leave the current matching point in the
|
||||
middle of a multi-code-unit character. The PCRE2_NEVER_BACKSLASH_C op-
|
||||
tion can be used by an application to lock out the use of \C, causing a
|
||||
compile-time error if it is encountered. It is also possible to build
|
||||
compile-time error if it is encountered. It is also possible to build
|
||||
PCRE2 with the use of \C permanently disabled.
|
||||
|
||||
Another way that performance can be hit is by running a pattern that
|
||||
has a very large search tree against a string that will never match.
|
||||
Nested unlimited repeats in a pattern are a common example. PCRE2 pro-
|
||||
vides some protection against this: see the pcre2_set_match_limit()
|
||||
function in the pcre2api page. There is a similar function called
|
||||
Another way that performance can be hit is by running a pattern that
|
||||
has a very large search tree against a string that will never match.
|
||||
Nested unlimited repeats in a pattern are a common example. PCRE2 pro-
|
||||
vides some protection against this: see the pcre2_set_match_limit()
|
||||
function in the pcre2api page. There is a similar function called
|
||||
pcre2_set_depth_limit() that can be used to restrict the amount of mem-
|
||||
ory that is used.
|
||||
|
||||
|
||||
USER DOCUMENTATION
|
||||
|
||||
The user documentation for PCRE2 comprises a number of different sec-
|
||||
tions. In the "man" format, each of these is a separate "man page". In
|
||||
the HTML format, each is a separate page, linked from the index page.
|
||||
In the plain text format, the descriptions of the pcre2grep and
|
||||
The user documentation for PCRE2 comprises a number of different sec-
|
||||
tions. In the "man" format, each of these is a separate "man page". In
|
||||
the HTML format, each is a separate page, linked from the index page.
|
||||
In the plain text format, the descriptions of the pcre2grep and
|
||||
pcre2test programs are in files called pcre2grep.txt and pcre2test.txt,
|
||||
respectively. The remaining sections, except for the pcre2demo section
|
||||
(which is a program listing), and the short pages for individual func-
|
||||
tions, are concatenated in pcre2.txt, for ease of searching. The sec-
|
||||
respectively. The remaining sections, except for the pcre2demo section
|
||||
(which is a program listing), and the short pages for individual func-
|
||||
tions, are concatenated in pcre2.txt, for ease of searching. The sec-
|
||||
tions are as follows:
|
||||
|
||||
pcre2 this document
|
||||
|
@ -160,7 +165,7 @@ USER DOCUMENTATION
|
|||
pcre2test description of the pcre2test command
|
||||
pcre2unicode discussion of Unicode and UTF support
|
||||
|
||||
In the "man" and HTML formats, there is also a short page for each C
|
||||
In the "man" and HTML formats, there is also a short page for each C
|
||||
library function, listing its arguments and results.
|
||||
|
||||
|
||||
|
@ -170,15 +175,15 @@ AUTHOR
|
|||
University Computing Service
|
||||
Cambridge, England.
|
||||
|
||||
Putting an actual email address here is a spam magnet. If you want to
|
||||
email me, use my two initials, followed by the two digits 10, at the
|
||||
Putting an actual email address here is a spam magnet. If you want to
|
||||
email me, use my two initials, followed by the two digits 10, at the
|
||||
domain cam.ac.uk.
|
||||
|
||||
|
||||
REVISION
|
||||
|
||||
Last updated: 17 September 2018
|
||||
Copyright (c) 1997-2018 University of Cambridge.
|
||||
Last updated: 28 April 2021
|
||||
Copyright (c) 1997-2021 University of Cambridge.
|
||||
------------------------------------------------------------------------------
|
||||
|
||||
|
||||
|
|
|
@ -1084,7 +1084,8 @@ SUBJECT MODIFIERS
|
|||
The following modifiers affect the matching process or request addi-
|
||||
tional information. Some of them may also be specified on a pattern
|
||||
line (see above), in which case they apply to every subject line that
|
||||
is matched against that pattern.
|
||||
is matched against that pattern, but can be overridden by modifiers on
|
||||
the subject.
|
||||
|
||||
aftertext show text after match
|
||||
allaftertext show text after captures
|
||||
|
@ -1132,29 +1133,29 @@ SUBJECT MODIFIERS
|
|||
zero_terminate pass the subject as zero-terminated
|
||||
|
||||
The effects of these modifiers are described in the following sections.
|
||||
When matching via the POSIX wrapper API, the aftertext, allaftertext,
|
||||
and ovector subject modifiers work as described below. All other modi-
|
||||
When matching via the POSIX wrapper API, the aftertext, allaftertext,
|
||||
and ovector subject modifiers work as described below. All other modi-
|
||||
fiers are either ignored, with a warning message, or cause an error.
|
||||
|
||||
Showing more text
|
||||
|
||||
The aftertext modifier requests that as well as outputting the part of
|
||||
The aftertext modifier requests that as well as outputting the part of
|
||||
the subject string that matched the entire pattern, pcre2test should in
|
||||
addition output the remainder of the subject string. This is useful for
|
||||
tests where the subject contains multiple copies of the same substring.
|
||||
The allaftertext modifier requests the same action for captured sub-
|
||||
The allaftertext modifier requests the same action for captured sub-
|
||||
strings as well as the main matched substring. In each case the remain-
|
||||
der is output on the following line with a plus character following the
|
||||
capture number.
|
||||
|
||||
The allusedtext modifier requests that all the text that was consulted
|
||||
during a successful pattern match by the interpreter should be shown,
|
||||
for both full and partial matches. This feature is not supported for
|
||||
JIT matching, and if requested with JIT it is ignored (with a warning
|
||||
message). Setting this modifier affects the output if there is a look-
|
||||
behind at the start of a match, or, for a complete match, a lookahead
|
||||
The allusedtext modifier requests that all the text that was consulted
|
||||
during a successful pattern match by the interpreter should be shown,
|
||||
for both full and partial matches. This feature is not supported for
|
||||
JIT matching, and if requested with JIT it is ignored (with a warning
|
||||
message). Setting this modifier affects the output if there is a look-
|
||||
behind at the start of a match, or, for a complete match, a lookahead
|
||||
at the end, or if \K is used in the pattern. Characters that precede or
|
||||
follow the start and end of the actual match are indicated in the out-
|
||||
follow the start and end of the actual match are indicated in the out-
|
||||
put by '<' or '>' characters underneath them. Here is an example:
|
||||
|
||||
re> /(?<=pqr)abc(?=xyz)/
|
||||
|
@ -1165,16 +1166,16 @@ SUBJECT MODIFIERS
|
|||
Partial match: pqrabcxy
|
||||
<<<
|
||||
|
||||
The first, complete match shows that the matched string is "abc", with
|
||||
the preceding and following strings "pqr" and "xyz" having been con-
|
||||
sulted during the match (when processing the assertions). The partial
|
||||
The first, complete match shows that the matched string is "abc", with
|
||||
the preceding and following strings "pqr" and "xyz" having been con-
|
||||
sulted during the match (when processing the assertions). The partial
|
||||
match can indicate only the preceding string.
|
||||
|
||||
The startchar modifier requests that the starting character for the
|
||||
match be indicated, if it is different to the start of the matched
|
||||
The startchar modifier requests that the starting character for the
|
||||
match be indicated, if it is different to the start of the matched
|
||||
string. The only time when this occurs is when \K has been processed as
|
||||
part of the match. In this situation, the output for the matched string
|
||||
is displayed from the starting character instead of from the match
|
||||
is displayed from the starting character instead of from the match
|
||||
point, with circumflex characters under the earlier characters. For ex-
|
||||
ample:
|
||||
|
||||
|
@ -1183,7 +1184,7 @@ SUBJECT MODIFIERS
|
|||
0: abcxyz
|
||||
^^^
|
||||
|
||||
Unlike allusedtext, the startchar modifier can be used with JIT. How-
|
||||
Unlike allusedtext, the startchar modifier can be used with JIT. How-
|
||||
ever, these two modifiers are mutually exclusive.
|
||||
|
||||
Showing the value of all capture groups
|
||||
|
@ -1191,91 +1192,96 @@ SUBJECT MODIFIERS
|
|||
The allcaptures modifier requests that the values of all potential cap-
|
||||
tured parentheses be output after a match. By default, only those up to
|
||||
the highest one actually used in the match are output (corresponding to
|
||||
the return code from pcre2_match()). Groups that did not take part in
|
||||
the match are output as "<unset>". This modifier is not relevant for
|
||||
DFA matching (which does no capturing) and does not apply when replace
|
||||
the return code from pcre2_match()). Groups that did not take part in
|
||||
the match are output as "<unset>". This modifier is not relevant for
|
||||
DFA matching (which does no capturing) and does not apply when replace
|
||||
is specified; it is ignored, with a warning message, if present.
|
||||
|
||||
Showing the entire ovector, for all outcomes
|
||||
|
||||
The allvector modifier requests that the entire ovector be shown, what-
|
||||
ever the outcome of the match. Compare allcaptures, which shows only up
|
||||
to the maximum number of capture groups for the pattern, and then only
|
||||
for a successful complete non-DFA match. This modifier, which acts af-
|
||||
ter any match result, and also for DFA matching, provides a means of
|
||||
checking that there are no unexpected modifications to ovector fields.
|
||||
Before each match attempt, the ovector is filled with a special value,
|
||||
and if this is found in both elements of a capturing pair, "<un-
|
||||
changed>" is output. After a successful match, this applies to all
|
||||
groups after the maximum capture group for the pattern. In other cases
|
||||
it applies to the entire ovector. After a partial match, the first two
|
||||
elements are the only ones that should be set. After a DFA match, the
|
||||
amount of ovector that is used depends on the number of matches that
|
||||
to the maximum number of capture groups for the pattern, and then only
|
||||
for a successful complete non-DFA match. This modifier, which acts af-
|
||||
ter any match result, and also for DFA matching, provides a means of
|
||||
checking that there are no unexpected modifications to ovector fields.
|
||||
Before each match attempt, the ovector is filled with a special value,
|
||||
and if this is found in both elements of a capturing pair, "<un-
|
||||
changed>" is output. After a successful match, this applies to all
|
||||
groups after the maximum capture group for the pattern. In other cases
|
||||
it applies to the entire ovector. After a partial match, the first two
|
||||
elements are the only ones that should be set. After a DFA match, the
|
||||
amount of ovector that is used depends on the number of matches that
|
||||
were found.
|
||||
|
||||
Testing pattern callouts
|
||||
|
||||
A callout function is supplied when pcre2test calls the library match-
|
||||
ing functions, unless callout_none is specified. Its behaviour can be
|
||||
controlled by various modifiers listed above whose names begin with
|
||||
callout_. Details are given in the section entitled "Callouts" below.
|
||||
Testing callouts from pcre2_substitute() is decribed separately in
|
||||
A callout function is supplied when pcre2test calls the library match-
|
||||
ing functions, unless callout_none is specified. Its behaviour can be
|
||||
controlled by various modifiers listed above whose names begin with
|
||||
callout_. Details are given in the section entitled "Callouts" below.
|
||||
Testing callouts from pcre2_substitute() is decribed separately in
|
||||
"Testing the substitution function" below.
|
||||
|
||||
Finding all matches in a string
|
||||
|
||||
Searching for all possible matches within a subject can be requested by
|
||||
the global or altglobal modifier. After finding a match, the matching
|
||||
function is called again to search the remainder of the subject. The
|
||||
difference between global and altglobal is that the former uses the
|
||||
start_offset argument to pcre2_match() or pcre2_dfa_match() to start
|
||||
searching at a new point within the entire string (which is what Perl
|
||||
the global or altglobal modifier. After finding a match, the matching
|
||||
function is called again to search the remainder of the subject. The
|
||||
difference between global and altglobal is that the former uses the
|
||||
start_offset argument to pcre2_match() or pcre2_dfa_match() to start
|
||||
searching at a new point within the entire string (which is what Perl
|
||||
does), whereas the latter passes over a shortened subject. This makes a
|
||||
difference to the matching process if the pattern begins with a lookbe-
|
||||
hind assertion (including \b or \B).
|
||||
|
||||
If an empty string is matched, the next match is done with the
|
||||
If an empty string is matched, the next match is done with the
|
||||
PCRE2_NOTEMPTY_ATSTART and PCRE2_ANCHORED flags set, in order to search
|
||||
for another, non-empty, match at the same point in the subject. If this
|
||||
match fails, the start offset is advanced, and the normal match is re-
|
||||
tried. This imitates the way Perl handles such cases when using the /g
|
||||
modifier or the split() function. Normally, the start offset is ad-
|
||||
vanced by one character, but if the newline convention recognizes CRLF
|
||||
as a newline, and the current character is CR followed by LF, an ad-
|
||||
match fails, the start offset is advanced, and the normal match is re-
|
||||
tried. This imitates the way Perl handles such cases when using the /g
|
||||
modifier or the split() function. Normally, the start offset is ad-
|
||||
vanced by one character, but if the newline convention recognizes CRLF
|
||||
as a newline, and the current character is CR followed by LF, an ad-
|
||||
vance of two characters occurs.
|
||||
|
||||
Testing substring extraction functions
|
||||
|
||||
The copy and get modifiers can be used to test the pcre2_sub-
|
||||
The copy and get modifiers can be used to test the pcre2_sub-
|
||||
string_copy_xxx() and pcre2_substring_get_xxx() functions. They can be
|
||||
given more than once, and each can specify a capture group name or num-
|
||||
ber, for example:
|
||||
|
||||
abcd\=copy=1,copy=3,get=G1
|
||||
|
||||
If the #subject command is used to set default copy and/or get lists,
|
||||
these can be unset by specifying a negative number to cancel all num-
|
||||
If the #subject command is used to set default copy and/or get lists,
|
||||
these can be unset by specifying a negative number to cancel all num-
|
||||
bered groups and an empty name to cancel all named groups.
|
||||
|
||||
The getall modifier tests pcre2_substring_list_get(), which extracts
|
||||
The getall modifier tests pcre2_substring_list_get(), which extracts
|
||||
all captured substrings.
|
||||
|
||||
If the subject line is successfully matched, the substrings extracted
|
||||
by the convenience functions are output with C, G, or L after the
|
||||
string number instead of a colon. This is in addition to the normal
|
||||
full list. The string length (that is, the return from the extraction
|
||||
If the subject line is successfully matched, the substrings extracted
|
||||
by the convenience functions are output with C, G, or L after the
|
||||
string number instead of a colon. This is in addition to the normal
|
||||
full list. The string length (that is, the return from the extraction
|
||||
function) is given in parentheses after each substring, followed by the
|
||||
name when the extraction was by name.
|
||||
|
||||
Testing the substitution function
|
||||
|
||||
If the replace modifier is set, the pcre2_substitute() function is
|
||||
called instead of one of the matching functions (or after one call of
|
||||
pcre2_match() in the case of PCRE2_SUBSTITUTE_MATCHED). Note that re-
|
||||
placement strings cannot contain commas, because a comma signifies the
|
||||
end of a modifier. This is not thought to be an issue in a test pro-
|
||||
If the replace modifier is set, the pcre2_substitute() function is
|
||||
called instead of one of the matching functions (or after one call of
|
||||
pcre2_match() in the case of PCRE2_SUBSTITUTE_MATCHED). Note that re-
|
||||
placement strings cannot contain commas, because a comma signifies the
|
||||
end of a modifier. This is not thought to be an issue in a test pro-
|
||||
gram.
|
||||
|
||||
Specifying a completely empty replacement string disables this modi-
|
||||
fier. However, it is possible to specify an empty replacement by pro-
|
||||
viding a buffer length, as described below, for an otherwise empty re-
|
||||
placement.
|
||||
|
||||
Unlike subject strings, pcre2test does not process replacement strings
|
||||
for escape sequences. In UTF mode, a replacement string is checked to
|
||||
see if it is a valid UTF-8 string. If so, it is correctly converted to
|
||||
|
@ -1929,5 +1935,5 @@ AUTHOR
|
|||
|
||||
REVISION
|
||||
|
||||
Last updated: 14 September 2020
|
||||
Copyright (c) 1997-2020 University of Cambridge.
|
||||
Last updated: 28 April 2021
|
||||
Copyright (c) 1997-2021 University of Cambridge.
|
||||
|
|
Loading…
Reference in New Issue