Tests and documentation updates.
This commit is contained in:
parent
819e175659
commit
f024446c93
21
doc/pcre2.3
21
doc/pcre2.3
|
@ -1,4 +1,4 @@
|
||||||
.TH PCRE2 3 "03 November 2014" "PCRE2 10.00"
|
.TH PCRE2 3 "18 November 2014" "PCRE2 10.00"
|
||||||
.SH NAME
|
.SH NAME
|
||||||
PCRE2 - Perl-compatible regular expressions (revised API)
|
PCRE2 - Perl-compatible regular expressions (revised API)
|
||||||
.SH INTRODUCTION
|
.SH INTRODUCTION
|
||||||
|
@ -8,9 +8,10 @@ PCRE2 is the name used for a revised API for the PCRE library, which is a set
|
||||||
of functions, written in C, that implement regular expression pattern matching
|
of functions, written in C, that implement regular expression pattern matching
|
||||||
using the same syntax and semantics as Perl, with just a few differences. Some
|
using the same syntax and semantics as Perl, with just a few differences. Some
|
||||||
features that appeared in Python and the original PCRE before they appeared in
|
features that appeared in Python and the original PCRE before they appeared in
|
||||||
Perl are also available using the Python syntax, there is some support for one
|
Perl are also available using the Python syntax. There is also some support for
|
||||||
or two .NET and Oniguruma syntax items, and there are options for requesting
|
one or two .NET and Oniguruma syntax items, and there are options for
|
||||||
some minor changes that give better ECMAScript (aka JavaScript) compatibility.
|
requesting some minor changes that give better ECMAScript (aka JavaScript)
|
||||||
|
compatibility.
|
||||||
.P
|
.P
|
||||||
The source code for PCRE2 can be compiled to support 8-bit, 16-bit, or 32-bit
|
The source code for PCRE2 can be compiled to support 8-bit, 16-bit, or 32-bit
|
||||||
code units, which means that up to three separate libraries may be installed.
|
code units, which means that up to three separate libraries may be installed.
|
||||||
|
@ -18,7 +19,7 @@ The original work to extend PCRE to 16-bit and 32-bit code units was done by
|
||||||
Zoltan Herczeg and Christian Persch, respectively. In all three cases, strings
|
Zoltan Herczeg and Christian Persch, respectively. In all three cases, strings
|
||||||
can be interpreted either as one character per code unit, or as UTF-encoded
|
can be interpreted either as one character per code unit, or as UTF-encoded
|
||||||
Unicode, with support for Unicode general category properties. Unicode support
|
Unicode, with support for Unicode general category properties. Unicode support
|
||||||
is optional at build time (but is the default); however, processing strings as
|
is optional at build time (but is the default). However, processing strings as
|
||||||
UTF code units must be enabled explicitly at run time. The version of Unicode
|
UTF code units must be enabled explicitly at run time. The version of Unicode
|
||||||
in use can be discovered by running
|
in use can be discovered by running
|
||||||
.sp
|
.sp
|
||||||
|
@ -140,19 +141,19 @@ listing), and the short pages for individual functions, are concatenated in
|
||||||
pcre2compat discussion of Perl compatibility
|
pcre2compat discussion of Perl compatibility
|
||||||
pcre2demo a demonstration C program that uses PCRE2
|
pcre2demo a demonstration C program that uses PCRE2
|
||||||
pcre2grep description of the \fBpcre2grep\fP command (8-bit only)
|
pcre2grep description of the \fBpcre2grep\fP command (8-bit only)
|
||||||
pcre2jit discussion of the just-in-time optimization support
|
pcre2jit discussion of just-in-time optimization support
|
||||||
pcre2limits details of size and other limits
|
pcre2limits details of size and other limits
|
||||||
pcre2matching discussion of the two matching algorithms
|
pcre2matching discussion of the two matching algorithms
|
||||||
pcre2partial details of the partial matching facility
|
pcre2partial details of the partial matching facility
|
||||||
.\" JOIN
|
.\" JOIN
|
||||||
pcre2pattern syntax and semantics of supported
|
pcre2pattern syntax and semantics of supported regular
|
||||||
regular expressions
|
expression patterns
|
||||||
pcre2perform discussion of performance issues
|
pcre2perform discussion of performance issues
|
||||||
pcre2posix the POSIX-compatible C API for the 8-bit library
|
pcre2posix the POSIX-compatible C API for the 8-bit library
|
||||||
pcre2sample discussion of the pcre2demo program
|
pcre2sample discussion of the pcre2demo program
|
||||||
pcre2stack discussion of stack usage
|
pcre2stack discussion of stack usage
|
||||||
pcre2syntax quick syntax reference
|
pcre2syntax quick syntax reference
|
||||||
pcre2test description of the \fBpcre2test\fP testing command
|
pcre2test description of the \fBpcre2test\fP command
|
||||||
pcre2unicode discussion of Unicode and UTF support
|
pcre2unicode discussion of Unicode and UTF support
|
||||||
.sp
|
.sp
|
||||||
In the "man" and HTML formats, there is also a short page for each C library
|
In the "man" and HTML formats, there is also a short page for each C library
|
||||||
|
@ -176,6 +177,6 @@ use my two initials, followed by the two digits 10, at the domain cam.ac.uk.
|
||||||
.rs
|
.rs
|
||||||
.sp
|
.sp
|
||||||
.nf
|
.nf
|
||||||
Last updated: 03 November 2014
|
Last updated: 18 November 2014
|
||||||
Copyright (c) 1997-2014 University of Cambridge.
|
Copyright (c) 1997-2014 University of Cambridge.
|
||||||
.fi
|
.fi
|
||||||
|
|
189
doc/pcre2api.3
189
doc/pcre2api.3
|
@ -1,4 +1,4 @@
|
||||||
.TH PCRE2API 3 "11 November 2014" "PCRE2 10.00"
|
.TH PCRE2API 3 "18 November 2014" "PCRE2 10.00"
|
||||||
.SH NAME
|
.SH NAME
|
||||||
PCRE2 - Perl-compatible regular expressions (revised API)
|
PCRE2 - Perl-compatible regular expressions (revised API)
|
||||||
.sp
|
.sp
|
||||||
|
@ -384,12 +384,9 @@ U+000C), NEL (next line, U+0085), LS (line separator, U+2028), and PS
|
||||||
.P
|
.P
|
||||||
Each of the first three conventions is used by at least one operating system as
|
Each of the first three conventions is used by at least one operating system as
|
||||||
its standard newline sequence. When PCRE2 is built, a default can be specified.
|
its standard newline sequence. When PCRE2 is built, a default can be specified.
|
||||||
The default default is LF, which is the Unix standard. When PCRE2 is run, the
|
The default default is LF, which is the Unix standard. However, the newline
|
||||||
default can be overridden, either when a pattern is compiled, or when it is
|
convention can be changed by an application when calling \fBpcre2_compile()\fP,
|
||||||
matched.
|
or it can be specified by special text at the start of the pattern itself; this
|
||||||
.P
|
|
||||||
The newline convention can be changed when calling \fBpcre2_compile()\fP, or it
|
|
||||||
can be specified by special text at the start of the pattern itself; this
|
|
||||||
overrides any other settings. See the
|
overrides any other settings. See the
|
||||||
.\" HREF
|
.\" HREF
|
||||||
\fBpcre2pattern\fP
|
\fBpcre2pattern\fP
|
||||||
|
@ -409,8 +406,8 @@ section on \fBpcre2_match()\fP options
|
||||||
below.
|
below.
|
||||||
.P
|
.P
|
||||||
The choice of newline convention does not affect the interpretation of
|
The choice of newline convention does not affect the interpretation of
|
||||||
the \en or \er escape sequences, nor does it affect what \eR matches, which has
|
the \en or \er escape sequences, nor does it affect what \eR matches; this has
|
||||||
its own separate control.
|
its own separate convention.
|
||||||
.
|
.
|
||||||
.
|
.
|
||||||
.SH MULTITHREADING
|
.SH MULTITHREADING
|
||||||
|
@ -423,7 +420,7 @@ designed to be fairly simple for non-threaded applications while at the same
|
||||||
time ensuring that multithreaded applications can use it.
|
time ensuring that multithreaded applications can use it.
|
||||||
.P
|
.P
|
||||||
There are several different blocks of data that are used to pass information
|
There are several different blocks of data that are used to pass information
|
||||||
between the application and the PCRE libraries.
|
between the application and the PCRE2 libraries.
|
||||||
.P
|
.P
|
||||||
(1) A pointer to the compiled form of a pattern is returned to the user when
|
(1) A pointer to the compiled form of a pattern is returned to the user when
|
||||||
\fBpcre2_compile()\fP is successful. The data in the compiled pattern is fixed,
|
\fBpcre2_compile()\fP is successful. The data in the compiled pattern is fixed,
|
||||||
|
@ -529,11 +526,11 @@ The memory used for a general context should be freed by calling:
|
||||||
A compile context is required if you want to change the default values of any
|
A compile context is required if you want to change the default values of any
|
||||||
of the following compile-time parameters:
|
of the following compile-time parameters:
|
||||||
.sp
|
.sp
|
||||||
What \eR matches (Unicode newlines or CR, LF, CRLF only);
|
What \eR matches (Unicode newlines or CR, LF, CRLF only)
|
||||||
PCRE2's character tables;
|
PCRE2's character tables
|
||||||
The newline character sequence;
|
The newline character sequence
|
||||||
The compile time nested parentheses limit;
|
The compile time nested parentheses limit
|
||||||
An external function for stack checking.
|
An external function for stack checking
|
||||||
.sp
|
.sp
|
||||||
A compile context is also required if you are using custom memory management.
|
A compile context is also required if you are using custom memory management.
|
||||||
If none of these apply, just pass NULL as the context argument of
|
If none of these apply, just pass NULL as the context argument of
|
||||||
|
@ -562,9 +559,8 @@ PCRE2_ERROR_BADDATA if invalid data is detected.
|
||||||
.sp
|
.sp
|
||||||
The value must be PCRE2_BSR_ANYCRLF, to specify that \eR matches only CR, LF,
|
The value must be PCRE2_BSR_ANYCRLF, to specify that \eR matches only CR, LF,
|
||||||
or CRLF, or PCRE2_BSR_UNICODE, to specify that \eR matches any Unicode line
|
or CRLF, or PCRE2_BSR_UNICODE, to specify that \eR matches any Unicode line
|
||||||
ending sequence. The value of this parameter does not affect what is compiled;
|
ending sequence. The value is used by the JIT compiler and by the two
|
||||||
it is just saved with the compiled pattern. The value is used by the JIT
|
interpreted matching functions, \fIpcre2_match()\fP and
|
||||||
compiler and by the two interpreted matching functions, \fIpcre2_match()\fP and
|
|
||||||
\fIpcre2_dfa_match()\fP.
|
\fIpcre2_dfa_match()\fP.
|
||||||
.sp
|
.sp
|
||||||
.nf
|
.nf
|
||||||
|
@ -678,12 +674,12 @@ patterns that are not anchored, the count restarts from zero for each position
|
||||||
in the subject string. This limit is not relevant to \fBpcre2_dfa_match()\fP,
|
in the subject string. This limit is not relevant to \fBpcre2_dfa_match()\fP,
|
||||||
which ignores it.
|
which ignores it.
|
||||||
.P
|
.P
|
||||||
When \fBpcre2_match()\fP is called with a pattern that was successfully studied
|
When \fBpcre2_match()\fP is called with a pattern that was successfully
|
||||||
with \fBpcre2_jit_compile()\fP, the way that the matching is executed is
|
processed by \fBpcre2_jit_compile()\fP, the way in which matching is executed
|
||||||
entirely different. However, there is still the possibility of runaway matching
|
is entirely different. However, there is still the possibility of runaway
|
||||||
that goes on for a very long time, and so the \fImatch_limit\fP value is also
|
matching that goes on for a very long time, and so the \fImatch_limit\fP value
|
||||||
used in this case (but in a different way) to limit how long the matching can
|
is also used in this case (but in a different way) to limit how long the
|
||||||
continue.
|
matching can continue.
|
||||||
.P
|
.P
|
||||||
The default value for the limit can be set when PCRE2 is built; the default
|
The default value for the limit can be set when PCRE2 is built; the default
|
||||||
default is 10 million, which handles all but the most extreme cases. If the
|
default is 10 million, which handles all but the most extreme cases. If the
|
||||||
|
@ -744,15 +740,16 @@ documentation. See the
|
||||||
.\" HREF
|
.\" HREF
|
||||||
\fBpcre2build\fP
|
\fBpcre2build\fP
|
||||||
.\"
|
.\"
|
||||||
documentation for details of how to build PCRE2. Using the heap for recursion
|
documentation for details of how to build PCRE2.
|
||||||
is a non-standard way of building PCRE2, for use in environments that have
|
.P
|
||||||
limited stacks. Because of the greater use of memory management,
|
Using the heap for recursion is a non-standard way of building PCRE2, for use
|
||||||
\fBpcre2_match()\fP runs more slowly. Functions that are different to the
|
in environments that have limited stacks. Because of the greater use of memory
|
||||||
general custom memory functions are provided so that special-purpose external
|
management, \fBpcre2_match()\fP runs more slowly. Functions that are different
|
||||||
code can be used for this case, because the memory blocks are all the same
|
to the general custom memory functions are provided so that special-purpose
|
||||||
size. The blocks are retained by \fBpcre2_match()\fP until it is about to exit
|
external code can be used for this case, because the memory blocks are all the
|
||||||
so that they can be re-used when possible during the match. In the absence of
|
same size. The blocks are retained by \fBpcre2_match()\fP until it is about to
|
||||||
these functions, the normal custom memory management functions are used, if
|
exit so that they can be re-used when possible during the match. In the absence
|
||||||
|
of these functions, the normal custom memory management functions are used, if
|
||||||
supplied, otherwise the system functions.
|
supplied, otherwise the system functions.
|
||||||
.
|
.
|
||||||
.
|
.
|
||||||
|
@ -784,9 +781,10 @@ available:
|
||||||
PCRE2_CONFIG_BSR
|
PCRE2_CONFIG_BSR
|
||||||
.sp
|
.sp
|
||||||
The output is an integer whose value indicates what character sequences the \eR
|
The output is an integer whose value indicates what character sequences the \eR
|
||||||
escape sequence matches by default. A value of 0 means that \eR matches any
|
escape sequence matches by default. A value of PCRE2_BSR_UNICODE means that \eR
|
||||||
Unicode line ending sequence; a value of 1 means that \eR matches only CR, LF,
|
matches any Unicode line ending sequence; a value of PCRE2_BSR_ANYCRLF means
|
||||||
or CRLF. The default can be overridden when a pattern is compiled or matched.
|
that \eR matches only CR, LF, or CRLF. The default can be overridden when a
|
||||||
|
pattern is compiled.
|
||||||
.sp
|
.sp
|
||||||
PCRE2_CONFIG_JIT
|
PCRE2_CONFIG_JIT
|
||||||
.sp
|
.sp
|
||||||
|
@ -796,7 +794,7 @@ compiling is available; otherwise it is set to zero.
|
||||||
PCRE2_CONFIG_JITTARGET
|
PCRE2_CONFIG_JITTARGET
|
||||||
.sp
|
.sp
|
||||||
The \fIwhere\fP argument should point to a buffer that is at least 48 code
|
The \fIwhere\fP argument should point to a buffer that is at least 48 code
|
||||||
units long. (The exact length needed can be found by calling
|
units long. (The exact length required can be found by calling
|
||||||
\fBpcre2_config()\fP with \fBwhere\fP set to NULL.) The buffer is filled with a
|
\fBpcre2_config()\fP with \fBwhere\fP set to NULL.) The buffer is filled with a
|
||||||
string that contains the name of the architecture for which the JIT compiler is
|
string that contains the name of the architecture for which the JIT compiler is
|
||||||
configured, for example "x86 32bit (little endian + unaligned)". If JIT support
|
configured, for example "x86 32bit (little endian + unaligned)". If JIT support
|
||||||
|
@ -829,11 +827,11 @@ Further details are given with \fBpcre2_match()\fP below.
|
||||||
The output is an integer whose value specifies the default character sequence
|
The output is an integer whose value specifies the default character sequence
|
||||||
that is recognized as meaning "newline". The values are:
|
that is recognized as meaning "newline". The values are:
|
||||||
.sp
|
.sp
|
||||||
1 Carriage return (CR)
|
PCRE2_NEWLINE_CR Carriage return (CR)
|
||||||
2 Linefeed (LF)
|
PCRE2_NEWLINE_LF Linefeed (LF)
|
||||||
3 Carriage return, linefeed (CRLF)
|
PCRE2_NEWLINE_CRLF Carriage return, linefeed (CRLF)
|
||||||
4 Any Unicode line ending
|
PCRE2_NEWLINE_ANY Any Unicode line ending
|
||||||
5 Any of CR, LF, or CRLF
|
PCRE2_NEWLINE_ANYCRLF Any of CR, LF, or CRLF
|
||||||
.sp
|
.sp
|
||||||
The default should normally correspond to the standard sequence for your
|
The default should normally correspond to the standard sequence for your
|
||||||
operating system.
|
operating system.
|
||||||
|
@ -865,7 +863,7 @@ heap instead of recursive function calls.
|
||||||
PCRE2_CONFIG_UNICODE_VERSION
|
PCRE2_CONFIG_UNICODE_VERSION
|
||||||
.sp
|
.sp
|
||||||
The \fIwhere\fP argument should point to a buffer that is at least 24 code
|
The \fIwhere\fP argument should point to a buffer that is at least 24 code
|
||||||
units long. (The exact length needed can be found by calling
|
units long. (The exact length required can be found by calling
|
||||||
\fBpcre2_config()\fP with \fBwhere\fP set to NULL.) If PCRE2 has been compiled
|
\fBpcre2_config()\fP with \fBwhere\fP set to NULL.) If PCRE2 has been compiled
|
||||||
without Unicode support, the buffer is filled with the text "Unicode not
|
without Unicode support, the buffer is filled with the text "Unicode not
|
||||||
supported". Otherwise, the Unicode version string (for example, "7.0.0") is
|
supported". Otherwise, the Unicode version string (for example, "7.0.0") is
|
||||||
|
@ -880,7 +878,7 @@ otherwise it is set to zero. Unicode support implies UTF support.
|
||||||
PCRE2_CONFIG_VERSION
|
PCRE2_CONFIG_VERSION
|
||||||
.sp
|
.sp
|
||||||
The \fIwhere\fP argument should point to a buffer that is at least 12 code
|
The \fIwhere\fP argument should point to a buffer that is at least 12 code
|
||||||
units long. (The exact length needed can be found by calling
|
units long. (The exact length required can be found by calling
|
||||||
\fBpcre2_config()\fP with \fBwhere\fP set to NULL.) The buffer is filled with
|
\fBpcre2_config()\fP with \fBwhere\fP set to NULL.) The buffer is filled with
|
||||||
the PCRE2 version string, zero-terminated. The number of code units used is
|
the PCRE2 version string, zero-terminated. The number of code units used is
|
||||||
returned. This is the length of the string plus one unit for the terminating
|
returned. This is the length of the string plus one unit for the terminating
|
||||||
|
@ -899,16 +897,16 @@ zero.
|
||||||
.B pcre2_code_free(pcre2_code *\fIcode\fP);
|
.B pcre2_code_free(pcre2_code *\fIcode\fP);
|
||||||
.fi
|
.fi
|
||||||
.P
|
.P
|
||||||
This function compiles a pattern, defined by a pointer to a string of code
|
The \fBpcre2_compile()\fP function compiles a pattern into an internal form.
|
||||||
units and a length, into an internal form. If the pattern is zero-terminated,
|
The pattern is defined by a pointer to a string of code units and a length, If
|
||||||
the length should be specified as PCRE2_ZERO_TERMINATED. The function returns a
|
the pattern is zero-terminated, the length can be specified as
|
||||||
pointer to a block of memory that contains the compiled pattern and related
|
PCRE2_ZERO_TERMINATED. The function returns a pointer to a block of memory that
|
||||||
data. The caller must free the memory by calling \fBpcre2_code_free()\fP when
|
contains the compiled pattern and related data. The caller must free the memory
|
||||||
it is no longer needed.
|
by calling \fBpcre2_code_free()\fP when it is no longer needed.
|
||||||
.P
|
.P
|
||||||
If the compile context argument \fIccontext\fP is NULL, the memory is obtained
|
If the compile context argument \fIccontext\fP is NULL, memory for the compiled
|
||||||
by calling \fBmalloc()\fP. Otherwise, it is obtained from the same memory
|
pattern is obtained by calling \fBmalloc()\fP. Otherwise, it is obtained from
|
||||||
function that was used for the compile context.
|
the same memory function that was used for the compile context.
|
||||||
.P
|
.P
|
||||||
The \fIoptions\fP argument contains various bit settings that affect the
|
The \fIoptions\fP argument contains various bit settings that affect the
|
||||||
compilation. It should be zero if no options are required. The available
|
compilation. It should be zero if no options are required. The available
|
||||||
|
@ -1235,7 +1233,7 @@ in the
|
||||||
\fBpcre2pattern\fP
|
\fBpcre2pattern\fP
|
||||||
.\"
|
.\"
|
||||||
page. If you set PCRE2_UCP, matching one of the items it affects takes much
|
page. If you set PCRE2_UCP, matching one of the items it affects takes much
|
||||||
longer. The option is available only if PCRE2 has been compiled with UTF
|
longer. The option is available only if PCRE2 has been compiled with Unicode
|
||||||
support.
|
support.
|
||||||
.sp
|
.sp
|
||||||
PCRE2_UNGREEDY
|
PCRE2_UNGREEDY
|
||||||
|
@ -1248,9 +1246,10 @@ with Perl. It can also be set by a (?U) option setting within the pattern.
|
||||||
.sp
|
.sp
|
||||||
This option causes PCRE2 to regard both the pattern and the subject strings
|
This option causes PCRE2 to regard both the pattern and the subject strings
|
||||||
that are subsequently processed as strings of UTF characters instead of
|
that are subsequently processed as strings of UTF characters instead of
|
||||||
single-code-unit strings. However, it is available only when PCRE2 is built to
|
single-code-unit strings. It is available when PCRE2 is built to include
|
||||||
include UTF support. If not, the use of this option provokes an error. Details
|
Unicode support (which is the default). If Unicode support is not available,
|
||||||
of how this option changes the behaviour of PCRE2 are given in the
|
the use of this option provokes an error. Details of how this option changes
|
||||||
|
the behaviour of PCRE2 are given in the
|
||||||
.\" HREF
|
.\" HREF
|
||||||
\fBpcre2unicode\fP
|
\fBpcre2unicode\fP
|
||||||
.\"
|
.\"
|
||||||
|
@ -1314,13 +1313,12 @@ Most, but not all patterns can be optimized by the JIT compiler.
|
||||||
.sp
|
.sp
|
||||||
PCRE2 handles caseless matching, and determines whether characters are letters,
|
PCRE2 handles caseless matching, and determines whether characters are letters,
|
||||||
digits, or whatever, by reference to a set of tables, indexed by character code
|
digits, or whatever, by reference to a set of tables, indexed by character code
|
||||||
point. When running in UTF-8 mode, or using the 16-bit or 32-bit libraries,
|
point. This applies only to characters whose code points are less than 256. By
|
||||||
this applies only to characters with code points less than 256. By default,
|
default, higher-valued code points never match escapes such as \ew or \ed.
|
||||||
higher-valued code points never match escapes such as \ew or \ed. However, if
|
However, if PCRE2 is built with UTF support, all characters can be tested with
|
||||||
PCRE2 is built with UTF support, all characters can be tested with \ep and \eP,
|
\ep and \eP, or, alternatively, the PCRE2_UCP option can be set when a pattern
|
||||||
or, alternatively, the PCRE2_UCP option can be set when a pattern is compiled;
|
is compiled; this causes \ew and friends to use Unicode property support
|
||||||
this causes \ew and friends to use Unicode property support instead of the
|
instead of the built-in tables.
|
||||||
built-in tables.
|
|
||||||
.P
|
.P
|
||||||
The use of locales with Unicode is discouraged. If you are handling characters
|
The use of locales with Unicode is discouraged. If you are handling characters
|
||||||
with code points greater than 128, you should either use Unicode support, or
|
with code points greater than 128, you should either use Unicode support, or
|
||||||
|
@ -1433,9 +1431,9 @@ are no back references.
|
||||||
PCRE2_INFO_BSR
|
PCRE2_INFO_BSR
|
||||||
.sp
|
.sp
|
||||||
The output is a uint32_t whose value indicates what character sequences the \eR
|
The output is a uint32_t whose value indicates what character sequences the \eR
|
||||||
escape sequence matches by default. A value of 0 means that \eR matches any
|
escape sequence matches. A value of PCRE2_BSR_UNICODE means that \eR matches
|
||||||
Unicode line ending sequence; a value of 1 means that \eR matches only CR, LF,
|
any Unicode line ending sequence; a value of PCRE2_BSR_ANYCRLF means that \eR
|
||||||
or CRLF. The default can be overridden when a pattern is matched.
|
matches only CR, LF, or CRLF.
|
||||||
.sp
|
.sp
|
||||||
PCRE2_INFO_CAPTURECOUNT
|
PCRE2_INFO_CAPTURECOUNT
|
||||||
.sp
|
.sp
|
||||||
|
@ -1623,17 +1621,16 @@ different for each compiled pattern.
|
||||||
.sp
|
.sp
|
||||||
PCRE2_INFO_NEWLINE
|
PCRE2_INFO_NEWLINE
|
||||||
.sp
|
.sp
|
||||||
The output is a \fBuint32_t\fP whose value specifies the default character
|
The output is a \fBuint32_t\fP with one of the following values:
|
||||||
sequence that will be recognized as meaning "newline" while matching. The
|
|
||||||
values are:
|
|
||||||
.sp
|
.sp
|
||||||
1 Carriage return (CR)
|
PCRE2_NEWLINE_CR Carriage return (CR)
|
||||||
2 Linefeed (LF)
|
PCRE2_NEWLINE_LF Linefeed (LF)
|
||||||
3 Carriage return, linefeed (CRLF)
|
PCRE2_NEWLINE_CRLF Carriage return, linefeed (CRLF)
|
||||||
4 Any Unicode line ending
|
PCRE2_NEWLINE_ANY Any Unicode line ending
|
||||||
5 Any of CR, LF, or CRLF
|
PCRE2_NEWLINE_ANYCRLF Any of CR, LF, or CRLF
|
||||||
.sp
|
.sp
|
||||||
The default can be overridden when a pattern is matched.
|
This specifies the default character sequence that will be recognized as
|
||||||
|
meaning "newline" while matching.
|
||||||
.sp
|
.sp
|
||||||
PCRE2_INFO_RECURSIONLIMIT
|
PCRE2_INFO_RECURSIONLIMIT
|
||||||
.sp
|
.sp
|
||||||
|
@ -1671,30 +1668,32 @@ Information about successful and unsuccessful matches is placed in a match
|
||||||
data block, which is an opaque structure that is accessed by function calls. In
|
data block, which is an opaque structure that is accessed by function calls. In
|
||||||
particular, the match data block contains a vector of offsets into the subject
|
particular, the match data block contains a vector of offsets into the subject
|
||||||
string that define the matched part of the subject and any substrings that were
|
string that define the matched part of the subject and any substrings that were
|
||||||
capured. This is know as the \fIovector\fP.
|
captured. This is know as the \fIovector\fP.
|
||||||
.P
|
.P
|
||||||
Before calling \fBpcre2_match()\fP or \fBpcre2_dfa_match()\fP you must create a
|
Before calling \fBpcre2_match()\fP, \fBpcre2_dfa_match()\fP, or
|
||||||
match data block by calling one of the creation functions above. For
|
\fBpcre2_jit_match()\fP you must create a match data block by calling one of
|
||||||
\fBpcre2_match_data_create()\fP, the first argument is the number of pairs of
|
the creation functions above. For \fBpcre2_match_data_create()\fP, the first
|
||||||
offsets in the \fIovector\fP. One pair of offsets is required to identify the
|
argument is the number of pairs of offsets in the \fIovector\fP. One pair of
|
||||||
string that matched the whole pattern, with another pair for each captured
|
offsets is required to identify the string that matched the whole pattern, with
|
||||||
substring. For example, a value of 4 creates enough space to record the matched
|
another pair for each captured substring. For example, a value of 4 creates
|
||||||
portion of the subject plus three captured substrings. A minimum of at least 1
|
enough space to record the matched portion of the subject plus three captured
|
||||||
pair is imposed by \fBpcre2_match_data_create()\fP, so it is always possible to
|
substrings. A minimum of at least 1 pair is imposed by
|
||||||
return the overall matched string.
|
\fBpcre2_match_data_create()\fP, so it is always possible to return the overall
|
||||||
|
matched string.
|
||||||
.P
|
.P
|
||||||
For \fBpcre2_match_data_create_from_pattern()\fP, the first argument is a
|
For \fBpcre2_match_data_create_from_pattern()\fP, the first argument is a
|
||||||
pointer to a compiled pattern. In this case the ovector is created to be
|
pointer to a compiled pattern. In this case the ovector is created to be
|
||||||
exactly the right size to hold all the substrings a pattern might capture.
|
exactly the right size to hold all the substrings a pattern might capture.
|
||||||
.P
|
.P
|
||||||
The second argument of both these functions ia a pointer to a general context,
|
The second argument of both these functions is a pointer to a general context,
|
||||||
which can specify custom memory management for obtaining the memory for the
|
which can specify custom memory management for obtaining the memory for the
|
||||||
match data block. If you are not using custom memory management, pass NULL.
|
match data block. If you are not using custom memory management, pass NULL.
|
||||||
.P
|
.P
|
||||||
A match data block can be used many times, with the same or different compiled
|
A match data block can be used many times, with the same or different compiled
|
||||||
patterns. When it is no longer needed, it should be freed by calling
|
patterns. When it is no longer needed, it should be freed by calling
|
||||||
\fBpcre2_match_data_free()\fP. How to extract information from a match data
|
\fBpcre2_match_data_free()\fP. You can extract information from a match data
|
||||||
block after a match operation is described in the sections on
|
block after a match operation has finished, using functions that are described
|
||||||
|
in the sections on
|
||||||
.\" HTML <a href="#matchedstrings">
|
.\" HTML <a href="#matchedstrings">
|
||||||
.\" </a>
|
.\" </a>
|
||||||
matched strings
|
matched strings
|
||||||
|
@ -1819,12 +1818,10 @@ zero. The only bits that may be set are PCRE2_ANCHORED, PCRE2_NOTBOL,
|
||||||
PCRE2_NOTEOL, PCRE2_NOTEMPTY, PCRE2_NOTEMPTY_ATSTART, PCRE2_NO_UTF_CHECK,
|
PCRE2_NOTEOL, PCRE2_NOTEMPTY, PCRE2_NOTEMPTY_ATSTART, PCRE2_NO_UTF_CHECK,
|
||||||
PCRE2_PARTIAL_HARD, and PCRE2_PARTIAL_SOFT. Their action is described below.
|
PCRE2_PARTIAL_HARD, and PCRE2_PARTIAL_SOFT. Their action is described below.
|
||||||
.P
|
.P
|
||||||
If the pattern was successfully processed by the just-in-time (JIT) compiler,
|
Setting PCRE2_ANCHORED at match time is not supported by the just-in-time (JIT)
|
||||||
the only supported options for matching using the JIT code are PCRE2_NOTBOL,
|
compiler. If it is set, JIT matching is disabled and the normal interpretive
|
||||||
PCRE2_NOTEOL, PCRE2_NOTEMPTY, PCRE2_NOTEMPTY_ATSTART, PCRE2_NO_UTF_CHECK,
|
code in \fBpcre2_match()\fP is run. The remaining options are supported for JIT
|
||||||
PCRE2_PARTIAL_HARD, and PCRE2_PARTIAL_SOFT. If an unsupported option is used,
|
matching.
|
||||||
JIT matching is disabled and the normal interpretive code in
|
|
||||||
\fBpcre2_match()\fP is run.
|
|
||||||
.sp
|
.sp
|
||||||
PCRE2_ANCHORED
|
PCRE2_ANCHORED
|
||||||
.sp
|
.sp
|
||||||
|
@ -2704,6 +2701,6 @@ Cambridge, England.
|
||||||
.rs
|
.rs
|
||||||
.sp
|
.sp
|
||||||
.nf
|
.nf
|
||||||
Last updated: 11 November 2014
|
Last updated: 18 November 2014
|
||||||
Copyright (c) 1997-2014 University of Cambridge.
|
Copyright (c) 1997-2014 University of Cambridge.
|
||||||
.fi
|
.fi
|
||||||
|
|
|
@ -58,8 +58,8 @@ ISGCC=0
|
||||||
|
|
||||||
# If the compiler is gcc, add a lot of warning switches.
|
# If the compiler is gcc, add a lot of warning switches.
|
||||||
|
|
||||||
cc --version >zzz 2>/dev/null
|
cc --version >/tmp/pcre2ccversion 2>/dev/null
|
||||||
if [ $? -eq 0 ] && grep GCC zzz >/dev/null; then
|
if [ $? -eq 0 ] && grep GCC /tmp/pcre2ccversion >/dev/null; then
|
||||||
ISGCC=1
|
ISGCC=1
|
||||||
CFLAGS="$CFLAGS -Wall"
|
CFLAGS="$CFLAGS -Wall"
|
||||||
CFLAGS="$CFLAGS -Wno-overlength-strings"
|
CFLAGS="$CFLAGS -Wno-overlength-strings"
|
||||||
|
@ -77,7 +77,7 @@ if [ $? -eq 0 ] && grep GCC zzz >/dev/null; then
|
||||||
CFLAGS="$CFLAGS -Wmissing-prototypes"
|
CFLAGS="$CFLAGS -Wmissing-prototypes"
|
||||||
CFLAGS="$CFLAGS -Wstrict-prototypes"
|
CFLAGS="$CFLAGS -Wstrict-prototypes"
|
||||||
fi
|
fi
|
||||||
|
rm -f /tmp/pcre2ccversion
|
||||||
|
|
||||||
# This function runs a single test with the set of configuration options that
|
# This function runs a single test with the set of configuration options that
|
||||||
# are in $opts. The source directory must be set in srcdir. The function must
|
# are in $opts. The source directory must be set in srcdir. The function must
|
||||||
|
@ -129,8 +129,6 @@ runtest()
|
||||||
|
|
||||||
./pcre2test -C jit >/dev/null
|
./pcre2test -C jit >/dev/null
|
||||||
jit=$?
|
jit=$?
|
||||||
./pcre2test -C unicode >/dev/null
|
|
||||||
utf=$?
|
|
||||||
./pcre2test -C pcre2-8 >/dev/null
|
./pcre2test -C pcre2-8 >/dev/null
|
||||||
pcre2_8=$?
|
pcre2_8=$?
|
||||||
|
|
||||||
|
@ -164,7 +162,7 @@ runtest()
|
||||||
echo "Skipping pcre2grep tests: newline is $nl"
|
echo "Skipping pcre2grep tests: newline is $nl"
|
||||||
fi
|
fi
|
||||||
|
|
||||||
if [ "$jit" -gt 0 -a $utf -gt 0 ]; then
|
if [ "$jit" -gt 0 ]; then
|
||||||
echo "Running JIT regression tests $withvalgrind"
|
echo "Running JIT regression tests $withvalgrind"
|
||||||
$cvalgrind $srcdir/pcre2_jit_test >teststdout 2>teststderr
|
$cvalgrind $srcdir/pcre2_jit_test >teststdout 2>teststderr
|
||||||
if [ $? -ne 0 -o -s teststderr ]; then
|
if [ $? -ne 0 -o -s teststderr ]; then
|
||||||
|
@ -175,7 +173,7 @@ runtest()
|
||||||
exit 1
|
exit 1
|
||||||
fi
|
fi
|
||||||
else
|
else
|
||||||
echo "Skipping JIT regression tests: JIT or UTF not enabled"
|
echo "Skipping JIT regression tests: JIT is not enabled"
|
||||||
fi
|
fi
|
||||||
}
|
}
|
||||||
|
|
||||||
|
|
51
maint/README
51
maint/README
|
@ -65,7 +65,7 @@ Updating to a new Unicode release
|
||||||
|
|
||||||
When there is a new release of Unicode, the files in Unicode.tables must be
|
When there is a new release of Unicode, the files in Unicode.tables must be
|
||||||
refreshed from the web site. If the new version of Unicode adds new character
|
refreshed from the web site. If the new version of Unicode adds new character
|
||||||
scripts, the source file pacr2_ucp.h and both the MultiStage2.py and the
|
scripts, the source file pcre2_ucp.h and both the MultiStage2.py and the
|
||||||
GenerateUtt.py scripts must be edited to add the new names. Then MultiStage2.py
|
GenerateUtt.py scripts must be edited to add the new names. Then MultiStage2.py
|
||||||
can be run to generate a new version of pcre2_ucd.c, and GenerateUtt.py can be
|
can be run to generate a new version of pcre2_ucd.c, and GenerateUtt.py can be
|
||||||
run to generate the tricky tables for inclusion in pcre2_tables.c.
|
run to generate the tricky tables for inclusion in pcre2_tables.c.
|
||||||
|
@ -73,7 +73,7 @@ run to generate the tricky tables for inclusion in pcre2_tables.c.
|
||||||
If MultiStage2.py gives the error "ValueError: list.index(x): x not in list",
|
If MultiStage2.py gives the error "ValueError: list.index(x): x not in list",
|
||||||
the cause is usually a missing (or misspelt) name in the list of scripts. I
|
the cause is usually a missing (or misspelt) name in the list of scripts. I
|
||||||
couldn't find a straightforward list of scripts on the Unicode site, but
|
couldn't find a straightforward list of scripts on the Unicode site, but
|
||||||
there's a useful Wikipedia page that list them, and notes the Unicode version
|
there's a useful Wikipedia page that lists them, and notes the Unicode version
|
||||||
in which they were introduced:
|
in which they were introduced:
|
||||||
|
|
||||||
http://en.wikipedia.org/wiki/Unicode_scripts#Table_of_Unicode_scripts
|
http://en.wikipedia.org/wiki/Unicode_scripts#Table_of_Unicode_scripts
|
||||||
|
@ -130,7 +130,7 @@ distribution for a new release.
|
||||||
systems, using different compilers as well. For example, on Solaris it is
|
systems, using different compilers as well. For example, on Solaris it is
|
||||||
helpful to test using Sun's cc compiler as a change from gcc. Adding
|
helpful to test using Sun's cc compiler as a change from gcc. Adding
|
||||||
-xarch=v9 to the cc options does a 64-bit test, but it also needs -S 64 for
|
-xarch=v9 to the cc options does a 64-bit test, but it also needs -S 64 for
|
||||||
pcretest to increase the stack size for test 2. Since I retired I can no
|
pcre2test to increase the stack size for test 2. Since I retired I can no
|
||||||
longer do this, but instead I rely on putting out release candidates for
|
longer do this, but instead I rely on putting out release candidates for
|
||||||
folks on the pcre-dev list to test.
|
folks on the pcre-dev list to test.
|
||||||
|
|
||||||
|
@ -194,7 +194,7 @@ and the zipball. Double-check with "svn status", then create an SVN tagged
|
||||||
copy:
|
copy:
|
||||||
|
|
||||||
svn copy svn://vcs.exim.org/pcre2/code/trunk \
|
svn copy svn://vcs.exim.org/pcre2/code/trunk \
|
||||||
svn://vcs.exim.org/pcre2/code/tags/pcre-8.xx
|
svn://vcs.exim.org/pcre2/code/tags/pcre-10.xx
|
||||||
|
|
||||||
When the new release is out, don't forget to tell webmaster@pcre.org and the
|
When the new release is out, don't forget to tell webmaster@pcre.org and the
|
||||||
mailing list. Also, update the list of version numbers in Bugzilla (edit
|
mailing list. Also, update the list of version numbers in Bugzilla (edit
|
||||||
|
@ -206,8 +206,7 @@ Future ideas (wish list)
|
||||||
|
|
||||||
This section records a list of ideas so that they do not get forgotten. They
|
This section records a list of ideas so that they do not get forgotten. They
|
||||||
vary enormously in their usefulness and potential for implementation. Some are
|
vary enormously in their usefulness and potential for implementation. Some are
|
||||||
very sensible; some are rather wacky. Some have been on this list for years;
|
very sensible; some are rather wacky. Some have been on this list for years.
|
||||||
others are relatively new.
|
|
||||||
|
|
||||||
. Optimization
|
. Optimization
|
||||||
|
|
||||||
|
@ -226,42 +225,38 @@ others are relatively new.
|
||||||
over the existing "required code unit" feature that just remembers one code
|
over the existing "required code unit" feature that just remembers one code
|
||||||
unit.
|
unit.
|
||||||
|
|
||||||
* Remember an initial string rather than just 1 code unit?
|
* Remember an initial string rather than just 1 code unit.
|
||||||
|
|
||||||
* A required code unit from alternatives - not just the last unit, but an
|
* A required code unit from alternatives - not just the last unit, but an
|
||||||
earlier one if common to all alternatives.
|
earlier one if common to all alternatives.
|
||||||
|
|
||||||
o Friedl contains other ideas.
|
* Friedl contains other ideas.
|
||||||
|
|
||||||
* The code does not set initial code unit flags for Unicode property types
|
* The code does not set initial code unit flags for Unicode property types
|
||||||
such as \p; I don't know how much benefit there would be for, for example,
|
such as \p; I don't know how much benefit there would be for, for example,
|
||||||
setting the bits for 0-9 and all values >= xC0 (in 8-bit mode) when a
|
setting the bits for 0-9 and all values >= xC0 (in 8-bit mode) when a
|
||||||
pattern starts with \p{N}.
|
pattern starts with \p{N}.
|
||||||
|
|
||||||
* There is scope for more "auto-possessifying" in connection with \p and \P.
|
|
||||||
|
|
||||||
. If Perl gets to a consistent state over the settings of capturing sub-
|
. If Perl gets to a consistent state over the settings of capturing sub-
|
||||||
patterns inside repeats, see if we can match it. One example of the
|
patterns inside repeats, see if we can match it. One example of the
|
||||||
difference is the matching of /(main(O)?)+/ against mainOmain, where PCRE
|
difference is the matching of /(main(O)?)+/ against mainOmain, where PCRE2
|
||||||
leaves $2 set. In Perl, it's unset. Changing this in PCRE will be very hard
|
leaves $2 set. In Perl, it's unset. Changing this in PCRE2 will be very hard
|
||||||
because I think it needs much more state to be remembered.
|
because I think it needs much more state to be remembered.
|
||||||
|
|
||||||
. Perl 6 will be a revolution. Is it a revolution too far for PCRE?
|
. Perl 6 will be a revolution. Is it a revolution too far for PCRE?
|
||||||
|
|
||||||
. Line endings:
|
. An option to use NUL as a line terminator in subject strings. This could be
|
||||||
|
done relatively easily. If it is done, a suitable option for pcre2grep is
|
||||||
* Option to use NUL as a line terminator in subject strings. This could now
|
also required.
|
||||||
be done relatively easily since the extension to support LF, CR, and CRLF.
|
|
||||||
If it is done, a suitable option for pcre2grep is also required.
|
|
||||||
|
|
||||||
. Catch SIGSEGV for stack overflows?
|
. Catch SIGSEGV for stack overflows?
|
||||||
|
|
||||||
. A feature to suspend a match via a callout was once requested.
|
. A feature to suspend a match via a callout was once requested.
|
||||||
|
|
||||||
. Option to convert results into character offsets and character lengths.
|
. An option to convert results into character offsets and character lengths.
|
||||||
|
|
||||||
. Option for pcre2grep to scan only the start of a file. I am not keen - this
|
. An option for pcre2grep to scan only the start of a file. I am not keen -
|
||||||
is the job of "head".
|
this is the job of "head".
|
||||||
|
|
||||||
. A (non-Unix) user wanted pcregrep options to (a) list a file name just once,
|
. A (non-Unix) user wanted pcregrep options to (a) list a file name just once,
|
||||||
preceded by a blank line, instead of adding it to every matched line, and (b)
|
preceded by a blank line, instead of adding it to every matched line, and (b)
|
||||||
|
@ -274,11 +269,6 @@ others are relatively new.
|
||||||
to switch this dynamically. It would have to be specified when PCRE2 was
|
to switch this dynamically. It would have to be specified when PCRE2 was
|
||||||
compiled. PCRE2 would then call a function every time it wanted a character.
|
compiled. PCRE2 would then call a function every time it wanted a character.
|
||||||
|
|
||||||
. Wild thought: the ability to compile from PCRE2's internal code to a real
|
|
||||||
FSM and a very fast (third) matcher to process the result. There would be
|
|
||||||
even more restrictions than for pcre2_dfa_exec(), however. This is not easy.
|
|
||||||
This is probably obsolete now that we have the JIT support.
|
|
||||||
|
|
||||||
. pcre2grep: add -rs for a sorted recurse? Having to store file names and sort
|
. pcre2grep: add -rs for a sorted recurse? Having to store file names and sort
|
||||||
them will of course slow it down.
|
them will of course slow it down.
|
||||||
|
|
||||||
|
@ -296,10 +286,10 @@ others are relatively new.
|
||||||
pattern.
|
pattern.
|
||||||
|
|
||||||
. Pcre2grep: an option to specify the output line separator, either as a string
|
. Pcre2grep: an option to specify the output line separator, either as a string
|
||||||
or select from a fixed list. This is not dead easy, because at the moment it
|
or select from a fixed list. This is not straightforward, because at the
|
||||||
outputs whatever is in the input file.
|
moment it outputs whatever is in the input file.
|
||||||
|
|
||||||
. Improve the code for duplicate checking in pcre_dfa_exec(). An incomplete,
|
. Improve the code for duplicate checking in pcre_dfa_match(). An incomplete,
|
||||||
non-thread-safe patch showed that this can help performance for patterns
|
non-thread-safe patch showed that this can help performance for patterns
|
||||||
where there are many alternatives. However, a simple thread-safe
|
where there are many alternatives. However, a simple thread-safe
|
||||||
implementation that I tried made things worse in many simple cases, so this
|
implementation that I tried made things worse in many simple cases, so this
|
||||||
|
@ -308,8 +298,7 @@ others are relatively new.
|
||||||
. PCRE2 cannot at present distinguish between subpatterns with different names,
|
. PCRE2 cannot at present distinguish between subpatterns with different names,
|
||||||
but the same number (created by the use of ?|). In order to do so, a way of
|
but the same number (created by the use of ?|). In order to do so, a way of
|
||||||
remembering *which* subpattern numbered n matched is needed. Bugzilla #760.
|
remembering *which* subpattern numbered n matched is needed. Bugzilla #760.
|
||||||
Now that (*MARK) has been implemented, it can perhaps be used as a way round
|
(*MARK) can perhaps be used as a way round this problem.
|
||||||
this problem.
|
|
||||||
|
|
||||||
. Instead of having #ifdef HAVE_CONFIG_H in each module, put #include
|
. Instead of having #ifdef HAVE_CONFIG_H in each module, put #include
|
||||||
"something" and the the #ifdef appears only in one place, in "something".
|
"something" and the the #ifdef appears only in one place, in "something".
|
||||||
|
@ -317,4 +306,4 @@ others are relatively new.
|
||||||
Philip Hazel
|
Philip Hazel
|
||||||
Email local part: ph10
|
Email local part: ph10
|
||||||
Email domain: cam.ac.uk
|
Email domain: cam.ac.uk
|
||||||
Last updated: 25 October 2014
|
Last updated: 18 November 2014
|
||||||
|
|
Loading…
Reference in New Issue