Tests and documentation updates.

This commit is contained in:
Philip.Hazel 2014-11-18 18:32:12 +00:00
parent 819e175659
commit f024446c93
4 changed files with 130 additions and 145 deletions

View File

@ -1,4 +1,4 @@
.TH PCRE2 3 "03 November 2014" "PCRE2 10.00" .TH PCRE2 3 "18 November 2014" "PCRE2 10.00"
.SH NAME .SH NAME
PCRE2 - Perl-compatible regular expressions (revised API) PCRE2 - Perl-compatible regular expressions (revised API)
.SH INTRODUCTION .SH INTRODUCTION
@ -8,9 +8,10 @@ PCRE2 is the name used for a revised API for the PCRE library, which is a set
of functions, written in C, that implement regular expression pattern matching of functions, written in C, that implement regular expression pattern matching
using the same syntax and semantics as Perl, with just a few differences. Some using the same syntax and semantics as Perl, with just a few differences. Some
features that appeared in Python and the original PCRE before they appeared in features that appeared in Python and the original PCRE before they appeared in
Perl are also available using the Python syntax, there is some support for one Perl are also available using the Python syntax. There is also some support for
or two .NET and Oniguruma syntax items, and there are options for requesting one or two .NET and Oniguruma syntax items, and there are options for
some minor changes that give better ECMAScript (aka JavaScript) compatibility. requesting some minor changes that give better ECMAScript (aka JavaScript)
compatibility.
.P .P
The source code for PCRE2 can be compiled to support 8-bit, 16-bit, or 32-bit The source code for PCRE2 can be compiled to support 8-bit, 16-bit, or 32-bit
code units, which means that up to three separate libraries may be installed. code units, which means that up to three separate libraries may be installed.
@ -18,7 +19,7 @@ The original work to extend PCRE to 16-bit and 32-bit code units was done by
Zoltan Herczeg and Christian Persch, respectively. In all three cases, strings Zoltan Herczeg and Christian Persch, respectively. In all three cases, strings
can be interpreted either as one character per code unit, or as UTF-encoded can be interpreted either as one character per code unit, or as UTF-encoded
Unicode, with support for Unicode general category properties. Unicode support Unicode, with support for Unicode general category properties. Unicode support
is optional at build time (but is the default); however, processing strings as is optional at build time (but is the default). However, processing strings as
UTF code units must be enabled explicitly at run time. The version of Unicode UTF code units must be enabled explicitly at run time. The version of Unicode
in use can be discovered by running in use can be discovered by running
.sp .sp
@ -140,19 +141,19 @@ listing), and the short pages for individual functions, are concatenated in
pcre2compat discussion of Perl compatibility pcre2compat discussion of Perl compatibility
pcre2demo a demonstration C program that uses PCRE2 pcre2demo a demonstration C program that uses PCRE2
pcre2grep description of the \fBpcre2grep\fP command (8-bit only) pcre2grep description of the \fBpcre2grep\fP command (8-bit only)
pcre2jit discussion of the just-in-time optimization support pcre2jit discussion of just-in-time optimization support
pcre2limits details of size and other limits pcre2limits details of size and other limits
pcre2matching discussion of the two matching algorithms pcre2matching discussion of the two matching algorithms
pcre2partial details of the partial matching facility pcre2partial details of the partial matching facility
.\" JOIN .\" JOIN
pcre2pattern syntax and semantics of supported pcre2pattern syntax and semantics of supported regular
regular expressions expression patterns
pcre2perform discussion of performance issues pcre2perform discussion of performance issues
pcre2posix the POSIX-compatible C API for the 8-bit library pcre2posix the POSIX-compatible C API for the 8-bit library
pcre2sample discussion of the pcre2demo program pcre2sample discussion of the pcre2demo program
pcre2stack discussion of stack usage pcre2stack discussion of stack usage
pcre2syntax quick syntax reference pcre2syntax quick syntax reference
pcre2test description of the \fBpcre2test\fP testing command pcre2test description of the \fBpcre2test\fP command
pcre2unicode discussion of Unicode and UTF support pcre2unicode discussion of Unicode and UTF support
.sp .sp
In the "man" and HTML formats, there is also a short page for each C library In the "man" and HTML formats, there is also a short page for each C library
@ -176,6 +177,6 @@ use my two initials, followed by the two digits 10, at the domain cam.ac.uk.
.rs .rs
.sp .sp
.nf .nf
Last updated: 03 November 2014 Last updated: 18 November 2014
Copyright (c) 1997-2014 University of Cambridge. Copyright (c) 1997-2014 University of Cambridge.
.fi .fi

View File

@ -1,4 +1,4 @@
.TH PCRE2API 3 "11 November 2014" "PCRE2 10.00" .TH PCRE2API 3 "18 November 2014" "PCRE2 10.00"
.SH NAME .SH NAME
PCRE2 - Perl-compatible regular expressions (revised API) PCRE2 - Perl-compatible regular expressions (revised API)
.sp .sp
@ -384,12 +384,9 @@ U+000C), NEL (next line, U+0085), LS (line separator, U+2028), and PS
.P .P
Each of the first three conventions is used by at least one operating system as Each of the first three conventions is used by at least one operating system as
its standard newline sequence. When PCRE2 is built, a default can be specified. its standard newline sequence. When PCRE2 is built, a default can be specified.
The default default is LF, which is the Unix standard. When PCRE2 is run, the The default default is LF, which is the Unix standard. However, the newline
default can be overridden, either when a pattern is compiled, or when it is convention can be changed by an application when calling \fBpcre2_compile()\fP,
matched. or it can be specified by special text at the start of the pattern itself; this
.P
The newline convention can be changed when calling \fBpcre2_compile()\fP, or it
can be specified by special text at the start of the pattern itself; this
overrides any other settings. See the overrides any other settings. See the
.\" HREF .\" HREF
\fBpcre2pattern\fP \fBpcre2pattern\fP
@ -409,8 +406,8 @@ section on \fBpcre2_match()\fP options
below. below.
.P .P
The choice of newline convention does not affect the interpretation of The choice of newline convention does not affect the interpretation of
the \en or \er escape sequences, nor does it affect what \eR matches, which has the \en or \er escape sequences, nor does it affect what \eR matches; this has
its own separate control. its own separate convention.
. .
. .
.SH MULTITHREADING .SH MULTITHREADING
@ -423,7 +420,7 @@ designed to be fairly simple for non-threaded applications while at the same
time ensuring that multithreaded applications can use it. time ensuring that multithreaded applications can use it.
.P .P
There are several different blocks of data that are used to pass information There are several different blocks of data that are used to pass information
between the application and the PCRE libraries. between the application and the PCRE2 libraries.
.P .P
(1) A pointer to the compiled form of a pattern is returned to the user when (1) A pointer to the compiled form of a pattern is returned to the user when
\fBpcre2_compile()\fP is successful. The data in the compiled pattern is fixed, \fBpcre2_compile()\fP is successful. The data in the compiled pattern is fixed,
@ -529,11 +526,11 @@ The memory used for a general context should be freed by calling:
A compile context is required if you want to change the default values of any A compile context is required if you want to change the default values of any
of the following compile-time parameters: of the following compile-time parameters:
.sp .sp
What \eR matches (Unicode newlines or CR, LF, CRLF only); What \eR matches (Unicode newlines or CR, LF, CRLF only)
PCRE2's character tables; PCRE2's character tables
The newline character sequence; The newline character sequence
The compile time nested parentheses limit; The compile time nested parentheses limit
An external function for stack checking. An external function for stack checking
.sp .sp
A compile context is also required if you are using custom memory management. A compile context is also required if you are using custom memory management.
If none of these apply, just pass NULL as the context argument of If none of these apply, just pass NULL as the context argument of
@ -562,9 +559,8 @@ PCRE2_ERROR_BADDATA if invalid data is detected.
.sp .sp
The value must be PCRE2_BSR_ANYCRLF, to specify that \eR matches only CR, LF, The value must be PCRE2_BSR_ANYCRLF, to specify that \eR matches only CR, LF,
or CRLF, or PCRE2_BSR_UNICODE, to specify that \eR matches any Unicode line or CRLF, or PCRE2_BSR_UNICODE, to specify that \eR matches any Unicode line
ending sequence. The value of this parameter does not affect what is compiled; ending sequence. The value is used by the JIT compiler and by the two
it is just saved with the compiled pattern. The value is used by the JIT interpreted matching functions, \fIpcre2_match()\fP and
compiler and by the two interpreted matching functions, \fIpcre2_match()\fP and
\fIpcre2_dfa_match()\fP. \fIpcre2_dfa_match()\fP.
.sp .sp
.nf .nf
@ -678,12 +674,12 @@ patterns that are not anchored, the count restarts from zero for each position
in the subject string. This limit is not relevant to \fBpcre2_dfa_match()\fP, in the subject string. This limit is not relevant to \fBpcre2_dfa_match()\fP,
which ignores it. which ignores it.
.P .P
When \fBpcre2_match()\fP is called with a pattern that was successfully studied When \fBpcre2_match()\fP is called with a pattern that was successfully
with \fBpcre2_jit_compile()\fP, the way that the matching is executed is processed by \fBpcre2_jit_compile()\fP, the way in which matching is executed
entirely different. However, there is still the possibility of runaway matching is entirely different. However, there is still the possibility of runaway
that goes on for a very long time, and so the \fImatch_limit\fP value is also matching that goes on for a very long time, and so the \fImatch_limit\fP value
used in this case (but in a different way) to limit how long the matching can is also used in this case (but in a different way) to limit how long the
continue. matching can continue.
.P .P
The default value for the limit can be set when PCRE2 is built; the default The default value for the limit can be set when PCRE2 is built; the default
default is 10 million, which handles all but the most extreme cases. If the default is 10 million, which handles all but the most extreme cases. If the
@ -744,15 +740,16 @@ documentation. See the
.\" HREF .\" HREF
\fBpcre2build\fP \fBpcre2build\fP
.\" .\"
documentation for details of how to build PCRE2. Using the heap for recursion documentation for details of how to build PCRE2.
is a non-standard way of building PCRE2, for use in environments that have .P
limited stacks. Because of the greater use of memory management, Using the heap for recursion is a non-standard way of building PCRE2, for use
\fBpcre2_match()\fP runs more slowly. Functions that are different to the in environments that have limited stacks. Because of the greater use of memory
general custom memory functions are provided so that special-purpose external management, \fBpcre2_match()\fP runs more slowly. Functions that are different
code can be used for this case, because the memory blocks are all the same to the general custom memory functions are provided so that special-purpose
size. The blocks are retained by \fBpcre2_match()\fP until it is about to exit external code can be used for this case, because the memory blocks are all the
so that they can be re-used when possible during the match. In the absence of same size. The blocks are retained by \fBpcre2_match()\fP until it is about to
these functions, the normal custom memory management functions are used, if exit so that they can be re-used when possible during the match. In the absence
of these functions, the normal custom memory management functions are used, if
supplied, otherwise the system functions. supplied, otherwise the system functions.
. .
. .
@ -784,9 +781,10 @@ available:
PCRE2_CONFIG_BSR PCRE2_CONFIG_BSR
.sp .sp
The output is an integer whose value indicates what character sequences the \eR The output is an integer whose value indicates what character sequences the \eR
escape sequence matches by default. A value of 0 means that \eR matches any escape sequence matches by default. A value of PCRE2_BSR_UNICODE means that \eR
Unicode line ending sequence; a value of 1 means that \eR matches only CR, LF, matches any Unicode line ending sequence; a value of PCRE2_BSR_ANYCRLF means
or CRLF. The default can be overridden when a pattern is compiled or matched. that \eR matches only CR, LF, or CRLF. The default can be overridden when a
pattern is compiled.
.sp .sp
PCRE2_CONFIG_JIT PCRE2_CONFIG_JIT
.sp .sp
@ -796,7 +794,7 @@ compiling is available; otherwise it is set to zero.
PCRE2_CONFIG_JITTARGET PCRE2_CONFIG_JITTARGET
.sp .sp
The \fIwhere\fP argument should point to a buffer that is at least 48 code The \fIwhere\fP argument should point to a buffer that is at least 48 code
units long. (The exact length needed can be found by calling units long. (The exact length required can be found by calling
\fBpcre2_config()\fP with \fBwhere\fP set to NULL.) The buffer is filled with a \fBpcre2_config()\fP with \fBwhere\fP set to NULL.) The buffer is filled with a
string that contains the name of the architecture for which the JIT compiler is string that contains the name of the architecture for which the JIT compiler is
configured, for example "x86 32bit (little endian + unaligned)". If JIT support configured, for example "x86 32bit (little endian + unaligned)". If JIT support
@ -829,11 +827,11 @@ Further details are given with \fBpcre2_match()\fP below.
The output is an integer whose value specifies the default character sequence The output is an integer whose value specifies the default character sequence
that is recognized as meaning "newline". The values are: that is recognized as meaning "newline". The values are:
.sp .sp
1 Carriage return (CR) PCRE2_NEWLINE_CR Carriage return (CR)
2 Linefeed (LF) PCRE2_NEWLINE_LF Linefeed (LF)
3 Carriage return, linefeed (CRLF) PCRE2_NEWLINE_CRLF Carriage return, linefeed (CRLF)
4 Any Unicode line ending PCRE2_NEWLINE_ANY Any Unicode line ending
5 Any of CR, LF, or CRLF PCRE2_NEWLINE_ANYCRLF Any of CR, LF, or CRLF
.sp .sp
The default should normally correspond to the standard sequence for your The default should normally correspond to the standard sequence for your
operating system. operating system.
@ -865,7 +863,7 @@ heap instead of recursive function calls.
PCRE2_CONFIG_UNICODE_VERSION PCRE2_CONFIG_UNICODE_VERSION
.sp .sp
The \fIwhere\fP argument should point to a buffer that is at least 24 code The \fIwhere\fP argument should point to a buffer that is at least 24 code
units long. (The exact length needed can be found by calling units long. (The exact length required can be found by calling
\fBpcre2_config()\fP with \fBwhere\fP set to NULL.) If PCRE2 has been compiled \fBpcre2_config()\fP with \fBwhere\fP set to NULL.) If PCRE2 has been compiled
without Unicode support, the buffer is filled with the text "Unicode not without Unicode support, the buffer is filled with the text "Unicode not
supported". Otherwise, the Unicode version string (for example, "7.0.0") is supported". Otherwise, the Unicode version string (for example, "7.0.0") is
@ -880,7 +878,7 @@ otherwise it is set to zero. Unicode support implies UTF support.
PCRE2_CONFIG_VERSION PCRE2_CONFIG_VERSION
.sp .sp
The \fIwhere\fP argument should point to a buffer that is at least 12 code The \fIwhere\fP argument should point to a buffer that is at least 12 code
units long. (The exact length needed can be found by calling units long. (The exact length required can be found by calling
\fBpcre2_config()\fP with \fBwhere\fP set to NULL.) The buffer is filled with \fBpcre2_config()\fP with \fBwhere\fP set to NULL.) The buffer is filled with
the PCRE2 version string, zero-terminated. The number of code units used is the PCRE2 version string, zero-terminated. The number of code units used is
returned. This is the length of the string plus one unit for the terminating returned. This is the length of the string plus one unit for the terminating
@ -899,16 +897,16 @@ zero.
.B pcre2_code_free(pcre2_code *\fIcode\fP); .B pcre2_code_free(pcre2_code *\fIcode\fP);
.fi .fi
.P .P
This function compiles a pattern, defined by a pointer to a string of code The \fBpcre2_compile()\fP function compiles a pattern into an internal form.
units and a length, into an internal form. If the pattern is zero-terminated, The pattern is defined by a pointer to a string of code units and a length, If
the length should be specified as PCRE2_ZERO_TERMINATED. The function returns a the pattern is zero-terminated, the length can be specified as
pointer to a block of memory that contains the compiled pattern and related PCRE2_ZERO_TERMINATED. The function returns a pointer to a block of memory that
data. The caller must free the memory by calling \fBpcre2_code_free()\fP when contains the compiled pattern and related data. The caller must free the memory
it is no longer needed. by calling \fBpcre2_code_free()\fP when it is no longer needed.
.P .P
If the compile context argument \fIccontext\fP is NULL, the memory is obtained If the compile context argument \fIccontext\fP is NULL, memory for the compiled
by calling \fBmalloc()\fP. Otherwise, it is obtained from the same memory pattern is obtained by calling \fBmalloc()\fP. Otherwise, it is obtained from
function that was used for the compile context. the same memory function that was used for the compile context.
.P .P
The \fIoptions\fP argument contains various bit settings that affect the The \fIoptions\fP argument contains various bit settings that affect the
compilation. It should be zero if no options are required. The available compilation. It should be zero if no options are required. The available
@ -1235,7 +1233,7 @@ in the
\fBpcre2pattern\fP \fBpcre2pattern\fP
.\" .\"
page. If you set PCRE2_UCP, matching one of the items it affects takes much page. If you set PCRE2_UCP, matching one of the items it affects takes much
longer. The option is available only if PCRE2 has been compiled with UTF longer. The option is available only if PCRE2 has been compiled with Unicode
support. support.
.sp .sp
PCRE2_UNGREEDY PCRE2_UNGREEDY
@ -1248,9 +1246,10 @@ with Perl. It can also be set by a (?U) option setting within the pattern.
.sp .sp
This option causes PCRE2 to regard both the pattern and the subject strings This option causes PCRE2 to regard both the pattern and the subject strings
that are subsequently processed as strings of UTF characters instead of that are subsequently processed as strings of UTF characters instead of
single-code-unit strings. However, it is available only when PCRE2 is built to single-code-unit strings. It is available when PCRE2 is built to include
include UTF support. If not, the use of this option provokes an error. Details Unicode support (which is the default). If Unicode support is not available,
of how this option changes the behaviour of PCRE2 are given in the the use of this option provokes an error. Details of how this option changes
the behaviour of PCRE2 are given in the
.\" HREF .\" HREF
\fBpcre2unicode\fP \fBpcre2unicode\fP
.\" .\"
@ -1314,13 +1313,12 @@ Most, but not all patterns can be optimized by the JIT compiler.
.sp .sp
PCRE2 handles caseless matching, and determines whether characters are letters, PCRE2 handles caseless matching, and determines whether characters are letters,
digits, or whatever, by reference to a set of tables, indexed by character code digits, or whatever, by reference to a set of tables, indexed by character code
point. When running in UTF-8 mode, or using the 16-bit or 32-bit libraries, point. This applies only to characters whose code points are less than 256. By
this applies only to characters with code points less than 256. By default, default, higher-valued code points never match escapes such as \ew or \ed.
higher-valued code points never match escapes such as \ew or \ed. However, if However, if PCRE2 is built with UTF support, all characters can be tested with
PCRE2 is built with UTF support, all characters can be tested with \ep and \eP, \ep and \eP, or, alternatively, the PCRE2_UCP option can be set when a pattern
or, alternatively, the PCRE2_UCP option can be set when a pattern is compiled; is compiled; this causes \ew and friends to use Unicode property support
this causes \ew and friends to use Unicode property support instead of the instead of the built-in tables.
built-in tables.
.P .P
The use of locales with Unicode is discouraged. If you are handling characters The use of locales with Unicode is discouraged. If you are handling characters
with code points greater than 128, you should either use Unicode support, or with code points greater than 128, you should either use Unicode support, or
@ -1433,9 +1431,9 @@ are no back references.
PCRE2_INFO_BSR PCRE2_INFO_BSR
.sp .sp
The output is a uint32_t whose value indicates what character sequences the \eR The output is a uint32_t whose value indicates what character sequences the \eR
escape sequence matches by default. A value of 0 means that \eR matches any escape sequence matches. A value of PCRE2_BSR_UNICODE means that \eR matches
Unicode line ending sequence; a value of 1 means that \eR matches only CR, LF, any Unicode line ending sequence; a value of PCRE2_BSR_ANYCRLF means that \eR
or CRLF. The default can be overridden when a pattern is matched. matches only CR, LF, or CRLF.
.sp .sp
PCRE2_INFO_CAPTURECOUNT PCRE2_INFO_CAPTURECOUNT
.sp .sp
@ -1623,17 +1621,16 @@ different for each compiled pattern.
.sp .sp
PCRE2_INFO_NEWLINE PCRE2_INFO_NEWLINE
.sp .sp
The output is a \fBuint32_t\fP whose value specifies the default character The output is a \fBuint32_t\fP with one of the following values:
sequence that will be recognized as meaning "newline" while matching. The
values are:
.sp .sp
1 Carriage return (CR) PCRE2_NEWLINE_CR Carriage return (CR)
2 Linefeed (LF) PCRE2_NEWLINE_LF Linefeed (LF)
3 Carriage return, linefeed (CRLF) PCRE2_NEWLINE_CRLF Carriage return, linefeed (CRLF)
4 Any Unicode line ending PCRE2_NEWLINE_ANY Any Unicode line ending
5 Any of CR, LF, or CRLF PCRE2_NEWLINE_ANYCRLF Any of CR, LF, or CRLF
.sp .sp
The default can be overridden when a pattern is matched. This specifies the default character sequence that will be recognized as
meaning "newline" while matching.
.sp .sp
PCRE2_INFO_RECURSIONLIMIT PCRE2_INFO_RECURSIONLIMIT
.sp .sp
@ -1671,30 +1668,32 @@ Information about successful and unsuccessful matches is placed in a match
data block, which is an opaque structure that is accessed by function calls. In data block, which is an opaque structure that is accessed by function calls. In
particular, the match data block contains a vector of offsets into the subject particular, the match data block contains a vector of offsets into the subject
string that define the matched part of the subject and any substrings that were string that define the matched part of the subject and any substrings that were
capured. This is know as the \fIovector\fP. captured. This is know as the \fIovector\fP.
.P .P
Before calling \fBpcre2_match()\fP or \fBpcre2_dfa_match()\fP you must create a Before calling \fBpcre2_match()\fP, \fBpcre2_dfa_match()\fP, or
match data block by calling one of the creation functions above. For \fBpcre2_jit_match()\fP you must create a match data block by calling one of
\fBpcre2_match_data_create()\fP, the first argument is the number of pairs of the creation functions above. For \fBpcre2_match_data_create()\fP, the first
offsets in the \fIovector\fP. One pair of offsets is required to identify the argument is the number of pairs of offsets in the \fIovector\fP. One pair of
string that matched the whole pattern, with another pair for each captured offsets is required to identify the string that matched the whole pattern, with
substring. For example, a value of 4 creates enough space to record the matched another pair for each captured substring. For example, a value of 4 creates
portion of the subject plus three captured substrings. A minimum of at least 1 enough space to record the matched portion of the subject plus three captured
pair is imposed by \fBpcre2_match_data_create()\fP, so it is always possible to substrings. A minimum of at least 1 pair is imposed by
return the overall matched string. \fBpcre2_match_data_create()\fP, so it is always possible to return the overall
matched string.
.P .P
For \fBpcre2_match_data_create_from_pattern()\fP, the first argument is a For \fBpcre2_match_data_create_from_pattern()\fP, the first argument is a
pointer to a compiled pattern. In this case the ovector is created to be pointer to a compiled pattern. In this case the ovector is created to be
exactly the right size to hold all the substrings a pattern might capture. exactly the right size to hold all the substrings a pattern might capture.
.P .P
The second argument of both these functions ia a pointer to a general context, The second argument of both these functions is a pointer to a general context,
which can specify custom memory management for obtaining the memory for the which can specify custom memory management for obtaining the memory for the
match data block. If you are not using custom memory management, pass NULL. match data block. If you are not using custom memory management, pass NULL.
.P .P
A match data block can be used many times, with the same or different compiled A match data block can be used many times, with the same or different compiled
patterns. When it is no longer needed, it should be freed by calling patterns. When it is no longer needed, it should be freed by calling
\fBpcre2_match_data_free()\fP. How to extract information from a match data \fBpcre2_match_data_free()\fP. You can extract information from a match data
block after a match operation is described in the sections on block after a match operation has finished, using functions that are described
in the sections on
.\" HTML <a href="#matchedstrings"> .\" HTML <a href="#matchedstrings">
.\" </a> .\" </a>
matched strings matched strings
@ -1819,12 +1818,10 @@ zero. The only bits that may be set are PCRE2_ANCHORED, PCRE2_NOTBOL,
PCRE2_NOTEOL, PCRE2_NOTEMPTY, PCRE2_NOTEMPTY_ATSTART, PCRE2_NO_UTF_CHECK, PCRE2_NOTEOL, PCRE2_NOTEMPTY, PCRE2_NOTEMPTY_ATSTART, PCRE2_NO_UTF_CHECK,
PCRE2_PARTIAL_HARD, and PCRE2_PARTIAL_SOFT. Their action is described below. PCRE2_PARTIAL_HARD, and PCRE2_PARTIAL_SOFT. Their action is described below.
.P .P
If the pattern was successfully processed by the just-in-time (JIT) compiler, Setting PCRE2_ANCHORED at match time is not supported by the just-in-time (JIT)
the only supported options for matching using the JIT code are PCRE2_NOTBOL, compiler. If it is set, JIT matching is disabled and the normal interpretive
PCRE2_NOTEOL, PCRE2_NOTEMPTY, PCRE2_NOTEMPTY_ATSTART, PCRE2_NO_UTF_CHECK, code in \fBpcre2_match()\fP is run. The remaining options are supported for JIT
PCRE2_PARTIAL_HARD, and PCRE2_PARTIAL_SOFT. If an unsupported option is used, matching.
JIT matching is disabled and the normal interpretive code in
\fBpcre2_match()\fP is run.
.sp .sp
PCRE2_ANCHORED PCRE2_ANCHORED
.sp .sp
@ -2704,6 +2701,6 @@ Cambridge, England.
.rs .rs
.sp .sp
.nf .nf
Last updated: 11 November 2014 Last updated: 18 November 2014
Copyright (c) 1997-2014 University of Cambridge. Copyright (c) 1997-2014 University of Cambridge.
.fi .fi

View File

@ -58,8 +58,8 @@ ISGCC=0
# If the compiler is gcc, add a lot of warning switches. # If the compiler is gcc, add a lot of warning switches.
cc --version >zzz 2>/dev/null cc --version >/tmp/pcre2ccversion 2>/dev/null
if [ $? -eq 0 ] && grep GCC zzz >/dev/null; then if [ $? -eq 0 ] && grep GCC /tmp/pcre2ccversion >/dev/null; then
ISGCC=1 ISGCC=1
CFLAGS="$CFLAGS -Wall" CFLAGS="$CFLAGS -Wall"
CFLAGS="$CFLAGS -Wno-overlength-strings" CFLAGS="$CFLAGS -Wno-overlength-strings"
@ -77,7 +77,7 @@ if [ $? -eq 0 ] && grep GCC zzz >/dev/null; then
CFLAGS="$CFLAGS -Wmissing-prototypes" CFLAGS="$CFLAGS -Wmissing-prototypes"
CFLAGS="$CFLAGS -Wstrict-prototypes" CFLAGS="$CFLAGS -Wstrict-prototypes"
fi fi
rm -f /tmp/pcre2ccversion
# This function runs a single test with the set of configuration options that # This function runs a single test with the set of configuration options that
# are in $opts. The source directory must be set in srcdir. The function must # are in $opts. The source directory must be set in srcdir. The function must
@ -129,8 +129,6 @@ runtest()
./pcre2test -C jit >/dev/null ./pcre2test -C jit >/dev/null
jit=$? jit=$?
./pcre2test -C unicode >/dev/null
utf=$?
./pcre2test -C pcre2-8 >/dev/null ./pcre2test -C pcre2-8 >/dev/null
pcre2_8=$? pcre2_8=$?
@ -164,7 +162,7 @@ runtest()
echo "Skipping pcre2grep tests: newline is $nl" echo "Skipping pcre2grep tests: newline is $nl"
fi fi
if [ "$jit" -gt 0 -a $utf -gt 0 ]; then if [ "$jit" -gt 0 ]; then
echo "Running JIT regression tests $withvalgrind" echo "Running JIT regression tests $withvalgrind"
$cvalgrind $srcdir/pcre2_jit_test >teststdout 2>teststderr $cvalgrind $srcdir/pcre2_jit_test >teststdout 2>teststderr
if [ $? -ne 0 -o -s teststderr ]; then if [ $? -ne 0 -o -s teststderr ]; then
@ -175,7 +173,7 @@ runtest()
exit 1 exit 1
fi fi
else else
echo "Skipping JIT regression tests: JIT or UTF not enabled" echo "Skipping JIT regression tests: JIT is not enabled"
fi fi
} }

View File

@ -65,7 +65,7 @@ Updating to a new Unicode release
When there is a new release of Unicode, the files in Unicode.tables must be When there is a new release of Unicode, the files in Unicode.tables must be
refreshed from the web site. If the new version of Unicode adds new character refreshed from the web site. If the new version of Unicode adds new character
scripts, the source file pacr2_ucp.h and both the MultiStage2.py and the scripts, the source file pcre2_ucp.h and both the MultiStage2.py and the
GenerateUtt.py scripts must be edited to add the new names. Then MultiStage2.py GenerateUtt.py scripts must be edited to add the new names. Then MultiStage2.py
can be run to generate a new version of pcre2_ucd.c, and GenerateUtt.py can be can be run to generate a new version of pcre2_ucd.c, and GenerateUtt.py can be
run to generate the tricky tables for inclusion in pcre2_tables.c. run to generate the tricky tables for inclusion in pcre2_tables.c.
@ -73,7 +73,7 @@ run to generate the tricky tables for inclusion in pcre2_tables.c.
If MultiStage2.py gives the error "ValueError: list.index(x): x not in list", If MultiStage2.py gives the error "ValueError: list.index(x): x not in list",
the cause is usually a missing (or misspelt) name in the list of scripts. I the cause is usually a missing (or misspelt) name in the list of scripts. I
couldn't find a straightforward list of scripts on the Unicode site, but couldn't find a straightforward list of scripts on the Unicode site, but
there's a useful Wikipedia page that list them, and notes the Unicode version there's a useful Wikipedia page that lists them, and notes the Unicode version
in which they were introduced: in which they were introduced:
http://en.wikipedia.org/wiki/Unicode_scripts#Table_of_Unicode_scripts http://en.wikipedia.org/wiki/Unicode_scripts#Table_of_Unicode_scripts
@ -130,7 +130,7 @@ distribution for a new release.
systems, using different compilers as well. For example, on Solaris it is systems, using different compilers as well. For example, on Solaris it is
helpful to test using Sun's cc compiler as a change from gcc. Adding helpful to test using Sun's cc compiler as a change from gcc. Adding
-xarch=v9 to the cc options does a 64-bit test, but it also needs -S 64 for -xarch=v9 to the cc options does a 64-bit test, but it also needs -S 64 for
pcretest to increase the stack size for test 2. Since I retired I can no pcre2test to increase the stack size for test 2. Since I retired I can no
longer do this, but instead I rely on putting out release candidates for longer do this, but instead I rely on putting out release candidates for
folks on the pcre-dev list to test. folks on the pcre-dev list to test.
@ -194,7 +194,7 @@ and the zipball. Double-check with "svn status", then create an SVN tagged
copy: copy:
svn copy svn://vcs.exim.org/pcre2/code/trunk \ svn copy svn://vcs.exim.org/pcre2/code/trunk \
svn://vcs.exim.org/pcre2/code/tags/pcre-8.xx svn://vcs.exim.org/pcre2/code/tags/pcre-10.xx
When the new release is out, don't forget to tell webmaster@pcre.org and the When the new release is out, don't forget to tell webmaster@pcre.org and the
mailing list. Also, update the list of version numbers in Bugzilla (edit mailing list. Also, update the list of version numbers in Bugzilla (edit
@ -206,8 +206,7 @@ Future ideas (wish list)
This section records a list of ideas so that they do not get forgotten. They This section records a list of ideas so that they do not get forgotten. They
vary enormously in their usefulness and potential for implementation. Some are vary enormously in their usefulness and potential for implementation. Some are
very sensible; some are rather wacky. Some have been on this list for years; very sensible; some are rather wacky. Some have been on this list for years.
others are relatively new.
. Optimization . Optimization
@ -226,42 +225,38 @@ others are relatively new.
over the existing "required code unit" feature that just remembers one code over the existing "required code unit" feature that just remembers one code
unit. unit.
* Remember an initial string rather than just 1 code unit? * Remember an initial string rather than just 1 code unit.
* A required code unit from alternatives - not just the last unit, but an * A required code unit from alternatives - not just the last unit, but an
earlier one if common to all alternatives. earlier one if common to all alternatives.
o Friedl contains other ideas. * Friedl contains other ideas.
* The code does not set initial code unit flags for Unicode property types * The code does not set initial code unit flags for Unicode property types
such as \p; I don't know how much benefit there would be for, for example, such as \p; I don't know how much benefit there would be for, for example,
setting the bits for 0-9 and all values >= xC0 (in 8-bit mode) when a setting the bits for 0-9 and all values >= xC0 (in 8-bit mode) when a
pattern starts with \p{N}. pattern starts with \p{N}.
* There is scope for more "auto-possessifying" in connection with \p and \P.
. If Perl gets to a consistent state over the settings of capturing sub- . If Perl gets to a consistent state over the settings of capturing sub-
patterns inside repeats, see if we can match it. One example of the patterns inside repeats, see if we can match it. One example of the
difference is the matching of /(main(O)?)+/ against mainOmain, where PCRE difference is the matching of /(main(O)?)+/ against mainOmain, where PCRE2
leaves $2 set. In Perl, it's unset. Changing this in PCRE will be very hard leaves $2 set. In Perl, it's unset. Changing this in PCRE2 will be very hard
because I think it needs much more state to be remembered. because I think it needs much more state to be remembered.
. Perl 6 will be a revolution. Is it a revolution too far for PCRE? . Perl 6 will be a revolution. Is it a revolution too far for PCRE?
. Line endings: . An option to use NUL as a line terminator in subject strings. This could be
done relatively easily. If it is done, a suitable option for pcre2grep is
* Option to use NUL as a line terminator in subject strings. This could now also required.
be done relatively easily since the extension to support LF, CR, and CRLF.
If it is done, a suitable option for pcre2grep is also required.
. Catch SIGSEGV for stack overflows? . Catch SIGSEGV for stack overflows?
. A feature to suspend a match via a callout was once requested. . A feature to suspend a match via a callout was once requested.
. Option to convert results into character offsets and character lengths. . An option to convert results into character offsets and character lengths.
. Option for pcre2grep to scan only the start of a file. I am not keen - this . An option for pcre2grep to scan only the start of a file. I am not keen -
is the job of "head". this is the job of "head".
. A (non-Unix) user wanted pcregrep options to (a) list a file name just once, . A (non-Unix) user wanted pcregrep options to (a) list a file name just once,
preceded by a blank line, instead of adding it to every matched line, and (b) preceded by a blank line, instead of adding it to every matched line, and (b)
@ -274,11 +269,6 @@ others are relatively new.
to switch this dynamically. It would have to be specified when PCRE2 was to switch this dynamically. It would have to be specified when PCRE2 was
compiled. PCRE2 would then call a function every time it wanted a character. compiled. PCRE2 would then call a function every time it wanted a character.
. Wild thought: the ability to compile from PCRE2's internal code to a real
FSM and a very fast (third) matcher to process the result. There would be
even more restrictions than for pcre2_dfa_exec(), however. This is not easy.
This is probably obsolete now that we have the JIT support.
. pcre2grep: add -rs for a sorted recurse? Having to store file names and sort . pcre2grep: add -rs for a sorted recurse? Having to store file names and sort
them will of course slow it down. them will of course slow it down.
@ -296,10 +286,10 @@ others are relatively new.
pattern. pattern.
. Pcre2grep: an option to specify the output line separator, either as a string . Pcre2grep: an option to specify the output line separator, either as a string
or select from a fixed list. This is not dead easy, because at the moment it or select from a fixed list. This is not straightforward, because at the
outputs whatever is in the input file. moment it outputs whatever is in the input file.
. Improve the code for duplicate checking in pcre_dfa_exec(). An incomplete, . Improve the code for duplicate checking in pcre_dfa_match(). An incomplete,
non-thread-safe patch showed that this can help performance for patterns non-thread-safe patch showed that this can help performance for patterns
where there are many alternatives. However, a simple thread-safe where there are many alternatives. However, a simple thread-safe
implementation that I tried made things worse in many simple cases, so this implementation that I tried made things worse in many simple cases, so this
@ -308,8 +298,7 @@ others are relatively new.
. PCRE2 cannot at present distinguish between subpatterns with different names, . PCRE2 cannot at present distinguish between subpatterns with different names,
but the same number (created by the use of ?|). In order to do so, a way of but the same number (created by the use of ?|). In order to do so, a way of
remembering *which* subpattern numbered n matched is needed. Bugzilla #760. remembering *which* subpattern numbered n matched is needed. Bugzilla #760.
Now that (*MARK) has been implemented, it can perhaps be used as a way round (*MARK) can perhaps be used as a way round this problem.
this problem.
. Instead of having #ifdef HAVE_CONFIG_H in each module, put #include . Instead of having #ifdef HAVE_CONFIG_H in each module, put #include
"something" and the the #ifdef appears only in one place, in "something". "something" and the the #ifdef appears only in one place, in "something".
@ -317,4 +306,4 @@ others are relatively new.
Philip Hazel Philip Hazel
Email local part: ph10 Email local part: ph10
Email domain: cam.ac.uk Email domain: cam.ac.uk
Last updated: 25 October 2014 Last updated: 18 November 2014