Tests and documentation updates.

This commit is contained in:
Philip.Hazel 2014-11-18 18:32:12 +00:00
parent 819e175659
commit f024446c93
4 changed files with 130 additions and 145 deletions

View File

@ -1,4 +1,4 @@
.TH PCRE2 3 "03 November 2014" "PCRE2 10.00"
.TH PCRE2 3 "18 November 2014" "PCRE2 10.00"
.SH NAME
PCRE2 - Perl-compatible regular expressions (revised API)
.SH INTRODUCTION
@ -8,9 +8,10 @@ PCRE2 is the name used for a revised API for the PCRE library, which is a set
of functions, written in C, that implement regular expression pattern matching
using the same syntax and semantics as Perl, with just a few differences. Some
features that appeared in Python and the original PCRE before they appeared in
Perl are also available using the Python syntax, there is some support for one
or two .NET and Oniguruma syntax items, and there are options for requesting
some minor changes that give better ECMAScript (aka JavaScript) compatibility.
Perl are also available using the Python syntax. There is also some support for
one or two .NET and Oniguruma syntax items, and there are options for
requesting some minor changes that give better ECMAScript (aka JavaScript)
compatibility.
.P
The source code for PCRE2 can be compiled to support 8-bit, 16-bit, or 32-bit
code units, which means that up to three separate libraries may be installed.
@ -18,7 +19,7 @@ The original work to extend PCRE to 16-bit and 32-bit code units was done by
Zoltan Herczeg and Christian Persch, respectively. In all three cases, strings
can be interpreted either as one character per code unit, or as UTF-encoded
Unicode, with support for Unicode general category properties. Unicode support
is optional at build time (but is the default); however, processing strings as
is optional at build time (but is the default). However, processing strings as
UTF code units must be enabled explicitly at run time. The version of Unicode
in use can be discovered by running
.sp
@ -140,19 +141,19 @@ listing), and the short pages for individual functions, are concatenated in
pcre2compat discussion of Perl compatibility
pcre2demo a demonstration C program that uses PCRE2
pcre2grep description of the \fBpcre2grep\fP command (8-bit only)
pcre2jit discussion of the just-in-time optimization support
pcre2jit discussion of just-in-time optimization support
pcre2limits details of size and other limits
pcre2matching discussion of the two matching algorithms
pcre2partial details of the partial matching facility
.\" JOIN
pcre2pattern syntax and semantics of supported
regular expressions
pcre2pattern syntax and semantics of supported regular
expression patterns
pcre2perform discussion of performance issues
pcre2posix the POSIX-compatible C API for the 8-bit library
pcre2sample discussion of the pcre2demo program
pcre2stack discussion of stack usage
pcre2syntax quick syntax reference
pcre2test description of the \fBpcre2test\fP testing command
pcre2test description of the \fBpcre2test\fP command
pcre2unicode discussion of Unicode and UTF support
.sp
In the "man" and HTML formats, there is also a short page for each C library
@ -176,6 +177,6 @@ use my two initials, followed by the two digits 10, at the domain cam.ac.uk.
.rs
.sp
.nf
Last updated: 03 November 2014
Last updated: 18 November 2014
Copyright (c) 1997-2014 University of Cambridge.
.fi

View File

@ -1,4 +1,4 @@
.TH PCRE2API 3 "11 November 2014" "PCRE2 10.00"
.TH PCRE2API 3 "18 November 2014" "PCRE2 10.00"
.SH NAME
PCRE2 - Perl-compatible regular expressions (revised API)
.sp
@ -384,12 +384,9 @@ U+000C), NEL (next line, U+0085), LS (line separator, U+2028), and PS
.P
Each of the first three conventions is used by at least one operating system as
its standard newline sequence. When PCRE2 is built, a default can be specified.
The default default is LF, which is the Unix standard. When PCRE2 is run, the
default can be overridden, either when a pattern is compiled, or when it is
matched.
.P
The newline convention can be changed when calling \fBpcre2_compile()\fP, or it
can be specified by special text at the start of the pattern itself; this
The default default is LF, which is the Unix standard. However, the newline
convention can be changed by an application when calling \fBpcre2_compile()\fP,
or it can be specified by special text at the start of the pattern itself; this
overrides any other settings. See the
.\" HREF
\fBpcre2pattern\fP
@ -409,8 +406,8 @@ section on \fBpcre2_match()\fP options
below.
.P
The choice of newline convention does not affect the interpretation of
the \en or \er escape sequences, nor does it affect what \eR matches, which has
its own separate control.
the \en or \er escape sequences, nor does it affect what \eR matches; this has
its own separate convention.
.
.
.SH MULTITHREADING
@ -423,7 +420,7 @@ designed to be fairly simple for non-threaded applications while at the same
time ensuring that multithreaded applications can use it.
.P
There are several different blocks of data that are used to pass information
between the application and the PCRE libraries.
between the application and the PCRE2 libraries.
.P
(1) A pointer to the compiled form of a pattern is returned to the user when
\fBpcre2_compile()\fP is successful. The data in the compiled pattern is fixed,
@ -529,11 +526,11 @@ The memory used for a general context should be freed by calling:
A compile context is required if you want to change the default values of any
of the following compile-time parameters:
.sp
What \eR matches (Unicode newlines or CR, LF, CRLF only);
PCRE2's character tables;
The newline character sequence;
The compile time nested parentheses limit;
An external function for stack checking.
What \eR matches (Unicode newlines or CR, LF, CRLF only)
PCRE2's character tables
The newline character sequence
The compile time nested parentheses limit
An external function for stack checking
.sp
A compile context is also required if you are using custom memory management.
If none of these apply, just pass NULL as the context argument of
@ -562,9 +559,8 @@ PCRE2_ERROR_BADDATA if invalid data is detected.
.sp
The value must be PCRE2_BSR_ANYCRLF, to specify that \eR matches only CR, LF,
or CRLF, or PCRE2_BSR_UNICODE, to specify that \eR matches any Unicode line
ending sequence. The value of this parameter does not affect what is compiled;
it is just saved with the compiled pattern. The value is used by the JIT
compiler and by the two interpreted matching functions, \fIpcre2_match()\fP and
ending sequence. The value is used by the JIT compiler and by the two
interpreted matching functions, \fIpcre2_match()\fP and
\fIpcre2_dfa_match()\fP.
.sp
.nf
@ -678,12 +674,12 @@ patterns that are not anchored, the count restarts from zero for each position
in the subject string. This limit is not relevant to \fBpcre2_dfa_match()\fP,
which ignores it.
.P
When \fBpcre2_match()\fP is called with a pattern that was successfully studied
with \fBpcre2_jit_compile()\fP, the way that the matching is executed is
entirely different. However, there is still the possibility of runaway matching
that goes on for a very long time, and so the \fImatch_limit\fP value is also
used in this case (but in a different way) to limit how long the matching can
continue.
When \fBpcre2_match()\fP is called with a pattern that was successfully
processed by \fBpcre2_jit_compile()\fP, the way in which matching is executed
is entirely different. However, there is still the possibility of runaway
matching that goes on for a very long time, and so the \fImatch_limit\fP value
is also used in this case (but in a different way) to limit how long the
matching can continue.
.P
The default value for the limit can be set when PCRE2 is built; the default
default is 10 million, which handles all but the most extreme cases. If the
@ -744,15 +740,16 @@ documentation. See the
.\" HREF
\fBpcre2build\fP
.\"
documentation for details of how to build PCRE2. Using the heap for recursion
is a non-standard way of building PCRE2, for use in environments that have
limited stacks. Because of the greater use of memory management,
\fBpcre2_match()\fP runs more slowly. Functions that are different to the
general custom memory functions are provided so that special-purpose external
code can be used for this case, because the memory blocks are all the same
size. The blocks are retained by \fBpcre2_match()\fP until it is about to exit
so that they can be re-used when possible during the match. In the absence of
these functions, the normal custom memory management functions are used, if
documentation for details of how to build PCRE2.
.P
Using the heap for recursion is a non-standard way of building PCRE2, for use
in environments that have limited stacks. Because of the greater use of memory
management, \fBpcre2_match()\fP runs more slowly. Functions that are different
to the general custom memory functions are provided so that special-purpose
external code can be used for this case, because the memory blocks are all the
same size. The blocks are retained by \fBpcre2_match()\fP until it is about to
exit so that they can be re-used when possible during the match. In the absence
of these functions, the normal custom memory management functions are used, if
supplied, otherwise the system functions.
.
.
@ -784,9 +781,10 @@ available:
PCRE2_CONFIG_BSR
.sp
The output is an integer whose value indicates what character sequences the \eR
escape sequence matches by default. A value of 0 means that \eR matches any
Unicode line ending sequence; a value of 1 means that \eR matches only CR, LF,
or CRLF. The default can be overridden when a pattern is compiled or matched.
escape sequence matches by default. A value of PCRE2_BSR_UNICODE means that \eR
matches any Unicode line ending sequence; a value of PCRE2_BSR_ANYCRLF means
that \eR matches only CR, LF, or CRLF. The default can be overridden when a
pattern is compiled.
.sp
PCRE2_CONFIG_JIT
.sp
@ -796,7 +794,7 @@ compiling is available; otherwise it is set to zero.
PCRE2_CONFIG_JITTARGET
.sp
The \fIwhere\fP argument should point to a buffer that is at least 48 code
units long. (The exact length needed can be found by calling
units long. (The exact length required can be found by calling
\fBpcre2_config()\fP with \fBwhere\fP set to NULL.) The buffer is filled with a
string that contains the name of the architecture for which the JIT compiler is
configured, for example "x86 32bit (little endian + unaligned)". If JIT support
@ -829,11 +827,11 @@ Further details are given with \fBpcre2_match()\fP below.
The output is an integer whose value specifies the default character sequence
that is recognized as meaning "newline". The values are:
.sp
1 Carriage return (CR)
2 Linefeed (LF)
3 Carriage return, linefeed (CRLF)
4 Any Unicode line ending
5 Any of CR, LF, or CRLF
PCRE2_NEWLINE_CR Carriage return (CR)
PCRE2_NEWLINE_LF Linefeed (LF)
PCRE2_NEWLINE_CRLF Carriage return, linefeed (CRLF)
PCRE2_NEWLINE_ANY Any Unicode line ending
PCRE2_NEWLINE_ANYCRLF Any of CR, LF, or CRLF
.sp
The default should normally correspond to the standard sequence for your
operating system.
@ -865,7 +863,7 @@ heap instead of recursive function calls.
PCRE2_CONFIG_UNICODE_VERSION
.sp
The \fIwhere\fP argument should point to a buffer that is at least 24 code
units long. (The exact length needed can be found by calling
units long. (The exact length required can be found by calling
\fBpcre2_config()\fP with \fBwhere\fP set to NULL.) If PCRE2 has been compiled
without Unicode support, the buffer is filled with the text "Unicode not
supported". Otherwise, the Unicode version string (for example, "7.0.0") is
@ -880,7 +878,7 @@ otherwise it is set to zero. Unicode support implies UTF support.
PCRE2_CONFIG_VERSION
.sp
The \fIwhere\fP argument should point to a buffer that is at least 12 code
units long. (The exact length needed can be found by calling
units long. (The exact length required can be found by calling
\fBpcre2_config()\fP with \fBwhere\fP set to NULL.) The buffer is filled with
the PCRE2 version string, zero-terminated. The number of code units used is
returned. This is the length of the string plus one unit for the terminating
@ -899,16 +897,16 @@ zero.
.B pcre2_code_free(pcre2_code *\fIcode\fP);
.fi
.P
This function compiles a pattern, defined by a pointer to a string of code
units and a length, into an internal form. If the pattern is zero-terminated,
the length should be specified as PCRE2_ZERO_TERMINATED. The function returns a
pointer to a block of memory that contains the compiled pattern and related
data. The caller must free the memory by calling \fBpcre2_code_free()\fP when
it is no longer needed.
The \fBpcre2_compile()\fP function compiles a pattern into an internal form.
The pattern is defined by a pointer to a string of code units and a length, If
the pattern is zero-terminated, the length can be specified as
PCRE2_ZERO_TERMINATED. The function returns a pointer to a block of memory that
contains the compiled pattern and related data. The caller must free the memory
by calling \fBpcre2_code_free()\fP when it is no longer needed.
.P
If the compile context argument \fIccontext\fP is NULL, the memory is obtained
by calling \fBmalloc()\fP. Otherwise, it is obtained from the same memory
function that was used for the compile context.
If the compile context argument \fIccontext\fP is NULL, memory for the compiled
pattern is obtained by calling \fBmalloc()\fP. Otherwise, it is obtained from
the same memory function that was used for the compile context.
.P
The \fIoptions\fP argument contains various bit settings that affect the
compilation. It should be zero if no options are required. The available
@ -1235,7 +1233,7 @@ in the
\fBpcre2pattern\fP
.\"
page. If you set PCRE2_UCP, matching one of the items it affects takes much
longer. The option is available only if PCRE2 has been compiled with UTF
longer. The option is available only if PCRE2 has been compiled with Unicode
support.
.sp
PCRE2_UNGREEDY
@ -1248,9 +1246,10 @@ with Perl. It can also be set by a (?U) option setting within the pattern.
.sp
This option causes PCRE2 to regard both the pattern and the subject strings
that are subsequently processed as strings of UTF characters instead of
single-code-unit strings. However, it is available only when PCRE2 is built to
include UTF support. If not, the use of this option provokes an error. Details
of how this option changes the behaviour of PCRE2 are given in the
single-code-unit strings. It is available when PCRE2 is built to include
Unicode support (which is the default). If Unicode support is not available,
the use of this option provokes an error. Details of how this option changes
the behaviour of PCRE2 are given in the
.\" HREF
\fBpcre2unicode\fP
.\"
@ -1314,13 +1313,12 @@ Most, but not all patterns can be optimized by the JIT compiler.
.sp
PCRE2 handles caseless matching, and determines whether characters are letters,
digits, or whatever, by reference to a set of tables, indexed by character code
point. When running in UTF-8 mode, or using the 16-bit or 32-bit libraries,
this applies only to characters with code points less than 256. By default,
higher-valued code points never match escapes such as \ew or \ed. However, if
PCRE2 is built with UTF support, all characters can be tested with \ep and \eP,
or, alternatively, the PCRE2_UCP option can be set when a pattern is compiled;
this causes \ew and friends to use Unicode property support instead of the
built-in tables.
point. This applies only to characters whose code points are less than 256. By
default, higher-valued code points never match escapes such as \ew or \ed.
However, if PCRE2 is built with UTF support, all characters can be tested with
\ep and \eP, or, alternatively, the PCRE2_UCP option can be set when a pattern
is compiled; this causes \ew and friends to use Unicode property support
instead of the built-in tables.
.P
The use of locales with Unicode is discouraged. If you are handling characters
with code points greater than 128, you should either use Unicode support, or
@ -1433,9 +1431,9 @@ are no back references.
PCRE2_INFO_BSR
.sp
The output is a uint32_t whose value indicates what character sequences the \eR
escape sequence matches by default. A value of 0 means that \eR matches any
Unicode line ending sequence; a value of 1 means that \eR matches only CR, LF,
or CRLF. The default can be overridden when a pattern is matched.
escape sequence matches. A value of PCRE2_BSR_UNICODE means that \eR matches
any Unicode line ending sequence; a value of PCRE2_BSR_ANYCRLF means that \eR
matches only CR, LF, or CRLF.
.sp
PCRE2_INFO_CAPTURECOUNT
.sp
@ -1623,17 +1621,16 @@ different for each compiled pattern.
.sp
PCRE2_INFO_NEWLINE
.sp
The output is a \fBuint32_t\fP whose value specifies the default character
sequence that will be recognized as meaning "newline" while matching. The
values are:
The output is a \fBuint32_t\fP with one of the following values:
.sp
1 Carriage return (CR)
2 Linefeed (LF)
3 Carriage return, linefeed (CRLF)
4 Any Unicode line ending
5 Any of CR, LF, or CRLF
.sp
The default can be overridden when a pattern is matched.
PCRE2_NEWLINE_CR Carriage return (CR)
PCRE2_NEWLINE_LF Linefeed (LF)
PCRE2_NEWLINE_CRLF Carriage return, linefeed (CRLF)
PCRE2_NEWLINE_ANY Any Unicode line ending
PCRE2_NEWLINE_ANYCRLF Any of CR, LF, or CRLF
.sp
This specifies the default character sequence that will be recognized as
meaning "newline" while matching.
.sp
PCRE2_INFO_RECURSIONLIMIT
.sp
@ -1671,30 +1668,32 @@ Information about successful and unsuccessful matches is placed in a match
data block, which is an opaque structure that is accessed by function calls. In
particular, the match data block contains a vector of offsets into the subject
string that define the matched part of the subject and any substrings that were
capured. This is know as the \fIovector\fP.
captured. This is know as the \fIovector\fP.
.P
Before calling \fBpcre2_match()\fP or \fBpcre2_dfa_match()\fP you must create a
match data block by calling one of the creation functions above. For
\fBpcre2_match_data_create()\fP, the first argument is the number of pairs of
offsets in the \fIovector\fP. One pair of offsets is required to identify the
string that matched the whole pattern, with another pair for each captured
substring. For example, a value of 4 creates enough space to record the matched
portion of the subject plus three captured substrings. A minimum of at least 1
pair is imposed by \fBpcre2_match_data_create()\fP, so it is always possible to
return the overall matched string.
Before calling \fBpcre2_match()\fP, \fBpcre2_dfa_match()\fP, or
\fBpcre2_jit_match()\fP you must create a match data block by calling one of
the creation functions above. For \fBpcre2_match_data_create()\fP, the first
argument is the number of pairs of offsets in the \fIovector\fP. One pair of
offsets is required to identify the string that matched the whole pattern, with
another pair for each captured substring. For example, a value of 4 creates
enough space to record the matched portion of the subject plus three captured
substrings. A minimum of at least 1 pair is imposed by
\fBpcre2_match_data_create()\fP, so it is always possible to return the overall
matched string.
.P
For \fBpcre2_match_data_create_from_pattern()\fP, the first argument is a
pointer to a compiled pattern. In this case the ovector is created to be
exactly the right size to hold all the substrings a pattern might capture.
.P
The second argument of both these functions ia a pointer to a general context,
The second argument of both these functions is a pointer to a general context,
which can specify custom memory management for obtaining the memory for the
match data block. If you are not using custom memory management, pass NULL.
.P
A match data block can be used many times, with the same or different compiled
patterns. When it is no longer needed, it should be freed by calling
\fBpcre2_match_data_free()\fP. How to extract information from a match data
block after a match operation is described in the sections on
\fBpcre2_match_data_free()\fP. You can extract information from a match data
block after a match operation has finished, using functions that are described
in the sections on
.\" HTML <a href="#matchedstrings">
.\" </a>
matched strings
@ -1819,12 +1818,10 @@ zero. The only bits that may be set are PCRE2_ANCHORED, PCRE2_NOTBOL,
PCRE2_NOTEOL, PCRE2_NOTEMPTY, PCRE2_NOTEMPTY_ATSTART, PCRE2_NO_UTF_CHECK,
PCRE2_PARTIAL_HARD, and PCRE2_PARTIAL_SOFT. Their action is described below.
.P
If the pattern was successfully processed by the just-in-time (JIT) compiler,
the only supported options for matching using the JIT code are PCRE2_NOTBOL,
PCRE2_NOTEOL, PCRE2_NOTEMPTY, PCRE2_NOTEMPTY_ATSTART, PCRE2_NO_UTF_CHECK,
PCRE2_PARTIAL_HARD, and PCRE2_PARTIAL_SOFT. If an unsupported option is used,
JIT matching is disabled and the normal interpretive code in
\fBpcre2_match()\fP is run.
Setting PCRE2_ANCHORED at match time is not supported by the just-in-time (JIT)
compiler. If it is set, JIT matching is disabled and the normal interpretive
code in \fBpcre2_match()\fP is run. The remaining options are supported for JIT
matching.
.sp
PCRE2_ANCHORED
.sp
@ -2704,6 +2701,6 @@ Cambridge, England.
.rs
.sp
.nf
Last updated: 11 November 2014
Last updated: 18 November 2014
Copyright (c) 1997-2014 University of Cambridge.
.fi

View File

@ -58,8 +58,8 @@ ISGCC=0
# If the compiler is gcc, add a lot of warning switches.
cc --version >zzz 2>/dev/null
if [ $? -eq 0 ] && grep GCC zzz >/dev/null; then
cc --version >/tmp/pcre2ccversion 2>/dev/null
if [ $? -eq 0 ] && grep GCC /tmp/pcre2ccversion >/dev/null; then
ISGCC=1
CFLAGS="$CFLAGS -Wall"
CFLAGS="$CFLAGS -Wno-overlength-strings"
@ -77,7 +77,7 @@ if [ $? -eq 0 ] && grep GCC zzz >/dev/null; then
CFLAGS="$CFLAGS -Wmissing-prototypes"
CFLAGS="$CFLAGS -Wstrict-prototypes"
fi
rm -f /tmp/pcre2ccversion
# This function runs a single test with the set of configuration options that
# are in $opts. The source directory must be set in srcdir. The function must
@ -129,8 +129,6 @@ runtest()
./pcre2test -C jit >/dev/null
jit=$?
./pcre2test -C unicode >/dev/null
utf=$?
./pcre2test -C pcre2-8 >/dev/null
pcre2_8=$?
@ -164,7 +162,7 @@ runtest()
echo "Skipping pcre2grep tests: newline is $nl"
fi
if [ "$jit" -gt 0 -a $utf -gt 0 ]; then
if [ "$jit" -gt 0 ]; then
echo "Running JIT regression tests $withvalgrind"
$cvalgrind $srcdir/pcre2_jit_test >teststdout 2>teststderr
if [ $? -ne 0 -o -s teststderr ]; then
@ -175,7 +173,7 @@ runtest()
exit 1
fi
else
echo "Skipping JIT regression tests: JIT or UTF not enabled"
echo "Skipping JIT regression tests: JIT is not enabled"
fi
}

View File

@ -65,7 +65,7 @@ Updating to a new Unicode release
When there is a new release of Unicode, the files in Unicode.tables must be
refreshed from the web site. If the new version of Unicode adds new character
scripts, the source file pacr2_ucp.h and both the MultiStage2.py and the
scripts, the source file pcre2_ucp.h and both the MultiStage2.py and the
GenerateUtt.py scripts must be edited to add the new names. Then MultiStage2.py
can be run to generate a new version of pcre2_ucd.c, and GenerateUtt.py can be
run to generate the tricky tables for inclusion in pcre2_tables.c.
@ -73,7 +73,7 @@ run to generate the tricky tables for inclusion in pcre2_tables.c.
If MultiStage2.py gives the error "ValueError: list.index(x): x not in list",
the cause is usually a missing (or misspelt) name in the list of scripts. I
couldn't find a straightforward list of scripts on the Unicode site, but
there's a useful Wikipedia page that list them, and notes the Unicode version
there's a useful Wikipedia page that lists them, and notes the Unicode version
in which they were introduced:
http://en.wikipedia.org/wiki/Unicode_scripts#Table_of_Unicode_scripts
@ -130,7 +130,7 @@ distribution for a new release.
systems, using different compilers as well. For example, on Solaris it is
helpful to test using Sun's cc compiler as a change from gcc. Adding
-xarch=v9 to the cc options does a 64-bit test, but it also needs -S 64 for
pcretest to increase the stack size for test 2. Since I retired I can no
pcre2test to increase the stack size for test 2. Since I retired I can no
longer do this, but instead I rely on putting out release candidates for
folks on the pcre-dev list to test.
@ -194,7 +194,7 @@ and the zipball. Double-check with "svn status", then create an SVN tagged
copy:
svn copy svn://vcs.exim.org/pcre2/code/trunk \
svn://vcs.exim.org/pcre2/code/tags/pcre-8.xx
svn://vcs.exim.org/pcre2/code/tags/pcre-10.xx
When the new release is out, don't forget to tell webmaster@pcre.org and the
mailing list. Also, update the list of version numbers in Bugzilla (edit
@ -206,8 +206,7 @@ Future ideas (wish list)
This section records a list of ideas so that they do not get forgotten. They
vary enormously in their usefulness and potential for implementation. Some are
very sensible; some are rather wacky. Some have been on this list for years;
others are relatively new.
very sensible; some are rather wacky. Some have been on this list for years.
. Optimization
@ -226,42 +225,38 @@ others are relatively new.
over the existing "required code unit" feature that just remembers one code
unit.
* Remember an initial string rather than just 1 code unit?
* Remember an initial string rather than just 1 code unit.
* A required code unit from alternatives - not just the last unit, but an
earlier one if common to all alternatives.
o Friedl contains other ideas.
* Friedl contains other ideas.
* The code does not set initial code unit flags for Unicode property types
such as \p; I don't know how much benefit there would be for, for example,
setting the bits for 0-9 and all values >= xC0 (in 8-bit mode) when a
pattern starts with \p{N}.
* There is scope for more "auto-possessifying" in connection with \p and \P.
. If Perl gets to a consistent state over the settings of capturing sub-
patterns inside repeats, see if we can match it. One example of the
difference is the matching of /(main(O)?)+/ against mainOmain, where PCRE
leaves $2 set. In Perl, it's unset. Changing this in PCRE will be very hard
difference is the matching of /(main(O)?)+/ against mainOmain, where PCRE2
leaves $2 set. In Perl, it's unset. Changing this in PCRE2 will be very hard
because I think it needs much more state to be remembered.
. Perl 6 will be a revolution. Is it a revolution too far for PCRE?
. Line endings:
* Option to use NUL as a line terminator in subject strings. This could now
be done relatively easily since the extension to support LF, CR, and CRLF.
If it is done, a suitable option for pcre2grep is also required.
. An option to use NUL as a line terminator in subject strings. This could be
done relatively easily. If it is done, a suitable option for pcre2grep is
also required.
. Catch SIGSEGV for stack overflows?
. A feature to suspend a match via a callout was once requested.
. Option to convert results into character offsets and character lengths.
. An option to convert results into character offsets and character lengths.
. Option for pcre2grep to scan only the start of a file. I am not keen - this
is the job of "head".
. An option for pcre2grep to scan only the start of a file. I am not keen -
this is the job of "head".
. A (non-Unix) user wanted pcregrep options to (a) list a file name just once,
preceded by a blank line, instead of adding it to every matched line, and (b)
@ -274,11 +269,6 @@ others are relatively new.
to switch this dynamically. It would have to be specified when PCRE2 was
compiled. PCRE2 would then call a function every time it wanted a character.
. Wild thought: the ability to compile from PCRE2's internal code to a real
FSM and a very fast (third) matcher to process the result. There would be
even more restrictions than for pcre2_dfa_exec(), however. This is not easy.
This is probably obsolete now that we have the JIT support.
. pcre2grep: add -rs for a sorted recurse? Having to store file names and sort
them will of course slow it down.
@ -296,10 +286,10 @@ others are relatively new.
pattern.
. Pcre2grep: an option to specify the output line separator, either as a string
or select from a fixed list. This is not dead easy, because at the moment it
outputs whatever is in the input file.
or select from a fixed list. This is not straightforward, because at the
moment it outputs whatever is in the input file.
. Improve the code for duplicate checking in pcre_dfa_exec(). An incomplete,
. Improve the code for duplicate checking in pcre_dfa_match(). An incomplete,
non-thread-safe patch showed that this can help performance for patterns
where there are many alternatives. However, a simple thread-safe
implementation that I tried made things worse in many simple cases, so this
@ -308,8 +298,7 @@ others are relatively new.
. PCRE2 cannot at present distinguish between subpatterns with different names,
but the same number (created by the use of ?|). In order to do so, a way of
remembering *which* subpattern numbered n matched is needed. Bugzilla #760.
Now that (*MARK) has been implemented, it can perhaps be used as a way round
this problem.
(*MARK) can perhaps be used as a way round this problem.
. Instead of having #ifdef HAVE_CONFIG_H in each module, put #include
"something" and the the #ifdef appears only in one place, in "something".
@ -317,4 +306,4 @@ others are relatively new.
Philip Hazel
Email local part: ph10
Email domain: cam.ac.uk
Last updated: 25 October 2014
Last updated: 18 November 2014