From f024446c9353e3bc15aae5f818012a9d7eeeb7b9 Mon Sep 17 00:00:00 2001 From: "Philip.Hazel" Date: Tue, 18 Nov 2014 18:32:12 +0000 Subject: [PATCH] Tests and documentation updates. --- doc/pcre2.3 | 21 ++--- doc/pcre2api.3 | 191 +++++++++++++++++++++--------------------- maint/ManyConfigTests | 12 ++- maint/README | 51 +++++------ 4 files changed, 130 insertions(+), 145 deletions(-) diff --git a/doc/pcre2.3 b/doc/pcre2.3 index 247e616..5415800 100644 --- a/doc/pcre2.3 +++ b/doc/pcre2.3 @@ -1,4 +1,4 @@ -.TH PCRE2 3 "03 November 2014" "PCRE2 10.00" +.TH PCRE2 3 "18 November 2014" "PCRE2 10.00" .SH NAME PCRE2 - Perl-compatible regular expressions (revised API) .SH INTRODUCTION @@ -8,9 +8,10 @@ PCRE2 is the name used for a revised API for the PCRE library, which is a set of functions, written in C, that implement regular expression pattern matching using the same syntax and semantics as Perl, with just a few differences. Some features that appeared in Python and the original PCRE before they appeared in -Perl are also available using the Python syntax, there is some support for one -or two .NET and Oniguruma syntax items, and there are options for requesting -some minor changes that give better ECMAScript (aka JavaScript) compatibility. +Perl are also available using the Python syntax. There is also some support for +one or two .NET and Oniguruma syntax items, and there are options for +requesting some minor changes that give better ECMAScript (aka JavaScript) +compatibility. .P The source code for PCRE2 can be compiled to support 8-bit, 16-bit, or 32-bit code units, which means that up to three separate libraries may be installed. @@ -18,7 +19,7 @@ The original work to extend PCRE to 16-bit and 32-bit code units was done by Zoltan Herczeg and Christian Persch, respectively. In all three cases, strings can be interpreted either as one character per code unit, or as UTF-encoded Unicode, with support for Unicode general category properties. Unicode support -is optional at build time (but is the default); however, processing strings as +is optional at build time (but is the default). However, processing strings as UTF code units must be enabled explicitly at run time. The version of Unicode in use can be discovered by running .sp @@ -140,19 +141,19 @@ listing), and the short pages for individual functions, are concatenated in pcre2compat discussion of Perl compatibility pcre2demo a demonstration C program that uses PCRE2 pcre2grep description of the \fBpcre2grep\fP command (8-bit only) - pcre2jit discussion of the just-in-time optimization support + pcre2jit discussion of just-in-time optimization support pcre2limits details of size and other limits pcre2matching discussion of the two matching algorithms pcre2partial details of the partial matching facility .\" JOIN - pcre2pattern syntax and semantics of supported - regular expressions + pcre2pattern syntax and semantics of supported regular + expression patterns pcre2perform discussion of performance issues pcre2posix the POSIX-compatible C API for the 8-bit library pcre2sample discussion of the pcre2demo program pcre2stack discussion of stack usage pcre2syntax quick syntax reference - pcre2test description of the \fBpcre2test\fP testing command + pcre2test description of the \fBpcre2test\fP command pcre2unicode discussion of Unicode and UTF support .sp In the "man" and HTML formats, there is also a short page for each C library @@ -176,6 +177,6 @@ use my two initials, followed by the two digits 10, at the domain cam.ac.uk. .rs .sp .nf -Last updated: 03 November 2014 +Last updated: 18 November 2014 Copyright (c) 1997-2014 University of Cambridge. .fi diff --git a/doc/pcre2api.3 b/doc/pcre2api.3 index 9d9d91f..08fd9cc 100644 --- a/doc/pcre2api.3 +++ b/doc/pcre2api.3 @@ -1,4 +1,4 @@ -.TH PCRE2API 3 "11 November 2014" "PCRE2 10.00" +.TH PCRE2API 3 "18 November 2014" "PCRE2 10.00" .SH NAME PCRE2 - Perl-compatible regular expressions (revised API) .sp @@ -384,12 +384,9 @@ U+000C), NEL (next line, U+0085), LS (line separator, U+2028), and PS .P Each of the first three conventions is used by at least one operating system as its standard newline sequence. When PCRE2 is built, a default can be specified. -The default default is LF, which is the Unix standard. When PCRE2 is run, the -default can be overridden, either when a pattern is compiled, or when it is -matched. -.P -The newline convention can be changed when calling \fBpcre2_compile()\fP, or it -can be specified by special text at the start of the pattern itself; this +The default default is LF, which is the Unix standard. However, the newline +convention can be changed by an application when calling \fBpcre2_compile()\fP, +or it can be specified by special text at the start of the pattern itself; this overrides any other settings. See the .\" HREF \fBpcre2pattern\fP @@ -409,8 +406,8 @@ section on \fBpcre2_match()\fP options below. .P The choice of newline convention does not affect the interpretation of -the \en or \er escape sequences, nor does it affect what \eR matches, which has -its own separate control. +the \en or \er escape sequences, nor does it affect what \eR matches; this has +its own separate convention. . . .SH MULTITHREADING @@ -423,7 +420,7 @@ designed to be fairly simple for non-threaded applications while at the same time ensuring that multithreaded applications can use it. .P There are several different blocks of data that are used to pass information -between the application and the PCRE libraries. +between the application and the PCRE2 libraries. .P (1) A pointer to the compiled form of a pattern is returned to the user when \fBpcre2_compile()\fP is successful. The data in the compiled pattern is fixed, @@ -529,11 +526,11 @@ The memory used for a general context should be freed by calling: A compile context is required if you want to change the default values of any of the following compile-time parameters: .sp - What \eR matches (Unicode newlines or CR, LF, CRLF only); - PCRE2's character tables; - The newline character sequence; - The compile time nested parentheses limit; - An external function for stack checking. + What \eR matches (Unicode newlines or CR, LF, CRLF only) + PCRE2's character tables + The newline character sequence + The compile time nested parentheses limit + An external function for stack checking .sp A compile context is also required if you are using custom memory management. If none of these apply, just pass NULL as the context argument of @@ -562,9 +559,8 @@ PCRE2_ERROR_BADDATA if invalid data is detected. .sp The value must be PCRE2_BSR_ANYCRLF, to specify that \eR matches only CR, LF, or CRLF, or PCRE2_BSR_UNICODE, to specify that \eR matches any Unicode line -ending sequence. The value of this parameter does not affect what is compiled; -it is just saved with the compiled pattern. The value is used by the JIT -compiler and by the two interpreted matching functions, \fIpcre2_match()\fP and +ending sequence. The value is used by the JIT compiler and by the two +interpreted matching functions, \fIpcre2_match()\fP and \fIpcre2_dfa_match()\fP. .sp .nf @@ -678,12 +674,12 @@ patterns that are not anchored, the count restarts from zero for each position in the subject string. This limit is not relevant to \fBpcre2_dfa_match()\fP, which ignores it. .P -When \fBpcre2_match()\fP is called with a pattern that was successfully studied -with \fBpcre2_jit_compile()\fP, the way that the matching is executed is -entirely different. However, there is still the possibility of runaway matching -that goes on for a very long time, and so the \fImatch_limit\fP value is also -used in this case (but in a different way) to limit how long the matching can -continue. +When \fBpcre2_match()\fP is called with a pattern that was successfully +processed by \fBpcre2_jit_compile()\fP, the way in which matching is executed +is entirely different. However, there is still the possibility of runaway +matching that goes on for a very long time, and so the \fImatch_limit\fP value +is also used in this case (but in a different way) to limit how long the +matching can continue. .P The default value for the limit can be set when PCRE2 is built; the default default is 10 million, which handles all but the most extreme cases. If the @@ -744,15 +740,16 @@ documentation. See the .\" HREF \fBpcre2build\fP .\" -documentation for details of how to build PCRE2. Using the heap for recursion -is a non-standard way of building PCRE2, for use in environments that have -limited stacks. Because of the greater use of memory management, -\fBpcre2_match()\fP runs more slowly. Functions that are different to the -general custom memory functions are provided so that special-purpose external -code can be used for this case, because the memory blocks are all the same -size. The blocks are retained by \fBpcre2_match()\fP until it is about to exit -so that they can be re-used when possible during the match. In the absence of -these functions, the normal custom memory management functions are used, if +documentation for details of how to build PCRE2. +.P +Using the heap for recursion is a non-standard way of building PCRE2, for use +in environments that have limited stacks. Because of the greater use of memory +management, \fBpcre2_match()\fP runs more slowly. Functions that are different +to the general custom memory functions are provided so that special-purpose +external code can be used for this case, because the memory blocks are all the +same size. The blocks are retained by \fBpcre2_match()\fP until it is about to +exit so that they can be re-used when possible during the match. In the absence +of these functions, the normal custom memory management functions are used, if supplied, otherwise the system functions. . . @@ -784,9 +781,10 @@ available: PCRE2_CONFIG_BSR .sp The output is an integer whose value indicates what character sequences the \eR -escape sequence matches by default. A value of 0 means that \eR matches any -Unicode line ending sequence; a value of 1 means that \eR matches only CR, LF, -or CRLF. The default can be overridden when a pattern is compiled or matched. +escape sequence matches by default. A value of PCRE2_BSR_UNICODE means that \eR +matches any Unicode line ending sequence; a value of PCRE2_BSR_ANYCRLF means +that \eR matches only CR, LF, or CRLF. The default can be overridden when a +pattern is compiled. .sp PCRE2_CONFIG_JIT .sp @@ -796,7 +794,7 @@ compiling is available; otherwise it is set to zero. PCRE2_CONFIG_JITTARGET .sp The \fIwhere\fP argument should point to a buffer that is at least 48 code -units long. (The exact length needed can be found by calling +units long. (The exact length required can be found by calling \fBpcre2_config()\fP with \fBwhere\fP set to NULL.) The buffer is filled with a string that contains the name of the architecture for which the JIT compiler is configured, for example "x86 32bit (little endian + unaligned)". If JIT support @@ -829,11 +827,11 @@ Further details are given with \fBpcre2_match()\fP below. The output is an integer whose value specifies the default character sequence that is recognized as meaning "newline". The values are: .sp - 1 Carriage return (CR) - 2 Linefeed (LF) - 3 Carriage return, linefeed (CRLF) - 4 Any Unicode line ending - 5 Any of CR, LF, or CRLF + PCRE2_NEWLINE_CR Carriage return (CR) + PCRE2_NEWLINE_LF Linefeed (LF) + PCRE2_NEWLINE_CRLF Carriage return, linefeed (CRLF) + PCRE2_NEWLINE_ANY Any Unicode line ending + PCRE2_NEWLINE_ANYCRLF Any of CR, LF, or CRLF .sp The default should normally correspond to the standard sequence for your operating system. @@ -865,7 +863,7 @@ heap instead of recursive function calls. PCRE2_CONFIG_UNICODE_VERSION .sp The \fIwhere\fP argument should point to a buffer that is at least 24 code -units long. (The exact length needed can be found by calling +units long. (The exact length required can be found by calling \fBpcre2_config()\fP with \fBwhere\fP set to NULL.) If PCRE2 has been compiled without Unicode support, the buffer is filled with the text "Unicode not supported". Otherwise, the Unicode version string (for example, "7.0.0") is @@ -880,7 +878,7 @@ otherwise it is set to zero. Unicode support implies UTF support. PCRE2_CONFIG_VERSION .sp The \fIwhere\fP argument should point to a buffer that is at least 12 code -units long. (The exact length needed can be found by calling +units long. (The exact length required can be found by calling \fBpcre2_config()\fP with \fBwhere\fP set to NULL.) The buffer is filled with the PCRE2 version string, zero-terminated. The number of code units used is returned. This is the length of the string plus one unit for the terminating @@ -899,16 +897,16 @@ zero. .B pcre2_code_free(pcre2_code *\fIcode\fP); .fi .P -This function compiles a pattern, defined by a pointer to a string of code -units and a length, into an internal form. If the pattern is zero-terminated, -the length should be specified as PCRE2_ZERO_TERMINATED. The function returns a -pointer to a block of memory that contains the compiled pattern and related -data. The caller must free the memory by calling \fBpcre2_code_free()\fP when -it is no longer needed. +The \fBpcre2_compile()\fP function compiles a pattern into an internal form. +The pattern is defined by a pointer to a string of code units and a length, If +the pattern is zero-terminated, the length can be specified as +PCRE2_ZERO_TERMINATED. The function returns a pointer to a block of memory that +contains the compiled pattern and related data. The caller must free the memory +by calling \fBpcre2_code_free()\fP when it is no longer needed. .P -If the compile context argument \fIccontext\fP is NULL, the memory is obtained -by calling \fBmalloc()\fP. Otherwise, it is obtained from the same memory -function that was used for the compile context. +If the compile context argument \fIccontext\fP is NULL, memory for the compiled +pattern is obtained by calling \fBmalloc()\fP. Otherwise, it is obtained from +the same memory function that was used for the compile context. .P The \fIoptions\fP argument contains various bit settings that affect the compilation. It should be zero if no options are required. The available @@ -1235,7 +1233,7 @@ in the \fBpcre2pattern\fP .\" page. If you set PCRE2_UCP, matching one of the items it affects takes much -longer. The option is available only if PCRE2 has been compiled with UTF +longer. The option is available only if PCRE2 has been compiled with Unicode support. .sp PCRE2_UNGREEDY @@ -1248,9 +1246,10 @@ with Perl. It can also be set by a (?U) option setting within the pattern. .sp This option causes PCRE2 to regard both the pattern and the subject strings that are subsequently processed as strings of UTF characters instead of -single-code-unit strings. However, it is available only when PCRE2 is built to -include UTF support. If not, the use of this option provokes an error. Details -of how this option changes the behaviour of PCRE2 are given in the +single-code-unit strings. It is available when PCRE2 is built to include +Unicode support (which is the default). If Unicode support is not available, +the use of this option provokes an error. Details of how this option changes +the behaviour of PCRE2 are given in the .\" HREF \fBpcre2unicode\fP .\" @@ -1314,13 +1313,12 @@ Most, but not all patterns can be optimized by the JIT compiler. .sp PCRE2 handles caseless matching, and determines whether characters are letters, digits, or whatever, by reference to a set of tables, indexed by character code -point. When running in UTF-8 mode, or using the 16-bit or 32-bit libraries, -this applies only to characters with code points less than 256. By default, -higher-valued code points never match escapes such as \ew or \ed. However, if -PCRE2 is built with UTF support, all characters can be tested with \ep and \eP, -or, alternatively, the PCRE2_UCP option can be set when a pattern is compiled; -this causes \ew and friends to use Unicode property support instead of the -built-in tables. +point. This applies only to characters whose code points are less than 256. By +default, higher-valued code points never match escapes such as \ew or \ed. +However, if PCRE2 is built with UTF support, all characters can be tested with +\ep and \eP, or, alternatively, the PCRE2_UCP option can be set when a pattern +is compiled; this causes \ew and friends to use Unicode property support +instead of the built-in tables. .P The use of locales with Unicode is discouraged. If you are handling characters with code points greater than 128, you should either use Unicode support, or @@ -1433,9 +1431,9 @@ are no back references. PCRE2_INFO_BSR .sp The output is a uint32_t whose value indicates what character sequences the \eR -escape sequence matches by default. A value of 0 means that \eR matches any -Unicode line ending sequence; a value of 1 means that \eR matches only CR, LF, -or CRLF. The default can be overridden when a pattern is matched. +escape sequence matches. A value of PCRE2_BSR_UNICODE means that \eR matches +any Unicode line ending sequence; a value of PCRE2_BSR_ANYCRLF means that \eR +matches only CR, LF, or CRLF. .sp PCRE2_INFO_CAPTURECOUNT .sp @@ -1623,17 +1621,16 @@ different for each compiled pattern. .sp PCRE2_INFO_NEWLINE .sp -The output is a \fBuint32_t\fP whose value specifies the default character -sequence that will be recognized as meaning "newline" while matching. The -values are: +The output is a \fBuint32_t\fP with one of the following values: .sp - 1 Carriage return (CR) - 2 Linefeed (LF) - 3 Carriage return, linefeed (CRLF) - 4 Any Unicode line ending - 5 Any of CR, LF, or CRLF -.sp -The default can be overridden when a pattern is matched. + PCRE2_NEWLINE_CR Carriage return (CR) + PCRE2_NEWLINE_LF Linefeed (LF) + PCRE2_NEWLINE_CRLF Carriage return, linefeed (CRLF) + PCRE2_NEWLINE_ANY Any Unicode line ending + PCRE2_NEWLINE_ANYCRLF Any of CR, LF, or CRLF +.sp +This specifies the default character sequence that will be recognized as +meaning "newline" while matching. .sp PCRE2_INFO_RECURSIONLIMIT .sp @@ -1671,30 +1668,32 @@ Information about successful and unsuccessful matches is placed in a match data block, which is an opaque structure that is accessed by function calls. In particular, the match data block contains a vector of offsets into the subject string that define the matched part of the subject and any substrings that were -capured. This is know as the \fIovector\fP. +captured. This is know as the \fIovector\fP. .P -Before calling \fBpcre2_match()\fP or \fBpcre2_dfa_match()\fP you must create a -match data block by calling one of the creation functions above. For -\fBpcre2_match_data_create()\fP, the first argument is the number of pairs of -offsets in the \fIovector\fP. One pair of offsets is required to identify the -string that matched the whole pattern, with another pair for each captured -substring. For example, a value of 4 creates enough space to record the matched -portion of the subject plus three captured substrings. A minimum of at least 1 -pair is imposed by \fBpcre2_match_data_create()\fP, so it is always possible to -return the overall matched string. +Before calling \fBpcre2_match()\fP, \fBpcre2_dfa_match()\fP, or +\fBpcre2_jit_match()\fP you must create a match data block by calling one of +the creation functions above. For \fBpcre2_match_data_create()\fP, the first +argument is the number of pairs of offsets in the \fIovector\fP. One pair of +offsets is required to identify the string that matched the whole pattern, with +another pair for each captured substring. For example, a value of 4 creates +enough space to record the matched portion of the subject plus three captured +substrings. A minimum of at least 1 pair is imposed by +\fBpcre2_match_data_create()\fP, so it is always possible to return the overall +matched string. .P For \fBpcre2_match_data_create_from_pattern()\fP, the first argument is a pointer to a compiled pattern. In this case the ovector is created to be exactly the right size to hold all the substrings a pattern might capture. .P -The second argument of both these functions ia a pointer to a general context, +The second argument of both these functions is a pointer to a general context, which can specify custom memory management for obtaining the memory for the match data block. If you are not using custom memory management, pass NULL. .P A match data block can be used many times, with the same or different compiled patterns. When it is no longer needed, it should be freed by calling -\fBpcre2_match_data_free()\fP. How to extract information from a match data -block after a match operation is described in the sections on +\fBpcre2_match_data_free()\fP. You can extract information from a match data +block after a match operation has finished, using functions that are described +in the sections on .\" HTML .\" matched strings @@ -1819,12 +1818,10 @@ zero. The only bits that may be set are PCRE2_ANCHORED, PCRE2_NOTBOL, PCRE2_NOTEOL, PCRE2_NOTEMPTY, PCRE2_NOTEMPTY_ATSTART, PCRE2_NO_UTF_CHECK, PCRE2_PARTIAL_HARD, and PCRE2_PARTIAL_SOFT. Their action is described below. .P -If the pattern was successfully processed by the just-in-time (JIT) compiler, -the only supported options for matching using the JIT code are PCRE2_NOTBOL, -PCRE2_NOTEOL, PCRE2_NOTEMPTY, PCRE2_NOTEMPTY_ATSTART, PCRE2_NO_UTF_CHECK, -PCRE2_PARTIAL_HARD, and PCRE2_PARTIAL_SOFT. If an unsupported option is used, -JIT matching is disabled and the normal interpretive code in -\fBpcre2_match()\fP is run. +Setting PCRE2_ANCHORED at match time is not supported by the just-in-time (JIT) +compiler. If it is set, JIT matching is disabled and the normal interpretive +code in \fBpcre2_match()\fP is run. The remaining options are supported for JIT +matching. .sp PCRE2_ANCHORED .sp @@ -2704,6 +2701,6 @@ Cambridge, England. .rs .sp .nf -Last updated: 11 November 2014 +Last updated: 18 November 2014 Copyright (c) 1997-2014 University of Cambridge. .fi diff --git a/maint/ManyConfigTests b/maint/ManyConfigTests index 7055555..2fd08bc 100755 --- a/maint/ManyConfigTests +++ b/maint/ManyConfigTests @@ -58,8 +58,8 @@ ISGCC=0 # If the compiler is gcc, add a lot of warning switches. -cc --version >zzz 2>/dev/null -if [ $? -eq 0 ] && grep GCC zzz >/dev/null; then +cc --version >/tmp/pcre2ccversion 2>/dev/null +if [ $? -eq 0 ] && grep GCC /tmp/pcre2ccversion >/dev/null; then ISGCC=1 CFLAGS="$CFLAGS -Wall" CFLAGS="$CFLAGS -Wno-overlength-strings" @@ -77,7 +77,7 @@ if [ $? -eq 0 ] && grep GCC zzz >/dev/null; then CFLAGS="$CFLAGS -Wmissing-prototypes" CFLAGS="$CFLAGS -Wstrict-prototypes" fi - +rm -f /tmp/pcre2ccversion # This function runs a single test with the set of configuration options that # are in $opts. The source directory must be set in srcdir. The function must @@ -129,8 +129,6 @@ runtest() ./pcre2test -C jit >/dev/null jit=$? - ./pcre2test -C unicode >/dev/null - utf=$? ./pcre2test -C pcre2-8 >/dev/null pcre2_8=$? @@ -164,7 +162,7 @@ runtest() echo "Skipping pcre2grep tests: newline is $nl" fi - if [ "$jit" -gt 0 -a $utf -gt 0 ]; then + if [ "$jit" -gt 0 ]; then echo "Running JIT regression tests $withvalgrind" $cvalgrind $srcdir/pcre2_jit_test >teststdout 2>teststderr if [ $? -ne 0 -o -s teststderr ]; then @@ -175,7 +173,7 @@ runtest() exit 1 fi else - echo "Skipping JIT regression tests: JIT or UTF not enabled" + echo "Skipping JIT regression tests: JIT is not enabled" fi } diff --git a/maint/README b/maint/README index 1de9a94..cd47da2 100644 --- a/maint/README +++ b/maint/README @@ -65,7 +65,7 @@ Updating to a new Unicode release When there is a new release of Unicode, the files in Unicode.tables must be refreshed from the web site. If the new version of Unicode adds new character -scripts, the source file pacr2_ucp.h and both the MultiStage2.py and the +scripts, the source file pcre2_ucp.h and both the MultiStage2.py and the GenerateUtt.py scripts must be edited to add the new names. Then MultiStage2.py can be run to generate a new version of pcre2_ucd.c, and GenerateUtt.py can be run to generate the tricky tables for inclusion in pcre2_tables.c. @@ -73,7 +73,7 @@ run to generate the tricky tables for inclusion in pcre2_tables.c. If MultiStage2.py gives the error "ValueError: list.index(x): x not in list", the cause is usually a missing (or misspelt) name in the list of scripts. I couldn't find a straightforward list of scripts on the Unicode site, but -there's a useful Wikipedia page that list them, and notes the Unicode version +there's a useful Wikipedia page that lists them, and notes the Unicode version in which they were introduced: http://en.wikipedia.org/wiki/Unicode_scripts#Table_of_Unicode_scripts @@ -130,7 +130,7 @@ distribution for a new release. systems, using different compilers as well. For example, on Solaris it is helpful to test using Sun's cc compiler as a change from gcc. Adding -xarch=v9 to the cc options does a 64-bit test, but it also needs -S 64 for - pcretest to increase the stack size for test 2. Since I retired I can no + pcre2test to increase the stack size for test 2. Since I retired I can no longer do this, but instead I rely on putting out release candidates for folks on the pcre-dev list to test. @@ -194,7 +194,7 @@ and the zipball. Double-check with "svn status", then create an SVN tagged copy: svn copy svn://vcs.exim.org/pcre2/code/trunk \ - svn://vcs.exim.org/pcre2/code/tags/pcre-8.xx + svn://vcs.exim.org/pcre2/code/tags/pcre-10.xx When the new release is out, don't forget to tell webmaster@pcre.org and the mailing list. Also, update the list of version numbers in Bugzilla (edit @@ -206,8 +206,7 @@ Future ideas (wish list) This section records a list of ideas so that they do not get forgotten. They vary enormously in their usefulness and potential for implementation. Some are -very sensible; some are rather wacky. Some have been on this list for years; -others are relatively new. +very sensible; some are rather wacky. Some have been on this list for years. . Optimization @@ -226,42 +225,38 @@ others are relatively new. over the existing "required code unit" feature that just remembers one code unit. - * Remember an initial string rather than just 1 code unit? + * Remember an initial string rather than just 1 code unit. * A required code unit from alternatives - not just the last unit, but an earlier one if common to all alternatives. - o Friedl contains other ideas. + * Friedl contains other ideas. * The code does not set initial code unit flags for Unicode property types such as \p; I don't know how much benefit there would be for, for example, setting the bits for 0-9 and all values >= xC0 (in 8-bit mode) when a pattern starts with \p{N}. - * There is scope for more "auto-possessifying" in connection with \p and \P. - . If Perl gets to a consistent state over the settings of capturing sub- patterns inside repeats, see if we can match it. One example of the - difference is the matching of /(main(O)?)+/ against mainOmain, where PCRE - leaves $2 set. In Perl, it's unset. Changing this in PCRE will be very hard + difference is the matching of /(main(O)?)+/ against mainOmain, where PCRE2 + leaves $2 set. In Perl, it's unset. Changing this in PCRE2 will be very hard because I think it needs much more state to be remembered. . Perl 6 will be a revolution. Is it a revolution too far for PCRE? -. Line endings: - - * Option to use NUL as a line terminator in subject strings. This could now - be done relatively easily since the extension to support LF, CR, and CRLF. - If it is done, a suitable option for pcre2grep is also required. +. An option to use NUL as a line terminator in subject strings. This could be + done relatively easily. If it is done, a suitable option for pcre2grep is + also required. . Catch SIGSEGV for stack overflows? . A feature to suspend a match via a callout was once requested. -. Option to convert results into character offsets and character lengths. +. An option to convert results into character offsets and character lengths. -. Option for pcre2grep to scan only the start of a file. I am not keen - this - is the job of "head". +. An option for pcre2grep to scan only the start of a file. I am not keen - + this is the job of "head". . A (non-Unix) user wanted pcregrep options to (a) list a file name just once, preceded by a blank line, instead of adding it to every matched line, and (b) @@ -274,11 +269,6 @@ others are relatively new. to switch this dynamically. It would have to be specified when PCRE2 was compiled. PCRE2 would then call a function every time it wanted a character. -. Wild thought: the ability to compile from PCRE2's internal code to a real - FSM and a very fast (third) matcher to process the result. There would be - even more restrictions than for pcre2_dfa_exec(), however. This is not easy. - This is probably obsolete now that we have the JIT support. - . pcre2grep: add -rs for a sorted recurse? Having to store file names and sort them will of course slow it down. @@ -296,10 +286,10 @@ others are relatively new. pattern. . Pcre2grep: an option to specify the output line separator, either as a string - or select from a fixed list. This is not dead easy, because at the moment it - outputs whatever is in the input file. + or select from a fixed list. This is not straightforward, because at the + moment it outputs whatever is in the input file. -. Improve the code for duplicate checking in pcre_dfa_exec(). An incomplete, +. Improve the code for duplicate checking in pcre_dfa_match(). An incomplete, non-thread-safe patch showed that this can help performance for patterns where there are many alternatives. However, a simple thread-safe implementation that I tried made things worse in many simple cases, so this @@ -308,8 +298,7 @@ others are relatively new. . PCRE2 cannot at present distinguish between subpatterns with different names, but the same number (created by the use of ?|). In order to do so, a way of remembering *which* subpattern numbered n matched is needed. Bugzilla #760. - Now that (*MARK) has been implemented, it can perhaps be used as a way round - this problem. + (*MARK) can perhaps be used as a way round this problem. . Instead of having #ifdef HAVE_CONFIG_H in each module, put #include "something" and the the #ifdef appears only in one place, in "something". @@ -317,4 +306,4 @@ others are relatively new. Philip Hazel Email local part: ph10 Email domain: cam.ac.uk -Last updated: 25 October 2014 +Last updated: 18 November 2014