Further substitution tests (code and data), and more documentation.

2014-11-14 18:41:20 +00:00 · 2014-11-14 18:41:20 +00:00 · 07f8372202
parent adc7be2d3a
commit 07f8372202
21 changed files with 1014 additions and 693 deletions
--- a/2
+++ b/2
@ -51,4 +51,6 @@ the currrent group as "unset". Thus, the ovector for those groups contained
 whatever was previously there. An example is the pattern /(x)|((*ACCEPT))/ when
 matched against "abcd".

+8. The pcre2_substitute() function has been implemented.
+
 ****
--- a/doc/html/pcre2.html
+++ b/doc/html/pcre2.html
@ -135,7 +135,7 @@ remaining sections, except for the <b>pcre2demo</b> section (which is a program
 listing), and the short pages for individual functions, are concatenated in
 <b>pcre2.txt</b>, for ease of searching. The sections are as follows:
 <pre>
-  pcre2              this document FIXME CHECK THIS LIST
+  pcre2              this document
  pcre2-config       show PCRE2 installation configuration information
  pcre2api           details of PCRE2's native C API
  pcre2build         building PCRE2
--- a/doc/html/pcre2api.html
+++ b/doc/html/pcre2api.html
@ -1089,7 +1089,7 @@ equivalent to Perl's /x option, and it can be changed within a pattern by a
 Which characters are interpreted as newlines can be specified by a setting in
 the compile context that is passed to <b>pcre2_compile()</b> or by a special
 sequence at the start of the pattern, as described in the section entitled
-<a href="pcrepattern.html#newlines">"Newline conventions"</a>
+<a href="pcre2pattern.html#newlines">"Newline conventions"</a>
 in the <b>pcre2pattern</b> documentation. A default is defined when PCRE2 is
 built.
 <pre>
@ -1243,7 +1243,7 @@ This option changes the way PCRE2 processes \B, \b, \D, \d, \S, \s, \W,
 \w, and some of the POSIX character classes. By default, only ASCII characters
 are recognized, but if PCRE2_UCP is set, Unicode properties are used instead to
 classify characters. More details are given in the section on
-<a href="pcre2.html#genericchartypes">generic character types</a>
+<a href="pcre2pattern.html#genericchartypes">generic character types</a>
 in the
 <a href="pcre2pattern.html"><b>pcre2pattern</b></a>
 page. If you set PCRE2_UCP, matching one of the items it affects takes much
@ -1924,11 +1924,8 @@ documentation.
 <P>
 When PCRE2 is built, a default newline convention is set; this is usually the
 standard convention for the operating system. The default can be overridden in
-either a
-<a href="#compilecontext">compile context</a>
-or a
-<a href="#matchcontext">match context.</a>
-However, changing the newline convention at match time disables JIT matching.
+a
+<a href="#compilecontext">compile context.</a>
 During matching, the newline choice affects the behaviour of the dot,
 circumflex, and dollar metacharacters. It may also alter the way the match
 position is advanced after a match failure for an unanchored pattern.
@ -2290,7 +2287,7 @@ subpattern <i>n</i> has not been used at all, it returns an empty string. This
 can be distinguished from a genuine zero-length substring by inspecting the
 appropriate offset in the ovector, which contains PCRE2_UNSET for unset
 substrings.
-<a name="extractbynname"></a></P>
+<a name="extractbyname"></a></P>
 <br><a name="SEC27" href="#TOC1">EXTRACTING CAPTURED SUBSTRINGS BY NAME</a><br>
 <P>
 <b>int pcre2_substring_number_from_name(const pcre2_code *<i>code</i>,</b>
@ -2358,7 +2355,8 @@ string in <i>outputbuffer</i>, replacing the part that was matched with the
 be given as PCRE2_ZERO_TERMINATED for a zero-terminated string.
 </P>
 <P>
-In the replacement string, which is interpreted as a UTF string in UTF mode, a
+In the replacement string, which is interpreted as a UTF string in UTF mode,
+and is checked for UTF validity unless the PCRE2_NO_UTF_CHECK option is set, a
 dollar character is an escape character that can specify the insertion of
 characters from capturing groups in the pattern. The following forms are
 recognized:
--- a/doc/html/pcre2jit.html
+++ b/doc/html/pcre2jit.html
@ -51,11 +51,12 @@ JIT support is an optional feature of PCRE2. The "configure" option
 you want to use JIT. The support is limited to the following hardware
 platforms:
 <pre>
-  ARM v5, v7, and Thumb2
+  ARM 32-bit (v5, v7, and Thumb2)
+  ARM 64-bit
  Intel x86 32-bit and 64-bit
-  MIPS 32-bit
+  MIPS 32-bit and 64-bit
  Power PC 32-bit and 64-bit
-  SPARC 32-bit (experimental)
+  SPARC 32-bit
 </pre>
 If --enable-jit is set on an unsupported platform, compilation fails.
 </P>
@ -73,11 +74,11 @@ To make use of the JIT support in the simplest way, all you have to do is to
 call <b>pcre2_jit_compile()</b> after successfully compiling a pattern with
 <b>pcre2_compile()</b>. This function has two arguments: the first is the
 compiled pattern pointer that was returned by <b>pcre2_compile()</b>, and the
-second is a set of option bits, which must include at least one of
-PCRE2_JIT_COMPLETE, PCRE2_JIT_PARTIAL_HARD, or PCRE2_JIT_PARTIAL_SOFT.
+second is zero or more of the following option bits: PCRE2_JIT_COMPLETE,
+PCRE2_JIT_PARTIAL_HARD, or PCRE2_JIT_PARTIAL_SOFT.
 </P>
 <P>
-If JIT support is not available, a call to <b>pcre2_jit_comple()</b> does
+If JIT support is not available, a call to <b>pcre2_jit_compile()</b> does
 nothing and returns PCRE2_ERROR_JIT_BADOPTION. Otherwise, the compiled pattern
 is passed to the JIT compiler, which turns it into machine code that executes
 much faster than the normal interpretive code, but yields exactly the same
@ -95,6 +96,20 @@ appropriate code is run if it is available. Otherwise, the pattern is matched
 using interpretive code.
 </P>
 <P>
+You can call <b>pcre2_jit_compile()</b> multiple times for the same compiled
+pattern. It does nothing if it has previously compiled code for any of the
+option bits. For example, you can call it once with PCRE2_JIT_COMPLETE and
+(perhaps later, when you find you need partial matching) again with
+PCRE2_JIT_COMPLETE and PCRE2_JIT_PARTIAL_HARD. This time it will ignore
+PCRE2_JIT_COMPLETE and just compile code for partial matching. If
+<b>pcre2_jit_compile()</b> is called with no option bits set, it immediately
+returns zero. This is an alternative way of testing if JIT is available.
+</P>
+<P>
+At present, it is not possible to free JIT compiled code except when the entire
+compiled pattern is freed by calling <b>pcre2_free_code()</b>.
+</P>
+<P>
 In some circumstances you may need to call additional functions. These are
 described in the section entitled
 <a href="#stackcontrol">"Controlling the JIT stack"</a>
@ -167,7 +182,7 @@ memory allocation), a starting size and a maximum size, and it returns a
 pointer to an opaque structure of type <b>pcre2_jit_stack</b>, or NULL if there
 is an error. The <b>pcre2_jit_stack_free()</b> function is used to free a stack
 that is no longer needed. (For the technically minded: the address space is
-allocated by mmap or VirtualAlloc.)  FIXME Is this right?
+allocated by mmap or VirtualAlloc.)
 </P>
 <P>
 JIT uses far less memory for recursion than the interpretive code,
@ -187,7 +202,8 @@ passed to a matching function, its information determines which JIT stack is
 used. There are three cases for the values of the other two options:
 <pre>
  (1) If <i>callback</i> is NULL and <i>data</i> is NULL, an internal 32K block
-      on the machine stack is used.
+      on the machine stack is used. This is the default when a match
+      context is created.

  (2) If <i>callback</i> is NULL and <i>data</i> is not NULL, <i>data</i> must be
      a pointer to a valid JIT stack, the result of calling
@ -402,7 +418,7 @@ Cambridge CB2 3QH, England.
 </P>
 <br><a name="SEC13" href="#TOC1">REVISION</a><br>
 <P>
-Last updated: 08 November 2014
+Last updated: 12 November 2014
 <br>
 Copyright &copy; 1997-2014 University of Cambridge.
 <br>
--- a/doc/html/pcre2pattern.html
+++ b/doc/html/pcre2pattern.html
@ -100,8 +100,8 @@ page.
 <P>
 Some applications that allow their users to supply patterns may wish to
 restrict them to non-UTF data for security reasons. If the PCRE2_NEVER_UTF
-option is set at compile time, (*UTF) is not allowed, and its appearance causes
-an error.
+option is passed to <b>pcre2_compile()</b>, (*UTF) is not allowed, and its
+appearance in a pattern causes an error.
 </P>
 <br><b>
 Unicode property support
@ -113,6 +113,22 @@ such as \d and \w to use Unicode properties to determine character types,
 instead of recognizing only characters with codes less than 128 via a lookup
 table.
 </P>
+<P>
+Some applications that allow their users to supply patterns may wish to
+restrict them for security reasons. If the PCRE2_NEVER_UCP option is passed to
+<b>pcre2_compile()</b>, (*UCP) is not allowed, and its appearance in a pattern
+causes an error.
+</P>
+<br><b>
+Locking out empty string matching
+</b><br>
+<P>
+Starting a pattern with (*NOTEMPTY) or (*NOTEMPTY_ATSTART) has the same effect
+as passing the PCRE2_NOTEMPTY or PCRE2_NOTEMPTY_ATSTART option to whichever
+matching function is subsequently called to match the pattern. These options
+lock out the matching of empty strings, either entirely, or only at the start
+of the subject.
+</P>
 <br><b>
 Disabling auto-possessification
 </b><br>
@ -133,6 +149,28 @@ PCRE2_NO_START_OPTIMIZE option. This disables several optimizations for quickly
 reaching "no match" results. For more details, see the
 <a href="pcre2api.html"><b>pcre2api</b></a>
 documentation.
+</P>
+<br><b>
+Setting match and recursion limits
+</b><br>
+<P>
+The caller of <b>pcre2_match()</b> can set a limit on the number of times the
+internal <b>match()</b> function is called and on the maximum depth of
+recursive calls. These facilities are provided to catch runaway matches that
+are provoked by patterns with huge matching trees (a typical example is a
+pattern with nested unlimited repeats) and to avoid running out of system stack
+by too much recursion. When one of these limits is reached, <b>pcre2_match()</b>
+gives an error return. The limits can also be set by items at the start of the
+pattern of the form
+<pre>
+  (*LIMIT_MATCH=d)
+  (*LIMIT_RECURSION=d)
+</pre>
+where d is any number of decimal digits. However, the value of the setting must
+be less than the value set (or defaulted) by the caller of <b>pcre2_match()</b>
+for it to have any effect. In other words, the pattern writer can lower the
+limits set by the programmer, but not raise them. If there is more than one
+setting of one of these limits, the lower value is used.
 <a name="newlines"></a></P>
 <br><b>
 Newline conventions
@ -179,26 +217,14 @@ below. A change of \R setting can be combined with a change of newline
 convention.
 </P>
 <br><b>
-Setting match and recursion limits
+Specifying what \R matches
 </b><br>
 <P>
-The caller of <b>pcre2_match()</b> can set a limit on the number of times the
-internal <b>match()</b> function is called and on the maximum depth of
-recursive calls. These facilities are provided to catch runaway matches that
-are provoked by patterns with huge matching trees (a typical example is a
-pattern with nested unlimited repeats) and to avoid running out of system stack
-by too much recursion. When one of these limits is reached, <b>pcre2_match()</b>
-gives an error return. The limits can also be set by items at the start of the
-pattern of the form
-<pre>
-  (*LIMIT_MATCH=d)
-  (*LIMIT_RECURSION=d)
-</pre>
-where d is any number of decimal digits. However, the value of the setting must
-be less than the value set (or defaulted) by the caller of <b>pcre2_match()</b>
-for it to have any effect. In other words, the pattern writer can lower the
-limits set by the programmer, but not raise them. If there is more than one
-setting of one of these limits, the lower value is used.
+It is possible to restrict \R to match only CR, LF, or CRLF (instead of the
+complete set of Unicode line endings) by setting the option PCRE2_BSR_ANYCRLF
+at compile time. This effect can also be achieved by starting a pattern with
+(*BSR_ANYCRLF). For completeness, (*BSR_UNICODE) is also recognized,
+corresponding to PCRE2_BSR_UNICODE.
 </P>
 <br><a name="SEC3" href="#TOC1">EBCDIC CHARACTER CODES</a><br>
 <P>
@ -2280,8 +2306,8 @@ complex:
 </PRE>
 </P>
 <P>
-There are four kinds of condition: references to subpatterns, references to
-recursion, a pseudo-condition called DEFINE, and assertions.
+There are five kinds of condition: references to subpatterns, references to
+recursion, two pseudo-conditions called DEFINE and VERSION, and assertions.
 </P>
 <br><b>
 Checking for a used subpattern by number
@ -2389,6 +2415,23 @@ pattern uses references to the named group to match the four dot-separated
 components of an IPv4 address, insisting on a word boundary at each end.
 </P>
 <br><b>
+Checking the PCRE2 version
+</b><br>
+<P>
+Programs that link with a PCRE2 library can check the version by calling
+<b>pcre2_config()</b> with appropriate arguments. Users of applications that do
+not have access to the underlying code cannot do this. A special "condition"
+called VERSION exists to allow such users to discover which version of PCRE2
+they are dealing with by using this condition to match a string such as
+"yesno". VERSION must be followed either by "=" or "&#62;=" and a version number.
+For example:
+<pre>
+  (?(VERSION&#62;=10.4)yes|no)
+</pre>
+This pattern matches "yes" if the PCRE2 version is greater or equal to 10.4, or
+"no" otherwise.
+</P>
+<br><b>
 Assertion conditions
 </b><br>
 <P>
@ -3180,7 +3223,7 @@ subpattern, (*THEN) causes the subroutine match to fail.
 <br><a name="SEC28" href="#TOC1">SEE ALSO</a><br>
 <P>
 <b>pcre2api</b>(3), <b>pcre2callout</b>(3), <b>pcre2matching</b>(3),
-<b>pcre2syntax</b>(3), <b>pcre2</b>(3), <b>pcre216(3)</b>, <b>pcre232(3)</b>.
+<b>pcre2syntax</b>(3), <b>pcre2</b>(3).
 </P>
 <br><a name="SEC29" href="#TOC1">AUTHOR</a><br>
 <P>
@ -3193,7 +3236,7 @@ Cambridge CB2 3QH, England.
 </P>
 <br><a name="SEC30" href="#TOC1">REVISION</a><br>
 <P>
-Last updated: 03 November 2014
+Last updated: 14 November 2014
 <br>
 Copyright &copy; 1997-2014 University of Cambridge.
 <br>
--- a/doc/html/pcre2syntax.html
+++ b/doc/html/pcre2syntax.html
@ -493,17 +493,18 @@ Each top-level branch of a look behind must be of a fixed length.
  (?(condition)yes-pattern)
  (?(condition)yes-pattern|no-pattern)

-  (?(n)...        absolute reference condition
-  (?(+n)...       relative reference condition
-  (?(-n)...       relative reference condition
-  (?(&#60;name&#62;)...   named reference condition (Perl)
-  (?('name')...   named reference condition (Perl)
-  (?(name)...     named reference condition (PCRE2)
-  (?(R)...        overall recursion condition
-  (?(Rn)...       specific group recursion condition
-  (?(R&name)...   specific recursion condition
-  (?(DEFINE)...   define subpattern for reference
-  (?(assert)...   assertion condition
+  (?(n)               absolute reference condition
+  (?(+n)              relative reference condition
+  (?(-n)              relative reference condition
+  (?(&#60;name&#62;)          named reference condition (Perl)
+  (?('name')          named reference condition (Perl)
+  (?(name)            named reference condition (PCRE2)
+  (?(R)               overall recursion condition
+  (?(Rn)              specific group recursion condition
+  (?(R&name)          specific recursion condition
+  (?(DEFINE)          define subpattern for reference
+  (?(VERSION[&#62;]=n.m)  test PCRE2 version
+  (?(assert)          assertion condition
 </PRE>
 </P>
 <br><a name="SEC23" href="#TOC1">BACKTRACKING CONTROL</a><br>
@ -552,7 +553,7 @@ Cambridge CB2 3QH, England.
 </P>
 <br><a name="SEC27" href="#TOC1">REVISION</a><br>
 <P>
-Last updated: 20 October 2014
+Last updated: 14 November 2014
 <br>
 Copyright &copy; 1997-2014 University of Cambridge.
 <br>
--- a/doc/html/pcre2test.html
+++ b/doc/html/pcre2test.html
@ -201,10 +201,11 @@ Behave as if each subject line contains the given modifiers.
 <P>
 <b>-t</b>
 Run each compile and match many times with a timer, and output the resulting
-times per compile or match. You can control the number of iterations that are
-used for timing by following <b>-t</b> with a number (as a separate item on the
-command line). For example, "-t 1000" iterates 1000 times. The default is to
-iterate 500,000 times.
+times per compile or match. When JIT is used, separate times are given for the
+initial compile and the JIT compile. You can control the number of iterations
+that are used for timing by following <b>-t</b> with a number (as a separate
+item on the command line). For example, "-t 1000" iterates 1000 times. The
+default is to iterate 500,000 times.
 </P>
 <P>
 <b>-tm</b>
@ -490,7 +491,6 @@ about the pattern:
      tables=[0|1|2]            select internal tables
 </pre>
 The effects of these modifiers are described in the following sections.
-FIXME: Give more examples.
 </P>
 <br><b>
 Newline and \R handling
@ -528,7 +528,31 @@ one-off tests.
 <P>
 The <b>info</b> modifier requests information about the compiled pattern
 (whether it is anchored, has a fixed first character, and so on). The
-information is obtained from the <b>pcre2_pattern_info()</b> function.
+information is obtained from the <b>pcre2_pattern_info()</b> function. Here are
+some typical examples:
+<pre>
+    re&#62; /(?i)(^a|^b)/m,info
+  Capturing subpattern count = 1
+  Compile options: multiline
+  Overall options: caseless multiline
+  First code unit at start or follows newline
+  Subject length lower bound = 1
+
+    re&#62; /(?i)abc/info
+  Capturing subpattern count = 0
+  Compile options: &#60;none&#62;
+  Overall options: caseless
+  First code unit = 'a' (caseless)
+  Last code unit = 'c' (caseless)
+  Subject length lower bound = 3
+</pre>
+"Compile options" are those specified to the compile function; "overall
+options" have added options that are taken or deduced from the pattern. If both
+sets of options are the same, just a single "options" line is output. "First
+code unit" is where any match must start; if there is more than one they are
+listed as "starting code units". "Last code unit" is the last literal code unit
+that must be present in any match. This is not necessarily the last character.
+These lines are omitted if no starting or ending code units are recorded.
 </P>
 <br><b>
 Specifying a pattern in hex
@ -543,8 +567,8 @@ pairs. For example:
 This feature is provided as a way of creating patterns that contain binary zero
 characters. By default, <b>pcre2test</b> passes patterns as zero-terminated
 strings to <b>pcre2_compile()</b>, giving the length as PCRE2_ZERO_TERMINATED.
-However, for patterns specified in hexadecimal, the length of the pattern is
-passed.
+However, for patterns specified in hexadecimal, the actual length of the
+pattern is passed.
 </P>
 <br><b>
 JIT compilation
@ -571,7 +595,7 @@ setting the size of the JIT stack.
 </P>
 <P>
 If the <b>jitfast</b> modifier is specified, matching is done using the JIT
-"fast path" interface (\fBpcre2_jit_match()), which skips some of the sanity
+"fast path" interface, \fBpcre2_jit_match(), which skips some of the sanity
 checks that are done by <b>pcre2_match()</b>, and of course does not work when
 JIT is not supported. If <b>jitfast</b> is specified without <b>jit</b>, jit=7 is
 assumed.
@ -604,11 +628,17 @@ character tables are mutually exclusive.
 Showing pattern memory
 </b><br>
 <P>
-The <b>/memory</b> modifier causes the size in bytes of the memory block used to
-hold the compiled pattern to be output. This does not include the size of the
+The <b>/memory</b> modifier causes the size in bytes of the memory used to hold
+the compiled pattern to be output. This does not include the size of the
 <b>pcre2_code</b> block; it is just the actual compiled data. If the pattern is
 subsequently passed to the JIT compiler, the size of the JIT compiled code is
-also output.
+also output. Here is an example:
+<pre>
+    re&#62; /a(b)c/jit,memory
+  Memory allocation (code space): 21
+  Memory allocation (JIT code): 1910
+
+</PRE>
 </P>
 <br><b>
 Limiting nested parentheses
@ -650,8 +680,8 @@ enable stack availability to be checked during compilation (see the
 <a href="pcre2api.html"><b>pcre2api</b></a>
 documentation for details). If the number specified by the modifier is greater
 than zero, <b>pcre2_set_compile_recursion_guard()</b> is called to set up
-callback from <b>pcre2_compile()</b> to a local function. The argument it is
-passed is the current nesting parenthesis depth; if this is greater than the
+callback from <b>pcre2_compile()</b> to a local function. The argument it
+receives is the current nesting parenthesis depth; if this is greater than the
 value given by the modifier, non-zero is returned, causing the compilation to
 be aborted.
 </P>
@ -688,6 +718,7 @@ not affect the compilation process.
      allusedtext         show all consulted text
  /g  global              global matching
      mark                show mark values
+      replace=&#60;string&#62;    specify a replacement string
      startchar           show starting character when relevant
 </pre>
 These modifiers may not appear in a <b>#pattern</b> command. If you want them as
@ -759,11 +790,11 @@ pattern.
      offset=&#60;n&#62;                set starting offset
      ovector=&#60;n&#62;               set size of output vector
      recursion_limit=&#60;n&#62;       set a recursion limit
+      replace=&#60;string&#62;          specify a replacement string
      startchar                 show startchar when relevant
      zero_terminate            pass the subject as zero-terminated
 </pre>
 The effects of these modifiers are described in the following sections.
-FIXME: Give more examples.
 </P>
 <br><b>
 Showing more text
@ -841,6 +872,30 @@ Any value other than zero is used as a return from <b>pcre2test</b>'s callout
 function.
 </P>
 <br><b>
+Finding all matches in a string
+</b><br>
+<P>
+Searching for all possible matches within a subject can be requested by the
+<b>global</b> or <b>/altglobal</b> modifier. After finding a match, the matching
+function is called again to search the remainder of the subject. The difference
+between <b>global</b> and <b>altglobal</b> is that the former uses the
+<i>start_offset</i> argument to <b>pcre2_match()</b> or <b>pcre2_dfa_match()</b>
+to start searching at a new point within the entire string (which is what Perl
+does), whereas the latter passes over a shortened substring. This makes a
+difference to the matching process if the pattern begins with a lookbehind
+assertion (including \b or \B).
+</P>
+<P>
+If an empty string is matched, the next match is done with the
+PCRE2_NOTEMPTY_ATSTART and PCRE2_ANCHORED flags set, in order to search for
+another, non-empty, match at the same point in the subject. If this match
+fails, the start offset is advanced, and the normal match is retried. This
+imitates the way Perl handles such cases when using the <b>/g</b> modifier or
+the <b>split()</b> function. Normally, the start offset is advanced by one
+character, but if the newline convention recognizes CRLF as a newline, and the
+current character is CR followed by LF, an advance of two is used.
+</P>
+<br><b>
 Testing substring extraction functions
 </b><br>
 <P>
@ -867,28 +922,46 @@ length (that is, the return from the extraction function) is given in
 parentheses after each substring.
 </P>
 <br><b>
-Finding all matches in a string
+Testing the substitution function
 </b><br>
 <P>
-Searching for all possible matches within a subject can be requested by the
-<b>global</b> or <b>/altglobal</b> modifier. After finding a match, the matching
-function is called again to search the remainder of the subject. The difference
-between <b>global</b> and <b>altglobal</b> is that the former uses the
-<i>start_offset</i> argument to <b>pcre2_match()</b> or <b>pcre2_dfa_match()</b>
-to start searching at a new point within the entire string (which is what Perl
-does), whereas the latter passes over a shortened substring. This makes a
-difference to the matching process if the pattern begins with a lookbehind
-assertion (including \b or \B).
+If the <b>replace</b> modifier is set, the <b>pcre2_substitute()</b> function is
+called instead of one of the matching functions. Unlike subject strings,
+<b>pcre2test</b> does not process replacement strings for escape sequences. In
+UTF mode, a replacement string is checked to see if it is a valid UTF-8 string.
+If so, it is correctly converted to a UTF string of the appropriate code unit
+width. If it is not a valid UTF-8 string, the individual code units are copied
+directly. This provides a means of passing an invalid UTF-8 string for testing
+purposes.
 </P>
 <P>
-If an empty string is matched, the next match is done with the
-PCRE2_NOTEMPTY_ATSTART and PCRE2_ANCHORED flags set, in order to search for
-another, non-empty, match at the same point in the subject. If this match
-fails, the start offset is advanced, and the normal match is retried. This
-imitates the way Perl handles such cases when using the <b>/g</b> modifier or
-the <b>split()</b> function. Normally, the start offset is advanced by one
-character, but if the newline convention recognizes CRLF as a newline, and the
-current character is CR followed by LF, an advance of two is used.
+If the <b>global</b> modifier is set, PCRE2_SUBSTITUTE_GLOBAL is passed to
+<b>pcre2_substitute()</b>. After a successful substitution, the modified string
+is output, preceded by the number of replacements. This may be zero if there
+were no matches. Here is a simple example of a substitution test:
+<pre>
+  /abc/replace=xxx
+      =abc=abc=
+   1: =xxx=abc=
+      =abc=abc=\=global
+   2: =xxx=xxx=
+</pre>
+Subject and replacement strings should be kept relatively short for
+substitution tests, as fixed-size buffers are used. To make it easy to test for
+buffer overflow, if the replacement string starts with a number in square
+brackets, that number is passed to <b>pcre2_substitute()</b> as the size of the
+output buffer, with the replacement string starting at the next character. Here
+is an example that tests the edge case:
+<pre>
+  /abc/
+      123abc123\=replace=[10]XYZ
+   1: 123XYZ123
+      123abc123\=replace=[9]XYZ
+  Failed: error -47: no more memory
+</pre>
+A replacement string is ignored with POSIX and DFA matching. Specifying partial
+matching provokes an error return ("bad option value") from
+<b>pcre2_substitute()</b>.
 </P>
 <br><b>
 Setting the JIT stack size
@ -969,10 +1042,10 @@ available for storing matching information. The default is 15.
 A value of zero is useful when testing the POSIX API because it causes
 <b>regexec()</b> to be called with a NULL capture vector. When not testing the
 POSIX API, a value of zero is used to cause
-<b>pcre2_match_data_create_from_pattern</b> to be called, in order to create a
+<b>pcre2_match_data_create_from_pattern()</b> to be called, in order to create a
 match block of exactly the right size for the pattern. (It is not possible to
-create a match block with a zero-length ovector; there is always one pair of
-offsets.)
+create a match block with a zero-length ovector; there is always at least one
+pair of offsets.)
 </P>
 <br><b>
 Passing the subject as zero-terminated
@ -985,7 +1058,7 @@ be passed as PCRE2_ZERO_TERMINATED. (When matching via the POSIX interface,
 this modifier has no effect, as there is no facility for passing a length.)
 </P>
 <P>
-When testing <b>pcre2_substitute</b>, this modifier also has the effect of
+When testing <b>pcre2_substitute()</b>, this modifier also has the effect of
 passing the replacement string as zero-terminated.
 </P>
 <br><a name="SEC12" href="#TOC1">THE ALTERNATIVE MATCHING FUNCTION</a><br>
@ -1233,7 +1306,7 @@ Cambridge CB2 3QH, England.
 </P>
 <br><a name="SEC20" href="#TOC1">REVISION</a><br>
 <P>
-Last updated: 09 November 2014
+Last updated: 14 November 2014
 <br>
 Copyright &copy; 1997-2014 University of Cambridge.
 <br>
--- a/doc/pcre2.3
+++ b/doc/pcre2.3
@ -132,7 +132,7 @@ remaining sections, except for the \fBpcre2demo\fP section (which is a program
 listing), and the short pages for individual functions, are concatenated in
 \fBpcre2.txt\fP, for ease of searching. The sections are as follows:
 .sp
-  pcre2              this document FIXME CHECK THIS LIST
+  pcre2              this document
  pcre2-config       show PCRE2 installation configuration information
  pcre2api           details of PCRE2's native C API
  pcre2build         building PCRE2
--- a/doc/pcre2.txt
+++ b/doc/pcre2.txt
@ -116,7 +116,7 @@ USER DOCUMENTATION
       tions,  are  concatenated in pcre2.txt, for ease of searching. The sec-
       tions are as follows:

-         pcre2              this document FIXME CHECK THIS LIST
+         pcre2              this document
         pcre2-config       show PCRE2 installation configuration information
         pcre2api           details of PCRE2's native C API
         pcre2build         building PCRE2
@ -1928,12 +1928,10 @@ NEWLINE HANDLING WHEN MATCHING

       When  PCRE2 is built, a default newline convention is set; this is usu-
       ally the standard convention for the operating system. The default  can
-       be overridden in either a compile context or a match context.  However,
-       changing the newline convention at match time  disables  JIT  matching.
-       During  matching,  the newline choice affects the behaviour of the dot,
-       circumflex, and dollar metacharacters. It may also alter  the  way  the
-       match position is advanced after a match failure for an unanchored pat-
-       tern.
+       be  overridden  in  a  compile  context.   During matching, the newline
+       choice affects  the  behaviour  of  the  dot,  circumflex,  and  dollar
+       metacharacters.  It  may  also  alter  the  way  the  match position is
+       advanced after a match failure for an unanchored pattern.

       When PCRE2_NEWLINE_CRLF, PCRE2_NEWLINE_ANYCRLF, or PCRE2_NEWLINE_ANY is
       set,  and a match attempt for an unanchored pattern fails when the cur-
@ -2320,9 +2318,10 @@ CREATING A NEW STRING WITH SUBSTITUTIONS
       given as PCRE2_ZERO_TERMINATED for a zero-terminated string.

       In  the replacement string, which is interpreted as a UTF string in UTF
-       mode, a dollar character is an escape character that  can  specify  the
-       insertion  of characters from capturing groups in the pattern. The fol-
-       lowing forms are recognized:
+       mode, and is checked for UTF  validity  unless  the  PCRE2_NO_UTF_CHECK
+       option is set, a dollar character is an escape character that can spec-
+       ify the insertion of characters from capturing groups in  the  pattern.
+       The following forms are recognized:

         $$      insert a dollar character
         $<n>    insert the contents of group <n>
@ -3508,11 +3507,12 @@ AVAILABILITY OF JIT SUPPORT
       built if you want to use JIT. The support is limited to  the  following
       hardware platforms:

-         ARM v5, v7, and Thumb2
+         ARM 32-bit (v5, v7, and Thumb2)
+         ARM 64-bit
         Intel x86 32-bit and 64-bit
-         MIPS 32-bit
+         MIPS 32-bit and 64-bit
         Power PC 32-bit and 64-bit
-         SPARC 32-bit (experimental)
+         SPARC 32-bit

       If --enable-jit is set on an unsupported platform, compilation fails.

@ -3531,10 +3531,10 @@ SIMPLE USE OF JIT
       is to call pcre2_jit_compile() after successfully compiling  a  pattern
       with pcre2_compile(). This function has two arguments: the first is the
       compiled pattern pointer that was returned by pcre2_compile(), and  the
-       second  is  a  set  of  option bits, which must include at least one of
-       PCRE2_JIT_COMPLETE, PCRE2_JIT_PARTIAL_HARD, or PCRE2_JIT_PARTIAL_SOFT.
+       second  is  zero  or  more of the following option bits: PCRE2_JIT_COM-
+       PLETE, PCRE2_JIT_PARTIAL_HARD, or PCRE2_JIT_PARTIAL_SOFT.

-       If JIT support is not available,  a  call  to  pcre2_jit_comple()  does
+       If JIT support is not available, a  call  to  pcre2_jit_compile()  does
       nothing  and returns PCRE2_ERROR_JIT_BADOPTION. Otherwise, the compiled
       pattern is passed to the JIT compiler, which turns it into machine code
       that executes much faster than the normal interpretive code, but yields
@ -3550,6 +3550,19 @@ SIMPLE USE OF JIT
       pcre2_match()  is  called,  the appropriate code is run if it is avail-
       able. Otherwise, the pattern is matched using interpretive code.

+       You can call pcre2_jit_compile() multiple times for the  same  compiled
+       pattern.  It does nothing if it has previously compiled code for any of
+       the option bits. For example, you can call it once with  PCRE2_JIT_COM-
+       PLETE  and  (perhaps  later,  when  you find you need partial matching)
+       again with PCRE2_JIT_COMPLETE and PCRE2_JIT_PARTIAL_HARD. This time  it
+       will ignore PCRE2_JIT_COMPLETE and just compile code for partial match-
+       ing. If pcre2_jit_compile() is called with no option bits set, it imme-
+       diately  returns  zero. This is an alternative way of testing if JIT is
+       available.
+
+       At present, it is not possible to free JIT compiled  code  except  when
+       the entire compiled pattern is freed by calling pcre2_free_code().
+
       In  some circumstances you may need to call additional functions. These
       are described in the  section  entitled  "Controlling  the  JIT  stack"
       below.
@ -3618,7 +3631,7 @@ CONTROLLING THE JIT STACK
       pcre2_jit_stack,    or    NULL    if    there    is   an   error.   The
       pcre2_jit_stack_free() function is used to free  a  stack  that  is  no
       longer  needed. (For the technically minded: the address space is allo-
-       cated by mmap or VirtualAlloc.)  FIXME Is this right?
+       cated by mmap or VirtualAlloc.)

       JIT uses far less memory for recursion than the interpretive code,  and
       a  maximum  stack size of 512K to 1M should be more than enough for any
@ -3637,7 +3650,8 @@ CONTROLLING THE JIT STACK
       two options:

         (1) If callback is NULL and data is NULL, an internal 32K block
-             on the machine stack is used.
+             on the machine stack is used. This is the default when a match
+             context is created.

         (2) If callback is NULL and data is not NULL, data must be
             a pointer to a valid JIT stack, the result of calling
@ -3840,7 +3854,7 @@ AUTHOR

 REVISION

-       Last updated: 08 November 2014
+       Last updated: 12 November 2014
       Copyright (c) 1997-2014 University of Cambridge.
 ------------------------------------------------------------------------------

--- a/doc/pcre2api.3
+++ b/doc/pcre2api.3
@ -1063,7 +1063,7 @@ equivalent to Perl's /x option, and it can be changed within a pattern by a
 Which characters are interpreted as newlines can be specified by a setting in
 the compile context that is passed to \fBpcre2_compile()\fP or by a special
 sequence at the start of the pattern, as described in the section entitled
-.\" HTML <a href="pcrepattern.html#newlines">
+.\" HTML <a href="pcre2pattern.html#newlines">
 .\" </a>
 "Newline conventions"
 .\"
@ -1226,7 +1226,7 @@ This option changes the way PCRE2 processes \eB, \eb, \eD, \ed, \eS, \es, \eW,
 \ew, and some of the POSIX character classes. By default, only ASCII characters
 are recognized, but if PCRE2_UCP is set, Unicode properties are used instead to
 classify characters. More details are given in the section on
-.\" HTML <a href="pcre2.html#genericchartypes">
+.\" HTML <a href="pcre2pattern.html#genericchartypes">
 .\" </a>
 generic character types
 .\"
@ -1939,17 +1939,11 @@ documentation.
 .sp
 When PCRE2 is built, a default newline convention is set; this is usually the
 standard convention for the operating system. The default can be overridden in
-either a
+a
 .\" HTML <a href="#compilecontext">
 .\" </a>
-compile context
+compile context.
 .\"
-or a
-.\" HTML <a href="#matchcontext">
-.\" </a>
-match context.
-.\"
-However, changing the newline convention at match time disables JIT matching.
 During matching, the newline choice affects the behaviour of the dot,
 circumflex, and dollar metacharacters. It may also alter the way the match
 position is advanced after a match failure for an unanchored pattern.
@ -2322,7 +2316,7 @@ appropriate offset in the ovector, which contains PCRE2_UNSET for unset
 substrings.
 .
 .
-.\" HTML <a name="extractbynname"></a>
+.\" HTML <a name="extractbyname"></a>
 .SH "EXTRACTING CAPTURED SUBSTRINGS BY NAME"
 .rs
 .sp
--- a/doc/pcre2pattern.3
+++ b/doc/pcre2pattern.3
@ -1,4 +1,4 @@
-.TH PCRE2PATTERN 3 "03 November 2014" "PCRE2 10.00"
+.TH PCRE2PATTERN 3 "14 November 2014" "PCRE2 10.00"
 .SH NAME
 PCRE2 - Perl-compatible regular expressions (revised API)
 .SH "PCRE2 REGULAR EXPRESSION DETAILS"
@ -63,8 +63,8 @@ page.
 .P
 Some applications that allow their users to supply patterns may wish to
 restrict them to non-UTF data for security reasons. If the PCRE2_NEVER_UTF
-option is set at compile time, (*UTF) is not allowed, and its appearance causes
-an error.
+option is passed to \fBpcre2_compile()\fP, (*UTF) is not allowed, and its
+appearance in a pattern causes an error.
 .
 .
 .SS "Unicode property support"
@ -75,6 +75,21 @@ This has the same effect as setting the PCRE2_UCP option: it causes sequences
 such as \ed and \ew to use Unicode properties to determine character types,
 instead of recognizing only characters with codes less than 128 via a lookup
 table.
+.P
+Some applications that allow their users to supply patterns may wish to
+restrict them for security reasons. If the PCRE2_NEVER_UCP option is passed to
+\fBpcre2_compile()\fP, (*UCP) is not allowed, and its appearance in a pattern
+causes an error.
+.
+.
+.SS "Locking out empty string matching"
+.rs
+.sp
+Starting a pattern with (*NOTEMPTY) or (*NOTEMPTY_ATSTART) has the same effect
+as passing the PCRE2_NOTEMPTY or PCRE2_NOTEMPTY_ATSTART option to whichever
+matching function is subsequently called to match the pattern. These options
+lock out the matching of empty strings, either entirely, or only at the start
+of the subject.
 .
 .
 .SS "Disabling auto-possessification"
@ -102,6 +117,28 @@ reaching "no match" results. For more details, see the
 documentation.
 .
 .
+.SS "Setting match and recursion limits"
+.rs
+.sp
+The caller of \fBpcre2_match()\fP can set a limit on the number of times the
+internal \fBmatch()\fP function is called and on the maximum depth of
+recursive calls. These facilities are provided to catch runaway matches that
+are provoked by patterns with huge matching trees (a typical example is a
+pattern with nested unlimited repeats) and to avoid running out of system stack
+by too much recursion. When one of these limits is reached, \fBpcre2_match()\fP
+gives an error return. The limits can also be set by items at the start of the
+pattern of the form
+.sp
+  (*LIMIT_MATCH=d)
+  (*LIMIT_RECURSION=d)
+.sp
+where d is any number of decimal digits. However, the value of the setting must
+be less than the value set (or defaulted) by the caller of \fBpcre2_match()\fP
+for it to have any effect. In other words, the pattern writer can lower the
+limits set by the programmer, but not raise them. If there is more than one
+setting of one of these limits, the lower value is used.
+.
+.
 .\" HTML <a name="newlines"></a>
 .SS "Newline conventions"
 .rs
@ -153,26 +190,14 @@ below. A change of \eR setting can be combined with a change of newline
 convention.
 .
 .
-.SS "Setting match and recursion limits"
+.SS "Specifying what \eR matches"
 .rs
 .sp
-The caller of \fBpcre2_match()\fP can set a limit on the number of times the
-internal \fBmatch()\fP function is called and on the maximum depth of
-recursive calls. These facilities are provided to catch runaway matches that
-are provoked by patterns with huge matching trees (a typical example is a
-pattern with nested unlimited repeats) and to avoid running out of system stack
-by too much recursion. When one of these limits is reached, \fBpcre2_match()\fP
-gives an error return. The limits can also be set by items at the start of the
-pattern of the form
-.sp
-  (*LIMIT_MATCH=d)
-  (*LIMIT_RECURSION=d)
-.sp
-where d is any number of decimal digits. However, the value of the setting must
-be less than the value set (or defaulted) by the caller of \fBpcre2_match()\fP
-for it to have any effect. In other words, the pattern writer can lower the
-limits set by the programmer, but not raise them. If there is more than one
-setting of one of these limits, the lower value is used.
+It is possible to restrict \eR to match only CR, LF, or CRLF (instead of the
+complete set of Unicode line endings) by setting the option PCRE2_BSR_ANYCRLF
+at compile time. This effect can also be achieved by starting a pattern with
+(*BSR_ANYCRLF). For completeness, (*BSR_UNICODE) is also recognized,
+corresponding to PCRE2_BSR_UNICODE.
 .
 .
 .SH "EBCDIC CHARACTER CODES"
@ -2302,8 +2327,8 @@ complex:
  (?(1) (A|B|C) | (D | (?(2)E|F) | E) )
 .sp
 .P
-There are four kinds of condition: references to subpatterns, references to
-recursion, a pseudo-condition called DEFINE, and assertions.
+There are five kinds of condition: references to subpatterns, references to
+recursion, two pseudo-conditions called DEFINE and VERSION, and assertions.
 .
 .
 .SS "Checking for a used subpattern by number"
@ -2418,6 +2443,23 @@ pattern uses references to the named group to match the four dot-separated
 components of an IPv4 address, insisting on a word boundary at each end.
 .
 .
+.SS "Checking the PCRE2 version"
+.rs
+.sp
+Programs that link with a PCRE2 library can check the version by calling
+\fBpcre2_config()\fP with appropriate arguments. Users of applications that do
+not have access to the underlying code cannot do this. A special "condition"
+called VERSION exists to allow such users to discover which version of PCRE2
+they are dealing with by using this condition to match a string such as
+"yesno". VERSION must be followed either by "=" or ">=" and a version number.
+For example:
+.sp
+  (?(VERSION>=10.4)yes|no)
+.sp
+This pattern matches "yes" if the PCRE2 version is greater or equal to 10.4, or
+"no" otherwise.
+.
+.
 .SS "Assertion conditions"
 .rs
 .sp
@ -3219,7 +3261,7 @@ subpattern, (*THEN) causes the subroutine match to fail.
 .rs
 .sp
 \fBpcre2api\fP(3), \fBpcre2callout\fP(3), \fBpcre2matching\fP(3),
-\fBpcre2syntax\fP(3), \fBpcre2\fP(3), \fBpcre216(3)\fP, \fBpcre232(3)\fP.
+\fBpcre2syntax\fP(3), \fBpcre2\fP(3).
 .
 .
 .SH AUTHOR
@ -3236,6 +3278,6 @@ Cambridge CB2 3QH, England.
 .rs
 .sp
 .nf
-Last updated: 03 November 2014
+Last updated: 14 November 2014
 Copyright (c) 1997-2014 University of Cambridge.
 .fi
--- a/doc/pcre2syntax.3
+++ b/doc/pcre2syntax.3
@ -1,4 +1,4 @@
-.TH PCRE2SYNTAX 3 "20 October 2014" "PCRE2 10.00"
+.TH PCRE2SYNTAX 3 "14 November 2014" "PCRE2 10.00"
 .SH NAME
 PCRE2 - Perl-compatible regular expressions (revised API)
 .SH "PCRE2 REGULAR EXPRESSION SYNTAX SUMMARY"
@ -470,17 +470,18 @@ Each top-level branch of a look behind must be of a fixed length.
  (?(condition)yes-pattern)
  (?(condition)yes-pattern|no-pattern)
 .sp
-  (?(n)...        absolute reference condition
-  (?(+n)...       relative reference condition
-  (?(-n)...       relative reference condition
-  (?(<name>)...   named reference condition (Perl)
-  (?('name')...   named reference condition (Perl)
-  (?(name)...     named reference condition (PCRE2)
-  (?(R)...        overall recursion condition
-  (?(Rn)...       specific group recursion condition
-  (?(R&name)...   specific recursion condition
-  (?(DEFINE)...   define subpattern for reference
-  (?(assert)...   assertion condition
+  (?(n)               absolute reference condition
+  (?(+n)              relative reference condition
+  (?(-n)              relative reference condition
+  (?(<name>)          named reference condition (Perl)
+  (?('name')          named reference condition (Perl)
+  (?(name)            named reference condition (PCRE2)
+  (?(R)               overall recursion condition
+  (?(Rn)              specific group recursion condition
+  (?(R&name)          specific recursion condition
+  (?(DEFINE)          define subpattern for reference
+  (?(VERSION[>]=n.m)  test PCRE2 version
+  (?(assert)          assertion condition
 .
 .
 .SH "BACKTRACKING CONTROL"
@ -535,6 +536,6 @@ Cambridge CB2 3QH, England.
 .rs
 .sp
 .nf
-Last updated: 20 October 2014
+Last updated: 14 November 2014
 Copyright (c) 1997-2014 University of Cambridge.
 .fi
--- a/doc/pcre2test.1
+++ b/doc/pcre2test.1
@ -1,4 +1,4 @@
-.TH PCRE2TEST 1 "12 November 2014" "PCRE 10.00"
+.TH PCRE2TEST 1 "14 November 2014" "PCRE 10.00"
 .SH NAME
 pcre2test - a program for testing Perl-compatible regular expressions.
 .SH SYNOPSIS
@ -450,7 +450,6 @@ about the pattern:
      tables=[0|1|2]            select internal tables
 .sp
 The effects of these modifiers are described in the following sections.
-FIXME: Give more examples.
 .
 .
 .SS "Newline and \eR handling"
@ -484,7 +483,31 @@ one-off tests.
 .P
 The \fBinfo\fP modifier requests information about the compiled pattern
 (whether it is anchored, has a fixed first character, and so on). The
-information is obtained from the \fBpcre2_pattern_info()\fP function.
+information is obtained from the \fBpcre2_pattern_info()\fP function. Here are
+some typical examples:
+.sp
+    re> /(?i)(^a|^b)/m,info
+  Capturing subpattern count = 1
+  Compile options: multiline
+  Overall options: caseless multiline
+  First code unit at start or follows newline
+  Subject length lower bound = 1
+.sp
+    re> /(?i)abc/info
+  Capturing subpattern count = 0
+  Compile options: <none>
+  Overall options: caseless
+  First code unit = 'a' (caseless)
+  Last code unit = 'c' (caseless)
+  Subject length lower bound = 3
+.sp
+"Compile options" are those specified to the compile function; "overall
+options" have added options that are taken or deduced from the pattern. If both
+sets of options are the same, just a single "options" line is output. "First
+code unit" is where any match must start; if there is more than one they are
+listed as "starting code units". "Last code unit" is the last literal code unit
+that must be present in any match. This is not necessarily the last character.
+These lines are omitted if no starting or ending code units are recorded.
 .
 .
 .SS "Specifying a pattern in hex"
@ -499,8 +522,8 @@ pairs. For example:
 This feature is provided as a way of creating patterns that contain binary zero
 characters. By default, \fBpcre2test\fP passes patterns as zero-terminated
 strings to \fBpcre2_compile()\fP, giving the length as PCRE2_ZERO_TERMINATED.
-However, for patterns specified in hexadecimal, the length of the pattern is
-passed.
+However, for patterns specified in hexadecimal, the actual length of the
+pattern is passed.
 .
 .
 .SS "JIT compilation"
@ -528,7 +551,7 @@ documentation. See also the \fBjitstack\fP modifier below for a way of
 setting the size of the JIT stack.
 .P
 If the \fBjitfast\fP modifier is specified, matching is done using the JIT
-"fast path" interface (\fBpcre2_jit_match()), which skips some of the sanity
+"fast path" interface, \fBpcre2_jit_match(), which skips some of the sanity
 checks that are done by \fBpcre2_match()\fP, and of course does not work when
 JIT is not supported. If \fBjitfast\fP is specified without \fBjit\fP, jit=7 is
 assumed.
@ -560,11 +583,16 @@ character tables are mutually exclusive.
 .SS "Showing pattern memory"
 .rs
 .sp
-The \fB/memory\fP modifier causes the size in bytes of the memory block used to
-hold the compiled pattern to be output. This does not include the size of the
+The \fB/memory\fP modifier causes the size in bytes of the memory used to hold
+the compiled pattern to be output. This does not include the size of the
 \fBpcre2_code\fP block; it is just the actual compiled data. If the pattern is
 subsequently passed to the JIT compiler, the size of the JIT compiled code is
-also output.
+also output. Here is an example:
+.sp
+    re> /a(b)c/jit,memory
+  Memory allocation (code space): 21
+  Memory allocation (JIT code): 1910
+.sp
 .
 .
 .SS "Limiting nested parentheses"
@ -608,8 +636,8 @@ enable stack availability to be checked during compilation (see the
 .\"
 documentation for details). If the number specified by the modifier is greater
 than zero, \fBpcre2_set_compile_recursion_guard()\fP is called to set up
-callback from \fBpcre2_compile()\fP to a local function. The argument it is
-passed is the current nesting parenthesis depth; if this is greater than the
+callback from \fBpcre2_compile()\fP to a local function. The argument it
+receives is the current nesting parenthesis depth; if this is greater than the
 value given by the modifier, non-zero is returned, causing the compilation to
 be aborted.
 .
@ -726,7 +754,6 @@ pattern.
      zero_terminate            pass the subject as zero-terminated
 .sp
 The effects of these modifiers are described in the following sections.
-FIXME: Give more examples.
 .
 .
 .SS "Showing more text"
@ -867,15 +894,22 @@ were no matches. Here is a simple example of a substitution test:
  /abc/replace=xxx
      =abc=abc=
   1: =xxx=abc=
-      =abc=abc=\=global
+      =abc=abc=\e=global
   2: =xxx=xxx=
 .sp
 Subject and replacement strings should be kept relatively short for
 substitution tests, as fixed-size buffers are used. To make it easy to test for
 buffer overflow, if the replacement string starts with a number in square
 brackets, that number is passed to \fBpcre2_substitute()\fP as the size of the
-output buffer, with the replacement string starting at the next character.
-.P
+output buffer, with the replacement string starting at the next character. Here
+is an example that tests the edge case:
+.sp
+  /abc/
+      123abc123\e=replace=[10]XYZ
+   1: 123XYZ123
+      123abc123\e=replace=[9]XYZ
+  Failed: error -47: no more memory
+.sp
 A replacement string is ignored with POSIX and DFA matching. Specifying partial
 matching provokes an error return ("bad option value") from
 \fBpcre2_substitute()\fP.
@ -957,10 +991,10 @@ available for storing matching information. The default is 15.
 A value of zero is useful when testing the POSIX API because it causes
 \fBregexec()\fP to be called with a NULL capture vector. When not testing the
 POSIX API, a value of zero is used to cause
-\fBpcre2_match_data_create_from_pattern\fP to be called, in order to create a
+\fBpcre2_match_data_create_from_pattern()\fP to be called, in order to create a
 match block of exactly the right size for the pattern. (It is not possible to
-create a match block with a zero-length ovector; there is always one pair of
-offsets.)
+create a match block with a zero-length ovector; there is always at least one
+pair of offsets.)
 .
 .
 .SS "Passing the subject as zero-terminated"
@ -972,7 +1006,7 @@ string, the \fBzero_terminate\fP modifier is provided. It causes the length to
 be passed as PCRE2_ZERO_TERMINATED. (When matching via the POSIX interface,
 this modifier has no effect, as there is no facility for passing a length.)
 .P
-When testing \fBpcre2_substitute\fP, this modifier also has the effect of
+When testing \fBpcre2_substitute()\fP, this modifier also has the effect of
 passing the replacement string as zero-terminated.
 .
 .
@ -1237,6 +1271,6 @@ Cambridge CB2 3QH, England.
 .rs
 .sp
 .nf
-Last updated: 12 November 2014
+Last updated: 14 November 2014
 Copyright (c) 1997-2014 University of Cambridge.
 .fi
--- a/doc/pcre2test.txt
+++ b/doc/pcre2test.txt
@ -150,11 +150,12 @@ COMMAND LINE OPTIONS
                 Behave as if each subject line contains the given modifiers.

       -t        Run each compile and match many times with a timer, and  out-
-                 put the resulting times per compile or match. You can control
-                 the number of iterations that are used for timing by  follow-
-                 ing  -t  with  a  number  (as  a separate item on the command
-                 line). For  example,  "-t  1000"  iterates  1000  times.  The
-                 default is to iterate 500,000 times.
+                 put  the  resulting  times  per compile or match. When JIT is
+                 used, separate times are given for the  initial  compile  and
+                 the  JIT  compile.  You  can control the number of iterations
+                 that are used for timing by following -t with a number (as  a
+                 separate  item  on  the command line). For example, "-t 1000"
+                 iterates 1000 times. The default is to iterate 500,000 times.

       -tm       This is like -t except that it times only the matching phase,
                 not the compile phase.
@ -437,7 +438,6 @@ PATTERN MODIFIERS
             tables=[0|1|2]            select internal tables

       The effects of these modifiers are described in the following sections.
-       FIXME: Give more examples.

   Newline and \R handling

@ -468,7 +468,32 @@ PATTERN MODIFIERS

       The  info  modifier  requests  information  about  the compiled pattern
       (whether it is anchored, has a fixed first character, and so  on).  The
-       information is obtained from the pcre2_pattern_info() function.
+       information  is  obtained  from the pcre2_pattern_info() function. Here
+       are some typical examples:
+
+           re> /(?i)(^a|^b)/m,info
+         Capturing subpattern count = 1
+         Compile options: multiline
+         Overall options: caseless multiline
+         First code unit at start or follows newline
+         Subject length lower bound = 1
+
+           re> /(?i)abc/info
+         Capturing subpattern count = 0
+         Compile options: <none>
+         Overall options: caseless
+         First code unit = 'a' (caseless)
+         Last code unit = 'c' (caseless)
+         Subject length lower bound = 3
+
+       "Compile options" are those specified to the compile function; "overall
+       options" have added options that are taken or deduced from the pattern.
+       If both sets of options are the same, just a single "options"  line  is
+       output.  "First  code  unit" is where any match must start; if there is
+       more than one they are listed as  "starting  code  units".  "Last  code
+       unit"  is the last literal code unit that must be present in any match.
+       This is not necessarily the last character.  These lines are omitted if
+       no starting or ending code units are recorded.

   Specifying a pattern in hex

@ -482,7 +507,7 @@ PATTERN MODIFIERS
       binary zero characters. By default, pcre2test passes patterns as  zero-
       terminated   strings   to   pcre2_compile(),   giving   the  length  as
       PCRE2_ZERO_TERMINATED.  However, for patterns specified in hexadecimal,
-       the length of the pattern is passed.
+       the actual length of the pattern is passed.

   JIT compilation

@ -505,7 +530,7 @@ PATTERN MODIFIERS
       size of the JIT stack.

       If  the  jitfast  modifier is specified, matching is done using the JIT
-       "fast path" interface (pcre2_jit_match()), which skips some of the san-
+       "fast path" interface, pcre2_jit_match(), which skips some of the  san-
       ity  checks that are done by pcre2_match(), and of course does not work
       when JIT is not supported. If jitfast is specified without  jit,  jit=7
       is assumed.
@ -533,11 +558,16 @@ PATTERN MODIFIERS

   Showing pattern memory

-       The /memory modifier causes the size in bytes of the memory block  used
-       to  hold  the  compiled pattern to be output. This does not include the
-       size of the pcre2_code block; it is just the actual compiled  data.  If
-       the pattern is subsequently passed to the JIT compiler, the size of the
-       JIT compiled code is also output.
+       The /memory modifier causes the size in bytes of  the  memory  used  to
+       hold  the compiled pattern to be output. This does not include the size
+       of the pcre2_code block; it is just the actual compiled  data.  If  the
+       pattern is subsequently passed to the JIT compiler, the size of the JIT
+       compiled code is also output. Here is an example:
+
+           re> /a(b)c/jit,memory
+         Memory allocation (code space): 21
+         Memory allocation (JIT code): 1910
+

   Limiting nested parentheses

@ -573,7 +603,7 @@ PATTERN MODIFIERS
       mentation  for  details).  If  the  number specified by the modifier is
       greater than zero, pcre2_set_compile_recursion_guard() is called to set
       up  callback  from pcre2_compile() to a local function. The argument it
-       is passed is the current nesting parenthesis depth; if this is  greater
+       receives is the current nesting parenthesis depth; if this  is  greater
       than the value given by the modifier, non-zero is returned, causing the
       compilation to be aborted.

@ -606,6 +636,7 @@ PATTERN MODIFIERS
             allusedtext         show all consulted text
         /g  global              global matching
             mark                show mark values
+             replace=<string>    specify a replacement string
             startchar           show starting character when relevant

       These modifiers may not appear in a #pattern command. If you want  them
@ -671,11 +702,11 @@ SUBJECT MODIFIERS
             offset=<n>                set starting offset
             ovector=<n>               set size of output vector
             recursion_limit=<n>       set a recursion limit
+             replace=<string>          specify a replacement string
             startchar                 show startchar when relevant
             zero_terminate            pass the subject as zero-terminated

       The effects of these modifiers are described in the following sections.
-       FIXME: Give more examples.

   Showing more text

@ -745,6 +776,28 @@ SUBJECT MODIFIERS
       ber.  Any value other than zero is used as a  return  from  pcre2test's
       callout function.

+   Finding all matches in a string
+
+       Searching for all possible matches within a subject can be requested by
+       the global or /altglobal modifier. After finding a match, the  matching
+       function  is  called  again to search the remainder of the subject. The
+       difference between global and altglobal is that  the  former  uses  the
+       start_offset  argument  to  pcre2_match() or pcre2_dfa_match() to start
+       searching at a new point within the entire string (which is  what  Perl
+       does), whereas the latter passes over a shortened substring. This makes
+       a difference to the matching process if the pattern begins with a look-
+       behind assertion (including \b or \B).
+
+       If  an  empty  string  is  matched,  the  next  match  is done with the
+       PCRE2_NOTEMPTY_ATSTART and PCRE2_ANCHORED flags set, in order to search
+       for another, non-empty, match at the same point in the subject. If this
+       match fails, the start offset is advanced,  and  the  normal  match  is
+       retried.  This  imitates the way Perl handles such cases when using the
+       /g modifier or the split() function.  Normally,  the  start  offset  is
+       advanced  by  one  character,  but if the newline convention recognizes
+       CRLF as a newline, and the current character is CR followed by  LF,  an
+       advance of two is used.
+
   Testing substring extraction functions

       The  copy  and  get  modifiers  can  be  used  to  test  the pcre2_sub-
@ -767,27 +820,45 @@ SUBJECT MODIFIERS
       full list. The string length (that is, the return from  the  extraction
       function) is given in parentheses after each substring.

-   Finding all matches in a string
+   Testing the substitution function

-       Searching for all possible matches within a subject can be requested by
-       the  global or /altglobal modifier. After finding a match, the matching
-       function is called again to search the remainder of  the  subject.  The
-       difference  between  global  and  altglobal is that the former uses the
-       start_offset argument to pcre2_match() or  pcre2_dfa_match()  to  start
-       searching  at  a new point within the entire string (which is what Perl
-       does), whereas the latter passes over a shortened substring. This makes
-       a difference to the matching process if the pattern begins with a look-
-       behind assertion (including \b or \B).
+       If  the  replace  modifier  is  set, the pcre2_substitute() function is
+       called instead  of  one  of  the  matching  functions.  Unlike  subject
+       strings,  pcre2test  does  not  process  replacement strings for escape
+       sequences. In UTF mode, a replacement string is checked to see if it is
+       a valid UTF-8 string.  If so, it is correctly converted to a UTF string
+       of the appropriate code unit width. If it is not a valid UTF-8  string,
+       the individual code units are copied directly. This provides a means of
+       passing an invalid UTF-8 string for testing purposes.

-       If an empty string  is  matched,  the  next  match  is  done  with  the
-       PCRE2_NOTEMPTY_ATSTART and PCRE2_ANCHORED flags set, in order to search
-       for another, non-empty, match at the same point in the subject. If this
-       match  fails,  the  start  offset  is advanced, and the normal match is
-       retried. This imitates the way Perl handles such cases when  using  the
-       /g  modifier  or  the  split()  function. Normally, the start offset is
-       advanced by one character, but if  the  newline  convention  recognizes
-       CRLF  as  a newline, and the current character is CR followed by LF, an
-       advance of two is used.
+       If the global modifier is set,  PCRE2_SUBSTITUTE_GLOBAL  is  passed  to
+       pcre2_substitute().  After  a  successful  substitution,  the  modified
+       string is output, preceded by the number of replacements. This  may  be
+       zero  if there were no matches. Here is a simple example of a substitu-
+       tion test:
+
+         /abc/replace=xxx
+             =abc=abc=
+          1: =xxx=abc=
+             =abc=abc=\=global
+          2: =xxx=xxx=
+
+       Subject and replacement strings should be  kept  relatively  short  for
+       substitution  tests, as fixed-size buffers are used. To make it easy to
+       test for buffer overflow, if the replacement string starts with a  num-
+       ber  in square brackets, that number is passed to pcre2_substitute() as
+       the size of the output buffer, with the replacement string starting  at
+       the next character. Here is an example that tests the edge case:
+
+         /abc/
+             123abc123\=replace=[10]XYZ
+          1: 123XYZ123
+             123abc123\=replace=[9]XYZ
+         Failed: error -47: no more memory
+
+       A replacement string is ignored with POSIX and DFA matching. Specifying
+       partial matching provokes an error return  ("bad  option  value")  from
+       pcre2_substitute().

   Setting the JIT stack size

@ -853,10 +924,10 @@ SUBJECT MODIFIERS
       A  value of zero is useful when testing the POSIX API because it causes
       regexec() to be called with a NULL capture vector. When not testing the
       POSIX  API,  a  value  of  zero  is used to cause pcre2_match_data_cre-
-       ate_from_pattern  to  be  called,  in  order to create a match block of
+       ate_from_pattern() to be called, in order to create a  match  block  of
       exactly the right size for the pattern. (It is not possible to create a
-       match  block  with  a  zero-length ovector; there is always one pair of
-       offsets.)
+       match block with a zero-length ovector; there is always  at  least  one
+       pair of offsets.)

   Passing the subject as zero-terminated

@ -867,7 +938,7 @@ SUBJECT MODIFIERS
       via  the  POSIX  interface, this modifier has no effect, as there is no
       facility for passing a length.)

-       When  testing  pcre2_substitute,  this  modifier also has the effect of
+       When testing pcre2_substitute(), this modifier also has the  effect  of
       passing the replacement string as zero-terminated.


@ -1112,5 +1183,5 @@ AUTHOR

 REVISION

-       Last updated: 09 November 2014
+       Last updated: 14 November 2014
       Copyright (c) 1997-2014 University of Cambridge.
--- a/src/pcre2_substitute.c
+++ b/src/pcre2_substitute.c
@ -84,7 +84,7 @@ uint32_t ovector_count;
 uint32_t goptions = 0;
 BOOL match_data_created = FALSE;
 BOOL global = FALSE;
-PCRE2_SIZE buff_offset, lengthleft, endlength;
+PCRE2_SIZE buff_offset, lengthleft, fraglength;
 PCRE2_SIZE *ovector;

 /* Partial matching is not valid. */
@ -154,14 +154,17 @@ do

  /* Any error other than no match returns the error code. No match when not
  doing the special after-empty-match global rematch, or when at the end of the
-  subject, breaks the global loop. Otherwise, advance the starting point and 
-  try again. */ 
+  subject, breaks the global loop. Otherwise, advance the starting point by one
+  character, copying it to the output, and try again. */

  if (rc < 0)
    {
+    PCRE2_SIZE save_start;
+
    if (rc != PCRE2_ERROR_NOMATCH) goto EXIT;
    if (goptions == 0 || start_offset >= length) break;
-    start_offset++;
+
+    save_start = start_offset++;
    if ((code->overall_options & PCRE2_UTF) != 0)
      {
 #if PCRE2_CODE_UNIT_WIDTH == 8
@ -173,6 +176,14 @@ do
        start_offset++;
 #endif
      }
+
+    fraglength = start_offset - save_start;
+    if (lengthleft < fraglength) goto NOROOM;
+    memcpy(buffer + buff_offset, subject + save_start,
+      fraglength*(PCRE2_CODE_UNIT_WIDTH/8));
+    buff_offset += fraglength;
+    lengthleft -= fraglength;
+
    goptions = 0;
    continue;
    }
@ -181,12 +192,12 @@ do

  subs++;
  if (rc == 0) rc = ovector_count;
-  endlength = ovector[0] - start_offset;
-  if (endlength >= lengthleft) goto NOROOM;
+  fraglength = ovector[0] - start_offset;
+  if (fraglength >= lengthleft) goto NOROOM;
  memcpy(buffer + buff_offset, subject + start_offset,
-    endlength*(PCRE2_CODE_UNIT_WIDTH/8));
-  buff_offset += endlength;
-  lengthleft -= endlength;
+    fraglength*(PCRE2_CODE_UNIT_WIDTH/8));
+  buff_offset += fraglength;
+  lengthleft -= fraglength;

  for (i = 0; i < rlength; i++)
    {
@ -279,11 +290,11 @@ do
 /* Copy the rest of the subject and return the number of substitutions. */

 rc = subs;
-endlength = length - start_offset;
-if (endlength + 1 > lengthleft) goto NOROOM;
+fraglength = length - start_offset;
+if (fraglength + 1 > lengthleft) goto NOROOM;
 memcpy(buffer + buff_offset, subject + start_offset,
-  endlength*(PCRE2_CODE_UNIT_WIDTH/8));
-buff_offset += endlength;
+  fraglength*(PCRE2_CODE_UNIT_WIDTH/8));
+buff_offset += fraglength;
 buffer[buff_offset] = 0;
 *blength = buff_offset;

--- a/src/pcre2test.c
+++ b/src/pcre2test.c
@ -164,6 +164,7 @@ void vms_setsymbol( char *, char *, int );
 #define DFA_WS_DIMENSION 1000   /* Size of DFA workspace */
 #define DEFAULT_OVECCOUNT 15    /* Default ovector count */
 #define JUNK_OFFSET 0xdeadbeef  /* For initializing ovector */
+#define LOCALESIZE 32           /* Size of locale name */
 #define LOOPREPEAT 500000       /* Default loop count for timing */
 #define REPLACE_MODSIZE 96      /* Field for reading 8-bit replacement */
 #define VERSION_SIZE 64         /* Size of buffer for the version strings */
@ -401,7 +402,7 @@ typedef struct patctl {    /* Structure for pattern modifiers. */
  uint32_t  jit;
  uint32_t  stackguard_test;
  uint32_t  tables_id;
-  uint8_t   locale[32];
+  uint8_t   locale[LOCALESIZE];
 } patctl;

 #define MAXCPYGET 10
@ -486,7 +487,7 @@ static modstruct modlist[] = {
  { "jitfast",             MOD_PAT,  MOD_CTL, CTL_JITFAST,               PO(control) },
  { "jitstack",            MOD_DAT,  MOD_INT, 0,                         DO(jitstack) },
  { "jitverify",           MOD_PAT,  MOD_CTL, CTL_JITVERIFY,             PO(control) },
-  { "locale",              MOD_PAT,  MOD_STR, 0,                         PO(locale) },
+  { "locale",              MOD_PAT,  MOD_STR, LOCALESIZE,                PO(locale) },
  { "mark",                MOD_PNDP, MOD_CTL, CTL_MARK,                  PO(control) },
  { "match_limit",         MOD_CTM,  MOD_INT, 0,                         MO(match_limit) },
  { "match_unset_backref", MOD_PAT,  MOD_OPT, PCRE2_MATCH_UNSET_BACKREF, PO(options) },
@ -512,7 +513,7 @@ static modstruct modlist[] = {
  { "posix",               MOD_PAT,  MOD_CTL, CTL_POSIX,                 PO(control) },
  { "ps",                  MOD_DAT,  MOD_OPT, PCRE2_PARTIAL_SOFT,        DO(options) },
  { "recursion_limit",     MOD_CTM,  MOD_INT, 0,                         MO(recursion_limit) },
-  { "replace",             MOD_PND,  MOD_STR, 0,                         PO(replacement) },
+  { "replace",             MOD_PND,  MOD_STR, REPLACE_MODSIZE,           PO(replacement) },
  { "stackguard",          MOD_PAT,  MOD_INT, 0,                         PO(stackguard_test) },
  { "startchar",           MOD_PND,  MOD_CTL, CTL_STARTCHAR,             PO(control) },
  { "tables",              MOD_PAT,  MOD_INT, 0,                         PO(tables_id) },
@ -3141,6 +3142,12 @@ for (;;)
    break;

    case MOD_STR:
+    if (len + 1 > m->value)
+      {
+      fprintf(outfile, "** Overlong value for '%s' (max %d code units)\n",
+        m->name, m->value - 1);
+      return FALSE;
+      }
    memcpy(field, pp, len);
    ((uint8_t *)field)[len] = 0;
    pp = ep;
--- a/testdata/testinput2
+++ b/testdata/testinput2
@ -4073,6 +4073,9 @@ a random value. /Ix
    123abc456abc789
    123abc456abc789\=g

+/(?<=abc)(|def)/g,replace=<$0>
+    123abcxyzabcdef789abcpqr
+
 # End of substitute tests 

 # End of testinput2 
--- a/testdata/testinput5
+++ b/testdata/testinput5
@ -1633,4 +1633,7 @@
 /ábc/utf,replace=XሴZ
    123ábc123

+/(?<=abc)(|def)/g,utf,replace=<$0>
+      123abcáyzabcdef789abcሴqr
+
 # End of testinput5 
--- a/testdata/testoutput2
+++ b/testdata/testoutput2
@ -13699,6 +13699,10 @@ Failed: error -34: bad option value
    123abc456abc789\=g
 2: 123xyz456xyz789

+/(?<=abc)(|def)/g,replace=<$0>
+    123abcxyzabcdef789abcpqr
+ 4: 123abc<>xyzabc<><def>789abc<>pqr
+
 # End of substitute tests 

 # End of testinput2 
--- a/testdata/testoutput5
+++ b/testdata/testoutput5
@ -4002,4 +4002,8 @@ Subject length lower bound = 1
    123ábc123
 1: 123X\x{1234}Z123

+/(?<=abc)(|def)/g,utf,replace=<$0>
+      123abcáyzabcdef789abcሴqr
+ 4: 123abc<>\x{e1}yzabc<><def>789abc<>\x{1234}qr
+
 # End of testinput5