Back off failed attempt to handle nested lookbehinds for estimating how much of

a partial match to retain for multi-segment matching. Document the current difficulty if the whole first segment cannot be retained.
2019-09-04 18:14:54 +00:00 · 2019-09-04 18:14:54 +00:00 · 963b570fd0
parent 87bc092222
commit 963b570fd0
9 changed files with 868 additions and 915 deletions
--- a/37
+++ b/37
@ -66,12 +66,7 @@ is made possessive and applied to an item in parentheses, because a
 parenthesized item may contain multiple branches or other backtracking points,
 for example /(a|ab){1}+c/ or /(a+){1}+a/.
-13. Nested lookbehinds are now taken into account when computing the maximum
+13. For partial matches, pcre2test was always showing the maximum lookbehind
 lookbehind value. For example /(?<=a(?<=ba)c)/ previously set a maximum
 lookbehind of 2, because that is the largest individual lookbehind. Now it sets
 it to 3, because matching looks back 3 characters.
 14. For partial matches, pcre2test was always showing the maximum lookbehind
 characters, flagged with "<", which is misleading when the lookbehind didn't
 actually look behind the start (because it was later in the pattern). Showing
 all consulted preceding characters for partial matches is now controlled by the
@ -79,25 +74,25 @@ existing "allusedtext" modifier and, as for complete matches, this facility is
 available only for non-JIT matching, because JIT does not maintain the first
 and last consulted characters.
-15. DFA matching (using pcre2_dfa_match()) was not recognising a partial match
+14. DFA matching (using pcre2_dfa_match()) was not recognising a partial match
 if the end of the subject was encountered in a lookahead (conditional or
 otherwise), an atomic group, or a recursion.
-16. Give error if pcre2test -t, -T, -tm or -TM is given an argument of zero.
+15. Give error if pcre2test -t, -T, -tm or -TM is given an argument of zero.
-17. Check for integer overflow when computing lookbehind lengths. Fixes 
+16. Check for integer overflow when computing lookbehind lengths. Fixes 
 Clusterfuzz issue 15636.
-18. Implemented non-atomic positive lookaround assertions.
+17. Implemented non-atomic positive lookaround assertions.
-19. If a lookbehind contained a lookahead that contained another lookbehind
+18. If a lookbehind contained a lookahead that contained another lookbehind
 within it, the nested lookbehind was not correctly processed. For example, if 
 /(?<=(?=(?<=a)))b/ was matched to "ab" it gave no match instead of matching 
 "b".
-20. Implemented pcre2_get_match_data_size().
+19. Implemented pcre2_get_match_data_size().
-21. Two alterations to partial matching (not yet done by JIT):
+20. Two alterations to partial matching (not yet done by JIT):
    (a) The definition of a partial match is slightly changed: if a pattern
    contains any lookbehinds, an empty partial match may be given, because this
@ -111,29 +106,29 @@ within it, the nested lookbehind was not correctly processed. For example, if
    (c) An empty string partial hard match can be returned for \z and \Z as it
    is documented that they shouldn't match. 
-22. A branch that started with (*ACCEPT) was not being recognized as one that
+21. A branch that started with (*ACCEPT) was not being recognized as one that
 could match an empty string. 
-23. Corrected pcre2_set_character_tables() tables data type: was const unsigned
+22. Corrected pcre2_set_character_tables() tables data type: was const unsigned
 char * instead of const uint8_t *, as generated by pcre2_maketables().
-24. Upgraded to Unicode 12.1.0.
+23. Upgraded to Unicode 12.1.0.
-25. Add -jitfast command line option to pcre2test (to make all the jit options 
+24. Add -jitfast command line option to pcre2test (to make all the jit options 
 available directly).
-26. Make pcre2test -C show if libreadline or libedit is supported.
+25. Make pcre2test -C show if libreadline or libedit is supported.
-28. If the length of one branch of a group exceeded 65535 (the maximum value
+26. If the length of one branch of a group exceeded 65535 (the maximum value
 that is remembered as a minimum length), the whole group's length was 
 incorrectly recorded as 65535, leading to incorrect "no match" when start-up 
 optimizations were in force.
-29. The "rightmost consulted character" value was not always correct; in 
+27. The "rightmost consulted character" value was not always correct; in 
 particular, if a pattern ended with a negative lookahead, characters that were 
 inspected in that lookahead were not included.
-30. Add the pcre2_maketables_free() function.
+28. Add the pcre2_maketables_free() function.
 Version 10.33 16-April-2019
--- a/doc/html/pcre2api.html
+++ b/doc/html/pcre2api.html
@ -2272,26 +2272,24 @@ defaulted by the caller of the match function.
 <pre>
  PCRE2_INFO_MAXLOOKBEHIND
 </pre>
-Return the largest number of characters (not code units) before the current
+A lookbehind assertion moves back a certain number of characters (not code
-matching point that could be inspected while processing a lookbehind assertion
+units) when it starts to process each of its branches. This request returns the
-in the pattern. Before release 10.34 this request used to give the largest
+largest of these backward moves. The third argument should point to a uint32_t
-value for any individual assertion. Now it takes into account nested
+integer. The simple assertions \b and \B require a one-character lookbehind
-lookbehinds, which can mean that the overall value is greater. For example, the
+and cause PCRE2_INFO_MAXLOOKBEHIND to return 1 in the absence of anything
-pattern (?&#60;=a(?&#60;=ba)c) previously returned 2, because that is the length of the
+longer. \A also registers a one-character lookbehind, though it does not
-largest individual lookbehind. Now it returns 3, because matching actually
+actually inspect the previous character.
 looks back 3 characters.
 </P>
 <P>
-The third argument should point to a uint32_t integer. This information is
+Note that this information is useful for multi-segment matching only
-useful when doing multi-segment matching using the partial matching facilities.
+if the pattern contains no nested lookbehinds. For example, the pattern 
-Note that the simple assertions \b and \B require a one-character lookbehind.
+(?&#60;=a(?&#60;=ba)c) returns a maximum lookbehind of 2, but when it is processed, the 
-\A also registers a one-character lookbehind, though it does not actually
+first lookbehind moves back by two characters, matches one character, then the
-inspect the previous character. This is to ensure that at least one character
+nested lookbehind also moves back by two characters. This puts the matching
-from the old segment is retained when a new segment is processed. Otherwise, if
+point three characters earlier than it was at the start.
-there are no lookbehinds in the pattern, \A might match incorrectly at the
+PCRE2_INFO_MAXLOOKBEHIND is really only useful as a debugging tool. See the
 start of a second or subsequent segment. There are more details in the
 <a href="pcre2partial.html"><b>pcre2partial</b></a>
-documentation.
+documentation for a discussion of multi-segment matching.
 <pre>
  PCRE2_INFO_MINLENGTH
 </pre>
--- a/doc/html/pcre2partial.html
+++ b/doc/html/pcre2partial.html
@ -49,7 +49,7 @@ complete match, though the details differ between the two types of matching
 function. If both options are set, PCRE2_PARTIAL_HARD takes precedence.
 </P>
 <P>
-If you want to use partial matching with just-in-time optimized code, as well 
+If you want to use partial matching with just-in-time optimized code, as well
 as setting a partial match option for the matching function, you must also call
 <b>pcre2_jit_compile()</b> with one or both of these options:
 <pre>
@ -101,7 +101,7 @@ matched string.
 </P>
 <P>
 (2) The pattern contains one or more lookbehind assertions. This condition
-exists in case there is a lookbehind that inspects characters before the start 
+exists in case there is a lookbehind that inspects characters before the start
 of the match.
 </P>
 <P>
@ -171,7 +171,7 @@ If PCRE2_PARTIAL_HARD is set, PCRE2_ERROR_PARTIAL is returned as soon as a
 partial match is found, without continuing to search for possible complete
 matches. This option is "hard" because it prefers an earlier partial match over
 a later complete match. For this reason, the assumption is made that the end of
-the supplied subject string is not the true end of the available data, which is 
+the supplied subject string is not the true end of the available data, which is
 why \z, \Z, \b, \B, and $ always give a partial match.
 </P>
 <P>
@ -226,7 +226,7 @@ date:
  data&#62; 3juj\=ph
  No match
 </pre>
-This example gives the same results for both hard and soft partial matching 
+This example gives the same results for both hard and soft partial matching
 options. Here is an example where there is a difference:
 <pre>
    re&#62; /^\d?\d(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\d\d$/
@ -234,7 +234,7 @@ options. Here is an example where there is a difference:
   0: 25jun04
   1: jun
  data&#62; 25jun04\=ph
-  Partial match: 25jun04 
+  Partial match: 25jun04
 </pre>
 With PCRE2_PARTIAL_SOFT, the subject is matched completely. For
 PCRE2_PARTIAL_HARD, however, the subject is assumed not to be complete, so
@ -244,9 +244,12 @@ there is only a partial match.
 <P>
 PCRE was not originally designed with multi-segment matching in mind. However,
 over time, features (including partial matching) that make multi-segment
-matching possible have been added. The string is searched segment by segment by
+matching possible have been added. A very long string can be searched segment
-calling <b>pcre2_match()</b> repeatedly, with the aim of achieving the same 
+by segment by calling <b>pcre2_match()</b> repeatedly, with the aim of achieving
-results that would happen if the entire string was available for searching.
+the same results that would happen if the entire string was available for
 searching all the time. Normally, the strings that are being sought are much
 shorter than each individual segment, and are in the middle of very long
 strings, so the pattern is normally not anchored.
 </P>
 <P>
 Special logic must be implemented to handle a matched substring that spans a
@ -256,11 +259,10 @@ changing the match by adding more characters. The PCRE2_NOTBOL option should
 also be set for all but the first segment.
 </P>
 <P>
-When a partial match occurs, the next segment must be added to the current 
+When a partial match occurs, the next segment must be added to the current
-subject and the match re-run, using the <i>startoffset</i> argument of 
+subject and the match re-run, using the <i>startoffset</i> argument of
 <b>pcre2_match()</b> to begin at the point where the partial match started.
-Multi-segment matching is usually used to search for substrings in the middle
+For example:
 of very long sequences, so the patterns are normally not anchored. For example:
 <pre>
    re&#62; /\d?\d(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\d\d/
  data&#62; ...the date is 23ja\=ph
@ -269,51 +271,52 @@ of very long sequences, so the patterns are normally not anchored. For example:
   0: 23jan19
   1: jan
 </pre>
-Note the use of the <b>offset</b> modifier to start the new match where the 
+Note the use of the <b>offset</b> modifier to start the new match where the
-partial match was found.
+partial match was found. In this example, the next segment was added to the one
 in which the partial match was found. This is the most straightforward
 approach, typically using a memory buffer that is twice the size of each
 segment. After a partial match, the first half of the buffer is discarded, the 
 second half is moved to the start of the buffer, and a new segment is added 
 before repeating the match as in the example above. After a no match, the 
 entire buffer can be discarded.
 </P>
 <P>
-In this simple example, the next segment was just added to the one in which the 
+If there are memory constraints, you may want to discard text that precedes a
-partial match was found. However, if there are memory constraints, it may be 
+partial match before adding the next segment. Unfortunately, this is not at
-necessary to discard text that precedes the partial match before adding the 
+present straightforward. In cases such as the above, where the pattern does not
-next segment. In cases such as the above, where the pattern does not contain
+contain any lookbehinds, it is sufficient to retain only the partially matched
-any lookbehinds, it is sufficient to retain only the partially matched
+substring. However, if the pattern contains a lookbehind assertion, characters
 substring. However, if a pattern contains a lookbehind assertion, characters
 that precede the start of the partial match may have been inspected during the
-matching process.
+matching process. When <b>pcre2test</b> displays a partial match, it indicates
-</P>
+these characters with '&#60;' if the <b>allusedtext</b> modifier is set:
 <P>
 The only lookbehind information that is available is the length of the longest
 lookbehind in a pattern. This may not, of course, be at the start of the
 pattern, but retaining that many characters before the partial match is
 sufficient, if not always strictly necessary. The way to do this is as follows:
 </P>
 <P>
 Before doing any matching, find the length of the longest lookbehind in the
 pattern by calling <b>pcre2_pattern_info()</b> with the PCRE2_INFO_MAXLOOKBEHIND
 option. Note that the resulting count is in characters, not code units. After a
 partial match, moving back from the ovector[0] offset in the subject by the
 number of characters given for the maximum lookbehind gets you to the earliest
 character that must be retained. In a non-UTF or a 32-bit situation, moving
 back is just a subtraction, but in UTF-8 or UTF-16 you have to count characters
 while moving back through the code units. Characters before the point you have
 now reached can be discarded.
 </P>
 <P>
 For example, if the pattern "(?&#60;=123)abc" is partially matched against the
 string "xx123ab", the ovector offsets are 5 and 7 ("ab"). The maximum
 lookbehind count is 3, so all characters before offset 2 can be discarded. The
 value of <b>startoffset</b> for the next match should be 3. When <b>pcre2test</b>
 displays a partial match, it indicates the lookbehind characters with '&#60;'
 characters if the <b>allusedtext</b> modifier is set:
 <pre>
    re&#62; "(?&#60;=123)abc"
  data&#62; xx123ab\=ph,allusedtext
  Partial match: 123ab
                 &#60;&#60;&#60;
 </pre>
-Note that the \fPallusedtext\fP modifier is not available for JIT matching,
+However, the \fPallusedtext\fP modifier is not available for JIT matching,
-because JIT matching does not maintain the first and last consulted characters.
+because JIT matching does not record the first (or last) consulted characters.
 For this reason, this information is not available via the API. It is therefore
 not possible in general to obtain the exact number of characters that must be
 retained in order to get the right match result. If you cannot retain the
 entire segment, you must find some heuristic way of choosing.
 </P>
 <P>
 If you know the approximate length of the matching substrings, you can use that
 to decide how much text to retain. The only lookbehind information that is
 currently available via the API is the length of the longest individual
 lookbehind in a pattern, but this can be misleading if there are nested
 lookbehinds. The value returned by calling <b>pcre2_pattern_info()</b> with the
 PCRE2_INFO_MAXLOOKBEHIND option is the maximum number of characters (not code
 units) that any individual lookbehind moves back when it is processed. A
 pattern such as "(?&#60;=(?&#60;!b)a)" has a maximum lookbehind value of one, but
 inspects two characters before its starting point.
 </P>
 <P>
 In a non-UTF or a 32-bit case, moving back is just a subtraction, but in
 UTF-8 or UTF-16 you have to count characters while moving back through the code
 units.
 </P>
 <br><a name="SEC5" href="#TOC1">PARTIAL MATCHING USING pcre2_dfa_match()</a><br>
 <P>
@ -379,11 +382,11 @@ want.
 </P>
 <P>
 If you do want to allow for starting again at the next character, one way of
-doing it is to retain the matched part of the segment and try a new complete
+doing it is to retain some or all of the segment and try a new complete match,
-match, as described for <b>pcre2_match()</b> above. Another possibility is to
+as described for <b>pcre2_match()</b> above. Another possibility is to work with
-work with two buffers. If a partial match at offset <i>n</i> in the first buffer
+two buffers. If a partial match at offset <i>n</i> in the first buffer is
-is followed by "no match" when PCRE2_DFA_RESTART is used on the second buffer,
+followed by "no match" when PCRE2_DFA_RESTART is used on the second buffer, you
-you can then try a new match starting at offset <i>n+1</i> in the first buffer.
+can then try a new match starting at offset <i>n+1</i> in the first buffer.
 </P>
 <br><a name="SEC7" href="#TOC1">AUTHOR</a><br>
 <P>
@ -396,7 +399,7 @@ Cambridge, England.
 </P>
 <br><a name="SEC8" href="#TOC1">REVISION</a><br>
 <P>
-Last updated: 07 August 2019
+Last updated: 04 September 2019
 <br>
 Copyright &copy; 1997-2019 University of Cambridge.
 <br>
--- a/doc/pcre2.txt
+++ b/doc/pcre2.txt
--- a/doc/pcre2api.3
+++ b/doc/pcre2api.3
@ -2232,27 +2232,25 @@ defaulted by the caller of the match function.
 .sp
  PCRE2_INFO_MAXLOOKBEHIND
 .sp
-Return the largest number of characters (not code units) before the current
+A lookbehind assertion moves back a certain number of characters (not code
-matching point that could be inspected while processing a lookbehind assertion
+units) when it starts to process each of its branches. This request returns the
-in the pattern. Before release 10.34 this request used to give the largest
+largest of these backward moves. The third argument should point to a uint32_t
-value for any individual assertion. Now it takes into account nested
+integer. The simple assertions \eb and \eB require a one-character lookbehind
-lookbehinds, which can mean that the overall value is greater. For example, the
+and cause PCRE2_INFO_MAXLOOKBEHIND to return 1 in the absence of anything
-pattern (?<=a(?<=ba)c) previously returned 2, because that is the length of the
+longer. \eA also registers a one-character lookbehind, though it does not
-largest individual lookbehind. Now it returns 3, because matching actually
+actually inspect the previous character.
 looks back 3 characters.
 .P
-The third argument should point to a uint32_t integer. This information is
+Note that this information is useful for multi-segment matching only
-useful when doing multi-segment matching using the partial matching facilities.
+if the pattern contains no nested lookbehinds. For example, the pattern 
-Note that the simple assertions \eb and \eB require a one-character lookbehind.
+(?<=a(?<=ba)c) returns a maximum lookbehind of 2, but when it is processed, the 
-\eA also registers a one-character lookbehind, though it does not actually
+first lookbehind moves back by two characters, matches one character, then the
-inspect the previous character. This is to ensure that at least one character
+nested lookbehind also moves back by two characters. This puts the matching
-from the old segment is retained when a new segment is processed. Otherwise, if
+point three characters earlier than it was at the start.
-there are no lookbehinds in the pattern, \eA might match incorrectly at the
+PCRE2_INFO_MAXLOOKBEHIND is really only useful as a debugging tool. See the
 start of a second or subsequent segment. There are more details in the
 .\" HREF
 \fBpcre2partial\fP
 .\"
-documentation.
+documentation for a discussion of multi-segment matching.
 .sp
  PCRE2_INFO_MINLENGTH
 .sp
--- a/doc/pcre2partial.3
+++ b/doc/pcre2partial.3
@ -1,4 +1,4 @@
-.TH PCRE2PARTIAL 3 "07 August 2019" "PCRE2 10.34"
+.TH PCRE2PARTIAL 3 "04 September 2019" "PCRE2 10.34"
 .SH NAME
 PCRE2 - Perl-compatible regular expressions
 .SH "PARTIAL MATCHING IN PCRE2"
@ -25,7 +25,7 @@ options is whether or not a partial match is preferred to an alternative
 complete match, though the details differ between the two types of matching
 function. If both options are set, PCRE2_PARTIAL_HARD takes precedence.
 .P
-If you want to use partial matching with just-in-time optimized code, as well 
+If you want to use partial matching with just-in-time optimized code, as well
 as setting a partial match option for the matching function, you must also call
 \fBpcre2_jit_compile()\fP with one or both of these options:
 .sp
@ -73,7 +73,7 @@ need not form part of the final matched string; lookbehind assertions and the
 matched string.
 .P
 (2) The pattern contains one or more lookbehind assertions. This condition
-exists in case there is a lookbehind that inspects characters before the start 
+exists in case there is a lookbehind that inspects characters before the start
 of the match.
 .P
 (3) There is a special case when the whole pattern can match an empty string.
@ -139,7 +139,7 @@ If PCRE2_PARTIAL_HARD is set, PCRE2_ERROR_PARTIAL is returned as soon as a
 partial match is found, without continuing to search for possible complete
 matches. This option is "hard" because it prefers an earlier partial match over
 a later complete match. For this reason, the assumption is made that the end of
-the supplied subject string is not the true end of the available data, which is 
+the supplied subject string is not the true end of the available data, which is
 why \ez, \eZ, \eb, \eB, and $ always give a partial match.
 .P
 If PCRE2_PARTIAL_SOFT is set, the partial match is remembered, but matching
@ -192,7 +192,7 @@ date:
  data> 3juj\e=ph
  No match
 .sp
-This example gives the same results for both hard and soft partial matching 
+This example gives the same results for both hard and soft partial matching
 options. Here is an example where there is a difference:
 .sp
    re> /^\ed?\ed(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\ed\ed$/
@ -200,8 +200,8 @@ options. Here is an example where there is a difference:
   0: 25jun04
   1: jun
  data> 25jun04\e=ph
-  Partial match: 25jun04 
+  Partial match: 25jun04
-.sp    
+.sp
 With PCRE2_PARTIAL_SOFT, the subject is matched completely. For
 PCRE2_PARTIAL_HARD, however, the subject is assumed not to be complete, so
 there is only a partial match.
@ -213,9 +213,12 @@ there is only a partial match.
 .sp
 PCRE was not originally designed with multi-segment matching in mind. However,
 over time, features (including partial matching) that make multi-segment
-matching possible have been added. The string is searched segment by segment by
+matching possible have been added. A very long string can be searched segment
-calling \fBpcre2_match()\fP repeatedly, with the aim of achieving the same 
+by segment by calling \fBpcre2_match()\fP repeatedly, with the aim of achieving
-results that would happen if the entire string was available for searching.
+the same results that would happen if the entire string was available for
 searching all the time. Normally, the strings that are being sought are much
 shorter than each individual segment, and are in the middle of very long
 strings, so the pattern is normally not anchored.
 .P
 Special logic must be implemented to handle a matched substring that spans a
 segment boundary. PCRE2_PARTIAL_HARD should be used, because it returns a
@ -223,11 +226,10 @@ partial match at the end of a segment whenever there is the possibility of
 changing the match by adding more characters. The PCRE2_NOTBOL option should
 also be set for all but the first segment.
 .P
-When a partial match occurs, the next segment must be added to the current 
+When a partial match occurs, the next segment must be added to the current
-subject and the match re-run, using the \fIstartoffset\fP argument of 
+subject and the match re-run, using the \fIstartoffset\fP argument of
 \fBpcre2_match()\fP to begin at the point where the partial match started.
-Multi-segment matching is usually used to search for substrings in the middle
+For example:
 of very long sequences, so the patterns are normally not anchored. For example:
 .sp
    re> /\ed?\ed(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\ed\ed/
  data> ...the date is 23ja\e=ph
@ -236,48 +238,49 @@ of very long sequences, so the patterns are normally not anchored. For example:
   0: 23jan19
   1: jan
 .sp
-Note the use of the \fBoffset\fP modifier to start the new match where the 
+Note the use of the \fBoffset\fP modifier to start the new match where the
-partial match was found.
+partial match was found. In this example, the next segment was added to the one
 in which the partial match was found. This is the most straightforward
 approach, typically using a memory buffer that is twice the size of each
 segment. After a partial match, the first half of the buffer is discarded, the 
 second half is moved to the start of the buffer, and a new segment is added 
 before repeating the match as in the example above. After a no match, the 
 entire buffer can be discarded.
 .P
-In this simple example, the next segment was just added to the one in which the 
+If there are memory constraints, you may want to discard text that precedes a
-partial match was found. However, if there are memory constraints, it may be 
+partial match before adding the next segment. Unfortunately, this is not at
-necessary to discard text that precedes the partial match before adding the 
+present straightforward. In cases such as the above, where the pattern does not
-next segment. In cases such as the above, where the pattern does not contain
+contain any lookbehinds, it is sufficient to retain only the partially matched
-any lookbehinds, it is sufficient to retain only the partially matched
+substring. However, if the pattern contains a lookbehind assertion, characters
 substring. However, if a pattern contains a lookbehind assertion, characters
 that precede the start of the partial match may have been inspected during the
-matching process.
+matching process. When \fBpcre2test\fP displays a partial match, it indicates
-.P
+these characters with '<' if the \fBallusedtext\fP modifier is set:
 The only lookbehind information that is available is the length of the longest
 lookbehind in a pattern. This may not, of course, be at the start of the
 pattern, but retaining that many characters before the partial match is
 sufficient, if not always strictly necessary. The way to do this is as follows:
 .P
 Before doing any matching, find the length of the longest lookbehind in the
 pattern by calling \fBpcre2_pattern_info()\fP with the PCRE2_INFO_MAXLOOKBEHIND
 option. Note that the resulting count is in characters, not code units. After a
 partial match, moving back from the ovector[0] offset in the subject by the
 number of characters given for the maximum lookbehind gets you to the earliest
 character that must be retained. In a non-UTF or a 32-bit situation, moving
 back is just a subtraction, but in UTF-8 or UTF-16 you have to count characters
 while moving back through the code units. Characters before the point you have
 now reached can be discarded.
 .P
 For example, if the pattern "(?<=123)abc" is partially matched against the
 string "xx123ab", the ovector offsets are 5 and 7 ("ab"). The maximum
 lookbehind count is 3, so all characters before offset 2 can be discarded. The
 value of \fBstartoffset\fP for the next match should be 3. When \fBpcre2test\fP
 displays a partial match, it indicates the lookbehind characters with '<'
 characters if the \fBallusedtext\fP modifier is set:
 .sp
    re> "(?<=123)abc"
  data> xx123ab\e=ph,allusedtext
  Partial match: 123ab
                 <<<
-.sp                  
+.sp
-Note that the \fPallusedtext\fP modifier is not available for JIT matching,
+However, the \fPallusedtext\fP modifier is not available for JIT matching,
-because JIT matching does not maintain the first and last consulted characters.
+because JIT matching does not record the first (or last) consulted characters.
-.
+For this reason, this information is not available via the API. It is therefore
 not possible in general to obtain the exact number of characters that must be
 retained in order to get the right match result. If you cannot retain the
 entire segment, you must find some heuristic way of choosing.
 .P
 If you know the approximate length of the matching substrings, you can use that
 to decide how much text to retain. The only lookbehind information that is
 currently available via the API is the length of the longest individual
 lookbehind in a pattern, but this can be misleading if there are nested
 lookbehinds. The value returned by calling \fBpcre2_pattern_info()\fP with the
 PCRE2_INFO_MAXLOOKBEHIND option is the maximum number of characters (not code
 units) that any individual lookbehind moves back when it is processed. A
 pattern such as "(?<=(?<!b)a)" has a maximum lookbehind value of one, but
 inspects two characters before its starting point.
 .P
 In a non-UTF or a 32-bit case, moving back is just a subtraction, but in
 UTF-8 or UTF-16 you have to count characters while moving back through the code
 units.
 .
 .
 .SH "PARTIAL MATCHING USING pcre2_dfa_match()"
@ -344,11 +347,11 @@ are remembered. Depending on the application, this may or may not be what you
 want.
 .P
 If you do want to allow for starting again at the next character, one way of
-doing it is to retain the matched part of the segment and try a new complete
+doing it is to retain some or all of the segment and try a new complete match,
-match, as described for \fBpcre2_match()\fP above. Another possibility is to
+as described for \fBpcre2_match()\fP above. Another possibility is to work with
-work with two buffers. If a partial match at offset \fIn\fP in the first buffer
+two buffers. If a partial match at offset \fIn\fP in the first buffer is
-is followed by "no match" when PCRE2_DFA_RESTART is used on the second buffer,
+followed by "no match" when PCRE2_DFA_RESTART is used on the second buffer, you
-you can then try a new match starting at offset \fIn+1\fP in the first buffer.
+can then try a new match starting at offset \fIn+1\fP in the first buffer.
 .
 .
 .SH AUTHOR
@ -365,6 +368,6 @@ Cambridge, England.
 .rs
 .sp
 .nf
-Last updated: 07 August 2019
+Last updated: 04 September 2019
 Copyright (c) 1997-2019 University of Cambridge.
 .fi
--- a/src/pcre2_compile.c
+++ b/src/pcre2_compile.c
@ -128,12 +128,12 @@ static int
    compile_block *, PCRE2_SIZE *);
 static int
-  get_branchlength(uint32_t **, int *, int *, int *, parsed_recurse_check *,
+  get_branchlength(uint32_t **, int *, int *, parsed_recurse_check *,
    compile_block *);
 static BOOL
-  set_lookbehind_lengths(uint32_t **, int *, int *, int *,
+  set_lookbehind_lengths(uint32_t **, int *, int *, parsed_recurse_check *, 
-    parsed_recurse_check *, compile_block *);
+    compile_block *);
 static int
  check_lookbehinds(uint32_t *, uint32_t **, parsed_recurse_check *,
@ -398,9 +398,6 @@ compiler is clever with identical subexpressions. */
 #define GI_SET_FIXED_LENGTH    0x80000000u
 #define GI_NOT_FIXED_LENGTH    0x40000000u
 #define GI_FIXED_LENGTH_MASK   0x0000ffffu
 #define GI_EXTRA_MASK          0x0fff0000u
 #define GI_EXTRA_MAX                 0xfff  /* NB not unsigned */
 #define GI_EXTRA_SHIFT                  16
 /* This simple test for a decimal digit works for both ASCII/Unicode and EBCDIC
 and is fast (a good compiler can turn it into a subtraction and unsigned
@ -8897,7 +8894,6 @@ improve processing speed when the same capturing group occurs many times.
 Arguments:
  pptrptr     pointer to pointer in the parsed pattern
  isinline    FALSE if a reference or recursion; TRUE for inline group
  extraptr    pointer to where to return extra lookbehind length
  errcodeptr  pointer to the errorcode
  lcptr       pointer to the loop counter
  group       number of captured group or -1 for a non-capturing group
@ -8908,13 +8904,11 @@ Returns:      the group length or a negative number
 */
 static int
-get_grouplength(uint32_t **pptrptr, BOOL isinline, int *extraptr,
+get_grouplength(uint32_t **pptrptr, BOOL isinline, int *errcodeptr, int *lcptr,
-  int *errcodeptr, int *lcptr, int group, parsed_recurse_check *recurses,
+   int group, parsed_recurse_check *recurses, compile_block *cb)
  compile_block *cb)
 {
 int branchlength;
 int grouplength = -1;
 int extra = 0;
 /* The cache can be used only if there is no possibility of there being two
 groups with the same number. We do not need to set the end pointer for a group
@ -8928,7 +8922,6 @@ if (group > 0 && (cb->external_flags & PCRE2_DUPCAPUSED) == 0)
  if ((groupinfo & GI_SET_FIXED_LENGTH) != 0)
    {
    if (isinline) *pptrptr = parsed_skip(*pptrptr, PSKIP_KET);
    *extraptr = (groupinfo & GI_EXTRA_MASK) >> GI_EXTRA_SHIFT;
    return groupinfo & GI_FIXED_LENGTH_MASK;
    }
  }
@ -8937,28 +8930,16 @@ if (group > 0 && (cb->external_flags & PCRE2_DUPCAPUSED) == 0)
 for(;;)
  {
-  int branchextra;
+  branchlength = get_branchlength(pptrptr, errcodeptr, lcptr, recurses, cb);
  branchlength = get_branchlength(pptrptr, &branchextra, errcodeptr, lcptr,
    recurses, cb);
  if (branchlength < 0) goto ISNOTFIXED;
-  if (grouplength == -1)
+  if (grouplength == -1) grouplength = branchlength;
-    {
+    else if (grouplength != branchlength) goto ISNOTFIXED;
    grouplength = branchlength;
    extra = branchextra;
    }
  else if (grouplength != branchlength || extra != branchextra) goto ISNOTFIXED;
  if (**pptrptr == META_KET) break;
  *pptrptr += 1;   /* Skip META_ALT */
  }
-/* There are only 12 bits for caching the extra value, but a pattern that
+if (group > 0) 
-needs more than that is weird indeed. */
+  cb->groupinfo[group] |= (uint32_t)(GI_SET_FIXED_LENGTH | grouplength);
 if (group > 0 && extra <= GI_EXTRA_MAX)
  cb->groupinfo[group] |= (uint32_t)
    (GI_SET_FIXED_LENGTH | (extra << GI_EXTRA_SHIFT) | grouplength);
 *extraptr = extra;
 return grouplength;
 ISNOTFIXED:
@ -8973,17 +8954,11 @@ return -1;
 *************************************************/
 /* Return a fixed length for a branch in a lookbehind, giving an error if the
-length is not fixed. We also take note of any extra value that is generated
+length is not fixed. On entry, *pptrptr points to the first element inside the
-from a nested lookbehind. For example, for /(?<=a(?<=ba)c)/ each individual
+branch. On exit it is set to point to the ALT or KET.
 lookbehind has length 2, but the max_lookbehind setting must be 3 because
 matching inspects 3 characters before the match starting point.
 On entry, *pptrptr points to the first element inside the branch. On exit it is
 set to point to the ALT or KET.
 Arguments:
  pptrptr     pointer to pointer in the parsed pattern
  extraptr    pointer to where to return extra lookbehind length
  errcodeptr  pointer to error code
  lcptr       pointer to loop counter
  recurses    chain of recurse_check to catch mutual recursion
@ -8993,14 +8968,11 @@ Returns:      the length, or a negative value on error
 */
 static int
-get_branchlength(uint32_t **pptrptr, int *extraptr, int *errcodeptr, int *lcptr,
+get_branchlength(uint32_t **pptrptr, int *errcodeptr, int *lcptr,
  parsed_recurse_check *recurses, compile_block *cb)
 {
 int branchlength = 0;
 int grouplength;
 int groupextra;
 int max;
 int extra = 0;   /* Additional lookbehind from nesting */
 uint32_t lastitemlength = 0;
 uint32_t *pptr = *pptrptr;
 PCRE2_SIZE offset;
@ -9149,17 +9121,13 @@ for (;; pptr++)
    break;
    /* A nested lookbehind does not contribute any length to this lookbehind,
-    but must itself be checked and have its lengths set. If the maximum
+    but must itself be checked and have its lengths set. */
    lookbehind for the nested lookbehind is greater than the length so far
    computed for this branch, we must compute an extra value and keep the
    largest encountered for use when setting the maximum overall lookbehind. */
    case META_LOOKBEHIND:
    case META_LOOKBEHINDNOT:
    case META_LOOKBEHIND_NA:
-    if (!set_lookbehind_lengths(&pptr, &max, errcodeptr, lcptr, recurses, cb))
+    if (!set_lookbehind_lengths(&pptr, errcodeptr, lcptr, recurses, cb))
      return -1;
    if (max - branchlength > extra) extra = max - branchlength;
    break;
    /* Back references and recursions are handled by very similar code. At this
@ -9267,14 +9235,15 @@ for (;; pptr++)
    in the cache. */
    gptr++;
-    grouplength = get_grouplength(&gptr, FALSE, &groupextra, errcodeptr, lcptr,
+    grouplength = get_grouplength(&gptr, FALSE, errcodeptr, lcptr, group, 
-      group, &this_recurse, cb);
+      &this_recurse, cb);
    if (grouplength < 0)
      {
      if (*errcodeptr == 0) goto ISNOTFIXED;
      return -1;  /* Error already set */
      }
-    goto OK_GROUP;
+    itemlength = grouplength;
    break;
    /* Check nested groups - advance past the initial data for each type and
    then seek a fixed length with get_grouplength(). */
@ -9304,16 +9273,10 @@ for (;; pptr++)
    case META_SCRIPT_RUN:
    pptr++;
    CHECK_GROUP:
-    grouplength = get_grouplength(&pptr, TRUE, &groupextra, errcodeptr, lcptr,
+    grouplength = get_grouplength(&pptr, TRUE, errcodeptr, lcptr, group, 
-      group, recurses, cb);
+      recurses, cb);
    if (grouplength < 0) return -1;
    /* A nested lookbehind within the group may require looking back further
    than the length of the group. */
    OK_GROUP:
    itemlength = grouplength;
    if (groupextra - branchlength > extra) extra = groupextra - branchlength;
    break;
    /* Exact repetition is OK; variable repetition is not. A repetition of zero
@ -9374,7 +9337,6 @@ for (;; pptr++)
 EXIT:
 *pptrptr = pptr;
 *extraptr = extra;
 return branchlength;
 PARSED_SKIP_FAILED:
@ -9400,7 +9362,6 @@ get_branchlength() as an "extra" value.
 Arguments:
  pptrptr     pointer to pointer in the parsed pattern
  maxptr      where to return maximum lookbehind for the whole group
  errcodeptr  pointer to error code
  lcptr       pointer to loop counter
  recurses    chain of recurse_check to catch mutual recursion
@ -9411,13 +9372,11 @@ Returns:      TRUE if all is well
 */
 static BOOL
-set_lookbehind_lengths(uint32_t **pptrptr, int *maxptr, int *errcodeptr,
+set_lookbehind_lengths(uint32_t **pptrptr, int *errcodeptr, int *lcptr, 
-  int *lcptr, parsed_recurse_check *recurses, compile_block *cb)
+  parsed_recurse_check *recurses, compile_block *cb)
 {
 PCRE2_SIZE offset;
 int branchlength;
 int branchextra;
 int max = 0;
 uint32_t *bptr = *pptrptr;
 READPLUSOFFSET(offset, bptr);  /* Offset for error messages */
@ -9426,8 +9385,7 @@ READPLUSOFFSET(offset, bptr);  /* Offset for error messages */
 do
  {
  *pptrptr += 1;
-  branchlength = get_branchlength(pptrptr, &branchextra, errcodeptr, lcptr,
+  branchlength = get_branchlength(pptrptr, errcodeptr, lcptr, recurses, cb);
    recurses, cb);
  if (branchlength < 0)
    {
    /* The errorcode and offset may already be set from a nested lookbehind. */
@ -9435,14 +9393,12 @@ do
    if (cb->erroroffset == PCRE2_UNSET) cb->erroroffset = offset;
    return FALSE;
    }
-  if (branchlength + branchextra > max) max = branchlength + branchextra;
+  if (branchlength > cb->max_lookbehind) cb->max_lookbehind = branchlength;
  *bptr |= branchlength;  /* branchlength never more than 65535 */
  bptr = *pptrptr;
  }
 while (*bptr == META_ALT);
 if (max > cb->max_lookbehind) cb->max_lookbehind = max;
 *maxptr = max;
 return TRUE;
 }
@ -9475,7 +9431,6 @@ static int
 check_lookbehinds(uint32_t *pptr, uint32_t **retptr,
  parsed_recurse_check *recurses, compile_block *cb)
 {
 int max;
 int errorcode = 0;
 int loopcount = 0;
 int nestlevel = 0;
@ -9599,8 +9554,7 @@ for (; *pptr != META_END; pptr++)
    case META_LOOKBEHIND:
    case META_LOOKBEHINDNOT:
    case META_LOOKBEHIND_NA:
-    if (!set_lookbehind_lengths(&pptr, &max, &errorcode, &loopcount,
+    if (!set_lookbehind_lengths(&pptr, &errorcode, &loopcount, recurses, cb))
         recurses, cb))
      return errorcode;
    break;
    }
--- a/testdata/testoutput15
+++ b/testdata/testoutput15
@ -304,7 +304,7 @@ Partial match, mark=xx: 123a
 /(?<=(?<=a)b)c.*/I
 Capture group count = 0
-Max lookbehind = 2
+Max lookbehind = 1
 First code unit = 'c'
 Subject length lower bound = 1
    abc\=ph
@ -337,7 +337,7 @@ Partial match: abcd
 /(?<=(?<=(?<=a)b)c)./I
 Capture group count = 0
-Max lookbehind = 3
+Max lookbehind = 1
 Subject length lower bound = 1
    123abcXYZ
 0: abcX
@ -354,7 +354,7 @@ Subject length lower bound = 1
 /(?<=ab((?<=...)cd))./I
 Capture group count = 1
-Max lookbehind = 5
+Max lookbehind = 4
 Subject length lower bound = 1
    ZabcdX
 0: ZabcdX
@ -363,7 +363,7 @@ Subject length lower bound = 1
 /(?<=((?<=(?<=ab).))(?1)(?1))./I
 Capture group count = 1
-Max lookbehind = 3
+Max lookbehind = 2
 Subject length lower bound = 1
    abxZ
 0: abxZ
--- a/testdata/testoutput2
+++ b/testdata/testoutput2
@ -17036,7 +17036,7 @@ Subject length lower bound = 1
 /(?<=(?<=a)b)c.*/I
 Capture group count = 0
-Max lookbehind = 2
+Max lookbehind = 1
 First code unit = 'c'
 Subject length lower bound = 1
    abc\=ph
@ -17064,7 +17064,7 @@ Subject length lower bound = 0
 /(?<=a(?<=a|ba)c)/I
 Capture group count = 0
-Max lookbehind = 3
+Max lookbehind = 2
 May match empty string
 Subject length lower bound = 0
@ -17076,7 +17076,7 @@ Subject length lower bound = 0
 /(?<=(?<=a)b)(?<!abcd)(?<=(?<=a)bcde)/I
 Capture group count = 0
-Max lookbehind = 5
+Max lookbehind = 4
 May match empty string
 Subject length lower bound = 0