Back off failed attempt to handle nested lookbehinds for estimating how much of

a partial match to retain for multi-segment matching. Document the current difficulty if the whole first segment cannot be retained.
2019-09-04 18:14:54 +00:00 · 2019-09-04 18:14:54 +00:00 · 963b570fd0
parent 87bc092222
commit 963b570fd0
9 changed files with 868 additions and 915 deletions
--- a/37
+++ b/37
@ -66,12 +66,7 @@ is made possessive and applied to an item in parentheses, because a
 parenthesized item may contain multiple branches or other backtracking points,
 for example /(a|ab){1}+c/ or /(a+){1}+a/.

-13. Nested lookbehinds are now taken into account when computing the maximum
-lookbehind value. For example /(?<=a(?<=ba)c)/ previously set a maximum
-lookbehind of 2, because that is the largest individual lookbehind. Now it sets
-it to 3, because matching looks back 3 characters.
-
-14. For partial matches, pcre2test was always showing the maximum lookbehind
+13. For partial matches, pcre2test was always showing the maximum lookbehind
 characters, flagged with "<", which is misleading when the lookbehind didn't
 actually look behind the start (because it was later in the pattern). Showing
 all consulted preceding characters for partial matches is now controlled by the
@ -79,25 +74,25 @@ existing "allusedtext" modifier and, as for complete matches, this facility is
 available only for non-JIT matching, because JIT does not maintain the first
 and last consulted characters.

-15. DFA matching (using pcre2_dfa_match()) was not recognising a partial match
+14. DFA matching (using pcre2_dfa_match()) was not recognising a partial match
 if the end of the subject was encountered in a lookahead (conditional or
 otherwise), an atomic group, or a recursion.

-16. Give error if pcre2test -t, -T, -tm or -TM is given an argument of zero.
+15. Give error if pcre2test -t, -T, -tm or -TM is given an argument of zero.

-17. Check for integer overflow when computing lookbehind lengths. Fixes 
+16. Check for integer overflow when computing lookbehind lengths. Fixes 
 Clusterfuzz issue 15636.

-18. Implemented non-atomic positive lookaround assertions.
+17. Implemented non-atomic positive lookaround assertions.

-19. If a lookbehind contained a lookahead that contained another lookbehind
+18. If a lookbehind contained a lookahead that contained another lookbehind
 within it, the nested lookbehind was not correctly processed. For example, if 
 /(?<=(?=(?<=a)))b/ was matched to "ab" it gave no match instead of matching 
 "b".

-20. Implemented pcre2_get_match_data_size().
+19. Implemented pcre2_get_match_data_size().

-21. Two alterations to partial matching (not yet done by JIT):
+20. Two alterations to partial matching (not yet done by JIT):

    (a) The definition of a partial match is slightly changed: if a pattern
    contains any lookbehinds, an empty partial match may be given, because this
@ -111,29 +106,29 @@ within it, the nested lookbehind was not correctly processed. For example, if
    (c) An empty string partial hard match can be returned for \z and \Z as it
    is documented that they shouldn't match. 
    
-22. A branch that started with (*ACCEPT) was not being recognized as one that
+21. A branch that started with (*ACCEPT) was not being recognized as one that
 could match an empty string. 

-23. Corrected pcre2_set_character_tables() tables data type: was const unsigned
+22. Corrected pcre2_set_character_tables() tables data type: was const unsigned
 char * instead of const uint8_t *, as generated by pcre2_maketables().

-24. Upgraded to Unicode 12.1.0.
+23. Upgraded to Unicode 12.1.0.

-25. Add -jitfast command line option to pcre2test (to make all the jit options 
+24. Add -jitfast command line option to pcre2test (to make all the jit options 
 available directly).

-26. Make pcre2test -C show if libreadline or libedit is supported.
+25. Make pcre2test -C show if libreadline or libedit is supported.

-28. If the length of one branch of a group exceeded 65535 (the maximum value
+26. If the length of one branch of a group exceeded 65535 (the maximum value
 that is remembered as a minimum length), the whole group's length was 
 incorrectly recorded as 65535, leading to incorrect "no match" when start-up 
 optimizations were in force.

-29. The "rightmost consulted character" value was not always correct; in 
+27. The "rightmost consulted character" value was not always correct; in 
 particular, if a pattern ended with a negative lookahead, characters that were 
 inspected in that lookahead were not included.

-30. Add the pcre2_maketables_free() function.
+28. Add the pcre2_maketables_free() function.


 Version 10.33 16-April-2019
--- a/doc/html/pcre2api.html
+++ b/doc/html/pcre2api.html
@ -2272,26 +2272,24 @@ defaulted by the caller of the match function.
 <pre>
  PCRE2_INFO_MAXLOOKBEHIND
 </pre>
-Return the largest number of characters (not code units) before the current
-matching point that could be inspected while processing a lookbehind assertion
-in the pattern. Before release 10.34 this request used to give the largest
-value for any individual assertion. Now it takes into account nested
-lookbehinds, which can mean that the overall value is greater. For example, the
-pattern (?&#60;=a(?&#60;=ba)c) previously returned 2, because that is the length of the
-largest individual lookbehind. Now it returns 3, because matching actually
-looks back 3 characters.
+A lookbehind assertion moves back a certain number of characters (not code
+units) when it starts to process each of its branches. This request returns the
+largest of these backward moves. The third argument should point to a uint32_t
+integer. The simple assertions \b and \B require a one-character lookbehind
+and cause PCRE2_INFO_MAXLOOKBEHIND to return 1 in the absence of anything
+longer. \A also registers a one-character lookbehind, though it does not
+actually inspect the previous character.
 </P>
 <P>
-The third argument should point to a uint32_t integer. This information is
-useful when doing multi-segment matching using the partial matching facilities.
-Note that the simple assertions \b and \B require a one-character lookbehind.
-\A also registers a one-character lookbehind, though it does not actually
-inspect the previous character. This is to ensure that at least one character
-from the old segment is retained when a new segment is processed. Otherwise, if
-there are no lookbehinds in the pattern, \A might match incorrectly at the
-start of a second or subsequent segment. There are more details in the
+Note that this information is useful for multi-segment matching only
+if the pattern contains no nested lookbehinds. For example, the pattern 
+(?&#60;=a(?&#60;=ba)c) returns a maximum lookbehind of 2, but when it is processed, the 
+first lookbehind moves back by two characters, matches one character, then the
+nested lookbehind also moves back by two characters. This puts the matching
+point three characters earlier than it was at the start.
+PCRE2_INFO_MAXLOOKBEHIND is really only useful as a debugging tool. See the
 <a href="pcre2partial.html"><b>pcre2partial</b></a>
-documentation.
+documentation for a discussion of multi-segment matching.
 <pre>
  PCRE2_INFO_MINLENGTH
 </pre>
--- a/doc/html/pcre2partial.html
+++ b/doc/html/pcre2partial.html
@ -244,9 +244,12 @@ there is only a partial match.
 <P>
 PCRE was not originally designed with multi-segment matching in mind. However,
 over time, features (including partial matching) that make multi-segment
-matching possible have been added. The string is searched segment by segment by
-calling <b>pcre2_match()</b> repeatedly, with the aim of achieving the same 
-results that would happen if the entire string was available for searching.
+matching possible have been added. A very long string can be searched segment
+by segment by calling <b>pcre2_match()</b> repeatedly, with the aim of achieving
+the same results that would happen if the entire string was available for
+searching all the time. Normally, the strings that are being sought are much
+shorter than each individual segment, and are in the middle of very long
+strings, so the pattern is normally not anchored.
 </P>
 <P>
 Special logic must be implemented to handle a matched substring that spans a
@ -259,8 +262,7 @@ also be set for all but the first segment.
 When a partial match occurs, the next segment must be added to the current
 subject and the match re-run, using the <i>startoffset</i> argument of
 <b>pcre2_match()</b> to begin at the point where the partial match started.
-Multi-segment matching is usually used to search for substrings in the middle
-of very long sequences, so the patterns are normally not anchored. For example:
+For example:
 <pre>
    re&#62; /\d?\d(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\d\d/
  data&#62; ...the date is 23ja\=ph
@ -270,50 +272,51 @@ of very long sequences, so the patterns are normally not anchored. For example:
   1: jan
 </pre>
 Note the use of the <b>offset</b> modifier to start the new match where the
-partial match was found.
+partial match was found. In this example, the next segment was added to the one
+in which the partial match was found. This is the most straightforward
+approach, typically using a memory buffer that is twice the size of each
+segment. After a partial match, the first half of the buffer is discarded, the 
+second half is moved to the start of the buffer, and a new segment is added 
+before repeating the match as in the example above. After a no match, the 
+entire buffer can be discarded.
 </P>
 <P>
-In this simple example, the next segment was just added to the one in which the 
-partial match was found. However, if there are memory constraints, it may be 
-necessary to discard text that precedes the partial match before adding the 
-next segment. In cases such as the above, where the pattern does not contain
-any lookbehinds, it is sufficient to retain only the partially matched
-substring. However, if a pattern contains a lookbehind assertion, characters
+If there are memory constraints, you may want to discard text that precedes a
+partial match before adding the next segment. Unfortunately, this is not at
+present straightforward. In cases such as the above, where the pattern does not
+contain any lookbehinds, it is sufficient to retain only the partially matched
+substring. However, if the pattern contains a lookbehind assertion, characters
 that precede the start of the partial match may have been inspected during the
-matching process.
-</P>
-<P>
-The only lookbehind information that is available is the length of the longest
-lookbehind in a pattern. This may not, of course, be at the start of the
-pattern, but retaining that many characters before the partial match is
-sufficient, if not always strictly necessary. The way to do this is as follows:
-</P>
-<P>
-Before doing any matching, find the length of the longest lookbehind in the
-pattern by calling <b>pcre2_pattern_info()</b> with the PCRE2_INFO_MAXLOOKBEHIND
-option. Note that the resulting count is in characters, not code units. After a
-partial match, moving back from the ovector[0] offset in the subject by the
-number of characters given for the maximum lookbehind gets you to the earliest
-character that must be retained. In a non-UTF or a 32-bit situation, moving
-back is just a subtraction, but in UTF-8 or UTF-16 you have to count characters
-while moving back through the code units. Characters before the point you have
-now reached can be discarded.
-</P>
-<P>
-For example, if the pattern "(?&#60;=123)abc" is partially matched against the
-string "xx123ab", the ovector offsets are 5 and 7 ("ab"). The maximum
-lookbehind count is 3, so all characters before offset 2 can be discarded. The
-value of <b>startoffset</b> for the next match should be 3. When <b>pcre2test</b>
-displays a partial match, it indicates the lookbehind characters with '&#60;'
-characters if the <b>allusedtext</b> modifier is set:
+matching process. When <b>pcre2test</b> displays a partial match, it indicates
+these characters with '&#60;' if the <b>allusedtext</b> modifier is set:
 <pre>
    re&#62; "(?&#60;=123)abc"
  data&#62; xx123ab\=ph,allusedtext
  Partial match: 123ab
                 &#60;&#60;&#60;
 </pre>
-Note that the \fPallusedtext\fP modifier is not available for JIT matching,
-because JIT matching does not maintain the first and last consulted characters.
+However, the \fPallusedtext\fP modifier is not available for JIT matching,
+because JIT matching does not record the first (or last) consulted characters.
+For this reason, this information is not available via the API. It is therefore
+not possible in general to obtain the exact number of characters that must be
+retained in order to get the right match result. If you cannot retain the
+entire segment, you must find some heuristic way of choosing.
+</P>
+<P>
+If you know the approximate length of the matching substrings, you can use that
+to decide how much text to retain. The only lookbehind information that is
+currently available via the API is the length of the longest individual
+lookbehind in a pattern, but this can be misleading if there are nested
+lookbehinds. The value returned by calling <b>pcre2_pattern_info()</b> with the
+PCRE2_INFO_MAXLOOKBEHIND option is the maximum number of characters (not code
+units) that any individual lookbehind moves back when it is processed. A
+pattern such as "(?&#60;=(?&#60;!b)a)" has a maximum lookbehind value of one, but
+inspects two characters before its starting point.
+</P>
+<P>
+In a non-UTF or a 32-bit case, moving back is just a subtraction, but in
+UTF-8 or UTF-16 you have to count characters while moving back through the code
+units.
 </P>
 <br><a name="SEC5" href="#TOC1">PARTIAL MATCHING USING pcre2_dfa_match()</a><br>
 <P>
@ -379,11 +382,11 @@ want.
 </P>
 <P>
 If you do want to allow for starting again at the next character, one way of
-doing it is to retain the matched part of the segment and try a new complete
-match, as described for <b>pcre2_match()</b> above. Another possibility is to
-work with two buffers. If a partial match at offset <i>n</i> in the first buffer
-is followed by "no match" when PCRE2_DFA_RESTART is used on the second buffer,
-you can then try a new match starting at offset <i>n+1</i> in the first buffer.
+doing it is to retain some or all of the segment and try a new complete match,
+as described for <b>pcre2_match()</b> above. Another possibility is to work with
+two buffers. If a partial match at offset <i>n</i> in the first buffer is
+followed by "no match" when PCRE2_DFA_RESTART is used on the second buffer, you
+can then try a new match starting at offset <i>n+1</i> in the first buffer.
 </P>
 <br><a name="SEC7" href="#TOC1">AUTHOR</a><br>
 <P>
@ -396,7 +399,7 @@ Cambridge, England.
 </P>
 <br><a name="SEC8" href="#TOC1">REVISION</a><br>
 <P>
-Last updated: 07 August 2019
+Last updated: 04 September 2019
 <br>
 Copyright &copy; 1997-2019 University of Cambridge.
 <br>
--- a/doc/pcre2.txt
+++ b/doc/pcre2.txt
@ -2234,25 +2234,24 @@ INFORMATION ABOUT A COMPILED PATTERN

         PCRE2_INFO_MAXLOOKBEHIND

-       Return  the  largest  number  of characters (not code units) before the
-       current matching point that could be inspected while processing a look-
-       behind assertion in the pattern. Before release 10.34 this request used
-       to give the largest value for any individual assertion.  Now  it  takes
-       into  account nested lookbehinds, which can mean that the overall value
-       is greater. For example, the pattern (?<=a(?<=ba)c) previously returned
-       2, because that is the length of the largest individual lookbehind. Now
-       it returns 3, because matching actually looks back 3 characters.
+       A  lookbehind  assertion moves back a certain number of characters (not
+       code units) when it starts to process each of its  branches.  This  re-
+       quest  returns  the largest of these backward moves. The third argument
+       should point to a uint32_t integer. The simple assertions \b and \B re-
+       quire  a one-character lookbehind and cause PCRE2_INFO_MAXLOOKBEHIND to
+       return 1 in the absence of anything longer. \A also  registers  a  one-
+       character  lookbehind, though it does not actually inspect the previous
+       character.

-       The third argument should point to a uint32_t integer. This information
-       is  useful when doing multi-segment matching using the partial matching
-       facilities.  Note that the simple assertions \b and \B require  a  one-
-       character  lookbehind.   \A  also registers a one-character lookbehind,
-       though it does not actually inspect the previous character. This is  to
-       ensure  that  at  least  one character from the old segment is retained
-       when a new segment is processed. Otherwise, if there are no lookbehinds
-       in  the pattern, \A might match incorrectly at the start of a second or
-       subsequent segment. There are more details in the pcre2partial documen-
-       tation.
+       Note that this information is useful for multi-segment matching only if
+       the  pattern  contains  no nested lookbehinds. For example, the pattern
+       (?<=a(?<=ba)c) returns a maximum lookbehind of 2, but when it  is  pro-
+       cessed,  the first lookbehind moves back by two characters, matches one
+       character, then the nested lookbehind also moves back  by  two  charac-
+       ters. This puts the matching point three characters earlier than it was
+       at the start.  PCRE2_INFO_MAXLOOKBEHIND is really only useful as a  de-
+       bugging  tool.  See  the pcre2partial documentation for a discussion of
+       multi-segment matching.

         PCRE2_INFO_MINLENGTH

@ -5877,10 +5876,13 @@ MULTI-SEGMENT MATCHING WITH pcre2_match()

       PCRE  was  not originally designed with multi-segment matching in mind.
       However, over time, features (including  partial  matching)  that  make
-       multi-segment matching possible have been added. The string is searched
-       segment by segment by calling pcre2_match() repeatedly, with the aim of
-       achieving  the  same results that would happen if the entire string was
-       available for searching.
+       multi-segment matching possible have been added. A very long string can
+       be searched segment by segment  by  calling  pcre2_match()  repeatedly,
+       with the aim of achieving the same results that would happen if the en-
+       tire string was available for searching all  the  time.  Normally,  the
+       strings  that  are  being  sought are much shorter than each individual
+       segment, and are in the middle of very long strings, so the pattern  is
+       normally not anchored.

       Special  logic  must  be implemented to handle a matched substring that
       spans a segment boundary. PCRE2_PARTIAL_HARD should be used, because it
@ -5891,9 +5893,7 @@ MULTI-SEGMENT MATCHING WITH pcre2_match()
       When a partial match occurs, the next segment must be added to the cur-
       rent subject and the match re-run, using the  startoffset  argument  of
       pcre2_match()  to  begin  at the point where the partial match started.
-       Multi-segment  matching is usually used to search for substrings in the
-       middle of very long sequences, so the patterns  are  normally  not  an-
-       chored. For example:
+       For example:

           re> /\d?\d(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\d\d/
         data> ...the date is 23ja\=ph
@ -5903,49 +5903,51 @@ MULTI-SEGMENT MATCHING WITH pcre2_match()
          1: jan

       Note the use of the offset modifier to start the new  match  where  the
-       partial match was found.
+       partial match was found. In this example, the next segment was added to
+       the one in which  the  partial  match  was  found.  This  is  the  most
+       straightforward approach, typically using a memory buffer that is twice
+       the size of each segment. After a partial match, the first half of  the
+       buffer  is discarded, the second half is moved to the start of the buf-
+       fer, and a new segment is added before repeating the match  as  in  the
+       example above. After a no match, the entire buffer can be discarded.

-       In this simple example, the next segment was just added to the  one  in
-       which  the  partial  match was found. However, if there are memory con-
-       straints, it may be necessary to discard text that precedes the partial
-       match before adding the next segment. In cases such as the above, where
-       the pattern does not contain any lookbehinds, it is sufficient  to  re-
-       tain  only  the partially matched substring. However, if a pattern con-
-       tains a lookbehind assertion, characters that precede the start of  the
-       partial match may have been inspected during the matching process.
-
-       The  only lookbehind information that is available is the length of the
-       longest lookbehind in a pattern. This may not, of  course,  be  at  the
-       start  of  the  pattern,  but retaining that many characters before the
-       partial match is sufficient, if not always strictly necessary. The  way
-       to do this is as follows:
-
-       Before doing any matching, find the length of the longest lookbehind in
-       the    pattern    by    calling    pcre2_pattern_info()    with     the
-       PCRE2_INFO_MAXLOOKBEHIND  option.  Note  that the resulting count is in
-       characters, not code units. After a partial match, moving back from the
-       ovector[0]  offset in the subject by the number of characters given for
-       the maximum lookbehind gets you to the earliest character that must  be
-       retained.  In  a  non-UTF  or a 32-bit situation, moving back is just a
-       subtraction, but in UTF-8 or UTF-16 you have to count characters  while
-       moving  back  through  the  code units. Characters before the point you
-       have now reached can be discarded.
-
-       For example, if the pattern "(?<=123)abc" is partially matched  against
-       the string "xx123ab", the ovector offsets are 5 and 7 ("ab"). The maxi-
-       mum lookbehind count is 3, so all characters before  offset  2  can  be
-       discarded.  The  value  of  startoffset for the next match should be 3.
-       When pcre2test displays a partial match, it  indicates  the  lookbehind
-       characters with '<' characters if the allusedtext modifier is set:
+       If there are memory constraints, you may want to discard text that pre-
+       cedes a partial match before adding the  next  segment.  Unfortunately,
+       this  is  not  at  present straightforward. In cases such as the above,
+       where the pattern does not contain any lookbehinds, it is sufficient to
+       retain  only  the  partially matched substring. However, if the pattern
+       contains a lookbehind assertion, characters that precede the  start  of
+       the  partial match may have been inspected during the matching process.
+       When pcre2test displays a partial match, it indicates these  characters
+       with '<' if the allusedtext modifier is set:

           re> "(?<=123)abc"
         data> xx123ab\=ph,allusedtext
         Partial match: 123ab
                        <<<

-       Note  that  the allusedtext modifier is not available for JIT matching,
-       because JIT matching does not maintain the  first  and  last  consulted
-       characters.
+       However,  the  allusedtext  modifier is not available for JIT matching,
+       because JIT matching does not record  the  first  (or  last)  consulted
+       characters.  For this reason, this information is not available via the
+       API. It is therefore not possible in general to obtain the exact number
+       of characters that must be retained in order to get the right match re-
+       sult. If you cannot retain the  entire  segment,  you  must  find  some
+       heuristic way of choosing.
+
+       If  you know the approximate length of the matching substrings, you can
+       use that to decide how much text to retain. The only lookbehind  infor-
+       mation  that  is  currently  available via the API is the length of the
+       longest individual lookbehind in a pattern, but this can be  misleading
+       if  there  are  nested  lookbehinds.  The  value  returned  by  calling
+       pcre2_pattern_info() with the PCRE2_INFO_MAXLOOKBEHIND  option  is  the
+       maximum number of characters (not code units) that any individual look-
+       behind  moves  back  when  it  is  processed.   A   pattern   such   as
+       "(?<=(?<!b)a)"  has a maximum lookbehind value of one, but inspects two
+       characters before its starting point.
+
+       In a non-UTF or a 32-bit case, moving back is just a  subtraction,  but
+       in  UTF-8  or  UTF-16  you  have  to count characters while moving back
+       through the code units.


 PARTIAL MATCHING USING pcre2_dfa_match()
@ -6012,12 +6014,12 @@ MULTI-SEGMENT MATCHING WITH pcre2_dfa_match()
       plication, this may or may not be what you want.

       If you do want to allow for starting again at the next  character,  one
-       way of doing it is to retain the matched part of the segment and try  a
-       new  complete match, as described for pcre2_match() above. Another pos-
-       sibility is to work with two buffers. If a partial match at offset n in
-       the  first  buffer  is followed by "no match" when PCRE2_DFA_RESTART is
-       used on the second buffer, you can then try a  new  match  starting  at
-       offset n+1 in the first buffer.
+       way  of  doing it is to retain some or all of the segment and try a new
+       complete match, as described for pcre2_match() above. Another possibil-
+       ity  is to work with two buffers. If a partial match at offset n in the
+       first buffer is followed by "no match" when PCRE2_DFA_RESTART  is  used
+       on  the  second buffer, you can then try a new match starting at offset
+       n+1 in the first buffer.


 AUTHOR
@ -6029,7 +6031,7 @@ AUTHOR

 REVISION

-       Last updated: 07 August 2019
+       Last updated: 04 September 2019
       Copyright (c) 1997-2019 University of Cambridge.
 ------------------------------------------------------------------------------
 
--- a/doc/pcre2api.3
+++ b/doc/pcre2api.3
@ -2232,27 +2232,25 @@ defaulted by the caller of the match function.
 .sp
  PCRE2_INFO_MAXLOOKBEHIND
 .sp
-Return the largest number of characters (not code units) before the current
-matching point that could be inspected while processing a lookbehind assertion
-in the pattern. Before release 10.34 this request used to give the largest
-value for any individual assertion. Now it takes into account nested
-lookbehinds, which can mean that the overall value is greater. For example, the
-pattern (?<=a(?<=ba)c) previously returned 2, because that is the length of the
-largest individual lookbehind. Now it returns 3, because matching actually
-looks back 3 characters.
+A lookbehind assertion moves back a certain number of characters (not code
+units) when it starts to process each of its branches. This request returns the
+largest of these backward moves. The third argument should point to a uint32_t
+integer. The simple assertions \eb and \eB require a one-character lookbehind
+and cause PCRE2_INFO_MAXLOOKBEHIND to return 1 in the absence of anything
+longer. \eA also registers a one-character lookbehind, though it does not
+actually inspect the previous character.
 .P
-The third argument should point to a uint32_t integer. This information is
-useful when doing multi-segment matching using the partial matching facilities.
-Note that the simple assertions \eb and \eB require a one-character lookbehind.
-\eA also registers a one-character lookbehind, though it does not actually
-inspect the previous character. This is to ensure that at least one character
-from the old segment is retained when a new segment is processed. Otherwise, if
-there are no lookbehinds in the pattern, \eA might match incorrectly at the
-start of a second or subsequent segment. There are more details in the
+Note that this information is useful for multi-segment matching only
+if the pattern contains no nested lookbehinds. For example, the pattern 
+(?<=a(?<=ba)c) returns a maximum lookbehind of 2, but when it is processed, the 
+first lookbehind moves back by two characters, matches one character, then the
+nested lookbehind also moves back by two characters. This puts the matching
+point three characters earlier than it was at the start.
+PCRE2_INFO_MAXLOOKBEHIND is really only useful as a debugging tool. See the
 .\" HREF
 \fBpcre2partial\fP
 .\"
-documentation.
+documentation for a discussion of multi-segment matching.
 .sp
  PCRE2_INFO_MINLENGTH
 .sp
--- a/doc/pcre2partial.3
+++ b/doc/pcre2partial.3
@ -1,4 +1,4 @@
-.TH PCRE2PARTIAL 3 "07 August 2019" "PCRE2 10.34"
+.TH PCRE2PARTIAL 3 "04 September 2019" "PCRE2 10.34"
 .SH NAME
 PCRE2 - Perl-compatible regular expressions
 .SH "PARTIAL MATCHING IN PCRE2"
@ -213,9 +213,12 @@ there is only a partial match.
 .sp
 PCRE was not originally designed with multi-segment matching in mind. However,
 over time, features (including partial matching) that make multi-segment
-matching possible have been added. The string is searched segment by segment by
-calling \fBpcre2_match()\fP repeatedly, with the aim of achieving the same 
-results that would happen if the entire string was available for searching.
+matching possible have been added. A very long string can be searched segment
+by segment by calling \fBpcre2_match()\fP repeatedly, with the aim of achieving
+the same results that would happen if the entire string was available for
+searching all the time. Normally, the strings that are being sought are much
+shorter than each individual segment, and are in the middle of very long
+strings, so the pattern is normally not anchored.
 .P
 Special logic must be implemented to handle a matched substring that spans a
 segment boundary. PCRE2_PARTIAL_HARD should be used, because it returns a
@ -226,8 +229,7 @@ also be set for all but the first segment.
 When a partial match occurs, the next segment must be added to the current
 subject and the match re-run, using the \fIstartoffset\fP argument of
 \fBpcre2_match()\fP to begin at the point where the partial match started.
-Multi-segment matching is usually used to search for substrings in the middle
-of very long sequences, so the patterns are normally not anchored. For example:
+For example:
 .sp
    re> /\ed?\ed(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\ed\ed/
  data> ...the date is 23ja\e=ph
@ -237,47 +239,48 @@ of very long sequences, so the patterns are normally not anchored. For example:
   1: jan
 .sp
 Note the use of the \fBoffset\fP modifier to start the new match where the
-partial match was found.
+partial match was found. In this example, the next segment was added to the one
+in which the partial match was found. This is the most straightforward
+approach, typically using a memory buffer that is twice the size of each
+segment. After a partial match, the first half of the buffer is discarded, the 
+second half is moved to the start of the buffer, and a new segment is added 
+before repeating the match as in the example above. After a no match, the 
+entire buffer can be discarded.
 .P
-In this simple example, the next segment was just added to the one in which the 
-partial match was found. However, if there are memory constraints, it may be 
-necessary to discard text that precedes the partial match before adding the 
-next segment. In cases such as the above, where the pattern does not contain
-any lookbehinds, it is sufficient to retain only the partially matched
-substring. However, if a pattern contains a lookbehind assertion, characters
+If there are memory constraints, you may want to discard text that precedes a
+partial match before adding the next segment. Unfortunately, this is not at
+present straightforward. In cases such as the above, where the pattern does not
+contain any lookbehinds, it is sufficient to retain only the partially matched
+substring. However, if the pattern contains a lookbehind assertion, characters
 that precede the start of the partial match may have been inspected during the
-matching process.
-.P
-The only lookbehind information that is available is the length of the longest
-lookbehind in a pattern. This may not, of course, be at the start of the
-pattern, but retaining that many characters before the partial match is
-sufficient, if not always strictly necessary. The way to do this is as follows:
-.P
-Before doing any matching, find the length of the longest lookbehind in the
-pattern by calling \fBpcre2_pattern_info()\fP with the PCRE2_INFO_MAXLOOKBEHIND
-option. Note that the resulting count is in characters, not code units. After a
-partial match, moving back from the ovector[0] offset in the subject by the
-number of characters given for the maximum lookbehind gets you to the earliest
-character that must be retained. In a non-UTF or a 32-bit situation, moving
-back is just a subtraction, but in UTF-8 or UTF-16 you have to count characters
-while moving back through the code units. Characters before the point you have
-now reached can be discarded.
-.P
-For example, if the pattern "(?<=123)abc" is partially matched against the
-string "xx123ab", the ovector offsets are 5 and 7 ("ab"). The maximum
-lookbehind count is 3, so all characters before offset 2 can be discarded. The
-value of \fBstartoffset\fP for the next match should be 3. When \fBpcre2test\fP
-displays a partial match, it indicates the lookbehind characters with '<'
-characters if the \fBallusedtext\fP modifier is set:
+matching process. When \fBpcre2test\fP displays a partial match, it indicates
+these characters with '<' if the \fBallusedtext\fP modifier is set:
 .sp
    re> "(?<=123)abc"
  data> xx123ab\e=ph,allusedtext
  Partial match: 123ab
                 <<<
 .sp
-Note that the \fPallusedtext\fP modifier is not available for JIT matching,
-because JIT matching does not maintain the first and last consulted characters.
-.
+However, the \fPallusedtext\fP modifier is not available for JIT matching,
+because JIT matching does not record the first (or last) consulted characters.
+For this reason, this information is not available via the API. It is therefore
+not possible in general to obtain the exact number of characters that must be
+retained in order to get the right match result. If you cannot retain the
+entire segment, you must find some heuristic way of choosing.
+.P
+If you know the approximate length of the matching substrings, you can use that
+to decide how much text to retain. The only lookbehind information that is
+currently available via the API is the length of the longest individual
+lookbehind in a pattern, but this can be misleading if there are nested
+lookbehinds. The value returned by calling \fBpcre2_pattern_info()\fP with the
+PCRE2_INFO_MAXLOOKBEHIND option is the maximum number of characters (not code
+units) that any individual lookbehind moves back when it is processed. A
+pattern such as "(?<=(?<!b)a)" has a maximum lookbehind value of one, but
+inspects two characters before its starting point.
+.P
+In a non-UTF or a 32-bit case, moving back is just a subtraction, but in
+UTF-8 or UTF-16 you have to count characters while moving back through the code
+units.
 .
 .
 .SH "PARTIAL MATCHING USING pcre2_dfa_match()"
@ -344,11 +347,11 @@ are remembered. Depending on the application, this may or may not be what you
 want.
 .P
 If you do want to allow for starting again at the next character, one way of
-doing it is to retain the matched part of the segment and try a new complete
-match, as described for \fBpcre2_match()\fP above. Another possibility is to
-work with two buffers. If a partial match at offset \fIn\fP in the first buffer
-is followed by "no match" when PCRE2_DFA_RESTART is used on the second buffer,
-you can then try a new match starting at offset \fIn+1\fP in the first buffer.
+doing it is to retain some or all of the segment and try a new complete match,
+as described for \fBpcre2_match()\fP above. Another possibility is to work with
+two buffers. If a partial match at offset \fIn\fP in the first buffer is
+followed by "no match" when PCRE2_DFA_RESTART is used on the second buffer, you
+can then try a new match starting at offset \fIn+1\fP in the first buffer.
 .
 .
 .SH AUTHOR
@ -365,6 +368,6 @@ Cambridge, England.
 .rs
 .sp
 .nf
-Last updated: 07 August 2019
+Last updated: 04 September 2019
 Copyright (c) 1997-2019 University of Cambridge.
 .fi
--- a/src/pcre2_compile.c
+++ b/src/pcre2_compile.c
@ -128,12 +128,12 @@ static int
    compile_block *, PCRE2_SIZE *);

 static int
-  get_branchlength(uint32_t **, int *, int *, int *, parsed_recurse_check *,
+  get_branchlength(uint32_t **, int *, int *, parsed_recurse_check *,
    compile_block *);

 static BOOL
-  set_lookbehind_lengths(uint32_t **, int *, int *, int *,
-    parsed_recurse_check *, compile_block *);
+  set_lookbehind_lengths(uint32_t **, int *, int *, parsed_recurse_check *, 
+    compile_block *);

 static int
  check_lookbehinds(uint32_t *, uint32_t **, parsed_recurse_check *,
@ -398,9 +398,6 @@ compiler is clever with identical subexpressions. */
 #define GI_SET_FIXED_LENGTH    0x80000000u
 #define GI_NOT_FIXED_LENGTH    0x40000000u
 #define GI_FIXED_LENGTH_MASK   0x0000ffffu
-#define GI_EXTRA_MASK          0x0fff0000u
-#define GI_EXTRA_MAX                 0xfff  /* NB not unsigned */
-#define GI_EXTRA_SHIFT                  16

 /* This simple test for a decimal digit works for both ASCII/Unicode and EBCDIC
 and is fast (a good compiler can turn it into a subtraction and unsigned
@ -8897,7 +8894,6 @@ improve processing speed when the same capturing group occurs many times.
 Arguments:
  pptrptr     pointer to pointer in the parsed pattern
  isinline    FALSE if a reference or recursion; TRUE for inline group
-  extraptr    pointer to where to return extra lookbehind length
  errcodeptr  pointer to the errorcode
  lcptr       pointer to the loop counter
  group       number of captured group or -1 for a non-capturing group
@ -8908,13 +8904,11 @@ Returns:      the group length or a negative number
 */

 static int
-get_grouplength(uint32_t **pptrptr, BOOL isinline, int *extraptr,
-  int *errcodeptr, int *lcptr, int group, parsed_recurse_check *recurses,
-  compile_block *cb)
+get_grouplength(uint32_t **pptrptr, BOOL isinline, int *errcodeptr, int *lcptr,
+   int group, parsed_recurse_check *recurses, compile_block *cb)
 {
 int branchlength;
 int grouplength = -1;
-int extra = 0;

 /* The cache can be used only if there is no possibility of there being two
 groups with the same number. We do not need to set the end pointer for a group
@ -8928,7 +8922,6 @@ if (group > 0 && (cb->external_flags & PCRE2_DUPCAPUSED) == 0)
  if ((groupinfo & GI_SET_FIXED_LENGTH) != 0)
    {
    if (isinline) *pptrptr = parsed_skip(*pptrptr, PSKIP_KET);
-    *extraptr = (groupinfo & GI_EXTRA_MASK) >> GI_EXTRA_SHIFT;
    return groupinfo & GI_FIXED_LENGTH_MASK;
    }
  }
@ -8937,28 +8930,16 @@ if (group > 0 && (cb->external_flags & PCRE2_DUPCAPUSED) == 0)

 for(;;)
  {
-  int branchextra;
-  branchlength = get_branchlength(pptrptr, &branchextra, errcodeptr, lcptr,
-    recurses, cb);
+  branchlength = get_branchlength(pptrptr, errcodeptr, lcptr, recurses, cb);
  if (branchlength < 0) goto ISNOTFIXED;
-  if (grouplength == -1)
-    {
-    grouplength = branchlength;
-    extra = branchextra;
-    }
-  else if (grouplength != branchlength || extra != branchextra) goto ISNOTFIXED;
+  if (grouplength == -1) grouplength = branchlength;
+    else if (grouplength != branchlength) goto ISNOTFIXED;
  if (**pptrptr == META_KET) break;
  *pptrptr += 1;   /* Skip META_ALT */
  }

-/* There are only 12 bits for caching the extra value, but a pattern that
-needs more than that is weird indeed. */
-
-if (group > 0 && extra <= GI_EXTRA_MAX)
-  cb->groupinfo[group] |= (uint32_t)
-    (GI_SET_FIXED_LENGTH | (extra << GI_EXTRA_SHIFT) | grouplength);
-
-*extraptr = extra;
+if (group > 0) 
+  cb->groupinfo[group] |= (uint32_t)(GI_SET_FIXED_LENGTH | grouplength);
 return grouplength;

 ISNOTFIXED:
@ -8973,17 +8954,11 @@ return -1;
 *************************************************/

 /* Return a fixed length for a branch in a lookbehind, giving an error if the
-length is not fixed. We also take note of any extra value that is generated
-from a nested lookbehind. For example, for /(?<=a(?<=ba)c)/ each individual
-lookbehind has length 2, but the max_lookbehind setting must be 3 because
-matching inspects 3 characters before the match starting point.
-
-On entry, *pptrptr points to the first element inside the branch. On exit it is
-set to point to the ALT or KET.
+length is not fixed. On entry, *pptrptr points to the first element inside the
+branch. On exit it is set to point to the ALT or KET.

 Arguments:
  pptrptr     pointer to pointer in the parsed pattern
-  extraptr    pointer to where to return extra lookbehind length
  errcodeptr  pointer to error code
  lcptr       pointer to loop counter
  recurses    chain of recurse_check to catch mutual recursion
@ -8993,14 +8968,11 @@ Returns:      the length, or a negative value on error
 */

 static int
-get_branchlength(uint32_t **pptrptr, int *extraptr, int *errcodeptr, int *lcptr,
+get_branchlength(uint32_t **pptrptr, int *errcodeptr, int *lcptr,
  parsed_recurse_check *recurses, compile_block *cb)
 {
 int branchlength = 0;
 int grouplength;
-int groupextra;
-int max;
-int extra = 0;   /* Additional lookbehind from nesting */
 uint32_t lastitemlength = 0;
 uint32_t *pptr = *pptrptr;
 PCRE2_SIZE offset;
@ -9149,17 +9121,13 @@ for (;; pptr++)
    break;

    /* A nested lookbehind does not contribute any length to this lookbehind,
-    but must itself be checked and have its lengths set. If the maximum
-    lookbehind for the nested lookbehind is greater than the length so far
-    computed for this branch, we must compute an extra value and keep the
-    largest encountered for use when setting the maximum overall lookbehind. */
+    but must itself be checked and have its lengths set. */

    case META_LOOKBEHIND:
    case META_LOOKBEHINDNOT:
    case META_LOOKBEHIND_NA:
-    if (!set_lookbehind_lengths(&pptr, &max, errcodeptr, lcptr, recurses, cb))
+    if (!set_lookbehind_lengths(&pptr, errcodeptr, lcptr, recurses, cb))
      return -1;
-    if (max - branchlength > extra) extra = max - branchlength;
    break;

    /* Back references and recursions are handled by very similar code. At this
@ -9267,14 +9235,15 @@ for (;; pptr++)
    in the cache. */

    gptr++;
-    grouplength = get_grouplength(&gptr, FALSE, &groupextra, errcodeptr, lcptr,
-      group, &this_recurse, cb);
+    grouplength = get_grouplength(&gptr, FALSE, errcodeptr, lcptr, group, 
+      &this_recurse, cb);
    if (grouplength < 0)
      {
      if (*errcodeptr == 0) goto ISNOTFIXED;
      return -1;  /* Error already set */
      }
-    goto OK_GROUP;
+    itemlength = grouplength;
+    break;

    /* Check nested groups - advance past the initial data for each type and
    then seek a fixed length with get_grouplength(). */
@ -9304,16 +9273,10 @@ for (;; pptr++)
    case META_SCRIPT_RUN:
    pptr++;
    CHECK_GROUP:
-    grouplength = get_grouplength(&pptr, TRUE, &groupextra, errcodeptr, lcptr,
-      group, recurses, cb);
+    grouplength = get_grouplength(&pptr, TRUE, errcodeptr, lcptr, group, 
+      recurses, cb);
    if (grouplength < 0) return -1;
-
-    /* A nested lookbehind within the group may require looking back further
-    than the length of the group. */
-
-    OK_GROUP:
    itemlength = grouplength;
-    if (groupextra - branchlength > extra) extra = groupextra - branchlength;
    break;

    /* Exact repetition is OK; variable repetition is not. A repetition of zero
@ -9374,7 +9337,6 @@ for (;; pptr++)

 EXIT:
 *pptrptr = pptr;
-*extraptr = extra;
 return branchlength;

 PARSED_SKIP_FAILED:
@ -9400,7 +9362,6 @@ get_branchlength() as an "extra" value.

 Arguments:
  pptrptr     pointer to pointer in the parsed pattern
-  maxptr      where to return maximum lookbehind for the whole group
  errcodeptr  pointer to error code
  lcptr       pointer to loop counter
  recurses    chain of recurse_check to catch mutual recursion
@ -9411,13 +9372,11 @@ Returns:      TRUE if all is well
 */

 static BOOL
-set_lookbehind_lengths(uint32_t **pptrptr, int *maxptr, int *errcodeptr,
-  int *lcptr, parsed_recurse_check *recurses, compile_block *cb)
+set_lookbehind_lengths(uint32_t **pptrptr, int *errcodeptr, int *lcptr, 
+  parsed_recurse_check *recurses, compile_block *cb)
 {
 PCRE2_SIZE offset;
 int branchlength;
-int branchextra;
-int max = 0;
 uint32_t *bptr = *pptrptr;

 READPLUSOFFSET(offset, bptr);  /* Offset for error messages */
@ -9426,8 +9385,7 @@ READPLUSOFFSET(offset, bptr);  /* Offset for error messages */
 do
  {
  *pptrptr += 1;
-  branchlength = get_branchlength(pptrptr, &branchextra, errcodeptr, lcptr,
-    recurses, cb);
+  branchlength = get_branchlength(pptrptr, errcodeptr, lcptr, recurses, cb);
  if (branchlength < 0)
    {
    /* The errorcode and offset may already be set from a nested lookbehind. */
@ -9435,14 +9393,12 @@ do
    if (cb->erroroffset == PCRE2_UNSET) cb->erroroffset = offset;
    return FALSE;
    }
-  if (branchlength + branchextra > max) max = branchlength + branchextra;
+  if (branchlength > cb->max_lookbehind) cb->max_lookbehind = branchlength;
  *bptr |= branchlength;  /* branchlength never more than 65535 */
  bptr = *pptrptr;
  }
 while (*bptr == META_ALT);

-if (max > cb->max_lookbehind) cb->max_lookbehind = max;
-*maxptr = max;
 return TRUE;
 }

@ -9475,7 +9431,6 @@ static int
 check_lookbehinds(uint32_t *pptr, uint32_t **retptr,
  parsed_recurse_check *recurses, compile_block *cb)
 {
-int max;
 int errorcode = 0;
 int loopcount = 0;
 int nestlevel = 0;
@ -9599,8 +9554,7 @@ for (; *pptr != META_END; pptr++)
    case META_LOOKBEHIND:
    case META_LOOKBEHINDNOT:
    case META_LOOKBEHIND_NA:
-    if (!set_lookbehind_lengths(&pptr, &max, &errorcode, &loopcount,
-         recurses, cb))
+    if (!set_lookbehind_lengths(&pptr, &errorcode, &loopcount, recurses, cb))
      return errorcode;
    break;
    }
--- a/testdata/testoutput15
+++ b/testdata/testoutput15
@ -304,7 +304,7 @@ Partial match, mark=xx: 123a

 /(?<=(?<=a)b)c.*/I
 Capture group count = 0
-Max lookbehind = 2
+Max lookbehind = 1
 First code unit = 'c'
 Subject length lower bound = 1
    abc\=ph
@ -337,7 +337,7 @@ Partial match: abcd

 /(?<=(?<=(?<=a)b)c)./I
 Capture group count = 0
-Max lookbehind = 3
+Max lookbehind = 1
 Subject length lower bound = 1
    123abcXYZ
 0: abcX
@ -354,7 +354,7 @@ Subject length lower bound = 1

 /(?<=ab((?<=...)cd))./I
 Capture group count = 1
-Max lookbehind = 5
+Max lookbehind = 4
 Subject length lower bound = 1
    ZabcdX
 0: ZabcdX
@ -363,7 +363,7 @@ Subject length lower bound = 1

 /(?<=((?<=(?<=ab).))(?1)(?1))./I
 Capture group count = 1
-Max lookbehind = 3
+Max lookbehind = 2
 Subject length lower bound = 1
    abxZ
 0: abxZ
--- a/testdata/testoutput2
+++ b/testdata/testoutput2
@ -17036,7 +17036,7 @@ Subject length lower bound = 1

 /(?<=(?<=a)b)c.*/I
 Capture group count = 0
-Max lookbehind = 2
+Max lookbehind = 1
 First code unit = 'c'
 Subject length lower bound = 1
    abc\=ph
@ -17064,7 +17064,7 @@ Subject length lower bound = 0

 /(?<=a(?<=a|ba)c)/I
 Capture group count = 0
-Max lookbehind = 3
+Max lookbehind = 2
 May match empty string
 Subject length lower bound = 0

@ -17076,7 +17076,7 @@ Subject length lower bound = 0

 /(?<=(?<=a)b)(?<!abcd)(?<=(?<=a)bcde)/I
 Capture group count = 0
-Max lookbehind = 5
+Max lookbehind = 4
 May match empty string
 Subject length lower bound = 0