diff --git a/HACKING b/HACKING index 53a0080..c51da7b 100644 --- a/HACKING +++ b/HACKING @@ -7,7 +7,7 @@ but with a revised (and incompatible) API. To avoid confusion, the original library is referred to as PCRE1 below. For information about testing PCRE2, see the pcre2test documentation and the comment at the head of the RunTest file. -PCRE1 releases were up to 8.3x when PCRE2 was developed, and later bug fix +PCRE1 releases were up to 8.3x when PCRE2 was developed, and later bug fix releases remain in the 8.xx series. PCRE2 releases started at 10.00 to avoid confusion with PCRE1. @@ -124,39 +124,39 @@ compile) has full knowledge of group names and numbers throughout. Several dozen lines of messy code were eliminated, though the new pre-pass was not short. In particular, parsing and skipping over [] classes is complicated. -While working on 10.22 I realized that I could simplify yet again by moving +While working on 10.22 I realized that I could simplify yet again by moving more of the parsing into the pre-pass, thus avoiding doing it in two places, so after 10.22 was released, the code underwent yet another big refactoring. This is how it is from 10.23 onwards: -The function called parse_regex() scans the pattern characters, parsing them -into literal data and meta characters. It converts escapes such as \x{123} -into literals, handles \Q...\E, and skips over comments and non-significant -white space. The result of the scanning is put into a vector of 32-bit unsigned -integers. Values less than 0x80000000 are literal data. Higher values represent +The function called parse_regex() scans the pattern characters, parsing them +into literal data and meta characters. It converts escapes such as \x{123} +into literals, handles \Q...\E, and skips over comments and non-significant +white space. The result of the scanning is put into a vector of 32-bit unsigned +integers. Values less than 0x80000000 are literal data. Higher values represent meta-characters. The top 16-bits of such values identify the meta-character, and these are given names such as META_CAPTURE. The lower 16-bits are available -for data, for example, the capturing group number. The only situation in which -literal data values greater than 0x7fffffff can appear is when the 32-bit -library is running in non-UTF mode. This is handled by having a special +for data, for example, the capturing group number. The only situation in which +literal data values greater than 0x7fffffff can appear is when the 32-bit +library is running in non-UTF mode. This is handled by having a special meta-character that is followed by the 32-bit data value. The size of the parsed pattern vector, when auto-callouts are not enabled, is -bounded by the length of the pattern (with one exception). The code is written -so that each item in the pattern uses no more vector elements than the number +bounded by the length of the pattern (with one exception). The code is written +so that each item in the pattern uses no more vector elements than the number of code units in the item itself. The exception is the aforementioned large 32-bit number handling. For this reason, 32-bit non-UTF patterns are scanned in advance to check for such values. When auto-callouts are enabled, the generous assumption is made that there will be a callout for each pattern code unit -(which of course is only actually true if all code units are literals) plus one +(which of course is only actually true if all code units are literals) plus one at the end. There is a default parsed pattern vector on the stack, but if this is not big enough, heap memory is used. -As before, the actual compiling function is run twice, the first time to -determine the amount of memory needed for the final compiled pattern. It +As before, the actual compiling function is run twice, the first time to +determine the amount of memory needed for the final compiled pattern. It now processes the parsed pattern vector, not the pattern itself, although some of the parsed items refer to strings in the pattern - for example, group -names. As escapes and comments have already been processed, the code is a bit +names. As escapes and comments have already been processed, the code is a bit simpler than before. Most errors can be diagnosed during the parsing scan. For those that cannot @@ -168,64 +168,67 @@ identify where errors occur. The elements of the parsed pattern vector ----------------------------------------- -The word "offset" below means a code unit offset into the pattern. When +The word "offset" below means a code unit offset into the pattern. When PCRE2_SIZE (which is usually size_t) is no bigger than uint32_t, an offset is stored in a single parsed pattern element. Otherwise (typically on 64-bit systems) it occupies two elements. The following meta items occupy just one element, with no data: META_ACCEPT (*ACCEPT) -META_ALT | alternation -META_ASTERISK * -META_ASTERISK_PLUS *+ -META_ASTERISK_QUERY *? -META_ATOMIC (?> start of atomic group -META_CIRCUMFLEX ^ metacharacter -META_CLASS [ start of non-empty class -META_CLASS_EMPTY [] empty class - only with PCRE2_ALLOW_EMPTY_CLASS +META_ASTERISK * +META_ASTERISK_PLUS *+ +META_ASTERISK_QUERY *? +META_ATOMIC (?> start of atomic group +META_CIRCUMFLEX ^ metacharacter +META_CLASS [ start of non-empty class +META_CLASS_EMPTY [] empty class - only with PCRE2_ALLOW_EMPTY_CLASS META_CLASS_EMPTY_NOT [^] negative empty class - ditto -META_CLASS_END ] end of non-empty class -META_CLASS_NOT [^ start non-empty negative class +META_CLASS_END ] end of non-empty class +META_CLASS_NOT [^ start non-empty negative class META_COMMIT (*COMMIT) -META_DOLLAR $ metacharacter -META_DOT . metacharacter +META_DOLLAR $ metacharacter +META_DOT . metacharacter META_END End of pattern (this value is 0x80000000) META_FAIL (*FAIL) -META_KET ) closing parenthesis +META_KET ) closing parenthesis META_LOOKAHEAD (?= start of lookahead META_LOOKAHEADNOT (?! start of negative lookahead -META_NOCAPTURE (?: no capture parens -META_PLUS + -META_PLUS_PLUS ++ -META_PLUS_QUERY +? +META_NOCAPTURE (?: no capture parens +META_PLUS + +META_PLUS_PLUS ++ +META_PLUS_QUERY +? META_PRUNE (*PRUNE) - no argument -META_QUERY ? -META_QUERY_PLUS ?+ -META_QUERY_QUERY ?? -META_RANGE_ESCAPED hyphen in class range with at least one escape -META_RANGE_LITERAL hyphen in class range defined literally +META_QUERY ? +META_QUERY_PLUS ?+ +META_QUERY_QUERY ?? +META_RANGE_ESCAPED hyphen in class range with at least one escape +META_RANGE_LITERAL hyphen in class range defined literally META_SKIP (*SKIP) - no argument META_THEN (*THEN) - no argument -The two RANGE values occur only in character classes. They are positioned -between two literals that define the start and end of the range. In an EBCDIC -evironment it is necessary to know whether either of the range values was -specified as an escape. In an ASCII/Unicode environment the distinction is not +The two RANGE values occur only in character classes. They are positioned +between two literals that define the start and end of the range. In an EBCDIC +evironment it is necessary to know whether either of the range values was +specified as an escape. In an ASCII/Unicode environment the distinction is not relevant. -The following have data in the lower 16 bits, and may be followed by other data +The following have data in the lower 16 bits, and may be followed by other data elements: +META_ALT | alternation META_BACKREF META_CAPTURE META_ESCAPE META_RECURSE +If the data for META_ALT is non-zero, it is inside a lookbehind, and the data +is the length of its branch, for which OP_REVERSE must be generated. + META_BACKREF, META_CAPTURE, and META_RECURSE have the capture group number as their data in the lower 16 bits of the element. META_BACKREF is followed by an offset if the back reference group number is 10 -or more. The offsets of the first ocurrences of references to groups whose +or more. The offsets of the first ocurrences of references to groups whose numbers are less than 10 are put in cb->small_ref_offset[] (only the first occurrence is useful). On 64-bit systems this avoids using more than two parsed pattern elements for items such as \3. The offset is used when an error is @@ -241,7 +244,7 @@ and an offset into the pattern to specify the name. The following have one data item that follows in the next vector element: -META_BIGVALUE Next is a literal >= META_END +META_BIGVALUE Next is a literal >= META_END META_OPTIONS (?i) and friends (data is new option bits) META_POSIX POSIX class item (data identifies the class) META_POSIX_NEG negative POSIX class item (ditto) @@ -249,19 +252,19 @@ META_POSIX_NEG negative POSIX class item (ditto) The following are followed by a length element, then a number of character code values (which should match with the length): -META_MARK (*MARK:xxxx) +META_MARK (*MARK:xxxx) META_PRUNE_ARG (*PRUNE:xxx) META_SKIP_ARG (*SKIP:xxxx) META_THEN_ARG (*THEN:xxxx) -The following are followed by a length element, then an offset in the pattern +The following are followed by a length element, then an offset in the pattern that identifies the name: -META_COND_NAME (?() or (?('name') or (?(name) -META_COND_RNAME (?(R&name) +META_COND_NAME (?() or (?('name') or (?(name) +META_COND_RNAME (?(R&name) META_COND_RNUMBER (?(Rdigits) -META_RECURSE_BYNAME (?&name) -META_BACKREF_BYNAME \k'name' +META_RECURSE_BYNAME (?&name) +META_BACKREF_BYNAME \k'name' META_COND_RNUMBER is used for names that start with R and continue with digits, because this is an ambiguous case. It could be a back reference to a group with @@ -269,26 +272,31 @@ that name, or it could be a recursion test on a numbered group. This one is followed by an offset, for use in error messages, then a number: -META_COND_NUMBER (?([+-]digits) +META_COND_NUMBER (?([+-]digits) The following are followed just by an offset, for use in error messages: META_COND_ASSERT (?(?assertion) META_COND_DEFINE (?(DEFINE) -META_LOOKBEHIND (?<= -META_LOOKBEHINDNOT (?' and 1 for '>='; the next two are the major and minor numbers: @@ -297,11 +305,11 @@ META_COND_VERSION (?(VERSIONx.y) Callouts are converted into one of two items: -META_CALLOUT_NUMBER (?C with numerical argument -META_CALLOUT_STRING (?C with string argument +META_CALLOUT_NUMBER (?C with numerical argument +META_CALLOUT_STRING (?C with string argument -In both cases, the next two elements contain the offset and length of the next -item in the pattern. Then there is either one callout number, or a length and +In both cases, the next two elements contain the offset and length of the next +item in the pattern. Then there is either one callout number, or a length and an offset for the string argument. The length includes both delimiters. @@ -410,11 +418,11 @@ These items are all just one unit long OP_THEN ) OP_ASSERT_ACCEPT is used when (*ACCEPT) is encountered within an assertion. -This ends the assertion, not the entire pattern match. The assertion (?!) is +This ends the assertion, not the entire pattern match. The assertion (?!) is always optimized to OP_FAIL. OP_ALLANY is used for '.' when PCRE2_DOTALL is set. It is also used for \C in -non-UTF modes and in UTF-32 mode (since one code unit still equals one +non-UTF modes and in UTF-32 mode (since one code unit still equals one character). Another use is for [^] when empty classes are permitted (PCRE2_ALLOW_EMPTY_CLASS is set). @@ -735,8 +743,8 @@ immediately before the assertion. It is also possible to insert a manual callout at this point. Only assertion conditions may have callouts preceding the condition. -A condition that is the negative assertion (?!) is optimized to OP_FAIL in all -parts of the pattern, so this is another opcode that may appear as a condition. +A condition that is the negative assertion (?!) is optimized to OP_FAIL in all +parts of the pattern, so this is another opcode that may appear as a condition. It is treated the same as OP_FALSE.