Fix error in documentation.
This commit is contained in:
parent
8eae402315
commit
e036cda7ea
146
HACKING
146
HACKING
|
@ -7,7 +7,7 @@ but with a revised (and incompatible) API. To avoid confusion, the original
|
|||
library is referred to as PCRE1 below. For information about testing PCRE2, see
|
||||
the pcre2test documentation and the comment at the head of the RunTest file.
|
||||
|
||||
PCRE1 releases were up to 8.3x when PCRE2 was developed, and later bug fix
|
||||
PCRE1 releases were up to 8.3x when PCRE2 was developed, and later bug fix
|
||||
releases remain in the 8.xx series. PCRE2 releases started at 10.00 to avoid
|
||||
confusion with PCRE1.
|
||||
|
||||
|
@ -124,39 +124,39 @@ compile) has full knowledge of group names and numbers throughout. Several
|
|||
dozen lines of messy code were eliminated, though the new pre-pass was not
|
||||
short. In particular, parsing and skipping over [] classes is complicated.
|
||||
|
||||
While working on 10.22 I realized that I could simplify yet again by moving
|
||||
While working on 10.22 I realized that I could simplify yet again by moving
|
||||
more of the parsing into the pre-pass, thus avoiding doing it in two places, so
|
||||
after 10.22 was released, the code underwent yet another big refactoring. This
|
||||
is how it is from 10.23 onwards:
|
||||
|
||||
The function called parse_regex() scans the pattern characters, parsing them
|
||||
into literal data and meta characters. It converts escapes such as \x{123}
|
||||
into literals, handles \Q...\E, and skips over comments and non-significant
|
||||
white space. The result of the scanning is put into a vector of 32-bit unsigned
|
||||
integers. Values less than 0x80000000 are literal data. Higher values represent
|
||||
The function called parse_regex() scans the pattern characters, parsing them
|
||||
into literal data and meta characters. It converts escapes such as \x{123}
|
||||
into literals, handles \Q...\E, and skips over comments and non-significant
|
||||
white space. The result of the scanning is put into a vector of 32-bit unsigned
|
||||
integers. Values less than 0x80000000 are literal data. Higher values represent
|
||||
meta-characters. The top 16-bits of such values identify the meta-character,
|
||||
and these are given names such as META_CAPTURE. The lower 16-bits are available
|
||||
for data, for example, the capturing group number. The only situation in which
|
||||
literal data values greater than 0x7fffffff can appear is when the 32-bit
|
||||
library is running in non-UTF mode. This is handled by having a special
|
||||
for data, for example, the capturing group number. The only situation in which
|
||||
literal data values greater than 0x7fffffff can appear is when the 32-bit
|
||||
library is running in non-UTF mode. This is handled by having a special
|
||||
meta-character that is followed by the 32-bit data value.
|
||||
|
||||
The size of the parsed pattern vector, when auto-callouts are not enabled, is
|
||||
bounded by the length of the pattern (with one exception). The code is written
|
||||
so that each item in the pattern uses no more vector elements than the number
|
||||
bounded by the length of the pattern (with one exception). The code is written
|
||||
so that each item in the pattern uses no more vector elements than the number
|
||||
of code units in the item itself. The exception is the aforementioned large
|
||||
32-bit number handling. For this reason, 32-bit non-UTF patterns are scanned in
|
||||
advance to check for such values. When auto-callouts are enabled, the generous
|
||||
assumption is made that there will be a callout for each pattern code unit
|
||||
(which of course is only actually true if all code units are literals) plus one
|
||||
(which of course is only actually true if all code units are literals) plus one
|
||||
at the end. There is a default parsed pattern vector on the stack, but if this
|
||||
is not big enough, heap memory is used.
|
||||
|
||||
As before, the actual compiling function is run twice, the first time to
|
||||
determine the amount of memory needed for the final compiled pattern. It
|
||||
As before, the actual compiling function is run twice, the first time to
|
||||
determine the amount of memory needed for the final compiled pattern. It
|
||||
now processes the parsed pattern vector, not the pattern itself, although some
|
||||
of the parsed items refer to strings in the pattern - for example, group
|
||||
names. As escapes and comments have already been processed, the code is a bit
|
||||
names. As escapes and comments have already been processed, the code is a bit
|
||||
simpler than before.
|
||||
|
||||
Most errors can be diagnosed during the parsing scan. For those that cannot
|
||||
|
@ -168,64 +168,67 @@ identify where errors occur.
|
|||
The elements of the parsed pattern vector
|
||||
-----------------------------------------
|
||||
|
||||
The word "offset" below means a code unit offset into the pattern. When
|
||||
The word "offset" below means a code unit offset into the pattern. When
|
||||
PCRE2_SIZE (which is usually size_t) is no bigger than uint32_t, an offset is
|
||||
stored in a single parsed pattern element. Otherwise (typically on 64-bit
|
||||
systems) it occupies two elements. The following meta items occupy just one
|
||||
element, with no data:
|
||||
|
||||
META_ACCEPT (*ACCEPT)
|
||||
META_ALT | alternation
|
||||
META_ASTERISK *
|
||||
META_ASTERISK_PLUS *+
|
||||
META_ASTERISK_QUERY *?
|
||||
META_ATOMIC (?> start of atomic group
|
||||
META_CIRCUMFLEX ^ metacharacter
|
||||
META_CLASS [ start of non-empty class
|
||||
META_CLASS_EMPTY [] empty class - only with PCRE2_ALLOW_EMPTY_CLASS
|
||||
META_ASTERISK *
|
||||
META_ASTERISK_PLUS *+
|
||||
META_ASTERISK_QUERY *?
|
||||
META_ATOMIC (?> start of atomic group
|
||||
META_CIRCUMFLEX ^ metacharacter
|
||||
META_CLASS [ start of non-empty class
|
||||
META_CLASS_EMPTY [] empty class - only with PCRE2_ALLOW_EMPTY_CLASS
|
||||
META_CLASS_EMPTY_NOT [^] negative empty class - ditto
|
||||
META_CLASS_END ] end of non-empty class
|
||||
META_CLASS_NOT [^ start non-empty negative class
|
||||
META_CLASS_END ] end of non-empty class
|
||||
META_CLASS_NOT [^ start non-empty negative class
|
||||
META_COMMIT (*COMMIT)
|
||||
META_DOLLAR $ metacharacter
|
||||
META_DOT . metacharacter
|
||||
META_DOLLAR $ metacharacter
|
||||
META_DOT . metacharacter
|
||||
META_END End of pattern (this value is 0x80000000)
|
||||
META_FAIL (*FAIL)
|
||||
META_KET ) closing parenthesis
|
||||
META_KET ) closing parenthesis
|
||||
META_LOOKAHEAD (?= start of lookahead
|
||||
META_LOOKAHEADNOT (?! start of negative lookahead
|
||||
META_NOCAPTURE (?: no capture parens
|
||||
META_PLUS +
|
||||
META_PLUS_PLUS ++
|
||||
META_PLUS_QUERY +?
|
||||
META_NOCAPTURE (?: no capture parens
|
||||
META_PLUS +
|
||||
META_PLUS_PLUS ++
|
||||
META_PLUS_QUERY +?
|
||||
META_PRUNE (*PRUNE) - no argument
|
||||
META_QUERY ?
|
||||
META_QUERY_PLUS ?+
|
||||
META_QUERY_QUERY ??
|
||||
META_RANGE_ESCAPED hyphen in class range with at least one escape
|
||||
META_RANGE_LITERAL hyphen in class range defined literally
|
||||
META_QUERY ?
|
||||
META_QUERY_PLUS ?+
|
||||
META_QUERY_QUERY ??
|
||||
META_RANGE_ESCAPED hyphen in class range with at least one escape
|
||||
META_RANGE_LITERAL hyphen in class range defined literally
|
||||
META_SKIP (*SKIP) - no argument
|
||||
META_THEN (*THEN) - no argument
|
||||
|
||||
The two RANGE values occur only in character classes. They are positioned
|
||||
between two literals that define the start and end of the range. In an EBCDIC
|
||||
evironment it is necessary to know whether either of the range values was
|
||||
specified as an escape. In an ASCII/Unicode environment the distinction is not
|
||||
The two RANGE values occur only in character classes. They are positioned
|
||||
between two literals that define the start and end of the range. In an EBCDIC
|
||||
evironment it is necessary to know whether either of the range values was
|
||||
specified as an escape. In an ASCII/Unicode environment the distinction is not
|
||||
relevant.
|
||||
|
||||
The following have data in the lower 16 bits, and may be followed by other data
|
||||
The following have data in the lower 16 bits, and may be followed by other data
|
||||
elements:
|
||||
|
||||
META_ALT | alternation
|
||||
META_BACKREF
|
||||
META_CAPTURE
|
||||
META_ESCAPE
|
||||
META_RECURSE
|
||||
|
||||
If the data for META_ALT is non-zero, it is inside a lookbehind, and the data
|
||||
is the length of its branch, for which OP_REVERSE must be generated.
|
||||
|
||||
META_BACKREF, META_CAPTURE, and META_RECURSE have the capture group number as
|
||||
their data in the lower 16 bits of the element.
|
||||
|
||||
META_BACKREF is followed by an offset if the back reference group number is 10
|
||||
or more. The offsets of the first ocurrences of references to groups whose
|
||||
or more. The offsets of the first ocurrences of references to groups whose
|
||||
numbers are less than 10 are put in cb->small_ref_offset[] (only the first
|
||||
occurrence is useful). On 64-bit systems this avoids using more than two parsed
|
||||
pattern elements for items such as \3. The offset is used when an error is
|
||||
|
@ -241,7 +244,7 @@ and an offset into the pattern to specify the name.
|
|||
|
||||
The following have one data item that follows in the next vector element:
|
||||
|
||||
META_BIGVALUE Next is a literal >= META_END
|
||||
META_BIGVALUE Next is a literal >= META_END
|
||||
META_OPTIONS (?i) and friends (data is new option bits)
|
||||
META_POSIX POSIX class item (data identifies the class)
|
||||
META_POSIX_NEG negative POSIX class item (ditto)
|
||||
|
@ -249,19 +252,19 @@ META_POSIX_NEG negative POSIX class item (ditto)
|
|||
The following are followed by a length element, then a number of character code
|
||||
values (which should match with the length):
|
||||
|
||||
META_MARK (*MARK:xxxx)
|
||||
META_MARK (*MARK:xxxx)
|
||||
META_PRUNE_ARG (*PRUNE:xxx)
|
||||
META_SKIP_ARG (*SKIP:xxxx)
|
||||
META_THEN_ARG (*THEN:xxxx)
|
||||
|
||||
The following are followed by a length element, then an offset in the pattern
|
||||
The following are followed by a length element, then an offset in the pattern
|
||||
that identifies the name:
|
||||
|
||||
META_COND_NAME (?(<name>) or (?('name') or (?(name)
|
||||
META_COND_RNAME (?(R&name)
|
||||
META_COND_NAME (?(<name>) or (?('name') or (?(name)
|
||||
META_COND_RNAME (?(R&name)
|
||||
META_COND_RNUMBER (?(Rdigits)
|
||||
META_RECURSE_BYNAME (?&name)
|
||||
META_BACKREF_BYNAME \k'name'
|
||||
META_RECURSE_BYNAME (?&name)
|
||||
META_BACKREF_BYNAME \k'name'
|
||||
|
||||
META_COND_RNUMBER is used for names that start with R and continue with digits,
|
||||
because this is an ambiguous case. It could be a back reference to a group with
|
||||
|
@ -269,26 +272,31 @@ that name, or it could be a recursion test on a numbered group.
|
|||
|
||||
This one is followed by an offset, for use in error messages, then a number:
|
||||
|
||||
META_COND_NUMBER (?([+-]digits)
|
||||
META_COND_NUMBER (?([+-]digits)
|
||||
|
||||
The following are followed just by an offset, for use in error messages:
|
||||
|
||||
META_COND_ASSERT (?(?assertion)
|
||||
META_COND_DEFINE (?(DEFINE)
|
||||
META_LOOKBEHIND (?<=
|
||||
META_LOOKBEHINDNOT (?<!
|
||||
|
||||
In fact, META_COND_ASSERT is used for any group starting (?( that does not
|
||||
match any of the other META_COND cases. The check that this group is an
|
||||
assertion (optionally preceded by a callout) happens at compile time.
|
||||
In fact, META_COND_ASSERT is used for any group starting (?( that does not
|
||||
match any of the other META_COND cases. The check that this group is an
|
||||
assertion (optionally preceded by a callout) happens at compile time.
|
||||
|
||||
The following are also followed just by an offset, but also the lower 16 bits
|
||||
of the main word contain the length of the first branch of the lookbehind
|
||||
group; this is used when generating OP_REVERSE for that branch.
|
||||
|
||||
META_LOOKBEHIND (?<=
|
||||
META_LOOKBEHINDNOT (?<!
|
||||
|
||||
The following are followed by two values, the minimum and maximum. Repeat
|
||||
values are limited to 65535 (MAX_REPEAT). A maximum value of "unlimited" is
|
||||
represented by UNLIMITED_REPEAT, which is bigger than MAX_REPEAT:
|
||||
|
||||
META_MINMAX {n,m} repeat
|
||||
META_MINMAX_PLUS {n,m}+ repeat
|
||||
META_MINMAX_QUERY {n,m}? repeat
|
||||
META_MINMAX {n,m} repeat
|
||||
META_MINMAX_PLUS {n,m}+ repeat
|
||||
META_MINMAX_QUERY {n,m}? repeat
|
||||
|
||||
This one is followed by three elements. The first is 0 for '>' and 1 for '>=';
|
||||
the next two are the major and minor numbers:
|
||||
|
@ -297,11 +305,11 @@ META_COND_VERSION (?(VERSION<op>x.y)
|
|||
|
||||
Callouts are converted into one of two items:
|
||||
|
||||
META_CALLOUT_NUMBER (?C with numerical argument
|
||||
META_CALLOUT_STRING (?C with string argument
|
||||
META_CALLOUT_NUMBER (?C with numerical argument
|
||||
META_CALLOUT_STRING (?C with string argument
|
||||
|
||||
In both cases, the next two elements contain the offset and length of the next
|
||||
item in the pattern. Then there is either one callout number, or a length and
|
||||
In both cases, the next two elements contain the offset and length of the next
|
||||
item in the pattern. Then there is either one callout number, or a length and
|
||||
an offset for the string argument. The length includes both delimiters.
|
||||
|
||||
|
||||
|
@ -410,11 +418,11 @@ These items are all just one unit long
|
|||
OP_THEN )
|
||||
|
||||
OP_ASSERT_ACCEPT is used when (*ACCEPT) is encountered within an assertion.
|
||||
This ends the assertion, not the entire pattern match. The assertion (?!) is
|
||||
This ends the assertion, not the entire pattern match. The assertion (?!) is
|
||||
always optimized to OP_FAIL.
|
||||
|
||||
OP_ALLANY is used for '.' when PCRE2_DOTALL is set. It is also used for \C in
|
||||
non-UTF modes and in UTF-32 mode (since one code unit still equals one
|
||||
non-UTF modes and in UTF-32 mode (since one code unit still equals one
|
||||
character). Another use is for [^] when empty classes are permitted
|
||||
(PCRE2_ALLOW_EMPTY_CLASS is set).
|
||||
|
||||
|
@ -735,8 +743,8 @@ immediately before the assertion. It is also possible to insert a manual
|
|||
callout at this point. Only assertion conditions may have callouts preceding
|
||||
the condition.
|
||||
|
||||
A condition that is the negative assertion (?!) is optimized to OP_FAIL in all
|
||||
parts of the pattern, so this is another opcode that may appear as a condition.
|
||||
A condition that is the negative assertion (?!) is optimized to OP_FAIL in all
|
||||
parts of the pattern, so this is another opcode that may appear as a condition.
|
||||
It is treated the same as OP_FALSE.
|
||||
|
||||
|
||||
|
|
Loading…
Reference in New Issue