Fix error in documentation.
This commit is contained in:
parent
8eae402315
commit
e036cda7ea
146
HACKING
146
HACKING
|
@ -7,7 +7,7 @@ but with a revised (and incompatible) API. To avoid confusion, the original
|
||||||
library is referred to as PCRE1 below. For information about testing PCRE2, see
|
library is referred to as PCRE1 below. For information about testing PCRE2, see
|
||||||
the pcre2test documentation and the comment at the head of the RunTest file.
|
the pcre2test documentation and the comment at the head of the RunTest file.
|
||||||
|
|
||||||
PCRE1 releases were up to 8.3x when PCRE2 was developed, and later bug fix
|
PCRE1 releases were up to 8.3x when PCRE2 was developed, and later bug fix
|
||||||
releases remain in the 8.xx series. PCRE2 releases started at 10.00 to avoid
|
releases remain in the 8.xx series. PCRE2 releases started at 10.00 to avoid
|
||||||
confusion with PCRE1.
|
confusion with PCRE1.
|
||||||
|
|
||||||
|
@ -124,39 +124,39 @@ compile) has full knowledge of group names and numbers throughout. Several
|
||||||
dozen lines of messy code were eliminated, though the new pre-pass was not
|
dozen lines of messy code were eliminated, though the new pre-pass was not
|
||||||
short. In particular, parsing and skipping over [] classes is complicated.
|
short. In particular, parsing and skipping over [] classes is complicated.
|
||||||
|
|
||||||
While working on 10.22 I realized that I could simplify yet again by moving
|
While working on 10.22 I realized that I could simplify yet again by moving
|
||||||
more of the parsing into the pre-pass, thus avoiding doing it in two places, so
|
more of the parsing into the pre-pass, thus avoiding doing it in two places, so
|
||||||
after 10.22 was released, the code underwent yet another big refactoring. This
|
after 10.22 was released, the code underwent yet another big refactoring. This
|
||||||
is how it is from 10.23 onwards:
|
is how it is from 10.23 onwards:
|
||||||
|
|
||||||
The function called parse_regex() scans the pattern characters, parsing them
|
The function called parse_regex() scans the pattern characters, parsing them
|
||||||
into literal data and meta characters. It converts escapes such as \x{123}
|
into literal data and meta characters. It converts escapes such as \x{123}
|
||||||
into literals, handles \Q...\E, and skips over comments and non-significant
|
into literals, handles \Q...\E, and skips over comments and non-significant
|
||||||
white space. The result of the scanning is put into a vector of 32-bit unsigned
|
white space. The result of the scanning is put into a vector of 32-bit unsigned
|
||||||
integers. Values less than 0x80000000 are literal data. Higher values represent
|
integers. Values less than 0x80000000 are literal data. Higher values represent
|
||||||
meta-characters. The top 16-bits of such values identify the meta-character,
|
meta-characters. The top 16-bits of such values identify the meta-character,
|
||||||
and these are given names such as META_CAPTURE. The lower 16-bits are available
|
and these are given names such as META_CAPTURE. The lower 16-bits are available
|
||||||
for data, for example, the capturing group number. The only situation in which
|
for data, for example, the capturing group number. The only situation in which
|
||||||
literal data values greater than 0x7fffffff can appear is when the 32-bit
|
literal data values greater than 0x7fffffff can appear is when the 32-bit
|
||||||
library is running in non-UTF mode. This is handled by having a special
|
library is running in non-UTF mode. This is handled by having a special
|
||||||
meta-character that is followed by the 32-bit data value.
|
meta-character that is followed by the 32-bit data value.
|
||||||
|
|
||||||
The size of the parsed pattern vector, when auto-callouts are not enabled, is
|
The size of the parsed pattern vector, when auto-callouts are not enabled, is
|
||||||
bounded by the length of the pattern (with one exception). The code is written
|
bounded by the length of the pattern (with one exception). The code is written
|
||||||
so that each item in the pattern uses no more vector elements than the number
|
so that each item in the pattern uses no more vector elements than the number
|
||||||
of code units in the item itself. The exception is the aforementioned large
|
of code units in the item itself. The exception is the aforementioned large
|
||||||
32-bit number handling. For this reason, 32-bit non-UTF patterns are scanned in
|
32-bit number handling. For this reason, 32-bit non-UTF patterns are scanned in
|
||||||
advance to check for such values. When auto-callouts are enabled, the generous
|
advance to check for such values. When auto-callouts are enabled, the generous
|
||||||
assumption is made that there will be a callout for each pattern code unit
|
assumption is made that there will be a callout for each pattern code unit
|
||||||
(which of course is only actually true if all code units are literals) plus one
|
(which of course is only actually true if all code units are literals) plus one
|
||||||
at the end. There is a default parsed pattern vector on the stack, but if this
|
at the end. There is a default parsed pattern vector on the stack, but if this
|
||||||
is not big enough, heap memory is used.
|
is not big enough, heap memory is used.
|
||||||
|
|
||||||
As before, the actual compiling function is run twice, the first time to
|
As before, the actual compiling function is run twice, the first time to
|
||||||
determine the amount of memory needed for the final compiled pattern. It
|
determine the amount of memory needed for the final compiled pattern. It
|
||||||
now processes the parsed pattern vector, not the pattern itself, although some
|
now processes the parsed pattern vector, not the pattern itself, although some
|
||||||
of the parsed items refer to strings in the pattern - for example, group
|
of the parsed items refer to strings in the pattern - for example, group
|
||||||
names. As escapes and comments have already been processed, the code is a bit
|
names. As escapes and comments have already been processed, the code is a bit
|
||||||
simpler than before.
|
simpler than before.
|
||||||
|
|
||||||
Most errors can be diagnosed during the parsing scan. For those that cannot
|
Most errors can be diagnosed during the parsing scan. For those that cannot
|
||||||
|
@ -168,64 +168,67 @@ identify where errors occur.
|
||||||
The elements of the parsed pattern vector
|
The elements of the parsed pattern vector
|
||||||
-----------------------------------------
|
-----------------------------------------
|
||||||
|
|
||||||
The word "offset" below means a code unit offset into the pattern. When
|
The word "offset" below means a code unit offset into the pattern. When
|
||||||
PCRE2_SIZE (which is usually size_t) is no bigger than uint32_t, an offset is
|
PCRE2_SIZE (which is usually size_t) is no bigger than uint32_t, an offset is
|
||||||
stored in a single parsed pattern element. Otherwise (typically on 64-bit
|
stored in a single parsed pattern element. Otherwise (typically on 64-bit
|
||||||
systems) it occupies two elements. The following meta items occupy just one
|
systems) it occupies two elements. The following meta items occupy just one
|
||||||
element, with no data:
|
element, with no data:
|
||||||
|
|
||||||
META_ACCEPT (*ACCEPT)
|
META_ACCEPT (*ACCEPT)
|
||||||
META_ALT | alternation
|
META_ASTERISK *
|
||||||
META_ASTERISK *
|
META_ASTERISK_PLUS *+
|
||||||
META_ASTERISK_PLUS *+
|
META_ASTERISK_QUERY *?
|
||||||
META_ASTERISK_QUERY *?
|
META_ATOMIC (?> start of atomic group
|
||||||
META_ATOMIC (?> start of atomic group
|
META_CIRCUMFLEX ^ metacharacter
|
||||||
META_CIRCUMFLEX ^ metacharacter
|
META_CLASS [ start of non-empty class
|
||||||
META_CLASS [ start of non-empty class
|
META_CLASS_EMPTY [] empty class - only with PCRE2_ALLOW_EMPTY_CLASS
|
||||||
META_CLASS_EMPTY [] empty class - only with PCRE2_ALLOW_EMPTY_CLASS
|
|
||||||
META_CLASS_EMPTY_NOT [^] negative empty class - ditto
|
META_CLASS_EMPTY_NOT [^] negative empty class - ditto
|
||||||
META_CLASS_END ] end of non-empty class
|
META_CLASS_END ] end of non-empty class
|
||||||
META_CLASS_NOT [^ start non-empty negative class
|
META_CLASS_NOT [^ start non-empty negative class
|
||||||
META_COMMIT (*COMMIT)
|
META_COMMIT (*COMMIT)
|
||||||
META_DOLLAR $ metacharacter
|
META_DOLLAR $ metacharacter
|
||||||
META_DOT . metacharacter
|
META_DOT . metacharacter
|
||||||
META_END End of pattern (this value is 0x80000000)
|
META_END End of pattern (this value is 0x80000000)
|
||||||
META_FAIL (*FAIL)
|
META_FAIL (*FAIL)
|
||||||
META_KET ) closing parenthesis
|
META_KET ) closing parenthesis
|
||||||
META_LOOKAHEAD (?= start of lookahead
|
META_LOOKAHEAD (?= start of lookahead
|
||||||
META_LOOKAHEADNOT (?! start of negative lookahead
|
META_LOOKAHEADNOT (?! start of negative lookahead
|
||||||
META_NOCAPTURE (?: no capture parens
|
META_NOCAPTURE (?: no capture parens
|
||||||
META_PLUS +
|
META_PLUS +
|
||||||
META_PLUS_PLUS ++
|
META_PLUS_PLUS ++
|
||||||
META_PLUS_QUERY +?
|
META_PLUS_QUERY +?
|
||||||
META_PRUNE (*PRUNE) - no argument
|
META_PRUNE (*PRUNE) - no argument
|
||||||
META_QUERY ?
|
META_QUERY ?
|
||||||
META_QUERY_PLUS ?+
|
META_QUERY_PLUS ?+
|
||||||
META_QUERY_QUERY ??
|
META_QUERY_QUERY ??
|
||||||
META_RANGE_ESCAPED hyphen in class range with at least one escape
|
META_RANGE_ESCAPED hyphen in class range with at least one escape
|
||||||
META_RANGE_LITERAL hyphen in class range defined literally
|
META_RANGE_LITERAL hyphen in class range defined literally
|
||||||
META_SKIP (*SKIP) - no argument
|
META_SKIP (*SKIP) - no argument
|
||||||
META_THEN (*THEN) - no argument
|
META_THEN (*THEN) - no argument
|
||||||
|
|
||||||
The two RANGE values occur only in character classes. They are positioned
|
The two RANGE values occur only in character classes. They are positioned
|
||||||
between two literals that define the start and end of the range. In an EBCDIC
|
between two literals that define the start and end of the range. In an EBCDIC
|
||||||
evironment it is necessary to know whether either of the range values was
|
evironment it is necessary to know whether either of the range values was
|
||||||
specified as an escape. In an ASCII/Unicode environment the distinction is not
|
specified as an escape. In an ASCII/Unicode environment the distinction is not
|
||||||
relevant.
|
relevant.
|
||||||
|
|
||||||
The following have data in the lower 16 bits, and may be followed by other data
|
The following have data in the lower 16 bits, and may be followed by other data
|
||||||
elements:
|
elements:
|
||||||
|
|
||||||
|
META_ALT | alternation
|
||||||
META_BACKREF
|
META_BACKREF
|
||||||
META_CAPTURE
|
META_CAPTURE
|
||||||
META_ESCAPE
|
META_ESCAPE
|
||||||
META_RECURSE
|
META_RECURSE
|
||||||
|
|
||||||
|
If the data for META_ALT is non-zero, it is inside a lookbehind, and the data
|
||||||
|
is the length of its branch, for which OP_REVERSE must be generated.
|
||||||
|
|
||||||
META_BACKREF, META_CAPTURE, and META_RECURSE have the capture group number as
|
META_BACKREF, META_CAPTURE, and META_RECURSE have the capture group number as
|
||||||
their data in the lower 16 bits of the element.
|
their data in the lower 16 bits of the element.
|
||||||
|
|
||||||
META_BACKREF is followed by an offset if the back reference group number is 10
|
META_BACKREF is followed by an offset if the back reference group number is 10
|
||||||
or more. The offsets of the first ocurrences of references to groups whose
|
or more. The offsets of the first ocurrences of references to groups whose
|
||||||
numbers are less than 10 are put in cb->small_ref_offset[] (only the first
|
numbers are less than 10 are put in cb->small_ref_offset[] (only the first
|
||||||
occurrence is useful). On 64-bit systems this avoids using more than two parsed
|
occurrence is useful). On 64-bit systems this avoids using more than two parsed
|
||||||
pattern elements for items such as \3. The offset is used when an error is
|
pattern elements for items such as \3. The offset is used when an error is
|
||||||
|
@ -241,7 +244,7 @@ and an offset into the pattern to specify the name.
|
||||||
|
|
||||||
The following have one data item that follows in the next vector element:
|
The following have one data item that follows in the next vector element:
|
||||||
|
|
||||||
META_BIGVALUE Next is a literal >= META_END
|
META_BIGVALUE Next is a literal >= META_END
|
||||||
META_OPTIONS (?i) and friends (data is new option bits)
|
META_OPTIONS (?i) and friends (data is new option bits)
|
||||||
META_POSIX POSIX class item (data identifies the class)
|
META_POSIX POSIX class item (data identifies the class)
|
||||||
META_POSIX_NEG negative POSIX class item (ditto)
|
META_POSIX_NEG negative POSIX class item (ditto)
|
||||||
|
@ -249,19 +252,19 @@ META_POSIX_NEG negative POSIX class item (ditto)
|
||||||
The following are followed by a length element, then a number of character code
|
The following are followed by a length element, then a number of character code
|
||||||
values (which should match with the length):
|
values (which should match with the length):
|
||||||
|
|
||||||
META_MARK (*MARK:xxxx)
|
META_MARK (*MARK:xxxx)
|
||||||
META_PRUNE_ARG (*PRUNE:xxx)
|
META_PRUNE_ARG (*PRUNE:xxx)
|
||||||
META_SKIP_ARG (*SKIP:xxxx)
|
META_SKIP_ARG (*SKIP:xxxx)
|
||||||
META_THEN_ARG (*THEN:xxxx)
|
META_THEN_ARG (*THEN:xxxx)
|
||||||
|
|
||||||
The following are followed by a length element, then an offset in the pattern
|
The following are followed by a length element, then an offset in the pattern
|
||||||
that identifies the name:
|
that identifies the name:
|
||||||
|
|
||||||
META_COND_NAME (?(<name>) or (?('name') or (?(name)
|
META_COND_NAME (?(<name>) or (?('name') or (?(name)
|
||||||
META_COND_RNAME (?(R&name)
|
META_COND_RNAME (?(R&name)
|
||||||
META_COND_RNUMBER (?(Rdigits)
|
META_COND_RNUMBER (?(Rdigits)
|
||||||
META_RECURSE_BYNAME (?&name)
|
META_RECURSE_BYNAME (?&name)
|
||||||
META_BACKREF_BYNAME \k'name'
|
META_BACKREF_BYNAME \k'name'
|
||||||
|
|
||||||
META_COND_RNUMBER is used for names that start with R and continue with digits,
|
META_COND_RNUMBER is used for names that start with R and continue with digits,
|
||||||
because this is an ambiguous case. It could be a back reference to a group with
|
because this is an ambiguous case. It could be a back reference to a group with
|
||||||
|
@ -269,26 +272,31 @@ that name, or it could be a recursion test on a numbered group.
|
||||||
|
|
||||||
This one is followed by an offset, for use in error messages, then a number:
|
This one is followed by an offset, for use in error messages, then a number:
|
||||||
|
|
||||||
META_COND_NUMBER (?([+-]digits)
|
META_COND_NUMBER (?([+-]digits)
|
||||||
|
|
||||||
The following are followed just by an offset, for use in error messages:
|
The following are followed just by an offset, for use in error messages:
|
||||||
|
|
||||||
META_COND_ASSERT (?(?assertion)
|
META_COND_ASSERT (?(?assertion)
|
||||||
META_COND_DEFINE (?(DEFINE)
|
META_COND_DEFINE (?(DEFINE)
|
||||||
META_LOOKBEHIND (?<=
|
|
||||||
META_LOOKBEHINDNOT (?<!
|
|
||||||
|
|
||||||
In fact, META_COND_ASSERT is used for any group starting (?( that does not
|
In fact, META_COND_ASSERT is used for any group starting (?( that does not
|
||||||
match any of the other META_COND cases. The check that this group is an
|
match any of the other META_COND cases. The check that this group is an
|
||||||
assertion (optionally preceded by a callout) happens at compile time.
|
assertion (optionally preceded by a callout) happens at compile time.
|
||||||
|
|
||||||
|
The following are also followed just by an offset, but also the lower 16 bits
|
||||||
|
of the main word contain the length of the first branch of the lookbehind
|
||||||
|
group; this is used when generating OP_REVERSE for that branch.
|
||||||
|
|
||||||
|
META_LOOKBEHIND (?<=
|
||||||
|
META_LOOKBEHINDNOT (?<!
|
||||||
|
|
||||||
The following are followed by two values, the minimum and maximum. Repeat
|
The following are followed by two values, the minimum and maximum. Repeat
|
||||||
values are limited to 65535 (MAX_REPEAT). A maximum value of "unlimited" is
|
values are limited to 65535 (MAX_REPEAT). A maximum value of "unlimited" is
|
||||||
represented by UNLIMITED_REPEAT, which is bigger than MAX_REPEAT:
|
represented by UNLIMITED_REPEAT, which is bigger than MAX_REPEAT:
|
||||||
|
|
||||||
META_MINMAX {n,m} repeat
|
META_MINMAX {n,m} repeat
|
||||||
META_MINMAX_PLUS {n,m}+ repeat
|
META_MINMAX_PLUS {n,m}+ repeat
|
||||||
META_MINMAX_QUERY {n,m}? repeat
|
META_MINMAX_QUERY {n,m}? repeat
|
||||||
|
|
||||||
This one is followed by three elements. The first is 0 for '>' and 1 for '>=';
|
This one is followed by three elements. The first is 0 for '>' and 1 for '>=';
|
||||||
the next two are the major and minor numbers:
|
the next two are the major and minor numbers:
|
||||||
|
@ -297,11 +305,11 @@ META_COND_VERSION (?(VERSION<op>x.y)
|
||||||
|
|
||||||
Callouts are converted into one of two items:
|
Callouts are converted into one of two items:
|
||||||
|
|
||||||
META_CALLOUT_NUMBER (?C with numerical argument
|
META_CALLOUT_NUMBER (?C with numerical argument
|
||||||
META_CALLOUT_STRING (?C with string argument
|
META_CALLOUT_STRING (?C with string argument
|
||||||
|
|
||||||
In both cases, the next two elements contain the offset and length of the next
|
In both cases, the next two elements contain the offset and length of the next
|
||||||
item in the pattern. Then there is either one callout number, or a length and
|
item in the pattern. Then there is either one callout number, or a length and
|
||||||
an offset for the string argument. The length includes both delimiters.
|
an offset for the string argument. The length includes both delimiters.
|
||||||
|
|
||||||
|
|
||||||
|
@ -410,11 +418,11 @@ These items are all just one unit long
|
||||||
OP_THEN )
|
OP_THEN )
|
||||||
|
|
||||||
OP_ASSERT_ACCEPT is used when (*ACCEPT) is encountered within an assertion.
|
OP_ASSERT_ACCEPT is used when (*ACCEPT) is encountered within an assertion.
|
||||||
This ends the assertion, not the entire pattern match. The assertion (?!) is
|
This ends the assertion, not the entire pattern match. The assertion (?!) is
|
||||||
always optimized to OP_FAIL.
|
always optimized to OP_FAIL.
|
||||||
|
|
||||||
OP_ALLANY is used for '.' when PCRE2_DOTALL is set. It is also used for \C in
|
OP_ALLANY is used for '.' when PCRE2_DOTALL is set. It is also used for \C in
|
||||||
non-UTF modes and in UTF-32 mode (since one code unit still equals one
|
non-UTF modes and in UTF-32 mode (since one code unit still equals one
|
||||||
character). Another use is for [^] when empty classes are permitted
|
character). Another use is for [^] when empty classes are permitted
|
||||||
(PCRE2_ALLOW_EMPTY_CLASS is set).
|
(PCRE2_ALLOW_EMPTY_CLASS is set).
|
||||||
|
|
||||||
|
@ -735,8 +743,8 @@ immediately before the assertion. It is also possible to insert a manual
|
||||||
callout at this point. Only assertion conditions may have callouts preceding
|
callout at this point. Only assertion conditions may have callouts preceding
|
||||||
the condition.
|
the condition.
|
||||||
|
|
||||||
A condition that is the negative assertion (?!) is optimized to OP_FAIL in all
|
A condition that is the negative assertion (?!) is optimized to OP_FAIL in all
|
||||||
parts of the pattern, so this is another opcode that may appear as a condition.
|
parts of the pattern, so this is another opcode that may appear as a condition.
|
||||||
It is treated the same as OP_FALSE.
|
It is treated the same as OP_FALSE.
|
||||||
|
|
||||||
|
|
||||||
|
|
Loading…
Reference in New Issue