Documentation correction.
This commit is contained in:
parent
52ba34a73c
commit
8fe95cf804
39
HACKING
39
HACKING
|
@ -263,10 +263,13 @@ of repeat make use of these opcodes:
|
||||||
OP_POSUPTO OP_POSUPTOI
|
OP_POSUPTO OP_POSUPTOI
|
||||||
OP_EXACT OP_EXACTI
|
OP_EXACT OP_EXACTI
|
||||||
|
|
||||||
Each of these is followed by a count and then the repeated character. OP_UPTO
|
Each of these is followed by a count and then the repeated character. The count
|
||||||
matches from 0 to the given number. A repeat with a non-zero minimum and a
|
is two bytes long in 8-bit mode (most significant byte first), or one code unit
|
||||||
fixed maximum is coded as an OP_EXACT followed by an OP_UPTO (or OP_MINUPTO or
|
in 16-bit and 32-bit modes.
|
||||||
OPT_POSUPTO).
|
|
||||||
|
OP_UPTO matches from 0 to the given number. A repeat with a non-zero minimum
|
||||||
|
and a fixed maximum is coded as an OP_EXACT followed by an OP_UPTO (or
|
||||||
|
OP_MINUPTO or OPT_POSUPTO).
|
||||||
|
|
||||||
Another set of matching repeating opcodes (called OP_NOTSTAR, OP_NOTSTARI,
|
Another set of matching repeating opcodes (called OP_NOTSTAR, OP_NOTSTARI,
|
||||||
etc.) are used for repeated, negated, single-character classes such as [^a]*.
|
etc.) are used for repeated, negated, single-character classes such as [^a]*.
|
||||||
|
@ -330,19 +333,21 @@ negative one. In either case, the opcode is followed by a 32-byte (16-short,
|
||||||
bits are counted from the least significant end of each unit. In caseless mode,
|
bits are counted from the least significant end of each unit. In caseless mode,
|
||||||
bits for both cases are set.
|
bits for both cases are set.
|
||||||
|
|
||||||
The reason for having both OP_CLASS and OP_NCLASS is so that, in UTF-8/16/32
|
The reason for having both OP_CLASS and OP_NCLASS is so that, in UTF-8 and
|
||||||
mode, subject characters with values greater than 255 can be handled correctly.
|
16-bit and 32-bit modes, subject characters with values greater than 255 can be
|
||||||
For OP_CLASS they do not match, whereas for OP_NCLASS they do.
|
handled correctly. For OP_CLASS they do not match, whereas for OP_NCLASS they
|
||||||
|
do.
|
||||||
|
|
||||||
For classes containing characters with values greater than 255 or that contain
|
For classes containing characters with values greater than 255 or that contain
|
||||||
\p or \P, OP_XCLASS is used. It optionally uses a bit map if any code points
|
\p or \P, OP_XCLASS is used. It optionally uses a bit map if any acceptable
|
||||||
are less than 256, followed by a list of pairs (for a range) and single
|
code points are less than 256, followed by a list of pairs (for a range) and
|
||||||
characters. In caseless mode, both cases are explicitly listed.
|
single characters. In caseless mode, both cases are explicitly listed.
|
||||||
|
|
||||||
OP_XCLASS is followed by a code unit containing flag bits: XCL_NOT indicates
|
OP_XCLASS is followed by a LINK_SIZE item containing the total length of the
|
||||||
that this is a negative class, and XCL_MAP indicates that a bit map is present.
|
opcode and its data. This is followed by a code unit containing flag bits:
|
||||||
There follows the bit map, if XCL_MAP is set, and then a sequence of items
|
XCL_NOT indicates that this is a negative class, and XCL_MAP indicates that a
|
||||||
coded as follows:
|
bit map is present. There follows the bit map, if XCL_MAP is set, and then a
|
||||||
|
sequence of items coded as follows:
|
||||||
|
|
||||||
XCL_END marks the end of the list
|
XCL_END marks the end of the list
|
||||||
XCL_SINGLE one character follows
|
XCL_SINGLE one character follows
|
||||||
|
@ -354,6 +359,10 @@ If a range starts with a code point less than 256 and ends with one greater
|
||||||
than 256, it is split into two ranges, with characters less than 256 being
|
than 256, it is split into two ranges, with characters less than 256 being
|
||||||
indicated in the bit map, and the rest with XCL_RANGE.
|
indicated in the bit map, and the rest with XCL_RANGE.
|
||||||
|
|
||||||
|
When XCL_NOT is set, the bit map, if present, contains bits for characters that
|
||||||
|
are allowed (exactly as for OP_NCLASS), but the list of items that follow it
|
||||||
|
specifies characters and properties that are not allowed.
|
||||||
|
|
||||||
|
|
||||||
Back references
|
Back references
|
||||||
---------------
|
---------------
|
||||||
|
@ -545,4 +554,4 @@ not a real opcode, but is used to check that tables indexed by opcode are the
|
||||||
correct length, in order to catch updating errors.
|
correct length, in order to catch updating errors.
|
||||||
|
|
||||||
Philip Hazel
|
Philip Hazel
|
||||||
August 2014
|
February 2015
|
||||||
|
|
Loading…
Reference in New Issue