Documentation update.
This commit is contained in:
parent
b3a6fd38b8
commit
504a073a89
110
HACKING
110
HACKING
|
@ -48,18 +48,20 @@ Friedl's terminology.
|
|||
OK, here's the real stuff
|
||||
-------------------------
|
||||
|
||||
For the set of functions that formed the original PCRE1 library (which are
|
||||
unrelated to those mentioned above), I tried at first to invent an algorithm
|
||||
that used an amount of store bounded by a multiple of the number of characters
|
||||
in the pattern, to save on compiling time. However, because of the greater
|
||||
complexity in Perl regular expressions, I couldn't do this. In any case, a
|
||||
first pass through the pattern is helpful for other reasons.
|
||||
For the set of functions that formed the original PCRE1 library in 1997 (which
|
||||
are unrelated to those mentioned above), I tried at first to invent an
|
||||
algorithm that used an amount of store bounded by a multiple of the number of
|
||||
characters in the pattern, to save on compiling time. However, because of the
|
||||
greater complexity in Perl regular expressions, I couldn't do this, even though
|
||||
the then current Perl 5.004 patterns were much simpler than those supported
|
||||
nowadays. In any case, a first pass through the pattern is helpful for other
|
||||
reasons.
|
||||
|
||||
|
||||
Support for 16-bit and 32-bit data strings
|
||||
-------------------------------------------
|
||||
|
||||
The library can be compiled in any combination of 8-bit, 16-bit or 32-bit
|
||||
The PCRE2 library can be compiled in any combination of 8-bit, 16-bit or 32-bit
|
||||
modes, creating up to three different libraries. In the description that
|
||||
follows, the word "short" is used for a 16-bit data quantity, and the phrase
|
||||
"code unit" is used for a quantity that is a byte in 8-bit mode, a short in
|
||||
|
@ -122,7 +124,7 @@ all the named subpatterns and their corresponding group numbers. This means
|
|||
that the actual compile (both the memory-computing dummy run and the real
|
||||
compile) has full knowledge of group names and numbers throughout. Several
|
||||
dozen lines of messy code were eliminated, though the new pre-pass was not
|
||||
short. In particular, parsing and skipping over [] classes was complicated.
|
||||
short. In particular, parsing and skipping over [] classes is complicated.
|
||||
|
||||
While working on 10.22 I realized that I could simplify yet again by moving
|
||||
more of the parsing into the pre-pass, thus avoiding doing it in two places, so
|
||||
|
@ -149,8 +151,8 @@ of code units in the item itself. The exception is the aforementioned large
|
|||
advance to check for such values. When auto-callouts are enabled, the generous
|
||||
assumption is made that there will be a callout for each pattern code unit
|
||||
(which of course is only actually true if all code units are literals) plus one
|
||||
at the end. There is a default parsed pattern vector on the stack, but if this
|
||||
is not big enough, heap memory is used.
|
||||
at the end. There is a default parsed pattern vector on the system stack, but
|
||||
if this is not big enough, heap memory is used.
|
||||
|
||||
As before, the actual compiling function is run twice, the first time to
|
||||
determine the amount of memory needed for the final compiled pattern. It
|
||||
|
@ -343,9 +345,14 @@ Changeable options
|
|||
------------------
|
||||
|
||||
The /i, /m, or /s options (PCRE2_CASELESS, PCRE2_MULTILINE, PCRE2_DOTALL, and
|
||||
some others) may change in the middle of patterns. Their processing is handled
|
||||
entirely at compile time by generating different opcodes for the different
|
||||
settings. The runtime functions do not need to keep track of an options state.
|
||||
others) may be changed in the middle of patterns by items such as (?i). Their
|
||||
processing is handled entirely at compile time by generating different opcodes
|
||||
for the different settings. The runtime functions do not need to keep track of
|
||||
an options state.
|
||||
|
||||
PCRE2_DUPNAMES, PCRE2_EXTENDED, PCRE2_EXTENDED_MORE, and PCRE2_NO_AUTO_CAPTURE
|
||||
are tracked and processed during the parsing pre-pass. The others are handled
|
||||
from META_OPTIONS items during the main compile phase.
|
||||
|
||||
|
||||
Format of compiled patterns
|
||||
|
@ -437,14 +444,22 @@ Matching literal characters
|
|||
---------------------------
|
||||
|
||||
The OP_CHAR opcode is followed by a single character that is to be matched
|
||||
casefully. For caseless matching, OP_CHARI is used. In UTF-8 or UTF-16 modes,
|
||||
the character may be more than one code unit long. In UTF-32 mode, characters
|
||||
are always exactly one code unit long.
|
||||
casefully. For caseless matching of characters that have at most two
|
||||
case-equivalent code points, OP_CHARI is used. In UTF-8 or UTF-16 modes, the
|
||||
character may be more than one code unit long. In UTF-32 mode, characters are
|
||||
always exactly one code unit long.
|
||||
|
||||
If there is only one character in a character class, OP_CHAR or OP_CHARI is
|
||||
used for a positive class, and OP_NOT or OP_NOTI for a negative one (that is,
|
||||
for something like [^a]).
|
||||
|
||||
Caseless matching (positive or negative) of characters that have more than two
|
||||
case-equivalent code points (which is possible only in UTF mode) is handled by
|
||||
compiling a Unicode property item (see below), with the pseudo-property
|
||||
PT_CLIST. The value of this property is an offset in a vector called
|
||||
"ucd_caseless_sets" which identifies the start of a short list of equivalent
|
||||
characters, terminated by the value NOTACHAR (0xffffffff).
|
||||
|
||||
|
||||
Repeating single characters
|
||||
---------------------------
|
||||
|
@ -520,7 +535,8 @@ Each is followed by two code units that encode the desired property as a type
|
|||
and a value. The types are a set of #defines of the form PT_xxx, and the values
|
||||
are enumerations of the form ucp_xx, defined in the pcre2_ucp.h source file.
|
||||
The value is relevant only for PT_GC (General Category), PT_PC (Particular
|
||||
Category), and PT_SC (Script).
|
||||
Category), PT_SC (Script), and the pseudo-property PT_CLIST, which is used to
|
||||
identify a list of case-equivalent characters when there are three or more.
|
||||
|
||||
Repeats of these items use the OP_TYPESTAR etc. set of opcodes, followed by
|
||||
three code units: OP_PROP or OP_NOTPROP, and then the desired property type and
|
||||
|
@ -532,7 +548,10 @@ Character classes
|
|||
|
||||
If there is only one character in a class, OP_CHAR or OP_CHARI is used for a
|
||||
positive class, and OP_NOT or OP_NOTI for a negative one (that is, for
|
||||
something like [^a]).
|
||||
something like [^a]), except when caselessly matching a character that has more
|
||||
than two case-equivalent code points (which can happen only in UTF mode). In
|
||||
this case a Unicode property item is used, as described above in "Matching
|
||||
literal characters".
|
||||
|
||||
A set of repeating opcodes (called OP_NOTSTAR etc.) are used for repeated,
|
||||
negated, single-character classes. The normal single-character opcodes
|
||||
|
@ -553,8 +572,8 @@ do.
|
|||
For classes containing characters with values greater than 255 or that contain
|
||||
\p or \P, OP_XCLASS is used. It optionally uses a bit map if any acceptable
|
||||
code points are less than 256, followed by a list of pairs (for a range) and/or
|
||||
single characters and/or properties. In caseless mode, both cases are
|
||||
explicitly listed.
|
||||
single characters and/or properties. In caseless mode, all equivalent
|
||||
characters are explicitly listed.
|
||||
|
||||
OP_XCLASS is followed by a LINK_SIZE value containing the total length of the
|
||||
opcode and its data. This is followed by a code unit containing flag bits:
|
||||
|
@ -611,8 +630,8 @@ opcode to see if it is one of these:
|
|||
OP_CRMINRANGE
|
||||
OP_CRPOSRANGE
|
||||
|
||||
All but the last three are single-code-unit items, with no data. The others are
|
||||
followed by the minimum and maximum repeat counts.
|
||||
All but the last three are single-code-unit items, with no data. The range
|
||||
opcodes are followed by the minimum and maximum repeat counts.
|
||||
|
||||
|
||||
Brackets and alternation
|
||||
|
@ -627,16 +646,17 @@ myself, can be round, square, curly, or pointy. Hence this usage rather than
|
|||
|
||||
Non-capturing brackets use the opcode OP_BRA, capturing brackets use OP_CBRA. A
|
||||
bracket opcode is followed by a LINK_SIZE value which gives the offset to the
|
||||
next alternative OP_ALT or, if there aren't any branches, to the matching
|
||||
OP_KET opcode. Each OP_ALT is followed by a LINK_SIZE value giving the offset
|
||||
to the next one, or to the OP_KET opcode. For capturing brackets, the bracket
|
||||
number is a count that immediately follows the offset.
|
||||
next alternative OP_ALT or, if there aren't any branches, to the terminating
|
||||
opcode. Each OP_ALT is followed by a LINK_SIZE value giving the offset to the
|
||||
next one, or to the final opcode. For capturing brackets, the bracket number is
|
||||
a count that immediately follows the offset.
|
||||
|
||||
OP_KET is used for subpatterns that do not repeat indefinitely, and OP_KETRMIN
|
||||
and OP_KETRMAX are used for indefinite repetitions, minimally or maximally
|
||||
respectively (see below for possessive repetitions). All three are followed by
|
||||
a LINK_SIZE value giving (as a positive number) the offset back to the matching
|
||||
bracket opcode.
|
||||
There are several opcodes that mark the end of a subpattern group. OP_KET is
|
||||
used for subpatterns that do not repeat indefinitely, OP_KETRMIN and
|
||||
OP_KETRMAX are used for indefinite repetitions, minimally or maximally
|
||||
respectively, and OP_KETRPOS for possessive repetitions (see below for more
|
||||
details). All four are followed by a LINK_SIZE value giving (as a positive
|
||||
number) the offset back to the matching bracket opcode.
|
||||
|
||||
If a subpattern is quantified such that it is permitted to match zero times, it
|
||||
is preceded by one of OP_BRAZERO, OP_BRAMINZERO, or OP_SKIPZERO. These are
|
||||
|
@ -725,11 +745,11 @@ tests the PCRE2 version number. This compiles into one of the opcodes OP_TRUE
|
|||
or OP_FALSE.
|
||||
|
||||
If a condition is not a back reference, recursion test, DEFINE, or VERSION, it
|
||||
must start with an assertion, whose opcode normally immediately follows OP_COND
|
||||
or OP_SCOND. However, if automatic callouts are enabled, a callout is inserted
|
||||
immediately before the assertion. It is also possible to insert a manual
|
||||
callout at this point. Only assertion conditions may have callouts preceding
|
||||
the condition.
|
||||
must start with a parenthesized assertion, whose opcode normally immediately
|
||||
follows OP_COND or OP_SCOND. However, if automatic callouts are enabled, a
|
||||
callout is inserted immediately before the assertion. It is also possible to
|
||||
insert a manual callout at this point. Only assertion conditions may have
|
||||
callouts preceding the condition.
|
||||
|
||||
A condition that is the negative assertion (?!) is optimized to OP_FAIL in all
|
||||
parts of the pattern, so this is another opcode that may appear as a condition.
|
||||
|
@ -758,12 +778,12 @@ treated as (?1)(?1)(?:(?1)){0,2}.
|
|||
Callouts
|
||||
--------
|
||||
|
||||
A callout can nowadays have either a numerical argument or a string argument.
|
||||
These use OP_CALLOUT or OP_CALLOUT_STR, respectively. In each case these are
|
||||
followed by two LINK_SIZE values giving the offset in the pattern string to the
|
||||
start of the following item, and another count giving the length of this item.
|
||||
These values make it possible for pcre2test to output useful tracing
|
||||
information using callouts.
|
||||
A callout may have either a numerical argument or a string argument. These use
|
||||
OP_CALLOUT or OP_CALLOUT_STR, respectively. In each case these are followed by
|
||||
two LINK_SIZE values giving the offset in the pattern string to the start of
|
||||
the following item, and another count giving the length of this item. These
|
||||
values make it possible for pcre2test to output useful tracing information
|
||||
using callouts.
|
||||
|
||||
In the case of a numeric callout, after these two values there is a single code
|
||||
unit containing the callout number, in the range 0-255, with 255 being used for
|
||||
|
@ -790,8 +810,8 @@ Opcode table checking
|
|||
---------------------
|
||||
|
||||
The last opcode that is defined in pcre2_internal.h is OP_TABLE_LENGTH. This is
|
||||
not a real opcode, but is used to check that tables indexed by opcode are the
|
||||
correct length, in order to catch updating errors.
|
||||
not a real opcode, but is used to check at compile time that tables indexed by
|
||||
opcode are the correct length, in order to catch updating errors.
|
||||
|
||||
Philip Hazel
|
||||
17 March 2017
|
||||
21 April 2017
|
||||
|
|
Loading…
Reference in New Issue