Documentation update.

This commit is contained in:
Philip.Hazel 2017-04-21 16:30:18 +00:00
parent b3a6fd38b8
commit 504a073a89
1 changed files with 65 additions and 45 deletions

110
HACKING
View File

@ -48,18 +48,20 @@ Friedl's terminology.
OK, here's the real stuff OK, here's the real stuff
------------------------- -------------------------
For the set of functions that formed the original PCRE1 library (which are For the set of functions that formed the original PCRE1 library in 1997 (which
unrelated to those mentioned above), I tried at first to invent an algorithm are unrelated to those mentioned above), I tried at first to invent an
that used an amount of store bounded by a multiple of the number of characters algorithm that used an amount of store bounded by a multiple of the number of
in the pattern, to save on compiling time. However, because of the greater characters in the pattern, to save on compiling time. However, because of the
complexity in Perl regular expressions, I couldn't do this. In any case, a greater complexity in Perl regular expressions, I couldn't do this, even though
first pass through the pattern is helpful for other reasons. the then current Perl 5.004 patterns were much simpler than those supported
nowadays. In any case, a first pass through the pattern is helpful for other
reasons.
Support for 16-bit and 32-bit data strings Support for 16-bit and 32-bit data strings
------------------------------------------- -------------------------------------------
The library can be compiled in any combination of 8-bit, 16-bit or 32-bit The PCRE2 library can be compiled in any combination of 8-bit, 16-bit or 32-bit
modes, creating up to three different libraries. In the description that modes, creating up to three different libraries. In the description that
follows, the word "short" is used for a 16-bit data quantity, and the phrase follows, the word "short" is used for a 16-bit data quantity, and the phrase
"code unit" is used for a quantity that is a byte in 8-bit mode, a short in "code unit" is used for a quantity that is a byte in 8-bit mode, a short in
@ -122,7 +124,7 @@ all the named subpatterns and their corresponding group numbers. This means
that the actual compile (both the memory-computing dummy run and the real that the actual compile (both the memory-computing dummy run and the real
compile) has full knowledge of group names and numbers throughout. Several compile) has full knowledge of group names and numbers throughout. Several
dozen lines of messy code were eliminated, though the new pre-pass was not dozen lines of messy code were eliminated, though the new pre-pass was not
short. In particular, parsing and skipping over [] classes was complicated. short. In particular, parsing and skipping over [] classes is complicated.
While working on 10.22 I realized that I could simplify yet again by moving While working on 10.22 I realized that I could simplify yet again by moving
more of the parsing into the pre-pass, thus avoiding doing it in two places, so more of the parsing into the pre-pass, thus avoiding doing it in two places, so
@ -149,8 +151,8 @@ of code units in the item itself. The exception is the aforementioned large
advance to check for such values. When auto-callouts are enabled, the generous advance to check for such values. When auto-callouts are enabled, the generous
assumption is made that there will be a callout for each pattern code unit assumption is made that there will be a callout for each pattern code unit
(which of course is only actually true if all code units are literals) plus one (which of course is only actually true if all code units are literals) plus one
at the end. There is a default parsed pattern vector on the stack, but if this at the end. There is a default parsed pattern vector on the system stack, but
is not big enough, heap memory is used. if this is not big enough, heap memory is used.
As before, the actual compiling function is run twice, the first time to As before, the actual compiling function is run twice, the first time to
determine the amount of memory needed for the final compiled pattern. It determine the amount of memory needed for the final compiled pattern. It
@ -343,9 +345,14 @@ Changeable options
------------------ ------------------
The /i, /m, or /s options (PCRE2_CASELESS, PCRE2_MULTILINE, PCRE2_DOTALL, and The /i, /m, or /s options (PCRE2_CASELESS, PCRE2_MULTILINE, PCRE2_DOTALL, and
some others) may change in the middle of patterns. Their processing is handled others) may be changed in the middle of patterns by items such as (?i). Their
entirely at compile time by generating different opcodes for the different processing is handled entirely at compile time by generating different opcodes
settings. The runtime functions do not need to keep track of an options state. for the different settings. The runtime functions do not need to keep track of
an options state.
PCRE2_DUPNAMES, PCRE2_EXTENDED, PCRE2_EXTENDED_MORE, and PCRE2_NO_AUTO_CAPTURE
are tracked and processed during the parsing pre-pass. The others are handled
from META_OPTIONS items during the main compile phase.
Format of compiled patterns Format of compiled patterns
@ -437,14 +444,22 @@ Matching literal characters
--------------------------- ---------------------------
The OP_CHAR opcode is followed by a single character that is to be matched The OP_CHAR opcode is followed by a single character that is to be matched
casefully. For caseless matching, OP_CHARI is used. In UTF-8 or UTF-16 modes, casefully. For caseless matching of characters that have at most two
the character may be more than one code unit long. In UTF-32 mode, characters case-equivalent code points, OP_CHARI is used. In UTF-8 or UTF-16 modes, the
are always exactly one code unit long. character may be more than one code unit long. In UTF-32 mode, characters are
always exactly one code unit long.
If there is only one character in a character class, OP_CHAR or OP_CHARI is If there is only one character in a character class, OP_CHAR or OP_CHARI is
used for a positive class, and OP_NOT or OP_NOTI for a negative one (that is, used for a positive class, and OP_NOT or OP_NOTI for a negative one (that is,
for something like [^a]). for something like [^a]).
Caseless matching (positive or negative) of characters that have more than two
case-equivalent code points (which is possible only in UTF mode) is handled by
compiling a Unicode property item (see below), with the pseudo-property
PT_CLIST. The value of this property is an offset in a vector called
"ucd_caseless_sets" which identifies the start of a short list of equivalent
characters, terminated by the value NOTACHAR (0xffffffff).
Repeating single characters Repeating single characters
--------------------------- ---------------------------
@ -520,7 +535,8 @@ Each is followed by two code units that encode the desired property as a type
and a value. The types are a set of #defines of the form PT_xxx, and the values and a value. The types are a set of #defines of the form PT_xxx, and the values
are enumerations of the form ucp_xx, defined in the pcre2_ucp.h source file. are enumerations of the form ucp_xx, defined in the pcre2_ucp.h source file.
The value is relevant only for PT_GC (General Category), PT_PC (Particular The value is relevant only for PT_GC (General Category), PT_PC (Particular
Category), and PT_SC (Script). Category), PT_SC (Script), and the pseudo-property PT_CLIST, which is used to
identify a list of case-equivalent characters when there are three or more.
Repeats of these items use the OP_TYPESTAR etc. set of opcodes, followed by Repeats of these items use the OP_TYPESTAR etc. set of opcodes, followed by
three code units: OP_PROP or OP_NOTPROP, and then the desired property type and three code units: OP_PROP or OP_NOTPROP, and then the desired property type and
@ -532,7 +548,10 @@ Character classes
If there is only one character in a class, OP_CHAR or OP_CHARI is used for a If there is only one character in a class, OP_CHAR or OP_CHARI is used for a
positive class, and OP_NOT or OP_NOTI for a negative one (that is, for positive class, and OP_NOT or OP_NOTI for a negative one (that is, for
something like [^a]). something like [^a]), except when caselessly matching a character that has more
than two case-equivalent code points (which can happen only in UTF mode). In
this case a Unicode property item is used, as described above in "Matching
literal characters".
A set of repeating opcodes (called OP_NOTSTAR etc.) are used for repeated, A set of repeating opcodes (called OP_NOTSTAR etc.) are used for repeated,
negated, single-character classes. The normal single-character opcodes negated, single-character classes. The normal single-character opcodes
@ -553,8 +572,8 @@ do.
For classes containing characters with values greater than 255 or that contain For classes containing characters with values greater than 255 or that contain
\p or \P, OP_XCLASS is used. It optionally uses a bit map if any acceptable \p or \P, OP_XCLASS is used. It optionally uses a bit map if any acceptable
code points are less than 256, followed by a list of pairs (for a range) and/or code points are less than 256, followed by a list of pairs (for a range) and/or
single characters and/or properties. In caseless mode, both cases are single characters and/or properties. In caseless mode, all equivalent
explicitly listed. characters are explicitly listed.
OP_XCLASS is followed by a LINK_SIZE value containing the total length of the OP_XCLASS is followed by a LINK_SIZE value containing the total length of the
opcode and its data. This is followed by a code unit containing flag bits: opcode and its data. This is followed by a code unit containing flag bits:
@ -611,8 +630,8 @@ opcode to see if it is one of these:
OP_CRMINRANGE OP_CRMINRANGE
OP_CRPOSRANGE OP_CRPOSRANGE
All but the last three are single-code-unit items, with no data. The others are All but the last three are single-code-unit items, with no data. The range
followed by the minimum and maximum repeat counts. opcodes are followed by the minimum and maximum repeat counts.
Brackets and alternation Brackets and alternation
@ -627,16 +646,17 @@ myself, can be round, square, curly, or pointy. Hence this usage rather than
Non-capturing brackets use the opcode OP_BRA, capturing brackets use OP_CBRA. A Non-capturing brackets use the opcode OP_BRA, capturing brackets use OP_CBRA. A
bracket opcode is followed by a LINK_SIZE value which gives the offset to the bracket opcode is followed by a LINK_SIZE value which gives the offset to the
next alternative OP_ALT or, if there aren't any branches, to the matching next alternative OP_ALT or, if there aren't any branches, to the terminating
OP_KET opcode. Each OP_ALT is followed by a LINK_SIZE value giving the offset opcode. Each OP_ALT is followed by a LINK_SIZE value giving the offset to the
to the next one, or to the OP_KET opcode. For capturing brackets, the bracket next one, or to the final opcode. For capturing brackets, the bracket number is
number is a count that immediately follows the offset. a count that immediately follows the offset.
OP_KET is used for subpatterns that do not repeat indefinitely, and OP_KETRMIN There are several opcodes that mark the end of a subpattern group. OP_KET is
and OP_KETRMAX are used for indefinite repetitions, minimally or maximally used for subpatterns that do not repeat indefinitely, OP_KETRMIN and
respectively (see below for possessive repetitions). All three are followed by OP_KETRMAX are used for indefinite repetitions, minimally or maximally
a LINK_SIZE value giving (as a positive number) the offset back to the matching respectively, and OP_KETRPOS for possessive repetitions (see below for more
bracket opcode. details). All four are followed by a LINK_SIZE value giving (as a positive
number) the offset back to the matching bracket opcode.
If a subpattern is quantified such that it is permitted to match zero times, it If a subpattern is quantified such that it is permitted to match zero times, it
is preceded by one of OP_BRAZERO, OP_BRAMINZERO, or OP_SKIPZERO. These are is preceded by one of OP_BRAZERO, OP_BRAMINZERO, or OP_SKIPZERO. These are
@ -725,11 +745,11 @@ tests the PCRE2 version number. This compiles into one of the opcodes OP_TRUE
or OP_FALSE. or OP_FALSE.
If a condition is not a back reference, recursion test, DEFINE, or VERSION, it If a condition is not a back reference, recursion test, DEFINE, or VERSION, it
must start with an assertion, whose opcode normally immediately follows OP_COND must start with a parenthesized assertion, whose opcode normally immediately
or OP_SCOND. However, if automatic callouts are enabled, a callout is inserted follows OP_COND or OP_SCOND. However, if automatic callouts are enabled, a
immediately before the assertion. It is also possible to insert a manual callout is inserted immediately before the assertion. It is also possible to
callout at this point. Only assertion conditions may have callouts preceding insert a manual callout at this point. Only assertion conditions may have
the condition. callouts preceding the condition.
A condition that is the negative assertion (?!) is optimized to OP_FAIL in all A condition that is the negative assertion (?!) is optimized to OP_FAIL in all
parts of the pattern, so this is another opcode that may appear as a condition. parts of the pattern, so this is another opcode that may appear as a condition.
@ -758,12 +778,12 @@ treated as (?1)(?1)(?:(?1)){0,2}.
Callouts Callouts
-------- --------
A callout can nowadays have either a numerical argument or a string argument. A callout may have either a numerical argument or a string argument. These use
These use OP_CALLOUT or OP_CALLOUT_STR, respectively. In each case these are OP_CALLOUT or OP_CALLOUT_STR, respectively. In each case these are followed by
followed by two LINK_SIZE values giving the offset in the pattern string to the two LINK_SIZE values giving the offset in the pattern string to the start of
start of the following item, and another count giving the length of this item. the following item, and another count giving the length of this item. These
These values make it possible for pcre2test to output useful tracing values make it possible for pcre2test to output useful tracing information
information using callouts. using callouts.
In the case of a numeric callout, after these two values there is a single code In the case of a numeric callout, after these two values there is a single code
unit containing the callout number, in the range 0-255, with 255 being used for unit containing the callout number, in the range 0-255, with 255 being used for
@ -790,8 +810,8 @@ Opcode table checking
--------------------- ---------------------
The last opcode that is defined in pcre2_internal.h is OP_TABLE_LENGTH. This is The last opcode that is defined in pcre2_internal.h is OP_TABLE_LENGTH. This is
not a real opcode, but is used to check that tables indexed by opcode are the not a real opcode, but is used to check at compile time that tables indexed by
correct length, in order to catch updating errors. opcode are the correct length, in order to catch updating errors.
Philip Hazel Philip Hazel
17 March 2017 21 April 2017