Documentation update.
This commit is contained in:
parent
b3a6fd38b8
commit
504a073a89
110
HACKING
110
HACKING
|
@ -48,18 +48,20 @@ Friedl's terminology.
|
||||||
OK, here's the real stuff
|
OK, here's the real stuff
|
||||||
-------------------------
|
-------------------------
|
||||||
|
|
||||||
For the set of functions that formed the original PCRE1 library (which are
|
For the set of functions that formed the original PCRE1 library in 1997 (which
|
||||||
unrelated to those mentioned above), I tried at first to invent an algorithm
|
are unrelated to those mentioned above), I tried at first to invent an
|
||||||
that used an amount of store bounded by a multiple of the number of characters
|
algorithm that used an amount of store bounded by a multiple of the number of
|
||||||
in the pattern, to save on compiling time. However, because of the greater
|
characters in the pattern, to save on compiling time. However, because of the
|
||||||
complexity in Perl regular expressions, I couldn't do this. In any case, a
|
greater complexity in Perl regular expressions, I couldn't do this, even though
|
||||||
first pass through the pattern is helpful for other reasons.
|
the then current Perl 5.004 patterns were much simpler than those supported
|
||||||
|
nowadays. In any case, a first pass through the pattern is helpful for other
|
||||||
|
reasons.
|
||||||
|
|
||||||
|
|
||||||
Support for 16-bit and 32-bit data strings
|
Support for 16-bit and 32-bit data strings
|
||||||
-------------------------------------------
|
-------------------------------------------
|
||||||
|
|
||||||
The library can be compiled in any combination of 8-bit, 16-bit or 32-bit
|
The PCRE2 library can be compiled in any combination of 8-bit, 16-bit or 32-bit
|
||||||
modes, creating up to three different libraries. In the description that
|
modes, creating up to three different libraries. In the description that
|
||||||
follows, the word "short" is used for a 16-bit data quantity, and the phrase
|
follows, the word "short" is used for a 16-bit data quantity, and the phrase
|
||||||
"code unit" is used for a quantity that is a byte in 8-bit mode, a short in
|
"code unit" is used for a quantity that is a byte in 8-bit mode, a short in
|
||||||
|
@ -122,7 +124,7 @@ all the named subpatterns and their corresponding group numbers. This means
|
||||||
that the actual compile (both the memory-computing dummy run and the real
|
that the actual compile (both the memory-computing dummy run and the real
|
||||||
compile) has full knowledge of group names and numbers throughout. Several
|
compile) has full knowledge of group names and numbers throughout. Several
|
||||||
dozen lines of messy code were eliminated, though the new pre-pass was not
|
dozen lines of messy code were eliminated, though the new pre-pass was not
|
||||||
short. In particular, parsing and skipping over [] classes was complicated.
|
short. In particular, parsing and skipping over [] classes is complicated.
|
||||||
|
|
||||||
While working on 10.22 I realized that I could simplify yet again by moving
|
While working on 10.22 I realized that I could simplify yet again by moving
|
||||||
more of the parsing into the pre-pass, thus avoiding doing it in two places, so
|
more of the parsing into the pre-pass, thus avoiding doing it in two places, so
|
||||||
|
@ -149,8 +151,8 @@ of code units in the item itself. The exception is the aforementioned large
|
||||||
advance to check for such values. When auto-callouts are enabled, the generous
|
advance to check for such values. When auto-callouts are enabled, the generous
|
||||||
assumption is made that there will be a callout for each pattern code unit
|
assumption is made that there will be a callout for each pattern code unit
|
||||||
(which of course is only actually true if all code units are literals) plus one
|
(which of course is only actually true if all code units are literals) plus one
|
||||||
at the end. There is a default parsed pattern vector on the stack, but if this
|
at the end. There is a default parsed pattern vector on the system stack, but
|
||||||
is not big enough, heap memory is used.
|
if this is not big enough, heap memory is used.
|
||||||
|
|
||||||
As before, the actual compiling function is run twice, the first time to
|
As before, the actual compiling function is run twice, the first time to
|
||||||
determine the amount of memory needed for the final compiled pattern. It
|
determine the amount of memory needed for the final compiled pattern. It
|
||||||
|
@ -343,9 +345,14 @@ Changeable options
|
||||||
------------------
|
------------------
|
||||||
|
|
||||||
The /i, /m, or /s options (PCRE2_CASELESS, PCRE2_MULTILINE, PCRE2_DOTALL, and
|
The /i, /m, or /s options (PCRE2_CASELESS, PCRE2_MULTILINE, PCRE2_DOTALL, and
|
||||||
some others) may change in the middle of patterns. Their processing is handled
|
others) may be changed in the middle of patterns by items such as (?i). Their
|
||||||
entirely at compile time by generating different opcodes for the different
|
processing is handled entirely at compile time by generating different opcodes
|
||||||
settings. The runtime functions do not need to keep track of an options state.
|
for the different settings. The runtime functions do not need to keep track of
|
||||||
|
an options state.
|
||||||
|
|
||||||
|
PCRE2_DUPNAMES, PCRE2_EXTENDED, PCRE2_EXTENDED_MORE, and PCRE2_NO_AUTO_CAPTURE
|
||||||
|
are tracked and processed during the parsing pre-pass. The others are handled
|
||||||
|
from META_OPTIONS items during the main compile phase.
|
||||||
|
|
||||||
|
|
||||||
Format of compiled patterns
|
Format of compiled patterns
|
||||||
|
@ -437,14 +444,22 @@ Matching literal characters
|
||||||
---------------------------
|
---------------------------
|
||||||
|
|
||||||
The OP_CHAR opcode is followed by a single character that is to be matched
|
The OP_CHAR opcode is followed by a single character that is to be matched
|
||||||
casefully. For caseless matching, OP_CHARI is used. In UTF-8 or UTF-16 modes,
|
casefully. For caseless matching of characters that have at most two
|
||||||
the character may be more than one code unit long. In UTF-32 mode, characters
|
case-equivalent code points, OP_CHARI is used. In UTF-8 or UTF-16 modes, the
|
||||||
are always exactly one code unit long.
|
character may be more than one code unit long. In UTF-32 mode, characters are
|
||||||
|
always exactly one code unit long.
|
||||||
|
|
||||||
If there is only one character in a character class, OP_CHAR or OP_CHARI is
|
If there is only one character in a character class, OP_CHAR or OP_CHARI is
|
||||||
used for a positive class, and OP_NOT or OP_NOTI for a negative one (that is,
|
used for a positive class, and OP_NOT or OP_NOTI for a negative one (that is,
|
||||||
for something like [^a]).
|
for something like [^a]).
|
||||||
|
|
||||||
|
Caseless matching (positive or negative) of characters that have more than two
|
||||||
|
case-equivalent code points (which is possible only in UTF mode) is handled by
|
||||||
|
compiling a Unicode property item (see below), with the pseudo-property
|
||||||
|
PT_CLIST. The value of this property is an offset in a vector called
|
||||||
|
"ucd_caseless_sets" which identifies the start of a short list of equivalent
|
||||||
|
characters, terminated by the value NOTACHAR (0xffffffff).
|
||||||
|
|
||||||
|
|
||||||
Repeating single characters
|
Repeating single characters
|
||||||
---------------------------
|
---------------------------
|
||||||
|
@ -520,7 +535,8 @@ Each is followed by two code units that encode the desired property as a type
|
||||||
and a value. The types are a set of #defines of the form PT_xxx, and the values
|
and a value. The types are a set of #defines of the form PT_xxx, and the values
|
||||||
are enumerations of the form ucp_xx, defined in the pcre2_ucp.h source file.
|
are enumerations of the form ucp_xx, defined in the pcre2_ucp.h source file.
|
||||||
The value is relevant only for PT_GC (General Category), PT_PC (Particular
|
The value is relevant only for PT_GC (General Category), PT_PC (Particular
|
||||||
Category), and PT_SC (Script).
|
Category), PT_SC (Script), and the pseudo-property PT_CLIST, which is used to
|
||||||
|
identify a list of case-equivalent characters when there are three or more.
|
||||||
|
|
||||||
Repeats of these items use the OP_TYPESTAR etc. set of opcodes, followed by
|
Repeats of these items use the OP_TYPESTAR etc. set of opcodes, followed by
|
||||||
three code units: OP_PROP or OP_NOTPROP, and then the desired property type and
|
three code units: OP_PROP or OP_NOTPROP, and then the desired property type and
|
||||||
|
@ -532,7 +548,10 @@ Character classes
|
||||||
|
|
||||||
If there is only one character in a class, OP_CHAR or OP_CHARI is used for a
|
If there is only one character in a class, OP_CHAR or OP_CHARI is used for a
|
||||||
positive class, and OP_NOT or OP_NOTI for a negative one (that is, for
|
positive class, and OP_NOT or OP_NOTI for a negative one (that is, for
|
||||||
something like [^a]).
|
something like [^a]), except when caselessly matching a character that has more
|
||||||
|
than two case-equivalent code points (which can happen only in UTF mode). In
|
||||||
|
this case a Unicode property item is used, as described above in "Matching
|
||||||
|
literal characters".
|
||||||
|
|
||||||
A set of repeating opcodes (called OP_NOTSTAR etc.) are used for repeated,
|
A set of repeating opcodes (called OP_NOTSTAR etc.) are used for repeated,
|
||||||
negated, single-character classes. The normal single-character opcodes
|
negated, single-character classes. The normal single-character opcodes
|
||||||
|
@ -553,8 +572,8 @@ do.
|
||||||
For classes containing characters with values greater than 255 or that contain
|
For classes containing characters with values greater than 255 or that contain
|
||||||
\p or \P, OP_XCLASS is used. It optionally uses a bit map if any acceptable
|
\p or \P, OP_XCLASS is used. It optionally uses a bit map if any acceptable
|
||||||
code points are less than 256, followed by a list of pairs (for a range) and/or
|
code points are less than 256, followed by a list of pairs (for a range) and/or
|
||||||
single characters and/or properties. In caseless mode, both cases are
|
single characters and/or properties. In caseless mode, all equivalent
|
||||||
explicitly listed.
|
characters are explicitly listed.
|
||||||
|
|
||||||
OP_XCLASS is followed by a LINK_SIZE value containing the total length of the
|
OP_XCLASS is followed by a LINK_SIZE value containing the total length of the
|
||||||
opcode and its data. This is followed by a code unit containing flag bits:
|
opcode and its data. This is followed by a code unit containing flag bits:
|
||||||
|
@ -611,8 +630,8 @@ opcode to see if it is one of these:
|
||||||
OP_CRMINRANGE
|
OP_CRMINRANGE
|
||||||
OP_CRPOSRANGE
|
OP_CRPOSRANGE
|
||||||
|
|
||||||
All but the last three are single-code-unit items, with no data. The others are
|
All but the last three are single-code-unit items, with no data. The range
|
||||||
followed by the minimum and maximum repeat counts.
|
opcodes are followed by the minimum and maximum repeat counts.
|
||||||
|
|
||||||
|
|
||||||
Brackets and alternation
|
Brackets and alternation
|
||||||
|
@ -627,16 +646,17 @@ myself, can be round, square, curly, or pointy. Hence this usage rather than
|
||||||
|
|
||||||
Non-capturing brackets use the opcode OP_BRA, capturing brackets use OP_CBRA. A
|
Non-capturing brackets use the opcode OP_BRA, capturing brackets use OP_CBRA. A
|
||||||
bracket opcode is followed by a LINK_SIZE value which gives the offset to the
|
bracket opcode is followed by a LINK_SIZE value which gives the offset to the
|
||||||
next alternative OP_ALT or, if there aren't any branches, to the matching
|
next alternative OP_ALT or, if there aren't any branches, to the terminating
|
||||||
OP_KET opcode. Each OP_ALT is followed by a LINK_SIZE value giving the offset
|
opcode. Each OP_ALT is followed by a LINK_SIZE value giving the offset to the
|
||||||
to the next one, or to the OP_KET opcode. For capturing brackets, the bracket
|
next one, or to the final opcode. For capturing brackets, the bracket number is
|
||||||
number is a count that immediately follows the offset.
|
a count that immediately follows the offset.
|
||||||
|
|
||||||
OP_KET is used for subpatterns that do not repeat indefinitely, and OP_KETRMIN
|
There are several opcodes that mark the end of a subpattern group. OP_KET is
|
||||||
and OP_KETRMAX are used for indefinite repetitions, minimally or maximally
|
used for subpatterns that do not repeat indefinitely, OP_KETRMIN and
|
||||||
respectively (see below for possessive repetitions). All three are followed by
|
OP_KETRMAX are used for indefinite repetitions, minimally or maximally
|
||||||
a LINK_SIZE value giving (as a positive number) the offset back to the matching
|
respectively, and OP_KETRPOS for possessive repetitions (see below for more
|
||||||
bracket opcode.
|
details). All four are followed by a LINK_SIZE value giving (as a positive
|
||||||
|
number) the offset back to the matching bracket opcode.
|
||||||
|
|
||||||
If a subpattern is quantified such that it is permitted to match zero times, it
|
If a subpattern is quantified such that it is permitted to match zero times, it
|
||||||
is preceded by one of OP_BRAZERO, OP_BRAMINZERO, or OP_SKIPZERO. These are
|
is preceded by one of OP_BRAZERO, OP_BRAMINZERO, or OP_SKIPZERO. These are
|
||||||
|
@ -725,11 +745,11 @@ tests the PCRE2 version number. This compiles into one of the opcodes OP_TRUE
|
||||||
or OP_FALSE.
|
or OP_FALSE.
|
||||||
|
|
||||||
If a condition is not a back reference, recursion test, DEFINE, or VERSION, it
|
If a condition is not a back reference, recursion test, DEFINE, or VERSION, it
|
||||||
must start with an assertion, whose opcode normally immediately follows OP_COND
|
must start with a parenthesized assertion, whose opcode normally immediately
|
||||||
or OP_SCOND. However, if automatic callouts are enabled, a callout is inserted
|
follows OP_COND or OP_SCOND. However, if automatic callouts are enabled, a
|
||||||
immediately before the assertion. It is also possible to insert a manual
|
callout is inserted immediately before the assertion. It is also possible to
|
||||||
callout at this point. Only assertion conditions may have callouts preceding
|
insert a manual callout at this point. Only assertion conditions may have
|
||||||
the condition.
|
callouts preceding the condition.
|
||||||
|
|
||||||
A condition that is the negative assertion (?!) is optimized to OP_FAIL in all
|
A condition that is the negative assertion (?!) is optimized to OP_FAIL in all
|
||||||
parts of the pattern, so this is another opcode that may appear as a condition.
|
parts of the pattern, so this is another opcode that may appear as a condition.
|
||||||
|
@ -758,12 +778,12 @@ treated as (?1)(?1)(?:(?1)){0,2}.
|
||||||
Callouts
|
Callouts
|
||||||
--------
|
--------
|
||||||
|
|
||||||
A callout can nowadays have either a numerical argument or a string argument.
|
A callout may have either a numerical argument or a string argument. These use
|
||||||
These use OP_CALLOUT or OP_CALLOUT_STR, respectively. In each case these are
|
OP_CALLOUT or OP_CALLOUT_STR, respectively. In each case these are followed by
|
||||||
followed by two LINK_SIZE values giving the offset in the pattern string to the
|
two LINK_SIZE values giving the offset in the pattern string to the start of
|
||||||
start of the following item, and another count giving the length of this item.
|
the following item, and another count giving the length of this item. These
|
||||||
These values make it possible for pcre2test to output useful tracing
|
values make it possible for pcre2test to output useful tracing information
|
||||||
information using callouts.
|
using callouts.
|
||||||
|
|
||||||
In the case of a numeric callout, after these two values there is a single code
|
In the case of a numeric callout, after these two values there is a single code
|
||||||
unit containing the callout number, in the range 0-255, with 255 being used for
|
unit containing the callout number, in the range 0-255, with 255 being used for
|
||||||
|
@ -790,8 +810,8 @@ Opcode table checking
|
||||||
---------------------
|
---------------------
|
||||||
|
|
||||||
The last opcode that is defined in pcre2_internal.h is OP_TABLE_LENGTH. This is
|
The last opcode that is defined in pcre2_internal.h is OP_TABLE_LENGTH. This is
|
||||||
not a real opcode, but is used to check that tables indexed by opcode are the
|
not a real opcode, but is used to check at compile time that tables indexed by
|
||||||
correct length, in order to catch updating errors.
|
opcode are the correct length, in order to catch updating errors.
|
||||||
|
|
||||||
Philip Hazel
|
Philip Hazel
|
||||||
17 March 2017
|
21 April 2017
|
||||||
|
|
Loading…
Reference in New Issue