Documentation for callouts with string arguments.

This commit is contained in:
Philip.Hazel 2015-03-15 17:49:03 +00:00
parent 15e034c9c2
commit 2ec7cbf9b5
6 changed files with 240 additions and 107 deletions

76
HACKING
View File

@ -8,7 +8,7 @@ library is referred to as PCRE1 below. For information about testing PCRE2, see
the pcre2test documentation and the comment at the head of the RunTest file.
PCRE1 releases were up to 8.3x when PCRE2 was developed. The 8.xx series will
continue for bugfixes if necessary. PCRE2 releases start at 10.0 to avoid
continue for bugfixes if necessary. PCRE2 releases started at 10.00 to avoid
confusion with PCRE1.
@ -39,7 +39,7 @@ subsequently heavily modified for Perl) compiles the expression twice: once in
a dummy mode in order to find out how much store will be needed, and then for
real. (The Perl version probably doesn't do this any more; I'm talking about
the original library.) The execution function operates by backtracking and
maximizing (or, optionally, minimizing in Perl) the amount of the subject that
maximizing (or, optionally, minimizing, in Perl) the amount of the subject that
matches individual wild portions of the pattern. This is an "NFA algorithm" in
Friedl's terminology.
@ -63,7 +63,7 @@ modes, creating up to three different libraries. In the description that
follows, the word "short" is used for a 16-bit data quantity, and the phrase
"code unit" is used for a quantity that is a byte in 8-bit mode, a short in
16-bit mode and a 32-bit word in 32-bit mode. The names of PCRE2 functions are
given in generic form, without a _8, _16, or _32 suffix.
given in generic form, without the _8, _16, or _32 suffix.
Computing the memory requirement: how it was
@ -100,8 +100,9 @@ issue, and in the event, nobody has commented on it.
At release 8.34, a limit on the nesting depth of parentheses was re-introduced
(default 250, settable at build time) so as to put a limit on the amount of
system stack used by the compile function. This is a safety feature for
environments with small stacks where the patterns are provided by users.
system stack used by the compile function, which uses recursive function calls
for nested parenthesized groups. This is a safety feature for environments with
small stacks where the patterns are provided by users.
Traditional matching function
@ -158,8 +159,9 @@ default value for LINK_SIZE is 2, except for the 32-bit library, where it can
only be 4. The 8-bit library can be compiled to used 3-byte or 4-byte values,
and the 16-bit library can be compiled to use 4-byte values, though this
impairs performance. Specifing a LINK_SIZE larger than 2 for these libraries is
necessary only when patterns whose compiled length is greater than 64K are
going to be processed.
necessary only when patterns whose compiled length is greater than 64K code
units are going to be processed. When a LINK_SIZE value uses more than one code
unit, the most significant unit is first.
In this description, we assume the "normal" compilation options. Data values
that are counts (e.g. quantifiers) are always two bytes long in 8-bit mode
@ -343,7 +345,7 @@ For classes containing characters with values greater than 255 or that contain
code points are less than 256, followed by a list of pairs (for a range) and
single characters. In caseless mode, both cases are explicitly listed.
OP_XCLASS is followed by a LINK_SIZE item containing the total length of the
OP_XCLASS is followed by a LINK_SIZE value containing the total length of the
opcode and its data. This is followed by a code unit containing flag bits:
XCL_NOT indicates that this is a negative class, and XCL_MAP indicates that a
bit map is present. There follows the bit map, if XCL_MAP is set, and then a
@ -356,7 +358,7 @@ sequence of items coded as follows:
XCL_NOTPROP a Unicode property (type, value) follows
If a range starts with a code point less than 256 and ends with one greater
than 256, it is split into two ranges, with characters less than 256 being
than 255, it is split into two ranges, with characters less than 256 being
indicated in the bit map, and the rest with XCL_RANGE.
When XCL_NOT is set, the bit map, if present, contains bits for characters that
@ -412,17 +414,17 @@ compile time, so alternation always happens in the context of brackets.
myself, can be round, square, curly, or pointy. Hence this usage rather than
"parentheses".]
Non-capturing brackets use the opcode OP_BRA, capturing brackets use OP_CBRA.
A bracket opcode is followed by LINK_SIZE bytes which give the offset to the
Non-capturing brackets use the opcode OP_BRA, capturing brackets use OP_CBRA. A
bracket opcode is followed by a LINK_SIZE value which gives the offset to the
next alternative OP_ALT or, if there aren't any branches, to the matching
OP_KET opcode. Each OP_ALT is followed by LINK_SIZE bytes giving the offset to
the next one, or to the OP_KET opcode. For capturing brackets, the bracket
OP_KET opcode. Each OP_ALT is followed by a LINK_SIZE value giving the offset
to the next one, or to the OP_KET opcode. For capturing brackets, the bracket
number is a count that immediately follows the offset.
OP_KET is used for subpatterns that do not repeat indefinitely, and OP_KETRMIN
and OP_KETRMAX are used for indefinite repetitions, minimally or maximally
respectively (see below for possessive repetitions). All three are followed by
LINK_SIZE bytes giving (as a positive number) the offset back to the matching
a LINK_SIZE value giving (as a positive number) the offset back to the matching
bracket opcode.
If a subpattern is quantified such that it is permitted to match zero times, it
@ -520,8 +522,11 @@ tests the PCRE2 version number. This compiles into one of the opcodes OP_TRUE
or OP_FALSE.
If a condition is not a back reference, recursion test, DEFINE, or VERSION, it
must start with an assertion, whose opcode immediately follows OP_COND or
OP_SCOND.
must start with an assertion, whose opcode normally immediately follows OP_COND
or OP_SCOND. However, if automatic callouts are enabled, a callout is inserted
immediately before the assertion. It is also possible to insert a manual
callout at this point. Only assertion conditions may have callouts preceding
the condition.
Recursion
@ -529,22 +534,43 @@ Recursion
Recursion either matches the current pattern, or some subexpression. The opcode
OP_RECURSE is followed by a LINK_SIZE value that is the offset to the starting
bracket from the start of the whole pattern. OP_RECURSE is automatically
wrapped inside OP_ONCE brackets, because otherwise some patterns broke it.
OP_RECURSE is also used for "subroutine" calls, even though they are not
strictly a recursion.
bracket from the start of the whole pattern. OP_RECURSE is also used for
"subroutine" calls, even though they are not strictly a recursion. Repeated
recursions are automatically wrapped inside OP_ONCE brackets, because otherwise
some patterns broke them. A non-repeated recursion is not wrapped in OP_ONCE
brackets, but it is nevertheless still treated as an atomic group.
Callout
-------
OP_CALLOUT is followed by one code unit of data that holds a callout number in
the range 0 to 254 for manual callouts, or 255 for an automatic callout. In
both cases there follows a count giving the offset in the pattern string to the
A callout can nowadays have either a numerical argument or a string argument.
These use OP_CALLOUT or OP_CALLOUT_STR, respectively. In each case these are
followed by two LINK_SIZE values giving the offset in the pattern string to the
start of the following item, and another count giving the length of this item.
These values make it possible for pcre2test to output useful tracing
information using automatic callouts.
information using callouts.
In the case of a numeric callout, after these two values there is a single code
unit containing the callout number, in the range 0-255, with 255 being used for
callouts that are automatically inserted as a result of the PCRE2_AUTO_CALLOUT
option. Thus, this opcode item is of fixed length:
[OP_CALLOUT] [PATTERN_OFFSET] [PATTERN_LENGTH] [NUMBER]
For callouts with string arguments, OP_CALLOUT_STR has three more data items:
a LINK_SIZE value giving the complete length of the entire opcode item, a
LINK_SIZE item containing the offset within the pattern string to the start of
the string argument, and the string itself, preceded by its starting delimiter
and followed by a binary zero. When a callout function is called, a pointer to
the actual string is passed, but the delimiter can be accessed as string[-1] if
the application needs it. In the 8-bit library, the callout in /X(?C'abc')Y/ is
compiled as the following bytes (decimal numbers represent binary values):
[OP_CALLOUT] [0] [10] [0] [1] [0] [14] [0] [5] ['] [a] [b] [c] [0]
-------- ------- -------- -------
| | | |
------- LINK_SIZE items ------
Opcode table checking
---------------------
@ -554,4 +580,4 @@ not a real opcode, but is used to check that tables indexed by opcode are the
correct length, in order to catch updating errors.
Philip Hazel
February 2015
March 2015

View File

@ -1,4 +1,4 @@
.TH PCRE2CALLOUT 3 "02 January 2015" "PCRE2 10.00"
.TH PCRE2CALLOUT 3 "15 March 2015" "PCRE2 10.20"
.SH NAME
PCRE2 - Perl-compatible regular expressions (revised API)
.SH SYNOPSIS
@ -15,18 +15,22 @@ PCRE2 - Perl-compatible regular expressions (revised API)
PCRE2 provides a feature called "callout", which is a means of temporarily
passing control to the caller of PCRE2 in the middle of pattern matching. The
caller of PCRE2 provides an external function by putting its entry point in
a match context (see \fBpcre2_set_callout()\fP) in the
a match context (see \fBpcre2_set_callout()\fP in the
.\" HREF
\fBpcre2api\fP
.\"
documentation).
.P
Within a regular expression, (?C) indicates the points at which the external
Within a regular expression, (?C<arg>) indicates a point at which the external
function is to be called. Different callout points can be identified by putting
a number less than 256 after the letter C. The default value is zero.
For example, this pattern has two callout points:
Alternatively, the argument may be a delimited string. The starting delimiter
must be one of ` ' " ^ % # $ { and the ending delimiter is the same as the
start, except for {, where the ending delimiter is }. If the ending delimiter
is needed within the string, it must be doubled. For example, this pattern has
two callout points:
.sp
(?C1)abc(?C2)def
(?C1)abc(?C"some ""arbitrary"" text")def
.sp
If the PCRE2_AUTO_CALLOUT option bit is set when a pattern is compiled, PCRE2
automatically inserts callouts, all with number 255, before each item in the
@ -43,20 +47,19 @@ alternation bar. If the pattern contains a conditional group whose condition is
an assertion, an automatic callout is inserted immediately before the
condition. Such a callout may also be inserted explicitly, for example:
.sp
(?(?C9)(?=a)ab|de)
(?(?C9)(?=a)ab|de) (?(?C%text%)(?!=d)ab|de)
.sp
This applies only to assertion conditions (because they are themselves
independent groups).
.P
Automatic callouts can be used for tracking the progress of pattern matching.
The
Callouts can be useful for tracking the progress of pattern matching. The
.\" HREF
\fBpcre2test\fP
.\"
program has a pattern qualifier (/auto_callout) that sets automatic callouts;
when it is used, the output indicates how the pattern is being matched. This is
useful information when you are trying to optimize the performance of a
particular pattern.
program has a pattern qualifier (/auto_callout) that sets automatic callouts.
When any callouts are present, the output from \fBpcre2test\fP indicates how
the pattern is being matched. This is useful information when you are trying to
optimize the performance of a particular pattern.
.
.
.SH "MISSING CALLOUTS"
@ -193,15 +196,52 @@ documentation). The callout block structure contains the following fields:
PCRE2_SIZE \fIcurrent_position\fP;
PCRE2_SIZE \fIpattern_position\fP;
PCRE2_SIZE \fInext_item_length\fP;
PCRE2_SIZE \fIcallout_string_offset\fP;
PCRE2_SPTR \fIcallout_string\fP;
uint32_t \fIcallout_string_length\fP;
.sp
The \fIversion\fP field contains the version number of the block format. The
current version is 0. The version number will change in future if additional
fields are added, but the intention is never to remove any of the existing
fields.
current version is 1; the three callout string fields were added for this
version. If you are writing an application that might use an earlier release of
PCRE2, you should check the version number before accessing any of these
fields. The version number will increase in future if more fields are added,
but the intention is never to remove any of the existing fields.
.
.
.SS "Fields for numerical callouts"
.rs
.sp
For a numerical callout, \fIcallout_string\fP is NULL, and \fIcallout_number\fP
contains the number of the callout, in the range 0-255. This is the number
that follows (?C for manual callouts; it is 255 for automatically generated
callouts.
.
.
.SS "Fields for string callouts"
.rs
.sp
For callouts with string arguments, \fIcallout_number\fP is always zero, and
\fIcallout_string\fP points to the string that is contained within the compiled
pattern. Its length is given by \fIcallout_string_length\fP. Duplicated ending
delimiters that were present in the original pattern string have been turned
into single characters. An additional code unit containing binary zero is
present after the string, but is not included in the length. The delimiter that
was used to start the string is also stored within the pattern, immediately
before the string itself. You can therefore access this delimiter as
\fIcallout_string\fP[-1] if you need it.
.P
The \fIcallout_number\fP field contains the number of the callout, as compiled
into the pattern (that is, the number after ?C for manual callouts, and 255 for
automatically generated callouts).
The \fIcallout_string_offset\fP field is the code unit offset to the start of
the callout argument string within the original pattern string. This is
provided for the benefit of applications such as script languages that might
need to report errors in the callout string within the pattern.
.
.
.SS "Fields for all callouts"
.rs
.sp
The remaining fields in the callout block are the same for both kinds of
callout.
.P
The \fIoffset_vector\fP field is a pointer to the vector of capturing offsets
(the "ovector") that was passed to the matching function in the match data
@ -246,7 +286,9 @@ of the entire subpattern.
.P
The \fIpattern_position\fP and \fInext_item_length\fP fields are intended to
help in distinguishing between different automatic callouts, which all have the
same callout number. However, they are set for all callouts.
same callout number. However, they are set for all callouts, and are used by
\fBpcre2test\fP to show the next item to be matched when displaying callout
information.
.P
In callouts from \fBpcre2_match()\fP the \fImark\fP field contains a pointer to
the zero-terminated name of the most recently passed (*MARK), (*PRUNE), or
@ -285,6 +327,6 @@ Cambridge, England.
.rs
.sp
.nf
Last updated: 02 January 2015
Last updated: 15 March 2015
Copyright (c) 1997-2015 University of Cambridge.
.fi

View File

@ -1,4 +1,4 @@
.TH PCRE2COMPAT 3 "28 September 2014" "PCRE2 10.0"
.TH PCRE2COMPAT 3 "15 March 2015" "PCRE2 10.20"
.SH NAME
PCRE2 - Perl-compatible regular expressions (revised API)
.SH "DIFFERENCES BETWEEN PCRE2 AND PERL"
@ -69,11 +69,11 @@ the
.\"
documentation for details.
.P
8. Subpatterns that are called as subroutines (whether or not recursively) are
always treated as atomic groups in PCRE2. This is like Python, but unlike Perl.
Captured values that are set outside a subroutine call can be reference from
inside in PCRE2, but not in Perl. There is a discussion that explains these
differences in more detail in the
8. Subroutine calls (whether recursive or not) are treated as atomic groups.
Atomic recursion is like Python, but unlike Perl. Captured values that are set
outside a subroutine call can be referenced from inside in PCRE2, but not in
Perl. There is a discussion that explains these differences in more detail in
the
.\" HTML <a href="pcre2pattern.html#recursiondifference">
.\" </a>
section on recursion differences from Perl
@ -185,6 +185,6 @@ Cambridge, England.
.rs
.sp
.nf
Last updated: 28 September 2014
Copyright (c) 1997-2014 University of Cambridge.
Last updated: 15 March 2015
Copyright (c) 1997-2015 University of Cambridge.
.fi

View File

@ -1,4 +1,4 @@
.TH PCRE2PATTERN 3 "28 January 2015" "PCRE2 10.00"
.TH PCRE2PATTERN 3 "15 March 2015" "PCRE2 10.20"
.SH NAME
PCRE2 - Perl-compatible regular expressions (revised API)
.SH "PCRE2 REGULAR EXPRESSION DETAILS"
@ -2821,42 +2821,69 @@ same pair of parentheses when there is a repetition.
PCRE2 provides a similar feature, but of course it cannot obey arbitrary Perl
code. The feature is called "callout". The caller of PCRE2 provides an external
function by putting its entry point in a match context using the function
\fBpcre2_set_callout()\fP and passing the context to \fBpcre2_match()\fP or
\fBpcre2_dfa_match()\fP. If no match context is passed, or if the callout entry
point is set to NULL, callouts are disabled.
\fBpcre2_set_callout()\fP, and then passing that context to \fBpcre2_match()\fP
or \fBpcre2_dfa_match()\fP. If no match context is passed, or if the callout
entry point is set to NULL, callouts are disabled.
.P
Within a regular expression, (?C) indicates the points at which the external
function is to be called. If you want to identify different callout points, you
can put a number less than 256 after the letter C. The default value is zero.
For example, this pattern has two callout points:
Within a regular expression, (?C<arg>) indicates a point at which the external
function is to be called. There are two kinds of callout: those with a
numerical argument and those with a string argument. (?C) on its own with no
argument is treated as (?C0). A numerical argument allows the application to
distinguish between different callouts. String arguments were added for release
10.20 to make it possible for script languages that use PCRE2 to embed short
scripts within patterns in a similar way to Perl.
.P
During matching, when PCRE2 reaches a callout point, the external function is
called. It is provided with the number or string argument of the callout, the
position in the pattern, and one item of data that is also set in the match
block. The callout function may cause matching to proceed, to backtrack, or to
fail.
.P
By default, PCRE2 implements a number of optimizations at matching time, and
one side-effect is that sometimes callouts are skipped. If you need all
possible callouts to happen, you need to set options that disable the relevant
optimizations. More details, including a complete description of the
programming interface to the callout function, are given in the
.\" HREF
\fBpcre2callout\fP
.\"
documentation.
.
.
.SS "Callouts with numerical arguments"
.rs
.sp
If you just want to have a means of identifying different callout points, put a
number less than 256 after the letter C. For example, this pattern has two
callout points:
.sp
(?C1)abc(?C2)def
.sp
If the PCRE2_AUTO_CALLOUT flag is passed to \fBpcre2_compile()\fP, callouts are
automatically installed before each item in the pattern. They are all numbered
255. If there is a conditional group in the pattern whose condition is an
assertion, an additional callout is inserted just before the condition. An
explicit callout may also be set at this position, as in this example:
If the PCRE2_AUTO_CALLOUT flag is passed to \fBpcre2_compile()\fP, numerical
callouts are automatically installed before each item in the pattern. They are
all numbered 255. If there is a conditional group in the pattern whose
condition is an assertion, an additional callout is inserted just before the
condition. An explicit callout may also be set at this position, as in this
example:
.sp
(?(?C9)(?=a)abc|def)
.sp
Note that this applies only to assertion conditions, not to other types of
condition.
.P
During matching, when PCRE2 reaches a callout point, the external function is
called. It is provided with the number of the callout, the position in the
pattern, and one item of data that is also set in the match block. The callout
function may cause matching to proceed, to backtrack, or to fail.
.P
By default, PCRE2 implements a number of optimizations at matching time, and
one side-effect is that sometimes callouts are skipped. If you need all
possible callouts to happen, you need to set options that disable the relevant
optimizations. More details, and a complete description of the interface to the
callout function, are given in the
.\" HREF
\fBpcre2callout\fP
.\"
documentation.
.
.
.SS "Callouts with string arguments"
.rs
.sp
A delimited string may be used instead of a number as a callout argument. The
starting delimiter must be one of ` ' " ^ % # $ { and the ending delimiter is
the same as the start, except for {, where the ending delimiter is }. If the
ending delimiter is needed within the string, it must be doubled. For
example:
.sp
(?C'ab ''c'' d')xyz(?C{any text})pqr
.sp
The doubling is removed before the string is passed to the callout function.
.
.
.\" HTML <a name="backtrackcontrol"></a>
@ -3302,6 +3329,6 @@ Cambridge, England.
.rs
.sp
.nf
Last updated: 28 January 2015
Last updated: 15 March 2015
Copyright (c) 1997-2015 University of Cambridge.
.fi

View File

@ -1,4 +1,4 @@
.TH PCRE2SYNTAX 3 "26 January 2015" "PCRE2 10.00"
.TH PCRE2SYNTAX 3 "15 March 2015" "PCRE2 10.20"
.SH NAME
PCRE2 - Perl-compatible regular expressions (revised API)
.SH "PCRE2 REGULAR EXPRESSION SYNTAX SUMMARY"
@ -513,8 +513,13 @@ pattern is not anchored.
.SH "CALLOUTS"
.rs
.sp
(?C) callout
(?Cn) callout with data n
(?C) callout (assumed number 0)
(?Cn) callout with numerical data n
(?C"text") callout with string data
.sp
The allowed string delimiters are ` ' " ^ % # $ (which are the same for the
start and the end), and the starting delimiter { matched with the ending
delimiter }. To encode the ending delimiter within the string, double it.
.
.
.SH "SEE ALSO"
@ -538,6 +543,6 @@ Cambridge, England.
.rs
.sp
.nf
Last updated: 26 January 2015
Last updated: 15 March 2015
Copyright (c) 1997-2015 University of Cambridge.
.fi

View File

@ -1,4 +1,4 @@
.TH PCRE2TEST 1 "23 January 2015" "PCRE 10.10"
.TH PCRE2TEST 1 "14 March 2015" "PCRE 10.20"
.SH NAME
pcre2test - a program for testing Perl-compatible regular expressions.
.SH SYNOPSIS
@ -875,11 +875,14 @@ set, the current captured groups are output when a callout occurs.
The \fBcallout_fail\fP modifier can be given one or two numbers. If there is
only one number, 1 is returned instead of 0 when a callout of that number is
reached. If two numbers are given, 1 is returned when callout <n> is reached
for the <m>th time.
for the <m>th time. Note that callouts with string arguments are always given
the number zero. See "Callouts" below for a description of the output when a
callout it taken.
.P
The \fBcallout_data\fP modifier can be given an unsigned or a negative number.
Any value other than zero is used as a return from \fBpcre2test\fP's callout
function.
This is set as the "user data" that is passed to the matching function, and
passed back when the callout function is invoked. Any value other than zero is
used as a return from \fBpcre2test\fP's callout function.
.
.
.SS "Finding all matches in a string"
@ -1231,10 +1234,31 @@ documentation.
.rs
.sp
If the pattern contains any callout requests, \fBpcre2test\fP's callout
function is called during matching. This works with both matching functions. By
default, the called function displays the callout number, the start and current
positions in the text at the callout time, and the next pattern item to be
tested. For example:
function is called during matching unless \fBcallout_none\fP is specified.
This works with both matching functions.
.P
The callout function in \fBpcre2test\fP returns zero (carry on matching) by
default, but you can use a \fBcallout_fail\fP modifier in a subject line (as
described above) to change this and other parameters of the callout.
.P
Inserting callouts can be helpful when using \fBpcre2test\fP to check
complicated regular expressions. For further information about callouts, see
the
.\" HREF
\fBpcre2callout\fP
.\"
documentation.
.P
The output for callouts with numerical arguments and those with string
arguments is slightly different.
.
.
.SS "Callouts with numerical arguments"
.rs
.sp
By default, the callout function displays the callout number, the start and
current positions in the subject text at the callout time, and the next pattern
item to be tested. For example:
.sp
--->pqrabcdef
0 ^ ^ \ed
@ -1275,18 +1299,27 @@ a change of latest mark is passed to the callout function. For example:
The mark changes between matching "a" and "b", but stays the same for the rest
of the match, so nothing more is output. If, as a result of backtracking, the
mark reverts to being unset, the text "<unset>" is output.
.P
The callout function in \fBpcre2test\fP returns zero (carry on matching) by
default, but you can use a \fBcallout_fail\fP modifier in a subject line (as
described above) to change this and other parameters of the callout.
.P
Inserting callouts can be helpful when using \fBpcre2test\fP to check
complicated regular expressions. For further information about callouts, see
the
.\" HREF
\fBpcre2callout\fP
.\"
documentation.
.
.
.SS "Callouts with string arguments"
.rs
.sp
The output for a callout with a string argument is similar, except that instead
of outputting a callout number before the position indicators, the callout
string and its offset in the pattern string are output before the reflection of
the subject string, and the subject string is reflected for each callout. For
example:
.sp
re> /^ab(?C'first')cd(?C"second")ef/
data> abcdefg
Callout (7): 'first'
--->abcdefg
^ ^ c
Callout (20): "second"
--->abcdefg
^ ^ e
0: abcdef
.sp
.
.
.
@ -1398,6 +1431,6 @@ Cambridge, England.
.rs
.sp
.nf
Last updated: 23 January 2015
Last updated: 14 March 2015
Copyright (c) 1997-2015 University of Cambridge.
.fi