diff --git a/HACKING b/HACKING index 87a7579..1e26d01 100644 --- a/HACKING +++ b/HACKING @@ -8,7 +8,7 @@ library is referred to as PCRE1 below. For information about testing PCRE2, see the pcre2test documentation and the comment at the head of the RunTest file. PCRE1 releases were up to 8.3x when PCRE2 was developed. The 8.xx series will -continue for bugfixes if necessary. PCRE2 releases start at 10.0 to avoid +continue for bugfixes if necessary. PCRE2 releases started at 10.00 to avoid confusion with PCRE1. @@ -39,7 +39,7 @@ subsequently heavily modified for Perl) compiles the expression twice: once in a dummy mode in order to find out how much store will be needed, and then for real. (The Perl version probably doesn't do this any more; I'm talking about the original library.) The execution function operates by backtracking and -maximizing (or, optionally, minimizing in Perl) the amount of the subject that +maximizing (or, optionally, minimizing, in Perl) the amount of the subject that matches individual wild portions of the pattern. This is an "NFA algorithm" in Friedl's terminology. @@ -63,7 +63,7 @@ modes, creating up to three different libraries. In the description that follows, the word "short" is used for a 16-bit data quantity, and the phrase "code unit" is used for a quantity that is a byte in 8-bit mode, a short in 16-bit mode and a 32-bit word in 32-bit mode. The names of PCRE2 functions are -given in generic form, without a _8, _16, or _32 suffix. +given in generic form, without the _8, _16, or _32 suffix. Computing the memory requirement: how it was @@ -100,8 +100,9 @@ issue, and in the event, nobody has commented on it. At release 8.34, a limit on the nesting depth of parentheses was re-introduced (default 250, settable at build time) so as to put a limit on the amount of -system stack used by the compile function. This is a safety feature for -environments with small stacks where the patterns are provided by users. +system stack used by the compile function, which uses recursive function calls +for nested parenthesized groups. This is a safety feature for environments with +small stacks where the patterns are provided by users. Traditional matching function @@ -158,8 +159,9 @@ default value for LINK_SIZE is 2, except for the 32-bit library, where it can only be 4. The 8-bit library can be compiled to used 3-byte or 4-byte values, and the 16-bit library can be compiled to use 4-byte values, though this impairs performance. Specifing a LINK_SIZE larger than 2 for these libraries is -necessary only when patterns whose compiled length is greater than 64K are -going to be processed. +necessary only when patterns whose compiled length is greater than 64K code +units are going to be processed. When a LINK_SIZE value uses more than one code +unit, the most significant unit is first. In this description, we assume the "normal" compilation options. Data values that are counts (e.g. quantifiers) are always two bytes long in 8-bit mode @@ -343,7 +345,7 @@ For classes containing characters with values greater than 255 or that contain code points are less than 256, followed by a list of pairs (for a range) and single characters. In caseless mode, both cases are explicitly listed. -OP_XCLASS is followed by a LINK_SIZE item containing the total length of the +OP_XCLASS is followed by a LINK_SIZE value containing the total length of the opcode and its data. This is followed by a code unit containing flag bits: XCL_NOT indicates that this is a negative class, and XCL_MAP indicates that a bit map is present. There follows the bit map, if XCL_MAP is set, and then a @@ -356,7 +358,7 @@ sequence of items coded as follows: XCL_NOTPROP a Unicode property (type, value) follows If a range starts with a code point less than 256 and ends with one greater -than 256, it is split into two ranges, with characters less than 256 being +than 255, it is split into two ranges, with characters less than 256 being indicated in the bit map, and the rest with XCL_RANGE. When XCL_NOT is set, the bit map, if present, contains bits for characters that @@ -412,17 +414,17 @@ compile time, so alternation always happens in the context of brackets. myself, can be round, square, curly, or pointy. Hence this usage rather than "parentheses".] -Non-capturing brackets use the opcode OP_BRA, capturing brackets use OP_CBRA. -A bracket opcode is followed by LINK_SIZE bytes which give the offset to the +Non-capturing brackets use the opcode OP_BRA, capturing brackets use OP_CBRA. A +bracket opcode is followed by a LINK_SIZE value which gives the offset to the next alternative OP_ALT or, if there aren't any branches, to the matching -OP_KET opcode. Each OP_ALT is followed by LINK_SIZE bytes giving the offset to -the next one, or to the OP_KET opcode. For capturing brackets, the bracket +OP_KET opcode. Each OP_ALT is followed by a LINK_SIZE value giving the offset +to the next one, or to the OP_KET opcode. For capturing brackets, the bracket number is a count that immediately follows the offset. OP_KET is used for subpatterns that do not repeat indefinitely, and OP_KETRMIN and OP_KETRMAX are used for indefinite repetitions, minimally or maximally respectively (see below for possessive repetitions). All three are followed by -LINK_SIZE bytes giving (as a positive number) the offset back to the matching +a LINK_SIZE value giving (as a positive number) the offset back to the matching bracket opcode. If a subpattern is quantified such that it is permitted to match zero times, it @@ -520,8 +522,11 @@ tests the PCRE2 version number. This compiles into one of the opcodes OP_TRUE or OP_FALSE. If a condition is not a back reference, recursion test, DEFINE, or VERSION, it -must start with an assertion, whose opcode immediately follows OP_COND or -OP_SCOND. +must start with an assertion, whose opcode normally immediately follows OP_COND +or OP_SCOND. However, if automatic callouts are enabled, a callout is inserted +immediately before the assertion. It is also possible to insert a manual +callout at this point. Only assertion conditions may have callouts preceding +the condition. Recursion @@ -529,22 +534,43 @@ Recursion Recursion either matches the current pattern, or some subexpression. The opcode OP_RECURSE is followed by a LINK_SIZE value that is the offset to the starting -bracket from the start of the whole pattern. OP_RECURSE is automatically -wrapped inside OP_ONCE brackets, because otherwise some patterns broke it. -OP_RECURSE is also used for "subroutine" calls, even though they are not -strictly a recursion. +bracket from the start of the whole pattern. OP_RECURSE is also used for +"subroutine" calls, even though they are not strictly a recursion. Repeated +recursions are automatically wrapped inside OP_ONCE brackets, because otherwise +some patterns broke them. A non-repeated recursion is not wrapped in OP_ONCE +brackets, but it is nevertheless still treated as an atomic group. Callout ------- -OP_CALLOUT is followed by one code unit of data that holds a callout number in -the range 0 to 254 for manual callouts, or 255 for an automatic callout. In -both cases there follows a count giving the offset in the pattern string to the +A callout can nowadays have either a numerical argument or a string argument. +These use OP_CALLOUT or OP_CALLOUT_STR, respectively. In each case these are +followed by two LINK_SIZE values giving the offset in the pattern string to the start of the following item, and another count giving the length of this item. These values make it possible for pcre2test to output useful tracing -information using automatic callouts. +information using callouts. +In the case of a numeric callout, after these two values there is a single code +unit containing the callout number, in the range 0-255, with 255 being used for +callouts that are automatically inserted as a result of the PCRE2_AUTO_CALLOUT +option. Thus, this opcode item is of fixed length: + + [OP_CALLOUT] [PATTERN_OFFSET] [PATTERN_LENGTH] [NUMBER] + +For callouts with string arguments, OP_CALLOUT_STR has three more data items: +a LINK_SIZE value giving the complete length of the entire opcode item, a +LINK_SIZE item containing the offset within the pattern string to the start of +the string argument, and the string itself, preceded by its starting delimiter +and followed by a binary zero. When a callout function is called, a pointer to +the actual string is passed, but the delimiter can be accessed as string[-1] if +the application needs it. In the 8-bit library, the callout in /X(?C'abc')Y/ is +compiled as the following bytes (decimal numbers represent binary values): + + [OP_CALLOUT] [0] [10] [0] [1] [0] [14] [0] [5] ['] [a] [b] [c] [0] + -------- ------- -------- ------- + | | | | + ------- LINK_SIZE items ------ Opcode table checking --------------------- @@ -554,4 +580,4 @@ not a real opcode, but is used to check that tables indexed by opcode are the correct length, in order to catch updating errors. Philip Hazel -February 2015 +March 2015 diff --git a/doc/pcre2callout.3 b/doc/pcre2callout.3 index eeac0d5..a485e1d 100644 --- a/doc/pcre2callout.3 +++ b/doc/pcre2callout.3 @@ -1,4 +1,4 @@ -.TH PCRE2CALLOUT 3 "02 January 2015" "PCRE2 10.00" +.TH PCRE2CALLOUT 3 "15 March 2015" "PCRE2 10.20" .SH NAME PCRE2 - Perl-compatible regular expressions (revised API) .SH SYNOPSIS @@ -15,18 +15,22 @@ PCRE2 - Perl-compatible regular expressions (revised API) PCRE2 provides a feature called "callout", which is a means of temporarily passing control to the caller of PCRE2 in the middle of pattern matching. The caller of PCRE2 provides an external function by putting its entry point in -a match context (see \fBpcre2_set_callout()\fP) in the +a match context (see \fBpcre2_set_callout()\fP in the .\" HREF \fBpcre2api\fP .\" documentation). .P -Within a regular expression, (?C) indicates the points at which the external +Within a regular expression, (?C) indicates a point at which the external function is to be called. Different callout points can be identified by putting a number less than 256 after the letter C. The default value is zero. -For example, this pattern has two callout points: +Alternatively, the argument may be a delimited string. The starting delimiter +must be one of ` ' " ^ % # $ { and the ending delimiter is the same as the +start, except for {, where the ending delimiter is }. If the ending delimiter +is needed within the string, it must be doubled. For example, this pattern has +two callout points: .sp - (?C1)abc(?C2)def + (?C1)abc(?C"some ""arbitrary"" text")def .sp If the PCRE2_AUTO_CALLOUT option bit is set when a pattern is compiled, PCRE2 automatically inserts callouts, all with number 255, before each item in the @@ -43,20 +47,19 @@ alternation bar. If the pattern contains a conditional group whose condition is an assertion, an automatic callout is inserted immediately before the condition. Such a callout may also be inserted explicitly, for example: .sp - (?(?C9)(?=a)ab|de) + (?(?C9)(?=a)ab|de) (?(?C%text%)(?!=d)ab|de) .sp This applies only to assertion conditions (because they are themselves independent groups). .P -Automatic callouts can be used for tracking the progress of pattern matching. -The +Callouts can be useful for tracking the progress of pattern matching. The .\" HREF \fBpcre2test\fP .\" -program has a pattern qualifier (/auto_callout) that sets automatic callouts; -when it is used, the output indicates how the pattern is being matched. This is -useful information when you are trying to optimize the performance of a -particular pattern. +program has a pattern qualifier (/auto_callout) that sets automatic callouts. +When any callouts are present, the output from \fBpcre2test\fP indicates how +the pattern is being matched. This is useful information when you are trying to +optimize the performance of a particular pattern. . . .SH "MISSING CALLOUTS" @@ -193,15 +196,52 @@ documentation). The callout block structure contains the following fields: PCRE2_SIZE \fIcurrent_position\fP; PCRE2_SIZE \fIpattern_position\fP; PCRE2_SIZE \fInext_item_length\fP; + PCRE2_SIZE \fIcallout_string_offset\fP; + PCRE2_SPTR \fIcallout_string\fP; + uint32_t \fIcallout_string_length\fP; + .sp The \fIversion\fP field contains the version number of the block format. The -current version is 0. The version number will change in future if additional -fields are added, but the intention is never to remove any of the existing -fields. +current version is 1; the three callout string fields were added for this +version. If you are writing an application that might use an earlier release of +PCRE2, you should check the version number before accessing any of these +fields. The version number will increase in future if more fields are added, +but the intention is never to remove any of the existing fields. +. +. +.SS "Fields for numerical callouts" +.rs +.sp +For a numerical callout, \fIcallout_string\fP is NULL, and \fIcallout_number\fP +contains the number of the callout, in the range 0-255. This is the number +that follows (?C for manual callouts; it is 255 for automatically generated +callouts. +. +. +.SS "Fields for string callouts" +.rs +.sp +For callouts with string arguments, \fIcallout_number\fP is always zero, and +\fIcallout_string\fP points to the string that is contained within the compiled +pattern. Its length is given by \fIcallout_string_length\fP. Duplicated ending +delimiters that were present in the original pattern string have been turned +into single characters. An additional code unit containing binary zero is +present after the string, but is not included in the length. The delimiter that +was used to start the string is also stored within the pattern, immediately +before the string itself. You can therefore access this delimiter as +\fIcallout_string\fP[-1] if you need it. .P -The \fIcallout_number\fP field contains the number of the callout, as compiled -into the pattern (that is, the number after ?C for manual callouts, and 255 for -automatically generated callouts). +The \fIcallout_string_offset\fP field is the code unit offset to the start of +the callout argument string within the original pattern string. This is +provided for the benefit of applications such as script languages that might +need to report errors in the callout string within the pattern. +. +. +.SS "Fields for all callouts" +.rs +.sp +The remaining fields in the callout block are the same for both kinds of +callout. .P The \fIoffset_vector\fP field is a pointer to the vector of capturing offsets (the "ovector") that was passed to the matching function in the match data @@ -246,7 +286,9 @@ of the entire subpattern. .P The \fIpattern_position\fP and \fInext_item_length\fP fields are intended to help in distinguishing between different automatic callouts, which all have the -same callout number. However, they are set for all callouts. +same callout number. However, they are set for all callouts, and are used by +\fBpcre2test\fP to show the next item to be matched when displaying callout +information. .P In callouts from \fBpcre2_match()\fP the \fImark\fP field contains a pointer to the zero-terminated name of the most recently passed (*MARK), (*PRUNE), or @@ -285,6 +327,6 @@ Cambridge, England. .rs .sp .nf -Last updated: 02 January 2015 +Last updated: 15 March 2015 Copyright (c) 1997-2015 University of Cambridge. .fi diff --git a/doc/pcre2compat.3 b/doc/pcre2compat.3 index a7c0c6c..a3306d7 100644 --- a/doc/pcre2compat.3 +++ b/doc/pcre2compat.3 @@ -1,4 +1,4 @@ -.TH PCRE2COMPAT 3 "28 September 2014" "PCRE2 10.0" +.TH PCRE2COMPAT 3 "15 March 2015" "PCRE2 10.20" .SH NAME PCRE2 - Perl-compatible regular expressions (revised API) .SH "DIFFERENCES BETWEEN PCRE2 AND PERL" @@ -69,11 +69,11 @@ the .\" documentation for details. .P -8. Subpatterns that are called as subroutines (whether or not recursively) are -always treated as atomic groups in PCRE2. This is like Python, but unlike Perl. -Captured values that are set outside a subroutine call can be reference from -inside in PCRE2, but not in Perl. There is a discussion that explains these -differences in more detail in the +8. Subroutine calls (whether recursive or not) are treated as atomic groups. +Atomic recursion is like Python, but unlike Perl. Captured values that are set +outside a subroutine call can be referenced from inside in PCRE2, but not in +Perl. There is a discussion that explains these differences in more detail in +the .\" HTML .\" section on recursion differences from Perl @@ -185,6 +185,6 @@ Cambridge, England. .rs .sp .nf -Last updated: 28 September 2014 -Copyright (c) 1997-2014 University of Cambridge. +Last updated: 15 March 2015 +Copyright (c) 1997-2015 University of Cambridge. .fi diff --git a/doc/pcre2pattern.3 b/doc/pcre2pattern.3 index e0d9b49..7c237fa 100644 --- a/doc/pcre2pattern.3 +++ b/doc/pcre2pattern.3 @@ -1,4 +1,4 @@ -.TH PCRE2PATTERN 3 "28 January 2015" "PCRE2 10.00" +.TH PCRE2PATTERN 3 "15 March 2015" "PCRE2 10.20" .SH NAME PCRE2 - Perl-compatible regular expressions (revised API) .SH "PCRE2 REGULAR EXPRESSION DETAILS" @@ -2821,42 +2821,69 @@ same pair of parentheses when there is a repetition. PCRE2 provides a similar feature, but of course it cannot obey arbitrary Perl code. The feature is called "callout". The caller of PCRE2 provides an external function by putting its entry point in a match context using the function -\fBpcre2_set_callout()\fP and passing the context to \fBpcre2_match()\fP or -\fBpcre2_dfa_match()\fP. If no match context is passed, or if the callout entry -point is set to NULL, callouts are disabled. +\fBpcre2_set_callout()\fP, and then passing that context to \fBpcre2_match()\fP +or \fBpcre2_dfa_match()\fP. If no match context is passed, or if the callout +entry point is set to NULL, callouts are disabled. .P -Within a regular expression, (?C) indicates the points at which the external -function is to be called. If you want to identify different callout points, you -can put a number less than 256 after the letter C. The default value is zero. -For example, this pattern has two callout points: +Within a regular expression, (?C) indicates a point at which the external +function is to be called. There are two kinds of callout: those with a +numerical argument and those with a string argument. (?C) on its own with no +argument is treated as (?C0). A numerical argument allows the application to +distinguish between different callouts. String arguments were added for release +10.20 to make it possible for script languages that use PCRE2 to embed short +scripts within patterns in a similar way to Perl. +.P +During matching, when PCRE2 reaches a callout point, the external function is +called. It is provided with the number or string argument of the callout, the +position in the pattern, and one item of data that is also set in the match +block. The callout function may cause matching to proceed, to backtrack, or to +fail. +.P +By default, PCRE2 implements a number of optimizations at matching time, and +one side-effect is that sometimes callouts are skipped. If you need all +possible callouts to happen, you need to set options that disable the relevant +optimizations. More details, including a complete description of the +programming interface to the callout function, are given in the +.\" HREF +\fBpcre2callout\fP +.\" +documentation. +. +. +.SS "Callouts with numerical arguments" +.rs +.sp +If you just want to have a means of identifying different callout points, put a +number less than 256 after the letter C. For example, this pattern has two +callout points: .sp (?C1)abc(?C2)def .sp -If the PCRE2_AUTO_CALLOUT flag is passed to \fBpcre2_compile()\fP, callouts are -automatically installed before each item in the pattern. They are all numbered -255. If there is a conditional group in the pattern whose condition is an -assertion, an additional callout is inserted just before the condition. An -explicit callout may also be set at this position, as in this example: +If the PCRE2_AUTO_CALLOUT flag is passed to \fBpcre2_compile()\fP, numerical +callouts are automatically installed before each item in the pattern. They are +all numbered 255. If there is a conditional group in the pattern whose +condition is an assertion, an additional callout is inserted just before the +condition. An explicit callout may also be set at this position, as in this +example: .sp (?(?C9)(?=a)abc|def) .sp Note that this applies only to assertion conditions, not to other types of condition. -.P -During matching, when PCRE2 reaches a callout point, the external function is -called. It is provided with the number of the callout, the position in the -pattern, and one item of data that is also set in the match block. The callout -function may cause matching to proceed, to backtrack, or to fail. -.P -By default, PCRE2 implements a number of optimizations at matching time, and -one side-effect is that sometimes callouts are skipped. If you need all -possible callouts to happen, you need to set options that disable the relevant -optimizations. More details, and a complete description of the interface to the -callout function, are given in the -.\" HREF -\fBpcre2callout\fP -.\" -documentation. +. +. +.SS "Callouts with string arguments" +.rs +.sp +A delimited string may be used instead of a number as a callout argument. The +starting delimiter must be one of ` ' " ^ % # $ { and the ending delimiter is +the same as the start, except for {, where the ending delimiter is }. If the +ending delimiter is needed within the string, it must be doubled. For +example: +.sp + (?C'ab ''c'' d')xyz(?C{any text})pqr +.sp +The doubling is removed before the string is passed to the callout function. . . .\" HTML @@ -3302,6 +3329,6 @@ Cambridge, England. .rs .sp .nf -Last updated: 28 January 2015 +Last updated: 15 March 2015 Copyright (c) 1997-2015 University of Cambridge. .fi diff --git a/doc/pcre2syntax.3 b/doc/pcre2syntax.3 index f7e231c..cfc6d0f 100644 --- a/doc/pcre2syntax.3 +++ b/doc/pcre2syntax.3 @@ -1,4 +1,4 @@ -.TH PCRE2SYNTAX 3 "26 January 2015" "PCRE2 10.00" +.TH PCRE2SYNTAX 3 "15 March 2015" "PCRE2 10.20" .SH NAME PCRE2 - Perl-compatible regular expressions (revised API) .SH "PCRE2 REGULAR EXPRESSION SYNTAX SUMMARY" @@ -513,8 +513,13 @@ pattern is not anchored. .SH "CALLOUTS" .rs .sp - (?C) callout - (?Cn) callout with data n + (?C) callout (assumed number 0) + (?Cn) callout with numerical data n + (?C"text") callout with string data +.sp +The allowed string delimiters are ` ' " ^ % # $ (which are the same for the +start and the end), and the starting delimiter { matched with the ending +delimiter }. To encode the ending delimiter within the string, double it. . . .SH "SEE ALSO" @@ -538,6 +543,6 @@ Cambridge, England. .rs .sp .nf -Last updated: 26 January 2015 +Last updated: 15 March 2015 Copyright (c) 1997-2015 University of Cambridge. .fi diff --git a/doc/pcre2test.1 b/doc/pcre2test.1 index 9c45b7c..a1fccb0 100644 --- a/doc/pcre2test.1 +++ b/doc/pcre2test.1 @@ -1,4 +1,4 @@ -.TH PCRE2TEST 1 "23 January 2015" "PCRE 10.10" +.TH PCRE2TEST 1 "14 March 2015" "PCRE 10.20" .SH NAME pcre2test - a program for testing Perl-compatible regular expressions. .SH SYNOPSIS @@ -875,11 +875,14 @@ set, the current captured groups are output when a callout occurs. The \fBcallout_fail\fP modifier can be given one or two numbers. If there is only one number, 1 is returned instead of 0 when a callout of that number is reached. If two numbers are given, 1 is returned when callout is reached -for the th time. +for the th time. Note that callouts with string arguments are always given +the number zero. See "Callouts" below for a description of the output when a +callout it taken. .P The \fBcallout_data\fP modifier can be given an unsigned or a negative number. -Any value other than zero is used as a return from \fBpcre2test\fP's callout -function. +This is set as the "user data" that is passed to the matching function, and +passed back when the callout function is invoked. Any value other than zero is +used as a return from \fBpcre2test\fP's callout function. . . .SS "Finding all matches in a string" @@ -1231,10 +1234,31 @@ documentation. .rs .sp If the pattern contains any callout requests, \fBpcre2test\fP's callout -function is called during matching. This works with both matching functions. By -default, the called function displays the callout number, the start and current -positions in the text at the callout time, and the next pattern item to be -tested. For example: +function is called during matching unless \fBcallout_none\fP is specified. +This works with both matching functions. +.P +The callout function in \fBpcre2test\fP returns zero (carry on matching) by +default, but you can use a \fBcallout_fail\fP modifier in a subject line (as +described above) to change this and other parameters of the callout. +.P +Inserting callouts can be helpful when using \fBpcre2test\fP to check +complicated regular expressions. For further information about callouts, see +the +.\" HREF +\fBpcre2callout\fP +.\" +documentation. +.P +The output for callouts with numerical arguments and those with string +arguments is slightly different. +. +. +.SS "Callouts with numerical arguments" +.rs +.sp +By default, the callout function displays the callout number, the start and +current positions in the subject text at the callout time, and the next pattern +item to be tested. For example: .sp --->pqrabcdef 0 ^ ^ \ed @@ -1275,18 +1299,27 @@ a change of latest mark is passed to the callout function. For example: The mark changes between matching "a" and "b", but stays the same for the rest of the match, so nothing more is output. If, as a result of backtracking, the mark reverts to being unset, the text "" is output. -.P -The callout function in \fBpcre2test\fP returns zero (carry on matching) by -default, but you can use a \fBcallout_fail\fP modifier in a subject line (as -described above) to change this and other parameters of the callout. -.P -Inserting callouts can be helpful when using \fBpcre2test\fP to check -complicated regular expressions. For further information about callouts, see -the -.\" HREF -\fBpcre2callout\fP -.\" -documentation. +. +. +.SS "Callouts with string arguments" +.rs +.sp +The output for a callout with a string argument is similar, except that instead +of outputting a callout number before the position indicators, the callout +string and its offset in the pattern string are output before the reflection of +the subject string, and the subject string is reflected for each callout. For +example: +.sp + re> /^ab(?C'first')cd(?C"second")ef/ + data> abcdefg + Callout (7): 'first' + --->abcdefg + ^ ^ c + Callout (20): "second" + --->abcdefg + ^ ^ e + 0: abcdef +.sp . . . @@ -1398,6 +1431,6 @@ Cambridge, England. .rs .sp .nf -Last updated: 23 January 2015 +Last updated: 14 March 2015 Copyright (c) 1997-2015 University of Cambridge. .fi