diff --git a/ChangeLog b/ChangeLog index c4d472b..18dfe76 100644 --- a/ChangeLog +++ b/ChangeLog @@ -37,6 +37,10 @@ src/pcre2_chartables.c.dist are updated. ranges such as a-z in EBCDIC environments. The original code probably never worked, though there were no bug reports. +10. Implement PCRE2_COPY_MATCHED_SUBJECT for pcre2_match() (including JIT via +pcre2_match()) and pcre2_dfa_match(), but *not* the pcre2_jit_match() fast +path. + Version 10.32 10-September-2018 ------------------------------- diff --git a/doc/html/pcre2_dfa_match.html b/doc/html/pcre2_dfa_match.html index 8702cca..ad9a28f 100644 --- a/doc/html/pcre2_dfa_match.html +++ b/doc/html/pcre2_dfa_match.html @@ -51,6 +51,8 @@ depth limits. The length and startoffset values are code units, no characters. The options are:
   PCRE2_ANCHORED          Match only at the first position
+  PCRE2_COPY_MATCHED_SUBJECT
+                          On success, make a private subject copy  
   PCRE2_ENDANCHORED       Pattern can match only at end of subject
   PCRE2_NOTBOL            Subject is not the beginning of a line
   PCRE2_NOTEOL            Subject is not the end of a line
diff --git a/doc/html/pcre2_match.html b/doc/html/pcre2_match.html
index ced70bb..82c9491 100644
--- a/doc/html/pcre2_match.html
+++ b/doc/html/pcre2_match.html
@@ -55,11 +55,13 @@ A match context is needed only if you want to:
   Change the backtracking depth limit
   Set custom memory management specifically for the match
 
-The length and startoffset values are code -units, not characters. The length may be given as PCRE2_ZERO_TERMINATE for a -subject that is terminated by a binary zero code unit. The options are: +The length and startoffset values are code units, not characters. +The length may be given as PCRE2_ZERO_TERMINATED for a subject that is +terminated by a binary zero code unit. The options are:
   PCRE2_ANCHORED          Match only at the first position
+  PCRE2_COPY_MATCHED_SUBJECT
+                          On success, make a private subject copy   
   PCRE2_ENDANCHORED       Pattern can match only at end of subject
   PCRE2_NOTBOL            Subject string is not the beginning of a line
   PCRE2_NOTEOL            Subject string is not the end of a line
diff --git a/doc/html/pcre2_match_data_free.html b/doc/html/pcre2_match_data_free.html
index 68a4461..746c3c1 100644
--- a/doc/html/pcre2_match_data_free.html
+++ b/doc/html/pcre2_match_data_free.html
@@ -31,6 +31,11 @@ using the memory freeing function from the general context or compiled pattern
 with which it was created, or free() if that was not set.
 

+If the PCRE2_COPY_MATCHED_SUBJECT was used for a successful match using this +match data block, the copy of the subject that was remembered with the block is +also freed. +

+

There is a complete description of the PCRE2 native API in the pcre2api page and a description of the POSIX API in the diff --git a/doc/html/pcre2api.html b/doc/html/pcre2api.html index f843e97..5e7adbf 100644 --- a/doc/html/pcre2api.html +++ b/doc/html/pcre2api.html @@ -1305,10 +1305,13 @@ NULL. NOTE: When one of the matching functions is called, pointers to the compiled pattern and the subject string are set in the match data block so that they can be referenced by the substring extraction functions. After running a match, you -must not free a compiled pattern (or a subject string) until after all +must not free a compiled pattern or a subject string until after all operations on the match data block -have taken place. +have taken place, unless, in the case of the subject string, you have used the +PCRE2_COPY_MATCHED_SUBJECT option, which is described in the section entitled +"Option bits for pcre2_match()" +below.

The options argument for pcre2_compile() contains various bit @@ -2419,7 +2422,10 @@ When one of the matching functions is called, pointers to the compiled pattern and the subject string are set in the match data block so that they can be referenced by the extraction functions. After running a match, you must not free a compiled pattern or a subject string until after all operations on the -match data block (for that match) have taken place. +match data block (for that match) have taken place, unless, in the case of the +subject string, you have used the PCRE2_COPY_MATCHED_SUBJECT option, which is +described in the section entitled "Option bits for pcre2_match()" +below.

When a match data block itself is no longer needed, it should be freed by @@ -2531,10 +2537,10 @@ Option bits for pcre2_match()

The unused bits of the options argument for pcre2_match() must be -zero. The only bits that may be set are PCRE2_ANCHORED, PCRE2_ENDANCHORED, -PCRE2_NOTBOL, PCRE2_NOTEOL, PCRE2_NOTEMPTY, PCRE2_NOTEMPTY_ATSTART, -PCRE2_NO_JIT, PCRE2_NO_UTF_CHECK, PCRE2_PARTIAL_HARD, and PCRE2_PARTIAL_SOFT. -Their action is described below. +zero. The only bits that may be set are PCRE2_ANCHORED, +PCRE2_COPY_MATCHED_SUBJECT, PCRE2_ENDANCHORED, PCRE2_NOTBOL, PCRE2_NOTEOL, +PCRE2_NOTEMPTY, PCRE2_NOTEMPTY_ATSTART, PCRE2_NO_JIT, PCRE2_NO_UTF_CHECK, +PCRE2_PARTIAL_HARD, and PCRE2_PARTIAL_SOFT. Their action is described below.

Setting PCRE2_ANCHORED or PCRE2_ENDANCHORED at match time is not supported by @@ -2549,6 +2555,22 @@ matching position. If a pattern was compiled with PCRE2_ANCHORED, or turned out to be anchored by virtue of its contents, it cannot be made unachored at matching time. Note that setting the option at match time disables JIT matching. +

+  PCRE2_COPY_MATCHED_SUBJECT
+
+By default, a pointer to the subject is remembered in the match data block so +that, after a successful match, it can be referenced by the substring +extraction functions. This means that the subject's memory must not be freed +until all such operations are complete. For some applications where the +lifetime of the subject string is not guaranteed, it may be necessary to make a +copy of the subject string, but it is wasteful to do this unless the match is +successful. After a successful match, if PCRE2_COPY_MATCHED_SUBJECT is set, the +subject is copied and the new pointer is remembered in the match data block +instead of the original subject pointer. The memory allocator that was used for +the match block itself is used. The copy is automatically freed when +pcre2_match_data_free() is called to free the match data block. It is also +automatically freed if the match data block is re-used for another match +operation.
   PCRE2_ENDANCHORED
 
@@ -2954,7 +2976,8 @@ The backtracking match limit was reached. If a pattern contains many nested backtracking points, heap memory is used to remember them. This error is given when the memory allocation function (default or custom) fails. Note that a different error, PCRE2_ERROR_HEAPLIMIT, is given -if the amount of memory needed exceeds the heap limit. +if the amount of memory needed exceeds the heap limit. PCRE2_ERROR_NOMEMORY is +also returned if PCRE2_COPY_MATCHED_SUBJECT is set and memory allocation fails.
   PCRE2_ERROR_NULL
 
@@ -3584,11 +3607,12 @@ Option bits for pcre_dfa_match()

The unused bits of the options argument for pcre2_dfa_match() must -be zero. The only bits that may be set are PCRE2_ANCHORED, PCRE2_ENDANCHORED, -PCRE2_NOTBOL, PCRE2_NOTEOL, PCRE2_NOTEMPTY, PCRE2_NOTEMPTY_ATSTART, -PCRE2_NO_UTF_CHECK, PCRE2_PARTIAL_HARD, PCRE2_PARTIAL_SOFT, PCRE2_DFA_SHORTEST, -and PCRE2_DFA_RESTART. All but the last four of these are exactly the same as -for pcre2_match(), so their description is not repeated here. +be zero. The only bits that may be set are PCRE2_ANCHORED, +PCRE2_COPY_MATCHED_SUBJECT, PCRE2_ENDANCHORED, PCRE2_NOTBOL, PCRE2_NOTEOL, +PCRE2_NOTEMPTY, PCRE2_NOTEMPTY_ATSTART, PCRE2_NO_UTF_CHECK, PCRE2_PARTIAL_HARD, +PCRE2_PARTIAL_SOFT, PCRE2_DFA_SHORTEST, and PCRE2_DFA_RESTART. All but the last +four of these are exactly the same as for pcre2_match(), so their +description is not repeated here.

   PCRE2_PARTIAL_HARD
   PCRE2_PARTIAL_SOFT
@@ -3732,7 +3756,7 @@ Cambridge, England.
 


REVISION

-Last updated: 21 September 2018 +Last updated: 16 October 2018
Copyright © 1997-2018 University of Cambridge.
diff --git a/doc/html/pcre2jit.html b/doc/html/pcre2jit.html index fa007e0..78fda6e 100644 --- a/doc/html/pcre2jit.html +++ b/doc/html/pcre2jit.html @@ -147,9 +147,10 @@ pattern.
UNSUPPORTED OPTIONS AND PATTERN ITEMS

The pcre2_match() options that are supported for JIT matching are -PCRE2_NOTBOL, PCRE2_NOTEOL, PCRE2_NOTEMPTY, PCRE2_NOTEMPTY_ATSTART, -PCRE2_NO_UTF_CHECK, PCRE2_PARTIAL_HARD, and PCRE2_PARTIAL_SOFT. The -PCRE2_ANCHORED option is not supported at match time. +PCRE2_COPY_MATCHED_SUBJECT, PCRE2_NOTBOL, PCRE2_NOTEOL, PCRE2_NOTEMPTY, +PCRE2_NOTEMPTY_ATSTART, PCRE2_NO_UTF_CHECK, PCRE2_PARTIAL_HARD, and +PCRE2_PARTIAL_SOFT. The PCRE2_ANCHORED and PCRE2_ENDANCHORED options are not +supported at match time.

If the PCRE2_NO_JIT option is passed to pcre2_match() it disables the @@ -402,10 +403,13 @@ processed by pcre2_jit_compile()).

The fast path function is called pcre2_jit_match(), and it takes exactly -the same arguments as pcre2_match(). The return values are also the same, -plus PCRE2_ERROR_JIT_BADOPTION if a matching mode (partial or complete) is -requested that was not compiled. Unsupported option bits (for example, -PCRE2_ANCHORED) are ignored, as is the PCRE2_NO_JIT option. +the same arguments as pcre2_match(). However, the subject string must be +specified with a length; PCRE2_ZERO_TERMINATED is not supported. Unsupported +option bits (for example, PCRE2_ANCHORED, PCRE2_ENDANCHORED and +PCRE2_COPY_MATCHED_SUBJECT) are ignored, as is the PCRE2_NO_JIT option. The +return values are also the same as for pcre2_match(), plus +PCRE2_ERROR_JIT_BADOPTION if a matching mode (partial or complete) is requested +that was not compiled.

When you call pcre2_match(), as well as testing for invalid options, a @@ -434,7 +438,7 @@ Cambridge, England.


REVISION

-Last updated: 28 June 2018 +Last updated: 16 October 2018
Copyright © 1997-2018 University of Cambridge.
diff --git a/doc/pcre2.txt b/doc/pcre2.txt index 6ad6922..ca28e47 100644 --- a/doc/pcre2.txt +++ b/doc/pcre2.txt @@ -1302,9 +1302,11 @@ COMPILING A PATTERN NOTE: When one of the matching functions is called, pointers to the compiled pattern and the subject string are set in the match data block so that they can be referenced by the substring extraction functions. - After running a match, you must not free a compiled pattern (or a sub- - ject string) until after all operations on the match data block have - taken place. + After running a match, you must not free a compiled pattern or a sub- + ject string until after all operations on the match data block have + taken place, unless, in the case of the subject string, you have used + the PCRE2_COPY_MATCHED_SUBJECT option, which is described in the sec- + tion entitled "Option bits for pcre2_match()" below. The options argument for pcre2_compile() contains various bit settings that affect the compilation. It should be zero if no options are @@ -2388,7 +2390,9 @@ THE MATCH DATA BLOCK they can be referenced by the extraction functions. After running a match, you must not free a compiled pattern or a subject string until after all operations on the match data block (for that match) have - taken place. + taken place, unless, in the case of the subject string, you have used + the PCRE2_COPY_MATCHED_SUBJECT option, which is described in the sec- + tion entitled "Option bits for pcre2_match()" below. When a match data block itself is no longer needed, it should be freed by calling pcre2_match_data_free(). If this function is called with a @@ -2488,25 +2492,43 @@ MATCHING A PATTERN: THE TRADITIONAL FUNCTION Option bits for pcre2_match() The unused bits of the options argument for pcre2_match() must be zero. - The only bits that may be set are PCRE2_ANCHORED, PCRE2_ENDANCHORED, - PCRE2_NOTBOL, PCRE2_NOTEOL, PCRE2_NOTEMPTY, PCRE2_NOTEMPTY_ATSTART, - PCRE2_NO_JIT, PCRE2_NO_UTF_CHECK, PCRE2_PARTIAL_HARD, and PCRE2_PAR- - TIAL_SOFT. Their action is described below. + The only bits that may be set are PCRE2_ANCHORED, + PCRE2_COPY_MATCHED_SUBJECT, PCRE2_ENDANCHORED, PCRE2_NOTBOL, + PCRE2_NOTEOL, PCRE2_NOTEMPTY, PCRE2_NOTEMPTY_ATSTART, PCRE2_NO_JIT, + PCRE2_NO_UTF_CHECK, PCRE2_PARTIAL_HARD, and PCRE2_PARTIAL_SOFT. Their + action is described below. - Setting PCRE2_ANCHORED or PCRE2_ENDANCHORED at match time is not sup- - ported by the just-in-time (JIT) compiler. If it is set, JIT matching - is disabled and the interpretive code in pcre2_match() is run. Apart - from PCRE2_NO_JIT (obviously), the remaining options are supported for + Setting PCRE2_ANCHORED or PCRE2_ENDANCHORED at match time is not sup- + ported by the just-in-time (JIT) compiler. If it is set, JIT matching + is disabled and the interpretive code in pcre2_match() is run. Apart + from PCRE2_NO_JIT (obviously), the remaining options are supported for JIT matching. PCRE2_ANCHORED The PCRE2_ANCHORED option limits pcre2_match() to matching at the first - matching position. If a pattern was compiled with PCRE2_ANCHORED, or - turned out to be anchored by virtue of its contents, it cannot be made - unachored at matching time. Note that setting the option at match time + matching position. If a pattern was compiled with PCRE2_ANCHORED, or + turned out to be anchored by virtue of its contents, it cannot be made + unachored at matching time. Note that setting the option at match time disables JIT matching. + PCRE2_COPY_MATCHED_SUBJECT + + By default, a pointer to the subject is remembered in the match data + block so that, after a successful match, it can be referenced by the + substring extraction functions. This means that the subject's memory + must not be freed until all such operations are complete. For some + applications where the lifetime of the subject string is not guaran- + teed, it may be necessary to make a copy of the subject string, but it + is wasteful to do this unless the match is successful. After a success- + ful match, if PCRE2_COPY_MATCHED_SUBJECT is set, the subject is copied + and the new pointer is remembered in the match data block instead of + the original subject pointer. The memory allocator that was used for + the match block itself is used. The copy is automatically freed when + pcre2_match_data_free() is called to free the match data block. It is + also automatically freed if the match data block is re-used for another + match operation. + PCRE2_ENDANCHORED If the PCRE2_ENDANCHORED option is set, any string that pcre2_match() @@ -2881,7 +2903,8 @@ ERROR RETURNS FROM pcre2_match() used to remember them. This error is given when the memory allocation function (default or custom) fails. Note that a different error, PCRE2_ERROR_HEAPLIMIT, is given if the amount of memory needed exceeds - the heap limit. + the heap limit. PCRE2_ERROR_NOMEMORY is also returned if + PCRE2_COPY_MATCHED_SUBJECT is set and memory allocation fails. PCRE2_ERROR_NULL @@ -2889,12 +2912,12 @@ ERROR RETURNS FROM pcre2_match() PCRE2_ERROR_RECURSELOOP - This error is returned when pcre2_match() detects a recursion loop - within the pattern. Specifically, it means that either the whole pat- + This error is returned when pcre2_match() detects a recursion loop + within the pattern. Specifically, it means that either the whole pat- tern or a subpattern has been called recursively for the second time at - the same position in the subject string. Some simple patterns that - might do this are detected and faulted at compile time, but more com- - plicated cases, in particular mutual recursions between two different + the same position in the subject string. Some simple patterns that + might do this are detected and faulted at compile time, but more com- + plicated cases, in particular mutual recursions between two different subpatterns, cannot be detected until matching is attempted. @@ -2903,20 +2926,20 @@ OBTAINING A TEXTUAL ERROR MESSAGE int pcre2_get_error_message(int errorcode, PCRE2_UCHAR *buffer, PCRE2_SIZE bufflen); - A text message for an error code from any PCRE2 function (compile, - match, or auxiliary) can be obtained by calling pcre2_get_error_mes- - sage(). The code is passed as the first argument, with the remaining - two arguments specifying a code unit buffer and its length in code - units, into which the text message is placed. The message is returned - in code units of the appropriate width for the library that is being + A text message for an error code from any PCRE2 function (compile, + match, or auxiliary) can be obtained by calling pcre2_get_error_mes- + sage(). The code is passed as the first argument, with the remaining + two arguments specifying a code unit buffer and its length in code + units, into which the text message is placed. The message is returned + in code units of the appropriate width for the library that is being used. - The returned message is terminated with a trailing zero, and the func- - tion returns the number of code units used, excluding the trailing + The returned message is terminated with a trailing zero, and the func- + tion returns the number of code units used, excluding the trailing zero. If the error number is unknown, the negative error code - PCRE2_ERROR_BADDATA is returned. If the buffer is too small, the mes- - sage is truncated (but still with a trailing zero), and the negative - error code PCRE2_ERROR_NOMEMORY is returned. None of the messages are + PCRE2_ERROR_BADDATA is returned. If the buffer is too small, the mes- + sage is truncated (but still with a trailing zero), and the negative + error code PCRE2_ERROR_NOMEMORY is returned. None of the messages are very long; a buffer size of 120 code units is ample. @@ -2935,39 +2958,39 @@ EXTRACTING CAPTURED SUBSTRINGS BY NUMBER void pcre2_substring_free(PCRE2_UCHAR *buffer); - Captured substrings can be accessed directly by using the ovector as + Captured substrings can be accessed directly by using the ovector as described above. For convenience, auxiliary functions are provided for - extracting captured substrings as new, separate, zero-terminated + extracting captured substrings as new, separate, zero-terminated strings. A substring that contains a binary zero is correctly extracted - and has a further zero added on the end, but the result is not, of + and has a further zero added on the end, but the result is not, of course, a C string. The functions in this section identify substrings by number. The number zero refers to the entire matched substring, with higher numbers refer- - ring to substrings captured by parenthesized groups. After a partial - match, only substring zero is available. An attempt to extract any - other substring gives the error PCRE2_ERROR_PARTIAL. The next section + ring to substrings captured by parenthesized groups. After a partial + match, only substring zero is available. An attempt to extract any + other substring gives the error PCRE2_ERROR_PARTIAL. The next section describes similar functions for extracting captured substrings by name. - If a pattern uses the \K escape sequence within a positive assertion, + If a pattern uses the \K escape sequence within a positive assertion, the reported start of a successful match can be greater than the end of - the match. For example, if the pattern (?=ab\K) is matched against - "ab", the start and end offset values for the match are 2 and 0. In - this situation, calling these functions with a zero substring number + the match. For example, if the pattern (?=ab\K) is matched against + "ab", the start and end offset values for the match are 2 and 0. In + this situation, calling these functions with a zero substring number extracts a zero-length empty string. - You can find the length in code units of a captured substring without - extracting it by calling pcre2_substring_length_bynumber(). The first - argument is a pointer to the match data block, the second is the group - number, and the third is a pointer to a variable into which the length - is placed. If you just want to know whether or not the substring has + You can find the length in code units of a captured substring without + extracting it by calling pcre2_substring_length_bynumber(). The first + argument is a pointer to the match data block, the second is the group + number, and the third is a pointer to a variable into which the length + is placed. If you just want to know whether or not the substring has been captured, you can pass the third argument as NULL. - The pcre2_substring_copy_bynumber() function copies a captured sub- - string into a supplied buffer, whereas pcre2_substring_get_bynumber() - copies it into new memory, obtained using the same memory allocation - function that was used for the match data block. The first two argu- - ments of these functions are a pointer to the match data block and a + The pcre2_substring_copy_bynumber() function copies a captured sub- + string into a supplied buffer, whereas pcre2_substring_get_bynumber() + copies it into new memory, obtained using the same memory allocation + function that was used for the match data block. The first two argu- + ments of these functions are a pointer to the match data block and a capturing group number. The final arguments of pcre2_substring_copy_bynumber() are a pointer to @@ -2976,25 +2999,25 @@ EXTRACTING CAPTURED SUBSTRINGS BY NUMBER for the extracted substring, excluding the terminating zero. For pcre2_substring_get_bynumber() the third and fourth arguments point - to variables that are updated with a pointer to the new memory and the - number of code units that comprise the substring, again excluding the - terminating zero. When the substring is no longer needed, the memory + to variables that are updated with a pointer to the new memory and the + number of code units that comprise the substring, again excluding the + terminating zero. When the substring is no longer needed, the memory should be freed by calling pcre2_substring_free(). - The return value from all these functions is zero for success, or a - negative error code. If the pattern match failed, the match failure - code is returned. If a substring number greater than zero is used - after a partial match, PCRE2_ERROR_PARTIAL is returned. Other possible + The return value from all these functions is zero for success, or a + negative error code. If the pattern match failed, the match failure + code is returned. If a substring number greater than zero is used + after a partial match, PCRE2_ERROR_PARTIAL is returned. Other possible error codes are: PCRE2_ERROR_NOMEMORY - The buffer was too small for pcre2_substring_copy_bynumber(), or the + The buffer was too small for pcre2_substring_copy_bynumber(), or the attempt to get memory failed for pcre2_substring_get_bynumber(). PCRE2_ERROR_NOSUBSTRING - There is no substring with that number in the pattern, that is, the + There is no substring with that number in the pattern, that is, the number is greater than the number of capturing parentheses. PCRE2_ERROR_UNAVAILABLE @@ -3005,8 +3028,8 @@ EXTRACTING CAPTURED SUBSTRINGS BY NUMBER PCRE2_ERROR_UNSET - The substring did not participate in the match. For example, if the - pattern is (abc)|(def) and the subject is "def", and the ovector con- + The substring did not participate in the match. For example, if the + pattern is (abc)|(def) and the subject is "def", and the ovector con- tains at least two capturing slots, substring number 1 is unset. @@ -3017,32 +3040,32 @@ EXTRACTING A LIST OF ALL CAPTURED SUBSTRINGS void pcre2_substring_list_free(PCRE2_SPTR *list); - The pcre2_substring_list_get() function extracts all available sub- - strings and builds a list of pointers to them. It also (optionally) - builds a second list that contains their lengths (in code units), + The pcre2_substring_list_get() function extracts all available sub- + strings and builds a list of pointers to them. It also (optionally) + builds a second list that contains their lengths (in code units), excluding a terminating zero that is added to each of them. All this is done in a single block of memory that is obtained using the same memory allocation function that was used to get the match data block. - This function must be called only after a successful match. If called + This function must be called only after a successful match. If called after a partial match, the error code PCRE2_ERROR_PARTIAL is returned. - The address of the memory block is returned via listptr, which is also + The address of the memory block is returned via listptr, which is also the start of the list of string pointers. The end of the list is marked - by a NULL pointer. The address of the list of lengths is returned via - lengthsptr. If your strings do not contain binary zeros and you do not + by a NULL pointer. The address of the list of lengths is returned via + lengthsptr. If your strings do not contain binary zeros and you do not therefore need the lengths, you may supply NULL as the lengthsptr argu- - ment to disable the creation of a list of lengths. The yield of the - function is zero if all went well, or PCRE2_ERROR_NOMEMORY if the mem- - ory block could not be obtained. When the list is no longer needed, it + ment to disable the creation of a list of lengths. The yield of the + function is zero if all went well, or PCRE2_ERROR_NOMEMORY if the mem- + ory block could not be obtained. When the list is no longer needed, it should be freed by calling pcre2_substring_list_free(). If this function encounters a substring that is unset, which can happen - when capturing subpattern number n+1 matches some part of the subject, - but subpattern n has not been used at all, it returns an empty string. - This can be distinguished from a genuine zero-length substring by + when capturing subpattern number n+1 matches some part of the subject, + but subpattern n has not been used at all, it returns an empty string. + This can be distinguished from a genuine zero-length substring by inspecting the appropriate offset in the ovector, which contain - PCRE2_UNSET for unset substrings, or by calling pcre2_sub- + PCRE2_UNSET for unset substrings, or by calling pcre2_sub- string_length_bynumber(). @@ -3062,39 +3085,39 @@ EXTRACTING CAPTURED SUBSTRINGS BY NAME void pcre2_substring_free(PCRE2_UCHAR *buffer); - To extract a substring by name, you first have to find associated num- + To extract a substring by name, you first have to find associated num- ber. For example, for this pattern: (a+)b(?\d+)... the number of the subpattern called "xxx" is 2. If the name is known to - be unique (PCRE2_DUPNAMES was not set), you can find the number from + be unique (PCRE2_DUPNAMES was not set), you can find the number from the name by calling pcre2_substring_number_from_name(). The first argu- - ment is the compiled pattern, and the second is the name. The yield of + ment is the compiled pattern, and the second is the name. The yield of the function is the subpattern number, PCRE2_ERROR_NOSUBSTRING if there - is no subpattern of that name, or PCRE2_ERROR_NOUNIQUESUBSTRING if - there is more than one subpattern of that name. Given the number, you - can extract the substring directly from the ovector, or use one of the + is no subpattern of that name, or PCRE2_ERROR_NOUNIQUESUBSTRING if + there is more than one subpattern of that name. Given the number, you + can extract the substring directly from the ovector, or use one of the "bynumber" functions described above. - For convenience, there are also "byname" functions that correspond to - the "bynumber" functions, the only difference being that the second - argument is a name instead of a number. If PCRE2_DUPNAMES is set and + For convenience, there are also "byname" functions that correspond to + the "bynumber" functions, the only difference being that the second + argument is a name instead of a number. If PCRE2_DUPNAMES is set and there are duplicate names, these functions scan all the groups with the given name, and return the first named string that is set. - If there are no groups with the given name, PCRE2_ERROR_NOSUBSTRING is - returned. If all groups with the name have numbers that are greater - than the number of slots in the ovector, PCRE2_ERROR_UNAVAILABLE is - returned. If there is at least one group with a slot in the ovector, + If there are no groups with the given name, PCRE2_ERROR_NOSUBSTRING is + returned. If all groups with the name have numbers that are greater + than the number of slots in the ovector, PCRE2_ERROR_UNAVAILABLE is + returned. If there is at least one group with a slot in the ovector, but no group is found to be set, PCRE2_ERROR_UNSET is returned. Warning: If the pattern uses the (?| feature to set up multiple subpat- - terns with the same number, as described in the section on duplicate - subpattern numbers in the pcre2pattern page, you cannot use names to - distinguish the different subpatterns, because names are not included - in the compiled code. The matching process uses only numbers. For this - reason, the use of different names for subpatterns of the same number + terns with the same number, as described in the section on duplicate + subpattern numbers in the pcre2pattern page, you cannot use names to + distinguish the different subpatterns, because names are not included + in the compiled code. The matching process uses only numbers. For this + reason, the use of different names for subpatterns of the same number causes an error at compile time. @@ -3107,92 +3130,92 @@ CREATING A NEW STRING WITH SUBSTITUTIONS PCRE2_SIZE rlength, PCRE2_UCHAR *outputbuffer, PCRE2_SIZE *outlengthptr); - This function calls pcre2_match() and then makes a copy of the subject - string in outputbuffer, replacing one or more parts that were matched + This function calls pcre2_match() and then makes a copy of the subject + string in outputbuffer, replacing one or more parts that were matched with the replacement string, whose length is supplied in rlength. This - can be given as PCRE2_ZERO_TERMINATED for a zero-terminated string. - The default is to perform just one replacement, but there is an option - that requests multiple replacements (see PCRE2_SUBSTITUTE_GLOBAL below + can be given as PCRE2_ZERO_TERMINATED for a zero-terminated string. + The default is to perform just one replacement, but there is an option + that requests multiple replacements (see PCRE2_SUBSTITUTE_GLOBAL below for details). - Matches in which a \K item in a lookahead in the pattern causes the - match to end before it starts are not supported, and give rise to an + Matches in which a \K item in a lookahead in the pattern causes the + match to end before it starts are not supported, and give rise to an error return. For global replacements, matches in which \K in a lookbe- - hind causes the match to start earlier than the point that was reached + hind causes the match to start earlier than the point that was reached in the previous iteration are also not supported. - The first seven arguments of pcre2_substitute() are the same as for + The first seven arguments of pcre2_substitute() are the same as for pcre2_match(), except that the partial matching options are not permit- - ted, and match_data may be passed as NULL, in which case a match data - block is obtained and freed within this function, using memory manage- - ment functions from the match context, if provided, or else those that + ted, and match_data may be passed as NULL, in which case a match data + block is obtained and freed within this function, using memory manage- + ment functions from the match context, if provided, or else those that were used to allocate memory for the compiled code. - If an external match_data block is provided, its contents afterwards - are those set by the final call to pcre2_match(). For global changes, - this will have ended in a matching error. The contents of the ovector + If an external match_data block is provided, its contents afterwards + are those set by the final call to pcre2_match(). For global changes, + this will have ended in a matching error. The contents of the ovector within the match data block may or may not have been changed. - The outlengthptr argument must point to a variable that contains the - length, in code units, of the output buffer. If the function is suc- - cessful, the value is updated to contain the length of the new string, + The outlengthptr argument must point to a variable that contains the + length, in code units, of the output buffer. If the function is suc- + cessful, the value is updated to contain the length of the new string, excluding the trailing zero that is automatically added. - If the function is not successful, the value set via outlengthptr - depends on the type of error. For syntax errors in the replacement - string, the value is the offset in the replacement string where the - error was detected. For other errors, the value is PCRE2_UNSET by - default. This includes the case of the output buffer being too small, - unless PCRE2_SUBSTITUTE_OVERFLOW_LENGTH is set (see below), in which - case the value is the minimum length needed, including space for the - trailing zero. Note that in order to compute the required length, - pcre2_substitute() has to simulate all the matching and copying, + If the function is not successful, the value set via outlengthptr + depends on the type of error. For syntax errors in the replacement + string, the value is the offset in the replacement string where the + error was detected. For other errors, the value is PCRE2_UNSET by + default. This includes the case of the output buffer being too small, + unless PCRE2_SUBSTITUTE_OVERFLOW_LENGTH is set (see below), in which + case the value is the minimum length needed, including space for the + trailing zero. Note that in order to compute the required length, + pcre2_substitute() has to simulate all the matching and copying, instead of giving an error return as soon as the buffer overflows. Note also that the length is in code units, not bytes. - In the replacement string, which is interpreted as a UTF string in UTF - mode, and is checked for UTF validity unless the PCRE2_NO_UTF_CHECK + In the replacement string, which is interpreted as a UTF string in UTF + mode, and is checked for UTF validity unless the PCRE2_NO_UTF_CHECK option is set, a dollar character is an escape character that can spec- - ify the insertion of characters from capturing groups or names from - (*MARK) or other control verbs in the pattern. The following forms are + ify the insertion of characters from capturing groups or names from + (*MARK) or other control verbs in the pattern. The following forms are always recognized: $$ insert a dollar character $ or ${} insert the contents of group $*MARK or ${*MARK} insert a control verb name - Either a group number or a group name can be given for . Curly - brackets are required only if the following character would be inter- + Either a group number or a group name can be given for . Curly + brackets are required only if the following character would be inter- preted as part of the number or name. The number may be zero to include - the entire matched string. For example, if the pattern a(b)c is - matched with "=abc=" and the replacement string "+$1$0$1+", the result + the entire matched string. For example, if the pattern a(b)c is + matched with "=abc=" and the replacement string "+$1$0$1+", the result is "=+babcb+=". $*MARK inserts the name from the last encountered (*ACCEPT), (*COMMIT), - (*MARK), (*PRUNE), or (*THEN) on the matching path that has a name. - (*MARK) must always include a name, but the other verbs need not. For + (*MARK), (*PRUNE), or (*THEN) on the matching path that has a name. + (*MARK) must always include a name, but the other verbs need not. For example, in the case of (*MARK:A)(*PRUNE) the name inserted is "A", but - for (*MARK:A)(*PRUNE:B) the relevant name is "B". This facility can be - used to perform simple simultaneous substitutions, as this pcre2test + for (*MARK:A)(*PRUNE:B) the relevant name is "B". This facility can be + used to perform simple simultaneous substitutions, as this pcre2test example shows: /(*MARK:pear)apple|(*MARK:orange)lemon/g,replace=${*MARK} apple lemon 2: pear orange - As well as the usual options for pcre2_match(), a number of additional + As well as the usual options for pcre2_match(), a number of additional options can be set in the options argument of pcre2_substitute(). PCRE2_SUBSTITUTE_GLOBAL causes the function to iterate over the subject - string, replacing every matching substring. If this option is not set, - only the first matching substring is replaced. The search for matches - takes place in the original subject string (that is, previous replace- - ments do not affect it). Iteration is implemented by advancing the - startoffset value for each search, which is always passed the entire + string, replacing every matching substring. If this option is not set, + only the first matching substring is replaced. The search for matches + takes place in the original subject string (that is, previous replace- + ments do not affect it). Iteration is implemented by advancing the + startoffset value for each search, which is always passed the entire subject string. If an offset limit is set in the match context, search- ing stops when that limit is reached. - You can restrict the effect of a global substitution to a portion of + You can restrict the effect of a global substitution to a portion of the subject string by setting either or both of startoffset and an off- set limit. Here is a pcre2test example: @@ -3200,87 +3223,87 @@ CREATING A NEW STRING WITH SUBSTITUTIONS ABC ABC ABC ABC\=offset=3,offset_limit=12 2: ABC A!C A!C ABC - When continuing with global substitutions after matching a substring + When continuing with global substitutions after matching a substring with zero length, an attempt to find a non-empty match at the same off- set is performed. If this is not successful, the offset is advanced by one character except when CRLF is a valid newline sequence and the next - two characters are CR, LF. In this case, the offset is advanced by two + two characters are CR, LF. In this case, the offset is advanced by two characters. - PCRE2_SUBSTITUTE_OVERFLOW_LENGTH changes what happens when the output + PCRE2_SUBSTITUTE_OVERFLOW_LENGTH changes what happens when the output buffer is too small. The default action is to return PCRE2_ERROR_NOMEM- - ORY immediately. If this option is set, however, pcre2_substitute() + ORY immediately. If this option is set, however, pcre2_substitute() continues to go through the motions of matching and substituting (with- - out, of course, writing anything) in order to compute the size of buf- - fer that is needed. This value is passed back via the outlengthptr - variable, with the result of the function still being + out, of course, writing anything) in order to compute the size of buf- + fer that is needed. This value is passed back via the outlengthptr + variable, with the result of the function still being PCRE2_ERROR_NOMEMORY. - Passing a buffer size of zero is a permitted way of finding out how - much memory is needed for given substitution. However, this does mean + Passing a buffer size of zero is a permitted way of finding out how + much memory is needed for given substitution. However, this does mean that the entire operation is carried out twice. Depending on the appli- - cation, it may be more efficient to allocate a large buffer and free - the excess afterwards, instead of using PCRE2_SUBSTITUTE_OVER- + cation, it may be more efficient to allocate a large buffer and free + the excess afterwards, instead of using PCRE2_SUBSTITUTE_OVER- FLOW_LENGTH. - PCRE2_SUBSTITUTE_UNKNOWN_UNSET causes references to capturing groups - that do not appear in the pattern to be treated as unset groups. This - option should be used with care, because it means that a typo in a - group name or number no longer causes the PCRE2_ERROR_NOSUBSTRING + PCRE2_SUBSTITUTE_UNKNOWN_UNSET causes references to capturing groups + that do not appear in the pattern to be treated as unset groups. This + option should be used with care, because it means that a typo in a + group name or number no longer causes the PCRE2_ERROR_NOSUBSTRING error. - PCRE2_SUBSTITUTE_UNSET_EMPTY causes unset capturing groups (including + PCRE2_SUBSTITUTE_UNSET_EMPTY causes unset capturing groups (including unknown groups when PCRE2_SUBSTITUTE_UNKNOWN_UNSET is set) to be - treated as empty strings when inserted as described above. If this - option is not set, an attempt to insert an unset group causes the - PCRE2_ERROR_UNSET error. This option does not influence the extended + treated as empty strings when inserted as described above. If this + option is not set, an attempt to insert an unset group causes the + PCRE2_ERROR_UNSET error. This option does not influence the extended substitution syntax described below. - PCRE2_SUBSTITUTE_EXTENDED causes extra processing to be applied to the - replacement string. Without this option, only the dollar character is - special, and only the group insertion forms listed above are valid. + PCRE2_SUBSTITUTE_EXTENDED causes extra processing to be applied to the + replacement string. Without this option, only the dollar character is + special, and only the group insertion forms listed above are valid. When PCRE2_SUBSTITUTE_EXTENDED is set, two things change: - Firstly, backslash in a replacement string is interpreted as an escape + Firstly, backslash in a replacement string is interpreted as an escape character. The usual forms such as \n or \x{ddd} can be used to specify - particular character codes, and backslash followed by any non-alphanu- - meric character quotes that character. Extended quoting can be coded + particular character codes, and backslash followed by any non-alphanu- + meric character quotes that character. Extended quoting can be coded using \Q...\E, exactly as in pattern strings. - There are also four escape sequences for forcing the case of inserted - letters. The insertion mechanism has three states: no case forcing, + There are also four escape sequences for forcing the case of inserted + letters. The insertion mechanism has three states: no case forcing, force upper case, and force lower case. The escape sequences change the current state: \U and \L change to upper or lower case forcing, respec- - tively, and \E (when not terminating a \Q quoted sequence) reverts to - no case forcing. The sequences \u and \l force the next character (if - it is a letter) to upper or lower case, respectively, and then the + tively, and \E (when not terminating a \Q quoted sequence) reverts to + no case forcing. The sequences \u and \l force the next character (if + it is a letter) to upper or lower case, respectively, and then the state automatically reverts to no case forcing. Case forcing applies to all inserted characters, including those from captured groups and let- ters within \Q...\E quoted sequences. Note that case forcing sequences such as \U...\E do not nest. For exam- - ple, the result of processing "\Uaa\LBB\Ecc\E" is "AAbbcc"; the final + ple, the result of processing "\Uaa\LBB\Ecc\E" is "AAbbcc"; the final \E has no effect. - The second effect of setting PCRE2_SUBSTITUTE_EXTENDED is to add more - flexibility to group substitution. The syntax is similar to that used + The second effect of setting PCRE2_SUBSTITUTE_EXTENDED is to add more + flexibility to group substitution. The syntax is similar to that used by Bash: ${:-} ${:+:} - As before, may be a group number or a name. The first form speci- - fies a default value. If group is set, its value is inserted; if - not, is expanded and the result inserted. The second form - specifies strings that are expanded and inserted when group is set - or unset, respectively. The first form is just a convenient shorthand + As before, may be a group number or a name. The first form speci- + fies a default value. If group is set, its value is inserted; if + not, is expanded and the result inserted. The second form + specifies strings that are expanded and inserted when group is set + or unset, respectively. The first form is just a convenient shorthand for ${:+${}:} - Backslash can be used to escape colons and closing curly brackets in - the replacement strings. A change of the case forcing state within a - replacement string remains in force afterwards, as shown in this + Backslash can be used to escape colons and closing curly brackets in + the replacement strings. A change of the case forcing state within a + replacement string remains in force afterwards, as shown in this pcre2test example: /(some)?(body)/substitute_extended,replace=${1:+\U:\L}HeLLo @@ -3289,42 +3312,42 @@ CREATING A NEW STRING WITH SUBSTITUTIONS somebody 1: HELLO - The PCRE2_SUBSTITUTE_UNSET_EMPTY option does not affect these extended - substitutions. However, PCRE2_SUBSTITUTE_UNKNOWN_UNSET does cause + The PCRE2_SUBSTITUTE_UNSET_EMPTY option does not affect these extended + substitutions. However, PCRE2_SUBSTITUTE_UNKNOWN_UNSET does cause unknown groups in the extended syntax forms to be treated as unset. - If successful, pcre2_substitute() returns the number of replacements + If successful, pcre2_substitute() returns the number of replacements that were made. This may be zero if no matches were found, and is never greater than 1 unless PCRE2_SUBSTITUTE_GLOBAL is set. In the event of an error, a negative error code is returned. Except for - PCRE2_ERROR_NOMATCH (which is never returned), errors from + PCRE2_ERROR_NOMATCH (which is never returned), errors from pcre2_match() are passed straight back. PCRE2_ERROR_NOSUBSTRING is returned for a non-existent substring inser- tion, unless PCRE2_SUBSTITUTE_UNKNOWN_UNSET is set. PCRE2_ERROR_UNSET is returned for an unset substring insertion (includ- - ing an unknown substring when PCRE2_SUBSTITUTE_UNKNOWN_UNSET is set) + ing an unknown substring when PCRE2_SUBSTITUTE_UNKNOWN_UNSET is set) when the simple (non-extended) syntax is used and PCRE2_SUBSTI- TUTE_UNSET_EMPTY is not set. - PCRE2_ERROR_NOMEMORY is returned if the output buffer is not big + PCRE2_ERROR_NOMEMORY is returned if the output buffer is not big enough. If the PCRE2_SUBSTITUTE_OVERFLOW_LENGTH option is set, the size - of buffer that is needed is returned via outlengthptr. Note that this + of buffer that is needed is returned via outlengthptr. Note that this does not happen by default. - PCRE2_ERROR_BADREPLACEMENT is used for miscellaneous syntax errors in + PCRE2_ERROR_BADREPLACEMENT is used for miscellaneous syntax errors in the replacement string, with more particular errors being - PCRE2_ERROR_BADREPESCAPE (invalid escape sequence), PCRE2_ERROR_REP- - MISSINGBRACE (closing curly bracket not found), PCRE2_ERROR_BADSUBSTI- + PCRE2_ERROR_BADREPESCAPE (invalid escape sequence), PCRE2_ERROR_REP- + MISSINGBRACE (closing curly bracket not found), PCRE2_ERROR_BADSUBSTI- TUTION (syntax error in extended group substitution), and - PCRE2_ERROR_BADSUBSPATTERN (the pattern match ended before it started - or the match started earlier than the current position in the subject, + PCRE2_ERROR_BADSUBSPATTERN (the pattern match ended before it started + or the match started earlier than the current position in the subject, which can happen if \K is used in an assertion). As for all PCRE2 errors, a text message that describes the error can be - obtained by calling the pcre2_get_error_message() function (see + obtained by calling the pcre2_get_error_message() function (see "Obtaining a textual error message" above). Substitution callouts @@ -3333,31 +3356,31 @@ CREATING A NEW STRING WITH SUBSTITUTIONS void (*callout_function)(pcre2_substitute_callout_block *, void *), void *callout_data); - The pcre2_set_substitution_callout() function can be used to specify a - callout function for pcre2_substitute(). This information is passed in - a match context. The callout function is called after each substitu- - tion. It is not called for simulated substitutions that happen as a - result of the PCRE2_SUBSTITUTE_OVERFLOW_LENGTH option. A callout func- + The pcre2_set_substitution_callout() function can be used to specify a + callout function for pcre2_substitute(). This information is passed in + a match context. The callout function is called after each substitu- + tion. It is not called for simulated substitutions that happen as a + result of the PCRE2_SUBSTITUTE_OVERFLOW_LENGTH option. A callout func- tion should not return any value. The first argument of the callout function is a pointer to a substitute - callout block structure, which contains the following fields, not nec- + callout block structure, which contains the following fields, not nec- essarily in this order: uint32_t version; PCRE2_SIZE input_offsets[2]; PCRE2_SIZE output_offsets[2]; - The version field contains the version number of the block format. The - current version is 0. The version number will increase in future if - more fields are added, but the intention is never to remove any of the + The version field contains the version number of the block format. The + current version is 0. The version number will increase in future if + more fields are added, but the intention is never to remove any of the existing fields. - The input_offsets vector contains the code unit offsets in the input + The input_offsets vector contains the code unit offsets in the input string of the matched substring, and the output_offsets vector contains the offsets of the replacement in the output string. - The second argument of the callout function is the value passed as + The second argument of the callout function is the value passed as callout_data when the function was registered. @@ -3366,56 +3389,56 @@ DUPLICATE SUBPATTERN NAMES int pcre2_substring_nametable_scan(const pcre2_code *code, PCRE2_SPTR name, PCRE2_SPTR *first, PCRE2_SPTR *last); - When a pattern is compiled with the PCRE2_DUPNAMES option, names for - subpatterns are not required to be unique. Duplicate names are always - allowed for subpatterns with the same number, created by using the (?| - feature. Indeed, if such subpatterns are named, they are required to + When a pattern is compiled with the PCRE2_DUPNAMES option, names for + subpatterns are not required to be unique. Duplicate names are always + allowed for subpatterns with the same number, created by using the (?| + feature. Indeed, if such subpatterns are named, they are required to use the same names. Normally, patterns with duplicate names are such that in any one match, - only one of the named subpatterns participates. An example is shown in + only one of the named subpatterns participates. An example is shown in the pcre2pattern documentation. - When duplicates are present, pcre2_substring_copy_byname() and - pcre2_substring_get_byname() return the first substring corresponding - to the given name that is set. Only if none are set is - PCRE2_ERROR_UNSET is returned. The pcre2_substring_number_from_name() + When duplicates are present, pcre2_substring_copy_byname() and + pcre2_substring_get_byname() return the first substring corresponding + to the given name that is set. Only if none are set is + PCRE2_ERROR_UNSET is returned. The pcre2_substring_number_from_name() function returns the error PCRE2_ERROR_NOUNIQUESUBSTRING when there are duplicate names. - If you want to get full details of all captured substrings for a given - name, you must use the pcre2_substring_nametable_scan() function. The - first argument is the compiled pattern, and the second is the name. If - the third and fourth arguments are NULL, the function returns a group + If you want to get full details of all captured substrings for a given + name, you must use the pcre2_substring_nametable_scan() function. The + first argument is the compiled pattern, and the second is the name. If + the third and fourth arguments are NULL, the function returns a group number for a unique name, or PCRE2_ERROR_NOUNIQUESUBSTRING otherwise. When the third and fourth arguments are not NULL, they must be pointers - to variables that are updated by the function. After it has run, they + to variables that are updated by the function. After it has run, they point to the first and last entries in the name-to-number table for the - given name, and the function returns the length of each entry in code - units. In both cases, PCRE2_ERROR_NOSUBSTRING is returned if there are + given name, and the function returns the length of each entry in code + units. In both cases, PCRE2_ERROR_NOSUBSTRING is returned if there are no entries for the given name. The format of the name table is described above in the section entitled - Information about a pattern. Given all the relevant entries for the - name, you can extract each of their numbers, and hence the captured + Information about a pattern. Given all the relevant entries for the + name, you can extract each of their numbers, and hence the captured data. FINDING ALL POSSIBLE MATCHES AT ONE POSITION - The traditional matching function uses a similar algorithm to Perl, - which stops when it finds the first match at a given point in the sub- + The traditional matching function uses a similar algorithm to Perl, + which stops when it finds the first match at a given point in the sub- ject. If you want to find all possible matches, or the longest possible - match at a given position, consider using the alternative matching - function (see below) instead. If you cannot use the alternative func- + match at a given position, consider using the alternative matching + function (see below) instead. If you cannot use the alternative func- tion, you can kludge it up by making use of the callout facility, which is described in the pcre2callout documentation. What you have to do is to insert a callout right at the end of the pat- - tern. When your callout function is called, extract and save the cur- - rent matched substring. Then return 1, which forces pcre2_match() to - backtrack and try other alternatives. Ultimately, when it runs out of + tern. When your callout function is called, extract and save the cur- + rent matched substring. Then return 1, which forces pcre2_match() to + backtrack and try other alternatives. Ultimately, when it runs out of matches, pcre2_match() will yield PCRE2_ERROR_NOMATCH. @@ -3427,26 +3450,26 @@ MATCHING A PATTERN: THE ALTERNATIVE FUNCTION pcre2_match_context *mcontext, int *workspace, PCRE2_SIZE wscount); - The function pcre2_dfa_match() is called to match a subject string - against a compiled pattern, using a matching algorithm that scans the + The function pcre2_dfa_match() is called to match a subject string + against a compiled pattern, using a matching algorithm that scans the subject string just once (not counting lookaround assertions), and does - not backtrack. This has different characteristics to the normal algo- - rithm, and is not compatible with Perl. Some of the features of PCRE2 - patterns are not supported. Nevertheless, there are times when this - kind of matching can be useful. For a discussion of the two matching + not backtrack. This has different characteristics to the normal algo- + rithm, and is not compatible with Perl. Some of the features of PCRE2 + patterns are not supported. Nevertheless, there are times when this + kind of matching can be useful. For a discussion of the two matching algorithms, and a list of features that pcre2_dfa_match() does not sup- port, see the pcre2matching documentation. - The arguments for the pcre2_dfa_match() function are the same as for + The arguments for the pcre2_dfa_match() function are the same as for pcre2_match(), plus two extras. The ovector within the match data block is used in a different way, and this is described below. The other com- - mon arguments are used in the same way as for pcre2_match(), so their + mon arguments are used in the same way as for pcre2_match(), so their description is not repeated here. - The two additional arguments provide workspace for the function. The - workspace vector should contain at least 20 elements. It is used for + The two additional arguments provide workspace for the function. The + workspace vector should contain at least 20 elements. It is used for keeping track of multiple paths through the pattern tree. More - workspace is needed for patterns and subjects where there are a lot of + workspace is needed for patterns and subjects where there are a lot of potential matches. Here is an example of a simple call to pcre2_dfa_match(): @@ -3466,13 +3489,14 @@ MATCHING A PATTERN: THE ALTERNATIVE FUNCTION Option bits for pcre_dfa_match() - The unused bits of the options argument for pcre2_dfa_match() must be - zero. The only bits that may be set are PCRE2_ANCHORED, PCRE2_ENDAN- - CHORED, PCRE2_NOTBOL, PCRE2_NOTEOL, PCRE2_NOTEMPTY, - PCRE2_NOTEMPTY_ATSTART, PCRE2_NO_UTF_CHECK, PCRE2_PARTIAL_HARD, - PCRE2_PARTIAL_SOFT, PCRE2_DFA_SHORTEST, and PCRE2_DFA_RESTART. All but - the last four of these are exactly the same as for pcre2_match(), so - their description is not repeated here. + The unused bits of the options argument for pcre2_dfa_match() must be + zero. The only bits that may be set are PCRE2_ANCHORED, + PCRE2_COPY_MATCHED_SUBJECT, PCRE2_ENDANCHORED, PCRE2_NOTBOL, + PCRE2_NOTEOL, PCRE2_NOTEMPTY, PCRE2_NOTEMPTY_ATSTART, + PCRE2_NO_UTF_CHECK, PCRE2_PARTIAL_HARD, PCRE2_PARTIAL_SOFT, + PCRE2_DFA_SHORTEST, and PCRE2_DFA_RESTART. All but the last four of + these are exactly the same as for pcre2_match(), so their description + is not repeated here. PCRE2_PARTIAL_HARD PCRE2_PARTIAL_SOFT @@ -3607,7 +3631,7 @@ AUTHOR REVISION - Last updated: 21 September 2018 + Last updated: 16 October 2018 Copyright (c) 1997-2018 University of Cambridge. ------------------------------------------------------------------------------ @@ -4924,29 +4948,30 @@ SIMPLE USE OF JIT UNSUPPORTED OPTIONS AND PATTERN ITEMS The pcre2_match() options that are supported for JIT matching are - PCRE2_NOTBOL, PCRE2_NOTEOL, PCRE2_NOTEMPTY, PCRE2_NOTEMPTY_ATSTART, - PCRE2_NO_UTF_CHECK, PCRE2_PARTIAL_HARD, and PCRE2_PARTIAL_SOFT. The - PCRE2_ANCHORED option is not supported at match time. + PCRE2_COPY_MATCHED_SUBJECT, PCRE2_NOTBOL, PCRE2_NOTEOL, PCRE2_NOTEMPTY, + PCRE2_NOTEMPTY_ATSTART, PCRE2_NO_UTF_CHECK, PCRE2_PARTIAL_HARD, and + PCRE2_PARTIAL_SOFT. The PCRE2_ANCHORED and PCRE2_ENDANCHORED options + are not supported at match time. - If the PCRE2_NO_JIT option is passed to pcre2_match() it disables the + If the PCRE2_NO_JIT option is passed to pcre2_match() it disables the use of JIT, forcing matching by the interpreter code. - The only unsupported pattern items are \C (match a single data unit) - when running in a UTF mode, and a callout immediately before an asser- + The only unsupported pattern items are \C (match a single data unit) + when running in a UTF mode, and a callout immediately before an asser- tion condition in a conditional group. RETURN VALUES FROM JIT MATCHING When a pattern is matched using JIT matching, the return values are the - same as those given by the interpretive pcre2_match() code, with the - addition of one new error code: PCRE2_ERROR_JIT_STACKLIMIT. This means - that the memory used for the JIT stack was insufficient. See "Control- + same as those given by the interpretive pcre2_match() code, with the + addition of one new error code: PCRE2_ERROR_JIT_STACKLIMIT. This means + that the memory used for the JIT stack was insufficient. See "Control- ling the JIT stack" below for a discussion of JIT stack usage. - The error code PCRE2_ERROR_MATCHLIMIT is returned by the JIT code if - searching a very large pattern tree goes on for too long, as it is in - the same circumstance when JIT is not used, but the details of exactly + The error code PCRE2_ERROR_MATCHLIMIT is returned by the JIT code if + searching a very large pattern tree goes on for too long, as it is in + the same circumstance when JIT is not used, but the details of exactly what is counted are not the same. The PCRE2_ERROR_DEPTHLIMIT error code is never returned when JIT matching is used. @@ -4954,25 +4979,25 @@ RETURN VALUES FROM JIT MATCHING CONTROLLING THE JIT STACK When the compiled JIT code runs, it needs a block of memory to use as a - stack. By default, it uses 32KiB on the machine stack. However, some - large or complicated patterns need more than this. The error - PCRE2_ERROR_JIT_STACKLIMIT is given when there is not enough stack. - Three functions are provided for managing blocks of memory for use as - JIT stacks. There is further discussion about the use of JIT stacks in + stack. By default, it uses 32KiB on the machine stack. However, some + large or complicated patterns need more than this. The error + PCRE2_ERROR_JIT_STACKLIMIT is given when there is not enough stack. + Three functions are provided for managing blocks of memory for use as + JIT stacks. There is further discussion about the use of JIT stacks in the section entitled "JIT stack FAQ" below. - The pcre2_jit_stack_create() function creates a JIT stack. Its argu- - ments are a starting size, a maximum size, and a general context (for - memory allocation functions, or NULL for standard memory allocation). + The pcre2_jit_stack_create() function creates a JIT stack. Its argu- + ments are a starting size, a maximum size, and a general context (for + memory allocation functions, or NULL for standard memory allocation). It returns a pointer to an opaque structure of type pcre2_jit_stack, or - NULL if there is an error. The pcre2_jit_stack_free() function is used + NULL if there is an error. The pcre2_jit_stack_free() function is used to free a stack that is no longer needed. If its argument is NULL, this - function returns immediately, without doing anything. (For the techni- - cally minded: the address space is allocated by mmap or VirtualAlloc.) - A maximum stack size of 512KiB to 1MiB should be more than enough for + function returns immediately, without doing anything. (For the techni- + cally minded: the address space is allocated by mmap or VirtualAlloc.) + A maximum stack size of 512KiB to 1MiB should be more than enough for any pattern. - The pcre2_jit_stack_assign() function specifies which stack JIT code + The pcre2_jit_stack_assign() function specifies which stack JIT code should use. Its arguments are as follows: pcre2_match_context *mcontext @@ -4982,7 +5007,7 @@ CONTROLLING THE JIT STACK The first argument is a pointer to a match context. When this is subse- quently passed to a matching function, its information determines which JIT stack is used. If this argument is NULL, the function returns imme- - diately, without doing anything. There are three cases for the values + diately, without doing anything. There are three cases for the values of the other two options: (1) If callback is NULL and data is NULL, an internal 32KiB block @@ -5000,34 +5025,34 @@ CONTROLLING THE JIT STACK return value must be a valid JIT stack, the result of calling pcre2_jit_stack_create(). - A callback function is obeyed whenever JIT code is about to be run; it + A callback function is obeyed whenever JIT code is about to be run; it is not obeyed when pcre2_match() is called with options that are incom- - patible for JIT matching. A callback function can therefore be used to - determine whether a match operation was executed by JIT or by the + patible for JIT matching. A callback function can therefore be used to + determine whether a match operation was executed by JIT or by the interpreter. You may safely use the same JIT stack for more than one pattern (either - by assigning directly or by callback), as long as the patterns are + by assigning directly or by callback), as long as the patterns are matched sequentially in the same thread. Currently, the only way to set - up non-sequential matches in one thread is to use callouts: if a call- - out function starts another match, that match must use a different JIT + up non-sequential matches in one thread is to use callouts: if a call- + out function starts another match, that match must use a different JIT stack to the one used for currently suspended match(es). - In a multithread application, if you do not specify a JIT stack, or if - you assign or pass back NULL from a callback, that is thread-safe, - because each thread has its own machine stack. However, if you assign - or pass back a non-NULL JIT stack, this must be a different stack for + In a multithread application, if you do not specify a JIT stack, or if + you assign or pass back NULL from a callback, that is thread-safe, + because each thread has its own machine stack. However, if you assign + or pass back a non-NULL JIT stack, this must be a different stack for each thread so that the application is thread-safe. - Strictly speaking, even more is allowed. You can assign the same non- - NULL stack to a match context that is used by any number of patterns, - as long as they are not used for matching by multiple threads at the - same time. For example, you could use the same stack in all compiled - patterns, with a global mutex in the callback to wait until the stack + Strictly speaking, even more is allowed. You can assign the same non- + NULL stack to a match context that is used by any number of patterns, + as long as they are not used for matching by multiple threads at the + same time. For example, you could use the same stack in all compiled + patterns, with a global mutex in the callback to wait until the stack is available for use. However, this is an inefficient solution, and not recommended. - This is a suggestion for how a multithreaded program that needs to set + This is a suggestion for how a multithreaded program that needs to set up non-default JIT stacks might operate: During thread initalization @@ -5039,7 +5064,7 @@ CONTROLLING THE JIT STACK Use a one-line callback function return thread_local_var - All the functions described in this section do nothing if JIT is not + All the functions described in this section do nothing if JIT is not available. @@ -5048,20 +5073,20 @@ JIT STACK FAQ (1) Why do we need JIT stacks? PCRE2 (and JIT) is a recursive, depth-first engine, so it needs a stack - where the local data of the current node is pushed before checking its + where the local data of the current node is pushed before checking its child nodes. Allocating real machine stack on some platforms is diffi- cult. For example, the stack chain needs to be updated every time if we - extend the stack on PowerPC. Although it is possible, its updating + extend the stack on PowerPC. Although it is possible, its updating time overhead decreases performance. So we do the recursion in memory. (2) Why don't we simply allocate blocks of memory with malloc()? - Modern operating systems have a nice feature: they can reserve an + Modern operating systems have a nice feature: they can reserve an address space instead of allocating memory. We can safely allocate mem- - ory pages inside this address space, so the stack could grow without + ory pages inside this address space, so the stack could grow without moving memory data (this is important because of pointers). Thus we can allocate 1MiB address space, and use only a single memory page (usually - 4KiB) if that is enough. However, we can still grow up to 1MiB anytime + 4KiB) if that is enough. However, we can still grow up to 1MiB anytime if needed. (3) Who "owns" a JIT stack? @@ -5069,8 +5094,8 @@ JIT STACK FAQ The owner of the stack is the user program, not the JIT studied pattern or anything else. The user program must ensure that if a stack is being used by pcre2_match(), (that is, it is assigned to a match context that - is passed to the pattern currently running), that stack must not be - used by any other threads (to avoid overwriting the same memory area). + is passed to the pattern currently running), that stack must not be + used by any other threads (to avoid overwriting the same memory area). The best practice for multithreaded programs is to allocate a stack for each thread, and return this stack through the JIT callback function. @@ -5078,36 +5103,36 @@ JIT STACK FAQ You can free a JIT stack at any time, as long as it will not be used by pcre2_match() again. When you assign the stack to a match context, only - a pointer is set. There is no reference counting or any other magic. + a pointer is set. There is no reference counting or any other magic. You can free compiled patterns, contexts, and stacks in any order, any- - time. Just do not call pcre2_match() with a match context pointing to + time. Just do not call pcre2_match() with a match context pointing to an already freed stack, as that will cause SEGFAULT. (Also, do not free - a stack currently used by pcre2_match() in another thread). You can - also replace the stack in a context at any time when it is not in use. + a stack currently used by pcre2_match() in another thread). You can + also replace the stack in a context at any time when it is not in use. You should free the previous stack before assigning a replacement. - (5) Should I allocate/free a stack every time before/after calling + (5) Should I allocate/free a stack every time before/after calling pcre2_match()? - No, because this is too costly in terms of resources. However, you - could implement some clever idea which release the stack if it is not - used in let's say two minutes. The JIT callback can help to achieve + No, because this is too costly in terms of resources. However, you + could implement some clever idea which release the stack if it is not + used in let's say two minutes. The JIT callback can help to achieve this without keeping a list of patterns. - (6) OK, the stack is for long term memory allocation. But what happens - if a pattern causes stack overflow with a stack of 1MiB? Is that 1MiB + (6) OK, the stack is for long term memory allocation. But what happens + if a pattern causes stack overflow with a stack of 1MiB? Is that 1MiB kept until the stack is freed? - Especially on embedded sytems, it might be a good idea to release mem- - ory sometimes without freeing the stack. There is no API for this at - the moment. Probably a function call which returns with the currently - allocated memory for any stack and another which allows releasing mem- + Especially on embedded sytems, it might be a good idea to release mem- + ory sometimes without freeing the stack. There is no API for this at + the moment. Probably a function call which returns with the currently + allocated memory for any stack and another which allows releasing mem- ory (shrinking the stack) would be a good idea if someone needs this. (7) This is too much of a headache. Isn't there any better solution for JIT stack handling? - No, thanks to Windows. If POSIX threads were used everywhere, we could + No, thanks to Windows. If POSIX threads were used everywhere, we could throw out this complicated API. @@ -5116,18 +5141,18 @@ FREEING JIT SPECULATIVE MEMORY void pcre2_jit_free_unused_memory(pcre2_general_context *gcontext); The JIT executable allocator does not free all memory when it is possi- - ble. It expects new allocations, and keeps some free memory around to - improve allocation speed. However, in low memory conditions, it might - be better to free all possible memory. You can cause this to happen by - calling pcre2_jit_free_unused_memory(). Its argument is a general con- + ble. It expects new allocations, and keeps some free memory around to + improve allocation speed. However, in low memory conditions, it might + be better to free all possible memory. You can cause this to happen by + calling pcre2_jit_free_unused_memory(). Its argument is a general con- text, for custom memory management, or NULL for standard memory manage- ment. EXAMPLE CODE - This is a single-threaded example that specifies a JIT stack without - using a callback. A real program should include error checking after + This is a single-threaded example that specifies a JIT stack without + using a callback. A real program should include error checking after all the function calls. int rc; @@ -5155,29 +5180,31 @@ EXAMPLE CODE JIT FAST PATH API Because the API described above falls back to interpreted matching when - JIT is not available, it is convenient for programs that are written + JIT is not available, it is convenient for programs that are written for general use in many environments. However, calling JIT via pcre2_match() does have a performance impact. Programs that are written - for use where JIT is known to be available, and which need the best - possible performance, can instead use a "fast path" API to call JIT - matching directly instead of calling pcre2_match() (obviously only for + for use where JIT is known to be available, and which need the best + possible performance, can instead use a "fast path" API to call JIT + matching directly instead of calling pcre2_match() (obviously only for patterns that have been successfully processed by pcre2_jit_compile()). - The fast path function is called pcre2_jit_match(), and it takes - exactly the same arguments as pcre2_match(). The return values are also - the same, plus PCRE2_ERROR_JIT_BADOPTION if a matching mode (partial or - complete) is requested that was not compiled. Unsupported option bits - (for example, PCRE2_ANCHORED) are ignored, as is the PCRE2_NO_JIT - option. + The fast path function is called pcre2_jit_match(), and it takes + exactly the same arguments as pcre2_match(). However, the subject + string must be specified with a length; PCRE2_ZERO_TERMINATED is not + supported. Unsupported option bits (for example, PCRE2_ANCHORED, + PCRE2_ENDANCHORED and PCRE2_COPY_MATCHED_SUBJECT) are ignored, as is + the PCRE2_NO_JIT option. The return values are also the same as for + pcre2_match(), plus PCRE2_ERROR_JIT_BADOPTION if a matching mode (par- + tial or complete) is requested that was not compiled. - When you call pcre2_match(), as well as testing for invalid options, a + When you call pcre2_match(), as well as testing for invalid options, a number of other sanity checks are performed on the arguments. For exam- ple, if the subject pointer is NULL, an immediate error is given. Also, - unless PCRE2_NO_UTF_CHECK is set, a UTF subject string is tested for - validity. In the interests of speed, these checks do not happen on the + unless PCRE2_NO_UTF_CHECK is set, a UTF subject string is tested for + validity. In the interests of speed, these checks do not happen on the JIT fast path, and if invalid data is passed, the result is undefined. - Bypassing the sanity checks and the pcre2_match() wrapping can give + Bypassing the sanity checks and the pcre2_match() wrapping can give speedups of more than 10%. @@ -5195,7 +5222,7 @@ AUTHOR REVISION - Last updated: 28 June 2018 + Last updated: 16 October 2018 Copyright (c) 1997-2018 University of Cambridge. ------------------------------------------------------------------------------ diff --git a/doc/pcre2_dfa_match.3 b/doc/pcre2_dfa_match.3 index dfc3ae6..834158c 100644 --- a/doc/pcre2_dfa_match.3 +++ b/doc/pcre2_dfa_match.3 @@ -1,4 +1,4 @@ -.TH PCRE2_DFA_MATCH 3 "26 April 2018" "PCRE2 10.32" +.TH PCRE2_DFA_MATCH 3 "16 October 2018" "PCRE2 10.33" .SH NAME PCRE2 - Perl-compatible regular expressions (revised API) .SH SYNOPSIS @@ -39,6 +39,8 @@ depth limits. The \fIlength\fP and \fIstartoffset\fP values are code units, not characters. The options are: .sp PCRE2_ANCHORED Match only at the first position + PCRE2_COPY_MATCHED_SUBJECT + On success, make a private subject copy PCRE2_ENDANCHORED Pattern can match only at end of subject PCRE2_NOTBOL Subject is not the beginning of a line PCRE2_NOTEOL Subject is not the end of a line diff --git a/doc/pcre2_match.3 b/doc/pcre2_match.3 index 9d15ec9..10a1a0f 100644 --- a/doc/pcre2_match.3 +++ b/doc/pcre2_match.3 @@ -1,4 +1,4 @@ -.TH PCRE2_MATCH 3 "14 November 2017" "PCRE2 10.31" +.TH PCRE2_MATCH 3 "16 October 2018" "PCRE2 10.33" .SH NAME PCRE2 - Perl-compatible regular expressions (revised API) .SH SYNOPSIS @@ -43,11 +43,13 @@ A match context is needed only if you want to: Change the backtracking depth limit Set custom memory management specifically for the match .sp -The \fIlength\fP and \fIstartoffset\fP values are code -units, not characters. The length may be given as PCRE2_ZERO_TERMINATE for a -subject that is terminated by a binary zero code unit. The options are: +The \fIlength\fP and \fIstartoffset\fP values are code units, not characters. +The length may be given as PCRE2_ZERO_TERMINATED for a subject that is +terminated by a binary zero code unit. The options are: .sp PCRE2_ANCHORED Match only at the first position + PCRE2_COPY_MATCHED_SUBJECT + On success, make a private subject copy PCRE2_ENDANCHORED Pattern can match only at end of subject PCRE2_NOTBOL Subject string is not the beginning of a line PCRE2_NOTEOL Subject string is not the end of a line diff --git a/doc/pcre2_match_data_free.3 b/doc/pcre2_match_data_free.3 index 56ed08b..5b920e4 100644 --- a/doc/pcre2_match_data_free.3 +++ b/doc/pcre2_match_data_free.3 @@ -1,4 +1,4 @@ -.TH PCRE2_MATCH_DATA_FREE 3 "28 June 2018" "PCRE2 10.32" +.TH PCRE2_MATCH_DATA_FREE 3 "16 October 2018" "PCRE2 10.33" .SH NAME PCRE2 - Perl-compatible regular expressions (revised API) .SH SYNOPSIS @@ -18,6 +18,10 @@ If \fImatch_data\fP is NULL, this function does nothing. Otherwise, using the memory freeing function from the general context or compiled pattern with which it was created, or \fBfree()\fP if that was not set. .P +If the PCRE2_COPY_MATCHED_SUBJECT was used for a successful match using this +match data block, the copy of the subject that was remembered with the block is +also freed. +.P There is a complete description of the PCRE2 native API in the .\" HREF \fBpcre2api\fP diff --git a/doc/pcre2api.3 b/doc/pcre2api.3 index eff19ab..61753fb 100644 --- a/doc/pcre2api.3 +++ b/doc/pcre2api.3 @@ -1,4 +1,4 @@ -.TH PCRE2API 3 "21 September 2018" "PCRE2 10.33" +.TH PCRE2API 3 "16 October 2018" "PCRE2 10.33" .SH NAME PCRE2 - Perl-compatible regular expressions (revised API) .sp @@ -1237,13 +1237,19 @@ NULL. NOTE: When one of the matching functions is called, pointers to the compiled pattern and the subject string are set in the match data block so that they can be referenced by the substring extraction functions. After running a match, you -must not free a compiled pattern (or a subject string) until after all +must not free a compiled pattern or a subject string until after all operations on the .\" HTML .\" match data block .\" -have taken place. +have taken place, unless, in the case of the subject string, you have used the +PCRE2_COPY_MATCHED_SUBJECT option, which is described in the section entitled +"Option bits for \fBpcre2_match()\fP" +.\" HTML +.\" +below. +.\" .P The \fIoptions\fP argument for \fBpcre2_compile()\fP contains various bit settings that affect the compilation. It should be zero if no options are @@ -2390,7 +2396,13 @@ When one of the matching functions is called, pointers to the compiled pattern and the subject string are set in the match data block so that they can be referenced by the extraction functions. After running a match, you must not free a compiled pattern or a subject string until after all operations on the -match data block (for that match) have taken place. +match data block (for that match) have taken place, unless, in the case of the +subject string, you have used the PCRE2_COPY_MATCHED_SUBJECT option, which is +described in the section entitled "Option bits for \fBpcre2_match()\fP" +.\" HTML +.\" +below. +.\" .P When a match data block itself is no longer needed, it should be freed by calling \fBpcre2_match_data_free()\fP. If this function is called with a NULL @@ -2507,10 +2519,10 @@ the use of .* with PCRE2_DOTALL, not by starting the pattern with ^ or \eA. .rs .sp The unused bits of the \fIoptions\fP argument for \fBpcre2_match()\fP must be -zero. The only bits that may be set are PCRE2_ANCHORED, PCRE2_ENDANCHORED, -PCRE2_NOTBOL, PCRE2_NOTEOL, PCRE2_NOTEMPTY, PCRE2_NOTEMPTY_ATSTART, -PCRE2_NO_JIT, PCRE2_NO_UTF_CHECK, PCRE2_PARTIAL_HARD, and PCRE2_PARTIAL_SOFT. -Their action is described below. +zero. The only bits that may be set are PCRE2_ANCHORED, +PCRE2_COPY_MATCHED_SUBJECT, PCRE2_ENDANCHORED, PCRE2_NOTBOL, PCRE2_NOTEOL, +PCRE2_NOTEMPTY, PCRE2_NOTEMPTY_ATSTART, PCRE2_NO_JIT, PCRE2_NO_UTF_CHECK, +PCRE2_PARTIAL_HARD, and PCRE2_PARTIAL_SOFT. Their action is described below. .P Setting PCRE2_ANCHORED or PCRE2_ENDANCHORED at match time is not supported by the just-in-time (JIT) compiler. If it is set, JIT matching is disabled and the @@ -2524,6 +2536,22 @@ matching position. If a pattern was compiled with PCRE2_ANCHORED, or turned out to be anchored by virtue of its contents, it cannot be made unachored at matching time. Note that setting the option at match time disables JIT matching. +.sp + PCRE2_COPY_MATCHED_SUBJECT +.sp +By default, a pointer to the subject is remembered in the match data block so +that, after a successful match, it can be referenced by the substring +extraction functions. This means that the subject's memory must not be freed +until all such operations are complete. For some applications where the +lifetime of the subject string is not guaranteed, it may be necessary to make a +copy of the subject string, but it is wasteful to do this unless the match is +successful. After a successful match, if PCRE2_COPY_MATCHED_SUBJECT is set, the +subject is copied and the new pointer is remembered in the match data block +instead of the original subject pointer. The memory allocator that was used for +the match block itself is used. The copy is automatically freed when +\fBpcre2_match_data_free()\fP is called to free the match data block. It is also +automatically freed if the match data block is re-used for another match +operation. .sp PCRE2_ENDANCHORED .sp @@ -2961,7 +2989,8 @@ The backtracking match limit was reached. If a pattern contains many nested backtracking points, heap memory is used to remember them. This error is given when the memory allocation function (default or custom) fails. Note that a different error, PCRE2_ERROR_HEAPLIMIT, is given -if the amount of memory needed exceeds the heap limit. +if the amount of memory needed exceeds the heap limit. PCRE2_ERROR_NOMEMORY is +also returned if PCRE2_COPY_MATCHED_SUBJECT is set and memory allocation fails. .sp PCRE2_ERROR_NULL .sp @@ -3579,11 +3608,12 @@ Here is an example of a simple call to \fBpcre2_dfa_match()\fP: .rs .sp The unused bits of the \fIoptions\fP argument for \fBpcre2_dfa_match()\fP must -be zero. The only bits that may be set are PCRE2_ANCHORED, PCRE2_ENDANCHORED, -PCRE2_NOTBOL, PCRE2_NOTEOL, PCRE2_NOTEMPTY, PCRE2_NOTEMPTY_ATSTART, -PCRE2_NO_UTF_CHECK, PCRE2_PARTIAL_HARD, PCRE2_PARTIAL_SOFT, PCRE2_DFA_SHORTEST, -and PCRE2_DFA_RESTART. All but the last four of these are exactly the same as -for \fBpcre2_match()\fP, so their description is not repeated here. +be zero. The only bits that may be set are PCRE2_ANCHORED, +PCRE2_COPY_MATCHED_SUBJECT, PCRE2_ENDANCHORED, PCRE2_NOTBOL, PCRE2_NOTEOL, +PCRE2_NOTEMPTY, PCRE2_NOTEMPTY_ATSTART, PCRE2_NO_UTF_CHECK, PCRE2_PARTIAL_HARD, +PCRE2_PARTIAL_SOFT, PCRE2_DFA_SHORTEST, and PCRE2_DFA_RESTART. All but the last +four of these are exactly the same as for \fBpcre2_match()\fP, so their +description is not repeated here. .sp PCRE2_PARTIAL_HARD PCRE2_PARTIAL_SOFT @@ -3737,6 +3767,6 @@ Cambridge, England. .rs .sp .nf -Last updated: 21 September 2018 +Last updated: 16 October 2018 Copyright (c) 1997-2018 University of Cambridge. .fi diff --git a/doc/pcre2jit.3 b/doc/pcre2jit.3 index c3b916b..26f320c 100644 --- a/doc/pcre2jit.3 +++ b/doc/pcre2jit.3 @@ -1,4 +1,4 @@ -.TH PCRE2JIT 3 "28 June 2018" "PCRE2 10.32" +.TH PCRE2JIT 3 "16 October 2018" "PCRE2 10.33" .SH NAME PCRE2 - Perl-compatible regular expressions (revised API) .SH "PCRE2 JUST-IN-TIME COMPILER SUPPORT" @@ -124,9 +124,10 @@ pattern. .rs .sp The \fBpcre2_match()\fP options that are supported for JIT matching are -PCRE2_NOTBOL, PCRE2_NOTEOL, PCRE2_NOTEMPTY, PCRE2_NOTEMPTY_ATSTART, -PCRE2_NO_UTF_CHECK, PCRE2_PARTIAL_HARD, and PCRE2_PARTIAL_SOFT. The -PCRE2_ANCHORED option is not supported at match time. +PCRE2_COPY_MATCHED_SUBJECT, PCRE2_NOTBOL, PCRE2_NOTEOL, PCRE2_NOTEMPTY, +PCRE2_NOTEMPTY_ATSTART, PCRE2_NO_UTF_CHECK, PCRE2_PARTIAL_HARD, and +PCRE2_PARTIAL_SOFT. The PCRE2_ANCHORED and PCRE2_ENDANCHORED options are not +supported at match time. .P If the PCRE2_NO_JIT option is passed to \fBpcre2_match()\fP it disables the use of JIT, forcing matching by the interpreter code. @@ -376,10 +377,13 @@ available, and which need the best possible performance, can instead use a processed by \fBpcre2_jit_compile()\fP). .P The fast path function is called \fBpcre2_jit_match()\fP, and it takes exactly -the same arguments as \fBpcre2_match()\fP. The return values are also the same, -plus PCRE2_ERROR_JIT_BADOPTION if a matching mode (partial or complete) is -requested that was not compiled. Unsupported option bits (for example, -PCRE2_ANCHORED) are ignored, as is the PCRE2_NO_JIT option. +the same arguments as \fBpcre2_match()\fP. However, the subject string must be +specified with a length; PCRE2_ZERO_TERMINATED is not supported. Unsupported +option bits (for example, PCRE2_ANCHORED, PCRE2_ENDANCHORED and +PCRE2_COPY_MATCHED_SUBJECT) are ignored, as is the PCRE2_NO_JIT option. The +return values are also the same as for \fBpcre2_match()\fP, plus +PCRE2_ERROR_JIT_BADOPTION if a matching mode (partial or complete) is requested +that was not compiled. .P When you call \fBpcre2_match()\fP, as well as testing for invalid options, a number of other sanity checks are performed on the arguments. For example, if @@ -412,6 +416,6 @@ Cambridge, England. .rs .sp .nf -Last updated: 28 June 2018 +Last updated: 16 October 2018 Copyright (c) 1997-2018 University of Cambridge. .fi diff --git a/src/pcre2.h.in b/src/pcre2.h.in index 68e7768..4d24f0d 100644 --- a/src/pcre2.h.in +++ b/src/pcre2.h.in @@ -167,36 +167,27 @@ D is inspected during pcre2_dfa_match() execution #define PCRE2_JIT_PARTIAL_HARD 0x00000004u #define PCRE2_JIT_INVALID_UTF 0x00000100u -/* These are for pcre2_match(), pcre2_dfa_match(), and pcre2_jit_match(). Note -that PCRE2_ANCHORED and PCRE2_NO_UTF_CHECK can also be passed to these -functions (though pcre2_jit_match() ignores the latter since it bypasses all -sanity checks). */ +/* These are for pcre2_match(), pcre2_dfa_match(), pcre2_jit_match(), and +pcre2_substitute(). Some are allowed only for one of the functions, and in +these cases it is noted below. Note that PCRE2_ANCHORED, PCRE2_ENDANCHORED and +PCRE2_NO_UTF_CHECK can also be passed to these functions (though +pcre2_jit_match() ignores the latter since it bypasses all sanity checks). */ -#define PCRE2_NOTBOL 0x00000001u -#define PCRE2_NOTEOL 0x00000002u -#define PCRE2_NOTEMPTY 0x00000004u /* ) These two must be kept */ -#define PCRE2_NOTEMPTY_ATSTART 0x00000008u /* ) adjacent to each other. */ -#define PCRE2_PARTIAL_SOFT 0x00000010u -#define PCRE2_PARTIAL_HARD 0x00000020u - -/* These are additional options for pcre2_dfa_match(). */ - -#define PCRE2_DFA_RESTART 0x00000040u -#define PCRE2_DFA_SHORTEST 0x00000080u - -/* These are additional options for pcre2_substitute(), which passes any others -through to pcre2_match(). */ - -#define PCRE2_SUBSTITUTE_GLOBAL 0x00000100u -#define PCRE2_SUBSTITUTE_EXTENDED 0x00000200u -#define PCRE2_SUBSTITUTE_UNSET_EMPTY 0x00000400u -#define PCRE2_SUBSTITUTE_UNKNOWN_UNSET 0x00000800u -#define PCRE2_SUBSTITUTE_OVERFLOW_LENGTH 0x00001000u - -/* A further option for pcre2_match(), not allowed for pcre2_dfa_match(), -ignored for pcre2_jit_match(). */ - -#define PCRE2_NO_JIT 0x00002000u +#define PCRE2_NOTBOL 0x00000001u +#define PCRE2_NOTEOL 0x00000002u +#define PCRE2_NOTEMPTY 0x00000004u /* ) These two must be kept */ +#define PCRE2_NOTEMPTY_ATSTART 0x00000008u /* ) adjacent to each other. */ +#define PCRE2_PARTIAL_SOFT 0x00000010u +#define PCRE2_PARTIAL_HARD 0x00000020u +#define PCRE2_DFA_RESTART 0x00000040u /* pcre2_dfa_match() only */ +#define PCRE2_DFA_SHORTEST 0x00000080u /* pcre2_dfa_match() only */ +#define PCRE2_SUBSTITUTE_GLOBAL 0x00000100u /* pcre2_substitute() only */ +#define PCRE2_SUBSTITUTE_EXTENDED 0x00000200u /* pcre2_substitute() only */ +#define PCRE2_SUBSTITUTE_UNSET_EMPTY 0x00000400u /* pcre2_substitute() only */ +#define PCRE2_SUBSTITUTE_UNKNOWN_UNSET 0x00000800u /* pcre2_substitute() only */ +#define PCRE2_SUBSTITUTE_OVERFLOW_LENGTH 0x00001000u /* pcre2_substitute() only */ +#define PCRE2_NO_JIT 0x00002000u /* Not for pcre2_dfa_match() */ +#define PCRE2_COPY_MATCHED_SUBJECT 0x00004000u /* Options for pcre2_pattern_convert(). */ diff --git a/src/pcre2_dfa_match.c b/src/pcre2_dfa_match.c index 2db8f96..51c05c3 100644 --- a/src/pcre2_dfa_match.c +++ b/src/pcre2_dfa_match.c @@ -85,7 +85,8 @@ in others, so I abandoned this code. */ #define PUBLIC_DFA_MATCH_OPTIONS \ (PCRE2_ANCHORED|PCRE2_ENDANCHORED|PCRE2_NOTBOL|PCRE2_NOTEOL|PCRE2_NOTEMPTY| \ PCRE2_NOTEMPTY_ATSTART|PCRE2_NO_UTF_CHECK|PCRE2_PARTIAL_HARD| \ - PCRE2_PARTIAL_SOFT|PCRE2_DFA_SHORTEST|PCRE2_DFA_RESTART) + PCRE2_PARTIAL_SOFT|PCRE2_DFA_SHORTEST|PCRE2_DFA_RESTART| \ + PCRE2_COPY_MATCHED_SUBJECT) /************************************************* @@ -3228,6 +3229,8 @@ pcre2_dfa_match(const pcre2_code *code, PCRE2_SPTR subject, PCRE2_SIZE length, pcre2_match_context *mcontext, int *workspace, PCRE2_SIZE wscount) { int rc; +int was_zero_terminated = 0; + const pcre2_real_code *re = (const pcre2_real_code *)code; PCRE2_SPTR start_match; @@ -3267,7 +3270,11 @@ rws->free = RWS_BASE_SIZE - RWS_ANCHOR_SIZE; /* A length equal to PCRE2_ZERO_TERMINATED implies a zero-terminated subject string. */ -if (length == PCRE2_ZERO_TERMINATED) length = PRIV(strlen)(subject); +if (length == PCRE2_ZERO_TERMINATED) + { + length = PRIV(strlen)(subject); + was_zero_terminated = 1; + } /* Plausibility checks */ @@ -3520,10 +3527,21 @@ if ((re->flags & PCRE2_LASTSET) != 0) } } +/* If the match data block was previously used with PCRE2_COPY_MATCHED_SUBJECT, +free the memory that was obtained. */ + +if ((match_data->flags & PCRE2_MD_COPIED_SUBJECT) != 0) + { + match_data->memctl.free((void *)match_data->subject, + match_data->memctl.memory_data); + match_data->flags &= ~PCRE2_MD_COPIED_SUBJECT; + } + /* Fill in fields that are always returned in the match data. */ match_data->code = re; match_data->subject = subject; +match_data->flags = 0; match_data->mark = NULL; match_data->matchedby = PCRE2_MATCHEDBY_DFA_INTERPRETER; @@ -3818,6 +3836,17 @@ for (;;) match_data->rightchar = (PCRE2_SIZE)( mb->last_used_ptr - subject); match_data->startchar = (PCRE2_SIZE)(start_match - subject); match_data->rc = rc; + + if (rc >= 0 &&(options & PCRE2_COPY_MATCHED_SUBJECT) != 0) + { + length = CU2BYTES(length + was_zero_terminated); + match_data->subject = match_data->memctl.malloc(length, + match_data->memctl.memory_data); + if (match_data->subject == NULL) return PCRE2_ERROR_NOMEMORY; + memcpy((void *)match_data->subject, subject, length); + match_data->flags |= PCRE2_MD_COPIED_SUBJECT; + } + goto EXIT; } diff --git a/src/pcre2_internal.h b/src/pcre2_internal.h index b13868a..4f50eef 100644 --- a/src/pcre2_internal.h +++ b/src/pcre2_internal.h @@ -534,6 +534,10 @@ bytes in a code unit in that mode. */ enum { PCRE2_MATCHEDBY_INTERPRETER, /* pcre2_match() */ PCRE2_MATCHEDBY_DFA_INTERPRETER, /* pcre2_dfa_match() */ PCRE2_MATCHEDBY_JIT }; /* pcre2_jit_match() */ + +/* Values for the flags field in a match data block. */ + +#define PCRE2_MD_COPIED_SUBJECT 0x01u /* Magic number to provide a small check against being handed junk. */ diff --git a/src/pcre2_intmodedep.h b/src/pcre2_intmodedep.h index d8ca92f..dc707b2 100644 --- a/src/pcre2_intmodedep.h +++ b/src/pcre2_intmodedep.h @@ -658,7 +658,8 @@ typedef struct pcre2_real_match_data { PCRE2_SIZE leftchar; /* Offset to leftmost code unit */ PCRE2_SIZE rightchar; /* Offset to rightmost code unit */ PCRE2_SIZE startchar; /* Offset to starting code unit */ - uint16_t matchedby; /* Type of match (normal, JIT, DFA) */ + uint8_t matchedby; /* Type of match (normal, JIT, DFA) */ + uint8_t flags; /* Various flags */ uint16_t oveccount; /* Number of pairs */ int rc; /* The return code from the match */ PCRE2_SIZE ovector[131072]; /* Must be last in the structure */ diff --git a/src/pcre2_jit_match.c b/src/pcre2_jit_match.c index 5a66545..484151c 100644 --- a/src/pcre2_jit_match.c +++ b/src/pcre2_jit_match.c @@ -7,7 +7,7 @@ and semantics are as close as possible to those of the Perl 5 language. Written by Philip Hazel Original API code Copyright (c) 1997-2012 University of Cambridge - New API code Copyright (c) 2016 University of Cambridge + New API code Copyright (c) 2016-2018 University of Cambridge ----------------------------------------------------------------------------- Redistribution and use in source and binary forms, with or without @@ -174,6 +174,7 @@ if (rc > (int)oveccount) rc = 0; match_data->code = re; match_data->subject = subject; +match_data->flags = 0; match_data->rc = rc; match_data->startchar = arguments.startchar_ptr - subject; match_data->leftchar = 0; diff --git a/src/pcre2_match.c b/src/pcre2_match.c index 8700592..7f39d08 100644 --- a/src/pcre2_match.c +++ b/src/pcre2_match.c @@ -69,11 +69,12 @@ information, and fields within it. */ #define PUBLIC_MATCH_OPTIONS \ (PCRE2_ANCHORED|PCRE2_ENDANCHORED|PCRE2_NOTBOL|PCRE2_NOTEOL|PCRE2_NOTEMPTY| \ PCRE2_NOTEMPTY_ATSTART|PCRE2_NO_UTF_CHECK|PCRE2_PARTIAL_HARD| \ - PCRE2_PARTIAL_SOFT|PCRE2_NO_JIT) + PCRE2_PARTIAL_SOFT|PCRE2_NO_JIT|PCRE2_COPY_MATCHED_SUBJECT) #define PUBLIC_JIT_MATCH_OPTIONS \ (PCRE2_NO_UTF_CHECK|PCRE2_NOTBOL|PCRE2_NOTEOL|PCRE2_NOTEMPTY|\ - PCRE2_NOTEMPTY_ATSTART|PCRE2_PARTIAL_SOFT|PCRE2_PARTIAL_HARD) + PCRE2_NOTEMPTY_ATSTART|PCRE2_PARTIAL_SOFT|PCRE2_PARTIAL_HARD|\ + PCRE2_COPY_MATCHED_SUBJECT) /* Non-error returns from and within the match() function. Error returns are externally defined PCRE2_ERROR_xxx codes, which are all negative. */ @@ -5014,7 +5015,7 @@ fprintf(stderr, "++ op=%d\n", *Fecode); must record a backtracking point and also set up a chained frame. */ case OP_ONCE: - case OP_SCRIPT_RUN: + case OP_SCRIPT_RUN: case OP_SBRA: Lframe_type = GF_NOCAPTURE | Fop; @@ -5526,14 +5527,14 @@ fprintf(stderr, "++ op=%d\n", *Fecode); case OP_ASSERT_NOT: case OP_ASSERTBACK_NOT: RRETURN(MATCH_MATCH); - - /* At the end of a script run, apply the script-checking rules. This code - will never by exercised if Unicode support it not compiled, because in + + /* At the end of a script run, apply the script-checking rules. This code + will never by exercised if Unicode support it not compiled, because in that environment script runs cause an error at compile time. */ - + case OP_SCRIPT_RUN: if (!PRIV(script_run)(P->eptr, Feptr, utf)) RRETURN(MATCH_NOMATCH); - break; + break; /* Whole-pattern recursion is coded as a recurse into group 0, so it won't be picked up here. Instead, we catch it when the OP_END is reached. @@ -6009,10 +6010,11 @@ pcre2_match(const pcre2_code *code, PCRE2_SPTR subject, PCRE2_SIZE length, pcre2_match_context *mcontext) { int rc; +int was_zero_terminated = 0; const uint8_t *start_bits = NULL; - const pcre2_real_code *re = (const pcre2_real_code *)code; + BOOL anchored; BOOL firstline; BOOL has_first_cu = FALSE; @@ -6052,7 +6054,11 @@ mb->stack_frames = (heapframe *)stack_frames_vector; /* A length equal to PCRE2_ZERO_TERMINATED implies a zero-terminated subject string. */ -if (length == PCRE2_ZERO_TERMINATED) length = PRIV(strlen)(subject); +if (length == PCRE2_ZERO_TERMINATED) + { + length = PRIV(strlen)(subject); + was_zero_terminated = 1; + } end_subject = subject + length; /* Plausibility checks */ @@ -6166,6 +6172,16 @@ time. */ if (mcontext != NULL && mcontext->offset_limit != PCRE2_UNSET && (re->overall_options & PCRE2_USE_OFFSET_LIMIT) == 0) return PCRE2_ERROR_BADOFFSETLIMIT; + +/* If the match data block was previously used with PCRE2_COPY_MATCHED_SUBJECT, +free the memory that was obtained. */ + +if ((match_data->flags & PCRE2_MD_COPIED_SUBJECT) != 0) + { + match_data->memctl.free((void *)match_data->subject, + match_data->memctl.memory_data); + match_data->flags &= ~PCRE2_MD_COPIED_SUBJECT; + } /* If the pattern was successfully studied with JIT support, run the JIT executable instead of the rest of this function. Most options must be set at @@ -6178,7 +6194,19 @@ if (re->executable_jit != NULL && (options & ~PUBLIC_JIT_MATCH_OPTIONS) == 0) { rc = pcre2_jit_match(code, subject, length, start_offset, options, match_data, mcontext); - if (rc != PCRE2_ERROR_JIT_BADOPTION) return rc; + if (rc != PCRE2_ERROR_JIT_BADOPTION) + { + if (rc >= 0 && (options & PCRE2_COPY_MATCHED_SUBJECT) != 0) + { + length = CU2BYTES(length + was_zero_terminated); + match_data->subject = match_data->memctl.malloc(length, + match_data->memctl.memory_data); + if (match_data->subject == NULL) return PCRE2_ERROR_NOMEMORY; + memcpy((void *)match_data->subject, subject, length); + match_data->flags |= PCRE2_MD_COPIED_SUBJECT; + } + return rc; + } } #endif @@ -6819,12 +6847,14 @@ if (mb->match_frames != mb->stack_frames) match_data->code = re; match_data->subject = subject; +match_data->flags = 0; match_data->mark = mb->mark; match_data->matchedby = PCRE2_MATCHEDBY_INTERPRETER; /* Handle a fully successful match. Set the return code to the number of captured strings, or 0 if there were too many to fit into the ovector, and then -set the remaining returned values before returning. */ +set the remaining returned values before returning. Make a copy of the subject +string if requested. */ if (rc == MATCH_MATCH) { @@ -6834,6 +6864,17 @@ if (rc == MATCH_MATCH) match_data->leftchar = mb->start_used_ptr - subject; match_data->rightchar = ((mb->last_used_ptr > mb->end_match_ptr)? mb->last_used_ptr : mb->end_match_ptr) - subject; + + if ((options & PCRE2_COPY_MATCHED_SUBJECT) != 0) + { + length = CU2BYTES(length + was_zero_terminated); + match_data->subject = match_data->memctl.malloc(length, + match_data->memctl.memory_data); + if (match_data->subject == NULL) return PCRE2_ERROR_NOMEMORY; + memcpy((void *)match_data->subject, subject, length); + match_data->flags |= PCRE2_MD_COPIED_SUBJECT; + } + return match_data->rc; } diff --git a/src/pcre2_match_data.c b/src/pcre2_match_data.c index b297f32..b480dec 100644 --- a/src/pcre2_match_data.c +++ b/src/pcre2_match_data.c @@ -7,7 +7,7 @@ and semantics are as close as possible to those of the Perl 5 language. Written by Philip Hazel Original API code Copyright (c) 1997-2012 University of Cambridge - New API code Copyright (c) 2016-2017 University of Cambridge + New API code Copyright (c) 2016-2018 University of Cambridge ----------------------------------------------------------------------------- Redistribution and use in source and binary forms, with or without @@ -63,6 +63,7 @@ yield = PRIV(memctl_malloc)( (pcre2_memctl *)gcontext); if (yield == NULL) return NULL; yield->oveccount = oveccount; +yield->flags = 0; return yield; } @@ -93,7 +94,12 @@ PCRE2_EXP_DEFN void PCRE2_CALL_CONVENTION pcre2_match_data_free(pcre2_match_data *match_data) { if (match_data != NULL) + { + if ((match_data->flags & PCRE2_MD_COPIED_SUBJECT) != 0) + match_data->memctl.free((void *)match_data->subject, + match_data->memctl.memory_data); match_data->memctl.free(match_data, match_data->memctl.memory_data); + } } diff --git a/src/pcre2test.c b/src/pcre2test.c index 5bf01bf..6af25cb 100644 --- a/src/pcre2test.c +++ b/src/pcre2test.c @@ -620,6 +620,7 @@ static modstruct modlist[] = { { "convert_glob_separator", MOD_PAT, MOD_CHR, 0, PO(convert_glob_separator) }, { "convert_length", MOD_PAT, MOD_INT, 0, PO(convert_length) }, { "copy", MOD_DAT, MOD_NN, DO(copy_numbers), DO(copy_names) }, + { "copy_matched_subject", MOD_DAT, MOD_OPT, PCRE2_COPY_MATCHED_SUBJECT, DO(options) }, { "debug", MOD_PAT, MOD_CTL, CTL_DEBUG, PO(control) }, { "depth_limit", MOD_CTM, MOD_INT, 0, MO(depth_limit) }, { "dfa", MOD_DAT, MOD_CTL, CTL_DFA, DO(control) }, @@ -4180,7 +4181,7 @@ else fprintf(outfile, "%s%s%s%s%s%s%s", ((options & PCRE2_EXTRA_BAD_ESCAPE_IS_LITERAL) != 0)? " bad_escape_is_literal" : "", ((options & PCRE2_EXTRA_MATCH_WORD) != 0)? " match_word" : "", ((options & PCRE2_EXTRA_MATCH_LINE) != 0)? " match_line" : "", - ((options & PCRE2_EXTRA_ESCAPED_CR_IS_LF) != 0)? " escaped_cr_is_lf" : "", + ((options & PCRE2_EXTRA_ESCAPED_CR_IS_LF) != 0)? " escaped_cr_is_lf" : "", after); } @@ -4196,11 +4197,13 @@ else fprintf(outfile, "%s%s%s%s%s%s%s", static void show_match_options(uint32_t options) { -fprintf(outfile, "%s%s%s%s%s%s%s%s%s%s%s", +fprintf(outfile, "%s%s%s%s%s%s%s%s%s%s%s%s%s", ((options & PCRE2_ANCHORED) != 0)? " anchored" : "", + ((options & PCRE2_COPY_MATCHED_SUBJECT) != 0)? " copy_matched_subject" : "", ((options & PCRE2_DFA_RESTART) != 0)? " dfa_restart" : "", ((options & PCRE2_DFA_SHORTEST) != 0)? " dfa_shortest" : "", ((options & PCRE2_ENDANCHORED) != 0)? " endanchored" : "", + ((options & PCRE2_NO_JIT) != 0)? " no_jit" : "", ((options & PCRE2_NO_UTF_CHECK) != 0)? " no_utf_check" : "", ((options & PCRE2_NOTBOL) != 0)? " notbol" : "", ((options & PCRE2_NOTEMPTY) != 0)? " notempty" : "", @@ -7442,6 +7445,25 @@ for (gmatched = 0;; gmatched++) } } + /* If PCRE2_COPY_MATCHED_SUBJECT was set, check that things are as they + should be, but not for fast JIT, where it isn't supported. */ + + if ((dat_datctl.options & PCRE2_COPY_MATCHED_SUBJECT) != 0 && + (pat_patctl.control & CTL_JITFAST) == 0) + { + if ((FLD(match_data, flags) & PCRE2_MD_COPIED_SUBJECT) == 0) + fprintf(outfile, + "** PCRE2 error: flag not set after copy_matched_subject\n"); + + if (CASTFLD(void *, match_data, subject) == pp) + fprintf(outfile, + "** PCRE2 error: copy_matched_subject has not copied\n"); + + if (memcmp(CASTFLD(void *, match_data, subject), pp, ulen) != 0) + fprintf(outfile, + "** PCRE2 error: copy_matched_subject mismatch\n"); + } + /* If this is not the first time round a global loop, check that the returned string has changed. If it has not, check for an empty string match at different starting offset from the previous match. This is a failed test diff --git a/testdata/testinput17 b/testdata/testinput17 index 0944151..65bbbb9 100644 --- a/testdata/testinput17 +++ b/testdata/testinput17 @@ -299,9 +299,9 @@ # ---- /[aC]/mg,firstline,newline=lf -match\nmatch + match\nmatch /[aCz]/mg,firstline,newline=lf -match\nmatch + match\nmatch # End of testinput17 diff --git a/testdata/testinput2 b/testdata/testinput2 index ad7f477..565ce18 100644 --- a/testdata/testinput2 +++ b/testdata/testinput2 @@ -5531,4 +5531,11 @@ a)"xI /(?(*script_run:xxx)zzz)/ +/foobar/ + the foobar thing\=copy_matched_subject + the foobar thing\=copy_matched_subject,zero_terminate + +/foobar/g + the foobar thing foobar again\=copy_matched_subject + # End of testinput2 diff --git a/testdata/testinput6 b/testdata/testinput6 index f7dedb2..71218a3 100644 --- a/testdata/testinput6 +++ b/testdata/testinput6 @@ -4955,4 +4955,11 @@ \= Expect no match \na +/foobar/ + the foobar thing\=copy_matched_subject + the foobar thing\=copy_matched_subject,zero_terminate + +/foobar/g + the foobar thing foobar again\=copy_matched_subject + # End of testinput6 diff --git a/testdata/testoutput17 b/testdata/testoutput17 index acf00e0..f5d751a 100644 --- a/testdata/testoutput17 +++ b/testdata/testoutput17 @@ -543,11 +543,11 @@ Failed: error -47: match limit exceeded # ---- /[aC]/mg,firstline,newline=lf -match\nmatch + match\nmatch 0: a (JIT) /[aCz]/mg,firstline,newline=lf -match\nmatch + match\nmatch 0: a (JIT) # End of testinput17 diff --git a/testdata/testoutput2 b/testdata/testoutput2 index 1365302..9ecbc9f 100644 --- a/testdata/testoutput2 +++ b/testdata/testoutput2 @@ -16821,6 +16821,17 @@ Failed: error 128 at offset 10: assertion expected after (?( or (?(?C) /(?(*script_run:xxx)zzz)/ Failed: error 128 at offset 14: assertion expected after (?( or (?(?C) +/foobar/ + the foobar thing\=copy_matched_subject + 0: foobar + the foobar thing\=copy_matched_subject,zero_terminate + 0: foobar + +/foobar/g + the foobar thing foobar again\=copy_matched_subject + 0: foobar + 0: foobar + # End of testinput2 Error -70: PCRE2_ERROR_BADDATA (unknown error number) Error -62: bad serialized data diff --git a/testdata/testoutput6 b/testdata/testoutput6 index caec833..f78f600 100644 --- a/testdata/testoutput6 +++ b/testdata/testoutput6 @@ -7783,4 +7783,15 @@ No match \na No match +/foobar/ + the foobar thing\=copy_matched_subject + 0: foobar + the foobar thing\=copy_matched_subject,zero_terminate + 0: foobar + +/foobar/g + the foobar thing foobar again\=copy_matched_subject + 0: foobar + 0: foobar + # End of testinput6