From 6e245572b875faccccd14dd10b5c242741297e8f Mon Sep 17 00:00:00 2001
From: "Philip.Hazel"
PCRE2_DUPNAMES
@@ -3634,7 +3635,7 @@ Cambridge, England.
-Last updated: 02 July 2018
+Last updated: 27 July 2018
Copyright © 1997-2018 University of Cambridge.
diff --git a/doc/html/pcre2compat.html b/doc/html/pcre2compat.html
index f7c694c..3123111 100644
--- a/doc/html/pcre2compat.html
+++ b/doc/html/pcre2compat.html
@@ -42,13 +42,14 @@ assertion is a condition that has a matching branch (that is, the condition is
false).
-4. The following Perl escape sequences are not supported: \l, \u, \L, -\U, and \N when followed by a character name or Unicode value. (\N on its -own, matching a non-newline character, is supported.) In fact these are +4. The following Perl escape sequences are not supported: \F, \l, \L, \u, +\U, and \N when followed by a character name. \N on its own, matching a +non-newline character, and \N{U+dd..}, matching a Unicode code point, are +supported. The escapes that modify the case of following letters are implemented by Perl's general string-handling and are not part of its pattern matching engine. If any of these are encountered by PCRE2, an error is -generated by default. However, if the PCRE2_ALT_BSUX option is set, -\U and \u are interpreted as ECMAScript interprets them. +generated by default. However, if the PCRE2_ALT_BSUX option is set, \U and \u +are interpreted as ECMAScript interprets them.
5. The Perl escape sequences \p, \P, and \X are supported only if PCRE2 is @@ -61,17 +62,22 @@ internal representation of Unicode characters, there is no need to implement the somewhat messy concept of surrogates."
-6. PCRE2 does support the \Q...\E escape for quoting substrings. Characters -in between are treated as literals. This is slightly different from Perl in -that $ and @ are also handled as literals inside the quotes. In Perl, they -cause variable interpolation (but of course PCRE2 does not have variables). -Note the following examples: +6. PCRE2 supports the \Q...\E escape for quoting substrings. Characters +in between are treated as literals. However, this is slightly different from +Perl in that $ and @ are also handled as literals inside the quotes. In Perl, +they cause variable interpolation (but of course PCRE2 does not have +variables). Also, Perl does "double-quotish backslash interpolation" on any +backslashes between \Q and \E which, its documentation says, "may lead to +confusing results". PCRE2 treats a backslash between \Q and \E just like any +other character. Note the following examples:
- Pattern PCRE2 matches Perl matches + Pattern PCRE2 matches Perl matches \Qabc$xyz\E abc$xyz abc followed by the contents of $xyz \Qabc\$xyz\E abc\$xyz abc\$xyz \Qabc\E\$\Qxyz\E abc$xyz abc$xyz + \QA\B\E A\B A\B + \Q\\E \ \\EThe \Q...\E sequence is recognized both inside and outside character classes. @@ -229,9 +235,9 @@ Cambridge, England. REVISION
-Last updated: 18 April 2017
+Last updated: 28 July 2018
-Copyright © 1997-2017 University of Cambridge.
+Copyright © 1997-2018 University of Cambridge.
Return to the PCRE2 index page. diff --git a/doc/html/pcre2pattern.html b/doc/html/pcre2pattern.html index 698e7bf..9643a82 100644 --- a/doc/html/pcre2pattern.html +++ b/doc/html/pcre2pattern.html @@ -357,13 +357,18 @@ of the pattern. If you want to remove the special meaning from a sequence of characters, you can do so by putting them between \Q and \E. This is different from Perl in that $ and @ are handled as literals in \Q...\E sequences in PCRE2, whereas -in Perl, $ and @ cause variable interpolation. Note the following examples: +in Perl, $ and @ cause variable interpolation. Also, Perl does "double-quotish +backslash interpolation" on any backslashes between \Q and \E which, its +documentation says, "may lead to confusing results". PCRE2 treats a backslash +between \Q and \E just like any other character. Note the following examples:
Pattern PCRE2 matches Perl matches \Qabc$xyz\E abc$xyz abc followed by the contents of $xyz \Qabc\$xyz\E abc\$xyz abc\$xyz \Qabc\E\$\Qxyz\E abc$xyz abc$xyz + \QA\B\E A\B A\B + \Q\\E \ \\EThe \Q...\E sequence is recognized both inside and outside character classes. An isolated \E that is not preceded by \Q is ignored. If \Q is not followed @@ -545,7 +550,7 @@ character class, these sequences have different meanings. Unsupported escape sequences
-In Perl, the sequences \l, \L, \u, and \U are recognized by its string +In Perl, the sequences \F, \l, \L, \u, and \U are recognized by its string handler and used to modify the case of following characters. By default, PCRE2 does not support these escape sequences. However, if the PCRE2_ALT_BSUX option is set, \U matches a "U" character, and \u can be used to define a character @@ -1635,21 +1640,27 @@ Perl option letters enclosed between "(?" and ")". The option letters are xx for PCRE2_EXTENDED_MORE For example, (?im) sets caseless, multiline matching. It is also possible to -unset these options by preceding the letter with a hyphen. The two "extended" -options are not independent; unsetting either one cancels the effects of both -of them. +unset these options by preceding the relevant letters with a hyphen, for +example (?-im). The two "extended" options are not independent; unsetting either +one cancels the effects of both of them.
A combined setting and unsetting such as (?im-sx), which sets PCRE2_CASELESS and PCRE2_MULTILINE while unsetting PCRE2_DOTALL and PCRE2_EXTENDED, is also -permitted. If a letter appears both before and after the hyphen, the option is -unset. An empty options setting "(?)" is allowed. Needless to say, it has no -effect. +permitted. Only one hyphen may appear in the options string. If a letter +appears both before and after the hyphen, the option is unset. An empty options +setting "(?)" is allowed. Needless to say, it has no effect. +
++If the first character following (? is a circumflex, it causes all of the above +options to be unset. Thus, (?^) is equivalent to (?-imnsx). Letters may follow +the circumflex to cause some options to be re-instated, but a hyphen may not +appear.
The PCRE2-specific options PCRE2_DUPNAMES and PCRE2_UNGREEDY can be changed in the same way as the Perl-compatible options by using the characters J and U -respectively. +respectively. However, these are not unset by (?^).
When one of these option changes occurs at top level (that is, not inside @@ -3579,7 +3590,7 @@ Cambridge, England.
-Last updated: 27 July 2018
+Last updated: 28 July 2018
Copyright © 1997-2018 University of Cambridge.
diff --git a/doc/html/pcre2syntax.html b/doc/html/pcre2syntax.html
index d087788..0f492a1 100644
--- a/doc/html/pcre2syntax.html
+++ b/doc/html/pcre2syntax.html
@@ -456,7 +456,15 @@ but some of them use Unicode properties if PCRE2_UCP is set. You can use
(?x) extended: ignore white space except in classes
(?xx) as (?x) but also ignore space and tab in classes
(?-...) unset option(s)
+ (?^) unset imnsx options
+Unsetting x or xx unsets both. Several options may be set at once, and a
+mixture of setting and unsetting such as (?i-x) is allowed, but there may be
+only one hyphen. Setting (but no unsetting) is allowed after (?^ for example
+(?^in). An option setting may appear at the start of a non-capturing group, for
+example (?i:...).
+
The following are recognized only at the very start of a pattern or after one of the newline or \R options with similar syntax. More than one of them may appear. For the first three, d is a decimal number. @@ -624,7 +632,7 @@ Cambridge, England.
-Last updated: 27 July 2018
+Last updated: 28 July 2018
Copyright © 1997-2018 University of Cambridge.
diff --git a/doc/pcre2.txt b/doc/pcre2.txt
index 69effd7..dd17020 100644
--- a/doc/pcre2.txt
+++ b/doc/pcre2.txt
@@ -1454,382 +1454,383 @@ COMPILING A PATTERN
this option, a dot does not match when the current position in the sub-
ject is at a newline. This option is equivalent to Perl's /s option,
and it can be changed within a pattern by a (?s) option setting. A neg-
- ative class such as [^a] always matches newline characters, independent
- of the setting of this option.
+ ative class such as [^a] always matches newline characters, and the \N
+ escape sequence always matches a non-newline character, independent of
+ the setting of PCRE2_DOTALL.
PCRE2_DUPNAMES
- If this bit is set, names used to identify capturing subpatterns need
+ If this bit is set, names used to identify capturing subpatterns need
not be unique. This can be helpful for certain types of pattern when it
- is known that only one instance of the named subpattern can ever be
- matched. There are more details of named subpatterns below; see also
+ is known that only one instance of the named subpattern can ever be
+ matched. There are more details of named subpatterns below; see also
the pcre2pattern documentation.
PCRE2_ENDANCHORED
- If this bit is set, the end of any pattern match must be right at the
+ If this bit is set, the end of any pattern match must be right at the
end of the string being searched (the "subject string"). If the pattern
match succeeds by reaching (*ACCEPT), but does not reach the end of the
- subject, the match fails at the current starting point. For unanchored
- patterns, a new match is then tried at the next starting point. How-
+ subject, the match fails at the current starting point. For unanchored
+ patterns, a new match is then tried at the next starting point. How-
ever, if the match succeeds by reaching the end of the pattern, but not
- the end of the subject, backtracking occurs and an alternative match
+ the end of the subject, backtracking occurs and an alternative match
may be found. Consider these two patterns:
.(*ACCEPT)|..
.|..
- If matched against "abc" with PCRE2_ENDANCHORED set, the first matches
- "c" whereas the second matches "bc". The effect of PCRE2_ENDANCHORED
- can also be achieved by appropriate constructs in the pattern itself,
+ If matched against "abc" with PCRE2_ENDANCHORED set, the first matches
+ "c" whereas the second matches "bc". The effect of PCRE2_ENDANCHORED
+ can also be achieved by appropriate constructs in the pattern itself,
which is the only way to do it in Perl.
For DFA matching with pcre2_dfa_match(), PCRE2_ENDANCHORED applies only
- to the first (that is, the longest) matched string. Other parallel
- matches, which are necessarily substrings of the first one, must obvi-
+ to the first (that is, the longest) matched string. Other parallel
+ matches, which are necessarily substrings of the first one, must obvi-
ously end before the end of the subject.
PCRE2_EXTENDED
- If this bit is set, most white space characters in the pattern are
- totally ignored except when escaped or inside a character class. How-
- ever, white space is not allowed within sequences such as (?> that
+ If this bit is set, most white space characters in the pattern are
+ totally ignored except when escaped or inside a character class. How-
+ ever, white space is not allowed within sequences such as (?> that
introduce various parenthesized subpatterns, nor within numerical quan-
- tifiers such as {1,3}. Ignorable white space is permitted between an
- item and a following quantifier and between a quantifier and a follow-
+ tifiers such as {1,3}. Ignorable white space is permitted between an
+ item and a following quantifier and between a quantifier and a follow-
ing + that indicates possessiveness.
- PCRE2_EXTENDED also causes characters between an unescaped # outside a
- character class and the next newline, inclusive, to be ignored, which
+ PCRE2_EXTENDED also causes characters between an unescaped # outside a
+ character class and the next newline, inclusive, to be ignored, which
makes it possible to include comments inside complicated patterns. Note
- that the end of this type of comment is a literal newline sequence in
+ that the end of this type of comment is a literal newline sequence in
the pattern; escape sequences that happen to represent a newline do not
- count. PCRE2_EXTENDED is equivalent to Perl's /x option, and it can be
+ count. PCRE2_EXTENDED is equivalent to Perl's /x option, and it can be
changed within a pattern by a (?x) option setting.
Which characters are interpreted as newlines can be specified by a set-
- ting in the compile context that is passed to pcre2_compile() or by a
- special sequence at the start of the pattern, as described in the sec-
- tion entitled "Newline conventions" in the pcre2pattern documentation.
+ ting in the compile context that is passed to pcre2_compile() or by a
+ special sequence at the start of the pattern, as described in the sec-
+ tion entitled "Newline conventions" in the pcre2pattern documentation.
A default is defined when PCRE2 is built.
PCRE2_EXTENDED_MORE
- This option has the effect of PCRE2_EXTENDED, but, in addition,
- unescaped space and horizontal tab characters are ignored inside a
- character class. PCRE2_EXTENDED_MORE is equivalent to Perl's 5.26 /xx
- option, and it can be changed within a pattern by a (?xx) option set-
+ This option has the effect of PCRE2_EXTENDED, but, in addition,
+ unescaped space and horizontal tab characters are ignored inside a
+ character class. PCRE2_EXTENDED_MORE is equivalent to Perl's 5.26 /xx
+ option, and it can be changed within a pattern by a (?xx) option set-
ting.
PCRE2_FIRSTLINE
If this option is set, the start of an unanchored pattern match must be
- before or at the first newline in the subject string following the
- start of matching, though the matched text may continue over the new-
+ before or at the first newline in the subject string following the
+ start of matching, though the matched text may continue over the new-
line. If startoffset is non-zero, the limiting newline is not necessar-
- ily the first newline in the subject. For example, if the subject
+ ily the first newline in the subject. For example, if the subject
string is "abc\nxyz" (where \n represents a single-character newline) a
- pattern match for "yz" succeeds with PCRE2_FIRSTLINE if startoffset is
- greater than 3. See also PCRE2_USE_OFFSET_LIMIT, which provides a more
- general limiting facility. If PCRE2_FIRSTLINE is set with an offset
- limit, a match must occur in the first line and also within the offset
+ pattern match for "yz" succeeds with PCRE2_FIRSTLINE if startoffset is
+ greater than 3. See also PCRE2_USE_OFFSET_LIMIT, which provides a more
+ general limiting facility. If PCRE2_FIRSTLINE is set with an offset
+ limit, a match must occur in the first line and also within the offset
limit. In other words, whichever limit comes first is used.
PCRE2_LITERAL
If this option is set, all meta-characters in the pattern are disabled,
- and it is treated as a literal string. Matching literal strings with a
+ and it is treated as a literal string. Matching literal strings with a
regular expression engine is not the most efficient way of doing it. If
- you are doing a lot of literal matching and are worried about effi-
+ you are doing a lot of literal matching and are worried about effi-
ciency, you should consider using other approaches. The only other main
options that are allowed with PCRE2_LITERAL are: PCRE2_ANCHORED,
PCRE2_ENDANCHORED, PCRE2_AUTO_CALLOUT, PCRE2_CASELESS, PCRE2_FIRSTLINE,
PCRE2_NO_START_OPTIMIZE, PCRE2_NO_UTF_CHECK, PCRE2_UTF, and
- PCRE2_USE_OFFSET_LIMIT. The extra options PCRE2_EXTRA_MATCH_LINE and
- PCRE2_EXTRA_MATCH_WORD are also supported. Any other options cause an
+ PCRE2_USE_OFFSET_LIMIT. The extra options PCRE2_EXTRA_MATCH_LINE and
+ PCRE2_EXTRA_MATCH_WORD are also supported. Any other options cause an
error.
PCRE2_MATCH_UNSET_BACKREF
- If this option is set, a backreference to an unset subpattern group
- matches an empty string (by default this causes the current matching
- alternative to fail). A pattern such as (\1)(a) succeeds when this
- option is set (assuming it can find an "a" in the subject), whereas it
- fails by default, for Perl compatibility. Setting this option makes
+ If this option is set, a backreference to an unset subpattern group
+ matches an empty string (by default this causes the current matching
+ alternative to fail). A pattern such as (\1)(a) succeeds when this
+ option is set (assuming it can find an "a" in the subject), whereas it
+ fails by default, for Perl compatibility. Setting this option makes
PCRE2 behave more like ECMAscript (aka JavaScript).
PCRE2_MULTILINE
- By default, for the purposes of matching "start of line" and "end of
- line", PCRE2 treats the subject string as consisting of a single line
- of characters, even if it actually contains newlines. The "start of
- line" metacharacter (^) matches only at the start of the string, and
- the "end of line" metacharacter ($) matches only at the end of the
+ By default, for the purposes of matching "start of line" and "end of
+ line", PCRE2 treats the subject string as consisting of a single line
+ of characters, even if it actually contains newlines. The "start of
+ line" metacharacter (^) matches only at the start of the string, and
+ the "end of line" metacharacter ($) matches only at the end of the
string, or before a terminating newline (except when PCRE2_DOL-
- LAR_ENDONLY is set). Note, however, that unless PCRE2_DOTALL is set,
+ LAR_ENDONLY is set). Note, however, that unless PCRE2_DOTALL is set,
the "any character" metacharacter (.) does not match at a newline. This
behaviour (for ^, $, and dot) is the same as Perl.
- When PCRE2_MULTILINE it is set, the "start of line" and "end of line"
- constructs match immediately following or immediately before internal
- newlines in the subject string, respectively, as well as at the very
- start and end. This is equivalent to Perl's /m option, and it can be
+ When PCRE2_MULTILINE it is set, the "start of line" and "end of line"
+ constructs match immediately following or immediately before internal
+ newlines in the subject string, respectively, as well as at the very
+ start and end. This is equivalent to Perl's /m option, and it can be
changed within a pattern by a (?m) option setting. Note that the "start
of line" metacharacter does not match after a newline at the end of the
- subject, for compatibility with Perl. However, you can change this by
- setting the PCRE2_ALT_CIRCUMFLEX option. If there are no newlines in a
- subject string, or no occurrences of ^ or $ in a pattern, setting
+ subject, for compatibility with Perl. However, you can change this by
+ setting the PCRE2_ALT_CIRCUMFLEX option. If there are no newlines in a
+ subject string, or no occurrences of ^ or $ in a pattern, setting
PCRE2_MULTILINE has no effect.
PCRE2_NEVER_BACKSLASH_C
- This option locks out the use of \C in the pattern that is being com-
- piled. This escape can cause unpredictable behaviour in UTF-8 or
- UTF-16 modes, because it may leave the current matching point in the
- middle of a multi-code-unit character. This option may be useful in
- applications that process patterns from external sources. Note that
+ This option locks out the use of \C in the pattern that is being com-
+ piled. This escape can cause unpredictable behaviour in UTF-8 or
+ UTF-16 modes, because it may leave the current matching point in the
+ middle of a multi-code-unit character. This option may be useful in
+ applications that process patterns from external sources. Note that
there is also a build-time option that permanently locks out the use of
\C.
PCRE2_NEVER_UCP
- This option locks out the use of Unicode properties for handling \B,
+ This option locks out the use of Unicode properties for handling \B,
\b, \D, \d, \S, \s, \W, \w, and some of the POSIX character classes, as
- described for the PCRE2_UCP option below. In particular, it prevents
- the creator of the pattern from enabling this facility by starting the
- pattern with (*UCP). This option may be useful in applications that
+ described for the PCRE2_UCP option below. In particular, it prevents
+ the creator of the pattern from enabling this facility by starting the
+ pattern with (*UCP). This option may be useful in applications that
process patterns from external sources. The option combination PCRE_UCP
and PCRE_NEVER_UCP causes an error.
PCRE2_NEVER_UTF
- This option locks out interpretation of the pattern as UTF-8, UTF-16,
+ This option locks out interpretation of the pattern as UTF-8, UTF-16,
or UTF-32, depending on which library is in use. In particular, it pre-
- vents the creator of the pattern from switching to UTF interpretation
- by starting the pattern with (*UTF). This option may be useful in
- applications that process patterns from external sources. The combina-
+ vents the creator of the pattern from switching to UTF interpretation
+ by starting the pattern with (*UTF). This option may be useful in
+ applications that process patterns from external sources. The combina-
tion of PCRE2_UTF and PCRE2_NEVER_UTF causes an error.
PCRE2_NO_AUTO_CAPTURE
If this option is set, it disables the use of numbered capturing paren-
- theses in the pattern. Any opening parenthesis that is not followed by
- ? behaves as if it were followed by ?: but named parentheses can still
+ theses in the pattern. Any opening parenthesis that is not followed by
+ ? behaves as if it were followed by ?: but named parentheses can still
be used for capturing (and they acquire numbers in the usual way). This
- is the same as Perl's /n option. Note that, when this option is set,
- references to capturing groups (backreferences or recursion/subroutine
- calls) may only refer to named groups, though the reference can be by
+ is the same as Perl's /n option. Note that, when this option is set,
+ references to capturing groups (backreferences or recursion/subroutine
+ calls) may only refer to named groups, though the reference can be by
name or by number.
PCRE2_NO_AUTO_POSSESS
If this option is set, it disables "auto-possessification", which is an
- optimization that, for example, turns a+b into a++b in order to avoid
- backtracks into a+ that can never be successful. However, if callouts
- are in use, auto-possessification means that some callouts are never
+ optimization that, for example, turns a+b into a++b in order to avoid
+ backtracks into a+ that can never be successful. However, if callouts
+ are in use, auto-possessification means that some callouts are never
taken. You can set this option if you want the matching functions to do
- a full unoptimized search and run all the callouts, but it is mainly
+ a full unoptimized search and run all the callouts, but it is mainly
provided for testing purposes.
PCRE2_NO_DOTSTAR_ANCHOR
If this option is set, it disables an optimization that is applied when
- .* is the first significant item in a top-level branch of a pattern,
- and all the other branches also start with .* or with \A or \G or ^.
- The optimization is automatically disabled for .* if it is inside an
- atomic group or a capturing group that is the subject of a backrefer-
- ence, or if the pattern contains (*PRUNE) or (*SKIP). When the opti-
- mization is not disabled, such a pattern is automatically anchored if
+ .* is the first significant item in a top-level branch of a pattern,
+ and all the other branches also start with .* or with \A or \G or ^.
+ The optimization is automatically disabled for .* if it is inside an
+ atomic group or a capturing group that is the subject of a backrefer-
+ ence, or if the pattern contains (*PRUNE) or (*SKIP). When the opti-
+ mization is not disabled, such a pattern is automatically anchored if
PCRE2_DOTALL is set for all the .* items and PCRE2_MULTILINE is not set
- for any ^ items. Otherwise, the fact that any match must start either
- at the start of the subject or following a newline is remembered. Like
+ for any ^ items. Otherwise, the fact that any match must start either
+ at the start of the subject or following a newline is remembered. Like
other optimizations, this can cause callouts to be skipped.
PCRE2_NO_START_OPTIMIZE
- This is an option whose main effect is at matching time. It does not
+ This is an option whose main effect is at matching time. It does not
change what pcre2_compile() generates, but it does affect the output of
the JIT compiler.
- There are a number of optimizations that may occur at the start of a
- match, in order to speed up the process. For example, if it is known
- that an unanchored match must start with a specific code unit value,
- the matching code searches the subject for that value, and fails imme-
- diately if it cannot find it, without actually running the main match-
- ing function. This means that a special item such as (*COMMIT) at the
- start of a pattern is not considered until after a suitable starting
- point for the match has been found. Also, when callouts or (*MARK)
- items are in use, these "start-up" optimizations can cause them to be
- skipped if the pattern is never actually used. The start-up optimiza-
- tions are in effect a pre-scan of the subject that takes place before
+ There are a number of optimizations that may occur at the start of a
+ match, in order to speed up the process. For example, if it is known
+ that an unanchored match must start with a specific code unit value,
+ the matching code searches the subject for that value, and fails imme-
+ diately if it cannot find it, without actually running the main match-
+ ing function. This means that a special item such as (*COMMIT) at the
+ start of a pattern is not considered until after a suitable starting
+ point for the match has been found. Also, when callouts or (*MARK)
+ items are in use, these "start-up" optimizations can cause them to be
+ skipped if the pattern is never actually used. The start-up optimiza-
+ tions are in effect a pre-scan of the subject that takes place before
the pattern is run.
The PCRE2_NO_START_OPTIMIZE option disables the start-up optimizations,
- possibly causing performance to suffer, but ensuring that in cases
- where the result is "no match", the callouts do occur, and that items
+ possibly causing performance to suffer, but ensuring that in cases
+ where the result is "no match", the callouts do occur, and that items
such as (*COMMIT) and (*MARK) are considered at every possible starting
position in the subject string.
- Setting PCRE2_NO_START_OPTIMIZE may change the outcome of a matching
+ Setting PCRE2_NO_START_OPTIMIZE may change the outcome of a matching
operation. Consider the pattern
(*COMMIT)ABC
- When this is compiled, PCRE2 records the fact that a match must start
- with the character "A". Suppose the subject string is "DEFABC". The
- start-up optimization scans along the subject, finds "A" and runs the
- first match attempt from there. The (*COMMIT) item means that the pat-
- tern must match the current starting position, which in this case, it
- does. However, if the same match is run with PCRE2_NO_START_OPTIMIZE
- set, the initial scan along the subject string does not happen. The
- first match attempt is run starting from "D" and when this fails,
- (*COMMIT) prevents any further matches being tried, so the overall
+ When this is compiled, PCRE2 records the fact that a match must start
+ with the character "A". Suppose the subject string is "DEFABC". The
+ start-up optimization scans along the subject, finds "A" and runs the
+ first match attempt from there. The (*COMMIT) item means that the pat-
+ tern must match the current starting position, which in this case, it
+ does. However, if the same match is run with PCRE2_NO_START_OPTIMIZE
+ set, the initial scan along the subject string does not happen. The
+ first match attempt is run starting from "D" and when this fails,
+ (*COMMIT) prevents any further matches being tried, so the overall
result is "no match".
- There are also other start-up optimizations. For example, a minimum
+ There are also other start-up optimizations. For example, a minimum
length for the subject may be recorded. Consider the pattern
(*MARK:A)(X|Y)
- The minimum length for a match is one character. If the subject is
+ The minimum length for a match is one character. If the subject is
"ABC", there will be attempts to match "ABC", "BC", and "C". An attempt
to match an empty string at the end of the subject does not take place,
- because PCRE2 knows that the subject is now too short, and so the
- (*MARK) is never encountered. In this case, the optimization does not
+ because PCRE2 knows that the subject is now too short, and so the
+ (*MARK) is never encountered. In this case, the optimization does not
affect the overall match result, which is still "no match", but it does
affect the auxiliary information that is returned.
PCRE2_NO_UTF_CHECK
- When PCRE2_UTF is set, the validity of the pattern as a UTF string is
- automatically checked. There are discussions about the validity of
- UTF-8 strings, UTF-16 strings, and UTF-32 strings in the pcre2unicode
- document. If an invalid UTF sequence is found, pcre2_compile() returns
+ When PCRE2_UTF is set, the validity of the pattern as a UTF string is
+ automatically checked. There are discussions about the validity of
+ UTF-8 strings, UTF-16 strings, and UTF-32 strings in the pcre2unicode
+ document. If an invalid UTF sequence is found, pcre2_compile() returns
a negative error code.
- If you know that your pattern is a valid UTF string, and you want to
- skip this check for performance reasons, you can set the
- PCRE2_NO_UTF_CHECK option. When it is set, the effect of passing an
+ If you know that your pattern is a valid UTF string, and you want to
+ skip this check for performance reasons, you can set the
+ PCRE2_NO_UTF_CHECK option. When it is set, the effect of passing an
invalid UTF string as a pattern is undefined. It may cause your program
to crash or loop.
Note that this option can also be passed to pcre2_match() and
- pcre_dfa_match(), to suppress UTF validity checking of the subject
+ pcre_dfa_match(), to suppress UTF validity checking of the subject
string.
Note also that setting PCRE2_NO_UTF_CHECK at compile time does not dis-
- able the error that is given if an escape sequence for an invalid Uni-
- code code point is encountered in the pattern. In particular, the so-
- called "surrogate" code points (0xd800 to 0xdfff) are invalid. If you
- want to allow escape sequences such as \x{d800} you can set the
- PCRE2_EXTRA_ALLOW_SURROGATE_ESCAPES extra option, as described in the
- section entitled "Extra compile options" below. However, this is pos-
+ able the error that is given if an escape sequence for an invalid Uni-
+ code code point is encountered in the pattern. In particular, the so-
+ called "surrogate" code points (0xd800 to 0xdfff) are invalid. If you
+ want to allow escape sequences such as \x{d800} you can set the
+ PCRE2_EXTRA_ALLOW_SURROGATE_ESCAPES extra option, as described in the
+ section entitled "Extra compile options" below. However, this is pos-
sible only in UTF-8 and UTF-32 modes, because these values are not rep-
resentable in UTF-16.
PCRE2_UCP
This option changes the way PCRE2 processes \B, \b, \D, \d, \S, \s, \W,
- \w, and some of the POSIX character classes. By default, only ASCII
- characters are recognized, but if PCRE2_UCP is set, Unicode properties
- are used instead to classify characters. More details are given in the
+ \w, and some of the POSIX character classes. By default, only ASCII
+ characters are recognized, but if PCRE2_UCP is set, Unicode properties
+ are used instead to classify characters. More details are given in the
section on generic character types in the pcre2pattern page. If you set
- PCRE2_UCP, matching one of the items it affects takes much longer. The
- option is available only if PCRE2 has been compiled with Unicode sup-
+ PCRE2_UCP, matching one of the items it affects takes much longer. The
+ option is available only if PCRE2 has been compiled with Unicode sup-
port (which is the default).
PCRE2_UNGREEDY
- This option inverts the "greediness" of the quantifiers so that they
- are not greedy by default, but become greedy if followed by "?". It is
- not compatible with Perl. It can also be set by a (?U) option setting
+ This option inverts the "greediness" of the quantifiers so that they
+ are not greedy by default, but become greedy if followed by "?". It is
+ not compatible with Perl. It can also be set by a (?U) option setting
within the pattern.
PCRE2_USE_OFFSET_LIMIT
This option must be set for pcre2_compile() if pcre2_set_offset_limit()
- is going to be used to set a non-default offset limit in a match con-
- text for matches that use this pattern. An error is generated if an
- offset limit is set without this option. For more details, see the
- description of pcre2_set_offset_limit() in the section that describes
+ is going to be used to set a non-default offset limit in a match con-
+ text for matches that use this pattern. An error is generated if an
+ offset limit is set without this option. For more details, see the
+ description of pcre2_set_offset_limit() in the section that describes
match contexts. See also the PCRE2_FIRSTLINE option above.
PCRE2_UTF
- This option causes PCRE2 to regard both the pattern and the subject
- strings that are subsequently processed as strings of UTF characters
- instead of single-code-unit strings. It is available when PCRE2 is
- built to include Unicode support (which is the default). If Unicode
- support is not available, the use of this option provokes an error.
- Details of how PCRE2_UTF changes the behaviour of PCRE2 are given in
+ This option causes PCRE2 to regard both the pattern and the subject
+ strings that are subsequently processed as strings of UTF characters
+ instead of single-code-unit strings. It is available when PCRE2 is
+ built to include Unicode support (which is the default). If Unicode
+ support is not available, the use of this option provokes an error.
+ Details of how PCRE2_UTF changes the behaviour of PCRE2 are given in
the pcre2unicode page.
Extra compile options
- Unlike the main compile-time options, the extra options are not saved
+ Unlike the main compile-time options, the extra options are not saved
with the compiled pattern. The option bits that can be set in a compile
- context by calling the pcre2_set_compile_extra_options() function are
+ context by calling the pcre2_set_compile_extra_options() function are
as follows:
PCRE2_EXTRA_ALLOW_SURROGATE_ESCAPES
- This option applies when compiling a pattern in UTF-8 or UTF-32 mode.
- It is forbidden in UTF-16 mode, and ignored in non-UTF modes. Unicode
+ This option applies when compiling a pattern in UTF-8 or UTF-32 mode.
+ It is forbidden in UTF-16 mode, and ignored in non-UTF modes. Unicode
"surrogate" code points in the range 0xd800 to 0xdfff are used in pairs
- in UTF-16 to encode code points with values in the range 0x10000 to
- 0x10ffff. The surrogates cannot therefore be represented in UTF-16.
+ in UTF-16 to encode code points with values in the range 0x10000 to
+ 0x10ffff. The surrogates cannot therefore be represented in UTF-16.
They can be represented in UTF-8 and UTF-32, but are defined as invalid
- code points, and cause errors if encountered in a UTF-8 or UTF-32
+ code points, and cause errors if encountered in a UTF-8 or UTF-32
string that is being checked for validity by PCRE2.
- These values also cause errors if encountered in escape sequences such
+ These values also cause errors if encountered in escape sequences such
as \x{d912} within a pattern. However, it seems that some applications,
- when using PCRE2 to check for unwanted characters in UTF-8 strings,
- explicitly test for the surrogates using escape sequences. The
- PCRE2_NO_UTF_CHECK option does not disable the error that occurs,
- because it applies only to the testing of input strings for UTF valid-
+ when using PCRE2 to check for unwanted characters in UTF-8 strings,
+ explicitly test for the surrogates using escape sequences. The
+ PCRE2_NO_UTF_CHECK option does not disable the error that occurs,
+ because it applies only to the testing of input strings for UTF valid-
ity.
- If the extra option PCRE2_EXTRA_ALLOW_SURROGATE_ESCAPES is set, surro-
- gate code point values in UTF-8 and UTF-32 patterns no longer provoke
- errors and are incorporated in the compiled pattern. However, they can
- only match subject characters if the matching function is called with
+ If the extra option PCRE2_EXTRA_ALLOW_SURROGATE_ESCAPES is set, surro-
+ gate code point values in UTF-8 and UTF-32 patterns no longer provoke
+ errors and are incorporated in the compiled pattern. However, they can
+ only match subject characters if the matching function is called with
PCRE2_NO_UTF_CHECK set.
PCRE2_EXTRA_BAD_ESCAPE_IS_LITERAL
- This is a dangerous option. Use with care. By default, an unrecognized
- escape such as \j or a malformed one such as \x{2z} causes a compile-
+ This is a dangerous option. Use with care. By default, an unrecognized
+ escape such as \j or a malformed one such as \x{2z} causes a compile-
time error when detected by pcre2_compile(). Perl is somewhat inconsis-
- tent in handling such items: for example, \j is treated as a literal
- "j", and non-hexadecimal digits in \x{} are just ignored, though warn-
- ings are given in both cases if Perl's warning switch is enabled. How-
- ever, a malformed octal number after \o{ always causes an error in
+ tent in handling such items: for example, \j is treated as a literal
+ "j", and non-hexadecimal digits in \x{} are just ignored, though warn-
+ ings are given in both cases if Perl's warning switch is enabled. How-
+ ever, a malformed octal number after \o{ always causes an error in
Perl.
- If the PCRE2_EXTRA_BAD_ESCAPE_IS_LITERAL extra option is passed to
- pcre2_compile(), all unrecognized or erroneous escape sequences are
- treated as single-character escapes. For example, \j is a literal "j"
- and \x{2z} is treated as the literal string "x{2z}". Setting this
- option means that typos in patterns may go undetected and have unex-
+ If the PCRE2_EXTRA_BAD_ESCAPE_IS_LITERAL extra option is passed to
+ pcre2_compile(), all unrecognized or erroneous escape sequences are
+ treated as single-character escapes. For example, \j is a literal "j"
+ and \x{2z} is treated as the literal string "x{2z}". Setting this
+ option means that typos in patterns may go undetected and have unex-
pected results. This is a dangerous option. Use with care.
PCRE2_EXTRA_MATCH_LINE
- This option is provided for use by the -x option of pcre2grep. It
- causes the pattern only to match complete lines. This is achieved by
- automatically inserting the code for "^(?:" at the start of the com-
- piled pattern and ")$" at the end. Thus, when PCRE2_MULTILINE is set,
- the matched line may be in the middle of the subject string. This
+ This option is provided for use by the -x option of pcre2grep. It
+ causes the pattern only to match complete lines. This is achieved by
+ automatically inserting the code for "^(?:" at the start of the com-
+ piled pattern and ")$" at the end. Thus, when PCRE2_MULTILINE is set,
+ the matched line may be in the middle of the subject string. This
option can be used with PCRE2_LITERAL.
PCRE2_EXTRA_MATCH_WORD
- This option is provided for use by the -w option of pcre2grep. It
- causes the pattern only to match strings that have a word boundary at
- the start and the end. This is achieved by automatically inserting the
- code for "\b(?:" at the start of the compiled pattern and ")\b" at the
- end. The option may be used with PCRE2_LITERAL. However, it is ignored
+ This option is provided for use by the -w option of pcre2grep. It
+ causes the pattern only to match strings that have a word boundary at
+ the start and the end. This is achieved by automatically inserting the
+ code for "\b(?:" at the start of the compiled pattern and ")\b" at the
+ end. The option may be used with PCRE2_LITERAL. However, it is ignored
if PCRE2_EXTRA_MATCH_LINE is also set.
@@ -1852,53 +1853,53 @@ JUST-IN-TIME (JIT) COMPILATION
void pcre2_jit_stack_free(pcre2_jit_stack *jit_stack);
- These functions provide support for JIT compilation, which, if the
- just-in-time compiler is available, further processes a compiled pat-
+ These functions provide support for JIT compilation, which, if the
+ just-in-time compiler is available, further processes a compiled pat-
tern into machine code that executes much faster than the pcre2_match()
- interpretive matching function. Full details are given in the pcre2jit
+ interpretive matching function. Full details are given in the pcre2jit
documentation.
- JIT compilation is a heavyweight optimization. It can take some time
- for patterns to be analyzed, and for one-off matches and simple pat-
- terns the benefit of faster execution might be offset by a much slower
- compilation time. Most (but not all) patterns can be optimized by the
+ JIT compilation is a heavyweight optimization. It can take some time
+ for patterns to be analyzed, and for one-off matches and simple pat-
+ terns the benefit of faster execution might be offset by a much slower
+ compilation time. Most (but not all) patterns can be optimized by the
JIT compiler.
LOCALE SUPPORT
- PCRE2 handles caseless matching, and determines whether characters are
- letters, digits, or whatever, by reference to a set of tables, indexed
- by character code point. This applies only to characters whose code
- points are less than 256. By default, higher-valued code points never
- match escapes such as \w or \d. However, if PCRE2 is built with Uni-
+ PCRE2 handles caseless matching, and determines whether characters are
+ letters, digits, or whatever, by reference to a set of tables, indexed
+ by character code point. This applies only to characters whose code
+ points are less than 256. By default, higher-valued code points never
+ match escapes such as \w or \d. However, if PCRE2 is built with Uni-
code support, all characters can be tested with \p and \P, or, alterna-
- tively, the PCRE2_UCP option can be set when a pattern is compiled;
- this causes \w and friends to use Unicode property support instead of
+ tively, the PCRE2_UCP option can be set when a pattern is compiled;
+ this causes \w and friends to use Unicode property support instead of
the built-in tables.
- The use of locales with Unicode is discouraged. If you are handling
- characters with code points greater than 128, you should either use
+ The use of locales with Unicode is discouraged. If you are handling
+ characters with code points greater than 128, you should either use
Unicode support, or use locales, but not try to mix the two.
- PCRE2 contains an internal set of character tables that are used by
- default. These are sufficient for many applications. Normally, the
+ PCRE2 contains an internal set of character tables that are used by
+ default. These are sufficient for many applications. Normally, the
internal tables recognize only ASCII characters. However, when PCRE2 is
built, it is possible to cause the internal tables to be rebuilt in the
default "C" locale of the local system, which may cause them to be dif-
ferent.
- The internal tables can be overridden by tables supplied by the appli-
- cation that calls PCRE2. These may be created in a different locale
- from the default. As more and more applications change to using Uni-
+ The internal tables can be overridden by tables supplied by the appli-
+ cation that calls PCRE2. These may be created in a different locale
+ from the default. As more and more applications change to using Uni-
code, the need for this locale support is expected to die away.
- External tables are built by calling the pcre2_maketables() function,
- in the relevant locale. The result can be passed to pcre2_compile() as
- often as necessary, by creating a compile context and calling
- pcre2_set_character_tables() to set the tables pointer therein. For
- example, to build and use tables that are appropriate for the French
- locale (where accented characters with values greater than 128 are
+ External tables are built by calling the pcre2_maketables() function,
+ in the relevant locale. The result can be passed to pcre2_compile() as
+ often as necessary, by creating a compile context and calling
+ pcre2_set_character_tables() to set the tables pointer therein. For
+ example, to build and use tables that are appropriate for the French
+ locale (where accented characters with values greater than 128 are
treated as letters), the following code could be used:
setlocale(LC_CTYPE, "fr_FR");
@@ -1907,15 +1908,15 @@ LOCALE SUPPORT
pcre2_set_character_tables(ccontext, tables);
re = pcre2_compile(..., ccontext);
- The locale name "fr_FR" is used on Linux and other Unix-like systems;
- if you are using Windows, the name for the French locale is "french".
- It is the caller's responsibility to ensure that the memory containing
+ The locale name "fr_FR" is used on Linux and other Unix-like systems;
+ if you are using Windows, the name for the French locale is "french".
+ It is the caller's responsibility to ensure that the memory containing
the tables remains available for as long as it is needed.
The pointer that is passed (via the compile context) to pcre2_compile()
- is saved with the compiled pattern, and the same tables are used by
- pcre2_match() and pcre_dfa_match(). Thus, for any single pattern, com-
- pilation and matching both happen in the same locale, but different
+ is saved with the compiled pattern, and the same tables are used by
+ pcre2_match() and pcre_dfa_match(). Thus, for any single pattern, com-
+ pilation and matching both happen in the same locale, but different
patterns can be processed in different locales.
@@ -1923,13 +1924,13 @@ INFORMATION ABOUT A COMPILED PATTERN
int pcre2_pattern_info(const pcre2 *code, uint32_t what, void *where);
- The pcre2_pattern_info() function returns general information about a
+ The pcre2_pattern_info() function returns general information about a
compiled pattern. For information about callouts, see the next section.
- The first argument for pcre2_pattern_info() is a pointer to the com-
+ The first argument for pcre2_pattern_info() is a pointer to the com-
piled pattern. The second argument specifies which piece of information
- is required, and the third argument is a pointer to a variable to
- receive the data. If the third argument is NULL, the first argument is
- ignored, and the function returns the size in bytes of the variable
+ is required, and the third argument is a pointer to a variable to
+ receive the data. If the third argument is NULL, the first argument is
+ ignored, and the function returns the size in bytes of the variable
that is required for the information requested. Otherwise, the yield of
the function is zero for success, or one of the following negative num-
bers:
@@ -1939,9 +1940,9 @@ INFORMATION ABOUT A COMPILED PATTERN
PCRE2_ERROR_BADOPTION the value of what was invalid
PCRE2_ERROR_UNSET the requested field is not set
- The "magic number" is placed at the start of each compiled pattern as
- an simple check against passing an arbitrary memory pointer. Here is a
- typical call of pcre2_pattern_info(), to obtain the length of the com-
+ The "magic number" is placed at the start of each compiled pattern as
+ an simple check against passing an arbitrary memory pointer. Here is a
+ typical call of pcre2_pattern_info(), to obtain the length of the com-
piled pattern:
int rc;
@@ -1959,22 +1960,22 @@ INFORMATION ABOUT A COMPILED PATTERN
PCRE2_INFO_EXTRAOPTIONS
Return copies of the pattern's options. The third argument should point
- to a uint32_t variable. PCRE2_INFO_ARGOPTIONS returns exactly the
- options that were passed to pcre2_compile(), whereas PCRE2_INFO_ALLOP-
- TIONS returns the compile options as modified by any top-level (*XXX)
- option settings such as (*UTF) at the start of the pattern itself.
- PCRE2_INFO_EXTRAOPTIONS returns the extra options that were set in the
- compile context by calling the pcre2_set_compile_extra_options() func-
+ to a uint32_t variable. PCRE2_INFO_ARGOPTIONS returns exactly the
+ options that were passed to pcre2_compile(), whereas PCRE2_INFO_ALLOP-
+ TIONS returns the compile options as modified by any top-level (*XXX)
+ option settings such as (*UTF) at the start of the pattern itself.
+ PCRE2_INFO_EXTRAOPTIONS returns the extra options that were set in the
+ compile context by calling the pcre2_set_compile_extra_options() func-
tion.
- For example, if the pattern /(*UTF)abc/ is compiled with the
- PCRE2_EXTENDED option, the result for PCRE2_INFO_ALLOPTIONS is
- PCRE2_EXTENDED and PCRE2_UTF. Option settings such as (?i) that can
- change within a pattern do not affect the result of PCRE2_INFO_ALLOP-
+ For example, if the pattern /(*UTF)abc/ is compiled with the
+ PCRE2_EXTENDED option, the result for PCRE2_INFO_ALLOPTIONS is
+ PCRE2_EXTENDED and PCRE2_UTF. Option settings such as (?i) that can
+ change within a pattern do not affect the result of PCRE2_INFO_ALLOP-
TIONS, even if they appear right at the start of the pattern. (This was
different in some earlier releases.)
- A pattern compiled without PCRE2_ANCHORED is automatically anchored by
+ A pattern compiled without PCRE2_ANCHORED is automatically anchored by
PCRE2 if the first significant item in every top-level branch is one of
the following:
@@ -1983,7 +1984,7 @@ INFORMATION ABOUT A COMPILED PATTERN
\G always
.* sometimes - see below
- When .* is the first significant item, anchoring is possible only when
+ When .* is the first significant item, anchoring is possible only when
all the following are true:
.* is not in an atomic group
@@ -1993,94 +1994,94 @@ INFORMATION ABOUT A COMPILED PATTERN
Neither (*PRUNE) nor (*SKIP) appears in the pattern
PCRE2_NO_DOTSTAR_ANCHOR is not set
- For patterns that are auto-anchored, the PCRE2_ANCHORED bit is set in
+ For patterns that are auto-anchored, the PCRE2_ANCHORED bit is set in
the options returned for PCRE2_INFO_ALLOPTIONS.
PCRE2_INFO_BACKREFMAX
- Return the number of the highest backreference in the pattern. The
- third argument should point to an uint32_t variable. Named subpatterns
- acquire numbers as well as names, and these count towards the highest
- backreference. Backreferences such as \4 or \g{12} match the captured
- characters of the given group, but in addition, the check that a cap-
- turing group is set in a conditional subpattern such as (?(3)a|b) is
+ Return the number of the highest backreference in the pattern. The
+ third argument should point to an uint32_t variable. Named subpatterns
+ acquire numbers as well as names, and these count towards the highest
+ backreference. Backreferences such as \4 or \g{12} match the captured
+ characters of the given group, but in addition, the check that a cap-
+ turing group is set in a conditional subpattern such as (?(3)a|b) is
also a backreference. Zero is returned if there are no backreferences.
PCRE2_INFO_BSR
- The output is a uint32_t integer whose value indicates what character
- sequences the \R escape sequence matches. A value of PCRE2_BSR_UNICODE
- means that \R matches any Unicode line ending sequence; a value of
+ The output is a uint32_t integer whose value indicates what character
+ sequences the \R escape sequence matches. A value of PCRE2_BSR_UNICODE
+ means that \R matches any Unicode line ending sequence; a value of
PCRE2_BSR_ANYCRLF means that \R matches only CR, LF, or CRLF.
PCRE2_INFO_CAPTURECOUNT
- Return the highest capturing subpattern number in the pattern. In pat-
+ Return the highest capturing subpattern number in the pattern. In pat-
terns where (?| is not used, this is also the total number of capturing
subpatterns. The third argument should point to an uint32_t variable.
PCRE2_INFO_DEPTHLIMIT
- If the pattern set a backtracking depth limit by including an item of
- the form (*LIMIT_DEPTH=nnnn) at the start, the value is returned. The
+ If the pattern set a backtracking depth limit by including an item of
+ the form (*LIMIT_DEPTH=nnnn) at the start, the value is returned. The
third argument should point to a uint32_t integer. If no such value has
- been set, the call to pcre2_pattern_info() returns the error
+ been set, the call to pcre2_pattern_info() returns the error
PCRE2_ERROR_UNSET. Note that this limit will only be used during match-
- ing if it is less than the limit set or defaulted by the caller of the
+ ing if it is less than the limit set or defaulted by the caller of the
match function.
PCRE2_INFO_FIRSTBITMAP
- In the absence of a single first code unit for a non-anchored pattern,
- pcre2_compile() may construct a 256-bit table that defines a fixed set
- of values for the first code unit in any match. For example, a pattern
- that starts with [abc] results in a table with three bits set. When
- code unit values greater than 255 are supported, the flag bit for 255
- means "any code unit of value 255 or above". If such a table was con-
- structed, a pointer to it is returned. Otherwise NULL is returned. The
+ In the absence of a single first code unit for a non-anchored pattern,
+ pcre2_compile() may construct a 256-bit table that defines a fixed set
+ of values for the first code unit in any match. For example, a pattern
+ that starts with [abc] results in a table with three bits set. When
+ code unit values greater than 255 are supported, the flag bit for 255
+ means "any code unit of value 255 or above". If such a table was con-
+ structed, a pointer to it is returned. Otherwise NULL is returned. The
third argument should point to a const uint8_t * variable.
PCRE2_INFO_FIRSTCODETYPE
Return information about the first code unit of any matched string, for
- a non-anchored pattern. The third argument should point to an uint32_t
- variable. If there is a fixed first value, for example, the letter "c"
- from a pattern such as (cat|cow|coyote), 1 is returned, and the value
- can be retrieved using PCRE2_INFO_FIRSTCODEUNIT. If there is no fixed
- first value, but it is known that a match can occur only at the start
- of the subject or following a newline in the subject, 2 is returned.
+ a non-anchored pattern. The third argument should point to an uint32_t
+ variable. If there is a fixed first value, for example, the letter "c"
+ from a pattern such as (cat|cow|coyote), 1 is returned, and the value
+ can be retrieved using PCRE2_INFO_FIRSTCODEUNIT. If there is no fixed
+ first value, but it is known that a match can occur only at the start
+ of the subject or following a newline in the subject, 2 is returned.
Otherwise, and for anchored patterns, 0 is returned.
PCRE2_INFO_FIRSTCODEUNIT
- Return the value of the first code unit of any matched string for a
- pattern where PCRE2_INFO_FIRSTCODETYPE returns 1; otherwise return 0.
- The third argument should point to an uint32_t variable. In the 8-bit
- library, the value is always less than 256. In the 16-bit library the
- value can be up to 0xffff. In the 32-bit library in UTF-32 mode the
+ Return the value of the first code unit of any matched string for a
+ pattern where PCRE2_INFO_FIRSTCODETYPE returns 1; otherwise return 0.
+ The third argument should point to an uint32_t variable. In the 8-bit
+ library, the value is always less than 256. In the 16-bit library the
+ value can be up to 0xffff. In the 32-bit library in UTF-32 mode the
value can be up to 0x10ffff, and up to 0xffffffff when not using UTF-32
mode.
PCRE2_INFO_FRAMESIZE
Return the size (in bytes) of the data frames that are used to remember
- backtracking positions when the pattern is processed by pcre2_match()
- without the use of JIT. The third argument should point to a size_t
+ backtracking positions when the pattern is processed by pcre2_match()
+ without the use of JIT. The third argument should point to a size_t
variable. The frame size depends on the number of capturing parentheses
- in the pattern. Each additional capturing group adds two PCRE2_SIZE
+ in the pattern. Each additional capturing group adds two PCRE2_SIZE
variables.
PCRE2_INFO_HASBACKSLASHC
- Return 1 if the pattern contains any instances of \C, otherwise 0. The
+ Return 1 if the pattern contains any instances of \C, otherwise 0. The
third argument should point to an uint32_t variable.
PCRE2_INFO_HASCRORLF
- Return 1 if the pattern contains any explicit matches for CR or LF
+ Return 1 if the pattern contains any explicit matches for CR or LF
characters, otherwise 0. The third argument should point to an uint32_t
- variable. An explicit match is either a literal CR or LF character, or
- \r or \n or one of the equivalent hexadecimal or octal escape
+ variable. An explicit match is either a literal CR or LF character, or
+ \r or \n or one of the equivalent hexadecimal or octal escape
sequences.
PCRE2_INFO_HEAPLIMIT
@@ -2088,81 +2089,81 @@ INFORMATION ABOUT A COMPILED PATTERN
If the pattern set a heap memory limit by including an item of the form
(*LIMIT_HEAP=nnnn) at the start, the value is returned. The third argu-
ment should point to a uint32_t integer. If no such value has been set,
- the call to pcre2_pattern_info() returns the error PCRE2_ERROR_UNSET.
- Note that this limit will only be used during matching if it is less
+ the call to pcre2_pattern_info() returns the error PCRE2_ERROR_UNSET.
+ Note that this limit will only be used during matching if it is less
than the limit set or defaulted by the caller of the match function.
PCRE2_INFO_JCHANGED
- Return 1 if the (?J) or (?-J) option setting is used in the pattern,
- otherwise 0. The third argument should point to an uint32_t variable.
- (?J) and (?-J) set and unset the local PCRE2_DUPNAMES option, respec-
+ Return 1 if the (?J) or (?-J) option setting is used in the pattern,
+ otherwise 0. The third argument should point to an uint32_t variable.
+ (?J) and (?-J) set and unset the local PCRE2_DUPNAMES option, respec-
tively.
PCRE2_INFO_JITSIZE
- If the compiled pattern was successfully processed by pcre2_jit_com-
- pile(), return the size of the JIT compiled code, otherwise return
+ If the compiled pattern was successfully processed by pcre2_jit_com-
+ pile(), return the size of the JIT compiled code, otherwise return
zero. The third argument should point to a size_t variable.
PCRE2_INFO_LASTCODETYPE
- Returns 1 if there is a rightmost literal code unit that must exist in
- any matched string, other than at its start. The third argument should
- point to an uint32_t variable. If there is no such value, 0 is
- returned. When 1 is returned, the code unit value itself can be
- retrieved using PCRE2_INFO_LASTCODEUNIT. For anchored patterns, a last
- literal value is recorded only if it follows something of variable
- length. For example, for the pattern /^a\d+z\d+/ the returned value is
- 1 (with "z" returned from PCRE2_INFO_LASTCODEUNIT), but for /^a\dz\d/
+ Returns 1 if there is a rightmost literal code unit that must exist in
+ any matched string, other than at its start. The third argument should
+ point to an uint32_t variable. If there is no such value, 0 is
+ returned. When 1 is returned, the code unit value itself can be
+ retrieved using PCRE2_INFO_LASTCODEUNIT. For anchored patterns, a last
+ literal value is recorded only if it follows something of variable
+ length. For example, for the pattern /^a\d+z\d+/ the returned value is
+ 1 (with "z" returned from PCRE2_INFO_LASTCODEUNIT), but for /^a\dz\d/
the returned value is 0.
PCRE2_INFO_LASTCODEUNIT
- Return the value of the rightmost literal code unit that must exist in
- any matched string, other than at its start, for a pattern where
+ Return the value of the rightmost literal code unit that must exist in
+ any matched string, other than at its start, for a pattern where
PCRE2_INFO_LASTCODETYPE returns 1. Otherwise, return 0. The third argu-
ment should point to an uint32_t variable.
PCRE2_INFO_MATCHEMPTY
- Return 1 if the pattern might match an empty string, otherwise 0. The
- third argument should point to an uint32_t variable. When a pattern
+ Return 1 if the pattern might match an empty string, otherwise 0. The
+ third argument should point to an uint32_t variable. When a pattern
contains recursive subroutine calls it is not always possible to deter-
- mine whether or not it can match an empty string. PCRE2 takes a cau-
+ mine whether or not it can match an empty string. PCRE2 takes a cau-
tious approach and returns 1 in such cases.
PCRE2_INFO_MATCHLIMIT
- If the pattern set a match limit by including an item of the form
- (*LIMIT_MATCH=nnnn) at the start, the value is returned. The third
- argument should point to a uint32_t integer. If no such value has been
- set, the call to pcre2_pattern_info() returns the error
+ If the pattern set a match limit by including an item of the form
+ (*LIMIT_MATCH=nnnn) at the start, the value is returned. The third
+ argument should point to a uint32_t integer. If no such value has been
+ set, the call to pcre2_pattern_info() returns the error
PCRE2_ERROR_UNSET. Note that this limit will only be used during match-
- ing if it is less than the limit set or defaulted by the caller of the
+ ing if it is less than the limit set or defaulted by the caller of the
match function.
PCRE2_INFO_MAXLOOKBEHIND
Return the number of characters (not code units) in the longest lookbe-
- hind assertion in the pattern. The third argument should point to a
- uint32_t integer. This information is useful when doing multi-segment
- matching using the partial matching facilities. Note that the simple
+ hind assertion in the pattern. The third argument should point to a
+ uint32_t integer. This information is useful when doing multi-segment
+ matching using the partial matching facilities. Note that the simple
assertions \b and \B require a one-character lookbehind. \A also regis-
- ters a one-character lookbehind, though it does not actually inspect
- the previous character. This is to ensure that at least one character
- from the old segment is retained when a new segment is processed. Oth-
- erwise, if there are no lookbehinds in the pattern, \A might match
+ ters a one-character lookbehind, though it does not actually inspect
+ the previous character. This is to ensure that at least one character
+ from the old segment is retained when a new segment is processed. Oth-
+ erwise, if there are no lookbehinds in the pattern, \A might match
incorrectly at the start of a second or subsequent segment.
PCRE2_INFO_MINLENGTH
- If a minimum length for matching subject strings was computed, its
- value is returned. Otherwise the returned value is 0. The value is a
- number of characters, which in UTF mode may be different from the num-
- ber of code units. The third argument should point to an uint32_t
- variable. The value is a lower bound to the length of any matching
- string. There may not be any strings of that length that do actually
+ If a minimum length for matching subject strings was computed, its
+ value is returned. Otherwise the returned value is 0. The value is a
+ number of characters, which in UTF mode may be different from the num-
+ ber of code units. The third argument should point to an uint32_t
+ variable. The value is a lower bound to the length of any matching
+ string. There may not be any strings of that length that do actually
match, but every string that does match is at least that long.
PCRE2_INFO_NAMECOUNT
@@ -2170,50 +2171,50 @@ INFORMATION ABOUT A COMPILED PATTERN
PCRE2_INFO_NAMETABLE
PCRE2 supports the use of named as well as numbered capturing parenthe-
- ses. The names are just an additional way of identifying the parenthe-
+ ses. The names are just an additional way of identifying the parenthe-
ses, which still acquire numbers. Several convenience functions such as
- pcre2_substring_get_byname() are provided for extracting captured sub-
- strings by name. It is also possible to extract the data directly, by
- first converting the name to a number in order to access the correct
- pointers in the output vector (described with pcre2_match() below). To
- do the conversion, you need to use the name-to-number map, which is
+ pcre2_substring_get_byname() are provided for extracting captured sub-
+ strings by name. It is also possible to extract the data directly, by
+ first converting the name to a number in order to access the correct
+ pointers in the output vector (described with pcre2_match() below). To
+ do the conversion, you need to use the name-to-number map, which is
described by these three values.
- The map consists of a number of fixed-size entries. PCRE2_INFO_NAME-
- COUNT gives the number of entries, and PCRE2_INFO_NAMEENTRYSIZE gives
- the size of each entry in code units; both of these return a uint32_t
+ The map consists of a number of fixed-size entries. PCRE2_INFO_NAME-
+ COUNT gives the number of entries, and PCRE2_INFO_NAMEENTRYSIZE gives
+ the size of each entry in code units; both of these return a uint32_t
value. The entry size depends on the length of the longest name.
PCRE2_INFO_NAMETABLE returns a pointer to the first entry of the table.
- This is a PCRE2_SPTR pointer to a block of code units. In the 8-bit
- library, the first two bytes of each entry are the number of the cap-
+ This is a PCRE2_SPTR pointer to a block of code units. In the 8-bit
+ library, the first two bytes of each entry are the number of the cap-
turing parenthesis, most significant byte first. In the 16-bit library,
- the pointer points to 16-bit code units, the first of which contains
- the parenthesis number. In the 32-bit library, the pointer points to
- 32-bit code units, the first of which contains the parenthesis number.
+ the pointer points to 16-bit code units, the first of which contains
+ the parenthesis number. In the 32-bit library, the pointer points to
+ 32-bit code units, the first of which contains the parenthesis number.
The rest of the entry is the corresponding name, zero terminated.
- The names are in alphabetical order. If (?| is used to create multiple
- groups with the same number, as described in the section on duplicate
- subpattern numbers in the pcre2pattern page, the groups may be given
- the same name, but there is only one entry in the table. Different
+ The names are in alphabetical order. If (?| is used to create multiple
+ groups with the same number, as described in the section on duplicate
+ subpattern numbers in the pcre2pattern page, the groups may be given
+ the same name, but there is only one entry in the table. Different
names for groups of the same number are not permitted.
- Duplicate names for subpatterns with different numbers are permitted,
- but only if PCRE2_DUPNAMES is set. They appear in the table in the
- order in which they were found in the pattern. In the absence of (?|
- this is the order of increasing number; when (?| is used this is not
+ Duplicate names for subpatterns with different numbers are permitted,
+ but only if PCRE2_DUPNAMES is set. They appear in the table in the
+ order in which they were found in the pattern. In the absence of (?|
+ this is the order of increasing number; when (?| is used this is not
necessarily the case because later subpatterns may have lower numbers.
- As a simple example of the name/number table, consider the following
- pattern after compilation by the 8-bit library (assume PCRE2_EXTENDED
+ As a simple example of the name/number table, consider the following
+ pattern after compilation by the 8-bit library (assume PCRE2_EXTENDED
is set, so white space - including newlines - is ignored):
(?