Documentation update re PCRE2_JIT_INVALID_UTF
This commit is contained in:
parent
7375089fa5
commit
590f65f061
|
@ -40,6 +40,7 @@ bits:
|
||||||
PCRE2_JIT_COMPLETE compile code for full matching
|
PCRE2_JIT_COMPLETE compile code for full matching
|
||||||
PCRE2_JIT_PARTIAL_SOFT compile code for soft partial matching
|
PCRE2_JIT_PARTIAL_SOFT compile code for soft partial matching
|
||||||
PCRE2_JIT_PARTIAL_HARD compile code for hard partial matching
|
PCRE2_JIT_PARTIAL_HARD compile code for hard partial matching
|
||||||
|
PCRE2_JIT_INVALID_UTF compile code to handle invalid UTF
|
||||||
</pre>
|
</pre>
|
||||||
The yield of the function is 0 for success, or a negative error code otherwise.
|
The yield of the function is 0 for success, or a negative error code otherwise.
|
||||||
In particular, PCRE2_ERROR_JIT_BADOPTION is returned if JIT is not supported or
|
In particular, PCRE2_ERROR_JIT_BADOPTION is returned if JIT is not supported or
|
||||||
|
|
|
@ -312,7 +312,7 @@ document for an overview of all the PCRE2 documentation.
|
||||||
<b>const unsigned char *pcre2_maketables(pcre2_general_context *<i>gcontext</i>);</b>
|
<b>const unsigned char *pcre2_maketables(pcre2_general_context *<i>gcontext</i>);</b>
|
||||||
<br>
|
<br>
|
||||||
<br>
|
<br>
|
||||||
<b>int pcre2_pattern_info(const pcre2_code *<i>code</i>, uint32_t <i>what</i>, </b>
|
<b>int pcre2_pattern_info(const pcre2_code *<i>code</i>, uint32_t <i>what</i>,</b>
|
||||||
<b> void *<i>where</i>);</b>
|
<b> void *<i>where</i>);</b>
|
||||||
<br>
|
<br>
|
||||||
<br>
|
<br>
|
||||||
|
|
|
@ -16,16 +16,17 @@ please consult the man page, in case the conversion went wrong.
|
||||||
<li><a name="TOC1" href="#SEC1">PCRE2 JUST-IN-TIME COMPILER SUPPORT</a>
|
<li><a name="TOC1" href="#SEC1">PCRE2 JUST-IN-TIME COMPILER SUPPORT</a>
|
||||||
<li><a name="TOC2" href="#SEC2">AVAILABILITY OF JIT SUPPORT</a>
|
<li><a name="TOC2" href="#SEC2">AVAILABILITY OF JIT SUPPORT</a>
|
||||||
<li><a name="TOC3" href="#SEC3">SIMPLE USE OF JIT</a>
|
<li><a name="TOC3" href="#SEC3">SIMPLE USE OF JIT</a>
|
||||||
<li><a name="TOC4" href="#SEC4">UNSUPPORTED OPTIONS AND PATTERN ITEMS</a>
|
<li><a name="TOC4" href="#SEC4">MATCHING SUBJECTS CONTAINING INVALID UTF</a>
|
||||||
<li><a name="TOC5" href="#SEC5">RETURN VALUES FROM JIT MATCHING</a>
|
<li><a name="TOC5" href="#SEC5">UNSUPPORTED OPTIONS AND PATTERN ITEMS</a>
|
||||||
<li><a name="TOC6" href="#SEC6">CONTROLLING THE JIT STACK</a>
|
<li><a name="TOC6" href="#SEC6">RETURN VALUES FROM JIT MATCHING</a>
|
||||||
<li><a name="TOC7" href="#SEC7">JIT STACK FAQ</a>
|
<li><a name="TOC7" href="#SEC7">CONTROLLING THE JIT STACK</a>
|
||||||
<li><a name="TOC8" href="#SEC8">FREEING JIT SPECULATIVE MEMORY</a>
|
<li><a name="TOC8" href="#SEC8">JIT STACK FAQ</a>
|
||||||
<li><a name="TOC9" href="#SEC9">EXAMPLE CODE</a>
|
<li><a name="TOC9" href="#SEC9">FREEING JIT SPECULATIVE MEMORY</a>
|
||||||
<li><a name="TOC10" href="#SEC10">JIT FAST PATH API</a>
|
<li><a name="TOC10" href="#SEC10">EXAMPLE CODE</a>
|
||||||
<li><a name="TOC11" href="#SEC11">SEE ALSO</a>
|
<li><a name="TOC11" href="#SEC11">JIT FAST PATH API</a>
|
||||||
<li><a name="TOC12" href="#SEC12">AUTHOR</a>
|
<li><a name="TOC12" href="#SEC12">SEE ALSO</a>
|
||||||
<li><a name="TOC13" href="#SEC13">REVISION</a>
|
<li><a name="TOC13" href="#SEC13">AUTHOR</a>
|
||||||
|
<li><a name="TOC14" href="#SEC14">REVISION</a>
|
||||||
</ul>
|
</ul>
|
||||||
<br><a name="SEC1" href="#TOC1">PCRE2 JUST-IN-TIME COMPILER SUPPORT</a><br>
|
<br><a name="SEC1" href="#TOC1">PCRE2 JUST-IN-TIME COMPILER SUPPORT</a><br>
|
||||||
<P>
|
<P>
|
||||||
|
@ -144,7 +145,29 @@ support is not available, or the pattern was not processed by
|
||||||
<b>pcre2_jit_compile()</b>, or the JIT compiler was not able to handle the
|
<b>pcre2_jit_compile()</b>, or the JIT compiler was not able to handle the
|
||||||
pattern.
|
pattern.
|
||||||
</P>
|
</P>
|
||||||
<br><a name="SEC4" href="#TOC1">UNSUPPORTED OPTIONS AND PATTERN ITEMS</a><br>
|
<br><a name="SEC4" href="#TOC1">MATCHING SUBJECTS CONTAINING INVALID UTF</a><br>
|
||||||
|
<P>
|
||||||
|
When a pattern is compiled with the PCRE2_UTF option, the interpretive matching
|
||||||
|
function expects its subject string to be a valid sequence of UTF code units.
|
||||||
|
If it is not, the result is undefined. This is also true by default of matching
|
||||||
|
via JIT. However, if the option PCRE2_JIT_INVALID_UTF is passed to
|
||||||
|
<b>pcre2_jit_compile()</b>, code that can process a subject containing invalid
|
||||||
|
UTF is compiled.
|
||||||
|
</P>
|
||||||
|
<P>
|
||||||
|
In this mode, an invalid code unit sequence never matches any pattern item. It
|
||||||
|
does not match dot, it does not match \p{Any}, it does not even match negative
|
||||||
|
items such as [^X]. A lookbehind assertion fails if it encounters an invalid
|
||||||
|
sequence while moving the current point backwards. In other words, an invalid
|
||||||
|
UTF code unit sequence acts as a barrier which no match can cross. Reaching an
|
||||||
|
invalid sequence causes an immediate backtrack.
|
||||||
|
</P>
|
||||||
|
<P>
|
||||||
|
Using this option, an application can run matches in arbitrary data, knowing
|
||||||
|
that any matched strings that are returned will be valid UTF. This can be
|
||||||
|
useful when searching for text in executable or other binary files.
|
||||||
|
</P>
|
||||||
|
<br><a name="SEC5" href="#TOC1">UNSUPPORTED OPTIONS AND PATTERN ITEMS</a><br>
|
||||||
<P>
|
<P>
|
||||||
The <b>pcre2_match()</b> options that are supported for JIT matching are
|
The <b>pcre2_match()</b> options that are supported for JIT matching are
|
||||||
PCRE2_COPY_MATCHED_SUBJECT, PCRE2_NOTBOL, PCRE2_NOTEOL, PCRE2_NOTEMPTY,
|
PCRE2_COPY_MATCHED_SUBJECT, PCRE2_NOTBOL, PCRE2_NOTEOL, PCRE2_NOTEMPTY,
|
||||||
|
@ -161,7 +184,7 @@ The only unsupported pattern items are \C (match a single data unit) when
|
||||||
running in a UTF mode, and a callout immediately before an assertion condition
|
running in a UTF mode, and a callout immediately before an assertion condition
|
||||||
in a conditional group.
|
in a conditional group.
|
||||||
</P>
|
</P>
|
||||||
<br><a name="SEC5" href="#TOC1">RETURN VALUES FROM JIT MATCHING</a><br>
|
<br><a name="SEC6" href="#TOC1">RETURN VALUES FROM JIT MATCHING</a><br>
|
||||||
<P>
|
<P>
|
||||||
When a pattern is matched using JIT matching, the return values are the same
|
When a pattern is matched using JIT matching, the return values are the same
|
||||||
as those given by the interpretive <b>pcre2_match()</b> code, with the addition
|
as those given by the interpretive <b>pcre2_match()</b> code, with the addition
|
||||||
|
@ -177,7 +200,7 @@ circumstance when JIT is not used, but the details of exactly what is counted
|
||||||
are not the same. The PCRE2_ERROR_DEPTHLIMIT error code is never returned
|
are not the same. The PCRE2_ERROR_DEPTHLIMIT error code is never returned
|
||||||
when JIT matching is used.
|
when JIT matching is used.
|
||||||
<a name="stackcontrol"></a></P>
|
<a name="stackcontrol"></a></P>
|
||||||
<br><a name="SEC6" href="#TOC1">CONTROLLING THE JIT STACK</a><br>
|
<br><a name="SEC7" href="#TOC1">CONTROLLING THE JIT STACK</a><br>
|
||||||
<P>
|
<P>
|
||||||
When the compiled JIT code runs, it needs a block of memory to use as a stack.
|
When the compiled JIT code runs, it needs a block of memory to use as a stack.
|
||||||
By default, it uses 32KiB on the machine stack. However, some large or
|
By default, it uses 32KiB on the machine stack. However, some large or
|
||||||
|
@ -270,7 +293,7 @@ non-default JIT stacks might operate:
|
||||||
</pre>
|
</pre>
|
||||||
All the functions described in this section do nothing if JIT is not available.
|
All the functions described in this section do nothing if JIT is not available.
|
||||||
<a name="stackfaq"></a></P>
|
<a name="stackfaq"></a></P>
|
||||||
<br><a name="SEC7" href="#TOC1">JIT STACK FAQ</a><br>
|
<br><a name="SEC8" href="#TOC1">JIT STACK FAQ</a><br>
|
||||||
<P>
|
<P>
|
||||||
(1) Why do we need JIT stacks?
|
(1) Why do we need JIT stacks?
|
||||||
<br>
|
<br>
|
||||||
|
@ -349,7 +372,7 @@ stack handling?
|
||||||
No, thanks to Windows. If POSIX threads were used everywhere, we could throw
|
No, thanks to Windows. If POSIX threads were used everywhere, we could throw
|
||||||
out this complicated API.
|
out this complicated API.
|
||||||
</P>
|
</P>
|
||||||
<br><a name="SEC8" href="#TOC1">FREEING JIT SPECULATIVE MEMORY</a><br>
|
<br><a name="SEC9" href="#TOC1">FREEING JIT SPECULATIVE MEMORY</a><br>
|
||||||
<P>
|
<P>
|
||||||
<b>void pcre2_jit_free_unused_memory(pcre2_general_context *<i>gcontext</i>);</b>
|
<b>void pcre2_jit_free_unused_memory(pcre2_general_context *<i>gcontext</i>);</b>
|
||||||
</P>
|
</P>
|
||||||
|
@ -361,7 +384,7 @@ all possible memory. You can cause this to happen by calling
|
||||||
pcre2_jit_free_unused_memory(). Its argument is a general context, for custom
|
pcre2_jit_free_unused_memory(). Its argument is a general context, for custom
|
||||||
memory management, or NULL for standard memory management.
|
memory management, or NULL for standard memory management.
|
||||||
</P>
|
</P>
|
||||||
<br><a name="SEC9" href="#TOC1">EXAMPLE CODE</a><br>
|
<br><a name="SEC10" href="#TOC1">EXAMPLE CODE</a><br>
|
||||||
<P>
|
<P>
|
||||||
This is a single-threaded example that specifies a JIT stack without using a
|
This is a single-threaded example that specifies a JIT stack without using a
|
||||||
callback. A real program should include error checking after all the function
|
callback. A real program should include error checking after all the function
|
||||||
|
@ -390,7 +413,7 @@ calls.
|
||||||
|
|
||||||
</PRE>
|
</PRE>
|
||||||
</P>
|
</P>
|
||||||
<br><a name="SEC10" href="#TOC1">JIT FAST PATH API</a><br>
|
<br><a name="SEC11" href="#TOC1">JIT FAST PATH API</a><br>
|
||||||
<P>
|
<P>
|
||||||
Because the API described above falls back to interpreted matching when JIT is
|
Because the API described above falls back to interpreted matching when JIT is
|
||||||
not available, it is convenient for programs that are written for general use
|
not available, it is convenient for programs that are written for general use
|
||||||
|
@ -423,11 +446,11 @@ invalid data is passed, the result is undefined.
|
||||||
Bypassing the sanity checks and the <b>pcre2_match()</b> wrapping can give
|
Bypassing the sanity checks and the <b>pcre2_match()</b> wrapping can give
|
||||||
speedups of more than 10%.
|
speedups of more than 10%.
|
||||||
</P>
|
</P>
|
||||||
<br><a name="SEC11" href="#TOC1">SEE ALSO</a><br>
|
<br><a name="SEC12" href="#TOC1">SEE ALSO</a><br>
|
||||||
<P>
|
<P>
|
||||||
<b>pcre2api</b>(3)
|
<b>pcre2api</b>(3)
|
||||||
</P>
|
</P>
|
||||||
<br><a name="SEC12" href="#TOC1">AUTHOR</a><br>
|
<br><a name="SEC13" href="#TOC1">AUTHOR</a><br>
|
||||||
<P>
|
<P>
|
||||||
Philip Hazel (FAQ by Zoltan Herczeg)
|
Philip Hazel (FAQ by Zoltan Herczeg)
|
||||||
<br>
|
<br>
|
||||||
|
@ -436,11 +459,11 @@ University Computing Service
|
||||||
Cambridge, England.
|
Cambridge, England.
|
||||||
<br>
|
<br>
|
||||||
</P>
|
</P>
|
||||||
<br><a name="SEC13" href="#TOC1">REVISION</a><br>
|
<br><a name="SEC14" href="#TOC1">REVISION</a><br>
|
||||||
<P>
|
<P>
|
||||||
Last updated: 16 October 2018
|
Last updated: 06 March 2019
|
||||||
<br>
|
<br>
|
||||||
Copyright © 1997-2018 University of Cambridge.
|
Copyright © 1997-2019 University of Cambridge.
|
||||||
<br>
|
<br>
|
||||||
<p>
|
<p>
|
||||||
Return to the <a href="index.html">PCRE2 index page</a>.
|
Return to the <a href="index.html">PCRE2 index page</a>.
|
||||||
|
|
|
@ -247,11 +247,34 @@ VALIDITY OF UTF STRINGS
|
||||||
</b><br>
|
</b><br>
|
||||||
<P>
|
<P>
|
||||||
When the PCRE2_UTF option is set, the strings passed as patterns and subjects
|
When the PCRE2_UTF option is set, the strings passed as patterns and subjects
|
||||||
are (by default) checked for validity on entry to the relevant functions.
|
are (by default) checked for validity on entry to the relevant functions. If an
|
||||||
If an invalid UTF string is passed, an negative error code is returned. The
|
invalid UTF string is passed, an negative error code is returned. The code unit
|
||||||
code unit offset to the offending character can be extracted from the match
|
offset to the offending character can be extracted from the match data block by
|
||||||
data block by calling <b>pcre2_get_startchar()</b>, which is used for this
|
calling <b>pcre2_get_startchar()</b>, which is used for this purpose after a UTF
|
||||||
purpose after a UTF error.
|
error.
|
||||||
|
</P>
|
||||||
|
<P>
|
||||||
|
In some situations, you may already know that your strings are valid, and
|
||||||
|
therefore want to skip these checks in order to improve performance, for
|
||||||
|
example in the case of a long subject string that is being scanned repeatedly.
|
||||||
|
If you set the PCRE2_NO_UTF_CHECK option at compile time or at match time,
|
||||||
|
PCRE2 assumes that the pattern or subject it is given (respectively) contains
|
||||||
|
only valid UTF code unit sequences.
|
||||||
|
</P>
|
||||||
|
<P>
|
||||||
|
If you pass an invalid UTF string when PCRE2_NO_UTF_CHECK is set, the result
|
||||||
|
is usually undefined and your program may crash or loop indefinitely. There is,
|
||||||
|
however, one mode of matching that can handle invalid UTF subject strings. This
|
||||||
|
is matching via the JIT optimization using the PCRE2_JIT_INVALID_UTF option
|
||||||
|
when calling <b>pcre2_jit_compile()</b>. For details, see the
|
||||||
|
<a href="pcre2jit.html"><b>pcre2jit</b></a>
|
||||||
|
documentation.
|
||||||
|
</P>
|
||||||
|
<P>
|
||||||
|
Passing PCRE2_NO_UTF_CHECK to <b>pcre2_compile()</b> just disables the check for
|
||||||
|
the pattern; it does not also apply to subject strings. If you want to disable
|
||||||
|
the check for a subject string you must pass this same option to
|
||||||
|
<b>pcre2_match()</b> or <b>pcre2_dfa_match()</b>.
|
||||||
</P>
|
</P>
|
||||||
<P>
|
<P>
|
||||||
UTF-16 and UTF-32 strings can indicate their endianness by special code knows
|
UTF-16 and UTF-32 strings can indicate their endianness by special code knows
|
||||||
|
@ -259,13 +282,14 @@ as a byte-order mark (BOM). The PCRE2 functions do not handle this, expecting
|
||||||
strings to be in host byte order.
|
strings to be in host byte order.
|
||||||
</P>
|
</P>
|
||||||
<P>
|
<P>
|
||||||
A UTF string is checked before any other processing takes place. In the case of
|
Unless PCRE2_NO_UTF_CHECK is set, a UTF string is checked before any other
|
||||||
<b>pcre2_match()</b> and <b>pcre2_dfa_match()</b> calls with a non-zero starting
|
processing takes place. In the case of <b>pcre2_match()</b> and
|
||||||
offset, the check is applied only to that part of the subject that could be
|
<b>pcre2_dfa_match()</b> calls with a non-zero starting offset, the check is
|
||||||
inspected during matching, and there is a check that the starting offset points
|
applied only to that part of the subject that could be inspected during
|
||||||
to the first code unit of a character or to the end of the subject. If there
|
matching, and there is a check that the starting offset points to the first
|
||||||
are no lookbehind assertions in the pattern, the check starts at the starting
|
code unit of a character or to the end of the subject. If there are no
|
||||||
offset. Otherwise, it starts at the length of the longest lookbehind before the
|
lookbehind assertions in the pattern, the check starts at the starting offset.
|
||||||
|
Otherwise, it starts at the length of the longest lookbehind before the
|
||||||
starting offset, or at the start of the subject if there are not that many
|
starting offset, or at the start of the subject if there are not that many
|
||||||
characters before the starting offset. Note that the sequences \b and \B are
|
characters before the starting offset. Note that the sequences \b and \B are
|
||||||
one-character lookbehinds.
|
one-character lookbehinds.
|
||||||
|
@ -285,31 +309,12 @@ surrogate thing is a fudge for UTF-16 which unfortunately messes up UTF-8 and
|
||||||
UTF-32.)
|
UTF-32.)
|
||||||
</P>
|
</P>
|
||||||
<P>
|
<P>
|
||||||
In some situations, you may already know that your strings are valid, and
|
Setting PCRE2_NO_UTF_CHECK at compile time does not disable the error that is
|
||||||
therefore want to skip these checks in order to improve performance, for
|
given if an escape sequence for an invalid Unicode code point is encountered in
|
||||||
example in the case of a long subject string that is being scanned repeatedly.
|
the pattern. If you want to allow escape sequences such as \x{d800} (a
|
||||||
If you set the PCRE2_NO_UTF_CHECK option at compile time or at match time,
|
surrogate code point) you can set the PCRE2_EXTRA_ALLOW_SURROGATE_ESCAPES extra
|
||||||
PCRE2 assumes that the pattern or subject it is given (respectively) contains
|
option. However, this is possible only in UTF-8 and UTF-32 modes, because these
|
||||||
only valid UTF code unit sequences.
|
values are not representable in UTF-16.
|
||||||
</P>
|
|
||||||
<P>
|
|
||||||
Passing PCRE2_NO_UTF_CHECK to <b>pcre2_compile()</b> just disables the check for
|
|
||||||
the pattern; it does not also apply to subject strings. If you want to disable
|
|
||||||
the check for a subject string you must pass this option to <b>pcre2_match()</b>
|
|
||||||
or <b>pcre2_dfa_match()</b>.
|
|
||||||
</P>
|
|
||||||
<P>
|
|
||||||
If you pass an invalid UTF string when PCRE2_NO_UTF_CHECK is set, the result
|
|
||||||
is undefined and your program may crash or loop indefinitely.
|
|
||||||
</P>
|
|
||||||
<P>
|
|
||||||
Note that setting PCRE2_NO_UTF_CHECK at compile time does not disable the error
|
|
||||||
that is given if an escape sequence for an invalid Unicode code point is
|
|
||||||
encountered in the pattern. If you want to allow escape sequences such as
|
|
||||||
\x{d800} (a surrogate code point) you can set the
|
|
||||||
PCRE2_EXTRA_ALLOW_SURROGATE_ESCAPES extra option. However, this is possible
|
|
||||||
only in UTF-8 and UTF-32 modes, because these values are not representable in
|
|
||||||
UTF-16.
|
|
||||||
<a name="utf8strings"></a></P>
|
<a name="utf8strings"></a></P>
|
||||||
<br><b>
|
<br><b>
|
||||||
Errors in UTF-8 strings
|
Errors in UTF-8 strings
|
||||||
|
@ -417,7 +422,7 @@ Cambridge, England.
|
||||||
REVISION
|
REVISION
|
||||||
</b><br>
|
</b><br>
|
||||||
<P>
|
<P>
|
||||||
Last updated: 03 February 2019
|
Last updated: 06 March 2019
|
||||||
<br>
|
<br>
|
||||||
Copyright © 1997-2019 University of Cambridge.
|
Copyright © 1997-2019 University of Cambridge.
|
||||||
<br>
|
<br>
|
||||||
|
|
179
doc/pcre2.txt
179
doc/pcre2.txt
|
@ -180,8 +180,8 @@ REVISION
|
||||||
Last updated: 17 September 2018
|
Last updated: 17 September 2018
|
||||||
Copyright (c) 1997-2018 University of Cambridge.
|
Copyright (c) 1997-2018 University of Cambridge.
|
||||||
------------------------------------------------------------------------------
|
------------------------------------------------------------------------------
|
||||||
|
|
||||||
|
|
||||||
PCRE2API(3) Library Functions Manual PCRE2API(3)
|
PCRE2API(3) Library Functions Manual PCRE2API(3)
|
||||||
|
|
||||||
|
|
||||||
|
@ -3681,8 +3681,8 @@ REVISION
|
||||||
Last updated: 14 February 2019
|
Last updated: 14 February 2019
|
||||||
Copyright (c) 1997-2019 University of Cambridge.
|
Copyright (c) 1997-2019 University of Cambridge.
|
||||||
------------------------------------------------------------------------------
|
------------------------------------------------------------------------------
|
||||||
|
|
||||||
|
|
||||||
PCRE2BUILD(3) Library Functions Manual PCRE2BUILD(3)
|
PCRE2BUILD(3) Library Functions Manual PCRE2BUILD(3)
|
||||||
|
|
||||||
|
|
||||||
|
@ -4254,8 +4254,8 @@ REVISION
|
||||||
Last updated: 03 March 2019
|
Last updated: 03 March 2019
|
||||||
Copyright (c) 1997-2019 University of Cambridge.
|
Copyright (c) 1997-2019 University of Cambridge.
|
||||||
------------------------------------------------------------------------------
|
------------------------------------------------------------------------------
|
||||||
|
|
||||||
|
|
||||||
PCRE2CALLOUT(3) Library Functions Manual PCRE2CALLOUT(3)
|
PCRE2CALLOUT(3) Library Functions Manual PCRE2CALLOUT(3)
|
||||||
|
|
||||||
|
|
||||||
|
@ -4685,8 +4685,8 @@ REVISION
|
||||||
Last updated: 03 February 2019
|
Last updated: 03 February 2019
|
||||||
Copyright (c) 1997-2019 University of Cambridge.
|
Copyright (c) 1997-2019 University of Cambridge.
|
||||||
------------------------------------------------------------------------------
|
------------------------------------------------------------------------------
|
||||||
|
|
||||||
|
|
||||||
PCRE2COMPAT(3) Library Functions Manual PCRE2COMPAT(3)
|
PCRE2COMPAT(3) Library Functions Manual PCRE2COMPAT(3)
|
||||||
|
|
||||||
|
|
||||||
|
@ -4890,8 +4890,8 @@ REVISION
|
||||||
Last updated: 12 February 2019
|
Last updated: 12 February 2019
|
||||||
Copyright (c) 1997-2019 University of Cambridge.
|
Copyright (c) 1997-2019 University of Cambridge.
|
||||||
------------------------------------------------------------------------------
|
------------------------------------------------------------------------------
|
||||||
|
|
||||||
|
|
||||||
PCRE2JIT(3) Library Functions Manual PCRE2JIT(3)
|
PCRE2JIT(3) Library Functions Manual PCRE2JIT(3)
|
||||||
|
|
||||||
|
|
||||||
|
@ -5010,6 +5010,29 @@ SIMPLE USE OF JIT
|
||||||
to handle the pattern.
|
to handle the pattern.
|
||||||
|
|
||||||
|
|
||||||
|
MATCHING SUBJECTS CONTAINING INVALID UTF
|
||||||
|
|
||||||
|
When a pattern is compiled with the PCRE2_UTF option, the interpretive
|
||||||
|
matching function expects its subject string to be a valid sequence of
|
||||||
|
UTF code units. If it is not, the result is undefined. This is also
|
||||||
|
true by default of matching via JIT. However, if the option
|
||||||
|
PCRE2_JIT_INVALID_UTF is passed to pcre2_jit_compile(), code that can
|
||||||
|
process a subject containing invalid UTF is compiled.
|
||||||
|
|
||||||
|
In this mode, an invalid code unit sequence never matches any pattern
|
||||||
|
item. It does not match dot, it does not match \p{Any}, it does not
|
||||||
|
even match negative items such as [^X]. A lookbehind assertion fails if
|
||||||
|
it encounters an invalid sequence while moving the current point back-
|
||||||
|
wards. In other words, an invalid UTF code unit sequence acts as a bar-
|
||||||
|
rier which no match can cross. Reaching an invalid sequence causes an
|
||||||
|
immediate backtrack.
|
||||||
|
|
||||||
|
Using this option, an application can run matches in arbitrary data,
|
||||||
|
knowing that any matched strings that are returned will be valid UTF.
|
||||||
|
This can be useful when searching for text in executable or other
|
||||||
|
binary files.
|
||||||
|
|
||||||
|
|
||||||
UNSUPPORTED OPTIONS AND PATTERN ITEMS
|
UNSUPPORTED OPTIONS AND PATTERN ITEMS
|
||||||
|
|
||||||
The pcre2_match() options that are supported for JIT matching are
|
The pcre2_match() options that are supported for JIT matching are
|
||||||
|
@ -5287,11 +5310,11 @@ AUTHOR
|
||||||
|
|
||||||
REVISION
|
REVISION
|
||||||
|
|
||||||
Last updated: 16 October 2018
|
Last updated: 06 March 2019
|
||||||
Copyright (c) 1997-2018 University of Cambridge.
|
Copyright (c) 1997-2019 University of Cambridge.
|
||||||
------------------------------------------------------------------------------
|
------------------------------------------------------------------------------
|
||||||
|
|
||||||
|
|
||||||
PCRE2LIMITS(3) Library Functions Manual PCRE2LIMITS(3)
|
PCRE2LIMITS(3) Library Functions Manual PCRE2LIMITS(3)
|
||||||
|
|
||||||
|
|
||||||
|
@ -5360,8 +5383,8 @@ REVISION
|
||||||
Last updated: 02 February 2019
|
Last updated: 02 February 2019
|
||||||
Copyright (c) 1997-2019 University of Cambridge.
|
Copyright (c) 1997-2019 University of Cambridge.
|
||||||
------------------------------------------------------------------------------
|
------------------------------------------------------------------------------
|
||||||
|
|
||||||
|
|
||||||
PCRE2MATCHING(3) Library Functions Manual PCRE2MATCHING(3)
|
PCRE2MATCHING(3) Library Functions Manual PCRE2MATCHING(3)
|
||||||
|
|
||||||
|
|
||||||
|
@ -5581,8 +5604,8 @@ REVISION
|
||||||
Last updated: 10 October 2018
|
Last updated: 10 October 2018
|
||||||
Copyright (c) 1997-2018 University of Cambridge.
|
Copyright (c) 1997-2018 University of Cambridge.
|
||||||
------------------------------------------------------------------------------
|
------------------------------------------------------------------------------
|
||||||
|
|
||||||
|
|
||||||
PCRE2PARTIAL(3) Library Functions Manual PCRE2PARTIAL(3)
|
PCRE2PARTIAL(3) Library Functions Manual PCRE2PARTIAL(3)
|
||||||
|
|
||||||
|
|
||||||
|
@ -6021,8 +6044,8 @@ REVISION
|
||||||
Last updated: 22 December 2014
|
Last updated: 22 December 2014
|
||||||
Copyright (c) 1997-2014 University of Cambridge.
|
Copyright (c) 1997-2014 University of Cambridge.
|
||||||
------------------------------------------------------------------------------
|
------------------------------------------------------------------------------
|
||||||
|
|
||||||
|
|
||||||
PCRE2PATTERN(3) Library Functions Manual PCRE2PATTERN(3)
|
PCRE2PATTERN(3) Library Functions Manual PCRE2PATTERN(3)
|
||||||
|
|
||||||
|
|
||||||
|
@ -9365,8 +9388,8 @@ REVISION
|
||||||
Last updated: 12 February 2019
|
Last updated: 12 February 2019
|
||||||
Copyright (c) 1997-2019 University of Cambridge.
|
Copyright (c) 1997-2019 University of Cambridge.
|
||||||
------------------------------------------------------------------------------
|
------------------------------------------------------------------------------
|
||||||
|
|
||||||
|
|
||||||
PCRE2PERFORM(3) Library Functions Manual PCRE2PERFORM(3)
|
PCRE2PERFORM(3) Library Functions Manual PCRE2PERFORM(3)
|
||||||
|
|
||||||
|
|
||||||
|
@ -9600,8 +9623,8 @@ REVISION
|
||||||
Last updated: 03 February 2019
|
Last updated: 03 February 2019
|
||||||
Copyright (c) 1997-2019 University of Cambridge.
|
Copyright (c) 1997-2019 University of Cambridge.
|
||||||
------------------------------------------------------------------------------
|
------------------------------------------------------------------------------
|
||||||
|
|
||||||
|
|
||||||
PCRE2POSIX(3) Library Functions Manual PCRE2POSIX(3)
|
PCRE2POSIX(3) Library Functions Manual PCRE2POSIX(3)
|
||||||
|
|
||||||
|
|
||||||
|
@ -9930,8 +9953,8 @@ REVISION
|
||||||
Last updated: 30 January 2019
|
Last updated: 30 January 2019
|
||||||
Copyright (c) 1997-2019 University of Cambridge.
|
Copyright (c) 1997-2019 University of Cambridge.
|
||||||
------------------------------------------------------------------------------
|
------------------------------------------------------------------------------
|
||||||
|
|
||||||
|
|
||||||
PCRE2SAMPLE(3) Library Functions Manual PCRE2SAMPLE(3)
|
PCRE2SAMPLE(3) Library Functions Manual PCRE2SAMPLE(3)
|
||||||
|
|
||||||
|
|
||||||
|
@ -10209,8 +10232,8 @@ REVISION
|
||||||
Last updated: 27 June 2018
|
Last updated: 27 June 2018
|
||||||
Copyright (c) 1997-2018 University of Cambridge.
|
Copyright (c) 1997-2018 University of Cambridge.
|
||||||
------------------------------------------------------------------------------
|
------------------------------------------------------------------------------
|
||||||
|
|
||||||
|
|
||||||
PCRE2SYNTAX(3) Library Functions Manual PCRE2SYNTAX(3)
|
PCRE2SYNTAX(3) Library Functions Manual PCRE2SYNTAX(3)
|
||||||
|
|
||||||
|
|
||||||
|
@ -10710,8 +10733,8 @@ REVISION
|
||||||
Last updated: 11 February 2019
|
Last updated: 11 February 2019
|
||||||
Copyright (c) 1997-2019 University of Cambridge.
|
Copyright (c) 1997-2019 University of Cambridge.
|
||||||
------------------------------------------------------------------------------
|
------------------------------------------------------------------------------
|
||||||
|
|
||||||
|
|
||||||
PCRE2UNICODE(3) Library Functions Manual PCRE2UNICODE(3)
|
PCRE2UNICODE(3) Library Functions Manual PCRE2UNICODE(3)
|
||||||
|
|
||||||
|
|
||||||
|
@ -10928,59 +10951,63 @@ VALIDITY OF UTF STRINGS
|
||||||
|
|
||||||
When the PCRE2_UTF option is set, the strings passed as patterns and
|
When the PCRE2_UTF option is set, the strings passed as patterns and
|
||||||
subjects are (by default) checked for validity on entry to the relevant
|
subjects are (by default) checked for validity on entry to the relevant
|
||||||
functions. If an invalid UTF string is passed, an negative error code
|
functions. If an invalid UTF string is passed, an negative error code
|
||||||
is returned. The code unit offset to the offending character can be
|
is returned. The code unit offset to the offending character can be
|
||||||
extracted from the match data block by calling pcre2_get_startchar(),
|
extracted from the match data block by calling pcre2_get_startchar(),
|
||||||
which is used for this purpose after a UTF error.
|
which is used for this purpose after a UTF error.
|
||||||
|
|
||||||
UTF-16 and UTF-32 strings can indicate their endianness by special code
|
In some situations, you may already know that your strings are valid,
|
||||||
knows as a byte-order mark (BOM). The PCRE2 functions do not handle
|
and therefore want to skip these checks in order to improve perfor-
|
||||||
this, expecting strings to be in host byte order.
|
mance, for example in the case of a long subject string that is being
|
||||||
|
scanned repeatedly. If you set the PCRE2_NO_UTF_CHECK option at com-
|
||||||
A UTF string is checked before any other processing takes place. In the
|
pile time or at match time, PCRE2 assumes that the pattern or subject
|
||||||
case of pcre2_match() and pcre2_dfa_match() calls with a non-zero
|
|
||||||
starting offset, the check is applied only to that part of the subject
|
|
||||||
that could be inspected during matching, and there is a check that the
|
|
||||||
starting offset points to the first code unit of a character or to the
|
|
||||||
end of the subject. If there are no lookbehind assertions in the pat-
|
|
||||||
tern, the check starts at the starting offset. Otherwise, it starts at
|
|
||||||
the length of the longest lookbehind before the starting offset, or at
|
|
||||||
the start of the subject if there are not that many characters before
|
|
||||||
the starting offset. Note that the sequences \b and \B are one-charac-
|
|
||||||
ter lookbehinds.
|
|
||||||
|
|
||||||
In addition to checking the format of the string, there is a check to
|
|
||||||
ensure that all code points lie in the range U+0 to U+10FFFF, excluding
|
|
||||||
the surrogate area. The so-called "non-character" code points are not
|
|
||||||
excluded because Unicode corrigendum #9 makes it clear that they should
|
|
||||||
not be.
|
|
||||||
|
|
||||||
Characters in the "Surrogate Area" of Unicode are reserved for use by
|
|
||||||
UTF-16, where they are used in pairs to encode code points with values
|
|
||||||
greater than 0xFFFF. The code points that are encoded by UTF-16 pairs
|
|
||||||
are available independently in the UTF-8 and UTF-32 encodings. (In
|
|
||||||
other words, the whole surrogate thing is a fudge for UTF-16 which
|
|
||||||
unfortunately messes up UTF-8 and UTF-32.)
|
|
||||||
|
|
||||||
In some situations, you may already know that your strings are valid,
|
|
||||||
and therefore want to skip these checks in order to improve perfor-
|
|
||||||
mance, for example in the case of a long subject string that is being
|
|
||||||
scanned repeatedly. If you set the PCRE2_NO_UTF_CHECK option at com-
|
|
||||||
pile time or at match time, PCRE2 assumes that the pattern or subject
|
|
||||||
it is given (respectively) contains only valid UTF code unit sequences.
|
it is given (respectively) contains only valid UTF code unit sequences.
|
||||||
|
|
||||||
|
If you pass an invalid UTF string when PCRE2_NO_UTF_CHECK is set, the
|
||||||
|
result is usually undefined and your program may crash or loop indefi-
|
||||||
|
nitely. There is, however, one mode of matching that can handle invalid
|
||||||
|
UTF subject strings. This is matching via the JIT optimization using
|
||||||
|
the PCRE2_JIT_INVALID_UTF option when calling pcre2_jit_compile(). For
|
||||||
|
details, see the pcre2jit documentation.
|
||||||
|
|
||||||
Passing PCRE2_NO_UTF_CHECK to pcre2_compile() just disables the check
|
Passing PCRE2_NO_UTF_CHECK to pcre2_compile() just disables the check
|
||||||
for the pattern; it does not also apply to subject strings. If you want
|
for the pattern; it does not also apply to subject strings. If you want
|
||||||
to disable the check for a subject string you must pass this option to
|
to disable the check for a subject string you must pass this same
|
||||||
pcre2_match() or pcre2_dfa_match().
|
option to pcre2_match() or pcre2_dfa_match().
|
||||||
|
|
||||||
If you pass an invalid UTF string when PCRE2_NO_UTF_CHECK is set, the
|
UTF-16 and UTF-32 strings can indicate their endianness by special code
|
||||||
result is undefined and your program may crash or loop indefinitely.
|
knows as a byte-order mark (BOM). The PCRE2 functions do not handle
|
||||||
|
this, expecting strings to be in host byte order.
|
||||||
|
|
||||||
Note that setting PCRE2_NO_UTF_CHECK at compile time does not disable
|
Unless PCRE2_NO_UTF_CHECK is set, a UTF string is checked before any
|
||||||
the error that is given if an escape sequence for an invalid Unicode
|
other processing takes place. In the case of pcre2_match() and
|
||||||
code point is encountered in the pattern. If you want to allow escape
|
pcre2_dfa_match() calls with a non-zero starting offset, the check is
|
||||||
sequences such as \x{d800} (a surrogate code point) you can set the
|
applied only to that part of the subject that could be inspected during
|
||||||
|
matching, and there is a check that the starting offset points to the
|
||||||
|
first code unit of a character or to the end of the subject. If there
|
||||||
|
are no lookbehind assertions in the pattern, the check starts at the
|
||||||
|
starting offset. Otherwise, it starts at the length of the longest
|
||||||
|
lookbehind before the starting offset, or at the start of the subject
|
||||||
|
if there are not that many characters before the starting offset. Note
|
||||||
|
that the sequences \b and \B are one-character lookbehinds.
|
||||||
|
|
||||||
|
In addition to checking the format of the string, there is a check to
|
||||||
|
ensure that all code points lie in the range U+0 to U+10FFFF, excluding
|
||||||
|
the surrogate area. The so-called "non-character" code points are not
|
||||||
|
excluded because Unicode corrigendum #9 makes it clear that they should
|
||||||
|
not be.
|
||||||
|
|
||||||
|
Characters in the "Surrogate Area" of Unicode are reserved for use by
|
||||||
|
UTF-16, where they are used in pairs to encode code points with values
|
||||||
|
greater than 0xFFFF. The code points that are encoded by UTF-16 pairs
|
||||||
|
are available independently in the UTF-8 and UTF-32 encodings. (In
|
||||||
|
other words, the whole surrogate thing is a fudge for UTF-16 which
|
||||||
|
unfortunately messes up UTF-8 and UTF-32.)
|
||||||
|
|
||||||
|
Setting PCRE2_NO_UTF_CHECK at compile time does not disable the error
|
||||||
|
that is given if an escape sequence for an invalid Unicode code point
|
||||||
|
is encountered in the pattern. If you want to allow escape sequences
|
||||||
|
such as \x{d800} (a surrogate code point) you can set the
|
||||||
PCRE2_EXTRA_ALLOW_SURROGATE_ESCAPES extra option. However, this is pos-
|
PCRE2_EXTRA_ALLOW_SURROGATE_ESCAPES extra option. However, this is pos-
|
||||||
sible only in UTF-8 and UTF-32 modes, because these values are not rep-
|
sible only in UTF-8 and UTF-32 modes, because these values are not rep-
|
||||||
resentable in UTF-16.
|
resentable in UTF-16.
|
||||||
|
@ -11079,8 +11106,8 @@ AUTHOR
|
||||||
|
|
||||||
REVISION
|
REVISION
|
||||||
|
|
||||||
Last updated: 03 February 2019
|
Last updated: 06 March 2019
|
||||||
Copyright (c) 1997-2019 University of Cambridge.
|
Copyright (c) 1997-2019 University of Cambridge.
|
||||||
------------------------------------------------------------------------------
|
------------------------------------------------------------------------------
|
||||||
|
|
||||||
|
|
||||||
|
|
|
@ -1,4 +1,4 @@
|
||||||
.TH PCRE2_JIT_COMPILE 3 "21 October 2014" "PCRE2 10.00"
|
.TH PCRE2_JIT_COMPILE 3 "06 March 2019" "PCRE2 10.33"
|
||||||
.SH NAME
|
.SH NAME
|
||||||
PCRE2 - Perl-compatible regular expressions (revised API)
|
PCRE2 - Perl-compatible regular expressions (revised API)
|
||||||
.SH SYNOPSIS
|
.SH SYNOPSIS
|
||||||
|
@ -29,6 +29,7 @@ bits:
|
||||||
PCRE2_JIT_COMPLETE compile code for full matching
|
PCRE2_JIT_COMPLETE compile code for full matching
|
||||||
PCRE2_JIT_PARTIAL_SOFT compile code for soft partial matching
|
PCRE2_JIT_PARTIAL_SOFT compile code for soft partial matching
|
||||||
PCRE2_JIT_PARTIAL_HARD compile code for hard partial matching
|
PCRE2_JIT_PARTIAL_HARD compile code for hard partial matching
|
||||||
|
PCRE2_JIT_INVALID_UTF compile code to handle invalid UTF
|
||||||
.sp
|
.sp
|
||||||
The yield of the function is 0 for success, or a negative error code otherwise.
|
The yield of the function is 0 for success, or a negative error code otherwise.
|
||||||
In particular, PCRE2_ERROR_JIT_BADOPTION is returned if JIT is not supported or
|
In particular, PCRE2_ERROR_JIT_BADOPTION is returned if JIT is not supported or
|
||||||
|
|
|
@ -1,4 +1,4 @@
|
||||||
.TH PCRE2JIT 3 "16 October 2018" "PCRE2 10.33"
|
.TH PCRE2JIT 3 "06 March 2019" "PCRE2 10.33"
|
||||||
.SH NAME
|
.SH NAME
|
||||||
PCRE2 - Perl-compatible regular expressions (revised API)
|
PCRE2 - Perl-compatible regular expressions (revised API)
|
||||||
.SH "PCRE2 JUST-IN-TIME COMPILER SUPPORT"
|
.SH "PCRE2 JUST-IN-TIME COMPILER SUPPORT"
|
||||||
|
@ -120,6 +120,28 @@ support is not available, or the pattern was not processed by
|
||||||
pattern.
|
pattern.
|
||||||
.
|
.
|
||||||
.
|
.
|
||||||
|
.SH "MATCHING SUBJECTS CONTAINING INVALID UTF"
|
||||||
|
.rs
|
||||||
|
.sp
|
||||||
|
When a pattern is compiled with the PCRE2_UTF option, the interpretive matching
|
||||||
|
function expects its subject string to be a valid sequence of UTF code units.
|
||||||
|
If it is not, the result is undefined. This is also true by default of matching
|
||||||
|
via JIT. However, if the option PCRE2_JIT_INVALID_UTF is passed to
|
||||||
|
\fBpcre2_jit_compile()\fP, code that can process a subject containing invalid
|
||||||
|
UTF is compiled.
|
||||||
|
.P
|
||||||
|
In this mode, an invalid code unit sequence never matches any pattern item. It
|
||||||
|
does not match dot, it does not match \ep{Any}, it does not even match negative
|
||||||
|
items such as [^X]. A lookbehind assertion fails if it encounters an invalid
|
||||||
|
sequence while moving the current point backwards. In other words, an invalid
|
||||||
|
UTF code unit sequence acts as a barrier which no match can cross. Reaching an
|
||||||
|
invalid sequence causes an immediate backtrack.
|
||||||
|
.P
|
||||||
|
Using this option, an application can run matches in arbitrary data, knowing
|
||||||
|
that any matched strings that are returned will be valid UTF. This can be
|
||||||
|
useful when searching for text in executable or other binary files.
|
||||||
|
.
|
||||||
|
.
|
||||||
.SH "UNSUPPORTED OPTIONS AND PATTERN ITEMS"
|
.SH "UNSUPPORTED OPTIONS AND PATTERN ITEMS"
|
||||||
.rs
|
.rs
|
||||||
.sp
|
.sp
|
||||||
|
@ -416,6 +438,6 @@ Cambridge, England.
|
||||||
.rs
|
.rs
|
||||||
.sp
|
.sp
|
||||||
.nf
|
.nf
|
||||||
Last updated: 16 October 2018
|
Last updated: 06 March 2019
|
||||||
Copyright (c) 1997-2018 University of Cambridge.
|
Copyright (c) 1997-2019 University of Cambridge.
|
||||||
.fi
|
.fi
|
||||||
|
|
|
@ -1,4 +1,4 @@
|
||||||
.TH PCRE2UNICODE 3 "03 February 2019" "PCRE2 10.33"
|
.TH PCRE2UNICODE 3 "06 March 2019" "PCRE2 10.33"
|
||||||
.SH NAME
|
.SH NAME
|
||||||
PCRE - Perl-compatible regular expressions (revised API)
|
PCRE - Perl-compatible regular expressions (revised API)
|
||||||
.SH "UNICODE AND UTF SUPPORT"
|
.SH "UNICODE AND UTF SUPPORT"
|
||||||
|
@ -230,23 +230,46 @@ adjacent characters.
|
||||||
.rs
|
.rs
|
||||||
.sp
|
.sp
|
||||||
When the PCRE2_UTF option is set, the strings passed as patterns and subjects
|
When the PCRE2_UTF option is set, the strings passed as patterns and subjects
|
||||||
are (by default) checked for validity on entry to the relevant functions.
|
are (by default) checked for validity on entry to the relevant functions. If an
|
||||||
If an invalid UTF string is passed, an negative error code is returned. The
|
invalid UTF string is passed, an negative error code is returned. The code unit
|
||||||
code unit offset to the offending character can be extracted from the match
|
offset to the offending character can be extracted from the match data block by
|
||||||
data block by calling \fBpcre2_get_startchar()\fP, which is used for this
|
calling \fBpcre2_get_startchar()\fP, which is used for this purpose after a UTF
|
||||||
purpose after a UTF error.
|
error.
|
||||||
|
.P
|
||||||
|
In some situations, you may already know that your strings are valid, and
|
||||||
|
therefore want to skip these checks in order to improve performance, for
|
||||||
|
example in the case of a long subject string that is being scanned repeatedly.
|
||||||
|
If you set the PCRE2_NO_UTF_CHECK option at compile time or at match time,
|
||||||
|
PCRE2 assumes that the pattern or subject it is given (respectively) contains
|
||||||
|
only valid UTF code unit sequences.
|
||||||
|
.P
|
||||||
|
If you pass an invalid UTF string when PCRE2_NO_UTF_CHECK is set, the result
|
||||||
|
is usually undefined and your program may crash or loop indefinitely. There is,
|
||||||
|
however, one mode of matching that can handle invalid UTF subject strings. This
|
||||||
|
is matching via the JIT optimization using the PCRE2_JIT_INVALID_UTF option
|
||||||
|
when calling \fBpcre2_jit_compile()\fP. For details, see the
|
||||||
|
.\" HREF
|
||||||
|
\fBpcre2jit\fP
|
||||||
|
.\"
|
||||||
|
documentation.
|
||||||
|
.P
|
||||||
|
Passing PCRE2_NO_UTF_CHECK to \fBpcre2_compile()\fP just disables the check for
|
||||||
|
the pattern; it does not also apply to subject strings. If you want to disable
|
||||||
|
the check for a subject string you must pass this same option to
|
||||||
|
\fBpcre2_match()\fP or \fBpcre2_dfa_match()\fP.
|
||||||
.P
|
.P
|
||||||
UTF-16 and UTF-32 strings can indicate their endianness by special code knows
|
UTF-16 and UTF-32 strings can indicate their endianness by special code knows
|
||||||
as a byte-order mark (BOM). The PCRE2 functions do not handle this, expecting
|
as a byte-order mark (BOM). The PCRE2 functions do not handle this, expecting
|
||||||
strings to be in host byte order.
|
strings to be in host byte order.
|
||||||
.P
|
.P
|
||||||
A UTF string is checked before any other processing takes place. In the case of
|
Unless PCRE2_NO_UTF_CHECK is set, a UTF string is checked before any other
|
||||||
\fBpcre2_match()\fP and \fBpcre2_dfa_match()\fP calls with a non-zero starting
|
processing takes place. In the case of \fBpcre2_match()\fP and
|
||||||
offset, the check is applied only to that part of the subject that could be
|
\fBpcre2_dfa_match()\fP calls with a non-zero starting offset, the check is
|
||||||
inspected during matching, and there is a check that the starting offset points
|
applied only to that part of the subject that could be inspected during
|
||||||
to the first code unit of a character or to the end of the subject. If there
|
matching, and there is a check that the starting offset points to the first
|
||||||
are no lookbehind assertions in the pattern, the check starts at the starting
|
code unit of a character or to the end of the subject. If there are no
|
||||||
offset. Otherwise, it starts at the length of the longest lookbehind before the
|
lookbehind assertions in the pattern, the check starts at the starting offset.
|
||||||
|
Otherwise, it starts at the length of the longest lookbehind before the
|
||||||
starting offset, or at the start of the subject if there are not that many
|
starting offset, or at the start of the subject if there are not that many
|
||||||
characters before the starting offset. Note that the sequences \eb and \eB are
|
characters before the starting offset. Note that the sequences \eb and \eB are
|
||||||
one-character lookbehinds.
|
one-character lookbehinds.
|
||||||
|
@ -263,28 +286,12 @@ independently in the UTF-8 and UTF-32 encodings. (In other words, the whole
|
||||||
surrogate thing is a fudge for UTF-16 which unfortunately messes up UTF-8 and
|
surrogate thing is a fudge for UTF-16 which unfortunately messes up UTF-8 and
|
||||||
UTF-32.)
|
UTF-32.)
|
||||||
.P
|
.P
|
||||||
In some situations, you may already know that your strings are valid, and
|
Setting PCRE2_NO_UTF_CHECK at compile time does not disable the error that is
|
||||||
therefore want to skip these checks in order to improve performance, for
|
given if an escape sequence for an invalid Unicode code point is encountered in
|
||||||
example in the case of a long subject string that is being scanned repeatedly.
|
the pattern. If you want to allow escape sequences such as \ex{d800} (a
|
||||||
If you set the PCRE2_NO_UTF_CHECK option at compile time or at match time,
|
surrogate code point) you can set the PCRE2_EXTRA_ALLOW_SURROGATE_ESCAPES extra
|
||||||
PCRE2 assumes that the pattern or subject it is given (respectively) contains
|
option. However, this is possible only in UTF-8 and UTF-32 modes, because these
|
||||||
only valid UTF code unit sequences.
|
values are not representable in UTF-16.
|
||||||
.P
|
|
||||||
Passing PCRE2_NO_UTF_CHECK to \fBpcre2_compile()\fP just disables the check for
|
|
||||||
the pattern; it does not also apply to subject strings. If you want to disable
|
|
||||||
the check for a subject string you must pass this option to \fBpcre2_match()\fP
|
|
||||||
or \fBpcre2_dfa_match()\fP.
|
|
||||||
.P
|
|
||||||
If you pass an invalid UTF string when PCRE2_NO_UTF_CHECK is set, the result
|
|
||||||
is undefined and your program may crash or loop indefinitely.
|
|
||||||
.P
|
|
||||||
Note that setting PCRE2_NO_UTF_CHECK at compile time does not disable the error
|
|
||||||
that is given if an escape sequence for an invalid Unicode code point is
|
|
||||||
encountered in the pattern. If you want to allow escape sequences such as
|
|
||||||
\ex{d800} (a surrogate code point) you can set the
|
|
||||||
PCRE2_EXTRA_ALLOW_SURROGATE_ESCAPES extra option. However, this is possible
|
|
||||||
only in UTF-8 and UTF-32 modes, because these values are not representable in
|
|
||||||
UTF-16.
|
|
||||||
.
|
.
|
||||||
.
|
.
|
||||||
.\" HTML <a name="utf8strings"></a>
|
.\" HTML <a name="utf8strings"></a>
|
||||||
|
@ -393,6 +400,6 @@ Cambridge, England.
|
||||||
.rs
|
.rs
|
||||||
.sp
|
.sp
|
||||||
.nf
|
.nf
|
||||||
Last updated: 03 February 2019
|
Last updated: 06 March 2019
|
||||||
Copyright (c) 1997-2019 University of Cambridge.
|
Copyright (c) 1997-2019 University of Cambridge.
|
||||||
.fi
|
.fi
|
||||||
|
|
Loading…
Reference in New Issue