Documentation update re PCRE2_JIT_INVALID_UTF
This commit is contained in:
parent
7375089fa5
commit
590f65f061
|
@ -40,6 +40,7 @@ bits:
|
|||
PCRE2_JIT_COMPLETE compile code for full matching
|
||||
PCRE2_JIT_PARTIAL_SOFT compile code for soft partial matching
|
||||
PCRE2_JIT_PARTIAL_HARD compile code for hard partial matching
|
||||
PCRE2_JIT_INVALID_UTF compile code to handle invalid UTF
|
||||
</pre>
|
||||
The yield of the function is 0 for success, or a negative error code otherwise.
|
||||
In particular, PCRE2_ERROR_JIT_BADOPTION is returned if JIT is not supported or
|
||||
|
|
|
@ -312,7 +312,7 @@ document for an overview of all the PCRE2 documentation.
|
|||
<b>const unsigned char *pcre2_maketables(pcre2_general_context *<i>gcontext</i>);</b>
|
||||
<br>
|
||||
<br>
|
||||
<b>int pcre2_pattern_info(const pcre2_code *<i>code</i>, uint32_t <i>what</i>, </b>
|
||||
<b>int pcre2_pattern_info(const pcre2_code *<i>code</i>, uint32_t <i>what</i>,</b>
|
||||
<b> void *<i>where</i>);</b>
|
||||
<br>
|
||||
<br>
|
||||
|
|
|
@ -16,16 +16,17 @@ please consult the man page, in case the conversion went wrong.
|
|||
<li><a name="TOC1" href="#SEC1">PCRE2 JUST-IN-TIME COMPILER SUPPORT</a>
|
||||
<li><a name="TOC2" href="#SEC2">AVAILABILITY OF JIT SUPPORT</a>
|
||||
<li><a name="TOC3" href="#SEC3">SIMPLE USE OF JIT</a>
|
||||
<li><a name="TOC4" href="#SEC4">UNSUPPORTED OPTIONS AND PATTERN ITEMS</a>
|
||||
<li><a name="TOC5" href="#SEC5">RETURN VALUES FROM JIT MATCHING</a>
|
||||
<li><a name="TOC6" href="#SEC6">CONTROLLING THE JIT STACK</a>
|
||||
<li><a name="TOC7" href="#SEC7">JIT STACK FAQ</a>
|
||||
<li><a name="TOC8" href="#SEC8">FREEING JIT SPECULATIVE MEMORY</a>
|
||||
<li><a name="TOC9" href="#SEC9">EXAMPLE CODE</a>
|
||||
<li><a name="TOC10" href="#SEC10">JIT FAST PATH API</a>
|
||||
<li><a name="TOC11" href="#SEC11">SEE ALSO</a>
|
||||
<li><a name="TOC12" href="#SEC12">AUTHOR</a>
|
||||
<li><a name="TOC13" href="#SEC13">REVISION</a>
|
||||
<li><a name="TOC4" href="#SEC4">MATCHING SUBJECTS CONTAINING INVALID UTF</a>
|
||||
<li><a name="TOC5" href="#SEC5">UNSUPPORTED OPTIONS AND PATTERN ITEMS</a>
|
||||
<li><a name="TOC6" href="#SEC6">RETURN VALUES FROM JIT MATCHING</a>
|
||||
<li><a name="TOC7" href="#SEC7">CONTROLLING THE JIT STACK</a>
|
||||
<li><a name="TOC8" href="#SEC8">JIT STACK FAQ</a>
|
||||
<li><a name="TOC9" href="#SEC9">FREEING JIT SPECULATIVE MEMORY</a>
|
||||
<li><a name="TOC10" href="#SEC10">EXAMPLE CODE</a>
|
||||
<li><a name="TOC11" href="#SEC11">JIT FAST PATH API</a>
|
||||
<li><a name="TOC12" href="#SEC12">SEE ALSO</a>
|
||||
<li><a name="TOC13" href="#SEC13">AUTHOR</a>
|
||||
<li><a name="TOC14" href="#SEC14">REVISION</a>
|
||||
</ul>
|
||||
<br><a name="SEC1" href="#TOC1">PCRE2 JUST-IN-TIME COMPILER SUPPORT</a><br>
|
||||
<P>
|
||||
|
@ -144,7 +145,29 @@ support is not available, or the pattern was not processed by
|
|||
<b>pcre2_jit_compile()</b>, or the JIT compiler was not able to handle the
|
||||
pattern.
|
||||
</P>
|
||||
<br><a name="SEC4" href="#TOC1">UNSUPPORTED OPTIONS AND PATTERN ITEMS</a><br>
|
||||
<br><a name="SEC4" href="#TOC1">MATCHING SUBJECTS CONTAINING INVALID UTF</a><br>
|
||||
<P>
|
||||
When a pattern is compiled with the PCRE2_UTF option, the interpretive matching
|
||||
function expects its subject string to be a valid sequence of UTF code units.
|
||||
If it is not, the result is undefined. This is also true by default of matching
|
||||
via JIT. However, if the option PCRE2_JIT_INVALID_UTF is passed to
|
||||
<b>pcre2_jit_compile()</b>, code that can process a subject containing invalid
|
||||
UTF is compiled.
|
||||
</P>
|
||||
<P>
|
||||
In this mode, an invalid code unit sequence never matches any pattern item. It
|
||||
does not match dot, it does not match \p{Any}, it does not even match negative
|
||||
items such as [^X]. A lookbehind assertion fails if it encounters an invalid
|
||||
sequence while moving the current point backwards. In other words, an invalid
|
||||
UTF code unit sequence acts as a barrier which no match can cross. Reaching an
|
||||
invalid sequence causes an immediate backtrack.
|
||||
</P>
|
||||
<P>
|
||||
Using this option, an application can run matches in arbitrary data, knowing
|
||||
that any matched strings that are returned will be valid UTF. This can be
|
||||
useful when searching for text in executable or other binary files.
|
||||
</P>
|
||||
<br><a name="SEC5" href="#TOC1">UNSUPPORTED OPTIONS AND PATTERN ITEMS</a><br>
|
||||
<P>
|
||||
The <b>pcre2_match()</b> options that are supported for JIT matching are
|
||||
PCRE2_COPY_MATCHED_SUBJECT, PCRE2_NOTBOL, PCRE2_NOTEOL, PCRE2_NOTEMPTY,
|
||||
|
@ -161,7 +184,7 @@ The only unsupported pattern items are \C (match a single data unit) when
|
|||
running in a UTF mode, and a callout immediately before an assertion condition
|
||||
in a conditional group.
|
||||
</P>
|
||||
<br><a name="SEC5" href="#TOC1">RETURN VALUES FROM JIT MATCHING</a><br>
|
||||
<br><a name="SEC6" href="#TOC1">RETURN VALUES FROM JIT MATCHING</a><br>
|
||||
<P>
|
||||
When a pattern is matched using JIT matching, the return values are the same
|
||||
as those given by the interpretive <b>pcre2_match()</b> code, with the addition
|
||||
|
@ -177,7 +200,7 @@ circumstance when JIT is not used, but the details of exactly what is counted
|
|||
are not the same. The PCRE2_ERROR_DEPTHLIMIT error code is never returned
|
||||
when JIT matching is used.
|
||||
<a name="stackcontrol"></a></P>
|
||||
<br><a name="SEC6" href="#TOC1">CONTROLLING THE JIT STACK</a><br>
|
||||
<br><a name="SEC7" href="#TOC1">CONTROLLING THE JIT STACK</a><br>
|
||||
<P>
|
||||
When the compiled JIT code runs, it needs a block of memory to use as a stack.
|
||||
By default, it uses 32KiB on the machine stack. However, some large or
|
||||
|
@ -270,7 +293,7 @@ non-default JIT stacks might operate:
|
|||
</pre>
|
||||
All the functions described in this section do nothing if JIT is not available.
|
||||
<a name="stackfaq"></a></P>
|
||||
<br><a name="SEC7" href="#TOC1">JIT STACK FAQ</a><br>
|
||||
<br><a name="SEC8" href="#TOC1">JIT STACK FAQ</a><br>
|
||||
<P>
|
||||
(1) Why do we need JIT stacks?
|
||||
<br>
|
||||
|
@ -349,7 +372,7 @@ stack handling?
|
|||
No, thanks to Windows. If POSIX threads were used everywhere, we could throw
|
||||
out this complicated API.
|
||||
</P>
|
||||
<br><a name="SEC8" href="#TOC1">FREEING JIT SPECULATIVE MEMORY</a><br>
|
||||
<br><a name="SEC9" href="#TOC1">FREEING JIT SPECULATIVE MEMORY</a><br>
|
||||
<P>
|
||||
<b>void pcre2_jit_free_unused_memory(pcre2_general_context *<i>gcontext</i>);</b>
|
||||
</P>
|
||||
|
@ -361,7 +384,7 @@ all possible memory. You can cause this to happen by calling
|
|||
pcre2_jit_free_unused_memory(). Its argument is a general context, for custom
|
||||
memory management, or NULL for standard memory management.
|
||||
</P>
|
||||
<br><a name="SEC9" href="#TOC1">EXAMPLE CODE</a><br>
|
||||
<br><a name="SEC10" href="#TOC1">EXAMPLE CODE</a><br>
|
||||
<P>
|
||||
This is a single-threaded example that specifies a JIT stack without using a
|
||||
callback. A real program should include error checking after all the function
|
||||
|
@ -390,7 +413,7 @@ calls.
|
|||
|
||||
</PRE>
|
||||
</P>
|
||||
<br><a name="SEC10" href="#TOC1">JIT FAST PATH API</a><br>
|
||||
<br><a name="SEC11" href="#TOC1">JIT FAST PATH API</a><br>
|
||||
<P>
|
||||
Because the API described above falls back to interpreted matching when JIT is
|
||||
not available, it is convenient for programs that are written for general use
|
||||
|
@ -423,11 +446,11 @@ invalid data is passed, the result is undefined.
|
|||
Bypassing the sanity checks and the <b>pcre2_match()</b> wrapping can give
|
||||
speedups of more than 10%.
|
||||
</P>
|
||||
<br><a name="SEC11" href="#TOC1">SEE ALSO</a><br>
|
||||
<br><a name="SEC12" href="#TOC1">SEE ALSO</a><br>
|
||||
<P>
|
||||
<b>pcre2api</b>(3)
|
||||
</P>
|
||||
<br><a name="SEC12" href="#TOC1">AUTHOR</a><br>
|
||||
<br><a name="SEC13" href="#TOC1">AUTHOR</a><br>
|
||||
<P>
|
||||
Philip Hazel (FAQ by Zoltan Herczeg)
|
||||
<br>
|
||||
|
@ -436,11 +459,11 @@ University Computing Service
|
|||
Cambridge, England.
|
||||
<br>
|
||||
</P>
|
||||
<br><a name="SEC13" href="#TOC1">REVISION</a><br>
|
||||
<br><a name="SEC14" href="#TOC1">REVISION</a><br>
|
||||
<P>
|
||||
Last updated: 16 October 2018
|
||||
Last updated: 06 March 2019
|
||||
<br>
|
||||
Copyright © 1997-2018 University of Cambridge.
|
||||
Copyright © 1997-2019 University of Cambridge.
|
||||
<br>
|
||||
<p>
|
||||
Return to the <a href="index.html">PCRE2 index page</a>.
|
||||
|
|
|
@ -247,11 +247,34 @@ VALIDITY OF UTF STRINGS
|
|||
</b><br>
|
||||
<P>
|
||||
When the PCRE2_UTF option is set, the strings passed as patterns and subjects
|
||||
are (by default) checked for validity on entry to the relevant functions.
|
||||
If an invalid UTF string is passed, an negative error code is returned. The
|
||||
code unit offset to the offending character can be extracted from the match
|
||||
data block by calling <b>pcre2_get_startchar()</b>, which is used for this
|
||||
purpose after a UTF error.
|
||||
are (by default) checked for validity on entry to the relevant functions. If an
|
||||
invalid UTF string is passed, an negative error code is returned. The code unit
|
||||
offset to the offending character can be extracted from the match data block by
|
||||
calling <b>pcre2_get_startchar()</b>, which is used for this purpose after a UTF
|
||||
error.
|
||||
</P>
|
||||
<P>
|
||||
In some situations, you may already know that your strings are valid, and
|
||||
therefore want to skip these checks in order to improve performance, for
|
||||
example in the case of a long subject string that is being scanned repeatedly.
|
||||
If you set the PCRE2_NO_UTF_CHECK option at compile time or at match time,
|
||||
PCRE2 assumes that the pattern or subject it is given (respectively) contains
|
||||
only valid UTF code unit sequences.
|
||||
</P>
|
||||
<P>
|
||||
If you pass an invalid UTF string when PCRE2_NO_UTF_CHECK is set, the result
|
||||
is usually undefined and your program may crash or loop indefinitely. There is,
|
||||
however, one mode of matching that can handle invalid UTF subject strings. This
|
||||
is matching via the JIT optimization using the PCRE2_JIT_INVALID_UTF option
|
||||
when calling <b>pcre2_jit_compile()</b>. For details, see the
|
||||
<a href="pcre2jit.html"><b>pcre2jit</b></a>
|
||||
documentation.
|
||||
</P>
|
||||
<P>
|
||||
Passing PCRE2_NO_UTF_CHECK to <b>pcre2_compile()</b> just disables the check for
|
||||
the pattern; it does not also apply to subject strings. If you want to disable
|
||||
the check for a subject string you must pass this same option to
|
||||
<b>pcre2_match()</b> or <b>pcre2_dfa_match()</b>.
|
||||
</P>
|
||||
<P>
|
||||
UTF-16 and UTF-32 strings can indicate their endianness by special code knows
|
||||
|
@ -259,13 +282,14 @@ as a byte-order mark (BOM). The PCRE2 functions do not handle this, expecting
|
|||
strings to be in host byte order.
|
||||
</P>
|
||||
<P>
|
||||
A UTF string is checked before any other processing takes place. In the case of
|
||||
<b>pcre2_match()</b> and <b>pcre2_dfa_match()</b> calls with a non-zero starting
|
||||
offset, the check is applied only to that part of the subject that could be
|
||||
inspected during matching, and there is a check that the starting offset points
|
||||
to the first code unit of a character or to the end of the subject. If there
|
||||
are no lookbehind assertions in the pattern, the check starts at the starting
|
||||
offset. Otherwise, it starts at the length of the longest lookbehind before the
|
||||
Unless PCRE2_NO_UTF_CHECK is set, a UTF string is checked before any other
|
||||
processing takes place. In the case of <b>pcre2_match()</b> and
|
||||
<b>pcre2_dfa_match()</b> calls with a non-zero starting offset, the check is
|
||||
applied only to that part of the subject that could be inspected during
|
||||
matching, and there is a check that the starting offset points to the first
|
||||
code unit of a character or to the end of the subject. If there are no
|
||||
lookbehind assertions in the pattern, the check starts at the starting offset.
|
||||
Otherwise, it starts at the length of the longest lookbehind before the
|
||||
starting offset, or at the start of the subject if there are not that many
|
||||
characters before the starting offset. Note that the sequences \b and \B are
|
||||
one-character lookbehinds.
|
||||
|
@ -285,31 +309,12 @@ surrogate thing is a fudge for UTF-16 which unfortunately messes up UTF-8 and
|
|||
UTF-32.)
|
||||
</P>
|
||||
<P>
|
||||
In some situations, you may already know that your strings are valid, and
|
||||
therefore want to skip these checks in order to improve performance, for
|
||||
example in the case of a long subject string that is being scanned repeatedly.
|
||||
If you set the PCRE2_NO_UTF_CHECK option at compile time or at match time,
|
||||
PCRE2 assumes that the pattern or subject it is given (respectively) contains
|
||||
only valid UTF code unit sequences.
|
||||
</P>
|
||||
<P>
|
||||
Passing PCRE2_NO_UTF_CHECK to <b>pcre2_compile()</b> just disables the check for
|
||||
the pattern; it does not also apply to subject strings. If you want to disable
|
||||
the check for a subject string you must pass this option to <b>pcre2_match()</b>
|
||||
or <b>pcre2_dfa_match()</b>.
|
||||
</P>
|
||||
<P>
|
||||
If you pass an invalid UTF string when PCRE2_NO_UTF_CHECK is set, the result
|
||||
is undefined and your program may crash or loop indefinitely.
|
||||
</P>
|
||||
<P>
|
||||
Note that setting PCRE2_NO_UTF_CHECK at compile time does not disable the error
|
||||
that is given if an escape sequence for an invalid Unicode code point is
|
||||
encountered in the pattern. If you want to allow escape sequences such as
|
||||
\x{d800} (a surrogate code point) you can set the
|
||||
PCRE2_EXTRA_ALLOW_SURROGATE_ESCAPES extra option. However, this is possible
|
||||
only in UTF-8 and UTF-32 modes, because these values are not representable in
|
||||
UTF-16.
|
||||
Setting PCRE2_NO_UTF_CHECK at compile time does not disable the error that is
|
||||
given if an escape sequence for an invalid Unicode code point is encountered in
|
||||
the pattern. If you want to allow escape sequences such as \x{d800} (a
|
||||
surrogate code point) you can set the PCRE2_EXTRA_ALLOW_SURROGATE_ESCAPES extra
|
||||
option. However, this is possible only in UTF-8 and UTF-32 modes, because these
|
||||
values are not representable in UTF-16.
|
||||
<a name="utf8strings"></a></P>
|
||||
<br><b>
|
||||
Errors in UTF-8 strings
|
||||
|
@ -417,7 +422,7 @@ Cambridge, England.
|
|||
REVISION
|
||||
</b><br>
|
||||
<P>
|
||||
Last updated: 03 February 2019
|
||||
Last updated: 06 March 2019
|
||||
<br>
|
||||
Copyright © 1997-2019 University of Cambridge.
|
||||
<br>
|
||||
|
|
179
doc/pcre2.txt
179
doc/pcre2.txt
|
@ -180,8 +180,8 @@ REVISION
|
|||
Last updated: 17 September 2018
|
||||
Copyright (c) 1997-2018 University of Cambridge.
|
||||
------------------------------------------------------------------------------
|
||||
|
||||
|
||||
|
||||
|
||||
PCRE2API(3) Library Functions Manual PCRE2API(3)
|
||||
|
||||
|
||||
|
@ -3681,8 +3681,8 @@ REVISION
|
|||
Last updated: 14 February 2019
|
||||
Copyright (c) 1997-2019 University of Cambridge.
|
||||
------------------------------------------------------------------------------
|
||||
|
||||
|
||||
|
||||
|
||||
PCRE2BUILD(3) Library Functions Manual PCRE2BUILD(3)
|
||||
|
||||
|
||||
|
@ -4254,8 +4254,8 @@ REVISION
|
|||
Last updated: 03 March 2019
|
||||
Copyright (c) 1997-2019 University of Cambridge.
|
||||
------------------------------------------------------------------------------
|
||||
|
||||
|
||||
|
||||
|
||||
PCRE2CALLOUT(3) Library Functions Manual PCRE2CALLOUT(3)
|
||||
|
||||
|
||||
|
@ -4685,8 +4685,8 @@ REVISION
|
|||
Last updated: 03 February 2019
|
||||
Copyright (c) 1997-2019 University of Cambridge.
|
||||
------------------------------------------------------------------------------
|
||||
|
||||
|
||||
|
||||
|
||||
PCRE2COMPAT(3) Library Functions Manual PCRE2COMPAT(3)
|
||||
|
||||
|
||||
|
@ -4890,8 +4890,8 @@ REVISION
|
|||
Last updated: 12 February 2019
|
||||
Copyright (c) 1997-2019 University of Cambridge.
|
||||
------------------------------------------------------------------------------
|
||||
|
||||
|
||||
|
||||
|
||||
PCRE2JIT(3) Library Functions Manual PCRE2JIT(3)
|
||||
|
||||
|
||||
|
@ -5010,6 +5010,29 @@ SIMPLE USE OF JIT
|
|||
to handle the pattern.
|
||||
|
||||
|
||||
MATCHING SUBJECTS CONTAINING INVALID UTF
|
||||
|
||||
When a pattern is compiled with the PCRE2_UTF option, the interpretive
|
||||
matching function expects its subject string to be a valid sequence of
|
||||
UTF code units. If it is not, the result is undefined. This is also
|
||||
true by default of matching via JIT. However, if the option
|
||||
PCRE2_JIT_INVALID_UTF is passed to pcre2_jit_compile(), code that can
|
||||
process a subject containing invalid UTF is compiled.
|
||||
|
||||
In this mode, an invalid code unit sequence never matches any pattern
|
||||
item. It does not match dot, it does not match \p{Any}, it does not
|
||||
even match negative items such as [^X]. A lookbehind assertion fails if
|
||||
it encounters an invalid sequence while moving the current point back-
|
||||
wards. In other words, an invalid UTF code unit sequence acts as a bar-
|
||||
rier which no match can cross. Reaching an invalid sequence causes an
|
||||
immediate backtrack.
|
||||
|
||||
Using this option, an application can run matches in arbitrary data,
|
||||
knowing that any matched strings that are returned will be valid UTF.
|
||||
This can be useful when searching for text in executable or other
|
||||
binary files.
|
||||
|
||||
|
||||
UNSUPPORTED OPTIONS AND PATTERN ITEMS
|
||||
|
||||
The pcre2_match() options that are supported for JIT matching are
|
||||
|
@ -5287,11 +5310,11 @@ AUTHOR
|
|||
|
||||
REVISION
|
||||
|
||||
Last updated: 16 October 2018
|
||||
Copyright (c) 1997-2018 University of Cambridge.
|
||||
Last updated: 06 March 2019
|
||||
Copyright (c) 1997-2019 University of Cambridge.
|
||||
------------------------------------------------------------------------------
|
||||
|
||||
|
||||
|
||||
|
||||
PCRE2LIMITS(3) Library Functions Manual PCRE2LIMITS(3)
|
||||
|
||||
|
||||
|
@ -5360,8 +5383,8 @@ REVISION
|
|||
Last updated: 02 February 2019
|
||||
Copyright (c) 1997-2019 University of Cambridge.
|
||||
------------------------------------------------------------------------------
|
||||
|
||||
|
||||
|
||||
|
||||
PCRE2MATCHING(3) Library Functions Manual PCRE2MATCHING(3)
|
||||
|
||||
|
||||
|
@ -5581,8 +5604,8 @@ REVISION
|
|||
Last updated: 10 October 2018
|
||||
Copyright (c) 1997-2018 University of Cambridge.
|
||||
------------------------------------------------------------------------------
|
||||
|
||||
|
||||
|
||||
|
||||
PCRE2PARTIAL(3) Library Functions Manual PCRE2PARTIAL(3)
|
||||
|
||||
|
||||
|
@ -6021,8 +6044,8 @@ REVISION
|
|||
Last updated: 22 December 2014
|
||||
Copyright (c) 1997-2014 University of Cambridge.
|
||||
------------------------------------------------------------------------------
|
||||
|
||||
|
||||
|
||||
|
||||
PCRE2PATTERN(3) Library Functions Manual PCRE2PATTERN(3)
|
||||
|
||||
|
||||
|
@ -9365,8 +9388,8 @@ REVISION
|
|||
Last updated: 12 February 2019
|
||||
Copyright (c) 1997-2019 University of Cambridge.
|
||||
------------------------------------------------------------------------------
|
||||
|
||||
|
||||
|
||||
|
||||
PCRE2PERFORM(3) Library Functions Manual PCRE2PERFORM(3)
|
||||
|
||||
|
||||
|
@ -9600,8 +9623,8 @@ REVISION
|
|||
Last updated: 03 February 2019
|
||||
Copyright (c) 1997-2019 University of Cambridge.
|
||||
------------------------------------------------------------------------------
|
||||
|
||||
|
||||
|
||||
|
||||
PCRE2POSIX(3) Library Functions Manual PCRE2POSIX(3)
|
||||
|
||||
|
||||
|
@ -9930,8 +9953,8 @@ REVISION
|
|||
Last updated: 30 January 2019
|
||||
Copyright (c) 1997-2019 University of Cambridge.
|
||||
------------------------------------------------------------------------------
|
||||
|
||||
|
||||
|
||||
|
||||
PCRE2SAMPLE(3) Library Functions Manual PCRE2SAMPLE(3)
|
||||
|
||||
|
||||
|
@ -10209,8 +10232,8 @@ REVISION
|
|||
Last updated: 27 June 2018
|
||||
Copyright (c) 1997-2018 University of Cambridge.
|
||||
------------------------------------------------------------------------------
|
||||
|
||||
|
||||
|
||||
|
||||
PCRE2SYNTAX(3) Library Functions Manual PCRE2SYNTAX(3)
|
||||
|
||||
|
||||
|
@ -10710,8 +10733,8 @@ REVISION
|
|||
Last updated: 11 February 2019
|
||||
Copyright (c) 1997-2019 University of Cambridge.
|
||||
------------------------------------------------------------------------------
|
||||
|
||||
|
||||
|
||||
|
||||
PCRE2UNICODE(3) Library Functions Manual PCRE2UNICODE(3)
|
||||
|
||||
|
||||
|
@ -10928,59 +10951,63 @@ VALIDITY OF UTF STRINGS
|
|||
|
||||
When the PCRE2_UTF option is set, the strings passed as patterns and
|
||||
subjects are (by default) checked for validity on entry to the relevant
|
||||
functions. If an invalid UTF string is passed, an negative error code
|
||||
functions. If an invalid UTF string is passed, an negative error code
|
||||
is returned. The code unit offset to the offending character can be
|
||||
extracted from the match data block by calling pcre2_get_startchar(),
|
||||
which is used for this purpose after a UTF error.
|
||||
|
||||
UTF-16 and UTF-32 strings can indicate their endianness by special code
|
||||
knows as a byte-order mark (BOM). The PCRE2 functions do not handle
|
||||
this, expecting strings to be in host byte order.
|
||||
|
||||
A UTF string is checked before any other processing takes place. In the
|
||||
case of pcre2_match() and pcre2_dfa_match() calls with a non-zero
|
||||
starting offset, the check is applied only to that part of the subject
|
||||
that could be inspected during matching, and there is a check that the
|
||||
starting offset points to the first code unit of a character or to the
|
||||
end of the subject. If there are no lookbehind assertions in the pat-
|
||||
tern, the check starts at the starting offset. Otherwise, it starts at
|
||||
the length of the longest lookbehind before the starting offset, or at
|
||||
the start of the subject if there are not that many characters before
|
||||
the starting offset. Note that the sequences \b and \B are one-charac-
|
||||
ter lookbehinds.
|
||||
|
||||
In addition to checking the format of the string, there is a check to
|
||||
ensure that all code points lie in the range U+0 to U+10FFFF, excluding
|
||||
the surrogate area. The so-called "non-character" code points are not
|
||||
excluded because Unicode corrigendum #9 makes it clear that they should
|
||||
not be.
|
||||
|
||||
Characters in the "Surrogate Area" of Unicode are reserved for use by
|
||||
UTF-16, where they are used in pairs to encode code points with values
|
||||
greater than 0xFFFF. The code points that are encoded by UTF-16 pairs
|
||||
are available independently in the UTF-8 and UTF-32 encodings. (In
|
||||
other words, the whole surrogate thing is a fudge for UTF-16 which
|
||||
unfortunately messes up UTF-8 and UTF-32.)
|
||||
|
||||
In some situations, you may already know that your strings are valid,
|
||||
and therefore want to skip these checks in order to improve perfor-
|
||||
mance, for example in the case of a long subject string that is being
|
||||
scanned repeatedly. If you set the PCRE2_NO_UTF_CHECK option at com-
|
||||
pile time or at match time, PCRE2 assumes that the pattern or subject
|
||||
In some situations, you may already know that your strings are valid,
|
||||
and therefore want to skip these checks in order to improve perfor-
|
||||
mance, for example in the case of a long subject string that is being
|
||||
scanned repeatedly. If you set the PCRE2_NO_UTF_CHECK option at com-
|
||||
pile time or at match time, PCRE2 assumes that the pattern or subject
|
||||
it is given (respectively) contains only valid UTF code unit sequences.
|
||||
|
||||
If you pass an invalid UTF string when PCRE2_NO_UTF_CHECK is set, the
|
||||
result is usually undefined and your program may crash or loop indefi-
|
||||
nitely. There is, however, one mode of matching that can handle invalid
|
||||
UTF subject strings. This is matching via the JIT optimization using
|
||||
the PCRE2_JIT_INVALID_UTF option when calling pcre2_jit_compile(). For
|
||||
details, see the pcre2jit documentation.
|
||||
|
||||
Passing PCRE2_NO_UTF_CHECK to pcre2_compile() just disables the check
|
||||
for the pattern; it does not also apply to subject strings. If you want
|
||||
to disable the check for a subject string you must pass this option to
|
||||
pcre2_match() or pcre2_dfa_match().
|
||||
to disable the check for a subject string you must pass this same
|
||||
option to pcre2_match() or pcre2_dfa_match().
|
||||
|
||||
If you pass an invalid UTF string when PCRE2_NO_UTF_CHECK is set, the
|
||||
result is undefined and your program may crash or loop indefinitely.
|
||||
UTF-16 and UTF-32 strings can indicate their endianness by special code
|
||||
knows as a byte-order mark (BOM). The PCRE2 functions do not handle
|
||||
this, expecting strings to be in host byte order.
|
||||
|
||||
Note that setting PCRE2_NO_UTF_CHECK at compile time does not disable
|
||||
the error that is given if an escape sequence for an invalid Unicode
|
||||
code point is encountered in the pattern. If you want to allow escape
|
||||
sequences such as \x{d800} (a surrogate code point) you can set the
|
||||
Unless PCRE2_NO_UTF_CHECK is set, a UTF string is checked before any
|
||||
other processing takes place. In the case of pcre2_match() and
|
||||
pcre2_dfa_match() calls with a non-zero starting offset, the check is
|
||||
applied only to that part of the subject that could be inspected during
|
||||
matching, and there is a check that the starting offset points to the
|
||||
first code unit of a character or to the end of the subject. If there
|
||||
are no lookbehind assertions in the pattern, the check starts at the
|
||||
starting offset. Otherwise, it starts at the length of the longest
|
||||
lookbehind before the starting offset, or at the start of the subject
|
||||
if there are not that many characters before the starting offset. Note
|
||||
that the sequences \b and \B are one-character lookbehinds.
|
||||
|
||||
In addition to checking the format of the string, there is a check to
|
||||
ensure that all code points lie in the range U+0 to U+10FFFF, excluding
|
||||
the surrogate area. The so-called "non-character" code points are not
|
||||
excluded because Unicode corrigendum #9 makes it clear that they should
|
||||
not be.
|
||||
|
||||
Characters in the "Surrogate Area" of Unicode are reserved for use by
|
||||
UTF-16, where they are used in pairs to encode code points with values
|
||||
greater than 0xFFFF. The code points that are encoded by UTF-16 pairs
|
||||
are available independently in the UTF-8 and UTF-32 encodings. (In
|
||||
other words, the whole surrogate thing is a fudge for UTF-16 which
|
||||
unfortunately messes up UTF-8 and UTF-32.)
|
||||
|
||||
Setting PCRE2_NO_UTF_CHECK at compile time does not disable the error
|
||||
that is given if an escape sequence for an invalid Unicode code point
|
||||
is encountered in the pattern. If you want to allow escape sequences
|
||||
such as \x{d800} (a surrogate code point) you can set the
|
||||
PCRE2_EXTRA_ALLOW_SURROGATE_ESCAPES extra option. However, this is pos-
|
||||
sible only in UTF-8 and UTF-32 modes, because these values are not rep-
|
||||
resentable in UTF-16.
|
||||
|
@ -11079,8 +11106,8 @@ AUTHOR
|
|||
|
||||
REVISION
|
||||
|
||||
Last updated: 03 February 2019
|
||||
Last updated: 06 March 2019
|
||||
Copyright (c) 1997-2019 University of Cambridge.
|
||||
------------------------------------------------------------------------------
|
||||
|
||||
|
||||
|
||||
|
||||
|
|
|
@ -1,4 +1,4 @@
|
|||
.TH PCRE2_JIT_COMPILE 3 "21 October 2014" "PCRE2 10.00"
|
||||
.TH PCRE2_JIT_COMPILE 3 "06 March 2019" "PCRE2 10.33"
|
||||
.SH NAME
|
||||
PCRE2 - Perl-compatible regular expressions (revised API)
|
||||
.SH SYNOPSIS
|
||||
|
@ -29,6 +29,7 @@ bits:
|
|||
PCRE2_JIT_COMPLETE compile code for full matching
|
||||
PCRE2_JIT_PARTIAL_SOFT compile code for soft partial matching
|
||||
PCRE2_JIT_PARTIAL_HARD compile code for hard partial matching
|
||||
PCRE2_JIT_INVALID_UTF compile code to handle invalid UTF
|
||||
.sp
|
||||
The yield of the function is 0 for success, or a negative error code otherwise.
|
||||
In particular, PCRE2_ERROR_JIT_BADOPTION is returned if JIT is not supported or
|
||||
|
|
|
@ -1,4 +1,4 @@
|
|||
.TH PCRE2JIT 3 "16 October 2018" "PCRE2 10.33"
|
||||
.TH PCRE2JIT 3 "06 March 2019" "PCRE2 10.33"
|
||||
.SH NAME
|
||||
PCRE2 - Perl-compatible regular expressions (revised API)
|
||||
.SH "PCRE2 JUST-IN-TIME COMPILER SUPPORT"
|
||||
|
@ -120,6 +120,28 @@ support is not available, or the pattern was not processed by
|
|||
pattern.
|
||||
.
|
||||
.
|
||||
.SH "MATCHING SUBJECTS CONTAINING INVALID UTF"
|
||||
.rs
|
||||
.sp
|
||||
When a pattern is compiled with the PCRE2_UTF option, the interpretive matching
|
||||
function expects its subject string to be a valid sequence of UTF code units.
|
||||
If it is not, the result is undefined. This is also true by default of matching
|
||||
via JIT. However, if the option PCRE2_JIT_INVALID_UTF is passed to
|
||||
\fBpcre2_jit_compile()\fP, code that can process a subject containing invalid
|
||||
UTF is compiled.
|
||||
.P
|
||||
In this mode, an invalid code unit sequence never matches any pattern item. It
|
||||
does not match dot, it does not match \ep{Any}, it does not even match negative
|
||||
items such as [^X]. A lookbehind assertion fails if it encounters an invalid
|
||||
sequence while moving the current point backwards. In other words, an invalid
|
||||
UTF code unit sequence acts as a barrier which no match can cross. Reaching an
|
||||
invalid sequence causes an immediate backtrack.
|
||||
.P
|
||||
Using this option, an application can run matches in arbitrary data, knowing
|
||||
that any matched strings that are returned will be valid UTF. This can be
|
||||
useful when searching for text in executable or other binary files.
|
||||
.
|
||||
.
|
||||
.SH "UNSUPPORTED OPTIONS AND PATTERN ITEMS"
|
||||
.rs
|
||||
.sp
|
||||
|
@ -416,6 +438,6 @@ Cambridge, England.
|
|||
.rs
|
||||
.sp
|
||||
.nf
|
||||
Last updated: 16 October 2018
|
||||
Copyright (c) 1997-2018 University of Cambridge.
|
||||
Last updated: 06 March 2019
|
||||
Copyright (c) 1997-2019 University of Cambridge.
|
||||
.fi
|
||||
|
|
|
@ -1,4 +1,4 @@
|
|||
.TH PCRE2UNICODE 3 "03 February 2019" "PCRE2 10.33"
|
||||
.TH PCRE2UNICODE 3 "06 March 2019" "PCRE2 10.33"
|
||||
.SH NAME
|
||||
PCRE - Perl-compatible regular expressions (revised API)
|
||||
.SH "UNICODE AND UTF SUPPORT"
|
||||
|
@ -230,23 +230,46 @@ adjacent characters.
|
|||
.rs
|
||||
.sp
|
||||
When the PCRE2_UTF option is set, the strings passed as patterns and subjects
|
||||
are (by default) checked for validity on entry to the relevant functions.
|
||||
If an invalid UTF string is passed, an negative error code is returned. The
|
||||
code unit offset to the offending character can be extracted from the match
|
||||
data block by calling \fBpcre2_get_startchar()\fP, which is used for this
|
||||
purpose after a UTF error.
|
||||
are (by default) checked for validity on entry to the relevant functions. If an
|
||||
invalid UTF string is passed, an negative error code is returned. The code unit
|
||||
offset to the offending character can be extracted from the match data block by
|
||||
calling \fBpcre2_get_startchar()\fP, which is used for this purpose after a UTF
|
||||
error.
|
||||
.P
|
||||
In some situations, you may already know that your strings are valid, and
|
||||
therefore want to skip these checks in order to improve performance, for
|
||||
example in the case of a long subject string that is being scanned repeatedly.
|
||||
If you set the PCRE2_NO_UTF_CHECK option at compile time or at match time,
|
||||
PCRE2 assumes that the pattern or subject it is given (respectively) contains
|
||||
only valid UTF code unit sequences.
|
||||
.P
|
||||
If you pass an invalid UTF string when PCRE2_NO_UTF_CHECK is set, the result
|
||||
is usually undefined and your program may crash or loop indefinitely. There is,
|
||||
however, one mode of matching that can handle invalid UTF subject strings. This
|
||||
is matching via the JIT optimization using the PCRE2_JIT_INVALID_UTF option
|
||||
when calling \fBpcre2_jit_compile()\fP. For details, see the
|
||||
.\" HREF
|
||||
\fBpcre2jit\fP
|
||||
.\"
|
||||
documentation.
|
||||
.P
|
||||
Passing PCRE2_NO_UTF_CHECK to \fBpcre2_compile()\fP just disables the check for
|
||||
the pattern; it does not also apply to subject strings. If you want to disable
|
||||
the check for a subject string you must pass this same option to
|
||||
\fBpcre2_match()\fP or \fBpcre2_dfa_match()\fP.
|
||||
.P
|
||||
UTF-16 and UTF-32 strings can indicate their endianness by special code knows
|
||||
as a byte-order mark (BOM). The PCRE2 functions do not handle this, expecting
|
||||
strings to be in host byte order.
|
||||
.P
|
||||
A UTF string is checked before any other processing takes place. In the case of
|
||||
\fBpcre2_match()\fP and \fBpcre2_dfa_match()\fP calls with a non-zero starting
|
||||
offset, the check is applied only to that part of the subject that could be
|
||||
inspected during matching, and there is a check that the starting offset points
|
||||
to the first code unit of a character or to the end of the subject. If there
|
||||
are no lookbehind assertions in the pattern, the check starts at the starting
|
||||
offset. Otherwise, it starts at the length of the longest lookbehind before the
|
||||
Unless PCRE2_NO_UTF_CHECK is set, a UTF string is checked before any other
|
||||
processing takes place. In the case of \fBpcre2_match()\fP and
|
||||
\fBpcre2_dfa_match()\fP calls with a non-zero starting offset, the check is
|
||||
applied only to that part of the subject that could be inspected during
|
||||
matching, and there is a check that the starting offset points to the first
|
||||
code unit of a character or to the end of the subject. If there are no
|
||||
lookbehind assertions in the pattern, the check starts at the starting offset.
|
||||
Otherwise, it starts at the length of the longest lookbehind before the
|
||||
starting offset, or at the start of the subject if there are not that many
|
||||
characters before the starting offset. Note that the sequences \eb and \eB are
|
||||
one-character lookbehinds.
|
||||
|
@ -263,28 +286,12 @@ independently in the UTF-8 and UTF-32 encodings. (In other words, the whole
|
|||
surrogate thing is a fudge for UTF-16 which unfortunately messes up UTF-8 and
|
||||
UTF-32.)
|
||||
.P
|
||||
In some situations, you may already know that your strings are valid, and
|
||||
therefore want to skip these checks in order to improve performance, for
|
||||
example in the case of a long subject string that is being scanned repeatedly.
|
||||
If you set the PCRE2_NO_UTF_CHECK option at compile time or at match time,
|
||||
PCRE2 assumes that the pattern or subject it is given (respectively) contains
|
||||
only valid UTF code unit sequences.
|
||||
.P
|
||||
Passing PCRE2_NO_UTF_CHECK to \fBpcre2_compile()\fP just disables the check for
|
||||
the pattern; it does not also apply to subject strings. If you want to disable
|
||||
the check for a subject string you must pass this option to \fBpcre2_match()\fP
|
||||
or \fBpcre2_dfa_match()\fP.
|
||||
.P
|
||||
If you pass an invalid UTF string when PCRE2_NO_UTF_CHECK is set, the result
|
||||
is undefined and your program may crash or loop indefinitely.
|
||||
.P
|
||||
Note that setting PCRE2_NO_UTF_CHECK at compile time does not disable the error
|
||||
that is given if an escape sequence for an invalid Unicode code point is
|
||||
encountered in the pattern. If you want to allow escape sequences such as
|
||||
\ex{d800} (a surrogate code point) you can set the
|
||||
PCRE2_EXTRA_ALLOW_SURROGATE_ESCAPES extra option. However, this is possible
|
||||
only in UTF-8 and UTF-32 modes, because these values are not representable in
|
||||
UTF-16.
|
||||
Setting PCRE2_NO_UTF_CHECK at compile time does not disable the error that is
|
||||
given if an escape sequence for an invalid Unicode code point is encountered in
|
||||
the pattern. If you want to allow escape sequences such as \ex{d800} (a
|
||||
surrogate code point) you can set the PCRE2_EXTRA_ALLOW_SURROGATE_ESCAPES extra
|
||||
option. However, this is possible only in UTF-8 and UTF-32 modes, because these
|
||||
values are not representable in UTF-16.
|
||||
.
|
||||
.
|
||||
.\" HTML <a name="utf8strings"></a>
|
||||
|
@ -393,6 +400,6 @@ Cambridge, England.
|
|||
.rs
|
||||
.sp
|
||||
.nf
|
||||
Last updated: 03 February 2019
|
||||
Last updated: 06 March 2019
|
||||
Copyright (c) 1997-2019 University of Cambridge.
|
||||
.fi
|
||||
|
|
Loading…
Reference in New Issue