Documentation update re PCRE2_JIT_INVALID_UTF

This commit is contained in:
Philip.Hazel 2019-03-06 17:38:20 +00:00
parent 7375089fa5
commit 590f65f061
8 changed files with 263 additions and 177 deletions

View File

@ -40,6 +40,7 @@ bits:
PCRE2_JIT_COMPLETE compile code for full matching
PCRE2_JIT_PARTIAL_SOFT compile code for soft partial matching
PCRE2_JIT_PARTIAL_HARD compile code for hard partial matching
PCRE2_JIT_INVALID_UTF compile code to handle invalid UTF
</pre>
The yield of the function is 0 for success, or a negative error code otherwise.
In particular, PCRE2_ERROR_JIT_BADOPTION is returned if JIT is not supported or

View File

@ -312,7 +312,7 @@ document for an overview of all the PCRE2 documentation.
<b>const unsigned char *pcre2_maketables(pcre2_general_context *<i>gcontext</i>);</b>
<br>
<br>
<b>int pcre2_pattern_info(const pcre2_code *<i>code</i>, uint32_t <i>what</i>, </b>
<b>int pcre2_pattern_info(const pcre2_code *<i>code</i>, uint32_t <i>what</i>,</b>
<b> void *<i>where</i>);</b>
<br>
<br>

View File

@ -16,16 +16,17 @@ please consult the man page, in case the conversion went wrong.
<li><a name="TOC1" href="#SEC1">PCRE2 JUST-IN-TIME COMPILER SUPPORT</a>
<li><a name="TOC2" href="#SEC2">AVAILABILITY OF JIT SUPPORT</a>
<li><a name="TOC3" href="#SEC3">SIMPLE USE OF JIT</a>
<li><a name="TOC4" href="#SEC4">UNSUPPORTED OPTIONS AND PATTERN ITEMS</a>
<li><a name="TOC5" href="#SEC5">RETURN VALUES FROM JIT MATCHING</a>
<li><a name="TOC6" href="#SEC6">CONTROLLING THE JIT STACK</a>
<li><a name="TOC7" href="#SEC7">JIT STACK FAQ</a>
<li><a name="TOC8" href="#SEC8">FREEING JIT SPECULATIVE MEMORY</a>
<li><a name="TOC9" href="#SEC9">EXAMPLE CODE</a>
<li><a name="TOC10" href="#SEC10">JIT FAST PATH API</a>
<li><a name="TOC11" href="#SEC11">SEE ALSO</a>
<li><a name="TOC12" href="#SEC12">AUTHOR</a>
<li><a name="TOC13" href="#SEC13">REVISION</a>
<li><a name="TOC4" href="#SEC4">MATCHING SUBJECTS CONTAINING INVALID UTF</a>
<li><a name="TOC5" href="#SEC5">UNSUPPORTED OPTIONS AND PATTERN ITEMS</a>
<li><a name="TOC6" href="#SEC6">RETURN VALUES FROM JIT MATCHING</a>
<li><a name="TOC7" href="#SEC7">CONTROLLING THE JIT STACK</a>
<li><a name="TOC8" href="#SEC8">JIT STACK FAQ</a>
<li><a name="TOC9" href="#SEC9">FREEING JIT SPECULATIVE MEMORY</a>
<li><a name="TOC10" href="#SEC10">EXAMPLE CODE</a>
<li><a name="TOC11" href="#SEC11">JIT FAST PATH API</a>
<li><a name="TOC12" href="#SEC12">SEE ALSO</a>
<li><a name="TOC13" href="#SEC13">AUTHOR</a>
<li><a name="TOC14" href="#SEC14">REVISION</a>
</ul>
<br><a name="SEC1" href="#TOC1">PCRE2 JUST-IN-TIME COMPILER SUPPORT</a><br>
<P>
@ -144,7 +145,29 @@ support is not available, or the pattern was not processed by
<b>pcre2_jit_compile()</b>, or the JIT compiler was not able to handle the
pattern.
</P>
<br><a name="SEC4" href="#TOC1">UNSUPPORTED OPTIONS AND PATTERN ITEMS</a><br>
<br><a name="SEC4" href="#TOC1">MATCHING SUBJECTS CONTAINING INVALID UTF</a><br>
<P>
When a pattern is compiled with the PCRE2_UTF option, the interpretive matching
function expects its subject string to be a valid sequence of UTF code units.
If it is not, the result is undefined. This is also true by default of matching
via JIT. However, if the option PCRE2_JIT_INVALID_UTF is passed to
<b>pcre2_jit_compile()</b>, code that can process a subject containing invalid
UTF is compiled.
</P>
<P>
In this mode, an invalid code unit sequence never matches any pattern item. It
does not match dot, it does not match \p{Any}, it does not even match negative
items such as [^X]. A lookbehind assertion fails if it encounters an invalid
sequence while moving the current point backwards. In other words, an invalid
UTF code unit sequence acts as a barrier which no match can cross. Reaching an
invalid sequence causes an immediate backtrack.
</P>
<P>
Using this option, an application can run matches in arbitrary data, knowing
that any matched strings that are returned will be valid UTF. This can be
useful when searching for text in executable or other binary files.
</P>
<br><a name="SEC5" href="#TOC1">UNSUPPORTED OPTIONS AND PATTERN ITEMS</a><br>
<P>
The <b>pcre2_match()</b> options that are supported for JIT matching are
PCRE2_COPY_MATCHED_SUBJECT, PCRE2_NOTBOL, PCRE2_NOTEOL, PCRE2_NOTEMPTY,
@ -161,7 +184,7 @@ The only unsupported pattern items are \C (match a single data unit) when
running in a UTF mode, and a callout immediately before an assertion condition
in a conditional group.
</P>
<br><a name="SEC5" href="#TOC1">RETURN VALUES FROM JIT MATCHING</a><br>
<br><a name="SEC6" href="#TOC1">RETURN VALUES FROM JIT MATCHING</a><br>
<P>
When a pattern is matched using JIT matching, the return values are the same
as those given by the interpretive <b>pcre2_match()</b> code, with the addition
@ -177,7 +200,7 @@ circumstance when JIT is not used, but the details of exactly what is counted
are not the same. The PCRE2_ERROR_DEPTHLIMIT error code is never returned
when JIT matching is used.
<a name="stackcontrol"></a></P>
<br><a name="SEC6" href="#TOC1">CONTROLLING THE JIT STACK</a><br>
<br><a name="SEC7" href="#TOC1">CONTROLLING THE JIT STACK</a><br>
<P>
When the compiled JIT code runs, it needs a block of memory to use as a stack.
By default, it uses 32KiB on the machine stack. However, some large or
@ -270,7 +293,7 @@ non-default JIT stacks might operate:
</pre>
All the functions described in this section do nothing if JIT is not available.
<a name="stackfaq"></a></P>
<br><a name="SEC7" href="#TOC1">JIT STACK FAQ</a><br>
<br><a name="SEC8" href="#TOC1">JIT STACK FAQ</a><br>
<P>
(1) Why do we need JIT stacks?
<br>
@ -349,7 +372,7 @@ stack handling?
No, thanks to Windows. If POSIX threads were used everywhere, we could throw
out this complicated API.
</P>
<br><a name="SEC8" href="#TOC1">FREEING JIT SPECULATIVE MEMORY</a><br>
<br><a name="SEC9" href="#TOC1">FREEING JIT SPECULATIVE MEMORY</a><br>
<P>
<b>void pcre2_jit_free_unused_memory(pcre2_general_context *<i>gcontext</i>);</b>
</P>
@ -361,7 +384,7 @@ all possible memory. You can cause this to happen by calling
pcre2_jit_free_unused_memory(). Its argument is a general context, for custom
memory management, or NULL for standard memory management.
</P>
<br><a name="SEC9" href="#TOC1">EXAMPLE CODE</a><br>
<br><a name="SEC10" href="#TOC1">EXAMPLE CODE</a><br>
<P>
This is a single-threaded example that specifies a JIT stack without using a
callback. A real program should include error checking after all the function
@ -390,7 +413,7 @@ calls.
</PRE>
</P>
<br><a name="SEC10" href="#TOC1">JIT FAST PATH API</a><br>
<br><a name="SEC11" href="#TOC1">JIT FAST PATH API</a><br>
<P>
Because the API described above falls back to interpreted matching when JIT is
not available, it is convenient for programs that are written for general use
@ -423,11 +446,11 @@ invalid data is passed, the result is undefined.
Bypassing the sanity checks and the <b>pcre2_match()</b> wrapping can give
speedups of more than 10%.
</P>
<br><a name="SEC11" href="#TOC1">SEE ALSO</a><br>
<br><a name="SEC12" href="#TOC1">SEE ALSO</a><br>
<P>
<b>pcre2api</b>(3)
</P>
<br><a name="SEC12" href="#TOC1">AUTHOR</a><br>
<br><a name="SEC13" href="#TOC1">AUTHOR</a><br>
<P>
Philip Hazel (FAQ by Zoltan Herczeg)
<br>
@ -436,11 +459,11 @@ University Computing Service
Cambridge, England.
<br>
</P>
<br><a name="SEC13" href="#TOC1">REVISION</a><br>
<br><a name="SEC14" href="#TOC1">REVISION</a><br>
<P>
Last updated: 16 October 2018
Last updated: 06 March 2019
<br>
Copyright &copy; 1997-2018 University of Cambridge.
Copyright &copy; 1997-2019 University of Cambridge.
<br>
<p>
Return to the <a href="index.html">PCRE2 index page</a>.

View File

@ -247,11 +247,34 @@ VALIDITY OF UTF STRINGS
</b><br>
<P>
When the PCRE2_UTF option is set, the strings passed as patterns and subjects
are (by default) checked for validity on entry to the relevant functions.
If an invalid UTF string is passed, an negative error code is returned. The
code unit offset to the offending character can be extracted from the match
data block by calling <b>pcre2_get_startchar()</b>, which is used for this
purpose after a UTF error.
are (by default) checked for validity on entry to the relevant functions. If an
invalid UTF string is passed, an negative error code is returned. The code unit
offset to the offending character can be extracted from the match data block by
calling <b>pcre2_get_startchar()</b>, which is used for this purpose after a UTF
error.
</P>
<P>
In some situations, you may already know that your strings are valid, and
therefore want to skip these checks in order to improve performance, for
example in the case of a long subject string that is being scanned repeatedly.
If you set the PCRE2_NO_UTF_CHECK option at compile time or at match time,
PCRE2 assumes that the pattern or subject it is given (respectively) contains
only valid UTF code unit sequences.
</P>
<P>
If you pass an invalid UTF string when PCRE2_NO_UTF_CHECK is set, the result
is usually undefined and your program may crash or loop indefinitely. There is,
however, one mode of matching that can handle invalid UTF subject strings. This
is matching via the JIT optimization using the PCRE2_JIT_INVALID_UTF option
when calling <b>pcre2_jit_compile()</b>. For details, see the
<a href="pcre2jit.html"><b>pcre2jit</b></a>
documentation.
</P>
<P>
Passing PCRE2_NO_UTF_CHECK to <b>pcre2_compile()</b> just disables the check for
the pattern; it does not also apply to subject strings. If you want to disable
the check for a subject string you must pass this same option to
<b>pcre2_match()</b> or <b>pcre2_dfa_match()</b>.
</P>
<P>
UTF-16 and UTF-32 strings can indicate their endianness by special code knows
@ -259,13 +282,14 @@ as a byte-order mark (BOM). The PCRE2 functions do not handle this, expecting
strings to be in host byte order.
</P>
<P>
A UTF string is checked before any other processing takes place. In the case of
<b>pcre2_match()</b> and <b>pcre2_dfa_match()</b> calls with a non-zero starting
offset, the check is applied only to that part of the subject that could be
inspected during matching, and there is a check that the starting offset points
to the first code unit of a character or to the end of the subject. If there
are no lookbehind assertions in the pattern, the check starts at the starting
offset. Otherwise, it starts at the length of the longest lookbehind before the
Unless PCRE2_NO_UTF_CHECK is set, a UTF string is checked before any other
processing takes place. In the case of <b>pcre2_match()</b> and
<b>pcre2_dfa_match()</b> calls with a non-zero starting offset, the check is
applied only to that part of the subject that could be inspected during
matching, and there is a check that the starting offset points to the first
code unit of a character or to the end of the subject. If there are no
lookbehind assertions in the pattern, the check starts at the starting offset.
Otherwise, it starts at the length of the longest lookbehind before the
starting offset, or at the start of the subject if there are not that many
characters before the starting offset. Note that the sequences \b and \B are
one-character lookbehinds.
@ -285,31 +309,12 @@ surrogate thing is a fudge for UTF-16 which unfortunately messes up UTF-8 and
UTF-32.)
</P>
<P>
In some situations, you may already know that your strings are valid, and
therefore want to skip these checks in order to improve performance, for
example in the case of a long subject string that is being scanned repeatedly.
If you set the PCRE2_NO_UTF_CHECK option at compile time or at match time,
PCRE2 assumes that the pattern or subject it is given (respectively) contains
only valid UTF code unit sequences.
</P>
<P>
Passing PCRE2_NO_UTF_CHECK to <b>pcre2_compile()</b> just disables the check for
the pattern; it does not also apply to subject strings. If you want to disable
the check for a subject string you must pass this option to <b>pcre2_match()</b>
or <b>pcre2_dfa_match()</b>.
</P>
<P>
If you pass an invalid UTF string when PCRE2_NO_UTF_CHECK is set, the result
is undefined and your program may crash or loop indefinitely.
</P>
<P>
Note that setting PCRE2_NO_UTF_CHECK at compile time does not disable the error
that is given if an escape sequence for an invalid Unicode code point is
encountered in the pattern. If you want to allow escape sequences such as
\x{d800} (a surrogate code point) you can set the
PCRE2_EXTRA_ALLOW_SURROGATE_ESCAPES extra option. However, this is possible
only in UTF-8 and UTF-32 modes, because these values are not representable in
UTF-16.
Setting PCRE2_NO_UTF_CHECK at compile time does not disable the error that is
given if an escape sequence for an invalid Unicode code point is encountered in
the pattern. If you want to allow escape sequences such as \x{d800} (a
surrogate code point) you can set the PCRE2_EXTRA_ALLOW_SURROGATE_ESCAPES extra
option. However, this is possible only in UTF-8 and UTF-32 modes, because these
values are not representable in UTF-16.
<a name="utf8strings"></a></P>
<br><b>
Errors in UTF-8 strings
@ -417,7 +422,7 @@ Cambridge, England.
REVISION
</b><br>
<P>
Last updated: 03 February 2019
Last updated: 06 March 2019
<br>
Copyright &copy; 1997-2019 University of Cambridge.
<br>

View File

@ -180,8 +180,8 @@ REVISION
Last updated: 17 September 2018
Copyright (c) 1997-2018 University of Cambridge.
------------------------------------------------------------------------------
PCRE2API(3) Library Functions Manual PCRE2API(3)
@ -3681,8 +3681,8 @@ REVISION
Last updated: 14 February 2019
Copyright (c) 1997-2019 University of Cambridge.
------------------------------------------------------------------------------
PCRE2BUILD(3) Library Functions Manual PCRE2BUILD(3)
@ -4254,8 +4254,8 @@ REVISION
Last updated: 03 March 2019
Copyright (c) 1997-2019 University of Cambridge.
------------------------------------------------------------------------------
PCRE2CALLOUT(3) Library Functions Manual PCRE2CALLOUT(3)
@ -4685,8 +4685,8 @@ REVISION
Last updated: 03 February 2019
Copyright (c) 1997-2019 University of Cambridge.
------------------------------------------------------------------------------
PCRE2COMPAT(3) Library Functions Manual PCRE2COMPAT(3)
@ -4890,8 +4890,8 @@ REVISION
Last updated: 12 February 2019
Copyright (c) 1997-2019 University of Cambridge.
------------------------------------------------------------------------------
PCRE2JIT(3) Library Functions Manual PCRE2JIT(3)
@ -5010,6 +5010,29 @@ SIMPLE USE OF JIT
to handle the pattern.
MATCHING SUBJECTS CONTAINING INVALID UTF
When a pattern is compiled with the PCRE2_UTF option, the interpretive
matching function expects its subject string to be a valid sequence of
UTF code units. If it is not, the result is undefined. This is also
true by default of matching via JIT. However, if the option
PCRE2_JIT_INVALID_UTF is passed to pcre2_jit_compile(), code that can
process a subject containing invalid UTF is compiled.
In this mode, an invalid code unit sequence never matches any pattern
item. It does not match dot, it does not match \p{Any}, it does not
even match negative items such as [^X]. A lookbehind assertion fails if
it encounters an invalid sequence while moving the current point back-
wards. In other words, an invalid UTF code unit sequence acts as a bar-
rier which no match can cross. Reaching an invalid sequence causes an
immediate backtrack.
Using this option, an application can run matches in arbitrary data,
knowing that any matched strings that are returned will be valid UTF.
This can be useful when searching for text in executable or other
binary files.
UNSUPPORTED OPTIONS AND PATTERN ITEMS
The pcre2_match() options that are supported for JIT matching are
@ -5287,11 +5310,11 @@ AUTHOR
REVISION
Last updated: 16 October 2018
Copyright (c) 1997-2018 University of Cambridge.
Last updated: 06 March 2019
Copyright (c) 1997-2019 University of Cambridge.
------------------------------------------------------------------------------
PCRE2LIMITS(3) Library Functions Manual PCRE2LIMITS(3)
@ -5360,8 +5383,8 @@ REVISION
Last updated: 02 February 2019
Copyright (c) 1997-2019 University of Cambridge.
------------------------------------------------------------------------------
PCRE2MATCHING(3) Library Functions Manual PCRE2MATCHING(3)
@ -5581,8 +5604,8 @@ REVISION
Last updated: 10 October 2018
Copyright (c) 1997-2018 University of Cambridge.
------------------------------------------------------------------------------
PCRE2PARTIAL(3) Library Functions Manual PCRE2PARTIAL(3)
@ -6021,8 +6044,8 @@ REVISION
Last updated: 22 December 2014
Copyright (c) 1997-2014 University of Cambridge.
------------------------------------------------------------------------------
PCRE2PATTERN(3) Library Functions Manual PCRE2PATTERN(3)
@ -9365,8 +9388,8 @@ REVISION
Last updated: 12 February 2019
Copyright (c) 1997-2019 University of Cambridge.
------------------------------------------------------------------------------
PCRE2PERFORM(3) Library Functions Manual PCRE2PERFORM(3)
@ -9600,8 +9623,8 @@ REVISION
Last updated: 03 February 2019
Copyright (c) 1997-2019 University of Cambridge.
------------------------------------------------------------------------------
PCRE2POSIX(3) Library Functions Manual PCRE2POSIX(3)
@ -9930,8 +9953,8 @@ REVISION
Last updated: 30 January 2019
Copyright (c) 1997-2019 University of Cambridge.
------------------------------------------------------------------------------
PCRE2SAMPLE(3) Library Functions Manual PCRE2SAMPLE(3)
@ -10209,8 +10232,8 @@ REVISION
Last updated: 27 June 2018
Copyright (c) 1997-2018 University of Cambridge.
------------------------------------------------------------------------------
PCRE2SYNTAX(3) Library Functions Manual PCRE2SYNTAX(3)
@ -10710,8 +10733,8 @@ REVISION
Last updated: 11 February 2019
Copyright (c) 1997-2019 University of Cambridge.
------------------------------------------------------------------------------
PCRE2UNICODE(3) Library Functions Manual PCRE2UNICODE(3)
@ -10928,59 +10951,63 @@ VALIDITY OF UTF STRINGS
When the PCRE2_UTF option is set, the strings passed as patterns and
subjects are (by default) checked for validity on entry to the relevant
functions. If an invalid UTF string is passed, an negative error code
functions. If an invalid UTF string is passed, an negative error code
is returned. The code unit offset to the offending character can be
extracted from the match data block by calling pcre2_get_startchar(),
which is used for this purpose after a UTF error.
UTF-16 and UTF-32 strings can indicate their endianness by special code
knows as a byte-order mark (BOM). The PCRE2 functions do not handle
this, expecting strings to be in host byte order.
A UTF string is checked before any other processing takes place. In the
case of pcre2_match() and pcre2_dfa_match() calls with a non-zero
starting offset, the check is applied only to that part of the subject
that could be inspected during matching, and there is a check that the
starting offset points to the first code unit of a character or to the
end of the subject. If there are no lookbehind assertions in the pat-
tern, the check starts at the starting offset. Otherwise, it starts at
the length of the longest lookbehind before the starting offset, or at
the start of the subject if there are not that many characters before
the starting offset. Note that the sequences \b and \B are one-charac-
ter lookbehinds.
In addition to checking the format of the string, there is a check to
ensure that all code points lie in the range U+0 to U+10FFFF, excluding
the surrogate area. The so-called "non-character" code points are not
excluded because Unicode corrigendum #9 makes it clear that they should
not be.
Characters in the "Surrogate Area" of Unicode are reserved for use by
UTF-16, where they are used in pairs to encode code points with values
greater than 0xFFFF. The code points that are encoded by UTF-16 pairs
are available independently in the UTF-8 and UTF-32 encodings. (In
other words, the whole surrogate thing is a fudge for UTF-16 which
unfortunately messes up UTF-8 and UTF-32.)
In some situations, you may already know that your strings are valid,
and therefore want to skip these checks in order to improve perfor-
mance, for example in the case of a long subject string that is being
scanned repeatedly. If you set the PCRE2_NO_UTF_CHECK option at com-
pile time or at match time, PCRE2 assumes that the pattern or subject
In some situations, you may already know that your strings are valid,
and therefore want to skip these checks in order to improve perfor-
mance, for example in the case of a long subject string that is being
scanned repeatedly. If you set the PCRE2_NO_UTF_CHECK option at com-
pile time or at match time, PCRE2 assumes that the pattern or subject
it is given (respectively) contains only valid UTF code unit sequences.
If you pass an invalid UTF string when PCRE2_NO_UTF_CHECK is set, the
result is usually undefined and your program may crash or loop indefi-
nitely. There is, however, one mode of matching that can handle invalid
UTF subject strings. This is matching via the JIT optimization using
the PCRE2_JIT_INVALID_UTF option when calling pcre2_jit_compile(). For
details, see the pcre2jit documentation.
Passing PCRE2_NO_UTF_CHECK to pcre2_compile() just disables the check
for the pattern; it does not also apply to subject strings. If you want
to disable the check for a subject string you must pass this option to
pcre2_match() or pcre2_dfa_match().
to disable the check for a subject string you must pass this same
option to pcre2_match() or pcre2_dfa_match().
If you pass an invalid UTF string when PCRE2_NO_UTF_CHECK is set, the
result is undefined and your program may crash or loop indefinitely.
UTF-16 and UTF-32 strings can indicate their endianness by special code
knows as a byte-order mark (BOM). The PCRE2 functions do not handle
this, expecting strings to be in host byte order.
Note that setting PCRE2_NO_UTF_CHECK at compile time does not disable
the error that is given if an escape sequence for an invalid Unicode
code point is encountered in the pattern. If you want to allow escape
sequences such as \x{d800} (a surrogate code point) you can set the
Unless PCRE2_NO_UTF_CHECK is set, a UTF string is checked before any
other processing takes place. In the case of pcre2_match() and
pcre2_dfa_match() calls with a non-zero starting offset, the check is
applied only to that part of the subject that could be inspected during
matching, and there is a check that the starting offset points to the
first code unit of a character or to the end of the subject. If there
are no lookbehind assertions in the pattern, the check starts at the
starting offset. Otherwise, it starts at the length of the longest
lookbehind before the starting offset, or at the start of the subject
if there are not that many characters before the starting offset. Note
that the sequences \b and \B are one-character lookbehinds.
In addition to checking the format of the string, there is a check to
ensure that all code points lie in the range U+0 to U+10FFFF, excluding
the surrogate area. The so-called "non-character" code points are not
excluded because Unicode corrigendum #9 makes it clear that they should
not be.
Characters in the "Surrogate Area" of Unicode are reserved for use by
UTF-16, where they are used in pairs to encode code points with values
greater than 0xFFFF. The code points that are encoded by UTF-16 pairs
are available independently in the UTF-8 and UTF-32 encodings. (In
other words, the whole surrogate thing is a fudge for UTF-16 which
unfortunately messes up UTF-8 and UTF-32.)
Setting PCRE2_NO_UTF_CHECK at compile time does not disable the error
that is given if an escape sequence for an invalid Unicode code point
is encountered in the pattern. If you want to allow escape sequences
such as \x{d800} (a surrogate code point) you can set the
PCRE2_EXTRA_ALLOW_SURROGATE_ESCAPES extra option. However, this is pos-
sible only in UTF-8 and UTF-32 modes, because these values are not rep-
resentable in UTF-16.
@ -11079,8 +11106,8 @@ AUTHOR
REVISION
Last updated: 03 February 2019
Last updated: 06 March 2019
Copyright (c) 1997-2019 University of Cambridge.
------------------------------------------------------------------------------

View File

@ -1,4 +1,4 @@
.TH PCRE2_JIT_COMPILE 3 "21 October 2014" "PCRE2 10.00"
.TH PCRE2_JIT_COMPILE 3 "06 March 2019" "PCRE2 10.33"
.SH NAME
PCRE2 - Perl-compatible regular expressions (revised API)
.SH SYNOPSIS
@ -29,6 +29,7 @@ bits:
PCRE2_JIT_COMPLETE compile code for full matching
PCRE2_JIT_PARTIAL_SOFT compile code for soft partial matching
PCRE2_JIT_PARTIAL_HARD compile code for hard partial matching
PCRE2_JIT_INVALID_UTF compile code to handle invalid UTF
.sp
The yield of the function is 0 for success, or a negative error code otherwise.
In particular, PCRE2_ERROR_JIT_BADOPTION is returned if JIT is not supported or

View File

@ -1,4 +1,4 @@
.TH PCRE2JIT 3 "16 October 2018" "PCRE2 10.33"
.TH PCRE2JIT 3 "06 March 2019" "PCRE2 10.33"
.SH NAME
PCRE2 - Perl-compatible regular expressions (revised API)
.SH "PCRE2 JUST-IN-TIME COMPILER SUPPORT"
@ -120,6 +120,28 @@ support is not available, or the pattern was not processed by
pattern.
.
.
.SH "MATCHING SUBJECTS CONTAINING INVALID UTF"
.rs
.sp
When a pattern is compiled with the PCRE2_UTF option, the interpretive matching
function expects its subject string to be a valid sequence of UTF code units.
If it is not, the result is undefined. This is also true by default of matching
via JIT. However, if the option PCRE2_JIT_INVALID_UTF is passed to
\fBpcre2_jit_compile()\fP, code that can process a subject containing invalid
UTF is compiled.
.P
In this mode, an invalid code unit sequence never matches any pattern item. It
does not match dot, it does not match \ep{Any}, it does not even match negative
items such as [^X]. A lookbehind assertion fails if it encounters an invalid
sequence while moving the current point backwards. In other words, an invalid
UTF code unit sequence acts as a barrier which no match can cross. Reaching an
invalid sequence causes an immediate backtrack.
.P
Using this option, an application can run matches in arbitrary data, knowing
that any matched strings that are returned will be valid UTF. This can be
useful when searching for text in executable or other binary files.
.
.
.SH "UNSUPPORTED OPTIONS AND PATTERN ITEMS"
.rs
.sp
@ -416,6 +438,6 @@ Cambridge, England.
.rs
.sp
.nf
Last updated: 16 October 2018
Copyright (c) 1997-2018 University of Cambridge.
Last updated: 06 March 2019
Copyright (c) 1997-2019 University of Cambridge.
.fi

View File

@ -1,4 +1,4 @@
.TH PCRE2UNICODE 3 "03 February 2019" "PCRE2 10.33"
.TH PCRE2UNICODE 3 "06 March 2019" "PCRE2 10.33"
.SH NAME
PCRE - Perl-compatible regular expressions (revised API)
.SH "UNICODE AND UTF SUPPORT"
@ -230,23 +230,46 @@ adjacent characters.
.rs
.sp
When the PCRE2_UTF option is set, the strings passed as patterns and subjects
are (by default) checked for validity on entry to the relevant functions.
If an invalid UTF string is passed, an negative error code is returned. The
code unit offset to the offending character can be extracted from the match
data block by calling \fBpcre2_get_startchar()\fP, which is used for this
purpose after a UTF error.
are (by default) checked for validity on entry to the relevant functions. If an
invalid UTF string is passed, an negative error code is returned. The code unit
offset to the offending character can be extracted from the match data block by
calling \fBpcre2_get_startchar()\fP, which is used for this purpose after a UTF
error.
.P
In some situations, you may already know that your strings are valid, and
therefore want to skip these checks in order to improve performance, for
example in the case of a long subject string that is being scanned repeatedly.
If you set the PCRE2_NO_UTF_CHECK option at compile time or at match time,
PCRE2 assumes that the pattern or subject it is given (respectively) contains
only valid UTF code unit sequences.
.P
If you pass an invalid UTF string when PCRE2_NO_UTF_CHECK is set, the result
is usually undefined and your program may crash or loop indefinitely. There is,
however, one mode of matching that can handle invalid UTF subject strings. This
is matching via the JIT optimization using the PCRE2_JIT_INVALID_UTF option
when calling \fBpcre2_jit_compile()\fP. For details, see the
.\" HREF
\fBpcre2jit\fP
.\"
documentation.
.P
Passing PCRE2_NO_UTF_CHECK to \fBpcre2_compile()\fP just disables the check for
the pattern; it does not also apply to subject strings. If you want to disable
the check for a subject string you must pass this same option to
\fBpcre2_match()\fP or \fBpcre2_dfa_match()\fP.
.P
UTF-16 and UTF-32 strings can indicate their endianness by special code knows
as a byte-order mark (BOM). The PCRE2 functions do not handle this, expecting
strings to be in host byte order.
.P
A UTF string is checked before any other processing takes place. In the case of
\fBpcre2_match()\fP and \fBpcre2_dfa_match()\fP calls with a non-zero starting
offset, the check is applied only to that part of the subject that could be
inspected during matching, and there is a check that the starting offset points
to the first code unit of a character or to the end of the subject. If there
are no lookbehind assertions in the pattern, the check starts at the starting
offset. Otherwise, it starts at the length of the longest lookbehind before the
Unless PCRE2_NO_UTF_CHECK is set, a UTF string is checked before any other
processing takes place. In the case of \fBpcre2_match()\fP and
\fBpcre2_dfa_match()\fP calls with a non-zero starting offset, the check is
applied only to that part of the subject that could be inspected during
matching, and there is a check that the starting offset points to the first
code unit of a character or to the end of the subject. If there are no
lookbehind assertions in the pattern, the check starts at the starting offset.
Otherwise, it starts at the length of the longest lookbehind before the
starting offset, or at the start of the subject if there are not that many
characters before the starting offset. Note that the sequences \eb and \eB are
one-character lookbehinds.
@ -263,28 +286,12 @@ independently in the UTF-8 and UTF-32 encodings. (In other words, the whole
surrogate thing is a fudge for UTF-16 which unfortunately messes up UTF-8 and
UTF-32.)
.P
In some situations, you may already know that your strings are valid, and
therefore want to skip these checks in order to improve performance, for
example in the case of a long subject string that is being scanned repeatedly.
If you set the PCRE2_NO_UTF_CHECK option at compile time or at match time,
PCRE2 assumes that the pattern or subject it is given (respectively) contains
only valid UTF code unit sequences.
.P
Passing PCRE2_NO_UTF_CHECK to \fBpcre2_compile()\fP just disables the check for
the pattern; it does not also apply to subject strings. If you want to disable
the check for a subject string you must pass this option to \fBpcre2_match()\fP
or \fBpcre2_dfa_match()\fP.
.P
If you pass an invalid UTF string when PCRE2_NO_UTF_CHECK is set, the result
is undefined and your program may crash or loop indefinitely.
.P
Note that setting PCRE2_NO_UTF_CHECK at compile time does not disable the error
that is given if an escape sequence for an invalid Unicode code point is
encountered in the pattern. If you want to allow escape sequences such as
\ex{d800} (a surrogate code point) you can set the
PCRE2_EXTRA_ALLOW_SURROGATE_ESCAPES extra option. However, this is possible
only in UTF-8 and UTF-32 modes, because these values are not representable in
UTF-16.
Setting PCRE2_NO_UTF_CHECK at compile time does not disable the error that is
given if an escape sequence for an invalid Unicode code point is encountered in
the pattern. If you want to allow escape sequences such as \ex{d800} (a
surrogate code point) you can set the PCRE2_EXTRA_ALLOW_SURROGATE_ESCAPES extra
option. However, this is possible only in UTF-8 and UTF-32 modes, because these
values are not representable in UTF-16.
.
.
.\" HTML <a name="utf8strings"></a>
@ -393,6 +400,6 @@ Cambridge, England.
.rs
.sp
.nf
Last updated: 03 February 2019
Last updated: 06 March 2019
Copyright (c) 1997-2019 University of Cambridge.
.fi