2014-09-23 13:35:51 +02:00
|
|
|
<html>
|
|
|
|
<head>
|
|
|
|
<title>pcre2unicode specification</title>
|
|
|
|
</head>
|
|
|
|
<body bgcolor="#FFFFFF" text="#00005A" link="#0066FF" alink="#3399FF" vlink="#2222BB">
|
|
|
|
<h1>pcre2unicode man page</h1>
|
|
|
|
<p>
|
|
|
|
Return to the <a href="index.html">PCRE2 index page</a>.
|
|
|
|
</p>
|
|
|
|
<p>
|
|
|
|
This page is part of the PCRE2 HTML documentation. It was generated
|
|
|
|
automatically from the original man page. If there is any nonsense in it,
|
|
|
|
please consult the man page, in case the conversion went wrong.
|
|
|
|
<br>
|
|
|
|
<br><b>
|
|
|
|
UNICODE AND UTF SUPPORT
|
|
|
|
</b><br>
|
|
|
|
<P>
|
2019-05-24 19:15:48 +02:00
|
|
|
PCRE2 is normally built with Unicode support, though if you do not need it, you
|
|
|
|
can build it without, in which case the library will be smaller. With Unicode
|
|
|
|
support, PCRE2 has knowledge of Unicode character properties and can process
|
2020-02-24 17:35:15 +01:00
|
|
|
strings of text in UTF-8, UTF-16, and UTF-32 format (depending on the code unit
|
2019-05-24 19:15:48 +02:00
|
|
|
width), but this is not the default. Unless specifically requested, PCRE2
|
|
|
|
treats each code unit in a string as one character.
|
|
|
|
</P>
|
|
|
|
<P>
|
2019-10-17 18:39:38 +02:00
|
|
|
There are two ways of telling PCRE2 to switch to UTF mode, where characters may
|
|
|
|
consist of more than one code unit and the range of values is constrained. The
|
2019-05-24 19:15:48 +02:00
|
|
|
program can call
|
2014-09-23 13:35:51 +02:00
|
|
|
<a href="pcre2_compile.html"><b>pcre2_compile()</b></a>
|
2019-05-24 19:15:48 +02:00
|
|
|
with the PCRE2_UTF option, or the pattern may start with the sequence (*UTF).
|
|
|
|
However, the latter facility can be locked out by the PCRE2_NEVER_UTF option.
|
2019-10-17 18:39:38 +02:00
|
|
|
That is, the programmer can prevent the supplier of the pattern from switching
|
2019-05-24 19:15:48 +02:00
|
|
|
to UTF mode.
|
|
|
|
</P>
|
|
|
|
<P>
|
|
|
|
Note that the PCRE2_MATCH_INVALID_UTF option (see
|
|
|
|
<a href="#matchinvalid">below)</a>
|
|
|
|
forces PCRE2_UTF to be set.
|
2014-09-23 13:35:51 +02:00
|
|
|
</P>
|
|
|
|
<P>
|
2019-05-24 19:15:48 +02:00
|
|
|
In UTF mode, both the pattern and any subject strings that are matched against
|
|
|
|
it are treated as UTF strings instead of strings of individual one-code-unit
|
|
|
|
characters. There are also some other changes to the way characters are
|
|
|
|
handled, as documented below.
|
2014-09-23 13:35:51 +02:00
|
|
|
</P>
|
|
|
|
<br><b>
|
|
|
|
UNICODE PROPERTY SUPPORT
|
|
|
|
</b><br>
|
|
|
|
<P>
|
|
|
|
When PCRE2 is built with Unicode support, the escape sequences \p{..},
|
2019-02-06 19:11:36 +01:00
|
|
|
\P{..}, and \X can be used. This is not dependent on the PCRE2_UTF setting.
|
|
|
|
The Unicode properties that can be tested are limited to the general category
|
|
|
|
properties such as Lu for an upper case letter or Nd for a decimal number, the
|
|
|
|
Unicode script names such as Arabic or Han, and the derived properties Any and
|
|
|
|
L&. Full lists are given in the
|
2014-09-23 13:35:51 +02:00
|
|
|
<a href="pcre2pattern.html"><b>pcre2pattern</b></a>
|
|
|
|
and
|
|
|
|
<a href="pcre2syntax.html"><b>pcre2syntax</b></a>
|
|
|
|
documentation. Only the short names for properties are supported. For example,
|
|
|
|
\p{L} matches a letter. Its Perl synonym, \p{Letter}, is not supported.
|
|
|
|
Furthermore, in Perl, many properties may optionally be prefixed by "Is", for
|
2017-05-17 19:55:11 +02:00
|
|
|
compatibility with Perl 5.6. PCRE2 does not support this.
|
2014-09-23 13:35:51 +02:00
|
|
|
</P>
|
|
|
|
<br><b>
|
|
|
|
WIDE CHARACTERS AND UTF MODES
|
|
|
|
</b><br>
|
|
|
|
<P>
|
2018-06-17 16:13:28 +02:00
|
|
|
Code points less than 256 can be specified in patterns by either braced or
|
2014-09-23 13:35:51 +02:00
|
|
|
unbraced hexadecimal escape sequences (for example, \x{b3} or \xb3). Larger
|
|
|
|
values have to use braced sequences. Unbraced octal code points up to \777 are
|
|
|
|
also recognized; larger ones can be coded using \o{...}.
|
|
|
|
</P>
|
|
|
|
<P>
|
2018-09-11 16:27:39 +02:00
|
|
|
The escape sequence \N{U+<hex digits>} is recognized as another way of
|
|
|
|
specifying a Unicode character by code point in a UTF mode. It is not allowed
|
2019-05-24 19:15:48 +02:00
|
|
|
in non-UTF mode.
|
2018-09-11 16:27:39 +02:00
|
|
|
</P>
|
|
|
|
<P>
|
2019-05-24 19:15:48 +02:00
|
|
|
In UTF mode, repeat quantifiers apply to complete UTF characters, not to
|
2014-09-23 13:35:51 +02:00
|
|
|
individual code units.
|
|
|
|
</P>
|
|
|
|
<P>
|
2019-05-24 19:15:48 +02:00
|
|
|
In UTF mode, the dot metacharacter matches one UTF character instead of a
|
2014-09-23 13:35:51 +02:00
|
|
|
single code unit.
|
|
|
|
</P>
|
|
|
|
<P>
|
2019-05-24 19:15:48 +02:00
|
|
|
In UTF mode, capture group names are not restricted to ASCII, and may contain
|
2019-02-06 19:11:36 +01:00
|
|
|
any Unicode letters and decimal digits, as well as underscore.
|
|
|
|
</P>
|
|
|
|
<P>
|
2019-05-24 19:15:48 +02:00
|
|
|
The escape sequence \C can be used to match a single code unit in UTF mode,
|
2014-09-23 13:35:51 +02:00
|
|
|
but its use can lead to some strange effects because it breaks up multi-unit
|
|
|
|
characters (see the description of \C in the
|
|
|
|
<a href="pcre2pattern.html"><b>pcre2pattern</b></a>
|
2019-03-04 19:07:04 +01:00
|
|
|
documentation). For this reason, there is a build-time option that disables
|
2019-02-06 19:11:36 +01:00
|
|
|
support for \C completely. There is also a less draconian compile-time option
|
|
|
|
for locking out the use of \C when a pattern is compiled.
|
2016-07-29 10:47:39 +02:00
|
|
|
</P>
|
|
|
|
<P>
|
|
|
|
The use of \C is not supported by the alternative matching function
|
|
|
|
<b>pcre2_dfa_match()</b> when in UTF-8 or UTF-16 mode, that is, when a character
|
|
|
|
may consist of more than one code unit. The use of \C in these modes provokes
|
|
|
|
a match-time error. Also, the JIT optimization does not support \C in these
|
|
|
|
modes. If JIT optimization is requested for a UTF-8 or UTF-16 pattern that
|
|
|
|
contains \C, it will not succeed, and so when <b>pcre2_match()</b> is called,
|
2019-05-24 19:15:48 +02:00
|
|
|
the matching will be carried out by the interpretive function.
|
2014-09-23 13:35:51 +02:00
|
|
|
</P>
|
|
|
|
<P>
|
|
|
|
The character escapes \b, \B, \d, \D, \s, \S, \w, and \W correctly test
|
|
|
|
characters of any code value, but, by default, the characters that PCRE2
|
|
|
|
recognizes as digits, spaces, or word characters remain the same set as in
|
|
|
|
non-UTF mode, all with code points less than 256. This remains true even when
|
|
|
|
PCRE2 is built to include Unicode support, because to do otherwise would slow
|
|
|
|
down matching in many common cases. Note that this also applies to \b
|
|
|
|
and \B, because they are defined in terms of \w and \W. If you want
|
|
|
|
to test for a wider sense of, say, "digit", you can use explicit Unicode
|
|
|
|
property tests such as \p{Nd}. Alternatively, if you set the PCRE2_UCP option,
|
|
|
|
the way that the character escapes work is changed so that Unicode properties
|
|
|
|
are used to determine which characters match. There are more details in the
|
|
|
|
section on
|
|
|
|
<a href="pcre2pattern.html#genericchartypes">generic character types</a>
|
|
|
|
in the
|
|
|
|
<a href="pcre2pattern.html"><b>pcre2pattern</b></a>
|
|
|
|
documentation.
|
|
|
|
</P>
|
|
|
|
<P>
|
|
|
|
Similarly, characters that match the POSIX named character classes are all
|
|
|
|
low-valued characters, unless the PCRE2_UCP option is set.
|
|
|
|
</P>
|
|
|
|
<P>
|
|
|
|
However, the special horizontal and vertical white space matching escapes (\h,
|
|
|
|
\H, \v, and \V) do match all the appropriate Unicode characters, whether or
|
|
|
|
not PCRE2_UCP is set.
|
|
|
|
</P>
|
2017-05-17 19:55:11 +02:00
|
|
|
<br><b>
|
2020-02-24 17:35:15 +01:00
|
|
|
UNICODE CASE-EQUIVALENCE
|
2017-05-17 19:55:11 +02:00
|
|
|
</b><br>
|
2014-09-23 13:35:51 +02:00
|
|
|
<P>
|
2020-02-24 17:35:15 +01:00
|
|
|
If either PCRE2_UTF or PCRE2_UCP is set, upper/lower case processing makes use
|
|
|
|
of Unicode properties except for characters whose code points are less than 128
|
|
|
|
and that have at most two case-equivalent values. For these, a direct table
|
|
|
|
lookup is used for speed. A few Unicode characters such as Greek sigma have
|
|
|
|
more than two code points that are case-equivalent, and these are treated
|
|
|
|
specially. Setting PCRE2_UCP without PCRE2_UTF allows Unicode-style case
|
|
|
|
processing for non-UTF character encodings such as UCS-2.
|
2018-10-12 19:02:34 +02:00
|
|
|
<a name="scriptruns"></a></P>
|
|
|
|
<br><b>
|
|
|
|
SCRIPT RUNS
|
|
|
|
</b><br>
|
|
|
|
<P>
|
|
|
|
The pattern constructs (*script_run:...) and (*atomic_script_run:...), with
|
|
|
|
synonyms (*sr:...) and (*asr:...), verify that the string matched within the
|
|
|
|
parentheses is a script run. In concept, a script run is a sequence of
|
|
|
|
characters that are all from the same Unicode script. However, because some
|
|
|
|
scripts are commonly used together, and because some diacritical and other
|
|
|
|
marks are used with multiple scripts, it is not that simple.
|
|
|
|
</P>
|
|
|
|
<P>
|
2019-03-04 19:07:04 +01:00
|
|
|
Every Unicode character has a Script property, mostly with a value
|
2018-10-12 19:02:34 +02:00
|
|
|
corresponding to the name of a script, such as Latin, Greek, or Cyrillic. There
|
|
|
|
are also three special values:
|
|
|
|
</P>
|
|
|
|
<P>
|
|
|
|
"Unknown" is used for code points that have not been assigned, and also for the
|
|
|
|
surrogate code points. In the PCRE2 32-bit library, characters whose code
|
2019-03-04 19:07:04 +01:00
|
|
|
points are greater than the Unicode maximum (U+10FFFF), which are accessible
|
2018-10-12 19:02:34 +02:00
|
|
|
only in non-UTF mode, are assigned the Unknown script.
|
|
|
|
</P>
|
|
|
|
<P>
|
|
|
|
"Common" is used for characters that are used with many scripts. These include
|
|
|
|
punctuation, emoji, mathematical, musical, and currency symbols, and the ASCII
|
|
|
|
digits 0 to 9.
|
|
|
|
</P>
|
|
|
|
<P>
|
|
|
|
"Inherited" is used for characters such as diacritical marks that modify a
|
|
|
|
previous character. These are considered to take on the script of the character
|
|
|
|
that they modify.
|
|
|
|
</P>
|
|
|
|
<P>
|
2019-03-04 19:07:04 +01:00
|
|
|
Some Inherited characters are used with many scripts, but many of them are only
|
|
|
|
normally used with a small number of scripts. For example, U+102E0 (Coptic
|
|
|
|
Epact thousands mark) is used only with Arabic and Coptic. In order to make it
|
|
|
|
possible to check this, a Unicode property called Script Extension exists. Its
|
|
|
|
value is a list of scripts that apply to the character. For the majority of
|
2018-10-12 19:02:34 +02:00
|
|
|
characters, the list contains just one script, the same one as the Script
|
|
|
|
property. However, for characters such as U+102E0 more than one Script is
|
|
|
|
listed. There are also some Common characters that have a single, non-Common
|
|
|
|
script in their Script Extension list.
|
|
|
|
</P>
|
|
|
|
<P>
|
2019-03-04 19:07:04 +01:00
|
|
|
The next section describes the basic rules for deciding whether a given string
|
|
|
|
of characters is a script run. Note, however, that there are some special cases
|
|
|
|
involving the Chinese Han script, and an additional constraint for decimal
|
2018-10-12 19:02:34 +02:00
|
|
|
digits. These are covered in subsequent sections.
|
|
|
|
</P>
|
|
|
|
<br><b>
|
|
|
|
Basic script run rules
|
|
|
|
</b><br>
|
|
|
|
<P>
|
|
|
|
A string that is less than two characters long is a script run. This is the
|
|
|
|
only case in which an Unknown character can be part of a script run. Longer
|
|
|
|
strings are checked using only the Script Extensions property, not the basic
|
|
|
|
Script property.
|
|
|
|
</P>
|
|
|
|
<P>
|
|
|
|
If a character's Script Extension property is the single value "Inherited", it
|
|
|
|
is always accepted as part of a script run. This is also true for the property
|
|
|
|
"Common", subject to the checking of decimal digits described below. All the
|
|
|
|
remaining characters in a script run must have at least one script in common in
|
|
|
|
their Script Extension lists. In set-theoretic terminology, the intersection of
|
|
|
|
all the sets of scripts must not be empty.
|
|
|
|
</P>
|
|
|
|
<P>
|
|
|
|
A simple example is an Internet name such as "google.com". The letters are all
|
|
|
|
in the Latin script, and the dot is Common, so this string is a script run.
|
2019-03-04 19:07:04 +01:00
|
|
|
However, the Cyrillic letter "o" looks exactly the same as the Latin "o"; a
|
2018-10-12 19:02:34 +02:00
|
|
|
string that looks the same, but with Cyrillic "o"s is not a script run.
|
|
|
|
</P>
|
|
|
|
<P>
|
2019-03-04 19:07:04 +01:00
|
|
|
More interesting examples involve characters with more than one script in their
|
2018-10-12 19:02:34 +02:00
|
|
|
Script Extension. Consider the following characters:
|
|
|
|
<pre>
|
|
|
|
U+060C Arabic comma
|
|
|
|
U+06D4 Arabic full stop
|
|
|
|
</pre>
|
2019-03-04 19:07:04 +01:00
|
|
|
The first has the Script Extension list Arabic, Hanifi Rohingya, Syriac, and
|
2018-10-12 19:02:34 +02:00
|
|
|
Thaana; the second has just Arabic and Hanifi Rohingya. Both of them could
|
|
|
|
appear in script runs of either Arabic or Hanifi Rohingya. The first could also
|
|
|
|
appear in Syriac or Thaana script runs, but the second could not.
|
|
|
|
</P>
|
|
|
|
<br><b>
|
|
|
|
The Chinese Han script
|
|
|
|
</b><br>
|
|
|
|
<P>
|
2019-03-04 19:07:04 +01:00
|
|
|
The Chinese Han script is commonly used in conjunction with other scripts for
|
|
|
|
writing certain languages. Japanese uses the Hiragana and Katakana scripts
|
2018-10-12 19:02:34 +02:00
|
|
|
together with Han; Korean uses Hangul and Han; Taiwanese Mandarin uses Bopomofo
|
|
|
|
and Han. These three combinations are treated as special cases when checking
|
|
|
|
script runs and are, in effect, "virtual scripts". Thus, a script run may
|
|
|
|
contain a mixture of Hiragana, Katakana, and Han, or a mixture of Hangul and
|
|
|
|
Han, or a mixture of Bopomofo and Han, but not, for example, a mixture of
|
|
|
|
Hangul and Bopomofo and Han. PCRE2 (like Perl) follows Unicode's Technical
|
|
|
|
Standard 39 ("Unicode Security Mechanisms", http://unicode.org/reports/tr39/)
|
|
|
|
in allowing such mixtures.
|
|
|
|
</P>
|
|
|
|
<br><b>
|
|
|
|
Decimal digits
|
|
|
|
</b><br>
|
|
|
|
<P>
|
|
|
|
Unicode contains many sets of 10 decimal digits in different scripts, and some
|
|
|
|
scripts (including the Common script) contain more than one set. Some of these
|
|
|
|
decimal digits them are visually indistinguishable from the common ASCII
|
|
|
|
digits. In addition to the script checking described above, if a script run
|
|
|
|
contains any decimal digits, they must all come from the same set of 10
|
|
|
|
adjacent characters.
|
2014-09-23 13:35:51 +02:00
|
|
|
</P>
|
|
|
|
<br><b>
|
|
|
|
VALIDITY OF UTF STRINGS
|
|
|
|
</b><br>
|
|
|
|
<P>
|
|
|
|
When the PCRE2_UTF option is set, the strings passed as patterns and subjects
|
2019-03-06 18:38:20 +01:00
|
|
|
are (by default) checked for validity on entry to the relevant functions. If an
|
2019-05-24 19:15:48 +02:00
|
|
|
invalid UTF string is passed, a negative error code is returned. The code unit
|
2019-03-06 18:38:20 +01:00
|
|
|
offset to the offending character can be extracted from the match data block by
|
|
|
|
calling <b>pcre2_get_startchar()</b>, which is used for this purpose after a UTF
|
|
|
|
error.
|
|
|
|
</P>
|
|
|
|
<P>
|
|
|
|
In some situations, you may already know that your strings are valid, and
|
|
|
|
therefore want to skip these checks in order to improve performance, for
|
|
|
|
example in the case of a long subject string that is being scanned repeatedly.
|
|
|
|
If you set the PCRE2_NO_UTF_CHECK option at compile time or at match time,
|
|
|
|
PCRE2 assumes that the pattern or subject it is given (respectively) contains
|
|
|
|
only valid UTF code unit sequences.
|
|
|
|
</P>
|
|
|
|
<P>
|
|
|
|
If you pass an invalid UTF string when PCRE2_NO_UTF_CHECK is set, the result
|
2019-05-24 19:15:48 +02:00
|
|
|
is undefined and your program may crash or loop indefinitely or give incorrect
|
|
|
|
results. There is, however, one mode of matching that can handle invalid UTF
|
|
|
|
subject strings. This is enabled by passing PCRE2_MATCH_INVALID_UTF to
|
|
|
|
<b>pcre2_compile()</b> and is discussed below in the next section. The rest of
|
|
|
|
this section covers the case when PCRE2_MATCH_INVALID_UTF is not set.
|
2019-03-06 18:38:20 +01:00
|
|
|
</P>
|
|
|
|
<P>
|
2019-05-24 19:15:48 +02:00
|
|
|
Passing PCRE2_NO_UTF_CHECK to <b>pcre2_compile()</b> just disables the UTF check
|
|
|
|
for the pattern; it does not also apply to subject strings. If you want to
|
|
|
|
disable the check for a subject string you must pass this same option to
|
2019-03-06 18:38:20 +01:00
|
|
|
<b>pcre2_match()</b> or <b>pcre2_dfa_match()</b>.
|
2014-09-23 13:35:51 +02:00
|
|
|
</P>
|
|
|
|
<P>
|
|
|
|
UTF-16 and UTF-32 strings can indicate their endianness by special code knows
|
|
|
|
as a byte-order mark (BOM). The PCRE2 functions do not handle this, expecting
|
|
|
|
strings to be in host byte order.
|
|
|
|
</P>
|
|
|
|
<P>
|
2019-03-06 18:38:20 +01:00
|
|
|
Unless PCRE2_NO_UTF_CHECK is set, a UTF string is checked before any other
|
|
|
|
processing takes place. In the case of <b>pcre2_match()</b> and
|
|
|
|
<b>pcre2_dfa_match()</b> calls with a non-zero starting offset, the check is
|
|
|
|
applied only to that part of the subject that could be inspected during
|
|
|
|
matching, and there is a check that the starting offset points to the first
|
|
|
|
code unit of a character or to the end of the subject. If there are no
|
|
|
|
lookbehind assertions in the pattern, the check starts at the starting offset.
|
|
|
|
Otherwise, it starts at the length of the longest lookbehind before the
|
2015-08-30 19:47:36 +02:00
|
|
|
starting offset, or at the start of the subject if there are not that many
|
|
|
|
characters before the starting offset. Note that the sequences \b and \B are
|
|
|
|
one-character lookbehinds.
|
|
|
|
</P>
|
|
|
|
<P>
|
|
|
|
In addition to checking the format of the string, there is a check to ensure
|
|
|
|
that all code points lie in the range U+0 to U+10FFFF, excluding the surrogate
|
|
|
|
area. The so-called "non-character" code points are not excluded because
|
|
|
|
Unicode corrigendum #9 makes it clear that they should not be.
|
2014-09-23 13:35:51 +02:00
|
|
|
</P>
|
|
|
|
<P>
|
|
|
|
Characters in the "Surrogate Area" of Unicode are reserved for use by UTF-16,
|
|
|
|
where they are used in pairs to encode code points with values greater than
|
|
|
|
0xFFFF. The code points that are encoded by UTF-16 pairs are available
|
|
|
|
independently in the UTF-8 and UTF-32 encodings. (In other words, the whole
|
|
|
|
surrogate thing is a fudge for UTF-16 which unfortunately messes up UTF-8 and
|
|
|
|
UTF-32.)
|
|
|
|
</P>
|
|
|
|
<P>
|
2019-03-06 18:38:20 +01:00
|
|
|
Setting PCRE2_NO_UTF_CHECK at compile time does not disable the error that is
|
|
|
|
given if an escape sequence for an invalid Unicode code point is encountered in
|
|
|
|
the pattern. If you want to allow escape sequences such as \x{d800} (a
|
|
|
|
surrogate code point) you can set the PCRE2_EXTRA_ALLOW_SURROGATE_ESCAPES extra
|
|
|
|
option. However, this is possible only in UTF-8 and UTF-32 modes, because these
|
|
|
|
values are not representable in UTF-16.
|
2014-09-23 13:35:51 +02:00
|
|
|
<a name="utf8strings"></a></P>
|
|
|
|
<br><b>
|
|
|
|
Errors in UTF-8 strings
|
|
|
|
</b><br>
|
|
|
|
<P>
|
|
|
|
The following negative error codes are given for invalid UTF-8 strings:
|
|
|
|
<pre>
|
|
|
|
PCRE2_ERROR_UTF8_ERR1
|
|
|
|
PCRE2_ERROR_UTF8_ERR2
|
|
|
|
PCRE2_ERROR_UTF8_ERR3
|
|
|
|
PCRE2_ERROR_UTF8_ERR4
|
|
|
|
PCRE2_ERROR_UTF8_ERR5
|
|
|
|
</pre>
|
|
|
|
The string ends with a truncated UTF-8 character; the code specifies how many
|
|
|
|
bytes are missing (1 to 5). Although RFC 3629 restricts UTF-8 characters to be
|
|
|
|
no longer than 4 bytes, the encoding scheme (originally defined by RFC 2279)
|
|
|
|
allows for up to 6 bytes, and this is checked first; hence the possibility of
|
|
|
|
4 or 5 missing bytes.
|
|
|
|
<pre>
|
|
|
|
PCRE2_ERROR_UTF8_ERR6
|
|
|
|
PCRE2_ERROR_UTF8_ERR7
|
|
|
|
PCRE2_ERROR_UTF8_ERR8
|
|
|
|
PCRE2_ERROR_UTF8_ERR9
|
|
|
|
PCRE2_ERROR_UTF8_ERR10
|
|
|
|
</pre>
|
|
|
|
The two most significant bits of the 2nd, 3rd, 4th, 5th, or 6th byte of the
|
|
|
|
character do not have the binary value 0b10 (that is, either the most
|
|
|
|
significant bit is 0, or the next bit is 1).
|
|
|
|
<pre>
|
|
|
|
PCRE2_ERROR_UTF8_ERR11
|
|
|
|
PCRE2_ERROR_UTF8_ERR12
|
|
|
|
</pre>
|
|
|
|
A character that is valid by the RFC 2279 rules is either 5 or 6 bytes long;
|
|
|
|
these code points are excluded by RFC 3629.
|
|
|
|
<pre>
|
|
|
|
PCRE2_ERROR_UTF8_ERR13
|
|
|
|
</pre>
|
2019-05-24 19:15:48 +02:00
|
|
|
A 4-byte character has a value greater than 0x10ffff; these code points are
|
2014-09-23 13:35:51 +02:00
|
|
|
excluded by RFC 3629.
|
|
|
|
<pre>
|
|
|
|
PCRE2_ERROR_UTF8_ERR14
|
|
|
|
</pre>
|
|
|
|
A 3-byte character has a value in the range 0xd800 to 0xdfff; this range of
|
|
|
|
code points are reserved by RFC 3629 for use with UTF-16, and so are excluded
|
|
|
|
from UTF-8.
|
|
|
|
<pre>
|
|
|
|
PCRE2_ERROR_UTF8_ERR15
|
|
|
|
PCRE2_ERROR_UTF8_ERR16
|
|
|
|
PCRE2_ERROR_UTF8_ERR17
|
|
|
|
PCRE2_ERROR_UTF8_ERR18
|
|
|
|
PCRE2_ERROR_UTF8_ERR19
|
|
|
|
</pre>
|
|
|
|
A 2-, 3-, 4-, 5-, or 6-byte character is "overlong", that is, it codes for a
|
|
|
|
value that can be represented by fewer bytes, which is invalid. For example,
|
|
|
|
the two bytes 0xc0, 0xae give the value 0x2e, whose correct coding uses just
|
|
|
|
one byte.
|
|
|
|
<pre>
|
|
|
|
PCRE2_ERROR_UTF8_ERR20
|
|
|
|
</pre>
|
|
|
|
The two most significant bits of the first byte of a character have the binary
|
|
|
|
value 0b10 (that is, the most significant bit is 1 and the second is 0). Such a
|
|
|
|
byte can only validly occur as the second or subsequent byte of a multi-byte
|
|
|
|
character.
|
|
|
|
<pre>
|
|
|
|
PCRE2_ERROR_UTF8_ERR21
|
|
|
|
</pre>
|
|
|
|
The first byte of a character has the value 0xfe or 0xff. These values can
|
|
|
|
never occur in a valid UTF-8 string.
|
|
|
|
<a name="utf16strings"></a></P>
|
|
|
|
<br><b>
|
|
|
|
Errors in UTF-16 strings
|
|
|
|
</b><br>
|
|
|
|
<P>
|
|
|
|
The following negative error codes are given for invalid UTF-16 strings:
|
|
|
|
<pre>
|
2016-07-29 10:47:39 +02:00
|
|
|
PCRE2_ERROR_UTF16_ERR1 Missing low surrogate at end of string
|
|
|
|
PCRE2_ERROR_UTF16_ERR2 Invalid low surrogate follows high surrogate
|
|
|
|
PCRE2_ERROR_UTF16_ERR3 Isolated low surrogate
|
2014-09-23 13:35:51 +02:00
|
|
|
|
|
|
|
<a name="utf32strings"></a></PRE>
|
|
|
|
</P>
|
|
|
|
<br><b>
|
|
|
|
Errors in UTF-32 strings
|
|
|
|
</b><br>
|
|
|
|
<P>
|
|
|
|
The following negative error codes are given for invalid UTF-32 strings:
|
|
|
|
<pre>
|
2016-07-29 10:47:39 +02:00
|
|
|
PCRE2_ERROR_UTF32_ERR1 Surrogate character (0xd800 to 0xdfff)
|
|
|
|
PCRE2_ERROR_UTF32_ERR2 Code point is greater than 0x10ffff
|
2014-09-23 13:35:51 +02:00
|
|
|
|
2019-05-24 19:15:48 +02:00
|
|
|
<a name="matchinvalid"></a></PRE>
|
|
|
|
</P>
|
|
|
|
<br><b>
|
|
|
|
MATCHING IN INVALID UTF STRINGS
|
|
|
|
</b><br>
|
|
|
|
<P>
|
|
|
|
You can run pattern matches on subject strings that may contain invalid UTF
|
|
|
|
sequences if you call <b>pcre2_compile()</b> with the PCRE2_MATCH_INVALID_UTF
|
|
|
|
option. This is supported by <b>pcre2_match()</b>, including JIT matching, but
|
|
|
|
not by <b>pcre2_dfa_match()</b>. When PCRE2_MATCH_INVALID_UTF is set, it forces
|
2019-10-17 18:39:38 +02:00
|
|
|
PCRE2_UTF to be set as well. Note, however, that the pattern itself must be a
|
2019-05-24 19:15:48 +02:00
|
|
|
valid UTF string.
|
|
|
|
</P>
|
|
|
|
<P>
|
|
|
|
Setting PCRE2_MATCH_INVALID_UTF does not affect what <b>pcre2_compile()</b>
|
|
|
|
generates, but if <b>pcre2_jit_compile()</b> is subsequently called, it does
|
|
|
|
generate different code. If JIT is not used, the option affects the behaviour
|
|
|
|
of the interpretive code in <b>pcre2_match()</b>. When PCRE2_MATCH_INVALID_UTF
|
|
|
|
is set at compile time, PCRE2_NO_UTF_CHECK is ignored at match time.
|
|
|
|
</P>
|
|
|
|
<P>
|
|
|
|
In this mode, an invalid code unit sequence in the subject never matches any
|
|
|
|
pattern item. It does not match dot, it does not match \p{Any}, it does not
|
|
|
|
even match negative items such as [^X]. A lookbehind assertion fails if it
|
|
|
|
encounters an invalid sequence while moving the current point backwards. In
|
|
|
|
other words, an invalid UTF code unit sequence acts as a barrier which no match
|
|
|
|
can cross.
|
|
|
|
</P>
|
|
|
|
<P>
|
|
|
|
You can also think of this as the subject being split up into fragments of
|
|
|
|
valid UTF, delimited internally by invalid code unit sequences. The pattern is
|
|
|
|
matched fragment by fragment. The result of a successful match, however, is
|
|
|
|
given as code unit offsets in the entire subject string in the usual way. There
|
|
|
|
are a few points to consider:
|
|
|
|
</P>
|
|
|
|
<P>
|
|
|
|
The internal boundaries are not interpreted as the beginnings or ends of lines
|
|
|
|
and so do not match circumflex or dollar characters in the pattern.
|
|
|
|
</P>
|
|
|
|
<P>
|
|
|
|
If <b>pcre2_match()</b> is called with an offset that points to an invalid
|
|
|
|
UTF-sequence, that sequence is skipped, and the match starts at the next valid
|
|
|
|
UTF character, or the end of the subject.
|
|
|
|
</P>
|
|
|
|
<P>
|
|
|
|
At internal fragment boundaries, \b and \B behave in the same way as at the
|
2019-10-17 18:39:38 +02:00
|
|
|
beginning and end of the subject. For example, a sequence such as \bWORD\b
|
2019-05-24 19:15:48 +02:00
|
|
|
would match an instance of WORD that is surrounded by invalid UTF code units.
|
|
|
|
</P>
|
|
|
|
<P>
|
|
|
|
Using PCRE2_MATCH_INVALID_UTF, an application can run matches on arbitrary
|
|
|
|
data, knowing that any matched strings that are returned are valid UTF. This
|
|
|
|
can be useful when searching for UTF text in executable or other binary files.
|
2014-09-23 13:35:51 +02:00
|
|
|
</P>
|
|
|
|
<br><b>
|
|
|
|
AUTHOR
|
|
|
|
</b><br>
|
|
|
|
<P>
|
|
|
|
Philip Hazel
|
|
|
|
<br>
|
|
|
|
University Computing Service
|
|
|
|
<br>
|
2014-11-21 17:45:06 +01:00
|
|
|
Cambridge, England.
|
2014-09-23 13:35:51 +02:00
|
|
|
<br>
|
|
|
|
</P>
|
|
|
|
<br><b>
|
|
|
|
REVISION
|
|
|
|
</b><br>
|
|
|
|
<P>
|
2020-02-24 17:35:15 +01:00
|
|
|
Last updated: 23 February 2020
|
2014-09-23 13:35:51 +02:00
|
|
|
<br>
|
2020-02-24 17:35:15 +01:00
|
|
|
Copyright © 1997-2020 University of Cambridge.
|
2014-09-23 13:35:51 +02:00
|
|
|
<br>
|
|
|
|
<p>
|
|
|
|
Return to the <a href="index.html">PCRE2 index page</a>.
|
|
|
|
</p>
|