Update grapheme breaking rules for Unicode 10.0.0.
This commit is contained in:
parent
41bb787fb3
commit
4f7a608d56
|
@ -213,6 +213,9 @@ unit". Previously only non-anchored patterns did this.
|
|||
|
||||
48. Add the callout_no_where modifier to pcre2test.
|
||||
|
||||
49. Update extended grapheme breaking rules to the latest set that are in
|
||||
Unicode Standard Annex #29.
|
||||
|
||||
|
||||
Version 10.23 14-February-2017
|
||||
------------------------------
|
||||
|
|
|
@ -177,7 +177,7 @@ The pcre2_match() function contains a counter that is incremented every time it
|
|||
goes round its main loop. The caller of <b>pcre2_match()</b> can set a limit on
|
||||
this counter, which therefore limits the amount of computing resource used for
|
||||
a match. The maximum depth of nested backtracking can also be limited; this
|
||||
indirectly restricts the amount of heap memory that is used, but there is also
|
||||
indirectly restricts the amount of heap memory that is used, but there is also
|
||||
an explicit memory limit that can be set.
|
||||
</P>
|
||||
<P>
|
||||
|
@ -198,7 +198,7 @@ limits set by the programmer, but not raise them. If there is more than one
|
|||
setting of one of these limits, the lower value is used.
|
||||
</P>
|
||||
<P>
|
||||
Prior to release 10.30, LIMIT_DEPTH was called LIMIT_RECURSION. This name is
|
||||
Prior to release 10.30, LIMIT_DEPTH was called LIMIT_RECURSION. This name is
|
||||
still recognized for backwards compatibility.
|
||||
</P>
|
||||
<P>
|
||||
|
@ -233,7 +233,7 @@ string with one of the following sequences:
|
|||
(*CRLF) carriage return, followed by linefeed
|
||||
(*ANYCRLF) any of the three above
|
||||
(*ANY) all Unicode newline sequences
|
||||
(*NUL) the NUL character (binary zero)
|
||||
(*NUL) the NUL character (binary zero)
|
||||
</pre>
|
||||
These override the default and the options given to the compiling function. For
|
||||
example, on a Unix system where LF is the default newline sequence, the pattern
|
||||
|
@ -249,7 +249,7 @@ The newline convention affects where the circumflex and dollar assertions are
|
|||
true. It also affects the interpretation of the dot metacharacter when
|
||||
PCRE2_DOTALL is not set, and the behaviour of \N. However, it does not affect
|
||||
what the \R escape sequence matches. By default, this is any Unicode newline
|
||||
sequence, for Perl compatibility. However, this can be changed; see the next
|
||||
sequence, for Perl compatibility. However, this can be changed; see the next
|
||||
section and the description of \R in the section entitled
|
||||
<a href="#newlineseq">"Newline sequences"</a>
|
||||
below. A change of \R setting can be combined with a change of newline
|
||||
|
@ -1001,9 +1001,12 @@ grapheme cluster", and treats the sequence as an atomic group
|
|||
<a href="#atomicgroup">(see below).</a>
|
||||
Unicode supports various kinds of composite character by giving each character
|
||||
a grapheme breaking property, and having rules that use these properties to
|
||||
define the boundaries of extended grapheme clusters. \X always matches at
|
||||
least one character. Then it decides whether to add additional characters
|
||||
according to the following rules for ending a cluster:
|
||||
define the boundaries of extended grapheme clusters. The rules are defined in
|
||||
Unicode Standard Annex 29, "Unicode Text Segmentation".
|
||||
</P>
|
||||
<P>
|
||||
\X always matches at least one character. Then it decides whether to add
|
||||
additional characters according to the following rules for ending a cluster:
|
||||
</P>
|
||||
<P>
|
||||
1. End at the end of the subject string.
|
||||
|
@ -1018,13 +1021,27 @@ L, V, LV, or LVT character; an LV or V character may be followed by a V or T
|
|||
character; an LVT or T character may be follwed only by a T character.
|
||||
</P>
|
||||
<P>
|
||||
4. Do not end before extending characters or spacing marks. Characters with
|
||||
the "mark" property always have the "extend" grapheme breaking property.
|
||||
4. Do not end before extending characters or spacing marks or the "zero-width
|
||||
joiner" characters. Characters with the "mark" property always have the
|
||||
"extend" grapheme breaking property.
|
||||
</P>
|
||||
<P>
|
||||
5. Do not end after prepend characters.
|
||||
</P>
|
||||
<P>
|
||||
6. Do not break within emoji modifier sequences (a base character followed by a
|
||||
modifier). Extending characters are allowed before the modifier.
|
||||
</P>
|
||||
<P>
|
||||
7. Do not break within emoji zwj sequences (zero-width jointer followed by
|
||||
"glue after ZWJ" or "base glue after ZWJ").
|
||||
</P>
|
||||
<P>
|
||||
8. Do not break within emoji flag sequences. That is, do not break between
|
||||
regional indicator (RI) characters if there are an odd number of RI characters
|
||||
before the break point.
|
||||
</P>
|
||||
<P>
|
||||
6. Otherwise, end the cluster.
|
||||
<a name="extraprops"></a></P>
|
||||
<br><b>
|
||||
|
@ -1562,15 +1579,15 @@ Perl option letters enclosed between "(?" and ")". The option letters are
|
|||
<pre>
|
||||
i for PCRE2_CASELESS
|
||||
m for PCRE2_MULTILINE
|
||||
n for PCRE2_NO_AUTO_CAPTURE
|
||||
n for PCRE2_NO_AUTO_CAPTURE
|
||||
s for PCRE2_DOTALL
|
||||
x for PCRE2_EXTENDED
|
||||
xx for PCRE2_EXTENDED_MORE
|
||||
</pre>
|
||||
For example, (?im) sets caseless, multiline matching. It is also possible to
|
||||
unset these options by preceding the letter with a hyphen. The two "extended"
|
||||
options are not independent; unsetting either one cancels the effects of both
|
||||
of them.
|
||||
unset these options by preceding the letter with a hyphen. The two "extended"
|
||||
options are not independent; unsetting either one cancels the effects of both
|
||||
of them.
|
||||
</P>
|
||||
<P>
|
||||
A combined setting and unsetting such as (?im-sx), which sets PCRE2_CASELESS
|
||||
|
@ -2249,14 +2266,14 @@ capturing subpatterns within it, these are counted for the purposes of
|
|||
numbering the capturing subpatterns in the whole pattern. However, substring
|
||||
capturing is carried out only for positive assertions that succeed, that is,
|
||||
one of their branches matches, so matching continues after the assertion. If
|
||||
all branches of a positive assertion fail to match, nothing is captured, and
|
||||
all branches of a positive assertion fail to match, nothing is captured, and
|
||||
control is passed to the previous backtracking point.
|
||||
</P>
|
||||
<P>
|
||||
No capturing is done for a negative assertion unless it is being used as a
|
||||
condition in a
|
||||
<a href="#subpatternsassubroutines">conditional subpattern</a>
|
||||
(see the discussion below). Matching continues after a non-conditional negative
|
||||
(see the discussion below). Matching continues after a non-conditional negative
|
||||
assertion only if all its branches fail to match.
|
||||
</P>
|
||||
<P>
|
||||
|
@ -2824,14 +2841,14 @@ if it contained untried alternatives and there was a subsequent matching
|
|||
failure. (Historical note: PCRE implemented recursion before Perl did.)
|
||||
</P>
|
||||
<P>
|
||||
Starting with release 10.30, recursive subroutine calls are no longer treated
|
||||
as atomic. That is, they can be re-entered to try unused alternatives if there
|
||||
is a matching failure later in the pattern. This is now compatible with the way
|
||||
Starting with release 10.30, recursive subroutine calls are no longer treated
|
||||
as atomic. That is, they can be re-entered to try unused alternatives if there
|
||||
is a matching failure later in the pattern. This is now compatible with the way
|
||||
Perl works. If you want a subroutine call to be atomic, you must explicitly
|
||||
enclose it in an atomic group.
|
||||
</P>
|
||||
<P>
|
||||
Supporting backtracking into recursions simplifies certain types of recursive
|
||||
Supporting backtracking into recursions simplifies certain types of recursive
|
||||
pattern. For example, this pattern matches palindromic strings:
|
||||
<pre>
|
||||
^((.)(?1)\2|.?)$
|
||||
|
@ -2863,7 +2880,7 @@ in PCRE2 these values can be referenced. Consider this pattern:
|
|||
This pattern matches "bab". The first capturing parentheses match "b", then in
|
||||
the second group, when the back reference \1 fails to match "b", the second
|
||||
alternative matches "a" and then recurses. In the recursion, \1 does now match
|
||||
"b" and so the whole match succeeds. This match used to fail in Perl, but in
|
||||
"b" and so the whole match succeeds. This match used to fail in Perl, but in
|
||||
later versions (I tried 5.024) it now works.
|
||||
<a name="subpatternsassubroutines"></a></P>
|
||||
<br><a name="SEC24" href="#TOC1">SUBPATTERNS AS SUBROUTINES</a><br>
|
||||
|
@ -3398,17 +3415,17 @@ processing; captured substrings are discarded.
|
|||
</P>
|
||||
<P>
|
||||
If the assertion is a condition, (*ACCEPT) causes the condition to be true for
|
||||
a positive assertion and false for a negative one; captured substrings are
|
||||
a positive assertion and false for a negative one; captured substrings are
|
||||
retained in both cases.
|
||||
</P>
|
||||
<P>
|
||||
The effect of (*THEN) is not allowed to escape beyond an assertion. If there
|
||||
The effect of (*THEN) is not allowed to escape beyond an assertion. If there
|
||||
are no more branches to try, (*THEN) causes a positive assertion to be false,
|
||||
and a negative assertion to be true.
|
||||
</P>
|
||||
<P>
|
||||
The other backtracking verbs are not treated specially if they appear in a
|
||||
standalone positive assertion. In a conditional positive assertion,
|
||||
standalone positive assertion. In a conditional positive assertion,
|
||||
backtracking into (*COMMIT), (*SKIP), or (*PRUNE) causes the condition to be
|
||||
false. However, for both standalone and conditional negative assertions,
|
||||
backtracking into (*COMMIT), (*SKIP), or (*PRUNE) causes the assertion to be
|
||||
|
@ -3455,7 +3472,7 @@ Cambridge, England.
|
|||
</P>
|
||||
<br><a name="SEC30" href="#TOC1">REVISION</a><br>
|
||||
<P>
|
||||
Last updated: 02 July 2017
|
||||
Last updated: 05 July 2017
|
||||
<br>
|
||||
Copyright © 1997-2017 University of Cambridge.
|
||||
<br>
|
||||
|
|
|
@ -6433,27 +6433,41 @@ BACKSLASH
|
|||
(see below). Unicode supports various kinds of composite character by
|
||||
giving each character a grapheme breaking property, and having rules
|
||||
that use these properties to define the boundaries of extended grapheme
|
||||
clusters. \X always matches at least one character. Then it decides
|
||||
whether to add additional characters according to the following rules
|
||||
for ending a cluster:
|
||||
clusters. The rules are defined in Unicode Standard Annex 29, "Unicode
|
||||
Text Segmentation".
|
||||
|
||||
\X always matches at least one character. Then it decides whether to
|
||||
add additional characters according to the following rules for ending a
|
||||
cluster:
|
||||
|
||||
1. End at the end of the subject string.
|
||||
|
||||
2. Do not end between CR and LF; otherwise end after any control char-
|
||||
2. Do not end between CR and LF; otherwise end after any control char-
|
||||
acter.
|
||||
|
||||
3. Do not break Hangul (a Korean script) syllable sequences. Hangul
|
||||
characters are of five types: L, V, T, LV, and LVT. An L character may
|
||||
be followed by an L, V, LV, or LVT character; an LV or V character may
|
||||
3. Do not break Hangul (a Korean script) syllable sequences. Hangul
|
||||
characters are of five types: L, V, T, LV, and LVT. An L character may
|
||||
be followed by an L, V, LV, or LVT character; an LV or V character may
|
||||
be followed by a V or T character; an LVT or T character may be follwed
|
||||
only by a T character.
|
||||
|
||||
4. Do not end before extending characters or spacing marks. Characters
|
||||
with the "mark" property always have the "extend" grapheme breaking
|
||||
property.
|
||||
4. Do not end before extending characters or spacing marks or the
|
||||
"zero-width joiner" characters. Characters with the "mark" property
|
||||
always have the "extend" grapheme breaking property.
|
||||
|
||||
5. Do not end after prepend characters.
|
||||
|
||||
6. Do not break within emoji modifier sequences (a base character fol-
|
||||
lowed by a modifier). Extending characters are allowed before the modi-
|
||||
fier.
|
||||
|
||||
7. Do not break within emoji zwj sequences (zero-width jointer followed
|
||||
by "glue after ZWJ" or "base glue after ZWJ").
|
||||
|
||||
8. Do not break within emoji flag sequences. That is, do not break
|
||||
between regional indicator (RI) characters if there are an odd number
|
||||
of RI characters before the break point.
|
||||
|
||||
6. Otherwise, end the cluster.
|
||||
|
||||
PCRE2's additional properties
|
||||
|
@ -8744,7 +8758,7 @@ AUTHOR
|
|||
|
||||
REVISION
|
||||
|
||||
Last updated: 02 July 2017
|
||||
Last updated: 05 July 2017
|
||||
Copyright (c) 1997-2017 University of Cambridge.
|
||||
------------------------------------------------------------------------------
|
||||
|
||||
|
|
|
@ -1,4 +1,4 @@
|
|||
.TH PCRE2PATTERN 3 "02 July 2017" "PCRE2 10.30"
|
||||
.TH PCRE2PATTERN 3 "05 July 2017" "PCRE2 10.30"
|
||||
.SH NAME
|
||||
PCRE2 - Perl-compatible regular expressions (revised API)
|
||||
.SH "PCRE2 REGULAR EXPRESSION DETAILS"
|
||||
|
@ -145,7 +145,7 @@ The pcre2_match() function contains a counter that is incremented every time it
|
|||
goes round its main loop. The caller of \fBpcre2_match()\fP can set a limit on
|
||||
this counter, which therefore limits the amount of computing resource used for
|
||||
a match. The maximum depth of nested backtracking can also be limited; this
|
||||
indirectly restricts the amount of heap memory that is used, but there is also
|
||||
indirectly restricts the amount of heap memory that is used, but there is also
|
||||
an explicit memory limit that can be set.
|
||||
.P
|
||||
These facilities are provided to catch runaway matches that are provoked by
|
||||
|
@ -164,7 +164,7 @@ for it to have any effect. In other words, the pattern writer can lower the
|
|||
limits set by the programmer, but not raise them. If there is more than one
|
||||
setting of one of these limits, the lower value is used.
|
||||
.P
|
||||
Prior to release 10.30, LIMIT_DEPTH was called LIMIT_RECURSION. This name is
|
||||
Prior to release 10.30, LIMIT_DEPTH was called LIMIT_RECURSION. This name is
|
||||
still recognized for backwards compatibility.
|
||||
.P
|
||||
The heap limit applies only when the \fBpcre2_match()\fP interpreter is used
|
||||
|
@ -203,7 +203,7 @@ string with one of the following sequences:
|
|||
(*CRLF) carriage return, followed by linefeed
|
||||
(*ANYCRLF) any of the three above
|
||||
(*ANY) all Unicode newline sequences
|
||||
(*NUL) the NUL character (binary zero)
|
||||
(*NUL) the NUL character (binary zero)
|
||||
.sp
|
||||
These override the default and the options given to the compiling function. For
|
||||
example, on a Unix system where LF is the default newline sequence, the pattern
|
||||
|
@ -218,7 +218,7 @@ The newline convention affects where the circumflex and dollar assertions are
|
|||
true. It also affects the interpretation of the dot metacharacter when
|
||||
PCRE2_DOTALL is not set, and the behaviour of \eN. However, it does not affect
|
||||
what the \eR escape sequence matches. By default, this is any Unicode newline
|
||||
sequence, for Perl compatibility. However, this can be changed; see the next
|
||||
sequence, for Perl compatibility. However, this can be changed; see the next
|
||||
section and the description of \eR in the section entitled
|
||||
.\" HTML <a href="#newlineseq">
|
||||
.\" </a>
|
||||
|
@ -998,9 +998,11 @@ grapheme cluster", and treats the sequence as an atomic group
|
|||
.\"
|
||||
Unicode supports various kinds of composite character by giving each character
|
||||
a grapheme breaking property, and having rules that use these properties to
|
||||
define the boundaries of extended grapheme clusters. \eX always matches at
|
||||
least one character. Then it decides whether to add additional characters
|
||||
according to the following rules for ending a cluster:
|
||||
define the boundaries of extended grapheme clusters. The rules are defined in
|
||||
Unicode Standard Annex 29, "Unicode Text Segmentation".
|
||||
.P
|
||||
\eX always matches at least one character. Then it decides whether to add
|
||||
additional characters according to the following rules for ending a cluster:
|
||||
.P
|
||||
1. End at the end of the subject string.
|
||||
.P
|
||||
|
@ -1011,11 +1013,22 @@ are of five types: L, V, T, LV, and LVT. An L character may be followed by an
|
|||
L, V, LV, or LVT character; an LV or V character may be followed by a V or T
|
||||
character; an LVT or T character may be follwed only by a T character.
|
||||
.P
|
||||
4. Do not end before extending characters or spacing marks. Characters with
|
||||
the "mark" property always have the "extend" grapheme breaking property.
|
||||
4. Do not end before extending characters or spacing marks or the "zero-width
|
||||
joiner" characters. Characters with the "mark" property always have the
|
||||
"extend" grapheme breaking property.
|
||||
.P
|
||||
5. Do not end after prepend characters.
|
||||
.P
|
||||
6. Do not break within emoji modifier sequences (a base character followed by a
|
||||
modifier). Extending characters are allowed before the modifier.
|
||||
.P
|
||||
7. Do not break within emoji zwj sequences (zero-width jointer followed by
|
||||
"glue after ZWJ" or "base glue after ZWJ").
|
||||
.P
|
||||
8. Do not break within emoji flag sequences. That is, do not break between
|
||||
regional indicator (RI) characters if there are an odd number of RI characters
|
||||
before the break point.
|
||||
.P
|
||||
6. Otherwise, end the cluster.
|
||||
.
|
||||
.
|
||||
|
@ -1560,15 +1573,15 @@ Perl option letters enclosed between "(?" and ")". The option letters are
|
|||
.sp
|
||||
i for PCRE2_CASELESS
|
||||
m for PCRE2_MULTILINE
|
||||
n for PCRE2_NO_AUTO_CAPTURE
|
||||
n for PCRE2_NO_AUTO_CAPTURE
|
||||
s for PCRE2_DOTALL
|
||||
x for PCRE2_EXTENDED
|
||||
xx for PCRE2_EXTENDED_MORE
|
||||
.sp
|
||||
For example, (?im) sets caseless, multiline matching. It is also possible to
|
||||
unset these options by preceding the letter with a hyphen. The two "extended"
|
||||
options are not independent; unsetting either one cancels the effects of both
|
||||
of them.
|
||||
unset these options by preceding the letter with a hyphen. The two "extended"
|
||||
options are not independent; unsetting either one cancels the effects of both
|
||||
of them.
|
||||
.P
|
||||
A combined setting and unsetting such as (?im-sx), which sets PCRE2_CASELESS
|
||||
and PCRE2_MULTILINE while unsetting PCRE2_DOTALL and PCRE2_EXTENDED, is also
|
||||
|
@ -2256,16 +2269,16 @@ capturing subpatterns within it, these are counted for the purposes of
|
|||
numbering the capturing subpatterns in the whole pattern. However, substring
|
||||
capturing is carried out only for positive assertions that succeed, that is,
|
||||
one of their branches matches, so matching continues after the assertion. If
|
||||
all branches of a positive assertion fail to match, nothing is captured, and
|
||||
all branches of a positive assertion fail to match, nothing is captured, and
|
||||
control is passed to the previous backtracking point.
|
||||
.P
|
||||
No capturing is done for a negative assertion unless it is being used as a
|
||||
condition in a
|
||||
.\" HTML <a href="#subpatternsassubroutines">
|
||||
.\" </a>
|
||||
conditional subpattern
|
||||
conditional subpattern
|
||||
.\"
|
||||
(see the discussion below). Matching continues after a non-conditional negative
|
||||
(see the discussion below). Matching continues after a non-conditional negative
|
||||
assertion only if all its branches fail to match.
|
||||
.P
|
||||
For compatibility with Perl, most assertion subpatterns may be repeated; though
|
||||
|
@ -2846,13 +2859,13 @@ once it had matched some of the subject string, it was never re-entered, even
|
|||
if it contained untried alternatives and there was a subsequent matching
|
||||
failure. (Historical note: PCRE implemented recursion before Perl did.)
|
||||
.P
|
||||
Starting with release 10.30, recursive subroutine calls are no longer treated
|
||||
as atomic. That is, they can be re-entered to try unused alternatives if there
|
||||
is a matching failure later in the pattern. This is now compatible with the way
|
||||
Starting with release 10.30, recursive subroutine calls are no longer treated
|
||||
as atomic. That is, they can be re-entered to try unused alternatives if there
|
||||
is a matching failure later in the pattern. This is now compatible with the way
|
||||
Perl works. If you want a subroutine call to be atomic, you must explicitly
|
||||
enclose it in an atomic group.
|
||||
.P
|
||||
Supporting backtracking into recursions simplifies certain types of recursive
|
||||
Supporting backtracking into recursions simplifies certain types of recursive
|
||||
pattern. For example, this pattern matches palindromic strings:
|
||||
.sp
|
||||
^((.)(?1)\e2|.?)$
|
||||
|
@ -2883,7 +2896,7 @@ in PCRE2 these values can be referenced. Consider this pattern:
|
|||
This pattern matches "bab". The first capturing parentheses match "b", then in
|
||||
the second group, when the back reference \e1 fails to match "b", the second
|
||||
alternative matches "a" and then recurses. In the recursion, \e1 does now match
|
||||
"b" and so the whole match succeeds. This match used to fail in Perl, but in
|
||||
"b" and so the whole match succeeds. This match used to fail in Perl, but in
|
||||
later versions (I tried 5.024) it now works.
|
||||
.
|
||||
.
|
||||
|
@ -3427,15 +3440,15 @@ negative assertion, (*ACCEPT) causes the assertion to fail without any further
|
|||
processing; captured substrings are discarded.
|
||||
.P
|
||||
If the assertion is a condition, (*ACCEPT) causes the condition to be true for
|
||||
a positive assertion and false for a negative one; captured substrings are
|
||||
a positive assertion and false for a negative one; captured substrings are
|
||||
retained in both cases.
|
||||
.P
|
||||
The effect of (*THEN) is not allowed to escape beyond an assertion. If there
|
||||
The effect of (*THEN) is not allowed to escape beyond an assertion. If there
|
||||
are no more branches to try, (*THEN) causes a positive assertion to be false,
|
||||
and a negative assertion to be true.
|
||||
.P
|
||||
The other backtracking verbs are not treated specially if they appear in a
|
||||
standalone positive assertion. In a conditional positive assertion,
|
||||
standalone positive assertion. In a conditional positive assertion,
|
||||
backtracking into (*COMMIT), (*SKIP), or (*PRUNE) causes the condition to be
|
||||
false. However, for both standalone and conditional negative assertions,
|
||||
backtracking into (*COMMIT), (*SKIP), or (*PRUNE) causes the assertion to be
|
||||
|
@ -3485,6 +3498,6 @@ Cambridge, England.
|
|||
.rs
|
||||
.sp
|
||||
.nf
|
||||
Last updated: 02 July 2017
|
||||
Last updated: 05 July 2017
|
||||
Copyright (c) 1997-2017 University of Cambridge.
|
||||
.fi
|
||||
|
|
|
@ -1379,8 +1379,46 @@ for (;;)
|
|||
if (!utf) d = *nptr; else { GETCHARLEN(d, nptr, dlen); }
|
||||
rgb = UCD_GRAPHBREAK(d);
|
||||
if ((PRIV(ucp_gbtable)[lgb] & (1u << rgb)) == 0) break;
|
||||
|
||||
/* Not breaking between Regional Indicators is allowed only if
|
||||
there are an even number of preceding RIs. */
|
||||
|
||||
if (lgb == ucp_gbRegionalIndicator &&
|
||||
rgb == ucp_gbRegionalIndicator)
|
||||
{
|
||||
int ricount = 0;
|
||||
PCRE2_SPTR bptr = nptr - 1;
|
||||
#ifdef SUPPORT_UNICODE
|
||||
if (utf) BACKCHAR(bptr);
|
||||
#endif
|
||||
/* bptr is pointing to the left-hand character */
|
||||
|
||||
while (bptr > mb->start_subject)
|
||||
{
|
||||
bptr--;
|
||||
#ifdef SUPPORT_UNICODE
|
||||
if (utf)
|
||||
{
|
||||
BACKCHAR(bptr);
|
||||
GETCHAR(d, bptr);
|
||||
}
|
||||
else
|
||||
#endif
|
||||
d = *bptr;
|
||||
if (UCD_GRAPHBREAK(d) != ucp_gbRegionalIndicator) break;
|
||||
ricount++;
|
||||
}
|
||||
if ((ricount & 1) != 0) break; /* Grapheme break required */
|
||||
}
|
||||
|
||||
/* If Extend follows E_Base[_GAZ] do not update lgb; this allows
|
||||
any number of Extend before a following E_Modifier. */
|
||||
|
||||
if (rgb != ucp_gbExtend ||
|
||||
(lgb != ucp_gbE_Base && lgb != ucp_gbE_Base_GAZ))
|
||||
lgb = rgb;
|
||||
|
||||
ncount++;
|
||||
lgb = rgb;
|
||||
nptr += dlen;
|
||||
}
|
||||
count++;
|
||||
|
@ -1641,8 +1679,46 @@ for (;;)
|
|||
if (!utf) d = *nptr; else { GETCHARLEN(d, nptr, dlen); }
|
||||
rgb = UCD_GRAPHBREAK(d);
|
||||
if ((PRIV(ucp_gbtable)[lgb] & (1u << rgb)) == 0) break;
|
||||
|
||||
/* Not breaking between Regional Indicators is allowed only if
|
||||
there are an even number of preceding RIs. */
|
||||
|
||||
if (lgb == ucp_gbRegionalIndicator &&
|
||||
rgb == ucp_gbRegionalIndicator)
|
||||
{
|
||||
int ricount = 0;
|
||||
PCRE2_SPTR bptr = nptr - 1;
|
||||
#ifdef SUPPORT_UNICODE
|
||||
if (utf) BACKCHAR(bptr);
|
||||
#endif
|
||||
/* bptr is pointing to the left-hand character */
|
||||
|
||||
while (bptr > mb->start_subject)
|
||||
{
|
||||
bptr--;
|
||||
#ifdef SUPPORT_UNICODE
|
||||
if (utf)
|
||||
{
|
||||
BACKCHAR(bptr);
|
||||
GETCHAR(d, bptr);
|
||||
}
|
||||
else
|
||||
#endif
|
||||
d = *bptr;
|
||||
if (UCD_GRAPHBREAK(d) != ucp_gbRegionalIndicator) break;
|
||||
ricount++;
|
||||
}
|
||||
if ((ricount & 1) != 0) break; /* Grapheme break required */
|
||||
}
|
||||
|
||||
/* If Extend follows E_Base[_GAZ] do not update lgb; this allows
|
||||
any number of Extend before a following E_Modifier. */
|
||||
|
||||
if (rgb != ucp_gbExtend ||
|
||||
(lgb != ucp_gbE_Base && lgb != ucp_gbE_Base_GAZ))
|
||||
lgb = rgb;
|
||||
|
||||
ncount++;
|
||||
lgb = rgb;
|
||||
nptr += dlen;
|
||||
}
|
||||
ADD_NEW_DATA(-(state_offset + count), 0, ncount);
|
||||
|
@ -1912,8 +1988,46 @@ for (;;)
|
|||
if (!utf) d = *nptr; else { GETCHARLEN(d, nptr, dlen); }
|
||||
rgb = UCD_GRAPHBREAK(d);
|
||||
if ((PRIV(ucp_gbtable)[lgb] & (1u << rgb)) == 0) break;
|
||||
|
||||
/* Not breaking between Regional Indicators is allowed only if
|
||||
there are an even number of preceding RIs. */
|
||||
|
||||
if (lgb == ucp_gbRegionalIndicator &&
|
||||
rgb == ucp_gbRegionalIndicator)
|
||||
{
|
||||
int ricount = 0;
|
||||
PCRE2_SPTR bptr = nptr - 1;
|
||||
#ifdef SUPPORT_UNICODE
|
||||
if (utf) BACKCHAR(bptr);
|
||||
#endif
|
||||
/* bptr is pointing to the left-hand character */
|
||||
|
||||
while (bptr > mb->start_subject)
|
||||
{
|
||||
bptr--;
|
||||
#ifdef SUPPORT_UNICODE
|
||||
if (utf)
|
||||
{
|
||||
BACKCHAR(bptr);
|
||||
GETCHAR(d, bptr);
|
||||
}
|
||||
else
|
||||
#endif
|
||||
d = *bptr;
|
||||
if (UCD_GRAPHBREAK(d) != ucp_gbRegionalIndicator) break;
|
||||
ricount++;
|
||||
}
|
||||
if ((ricount & 1) != 0) break; /* Grapheme break required */
|
||||
}
|
||||
|
||||
/* If Extend follows E_Base[_GAZ] do not update lgb; this allows
|
||||
any number of Extend before a following E_Modifier. */
|
||||
|
||||
if (rgb != ucp_gbExtend ||
|
||||
(lgb != ucp_gbE_Base && lgb != ucp_gbE_Base_GAZ))
|
||||
lgb = rgb;
|
||||
|
||||
ncount++;
|
||||
lgb = rgb;
|
||||
nptr += dlen;
|
||||
}
|
||||
if (nptr >= end_subject && (mb->moptions & PCRE2_PARTIAL_HARD) != 0)
|
||||
|
@ -2102,8 +2216,46 @@ for (;;)
|
|||
if (!utf) d = *nptr; else { GETCHARLEN(d, nptr, dlen); }
|
||||
rgb = UCD_GRAPHBREAK(d);
|
||||
if ((PRIV(ucp_gbtable)[lgb] & (1u << rgb)) == 0) break;
|
||||
|
||||
/* Not breaking between Regional Indicators is allowed only if
|
||||
there are an even number of preceding RIs. */
|
||||
|
||||
if (lgb == ucp_gbRegionalIndicator &&
|
||||
rgb == ucp_gbRegionalIndicator)
|
||||
{
|
||||
int ricount = 0;
|
||||
PCRE2_SPTR bptr = nptr - 1;
|
||||
#ifdef SUPPORT_UNICODE
|
||||
if (utf) BACKCHAR(bptr);
|
||||
#endif
|
||||
/* bptr is pointing to the left-hand character */
|
||||
|
||||
while (bptr > mb->start_subject)
|
||||
{
|
||||
bptr--;
|
||||
#ifdef SUPPORT_UNICODE
|
||||
if (utf)
|
||||
{
|
||||
BACKCHAR(bptr);
|
||||
GETCHAR(d, bptr);
|
||||
}
|
||||
else
|
||||
#endif
|
||||
d = *bptr;
|
||||
if (UCD_GRAPHBREAK(d) != ucp_gbRegionalIndicator) break;
|
||||
ricount++;
|
||||
}
|
||||
if ((ricount & 1) != 0) break; /* Grapheme break required */
|
||||
}
|
||||
|
||||
/* If Extend follows E_Base[_GAZ] do not update lgb; this allows
|
||||
any number of Extend before a following E_Modifier. */
|
||||
|
||||
if (rgb != ucp_gbExtend ||
|
||||
(lgb != ucp_gbE_Base && lgb != ucp_gbE_Base_GAZ))
|
||||
lgb = rgb;
|
||||
|
||||
ncount++;
|
||||
lgb = rgb;
|
||||
nptr += dlen;
|
||||
}
|
||||
if (nptr >= end_subject && (mb->moptions & PCRE2_PARTIAL_HARD) != 0)
|
||||
|
@ -2129,7 +2281,7 @@ for (;;)
|
|||
case 0x2029:
|
||||
#endif /* Not EBCDIC */
|
||||
if (mb->bsr_convention == PCRE2_BSR_ANYCRLF) break;
|
||||
/* Fall through */
|
||||
/* Fall through */
|
||||
|
||||
case CHAR_LF:
|
||||
ADD_NEW(state_offset + 1, 0);
|
||||
|
@ -3427,7 +3579,7 @@ for (;;)
|
|||
while (t < mb->end_subject && !IS_NEWLINE(t)) t++;
|
||||
end_subject = t;
|
||||
}
|
||||
|
||||
|
||||
/* Anchored: check the first code unit if one is recorded. This may seem
|
||||
pointless but it can help in detecting a no match case without scanning for
|
||||
the required code unit. */
|
||||
|
|
|
@ -2449,7 +2449,44 @@ fprintf(stderr, "++ op=%d\n", *Fecode);
|
|||
if (!utf) fc = *Feptr; else { GETCHARLEN(fc, Feptr, len); }
|
||||
rgb = UCD_GRAPHBREAK(fc);
|
||||
if ((PRIV(ucp_gbtable)[lgb] & (1 << rgb)) == 0) break;
|
||||
lgb = rgb;
|
||||
|
||||
/* Not breaking between Regional Indicators is allowed only if there
|
||||
are an even number of preceding RIs. */
|
||||
|
||||
if (lgb == ucp_gbRegionalIndicator && rgb == ucp_gbRegionalIndicator)
|
||||
{
|
||||
int ricount = 0;
|
||||
PCRE2_SPTR bptr = Feptr - 1;
|
||||
#ifdef SUPPORT_UNICODE
|
||||
if (utf) BACKCHAR(bptr);
|
||||
#endif
|
||||
/* bptr is pointing to the left-hand character */
|
||||
|
||||
while (bptr > mb->start_subject)
|
||||
{
|
||||
bptr--;
|
||||
#ifdef SUPPORT_UNICODE
|
||||
if (utf)
|
||||
{
|
||||
BACKCHAR(bptr);
|
||||
GETCHAR(fc, bptr);
|
||||
}
|
||||
else
|
||||
#endif
|
||||
fc = *bptr;
|
||||
if (UCD_GRAPHBREAK(fc) != ucp_gbRegionalIndicator) break;
|
||||
ricount++;
|
||||
}
|
||||
if ((ricount & 1) != 0) break; /* Grapheme break required */
|
||||
}
|
||||
|
||||
/* If Extend follows E_Base[_GAZ] do not update lgb; this allows
|
||||
any number of Extend before a following E_Modifier. */
|
||||
|
||||
if (rgb != ucp_gbExtend ||
|
||||
(lgb != ucp_gbE_Base && lgb != ucp_gbE_Base_GAZ))
|
||||
lgb = rgb;
|
||||
|
||||
Feptr += len;
|
||||
}
|
||||
}
|
||||
|
@ -2757,7 +2794,45 @@ fprintf(stderr, "++ op=%d\n", *Fecode);
|
|||
if (!utf) fc = *Feptr; else { GETCHARLEN(fc, Feptr, len); }
|
||||
rgb = UCD_GRAPHBREAK(fc);
|
||||
if ((PRIV(ucp_gbtable)[lgb] & (1 << rgb)) == 0) break;
|
||||
lgb = rgb;
|
||||
|
||||
/* Not breaking between Regional Indicators is allowed only if
|
||||
there are an even number of preceding RIs. */
|
||||
|
||||
if (lgb == ucp_gbRegionalIndicator &&
|
||||
rgb == ucp_gbRegionalIndicator)
|
||||
{
|
||||
int ricount = 0;
|
||||
PCRE2_SPTR bptr = Feptr - 1;
|
||||
#ifdef SUPPORT_UNICODE
|
||||
if (utf) BACKCHAR(bptr);
|
||||
#endif
|
||||
/* bptr is pointing to the left-hand character */
|
||||
|
||||
while (bptr > mb->start_subject)
|
||||
{
|
||||
bptr--;
|
||||
#ifdef SUPPORT_UNICODE
|
||||
if (utf)
|
||||
{
|
||||
BACKCHAR(bptr);
|
||||
GETCHAR(fc, bptr);
|
||||
}
|
||||
else
|
||||
#endif
|
||||
fc = *bptr;
|
||||
if (UCD_GRAPHBREAK(fc) != ucp_gbRegionalIndicator) break;
|
||||
ricount++;
|
||||
}
|
||||
if ((ricount & 1) != 0) break; /* Grapheme break required */
|
||||
}
|
||||
|
||||
/* If Extend follows E_Base[_GAZ] do not update lgb; this allows
|
||||
any number of Extend before a following E_Modifier. */
|
||||
|
||||
if (rgb != ucp_gbExtend ||
|
||||
(lgb != ucp_gbE_Base && lgb != ucp_gbE_Base_GAZ))
|
||||
lgb = rgb;
|
||||
|
||||
Feptr += len;
|
||||
}
|
||||
}
|
||||
|
@ -3527,7 +3602,45 @@ fprintf(stderr, "++ op=%d\n", *Fecode);
|
|||
if (!utf) fc = *Feptr; else { GETCHARLEN(fc, Feptr, len); }
|
||||
rgb = UCD_GRAPHBREAK(fc);
|
||||
if ((PRIV(ucp_gbtable)[lgb] & (1 << rgb)) == 0) break;
|
||||
lgb = rgb;
|
||||
|
||||
/* Not breaking between Regional Indicators is allowed only if
|
||||
there are an even number of preceding RIs. */
|
||||
|
||||
if (lgb == ucp_gbRegionalIndicator &&
|
||||
rgb == ucp_gbRegionalIndicator)
|
||||
{
|
||||
int ricount = 0;
|
||||
PCRE2_SPTR bptr = Feptr - 1;
|
||||
#ifdef SUPPORT_UNICODE
|
||||
if (utf) BACKCHAR(bptr);
|
||||
#endif
|
||||
/* bptr is pointing to the left-hand character */
|
||||
|
||||
while (bptr > mb->start_subject)
|
||||
{
|
||||
bptr--;
|
||||
#ifdef SUPPORT_UNICODE
|
||||
if (utf)
|
||||
{
|
||||
BACKCHAR(bptr);
|
||||
GETCHAR(fc, bptr);
|
||||
}
|
||||
else
|
||||
#endif
|
||||
fc = *bptr;
|
||||
if (UCD_GRAPHBREAK(fc) != ucp_gbRegionalIndicator) break;
|
||||
ricount++;
|
||||
}
|
||||
if ((ricount & 1) != 0) break; /* Grapheme break required */
|
||||
}
|
||||
|
||||
/* If Extend follows E_Base[_GAZ] do not update lgb; this allows
|
||||
any number of Extend before a following E_Modifier. */
|
||||
|
||||
if (rgb != ucp_gbExtend ||
|
||||
(lgb != ucp_gbE_Base && lgb != ucp_gbE_Base_GAZ))
|
||||
lgb = rgb;
|
||||
|
||||
Feptr += len;
|
||||
}
|
||||
}
|
||||
|
@ -4063,7 +4176,45 @@ fprintf(stderr, "++ op=%d\n", *Fecode);
|
|||
if (!utf) fc = *Feptr; else { GETCHARLEN(fc, Feptr, len); }
|
||||
rgb = UCD_GRAPHBREAK(fc);
|
||||
if ((PRIV(ucp_gbtable)[lgb] & (1 << rgb)) == 0) break;
|
||||
lgb = rgb;
|
||||
|
||||
/* Not breaking between Regional Indicators is allowed only if
|
||||
there are an even number of preceding RIs. */
|
||||
|
||||
if (lgb == ucp_gbRegionalIndicator &&
|
||||
rgb == ucp_gbRegionalIndicator)
|
||||
{
|
||||
int ricount = 0;
|
||||
PCRE2_SPTR bptr = Feptr - 1;
|
||||
#ifdef SUPPORT_UNICODE
|
||||
if (utf) BACKCHAR(bptr);
|
||||
#endif
|
||||
/* bptr is pointing to the left-hand character */
|
||||
|
||||
while (bptr > mb->start_subject)
|
||||
{
|
||||
bptr--;
|
||||
#ifdef SUPPORT_UNICODE
|
||||
if (utf)
|
||||
{
|
||||
BACKCHAR(bptr);
|
||||
GETCHAR(fc, bptr);
|
||||
}
|
||||
else
|
||||
#endif
|
||||
fc = *bptr;
|
||||
if (UCD_GRAPHBREAK(fc) != ucp_gbRegionalIndicator) break;
|
||||
ricount++;
|
||||
}
|
||||
if ((ricount & 1) != 0) break; /* Grapheme break required */
|
||||
}
|
||||
|
||||
/* If Extend follows E_Base[_GAZ] do not update lgb; this allows
|
||||
any number of Extend before a following E_Modifier. */
|
||||
|
||||
if (rgb != ucp_gbExtend ||
|
||||
(lgb != ucp_gbE_Base && lgb != ucp_gbE_Base_GAZ))
|
||||
lgb = rgb;
|
||||
|
||||
Feptr += len;
|
||||
}
|
||||
}
|
||||
|
|
|
@ -157,49 +157,62 @@ two code points. The breaking rules are as follows:
|
|||
LV or V may be followed by V or T
|
||||
LVT or T may be followed by T
|
||||
|
||||
4. Do not break before extending characters.
|
||||
4. Do not break before extending characters or zero-width-joiner (ZWJ).
|
||||
|
||||
The next two rules are only for extended grapheme clusters (but that's what we
|
||||
The following rules are only for extended grapheme clusters (but that's what we
|
||||
are implementing).
|
||||
|
||||
5. Do not break before SpacingMarks.
|
||||
|
||||
6. Do not break after Prepend characters.
|
||||
|
||||
7. Otherwise, break everywhere.
|
||||
7. Do not break within emoji modifier sequences (E_Base or E_Base_GAZ followed
|
||||
by E_Modifier). Extend characters are allowed before the modifier; this
|
||||
cannot be represented in this table, the code has to deal with it.
|
||||
|
||||
8. Do not break within emoji zwj sequences (ZWJ followed by Glue_After_Zwj or
|
||||
E_Base_GAZ).
|
||||
|
||||
9. Do not break within emoji flag sequences. That is, do not break between
|
||||
regional indicator (RI) symbols if there are an odd number of RI characters
|
||||
before the break point. This table encodes "join RI characters"; the code
|
||||
has to deal with checking for previous adjoining RIs.
|
||||
|
||||
10. Otherwise, break everywhere.
|
||||
*/
|
||||
|
||||
#define ESZ (1<<ucp_gbExtend)|(1<<ucp_gbSpacingMark)|(1<<ucp_gbZWJ)
|
||||
|
||||
const uint32_t PRIV(ucp_gbtable)[] = {
|
||||
(1<<ucp_gbLF), /* 0 CR */
|
||||
0, /* 1 LF */
|
||||
0, /* 2 Control */
|
||||
(1<<ucp_gbExtend)|(1<<ucp_gbSpacingMark), /* 3 Extend */
|
||||
(1<<ucp_gbExtend)|(1<<ucp_gbPrepend)| /* 4 Prepend */
|
||||
(1<<ucp_gbSpacingMark)|(1<<ucp_gbL)|
|
||||
(1<<ucp_gbV)|(1<<ucp_gbT)|(1<<ucp_gbLV)|
|
||||
(1<<ucp_gbLVT)|(1<<ucp_gbOther),
|
||||
|
||||
(1<<ucp_gbExtend)|(1<<ucp_gbSpacingMark), /* 5 SpacingMark */
|
||||
(1<<ucp_gbExtend)|(1<<ucp_gbSpacingMark)|(1<<ucp_gbL)| /* 6 L */
|
||||
(1<<ucp_gbV)|(1<<ucp_gbLV)|(1<<ucp_gbLVT),
|
||||
|
||||
(1<<ucp_gbExtend)|(1<<ucp_gbSpacingMark)|(1<<ucp_gbV)| /* 7 V */
|
||||
(1<<ucp_gbT),
|
||||
|
||||
(1<<ucp_gbExtend)|(1<<ucp_gbSpacingMark)|(1<<ucp_gbT), /* 8 T */
|
||||
(1<<ucp_gbExtend)|(1<<ucp_gbSpacingMark)|(1<<ucp_gbV)| /* 9 LV */
|
||||
(1<<ucp_gbT),
|
||||
|
||||
(1<<ucp_gbExtend)|(1<<ucp_gbSpacingMark)|(1<<ucp_gbT), /* 10 LVT */
|
||||
ESZ, /* 3 Extend */
|
||||
ESZ|(1<<ucp_gbPrepend)| /* 4 Prepend */
|
||||
(1<<ucp_gbL)|(1<<ucp_gbV)|(1<<ucp_gbT)|
|
||||
(1<<ucp_gbLV)|(1<<ucp_gbLVT)|(1<<ucp_gbOther)|
|
||||
(1<<ucp_gbRegionalIndicator)|
|
||||
(1<<ucp_gbE_Base)|(1<<ucp_gbE_Modifier)|
|
||||
(1<<ucp_gbE_Base_GAZ)|
|
||||
(1<<ucp_gbZWJ)|(1<<ucp_gbGlue_After_Zwj),
|
||||
ESZ, /* 5 SpacingMark */
|
||||
ESZ|(1<<ucp_gbL)|(1<<ucp_gbV)|(1<<ucp_gbLV)| /* 6 L */
|
||||
(1<<ucp_gbLVT),
|
||||
ESZ|(1<<ucp_gbV)|(1<<ucp_gbT), /* 7 V */
|
||||
ESZ|(1<<ucp_gbT), /* 8 T */
|
||||
ESZ|(1<<ucp_gbV)|(1<<ucp_gbT), /* 9 LV */
|
||||
ESZ|(1<<ucp_gbT), /* 10 LVT */
|
||||
(1<<ucp_gbRegionalIndicator), /* 11 RegionalIndicator */
|
||||
(1<<ucp_gbExtend)|(1<<ucp_gbSpacingMark), /* 12 Other */
|
||||
(1<<ucp_gbExtend)|(1<<ucp_gbSpacingMark), /* 13 E_Base */
|
||||
(1<<ucp_gbExtend)|(1<<ucp_gbSpacingMark), /* 14 E_Modifier */
|
||||
(1<<ucp_gbExtend)|(1<<ucp_gbSpacingMark), /* 15 E_Base_GAZ */
|
||||
(1<<ucp_gbExtend)|(1<<ucp_gbSpacingMark), /* 16 ZWJ */
|
||||
(1<<ucp_gbExtend)|(1<<ucp_gbSpacingMark) /* 12 Glue_After_Zwj */
|
||||
ESZ, /* 12 Other */
|
||||
ESZ|(1<<ucp_gbE_Modifier), /* 13 E_Base */
|
||||
ESZ, /* 14 E_Modifier */
|
||||
ESZ|(1<<ucp_gbE_Modifier), /* 15 E_Base_GAZ */
|
||||
ESZ|(1<<ucp_gbGlue_After_Zwj)|(1<<ucp_gbE_Base_GAZ), /* 16 ZWJ */
|
||||
ESZ /* 12 Glue_After_Zwj */
|
||||
};
|
||||
|
||||
#undef ESZ
|
||||
|
||||
#ifdef SUPPORT_JIT
|
||||
/* This table reverses PRIV(ucp_gentype). We can save the cost
|
||||
of a memory load. */
|
||||
|
|
|
@ -2041,4 +2041,23 @@
|
|||
/^(?:(\X)(?C))+$/utf
|
||||
\x{1E900}\x{1E924}\x{1E953}\x{11C00}\x{11C2D}\x{11C3E}\x{11C70}\x{11C77}\x{11CAB}\x{11400}\x{1142F}\x{11455}\x{104B0}\x{104D8}\x{104FB}\x{16FE0}\x{18800}\x{18AF2}\x{11D00}\x{11D3A}\x{11D59}\x{16FE1}\x{1B170}\x{1B2FB}\x{11A50}\x{11A58}\x{11AA2}\x{11A00}\x{11A07}\x{11A47}\=callout_capture,callout_no_where
|
||||
|
||||
# These two are here because JIT is not yet updated. Also, the very first data
|
||||
# line is handled differently by Perl.
|
||||
|
||||
/^\X/utf
|
||||
A\x{200d}B A ZWJ
|
||||
\x{261D}\x{1F3FB}B E_Base E_Modifier
|
||||
\x{1F466}\x{1F3FF}B E_Base_GAZ E_Modifier
|
||||
\x{200d}\x{1F3A4}B ZWJ Glue_After_ZWJ
|
||||
\x{200d}\x{1F469}B ZWJ E_Base_GAZ
|
||||
\x{1F1E6}\x{1F1E7}B RegionalIndicator RegionalIndicator
|
||||
\x{261D}\x{E0100}\x{1F3FB}B\=no_jit E_Base Extend E_Modifier
|
||||
|
||||
# Regional indicators
|
||||
|
||||
/^(\X)(\X)/utf,aftertext
|
||||
\x{1F1E6}\x{1F1E7}\x{1F1E7}B\=no_jit
|
||||
\x{1F1E6}\x{1F1E7}\x{1F1E7}\x{1F1E6}B\=no_jit
|
||||
|
||||
|
||||
# End of testinput5
|
||||
|
|
|
@ -4667,4 +4667,38 @@ Callout 0: last capture = 1
|
|||
0: \x{1e900}\x{1e924}\x{1e953}\x{11c00}\x{11c2d}\x{11c3e}\x{11c70}\x{11c77}\x{11cab}\x{11400}\x{1142f}\x{11455}\x{104b0}\x{104d8}\x{104fb}\x{16fe0}\x{18800}\x{18af2}\x{11d00}\x{11d3a}\x{11d59}\x{16fe1}\x{1b170}\x{1b2fb}\x{11a50}\x{11a58}\x{11aa2}\x{11a00}\x{11a07}\x{11a47}
|
||||
1: \x{11a00}\x{11a07}\x{11a47}
|
||||
|
||||
# These two are here because JIT is not yet updated. Also, the very first data
|
||||
# line is handled differently by Perl.
|
||||
|
||||
/^\X/utf
|
||||
A\x{200d}B A ZWJ
|
||||
0: A\x{200d}
|
||||
\x{261D}\x{1F3FB}B E_Base E_Modifier
|
||||
0: \x{261d}\x{1f3fb}
|
||||
\x{1F466}\x{1F3FF}B E_Base_GAZ E_Modifier
|
||||
0: \x{1f466}\x{1f3ff}
|
||||
\x{200d}\x{1F3A4}B ZWJ Glue_After_ZWJ
|
||||
0: \x{200d}\x{1f3a4}
|
||||
\x{200d}\x{1F469}B ZWJ E_Base_GAZ
|
||||
0: \x{200d}\x{1f469}
|
||||
\x{1F1E6}\x{1F1E7}B RegionalIndicator RegionalIndicator
|
||||
0: \x{1f1e6}\x{1f1e7}
|
||||
\x{261D}\x{E0100}\x{1F3FB}B\=no_jit E_Base Extend E_Modifier
|
||||
** /n is not valid here
|
||||
|
||||
# Regional indicators
|
||||
|
||||
/^(\X)(\X)/utf,aftertext
|
||||
\x{1F1E6}\x{1F1E7}\x{1F1E7}B\=no_jit
|
||||
0: \x{1f1e6}\x{1f1e7}\x{1f1e7}
|
||||
0+ B
|
||||
1: \x{1f1e6}\x{1f1e7}
|
||||
2: \x{1f1e7}
|
||||
\x{1F1E6}\x{1F1E7}\x{1F1E7}\x{1F1E6}B\=no_jit
|
||||
0: \x{1f1e6}\x{1f1e7}\x{1f1e7}\x{1f1e6}
|
||||
0+ B
|
||||
1: \x{1f1e6}\x{1f1e7}
|
||||
2: \x{1f1e7}\x{1f1e6}
|
||||
|
||||
|
||||
# End of testinput5
|
||||
|
|
Loading…
Reference in New Issue