Update grapheme breaking rules for Unicode 10.0.0.

This commit is contained in:
Philip.Hazel 2017-07-05 08:55:49 +00:00
parent 41bb787fb3
commit 4f7a608d56
9 changed files with 514 additions and 98 deletions

View File

@ -213,6 +213,9 @@ unit". Previously only non-anchored patterns did this.
48. Add the callout_no_where modifier to pcre2test.
49. Update extended grapheme breaking rules to the latest set that are in
Unicode Standard Annex #29.
Version 10.23 14-February-2017
------------------------------

View File

@ -1001,9 +1001,12 @@ grapheme cluster", and treats the sequence as an atomic group
<a href="#atomicgroup">(see below).</a>
Unicode supports various kinds of composite character by giving each character
a grapheme breaking property, and having rules that use these properties to
define the boundaries of extended grapheme clusters. \X always matches at
least one character. Then it decides whether to add additional characters
according to the following rules for ending a cluster:
define the boundaries of extended grapheme clusters. The rules are defined in
Unicode Standard Annex 29, "Unicode Text Segmentation".
</P>
<P>
\X always matches at least one character. Then it decides whether to add
additional characters according to the following rules for ending a cluster:
</P>
<P>
1. End at the end of the subject string.
@ -1018,13 +1021,27 @@ L, V, LV, or LVT character; an LV or V character may be followed by a V or T
character; an LVT or T character may be follwed only by a T character.
</P>
<P>
4. Do not end before extending characters or spacing marks. Characters with
the "mark" property always have the "extend" grapheme breaking property.
4. Do not end before extending characters or spacing marks or the "zero-width
joiner" characters. Characters with the "mark" property always have the
"extend" grapheme breaking property.
</P>
<P>
5. Do not end after prepend characters.
</P>
<P>
6. Do not break within emoji modifier sequences (a base character followed by a
modifier). Extending characters are allowed before the modifier.
</P>
<P>
7. Do not break within emoji zwj sequences (zero-width jointer followed by
"glue after ZWJ" or "base glue after ZWJ").
</P>
<P>
8. Do not break within emoji flag sequences. That is, do not break between
regional indicator (RI) characters if there are an odd number of RI characters
before the break point.
</P>
<P>
6. Otherwise, end the cluster.
<a name="extraprops"></a></P>
<br><b>
@ -3455,7 +3472,7 @@ Cambridge, England.
</P>
<br><a name="SEC30" href="#TOC1">REVISION</a><br>
<P>
Last updated: 02 July 2017
Last updated: 05 July 2017
<br>
Copyright &copy; 1997-2017 University of Cambridge.
<br>

View File

@ -6433,9 +6433,12 @@ BACKSLASH
(see below). Unicode supports various kinds of composite character by
giving each character a grapheme breaking property, and having rules
that use these properties to define the boundaries of extended grapheme
clusters. \X always matches at least one character. Then it decides
whether to add additional characters according to the following rules
for ending a cluster:
clusters. The rules are defined in Unicode Standard Annex 29, "Unicode
Text Segmentation".
\X always matches at least one character. Then it decides whether to
add additional characters according to the following rules for ending a
cluster:
1. End at the end of the subject string.
@ -6448,12 +6451,23 @@ BACKSLASH
be followed by a V or T character; an LVT or T character may be follwed
only by a T character.
4. Do not end before extending characters or spacing marks. Characters
with the "mark" property always have the "extend" grapheme breaking
property.
4. Do not end before extending characters or spacing marks or the
"zero-width joiner" characters. Characters with the "mark" property
always have the "extend" grapheme breaking property.
5. Do not end after prepend characters.
6. Do not break within emoji modifier sequences (a base character fol-
lowed by a modifier). Extending characters are allowed before the modi-
fier.
7. Do not break within emoji zwj sequences (zero-width jointer followed
by "glue after ZWJ" or "base glue after ZWJ").
8. Do not break within emoji flag sequences. That is, do not break
between regional indicator (RI) characters if there are an odd number
of RI characters before the break point.
6. Otherwise, end the cluster.
PCRE2's additional properties
@ -8744,7 +8758,7 @@ AUTHOR
REVISION
Last updated: 02 July 2017
Last updated: 05 July 2017
Copyright (c) 1997-2017 University of Cambridge.
------------------------------------------------------------------------------

View File

@ -1,4 +1,4 @@
.TH PCRE2PATTERN 3 "02 July 2017" "PCRE2 10.30"
.TH PCRE2PATTERN 3 "05 July 2017" "PCRE2 10.30"
.SH NAME
PCRE2 - Perl-compatible regular expressions (revised API)
.SH "PCRE2 REGULAR EXPRESSION DETAILS"
@ -998,9 +998,11 @@ grapheme cluster", and treats the sequence as an atomic group
.\"
Unicode supports various kinds of composite character by giving each character
a grapheme breaking property, and having rules that use these properties to
define the boundaries of extended grapheme clusters. \eX always matches at
least one character. Then it decides whether to add additional characters
according to the following rules for ending a cluster:
define the boundaries of extended grapheme clusters. The rules are defined in
Unicode Standard Annex 29, "Unicode Text Segmentation".
.P
\eX always matches at least one character. Then it decides whether to add
additional characters according to the following rules for ending a cluster:
.P
1. End at the end of the subject string.
.P
@ -1011,11 +1013,22 @@ are of five types: L, V, T, LV, and LVT. An L character may be followed by an
L, V, LV, or LVT character; an LV or V character may be followed by a V or T
character; an LVT or T character may be follwed only by a T character.
.P
4. Do not end before extending characters or spacing marks. Characters with
the "mark" property always have the "extend" grapheme breaking property.
4. Do not end before extending characters or spacing marks or the "zero-width
joiner" characters. Characters with the "mark" property always have the
"extend" grapheme breaking property.
.P
5. Do not end after prepend characters.
.P
6. Do not break within emoji modifier sequences (a base character followed by a
modifier). Extending characters are allowed before the modifier.
.P
7. Do not break within emoji zwj sequences (zero-width jointer followed by
"glue after ZWJ" or "base glue after ZWJ").
.P
8. Do not break within emoji flag sequences. That is, do not break between
regional indicator (RI) characters if there are an odd number of RI characters
before the break point.
.P
6. Otherwise, end the cluster.
.
.
@ -3485,6 +3498,6 @@ Cambridge, England.
.rs
.sp
.nf
Last updated: 02 July 2017
Last updated: 05 July 2017
Copyright (c) 1997-2017 University of Cambridge.
.fi

View File

@ -1379,8 +1379,46 @@ for (;;)
if (!utf) d = *nptr; else { GETCHARLEN(d, nptr, dlen); }
rgb = UCD_GRAPHBREAK(d);
if ((PRIV(ucp_gbtable)[lgb] & (1u << rgb)) == 0) break;
ncount++;
/* Not breaking between Regional Indicators is allowed only if
there are an even number of preceding RIs. */
if (lgb == ucp_gbRegionalIndicator &&
rgb == ucp_gbRegionalIndicator)
{
int ricount = 0;
PCRE2_SPTR bptr = nptr - 1;
#ifdef SUPPORT_UNICODE
if (utf) BACKCHAR(bptr);
#endif
/* bptr is pointing to the left-hand character */
while (bptr > mb->start_subject)
{
bptr--;
#ifdef SUPPORT_UNICODE
if (utf)
{
BACKCHAR(bptr);
GETCHAR(d, bptr);
}
else
#endif
d = *bptr;
if (UCD_GRAPHBREAK(d) != ucp_gbRegionalIndicator) break;
ricount++;
}
if ((ricount & 1) != 0) break; /* Grapheme break required */
}
/* If Extend follows E_Base[_GAZ] do not update lgb; this allows
any number of Extend before a following E_Modifier. */
if (rgb != ucp_gbExtend ||
(lgb != ucp_gbE_Base && lgb != ucp_gbE_Base_GAZ))
lgb = rgb;
ncount++;
nptr += dlen;
}
count++;
@ -1641,8 +1679,46 @@ for (;;)
if (!utf) d = *nptr; else { GETCHARLEN(d, nptr, dlen); }
rgb = UCD_GRAPHBREAK(d);
if ((PRIV(ucp_gbtable)[lgb] & (1u << rgb)) == 0) break;
ncount++;
/* Not breaking between Regional Indicators is allowed only if
there are an even number of preceding RIs. */
if (lgb == ucp_gbRegionalIndicator &&
rgb == ucp_gbRegionalIndicator)
{
int ricount = 0;
PCRE2_SPTR bptr = nptr - 1;
#ifdef SUPPORT_UNICODE
if (utf) BACKCHAR(bptr);
#endif
/* bptr is pointing to the left-hand character */
while (bptr > mb->start_subject)
{
bptr--;
#ifdef SUPPORT_UNICODE
if (utf)
{
BACKCHAR(bptr);
GETCHAR(d, bptr);
}
else
#endif
d = *bptr;
if (UCD_GRAPHBREAK(d) != ucp_gbRegionalIndicator) break;
ricount++;
}
if ((ricount & 1) != 0) break; /* Grapheme break required */
}
/* If Extend follows E_Base[_GAZ] do not update lgb; this allows
any number of Extend before a following E_Modifier. */
if (rgb != ucp_gbExtend ||
(lgb != ucp_gbE_Base && lgb != ucp_gbE_Base_GAZ))
lgb = rgb;
ncount++;
nptr += dlen;
}
ADD_NEW_DATA(-(state_offset + count), 0, ncount);
@ -1912,8 +1988,46 @@ for (;;)
if (!utf) d = *nptr; else { GETCHARLEN(d, nptr, dlen); }
rgb = UCD_GRAPHBREAK(d);
if ((PRIV(ucp_gbtable)[lgb] & (1u << rgb)) == 0) break;
ncount++;
/* Not breaking between Regional Indicators is allowed only if
there are an even number of preceding RIs. */
if (lgb == ucp_gbRegionalIndicator &&
rgb == ucp_gbRegionalIndicator)
{
int ricount = 0;
PCRE2_SPTR bptr = nptr - 1;
#ifdef SUPPORT_UNICODE
if (utf) BACKCHAR(bptr);
#endif
/* bptr is pointing to the left-hand character */
while (bptr > mb->start_subject)
{
bptr--;
#ifdef SUPPORT_UNICODE
if (utf)
{
BACKCHAR(bptr);
GETCHAR(d, bptr);
}
else
#endif
d = *bptr;
if (UCD_GRAPHBREAK(d) != ucp_gbRegionalIndicator) break;
ricount++;
}
if ((ricount & 1) != 0) break; /* Grapheme break required */
}
/* If Extend follows E_Base[_GAZ] do not update lgb; this allows
any number of Extend before a following E_Modifier. */
if (rgb != ucp_gbExtend ||
(lgb != ucp_gbE_Base && lgb != ucp_gbE_Base_GAZ))
lgb = rgb;
ncount++;
nptr += dlen;
}
if (nptr >= end_subject && (mb->moptions & PCRE2_PARTIAL_HARD) != 0)
@ -2102,8 +2216,46 @@ for (;;)
if (!utf) d = *nptr; else { GETCHARLEN(d, nptr, dlen); }
rgb = UCD_GRAPHBREAK(d);
if ((PRIV(ucp_gbtable)[lgb] & (1u << rgb)) == 0) break;
ncount++;
/* Not breaking between Regional Indicators is allowed only if
there are an even number of preceding RIs. */
if (lgb == ucp_gbRegionalIndicator &&
rgb == ucp_gbRegionalIndicator)
{
int ricount = 0;
PCRE2_SPTR bptr = nptr - 1;
#ifdef SUPPORT_UNICODE
if (utf) BACKCHAR(bptr);
#endif
/* bptr is pointing to the left-hand character */
while (bptr > mb->start_subject)
{
bptr--;
#ifdef SUPPORT_UNICODE
if (utf)
{
BACKCHAR(bptr);
GETCHAR(d, bptr);
}
else
#endif
d = *bptr;
if (UCD_GRAPHBREAK(d) != ucp_gbRegionalIndicator) break;
ricount++;
}
if ((ricount & 1) != 0) break; /* Grapheme break required */
}
/* If Extend follows E_Base[_GAZ] do not update lgb; this allows
any number of Extend before a following E_Modifier. */
if (rgb != ucp_gbExtend ||
(lgb != ucp_gbE_Base && lgb != ucp_gbE_Base_GAZ))
lgb = rgb;
ncount++;
nptr += dlen;
}
if (nptr >= end_subject && (mb->moptions & PCRE2_PARTIAL_HARD) != 0)

View File

@ -2449,7 +2449,44 @@ fprintf(stderr, "++ op=%d\n", *Fecode);
if (!utf) fc = *Feptr; else { GETCHARLEN(fc, Feptr, len); }
rgb = UCD_GRAPHBREAK(fc);
if ((PRIV(ucp_gbtable)[lgb] & (1 << rgb)) == 0) break;
/* Not breaking between Regional Indicators is allowed only if there
are an even number of preceding RIs. */
if (lgb == ucp_gbRegionalIndicator && rgb == ucp_gbRegionalIndicator)
{
int ricount = 0;
PCRE2_SPTR bptr = Feptr - 1;
#ifdef SUPPORT_UNICODE
if (utf) BACKCHAR(bptr);
#endif
/* bptr is pointing to the left-hand character */
while (bptr > mb->start_subject)
{
bptr--;
#ifdef SUPPORT_UNICODE
if (utf)
{
BACKCHAR(bptr);
GETCHAR(fc, bptr);
}
else
#endif
fc = *bptr;
if (UCD_GRAPHBREAK(fc) != ucp_gbRegionalIndicator) break;
ricount++;
}
if ((ricount & 1) != 0) break; /* Grapheme break required */
}
/* If Extend follows E_Base[_GAZ] do not update lgb; this allows
any number of Extend before a following E_Modifier. */
if (rgb != ucp_gbExtend ||
(lgb != ucp_gbE_Base && lgb != ucp_gbE_Base_GAZ))
lgb = rgb;
Feptr += len;
}
}
@ -2757,7 +2794,45 @@ fprintf(stderr, "++ op=%d\n", *Fecode);
if (!utf) fc = *Feptr; else { GETCHARLEN(fc, Feptr, len); }
rgb = UCD_GRAPHBREAK(fc);
if ((PRIV(ucp_gbtable)[lgb] & (1 << rgb)) == 0) break;
/* Not breaking between Regional Indicators is allowed only if
there are an even number of preceding RIs. */
if (lgb == ucp_gbRegionalIndicator &&
rgb == ucp_gbRegionalIndicator)
{
int ricount = 0;
PCRE2_SPTR bptr = Feptr - 1;
#ifdef SUPPORT_UNICODE
if (utf) BACKCHAR(bptr);
#endif
/* bptr is pointing to the left-hand character */
while (bptr > mb->start_subject)
{
bptr--;
#ifdef SUPPORT_UNICODE
if (utf)
{
BACKCHAR(bptr);
GETCHAR(fc, bptr);
}
else
#endif
fc = *bptr;
if (UCD_GRAPHBREAK(fc) != ucp_gbRegionalIndicator) break;
ricount++;
}
if ((ricount & 1) != 0) break; /* Grapheme break required */
}
/* If Extend follows E_Base[_GAZ] do not update lgb; this allows
any number of Extend before a following E_Modifier. */
if (rgb != ucp_gbExtend ||
(lgb != ucp_gbE_Base && lgb != ucp_gbE_Base_GAZ))
lgb = rgb;
Feptr += len;
}
}
@ -3527,7 +3602,45 @@ fprintf(stderr, "++ op=%d\n", *Fecode);
if (!utf) fc = *Feptr; else { GETCHARLEN(fc, Feptr, len); }
rgb = UCD_GRAPHBREAK(fc);
if ((PRIV(ucp_gbtable)[lgb] & (1 << rgb)) == 0) break;
/* Not breaking between Regional Indicators is allowed only if
there are an even number of preceding RIs. */
if (lgb == ucp_gbRegionalIndicator &&
rgb == ucp_gbRegionalIndicator)
{
int ricount = 0;
PCRE2_SPTR bptr = Feptr - 1;
#ifdef SUPPORT_UNICODE
if (utf) BACKCHAR(bptr);
#endif
/* bptr is pointing to the left-hand character */
while (bptr > mb->start_subject)
{
bptr--;
#ifdef SUPPORT_UNICODE
if (utf)
{
BACKCHAR(bptr);
GETCHAR(fc, bptr);
}
else
#endif
fc = *bptr;
if (UCD_GRAPHBREAK(fc) != ucp_gbRegionalIndicator) break;
ricount++;
}
if ((ricount & 1) != 0) break; /* Grapheme break required */
}
/* If Extend follows E_Base[_GAZ] do not update lgb; this allows
any number of Extend before a following E_Modifier. */
if (rgb != ucp_gbExtend ||
(lgb != ucp_gbE_Base && lgb != ucp_gbE_Base_GAZ))
lgb = rgb;
Feptr += len;
}
}
@ -4063,7 +4176,45 @@ fprintf(stderr, "++ op=%d\n", *Fecode);
if (!utf) fc = *Feptr; else { GETCHARLEN(fc, Feptr, len); }
rgb = UCD_GRAPHBREAK(fc);
if ((PRIV(ucp_gbtable)[lgb] & (1 << rgb)) == 0) break;
/* Not breaking between Regional Indicators is allowed only if
there are an even number of preceding RIs. */
if (lgb == ucp_gbRegionalIndicator &&
rgb == ucp_gbRegionalIndicator)
{
int ricount = 0;
PCRE2_SPTR bptr = Feptr - 1;
#ifdef SUPPORT_UNICODE
if (utf) BACKCHAR(bptr);
#endif
/* bptr is pointing to the left-hand character */
while (bptr > mb->start_subject)
{
bptr--;
#ifdef SUPPORT_UNICODE
if (utf)
{
BACKCHAR(bptr);
GETCHAR(fc, bptr);
}
else
#endif
fc = *bptr;
if (UCD_GRAPHBREAK(fc) != ucp_gbRegionalIndicator) break;
ricount++;
}
if ((ricount & 1) != 0) break; /* Grapheme break required */
}
/* If Extend follows E_Base[_GAZ] do not update lgb; this allows
any number of Extend before a following E_Modifier. */
if (rgb != ucp_gbExtend ||
(lgb != ucp_gbE_Base && lgb != ucp_gbE_Base_GAZ))
lgb = rgb;
Feptr += len;
}
}

View File

@ -157,49 +157,62 @@ two code points. The breaking rules are as follows:
LV or V may be followed by V or T
LVT or T may be followed by T
4. Do not break before extending characters.
4. Do not break before extending characters or zero-width-joiner (ZWJ).
The next two rules are only for extended grapheme clusters (but that's what we
The following rules are only for extended grapheme clusters (but that's what we
are implementing).
5. Do not break before SpacingMarks.
6. Do not break after Prepend characters.
7. Otherwise, break everywhere.
7. Do not break within emoji modifier sequences (E_Base or E_Base_GAZ followed
by E_Modifier). Extend characters are allowed before the modifier; this
cannot be represented in this table, the code has to deal with it.
8. Do not break within emoji zwj sequences (ZWJ followed by Glue_After_Zwj or
E_Base_GAZ).
9. Do not break within emoji flag sequences. That is, do not break between
regional indicator (RI) symbols if there are an odd number of RI characters
before the break point. This table encodes "join RI characters"; the code
has to deal with checking for previous adjoining RIs.
10. Otherwise, break everywhere.
*/
#define ESZ (1<<ucp_gbExtend)|(1<<ucp_gbSpacingMark)|(1<<ucp_gbZWJ)
const uint32_t PRIV(ucp_gbtable)[] = {
(1<<ucp_gbLF), /* 0 CR */
0, /* 1 LF */
0, /* 2 Control */
(1<<ucp_gbExtend)|(1<<ucp_gbSpacingMark), /* 3 Extend */
(1<<ucp_gbExtend)|(1<<ucp_gbPrepend)| /* 4 Prepend */
(1<<ucp_gbSpacingMark)|(1<<ucp_gbL)|
(1<<ucp_gbV)|(1<<ucp_gbT)|(1<<ucp_gbLV)|
(1<<ucp_gbLVT)|(1<<ucp_gbOther),
(1<<ucp_gbExtend)|(1<<ucp_gbSpacingMark), /* 5 SpacingMark */
(1<<ucp_gbExtend)|(1<<ucp_gbSpacingMark)|(1<<ucp_gbL)| /* 6 L */
(1<<ucp_gbV)|(1<<ucp_gbLV)|(1<<ucp_gbLVT),
(1<<ucp_gbExtend)|(1<<ucp_gbSpacingMark)|(1<<ucp_gbV)| /* 7 V */
(1<<ucp_gbT),
(1<<ucp_gbExtend)|(1<<ucp_gbSpacingMark)|(1<<ucp_gbT), /* 8 T */
(1<<ucp_gbExtend)|(1<<ucp_gbSpacingMark)|(1<<ucp_gbV)| /* 9 LV */
(1<<ucp_gbT),
(1<<ucp_gbExtend)|(1<<ucp_gbSpacingMark)|(1<<ucp_gbT), /* 10 LVT */
ESZ, /* 3 Extend */
ESZ|(1<<ucp_gbPrepend)| /* 4 Prepend */
(1<<ucp_gbL)|(1<<ucp_gbV)|(1<<ucp_gbT)|
(1<<ucp_gbLV)|(1<<ucp_gbLVT)|(1<<ucp_gbOther)|
(1<<ucp_gbRegionalIndicator)|
(1<<ucp_gbE_Base)|(1<<ucp_gbE_Modifier)|
(1<<ucp_gbE_Base_GAZ)|
(1<<ucp_gbZWJ)|(1<<ucp_gbGlue_After_Zwj),
ESZ, /* 5 SpacingMark */
ESZ|(1<<ucp_gbL)|(1<<ucp_gbV)|(1<<ucp_gbLV)| /* 6 L */
(1<<ucp_gbLVT),
ESZ|(1<<ucp_gbV)|(1<<ucp_gbT), /* 7 V */
ESZ|(1<<ucp_gbT), /* 8 T */
ESZ|(1<<ucp_gbV)|(1<<ucp_gbT), /* 9 LV */
ESZ|(1<<ucp_gbT), /* 10 LVT */
(1<<ucp_gbRegionalIndicator), /* 11 RegionalIndicator */
(1<<ucp_gbExtend)|(1<<ucp_gbSpacingMark), /* 12 Other */
(1<<ucp_gbExtend)|(1<<ucp_gbSpacingMark), /* 13 E_Base */
(1<<ucp_gbExtend)|(1<<ucp_gbSpacingMark), /* 14 E_Modifier */
(1<<ucp_gbExtend)|(1<<ucp_gbSpacingMark), /* 15 E_Base_GAZ */
(1<<ucp_gbExtend)|(1<<ucp_gbSpacingMark), /* 16 ZWJ */
(1<<ucp_gbExtend)|(1<<ucp_gbSpacingMark) /* 12 Glue_After_Zwj */
ESZ, /* 12 Other */
ESZ|(1<<ucp_gbE_Modifier), /* 13 E_Base */
ESZ, /* 14 E_Modifier */
ESZ|(1<<ucp_gbE_Modifier), /* 15 E_Base_GAZ */
ESZ|(1<<ucp_gbGlue_After_Zwj)|(1<<ucp_gbE_Base_GAZ), /* 16 ZWJ */
ESZ /* 12 Glue_After_Zwj */
};
#undef ESZ
#ifdef SUPPORT_JIT
/* This table reverses PRIV(ucp_gentype). We can save the cost
of a memory load. */

19
testdata/testinput5 vendored
View File

@ -2041,4 +2041,23 @@
/^(?:(\X)(?C))+$/utf
\x{1E900}\x{1E924}\x{1E953}\x{11C00}\x{11C2D}\x{11C3E}\x{11C70}\x{11C77}\x{11CAB}\x{11400}\x{1142F}\x{11455}\x{104B0}\x{104D8}\x{104FB}\x{16FE0}\x{18800}\x{18AF2}\x{11D00}\x{11D3A}\x{11D59}\x{16FE1}\x{1B170}\x{1B2FB}\x{11A50}\x{11A58}\x{11AA2}\x{11A00}\x{11A07}\x{11A47}\=callout_capture,callout_no_where
# These two are here because JIT is not yet updated. Also, the very first data
# line is handled differently by Perl.
/^\X/utf
A\x{200d}B A ZWJ
\x{261D}\x{1F3FB}B E_Base E_Modifier
\x{1F466}\x{1F3FF}B E_Base_GAZ E_Modifier
\x{200d}\x{1F3A4}B ZWJ Glue_After_ZWJ
\x{200d}\x{1F469}B ZWJ E_Base_GAZ
\x{1F1E6}\x{1F1E7}B RegionalIndicator RegionalIndicator
\x{261D}\x{E0100}\x{1F3FB}B\=no_jit E_Base Extend E_Modifier
# Regional indicators
/^(\X)(\X)/utf,aftertext
\x{1F1E6}\x{1F1E7}\x{1F1E7}B\=no_jit
\x{1F1E6}\x{1F1E7}\x{1F1E7}\x{1F1E6}B\=no_jit
# End of testinput5

34
testdata/testoutput5 vendored
View File

@ -4667,4 +4667,38 @@ Callout 0: last capture = 1
0: \x{1e900}\x{1e924}\x{1e953}\x{11c00}\x{11c2d}\x{11c3e}\x{11c70}\x{11c77}\x{11cab}\x{11400}\x{1142f}\x{11455}\x{104b0}\x{104d8}\x{104fb}\x{16fe0}\x{18800}\x{18af2}\x{11d00}\x{11d3a}\x{11d59}\x{16fe1}\x{1b170}\x{1b2fb}\x{11a50}\x{11a58}\x{11aa2}\x{11a00}\x{11a07}\x{11a47}
1: \x{11a00}\x{11a07}\x{11a47}
# These two are here because JIT is not yet updated. Also, the very first data
# line is handled differently by Perl.
/^\X/utf
A\x{200d}B A ZWJ
0: A\x{200d}
\x{261D}\x{1F3FB}B E_Base E_Modifier
0: \x{261d}\x{1f3fb}
\x{1F466}\x{1F3FF}B E_Base_GAZ E_Modifier
0: \x{1f466}\x{1f3ff}
\x{200d}\x{1F3A4}B ZWJ Glue_After_ZWJ
0: \x{200d}\x{1f3a4}
\x{200d}\x{1F469}B ZWJ E_Base_GAZ
0: \x{200d}\x{1f469}
\x{1F1E6}\x{1F1E7}B RegionalIndicator RegionalIndicator
0: \x{1f1e6}\x{1f1e7}
\x{261D}\x{E0100}\x{1F3FB}B\=no_jit E_Base Extend E_Modifier
** /n is not valid here
# Regional indicators
/^(\X)(\X)/utf,aftertext
\x{1F1E6}\x{1F1E7}\x{1F1E7}B\=no_jit
0: \x{1f1e6}\x{1f1e7}\x{1f1e7}
0+ B
1: \x{1f1e6}\x{1f1e7}
2: \x{1f1e7}
\x{1F1E6}\x{1F1E7}\x{1F1E7}\x{1F1E6}B\=no_jit
0: \x{1f1e6}\x{1f1e7}\x{1f1e7}\x{1f1e6}
0+ B
1: \x{1f1e6}\x{1f1e7}
2: \x{1f1e7}\x{1f1e6}
# End of testinput5