Update grapheme breaking rules for Unicode 10.0.0.
This commit is contained in:
parent
41bb787fb3
commit
4f7a608d56
|
@ -213,6 +213,9 @@ unit". Previously only non-anchored patterns did this.
|
|||
|
||||
48. Add the callout_no_where modifier to pcre2test.
|
||||
|
||||
49. Update extended grapheme breaking rules to the latest set that are in
|
||||
Unicode Standard Annex #29.
|
||||
|
||||
|
||||
Version 10.23 14-February-2017
|
||||
------------------------------
|
||||
|
|
|
@ -1001,9 +1001,12 @@ grapheme cluster", and treats the sequence as an atomic group
|
|||
<a href="#atomicgroup">(see below).</a>
|
||||
Unicode supports various kinds of composite character by giving each character
|
||||
a grapheme breaking property, and having rules that use these properties to
|
||||
define the boundaries of extended grapheme clusters. \X always matches at
|
||||
least one character. Then it decides whether to add additional characters
|
||||
according to the following rules for ending a cluster:
|
||||
define the boundaries of extended grapheme clusters. The rules are defined in
|
||||
Unicode Standard Annex 29, "Unicode Text Segmentation".
|
||||
</P>
|
||||
<P>
|
||||
\X always matches at least one character. Then it decides whether to add
|
||||
additional characters according to the following rules for ending a cluster:
|
||||
</P>
|
||||
<P>
|
||||
1. End at the end of the subject string.
|
||||
|
@ -1018,13 +1021,27 @@ L, V, LV, or LVT character; an LV or V character may be followed by a V or T
|
|||
character; an LVT or T character may be follwed only by a T character.
|
||||
</P>
|
||||
<P>
|
||||
4. Do not end before extending characters or spacing marks. Characters with
|
||||
the "mark" property always have the "extend" grapheme breaking property.
|
||||
4. Do not end before extending characters or spacing marks or the "zero-width
|
||||
joiner" characters. Characters with the "mark" property always have the
|
||||
"extend" grapheme breaking property.
|
||||
</P>
|
||||
<P>
|
||||
5. Do not end after prepend characters.
|
||||
</P>
|
||||
<P>
|
||||
6. Do not break within emoji modifier sequences (a base character followed by a
|
||||
modifier). Extending characters are allowed before the modifier.
|
||||
</P>
|
||||
<P>
|
||||
7. Do not break within emoji zwj sequences (zero-width jointer followed by
|
||||
"glue after ZWJ" or "base glue after ZWJ").
|
||||
</P>
|
||||
<P>
|
||||
8. Do not break within emoji flag sequences. That is, do not break between
|
||||
regional indicator (RI) characters if there are an odd number of RI characters
|
||||
before the break point.
|
||||
</P>
|
||||
<P>
|
||||
6. Otherwise, end the cluster.
|
||||
<a name="extraprops"></a></P>
|
||||
<br><b>
|
||||
|
@ -3455,7 +3472,7 @@ Cambridge, England.
|
|||
</P>
|
||||
<br><a name="SEC30" href="#TOC1">REVISION</a><br>
|
||||
<P>
|
||||
Last updated: 02 July 2017
|
||||
Last updated: 05 July 2017
|
||||
<br>
|
||||
Copyright © 1997-2017 University of Cambridge.
|
||||
<br>
|
||||
|
|
|
@ -6433,27 +6433,41 @@ BACKSLASH
|
|||
(see below). Unicode supports various kinds of composite character by
|
||||
giving each character a grapheme breaking property, and having rules
|
||||
that use these properties to define the boundaries of extended grapheme
|
||||
clusters. \X always matches at least one character. Then it decides
|
||||
whether to add additional characters according to the following rules
|
||||
for ending a cluster:
|
||||
clusters. The rules are defined in Unicode Standard Annex 29, "Unicode
|
||||
Text Segmentation".
|
||||
|
||||
\X always matches at least one character. Then it decides whether to
|
||||
add additional characters according to the following rules for ending a
|
||||
cluster:
|
||||
|
||||
1. End at the end of the subject string.
|
||||
|
||||
2. Do not end between CR and LF; otherwise end after any control char-
|
||||
2. Do not end between CR and LF; otherwise end after any control char-
|
||||
acter.
|
||||
|
||||
3. Do not break Hangul (a Korean script) syllable sequences. Hangul
|
||||
characters are of five types: L, V, T, LV, and LVT. An L character may
|
||||
be followed by an L, V, LV, or LVT character; an LV or V character may
|
||||
3. Do not break Hangul (a Korean script) syllable sequences. Hangul
|
||||
characters are of five types: L, V, T, LV, and LVT. An L character may
|
||||
be followed by an L, V, LV, or LVT character; an LV or V character may
|
||||
be followed by a V or T character; an LVT or T character may be follwed
|
||||
only by a T character.
|
||||
|
||||
4. Do not end before extending characters or spacing marks. Characters
|
||||
with the "mark" property always have the "extend" grapheme breaking
|
||||
property.
|
||||
4. Do not end before extending characters or spacing marks or the
|
||||
"zero-width joiner" characters. Characters with the "mark" property
|
||||
always have the "extend" grapheme breaking property.
|
||||
|
||||
5. Do not end after prepend characters.
|
||||
|
||||
6. Do not break within emoji modifier sequences (a base character fol-
|
||||
lowed by a modifier). Extending characters are allowed before the modi-
|
||||
fier.
|
||||
|
||||
7. Do not break within emoji zwj sequences (zero-width jointer followed
|
||||
by "glue after ZWJ" or "base glue after ZWJ").
|
||||
|
||||
8. Do not break within emoji flag sequences. That is, do not break
|
||||
between regional indicator (RI) characters if there are an odd number
|
||||
of RI characters before the break point.
|
||||
|
||||
6. Otherwise, end the cluster.
|
||||
|
||||
PCRE2's additional properties
|
||||
|
@ -8744,7 +8758,7 @@ AUTHOR
|
|||
|
||||
REVISION
|
||||
|
||||
Last updated: 02 July 2017
|
||||
Last updated: 05 July 2017
|
||||
Copyright (c) 1997-2017 University of Cambridge.
|
||||
------------------------------------------------------------------------------
|
||||
|
||||
|
|
|
@ -1,4 +1,4 @@
|
|||
.TH PCRE2PATTERN 3 "02 July 2017" "PCRE2 10.30"
|
||||
.TH PCRE2PATTERN 3 "05 July 2017" "PCRE2 10.30"
|
||||
.SH NAME
|
||||
PCRE2 - Perl-compatible regular expressions (revised API)
|
||||
.SH "PCRE2 REGULAR EXPRESSION DETAILS"
|
||||
|
@ -998,9 +998,11 @@ grapheme cluster", and treats the sequence as an atomic group
|
|||
.\"
|
||||
Unicode supports various kinds of composite character by giving each character
|
||||
a grapheme breaking property, and having rules that use these properties to
|
||||
define the boundaries of extended grapheme clusters. \eX always matches at
|
||||
least one character. Then it decides whether to add additional characters
|
||||
according to the following rules for ending a cluster:
|
||||
define the boundaries of extended grapheme clusters. The rules are defined in
|
||||
Unicode Standard Annex 29, "Unicode Text Segmentation".
|
||||
.P
|
||||
\eX always matches at least one character. Then it decides whether to add
|
||||
additional characters according to the following rules for ending a cluster:
|
||||
.P
|
||||
1. End at the end of the subject string.
|
||||
.P
|
||||
|
@ -1011,11 +1013,22 @@ are of five types: L, V, T, LV, and LVT. An L character may be followed by an
|
|||
L, V, LV, or LVT character; an LV or V character may be followed by a V or T
|
||||
character; an LVT or T character may be follwed only by a T character.
|
||||
.P
|
||||
4. Do not end before extending characters or spacing marks. Characters with
|
||||
the "mark" property always have the "extend" grapheme breaking property.
|
||||
4. Do not end before extending characters or spacing marks or the "zero-width
|
||||
joiner" characters. Characters with the "mark" property always have the
|
||||
"extend" grapheme breaking property.
|
||||
.P
|
||||
5. Do not end after prepend characters.
|
||||
.P
|
||||
6. Do not break within emoji modifier sequences (a base character followed by a
|
||||
modifier). Extending characters are allowed before the modifier.
|
||||
.P
|
||||
7. Do not break within emoji zwj sequences (zero-width jointer followed by
|
||||
"glue after ZWJ" or "base glue after ZWJ").
|
||||
.P
|
||||
8. Do not break within emoji flag sequences. That is, do not break between
|
||||
regional indicator (RI) characters if there are an odd number of RI characters
|
||||
before the break point.
|
||||
.P
|
||||
6. Otherwise, end the cluster.
|
||||
.
|
||||
.
|
||||
|
@ -3485,6 +3498,6 @@ Cambridge, England.
|
|||
.rs
|
||||
.sp
|
||||
.nf
|
||||
Last updated: 02 July 2017
|
||||
Last updated: 05 July 2017
|
||||
Copyright (c) 1997-2017 University of Cambridge.
|
||||
.fi
|
||||
|
|
|
@ -1379,8 +1379,46 @@ for (;;)
|
|||
if (!utf) d = *nptr; else { GETCHARLEN(d, nptr, dlen); }
|
||||
rgb = UCD_GRAPHBREAK(d);
|
||||
if ((PRIV(ucp_gbtable)[lgb] & (1u << rgb)) == 0) break;
|
||||
|
||||
/* Not breaking between Regional Indicators is allowed only if
|
||||
there are an even number of preceding RIs. */
|
||||
|
||||
if (lgb == ucp_gbRegionalIndicator &&
|
||||
rgb == ucp_gbRegionalIndicator)
|
||||
{
|
||||
int ricount = 0;
|
||||
PCRE2_SPTR bptr = nptr - 1;
|
||||
#ifdef SUPPORT_UNICODE
|
||||
if (utf) BACKCHAR(bptr);
|
||||
#endif
|
||||
/* bptr is pointing to the left-hand character */
|
||||
|
||||
while (bptr > mb->start_subject)
|
||||
{
|
||||
bptr--;
|
||||
#ifdef SUPPORT_UNICODE
|
||||
if (utf)
|
||||
{
|
||||
BACKCHAR(bptr);
|
||||
GETCHAR(d, bptr);
|
||||
}
|
||||
else
|
||||
#endif
|
||||
d = *bptr;
|
||||
if (UCD_GRAPHBREAK(d) != ucp_gbRegionalIndicator) break;
|
||||
ricount++;
|
||||
}
|
||||
if ((ricount & 1) != 0) break; /* Grapheme break required */
|
||||
}
|
||||
|
||||
/* If Extend follows E_Base[_GAZ] do not update lgb; this allows
|
||||
any number of Extend before a following E_Modifier. */
|
||||
|
||||
if (rgb != ucp_gbExtend ||
|
||||
(lgb != ucp_gbE_Base && lgb != ucp_gbE_Base_GAZ))
|
||||
lgb = rgb;
|
||||
|
||||
ncount++;
|
||||
lgb = rgb;
|
||||
nptr += dlen;
|
||||
}
|
||||
count++;
|
||||
|
@ -1641,8 +1679,46 @@ for (;;)
|
|||
if (!utf) d = *nptr; else { GETCHARLEN(d, nptr, dlen); }
|
||||
rgb = UCD_GRAPHBREAK(d);
|
||||
if ((PRIV(ucp_gbtable)[lgb] & (1u << rgb)) == 0) break;
|
||||
|
||||
/* Not breaking between Regional Indicators is allowed only if
|
||||
there are an even number of preceding RIs. */
|
||||
|
||||
if (lgb == ucp_gbRegionalIndicator &&
|
||||
rgb == ucp_gbRegionalIndicator)
|
||||
{
|
||||
int ricount = 0;
|
||||
PCRE2_SPTR bptr = nptr - 1;
|
||||
#ifdef SUPPORT_UNICODE
|
||||
if (utf) BACKCHAR(bptr);
|
||||
#endif
|
||||
/* bptr is pointing to the left-hand character */
|
||||
|
||||
while (bptr > mb->start_subject)
|
||||
{
|
||||
bptr--;
|
||||
#ifdef SUPPORT_UNICODE
|
||||
if (utf)
|
||||
{
|
||||
BACKCHAR(bptr);
|
||||
GETCHAR(d, bptr);
|
||||
}
|
||||
else
|
||||
#endif
|
||||
d = *bptr;
|
||||
if (UCD_GRAPHBREAK(d) != ucp_gbRegionalIndicator) break;
|
||||
ricount++;
|
||||
}
|
||||
if ((ricount & 1) != 0) break; /* Grapheme break required */
|
||||
}
|
||||
|
||||
/* If Extend follows E_Base[_GAZ] do not update lgb; this allows
|
||||
any number of Extend before a following E_Modifier. */
|
||||
|
||||
if (rgb != ucp_gbExtend ||
|
||||
(lgb != ucp_gbE_Base && lgb != ucp_gbE_Base_GAZ))
|
||||
lgb = rgb;
|
||||
|
||||
ncount++;
|
||||
lgb = rgb;
|
||||
nptr += dlen;
|
||||
}
|
||||
ADD_NEW_DATA(-(state_offset + count), 0, ncount);
|
||||
|
@ -1912,8 +1988,46 @@ for (;;)
|
|||
if (!utf) d = *nptr; else { GETCHARLEN(d, nptr, dlen); }
|
||||
rgb = UCD_GRAPHBREAK(d);
|
||||
if ((PRIV(ucp_gbtable)[lgb] & (1u << rgb)) == 0) break;
|
||||
|
||||
/* Not breaking between Regional Indicators is allowed only if
|
||||
there are an even number of preceding RIs. */
|
||||
|
||||
if (lgb == ucp_gbRegionalIndicator &&
|
||||
rgb == ucp_gbRegionalIndicator)
|
||||
{
|
||||
int ricount = 0;
|
||||
PCRE2_SPTR bptr = nptr - 1;
|
||||
#ifdef SUPPORT_UNICODE
|
||||
if (utf) BACKCHAR(bptr);
|
||||
#endif
|
||||
/* bptr is pointing to the left-hand character */
|
||||
|
||||
while (bptr > mb->start_subject)
|
||||
{
|
||||
bptr--;
|
||||
#ifdef SUPPORT_UNICODE
|
||||
if (utf)
|
||||
{
|
||||
BACKCHAR(bptr);
|
||||
GETCHAR(d, bptr);
|
||||
}
|
||||
else
|
||||
#endif
|
||||
d = *bptr;
|
||||
if (UCD_GRAPHBREAK(d) != ucp_gbRegionalIndicator) break;
|
||||
ricount++;
|
||||
}
|
||||
if ((ricount & 1) != 0) break; /* Grapheme break required */
|
||||
}
|
||||
|
||||
/* If Extend follows E_Base[_GAZ] do not update lgb; this allows
|
||||
any number of Extend before a following E_Modifier. */
|
||||
|
||||
if (rgb != ucp_gbExtend ||
|
||||
(lgb != ucp_gbE_Base && lgb != ucp_gbE_Base_GAZ))
|
||||
lgb = rgb;
|
||||
|
||||
ncount++;
|
||||
lgb = rgb;
|
||||
nptr += dlen;
|
||||
}
|
||||
if (nptr >= end_subject && (mb->moptions & PCRE2_PARTIAL_HARD) != 0)
|
||||
|
@ -2102,8 +2216,46 @@ for (;;)
|
|||
if (!utf) d = *nptr; else { GETCHARLEN(d, nptr, dlen); }
|
||||
rgb = UCD_GRAPHBREAK(d);
|
||||
if ((PRIV(ucp_gbtable)[lgb] & (1u << rgb)) == 0) break;
|
||||
|
||||
/* Not breaking between Regional Indicators is allowed only if
|
||||
there are an even number of preceding RIs. */
|
||||
|
||||
if (lgb == ucp_gbRegionalIndicator &&
|
||||
rgb == ucp_gbRegionalIndicator)
|
||||
{
|
||||
int ricount = 0;
|
||||
PCRE2_SPTR bptr = nptr - 1;
|
||||
#ifdef SUPPORT_UNICODE
|
||||
if (utf) BACKCHAR(bptr);
|
||||
#endif
|
||||
/* bptr is pointing to the left-hand character */
|
||||
|
||||
while (bptr > mb->start_subject)
|
||||
{
|
||||
bptr--;
|
||||
#ifdef SUPPORT_UNICODE
|
||||
if (utf)
|
||||
{
|
||||
BACKCHAR(bptr);
|
||||
GETCHAR(d, bptr);
|
||||
}
|
||||
else
|
||||
#endif
|
||||
d = *bptr;
|
||||
if (UCD_GRAPHBREAK(d) != ucp_gbRegionalIndicator) break;
|
||||
ricount++;
|
||||
}
|
||||
if ((ricount & 1) != 0) break; /* Grapheme break required */
|
||||
}
|
||||
|
||||
/* If Extend follows E_Base[_GAZ] do not update lgb; this allows
|
||||
any number of Extend before a following E_Modifier. */
|
||||
|
||||
if (rgb != ucp_gbExtend ||
|
||||
(lgb != ucp_gbE_Base && lgb != ucp_gbE_Base_GAZ))
|
||||
lgb = rgb;
|
||||
|
||||
ncount++;
|
||||
lgb = rgb;
|
||||
nptr += dlen;
|
||||
}
|
||||
if (nptr >= end_subject && (mb->moptions & PCRE2_PARTIAL_HARD) != 0)
|
||||
|
|
|
@ -2449,7 +2449,44 @@ fprintf(stderr, "++ op=%d\n", *Fecode);
|
|||
if (!utf) fc = *Feptr; else { GETCHARLEN(fc, Feptr, len); }
|
||||
rgb = UCD_GRAPHBREAK(fc);
|
||||
if ((PRIV(ucp_gbtable)[lgb] & (1 << rgb)) == 0) break;
|
||||
lgb = rgb;
|
||||
|
||||
/* Not breaking between Regional Indicators is allowed only if there
|
||||
are an even number of preceding RIs. */
|
||||
|
||||
if (lgb == ucp_gbRegionalIndicator && rgb == ucp_gbRegionalIndicator)
|
||||
{
|
||||
int ricount = 0;
|
||||
PCRE2_SPTR bptr = Feptr - 1;
|
||||
#ifdef SUPPORT_UNICODE
|
||||
if (utf) BACKCHAR(bptr);
|
||||
#endif
|
||||
/* bptr is pointing to the left-hand character */
|
||||
|
||||
while (bptr > mb->start_subject)
|
||||
{
|
||||
bptr--;
|
||||
#ifdef SUPPORT_UNICODE
|
||||
if (utf)
|
||||
{
|
||||
BACKCHAR(bptr);
|
||||
GETCHAR(fc, bptr);
|
||||
}
|
||||
else
|
||||
#endif
|
||||
fc = *bptr;
|
||||
if (UCD_GRAPHBREAK(fc) != ucp_gbRegionalIndicator) break;
|
||||
ricount++;
|
||||
}
|
||||
if ((ricount & 1) != 0) break; /* Grapheme break required */
|
||||
}
|
||||
|
||||
/* If Extend follows E_Base[_GAZ] do not update lgb; this allows
|
||||
any number of Extend before a following E_Modifier. */
|
||||
|
||||
if (rgb != ucp_gbExtend ||
|
||||
(lgb != ucp_gbE_Base && lgb != ucp_gbE_Base_GAZ))
|
||||
lgb = rgb;
|
||||
|
||||
Feptr += len;
|
||||
}
|
||||
}
|
||||
|
@ -2757,7 +2794,45 @@ fprintf(stderr, "++ op=%d\n", *Fecode);
|
|||
if (!utf) fc = *Feptr; else { GETCHARLEN(fc, Feptr, len); }
|
||||
rgb = UCD_GRAPHBREAK(fc);
|
||||
if ((PRIV(ucp_gbtable)[lgb] & (1 << rgb)) == 0) break;
|
||||
lgb = rgb;
|
||||
|
||||
/* Not breaking between Regional Indicators is allowed only if
|
||||
there are an even number of preceding RIs. */
|
||||
|
||||
if (lgb == ucp_gbRegionalIndicator &&
|
||||
rgb == ucp_gbRegionalIndicator)
|
||||
{
|
||||
int ricount = 0;
|
||||
PCRE2_SPTR bptr = Feptr - 1;
|
||||
#ifdef SUPPORT_UNICODE
|
||||
if (utf) BACKCHAR(bptr);
|
||||
#endif
|
||||
/* bptr is pointing to the left-hand character */
|
||||
|
||||
while (bptr > mb->start_subject)
|
||||
{
|
||||
bptr--;
|
||||
#ifdef SUPPORT_UNICODE
|
||||
if (utf)
|
||||
{
|
||||
BACKCHAR(bptr);
|
||||
GETCHAR(fc, bptr);
|
||||
}
|
||||
else
|
||||
#endif
|
||||
fc = *bptr;
|
||||
if (UCD_GRAPHBREAK(fc) != ucp_gbRegionalIndicator) break;
|
||||
ricount++;
|
||||
}
|
||||
if ((ricount & 1) != 0) break; /* Grapheme break required */
|
||||
}
|
||||
|
||||
/* If Extend follows E_Base[_GAZ] do not update lgb; this allows
|
||||
any number of Extend before a following E_Modifier. */
|
||||
|
||||
if (rgb != ucp_gbExtend ||
|
||||
(lgb != ucp_gbE_Base && lgb != ucp_gbE_Base_GAZ))
|
||||
lgb = rgb;
|
||||
|
||||
Feptr += len;
|
||||
}
|
||||
}
|
||||
|
@ -3527,7 +3602,45 @@ fprintf(stderr, "++ op=%d\n", *Fecode);
|
|||
if (!utf) fc = *Feptr; else { GETCHARLEN(fc, Feptr, len); }
|
||||
rgb = UCD_GRAPHBREAK(fc);
|
||||
if ((PRIV(ucp_gbtable)[lgb] & (1 << rgb)) == 0) break;
|
||||
lgb = rgb;
|
||||
|
||||
/* Not breaking between Regional Indicators is allowed only if
|
||||
there are an even number of preceding RIs. */
|
||||
|
||||
if (lgb == ucp_gbRegionalIndicator &&
|
||||
rgb == ucp_gbRegionalIndicator)
|
||||
{
|
||||
int ricount = 0;
|
||||
PCRE2_SPTR bptr = Feptr - 1;
|
||||
#ifdef SUPPORT_UNICODE
|
||||
if (utf) BACKCHAR(bptr);
|
||||
#endif
|
||||
/* bptr is pointing to the left-hand character */
|
||||
|
||||
while (bptr > mb->start_subject)
|
||||
{
|
||||
bptr--;
|
||||
#ifdef SUPPORT_UNICODE
|
||||
if (utf)
|
||||
{
|
||||
BACKCHAR(bptr);
|
||||
GETCHAR(fc, bptr);
|
||||
}
|
||||
else
|
||||
#endif
|
||||
fc = *bptr;
|
||||
if (UCD_GRAPHBREAK(fc) != ucp_gbRegionalIndicator) break;
|
||||
ricount++;
|
||||
}
|
||||
if ((ricount & 1) != 0) break; /* Grapheme break required */
|
||||
}
|
||||
|
||||
/* If Extend follows E_Base[_GAZ] do not update lgb; this allows
|
||||
any number of Extend before a following E_Modifier. */
|
||||
|
||||
if (rgb != ucp_gbExtend ||
|
||||
(lgb != ucp_gbE_Base && lgb != ucp_gbE_Base_GAZ))
|
||||
lgb = rgb;
|
||||
|
||||
Feptr += len;
|
||||
}
|
||||
}
|
||||
|
@ -4063,7 +4176,45 @@ fprintf(stderr, "++ op=%d\n", *Fecode);
|
|||
if (!utf) fc = *Feptr; else { GETCHARLEN(fc, Feptr, len); }
|
||||
rgb = UCD_GRAPHBREAK(fc);
|
||||
if ((PRIV(ucp_gbtable)[lgb] & (1 << rgb)) == 0) break;
|
||||
lgb = rgb;
|
||||
|
||||
/* Not breaking between Regional Indicators is allowed only if
|
||||
there are an even number of preceding RIs. */
|
||||
|
||||
if (lgb == ucp_gbRegionalIndicator &&
|
||||
rgb == ucp_gbRegionalIndicator)
|
||||
{
|
||||
int ricount = 0;
|
||||
PCRE2_SPTR bptr = Feptr - 1;
|
||||
#ifdef SUPPORT_UNICODE
|
||||
if (utf) BACKCHAR(bptr);
|
||||
#endif
|
||||
/* bptr is pointing to the left-hand character */
|
||||
|
||||
while (bptr > mb->start_subject)
|
||||
{
|
||||
bptr--;
|
||||
#ifdef SUPPORT_UNICODE
|
||||
if (utf)
|
||||
{
|
||||
BACKCHAR(bptr);
|
||||
GETCHAR(fc, bptr);
|
||||
}
|
||||
else
|
||||
#endif
|
||||
fc = *bptr;
|
||||
if (UCD_GRAPHBREAK(fc) != ucp_gbRegionalIndicator) break;
|
||||
ricount++;
|
||||
}
|
||||
if ((ricount & 1) != 0) break; /* Grapheme break required */
|
||||
}
|
||||
|
||||
/* If Extend follows E_Base[_GAZ] do not update lgb; this allows
|
||||
any number of Extend before a following E_Modifier. */
|
||||
|
||||
if (rgb != ucp_gbExtend ||
|
||||
(lgb != ucp_gbE_Base && lgb != ucp_gbE_Base_GAZ))
|
||||
lgb = rgb;
|
||||
|
||||
Feptr += len;
|
||||
}
|
||||
}
|
||||
|
|
|
@ -157,49 +157,62 @@ two code points. The breaking rules are as follows:
|
|||
LV or V may be followed by V or T
|
||||
LVT or T may be followed by T
|
||||
|
||||
4. Do not break before extending characters.
|
||||
4. Do not break before extending characters or zero-width-joiner (ZWJ).
|
||||
|
||||
The next two rules are only for extended grapheme clusters (but that's what we
|
||||
The following rules are only for extended grapheme clusters (but that's what we
|
||||
are implementing).
|
||||
|
||||
5. Do not break before SpacingMarks.
|
||||
|
||||
6. Do not break after Prepend characters.
|
||||
|
||||
7. Otherwise, break everywhere.
|
||||
7. Do not break within emoji modifier sequences (E_Base or E_Base_GAZ followed
|
||||
by E_Modifier). Extend characters are allowed before the modifier; this
|
||||
cannot be represented in this table, the code has to deal with it.
|
||||
|
||||
8. Do not break within emoji zwj sequences (ZWJ followed by Glue_After_Zwj or
|
||||
E_Base_GAZ).
|
||||
|
||||
9. Do not break within emoji flag sequences. That is, do not break between
|
||||
regional indicator (RI) symbols if there are an odd number of RI characters
|
||||
before the break point. This table encodes "join RI characters"; the code
|
||||
has to deal with checking for previous adjoining RIs.
|
||||
|
||||
10. Otherwise, break everywhere.
|
||||
*/
|
||||
|
||||
#define ESZ (1<<ucp_gbExtend)|(1<<ucp_gbSpacingMark)|(1<<ucp_gbZWJ)
|
||||
|
||||
const uint32_t PRIV(ucp_gbtable)[] = {
|
||||
(1<<ucp_gbLF), /* 0 CR */
|
||||
0, /* 1 LF */
|
||||
0, /* 2 Control */
|
||||
(1<<ucp_gbExtend)|(1<<ucp_gbSpacingMark), /* 3 Extend */
|
||||
(1<<ucp_gbExtend)|(1<<ucp_gbPrepend)| /* 4 Prepend */
|
||||
(1<<ucp_gbSpacingMark)|(1<<ucp_gbL)|
|
||||
(1<<ucp_gbV)|(1<<ucp_gbT)|(1<<ucp_gbLV)|
|
||||
(1<<ucp_gbLVT)|(1<<ucp_gbOther),
|
||||
|
||||
(1<<ucp_gbExtend)|(1<<ucp_gbSpacingMark), /* 5 SpacingMark */
|
||||
(1<<ucp_gbExtend)|(1<<ucp_gbSpacingMark)|(1<<ucp_gbL)| /* 6 L */
|
||||
(1<<ucp_gbV)|(1<<ucp_gbLV)|(1<<ucp_gbLVT),
|
||||
|
||||
(1<<ucp_gbExtend)|(1<<ucp_gbSpacingMark)|(1<<ucp_gbV)| /* 7 V */
|
||||
(1<<ucp_gbT),
|
||||
|
||||
(1<<ucp_gbExtend)|(1<<ucp_gbSpacingMark)|(1<<ucp_gbT), /* 8 T */
|
||||
(1<<ucp_gbExtend)|(1<<ucp_gbSpacingMark)|(1<<ucp_gbV)| /* 9 LV */
|
||||
(1<<ucp_gbT),
|
||||
|
||||
(1<<ucp_gbExtend)|(1<<ucp_gbSpacingMark)|(1<<ucp_gbT), /* 10 LVT */
|
||||
ESZ, /* 3 Extend */
|
||||
ESZ|(1<<ucp_gbPrepend)| /* 4 Prepend */
|
||||
(1<<ucp_gbL)|(1<<ucp_gbV)|(1<<ucp_gbT)|
|
||||
(1<<ucp_gbLV)|(1<<ucp_gbLVT)|(1<<ucp_gbOther)|
|
||||
(1<<ucp_gbRegionalIndicator)|
|
||||
(1<<ucp_gbE_Base)|(1<<ucp_gbE_Modifier)|
|
||||
(1<<ucp_gbE_Base_GAZ)|
|
||||
(1<<ucp_gbZWJ)|(1<<ucp_gbGlue_After_Zwj),
|
||||
ESZ, /* 5 SpacingMark */
|
||||
ESZ|(1<<ucp_gbL)|(1<<ucp_gbV)|(1<<ucp_gbLV)| /* 6 L */
|
||||
(1<<ucp_gbLVT),
|
||||
ESZ|(1<<ucp_gbV)|(1<<ucp_gbT), /* 7 V */
|
||||
ESZ|(1<<ucp_gbT), /* 8 T */
|
||||
ESZ|(1<<ucp_gbV)|(1<<ucp_gbT), /* 9 LV */
|
||||
ESZ|(1<<ucp_gbT), /* 10 LVT */
|
||||
(1<<ucp_gbRegionalIndicator), /* 11 RegionalIndicator */
|
||||
(1<<ucp_gbExtend)|(1<<ucp_gbSpacingMark), /* 12 Other */
|
||||
(1<<ucp_gbExtend)|(1<<ucp_gbSpacingMark), /* 13 E_Base */
|
||||
(1<<ucp_gbExtend)|(1<<ucp_gbSpacingMark), /* 14 E_Modifier */
|
||||
(1<<ucp_gbExtend)|(1<<ucp_gbSpacingMark), /* 15 E_Base_GAZ */
|
||||
(1<<ucp_gbExtend)|(1<<ucp_gbSpacingMark), /* 16 ZWJ */
|
||||
(1<<ucp_gbExtend)|(1<<ucp_gbSpacingMark) /* 12 Glue_After_Zwj */
|
||||
ESZ, /* 12 Other */
|
||||
ESZ|(1<<ucp_gbE_Modifier), /* 13 E_Base */
|
||||
ESZ, /* 14 E_Modifier */
|
||||
ESZ|(1<<ucp_gbE_Modifier), /* 15 E_Base_GAZ */
|
||||
ESZ|(1<<ucp_gbGlue_After_Zwj)|(1<<ucp_gbE_Base_GAZ), /* 16 ZWJ */
|
||||
ESZ /* 12 Glue_After_Zwj */
|
||||
};
|
||||
|
||||
#undef ESZ
|
||||
|
||||
#ifdef SUPPORT_JIT
|
||||
/* This table reverses PRIV(ucp_gentype). We can save the cost
|
||||
of a memory load. */
|
||||
|
|
|
@ -2041,4 +2041,23 @@
|
|||
/^(?:(\X)(?C))+$/utf
|
||||
\x{1E900}\x{1E924}\x{1E953}\x{11C00}\x{11C2D}\x{11C3E}\x{11C70}\x{11C77}\x{11CAB}\x{11400}\x{1142F}\x{11455}\x{104B0}\x{104D8}\x{104FB}\x{16FE0}\x{18800}\x{18AF2}\x{11D00}\x{11D3A}\x{11D59}\x{16FE1}\x{1B170}\x{1B2FB}\x{11A50}\x{11A58}\x{11AA2}\x{11A00}\x{11A07}\x{11A47}\=callout_capture,callout_no_where
|
||||
|
||||
# These two are here because JIT is not yet updated. Also, the very first data
|
||||
# line is handled differently by Perl.
|
||||
|
||||
/^\X/utf
|
||||
A\x{200d}B A ZWJ
|
||||
\x{261D}\x{1F3FB}B E_Base E_Modifier
|
||||
\x{1F466}\x{1F3FF}B E_Base_GAZ E_Modifier
|
||||
\x{200d}\x{1F3A4}B ZWJ Glue_After_ZWJ
|
||||
\x{200d}\x{1F469}B ZWJ E_Base_GAZ
|
||||
\x{1F1E6}\x{1F1E7}B RegionalIndicator RegionalIndicator
|
||||
\x{261D}\x{E0100}\x{1F3FB}B\=no_jit E_Base Extend E_Modifier
|
||||
|
||||
# Regional indicators
|
||||
|
||||
/^(\X)(\X)/utf,aftertext
|
||||
\x{1F1E6}\x{1F1E7}\x{1F1E7}B\=no_jit
|
||||
\x{1F1E6}\x{1F1E7}\x{1F1E7}\x{1F1E6}B\=no_jit
|
||||
|
||||
|
||||
# End of testinput5
|
||||
|
|
|
@ -4667,4 +4667,38 @@ Callout 0: last capture = 1
|
|||
0: \x{1e900}\x{1e924}\x{1e953}\x{11c00}\x{11c2d}\x{11c3e}\x{11c70}\x{11c77}\x{11cab}\x{11400}\x{1142f}\x{11455}\x{104b0}\x{104d8}\x{104fb}\x{16fe0}\x{18800}\x{18af2}\x{11d00}\x{11d3a}\x{11d59}\x{16fe1}\x{1b170}\x{1b2fb}\x{11a50}\x{11a58}\x{11aa2}\x{11a00}\x{11a07}\x{11a47}
|
||||
1: \x{11a00}\x{11a07}\x{11a47}
|
||||
|
||||
# These two are here because JIT is not yet updated. Also, the very first data
|
||||
# line is handled differently by Perl.
|
||||
|
||||
/^\X/utf
|
||||
A\x{200d}B A ZWJ
|
||||
0: A\x{200d}
|
||||
\x{261D}\x{1F3FB}B E_Base E_Modifier
|
||||
0: \x{261d}\x{1f3fb}
|
||||
\x{1F466}\x{1F3FF}B E_Base_GAZ E_Modifier
|
||||
0: \x{1f466}\x{1f3ff}
|
||||
\x{200d}\x{1F3A4}B ZWJ Glue_After_ZWJ
|
||||
0: \x{200d}\x{1f3a4}
|
||||
\x{200d}\x{1F469}B ZWJ E_Base_GAZ
|
||||
0: \x{200d}\x{1f469}
|
||||
\x{1F1E6}\x{1F1E7}B RegionalIndicator RegionalIndicator
|
||||
0: \x{1f1e6}\x{1f1e7}
|
||||
\x{261D}\x{E0100}\x{1F3FB}B\=no_jit E_Base Extend E_Modifier
|
||||
** /n is not valid here
|
||||
|
||||
# Regional indicators
|
||||
|
||||
/^(\X)(\X)/utf,aftertext
|
||||
\x{1F1E6}\x{1F1E7}\x{1F1E7}B\=no_jit
|
||||
0: \x{1f1e6}\x{1f1e7}\x{1f1e7}
|
||||
0+ B
|
||||
1: \x{1f1e6}\x{1f1e7}
|
||||
2: \x{1f1e7}
|
||||
\x{1F1E6}\x{1F1E7}\x{1F1E7}\x{1F1E6}B\=no_jit
|
||||
0: \x{1f1e6}\x{1f1e7}\x{1f1e7}\x{1f1e6}
|
||||
0+ B
|
||||
1: \x{1f1e6}\x{1f1e7}
|
||||
2: \x{1f1e7}\x{1f1e6}
|
||||
|
||||
|
||||
# End of testinput5
|
||||
|
|
Loading…
Reference in New Issue