Update grapheme breaking rules for Unicode 10.0.0.

2017-07-05 08:55:49 +00:00 · 2017-07-05 08:55:49 +00:00 · 4f7a608d56
parent 41bb787fb3
commit 4f7a608d56
9 changed files with 514 additions and 98 deletions
--- a/3
+++ b/3
@ -213,6 +213,9 @@ unit". Previously only non-anchored patterns did this.

 48. Add the callout_no_where modifier to pcre2test.

+49. Update extended grapheme breaking rules to the latest set that are in 
+Unicode Standard Annex #29.
+

 Version 10.23 14-February-2017
 ------------------------------
--- a/doc/html/pcre2pattern.html
+++ b/doc/html/pcre2pattern.html
@ -1001,9 +1001,12 @@ grapheme cluster", and treats the sequence as an atomic group
 <a href="#atomicgroup">(see below).</a>
 Unicode supports various kinds of composite character by giving each character
 a grapheme breaking property, and having rules that use these properties to
-define the boundaries of extended grapheme clusters. \X always matches at
-least one character. Then it decides whether to add additional characters
-according to the following rules for ending a cluster:
+define the boundaries of extended grapheme clusters. The rules are defined in
+Unicode Standard Annex 29, "Unicode Text Segmentation".
+</P>
+<P>
+\X always matches at least one character. Then it decides whether to add
+additional characters according to the following rules for ending a cluster:
 </P>
 <P>
 1. End at the end of the subject string.
@ -1018,13 +1021,27 @@ L, V, LV, or LVT character; an LV or V character may be followed by a V or T
 character; an LVT or T character may be follwed only by a T character.
 </P>
 <P>
-4. Do not end before extending characters or spacing marks. Characters with
-the "mark" property always have the "extend" grapheme breaking property.
+4. Do not end before extending characters or spacing marks or the "zero-width
+joiner" characters. Characters with the "mark" property always have the
+"extend" grapheme breaking property.
 </P>
 <P>
 5. Do not end after prepend characters.
 </P>
 <P>
+6. Do not break within emoji modifier sequences (a base character followed by a
+modifier). Extending characters are allowed before the modifier.
+</P>
+<P>
+7. Do not break within emoji zwj sequences (zero-width jointer followed by
+"glue after ZWJ" or "base glue after ZWJ").
+</P>
+<P>
+8. Do not break within emoji flag sequences. That is, do not break between
+regional indicator (RI) characters if there are an odd number of RI characters
+before the break point.
+</P>
+<P>
 6. Otherwise, end the cluster.
 <a name="extraprops"></a></P>
 <br><b>
@ -3455,7 +3472,7 @@ Cambridge, England.
 </P>
 <br><a name="SEC30" href="#TOC1">REVISION</a><br>
 <P>
-Last updated: 02 July 2017
+Last updated: 05 July 2017
 <br>
 Copyright &copy; 1997-2017 University of Cambridge.
 <br>
--- a/doc/pcre2.txt
+++ b/doc/pcre2.txt
@ -6433,27 +6433,41 @@ BACKSLASH
       (see  below).  Unicode supports various kinds of composite character by
       giving each character a grapheme breaking property,  and  having  rules
       that use these properties to define the boundaries of extended grapheme
-       clusters. \X always matches at least one  character.  Then  it  decides
-       whether  to  add additional characters according to the following rules
-       for ending a cluster:
+       clusters. The rules are defined in Unicode Standard Annex 29,  "Unicode
+       Text Segmentation".
+
+       \X  always  matches  at least one character. Then it decides whether to
+       add additional characters according to the following rules for ending a
+       cluster:

       1. End at the end of the subject string.

-       2. Do not end between CR and LF; otherwise end after any control  char-
+       2.  Do not end between CR and LF; otherwise end after any control char-
       acter.

-       3.  Do  not  break  Hangul (a Korean script) syllable sequences. Hangul
-       characters are of five types: L, V, T, LV, and LVT. An L character  may
-       be  followed by an L, V, LV, or LVT character; an LV or V character may
+       3. Do not break Hangul (a Korean  script)  syllable  sequences.  Hangul
+       characters  are of five types: L, V, T, LV, and LVT. An L character may
+       be followed by an L, V, LV, or LVT character; an LV or V character  may
       be followed by a V or T character; an LVT or T character may be follwed
       only by a T character.

-       4.  Do not end before extending characters or spacing marks. Characters
-       with the "mark" property always have  the  "extend"  grapheme  breaking
-       property.
+       4. Do not end before extending  characters  or  spacing  marks  or  the
+       "zero-width  joiner"  characters.  Characters  with the "mark" property
+       always have the "extend" grapheme breaking property.

       5. Do not end after prepend characters.

+       6. Do not break within emoji modifier sequences (a base character  fol-
+       lowed by a modifier). Extending characters are allowed before the modi-
+       fier.
+
+       7. Do not break within emoji zwj sequences (zero-width jointer followed
+       by "glue after ZWJ" or "base glue after ZWJ").
+
+       8.  Do  not  break  within  emoji flag sequences. That is, do not break
+       between regional indicator (RI) characters if there are an  odd  number
+       of RI characters before the break point.
+
       6. Otherwise, end the cluster.

   PCRE2's additional properties
@ -8744,7 +8758,7 @@ AUTHOR

 REVISION

-       Last updated: 02 July 2017
+       Last updated: 05 July 2017
       Copyright (c) 1997-2017 University of Cambridge.
 ------------------------------------------------------------------------------
 
--- a/doc/pcre2pattern.3
+++ b/doc/pcre2pattern.3
@ -1,4 +1,4 @@
-.TH PCRE2PATTERN 3 "02 July 2017" "PCRE2 10.30"
+.TH PCRE2PATTERN 3 "05 July 2017" "PCRE2 10.30"
 .SH NAME
 PCRE2 - Perl-compatible regular expressions (revised API)
 .SH "PCRE2 REGULAR EXPRESSION DETAILS"
@ -998,9 +998,11 @@ grapheme cluster", and treats the sequence as an atomic group
 .\"
 Unicode supports various kinds of composite character by giving each character
 a grapheme breaking property, and having rules that use these properties to
-define the boundaries of extended grapheme clusters. \eX always matches at
-least one character. Then it decides whether to add additional characters
-according to the following rules for ending a cluster:
+define the boundaries of extended grapheme clusters. The rules are defined in
+Unicode Standard Annex 29, "Unicode Text Segmentation".
+.P
+\eX always matches at least one character. Then it decides whether to add
+additional characters according to the following rules for ending a cluster:
 .P
 1. End at the end of the subject string.
 .P
@ -1011,11 +1013,22 @@ are of five types: L, V, T, LV, and LVT. An L character may be followed by an
 L, V, LV, or LVT character; an LV or V character may be followed by a V or T
 character; an LVT or T character may be follwed only by a T character.
 .P
-4. Do not end before extending characters or spacing marks. Characters with
-the "mark" property always have the "extend" grapheme breaking property.
+4. Do not end before extending characters or spacing marks or the "zero-width
+joiner" characters. Characters with the "mark" property always have the
+"extend" grapheme breaking property.
 .P
 5. Do not end after prepend characters.
 .P
+6. Do not break within emoji modifier sequences (a base character followed by a
+modifier). Extending characters are allowed before the modifier.
+.P
+7. Do not break within emoji zwj sequences (zero-width jointer followed by
+"glue after ZWJ" or "base glue after ZWJ").
+.P
+8. Do not break within emoji flag sequences. That is, do not break between
+regional indicator (RI) characters if there are an odd number of RI characters
+before the break point.
+.P
 6. Otherwise, end the cluster.
 .
 .
@ -3485,6 +3498,6 @@ Cambridge, England.
 .rs
 .sp
 .nf
-Last updated: 02 July 2017
+Last updated: 05 July 2017
 Copyright (c) 1997-2017 University of Cambridge.
 .fi
--- a/src/pcre2_dfa_match.c
+++ b/src/pcre2_dfa_match.c
@ -1379,8 +1379,46 @@ for (;;)
          if (!utf) d = *nptr; else { GETCHARLEN(d, nptr, dlen); }
          rgb = UCD_GRAPHBREAK(d);
          if ((PRIV(ucp_gbtable)[lgb] & (1u << rgb)) == 0) break;
+
+          /* Not breaking between Regional Indicators is allowed only if
+          there are an even number of preceding RIs. */
+
+          if (lgb == ucp_gbRegionalIndicator &&
+              rgb == ucp_gbRegionalIndicator)
+            {
+            int ricount = 0;
+            PCRE2_SPTR bptr = nptr - 1;
+#ifdef SUPPORT_UNICODE
+            if (utf) BACKCHAR(bptr);
+#endif
+            /* bptr is pointing to the left-hand character */
+
+            while (bptr > mb->start_subject)
+              {
+              bptr--;
+#ifdef SUPPORT_UNICODE
+              if (utf)
+                {
+                BACKCHAR(bptr);
+                GETCHAR(d, bptr);
+                }
+              else
+#endif
+              d = *bptr;
+              if (UCD_GRAPHBREAK(d) != ucp_gbRegionalIndicator) break;
+              ricount++;
+              }
+            if ((ricount & 1) != 0) break;  /* Grapheme break required */
+            }
+
+          /* If Extend follows E_Base[_GAZ] do not update lgb; this allows
+          any number of Extend before a following E_Modifier. */
+
+          if (rgb != ucp_gbExtend ||
+              (lgb != ucp_gbE_Base && lgb != ucp_gbE_Base_GAZ))
+            lgb = rgb;
+
          ncount++;
-          lgb = rgb;
          nptr += dlen;
          }
        count++;
@ -1641,8 +1679,46 @@ for (;;)
          if (!utf) d = *nptr; else { GETCHARLEN(d, nptr, dlen); }
          rgb = UCD_GRAPHBREAK(d);
          if ((PRIV(ucp_gbtable)[lgb] & (1u << rgb)) == 0) break;
+
+          /* Not breaking between Regional Indicators is allowed only if
+          there are an even number of preceding RIs. */
+
+          if (lgb == ucp_gbRegionalIndicator &&
+              rgb == ucp_gbRegionalIndicator)
+            {
+            int ricount = 0;
+            PCRE2_SPTR bptr = nptr - 1;
+#ifdef SUPPORT_UNICODE
+            if (utf) BACKCHAR(bptr);
+#endif
+            /* bptr is pointing to the left-hand character */
+
+            while (bptr > mb->start_subject)
+              {
+              bptr--;
+#ifdef SUPPORT_UNICODE
+              if (utf)
+                {
+                BACKCHAR(bptr);
+                GETCHAR(d, bptr);
+                }
+              else
+#endif
+              d = *bptr;
+              if (UCD_GRAPHBREAK(d) != ucp_gbRegionalIndicator) break;
+              ricount++;
+              }
+            if ((ricount & 1) != 0) break;  /* Grapheme break required */
+            }
+
+          /* If Extend follows E_Base[_GAZ] do not update lgb; this allows
+          any number of Extend before a following E_Modifier. */
+
+          if (rgb != ucp_gbExtend ||
+              (lgb != ucp_gbE_Base && lgb != ucp_gbE_Base_GAZ))
+            lgb = rgb;
+
          ncount++;
-          lgb = rgb;
          nptr += dlen;
          }
        ADD_NEW_DATA(-(state_offset + count), 0, ncount);
@ -1912,8 +1988,46 @@ for (;;)
          if (!utf) d = *nptr; else { GETCHARLEN(d, nptr, dlen); }
          rgb = UCD_GRAPHBREAK(d);
          if ((PRIV(ucp_gbtable)[lgb] & (1u << rgb)) == 0) break;
+
+          /* Not breaking between Regional Indicators is allowed only if
+          there are an even number of preceding RIs. */
+
+          if (lgb == ucp_gbRegionalIndicator &&
+              rgb == ucp_gbRegionalIndicator)
+            {
+            int ricount = 0;
+            PCRE2_SPTR bptr = nptr - 1;
+#ifdef SUPPORT_UNICODE
+            if (utf) BACKCHAR(bptr);
+#endif
+            /* bptr is pointing to the left-hand character */
+
+            while (bptr > mb->start_subject)
+              {
+              bptr--;
+#ifdef SUPPORT_UNICODE
+              if (utf)
+                {
+                BACKCHAR(bptr);
+                GETCHAR(d, bptr);
+                }
+              else
+#endif
+              d = *bptr;
+              if (UCD_GRAPHBREAK(d) != ucp_gbRegionalIndicator) break;
+              ricount++;
+              }
+            if ((ricount & 1) != 0) break;  /* Grapheme break required */
+            }
+
+          /* If Extend follows E_Base[_GAZ] do not update lgb; this allows
+          any number of Extend before a following E_Modifier. */
+
+          if (rgb != ucp_gbExtend ||
+              (lgb != ucp_gbE_Base && lgb != ucp_gbE_Base_GAZ))
+            lgb = rgb;
+
          ncount++;
-          lgb = rgb;
          nptr += dlen;
          }
        if (nptr >= end_subject && (mb->moptions & PCRE2_PARTIAL_HARD) != 0)
@ -2102,8 +2216,46 @@ for (;;)
          if (!utf) d = *nptr; else { GETCHARLEN(d, nptr, dlen); }
          rgb = UCD_GRAPHBREAK(d);
          if ((PRIV(ucp_gbtable)[lgb] & (1u << rgb)) == 0) break;
+
+          /* Not breaking between Regional Indicators is allowed only if
+          there are an even number of preceding RIs. */
+
+          if (lgb == ucp_gbRegionalIndicator &&
+              rgb == ucp_gbRegionalIndicator)
+            {
+            int ricount = 0;
+            PCRE2_SPTR bptr = nptr - 1;
+#ifdef SUPPORT_UNICODE
+            if (utf) BACKCHAR(bptr);
+#endif
+            /* bptr is pointing to the left-hand character */
+
+            while (bptr > mb->start_subject)
+              {
+              bptr--;
+#ifdef SUPPORT_UNICODE
+              if (utf)
+                {
+                BACKCHAR(bptr);
+                GETCHAR(d, bptr);
+                }
+              else
+#endif
+              d = *bptr;
+              if (UCD_GRAPHBREAK(d) != ucp_gbRegionalIndicator) break;
+              ricount++;
+              }
+            if ((ricount & 1) != 0) break;  /* Grapheme break required */
+            }
+
+          /* If Extend follows E_Base[_GAZ] do not update lgb; this allows
+          any number of Extend before a following E_Modifier. */
+
+          if (rgb != ucp_gbExtend ||
+              (lgb != ucp_gbE_Base && lgb != ucp_gbE_Base_GAZ))
+            lgb = rgb;
+
          ncount++;
-          lgb = rgb;
          nptr += dlen;
          }
        if (nptr >= end_subject && (mb->moptions & PCRE2_PARTIAL_HARD) != 0)
--- a/src/pcre2_match.c
+++ b/src/pcre2_match.c
@ -2449,7 +2449,44 @@ fprintf(stderr, "++ op=%d\n", *Fecode);
        if (!utf) fc = *Feptr; else { GETCHARLEN(fc, Feptr, len); }
        rgb = UCD_GRAPHBREAK(fc);
        if ((PRIV(ucp_gbtable)[lgb] & (1 << rgb)) == 0) break;
-        lgb = rgb;
+
+        /* Not breaking between Regional Indicators is allowed only if there
+        are an even number of preceding RIs. */
+
+        if (lgb == ucp_gbRegionalIndicator && rgb == ucp_gbRegionalIndicator)
+          {
+          int ricount = 0;
+          PCRE2_SPTR bptr = Feptr - 1;
+#ifdef SUPPORT_UNICODE
+          if (utf) BACKCHAR(bptr);
+#endif
+          /* bptr is pointing to the left-hand character */
+
+          while (bptr > mb->start_subject)
+            {
+            bptr--;
+#ifdef SUPPORT_UNICODE
+            if (utf)
+              {
+              BACKCHAR(bptr);
+              GETCHAR(fc, bptr);
+              }
+            else
+#endif
+            fc = *bptr;
+            if (UCD_GRAPHBREAK(fc) != ucp_gbRegionalIndicator) break;
+            ricount++;
+            }
+          if ((ricount & 1) != 0) break;  /* Grapheme break required */
+          }
+
+        /* If Extend follows E_Base[_GAZ] do not update lgb; this allows
+        any number of Extend before a following E_Modifier. */
+
+        if (rgb != ucp_gbExtend ||
+            (lgb != ucp_gbE_Base && lgb != ucp_gbE_Base_GAZ))
+          lgb = rgb;
+
        Feptr += len;
        }
      }
@ -2757,7 +2794,45 @@ fprintf(stderr, "++ op=%d\n", *Fecode);
              if (!utf) fc = *Feptr; else { GETCHARLEN(fc, Feptr, len); }
              rgb = UCD_GRAPHBREAK(fc);
              if ((PRIV(ucp_gbtable)[lgb] & (1 << rgb)) == 0) break;
-              lgb = rgb;
+
+              /* Not breaking between Regional Indicators is allowed only if
+              there are an even number of preceding RIs. */
+
+              if (lgb == ucp_gbRegionalIndicator &&
+                  rgb == ucp_gbRegionalIndicator)
+                {
+                int ricount = 0;
+                PCRE2_SPTR bptr = Feptr - 1;
+#ifdef SUPPORT_UNICODE
+                if (utf) BACKCHAR(bptr);
+#endif
+                /* bptr is pointing to the left-hand character */
+
+                while (bptr > mb->start_subject)
+                  {
+                  bptr--;
+#ifdef SUPPORT_UNICODE
+                  if (utf)
+                    {
+                    BACKCHAR(bptr);
+                    GETCHAR(fc, bptr);
+                    }
+                  else
+#endif
+                  fc = *bptr;
+                  if (UCD_GRAPHBREAK(fc) != ucp_gbRegionalIndicator) break;
+                  ricount++;
+                  }
+                if ((ricount & 1) != 0) break;  /* Grapheme break required */
+                }
+
+              /* If Extend follows E_Base[_GAZ] do not update lgb; this allows
+              any number of Extend before a following E_Modifier. */
+
+              if (rgb != ucp_gbExtend ||
+                  (lgb != ucp_gbE_Base && lgb != ucp_gbE_Base_GAZ))
+                lgb = rgb;
+
              Feptr += len;
              }
            }
@ -3527,7 +3602,45 @@ fprintf(stderr, "++ op=%d\n", *Fecode);
              if (!utf) fc = *Feptr; else { GETCHARLEN(fc, Feptr, len); }
              rgb = UCD_GRAPHBREAK(fc);
              if ((PRIV(ucp_gbtable)[lgb] & (1 << rgb)) == 0) break;
-              lgb = rgb;
+
+              /* Not breaking between Regional Indicators is allowed only if
+              there are an even number of preceding RIs. */
+
+              if (lgb == ucp_gbRegionalIndicator &&
+                  rgb == ucp_gbRegionalIndicator)
+                {
+                int ricount = 0;
+                PCRE2_SPTR bptr = Feptr - 1;
+#ifdef SUPPORT_UNICODE
+                if (utf) BACKCHAR(bptr);
+#endif
+                /* bptr is pointing to the left-hand character */
+
+                while (bptr > mb->start_subject)
+                  {
+                  bptr--;
+#ifdef SUPPORT_UNICODE
+                  if (utf)
+                    {
+                    BACKCHAR(bptr);
+                    GETCHAR(fc, bptr);
+                    }
+                  else
+#endif
+                  fc = *bptr;
+                  if (UCD_GRAPHBREAK(fc) != ucp_gbRegionalIndicator) break;
+                  ricount++;
+                  }
+                if ((ricount & 1) != 0) break;  /* Grapheme break required */
+                }
+
+              /* If Extend follows E_Base[_GAZ] do not update lgb; this allows
+              any number of Extend before a following E_Modifier. */
+
+              if (rgb != ucp_gbExtend ||
+                  (lgb != ucp_gbE_Base && lgb != ucp_gbE_Base_GAZ))
+                lgb = rgb;
+
              Feptr += len;
              }
            }
@ -4063,7 +4176,45 @@ fprintf(stderr, "++ op=%d\n", *Fecode);
              if (!utf) fc = *Feptr; else { GETCHARLEN(fc, Feptr, len); }
              rgb = UCD_GRAPHBREAK(fc);
              if ((PRIV(ucp_gbtable)[lgb] & (1 << rgb)) == 0) break;
-              lgb = rgb;
+
+              /* Not breaking between Regional Indicators is allowed only if
+              there are an even number of preceding RIs. */
+
+              if (lgb == ucp_gbRegionalIndicator &&
+                  rgb == ucp_gbRegionalIndicator)
+                {
+                int ricount = 0;
+                PCRE2_SPTR bptr = Feptr - 1;
+#ifdef SUPPORT_UNICODE
+                if (utf) BACKCHAR(bptr);
+#endif
+                /* bptr is pointing to the left-hand character */
+
+                while (bptr > mb->start_subject)
+                  {
+                  bptr--;
+#ifdef SUPPORT_UNICODE
+                  if (utf)
+                    {
+                    BACKCHAR(bptr);
+                    GETCHAR(fc, bptr);
+                    }
+                  else
+#endif
+                  fc = *bptr;
+                  if (UCD_GRAPHBREAK(fc) != ucp_gbRegionalIndicator) break;
+                  ricount++;
+                  }
+                if ((ricount & 1) != 0) break;  /* Grapheme break required */
+                }
+
+              /* If Extend follows E_Base[_GAZ] do not update lgb; this allows
+              any number of Extend before a following E_Modifier. */
+
+              if (rgb != ucp_gbExtend ||
+                  (lgb != ucp_gbE_Base && lgb != ucp_gbE_Base_GAZ))
+                lgb = rgb;
+
              Feptr += len;
              }
            }
--- a/src/pcre2_tables.c
+++ b/src/pcre2_tables.c
@ -157,49 +157,62 @@ two code points. The breaking rules are as follows:
    LV or V may be followed by V or T
    LVT or T may be followed by T

-4. Do not break before extending characters.
+4. Do not break before extending characters or zero-width-joiner (ZWJ).

-The next two rules are only for extended grapheme clusters (but that's what we
+The following rules are only for extended grapheme clusters (but that's what we
 are implementing).

 5. Do not break before SpacingMarks.

 6. Do not break after Prepend characters.

-7. Otherwise, break everywhere.
+7. Do not break within emoji modifier sequences (E_Base or E_Base_GAZ followed
+   by E_Modifier). Extend characters are allowed before the modifier; this 
+   cannot be represented in this table, the code has to deal with it.
+   
+8. Do not break within emoji zwj sequences (ZWJ followed by Glue_After_Zwj   or
+   E_Base_GAZ).
+   
+9. Do not break within emoji flag sequences. That is, do not break between 
+   regional indicator (RI) symbols if there are an odd number of RI characters 
+   before the break point. This table encodes "join RI characters"; the code 
+   has to deal with checking for previous adjoining RIs.
+
+10. Otherwise, break everywhere.
 */

+#define ESZ (1<<ucp_gbExtend)|(1<<ucp_gbSpacingMark)|(1<<ucp_gbZWJ)
+
 const uint32_t PRIV(ucp_gbtable)[] = {
   (1<<ucp_gbLF),                                           /*  0 CR */
   0,                                                       /*  1 LF */
   0,                                                       /*  2 Control */
-   (1<<ucp_gbExtend)|(1<<ucp_gbSpacingMark),                /*  3 Extend */
-   (1<<ucp_gbExtend)|(1<<ucp_gbPrepend)|                    /*  4 Prepend */
-     (1<<ucp_gbSpacingMark)|(1<<ucp_gbL)|
-     (1<<ucp_gbV)|(1<<ucp_gbT)|(1<<ucp_gbLV)|
-     (1<<ucp_gbLVT)|(1<<ucp_gbOther),
-
-   (1<<ucp_gbExtend)|(1<<ucp_gbSpacingMark),                /*  5 SpacingMark */
-   (1<<ucp_gbExtend)|(1<<ucp_gbSpacingMark)|(1<<ucp_gbL)|   /*  6 L */
-     (1<<ucp_gbV)|(1<<ucp_gbLV)|(1<<ucp_gbLVT),
-
-   (1<<ucp_gbExtend)|(1<<ucp_gbSpacingMark)|(1<<ucp_gbV)|   /*  7 V */
-     (1<<ucp_gbT),
-
-   (1<<ucp_gbExtend)|(1<<ucp_gbSpacingMark)|(1<<ucp_gbT),   /*  8 T */
-   (1<<ucp_gbExtend)|(1<<ucp_gbSpacingMark)|(1<<ucp_gbV)|   /*  9 LV */
-     (1<<ucp_gbT),
-
-   (1<<ucp_gbExtend)|(1<<ucp_gbSpacingMark)|(1<<ucp_gbT),   /* 10 LVT */
+   ESZ,                                                     /*  3 Extend */
+   ESZ|(1<<ucp_gbPrepend)|                                  /*  4 Prepend */
+       (1<<ucp_gbL)|(1<<ucp_gbV)|(1<<ucp_gbT)|
+       (1<<ucp_gbLV)|(1<<ucp_gbLVT)|(1<<ucp_gbOther)|
+       (1<<ucp_gbRegionalIndicator)|
+       (1<<ucp_gbE_Base)|(1<<ucp_gbE_Modifier)|
+       (1<<ucp_gbE_Base_GAZ)|
+       (1<<ucp_gbZWJ)|(1<<ucp_gbGlue_After_Zwj),
+   ESZ,                                                     /*  5 SpacingMark */
+   ESZ|(1<<ucp_gbL)|(1<<ucp_gbV)|(1<<ucp_gbLV)|             /*  6 L */
+       (1<<ucp_gbLVT),
+   ESZ|(1<<ucp_gbV)|(1<<ucp_gbT),                           /*  7 V */
+   ESZ|(1<<ucp_gbT),                                        /*  8 T */
+   ESZ|(1<<ucp_gbV)|(1<<ucp_gbT),                           /*  9 LV */
+   ESZ|(1<<ucp_gbT),                                        /* 10 LVT */
   (1<<ucp_gbRegionalIndicator),                            /* 11 RegionalIndicator */
-   (1<<ucp_gbExtend)|(1<<ucp_gbSpacingMark),                /* 12 Other */
-   (1<<ucp_gbExtend)|(1<<ucp_gbSpacingMark),                /* 13 E_Base */
-   (1<<ucp_gbExtend)|(1<<ucp_gbSpacingMark),                /* 14 E_Modifier */
-   (1<<ucp_gbExtend)|(1<<ucp_gbSpacingMark),                /* 15 E_Base_GAZ */
-   (1<<ucp_gbExtend)|(1<<ucp_gbSpacingMark),                /* 16 ZWJ */
-   (1<<ucp_gbExtend)|(1<<ucp_gbSpacingMark)                 /* 12 Glue_After_Zwj */
+   ESZ,                                                     /* 12 Other */
+   ESZ|(1<<ucp_gbE_Modifier),                               /* 13 E_Base */
+   ESZ,                                                     /* 14 E_Modifier */
+   ESZ|(1<<ucp_gbE_Modifier),                               /* 15 E_Base_GAZ */
+   ESZ|(1<<ucp_gbGlue_After_Zwj)|(1<<ucp_gbE_Base_GAZ),     /* 16 ZWJ */
+   ESZ                                                      /* 12 Glue_After_Zwj */
 };

+#undef ESZ
+
 #ifdef SUPPORT_JIT
 /* This table reverses PRIV(ucp_gentype). We can save the cost
 of a memory load. */
--- a/testdata/testinput5
+++ b/testdata/testinput5
@ -2041,4 +2041,23 @@
 /^(?:(\X)(?C))+$/utf
    \x{1E900}\x{1E924}\x{1E953}\x{11C00}\x{11C2D}\x{11C3E}\x{11C70}\x{11C77}\x{11CAB}\x{11400}\x{1142F}\x{11455}\x{104B0}\x{104D8}\x{104FB}\x{16FE0}\x{18800}\x{18AF2}\x{11D00}\x{11D3A}\x{11D59}\x{16FE1}\x{1B170}\x{1B2FB}\x{11A50}\x{11A58}\x{11AA2}\x{11A00}\x{11A07}\x{11A47}\=callout_capture,callout_no_where 

+# These two are here because JIT is not yet updated. Also, the very first data
+# line is handled differently by Perl.
+
+/^\X/utf
+    A\x{200d}B                     A ZWJ
+    \x{261D}\x{1F3FB}B             E_Base E_Modifier
+    \x{1F466}\x{1F3FF}B            E_Base_GAZ E_Modifier 
+    \x{200d}\x{1F3A4}B             ZWJ Glue_After_ZWJ
+    \x{200d}\x{1F469}B             ZWJ E_Base_GAZ  
+    \x{1F1E6}\x{1F1E7}B            RegionalIndicator RegionalIndicator 
+    \x{261D}\x{E0100}\x{1F3FB}B\=no_jit    E_Base Extend E_Modifier
+    
+# Regional indicators
+
+/^(\X)(\X)/utf,aftertext
+    \x{1F1E6}\x{1F1E7}\x{1F1E7}B\=no_jit
+    \x{1F1E6}\x{1F1E7}\x{1F1E7}\x{1F1E6}B\=no_jit
+
+
 # End of testinput5
--- a/testdata/testoutput5
+++ b/testdata/testoutput5
@ -4667,4 +4667,38 @@ Callout 0: last capture = 1
 0: \x{1e900}\x{1e924}\x{1e953}\x{11c00}\x{11c2d}\x{11c3e}\x{11c70}\x{11c77}\x{11cab}\x{11400}\x{1142f}\x{11455}\x{104b0}\x{104d8}\x{104fb}\x{16fe0}\x{18800}\x{18af2}\x{11d00}\x{11d3a}\x{11d59}\x{16fe1}\x{1b170}\x{1b2fb}\x{11a50}\x{11a58}\x{11aa2}\x{11a00}\x{11a07}\x{11a47}
 1: \x{11a00}\x{11a07}\x{11a47}

+# These two are here because JIT is not yet updated. Also, the very first data
+# line is handled differently by Perl.
+
+/^\X/utf
+    A\x{200d}B                     A ZWJ
+ 0: A\x{200d}
+    \x{261D}\x{1F3FB}B             E_Base E_Modifier
+ 0: \x{261d}\x{1f3fb}
+    \x{1F466}\x{1F3FF}B            E_Base_GAZ E_Modifier 
+ 0: \x{1f466}\x{1f3ff}
+    \x{200d}\x{1F3A4}B             ZWJ Glue_After_ZWJ
+ 0: \x{200d}\x{1f3a4}
+    \x{200d}\x{1F469}B             ZWJ E_Base_GAZ  
+ 0: \x{200d}\x{1f469}
+    \x{1F1E6}\x{1F1E7}B            RegionalIndicator RegionalIndicator 
+ 0: \x{1f1e6}\x{1f1e7}
+    \x{261D}\x{E0100}\x{1F3FB}B\=no_jit    E_Base Extend E_Modifier
+** /n is not valid here
+    
+# Regional indicators
+
+/^(\X)(\X)/utf,aftertext
+    \x{1F1E6}\x{1F1E7}\x{1F1E7}B\=no_jit
+ 0: \x{1f1e6}\x{1f1e7}\x{1f1e7}
+ 0+ B
+ 1: \x{1f1e6}\x{1f1e7}
+ 2: \x{1f1e7}
+    \x{1F1E6}\x{1F1E7}\x{1F1E7}\x{1F1E6}B\=no_jit
+ 0: \x{1f1e6}\x{1f1e7}\x{1f1e7}\x{1f1e6}
+ 0+ B
+ 1: \x{1f1e6}\x{1f1e7}
+ 2: \x{1f1e7}\x{1f1e6}
+
+
 # End of testinput5