Update grapheme breaking rules for Unicode 10.0.0.

2017-07-05 08:55:49 +00:00 · 2017-07-05 08:55:49 +00:00 · 4f7a608d56
commit 4f7a608d56
parent 41bb787fb3
9 changed files with 514 additions and 98 deletions
--- a/3
+++ b/3
@ -213,6 +213,9 @@ unit". Previously only non-anchored patterns did this.

 48. Add the callout_no_where modifier to pcre2test.

+49. Update extended grapheme breaking rules to the latest set that are in 
+Unicode Standard Annex #29.
+

 Version 10.23 14-February-2017
 ------------------------------
--- a/doc/html/pcre2pattern.html
+++ b/doc/html/pcre2pattern.html
@ -177,7 +177,7 @@ The pcre2_match() function contains a counter that is incremented every time it
 goes round its main loop. The caller of <b>pcre2_match()</b> can set a limit on
 this counter, which therefore limits the amount of computing resource used for
 a match. The maximum depth of nested backtracking can also be limited; this
-indirectly restricts the amount of heap memory that is used, but there is also 
+indirectly restricts the amount of heap memory that is used, but there is also
 an explicit memory limit that can be set.
 </P>
 <P>
@ -198,7 +198,7 @@ limits set by the programmer, but not raise them. If there is more than one
 setting of one of these limits, the lower value is used.
 </P>
 <P>
-Prior to release 10.30, LIMIT_DEPTH was called LIMIT_RECURSION. This name is 
+Prior to release 10.30, LIMIT_DEPTH was called LIMIT_RECURSION. This name is
 still recognized for backwards compatibility.
 </P>
 <P>
@ -233,7 +233,7 @@ string with one of the following sequences:
  (*CRLF)      carriage return, followed by linefeed
  (*ANYCRLF)   any of the three above
  (*ANY)       all Unicode newline sequences
-  (*NUL)       the NUL character (binary zero) 
+  (*NUL)       the NUL character (binary zero)
 </pre>
 These override the default and the options given to the compiling function. For
 example, on a Unix system where LF is the default newline sequence, the pattern
@ -249,7 +249,7 @@ The newline convention affects where the circumflex and dollar assertions are
 true. It also affects the interpretation of the dot metacharacter when
 PCRE2_DOTALL is not set, and the behaviour of \N. However, it does not affect
 what the \R escape sequence matches. By default, this is any Unicode newline
-sequence, for Perl compatibility. However, this can be changed; see the next 
+sequence, for Perl compatibility. However, this can be changed; see the next
 section and the description of \R in the section entitled
 <a href="#newlineseq">"Newline sequences"</a>
 below. A change of \R setting can be combined with a change of newline
@ -1001,9 +1001,12 @@ grapheme cluster", and treats the sequence as an atomic group
 <a href="#atomicgroup">(see below).</a>
 Unicode supports various kinds of composite character by giving each character
 a grapheme breaking property, and having rules that use these properties to
-define the boundaries of extended grapheme clusters. \X always matches at
-least one character. Then it decides whether to add additional characters
-according to the following rules for ending a cluster:
+define the boundaries of extended grapheme clusters. The rules are defined in
+Unicode Standard Annex 29, "Unicode Text Segmentation".
+</P>
+<P>
+\X always matches at least one character. Then it decides whether to add
+additional characters according to the following rules for ending a cluster:
 </P>
 <P>
 1. End at the end of the subject string.
@ -1018,13 +1021,27 @@ L, V, LV, or LVT character; an LV or V character may be followed by a V or T
 character; an LVT or T character may be follwed only by a T character.
 </P>
 <P>
-4. Do not end before extending characters or spacing marks. Characters with
-the "mark" property always have the "extend" grapheme breaking property.
+4. Do not end before extending characters or spacing marks or the "zero-width
+joiner" characters. Characters with the "mark" property always have the
+"extend" grapheme breaking property.
 </P>
 <P>
 5. Do not end after prepend characters.
 </P>
 <P>
+6. Do not break within emoji modifier sequences (a base character followed by a
+modifier). Extending characters are allowed before the modifier.
+</P>
+<P>
+7. Do not break within emoji zwj sequences (zero-width jointer followed by
+"glue after ZWJ" or "base glue after ZWJ").
+</P>
+<P>
+8. Do not break within emoji flag sequences. That is, do not break between
+regional indicator (RI) characters if there are an odd number of RI characters
+before the break point.
+</P>
+<P>
 6. Otherwise, end the cluster.
 <a name="extraprops"></a></P>
 <br><b>
@ -1562,15 +1579,15 @@ Perl option letters enclosed between "(?" and ")". The option letters are
 <pre>
  i  for PCRE2_CASELESS
  m  for PCRE2_MULTILINE
-  n  for PCRE2_NO_AUTO_CAPTURE 
+  n  for PCRE2_NO_AUTO_CAPTURE
  s  for PCRE2_DOTALL
  x  for PCRE2_EXTENDED
  xx for PCRE2_EXTENDED_MORE
 </pre>
 For example, (?im) sets caseless, multiline matching. It is also possible to
-unset these options by preceding the letter with a hyphen. The two "extended" 
-options are not independent; unsetting either one cancels the effects of both 
-of them. 
+unset these options by preceding the letter with a hyphen. The two "extended"
+options are not independent; unsetting either one cancels the effects of both
+of them.
 </P>
 <P>
 A combined setting and unsetting such as (?im-sx), which sets PCRE2_CASELESS
@ -2249,14 +2266,14 @@ capturing subpatterns within it, these are counted for the purposes of
 numbering the capturing subpatterns in the whole pattern. However, substring
 capturing is carried out only for positive assertions that succeed, that is,
 one of their branches matches, so matching continues after the assertion. If
-all branches of a positive assertion fail to match, nothing is captured, and 
+all branches of a positive assertion fail to match, nothing is captured, and
 control is passed to the previous backtracking point.
 </P>
 <P>
 No capturing is done for a negative assertion unless it is being used as a
 condition in a
 <a href="#subpatternsassubroutines">conditional subpattern</a>
-(see the discussion below). Matching continues after a non-conditional negative 
+(see the discussion below). Matching continues after a non-conditional negative
 assertion only if all its branches fail to match.
 </P>
 <P>
@ -2824,14 +2841,14 @@ if it contained untried alternatives and there was a subsequent matching
 failure. (Historical note: PCRE implemented recursion before Perl did.)
 </P>
 <P>
-Starting with release 10.30, recursive subroutine calls are no longer treated 
-as atomic. That is, they can be re-entered to try unused alternatives if there 
-is a matching failure later in the pattern. This is now compatible with the way 
+Starting with release 10.30, recursive subroutine calls are no longer treated
+as atomic. That is, they can be re-entered to try unused alternatives if there
+is a matching failure later in the pattern. This is now compatible with the way
 Perl works. If you want a subroutine call to be atomic, you must explicitly
 enclose it in an atomic group.
 </P>
 <P>
-Supporting backtracking into recursions simplifies certain types of recursive 
+Supporting backtracking into recursions simplifies certain types of recursive
 pattern. For example, this pattern matches palindromic strings:
 <pre>
  ^((.)(?1)\2|.?)$
@ -2863,7 +2880,7 @@ in PCRE2 these values can be referenced. Consider this pattern:
 This pattern matches "bab". The first capturing parentheses match "b", then in
 the second group, when the back reference \1 fails to match "b", the second
 alternative matches "a" and then recurses. In the recursion, \1 does now match
-"b" and so the whole match succeeds. This match used to fail in Perl, but in 
+"b" and so the whole match succeeds. This match used to fail in Perl, but in
 later versions (I tried 5.024) it now works.
 <a name="subpatternsassubroutines"></a></P>
 <br><a name="SEC24" href="#TOC1">SUBPATTERNS AS SUBROUTINES</a><br>
@ -3398,17 +3415,17 @@ processing; captured substrings are discarded.
 </P>
 <P>
 If the assertion is a condition, (*ACCEPT) causes the condition to be true for
-a positive assertion and false for a negative one; captured substrings are 
+a positive assertion and false for a negative one; captured substrings are
 retained in both cases.
 </P>
 <P>
-The effect of (*THEN) is not allowed to escape beyond an assertion. If there 
+The effect of (*THEN) is not allowed to escape beyond an assertion. If there
 are no more branches to try, (*THEN) causes a positive assertion to be false,
 and a negative assertion to be true.
 </P>
 <P>
 The other backtracking verbs are not treated specially if they appear in a
-standalone positive assertion. In a conditional positive assertion, 
+standalone positive assertion. In a conditional positive assertion,
 backtracking into (*COMMIT), (*SKIP), or (*PRUNE) causes the condition to be
 false. However, for both standalone and conditional negative assertions,
 backtracking into (*COMMIT), (*SKIP), or (*PRUNE) causes the assertion to be
@ -3455,7 +3472,7 @@ Cambridge, England.
 </P>
 <br><a name="SEC30" href="#TOC1">REVISION</a><br>
 <P>
-Last updated: 02 July 2017
+Last updated: 05 July 2017
 <br>
 Copyright &copy; 1997-2017 University of Cambridge.
 <br>
--- a/doc/pcre2.txt
+++ b/doc/pcre2.txt
@ -6433,27 +6433,41 @@ BACKSLASH
       (see  below).  Unicode supports various kinds of composite character by
       giving each character a grapheme breaking property,  and  having  rules
       that use these properties to define the boundaries of extended grapheme
-       clusters. \X always matches at least one  character.  Then  it  decides
-       whether  to  add additional characters according to the following rules
-       for ending a cluster:
+       clusters. The rules are defined in Unicode Standard Annex 29,  "Unicode
+       Text Segmentation".
+
+       \X  always  matches  at least one character. Then it decides whether to
+       add additional characters according to the following rules for ending a
+       cluster:

       1. End at the end of the subject string.

-       2. Do not end between CR and LF; otherwise end after any control  char-
+       2.  Do not end between CR and LF; otherwise end after any control char-
       acter.

-       3.  Do  not  break  Hangul (a Korean script) syllable sequences. Hangul
-       characters are of five types: L, V, T, LV, and LVT. An L character  may
-       be  followed by an L, V, LV, or LVT character; an LV or V character may
+       3. Do not break Hangul (a Korean  script)  syllable  sequences.  Hangul
+       characters  are of five types: L, V, T, LV, and LVT. An L character may
+       be followed by an L, V, LV, or LVT character; an LV or V character  may
       be followed by a V or T character; an LVT or T character may be follwed
       only by a T character.

-       4.  Do not end before extending characters or spacing marks. Characters
-       with the "mark" property always have  the  "extend"  grapheme  breaking
-       property.
+       4. Do not end before extending  characters  or  spacing  marks  or  the
+       "zero-width  joiner"  characters.  Characters  with the "mark" property
+       always have the "extend" grapheme breaking property.

       5. Do not end after prepend characters.

+       6. Do not break within emoji modifier sequences (a base character  fol-
+       lowed by a modifier). Extending characters are allowed before the modi-
+       fier.
+
+       7. Do not break within emoji zwj sequences (zero-width jointer followed
+       by "glue after ZWJ" or "base glue after ZWJ").
+
+       8.  Do  not  break  within  emoji flag sequences. That is, do not break
+       between regional indicator (RI) characters if there are an  odd  number
+       of RI characters before the break point.
+
       6. Otherwise, end the cluster.

   PCRE2's additional properties
@ -8744,7 +8758,7 @@ AUTHOR

 REVISION

-       Last updated: 02 July 2017
+       Last updated: 05 July 2017
       Copyright (c) 1997-2017 University of Cambridge.
 ------------------------------------------------------------------------------
 
--- a/doc/pcre2pattern.3
+++ b/doc/pcre2pattern.3
@ -1,4 +1,4 @@
-.TH PCRE2PATTERN 3 "02 July 2017" "PCRE2 10.30"
+.TH PCRE2PATTERN 3 "05 July 2017" "PCRE2 10.30"
 .SH NAME
 PCRE2 - Perl-compatible regular expressions (revised API)
 .SH "PCRE2 REGULAR EXPRESSION DETAILS"
@ -145,7 +145,7 @@ The pcre2_match() function contains a counter that is incremented every time it
 goes round its main loop. The caller of \fBpcre2_match()\fP can set a limit on
 this counter, which therefore limits the amount of computing resource used for
 a match. The maximum depth of nested backtracking can also be limited; this
-indirectly restricts the amount of heap memory that is used, but there is also 
+indirectly restricts the amount of heap memory that is used, but there is also
 an explicit memory limit that can be set.
 .P
 These facilities are provided to catch runaway matches that are provoked by
@ -164,7 +164,7 @@ for it to have any effect. In other words, the pattern writer can lower the
 limits set by the programmer, but not raise them. If there is more than one
 setting of one of these limits, the lower value is used.
 .P
-Prior to release 10.30, LIMIT_DEPTH was called LIMIT_RECURSION. This name is 
+Prior to release 10.30, LIMIT_DEPTH was called LIMIT_RECURSION. This name is
 still recognized for backwards compatibility.
 .P
 The heap limit applies only when the \fBpcre2_match()\fP interpreter is used
@ -203,7 +203,7 @@ string with one of the following sequences:
  (*CRLF)      carriage return, followed by linefeed
  (*ANYCRLF)   any of the three above
  (*ANY)       all Unicode newline sequences
-  (*NUL)       the NUL character (binary zero) 
+  (*NUL)       the NUL character (binary zero)
 .sp
 These override the default and the options given to the compiling function. For
 example, on a Unix system where LF is the default newline sequence, the pattern
@ -218,7 +218,7 @@ The newline convention affects where the circumflex and dollar assertions are
 true. It also affects the interpretation of the dot metacharacter when
 PCRE2_DOTALL is not set, and the behaviour of \eN. However, it does not affect
 what the \eR escape sequence matches. By default, this is any Unicode newline
-sequence, for Perl compatibility. However, this can be changed; see the next 
+sequence, for Perl compatibility. However, this can be changed; see the next
 section and the description of \eR in the section entitled
 .\" HTML <a href="#newlineseq">
 .\" </a>
@ -998,9 +998,11 @@ grapheme cluster", and treats the sequence as an atomic group
 .\"
 Unicode supports various kinds of composite character by giving each character
 a grapheme breaking property, and having rules that use these properties to
-define the boundaries of extended grapheme clusters. \eX always matches at
-least one character. Then it decides whether to add additional characters
-according to the following rules for ending a cluster:
+define the boundaries of extended grapheme clusters. The rules are defined in
+Unicode Standard Annex 29, "Unicode Text Segmentation".
+.P
+\eX always matches at least one character. Then it decides whether to add
+additional characters according to the following rules for ending a cluster:
 .P
 1. End at the end of the subject string.
 .P
@ -1011,11 +1013,22 @@ are of five types: L, V, T, LV, and LVT. An L character may be followed by an
 L, V, LV, or LVT character; an LV or V character may be followed by a V or T
 character; an LVT or T character may be follwed only by a T character.
 .P
-4. Do not end before extending characters or spacing marks. Characters with
-the "mark" property always have the "extend" grapheme breaking property.
+4. Do not end before extending characters or spacing marks or the "zero-width
+joiner" characters. Characters with the "mark" property always have the
+"extend" grapheme breaking property.
 .P
 5. Do not end after prepend characters.
 .P
+6. Do not break within emoji modifier sequences (a base character followed by a
+modifier). Extending characters are allowed before the modifier.
+.P
+7. Do not break within emoji zwj sequences (zero-width jointer followed by
+"glue after ZWJ" or "base glue after ZWJ").
+.P
+8. Do not break within emoji flag sequences. That is, do not break between
+regional indicator (RI) characters if there are an odd number of RI characters
+before the break point.
+.P
 6. Otherwise, end the cluster.
 .
 .
@ -1560,15 +1573,15 @@ Perl option letters enclosed between "(?" and ")". The option letters are
 .sp
  i  for PCRE2_CASELESS
  m  for PCRE2_MULTILINE
-  n  for PCRE2_NO_AUTO_CAPTURE 
+  n  for PCRE2_NO_AUTO_CAPTURE
  s  for PCRE2_DOTALL
  x  for PCRE2_EXTENDED
  xx for PCRE2_EXTENDED_MORE
 .sp
 For example, (?im) sets caseless, multiline matching. It is also possible to
-unset these options by preceding the letter with a hyphen. The two "extended" 
-options are not independent; unsetting either one cancels the effects of both 
-of them. 
+unset these options by preceding the letter with a hyphen. The two "extended"
+options are not independent; unsetting either one cancels the effects of both
+of them.
 .P
 A combined setting and unsetting such as (?im-sx), which sets PCRE2_CASELESS
 and PCRE2_MULTILINE while unsetting PCRE2_DOTALL and PCRE2_EXTENDED, is also
@ -2256,16 +2269,16 @@ capturing subpatterns within it, these are counted for the purposes of
 numbering the capturing subpatterns in the whole pattern. However, substring
 capturing is carried out only for positive assertions that succeed, that is,
 one of their branches matches, so matching continues after the assertion. If
-all branches of a positive assertion fail to match, nothing is captured, and 
+all branches of a positive assertion fail to match, nothing is captured, and
 control is passed to the previous backtracking point.
 .P
 No capturing is done for a negative assertion unless it is being used as a
 condition in a
 .\" HTML <a href="#subpatternsassubroutines">
 .\" </a>
-conditional subpattern 
+conditional subpattern
 .\"
-(see the discussion below). Matching continues after a non-conditional negative 
+(see the discussion below). Matching continues after a non-conditional negative
 assertion only if all its branches fail to match.
 .P
 For compatibility with Perl, most assertion subpatterns may be repeated; though
@ -2846,13 +2859,13 @@ once it had matched some of the subject string, it was never re-entered, even
 if it contained untried alternatives and there was a subsequent matching
 failure. (Historical note: PCRE implemented recursion before Perl did.)
 .P
-Starting with release 10.30, recursive subroutine calls are no longer treated 
-as atomic. That is, they can be re-entered to try unused alternatives if there 
-is a matching failure later in the pattern. This is now compatible with the way 
+Starting with release 10.30, recursive subroutine calls are no longer treated
+as atomic. That is, they can be re-entered to try unused alternatives if there
+is a matching failure later in the pattern. This is now compatible with the way
 Perl works. If you want a subroutine call to be atomic, you must explicitly
 enclose it in an atomic group.
 .P
-Supporting backtracking into recursions simplifies certain types of recursive 
+Supporting backtracking into recursions simplifies certain types of recursive
 pattern. For example, this pattern matches palindromic strings:
 .sp
  ^((.)(?1)\e2|.?)$
@ -2883,7 +2896,7 @@ in PCRE2 these values can be referenced. Consider this pattern:
 This pattern matches "bab". The first capturing parentheses match "b", then in
 the second group, when the back reference \e1 fails to match "b", the second
 alternative matches "a" and then recurses. In the recursion, \e1 does now match
-"b" and so the whole match succeeds. This match used to fail in Perl, but in 
+"b" and so the whole match succeeds. This match used to fail in Perl, but in
 later versions (I tried 5.024) it now works.
 .
 .
@ -3427,15 +3440,15 @@ negative assertion, (*ACCEPT) causes the assertion to fail without any further
 processing; captured substrings are discarded.
 .P
 If the assertion is a condition, (*ACCEPT) causes the condition to be true for
-a positive assertion and false for a negative one; captured substrings are 
+a positive assertion and false for a negative one; captured substrings are
 retained in both cases.
 .P
-The effect of (*THEN) is not allowed to escape beyond an assertion. If there 
+The effect of (*THEN) is not allowed to escape beyond an assertion. If there
 are no more branches to try, (*THEN) causes a positive assertion to be false,
 and a negative assertion to be true.
 .P
 The other backtracking verbs are not treated specially if they appear in a
-standalone positive assertion. In a conditional positive assertion, 
+standalone positive assertion. In a conditional positive assertion,
 backtracking into (*COMMIT), (*SKIP), or (*PRUNE) causes the condition to be
 false. However, for both standalone and conditional negative assertions,
 backtracking into (*COMMIT), (*SKIP), or (*PRUNE) causes the assertion to be
@ -3485,6 +3498,6 @@ Cambridge, England.
 .rs
 .sp
 .nf
-Last updated: 02 July 2017
+Last updated: 05 July 2017
 Copyright (c) 1997-2017 University of Cambridge.
 .fi
--- a/src/pcre2_dfa_match.c
+++ b/src/pcre2_dfa_match.c
@ -1379,8 +1379,46 @@ for (;;)
          if (!utf) d = *nptr; else { GETCHARLEN(d, nptr, dlen); }
          rgb = UCD_GRAPHBREAK(d);
          if ((PRIV(ucp_gbtable)[lgb] & (1u << rgb)) == 0) break;
+
+          /* Not breaking between Regional Indicators is allowed only if
+          there are an even number of preceding RIs. */
+
+          if (lgb == ucp_gbRegionalIndicator &&
+              rgb == ucp_gbRegionalIndicator)
+            {
+            int ricount = 0;
+            PCRE2_SPTR bptr = nptr - 1;
+#ifdef SUPPORT_UNICODE
+            if (utf) BACKCHAR(bptr);
+#endif
+            /* bptr is pointing to the left-hand character */
+
+            while (bptr > mb->start_subject)
+              {
+              bptr--;
+#ifdef SUPPORT_UNICODE
+              if (utf)
+                {
+                BACKCHAR(bptr);
+                GETCHAR(d, bptr);
+                }
+              else
+#endif
+              d = *bptr;
+              if (UCD_GRAPHBREAK(d) != ucp_gbRegionalIndicator) break;
+              ricount++;
+              }
+            if ((ricount & 1) != 0) break;  /* Grapheme break required */
+            }
+
+          /* If Extend follows E_Base[_GAZ] do not update lgb; this allows
+          any number of Extend before a following E_Modifier. */
+
+          if (rgb != ucp_gbExtend ||
+              (lgb != ucp_gbE_Base && lgb != ucp_gbE_Base_GAZ))
+            lgb = rgb;
+
          ncount++;
-          lgb = rgb;
          nptr += dlen;
          }
        count++;
@ -1641,8 +1679,46 @@ for (;;)
          if (!utf) d = *nptr; else { GETCHARLEN(d, nptr, dlen); }
          rgb = UCD_GRAPHBREAK(d);
          if ((PRIV(ucp_gbtable)[lgb] & (1u << rgb)) == 0) break;
+
+          /* Not breaking between Regional Indicators is allowed only if
+          there are an even number of preceding RIs. */
+
+          if (lgb == ucp_gbRegionalIndicator &&
+              rgb == ucp_gbRegionalIndicator)
+            {
+            int ricount = 0;
+            PCRE2_SPTR bptr = nptr - 1;
+#ifdef SUPPORT_UNICODE
+            if (utf) BACKCHAR(bptr);
+#endif
+            /* bptr is pointing to the left-hand character */
+
+            while (bptr > mb->start_subject)
+              {
+              bptr--;
+#ifdef SUPPORT_UNICODE
+              if (utf)
+                {
+                BACKCHAR(bptr);
+                GETCHAR(d, bptr);
+                }
+              else
+#endif
+              d = *bptr;
+              if (UCD_GRAPHBREAK(d) != ucp_gbRegionalIndicator) break;
+              ricount++;
+              }
+            if ((ricount & 1) != 0) break;  /* Grapheme break required */
+            }
+
+          /* If Extend follows E_Base[_GAZ] do not update lgb; this allows
+          any number of Extend before a following E_Modifier. */
+
+          if (rgb != ucp_gbExtend ||
+              (lgb != ucp_gbE_Base && lgb != ucp_gbE_Base_GAZ))
+            lgb = rgb;
+
          ncount++;
-          lgb = rgb;
          nptr += dlen;
          }
        ADD_NEW_DATA(-(state_offset + count), 0, ncount);
@ -1912,8 +1988,46 @@ for (;;)
          if (!utf) d = *nptr; else { GETCHARLEN(d, nptr, dlen); }
          rgb = UCD_GRAPHBREAK(d);
          if ((PRIV(ucp_gbtable)[lgb] & (1u << rgb)) == 0) break;
+
+          /* Not breaking between Regional Indicators is allowed only if
+          there are an even number of preceding RIs. */
+
+          if (lgb == ucp_gbRegionalIndicator &&
+              rgb == ucp_gbRegionalIndicator)
+            {
+            int ricount = 0;
+            PCRE2_SPTR bptr = nptr - 1;
+#ifdef SUPPORT_UNICODE
+            if (utf) BACKCHAR(bptr);
+#endif
+            /* bptr is pointing to the left-hand character */
+
+            while (bptr > mb->start_subject)
+              {
+              bptr--;
+#ifdef SUPPORT_UNICODE
+              if (utf)
+                {
+                BACKCHAR(bptr);
+                GETCHAR(d, bptr);
+                }
+              else
+#endif
+              d = *bptr;
+              if (UCD_GRAPHBREAK(d) != ucp_gbRegionalIndicator) break;
+              ricount++;
+              }
+            if ((ricount & 1) != 0) break;  /* Grapheme break required */
+            }
+
+          /* If Extend follows E_Base[_GAZ] do not update lgb; this allows
+          any number of Extend before a following E_Modifier. */
+
+          if (rgb != ucp_gbExtend ||
+              (lgb != ucp_gbE_Base && lgb != ucp_gbE_Base_GAZ))
+            lgb = rgb;
+
          ncount++;
-          lgb = rgb;
          nptr += dlen;
          }
        if (nptr >= end_subject && (mb->moptions & PCRE2_PARTIAL_HARD) != 0)
@ -2102,8 +2216,46 @@ for (;;)
          if (!utf) d = *nptr; else { GETCHARLEN(d, nptr, dlen); }
          rgb = UCD_GRAPHBREAK(d);
          if ((PRIV(ucp_gbtable)[lgb] & (1u << rgb)) == 0) break;
+
+          /* Not breaking between Regional Indicators is allowed only if
+          there are an even number of preceding RIs. */
+
+          if (lgb == ucp_gbRegionalIndicator &&
+              rgb == ucp_gbRegionalIndicator)
+            {
+            int ricount = 0;
+            PCRE2_SPTR bptr = nptr - 1;
+#ifdef SUPPORT_UNICODE
+            if (utf) BACKCHAR(bptr);
+#endif
+            /* bptr is pointing to the left-hand character */
+
+            while (bptr > mb->start_subject)
+              {
+              bptr--;
+#ifdef SUPPORT_UNICODE
+              if (utf)
+                {
+                BACKCHAR(bptr);
+                GETCHAR(d, bptr);
+                }
+              else
+#endif
+              d = *bptr;
+              if (UCD_GRAPHBREAK(d) != ucp_gbRegionalIndicator) break;
+              ricount++;
+              }
+            if ((ricount & 1) != 0) break;  /* Grapheme break required */
+            }
+
+          /* If Extend follows E_Base[_GAZ] do not update lgb; this allows
+          any number of Extend before a following E_Modifier. */
+
+          if (rgb != ucp_gbExtend ||
+              (lgb != ucp_gbE_Base && lgb != ucp_gbE_Base_GAZ))
+            lgb = rgb;
+
          ncount++;
-          lgb = rgb;
          nptr += dlen;
          }
        if (nptr >= end_subject && (mb->moptions & PCRE2_PARTIAL_HARD) != 0)
@ -2129,7 +2281,7 @@ for (;;)
        case 0x2029:
 #endif  /* Not EBCDIC */
        if (mb->bsr_convention == PCRE2_BSR_ANYCRLF) break;
-        /* Fall through */ 
+        /* Fall through */

        case CHAR_LF:
        ADD_NEW(state_offset + 1, 0);
@ -3427,7 +3579,7 @@ for (;;)
      while (t < mb->end_subject && !IS_NEWLINE(t)) t++;
      end_subject = t;
      }
-      
+
    /* Anchored: check the first code unit if one is recorded. This may seem
    pointless but it can help in detecting a no match case without scanning for
    the required code unit. */
--- a/src/pcre2_match.c
+++ b/src/pcre2_match.c
@ -2449,7 +2449,44 @@ fprintf(stderr, "++ op=%d\n", *Fecode);
        if (!utf) fc = *Feptr; else { GETCHARLEN(fc, Feptr, len); }
        rgb = UCD_GRAPHBREAK(fc);
        if ((PRIV(ucp_gbtable)[lgb] & (1 << rgb)) == 0) break;
-        lgb = rgb;
+
+        /* Not breaking between Regional Indicators is allowed only if there
+        are an even number of preceding RIs. */
+
+        if (lgb == ucp_gbRegionalIndicator && rgb == ucp_gbRegionalIndicator)
+          {
+          int ricount = 0;
+          PCRE2_SPTR bptr = Feptr - 1;
+#ifdef SUPPORT_UNICODE
+          if (utf) BACKCHAR(bptr);
+#endif
+          /* bptr is pointing to the left-hand character */
+
+          while (bptr > mb->start_subject)
+            {
+            bptr--;
+#ifdef SUPPORT_UNICODE
+            if (utf)
+              {
+              BACKCHAR(bptr);
+              GETCHAR(fc, bptr);
+              }
+            else
+#endif
+            fc = *bptr;
+            if (UCD_GRAPHBREAK(fc) != ucp_gbRegionalIndicator) break;
+            ricount++;
+            }
+          if ((ricount & 1) != 0) break;  /* Grapheme break required */
+          }
+
+        /* If Extend follows E_Base[_GAZ] do not update lgb; this allows
+        any number of Extend before a following E_Modifier. */
+
+        if (rgb != ucp_gbExtend ||
+            (lgb != ucp_gbE_Base && lgb != ucp_gbE_Base_GAZ))
+          lgb = rgb;
+
        Feptr += len;
        }
      }
@ -2757,7 +2794,45 @@ fprintf(stderr, "++ op=%d\n", *Fecode);
              if (!utf) fc = *Feptr; else { GETCHARLEN(fc, Feptr, len); }
              rgb = UCD_GRAPHBREAK(fc);
              if ((PRIV(ucp_gbtable)[lgb] & (1 << rgb)) == 0) break;
-              lgb = rgb;
+
+              /* Not breaking between Regional Indicators is allowed only if
+              there are an even number of preceding RIs. */
+
+              if (lgb == ucp_gbRegionalIndicator &&
+                  rgb == ucp_gbRegionalIndicator)
+                {
+                int ricount = 0;
+                PCRE2_SPTR bptr = Feptr - 1;
+#ifdef SUPPORT_UNICODE
+                if (utf) BACKCHAR(bptr);
+#endif
+                /* bptr is pointing to the left-hand character */
+
+                while (bptr > mb->start_subject)
+                  {
+                  bptr--;
+#ifdef SUPPORT_UNICODE
+                  if (utf)
+                    {
+                    BACKCHAR(bptr);
+                    GETCHAR(fc, bptr);
+                    }
+                  else
+#endif
+                  fc = *bptr;
+                  if (UCD_GRAPHBREAK(fc) != ucp_gbRegionalIndicator) break;
+                  ricount++;
+                  }
+                if ((ricount & 1) != 0) break;  /* Grapheme break required */
+                }
+
+              /* If Extend follows E_Base[_GAZ] do not update lgb; this allows
+              any number of Extend before a following E_Modifier. */
+
+              if (rgb != ucp_gbExtend ||
+                  (lgb != ucp_gbE_Base && lgb != ucp_gbE_Base_GAZ))
+                lgb = rgb;
+
              Feptr += len;
              }
            }
@ -3527,7 +3602,45 @@ fprintf(stderr, "++ op=%d\n", *Fecode);
              if (!utf) fc = *Feptr; else { GETCHARLEN(fc, Feptr, len); }
              rgb = UCD_GRAPHBREAK(fc);
              if ((PRIV(ucp_gbtable)[lgb] & (1 << rgb)) == 0) break;
-              lgb = rgb;
+
+              /* Not breaking between Regional Indicators is allowed only if
+              there are an even number of preceding RIs. */
+
+              if (lgb == ucp_gbRegionalIndicator &&
+                  rgb == ucp_gbRegionalIndicator)
+                {
+                int ricount = 0;
+                PCRE2_SPTR bptr = Feptr - 1;
+#ifdef SUPPORT_UNICODE
+                if (utf) BACKCHAR(bptr);
+#endif
+                /* bptr is pointing to the left-hand character */
+
+                while (bptr > mb->start_subject)
+                  {
+                  bptr--;
+#ifdef SUPPORT_UNICODE
+                  if (utf)
+                    {
+                    BACKCHAR(bptr);
+                    GETCHAR(fc, bptr);
+                    }
+                  else
+#endif
+                  fc = *bptr;
+                  if (UCD_GRAPHBREAK(fc) != ucp_gbRegionalIndicator) break;
+                  ricount++;
+                  }
+                if ((ricount & 1) != 0) break;  /* Grapheme break required */
+                }
+
+              /* If Extend follows E_Base[_GAZ] do not update lgb; this allows
+              any number of Extend before a following E_Modifier. */
+
+              if (rgb != ucp_gbExtend ||
+                  (lgb != ucp_gbE_Base && lgb != ucp_gbE_Base_GAZ))
+                lgb = rgb;
+
              Feptr += len;
              }
            }
@ -4063,7 +4176,45 @@ fprintf(stderr, "++ op=%d\n", *Fecode);
              if (!utf) fc = *Feptr; else { GETCHARLEN(fc, Feptr, len); }
              rgb = UCD_GRAPHBREAK(fc);
              if ((PRIV(ucp_gbtable)[lgb] & (1 << rgb)) == 0) break;
-              lgb = rgb;
+
+              /* Not breaking between Regional Indicators is allowed only if
+              there are an even number of preceding RIs. */
+
+              if (lgb == ucp_gbRegionalIndicator &&
+                  rgb == ucp_gbRegionalIndicator)
+                {
+                int ricount = 0;
+                PCRE2_SPTR bptr = Feptr - 1;
+#ifdef SUPPORT_UNICODE
+                if (utf) BACKCHAR(bptr);
+#endif
+                /* bptr is pointing to the left-hand character */
+
+                while (bptr > mb->start_subject)
+                  {
+                  bptr--;
+#ifdef SUPPORT_UNICODE
+                  if (utf)
+                    {
+                    BACKCHAR(bptr);
+                    GETCHAR(fc, bptr);
+                    }
+                  else
+#endif
+                  fc = *bptr;
+                  if (UCD_GRAPHBREAK(fc) != ucp_gbRegionalIndicator) break;
+                  ricount++;
+                  }
+                if ((ricount & 1) != 0) break;  /* Grapheme break required */
+                }
+
+              /* If Extend follows E_Base[_GAZ] do not update lgb; this allows
+              any number of Extend before a following E_Modifier. */
+
+              if (rgb != ucp_gbExtend ||
+                  (lgb != ucp_gbE_Base && lgb != ucp_gbE_Base_GAZ))
+                lgb = rgb;
+
              Feptr += len;
              }
            }
--- a/src/pcre2_tables.c
+++ b/src/pcre2_tables.c
@ -157,49 +157,62 @@ two code points. The breaking rules are as follows:
    LV or V may be followed by V or T
    LVT or T may be followed by T

-4. Do not break before extending characters.
+4. Do not break before extending characters or zero-width-joiner (ZWJ).

-The next two rules are only for extended grapheme clusters (but that's what we
+The following rules are only for extended grapheme clusters (but that's what we
 are implementing).

 5. Do not break before SpacingMarks.

 6. Do not break after Prepend characters.

-7. Otherwise, break everywhere.
+7. Do not break within emoji modifier sequences (E_Base or E_Base_GAZ followed
+   by E_Modifier). Extend characters are allowed before the modifier; this 
+   cannot be represented in this table, the code has to deal with it.
+   
+8. Do not break within emoji zwj sequences (ZWJ followed by Glue_After_Zwj   or
+   E_Base_GAZ).
+   
+9. Do not break within emoji flag sequences. That is, do not break between 
+   regional indicator (RI) symbols if there are an odd number of RI characters 
+   before the break point. This table encodes "join RI characters"; the code 
+   has to deal with checking for previous adjoining RIs.
+
+10. Otherwise, break everywhere.
 */

+#define ESZ (1<<ucp_gbExtend)|(1<<ucp_gbSpacingMark)|(1<<ucp_gbZWJ)
+
 const uint32_t PRIV(ucp_gbtable)[] = {
   (1<<ucp_gbLF),                                           /*  0 CR */
   0,                                                       /*  1 LF */
   0,                                                       /*  2 Control */
-   (1<<ucp_gbExtend)|(1<<ucp_gbSpacingMark),                /*  3 Extend */
-   (1<<ucp_gbExtend)|(1<<ucp_gbPrepend)|                    /*  4 Prepend */
-     (1<<ucp_gbSpacingMark)|(1<<ucp_gbL)|
-     (1<<ucp_gbV)|(1<<ucp_gbT)|(1<<ucp_gbLV)|
-     (1<<ucp_gbLVT)|(1<<ucp_gbOther),
-
-   (1<<ucp_gbExtend)|(1<<ucp_gbSpacingMark),                /*  5 SpacingMark */
-   (1<<ucp_gbExtend)|(1<<ucp_gbSpacingMark)|(1<<ucp_gbL)|   /*  6 L */
-     (1<<ucp_gbV)|(1<<ucp_gbLV)|(1<<ucp_gbLVT),
-
-   (1<<ucp_gbExtend)|(1<<ucp_gbSpacingMark)|(1<<ucp_gbV)|   /*  7 V */
-     (1<<ucp_gbT),
-
-   (1<<ucp_gbExtend)|(1<<ucp_gbSpacingMark)|(1<<ucp_gbT),   /*  8 T */
-   (1<<ucp_gbExtend)|(1<<ucp_gbSpacingMark)|(1<<ucp_gbV)|   /*  9 LV */
-     (1<<ucp_gbT),
-
-   (1<<ucp_gbExtend)|(1<<ucp_gbSpacingMark)|(1<<ucp_gbT),   /* 10 LVT */
+   ESZ,                                                     /*  3 Extend */
+   ESZ|(1<<ucp_gbPrepend)|                                  /*  4 Prepend */
+       (1<<ucp_gbL)|(1<<ucp_gbV)|(1<<ucp_gbT)|
+       (1<<ucp_gbLV)|(1<<ucp_gbLVT)|(1<<ucp_gbOther)|
+       (1<<ucp_gbRegionalIndicator)|
+       (1<<ucp_gbE_Base)|(1<<ucp_gbE_Modifier)|
+       (1<<ucp_gbE_Base_GAZ)|
+       (1<<ucp_gbZWJ)|(1<<ucp_gbGlue_After_Zwj),
+   ESZ,                                                     /*  5 SpacingMark */
+   ESZ|(1<<ucp_gbL)|(1<<ucp_gbV)|(1<<ucp_gbLV)|             /*  6 L */
+       (1<<ucp_gbLVT),
+   ESZ|(1<<ucp_gbV)|(1<<ucp_gbT),                           /*  7 V */
+   ESZ|(1<<ucp_gbT),                                        /*  8 T */
+   ESZ|(1<<ucp_gbV)|(1<<ucp_gbT),                           /*  9 LV */
+   ESZ|(1<<ucp_gbT),                                        /* 10 LVT */
   (1<<ucp_gbRegionalIndicator),                            /* 11 RegionalIndicator */
-   (1<<ucp_gbExtend)|(1<<ucp_gbSpacingMark),                /* 12 Other */
-   (1<<ucp_gbExtend)|(1<<ucp_gbSpacingMark),                /* 13 E_Base */
-   (1<<ucp_gbExtend)|(1<<ucp_gbSpacingMark),                /* 14 E_Modifier */
-   (1<<ucp_gbExtend)|(1<<ucp_gbSpacingMark),                /* 15 E_Base_GAZ */
-   (1<<ucp_gbExtend)|(1<<ucp_gbSpacingMark),                /* 16 ZWJ */
-   (1<<ucp_gbExtend)|(1<<ucp_gbSpacingMark)                 /* 12 Glue_After_Zwj */
+   ESZ,                                                     /* 12 Other */
+   ESZ|(1<<ucp_gbE_Modifier),                               /* 13 E_Base */
+   ESZ,                                                     /* 14 E_Modifier */
+   ESZ|(1<<ucp_gbE_Modifier),                               /* 15 E_Base_GAZ */
+   ESZ|(1<<ucp_gbGlue_After_Zwj)|(1<<ucp_gbE_Base_GAZ),     /* 16 ZWJ */
+   ESZ                                                      /* 12 Glue_After_Zwj */
 };

+#undef ESZ
+
 #ifdef SUPPORT_JIT
 /* This table reverses PRIV(ucp_gentype). We can save the cost
 of a memory load. */
--- a/testdata/testinput5
+++ b/testdata/testinput5
@ -2041,4 +2041,23 @@
 /^(?:(\X)(?C))+$/utf
    \x{1E900}\x{1E924}\x{1E953}\x{11C00}\x{11C2D}\x{11C3E}\x{11C70}\x{11C77}\x{11CAB}\x{11400}\x{1142F}\x{11455}\x{104B0}\x{104D8}\x{104FB}\x{16FE0}\x{18800}\x{18AF2}\x{11D00}\x{11D3A}\x{11D59}\x{16FE1}\x{1B170}\x{1B2FB}\x{11A50}\x{11A58}\x{11AA2}\x{11A00}\x{11A07}\x{11A47}\=callout_capture,callout_no_where 

+# These two are here because JIT is not yet updated. Also, the very first data
+# line is handled differently by Perl.
+
+/^\X/utf
+    A\x{200d}B                     A ZWJ
+    \x{261D}\x{1F3FB}B             E_Base E_Modifier
+    \x{1F466}\x{1F3FF}B            E_Base_GAZ E_Modifier 
+    \x{200d}\x{1F3A4}B             ZWJ Glue_After_ZWJ
+    \x{200d}\x{1F469}B             ZWJ E_Base_GAZ  
+    \x{1F1E6}\x{1F1E7}B            RegionalIndicator RegionalIndicator 
+    \x{261D}\x{E0100}\x{1F3FB}B\=no_jit    E_Base Extend E_Modifier
+    
+# Regional indicators
+
+/^(\X)(\X)/utf,aftertext
+    \x{1F1E6}\x{1F1E7}\x{1F1E7}B\=no_jit
+    \x{1F1E6}\x{1F1E7}\x{1F1E7}\x{1F1E6}B\=no_jit
+
+
 # End of testinput5
--- a/testdata/testoutput5
+++ b/testdata/testoutput5
@ -4667,4 +4667,38 @@ Callout 0: last capture = 1
 0: \x{1e900}\x{1e924}\x{1e953}\x{11c00}\x{11c2d}\x{11c3e}\x{11c70}\x{11c77}\x{11cab}\x{11400}\x{1142f}\x{11455}\x{104b0}\x{104d8}\x{104fb}\x{16fe0}\x{18800}\x{18af2}\x{11d00}\x{11d3a}\x{11d59}\x{16fe1}\x{1b170}\x{1b2fb}\x{11a50}\x{11a58}\x{11aa2}\x{11a00}\x{11a07}\x{11a47}
 1: \x{11a00}\x{11a07}\x{11a47}

+# These two are here because JIT is not yet updated. Also, the very first data
+# line is handled differently by Perl.
+
+/^\X/utf
+    A\x{200d}B                     A ZWJ
+ 0: A\x{200d}
+    \x{261D}\x{1F3FB}B             E_Base E_Modifier
+ 0: \x{261d}\x{1f3fb}
+    \x{1F466}\x{1F3FF}B            E_Base_GAZ E_Modifier 
+ 0: \x{1f466}\x{1f3ff}
+    \x{200d}\x{1F3A4}B             ZWJ Glue_After_ZWJ
+ 0: \x{200d}\x{1f3a4}
+    \x{200d}\x{1F469}B             ZWJ E_Base_GAZ  
+ 0: \x{200d}\x{1f469}
+    \x{1F1E6}\x{1F1E7}B            RegionalIndicator RegionalIndicator 
+ 0: \x{1f1e6}\x{1f1e7}
+    \x{261D}\x{E0100}\x{1F3FB}B\=no_jit    E_Base Extend E_Modifier
+** /n is not valid here
+    
+# Regional indicators
+
+/^(\X)(\X)/utf,aftertext
+    \x{1F1E6}\x{1F1E7}\x{1F1E7}B\=no_jit
+ 0: \x{1f1e6}\x{1f1e7}\x{1f1e7}
+ 0+ B
+ 1: \x{1f1e6}\x{1f1e7}
+ 2: \x{1f1e7}
+    \x{1F1E6}\x{1F1E7}\x{1F1E7}\x{1F1E6}B\=no_jit
+ 0: \x{1f1e6}\x{1f1e7}\x{1f1e7}\x{1f1e6}
+ 0+ B
+ 1: \x{1f1e6}\x{1f1e7}
+ 2: \x{1f1e7}\x{1f1e6}
+
+
 # End of testinput5