diff --git a/ChangeLog b/ChangeLog index ee7cc66..ce57091 100644 --- a/ChangeLog +++ b/ChangeLog @@ -213,6 +213,9 @@ unit". Previously only non-anchored patterns did this. 48. Add the callout_no_where modifier to pcre2test. +49. Update extended grapheme breaking rules to the latest set that are in +Unicode Standard Annex #29. + Version 10.23 14-February-2017 ------------------------------ diff --git a/doc/html/pcre2pattern.html b/doc/html/pcre2pattern.html index a582316..2fb3eb5 100644 --- a/doc/html/pcre2pattern.html +++ b/doc/html/pcre2pattern.html @@ -177,7 +177,7 @@ The pcre2_match() function contains a counter that is incremented every time it goes round its main loop. The caller of pcre2_match() can set a limit on this counter, which therefore limits the amount of computing resource used for a match. The maximum depth of nested backtracking can also be limited; this -indirectly restricts the amount of heap memory that is used, but there is also +indirectly restricts the amount of heap memory that is used, but there is also an explicit memory limit that can be set.
@@ -198,7 +198,7 @@ limits set by the programmer, but not raise them. If there is more than one setting of one of these limits, the lower value is used.
-Prior to release 10.30, LIMIT_DEPTH was called LIMIT_RECURSION. This name is +Prior to release 10.30, LIMIT_DEPTH was called LIMIT_RECURSION. This name is still recognized for backwards compatibility.
@@ -233,7 +233,7 @@ string with one of the following sequences: (*CRLF) carriage return, followed by linefeed (*ANYCRLF) any of the three above (*ANY) all Unicode newline sequences - (*NUL) the NUL character (binary zero) + (*NUL) the NUL character (binary zero) These override the default and the options given to the compiling function. For example, on a Unix system where LF is the default newline sequence, the pattern @@ -249,7 +249,7 @@ The newline convention affects where the circumflex and dollar assertions are true. It also affects the interpretation of the dot metacharacter when PCRE2_DOTALL is not set, and the behaviour of \N. However, it does not affect what the \R escape sequence matches. By default, this is any Unicode newline -sequence, for Perl compatibility. However, this can be changed; see the next +sequence, for Perl compatibility. However, this can be changed; see the next section and the description of \R in the section entitled "Newline sequences" below. A change of \R setting can be combined with a change of newline @@ -1001,9 +1001,12 @@ grapheme cluster", and treats the sequence as an atomic group (see below). Unicode supports various kinds of composite character by giving each character a grapheme breaking property, and having rules that use these properties to -define the boundaries of extended grapheme clusters. \X always matches at -least one character. Then it decides whether to add additional characters -according to the following rules for ending a cluster: +define the boundaries of extended grapheme clusters. The rules are defined in +Unicode Standard Annex 29, "Unicode Text Segmentation". +
++\X always matches at least one character. Then it decides whether to add +additional characters according to the following rules for ending a cluster:
1. End at the end of the subject string. @@ -1018,13 +1021,27 @@ L, V, LV, or LVT character; an LV or V character may be followed by a V or T character; an LVT or T character may be follwed only by a T character.
-4. Do not end before extending characters or spacing marks. Characters with -the "mark" property always have the "extend" grapheme breaking property. +4. Do not end before extending characters or spacing marks or the "zero-width +joiner" characters. Characters with the "mark" property always have the +"extend" grapheme breaking property.
5. Do not end after prepend characters.
+6. Do not break within emoji modifier sequences (a base character followed by a +modifier). Extending characters are allowed before the modifier. +
++7. Do not break within emoji zwj sequences (zero-width jointer followed by +"glue after ZWJ" or "base glue after ZWJ"). +
++8. Do not break within emoji flag sequences. That is, do not break between +regional indicator (RI) characters if there are an odd number of RI characters +before the break point. +
+6. Otherwise, end the cluster.
i for PCRE2_CASELESS m for PCRE2_MULTILINE - n for PCRE2_NO_AUTO_CAPTURE + n for PCRE2_NO_AUTO_CAPTURE s for PCRE2_DOTALL x for PCRE2_EXTENDED xx for PCRE2_EXTENDED_MOREFor example, (?im) sets caseless, multiline matching. It is also possible to -unset these options by preceding the letter with a hyphen. The two "extended" -options are not independent; unsetting either one cancels the effects of both -of them. +unset these options by preceding the letter with a hyphen. The two "extended" +options are not independent; unsetting either one cancels the effects of both +of them.
A combined setting and unsetting such as (?im-sx), which sets PCRE2_CASELESS @@ -2249,14 +2266,14 @@ capturing subpatterns within it, these are counted for the purposes of numbering the capturing subpatterns in the whole pattern. However, substring capturing is carried out only for positive assertions that succeed, that is, one of their branches matches, so matching continues after the assertion. If -all branches of a positive assertion fail to match, nothing is captured, and +all branches of a positive assertion fail to match, nothing is captured, and control is passed to the previous backtracking point.
No capturing is done for a negative assertion unless it is being used as a condition in a conditional subpattern -(see the discussion below). Matching continues after a non-conditional negative +(see the discussion below). Matching continues after a non-conditional negative assertion only if all its branches fail to match.
@@ -2824,14 +2841,14 @@ if it contained untried alternatives and there was a subsequent matching failure. (Historical note: PCRE implemented recursion before Perl did.)
-Starting with release 10.30, recursive subroutine calls are no longer treated -as atomic. That is, they can be re-entered to try unused alternatives if there -is a matching failure later in the pattern. This is now compatible with the way +Starting with release 10.30, recursive subroutine calls are no longer treated +as atomic. That is, they can be re-entered to try unused alternatives if there +is a matching failure later in the pattern. This is now compatible with the way Perl works. If you want a subroutine call to be atomic, you must explicitly enclose it in an atomic group.
-Supporting backtracking into recursions simplifies certain types of recursive +Supporting backtracking into recursions simplifies certain types of recursive pattern. For example, this pattern matches palindromic strings:
^((.)(?1)\2|.?)$ @@ -2863,7 +2880,7 @@ in PCRE2 these values can be referenced. Consider this pattern: This pattern matches "bab". The first capturing parentheses match "b", then in the second group, when the back reference \1 fails to match "b", the second alternative matches "a" and then recurses. In the recursion, \1 does now match -"b" and so the whole match succeeds. This match used to fail in Perl, but in +"b" and so the whole match succeeds. This match used to fail in Perl, but in later versions (I tried 5.024) it now works.
SUBPATTERNS AS SUBROUTINES
@@ -3398,17 +3415,17 @@ processing; captured substrings are discarded.If the assertion is a condition, (*ACCEPT) causes the condition to be true for -a positive assertion and false for a negative one; captured substrings are +a positive assertion and false for a negative one; captured substrings are retained in both cases.
-The effect of (*THEN) is not allowed to escape beyond an assertion. If there +The effect of (*THEN) is not allowed to escape beyond an assertion. If there are no more branches to try, (*THEN) causes a positive assertion to be false, and a negative assertion to be true.
The other backtracking verbs are not treated specially if they appear in a -standalone positive assertion. In a conditional positive assertion, +standalone positive assertion. In a conditional positive assertion, backtracking into (*COMMIT), (*SKIP), or (*PRUNE) causes the condition to be false. However, for both standalone and conditional negative assertions, backtracking into (*COMMIT), (*SKIP), or (*PRUNE) causes the assertion to be @@ -3455,7 +3472,7 @@ Cambridge, England.
REVISION
-Last updated: 02 July 2017 +Last updated: 05 July 2017
Copyright © 1997-2017 University of Cambridge.
diff --git a/doc/pcre2.txt b/doc/pcre2.txt index 6a9bb96..1f7be3d 100644 --- a/doc/pcre2.txt +++ b/doc/pcre2.txt @@ -6433,27 +6433,41 @@ BACKSLASH (see below). Unicode supports various kinds of composite character by giving each character a grapheme breaking property, and having rules that use these properties to define the boundaries of extended grapheme - clusters. \X always matches at least one character. Then it decides - whether to add additional characters according to the following rules - for ending a cluster: + clusters. The rules are defined in Unicode Standard Annex 29, "Unicode + Text Segmentation". + + \X always matches at least one character. Then it decides whether to + add additional characters according to the following rules for ending a + cluster: 1. End at the end of the subject string. - 2. Do not end between CR and LF; otherwise end after any control char- + 2. Do not end between CR and LF; otherwise end after any control char- acter. - 3. Do not break Hangul (a Korean script) syllable sequences. Hangul - characters are of five types: L, V, T, LV, and LVT. An L character may - be followed by an L, V, LV, or LVT character; an LV or V character may + 3. Do not break Hangul (a Korean script) syllable sequences. Hangul + characters are of five types: L, V, T, LV, and LVT. An L character may + be followed by an L, V, LV, or LVT character; an LV or V character may be followed by a V or T character; an LVT or T character may be follwed only by a T character. - 4. Do not end before extending characters or spacing marks. Characters - with the "mark" property always have the "extend" grapheme breaking - property. + 4. Do not end before extending characters or spacing marks or the + "zero-width joiner" characters. Characters with the "mark" property + always have the "extend" grapheme breaking property. 5. Do not end after prepend characters. + 6. Do not break within emoji modifier sequences (a base character fol- + lowed by a modifier). Extending characters are allowed before the modi- + fier. + + 7. Do not break within emoji zwj sequences (zero-width jointer followed + by "glue after ZWJ" or "base glue after ZWJ"). + + 8. Do not break within emoji flag sequences. That is, do not break + between regional indicator (RI) characters if there are an odd number + of RI characters before the break point. + 6. Otherwise, end the cluster. PCRE2's additional properties @@ -8744,7 +8758,7 @@ AUTHOR REVISION - Last updated: 02 July 2017 + Last updated: 05 July 2017 Copyright (c) 1997-2017 University of Cambridge. ------------------------------------------------------------------------------ diff --git a/doc/pcre2pattern.3 b/doc/pcre2pattern.3 index 42ab96b..c3d54a8 100644 --- a/doc/pcre2pattern.3 +++ b/doc/pcre2pattern.3 @@ -1,4 +1,4 @@ -.TH PCRE2PATTERN 3 "02 July 2017" "PCRE2 10.30" +.TH PCRE2PATTERN 3 "05 July 2017" "PCRE2 10.30" .SH NAME PCRE2 - Perl-compatible regular expressions (revised API) .SH "PCRE2 REGULAR EXPRESSION DETAILS" @@ -145,7 +145,7 @@ The pcre2_match() function contains a counter that is incremented every time it goes round its main loop. The caller of \fBpcre2_match()\fP can set a limit on this counter, which therefore limits the amount of computing resource used for a match. The maximum depth of nested backtracking can also be limited; this -indirectly restricts the amount of heap memory that is used, but there is also +indirectly restricts the amount of heap memory that is used, but there is also an explicit memory limit that can be set. .P These facilities are provided to catch runaway matches that are provoked by @@ -164,7 +164,7 @@ for it to have any effect. In other words, the pattern writer can lower the limits set by the programmer, but not raise them. If there is more than one setting of one of these limits, the lower value is used. .P -Prior to release 10.30, LIMIT_DEPTH was called LIMIT_RECURSION. This name is +Prior to release 10.30, LIMIT_DEPTH was called LIMIT_RECURSION. This name is still recognized for backwards compatibility. .P The heap limit applies only when the \fBpcre2_match()\fP interpreter is used @@ -203,7 +203,7 @@ string with one of the following sequences: (*CRLF) carriage return, followed by linefeed (*ANYCRLF) any of the three above (*ANY) all Unicode newline sequences - (*NUL) the NUL character (binary zero) + (*NUL) the NUL character (binary zero) .sp These override the default and the options given to the compiling function. For example, on a Unix system where LF is the default newline sequence, the pattern @@ -218,7 +218,7 @@ The newline convention affects where the circumflex and dollar assertions are true. It also affects the interpretation of the dot metacharacter when PCRE2_DOTALL is not set, and the behaviour of \eN. However, it does not affect what the \eR escape sequence matches. By default, this is any Unicode newline -sequence, for Perl compatibility. However, this can be changed; see the next +sequence, for Perl compatibility. However, this can be changed; see the next section and the description of \eR in the section entitled .\" HTML .\" @@ -998,9 +998,11 @@ grapheme cluster", and treats the sequence as an atomic group .\" Unicode supports various kinds of composite character by giving each character a grapheme breaking property, and having rules that use these properties to -define the boundaries of extended grapheme clusters. \eX always matches at -least one character. Then it decides whether to add additional characters -according to the following rules for ending a cluster: +define the boundaries of extended grapheme clusters. The rules are defined in +Unicode Standard Annex 29, "Unicode Text Segmentation". +.P +\eX always matches at least one character. Then it decides whether to add +additional characters according to the following rules for ending a cluster: .P 1. End at the end of the subject string. .P @@ -1011,11 +1013,22 @@ are of five types: L, V, T, LV, and LVT. An L character may be followed by an L, V, LV, or LVT character; an LV or V character may be followed by a V or T character; an LVT or T character may be follwed only by a T character. .P -4. Do not end before extending characters or spacing marks. Characters with -the "mark" property always have the "extend" grapheme breaking property. +4. Do not end before extending characters or spacing marks or the "zero-width +joiner" characters. Characters with the "mark" property always have the +"extend" grapheme breaking property. .P 5. Do not end after prepend characters. .P +6. Do not break within emoji modifier sequences (a base character followed by a +modifier). Extending characters are allowed before the modifier. +.P +7. Do not break within emoji zwj sequences (zero-width jointer followed by +"glue after ZWJ" or "base glue after ZWJ"). +.P +8. Do not break within emoji flag sequences. That is, do not break between +regional indicator (RI) characters if there are an odd number of RI characters +before the break point. +.P 6. Otherwise, end the cluster. . . @@ -1560,15 +1573,15 @@ Perl option letters enclosed between "(?" and ")". The option letters are .sp i for PCRE2_CASELESS m for PCRE2_MULTILINE - n for PCRE2_NO_AUTO_CAPTURE + n for PCRE2_NO_AUTO_CAPTURE s for PCRE2_DOTALL x for PCRE2_EXTENDED xx for PCRE2_EXTENDED_MORE .sp For example, (?im) sets caseless, multiline matching. It is also possible to -unset these options by preceding the letter with a hyphen. The two "extended" -options are not independent; unsetting either one cancels the effects of both -of them. +unset these options by preceding the letter with a hyphen. The two "extended" +options are not independent; unsetting either one cancels the effects of both +of them. .P A combined setting and unsetting such as (?im-sx), which sets PCRE2_CASELESS and PCRE2_MULTILINE while unsetting PCRE2_DOTALL and PCRE2_EXTENDED, is also @@ -2256,16 +2269,16 @@ capturing subpatterns within it, these are counted for the purposes of numbering the capturing subpatterns in the whole pattern. However, substring capturing is carried out only for positive assertions that succeed, that is, one of their branches matches, so matching continues after the assertion. If -all branches of a positive assertion fail to match, nothing is captured, and +all branches of a positive assertion fail to match, nothing is captured, and control is passed to the previous backtracking point. .P No capturing is done for a negative assertion unless it is being used as a condition in a .\" HTML .\" -conditional subpattern +conditional subpattern .\" -(see the discussion below). Matching continues after a non-conditional negative +(see the discussion below). Matching continues after a non-conditional negative assertion only if all its branches fail to match. .P For compatibility with Perl, most assertion subpatterns may be repeated; though @@ -2846,13 +2859,13 @@ once it had matched some of the subject string, it was never re-entered, even if it contained untried alternatives and there was a subsequent matching failure. (Historical note: PCRE implemented recursion before Perl did.) .P -Starting with release 10.30, recursive subroutine calls are no longer treated -as atomic. That is, they can be re-entered to try unused alternatives if there -is a matching failure later in the pattern. This is now compatible with the way +Starting with release 10.30, recursive subroutine calls are no longer treated +as atomic. That is, they can be re-entered to try unused alternatives if there +is a matching failure later in the pattern. This is now compatible with the way Perl works. If you want a subroutine call to be atomic, you must explicitly enclose it in an atomic group. .P -Supporting backtracking into recursions simplifies certain types of recursive +Supporting backtracking into recursions simplifies certain types of recursive pattern. For example, this pattern matches palindromic strings: .sp ^((.)(?1)\e2|.?)$ @@ -2883,7 +2896,7 @@ in PCRE2 these values can be referenced. Consider this pattern: This pattern matches "bab". The first capturing parentheses match "b", then in the second group, when the back reference \e1 fails to match "b", the second alternative matches "a" and then recurses. In the recursion, \e1 does now match -"b" and so the whole match succeeds. This match used to fail in Perl, but in +"b" and so the whole match succeeds. This match used to fail in Perl, but in later versions (I tried 5.024) it now works. . . @@ -3427,15 +3440,15 @@ negative assertion, (*ACCEPT) causes the assertion to fail without any further processing; captured substrings are discarded. .P If the assertion is a condition, (*ACCEPT) causes the condition to be true for -a positive assertion and false for a negative one; captured substrings are +a positive assertion and false for a negative one; captured substrings are retained in both cases. .P -The effect of (*THEN) is not allowed to escape beyond an assertion. If there +The effect of (*THEN) is not allowed to escape beyond an assertion. If there are no more branches to try, (*THEN) causes a positive assertion to be false, and a negative assertion to be true. .P The other backtracking verbs are not treated specially if they appear in a -standalone positive assertion. In a conditional positive assertion, +standalone positive assertion. In a conditional positive assertion, backtracking into (*COMMIT), (*SKIP), or (*PRUNE) causes the condition to be false. However, for both standalone and conditional negative assertions, backtracking into (*COMMIT), (*SKIP), or (*PRUNE) causes the assertion to be @@ -3485,6 +3498,6 @@ Cambridge, England. .rs .sp .nf -Last updated: 02 July 2017 +Last updated: 05 July 2017 Copyright (c) 1997-2017 University of Cambridge. .fi diff --git a/src/pcre2_dfa_match.c b/src/pcre2_dfa_match.c index 7fe6dfe..5ae1394 100644 --- a/src/pcre2_dfa_match.c +++ b/src/pcre2_dfa_match.c @@ -1379,8 +1379,46 @@ for (;;) if (!utf) d = *nptr; else { GETCHARLEN(d, nptr, dlen); } rgb = UCD_GRAPHBREAK(d); if ((PRIV(ucp_gbtable)[lgb] & (1u << rgb)) == 0) break; + + /* Not breaking between Regional Indicators is allowed only if + there are an even number of preceding RIs. */ + + if (lgb == ucp_gbRegionalIndicator && + rgb == ucp_gbRegionalIndicator) + { + int ricount = 0; + PCRE2_SPTR bptr = nptr - 1; +#ifdef SUPPORT_UNICODE + if (utf) BACKCHAR(bptr); +#endif + /* bptr is pointing to the left-hand character */ + + while (bptr > mb->start_subject) + { + bptr--; +#ifdef SUPPORT_UNICODE + if (utf) + { + BACKCHAR(bptr); + GETCHAR(d, bptr); + } + else +#endif + d = *bptr; + if (UCD_GRAPHBREAK(d) != ucp_gbRegionalIndicator) break; + ricount++; + } + if ((ricount & 1) != 0) break; /* Grapheme break required */ + } + + /* If Extend follows E_Base[_GAZ] do not update lgb; this allows + any number of Extend before a following E_Modifier. */ + + if (rgb != ucp_gbExtend || + (lgb != ucp_gbE_Base && lgb != ucp_gbE_Base_GAZ)) + lgb = rgb; + ncount++; - lgb = rgb; nptr += dlen; } count++; @@ -1641,8 +1679,46 @@ for (;;) if (!utf) d = *nptr; else { GETCHARLEN(d, nptr, dlen); } rgb = UCD_GRAPHBREAK(d); if ((PRIV(ucp_gbtable)[lgb] & (1u << rgb)) == 0) break; + + /* Not breaking between Regional Indicators is allowed only if + there are an even number of preceding RIs. */ + + if (lgb == ucp_gbRegionalIndicator && + rgb == ucp_gbRegionalIndicator) + { + int ricount = 0; + PCRE2_SPTR bptr = nptr - 1; +#ifdef SUPPORT_UNICODE + if (utf) BACKCHAR(bptr); +#endif + /* bptr is pointing to the left-hand character */ + + while (bptr > mb->start_subject) + { + bptr--; +#ifdef SUPPORT_UNICODE + if (utf) + { + BACKCHAR(bptr); + GETCHAR(d, bptr); + } + else +#endif + d = *bptr; + if (UCD_GRAPHBREAK(d) != ucp_gbRegionalIndicator) break; + ricount++; + } + if ((ricount & 1) != 0) break; /* Grapheme break required */ + } + + /* If Extend follows E_Base[_GAZ] do not update lgb; this allows + any number of Extend before a following E_Modifier. */ + + if (rgb != ucp_gbExtend || + (lgb != ucp_gbE_Base && lgb != ucp_gbE_Base_GAZ)) + lgb = rgb; + ncount++; - lgb = rgb; nptr += dlen; } ADD_NEW_DATA(-(state_offset + count), 0, ncount); @@ -1912,8 +1988,46 @@ for (;;) if (!utf) d = *nptr; else { GETCHARLEN(d, nptr, dlen); } rgb = UCD_GRAPHBREAK(d); if ((PRIV(ucp_gbtable)[lgb] & (1u << rgb)) == 0) break; + + /* Not breaking between Regional Indicators is allowed only if + there are an even number of preceding RIs. */ + + if (lgb == ucp_gbRegionalIndicator && + rgb == ucp_gbRegionalIndicator) + { + int ricount = 0; + PCRE2_SPTR bptr = nptr - 1; +#ifdef SUPPORT_UNICODE + if (utf) BACKCHAR(bptr); +#endif + /* bptr is pointing to the left-hand character */ + + while (bptr > mb->start_subject) + { + bptr--; +#ifdef SUPPORT_UNICODE + if (utf) + { + BACKCHAR(bptr); + GETCHAR(d, bptr); + } + else +#endif + d = *bptr; + if (UCD_GRAPHBREAK(d) != ucp_gbRegionalIndicator) break; + ricount++; + } + if ((ricount & 1) != 0) break; /* Grapheme break required */ + } + + /* If Extend follows E_Base[_GAZ] do not update lgb; this allows + any number of Extend before a following E_Modifier. */ + + if (rgb != ucp_gbExtend || + (lgb != ucp_gbE_Base && lgb != ucp_gbE_Base_GAZ)) + lgb = rgb; + ncount++; - lgb = rgb; nptr += dlen; } if (nptr >= end_subject && (mb->moptions & PCRE2_PARTIAL_HARD) != 0) @@ -2102,8 +2216,46 @@ for (;;) if (!utf) d = *nptr; else { GETCHARLEN(d, nptr, dlen); } rgb = UCD_GRAPHBREAK(d); if ((PRIV(ucp_gbtable)[lgb] & (1u << rgb)) == 0) break; + + /* Not breaking between Regional Indicators is allowed only if + there are an even number of preceding RIs. */ + + if (lgb == ucp_gbRegionalIndicator && + rgb == ucp_gbRegionalIndicator) + { + int ricount = 0; + PCRE2_SPTR bptr = nptr - 1; +#ifdef SUPPORT_UNICODE + if (utf) BACKCHAR(bptr); +#endif + /* bptr is pointing to the left-hand character */ + + while (bptr > mb->start_subject) + { + bptr--; +#ifdef SUPPORT_UNICODE + if (utf) + { + BACKCHAR(bptr); + GETCHAR(d, bptr); + } + else +#endif + d = *bptr; + if (UCD_GRAPHBREAK(d) != ucp_gbRegionalIndicator) break; + ricount++; + } + if ((ricount & 1) != 0) break; /* Grapheme break required */ + } + + /* If Extend follows E_Base[_GAZ] do not update lgb; this allows + any number of Extend before a following E_Modifier. */ + + if (rgb != ucp_gbExtend || + (lgb != ucp_gbE_Base && lgb != ucp_gbE_Base_GAZ)) + lgb = rgb; + ncount++; - lgb = rgb; nptr += dlen; } if (nptr >= end_subject && (mb->moptions & PCRE2_PARTIAL_HARD) != 0) @@ -2129,7 +2281,7 @@ for (;;) case 0x2029: #endif /* Not EBCDIC */ if (mb->bsr_convention == PCRE2_BSR_ANYCRLF) break; - /* Fall through */ + /* Fall through */ case CHAR_LF: ADD_NEW(state_offset + 1, 0); @@ -3427,7 +3579,7 @@ for (;;) while (t < mb->end_subject && !IS_NEWLINE(t)) t++; end_subject = t; } - + /* Anchored: check the first code unit if one is recorded. This may seem pointless but it can help in detecting a no match case without scanning for the required code unit. */ diff --git a/src/pcre2_match.c b/src/pcre2_match.c index 2461da1..050b7e9 100644 --- a/src/pcre2_match.c +++ b/src/pcre2_match.c @@ -2449,7 +2449,44 @@ fprintf(stderr, "++ op=%d\n", *Fecode); if (!utf) fc = *Feptr; else { GETCHARLEN(fc, Feptr, len); } rgb = UCD_GRAPHBREAK(fc); if ((PRIV(ucp_gbtable)[lgb] & (1 << rgb)) == 0) break; - lgb = rgb; + + /* Not breaking between Regional Indicators is allowed only if there + are an even number of preceding RIs. */ + + if (lgb == ucp_gbRegionalIndicator && rgb == ucp_gbRegionalIndicator) + { + int ricount = 0; + PCRE2_SPTR bptr = Feptr - 1; +#ifdef SUPPORT_UNICODE + if (utf) BACKCHAR(bptr); +#endif + /* bptr is pointing to the left-hand character */ + + while (bptr > mb->start_subject) + { + bptr--; +#ifdef SUPPORT_UNICODE + if (utf) + { + BACKCHAR(bptr); + GETCHAR(fc, bptr); + } + else +#endif + fc = *bptr; + if (UCD_GRAPHBREAK(fc) != ucp_gbRegionalIndicator) break; + ricount++; + } + if ((ricount & 1) != 0) break; /* Grapheme break required */ + } + + /* If Extend follows E_Base[_GAZ] do not update lgb; this allows + any number of Extend before a following E_Modifier. */ + + if (rgb != ucp_gbExtend || + (lgb != ucp_gbE_Base && lgb != ucp_gbE_Base_GAZ)) + lgb = rgb; + Feptr += len; } } @@ -2757,7 +2794,45 @@ fprintf(stderr, "++ op=%d\n", *Fecode); if (!utf) fc = *Feptr; else { GETCHARLEN(fc, Feptr, len); } rgb = UCD_GRAPHBREAK(fc); if ((PRIV(ucp_gbtable)[lgb] & (1 << rgb)) == 0) break; - lgb = rgb; + + /* Not breaking between Regional Indicators is allowed only if + there are an even number of preceding RIs. */ + + if (lgb == ucp_gbRegionalIndicator && + rgb == ucp_gbRegionalIndicator) + { + int ricount = 0; + PCRE2_SPTR bptr = Feptr - 1; +#ifdef SUPPORT_UNICODE + if (utf) BACKCHAR(bptr); +#endif + /* bptr is pointing to the left-hand character */ + + while (bptr > mb->start_subject) + { + bptr--; +#ifdef SUPPORT_UNICODE + if (utf) + { + BACKCHAR(bptr); + GETCHAR(fc, bptr); + } + else +#endif + fc = *bptr; + if (UCD_GRAPHBREAK(fc) != ucp_gbRegionalIndicator) break; + ricount++; + } + if ((ricount & 1) != 0) break; /* Grapheme break required */ + } + + /* If Extend follows E_Base[_GAZ] do not update lgb; this allows + any number of Extend before a following E_Modifier. */ + + if (rgb != ucp_gbExtend || + (lgb != ucp_gbE_Base && lgb != ucp_gbE_Base_GAZ)) + lgb = rgb; + Feptr += len; } } @@ -3527,7 +3602,45 @@ fprintf(stderr, "++ op=%d\n", *Fecode); if (!utf) fc = *Feptr; else { GETCHARLEN(fc, Feptr, len); } rgb = UCD_GRAPHBREAK(fc); if ((PRIV(ucp_gbtable)[lgb] & (1 << rgb)) == 0) break; - lgb = rgb; + + /* Not breaking between Regional Indicators is allowed only if + there are an even number of preceding RIs. */ + + if (lgb == ucp_gbRegionalIndicator && + rgb == ucp_gbRegionalIndicator) + { + int ricount = 0; + PCRE2_SPTR bptr = Feptr - 1; +#ifdef SUPPORT_UNICODE + if (utf) BACKCHAR(bptr); +#endif + /* bptr is pointing to the left-hand character */ + + while (bptr > mb->start_subject) + { + bptr--; +#ifdef SUPPORT_UNICODE + if (utf) + { + BACKCHAR(bptr); + GETCHAR(fc, bptr); + } + else +#endif + fc = *bptr; + if (UCD_GRAPHBREAK(fc) != ucp_gbRegionalIndicator) break; + ricount++; + } + if ((ricount & 1) != 0) break; /* Grapheme break required */ + } + + /* If Extend follows E_Base[_GAZ] do not update lgb; this allows + any number of Extend before a following E_Modifier. */ + + if (rgb != ucp_gbExtend || + (lgb != ucp_gbE_Base && lgb != ucp_gbE_Base_GAZ)) + lgb = rgb; + Feptr += len; } } @@ -4063,7 +4176,45 @@ fprintf(stderr, "++ op=%d\n", *Fecode); if (!utf) fc = *Feptr; else { GETCHARLEN(fc, Feptr, len); } rgb = UCD_GRAPHBREAK(fc); if ((PRIV(ucp_gbtable)[lgb] & (1 << rgb)) == 0) break; - lgb = rgb; + + /* Not breaking between Regional Indicators is allowed only if + there are an even number of preceding RIs. */ + + if (lgb == ucp_gbRegionalIndicator && + rgb == ucp_gbRegionalIndicator) + { + int ricount = 0; + PCRE2_SPTR bptr = Feptr - 1; +#ifdef SUPPORT_UNICODE + if (utf) BACKCHAR(bptr); +#endif + /* bptr is pointing to the left-hand character */ + + while (bptr > mb->start_subject) + { + bptr--; +#ifdef SUPPORT_UNICODE + if (utf) + { + BACKCHAR(bptr); + GETCHAR(fc, bptr); + } + else +#endif + fc = *bptr; + if (UCD_GRAPHBREAK(fc) != ucp_gbRegionalIndicator) break; + ricount++; + } + if ((ricount & 1) != 0) break; /* Grapheme break required */ + } + + /* If Extend follows E_Base[_GAZ] do not update lgb; this allows + any number of Extend before a following E_Modifier. */ + + if (rgb != ucp_gbExtend || + (lgb != ucp_gbE_Base && lgb != ucp_gbE_Base_GAZ)) + lgb = rgb; + Feptr += len; } } diff --git a/src/pcre2_tables.c b/src/pcre2_tables.c index 3acb762..4398895 100644 --- a/src/pcre2_tables.c +++ b/src/pcre2_tables.c @@ -157,49 +157,62 @@ two code points. The breaking rules are as follows: LV or V may be followed by V or T LVT or T may be followed by T -4. Do not break before extending characters. +4. Do not break before extending characters or zero-width-joiner (ZWJ). -The next two rules are only for extended grapheme clusters (but that's what we +The following rules are only for extended grapheme clusters (but that's what we are implementing). 5. Do not break before SpacingMarks. 6. Do not break after Prepend characters. -7. Otherwise, break everywhere. +7. Do not break within emoji modifier sequences (E_Base or E_Base_GAZ followed + by E_Modifier). Extend characters are allowed before the modifier; this + cannot be represented in this table, the code has to deal with it. + +8. Do not break within emoji zwj sequences (ZWJ followed by Glue_After_Zwj or + E_Base_GAZ). + +9. Do not break within emoji flag sequences. That is, do not break between + regional indicator (RI) symbols if there are an odd number of RI characters + before the break point. This table encodes "join RI characters"; the code + has to deal with checking for previous adjoining RIs. + +10. Otherwise, break everywhere. */ +#define ESZ (1<