Update grapheme breaking rules for Unicode 10.0.0.

This commit is contained in:
Philip.Hazel 2017-07-05 08:55:49 +00:00
parent 41bb787fb3
commit 4f7a608d56
9 changed files with 514 additions and 98 deletions

View File

@ -213,6 +213,9 @@ unit". Previously only non-anchored patterns did this.
48. Add the callout_no_where modifier to pcre2test.
49. Update extended grapheme breaking rules to the latest set that are in
Unicode Standard Annex #29.
Version 10.23 14-February-2017
------------------------------

View File

@ -177,7 +177,7 @@ The pcre2_match() function contains a counter that is incremented every time it
goes round its main loop. The caller of <b>pcre2_match()</b> can set a limit on
this counter, which therefore limits the amount of computing resource used for
a match. The maximum depth of nested backtracking can also be limited; this
indirectly restricts the amount of heap memory that is used, but there is also
indirectly restricts the amount of heap memory that is used, but there is also
an explicit memory limit that can be set.
</P>
<P>
@ -198,7 +198,7 @@ limits set by the programmer, but not raise them. If there is more than one
setting of one of these limits, the lower value is used.
</P>
<P>
Prior to release 10.30, LIMIT_DEPTH was called LIMIT_RECURSION. This name is
Prior to release 10.30, LIMIT_DEPTH was called LIMIT_RECURSION. This name is
still recognized for backwards compatibility.
</P>
<P>
@ -233,7 +233,7 @@ string with one of the following sequences:
(*CRLF) carriage return, followed by linefeed
(*ANYCRLF) any of the three above
(*ANY) all Unicode newline sequences
(*NUL) the NUL character (binary zero)
(*NUL) the NUL character (binary zero)
</pre>
These override the default and the options given to the compiling function. For
example, on a Unix system where LF is the default newline sequence, the pattern
@ -249,7 +249,7 @@ The newline convention affects where the circumflex and dollar assertions are
true. It also affects the interpretation of the dot metacharacter when
PCRE2_DOTALL is not set, and the behaviour of \N. However, it does not affect
what the \R escape sequence matches. By default, this is any Unicode newline
sequence, for Perl compatibility. However, this can be changed; see the next
sequence, for Perl compatibility. However, this can be changed; see the next
section and the description of \R in the section entitled
<a href="#newlineseq">"Newline sequences"</a>
below. A change of \R setting can be combined with a change of newline
@ -1001,9 +1001,12 @@ grapheme cluster", and treats the sequence as an atomic group
<a href="#atomicgroup">(see below).</a>
Unicode supports various kinds of composite character by giving each character
a grapheme breaking property, and having rules that use these properties to
define the boundaries of extended grapheme clusters. \X always matches at
least one character. Then it decides whether to add additional characters
according to the following rules for ending a cluster:
define the boundaries of extended grapheme clusters. The rules are defined in
Unicode Standard Annex 29, "Unicode Text Segmentation".
</P>
<P>
\X always matches at least one character. Then it decides whether to add
additional characters according to the following rules for ending a cluster:
</P>
<P>
1. End at the end of the subject string.
@ -1018,13 +1021,27 @@ L, V, LV, or LVT character; an LV or V character may be followed by a V or T
character; an LVT or T character may be follwed only by a T character.
</P>
<P>
4. Do not end before extending characters or spacing marks. Characters with
the "mark" property always have the "extend" grapheme breaking property.
4. Do not end before extending characters or spacing marks or the "zero-width
joiner" characters. Characters with the "mark" property always have the
"extend" grapheme breaking property.
</P>
<P>
5. Do not end after prepend characters.
</P>
<P>
6. Do not break within emoji modifier sequences (a base character followed by a
modifier). Extending characters are allowed before the modifier.
</P>
<P>
7. Do not break within emoji zwj sequences (zero-width jointer followed by
"glue after ZWJ" or "base glue after ZWJ").
</P>
<P>
8. Do not break within emoji flag sequences. That is, do not break between
regional indicator (RI) characters if there are an odd number of RI characters
before the break point.
</P>
<P>
6. Otherwise, end the cluster.
<a name="extraprops"></a></P>
<br><b>
@ -1562,15 +1579,15 @@ Perl option letters enclosed between "(?" and ")". The option letters are
<pre>
i for PCRE2_CASELESS
m for PCRE2_MULTILINE
n for PCRE2_NO_AUTO_CAPTURE
n for PCRE2_NO_AUTO_CAPTURE
s for PCRE2_DOTALL
x for PCRE2_EXTENDED
xx for PCRE2_EXTENDED_MORE
</pre>
For example, (?im) sets caseless, multiline matching. It is also possible to
unset these options by preceding the letter with a hyphen. The two "extended"
options are not independent; unsetting either one cancels the effects of both
of them.
unset these options by preceding the letter with a hyphen. The two "extended"
options are not independent; unsetting either one cancels the effects of both
of them.
</P>
<P>
A combined setting and unsetting such as (?im-sx), which sets PCRE2_CASELESS
@ -2249,14 +2266,14 @@ capturing subpatterns within it, these are counted for the purposes of
numbering the capturing subpatterns in the whole pattern. However, substring
capturing is carried out only for positive assertions that succeed, that is,
one of their branches matches, so matching continues after the assertion. If
all branches of a positive assertion fail to match, nothing is captured, and
all branches of a positive assertion fail to match, nothing is captured, and
control is passed to the previous backtracking point.
</P>
<P>
No capturing is done for a negative assertion unless it is being used as a
condition in a
<a href="#subpatternsassubroutines">conditional subpattern</a>
(see the discussion below). Matching continues after a non-conditional negative
(see the discussion below). Matching continues after a non-conditional negative
assertion only if all its branches fail to match.
</P>
<P>
@ -2824,14 +2841,14 @@ if it contained untried alternatives and there was a subsequent matching
failure. (Historical note: PCRE implemented recursion before Perl did.)
</P>
<P>
Starting with release 10.30, recursive subroutine calls are no longer treated
as atomic. That is, they can be re-entered to try unused alternatives if there
is a matching failure later in the pattern. This is now compatible with the way
Starting with release 10.30, recursive subroutine calls are no longer treated
as atomic. That is, they can be re-entered to try unused alternatives if there
is a matching failure later in the pattern. This is now compatible with the way
Perl works. If you want a subroutine call to be atomic, you must explicitly
enclose it in an atomic group.
</P>
<P>
Supporting backtracking into recursions simplifies certain types of recursive
Supporting backtracking into recursions simplifies certain types of recursive
pattern. For example, this pattern matches palindromic strings:
<pre>
^((.)(?1)\2|.?)$
@ -2863,7 +2880,7 @@ in PCRE2 these values can be referenced. Consider this pattern:
This pattern matches "bab". The first capturing parentheses match "b", then in
the second group, when the back reference \1 fails to match "b", the second
alternative matches "a" and then recurses. In the recursion, \1 does now match
"b" and so the whole match succeeds. This match used to fail in Perl, but in
"b" and so the whole match succeeds. This match used to fail in Perl, but in
later versions (I tried 5.024) it now works.
<a name="subpatternsassubroutines"></a></P>
<br><a name="SEC24" href="#TOC1">SUBPATTERNS AS SUBROUTINES</a><br>
@ -3398,17 +3415,17 @@ processing; captured substrings are discarded.
</P>
<P>
If the assertion is a condition, (*ACCEPT) causes the condition to be true for
a positive assertion and false for a negative one; captured substrings are
a positive assertion and false for a negative one; captured substrings are
retained in both cases.
</P>
<P>
The effect of (*THEN) is not allowed to escape beyond an assertion. If there
The effect of (*THEN) is not allowed to escape beyond an assertion. If there
are no more branches to try, (*THEN) causes a positive assertion to be false,
and a negative assertion to be true.
</P>
<P>
The other backtracking verbs are not treated specially if they appear in a
standalone positive assertion. In a conditional positive assertion,
standalone positive assertion. In a conditional positive assertion,
backtracking into (*COMMIT), (*SKIP), or (*PRUNE) causes the condition to be
false. However, for both standalone and conditional negative assertions,
backtracking into (*COMMIT), (*SKIP), or (*PRUNE) causes the assertion to be
@ -3455,7 +3472,7 @@ Cambridge, England.
</P>
<br><a name="SEC30" href="#TOC1">REVISION</a><br>
<P>
Last updated: 02 July 2017
Last updated: 05 July 2017
<br>
Copyright &copy; 1997-2017 University of Cambridge.
<br>

View File

@ -6433,27 +6433,41 @@ BACKSLASH
(see below). Unicode supports various kinds of composite character by
giving each character a grapheme breaking property, and having rules
that use these properties to define the boundaries of extended grapheme
clusters. \X always matches at least one character. Then it decides
whether to add additional characters according to the following rules
for ending a cluster:
clusters. The rules are defined in Unicode Standard Annex 29, "Unicode
Text Segmentation".
\X always matches at least one character. Then it decides whether to
add additional characters according to the following rules for ending a
cluster:
1. End at the end of the subject string.
2. Do not end between CR and LF; otherwise end after any control char-
2. Do not end between CR and LF; otherwise end after any control char-
acter.
3. Do not break Hangul (a Korean script) syllable sequences. Hangul
characters are of five types: L, V, T, LV, and LVT. An L character may
be followed by an L, V, LV, or LVT character; an LV or V character may
3. Do not break Hangul (a Korean script) syllable sequences. Hangul
characters are of five types: L, V, T, LV, and LVT. An L character may
be followed by an L, V, LV, or LVT character; an LV or V character may
be followed by a V or T character; an LVT or T character may be follwed
only by a T character.
4. Do not end before extending characters or spacing marks. Characters
with the "mark" property always have the "extend" grapheme breaking
property.
4. Do not end before extending characters or spacing marks or the
"zero-width joiner" characters. Characters with the "mark" property
always have the "extend" grapheme breaking property.
5. Do not end after prepend characters.
6. Do not break within emoji modifier sequences (a base character fol-
lowed by a modifier). Extending characters are allowed before the modi-
fier.
7. Do not break within emoji zwj sequences (zero-width jointer followed
by "glue after ZWJ" or "base glue after ZWJ").
8. Do not break within emoji flag sequences. That is, do not break
between regional indicator (RI) characters if there are an odd number
of RI characters before the break point.
6. Otherwise, end the cluster.
PCRE2's additional properties
@ -8744,7 +8758,7 @@ AUTHOR
REVISION
Last updated: 02 July 2017
Last updated: 05 July 2017
Copyright (c) 1997-2017 University of Cambridge.
------------------------------------------------------------------------------

View File

@ -1,4 +1,4 @@
.TH PCRE2PATTERN 3 "02 July 2017" "PCRE2 10.30"
.TH PCRE2PATTERN 3 "05 July 2017" "PCRE2 10.30"
.SH NAME
PCRE2 - Perl-compatible regular expressions (revised API)
.SH "PCRE2 REGULAR EXPRESSION DETAILS"
@ -145,7 +145,7 @@ The pcre2_match() function contains a counter that is incremented every time it
goes round its main loop. The caller of \fBpcre2_match()\fP can set a limit on
this counter, which therefore limits the amount of computing resource used for
a match. The maximum depth of nested backtracking can also be limited; this
indirectly restricts the amount of heap memory that is used, but there is also
indirectly restricts the amount of heap memory that is used, but there is also
an explicit memory limit that can be set.
.P
These facilities are provided to catch runaway matches that are provoked by
@ -164,7 +164,7 @@ for it to have any effect. In other words, the pattern writer can lower the
limits set by the programmer, but not raise them. If there is more than one
setting of one of these limits, the lower value is used.
.P
Prior to release 10.30, LIMIT_DEPTH was called LIMIT_RECURSION. This name is
Prior to release 10.30, LIMIT_DEPTH was called LIMIT_RECURSION. This name is
still recognized for backwards compatibility.
.P
The heap limit applies only when the \fBpcre2_match()\fP interpreter is used
@ -203,7 +203,7 @@ string with one of the following sequences:
(*CRLF) carriage return, followed by linefeed
(*ANYCRLF) any of the three above
(*ANY) all Unicode newline sequences
(*NUL) the NUL character (binary zero)
(*NUL) the NUL character (binary zero)
.sp
These override the default and the options given to the compiling function. For
example, on a Unix system where LF is the default newline sequence, the pattern
@ -218,7 +218,7 @@ The newline convention affects where the circumflex and dollar assertions are
true. It also affects the interpretation of the dot metacharacter when
PCRE2_DOTALL is not set, and the behaviour of \eN. However, it does not affect
what the \eR escape sequence matches. By default, this is any Unicode newline
sequence, for Perl compatibility. However, this can be changed; see the next
sequence, for Perl compatibility. However, this can be changed; see the next
section and the description of \eR in the section entitled
.\" HTML <a href="#newlineseq">
.\" </a>
@ -998,9 +998,11 @@ grapheme cluster", and treats the sequence as an atomic group
.\"
Unicode supports various kinds of composite character by giving each character
a grapheme breaking property, and having rules that use these properties to
define the boundaries of extended grapheme clusters. \eX always matches at
least one character. Then it decides whether to add additional characters
according to the following rules for ending a cluster:
define the boundaries of extended grapheme clusters. The rules are defined in
Unicode Standard Annex 29, "Unicode Text Segmentation".
.P
\eX always matches at least one character. Then it decides whether to add
additional characters according to the following rules for ending a cluster:
.P
1. End at the end of the subject string.
.P
@ -1011,11 +1013,22 @@ are of five types: L, V, T, LV, and LVT. An L character may be followed by an
L, V, LV, or LVT character; an LV or V character may be followed by a V or T
character; an LVT or T character may be follwed only by a T character.
.P
4. Do not end before extending characters or spacing marks. Characters with
the "mark" property always have the "extend" grapheme breaking property.
4. Do not end before extending characters or spacing marks or the "zero-width
joiner" characters. Characters with the "mark" property always have the
"extend" grapheme breaking property.
.P
5. Do not end after prepend characters.
.P
6. Do not break within emoji modifier sequences (a base character followed by a
modifier). Extending characters are allowed before the modifier.
.P
7. Do not break within emoji zwj sequences (zero-width jointer followed by
"glue after ZWJ" or "base glue after ZWJ").
.P
8. Do not break within emoji flag sequences. That is, do not break between
regional indicator (RI) characters if there are an odd number of RI characters
before the break point.
.P
6. Otherwise, end the cluster.
.
.
@ -1560,15 +1573,15 @@ Perl option letters enclosed between "(?" and ")". The option letters are
.sp
i for PCRE2_CASELESS
m for PCRE2_MULTILINE
n for PCRE2_NO_AUTO_CAPTURE
n for PCRE2_NO_AUTO_CAPTURE
s for PCRE2_DOTALL
x for PCRE2_EXTENDED
xx for PCRE2_EXTENDED_MORE
.sp
For example, (?im) sets caseless, multiline matching. It is also possible to
unset these options by preceding the letter with a hyphen. The two "extended"
options are not independent; unsetting either one cancels the effects of both
of them.
unset these options by preceding the letter with a hyphen. The two "extended"
options are not independent; unsetting either one cancels the effects of both
of them.
.P
A combined setting and unsetting such as (?im-sx), which sets PCRE2_CASELESS
and PCRE2_MULTILINE while unsetting PCRE2_DOTALL and PCRE2_EXTENDED, is also
@ -2256,16 +2269,16 @@ capturing subpatterns within it, these are counted for the purposes of
numbering the capturing subpatterns in the whole pattern. However, substring
capturing is carried out only for positive assertions that succeed, that is,
one of their branches matches, so matching continues after the assertion. If
all branches of a positive assertion fail to match, nothing is captured, and
all branches of a positive assertion fail to match, nothing is captured, and
control is passed to the previous backtracking point.
.P
No capturing is done for a negative assertion unless it is being used as a
condition in a
.\" HTML <a href="#subpatternsassubroutines">
.\" </a>
conditional subpattern
conditional subpattern
.\"
(see the discussion below). Matching continues after a non-conditional negative
(see the discussion below). Matching continues after a non-conditional negative
assertion only if all its branches fail to match.
.P
For compatibility with Perl, most assertion subpatterns may be repeated; though
@ -2846,13 +2859,13 @@ once it had matched some of the subject string, it was never re-entered, even
if it contained untried alternatives and there was a subsequent matching
failure. (Historical note: PCRE implemented recursion before Perl did.)
.P
Starting with release 10.30, recursive subroutine calls are no longer treated
as atomic. That is, they can be re-entered to try unused alternatives if there
is a matching failure later in the pattern. This is now compatible with the way
Starting with release 10.30, recursive subroutine calls are no longer treated
as atomic. That is, they can be re-entered to try unused alternatives if there
is a matching failure later in the pattern. This is now compatible with the way
Perl works. If you want a subroutine call to be atomic, you must explicitly
enclose it in an atomic group.
.P
Supporting backtracking into recursions simplifies certain types of recursive
Supporting backtracking into recursions simplifies certain types of recursive
pattern. For example, this pattern matches palindromic strings:
.sp
^((.)(?1)\e2|.?)$
@ -2883,7 +2896,7 @@ in PCRE2 these values can be referenced. Consider this pattern:
This pattern matches "bab". The first capturing parentheses match "b", then in
the second group, when the back reference \e1 fails to match "b", the second
alternative matches "a" and then recurses. In the recursion, \e1 does now match
"b" and so the whole match succeeds. This match used to fail in Perl, but in
"b" and so the whole match succeeds. This match used to fail in Perl, but in
later versions (I tried 5.024) it now works.
.
.
@ -3427,15 +3440,15 @@ negative assertion, (*ACCEPT) causes the assertion to fail without any further
processing; captured substrings are discarded.
.P
If the assertion is a condition, (*ACCEPT) causes the condition to be true for
a positive assertion and false for a negative one; captured substrings are
a positive assertion and false for a negative one; captured substrings are
retained in both cases.
.P
The effect of (*THEN) is not allowed to escape beyond an assertion. If there
The effect of (*THEN) is not allowed to escape beyond an assertion. If there
are no more branches to try, (*THEN) causes a positive assertion to be false,
and a negative assertion to be true.
.P
The other backtracking verbs are not treated specially if they appear in a
standalone positive assertion. In a conditional positive assertion,
standalone positive assertion. In a conditional positive assertion,
backtracking into (*COMMIT), (*SKIP), or (*PRUNE) causes the condition to be
false. However, for both standalone and conditional negative assertions,
backtracking into (*COMMIT), (*SKIP), or (*PRUNE) causes the assertion to be
@ -3485,6 +3498,6 @@ Cambridge, England.
.rs
.sp
.nf
Last updated: 02 July 2017
Last updated: 05 July 2017
Copyright (c) 1997-2017 University of Cambridge.
.fi

View File

@ -1379,8 +1379,46 @@ for (;;)
if (!utf) d = *nptr; else { GETCHARLEN(d, nptr, dlen); }
rgb = UCD_GRAPHBREAK(d);
if ((PRIV(ucp_gbtable)[lgb] & (1u << rgb)) == 0) break;
/* Not breaking between Regional Indicators is allowed only if
there are an even number of preceding RIs. */
if (lgb == ucp_gbRegionalIndicator &&
rgb == ucp_gbRegionalIndicator)
{
int ricount = 0;
PCRE2_SPTR bptr = nptr - 1;
#ifdef SUPPORT_UNICODE
if (utf) BACKCHAR(bptr);
#endif
/* bptr is pointing to the left-hand character */
while (bptr > mb->start_subject)
{
bptr--;
#ifdef SUPPORT_UNICODE
if (utf)
{
BACKCHAR(bptr);
GETCHAR(d, bptr);
}
else
#endif
d = *bptr;
if (UCD_GRAPHBREAK(d) != ucp_gbRegionalIndicator) break;
ricount++;
}
if ((ricount & 1) != 0) break; /* Grapheme break required */
}
/* If Extend follows E_Base[_GAZ] do not update lgb; this allows
any number of Extend before a following E_Modifier. */
if (rgb != ucp_gbExtend ||
(lgb != ucp_gbE_Base && lgb != ucp_gbE_Base_GAZ))
lgb = rgb;
ncount++;
lgb = rgb;
nptr += dlen;
}
count++;
@ -1641,8 +1679,46 @@ for (;;)
if (!utf) d = *nptr; else { GETCHARLEN(d, nptr, dlen); }
rgb = UCD_GRAPHBREAK(d);
if ((PRIV(ucp_gbtable)[lgb] & (1u << rgb)) == 0) break;
/* Not breaking between Regional Indicators is allowed only if
there are an even number of preceding RIs. */
if (lgb == ucp_gbRegionalIndicator &&
rgb == ucp_gbRegionalIndicator)
{
int ricount = 0;
PCRE2_SPTR bptr = nptr - 1;
#ifdef SUPPORT_UNICODE
if (utf) BACKCHAR(bptr);
#endif
/* bptr is pointing to the left-hand character */
while (bptr > mb->start_subject)
{
bptr--;
#ifdef SUPPORT_UNICODE
if (utf)
{
BACKCHAR(bptr);
GETCHAR(d, bptr);
}
else
#endif
d = *bptr;
if (UCD_GRAPHBREAK(d) != ucp_gbRegionalIndicator) break;
ricount++;
}
if ((ricount & 1) != 0) break; /* Grapheme break required */
}
/* If Extend follows E_Base[_GAZ] do not update lgb; this allows
any number of Extend before a following E_Modifier. */
if (rgb != ucp_gbExtend ||
(lgb != ucp_gbE_Base && lgb != ucp_gbE_Base_GAZ))
lgb = rgb;
ncount++;
lgb = rgb;
nptr += dlen;
}
ADD_NEW_DATA(-(state_offset + count), 0, ncount);
@ -1912,8 +1988,46 @@ for (;;)
if (!utf) d = *nptr; else { GETCHARLEN(d, nptr, dlen); }
rgb = UCD_GRAPHBREAK(d);
if ((PRIV(ucp_gbtable)[lgb] & (1u << rgb)) == 0) break;
/* Not breaking between Regional Indicators is allowed only if
there are an even number of preceding RIs. */
if (lgb == ucp_gbRegionalIndicator &&
rgb == ucp_gbRegionalIndicator)
{
int ricount = 0;
PCRE2_SPTR bptr = nptr - 1;
#ifdef SUPPORT_UNICODE
if (utf) BACKCHAR(bptr);
#endif
/* bptr is pointing to the left-hand character */
while (bptr > mb->start_subject)
{
bptr--;
#ifdef SUPPORT_UNICODE
if (utf)
{
BACKCHAR(bptr);
GETCHAR(d, bptr);
}
else
#endif
d = *bptr;
if (UCD_GRAPHBREAK(d) != ucp_gbRegionalIndicator) break;
ricount++;
}
if ((ricount & 1) != 0) break; /* Grapheme break required */
}
/* If Extend follows E_Base[_GAZ] do not update lgb; this allows
any number of Extend before a following E_Modifier. */
if (rgb != ucp_gbExtend ||
(lgb != ucp_gbE_Base && lgb != ucp_gbE_Base_GAZ))
lgb = rgb;
ncount++;
lgb = rgb;
nptr += dlen;
}
if (nptr >= end_subject && (mb->moptions & PCRE2_PARTIAL_HARD) != 0)
@ -2102,8 +2216,46 @@ for (;;)
if (!utf) d = *nptr; else { GETCHARLEN(d, nptr, dlen); }
rgb = UCD_GRAPHBREAK(d);
if ((PRIV(ucp_gbtable)[lgb] & (1u << rgb)) == 0) break;
/* Not breaking between Regional Indicators is allowed only if
there are an even number of preceding RIs. */
if (lgb == ucp_gbRegionalIndicator &&
rgb == ucp_gbRegionalIndicator)
{
int ricount = 0;
PCRE2_SPTR bptr = nptr - 1;
#ifdef SUPPORT_UNICODE
if (utf) BACKCHAR(bptr);
#endif
/* bptr is pointing to the left-hand character */
while (bptr > mb->start_subject)
{
bptr--;
#ifdef SUPPORT_UNICODE
if (utf)
{
BACKCHAR(bptr);
GETCHAR(d, bptr);
}
else
#endif
d = *bptr;
if (UCD_GRAPHBREAK(d) != ucp_gbRegionalIndicator) break;
ricount++;
}
if ((ricount & 1) != 0) break; /* Grapheme break required */
}
/* If Extend follows E_Base[_GAZ] do not update lgb; this allows
any number of Extend before a following E_Modifier. */
if (rgb != ucp_gbExtend ||
(lgb != ucp_gbE_Base && lgb != ucp_gbE_Base_GAZ))
lgb = rgb;
ncount++;
lgb = rgb;
nptr += dlen;
}
if (nptr >= end_subject && (mb->moptions & PCRE2_PARTIAL_HARD) != 0)
@ -2129,7 +2281,7 @@ for (;;)
case 0x2029:
#endif /* Not EBCDIC */
if (mb->bsr_convention == PCRE2_BSR_ANYCRLF) break;
/* Fall through */
/* Fall through */
case CHAR_LF:
ADD_NEW(state_offset + 1, 0);
@ -3427,7 +3579,7 @@ for (;;)
while (t < mb->end_subject && !IS_NEWLINE(t)) t++;
end_subject = t;
}
/* Anchored: check the first code unit if one is recorded. This may seem
pointless but it can help in detecting a no match case without scanning for
the required code unit. */

View File

@ -2449,7 +2449,44 @@ fprintf(stderr, "++ op=%d\n", *Fecode);
if (!utf) fc = *Feptr; else { GETCHARLEN(fc, Feptr, len); }
rgb = UCD_GRAPHBREAK(fc);
if ((PRIV(ucp_gbtable)[lgb] & (1 << rgb)) == 0) break;
lgb = rgb;
/* Not breaking between Regional Indicators is allowed only if there
are an even number of preceding RIs. */
if (lgb == ucp_gbRegionalIndicator && rgb == ucp_gbRegionalIndicator)
{
int ricount = 0;
PCRE2_SPTR bptr = Feptr - 1;
#ifdef SUPPORT_UNICODE
if (utf) BACKCHAR(bptr);
#endif
/* bptr is pointing to the left-hand character */
while (bptr > mb->start_subject)
{
bptr--;
#ifdef SUPPORT_UNICODE
if (utf)
{
BACKCHAR(bptr);
GETCHAR(fc, bptr);
}
else
#endif
fc = *bptr;
if (UCD_GRAPHBREAK(fc) != ucp_gbRegionalIndicator) break;
ricount++;
}
if ((ricount & 1) != 0) break; /* Grapheme break required */
}
/* If Extend follows E_Base[_GAZ] do not update lgb; this allows
any number of Extend before a following E_Modifier. */
if (rgb != ucp_gbExtend ||
(lgb != ucp_gbE_Base && lgb != ucp_gbE_Base_GAZ))
lgb = rgb;
Feptr += len;
}
}
@ -2757,7 +2794,45 @@ fprintf(stderr, "++ op=%d\n", *Fecode);
if (!utf) fc = *Feptr; else { GETCHARLEN(fc, Feptr, len); }
rgb = UCD_GRAPHBREAK(fc);
if ((PRIV(ucp_gbtable)[lgb] & (1 << rgb)) == 0) break;
lgb = rgb;
/* Not breaking between Regional Indicators is allowed only if
there are an even number of preceding RIs. */
if (lgb == ucp_gbRegionalIndicator &&
rgb == ucp_gbRegionalIndicator)
{
int ricount = 0;
PCRE2_SPTR bptr = Feptr - 1;
#ifdef SUPPORT_UNICODE
if (utf) BACKCHAR(bptr);
#endif
/* bptr is pointing to the left-hand character */
while (bptr > mb->start_subject)
{
bptr--;
#ifdef SUPPORT_UNICODE
if (utf)
{
BACKCHAR(bptr);
GETCHAR(fc, bptr);
}
else
#endif
fc = *bptr;
if (UCD_GRAPHBREAK(fc) != ucp_gbRegionalIndicator) break;
ricount++;
}
if ((ricount & 1) != 0) break; /* Grapheme break required */
}
/* If Extend follows E_Base[_GAZ] do not update lgb; this allows
any number of Extend before a following E_Modifier. */
if (rgb != ucp_gbExtend ||
(lgb != ucp_gbE_Base && lgb != ucp_gbE_Base_GAZ))
lgb = rgb;
Feptr += len;
}
}
@ -3527,7 +3602,45 @@ fprintf(stderr, "++ op=%d\n", *Fecode);
if (!utf) fc = *Feptr; else { GETCHARLEN(fc, Feptr, len); }
rgb = UCD_GRAPHBREAK(fc);
if ((PRIV(ucp_gbtable)[lgb] & (1 << rgb)) == 0) break;
lgb = rgb;
/* Not breaking between Regional Indicators is allowed only if
there are an even number of preceding RIs. */
if (lgb == ucp_gbRegionalIndicator &&
rgb == ucp_gbRegionalIndicator)
{
int ricount = 0;
PCRE2_SPTR bptr = Feptr - 1;
#ifdef SUPPORT_UNICODE
if (utf) BACKCHAR(bptr);
#endif
/* bptr is pointing to the left-hand character */
while (bptr > mb->start_subject)
{
bptr--;
#ifdef SUPPORT_UNICODE
if (utf)
{
BACKCHAR(bptr);
GETCHAR(fc, bptr);
}
else
#endif
fc = *bptr;
if (UCD_GRAPHBREAK(fc) != ucp_gbRegionalIndicator) break;
ricount++;
}
if ((ricount & 1) != 0) break; /* Grapheme break required */
}
/* If Extend follows E_Base[_GAZ] do not update lgb; this allows
any number of Extend before a following E_Modifier. */
if (rgb != ucp_gbExtend ||
(lgb != ucp_gbE_Base && lgb != ucp_gbE_Base_GAZ))
lgb = rgb;
Feptr += len;
}
}
@ -4063,7 +4176,45 @@ fprintf(stderr, "++ op=%d\n", *Fecode);
if (!utf) fc = *Feptr; else { GETCHARLEN(fc, Feptr, len); }
rgb = UCD_GRAPHBREAK(fc);
if ((PRIV(ucp_gbtable)[lgb] & (1 << rgb)) == 0) break;
lgb = rgb;
/* Not breaking between Regional Indicators is allowed only if
there are an even number of preceding RIs. */
if (lgb == ucp_gbRegionalIndicator &&
rgb == ucp_gbRegionalIndicator)
{
int ricount = 0;
PCRE2_SPTR bptr = Feptr - 1;
#ifdef SUPPORT_UNICODE
if (utf) BACKCHAR(bptr);
#endif
/* bptr is pointing to the left-hand character */
while (bptr > mb->start_subject)
{
bptr--;
#ifdef SUPPORT_UNICODE
if (utf)
{
BACKCHAR(bptr);
GETCHAR(fc, bptr);
}
else
#endif
fc = *bptr;
if (UCD_GRAPHBREAK(fc) != ucp_gbRegionalIndicator) break;
ricount++;
}
if ((ricount & 1) != 0) break; /* Grapheme break required */
}
/* If Extend follows E_Base[_GAZ] do not update lgb; this allows
any number of Extend before a following E_Modifier. */
if (rgb != ucp_gbExtend ||
(lgb != ucp_gbE_Base && lgb != ucp_gbE_Base_GAZ))
lgb = rgb;
Feptr += len;
}
}

View File

@ -157,49 +157,62 @@ two code points. The breaking rules are as follows:
LV or V may be followed by V or T
LVT or T may be followed by T
4. Do not break before extending characters.
4. Do not break before extending characters or zero-width-joiner (ZWJ).
The next two rules are only for extended grapheme clusters (but that's what we
The following rules are only for extended grapheme clusters (but that's what we
are implementing).
5. Do not break before SpacingMarks.
6. Do not break after Prepend characters.
7. Otherwise, break everywhere.
7. Do not break within emoji modifier sequences (E_Base or E_Base_GAZ followed
by E_Modifier). Extend characters are allowed before the modifier; this
cannot be represented in this table, the code has to deal with it.
8. Do not break within emoji zwj sequences (ZWJ followed by Glue_After_Zwj or
E_Base_GAZ).
9. Do not break within emoji flag sequences. That is, do not break between
regional indicator (RI) symbols if there are an odd number of RI characters
before the break point. This table encodes "join RI characters"; the code
has to deal with checking for previous adjoining RIs.
10. Otherwise, break everywhere.
*/
#define ESZ (1<<ucp_gbExtend)|(1<<ucp_gbSpacingMark)|(1<<ucp_gbZWJ)
const uint32_t PRIV(ucp_gbtable)[] = {
(1<<ucp_gbLF), /* 0 CR */
0, /* 1 LF */
0, /* 2 Control */
(1<<ucp_gbExtend)|(1<<ucp_gbSpacingMark), /* 3 Extend */
(1<<ucp_gbExtend)|(1<<ucp_gbPrepend)| /* 4 Prepend */
(1<<ucp_gbSpacingMark)|(1<<ucp_gbL)|
(1<<ucp_gbV)|(1<<ucp_gbT)|(1<<ucp_gbLV)|
(1<<ucp_gbLVT)|(1<<ucp_gbOther),
(1<<ucp_gbExtend)|(1<<ucp_gbSpacingMark), /* 5 SpacingMark */
(1<<ucp_gbExtend)|(1<<ucp_gbSpacingMark)|(1<<ucp_gbL)| /* 6 L */
(1<<ucp_gbV)|(1<<ucp_gbLV)|(1<<ucp_gbLVT),
(1<<ucp_gbExtend)|(1<<ucp_gbSpacingMark)|(1<<ucp_gbV)| /* 7 V */
(1<<ucp_gbT),
(1<<ucp_gbExtend)|(1<<ucp_gbSpacingMark)|(1<<ucp_gbT), /* 8 T */
(1<<ucp_gbExtend)|(1<<ucp_gbSpacingMark)|(1<<ucp_gbV)| /* 9 LV */
(1<<ucp_gbT),
(1<<ucp_gbExtend)|(1<<ucp_gbSpacingMark)|(1<<ucp_gbT), /* 10 LVT */
ESZ, /* 3 Extend */
ESZ|(1<<ucp_gbPrepend)| /* 4 Prepend */
(1<<ucp_gbL)|(1<<ucp_gbV)|(1<<ucp_gbT)|
(1<<ucp_gbLV)|(1<<ucp_gbLVT)|(1<<ucp_gbOther)|
(1<<ucp_gbRegionalIndicator)|
(1<<ucp_gbE_Base)|(1<<ucp_gbE_Modifier)|
(1<<ucp_gbE_Base_GAZ)|
(1<<ucp_gbZWJ)|(1<<ucp_gbGlue_After_Zwj),
ESZ, /* 5 SpacingMark */
ESZ|(1<<ucp_gbL)|(1<<ucp_gbV)|(1<<ucp_gbLV)| /* 6 L */
(1<<ucp_gbLVT),
ESZ|(1<<ucp_gbV)|(1<<ucp_gbT), /* 7 V */
ESZ|(1<<ucp_gbT), /* 8 T */
ESZ|(1<<ucp_gbV)|(1<<ucp_gbT), /* 9 LV */
ESZ|(1<<ucp_gbT), /* 10 LVT */
(1<<ucp_gbRegionalIndicator), /* 11 RegionalIndicator */
(1<<ucp_gbExtend)|(1<<ucp_gbSpacingMark), /* 12 Other */
(1<<ucp_gbExtend)|(1<<ucp_gbSpacingMark), /* 13 E_Base */
(1<<ucp_gbExtend)|(1<<ucp_gbSpacingMark), /* 14 E_Modifier */
(1<<ucp_gbExtend)|(1<<ucp_gbSpacingMark), /* 15 E_Base_GAZ */
(1<<ucp_gbExtend)|(1<<ucp_gbSpacingMark), /* 16 ZWJ */
(1<<ucp_gbExtend)|(1<<ucp_gbSpacingMark) /* 12 Glue_After_Zwj */
ESZ, /* 12 Other */
ESZ|(1<<ucp_gbE_Modifier), /* 13 E_Base */
ESZ, /* 14 E_Modifier */
ESZ|(1<<ucp_gbE_Modifier), /* 15 E_Base_GAZ */
ESZ|(1<<ucp_gbGlue_After_Zwj)|(1<<ucp_gbE_Base_GAZ), /* 16 ZWJ */
ESZ /* 12 Glue_After_Zwj */
};
#undef ESZ
#ifdef SUPPORT_JIT
/* This table reverses PRIV(ucp_gentype). We can save the cost
of a memory load. */

19
testdata/testinput5 vendored
View File

@ -2041,4 +2041,23 @@
/^(?:(\X)(?C))+$/utf
\x{1E900}\x{1E924}\x{1E953}\x{11C00}\x{11C2D}\x{11C3E}\x{11C70}\x{11C77}\x{11CAB}\x{11400}\x{1142F}\x{11455}\x{104B0}\x{104D8}\x{104FB}\x{16FE0}\x{18800}\x{18AF2}\x{11D00}\x{11D3A}\x{11D59}\x{16FE1}\x{1B170}\x{1B2FB}\x{11A50}\x{11A58}\x{11AA2}\x{11A00}\x{11A07}\x{11A47}\=callout_capture,callout_no_where
# These two are here because JIT is not yet updated. Also, the very first data
# line is handled differently by Perl.
/^\X/utf
A\x{200d}B A ZWJ
\x{261D}\x{1F3FB}B E_Base E_Modifier
\x{1F466}\x{1F3FF}B E_Base_GAZ E_Modifier
\x{200d}\x{1F3A4}B ZWJ Glue_After_ZWJ
\x{200d}\x{1F469}B ZWJ E_Base_GAZ
\x{1F1E6}\x{1F1E7}B RegionalIndicator RegionalIndicator
\x{261D}\x{E0100}\x{1F3FB}B\=no_jit E_Base Extend E_Modifier
# Regional indicators
/^(\X)(\X)/utf,aftertext
\x{1F1E6}\x{1F1E7}\x{1F1E7}B\=no_jit
\x{1F1E6}\x{1F1E7}\x{1F1E7}\x{1F1E6}B\=no_jit
# End of testinput5

34
testdata/testoutput5 vendored
View File

@ -4667,4 +4667,38 @@ Callout 0: last capture = 1
0: \x{1e900}\x{1e924}\x{1e953}\x{11c00}\x{11c2d}\x{11c3e}\x{11c70}\x{11c77}\x{11cab}\x{11400}\x{1142f}\x{11455}\x{104b0}\x{104d8}\x{104fb}\x{16fe0}\x{18800}\x{18af2}\x{11d00}\x{11d3a}\x{11d59}\x{16fe1}\x{1b170}\x{1b2fb}\x{11a50}\x{11a58}\x{11aa2}\x{11a00}\x{11a07}\x{11a47}
1: \x{11a00}\x{11a07}\x{11a47}
# These two are here because JIT is not yet updated. Also, the very first data
# line is handled differently by Perl.
/^\X/utf
A\x{200d}B A ZWJ
0: A\x{200d}
\x{261D}\x{1F3FB}B E_Base E_Modifier
0: \x{261d}\x{1f3fb}
\x{1F466}\x{1F3FF}B E_Base_GAZ E_Modifier
0: \x{1f466}\x{1f3ff}
\x{200d}\x{1F3A4}B ZWJ Glue_After_ZWJ
0: \x{200d}\x{1f3a4}
\x{200d}\x{1F469}B ZWJ E_Base_GAZ
0: \x{200d}\x{1f469}
\x{1F1E6}\x{1F1E7}B RegionalIndicator RegionalIndicator
0: \x{1f1e6}\x{1f1e7}
\x{261D}\x{E0100}\x{1F3FB}B\=no_jit E_Base Extend E_Modifier
** /n is not valid here
# Regional indicators
/^(\X)(\X)/utf,aftertext
\x{1F1E6}\x{1F1E7}\x{1F1E7}B\=no_jit
0: \x{1f1e6}\x{1f1e7}\x{1f1e7}
0+ B
1: \x{1f1e6}\x{1f1e7}
2: \x{1f1e7}
\x{1F1E6}\x{1F1E7}\x{1F1E7}\x{1F1E6}B\=no_jit
0: \x{1f1e6}\x{1f1e7}\x{1f1e7}\x{1f1e6}
0+ B
1: \x{1f1e6}\x{1f1e7}
2: \x{1f1e7}\x{1f1e6}
# End of testinput5