Update grapheme breaking rules for Unicode 10.0.0.
This commit is contained in:
parent
41bb787fb3
commit
4f7a608d56
|
@ -213,6 +213,9 @@ unit". Previously only non-anchored patterns did this.
|
||||||
|
|
||||||
48. Add the callout_no_where modifier to pcre2test.
|
48. Add the callout_no_where modifier to pcre2test.
|
||||||
|
|
||||||
|
49. Update extended grapheme breaking rules to the latest set that are in
|
||||||
|
Unicode Standard Annex #29.
|
||||||
|
|
||||||
|
|
||||||
Version 10.23 14-February-2017
|
Version 10.23 14-February-2017
|
||||||
------------------------------
|
------------------------------
|
||||||
|
|
|
@ -177,7 +177,7 @@ The pcre2_match() function contains a counter that is incremented every time it
|
||||||
goes round its main loop. The caller of <b>pcre2_match()</b> can set a limit on
|
goes round its main loop. The caller of <b>pcre2_match()</b> can set a limit on
|
||||||
this counter, which therefore limits the amount of computing resource used for
|
this counter, which therefore limits the amount of computing resource used for
|
||||||
a match. The maximum depth of nested backtracking can also be limited; this
|
a match. The maximum depth of nested backtracking can also be limited; this
|
||||||
indirectly restricts the amount of heap memory that is used, but there is also
|
indirectly restricts the amount of heap memory that is used, but there is also
|
||||||
an explicit memory limit that can be set.
|
an explicit memory limit that can be set.
|
||||||
</P>
|
</P>
|
||||||
<P>
|
<P>
|
||||||
|
@ -198,7 +198,7 @@ limits set by the programmer, but not raise them. If there is more than one
|
||||||
setting of one of these limits, the lower value is used.
|
setting of one of these limits, the lower value is used.
|
||||||
</P>
|
</P>
|
||||||
<P>
|
<P>
|
||||||
Prior to release 10.30, LIMIT_DEPTH was called LIMIT_RECURSION. This name is
|
Prior to release 10.30, LIMIT_DEPTH was called LIMIT_RECURSION. This name is
|
||||||
still recognized for backwards compatibility.
|
still recognized for backwards compatibility.
|
||||||
</P>
|
</P>
|
||||||
<P>
|
<P>
|
||||||
|
@ -233,7 +233,7 @@ string with one of the following sequences:
|
||||||
(*CRLF) carriage return, followed by linefeed
|
(*CRLF) carriage return, followed by linefeed
|
||||||
(*ANYCRLF) any of the three above
|
(*ANYCRLF) any of the three above
|
||||||
(*ANY) all Unicode newline sequences
|
(*ANY) all Unicode newline sequences
|
||||||
(*NUL) the NUL character (binary zero)
|
(*NUL) the NUL character (binary zero)
|
||||||
</pre>
|
</pre>
|
||||||
These override the default and the options given to the compiling function. For
|
These override the default and the options given to the compiling function. For
|
||||||
example, on a Unix system where LF is the default newline sequence, the pattern
|
example, on a Unix system where LF is the default newline sequence, the pattern
|
||||||
|
@ -249,7 +249,7 @@ The newline convention affects where the circumflex and dollar assertions are
|
||||||
true. It also affects the interpretation of the dot metacharacter when
|
true. It also affects the interpretation of the dot metacharacter when
|
||||||
PCRE2_DOTALL is not set, and the behaviour of \N. However, it does not affect
|
PCRE2_DOTALL is not set, and the behaviour of \N. However, it does not affect
|
||||||
what the \R escape sequence matches. By default, this is any Unicode newline
|
what the \R escape sequence matches. By default, this is any Unicode newline
|
||||||
sequence, for Perl compatibility. However, this can be changed; see the next
|
sequence, for Perl compatibility. However, this can be changed; see the next
|
||||||
section and the description of \R in the section entitled
|
section and the description of \R in the section entitled
|
||||||
<a href="#newlineseq">"Newline sequences"</a>
|
<a href="#newlineseq">"Newline sequences"</a>
|
||||||
below. A change of \R setting can be combined with a change of newline
|
below. A change of \R setting can be combined with a change of newline
|
||||||
|
@ -1001,9 +1001,12 @@ grapheme cluster", and treats the sequence as an atomic group
|
||||||
<a href="#atomicgroup">(see below).</a>
|
<a href="#atomicgroup">(see below).</a>
|
||||||
Unicode supports various kinds of composite character by giving each character
|
Unicode supports various kinds of composite character by giving each character
|
||||||
a grapheme breaking property, and having rules that use these properties to
|
a grapheme breaking property, and having rules that use these properties to
|
||||||
define the boundaries of extended grapheme clusters. \X always matches at
|
define the boundaries of extended grapheme clusters. The rules are defined in
|
||||||
least one character. Then it decides whether to add additional characters
|
Unicode Standard Annex 29, "Unicode Text Segmentation".
|
||||||
according to the following rules for ending a cluster:
|
</P>
|
||||||
|
<P>
|
||||||
|
\X always matches at least one character. Then it decides whether to add
|
||||||
|
additional characters according to the following rules for ending a cluster:
|
||||||
</P>
|
</P>
|
||||||
<P>
|
<P>
|
||||||
1. End at the end of the subject string.
|
1. End at the end of the subject string.
|
||||||
|
@ -1018,13 +1021,27 @@ L, V, LV, or LVT character; an LV or V character may be followed by a V or T
|
||||||
character; an LVT or T character may be follwed only by a T character.
|
character; an LVT or T character may be follwed only by a T character.
|
||||||
</P>
|
</P>
|
||||||
<P>
|
<P>
|
||||||
4. Do not end before extending characters or spacing marks. Characters with
|
4. Do not end before extending characters or spacing marks or the "zero-width
|
||||||
the "mark" property always have the "extend" grapheme breaking property.
|
joiner" characters. Characters with the "mark" property always have the
|
||||||
|
"extend" grapheme breaking property.
|
||||||
</P>
|
</P>
|
||||||
<P>
|
<P>
|
||||||
5. Do not end after prepend characters.
|
5. Do not end after prepend characters.
|
||||||
</P>
|
</P>
|
||||||
<P>
|
<P>
|
||||||
|
6. Do not break within emoji modifier sequences (a base character followed by a
|
||||||
|
modifier). Extending characters are allowed before the modifier.
|
||||||
|
</P>
|
||||||
|
<P>
|
||||||
|
7. Do not break within emoji zwj sequences (zero-width jointer followed by
|
||||||
|
"glue after ZWJ" or "base glue after ZWJ").
|
||||||
|
</P>
|
||||||
|
<P>
|
||||||
|
8. Do not break within emoji flag sequences. That is, do not break between
|
||||||
|
regional indicator (RI) characters if there are an odd number of RI characters
|
||||||
|
before the break point.
|
||||||
|
</P>
|
||||||
|
<P>
|
||||||
6. Otherwise, end the cluster.
|
6. Otherwise, end the cluster.
|
||||||
<a name="extraprops"></a></P>
|
<a name="extraprops"></a></P>
|
||||||
<br><b>
|
<br><b>
|
||||||
|
@ -1562,15 +1579,15 @@ Perl option letters enclosed between "(?" and ")". The option letters are
|
||||||
<pre>
|
<pre>
|
||||||
i for PCRE2_CASELESS
|
i for PCRE2_CASELESS
|
||||||
m for PCRE2_MULTILINE
|
m for PCRE2_MULTILINE
|
||||||
n for PCRE2_NO_AUTO_CAPTURE
|
n for PCRE2_NO_AUTO_CAPTURE
|
||||||
s for PCRE2_DOTALL
|
s for PCRE2_DOTALL
|
||||||
x for PCRE2_EXTENDED
|
x for PCRE2_EXTENDED
|
||||||
xx for PCRE2_EXTENDED_MORE
|
xx for PCRE2_EXTENDED_MORE
|
||||||
</pre>
|
</pre>
|
||||||
For example, (?im) sets caseless, multiline matching. It is also possible to
|
For example, (?im) sets caseless, multiline matching. It is also possible to
|
||||||
unset these options by preceding the letter with a hyphen. The two "extended"
|
unset these options by preceding the letter with a hyphen. The two "extended"
|
||||||
options are not independent; unsetting either one cancels the effects of both
|
options are not independent; unsetting either one cancels the effects of both
|
||||||
of them.
|
of them.
|
||||||
</P>
|
</P>
|
||||||
<P>
|
<P>
|
||||||
A combined setting and unsetting such as (?im-sx), which sets PCRE2_CASELESS
|
A combined setting and unsetting such as (?im-sx), which sets PCRE2_CASELESS
|
||||||
|
@ -2249,14 +2266,14 @@ capturing subpatterns within it, these are counted for the purposes of
|
||||||
numbering the capturing subpatterns in the whole pattern. However, substring
|
numbering the capturing subpatterns in the whole pattern. However, substring
|
||||||
capturing is carried out only for positive assertions that succeed, that is,
|
capturing is carried out only for positive assertions that succeed, that is,
|
||||||
one of their branches matches, so matching continues after the assertion. If
|
one of their branches matches, so matching continues after the assertion. If
|
||||||
all branches of a positive assertion fail to match, nothing is captured, and
|
all branches of a positive assertion fail to match, nothing is captured, and
|
||||||
control is passed to the previous backtracking point.
|
control is passed to the previous backtracking point.
|
||||||
</P>
|
</P>
|
||||||
<P>
|
<P>
|
||||||
No capturing is done for a negative assertion unless it is being used as a
|
No capturing is done for a negative assertion unless it is being used as a
|
||||||
condition in a
|
condition in a
|
||||||
<a href="#subpatternsassubroutines">conditional subpattern</a>
|
<a href="#subpatternsassubroutines">conditional subpattern</a>
|
||||||
(see the discussion below). Matching continues after a non-conditional negative
|
(see the discussion below). Matching continues after a non-conditional negative
|
||||||
assertion only if all its branches fail to match.
|
assertion only if all its branches fail to match.
|
||||||
</P>
|
</P>
|
||||||
<P>
|
<P>
|
||||||
|
@ -2824,14 +2841,14 @@ if it contained untried alternatives and there was a subsequent matching
|
||||||
failure. (Historical note: PCRE implemented recursion before Perl did.)
|
failure. (Historical note: PCRE implemented recursion before Perl did.)
|
||||||
</P>
|
</P>
|
||||||
<P>
|
<P>
|
||||||
Starting with release 10.30, recursive subroutine calls are no longer treated
|
Starting with release 10.30, recursive subroutine calls are no longer treated
|
||||||
as atomic. That is, they can be re-entered to try unused alternatives if there
|
as atomic. That is, they can be re-entered to try unused alternatives if there
|
||||||
is a matching failure later in the pattern. This is now compatible with the way
|
is a matching failure later in the pattern. This is now compatible with the way
|
||||||
Perl works. If you want a subroutine call to be atomic, you must explicitly
|
Perl works. If you want a subroutine call to be atomic, you must explicitly
|
||||||
enclose it in an atomic group.
|
enclose it in an atomic group.
|
||||||
</P>
|
</P>
|
||||||
<P>
|
<P>
|
||||||
Supporting backtracking into recursions simplifies certain types of recursive
|
Supporting backtracking into recursions simplifies certain types of recursive
|
||||||
pattern. For example, this pattern matches palindromic strings:
|
pattern. For example, this pattern matches palindromic strings:
|
||||||
<pre>
|
<pre>
|
||||||
^((.)(?1)\2|.?)$
|
^((.)(?1)\2|.?)$
|
||||||
|
@ -2863,7 +2880,7 @@ in PCRE2 these values can be referenced. Consider this pattern:
|
||||||
This pattern matches "bab". The first capturing parentheses match "b", then in
|
This pattern matches "bab". The first capturing parentheses match "b", then in
|
||||||
the second group, when the back reference \1 fails to match "b", the second
|
the second group, when the back reference \1 fails to match "b", the second
|
||||||
alternative matches "a" and then recurses. In the recursion, \1 does now match
|
alternative matches "a" and then recurses. In the recursion, \1 does now match
|
||||||
"b" and so the whole match succeeds. This match used to fail in Perl, but in
|
"b" and so the whole match succeeds. This match used to fail in Perl, but in
|
||||||
later versions (I tried 5.024) it now works.
|
later versions (I tried 5.024) it now works.
|
||||||
<a name="subpatternsassubroutines"></a></P>
|
<a name="subpatternsassubroutines"></a></P>
|
||||||
<br><a name="SEC24" href="#TOC1">SUBPATTERNS AS SUBROUTINES</a><br>
|
<br><a name="SEC24" href="#TOC1">SUBPATTERNS AS SUBROUTINES</a><br>
|
||||||
|
@ -3398,17 +3415,17 @@ processing; captured substrings are discarded.
|
||||||
</P>
|
</P>
|
||||||
<P>
|
<P>
|
||||||
If the assertion is a condition, (*ACCEPT) causes the condition to be true for
|
If the assertion is a condition, (*ACCEPT) causes the condition to be true for
|
||||||
a positive assertion and false for a negative one; captured substrings are
|
a positive assertion and false for a negative one; captured substrings are
|
||||||
retained in both cases.
|
retained in both cases.
|
||||||
</P>
|
</P>
|
||||||
<P>
|
<P>
|
||||||
The effect of (*THEN) is not allowed to escape beyond an assertion. If there
|
The effect of (*THEN) is not allowed to escape beyond an assertion. If there
|
||||||
are no more branches to try, (*THEN) causes a positive assertion to be false,
|
are no more branches to try, (*THEN) causes a positive assertion to be false,
|
||||||
and a negative assertion to be true.
|
and a negative assertion to be true.
|
||||||
</P>
|
</P>
|
||||||
<P>
|
<P>
|
||||||
The other backtracking verbs are not treated specially if they appear in a
|
The other backtracking verbs are not treated specially if they appear in a
|
||||||
standalone positive assertion. In a conditional positive assertion,
|
standalone positive assertion. In a conditional positive assertion,
|
||||||
backtracking into (*COMMIT), (*SKIP), or (*PRUNE) causes the condition to be
|
backtracking into (*COMMIT), (*SKIP), or (*PRUNE) causes the condition to be
|
||||||
false. However, for both standalone and conditional negative assertions,
|
false. However, for both standalone and conditional negative assertions,
|
||||||
backtracking into (*COMMIT), (*SKIP), or (*PRUNE) causes the assertion to be
|
backtracking into (*COMMIT), (*SKIP), or (*PRUNE) causes the assertion to be
|
||||||
|
@ -3455,7 +3472,7 @@ Cambridge, England.
|
||||||
</P>
|
</P>
|
||||||
<br><a name="SEC30" href="#TOC1">REVISION</a><br>
|
<br><a name="SEC30" href="#TOC1">REVISION</a><br>
|
||||||
<P>
|
<P>
|
||||||
Last updated: 02 July 2017
|
Last updated: 05 July 2017
|
||||||
<br>
|
<br>
|
||||||
Copyright © 1997-2017 University of Cambridge.
|
Copyright © 1997-2017 University of Cambridge.
|
||||||
<br>
|
<br>
|
||||||
|
|
|
@ -6433,27 +6433,41 @@ BACKSLASH
|
||||||
(see below). Unicode supports various kinds of composite character by
|
(see below). Unicode supports various kinds of composite character by
|
||||||
giving each character a grapheme breaking property, and having rules
|
giving each character a grapheme breaking property, and having rules
|
||||||
that use these properties to define the boundaries of extended grapheme
|
that use these properties to define the boundaries of extended grapheme
|
||||||
clusters. \X always matches at least one character. Then it decides
|
clusters. The rules are defined in Unicode Standard Annex 29, "Unicode
|
||||||
whether to add additional characters according to the following rules
|
Text Segmentation".
|
||||||
for ending a cluster:
|
|
||||||
|
\X always matches at least one character. Then it decides whether to
|
||||||
|
add additional characters according to the following rules for ending a
|
||||||
|
cluster:
|
||||||
|
|
||||||
1. End at the end of the subject string.
|
1. End at the end of the subject string.
|
||||||
|
|
||||||
2. Do not end between CR and LF; otherwise end after any control char-
|
2. Do not end between CR and LF; otherwise end after any control char-
|
||||||
acter.
|
acter.
|
||||||
|
|
||||||
3. Do not break Hangul (a Korean script) syllable sequences. Hangul
|
3. Do not break Hangul (a Korean script) syllable sequences. Hangul
|
||||||
characters are of five types: L, V, T, LV, and LVT. An L character may
|
characters are of five types: L, V, T, LV, and LVT. An L character may
|
||||||
be followed by an L, V, LV, or LVT character; an LV or V character may
|
be followed by an L, V, LV, or LVT character; an LV or V character may
|
||||||
be followed by a V or T character; an LVT or T character may be follwed
|
be followed by a V or T character; an LVT or T character may be follwed
|
||||||
only by a T character.
|
only by a T character.
|
||||||
|
|
||||||
4. Do not end before extending characters or spacing marks. Characters
|
4. Do not end before extending characters or spacing marks or the
|
||||||
with the "mark" property always have the "extend" grapheme breaking
|
"zero-width joiner" characters. Characters with the "mark" property
|
||||||
property.
|
always have the "extend" grapheme breaking property.
|
||||||
|
|
||||||
5. Do not end after prepend characters.
|
5. Do not end after prepend characters.
|
||||||
|
|
||||||
|
6. Do not break within emoji modifier sequences (a base character fol-
|
||||||
|
lowed by a modifier). Extending characters are allowed before the modi-
|
||||||
|
fier.
|
||||||
|
|
||||||
|
7. Do not break within emoji zwj sequences (zero-width jointer followed
|
||||||
|
by "glue after ZWJ" or "base glue after ZWJ").
|
||||||
|
|
||||||
|
8. Do not break within emoji flag sequences. That is, do not break
|
||||||
|
between regional indicator (RI) characters if there are an odd number
|
||||||
|
of RI characters before the break point.
|
||||||
|
|
||||||
6. Otherwise, end the cluster.
|
6. Otherwise, end the cluster.
|
||||||
|
|
||||||
PCRE2's additional properties
|
PCRE2's additional properties
|
||||||
|
@ -8744,7 +8758,7 @@ AUTHOR
|
||||||
|
|
||||||
REVISION
|
REVISION
|
||||||
|
|
||||||
Last updated: 02 July 2017
|
Last updated: 05 July 2017
|
||||||
Copyright (c) 1997-2017 University of Cambridge.
|
Copyright (c) 1997-2017 University of Cambridge.
|
||||||
------------------------------------------------------------------------------
|
------------------------------------------------------------------------------
|
||||||
|
|
||||||
|
|
|
@ -1,4 +1,4 @@
|
||||||
.TH PCRE2PATTERN 3 "02 July 2017" "PCRE2 10.30"
|
.TH PCRE2PATTERN 3 "05 July 2017" "PCRE2 10.30"
|
||||||
.SH NAME
|
.SH NAME
|
||||||
PCRE2 - Perl-compatible regular expressions (revised API)
|
PCRE2 - Perl-compatible regular expressions (revised API)
|
||||||
.SH "PCRE2 REGULAR EXPRESSION DETAILS"
|
.SH "PCRE2 REGULAR EXPRESSION DETAILS"
|
||||||
|
@ -145,7 +145,7 @@ The pcre2_match() function contains a counter that is incremented every time it
|
||||||
goes round its main loop. The caller of \fBpcre2_match()\fP can set a limit on
|
goes round its main loop. The caller of \fBpcre2_match()\fP can set a limit on
|
||||||
this counter, which therefore limits the amount of computing resource used for
|
this counter, which therefore limits the amount of computing resource used for
|
||||||
a match. The maximum depth of nested backtracking can also be limited; this
|
a match. The maximum depth of nested backtracking can also be limited; this
|
||||||
indirectly restricts the amount of heap memory that is used, but there is also
|
indirectly restricts the amount of heap memory that is used, but there is also
|
||||||
an explicit memory limit that can be set.
|
an explicit memory limit that can be set.
|
||||||
.P
|
.P
|
||||||
These facilities are provided to catch runaway matches that are provoked by
|
These facilities are provided to catch runaway matches that are provoked by
|
||||||
|
@ -164,7 +164,7 @@ for it to have any effect. In other words, the pattern writer can lower the
|
||||||
limits set by the programmer, but not raise them. If there is more than one
|
limits set by the programmer, but not raise them. If there is more than one
|
||||||
setting of one of these limits, the lower value is used.
|
setting of one of these limits, the lower value is used.
|
||||||
.P
|
.P
|
||||||
Prior to release 10.30, LIMIT_DEPTH was called LIMIT_RECURSION. This name is
|
Prior to release 10.30, LIMIT_DEPTH was called LIMIT_RECURSION. This name is
|
||||||
still recognized for backwards compatibility.
|
still recognized for backwards compatibility.
|
||||||
.P
|
.P
|
||||||
The heap limit applies only when the \fBpcre2_match()\fP interpreter is used
|
The heap limit applies only when the \fBpcre2_match()\fP interpreter is used
|
||||||
|
@ -203,7 +203,7 @@ string with one of the following sequences:
|
||||||
(*CRLF) carriage return, followed by linefeed
|
(*CRLF) carriage return, followed by linefeed
|
||||||
(*ANYCRLF) any of the three above
|
(*ANYCRLF) any of the three above
|
||||||
(*ANY) all Unicode newline sequences
|
(*ANY) all Unicode newline sequences
|
||||||
(*NUL) the NUL character (binary zero)
|
(*NUL) the NUL character (binary zero)
|
||||||
.sp
|
.sp
|
||||||
These override the default and the options given to the compiling function. For
|
These override the default and the options given to the compiling function. For
|
||||||
example, on a Unix system where LF is the default newline sequence, the pattern
|
example, on a Unix system where LF is the default newline sequence, the pattern
|
||||||
|
@ -218,7 +218,7 @@ The newline convention affects where the circumflex and dollar assertions are
|
||||||
true. It also affects the interpretation of the dot metacharacter when
|
true. It also affects the interpretation of the dot metacharacter when
|
||||||
PCRE2_DOTALL is not set, and the behaviour of \eN. However, it does not affect
|
PCRE2_DOTALL is not set, and the behaviour of \eN. However, it does not affect
|
||||||
what the \eR escape sequence matches. By default, this is any Unicode newline
|
what the \eR escape sequence matches. By default, this is any Unicode newline
|
||||||
sequence, for Perl compatibility. However, this can be changed; see the next
|
sequence, for Perl compatibility. However, this can be changed; see the next
|
||||||
section and the description of \eR in the section entitled
|
section and the description of \eR in the section entitled
|
||||||
.\" HTML <a href="#newlineseq">
|
.\" HTML <a href="#newlineseq">
|
||||||
.\" </a>
|
.\" </a>
|
||||||
|
@ -998,9 +998,11 @@ grapheme cluster", and treats the sequence as an atomic group
|
||||||
.\"
|
.\"
|
||||||
Unicode supports various kinds of composite character by giving each character
|
Unicode supports various kinds of composite character by giving each character
|
||||||
a grapheme breaking property, and having rules that use these properties to
|
a grapheme breaking property, and having rules that use these properties to
|
||||||
define the boundaries of extended grapheme clusters. \eX always matches at
|
define the boundaries of extended grapheme clusters. The rules are defined in
|
||||||
least one character. Then it decides whether to add additional characters
|
Unicode Standard Annex 29, "Unicode Text Segmentation".
|
||||||
according to the following rules for ending a cluster:
|
.P
|
||||||
|
\eX always matches at least one character. Then it decides whether to add
|
||||||
|
additional characters according to the following rules for ending a cluster:
|
||||||
.P
|
.P
|
||||||
1. End at the end of the subject string.
|
1. End at the end of the subject string.
|
||||||
.P
|
.P
|
||||||
|
@ -1011,11 +1013,22 @@ are of five types: L, V, T, LV, and LVT. An L character may be followed by an
|
||||||
L, V, LV, or LVT character; an LV or V character may be followed by a V or T
|
L, V, LV, or LVT character; an LV or V character may be followed by a V or T
|
||||||
character; an LVT or T character may be follwed only by a T character.
|
character; an LVT or T character may be follwed only by a T character.
|
||||||
.P
|
.P
|
||||||
4. Do not end before extending characters or spacing marks. Characters with
|
4. Do not end before extending characters or spacing marks or the "zero-width
|
||||||
the "mark" property always have the "extend" grapheme breaking property.
|
joiner" characters. Characters with the "mark" property always have the
|
||||||
|
"extend" grapheme breaking property.
|
||||||
.P
|
.P
|
||||||
5. Do not end after prepend characters.
|
5. Do not end after prepend characters.
|
||||||
.P
|
.P
|
||||||
|
6. Do not break within emoji modifier sequences (a base character followed by a
|
||||||
|
modifier). Extending characters are allowed before the modifier.
|
||||||
|
.P
|
||||||
|
7. Do not break within emoji zwj sequences (zero-width jointer followed by
|
||||||
|
"glue after ZWJ" or "base glue after ZWJ").
|
||||||
|
.P
|
||||||
|
8. Do not break within emoji flag sequences. That is, do not break between
|
||||||
|
regional indicator (RI) characters if there are an odd number of RI characters
|
||||||
|
before the break point.
|
||||||
|
.P
|
||||||
6. Otherwise, end the cluster.
|
6. Otherwise, end the cluster.
|
||||||
.
|
.
|
||||||
.
|
.
|
||||||
|
@ -1560,15 +1573,15 @@ Perl option letters enclosed between "(?" and ")". The option letters are
|
||||||
.sp
|
.sp
|
||||||
i for PCRE2_CASELESS
|
i for PCRE2_CASELESS
|
||||||
m for PCRE2_MULTILINE
|
m for PCRE2_MULTILINE
|
||||||
n for PCRE2_NO_AUTO_CAPTURE
|
n for PCRE2_NO_AUTO_CAPTURE
|
||||||
s for PCRE2_DOTALL
|
s for PCRE2_DOTALL
|
||||||
x for PCRE2_EXTENDED
|
x for PCRE2_EXTENDED
|
||||||
xx for PCRE2_EXTENDED_MORE
|
xx for PCRE2_EXTENDED_MORE
|
||||||
.sp
|
.sp
|
||||||
For example, (?im) sets caseless, multiline matching. It is also possible to
|
For example, (?im) sets caseless, multiline matching. It is also possible to
|
||||||
unset these options by preceding the letter with a hyphen. The two "extended"
|
unset these options by preceding the letter with a hyphen. The two "extended"
|
||||||
options are not independent; unsetting either one cancels the effects of both
|
options are not independent; unsetting either one cancels the effects of both
|
||||||
of them.
|
of them.
|
||||||
.P
|
.P
|
||||||
A combined setting and unsetting such as (?im-sx), which sets PCRE2_CASELESS
|
A combined setting and unsetting such as (?im-sx), which sets PCRE2_CASELESS
|
||||||
and PCRE2_MULTILINE while unsetting PCRE2_DOTALL and PCRE2_EXTENDED, is also
|
and PCRE2_MULTILINE while unsetting PCRE2_DOTALL and PCRE2_EXTENDED, is also
|
||||||
|
@ -2256,16 +2269,16 @@ capturing subpatterns within it, these are counted for the purposes of
|
||||||
numbering the capturing subpatterns in the whole pattern. However, substring
|
numbering the capturing subpatterns in the whole pattern. However, substring
|
||||||
capturing is carried out only for positive assertions that succeed, that is,
|
capturing is carried out only for positive assertions that succeed, that is,
|
||||||
one of their branches matches, so matching continues after the assertion. If
|
one of their branches matches, so matching continues after the assertion. If
|
||||||
all branches of a positive assertion fail to match, nothing is captured, and
|
all branches of a positive assertion fail to match, nothing is captured, and
|
||||||
control is passed to the previous backtracking point.
|
control is passed to the previous backtracking point.
|
||||||
.P
|
.P
|
||||||
No capturing is done for a negative assertion unless it is being used as a
|
No capturing is done for a negative assertion unless it is being used as a
|
||||||
condition in a
|
condition in a
|
||||||
.\" HTML <a href="#subpatternsassubroutines">
|
.\" HTML <a href="#subpatternsassubroutines">
|
||||||
.\" </a>
|
.\" </a>
|
||||||
conditional subpattern
|
conditional subpattern
|
||||||
.\"
|
.\"
|
||||||
(see the discussion below). Matching continues after a non-conditional negative
|
(see the discussion below). Matching continues after a non-conditional negative
|
||||||
assertion only if all its branches fail to match.
|
assertion only if all its branches fail to match.
|
||||||
.P
|
.P
|
||||||
For compatibility with Perl, most assertion subpatterns may be repeated; though
|
For compatibility with Perl, most assertion subpatterns may be repeated; though
|
||||||
|
@ -2846,13 +2859,13 @@ once it had matched some of the subject string, it was never re-entered, even
|
||||||
if it contained untried alternatives and there was a subsequent matching
|
if it contained untried alternatives and there was a subsequent matching
|
||||||
failure. (Historical note: PCRE implemented recursion before Perl did.)
|
failure. (Historical note: PCRE implemented recursion before Perl did.)
|
||||||
.P
|
.P
|
||||||
Starting with release 10.30, recursive subroutine calls are no longer treated
|
Starting with release 10.30, recursive subroutine calls are no longer treated
|
||||||
as atomic. That is, they can be re-entered to try unused alternatives if there
|
as atomic. That is, they can be re-entered to try unused alternatives if there
|
||||||
is a matching failure later in the pattern. This is now compatible with the way
|
is a matching failure later in the pattern. This is now compatible with the way
|
||||||
Perl works. If you want a subroutine call to be atomic, you must explicitly
|
Perl works. If you want a subroutine call to be atomic, you must explicitly
|
||||||
enclose it in an atomic group.
|
enclose it in an atomic group.
|
||||||
.P
|
.P
|
||||||
Supporting backtracking into recursions simplifies certain types of recursive
|
Supporting backtracking into recursions simplifies certain types of recursive
|
||||||
pattern. For example, this pattern matches palindromic strings:
|
pattern. For example, this pattern matches palindromic strings:
|
||||||
.sp
|
.sp
|
||||||
^((.)(?1)\e2|.?)$
|
^((.)(?1)\e2|.?)$
|
||||||
|
@ -2883,7 +2896,7 @@ in PCRE2 these values can be referenced. Consider this pattern:
|
||||||
This pattern matches "bab". The first capturing parentheses match "b", then in
|
This pattern matches "bab". The first capturing parentheses match "b", then in
|
||||||
the second group, when the back reference \e1 fails to match "b", the second
|
the second group, when the back reference \e1 fails to match "b", the second
|
||||||
alternative matches "a" and then recurses. In the recursion, \e1 does now match
|
alternative matches "a" and then recurses. In the recursion, \e1 does now match
|
||||||
"b" and so the whole match succeeds. This match used to fail in Perl, but in
|
"b" and so the whole match succeeds. This match used to fail in Perl, but in
|
||||||
later versions (I tried 5.024) it now works.
|
later versions (I tried 5.024) it now works.
|
||||||
.
|
.
|
||||||
.
|
.
|
||||||
|
@ -3427,15 +3440,15 @@ negative assertion, (*ACCEPT) causes the assertion to fail without any further
|
||||||
processing; captured substrings are discarded.
|
processing; captured substrings are discarded.
|
||||||
.P
|
.P
|
||||||
If the assertion is a condition, (*ACCEPT) causes the condition to be true for
|
If the assertion is a condition, (*ACCEPT) causes the condition to be true for
|
||||||
a positive assertion and false for a negative one; captured substrings are
|
a positive assertion and false for a negative one; captured substrings are
|
||||||
retained in both cases.
|
retained in both cases.
|
||||||
.P
|
.P
|
||||||
The effect of (*THEN) is not allowed to escape beyond an assertion. If there
|
The effect of (*THEN) is not allowed to escape beyond an assertion. If there
|
||||||
are no more branches to try, (*THEN) causes a positive assertion to be false,
|
are no more branches to try, (*THEN) causes a positive assertion to be false,
|
||||||
and a negative assertion to be true.
|
and a negative assertion to be true.
|
||||||
.P
|
.P
|
||||||
The other backtracking verbs are not treated specially if they appear in a
|
The other backtracking verbs are not treated specially if they appear in a
|
||||||
standalone positive assertion. In a conditional positive assertion,
|
standalone positive assertion. In a conditional positive assertion,
|
||||||
backtracking into (*COMMIT), (*SKIP), or (*PRUNE) causes the condition to be
|
backtracking into (*COMMIT), (*SKIP), or (*PRUNE) causes the condition to be
|
||||||
false. However, for both standalone and conditional negative assertions,
|
false. However, for both standalone and conditional negative assertions,
|
||||||
backtracking into (*COMMIT), (*SKIP), or (*PRUNE) causes the assertion to be
|
backtracking into (*COMMIT), (*SKIP), or (*PRUNE) causes the assertion to be
|
||||||
|
@ -3485,6 +3498,6 @@ Cambridge, England.
|
||||||
.rs
|
.rs
|
||||||
.sp
|
.sp
|
||||||
.nf
|
.nf
|
||||||
Last updated: 02 July 2017
|
Last updated: 05 July 2017
|
||||||
Copyright (c) 1997-2017 University of Cambridge.
|
Copyright (c) 1997-2017 University of Cambridge.
|
||||||
.fi
|
.fi
|
||||||
|
|
|
@ -1379,8 +1379,46 @@ for (;;)
|
||||||
if (!utf) d = *nptr; else { GETCHARLEN(d, nptr, dlen); }
|
if (!utf) d = *nptr; else { GETCHARLEN(d, nptr, dlen); }
|
||||||
rgb = UCD_GRAPHBREAK(d);
|
rgb = UCD_GRAPHBREAK(d);
|
||||||
if ((PRIV(ucp_gbtable)[lgb] & (1u << rgb)) == 0) break;
|
if ((PRIV(ucp_gbtable)[lgb] & (1u << rgb)) == 0) break;
|
||||||
|
|
||||||
|
/* Not breaking between Regional Indicators is allowed only if
|
||||||
|
there are an even number of preceding RIs. */
|
||||||
|
|
||||||
|
if (lgb == ucp_gbRegionalIndicator &&
|
||||||
|
rgb == ucp_gbRegionalIndicator)
|
||||||
|
{
|
||||||
|
int ricount = 0;
|
||||||
|
PCRE2_SPTR bptr = nptr - 1;
|
||||||
|
#ifdef SUPPORT_UNICODE
|
||||||
|
if (utf) BACKCHAR(bptr);
|
||||||
|
#endif
|
||||||
|
/* bptr is pointing to the left-hand character */
|
||||||
|
|
||||||
|
while (bptr > mb->start_subject)
|
||||||
|
{
|
||||||
|
bptr--;
|
||||||
|
#ifdef SUPPORT_UNICODE
|
||||||
|
if (utf)
|
||||||
|
{
|
||||||
|
BACKCHAR(bptr);
|
||||||
|
GETCHAR(d, bptr);
|
||||||
|
}
|
||||||
|
else
|
||||||
|
#endif
|
||||||
|
d = *bptr;
|
||||||
|
if (UCD_GRAPHBREAK(d) != ucp_gbRegionalIndicator) break;
|
||||||
|
ricount++;
|
||||||
|
}
|
||||||
|
if ((ricount & 1) != 0) break; /* Grapheme break required */
|
||||||
|
}
|
||||||
|
|
||||||
|
/* If Extend follows E_Base[_GAZ] do not update lgb; this allows
|
||||||
|
any number of Extend before a following E_Modifier. */
|
||||||
|
|
||||||
|
if (rgb != ucp_gbExtend ||
|
||||||
|
(lgb != ucp_gbE_Base && lgb != ucp_gbE_Base_GAZ))
|
||||||
|
lgb = rgb;
|
||||||
|
|
||||||
ncount++;
|
ncount++;
|
||||||
lgb = rgb;
|
|
||||||
nptr += dlen;
|
nptr += dlen;
|
||||||
}
|
}
|
||||||
count++;
|
count++;
|
||||||
|
@ -1641,8 +1679,46 @@ for (;;)
|
||||||
if (!utf) d = *nptr; else { GETCHARLEN(d, nptr, dlen); }
|
if (!utf) d = *nptr; else { GETCHARLEN(d, nptr, dlen); }
|
||||||
rgb = UCD_GRAPHBREAK(d);
|
rgb = UCD_GRAPHBREAK(d);
|
||||||
if ((PRIV(ucp_gbtable)[lgb] & (1u << rgb)) == 0) break;
|
if ((PRIV(ucp_gbtable)[lgb] & (1u << rgb)) == 0) break;
|
||||||
|
|
||||||
|
/* Not breaking between Regional Indicators is allowed only if
|
||||||
|
there are an even number of preceding RIs. */
|
||||||
|
|
||||||
|
if (lgb == ucp_gbRegionalIndicator &&
|
||||||
|
rgb == ucp_gbRegionalIndicator)
|
||||||
|
{
|
||||||
|
int ricount = 0;
|
||||||
|
PCRE2_SPTR bptr = nptr - 1;
|
||||||
|
#ifdef SUPPORT_UNICODE
|
||||||
|
if (utf) BACKCHAR(bptr);
|
||||||
|
#endif
|
||||||
|
/* bptr is pointing to the left-hand character */
|
||||||
|
|
||||||
|
while (bptr > mb->start_subject)
|
||||||
|
{
|
||||||
|
bptr--;
|
||||||
|
#ifdef SUPPORT_UNICODE
|
||||||
|
if (utf)
|
||||||
|
{
|
||||||
|
BACKCHAR(bptr);
|
||||||
|
GETCHAR(d, bptr);
|
||||||
|
}
|
||||||
|
else
|
||||||
|
#endif
|
||||||
|
d = *bptr;
|
||||||
|
if (UCD_GRAPHBREAK(d) != ucp_gbRegionalIndicator) break;
|
||||||
|
ricount++;
|
||||||
|
}
|
||||||
|
if ((ricount & 1) != 0) break; /* Grapheme break required */
|
||||||
|
}
|
||||||
|
|
||||||
|
/* If Extend follows E_Base[_GAZ] do not update lgb; this allows
|
||||||
|
any number of Extend before a following E_Modifier. */
|
||||||
|
|
||||||
|
if (rgb != ucp_gbExtend ||
|
||||||
|
(lgb != ucp_gbE_Base && lgb != ucp_gbE_Base_GAZ))
|
||||||
|
lgb = rgb;
|
||||||
|
|
||||||
ncount++;
|
ncount++;
|
||||||
lgb = rgb;
|
|
||||||
nptr += dlen;
|
nptr += dlen;
|
||||||
}
|
}
|
||||||
ADD_NEW_DATA(-(state_offset + count), 0, ncount);
|
ADD_NEW_DATA(-(state_offset + count), 0, ncount);
|
||||||
|
@ -1912,8 +1988,46 @@ for (;;)
|
||||||
if (!utf) d = *nptr; else { GETCHARLEN(d, nptr, dlen); }
|
if (!utf) d = *nptr; else { GETCHARLEN(d, nptr, dlen); }
|
||||||
rgb = UCD_GRAPHBREAK(d);
|
rgb = UCD_GRAPHBREAK(d);
|
||||||
if ((PRIV(ucp_gbtable)[lgb] & (1u << rgb)) == 0) break;
|
if ((PRIV(ucp_gbtable)[lgb] & (1u << rgb)) == 0) break;
|
||||||
|
|
||||||
|
/* Not breaking between Regional Indicators is allowed only if
|
||||||
|
there are an even number of preceding RIs. */
|
||||||
|
|
||||||
|
if (lgb == ucp_gbRegionalIndicator &&
|
||||||
|
rgb == ucp_gbRegionalIndicator)
|
||||||
|
{
|
||||||
|
int ricount = 0;
|
||||||
|
PCRE2_SPTR bptr = nptr - 1;
|
||||||
|
#ifdef SUPPORT_UNICODE
|
||||||
|
if (utf) BACKCHAR(bptr);
|
||||||
|
#endif
|
||||||
|
/* bptr is pointing to the left-hand character */
|
||||||
|
|
||||||
|
while (bptr > mb->start_subject)
|
||||||
|
{
|
||||||
|
bptr--;
|
||||||
|
#ifdef SUPPORT_UNICODE
|
||||||
|
if (utf)
|
||||||
|
{
|
||||||
|
BACKCHAR(bptr);
|
||||||
|
GETCHAR(d, bptr);
|
||||||
|
}
|
||||||
|
else
|
||||||
|
#endif
|
||||||
|
d = *bptr;
|
||||||
|
if (UCD_GRAPHBREAK(d) != ucp_gbRegionalIndicator) break;
|
||||||
|
ricount++;
|
||||||
|
}
|
||||||
|
if ((ricount & 1) != 0) break; /* Grapheme break required */
|
||||||
|
}
|
||||||
|
|
||||||
|
/* If Extend follows E_Base[_GAZ] do not update lgb; this allows
|
||||||
|
any number of Extend before a following E_Modifier. */
|
||||||
|
|
||||||
|
if (rgb != ucp_gbExtend ||
|
||||||
|
(lgb != ucp_gbE_Base && lgb != ucp_gbE_Base_GAZ))
|
||||||
|
lgb = rgb;
|
||||||
|
|
||||||
ncount++;
|
ncount++;
|
||||||
lgb = rgb;
|
|
||||||
nptr += dlen;
|
nptr += dlen;
|
||||||
}
|
}
|
||||||
if (nptr >= end_subject && (mb->moptions & PCRE2_PARTIAL_HARD) != 0)
|
if (nptr >= end_subject && (mb->moptions & PCRE2_PARTIAL_HARD) != 0)
|
||||||
|
@ -2102,8 +2216,46 @@ for (;;)
|
||||||
if (!utf) d = *nptr; else { GETCHARLEN(d, nptr, dlen); }
|
if (!utf) d = *nptr; else { GETCHARLEN(d, nptr, dlen); }
|
||||||
rgb = UCD_GRAPHBREAK(d);
|
rgb = UCD_GRAPHBREAK(d);
|
||||||
if ((PRIV(ucp_gbtable)[lgb] & (1u << rgb)) == 0) break;
|
if ((PRIV(ucp_gbtable)[lgb] & (1u << rgb)) == 0) break;
|
||||||
|
|
||||||
|
/* Not breaking between Regional Indicators is allowed only if
|
||||||
|
there are an even number of preceding RIs. */
|
||||||
|
|
||||||
|
if (lgb == ucp_gbRegionalIndicator &&
|
||||||
|
rgb == ucp_gbRegionalIndicator)
|
||||||
|
{
|
||||||
|
int ricount = 0;
|
||||||
|
PCRE2_SPTR bptr = nptr - 1;
|
||||||
|
#ifdef SUPPORT_UNICODE
|
||||||
|
if (utf) BACKCHAR(bptr);
|
||||||
|
#endif
|
||||||
|
/* bptr is pointing to the left-hand character */
|
||||||
|
|
||||||
|
while (bptr > mb->start_subject)
|
||||||
|
{
|
||||||
|
bptr--;
|
||||||
|
#ifdef SUPPORT_UNICODE
|
||||||
|
if (utf)
|
||||||
|
{
|
||||||
|
BACKCHAR(bptr);
|
||||||
|
GETCHAR(d, bptr);
|
||||||
|
}
|
||||||
|
else
|
||||||
|
#endif
|
||||||
|
d = *bptr;
|
||||||
|
if (UCD_GRAPHBREAK(d) != ucp_gbRegionalIndicator) break;
|
||||||
|
ricount++;
|
||||||
|
}
|
||||||
|
if ((ricount & 1) != 0) break; /* Grapheme break required */
|
||||||
|
}
|
||||||
|
|
||||||
|
/* If Extend follows E_Base[_GAZ] do not update lgb; this allows
|
||||||
|
any number of Extend before a following E_Modifier. */
|
||||||
|
|
||||||
|
if (rgb != ucp_gbExtend ||
|
||||||
|
(lgb != ucp_gbE_Base && lgb != ucp_gbE_Base_GAZ))
|
||||||
|
lgb = rgb;
|
||||||
|
|
||||||
ncount++;
|
ncount++;
|
||||||
lgb = rgb;
|
|
||||||
nptr += dlen;
|
nptr += dlen;
|
||||||
}
|
}
|
||||||
if (nptr >= end_subject && (mb->moptions & PCRE2_PARTIAL_HARD) != 0)
|
if (nptr >= end_subject && (mb->moptions & PCRE2_PARTIAL_HARD) != 0)
|
||||||
|
@ -2129,7 +2281,7 @@ for (;;)
|
||||||
case 0x2029:
|
case 0x2029:
|
||||||
#endif /* Not EBCDIC */
|
#endif /* Not EBCDIC */
|
||||||
if (mb->bsr_convention == PCRE2_BSR_ANYCRLF) break;
|
if (mb->bsr_convention == PCRE2_BSR_ANYCRLF) break;
|
||||||
/* Fall through */
|
/* Fall through */
|
||||||
|
|
||||||
case CHAR_LF:
|
case CHAR_LF:
|
||||||
ADD_NEW(state_offset + 1, 0);
|
ADD_NEW(state_offset + 1, 0);
|
||||||
|
@ -3427,7 +3579,7 @@ for (;;)
|
||||||
while (t < mb->end_subject && !IS_NEWLINE(t)) t++;
|
while (t < mb->end_subject && !IS_NEWLINE(t)) t++;
|
||||||
end_subject = t;
|
end_subject = t;
|
||||||
}
|
}
|
||||||
|
|
||||||
/* Anchored: check the first code unit if one is recorded. This may seem
|
/* Anchored: check the first code unit if one is recorded. This may seem
|
||||||
pointless but it can help in detecting a no match case without scanning for
|
pointless but it can help in detecting a no match case without scanning for
|
||||||
the required code unit. */
|
the required code unit. */
|
||||||
|
|
|
@ -2449,7 +2449,44 @@ fprintf(stderr, "++ op=%d\n", *Fecode);
|
||||||
if (!utf) fc = *Feptr; else { GETCHARLEN(fc, Feptr, len); }
|
if (!utf) fc = *Feptr; else { GETCHARLEN(fc, Feptr, len); }
|
||||||
rgb = UCD_GRAPHBREAK(fc);
|
rgb = UCD_GRAPHBREAK(fc);
|
||||||
if ((PRIV(ucp_gbtable)[lgb] & (1 << rgb)) == 0) break;
|
if ((PRIV(ucp_gbtable)[lgb] & (1 << rgb)) == 0) break;
|
||||||
lgb = rgb;
|
|
||||||
|
/* Not breaking between Regional Indicators is allowed only if there
|
||||||
|
are an even number of preceding RIs. */
|
||||||
|
|
||||||
|
if (lgb == ucp_gbRegionalIndicator && rgb == ucp_gbRegionalIndicator)
|
||||||
|
{
|
||||||
|
int ricount = 0;
|
||||||
|
PCRE2_SPTR bptr = Feptr - 1;
|
||||||
|
#ifdef SUPPORT_UNICODE
|
||||||
|
if (utf) BACKCHAR(bptr);
|
||||||
|
#endif
|
||||||
|
/* bptr is pointing to the left-hand character */
|
||||||
|
|
||||||
|
while (bptr > mb->start_subject)
|
||||||
|
{
|
||||||
|
bptr--;
|
||||||
|
#ifdef SUPPORT_UNICODE
|
||||||
|
if (utf)
|
||||||
|
{
|
||||||
|
BACKCHAR(bptr);
|
||||||
|
GETCHAR(fc, bptr);
|
||||||
|
}
|
||||||
|
else
|
||||||
|
#endif
|
||||||
|
fc = *bptr;
|
||||||
|
if (UCD_GRAPHBREAK(fc) != ucp_gbRegionalIndicator) break;
|
||||||
|
ricount++;
|
||||||
|
}
|
||||||
|
if ((ricount & 1) != 0) break; /* Grapheme break required */
|
||||||
|
}
|
||||||
|
|
||||||
|
/* If Extend follows E_Base[_GAZ] do not update lgb; this allows
|
||||||
|
any number of Extend before a following E_Modifier. */
|
||||||
|
|
||||||
|
if (rgb != ucp_gbExtend ||
|
||||||
|
(lgb != ucp_gbE_Base && lgb != ucp_gbE_Base_GAZ))
|
||||||
|
lgb = rgb;
|
||||||
|
|
||||||
Feptr += len;
|
Feptr += len;
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
|
@ -2757,7 +2794,45 @@ fprintf(stderr, "++ op=%d\n", *Fecode);
|
||||||
if (!utf) fc = *Feptr; else { GETCHARLEN(fc, Feptr, len); }
|
if (!utf) fc = *Feptr; else { GETCHARLEN(fc, Feptr, len); }
|
||||||
rgb = UCD_GRAPHBREAK(fc);
|
rgb = UCD_GRAPHBREAK(fc);
|
||||||
if ((PRIV(ucp_gbtable)[lgb] & (1 << rgb)) == 0) break;
|
if ((PRIV(ucp_gbtable)[lgb] & (1 << rgb)) == 0) break;
|
||||||
lgb = rgb;
|
|
||||||
|
/* Not breaking between Regional Indicators is allowed only if
|
||||||
|
there are an even number of preceding RIs. */
|
||||||
|
|
||||||
|
if (lgb == ucp_gbRegionalIndicator &&
|
||||||
|
rgb == ucp_gbRegionalIndicator)
|
||||||
|
{
|
||||||
|
int ricount = 0;
|
||||||
|
PCRE2_SPTR bptr = Feptr - 1;
|
||||||
|
#ifdef SUPPORT_UNICODE
|
||||||
|
if (utf) BACKCHAR(bptr);
|
||||||
|
#endif
|
||||||
|
/* bptr is pointing to the left-hand character */
|
||||||
|
|
||||||
|
while (bptr > mb->start_subject)
|
||||||
|
{
|
||||||
|
bptr--;
|
||||||
|
#ifdef SUPPORT_UNICODE
|
||||||
|
if (utf)
|
||||||
|
{
|
||||||
|
BACKCHAR(bptr);
|
||||||
|
GETCHAR(fc, bptr);
|
||||||
|
}
|
||||||
|
else
|
||||||
|
#endif
|
||||||
|
fc = *bptr;
|
||||||
|
if (UCD_GRAPHBREAK(fc) != ucp_gbRegionalIndicator) break;
|
||||||
|
ricount++;
|
||||||
|
}
|
||||||
|
if ((ricount & 1) != 0) break; /* Grapheme break required */
|
||||||
|
}
|
||||||
|
|
||||||
|
/* If Extend follows E_Base[_GAZ] do not update lgb; this allows
|
||||||
|
any number of Extend before a following E_Modifier. */
|
||||||
|
|
||||||
|
if (rgb != ucp_gbExtend ||
|
||||||
|
(lgb != ucp_gbE_Base && lgb != ucp_gbE_Base_GAZ))
|
||||||
|
lgb = rgb;
|
||||||
|
|
||||||
Feptr += len;
|
Feptr += len;
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
|
@ -3527,7 +3602,45 @@ fprintf(stderr, "++ op=%d\n", *Fecode);
|
||||||
if (!utf) fc = *Feptr; else { GETCHARLEN(fc, Feptr, len); }
|
if (!utf) fc = *Feptr; else { GETCHARLEN(fc, Feptr, len); }
|
||||||
rgb = UCD_GRAPHBREAK(fc);
|
rgb = UCD_GRAPHBREAK(fc);
|
||||||
if ((PRIV(ucp_gbtable)[lgb] & (1 << rgb)) == 0) break;
|
if ((PRIV(ucp_gbtable)[lgb] & (1 << rgb)) == 0) break;
|
||||||
lgb = rgb;
|
|
||||||
|
/* Not breaking between Regional Indicators is allowed only if
|
||||||
|
there are an even number of preceding RIs. */
|
||||||
|
|
||||||
|
if (lgb == ucp_gbRegionalIndicator &&
|
||||||
|
rgb == ucp_gbRegionalIndicator)
|
||||||
|
{
|
||||||
|
int ricount = 0;
|
||||||
|
PCRE2_SPTR bptr = Feptr - 1;
|
||||||
|
#ifdef SUPPORT_UNICODE
|
||||||
|
if (utf) BACKCHAR(bptr);
|
||||||
|
#endif
|
||||||
|
/* bptr is pointing to the left-hand character */
|
||||||
|
|
||||||
|
while (bptr > mb->start_subject)
|
||||||
|
{
|
||||||
|
bptr--;
|
||||||
|
#ifdef SUPPORT_UNICODE
|
||||||
|
if (utf)
|
||||||
|
{
|
||||||
|
BACKCHAR(bptr);
|
||||||
|
GETCHAR(fc, bptr);
|
||||||
|
}
|
||||||
|
else
|
||||||
|
#endif
|
||||||
|
fc = *bptr;
|
||||||
|
if (UCD_GRAPHBREAK(fc) != ucp_gbRegionalIndicator) break;
|
||||||
|
ricount++;
|
||||||
|
}
|
||||||
|
if ((ricount & 1) != 0) break; /* Grapheme break required */
|
||||||
|
}
|
||||||
|
|
||||||
|
/* If Extend follows E_Base[_GAZ] do not update lgb; this allows
|
||||||
|
any number of Extend before a following E_Modifier. */
|
||||||
|
|
||||||
|
if (rgb != ucp_gbExtend ||
|
||||||
|
(lgb != ucp_gbE_Base && lgb != ucp_gbE_Base_GAZ))
|
||||||
|
lgb = rgb;
|
||||||
|
|
||||||
Feptr += len;
|
Feptr += len;
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
|
@ -4063,7 +4176,45 @@ fprintf(stderr, "++ op=%d\n", *Fecode);
|
||||||
if (!utf) fc = *Feptr; else { GETCHARLEN(fc, Feptr, len); }
|
if (!utf) fc = *Feptr; else { GETCHARLEN(fc, Feptr, len); }
|
||||||
rgb = UCD_GRAPHBREAK(fc);
|
rgb = UCD_GRAPHBREAK(fc);
|
||||||
if ((PRIV(ucp_gbtable)[lgb] & (1 << rgb)) == 0) break;
|
if ((PRIV(ucp_gbtable)[lgb] & (1 << rgb)) == 0) break;
|
||||||
lgb = rgb;
|
|
||||||
|
/* Not breaking between Regional Indicators is allowed only if
|
||||||
|
there are an even number of preceding RIs. */
|
||||||
|
|
||||||
|
if (lgb == ucp_gbRegionalIndicator &&
|
||||||
|
rgb == ucp_gbRegionalIndicator)
|
||||||
|
{
|
||||||
|
int ricount = 0;
|
||||||
|
PCRE2_SPTR bptr = Feptr - 1;
|
||||||
|
#ifdef SUPPORT_UNICODE
|
||||||
|
if (utf) BACKCHAR(bptr);
|
||||||
|
#endif
|
||||||
|
/* bptr is pointing to the left-hand character */
|
||||||
|
|
||||||
|
while (bptr > mb->start_subject)
|
||||||
|
{
|
||||||
|
bptr--;
|
||||||
|
#ifdef SUPPORT_UNICODE
|
||||||
|
if (utf)
|
||||||
|
{
|
||||||
|
BACKCHAR(bptr);
|
||||||
|
GETCHAR(fc, bptr);
|
||||||
|
}
|
||||||
|
else
|
||||||
|
#endif
|
||||||
|
fc = *bptr;
|
||||||
|
if (UCD_GRAPHBREAK(fc) != ucp_gbRegionalIndicator) break;
|
||||||
|
ricount++;
|
||||||
|
}
|
||||||
|
if ((ricount & 1) != 0) break; /* Grapheme break required */
|
||||||
|
}
|
||||||
|
|
||||||
|
/* If Extend follows E_Base[_GAZ] do not update lgb; this allows
|
||||||
|
any number of Extend before a following E_Modifier. */
|
||||||
|
|
||||||
|
if (rgb != ucp_gbExtend ||
|
||||||
|
(lgb != ucp_gbE_Base && lgb != ucp_gbE_Base_GAZ))
|
||||||
|
lgb = rgb;
|
||||||
|
|
||||||
Feptr += len;
|
Feptr += len;
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
|
|
|
@ -157,49 +157,62 @@ two code points. The breaking rules are as follows:
|
||||||
LV or V may be followed by V or T
|
LV or V may be followed by V or T
|
||||||
LVT or T may be followed by T
|
LVT or T may be followed by T
|
||||||
|
|
||||||
4. Do not break before extending characters.
|
4. Do not break before extending characters or zero-width-joiner (ZWJ).
|
||||||
|
|
||||||
The next two rules are only for extended grapheme clusters (but that's what we
|
The following rules are only for extended grapheme clusters (but that's what we
|
||||||
are implementing).
|
are implementing).
|
||||||
|
|
||||||
5. Do not break before SpacingMarks.
|
5. Do not break before SpacingMarks.
|
||||||
|
|
||||||
6. Do not break after Prepend characters.
|
6. Do not break after Prepend characters.
|
||||||
|
|
||||||
7. Otherwise, break everywhere.
|
7. Do not break within emoji modifier sequences (E_Base or E_Base_GAZ followed
|
||||||
|
by E_Modifier). Extend characters are allowed before the modifier; this
|
||||||
|
cannot be represented in this table, the code has to deal with it.
|
||||||
|
|
||||||
|
8. Do not break within emoji zwj sequences (ZWJ followed by Glue_After_Zwj or
|
||||||
|
E_Base_GAZ).
|
||||||
|
|
||||||
|
9. Do not break within emoji flag sequences. That is, do not break between
|
||||||
|
regional indicator (RI) symbols if there are an odd number of RI characters
|
||||||
|
before the break point. This table encodes "join RI characters"; the code
|
||||||
|
has to deal with checking for previous adjoining RIs.
|
||||||
|
|
||||||
|
10. Otherwise, break everywhere.
|
||||||
*/
|
*/
|
||||||
|
|
||||||
|
#define ESZ (1<<ucp_gbExtend)|(1<<ucp_gbSpacingMark)|(1<<ucp_gbZWJ)
|
||||||
|
|
||||||
const uint32_t PRIV(ucp_gbtable)[] = {
|
const uint32_t PRIV(ucp_gbtable)[] = {
|
||||||
(1<<ucp_gbLF), /* 0 CR */
|
(1<<ucp_gbLF), /* 0 CR */
|
||||||
0, /* 1 LF */
|
0, /* 1 LF */
|
||||||
0, /* 2 Control */
|
0, /* 2 Control */
|
||||||
(1<<ucp_gbExtend)|(1<<ucp_gbSpacingMark), /* 3 Extend */
|
ESZ, /* 3 Extend */
|
||||||
(1<<ucp_gbExtend)|(1<<ucp_gbPrepend)| /* 4 Prepend */
|
ESZ|(1<<ucp_gbPrepend)| /* 4 Prepend */
|
||||||
(1<<ucp_gbSpacingMark)|(1<<ucp_gbL)|
|
(1<<ucp_gbL)|(1<<ucp_gbV)|(1<<ucp_gbT)|
|
||||||
(1<<ucp_gbV)|(1<<ucp_gbT)|(1<<ucp_gbLV)|
|
(1<<ucp_gbLV)|(1<<ucp_gbLVT)|(1<<ucp_gbOther)|
|
||||||
(1<<ucp_gbLVT)|(1<<ucp_gbOther),
|
(1<<ucp_gbRegionalIndicator)|
|
||||||
|
(1<<ucp_gbE_Base)|(1<<ucp_gbE_Modifier)|
|
||||||
(1<<ucp_gbExtend)|(1<<ucp_gbSpacingMark), /* 5 SpacingMark */
|
(1<<ucp_gbE_Base_GAZ)|
|
||||||
(1<<ucp_gbExtend)|(1<<ucp_gbSpacingMark)|(1<<ucp_gbL)| /* 6 L */
|
(1<<ucp_gbZWJ)|(1<<ucp_gbGlue_After_Zwj),
|
||||||
(1<<ucp_gbV)|(1<<ucp_gbLV)|(1<<ucp_gbLVT),
|
ESZ, /* 5 SpacingMark */
|
||||||
|
ESZ|(1<<ucp_gbL)|(1<<ucp_gbV)|(1<<ucp_gbLV)| /* 6 L */
|
||||||
(1<<ucp_gbExtend)|(1<<ucp_gbSpacingMark)|(1<<ucp_gbV)| /* 7 V */
|
(1<<ucp_gbLVT),
|
||||||
(1<<ucp_gbT),
|
ESZ|(1<<ucp_gbV)|(1<<ucp_gbT), /* 7 V */
|
||||||
|
ESZ|(1<<ucp_gbT), /* 8 T */
|
||||||
(1<<ucp_gbExtend)|(1<<ucp_gbSpacingMark)|(1<<ucp_gbT), /* 8 T */
|
ESZ|(1<<ucp_gbV)|(1<<ucp_gbT), /* 9 LV */
|
||||||
(1<<ucp_gbExtend)|(1<<ucp_gbSpacingMark)|(1<<ucp_gbV)| /* 9 LV */
|
ESZ|(1<<ucp_gbT), /* 10 LVT */
|
||||||
(1<<ucp_gbT),
|
|
||||||
|
|
||||||
(1<<ucp_gbExtend)|(1<<ucp_gbSpacingMark)|(1<<ucp_gbT), /* 10 LVT */
|
|
||||||
(1<<ucp_gbRegionalIndicator), /* 11 RegionalIndicator */
|
(1<<ucp_gbRegionalIndicator), /* 11 RegionalIndicator */
|
||||||
(1<<ucp_gbExtend)|(1<<ucp_gbSpacingMark), /* 12 Other */
|
ESZ, /* 12 Other */
|
||||||
(1<<ucp_gbExtend)|(1<<ucp_gbSpacingMark), /* 13 E_Base */
|
ESZ|(1<<ucp_gbE_Modifier), /* 13 E_Base */
|
||||||
(1<<ucp_gbExtend)|(1<<ucp_gbSpacingMark), /* 14 E_Modifier */
|
ESZ, /* 14 E_Modifier */
|
||||||
(1<<ucp_gbExtend)|(1<<ucp_gbSpacingMark), /* 15 E_Base_GAZ */
|
ESZ|(1<<ucp_gbE_Modifier), /* 15 E_Base_GAZ */
|
||||||
(1<<ucp_gbExtend)|(1<<ucp_gbSpacingMark), /* 16 ZWJ */
|
ESZ|(1<<ucp_gbGlue_After_Zwj)|(1<<ucp_gbE_Base_GAZ), /* 16 ZWJ */
|
||||||
(1<<ucp_gbExtend)|(1<<ucp_gbSpacingMark) /* 12 Glue_After_Zwj */
|
ESZ /* 12 Glue_After_Zwj */
|
||||||
};
|
};
|
||||||
|
|
||||||
|
#undef ESZ
|
||||||
|
|
||||||
#ifdef SUPPORT_JIT
|
#ifdef SUPPORT_JIT
|
||||||
/* This table reverses PRIV(ucp_gentype). We can save the cost
|
/* This table reverses PRIV(ucp_gentype). We can save the cost
|
||||||
of a memory load. */
|
of a memory load. */
|
||||||
|
|
|
@ -2041,4 +2041,23 @@
|
||||||
/^(?:(\X)(?C))+$/utf
|
/^(?:(\X)(?C))+$/utf
|
||||||
\x{1E900}\x{1E924}\x{1E953}\x{11C00}\x{11C2D}\x{11C3E}\x{11C70}\x{11C77}\x{11CAB}\x{11400}\x{1142F}\x{11455}\x{104B0}\x{104D8}\x{104FB}\x{16FE0}\x{18800}\x{18AF2}\x{11D00}\x{11D3A}\x{11D59}\x{16FE1}\x{1B170}\x{1B2FB}\x{11A50}\x{11A58}\x{11AA2}\x{11A00}\x{11A07}\x{11A47}\=callout_capture,callout_no_where
|
\x{1E900}\x{1E924}\x{1E953}\x{11C00}\x{11C2D}\x{11C3E}\x{11C70}\x{11C77}\x{11CAB}\x{11400}\x{1142F}\x{11455}\x{104B0}\x{104D8}\x{104FB}\x{16FE0}\x{18800}\x{18AF2}\x{11D00}\x{11D3A}\x{11D59}\x{16FE1}\x{1B170}\x{1B2FB}\x{11A50}\x{11A58}\x{11AA2}\x{11A00}\x{11A07}\x{11A47}\=callout_capture,callout_no_where
|
||||||
|
|
||||||
|
# These two are here because JIT is not yet updated. Also, the very first data
|
||||||
|
# line is handled differently by Perl.
|
||||||
|
|
||||||
|
/^\X/utf
|
||||||
|
A\x{200d}B A ZWJ
|
||||||
|
\x{261D}\x{1F3FB}B E_Base E_Modifier
|
||||||
|
\x{1F466}\x{1F3FF}B E_Base_GAZ E_Modifier
|
||||||
|
\x{200d}\x{1F3A4}B ZWJ Glue_After_ZWJ
|
||||||
|
\x{200d}\x{1F469}B ZWJ E_Base_GAZ
|
||||||
|
\x{1F1E6}\x{1F1E7}B RegionalIndicator RegionalIndicator
|
||||||
|
\x{261D}\x{E0100}\x{1F3FB}B\=no_jit E_Base Extend E_Modifier
|
||||||
|
|
||||||
|
# Regional indicators
|
||||||
|
|
||||||
|
/^(\X)(\X)/utf,aftertext
|
||||||
|
\x{1F1E6}\x{1F1E7}\x{1F1E7}B\=no_jit
|
||||||
|
\x{1F1E6}\x{1F1E7}\x{1F1E7}\x{1F1E6}B\=no_jit
|
||||||
|
|
||||||
|
|
||||||
# End of testinput5
|
# End of testinput5
|
||||||
|
|
|
@ -4667,4 +4667,38 @@ Callout 0: last capture = 1
|
||||||
0: \x{1e900}\x{1e924}\x{1e953}\x{11c00}\x{11c2d}\x{11c3e}\x{11c70}\x{11c77}\x{11cab}\x{11400}\x{1142f}\x{11455}\x{104b0}\x{104d8}\x{104fb}\x{16fe0}\x{18800}\x{18af2}\x{11d00}\x{11d3a}\x{11d59}\x{16fe1}\x{1b170}\x{1b2fb}\x{11a50}\x{11a58}\x{11aa2}\x{11a00}\x{11a07}\x{11a47}
|
0: \x{1e900}\x{1e924}\x{1e953}\x{11c00}\x{11c2d}\x{11c3e}\x{11c70}\x{11c77}\x{11cab}\x{11400}\x{1142f}\x{11455}\x{104b0}\x{104d8}\x{104fb}\x{16fe0}\x{18800}\x{18af2}\x{11d00}\x{11d3a}\x{11d59}\x{16fe1}\x{1b170}\x{1b2fb}\x{11a50}\x{11a58}\x{11aa2}\x{11a00}\x{11a07}\x{11a47}
|
||||||
1: \x{11a00}\x{11a07}\x{11a47}
|
1: \x{11a00}\x{11a07}\x{11a47}
|
||||||
|
|
||||||
|
# These two are here because JIT is not yet updated. Also, the very first data
|
||||||
|
# line is handled differently by Perl.
|
||||||
|
|
||||||
|
/^\X/utf
|
||||||
|
A\x{200d}B A ZWJ
|
||||||
|
0: A\x{200d}
|
||||||
|
\x{261D}\x{1F3FB}B E_Base E_Modifier
|
||||||
|
0: \x{261d}\x{1f3fb}
|
||||||
|
\x{1F466}\x{1F3FF}B E_Base_GAZ E_Modifier
|
||||||
|
0: \x{1f466}\x{1f3ff}
|
||||||
|
\x{200d}\x{1F3A4}B ZWJ Glue_After_ZWJ
|
||||||
|
0: \x{200d}\x{1f3a4}
|
||||||
|
\x{200d}\x{1F469}B ZWJ E_Base_GAZ
|
||||||
|
0: \x{200d}\x{1f469}
|
||||||
|
\x{1F1E6}\x{1F1E7}B RegionalIndicator RegionalIndicator
|
||||||
|
0: \x{1f1e6}\x{1f1e7}
|
||||||
|
\x{261D}\x{E0100}\x{1F3FB}B\=no_jit E_Base Extend E_Modifier
|
||||||
|
** /n is not valid here
|
||||||
|
|
||||||
|
# Regional indicators
|
||||||
|
|
||||||
|
/^(\X)(\X)/utf,aftertext
|
||||||
|
\x{1F1E6}\x{1F1E7}\x{1F1E7}B\=no_jit
|
||||||
|
0: \x{1f1e6}\x{1f1e7}\x{1f1e7}
|
||||||
|
0+ B
|
||||||
|
1: \x{1f1e6}\x{1f1e7}
|
||||||
|
2: \x{1f1e7}
|
||||||
|
\x{1F1E6}\x{1F1E7}\x{1F1E7}\x{1F1E6}B\=no_jit
|
||||||
|
0: \x{1f1e6}\x{1f1e7}\x{1f1e7}\x{1f1e6}
|
||||||
|
0+ B
|
||||||
|
1: \x{1f1e6}\x{1f1e7}
|
||||||
|
2: \x{1f1e7}\x{1f1e6}
|
||||||
|
|
||||||
|
|
||||||
# End of testinput5
|
# End of testinput5
|
||||||
|
|
Loading…
Reference in New Issue