2018-10-29 23:10:53 +01:00
|
|
|
<?xml version="1.0"?>
|
|
|
|
<!DOCTYPE book PUBLIC "-//OASIS//DTD DocBook XML V4.3//EN"
|
|
|
|
"http://www.oasis-open.org/docbook/xml/4.3/docbookx.dtd" [
|
|
|
|
<!ENTITY % local.common.attrib "xmlns:xi CDATA #FIXED 'http://www.w3.org/2003/XInclude'">
|
|
|
|
<!ENTITY version SYSTEM "version.xml">
|
|
|
|
]>
|
2016-01-28 19:14:12 +01:00
|
|
|
<chapter id="clusters">
|
|
|
|
<title>Clusters</title>
|
2018-11-28 20:48:38 +01:00
|
|
|
<section id="clusters-and-shaping">
|
|
|
|
<title>Clusters and shaping</title>
|
2018-11-12 19:17:06 +01:00
|
|
|
<para>
|
|
|
|
In text shaping, a <emphasis>cluster</emphasis> is a sequence of
|
|
|
|
characters that needs to be treated as a single, indivisible
|
2018-11-28 20:48:38 +01:00
|
|
|
unit. A single letter or symbol can be a cluster of its
|
|
|
|
own. Other clusters correspond to longer subsequences of the
|
|
|
|
input code points — such as a ligature or conjunct form
|
|
|
|
— and require the shaper to ensure that the cluster is not
|
|
|
|
broken during the shaping process.
|
2016-01-28 19:14:12 +01:00
|
|
|
</para>
|
|
|
|
<para>
|
2018-11-16 00:40:21 +01:00
|
|
|
A cluster is distinct from a <emphasis>grapheme</emphasis>,
|
2018-11-28 20:48:38 +01:00
|
|
|
which is the smallest unit of meaning in a writing system or
|
|
|
|
script.
|
2018-11-16 00:40:21 +01:00
|
|
|
</para>
|
|
|
|
<para>
|
2018-11-28 20:48:38 +01:00
|
|
|
The definitions of the two terms are similar. However, clusters
|
|
|
|
are only relevant for script shaping and glyph layout. In
|
|
|
|
contrast, graphemes are a property of the underlying script, and
|
|
|
|
are of interest when client programs implement orthographic
|
|
|
|
or linguistic functionality.
|
|
|
|
</para>
|
|
|
|
<para>
|
|
|
|
For example, two individual letters are often two separate
|
|
|
|
graphemes. When two letters form a ligature, however, they
|
|
|
|
combine into a single glyph. They are then part of the same
|
|
|
|
cluster and are treated as a unit by the shaping engine —
|
|
|
|
even though the two original, underlying letters remain separate
|
|
|
|
graphemes.
|
|
|
|
</para>
|
|
|
|
<para>
|
|
|
|
HarfBuzz is concerned with clusters, <emphasis>not</emphasis>
|
|
|
|
with graphemes — although client programs using HarfBuzz
|
|
|
|
may still care about graphemes for other reasons from time to time.
|
2018-11-16 00:40:21 +01:00
|
|
|
</para>
|
|
|
|
<para>
|
|
|
|
During the shaping process, there are several shaping operations
|
|
|
|
that may merge adjacent characters (for example, when two code
|
|
|
|
points form a ligature or a conjunct form and are replaced by a
|
|
|
|
single glyph) or split one character into several (for example,
|
|
|
|
when decomposing a code point through the
|
2018-11-28 20:48:38 +01:00
|
|
|
<literal>ccmp</literal> feature). Operations like these alter
|
|
|
|
clusters; HarfBuzz tracks the changes to ensure that no clusters
|
|
|
|
get lost or broken during shaping.
|
2016-01-28 19:14:12 +01:00
|
|
|
</para>
|
|
|
|
<para>
|
2018-11-28 20:48:38 +01:00
|
|
|
HarfBuzz records cluster information independently from how
|
|
|
|
shaping operations affect the individual glyphs returned in an
|
|
|
|
output buffer. Consequently, a client program using HarfBuzz can
|
|
|
|
utilize the cluster information to implement features such as:
|
2018-11-12 19:17:06 +01:00
|
|
|
</para>
|
|
|
|
<itemizedlist>
|
|
|
|
<listitem>
|
|
|
|
<para>
|
2018-11-15 22:47:03 +01:00
|
|
|
Correctly positioning the cursor within a shaped text run,
|
|
|
|
even when characters have formed ligatures, composed or
|
|
|
|
decomposed, reordered, or undergone other shaping operations.
|
2018-11-12 19:17:06 +01:00
|
|
|
</para>
|
|
|
|
</listitem>
|
|
|
|
<listitem>
|
|
|
|
<para>
|
|
|
|
Correctly highlighting a text selection that includes some,
|
2018-11-15 22:47:03 +01:00
|
|
|
but not all, of the characters in a word.
|
2018-11-12 19:17:06 +01:00
|
|
|
</para>
|
|
|
|
</listitem>
|
|
|
|
<listitem>
|
|
|
|
<para>
|
|
|
|
Applying text attributes (such as color or underlining) to
|
2018-11-15 22:47:03 +01:00
|
|
|
part, but not all, of a word.
|
2018-11-12 19:17:06 +01:00
|
|
|
</para>
|
|
|
|
</listitem>
|
|
|
|
<listitem>
|
|
|
|
<para>
|
|
|
|
Generating output document formats (such as PDF) with
|
|
|
|
embedded text that can be fully extracted.
|
|
|
|
</para>
|
|
|
|
</listitem>
|
2018-11-15 22:47:03 +01:00
|
|
|
<listitem>
|
|
|
|
<para>
|
|
|
|
Determining the mapping between input characters and output
|
|
|
|
glyphs, such as which glyphs are ligatures.
|
|
|
|
</para>
|
|
|
|
</listitem>
|
2018-11-12 19:17:06 +01:00
|
|
|
<listitem>
|
|
|
|
<para>
|
|
|
|
Performing line-breaking, justification, and other
|
|
|
|
line-level or paragraph-level operations that must be done
|
2018-11-28 20:48:38 +01:00
|
|
|
after shaping is complete, but which require examining
|
|
|
|
character-level properties.
|
2018-11-12 19:17:06 +01:00
|
|
|
</para>
|
|
|
|
</listitem>
|
|
|
|
</itemizedlist>
|
2018-11-28 20:48:38 +01:00
|
|
|
</section>
|
|
|
|
<section id="working-with-harfbuzz-clusters">
|
|
|
|
<title>Working with HarfBuzz clusters</title>
|
2018-11-12 19:17:06 +01:00
|
|
|
<para>
|
2018-11-16 00:40:21 +01:00
|
|
|
When you add text to a HarfBuzz buffer, each code point must be
|
|
|
|
assigned a <emphasis>cluster value</emphasis>.
|
2018-11-12 19:17:06 +01:00
|
|
|
</para>
|
|
|
|
<para>
|
|
|
|
This cluster value is an arbitrary number; HarfBuzz uses it only
|
|
|
|
to distinguish between clusters. Many client programs will use
|
|
|
|
the index of each code point in the input text stream as the
|
2018-11-16 00:40:21 +01:00
|
|
|
cluster value. This is for the sake of convenience; the actual
|
|
|
|
value does not matter.
|
2018-11-12 19:17:06 +01:00
|
|
|
</para>
|
|
|
|
<para>
|
2018-11-28 20:48:38 +01:00
|
|
|
Some of the shaping operations performed by HarfBuzz —
|
|
|
|
such as reordering, composition, decomposition, and substitution
|
|
|
|
— may alter the cluster values of some characters. The
|
|
|
|
final cluster values in the buffer at the end of the shaping
|
|
|
|
process will indicate to client programs which subsequences of
|
|
|
|
glyphs represent a cluster and, therefore, must not be
|
|
|
|
separated.
|
|
|
|
</para>
|
|
|
|
<para>
|
|
|
|
In addition, client programs can query the final cluster values
|
|
|
|
to discern other potentially important information about the
|
|
|
|
glyphs in the output buffer (such as whether or not a ligature
|
|
|
|
was formed).
|
|
|
|
</para>
|
|
|
|
<para>
|
|
|
|
For example, if the initial sequence of cluster values was:
|
|
|
|
</para>
|
|
|
|
<programlisting>
|
|
|
|
0,1,2,3,4
|
|
|
|
</programlisting>
|
|
|
|
<para>
|
|
|
|
and the final sequence of cluster values is:
|
|
|
|
</para>
|
|
|
|
<programlisting>
|
|
|
|
0,0,3,3
|
|
|
|
</programlisting>
|
|
|
|
<para>
|
|
|
|
then there are two clusters in the output buffer: the first
|
|
|
|
cluster includes the first two glyphs, and the second cluster
|
|
|
|
includes the third and fourth glyphs. It is also evident that a
|
|
|
|
ligature or conjunct has been formed, because there are fewer
|
|
|
|
glyphs in the output buffer (four) than there were code points
|
|
|
|
in the input buffer (five).
|
|
|
|
</para>
|
|
|
|
<para>
|
|
|
|
Although client programs using HarfBuzz are free to assign
|
|
|
|
initial cluster values in any manner they choose to, HarfBuzz
|
|
|
|
does offer some useful guarantees if the cluster values are
|
|
|
|
assigned in a monotonic (either non-decreasing or non-increasing)
|
|
|
|
order.
|
|
|
|
</para>
|
|
|
|
<para>
|
2019-07-05 19:05:11 +02:00
|
|
|
For buffers in the left-to-right (LTR)
|
|
|
|
or top-to-bottom (TTB) text flow direction,
|
2018-11-28 20:48:38 +01:00
|
|
|
HarfBuzz will preserve the monotonic property: client programs
|
2019-05-28 16:50:17 +02:00
|
|
|
are guaranteed that monotonically increasing initial cluster
|
2018-11-28 20:48:38 +01:00
|
|
|
values will be returned as monotonically increasing final
|
|
|
|
cluster values.
|
|
|
|
</para>
|
|
|
|
<para>
|
2019-07-05 19:05:11 +02:00
|
|
|
For buffers in the right-to-left (RTL)
|
|
|
|
or bottom-to-top (BTT) text flow direction,
|
2018-11-28 20:48:38 +01:00
|
|
|
the directionality of the buffer itself is reversed for final
|
|
|
|
output as a matter of design. Therefore, HarfBuzz inverts the
|
|
|
|
monotonic property: client programs are guaranteed that
|
2019-05-28 16:50:17 +02:00
|
|
|
monotonically increasing initial cluster values will be
|
2018-11-28 20:48:38 +01:00
|
|
|
returned as monotonically <emphasis>decreasing</emphasis> final
|
|
|
|
cluster values.
|
|
|
|
</para>
|
|
|
|
<para>
|
|
|
|
Client programs can adjust how HarfBuzz handles clusters during
|
2018-11-12 19:17:06 +01:00
|
|
|
shaping by setting the
|
|
|
|
<literal>cluster_level</literal> of the
|
|
|
|
buffer. HarfBuzz offers three <emphasis>levels</emphasis> of
|
|
|
|
clustering support for this property:
|
|
|
|
</para>
|
|
|
|
<itemizedlist>
|
|
|
|
<listitem>
|
|
|
|
<para><emphasis>Level 0</emphasis> is the default and
|
|
|
|
reproduces the behavior of the old HarfBuzz library.
|
|
|
|
</para>
|
|
|
|
<para>
|
|
|
|
The distinguishing feature of level 0 behavior is that, at
|
|
|
|
the beginning of processing the buffer, all code points that
|
|
|
|
are categorized as <emphasis>marks</emphasis>,
|
|
|
|
<emphasis>modifier symbols</emphasis>, or
|
|
|
|
<emphasis>Emoji extended pictographic</emphasis> modifiers,
|
|
|
|
as well as the <emphasis>Zero Width Joiner</emphasis> and
|
|
|
|
<emphasis>Zero Width Non-Joiner</emphasis> code points, are
|
|
|
|
assigned the cluster value of the closest preceding code
|
2018-11-16 00:40:21 +01:00
|
|
|
point from <emphasis>different</emphasis> category.
|
2018-11-12 19:17:06 +01:00
|
|
|
</para>
|
|
|
|
<para>
|
|
|
|
In essence, whenever a base character is followed by a mark
|
|
|
|
character or a sequence of mark characters, those marks are
|
|
|
|
reassigned to the same initial cluster value as the base
|
|
|
|
character. This reassignment is referred to as
|
|
|
|
"merging" the affected clusters. This behavior is based on
|
|
|
|
the Grapheme Cluster Boundary specification in <ulink
|
|
|
|
url="https://www.unicode.org/reports/tr29/#Regex_Definitions">Unicode
|
|
|
|
Technical Report 29</ulink>.
|
|
|
|
</para>
|
|
|
|
<para>
|
|
|
|
Client programs can specify level 0 behavior for a buffer by
|
|
|
|
setting its <literal>cluster_level</literal> to
|
|
|
|
<literal>HB_BUFFER_CLUSTER_LEVEL_MONOTONE_GRAPHEMES</literal>.
|
|
|
|
</para>
|
|
|
|
</listitem>
|
|
|
|
<listitem>
|
|
|
|
<para>
|
|
|
|
<emphasis>Level 1</emphasis> tweaks the old behavior
|
|
|
|
slightly to produce better results. Therefore, level 1
|
|
|
|
clustering is recommended for code that is not required to
|
|
|
|
implement backward compatibility with the old HarfBuzz.
|
|
|
|
</para>
|
|
|
|
<para>
|
|
|
|
Level 1 differs from level 0 by not merging the
|
|
|
|
clusters of marks and other modifier code points with the
|
|
|
|
preceding "base" code point's cluster. By preserving the
|
2018-11-15 22:47:03 +01:00
|
|
|
separate cluster values of these marks and modifier code
|
|
|
|
points, script shapers can perform additional operations
|
|
|
|
that might lead to improved results (for example, reordering
|
|
|
|
a sequence of marks).
|
2018-11-12 19:17:06 +01:00
|
|
|
</para>
|
|
|
|
<para>
|
|
|
|
Client programs can specify level 1 behavior for a buffer by
|
|
|
|
setting its <literal>cluster_level</literal> to
|
|
|
|
<literal>HB_BUFFER_CLUSTER_LEVEL_MONOTONE_CHARACTERS</literal>.
|
|
|
|
</para>
|
|
|
|
</listitem>
|
|
|
|
<listitem>
|
|
|
|
<para>
|
|
|
|
<emphasis>Level 2</emphasis> differs significantly in how it
|
|
|
|
treats cluster values. In level 2, HarfBuzz never merges
|
|
|
|
clusters.
|
|
|
|
</para>
|
|
|
|
<para>
|
|
|
|
This difference can be seen most clearly when HarfBuzz processes
|
|
|
|
ligature substitutions and glyph decompositions. In level 0
|
|
|
|
and level 1, ligatures and glyph decomposition both involve
|
|
|
|
merging clusters; in level 2, neither of these operations
|
|
|
|
triggers a merge.
|
|
|
|
</para>
|
|
|
|
<para>
|
|
|
|
Client programs can specify level 2 behavior for a buffer by
|
|
|
|
setting its <literal>cluster_level</literal> to
|
|
|
|
<literal>HB_BUFFER_CLUSTER_LEVEL_CHARACTERS</literal>.
|
|
|
|
</para>
|
|
|
|
</listitem>
|
|
|
|
</itemizedlist>
|
2018-11-16 00:40:21 +01:00
|
|
|
<para>
|
|
|
|
As mentioned earlier, client programs using HarfBuzz often
|
|
|
|
assign initial cluster values in a buffer by reusing the indices
|
|
|
|
of the code points in the input text. This gives a sequence of
|
|
|
|
cluster values that is monotonically increasing (for example,
|
2018-11-28 20:48:38 +01:00
|
|
|
0,1,2,3,4).
|
2018-11-16 00:40:21 +01:00
|
|
|
</para>
|
2018-11-12 19:17:06 +01:00
|
|
|
<para>
|
|
|
|
It is not <emphasis>required</emphasis> that the cluster values
|
|
|
|
in a buffer be monotonically increasing. However, if the initial
|
|
|
|
cluster values in a buffer are monotonic and the buffer is
|
2018-11-16 00:40:21 +01:00
|
|
|
configured to use cluster level 0 or 1, then HarfBuzz
|
2018-11-12 19:17:06 +01:00
|
|
|
guarantees that the final cluster values in the shaped buffer
|
|
|
|
will also be monotonic. No such guarantee is made for cluster
|
|
|
|
level 2.
|
|
|
|
</para>
|
|
|
|
<para>
|
2018-11-16 00:40:21 +01:00
|
|
|
In levels 0 and 1, HarfBuzz implements the following conceptual
|
|
|
|
model for cluster values:
|
2018-11-12 19:17:06 +01:00
|
|
|
</para>
|
|
|
|
<itemizedlist spacing="compact">
|
|
|
|
<listitem>
|
|
|
|
<para>
|
2018-11-16 00:40:21 +01:00
|
|
|
If the sequence of input cluster values is monotonic, the
|
|
|
|
sequence of cluster values will remain monotonic.
|
2018-11-12 19:17:06 +01:00
|
|
|
</para>
|
|
|
|
</listitem>
|
|
|
|
<listitem>
|
|
|
|
<para>
|
|
|
|
Each cluster value represents a single cluster.
|
|
|
|
</para>
|
|
|
|
</listitem>
|
|
|
|
<listitem>
|
|
|
|
<para>
|
|
|
|
Each cluster contains one or more glyphs and one or more
|
|
|
|
characters.
|
|
|
|
</para>
|
|
|
|
</listitem>
|
|
|
|
</itemizedlist>
|
|
|
|
<para>
|
|
|
|
In practice, this model offers several benefits. Assuming that
|
|
|
|
the initial cluster values were monotonically increasing
|
|
|
|
and distinct before shaping began, then, in the final output:
|
|
|
|
</para>
|
|
|
|
<itemizedlist spacing="compact">
|
|
|
|
<listitem>
|
|
|
|
<para>
|
|
|
|
All adjacent glyphs having the same final cluster
|
|
|
|
value belong to the same cluster.
|
|
|
|
</para>
|
|
|
|
</listitem>
|
|
|
|
<listitem>
|
|
|
|
<para>
|
|
|
|
Each character belongs to the cluster that has the highest
|
|
|
|
cluster value <emphasis>not larger than</emphasis> its
|
|
|
|
initial cluster value.
|
|
|
|
</para>
|
|
|
|
</listitem>
|
|
|
|
</itemizedlist>
|
|
|
|
</section>
|
2018-11-28 20:48:38 +01:00
|
|
|
|
2018-11-12 19:17:06 +01:00
|
|
|
<section id="a-clustering-example-for-levels-0-and-1">
|
|
|
|
<title>A clustering example for levels 0 and 1</title>
|
|
|
|
<para>
|
2018-11-28 20:48:38 +01:00
|
|
|
The basic shaping operations affect clusters in a predictable
|
|
|
|
manner when using level 0 or level 1:
|
|
|
|
</para>
|
|
|
|
<itemizedlist>
|
|
|
|
<listitem>
|
|
|
|
<para>
|
|
|
|
When two or more clusters <emphasis>merge</emphasis>, the
|
|
|
|
resulting merged cluster takes as its cluster value the
|
|
|
|
<emphasis>minimum</emphasis> of the incoming cluster values.
|
|
|
|
</para>
|
|
|
|
</listitem>
|
|
|
|
<listitem>
|
|
|
|
<para>
|
|
|
|
When a cluster <emphasis>decomposes</emphasis>, all of the
|
|
|
|
resulting child clusters inherit as their cluster value the
|
|
|
|
cluster value of the parent cluster.
|
|
|
|
</para>
|
|
|
|
</listitem>
|
|
|
|
<listitem>
|
|
|
|
<para>
|
|
|
|
When a character is <emphasis>reordered</emphasis>, the
|
|
|
|
reordered character and all clusters that the character
|
|
|
|
moves past as part of the reordering are merged into one cluster.
|
|
|
|
</para>
|
|
|
|
</listitem>
|
|
|
|
</itemizedlist>
|
|
|
|
<para>
|
|
|
|
The functionality, guarantees, and benefits of level 0 and level
|
|
|
|
1 behavior can be seen with some examples. First, let us examine
|
|
|
|
what happens with cluster values when shaping involves cluster
|
|
|
|
merging with ligatures and decomposition.
|
2018-11-12 19:17:06 +01:00
|
|
|
</para>
|
2018-11-28 20:48:38 +01:00
|
|
|
|
2018-11-12 19:17:06 +01:00
|
|
|
<para>
|
|
|
|
Let's say we start with the following character sequence (top row) and
|
|
|
|
initial cluster values (bottom row):
|
2016-01-28 19:14:12 +01:00
|
|
|
</para>
|
|
|
|
<programlisting>
|
2018-11-12 19:17:06 +01:00
|
|
|
A,B,C,D,E
|
|
|
|
0,1,2,3,4
|
|
|
|
</programlisting>
|
2016-01-28 19:14:12 +01:00
|
|
|
<para>
|
2018-11-12 19:17:06 +01:00
|
|
|
During shaping, HarfBuzz maps these characters to glyphs from
|
2018-11-16 00:40:21 +01:00
|
|
|
the font. For simplicity, let us assume that each character maps
|
2018-11-12 19:17:06 +01:00
|
|
|
to the corresponding, identical-looking glyph:
|
2016-01-28 19:14:12 +01:00
|
|
|
</para>
|
|
|
|
<programlisting>
|
2018-11-12 19:17:06 +01:00
|
|
|
A,B,C,D,E
|
|
|
|
0,1,2,3,4
|
|
|
|
</programlisting>
|
2016-01-28 19:14:12 +01:00
|
|
|
<para>
|
2018-11-12 19:17:06 +01:00
|
|
|
Now if, for example, <literal>B</literal> and <literal>C</literal>
|
|
|
|
form a ligature, then the clusters to which they belong
|
|
|
|
"merge". This merged cluster takes for its cluster
|
|
|
|
value the minimum of all the cluster values of the clusters that
|
|
|
|
went in to the ligature. In this case, we get:
|
2016-01-28 19:14:12 +01:00
|
|
|
</para>
|
|
|
|
<programlisting>
|
2018-11-12 19:17:06 +01:00
|
|
|
A,BC,D,E
|
|
|
|
0,1 ,3,4
|
|
|
|
</programlisting>
|
|
|
|
<para>
|
|
|
|
because 1 is the minimum of the set {1,2}, which were the
|
|
|
|
cluster values of <literal>B</literal> and
|
|
|
|
<literal>C</literal>.
|
|
|
|
</para>
|
2016-01-28 19:14:12 +01:00
|
|
|
<para>
|
2018-11-12 19:17:06 +01:00
|
|
|
Next, let us say that the <literal>BC</literal> ligature glyph
|
|
|
|
decomposes into three components, and <literal>D</literal> also
|
2018-11-28 20:48:38 +01:00
|
|
|
decomposes into two components. Whenever a cluster decomposes,
|
|
|
|
its components each inherit the cluster value of their parent:
|
2016-01-28 19:14:12 +01:00
|
|
|
</para>
|
2018-11-12 19:17:06 +01:00
|
|
|
<programlisting>
|
|
|
|
A,BC0,BC1,BC2,D0,D1,E
|
|
|
|
0,1 ,1 ,1 ,3 ,3 ,4
|
|
|
|
</programlisting>
|
2016-01-28 19:14:12 +01:00
|
|
|
<para>
|
2018-11-12 19:17:06 +01:00
|
|
|
Next, if <literal>BC2</literal> and <literal>D0</literal> form a
|
|
|
|
ligature, then their clusters (cluster values 1 and 3) merge into
|
|
|
|
<literal>min(1,3) = 1</literal>:
|
2016-01-28 19:14:12 +01:00
|
|
|
</para>
|
|
|
|
<programlisting>
|
2018-11-12 19:17:06 +01:00
|
|
|
A,BC0,BC1,BC2D0,D1,E
|
|
|
|
0,1 ,1 ,1 ,1 ,4
|
|
|
|
</programlisting>
|
2018-11-28 20:48:38 +01:00
|
|
|
<para>
|
|
|
|
Note that the entirety of cluster 3 merges into cluster 1, not
|
|
|
|
just the <literal>D0</literal> glyph. This reflects the fact
|
|
|
|
that the cluster <emphasis>must</emphasis> be treated as an
|
|
|
|
indivisible unit.
|
|
|
|
</para>
|
2018-11-12 19:17:06 +01:00
|
|
|
<para>
|
|
|
|
At this point, cluster 1 means: the character sequence
|
|
|
|
<literal>BCD</literal> is represented by glyphs
|
|
|
|
<literal>BC0,BC1,BC2D0,D1</literal> and cannot be broken down any
|
|
|
|
further.
|
|
|
|
</para>
|
|
|
|
</section>
|
|
|
|
<section id="reordering-in-levels-0-and-1">
|
|
|
|
<title>Reordering in levels 0 and 1</title>
|
|
|
|
<para>
|
|
|
|
Another common operation in the more complex shapers is glyph
|
|
|
|
reordering. In order to maintain a monotonic cluster sequence
|
|
|
|
when glyph reordering takes place, HarfBuzz merges the clusters
|
|
|
|
of everything in the reordering sequence.
|
|
|
|
</para>
|
2016-01-28 19:14:12 +01:00
|
|
|
<para>
|
2018-11-12 19:17:06 +01:00
|
|
|
For example, let us again start with the character sequence (top
|
|
|
|
row) and initial cluster values (bottom row):
|
2016-01-28 19:14:12 +01:00
|
|
|
</para>
|
|
|
|
<programlisting>
|
2018-11-12 19:17:06 +01:00
|
|
|
A,B,C,D,E
|
|
|
|
0,1,2,3,4
|
|
|
|
</programlisting>
|
2016-01-28 19:14:12 +01:00
|
|
|
<para>
|
2018-11-28 20:48:38 +01:00
|
|
|
If <literal>D</literal> is reordered to the position immediately
|
|
|
|
before <literal>B</literal>, then HarfBuzz merges the
|
|
|
|
<literal>B</literal>, <literal>C</literal>, and
|
|
|
|
<literal>D</literal> clusters — all the clusters between
|
|
|
|
the final position of the reordered glyph and its original
|
|
|
|
position. This means that we get:
|
2016-01-28 19:14:12 +01:00
|
|
|
</para>
|
|
|
|
<programlisting>
|
2018-11-12 19:17:06 +01:00
|
|
|
A,D,B,C,E
|
|
|
|
0,1,1,1,4
|
|
|
|
</programlisting>
|
2016-01-28 19:14:12 +01:00
|
|
|
<para>
|
2018-11-28 20:48:38 +01:00
|
|
|
as the final cluster sequence.
|
|
|
|
</para>
|
|
|
|
<para>
|
|
|
|
Merging this many clusters is not ideal, but it is the only
|
|
|
|
sensible way for HarfBuzz to maintain the guarantee that the
|
|
|
|
sequence of cluster values remains monotonic and to retain the
|
2018-11-12 19:17:06 +01:00
|
|
|
true relationship between glyphs and characters.
|
|
|
|
</para>
|
|
|
|
</section>
|
|
|
|
<section id="the-distinction-between-levels-0-and-1">
|
|
|
|
<title>The distinction between levels 0 and 1</title>
|
|
|
|
<para>
|
|
|
|
The preceding examples demonstrate the main effects of using
|
|
|
|
cluster levels 0 and 1. The only difference between the two
|
|
|
|
levels is this: in level 0, at the very beginning of the shaping
|
2018-11-28 20:48:38 +01:00
|
|
|
process, HarfBuzz merges the cluster of each base character
|
|
|
|
with the clusters of all Unicode marks (combining or not) and
|
|
|
|
modifiers that follow it.
|
2018-11-12 19:17:06 +01:00
|
|
|
</para>
|
|
|
|
<para>
|
|
|
|
For example, let us start with the following character sequence
|
|
|
|
(top row) and accompanying initial cluster values (bottom row):
|
|
|
|
</para>
|
|
|
|
<programlisting>
|
|
|
|
A,acute,B
|
|
|
|
0,1 ,2
|
|
|
|
</programlisting>
|
|
|
|
<para>
|
|
|
|
The <literal>acute</literal> is a Unicode mark. If HarfBuzz is
|
|
|
|
using cluster level 0 on this sequence, then the
|
|
|
|
<literal>A</literal> and <literal>acute</literal> clusters will
|
|
|
|
merge, and the result will become:
|
2016-01-28 19:14:12 +01:00
|
|
|
</para>
|
|
|
|
<programlisting>
|
2018-11-12 19:17:06 +01:00
|
|
|
A,acute,B
|
|
|
|
0,0 ,2
|
|
|
|
</programlisting>
|
2018-11-28 20:48:38 +01:00
|
|
|
<para>
|
|
|
|
This merger is performed before any other script-shaping
|
|
|
|
steps.
|
|
|
|
</para>
|
2018-11-12 19:17:06 +01:00
|
|
|
<para>
|
|
|
|
This initial cluster merging is the default behavior of the
|
|
|
|
Windows shaping engine, and the old HarfBuzz codebase copied
|
|
|
|
that behavior to maintain compatibility. Consequently, it has
|
|
|
|
remained the default behavior in the new HarfBuzz codebase.
|
|
|
|
</para>
|
|
|
|
<para>
|
2018-11-28 20:48:38 +01:00
|
|
|
But this initial cluster-merging behavior makes it impossible
|
2019-01-22 01:03:02 +01:00
|
|
|
for client programs to implement some features (such as to
|
2018-11-12 19:17:06 +01:00
|
|
|
color diacritic marks differently from their base
|
2018-11-28 20:48:38 +01:00
|
|
|
characters). That is why, in level 1, HarfBuzz does not perform
|
2018-11-12 19:17:06 +01:00
|
|
|
the initial merging step.
|
|
|
|
</para>
|
|
|
|
<para>
|
|
|
|
For client programs that rely on HarfBuzz cluster values to
|
|
|
|
perform cursor positioning, level 0 is more convenient. But
|
|
|
|
relying on cluster boundaries for cursor positioning is wrong: cursor
|
|
|
|
positions should be determined based on Unicode grapheme
|
2018-11-28 20:48:38 +01:00
|
|
|
boundaries, not on shaping-cluster boundaries. As such, using
|
|
|
|
level 1 clustering behavior is recommended.
|
|
|
|
</para>
|
|
|
|
<para>
|
|
|
|
One final facet of levels 0 and 1 is worth noting. HarfBuzz
|
|
|
|
currently does not allow any
|
|
|
|
<emphasis>multiple-substitution</emphasis> GSUB lookups to
|
|
|
|
replace a glyph with zero glyphs (in other words, to delete a
|
|
|
|
glyph).
|
2018-11-12 19:17:06 +01:00
|
|
|
</para>
|
|
|
|
<para>
|
2018-11-28 20:48:38 +01:00
|
|
|
But, in some other situations, glyphs can be deleted. In
|
|
|
|
those cases, if the glyph being deleted is the last glyph of its
|
|
|
|
cluster, HarfBuzz makes sure to merge the deleted glyph's
|
|
|
|
cluster with a neighboring cluster.
|
2018-11-12 19:17:06 +01:00
|
|
|
</para>
|
|
|
|
<para>
|
|
|
|
This is done primarily to make sure that the starting cluster of the
|
|
|
|
text always has the cluster index pointing to the start of the text
|
2018-11-28 20:48:38 +01:00
|
|
|
for the run; more than one client program currently relies on this
|
2018-11-12 19:17:06 +01:00
|
|
|
guarantee.
|
|
|
|
</para>
|
|
|
|
<para>
|
2018-11-28 20:48:38 +01:00
|
|
|
Incidentally, Apple's CoreText does something different to
|
|
|
|
maintain the same promise: it inserts a glyph with id 65535 at
|
|
|
|
the beginning of the glyph string if the glyph corresponding to
|
|
|
|
the first character in the run was deleted. HarfBuzz might do
|
|
|
|
something similar in the future.
|
2018-11-12 19:17:06 +01:00
|
|
|
</para>
|
|
|
|
</section>
|
|
|
|
<section id="level-2">
|
|
|
|
<title>Level 2</title>
|
|
|
|
<para>
|
|
|
|
HarfBuzz's level 2 cluster behavior uses a significantly
|
|
|
|
different model than that of level 0 and level 1.
|
|
|
|
</para>
|
2016-01-28 19:14:12 +01:00
|
|
|
<para>
|
2018-11-12 19:17:06 +01:00
|
|
|
The level 2 behavior is easy to describe, but it may be
|
|
|
|
difficult to understand in practical terms. In brief, level 2
|
|
|
|
performs no merging of clusters whatsoever.
|
2016-01-28 19:14:12 +01:00
|
|
|
</para>
|
|
|
|
<para>
|
2018-11-28 20:48:38 +01:00
|
|
|
This means that there is no initial base-and-mark merging step
|
|
|
|
(as is done in level 0), and it means that reordering moves and
|
|
|
|
ligature substitutions do not trigger a cluster merge.
|
|
|
|
</para>
|
|
|
|
<para>
|
|
|
|
Only one shaping operation directly affects clusters when using
|
|
|
|
level 2:
|
|
|
|
</para>
|
|
|
|
<itemizedlist>
|
|
|
|
<listitem>
|
|
|
|
<para>
|
|
|
|
When a cluster <emphasis>decomposes</emphasis>, all of the
|
|
|
|
resulting child clusters inherit as their cluster value the
|
|
|
|
cluster value of the parent cluster.
|
|
|
|
</para>
|
|
|
|
</listitem>
|
|
|
|
</itemizedlist>
|
|
|
|
<para>
|
|
|
|
When glyphs do form a ligature (or when some other feature
|
|
|
|
substitutes multiple glyphs with one glyph) the cluster value
|
2018-11-12 19:17:06 +01:00
|
|
|
of the first glyph is retained as the cluster value for the
|
2018-11-28 20:48:38 +01:00
|
|
|
resulting ligature.
|
|
|
|
</para>
|
|
|
|
<para>
|
|
|
|
This occurrence sounds similar to a cluster merge, but it is
|
|
|
|
different. In particular, no subsequent characters —
|
|
|
|
including marks and modifiers — are affected. They retain
|
|
|
|
their previous cluster values.
|
2016-01-28 19:14:12 +01:00
|
|
|
</para>
|
2018-11-12 19:17:06 +01:00
|
|
|
<para>
|
2018-11-28 20:48:38 +01:00
|
|
|
Level 2 cluster behavior is ultimately less complex than level 0
|
|
|
|
or level 1, but there are several cases for which processing
|
|
|
|
cluster values produced at level 2 may be tricky.
|
2018-11-12 19:17:06 +01:00
|
|
|
</para>
|
|
|
|
<section id="ligatures-with-combining-marks-in-level-2">
|
|
|
|
<title>Ligatures with combining marks in level 2</title>
|
|
|
|
<para>
|
|
|
|
The first example of how HarfBuzz's level 2 cluster behavior
|
|
|
|
can be tricky is when the text to be shaped includes combining
|
|
|
|
marks attached to ligatures.
|
|
|
|
</para>
|
|
|
|
<para>
|
|
|
|
Let us start with an input sequence with the following
|
|
|
|
characters (top row) and initial cluster values (bottom row):
|
|
|
|
</para>
|
|
|
|
<programlisting>
|
|
|
|
A,acute,B,breve,C,circumflex
|
|
|
|
0,1 ,2,3 ,4,5
|
|
|
|
</programlisting>
|
|
|
|
<para>
|
|
|
|
If the sequence <literal>A,B,C</literal> forms a ligature,
|
|
|
|
then these are the cluster values HarfBuzz will return under
|
|
|
|
the various cluster levels:
|
|
|
|
</para>
|
|
|
|
<para>
|
|
|
|
Level 0:
|
|
|
|
</para>
|
|
|
|
<programlisting>
|
|
|
|
ABC,acute,breve,circumflex
|
|
|
|
0 ,0 ,0 ,0
|
|
|
|
</programlisting>
|
|
|
|
<para>
|
|
|
|
Level 1:
|
|
|
|
</para>
|
|
|
|
<programlisting>
|
|
|
|
ABC,acute,breve,circumflex
|
|
|
|
0 ,0 ,0 ,5
|
|
|
|
</programlisting>
|
|
|
|
<para>
|
|
|
|
Level 2:
|
|
|
|
</para>
|
|
|
|
<programlisting>
|
|
|
|
ABC,acute,breve,circumflex
|
|
|
|
0 ,1 ,3 ,5
|
|
|
|
</programlisting>
|
|
|
|
<para>
|
|
|
|
Making sense of the level 2 result is the hardest for a client
|
|
|
|
program, because there is nothing in the cluster values that
|
|
|
|
indicates that <literal>B</literal> and <literal>C</literal>
|
|
|
|
formed a ligature with <literal>A</literal>.
|
|
|
|
</para>
|
|
|
|
<para>
|
|
|
|
In contrast, the "merged" cluster values of the mark glyphs
|
|
|
|
that are seen in the level 0 and level 1 output are evidence
|
|
|
|
that a ligature substitution took place.
|
|
|
|
</para>
|
|
|
|
</section>
|
|
|
|
<section id="reordering-in-level-2">
|
|
|
|
<title>Reordering in level 2</title>
|
|
|
|
<para>
|
|
|
|
Another example of how HarfBuzz's level 2 cluster behavior
|
|
|
|
can be tricky is when glyphs reorder. Consider an input sequence
|
|
|
|
with the following characters (top row) and initial cluster
|
|
|
|
values (bottom row):
|
|
|
|
</para>
|
|
|
|
<programlisting>
|
|
|
|
A,B,C,D,E
|
|
|
|
0,1,2,3,4
|
|
|
|
</programlisting>
|
|
|
|
<para>
|
|
|
|
Now imagine <literal>D</literal> moves before
|
|
|
|
<literal>B</literal> in a reordering operation. The cluster
|
|
|
|
values will then be:
|
|
|
|
</para>
|
|
|
|
<programlisting>
|
|
|
|
A,D,B,C,E
|
|
|
|
0,3,1,2,4
|
|
|
|
</programlisting>
|
|
|
|
<para>
|
|
|
|
Next, if <literal>D</literal> forms a ligature with
|
|
|
|
<literal>B</literal>, the output is:
|
|
|
|
</para>
|
|
|
|
<programlisting>
|
|
|
|
A,DB,C,E
|
|
|
|
0,3 ,2,4
|
|
|
|
</programlisting>
|
|
|
|
<para>
|
|
|
|
However, in a different scenario, in which the shaping rules
|
|
|
|
of the script instead caused <literal>A</literal> and
|
|
|
|
<literal>B</literal> to form a ligature
|
|
|
|
<emphasis>before</emphasis> the <literal>D</literal> reordered, the
|
|
|
|
result would be:
|
|
|
|
</para>
|
|
|
|
<programlisting>
|
|
|
|
AB,D,C,E
|
|
|
|
0 ,3,2,4
|
|
|
|
</programlisting>
|
|
|
|
<para>
|
|
|
|
There is no way for a client program to differentiate between
|
|
|
|
these two scenarios based on the cluster values
|
|
|
|
alone. Consequently, client programs that use level 2 might
|
|
|
|
need to undertake additional work in order to manage cursor
|
|
|
|
positioning, text attributes, or other desired features.
|
|
|
|
</para>
|
|
|
|
</section>
|
|
|
|
<section id="other-considerations-in-level-2">
|
|
|
|
<title>Other considerations in level 2</title>
|
|
|
|
<para>
|
|
|
|
There may be other problems encountered with ligatures under
|
|
|
|
level 2, such as if the direction of the text is forced to
|
2019-01-22 09:58:36 +01:00
|
|
|
the opposite of its natural direction (for example, Arabic text
|
2018-11-28 20:48:38 +01:00
|
|
|
that is forced into left-to-right directionality). But,
|
|
|
|
generally speaking, these other scenarios are minor corner
|
|
|
|
cases that are too obscure for most client programs to need to
|
|
|
|
worry about.
|
2018-11-12 19:17:06 +01:00
|
|
|
</para>
|
|
|
|
</section>
|
|
|
|
</section>
|
2016-01-28 19:14:12 +01:00
|
|
|
</chapter>
|