Usermanual: expand clusters chapter.
This commit is contained in:
parent
30cb45b3ea
commit
53ac46e974
|
@ -5,306 +5,509 @@
|
||||||
<!ENTITY version SYSTEM "version.xml">
|
<!ENTITY version SYSTEM "version.xml">
|
||||||
]>
|
]>
|
||||||
<chapter id="clusters">
|
<chapter id="clusters">
|
||||||
<sect1 id="clusters">
|
|
||||||
<title>Clusters</title>
|
<title>Clusters</title>
|
||||||
<para>
|
<section id="clusters">
|
||||||
In shaping text, a <emphasis>cluster</emphasis> is a sequence of
|
<title>Clusters</title>
|
||||||
code points that needs to be treated as a single, indivisible unit.
|
|
||||||
</para>
|
|
||||||
<para>
|
|
||||||
When you add text to a HB buffer, each character is associated with
|
|
||||||
a <emphasis>cluster value</emphasis>. This is an arbitrary number as
|
|
||||||
far as HB is concerned.
|
|
||||||
</para>
|
|
||||||
<para>
|
|
||||||
Most clients will use UTF-8, UTF-16, or UTF-32 indices, but the
|
|
||||||
actual number does not matter. Moreover, it is not required for the
|
|
||||||
cluster values to be monotonically increasing, but pretty much all
|
|
||||||
of HB's tests are performed on monotonically increasing cluster
|
|
||||||
numbers. Nevertheless, there is no such assumption in the code
|
|
||||||
itself. With that in mind, let's examine what happens with cluster
|
|
||||||
values during shaping under each cluster-level.
|
|
||||||
</para>
|
|
||||||
<para>
|
|
||||||
HarfBuzz provides three <emphasis>levels</emphasis> of clustering
|
|
||||||
support. Level 0 is the default behavior and reproduces the behavior
|
|
||||||
of the old HarfBuzz library. Level 1 tweaks this behavior slightly
|
|
||||||
to produce better results, so level 1 clustering is recommended for
|
|
||||||
code that is not required to implement backward compatibility with
|
|
||||||
the old HarfBuzz.
|
|
||||||
</para>
|
|
||||||
<para>
|
|
||||||
Level 2 differs significantly in how it treats cluster values.
|
|
||||||
Levels 0 and 1 both process ligatures and glyph decomposition by
|
|
||||||
merging clusters; level 2 does not.
|
|
||||||
</para>
|
|
||||||
<para>
|
|
||||||
The conceptual model for what the cluster values mean, in levels 0
|
|
||||||
and 1, is this:
|
|
||||||
</para>
|
|
||||||
<itemizedlist spacing="compact">
|
|
||||||
<listitem>
|
|
||||||
<para>
|
|
||||||
the sequence of cluster values will always remain monotone
|
|
||||||
</para>
|
|
||||||
</listitem>
|
|
||||||
<listitem>
|
|
||||||
<para>
|
|
||||||
each value represents a single cluster
|
|
||||||
</para>
|
|
||||||
</listitem>
|
|
||||||
<listitem>
|
|
||||||
<para>
|
|
||||||
each cluster contains one or more glyphs and one or more
|
|
||||||
characters
|
|
||||||
</para>
|
|
||||||
</listitem>
|
|
||||||
</itemizedlist>
|
|
||||||
<para>
|
|
||||||
Assuming that initial cluster numbers were monotonically increasing
|
|
||||||
and distinct, then all adjacent glyphs having the same cluster
|
|
||||||
number belong to the same cluster, and all characters belong to the
|
|
||||||
cluster that has the highest number not larger than their initial
|
|
||||||
cluster number. This will become clearer with an example.
|
|
||||||
</para>
|
|
||||||
</sect1>
|
|
||||||
<sect1 id="a-clustering-example-for-levels-0-and-1">
|
|
||||||
<title>A clustering example for levels 0 and 1</title>
|
|
||||||
<para>
|
|
||||||
Let's say we start with the following character sequence and cluster
|
|
||||||
values:
|
|
||||||
</para>
|
|
||||||
<programlisting>
|
|
||||||
A,B,C,D,E
|
|
||||||
0,1,2,3,4
|
|
||||||
</programlisting>
|
|
||||||
<para>
|
|
||||||
We then map the characters to glyphs. For simplicity, let's assume
|
|
||||||
that each character maps to the corresponding, identical-looking
|
|
||||||
glyph:
|
|
||||||
</para>
|
|
||||||
<programlisting>
|
|
||||||
A,B,C,D,E
|
|
||||||
0,1,2,3,4
|
|
||||||
</programlisting>
|
|
||||||
<para>
|
|
||||||
Now if, for example, <literal>B</literal> and <literal>C</literal>
|
|
||||||
ligate, then the clusters to which they belong "merge".
|
|
||||||
This merged cluster takes for its cluster number the minimum of all
|
|
||||||
the cluster numbers of the clusters that went in. In this case, we
|
|
||||||
get:
|
|
||||||
</para>
|
|
||||||
<programlisting>
|
|
||||||
A,BC,D,E
|
|
||||||
0,1 ,3,4
|
|
||||||
</programlisting>
|
|
||||||
<para>
|
|
||||||
Now let's assume that the <literal>BC</literal> glyph decomposes
|
|
||||||
into three components, and <literal>D</literal> also decomposes into
|
|
||||||
two. The components each inherit the cluster value of their parent:
|
|
||||||
</para>
|
|
||||||
<programlisting>
|
|
||||||
A,BC0,BC1,BC2,D0,D1,E
|
|
||||||
0,1 ,1 ,1 ,3 ,3 ,4
|
|
||||||
</programlisting>
|
|
||||||
<para>
|
|
||||||
Now if <literal>BC2</literal> and <literal>D0</literal> ligate, then
|
|
||||||
their clusters (numbers 1 and 3) merge into
|
|
||||||
<literal>min(1,3) = 1</literal>:
|
|
||||||
</para>
|
|
||||||
<programlisting>
|
|
||||||
A,BC0,BC1,BC2D0,D1,E
|
|
||||||
0,1 ,1 ,1 ,1 ,4
|
|
||||||
</programlisting>
|
|
||||||
<para>
|
|
||||||
At this point, cluster 1 means: the character sequence
|
|
||||||
<literal>BCD</literal> is represented by glyphs
|
|
||||||
<literal>BC0,BC1,BC2D0,D1</literal> and cannot be broken down any
|
|
||||||
further.
|
|
||||||
</para>
|
|
||||||
</sect1>
|
|
||||||
<sect1 id="reordering-in-levels-0-and-1">
|
|
||||||
<title>Reordering in levels 0 and 1</title>
|
|
||||||
<para>
|
|
||||||
Another common operation in the more complex shapers is when things
|
|
||||||
reorder. In those cases, to maintain monotone clusters, HB merges
|
|
||||||
the clusters of everything in the reordering sequence. For example,
|
|
||||||
let's again start with the character sequence:
|
|
||||||
</para>
|
|
||||||
<programlisting>
|
|
||||||
A,B,C,D,E
|
|
||||||
0,1,2,3,4
|
|
||||||
</programlisting>
|
|
||||||
<para>
|
|
||||||
If <literal>D</literal> is reordered before <literal>B</literal>,
|
|
||||||
then the <literal>B</literal>, <literal>C</literal>, and
|
|
||||||
<literal>D</literal> clusters merge, and we get:
|
|
||||||
</para>
|
|
||||||
<programlisting>
|
|
||||||
A,D,B,C,E
|
|
||||||
0,1,1,1,4
|
|
||||||
</programlisting>
|
|
||||||
<para>
|
|
||||||
This is clearly not ideal, but it is the only sensible way to
|
|
||||||
maintain monotone indices and retain the true relationship between
|
|
||||||
glyphs and characters.
|
|
||||||
</para>
|
|
||||||
</sect1>
|
|
||||||
<sect1 id="the-distinction-between-levels-0-and-1">
|
|
||||||
<title>The distinction between levels 0 and 1</title>
|
|
||||||
<para>
|
|
||||||
So, the above is pretty much what cluster levels 0 and 1 do. The
|
|
||||||
only difference between the two is this: in level 0, at the very
|
|
||||||
beginning of the shaping process, we also merge clusters between
|
|
||||||
base characters and all Unicode marks (combining or not) following
|
|
||||||
them. E.g.:
|
|
||||||
</para>
|
|
||||||
<programlisting>
|
|
||||||
A,acute,B
|
|
||||||
0,1 ,2
|
|
||||||
</programlisting>
|
|
||||||
<para>
|
|
||||||
will become:
|
|
||||||
</para>
|
|
||||||
<programlisting>
|
|
||||||
A,acute,B
|
|
||||||
0,0 ,2
|
|
||||||
</programlisting>
|
|
||||||
<para>
|
|
||||||
This is the default behavior. We do it because Windows did it and
|
|
||||||
old HarfBuzz did it, so this remained the default. But this behavior
|
|
||||||
makes it impossible to color diacritic marks differently from their
|
|
||||||
base characters. That's why in level 1 we do not perform this
|
|
||||||
initial merging step.
|
|
||||||
</para>
|
|
||||||
<para>
|
|
||||||
For clients, level 0 is more convenient if they rely on HarfBuzz
|
|
||||||
clusters for cursor positioning. But that's wrong anyway: cursor
|
|
||||||
positions should be determined based on Unicode grapheme boundaries,
|
|
||||||
NOT shaping clusters. As such, level 1 clusters are preferred.
|
|
||||||
</para>
|
|
||||||
<para>
|
|
||||||
One last note about levels 0 and 1. We currently don't allow a
|
|
||||||
<literal>MultipleSubst</literal> lookup to replace a glyph with zero
|
|
||||||
glyphs (i.e., to delete a glyph). But in some other situations,
|
|
||||||
glyphs can be deleted. In those cases, if the glyph being deleted is
|
|
||||||
the last glyph of its cluster, we make sure to merge the cluster
|
|
||||||
with a neighboring cluster.
|
|
||||||
</para>
|
|
||||||
<para>
|
|
||||||
This is, primarily, to make sure that the starting cluster of the
|
|
||||||
text always has the cluster index pointing to the start of the text
|
|
||||||
for the run; more than one client currently relies on this
|
|
||||||
guarantee.
|
|
||||||
</para>
|
|
||||||
<para>
|
|
||||||
Incidentally, Apple's CoreText does something else to maintain the
|
|
||||||
same promise: it inserts a glyph with id 65535 at the beginning of
|
|
||||||
the glyph string if the glyph corresponding to the first character
|
|
||||||
in the run was deleted. HarfBuzz might do something similar in the
|
|
||||||
future.
|
|
||||||
</para>
|
|
||||||
</sect1>
|
|
||||||
<sect1 id="level-2">
|
|
||||||
<title>Level 2</title>
|
|
||||||
<para>
|
|
||||||
Level 2 is a different beast from levels 0 and 1. It is simple to
|
|
||||||
describe, but hard to make sense of. It simply doesn't do any
|
|
||||||
cluster merging whatsoever. When things ligate or otherwise multiple
|
|
||||||
glyphs turn into one, the cluster value of the first glyph is
|
|
||||||
retained.
|
|
||||||
</para>
|
|
||||||
<para>
|
|
||||||
Here are a few examples of why processing cluster values produced at
|
|
||||||
this level might be tricky:
|
|
||||||
</para>
|
|
||||||
<sect2 id="ligatures-with-combining-marks">
|
|
||||||
<title>Ligatures with combining marks</title>
|
|
||||||
<para>
|
<para>
|
||||||
Imagine capital letters are bases and lower case letters are
|
In text shaping, a <emphasis>cluster</emphasis> is a sequence of
|
||||||
combining marks. With an input sequence like this:
|
characters that needs to be treated as a single, indivisible
|
||||||
|
unit.
|
||||||
|
</para>
|
||||||
|
<para>
|
||||||
|
During the shaping process, some shaping operations may
|
||||||
|
merge adjacent characters (for example, when two code points form
|
||||||
|
a ligature and are replaced by a single glyph) or split one
|
||||||
|
character into several (for example, when performing the Unicode
|
||||||
|
canonical decomposition of a code point).
|
||||||
|
</para>
|
||||||
|
<para>
|
||||||
|
HarfBuzz tracks clusters independently from how these
|
||||||
|
shaping operations alter the individual glyphs that comprise the
|
||||||
|
output HarfBuzz returns in a buffer. Consequently,
|
||||||
|
a client program using HarfBuzz can utilize the cluster
|
||||||
|
information to implement features such as:
|
||||||
|
</para>
|
||||||
|
<itemizedlist>
|
||||||
|
<listitem>
|
||||||
|
<para>
|
||||||
|
Correctly positioning the cursor between two characters that
|
||||||
|
have combined into a single glyph by forming a ligature.
|
||||||
|
</para>
|
||||||
|
</listitem>
|
||||||
|
<listitem>
|
||||||
|
<para>
|
||||||
|
Correctly highlighting a text selection that includes some,
|
||||||
|
but not all, of the characters comprising a ligature.
|
||||||
|
</para>
|
||||||
|
</listitem>
|
||||||
|
<listitem>
|
||||||
|
<para>
|
||||||
|
Applying text attributes (such as color or underlining) to
|
||||||
|
part, but not all, of a composed base-and-mark combination.
|
||||||
|
</para>
|
||||||
|
</listitem>
|
||||||
|
<listitem>
|
||||||
|
<para>
|
||||||
|
Generating output document formats (such as PDF) with
|
||||||
|
embedded text that can be fully extracted.
|
||||||
|
</para>
|
||||||
|
</listitem>
|
||||||
|
<listitem>
|
||||||
|
<para>
|
||||||
|
Performing line-breaking, justification, and other
|
||||||
|
line-level or paragraph-level operations that must be done
|
||||||
|
after shaping is complete, but which require character-level
|
||||||
|
properties.
|
||||||
|
</para>
|
||||||
|
</listitem>
|
||||||
|
</itemizedlist>
|
||||||
|
<para>
|
||||||
|
When you add text to a HarfBuzz buffer, each code point is assigned
|
||||||
|
a <emphasis>cluster value</emphasis>.
|
||||||
|
</para>
|
||||||
|
<para>
|
||||||
|
This cluster value is an arbitrary number; HarfBuzz uses it only
|
||||||
|
to distinguish between clusters. Many client programs will use
|
||||||
|
the index of each code point in the input text stream as the
|
||||||
|
cluster value, as a matter of convenience; the actual value does
|
||||||
|
not matter.
|
||||||
|
</para>
|
||||||
|
<para>
|
||||||
|
Client programs can choose how HarfBuzz handles clusters during
|
||||||
|
shaping by setting the
|
||||||
|
<literal>cluster_level</literal> of the
|
||||||
|
buffer. HarfBuzz offers three <emphasis>levels</emphasis> of
|
||||||
|
clustering support for this property:
|
||||||
|
</para>
|
||||||
|
<itemizedlist>
|
||||||
|
<listitem>
|
||||||
|
<para><emphasis>Level 0</emphasis> is the default and
|
||||||
|
reproduces the behavior of the old HarfBuzz library.
|
||||||
|
</para>
|
||||||
|
<para>
|
||||||
|
The distinguishing feature of level 0 behavior is that, at
|
||||||
|
the beginning of processing the buffer, all code points that
|
||||||
|
are categorized as <emphasis>marks</emphasis>,
|
||||||
|
<emphasis>modifier symbols</emphasis>, or
|
||||||
|
<emphasis>Emoji extended pictographic</emphasis> modifiers,
|
||||||
|
as well as the <emphasis>Zero Width Joiner</emphasis> and
|
||||||
|
<emphasis>Zero Width Non-Joiner</emphasis> code points, are
|
||||||
|
assigned the cluster value of the closest preceding code
|
||||||
|
point from <emphasis>diferent</emphasis> category.
|
||||||
|
</para>
|
||||||
|
<para>
|
||||||
|
In essence, whenever a base character is followed by a mark
|
||||||
|
character or a sequence of mark characters, those marks are
|
||||||
|
reassigned to the same initial cluster value as the base
|
||||||
|
character. This reassignment is referred to as
|
||||||
|
"merging" the affected clusters. This behavior is based on
|
||||||
|
the Grapheme Cluster Boundary specification in <ulink
|
||||||
|
url="https://www.unicode.org/reports/tr29/#Regex_Definitions">Unicode
|
||||||
|
Technical Report 29</ulink>.
|
||||||
|
</para>
|
||||||
|
<para>
|
||||||
|
Client programs can specify level 0 behavior for a buffer by
|
||||||
|
setting its <literal>cluster_level</literal> to
|
||||||
|
<literal>HB_BUFFER_CLUSTER_LEVEL_MONOTONE_GRAPHEMES</literal>.
|
||||||
|
</para>
|
||||||
|
</listitem>
|
||||||
|
<listitem>
|
||||||
|
<para>
|
||||||
|
<emphasis>Level 1</emphasis> tweaks the old behavior
|
||||||
|
slightly to produce better results. Therefore, level 1
|
||||||
|
clustering is recommended for code that is not required to
|
||||||
|
implement backward compatibility with the old HarfBuzz.
|
||||||
|
</para>
|
||||||
|
<para>
|
||||||
|
Level 1 differs from level 0 by not merging the
|
||||||
|
clusters of marks and other modifier code points with the
|
||||||
|
preceding "base" code point's cluster. By preserving the
|
||||||
|
cluster values of these marks and modifier code points,
|
||||||
|
script shaping can perform additional operations that might
|
||||||
|
lead to improved results (for example, reordering a sequence
|
||||||
|
of marks).
|
||||||
|
</para>
|
||||||
|
<para>
|
||||||
|
Client programs can specify level 1 behavior for a buffer by
|
||||||
|
setting its <literal>cluster_level</literal> to
|
||||||
|
<literal>HB_BUFFER_CLUSTER_LEVEL_MONOTONE_CHARACTERS</literal>.
|
||||||
|
</para>
|
||||||
|
</listitem>
|
||||||
|
<listitem>
|
||||||
|
<para>
|
||||||
|
<emphasis>Level 2</emphasis> differs significantly in how it
|
||||||
|
treats cluster values. In level 2, HarfBuzz never merges
|
||||||
|
clusters.
|
||||||
|
</para>
|
||||||
|
<para>
|
||||||
|
This difference can be seen most clearly when HarfBuzz processes
|
||||||
|
ligature substitutions and glyph decompositions. In level 0
|
||||||
|
and level 1, ligatures and glyph decomposition both involve
|
||||||
|
merging clusters; in level 2, neither of these operations
|
||||||
|
triggers a merge.
|
||||||
|
</para>
|
||||||
|
<para>
|
||||||
|
Client programs can specify level 2 behavior for a buffer by
|
||||||
|
setting its <literal>cluster_level</literal> to
|
||||||
|
<literal>HB_BUFFER_CLUSTER_LEVEL_CHARACTERS</literal>.
|
||||||
|
</para>
|
||||||
|
</listitem>
|
||||||
|
</itemizedlist>
|
||||||
|
<para>
|
||||||
|
It is not <emphasis>required</emphasis> that the cluster values
|
||||||
|
in a buffer be monotonically increasing. However, if the initial
|
||||||
|
cluster values in a buffer are monotonic and the buffer is
|
||||||
|
configured to use clustering level 0 or 1, then HarfBuzz
|
||||||
|
guarantees that the final cluster values in the shaped buffer
|
||||||
|
will also be monotonic. No such guarantee is made for cluster
|
||||||
|
level 2.
|
||||||
|
</para>
|
||||||
|
<para>
|
||||||
|
In levels 0 and 1, HarfBuzz implements the following conceptual model for
|
||||||
|
cluster values:
|
||||||
|
</para>
|
||||||
|
<itemizedlist spacing="compact">
|
||||||
|
<listitem>
|
||||||
|
<para>
|
||||||
|
The sequence of cluster values will always remain monotonic.
|
||||||
|
</para>
|
||||||
|
</listitem>
|
||||||
|
<listitem>
|
||||||
|
<para>
|
||||||
|
Each cluster value represents a single cluster.
|
||||||
|
</para>
|
||||||
|
</listitem>
|
||||||
|
<listitem>
|
||||||
|
<para>
|
||||||
|
Each cluster contains one or more glyphs and one or more
|
||||||
|
characters.
|
||||||
|
</para>
|
||||||
|
</listitem>
|
||||||
|
</itemizedlist>
|
||||||
|
<para>
|
||||||
|
In practice, this model offers several benefits. Assuming that
|
||||||
|
the initial cluster values were monotonically increasing
|
||||||
|
and distinct before shaping began, then, in the final output:
|
||||||
|
</para>
|
||||||
|
<itemizedlist spacing="compact">
|
||||||
|
<listitem>
|
||||||
|
<para>
|
||||||
|
All adjacent glyphs having the same final cluster
|
||||||
|
value belong to the same cluster.
|
||||||
|
</para>
|
||||||
|
</listitem>
|
||||||
|
<listitem>
|
||||||
|
<para>
|
||||||
|
Each character belongs to the cluster that has the highest
|
||||||
|
cluster value <emphasis>not larger than</emphasis> its
|
||||||
|
initial cluster value.
|
||||||
|
</para>
|
||||||
|
</listitem>
|
||||||
|
</itemizedlist>
|
||||||
|
|
||||||
|
</section>
|
||||||
|
<section id="a-clustering-example-for-levels-0-and-1">
|
||||||
|
<title>A clustering example for levels 0 and 1</title>
|
||||||
|
<para>
|
||||||
|
The guarantees and benefits of level 0 and level 1 can be seen
|
||||||
|
with some examples. First, let us examine what happens with cluster
|
||||||
|
values when shaping involves cluster merging with ligatures and
|
||||||
|
decomposition.
|
||||||
|
</para>
|
||||||
|
<para>
|
||||||
|
Let's say we start with the following character sequence (top row) and
|
||||||
|
initial cluster values (bottom row):
|
||||||
</para>
|
</para>
|
||||||
<programlisting>
|
<programlisting>
|
||||||
A,a,B,b,C,c
|
A,B,C,D,E
|
||||||
0,1,2,3,4,5
|
0,1,2,3,4
|
||||||
</programlisting>
|
</programlisting>
|
||||||
<para>
|
<para>
|
||||||
if <literal>A,B,C</literal> ligate, then here are the cluster
|
During shaping, HarfBuzz maps these characters to glyphs from
|
||||||
values one would get under the various levels:
|
the font. For simplicity, let's assume that each character maps
|
||||||
</para>
|
to the corresponding, identical-looking glyph:
|
||||||
<para>
|
|
||||||
level 0:
|
|
||||||
</para>
|
</para>
|
||||||
<programlisting>
|
<programlisting>
|
||||||
ABC,a,b,c
|
A,B,C,D,E
|
||||||
0 ,0,0,0
|
0,1,2,3,4
|
||||||
</programlisting>
|
</programlisting>
|
||||||
<para>
|
<para>
|
||||||
level 1:
|
Now if, for example, <literal>B</literal> and <literal>C</literal>
|
||||||
|
form a ligature, then the clusters to which they belong
|
||||||
|
"merge". This merged cluster takes for its cluster
|
||||||
|
value the minimum of all the cluster values of the clusters that
|
||||||
|
went in to the ligature. In this case, we get:
|
||||||
</para>
|
</para>
|
||||||
<programlisting>
|
<programlisting>
|
||||||
ABC,a,b,c
|
A,BC,D,E
|
||||||
0 ,0,0,5
|
0,1 ,3,4
|
||||||
</programlisting>
|
</programlisting>
|
||||||
<para>
|
<para>
|
||||||
level 2:
|
because 1 is the minimum of the set {1,2}, which were the
|
||||||
|
cluster values of <literal>B</literal> and
|
||||||
|
<literal>C</literal>.
|
||||||
|
</para>
|
||||||
|
<para>
|
||||||
|
Next, let us say that the <literal>BC</literal> ligature glyph
|
||||||
|
decomposes into three components, and <literal>D</literal> also
|
||||||
|
decomposes into two components. These components each inherit the
|
||||||
|
cluster value of their parent:
|
||||||
</para>
|
</para>
|
||||||
<programlisting>
|
<programlisting>
|
||||||
ABC,a,b,c
|
A,BC0,BC1,BC2,D0,D1,E
|
||||||
0 ,1,3,5
|
0,1 ,1 ,1 ,3 ,3 ,4
|
||||||
</programlisting>
|
</programlisting>
|
||||||
<para>
|
<para>
|
||||||
Making sense of the last example is the hardest for a client,
|
Next, if <literal>BC2</literal> and <literal>D0</literal> form a
|
||||||
because there is nothing in the cluster values to suggest that
|
ligature, then their clusters (cluster values 1 and 3) merge into
|
||||||
<literal>B</literal> and <literal>C</literal> ligated with
|
<literal>min(1,3) = 1</literal>:
|
||||||
<literal>A</literal>.
|
|
||||||
</para>
|
|
||||||
</sect2>
|
|
||||||
<sect2 id="reordering">
|
|
||||||
<title>Reordering</title>
|
|
||||||
<para>
|
|
||||||
Another tricky case is when things reorder. Under level 2:
|
|
||||||
</para>
|
</para>
|
||||||
<programlisting>
|
<programlisting>
|
||||||
A,B,C,D,E
|
A,BC0,BC1,BC2D0,D1,E
|
||||||
0,1,2,3,4
|
0,1 ,1 ,1 ,1 ,4
|
||||||
</programlisting>
|
</programlisting>
|
||||||
<para>
|
<para>
|
||||||
Now imagine <literal>D</literal> moves before
|
At this point, cluster 1 means: the character sequence
|
||||||
<literal>B</literal>:
|
<literal>BCD</literal> is represented by glyphs
|
||||||
|
<literal>BC0,BC1,BC2D0,D1</literal> and cannot be broken down any
|
||||||
|
further.
|
||||||
|
</para>
|
||||||
|
</section>
|
||||||
|
<section id="reordering-in-levels-0-and-1">
|
||||||
|
<title>Reordering in levels 0 and 1</title>
|
||||||
|
<para>
|
||||||
|
Another common operation in the more complex shapers is glyph
|
||||||
|
reordering. In order to maintain a monotonic cluster sequence
|
||||||
|
when glyph reordering takes place, HarfBuzz merges the clusters
|
||||||
|
of everything in the reordering sequence.
|
||||||
|
</para>
|
||||||
|
<para>
|
||||||
|
For example, let us again start with the character sequence (top
|
||||||
|
row) and initial cluster values (bottom row):
|
||||||
</para>
|
</para>
|
||||||
<programlisting>
|
<programlisting>
|
||||||
A,D,B,C,E
|
A,B,C,D,E
|
||||||
0,3,1,2,4
|
0,1,2,3,4
|
||||||
</programlisting>
|
</programlisting>
|
||||||
<para>
|
<para>
|
||||||
Now, if <literal>D</literal> ligates with <literal>B</literal>, we
|
If <literal>D</literal> is reordered before <literal>B</literal>,
|
||||||
|
then HarfBuzz merges the <literal>B</literal>,
|
||||||
|
<literal>C</literal>, and <literal>D</literal> clusters, and we
|
||||||
get:
|
get:
|
||||||
</para>
|
</para>
|
||||||
<programlisting>
|
<programlisting>
|
||||||
A,DB,C,E
|
A,D,B,C,E
|
||||||
0,3 ,2,4
|
0,1,1,1,4
|
||||||
</programlisting>
|
</programlisting>
|
||||||
<para>
|
<para>
|
||||||
In a different scenario, <literal>A</literal> and
|
This is clearly not ideal, but it is the only sensible way to
|
||||||
<literal>B</literal> could have ligated
|
maintain a monotonic sequence of cluster values and retain the
|
||||||
<emphasis>before</emphasis> <literal>D</literal> reordered; that
|
true relationship between glyphs and characters.
|
||||||
would have resulted in:
|
</para>
|
||||||
|
</section>
|
||||||
|
<section id="the-distinction-between-levels-0-and-1">
|
||||||
|
<title>The distinction between levels 0 and 1</title>
|
||||||
|
<para>
|
||||||
|
The preceding examples demonstrate the main effects of using
|
||||||
|
cluster levels 0 and 1. The only difference between the two
|
||||||
|
levels is this: in level 0, at the very beginning of the shaping
|
||||||
|
process, HarfBuzz also merges clusters between any base character
|
||||||
|
and all Unicode marks (combining or not) that follow it.
|
||||||
|
</para>
|
||||||
|
<para>
|
||||||
|
For example, let us start with the following character sequence
|
||||||
|
(top row) and accompanying initial cluster values (bottom row):
|
||||||
</para>
|
</para>
|
||||||
<programlisting>
|
<programlisting>
|
||||||
AB,D,C,E
|
A,acute,B
|
||||||
0 ,3,2,4
|
0,1 ,2
|
||||||
</programlisting>
|
</programlisting>
|
||||||
<para>
|
<para>
|
||||||
There's no way to differentiate between these two scenarios based
|
The <literal>acute</literal> is a Unicode mark. If HarfBuzz is
|
||||||
on the cluster numbers alone.
|
using cluster level 0 on this sequence, then the
|
||||||
|
<literal>A</literal> and <literal>acute</literal> clusters will
|
||||||
|
merge, and the result will become:
|
||||||
|
</para>
|
||||||
|
<programlisting>
|
||||||
|
A,acute,B
|
||||||
|
0,0 ,2
|
||||||
|
</programlisting>
|
||||||
|
<para>
|
||||||
|
This initial cluster merging is the default behavior of the
|
||||||
|
Windows shaping engine, and the old HarfBuzz codebase copied
|
||||||
|
that behavior to maintain compatibility. Consequently, it has
|
||||||
|
remained the default behavior in the new HarfBuzz codebase.
|
||||||
</para>
|
</para>
|
||||||
<para>
|
<para>
|
||||||
Another problem happens with ligatures under level 2 if the
|
But this initial cluster-merging behavior makes it impossible to
|
||||||
direction of the text is forced to opposite of its natural
|
color diacritic marks differently from their base
|
||||||
direction (e.g. left-to-right Arabic). But that's too much of a
|
characters. That is why, in level 1, HarfBuzz does not perform
|
||||||
corner case to worry about.
|
the initial merging step.
|
||||||
</para>
|
</para>
|
||||||
</sect2>
|
<para>
|
||||||
</sect1>
|
For client programs that rely on HarfBuzz cluster values to
|
||||||
|
perform cursor positioning, level 0 is more convenient. But
|
||||||
|
relying on cluster boundaries for cursor positioning is wrong: cursor
|
||||||
|
positions should be determined based on Unicode grapheme
|
||||||
|
boundaries, not on shaping-cluster boundaries. As such, level 1
|
||||||
|
clusters are preferred.
|
||||||
|
</para>
|
||||||
|
<para>
|
||||||
|
One last note about levels 0 and 1. HarfBuzz currently does not allow a
|
||||||
|
<literal>MultipleSubst</literal> lookup to replace a glyph with zero
|
||||||
|
glyphs (in other words, to delete a glyph). But, in some other situations,
|
||||||
|
glyphs can be deleted. In those cases, if the glyph being deleted is
|
||||||
|
the last glyph of its cluster, HarfBuzz makes sure to merge the cluster
|
||||||
|
with a neighboring cluster.
|
||||||
|
</para>
|
||||||
|
<para>
|
||||||
|
This is done primarily to make sure that the starting cluster of the
|
||||||
|
text always has the cluster index pointing to the start of the text
|
||||||
|
for the run; more than one client currently relies on this
|
||||||
|
guarantee.
|
||||||
|
</para>
|
||||||
|
<para>
|
||||||
|
Incidentally, Apple's CoreText does something else to maintain the
|
||||||
|
same promise: it inserts a glyph with id 65535 at the beginning of
|
||||||
|
the glyph string if the glyph corresponding to the first character
|
||||||
|
in the run was deleted. HarfBuzz might do something similar in the
|
||||||
|
future.
|
||||||
|
</para>
|
||||||
|
</section>
|
||||||
|
<section id="level-2">
|
||||||
|
<title>Level 2</title>
|
||||||
|
<para>
|
||||||
|
HarfBuzz's level 2 cluster behavior uses a significantly
|
||||||
|
different model than that of level 0 and level 1.
|
||||||
|
</para>
|
||||||
|
<para>
|
||||||
|
The level 2 behavior is easy to describe, but it may be
|
||||||
|
difficult to understand in practical terms. In brief, level 2
|
||||||
|
performs no merging of clusters whatsoever.
|
||||||
|
</para>
|
||||||
|
<para>
|
||||||
|
When glyphs form a ligature (or when some other feature
|
||||||
|
substitutes multiple glyphs with one glyph), the cluster value
|
||||||
|
of the first glyph is retained as the cluster value for the
|
||||||
|
ligature. However, no subsequent clusters — including
|
||||||
|
marks and modifiers — are affected.
|
||||||
|
</para>
|
||||||
|
<para>
|
||||||
|
Level 2 cluster behavior is less complex than level 0 or level
|
||||||
|
1, but there are a few cases in which processing cluster values
|
||||||
|
produced at level 2 may be tricky.
|
||||||
|
</para>
|
||||||
|
<section id="ligatures-with-combining-marks-in-level-2">
|
||||||
|
<title>Ligatures with combining marks in level 2</title>
|
||||||
|
<para>
|
||||||
|
The first example of how HarfBuzz's level 2 cluster behavior
|
||||||
|
can be tricky is when the text to be shaped includes combining
|
||||||
|
marks attached to ligatures.
|
||||||
|
</para>
|
||||||
|
<para>
|
||||||
|
Let us start with an input sequence with the following
|
||||||
|
characters (top row) and initial cluster values (bottom row):
|
||||||
|
</para>
|
||||||
|
<programlisting>
|
||||||
|
A,acute,B,breve,C,circumflex
|
||||||
|
0,1 ,2,3 ,4,5
|
||||||
|
</programlisting>
|
||||||
|
<para>
|
||||||
|
If the sequence <literal>A,B,C</literal> forms a ligature,
|
||||||
|
then these are the cluster values HarfBuzz will return under
|
||||||
|
the various cluster levels:
|
||||||
|
</para>
|
||||||
|
<para>
|
||||||
|
Level 0:
|
||||||
|
</para>
|
||||||
|
<programlisting>
|
||||||
|
ABC,acute,breve,circumflex
|
||||||
|
0 ,0 ,0 ,0
|
||||||
|
</programlisting>
|
||||||
|
<para>
|
||||||
|
Level 1:
|
||||||
|
</para>
|
||||||
|
<programlisting>
|
||||||
|
ABC,acute,breve,circumflex
|
||||||
|
0 ,0 ,0 ,5
|
||||||
|
</programlisting>
|
||||||
|
<para>
|
||||||
|
Level 2:
|
||||||
|
</para>
|
||||||
|
<programlisting>
|
||||||
|
ABC,acute,breve,circumflex
|
||||||
|
0 ,1 ,3 ,5
|
||||||
|
</programlisting>
|
||||||
|
<para>
|
||||||
|
Making sense of the level 2 result is the hardest for a client
|
||||||
|
program, because there is nothing in the cluster values that
|
||||||
|
indicates that <literal>B</literal> and <literal>C</literal>
|
||||||
|
formed a ligature with <literal>A</literal>.
|
||||||
|
</para>
|
||||||
|
<para>
|
||||||
|
In contrast, the "merged" cluster values of the mark glyphs
|
||||||
|
that are seen in the level 0 and level 1 output are evidence
|
||||||
|
that a ligature substitution took place.
|
||||||
|
</para>
|
||||||
|
</section>
|
||||||
|
<section id="reordering-in-level-2">
|
||||||
|
<title>Reordering in level 2</title>
|
||||||
|
<para>
|
||||||
|
Another example of how HarfBuzz's level 2 cluster behavior
|
||||||
|
can be tricky is when glyphs reorder. Consider an input sequence
|
||||||
|
with the following characters (top row) and initial cluster
|
||||||
|
values (bottom row):
|
||||||
|
</para>
|
||||||
|
<programlisting>
|
||||||
|
A,B,C,D,E
|
||||||
|
0,1,2,3,4
|
||||||
|
</programlisting>
|
||||||
|
<para>
|
||||||
|
Now imagine <literal>D</literal> moves before
|
||||||
|
<literal>B</literal> in a reordering operation. The cluster
|
||||||
|
values will then be:
|
||||||
|
</para>
|
||||||
|
<programlisting>
|
||||||
|
A,D,B,C,E
|
||||||
|
0,3,1,2,4
|
||||||
|
</programlisting>
|
||||||
|
<para>
|
||||||
|
Next, if <literal>D</literal> forms a ligature with
|
||||||
|
<literal>B</literal>, the output is:
|
||||||
|
</para>
|
||||||
|
<programlisting>
|
||||||
|
A,DB,C,E
|
||||||
|
0,3 ,2,4
|
||||||
|
</programlisting>
|
||||||
|
<para>
|
||||||
|
However, in a different scenario, in which the shaping rules
|
||||||
|
of the script instead caused <literal>A</literal> and
|
||||||
|
<literal>B</literal> to form a ligature
|
||||||
|
<emphasis>before</emphasis> the <literal>D</literal> reordered, the
|
||||||
|
result would be:
|
||||||
|
</para>
|
||||||
|
<programlisting>
|
||||||
|
AB,D,C,E
|
||||||
|
0 ,3,2,4
|
||||||
|
</programlisting>
|
||||||
|
<para>
|
||||||
|
There is no way for a client program to differentiate between
|
||||||
|
these two scenarios based on the cluster values
|
||||||
|
alone. Consequently, client programs that use level 2 might
|
||||||
|
need to undertake additional work in order to manage cursor
|
||||||
|
positioning, text attributes, or other desired features.
|
||||||
|
</para>
|
||||||
|
</section>
|
||||||
|
<section id="other-considerations-in-level-2">
|
||||||
|
<title>Other considerations in level 2</title>
|
||||||
|
<para>
|
||||||
|
There may be other problems encountered with ligatures under
|
||||||
|
level 2, such as if the direction of the text is forced to
|
||||||
|
opposite of its natural direction (for example, left-to-right
|
||||||
|
Arabic). But, generally speaking, these other scenarios are
|
||||||
|
minor corner cases that are too obscure for most client
|
||||||
|
programs to need to worry about.
|
||||||
|
</para>
|
||||||
|
</section>
|
||||||
|
</section>
|
||||||
</chapter>
|
</chapter>
|
||||||
|
|
Loading…
Reference in New Issue