Clusters

Clusters In text shaping, a cluster is a sequence of characters that needs to be treated as a single, indivisible unit. A cluster is distinct from a grapheme, which is the smallest unit of a writing system or script, because clusters are only relevant for script shaping and the layout of glyphs. For example, a grapheme may be a letter, a number, a logogram, or a symbol. When two letters form a ligature, however, they combine into a single glyph. They are therefore part of the same cluster and are treated as a unit — even though the two original, underlying letters are separate graphemes. During the shaping process, there are several shaping operations that may merge adjacent characters (for example, when two code points form a ligature or a conjunct form and are replaced by a single glyph) or split one character into several (for example, when decomposing a code point through the ccmp feature). HarfBuzz tracks clusters independently from how these shaping operations affect the individual glyphs that comprise the output HarfBuzz returns in a buffer. Consequently, a client program using HarfBuzz can utilize the cluster information to implement features such as: Correctly positioning the cursor within a shaped text run, even when characters have formed ligatures, composed or decomposed, reordered, or undergone other shaping operations. Correctly highlighting a text selection that includes some, but not all, of the characters in a word. Applying text attributes (such as color or underlining) to part, but not all, of a word. Generating output document formats (such as PDF) with embedded text that can be fully extracted. Determining the mapping between input characters and output glyphs, such as which glyphs are ligatures. Performing line-breaking, justification, and other line-level or paragraph-level operations that must be done after shaping is complete, but which require character-level properties. When you add text to a HarfBuzz buffer, each code point must be assigned a cluster value. This cluster value is an arbitrary number; HarfBuzz uses it only to distinguish between clusters. Many client programs will use the index of each code point in the input text stream as the cluster value. This is for the sake of convenience; the actual value does not matter. Client programs can choose how HarfBuzz handles clusters during shaping by setting the cluster_level of the buffer. HarfBuzz offers three levels of clustering support for this property: Level 0 is the default and reproduces the behavior of the old HarfBuzz library. The distinguishing feature of level 0 behavior is that, at the beginning of processing the buffer, all code points that are categorized as marks, modifier symbols, or Emoji extended pictographic modifiers, as well as the Zero Width Joiner and Zero Width Non-Joiner code points, are assigned the cluster value of the closest preceding code point from different category. In essence, whenever a base character is followed by a mark character or a sequence of mark characters, those marks are reassigned to the same initial cluster value as the base character. This reassignment is referred to as "merging" the affected clusters. This behavior is based on the Grapheme Cluster Boundary specification in Unicode Technical Report 29. Client programs can specify level 0 behavior for a buffer by setting its cluster_level to HB_BUFFER_CLUSTER_LEVEL_MONOTONE_GRAPHEMES. Level 1 tweaks the old behavior slightly to produce better results. Therefore, level 1 clustering is recommended for code that is not required to implement backward compatibility with the old HarfBuzz. Level 1 differs from level 0 by not merging the clusters of marks and other modifier code points with the preceding "base" code point's cluster. By preserving the separate cluster values of these marks and modifier code points, script shapers can perform additional operations that might lead to improved results (for example, reordering a sequence of marks). Client programs can specify level 1 behavior for a buffer by setting its cluster_level to HB_BUFFER_CLUSTER_LEVEL_MONOTONE_CHARACTERS. Level 2 differs significantly in how it treats cluster values. In level 2, HarfBuzz never merges clusters. This difference can be seen most clearly when HarfBuzz processes ligature substitutions and glyph decompositions. In level 0 and level 1, ligatures and glyph decomposition both involve merging clusters; in level 2, neither of these operations triggers a merge. Client programs can specify level 2 behavior for a buffer by setting its cluster_level to HB_BUFFER_CLUSTER_LEVEL_CHARACTERS. As mentioned earlier, client programs using HarfBuzz often assign initial cluster values in a buffer by reusing the indices of the code points in the input text. This gives a sequence of cluster values that is monotonically increasing (for example, 0,1,2,3,4,5). It is not required that the cluster values in a buffer be monotonically increasing. However, if the initial cluster values in a buffer are monotonic and the buffer is configured to use cluster level 0 or 1, then HarfBuzz guarantees that the final cluster values in the shaped buffer will also be monotonic. No such guarantee is made for cluster level 2. In levels 0 and 1, HarfBuzz implements the following conceptual model for cluster values: If the sequence of input cluster values is monotonic, the sequence of cluster values will remain monotonic. Each cluster value represents a single cluster. Each cluster contains one or more glyphs and one or more characters. In practice, this model offers several benefits. Assuming that the initial cluster values were monotonically increasing and distinct before shaping began, then, in the final output: All adjacent glyphs having the same final cluster value belong to the same cluster. Each character belongs to the cluster that has the highest cluster value not larger than its initial cluster value.

A clustering example for levels 0 and 1 The guarantees and benefits of level 0 and level 1 can be seen with some examples. First, let us examine what happens with cluster values when shaping involves cluster merging with ligatures and decomposition. Let's say we start with the following character sequence (top row) and initial cluster values (bottom row): A,B,C,D,E 0,1,2,3,4 During shaping, HarfBuzz maps these characters to glyphs from the font. For simplicity, let us assume that each character maps to the corresponding, identical-looking glyph: A,B,C,D,E 0,1,2,3,4 Now if, for example, B and C form a ligature, then the clusters to which they belong "merge". This merged cluster takes for its cluster value the minimum of all the cluster values of the clusters that went in to the ligature. In this case, we get: A,BC,D,E 0,1 ,3,4 because 1 is the minimum of the set {1,2}, which were the cluster values of B and C. Next, let us say that the BC ligature glyph decomposes into three components, and D also decomposes into two components. These components each inherit the cluster value of their parent: A,BC0,BC1,BC2,D0,D1,E 0,1 ,1 ,1 ,3 ,3 ,4 Next, if BC2 and D0 form a ligature, then their clusters (cluster values 1 and 3) merge into min(1,3) = 1: A,BC0,BC1,BC2D0,D1,E 0,1 ,1 ,1 ,1 ,4 At this point, cluster 1 means: the character sequence BCD is represented by glyphs BC0,BC1,BC2D0,D1 and cannot be broken down any further.

Reordering in levels 0 and 1 Another common operation in the more complex shapers is glyph reordering. In order to maintain a monotonic cluster sequence when glyph reordering takes place, HarfBuzz merges the clusters of everything in the reordering sequence. For example, let us again start with the character sequence (top row) and initial cluster values (bottom row): A,B,C,D,E 0,1,2,3,4 If D is reordered to before B, then HarfBuzz merges the B, C, and D clusters, and we get: A,D,B,C,E 0,1,1,1,4 This is clearly not ideal, but it is the only sensible way to maintain a monotonic sequence of cluster values and retain the true relationship between glyphs and characters.

The distinction between levels 0 and 1 The preceding examples demonstrate the main effects of using cluster levels 0 and 1. The only difference between the two levels is this: in level 0, at the very beginning of the shaping process, HarfBuzz also merges clusters between any base character and all Unicode marks (combining or not) that follow it. For example, let us start with the following character sequence (top row) and accompanying initial cluster values (bottom row): A,acute,B 0,1 ,2 The acute is a Unicode mark. If HarfBuzz is using cluster level 0 on this sequence, then the A and acute clusters will merge, and the result will become: A,acute,B 0,0 ,2 This initial cluster merging is the default behavior of the Windows shaping engine, and the old HarfBuzz codebase copied that behavior to maintain compatibility. Consequently, it has remained the default behavior in the new HarfBuzz codebase. But this initial cluster-merging behavior makes it impossible to color diacritic marks differently from their base characters. That is why, in level 1, HarfBuzz does not perform the initial merging step. For client programs that rely on HarfBuzz cluster values to perform cursor positioning, level 0 is more convenient. But relying on cluster boundaries for cursor positioning is wrong: cursor positions should be determined based on Unicode grapheme boundaries, not on shaping-cluster boundaries. As such, level 1 clusters are preferred. One last note about levels 0 and 1. HarfBuzz currently does not allow a MultipleSubst lookup to replace a glyph with zero glyphs (in other words, to delete a glyph). But, in some other situations, glyphs can be deleted. In those cases, if the glyph being deleted is the last glyph of its cluster, HarfBuzz makes sure to merge the cluster with a neighboring cluster. This is done primarily to make sure that the starting cluster of the text always has the cluster index pointing to the start of the text for the run; more than one client currently relies on this guarantee. Incidentally, Apple's CoreText does something else to maintain the same promise: it inserts a glyph with id 65535 at the beginning of the glyph string if the glyph corresponding to the first character in the run was deleted. HarfBuzz might do something similar in the future.

Level 2 HarfBuzz's level 2 cluster behavior uses a significantly different model than that of level 0 and level 1. The level 2 behavior is easy to describe, but it may be difficult to understand in practical terms. In brief, level 2 performs no merging of clusters whatsoever. When glyphs form a ligature (or when some other feature substitutes multiple glyphs with one glyph), the cluster value of the first glyph is retained as the cluster value for the ligature. However, no subsequent clusters — including marks and modifiers — are affected. Level 2 cluster behavior is less complex than level 0 or level 1, but there are a few cases in which processing cluster values produced at level 2 may be tricky.

Ligatures with combining marks in level 2 The first example of how HarfBuzz's level 2 cluster behavior can be tricky is when the text to be shaped includes combining marks attached to ligatures. Let us start with an input sequence with the following characters (top row) and initial cluster values (bottom row): A,acute,B,breve,C,circumflex 0,1 ,2,3 ,4,5 If the sequence A,B,C forms a ligature, then these are the cluster values HarfBuzz will return under the various cluster levels: Level 0: ABC,acute,breve,circumflex 0 ,0 ,0 ,0 Level 1: ABC,acute,breve,circumflex 0 ,0 ,0 ,5 Level 2: ABC,acute,breve,circumflex 0 ,1 ,3 ,5 Making sense of the level 2 result is the hardest for a client program, because there is nothing in the cluster values that indicates that B and C formed a ligature with A. In contrast, the "merged" cluster values of the mark glyphs that are seen in the level 0 and level 1 output are evidence that a ligature substitution took place.

Reordering in level 2 Another example of how HarfBuzz's level 2 cluster behavior can be tricky is when glyphs reorder. Consider an input sequence with the following characters (top row) and initial cluster values (bottom row): A,B,C,D,E 0,1,2,3,4 Now imagine D moves before B in a reordering operation. The cluster values will then be: A,D,B,C,E 0,3,1,2,4 Next, if D forms a ligature with B, the output is: A,DB,C,E 0,3 ,2,4 However, in a different scenario, in which the shaping rules of the script instead caused A and B to form a ligature before the D reordered, the result would be: AB,D,C,E 0 ,3,2,4 There is no way for a client program to differentiate between these two scenarios based on the cluster values alone. Consequently, client programs that use level 2 might need to undertake additional work in order to manage cursor positioning, text attributes, or other desired features.

Other considerations in level 2 There may be other problems encountered with ligatures under level 2, such as if the direction of the text is forced to opposite of its natural direction (for example, left-to-right Arabic). But, generally speaking, these other scenarios are minor corner cases that are too obscure for most client programs to need to worry about.