Clusters

Clusters In shaping text, a cluster is a sequence of code points that needs to be treated as a single, indivisible unit. When you add text to a HB buffer, each character is associated with a cluster value. This is an arbitrary number as far as HB is concerned. Most clients will use UTF-8, UTF-16, or UTF-32 indices, but the actual number does not matter. Moreover, it is not required for the cluster values to be monotonically increasing, but pretty much all of HB's tests are performed on monotonically increasing cluster numbers. Nevertheless, there is no such assumption in the code itself. With that in mind, let's examine what happens with cluster values during shaping under each cluster-level. HarfBuzz provides three levels of clustering support. Level 0 is the default behavior and reproduces the behavior of the old HarfBuzz library. Level 1 tweaks this behavior slightly to produce better results, so level 1 clustering is recommended for code that is not required to implement backward compatibility with the old HarfBuzz. Level 2 differs significantly in how it treats cluster values. Levels 0 and 1 both process ligatures and glyph decomposition by merging clusters; level 2 does not. The conceptual model for what the cluster values mean, in levels 0 and 1, is this: the sequence of cluster values will always remain monotone each value represents a single cluster each cluster contains one or more glyphs and one or more characters Assuming that initial cluster numbers were monotonically increasing and distinct, then all adjacent glyphs having the same cluster number belong to the same cluster, and all characters belong to the cluster that has the highest number not larger than their initial cluster number. This will become clearer with an example. A clustering example for levels 0 and 1 Let's say we start with the following character sequence and cluster values: A,B,C,D,E 0,1,2,3,4 We then map the characters to glyphs. For simplicity, let's assume that each character maps to the corresponding, identical-looking glyph: A,B,C,D,E 0,1,2,3,4 Now if, for example, B and C ligate, then the clusters to which they belong "merge". This merged cluster takes for its cluster number the minimum of all the cluster numbers of the clusters that went in. In this case, we get: A,BC,D,E 0,1 ,3,4 Now let's assume that the BC glyph decomposes into three components, and D also decomposes into two. The components each inherit the cluster value of their parent: A,BC0,BC1,BC2,D0,D1,E 0,1 ,1 ,1 ,3 ,3 ,4 Now if BC2 and D0 ligate, then their clusters (numbers 1 and 3) merge into min(1,3) = 1: A,BC0,BC1,BC2D0,D1,E 0,1 ,1 ,1 ,1 ,4 At this point, cluster 1 means: the character sequence BCD is represented by glyphs BC0,BC1,BC2D0,D1 and cannot be broken down any further. Reordering in levels 0 and 1 Another common operation in the more complex shapers is when things reorder. In those cases, to maintain monotone clusters, HB merges the clusters of everything in the reordering sequence. For example, let's again start with the character sequence: A,B,C,D,E 0,1,2,3,4 If D is reordered before B, then the B, C, and D clusters merge, and we get: A,D,B,C,E 0,1,1,1,4 This is clearly not ideal, but it is the only sensible way to maintain monotone indices and retain the true relationship between glyphs and characters. The distinction between levels 0 and 1 So, the above is pretty much what cluster levels 0 and 1 do. The only difference between the two is this: in level 0, at the very beginning of the shaping process, we also merge clusters between base characters and all Unicode marks (combining or not) following them. E.g.: A,acute,B 0,1 ,2 will become: A,acute,B 0,0 ,2 This is the default behavior. We do it because Windows did it and old HarfBuzz did it, so this remained the default. But this behavior makes it impossible to color diacritic marks differently from their base characters. That's why in level 1 we do not perform this initial merging step. For clients, level 0 is more convenient if they rely on HarfBuzz clusters for cursor positioning. But that's wrong anyway: cursor positions should be determined based on Unicode grapheme boundaries, NOT shaping clusters. As such, level 1 clusters are preferred. One last note about levels 0 and 1. We currently don't allow a MultipleSubst lookup to replace a glyph with zero glyphs (i.e., to delete a glyph). But in some other situations, glyphs can be deleted. In those cases, if the glyph being deleted is the last glyph of its cluster, we make sure to merge the cluster with a neighboring cluster. This is, primarily, to make sure that the starting cluster of the text always has the cluster index pointing to the start of the text for the run; more than one client currently relies on this guarantee. Incidentally, Apple's CoreText does something else to maintain the same promise: it inserts a glyph with id 65535 at the beginning of the glyph string if the glyph corresponding to the first character in the run was deleted. HarfBuzz might do something similar in the future. Level 2 Level 2 is a different beast from levels 0 and 1. It is simple to describe, but hard to make sense of. It simply doesn't do any cluster merging whatsoever. When things ligate or otherwise multiple glyphs turn into one, the cluster value of the first glyph is retained. Here are a few examples of why processing cluster values produced at this level might be tricky: Ligatures with combining marks Imagine capital letters are bases and lower case letters are combining marks. With an input sequence like this: A,a,B,b,C,c 0,1,2,3,4,5 if A,B,C ligate, then here are the cluster values one would get under the various levels: level 0: ABC,a,b,c 0 ,0,0,0 level 1: ABC,a,b,c 0 ,0,0,5 level 2: ABC,a,b,c 0 ,1,3,5 Making sense of the last example is the hardest for a client, because there is nothing in the cluster values to suggest that B and C ligated with A. Reordering Another tricky case is when things reorder. Under level 2: A,B,C,D,E 0,1,2,3,4 Now imagine D moves before B: A,D,B,C,E 0,3,1,2,4 Now, if D ligates with B, we get: A,DB,C,E 0,3 ,2,4 In a different scenario, A and B could have ligated before D reordered; that would have resulted in: AB,D,C,E 0 ,3,2,4 There's no way to differentiate between these two scenarios based on the cluster numbers alone. Another problem happens with ligatures under level 2 if the direction of the text is forced to opposite of its natural direction (e.g. left-to-right Arabic). But that's too much of a corner case to worry about.