diff --git a/docs/usermanual-clusters.xml b/docs/usermanual-clusters.xml index 7b2c7adc7..f7db0f596 100644 --- a/docs/usermanual-clusters.xml +++ b/docs/usermanual-clusters.xml @@ -5,306 +5,509 @@ ]> - Clusters - - In shaping text, a cluster is a sequence of - code points that needs to be treated as a single, indivisible unit. - - - When you add text to a HB buffer, each character is associated with - a cluster value. This is an arbitrary number as - far as HB is concerned. - - - Most clients will use UTF-8, UTF-16, or UTF-32 indices, but the - actual number does not matter. Moreover, it is not required for the - cluster values to be monotonically increasing, but pretty much all - of HB's tests are performed on monotonically increasing cluster - numbers. Nevertheless, there is no such assumption in the code - itself. With that in mind, let's examine what happens with cluster - values during shaping under each cluster-level. - - - HarfBuzz provides three levels of clustering - support. Level 0 is the default behavior and reproduces the behavior - of the old HarfBuzz library. Level 1 tweaks this behavior slightly - to produce better results, so level 1 clustering is recommended for - code that is not required to implement backward compatibility with - the old HarfBuzz. - - - Level 2 differs significantly in how it treats cluster values. - Levels 0 and 1 both process ligatures and glyph decomposition by - merging clusters; level 2 does not. - - - The conceptual model for what the cluster values mean, in levels 0 - and 1, is this: - - - - - the sequence of cluster values will always remain monotone - - - - - each value represents a single cluster - - - - - each cluster contains one or more glyphs and one or more - characters - - - - - Assuming that initial cluster numbers were monotonically increasing - and distinct, then all adjacent glyphs having the same cluster - number belong to the same cluster, and all characters belong to the - cluster that has the highest number not larger than their initial - cluster number. This will become clearer with an example. - - - - A clustering example for levels 0 and 1 - - Let's say we start with the following character sequence and cluster - values: - - - A,B,C,D,E - 0,1,2,3,4 - - - We then map the characters to glyphs. For simplicity, let's assume - that each character maps to the corresponding, identical-looking - glyph: - - - A,B,C,D,E - 0,1,2,3,4 - - - Now if, for example, B and C - ligate, then the clusters to which they belong "merge". - This merged cluster takes for its cluster number the minimum of all - the cluster numbers of the clusters that went in. In this case, we - get: - - - A,BC,D,E - 0,1 ,3,4 - - - Now let's assume that the BC glyph decomposes - into three components, and D also decomposes into - two. The components each inherit the cluster value of their parent: - - - A,BC0,BC1,BC2,D0,D1,E - 0,1 ,1 ,1 ,3 ,3 ,4 - - - Now if BC2 and D0 ligate, then - their clusters (numbers 1 and 3) merge into - min(1,3) = 1: - - - A,BC0,BC1,BC2D0,D1,E - 0,1 ,1 ,1 ,1 ,4 - - - At this point, cluster 1 means: the character sequence - BCD is represented by glyphs - BC0,BC1,BC2D0,D1 and cannot be broken down any - further. - - - - Reordering in levels 0 and 1 - - Another common operation in the more complex shapers is when things - reorder. In those cases, to maintain monotone clusters, HB merges - the clusters of everything in the reordering sequence. For example, - let's again start with the character sequence: - - - A,B,C,D,E - 0,1,2,3,4 - - - If D is reordered before B, - then the B, C, and - D clusters merge, and we get: - - - A,D,B,C,E - 0,1,1,1,4 - - - This is clearly not ideal, but it is the only sensible way to - maintain monotone indices and retain the true relationship between - glyphs and characters. - - - - The distinction between levels 0 and 1 - - So, the above is pretty much what cluster levels 0 and 1 do. The - only difference between the two is this: in level 0, at the very - beginning of the shaping process, we also merge clusters between - base characters and all Unicode marks (combining or not) following - them. E.g.: - - - A,acute,B - 0,1 ,2 - - - will become: - - - A,acute,B - 0,0 ,2 - - - This is the default behavior. We do it because Windows did it and - old HarfBuzz did it, so this remained the default. But this behavior - makes it impossible to color diacritic marks differently from their - base characters. That's why in level 1 we do not perform this - initial merging step. - - - For clients, level 0 is more convenient if they rely on HarfBuzz - clusters for cursor positioning. But that's wrong anyway: cursor - positions should be determined based on Unicode grapheme boundaries, - NOT shaping clusters. As such, level 1 clusters are preferred. - - - One last note about levels 0 and 1. We currently don't allow a - MultipleSubst lookup to replace a glyph with zero - glyphs (i.e., to delete a glyph). But in some other situations, - glyphs can be deleted. In those cases, if the glyph being deleted is - the last glyph of its cluster, we make sure to merge the cluster - with a neighboring cluster. - - - This is, primarily, to make sure that the starting cluster of the - text always has the cluster index pointing to the start of the text - for the run; more than one client currently relies on this - guarantee. - - - Incidentally, Apple's CoreText does something else to maintain the - same promise: it inserts a glyph with id 65535 at the beginning of - the glyph string if the glyph corresponding to the first character - in the run was deleted. HarfBuzz might do something similar in the - future. - - - - Level 2 - - Level 2 is a different beast from levels 0 and 1. It is simple to - describe, but hard to make sense of. It simply doesn't do any - cluster merging whatsoever. When things ligate or otherwise multiple - glyphs turn into one, the cluster value of the first glyph is - retained. - - - Here are a few examples of why processing cluster values produced at - this level might be tricky: - - - Ligatures with combining marks +
+ Clusters - Imagine capital letters are bases and lower case letters are - combining marks. With an input sequence like this: + In text shaping, a cluster is a sequence of + characters that needs to be treated as a single, indivisible + unit. + + + During the shaping process, some shaping operations may + merge adjacent characters (for example, when two code points form + a ligature and are replaced by a single glyph) or split one + character into several (for example, when performing the Unicode + canonical decomposition of a code point). + + + HarfBuzz tracks clusters independently from how these + shaping operations alter the individual glyphs that comprise the + output HarfBuzz returns in a buffer. Consequently, + a client program using HarfBuzz can utilize the cluster + information to implement features such as: + + + + + Correctly positioning the cursor between two characters that + have combined into a single glyph by forming a ligature. + + + + + Correctly highlighting a text selection that includes some, + but not all, of the characters comprising a ligature. + + + + + Applying text attributes (such as color or underlining) to + part, but not all, of a composed base-and-mark combination. + + + + + Generating output document formats (such as PDF) with + embedded text that can be fully extracted. + + + + + Performing line-breaking, justification, and other + line-level or paragraph-level operations that must be done + after shaping is complete, but which require character-level + properties. + + + + + When you add text to a HarfBuzz buffer, each code point is assigned + a cluster value. + + + This cluster value is an arbitrary number; HarfBuzz uses it only + to distinguish between clusters. Many client programs will use + the index of each code point in the input text stream as the + cluster value, as a matter of convenience; the actual value does + not matter. + + + Client programs can choose how HarfBuzz handles clusters during + shaping by setting the + cluster_level of the + buffer. HarfBuzz offers three levels of + clustering support for this property: + + + + Level 0 is the default and + reproduces the behavior of the old HarfBuzz library. + + + The distinguishing feature of level 0 behavior is that, at + the beginning of processing the buffer, all code points that + are categorized as marks, + modifier symbols, or + Emoji extended pictographic modifiers, + as well as the Zero Width Joiner and + Zero Width Non-Joiner code points, are + assigned the cluster value of the closest preceding code + point from diferent category. + + + In essence, whenever a base character is followed by a mark + character or a sequence of mark characters, those marks are + reassigned to the same initial cluster value as the base + character. This reassignment is referred to as + "merging" the affected clusters. This behavior is based on + the Grapheme Cluster Boundary specification in Unicode + Technical Report 29. + + + Client programs can specify level 0 behavior for a buffer by + setting its cluster_level to + HB_BUFFER_CLUSTER_LEVEL_MONOTONE_GRAPHEMES. + + + + + Level 1 tweaks the old behavior + slightly to produce better results. Therefore, level 1 + clustering is recommended for code that is not required to + implement backward compatibility with the old HarfBuzz. + + + Level 1 differs from level 0 by not merging the + clusters of marks and other modifier code points with the + preceding "base" code point's cluster. By preserving the + cluster values of these marks and modifier code points, + script shaping can perform additional operations that might + lead to improved results (for example, reordering a sequence + of marks). + + + Client programs can specify level 1 behavior for a buffer by + setting its cluster_level to + HB_BUFFER_CLUSTER_LEVEL_MONOTONE_CHARACTERS. + + + + + Level 2 differs significantly in how it + treats cluster values. In level 2, HarfBuzz never merges + clusters. + + + This difference can be seen most clearly when HarfBuzz processes + ligature substitutions and glyph decompositions. In level 0 + and level 1, ligatures and glyph decomposition both involve + merging clusters; in level 2, neither of these operations + triggers a merge. + + + Client programs can specify level 2 behavior for a buffer by + setting its cluster_level to + HB_BUFFER_CLUSTER_LEVEL_CHARACTERS. + + + + + It is not required that the cluster values + in a buffer be monotonically increasing. However, if the initial + cluster values in a buffer are monotonic and the buffer is + configured to use clustering level 0 or 1, then HarfBuzz + guarantees that the final cluster values in the shaped buffer + will also be monotonic. No such guarantee is made for cluster + level 2. + + + In levels 0 and 1, HarfBuzz implements the following conceptual model for + cluster values: + + + + + The sequence of cluster values will always remain monotonic. + + + + + Each cluster value represents a single cluster. + + + + + Each cluster contains one or more glyphs and one or more + characters. + + + + + In practice, this model offers several benefits. Assuming that + the initial cluster values were monotonically increasing + and distinct before shaping began, then, in the final output: + + + + + All adjacent glyphs having the same final cluster + value belong to the same cluster. + + + + + Each character belongs to the cluster that has the highest + cluster value not larger than its + initial cluster value. + + + + +
+
+ A clustering example for levels 0 and 1 + + The guarantees and benefits of level 0 and level 1 can be seen + with some examples. First, let us examine what happens with cluster + values when shaping involves cluster merging with ligatures and + decomposition. + + + Let's say we start with the following character sequence (top row) and + initial cluster values (bottom row): - A,a,B,b,C,c - 0,1,2,3,4,5 - + A,B,C,D,E + 0,1,2,3,4 + - if A,B,C ligate, then here are the cluster - values one would get under the various levels: - - - level 0: + During shaping, HarfBuzz maps these characters to glyphs from + the font. For simplicity, let's assume that each character maps + to the corresponding, identical-looking glyph: - ABC,a,b,c - 0 ,0,0,0 - + A,B,C,D,E + 0,1,2,3,4 + - level 1: + Now if, for example, B and C + form a ligature, then the clusters to which they belong + "merge". This merged cluster takes for its cluster + value the minimum of all the cluster values of the clusters that + went in to the ligature. In this case, we get: - ABC,a,b,c - 0 ,0,0,5 - + A,BC,D,E + 0,1 ,3,4 + - level 2: + because 1 is the minimum of the set {1,2}, which were the + cluster values of B and + C. + + + Next, let us say that the BC ligature glyph + decomposes into three components, and D also + decomposes into two components. These components each inherit the + cluster value of their parent: - ABC,a,b,c - 0 ,1,3,5 - + A,BC0,BC1,BC2,D0,D1,E + 0,1 ,1 ,1 ,3 ,3 ,4 + - Making sense of the last example is the hardest for a client, - because there is nothing in the cluster values to suggest that - B and C ligated with - A. - - - - Reordering - - Another tricky case is when things reorder. Under level 2: + Next, if BC2 and D0 form a + ligature, then their clusters (cluster values 1 and 3) merge into + min(1,3) = 1: - A,B,C,D,E - 0,1,2,3,4 - + A,BC0,BC1,BC2D0,D1,E + 0,1 ,1 ,1 ,1 ,4 + - Now imagine D moves before - B: + At this point, cluster 1 means: the character sequence + BCD is represented by glyphs + BC0,BC1,BC2D0,D1 and cannot be broken down any + further. + +
+
+ Reordering in levels 0 and 1 + + Another common operation in the more complex shapers is glyph + reordering. In order to maintain a monotonic cluster sequence + when glyph reordering takes place, HarfBuzz merges the clusters + of everything in the reordering sequence. + + + For example, let us again start with the character sequence (top + row) and initial cluster values (bottom row): - A,D,B,C,E - 0,3,1,2,4 - + A,B,C,D,E + 0,1,2,3,4 + - Now, if D ligates with B, we + If D is reordered before B, + then HarfBuzz merges the B, + C, and D clusters, and we get: - A,DB,C,E - 0,3 ,2,4 - + A,D,B,C,E + 0,1,1,1,4 + - In a different scenario, A and - B could have ligated - before D reordered; that - would have resulted in: + This is clearly not ideal, but it is the only sensible way to + maintain a monotonic sequence of cluster values and retain the + true relationship between glyphs and characters. + +
+
+ The distinction between levels 0 and 1 + + The preceding examples demonstrate the main effects of using + cluster levels 0 and 1. The only difference between the two + levels is this: in level 0, at the very beginning of the shaping + process, HarfBuzz also merges clusters between any base character + and all Unicode marks (combining or not) that follow it. + + + For example, let us start with the following character sequence + (top row) and accompanying initial cluster values (bottom row): - AB,D,C,E - 0 ,3,2,4 - + A,acute,B + 0,1 ,2 + - There's no way to differentiate between these two scenarios based - on the cluster numbers alone. + The acute is a Unicode mark. If HarfBuzz is + using cluster level 0 on this sequence, then the + A and acute clusters will + merge, and the result will become: + + + A,acute,B + 0,0 ,2 + + + This initial cluster merging is the default behavior of the + Windows shaping engine, and the old HarfBuzz codebase copied + that behavior to maintain compatibility. Consequently, it has + remained the default behavior in the new HarfBuzz codebase. - Another problem happens with ligatures under level 2 if the - direction of the text is forced to opposite of its natural - direction (e.g. left-to-right Arabic). But that's too much of a - corner case to worry about. + But this initial cluster-merging behavior makes it impossible to + color diacritic marks differently from their base + characters. That is why, in level 1, HarfBuzz does not perform + the initial merging step. - - + + For client programs that rely on HarfBuzz cluster values to + perform cursor positioning, level 0 is more convenient. But + relying on cluster boundaries for cursor positioning is wrong: cursor + positions should be determined based on Unicode grapheme + boundaries, not on shaping-cluster boundaries. As such, level 1 + clusters are preferred. + + + One last note about levels 0 and 1. HarfBuzz currently does not allow a + MultipleSubst lookup to replace a glyph with zero + glyphs (in other words, to delete a glyph). But, in some other situations, + glyphs can be deleted. In those cases, if the glyph being deleted is + the last glyph of its cluster, HarfBuzz makes sure to merge the cluster + with a neighboring cluster. + + + This is done primarily to make sure that the starting cluster of the + text always has the cluster index pointing to the start of the text + for the run; more than one client currently relies on this + guarantee. + + + Incidentally, Apple's CoreText does something else to maintain the + same promise: it inserts a glyph with id 65535 at the beginning of + the glyph string if the glyph corresponding to the first character + in the run was deleted. HarfBuzz might do something similar in the + future. + +
+
+ Level 2 + + HarfBuzz's level 2 cluster behavior uses a significantly + different model than that of level 0 and level 1. + + + The level 2 behavior is easy to describe, but it may be + difficult to understand in practical terms. In brief, level 2 + performs no merging of clusters whatsoever. + + + When glyphs form a ligature (or when some other feature + substitutes multiple glyphs with one glyph), the cluster value + of the first glyph is retained as the cluster value for the + ligature. However, no subsequent clusters — including + marks and modifiers — are affected. + + + Level 2 cluster behavior is less complex than level 0 or level + 1, but there are a few cases in which processing cluster values + produced at level 2 may be tricky. + +
+ Ligatures with combining marks in level 2 + + The first example of how HarfBuzz's level 2 cluster behavior + can be tricky is when the text to be shaped includes combining + marks attached to ligatures. + + + Let us start with an input sequence with the following + characters (top row) and initial cluster values (bottom row): + + + A,acute,B,breve,C,circumflex + 0,1 ,2,3 ,4,5 + + + If the sequence A,B,C forms a ligature, + then these are the cluster values HarfBuzz will return under + the various cluster levels: + + + Level 0: + + + ABC,acute,breve,circumflex + 0 ,0 ,0 ,0 + + + Level 1: + + + ABC,acute,breve,circumflex + 0 ,0 ,0 ,5 + + + Level 2: + + + ABC,acute,breve,circumflex + 0 ,1 ,3 ,5 + + + Making sense of the level 2 result is the hardest for a client + program, because there is nothing in the cluster values that + indicates that B and C + formed a ligature with A. + + + In contrast, the "merged" cluster values of the mark glyphs + that are seen in the level 0 and level 1 output are evidence + that a ligature substitution took place. + +
+
+ Reordering in level 2 + + Another example of how HarfBuzz's level 2 cluster behavior + can be tricky is when glyphs reorder. Consider an input sequence + with the following characters (top row) and initial cluster + values (bottom row): + + + A,B,C,D,E + 0,1,2,3,4 + + + Now imagine D moves before + B in a reordering operation. The cluster + values will then be: + + + A,D,B,C,E + 0,3,1,2,4 + + + Next, if D forms a ligature with + B, the output is: + + + A,DB,C,E + 0,3 ,2,4 + + + However, in a different scenario, in which the shaping rules + of the script instead caused A and + B to form a ligature + before the D reordered, the + result would be: + + + AB,D,C,E + 0 ,3,2,4 + + + There is no way for a client program to differentiate between + these two scenarios based on the cluster values + alone. Consequently, client programs that use level 2 might + need to undertake additional work in order to manage cursor + positioning, text attributes, or other desired features. + +
+
+ Other considerations in level 2 + + There may be other problems encountered with ligatures under + level 2, such as if the direction of the text is forced to + opposite of its natural direction (for example, left-to-right + Arabic). But, generally speaking, these other scenarios are + minor corner cases that are too obscure for most client + programs to need to worry about. + +
+