diff --git a/docs/usermanual-buffers-language-script-and-direction.xml b/docs/usermanual-buffers-language-script-and-direction.xml index 68ce9bd0b..1c6b5dab1 100644 --- a/docs/usermanual-buffers-language-script-and-direction.xml +++ b/docs/usermanual-buffers-language-script-and-direction.xml @@ -15,14 +15,15 @@
Creating and destroying buffers - As we saw in our initial example, a buffer is created and + As we saw in our Getting Started example, a + buffer is created and initialized with hb_buffer_create(). This produces a new, empty buffer object, instantiated with some default values and ready to accept your Unicode strings. - HarfBuzz manages the memory of objects that it creates (such as - buffers), so you don't have to. When you have finished working on + HarfBuzz manages the memory of objects (such as buffers) that it + creates, so you don't have to. When you have finished working on a buffer, you can call hb_buffer_destroy(): diff --git a/docs/usermanual-clusters.xml b/docs/usermanual-clusters.xml index f48e89c20..228cc560a 100644 --- a/docs/usermanual-clusters.xml +++ b/docs/usermanual-clusters.xml @@ -6,25 +6,41 @@ ]> Clusters -
- Clusters +
+ Clusters and shaping In text shaping, a cluster is a sequence of characters that needs to be treated as a single, indivisible - unit. + unit. A single letter or symbol can be a cluster of its + own. Other clusters correspond to longer subsequences of the + input code points — such as a ligature or conjunct form + — and require the shaper to ensure that the cluster is not + broken during the shaping process. A cluster is distinct from a grapheme, - which is the smallest unit of a writing system or script, - because clusters are only relevant for script shaping and the - layout of glyphs. + which is the smallest unit of meaning in a writing system or + script. - For example, a grapheme may be a letter, a number, a logogram, - or a symbol. When two letters form a ligature, however, they - combine into a single glyph. They are therefore part of the same - cluster and are treated as a unit — even though the two - original, underlying letters are separate graphemes. + The definitions of the two terms are similar. However, clusters + are only relevant for script shaping and glyph layout. In + contrast, graphemes are a property of the underlying script, and + are of interest when client programs implement orthographic + or linguistic functionality. + + + For example, two individual letters are often two separate + graphemes. When two letters form a ligature, however, they + combine into a single glyph. They are then part of the same + cluster and are treated as a unit by the shaping engine — + even though the two original, underlying letters remain separate + graphemes. + + + HarfBuzz is concerned with clusters, not + with graphemes — although client programs using HarfBuzz + may still care about graphemes for other reasons from time to time. During the shaping process, there are several shaping operations @@ -32,14 +48,15 @@ points form a ligature or a conjunct form and are replaced by a single glyph) or split one character into several (for example, when decomposing a code point through the - ccmp feature). + ccmp feature). Operations like these alter + clusters; HarfBuzz tracks the changes to ensure that no clusters + get lost or broken during shaping. - HarfBuzz tracks clusters independently from how these - shaping operations affect the individual glyphs that comprise the - output HarfBuzz returns in a buffer. Consequently, - a client program using HarfBuzz can utilize the cluster - information to implement features such as: + HarfBuzz records cluster information independently from how + shaping operations affect the individual glyphs returned in an + output buffer. Consequently, a client program using HarfBuzz can + utilize the cluster information to implement features such as: @@ -77,11 +94,14 @@ Performing line-breaking, justification, and other line-level or paragraph-level operations that must be done - after shaping is complete, but which require character-level - properties. + after shaping is complete, but which require examining + character-level properties. +
+
+ Working with HarfBuzz clusters When you add text to a HarfBuzz buffer, each code point must be assigned a cluster value. @@ -94,7 +114,65 @@ value does not matter. - Client programs can choose how HarfBuzz handles clusters during + Some of the shaping operations performed by HarfBuzz — + such as reordering, composition, decomposition, and substitution + — may alter the cluster values of some characters. The + final cluster values in the buffer at the end of the shaping + process will indicate to client programs which subsequences of + glyphs represent a cluster and, therefore, must not be + separated. + + + In addition, client programs can query the final cluster values + to discern other potentially important information about the + glyphs in the output buffer (such as whether or not a ligature + was formed). + + + For example, if the initial sequence of cluster values was: + + + 0,1,2,3,4 + + + and the final sequence of cluster values is: + + + 0,0,3,3 + + + then there are two clusters in the output buffer: the first + cluster includes the first two glyphs, and the second cluster + includes the third and fourth glyphs. It is also evident that a + ligature or conjunct has been formed, because there are fewer + glyphs in the output buffer (four) than there were code points + in the input buffer (five). + + + Although client programs using HarfBuzz are free to assign + initial cluster values in any manner they choose to, HarfBuzz + does offer some useful guarantees if the cluster values are + assigned in a monotonic (either non-decreasing or non-increasing) + order. + + + For left-to-right scripts (LTR) and top-to-bottom scripts (TTB), + HarfBuzz will preserve the monotonic property: client programs + are guaranteed that monotonically increasing initial clulster + values will be returned as monotonically increasing final + cluster values. + + + For right-to-left scripts (RTL) and bottom-to-top scripts (BTT), + the directionality of the buffer itself is reversed for final + output as a matter of design. Therefore, HarfBuzz inverts the + monotonic property: client programs are guaranteed that + monotonically increasing initial clulster values will be + returned as monotonically decreasing final + cluster values. + + + Client programs can adjust how HarfBuzz handles clusters during shaping by setting the cluster_level of the buffer. HarfBuzz offers three levels of @@ -179,7 +257,7 @@ assign initial cluster values in a buffer by reusing the indices of the code points in the input text. This gives a sequence of cluster values that is monotonically increasing (for example, - 0,1,2,3,4,5). + 0,1,2,3,4). It is not required that the cluster values @@ -233,16 +311,44 @@ -
+
A clustering example for levels 0 and 1 - The guarantees and benefits of level 0 and level 1 can be seen - with some examples. First, let us examine what happens with cluster - values when shaping involves cluster merging with ligatures and - decomposition. + The basic shaping operations affect clusters in a predictable + manner when using level 0 or level 1: + + + + When two or more clusters merge, the + resulting merged cluster takes as its cluster value the + minimum of the incoming cluster values. + + + + + When a cluster decomposes, all of the + resulting child clusters inherit as their cluster value the + cluster value of the parent cluster. + + + + + When a character is reordered, the + reordered character and all clusters that the character + moves past as part of the reordering are merged into one cluster. + + + + + The functionality, guarantees, and benefits of level 0 and level + 1 behavior can be seen with some examples. First, let us examine + what happens with cluster values when shaping involves cluster + merging with ligatures and decomposition. + + Let's say we start with the following character sequence (top row) and initial cluster values (bottom row): @@ -279,8 +385,8 @@ Next, let us say that the BC ligature glyph decomposes into three components, and D also - decomposes into two components. These components each inherit the - cluster value of their parent: + decomposes into two components. Whenever a cluster decomposes, + its components each inherit the cluster value of their parent: A,BC0,BC1,BC2,D0,D1,E @@ -295,6 +401,12 @@ A,BC0,BC1,BC2D0,D1,E 0,1 ,1 ,1 ,1 ,4 + + Note that the entirety of cluster 3 merges into cluster 1, not + just the D0 glyph. This reflects the fact + that the cluster must be treated as an + indivisible unit. + At this point, cluster 1 means: the character sequence BCD is represented by glyphs @@ -319,18 +431,24 @@ 0,1,2,3,4 - If D is reordered to before B, - then HarfBuzz merges the B, - C, and D clusters, and we - get: + If D is reordered to the position immediately + before B, then HarfBuzz merges the + B, C, and + D clusters — all the clusters between + the final position of the reordered glyph and its original + position. This means that we get: A,D,B,C,E 0,1,1,1,4 - This is clearly not ideal, but it is the only sensible way to - maintain a monotonic sequence of cluster values and retain the + as the final cluster sequence. + + + Merging this many clusters is not ideal, but it is the only + sensible way for HarfBuzz to maintain the guarantee that the + sequence of cluster values remains monotonic and to retain the true relationship between glyphs and characters.
@@ -340,8 +458,9 @@ The preceding examples demonstrate the main effects of using cluster levels 0 and 1. The only difference between the two levels is this: in level 0, at the very beginning of the shaping - process, HarfBuzz also merges clusters between any base character - and all Unicode marks (combining or not) that follow it. + process, HarfBuzz merges the cluster of each base character + with the clusters of all Unicode marks (combining or not) and + modifiers that follow it. For example, let us start with the following character sequence @@ -361,6 +480,10 @@ A,acute,B 0,0 ,2 + + This merger is performed before any other script-shaping + steps. + This initial cluster merging is the default behavior of the Windows shaping engine, and the old HarfBuzz codebase copied @@ -368,9 +491,10 @@ remained the default behavior in the new HarfBuzz codebase. - But this initial cluster-merging behavior makes it impossible to + But this initial cluster-merging behavior makes it impossible + client programs to implement some features (such as to color diacritic marks differently from their base - characters. That is why, in level 1, HarfBuzz does not perform + characters). That is why, in level 1, HarfBuzz does not perform the initial merging step. @@ -378,29 +502,34 @@ perform cursor positioning, level 0 is more convenient. But relying on cluster boundaries for cursor positioning is wrong: cursor positions should be determined based on Unicode grapheme - boundaries, not on shaping-cluster boundaries. As such, level 1 - clusters are preferred. + boundaries, not on shaping-cluster boundaries. As such, using + level 1 clustering behavior is recommended. - One last note about levels 0 and 1. HarfBuzz currently does not allow a - MultipleSubst lookup to replace a glyph with zero - glyphs (in other words, to delete a glyph). But, in some other situations, - glyphs can be deleted. In those cases, if the glyph being deleted is - the last glyph of its cluster, HarfBuzz makes sure to merge the cluster - with a neighboring cluster. + One final facet of levels 0 and 1 is worth noting. HarfBuzz + currently does not allow any + multiple-substitution GSUB lookups to + replace a glyph with zero glyphs (in other words, to delete a + glyph). + + + But, in some other situations, glyphs can be deleted. In + those cases, if the glyph being deleted is the last glyph of its + cluster, HarfBuzz makes sure to merge the deleted glyph's + cluster with a neighboring cluster. This is done primarily to make sure that the starting cluster of the text always has the cluster index pointing to the start of the text - for the run; more than one client currently relies on this + for the run; more than one client program currently relies on this guarantee. - Incidentally, Apple's CoreText does something else to maintain the - same promise: it inserts a glyph with id 65535 at the beginning of - the glyph string if the glyph corresponding to the first character - in the run was deleted. HarfBuzz might do something similar in the - future. + Incidentally, Apple's CoreText does something different to + maintain the same promise: it inserts a glyph with id 65535 at + the beginning of the glyph string if the glyph corresponding to + the first character in the run was deleted. HarfBuzz might do + something similar in the future.
@@ -415,16 +544,39 @@ performs no merging of clusters whatsoever. - When glyphs form a ligature (or when some other feature - substitutes multiple glyphs with one glyph), the cluster value - of the first glyph is retained as the cluster value for the - ligature. However, no subsequent clusters — including - marks and modifiers — are affected. + This means that there is no initial base-and-mark merging step + (as is done in level 0), and it means that reordering moves and + ligature substitutions do not trigger a cluster merge. - Level 2 cluster behavior is less complex than level 0 or level - 1, but there are a few cases in which processing cluster values - produced at level 2 may be tricky. + Only one shaping operation directly affects clusters when using + level 2: + + + + + When a cluster decomposes, all of the + resulting child clusters inherit as their cluster value the + cluster value of the parent cluster. + + + + + When glyphs do form a ligature (or when some other feature + substitutes multiple glyphs with one glyph) the cluster value + of the first glyph is retained as the cluster value for the + resulting ligature. + + + This occurrence sounds similar to a cluster merge, but it is + different. In particular, no subsequent characters — + including marks and modifiers — are affected. They retain + their previous cluster values. + + + Level 2 cluster behavior is ultimately less complex than level 0 + or level 1, but there are several cases for which processing + cluster values produced at level 2 may be tricky.
Ligatures with combining marks in level 2 @@ -532,10 +684,11 @@ There may be other problems encountered with ligatures under level 2, such as if the direction of the text is forced to - opposite of its natural direction (for example, left-to-right - Arabic). But, generally speaking, these other scenarios are - minor corner cases that are too obscure for most client - programs to need to worry about. + opposite of its natural direction (for example, Arabic text + that is forced into left-to-right directionality). But, + generally speaking, these other scenarios are minor corner + cases that are too obscure for most client programs to need to + worry about.
diff --git a/docs/usermanual-getting-started.xml b/docs/usermanual-getting-started.xml index 932bd9471..fda1e3b0a 100644 --- a/docs/usermanual-getting-started.xml +++ b/docs/usermanual-getting-started.xml @@ -76,12 +76,41 @@
Terminology + + + script + + + In text shaping, a script is a + writing system: a set of symbols, rules, and conventions + that is used to represent a language or multiple + languages. + + + In general computing lingo, the word "script" can also + be used to mean an executable program (usually one + written in a human-readable programming language). For + the sake of clarity, HarfBuzz documents will always use + more specific terminology when referring to this + meaning, such as "Python script" or "shell script." In + all other instances, "script" refers to a writing system. + + + For developers using HarfBuzz, it is important to note + the distinction between a script and a language. Most + scripts are used to write a variety of different + languages, and many languages may be written in more + than one script. + + + + shaper In HarfBuzz, a shaper is a - handler for a specific script shaping model. HarfBuzz + handler for a specific script-shaping model. HarfBuzz implements separate shapers for Indic, Arabic, Thai and Lao, Khmer, Myanmar, Tibetan, Hangul, Hebrew, the Universal Shaping Engine (USE), and a default shaper for @@ -95,12 +124,12 @@ In text shaping, a cluster is a - sequence of codepoints that must be handled as an - indivisible unit. Clusters can include codepoint + sequence of codepoints that must be treated as an + indivisible unit. Clusters can include code-point sequences that form a ligature or base-and-mark sequences. Tracking and preserving clusters is important when shaping operations might separate or reorder - codepoints. + code points. HarfBuzz provides three cluster @@ -111,7 +140,59 @@ - + + grapheme + + + In linguistics, a grapheme is one + of the indivisible units that make up a writing system or + script. Often, graphemes are individual symbols (letters, + numbers, punctuation marks, logograms, etc.) but, + depending on the writing system, a particular grapheme + might correspond to a sequence of several Unicode code + points. + + + In practice, HarfBuzz and other text-shaping engines + are not generally concerned with graphemes. However, it + is important for developers using HarfBuzz to recognize + that there is a difference between graphemes and shaping + clusters (see above). The two concepts may overlap + frequently, but there is no guarantee that they will be + identical. + + + + + + syllable + + + In linguistics, a syllable is an + a sequence of sounds that makes up a building block of a + particular language. Every language has its own set of + rules describing what constitutes a valid syllable. + + + For text-shaping purposes, the various definitions of + "syllable" are important because script-specific shaping + operations may be applied at the syllable level. For + example, a reordering rule might specify that a vowel + mark be reordered to the beginning of the syllable. + + + Syllables will consist of one or more Unicode code + points. The definition of a syllable for a particular + writing system might correspond to how HarfBuzz + identifies clusters (see above) for the same writing + system. However, it is important for developers using + HarfBuzz to recognize that there is a difference between + syllables and shaping clusters. The two concepts may + overlap frequently, but there is no guarantee that they + will be identical. + + +
diff --git a/docs/usermanual-install-harfbuzz.xml b/docs/usermanual-install-harfbuzz.xml index a6484fc5a..53aa38d31 100644 --- a/docs/usermanual-install-harfbuzz.xml +++ b/docs/usermanual-install-harfbuzz.xml @@ -126,7 +126,7 @@ If you need to build HarfBuzz from source, first put the - ragel binary on your + ragel binary on your PATH, then follow the appveyor CI cmake build @@ -229,6 +229,7 @@ + --with-libstdc++ diff --git a/docs/usermanual-shaping-concepts.xml b/docs/usermanual-shaping-concepts.xml index bc9f1b830..db4e30983 100644 --- a/docs/usermanual-shaping-concepts.xml +++ b/docs/usermanual-shaping-concepts.xml @@ -182,22 +182,23 @@ Southeast Asian scripts are also assigned Unicode Indic Syllabic Category (UISC) and Unicode Indic Positional Category (UIPC) - property that provides more detailed information needed for + properties that provide more detailed information needed for shaping. The UISC property sub-categorizes Letters and Marks according to common script-shaping behaviors. For example, UISC distinguishes between consonant letters, vowel letters, and vowel marks. The - UIPC property sub-categorizes Mark codepoints by the visual + UIPC property sub-categorizes Mark codepoints by the relative visual position that they occupy (above, below, right, left, or in multiple positions). Some complex scripts require that the text run be split into - syllables, and what constitutes a valid syllable in these - scripts is specified in regular expressions of the Letter and - Mark codepoints that take the UISC and UIPC properties into account. + syllables. What constitutes a valid syllable in these + scripts is specified in regular expressions, formed from the + Letter and Mark codepoints, that take the UISC and UIPC + properties into account.