diff --git a/docs/usermanual-buffers-language-script-and-direction.xml b/docs/usermanual-buffers-language-script-and-direction.xml
index 68ce9bd0b..1c6b5dab1 100644
--- a/docs/usermanual-buffers-language-script-and-direction.xml
+++ b/docs/usermanual-buffers-language-script-and-direction.xml
@@ -15,14 +15,15 @@
Creating and destroying buffers
- As we saw in our initial example, a buffer is created and
+ As we saw in our Getting Started example, a
+ buffer is created and
initialized with hb_buffer_create(). This
produces a new, empty buffer object, instantiated with some
default values and ready to accept your Unicode strings.
- HarfBuzz manages the memory of objects that it creates (such as
- buffers), so you don't have to. When you have finished working on
+ HarfBuzz manages the memory of objects (such as buffers) that it
+ creates, so you don't have to. When you have finished working on
a buffer, you can call hb_buffer_destroy():
diff --git a/docs/usermanual-clusters.xml b/docs/usermanual-clusters.xml
index f48e89c20..228cc560a 100644
--- a/docs/usermanual-clusters.xml
+++ b/docs/usermanual-clusters.xml
@@ -6,25 +6,41 @@
]>
Clusters
-
- Clusters
+
+ Clusters and shaping
In text shaping, a cluster is a sequence of
characters that needs to be treated as a single, indivisible
- unit.
+ unit. A single letter or symbol can be a cluster of its
+ own. Other clusters correspond to longer subsequences of the
+ input code points — such as a ligature or conjunct form
+ — and require the shaper to ensure that the cluster is not
+ broken during the shaping process.
A cluster is distinct from a grapheme,
- which is the smallest unit of a writing system or script,
- because clusters are only relevant for script shaping and the
- layout of glyphs.
+ which is the smallest unit of meaning in a writing system or
+ script.
- For example, a grapheme may be a letter, a number, a logogram,
- or a symbol. When two letters form a ligature, however, they
- combine into a single glyph. They are therefore part of the same
- cluster and are treated as a unit — even though the two
- original, underlying letters are separate graphemes.
+ The definitions of the two terms are similar. However, clusters
+ are only relevant for script shaping and glyph layout. In
+ contrast, graphemes are a property of the underlying script, and
+ are of interest when client programs implement orthographic
+ or linguistic functionality.
+
+
+ For example, two individual letters are often two separate
+ graphemes. When two letters form a ligature, however, they
+ combine into a single glyph. They are then part of the same
+ cluster and are treated as a unit by the shaping engine —
+ even though the two original, underlying letters remain separate
+ graphemes.
+
+
+ HarfBuzz is concerned with clusters, not
+ with graphemes — although client programs using HarfBuzz
+ may still care about graphemes for other reasons from time to time.
During the shaping process, there are several shaping operations
@@ -32,14 +48,15 @@
points form a ligature or a conjunct form and are replaced by a
single glyph) or split one character into several (for example,
when decomposing a code point through the
- ccmp feature).
+ ccmp feature). Operations like these alter
+ clusters; HarfBuzz tracks the changes to ensure that no clusters
+ get lost or broken during shaping.
- HarfBuzz tracks clusters independently from how these
- shaping operations affect the individual glyphs that comprise the
- output HarfBuzz returns in a buffer. Consequently,
- a client program using HarfBuzz can utilize the cluster
- information to implement features such as:
+ HarfBuzz records cluster information independently from how
+ shaping operations affect the individual glyphs returned in an
+ output buffer. Consequently, a client program using HarfBuzz can
+ utilize the cluster information to implement features such as:
@@ -77,11 +94,14 @@
Performing line-breaking, justification, and other
line-level or paragraph-level operations that must be done
- after shaping is complete, but which require character-level
- properties.
+ after shaping is complete, but which require examining
+ character-level properties.
+
+
+ Working with HarfBuzz clusters
When you add text to a HarfBuzz buffer, each code point must be
assigned a cluster value.
@@ -94,7 +114,65 @@
value does not matter.
- Client programs can choose how HarfBuzz handles clusters during
+ Some of the shaping operations performed by HarfBuzz —
+ such as reordering, composition, decomposition, and substitution
+ — may alter the cluster values of some characters. The
+ final cluster values in the buffer at the end of the shaping
+ process will indicate to client programs which subsequences of
+ glyphs represent a cluster and, therefore, must not be
+ separated.
+
+
+ In addition, client programs can query the final cluster values
+ to discern other potentially important information about the
+ glyphs in the output buffer (such as whether or not a ligature
+ was formed).
+
+
+ For example, if the initial sequence of cluster values was:
+
+
+ 0,1,2,3,4
+
+
+ and the final sequence of cluster values is:
+
+
+ 0,0,3,3
+
+
+ then there are two clusters in the output buffer: the first
+ cluster includes the first two glyphs, and the second cluster
+ includes the third and fourth glyphs. It is also evident that a
+ ligature or conjunct has been formed, because there are fewer
+ glyphs in the output buffer (four) than there were code points
+ in the input buffer (five).
+
+
+ Although client programs using HarfBuzz are free to assign
+ initial cluster values in any manner they choose to, HarfBuzz
+ does offer some useful guarantees if the cluster values are
+ assigned in a monotonic (either non-decreasing or non-increasing)
+ order.
+
+
+ For left-to-right scripts (LTR) and top-to-bottom scripts (TTB),
+ HarfBuzz will preserve the monotonic property: client programs
+ are guaranteed that monotonically increasing initial clulster
+ values will be returned as monotonically increasing final
+ cluster values.
+
+
+ For right-to-left scripts (RTL) and bottom-to-top scripts (BTT),
+ the directionality of the buffer itself is reversed for final
+ output as a matter of design. Therefore, HarfBuzz inverts the
+ monotonic property: client programs are guaranteed that
+ monotonically increasing initial clulster values will be
+ returned as monotonically decreasing final
+ cluster values.
+
+
+ Client programs can adjust how HarfBuzz handles clusters during
shaping by setting the
cluster_level of the
buffer. HarfBuzz offers three levels of
@@ -179,7 +257,7 @@
assign initial cluster values in a buffer by reusing the indices
of the code points in the input text. This gives a sequence of
cluster values that is monotonically increasing (for example,
- 0,1,2,3,4,5).
+ 0,1,2,3,4).
It is not required that the cluster values
@@ -233,16 +311,44 @@
-
+
A clustering example for levels 0 and 1
- The guarantees and benefits of level 0 and level 1 can be seen
- with some examples. First, let us examine what happens with cluster
- values when shaping involves cluster merging with ligatures and
- decomposition.
+ The basic shaping operations affect clusters in a predictable
+ manner when using level 0 or level 1:
+
+
+
+ When two or more clusters merge, the
+ resulting merged cluster takes as its cluster value the
+ minimum of the incoming cluster values.
+
+
+
+
+ When a cluster decomposes, all of the
+ resulting child clusters inherit as their cluster value the
+ cluster value of the parent cluster.
+
+
+
+
+ When a character is reordered, the
+ reordered character and all clusters that the character
+ moves past as part of the reordering are merged into one cluster.
+
+
+
+
+ The functionality, guarantees, and benefits of level 0 and level
+ 1 behavior can be seen with some examples. First, let us examine
+ what happens with cluster values when shaping involves cluster
+ merging with ligatures and decomposition.
+
+
Let's say we start with the following character sequence (top row) and
initial cluster values (bottom row):
@@ -279,8 +385,8 @@
Next, let us say that the BC ligature glyph
decomposes into three components, and D also
- decomposes into two components. These components each inherit the
- cluster value of their parent:
+ decomposes into two components. Whenever a cluster decomposes,
+ its components each inherit the cluster value of their parent:
A,BC0,BC1,BC2,D0,D1,E
@@ -295,6 +401,12 @@
A,BC0,BC1,BC2D0,D1,E
0,1 ,1 ,1 ,1 ,4
+
+ Note that the entirety of cluster 3 merges into cluster 1, not
+ just the D0 glyph. This reflects the fact
+ that the cluster must be treated as an
+ indivisible unit.
+
At this point, cluster 1 means: the character sequence
BCD is represented by glyphs
@@ -319,18 +431,24 @@
0,1,2,3,4
- If D is reordered to before B,
- then HarfBuzz merges the B,
- C, and D clusters, and we
- get:
+ If D is reordered to the position immediately
+ before B, then HarfBuzz merges the
+ B, C, and
+ D clusters — all the clusters between
+ the final position of the reordered glyph and its original
+ position. This means that we get:
A,D,B,C,E
0,1,1,1,4
- This is clearly not ideal, but it is the only sensible way to
- maintain a monotonic sequence of cluster values and retain the
+ as the final cluster sequence.
+
+
+ Merging this many clusters is not ideal, but it is the only
+ sensible way for HarfBuzz to maintain the guarantee that the
+ sequence of cluster values remains monotonic and to retain the
true relationship between glyphs and characters.
@@ -340,8 +458,9 @@
The preceding examples demonstrate the main effects of using
cluster levels 0 and 1. The only difference between the two
levels is this: in level 0, at the very beginning of the shaping
- process, HarfBuzz also merges clusters between any base character
- and all Unicode marks (combining or not) that follow it.
+ process, HarfBuzz merges the cluster of each base character
+ with the clusters of all Unicode marks (combining or not) and
+ modifiers that follow it.
For example, let us start with the following character sequence
@@ -361,6 +480,10 @@
A,acute,B
0,0 ,2
+
+ This merger is performed before any other script-shaping
+ steps.
+
This initial cluster merging is the default behavior of the
Windows shaping engine, and the old HarfBuzz codebase copied
@@ -368,9 +491,10 @@
remained the default behavior in the new HarfBuzz codebase.
- But this initial cluster-merging behavior makes it impossible to
+ But this initial cluster-merging behavior makes it impossible
+ client programs to implement some features (such as to
color diacritic marks differently from their base
- characters. That is why, in level 1, HarfBuzz does not perform
+ characters). That is why, in level 1, HarfBuzz does not perform
the initial merging step.
@@ -378,29 +502,34 @@
perform cursor positioning, level 0 is more convenient. But
relying on cluster boundaries for cursor positioning is wrong: cursor
positions should be determined based on Unicode grapheme
- boundaries, not on shaping-cluster boundaries. As such, level 1
- clusters are preferred.
+ boundaries, not on shaping-cluster boundaries. As such, using
+ level 1 clustering behavior is recommended.
- One last note about levels 0 and 1. HarfBuzz currently does not allow a
- MultipleSubst lookup to replace a glyph with zero
- glyphs (in other words, to delete a glyph). But, in some other situations,
- glyphs can be deleted. In those cases, if the glyph being deleted is
- the last glyph of its cluster, HarfBuzz makes sure to merge the cluster
- with a neighboring cluster.
+ One final facet of levels 0 and 1 is worth noting. HarfBuzz
+ currently does not allow any
+ multiple-substitution GSUB lookups to
+ replace a glyph with zero glyphs (in other words, to delete a
+ glyph).
+
+
+ But, in some other situations, glyphs can be deleted. In
+ those cases, if the glyph being deleted is the last glyph of its
+ cluster, HarfBuzz makes sure to merge the deleted glyph's
+ cluster with a neighboring cluster.
This is done primarily to make sure that the starting cluster of the
text always has the cluster index pointing to the start of the text
- for the run; more than one client currently relies on this
+ for the run; more than one client program currently relies on this
guarantee.
- Incidentally, Apple's CoreText does something else to maintain the
- same promise: it inserts a glyph with id 65535 at the beginning of
- the glyph string if the glyph corresponding to the first character
- in the run was deleted. HarfBuzz might do something similar in the
- future.
+ Incidentally, Apple's CoreText does something different to
+ maintain the same promise: it inserts a glyph with id 65535 at
+ the beginning of the glyph string if the glyph corresponding to
+ the first character in the run was deleted. HarfBuzz might do
+ something similar in the future.
@@ -415,16 +544,39 @@
performs no merging of clusters whatsoever.
- When glyphs form a ligature (or when some other feature
- substitutes multiple glyphs with one glyph), the cluster value
- of the first glyph is retained as the cluster value for the
- ligature. However, no subsequent clusters — including
- marks and modifiers — are affected.
+ This means that there is no initial base-and-mark merging step
+ (as is done in level 0), and it means that reordering moves and
+ ligature substitutions do not trigger a cluster merge.
- Level 2 cluster behavior is less complex than level 0 or level
- 1, but there are a few cases in which processing cluster values
- produced at level 2 may be tricky.
+ Only one shaping operation directly affects clusters when using
+ level 2:
+
+
+
+
+ When a cluster decomposes, all of the
+ resulting child clusters inherit as their cluster value the
+ cluster value of the parent cluster.
+
+
+
+
+ When glyphs do form a ligature (or when some other feature
+ substitutes multiple glyphs with one glyph) the cluster value
+ of the first glyph is retained as the cluster value for the
+ resulting ligature.
+
+
+ This occurrence sounds similar to a cluster merge, but it is
+ different. In particular, no subsequent characters —
+ including marks and modifiers — are affected. They retain
+ their previous cluster values.
+
+
+ Level 2 cluster behavior is ultimately less complex than level 0
+ or level 1, but there are several cases for which processing
+ cluster values produced at level 2 may be tricky.
Ligatures with combining marks in level 2
@@ -532,10 +684,11 @@
There may be other problems encountered with ligatures under
level 2, such as if the direction of the text is forced to
- opposite of its natural direction (for example, left-to-right
- Arabic). But, generally speaking, these other scenarios are
- minor corner cases that are too obscure for most client
- programs to need to worry about.
+ opposite of its natural direction (for example, Arabic text
+ that is forced into left-to-right directionality). But,
+ generally speaking, these other scenarios are minor corner
+ cases that are too obscure for most client programs to need to
+ worry about.
diff --git a/docs/usermanual-getting-started.xml b/docs/usermanual-getting-started.xml
index 932bd9471..fda1e3b0a 100644
--- a/docs/usermanual-getting-started.xml
+++ b/docs/usermanual-getting-started.xml
@@ -76,12 +76,41 @@
Terminology
+
+
+ script
+
+
+ In text shaping, a script is a
+ writing system: a set of symbols, rules, and conventions
+ that is used to represent a language or multiple
+ languages.
+
+
+ In general computing lingo, the word "script" can also
+ be used to mean an executable program (usually one
+ written in a human-readable programming language). For
+ the sake of clarity, HarfBuzz documents will always use
+ more specific terminology when referring to this
+ meaning, such as "Python script" or "shell script." In
+ all other instances, "script" refers to a writing system.
+
+
+ For developers using HarfBuzz, it is important to note
+ the distinction between a script and a language. Most
+ scripts are used to write a variety of different
+ languages, and many languages may be written in more
+ than one script.
+
+
+
+
shaper
In HarfBuzz, a shaper is a
- handler for a specific script shaping model. HarfBuzz
+ handler for a specific script-shaping model. HarfBuzz
implements separate shapers for Indic, Arabic, Thai and
Lao, Khmer, Myanmar, Tibetan, Hangul, Hebrew, the
Universal Shaping Engine (USE), and a default shaper for
@@ -95,12 +124,12 @@
In text shaping, a cluster is a
- sequence of codepoints that must be handled as an
- indivisible unit. Clusters can include codepoint
+ sequence of codepoints that must be treated as an
+ indivisible unit. Clusters can include code-point
sequences that form a ligature or base-and-mark
sequences. Tracking and preserving clusters is important
when shaping operations might separate or reorder
- codepoints.
+ code points.
HarfBuzz provides three cluster
@@ -111,7 +140,59 @@
-
+
+ grapheme
+
+
+ In linguistics, a grapheme is one
+ of the indivisible units that make up a writing system or
+ script. Often, graphemes are individual symbols (letters,
+ numbers, punctuation marks, logograms, etc.) but,
+ depending on the writing system, a particular grapheme
+ might correspond to a sequence of several Unicode code
+ points.
+
+
+ In practice, HarfBuzz and other text-shaping engines
+ are not generally concerned with graphemes. However, it
+ is important for developers using HarfBuzz to recognize
+ that there is a difference between graphemes and shaping
+ clusters (see above). The two concepts may overlap
+ frequently, but there is no guarantee that they will be
+ identical.
+
+
+
+
+
+ syllable
+
+
+ In linguistics, a syllable is an
+ a sequence of sounds that makes up a building block of a
+ particular language. Every language has its own set of
+ rules describing what constitutes a valid syllable.
+
+
+ For text-shaping purposes, the various definitions of
+ "syllable" are important because script-specific shaping
+ operations may be applied at the syllable level. For
+ example, a reordering rule might specify that a vowel
+ mark be reordered to the beginning of the syllable.
+
+
+ Syllables will consist of one or more Unicode code
+ points. The definition of a syllable for a particular
+ writing system might correspond to how HarfBuzz
+ identifies clusters (see above) for the same writing
+ system. However, it is important for developers using
+ HarfBuzz to recognize that there is a difference between
+ syllables and shaping clusters. The two concepts may
+ overlap frequently, but there is no guarantee that they
+ will be identical.
+
+
+
diff --git a/docs/usermanual-install-harfbuzz.xml b/docs/usermanual-install-harfbuzz.xml
index a6484fc5a..53aa38d31 100644
--- a/docs/usermanual-install-harfbuzz.xml
+++ b/docs/usermanual-install-harfbuzz.xml
@@ -126,7 +126,7 @@
If you need to build HarfBuzz from source, first put the
- ragel binary on your
+ ragel binary on your
PATH, then follow the appveyor CI cmake
build
@@ -229,6 +229,7 @@
+
--with-libstdc++
diff --git a/docs/usermanual-shaping-concepts.xml b/docs/usermanual-shaping-concepts.xml
index bc9f1b830..db4e30983 100644
--- a/docs/usermanual-shaping-concepts.xml
+++ b/docs/usermanual-shaping-concepts.xml
@@ -182,22 +182,23 @@
Southeast Asian scripts are also assigned
Unicode Indic Syllabic Category (UISC) and
Unicode Indic Positional Category (UIPC)
- property that provides more detailed information needed for
+ properties that provide more detailed information needed for
shaping.
The UISC property sub-categorizes Letters and Marks according to
common script-shaping behaviors. For example, UISC distinguishes
between consonant letters, vowel letters, and vowel marks. The
- UIPC property sub-categorizes Mark codepoints by the visual
+ UIPC property sub-categorizes Mark codepoints by the relative visual
position that they occupy (above, below, right, left, or in
multiple positions).
Some complex scripts require that the text run be split into
- syllables, and what constitutes a valid syllable in these
- scripts is specified in regular expressions of the Letter and
- Mark codepoints that take the UISC and UIPC properties into account.
+ syllables. What constitutes a valid syllable in these
+ scripts is specified in regular expressions, formed from the
+ Letter and Mark codepoints, that take the UISC and UIPC
+ properties into account.