Usermanual: small updates.

This commit is contained in:
Nathan Willis 2018-11-28 13:48:38 -06:00 committed by Khaled Hosny
parent 26c5b54fb0
commit ed13caddf2
5 changed files with 315 additions and 78 deletions

View File

@ -15,14 +15,15 @@
<section id="creating-and-destroying-buffers"> <section id="creating-and-destroying-buffers">
<title>Creating and destroying buffers</title> <title>Creating and destroying buffers</title>
<para> <para>
As we saw in our initial example, a buffer is created and As we saw in our <emphasis>Getting Started</emphasis> example, a
buffer is created and
initialized with <literal>hb_buffer_create()</literal>. This initialized with <literal>hb_buffer_create()</literal>. This
produces a new, empty buffer object, instantiated with some produces a new, empty buffer object, instantiated with some
default values and ready to accept your Unicode strings. default values and ready to accept your Unicode strings.
</para> </para>
<para> <para>
HarfBuzz manages the memory of objects that it creates (such as HarfBuzz manages the memory of objects (such as buffers) that it
buffers), so you don't have to. When you have finished working on creates, so you don't have to. When you have finished working on
a buffer, you can call <literal>hb_buffer_destroy()</literal>: a buffer, you can call <literal>hb_buffer_destroy()</literal>:
</para> </para>
<programlisting language="C"> <programlisting language="C">

View File

@ -6,25 +6,41 @@
]> ]>
<chapter id="clusters"> <chapter id="clusters">
<title>Clusters</title> <title>Clusters</title>
<section id="clusters"> <section id="clusters-and-shaping">
<title>Clusters</title> <title>Clusters and shaping</title>
<para> <para>
In text shaping, a <emphasis>cluster</emphasis> is a sequence of In text shaping, a <emphasis>cluster</emphasis> is a sequence of
characters that needs to be treated as a single, indivisible characters that needs to be treated as a single, indivisible
unit. unit. A single letter or symbol can be a cluster of its
own. Other clusters correspond to longer subsequences of the
input code points &mdash; such as a ligature or conjunct form
&mdash; and require the shaper to ensure that the cluster is not
broken during the shaping process.
</para> </para>
<para> <para>
A cluster is distinct from a <emphasis>grapheme</emphasis>, A cluster is distinct from a <emphasis>grapheme</emphasis>,
which is the smallest unit of a writing system or script, which is the smallest unit of meaning in a writing system or
because clusters are only relevant for script shaping and the script.
layout of glyphs.
</para> </para>
<para> <para>
For example, a grapheme may be a letter, a number, a logogram, The definitions of the two terms are similar. However, clusters
or a symbol. When two letters form a ligature, however, they are only relevant for script shaping and glyph layout. In
combine into a single glyph. They are therefore part of the same contrast, graphemes are a property of the underlying script, and
cluster and are treated as a unit &mdash; even though the two are of interest when client programs implement orthographic
original, underlying letters are separate graphemes. or linguistic functionality.
</para>
<para>
For example, two individual letters are often two separate
graphemes. When two letters form a ligature, however, they
combine into a single glyph. They are then part of the same
cluster and are treated as a unit by the shaping engine &mdash;
even though the two original, underlying letters remain separate
graphemes.
</para>
<para>
HarfBuzz is concerned with clusters, <emphasis>not</emphasis>
with graphemes &mdash; although client programs using HarfBuzz
may still care about graphemes for other reasons from time to time.
</para> </para>
<para> <para>
During the shaping process, there are several shaping operations During the shaping process, there are several shaping operations
@ -32,14 +48,15 @@
points form a ligature or a conjunct form and are replaced by a points form a ligature or a conjunct form and are replaced by a
single glyph) or split one character into several (for example, single glyph) or split one character into several (for example,
when decomposing a code point through the when decomposing a code point through the
<literal>ccmp</literal> feature). <literal>ccmp</literal> feature). Operations like these alter
clusters; HarfBuzz tracks the changes to ensure that no clusters
get lost or broken during shaping.
</para> </para>
<para> <para>
HarfBuzz tracks clusters independently from how these HarfBuzz records cluster information independently from how
shaping operations affect the individual glyphs that comprise the shaping operations affect the individual glyphs returned in an
output HarfBuzz returns in a buffer. Consequently, output buffer. Consequently, a client program using HarfBuzz can
a client program using HarfBuzz can utilize the cluster utilize the cluster information to implement features such as:
information to implement features such as:
</para> </para>
<itemizedlist> <itemizedlist>
<listitem> <listitem>
@ -77,11 +94,14 @@
<para> <para>
Performing line-breaking, justification, and other Performing line-breaking, justification, and other
line-level or paragraph-level operations that must be done line-level or paragraph-level operations that must be done
after shaping is complete, but which require character-level after shaping is complete, but which require examining
properties. character-level properties.
</para> </para>
</listitem> </listitem>
</itemizedlist> </itemizedlist>
</section>
<section id="working-with-harfbuzz-clusters">
<title>Working with HarfBuzz clusters</title>
<para> <para>
When you add text to a HarfBuzz buffer, each code point must be When you add text to a HarfBuzz buffer, each code point must be
assigned a <emphasis>cluster value</emphasis>. assigned a <emphasis>cluster value</emphasis>.
@ -94,7 +114,65 @@
value does not matter. value does not matter.
</para> </para>
<para> <para>
Client programs can choose how HarfBuzz handles clusters during Some of the shaping operations performed by HarfBuzz &mdash;
such as reordering, composition, decomposition, and substitution
&mdash; may alter the cluster values of some characters. The
final cluster values in the buffer at the end of the shaping
process will indicate to client programs which subsequences of
glyphs represent a cluster and, therefore, must not be
separated.
</para>
<para>
In addition, client programs can query the final cluster values
to discern other potentially important information about the
glyphs in the output buffer (such as whether or not a ligature
was formed).
</para>
<para>
For example, if the initial sequence of cluster values was:
</para>
<programlisting>
0,1,2,3,4
</programlisting>
<para>
and the final sequence of cluster values is:
</para>
<programlisting>
0,0,3,3
</programlisting>
<para>
then there are two clusters in the output buffer: the first
cluster includes the first two glyphs, and the second cluster
includes the third and fourth glyphs. It is also evident that a
ligature or conjunct has been formed, because there are fewer
glyphs in the output buffer (four) than there were code points
in the input buffer (five).
</para>
<para>
Although client programs using HarfBuzz are free to assign
initial cluster values in any manner they choose to, HarfBuzz
does offer some useful guarantees if the cluster values are
assigned in a monotonic (either non-decreasing or non-increasing)
order.
</para>
<para>
For left-to-right scripts (LTR) and top-to-bottom scripts (TTB),
HarfBuzz will preserve the monotonic property: client programs
are guaranteed that monotonically increasing initial clulster
values will be returned as monotonically increasing final
cluster values.
</para>
<para>
For right-to-left scripts (RTL) and bottom-to-top scripts (BTT),
the directionality of the buffer itself is reversed for final
output as a matter of design. Therefore, HarfBuzz inverts the
monotonic property: client programs are guaranteed that
monotonically increasing initial clulster values will be
returned as monotonically <emphasis>decreasing</emphasis> final
cluster values.
</para>
<para>
Client programs can adjust how HarfBuzz handles clusters during
shaping by setting the shaping by setting the
<literal>cluster_level</literal> of the <literal>cluster_level</literal> of the
buffer. HarfBuzz offers three <emphasis>levels</emphasis> of buffer. HarfBuzz offers three <emphasis>levels</emphasis> of
@ -179,7 +257,7 @@
assign initial cluster values in a buffer by reusing the indices assign initial cluster values in a buffer by reusing the indices
of the code points in the input text. This gives a sequence of of the code points in the input text. This gives a sequence of
cluster values that is monotonically increasing (for example, cluster values that is monotonically increasing (for example,
0,1,2,3,4,5). 0,1,2,3,4).
</para> </para>
<para> <para>
It is not <emphasis>required</emphasis> that the cluster values It is not <emphasis>required</emphasis> that the cluster values
@ -233,16 +311,44 @@
</para> </para>
</listitem> </listitem>
</itemizedlist> </itemizedlist>
</section> </section>
<section id="a-clustering-example-for-levels-0-and-1"> <section id="a-clustering-example-for-levels-0-and-1">
<title>A clustering example for levels 0 and 1</title> <title>A clustering example for levels 0 and 1</title>
<para> <para>
The guarantees and benefits of level 0 and level 1 can be seen The basic shaping operations affect clusters in a predictable
with some examples. First, let us examine what happens with cluster manner when using level 0 or level 1:
values when shaping involves cluster merging with ligatures and
decomposition.
</para> </para>
<itemizedlist>
<listitem>
<para>
When two or more clusters <emphasis>merge</emphasis>, the
resulting merged cluster takes as its cluster value the
<emphasis>minimum</emphasis> of the incoming cluster values.
</para>
</listitem>
<listitem>
<para>
When a cluster <emphasis>decomposes</emphasis>, all of the
resulting child clusters inherit as their cluster value the
cluster value of the parent cluster.
</para>
</listitem>
<listitem>
<para>
When a character is <emphasis>reordered</emphasis>, the
reordered character and all clusters that the character
moves past as part of the reordering are merged into one cluster.
</para>
</listitem>
</itemizedlist>
<para>
The functionality, guarantees, and benefits of level 0 and level
1 behavior can be seen with some examples. First, let us examine
what happens with cluster values when shaping involves cluster
merging with ligatures and decomposition.
</para>
<para> <para>
Let's say we start with the following character sequence (top row) and Let's say we start with the following character sequence (top row) and
initial cluster values (bottom row): initial cluster values (bottom row):
@ -279,8 +385,8 @@
<para> <para>
Next, let us say that the <literal>BC</literal> ligature glyph Next, let us say that the <literal>BC</literal> ligature glyph
decomposes into three components, and <literal>D</literal> also decomposes into three components, and <literal>D</literal> also
decomposes into two components. These components each inherit the decomposes into two components. Whenever a cluster decomposes,
cluster value of their parent: its components each inherit the cluster value of their parent:
</para> </para>
<programlisting> <programlisting>
A,BC0,BC1,BC2,D0,D1,E A,BC0,BC1,BC2,D0,D1,E
@ -295,6 +401,12 @@
A,BC0,BC1,BC2D0,D1,E A,BC0,BC1,BC2D0,D1,E
0,1 ,1 ,1 ,1 ,4 0,1 ,1 ,1 ,1 ,4
</programlisting> </programlisting>
<para>
Note that the entirety of cluster 3 merges into cluster 1, not
just the <literal>D0</literal> glyph. This reflects the fact
that the cluster <emphasis>must</emphasis> be treated as an
indivisible unit.
</para>
<para> <para>
At this point, cluster 1 means: the character sequence At this point, cluster 1 means: the character sequence
<literal>BCD</literal> is represented by glyphs <literal>BCD</literal> is represented by glyphs
@ -319,18 +431,24 @@
0,1,2,3,4 0,1,2,3,4
</programlisting> </programlisting>
<para> <para>
If <literal>D</literal> is reordered to before <literal>B</literal>, If <literal>D</literal> is reordered to the position immediately
then HarfBuzz merges the <literal>B</literal>, before <literal>B</literal>, then HarfBuzz merges the
<literal>C</literal>, and <literal>D</literal> clusters, and we <literal>B</literal>, <literal>C</literal>, and
get: <literal>D</literal> clusters &mdash; all the clusters between
the final position of the reordered glyph and its original
position. This means that we get:
</para> </para>
<programlisting> <programlisting>
A,D,B,C,E A,D,B,C,E
0,1,1,1,4 0,1,1,1,4
</programlisting> </programlisting>
<para> <para>
This is clearly not ideal, but it is the only sensible way to as the final cluster sequence.
maintain a monotonic sequence of cluster values and retain the </para>
<para>
Merging this many clusters is not ideal, but it is the only
sensible way for HarfBuzz to maintain the guarantee that the
sequence of cluster values remains monotonic and to retain the
true relationship between glyphs and characters. true relationship between glyphs and characters.
</para> </para>
</section> </section>
@ -340,8 +458,9 @@
The preceding examples demonstrate the main effects of using The preceding examples demonstrate the main effects of using
cluster levels 0 and 1. The only difference between the two cluster levels 0 and 1. The only difference between the two
levels is this: in level 0, at the very beginning of the shaping levels is this: in level 0, at the very beginning of the shaping
process, HarfBuzz also merges clusters between any base character process, HarfBuzz merges the cluster of each base character
and all Unicode marks (combining or not) that follow it. with the clusters of all Unicode marks (combining or not) and
modifiers that follow it.
</para> </para>
<para> <para>
For example, let us start with the following character sequence For example, let us start with the following character sequence
@ -361,6 +480,10 @@
A,acute,B A,acute,B
0,0 ,2 0,0 ,2
</programlisting> </programlisting>
<para>
This merger is performed before any other script-shaping
steps.
</para>
<para> <para>
This initial cluster merging is the default behavior of the This initial cluster merging is the default behavior of the
Windows shaping engine, and the old HarfBuzz codebase copied Windows shaping engine, and the old HarfBuzz codebase copied
@ -368,9 +491,10 @@
remained the default behavior in the new HarfBuzz codebase. remained the default behavior in the new HarfBuzz codebase.
</para> </para>
<para> <para>
But this initial cluster-merging behavior makes it impossible to But this initial cluster-merging behavior makes it impossible
client programs to implement some features (such as to
color diacritic marks differently from their base color diacritic marks differently from their base
characters. That is why, in level 1, HarfBuzz does not perform characters). That is why, in level 1, HarfBuzz does not perform
the initial merging step. the initial merging step.
</para> </para>
<para> <para>
@ -378,29 +502,34 @@
perform cursor positioning, level 0 is more convenient. But perform cursor positioning, level 0 is more convenient. But
relying on cluster boundaries for cursor positioning is wrong: cursor relying on cluster boundaries for cursor positioning is wrong: cursor
positions should be determined based on Unicode grapheme positions should be determined based on Unicode grapheme
boundaries, not on shaping-cluster boundaries. As such, level 1 boundaries, not on shaping-cluster boundaries. As such, using
clusters are preferred. level 1 clustering behavior is recommended.
</para> </para>
<para> <para>
One last note about levels 0 and 1. HarfBuzz currently does not allow a One final facet of levels 0 and 1 is worth noting. HarfBuzz
<literal>MultipleSubst</literal> lookup to replace a glyph with zero currently does not allow any
glyphs (in other words, to delete a glyph). But, in some other situations, <emphasis>multiple-substitution</emphasis> GSUB lookups to
glyphs can be deleted. In those cases, if the glyph being deleted is replace a glyph with zero glyphs (in other words, to delete a
the last glyph of its cluster, HarfBuzz makes sure to merge the cluster glyph).
with a neighboring cluster. </para>
<para>
But, in some other situations, glyphs can be deleted. In
those cases, if the glyph being deleted is the last glyph of its
cluster, HarfBuzz makes sure to merge the deleted glyph's
cluster with a neighboring cluster.
</para> </para>
<para> <para>
This is done primarily to make sure that the starting cluster of the This is done primarily to make sure that the starting cluster of the
text always has the cluster index pointing to the start of the text text always has the cluster index pointing to the start of the text
for the run; more than one client currently relies on this for the run; more than one client program currently relies on this
guarantee. guarantee.
</para> </para>
<para> <para>
Incidentally, Apple's CoreText does something else to maintain the Incidentally, Apple's CoreText does something different to
same promise: it inserts a glyph with id 65535 at the beginning of maintain the same promise: it inserts a glyph with id 65535 at
the glyph string if the glyph corresponding to the first character the beginning of the glyph string if the glyph corresponding to
in the run was deleted. HarfBuzz might do something similar in the the first character in the run was deleted. HarfBuzz might do
future. something similar in the future.
</para> </para>
</section> </section>
<section id="level-2"> <section id="level-2">
@ -415,16 +544,39 @@
performs no merging of clusters whatsoever. performs no merging of clusters whatsoever.
</para> </para>
<para> <para>
When glyphs form a ligature (or when some other feature This means that there is no initial base-and-mark merging step
substitutes multiple glyphs with one glyph), the cluster value (as is done in level 0), and it means that reordering moves and
of the first glyph is retained as the cluster value for the ligature substitutions do not trigger a cluster merge.
ligature. However, no subsequent clusters &mdash; including
marks and modifiers &mdash; are affected.
</para> </para>
<para> <para>
Level 2 cluster behavior is less complex than level 0 or level Only one shaping operation directly affects clusters when using
1, but there are a few cases in which processing cluster values level 2:
produced at level 2 may be tricky. </para>
<itemizedlist>
<listitem>
<para>
When a cluster <emphasis>decomposes</emphasis>, all of the
resulting child clusters inherit as their cluster value the
cluster value of the parent cluster.
</para>
</listitem>
</itemizedlist>
<para>
When glyphs do form a ligature (or when some other feature
substitutes multiple glyphs with one glyph) the cluster value
of the first glyph is retained as the cluster value for the
resulting ligature.
</para>
<para>
This occurrence sounds similar to a cluster merge, but it is
different. In particular, no subsequent characters &mdash;
including marks and modifiers &mdash; are affected. They retain
their previous cluster values.
</para>
<para>
Level 2 cluster behavior is ultimately less complex than level 0
or level 1, but there are several cases for which processing
cluster values produced at level 2 may be tricky.
</para> </para>
<section id="ligatures-with-combining-marks-in-level-2"> <section id="ligatures-with-combining-marks-in-level-2">
<title>Ligatures with combining marks in level 2</title> <title>Ligatures with combining marks in level 2</title>
@ -532,10 +684,11 @@
<para> <para>
There may be other problems encountered with ligatures under There may be other problems encountered with ligatures under
level 2, such as if the direction of the text is forced to level 2, such as if the direction of the text is forced to
opposite of its natural direction (for example, left-to-right opposite of its natural direction (for example, Arabic text
Arabic). But, generally speaking, these other scenarios are that is forced into left-to-right directionality). But,
minor corner cases that are too obscure for most client generally speaking, these other scenarios are minor corner
programs to need to worry about. cases that are too obscure for most client programs to need to
worry about.
</para> </para>
</section> </section>
</section> </section>

View File

@ -76,12 +76,41 @@
<section> <section>
<title>Terminology</title> <title>Terminology</title>
<variablelist> <variablelist>
<?dbfo list-presentation="blocks"?>
<varlistentry>
<term>script</term>
<listitem>
<para>
In text shaping, a <emphasis>script</emphasis> is a
writing system: a set of symbols, rules, and conventions
that is used to represent a language or multiple
languages.
</para>
<para>
In general computing lingo, the word "script" can also
be used to mean an executable program (usually one
written in a human-readable programming language). For
the sake of clarity, HarfBuzz documents will always use
more specific terminology when referring to this
meaning, such as "Python script" or "shell script." In
all other instances, "script" refers to a writing system.
</para>
<para>
For developers using HarfBuzz, it is important to note
the distinction between a script and a language. Most
scripts are used to write a variety of different
languages, and many languages may be written in more
than one script.
</para>
</listitem>
</varlistentry>
<varlistentry> <varlistentry>
<term>shaper</term> <term>shaper</term>
<listitem> <listitem>
<para> <para>
In HarfBuzz, a <emphasis>shaper</emphasis> is a In HarfBuzz, a <emphasis>shaper</emphasis> is a
handler for a specific script shaping model. HarfBuzz handler for a specific script-shaping model. HarfBuzz
implements separate shapers for Indic, Arabic, Thai and implements separate shapers for Indic, Arabic, Thai and
Lao, Khmer, Myanmar, Tibetan, Hangul, Hebrew, the Lao, Khmer, Myanmar, Tibetan, Hangul, Hebrew, the
Universal Shaping Engine (USE), and a default shaper for Universal Shaping Engine (USE), and a default shaper for
@ -95,12 +124,12 @@
<listitem> <listitem>
<para> <para>
In text shaping, a <emphasis>cluster</emphasis> is a In text shaping, a <emphasis>cluster</emphasis> is a
sequence of codepoints that must be handled as an sequence of codepoints that must be treated as an
indivisible unit. Clusters can include codepoint indivisible unit. Clusters can include code-point
sequences that form a ligature or base-and-mark sequences that form a ligature or base-and-mark
sequences. Tracking and preserving clusters is important sequences. Tracking and preserving clusters is important
when shaping operations might separate or reorder when shaping operations might separate or reorder
codepoints. code points.
</para> </para>
<para> <para>
HarfBuzz provides three cluster HarfBuzz provides three cluster
@ -111,7 +140,59 @@
</listitem> </listitem>
</varlistentry> </varlistentry>
<varlistentry>
<term>grapheme</term>
<listitem>
<para>
In linguistics, a <emphasis>grapheme</emphasis> is one
of the indivisible units that make up a writing system or
script. Often, graphemes are individual symbols (letters,
numbers, punctuation marks, logograms, etc.) but,
depending on the writing system, a particular grapheme
might correspond to a sequence of several Unicode code
points.
</para>
<para>
In practice, HarfBuzz and other text-shaping engines
are not generally concerned with graphemes. However, it
is important for developers using HarfBuzz to recognize
that there is a difference between graphemes and shaping
clusters (see above). The two concepts may overlap
frequently, but there is no guarantee that they will be
identical.
</para>
</listitem>
</varlistentry>
<varlistentry>
<term>syllable</term>
<listitem>
<para>
In linguistics, a <emphasis>syllable</emphasis> is an
a sequence of sounds that makes up a building block of a
particular language. Every language has its own set of
rules describing what constitutes a valid syllable.
</para>
<para>
For text-shaping purposes, the various definitions of
"syllable" are important because script-specific shaping
operations may be applied at the syllable level. For
example, a reordering rule might specify that a vowel
mark be reordered to the beginning of the syllable.
</para>
<para>
Syllables will consist of one or more Unicode code
points. The definition of a syllable for a particular
writing system might correspond to how HarfBuzz
identifies clusters (see above) for the same writing
system. However, it is important for developers using
HarfBuzz to recognize that there is a difference between
syllables and shaping clusters. The two concepts may
overlap frequently, but there is no guarantee that they
will be identical.
</para>
</listitem>
</varlistentry>
</variablelist> </variablelist>
</section> </section>

View File

@ -126,7 +126,7 @@
</para> </para>
<para> <para>
If you need to build HarfBuzz from source, first put the If you need to build HarfBuzz from source, first put the
<program>ragel</program> binary on your <package>ragel</package> binary on your
<literal>PATH</literal>, then follow the appveyor CI cmake <literal>PATH</literal>, then follow the appveyor CI cmake
<ulink <ulink
url="https://github.com/harfbuzz/harfbuzz/blob/master/appveyor.yml">build url="https://github.com/harfbuzz/harfbuzz/blob/master/appveyor.yml">build
@ -229,6 +229,7 @@
</para> </para>
<variablelist> <variablelist>
<?dbfo list-presentation="blocks"?>
<varlistentry> <varlistentry>
<term>--with-libstdc++</term> <term>--with-libstdc++</term>
<listitem> <listitem>

View File

@ -182,22 +182,23 @@
Southeast Asian scripts are also assigned Southeast Asian scripts are also assigned
<emphasis>Unicode Indic Syllabic Category</emphasis> (UISC) and <emphasis>Unicode Indic Syllabic Category</emphasis> (UISC) and
<emphasis>Unicode Indic Positional Category</emphasis> (UIPC) <emphasis>Unicode Indic Positional Category</emphasis> (UIPC)
property that provides more detailed information needed for properties that provide more detailed information needed for
shaping. shaping.
</para> </para>
<para> <para>
The UISC property sub-categorizes Letters and Marks according to The UISC property sub-categorizes Letters and Marks according to
common script-shaping behaviors. For example, UISC distinguishes common script-shaping behaviors. For example, UISC distinguishes
between consonant letters, vowel letters, and vowel marks. The between consonant letters, vowel letters, and vowel marks. The
UIPC property sub-categorizes Mark codepoints by the visual UIPC property sub-categorizes Mark codepoints by the relative visual
position that they occupy (above, below, right, left, or in position that they occupy (above, below, right, left, or in
multiple positions). multiple positions).
</para> </para>
<para> <para>
Some complex scripts require that the text run be split into Some complex scripts require that the text run be split into
syllables, and what constitutes a valid syllable in these syllables. What constitutes a valid syllable in these
scripts is specified in regular expressions of the Letter and scripts is specified in regular expressions, formed from the
Mark codepoints that take the UISC and UIPC properties into account. Letter and Mark codepoints, that take the UISC and UIPC
properties into account.
</para> </para>
</section> </section>