Usermanual: small updates.
This commit is contained in:
parent
26c5b54fb0
commit
ed13caddf2
|
@ -15,14 +15,15 @@
|
||||||
<section id="creating-and-destroying-buffers">
|
<section id="creating-and-destroying-buffers">
|
||||||
<title>Creating and destroying buffers</title>
|
<title>Creating and destroying buffers</title>
|
||||||
<para>
|
<para>
|
||||||
As we saw in our initial example, a buffer is created and
|
As we saw in our <emphasis>Getting Started</emphasis> example, a
|
||||||
|
buffer is created and
|
||||||
initialized with <literal>hb_buffer_create()</literal>. This
|
initialized with <literal>hb_buffer_create()</literal>. This
|
||||||
produces a new, empty buffer object, instantiated with some
|
produces a new, empty buffer object, instantiated with some
|
||||||
default values and ready to accept your Unicode strings.
|
default values and ready to accept your Unicode strings.
|
||||||
</para>
|
</para>
|
||||||
<para>
|
<para>
|
||||||
HarfBuzz manages the memory of objects that it creates (such as
|
HarfBuzz manages the memory of objects (such as buffers) that it
|
||||||
buffers), so you don't have to. When you have finished working on
|
creates, so you don't have to. When you have finished working on
|
||||||
a buffer, you can call <literal>hb_buffer_destroy()</literal>:
|
a buffer, you can call <literal>hb_buffer_destroy()</literal>:
|
||||||
</para>
|
</para>
|
||||||
<programlisting language="C">
|
<programlisting language="C">
|
||||||
|
|
|
@ -6,25 +6,41 @@
|
||||||
]>
|
]>
|
||||||
<chapter id="clusters">
|
<chapter id="clusters">
|
||||||
<title>Clusters</title>
|
<title>Clusters</title>
|
||||||
<section id="clusters">
|
<section id="clusters-and-shaping">
|
||||||
<title>Clusters</title>
|
<title>Clusters and shaping</title>
|
||||||
<para>
|
<para>
|
||||||
In text shaping, a <emphasis>cluster</emphasis> is a sequence of
|
In text shaping, a <emphasis>cluster</emphasis> is a sequence of
|
||||||
characters that needs to be treated as a single, indivisible
|
characters that needs to be treated as a single, indivisible
|
||||||
unit.
|
unit. A single letter or symbol can be a cluster of its
|
||||||
|
own. Other clusters correspond to longer subsequences of the
|
||||||
|
input code points — such as a ligature or conjunct form
|
||||||
|
— and require the shaper to ensure that the cluster is not
|
||||||
|
broken during the shaping process.
|
||||||
</para>
|
</para>
|
||||||
<para>
|
<para>
|
||||||
A cluster is distinct from a <emphasis>grapheme</emphasis>,
|
A cluster is distinct from a <emphasis>grapheme</emphasis>,
|
||||||
which is the smallest unit of a writing system or script,
|
which is the smallest unit of meaning in a writing system or
|
||||||
because clusters are only relevant for script shaping and the
|
script.
|
||||||
layout of glyphs.
|
|
||||||
</para>
|
</para>
|
||||||
<para>
|
<para>
|
||||||
For example, a grapheme may be a letter, a number, a logogram,
|
The definitions of the two terms are similar. However, clusters
|
||||||
or a symbol. When two letters form a ligature, however, they
|
are only relevant for script shaping and glyph layout. In
|
||||||
combine into a single glyph. They are therefore part of the same
|
contrast, graphemes are a property of the underlying script, and
|
||||||
cluster and are treated as a unit — even though the two
|
are of interest when client programs implement orthographic
|
||||||
original, underlying letters are separate graphemes.
|
or linguistic functionality.
|
||||||
|
</para>
|
||||||
|
<para>
|
||||||
|
For example, two individual letters are often two separate
|
||||||
|
graphemes. When two letters form a ligature, however, they
|
||||||
|
combine into a single glyph. They are then part of the same
|
||||||
|
cluster and are treated as a unit by the shaping engine —
|
||||||
|
even though the two original, underlying letters remain separate
|
||||||
|
graphemes.
|
||||||
|
</para>
|
||||||
|
<para>
|
||||||
|
HarfBuzz is concerned with clusters, <emphasis>not</emphasis>
|
||||||
|
with graphemes — although client programs using HarfBuzz
|
||||||
|
may still care about graphemes for other reasons from time to time.
|
||||||
</para>
|
</para>
|
||||||
<para>
|
<para>
|
||||||
During the shaping process, there are several shaping operations
|
During the shaping process, there are several shaping operations
|
||||||
|
@ -32,14 +48,15 @@
|
||||||
points form a ligature or a conjunct form and are replaced by a
|
points form a ligature or a conjunct form and are replaced by a
|
||||||
single glyph) or split one character into several (for example,
|
single glyph) or split one character into several (for example,
|
||||||
when decomposing a code point through the
|
when decomposing a code point through the
|
||||||
<literal>ccmp</literal> feature).
|
<literal>ccmp</literal> feature). Operations like these alter
|
||||||
|
clusters; HarfBuzz tracks the changes to ensure that no clusters
|
||||||
|
get lost or broken during shaping.
|
||||||
</para>
|
</para>
|
||||||
<para>
|
<para>
|
||||||
HarfBuzz tracks clusters independently from how these
|
HarfBuzz records cluster information independently from how
|
||||||
shaping operations affect the individual glyphs that comprise the
|
shaping operations affect the individual glyphs returned in an
|
||||||
output HarfBuzz returns in a buffer. Consequently,
|
output buffer. Consequently, a client program using HarfBuzz can
|
||||||
a client program using HarfBuzz can utilize the cluster
|
utilize the cluster information to implement features such as:
|
||||||
information to implement features such as:
|
|
||||||
</para>
|
</para>
|
||||||
<itemizedlist>
|
<itemizedlist>
|
||||||
<listitem>
|
<listitem>
|
||||||
|
@ -77,11 +94,14 @@
|
||||||
<para>
|
<para>
|
||||||
Performing line-breaking, justification, and other
|
Performing line-breaking, justification, and other
|
||||||
line-level or paragraph-level operations that must be done
|
line-level or paragraph-level operations that must be done
|
||||||
after shaping is complete, but which require character-level
|
after shaping is complete, but which require examining
|
||||||
properties.
|
character-level properties.
|
||||||
</para>
|
</para>
|
||||||
</listitem>
|
</listitem>
|
||||||
</itemizedlist>
|
</itemizedlist>
|
||||||
|
</section>
|
||||||
|
<section id="working-with-harfbuzz-clusters">
|
||||||
|
<title>Working with HarfBuzz clusters</title>
|
||||||
<para>
|
<para>
|
||||||
When you add text to a HarfBuzz buffer, each code point must be
|
When you add text to a HarfBuzz buffer, each code point must be
|
||||||
assigned a <emphasis>cluster value</emphasis>.
|
assigned a <emphasis>cluster value</emphasis>.
|
||||||
|
@ -94,7 +114,65 @@
|
||||||
value does not matter.
|
value does not matter.
|
||||||
</para>
|
</para>
|
||||||
<para>
|
<para>
|
||||||
Client programs can choose how HarfBuzz handles clusters during
|
Some of the shaping operations performed by HarfBuzz —
|
||||||
|
such as reordering, composition, decomposition, and substitution
|
||||||
|
— may alter the cluster values of some characters. The
|
||||||
|
final cluster values in the buffer at the end of the shaping
|
||||||
|
process will indicate to client programs which subsequences of
|
||||||
|
glyphs represent a cluster and, therefore, must not be
|
||||||
|
separated.
|
||||||
|
</para>
|
||||||
|
<para>
|
||||||
|
In addition, client programs can query the final cluster values
|
||||||
|
to discern other potentially important information about the
|
||||||
|
glyphs in the output buffer (such as whether or not a ligature
|
||||||
|
was formed).
|
||||||
|
</para>
|
||||||
|
<para>
|
||||||
|
For example, if the initial sequence of cluster values was:
|
||||||
|
</para>
|
||||||
|
<programlisting>
|
||||||
|
0,1,2,3,4
|
||||||
|
</programlisting>
|
||||||
|
<para>
|
||||||
|
and the final sequence of cluster values is:
|
||||||
|
</para>
|
||||||
|
<programlisting>
|
||||||
|
0,0,3,3
|
||||||
|
</programlisting>
|
||||||
|
<para>
|
||||||
|
then there are two clusters in the output buffer: the first
|
||||||
|
cluster includes the first two glyphs, and the second cluster
|
||||||
|
includes the third and fourth glyphs. It is also evident that a
|
||||||
|
ligature or conjunct has been formed, because there are fewer
|
||||||
|
glyphs in the output buffer (four) than there were code points
|
||||||
|
in the input buffer (five).
|
||||||
|
</para>
|
||||||
|
<para>
|
||||||
|
Although client programs using HarfBuzz are free to assign
|
||||||
|
initial cluster values in any manner they choose to, HarfBuzz
|
||||||
|
does offer some useful guarantees if the cluster values are
|
||||||
|
assigned in a monotonic (either non-decreasing or non-increasing)
|
||||||
|
order.
|
||||||
|
</para>
|
||||||
|
<para>
|
||||||
|
For left-to-right scripts (LTR) and top-to-bottom scripts (TTB),
|
||||||
|
HarfBuzz will preserve the monotonic property: client programs
|
||||||
|
are guaranteed that monotonically increasing initial clulster
|
||||||
|
values will be returned as monotonically increasing final
|
||||||
|
cluster values.
|
||||||
|
</para>
|
||||||
|
<para>
|
||||||
|
For right-to-left scripts (RTL) and bottom-to-top scripts (BTT),
|
||||||
|
the directionality of the buffer itself is reversed for final
|
||||||
|
output as a matter of design. Therefore, HarfBuzz inverts the
|
||||||
|
monotonic property: client programs are guaranteed that
|
||||||
|
monotonically increasing initial clulster values will be
|
||||||
|
returned as monotonically <emphasis>decreasing</emphasis> final
|
||||||
|
cluster values.
|
||||||
|
</para>
|
||||||
|
<para>
|
||||||
|
Client programs can adjust how HarfBuzz handles clusters during
|
||||||
shaping by setting the
|
shaping by setting the
|
||||||
<literal>cluster_level</literal> of the
|
<literal>cluster_level</literal> of the
|
||||||
buffer. HarfBuzz offers three <emphasis>levels</emphasis> of
|
buffer. HarfBuzz offers three <emphasis>levels</emphasis> of
|
||||||
|
@ -179,7 +257,7 @@
|
||||||
assign initial cluster values in a buffer by reusing the indices
|
assign initial cluster values in a buffer by reusing the indices
|
||||||
of the code points in the input text. This gives a sequence of
|
of the code points in the input text. This gives a sequence of
|
||||||
cluster values that is monotonically increasing (for example,
|
cluster values that is monotonically increasing (for example,
|
||||||
0,1,2,3,4,5).
|
0,1,2,3,4).
|
||||||
</para>
|
</para>
|
||||||
<para>
|
<para>
|
||||||
It is not <emphasis>required</emphasis> that the cluster values
|
It is not <emphasis>required</emphasis> that the cluster values
|
||||||
|
@ -233,16 +311,44 @@
|
||||||
</para>
|
</para>
|
||||||
</listitem>
|
</listitem>
|
||||||
</itemizedlist>
|
</itemizedlist>
|
||||||
|
|
||||||
</section>
|
</section>
|
||||||
|
|
||||||
<section id="a-clustering-example-for-levels-0-and-1">
|
<section id="a-clustering-example-for-levels-0-and-1">
|
||||||
<title>A clustering example for levels 0 and 1</title>
|
<title>A clustering example for levels 0 and 1</title>
|
||||||
<para>
|
<para>
|
||||||
The guarantees and benefits of level 0 and level 1 can be seen
|
The basic shaping operations affect clusters in a predictable
|
||||||
with some examples. First, let us examine what happens with cluster
|
manner when using level 0 or level 1:
|
||||||
values when shaping involves cluster merging with ligatures and
|
|
||||||
decomposition.
|
|
||||||
</para>
|
</para>
|
||||||
|
<itemizedlist>
|
||||||
|
<listitem>
|
||||||
|
<para>
|
||||||
|
When two or more clusters <emphasis>merge</emphasis>, the
|
||||||
|
resulting merged cluster takes as its cluster value the
|
||||||
|
<emphasis>minimum</emphasis> of the incoming cluster values.
|
||||||
|
</para>
|
||||||
|
</listitem>
|
||||||
|
<listitem>
|
||||||
|
<para>
|
||||||
|
When a cluster <emphasis>decomposes</emphasis>, all of the
|
||||||
|
resulting child clusters inherit as their cluster value the
|
||||||
|
cluster value of the parent cluster.
|
||||||
|
</para>
|
||||||
|
</listitem>
|
||||||
|
<listitem>
|
||||||
|
<para>
|
||||||
|
When a character is <emphasis>reordered</emphasis>, the
|
||||||
|
reordered character and all clusters that the character
|
||||||
|
moves past as part of the reordering are merged into one cluster.
|
||||||
|
</para>
|
||||||
|
</listitem>
|
||||||
|
</itemizedlist>
|
||||||
|
<para>
|
||||||
|
The functionality, guarantees, and benefits of level 0 and level
|
||||||
|
1 behavior can be seen with some examples. First, let us examine
|
||||||
|
what happens with cluster values when shaping involves cluster
|
||||||
|
merging with ligatures and decomposition.
|
||||||
|
</para>
|
||||||
|
|
||||||
<para>
|
<para>
|
||||||
Let's say we start with the following character sequence (top row) and
|
Let's say we start with the following character sequence (top row) and
|
||||||
initial cluster values (bottom row):
|
initial cluster values (bottom row):
|
||||||
|
@ -279,8 +385,8 @@
|
||||||
<para>
|
<para>
|
||||||
Next, let us say that the <literal>BC</literal> ligature glyph
|
Next, let us say that the <literal>BC</literal> ligature glyph
|
||||||
decomposes into three components, and <literal>D</literal> also
|
decomposes into three components, and <literal>D</literal> also
|
||||||
decomposes into two components. These components each inherit the
|
decomposes into two components. Whenever a cluster decomposes,
|
||||||
cluster value of their parent:
|
its components each inherit the cluster value of their parent:
|
||||||
</para>
|
</para>
|
||||||
<programlisting>
|
<programlisting>
|
||||||
A,BC0,BC1,BC2,D0,D1,E
|
A,BC0,BC1,BC2,D0,D1,E
|
||||||
|
@ -295,6 +401,12 @@
|
||||||
A,BC0,BC1,BC2D0,D1,E
|
A,BC0,BC1,BC2D0,D1,E
|
||||||
0,1 ,1 ,1 ,1 ,4
|
0,1 ,1 ,1 ,1 ,4
|
||||||
</programlisting>
|
</programlisting>
|
||||||
|
<para>
|
||||||
|
Note that the entirety of cluster 3 merges into cluster 1, not
|
||||||
|
just the <literal>D0</literal> glyph. This reflects the fact
|
||||||
|
that the cluster <emphasis>must</emphasis> be treated as an
|
||||||
|
indivisible unit.
|
||||||
|
</para>
|
||||||
<para>
|
<para>
|
||||||
At this point, cluster 1 means: the character sequence
|
At this point, cluster 1 means: the character sequence
|
||||||
<literal>BCD</literal> is represented by glyphs
|
<literal>BCD</literal> is represented by glyphs
|
||||||
|
@ -319,18 +431,24 @@
|
||||||
0,1,2,3,4
|
0,1,2,3,4
|
||||||
</programlisting>
|
</programlisting>
|
||||||
<para>
|
<para>
|
||||||
If <literal>D</literal> is reordered to before <literal>B</literal>,
|
If <literal>D</literal> is reordered to the position immediately
|
||||||
then HarfBuzz merges the <literal>B</literal>,
|
before <literal>B</literal>, then HarfBuzz merges the
|
||||||
<literal>C</literal>, and <literal>D</literal> clusters, and we
|
<literal>B</literal>, <literal>C</literal>, and
|
||||||
get:
|
<literal>D</literal> clusters — all the clusters between
|
||||||
|
the final position of the reordered glyph and its original
|
||||||
|
position. This means that we get:
|
||||||
</para>
|
</para>
|
||||||
<programlisting>
|
<programlisting>
|
||||||
A,D,B,C,E
|
A,D,B,C,E
|
||||||
0,1,1,1,4
|
0,1,1,1,4
|
||||||
</programlisting>
|
</programlisting>
|
||||||
<para>
|
<para>
|
||||||
This is clearly not ideal, but it is the only sensible way to
|
as the final cluster sequence.
|
||||||
maintain a monotonic sequence of cluster values and retain the
|
</para>
|
||||||
|
<para>
|
||||||
|
Merging this many clusters is not ideal, but it is the only
|
||||||
|
sensible way for HarfBuzz to maintain the guarantee that the
|
||||||
|
sequence of cluster values remains monotonic and to retain the
|
||||||
true relationship between glyphs and characters.
|
true relationship between glyphs and characters.
|
||||||
</para>
|
</para>
|
||||||
</section>
|
</section>
|
||||||
|
@ -340,8 +458,9 @@
|
||||||
The preceding examples demonstrate the main effects of using
|
The preceding examples demonstrate the main effects of using
|
||||||
cluster levels 0 and 1. The only difference between the two
|
cluster levels 0 and 1. The only difference between the two
|
||||||
levels is this: in level 0, at the very beginning of the shaping
|
levels is this: in level 0, at the very beginning of the shaping
|
||||||
process, HarfBuzz also merges clusters between any base character
|
process, HarfBuzz merges the cluster of each base character
|
||||||
and all Unicode marks (combining or not) that follow it.
|
with the clusters of all Unicode marks (combining or not) and
|
||||||
|
modifiers that follow it.
|
||||||
</para>
|
</para>
|
||||||
<para>
|
<para>
|
||||||
For example, let us start with the following character sequence
|
For example, let us start with the following character sequence
|
||||||
|
@ -361,6 +480,10 @@
|
||||||
A,acute,B
|
A,acute,B
|
||||||
0,0 ,2
|
0,0 ,2
|
||||||
</programlisting>
|
</programlisting>
|
||||||
|
<para>
|
||||||
|
This merger is performed before any other script-shaping
|
||||||
|
steps.
|
||||||
|
</para>
|
||||||
<para>
|
<para>
|
||||||
This initial cluster merging is the default behavior of the
|
This initial cluster merging is the default behavior of the
|
||||||
Windows shaping engine, and the old HarfBuzz codebase copied
|
Windows shaping engine, and the old HarfBuzz codebase copied
|
||||||
|
@ -368,9 +491,10 @@
|
||||||
remained the default behavior in the new HarfBuzz codebase.
|
remained the default behavior in the new HarfBuzz codebase.
|
||||||
</para>
|
</para>
|
||||||
<para>
|
<para>
|
||||||
But this initial cluster-merging behavior makes it impossible to
|
But this initial cluster-merging behavior makes it impossible
|
||||||
|
client programs to implement some features (such as to
|
||||||
color diacritic marks differently from their base
|
color diacritic marks differently from their base
|
||||||
characters. That is why, in level 1, HarfBuzz does not perform
|
characters). That is why, in level 1, HarfBuzz does not perform
|
||||||
the initial merging step.
|
the initial merging step.
|
||||||
</para>
|
</para>
|
||||||
<para>
|
<para>
|
||||||
|
@ -378,29 +502,34 @@
|
||||||
perform cursor positioning, level 0 is more convenient. But
|
perform cursor positioning, level 0 is more convenient. But
|
||||||
relying on cluster boundaries for cursor positioning is wrong: cursor
|
relying on cluster boundaries for cursor positioning is wrong: cursor
|
||||||
positions should be determined based on Unicode grapheme
|
positions should be determined based on Unicode grapheme
|
||||||
boundaries, not on shaping-cluster boundaries. As such, level 1
|
boundaries, not on shaping-cluster boundaries. As such, using
|
||||||
clusters are preferred.
|
level 1 clustering behavior is recommended.
|
||||||
</para>
|
</para>
|
||||||
<para>
|
<para>
|
||||||
One last note about levels 0 and 1. HarfBuzz currently does not allow a
|
One final facet of levels 0 and 1 is worth noting. HarfBuzz
|
||||||
<literal>MultipleSubst</literal> lookup to replace a glyph with zero
|
currently does not allow any
|
||||||
glyphs (in other words, to delete a glyph). But, in some other situations,
|
<emphasis>multiple-substitution</emphasis> GSUB lookups to
|
||||||
glyphs can be deleted. In those cases, if the glyph being deleted is
|
replace a glyph with zero glyphs (in other words, to delete a
|
||||||
the last glyph of its cluster, HarfBuzz makes sure to merge the cluster
|
glyph).
|
||||||
with a neighboring cluster.
|
</para>
|
||||||
|
<para>
|
||||||
|
But, in some other situations, glyphs can be deleted. In
|
||||||
|
those cases, if the glyph being deleted is the last glyph of its
|
||||||
|
cluster, HarfBuzz makes sure to merge the deleted glyph's
|
||||||
|
cluster with a neighboring cluster.
|
||||||
</para>
|
</para>
|
||||||
<para>
|
<para>
|
||||||
This is done primarily to make sure that the starting cluster of the
|
This is done primarily to make sure that the starting cluster of the
|
||||||
text always has the cluster index pointing to the start of the text
|
text always has the cluster index pointing to the start of the text
|
||||||
for the run; more than one client currently relies on this
|
for the run; more than one client program currently relies on this
|
||||||
guarantee.
|
guarantee.
|
||||||
</para>
|
</para>
|
||||||
<para>
|
<para>
|
||||||
Incidentally, Apple's CoreText does something else to maintain the
|
Incidentally, Apple's CoreText does something different to
|
||||||
same promise: it inserts a glyph with id 65535 at the beginning of
|
maintain the same promise: it inserts a glyph with id 65535 at
|
||||||
the glyph string if the glyph corresponding to the first character
|
the beginning of the glyph string if the glyph corresponding to
|
||||||
in the run was deleted. HarfBuzz might do something similar in the
|
the first character in the run was deleted. HarfBuzz might do
|
||||||
future.
|
something similar in the future.
|
||||||
</para>
|
</para>
|
||||||
</section>
|
</section>
|
||||||
<section id="level-2">
|
<section id="level-2">
|
||||||
|
@ -415,16 +544,39 @@
|
||||||
performs no merging of clusters whatsoever.
|
performs no merging of clusters whatsoever.
|
||||||
</para>
|
</para>
|
||||||
<para>
|
<para>
|
||||||
When glyphs form a ligature (or when some other feature
|
This means that there is no initial base-and-mark merging step
|
||||||
substitutes multiple glyphs with one glyph), the cluster value
|
(as is done in level 0), and it means that reordering moves and
|
||||||
of the first glyph is retained as the cluster value for the
|
ligature substitutions do not trigger a cluster merge.
|
||||||
ligature. However, no subsequent clusters — including
|
|
||||||
marks and modifiers — are affected.
|
|
||||||
</para>
|
</para>
|
||||||
<para>
|
<para>
|
||||||
Level 2 cluster behavior is less complex than level 0 or level
|
Only one shaping operation directly affects clusters when using
|
||||||
1, but there are a few cases in which processing cluster values
|
level 2:
|
||||||
produced at level 2 may be tricky.
|
</para>
|
||||||
|
<itemizedlist>
|
||||||
|
<listitem>
|
||||||
|
<para>
|
||||||
|
When a cluster <emphasis>decomposes</emphasis>, all of the
|
||||||
|
resulting child clusters inherit as their cluster value the
|
||||||
|
cluster value of the parent cluster.
|
||||||
|
</para>
|
||||||
|
</listitem>
|
||||||
|
</itemizedlist>
|
||||||
|
<para>
|
||||||
|
When glyphs do form a ligature (or when some other feature
|
||||||
|
substitutes multiple glyphs with one glyph) the cluster value
|
||||||
|
of the first glyph is retained as the cluster value for the
|
||||||
|
resulting ligature.
|
||||||
|
</para>
|
||||||
|
<para>
|
||||||
|
This occurrence sounds similar to a cluster merge, but it is
|
||||||
|
different. In particular, no subsequent characters —
|
||||||
|
including marks and modifiers — are affected. They retain
|
||||||
|
their previous cluster values.
|
||||||
|
</para>
|
||||||
|
<para>
|
||||||
|
Level 2 cluster behavior is ultimately less complex than level 0
|
||||||
|
or level 1, but there are several cases for which processing
|
||||||
|
cluster values produced at level 2 may be tricky.
|
||||||
</para>
|
</para>
|
||||||
<section id="ligatures-with-combining-marks-in-level-2">
|
<section id="ligatures-with-combining-marks-in-level-2">
|
||||||
<title>Ligatures with combining marks in level 2</title>
|
<title>Ligatures with combining marks in level 2</title>
|
||||||
|
@ -532,10 +684,11 @@
|
||||||
<para>
|
<para>
|
||||||
There may be other problems encountered with ligatures under
|
There may be other problems encountered with ligatures under
|
||||||
level 2, such as if the direction of the text is forced to
|
level 2, such as if the direction of the text is forced to
|
||||||
opposite of its natural direction (for example, left-to-right
|
opposite of its natural direction (for example, Arabic text
|
||||||
Arabic). But, generally speaking, these other scenarios are
|
that is forced into left-to-right directionality). But,
|
||||||
minor corner cases that are too obscure for most client
|
generally speaking, these other scenarios are minor corner
|
||||||
programs to need to worry about.
|
cases that are too obscure for most client programs to need to
|
||||||
|
worry about.
|
||||||
</para>
|
</para>
|
||||||
</section>
|
</section>
|
||||||
</section>
|
</section>
|
||||||
|
|
|
@ -76,12 +76,41 @@
|
||||||
<section>
|
<section>
|
||||||
<title>Terminology</title>
|
<title>Terminology</title>
|
||||||
<variablelist>
|
<variablelist>
|
||||||
|
<?dbfo list-presentation="blocks"?>
|
||||||
|
<varlistentry>
|
||||||
|
<term>script</term>
|
||||||
|
<listitem>
|
||||||
|
<para>
|
||||||
|
In text shaping, a <emphasis>script</emphasis> is a
|
||||||
|
writing system: a set of symbols, rules, and conventions
|
||||||
|
that is used to represent a language or multiple
|
||||||
|
languages.
|
||||||
|
</para>
|
||||||
|
<para>
|
||||||
|
In general computing lingo, the word "script" can also
|
||||||
|
be used to mean an executable program (usually one
|
||||||
|
written in a human-readable programming language). For
|
||||||
|
the sake of clarity, HarfBuzz documents will always use
|
||||||
|
more specific terminology when referring to this
|
||||||
|
meaning, such as "Python script" or "shell script." In
|
||||||
|
all other instances, "script" refers to a writing system.
|
||||||
|
</para>
|
||||||
|
<para>
|
||||||
|
For developers using HarfBuzz, it is important to note
|
||||||
|
the distinction between a script and a language. Most
|
||||||
|
scripts are used to write a variety of different
|
||||||
|
languages, and many languages may be written in more
|
||||||
|
than one script.
|
||||||
|
</para>
|
||||||
|
</listitem>
|
||||||
|
</varlistentry>
|
||||||
|
|
||||||
<varlistentry>
|
<varlistentry>
|
||||||
<term>shaper</term>
|
<term>shaper</term>
|
||||||
<listitem>
|
<listitem>
|
||||||
<para>
|
<para>
|
||||||
In HarfBuzz, a <emphasis>shaper</emphasis> is a
|
In HarfBuzz, a <emphasis>shaper</emphasis> is a
|
||||||
handler for a specific script shaping model. HarfBuzz
|
handler for a specific script-shaping model. HarfBuzz
|
||||||
implements separate shapers for Indic, Arabic, Thai and
|
implements separate shapers for Indic, Arabic, Thai and
|
||||||
Lao, Khmer, Myanmar, Tibetan, Hangul, Hebrew, the
|
Lao, Khmer, Myanmar, Tibetan, Hangul, Hebrew, the
|
||||||
Universal Shaping Engine (USE), and a default shaper for
|
Universal Shaping Engine (USE), and a default shaper for
|
||||||
|
@ -95,12 +124,12 @@
|
||||||
<listitem>
|
<listitem>
|
||||||
<para>
|
<para>
|
||||||
In text shaping, a <emphasis>cluster</emphasis> is a
|
In text shaping, a <emphasis>cluster</emphasis> is a
|
||||||
sequence of codepoints that must be handled as an
|
sequence of codepoints that must be treated as an
|
||||||
indivisible unit. Clusters can include codepoint
|
indivisible unit. Clusters can include code-point
|
||||||
sequences that form a ligature or base-and-mark
|
sequences that form a ligature or base-and-mark
|
||||||
sequences. Tracking and preserving clusters is important
|
sequences. Tracking and preserving clusters is important
|
||||||
when shaping operations might separate or reorder
|
when shaping operations might separate or reorder
|
||||||
codepoints.
|
code points.
|
||||||
</para>
|
</para>
|
||||||
<para>
|
<para>
|
||||||
HarfBuzz provides three cluster
|
HarfBuzz provides three cluster
|
||||||
|
@ -111,7 +140,59 @@
|
||||||
</listitem>
|
</listitem>
|
||||||
</varlistentry>
|
</varlistentry>
|
||||||
|
|
||||||
|
<varlistentry>
|
||||||
|
<term>grapheme</term>
|
||||||
|
<listitem>
|
||||||
|
<para>
|
||||||
|
In linguistics, a <emphasis>grapheme</emphasis> is one
|
||||||
|
of the indivisible units that make up a writing system or
|
||||||
|
script. Often, graphemes are individual symbols (letters,
|
||||||
|
numbers, punctuation marks, logograms, etc.) but,
|
||||||
|
depending on the writing system, a particular grapheme
|
||||||
|
might correspond to a sequence of several Unicode code
|
||||||
|
points.
|
||||||
|
</para>
|
||||||
|
<para>
|
||||||
|
In practice, HarfBuzz and other text-shaping engines
|
||||||
|
are not generally concerned with graphemes. However, it
|
||||||
|
is important for developers using HarfBuzz to recognize
|
||||||
|
that there is a difference between graphemes and shaping
|
||||||
|
clusters (see above). The two concepts may overlap
|
||||||
|
frequently, but there is no guarantee that they will be
|
||||||
|
identical.
|
||||||
|
</para>
|
||||||
|
</listitem>
|
||||||
|
</varlistentry>
|
||||||
|
|
||||||
|
<varlistentry>
|
||||||
|
<term>syllable</term>
|
||||||
|
<listitem>
|
||||||
|
<para>
|
||||||
|
In linguistics, a <emphasis>syllable</emphasis> is an
|
||||||
|
a sequence of sounds that makes up a building block of a
|
||||||
|
particular language. Every language has its own set of
|
||||||
|
rules describing what constitutes a valid syllable.
|
||||||
|
</para>
|
||||||
|
<para>
|
||||||
|
For text-shaping purposes, the various definitions of
|
||||||
|
"syllable" are important because script-specific shaping
|
||||||
|
operations may be applied at the syllable level. For
|
||||||
|
example, a reordering rule might specify that a vowel
|
||||||
|
mark be reordered to the beginning of the syllable.
|
||||||
|
</para>
|
||||||
|
<para>
|
||||||
|
Syllables will consist of one or more Unicode code
|
||||||
|
points. The definition of a syllable for a particular
|
||||||
|
writing system might correspond to how HarfBuzz
|
||||||
|
identifies clusters (see above) for the same writing
|
||||||
|
system. However, it is important for developers using
|
||||||
|
HarfBuzz to recognize that there is a difference between
|
||||||
|
syllables and shaping clusters. The two concepts may
|
||||||
|
overlap frequently, but there is no guarantee that they
|
||||||
|
will be identical.
|
||||||
|
</para>
|
||||||
|
</listitem>
|
||||||
|
</varlistentry>
|
||||||
</variablelist>
|
</variablelist>
|
||||||
|
|
||||||
</section>
|
</section>
|
||||||
|
|
|
@ -126,7 +126,7 @@
|
||||||
</para>
|
</para>
|
||||||
<para>
|
<para>
|
||||||
If you need to build HarfBuzz from source, first put the
|
If you need to build HarfBuzz from source, first put the
|
||||||
<program>ragel</program> binary on your
|
<package>ragel</package> binary on your
|
||||||
<literal>PATH</literal>, then follow the appveyor CI cmake
|
<literal>PATH</literal>, then follow the appveyor CI cmake
|
||||||
<ulink
|
<ulink
|
||||||
url="https://github.com/harfbuzz/harfbuzz/blob/master/appveyor.yml">build
|
url="https://github.com/harfbuzz/harfbuzz/blob/master/appveyor.yml">build
|
||||||
|
@ -229,6 +229,7 @@
|
||||||
</para>
|
</para>
|
||||||
|
|
||||||
<variablelist>
|
<variablelist>
|
||||||
|
<?dbfo list-presentation="blocks"?>
|
||||||
<varlistentry>
|
<varlistentry>
|
||||||
<term>--with-libstdc++</term>
|
<term>--with-libstdc++</term>
|
||||||
<listitem>
|
<listitem>
|
||||||
|
|
|
@ -182,22 +182,23 @@
|
||||||
Southeast Asian scripts are also assigned
|
Southeast Asian scripts are also assigned
|
||||||
<emphasis>Unicode Indic Syllabic Category</emphasis> (UISC) and
|
<emphasis>Unicode Indic Syllabic Category</emphasis> (UISC) and
|
||||||
<emphasis>Unicode Indic Positional Category</emphasis> (UIPC)
|
<emphasis>Unicode Indic Positional Category</emphasis> (UIPC)
|
||||||
property that provides more detailed information needed for
|
properties that provide more detailed information needed for
|
||||||
shaping.
|
shaping.
|
||||||
</para>
|
</para>
|
||||||
<para>
|
<para>
|
||||||
The UISC property sub-categorizes Letters and Marks according to
|
The UISC property sub-categorizes Letters and Marks according to
|
||||||
common script-shaping behaviors. For example, UISC distinguishes
|
common script-shaping behaviors. For example, UISC distinguishes
|
||||||
between consonant letters, vowel letters, and vowel marks. The
|
between consonant letters, vowel letters, and vowel marks. The
|
||||||
UIPC property sub-categorizes Mark codepoints by the visual
|
UIPC property sub-categorizes Mark codepoints by the relative visual
|
||||||
position that they occupy (above, below, right, left, or in
|
position that they occupy (above, below, right, left, or in
|
||||||
multiple positions).
|
multiple positions).
|
||||||
</para>
|
</para>
|
||||||
<para>
|
<para>
|
||||||
Some complex scripts require that the text run be split into
|
Some complex scripts require that the text run be split into
|
||||||
syllables, and what constitutes a valid syllable in these
|
syllables. What constitutes a valid syllable in these
|
||||||
scripts is specified in regular expressions of the Letter and
|
scripts is specified in regular expressions, formed from the
|
||||||
Mark codepoints that take the UISC and UIPC properties into account.
|
Letter and Mark codepoints, that take the UISC and UIPC
|
||||||
|
properties into account.
|
||||||
</para>
|
</para>
|
||||||
|
|
||||||
</section>
|
</section>
|
||||||
|
|
Loading…
Reference in New Issue