445 lines
17 KiB
XML
445 lines
17 KiB
XML
<chapter id="what-is-harfbuzz">
|
|
<title>What is HarfBuzz?</title>
|
|
<para>
|
|
HarfBuzz is a <emphasis>text-shaping engine</emphasis>. If you
|
|
give HarfBuzz a font and a string containing a sequence of Unicode
|
|
codepoints, HarfBuzz selects and positions the corresponding
|
|
glyphs from the font, applying all of the necessary layout rules
|
|
and font features. HarfBuzz then returns the string to you in the
|
|
form that is correctly arranged for the language and writing
|
|
system.
|
|
</para>
|
|
<para>
|
|
HarfBuzz can properly shape all of the world's major writing
|
|
systems. It runs on all major operating systems and software
|
|
platforms, and it supports all of the modern font formats in use
|
|
today.
|
|
</para>
|
|
<section id="what-is-text-shaping">
|
|
<title>What is text shaping?</title>
|
|
<para>
|
|
Text shaping is the process of translating a string of character
|
|
codes (such as Unicode codepoints) into a properly arranged
|
|
sequence of glyphs that can be rendered onto a screen or into
|
|
final output form for inclusion in a document.
|
|
</para>
|
|
<para>
|
|
The shaping process is dependent on the input string, the active
|
|
font, the script (or writing system) that the string is in, and
|
|
the language that the string is in.
|
|
</para>
|
|
<para>
|
|
Modern software systems generally only deal with strings in the
|
|
Unicode encoding scheme (although legacy systems and documents may
|
|
involve other encodings).
|
|
</para>
|
|
<para>
|
|
There are several font formats that a program might
|
|
encounter, each of which has a set of standard text-shaping
|
|
rules.
|
|
</para>
|
|
<para>The dominant format is <ulink
|
|
url="http://www.microsoft.com/typography/otspec/">OpenType</ulink>. The
|
|
OpenType specification defines a series of <ulink url="https://github.com/n8willis/opentype-shaping-documents">shaping models</ulink> for
|
|
various scripts from around the world. These shaping models depend on
|
|
the font including certain features in its <literal>GSUB</literal>
|
|
and <literal>GPOS</literal> tables.
|
|
</para>
|
|
<para>
|
|
Alternatively, OpenType fonts can include shaping features for
|
|
the <ulink url="https://graphite.sil.org/">Graphite</ulink> shaping model.
|
|
</para>
|
|
<para>
|
|
TrueType fonts can also include OpenType shaping
|
|
features. Alternatively, TrueType fonts can also include <ulink url="https://developer.apple.com/fonts/TrueType-Reference-Manual/RM09/AppendixF.html">Apple
|
|
Advanced Typography</ulink> (AAT) tables to implement shaping
|
|
support. AAT fonts are generally only found on macOS and iOS systems.
|
|
</para>
|
|
<para>
|
|
Text strings will usually be tagged with a script and language
|
|
tag that provide the context needed to perform text shaping
|
|
correctly. The necessary <ulink
|
|
url="https://docs.microsoft.com/en-us/typography/opentype/spec/scripttags">Script</ulink>
|
|
and <ulink
|
|
url="https://docs.microsoft.com/en-us/typography/opentype/spec/languagetags">language</ulink>
|
|
tags are defined by OpenType.
|
|
</para>
|
|
</section>
|
|
|
|
<section id="why-do-i-need-a-shaping-engine">
|
|
<title>Why do I need a shaping engine?</title>
|
|
<para>
|
|
Text shaping is an integral part of preparing text for
|
|
display. Before a Unicode sequence can be rendered, the
|
|
codepoints in the sequence must be mapped to the corresponding
|
|
glyphs provided in the font, and those glyphs must be positioned
|
|
correctly relative to each other. For many of the scripts
|
|
supported in Unicode, these steps involve script-specific layout
|
|
rules, including complex joining, reordering, and positioning
|
|
behavior. Implementing these rules is the job of the shaping engine.
|
|
</para>
|
|
<para>
|
|
Text shaping is a fairly low-level operation. HarfBuzz is
|
|
used directly by graphical rendering libraries like <ulink
|
|
url="https://www.pango.org/">Pango</a>, as well as by the layout
|
|
engines in Firefox, LibreOffice, and Chromium. Unless you are
|
|
<emphasis>writing</emphasis> one of these layout engines
|
|
yourself, you will probably not need to use HarfBuzz: normally,
|
|
lower-level libraries will turn text into glyphs for you.
|
|
</para>
|
|
<para>
|
|
However, if you <emphasis>are</emphasis> writing a layout engine
|
|
or graphics library yourself, then you will need to perform text
|
|
shaping, and this is where HarfBuzz can help you.
|
|
</para>
|
|
<para>
|
|
Here are some specific scenarios where a text-shaping engine
|
|
like HarfBuzz helps you:
|
|
</para>
|
|
<itemizedlist>
|
|
<listitem>
|
|
<para>
|
|
OpenType fonts contain a set of glyphs (that is, shapes
|
|
to represent the letters, numbers, punctuation marks, and
|
|
all other symbols), which are indexed by a <literal>glyph ID</literal>.
|
|
</para>
|
|
<para>
|
|
A particular glyph ID within the font does not necessarily
|
|
correlate to a predictable Unicode codepoint. For instance,
|
|
some fonts have the letter "a" as glyph ID 1, but
|
|
many others do not. In order to retrieve the right glyph
|
|
from the font to display "a", you need to consult
|
|
the table inside the font (the <literal>cmap</literal>
|
|
table) that maps Unicode codepoints to glyph IDs. In other
|
|
words, <emphasis>text shaping turns codepoints into glyph
|
|
IDs</emphasis>.
|
|
</para>
|
|
</listitem>
|
|
<listitem>
|
|
<para>
|
|
Many OpenType fonts contain ligatures: combinations of
|
|
characters that are rendered as a single unit. For instance,
|
|
it is common for the <literal>fi</literal> letter
|
|
combination to appear in print as the single ligature glyph
|
|
"fi".
|
|
</para>
|
|
<para>
|
|
Whether you should render an "f, i" sequence
|
|
as <literal>fi</literal> or as "fi" does not
|
|
depend on the input text. Instead, it depends on the whether
|
|
or not the font includes an "fi" glyph and on the
|
|
level of ligature application you wish to perform. The font
|
|
and the amount of ligature application used are under your
|
|
control. In other words, <emphasis>text shaping involves
|
|
querying the font's ligature tables and determining what
|
|
substitutions should be made</emphasis>.
|
|
</para>
|
|
</listitem>
|
|
<listitem>
|
|
<para>
|
|
While ligatures like "fi" are optional typographic
|
|
refinements, some languages <emphasis>require</emphasis> certain
|
|
substitutions to be made in order to display text correctly.
|
|
</para>
|
|
<para>
|
|
For example, in Tamil, when the letter "TTA" (ட)
|
|
letter is followed by "U" (உ), the pair
|
|
must be replaced by the single glyph "டு". The
|
|
sequence of Unicode characters "டஉ" needs to be
|
|
substituted with a single "டு" glyph from the
|
|
font.
|
|
</para>
|
|
<para>
|
|
But "டு" does not have a Unicode codepoint. To
|
|
find this glyph, you need to consult the table inside
|
|
the font (the <literal>GSUB</literal> table) that contains
|
|
substitution information. In other words, <emphasis>text shaping
|
|
chooses the correct glyph for a sequence of characters
|
|
provided</emphasis>.
|
|
</para>
|
|
</listitem>
|
|
<listitem>
|
|
<para>
|
|
Similarly, each Arabic character has four different variants
|
|
corresponding to the different positions in might appear in
|
|
within a sequence. Inside a font, there will be separate
|
|
glyphs for the initial, medial, final, and isolated forms of
|
|
each letter, each at a different glyph ID.
|
|
</para>
|
|
<para>
|
|
Unicode only assigns one codepoint per character, so a
|
|
Unicode string will not tell you which glyph variant to use
|
|
for each character. To decide, you need to analyze the whole
|
|
string and determine the appropriate glyph for each character
|
|
based on its position. In other words, <emphasis>text
|
|
shaping chooses the correct form of the letter by its
|
|
position and returns the correct glyph from the font</emphasis>.
|
|
</para>
|
|
</listitem>
|
|
<listitem>
|
|
<para>
|
|
Other languages involve marks and accents that need to be
|
|
rendered in specific positions relative a base character. For
|
|
instance, the Moldovan language includes the Cyrillic letter
|
|
"zhe" (ж) with a breve accent, like so: "ӂ".
|
|
</para>
|
|
<para>
|
|
Some fonts will provide this character as a single
|
|
zhe-with-breve glyph, but other fonts will not and, instead,
|
|
will expect the rendering engine to form the character by
|
|
superimposing the separate "ж" and "˘"
|
|
glyphs.
|
|
</para>
|
|
<para>
|
|
But exactly where you should draw the breve depends on the
|
|
height and width of the preceding zhe glyph. To find the
|
|
right position, you need to consult the table inside
|
|
the font (the <literal>GPOS</literal> table) that contains
|
|
positioning information.
|
|
In other words, <emphasis>text shaping tells you whether you
|
|
have a precomposed glyph within your font or if you need to
|
|
compose a glyph yourself out of combining marks—and,
|
|
if so, where to position those marks.</emphasis>
|
|
</para>
|
|
</listitem>
|
|
</itemizedlist>
|
|
<para>
|
|
If tasks like these are something that you need to do, then you
|
|
need a text shaping engine. You could use Uniscribe if you are
|
|
writing Windows software; you could use CoreText on macOS; or
|
|
you could use HarfBuzz.
|
|
</para>
|
|
<note>
|
|
<para>
|
|
In the rest of this manual, the text will assume that the reader
|
|
is that implementor of a text-layout engine.
|
|
</para>
|
|
</note>
|
|
</section>
|
|
|
|
|
|
<section>
|
|
<title>What does HarfBuzz do?</title>
|
|
<para>
|
|
HarfBuzz provides OpenType text shaping through a cross-platform
|
|
C API that accepts sequences of Unicode input text. Currently,
|
|
the following OpenType shaping models are supported:
|
|
</para>
|
|
<itemizedlist>
|
|
<listitem>
|
|
<para>
|
|
Indic (covering Devanagari, Bengali, Gujarati,
|
|
Gurmukhi, Kannada, Malayalam, Oriya, Tamil, Telugu, and
|
|
Sinhala)
|
|
</para>
|
|
</listitem>
|
|
<listitem>
|
|
<para>
|
|
Arabic (covering Arabic, N'Ko, Syriac, and Mongolian)
|
|
</para>
|
|
</listitem>
|
|
<listitem>
|
|
<para>
|
|
Thai and Lao
|
|
</para>
|
|
</listitem>
|
|
<listitem>
|
|
<para>
|
|
Khmer
|
|
</para>
|
|
</listitem>
|
|
<listitem>
|
|
<para>
|
|
Myanmar
|
|
</para>
|
|
</listitem>
|
|
|
|
<listitem>
|
|
<para>
|
|
Tibetan
|
|
</para>
|
|
</listitem>
|
|
|
|
<listitem>
|
|
<para>
|
|
Hangul
|
|
</para>
|
|
</listitem>
|
|
|
|
<listitem>
|
|
<para>
|
|
Hebrew
|
|
</para>
|
|
</listitem>
|
|
<listitem>
|
|
<para>
|
|
The Universal Shaping Engine or <emphasis>USE</emphasis>
|
|
(covering complex scripts not covered by the above shaping
|
|
models)
|
|
</para>
|
|
</listitem>
|
|
<listitem>
|
|
<para>
|
|
A default shaping model for non-complex scripts
|
|
(covering Latin, Cyrillic, Greek, Armenian, Georgian, Tifinagh,
|
|
and many others)
|
|
</para>
|
|
</listitem>
|
|
<listitem>
|
|
<para>
|
|
Emoji (including emoji modifier sequences, flag sequences,
|
|
and ZWJ sequences)
|
|
</para>
|
|
</listitem>
|
|
</itemizedlist>
|
|
|
|
<para>
|
|
In addition to OpenType shaping, HarfBuzz supports the latest
|
|
version of Graphite shaping. HarfBuzz currently supports AAT
|
|
shaping only on macOS and iOS systems, and in a pass-through
|
|
fashion: HarfBuzz hands off AAT support to the system CoreText
|
|
library. However, full, built-in AAT support within HarfBuzz is
|
|
under development.
|
|
</para>
|
|
|
|
<para>
|
|
HarfBuzz can read and understand TrueType fonts (.ttf), TrueType
|
|
collections (.ttc), and OpenType fonts (.otf, including those
|
|
fonts that contain TrueType-style outlines and those that
|
|
contain PostScript CFF or CFF2 outlines).
|
|
</para>
|
|
|
|
<para>
|
|
HarfBuzz can run on top of the FreeType, CoreText, DirectWrite,
|
|
or Uniscribe font renderers.
|
|
</para>
|
|
|
|
<para>
|
|
In addition to its core shaping functionality, HarfBuzz provides
|
|
functions for accessing other font features, including optional
|
|
GSUB and GPOS OpenType features, as well as
|
|
all color-font formats (<literal>CBDT</literal>,
|
|
<literal>sbix</literal>, <literal>COLR/CPAL</literal>, and
|
|
<literal>SVG-OT</literal>) and OpenType variable fonts. HarfBuzz
|
|
also includes a font-subsetting feature.
|
|
</para>
|
|
|
|
<para>
|
|
HarfBuzz can perform some low-level math-shaping operations,
|
|
although it does not currently perform full shaping for
|
|
mathematical typesetting.
|
|
</para>
|
|
|
|
<para>
|
|
A suite of command-line utilities is also provided in the
|
|
source-code tree, designed to help users test and debug
|
|
HarfBuzz's features on real-world fonts and input.
|
|
</para>
|
|
</section>
|
|
|
|
<section id="what-harfbuzz-doesnt-do">
|
|
<title>What HarfBuzz doesn't do</title>
|
|
<para>
|
|
HarfBuzz will take a Unicode string, shape it, and give you the
|
|
information required to lay it out correctly on a single
|
|
horizontal (or vertical) line using the font provided. That is the
|
|
extent of HarfBuzz's responsibility.
|
|
</para>
|
|
<para>
|
|
It is important to note that if you are implementing a complete
|
|
text-layout engine you may have other responsibilities that
|
|
HarfBuzz will <emphasis>not</emphasis> help you with. For example:
|
|
</para>
|
|
<itemizedlist>
|
|
<listitem>
|
|
<para>
|
|
HarfBuzz won't help you with bidirectionality. If you want to
|
|
lay out text that includes a mix of Hebrew and English, you
|
|
will need to ensure that each buffer provided to HarfBuzz has its
|
|
characters in the correct layout order. This will be different
|
|
from the logical order in which the Unicode text is stored. In
|
|
other words, the user will hit the keys in the following
|
|
sequence:
|
|
</para>
|
|
<programlisting>
|
|
A B C [space] ג ב א [space] D E F
|
|
</programlisting>
|
|
<para>
|
|
but will expect to see in the output:
|
|
</para>
|
|
<programlisting>
|
|
ABC אבג DEF
|
|
</programlisting>
|
|
<para>
|
|
This reordering is called <emphasis>bidi processing</emphasis>
|
|
("bidi" is short for bidirectional), and there's an
|
|
algorithm as an annex to the Unicode Standard which tells you how
|
|
to reorder a string from logical order into presentation order.
|
|
Before sending your string to HarfBuzz, you may need to apply the
|
|
bidi algorithm to it. Libraries such as <ulink
|
|
url="http://icu-project.org/">ICU</ulink> and <ulink
|
|
url="http://fribidi.org/">fribidi</a> can do this for you.
|
|
</para>
|
|
</listitem>
|
|
<listitem>
|
|
<para>
|
|
HarfBuzz won't help you with text that contains different font
|
|
properties. For instance, if you have the string "a
|
|
<emphasis>huge</emphasis> breakfast", and you expect
|
|
"huge" to be italic, then you will need to send three
|
|
strings to HarfBuzz: <literal>a</literal>, in your Roman font;
|
|
<literal>huge</literal> using your italic font; and
|
|
<literal>breakfast</literal> using your Roman font again.
|
|
</para>
|
|
<para>
|
|
Similarly, if you change the font, font size, script,
|
|
language, or direction within your string, then you will
|
|
need to shape each run independently and output them
|
|
independently. HarfBuzz expects to shape a run of characters
|
|
that all share the same properties.
|
|
</para>
|
|
</listitem>
|
|
<listitem>
|
|
<para>
|
|
HarfBuzz won't help you with line breaking, hyphenation, or
|
|
justification. As mentioned above, HarfBuzz lays out the string
|
|
along a <emphasis>single line</emphasis> of, notionally,
|
|
infinite length. If you want to find out where the potential
|
|
word, sentence and line break points are in your text, you
|
|
could use the ICU library's break iterator functions.
|
|
</para>
|
|
<para>
|
|
HarfBuzz can tell you how wide a shaped piece of text is, which is
|
|
useful input to a justification algorithm, but it knows nothing
|
|
about paragraphs, lines or line lengths. Nor will it adjust the
|
|
space between words to fit them proportionally into a line. If you
|
|
want to layout text in paragraphs, you will probably want to send
|
|
each word of your text to HarfBuzz to determine its shaped width
|
|
after glyph substitutions, then work out how many words will fit
|
|
on a line, and then finally output each word of the line separated
|
|
by a space of the correct size to fully justify the paragraph.
|
|
</para>
|
|
</listitem>
|
|
</itemizedlist>
|
|
<para>
|
|
As a layout-engine implementor, HarfBuzz will help you with the
|
|
interface between your text and your font, and that's something
|
|
that you'll need—what you then do with the glyphs that your font
|
|
returns is up to you.
|
|
</para>
|
|
</section>
|
|
|
|
<section id="why-is-it-called-harfbuzz">
|
|
<title>Why is it called HarfBuzz?</title>
|
|
<para>
|
|
HarfBuzz began its life as text-shaping code within the FreeType
|
|
project (and you will see references to the FreeType authors
|
|
within the source code copyright declarations), but was then
|
|
extracted out to its own project. This project is maintained by
|
|
Behdad Esfahbod, who named it HarfBuzz. Originally, it was a
|
|
shaping engine for OpenType fonts—"HarfBuzz" is
|
|
the Persian for "open type".
|
|
</para>
|
|
</section>
|
|
</chapter>
|