[Docs] Usermanual; fill out Buffers chapter.
This commit is contained in:
parent
6d9a86ae75
commit
3b301c5ac6
|
@ -7,30 +7,38 @@
|
|||
<chapter id="buffers-language-script-and-direction">
|
||||
<title>Buffers, language, script and direction</title>
|
||||
<para>
|
||||
The input to HarfBuzz is a series of Unicode characters, stored in a
|
||||
The input to the HarfBuzz shaper is a series of Unicode characters, stored in a
|
||||
buffer. In this chapter, we'll look at how to set up a buffer with
|
||||
the text that we want and then customize the properties of the
|
||||
buffer.
|
||||
the text that we want and how to customize the properties of the
|
||||
buffer. We'll also look at a piece of lower-level machinery that
|
||||
you will need to understand before proceeding: the functions that
|
||||
HarfBuzz uses to retrieve Unicode information.
|
||||
</para>
|
||||
<para>
|
||||
After shaping is complete, HarfBuzz puts its output back
|
||||
into the buffer. But getting that output requires setting up a
|
||||
face and a font first, so we will look at that in the next chapter
|
||||
instead of here.
|
||||
</para>
|
||||
<section id="creating-and-destroying-buffers">
|
||||
<title>Creating and destroying buffers</title>
|
||||
<para>
|
||||
As we saw in our <emphasis>Getting Started</emphasis> example, a
|
||||
buffer is created and
|
||||
initialized with <literal>hb_buffer_create()</literal>. This
|
||||
initialized with <function>hb_buffer_create()</function>. This
|
||||
produces a new, empty buffer object, instantiated with some
|
||||
default values and ready to accept your Unicode strings.
|
||||
</para>
|
||||
<para>
|
||||
HarfBuzz manages the memory of objects (such as buffers) that it
|
||||
creates, so you don't have to. When you have finished working on
|
||||
a buffer, you can call <literal>hb_buffer_destroy()</literal>:
|
||||
a buffer, you can call <function>hb_buffer_destroy()</function>:
|
||||
</para>
|
||||
<programlisting language="C">
|
||||
hb_buffer_t *buffer = hb_buffer_create();
|
||||
...
|
||||
hb_buffer_destroy(buffer);
|
||||
</programlisting>
|
||||
hb_buffer_t *buf = hb_buffer_create();
|
||||
...
|
||||
hb_buffer_destroy(buf);
|
||||
</programlisting>
|
||||
<para>
|
||||
This will destroy the object and free its associated memory -
|
||||
unless some other part of the program holds a reference to this
|
||||
|
@ -39,46 +47,350 @@
|
|||
else destroying it, you should increase its reference count:
|
||||
</para>
|
||||
<programlisting language="C">
|
||||
void somefunc(hb_buffer_t *buffer) {
|
||||
buffer = hb_buffer_reference(buffer);
|
||||
...
|
||||
</programlisting>
|
||||
void somefunc(hb_buffer_t *buf) {
|
||||
buf = hb_buffer_reference(buf);
|
||||
...
|
||||
</programlisting>
|
||||
<para>
|
||||
And then decrease it once you're done with it:
|
||||
</para>
|
||||
<programlisting language="C">
|
||||
hb_buffer_destroy(buffer);
|
||||
}
|
||||
</programlisting>
|
||||
hb_buffer_destroy(buf);
|
||||
}
|
||||
</programlisting>
|
||||
<para>
|
||||
While we are on the subject of reference-counting buffers, it is
|
||||
worth noting that an individual buffer can only meaningfully be
|
||||
used by one thread at a time.
|
||||
</para>
|
||||
<para>
|
||||
To throw away all the data in your buffer and start from scratch,
|
||||
call <literal>hb_buffer_reset(buffer)</literal>. If you want to
|
||||
call <function>hb_buffer_reset(buf)</function>. If you want to
|
||||
throw away the string in the buffer but keep the options, you can
|
||||
instead call <literal>hb_buffer_clear_contents(buffer)</literal>.
|
||||
instead call <function>hb_buffer_clear_contents(buf)</function>.
|
||||
</para>
|
||||
</section>
|
||||
|
||||
<section id="adding-text-to-the-buffer">
|
||||
<title>Adding text to the buffer</title>
|
||||
<para>
|
||||
Now we have a brand new HarfBuzz buffer. Let's start filling it
|
||||
with text! From HarfBuzz's perspective, a buffer is just a stream
|
||||
of Unicode codepoints, but your input string is probably in one of
|
||||
the standard Unicode character encodings (UTF-8, UTF-16, UTF-32)
|
||||
of Unicode code points, but your input string is probably in one of
|
||||
the standard Unicode character encodings (UTF-8, UTF-16, or
|
||||
UTF-32). HarfBuzz provides convenience functions that accept
|
||||
each of these encodings:
|
||||
<function>hb_buffer_add_utf8()</function>,
|
||||
<function>hb_buffer_add_utf16()</function>, and
|
||||
<function>hb_buffer_add_utf32()</function>. Other than the
|
||||
character encoding they accept, they function identically.
|
||||
</para>
|
||||
<para>
|
||||
You can add UTF-8 text to a buffer by passing in the text array,
|
||||
the array's length, an offset into the array for the first
|
||||
character to add, and the length of the segment to add:
|
||||
</para>
|
||||
<programlisting language="C">
|
||||
hb_buffer_add_utf8 (hb_buffer_t *buf,
|
||||
const char *text,
|
||||
int text_length,
|
||||
unsigned int item_offset,
|
||||
int item_length)
|
||||
</programlisting>
|
||||
<para>
|
||||
So, in practice, you can say:
|
||||
</para>
|
||||
<programlisting language="C">
|
||||
hb_buffer_add_utf8(buf, text, strlen(text), 0, strlen(text));
|
||||
</programlisting>
|
||||
<para>
|
||||
This will append your new characters to
|
||||
<parameter>buf</parameter>, not replace its existing
|
||||
contents. Also, note that you can use <literal>-1</literal> in
|
||||
place of the first instance of <function>strlen(text)</function>
|
||||
if your text array is NULL-terminated. Similarly, you can also use
|
||||
<literal>-1</literal> as the final argument want to add its full
|
||||
contents.
|
||||
</para>
|
||||
<para>
|
||||
Whatever start <parameter>item_offset</parameter> and
|
||||
<parameter>item_length</parameter> you provide, HarfBuzz will also
|
||||
attempt to grab the five characters <emphasis>before</emphasis>
|
||||
the offset point and the five characters
|
||||
<emphasis>after</emphasis> the designated end. These are the
|
||||
before and after "context" segments, which are used internally
|
||||
for HarfBuzz to make shaping decisions. They will not be part of
|
||||
the final output, but they ensure that HarfBuzz's
|
||||
script-specific shaping operations are correct. If there are
|
||||
fewer than five characters available for the before or after
|
||||
contexts, HarfBuzz will just grab what is there.
|
||||
</para>
|
||||
<para>
|
||||
For longer text runs, such as full paragraphs, it might be
|
||||
tempting to only add smaller sub-segments to a buffer and
|
||||
shape them in piecemeal fashion. Generally, this is not a good
|
||||
idea, however, because a lot of shaping decisions are
|
||||
dependent on this context information. For example, in Arabic
|
||||
and other connected scripts, HarfBuzz needs to know the code
|
||||
points before and after each character in order to correctly
|
||||
determine which glyph to return.
|
||||
</para>
|
||||
<para>
|
||||
The safest approach is to add all of the text available, then
|
||||
use <parameter>item_offset</parameter> and
|
||||
<parameter>item_length</parameter> to indicate which characters you
|
||||
want shaped, so that HarfBuzz has access to any context.
|
||||
</para>
|
||||
<para>
|
||||
You can also add Unicode code points directly with
|
||||
<function>hb_buffer_add_codepoints()</function>. The arguments
|
||||
to this function are the same as those for the UTF
|
||||
encodings. But it is particularly important to note that
|
||||
HarfBuzz does not do validity checking on the text that is added
|
||||
to a buffer. Invalid code points will be replaced, but it is up
|
||||
to you to do any deep-sanity checking necessary.
|
||||
</para>
|
||||
|
||||
</section>
|
||||
|
||||
<section id="setting-buffer-properties">
|
||||
<title>Setting buffer properties</title>
|
||||
<para>
|
||||
Buffers containing input characters still need several
|
||||
properties set before HarfBuzz can shape their text correctly.
|
||||
</para>
|
||||
</section>
|
||||
<section id="what-about-the-other-scripts">
|
||||
<title>What about the other scripts?</title>
|
||||
<para>
|
||||
Initially, all buffers are set to the
|
||||
<literal>HB_BUFFER_CONTENT_TYPE_INVALID</literal> content
|
||||
type. After adding text, the buffer should be set to
|
||||
<literal>HB_BUFFER_CONTENT_TYPE_UNICODE</literal> instead, which
|
||||
indicates that it contains un-shaped input
|
||||
characters. After shaping, the buffer will have the
|
||||
<literal>HB_BUFFER_CONTENT_TYPE_GLYPHS</literal> content type.
|
||||
</para>
|
||||
<para>
|
||||
<function>hb_buffer_add_utf8()</function> and the
|
||||
other UTF functions set the content type of their buffer
|
||||
automatically. But if you are reusing a buffer you may want to
|
||||
check its state with
|
||||
<function>hb_buffer_get_content_type(buffer)</function>. If
|
||||
necessary you can set the content type with
|
||||
</para>
|
||||
<programlisting language="C">
|
||||
hb_buffer_set_content_type(buf, HB_BUFFER_CONTENT_TYPE_UNICODE);
|
||||
</programlisting>
|
||||
<para>
|
||||
to prepare for shaping.
|
||||
</para>
|
||||
<para>
|
||||
Buffers also need to carry information about the script,
|
||||
language, and text direction of their contents. You can set
|
||||
these properties individually:
|
||||
</para>
|
||||
<programlisting language="C">
|
||||
hb_buffer_set_direction(buf, HB_DIRECTION_LTR);
|
||||
hb_buffer_set_script(buf, HB_SCRIPT_LATIN);
|
||||
hb_buffer_set_language(buf, hb_language_from_string("en", -1));
|
||||
</programlisting>
|
||||
<para>
|
||||
However, since these properties are often the repeated for
|
||||
multiple text runs, you can also save them in a
|
||||
<literal>hb_segment_properties_t</literal> for reuse:
|
||||
</para>
|
||||
<programlisting language="C">
|
||||
hb_segment_properties_t *savedprops;
|
||||
hb_buffer_get_segment_properties (buf, savedprops);
|
||||
...
|
||||
hb_buffer_set_segment_properties (buf2, savedprops);
|
||||
</programlisting>
|
||||
<para>
|
||||
HarfBuzz also provides getter functions to retrieve a buffer's
|
||||
direction, script, and language properties individually.
|
||||
</para>
|
||||
<para>
|
||||
HarfBuzz recognizes four text directions in
|
||||
<type>hb_direction_t</type>: left-to-right
|
||||
(<literal>HB_DIRECTION_LTR</literal>), right-to-left (<literal>HB_DIRECTION_RTL</literal>),
|
||||
top-to-bottom (<literal>HB_DIRECTION_TTB</literal>), and
|
||||
bottom-to-top (<literal>HB_DIRECTION_BTT</literal>). For the
|
||||
script property, HarfBuzz uses identifiers based on the
|
||||
<ulink
|
||||
url="https://unicode.org/iso15924/">ISO-15924
|
||||
standard</ulink>. For languages, HarfBuzz uses tags based on the
|
||||
<ulink url="https://tools.ietf.org/html/bcp47">IETF BCP 47</ulink> standard.
|
||||
</para>
|
||||
<para>
|
||||
Helper functions are provided to convert character strings into
|
||||
the necessary script and language tag types.
|
||||
</para>
|
||||
<para>
|
||||
Two additional buffer properties to be aware of are the
|
||||
"invisible glyph" and the replacement code point. The
|
||||
replacement code point is inserted into buffer output in place of
|
||||
any invalid code points encountered in the input. By default, it
|
||||
is the Unicode <literal>REPLACEMENT CHARACTER</literal> code
|
||||
point, <literal>U+FFFD</literal> "�". You can change this with
|
||||
</para>
|
||||
<programlisting language="C">
|
||||
hb_buffer_set_replacement_codepoint(buf, replacement);
|
||||
</programlisting>
|
||||
<para>
|
||||
The invisible glyph is used to replace all output glyphs that
|
||||
are invisible. By default, the standard space character
|
||||
<literal>U+0020</literal> is used; you can replace this (for
|
||||
example, when using a font that provides script-specific
|
||||
spaces) with
|
||||
</para>
|
||||
<programlisting language="C">
|
||||
hb_buffer_set_invisible_glyph(buf, replacement);
|
||||
</programlisting>
|
||||
<para>
|
||||
HarfBuzz supports a few additional flags you might want to set
|
||||
on your buffer under certain circumstances. The
|
||||
<literal>HB_BUFFER_FLAG_BOT</literal> and
|
||||
<literal>HB_BUFFER_FLAG_EOT</literal> flags tell HarfBuzz
|
||||
that the buffer represents the beginning or end (respectively)
|
||||
of a text element (such as a paragraph or other block). Knowing
|
||||
this allows HarfBuzz to apply certain contextual font features
|
||||
when shaping, such as initial or final variants in connected
|
||||
scripts.
|
||||
</para>
|
||||
<para>
|
||||
<literal>HB_BUFFER_FLAG_PRESERVE_DEFAULT_IGNORABLES</literal>
|
||||
tells HarfBuzz not to hide glyphs with the
|
||||
<literal>Default_Ignorable</literal> property in Unicode. This
|
||||
property designates control characters and other non-printing
|
||||
code points, such as joiners and variation selectors. Normally
|
||||
HarfBuzz replaces them in the output buffer with zero-width
|
||||
space glyphs; setting this flag causes them to be printed,
|
||||
which can be helpful for troubleshooting.
|
||||
</para>
|
||||
<para>
|
||||
Conversely, setting the
|
||||
<literal>HB_BUFFER_FLAG_REMOVE_DEFAULT_IGNORABLES</literal> flag
|
||||
tells HarfBuzz to remove <literal>Default_Ignorable</literal>
|
||||
glyphs from the output buffer entirely. Finally, setting the
|
||||
<literal>HB_BUFFER_FLAG_DO_NOT_INSERT_DOTTED_CIRCLE</literal>
|
||||
flag tells HarfBuzz not to insert the dotted-circle glyph
|
||||
(<literal>U+25CC</literal>, "◌"), which is normally
|
||||
inserted into buffer output when broken character sequences are
|
||||
encountered (such as combining marks that are not attached to a
|
||||
base character).
|
||||
</para>
|
||||
</section>
|
||||
|
||||
<section id="customizing-unicode-functions">
|
||||
<title>Customizing Unicode functions</title>
|
||||
<para>
|
||||
HarfBuzz requires some simple functions for accessing
|
||||
information from the Unicode Character Database (such as the
|
||||
<literal>General_Category</literal> (gc) and
|
||||
<literal>Script</literal> (sc) properties) that is useful
|
||||
for shaping, as well as some useful operations like composing and
|
||||
decomposing code points.
|
||||
</para>
|
||||
<para>
|
||||
At build time, HarfBuzz looks first for the GLib library. If
|
||||
it is found, HarfBuzz will use GLib's Unicode functions by
|
||||
default. If there is no GLib, HarfBuzz will fall back to its own
|
||||
internal, lightweight set of Unicode functions instead.
|
||||
</para>
|
||||
<para>
|
||||
If your program has access to other Unicode functions, however,
|
||||
such as through a system library or application framework, you
|
||||
might prefer to use those instead of the built-in
|
||||
options. HarfBuzz supports this by implementing its Unicode
|
||||
functions as a set of virtual methods that you can replace —
|
||||
without otherwise affecting HarfBuzz's functionality.
|
||||
</para>
|
||||
<para>
|
||||
The Unicode functions are specified in a structure called
|
||||
<literal>unicode_funcs</literal> which is attached to each
|
||||
buffer. But even though <literal>unicode_funcs</literal> is
|
||||
associated with a <type>hb_buffer_t</type>, the functions
|
||||
themselves are called by other HarfBuzz APIs that access
|
||||
buffers, so it would be unwise for you to hook different
|
||||
functions into different buffers.
|
||||
</para>
|
||||
<para>
|
||||
In addition, you can mark your <literal>unicode_funcs</literal>
|
||||
as immutable by calling
|
||||
<function>hb_unicode_funcs_make_immutable (ufuncs)</function>. This is especially useful if your code is a
|
||||
library or framework that will have its own client programs. By
|
||||
marking your Unicode function choices as immutable, you prevent
|
||||
your own client programs from changing the
|
||||
<literal>unicode_funcs</literal> configuration and introducing
|
||||
inconsistencies and errors downstream.
|
||||
</para>
|
||||
<para>
|
||||
You can retrieve the Unicode-functions configuration for
|
||||
your buffer by calling <function>hb_buffer_get_unicode_funcs()</function>:
|
||||
</para>
|
||||
<programlisting language="C">
|
||||
hb_unicode_funcs_t *ufunctions;
|
||||
ufunctions = hb_buffer_get_unicode_funcs(buf);
|
||||
</programlisting>
|
||||
<para>
|
||||
The current version of <literal>unicode_funcs</literal> uses six functions:
|
||||
</para>
|
||||
<itemizedlist>
|
||||
<listitem>
|
||||
<para>
|
||||
<function>hb_unicode_combining_class_func_t</function>:
|
||||
returns the Canonical Combining Class of a code point.
|
||||
</para>
|
||||
</listitem>
|
||||
<listitem>
|
||||
<para>
|
||||
<function>hb_unicode_general_category_func_t</function>:
|
||||
returns the General Category (gc) of a code point.
|
||||
</para>
|
||||
</listitem>
|
||||
<listitem>
|
||||
<para>
|
||||
<function>hb_unicode_mirroring_func_t</function>: returns
|
||||
the Mirroring Glyph code point (for bi-directional
|
||||
replacement) of a code point.
|
||||
</para>
|
||||
</listitem>
|
||||
<listitem>
|
||||
<para>
|
||||
<function>hb_unicode_script_func_t</function>: returns the
|
||||
Script (sc) property of a code point.
|
||||
</para>
|
||||
</listitem>
|
||||
<listitem>
|
||||
<para>
|
||||
<function>hb_unicode_compose_func_t</function>: returns the
|
||||
canonical composition of a sequence of two code points.
|
||||
</para>
|
||||
</listitem>
|
||||
<listitem>
|
||||
<para>
|
||||
<function>hb_unicode_decompose_func_t</function>: returns
|
||||
the canonical decomposition of a code point.
|
||||
</para>
|
||||
</listitem>
|
||||
</itemizedlist>
|
||||
<para>
|
||||
Note, however, that future HarfBuzz releases may alter this set.
|
||||
</para>
|
||||
<para>
|
||||
Each Unicode function has a corresponding setter, with which you
|
||||
can assign a callback to your replacement function. For example,
|
||||
to replace
|
||||
<function>hb_unicode_general_category_func_t</function>, you can call
|
||||
</para>
|
||||
<programlisting language="C">
|
||||
hb_unicode_funcs_set_general_category_func (*ufuncs, func, *user_data, destroy)
|
||||
</programlisting>
|
||||
<para>
|
||||
Virtualizing this set of Unicode functions is primarily intended
|
||||
to improve portability. There is no need for every client
|
||||
program to make the effort to replace the default options, so if
|
||||
you are unsure, do not feel any pressure to customize
|
||||
<literal>unicode_funcs</literal>.
|
||||
</para>
|
||||
</section>
|
||||
|
||||
</chapter>
|
||||
|
|
Loading…
Reference in New Issue