[serialize] document how the serializer works.
This commit is contained in:
parent
45df259538
commit
0fb8e22406
|
@ -0,0 +1,178 @@
|
|||
# Introduction
|
||||
|
||||
In hb-subset serialization is the process of writing the subsetted font
|
||||
tables out to actual bytes in the final format. All serialization works
|
||||
through an object called the serialize context
|
||||
([hb_serialize_context_t](https://github.com/harfbuzz/harfbuzz/blob/main/src/hb-serialize.hh)).
|
||||
|
||||
Internally the serialize context holds a fixed size memory buffer. For simple
|
||||
tables the final bytes are written into the buffer sequentially to produce
|
||||
the final serialized bytes.
|
||||
|
||||
## Simple Tables
|
||||
|
||||
Simple tables are tables that do not use offset graphs.
|
||||
|
||||
To write a struct into the serialization context, first you call an
|
||||
allocation method on the context which requests a writable array of bytes of
|
||||
a fixed size. If the requested array will not exceed the bounds of the fixed
|
||||
buffer the serializer will return a pointer to the next unwritten portion
|
||||
of the buffer. Then the struct is cast onto the returned pointer and values
|
||||
are written to the structs fields.
|
||||
|
||||
Internally the serialization context ends up looking like:
|
||||
|
||||
```
|
||||
+-------+-------+-----+-------+--------------+
|
||||
| Obj 1 | Obj 2 | ... | Obj N | Unused Space |
|
||||
+-------+-------+-----+-------+--------------+
|
||||
```
|
||||
|
||||
Here Obj N, is the object currently being written.
|
||||
|
||||
## Complex Tables
|
||||
|
||||
Complex tables are made up of graphs of objects, where offset's are used
|
||||
to form the edges of the graphs. Each object is a continous slice of bytes
|
||||
that contains zero or more offsets pointing to more objects.
|
||||
|
||||
In this case the serialization buffer has a different layout:
|
||||
|
||||
```
|
||||
|- in progress objects -| |--- packed objects --|
|
||||
+-----------+-----------+--------------+-------+-----+-------+
|
||||
| Obj n+2 | Obj n+1 | Unused Space | Obj n | ... | Obj 0 |
|
||||
+-----------+-----------+--------------+-------+-----+-------+
|
||||
|-----------------------> <---------------------|
|
||||
```
|
||||
|
||||
The buffer holds two stacks:
|
||||
|
||||
1. In progress objects are held in a stack starting from the start of buffer
|
||||
that grows towards the end of the buffer.
|
||||
|
||||
2. Packed objects are held in a stack that starts at the end of the buffer
|
||||
and grows towards the start of the buffer.
|
||||
|
||||
Once the object on the top of the in progress stack is finished being written
|
||||
its bytes are popped from the in progress stack and copied to the top of
|
||||
the packed objects stack. In the example above, finalizing Obj n+1
|
||||
would result in the following state:
|
||||
|
||||
```
|
||||
+---------+--------------+---------+-------+-----+-------+
|
||||
| Obj n+2 | Unused Space | Obj n+1 | Obj n | ... | Obj 0 |
|
||||
+---------+--------------+---------+-------+-----+-------+
|
||||
```
|
||||
|
||||
Each packed object is associated with an ID, it's zero based position in the packed
|
||||
objects stack. In this example Obj 0, would have an ID of 0.
|
||||
|
||||
During serialization offsets that link from one object to another are stored
|
||||
using object ids. The serialize context maintains a list of links between
|
||||
objects. Each link records the parent object id, the child object id, the position
|
||||
of the offset field within the parent object, and the width of the offset.
|
||||
|
||||
Links are always added to the current in progress object and you can only link too
|
||||
objects that have been packed and thus have an ID.
|
||||
|
||||
### Object De-duplication
|
||||
|
||||
An important optimization in packing offset graphs is de-duplicating equivalent objects. If you
|
||||
have two or more parent objects that point to child objects that are equivalent then you only need
|
||||
to encode the child once and can have the parents point to the same child. This can significantly
|
||||
reduce the final size of a serialized graph.
|
||||
|
||||
During packing of an inprogress object the serialization context checks if any existing packed
|
||||
objects are equivalent to the object being packed. Here equivalence means the object has the
|
||||
exact same bytes and all of it's links are equivalent. If an equivalent object is found the
|
||||
in progress object is discarded and not copied to the packed object stack. The object id of
|
||||
the equivalent object is instead returned. Thus parent objects will then link to the existing
|
||||
equivalent object.
|
||||
|
||||
To find equivalent objects the serialization context maintains a hashmap from object to the canonical
|
||||
object id.
|
||||
|
||||
### Link Resolution
|
||||
|
||||
Once all objects have been packed the next step is to assign actual values to all of the offset
|
||||
fields. Prior to this point all links in the graph have been recorded using object id's. For each
|
||||
link the resolver computes the offset between the parent and child and writes the offset into
|
||||
the serialization buffer at the appropriate location.
|
||||
|
||||
### Offset Overflow Resolution
|
||||
|
||||
If during link resolution the resolver finds that an offsets value would exceed what can be encoded
|
||||
in that offset field link resolution is aborted and the offset overflow resolver is invoked.
|
||||
That process is documented [here](reapcker.md).
|
||||
|
||||
|
||||
### Example of Complex Serialization
|
||||
|
||||
|
||||
If we wanted to serialize the following graph:
|
||||
|
||||
```
|
||||
a--b--d
|
||||
\ /
|
||||
c
|
||||
```
|
||||
|
||||
Serializer would be called like this:
|
||||
|
||||
```c++
|
||||
hb_serialize_context_t ctx;
|
||||
|
||||
struct root {
|
||||
char name;
|
||||
Offset16To<child> child_1;
|
||||
Offset16To<child> child_2;
|
||||
}
|
||||
|
||||
struct child {
|
||||
char name;
|
||||
Offset16To<char> leaf;
|
||||
}
|
||||
|
||||
// Object A.
|
||||
ctx->push();
|
||||
root* a = ctx->start_embed<root> ();
|
||||
ctx->extend_min (a);
|
||||
a->name = 'a';
|
||||
|
||||
// Object B.
|
||||
ctx->push();
|
||||
child* b = ctx->start_embed<child> ();
|
||||
ctx->extend_min (b);
|
||||
b->name = 'b';
|
||||
|
||||
// Object D.
|
||||
ctx->push();
|
||||
*ctx->allocate_size<char> (1) = 'd';
|
||||
unsigned d_id = ctx->pop_pack ();
|
||||
|
||||
ctx->add_link (b->leaf, d_id);
|
||||
unsigned b_id = ctx->pop_pack ();
|
||||
|
||||
// Object C
|
||||
ctx->push();
|
||||
child* c = ctx->start_embed<child> ();
|
||||
ctx->extend_min (c);
|
||||
c->name = 'c';
|
||||
|
||||
// Object D.
|
||||
ctx->push();
|
||||
*ctx->allocate_size<char> (1) = 'd';
|
||||
d_id = ctx->pop_pack (); // Serializer will automatically de-dup this with the previous 'd'
|
||||
|
||||
ctx->add_link (c->leaf, d_id);
|
||||
unsigned c_id = ctx->pop_pack ();
|
||||
|
||||
// Object A's links:
|
||||
ctx->add_link (a->child_1, b_id);
|
||||
ctx->add_link (a->child_2, c_id);
|
||||
ctx->pop_pack ();
|
||||
|
||||
ctx->end_serialize ();
|
||||
|
||||
```
|
Loading…
Reference in New Issue