Part of https://github.com/harfbuzz/harfbuzz/issues/3591
"After that, bulk of the time I suppose is spent in binary-searching the
language table. I suggest we split the language table in 2-letter and
3-letter tags, to speed-up the vast majority of cases that are
2-letter."
benchmark-ot, before:
----------------------------------------------------------------------------------------------
Benchmark Time CPU Iterations
----------------------------------------------------------------------------------------------
BM_hb_ot_tags_from_script_and_language/COMMON zh_CN 112 ns 111 ns 6286271
BM_hb_ot_tags_from_script_and_language/COMMON en_US 60.6 ns 60.4 ns 11671176
BM_hb_ot_tags_from_script_and_language/LATIN en_US 61.3 ns 61.1 ns 11442645
BM_hb_ot_tags_from_script_and_language/COMMON none 4.75 ns 4.74 ns 146997235
BM_hb_ot_tags_from_script_and_language/LATIN none 4.65 ns 4.64 ns 150938747
After:
----------------------------------------------------------------------------------------------
Benchmark Time CPU Iterations
----------------------------------------------------------------------------------------------
BM_hb_ot_tags_from_script_and_language/COMMON zh_CN 89.5 ns 89.2 ns 7747649
BM_hb_ot_tags_from_script_and_language/COMMON en_US 38.5 ns 38.4 ns 18199432
BM_hb_ot_tags_from_script_and_language/LATIN en_US 39.0 ns 38.9 ns 18049238
BM_hb_ot_tags_from_script_and_language/COMMON none 4.53 ns 4.52 ns 154895110
BM_hb_ot_tags_from_script_and_language/LATIN none 4.54 ns 4.52 ns 154762105
If an OpenType tag maps to a BCP 47 macrolanguage, that is presumably to
support the use of the macrolanguage as a vague stand-in for one of its
individual languages. For example, "ar" and "zh" are often used for
"arb" and "cmn". When the OpenType tag maps to a macrolanguage and some
but not all of its individual languages, that indicates that the
OpenType tag only corresponds to the listed individual languages (which
may be referred to using the macrolanguage subtag) but not the missing
individual languages. In particular, INUK (Nunavik Inuktitut) is mapped
to "ike" (Eastern Canadian Inuktitut) and "iu" (Inuktitut) but not to
"ikt" (Inuinnaqtun), so "ikt" should not inherit the INUK mapping from
its macrolanguage "iu".
Every macrolanguage not mentioned in the OT language system tag registry
is mapped to every tag of its individual languages, if those have
registered tags.
A BCP 47 language tag with both a script subtag and a region subtag
would be printed as a human-readable name in hb-ot-tag-table.hh as if it
only had its language subtag.
The general rule is that if a BCP 47 macrolanguage maps to an OpenType
language system tag, all its individual languages map to it too.
Previously, a tag like "prs" (Dari) would not map to the language system
tag ('FAR ') of its macrolanguage ("fa") because "prs" already has its
own language system tag ('DRI '). That exception has been removed: now
"prs" maps to 'DRI ' and falls back to 'FAR '.
OpenType only officially maps four ISO 639 codes to Quechua languages,
but prior versions of HarfBuzz also mapped qu to 'QUZ '. Because qu is a
macrolanguage, the mapping now applies to all individual Quechua
languages. OpenType calls 'QUZ ' "Quechua", but it really corresponds to
Cusco Quechua, so the individual Quechua languages should not all
necessarily be mapped to it.
If the second subtag of a BCP 47 tag is three letters long, it denotes
an extended language. The tag converter ignores the language subtag and
uses the extended language instead.
There are some grandfathered exceptions, which are handled earlier.
The new script, gen-tag-table.py, generates `ot_languages` automatically
from the [OpenType language system tag registry][ot] and the [IANA
Language Subtag Registry][bcp47] with some manual modifications. If an
OpenType tag maps to a BCP 47 macrolanguage, all the macrolanguage's
individual languages are mapped to the same OpenType tag, except for
individual languages with their own OpenType mappings. Deprecated
BCP 47 tags are canonicalized.
[ot]: https://docs.microsoft.com/en-us/typography/opentype/spec/languagetags
[bcp47]: https://www.iana.org/assignments/language-subtag-registry/language-subtag-registry
Some OpenType tags correspond to multiple ISO 639 codes. The mapping
from ISO 639 codes lists OpenType tags in priority order, such that more
specific or more likely tags appear first.
Some OpenType tags have no corresponding ISO 639 code in the registry so
their mappings use BCP 47 subtags besides the language. For example, any
BCP 47 tag with a fonipa variant subtag is mapped to 'IPPH', and 'IPPH'
is mapped back to und-fonipa.
Other OpenType tags have no corresponding ISO 639 code because it is not
clear what they are for. HarfBuzz just ignores these tags.
One such ignored tag is 'ZHP ' (Chinese Phonetic). It probably means
zh-Latn. However, it is used in Microsoft JhengHei and Microsoft YaHei
with the script tag 'hani', implying that it is not a romanization
scheme after all. It would be simple enough to add this mapping to
gen-tag-table.py once a definitive mapping is determined.
The manual modifications are mainly either obvious mappings that the
OpenType registry omits or mappings for compatibility with previous
versions of HarfBuzz. Some of the old mappings were discarded, though,
for homophonous language names. For example, OpenType maps 'KUI ' to
kxu; previous versions of HarfBuzz also mapped it to kvd, because kvd
and kxu both happen to be called "Kui".
gen-tag-table.py also generates a function to convert multi-subtag tags
like el-polyton and zh-HK to OpenType tags, replacing `ot_languages_zh`
and the hard-coded list of special cases in `hb_ot_tags_from_language`.
It also generates a function to convert OpenType tags to BCP 47,
replacing the hard-coded list of special cases in
`hb_ot_tag_to_language`.