New approach to fix this:
69f9fbc420
Previous approach was reverted as it was too broad. See context:
https://github.com/behdad/harfbuzz/issues/347#issuecomment-267838368
With U+05E9,U+05B8,U+05C1,U+05DC and Arial Unicode, we now (correctly) disable
GDEF and GPOS, so we get results very close to Uniscribe, but slightly different
since our fallback position logic is not exactly the same:
Before: [gid1166=3+991|gid1142=0+737|gid5798=0+1434]
After: [gid1166=3+991|gid1142=0@402,-26+0|gid5798=0+1434]
Uniscribe: [gid1166=3+991|gid1142=0@348,0+0|gid5798=0+1434]
New API:
- hb_font_get_nominal_glyph_func_t
- hb_font_get_variation_glyph_func_t
- hb_font_funcs_set_nominal_glyph_func()
- hb_font_funcs_set_variation_glyph_func()
- hb_font_get_nominal_glyph()
- hb_font_get_variation_glyph()
Deprecated API:
- hb_font_get_glyph_func_t
- hb_font_funcs_set_glyph_func()
Clients that implement their own font-funcs are encouraged to replace
their get_glyph() implementation with a get_nominal_glyph() and
get_variation_glyph() pair. The variation version can assume that
variation_selector argument is not zero.
Test stats remain unchanged, except for Malayalam, which we investigate:
BENGALI: 353725 out of 354188 tests passed. 463 failed (0.130722%)
DEVANAGARI: 707307 out of 707394 tests passed. 87 failed (0.0122987%)
GUJARATI: 366349 out of 366457 tests passed. 108 failed (0.0294714%)
GURMUKHI: 60732 out of 60747 tests passed. 15 failed (0.0246926%)
KANNADA: 951190 out of 951913 tests passed. 723 failed (0.0759523%)
KHMER: 299070 out of 299124 tests passed. 54 failed (0.0180527%)
MALAYALAM: 1047584 out of 1048334 tests passed. 750 failed (0.0715421%)
ORIYA: 42320 out of 42329 tests passed. 9 failed (0.021262%)
SINHALA: 271662 out of 271847 tests passed. 185 failed (0.068053%)
TAMIL: 1091753 out of 1091754 tests passed. 1 failed (9.15957e-05%)
TELUGU: 970555 out of 970573 tests passed. 18 failed (0.00185457%)
Myanmar, compared to Windows 10 mmrtext.ttf:
MYANMAR: 1123865 out of 1123883 tests passed. 18 failed (0.00160159%)
Looks like Unsicribe responds to the 'mymr' tag by zeroing marks
GDEF_LATE instead of generic-shaper UNICODE_LATE. Implement that.
Fixes
Bug 81775 - Incorrect Rendering with harfbuzz-ng myanmar unicode
https://bugs.freedesktop.org/show_bug.cgi?id=81775
Micro-test added based on Padauk.
The spec and Uniscribe don't allow these, but UTN#11
specifically says the sequence U+104B,U+1038 is valid.
As such, allow all "P V" sequences. There's about
eight sequences that match that structure, but Roozbeh
thinks it's fine to allow all of them.
Test case: U+104B, U+1038
https://bugs.freedesktop.org/show_bug.cgi?id=71947
The spec and Uniscribe treat it as consonant in the grammar, but
it's not in IndicSyllableCategory.txt, so fix up.
Test sequence: U+1004,U+103A,U+1039,U+104E
https://bugs.freedesktop.org/show_bug.cgi?id=71948
This reverts commit d5bd0590ae.
The reasoning behind that logic was flawed and made under
a misunderstanding of the original problem, and caused
regressions as reported by Jonathan Kew in thread titled
"tibetan marks" in Oct 2013. Apparently I have had fixed
the original problem with this commit:
7e08f1258d
So, revert the faulty commit and everything seems to be in good
shape.
After the Ngapi hackfest work, we were assuming that fonts
won't use presentation features to choose specific forms
(eg. conjuncts). As such, we were using auto-joiner behavior
for such features. It proved to be troublesome as many fonts
used presentation forms ('pres') for example to form conjuncts,
which need to be disabled when a ZWJ is inserted.
Two examples:
U+0D2F,U+200D,U+0D4D,U+0D2F with kartika.ttf
U+0995,U+09CD,U+200D,U+09B7 with vrinda.ttf
What we do now is to never do magic to ZWJ during GSUB's main input
match for Indic-style shapers. Note that backtrack/lookahead are still
matched liberally, as is GPOS. This seems to be an acceptable
compromise.
As to the bug that initially started this work, that one needs to
be fixed differently:
Bug 58714 - Kannada u+0cb0 u+200d u+0ccd u+0c95 u+0cbe does not
provide same results as Windows8
https://bugs.freedesktop.org/show_bug.cgi?id=58714
New numbers:
BENGALI: 353689 out of 354188 tests passed. 499 failed (0.140886%)
DEVANAGARI: 707305 out of 707394 tests passed. 89 failed (0.0125814%)
GUJARATI: 366349 out of 366457 tests passed. 108 failed (0.0294714%)
GURMUKHI: 60706 out of 60747 tests passed. 41 failed (0.067493%)
KANNADA: 951030 out of 951913 tests passed. 883 failed (0.0927606%)
KHMER: 299070 out of 299124 tests passed. 54 failed (0.0180527%)
LAO: 53611 out of 53644 tests passed. 33 failed (0.0615167%)
MALAYALAM: 1048102 out of 1048334 tests passed. 232 failed (0.0221304%)
ORIYA: 42320 out of 42329 tests passed. 9 failed (0.021262%)
SINHALA: 271666 out of 271847 tests passed. 181 failed (0.0665816%)
TAMIL: 1091753 out of 1091754 tests passed. 1 failed (9.15957e-05%)
TELUGU: 970555 out of 970573 tests passed. 18 failed (0.00185457%)
TIBETAN: 208469 out of 208469 tests passed. 0 failed (0%)
Not for Arabic, but for Indic-like scripts. ZWJ/ZWNJ have special
meanings in those scripts, so let font lookups take full control.
This undoes the regression caused by automatic-joiners handling
introduced two commits ago.
We only disable automatic joiner handling for the "basic shaping
features" of Indic, Myanmar, and SEAsian shapers. The "presentation
forms" and other features are still applied with automatic-joiner
handling.
This change also changes the test suite failure statistics, such that
a few scripts show more "failures". The most affected is Kannada.
However, upon inspection, we believe that in most, if not all, of the
new failures, we are producing results superior to Uniscribe. Hard to
count those!
Here's an example of what is fixed by the recent joiner-handling
changes:
https://bugs.freedesktop.org/show_bug.cgi?id=58714
New numbers, for future reference:
BENGALI: 353892 out of 354188 tests passed. 296 failed (0.0835714%)
DEVANAGARI: 707336 out of 707394 tests passed. 58 failed (0.00819911%)
GUJARATI: 366262 out of 366457 tests passed. 195 failed (0.0532122%)
GURMUKHI: 60706 out of 60747 tests passed. 41 failed (0.067493%)
KANNADA: 950680 out of 951913 tests passed. 1233 failed (0.129529%)
KHMER: 299074 out of 299124 tests passed. 50 failed (0.0167155%)
LAO: 53611 out of 53644 tests passed. 33 failed (0.0615167%)
MALAYALAM: 1047983 out of 1048334 tests passed. 351 failed (0.0334817%)
ORIYA: 42320 out of 42329 tests passed. 9 failed (0.021262%)
SINHALA: 271539 out of 271847 tests passed. 308 failed (0.113299%)
TAMIL: 1091753 out of 1091754 tests passed. 1 failed (9.15957e-05%)
TELUGU: 970555 out of 970573 tests passed. 18 failed (0.00185457%)
TIBETAN: 208469 out of 208469 tests passed. 0 failed (0%)
Before, we were zeroing advance width of attached marks for
non-Indic scripts, and not doing it for Indic.
We have now three different behaviors, which seem to better
reflect what Uniscribe is doing:
- For Indic, no explicit zeroing happens whatsoever, which
is the same as before,
- For Myanmar, zero advance width of glyphs marked as marks
*in GDEF*, and do that *before* applying GPOS. This seems
to be what the new Win8 Myanmar shaper does,
- For everything else, zero advance width of glyphs that are
from General_Category=Mn Unicode characters, and do so
before applying GPOS. This seems to be what Uniscribe does
for Latin at least.
With these changes, positioning of all tests matches for Myanmar,
except for the glitch in Uniscribe not applying 'mark'. See preivous
commit.