If pre-base reordering Ra is NOT formed (or formed and then
broken up), we should consider that Ra as base. This is
observable when there's a left matra or dotreph that positions
before base.
Now, it might be that we shouldn't do this if the Ra happend
to form a below form. We can't quite deduce that right now...
Micro test added. Also at:
https://code.google.com/a/google.com/p/noto-alpha/issues/detail?id=186#c29
Sometimes font designers form half/pref/etc consonant forms
unconditionally and then undo that conditionally. Try to
recover the OT_H classification in those cases.
No test number changes expected.
Normally if you want to, say, conditionally prevent a 'pref', you
would use blocking contextual matching. Some designers instead
form the 'pref' form, then undo it in context. To detect that
we now also remember glyphs that went through MultipleSubst.
In the only place that this is used, Uniscribe seems to only care
about the "last" transformation between Ligature and Multiple
substitions. Ie. if you ligate, expand, and ligate again, it
moves the pref, but if you ligate and expand it doesn't. That's
why we clear the MULTIPLIED bit when setting LIGATED.
Micro-test added. Test: U+0D2F,0D4D,0D30 with font from:
[1]
https://code.google.com/a/google.com/p/noto-alpha/issues/detail?id=186#c29
Sinhala and Telugu use "explicit" reph. That is, the reph is formed by
a Ra,H,ZWJ sequence. Previously, upon detecting this sequence, we were
checking checking whether the 'rphf' feature applies to the first two
glyphs of the sequence. This is how the Microsoft fonts are designed.
However, testing with Noto shows that apparently Uniscribe also forms
the reph if the lookup ligates all three glyphs. So, try both
sequences.
Doesn't affect test results for Sinhala or Telugu.
https://code.google.com/a/google.com/p/noto-alpha/issues/detail?id=232
commit b5a0f69e47
Author: Behdad Esfahbod <behdad@behdad.org>
Date: Thu Oct 17 18:04:23 2013 +0200
[indic] Pass zero-context=false to would_substitute for newer scripts
For scripts without an old/new spec distinction, use zero-context=false.
This changes behavior in Sinhala / Khmer, but doesn't seem to regress.
This will be useful and used in Javanese.
The *intention* was to change zero-context from true to false for scripts that
don't have old-vs-new specs. However, checking the code, looks like we
essentially change zero-context to always be true; ie. we only changed things
for old-spec, and we broke them. That's what causes this bug:
https://bugs.freedesktop.org/show_bug.cgi?id=76705
The root of the bug is here:
/* Use zero-context would_substitute() matching for new-spec of the main
* Indic scripts, but not for old-spec or scripts with one spec only. */
bool zero_context = indic_plan->config->has_old_spec || !indic_plan->is_old_spec;
Note that is_old_spec itself is:
indic_plan->is_old_spec = indic_plan->config->has_old_spec && ((plan->map.chosen_script[0] & 0x000000FF) != '2');
It's easy to show that zero_context is now always true. What we really meant was:
bool zero_context = indic_plan->config->has_old_spec && !indic_plan->is_old_spec;
Ie, "&&" instead of "||". We made this change supposedly to make Javanese
work. But apparently we got it working regardless! So I'm going to fix this
to only change the logic for old-spec and not touch other cases.
This reverts commit d5bd0590ae.
The reasoning behind that logic was flawed and made under
a misunderstanding of the original problem, and caused
regressions as reported by Jonathan Kew in thread titled
"tibetan marks" in Oct 2013. Apparently I have had fixed
the original problem with this commit:
7e08f1258d
So, revert the faulty commit and everything seems to be in good
shape.
For Javanese (pref_len == 1) only reorder if it didn't ligate. That's
sensible, and what the spec says. For other Indic (pref_len > 1)
only reorder if ligated.
Doesn't change any test numbers.
Bug 58714 - Kannada u+0cb0 u+200d u+0ccd u+0c95 u+0cbe does not provide
same results as Windows8
https://bugs.freedesktop.org/show_bug.cgi?id=58714
Test with U+0CB0,U+200D,U+0CCD,U+0C95,U+0CBF and tunga.ttf.
Improves some scripts. Improves Bengali too, but numbers
are up because we produce better results than Uniscribe for some
sequences now.
New numbers:
BENGALI: 353724 out of 354188 tests passed. 464 failed (0.131004%)
DEVANAGARI: 707307 out of 707394 tests passed. 87 failed (0.0122987%)
GUJARATI: 366349 out of 366457 tests passed. 108 failed (0.0294714%)
GURMUKHI: 60732 out of 60747 tests passed. 15 failed (0.0246926%)
KANNADA: 951190 out of 951913 tests passed. 723 failed (0.0759523%)
KHMER: 299070 out of 299124 tests passed. 54 failed (0.0180527%)
MALAYALAM: 1048140 out of 1048334 tests passed. 194 failed (0.0185056%)
ORIYA: 42320 out of 42329 tests passed. 9 failed (0.021262%)
SINHALA: 271662 out of 271847 tests passed. 185 failed (0.068053%)
TAMIL: 1091753 out of 1091754 tests passed. 1 failed (9.15957e-05%)
TELUGU: 970555 out of 970573 tests passed. 18 failed (0.00185457%)
See comments from caveat! Seems to work fine.
This is useful for Javanese which has an atomically encoded pre-base
reordering Ra which should only be reordered if it was substituted
by the pref feature.
For scripts without an old/new spec distinction, use zero-context=false.
This changes behavior in Sinhala / Khmer, but doesn't seem to regress.
This will be useful and used in Javanese.
Whic means these twp are applied per-syllable now. Apparently
in some Khmer fonts the clig interacts with presentation features.
Test case: U+1781,U+17D2,U+1789,U+17BB,U+17C6 with Mondulkiri-R.ttf
should produce one big ligature.
First, we were abusing OT_VD instead of OT_A. Fix that
but moving OT_A in the grammar where it belongs (which
is different from what the spec says).
Also, only allow medial consonants after all other
consonants. This doesn't affect any current character.
Finally, fix Halant attachment in presence of medial
consonants. Again, this currently doesn't affect any
sequence.
I lied. There's Gurmukhi U+0A75 which is Consonant_Medial.
Uniscribe allows one of those in each of these positions:
before matras, after matras and before syllable modifiers,
and after syllable modifiers! We currently just allow
unlimited numbers of it, before matras.
Seems to better match Uniscribe.
Note: NotoSansTelugu-Regular has kern feature, so this fixes most of the
positioning failures there, except for the kern pairs blocked by a
(non-)joiner, in which case we (correctly) kern, but Uniscribe doesn't.
More like Uniscribe... We still allow user-defined features to
work across syllables, but not pres,blws,abs,psts,etc.
This "regressed" Sinhala numbers by 11. These are cases were
there's Consonant followed by Ra,Halant,ZWJ at the of text.
The Ra,Halant,ZWJ ends up forming reph, which is wrong...
But before we were also ligating that reph with the previous
consonant. That's even more wrong. That's also what Uniscribe
does.
Current numbers:
BENGALI: 353732 out of 354188 tests passed. 456 failed (0.128745%)
DEVANAGARI: 707307 out of 707394 tests passed. 87 failed (0.0122987%)
GUJARATI: 366349 out of 366457 tests passed. 108 failed (0.0294714%)
GURMUKHI: 60732 out of 60747 tests passed. 15 failed (0.0246926%)
KANNADA: 951030 out of 951913 tests passed. 883 failed (0.0927606%)
KHMER: 299070 out of 299124 tests passed. 54 failed (0.0180527%)
MALAYALAM: 1048140 out of 1048334 tests passed. 194 failed (0.0185056%)
ORIYA: 42320 out of 42329 tests passed. 9 failed (0.021262%)
SINHALA: 271655 out of 271847 tests passed. 192 failed (0.070628%)
TAMIL: 1091753 out of 1091754 tests passed. 1 failed (9.15957e-05%)
TELUGU: 970555 out of 970573 tests passed. 18 failed (0.00185457%)
After the Ngapi hackfest work, we were assuming that fonts
won't use presentation features to choose specific forms
(eg. conjuncts). As such, we were using auto-joiner behavior
for such features. It proved to be troublesome as many fonts
used presentation forms ('pres') for example to form conjuncts,
which need to be disabled when a ZWJ is inserted.
Two examples:
U+0D2F,U+200D,U+0D4D,U+0D2F with kartika.ttf
U+0995,U+09CD,U+200D,U+09B7 with vrinda.ttf
What we do now is to never do magic to ZWJ during GSUB's main input
match for Indic-style shapers. Note that backtrack/lookahead are still
matched liberally, as is GPOS. This seems to be an acceptable
compromise.
As to the bug that initially started this work, that one needs to
be fixed differently:
Bug 58714 - Kannada u+0cb0 u+200d u+0ccd u+0c95 u+0cbe does not
provide same results as Windows8
https://bugs.freedesktop.org/show_bug.cgi?id=58714
New numbers:
BENGALI: 353689 out of 354188 tests passed. 499 failed (0.140886%)
DEVANAGARI: 707305 out of 707394 tests passed. 89 failed (0.0125814%)
GUJARATI: 366349 out of 366457 tests passed. 108 failed (0.0294714%)
GURMUKHI: 60706 out of 60747 tests passed. 41 failed (0.067493%)
KANNADA: 951030 out of 951913 tests passed. 883 failed (0.0927606%)
KHMER: 299070 out of 299124 tests passed. 54 failed (0.0180527%)
LAO: 53611 out of 53644 tests passed. 33 failed (0.0615167%)
MALAYALAM: 1048102 out of 1048334 tests passed. 232 failed (0.0221304%)
ORIYA: 42320 out of 42329 tests passed. 9 failed (0.021262%)
SINHALA: 271666 out of 271847 tests passed. 181 failed (0.0665816%)
TAMIL: 1091753 out of 1091754 tests passed. 1 failed (9.15957e-05%)
TELUGU: 970555 out of 970573 tests passed. 18 failed (0.00185457%)
TIBETAN: 208469 out of 208469 tests passed. 0 failed (0%)
Before, we were marking them as below-form for initial reordering.
However, there is a rule that says "post consonants should follow
below consonsnts" for base determination purposes. Malayalam has
port-form YA/VA, and RA is pre-base. As such, for a sequence like
YA,Virama,YA,Virama,RA, the correct base is at index 0. But
because the code was seeing RA as a below-base, it was stopping at
the second YA as base, instead of jumping it as a post-base.
By treating prebase-reordering consonants like post-forms, this
is fixed.
MALAYALAM went down from 351 to 265. Other numbers didn't change:
BENGALI: 353686 out of 354188 tests passed. 502 failed (0.141733%)
DEVANAGARI: 707305 out of 707394 tests passed. 89 failed (0.0125814%)
GUJARATI: 366262 out of 366457 tests passed. 195 failed (0.0532122%)
GURMUKHI: 60706 out of 60747 tests passed. 41 failed (0.067493%)
KANNADA: 950680 out of 951913 tests passed. 1233 failed (0.129529%)
KHMER: 299074 out of 299124 tests passed. 50 failed (0.0167155%)
LAO: 53611 out of 53644 tests passed. 33 failed (0.0615167%)
MALAYALAM: 1048069 out of 1048334 tests passed. 265 failed (0.0252782%)
ORIYA: 42320 out of 42329 tests passed. 9 failed (0.021262%)
SINHALA: 271539 out of 271847 tests passed. 308 failed (0.113299%)
TAMIL: 1091753 out of 1091754 tests passed. 1 failed (9.15957e-05%)
TELUGU: 970555 out of 970573 tests passed. 18 failed (0.00185457%)
TIBETAN: 208469 out of 208469 tests passed. 0 failed (0%)
Not for Arabic, but for Indic-like scripts. ZWJ/ZWNJ have special
meanings in those scripts, so let font lookups take full control.
This undoes the regression caused by automatic-joiners handling
introduced two commits ago.
We only disable automatic joiner handling for the "basic shaping
features" of Indic, Myanmar, and SEAsian shapers. The "presentation
forms" and other features are still applied with automatic-joiner
handling.
This change also changes the test suite failure statistics, such that
a few scripts show more "failures". The most affected is Kannada.
However, upon inspection, we believe that in most, if not all, of the
new failures, we are producing results superior to Uniscribe. Hard to
count those!
Here's an example of what is fixed by the recent joiner-handling
changes:
https://bugs.freedesktop.org/show_bug.cgi?id=58714
New numbers, for future reference:
BENGALI: 353892 out of 354188 tests passed. 296 failed (0.0835714%)
DEVANAGARI: 707336 out of 707394 tests passed. 58 failed (0.00819911%)
GUJARATI: 366262 out of 366457 tests passed. 195 failed (0.0532122%)
GURMUKHI: 60706 out of 60747 tests passed. 41 failed (0.067493%)
KANNADA: 950680 out of 951913 tests passed. 1233 failed (0.129529%)
KHMER: 299074 out of 299124 tests passed. 50 failed (0.0167155%)
LAO: 53611 out of 53644 tests passed. 33 failed (0.0615167%)
MALAYALAM: 1047983 out of 1048334 tests passed. 351 failed (0.0334817%)
ORIYA: 42320 out of 42329 tests passed. 9 failed (0.021262%)
SINHALA: 271539 out of 271847 tests passed. 308 failed (0.113299%)
TAMIL: 1091753 out of 1091754 tests passed. 1 failed (9.15957e-05%)
TELUGU: 970555 out of 970573 tests passed. 18 failed (0.00185457%)
TIBETAN: 208469 out of 208469 tests passed. 0 failed (0%)
Ouch, how did things ever work without this?! The added test that has a
dot-reph as well as a pre-base reordering Ra perfectly demonstrates the
bug (tested with Nirmala font from Win8 for example). Testing suggests
that Win8 shaper has the *exact* same bug / behavior that we used to
have. Odd.
Before, we were zeroing advance width of attached marks for
non-Indic scripts, and not doing it for Indic.
We have now three different behaviors, which seem to better
reflect what Uniscribe is doing:
- For Indic, no explicit zeroing happens whatsoever, which
is the same as before,
- For Myanmar, zero advance width of glyphs marked as marks
*in GDEF*, and do that *before* applying GPOS. This seems
to be what the new Win8 Myanmar shaper does,
- For everything else, zero advance width of glyphs that are
from General_Category=Mn Unicode characters, and do so
before applying GPOS. This seems to be what Uniscribe does
for Latin at least.
With these changes, positioning of all tests matches for Myanmar,
except for the glitch in Uniscribe not applying 'mark'. See preivous
commit.
This is important for the Sinhala U+0DDA split matra since it decomposes
to U+0DD9,U+0DCA where U+0DD9 is a left matra and U+0DCA is the virama.
We don't want to move the virama with the left matra.
TEST: U+0D9A,U+0DDA
Note that we were already doing this in the Uniscribe bug compatibility
mode. We now do it all the time.
Had to do some refactoring to make this happen...
Under uniscribe bug compatibility mode, we still plit them
Uniscrie-style, but Jonathan and I convinced ourselves that there is no
harm doing this the Unicode way. This change makes that happen, and
unbreaks free Sinhala fonts.
Uniscribe doesn't. And some fonts abuse this feature to get Indic
shaping working in non-complex applications like Adobe's apps.
No change in numbers:
BENGALI: 353897 out of 354188 tests passed. 291 failed (0.0821598%)
DEVANAGARI: 707337 out of 707394 tests passed. 57 failed (0.00805774%)
GUJARATI: 366440 out of 366457 tests passed. 17 failed (0.00463902%)
GURMUKHI: 60704 out of 60747 tests passed. 43 failed (0.0707854%)
KANNADA: 951046 out of 951913 tests passed. 867 failed (0.0910798%)
KHMER: 299074 out of 299124 tests passed. 50 failed (0.0167155%)
LAO: 53611 out of 53644 tests passed. 33 failed (0.0615167%)
MALAYALAM: 1048011 out of 1048334 tests passed. 323 failed (0.0308108%)
ORIYA: 42320 out of 42329 tests passed. 9 failed (0.021262%)
SINHALA: 271666 out of 271847 tests passed. 181 failed (0.0665816%)
TAMIL: 1091754 out of 1091754 tests passed. 0 failed (0%)
TELUGU: 970557 out of 970573 tests passed. 16 failed (0.00164851%)
TIBETAN: 208469 out of 208469 tests passed. 0 failed (0%)
The logic for pre-base reordering follows the left matra logic.
We had an exception for Malayalam/Tamil in the left matra repositioning
which was not reflected in pre-base reordering.
Malayalam failures down from 337 to 323.
BENGALI: 353996 out of 354285 tests passed. 289 failed (0.0815727%)
DEVANAGARI: 707339 out of 707394 tests passed. 55 failed (0.00777502%)
GUJARATI: 366489 out of 366506 tests passed. 17 failed (0.0046384%)
GURMUKHI: 60769 out of 60809 tests passed. 40 failed (0.0657797%)
KANNADA: 951086 out of 951913 tests passed. 827 failed (0.0868777%)
KHMER: 299106 out of 299124 tests passed. 18 failed (0.00601757%)
LAO: 53611 out of 53644 tests passed. 33 failed (0.0615167%)
MALAYALAM: 1048011 out of 1048334 tests passed. 323 failed (0.0308108%)
ORIYA: 42320 out of 42329 tests passed. 9 failed (0.021262%)
SINHALA: 271726 out of 271847 tests passed. 121 failed (0.0445103%)
TAMIL: 1091837 out of 1091837 tests passed. 0 failed (0%)
TELUGU: 970558 out of 970573 tests passed. 15 failed (0.00154548%)
TIBETAN: 208469 out of 208469 tests passed. 0 failed (0%)
With FreeSerif, it seems that the 'ccmp' feature does ligature
substituttions. That was then causing syllable match failures. We now
find syllables before any features have been applied.
Test sequence: U+0D9A,U+0DCA,U+200D,U+0DBB,U+0DCF
The Win7 Tamil font does not realy on this behavior, but the WinXP
version does. Handle Tamil like Malayalam: Matras always move to
before base.
WinXP Tamil failures went down from 168964 (15.4752%) to 167
(0.0152953%) (two orders of magnitude reduction!).
Included in this is a minor fixup that actually fixed a few tests
with non-Tamil too. Numbers at:
BENGALI: 353997 out of 354285 tests passed. 288 failed (0.0812905%)
DEVANAGARI: 707339 out of 707394 tests passed. 55 failed (0.00777502%)
GUJARATI: 366489 out of 366506 tests passed. 17 failed (0.0046384%)
GURMUKHI: 60769 out of 60809 tests passed. 40 failed (0.0657797%)
KANNADA: 951086 out of 951913 tests passed. 827 failed (0.0868777%)
KHMER: 299106 out of 299124 tests passed. 18 failed (0.00601757%)
LAO: 53611 out of 53644 tests passed. 33 failed (0.0615167%)
MALAYALAM: 1048104 out of 1048416 tests passed. 312 failed (0.0297592%)
ORIYA: 42320 out of 42329 tests passed. 9 failed (0.021262%)
SINHALA: 271747 out of 271847 tests passed. 100 failed (0.0367854%)
TAMIL: 1091837 out of 1091837 tests passed. 0 failed (0%)
TELUGU: 970558 out of 970573 tests passed. 15 failed (0.00154548%)
TIBETAN: 208469 out of 208469 tests passed. 0 failed (0%)
No panic, we reeally insert dotted circle when it's absolutely broken.
Fixes most of the dotted-circle cases against Uniscribe. (for Devanagari
fixes 80% of them, for Khmer 70%; the rest look like Uniscribe being
really bogus...)
I had to make a decision. Apparently Uniscribe adds one dotted circle
to each broken character. I tried that, but that goes wrong easily with
split matras. So I made it add only one dotted circle to an entire
broken syllable tail. As in: "if there was a dotted circle here, this
would have formed a correct cluster." That works better for split
stuff, and I like it more.
Fixes consonant-position with old-spec Malayalam. Uniscribe seem to be
doing this. Fixes below-base La (eg. Pa,H,La) with AnjaliNewLipi.ttf.
Doesn't regress new-spec or other scripts.
Free up syllables and let features work across syllables for the
presentation forms features and GPOS.
Fixed:
- 1 GURMUKHI test (remains 40)
- 12 KHMER tests (remains 18)
- 11 SINHALA tests (remains 121)
Regresses:
- 5 MALAYALAM tests (up to 312)
Current numbers:
BENGALI: 353996 out of 354285 tests passed. 289 failed (0.0815727%)
DEVANAGARI: 707339 out of 707394 tests passed. 55 failed (0.00777502%)
GUJARATI: 366489 out of 366506 tests passed. 17 failed (0.0046384%)
GURMUKHI: 60769 out of 60809 tests passed. 40 failed (0.0657797%)
KANNADA: 951086 out of 951913 tests passed. 827 failed (0.0868777%)
KHMER: 299106 out of 299124 tests passed. 18 failed (0.00601757%)
LAO: 53611 out of 53644 tests passed. 33 failed (0.0615167%)
MALAYALAM: 1048104 out of 1048416 tests passed. 312 failed (0.0297592%)
ORIYA: 42320 out of 42329 tests passed. 9 failed (0.021262%)
SINHALA: 271726 out of 271847 tests passed. 121 failed (0.0445103%)
TAMIL: 1091837 out of 1091837 tests passed. 0 failed (0%)
TELUGU: 970558 out of 970573 tests passed. 15 failed (0.00154548%)
TIBETAN: 208469 out of 208469 tests passed. 0 failed (0%)
The merger of normalizer and glyph-mapping broke shapers that
modified text stream. Unbreak them by adding a new preprocess_text
shaping stage that happens before normalizing/cmap and disallow
setup_mask modification of actual text.
We need the font for glyph lookup during GSUB pauses in Indic shaper.
Could perhaps be avoided, but at this point, we don't mean to support
separate substitute()/position() entry points (anymore), so there is
no point in not providing the font to GSUB.