I don't expect ragel to be creating too much noise in its generated
output, and including this in-tree helps users right now. We can
revisit this later if it proved to be too much trouble.
With FreeSerif, it seems that the 'ccmp' feature does ligature
substituttions. That was then causing syllable match failures. We now
find syllables before any features have been applied.
Test sequence: U+0D9A,U+0DCA,U+200D,U+0DBB,U+0DCF
With this in place, you can remove GDEF/GSUB/GPOS tables from Arabic
fonts and still get per-component marks positioned on
oh-yeah-fallback-formed LAM-ALEF ligatures with marks in between the LAM
and ALEF.
Now *that*'s pretty cool, if a bit anachronistic...
Uniscribe accepts a Halant,ZWJ before matras. Allow that.
BENGALI down from 295 to 291
DEVANAGARI down from 69 to 57
GUJARATI down from 19 to 17
KANNADA down from 871 to 867
MALAYALAM down from 340 to 337
TELUGU down from 20 to 16
Currently at:
BENGALI: 353897 out of 354188 tests passed. 291 failed (0.0821598%)
DEVANAGARI: 707337 out of 707394 tests passed. 57 failed (0.00805774%)
GUJARATI: 366440 out of 366457 tests passed. 17 failed (0.00463902%)
GURMUKHI: 60704 out of 60747 tests passed. 43 failed (0.0707854%)
KANNADA: 951046 out of 951913 tests passed. 867 failed (0.0910798%)
KHMER: 299077 out of 299124 tests passed. 47 failed (0.0157125%)
LAO: 53611 out of 53644 tests passed. 33 failed (0.0615167%)
MALAYALAM: 1047997 out of 1048334 tests passed. 337 failed (0.0321462%)
ORIYA: 42320 out of 42329 tests passed. 9 failed (0.021262%)
SINHALA: 271666 out of 271847 tests passed. 181 failed (0.0665816%)
TAMIL: 1091754 out of 1091754 tests passed. 0 failed (0%)
TELUGU: 970557 out of 970573 tests passed. 16 failed (0.00164851%)
TIBETAN: 208469 out of 208469 tests passed. 0 failed (0%)
Now that we insert dotted-circle, tests break more easily when our indic
machine breaks.
In particular, a few Devanagari tests were having sequences like
"C,H,ZWJ,N", and because of the ZWJ the Nukta does NOT get reordered to
before the Halant as the grammar used to expect... Fixup.
Another case is as simple as "C,ZWJ,SM".
Fixes 10 out of 79 failures:
DEVANAGARI: 707325 out of 707394 tests passed. 69 failed (0.00975411%)
Brings down Khmer failures from 162 to 47.
KHMER: 299077 out of 299124 tests passed. 47 failed (0.0157125%)
Also rebaselined some of the test files that had only-inherited lines.
Removing those, the stats are:
BENGALI: 353893 out of 354188 tests passed. 295 failed (0.0832891%)
DEVANAGARI: 707315 out of 707394 tests passed. 79 failed (0.0111678%)
GUJARATI: 366438 out of 366457 tests passed. 19 failed (0.00518478%)
GURMUKHI: 60704 out of 60747 tests passed. 43 failed (0.0707854%)
KANNADA: 951042 out of 951913 tests passed. 871 failed (0.0915%)
KHMER: 299077 out of 299124 tests passed. 47 failed (0.0157125%)
LAO: 53611 out of 53644 tests passed. 33 failed (0.0615167%)
MALAYALAM: 1047994 out of 1048334 tests passed. 340 failed (0.0324324%)
ORIYA: 42320 out of 42329 tests passed. 9 failed (0.021262%)
SINHALA: 271666 out of 271847 tests passed. 181 failed (0.0665816%)
TAMIL: 1091754 out of 1091754 tests passed. 0 failed (0%)
TELUGU: 970553 out of 970573 tests passed. 20 failed (0.00206064%)
TIBETAN: 208469 out of 208469 tests passed. 0 failed (0%)
Still some regressions, but some of the more egregious cases are
addressed.
After we implemented dotted-circle, we were still ignoring any tests
that had dottedcircle in it for any of the shapers. That meant that if
we wrongly outputted dottedcircle, the test was being ignored. Ouch!
Fixing that shows regressions across the board. Most are Uniscribe
bugs: NOT inserting dotted-circle when it should. Some are arou
machine bugs. This is in fact a nice way to catch Indic-machine
deficiencies and when I fix the regressions, our clusters should be
much closer to Uniscribe. For now, we regressed from:
BENGALI: 353997 out of 354285 tests passed. 288 failed (0.0812905%)
DEVANAGARI: 707339 out of 707394 tests passed. 55 failed (0.00777502%)
GUJARATI: 366489 out of 366506 tests passed. 17 failed (0.0046384%)
GURMUKHI: 60769 out of 60809 tests passed. 40 failed (0.0657797%)
KANNADA: 951086 out of 951913 tests passed. 827 failed (0.0868777%)
KHMER: 299106 out of 299124 tests passed. 18 failed (0.00601757%)
LAO: 53611 out of 53644 tests passed. 33 failed (0.0615167%)
MALAYALAM: 1048104 out of 1048416 tests passed. 312 failed (0.0297592%)
ORIYA: 42320 out of 42329 tests passed. 9 failed (0.021262%)
SINHALA: 271747 out of 271847 tests passed. 100 failed (0.0367854%)
TAMIL: 1091837 out of 1091837 tests passed. 0 failed (0%)
TELUGU: 970558 out of 970573 tests passed. 15 failed (0.00154548%)
TIBETAN: 208469 out of 208469 tests passed. 0 failed (0%)
To:
BENGALI: 353990 out of 354285 tests passed. 295 failed (0.0832663%)
DEVANAGARI: 707315 out of 707394 tests passed. 79 failed (0.0111678%)
GUJARATI: 366447 out of 366506 tests passed. 59 failed (0.016098%)
GURMUKHI: 60707 out of 60809 tests passed. 102 failed (0.167738%)
KANNADA: 951042 out of 951913 tests passed. 871 failed (0.0915%)
KHMER: 298962 out of 299124 tests passed. 162 failed (0.0541581%)
LAO: 53611 out of 53644 tests passed. 33 failed (0.0615167%)
MALAYALAM: 1048074 out of 1048416 tests passed. 342 failed (0.0326206%)
ORIYA: 42320 out of 42329 tests passed. 9 failed (0.021262%)
SINHALA: 271666 out of 271847 tests passed. 181 failed (0.0665816%)
TAMIL: 1091835 out of 1091837 tests passed. 2 failed (0.000183178%)
TELUGU: 970553 out of 970573 tests passed. 20 failed (0.00206064%)
TIBETAN: 208469 out of 208469 tests passed. 0 failed (0%)
Investigating.
The Win7 Tamil font does not realy on this behavior, but the WinXP
version does. Handle Tamil like Malayalam: Matras always move to
before base.
WinXP Tamil failures went down from 168964 (15.4752%) to 167
(0.0152953%) (two orders of magnitude reduction!).
Included in this is a minor fixup that actually fixed a few tests
with non-Tamil too. Numbers at:
BENGALI: 353997 out of 354285 tests passed. 288 failed (0.0812905%)
DEVANAGARI: 707339 out of 707394 tests passed. 55 failed (0.00777502%)
GUJARATI: 366489 out of 366506 tests passed. 17 failed (0.0046384%)
GURMUKHI: 60769 out of 60809 tests passed. 40 failed (0.0657797%)
KANNADA: 951086 out of 951913 tests passed. 827 failed (0.0868777%)
KHMER: 299106 out of 299124 tests passed. 18 failed (0.00601757%)
LAO: 53611 out of 53644 tests passed. 33 failed (0.0615167%)
MALAYALAM: 1048104 out of 1048416 tests passed. 312 failed (0.0297592%)
ORIYA: 42320 out of 42329 tests passed. 9 failed (0.021262%)
SINHALA: 271747 out of 271847 tests passed. 100 failed (0.0367854%)
TAMIL: 1091837 out of 1091837 tests passed. 0 failed (0%)
TELUGU: 970558 out of 970573 tests passed. 15 failed (0.00154548%)
TIBETAN: 208469 out of 208469 tests passed. 0 failed (0%)
Unfortunately if the font has GPOS and 'mark' feature does
not position mark on dotted-circle, our inserted dotted-circle
will not get the mark repositioned to itself. Uniscribe cheats
here.
If there is no GPOS however, the fallback positioning kicks in
and sorts this out.
I'm not willing to address the first case.
No panic, we reeally insert dotted circle when it's absolutely broken.
Fixes most of the dotted-circle cases against Uniscribe. (for Devanagari
fixes 80% of them, for Khmer 70%; the rest look like Uniscribe being
really bogus...)
I had to make a decision. Apparently Uniscribe adds one dotted circle
to each broken character. I tried that, but that goes wrong easily with
split matras. So I made it add only one dotted circle to an entire
broken syllable tail. As in: "if there was a dotted circle here, this
would have formed a correct cluster." That works better for split
stuff, and I like it more.