More like Uniscribe... We still allow user-defined features to
work across syllables, but not pres,blws,abs,psts,etc.
This "regressed" Sinhala numbers by 11. These are cases were
there's Consonant followed by Ra,Halant,ZWJ at the of text.
The Ra,Halant,ZWJ ends up forming reph, which is wrong...
But before we were also ligating that reph with the previous
consonant. That's even more wrong. That's also what Uniscribe
does.
Current numbers:
BENGALI: 353732 out of 354188 tests passed. 456 failed (0.128745%)
DEVANAGARI: 707307 out of 707394 tests passed. 87 failed (0.0122987%)
GUJARATI: 366349 out of 366457 tests passed. 108 failed (0.0294714%)
GURMUKHI: 60732 out of 60747 tests passed. 15 failed (0.0246926%)
KANNADA: 951030 out of 951913 tests passed. 883 failed (0.0927606%)
KHMER: 299070 out of 299124 tests passed. 54 failed (0.0180527%)
MALAYALAM: 1048140 out of 1048334 tests passed. 194 failed (0.0185056%)
ORIYA: 42320 out of 42329 tests passed. 9 failed (0.021262%)
SINHALA: 271655 out of 271847 tests passed. 192 failed (0.070628%)
TAMIL: 1091753 out of 1091754 tests passed. 1 failed (9.15957e-05%)
TELUGU: 970555 out of 970573 tests passed. 18 failed (0.00185457%)
With FreeSerif, it seems that the 'ccmp' feature does ligature
substituttions. That was then causing syllable match failures. We now
find syllables before any features have been applied.
Test sequence: U+0D9A,U+0DCA,U+200D,U+0DBB,U+0DCF
Apparently if there is C,V,ZWJ,C, the first C will be base, but if
it's C,ZWJ,V,C, the second one will be.
Note that Uniscribe implements this differently, by breaking syllable in
the case of C,ZWJ,V,C and putting the first consonant in one syllable
and the rest in the next syllable.
Sinhala failures down from 208 to 158 (0.0581209%). No changes to
Khmer.
Sinhala does not have half forms. And most (all?) consonants can be
base, except when preceded by ZWJ, which would request a subjoined form.
Hence switch the base algorithm to categorize with Khmer, start search
at start, and stop at a ZWJ.
Also, mark all pos=base consonants after base to be subjoined. Mark
base itself to have pos=base.
Finally, adjust Sinhala's reph position to after-main.
Brings down Sinhala failures from 455 to 328 (0.120656%).
If, say, a H,ZWJ,C ligature was formed, we don't want the code to detec
that as a Halant. So, ignore ligatures when matching category in
final_reordering.
Sinhala failures down from 514 to 455 (0.167374%).
In Sinhala, Rakar is formed by Al-Lakuna,ZWJ,Ra. If you put that at the
end of a Consonant,Matra syllable, you get a dotted-circle from
Uniscribe. Apparently adding a ZWJ before the Al-Lakuna "fixes" that.
And people have been encoding that sequence... So, allow a forced
"ZWJ,Virama,ZWJ,Ra" sequence at the of syllables.
Fixes some 100 or more of Sinhala failures. Now at 622 only (0.23%).
POS_BASE can disappear if base ligated backward. Define base as last
with position not after base.
Fixes a few hundred of Sinhala failures with Iskoola Pota.