Commit Graph

2292 Commits

Author SHA1 Message Date
Behdad Esfahbod c7c4de2fb9 [Indic] Remove syllable length check before sorting
We now limit syllable lengths in the machine.  No need to match here.
2012-07-23 18:25:02 -04:00
Behdad Esfahbod 9fa052733e [Indic] Limit syllables to at most five consonants
Seems to be about what Uniscribe does.  Not exactly.  But close enough.
More consonants will start a new cluster.

A few scripts went way down in failures.  In particular:

  - Devanagari failures went down from 490 to 56.
  - Telugu went down from 113 to 49.

Other scripts went down slightly or didn't change.  New numbers:

BENGALI: 353908 out of 354285 tests passed. 377 failed (0.106412%)
DEVANAGARI: 693572 out of 693628 tests passed. 56 failed (0.00807349%)
GUJARATI: 366485 out of 366506 tests passed. 21 failed (0.00572978%)
GURMUKHI: 60750 out of 60809 tests passed. 59 failed (0.0970251%)
KANNADA: 950730 out of 951913 tests passed. 1183 failed (0.124276%)
KHMER: 298613 out of 299124 tests passed. 511 failed (0.170832%)
MALAYALAM: 1046881 out of 1048416 tests passed. 1535 failed (0.146411%)
ORIYA: 42320 out of 42329 tests passed. 9 failed (0.021262%)
SINHALA: 271333 out of 271847 tests passed. 514 failed (0.189077%)
TAMIL: 1091837 out of 1091837 tests passed. 0 failed (0%)
TELUGU: 970524 out of 970573 tests passed. 49 failed (0.00504856%)

Some of the remaining Telugu and Devanagari issues seem to be Uniscribe
eating Anusvara when placed before a non-joiner.  Ouch!
2012-07-23 18:19:17 -04:00
Behdad Esfahbod 093cd58326 [Thai] Fix SARA AM handling
Oops, thinko.
2012-07-23 14:04:42 -04:00
Behdad Esfahbod 42848453bf [Thai] Reorder U+0E3A THAI VOWEL SIGN PHINTHU
Uniscribe reorders U+0E3A to be after U+0E38 and U+0E39.  We do that by
modifying the ccc for U+0E3A.

Fixes the two remaining Thai failures (see previous commit).
2012-07-23 13:52:07 -04:00
Behdad Esfahbod 4a7f4f3e56 [Thai] Adjust SARA AM reordering to match Uniscribe
Adjust the list of marks before SARA AM that get the reordering
treatment.  Also adjust cluster formation to match Uniscribe.

With Wikipedia test data, now I see:

  - For Thai, with the Angsana New font from Win7, I see 54 failures out
    of over 4M tests  (0.00129107%).  Of the 54, two are legitimate
    reordering issues (fix coming soon), and the other 52 are simply
    Uniscribe using a zero-width space char instead of an unknown
    character for missing glyphs.  No idea why.  The missing-glyph
    sequences include one that is a Thai character followed by an Arabic
    Sokun.  Someone confused it with Nikhahit I assume!

  - For Lao, with the Dokchampa font from Win7, 33 tests fail out of
    54k (0.0615167%).  All seem to be insignificant mark positioning
    with two marks on a base.  Have to investigate.
2012-07-23 13:15:33 -04:00
Behdad Esfahbod 2cc933aff9 [Indic] Fix cluster formation with left-matras and conjunct forms
Test case was: <U+0D15,U+0D4D,U+0D15,U+0D4A>.
2012-07-23 08:23:44 -04:00
Behdad Esfahbod e6b01a878c [Indic] Further streamline cluster formation
This should address all possible cluster misformations that I had in
mind.
2012-07-23 00:11:26 -04:00
Behdad Esfahbod 7b2a7dadd6 [Indic] Merge clusters before sorting
This should fix any instabilities in cluster formation that we were
speculating may happen with surrounding syllables.  Or most of it
perhaps.
2012-07-22 23:58:55 -04:00
Behdad Esfahbod abb3239ef9 [Indic] Update clusters for left-matra even if matra didn't move
Fixes crashes reported with left matra under
non-uniscribe-bug-compatibilty mode.
2012-07-22 23:55:19 -04:00
Behdad Esfahbod 60554f14d8 [Indic] Merge in Malayalam tests
From:
http://silpa.org.in/pub/tests/hb/ml/ml-harfbuzz-testdata.txt
2012-07-22 23:23:56 -04:00
Behdad Esfahbod 5c7081770c [Indic] Add extensive Sinhala tests
Generated by:
http://git.savannah.gnu.org/cgit/sinhala.git/plain/utils/gen-unicode-sinhala.py
2012-07-22 23:20:27 -04:00
Behdad Esfahbod 2efe4707b1 [Indic] Add Sinhala tests
Merge tests from:
http://git.savannah.gnu.org/cgit/sinhala.git/plain/patches/icu-sinhala-rendering.txt
2012-07-22 23:17:59 -04:00
Behdad Esfahbod 3d4c111b7a Add a test case 2012-07-20 19:34:39 -04:00
Behdad Esfahbod 92a1ad7bef [Indic] Stop searching for base if a post form is found before below form
Improves Bengali and Gurmukhi.  Malayalam regressed a bit.  We will deal
with that later.
2012-07-20 18:55:15 -04:00
Behdad Esfahbod 4c450c703f [Indic] Recompose Bengali Ya,Nukta
This is a bunch of hacks for now.

Improves Bengali a bit.
2012-07-20 18:13:04 -04:00
Behdad Esfahbod e9c0f152a3 [Uniscribe] Fix script fallback
Gurmukhi failures half now.  Others changed slightly.
2012-07-20 17:37:48 -04:00
Behdad Esfahbod 5791f32915 [Indic] Allow a ZWNJ after SM's
Malayalam failures go way down.  Other scripts benefitted slightly too.
Sinhala had one or two test regressions, but...
2012-07-20 16:26:55 -04:00
Behdad Esfahbod 34ae336f3f [Indic] Improve Reph AfterMain positioning
Fixes 20 out of 48 failing Oriya tests.  Failure rate down to 0.066% now.
2012-07-20 16:17:28 -04:00
Behdad Esfahbod bdd080431a [Indic] Reposition Oriya Candrabindu
Oriya failures down from 0.65% to 0.20%.
2012-07-20 16:03:09 -04:00
Behdad Esfahbod 5f0eaaad12 [Indic] Fix base search in final_reordering
Fixes most Malayalam failures.  Down from 1.6% to 0.38% now.  Fixes a
few more in other scripts too.
2012-07-20 15:47:24 -04:00
Behdad Esfahbod 81202bd860 [Indic] Don't attach SM/VD to other characters 2012-07-20 15:14:51 -04:00
Behdad Esfahbod efb4ad7356 Fix compiler warnings
If x is not constant, we cannot ASSERT_STATIC on it.
2012-07-20 14:27:38 -04:00
Behdad Esfahbod f31d97e44e [Indic] Form Telugu Reph out of Ra,Virama,ZWJ
Apparently this was approved in Feb 2012.  No font yet.
2012-07-20 14:13:35 -04:00
Behdad Esfahbod 2e193b240e [Indic] Don't split U+0AC9
Althought IndicMatraCategory.txt classifies it as Top_And_Right matra,
it does not have Unicode decomposition, and Uniscribe does not do
anything special about it either.

Gujarati failures down from 0.672% to 0.0130966%.
2012-07-20 14:02:35 -04:00
Behdad Esfahbod 30c3d5e9fc [Indic] Simplify Uniscribe cluster emulation
Now that we break syllables on Halant,ZWNJ, this code can be simplified.
2012-07-20 13:56:32 -04:00
Behdad Esfahbod decf6ffca4 [Indic] Minor! 2012-07-20 13:51:31 -04:00
Behdad Esfahbod 9e4f94a72c [Indic] Break syllables at Halant,ZWNJ
That's really what Uniscribe does, and explains a lot of pecularities of
Halant,ZWNJ before the base.

Sent Telugu from 1% failures to 0.03%.  Improved Kannada and Malayalam
slightly.  Fixed half of Bengali, and did NOT break anything!
2012-07-20 13:48:03 -04:00
Behdad Esfahbod 2c372b80f6 [Indic] Better check for applying 'init'
Specifically, don't apply 'init' if previous char is a joiner.

Fixes some more of Bengali.
2012-07-20 13:37:48 -04:00
Behdad Esfahbod 34a7440b7c [GPOS] Don't zero mark advances
Fixes more of Telugu, Kannada, and Oriya.

May break things (outside Indic...), but we cannot think of any font relying
on this immediately.
2012-07-20 12:40:39 -04:00
Behdad Esfahbod 8ed248de77 [Indic] Minor 2012-07-20 11:42:24 -04:00
Behdad Esfahbod d0e68dbd0b [Indic] Implement reph positioning step 5
Not tuned, just copied from step 2.  Fixes another 0.5% of Kannada
failures.  1% to go.
2012-07-20 11:25:41 -04:00
Behdad Esfahbod a9e45c32e4 [Indic] Don't let ZWNJ at the end of syllable affect base search
Fixes a few Devanagari, half of remaining Kannada failures, quarter for
Telugu, and others slightly improved or unchanged.
2012-07-20 11:04:15 -04:00
Behdad Esfahbod 20b68e699f [Indic] Apply 'cjct' globally
Fixes 5 Devanagari failures, and no regressions.
2012-07-20 10:47:46 -04:00
Behdad Esfahbod 51e764de44 [Indic] Unbreak old scriptures
Brings down failures with Lohit-Telugu from 57% to 1.40%.
2012-07-20 10:30:24 -04:00
Behdad Esfahbod 900cf3d449 Minor 2012-07-20 10:18:23 -04:00
Behdad Esfahbod 87cd63266e [Indic] Recategorize some Kannada right matras
Kannada failures down from 3.5% to 2.93%.
2012-07-19 21:25:46 -04:00
Behdad Esfahbod 3604d64ced [Indic] Recategorize GURMUKHI ADDAK
It's not in IndicSyllabicCategory.txt.  Fixes most of Gurmukhi failures.
Failures down from 7.7% to 0.222%!
2012-07-19 21:13:04 -04:00
Behdad Esfahbod 8932858123 Minor 2012-07-19 21:02:38 -04:00
Behdad Esfahbod 47ef931f13 [buffer] Make sure out_info = info during GPOS 2012-07-19 20:52:44 -04:00
Behdad Esfahbod ae63cf2062 Print line number during return when tracing 2012-07-19 20:45:41 -04:00
Behdad Esfahbod 5249f3aee1 [Indic] Unbreak Khmer
For Khmer, all consonants are subjoining.  No need to look in the font.
We were looking in the wrong order anyway.
2012-07-19 20:30:22 -04:00
Behdad Esfahbod e0475345d5 [Indic] Apply 'akhn' globally
Fixes 1.5% more failures for Telugu, 2% for Kannada.
Breaks one test in Devanagari.
2012-07-19 20:24:14 -04:00
Behdad Esfahbod c87bcddb10 [Indic] Add failing test for Kannada 2012-07-19 20:03:25 -04:00
Behdad Esfahbod fa247ebe52 [Indic] Better position U+0CD5
Fixes another 5% of Kannada failures.
2012-07-19 19:52:19 -04:00
Behdad Esfahbod f055442716 [Indic] Lookup consonant position in the font
Fixes most failures of Oriya, and improves others a bit.
2012-07-19 16:20:21 -04:00
Behdad Esfahbod 74d1d88781 [GSUB] Fix would_apply() for LigatureSubst 2012-07-19 16:14:23 -04:00
Behdad Esfahbod 787f7d1e9b [TODO] Minor 2012-07-19 15:29:13 -04:00
Behdad Esfahbod be73a5f936 Add src/test-would-substitute tool 2012-07-19 15:12:18 -04:00
Behdad Esfahbod e72b360ac6 Refactor / finish would_apply() operation
Untested.
2012-07-19 14:44:46 -04:00
Behdad Esfahbod 8c973ebf0f [Indic] Implement per-script matra positioning
Following what the spec says.

Brings down Telugu failures from 40% to 3.75%, and Kannada failures from
44% to 10%.  Does NOT affect other scripts' test results.
2012-07-19 13:25:08 -04:00