Follows the order of the Arabic/Syriac specs. Also don't stop
between rlig and calt in non-Arabic scripts.
Micro-tests for Arabic and Mongolian added for the latter.
We now handle U+FFFD replacement in hb_buffer_add_utf*(). Any other
manipulation can happen in user callbacks. No need for this.
efe74214bb (commitcomment-7039404)
This reverts commit efe74214bb.
Conflicts:
src/hb-ot-shape-normalize.cc
With this change, we now by default replace broken UTF-8/16/32 bits
with U+FFFD. This can be changed by calling new API on the buffer.
Previously the replacement value used to be (hb_codepoint_t)-1.
Note that hb_buffer_clear_contents() does NOT reset the replacement
character.
See discussion here:
6f13b6d62d
New API:
hb_buffer_set_replacement_codepoint()
hb_buffer_get_replacement_codepoint()
Originally we fixed those in 79d1007a50.
However, fonts like MongolianWhite don't have GDEF, but have IgnoreMarks
in their LigatureSubstitute init/etc features. We were synthesizing a
GDEF class of mark for Mongolian Variation Selectors and as such the
ligature lookups where not matching. Uniscribe doesn't do that.
I tried with more sophisticated fixes, like, if there is no GDEF and
a lookup-flag mismatch happens, instead of rejecting a match, try
skipping that glyph. That surely produces some interesting behavior,
but since we don't want to support fonts missing GDEF more than we have
to, I went for this simpler fix which is to always mark
default-ignorables as base when synthesizing GDEF.
Micro-test added.
Fixes rest of https://bugs.freedesktop.org/show_bug.cgi?id=65258
Only if the font doesn't support it. Ie, this gives the user to
use non-Unicode codepoints as private values and return a meaningful
glyph for them. But if it's invalid and font callback doesn't
like it, and if font has U+FFFD, show that instead.
Font functions that do not want this automatic replacement to
happen should return true from get_glyph() if unicode > 0x10FFFF.
Replaces https://github.com/behdad/harfbuzz/pull/27
There may be more. There are members that are by definition
redundant or reserved and not needed, NOT what we *currently*
don't use.
I'm sure there's more...
Add hb_ot_layout_language_get_required_feature_index() again, which
is used in Pango. This was removed in
da13293798 in favor of
hb_ot_layout_language_get_required_feature().
API changes:
- Added hb_ot_layout_language_get_required_feature_index back.
HB_VERSION_CHECK's comparison was originally written wrongly
by mistake. When API tests were written, they were also written
wrongly to pass given the wrong implementation... Sigh.
Given the purpose of this API, there's no point in fixing it
without renaming it. As such, rename.
API changes:
HB_VERSION_CHECK -> HB_VERSION_ATLEAST
hb_version_check -> hb_version_atleast
If pre-base reordering Ra is NOT formed (or formed and then
broken up), we should consider that Ra as base. This is
observable when there's a left matra or dotreph that positions
before base.
Now, it might be that we shouldn't do this if the Ra happend
to form a below form. We can't quite deduce that right now...
Micro test added. Also at:
https://code.google.com/a/google.com/p/noto-alpha/issues/detail?id=186#c29
Sometimes font designers form half/pref/etc consonant forms
unconditionally and then undo that conditionally. Try to
recover the OT_H classification in those cases.
No test number changes expected.
Normally if you want to, say, conditionally prevent a 'pref', you
would use blocking contextual matching. Some designers instead
form the 'pref' form, then undo it in context. To detect that
we now also remember glyphs that went through MultipleSubst.
In the only place that this is used, Uniscribe seems to only care
about the "last" transformation between Ligature and Multiple
substitions. Ie. if you ligate, expand, and ligate again, it
moves the pref, but if you ligate and expand it doesn't. That's
why we clear the MULTIPLIED bit when setting LIGATED.
Micro-test added. Test: U+0D2F,0D4D,0D30 with font from:
[1]
https://code.google.com/a/google.com/p/noto-alpha/issues/detail?id=186#c29
Roboto was hitting this. FreeType also has pretty much the
same code for this, in ttcmap.c:tt_cmap4_validate():
/* in certain fonts, the `length' field is invalid and goes */
/* out of bound. We try to correct this here... */
if ( table + length > valid->limit )
{
if ( valid->level >= FT_VALIDATE_TIGHT )
FT_INVALID_TOO_SHORT;
length = (FT_UInt)( valid->limit - table );
}
Sinhala and Telugu use "explicit" reph. That is, the reph is formed by
a Ra,H,ZWJ sequence. Previously, upon detecting this sequence, we were
checking checking whether the 'rphf' feature applies to the first two
glyphs of the sequence. This is how the Microsoft fonts are designed.
However, testing with Noto shows that apparently Uniscribe also forms
the reph if the lookup ligates all three glyphs. So, try both
sequences.
Doesn't affect test results for Sinhala or Telugu.
https://code.google.com/a/google.com/p/noto-alpha/issues/detail?id=232
The grammar in the OT spec, and the existing Windows implementation
seem to be confused around where to allow Asat around the medial
consonants.
The previous grammar for medial group was allowing an Asat after
the medial group only if there was a medial Wa or Ha, but not if
there was only a medial Ya. This doesn't make sense to me and
sounds reversed, as both medial Wa and Ha are below marks while
Asat is an above mark. An Asat can come before the medial group
already (in fact, multiple ones can. Why?!). The medial Ya
however is a spacing mark and according to Roozbeh it's valid
to want an Asat on the medial Ya instead of the base, so it looks
to me like we want to allow an Asat after the medial group if
there *was* a Ya but not if there wasn't any. Not wanting to
produce dotted-circle where Windows is not, this commit changes
the grammar to allow one Asat after the medial group no matter
what comes in the group.
Test: U+1002,103A,103B vs U+1002,103B,103A
Before we were just relying on the compiler inlining them and not
leaving a trace in our public API. Try to fix. Hopefully not
breaking anyone's build.
commit b5a0f69e47
Author: Behdad Esfahbod <behdad@behdad.org>
Date: Thu Oct 17 18:04:23 2013 +0200
[indic] Pass zero-context=false to would_substitute for newer scripts
For scripts without an old/new spec distinction, use zero-context=false.
This changes behavior in Sinhala / Khmer, but doesn't seem to regress.
This will be useful and used in Javanese.
The *intention* was to change zero-context from true to false for scripts that
don't have old-vs-new specs. However, checking the code, looks like we
essentially change zero-context to always be true; ie. we only changed things
for old-spec, and we broke them. That's what causes this bug:
https://bugs.freedesktop.org/show_bug.cgi?id=76705
The root of the bug is here:
/* Use zero-context would_substitute() matching for new-spec of the main
* Indic scripts, but not for old-spec or scripts with one spec only. */
bool zero_context = indic_plan->config->has_old_spec || !indic_plan->is_old_spec;
Note that is_old_spec itself is:
indic_plan->is_old_spec = indic_plan->config->has_old_spec && ((plan->map.chosen_script[0] & 0x000000FF) != '2');
It's easy to show that zero_context is now always true. What we really meant was:
bool zero_context = indic_plan->config->has_old_spec && !indic_plan->is_old_spec;
Ie, "&&" instead of "||". We made this change supposedly to make Javanese
work. But apparently we got it working regardless! So I'm going to fix this
to only change the logic for old-spec and not touch other cases.
This is a higher-priority shaper than default shaper ("ot"), but
only picks up fonts that have AAT "morx"/"mort" table.
Note that for this to work the font face's get_table() implementation
should know how to return the full font blob.
Based on patch from Konstantin Ritt.
Not exhaustively tested, but I think I got the intended logic
right.
The logic can perhaps be simplified. Maybe we should disabled
normalization with this shaper. Then again, for now focusing on
correctness.
When seeing U+2044 FRACTION SLASH in the text, find decimal
digits (Unicode General Category Decimal_Number) around it,
and mark the pre-slash digits with 'numr' feature, the post-slash
digits with 'dnom' feature, and the whole sequence with 'frac'
feature.
This beautifully renders fractions with major Windows fonts,
and any other font that implements those features (numr/dnom is
enough for most fonts.)
Not the fastest way to do this, but good enough for a start.
CoreText does automatic font fallback (AKA "cascading") for characters
not supported by the requested font, and provides no way to turn it off,
so detect if the returned run uses a font other than the requested one
and fill in the buffer with .notdef glyphs instead of random indices
glyph from a different font.
The spec and Uniscribe don't allow these, but UTN#11
specifically says the sequence U+104B,U+1038 is valid.
As such, allow all "P V" sequences. There's about
eight sequences that match that structure, but Roozbeh
thinks it's fine to allow all of them.
Test case: U+104B, U+1038
https://bugs.freedesktop.org/show_bug.cgi?id=71947
The spec and Uniscribe treat it as consonant in the grammar, but
it's not in IndicSyllableCategory.txt, so fix up.
Test sequence: U+1004,U+103A,U+1039,U+104E
https://bugs.freedesktop.org/show_bug.cgi?id=71948
This is broken sequence according to OpenType spec, Uniscribe,
and current HarfBuzz implementation. But Roozbeh says this
is a valid sequence, so allow it. There are multiple
"(DB As?)?" constructs in the grammar, but Roozbeh thinks only
this one needs changing.
Test case: 1014,1063,103A
Fixes https://bugs.freedesktop.org/show_bug.cgi?id=71949
Based on research into latest SIL and Windows fonts, pulling in
the latest OpenType language tag proposal from Microsoft, and updating
to latest language tags and names from ISO 639.
This reverts commit d5bd0590ae.
The reasoning behind that logic was flawed and made under
a misunderstanding of the original problem, and caused
regressions as reported by Jonathan Kew in thread titled
"tibetan marks" in Oct 2013. Apparently I have had fixed
the original problem with this commit:
7e08f1258d
So, revert the faulty commit and everything seems to be in good
shape.
For Javanese (pref_len == 1) only reorder if it didn't ligate. That's
sensible, and what the spec says. For other Indic (pref_len > 1)
only reorder if ligated.
Doesn't change any test numbers.
Bug 58714 - Kannada u+0cb0 u+200d u+0ccd u+0c95 u+0cbe does not provide
same results as Windows8
https://bugs.freedesktop.org/show_bug.cgi?id=58714
Test with U+0CB0,U+200D,U+0CCD,U+0C95,U+0CBF and tunga.ttf.
Improves some scripts. Improves Bengali too, but numbers
are up because we produce better results than Uniscribe for some
sequences now.
New numbers:
BENGALI: 353724 out of 354188 tests passed. 464 failed (0.131004%)
DEVANAGARI: 707307 out of 707394 tests passed. 87 failed (0.0122987%)
GUJARATI: 366349 out of 366457 tests passed. 108 failed (0.0294714%)
GURMUKHI: 60732 out of 60747 tests passed. 15 failed (0.0246926%)
KANNADA: 951190 out of 951913 tests passed. 723 failed (0.0759523%)
KHMER: 299070 out of 299124 tests passed. 54 failed (0.0180527%)
MALAYALAM: 1048140 out of 1048334 tests passed. 194 failed (0.0185056%)
ORIYA: 42320 out of 42329 tests passed. 9 failed (0.021262%)
SINHALA: 271662 out of 271847 tests passed. 185 failed (0.068053%)
TAMIL: 1091753 out of 1091754 tests passed. 1 failed (9.15957e-05%)
TELUGU: 970555 out of 970573 tests passed. 18 failed (0.00185457%)
Lohit-Punjabi has a upem of 769! We were losing one unit in our
code, and FreeType is losing another one... Test with U+0A06.
Has an advance of 854 in the font. We were producing 852.
Now we do 853, which is what FreeType is telling us.
See comments from caveat! Seems to work fine.
This is useful for Javanese which has an atomically encoded pre-base
reordering Ra which should only be reordered if it was substituted
by the pref feature.
For scripts without an old/new spec distinction, use zero-context=false.
This changes behavior in Sinhala / Khmer, but doesn't seem to regress.
This will be useful and used in Javanese.
Whic means these twp are applied per-syllable now. Apparently
in some Khmer fonts the clig interacts with presentation features.
Test case: U+1781,U+17D2,U+1789,U+17BB,U+17C6 with Mondulkiri-R.ttf
should produce one big ligature.
Commit 6b65a76b40. "end" was becoming
negative. Was trigerred by Lohit-Kannada 2.5.3 and the sequence:
U+0CB0,U+200D,U+0CBE,U+0CB7,U+0CCD,U+0C9F,U+0CCD,U+0CB0,U+0C97,U+0CB3
Two glyphs were being duplicated.
First, we were abusing OT_VD instead of OT_A. Fix that
but moving OT_A in the grammar where it belongs (which
is different from what the spec says).
Also, only allow medial consonants after all other
consonants. This doesn't affect any current character.
Finally, fix Halant attachment in presence of medial
consonants. Again, this currently doesn't affect any
sequence.
I lied. There's Gurmukhi U+0A75 which is Consonant_Medial.
Uniscribe allows one of those in each of these positions:
before matras, after matras and before syllable modifiers,
and after syllable modifiers! We currently just allow
unlimited numbers of it, before matras.
Unicode 6.2.0 Section 16.2 / Figure 16.3 says:
"For backward compatibility, between Arabic characters a ZWJ acts just
like the sequence <ZWJ, ZWNJ, ZWJ>, preventing a ligature from forming
instead of requesting the use of a ligature that would not normally be
used. As a result, there is no plain text mechanism for requesting the
use of a ligature in Arabic text."
As such, we flip internal zwj to zwnj flags for GSUB matching, which
means it will block ligation in all features, unless the font
explicitly matches U+200D glyph. This doesn't affect joining behavior.
Bug 70509 - Candrabindu+Visarga doesn't work in Devanagari
https://bugs.freedesktop.org/show_bug.cgi?id=70509
We categorize both bindus and visarga as syllable-modifiers.
OT spec doesn't actually say what characters go in the syllable
modifier category, and allows one. We just allow up to two now.
Test case: U+0930,U+0941,U+0901,U+0903
Uniscribe currently doesn't support that and produces a
dotted circle.
Seems to better match Uniscribe.
Note: NotoSansTelugu-Regular has kern feature, so this fixes most of the
positioning failures there, except for the kern pairs blocked by a
(non-)joiner, in which case we (correctly) kern, but Uniscribe doesn't.
More like Uniscribe... We still allow user-defined features to
work across syllables, but not pres,blws,abs,psts,etc.
This "regressed" Sinhala numbers by 11. These are cases were
there's Consonant followed by Ra,Halant,ZWJ at the of text.
The Ra,Halant,ZWJ ends up forming reph, which is wrong...
But before we were also ligating that reph with the previous
consonant. That's even more wrong. That's also what Uniscribe
does.
Current numbers:
BENGALI: 353732 out of 354188 tests passed. 456 failed (0.128745%)
DEVANAGARI: 707307 out of 707394 tests passed. 87 failed (0.0122987%)
GUJARATI: 366349 out of 366457 tests passed. 108 failed (0.0294714%)
GURMUKHI: 60732 out of 60747 tests passed. 15 failed (0.0246926%)
KANNADA: 951030 out of 951913 tests passed. 883 failed (0.0927606%)
KHMER: 299070 out of 299124 tests passed. 54 failed (0.0180527%)
MALAYALAM: 1048140 out of 1048334 tests passed. 194 failed (0.0185056%)
ORIYA: 42320 out of 42329 tests passed. 9 failed (0.021262%)
SINHALA: 271655 out of 271847 tests passed. 192 failed (0.070628%)
TAMIL: 1091753 out of 1091754 tests passed. 1 failed (9.15957e-05%)
TELUGU: 970555 out of 970573 tests passed. 18 failed (0.00185457%)
Previously we only supported recursive sublookups with
ascending indices. We were also not correctly handling
non-1-to-1 recursed lookups.
Fix all that!
Fixes the three tests in test/shaping/tests/context-matching.tests,
which were derived from NotoSansBengali and NotoSansDevanagari
among others.
Should move these out of the public header...
We're "clean" of introspection warnings now. Remaining ones are about
graphite2 / freetype types not being introspectable.
This reverts commit 10f964623f.
See discussion with Khaled Hosny on mailing list. In short, since
integers here can be negative, and int division is "round towards
zero", proper rounding should take sign into account. Just skip
doing it again, has been serving us well before.
They've been disabled for a while and no one cared. We're past
the point to need them for testing, and if we ever need to
resurrect them again, well, they're in git graveyard somewhere.
Initial setup of gtk-doc. Straight forward setup following the gtk-doc
instructions. Ignore some troublesome types in src/hb-gobject.h. To
build use "./autogen.sh --enable-gtk-doc" then "make". Docs are in
harfbuzz/docs/reference/html/index.html.
Based on patch from Jonathan Kew and data from Apple.
It's not working correctly though, and I suspect I'm hitting a bug in
CoreText. When I do this:
hb-shape /Library/Fonts/Zapfino.ttf ZapfinoZapfino --shaper coretext \
--features=-liga
I expect both ligatures to turn off, but only the second one does:
[Z_a_p_f_i_n_o=0+2333|Z=7+395|a=8+285|p_f=9+433|i=11+181|n=12+261|o=13+250]
whereas if I disable 'dlig' instead of 'liga', both are turned off.
Smells...
Doesn't resolve conflicting feature settings.
When installing per-process fonts using AddFontMemResourceEx(),
if a font with the same family name is already installed, sometimes
that one gets used. Which is problematic for us. As such, we
now mangle the font to install a new 'name' table with a unique
name, which we then use to choose the font.
Patch from Jonathan Kew.
During GSUB, if a ligation happens, subsequence context input matching
matches the new indexing. During GPOS however, the indices never
change. So just go one by one.
Fixes 'dist' positioning with mmrtext.ttf and the following sequence:
U+1014,U+1039,U+1011,U+1014,U+1039,U+1011,U+1014,U+1039,U+1011
Reported by Jonathan Kew.
Email from Jonathan Kew:
My cygwin build kept aborting on certain test words when run with the
uniscribe backend. Turned out this was caused by a bug in the allocation
of scratch buffers in hb-uniscribe.cc.
Commit 2a17f9568d introduced a new line
ALLOCATE_ARRAY (SCRIPT_VISATTR, vis_attr, glyphs_size);
but it failed to account for this in the computation of glyphs_size
(the number of glyphs for which scratch buffer space is available),
with the result that the vis_clusters array ends up overrunning the
end of the scratch buffer and clobbering the beginning of the buffer's
info[].
AFAICS, the vis_attr array is not actually used, so the simple fix is
to remove the line that allocates it. (If/when we -do- need to use
vis_attr for something, we'll need to add another term to the earlier
calculation of glyphs_size.)
With this patch, the uniscribe backend runs reliably again.
JK
This changes the semantics of get_glyph() callback and expect that
callbacks return false if the requested variant is not available, and
then we will call them back with variation_selector=0 and will retain
the glyph for the selector in the glyph stream.
Apparently most Mongolian fonts implement the Mongolian Variation
Selectors using GSUB, not cmap.
https://bugs.freedesktop.org/show_bug.cgi?id=65258
Note that this doesn't fix the Mongolian shaping yet, because the way
that's implemented is that the, say, 'init' feature ligates the letter
and the variation-selector. However, since currently the variation
selector doesn't have the 'init' mask on, it will not be matched...
If there's a mark ligating forward with non-mark, they were
inheriting the GC of the mark and later get advance-zeroed.
Don't do that if there's any non-mark glyph in the ligature.
Sample test: U+1780,U+17D2,U+179F with Kh-Metal-Chrieng.ttf
Also:
Bug 58922 - Issue with mark advance zeroing in generic shaper
We were not initializing the digests properly and as a result they were
being initialized to zero, making digest1 to never do any useful work.
Speeds up Amiri shaping significantly.
See thread "an issue regarding discrepancy between Korean and Unicode
standards" on the mailing list for the rationale. In short: Uniscribe
doesn't, so fonts are designed to work without it.
Testing shows that this is closer to what Uniscribe does.
Reported by Khaled Hosny:
"""
commit 568000274c
...
This commit is causing a regression with Amiri, the string “هَٰذ” with
Uniscribe and HarfBuzz before this commit, gives:
[uni0630.fina=3+965|uni0670.medi=0+600|uni064E=0@-256,0+0|uni0647.init=0+926]
But now it gives:
[uni0630.fina=3+965|uni0670.medi=0+0|uni064E=0@-256,0+0|uni0647.init=0+926]
i.e. uni0670.medi is zeroed though it has a base glyph GDEF class.
"""
The test case is U+0647,U+064E,U+0670,U+0630 with Amiri.
After the Ngapi hackfest work, we were assuming that fonts
won't use presentation features to choose specific forms
(eg. conjuncts). As such, we were using auto-joiner behavior
for such features. It proved to be troublesome as many fonts
used presentation forms ('pres') for example to form conjuncts,
which need to be disabled when a ZWJ is inserted.
Two examples:
U+0D2F,U+200D,U+0D4D,U+0D2F with kartika.ttf
U+0995,U+09CD,U+200D,U+09B7 with vrinda.ttf
What we do now is to never do magic to ZWJ during GSUB's main input
match for Indic-style shapers. Note that backtrack/lookahead are still
matched liberally, as is GPOS. This seems to be an acceptable
compromise.
As to the bug that initially started this work, that one needs to
be fixed differently:
Bug 58714 - Kannada u+0cb0 u+200d u+0ccd u+0c95 u+0cbe does not
provide same results as Windows8
https://bugs.freedesktop.org/show_bug.cgi?id=58714
New numbers:
BENGALI: 353689 out of 354188 tests passed. 499 failed (0.140886%)
DEVANAGARI: 707305 out of 707394 tests passed. 89 failed (0.0125814%)
GUJARATI: 366349 out of 366457 tests passed. 108 failed (0.0294714%)
GURMUKHI: 60706 out of 60747 tests passed. 41 failed (0.067493%)
KANNADA: 951030 out of 951913 tests passed. 883 failed (0.0927606%)
KHMER: 299070 out of 299124 tests passed. 54 failed (0.0180527%)
LAO: 53611 out of 53644 tests passed. 33 failed (0.0615167%)
MALAYALAM: 1048102 out of 1048334 tests passed. 232 failed (0.0221304%)
ORIYA: 42320 out of 42329 tests passed. 9 failed (0.021262%)
SINHALA: 271666 out of 271847 tests passed. 181 failed (0.0665816%)
TAMIL: 1091753 out of 1091754 tests passed. 1 failed (9.15957e-05%)
TELUGU: 970555 out of 970573 tests passed. 18 failed (0.00185457%)
TIBETAN: 208469 out of 208469 tests passed. 0 failed (0%)
That flag is redundant, deprecated, and ignored since April 2011.
From FreeType git log:
commit 8c82ec5b17d0cfc9b0876a2d848acc207a62a25a
Author: Behdad Esfahbod <behdad@behdad.org>
Date: Thu Apr 21 08:21:37 2011 +0200
Always ignore global advance.
This makes FT_LOAD_IGNORE_GLOBAL_ADVANCE_WIDTH redundant,
deprecated, and ignored. The new behavior is what every major user
of FreeType has been requesting. Global advance is broken in many
CJK fonts. Just ignoring it by default makes most sense.
* src/truetype/ttdriver.c (tt_get_advances),
src/truetype/ttgload.c (TT_Get_HMetrics, TT_Get_VMetrics,
tt_get_metrics, compute_glyph_metrics, TT_Load_Glyph),
src/truetype/ttgload.h: Implement it.
* docs/CHANGES: Updated.
I added these because the older mingw32 toolchain didn't have
MemoryBarrier(). The newer mingw-w64 toolchain however has.
As reported by John Emmas this was causing build failure with
MSVC (on glib) because of inline issues. But that reminded me
that we may be taking this path even if the system implements
MemoryBarrier as a function, which is a waste. So, just remove
it.
Before, we were marking them as below-form for initial reordering.
However, there is a rule that says "post consonants should follow
below consonsnts" for base determination purposes. Malayalam has
port-form YA/VA, and RA is pre-base. As such, for a sequence like
YA,Virama,YA,Virama,RA, the correct base is at index 0. But
because the code was seeing RA as a below-base, it was stopping at
the second YA as base, instead of jumping it as a post-base.
By treating prebase-reordering consonants like post-forms, this
is fixed.
MALAYALAM went down from 351 to 265. Other numbers didn't change:
BENGALI: 353686 out of 354188 tests passed. 502 failed (0.141733%)
DEVANAGARI: 707305 out of 707394 tests passed. 89 failed (0.0125814%)
GUJARATI: 366262 out of 366457 tests passed. 195 failed (0.0532122%)
GURMUKHI: 60706 out of 60747 tests passed. 41 failed (0.067493%)
KANNADA: 950680 out of 951913 tests passed. 1233 failed (0.129529%)
KHMER: 299074 out of 299124 tests passed. 50 failed (0.0167155%)
LAO: 53611 out of 53644 tests passed. 33 failed (0.0615167%)
MALAYALAM: 1048069 out of 1048334 tests passed. 265 failed (0.0252782%)
ORIYA: 42320 out of 42329 tests passed. 9 failed (0.021262%)
SINHALA: 271539 out of 271847 tests passed. 308 failed (0.113299%)
TAMIL: 1091753 out of 1091754 tests passed. 1 failed (9.15957e-05%)
TELUGU: 970555 out of 970573 tests passed. 18 failed (0.00185457%)
TIBETAN: 208469 out of 208469 tests passed. 0 failed (0%)
Such fonts are *definitely* really broken. Give up.
Limits time spent in sanitize for extremely / deliberately broken
fonts. For example, two fonts with these md5sum / names:
9343f0a1b8c84b8123e7d201cae62ffd.ttf
eb8c978547f09d368fc204194fb34688.ttf
were spending over a second in sanitize! Not anymore.
This fixes a design bug with sanitize and sub-blobs that can
cause crashes. Jonathan and I found and debugged this issue
when we tested a corrupt font with the md5sum / filename:
ea395483d37af0cb933f40689ff7b60a. Two hours of intense
debugging we found out that the font has overlapping GSUB/GPOS
tables, and as such, sanitizing the second table can modify
the first one, which can cause all kinds of undefined behavior.
The correct way to fix this is to make sure sub-blobs are
always created readonly, since we consider the parent blob
to be a shared resource and can't modify it, even if it *is*
writable.
This essentially makes the READONLY_MAY_MAKE_WRITABLE mode
unused... Maybe we should simply remove / deprecate it.
When a match_func was not set on the matcher_t object (ie. from GPOS),
then the Default_Ignorables (including joiners) were never skipped.
This meant that they were not skipped as they should during GPOS
matching. Fix that.
A few Indic numbers have "regressed": BENGALI and DEVANAGARI went
up from 290 and 58 respectively, but in both cases new results are
superior to Uniscribe, as they apply GPOS when we weren't (and
Uniscribe isn't) before.
BENGALI: 353686 out of 354188 tests passed. 502 failed (0.141733%)
DEVANAGARI: 707305 out of 707394 tests passed. 89 failed (0.0125814%)
GUJARATI: 366262 out of 366457 tests passed. 195 failed (0.0532122%)
GURMUKHI: 60706 out of 60747 tests passed. 41 failed (0.067493%)
KANNADA: 950680 out of 951913 tests passed. 1233 failed (0.129529%)
KHMER: 299074 out of 299124 tests passed. 50 failed (0.0167155%)
LAO: 53611 out of 53644 tests passed. 33 failed (0.0615167%)
MALAYALAM: 1047983 out of 1048334 tests passed. 351 failed (0.0334817%)
ORIYA: 42320 out of 42329 tests passed. 9 failed (0.021262%)
SINHALA: 271539 out of 271847 tests passed. 308 failed (0.113299%)
TAMIL: 1091753 out of 1091754 tests passed. 1 failed (9.15957e-05%)
TELUGU: 970555 out of 970573 tests passed. 18 failed (0.00185457%)
TIBETAN: 208469 out of 208469 tests passed. 0 failed (0%)
The code was confused because it was expecting left matra to have
POS_PRE_M, like we do in the Myanmar shaper, but that is not what
we were doing in this shaper. Rewrite to rely on category only.
Test case: U+AA06,U+AA34,U+AA2F
Before, if one called hb_shape() without setting script, language, and
direction on the buffer, hb_shape() was calling
hb_buffer_guess_segment_properties() on the user's behalf to guess
these.
This is very dangerous, since any serious user of HarfBuzz must set
these properly (specially important is direction). So now, we don't
guess properties by default. People not setting direction will get
an abort() now. If the old behavior is desired (fragile, good for
simple testing only), users can call
hb_buffer_guess_segment_properties() on the buffer just before calling
hb_shape().
Surprisingly, if user ever tried to turn a default feature off partially
(say, disable liga for a range), the feature was being turned off
globally! Fixed now.
Originally we meant to match backtrack/lookahead across syllable
boundaries. But a bug in the code meant that this was NOT done for
backtrack. We "fixed" that in 2c7d0b6b80,
but that broke Myanmar shaping.
We now believe that for Indic-like shapers (which is where syllables are
used), all basic shaping forms should be fully contained within their
syllables, so now we limit backtrack/lookahead matching to the syllable
too. Unbreaks Myanmar.
Not for Arabic, but for Indic-like scripts. ZWJ/ZWNJ have special
meanings in those scripts, so let font lookups take full control.
This undoes the regression caused by automatic-joiners handling
introduced two commits ago.
We only disable automatic joiner handling for the "basic shaping
features" of Indic, Myanmar, and SEAsian shapers. The "presentation
forms" and other features are still applied with automatic-joiner
handling.
This change also changes the test suite failure statistics, such that
a few scripts show more "failures". The most affected is Kannada.
However, upon inspection, we believe that in most, if not all, of the
new failures, we are producing results superior to Uniscribe. Hard to
count those!
Here's an example of what is fixed by the recent joiner-handling
changes:
https://bugs.freedesktop.org/show_bug.cgi?id=58714
New numbers, for future reference:
BENGALI: 353892 out of 354188 tests passed. 296 failed (0.0835714%)
DEVANAGARI: 707336 out of 707394 tests passed. 58 failed (0.00819911%)
GUJARATI: 366262 out of 366457 tests passed. 195 failed (0.0532122%)
GURMUKHI: 60706 out of 60747 tests passed. 41 failed (0.067493%)
KANNADA: 950680 out of 951913 tests passed. 1233 failed (0.129529%)
KHMER: 299074 out of 299124 tests passed. 50 failed (0.0167155%)
LAO: 53611 out of 53644 tests passed. 33 failed (0.0615167%)
MALAYALAM: 1047983 out of 1048334 tests passed. 351 failed (0.0334817%)
ORIYA: 42320 out of 42329 tests passed. 9 failed (0.021262%)
SINHALA: 271539 out of 271847 tests passed. 308 failed (0.113299%)
TAMIL: 1091753 out of 1091754 tests passed. 1 failed (9.15957e-05%)
TELUGU: 970555 out of 970573 tests passed. 18 failed (0.00185457%)
TIBETAN: 208469 out of 208469 tests passed. 0 failed (0%)
When matching lookups, be smart about default-ignorable characters.
In particular:
Do nothing specific about ZWNJ, but for the other default-ignorables:
If the lookup in question uses the ignorable character in a sequence,
then match it as we used to do. However, if the sequence match will
fail because the default-ignorable blocked it, try skipping the
ignorable character and continue.
The most immediate thing it means is that if Lam-Alef forms a ligature,
then Lam-ZWJ-Alef will do to. Finally!
One exception: when matching for GPOS, or for backtrack/lookahead of
GSUB, we ignore ZWNJ too. That's the right thing to do.
It certainly is possible to build fonts that this feature will result
in undesirable glyphs, but it's hard to think of a real-world case
that that would happen.
This *does* break Indic shaping right now, since Indic Unicode has
specific rules for what ZWJ/ZWNJ mean, and skipping ZWJ is breaking
those rules. That will be fixed in upcoming commits.
Ouch, how did things ever work without this?! The added test that has a
dot-reph as well as a pre-base reordering Ra perfectly demonstrates the
bug (tested with Nirmala font from Win8 for example). Testing suggests
that Win8 shaper has the *exact* same bug / behavior that we used to
have. Odd.
This is a followup to 568000274c.
Looks like in the Latin shaper, Uniscribe zeroes all Unicode NSM
advances *after* GPOS, not before. Match that.
Can be tested using DejaVu Sans Mono, since that font has GPOS
rules to zero the mark advances on its own.
Before, we were zeroing advance width of attached marks for
non-Indic scripts, and not doing it for Indic.
We have now three different behaviors, which seem to better
reflect what Uniscribe is doing:
- For Indic, no explicit zeroing happens whatsoever, which
is the same as before,
- For Myanmar, zero advance width of glyphs marked as marks
*in GDEF*, and do that *before* applying GPOS. This seems
to be what the new Win8 Myanmar shaper does,
- For everything else, zero advance width of glyphs that are
from General_Category=Mn Unicode characters, and do so
before applying GPOS. This seems to be what Uniscribe does
for Latin at least.
With these changes, positioning of all tests matches for Myanmar,
except for the glitch in Uniscribe not applying 'mark'. See preivous
commit.
Implemented as a hack for now. Myanmar failures down from 23 to 15.
MYANMAR: 1123868 out of 1123883 tests passed. 15 failed (0.00133466%)
The remaining 15 cases are all where the syllable is wrong according to
the OpenType spec. We insert dottedcircle. Uniscribe fails to do that,
but it also fails to reorder the prebase-reordering medial-Ra. So it
gets it wrong.
Before, when matching ligatures, we never skipping over base / liga
glyphs even if that was what the LookupFlags asked for.
Fixed now. We carefully reviewed all instances of this, and tested with
Amiri as well as some Indic scripts, and are confident that this should
NOT break anyone's fonts. It's also how Uniscribe does it, from what
we can tell.
Before, for most scripts, we were not trying to recompose two characters
if the second one had ccc=0. That fails for Myanmar where U+1026
decomposes to U+1025,U+102E, both of which have ccc=0. However, we do
want to try to recompose those. We now check whether the second is a
mark, using general category instead.
At the same time, remove optimization that was conflicting with this.
[Let the Ngapi hackfest begin!]
This reverts commit fab7a71f11.
Conflicts:
src/hb-ot-shape-complex-indic-machine.hh
Keeping that generated file in-tree causes problems with processes like
tinderbox that automatically fetch and build harfbuzz. It's harder to
bootstrap harfbuzz now (as was previously), but I'm willing to give this
another chance and see how it goes.
If in a MarkPos table, a base has no anchor for a particular mark class,
return NULL such that the subsequent subtables get a chance at it.
Test case:
hb-shape ./EBGaramond12-Regular.otf ἂ --features="ss20","smcp"
API additions:
hb_segment_properties_t
HB_SEGMENT_PROPERTIES_DEFAULT
hb_segment_properties_equal()
hb_segment_properties_hash()
hb_buffer_set_segment_properties()
hb_buffer_get_segment_properties()
hb_ot_layout_glyph_class_t
hb_shape_plan_t
hb_shape_plan_create()
hb_shape_plan_create_cached()
hb_shape_plan_get_empty()
hb_shape_plan_reference()
hb_shape_plan_destroy()
hb_shape_plan_set_user_data()
hb_shape_plan_get_user_data()
hb_shape_plan_execute()
hb_ot_shape_plan_collect_lookups()
API changes:
Rename hb_ot_layout_feature_get_lookup_indexes() to
hb_ot_layout_feature_get_lookups().
New header file:
hb-shape-plan.h
And a bunch of prototyped but not implemented stuff. Coming soon.
(Tests fail because of the prototypes right now.)
This is important for the Sinhala U+0DDA split matra since it decomposes
to U+0DD9,U+0DCA where U+0DD9 is a left matra and U+0DCA is the virama.
We don't want to move the virama with the left matra.
TEST: U+0D9A,U+0DDA
Note that we were already doing this in the Uniscribe bug compatibility
mode. We now do it all the time.
New API:
hb_buffer_flags_t
HB_BUFFER_FLAGS_DEFAULT
HB_BUFFER_FLAG_BOT
HB_BUFFER_FLAG_EOT
HB_BUFFER_FLAG_PRESERVE_DEFAULT_IGNORABLES
hb_buffer_set_flags()
hb_buffer_get_flags()
We use the BOT flag to decide whether to insert dottedcircle if the
first char in the buffer is a combining mark.
The PRESERVE_DEFAULT_IGNORABLES flag prevents removal of characters like
ZWNJ/ZWJ/...
Had to do some refactoring to make this happen...
Under uniscribe bug compatibility mode, we still plit them
Uniscrie-style, but Jonathan and I convinced ourselves that there is no
harm doing this the Unicode way. This change makes that happen, and
unbreaks free Sinhala fonts.
Windows 8 adds a Myanmar shaper using the 'mym2' tag. Route that
through the Indic shaper. It's still very broken, but at least this
does NOT break old-style Myanmar shaping using the generic shaper.
For Arabic and Indic shapers, if the font doesn't have a script system
for the script, use default shaper.
Make an exception for Arabic script since we have fallback logic for
that one.
As reported on the list:
I am seeing a similar problem building harfbuzz 0.9.5 with Apple gcc
4.0.1 on OS X 10.5 Leopard:
hb-ot-layout-common-private.hh:406: error: 'struct
OT::CoverageFormat1::Iter' is private
hb-ot-layout-common-private.hh:646: error: within this context
hb-ot-layout-common-private.hh:500: error: 'struct
OT::CoverageFormat2::Iter' is private
hb-ot-layout-common-private.hh:647: error: within this context
make[4]: *** [libharfbuzz_la-hb-ot-layout.lo] Error 1
Also reported as happening with MSVC 2005.
Uniscribe doesn't. And some fonts abuse this feature to get Indic
shaping working in non-complex applications like Adobe's apps.
No change in numbers:
BENGALI: 353897 out of 354188 tests passed. 291 failed (0.0821598%)
DEVANAGARI: 707337 out of 707394 tests passed. 57 failed (0.00805774%)
GUJARATI: 366440 out of 366457 tests passed. 17 failed (0.00463902%)
GURMUKHI: 60704 out of 60747 tests passed. 43 failed (0.0707854%)
KANNADA: 951046 out of 951913 tests passed. 867 failed (0.0910798%)
KHMER: 299074 out of 299124 tests passed. 50 failed (0.0167155%)
LAO: 53611 out of 53644 tests passed. 33 failed (0.0615167%)
MALAYALAM: 1048011 out of 1048334 tests passed. 323 failed (0.0308108%)
ORIYA: 42320 out of 42329 tests passed. 9 failed (0.021262%)
SINHALA: 271666 out of 271847 tests passed. 181 failed (0.0665816%)
TAMIL: 1091754 out of 1091754 tests passed. 0 failed (0%)
TELUGU: 970557 out of 970573 tests passed. 16 failed (0.00164851%)
TIBETAN: 208469 out of 208469 tests passed. 0 failed (0%)
Patch from Jonathan Kew.
Part of fixing:
Mozilla Bug 801410 - avoid inserting dotted-circle for run-initial
Unicode combining characters in "simple" scripts such as Latin
https://bugzilla.mozilla.org/show_bug.cgi?id=801410
The logic for pre-base reordering follows the left matra logic.
We had an exception for Malayalam/Tamil in the left matra repositioning
which was not reflected in pre-base reordering.
Malayalam failures down from 337 to 323.
BENGALI: 353996 out of 354285 tests passed. 289 failed (0.0815727%)
DEVANAGARI: 707339 out of 707394 tests passed. 55 failed (0.00777502%)
GUJARATI: 366489 out of 366506 tests passed. 17 failed (0.0046384%)
GURMUKHI: 60769 out of 60809 tests passed. 40 failed (0.0657797%)
KANNADA: 951086 out of 951913 tests passed. 827 failed (0.0868777%)
KHMER: 299106 out of 299124 tests passed. 18 failed (0.00601757%)
LAO: 53611 out of 53644 tests passed. 33 failed (0.0615167%)
MALAYALAM: 1048011 out of 1048334 tests passed. 323 failed (0.0308108%)
ORIYA: 42320 out of 42329 tests passed. 9 failed (0.021262%)
SINHALA: 271726 out of 271847 tests passed. 121 failed (0.0445103%)
TAMIL: 1091837 out of 1091837 tests passed. 0 failed (0%)
TELUGU: 970558 out of 970573 tests passed. 15 failed (0.00154548%)
TIBETAN: 208469 out of 208469 tests passed. 0 failed (0%)
Incidentally, this makes it not crash with icu-le-hb anymore...
I'm not smart / stupid enough to spend two more days debugging C++
linking issues, and this is ABI-stable at least.
That's really the logic desired. Except that MONGOLIAN VOWEL SEPARATOR
is not default_ignorable but it really should be. Reported to Unicode.
Based on suggestion from Konstantin Ritt.
To be used for a variety of purposes. We save up to five characters
in each direction. No public API changes, everything is taken care
of already. All clients need to do is to call hb_buffer_add_utf* with
the full text + segment info (or at least some context) instead of
just passing in the segment.
Various operations (hb_buffer_reset, hb_buffer_set_length,
hb_buffer_add*) automatically reset the relevant contexts.
I don't expect ragel to be creating too much noise in its generated
output, and including this in-tree helps users right now. We can
revisit this later if it proved to be too much trouble.
With FreeSerif, it seems that the 'ccmp' feature does ligature
substituttions. That was then causing syllable match failures. We now
find syllables before any features have been applied.
Test sequence: U+0D9A,U+0DCA,U+200D,U+0DBB,U+0DCF
With this in place, you can remove GDEF/GSUB/GPOS tables from Arabic
fonts and still get per-component marks positioned on
oh-yeah-fallback-formed LAM-ALEF ligatures with marks in between the LAM
and ALEF.
Now *that*'s pretty cool, if a bit anachronistic...
Uniscribe accepts a Halant,ZWJ before matras. Allow that.
BENGALI down from 295 to 291
DEVANAGARI down from 69 to 57
GUJARATI down from 19 to 17
KANNADA down from 871 to 867
MALAYALAM down from 340 to 337
TELUGU down from 20 to 16
Currently at:
BENGALI: 353897 out of 354188 tests passed. 291 failed (0.0821598%)
DEVANAGARI: 707337 out of 707394 tests passed. 57 failed (0.00805774%)
GUJARATI: 366440 out of 366457 tests passed. 17 failed (0.00463902%)
GURMUKHI: 60704 out of 60747 tests passed. 43 failed (0.0707854%)
KANNADA: 951046 out of 951913 tests passed. 867 failed (0.0910798%)
KHMER: 299077 out of 299124 tests passed. 47 failed (0.0157125%)
LAO: 53611 out of 53644 tests passed. 33 failed (0.0615167%)
MALAYALAM: 1047997 out of 1048334 tests passed. 337 failed (0.0321462%)
ORIYA: 42320 out of 42329 tests passed. 9 failed (0.021262%)
SINHALA: 271666 out of 271847 tests passed. 181 failed (0.0665816%)
TAMIL: 1091754 out of 1091754 tests passed. 0 failed (0%)
TELUGU: 970557 out of 970573 tests passed. 16 failed (0.00164851%)
TIBETAN: 208469 out of 208469 tests passed. 0 failed (0%)
Now that we insert dotted-circle, tests break more easily when our indic
machine breaks.
In particular, a few Devanagari tests were having sequences like
"C,H,ZWJ,N", and because of the ZWJ the Nukta does NOT get reordered to
before the Halant as the grammar used to expect... Fixup.
Another case is as simple as "C,ZWJ,SM".
Fixes 10 out of 79 failures:
DEVANAGARI: 707325 out of 707394 tests passed. 69 failed (0.00975411%)
Brings down Khmer failures from 162 to 47.
KHMER: 299077 out of 299124 tests passed. 47 failed (0.0157125%)
Also rebaselined some of the test files that had only-inherited lines.
Removing those, the stats are:
BENGALI: 353893 out of 354188 tests passed. 295 failed (0.0832891%)
DEVANAGARI: 707315 out of 707394 tests passed. 79 failed (0.0111678%)
GUJARATI: 366438 out of 366457 tests passed. 19 failed (0.00518478%)
GURMUKHI: 60704 out of 60747 tests passed. 43 failed (0.0707854%)
KANNADA: 951042 out of 951913 tests passed. 871 failed (0.0915%)
KHMER: 299077 out of 299124 tests passed. 47 failed (0.0157125%)
LAO: 53611 out of 53644 tests passed. 33 failed (0.0615167%)
MALAYALAM: 1047994 out of 1048334 tests passed. 340 failed (0.0324324%)
ORIYA: 42320 out of 42329 tests passed. 9 failed (0.021262%)
SINHALA: 271666 out of 271847 tests passed. 181 failed (0.0665816%)
TAMIL: 1091754 out of 1091754 tests passed. 0 failed (0%)
TELUGU: 970553 out of 970573 tests passed. 20 failed (0.00206064%)
TIBETAN: 208469 out of 208469 tests passed. 0 failed (0%)
Still some regressions, but some of the more egregious cases are
addressed.
The Win7 Tamil font does not realy on this behavior, but the WinXP
version does. Handle Tamil like Malayalam: Matras always move to
before base.
WinXP Tamil failures went down from 168964 (15.4752%) to 167
(0.0152953%) (two orders of magnitude reduction!).
Included in this is a minor fixup that actually fixed a few tests
with non-Tamil too. Numbers at:
BENGALI: 353997 out of 354285 tests passed. 288 failed (0.0812905%)
DEVANAGARI: 707339 out of 707394 tests passed. 55 failed (0.00777502%)
GUJARATI: 366489 out of 366506 tests passed. 17 failed (0.0046384%)
GURMUKHI: 60769 out of 60809 tests passed. 40 failed (0.0657797%)
KANNADA: 951086 out of 951913 tests passed. 827 failed (0.0868777%)
KHMER: 299106 out of 299124 tests passed. 18 failed (0.00601757%)
LAO: 53611 out of 53644 tests passed. 33 failed (0.0615167%)
MALAYALAM: 1048104 out of 1048416 tests passed. 312 failed (0.0297592%)
ORIYA: 42320 out of 42329 tests passed. 9 failed (0.021262%)
SINHALA: 271747 out of 271847 tests passed. 100 failed (0.0367854%)
TAMIL: 1091837 out of 1091837 tests passed. 0 failed (0%)
TELUGU: 970558 out of 970573 tests passed. 15 failed (0.00154548%)
TIBETAN: 208469 out of 208469 tests passed. 0 failed (0%)
Unfortunately if the font has GPOS and 'mark' feature does
not position mark on dotted-circle, our inserted dotted-circle
will not get the mark repositioned to itself. Uniscribe cheats
here.
If there is no GPOS however, the fallback positioning kicks in
and sorts this out.
I'm not willing to address the first case.
No panic, we reeally insert dotted circle when it's absolutely broken.
Fixes most of the dotted-circle cases against Uniscribe. (for Devanagari
fixes 80% of them, for Khmer 70%; the rest look like Uniscribe being
really bogus...)
I had to make a decision. Apparently Uniscribe adds one dotted circle
to each broken character. I tried that, but that goes wrong easily with
split matras. So I made it add only one dotted circle to an entire
broken syllable tail. As in: "if there was a dotted circle here, this
would have formed a correct cluster." That works better for split
stuff, and I like it more.
This will eventually allow us to skip marks, as well as (fallback)
attach marks to ligature components of fallback-shaped Arabic.
That would be pretty cool. I kludged GDEF props in, so mark-skipping
works, but the produced ligature id/components will be cleared later
by substitute_start() et al.
Perhaps using a synthetic table for Arabic fallback shaping was a better
idea. The current approach has way too many layering violations...
Fixes consonant-position with old-spec Malayalam. Uniscribe seem to be
doing this. Fixes below-base La (eg. Pa,H,La) with AnjaliNewLipi.ttf.
Doesn't regress new-spec or other scripts.
This reverts commit 24dd4e5674.
Oops. My bad. The change _regressed_ Malayalam test suite, not
improved it. I'll redo it, differentiating between old-spec and
new-spec cases.
The MS Indic specs say "...all classifications are determined ... using
context-free substitutions." However, testing shows that MS's Malayalam
shapers (both old and new), "match" even if there is no zero-context rule.
We follow.
Fixes below-base La (eg. Pa,H,La) with AnjaliNewLipi.ttf (old spec).
Moreover, test suite Malayalam failures are down to 312 from 875! No
change in other scripts.
Current numbers:
BENGALI: 353996 out of 354285 tests passed. 289 failed (0.0815727%)
DEVANAGARI: 707339 out of 707394 tests passed. 55 failed (0.00777502%)
GUJARATI: 366489 out of 366506 tests passed. 17 failed (0.0046384%)
GURMUKHI: 60769 out of 60809 tests passed. 40 failed (0.0657797%)
KANNADA: 951086 out of 951913 tests passed. 827 failed (0.0868777%)
KHMER: 299106 out of 299124 tests passed. 18 failed (0.00601757%)
LAO: 53611 out of 53644 tests passed. 33 failed (0.0615167%)
MALAYALAM: 1047541 out of 1048416 tests passed. 875 failed (0.0834592%)
ORIYA: 42320 out of 42329 tests passed. 9 failed (0.021262%)
SINHALA: 271726 out of 271847 tests passed. 121 failed (0.0445103%)
TAMIL: 1091837 out of 1091837 tests passed. 0 failed (0%)
TELUGU: 970558 out of 970573 tests passed. 15 failed (0.00154548%)
TIBETAN: 208469 out of 208469 tests passed. 0 failed (0%)
Free up syllables and let features work across syllables for the
presentation forms features and GPOS.
Fixed:
- 1 GURMUKHI test (remains 40)
- 12 KHMER tests (remains 18)
- 11 SINHALA tests (remains 121)
Regresses:
- 5 MALAYALAM tests (up to 312)
Current numbers:
BENGALI: 353996 out of 354285 tests passed. 289 failed (0.0815727%)
DEVANAGARI: 707339 out of 707394 tests passed. 55 failed (0.00777502%)
GUJARATI: 366489 out of 366506 tests passed. 17 failed (0.0046384%)
GURMUKHI: 60769 out of 60809 tests passed. 40 failed (0.0657797%)
KANNADA: 951086 out of 951913 tests passed. 827 failed (0.0868777%)
KHMER: 299106 out of 299124 tests passed. 18 failed (0.00601757%)
LAO: 53611 out of 53644 tests passed. 33 failed (0.0615167%)
MALAYALAM: 1048104 out of 1048416 tests passed. 312 failed (0.0297592%)
ORIYA: 42320 out of 42329 tests passed. 9 failed (0.021262%)
SINHALA: 271726 out of 271847 tests passed. 121 failed (0.0445103%)
TAMIL: 1091837 out of 1091837 tests passed. 0 failed (0%)
TELUGU: 970558 out of 970573 tests passed. 15 failed (0.00154548%)
TIBETAN: 208469 out of 208469 tests passed. 0 failed (0%)
The merger of normalizer and glyph-mapping broke shapers that
modified text stream. Unbreak them by adding a new preprocess_text
shaping stage that happens before normalizing/cmap and disallow
setup_mask modification of actual text.
The change is very subtle. If we have a single-char cluster that
decomposes to three or more characters, then try recomposition, in
case the farther mark may compose with the base.
Essentially move the glyph mapping to normalization process.
The effect on Devanagari is small (but observable). Should be more
observable in simple text, like ASCII.