If there's a mark ligating forward with non-mark, they were
inheriting the GC of the mark and later get advance-zeroed.
Don't do that if there's any non-mark glyph in the ligature.
Sample test: U+1780,U+17D2,U+179F with Kh-Metal-Chrieng.ttf
Also:
Bug 58922 - Issue with mark advance zeroing in generic shaper
We were not initializing the digests properly and as a result they were
being initialized to zero, making digest1 to never do any useful work.
Speeds up Amiri shaping significantly.
See thread "an issue regarding discrepancy between Korean and Unicode
standards" on the mailing list for the rationale. In short: Uniscribe
doesn't, so fonts are designed to work without it.
Testing shows that this is closer to what Uniscribe does.
Reported by Khaled Hosny:
"""
commit 568000274c
...
This commit is causing a regression with Amiri, the string “هَٰذ” with
Uniscribe and HarfBuzz before this commit, gives:
[uni0630.fina=3+965|uni0670.medi=0+600|uni064E=0@-256,0+0|uni0647.init=0+926]
But now it gives:
[uni0630.fina=3+965|uni0670.medi=0+0|uni064E=0@-256,0+0|uni0647.init=0+926]
i.e. uni0670.medi is zeroed though it has a base glyph GDEF class.
"""
The test case is U+0647,U+064E,U+0670,U+0630 with Amiri.
After the Ngapi hackfest work, we were assuming that fonts
won't use presentation features to choose specific forms
(eg. conjuncts). As such, we were using auto-joiner behavior
for such features. It proved to be troublesome as many fonts
used presentation forms ('pres') for example to form conjuncts,
which need to be disabled when a ZWJ is inserted.
Two examples:
U+0D2F,U+200D,U+0D4D,U+0D2F with kartika.ttf
U+0995,U+09CD,U+200D,U+09B7 with vrinda.ttf
What we do now is to never do magic to ZWJ during GSUB's main input
match for Indic-style shapers. Note that backtrack/lookahead are still
matched liberally, as is GPOS. This seems to be an acceptable
compromise.
As to the bug that initially started this work, that one needs to
be fixed differently:
Bug 58714 - Kannada u+0cb0 u+200d u+0ccd u+0c95 u+0cbe does not
provide same results as Windows8
https://bugs.freedesktop.org/show_bug.cgi?id=58714
New numbers:
BENGALI: 353689 out of 354188 tests passed. 499 failed (0.140886%)
DEVANAGARI: 707305 out of 707394 tests passed. 89 failed (0.0125814%)
GUJARATI: 366349 out of 366457 tests passed. 108 failed (0.0294714%)
GURMUKHI: 60706 out of 60747 tests passed. 41 failed (0.067493%)
KANNADA: 951030 out of 951913 tests passed. 883 failed (0.0927606%)
KHMER: 299070 out of 299124 tests passed. 54 failed (0.0180527%)
LAO: 53611 out of 53644 tests passed. 33 failed (0.0615167%)
MALAYALAM: 1048102 out of 1048334 tests passed. 232 failed (0.0221304%)
ORIYA: 42320 out of 42329 tests passed. 9 failed (0.021262%)
SINHALA: 271666 out of 271847 tests passed. 181 failed (0.0665816%)
TAMIL: 1091753 out of 1091754 tests passed. 1 failed (9.15957e-05%)
TELUGU: 970555 out of 970573 tests passed. 18 failed (0.00185457%)
TIBETAN: 208469 out of 208469 tests passed. 0 failed (0%)
That flag is redundant, deprecated, and ignored since April 2011.
From FreeType git log:
commit 8c82ec5b17d0cfc9b0876a2d848acc207a62a25a
Author: Behdad Esfahbod <behdad@behdad.org>
Date: Thu Apr 21 08:21:37 2011 +0200
Always ignore global advance.
This makes FT_LOAD_IGNORE_GLOBAL_ADVANCE_WIDTH redundant,
deprecated, and ignored. The new behavior is what every major user
of FreeType has been requesting. Global advance is broken in many
CJK fonts. Just ignoring it by default makes most sense.
* src/truetype/ttdriver.c (tt_get_advances),
src/truetype/ttgload.c (TT_Get_HMetrics, TT_Get_VMetrics,
tt_get_metrics, compute_glyph_metrics, TT_Load_Glyph),
src/truetype/ttgload.h: Implement it.
* docs/CHANGES: Updated.
I added these because the older mingw32 toolchain didn't have
MemoryBarrier(). The newer mingw-w64 toolchain however has.
As reported by John Emmas this was causing build failure with
MSVC (on glib) because of inline issues. But that reminded me
that we may be taking this path even if the system implements
MemoryBarrier as a function, which is a waste. So, just remove
it.
Before, we were marking them as below-form for initial reordering.
However, there is a rule that says "post consonants should follow
below consonsnts" for base determination purposes. Malayalam has
port-form YA/VA, and RA is pre-base. As such, for a sequence like
YA,Virama,YA,Virama,RA, the correct base is at index 0. But
because the code was seeing RA as a below-base, it was stopping at
the second YA as base, instead of jumping it as a post-base.
By treating prebase-reordering consonants like post-forms, this
is fixed.
MALAYALAM went down from 351 to 265. Other numbers didn't change:
BENGALI: 353686 out of 354188 tests passed. 502 failed (0.141733%)
DEVANAGARI: 707305 out of 707394 tests passed. 89 failed (0.0125814%)
GUJARATI: 366262 out of 366457 tests passed. 195 failed (0.0532122%)
GURMUKHI: 60706 out of 60747 tests passed. 41 failed (0.067493%)
KANNADA: 950680 out of 951913 tests passed. 1233 failed (0.129529%)
KHMER: 299074 out of 299124 tests passed. 50 failed (0.0167155%)
LAO: 53611 out of 53644 tests passed. 33 failed (0.0615167%)
MALAYALAM: 1048069 out of 1048334 tests passed. 265 failed (0.0252782%)
ORIYA: 42320 out of 42329 tests passed. 9 failed (0.021262%)
SINHALA: 271539 out of 271847 tests passed. 308 failed (0.113299%)
TAMIL: 1091753 out of 1091754 tests passed. 1 failed (9.15957e-05%)
TELUGU: 970555 out of 970573 tests passed. 18 failed (0.00185457%)
TIBETAN: 208469 out of 208469 tests passed. 0 failed (0%)
Such fonts are *definitely* really broken. Give up.
Limits time spent in sanitize for extremely / deliberately broken
fonts. For example, two fonts with these md5sum / names:
9343f0a1b8c84b8123e7d201cae62ffd.ttf
eb8c978547f09d368fc204194fb34688.ttf
were spending over a second in sanitize! Not anymore.
This fixes a design bug with sanitize and sub-blobs that can
cause crashes. Jonathan and I found and debugged this issue
when we tested a corrupt font with the md5sum / filename:
ea395483d37af0cb933f40689ff7b60a. Two hours of intense
debugging we found out that the font has overlapping GSUB/GPOS
tables, and as such, sanitizing the second table can modify
the first one, which can cause all kinds of undefined behavior.
The correct way to fix this is to make sure sub-blobs are
always created readonly, since we consider the parent blob
to be a shared resource and can't modify it, even if it *is*
writable.
This essentially makes the READONLY_MAY_MAKE_WRITABLE mode
unused... Maybe we should simply remove / deprecate it.
When a match_func was not set on the matcher_t object (ie. from GPOS),
then the Default_Ignorables (including joiners) were never skipped.
This meant that they were not skipped as they should during GPOS
matching. Fix that.
A few Indic numbers have "regressed": BENGALI and DEVANAGARI went
up from 290 and 58 respectively, but in both cases new results are
superior to Uniscribe, as they apply GPOS when we weren't (and
Uniscribe isn't) before.
BENGALI: 353686 out of 354188 tests passed. 502 failed (0.141733%)
DEVANAGARI: 707305 out of 707394 tests passed. 89 failed (0.0125814%)
GUJARATI: 366262 out of 366457 tests passed. 195 failed (0.0532122%)
GURMUKHI: 60706 out of 60747 tests passed. 41 failed (0.067493%)
KANNADA: 950680 out of 951913 tests passed. 1233 failed (0.129529%)
KHMER: 299074 out of 299124 tests passed. 50 failed (0.0167155%)
LAO: 53611 out of 53644 tests passed. 33 failed (0.0615167%)
MALAYALAM: 1047983 out of 1048334 tests passed. 351 failed (0.0334817%)
ORIYA: 42320 out of 42329 tests passed. 9 failed (0.021262%)
SINHALA: 271539 out of 271847 tests passed. 308 failed (0.113299%)
TAMIL: 1091753 out of 1091754 tests passed. 1 failed (9.15957e-05%)
TELUGU: 970555 out of 970573 tests passed. 18 failed (0.00185457%)
TIBETAN: 208469 out of 208469 tests passed. 0 failed (0%)
The code was confused because it was expecting left matra to have
POS_PRE_M, like we do in the Myanmar shaper, but that is not what
we were doing in this shaper. Rewrite to rely on category only.
Test case: U+AA06,U+AA34,U+AA2F
Before, if one called hb_shape() without setting script, language, and
direction on the buffer, hb_shape() was calling
hb_buffer_guess_segment_properties() on the user's behalf to guess
these.
This is very dangerous, since any serious user of HarfBuzz must set
these properly (specially important is direction). So now, we don't
guess properties by default. People not setting direction will get
an abort() now. If the old behavior is desired (fragile, good for
simple testing only), users can call
hb_buffer_guess_segment_properties() on the buffer just before calling
hb_shape().
Surprisingly, if user ever tried to turn a default feature off partially
(say, disable liga for a range), the feature was being turned off
globally! Fixed now.
Originally we meant to match backtrack/lookahead across syllable
boundaries. But a bug in the code meant that this was NOT done for
backtrack. We "fixed" that in 2c7d0b6b80,
but that broke Myanmar shaping.
We now believe that for Indic-like shapers (which is where syllables are
used), all basic shaping forms should be fully contained within their
syllables, so now we limit backtrack/lookahead matching to the syllable
too. Unbreaks Myanmar.
Not for Arabic, but for Indic-like scripts. ZWJ/ZWNJ have special
meanings in those scripts, so let font lookups take full control.
This undoes the regression caused by automatic-joiners handling
introduced two commits ago.
We only disable automatic joiner handling for the "basic shaping
features" of Indic, Myanmar, and SEAsian shapers. The "presentation
forms" and other features are still applied with automatic-joiner
handling.
This change also changes the test suite failure statistics, such that
a few scripts show more "failures". The most affected is Kannada.
However, upon inspection, we believe that in most, if not all, of the
new failures, we are producing results superior to Uniscribe. Hard to
count those!
Here's an example of what is fixed by the recent joiner-handling
changes:
https://bugs.freedesktop.org/show_bug.cgi?id=58714
New numbers, for future reference:
BENGALI: 353892 out of 354188 tests passed. 296 failed (0.0835714%)
DEVANAGARI: 707336 out of 707394 tests passed. 58 failed (0.00819911%)
GUJARATI: 366262 out of 366457 tests passed. 195 failed (0.0532122%)
GURMUKHI: 60706 out of 60747 tests passed. 41 failed (0.067493%)
KANNADA: 950680 out of 951913 tests passed. 1233 failed (0.129529%)
KHMER: 299074 out of 299124 tests passed. 50 failed (0.0167155%)
LAO: 53611 out of 53644 tests passed. 33 failed (0.0615167%)
MALAYALAM: 1047983 out of 1048334 tests passed. 351 failed (0.0334817%)
ORIYA: 42320 out of 42329 tests passed. 9 failed (0.021262%)
SINHALA: 271539 out of 271847 tests passed. 308 failed (0.113299%)
TAMIL: 1091753 out of 1091754 tests passed. 1 failed (9.15957e-05%)
TELUGU: 970555 out of 970573 tests passed. 18 failed (0.00185457%)
TIBETAN: 208469 out of 208469 tests passed. 0 failed (0%)
When matching lookups, be smart about default-ignorable characters.
In particular:
Do nothing specific about ZWNJ, but for the other default-ignorables:
If the lookup in question uses the ignorable character in a sequence,
then match it as we used to do. However, if the sequence match will
fail because the default-ignorable blocked it, try skipping the
ignorable character and continue.
The most immediate thing it means is that if Lam-Alef forms a ligature,
then Lam-ZWJ-Alef will do to. Finally!
One exception: when matching for GPOS, or for backtrack/lookahead of
GSUB, we ignore ZWNJ too. That's the right thing to do.
It certainly is possible to build fonts that this feature will result
in undesirable glyphs, but it's hard to think of a real-world case
that that would happen.
This *does* break Indic shaping right now, since Indic Unicode has
specific rules for what ZWJ/ZWNJ mean, and skipping ZWJ is breaking
those rules. That will be fixed in upcoming commits.
Ouch, how did things ever work without this?! The added test that has a
dot-reph as well as a pre-base reordering Ra perfectly demonstrates the
bug (tested with Nirmala font from Win8 for example). Testing suggests
that Win8 shaper has the *exact* same bug / behavior that we used to
have. Odd.
This is a followup to 568000274c.
Looks like in the Latin shaper, Uniscribe zeroes all Unicode NSM
advances *after* GPOS, not before. Match that.
Can be tested using DejaVu Sans Mono, since that font has GPOS
rules to zero the mark advances on its own.
Before, we were zeroing advance width of attached marks for
non-Indic scripts, and not doing it for Indic.
We have now three different behaviors, which seem to better
reflect what Uniscribe is doing:
- For Indic, no explicit zeroing happens whatsoever, which
is the same as before,
- For Myanmar, zero advance width of glyphs marked as marks
*in GDEF*, and do that *before* applying GPOS. This seems
to be what the new Win8 Myanmar shaper does,
- For everything else, zero advance width of glyphs that are
from General_Category=Mn Unicode characters, and do so
before applying GPOS. This seems to be what Uniscribe does
for Latin at least.
With these changes, positioning of all tests matches for Myanmar,
except for the glitch in Uniscribe not applying 'mark'. See preivous
commit.
Implemented as a hack for now. Myanmar failures down from 23 to 15.
MYANMAR: 1123868 out of 1123883 tests passed. 15 failed (0.00133466%)
The remaining 15 cases are all where the syllable is wrong according to
the OpenType spec. We insert dottedcircle. Uniscribe fails to do that,
but it also fails to reorder the prebase-reordering medial-Ra. So it
gets it wrong.
Before, when matching ligatures, we never skipping over base / liga
glyphs even if that was what the LookupFlags asked for.
Fixed now. We carefully reviewed all instances of this, and tested with
Amiri as well as some Indic scripts, and are confident that this should
NOT break anyone's fonts. It's also how Uniscribe does it, from what
we can tell.
Before, for most scripts, we were not trying to recompose two characters
if the second one had ccc=0. That fails for Myanmar where U+1026
decomposes to U+1025,U+102E, both of which have ccc=0. However, we do
want to try to recompose those. We now check whether the second is a
mark, using general category instead.
At the same time, remove optimization that was conflicting with this.
[Let the Ngapi hackfest begin!]
This reverts commit fab7a71f11.
Conflicts:
src/hb-ot-shape-complex-indic-machine.hh
Keeping that generated file in-tree causes problems with processes like
tinderbox that automatically fetch and build harfbuzz. It's harder to
bootstrap harfbuzz now (as was previously), but I'm willing to give this
another chance and see how it goes.
If in a MarkPos table, a base has no anchor for a particular mark class,
return NULL such that the subsequent subtables get a chance at it.
Test case:
hb-shape ./EBGaramond12-Regular.otf ἂ --features="ss20","smcp"
API additions:
hb_segment_properties_t
HB_SEGMENT_PROPERTIES_DEFAULT
hb_segment_properties_equal()
hb_segment_properties_hash()
hb_buffer_set_segment_properties()
hb_buffer_get_segment_properties()
hb_ot_layout_glyph_class_t
hb_shape_plan_t
hb_shape_plan_create()
hb_shape_plan_create_cached()
hb_shape_plan_get_empty()
hb_shape_plan_reference()
hb_shape_plan_destroy()
hb_shape_plan_set_user_data()
hb_shape_plan_get_user_data()
hb_shape_plan_execute()
hb_ot_shape_plan_collect_lookups()
API changes:
Rename hb_ot_layout_feature_get_lookup_indexes() to
hb_ot_layout_feature_get_lookups().
New header file:
hb-shape-plan.h
And a bunch of prototyped but not implemented stuff. Coming soon.
(Tests fail because of the prototypes right now.)
This is important for the Sinhala U+0DDA split matra since it decomposes
to U+0DD9,U+0DCA where U+0DD9 is a left matra and U+0DCA is the virama.
We don't want to move the virama with the left matra.
TEST: U+0D9A,U+0DDA
Note that we were already doing this in the Uniscribe bug compatibility
mode. We now do it all the time.
New API:
hb_buffer_flags_t
HB_BUFFER_FLAGS_DEFAULT
HB_BUFFER_FLAG_BOT
HB_BUFFER_FLAG_EOT
HB_BUFFER_FLAG_PRESERVE_DEFAULT_IGNORABLES
hb_buffer_set_flags()
hb_buffer_get_flags()
We use the BOT flag to decide whether to insert dottedcircle if the
first char in the buffer is a combining mark.
The PRESERVE_DEFAULT_IGNORABLES flag prevents removal of characters like
ZWNJ/ZWJ/...
Had to do some refactoring to make this happen...
Under uniscribe bug compatibility mode, we still plit them
Uniscrie-style, but Jonathan and I convinced ourselves that there is no
harm doing this the Unicode way. This change makes that happen, and
unbreaks free Sinhala fonts.
Windows 8 adds a Myanmar shaper using the 'mym2' tag. Route that
through the Indic shaper. It's still very broken, but at least this
does NOT break old-style Myanmar shaping using the generic shaper.
For Arabic and Indic shapers, if the font doesn't have a script system
for the script, use default shaper.
Make an exception for Arabic script since we have fallback logic for
that one.
As reported on the list:
I am seeing a similar problem building harfbuzz 0.9.5 with Apple gcc
4.0.1 on OS X 10.5 Leopard:
hb-ot-layout-common-private.hh:406: error: 'struct
OT::CoverageFormat1::Iter' is private
hb-ot-layout-common-private.hh:646: error: within this context
hb-ot-layout-common-private.hh:500: error: 'struct
OT::CoverageFormat2::Iter' is private
hb-ot-layout-common-private.hh:647: error: within this context
make[4]: *** [libharfbuzz_la-hb-ot-layout.lo] Error 1
Also reported as happening with MSVC 2005.
Uniscribe doesn't. And some fonts abuse this feature to get Indic
shaping working in non-complex applications like Adobe's apps.
No change in numbers:
BENGALI: 353897 out of 354188 tests passed. 291 failed (0.0821598%)
DEVANAGARI: 707337 out of 707394 tests passed. 57 failed (0.00805774%)
GUJARATI: 366440 out of 366457 tests passed. 17 failed (0.00463902%)
GURMUKHI: 60704 out of 60747 tests passed. 43 failed (0.0707854%)
KANNADA: 951046 out of 951913 tests passed. 867 failed (0.0910798%)
KHMER: 299074 out of 299124 tests passed. 50 failed (0.0167155%)
LAO: 53611 out of 53644 tests passed. 33 failed (0.0615167%)
MALAYALAM: 1048011 out of 1048334 tests passed. 323 failed (0.0308108%)
ORIYA: 42320 out of 42329 tests passed. 9 failed (0.021262%)
SINHALA: 271666 out of 271847 tests passed. 181 failed (0.0665816%)
TAMIL: 1091754 out of 1091754 tests passed. 0 failed (0%)
TELUGU: 970557 out of 970573 tests passed. 16 failed (0.00164851%)
TIBETAN: 208469 out of 208469 tests passed. 0 failed (0%)
Patch from Jonathan Kew.
Part of fixing:
Mozilla Bug 801410 - avoid inserting dotted-circle for run-initial
Unicode combining characters in "simple" scripts such as Latin
https://bugzilla.mozilla.org/show_bug.cgi?id=801410
The logic for pre-base reordering follows the left matra logic.
We had an exception for Malayalam/Tamil in the left matra repositioning
which was not reflected in pre-base reordering.
Malayalam failures down from 337 to 323.
BENGALI: 353996 out of 354285 tests passed. 289 failed (0.0815727%)
DEVANAGARI: 707339 out of 707394 tests passed. 55 failed (0.00777502%)
GUJARATI: 366489 out of 366506 tests passed. 17 failed (0.0046384%)
GURMUKHI: 60769 out of 60809 tests passed. 40 failed (0.0657797%)
KANNADA: 951086 out of 951913 tests passed. 827 failed (0.0868777%)
KHMER: 299106 out of 299124 tests passed. 18 failed (0.00601757%)
LAO: 53611 out of 53644 tests passed. 33 failed (0.0615167%)
MALAYALAM: 1048011 out of 1048334 tests passed. 323 failed (0.0308108%)
ORIYA: 42320 out of 42329 tests passed. 9 failed (0.021262%)
SINHALA: 271726 out of 271847 tests passed. 121 failed (0.0445103%)
TAMIL: 1091837 out of 1091837 tests passed. 0 failed (0%)
TELUGU: 970558 out of 970573 tests passed. 15 failed (0.00154548%)
TIBETAN: 208469 out of 208469 tests passed. 0 failed (0%)
Incidentally, this makes it not crash with icu-le-hb anymore...
I'm not smart / stupid enough to spend two more days debugging C++
linking issues, and this is ABI-stable at least.
That's really the logic desired. Except that MONGOLIAN VOWEL SEPARATOR
is not default_ignorable but it really should be. Reported to Unicode.
Based on suggestion from Konstantin Ritt.
To be used for a variety of purposes. We save up to five characters
in each direction. No public API changes, everything is taken care
of already. All clients need to do is to call hb_buffer_add_utf* with
the full text + segment info (or at least some context) instead of
just passing in the segment.
Various operations (hb_buffer_reset, hb_buffer_set_length,
hb_buffer_add*) automatically reset the relevant contexts.
I don't expect ragel to be creating too much noise in its generated
output, and including this in-tree helps users right now. We can
revisit this later if it proved to be too much trouble.
With FreeSerif, it seems that the 'ccmp' feature does ligature
substituttions. That was then causing syllable match failures. We now
find syllables before any features have been applied.
Test sequence: U+0D9A,U+0DCA,U+200D,U+0DBB,U+0DCF
With this in place, you can remove GDEF/GSUB/GPOS tables from Arabic
fonts and still get per-component marks positioned on
oh-yeah-fallback-formed LAM-ALEF ligatures with marks in between the LAM
and ALEF.
Now *that*'s pretty cool, if a bit anachronistic...
Uniscribe accepts a Halant,ZWJ before matras. Allow that.
BENGALI down from 295 to 291
DEVANAGARI down from 69 to 57
GUJARATI down from 19 to 17
KANNADA down from 871 to 867
MALAYALAM down from 340 to 337
TELUGU down from 20 to 16
Currently at:
BENGALI: 353897 out of 354188 tests passed. 291 failed (0.0821598%)
DEVANAGARI: 707337 out of 707394 tests passed. 57 failed (0.00805774%)
GUJARATI: 366440 out of 366457 tests passed. 17 failed (0.00463902%)
GURMUKHI: 60704 out of 60747 tests passed. 43 failed (0.0707854%)
KANNADA: 951046 out of 951913 tests passed. 867 failed (0.0910798%)
KHMER: 299077 out of 299124 tests passed. 47 failed (0.0157125%)
LAO: 53611 out of 53644 tests passed. 33 failed (0.0615167%)
MALAYALAM: 1047997 out of 1048334 tests passed. 337 failed (0.0321462%)
ORIYA: 42320 out of 42329 tests passed. 9 failed (0.021262%)
SINHALA: 271666 out of 271847 tests passed. 181 failed (0.0665816%)
TAMIL: 1091754 out of 1091754 tests passed. 0 failed (0%)
TELUGU: 970557 out of 970573 tests passed. 16 failed (0.00164851%)
TIBETAN: 208469 out of 208469 tests passed. 0 failed (0%)
Now that we insert dotted-circle, tests break more easily when our indic
machine breaks.
In particular, a few Devanagari tests were having sequences like
"C,H,ZWJ,N", and because of the ZWJ the Nukta does NOT get reordered to
before the Halant as the grammar used to expect... Fixup.
Another case is as simple as "C,ZWJ,SM".
Fixes 10 out of 79 failures:
DEVANAGARI: 707325 out of 707394 tests passed. 69 failed (0.00975411%)
Brings down Khmer failures from 162 to 47.
KHMER: 299077 out of 299124 tests passed. 47 failed (0.0157125%)
Also rebaselined some of the test files that had only-inherited lines.
Removing those, the stats are:
BENGALI: 353893 out of 354188 tests passed. 295 failed (0.0832891%)
DEVANAGARI: 707315 out of 707394 tests passed. 79 failed (0.0111678%)
GUJARATI: 366438 out of 366457 tests passed. 19 failed (0.00518478%)
GURMUKHI: 60704 out of 60747 tests passed. 43 failed (0.0707854%)
KANNADA: 951042 out of 951913 tests passed. 871 failed (0.0915%)
KHMER: 299077 out of 299124 tests passed. 47 failed (0.0157125%)
LAO: 53611 out of 53644 tests passed. 33 failed (0.0615167%)
MALAYALAM: 1047994 out of 1048334 tests passed. 340 failed (0.0324324%)
ORIYA: 42320 out of 42329 tests passed. 9 failed (0.021262%)
SINHALA: 271666 out of 271847 tests passed. 181 failed (0.0665816%)
TAMIL: 1091754 out of 1091754 tests passed. 0 failed (0%)
TELUGU: 970553 out of 970573 tests passed. 20 failed (0.00206064%)
TIBETAN: 208469 out of 208469 tests passed. 0 failed (0%)
Still some regressions, but some of the more egregious cases are
addressed.
The Win7 Tamil font does not realy on this behavior, but the WinXP
version does. Handle Tamil like Malayalam: Matras always move to
before base.
WinXP Tamil failures went down from 168964 (15.4752%) to 167
(0.0152953%) (two orders of magnitude reduction!).
Included in this is a minor fixup that actually fixed a few tests
with non-Tamil too. Numbers at:
BENGALI: 353997 out of 354285 tests passed. 288 failed (0.0812905%)
DEVANAGARI: 707339 out of 707394 tests passed. 55 failed (0.00777502%)
GUJARATI: 366489 out of 366506 tests passed. 17 failed (0.0046384%)
GURMUKHI: 60769 out of 60809 tests passed. 40 failed (0.0657797%)
KANNADA: 951086 out of 951913 tests passed. 827 failed (0.0868777%)
KHMER: 299106 out of 299124 tests passed. 18 failed (0.00601757%)
LAO: 53611 out of 53644 tests passed. 33 failed (0.0615167%)
MALAYALAM: 1048104 out of 1048416 tests passed. 312 failed (0.0297592%)
ORIYA: 42320 out of 42329 tests passed. 9 failed (0.021262%)
SINHALA: 271747 out of 271847 tests passed. 100 failed (0.0367854%)
TAMIL: 1091837 out of 1091837 tests passed. 0 failed (0%)
TELUGU: 970558 out of 970573 tests passed. 15 failed (0.00154548%)
TIBETAN: 208469 out of 208469 tests passed. 0 failed (0%)
Unfortunately if the font has GPOS and 'mark' feature does
not position mark on dotted-circle, our inserted dotted-circle
will not get the mark repositioned to itself. Uniscribe cheats
here.
If there is no GPOS however, the fallback positioning kicks in
and sorts this out.
I'm not willing to address the first case.
No panic, we reeally insert dotted circle when it's absolutely broken.
Fixes most of the dotted-circle cases against Uniscribe. (for Devanagari
fixes 80% of them, for Khmer 70%; the rest look like Uniscribe being
really bogus...)
I had to make a decision. Apparently Uniscribe adds one dotted circle
to each broken character. I tried that, but that goes wrong easily with
split matras. So I made it add only one dotted circle to an entire
broken syllable tail. As in: "if there was a dotted circle here, this
would have formed a correct cluster." That works better for split
stuff, and I like it more.
This will eventually allow us to skip marks, as well as (fallback)
attach marks to ligature components of fallback-shaped Arabic.
That would be pretty cool. I kludged GDEF props in, so mark-skipping
works, but the produced ligature id/components will be cleared later
by substitute_start() et al.
Perhaps using a synthetic table for Arabic fallback shaping was a better
idea. The current approach has way too many layering violations...
Fixes consonant-position with old-spec Malayalam. Uniscribe seem to be
doing this. Fixes below-base La (eg. Pa,H,La) with AnjaliNewLipi.ttf.
Doesn't regress new-spec or other scripts.
This reverts commit 24dd4e5674.
Oops. My bad. The change _regressed_ Malayalam test suite, not
improved it. I'll redo it, differentiating between old-spec and
new-spec cases.
The MS Indic specs say "...all classifications are determined ... using
context-free substitutions." However, testing shows that MS's Malayalam
shapers (both old and new), "match" even if there is no zero-context rule.
We follow.
Fixes below-base La (eg. Pa,H,La) with AnjaliNewLipi.ttf (old spec).
Moreover, test suite Malayalam failures are down to 312 from 875! No
change in other scripts.
Current numbers:
BENGALI: 353996 out of 354285 tests passed. 289 failed (0.0815727%)
DEVANAGARI: 707339 out of 707394 tests passed. 55 failed (0.00777502%)
GUJARATI: 366489 out of 366506 tests passed. 17 failed (0.0046384%)
GURMUKHI: 60769 out of 60809 tests passed. 40 failed (0.0657797%)
KANNADA: 951086 out of 951913 tests passed. 827 failed (0.0868777%)
KHMER: 299106 out of 299124 tests passed. 18 failed (0.00601757%)
LAO: 53611 out of 53644 tests passed. 33 failed (0.0615167%)
MALAYALAM: 1047541 out of 1048416 tests passed. 875 failed (0.0834592%)
ORIYA: 42320 out of 42329 tests passed. 9 failed (0.021262%)
SINHALA: 271726 out of 271847 tests passed. 121 failed (0.0445103%)
TAMIL: 1091837 out of 1091837 tests passed. 0 failed (0%)
TELUGU: 970558 out of 970573 tests passed. 15 failed (0.00154548%)
TIBETAN: 208469 out of 208469 tests passed. 0 failed (0%)
Free up syllables and let features work across syllables for the
presentation forms features and GPOS.
Fixed:
- 1 GURMUKHI test (remains 40)
- 12 KHMER tests (remains 18)
- 11 SINHALA tests (remains 121)
Regresses:
- 5 MALAYALAM tests (up to 312)
Current numbers:
BENGALI: 353996 out of 354285 tests passed. 289 failed (0.0815727%)
DEVANAGARI: 707339 out of 707394 tests passed. 55 failed (0.00777502%)
GUJARATI: 366489 out of 366506 tests passed. 17 failed (0.0046384%)
GURMUKHI: 60769 out of 60809 tests passed. 40 failed (0.0657797%)
KANNADA: 951086 out of 951913 tests passed. 827 failed (0.0868777%)
KHMER: 299106 out of 299124 tests passed. 18 failed (0.00601757%)
LAO: 53611 out of 53644 tests passed. 33 failed (0.0615167%)
MALAYALAM: 1048104 out of 1048416 tests passed. 312 failed (0.0297592%)
ORIYA: 42320 out of 42329 tests passed. 9 failed (0.021262%)
SINHALA: 271726 out of 271847 tests passed. 121 failed (0.0445103%)
TAMIL: 1091837 out of 1091837 tests passed. 0 failed (0%)
TELUGU: 970558 out of 970573 tests passed. 15 failed (0.00154548%)
TIBETAN: 208469 out of 208469 tests passed. 0 failed (0%)
The merger of normalizer and glyph-mapping broke shapers that
modified text stream. Unbreak them by adding a new preprocess_text
shaping stage that happens before normalizing/cmap and disallow
setup_mask modification of actual text.
The change is very subtle. If we have a single-char cluster that
decomposes to three or more characters, then try recomposition, in
case the farther mark may compose with the base.
Essentially move the glyph mapping to normalization process.
The effect on Devanagari is small (but observable). Should be more
observable in simple text, like ASCII.
This reverts commit 0981068b75.
I was confused. Even if we access coverage[0] unconditionally, we don't
need bound checks since the array machinary already handles that.
Apparently even that doesn't make check-internal-symbols.sh happy with
mingw32. Going to disable that for DLLs again, but hopefully the
export-file is doing *something*.
'rclt' is "Required Contextual Forms" being proposed by Microsoft.
It's like 'calt', but supposedly always on. We apply 'calt' anyway,
and now apply this too.
At this point, the GDEF glyph synthesis looks pointless. Not that I
have many fonts without GDEF lying around.
As for mark advance zeroing when GPOS not available, that also is being
replaced by proper fallback mark positioning soon.
We need the font for glyph lookup during GSUB pauses in Indic shaper.
Could perhaps be avoided, but at this point, we don't mean to support
separate substitute()/position() entry points (anymore), so there is
no point in not providing the font to GSUB.
Gives me a good 10% speedup for the Devanagari test case. Less so
for less lookup-intensive tests.
For the Devanagari test case, the false positive rate of the GSUB digest
is 4%.
If there is no GPOS, zero mark advances.
If there *is* GPOS and the shaper requests so, zero mark advances for
attached marks.
Fixes regression with Tibetan, where the font has GPOS, and marks a
glyph as mark where it shouldn't get zero advance.
When we removed the separate Hangul shaper, the specific normalization
preference of Hangul was lost. Fix that. Also, the Thai shaper was
copied from Hangul, so had the fully-composed normalization behavior,
which was unnecessary. So, fix that too.
The d1d69ec52e change broke Kannada badly,
since it was ligating consonants, pushing matra out, and then ligating
with the matra. Adjust for that. See comments.
If two marks form a ligature, retain their previous lig_id, such that
the mark ligature can attach to ligature components...
Fixes https://bugzilla.gnome.org/show_bug.cgi?id=676343
In fact, I noticed that we should not let ligatures form between glyphs
coming from different components of a previous ligature. For example,
if the sequence is: LAM,SHADDA,LAM,FATHA,HEH, the LAM,LAM,HEH form a
ligature, putting SHADDA and FATHA next to eachother. However, it would
be wrong to ligate them. Uniscribe has this bug also.
This commit: a3313e5400 broke MarkMarkPos
when one of the marks itself is a ligature. That regressed 26 Tibetan
tests (up from zero!). Fix that. Tibetan back to zero.
And use it to speed up the hotspot by checking coverage directly in
the main loop, not 10 functions deep in.
Gives me a solid 20% boost with Indic test suite. Less so for less
lookup-intensive scenarios.
Remove the "fast_path" hack from before.
Does not provide Uniscribe-compatible results, but should at least avoid
breaking hb-view due to out-of-order cluster values.
For RTL runs, ensure cluster values are non-increasing (instead of
non-decreasing).
Backporting from upstream:
commit b847f24ce855d24f6822bcd9c0006905e81b94d8
Author: Behdad Esfahbod <behdad@behdad.org>
Date: Wed Jul 25 19:29:16 2012 -0400
[arabic] Fix Arabic cursive positioning
This was clearly broken in testing. Who knows... Fixes for me.
Test with a Nastaleeq font, or with Arabic Typesetting.
Backporting from Chromium.
This was broken as a result of 7b84c536c1.
As Khaled reported, MarkMark positioning was broken with glyphs
resulting from a MultipleSubst. Fixed. Test with the ALLAH character
in Amiri.
Does not attempt to handle clusters in a Uniscribe- or HarfBuzz-compatible way;
just returns the original string indexes that CT maintains. These may even be
out-of-order in the case of reordrant glyphs.
The font is forming a post-base consonant in some samples, and Uniscribe
positions top matra on the post-base. Do the same.
Gurmukhi failures down from 59 to 41 (0.0674242%).
Just put it before base, which is what's expected.
Malayalam failures down from 1559 to 1197 (0.114172%).
BENGALI: 353988 out of 354285 tests passed. 297 failed (0.0838308%)
DEVANAGARI: 693571 out of 693628 tests passed. 57 failed (0.00821766%)
GUJARATI: 366489 out of 366506 tests passed. 17 failed (0.0046384%)
GURMUKHI: 60750 out of 60809 tests passed. 59 failed (0.0970251%)
KANNADA: 950956 out of 951913 tests passed. 957 failed (0.100534%)
KHMER: 299094 out of 299124 tests passed. 30 failed (0.0100293%)
MALAYALAM: 1047219 out of 1048416 tests passed. 1197 failed (0.114172%)
ORIYA: 42320 out of 42329 tests passed. 9 failed (0.021262%)
SINHALA: 271699 out of 271847 tests passed. 148 failed (0.0544424%)
TAMIL: 1091837 out of 1091837 tests passed. 0 failed (0%)
TELUGU: 970524 out of 970573 tests passed. 49 failed (0.00504856%)
The sequence Ra,H,Ya in Bengali is ambigious and Unicode encoded that to
get Ya-Phalaa, one would place ZWJ before Halant. Ie. a ZWJ,H sequence
requests subjoining, while a H,ZWJ requests Half form. Implement that.
Bengali failures go down from 377 to 297 (0.0838308%).
Gujarati is down by 4 to 17 (0.0046384%).
Kannada is down by 226 to 957 (0.100534%).
Current status:
BENGALI: 353988 out of 354285 tests passed. 297 failed (0.0838308%)
DEVANAGARI: 693571 out of 693628 tests passed. 57 failed (0.00821766%)
GUJARATI: 366489 out of 366506 tests passed. 17 failed (0.0046384%)
GURMUKHI: 60750 out of 60809 tests passed. 59 failed (0.0970251%)
KANNADA: 950956 out of 951913 tests passed. 957 failed (0.100534%)
KHMER: 299094 out of 299124 tests passed. 30 failed (0.0100293%)
MALAYALAM: 1046857 out of 1048416 tests passed. 1559 failed (0.148701%)
ORIYA: 42320 out of 42329 tests passed. 9 failed (0.021262%)
SINHALA: 271699 out of 271847 tests passed. 148 failed (0.0544424%)
TAMIL: 1091837 out of 1091837 tests passed. 0 failed (0%)
TELUGU: 970524 out of 970573 tests passed. 49 failed (0.00504856%)
Also limit joiners.
This limits our syllable length to a constant, and is
closer to what Uniscribe does anyway.
Two Devanagari tests regressed, but who cares about tests with 20
joiners in a row?! Devanagari at 57 (0.00821766%) now.
In Khmer coeng model, a V,Ra can go *after* matras. If it goes after a
split matra, it should be reordered to *before* the left part of such matra.
Khmer failures down from 136 to 39 (0.0130381%).
Apparently if there is C,V,ZWJ,C, the first C will be base, but if
it's C,ZWJ,V,C, the second one will be.
Note that Uniscribe implements this differently, by breaking syllable in
the case of C,ZWJ,V,C and putting the first consonant in one syllable
and the rest in the next syllable.
Sinhala failures down from 208 to 158 (0.0581209%). No changes to
Khmer.
Sinhala does not have half forms. And most (all?) consonants can be
base, except when preceded by ZWJ, which would request a subjoined form.
Hence switch the base algorithm to categorize with Khmer, start search
at start, and stop at a ZWJ.
Also, mark all pos=base consonants after base to be subjoined. Mark
base itself to have pos=base.
Finally, adjust Sinhala's reph position to after-main.
Brings down Sinhala failures from 455 to 328 (0.120656%).
If, say, a H,ZWJ,C ligature was formed, we don't want the code to detec
that as a Halant. So, ignore ligatures when matching category in
final_reordering.
Sinhala failures down from 514 to 455 (0.167374%).
Seems to be about what Uniscribe does. Not exactly. But close enough.
More consonants will start a new cluster.
A few scripts went way down in failures. In particular:
- Devanagari failures went down from 490 to 56.
- Telugu went down from 113 to 49.
Other scripts went down slightly or didn't change. New numbers:
BENGALI: 353908 out of 354285 tests passed. 377 failed (0.106412%)
DEVANAGARI: 693572 out of 693628 tests passed. 56 failed (0.00807349%)
GUJARATI: 366485 out of 366506 tests passed. 21 failed (0.00572978%)
GURMUKHI: 60750 out of 60809 tests passed. 59 failed (0.0970251%)
KANNADA: 950730 out of 951913 tests passed. 1183 failed (0.124276%)
KHMER: 298613 out of 299124 tests passed. 511 failed (0.170832%)
MALAYALAM: 1046881 out of 1048416 tests passed. 1535 failed (0.146411%)
ORIYA: 42320 out of 42329 tests passed. 9 failed (0.021262%)
SINHALA: 271333 out of 271847 tests passed. 514 failed (0.189077%)
TAMIL: 1091837 out of 1091837 tests passed. 0 failed (0%)
TELUGU: 970524 out of 970573 tests passed. 49 failed (0.00504856%)
Some of the remaining Telugu and Devanagari issues seem to be Uniscribe
eating Anusvara when placed before a non-joiner. Ouch!
Uniscribe reorders U+0E3A to be after U+0E38 and U+0E39. We do that by
modifying the ccc for U+0E3A.
Fixes the two remaining Thai failures (see previous commit).
Adjust the list of marks before SARA AM that get the reordering
treatment. Also adjust cluster formation to match Uniscribe.
With Wikipedia test data, now I see:
- For Thai, with the Angsana New font from Win7, I see 54 failures out
of over 4M tests (0.00129107%). Of the 54, two are legitimate
reordering issues (fix coming soon), and the other 52 are simply
Uniscribe using a zero-width space char instead of an unknown
character for missing glyphs. No idea why. The missing-glyph
sequences include one that is a Thai character followed by an Arabic
Sokun. Someone confused it with Nikhahit I assume!
- For Lao, with the Dokchampa font from Win7, 33 tests fail out of
54k (0.0615167%). All seem to be insignificant mark positioning
with two marks on a base. Have to investigate.
Althought IndicMatraCategory.txt classifies it as Top_And_Right matra,
it does not have Unicode decomposition, and Uniscribe does not do
anything special about it either.
Gujarati failures down from 0.672% to 0.0130966%.
That's really what Uniscribe does, and explains a lot of pecularities of
Halant,ZWNJ before the base.
Sent Telugu from 1% failures to 0.03%. Improved Kannada and Malayalam
slightly. Fixed half of Bengali, and did NOT break anything!
Following what the spec says.
Brings down Telugu failures from 40% to 3.75%, and Kannada failures from
44% to 10%. Does NOT affect other scripts' test results.
This is a hack for now. Will be fixed when we do complex-shaper-driven
normalization properly.
The results with or without decomposition are the same, but Uniscribe
does not normalize, so this matches better.
In Sinhala, Rakar is formed by Al-Lakuna,ZWJ,Ra. If you put that at the
end of a Consonant,Matra syllable, you get a dotted-circle from
Uniscribe. Apparently adding a ZWJ before the Al-Lakuna "fixes" that.
And people have been encoding that sequence... So, allow a forced
"ZWJ,Virama,ZWJ,Ra" sequence at the of syllables.
Fixes some 100 or more of Sinhala failures. Now at 622 only (0.23%).
POS_BASE can disappear if base ligated backward. Define base as last
with position not after base.
Fixes a few hundred of Sinhala failures with Iskoola Pota.
It's a visual Repha.
Still not positioning logical Repha as occurs in Malayalam.
Another 200 Khmer failures fixed. 547 to go. That's better than
Devanagari!
This reorders glyphs within the cluster to a nominal order. This should
have no visible effect on the output, but helps with testing, for
getting the same hb-shape output for visually-equal glyphs for each
cluster.
In such scripts (ie. Khmer), a ZWJ/ZWNJ shouldn't stop the search for
base. So, instead just choose the first consonant as base directly.
Test sequence:
U+1798,200c,U+17C9,U+17D2,U+179B,U+17C1,U+17C7
Mark stuff after a pre-base reordering Ro 'cfar'. Used in Khmer.
This allows distinguishing the following cases with MS Khmer fonts:
U+1784,U+17D2,U+179A,U+17D2,U+1782
U+1784,U+17D2,U+1782,U+17D2,U+179A
In Khmer, a final subjoined consonant or independent vowel can occur
after matras. This final subjoined thing should NOT be reordered to
before the matra even though it's subjoined.
Fixes another 1k of the Khmer failures. Not much left really.
Amend the syllable structure to allow a final subscripted consonant
(Coeng+C) and a final subscripted independent vowel (Coeng+V).
Fixes another 2k of Khmer failures.
Normally, we attach the Halant to the previous character and move it
with it. For after-base consonants however, the Halant "belongs" to the
consonant after, so attach it so.
This fixes Bengali sequences involving post-base consonant Ya, which
should ligate with the Halant to form Ya Phala, but previously a
reordered matras was blocking the ligation.
Seems like this is what Uniscribe is doing, and does not break any fonts
we tested (with Devanagari, Malayalam, Khmer, and Bengali), while fixing
some Ra Phala sequences for Bengali with Vrinda. Fixes another 2% of
Bengali failures (a couple more to go).
Uniscribe does not apply 'kern' in the Indic module. Some of the Khmer
fonts they ship have small adjustments in the 'kern' table. Disable
'kern' in the Indic module under Uniscribe bug compatibility mode.
Fixes some 10% of the Khmer failures. Remains under 3% (excluding
dotted-circle ones).
Mark them the same as the Register Shifters for now. Need to rename
that category to something more sensible after all is settled.
Fixes another percent of Khmer failures. Down to under 3%!
We are going to split matras without a Unicode decompositions in a way
that the second half takes the codepoint of the whole matra. So,
position them where the second half is supposed to end up.
Since we use the OpenType versions of Uniscribe functions, we are
relying on that version of the WINNT API. Otherwise, usp10.h will hide
those symbols.
Previously was failing to match fonts that didn't support CHARSET_ANSI.
There still remains a problem with the Uniscribe backend, in that if a
font with the same family name is installed, and is newer, the native
one is preferred over the font we provide. Fixing it requires rewriting
the name table with a unique family name...
This was causing all object types to be non-POD and have static
initializers. We don't need that!
Now, most nil objects just moved from .bss to .data. Fixing for that
coming soon.
According to Tom Hacohen this was breaking build with some compilers.
In file included from hb-buffer-private.hh:35:0,
from hb-ot-map-private.hh:32,
from hb-ot-shape-private.hh:32,
from hb-ot-shape.cc:29:
hb-object-private.hh: In constructor '_hb_object_header_t::_hb_object_header_t()':
hb-object-private.hh:97:8: error: uninitialized const member in 'struct hb_reference_count_t'
hb-object-private.hh:51:25: note: 'hb_reference_count_t::ref_count' should be initialized
In file included from hb-ot-shape.cc:33:0:
hb-set-private.hh: In constructor '_hb_set_t::_hb_set_t()':
hb-set-private.hh:37:8: note: synthesized method '_hb_object_header_t::_hb_object_header_t()' first required here
hb-ot-shape.cc: In function 'void hb_ot_shape_glyphs_closure(hb_font_t*, hb_buffer_t*, const hb_feature_t*, unsigned int, hb_set_t*)':
hb-ot-shape.cc:521:12: note: synthesized method '_hb_set_t::_hb_set_t()' first required here
And this, concludes the HarfBuzz Massala Hackfest.
I like to specially thank Jonathan Kew for doing all the decription and
letting me get commit points.
For dotted-circle independent clusters, Uniscribe does no Reph shaping
for the exact sequence Ra+Halant+25CC. Which also is the only possible
sequence with 25CC at the end.
Uniscribe allows up to two nuktas per consonant and one per matra. It does so
indepent of whether the consonant already has a nukta in it. Tests:
* U+0916,U+093C,U+0941
* U+0959,U+093C,U+0941
* U+0916,U+093C,U+093C,U+0941
* U+0959,U+093C,U+093C,U+0941
* U+0916,U+093C,U+093C,U+093C,U+0941
* U+0959,U+093C,U+093C,U+093C,U+0941
* 915,93c,93c,,94d,U+0916,U+093C,U+093C,U+093e,93c,93c
This does not apply to the context matchings.
This regresses tests right now. And we are not sure whether this is
the right thing to do for GPOS. But we'll figure out.
Uniscribe doesn't do it, but we want to do as it gives the Reph the
opportunity to interact with the Matras. Test with mangal for example.
Sequence: <0930,094d,0915,094b,094d>
In test suite already.
This introduced a failure, which we tracked down to a test case like this:
U+092E,U+094B,U+094D,U+0930
The final character is a Ra that should be put in a syllable of it's
own. And we do. But it will interact with the Halant before it. So
now we finally are convinced that we have to limit features to syllable
boundaries. That's coming after lunch!
Also remove shaper_options argument to hb_shape_full(). That was
unused and for "future". Let it go.
More shaper API coming in preparation for plan/planned API.
Users should #include <hb.h> (or hb-ft.h, hb-glib.h, etc), but
never things like hb-shape.h directly. This makes it easier to
refactor headers later on without breaking compatibility.
This is what old HB does. Morever, fixes rendering with Win8 malgun
font. The Win7 version doesn't compose with either Uniscribe nor HB,
but Win8 version works as expected, like Uniscribe, with this change.
Lets call Hangul done for now.
I couldn't measure significant performance gains out of this; maybe
about 5% (with one million Malayalam strings). Still, not bad.
But reminds me that optimizing this codebase without profiling first
is simply not going to work. Oh well...
Previously, we were NOT actually recomposing Hangul Jamo. We do now.
The two lines in:
test/shaping/texts/in-tree/shaper-default/script-hangul/misc/misc.txt
Now render the same with the UnDotum.ttf font. Previously the second
linle was rendering boxes.
We can also start applying OpenType Jamo features later. At this time,
I have no idea how the 'ljmo', 'vjmo', 'tjmo' features are supposed to
work. Maybe someone can explain them to me?
As requested by Jonathan Kew.
We need to devise a better mechanism to choose which scripts to
pass through the Indic shaper. Moreover, currently we are storing
data for some scripts in the Indic shaper that are not even going
through that shaper. Need to find a better way...
Makes number of failures against Uniscribe with hi_IN dictionary from
OO.o to go down from 6334 to 4290. Not bad for a one-line change!
Mozilla Bug 729626 - ASAN: heap-buffer-overflow HTML
So, apparently there's no atomic int 'get' method on Apple. You have to
add(0) to get. And that's not const-friendly. So switch inert-object
checking to a non-atomic get. This, however, is safe, and a negligible
performance boost too.
Patch and description from Jonathan Kew:
It turns out that some legacy Thai fonts provide OpenType substitution
features to implement mark positioning, but (incorrectly) put those
features/lookups under the 'latn' script tag instead of using 'thai' (or
possibly 'DFLT'). See
https://bugzilla.mozilla.org/show_bug.cgi?id=719366 for an example and
more detailed description.
Although this is really a font bug, I suggest that we could improve the
rendering of such fonts by looking for the 'latn' as a fallback if
neither the requested script nor "default" is found in
hb_ot_layout_table_choose_script. Suggested patch against harfbuzz
master is attached.
This does _not_ affect the other kind of legacy Thai font, where custom
code to support vendor-specific PUA codepoints would be needed. I'm not
keen to go down that path; IMO, such fonts should be ruthlessly stamped
out in favour of standards-based solutions. :)
JK
In hb_ot_tag_from_language(), if first component of an unknown
language is three letters long, use it directly as OpenType language
tag (after case conversion and padding).
Still not sure about:
1) Case. We pass lowercase for now. Would be nice if graphite was
uppercase 3letter like OpenType,
2) Padding. IMO, tag padding is always with spaces, but Martin was
talking about NUL bytes.
Can be -1 for NUL-terminated string. This is useful for passing parts
of a larger string to a function without having to copy or modify the
string first.
Affected functions:
hb_tag_t hb_tag_from_string()
hb_direction_from_string()
hb_language_from_string()
hb_script_from_string()
As reported by Khaled on the list:
"After the introduction of canonical reordering of combining marks
(commit 34c22f8), I'm no longer able to do mark/mark substitution or
positioning for mark sequences that involve shadda as a first mark (or
most interesting sequences at least).
"After some digging, it turned out that shadda have a ccc=33 while most
Arabic marks that combine with it have a lower ccc value, which results
in the shadda being reordered after the other mark which,
unsurprisingly, breaks my contextual substitution and mkmk anchors."
See:
http://unicode.org/faq/normalization.html#8http://unicode.org/faq/normalization.html#9
For two reasons:
1. User can always call hb_buffer_pre_allocate() themselves, and
2. Now we do a pre_alloc in add_utfX anyway, so the total number of
reallocs is limited to a small number (~3) anyway. This just makes the
API cleaner.
According to Peter Constable this is indeed what Uniscribe has been
doing for years.
Mozilla Bug 667166 - wrong shape of letter when it comes at the end of
word in the arabic version of Firefox 5.0
Remove hb_ft_get_font_funcs() as it cannot be used by the user anyway.
Add hb_ft_font_set_funcs(). Which will make the font internally use
FreeType. That is, no need for the font to have created using the
hb-ft API. Just create using hb_face_create()/hb_font_create() and
then call this on the font (after having set font scale). This
internally creates an FT_Face and attached to the font.
hb_shape() now accepts a shaper_options and a shaper_list argument.
Both can be set to NULL to emulate previous API. And in most situations
they are expected to be set to NULL.
hb_shape() also returns a boolean for now. If shaper_list is NULL, the
return value can be ignored.
shaper_options is ignored for now, but otherwise it should be a
NULL-terminated list of strings.
shaper_list is a NULL-terminated list of strings. Currently recognized
strings are "ot" for native OpenType Layout implementation, "uniscribe"
for the Uniscribe backend, and "fallback" for the non-complex backend
(that will be implemented shortly). The fallback backend never fails.
The env var HB_SHAPER_LIST is also parsed and honored. It's a
colon-separated list of shaper names. The fallback shaper is invoked if
none of the env-listed shapers succeed.
New API hb_buffer_guess_properties() added.
In old-style Indic OT standards, the post-base Halants are moved after
their base. Emulate that by moving first post-base Halant to
post-last-consonant.
Brings test-shape-complex failures down from 88 to 54. Getting there!
Find the base consonant and apply basic Indic features accordingly.
Nothing complete, but does something for now. Specifically:
no Ra handling right now, and no ZWJ/ZWNJ.
Number of failing shape-complex tests goes from 174 down to 125.
Next: reorder matras.
I've messed up a lot of stuff recently, different parts of the
shaping process are stumbling on eachother's toes because
manually tracking what's in which buffer var is hard. I'm
going to add some internal API to track those such that mistakes
are discovered as soon as they are introduced.
Instead of always applying those two features before the complex shaper,
let the complex shaper decide whether they should be applied first.
Also add stub for Indic's final_reordering().
Add compose() and decompose() unicode funcs. These implement
pair-wise canonical composition/decomposition.
The glib/icu implementations are lacking for now. We are adding
API for this to glib, but I cannot find any useful API in ICU.
May end of implementing these in-house.
Changed all unicode_funcs callback names to remove the "_get" part.
Eg, hb_unicode_get_script_func_t is now hb_unicode_script_func_t,
and hb_unicode_get_script() is hb_unicode_script() now.
Fixes https://bugzilla.mozilla.org/show_bug.cgi?id=644184
among others.
Shapers now can request segmented feature application by calling
add_gsub_pause() or add_gpos_pause(). They can also provide a
callback to be called at the pause. Currently the Arabic shaper
uses pauses to enforce certain feature application. The Indic
shaper can use the same facility to pause and do reordering in the
callback.