From the issue:
"In this font, the virama,ya first forms a ligature, then decomposes back to
virama,ya. This causes those two to be marked parts of a MultipleSubst
sequence. When attaching the matra, we look for the first of the MultipleSubst
sequence because that's where we attach to (because of eg #740). In this case,
the first glyph in the MultipleSubst sequence is a mark, so we skip it and
attach to the base char before it."
Font in question is Nirmala UI from Windows 10. Test sequence:
U+0926,U+094D,U+092F,U+0941
Fixes https://github.com/harfbuzz/harfbuzz/issues/1020
Not optimized to use sortedness yet. Also start putting in place infra
to faster reject bad data.
A version of Chandas.ttf found on some Chrome bots has 660kb of GPOS,
mostly junk. That is causing 48 million of set->add() calls in
collect_glyphs(), which is insane.
In the upcoming commits, I'll be speeding that up by optimizing
add_sorted_array(), while also reducing work by rejecting out-of-sort
arrays quickly and propagate the rejection.
Part of https://bugs.chromium.org/p/chromium/issues/detail?id=794896
Apparently a base glyph can also become an attached component of a
ligature if the ligature-forming lookup used IgnoreBase. This was
being confused with a non-first component of a MultipleSubst and
hence not matched for mark-attachment. Tweak test to fix.
Fixes https://github.com/behdad/harfbuzz/issues/543
That commit moved the advance adjustment for mark positioning to
be applied immediately, instead of doing late before. This breaks
if mark advances are zeroed late, like in Arabic. Also, easier to
hit it in RTL scripts since a single mark with non-zero advance is
enough to hit the bug, whereas in LTR, at least two marks are needed.
This reopens https://github.com/behdad/harfbuzz/issues/211
The cursive+mark interaction is broken again. To be fixed in a
different way.
Fixes https://github.com/behdad/harfbuzz/issues/211
What happens in that bug is that a mark is attached to base first,
then a second mark is cursive-chained to the first mark. This only
"works" because it's in the Indic shaper where mark advances are
not zeroed.
Before, we didn't allow cursive to run on marks at all. Fix that.
We also where updating mark major offsets at the end of GPOS, such
that changes in advance of base will not change the mark attachment
position. That was superior to the alternative (which is what Uniscribe
does BTW), but made it hard to apply cursive to the mark after it
was positioned. We could track major-direction offset changes and
apply that to cursive in the post process, but that's a much trickier
thing to do than the fix here, which is to immediately apply the
major-direction advance-width offsets... Ie.:
https://github.com/behdad/harfbuzz/issues/211#issuecomment-183194739
If this breaks any fonts, the font should be fixed to do mark attachment
after all the advances are set up first (kerning, etc).
Finally, this, still doesn't make us match Uniscribe, for I explained
in that bug. Looks like Uniscribe applies minor-direction cursive
adjustment immediate as well. We don't, and we like it our way, at
least for now. Eg. the sequence in the test case does this:
- The first subscript attaches with mark-to-base, moving in x only,
- The second subscript attaches with cursive attachment to first subscript
moving in x only,
- A final context rule moves the first subscript up by 104 units.
The way we do, the final shift-up, also shifts up the second subscript
mark because it's cursively-attached. Uniscribe doesn't. We get:
[ttaorya=0+1307|casubscriptorya=0@-242,104+-231|casubscriptnarroworya=0@20,104+507]
while Uniscribe gets:
[ttaorya=0+1307|casubscriptorya=0@-242,104+-211|casubscriptnarroworya=0+487]
note the different y-offset of the last glyph. In our view, after cursive,
things move together, period.
See thread "Issue with cursive attachment" started by Khaled.
Turned out fixing this wasn't as bad as I had assumed. I like the
new code better; we now have a theoretical model of cursive
connections that is easier to reason about.
This makes a lot of code safer. We only try modifying the object in one
place, after making sure it's safe to do so. So, do a const_cast<> in
that one place...
Currently:
- Initializing skippy is very expensive,
- Our lookup accelerator (using set-digests) can be very ineffecite,
As such, we end up many times initializing skippy but then failing
coverage check. Reordering fixes that.
When, later, we fix our accelerator to have truly small false-positive
rate (for example by using the frozen-sets), then we might want to
reorder these checks such that we wouldn't calculate coverage number
if skippy is going to fail.
This shows a 5% speedup with Roboto already.
Roboto has glyphs (like 'F') that have 200 kerning pairs.
Add a handcoded bsearch instead of previous linear search.
This doesn't show much speedup though, apparently we spend the
bulk of the time somewhere before here.
Before we were just relying on the compiler inlining them and not
leaving a trace in our public API. Try to fix. Hopefully not
breaking anyone's build.
When matching lookups, be smart about default-ignorable characters.
In particular:
Do nothing specific about ZWNJ, but for the other default-ignorables:
If the lookup in question uses the ignorable character in a sequence,
then match it as we used to do. However, if the sequence match will
fail because the default-ignorable blocked it, try skipping the
ignorable character and continue.
The most immediate thing it means is that if Lam-Alef forms a ligature,
then Lam-ZWJ-Alef will do to. Finally!
One exception: when matching for GPOS, or for backtrack/lookahead of
GSUB, we ignore ZWNJ too. That's the right thing to do.
It certainly is possible to build fonts that this feature will result
in undesirable glyphs, but it's hard to think of a real-world case
that that would happen.
This *does* break Indic shaping right now, since Indic Unicode has
specific rules for what ZWJ/ZWNJ mean, and skipping ZWJ is breaking
those rules. That will be fixed in upcoming commits.
Before, we were zeroing advance width of attached marks for
non-Indic scripts, and not doing it for Indic.
We have now three different behaviors, which seem to better
reflect what Uniscribe is doing:
- For Indic, no explicit zeroing happens whatsoever, which
is the same as before,
- For Myanmar, zero advance width of glyphs marked as marks
*in GDEF*, and do that *before* applying GPOS. This seems
to be what the new Win8 Myanmar shaper does,
- For everything else, zero advance width of glyphs that are
from General_Category=Mn Unicode characters, and do so
before applying GPOS. This seems to be what Uniscribe does
for Latin at least.
With these changes, positioning of all tests matches for Myanmar,
except for the glitch in Uniscribe not applying 'mark'. See preivous
commit.
Before, when matching ligatures, we never skipping over base / liga
glyphs even if that was what the LookupFlags asked for.
Fixed now. We carefully reviewed all instances of this, and tested with
Amiri as well as some Indic scripts, and are confident that this should
NOT break anyone's fonts. It's also how Uniscribe does it, from what
we can tell.
If in a MarkPos table, a base has no anchor for a particular mark class,
return NULL such that the subsequent subtables get a chance at it.
Test case:
hb-shape ./EBGaramond12-Regular.otf ἂ --features="ss20","smcp"
Gives me a good 10% speedup for the Devanagari test case. Less so
for less lookup-intensive tests.
For the Devanagari test case, the false positive rate of the GSUB digest
is 4%.
If there is no GPOS, zero mark advances.
If there *is* GPOS and the shaper requests so, zero mark advances for
attached marks.
Fixes regression with Tibetan, where the font has GPOS, and marks a
glyph as mark where it shouldn't get zero advance.
This commit: a3313e5400 broke MarkMarkPos
when one of the marks itself is a ligature. That regressed 26 Tibetan
tests (up from zero!). Fix that. Tibetan back to zero.
And use it to speed up the hotspot by checking coverage directly in
the main loop, not 10 functions deep in.
Gives me a solid 20% boost with Indic test suite. Less so for less
lookup-intensive scenarios.
Remove the "fast_path" hack from before.
This was broken as a result of 7b84c536c1.
As Khaled reported, MarkMark positioning was broken with glyphs
resulting from a MultipleSubst. Fixed. Test with the ALLAH character
in Amiri.
I couldn't measure significant performance gains out of this; maybe
about 5% (with one million Malayalam strings). Still, not bad.
But reminds me that optimizing this codebase without profiling first
is simply not going to work. Oh well...