New approach to fix this:
69f9fbc420
Previous approach was reverted as it was too broad. See context:
https://github.com/behdad/harfbuzz/issues/347#issuecomment-267838368
With U+05E9,U+05B8,U+05C1,U+05DC and Arial Unicode, we now (correctly) disable
GDEF and GPOS, so we get results very close to Uniscribe, but slightly different
since our fallback position logic is not exactly the same:
Before: [gid1166=3+991|gid1142=0+737|gid5798=0+1434]
After: [gid1166=3+991|gid1142=0@402,-26+0|gid5798=0+1434]
Uniscribe: [gid1166=3+991|gid1142=0@348,0+0|gid5798=0+1434]
Back in the old days, we used to apply 'calt' and 'cswh' in Arabic shaper,
with a pause in between. Then we disabled the 'cswh' because Microsoft
disabled it, but forgot to remove the unnecessary pause. Do that now.
This has the benefit that it fixes shaping with monbaiti from Windows 10.
In that version of that font, the lookups from 'calt' are duplicated in
'rclt', and Mongolian was changed to go through Universal Shaping Engine.
We still use the Arabic shaper for Mongolian. With a pause after 'calt',
we were applying the duplicate lookups from 'calt' and 'rclt' twice. It
happened to be the case that these lookups were NOT idempotent. So we
were getting wrong shaping. See thread "Windows 10 monbaiti.ttf upgrade
(5.01 -> 5.51) caused loss of diacritical marks when shaped with harfbuz"
on the mailing list. This fixes that.
This aims to make Syriac Abbr Mark sizing more accurate when repeating segments are used, by adding an extra repeat and tightening up the spacing slightly rather than leaving a shortfall corresponding to a partial repeat-width.
The feature is enabled for any character in the Arabic shaper.
We should experiment with using it for Arabic subtending marks.
Though, that has a directionality problem as well, since those
are used with digits...
Fixes https://github.com/behdad/harfbuzz/issues/141
Previously, we expected users to provide BOT/EOT flags when the
text *segment* was at paragraph boundaries. This meant that for
clients that provide full paragraph to HarfBuzz (eg. Pango), they
had code like this:
hb_buffer_set_flags (hb_buffer,
(item_offset == 0 ? HB_BUFFER_FLAG_BOT : 0) |
(item_offset + item_length == paragraph_length ?
HB_BUFFER_FLAG_EOT : 0));
hb_buffer_add_utf8 (hb_buffer,
paragraph_text, paragraph_length,
item_offset, item_length);
After this change such clients can simply say:
hb_buffer_set_flags (hb_buffer,
HB_BUFFER_FLAG_BOT | HB_BUFFER_FLAG_EOT);
hb_buffer_add_utf8 (hb_buffer,
paragraph_text, paragraph_length,
item_offset, item_length);
Ie, HarfBuzz itself checks whether the segment is at the beginning/end
of the paragraph. Clients that only pass item-at-a-time to HarfBuzz
continue not setting any flags whatsoever.
Another way to put it is: if there's pre-context text in the buffer,
HarfBuzz ignores the BOT flag. If there's post-context, it ignores
EOT flag.
Follows the order of the Arabic/Syriac specs. Also don't stop
between rlig and calt in non-Arabic scripts.
Micro-tests for Arabic and Mongolian added for the latter.
This reverts commit d5bd0590ae.
The reasoning behind that logic was flawed and made under
a misunderstanding of the original problem, and caused
regressions as reported by Jonathan Kew in thread titled
"tibetan marks" in Oct 2013. Apparently I have had fixed
the original problem with this commit:
7e08f1258d
So, revert the faulty commit and everything seems to be in good
shape.
Unicode 6.2.0 Section 16.2 / Figure 16.3 says:
"For backward compatibility, between Arabic characters a ZWJ acts just
like the sequence <ZWJ, ZWNJ, ZWJ>, preventing a ligature from forming
instead of requesting the use of a ligature that would not normally be
used. As a result, there is no plain text mechanism for requesting the
use of a ligature in Arabic text."
As such, we flip internal zwj to zwnj flags for GSUB matching, which
means it will block ligation in all features, unless the font
explicitly matches U+200D glyph. This doesn't affect joining behavior.
Testing shows that this is closer to what Uniscribe does.
Reported by Khaled Hosny:
"""
commit 568000274c
...
This commit is causing a regression with Amiri, the string “هَٰذ” with
Uniscribe and HarfBuzz before this commit, gives:
[uni0630.fina=3+965|uni0670.medi=0+600|uni064E=0@-256,0+0|uni0647.init=0+926]
But now it gives:
[uni0630.fina=3+965|uni0670.medi=0+0|uni064E=0@-256,0+0|uni0647.init=0+926]
i.e. uni0670.medi is zeroed though it has a base glyph GDEF class.
"""
The test case is U+0647,U+064E,U+0670,U+0630 with Amiri.
Before, we were zeroing advance width of attached marks for
non-Indic scripts, and not doing it for Indic.
We have now three different behaviors, which seem to better
reflect what Uniscribe is doing:
- For Indic, no explicit zeroing happens whatsoever, which
is the same as before,
- For Myanmar, zero advance width of glyphs marked as marks
*in GDEF*, and do that *before* applying GPOS. This seems
to be what the new Win8 Myanmar shaper does,
- For everything else, zero advance width of glyphs that are
from General_Category=Mn Unicode characters, and do so
before applying GPOS. This seems to be what Uniscribe does
for Latin at least.
With these changes, positioning of all tests matches for Myanmar,
except for the glitch in Uniscribe not applying 'mark'. See preivous
commit.
New API:
hb_buffer_flags_t
HB_BUFFER_FLAGS_DEFAULT
HB_BUFFER_FLAG_BOT
HB_BUFFER_FLAG_EOT
HB_BUFFER_FLAG_PRESERVE_DEFAULT_IGNORABLES
hb_buffer_set_flags()
hb_buffer_get_flags()
We use the BOT flag to decide whether to insert dottedcircle if the
first char in the buffer is a combining mark.
The PRESERVE_DEFAULT_IGNORABLES flag prevents removal of characters like
ZWNJ/ZWJ/...
Had to do some refactoring to make this happen...
Under uniscribe bug compatibility mode, we still plit them
Uniscrie-style, but Jonathan and I convinced ourselves that there is no
harm doing this the Unicode way. This change makes that happen, and
unbreaks free Sinhala fonts.
This will eventually allow us to skip marks, as well as (fallback)
attach marks to ligature components of fallback-shaped Arabic.
That would be pretty cool. I kludged GDEF props in, so mark-skipping
works, but the produced ligature id/components will be cleared later
by substitute_start() et al.
Perhaps using a synthetic table for Arabic fallback shaping was a better
idea. The current approach has way too many layering violations...
The merger of normalizer and glyph-mapping broke shapers that
modified text stream. Unbreak them by adding a new preprocess_text
shaping stage that happens before normalizing/cmap and disallow
setup_mask modification of actual text.
If there is no GPOS, zero mark advances.
If there *is* GPOS and the shaper requests so, zero mark advances for
attached marks.
Fixes regression with Tibetan, where the font has GPOS, and marks a
glyph as mark where it shouldn't get zero advance.
When we removed the separate Hangul shaper, the specific normalization
preference of Hangul was lost. Fix that. Also, the Thai shaper was
copied from Hangul, so had the fully-composed normalization behavior,
which was unnecessary. So, fix that too.
According to Peter Constable this is indeed what Uniscribe has been
doing for years.
Mozilla Bug 667166 - wrong shape of letter when it comes at the end of
word in the arabic version of Firefox 5.0
Instead of always applying those two features before the complex shaper,
let the complex shaper decide whether they should be applied first.
Also add stub for Indic's final_reordering().
Fixes https://bugzilla.mozilla.org/show_bug.cgi?id=644184
among others.
Shapers now can request segmented feature application by calling
add_gsub_pause() or add_gpos_pause(). They can also provide a
callback to be called at the pause. Currently the Arabic shaper
uses pauses to enforce certain feature application. The Indic
shaper can use the same facility to pause and do reordering in the
callback.
Nothing functional in there yet.
So far, we're parsing IndicSyllabicCategory.txt and IndicMatraCategory.txt
fils from Unicode Character Database and store them in an array to be used
by the shaper. Also hooked up the shaper, but it does not do anything
right now.
That better matches OpenType spec. Note that we enable it for all
Arabic-shaper scripts. Ie. we enable it by default for Syriac too,
but the SyriacOT spec does not require it. I think this is a more
useful compromise than special-casing for Arabic script alone.
Add support for classic Mongolian script to the Arabic shaper.
Still work to be done around U+180E MONGOLIAN VOWEL SEPARATOR as it
should not be included in the final glyph stream the same way that
ZWNJ, etc should not appear in the final glyph stream.
But the joining part should be done.
There remains the question of how should the U+18A9 MONGOLIAN LETTER ALI
GALI DAGALGA be handled as it has General Category NSM but a letter
nonetheless. For now, our generic logic makes this a joining T instead
of joining D as other Mongolian letters are.
Mandaic was added to Unicode 6.0, but the joining data was not updated.
Draft ArabicShaping.txt from 6.1 includes the joining data for Mandaic.
Use that.