From 291d30b1ff5cf938d5d408c08c2d2accab6a6fda Mon Sep 17 00:00:00 2001 From: Behdad Esfahbod Date: Sun, 8 Dec 2019 18:59:17 -0600 Subject: [PATCH] [simd] Comments --- src/hb-simd.hh | 20 ++++++++++++++------ 1 file changed, 14 insertions(+), 6 deletions(-) diff --git a/src/hb-simd.hh b/src/hb-simd.hh index b9fad6837..2e4000abc 100644 --- a/src/hb-simd.hh +++ b/src/hb-simd.hh @@ -47,15 +47,21 @@ * be a binary search in the Coverage table, but that's where the * hb_set_digest_t speedup came from: hb_set_digest_t are narrow (3 or 4 * integers) structures that implement approximate matching, similar to Bloom - * Filters or Quotient Filters. These digests do all bitwise operations, so - * they can be easily vectorized. Combined with a gather operation, or just - * multiple fetches in a row (which should parallelize) when gather is not - * available. This will allow us to * skip over 8 or 16 glyphs at a time. + * Filters or Quotient Filters. These digests do all their work using bitwise + * operations, so they can be easily vectorized. Combined with a gather + * operation, or just multiple fetches in a row (which should parallelize) when + * gather is not available. This will allow us to skip over 8 or 16 glyphs at + * a time. * * For fast fonts, like simple Latin fonts, like Roboto, the majority of time * is spent in binary searching in the Coverage table of kern and liga lookups. * We can, again, use vector gather and comparison operations to implement a - * 8ary or 16ary search instead of binary search. + * 9ary or 17ary search instead of binary search, which will reduce search + * depth by 3x / 4x respectively. It's important to keep in mind that a + * 16-at-a-time 17ary search is /not/ in any way 17 times faster. Only 4 times + * faster at best since the number of search steps compared to binary search is + * log(17)/log(2) ~= 4. That should be taken into account while assessing + * various designs. * * The rest of this files adds facilities to implement those, and possibly * more. @@ -137,7 +143,9 @@ * example, my 2019 ThinkPad Yoga X1 does *not* support it. We should * definitely explore that, but not initially. Also, it is possible that the * extra memory load that puts will defeat the speedup we can gain from it. - * Must be implemented and measured carefully. + * Also do note that for the search usecase, doubling the bitwidth from 256 + * to 512, as discussed, only has hard max benefit cap of less than 30% + * speedup (log(17) / log(9)). Must be implemented and measured carefully. */ /* DESIGN