Thanks to our macros that abstract SSE use, the functions can use
AVX2 when available (at compile time)
This brings an extra 23% speed improvement on bench_dwt in 64bit builds
with AVX2 compared to SSE2.
* Use single-pass lifting inverse wavelet transform.
* For vertical pass, use SSE2 when available so as to process 8 columns
in parallel. This is the most beneficial improvement, since the
vertical pass involves a lot of cache trashing.
With the bench_dwt utility with default arguments (16383x16383 image),
time goes from 4.064 s to 1.212 s.
Commit 29313eb5 introduced those flags to avoid issues with
-fsanitize=unsigned-integer-overflow
However it is better just to rewrite the loop to avoid such condition
to occur.
There are situations where, given a tile size, at a resolution level,
there are sub-bands with x0==x1 or y0==y1, that consequently don't have any
valid codeblocks, but the other sub-bands may be non-empty.
Given that we recycle the memory from one tile to another one, those
ghost codeblocks might be non-0 and thus candidate for packet inclusion.
This saves comparing the current pointer with the end of buffer pointer.
This results at least in tiny speed improvement for raw decoding, and
smaller code size for MQC as well.
This kills the remains of the raw.h/.c files that were only used for
decoding. Encoding using the mqc structure already.
Ported from Carl Hetherington work (actually through Matthieu Darbois's port
on top of OpenJPEG 2.1.0)
Can reduce total encoding time by 10-15%
WARNING: VSC mode is not implemented, and so is a temporary regression
that must be fixed.
Decoding some valid .jp2 files like Sentinel2 datasets leads to warnings like:
No incltree created.
tgt_create tree->numnodes == 0, no tree created.
No imsbtree created.
tgt_create tree->numnodes == 0, no tree created.
Besides that, the image is correctly decoded. So there is no reason to emit
those warnings.