The previous constant opj_c13318 was mysteriously equal to 2/K , and in
the DWT, we had to divide K and opj_c13318 by 2... The issue was that the
band->stepsize computation in tcd.c didn't take into account the log2gain of
the band.
The effect of this change is expected to be mostly equivalent to the previous
situation, except some difference in rounding. But it leads to a dramatic
reduction of the mean square error and peak error in the irreversible encoding
of issue141.tif !
Update the bench_dwt utility to have a -decode/-encode switch
Measured performance gains for DWT encoder on a
Intel(R) Core(TM) i7-6700HQ CPU @ 2.60GHz (4 cores, hyper threaded)
Encoding time:
$ ./bin/bench_dwt -encode -num_threads 1
time for dwt_encode: total = 8.348 s, wallclock = 8.352 s
$ ./bin/bench_dwt -encode -num_threads 2
time for dwt_encode: total = 9.776 s, wallclock = 4.904 s
$ ./bin/bench_dwt -encode -num_threads 4
time for dwt_encode: total = 13.188 s, wallclock = 3.310 s
$ ./bin/bench_dwt -encode -num_threads 8
time for dwt_encode: total = 30.024 s, wallclock = 4.064 s
Scaling is probably limited by memory access patterns causing
memory access to be the bottleneck.
The slightly worse results with threads==8 than with thread==4
is due to hyperthreading being not appropriate here.
* -PLT switch added to opj_compress
* Add a opj_encoder_set_extra_options() function that
accepts a PLT=YES option, and could be expanded later
for other uses.
-------
Testing with a Sentinel2 10m band, T36JTT_20160914T074612_B02.jp2,
coming from S2A_MSIL1C_20160914T074612_N0204_R135_T36JTT_20160914T081456.SAFE
Decompress it to TIFF:
```
opj_uncompress -i T36JTT_20160914T074612_B02.jp2 -o T36JTT_20160914T074612_B02.tif
```
Recompress it with similar parameters as original:
```
opj_compress -n 5 -c [256,256],[256,256],[256,256],[256,256],[256,256] -t 1024,1024 -PLT -i T36JTT_20160914T074612_B02.tif -o T36JTT_20160914T074612_B02_PLT.jp2
```
Dump codestream detail with GDAL dump_jp2.py utility (https://github.com/OSGeo/gdal/blob/master/gdal/swig/python/samples/dump_jp2.py)
```
python dump_jp2.py T36JTT_20160914T074612_B02.jp2 > /tmp/dump_sentinel2_ori.txt
python dump_jp2.py T36JTT_20160914T074612_B02_PLT.jp2 > /tmp/dump_sentinel2_openjpeg_plt.txt
```
The diff between both show very similar structure, and identical number of packets in PLT markers
Now testing with Kakadu (KDU803_Demo_Apps_for_Linux-x86-64_200210)
Full file decompression:
```
kdu_expand -i T36JTT_20160914T074612_B02_PLT.jp2 -o tmp.tif
Consumed 121 tile-part(s) from a total of 121 tile(s).
Consumed 80,318,806 codestream bytes (excluding any file format) = 5.329697
bits/pel.
Processed using the multi-threaded environment, with
8 parallel threads of execution
```
Partial decompresson (presumably using PLT markers):
```
kdu_expand -i T36JTT_20160914T074612_B02.jp2 -o tmp.pgm -region "{0.5,0.5},{0.01,0.01}"
kdu_expand -i T36JTT_20160914T074612_B02_PLT.jp2 -o tmp2.pgm -region "{0.5,0.5},{0.01,0.01}"
diff tmp.pgm tmp2.pgm && echo "same !"
```
-------
Funded by ESA for S2-MPC project
This adds a opj_set_decoded_components(opj_codec_t *p_codec,
OPJ_UINT32 numcomps, const OPJ_UINT32* comps_indices) function,
and equivalent "opj_decompress -c compno[,compno]*" option.
When specified, neither the MCT transform nor JP2 channel transformations
will be applied.
Tests added for various combinations of whole image vs tiled-based decoding,
full or reduced resolution, use of decode area or not.
However the intermediate buffer for decoding must still be smaller than 4
billion pixels, so this is useful for decoding at a lower resolution level,
or subtile decoding.
Instead of being the full tile size.
* Use a sparse array mechanism to store code-blocks and intermediate stages of
IDWT.
* IDWT, DC level shift and MCT stages are done just on that smaller array.
* Improve copy of tile component array to final image, by saving an intermediate
buffer.
* For full-tile decoding at reduced resolution, only allocate the tile buffer to
the reduced size, instead of the full-resolution size.
Currently we allocate at least 8192 bytes for each codeblock, and copy
the relevant parts of the codestream in that per-codeblock buffer as we
decode packets.
As the whole codestream for the tile is ingested in memory and alive
during the decoding, we can directly point to it instead of copying. But
to do that, we need an intermediate concept, a 'chunk' of code-stream segment,
given that segments may be made of data at different places in the code-stream
when quality layers are used.
With that change, the decoding of MAPA_005.jp2 goes down from the previous
improvement of 2.7 GB down to 1.9 GB.
New profile:
n4: 1885648469 (heap allocation functions) malloc/new/new[], --alloc-fns, etc.
n1: 1610689344 0x4E78287: opj_aligned_malloc (opj_malloc.c:61)
n1: 1610689344 0x4E71D7B: opj_alloc_tile_component_data (tcd.c:676)
n1: 1610689344 0x4E7272C: opj_tcd_init_decode_tile (tcd.c:816)
n1: 1610689344 0x4E4BDD9: opj_j2k_read_tile_header (j2k.c:8618)
n1: 1610689344 0x4E4C8A2: opj_j2k_decode_tiles (j2k.c:10349)
n1: 1610689344 0x4E4E36E: opj_j2k_decode (j2k.c:7847)
n1: 1610689344 0x4E52FA2: opj_jp2_decode (jp2.c:1564)
n0: 1610689344 0x40374E: main (opj_decompress.c:1459)
n1: 219232541 0x4E4BBF0: opj_j2k_read_tile_header (j2k.c:4685)
n1: 219232541 0x4E4C8A2: opj_j2k_decode_tiles (j2k.c:10349)
n1: 219232541 0x4E4E36E: opj_j2k_decode (j2k.c:7847)
n1: 219232541 0x4E52FA2: opj_jp2_decode (jp2.c:1564)
n0: 219232541 0x40374E: main (opj_decompress.c:1459)
n1: 39822000 0x4E727A9: opj_tcd_init_decode_tile (tcd.c:1219)
n1: 39822000 0x4E4BDD9: opj_j2k_read_tile_header (j2k.c:8618)
n1: 39822000 0x4E4C8A2: opj_j2k_decode_tiles (j2k.c:10349)
n1: 39822000 0x4E4E36E: opj_j2k_decode (j2k.c:7847)
n1: 39822000 0x4E52FA2: opj_jp2_decode (jp2.c:1564)
n0: 39822000 0x40374E: main (opj_decompress.c:1459)
n0: 15904584 in 52 places, all below massif's threshold (1.00%)
When components don't have the same width, unaligned load/store are possible.
Fixes openjeg-crashes-2017-07-27/id:000000,sig:11,src:001342,op:flip4,pos:162.jp2 of #895
There are situations where, given a tile size, at a resolution level,
there are sub-bands with x0==x1 or y0==y1, that consequently don't have any
valid codeblocks, but the other sub-bands may be non-empty.
Given that we recycle the memory from one tile to another one, those
ghost codeblocks might be non-0 and thus candidate for packet inclusion.