Compression and other filter options #1

mkitti · 2022-05-27T20:26:10Z

The hdf5plugin package in pip and conda-forge can be found below:
https://github.com/silx-kit/hdf5plugin
http://www.silx.org/doc/hdf5plugin/latest/

It would be nice if one could specify an arbitrary filter as in h5repack:
https://portal.hdfgroup.org/display/HDF5/h5repack

               FILT - is a string with the format: 
             
                 <list of objects>:<name of filter>=<filter parameters> 
             
                 <list of objects> is a comma separated list of object names, meaning apply 
                   compression only to those objects. If no names are specified, the filter 
                   is applied to all objects 
                 <name of filter> can be: 
                   GZIP, to apply the HDF5 GZIP filter (GZIP compression) 
                   SZIP, to apply the HDF5 SZIP filter (SZIP compression) 
                   SHUF, to apply the HDF5 shuffle filter 
                   FLET, to apply the HDF5 checksum filter 
                   NBIT, to apply the HDF5 NBIT filter (NBIT compression) 
                   SOFF, to apply the HDF5 Scale/Offset filter 
                   UD,   to apply a user defined filter 
                   NONE, to remove all filters 
                 <filter parameters> is optional filter parameter information 
                   GZIP=<deflation level> from 1-9 
                   SZIP=<pixels per block,coding> pixels per block is a even number in 
                       2-32 and coding method is either EC or NN 
                   SHUF (no parameter) 
                   FLET (no parameter) 
                   NBIT (no parameter) 
                   SOFF=<scale_factor,scale_type> scale_factor is an integer and scale_type 
                       is either IN or DS 
                   UD=<filter_number,filter_flag,cd_value_count,value_1[,value_2,...,value_N]> 
                       required values for filter_number,filter_flag,cd_value_count,value_1 
                       optional values for value_2 to value_N 
                   NONE (no parameter)

Also note the SOFF above, which is the scale-offset filter that I previously discussed.

Here filters can be specified by their registered filters numbers:
https://portal.hdfgroup.org/display/support/Registered+Filter+Plugins

Beyond, the hdf5plugin Python package, filters could also be loaded as detailed here:
https://portal.hdfgroup.org/display/HDF5/HDF5+Dynamically+Loaded+Filters

The HDF Group maintains repository of plugins here:
https://github.com/hdfGroup/hdf5_plugins

The text was updated successfully, but these errors were encountered:

mkitti · 2022-05-27T21:52:23Z

For h5py support, see custom compression filters here:
https://docs.h5py.org/en/stable/high/dataset.html#custom-compression-filters

clbarnes · 2022-05-27T22:27:52Z

Is there a large demand for other filters? In most e.g. N5 datasets I've come across in the wild, I've mainly seen gzip. As I understand it, these raw images compress so poorly anyway that compression itself isn't necessarily worth the hassle, let alone variations between compression algorithms. There is certainly a benefit to keeping things simple and widely supported. However, if there are significant gains in storage efficiency and performance in using e.g. blosc and different compressors, I'd be willing.

These HDF5 files are unlikely to be the final form of these data - practically any downstream use will require scaling, contrast correction, and alignment, at which point other forms of filter and compression could be applied. My goal here is to produce a widely-compatible first form of the data so that everyone can use Jeiss images without having to concern themselves with the .dat format.

mkitti · 2022-05-27T22:34:17Z

We have experimented with compression filters in the past:
https://docs.google.com/presentation/d/1d1xH93uxTnUBlr5IrWOQjTljEvakwmgZu1kweWYMGAo/edit?usp=sharing

Basically we can get better compression using bitshuffle / zstd either directly or via Blosc. The file is about 70% of the original size and decompresses a factor of 4x faster than gzip.

clbarnes · 2022-05-27T22:47:41Z

That does look like a significant gain - are these plugins easily available in standard channels alongside HDF5 libraries? Don't want to impede adoption by getting too experimental!

mkitti · 2022-05-27T22:55:17Z

For Python, the plugins are very easily installable:
https://pypi.org/project/hdf5plugin/
https://anaconda.org/conda-forge/hdf5plugin

In general, The HDF Group also provides downloadable binaries for each release:
https://www.hdfgroup.org/downloads/hdf5/

I'm currently working on improving access via Java:
scijava/pom-scijava#181
https://github.com/JaneliaSciComp/jhdf5/tree/mkitti/hdf5_libsh

I'm hoping to put together a plugin package for ImageJ / FIJI soon once I can update the base jhdf5 library.

mkitti · 2022-05-27T22:59:29Z

The main issue with Java is the currently distributed jhdf5 library in FIJI statically links the original HDF5 library:
https://sissource.ethz.ch/sispub/jhdf5/-/tree/master/libs/native/jhdf5

The library only exports JNI symbols and not the original HDF5 symbols which some of the plugins need. The branch I posted above fixes this but splitting the library into two shared libraries: hdf5 and jhdf5 (with JNI symbols).

Some plugins such as ZSTD do not actually callback into the HDF5 library. In this case setting the HDF5_PLUGIN_PATH to either the HDF Group plugins or the Python package may be sufficient.

clbarnes · 2022-05-31T10:57:18Z

I've added h5py's built-in byteshuffle, scale-offset, and checksum options on the basis that they're probably pretty ubiquitous. I'd like to be cautious about the others: I want to avoid users getting an HDF5 file and finding they can't open it with standard tooling, and even hdf5plugin requires all openers of the file to have the package imported.

mkitti · 2022-05-31T12:43:30Z

These are the filters within the HDF5 code base itself:

Filter identifiers for the filters distributed with the HDF5 Library are as follows:

H5Z_FILTER_DEFLATE	The gzip compression, or deflation, filter
H5Z_FILTER_SZIP	The SZIP compression filter
H5Z_FILTER_NBIT	The N-bit compression filter
H5Z_FILTER_SCALEOFFSET	The scale-offset compression filter
H5Z_FILTER_SHUFFLE	The shuffle algorithm filter
H5Z_FILTER_FLETCHER32	The Fletcher32 checksum, or error checking, filter

https://portal.hdfgroup.org/display/HDF5/Filters

The main one that might be disabled is SZIP due to patent issues.

clbarnes · 2022-05-31T14:05:37Z

Got it, so even lzf isn't a given.

I've done some very loose benchmarking (one single-channel image, one run per configuration, writing to memory) and came up with this:

rel_write_time	rel_read_time	rel_size	write_time(s)	read_time(s)	size(B)	filters
1.04	0.91	1.00	2.10	0.19	527883320	
11.23	12.95	0.79	22.61	2.72	418421427	gzip
2.12	5.38	1.00	4.28	1.13	525460561	lzf
2.19	8.48	0.78	4.41	1.78	413065378	scaleoffset
7.39	15.05	0.77	14.87	3.16	403841668	scaleoffset+gzip
3.02	8.66	0.78	6.08	1.82	412653629	scaleoffset+lzf
1.13	1.69	1.00	2.28	0.35	527883320	byteshuffle
6.90	6.80	0.70	13.90	1.43	366895195	byteshuffle+gzip
2.09	4.43	0.82	4.20	0.93	434552392	byteshuffle+lzf
2.26	9.07	0.78	4.56	1.90	413065402	byteshuffle+scaleoffset
7.69	15.62	0.77	15.49	3.28	404077271	byteshuffle+scaleoffset+gzip
3.02	9.25	0.78	6.08	1.94	412653798	byteshuffle+scaleoffset+lzf
1.11	1.59	1.00	2.23	0.33	527883320	bitshuffle
1.17	1.80	0.74	2.36	0.38	390987495	bitshuffle+lz4
1.03	1.01	1.00	2.07	0.21	527200382	lz4
1.29	2.85	0.79	2.60	0.60	416709700	zstd
1.03	0.94	1.00	2.07	0.20	527883320	blosc+blosclz+0sh
1.57	2.61	0.87	3.15	0.55	458438347	blosc+blosclz+Bsh
1.07	0.97	1.00	2.16	0.20	527883320	blosc+blosclz+bsh
1.07	0.94	1.00	2.16	0.20	527120922	blosc+lz4+0sh
1.31	1.87	0.85	2.63	0.39	450420924	blosc+lz4+Bsh
1.10	0.94	1.00	2.22	0.20	527120922	blosc+lz4+bsh
4.66	1.23	1.00	9.38	0.26	525304316	blosc+lz4hc+0sh
7.13	1.59	0.76	14.37	0.33	403198524	blosc+lz4hc+Bsh
4.69	1.39	1.00	9.44	0.29	525304316	blosc+lz4hc+bsh
11.86	14.17	0.79	23.88	2.97	418710050	blosc+zlib+0sh
11.28	5.65	0.68	22.72	1.18	361219997	blosc+zlib+Bsh
11.47	13.81	0.79	23.11	2.90	418710050	blosc+zlib+bsh
3.97	3.80	0.79	7.99	0.80	416265416	blosc+zstd+0sh
9.90	2.88	0.70	19.94	0.61	368839149	blosc+zstd+Bsh
3.88	3.56	0.79	7.82	0.75	416265416	blosc+zstd+bsh

Some of it doesn't seem to make much sense (e.g. not seeing any significant size decrease for some compressors) but it does look like blosc+zstd+byteshuffle is a good combination, for size and reading at least. The bitshuffle+lz4 is nearly as good while quite a lot faster.

mkitti · 2022-05-31T18:03:10Z

Some of these are not compressors at all. The shuffles just permute the data

mkitti · 2022-05-31T18:11:17Z

lz4 basically does really fast run length encoding or similar. I found it can be very sensitive to distribution of the data.

mkitti · 2022-05-31T18:15:21Z

I just heard that Mathworks is thinking about bundling some plugins with MATLAB.

https://www.mathworks.com/help/matlab/import_export/read-and-write-hdf5-datasets-using-dynamically-loaded-filters.html

clbarnes · 2022-06-01T09:53:51Z

Yeah, I know that some filters shouldn't expect to compress, but there are a few blosc+compressor with various shuffles with out a 1% compression, which surprised me.

mkitti · 2022-06-01T10:26:34Z

By the way, what are Bsh and bsh? I'm assuming they are the different shuffles, but I am not clear which is which. For scale offset, where were the scale and offset?

clbarnes · 2022-06-01T10:30:50Z

0sh = no shuffling, bsh = bit shuffling, Bsh = Byte shuffling.

For scale-offset, I used 0 if enabled, so HDF5 figures out the parameters on a per-chunk basis for lossless compression, documented here https://docs.h5py.org/en/stable/high/dataset.html#dataset-scaleoffset

mkitti · 2022-06-01T14:24:54Z

That's what I had thought. I'm surprised that byte shuffle results in smaller size files than when bit shuffle has been applied. In my experience, bit shuffle tends to beat byte shuffle in terms of compression size, so now I'm trying to imagine a scenario how the converse could be true.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Compression and other filter options #1

Compression and other filter options #1

mkitti commented May 27, 2022

mkitti commented May 27, 2022

clbarnes commented May 27, 2022 •

edited

Loading

mkitti commented May 27, 2022

clbarnes commented May 27, 2022

mkitti commented May 27, 2022

mkitti commented May 27, 2022

clbarnes commented May 31, 2022

mkitti commented May 31, 2022

clbarnes commented May 31, 2022

mkitti commented May 31, 2022

mkitti commented May 31, 2022

mkitti commented May 31, 2022

clbarnes commented Jun 1, 2022

mkitti commented Jun 1, 2022

clbarnes commented Jun 1, 2022

mkitti commented Jun 1, 2022 •

edited

Loading

Compression and other filter options #1

Compression and other filter options #1

Comments

mkitti commented May 27, 2022

mkitti commented May 27, 2022

clbarnes commented May 27, 2022 • edited Loading

mkitti commented May 27, 2022

clbarnes commented May 27, 2022

mkitti commented May 27, 2022

mkitti commented May 27, 2022

clbarnes commented May 31, 2022

mkitti commented May 31, 2022

clbarnes commented May 31, 2022

mkitti commented May 31, 2022

mkitti commented May 31, 2022

mkitti commented May 31, 2022

clbarnes commented Jun 1, 2022

mkitti commented Jun 1, 2022

clbarnes commented Jun 1, 2022

mkitti commented Jun 1, 2022 • edited Loading

clbarnes commented May 27, 2022 •

edited

Loading

mkitti commented Jun 1, 2022 •

edited

Loading