Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Compression and other filter options #1

Open
mkitti opened this issue May 27, 2022 · 16 comments
Open

Compression and other filter options #1

mkitti opened this issue May 27, 2022 · 16 comments

Comments

@mkitti
Copy link

mkitti commented May 27, 2022

The hdf5plugin package in pip and conda-forge can be found below:
https://github.com/silx-kit/hdf5plugin
http://www.silx.org/doc/hdf5plugin/latest/

It would be nice if one could specify an arbitrary filter as in h5repack:
https://portal.hdfgroup.org/display/HDF5/h5repack

               FILT - is a string with the format: 
             
                 <list of objects>:<name of filter>=<filter parameters> 
             
                 <list of objects> is a comma separated list of object names, meaning apply 
                   compression only to those objects. If no names are specified, the filter 
                   is applied to all objects 
                 <name of filter> can be: 
                   GZIP, to apply the HDF5 GZIP filter (GZIP compression) 
                   SZIP, to apply the HDF5 SZIP filter (SZIP compression) 
                   SHUF, to apply the HDF5 shuffle filter 
                   FLET, to apply the HDF5 checksum filter 
                   NBIT, to apply the HDF5 NBIT filter (NBIT compression) 
                   SOFF, to apply the HDF5 Scale/Offset filter 
                   UD,   to apply a user defined filter 
                   NONE, to remove all filters 
                 <filter parameters> is optional filter parameter information 
                   GZIP=<deflation level> from 1-9 
                   SZIP=<pixels per block,coding> pixels per block is a even number in 
                       2-32 and coding method is either EC or NN 
                   SHUF (no parameter) 
                   FLET (no parameter) 
                   NBIT (no parameter) 
                   SOFF=<scale_factor,scale_type> scale_factor is an integer and scale_type 
                       is either IN or DS 
                   UD=<filter_number,filter_flag,cd_value_count,value_1[,value_2,...,value_N]> 
                       required values for filter_number,filter_flag,cd_value_count,value_1 
                       optional values for value_2 to value_N 
                   NONE (no parameter) 

Also note the SOFF above, which is the scale-offset filter that I previously discussed.

Here filters can be specified by their registered filters numbers:
https://portal.hdfgroup.org/display/support/Registered+Filter+Plugins

Beyond, the hdf5plugin Python package, filters could also be loaded as detailed here:
https://portal.hdfgroup.org/display/HDF5/HDF5+Dynamically+Loaded+Filters

The HDF Group maintains repository of plugins here:
https://github.com/hdfGroup/hdf5_plugins

@mkitti
Copy link
Author

mkitti commented May 27, 2022

For h5py support, see custom compression filters here:
https://docs.h5py.org/en/stable/high/dataset.html#custom-compression-filters

@clbarnes
Copy link
Owner

clbarnes commented May 27, 2022

Is there a large demand for other filters? In most e.g. N5 datasets I've come across in the wild, I've mainly seen gzip. As I understand it, these raw images compress so poorly anyway that compression itself isn't necessarily worth the hassle, let alone variations between compression algorithms. There is certainly a benefit to keeping things simple and widely supported. However, if there are significant gains in storage efficiency and performance in using e.g. blosc and different compressors, I'd be willing.

These HDF5 files are unlikely to be the final form of these data - practically any downstream use will require scaling, contrast correction, and alignment, at which point other forms of filter and compression could be applied. My goal here is to produce a widely-compatible first form of the data so that everyone can use Jeiss images without having to concern themselves with the .dat format.

@mkitti
Copy link
Author

mkitti commented May 27, 2022

We have experimented with compression filters in the past:
https://docs.google.com/presentation/d/1d1xH93uxTnUBlr5IrWOQjTljEvakwmgZu1kweWYMGAo/edit?usp=sharing

Basically we can get better compression using bitshuffle / zstd either directly or via Blosc. The file is about 70% of the original size and decompresses a factor of 4x faster than gzip.

@clbarnes
Copy link
Owner

That does look like a significant gain - are these plugins easily available in standard channels alongside HDF5 libraries? Don't want to impede adoption by getting too experimental!

@mkitti
Copy link
Author

mkitti commented May 27, 2022

For Python, the plugins are very easily installable:
https://pypi.org/project/hdf5plugin/
https://anaconda.org/conda-forge/hdf5plugin

In general, The HDF Group also provides downloadable binaries for each release:
https://www.hdfgroup.org/downloads/hdf5/

I'm currently working on improving access via Java:
scijava/pom-scijava#181
https://github.com/JaneliaSciComp/jhdf5/tree/mkitti/hdf5_libsh

I'm hoping to put together a plugin package for ImageJ / FIJI soon once I can update the base jhdf5 library.

@mkitti
Copy link
Author

mkitti commented May 27, 2022

The main issue with Java is the currently distributed jhdf5 library in FIJI statically links the original HDF5 library:
https://sissource.ethz.ch/sispub/jhdf5/-/tree/master/libs/native/jhdf5

The library only exports JNI symbols and not the original HDF5 symbols which some of the plugins need. The branch I posted above fixes this but splitting the library into two shared libraries: hdf5 and jhdf5 (with JNI symbols).

Some plugins such as ZSTD do not actually callback into the HDF5 library. In this case setting the HDF5_PLUGIN_PATH to either the HDF Group plugins or the Python package may be sufficient.

@clbarnes
Copy link
Owner

I've added h5py's built-in byteshuffle, scale-offset, and checksum options on the basis that they're probably pretty ubiquitous. I'd like to be cautious about the others: I want to avoid users getting an HDF5 file and finding they can't open it with standard tooling, and even hdf5plugin requires all openers of the file to have the package imported.

@mkitti
Copy link
Author

mkitti commented May 31, 2022

These are the filters within the HDF5 code base itself:

Filter identifiers for the filters distributed with the HDF5 Library are as follows:

H5Z_FILTER_DEFLATE The gzip compression, or deflation, filter
H5Z_FILTER_SZIP The SZIP compression filter
H5Z_FILTER_NBIT The N-bit compression filter
H5Z_FILTER_SCALEOFFSET The scale-offset compression filter
H5Z_FILTER_SHUFFLE The shuffle algorithm filter
H5Z_FILTER_FLETCHER32 The Fletcher32 checksum, or error checking, filter

https://portal.hdfgroup.org/display/HDF5/Filters

The main one that might be disabled is SZIP due to patent issues.

@clbarnes
Copy link
Owner

Got it, so even lzf isn't a given.

I've done some very loose benchmarking (one single-channel image, one run per configuration, writing to memory) and came up with this:

rel_write_time	rel_read_time	rel_size	write_time(s)	read_time(s)	size(B)	filters
1.04	0.91	1.00	2.10	0.19	527883320	
11.23	12.95	0.79	22.61	2.72	418421427	gzip
2.12	5.38	1.00	4.28	1.13	525460561	lzf
2.19	8.48	0.78	4.41	1.78	413065378	scaleoffset
7.39	15.05	0.77	14.87	3.16	403841668	scaleoffset+gzip
3.02	8.66	0.78	6.08	1.82	412653629	scaleoffset+lzf
1.13	1.69	1.00	2.28	0.35	527883320	byteshuffle
6.90	6.80	0.70	13.90	1.43	366895195	byteshuffle+gzip
2.09	4.43	0.82	4.20	0.93	434552392	byteshuffle+lzf
2.26	9.07	0.78	4.56	1.90	413065402	byteshuffle+scaleoffset
7.69	15.62	0.77	15.49	3.28	404077271	byteshuffle+scaleoffset+gzip
3.02	9.25	0.78	6.08	1.94	412653798	byteshuffle+scaleoffset+lzf
1.11	1.59	1.00	2.23	0.33	527883320	bitshuffle
1.17	1.80	0.74	2.36	0.38	390987495	bitshuffle+lz4
1.03	1.01	1.00	2.07	0.21	527200382	lz4
1.29	2.85	0.79	2.60	0.60	416709700	zstd
1.03	0.94	1.00	2.07	0.20	527883320	blosc+blosclz+0sh
1.57	2.61	0.87	3.15	0.55	458438347	blosc+blosclz+Bsh
1.07	0.97	1.00	2.16	0.20	527883320	blosc+blosclz+bsh
1.07	0.94	1.00	2.16	0.20	527120922	blosc+lz4+0sh
1.31	1.87	0.85	2.63	0.39	450420924	blosc+lz4+Bsh
1.10	0.94	1.00	2.22	0.20	527120922	blosc+lz4+bsh
4.66	1.23	1.00	9.38	0.26	525304316	blosc+lz4hc+0sh
7.13	1.59	0.76	14.37	0.33	403198524	blosc+lz4hc+Bsh
4.69	1.39	1.00	9.44	0.29	525304316	blosc+lz4hc+bsh
11.86	14.17	0.79	23.88	2.97	418710050	blosc+zlib+0sh
11.28	5.65	0.68	22.72	1.18	361219997	blosc+zlib+Bsh
11.47	13.81	0.79	23.11	2.90	418710050	blosc+zlib+bsh
3.97	3.80	0.79	7.99	0.80	416265416	blosc+zstd+0sh
9.90	2.88	0.70	19.94	0.61	368839149	blosc+zstd+Bsh
3.88	3.56	0.79	7.82	0.75	416265416	blosc+zstd+bsh

Some of it doesn't seem to make much sense (e.g. not seeing any significant size decrease for some compressors) but it does look like blosc+zstd+byteshuffle is a good combination, for size and reading at least. The bitshuffle+lz4 is nearly as good while quite a lot faster.

@mkitti
Copy link
Author

mkitti commented May 31, 2022

Some of these are not compressors at all. The shuffles just permute the data

@mkitti
Copy link
Author

mkitti commented May 31, 2022

lz4 basically does really fast run length encoding or similar. I found it can be very sensitive to distribution of the data.

@mkitti
Copy link
Author

mkitti commented May 31, 2022

I just heard that Mathworks is thinking about bundling some plugins with MATLAB.

https://www.mathworks.com/help/matlab/import_export/read-and-write-hdf5-datasets-using-dynamically-loaded-filters.html

@clbarnes
Copy link
Owner

clbarnes commented Jun 1, 2022

Yeah, I know that some filters shouldn't expect to compress, but there are a few blosc+compressor with various shuffles with out a 1% compression, which surprised me.

@mkitti
Copy link
Author

mkitti commented Jun 1, 2022

By the way, what are Bsh and bsh? I'm assuming they are the different shuffles, but I am not clear which is which. For scale offset, where were the scale and offset?

@clbarnes
Copy link
Owner

clbarnes commented Jun 1, 2022

0sh = no shuffling, bsh = bit shuffling, Bsh = Byte shuffling.

For scale-offset, I used 0 if enabled, so HDF5 figures out the parameters on a per-chunk basis for lossless compression, documented here https://docs.h5py.org/en/stable/high/dataset.html#dataset-scaleoffset

@mkitti
Copy link
Author

mkitti commented Jun 1, 2022

That's what I had thought. I'm surprised that byte shuffle results in smaller size files than when bit shuffle has been applied. In my experience, bit shuffle tends to beat byte shuffle in terms of compression size, so now I'm trying to imagine a scenario how the converse could be true.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants