-
Notifications
You must be signed in to change notification settings - Fork 442
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Zstd Support #530
Comments
I have also observed good performance on ZSTD, although it appears that the CRAM compressor is also a very modern approach. See my issue here on the "unifying compressor API" squash and its related compression benchmark: |
In particular, I was able to convince the squash benchmark folks to add the genome FASTA data to their compressor benchmark. I then ran the benchmark on chr1 and observed very good compression and speed for zstd at the highest (21) compression level: https://gist.github.com/kyleabeauchamp/e5f5d79aa153bc85d854a705a25c9166 |
@kyleabeauchamp My student has been using a modified version of A programmer in the lab has produced a version of htslib that uses zstd instead of zlib, relying on the wrapper provided by zstd to port the code. https://github.com/zmertens/htslib/tree/zstd Ideally, we would like to have something in which zstd and zlib work together. |
Cool. I'm not actually an HTSLib developer :) but I wanted to voice my support for any project that improves the compression ratio and/or read throughput of HTSLib-supported formats (particularly BAM and CRAM). |
I've experimented with Zstd and CRAM before. It's a definite win over zlib, at some compression levels both in speed and size. I experimented on this in Staden io_lib for the original Scramble implementation of CRAM (https://github.com/jkbonfield/io_lib). There is an earlier Zstd branch there but also a cram_modules branch which has lots of other experimental plugins. The problem isn't adding Zstd, which is actually very easy, but being able to put together a group of changes that jointly justify a change to the specification. We don't want to do this piecemeal with lots of variations of specification, but a significant step forward as we did between CRAM v2 and v3. I have various other experiments on the go and in particular have a significant improvement for compression of query names (beating everything pretty much bar the excessively slow context mixing algorithms). If you're interested in compression and looking at other formats then you may also want to check out https://github.com/IlyaGrebnov/libbsc. On some CRAM data series that's the winner out of the general purpose tools. LZ4 has potential use though within samtools without needing to change any specifications. It's ideal for temporary file output, eg when spooling the blocks out in a samtools sort process. |
Regarding FASTA compression, it's really a special case and not so helpful here. Also note that the higher levels mainly work better by using larger block sizes, harming random access. SAM & BAM sequence encoding is predominantly about how efficient the LZ step is as a genome sorted file will have lots of overlap from one sequence to the next. CRAM is (usually) referenced based, so the efficient is then down to how well the deltas get encoded. Zstd doesn't buy that much there really. I experimented with column oriented compression of VCF (so similar to how CRAM is vs BAM). It gains a bit, but not as much as I'd hoped for. There are other reasons why that may be good to do though, but this would be a larger project as it's really then designing a new format from scratch. |
Thanks, this is helpful information in terms of the challenges and also what's been tried. I suspected someone did a lot of benchmarking at some point. |
I think one could get a lot out of using Zstd over Zlib due to is faster compression and decompression speed. Here is a summary of our results for bcf, vcf, sam, gff3, gtf, and fasta. I think it is clear that if your problem is limited by compression/decompression speed that Zstd is more efficient. For the same amount of compression, Zstd is faster, and for the same amount of time Zstd compresses more. Zstd is also more controllable allowing for a wider range of values. As you say @jkbonfield , Enabling random access will probably reduce these advantages, but I poked around in the Zstd source code, and each compression level means is adjusted based on the size of the source block, and they have levels for blocks below 16KB.
We have other metrics based on transmitting a dataset over a 'pipe' that include compression size, compression speed, and decompression speed. In those, zstd and lz4 do well depending on the 'pipe' size because of their fast compression and decompression speed. |
The pipe transfer point is an interesting one as if you're only talking about data transfer then you'll be willing to use any format as it's temporary. However if it's a process of BAM->BAM.raw->BAM+zstd -> ... -> BAM.raw -> BAM (deflate) then that's possibly a lot of compression and uncompression steps. It's better then to just use something "good enough" like CRAM directly without needing to transcode twice for the transmission step. Some examples from my scramble experiments:
Here 10k and 100k refer to 10,000 and 100,000 reads per cram slice; so the granularity of random access. Adding zstd and fqzcomp (quality) shaved off around 3% unless we want to get coarser grained on random access. It's OK, but by itself isn't anything ground breaking. The .cram.size files were outputs from the cram_size tool. Examples of the original and zstd.10k one are:
and
The various g, r and R in the first one refer to gzip (deflate), rANS order-0 and rANS order-1 codecs. The second example has [q] and [Z] which is how the plugins referred to their codec, for fqzcomp quality codec and Zstd respectively. I can't recall which compression level I was using though. Basically most of the space savings are coming from RN (read names), QS (quality scores), BA (literal bases - likely from the unmapped reads at the end of the file) and XA:Z aux tag. I already have a better read name codec that beats zstd anyway, plus qualities aren't compressed well by that scheme either. See https://github.com/jkbonfield/mpeg_misc/blob/master/CE5/RESULTS.md for some tokenisation and compression of the first 10,000 read names (1 cram block) from a variety of test files covering multiple platforms. Libbsc in other tests does better still. Eg on SRR1238539 I get for RN and XA:Z I get:
This really leaves auxiliary blocks and maybe for unmapped data BA fields as the key source of improvement (for CRAM anyway). With BAM and BCF it's a totally different scenario that I haven't really investigated. As a general purpose codec though libbsc really is quite impressive and undervalued IMO. Edit: and some more stats from pure CRAM vs BSC+FQZ on the platinum genomes sample:
That's more significant - 12% boost - but needs large block sizes to work well. |
Have you ever experimented with the pre calculated dictionary abilities of
zstd or brotli? I imagine that for some file types, including a pre
calculated dictionary along with htslib/bgzip would improve ratios.
…On May 9, 2017 04:44, "James Bonfield" ***@***.***> wrote:
The pipe transfer point is an interesting one as if you're only talking
about data transfer then you'll be willing to use any format as it's
temporary. However if it's a process of BAM->BAM.raw->BAM+zstd -> ... ->
BAM.raw -> BAM (deflate) then that's possibly a lot of compression and
uncompression steps. It's better then to just use something "good enough"
like CRAM directly without needing to transcode twice for the transmission
step.
Some examples from my scramble experiments:
$ awk '{a+=$6} END {print a}' NA12878_S1.1.orig.cram.10k.size
6.64126e+10
$ awk '{a+=$6} END {print a}' NA12878_S1.1.bam.fqz+zstd.10k.cram.size
6.23181e+10
$ awk '{a+=$6} END {print a}' NA12878_S1.1.bam.fqz+zstd.100k.cram.size
5.96217e+10
Here 10k and 100k refer to 10,000 and 100,000 reads per cram slice; so the
granularity of random access. Adding zstd and fqzcomp (quality) shaved off
around 3% unless we want to get coarser grained on random access. It's OK,
but by itself isn't anything ground breaking.
The .cram.size files were outputs from the cram_size tool. Examples of the
original and zstd.10k one are:
$ cat NA12878_S1.1.orig.cram.size
Block CORE , total size 0
Block content_id 11, total size 9291477075 g RN
Block content_id 12, total size 50497889145 R QS
Block content_id 13, total size 10619884 g IN
Block content_id 14, total size 316666117 g rR SC
Block content_id 15, total size 453510486 R BF
Block content_id 16, total size 218338526 r CF
Block content_id 17, total size 497168802 r AP
Block content_id 19, total size 128590411 rR MQ
Block content_id 20, total size 8716899 r NS
Block content_id 21, total size 17488877 g R MF
Block content_id 22, total size 154859102 g R TS
Block content_id 23, total size 188220982 g NP
Block content_id 24, total size 777242598 g NF
Block content_id 26, total size 219022726 rR FN
Block content_id 27, total size 85217034 r FC
Block content_id 28, total size 529726388 g R FP
Block content_id 29, total size 7825109 g DL
Block content_id 30, total size 1784256524 g BA
Block content_id 31, total size 127139382 g rR BS
Block content_id 32, total size 123102176 rR TL
Block content_id 4279619, total size 117877737 rR AMC
Block content_id 5063770, total size 496 g MDZ
Block content_id 5131587, total size 139 g NMC
Block content_id 5459267, total size 95402099 rR SMC
Block content_id 5779523, total size 43436123 r X0C
Block content_id 5779529, total size 1584 g X0I
Block content_id 5779539, total size 7159543 g X0S
Block content_id 5779779, total size 63076858 g rR X1C
Block content_id 5779785, total size 52 g X1I
Block content_id 5779795, total size 9021181 g X1S
Block content_id 5783898, total size 279421941 g XAZ
Block content_id 5784387, total size 53329435 g XCC
Block content_id 5785411, total size 36359075 r XGC
Block content_id 5786947, total size 188586224 rR XMC
Block content_id 5787203, total size 4936 g XNC
Block content_id 5787459, total size 31961173 r XOC
Block content_id 5788737, total size 49860294 r XTA
and
$ cat NA12878_S1.1.bam.fqz+zstd.10k.cram.size
Block CORE , total size 0
Block content_id 11, total size 9123818505 <(912)%20381-8505> [Z] RN
Block content_id 12, total size 46964081560 [q] QS
Block content_id 13, total size 6800171 [Z] IN
Block content_id 14, total size 316582561 g R[Z] SC
Block content_id 15, total size 453510486 R BF
Block content_id 16, total size 218338526 r CF
Block content_id 17, total size 497168802 r AP
Block content_id 19, total size 128581699 rR MQ
Block content_id 20, total size 7979514 r [Z] NS
Block content_id 21, total size 17488877 g R MF
Block content_id 22, total size 105277278 R[Z] TS
Block content_id 23, total size 122722059 [Z] NP
Block content_id 24, total size 786520929 [Z] NF
Block content_id 26, total size 219025326 rR FN
Block content_id 27, total size 85217034 r FC
Block content_id 28, total size 501211157 g [Z] FP
Block content_id 29, total size 7630147 [Z] DL
Block content_id 30, total size 1594899103 g [Z] BA
Block content_id 31, total size 127143928 g rR[Z] BS
Block content_id 32, total size 127375184 rR TL
Block content_id 4279619, total size 117875360 rR AMC
Block content_id 5063770, total size 491 [Z] MDZ
Block content_id 5131587, total size 148 g [Z] NMC
Block content_id 5459267, total size 95402117 rR SMC
Block content_id 5779523, total size 26177637 r [Z] X0C
Block content_id 5779529, total size 3998 [Z] X0I
Block content_id 5779539, total size 6603397 g [Z] X0S
Block content_id 5779779, total size 63077067 g rR[Z] X1C
Block content_id 5779785, total size 39 [Z] X1I
Block content_id 5779795, total size 9070776 g [Z] X1S
Block content_id 5783898, total size 228438521 [Z] XAZ
Block content_id 5784387, total size 53329470 g XCC
Block content_id 5785411, total size 36359075 r XGC
Block content_id 5786947, total size 188585670 rR XMC
Block content_id 5787203, total size 5976 g [Z] XNC
Block content_id 5787459, total size 31961173 r XOC
Block content_id 5788737, total size 49860294 r XTA
The various g, r and R in the first one refer to gzip (deflate), rANS
order-0 and rANS order-1 codecs. The second example has [q] and [Z] which
is how the plugins referred to their codec, for fqzcomp quality codec and
Zstd respectively. I can't recall which compression level I was using
though.
Basically most of the space savings are coming from RN (read names), QS
(quality scores), BA (literal bases - likely from the unmapped reads at the
end of the file) and XA:Z aux tag.
I already have a better read name codec that beats zstd anyway, plus
qualities aren't compressed well by that scheme either. See
https://github.com/jkbonfield/mpeg_misc/blob/master/CE5/RESULTS.md for
some tokenisation and compression of the first 10,000 read names (1 cram
block) from a variety of test files covering multiple platforms. Libbsc in
other tests does better still. Eg on SRR1238539 I get for RN and XA:Z I get:
Block content_id 11, total size 529026032 g RN
Block content_id 11, total size 245988043 [Z] RN
Block content_id 11, total size 217363907 [Z][b] RN
Block content_id 5783898, total size 242569444 g XAZ
Block content_id 5783898, total size 234564415 [Z] XAZ
Block content_id 5783898, total size 191399854 [Z][b] XAZ
This really leaves auxiliary blocks and maybe for unmapped data BA fields
as the key source of improvement (for CRAM anyway).
With BAM and BCF it's a totally different scenario that I haven't really
investigated. As a general purpose codec though libbsc really is quite
impressive and undervalued IMO.
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#530 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAGOHgg2-VdlwdylUrV5f2rIjAiOmTCOks5r4FF_gaJpZM4NSZ_y>
.
|
No I haven't, nor have I looked into how best to create a dictionary. I expect it wouldn't make much difference except for small block sizes (ie when supporting a high degree of random access). In that situation I think it may well help rescue a lot of the file growth. I've thought about this for the pure entropy encoders though. Eg order-1 rANS works well on qualities but becomes a costly overhead if we shrink to 1000 records per block. In CRAM we could put the table in the container header (Vadim's original suggestion infact) and have say 100 small slices per container such that we share the header meta-data cost between many blocks. Such an approach ought to work well for specifying an external LZ dictionary too. For direct replacement of bgzf in BAM/BCF though it's less clear how to do this as the file contents have no block oriented structure of its own (other than by the bgzf layer itself). |
@jkbonfield I'm reviving this ancient issue rather than making a new duplicate because I think that nothing has fundamentally changed, even though zstd has significantly improved over the last 4 years. The bigger issue than implementing zstd in code is dealing with creating a new file format that existing toolchains and old versions of htslib may not be able to work with. However, I've been experimenting with a forked version of bgzf that's using zstd instead of zlib / libdeflate, and I believe that there are some very significant gains to be had both for disk usage and compression / decompression speed. The lab that I'm in has a significant number of bgzip-compressed VCFs, and a relatively simple ZSTD swap reduces them to 60-70% of current bgzip-ed size while being much more CPU-efficient to decompress as well. There's lesser benefit on .bams as you measured in this issue, but I think that there are options for using dictionaries to greater improve .bam compression. Zstd will still not achieve compression levels like lzma or especially .crams can, but it's so much faster that it would still be a significant improvement over zlib / libdeflate. If I were to develop a bzstdf layer for accessing block-compressed files using zstd, with metadata frames holding info about block sizes for the indexer to read, and a dictionary included for several types of data (.bam, .vcf) or the ability to include a dictionary as a metadata frame to enable much greater compression with small block size, would you be interested in accepting it? |
I personally think it would be emminently sensible, although I'd have to check with others if it would be accepted. Certainly for new projects I've tried to push for such things. (VGP/vgp-tools#1). It ought to be an entirely new tool though with a new suffix, as it'd be incompatible with bgzip. A choice of block size would help too. The default for bgzf is way too small IMO. One other thing to consider though is also benchmarking against libbsc. This is another high performance tool which may offer better compression ratios, particularly with the newer releases and faster entropy encoding mode. https://github.com/IlyaGrebnov/libbsc. It may still be too slow for most use cases though, particularly decompression speed won't match zstd (which is a tough one to beat). Plus zstd has the weight of being a published RFC behind it. Basically zstd is to gzip as bsc is to bzip2. |
Having a |
Yes, I've been experimenting with a new tool to compress / decompress (bgzstd) and a completely different layer with the same API as bgzf (bzstdf). I still need to finish the metadata blocks. We would also have to update Using a dictionary would get around most of the overhead we get from using a smaller block size if we choose a smaller block size, so I really do believe it would be possible to have the best of all worlds with a higher compression ratio, faster decompression & compression, and still extremely fast random I/O enabled by the smaller block size. |
Agreed. How would you envisage the dictionary is used? There is a Zstd command (and I assume associated function) to compute a dictionary from a file, so we could build the dictionary from, say, the first few MB of file data and then apply it for all blocks. Maybe it would be good if the format permitted periodic rebuilding of the dictionary and a mechanism in the index that records where dictionary blocks are found. |
I was actually thinking that as part of the specification that includes Zstd compression, an appropriate dictionary could be included for each type of commonly block-compressed data, that is BAMs and VCFs, meaning that there could be a default dictionary available out of the box, that could potentially be overridden by a dictionary encoded in a metadata frame. I need to test this to get real measurements, but this would make compression faster because there wouldn't have to be a training step, it would make output file sizes smaller because the dictionary would be in the bzstd binary, not in the input or output files. A model for this is the Brotli compression format, which has seen significant benefits from having a dictionary as part of its spec that ends up being built into the binary: https://datatracker.ietf.org/doc/html/rfc7932#appendix-A |
Definitely sounds like an interesting research project. |
This is certainly worth looking into, especially as most users should now be on Linux distributions that include zstd packages. We'll need to think carefully about exactly how we integrate it, though. Some points to consider include:
|
An undergrad is the lab is looking into how modern compression algorithms work with genomic data. Long story short, it looks like there are justifications to use LZ4, ZSTD, and Brotli in different cases.
If we produced a pull request to add support for these libraries to htslib, maybe initially to cram, would it be likely to be accepted?
The text was updated successfully, but these errors were encountered: