Add basic recognition for XZ- & Zstd-compressed data, and D4 format #1340

jmarshall · 2021-10-05T11:44:12Z

There has been some recent discussion of using Zstandard compression (cf #530).

As a very small first step towards this, this adds xz_compression and zstd_compression to enum htsCompression and recognises their magic numbers, as we do for bzip2_compression. As htslib already uses liblzma, this also adds decompress_peek_xz() (if liblzma is configured in) to peek at the contents. Unlike libbz2's decompression routines, an lzma_stream_decoder can decompress a few hundred bytes of a stream without needing to read in an entire multi-hundred-KB block — hence is suitable for use with peeking.

There is also some talk of an intersection between D4 and htsget (though IMHO I'm not yet sure this intersection is non-empty), and of upcoming work in that area. So this also adds basic recognition of D4 format. Note that the D4 spec is fairly incomplete and (as far as I am aware) the format currently has only one implementation. So you may wish to leave the D4 commit off for now. This PR does not set version.major/.minor for D4; see 38/d4-format#34.

In the unlikely event of an error from decompress_peek() (now renamed in anticipation of similar functions for other compression formats), ensure hts_detect_format() also returns an error rather than incorrectly detecting the contents as being empty.

Similarly to hts_detect_format()'s basic detection of bzip2-compressed data, recognise the magic numbers of these two formats. If HAVE_LIBLZMA is defined, we can decompress the first few KB (as we do for gzipped data) to detect the contents. Otherwise, and for Zstandard compression, we can just report "Unknown {XZ|Zstandard}-compressed data".

jmarshall · 2021-10-05T13:09:11Z

Also fixes a minor cram_io.c memory leak I noticed while studying the lzma_* code.

jkbonfield · 2021-10-06T14:37:59Z

Thanks.

I'm curious where you're going with this, but useful additions to have.

jmarshall · 2021-10-06T21:02:44Z

No particular destination in mind 😄

Both XZ and Zstd formats are block-orientated and have the ability to embed metadata blocks that won't appear in the decompressed stream — though the lzma and zstd libraries have only embryonic API support for random access. (Both those statements based on a cursory reading for both and may be inaccurate/outdated.)

Hence I can imagine a future in which one of these supplants BGZF, ideally being used in a fairly vanilla way and with the index embedded (rather than as a separate .bai file). Listing them in htsCompression and recognising them is a tiny prerequisite of this. But me, I'm not currently doing anything towards this.

38 · 2021-10-13T22:12:40Z

Hi @jmarshall,

the PR looks really great, I am very happy to see htslib is supporting d4 format!

We are assuming you folks are interested in adding D4 reading/writing ability later. We are very glad to help with that.
Currently we only have the Rust implementation, but we are not sure if you folks want a C implementation of D4 completely or you are interested in using the C binding of the Rust code.

Also we are very happy to either supporting the C binding or help on implementing a parser in C.

Thanks for working on this and please let us know if you need any help.

jmarshall · 2021-10-13T22:38:34Z

This PR is only a very tiny step towards reading D4.

To my mind, in order to be a stable serious format with proven interchange ability, D4 would need to have more than one implementation.* So from that point of view, IMHO a C implementation as part of or usable from htslib would be more interesting than using a binding to the (solitary) Rust implementation. To be worth investing that effort, D4 would need to be in widespread use — to be sure, this is something of a chicken and egg problem…

(The htslib maintainers' opinions are of more interest than mine, of course.)

* Are D4 files intended to be archived or shared with others? i.e., are they akin to BAM/CRAM which you would store long-term, or are they more something you would generate temporarily during your local analysis of a sample, somewhat akin to a BAI index? (If the latter, the need for stability, interchange, and multiple implementations might be a bit lower.)

arq5x · 2021-10-26T14:51:16Z

@jmarshall you can think of D4 as a direct alternative to bigwig. Not an index, but a complete representation of quantitative genomics (e.g., depths from RNA-seq, WGS, ChIP, etc.) experiments.

arq5x · 2021-10-26T14:58:17Z

Also, I discussed a C implementation of the format with @38. He feels that a single-threaded version would not be difficult, but a multi-threaded version that achieves the R/W performance of the Rust implementation would be much harder. That said, a single-threaded version is likely fine for this purpose.

jmarshall added 4 commits October 5, 2021 11:51

Fix lzma_mem_inflate() error handling memory leak

5a806f5

Add basic recognition of D4 quantitative interval format

95e2d6d

jmarshall force-pushed the xz-zstd branch from 0690c37 to 95e2d6d Compare October 5, 2021 13:08

daviesrob assigned jkbonfield Oct 5, 2021

jkbonfield merged commit 8b38a1b into samtools:develop Oct 6, 2021

jmarshall deleted the xz-zstd branch October 6, 2021 20:06

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add basic recognition for XZ- & Zstd-compressed data, and D4 format #1340

Add basic recognition for XZ- & Zstd-compressed data, and D4 format #1340

jmarshall commented Oct 5, 2021

jmarshall commented Oct 5, 2021 •

edited

Loading

jkbonfield commented Oct 6, 2021

jmarshall commented Oct 6, 2021

38 commented Oct 13, 2021 •

edited

Loading

jmarshall commented Oct 13, 2021 •

edited

Loading

arq5x commented Oct 26, 2021

arq5x commented Oct 26, 2021

Add basic recognition for XZ- & Zstd-compressed data, and D4 format #1340

Add basic recognition for XZ- & Zstd-compressed data, and D4 format #1340

Conversation

jmarshall commented Oct 5, 2021

jmarshall commented Oct 5, 2021 • edited Loading

jkbonfield commented Oct 6, 2021

jmarshall commented Oct 6, 2021

38 commented Oct 13, 2021 • edited Loading

jmarshall commented Oct 13, 2021 • edited Loading

arq5x commented Oct 26, 2021

arq5x commented Oct 26, 2021

jmarshall commented Oct 5, 2021 •

edited

Loading

38 commented Oct 13, 2021 •

edited

Loading

jmarshall commented Oct 13, 2021 •

edited

Loading