Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add basic recognition for XZ- & Zstd-compressed data, and D4 format #1340

Merged
merged 4 commits into from
Oct 6, 2021

Conversation

jmarshall
Copy link
Member

There has been some recent discussion of using Zstandard compression (cf #530).

As a very small first step towards this, this adds xz_compression and zstd_compression to enum htsCompression and recognises their magic numbers, as we do for bzip2_compression. As htslib already uses liblzma, this also adds decompress_peek_xz() (if liblzma is configured in) to peek at the contents. Unlike libbz2's decompression routines, an lzma_stream_decoder can decompress a few hundred bytes of a stream without needing to read in an entire multi-hundred-KB block — hence is suitable for use with peeking.

There is also some talk of an intersection between D4 and htsget (though IMHO I'm not yet sure this intersection is non-empty), and of upcoming work in that area. So this also adds basic recognition of D4 format. Note that the D4 spec is fairly incomplete and (as far as I am aware) the format currently has only one implementation. So you may wish to leave the D4 commit off for now. This PR does not set version.major/.minor for D4; see 38/d4-format#34.

In the unlikely event of an error from decompress_peek() (now renamed
in anticipation of similar functions for other compression formats),
ensure hts_detect_format() also returns an error rather than incorrectly
detecting the contents as being empty.
Similarly to hts_detect_format()'s basic detection of bzip2-compressed
data, recognise the magic numbers of these two formats. If HAVE_LIBLZMA
is defined, we can decompress the first few KB (as we do for gzipped
data) to detect the contents. Otherwise, and for Zstandard compression,
we can just report "Unknown {XZ|Zstandard}-compressed data".
@jmarshall
Copy link
Member Author

jmarshall commented Oct 5, 2021

Also fixes a minor cram_io.c memory leak I noticed while studying the lzma_* code.

@jkbonfield jkbonfield merged commit 8b38a1b into samtools:develop Oct 6, 2021
@jkbonfield
Copy link
Contributor

Thanks.

I'm curious where you're going with this, but useful additions to have.

@jmarshall jmarshall deleted the xz-zstd branch October 6, 2021 20:06
@jmarshall
Copy link
Member Author

No particular destination in mind 😄

Both XZ and Zstd formats are block-orientated and have the ability to embed metadata blocks that won't appear in the decompressed stream — though the lzma and zstd libraries have only embryonic API support for random access. (Both those statements based on a cursory reading for both and may be inaccurate/outdated.)

Hence I can imagine a future in which one of these supplants BGZF, ideally being used in a fairly vanilla way and with the index embedded (rather than as a separate .bai file). Listing them in htsCompression and recognising them is a tiny prerequisite of this. But me, I'm not currently doing anything towards this.

@38
Copy link

38 commented Oct 13, 2021

Hi @jmarshall,

the PR looks really great, I am very happy to see htslib is supporting d4 format!

We are assuming you folks are interested in adding D4 reading/writing ability later. We are very glad to help with that.
Currently we only have the Rust implementation, but we are not sure if you folks want a C implementation of D4 completely or you are interested in using the C binding of the Rust code.

Also we are very happy to either supporting the C binding or help on implementing a parser in C.

Thanks for working on this and please let us know if you need any help.

@jmarshall
Copy link
Member Author

jmarshall commented Oct 13, 2021

This PR is only a very tiny step towards reading D4.

To my mind, in order to be a stable serious format with proven interchange ability, D4 would need to have more than one implementation.* So from that point of view, IMHO a C implementation as part of or usable from htslib would be more interesting than using a binding to the (solitary) Rust implementation. To be worth investing that effort, D4 would need to be in widespread use — to be sure, this is something of a chicken and egg problem…

(The htslib maintainers' opinions are of more interest than mine, of course.)

* Are D4 files intended to be archived or shared with others? i.e., are they akin to BAM/CRAM which you would store long-term, or are they more something you would generate temporarily during your local analysis of a sample, somewhat akin to a BAI index? (If the latter, the need for stability, interchange, and multiple implementations might be a bit lower.)

@arq5x
Copy link
Contributor

arq5x commented Oct 26, 2021

@jmarshall you can think of D4 as a direct alternative to bigwig. Not an index, but a complete representation of quantitative genomics (e.g., depths from RNA-seq, WGS, ChIP, etc.) experiments.

@arq5x
Copy link
Contributor

arq5x commented Oct 26, 2021

Also, I discussed a C implementation of the format with @38. He feels that a single-threaded version would not be difficult, but a multi-threaded version that achieves the R/W performance of the Rust implementation would be much harder. That said, a single-threaded version is likely fine for this purpose.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants