-
Notifications
You must be signed in to change notification settings - Fork 442
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add basic recognition for XZ- & Zstd-compressed data, and D4 format #1340
Conversation
In the unlikely event of an error from decompress_peek() (now renamed in anticipation of similar functions for other compression formats), ensure hts_detect_format() also returns an error rather than incorrectly detecting the contents as being empty.
Similarly to hts_detect_format()'s basic detection of bzip2-compressed data, recognise the magic numbers of these two formats. If HAVE_LIBLZMA is defined, we can decompress the first few KB (as we do for gzipped data) to detect the contents. Otherwise, and for Zstandard compression, we can just report "Unknown {XZ|Zstandard}-compressed data".
Also fixes a minor cram_io.c memory leak I noticed while studying the |
Thanks. I'm curious where you're going with this, but useful additions to have. |
No particular destination in mind 😄 Both XZ and Zstd formats are block-orientated and have the ability to embed metadata blocks that won't appear in the decompressed stream — though the lzma and zstd libraries have only embryonic API support for random access. (Both those statements based on a cursory reading for both and may be inaccurate/outdated.) Hence I can imagine a future in which one of these supplants BGZF, ideally being used in a fairly vanilla way and with the index embedded (rather than as a separate .bai file). Listing them in |
Hi @jmarshall, the PR looks really great, I am very happy to see htslib is supporting d4 format! We are assuming you folks are interested in adding D4 reading/writing ability later. We are very glad to help with that. Also we are very happy to either supporting the C binding or help on implementing a parser in C. Thanks for working on this and please let us know if you need any help. |
This PR is only a very tiny step towards reading D4. To my mind, in order to be a stable serious format with proven interchange ability, D4 would need to have more than one implementation.* So from that point of view, IMHO a C implementation as part of or usable from htslib would be more interesting than using a binding to the (solitary) Rust implementation. To be worth investing that effort, D4 would need to be in widespread use — to be sure, this is something of a chicken and egg problem… (The htslib maintainers' opinions are of more interest than mine, of course.) * Are D4 files intended to be archived or shared with others? i.e., are they akin to BAM/CRAM which you would store long-term, or are they more something you would generate temporarily during your local analysis of a sample, somewhat akin to a BAI index? (If the latter, the need for stability, interchange, and multiple implementations might be a bit lower.) |
@jmarshall you can think of D4 as a direct alternative to bigwig. Not an index, but a complete representation of quantitative genomics (e.g., depths from RNA-seq, WGS, ChIP, etc.) experiments. |
Also, I discussed a C implementation of the format with @38. He feels that a single-threaded version would not be difficult, but a multi-threaded version that achieves the R/W performance of the Rust implementation would be much harder. That said, a single-threaded version is likely fine for this purpose. |
There has been some recent discussion of using Zstandard compression (cf #530).
As a very small first step towards this, this adds
xz_compression
andzstd_compression
toenum htsCompression
and recognises their magic numbers, as we do forbzip2_compression
. As htslib already uses liblzma, this also addsdecompress_peek_xz()
(if liblzma is configured in) to peek at the contents. Unlike libbz2's decompression routines, anlzma_stream_decoder
can decompress a few hundred bytes of a stream without needing to read in an entire multi-hundred-KB block — hence is suitable for use with peeking.There is also some talk of an intersection between D4 and htsget (though IMHO I'm not yet sure this intersection is non-empty), and of upcoming work in that area. So this also adds basic recognition of D4 format. Note that the D4 spec is fairly incomplete and (as far as I am aware) the format currently has only one implementation. So you may wish to leave the D4 commit off for now. This PR does not set
version.major
/.minor
for D4; see 38/d4-format#34.