-
Notifications
You must be signed in to change notification settings - Fork 442
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cram fuzzing improvements. #1701
Conversation
This is triggered when our container has only unmapped reads, but those reads have CIGAR strings consuming no bases (eg 0M). Fixes samtools#1691 Many tyhanks to Octavio Galland for reporting this and the subsequent commits covering other CRAM-write fuzzing problems.
If we write to stdout and get an error, samtools error printing does an hts_flush of stdout, followed by an hts_close which then also does its own flush. We weren't resetting fd->ctr to NULL, causing a second attempt to encode the container. However we've already discarded the c->bams array. We double check that too for safety, but it should be unnecessary now we correctly free and nullify the container in cram_flush rather than relying on it in cram_close, so multiple flushes don't trigger this bug.
Most of the CRAM code (and hopefully all) already has hts_pos_t throughout when manipulating positions, but for some reason we never updated the structures. Surprising given I thought I'd tested CRAM 4 as working on long references. (They're already 64-bit in io_lib.) The code that fills out these variables already very carefully calls the 32-bit (CRAM3) or 64-bit (CRAM4) query function and then copies the variable over, so that should still be fine. Fixes samtools#1692
Long references shouldn't silently produce a wrapped around CRAM header and then get detected as erroneous in decode.
61fdecd
to
cf02ef0
Compare
cf02ef0
to
6363198
Compare
We now continue the close process, freeing up everything else as we go.
Fixes samtools#1693 which read 1 byte beyond the end of the allocated c->ref array.
These are shared refs between containers / slices, but auto-generated per container. We set c->ref_free in encode_container, but didn't check this in all cases. Additionally fixed memory leaks. Fixes samtools#1696
This is OK for CRAM 4, but not CRAM 3 and triggers an assertion. This is now a proper error return instead. Fixes samtools#1697
This was checked for with aligned data, but unmapped "placed" reads could start beyind the reference end and cr->aend was dutifully set where it claimed to start. This could cause buffer read-overruns when computing the reference md5. Fixes samtools#1698, fixes samtools#1700
54af36e
to
e7b7ccd
Compare
Slowly beating this into shape. The good news is that I can confirm cherry-picking the last commit here (hts_open_fuzzer) onto develop quite quickly detected one of the recently reported fuzzing issues, so I have confidence that this is exercising a much better portion of our code base. With this PR, it's not (YET!) failed though, so it's looking good for us to be able to detect more problems. (I'm not naive enough to assume they've all been discovered.) To anyone curious over the I haven't yet determined whether we need to make the same change for the BAM writer, as there may also be some things it throws up as errors that SAM does not, such as out of bounds reference coordinates. In theory that ought to trigger a similar issue, with appropriately similar test harness fix. |
If we have a few reads with 64-bit positions widly far apart, we could attempt to allocate huge sums of memory. The embedded reference isn't even helpful in this scenario, so we can limit this allocation to something more respectable. Fixes samtools#1699
This tests many more code paths (encoding).
0f4431c
to
be16302
Compare
The c->bams list is normally moved over to spare_bams list on success, but on encode failure it's still there so we need to free this. Additionally failure to encode a set of aux tags leaves some aux blocks in use in the container still which haven't yet been migrated over to s->aux_blocks.
Commit 0ab97e3 limited memory use for excessively large spanning slices. This could be extremely sparse data, in which case embedding a reference isn't helpful, or it could be deep data which a huge gap (eg spanning a centromere). Our maximum gap is still big - approx 100Mb span - so this is unlikely to happen. However if we do trigger it, we never get the option of going back from no-ref to embed-ref=2 (ie own consensus) again. This commit adds a counter on the number of times we've failed to generate a reference in recent times, incrementing or decrementing accordingly. Once it hits a certain level we give up, as we did before. So for example name sorting data will try 5 times and then never attempt to use embed_ref again.
SAM aux tags have code [A-Za-z][A-Za-z0-9]. Our content ids also include the type, but we can't have the same code with more than one type. So the maximum number of tags possible is 3224. Each code could be using multiple blocks (eg via byte-array-len), and we have other blocks too for the fixed fields, but a maximum of 10,000 possible blocks in any slice is a generous upper-bound. This avoids memory issues with files where num_content_ids is 100s of millions. We already blocked memory wrap-around and malloc woes, but it's safer to be more restrictive. This also fixes out-of-memory errors in fuzz testing. Credit to OSS-Fuzz Fixes oss-fuzz 61840
be16302
to
2eac42c
Compare
Doing a write-SAM, write-BAM, write-CRAM misses some issues as we can bail out the loop early if one of the other formats fails. The full test really needs repeating the entire thing 3 times over. So the main interface now takes the memory block and redoes all the mem: opening. More work, but also covers more paths.
2GB is too high as bcf is loading multiple things and that's the limit of total memory, not per allocation.
1519d48
to
6de54cd
Compare
This is triggered when the aux data hasn't been validated and contains a partial buffer. Eg a non-null terminated string, or an 'i' integer with < 4 bytes remaining in the buffer. This is a bit involved as the htslib API also lacks bounds checking in the utility functions like bam_aux2i. I added a local bounds checking equivalent, but in time maybe this znd other similar functions should be moved to become an official part of the API.
6de54cd
to
fe0ae71
Compare
This is better resisting fuzzing since the last update. I will merge the progress so far. |
Thanks. I think that's the right decision as it's all a series of incremental steps and I don't think this introduces new bugs (I hope not anyway). If we find a bunch more, I'll start a cram-fuzz-2 PR :) Let's hope it'll be a while coming! |
Fixes a raft of CRAM errors, with thanks to Octavio Galland for finding and reporting.
Draft as I have a few memory leaks when erroring out after failure to encode.