Cram fuzzing improvements. #1701

jkbonfield · 2023-11-16T17:33:18Z

Fixes a raft of CRAM errors, with thanks to Octavio Galland for finding and reporting.

Draft as I have a few memory leaks when erroring out after failure to encode.

This is triggered when our container has only unmapped reads, but those reads have CIGAR strings consuming no bases (eg 0M). Fixes samtools#1691 Many tyhanks to Octavio Galland for reporting this and the subsequent commits covering other CRAM-write fuzzing problems.

If we write to stdout and get an error, samtools error printing does an hts_flush of stdout, followed by an hts_close which then also does its own flush. We weren't resetting fd->ctr to NULL, causing a second attempt to encode the container. However we've already discarded the c->bams array. We double check that too for safety, but it should be unnecessary now we correctly free and nullify the container in cram_flush rather than relying on it in cram_close, so multiple flushes don't trigger this bug.

Most of the CRAM code (and hopefully all) already has hts_pos_t throughout when manipulating positions, but for some reason we never updated the structures. Surprising given I thought I'd tested CRAM 4 as working on long references. (They're already 64-bit in io_lib.) The code that fills out these variables already very carefully calls the 32-bit (CRAM3) or 64-bit (CRAM4) query function and then copies the variable over, so that should still be fine. Fixes samtools#1692

Long references shouldn't silently produce a wrapped around CRAM header and then get detected as erroneous in decode.

cram/cram_encode.c

cram/cram_io.c

We now continue the close process, freeing up everything else as we go.

Fixes samtools#1693 which read 1 byte beyond the end of the allocated c->ref array.

These are shared refs between containers / slices, but auto-generated per container. We set c->ref_free in encode_container, but didn't check this in all cases. Additionally fixed memory leaks. Fixes samtools#1696

This is OK for CRAM 4, but not CRAM 3 and triggers an assertion. This is now a proper error return instead. Fixes samtools#1697

This was checked for with aligned data, but unmapped "placed" reads could start beyind the reference end and cr->aend was dutifully set where it claimed to start. This could cause buffer read-overruns when computing the reference md5. Fixes samtools#1698, fixes samtools#1700

jkbonfield · 2023-11-17T12:37:24Z

Slowly beating this into shape. The good news is that I can confirm cherry-picking the last commit here (hts_open_fuzzer) onto develop quite quickly detected one of the recently reported fuzzing issues, so I have confidence that this is exercising a much better portion of our code base.

With this PR, it's not (YET!) failed though, so it's looking good for us to be able to detect more problems. (I'm not naive enough to assume they've all been discovered.)

To anyone curious over the hts_close_or_abort vs hts_close on CRAM, this is because there are records which we can write to SAM but cannot write to CRAM. One such example is using a tag of "XX:d:0.1". This isn't SAM compliant and CRAM rejects it. SAM and BAM have an undocumented extension to store that format. Couple this with CRAM's encoding strategy being to buffer up a bunch of BAM records and then transcribe them as a block once enough have been reached, all the cram_write calls can work just fine as it's simply adding to the to-do list, and the cram_encode_container triggered by the flush from closing is where the errors then get detected.

I haven't yet determined whether we need to make the same change for the BAM writer, as there may also be some things it throws up as errors that SAM does not, such as out of bounds reference coordinates. In theory that ought to trigger a similar issue, with appropriately similar test harness fix.

If we have a few reads with 64-bit positions widly far apart, we could attempt to allocate huge sums of memory. The embedded reference isn't even helpful in this scenario, so we can limit this allocation to something more respectable. Fixes samtools#1699

This tests many more code paths (encoding).

cram/cram_io.c

The c->bams list is normally moved over to spare_bams list on success, but on encode failure it's still there so we need to free this. Additionally failure to encode a set of aux tags leaves some aux blocks in use in the container still which haven't yet been migrated over to s->aux_blocks.

Commit 0ab97e3 limited memory use for excessively large spanning slices. This could be extremely sparse data, in which case embedding a reference isn't helpful, or it could be deep data which a huge gap (eg spanning a centromere). Our maximum gap is still big - approx 100Mb span - so this is unlikely to happen. However if we do trigger it, we never get the option of going back from no-ref to embed-ref=2 (ie own consensus) again. This commit adds a counter on the number of times we've failed to generate a reference in recent times, incrementing or decrementing accordingly. Once it hits a certain level we give up, as we did before. So for example name sorting data will try 5 times and then never attempt to use embed_ref again.

SAM aux tags have code [A-Za-z][A-Za-z0-9]. Our content ids also include the type, but we can't have the same code with more than one type. So the maximum number of tags possible is 3224. Each code could be using multiple blocks (eg via byte-array-len), and we have other blocks too for the fixed fields, but a maximum of 10,000 possible blocks in any slice is a generous upper-bound. This avoids memory issues with files where num_content_ids is 100s of millions. We already blocked memory wrap-around and malloc woes, but it's safer to be more restrictive. This also fixes out-of-memory errors in fuzz testing. Credit to OSS-Fuzz Fixes oss-fuzz 61840

Doing a write-SAM, write-BAM, write-CRAM misses some issues as we can bail out the loop early if one of the other formats fails. The full test really needs repeating the entire thing 3 times over. So the main interface now takes the memory block and redoes all the mem: opening. More work, but also covers more paths.

2GB is too high as bcf is loading multiple things and that's the limit of total memory, not per allocation.

This is triggered when the aux data hasn't been validated and contains a partial buffer. Eg a non-null terminated string, or an 'i' integer with < 4 bytes remaining in the buffer. This is a bit involved as the htslib API also lacks bounds checking in the utility functions like bam_aux2i. I added a local bounds checking equivalent, but in time maybe this znd other similar functions should be moved to become an official part of the API.

daviesrob · 2023-11-23T16:04:18Z

This is better resisting fuzzing since the last update. I will merge the progress so far.

jkbonfield · 2023-11-23T16:07:02Z

Thanks. I think that's the right decision as it's all a series of incremental steps and I don't think this introduces new bugs (I hope not anyway). If we find a bunch more, I'll start a cram-fuzz-2 PR :) Let's hope it'll be a while coming!

jkbonfield added 4 commits November 16, 2023 17:33

Prevent CRAM 3 from attempting to write out out-of-bounds ref positions

da25e53

Long references shouldn't silently produce a wrapped around CRAM header and then get detected as erroneous in decode.

jkbonfield force-pushed the cram_fuzz branch from 61fdecd to cf02ef0 Compare November 16, 2023 17:35

jmarshall reviewed Nov 16, 2023

View reviewed changes

cram/cram_encode.c Outdated Show resolved Hide resolved

jmarshall reviewed Nov 16, 2023

View reviewed changes

cram/cram_encode.c Outdated Show resolved Hide resolved

daviesrob reviewed Nov 17, 2023

View reviewed changes

cram/cram_io.c Show resolved Hide resolved

jkbonfield force-pushed the cram_fuzz branch from cf02ef0 to 6363198 Compare November 17, 2023 10:41

jkbonfield added 5 commits November 17, 2023 11:42

Remove memory leak when cram_encode_container fails during a close.

7a3e884

We now continue the close process, freeing up everything else as we go.

Fix out by one error on extend_ref memory allocation.

55d41a2

Fixes samtools#1693 which read 1 byte beyond the end of the allocated c->ref array.

Don't call cram_ref_decr on consensus-based references.

340cbee

These are shared refs between containers / slices, but auto-generated per container. We set c->ref_free in encode_container, but didn't check this in all cases. Additionally fixed memory leaks. Fixes samtools#1696

Check for 64-bit values in BETA codec initialiser.

99bf38d

This is OK for CRAM 4, but not CRAM 3 and triggers an assertion. This is now a proper error return instead. Fixes samtools#1697

jkbonfield force-pushed the cram_fuzz branch 4 times, most recently from 54af36e to e7b7ccd Compare November 17, 2023 12:34

jkbonfield added 2 commits November 20, 2023 10:25

Improve the fuzzer to write BAM/CRAM and BCF too.

7bfa464

This tests many more code paths (encoding).

jkbonfield force-pushed the cram_fuzz branch 3 times, most recently from 0f4431c to be16302 Compare November 21, 2023 10:10

daviesrob reviewed Nov 21, 2023

View reviewed changes

cram/cram_io.c Outdated Show resolved Hide resolved

daviesrob self-assigned this Nov 21, 2023

jkbonfield added 3 commits November 22, 2023 11:17

jkbonfield force-pushed the cram_fuzz branch from be16302 to 2eac42c Compare November 22, 2023 12:10

jkbonfield added 2 commits November 22, 2023 12:20

Reduce maxmimum BCF header len in fuzzer.

52644e7

2GB is too high as bcf is loading multiple things and that's the limit of total memory, not per allocation.

jkbonfield force-pushed the cram_fuzz branch 2 times, most recently from 1519d48 to 6de54cd Compare November 23, 2023 13:09

jkbonfield force-pushed the cram_fuzz branch from 6de54cd to fe0ae71 Compare November 23, 2023 14:23

daviesrob marked this pull request as ready for review November 23, 2023 16:01

daviesrob merged commit 4bb1ddf into samtools:develop Nov 23, 2023
8 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cram fuzzing improvements. #1701

Cram fuzzing improvements. #1701

jkbonfield commented Nov 16, 2023

jkbonfield commented Nov 17, 2023 •

edited

Loading

daviesrob commented Nov 23, 2023

jkbonfield commented Nov 23, 2023

Cram fuzzing improvements. #1701

Cram fuzzing improvements. #1701

Conversation

jkbonfield commented Nov 16, 2023

jkbonfield commented Nov 17, 2023 • edited Loading

daviesrob commented Nov 23, 2023

jkbonfield commented Nov 23, 2023

jkbonfield commented Nov 17, 2023 •

edited

Loading