Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cram fuzzing improvements. #1701

Merged
merged 17 commits into from
Nov 23, 2023
Merged

Cram fuzzing improvements. #1701

merged 17 commits into from
Nov 23, 2023

Conversation

jkbonfield
Copy link
Contributor

Fixes a raft of CRAM errors, with thanks to Octavio Galland for finding and reporting.

Draft as I have a few memory leaks when erroring out after failure to encode.

This is triggered when our container has only unmapped reads, but those
reads have CIGAR strings consuming no bases (eg 0M).

Fixes samtools#1691

Many tyhanks to Octavio Galland for reporting this and the subsequent
commits covering other CRAM-write fuzzing problems.
If we write to stdout and get an error, samtools error printing does
an hts_flush of stdout, followed by an hts_close which then also does
its own flush.

We weren't resetting fd->ctr to NULL, causing a second attempt to
encode the container.  However we've already discarded the c->bams
array.  We double check that too for safety, but it should be
unnecessary now we correctly free and nullify the container in
cram_flush rather than relying on it in cram_close, so multiple
flushes don't trigger this bug.
Most of the CRAM code (and hopefully all) already has hts_pos_t
throughout when manipulating positions, but for some reason we never
updated the structures.  Surprising given I thought I'd tested CRAM 4
as working on long references.  (They're already 64-bit in io_lib.)

The code that fills out these variables already very carefully calls
the 32-bit (CRAM3) or 64-bit (CRAM4) query function and then copies
the variable over, so that should still be fine.

Fixes samtools#1692
Long references shouldn't silently produce a wrapped around CRAM
header and then get detected as erroneous in decode.
cram/cram_encode.c Outdated Show resolved Hide resolved
cram/cram_encode.c Outdated Show resolved Hide resolved
We now continue the close process, freeing up everything else as we go.
Fixes samtools#1693 which read 1 byte beyond the end of the allocated c->ref
array.
These are shared refs between containers / slices, but auto-generated
per container.  We set c->ref_free in encode_container, but didn't
check this in all cases.  Additionally fixed memory leaks.

Fixes samtools#1696
This is OK for CRAM 4, but not CRAM 3 and triggers an assertion.  This
is now a proper error return instead.

Fixes samtools#1697
This was checked for with aligned data, but unmapped "placed" reads
could start beyind the reference end and cr->aend was dutifully set
where it claimed to start. This could cause buffer read-overruns when
computing the reference md5.

Fixes samtools#1698, fixes samtools#1700
@jkbonfield jkbonfield force-pushed the cram_fuzz branch 4 times, most recently from 54af36e to e7b7ccd Compare November 17, 2023 12:34
@jkbonfield
Copy link
Contributor Author

jkbonfield commented Nov 17, 2023

Slowly beating this into shape. The good news is that I can confirm cherry-picking the last commit here (hts_open_fuzzer) onto develop quite quickly detected one of the recently reported fuzzing issues, so I have confidence that this is exercising a much better portion of our code base.

With this PR, it's not (YET!) failed though, so it's looking good for us to be able to detect more problems. (I'm not naive enough to assume they've all been discovered.)

To anyone curious over the hts_close_or_abort vs hts_close on CRAM, this is because there are records which we can write to SAM but cannot write to CRAM. One such example is using a tag of "XX:d:0.1". This isn't SAM compliant and CRAM rejects it. SAM and BAM have an undocumented extension to store that format. Couple this with CRAM's encoding strategy being to buffer up a bunch of BAM records and then transcribe them as a block once enough have been reached, all the cram_write calls can work just fine as it's simply adding to the to-do list, and the cram_encode_container triggered by the flush from closing is where the errors then get detected.

I haven't yet determined whether we need to make the same change for the BAM writer, as there may also be some things it throws up as errors that SAM does not, such as out of bounds reference coordinates. In theory that ought to trigger a similar issue, with appropriately similar test harness fix.

If we have a few reads with 64-bit positions widly far apart, we could
attempt to allocate huge sums of memory.  The embedded reference isn't
even helpful in this scenario, so we can limit this allocation to
something more respectable.

Fixes samtools#1699
This tests many more code paths (encoding).
@jkbonfield jkbonfield force-pushed the cram_fuzz branch 3 times, most recently from 0f4431c to be16302 Compare November 21, 2023 10:10
cram/cram_io.c Outdated Show resolved Hide resolved
@daviesrob daviesrob self-assigned this Nov 21, 2023
The c->bams list is normally moved over to spare_bams list on success,
but on encode failure it's still there so we need to free this.

Additionally failure to encode a set of aux tags leaves some aux
blocks in use in the container still which haven't yet been migrated
over to s->aux_blocks.
Commit 0ab97e3 limited memory use for excessively large spanning
slices.  This could be extremely sparse data, in which case embedding
a reference isn't helpful, or it could be deep data which a huge gap
(eg spanning a centromere).  Our maximum gap is still big - approx
100Mb span - so this is unlikely to happen.  However if we do trigger
it, we never get the option of going back from no-ref to embed-ref=2
(ie own consensus) again.

This commit adds a counter on the number of times we've failed to
generate a reference in recent times, incrementing or decrementing
accordingly.  Once it hits a certain level we give up, as we did
before.  So for example name sorting data will try 5 times and then
never attempt to use embed_ref again.
SAM aux tags have code [A-Za-z][A-Za-z0-9].  Our content ids also
include the type, but we can't have the same code with more than one
type.  So the maximum number of tags possible is 3224.  Each code
could be using multiple blocks (eg via byte-array-len), and we have
other blocks too for the fixed fields, but a maximum of 10,000
possible blocks in any slice is a generous upper-bound.

This avoids memory issues with files where num_content_ids is 100s of
millions.  We already blocked memory wrap-around and malloc woes, but
it's safer to be more restrictive.  This also fixes out-of-memory
errors in fuzz testing.

Credit to OSS-Fuzz
Fixes oss-fuzz 61840
Doing a write-SAM, write-BAM, write-CRAM misses some issues as we can
bail out the loop early if one of the other formats fails.  The full
test really needs repeating the entire thing 3 times over.  So the
main interface now takes the memory block and redoes all the mem:
opening.  More work, but also covers more paths.
2GB is too high as bcf is loading multiple things and that's the limit
of total memory, not per allocation.
@jkbonfield jkbonfield force-pushed the cram_fuzz branch 2 times, most recently from 1519d48 to 6de54cd Compare November 23, 2023 13:09
This is triggered when the aux data hasn't been validated and contains
a partial buffer.  Eg a non-null terminated string, or an 'i' integer
with < 4 bytes remaining in the buffer.

This is a bit involved as the htslib API also lacks bounds checking in
the utility functions like bam_aux2i.  I added a local bounds checking
equivalent, but in time maybe this znd other similar functions should
be moved to become an official part of the API.
@daviesrob daviesrob marked this pull request as ready for review November 23, 2023 16:01
@daviesrob
Copy link
Member

This is better resisting fuzzing since the last update. I will merge the progress so far.

@daviesrob daviesrob merged commit 4bb1ddf into samtools:develop Nov 23, 2023
8 checks passed
@jkbonfield
Copy link
Contributor Author

Thanks. I think that's the right decision as it's all a series of incremental steps and I don't think this introduces new bugs (I hope not anyway). If we find a bunch more, I'll start a cram-fuzz-2 PR :) Let's hope it'll be a while coming!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants