Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cram embed ref cons #1449

Merged
merged 7 commits into from
Jul 19, 2022
Merged

Conversation

jkbonfield
Copy link
Contributor

A better implementation of embed_ref=2 now I've realised I was mistaken in the main code structure. It's much easier than I thought to do a proper consensus calculation, so I now use this. This means:

  • With MD:Z tags, the embed_ref=2 (implicit) is the same as embed_ref=1 (explicit via external .fa file).
  • Without MD:Z tags, embed_ref=2 is now usually slightly smaller than an explicit embed_ref=1 as the consensus is always at least as accurate as the official reference. The old code was sometimes a bit suspect in the generated reference.

This also addresses some of the issues raised in #1445, for better or worse. My decisions to the questions raised there were:

  1. "If we use the embed_ref option, should it automatically use level 2 if no external reference is found?"
    Yes. An unadorned "embed_ref" is still level 1, but when a reference cannot be found it switches to level 2. Note it does this on the file descriptor so it applies to all containers.

  2. "If an external reference is not found during encode and no embed_ref option has been set, should we automatically switch to embed_ref=2 mode? "
    Yes, for both when MD:Z are known and when not, as it's not easy to know whether they are present, and there are middle-ground cases such as partially present, or present but incorrect so it's not good to try and separate logic by content. A warning message is emitted stating that the file will likely be larger than when using an external reference.

  3. "Should we be providing some hint mechanism in the SAM header, eg by adding @co htslib_cram_options:embed_ref and similar."
    Currently no. With the notion of simply auto-generating an embedded reference when one doesn't exist, this is less required anyway.

Note this isn't a complete panacea. Unsorted data cannot use embedded references, and neither can multi_seq_per_slice mode (eg when encoding a highly fragmented assembly).

Thoughts?

@jkbonfield
Copy link
Contributor Author

jkbonfield commented Jun 10, 2022

Bah I found another corner case.

The SAM header must have M5 tags when outputting in CRAM, unless an explicit embed_ref=2 or no_ref is used. An example is the classic platinum genome NA12878_S1.bam from Illumina. This has no meta-data whatsoever; just SQ SN and LN fields leaving the user totally guessing if it's GRCh37, GRCh38, or IIRC it's actually NCBI's 37 with a subtly different MT genome (sigh). Platinum data, but hardly platinum level provenance.

I'm not sure how best to handle that. It's just another hard failure right now, but with an explicit embed_ref=2 it works fine and there are MD:Z tags present so it'll even be efficient for retaining those.

@daviesrob daviesrob self-assigned this Jun 21, 2022
@jkbonfield jkbonfield marked this pull request as ready for review July 6, 2022 11:01
// Returns the next cigar op code: one of the BAM_C* codes,
// or -1 if no more are present.
static inline
int next_cigar_op(uint32_t *cigar, int *ncigar, int *skip, int *spos,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ncigar should be uint32_t given that bam1_core_t::ncigar is. Also, why is it passed in as a pointer?

skip could also be moved in here as a const array, couldn't it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question on the mystery pointer. I assume it had an earlier incarnation.

Thanks for the review comments. I'll work through them.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regarding "skip", it could be moved there, but the next_cigar_op is a general purpose function (albeit static), with skip being a parameterisation of it. I could move it there, but then it'd need renaming to something less general. I doubt it matters much so inclined to leave as-is for now.

cram/cram_encode.c Outdated Show resolved Hide resolved
cram/cram_encode.c Outdated Show resolved Hide resolved
Comment on lines 1579 to 1580
int ncigar = b->core.n_cigar;
int i, j;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

uint32_t

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see there is an ncigar field in the internal cram record struct too, which is signed. Sloppy! (Sorry)

I'll look into these as a separate issue though.

cram/cram_encode.c Outdated Show resolved Hide resolved
cram/cram_encode.c Outdated Show resolved Hide resolved
cram/cram_encode.c Outdated Show resolved Hide resolved
cram/cram_encode.c Outdated Show resolved Hide resolved
cram/cram_encode.c Outdated Show resolved Hide resolved
if (fd->embed_ref <= 0)
hts_log_warning("NOTE: the CRAM file will be bigger than"
" using an external reference");
pthread_mutex_lock(&fd->ref_lock);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This lock should happen earlier, to guard the fd->embed_ref reads as well. Or possibly fd->embed_ref should be copied into a local variable as happens elsewhere.

It looks like there are a few other unlocked accesses to this variable elsewhere, which will also need attention.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea for putting into a local variable, as is already done just about with nref. Regarding elsewhere (such as process_read) I did think about locks but the various automated tools seemed to think it was impossible to trigger data races there.

I'll explore again with more tools. It's possible it can still happen if we fail to find a reference part way through a file instead of only at the first one (which essentially means it hasn't yet dispatched a single thread job, hence impossible to trigger). So I'll stress test more avenues.

@jkbonfield jkbonfield force-pushed the cram_embed_ref_cons branch from 8145686 to 29569a0 Compare July 15, 2022 13:09
@jkbonfield
Copy link
Contributor Author

I think this should be working now. Thanks for the review. It can all be squashed together when you're OK with it.

When these tags are present, the reference is inferred so it's much
the same as before.

However when not present, we previously took a first-come first-serve
approach to consensus generation as the just-in-time reference
generation was in the process_one_read loop.  As we'd already updated
the reference and delta-encoded against it, if a later read comes
along with conflicts then we couldn't correct anything.

33ff2bc incorrectly made the assumption that we were being fed a bunch
of bam records and didn't have all the data available up-front.
However this isn't true.  For the sake of threading efficiency, the
encoder buffers up all the BAMs for a container before dispatching the
cram_encode_container call, so we can do a consensus generation stage
prior to the process_one_read loop.  This improves consensus accuracy
and also simplifies the MD:Z using code too as it doesn't have to
worry about inconsistencies any more.

On 10MB worth of ONT data we had the following total file and SEQ sizes:

                   Total      SEQ only
    no_ref         205899946  74416907
    embed_ref=0    135303919  27583376  use external ref
    embed_ref=1    139709759  32008185  embed external ref, from file
    embed_ref=2+MD 139710386  32007405  embed external ref, inferred
    embed_ref=2-MD 149371457  41670856  OLD: first-come first-serve cons
    embed_ref=2-MD 139598781  31898253  NEW: proper consensus

The old embed_ref=2 needed 30% more space to delta-encode seqs,
although it was still a major win over using no_ref mode.  The new
consensus is is a significant improvement and it's now a better match
for the sequence (as we'd expect) than the external reference.
Obviously this can't be done when we have MD tags as otherwise we'd
need to store them verbatim which then takes more space up again.

On Illumina NovaSeq we also see a significant reduction to the
embedding overheads:

                   Total      SEQ only
    no_ref         256252015  87171589  48.4s
    embed_ref=0    207793068  38761828  34.7s
    embed_ref=1    219064424  50021186  35.3s
    embed_ref=2-MD 231012254  61929861  46.2s OLD
    embed_ref=2-MD 218014361  48935989  38.9s NEW

The code has also been refactored in a few places to speed it up.
Specifically with the newer consensus building strategy, we no longer
produce fake MD tags and just call the earlier algorithm, and instead
simply add directly to consensus histograms.

TODO: figure out how to enable this automatically, without penalty
when we do have the external reference available.
This choice is debatable, as it means people may get larger files than
they expect, but it also means the CRAM can be built even if
suboptimal.  A warning is given notifying the user of this action.

If a user explicitly species embed_ref=0 then this disables this
automatic code and will turn the warning back into a hard failure.

Also added more checks for handling things like unsorted data or
multi_seq_per_slice mode, neither of which are compatible with
embedding references (both the original =1 mode or =2).
During the initial set when transitioning from -1 to 2 due to a lack
of both UR and M5 header tags, or a failure to load the reference
specified, we need to briefly lock.  This shouldn't cause any slow
downs as it's guarded by the transition status.
It's too harsh to simply bail out with invalid data.  Instead we just
nullify the portion of reference computed with this record and drop
back to the consensus method.

Also replaced hts_log_warning messages with hts_log_info, as typically
we either get no messages or a huge proportion, as MD being wrong
quite often means some large scale reprocessing happened.

Tested using soft-clip adjusted covid19 samples where almost every MD
tag was incorrect.
- Update Makefile dependencies
- Switch to uint32_t for ncigar usage
- Fix an hts_pos_t issue for ref position
- Improve thread locks on fd->embed_ref
- Fix memory leak on error handling
- Remove redundant pass-by-reference to next_cigar_op
@jkbonfield jkbonfield force-pushed the cram_embed_ref_cons branch from 29569a0 to 81946e3 Compare July 19, 2022 10:55
@jkbonfield
Copy link
Contributor Author

Added a commit to move fd->embed_ref to local variables / arguments, so one container changing the global default doesn't affect the running behaviour of code in another thread.

Also rebased so the tests should now pass.

This field can be changed on-the-fly by cram_encode_container when it
discoveres it cannot find a reference.  It switches to embed_ref=2 in
this case.  Due to the multi-threaded nature, it is important that all
running code elsewhere is using their embed_ref parameter at the time
the container was created.

This opens up an avenue for downgrounding embed_ref again if we
require (eg a reference is absent, but subsequent ones can be found),
but for now we don't support this.
@jkbonfield jkbonfield force-pushed the cram_embed_ref_cons branch from 81946e3 to f0f3eec Compare July 19, 2022 10:57
@daviesrob
Copy link
Member

Looks better thanks. I see it might be possible to get more than one warning printed when it changes mode, but that's hopefully unlikely. We can try to fix it later if it turns out to be an annoyance.

@daviesrob daviesrob merged commit 9562aeb into samtools:develop Jul 19, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants