-
Notifications
You must be signed in to change notification settings - Fork 442
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cram embed ref cons #1449
Cram embed ref cons #1449
Conversation
Bah I found another corner case. The SAM header must have M5 tags when outputting in CRAM, unless an explicit embed_ref=2 or no_ref is used. An example is the classic platinum genome NA12878_S1.bam from Illumina. This has no meta-data whatsoever; just SQ SN and LN fields leaving the user totally guessing if it's GRCh37, GRCh38, or IIRC it's actually NCBI's 37 with a subtly different MT genome (sigh). Platinum data, but hardly platinum level provenance. I'm not sure how best to handle that. It's just another hard failure right now, but with an explicit embed_ref=2 it works fine and there are MD:Z tags present so it'll even be efficient for retaining those. |
cram/cram_encode.c
Outdated
// Returns the next cigar op code: one of the BAM_C* codes, | ||
// or -1 if no more are present. | ||
static inline | ||
int next_cigar_op(uint32_t *cigar, int *ncigar, int *skip, int *spos, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ncigar
should be uint32_t
given that bam1_core_t::ncigar
is. Also, why is it passed in as a pointer?
skip
could also be moved in here as a const array, couldn't it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good question on the mystery pointer. I assume it had an earlier incarnation.
Thanks for the review comments. I'll work through them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Regarding "skip", it could be moved there, but the next_cigar_op
is a general purpose function (albeit static), with skip being a parameterisation of it. I could move it there, but then it'd need renaming to something less general. I doubt it matters much so inclined to leave as-is for now.
cram/cram_encode.c
Outdated
int ncigar = b->core.n_cigar; | ||
int i, j; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
uint32_t
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see there is an ncigar
field in the internal cram record struct too, which is signed. Sloppy! (Sorry)
I'll look into these as a separate issue though.
if (fd->embed_ref <= 0) | ||
hts_log_warning("NOTE: the CRAM file will be bigger than" | ||
" using an external reference"); | ||
pthread_mutex_lock(&fd->ref_lock); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This lock should happen earlier, to guard the fd->embed_ref
reads as well. Or possibly fd->embed_ref
should be copied into a local variable as happens elsewhere.
It looks like there are a few other unlocked accesses to this variable elsewhere, which will also need attention.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good idea for putting into a local variable, as is already done just about with nref
. Regarding elsewhere (such as process_read
) I did think about locks but the various automated tools seemed to think it was impossible to trigger data races there.
I'll explore again with more tools. It's possible it can still happen if we fail to find a reference part way through a file instead of only at the first one (which essentially means it hasn't yet dispatched a single thread job, hence impossible to trigger). So I'll stress test more avenues.
8145686
to
29569a0
Compare
I think this should be working now. Thanks for the review. It can all be squashed together when you're OK with it. |
When these tags are present, the reference is inferred so it's much the same as before. However when not present, we previously took a first-come first-serve approach to consensus generation as the just-in-time reference generation was in the process_one_read loop. As we'd already updated the reference and delta-encoded against it, if a later read comes along with conflicts then we couldn't correct anything. 33ff2bc incorrectly made the assumption that we were being fed a bunch of bam records and didn't have all the data available up-front. However this isn't true. For the sake of threading efficiency, the encoder buffers up all the BAMs for a container before dispatching the cram_encode_container call, so we can do a consensus generation stage prior to the process_one_read loop. This improves consensus accuracy and also simplifies the MD:Z using code too as it doesn't have to worry about inconsistencies any more. On 10MB worth of ONT data we had the following total file and SEQ sizes: Total SEQ only no_ref 205899946 74416907 embed_ref=0 135303919 27583376 use external ref embed_ref=1 139709759 32008185 embed external ref, from file embed_ref=2+MD 139710386 32007405 embed external ref, inferred embed_ref=2-MD 149371457 41670856 OLD: first-come first-serve cons embed_ref=2-MD 139598781 31898253 NEW: proper consensus The old embed_ref=2 needed 30% more space to delta-encode seqs, although it was still a major win over using no_ref mode. The new consensus is is a significant improvement and it's now a better match for the sequence (as we'd expect) than the external reference. Obviously this can't be done when we have MD tags as otherwise we'd need to store them verbatim which then takes more space up again. On Illumina NovaSeq we also see a significant reduction to the embedding overheads: Total SEQ only no_ref 256252015 87171589 48.4s embed_ref=0 207793068 38761828 34.7s embed_ref=1 219064424 50021186 35.3s embed_ref=2-MD 231012254 61929861 46.2s OLD embed_ref=2-MD 218014361 48935989 38.9s NEW The code has also been refactored in a few places to speed it up. Specifically with the newer consensus building strategy, we no longer produce fake MD tags and just call the earlier algorithm, and instead simply add directly to consensus histograms. TODO: figure out how to enable this automatically, without penalty when we do have the external reference available.
This choice is debatable, as it means people may get larger files than they expect, but it also means the CRAM can be built even if suboptimal. A warning is given notifying the user of this action. If a user explicitly species embed_ref=0 then this disables this automatic code and will turn the warning back into a hard failure. Also added more checks for handling things like unsorted data or multi_seq_per_slice mode, neither of which are compatible with embedding references (both the original =1 mode or =2).
During the initial set when transitioning from -1 to 2 due to a lack of both UR and M5 header tags, or a failure to load the reference specified, we need to briefly lock. This shouldn't cause any slow downs as it's guarded by the transition status.
It's too harsh to simply bail out with invalid data. Instead we just nullify the portion of reference computed with this record and drop back to the consensus method. Also replaced hts_log_warning messages with hts_log_info, as typically we either get no messages or a huge proportion, as MD being wrong quite often means some large scale reprocessing happened. Tested using soft-clip adjusted covid19 samples where almost every MD tag was incorrect.
- Update Makefile dependencies - Switch to uint32_t for ncigar usage - Fix an hts_pos_t issue for ref position - Improve thread locks on fd->embed_ref - Fix memory leak on error handling - Remove redundant pass-by-reference to next_cigar_op
29569a0
to
81946e3
Compare
Added a commit to move Also rebased so the tests should now pass. |
This field can be changed on-the-fly by cram_encode_container when it discoveres it cannot find a reference. It switches to embed_ref=2 in this case. Due to the multi-threaded nature, it is important that all running code elsewhere is using their embed_ref parameter at the time the container was created. This opens up an avenue for downgrounding embed_ref again if we require (eg a reference is absent, but subsequent ones can be found), but for now we don't support this.
81946e3
to
f0f3eec
Compare
Looks better thanks. I see it might be possible to get more than one warning printed when it changes mode, but that's hopefully unlikely. We can try to fix it later if it turns out to be an annoyance. |
A better implementation of embed_ref=2 now I've realised I was mistaken in the main code structure. It's much easier than I thought to do a proper consensus calculation, so I now use this. This means:
This also addresses some of the issues raised in #1445, for better or worse. My decisions to the questions raised there were:
"If we use the embed_ref option, should it automatically use level 2 if no external reference is found?"
Yes. An unadorned "embed_ref" is still level 1, but when a reference cannot be found it switches to level 2. Note it does this on the file descriptor so it applies to all containers.
"If an external reference is not found during encode and no embed_ref option has been set, should we automatically switch to embed_ref=2 mode? "
Yes, for both when MD:Z are known and when not, as it's not easy to know whether they are present, and there are middle-ground cases such as partially present, or present but incorrect so it's not good to try and separate logic by content. A warning message is emitted stating that the file will likely be larger than when using an external reference.
"Should we be providing some hint mechanism in the SAM header, eg by adding @co htslib_cram_options:embed_ref and similar."
Currently no. With the notion of simply auto-generating an embedded reference when one doesn't exist, this is less required anyway.
Note this isn't a complete panacea. Unsorted data cannot use embedded references, and neither can multi_seq_per_slice mode (eg when encoding a highly fragmented assembly).
Thoughts?