Cram embed ref cons #1449

jkbonfield · 2022-06-10T14:35:28Z

A better implementation of embed_ref=2 now I've realised I was mistaken in the main code structure. It's much easier than I thought to do a proper consensus calculation, so I now use this. This means:

With MD:Z tags, the embed_ref=2 (implicit) is the same as embed_ref=1 (explicit via external .fa file).
Without MD:Z tags, embed_ref=2 is now usually slightly smaller than an explicit embed_ref=1 as the consensus is always at least as accurate as the official reference. The old code was sometimes a bit suspect in the generated reference.

This also addresses some of the issues raised in #1445, for better or worse. My decisions to the questions raised there were:

"If we use the embed_ref option, should it automatically use level 2 if no external reference is found?"
Yes. An unadorned "embed_ref" is still level 1, but when a reference cannot be found it switches to level 2. Note it does this on the file descriptor so it applies to all containers.
"If an external reference is not found during encode and no embed_ref option has been set, should we automatically switch to embed_ref=2 mode? "
Yes, for both when MD:Z are known and when not, as it's not easy to know whether they are present, and there are middle-ground cases such as partially present, or present but incorrect so it's not good to try and separate logic by content. A warning message is emitted stating that the file will likely be larger than when using an external reference.
"Should we be providing some hint mechanism in the SAM header, eg by adding @co htslib_cram_options:embed_ref and similar."
Currently no. With the notion of simply auto-generating an embedded reference when one doesn't exist, this is less required anyway.

Note this isn't a complete panacea. Unsorted data cannot use embedded references, and neither can multi_seq_per_slice mode (eg when encoding a highly fragmented assembly).

Thoughts?

jkbonfield · 2022-06-10T14:50:15Z

Bah I found another corner case.

The SAM header must have M5 tags when outputting in CRAM, unless an explicit embed_ref=2 or no_ref is used. An example is the classic platinum genome NA12878_S1.bam from Illumina. This has no meta-data whatsoever; just SQ SN and LN fields leaving the user totally guessing if it's GRCh37, GRCh38, or IIRC it's actually NCBI's 37 with a subtly different MT genome (sigh). Platinum data, but hardly platinum level provenance.

I'm not sure how best to handle that. It's just another hard failure right now, but with an explicit embed_ref=2 it works fine and there are MD:Z tags present so it'll even be efficient for retaining those.

daviesrob · 2022-07-14T10:57:56Z

cram/cram_encode.c

+// Returns the next cigar op code: one of the BAM_C* codes,
+// or -1 if no more are present.
+static inline
+int next_cigar_op(uint32_t *cigar, int *ncigar, int *skip, int *spos,


ncigar should be uint32_t given that bam1_core_t::ncigar is. Also, why is it passed in as a pointer?

skip could also be moved in here as a const array, couldn't it?

Good question on the mystery pointer. I assume it had an earlier incarnation.

Thanks for the review comments. I'll work through them.

Regarding "skip", it could be moved there, but the next_cigar_op is a general purpose function (albeit static), with skip being a parameterisation of it. I could move it there, but then it'd need renaming to something less general. I doubt it matters much so inclined to leave as-is for now.

cram/cram_encode.c

daviesrob · 2022-07-14T11:32:13Z

cram/cram_encode.c

+    int ncigar = b->core.n_cigar;
+    int i, j;


I see there is an ncigar field in the internal cram record struct too, which is signed. Sloppy! (Sorry)

I'll look into these as a separate issue though.

cram/cram_encode.c

daviesrob · 2022-07-14T12:57:29Z

cram/cram_encode.c

+                if (fd->embed_ref <= 0)
+                    hts_log_warning("NOTE: the CRAM file will be bigger than"
+                                    " using an external reference");
+                pthread_mutex_lock(&fd->ref_lock);


This lock should happen earlier, to guard the fd->embed_ref reads as well. Or possibly fd->embed_ref should be copied into a local variable as happens elsewhere.

It looks like there are a few other unlocked accesses to this variable elsewhere, which will also need attention.

Good idea for putting into a local variable, as is already done just about with nref. Regarding elsewhere (such as process_read) I did think about locks but the various automated tools seemed to think it was impossible to trigger data races there.

I'll explore again with more tools. It's possible it can still happen if we fail to find a reference part way through a file instead of only at the first one (which essentially means it hasn't yet dispatched a single thread job, hence impossible to trigger). So I'll stress test more avenues.

cram/cram_encode.c

jkbonfield · 2022-07-15T14:06:43Z

I think this should be working now. Thanks for the review. It can all be squashed together when you're OK with it.

When these tags are present, the reference is inferred so it's much the same as before. However when not present, we previously took a first-come first-serve approach to consensus generation as the just-in-time reference generation was in the process_one_read loop. As we'd already updated the reference and delta-encoded against it, if a later read comes along with conflicts then we couldn't correct anything. 33ff2bc incorrectly made the assumption that we were being fed a bunch of bam records and didn't have all the data available up-front. However this isn't true. For the sake of threading efficiency, the encoder buffers up all the BAMs for a container before dispatching the cram_encode_container call, so we can do a consensus generation stage prior to the process_one_read loop. This improves consensus accuracy and also simplifies the MD:Z using code too as it doesn't have to worry about inconsistencies any more. On 10MB worth of ONT data we had the following total file and SEQ sizes: Total SEQ only no_ref 205899946 74416907 embed_ref=0 135303919 27583376 use external ref embed_ref=1 139709759 32008185 embed external ref, from file embed_ref=2+MD 139710386 32007405 embed external ref, inferred embed_ref=2-MD 149371457 41670856 OLD: first-come first-serve cons embed_ref=2-MD 139598781 31898253 NEW: proper consensus The old embed_ref=2 needed 30% more space to delta-encode seqs, although it was still a major win over using no_ref mode. The new consensus is is a significant improvement and it's now a better match for the sequence (as we'd expect) than the external reference. Obviously this can't be done when we have MD tags as otherwise we'd need to store them verbatim which then takes more space up again. On Illumina NovaSeq we also see a significant reduction to the embedding overheads: Total SEQ only no_ref 256252015 87171589 48.4s embed_ref=0 207793068 38761828 34.7s embed_ref=1 219064424 50021186 35.3s embed_ref=2-MD 231012254 61929861 46.2s OLD embed_ref=2-MD 218014361 48935989 38.9s NEW The code has also been refactored in a few places to speed it up. Specifically with the newer consensus building strategy, we no longer produce fake MD tags and just call the earlier algorithm, and instead simply add directly to consensus histograms. TODO: figure out how to enable this automatically, without penalty when we do have the external reference available.

This choice is debatable, as it means people may get larger files than they expect, but it also means the CRAM can be built even if suboptimal. A warning is given notifying the user of this action. If a user explicitly species embed_ref=0 then this disables this automatic code and will turn the warning back into a hard failure. Also added more checks for handling things like unsorted data or multi_seq_per_slice mode, neither of which are compatible with embedding references (both the original =1 mode or =2).

During the initial set when transitioning from -1 to 2 due to a lack of both UR and M5 header tags, or a failure to load the reference specified, we need to briefly lock. This shouldn't cause any slow downs as it's guarded by the transition status.

It's too harsh to simply bail out with invalid data. Instead we just nullify the portion of reference computed with this record and drop back to the consensus method. Also replaced hts_log_warning messages with hts_log_info, as typically we either get no messages or a huge proportion, as MD being wrong quite often means some large scale reprocessing happened. Tested using soft-clip adjusted covid19 samples where almost every MD tag was incorrect.

- Update Makefile dependencies - Switch to uint32_t for ncigar usage - Fix an hts_pos_t issue for ref position - Improve thread locks on fd->embed_ref - Fix memory leak on error handling - Remove redundant pass-by-reference to next_cigar_op

jkbonfield · 2022-07-19T10:56:23Z

Added a commit to move fd->embed_ref to local variables / arguments, so one container changing the global default doesn't affect the running behaviour of code in another thread.

Also rebased so the tests should now pass.

This field can be changed on-the-fly by cram_encode_container when it discoveres it cannot find a reference. It switches to embed_ref=2 in this case. Due to the multi-threaded nature, it is important that all running code elsewhere is using their embed_ref parameter at the time the container was created. This opens up an avenue for downgrounding embed_ref again if we require (eg a reference is absent, but subsequent ones can be found), but for now we don't support this.

daviesrob · 2022-07-19T14:44:04Z

Looks better thanks. I see it might be possible to get more than one warning printed when it changes mode, but that's hopefully unlikely. We can try to fix it later if it turns out to be an annoyance.

daviesrob self-assigned this Jun 21, 2022

jkbonfield marked this pull request as ready for review July 6, 2022 11:01

daviesrob reviewed Jul 14, 2022

View reviewed changes

cram/cram_encode.c Show resolved Hide resolved

jkbonfield force-pushed the cram_embed_ref_cons branch from 8145686 to 29569a0 Compare July 15, 2022 13:09

jkbonfield added 6 commits July 19, 2022 11:54

Also enable embed_ref=2 when converting BAMs with no UR or M5 tags.

a381ca6

Amendments following review.

5bdd37a

- Update Makefile dependencies - Switch to uint32_t for ncigar usage - Fix an hts_pos_t issue for ref position - Improve thread locks on fd->embed_ref - Fix memory leak on error handling - Remove redundant pass-by-reference to next_cigar_op

jkbonfield force-pushed the cram_embed_ref_cons branch from 29569a0 to 81946e3 Compare July 19, 2022 10:55

jkbonfield force-pushed the cram_embed_ref_cons branch from 81946e3 to f0f3eec Compare July 19, 2022 10:57

daviesrob merged commit 9562aeb into samtools:develop Jul 19, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cram embed ref cons #1449

Cram embed ref cons #1449

jkbonfield commented Jun 10, 2022

jkbonfield commented Jun 10, 2022 •

edited

Loading

daviesrob Jul 14, 2022

jkbonfield Jul 15, 2022

jkbonfield Jul 15, 2022

daviesrob Jul 14, 2022

jkbonfield Jul 15, 2022

daviesrob Jul 14, 2022

jkbonfield Jul 15, 2022

jkbonfield commented Jul 15, 2022

jkbonfield commented Jul 19, 2022

daviesrob commented Jul 19, 2022

Cram embed ref cons #1449

Cram embed ref cons #1449

Conversation

jkbonfield commented Jun 10, 2022

jkbonfield commented Jun 10, 2022 • edited Loading

daviesrob Jul 14, 2022

Choose a reason for hiding this comment

jkbonfield Jul 15, 2022

Choose a reason for hiding this comment

jkbonfield Jul 15, 2022

Choose a reason for hiding this comment

daviesrob Jul 14, 2022

Choose a reason for hiding this comment

jkbonfield Jul 15, 2022

Choose a reason for hiding this comment

daviesrob Jul 14, 2022

Choose a reason for hiding this comment

jkbonfield Jul 15, 2022

Choose a reason for hiding this comment

jkbonfield commented Jul 15, 2022

jkbonfield commented Jul 19, 2022

daviesrob commented Jul 19, 2022

jkbonfield commented Jun 10, 2022 •

edited

Loading