Imprecise variant calling -- gather clustered evidence links #3469

cwhelan · 2017-08-21T16:53:50Z

This PR is the initial stage of implementing the calling of IMPRECISE variants in the SV pipeline. It introduces the concept of an evidence-target link, which joins an evidence interval to its distal target. This is an extension of the 'coherent' evidence concept previously used in determining evidence thresholds for assembly. The code in this PR contains the following changes:

Evidence intervals and distal targets now are treated as stranded, and evidence-target link clustering depends on overlaps between both intervals and strands.
Evidence target interval and distal target interval calculations have been modified to make sure that evidence supporting the same event clusters together (has overlapping intervals). This includes several changes such as extending the 'rest-of-fragment-size' calculation to try to capture almost all non-outlier fragment sizes in the library; increasing the split read location uncertainty a little; and being more precise about the boundaries of distal target intervals by taking advantage of information in the MD and MC tags if available.
Evidence target links are gathered for every piece of evidence supporting a high-quality distal target.
Evidence target links are clustered together and store the amount of split-read and read-pair evidence that went into each cluster.
All evidence target link clusters that are composed of at least 1 split read or at least 2 read pairs are collected in the driver and emitted in a BEDPE formatted file specified in the command line parameters.
A PairedStrandedIntervalTree data structure is introduced to allow SVIntervalTree-style lookups for paired intervals.

To finish this work, future PRs will 1) use the collected evidence target links to annotate our assembly called-variants with the number of split reads and read pairs observed in the original mappings and 2) create IMPRECISE VCF records for events that have enough evidence-target-link support, first for deletions and then possibly for other variant types.

Initial testing shows that these changes slightly increase the number of variants called by the current pipeline, on both the CHM mix and NA12878 data sets, without greatly affecting run time.

codecov-io · 2017-08-21T17:36:13Z

Codecov Report

Merging #3469 into master will increase coverage by 0.026%.
The diff coverage is 81.29%.

@@               Coverage Diff               @@
##              master     #3469       +/-   ##
===============================================
+ Coverage     79.905%   79.932%   +0.026%     
- Complexity     17918     17978       +60     
===============================================
  Files           1199      1205        +6     
  Lines          65102     65481      +379     
  Branches       10142     10226       +84     
===============================================
+ Hits           52020     52340      +320     
- Misses          9042      9051        +9     
- Partials        4040      4090       +50

Impacted Files	Coverage Δ	Complexity Δ
...tructuralVariationDiscoveryArgumentCollection.java	`96.552% <ø> (ø)`	`0 <0> (ø)`	⬇️
...llbender/tools/spark/sv/evidence/ReadMetadata.java	`83.758% <ø> (-0.318%)`	`47 <0> (-1)`
...bender/tools/spark/sv/evidence/ReadClassifier.java	`85.526% <0%> (-1.14%)`	`33 <9> (ø)`
...der/tools/spark/sv/evidence/LibraryStatistics.java	`88.095% <100%> (+0.29%)`	`12 <1> (+1)`	⬆️
...ols/spark/sv/evidence/BreakpointDensityFilter.java	`97.183% <100%> (ø)`	`27 <3> (-1)`	⬇️
...llbender/tools/spark/sv/evidence/SVReadFilter.java	`76.19% <50%> (-12.698%)`	`24 <1> (ø)`
...te/hellbender/tools/spark/sv/utils/SVInterval.java	`86.364% <50%> (-1.732%)`	`28 <1> (+1)`
.../tools/spark/sv/utils/PairedStrandedIntervals.java	`50% <50%> (ø)`	`4 <4> (?)`
.../sv/StructuralVariationDiscoveryPipelineSpark.java	`97.143% <66.667%> (ø)`	`4 <0> (ø)`	⬇️
...lbender/tools/spark/sv/utils/StrandedInterval.java	`71.429% <71.429%> (ø)`	`7 <7> (?)`
... and 18 more

tedsharpe

I'm just beginning to dig in, and I still have lots more to look at, but I wanted to record this overarching question:
It seems to me that in finding consistent evidence, that the exact strand of the source and target are irrelevant. All that matters, it seems to me, are whether the source and target are same strand or opposite strand. True, or not true?

tedsharpe · 2017-08-22T19:33:47Z

.../java/org/broadinstitute/hellbender/tools/spark/sv/evidence/FindBreakpointEvidenceSpark.java

+        return new Tuple2<>(intervals, evidenceTargetLinks);
+    }
+
+    private static void writeTargetLinks(final FindBreakpointEvidenceSparkArgumentCollection params, final Broadcast<ReadMetadata> broadcastMetadata, final List<EvidenceTargetLink> targetLinks) {


This is a triviality, but it would be better to make the signature:

private static void writeTargetLinks(final String targetLinkFile, final ReadMetadata readMetadata, final List<EvidenceTargetLink> targetLinks) {

And then on line 645 where you call it, it's just:

writeTargetLinks(params.targetLinkFile, broadcastMetadata.getValue(), evidenceTargetLinks);

tedsharpe · 2017-08-22T20:08:09Z

src/main/java/org/broadinstitute/hellbender/tools/spark/sv/evidence/EvidenceTargetLink.java

+    }
+
+    public String toBedpeString(ReadMetadata readMetadata) {
+        final Map<Integer, String> contigIdToContigNameMap =


You don't need this. ReadMetadata stores the inverse map. Use readMetadata.getContigName( int id ) to turn a contigID into a contigName.

Oops, I missed that, thanks. done.

tedsharpe · 2017-08-22T20:39:37Z

.../java/org/broadinstitute/hellbender/tools/spark/sv/evidence/EvidenceTargetLinkClusterer.java

+     * the links will be combined as long as strands agree. Returned links have the intersection of two paired intervals as
+     * source and target, with the lower-coordinate interval appearing as source.
+     */
+    public static List<EvidenceTargetLink> deduplicateTargetLinks(final List<EvidenceTargetLink> evidenceTargetLinks) {


Wouldn't it be easier to always record the links in canonical order (source < target)?
You could just have the EvidenceTargetLink constructor do the canonicalization.

This is an interesting idea that I'm still pondering..

tedsharpe · 2017-08-22T21:51:12Z

.../java/org/broadinstitute/hellbender/tools/spark/sv/evidence/EvidenceTargetLinkClusterer.java

+                                    oldLink.sourceForwardStrand,
+                                    newTarget,
+                                    oldLink.targetForwardStrand,
+                                    nextEvidence instanceof BreakpointEvidence.DiscordantReadPairEvidence


Should be BreakpointEvidence.SplitRead, right?

I don't think so.. These ternaries are hard to read but in this clause and the next I'm saying:

if it's a read pair use the old split read count, otherwise increment the split read count

if it's a read pair increment the read pair count, otherwise use the old read pair count

Maybe if I pull these out into named variables it will be clearer.

It's fine. I just misread it.
The code does assume that if it's not a discordant pair, it must be a split read, however. Might be a little fragile.

I changed it to check SplitRead for the split read count..

Might also want to test that it's one or the other.

Added a validation check in my iteration loop.

cwhelan · 2017-08-23T02:35:59Z

Re: strandedness, while it would be much easier not to have to track strand (and would probably save me from writing a lot of bugs), I think that we do need to do so if we ever want these data elements to be used for calling more complicated events like inversions or translocations. For example, for inversions we need to record whether we saw evidence linking +/+ strands or -/- strands, since the breakpoints might appear different at the two junctions due to insertions or deletions. For translocations, you definitely need to know which strands were involved so that you can extrapolate the rest of the event from the breakpoint. Agreed that it's a pain to keep track of it, though, and I find myself introducing strand errors all the time.

tedsharpe · 2017-08-23T13:21:40Z

I'm not suggesting that you ignore strand.
I'm suggesting that these two alignments are consistent, even though their strandedness doesn't agree:
readX +flag chr1 10001 60 75M76S ... SA:chr1,85275,-,75S76M,60,1;
readY -flag chr1 10001 60 75M76S ... SA:chr1,85275,+,75S76M,60,1;

Whereas, this alignment is inconsistent:
readZ -flag chr1 10001 60 75M76S ... SA:chr1,85275,-,75S76M,60,1;

tedsharpe · 2017-08-23T13:25:41Z

Should have said,
Whereas, these alignments are inconsistent with the pair above, but consistent with each other:
readZ -flag chr1 10001 60 75M76S ... SA:chr1,85275,-,75S76M,60,1;
readZ +flag chr1 10001 60 75M76S ... SA:chr1,85275,+,75S76M,60,1;

cwhelan · 2017-08-23T14:07:44Z

That's right, but for split reads the strand I'm using is not the actual strand of the alignment but the side of each aligned region on which the clipping occurred (clip on the left == negative strand, clip on the right == positive strand). Strandedness is more important for read pair data but I wanted to give it an analogue for split reads so that the evidence from the two types of reads could play together. Let's discuss more in person.

tedsharpe · 2017-08-23T15:01:22Z

Light begins to dawn on Marblehead.
You're modeling a breakpoint, not a pair of alignments.

tedsharpe · 2017-08-23T15:44:43Z

.../java/org/broadinstitute/hellbender/tools/spark/sv/evidence/EvidenceTargetLinkClusterer.java

+            }
+        }
+
+        for (Iterator<SVIntervalTree.Entry<EvidenceTargetLink>> it = currentIntervalsWithTargets.iterator(); it.hasNext(); ) {


SVIntervalTree implements Iterable, so this could be expressed a little more tersely if you wanted to:

currentIntervalsWithTargets.forEach(entry -> links.add(entry.getValue()));

tedsharpe · 2017-08-23T15:50:45Z

.../java/org/broadinstitute/hellbender/tools/spark/sv/evidence/EvidenceTargetLinkClusterer.java

+        while (breakpointEvidenceIterator.hasNext()) {
+            final BreakpointEvidence nextEvidence = breakpointEvidenceIterator.next();
+            if (nextEvidence.hasDistalTargets(readMetadata)) {
+                final SVInterval target = nextEvidence.getDistalTargets(readMetadata).get(0);


Unused variable.

sorry, removed

tedsharpe · 2017-08-23T15:54:30Z

.../java/org/broadinstitute/hellbender/tools/spark/sv/evidence/EvidenceTargetLinkClusterer.java

+                        link.source.intersect(existingLink.target), link.sourceForwardStrand,
+                        link.splitReads + existingLink.splitReads, Math.max(link.readPairs, existingLink.readPairs));
+                psiOverlappers.remove();
+                break;


Is this right? You intended to break here?
If so, the while on line 102 should just be an if, but I'm guessing you intended to process all the overlappers rather than just the first.

You're right, this logic is wrong / incomplete.

I realized that way I'm setting up the data that I'm calling this on, there can never be more than one overlapper. I'll either fix this to make it more clear or re-do it if I decide to take your suggestion to canonicalize the links.

tedsharpe · 2017-08-23T17:11:46Z

...broadinstitute/hellbender/tools/spark/sv/StructuralVariationDiscoveryArgumentCollection.java

@@ -119,6 +119,10 @@
        @Argument(doc = "output dir for assemblies", fullName = "gfaDir", optional = true)
        public String gfaDir;

+        @Argument(doc = "file for non-assembled breakpoints in bedpe format",


seems obvious to us, but you might want to say "output file"

tedsharpe

Seems odd to add an SVReadFilter to the ReadMetadata, and pass that around to the distal target methods, all so that we can query the param minEvidenceMapQ. ReadMetadata is becoming quite a pot-au-feu. I assume you tried passing that param through the call stack and it became cumbersome?

tedsharpe · 2017-08-23T19:50:04Z

src/main/java/org/broadinstitute/hellbender/tools/spark/sv/evidence/EvidenceTargetLink.java

+import java.util.stream.Collectors;
+
+@DefaultSerializer(EvidenceTargetLink.Serializer.class)
+public class EvidenceTargetLink {


Seems like this could be final.

tedsharpe · 2017-08-23T19:51:36Z

src/main/java/org/broadinstitute/hellbender/tools/spark/sv/evidence/EvidenceTargetLink.java

+    final int splitReads;
+    final int readPairs;
+
+    public EvidenceTargetLink(final SVInterval source, final boolean sourceForwardStrand, final SVInterval target, final boolean targetForwardStrand, final int splitReads, final int readPairs) {


If you make sure source and target are non-null in the constructor, you could simplify equals and hashCode by eliminating the null testing.

tedsharpe · 2017-08-23T20:03:47Z

.../java/org/broadinstitute/hellbender/tools/spark/sv/evidence/FindBreakpointEvidenceSpark.java

+    }
+
+
+    private static List<BreakpointEvidence> collectAllEvidence( final JavaRDD<BreakpointEvidence> evidenceRDD,


It appears that this is not called. Should it have been?

No, sorry. Dev code that I mistakenly left in. Removed.

cwhelan · 2017-08-23T21:05:41Z

Re: Adding the SVReadFilter to the ReadMetadata object. It would have been a little bit of a hassle to pass this mapq threshold around -- It'd have to be passed into BreakpointDensityFilter and EvidenceTargetLinkClusterer. Not the end of the world, though.

My thought was that the metadata is really tied to the filter's parameters -- if we were to save out the metadata for a bam, for example, it would be helpful to know if the metadata was generated with an override of the default filter values if we wanted to parse and reinterpret it. But maybe that's not something we need to be overly concerned about right now. If you object to adding the filter to the metadata I'll remove it and pass the values around instead.

tedsharpe · 2017-08-23T21:28:06Z

Sorry. I get your thinking, but I think it would be cleaner to pass the param to each of those two classes. It cleans up the dependency tree -- those two classes don't depend on any of the ReadMetadata state but for that one param -- and avoids having to have the SVReadFilter be both Java and Kryo serializable. I'd prefer it, but won't insist upon it.
Adding the filter values to the read metadata output file makes sense, though. Doesn't need to be done now.

cwhelan · 2017-08-24T03:24:44Z

No worries, I'll take it out. I didn't like the dual serialization that was necessary either. No big deal.

tedsharpe · 2017-08-30T18:55:34Z

src/main/java/org/broadinstitute/hellbender/tools/spark/sv/utils/PairedStrandedIntervals.java

@@ -0,0 +1,69 @@
+package org.broadinstitute.hellbender.tools.spark.sv.utils;
+
+public final class PairedStrandedIntervals {


I think it might be helpful to have this class more clearly model an uncertain breakpoint, rather than just a pair of intervals with strand. As it stands, it's not clear what the relationship between the two sequences is. I think I've figured out that in the existing code the two sequences are extensions of the alignments of a read pair -- and therefore expected to be on opposite strands.
So the data held by this class is fine, but the interpretation of the relationship between left and right could be made simpler and more explicit, I think.
I'd like to see a head sequence and a tail sequence (rather than a left and right) with the strandedness manipulated so that the actual sample sequence over the breakpoint -- if we were able to figure it out exactly -- would be some prefix of the head sequence followed by some suffix of the tail sequence.
This is probably totally unclear, and we can chat.

tedsharpe · 2017-08-30T19:07:09Z

I've tried to study the code that turns a split read into a stranded interval pair so that it can be compared directly to the disparate read pair info, and I can't convince myself that it's correct.

There seem to be a bunch of cases that aren't covered. For example, it seems to me that the first step would have to be reversing the cigars on negative strand SA's, so that you can see which pairs of SA's are genuinely supplemental to -- and potentially partially overlapping -- each other, and the actual order of the aligned sequences. (I'm sure this is unclear, too. We can talk.)

Is there a unit test that mocks the reads we'd expect to get from an inversion breakpoint, both discordant and split. And that demonstrates that the stranded interval pairs generated from the reads are consistent with each other and with that breakpoint?

tedsharpe

I'm done. I leave it to you to figure out whether my latest suggestions have merit, and, if so, whether we'll address them now or in another PR.

cwhelan · 2017-09-09T20:12:24Z

@tedsharpe I've addressed some of your comments here -- all the simpler stuff plus:

I now only make distal targets for split reads with one supplementary alignment. We can make a ticket to handle more complex cases at some point.
I renamed the concept of strand in the EvidenceTargetLink and related classes -- I'm now calling it evidenceUpstreamOfBreakpoint.
I canonicalize EvidenceTargetLinks and only create them when the source is upstream of the target. This allowed me to get rid of the de-duplication code, so thanks for the suggestion.

It seemed tricky to me to try to cluster these links during the initial pass over the reads while at the same time keeping track of coherent evidence. In my testing it doesn't seem like it is slow to run over the EvidenceRDD again to do this, but we could think about trying to change this sometime if we're looking for optimizations.

Want to take another look?

tedsharpe · 2017-09-12T18:07:38Z

It would be nice to have a StrandedInterval class, and then the distal targets could be stored as one of these in the various BreakpointEvidence subclasses that have distal targets, and the PairedStrandedIntervals class could just have two of them. This would also let you eliminate the methods hasDistalTargets and getDistalTargetsUpstreamOfBreakpoints -- getDistalTargets would do the job of all three.
I assume that PairedStrandedIntervals shouldn't ever have null intervals for source or target, so you could just validate in the constructor and then you wouldn't have to test nullness in equals and hashCode.

cwhelan · 2017-09-13T01:48:40Z

@tedsharpe Thanks for the suggestions, I think it makes the code a bit cleaner. Do these changes match what you were thinking?

tedsharpe · 2017-09-13T13:42:10Z

Definitely cleaner. Thanks. Merge away.

…ls to make links coherent in most cases

cwhelan requested a review from tedsharpe August 21, 2017 16:54

tedsharpe self-assigned this Aug 21, 2017

tedsharpe reviewed Aug 22, 2017

View reviewed changes

tedsharpe reviewed Aug 23, 2017

View reviewed changes

tedsharpe reviewed Aug 30, 2017

View reviewed changes

tedsharpe approved these changes Aug 30, 2017

View reviewed changes

cwhelan force-pushed the cw_imprecise_variants branch from fb75966 to b77565c Compare September 9, 2017 14:07

cwhelan force-pushed the cw_imprecise_variants branch from d43c987 to af2e42e Compare September 12, 2017 20:26

add gathering of imprecise evidence links and extend evidence interva…

14fa736

…ls to make links coherent in most cases

cwhelan force-pushed the cw_imprecise_variants branch from af2e42e to 14fa736 Compare September 13, 2017 14:17

cwhelan merged commit c5f4665 into master Sep 13, 2017

cwhelan deleted the cw_imprecise_variants branch September 13, 2017 15:06

		}


		private static List<BreakpointEvidence> collectAllEvidence( final JavaRDD<BreakpointEvidence> evidenceRDD,

		@@ -0,0 +1,69 @@
		package org.broadinstitute.hellbender.tools.spark.sv.utils;

		public final class PairedStrandedIntervals {

Imprecise variant calling -- gather clustered evidence links #3469

Imprecise variant calling -- gather clustered evidence links #3469

Conversation

cwhelan commented Aug 21, 2017

codecov-io commented Aug 21, 2017 • edited Loading

Codecov Report

tedsharpe left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tedsharpe Aug 22, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cwhelan commented Aug 23, 2017

tedsharpe commented Aug 23, 2017

tedsharpe commented Aug 23, 2017

cwhelan commented Aug 23, 2017

tedsharpe commented Aug 23, 2017

tedsharpe Aug 23, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tedsharpe left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cwhelan commented Aug 23, 2017

tedsharpe commented Aug 23, 2017 • edited Loading

cwhelan commented Aug 24, 2017

Choose a reason for hiding this comment

tedsharpe commented Aug 30, 2017

tedsharpe left a comment

Choose a reason for hiding this comment

cwhelan commented Sep 9, 2017

tedsharpe commented Sep 12, 2017

cwhelan commented Sep 13, 2017

tedsharpe commented Sep 13, 2017

codecov-io commented Aug 21, 2017 •

edited

Loading

tedsharpe Aug 22, 2017 •

edited

Loading

tedsharpe Aug 23, 2017 •

edited

Loading

tedsharpe commented Aug 23, 2017 •

edited

Loading