Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Imprecise variant calling -- gather clustered evidence links #3469

Merged
merged 1 commit into from
Sep 13, 2017

Conversation

cwhelan
Copy link
Member

@cwhelan cwhelan commented Aug 21, 2017

This PR is the initial stage of implementing the calling of IMPRECISE variants in the SV pipeline. It introduces the concept of an evidence-target link, which joins an evidence interval to its distal target. This is an extension of the 'coherent' evidence concept previously used in determining evidence thresholds for assembly. The code in this PR contains the following changes:

  • Evidence intervals and distal targets now are treated as stranded, and evidence-target link clustering depends on overlaps between both intervals and strands.
  • Evidence target interval and distal target interval calculations have been modified to make sure that evidence supporting the same event clusters together (has overlapping intervals). This includes several changes such as extending the 'rest-of-fragment-size' calculation to try to capture almost all non-outlier fragment sizes in the library; increasing the split read location uncertainty a little; and being more precise about the boundaries of distal target intervals by taking advantage of information in the MD and MC tags if available.
  • Evidence target links are gathered for every piece of evidence supporting a high-quality distal target.
  • Evidence target links are clustered together and store the amount of split-read and read-pair evidence that went into each cluster.
  • All evidence target link clusters that are composed of at least 1 split read or at least 2 read pairs are collected in the driver and emitted in a BEDPE formatted file specified in the command line parameters.
  • A PairedStrandedIntervalTree data structure is introduced to allow SVIntervalTree-style lookups for paired intervals.

To finish this work, future PRs will 1) use the collected evidence target links to annotate our assembly called-variants with the number of split reads and read pairs observed in the original mappings and 2) create IMPRECISE VCF records for events that have enough evidence-target-link support, first for deletions and then possibly for other variant types.

Initial testing shows that these changes slightly increase the number of variants called by the current pipeline, on both the CHM mix and NA12878 data sets, without greatly affecting run time.

@cwhelan cwhelan requested a review from tedsharpe August 21, 2017 16:54
@tedsharpe tedsharpe self-assigned this Aug 21, 2017
@codecov-io
Copy link

codecov-io commented Aug 21, 2017

Codecov Report

Merging #3469 into master will increase coverage by 0.026%.
The diff coverage is 81.29%.

@@               Coverage Diff               @@
##              master     #3469       +/-   ##
===============================================
+ Coverage     79.905%   79.932%   +0.026%     
- Complexity     17918     17978       +60     
===============================================
  Files           1199      1205        +6     
  Lines          65102     65481      +379     
  Branches       10142     10226       +84     
===============================================
+ Hits           52020     52340      +320     
- Misses          9042      9051        +9     
- Partials        4040      4090       +50
Impacted Files Coverage Δ Complexity Δ
...tructuralVariationDiscoveryArgumentCollection.java 96.552% <ø> (ø) 0 <0> (ø) ⬇️
...llbender/tools/spark/sv/evidence/ReadMetadata.java 83.758% <ø> (-0.318%) 47 <0> (-1)
...bender/tools/spark/sv/evidence/ReadClassifier.java 85.526% <0%> (-1.14%) 33 <9> (ø)
...der/tools/spark/sv/evidence/LibraryStatistics.java 88.095% <100%> (+0.29%) 12 <1> (+1) ⬆️
...ols/spark/sv/evidence/BreakpointDensityFilter.java 97.183% <100%> (ø) 27 <3> (-1) ⬇️
...llbender/tools/spark/sv/evidence/SVReadFilter.java 76.19% <50%> (-12.698%) 24 <1> (ø)
...te/hellbender/tools/spark/sv/utils/SVInterval.java 86.364% <50%> (-1.732%) 28 <1> (+1)
.../tools/spark/sv/utils/PairedStrandedIntervals.java 50% <50%> (ø) 4 <4> (?)
.../sv/StructuralVariationDiscoveryPipelineSpark.java 97.143% <66.667%> (ø) 4 <0> (ø) ⬇️
...lbender/tools/spark/sv/utils/StrandedInterval.java 71.429% <71.429%> (ø) 7 <7> (?)
... and 18 more

Copy link
Contributor

@tedsharpe tedsharpe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm just beginning to dig in, and I still have lots more to look at, but I wanted to record this overarching question:
It seems to me that in finding consistent evidence, that the exact strand of the source and target are irrelevant. All that matters, it seems to me, are whether the source and target are same strand or opposite strand. True, or not true?

return new Tuple2<>(intervals, evidenceTargetLinks);
}

private static void writeTargetLinks(final FindBreakpointEvidenceSparkArgumentCollection params, final Broadcast<ReadMetadata> broadcastMetadata, final List<EvidenceTargetLink> targetLinks) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a triviality, but it would be better to make the signature:

private static void writeTargetLinks(final String targetLinkFile,
                 final ReadMetadata readMetadata,
                 final List<EvidenceTargetLink> targetLinks) {

And then on line 645 where you call it, it's just:

        writeTargetLinks(params.targetLinkFile, broadcastMetadata.getValue(), evidenceTargetLinks);

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

}

public String toBedpeString(ReadMetadata readMetadata) {
final Map<Integer, String> contigIdToContigNameMap =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You don't need this. ReadMetadata stores the inverse map. Use readMetadata.getContigName( int id ) to turn a contigID into a contigName.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops, I missed that, thanks. done.

* the links will be combined as long as strands agree. Returned links have the intersection of two paired intervals as
* source and target, with the lower-coordinate interval appearing as source.
*/
public static List<EvidenceTargetLink> deduplicateTargetLinks(final List<EvidenceTargetLink> evidenceTargetLinks) {
Copy link
Contributor

@tedsharpe tedsharpe Aug 22, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't it be easier to always record the links in canonical order (source < target)?
You could just have the EvidenceTargetLink constructor do the canonicalization.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an interesting idea that I'm still pondering..

oldLink.sourceForwardStrand,
newTarget,
oldLink.targetForwardStrand,
nextEvidence instanceof BreakpointEvidence.DiscordantReadPairEvidence
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be BreakpointEvidence.SplitRead, right?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think so.. These ternaries are hard to read but in this clause and the next I'm saying:

  • if it's a read pair use the old split read count, otherwise increment the split read count
  • if it's a read pair increment the read pair count, otherwise use the old read pair count

Maybe if I pull these out into named variables it will be clearer.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's fine. I just misread it.
The code does assume that if it's not a discordant pair, it must be a split read, however. Might be a little fragile.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed it to check SplitRead for the split read count..

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might also want to test that it's one or the other.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a validation check in my iteration loop.

@cwhelan
Copy link
Member Author

cwhelan commented Aug 23, 2017

Re: strandedness, while it would be much easier not to have to track strand (and would probably save me from writing a lot of bugs), I think that we do need to do so if we ever want these data elements to be used for calling more complicated events like inversions or translocations. For example, for inversions we need to record whether we saw evidence linking +/+ strands or -/- strands, since the breakpoints might appear different at the two junctions due to insertions or deletions. For translocations, you definitely need to know which strands were involved so that you can extrapolate the rest of the event from the breakpoint. Agreed that it's a pain to keep track of it, though, and I find myself introducing strand errors all the time.

@tedsharpe
Copy link
Contributor

I'm not suggesting that you ignore strand.
I'm suggesting that these two alignments are consistent, even though their strandedness doesn't agree:
readX +flag chr1 10001 60 75M76S ... SA:chr1,85275,-,75S76M,60,1;
readY -flag chr1 10001 60 75M76S ... SA:chr1,85275,+,75S76M,60,1;

Whereas, this alignment is inconsistent:
readZ -flag chr1 10001 60 75M76S ... SA:chr1,85275,-,75S76M,60,1;

@tedsharpe
Copy link
Contributor

Should have said,
Whereas, these alignments are inconsistent with the pair above, but consistent with each other:
readZ -flag chr1 10001 60 75M76S ... SA:chr1,85275,-,75S76M,60,1;
readZ +flag chr1 10001 60 75M76S ... SA:chr1,85275,+,75S76M,60,1;

@cwhelan
Copy link
Member Author

cwhelan commented Aug 23, 2017

That's right, but for split reads the strand I'm using is not the actual strand of the alignment but the side of each aligned region on which the clipping occurred (clip on the left == negative strand, clip on the right == positive strand). Strandedness is more important for read pair data but I wanted to give it an analogue for split reads so that the evidence from the two types of reads could play together. Let's discuss more in person.

@tedsharpe
Copy link
Contributor

Light begins to dawn on Marblehead.
You're modeling a breakpoint, not a pair of alignments.

}
}

for (Iterator<SVIntervalTree.Entry<EvidenceTargetLink>> it = currentIntervalsWithTargets.iterator(); it.hasNext(); ) {
Copy link
Contributor

@tedsharpe tedsharpe Aug 23, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SVIntervalTree implements Iterable, so this could be expressed a little more tersely if you wanted to:

        currentIntervalsWithTargets.forEach(entry -> links.add(entry.getValue()));

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

while (breakpointEvidenceIterator.hasNext()) {
final BreakpointEvidence nextEvidence = breakpointEvidenceIterator.next();
if (nextEvidence.hasDistalTargets(readMetadata)) {
final SVInterval target = nextEvidence.getDistalTargets(readMetadata).get(0);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unused variable.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry, removed

link.source.intersect(existingLink.target), link.sourceForwardStrand,
link.splitReads + existingLink.splitReads, Math.max(link.readPairs, existingLink.readPairs));
psiOverlappers.remove();
break;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this right? You intended to break here?
If so, the while on line 102 should just be an if, but I'm guessing you intended to process all the overlappers rather than just the first.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right, this logic is wrong / incomplete.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I realized that way I'm setting up the data that I'm calling this on, there can never be more than one overlapper. I'll either fix this to make it more clear or re-do it if I decide to take your suggestion to canonicalize the links.

@@ -119,6 +119,10 @@
@Argument(doc = "output dir for assemblies", fullName = "gfaDir", optional = true)
public String gfaDir;

@Argument(doc = "file for non-assembled breakpoints in bedpe format",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems obvious to us, but you might want to say "output file"

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Copy link
Contributor

@tedsharpe tedsharpe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems odd to add an SVReadFilter to the ReadMetadata, and pass that around to the distal target methods, all so that we can query the param minEvidenceMapQ. ReadMetadata is becoming quite a pot-au-feu. I assume you tried passing that param through the call stack and it became cumbersome?

import java.util.stream.Collectors;

@DefaultSerializer(EvidenceTargetLink.Serializer.class)
public class EvidenceTargetLink {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like this could be final.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

final int splitReads;
final int readPairs;

public EvidenceTargetLink(final SVInterval source, final boolean sourceForwardStrand, final SVInterval target, final boolean targetForwardStrand, final int splitReads, final int readPairs) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you make sure source and target are non-null in the constructor, you could simplify equals and hashCode by eliminating the null testing.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

}


private static List<BreakpointEvidence> collectAllEvidence( final JavaRDD<BreakpointEvidence> evidenceRDD,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It appears that this is not called. Should it have been?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, sorry. Dev code that I mistakenly left in. Removed.

@cwhelan
Copy link
Member Author

cwhelan commented Aug 23, 2017

Re: Adding the SVReadFilter to the ReadMetadata object. It would have been a little bit of a hassle to pass this mapq threshold around -- It'd have to be passed into BreakpointDensityFilter and EvidenceTargetLinkClusterer. Not the end of the world, though.

My thought was that the metadata is really tied to the filter's parameters -- if we were to save out the metadata for a bam, for example, it would be helpful to know if the metadata was generated with an override of the default filter values if we wanted to parse and reinterpret it. But maybe that's not something we need to be overly concerned about right now. If you object to adding the filter to the metadata I'll remove it and pass the values around instead.

@tedsharpe
Copy link
Contributor

tedsharpe commented Aug 23, 2017

Sorry. I get your thinking, but I think it would be cleaner to pass the param to each of those two classes. It cleans up the dependency tree -- those two classes don't depend on any of the ReadMetadata state but for that one param -- and avoids having to have the SVReadFilter be both Java and Kryo serializable. I'd prefer it, but won't insist upon it.
Adding the filter values to the read metadata output file makes sense, though. Doesn't need to be done now.

@cwhelan
Copy link
Member Author

cwhelan commented Aug 24, 2017

No worries, I'll take it out. I didn't like the dual serialization that was necessary either. No big deal.

@@ -0,0 +1,69 @@
package org.broadinstitute.hellbender.tools.spark.sv.utils;

public final class PairedStrandedIntervals {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it might be helpful to have this class more clearly model an uncertain breakpoint, rather than just a pair of intervals with strand. As it stands, it's not clear what the relationship between the two sequences is. I think I've figured out that in the existing code the two sequences are extensions of the alignments of a read pair -- and therefore expected to be on opposite strands.
So the data held by this class is fine, but the interpretation of the relationship between left and right could be made simpler and more explicit, I think.
I'd like to see a head sequence and a tail sequence (rather than a left and right) with the strandedness manipulated so that the actual sample sequence over the breakpoint -- if we were able to figure it out exactly -- would be some prefix of the head sequence followed by some suffix of the tail sequence.
This is probably totally unclear, and we can chat.

@tedsharpe
Copy link
Contributor

I've tried to study the code that turns a split read into a stranded interval pair so that it can be compared directly to the disparate read pair info, and I can't convince myself that it's correct.

There seem to be a bunch of cases that aren't covered. For example, it seems to me that the first step would have to be reversing the cigars on negative strand SA's, so that you can see which pairs of SA's are genuinely supplemental to -- and potentially partially overlapping -- each other, and the actual order of the aligned sequences. (I'm sure this is unclear, too. We can talk.)

Is there a unit test that mocks the reads we'd expect to get from an inversion breakpoint, both discordant and split. And that demonstrates that the stranded interval pairs generated from the reads are consistent with each other and with that breakpoint?

Copy link
Contributor

@tedsharpe tedsharpe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm done. I leave it to you to figure out whether my latest suggestions have merit, and, if so, whether we'll address them now or in another PR.

@cwhelan
Copy link
Member Author

cwhelan commented Sep 9, 2017

@tedsharpe I've addressed some of your comments here -- all the simpler stuff plus:

  • I now only make distal targets for split reads with one supplementary alignment. We can make a ticket to handle more complex cases at some point.
  • I renamed the concept of strand in the EvidenceTargetLink and related classes -- I'm now calling it evidenceUpstreamOfBreakpoint.
  • I canonicalize EvidenceTargetLinks and only create them when the source is upstream of the target. This allowed me to get rid of the de-duplication code, so thanks for the suggestion.

It seemed tricky to me to try to cluster these links during the initial pass over the reads while at the same time keeping track of coherent evidence. In my testing it doesn't seem like it is slow to run over the EvidenceRDD again to do this, but we could think about trying to change this sometime if we're looking for optimizations.

Want to take another look?

@tedsharpe
Copy link
Contributor

It would be nice to have a StrandedInterval class, and then the distal targets could be stored as one of these in the various BreakpointEvidence subclasses that have distal targets, and the PairedStrandedIntervals class could just have two of them. This would also let you eliminate the methods hasDistalTargets and getDistalTargetsUpstreamOfBreakpoints -- getDistalTargets would do the job of all three.
I assume that PairedStrandedIntervals shouldn't ever have null intervals for source or target, so you could just validate in the constructor and then you wouldn't have to test nullness in equals and hashCode.

@cwhelan
Copy link
Member Author

cwhelan commented Sep 13, 2017

@tedsharpe Thanks for the suggestions, I think it makes the code a bit cleaner. Do these changes match what you were thinking?

@tedsharpe
Copy link
Contributor

Definitely cleaner. Thanks. Merge away.

@cwhelan cwhelan merged commit c5f4665 into master Sep 13, 2017
@cwhelan cwhelan deleted the cw_imprecise_variants branch September 13, 2017 15:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants