-
Notifications
You must be signed in to change notification settings - Fork 593
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Imprecise variant calling -- gather clustered evidence links #3469
Conversation
Codecov Report
@@ Coverage Diff @@
## master #3469 +/- ##
===============================================
+ Coverage 79.905% 79.932% +0.026%
- Complexity 17918 17978 +60
===============================================
Files 1199 1205 +6
Lines 65102 65481 +379
Branches 10142 10226 +84
===============================================
+ Hits 52020 52340 +320
- Misses 9042 9051 +9
- Partials 4040 4090 +50
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm just beginning to dig in, and I still have lots more to look at, but I wanted to record this overarching question:
It seems to me that in finding consistent evidence, that the exact strand of the source and target are irrelevant. All that matters, it seems to me, are whether the source and target are same strand or opposite strand. True, or not true?
return new Tuple2<>(intervals, evidenceTargetLinks); | ||
} | ||
|
||
private static void writeTargetLinks(final FindBreakpointEvidenceSparkArgumentCollection params, final Broadcast<ReadMetadata> broadcastMetadata, final List<EvidenceTargetLink> targetLinks) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a triviality, but it would be better to make the signature:
private static void writeTargetLinks(final String targetLinkFile,
final ReadMetadata readMetadata,
final List<EvidenceTargetLink> targetLinks) {
And then on line 645 where you call it, it's just:
writeTargetLinks(params.targetLinkFile, broadcastMetadata.getValue(), evidenceTargetLinks);
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
} | ||
|
||
public String toBedpeString(ReadMetadata readMetadata) { | ||
final Map<Integer, String> contigIdToContigNameMap = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You don't need this. ReadMetadata stores the inverse map. Use readMetadata.getContigName( int id ) to turn a contigID into a contigName.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oops, I missed that, thanks. done.
* the links will be combined as long as strands agree. Returned links have the intersection of two paired intervals as | ||
* source and target, with the lower-coordinate interval appearing as source. | ||
*/ | ||
public static List<EvidenceTargetLink> deduplicateTargetLinks(final List<EvidenceTargetLink> evidenceTargetLinks) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wouldn't it be easier to always record the links in canonical order (source < target)?
You could just have the EvidenceTargetLink constructor do the canonicalization.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is an interesting idea that I'm still pondering..
oldLink.sourceForwardStrand, | ||
newTarget, | ||
oldLink.targetForwardStrand, | ||
nextEvidence instanceof BreakpointEvidence.DiscordantReadPairEvidence |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should be BreakpointEvidence.SplitRead, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think so.. These ternaries are hard to read but in this clause and the next I'm saying:
- if it's a read pair use the old split read count, otherwise increment the split read count
- if it's a read pair increment the read pair count, otherwise use the old read pair count
Maybe if I pull these out into named variables it will be clearer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's fine. I just misread it.
The code does assume that if it's not a discordant pair, it must be a split read, however. Might be a little fragile.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I changed it to check SplitRead for the split read count..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might also want to test that it's one or the other.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a validation check in my iteration loop.
Re: strandedness, while it would be much easier not to have to track strand (and would probably save me from writing a lot of bugs), I think that we do need to do so if we ever want these data elements to be used for calling more complicated events like inversions or translocations. For example, for inversions we need to record whether we saw evidence linking +/+ strands or -/- strands, since the breakpoints might appear different at the two junctions due to insertions or deletions. For translocations, you definitely need to know which strands were involved so that you can extrapolate the rest of the event from the breakpoint. Agreed that it's a pain to keep track of it, though, and I find myself introducing strand errors all the time. |
I'm not suggesting that you ignore strand. Whereas, this alignment is inconsistent: |
Should have said, |
That's right, but for split reads the strand I'm using is not the actual strand of the alignment but the side of each aligned region on which the clipping occurred (clip on the left == negative strand, clip on the right == positive strand). Strandedness is more important for read pair data but I wanted to give it an analogue for split reads so that the evidence from the two types of reads could play together. Let's discuss more in person. |
Light begins to dawn on Marblehead. |
} | ||
} | ||
|
||
for (Iterator<SVIntervalTree.Entry<EvidenceTargetLink>> it = currentIntervalsWithTargets.iterator(); it.hasNext(); ) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SVIntervalTree implements Iterable, so this could be expressed a little more tersely if you wanted to:
currentIntervalsWithTargets.forEach(entry -> links.add(entry.getValue()));
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
while (breakpointEvidenceIterator.hasNext()) { | ||
final BreakpointEvidence nextEvidence = breakpointEvidenceIterator.next(); | ||
if (nextEvidence.hasDistalTargets(readMetadata)) { | ||
final SVInterval target = nextEvidence.getDistalTargets(readMetadata).get(0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unused variable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sorry, removed
link.source.intersect(existingLink.target), link.sourceForwardStrand, | ||
link.splitReads + existingLink.splitReads, Math.max(link.readPairs, existingLink.readPairs)); | ||
psiOverlappers.remove(); | ||
break; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this right? You intended to break here?
If so, the while on line 102 should just be an if, but I'm guessing you intended to process all the overlappers rather than just the first.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right, this logic is wrong / incomplete.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I realized that way I'm setting up the data that I'm calling this on, there can never be more than one overlapper. I'll either fix this to make it more clear or re-do it if I decide to take your suggestion to canonicalize the links.
@@ -119,6 +119,10 @@ | |||
@Argument(doc = "output dir for assemblies", fullName = "gfaDir", optional = true) | |||
public String gfaDir; | |||
|
|||
@Argument(doc = "file for non-assembled breakpoints in bedpe format", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
seems obvious to us, but you might want to say "output file"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems odd to add an SVReadFilter to the ReadMetadata, and pass that around to the distal target methods, all so that we can query the param minEvidenceMapQ. ReadMetadata is becoming quite a pot-au-feu. I assume you tried passing that param through the call stack and it became cumbersome?
import java.util.stream.Collectors; | ||
|
||
@DefaultSerializer(EvidenceTargetLink.Serializer.class) | ||
public class EvidenceTargetLink { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like this could be final.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
final int splitReads; | ||
final int readPairs; | ||
|
||
public EvidenceTargetLink(final SVInterval source, final boolean sourceForwardStrand, final SVInterval target, final boolean targetForwardStrand, final int splitReads, final int readPairs) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you make sure source and target are non-null in the constructor, you could simplify equals and hashCode by eliminating the null testing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
} | ||
|
||
|
||
private static List<BreakpointEvidence> collectAllEvidence( final JavaRDD<BreakpointEvidence> evidenceRDD, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It appears that this is not called. Should it have been?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, sorry. Dev code that I mistakenly left in. Removed.
Re: Adding the My thought was that the metadata is really tied to the filter's parameters -- if we were to save out the metadata for a bam, for example, it would be helpful to know if the metadata was generated with an override of the default filter values if we wanted to parse and reinterpret it. But maybe that's not something we need to be overly concerned about right now. If you object to adding the filter to the metadata I'll remove it and pass the values around instead. |
Sorry. I get your thinking, but I think it would be cleaner to pass the param to each of those two classes. It cleans up the dependency tree -- those two classes don't depend on any of the ReadMetadata state but for that one param -- and avoids having to have the SVReadFilter be both Java and Kryo serializable. I'd prefer it, but won't insist upon it. |
No worries, I'll take it out. I didn't like the dual serialization that was necessary either. No big deal. |
@@ -0,0 +1,69 @@ | |||
package org.broadinstitute.hellbender.tools.spark.sv.utils; | |||
|
|||
public final class PairedStrandedIntervals { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it might be helpful to have this class more clearly model an uncertain breakpoint, rather than just a pair of intervals with strand. As it stands, it's not clear what the relationship between the two sequences is. I think I've figured out that in the existing code the two sequences are extensions of the alignments of a read pair -- and therefore expected to be on opposite strands.
So the data held by this class is fine, but the interpretation of the relationship between left and right could be made simpler and more explicit, I think.
I'd like to see a head sequence and a tail sequence (rather than a left and right) with the strandedness manipulated so that the actual sample sequence over the breakpoint -- if we were able to figure it out exactly -- would be some prefix of the head sequence followed by some suffix of the tail sequence.
This is probably totally unclear, and we can chat.
I've tried to study the code that turns a split read into a stranded interval pair so that it can be compared directly to the disparate read pair info, and I can't convince myself that it's correct. There seem to be a bunch of cases that aren't covered. For example, it seems to me that the first step would have to be reversing the cigars on negative strand SA's, so that you can see which pairs of SA's are genuinely supplemental to -- and potentially partially overlapping -- each other, and the actual order of the aligned sequences. (I'm sure this is unclear, too. We can talk.) Is there a unit test that mocks the reads we'd expect to get from an inversion breakpoint, both discordant and split. And that demonstrates that the stranded interval pairs generated from the reads are consistent with each other and with that breakpoint? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm done. I leave it to you to figure out whether my latest suggestions have merit, and, if so, whether we'll address them now or in another PR.
fb75966
to
b77565c
Compare
@tedsharpe I've addressed some of your comments here -- all the simpler stuff plus:
It seemed tricky to me to try to cluster these links during the initial pass over the reads while at the same time keeping track of coherent evidence. In my testing it doesn't seem like it is slow to run over the Want to take another look? |
It would be nice to have a StrandedInterval class, and then the distal targets could be stored as one of these in the various BreakpointEvidence subclasses that have distal targets, and the PairedStrandedIntervals class could just have two of them. This would also let you eliminate the methods hasDistalTargets and getDistalTargetsUpstreamOfBreakpoints -- getDistalTargets would do the job of all three. |
d43c987
to
af2e42e
Compare
@tedsharpe Thanks for the suggestions, I think it makes the code a bit cleaner. Do these changes match what you were thinking? |
Definitely cleaner. Thanks. Merge away. |
…ls to make links coherent in most cases
af2e42e
to
14fa736
Compare
This PR is the initial stage of implementing the calling of IMPRECISE variants in the SV pipeline. It introduces the concept of an evidence-target link, which joins an evidence interval to its distal target. This is an extension of the 'coherent' evidence concept previously used in determining evidence thresholds for assembly. The code in this PR contains the following changes:
PairedStrandedIntervalTree
data structure is introduced to allowSVIntervalTree
-style lookups for paired intervals.To finish this work, future PRs will 1) use the collected evidence target links to annotate our assembly called-variants with the number of split reads and read pairs observed in the original mappings and 2) create IMPRECISE VCF records for events that have enough evidence-target-link support, first for deletions and then possibly for other variant types.
Initial testing shows that these changes slightly increase the number of variants called by the current pipeline, on both the CHM mix and NA12878 data sets, without greatly affecting run time.