PR series for complex SV, part 3 #3457

SHuang-Broad · 2017-08-17T22:51:18Z

This PR deals with long reads with exactly two alignments (no other equally good alignment configuration), mapped to the same chromosome with strand switch, but NOT significantly overlapping each other.

We used to call inversions from such alignments, but it is more appropriate to emit BND records because a lot of times such signal is actually generated from inverted segmental duplications, or simply inverted mobile element insertions. To confidently interpret and distinguish between such events, we need other types of evidence, and is better to be dealt with downstream logic units.

Inverted duplications are NOT dealt with in this PR and is going to be in the next.

NEEDS TO WAIT UNTIL PART 1 & 2 ARE IN.

codecov-io · 2017-08-17T23:31:37Z

Codecov Report

Merging #3457 into master will decrease coverage by 0.174%.
The diff coverage is 41.199%.

@@               Coverage Diff               @@
##              master     #3457       +/-   ##
===============================================
- Coverage     80.073%   79.899%   -0.174%     
- Complexity     17798     17814       +16     
===============================================
  Files           1194      1198        +4     
  Lines          64540     64743      +203     
  Branches       10021     10056       +35     
===============================================
+ Hits           51679     51729       +50     
- Misses          8859      8998      +139     
- Partials        4002      4016       +14

Impacted Files	Coverage Δ	Complexity Δ
.../sv/discovery/prototype/InsDelVariantDetector.java	`0% <ø> (ø)`	`0 <0> (ø)`	⬇️
...ender/tools/spark/sv/utils/GATKSVVCFConstants.java	`0% <ø> (ø)`	`0 <0> (ø)`	⬇️
...tute/hellbender/tools/spark/sv/utils/RDDUtils.java	`0% <0%> (ø)`	`0 <0> (?)`
...iscoverFromLocalAssemblyContigAlignmentsSpark.java	`0% <0%> (ø)`	`0 <0> (ø)`	⬇️
...v/evidence/experimental/FindSmallIndelRegions.java	`0% <0%> (ø)`	`0 <0> (ø)`	⬇️
.../tools/spark/sv/discovery/BreakEndVariantType.java	`0% <0%> (ø)`	`0 <0> (?)`
...er/tools/spark/sv/discovery/AlignmentInterval.java	`91.057% <100%> (-0.469%)`	`26 <2> (+1)`
...nder/tools/spark/sv/discovery/SvTypeInference.java	`73.529% <100%> (ø)`	`9 <0> (ø)`	⬇️
...te/hellbender/tools/spark/sv/discovery/SvType.java	`100% <100%> (ø)`	`6 <1> (ø)`	⬇️
...bender/tools/spark/sv/discovery/AlignedContig.java	`91.489% <100%> (ø)`	`13 <0> (ø)`	⬇️
... and 12 more

SHuang-Broad · 2017-08-21T17:23:50Z

Step 4 towards #2703

pshapiro4broad · 2017-08-21T17:23:14Z

src/main/java/org/broadinstitute/hellbender/tools/spark/sv/discovery/AlignedContig.java

- * its name
- * its sequence as produced by the assembler (no reverse complement like in the SAM record if it maps to '-' strand), and
- * its stripped-down alignment information.
+ *   its name


When formatted as javadoc this will all appear as a single paragraph of text. To write a list in javadoc you need to use some basic HTML:

/** * Locally assembled contig: * <ul><li>its name * <li>its sequence ... * <li>its stripped-down ... * </ul> */

Note that you don't need need </li> as it's automatically closed by <li>.

pshapiro4broad · 2017-08-21T17:27:48Z

src/main/java/org/broadinstitute/hellbender/tools/spark/sv/utils/FileUtils.java

+
+public final class FileUtils {
+
+    public static void writeLinesToSingleFile(final Iterator<String> linesToWrite, final String fileName) {


I believe that this can be replaced with Files.write()

Thanks for taking a look, @pshapiro4broad. This would also us to write to local FS, hadoop, and google buckets, i.e. more general than Files.write().

pshapiro4broad · 2017-08-21T17:29:55Z

src/main/java/org/broadinstitute/hellbender/tools/spark/sv/utils/FileUtils.java

+    public static boolean createDirInBucketToWriteTo(final String pathString) {
+        try {
+            Utils.nonNull(pathString);
+            if ( java.nio.file.Files.exists(java.nio.file.Paths.get(pathString)) )


If possible, you should modify your imports so a fully qualified name isn't required for java.io.file. symbols.

This is to deal with name clashes between nio Files and hadoop Files.

pshapiro4broad · 2017-08-21T17:32:22Z

src/main/java/org/broadinstitute/hellbender/tools/spark/sv/discovery/AlignmentInterval.java

@@ -36,9 +37,11 @@
    public final int mismatches;
    public final int alnScore;

-    // if this is true, fields "mapQual", "mismatches", "alnScore" should be viewed with care as they were simply copied from the
-    // original alignment (not for "mismatches"), which after the split are wrong (we didn't recompute them because that would require expensive SW re-alignment)
+    // if any of the following boolean fields are true, fields "mapQual", "mismatches", "alnScore" should be viewed


It looks like at least part of this should be put in a javadoc for this field.

pshapiro4broad · 2017-08-21T17:34:20Z

...in/java/org/broadinstitute/hellbender/tools/spark/sv/discovery/AnnotatedVariantProducer.java

+                            novelAdjacencyReferenceLocations.complication, it.next(), contigAlignments, broadcastReference);
+            result.add(mateRecord);
+        }
+        return result.iterator();


Why not return List<> here? A caller can easily convert it to an iterator if needed. Returning List provides more flexibility to the caller.

That's constraint by API in Spark.

This method is only being invoked in a lambda as opposed to via method reference right now. I do think that @pshapiro4broad is right in that returning lists is more flexible if we have them. Why not just return the list and then call iterator on it in the lambda in dealWithSimpleStrandSwitchBkpts?

SHuang-Broad · 2017-08-23T23:34:08Z

@cwhelan this is now ready. Please review. Thanks!

cwhelan

Looks pretty good but I think it could use a little more commenting and/or descriptions for some of the meaty logic, and I'm a little unsure that I understood all of it.

cwhelan · 2017-08-28T14:37:26Z

src/main/java/org/broadinstitute/hellbender/tools/spark/sv/discovery/AlignmentInterval.java

+     * Computes overlap between reference span of the two input alignment intervals.
+     */
+    static int overlapOnRefSpan(final AlignmentInterval one, final AlignmentInterval two) {
+        Utils.validateArg(AlignedContig.sortAlignments().compare(one, two) < 0,


could you rename the sortAlignments() method to alignmentIntervalComparator() or something similar? That method doesn't actually sort the alignments so the name doesn't seem right.

cwhelan · 2017-08-28T14:40:35Z

src/main/java/org/broadinstitute/hellbender/tools/spark/sv/discovery/AlignmentInterval.java

+     */
+    static int overlapOnRefSpan(final AlignmentInterval one, final AlignmentInterval two) {
+        Utils.validateArg(AlignedContig.sortAlignments().compare(one, two) < 0,
+                "assumption that first input AI reside a place earlier than second input is violated: \n" +


This message might read more clearly: "assumption that the first input AI is upstream of the second is violated". But, why have this requirement? Why not swap the order of comparison depending on which AI actually comes first? The signature of the method doesn't specify that the intervals need to be ordered so this could be frustrating to others trying to use it.

removed the assertion because it is really unnecessary when computing overlaps (order doesn't matter).

also added test

cwhelan · 2017-08-28T14:41:32Z

src/main/java/org/broadinstitute/hellbender/tools/spark/sv/discovery/AlignmentInterval.java

+        if ( !one.referenceSpan.getContig().equals(two.referenceSpan.getContig()) ) return  0;
+
+        // dummy number for chr to be used in constructing SVInterval, since input CA has 2 AI & both map to the same chr
+        final int dummyChr = 1;


What if you made this -1 instead of 1? That way it can't be mistaken for a real contig ID by someone who doesn't know what's going on.

done. though this is a temp var that calling functions couldn't reach

cwhelan · 2017-08-28T14:46:46Z

...in/java/org/broadinstitute/hellbender/tools/spark/sv/discovery/AnnotatedVariantProducer.java

    private static final long serialVersionUID = 1L;

+    public static Iterator<VariantContext> produceMultipleAnnotatedVcFromNovelAdjacency(final NovelAdjacencyReferenceLocations novelAdjacencyReferenceLocations,


Can you add a comment to this method describing what the parameters should be? It's not clear what should be in the Iterables from the signature. And why pass in Iterables instead of eg lists?

added and changed assertion to size 2 exactly

cwhelan · 2017-08-28T14:52:07Z

...in/java/org/broadinstitute/hellbender/tools/spark/sv/discovery/AnnotatedVariantProducer.java

+        final List<VariantContext> result = new ArrayList<>();
+        result.add(record);
+        // hack for now because up to this point inferredType would have max of 2 only
+        while (it.hasNext()) {


I don't really understand why you're iterating over this if there are only going to be two things in the list. If there were more inferredTypes, you'd create a bunch of duplicate records that differed only in their type field. Why not: make inferredTypes a list, validate that there are only two of them, and then just grab the second one and remove the loop?

changed to asserting only 2 (and checked)

cwhelan · 2017-08-28T17:10:39Z

...stitute/hellbender/tools/spark/sv/discovery/prototype/SimpleStrandSwitchVariantDetector.java

+                    final List<CigarElement> resultCEs = threeSections._1();
+                    final int a = readBasesConsumed + ce.getLength() - clipLength;
+                    final CigarOperator op = ce.getOperator().isAlignment() ? CigarOperator.M : CigarOperator.S;
+                    if (clipFrom3PrimeEnd) {


This code might be a little more concise if you put these things in the sublist in the same order all the time, and then reverse the sublist if clipFrom3PrimeEnd with Collections.reverse(), and then append the sublist to resultCEs.

cwhelan · 2017-08-28T17:11:17Z

...stitute/hellbender/tools/spark/sv/discovery/prototype/SimpleStrandSwitchVariantDetector.java

+                        }
+                        resultCEs.addAll(cigarElements.subList(idx+1, cigarElements.size()));
+                    }
+                    if (!threeSections._3().isEmpty())


You don't really need this if check, you could still add it in if it's empty.

cwhelan · 2017-08-28T17:37:02Z

...stitute/hellbender/tools/spark/sv/discovery/prototype/SimpleStrandSwitchVariantDetector.java

+    }
+
+    /**
+     * Taking advantage of the fact that for input read, we know it has only two alignments that map to the same reference


I'd add an assertion to this method too if that's the case.

cwhelan · 2017-08-28T17:39:23Z

...stitute/hellbender/tools/spark/sv/discovery/prototype/SimpleStrandSwitchVariantDetector.java

+
+        final JavaPairRDD<ChimericAlignment, byte[]> simpleStrandSwitchBkpts =
+                longReads
+                        .mapToPair(SimpleStrandSwitchVariantDetector::convertAlignmentIntervalToChimericAlignment)


To me a more straightforward flow would be to filter them on splitPairStrongEnoughEvidenceForCA and then convert the remaining ones to ChimericAlignments, rather than filtering on nonNull.

cwhelan · 2017-08-28T17:53:08Z

.../discovery/prototype/VariantDetectorFromLongReadAlignmentsForSimpleStrandSwitchUnitTest.java

+
+import java.io.IOException;
+
+public class VariantDetectorFromLongReadAlignmentsForSimpleStrandSwitchUnitTest extends BaseTest {


It might be worth adding a test for extractCigarElements as well. Perhaps some of the other methods in the class, too.

added test for extractCigarElements for now. I am planning to get the prototype code in first, then test the code extensively, just in case I need to leave for several days...

SHuang-Broad

Implemented requested changes (mostly, but not all) in 3 commits. @cwhelan please take another look. Thanks!

SHuang-Broad · 2017-08-28T19:28:40Z

src/main/java/org/broadinstitute/hellbender/tools/spark/sv/discovery/AlignmentInterval.java

+     * Computes overlap between reference span of the two input alignment intervals.
+     */
+    static int overlapOnRefSpan(final AlignmentInterval one, final AlignmentInterval two) {
+        Utils.validateArg(AlignedContig.sortAlignments().compare(one, two) < 0,


SHuang-Broad · 2017-08-28T19:29:12Z

src/main/java/org/broadinstitute/hellbender/tools/spark/sv/discovery/AlignmentInterval.java

+     */
+    static int overlapOnRefSpan(final AlignmentInterval one, final AlignmentInterval two) {
+        Utils.validateArg(AlignedContig.sortAlignments().compare(one, two) < 0,
+                "assumption that first input AI reside a place earlier than second input is violated: \n" +


removed the assertion because it is really unnecessary when computing overlaps (order doesn't matter).

SHuang-Broad · 2017-08-28T19:29:21Z

src/main/java/org/broadinstitute/hellbender/tools/spark/sv/discovery/AlignmentInterval.java

+     */
+    static int overlapOnRefSpan(final AlignmentInterval one, final AlignmentInterval two) {
+        Utils.validateArg(AlignedContig.sortAlignments().compare(one, two) < 0,
+                "assumption that first input AI reside a place earlier than second input is violated: \n" +


also added test

SHuang-Broad · 2017-08-28T20:32:37Z

src/main/java/org/broadinstitute/hellbender/tools/spark/sv/discovery/AlignmentInterval.java

+        if ( !one.referenceSpan.getContig().equals(two.referenceSpan.getContig()) ) return  0;
+
+        // dummy number for chr to be used in constructing SVInterval, since input CA has 2 AI & both map to the same chr
+        final int dummyChr = 1;


done. though this is a temp var that calling functions couldn't reach

SHuang-Broad · 2017-08-28T21:03:08Z

...in/java/org/broadinstitute/hellbender/tools/spark/sv/discovery/AnnotatedVariantProducer.java

    private static final long serialVersionUID = 1L;

+    public static Iterator<VariantContext> produceMultipleAnnotatedVcFromNovelAdjacency(final NovelAdjacencyReferenceLocations novelAdjacencyReferenceLocations,


added and changed assertion to size 2 exactly

SHuang-Broad · 2017-08-29T16:50:05Z

...stitute/hellbender/tools/spark/sv/discovery/prototype/SimpleStrandSwitchVariantDetector.java

+                        .noneMatch(op -> op.equals(CigarOperator.N) || op.isPadding()),
+                "Input alignment contains padding or skip operations, which is currently unsupported: " + input.toPackedString());
+
+        final Tuple3<List<CigarElement>, List<CigarElement>, List<CigarElement>> threeSections = extractCigarElements(input);


SHuang-Broad · 2017-08-29T16:50:14Z

...stitute/hellbender/tools/spark/sv/discovery/prototype/SimpleStrandSwitchVariantDetector.java

+                else { // enough read bases would be clipped
+
+                    if ( !ce.getOperator().isAlignment() && !ce.getOperator().equals(CigarOperator.I))
+                        throw new GATKException.ShouldNeverReachHereException("Logic error, should not reach here");


SHuang-Broad · 2017-08-29T16:50:30Z

...stitute/hellbender/tools/spark/sv/discovery/prototype/SimpleStrandSwitchVariantDetector.java

+                    if ( !ce.getOperator().isAlignment() && !ce.getOperator().equals(CigarOperator.I))
+                        throw new GATKException.ShouldNeverReachHereException("Logic error, should not reach here");
+
+                    // dead with cigar first


it should have been "deal with"....

SHuang-Broad · 2017-08-29T16:52:01Z

.../discovery/prototype/VariantDetectorFromLongReadAlignmentsForSimpleStrandSwitchUnitTest.java

+
+import java.io.IOException;
+
+public class VariantDetectorFromLongReadAlignmentsForSimpleStrandSwitchUnitTest extends BaseTest {


added test for extractCigarElements for now. I am planning to get the prototype code in first, then test the code extensively, just in case I need to leave for several days...

SHuang-Broad · 2017-08-29T17:28:54Z

...stitute/hellbender/tools/spark/sv/discovery/prototype/SimpleStrandSwitchVariantDetector.java

+                    final List<CigarElement> resultCEs = threeSections._1();
+                    final int a = readBasesConsumed + ce.getLength() - clipLength;
+                    final CigarOperator op = ce.getOperator().isAlignment() ? CigarOperator.M : CigarOperator.S;
+                    if (clipFrom3PrimeEnd) {


SHuang-Broad · 2017-09-05T18:09:30Z

ping @cwhelan .

cwhelan

This looks good now, just a few minor comments.

cwhelan · 2017-09-05T18:18:01Z

...stitute/hellbender/tools/spark/sv/discovery/prototype/SimpleStrandSwitchVariantDetector.java

@@ -67,7 +72,8 @@ public boolean test(final AlignedContig contig) {

    /**
     * Removes overlap from a designated alignment interval, so that the inverted duplicated reference span is minimal.
-     * If the two alignment intervals are NOT overlapping, return the original read.
+     * If the two alignment intervals are NOT overlapping, return the original aligned contig.
+     * For algorithm {@see <a href="https://github.com/broadinstitute/dsde-methods-sv/pull/8"}


This is our private repo so I'm not sure we should reference it in comments in the public code -- it would just be frustrating for someone external.

Agreed and removed.
But we are more and more in need of a central place for documenting the algorithms used in the various stages of the pipeline.

cwhelan · 2017-09-05T18:36:35Z

...stitute/hellbender/tools/spark/sv/discovery/prototype/SimpleStrandSwitchVariantDetector.java

@@ -77,6 +83,8 @@ private static AlignedContig removeOverlap(final AlignedContig contig) {
        } else {
            final AlignmentInterval one = contig.alignmentIntervals.get(0),
                                    two = contig.alignmentIntervals.get(1);
+            // js is for "the shoot-off reference location of a jump that linked to alignment intervals", and


Is this supposed to be "linked two alignment intervals"?

I'm not sure I understand what "shoot-off" and "landing" are -- or, I guess I can imagine, but maybe something like "jumpStart" and "jumpEnd" might be more intuitive?

!OoO!, yes, it should be "two" instead of "to".

I have in mind a cartoonish depiction of a tiny anime figure sliding along the reference guided by the alignments of the read, sometimes jumping back and forth. Sorry for this too-pictorial naming scheme.

cwhelan · 2017-09-05T18:43:25Z

...stitute/hellbender/tools/spark/sv/discovery/prototype/SimpleStrandSwitchVariantDetector.java

+
+                    // deal with cigar first
+                    newMiddleSection.add( new CigarElement(clipLengthOnRead, CigarOperator.S) );
+                    final int a = readBasesConsumed + ce.getLength() - clipLengthOnRead;


Would a more descriptive name for a be basesRemainingOnCigarElementAfterClipping? Just trying to make sure I understand the algorithm.

yes, that's right and updated with documentation

cwhelan · 2017-09-05T18:46:57Z

...in/java/org/broadinstitute/hellbender/tools/spark/sv/discovery/AnnotatedVariantProducer.java

@@ -90,7 +90,7 @@ static VariantContext produceAnnotatedVcFromInferredTypeAndRefLocations(final Si
                .id(inferredType.getInternalVariantId())
                .attribute(GATKSVVCFConstants.SVTYPE, inferredType.toString());

-        if (inferredType instanceof SimpleSVType)


I don't think there's anything wrong with using instanceof.

Strangely it failed... Plus it might be better if an alt route exists that doesn't use reflection and has equal line of code works

* added BreakEndVariantType for outputing BND formatted VCF records, and a corresponding method in AnnotatedVariantProducer (no SVLEN for such records); * added code path SimpleStrandSwitchVariantDetector for dealing with simple strand switch BND variants which we now emit; also added utility classes utility classes RDDUtils

SHuang-Broad mentioned this pull request Aug 17, 2017

Provide a tool for outputting possible pathogen injection site on (human) host #3458

Closed

pshapiro4broad reviewed Aug 21, 2017

View reviewed changes

SHuang-Broad force-pushed the sh_cpx_sv_pr_3 branch from 68a5e44 to 12bed02 Compare August 23, 2017 23:32

SHuang-Broad requested a review from cwhelan August 23, 2017 23:34

SHuang-Broad mentioned this pull request Aug 24, 2017

Add MATEID annotation to BND records #3508

Closed

cwhelan requested changes Aug 28, 2017

View reviewed changes

SHuang-Broad commented Aug 29, 2017

View reviewed changes

SHuang-Broad assigned cwhelan Sep 5, 2017

cwhelan approved these changes Sep 5, 2017

View reviewed changes

SHuang-Broad force-pushed the sh_cpx_sv_pr_3 branch from 897ef07 to c10d064 Compare September 5, 2017 22:18

SHuang-Broad force-pushed the sh_cpx_sv_pr_3 branch from c10d064 to 967283e Compare September 6, 2017 16:38

SHuang-Broad merged commit a8d9fd4 into master Sep 6, 2017

SHuang-Broad deleted the sh_cpx_sv_pr_3 branch September 6, 2017 17:43


		public final class FileUtils {

		public static void writeLinesToSingleFile(final Iterator<String> linesToWrite, final String fileName) {

		private static final long serialVersionUID = 1L;

		public static Iterator<VariantContext> produceMultipleAnnotatedVcFromNovelAdjacency(final NovelAdjacencyReferenceLocations novelAdjacencyReferenceLocations,


		import java.io.IOException;

		public class VariantDetectorFromLongReadAlignmentsForSimpleStrandSwitchUnitTest extends BaseTest {

PR series for complex SV, part 3 #3457

PR series for complex SV, part 3 #3457

Conversation

SHuang-Broad commented Aug 17, 2017

codecov-io commented Aug 17, 2017 • edited Loading

Codecov Report

SHuang-Broad commented Aug 21, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SHuang-Broad commented Aug 23, 2017

cwhelan left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SHuang-Broad left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SHuang-Broad commented Sep 5, 2017

cwhelan left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

codecov-io commented Aug 17, 2017 •

edited

Loading