PR series for complex SV, part 4 #3464

SHuang-Broad · 2017-08-21T00:06:07Z

This PR deals with long reads with exactly two alignments (no other equally good alignment configuration), mapped to the same chromosome with strand switch, significantly overlapping each other on their reference spans.

We used to call inversions from such alignments when feasible, but it is more appropriate to emit inverted duplication records.

NEEDS TO WAIT UNTIL PARTS 1, 2 AND 3 ARE IN.

codecov-io · 2017-08-21T00:46:07Z

Codecov Report

Merging #3464 into master will decrease coverage by 0.123%.
The diff coverage is 54.856%.

@@               Coverage Diff               @@
##              master     #3464       +/-   ##
===============================================
- Coverage     79.377%   79.254%   -0.123%     
- Complexity     17314     17379       +65     
===============================================
  Files           1140      1142        +2     
  Lines          62643     62889      +246     
  Branches        9497      9546       +49     
===============================================
+ Hits           49724     49842      +118     
- Misses          9133      9241      +108     
- Partials        3786      3806       +20

Impacted Files	Coverage Δ	Complexity Δ
...te/hellbender/tools/spark/sv/discovery/SvType.java	`100% <ø> (ø)`	`6 <0> (ø)`	⬇️
...ls/spark/sv/discovery/GappedAlignmentSplitter.java	`95% <ø> (ø)`	`18 <0> (ø)`	⬇️
.../hellbender/tools/spark/sv/utils/SVFastqUtils.java	`65.152% <ø> (+1.849%)`	`5 <0> (ø)`	⬇️
.../sv/StructuralVariationDiscoveryPipelineSpark.java	`94.872% <ø> (-0.065%)`	`6 <0> (ø)`
...e/hellbender/tools/spark/sv/utils/SVVCFWriter.java	`86.047% <ø> (ø)`	`10 <0> (ø)`	⬇️
...y/prototype/SimpleStrandSwitchVariantDetector.java	`27.848% <0%> (-4.744%)`	`13 <0> (ø)`
.../sv/discovery/prototype/InsDelVariantDetector.java	`0% <0%> (ø)`	`0 <0> (ø)`	⬇️
...tools/spark/sv/sga/AlignAssembledContigsSpark.java	`100% <100%> (+2.222%)`	`12 <0> (ø)`	⬇️
.../DiscoverVariantsFromContigAlignmentsSAMSpark.java	`94.505% <100%> (ø)`	`23 <1> (ø)`	⬇️
.../DiscoverVariantsFromContigAlignmentsSGASpark.java	`80.556% <28.571%> (-6.012%)`	`13 <0> (ø)`
... and 16 more

SHuang-Broad · 2017-08-21T17:24:24Z

Step 5 towards #2703

SHuang-Broad · 2017-08-25T20:04:27Z

The most mind-numbing details are explained here.

SHuang-Broad · 2017-09-07T18:21:17Z

This brings to us approximately 60 variants (without any filter applied).

@cwhelan Please take time to review, another PR (supposedly dealing with simple "translocation"s) is going to line up after this. Then the major graph-based one, but expected to take sometime to codeup. Thanks!

cwhelan

Looks good overall, just a few comments here and there mostly on naming, and a question about the logic in the ref-walk-distance code.

cwhelan · 2017-09-11T02:09:13Z

.travis.yml

@@ -1,6 +1,7 @@
 language: java
 sudo: required
 dist: trusty
+group: deprecated-2017Q3


Is this supposed to go in here?

nope, this was done to see if a fix for travis hiccup would work. Will be removed before final clean commit.

cwhelan · 2017-09-11T17:19:10Z

src/main/java/org/broadinstitute/hellbender/tools/spark/sv/utils/SvCigarUtils.java

+                        + totalRefLen + ") of reference bases spanned by the cigar, indicated by cigar " + TextCigarCodec.encode(cigarAlong5To3DirOfRead));
+
+        final int readLength = immutableViewOnOriginalCigar.stream().mapToInt(ce -> ce.getOperator().consumesReadBases() ? ce.getLength() : 0).sum();
+        final List<CigarElement> cigarElements = walkBackward ? Lists.reverse(immutableViewOnOriginalCigar) : immutableViewOnOriginalCigar;


There's a CigarUtils.invertCigar() method -- could you use that?

Done with some changes in lines above.

cwhelan · 2017-09-11T18:27:47Z

src/main/java/org/broadinstitute/hellbender/tools/spark/sv/utils/GATKSVVCFHeaderLines.java

@@ -90,6 +93,11 @@ private static void addSymbAltAlleleLine(final VCFSimpleHeaderLine line) {
        addInfoLine(new VCFInfoHeaderLine(GATKSVVCFConstants.DUPLICATION_NUMBERS, VCFHeaderLineCount.R, VCFHeaderLineType.Integer, "Number of times the sequence is duplicated on reference and on the alternate alleles"));
        addInfoLine(new VCFInfoHeaderLine(GATKSVVCFConstants.DUP_ANNOTATIONS_IMPRECISE, 0, VCFHeaderLineType.Flag, "Whether the duplication annotations are from an experimental optimization procedure"));

+        addInfoLine(new VCFInfoHeaderLine(GATKSVVCFConstants.INVDUP_STRANDS, VCFHeaderLineCount.A, VCFHeaderLineType.String,


As discussed in person, I feel like the word 'strands' is a little overloaded here, but I leave it up to you to decide whether or not to change this.

Using the suggested "orientation" word, and expanded a bit in its header description.

cwhelan · 2017-09-11T18:28:19Z

src/main/java/org/broadinstitute/hellbender/tools/spark/sv/utils/GATKSVVCFHeaderLines.java

@@ -90,6 +93,11 @@ private static void addSymbAltAlleleLine(final VCFSimpleHeaderLine line) {
        addInfoLine(new VCFInfoHeaderLine(GATKSVVCFConstants.DUPLICATION_NUMBERS, VCFHeaderLineCount.R, VCFHeaderLineType.Integer, "Number of times the sequence is duplicated on reference and on the alternate alleles"));
        addInfoLine(new VCFInfoHeaderLine(GATKSVVCFConstants.DUP_ANNOTATIONS_IMPRECISE, 0, VCFHeaderLineType.Flag, "Whether the duplication annotations are from an experimental optimization procedure"));

+        addInfoLine(new VCFInfoHeaderLine(GATKSVVCFConstants.INVDUP_STRANDS, VCFHeaderLineCount.A, VCFHeaderLineType.String,
+                "Strands of the duplicated sequence on alt allele, one group for each alt allele (currently only available for inverted duplication variants)"));
+        addInfoLine(new VCFInfoHeaderLine(GATKSVVCFConstants.INV_TRANS_INS_REF_SPAN, VCFHeaderLineCount.UNBOUNDED, VCFHeaderLineType.String,


As discussed in person it might be a little simpler to just re-use the insert sequence mappings annotation for this.

While changing the code for re-using this annotation, I realized that the case is a bit more complicated than what we thought of:
This annotation is for the area that is NOT covered by the alignment ref span overlap

two <--------------------- one ------------------------> ||||||||||| these bar bases are the inv ins annotated part

But there could be a part of the contig that is not at all covered by these two alignments, i.e. that part of the contig is unmapped. Such sequence would go into the annotation INSERTED_SEQUENCE, and without any INSERTED_SEQUENCE_MAPPINGS.
And for the case when there's 3 alignments (not the case being covered here but will be in the graph-based cpx case), where the 1st and last are like the case here, with the middle one mapped somewhere else on the reference, we need these two different annotations.

Do you agree?

Do we allow multiple "INSERTED_SEQUENCE" values? If so I'd think this would still be fine, right? What if we made a 1-to-1 mapping between INSERTED_SEQUENCE values and INSERTED_SEQUENCE_MAPPINGS values, and gave the ones without a mapping a value of "." or something to indicate that it doesn't have a mapping? Would that solve the issue?

I like this proposal better, so I am going to implement it in this PR in the next commit.

To make it more precise, based on our discussions this morning, and the fact that some of the insertion mappings are "extracted" from a larger alignment block, I'm going to strip down the mapping information stored in the INSERTED_SEQUENCE_MAPPINGS, because

the extracted info could not make guarantee that MQ, NM is correct; and

if the analyst is keen on the exact mapping and alignment, he/she can always take the provided inserted sequence and the given mapping location and run alignment with his/her choice of alignment parameters.

So the information stored in INSERTED_SEQUENCE_MAPPINGS will be formatted as

refChr_startInclusive_endInclusive_ORIENTATION_O/E

where the O/E is to signal if the mapping location is extracted from a bigger alignment block or not.

What do you think?

cwhelan · 2017-09-11T18:38:48Z

src/main/java/org/broadinstitute/hellbender/tools/spark/sv/utils/SvCigarUtils.java

+        int readBasesConsumed = 0;
+        int refWalkDist = 0;
+
+        for (final CigarElement ce : cigarElements) {


What about the case where the start on read occurs in the middle of a cigar operation? E.g. if the cigar is 10M1D5M and I say start on read is 5 and distanceOnRead = 10? It looks to me like this would return 15, and not the value I would be expecting given the description, 11.

Corrected and added proposed test case accordingly. Turns out very symmetrical to the similar function below for computing read walk distance.

cwhelan · 2017-09-11T18:45:36Z

...ain/java/org/broadinstitute/hellbender/tools/spark/sv/discovery/BreakpointComplications.java

@@ -30,6 +36,10 @@
    @SuppressWarnings("unchecked")
    private static final List<String> DEFAULT_CIGAR_STRINGS_FOR_DUP_SEQ_ON_CTG = new ArrayList<>(Collections.EMPTY_LIST);

+    private static final List<Strand> DEFAULT_INV_DUP_REF_STRAND = Collections.singletonList(Strand.POSITIVE);
+    private static final List<Strand> DEFAULT_INV_DUP_CTG_STRANDs_FR = Arrays.asList(Strand.POSITIVE, Strand.NEGATIVE);


The lower case 's' in the name of this constant looks really weird -- I'd just capitalize everything.

Oops, unintended. Fixed.

cwhelan · 2017-09-11T19:10:43Z

...ain/java/org/broadinstitute/hellbender/tools/spark/sv/discovery/BreakpointComplications.java

+        if (firstAlignmentInterval.forwardStrand) {
+            final int alpha = firstAlignmentInterval.referenceSpan.getStart(),
+                      omega = secondAlignmentInterval.referenceSpan.getStart();
+            dupSeqRepeatUnitRefSpan = new SimpleInterval(firstAlignmentInterval.referenceSpan.getContig(),


This is dependent on the two reference span intervals overlapping if I'm reading this right; it might be nice to document or validate that in this method to remind the reader/maintainer.

Yes, that is why this method is called initForInvDup(), hence assuming the input has "significant" overlaps on their ref spans. Documented.

cwhelan · 2017-09-11T19:21:47Z

...ain/java/org/broadinstitute/hellbender/tools/spark/sv/discovery/BreakpointComplications.java

+
+        final int start, end; // intended to be 0-based, semi-open [start, end)
+        final boolean needRC;
+        if (firstAlignmentInterval.forwardStrand) {


Not a big deal, and maybe you feel like it might make things more confusing, but it seems like you could extract this boolean out and xor it with the booleans below and not need if/else clauses with largely duplicated code.

Actually, the details are not duplicated (though they look very similar), so trying to go for code simplicity might, as you suspected, make the code much less readable, so I'm going to keep it the way as it is right now.
But, I do feel that we need some "contig normalization procedure" up front, which would guarantee us that the representation we get from the previous stages are restricted. But that's for future.

cwhelan · 2017-09-11T19:24:43Z

...ain/java/org/broadinstitute/hellbender/tools/spark/sv/discovery/BreakpointComplications.java

            cigarStringsForDupSeqOnCtg = null;
            dupAnnotIsFromOptimization = false;
        }
+
+        if (input.readBoolean()) {


Is there a reason not to use readObjectOrNull and writeObjectOrNull in this serialization code?

Nope, except personal preference for native types. Plus SimpleInternal is marked Java serializable but not using Kryo.

cwhelan · 2017-09-11T19:34:17Z

src/main/java/org/broadinstitute/hellbender/tools/spark/sv/utils/GATKSVVCFConstants.java

@@ -31,6 +32,8 @@
    public static final String INSERTED_SEQUENCE_MAPPINGS = "INSERTED_SEQUENCE_MAPPINGS";
    public static final String HOMOLOGY = "HOMOLOGY";
    public static final String HOMOLOGY_LENGTH = "HOMOLOGY_LENGTH";
+    public static final String ALT_HAPLOTYPE_SEQ = "ALT_HT_SEQ";


Might as well spell out HAPLOTYPE here to make it clear in the VCF file

SHuang-Broad · 2017-09-25T16:11:35Z

@cwhelan , after our offline discussion about how to make the insertion annotation easier for downstream analysis, I went back and did a check on how insertion annotation are extracted, and here's a summary:

Keys: NARL—NovelAdjacencyReferenceLocations, CA—ChimericAlignment, BC—BreakpointComplications

NARL contains {mate breakpoint locations, orientation change between the breakpoints, complications around the breakpoint}, where the complication is mainly used for two purposes: 1) adjusting the exact locations of the breakpoints, and 2) useful for downstream annotations for the associated VCF records.
NARL is constructed from an input CA, which in turn has two ways of being constructed: 1) deliberate construction with two neighboring alignments on the contig, 2) construction from an input contig whose alignments are scanned through in a semi pair-wise fashion. The second way is the master version currently in use in out pipeline, and is planned to be phased out eventually. Note that the second way extracts neighboring alignments that it considers good quality and send the information to the first version for actual construction.
The ctor for CA that takes two neighboring alignments also takes in a string representation of mapping locations of suspected insertions. In master version, this information is extracted from the alignments that are considered not strong enough, e.g. lower MQ, shorter alignment length, but the actual inserted sequence is not extracted in CA, but rather in BC.
The BC field in NARL, holding inserted sequence, micro-homology and other information (e.g. for duplication), does not contain where the inserted sequence, if any, is mapped, and the inserted sequence is extracted from the distance on the contig between two neighboring alignments stored in the input CA.
The variant, after its NARL locations are pinned down, is annotated by information stored in both CA and BC, where the information for BC is critical for reconstructing the alt haplotype and those in CA mainly for evaluating evidence strength.

Hence we should put the suspected inserted sequence and mapping in the same class, proposed to be in BC.

But I think this PR is getting bigger than it should be, and the cigar operation is what I now need for the graph-based cpx sv detection. So I am thinking of creating a ticket for this issue, and follow up in a month. What you do think?

cwhelan · 2017-09-27T01:53:48Z

This looks good to me. Fine to leave the other stuff for future PRs.

SHuang-Broad · 2017-10-02T15:01:52Z

opened ticket #3647 to tackle this discussed problem.

* adding capability to discover inverted duplications, BreakpointComplications (and associated classes) expanded to cover this new type of variants, as well as VCF annotation; * lifted Strand class in SVFastqUtils to upper level; * added some tests for checking headers in test files

SHuang-Broad force-pushed the sh_cpx_sv_pr_invDup branch from 19cd56b to bea70ff Compare September 6, 2017 22:14

SHuang-Broad added the SV label Sep 6, 2017

SHuang-Broad force-pushed the sh_cpx_sv_pr_invDup branch 2 times, most recently from 47a3f6c to 3480acf Compare September 7, 2017 15:26

SHuang-Broad requested a review from cwhelan September 7, 2017 18:18

SHuang-Broad assigned cwhelan Sep 7, 2017

SHuang-Broad force-pushed the sh_cpx_sv_pr_invDup branch from 3480acf to 1e57989 Compare September 7, 2017 19:21

cwhelan requested changes Sep 11, 2017

View reviewed changes

cwhelan approved these changes Sep 27, 2017

View reviewed changes

SHuang-Broad force-pushed the sh_cpx_sv_pr_invDup branch from 8462bbf to 8e15ee6 Compare October 2, 2017 16:31

SHuang-Broad force-pushed the sh_cpx_sv_pr_invDup branch from 8e15ee6 to 28b726c Compare October 3, 2017 19:59

SHuang-Broad merged commit 1dae6b5 into master Oct 3, 2017

SHuang-Broad deleted the sh_cpx_sv_pr_invDup branch October 3, 2017 22:37

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PR series for complex SV, part 4 #3464

PR series for complex SV, part 4 #3464

SHuang-Broad commented Aug 21, 2017

codecov-io commented Aug 21, 2017 •

edited

Loading

SHuang-Broad commented Aug 21, 2017

SHuang-Broad commented Aug 25, 2017

SHuang-Broad commented Sep 7, 2017

cwhelan left a comment

cwhelan Sep 11, 2017

SHuang-Broad Sep 13, 2017

cwhelan Sep 11, 2017

SHuang-Broad Sep 14, 2017

cwhelan Sep 11, 2017

SHuang-Broad Sep 14, 2017

cwhelan Sep 11, 2017

SHuang-Broad Sep 14, 2017

cwhelan Sep 20, 2017

SHuang-Broad Sep 21, 2017

cwhelan Sep 11, 2017

SHuang-Broad Sep 14, 2017 •

edited

Loading

cwhelan Sep 11, 2017

SHuang-Broad Sep 14, 2017

cwhelan Sep 11, 2017

SHuang-Broad Sep 14, 2017

cwhelan Sep 11, 2017

SHuang-Broad Sep 14, 2017

cwhelan Sep 11, 2017

SHuang-Broad Sep 14, 2017

cwhelan Sep 11, 2017

SHuang-Broad Sep 14, 2017

SHuang-Broad commented Sep 25, 2017

cwhelan commented Sep 27, 2017

SHuang-Broad commented Oct 2, 2017

PR series for complex SV, part 4 #3464

PR series for complex SV, part 4 #3464

Conversation

SHuang-Broad commented Aug 21, 2017

codecov-io commented Aug 21, 2017 • edited Loading

Codecov Report

SHuang-Broad commented Aug 21, 2017

SHuang-Broad commented Aug 25, 2017

SHuang-Broad commented Sep 7, 2017

cwhelan left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SHuang-Broad Sep 14, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SHuang-Broad commented Sep 25, 2017

cwhelan commented Sep 27, 2017

SHuang-Broad commented Oct 2, 2017

codecov-io commented Aug 21, 2017 •

edited

Loading

SHuang-Broad Sep 14, 2017 •

edited

Loading