Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PR series for complex SV, part 4 #3464

Merged
merged 1 commit into from
Oct 3, 2017
Merged

Conversation

SHuang-Broad
Copy link
Contributor

This PR deals with long reads with exactly two alignments (no other equally good alignment configuration), mapped to the same chromosome with strand switch, significantly overlapping each other on their reference spans.

We used to call inversions from such alignments when feasible, but it is more appropriate to emit inverted duplication records.

NEEDS TO WAIT UNTIL PARTS 1, 2 AND 3 ARE IN.

@codecov-io
Copy link

codecov-io commented Aug 21, 2017

Codecov Report

Merging #3464 into master will decrease coverage by 0.123%.
The diff coverage is 54.856%.

@@               Coverage Diff               @@
##              master     #3464       +/-   ##
===============================================
- Coverage     79.377%   79.254%   -0.123%     
- Complexity     17314     17379       +65     
===============================================
  Files           1140      1142        +2     
  Lines          62643     62889      +246     
  Branches        9497      9546       +49     
===============================================
+ Hits           49724     49842      +118     
- Misses          9133      9241      +108     
- Partials        3786      3806       +20
Impacted Files Coverage Δ Complexity Δ
...te/hellbender/tools/spark/sv/discovery/SvType.java 100% <ø> (ø) 6 <0> (ø) ⬇️
...ls/spark/sv/discovery/GappedAlignmentSplitter.java 95% <ø> (ø) 18 <0> (ø) ⬇️
.../hellbender/tools/spark/sv/utils/SVFastqUtils.java 65.152% <ø> (+1.849%) 5 <0> (ø) ⬇️
.../sv/StructuralVariationDiscoveryPipelineSpark.java 94.872% <ø> (-0.065%) 6 <0> (ø)
...e/hellbender/tools/spark/sv/utils/SVVCFWriter.java 86.047% <ø> (ø) 10 <0> (ø) ⬇️
...y/prototype/SimpleStrandSwitchVariantDetector.java 27.848% <0%> (-4.744%) 13 <0> (ø)
.../sv/discovery/prototype/InsDelVariantDetector.java 0% <0%> (ø) 0 <0> (ø) ⬇️
...tools/spark/sv/sga/AlignAssembledContigsSpark.java 100% <100%> (+2.222%) 12 <0> (ø) ⬇️
.../DiscoverVariantsFromContigAlignmentsSAMSpark.java 94.505% <100%> (ø) 23 <1> (ø) ⬇️
.../DiscoverVariantsFromContigAlignmentsSGASpark.java 80.556% <28.571%> (-6.012%) 13 <0> (ø)
... and 16 more

@SHuang-Broad
Copy link
Contributor Author

Step 5 towards #2703

@SHuang-Broad
Copy link
Contributor Author

The most mind-numbing details are explained here.

@SHuang-Broad
Copy link
Contributor Author

This brings to us approximately 60 variants (without any filter applied).

@cwhelan Please take time to review, another PR (supposedly dealing with simple "translocation"s) is going to line up after this. Then the major graph-based one, but expected to take sometime to codeup. Thanks!

Copy link
Member

@cwhelan cwhelan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good overall, just a few comments here and there mostly on naming, and a question about the logic in the ref-walk-distance code.

.travis.yml Outdated
@@ -1,6 +1,7 @@
language: java
sudo: required
dist: trusty
group: deprecated-2017Q3
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this supposed to go in here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nope, this was done to see if a fix for travis hiccup would work. Will be removed before final clean commit.

+ totalRefLen + ") of reference bases spanned by the cigar, indicated by cigar " + TextCigarCodec.encode(cigarAlong5To3DirOfRead));

final int readLength = immutableViewOnOriginalCigar.stream().mapToInt(ce -> ce.getOperator().consumesReadBases() ? ce.getLength() : 0).sum();
final List<CigarElement> cigarElements = walkBackward ? Lists.reverse(immutableViewOnOriginalCigar) : immutableViewOnOriginalCigar;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a CigarUtils.invertCigar() method -- could you use that?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done with some changes in lines above.

@@ -90,6 +93,11 @@ private static void addSymbAltAlleleLine(final VCFSimpleHeaderLine line) {
addInfoLine(new VCFInfoHeaderLine(GATKSVVCFConstants.DUPLICATION_NUMBERS, VCFHeaderLineCount.R, VCFHeaderLineType.Integer, "Number of times the sequence is duplicated on reference and on the alternate alleles"));
addInfoLine(new VCFInfoHeaderLine(GATKSVVCFConstants.DUP_ANNOTATIONS_IMPRECISE, 0, VCFHeaderLineType.Flag, "Whether the duplication annotations are from an experimental optimization procedure"));

addInfoLine(new VCFInfoHeaderLine(GATKSVVCFConstants.INVDUP_STRANDS, VCFHeaderLineCount.A, VCFHeaderLineType.String,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed in person, I feel like the word 'strands' is a little overloaded here, but I leave it up to you to decide whether or not to change this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using the suggested "orientation" word, and expanded a bit in its header description.

@@ -90,6 +93,11 @@ private static void addSymbAltAlleleLine(final VCFSimpleHeaderLine line) {
addInfoLine(new VCFInfoHeaderLine(GATKSVVCFConstants.DUPLICATION_NUMBERS, VCFHeaderLineCount.R, VCFHeaderLineType.Integer, "Number of times the sequence is duplicated on reference and on the alternate alleles"));
addInfoLine(new VCFInfoHeaderLine(GATKSVVCFConstants.DUP_ANNOTATIONS_IMPRECISE, 0, VCFHeaderLineType.Flag, "Whether the duplication annotations are from an experimental optimization procedure"));

addInfoLine(new VCFInfoHeaderLine(GATKSVVCFConstants.INVDUP_STRANDS, VCFHeaderLineCount.A, VCFHeaderLineType.String,
"Strands of the duplicated sequence on alt allele, one group for each alt allele (currently only available for inverted duplication variants)"));
addInfoLine(new VCFInfoHeaderLine(GATKSVVCFConstants.INV_TRANS_INS_REF_SPAN, VCFHeaderLineCount.UNBOUNDED, VCFHeaderLineType.String,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed in person it might be a little simpler to just re-use the insert sequence mappings annotation for this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While changing the code for re-using this annotation, I realized that the case is a bit more complicated than what we thought of:
This annotation is for the area that is NOT covered by the alignment ref span overlap

two              <---------------------
one ------------------------>
                             |||||||||||
                             these bar bases are the inv ins annotated part

But there could be a part of the contig that is not at all covered by these two alignments, i.e. that part of the contig is unmapped. Such sequence would go into the annotation INSERTED_SEQUENCE, and without any INSERTED_SEQUENCE_MAPPINGS.
And for the case when there's 3 alignments (not the case being covered here but will be in the graph-based cpx case), where the 1st and last are like the case here, with the middle one mapped somewhere else on the reference, we need these two different annotations.

Do you agree?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we allow multiple "INSERTED_SEQUENCE" values? If so I'd think this would still be fine, right? What if we made a 1-to-1 mapping between INSERTED_SEQUENCE values and INSERTED_SEQUENCE_MAPPINGS values, and gave the ones without a mapping a value of "." or something to indicate that it doesn't have a mapping? Would that solve the issue?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this proposal better, so I am going to implement it in this PR in the next commit.

To make it more precise, based on our discussions this morning, and the fact that some of the insertion mappings are "extracted" from a larger alignment block, I'm going to strip down the mapping information stored in the INSERTED_SEQUENCE_MAPPINGS, because

  1. the extracted info could not make guarantee that MQ, NM is correct; and
  2. if the analyst is keen on the exact mapping and alignment, he/she can always take the provided inserted sequence and the given mapping location and run alignment with his/her choice of alignment parameters.

So the information stored in INSERTED_SEQUENCE_MAPPINGS will be formatted as

refChr_startInclusive_endInclusive_ORIENTATION_O/E

where the O/E is to signal if the mapping location is extracted from a bigger alignment block or not.

What do you think?

int readBasesConsumed = 0;
int refWalkDist = 0;

for (final CigarElement ce : cigarElements) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about the case where the start on read occurs in the middle of a cigar operation? E.g. if the cigar is 10M1D5M and I say start on read is 5 and distanceOnRead = 10? It looks to me like this would return 15, and not the value I would be expecting given the description, 11.

Copy link
Contributor Author

@SHuang-Broad SHuang-Broad Sep 14, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Corrected and added proposed test case accordingly. Turns out very symmetrical to the similar function below for computing read walk distance.

@@ -30,6 +36,10 @@
@SuppressWarnings("unchecked")
private static final List<String> DEFAULT_CIGAR_STRINGS_FOR_DUP_SEQ_ON_CTG = new ArrayList<>(Collections.EMPTY_LIST);

private static final List<Strand> DEFAULT_INV_DUP_REF_STRAND = Collections.singletonList(Strand.POSITIVE);
private static final List<Strand> DEFAULT_INV_DUP_CTG_STRANDs_FR = Arrays.asList(Strand.POSITIVE, Strand.NEGATIVE);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The lower case 's' in the name of this constant looks really weird -- I'd just capitalize everything.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops, unintended. Fixed.

if (firstAlignmentInterval.forwardStrand) {
final int alpha = firstAlignmentInterval.referenceSpan.getStart(),
omega = secondAlignmentInterval.referenceSpan.getStart();
dupSeqRepeatUnitRefSpan = new SimpleInterval(firstAlignmentInterval.referenceSpan.getContig(),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is dependent on the two reference span intervals overlapping if I'm reading this right; it might be nice to document or validate that in this method to remind the reader/maintainer.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that is why this method is called initForInvDup(), hence assuming the input has "significant" overlaps on their ref spans. Documented.


final int start, end; // intended to be 0-based, semi-open [start, end)
final boolean needRC;
if (firstAlignmentInterval.forwardStrand) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a big deal, and maybe you feel like it might make things more confusing, but it seems like you could extract this boolean out and xor it with the booleans below and not need if/else clauses with largely duplicated code.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, the details are not duplicated (though they look very similar), so trying to go for code simplicity might, as you suspected, make the code much less readable, so I'm going to keep it the way as it is right now.
But, I do feel that we need some "contig normalization procedure" up front, which would guarantee us that the representation we get from the previous stages are restricted. But that's for future.

cigarStringsForDupSeqOnCtg = null;
dupAnnotIsFromOptimization = false;
}

if (input.readBoolean()) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason not to use readObjectOrNull and writeObjectOrNull in this serialization code?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nope, except personal preference for native types. Plus SimpleInternal is marked Java serializable but not using Kryo.

@@ -31,6 +32,8 @@
public static final String INSERTED_SEQUENCE_MAPPINGS = "INSERTED_SEQUENCE_MAPPINGS";
public static final String HOMOLOGY = "HOMOLOGY";
public static final String HOMOLOGY_LENGTH = "HOMOLOGY_LENGTH";
public static final String ALT_HAPLOTYPE_SEQ = "ALT_HT_SEQ";
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might as well spell out HAPLOTYPE here to make it clear in the VCF file

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@SHuang-Broad
Copy link
Contributor Author

@cwhelan , after our offline discussion about how to make the insertion annotation easier for downstream analysis, I went back and did a check on how insertion annotation are extracted, and here's a summary:

Keys: NARL—NovelAdjacencyReferenceLocations, CA—ChimericAlignment, BC—BreakpointComplications

  • NARL contains {mate breakpoint locations, orientation change between the breakpoints, complications around the breakpoint}, where the complication is mainly used for two purposes: 1) adjusting the exact locations of the breakpoints, and 2) useful for downstream annotations for the associated VCF records.
  • NARL is constructed from an input CA, which in turn has two ways of being constructed: 1) deliberate construction with two neighboring alignments on the contig, 2) construction from an input contig whose alignments are scanned through in a semi pair-wise fashion. The second way is the master version currently in use in out pipeline, and is planned to be phased out eventually. Note that the second way extracts neighboring alignments that it considers good quality and send the information to the first version for actual construction.
  • The ctor for CA that takes two neighboring alignments also takes in a string representation of mapping locations of suspected insertions. In master version, this information is extracted from the alignments that are considered not strong enough, e.g. lower MQ, shorter alignment length, but the actual inserted sequence is not extracted in CA, but rather in BC.
  • The BC field in NARL, holding inserted sequence, micro-homology and other information (e.g. for duplication), does not contain where the inserted sequence, if any, is mapped, and the inserted sequence is extracted from the distance on the contig between two neighboring alignments stored in the input CA.
  • The variant, after its NARL locations are pinned down, is annotated by information stored in both CA and BC, where the information for BC is critical for reconstructing the alt haplotype and those in CA mainly for evaluating evidence strength.

Hence we should put the suspected inserted sequence and mapping in the same class, proposed to be in BC.

But I think this PR is getting bigger than it should be, and the cigar operation is what I now need for the graph-based cpx sv detection. So I am thinking of creating a ticket for this issue, and follow up in a month. What you do think?

@cwhelan
Copy link
Member

cwhelan commented Sep 27, 2017

This looks good to me. Fine to leave the other stuff for future PRs.

@SHuang-Broad
Copy link
Contributor Author

opened ticket #3647 to tackle this discussed problem.

* adding capability to discover inverted duplications, BreakpointComplications (and associated classes) expanded to cover this new type of variants, as well as VCF annotation;
* lifted Strand class in SVFastqUtils to upper level;
* added some tests for checking headers in test files
@SHuang-Broad SHuang-Broad merged commit 1dae6b5 into master Oct 3, 2017
@SHuang-Broad SHuang-Broad deleted the sh_cpx_sv_pr_invDup branch October 3, 2017 22:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants