-
Notifications
You must be signed in to change notification settings - Fork 593
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PR series for complex SV, part 4 #3464
Conversation
Codecov Report
@@ Coverage Diff @@
## master #3464 +/- ##
===============================================
- Coverage 79.377% 79.254% -0.123%
- Complexity 17314 17379 +65
===============================================
Files 1140 1142 +2
Lines 62643 62889 +246
Branches 9497 9546 +49
===============================================
+ Hits 49724 49842 +118
- Misses 9133 9241 +108
- Partials 3786 3806 +20
|
Step 5 towards #2703 |
The most mind-numbing details are explained here. |
19cd56b
to
bea70ff
Compare
47a3f6c
to
3480acf
Compare
This brings to us approximately 60 variants (without any filter applied). @cwhelan Please take time to review, another PR (supposedly dealing with simple "translocation"s) is going to line up after this. Then the major graph-based one, but expected to take sometime to codeup. Thanks! |
3480acf
to
1e57989
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good overall, just a few comments here and there mostly on naming, and a question about the logic in the ref-walk-distance code.
.travis.yml
Outdated
@@ -1,6 +1,7 @@ | |||
language: java | |||
sudo: required | |||
dist: trusty | |||
group: deprecated-2017Q3 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this supposed to go in here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nope, this was done to see if a fix for travis hiccup would work. Will be removed before final clean commit.
+ totalRefLen + ") of reference bases spanned by the cigar, indicated by cigar " + TextCigarCodec.encode(cigarAlong5To3DirOfRead)); | ||
|
||
final int readLength = immutableViewOnOriginalCigar.stream().mapToInt(ce -> ce.getOperator().consumesReadBases() ? ce.getLength() : 0).sum(); | ||
final List<CigarElement> cigarElements = walkBackward ? Lists.reverse(immutableViewOnOriginalCigar) : immutableViewOnOriginalCigar; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's a CigarUtils.invertCigar()
method -- could you use that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done with some changes in lines above.
@@ -90,6 +93,11 @@ private static void addSymbAltAlleleLine(final VCFSimpleHeaderLine line) { | |||
addInfoLine(new VCFInfoHeaderLine(GATKSVVCFConstants.DUPLICATION_NUMBERS, VCFHeaderLineCount.R, VCFHeaderLineType.Integer, "Number of times the sequence is duplicated on reference and on the alternate alleles")); | |||
addInfoLine(new VCFInfoHeaderLine(GATKSVVCFConstants.DUP_ANNOTATIONS_IMPRECISE, 0, VCFHeaderLineType.Flag, "Whether the duplication annotations are from an experimental optimization procedure")); | |||
|
|||
addInfoLine(new VCFInfoHeaderLine(GATKSVVCFConstants.INVDUP_STRANDS, VCFHeaderLineCount.A, VCFHeaderLineType.String, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As discussed in person, I feel like the word 'strands' is a little overloaded here, but I leave it up to you to decide whether or not to change this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using the suggested "orientation" word, and expanded a bit in its header description.
@@ -90,6 +93,11 @@ private static void addSymbAltAlleleLine(final VCFSimpleHeaderLine line) { | |||
addInfoLine(new VCFInfoHeaderLine(GATKSVVCFConstants.DUPLICATION_NUMBERS, VCFHeaderLineCount.R, VCFHeaderLineType.Integer, "Number of times the sequence is duplicated on reference and on the alternate alleles")); | |||
addInfoLine(new VCFInfoHeaderLine(GATKSVVCFConstants.DUP_ANNOTATIONS_IMPRECISE, 0, VCFHeaderLineType.Flag, "Whether the duplication annotations are from an experimental optimization procedure")); | |||
|
|||
addInfoLine(new VCFInfoHeaderLine(GATKSVVCFConstants.INVDUP_STRANDS, VCFHeaderLineCount.A, VCFHeaderLineType.String, | |||
"Strands of the duplicated sequence on alt allele, one group for each alt allele (currently only available for inverted duplication variants)")); | |||
addInfoLine(new VCFInfoHeaderLine(GATKSVVCFConstants.INV_TRANS_INS_REF_SPAN, VCFHeaderLineCount.UNBOUNDED, VCFHeaderLineType.String, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As discussed in person it might be a little simpler to just re-use the insert sequence mappings annotation for this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While changing the code for re-using this annotation, I realized that the case is a bit more complicated than what we thought of:
This annotation is for the area that is NOT covered by the alignment ref span overlap
two <---------------------
one ------------------------>
|||||||||||
these bar bases are the inv ins annotated part
But there could be a part of the contig that is not at all covered by these two alignments, i.e. that part of the contig is unmapped. Such sequence would go into the annotation INSERTED_SEQUENCE
, and without any INSERTED_SEQUENCE_MAPPINGS
.
And for the case when there's 3 alignments (not the case being covered here but will be in the graph-based cpx case), where the 1st and last are like the case here, with the middle one mapped somewhere else on the reference, we need these two different annotations.
Do you agree?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we allow multiple "INSERTED_SEQUENCE" values? If so I'd think this would still be fine, right? What if we made a 1-to-1 mapping between INSERTED_SEQUENCE values and INSERTED_SEQUENCE_MAPPINGS values, and gave the ones without a mapping a value of "." or something to indicate that it doesn't have a mapping? Would that solve the issue?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like this proposal better, so I am going to implement it in this PR in the next commit.
To make it more precise, based on our discussions this morning, and the fact that some of the insertion mappings are "extracted" from a larger alignment block, I'm going to strip down the mapping information stored in the INSERTED_SEQUENCE_MAPPINGS, because
- the extracted info could not make guarantee that MQ, NM is correct; and
- if the analyst is keen on the exact mapping and alignment, he/she can always take the provided inserted sequence and the given mapping location and run alignment with his/her choice of alignment parameters.
So the information stored in INSERTED_SEQUENCE_MAPPINGS will be formatted as
refChr_startInclusive_endInclusive_ORIENTATION_O/E
where the O/E
is to signal if the mapping location is extracted from a bigger alignment block or not.
What do you think?
int readBasesConsumed = 0; | ||
int refWalkDist = 0; | ||
|
||
for (final CigarElement ce : cigarElements) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about the case where the start on read occurs in the middle of a cigar operation? E.g. if the cigar is 10M1D5M and I say start on read is 5 and distanceOnRead = 10? It looks to me like this would return 15, and not the value I would be expecting given the description, 11.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Corrected and added proposed test case accordingly. Turns out very symmetrical to the similar function below for computing read walk distance.
@@ -30,6 +36,10 @@ | |||
@SuppressWarnings("unchecked") | |||
private static final List<String> DEFAULT_CIGAR_STRINGS_FOR_DUP_SEQ_ON_CTG = new ArrayList<>(Collections.EMPTY_LIST); | |||
|
|||
private static final List<Strand> DEFAULT_INV_DUP_REF_STRAND = Collections.singletonList(Strand.POSITIVE); | |||
private static final List<Strand> DEFAULT_INV_DUP_CTG_STRANDs_FR = Arrays.asList(Strand.POSITIVE, Strand.NEGATIVE); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The lower case 's' in the name of this constant looks really weird -- I'd just capitalize everything.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oops, unintended. Fixed.
if (firstAlignmentInterval.forwardStrand) { | ||
final int alpha = firstAlignmentInterval.referenceSpan.getStart(), | ||
omega = secondAlignmentInterval.referenceSpan.getStart(); | ||
dupSeqRepeatUnitRefSpan = new SimpleInterval(firstAlignmentInterval.referenceSpan.getContig(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is dependent on the two reference span intervals overlapping if I'm reading this right; it might be nice to document or validate that in this method to remind the reader/maintainer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that is why this method is called initForInvDup()
, hence assuming the input has "significant" overlaps on their ref spans. Documented.
|
||
final int start, end; // intended to be 0-based, semi-open [start, end) | ||
final boolean needRC; | ||
if (firstAlignmentInterval.forwardStrand) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not a big deal, and maybe you feel like it might make things more confusing, but it seems like you could extract this boolean out and xor it with the booleans below and not need if/else clauses with largely duplicated code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, the details are not duplicated (though they look very similar), so trying to go for code simplicity might, as you suspected, make the code much less readable, so I'm going to keep it the way as it is right now.
But, I do feel that we need some "contig normalization procedure" up front, which would guarantee us that the representation we get from the previous stages are restricted. But that's for future.
cigarStringsForDupSeqOnCtg = null; | ||
dupAnnotIsFromOptimization = false; | ||
} | ||
|
||
if (input.readBoolean()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a reason not to use readObjectOrNull
and writeObjectOrNull
in this serialization code?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nope, except personal preference for native types. Plus SimpleInternal
is marked Java serializable but not using Kryo.
@@ -31,6 +32,8 @@ | |||
public static final String INSERTED_SEQUENCE_MAPPINGS = "INSERTED_SEQUENCE_MAPPINGS"; | |||
public static final String HOMOLOGY = "HOMOLOGY"; | |||
public static final String HOMOLOGY_LENGTH = "HOMOLOGY_LENGTH"; | |||
public static final String ALT_HAPLOTYPE_SEQ = "ALT_HT_SEQ"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might as well spell out HAPLOTYPE here to make it clear in the VCF file
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
@cwhelan , after our offline discussion about how to make the insertion annotation easier for downstream analysis, I went back and did a check on how insertion annotation are extracted, and here's a summary: Keys: NARL—NovelAdjacencyReferenceLocations, CA—ChimericAlignment, BC—BreakpointComplications
Hence we should put the suspected inserted sequence and mapping in the same class, proposed to be in BC. But I think this PR is getting bigger than it should be, and the cigar operation is what I now need for the graph-based cpx sv detection. So I am thinking of creating a ticket for this issue, and follow up in a month. What you do think? |
This looks good to me. Fine to leave the other stuff for future PRs. |
opened ticket #3647 to tackle this discussed problem. |
8462bbf
to
8e15ee6
Compare
* adding capability to discover inverted duplications, BreakpointComplications (and associated classes) expanded to cover this new type of variants, as well as VCF annotation; * lifted Strand class in SVFastqUtils to upper level; * added some tests for checking headers in test files
8e15ee6
to
28b726c
Compare
This PR deals with long reads with exactly two alignments (no other equally good alignment configuration), mapped to the same chromosome with strand switch, significantly overlapping each other on their reference spans.
We used to call inversions from such alignments when feasible, but it is more appropriate to emit inverted duplication records.
NEEDS TO WAIT UNTIL PARTS 1, 2 AND 3 ARE IN.