-
Notifications
You must be signed in to change notification settings - Fork 593
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Structural Variant Context #3476
Conversation
This is part of a large piece of work, I'm trying to submit different part in separated self-standing pull-requests. @SHuang-Broad please review. |
Codecov Report
@@ Coverage Diff @@
## master #3476 +/- ##
===============================================
- Coverage 80.098% 80.087% -0.011%
- Complexity 17741 17760 +19
===============================================
Files 1187 1188 +1
Lines 64305 64410 +105
Branches 9992 10004 +12
===============================================
+ Hits 51507 51584 +77
- Misses 8825 8840 +15
- Partials 3973 3986 +13
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@vruano Thanks for introducing this class.
However, I have several change requests:
- override
getType
in its base class because there's no way this is a SNP - override the various
is***
methods appropriately - test vcf is outdated (and is actually invalid, but my bad) in the sense that some annotations are not in the header and
ALT
allele header lines are missing. Simply use the latest master version ofDiscoverVariantsFromContigAlignmentsSAMSpark
to generate a valid vcf. In addition, there are records in the VCF that should be excluded for now - the method
composeHaplotypeBasedOnReference()
is a little heavy-weighted for this class, as well as the methodgetBreakPointIntervals()
. I suggest (no strong feelings though) them being put in aSVContextUtils
class or something similar
@@ -0,0 +1,280 @@ | |||
package org.broadinstitute.hellbender.tools.spark.sv; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would you object putting this in the utils
subpackage? To me it fits there better.
And if you don't want to, then I insist this class be package-private
because this class is experimental.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok utils it is.
import java.util.Arrays; | ||
import java.util.Collections; | ||
import java.util.List; | ||
import java.util.function.Supplier; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unused import
's found. Please remove or "optimize".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ain't there anymore.
import java.util.Arrays; | ||
import java.util.Collections; | ||
import java.util.List; | ||
import java.util.stream.Collectors; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unused import
's found. Please remove or "optimize".
import java.util.function.Supplier; | ||
|
||
/** | ||
* Variant context with additional method to mind the structural variant specific information from structural |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do you mean "mind" or "mine"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The latter, thanks.
private int length = -1; | ||
|
||
/** | ||
* Returns a instance given a plain {@link VariantContext} instance. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"an" instatance
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
} | ||
} | ||
|
||
public static ReferenceMultiSource referenceMultiSource() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
private
instead of public
* @param reference the reference to use as source. | ||
* @return never {@code null}. | ||
*/ | ||
public Haplotype composeHaplotypeBasedOnReference(final int index, final int paddingSize, final ReferenceMultiSource reference) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should have extra parameter pipelineoptions instead of a implicit null below when accessing reference bases
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To be honest I don't have a grasp on the PipelineOptions stuff yet, we follow your advice.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My bad. I just talked with the Engine team yesterday and they suggest, for tool developers, when PipelineOptions
are needed, always use null
and they'll have a smart way of dealing with the case when null
is not enough. So please keep it the old way. Sorry!
Math.max(1, getStart() - paddingSize + 1), | ||
Math.min(contigLength, getEnd() + paddingSize)); | ||
|
||
final ReferenceBases bases; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
code above this line could be refactored into a "argument checking and get reference interval" utility method (perferably either private or package private to be tested)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the concern? that the method code is too long? @SHuang-Broad
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not too long, but seems a utility method would serve the purpose. not a must though.
} catch (final IOException ex) { | ||
throw new GATKException("could not read reference file"); | ||
} | ||
if (index == 0) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This has the implicit assumption that ref allele is at index 0. I remember from the days of fixing HaplotypeCaller this kind of implicit assumptions can give surprises. Please document clearly in the Javadoc for the parameters of this function that index==0 <=> reference allele
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In think that given that this is a VC it is guaranteed that ref is at index 0. I imagine those problems where due to assume that the index 0 in order Allele containing objects also meant reference for example Allele array that was populated in a arbitrary manner... that said there is no much harm in calling getAllele(i) to verify, is just a added instructions.
result.setCigar(new Cigar(Collections.singletonList(new CigarElement(referenceInterval.size(), CigarOperator.M)))); | ||
result.setGenomeLocation(referenceInterval); | ||
return result; | ||
} else { //index == 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This demands documentation for the biallelic assumption.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For now StructuralVariantContext is intrinsically balletic. What I can do is to make it "explode" if that is not the case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
create() is going to take care of making sure that is the case before returning a value.
86911ee
to
8d77ce9
Compare
@SHuang-Broad take a look and tell me if anything else needs to be changed. |
/** | ||
* Returns the absolute structural variant length as recorded in the SVLEN annotation. | ||
* <p> | ||
* When SVLEN is absent this method returns -1. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suggest changing the "-1"'s mentioned in this doc to
public static final int MISSING_LENGTH = -1; // note currently it is private
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One minor comment. Merge at will. 👍
@SHuang-Broad I made a few "last" changes... MISSING_XX constants are only used internally to identify when the values have not been calculated. For length I have NO_LENGTH to indicate that in fact there is no length for that variant. Please take a look; hope to have this merged in today. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ready. Go. 👍
@@ -127,7 +140,8 @@ public int getEnd() { | |||
* @param pipelineOptions pipeline options. | |||
* @return never {@code null}. | |||
*/ | |||
public Haplotype composeHaplotypeBasedOnReference(final int index, final int paddingSize, final ReferenceMultiSource reference, final PipelineOptions pipelineOptions) { | |||
public Haplotype composeHaplotypeBasedOnReference(final int index, final int paddingSize, final ReferenceMultiSource reference, | |||
final PipelineOptions pipelineOptions) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My last comment about PipelineOptions
is probably overwelmed by other changes, but here's the summary:
My suggestion about adding PipelineOptions are wrong, and please remove them. We need only null for them practically.
end = getStart() + getStructuralVariantLength(); | ||
} else { | ||
end = super.getEnd(); | ||
if (end == MISSING_END) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not suggesting that you change them now (though it will be good if you do so), I still believe that BND records should have no END
concept associated, so having a hasEnd()
method is good, and preferably users should call hasEnd()
first, then getEnd()
.
But as said, you don't need to add this now.
No problem adding hasEnd()... but can you break down what SV types have it which don't .... BUT I think is important to consider here what "end" means in reality.. I think that "end" here should be the last position continuously overlapped by the variant from its start position. So for insertions, translocations and bnds, typically it would be set equal to start. Think about "start" itself.... it does not make reference to the first overlapped based but the ones before it. If a BND would not have an "end" why should it have an "start"? I think start-end is just defined to what is practical for the sake of working with VCFs. What do you think? @SHuang-Broad |
@SHuang-Broad what is PipelineOptions needed for ... does one need it to access the reference if it stored in something that is not a ordinary file? (e.g. GS bucket?) |
|
Wrapper around VC object to access SVContext specific annotations.
Wrapper around VC object to access SVContext specific annotations.