Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Structural Variant Context #3476

Merged
merged 4 commits into from
Aug 27, 2017
Merged

Structural Variant Context #3476

merged 4 commits into from
Aug 27, 2017

Conversation

vruano
Copy link
Contributor

@vruano vruano commented Aug 22, 2017

Wrapper around VC object to access SVContext specific annotations.

@vruano
Copy link
Contributor Author

vruano commented Aug 22, 2017

This is part of a large piece of work, I'm trying to submit different part in separated self-standing pull-requests. @SHuang-Broad please review.

@SHuang-Broad SHuang-Broad self-assigned this Aug 22, 2017
@codecov-io
Copy link

codecov-io commented Aug 22, 2017

Codecov Report

Merging #3476 into master will decrease coverage by 0.011%.
The diff coverage is 74.286%.

@@               Coverage Diff               @@
##              master     #3476       +/-   ##
===============================================
- Coverage     80.098%   80.087%   -0.011%     
- Complexity     17741     17760       +19     
===============================================
  Files           1187      1188        +1     
  Lines          64305     64410      +105     
  Branches        9992     10004       +12     
===============================================
+ Hits           51507     51584       +77     
- Misses          8825      8840       +15     
- Partials        3973      3986       +13
Impacted Files Coverage Δ Complexity Δ
...tools/spark/sv/utils/StructuralVariantContext.java 74.286% <74.286%> (ø) 19 <19> (?)
...oadinstitute/hellbender/utils/gcs/BucketUtils.java 76.316% <0%> (-1.316%) 38% <0%> (ø)
...er/tools/spark/sv/discovery/AlignmentInterval.java 90.678% <0%> (-0.847%) 24% <0%> (ø)
...e/hellbender/engine/spark/SparkContextFactory.java 73.973% <0%> (+2.74%) 11% <0%> (ø) ⬇️

Copy link
Contributor

@SHuang-Broad SHuang-Broad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vruano Thanks for introducing this class.

However, I have several change requests:

  1. override getType in its base class because there's no way this is a SNP
  2. override the various is*** methods appropriately
  3. test vcf is outdated (and is actually invalid, but my bad) in the sense that some annotations are not in the header and ALT allele header lines are missing. Simply use the latest master version of DiscoverVariantsFromContigAlignmentsSAMSpark to generate a valid vcf. In addition, there are records in the VCF that should be excluded for now
  4. the method composeHaplotypeBasedOnReference() is a little heavy-weighted for this class, as well as the method getBreakPointIntervals(). I suggest (no strong feelings though) them being put in a SVContextUtils class or something similar

@@ -0,0 +1,280 @@
package org.broadinstitute.hellbender.tools.spark.sv;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would you object putting this in the utils subpackage? To me it fits there better.
And if you don't want to, then I insist this class be package-private because this class is experimental.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok utils it is.

import java.util.Arrays;
import java.util.Collections;
import java.util.List;
import java.util.function.Supplier;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unused import's found. Please remove or "optimize".

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ain't there anymore.

import java.util.Arrays;
import java.util.Collections;
import java.util.List;
import java.util.stream.Collectors;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unused import's found. Please remove or "optimize".

import java.util.function.Supplier;

/**
* Variant context with additional method to mind the structural variant specific information from structural
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you mean "mind" or "mine"?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The latter, thanks.

private int length = -1;

/**
* Returns a instance given a plain {@link VariantContext} instance.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"an" instatance

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

}
}

public static ReferenceMultiSource referenceMultiSource() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

private instead of public

* @param reference the reference to use as source.
* @return never {@code null}.
*/
public Haplotype composeHaplotypeBasedOnReference(final int index, final int paddingSize, final ReferenceMultiSource reference) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should have extra parameter pipelineoptions instead of a implicit null below when accessing reference bases

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be honest I don't have a grasp on the PipelineOptions stuff yet, we follow your advice.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My bad. I just talked with the Engine team yesterday and they suggest, for tool developers, when PipelineOptions are needed, always use null and they'll have a smart way of dealing with the case when null is not enough. So please keep it the old way. Sorry!

Math.max(1, getStart() - paddingSize + 1),
Math.min(contigLength, getEnd() + paddingSize));

final ReferenceBases bases;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

code above this line could be refactored into a "argument checking and get reference interval" utility method (perferably either private or package private to be tested)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the concern? that the method code is too long? @SHuang-Broad

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not too long, but seems a utility method would serve the purpose. not a must though.

} catch (final IOException ex) {
throw new GATKException("could not read reference file");
}
if (index == 0) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This has the implicit assumption that ref allele is at index 0. I remember from the days of fixing HaplotypeCaller this kind of implicit assumptions can give surprises. Please document clearly in the Javadoc for the parameters of this function that index==0 <=> reference allele

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In think that given that this is a VC it is guaranteed that ref is at index 0. I imagine those problems where due to assume that the index 0 in order Allele containing objects also meant reference for example Allele array that was populated in a arbitrary manner... that said there is no much harm in calling getAllele(i) to verify, is just a added instructions.

result.setCigar(new Cigar(Collections.singletonList(new CigarElement(referenceInterval.size(), CigarOperator.M))));
result.setGenomeLocation(referenceInterval);
return result;
} else { //index == 1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This demands documentation for the biallelic assumption.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For now StructuralVariantContext is intrinsically balletic. What I can do is to make it "explode" if that is not the case.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

create() is going to take care of making sure that is the case before returning a value.

@SHuang-Broad SHuang-Broad assigned vruano and unassigned SHuang-Broad Aug 22, 2017
@vruano vruano force-pushed the vrr_svc branch 2 times, most recently from 86911ee to 8d77ce9 Compare August 23, 2017 20:59
@vruano
Copy link
Contributor Author

vruano commented Aug 24, 2017

@SHuang-Broad take a look and tell me if anything else needs to be changed.

/**
* Returns the absolute structural variant length as recorded in the SVLEN annotation.
* <p>
* When SVLEN is absent this method returns -1.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest changing the "-1"'s mentioned in this doc to

public static final int MISSING_LENGTH = -1; // note currently it is private

Copy link
Contributor

@SHuang-Broad SHuang-Broad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One minor comment. Merge at will. 👍

@vruano
Copy link
Contributor Author

vruano commented Aug 25, 2017

@SHuang-Broad I made a few "last" changes... MISSING_XX constants are only used internally to identify when the values have not been calculated. For length I have NO_LENGTH to indicate that in fact there is no length for that variant. Please take a look; hope to have this merged in today.

Copy link
Contributor

@SHuang-Broad SHuang-Broad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ready. Go. 👍

@@ -127,7 +140,8 @@ public int getEnd() {
* @param pipelineOptions pipeline options.
* @return never {@code null}.
*/
public Haplotype composeHaplotypeBasedOnReference(final int index, final int paddingSize, final ReferenceMultiSource reference, final PipelineOptions pipelineOptions) {
public Haplotype composeHaplotypeBasedOnReference(final int index, final int paddingSize, final ReferenceMultiSource reference,
final PipelineOptions pipelineOptions) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My last comment about PipelineOptions is probably overwelmed by other changes, but here's the summary:
My suggestion about adding PipelineOptions are wrong, and please remove them. We need only null for them practically.

end = getStart() + getStructuralVariantLength();
} else {
end = super.getEnd();
if (end == MISSING_END) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not suggesting that you change them now (though it will be good if you do so), I still believe that BND records should have no END concept associated, so having a hasEnd() method is good, and preferably users should call hasEnd() first, then getEnd().
But as said, you don't need to add this now.

@vruano
Copy link
Contributor Author

vruano commented Aug 25, 2017

No problem adding hasEnd()... but can you break down what SV types have it which don't ....

BUT I think is important to consider here what "end" means in reality.. I think that "end" here should be the last position continuously overlapped by the variant from its start position. So for insertions, translocations and bnds, typically it would be set equal to start.

Think about "start" itself.... it does not make reference to the first overlapped based but the ones before it. If a BND would not have an "end" why should it have an "start"?

I think start-end is just defined to what is practical for the sake of working with VCFs.

What do you think? @SHuang-Broad

@vruano
Copy link
Contributor Author

vruano commented Aug 25, 2017

@SHuang-Broad what is PipelineOptions needed for ... does one need it to access the reference if it stored in something that is not a ordinary file? (e.g. GS bucket?)

@SHuang-Broad
Copy link
Contributor

  • For PipelineOptions, my understanding is that it is used for Google genomics API, and we seem to use it very infrequently in GATK (and never in SV), so it is safe to use null whenever engine level or other utility functions API needs it
  • For the END and START annotation, there is NOSTART in VCF spec, but POS, so I don't know where the start comes from. And yes, I agree that BND records don't have a start either, it is merely a novel adjacency between two genomic locations, none of which is a start or end.

@vruano vruano merged commit bfa9af4 into master Aug 27, 2017
@vruano vruano deleted the vrr_svc branch August 27, 2017 05:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants