Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

prevent users from requesting g.vcf.gz in Spark #4277

Merged
merged 3 commits into from
Jan 30, 2018
Merged

Conversation

lbergelson
Copy link
Member

* this is currently broken, see #4274
* add a check to HaplotypeCallerSpark and VariantSparkSink and throw a clear exception in this case
* added test for GVCF writing in VariantSparkSink which previously didn't exist
* added new UserException.UnimplementedFeature class
* closes #4275
@codecov-io
Copy link

codecov-io commented Jan 29, 2018

Codecov Report

Merging #4277 into master will increase coverage by 0.06%.
The diff coverage is 87.5%.

@@              Coverage Diff               @@
##              master     #4277      +/-   ##
==============================================
+ Coverage     79.087%   79.147%   +0.06%     
- Complexity     16612     16707      +95     
==============================================
  Files           1049      1051       +2     
  Lines          59580     59808     +228     
  Branches        9730      9758      +28     
==============================================
+ Hits           47120     47336     +216     
- Misses          8681      8685       +4     
- Partials        3779      3787       +8
Impacted Files Coverage Δ Complexity Δ
...stitute/hellbender/tools/HaplotypeCallerSpark.java 83.673% <100%> (+0.516%) 26 <4> (+3) ⬆️
...institute/hellbender/exceptions/UserException.java 71.111% <100%> (+0.434%) 3 <0> (ø) ⬇️
...er/engine/spark/datasources/VariantsSparkSink.java 86.047% <66.667%> (+2.713%) 17 <0> (+4) ⬆️
...ollections/RequiredIntervalArgumentCollection.java 83.333% <0%> (ø) 6% <0%> (+3%) ⬆️
...ellbender/engine/filters/ReadLengthReadFilter.java 100% <0%> (ø) 8% <0%> (+4%) ⬆️
...ender/engine/filters/FragmentLengthReadFilter.java 100% <0%> (ø) 8% <0%> (+4%) ⬆️
...ender/engine/filters/MappingQualityReadFilter.java 100% <0%> (ø) 10% <0%> (+5%) ⬆️
...e/hellbender/engine/filters/LibraryReadFilter.java 100% <0%> (ø) 6% <0%> (+3%) ⬆️
...llbender/engine/filters/OverclippedReadFilter.java 100% <0%> (ø) 18% <0%> (+9%) ⬆️
...te/hellbender/engine/filters/ReadStrandFilter.java 66.667% <0%> (ø) 6% <0%> (+3%) ⬆️
... and 11 more

Copy link
Contributor

@droazen droazen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review complete, back to @lbergelson, merge after addressing comments.


//TODO remove me when https://github.com/broadinstitute/gatk/issues/4274 is fixed
if (writeGvcf && (AbstractFeatureReader.hasBlockCompressedExtension(outputFile) || outputFile.endsWith(IOUtil.BCF_FILE_EXTENSION))) {
throw new UserException.UnimplementedFeature("It is currently not possible to write a compressed g.vcf or bcf.gz on spark. See https://github.com/broadinstitute/gatk/issues/4274 for more details.");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't this forbid writing any BCFs when in GVCF mode (not just bcf.gz)? Was this the intent? Both the error message and the ticket #4274 seem incomplete, since they don't mention disallowing g.bcf

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That was the intent, it's currently not possible to write g.bcf at all, compressed or not. I think we are actually compressing bcf though automatically on spark, but it interacts strangely with the .gz codepaths. I've updated the comment and opened a new issue #4303

private static final long serialVersionUID = 0L;

public UnimplementedFeature(String message){
super(message);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

New UserException subclasses should generally prepend some boilerplate text to the error message in their constructors (eg, "Feature is not currently implemented: %s")

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like this generally leads to more confusing error messages since you have to fight the existing text to get it to say what you want.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It leads to more consistent error messages. You don't have to "fight" the existing text so much as just look at it before adding your message.

if(hcArgs.emitReferenceConfidence == ReferenceConfidenceMode.GVCF
&& (AbstractFeatureReader.hasBlockCompressedExtension(output) || output.endsWith(IOUtil.BCF_FILE_EXTENSION))) {
throw new UserException.UnimplementedFeature("It is currently not possible to write a compressed g.vcf or bcf.gz from HaplotypeCallerSpark. " +
"See https://github.com/broadinstitute/gatk/issues/4274 for more details.");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you repeating this check so that the tool can fail early, or would the check in VariantSparkSink not catch this?

Either way, my comment above also applies here: Doesn't this forbid writing any BCFs when in GVCF mode (not just bcf.gz)? Was this the intent? Both the error message and the ticket #4274 seem incomplete, since they don't mention disallowing g.bcf

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Repeating it to catch it early, otherwise it will run the entire haplotype caller and then fail. Updated here as well.

@Test(expectedExceptions = UserException.UnimplementedFeature.class)
public void testGVCF_GZ_isDisallowed() throws IOException {
JavaSparkContext ctx = SparkContextFactory.getTestSparkContext();
VariantsSparkSink.writeVariants(ctx, createTempFile("test",".g.vcf.gz").toString(), null,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also test that .g.bcf and .g.bcf.gz are disallowed.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added

public String getTestedToolName(){
return IndexFeatureFile.class.getSimpleName();
}
}.runCommandLine(new String[]{"-F", output.getAbsolutePath()});
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't new Main().instanceMain(args) simpler here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is but it misses out on some things like setting the default test verbosity.

final String[] args = {
"-I", NA12878_20_21_WGS_bam,
"-R", b37_2bit_reference_20_21,
"-O", createTempFile("testGVCF_GZ_throw_exception", ".g.vcf.gz").getAbsolutePath(),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also test that .g.bcf and .g.bcf.gz are disabled.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@droazen droazen assigned lbergelson and unassigned droazen Jan 29, 2018
Copy link
Contributor

@droazen droazen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@droazen droazen merged commit d6e7635 into master Jan 30, 2018
@droazen droazen deleted the lb_fix_spark_g_vcf_gz branch January 30, 2018 15:10
lbergelson added a commit that referenced this pull request Jan 31, 2018
* prevent users from requesting g.vcf.gz in Spark

* this is currently broken, see #4274
* add a check to HaplotypeCallerSpark and VariantSparkSink and throw a clear exception in this case
* added test for GVCF writing in VariantSparkSink which previously didn't exist
* added new UserException.UnimplementedFeature class
* closes #4275
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

prevent running HaplotypeCallerSpark with broken output formats
3 participants