Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tribble truncated VCF lines reading certain vcf.gzs without an index #4224

Closed
ldgauthier opened this issue Jan 22, 2018 · 22 comments
Closed

Comments

@ldgauthier
Copy link
Contributor

My struggles with indexes in GATK4 continue. I forgot to pull down the corresponding *.tbi index for my .vcf.gz, but SelectVariants just toodled along until chr14:

htsjdk.tribble.TribbleException: Line 3889836: there aren't enough columns for line chr14       24737838        rs1101636       C       T       .       PASS    AC=2;AF=1.00;AN=2;DB;DP=38;ExcessHet=0.7420;FS=0.000;InbreedingCoeff=0.0357;MQ=59.98;MQRankSum=0.026;MQ_DP=24332;POSITIVE_TRAIN_SITE;QD=22.41;QUALapprox=537 (we expected 9 tokens, and saw 8 ), for input source: file:///humgen/gsa-hpprojects/dev/gauthier/reblockGVCF/gnomADaccuracyTest.noMQinSNPVQSR.SynDip.vcf.gz
        at htsjdk.variant.vcf.AbstractVCFCodec.decodeLine(AbstractVCFCodec.java:281)
        at htsjdk.variant.vcf.AbstractVCFCodec.decode(AbstractVCFCodec.java:262)
        at htsjdk.variant.vcf.AbstractVCFCodec.decode(AbstractVCFCodec.java:64)
        at htsjdk.tribble.AsciiFeatureCodec.decode(AsciiFeatureCodec.java:70)
        at htsjdk.tribble.AsciiFeatureCodec.decode(AsciiFeatureCodec.java:37)
        at htsjdk.tribble.TribbleIndexedFeatureReader$WFIterator.readNextRecord(TribbleIndexedFeatureReader.java:365)
        at htsjdk.tribble.TribbleIndexedFeatureReader$WFIterator.next(TribbleIndexedFeatureReader.java:346)
        at htsjdk.tribble.TribbleIndexedFeatureReader$WFIterator.next(TribbleIndexedFeatureReader.java:307)
        at java.util.Iterator.forEachRemaining(Iterator.java:116)
        at java.util.Spliterators$IteratorSpliterator.forEachRemaining(Spliterators.java:1801)
        at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:481)
        at java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:471)
        at java.util.stream.ForEachOps$ForEachOp.evaluateSequential(ForEachOps.java:151)
        at java.util.stream.ForEachOps$ForEachOp$OfRef.evaluateSequential(ForEachOps.java:174)
        at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
        at java.util.stream.ReferencePipeline.forEach(ReferencePipeline.java:418)
        at org.broadinstitute.hellbender.engine.VariantWalkerBase.traverse(VariantWalkerBase.java:108)
        at org.broadinstitute.hellbender.engine.GATKTool.doWork(GATKTool.java:893)
        at org.broadinstitute.hellbender.cmdline.CommandLineProgram.runTool(CommandLineProgram.java:136)
        at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMainPostParseArgs(CommandLineProgram.java:179)
        at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:198)
        at org.broadinstitute.hellbender.Main.runCommandLineProgram(Main.java:152)
        at org.broadinstitute.hellbender.Main.mainEntry(Main.java:195)
        at org.broadinstitute.hellbender.Main.main(Main.java:275)

I ran java -jar gatk-4.0.0.0/gatk-package-4.0.0.0-local.jar SelectVariants -V gnomADaccuracyTest.noMQinSNPVQSR.SynDip.vcf.gz -O testNoIndex.vcf.gz. Data is at /humgen/gsa-hpprojects/dev/gauthier/reblockGVCF If I remember to pull down the index everything works swimmingly. I'd love for this to either work without an index or fail early with an appropriate message about the index being missing.

@droazen
Copy link
Contributor

droazen commented Jan 22, 2018

@ldgauthier Is line 3889836 of your vcf corrupt (which is what is implied by the "wrong number of tokens" error)?

In general we only require an index in GATK4 if -L is specified.

@lbergelson
Copy link
Member

Why does this work if you have an index?

@droazen
Copy link
Contributor

droazen commented Jan 22, 2018

Also, this is orthogonal to the issue at hand, but you should really be running ./gatk rather than java -jar when you run GATK4, as the gatk frontend script sets a number of important system/htsjdk properties.

@ldgauthier
Copy link
Contributor Author

ldgauthier commented Jan 22, 2018

It looks perfectly reasonable from that log output and when I select that line in particular (granted, after I've added the index back in) it looks fine. I count 10 columns. I dug into this before I figured out the index fix. I don't know why it's expecting 8 (apparently it thinks header == null) or getting 9 (apparently that's what condenseTrailingTokens does) If I pad that site to get the VC above and below they all look well behaved, whitespace and everything:

chr14^I24736867^Irs854413^IT^IC^I.^IPASS^IAC=2;AF=1.00;AN=2;DB;DP=31;ExcessHet=0.5837;FS=0.000;InbreedingCoeff=0.0411;MQ=59.97;MQRankSum=-1.800e-02;MQ_DP=24250;POSITIVE_TRAIN_SITE;QD=25.91;QUALapprox=616134;ReadPosRankSum=0.576;SOR=0.704;VQSLOD=2.30;VarDP=23778;culprit=MQRankSum^IGT:AD:DP:GQ:PGT:PID:PL:SB^I1/1:0,31:31:93:0|1:24736864_A_G:1036,93,0:0,0,15,16$
chr14^I24737838^Irs1101636^IC^IT^I.^IPASS^IAC=2;AF=1.00;AN=2;DB;DP=38;ExcessHet=0.7420;FS=0.000;InbreedingCoeff=0.0357;MQ=59.98;MQRankSum=0.026;MQ_DP=24332;POSITIVE_TRAIN_SITE;QD=22.41;QUALapprox=537499;ReadPosRankSum=0.567;SOR=0.685;VQSLOD=2.09;VarDP=23982;culprit=DP^IGT:AD:DP:GQ:PL:SB^I1/1:1,37:38:79:1150,79,0:0,1,17,20$
chr14^I24739444^Irs12435407^IA^IT^I.^IPASS^IAC=2;AF=1.00;AN=2;DB;DP=30;ExcessHet=0.4770;FS=0.000;InbreedingCoeff=0.0430;MQ=59.99;MQRankSum=-6.000e-02;MQ_DP=17420;POSITIVE_TRAIN_SITE;QD=27.90;QUALapprox=478116;ReadPosRankSum=0.293;SOR=0.698;VQSLOD=2.27;VarDP=17137;culprit=MQRankSum^IGT:AD:DP:GQ:PGT:PID:PL:SB^I1/1:1,29:30:21:0|1:24739444_A_T:1167,21,0:1,0,15,14$

@droazen droazen added the bug label Jan 22, 2018
@droazen droazen changed the title Tribble segfault reading .vcf.gz without index Tribble truncated VCF lines reading certain vcf.gzs Jan 22, 2018
@droazen
Copy link
Contributor

droazen commented Jan 22, 2018

@ldgauthier @yfarjoun came to me just now with a similar issue involving a truncated line from a .vcf.gz, only his issue happened even with a tabix index present. We think it's possible that people are somehow creating block-compressed files that are not compatible with htsjdk. Can you give me the complete provenance of your vcf file above (in particular, which tool block-compressed it)?

@yfarjoun
Copy link
Contributor

yfarjoun commented Jan 22, 2018

Correction: the tbi was NOT present in my (actually @tlangs 's ) case. I've asked him to run it again with the index present and see what happens.

@cmnbroad
Copy link
Collaborator

@tlangs 's issue sounds similar to #3837.

@ldgauthier
Copy link
Contributor Author

My file came from a workflow that's pretty similar to the production joint calling workflow: blah, blah, GenomicsDBImport, my own GATK4 walker, VariantAnnotator, VariantFiltration, MakeSitesOnlyVcf, ApplyVQSR, SelectVariants. The SelectVariants is from my own jar which was branched off of probably 4beta6.

@tlangs
Copy link
Contributor

tlangs commented Jan 23, 2018

Running it with the index works.

@lbergelson
Copy link
Member

If you uncompress the file does it work without the index?

@tlangs
Copy link
Contributor

tlangs commented Jan 23, 2018

@lbergelson Yep!

@jamesemery
Copy link
Collaborator

jamesemery commented Jan 23, 2018

An interesting note, when I run this on my own machine over the same input file I'm getting the error triggering at a different site. Namely at the site indicated in this stack trace:

htsjdk.tribble.TribbleException: Line 2607098: there aren't enough columns for line 
chr8	95336218	rs4735361	C	G	PASS	AC=2;AF=1.00;AN=2;DB;DP=30;ExcessHet=6.5471;FS=0.000;InbreedingCoeff=-0.0338;MQ=60.00;MQRankSum=-5.200e-02;MQ_DP=19905;POSITIVE_TRAIN_SITE;QD=19.71;QUALapprox=388030;ReadP (we expected 9 tokens, and saw 8 ), for input source: gnomADaccuracyTest.SynDip.vcf.gz
	at htsjdk.variant.vcf.AbstractVCFCodec.decodeLine(AbstractVCFCodec.java:281)
	at htsjdk.variant.vcf.AbstractVCFCodec.decode(AbstractVCFCodec.java:262)
	at htsjdk.variant.vcf.AbstractVCFCodec.decode(AbstractVCFCodec.java:64)
	at htsjdk.tribble.AsciiFeatureCodec.decode(AsciiFeatureCodec.java:70)
	at htsjdk.tribble.AsciiFeatureCodec.decode(AsciiFeatureCodec.java:37)
	at htsjdk.tribble.TribbleIndexedFeatureReader$WFIterator.readNextRecord(TribbleIndexedFeatureReader.java:372)
	at htsjdk.tribble.TribbleIndexedFeatureReader$WFIterator.next(TribbleIndexedFeatureReader.java:353)
	at htsjdk.tribble.TribbleIndexedFeatureReader$WFIterator.next(TribbleIndexedFeatureReader.java:314)
	at java.util.Iterator.forEachRemaining(Iterator.java:116)
	at java.util.Spliterators$IteratorSpliterator.forEachRemaining(Spliterators.java:1801)
	at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:481)
	at java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:471)
	at java.util.stream.ForEachOps$ForEachOp.evaluateSequential(ForEachOps.java:151)
	at java.util.stream.ForEachOps$ForEachOp$OfRef.evaluateSequential(ForEachOps.java:174)
	at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
	at java.util.stream.ReferencePipeline.forEach(ReferencePipeline.java:418)
	at org.broadinstitute.hellbender.engine.VariantWalkerBase.traverse(VariantWalkerBase.java:108)
	at org.broadinstitute.hellbender.engine.GATKTool.doWork(GATKTool.java:893)
	at org.broadinstitute.hellbender.cmdline.CommandLineProgram.runTool(CommandLineProgram.java:136)
	at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMainPostParseArgs(CommandLineProgram.java:179)
	at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:198)
	at org.broadinstitute.hellbender.Main.runCommandLineProgram(Main.java:152)
	at org.broadinstitute.hellbender.Main.mainEntry(Main.java:195)
	at org.broadinstitute.hellbender.Main.main(Main.java:275)```

@droazen droazen changed the title Tribble truncated VCF lines reading certain vcf.gzs Tribble truncated VCF lines reading certain vcf.gzs without an index Jan 23, 2018
@droazen
Copy link
Contributor

droazen commented Jan 23, 2018

It's pretty clear at this point that there is a bug in tribble with iteration over block-compressed inputs that lack an index. This is a completely different codepath (and even a different FeatureReader) than you get if an index is present.

To buy us some time to nail this down, we are going to patch GATK to always require an index for block-compressed tribble files, even if -L is not specified. This change will go out in the bug fix release this Friday.

@droazen
Copy link
Contributor

droazen commented Jan 24, 2018

@ldgauthier @yfarjoun We have an update on this! We've identified the bug:

  • When AbstractFeatureReader.getFeatureReader() tries to open a .vcf.gz that doesn't have an index, it returns a TribbleIndexedFeatureReader instead of a TabixFeatureReader, because methods.isTabix() returns false when an index is not present.
  • TribbleIndexedFeatureReader, in turn, opens a Java vanilla GZIPInputStream, instead of the BlockCompressedInputStream that gets opened when you create a TabixFeatureReader.
  • GZIPInputStream, in turn, has a confirmed bug filed against it in Oracle's bug tracker (see https://bugs.java.com/bugdatabase/view_bug.do?bug_id=7036144#), that it inappropriately relies on the available() method to detect end-of-file, which is never safe to do given the contract of available()
  • As the final piece in the ghastly puzzle, implementations of SeekableStream in htsjdk do not implement available() at all, instead using the default implementation which always returns 0.

As a result of this combination of bugs in Java's GZIPInputStream itself and bugs in htsjdk's SeekableStream classes, end-of-file can be detected prematurely when within 26 bytes of the end of a block, due to the following code in GZIPInputStream.readTrailer():

        if (this.in.available() > 0 || n > 26) {
            ....
        }
        return true;  // EOF

Where n is the number of bytes left to inflate in the current block.

The solution is to replace all usages of the bugged GZIPInputStream with BlockCompressedInputStream in tribble in htsjdk (at least, for points in the code where the input is known to be block-gzipped rather than regular gzipped). For due diligence we should also implement available() correctly for all implementations of SeekableStream in htsjdk.

@yfarjoun
Copy link
Contributor

yfarjoun commented Jan 24, 2018 via email

@tlangs
Copy link
Contributor

tlangs commented Jan 24, 2018

😱 Oh my goodness that's some digging!

@magicDGS
Copy link
Contributor

I did a PR in HTSJDK (samtools/htsjdk#1077) for the SeekableStream part, for tracking also here the progress on this.

@droazen droazen removed this from the Engine-4.0.1.0 milestone Jan 26, 2018
@droazen droazen added this to the Engine-4.1 milestone Jan 26, 2018
jamesemery added a commit that referenced this issue Jan 26, 2018
* Temporarily disabling unindexed tabix feature files, see #4224
lbergelson pushed a commit that referenced this issue Jan 31, 2018
* Temporarily disabling unindexed tabix feature files, see #4224
@droazen droazen modified the milestones: Engine-4.1, Engine-1Q2018 Feb 5, 2018
@droazen droazen modified the milestones: Engine-1Q2018, Engine-2Q2018 Apr 6, 2018
tomwhite added a commit that referenced this issue Apr 27, 2018
@droazen droazen removed this from the Engine-2Q2018 milestone Oct 4, 2018
@droazen droazen assigned lbergelson and unassigned jamesemery Oct 4, 2018
@droazen droazen added this to the Engine-Q42018 milestone Oct 4, 2018
@droazen
Copy link
Contributor

droazen commented Oct 4, 2018

@lbergelson Is this still an issue in the latest HTSJDK?

@magicDGS
Copy link
Contributor

@droazen - I believe that this is fixed: I did the SeekableStream part, and @cmnbroad did some stuff with the tribble/bgzip...

@droazen
Copy link
Contributor

droazen commented Oct 10, 2018

Thanks @magicDGS -- @cmnbroad , would you also be comfortable closing this one?

@cmnbroad
Copy link
Collaborator

Yes, I think there was more than one issue, and more than one fix here. I'm closing. I think we can close #3837 as well ?

@lbergelson
Copy link
Member

This is fixed in htsjdk as the others said.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

8 participants