-
Notifications
You must be signed in to change notification settings - Fork 593
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Tribble truncated VCF lines reading certain vcf.gzs without an index #4224
Comments
@ldgauthier Is line 3889836 of your vcf corrupt (which is what is implied by the "wrong number of tokens" error)? In general we only require an index in GATK4 if |
Why does this work if you have an index? |
Also, this is orthogonal to the issue at hand, but you should really be running |
It looks perfectly reasonable from that log output and when I select that line in particular (granted, after I've added the index back in) it looks fine. I count 10 columns. I dug into this before I figured out the index fix. I don't know why it's expecting 8 (apparently it thinks
|
@ldgauthier @yfarjoun came to me just now with a similar issue involving a truncated line from a |
Correction: the tbi was NOT present in my (actually @tlangs 's ) case. I've asked him to run it again with the index present and see what happens. |
My file came from a workflow that's pretty similar to the production joint calling workflow: blah, blah, GenomicsDBImport, my own GATK4 walker, VariantAnnotator, VariantFiltration, MakeSitesOnlyVcf, ApplyVQSR, SelectVariants. The SelectVariants is from my own jar which was branched off of probably 4beta6. |
Running it with the index works. |
If you uncompress the file does it work without the index? |
@lbergelson Yep! |
An interesting note, when I run this on my own machine over the same input file I'm getting the error triggering at a different site. Namely at the site indicated in this stack trace:
|
It's pretty clear at this point that there is a bug in tribble with iteration over block-compressed inputs that lack an index. This is a completely different codepath (and even a different To buy us some time to nail this down, we are going to patch GATK to always require an index for block-compressed tribble files, even if |
@ldgauthier @yfarjoun We have an update on this! We've identified the bug:
As a result of this combination of bugs in Java's
Where The solution is to replace all usages of the bugged |
Wow. How on Earth did we avoid this for so long?
…On Wed, Jan 24, 2018 at 4:39 PM droazen ***@***.***> wrote:
@ldgauthier <https://github.com/ldgauthier> @yfarjoun
<https://github.com/yfarjoun> We have an update on this! We've identified
the bug:
- When AbstractFeatureReader.getFeatureReader() tries to open a .vcf.gz
that doesn't have an index, it returns a TribbleIndexedFeatureReader
instead of a TabixFeatureReader, because methods.isTabix() returns
false when an index is not present.
- TribbleIndexedFeatureReader, in turn, opens a Java vanilla
GZIPInputStream, instead of the BlockCompressedInputStream that gets
opened when you create a TabixFeatureReader.
- GZIPInputStream, in turn, has a *confirmed bug* filed against it in
Oracle's bug tracker (see
https://bugs.java.com/bugdatabase/view_bug.do?bug_id=7036144#), that
it inappropriately relies on the available() method to detect
end-of-file, which is never safe to do given the contract of
available()
- As the final piece in the ghastly puzzle, implementations of
SeekableStream in htsjdk do not implement available() at all, instead
using the default implementation which always returns 0.
As a result of this combination of bugs in Java's GZIPInputStream itself
and bugs in htsjdk's SeekableStream classes, end-of-file can be detected
prematurely when within 26 bytes of the end of a block, due to the
following code in GZIPInputStream.readTrailer():
if (this.in.available() > 0 || n > 26) {
....
}
return true; // EOF
Where n is the number of bytes left to inflate in the current block.
The solution is to replace all usages of the bugged GZIPInputStream with
BlockCompressedInputStream in tribble in htsjdk (at least, for points in
the code where the input is known to be block-gzipped rather than regular
gzipped). For due diligence we should also implement available()
correctly for all implementations of SeekableStream in htsjdk.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#4224 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ACnk0h8AF8wYzkbHSmAu4-8n5TE8GtOUks5tN6MfgaJpZM4RoUzm>
.
|
😱 Oh my goodness that's some digging! |
I did a PR in HTSJDK (samtools/htsjdk#1077) for the |
* Temporarily disabling unindexed tabix feature files, see #4224
* Temporarily disabling unindexed tabix feature files, see #4224
@lbergelson Is this still an issue in the latest HTSJDK? |
Yes, I think there was more than one issue, and more than one fix here. I'm closing. I think we can close #3837 as well ? |
This is fixed in htsjdk as the others said. |
My struggles with indexes in GATK4 continue. I forgot to pull down the corresponding *.tbi index for my .vcf.gz, but SelectVariants just toodled along until chr14:
I ran
java -jar gatk-4.0.0.0/gatk-package-4.0.0.0-local.jar SelectVariants -V gnomADaccuracyTest.noMQinSNPVQSR.SynDip.vcf.gz -O testNoIndex.vcf.gz
. Data is at/humgen/gsa-hpprojects/dev/gauthier/reblockGVCF
If I remember to pull down the index everything works swimmingly. I'd love for this to either work without an index or fail early with an appropriate message about the index being missing.The text was updated successfully, but these errors were encountered: