Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

possible bug: slowdown and crashing when using uncompressed reference. #53

Closed
matthdsm opened this issue May 19, 2017 · 9 comments
Closed

Comments

@matthdsm
Copy link
Contributor

matthdsm commented May 19, 2017

Hi Will,

We're experiencing the following:
using the command:

 unset PERL5LIB && \
vep \
--vcf -o stdout \
-i GiaB.vcf.gz \
--fork 8 \
--species homo_sapiens \
--no_stats \
--cache \
--merged \
--offline \
--dir /genomes/Hsapiens/hg38/vep \
--symbol --numbers --biotype --total_length --canonical --gene_phenotype --ccds --uniprot \
--domains --regulatory --protein --tsl --appris --af --max_af --af_1kg --af_esp --af_exac \
--pubmed --variant_class --sift b --polyphen b \
--plugin LoF,human_ancestor_fa:false \
--plugin MaxEntScan,/share/maxentscan-0_2004.04.21-0 \
--plugin GeneSplicer,/share/genesplicer-1.0-1/genesplicer,/genomes/Hsapiens/hg38/variation/genesplicer \
--hgvs --shift_hgvs 1 --fasta /genomes/Hsapiens/hg38/seq/hg38.fa \
| sed '/^#/! s/;;/;/g' \
| bgzip -c > GiaB-vepeffects.vcf.gz

I'm getting a gzip: stdout: Broken pipe error, which doesn't really do anything, but just looks ugly.
What's worse is that this command takes forever to run (8h+) and then crashes. The issue is solved by removing the --hgvs --shift_hgvs 1 --fasta /genomes/Hsapiens/hg38/seq/hg38.fa flags, but then we are left without some crucial annotations.
Since our vep installation has some HTSLIB issues, we're not able to test this with the default bgzipped fasta file.
I have a suspicion that the issue stems from the indexing of the fasta file. Instead of hg38.fa.index as was usual. I'm getting hg38.fa.index.dir and hg38.fa.index.pag.

Another (smaller) annoyance is that both Genesplicer and MaxEntScan don't work without specifying a fasta using the --fasta flag.

Any idea's in this?

Cheers
M

@matthdsm
Copy link
Contributor Author

matthdsm commented May 19, 2017

As a follow up,
is there any way we could create the fasta index ourselves outside of VEP? It doesn't seem to be a standard .fai format, but something bioperl specific.

I've also come across an old support thread, saying an older version of bioperl could be to blame, but afaik, we're using 1.6.924

[node]$ conda list bioperl                                                                                                                                                             # packages in environment at /home/galaxy/bcbio/anaconda:
#
perl-bioperl              1.6.924                       4    bioconda

Cheers
M

@willmclaren
Copy link
Contributor

I'm surprised to see the index.dir and index.pag files, especially as you are pointing to a file and not a directory. Perhaps the indexer is confused by a symlink or similar?

VEP uses Bio::DB::Fasta if it can't use Bio::DB::HTS::Faidx, and this has a number of issues:

  • older versions of BioPerl can't index large FASTA files (yours should be OK)
  • if you interrupt a script in the middle of index creation, this goes undetected and Bio::DB::Fasta attempts to use the incomplete/corrupted index (often without any apparent issues apart from the appearance of Ns in fetched sequence). To be sure, remove all index files and run again.
  • it is a lot slower than Bio::DB::HTS::Faidx, particularly compared to when using a compressed FASTA file

However, even using Bio::DB::HTS::Faidx, generating HGVS adds significant runtime to VEP due to various internal overheads (sequence lookup being one of them), so if you can in any way avoid this that will help. It is something we are looking at improving in future versions.

@matthdsm
Copy link
Contributor Author

Hi Will,
thanks for commenting. So, if I understand you correctly, fixing the HTSLIB issue in the conda recipe should improve performance, even with an uncompressed fasta file.
I know what I'm doing today ;)

Thanks
M

@willmclaren
Copy link
Contributor

I'd say that you may not see much faster performance using an uncompressed file, but the library is infinitely better with regards to index stability etc. If possible definitely use a bgzipped FASTA file; it's faster (I think around 10x) and of course takes up way less disk space too.

@matthdsm
Copy link
Contributor Author

Hi Will,

I did some debugging and found the issue is caused by a "missing" htslib file.
Curiously, this only occurs with htslib 1.4.1. htslib 1.3 works fine.

Cheers
M

@willmclaren
Copy link
Contributor

Presumably the lzma library? We also had this issue and had to switch to using a fixed release of htslib for Travis...

@matthdsm
Copy link
Contributor Author

matthdsm commented May 23, 2017

Actually, the missing library is libhts.so.1.
Apparently this is caused by some changes in htslib 1.4.1 ( see here)
I think we're going to have to pin the htslib release for the conda vep, unless Brad can find a fix.

M

@matthdsm
Copy link
Contributor Author

Hi Will,

I was checking out the automated installer en noticed that the default install of htslib is v1.3.2.
Are there any plans to update this to the latest version? (being 1.4.1)

Cheers
M

@willmclaren
Copy link
Contributor

Eventually we may, yes, but really it makes little difference - the differences between the versions AFAIK have no bearing on the functionality that VEP uses.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants