Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error with makeblastdb using GTDB taxonkit create-taxdump taxid.map #70

Closed
BenjaminJPerry opened this issue Nov 23, 2022 · 4 comments
Closed

Comments

@BenjaminJPerry
Copy link

Hello Wei Shen,

This is not strictly an error with taxonkit create-taxdump, but more of a feature request?

I'm trying to use the taxid.map generated using taxonkit create-taxdump for the GTDB database (r207) when making a blastn database of the complete set of GTDB representative genomes (r207).

Making the taxdump using taxonkit,

(/home/perrybe/conda-envs/taxonkit) inscrutable$ taxonkit --help | head
TaxonKit - A Practical and Efficient NCBI Taxonomy Toolkit

Version: 0.13.0

Author: Wei Shen <shenwei356@gmail.com>

Source code: https://github.com/shenwei356/taxonkit
Documents  : https://bioinf.shenwei.me/taxonkit
Citation   : https://www.sciencedirect.com/science/article/pii/S1673852721000837

(/home/perrybe/conda-envs/taxonkit) inscrutable$ taxonkit create-taxdump --gtdb --force ar53_taxonomy.tsv bac120_taxonomy.tsv -O ./
09:02:31.932 [INFO] 317542 records saved to taxid.map
09:02:32.366 [INFO] 401815 records saved to nodes.dmp
09:02:32.642 [INFO] 401815 records saved to names.dmp
09:02:32.644 [INFO] 0 records saved to merged.dmp
09:02:32.644 [INFO] 0 records saved to delnodes.dmp

Using it to make the blast database (where the error occurs),

(/dataset/bioinformatics_dev/active/conda-env/blast2.9) inscrutable$ makeblastdb -version
makeblastdb: 2.9.0+
 Package: blast 2.9.0, build May 31 2019 20:53:30

(/dataset/bioinformatics_dev/active/conda-env/blast2.9) inscrutable$ makeblastdb -in GTDB-latest.fna -dbtype nucl -parse_seqids -taxid_map taxid.map


Building a new DB, current time: 11/24/2022 08:56:25
New DB name:   /bifo/scratch/2022-BJP-GTDB_Benchmarking/gtdb-latest/GTDB-latest.fna
New DB title:  GTDB-latest.fna
Sequence type: Nucleotide
Keep MBits: T
Maximum file size: 1000000000B
Error: NCBI C++ Exception:
    T0 "/opt/conda/conda-bld/blast_1559335677723/work/c++/src/corelib/ncbistr.cpp", line 578: Error: ncbi::NStr::StringToInt() - Cannot convert string '2988443261' to int, overflow (m_Pos = 0)

In the taxid.map generated with taxonkit ,

(/dataset/bioinformatics_dev/active/conda-env/blast2.9) inscrutable$ cat taxid.map | wc -l
317542
(/dataset/bioinformatics_dev/active/conda-env/blast2.9) inscrutable$ cat taxid.map | grep -n "2988443261"
4:GCF_000980105.1       2988443261
(/dataset/bioinformatics_dev/active/conda-env/blast2.9) inscrutable$ head taxid.map
GCF_000979375.1 1349515035
GCF_000970165.1 1457399847
GCF_000979555.1 732503645
GCF_000980105.1 2988443261 <---
GCF_000007065.1 369781300
GCF_000980175.1 148096987
GCF_000970205.1 3005035806
GCA_002506415.1 1847834409
GCF_000970245.1 977990156
GCF_000970185.1 3122581739

It seems like the size of the value is too large for makeblastdb to handle when building?

It may be more of an issue with makeblastdb, but I thought I would pass it on as it might be an easy fix in taxonkit 😋

Thank you for all the excellent bioinformatic software 🥇 😁

Ben

@BenjaminJPerry
Copy link
Author

I tried using the latets release of makeblastdb an had the same error,

inscrutable$ makeblastdb -version
makeblastdb: 2.13.0+
 Package: blast 2.13.0, build Jul 18 2022 22:49:37

inscrutable$ makeblastdb -in GTDB-latest.fna -input_type fasta -dbtype nucl -taxid_map taxid.map -parse_seqids -out GTDB-r207


Building a new DB, current time: 11/24/2022 09:38:13
New DB name:   /bifo/scratch/2022-BJP-GTDB_Benchmarking/gtdb-latest/GTDB-r207
New DB title:  GTDB-latest.fna
Sequence type: Nucleotide
Keep MBits: T
Maximum file size: 3000000000B
Error: NCBI C++ Exception:
    T0 "/opt/conda/conda-bld/blast_1658184301332/work/blast/c++/src/corelib/ncbistr.cpp", line 640: Error: (CStringException::eConvert) ncbi::NStr::StringToInt() - Cannot convert string '2988443261' to int, overflow (m_Pos = 0)


@shenwei356
Copy link
Owner

shenwei356 commented Nov 24, 2022

We hashed the taxon name (in lower case) of each taxon node to uint64 using xxhash and converted it to uint32 (max value: (1<<32) - 1 = 4294967295). While it looks like more than one tool (shenwei356/gtdb-taxdump#4) stores a taxid as an int32 (max value: (1<<31) - 1 = 2147483647).

It's time for change.

@shenwei356
Copy link
Owner

shenwei356 commented Nov 24, 2022

Just updated the code. Please test it.

$ grep GCF_000980105.1 gtdb-taxdump/R207/taxid.map 
GCF_000980105.1 840959613

I'll update https://github.com/shenwei356/gtdb-taxdump later.

@shenwei356
Copy link
Owner

Tagged a new release: v0.14.0

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants