GTDB taxonomy taxdump files with trackable TaxIds

Metagenomic tools like Kraken2, Centrifuge and KMCP support NCBI taxonomy in format of NCBI taxdump files. GTDB, a prokaryotic genomes catalogue, has its own taxonomy data. Though the genomes, derived from GenBank and RefSeq, can be mappped to NCBI taxonomy TaxIds, there's an urgent need to create its own taxonomy taxdump files with stable and trackable TaxIds.

A TaxonKit command, taxonkit create-taxdump is used to create NCBI-style taxdump files for any taxonomy dataset, including GTDB and ICTV.

Related projects:

ictv-taxdump: NCBI-style taxdump files for International Committee on Taxonomy of Viruses (ICTV)
taxid-changelog: NCBI taxonomic identifier (taxid) changelog
taxonkit: A Practical and Efficient NCBI Taxonomy Toolkit

Method

Taxonomic hierarchy

A GTDB species cluster contains >=1 assemblies, each can be treated as a strain. So we can assign each assembly a TaxId with the rank of "no rank" below the species rank. Therefore, we can also track the changes of these assemblies via the TaxId later.

Generation of TaxIds

We just hash the rank+taxon_name (in lower case) of each taxon node to uint64 using xxhash and convert it to int32.

For the NCBI assembly accession.
1. The prefix GCA_ is not used because some GenBank entries (GCA_000176655.2 in R80) were moved to RefSeq (GCF_000176655.2 in R83) and the prefix changed.
2. The version number is trimed because it may change. So, 000176655 is hashed to get the TaxId.
For the non-NCBI assembly accession. The accession per se is hashed. E.g., UBA12275
For the name of a node. The taxon name per se is hashed. E.g, Bacteria.

Data and tools

GTDB taxnomy files are download from https://data.gtdb.ecogenomic.org/releases/, and organized as:

tree taxonomy/
taxonomy/
├── R080
│   └── bac_taxonomy_r80.tsv
├── R083
│   └── bac_taxonomy_r83.tsv
├── R086
│   ├── ar122_taxonomy_r86.2.tsv
│   └── bac120_taxonomy_r86.2.tsv
├── R089
│   ├── ar122_taxonomy_r89.tsv
│   └── bac120_taxonomy_r89.tsv
├── R095
│   ├── ar122_taxonomy_r95.tsv.gz
│   └── bac120_taxonomy_r95.tsv.gz
├── R202
│   ├── ar122_taxonomy_r202.tsv.gz
│   └── bac120_taxonomy_r202.tsv.gz
├── R207
│   ├── ar53_taxonomy_r207.tsv.gz
│   └── bac120_taxonomy_r207.tsv.gz
├── R214
│   ├── ar53_taxonomy_r214.tsv.gz
│   └── bac120_taxonomy_r214.tsv.gz
└── R220
    ├── ar53_taxonomy_r220.tsv.gz
    └── bac120_taxonomy_r220.tsv.gz

TaxonKit v0.12.0 or a later version is needed. v0.16.0 or a later version is preferred.

Since v0.14.0, taxonkit create-taxdump stores TaxIds in int32 following BLAST and DIAMOND, rather than uint32 in previous versions.
Since v0.16.0, duplicated names with different ranks are allowed.

Steps

Generating taxdump files for the first version r80:

 taxonkit create-taxdump taxonomy/R080/*.tsv* --gtdb --out-dir gtdb-taxdump/R080 --force
 22:23:09.195 [INFO] 94759 records saved to gtdb-taxdump/R080/taxid.map
 22:23:09.249 [INFO] 111705 records saved to gtdb-taxdump/R080/nodes.dmp
 22:23:09.293 [INFO] 111705 records saved to gtdb-taxdump/R080/names.dmp
 22:23:09.293 [INFO] 0 records saved to gtdb-taxdump/R080/merged.dmp
 22:23:09.293 [INFO] 0 records saved to gtdb-taxdump/R080/delnodes.dmp

For later versions, we need the taxdump files of the revious version to track merged and deleted nodes.

 taxonkit create-taxdump --gtdb -x gtdb-taxdump/R080/ \
     taxonomy/R083/*.tsv*  --out-dir gtdb-taxdump/R083  --force
     
 taxonkit create-taxdump --gtdb -x gtdb-taxdump/R083/ \
     taxonomy/R086/*.tsv*  --out-dir gtdb-taxdump/R086  --force

 taxonkit create-taxdump --gtdb -x gtdb-taxdump/R086/ \
     taxonomy/R089/*.tsv*  --out-dir gtdb-taxdump/R089  --force
     
 taxonkit create-taxdump --gtdb -x gtdb-taxdump/R089/ \
     taxonomy/R095/*.tsv*  --out-dir gtdb-taxdump/R095  --force
     
 taxonkit create-taxdump --gtdb -x gtdb-taxdump/R095/ \
     taxonomy/R202/*.tsv*  --out-dir gtdb-taxdump/R202  --force
     
 taxonkit create-taxdump --gtdb -x gtdb-taxdump/R202/ \
     taxonomy/R207/*.tsv*  --out-dir gtdb-taxdump/R207  --force

 taxonkit create-taxdump --gtdb -x gtdb-taxdump/R207/ \
     taxonomy/R214/*.tsv*  --out-dir gtdb-taxdump/R214  --force

 taxonkit create-taxdump --gtdb -x gtdb-taxdump/R214/ \
     taxonomy/R220/*.tsv*  --out-dir gtdb-taxdump/R220  --force

Generating TaxId changelog (Note that, it's not perfect for GTDB taxonomy).

We only check and eliminate taxid collision within a single version of taxonomy data. Therefore, if you create taxid-changelog with "taxid-changelog", different taxons in multiple versions might have the same TaxIds and some change events might be wrong.

A single version of taxonomic data created by "taxonkit create-taxdump" has no problem, it's just the changelog might not be perfect.

    taxonkit taxid-changelog -i gtdb-taxdump -o gtdb-taxid-changelog.csv.gz --verbose

Download

The release page contains taxdump files for all GTDB versions, and a TaxId changelog file (gtdb-taxid-changelog.csv.gz).

Learn more about the taxid-changelog.

Results

Basic usage

set the environment variable for simplicity

export TAXONKIT_DB=gtdb-taxdump/R220/

Query the TaxId via an assembly accession

grep GCA_905234495.1 gtdb-taxdump/R220/taxid.map
GCA_905234495.1 254122285

Query the TaxId via taxon name

echo Escherichia coli \
    | taxonkit name2taxid
Escherichia coli        599451526

Complete lineage

# with lineage
echo 599451526 \
    | taxonkit lineage -nr
599451526       Bacteria;Pseudomonadota;Gammaproteobacteria;Enterobacterales;Enterobacteriaceae;Escherichia;Escherichia coli    Escherichia coli        species

# with reformat
echo 599451526 \
    | taxonkit reformat -I 1
599451526       Bacteria;Pseudomonadota;Gammaproteobacteria;Enterobacterales;Enterobacteriaceae;Escherichia;Escherichia coli

Complete lineage (GTDB style)

echo 599451526 \
    | taxonkit reformat -I 1 -P --prefix-k d__
599451526       d__Bacteria;p__Pseudomonadota;c__Gammaproteobacteria;o__Enterobacterales;f__Enterobacteriaceae;g__Escherichia;s__Escherichia coli

All lineages

taxonkit list --ids 1 -I "" \
    | taxonkit filter -E species \
    | taxonkit reformat -I 1 -P --prefix-k d__ \
    > gtdb_species.txt

Checking consistency

$ zcat taxonomy/R220/* | cut -f 2 | sort | uniq | md5sum
f9e0f5268ab65026894703db3eab7b4b  -

$ cut -f 2 gtdb_species.txt | sort | md5sum
f9e0f5268ab65026894703db3eab7b4b  -

TaxId changes

Notes:

The Y axis is the number of TaxId, not that of species.
The data is generated by "taxonkit taxid-changelog", which was originally designed for NCBI taxonomy, where the the TaxIds are stable. For other taxonomic data created by "taxonkit create-taxdump", e.g., GTDB-taxdump, some change events might be wrong, because
- There would be dramatic changes between the two versions.
- Different taxons in multiple versions might have the same TaxIds, because we only check and eliminate taxid collision within a single version

Species changes

How many species are there in R220?

$ taxonkit list --data-dir gtdb-taxdump/R220/ --ids 1 -I "" \
    | taxonkit filter --data-dir gtdb-taxdump/R220/ -E species \
    | wc -l
113104

How many species are added in R220?

$ pigz -cd gtdb-taxid-changelog.csv.gz \
    | csvtk grep -f version -p R220 \
    | csvtk grep -f change -p NEW \
    | csvtk grep -f rank -p species \
    | csvtk nrow
31987

How many species are deleted in R220?

$ pigz -cd gtdb-taxid-changelog.csv.gz \
    | csvtk grep -f version -p R220 \
    | csvtk grep -f change -p DELETE \
    | csvtk grep -f rank -p species \
    | csvtk nrow
3127

How many species are merged into others in R220?

$ pigz -cd gtdb-taxid-changelog.csv.gz \
    | csvtk grep -f version -p R220 \
    | csvtk grep -f change -p MERGE \
    | csvtk grep -f rank -p species \
    | csvtk nrow
1182

Summary

Complete lineages (R220)

$ cat gtdb-taxdump/R220/taxid.map  \
    | csvtk freq -Ht -f 2 -nr \
    | taxonkit lineage -r -n -L --data-dir gtdb-taxdump/R220/ \
    | taxonkit reformat -I 1 -f '{k}\t{p}\t{c}\t{o}\t{f}\t{g}\t{s}' --data-dir gtdb-taxdump/R220/ \
    | csvtk add-header -t -n 'taxid,count,name,rank,superkindom,phylum,class,order,family,genus,species' \
    > taxid.map.stats.tsv

Frequency of species

$ csvtk freq -t -nr -f species taxid.map.stats.tsv \
    > taxid.map.stats.freq-species.tsv
    
$ head -n 21 taxid.map.stats.freq-species.tsv \
    | csvtk pretty -t
species                      frequency
--------------------------   ---------
Escherichia coli             38926
Klebsiella pneumoniae        18499
Staphylococcus aureus        16021
Salmonella enterica          15089
Streptococcus pneumoniae     9133
Acinetobacter baumannii      8536
Pseudomonas aeruginosa       8390
Mycobacterium tuberculosis   7337
Enterococcus_B faecium       3202
Enterococcus faecalis        3044
Clostridioides difficile     2991
Campylobacter_D jejuni       2873
Listeria monocytogenes       2517
Neisseria meningitidis       2336
Vibrio parahaemolyticus      2264
Streptococcus pyogenes       2258
Mycobacterium abscessus      2029
Listeria monocytogenes_B     2025
Burkholderia mallei          1934
Streptococcus agalactiae     1893

Taxon history of Escherichia coli

csvtk is used to help handle the results.

Get the TaxId:

$ echo Escherichia coli \
    | taxonkit name2taxid --data-dir gtdb-taxdump/R220/
Escherichia coli        599451526

Any changes in the past? Hmm, of cause, it appeared in R80.

$ zcat gtdb-taxid-changelog.csv.gz \
    | csvtk grep -f taxid -p 599451526 \
    | csvtk cut -f -lineage-taxids \
    | csvtk csv2md

taxid	version	change	change-value	name	rank	lineage
599451526	R080	NEW		Escherichia coli	species	Bacteria;Proteobacteria;Gammaproteobacteria;Enterobacterales;Enterobacteriaceae;Escherichia;Escherichia coli
599451526	R207	ABSORB	1223627963;1584917910;1670897256;2030830777	Escherichia coli	species	Bacteria;Proteobacteria;Gammaproteobacteria;Enterobacterales;Enterobacteriaceae;Escherichia;Escherichia coli
599451526	R214	CHANGE_LIN_TAX		Escherichia coli	species	Bacteria;Pseudomonadota;Gammaproteobacteria;Enterobacterales;Enterobacteriaceae;Escherichia;Escherichia coli

In R214, the phylum Proteobacteria changed to Pseudomonadota, also mentioned in the release announcement.

And Escherichia coli absorbs four taxa in R207, let's see what happened to them:

$ zcat gtdb-taxid-changelog.csv.gz \
    | csvtk grep -f taxid -p 1223627963,1584917910,1670897256,2030830777 \
    | csvtk cut -f -lineage-taxids \
    | csvtk csv2md

taxid	version	change	change-value	name	rank	lineage
1223627963	R089	NEW		Escherichia dysenteriae	species	Bacteria;Proteobacteria;Gammaproteobacteria;Enterobacterales;Enterobacteriaceae;Escherichia;Escherichia dysenteriae
1223627963	R207	MERGE	599451526	Escherichia dysenteriae	species	Bacteria;Proteobacteria;Gammaproteobacteria;Enterobacterales;Enterobacteriaceae;Escherichia;Escherichia dysenteriae
1584917910	R089	NEW		Escherichia coli_C	species	Bacteria;Proteobacteria;Gammaproteobacteria;Enterobacterales;Enterobacteriaceae;Escherichia;Escherichia coli_C
1584917910	R089	ABSORB	174151795;266865208	Escherichia coli_C	species	Bacteria;Proteobacteria;Gammaproteobacteria;Enterobacterales;Enterobacteriaceae;Escherichia;Escherichia coli_C
1584917910	R207	MERGE	599451526	Escherichia coli_C	species	Bacteria;Proteobacteria;Gammaproteobacteria;Enterobacterales;Enterobacteriaceae;Escherichia;Escherichia coli_C
1670897256	R089	NEW		Escherichia coli_D	species	Bacteria;Proteobacteria;Gammaproteobacteria;Enterobacterales;Enterobacteriaceae;Escherichia;Escherichia coli_D
1670897256	R207	MERGE	599451526	Escherichia coli_D	species	Bacteria;Proteobacteria;Gammaproteobacteria;Enterobacterales;Enterobacteriaceae;Escherichia;Escherichia coli_D
2030830777	R089	NEW		Escherichia flexneri	species	Bacteria;Proteobacteria;Gammaproteobacteria;Enterobacterales;Enterobacteriaceae;Escherichia;Escherichia flexneri
2030830777	R207	MERGE	599451526	Escherichia flexneri	species	Bacteria;Proteobacteria;Gammaproteobacteria;Enterobacterales;Enterobacteriaceae;Escherichia;Escherichia flexneri

Yes, Escherichia flexneri is merged into Escherichia coli as reported in the release note of R207.

We can also check the history of an Escherichia flexneri assembly. Listing assemblies:

$ taxonkit list --data-dir gtdb-taxdump/R202/ --ids 2030830777 -n -r -I "" \
    | head -n 5
2030830777 [species] Escherichia flexneri
188562 [no rank] 009882745
246688 [no rank] 003982535
530007 [no rank] 003981095
930852 [no rank] 005393725

E.g., the taxon node 013185635 (taxid 169219442). Let's check the history via the TaxId:

$ echo 013185635 | taxonkit  name2taxid --data-dir gtdb-taxdump/R202/
013185635       169219442

$ zcat gtdb-taxid-changelog.csv.gz \
    | csvtk grep -f taxid -p 169219442 \
    | csvtk cut -f -lineage-taxids \
    | csvtk csv2md

taxid	version	change	name	rank	lineage
169219442	R202	NEW	013185635	no rank	Bacteria;Proteobacteria;Gammaproteobacteria;Enterobacterales;Enterobacteriaceae;Escherichia;Escherichia flexneri;013185635
169219442	R207	CHANGE_LIN_TAX	013185635	no rank	Bacteria;Proteobacteria;Gammaproteobacteria;Enterobacterales;Enterobacteriaceae;Escherichia;Escherichia coli;013185635
169219442	R214	CHANGE_LIN_TAX	013185635	no rank	Bacteria;Pseudomonadota;Gammaproteobacteria;Enterobacterales;Enterobacteriaceae;Escherichia;Escherichia coli;013185635

Note that we removed the prefix (GCA_ and GCF_) and version number (see method). So the original assembly accession should be GCA_013185635.X, which can be found in taxid.map file:

$ cat gtdb-taxdump/R214/taxid.map \
    | csvtk grep -Ht -f 2 -p 169219442
GCF_013185635.1 169219442

The GCA_013185635.1 page also shows the taxonomic information of current version (R207) and the taxon history:

Release	Domain	Phylum	Class	Order	Family	Genus	Species
R220	d__Bacteria	p__Pseudomonadota	c__Gammaproteobacteria	o__Enterobacterales	f__Enterobacteriaceae	g__Escherichia	s__Escherichia coli
R214	d__Bacteria	p__Pseudomonadota	c__Gammaproteobacteria	o__Enterobacterales	f__Enterobacteriaceae	g__Escherichia	s__Escherichia coli
R207	d__Bacteria	p__Proteobacteria	c__Gammaproteobacteria	o__Enterobacterales	f__Enterobacteriaceae	g__Escherichia	s__Escherichia coli
R202	d__Bacteria	p__Proteobacteria	c__Gammaproteobacteria	o__Enterobacterales	f__Enterobacteriaceae	g__Escherichia	s__Escherichia flexneri

Species of the genus Escherichia

# set the direcotory of taxdump file
export TAXONKIT_DB=gtdb-taxdump/R220

$ echo Escherichia | taxonkit name2taxid 
Escherichia     1028471294

$ taxonkit list --ids 1028471294 -I "" \
    | taxonkit filter  -E species \
    | taxonkit lineage -Lnr \
    | tee Escherichia.tsv
300575795       Escherichia sp005843885 species
599451526       Escherichia coli        species
1004016418      Escherichia sp004211955 species
1083756244      Escherichia ruysiae     species
1155214706      Escherichia fergusonii  species
1627494196      Escherichia sp002965065 species
1705205476      Escherichia whittamii   species
1831350832      Escherichia coli_F      species
1854306313      Escherichia marmotae    species
1904681918      Escherichia coli_E      species
2087647928      Escherichia albertii    species

$ csvtk join -Ht Escherichia.tsv \
    <(cut -f 1 Escherichia.tsv \
        | rush 'echo -ne "{}\t$(taxonkit list --ids {} -I "" \
        | taxonkit filter -L species | wc -l)\n"') \
    | csvtk add-header -t -n "taxid,name,rank,#assembly" \
    | csvtk sort -t -k "#assembly:nr" -k name \
    | csvtk csv2md -t

taxid	name	rank	#assembly
599451526	Escherichia coli	species	38926
2087647928	Escherichia albertii	species	239
1155214706	Escherichia fergusonii	species	161
1854306313	Escherichia marmotae	species	141
1831350832	Escherichia coli_F	species	97
1083756244	Escherichia ruysiae	species	62
300575795	Escherichia sp005843885	species	37
1705205476	Escherichia whittamii	species	4
1904681918	Escherichia coli_E	species	2
1627494196	Escherichia sp002965065	species	2
1004016418	Escherichia sp004211955	species	2

What's the Escherichia coli_E? There's only two genome: GCF_011881725.1, and GCF_023276905.1 (fresh new in R214).

$ taxonkit list --ids 1904681918 -nr
1904681918 [species] Escherichia coli_E
  231798968 [no rank] 011881725
  1417695290 [no rank] 023276905

$ grep 011881725 gtdb-taxdump/R220/taxid.map
GCF_011881725.1 231798968

Common manipulations

Except the four taxdump files, we provide a taxid.map file which maps genome accessions to TaxIds.

$ wc -l gtdb-taxdump/R220/*
    23767 gtdb-taxdump/R220/delnodes.dmp
     1322 gtdb-taxdump/R220/merged.dmp
   743239 gtdb-taxdump/R220/names.dmp
   743239 gtdb-taxdump/R220/nodes.dmp
      107 gtdb-taxdump/R220/ranks.txt
   596859 gtdb-taxdump/R220/taxid.map

List all the genomes of a species, e.g., Akkermansia muciniphila,

# Retreive the TaxId
$ echo Akkermansia muciniphila | taxonkit name2taxid --data-dir gtdb-taxdump/R220
Akkermansia muciniphila 791276584

# list subtree
$ taxonkit list --data-dir gtdb-taxdump/R220 -nr --ids  791276584 | head -n 5
791276584 [species] Akkermansia muciniphila
  2229511 [no rank] 948901395
  3636769 [no rank] 948711495
  7496143 [no rank] 949510945
  7567111 [no rank] 949384685

# mapping TaxIds to Genome accessions with taxid.map
$ taxonkit list --data-dir gtdb-taxdump/R220 -I "" --ids  791276584 \
    | csvtk join -Ht -f '1;2' - gtdb-taxdump/R220/taxid.map \
    | head -n 5
2229511 GCA_948901395.1
3636769 GCA_948711495.1
7496143 GCA_949510945.1
7567111 GCA_949384685.1
7776528 GCA_959604705.1

Find the history of a taxon using scientific name:

$ zcat gtdb-taxid-changelog.csv.gz \
    | csvtk grep -f name -i -r -p "Escherichia dysenteriae" \
    | csvtk cut -f -lineage,-lineage-taxids \
    | csvtk csv2md
|taxid     |version|change|change-value|name                   |rank   |
|:---------|:------|:-----|:-----------|:----------------------|:------|
|1223627963|R089   |NEW   |            |Escherichia dysenteriae|species|
|1223627963|R207   |MERGE |599451526   |Escherichia dysenteriae|species|


# another example
$ zcat gtdb-taxid-changelog.csv.gz \
    | csvtk grep -f name -i -r -p "Escherichia coli" \
    | csvtk cut -f -lineage,-lineage-taxids \
    | csvtk csv2md

taxid	version	change	change-value	name	rank
174151795	R080	NEW		Escherichia coli_A	species
174151795	R089	MERGE	1584917910	Escherichia coli_A	species
266865208	R086	NEW		Escherichia coli_B	species
266865208	R089	MERGE	1584917910	Escherichia coli_B	species
599451526	R080	NEW		Escherichia coli	species
599451526	R207	ABSORB	1223627963;1584917910;1670897256;2030830777	Escherichia coli	species
599451526	R214	CHANGE_LIN_TAX		Escherichia coli	species
1584917910	R089	NEW		Escherichia coli_C	species
1584917910	R089	ABSORB	174151795;266865208	Escherichia coli_C	species
1584917910	R207	MERGE	599451526	Escherichia coli_C	species
1670897256	R089	NEW		Escherichia coli_D	species
1670897256	R207	MERGE	599451526	Escherichia coli_D	species
1831350832	R220	NEW		Escherichia coli_F	species
1904681918	R202	NEW		Escherichia coli_E	species
1904681918	R214	CHANGE_LIN_TAX		Escherichia coli_E	species

Check more TaxonKit commands and usages.

Known issues

Note: the TaxIds below may be not the lastest (taxonkit v0.14.0 save TaxIds in int32 instead of uint32).

Inaccurate delnodes.dmp and merged.dmp for a few taxa with same names

In old versions, some taxa had the same names, e.g., 1-14-0-10-36-11.

# r86.2

# taxid of 1-14-0-10-36-11: 810514457
GB_GCA_002762845.1	d__Archaea;p__Nanoarchaeota;c__Woesearchaeia;o__GW2011-AR9;f__GW2011-AR9;g__1-14-0-10-36-11;s__    

# taxid of 1-14-0-10-36-11: 810514458
GB_GCA_002778535.1	d__Bacteria;p__Patescibacteria;c__ABY1;o__Kuenenbacterales;f__UBA2196;g__1-14-0-10-36-11;s__

Later in r89, the Archaea genus 1-14-0-10-36-11 was renamed, while taxid 3509163818 was assigned to Bacteria genus 1-14-0-10-36-11 and taxid 3509163819 was marked in delnodes.dmp.

# genus changed, and assigned a new species
GB_GCA_002762845.1	d__Archaea;p__Nanoarchaeota;c__Nanoarchaeia;o__Woesearchaeales;f__GW2011-AR9;g__PCYB01;s__PCYB01 sp002762845

# assigned a new species
# taxid of 1-14-0-10-36-11: 3509163818
GB_GCA_002778535.1	d__Bacteria;p__Patescibacteria;c__ABY1;o__UBA2196;f__UBA2196;g__1-14-0-10-36-11;s__1-14-0-10-36-11 sp002778535

As a result, the taxid-changelog showed:

$ zcat gtdb-taxid-changelog.csv.gz | csvtk grep -f taxid -p 810514457
taxid,version,change,change-value,name,rank,lineage,lineage-taxids
810514457,R086,NEW,,1-14-0-10-36-11,genus,Archaea;Nanoarchaeota;Woesearchaeia;GW2011-AR9;GW2011-AR9;1-14-0-10-36-11,1337977286;479299029;1556208458;912946924;930607342;810514457
810514457,R089,CHANGE_LIN_TAX,,1-14-0-10-36-11,genus,Bacteria;Patescibacteria;ABY1;UBA2196;UBA2196;1-14-0-10-36-11,81602897;1771153889;802220661;1881906388;2078787713;810514457

$ zcat gtdb-taxid-changelog.csv.gz | csvtk grep -f taxid -p 810514458
taxid,version,change,change-value,name,rank,lineage,lineage-taxids
810514458,R086,NEW,,1-14-0-10-36-11,genus,Bacteria;Patescibacteria;ABY1;Kuenenbacterales;UBA2196;1-14-0-10-36-11,81602897;1771153889;802220661;2147262481;2078787713;810514458
810514458,R089,DELETE,,1-14-0-10-36-11,genus,Bacteria;Patescibacteria;ABY1;Kuenenbacterales;UBA2196;1-14-0-10-36-11,81602897;1771153889;802220661;2147262481;2078787713;810514458

Unstable delnodes.dmp and merged.dmp for a few taxa of which genomes are mreged into different taxa

An example: In R95, some (Sphingobium japonicum_A) genomes (GCF_000445085.1) were merged into (Sphingobium chinhatense), while others (GCF_000091125.1) into Sphingobium indicum. Check details

Merging GTDB and NCBI taxonomy

If you need the taxdump files and the taxid.map file mapping genome assembly accessions to TaxIds, please follow Merging the GTDB taxonomy (for prokaryotic genomes from GTDB) and NCBI taxonomy (for genomes from NCBI).
If you just need the taxdump files, please follow Merging GTDB and NCBI taxonomy.

Citation

Shen, W., Ren, H., TaxonKit: a practical and efficient NCBI Taxonomy toolkit, Journal of Genetics and Genomics, https://doi.org/10.1016/j.jgg.2021.03.006

Contributing

We welcome pull requests, bug fixes and issue reports.

License

MIT License

Similar tools

gtdb_to_taxdump, Convert GTDB taxonomy to NCBI taxdump format

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

GTDB taxonomy taxdump files with trackable TaxIds

Table of Contents

Method

Taxonomic hierarchy

Generation of TaxIds

Data and tools

Steps

Download

Results

Basic usage

TaxId changes

Species changes

Summary

Taxon history of Escherichia coli

Species of the genus Escherichia

Common manipulations

Known issues

Inaccurate delnodes.dmp and merged.dmp for a few taxa with same names

Unstable delnodes.dmp and merged.dmp for a few taxa of which genomes are mreged into different taxa

Merging GTDB and NCBI taxonomy

Citation

Contributing

License

Similar tools

Files

README.md

Latest commit

History

README.md

File metadata and controls

GTDB taxonomy taxdump files with trackable TaxIds

Table of Contents

Method

Taxonomic hierarchy

Generation of TaxIds

Data and tools

Steps

Download

Results

Basic usage

TaxId changes

Species changes

Summary

Taxon history of Escherichia coli

Species of the genus Escherichia

Common manipulations

Known issues

Inaccurate delnodes.dmp and merged.dmp for a few taxa with same names

Unstable delnodes.dmp and merged.dmp for a few taxa of which genomes are mreged into different taxa

Merging GTDB and NCBI taxonomy

Citation

Contributing

License

Similar tools