Metagenomic tools like Kraken2, Centrifuge and KMCP support NCBI taxonomy in format of NCBI taxdump files. GTDB, a prokaryotic genomes catalogue, has its own taxonomy data. Though the genomes, derived from GenBank and RefSeq, can be mappped to NCBI taxonomy TaxIds, there's an urgent need to create its own taxonomy taxdump files with stable and trackable TaxIds.
A TaxonKit command, taxonkit create-taxdump is used to create NCBI-style taxdump files for any taxonomy dataset, including GTDB and ICTV.
Related projects:
- ictv-taxdump: NCBI-style taxdump files for International Committee on Taxonomy of Viruses (ICTV)
- taxid-changelog: NCBI taxonomic identifier (taxid) changelog
- taxonkit: A Practical and Efficient NCBI Taxonomy Toolkit
A GTDB species cluster contains >=1 assemblies, each can be treated as a strain. So we can assign each assembly a TaxId with the rank of "no rank" below the species rank. Therefore, we can also track the changes of these assemblies via the TaxId later.
We just hash the rank+taxon_name (in lower case) of each taxon node to uint64
using xxhash and convert it to int32
.
- For the NCBI assembly accession.
- The prefix
GCA_
is not used because some GenBank entries (GCA_000176655.2
in R80) were moved to RefSeq (GCF_000176655.2
in R83) and the prefix changed. - The version number is trimed because it may change.
So,
000176655
is hashed to get the TaxId.
- The prefix
- For the non-NCBI assembly accession. The accession per se is hashed. E.g.,
UBA12275
- For the name of a node. The taxon name per se is hashed. E.g,
Bacteria
.
GTDB taxnomy files are download from https://data.gtdb.ecogenomic.org/releases/, and organized as:
tree taxonomy/
taxonomy/
├── R080
│ └── bac_taxonomy_r80.tsv
├── R083
│ └── bac_taxonomy_r83.tsv
├── R086
│ ├── ar122_taxonomy_r86.2.tsv
│ └── bac120_taxonomy_r86.2.tsv
├── R089
│ ├── ar122_taxonomy_r89.tsv
│ └── bac120_taxonomy_r89.tsv
├── R095
│ ├── ar122_taxonomy_r95.tsv.gz
│ └── bac120_taxonomy_r95.tsv.gz
├── R202
│ ├── ar122_taxonomy_r202.tsv.gz
│ └── bac120_taxonomy_r202.tsv.gz
├── R207
│ ├── ar53_taxonomy_r207.tsv.gz
│ └── bac120_taxonomy_r207.tsv.gz
├── R214
│ ├── ar53_taxonomy_r214.tsv.gz
│ └── bac120_taxonomy_r214.tsv.gz
└── R220
├── ar53_taxonomy_r220.tsv.gz
└── bac120_taxonomy_r220.tsv.gz
TaxonKit v0.12.0 or a later version is needed. v0.16.0 or a later version is preferred.
- Since v0.14.0, taxonkit create-taxdump stores
TaxIds in
int32
following BLAST and DIAMOND, rather thanuint32
in previous versions. - Since v0.16.0, duplicated names with different ranks are allowed.
-
Generating taxdump files for the first version r80:
taxonkit create-taxdump taxonomy/R080/*.tsv* --gtdb --out-dir gtdb-taxdump/R080 --force 22:23:09.195 [INFO] 94759 records saved to gtdb-taxdump/R080/taxid.map 22:23:09.249 [INFO] 111705 records saved to gtdb-taxdump/R080/nodes.dmp 22:23:09.293 [INFO] 111705 records saved to gtdb-taxdump/R080/names.dmp 22:23:09.293 [INFO] 0 records saved to gtdb-taxdump/R080/merged.dmp 22:23:09.293 [INFO] 0 records saved to gtdb-taxdump/R080/delnodes.dmp
-
For later versions, we need the taxdump files of the revious version to track merged and deleted nodes.
taxonkit create-taxdump --gtdb -x gtdb-taxdump/R080/ \ taxonomy/R083/*.tsv* --out-dir gtdb-taxdump/R083 --force taxonkit create-taxdump --gtdb -x gtdb-taxdump/R083/ \ taxonomy/R086/*.tsv* --out-dir gtdb-taxdump/R086 --force taxonkit create-taxdump --gtdb -x gtdb-taxdump/R086/ \ taxonomy/R089/*.tsv* --out-dir gtdb-taxdump/R089 --force taxonkit create-taxdump --gtdb -x gtdb-taxdump/R089/ \ taxonomy/R095/*.tsv* --out-dir gtdb-taxdump/R095 --force taxonkit create-taxdump --gtdb -x gtdb-taxdump/R095/ \ taxonomy/R202/*.tsv* --out-dir gtdb-taxdump/R202 --force taxonkit create-taxdump --gtdb -x gtdb-taxdump/R202/ \ taxonomy/R207/*.tsv* --out-dir gtdb-taxdump/R207 --force taxonkit create-taxdump --gtdb -x gtdb-taxdump/R207/ \ taxonomy/R214/*.tsv* --out-dir gtdb-taxdump/R214 --force taxonkit create-taxdump --gtdb -x gtdb-taxdump/R214/ \ taxonomy/R220/*.tsv* --out-dir gtdb-taxdump/R220 --force
-
Generating TaxId changelog (Note that, it's not perfect for GTDB taxonomy).
We only check and eliminate taxid collision within a single version of taxonomy data. Therefore, if you create taxid-changelog with "taxid-changelog", different taxons in multiple versions might have the same TaxIds and some change events might be wrong.
A single version of taxonomic data created by "taxonkit create-taxdump" has no problem, it's just the changelog might not be perfect.
taxonkit taxid-changelog -i gtdb-taxdump -o gtdb-taxid-changelog.csv.gz --verbose
The release page contains taxdump files for all GTDB versions, and a TaxId changelog file (gtdb-taxid-changelog.csv.gz).
Learn more about the taxid-changelog.
set the environment variable for simplicity
export TAXONKIT_DB=gtdb-taxdump/R220/
Query the TaxId via an assembly accession
grep GCA_905234495.1 gtdb-taxdump/R220/taxid.map
GCA_905234495.1 254122285
Query the TaxId via taxon name
echo Escherichia coli \
| taxonkit name2taxid
Escherichia coli 599451526
Complete lineage
# with lineage
echo 599451526 \
| taxonkit lineage -nr
599451526 Bacteria;Pseudomonadota;Gammaproteobacteria;Enterobacterales;Enterobacteriaceae;Escherichia;Escherichia coli Escherichia coli species
# with reformat
echo 599451526 \
| taxonkit reformat -I 1
599451526 Bacteria;Pseudomonadota;Gammaproteobacteria;Enterobacterales;Enterobacteriaceae;Escherichia;Escherichia coli
Complete lineage (GTDB style)
echo 599451526 \
| taxonkit reformat -I 1 -P --prefix-k d__
599451526 d__Bacteria;p__Pseudomonadota;c__Gammaproteobacteria;o__Enterobacterales;f__Enterobacteriaceae;g__Escherichia;s__Escherichia coli
All lineages
taxonkit list --ids 1 -I "" \
| taxonkit filter -E species \
| taxonkit reformat -I 1 -P --prefix-k d__ \
> gtdb_species.txt
Checking consistency
$ zcat taxonomy/R220/* | cut -f 2 | sort | uniq | md5sum
f9e0f5268ab65026894703db3eab7b4b -
$ cut -f 2 gtdb_species.txt | sort | md5sum
f9e0f5268ab65026894703db3eab7b4b -
Notes:
- The Y axis is the number of TaxId, not that of species.
- The data is generated by "taxonkit taxid-changelog", which was originally designed for NCBI taxonomy, where the the TaxIds are stable.
For other taxonomic data created by "taxonkit create-taxdump", e.g., GTDB-taxdump, some change events might be wrong, because
- There would be dramatic changes between the two versions.
- Different taxons in multiple versions might have the same TaxIds, because we only check and eliminate taxid collision within a single version
How many species are there in R220?
$ taxonkit list --data-dir gtdb-taxdump/R220/ --ids 1 -I "" \
| taxonkit filter --data-dir gtdb-taxdump/R220/ -E species \
| wc -l
113104
How many species are added in R220?
$ pigz -cd gtdb-taxid-changelog.csv.gz \
| csvtk grep -f version -p R220 \
| csvtk grep -f change -p NEW \
| csvtk grep -f rank -p species \
| csvtk nrow
31987
How many species are deleted in R220?
$ pigz -cd gtdb-taxid-changelog.csv.gz \
| csvtk grep -f version -p R220 \
| csvtk grep -f change -p DELETE \
| csvtk grep -f rank -p species \
| csvtk nrow
3127
How many species are merged into others in R220?
$ pigz -cd gtdb-taxid-changelog.csv.gz \
| csvtk grep -f version -p R220 \
| csvtk grep -f change -p MERGE \
| csvtk grep -f rank -p species \
| csvtk nrow
1182
Complete lineages (R220)
$ cat gtdb-taxdump/R220/taxid.map \
| csvtk freq -Ht -f 2 -nr \
| taxonkit lineage -r -n -L --data-dir gtdb-taxdump/R220/ \
| taxonkit reformat -I 1 -f '{k}\t{p}\t{c}\t{o}\t{f}\t{g}\t{s}' --data-dir gtdb-taxdump/R220/ \
| csvtk add-header -t -n 'taxid,count,name,rank,superkindom,phylum,class,order,family,genus,species' \
> taxid.map.stats.tsv
Frequency of species
$ csvtk freq -t -nr -f species taxid.map.stats.tsv \
> taxid.map.stats.freq-species.tsv
$ head -n 21 taxid.map.stats.freq-species.tsv \
| csvtk pretty -t
species frequency
-------------------------- ---------
Escherichia coli 38926
Klebsiella pneumoniae 18499
Staphylococcus aureus 16021
Salmonella enterica 15089
Streptococcus pneumoniae 9133
Acinetobacter baumannii 8536
Pseudomonas aeruginosa 8390
Mycobacterium tuberculosis 7337
Enterococcus_B faecium 3202
Enterococcus faecalis 3044
Clostridioides difficile 2991
Campylobacter_D jejuni 2873
Listeria monocytogenes 2517
Neisseria meningitidis 2336
Vibrio parahaemolyticus 2264
Streptococcus pyogenes 2258
Mycobacterium abscessus 2029
Listeria monocytogenes_B 2025
Burkholderia mallei 1934
Streptococcus agalactiae 1893
csvtk is used to help handle the results.
Get the TaxId:
$ echo Escherichia coli \
| taxonkit name2taxid --data-dir gtdb-taxdump/R220/
Escherichia coli 599451526
Any changes in the past? Hmm, of cause, it appeared in R80.
$ zcat gtdb-taxid-changelog.csv.gz \
| csvtk grep -f taxid -p 599451526 \
| csvtk cut -f -lineage-taxids \
| csvtk csv2md
taxid | version | change | change-value | name | rank | lineage |
---|---|---|---|---|---|---|
599451526 | R080 | NEW | Escherichia coli | species | Bacteria;Proteobacteria;Gammaproteobacteria;Enterobacterales;Enterobacteriaceae;Escherichia;Escherichia coli | |
599451526 | R207 | ABSORB | 1223627963;1584917910;1670897256;2030830777 | Escherichia coli | species | Bacteria;Proteobacteria;Gammaproteobacteria;Enterobacterales;Enterobacteriaceae;Escherichia;Escherichia coli |
599451526 | R214 | CHANGE_LIN_TAX | Escherichia coli | species | Bacteria;Pseudomonadota;Gammaproteobacteria;Enterobacterales;Enterobacteriaceae;Escherichia;Escherichia coli |
In R214, the phylum Proteobacteria
changed to Pseudomonadota
, also mentioned in the release announcement.
And Escherichia coli absorb
s four taxa in R207, let's see what happened to them:
$ zcat gtdb-taxid-changelog.csv.gz \
| csvtk grep -f taxid -p 1223627963,1584917910,1670897256,2030830777 \
| csvtk cut -f -lineage-taxids \
| csvtk csv2md
taxid | version | change | change-value | name | rank | lineage |
---|---|---|---|---|---|---|
1223627963 | R089 | NEW | Escherichia dysenteriae | species | Bacteria;Proteobacteria;Gammaproteobacteria;Enterobacterales;Enterobacteriaceae;Escherichia;Escherichia dysenteriae | |
1223627963 | R207 | MERGE | 599451526 | Escherichia dysenteriae | species | Bacteria;Proteobacteria;Gammaproteobacteria;Enterobacterales;Enterobacteriaceae;Escherichia;Escherichia dysenteriae |
1584917910 | R089 | NEW | Escherichia coli_C | species | Bacteria;Proteobacteria;Gammaproteobacteria;Enterobacterales;Enterobacteriaceae;Escherichia;Escherichia coli_C | |
1584917910 | R089 | ABSORB | 174151795;266865208 | Escherichia coli_C | species | Bacteria;Proteobacteria;Gammaproteobacteria;Enterobacterales;Enterobacteriaceae;Escherichia;Escherichia coli_C |
1584917910 | R207 | MERGE | 599451526 | Escherichia coli_C | species | Bacteria;Proteobacteria;Gammaproteobacteria;Enterobacterales;Enterobacteriaceae;Escherichia;Escherichia coli_C |
1670897256 | R089 | NEW | Escherichia coli_D | species | Bacteria;Proteobacteria;Gammaproteobacteria;Enterobacterales;Enterobacteriaceae;Escherichia;Escherichia coli_D | |
1670897256 | R207 | MERGE | 599451526 | Escherichia coli_D | species | Bacteria;Proteobacteria;Gammaproteobacteria;Enterobacterales;Enterobacteriaceae;Escherichia;Escherichia coli_D |
2030830777 | R089 | NEW | Escherichia flexneri | species | Bacteria;Proteobacteria;Gammaproteobacteria;Enterobacterales;Enterobacteriaceae;Escherichia;Escherichia flexneri | |
2030830777 | R207 | MERGE | 599451526 | Escherichia flexneri | species | Bacteria;Proteobacteria;Gammaproteobacteria;Enterobacterales;Enterobacteriaceae;Escherichia;Escherichia flexneri |
Yes, Escherichia flexneri is merged into Escherichia coli as reported in the release note of R207.
We can also check the history of an Escherichia flexneri assembly. Listing assemblies:
$ taxonkit list --data-dir gtdb-taxdump/R202/ --ids 2030830777 -n -r -I "" \
| head -n 5
2030830777 [species] Escherichia flexneri
188562 [no rank] 009882745
246688 [no rank] 003982535
530007 [no rank] 003981095
930852 [no rank] 005393725
E.g., the taxon node 013185635
(taxid 169219442
). Let's check the history via the TaxId:
$ echo 013185635 | taxonkit name2taxid --data-dir gtdb-taxdump/R202/
013185635 169219442
$ zcat gtdb-taxid-changelog.csv.gz \
| csvtk grep -f taxid -p 169219442 \
| csvtk cut -f -lineage-taxids \
| csvtk csv2md
taxid | version | change | change-value | name | rank | lineage |
---|---|---|---|---|---|---|
169219442 | R202 | NEW | 013185635 | no rank | Bacteria;Proteobacteria;Gammaproteobacteria;Enterobacterales;Enterobacteriaceae;Escherichia;Escherichia flexneri;013185635 | |
169219442 | R207 | CHANGE_LIN_TAX | 013185635 | no rank | Bacteria;Proteobacteria;Gammaproteobacteria;Enterobacterales;Enterobacteriaceae;Escherichia;Escherichia coli;013185635 | |
169219442 | R214 | CHANGE_LIN_TAX | 013185635 | no rank | Bacteria;Pseudomonadota;Gammaproteobacteria;Enterobacterales;Enterobacteriaceae;Escherichia;Escherichia coli;013185635 |
Note that we removed the prefix (GCA_
and GCF_
) and version number (see method).
So the original assembly accession should be GCA_013185635.X
, which can be found in taxid.map
file:
$ cat gtdb-taxdump/R214/taxid.map \
| csvtk grep -Ht -f 2 -p 169219442
GCF_013185635.1 169219442
The GCA_013185635.1 page also shows the taxonomic information of current version (R207) and the taxon history:
Release | Domain | Phylum | Class | Order | Family | Genus | Species |
---|---|---|---|---|---|---|---|
R220 | d__Bacteria | p__Pseudomonadota | c__Gammaproteobacteria | o__Enterobacterales | f__Enterobacteriaceae | g__Escherichia | s__Escherichia coli |
R214 | d__Bacteria | p__Pseudomonadota | c__Gammaproteobacteria | o__Enterobacterales | f__Enterobacteriaceae | g__Escherichia | s__Escherichia coli |
R207 | d__Bacteria | p__Proteobacteria | c__Gammaproteobacteria | o__Enterobacterales | f__Enterobacteriaceae | g__Escherichia | s__Escherichia coli |
R202 | d__Bacteria | p__Proteobacteria | c__Gammaproteobacteria | o__Enterobacterales | f__Enterobacteriaceae | g__Escherichia | s__Escherichia flexneri |
# set the direcotory of taxdump file
export TAXONKIT_DB=gtdb-taxdump/R220
$ echo Escherichia | taxonkit name2taxid
Escherichia 1028471294
$ taxonkit list --ids 1028471294 -I "" \
| taxonkit filter -E species \
| taxonkit lineage -Lnr \
| tee Escherichia.tsv
300575795 Escherichia sp005843885 species
599451526 Escherichia coli species
1004016418 Escherichia sp004211955 species
1083756244 Escherichia ruysiae species
1155214706 Escherichia fergusonii species
1627494196 Escherichia sp002965065 species
1705205476 Escherichia whittamii species
1831350832 Escherichia coli_F species
1854306313 Escherichia marmotae species
1904681918 Escherichia coli_E species
2087647928 Escherichia albertii species
$ csvtk join -Ht Escherichia.tsv \
<(cut -f 1 Escherichia.tsv \
| rush 'echo -ne "{}\t$(taxonkit list --ids {} -I "" \
| taxonkit filter -L species | wc -l)\n"') \
| csvtk add-header -t -n "taxid,name,rank,#assembly" \
| csvtk sort -t -k "#assembly:nr" -k name \
| csvtk csv2md -t
taxid | name | rank | #assembly |
---|---|---|---|
599451526 | Escherichia coli | species | 38926 |
2087647928 | Escherichia albertii | species | 239 |
1155214706 | Escherichia fergusonii | species | 161 |
1854306313 | Escherichia marmotae | species | 141 |
1831350832 | Escherichia coli_F | species | 97 |
1083756244 | Escherichia ruysiae | species | 62 |
300575795 | Escherichia sp005843885 | species | 37 |
1705205476 | Escherichia whittamii | species | 4 |
1904681918 | Escherichia coli_E | species | 2 |
1627494196 | Escherichia sp002965065 | species | 2 |
1004016418 | Escherichia sp004211955 | species | 2 |
What's the Escherichia coli_E? There's only two genome: GCF_011881725.1, and GCF_023276905.1 (fresh new in R214).
$ taxonkit list --ids 1904681918 -nr
1904681918 [species] Escherichia coli_E
231798968 [no rank] 011881725
1417695290 [no rank] 023276905
$ grep 011881725 gtdb-taxdump/R220/taxid.map
GCF_011881725.1 231798968
Except the four taxdump files, we provide a taxid.map
file which maps genome accessions to TaxIds.
$ wc -l gtdb-taxdump/R220/*
23767 gtdb-taxdump/R220/delnodes.dmp
1322 gtdb-taxdump/R220/merged.dmp
743239 gtdb-taxdump/R220/names.dmp
743239 gtdb-taxdump/R220/nodes.dmp
107 gtdb-taxdump/R220/ranks.txt
596859 gtdb-taxdump/R220/taxid.map
List all the genomes of a species, e.g., Akkermansia muciniphila,
# Retreive the TaxId
$ echo Akkermansia muciniphila | taxonkit name2taxid --data-dir gtdb-taxdump/R220
Akkermansia muciniphila 791276584
# list subtree
$ taxonkit list --data-dir gtdb-taxdump/R220 -nr --ids 791276584 | head -n 5
791276584 [species] Akkermansia muciniphila
2229511 [no rank] 948901395
3636769 [no rank] 948711495
7496143 [no rank] 949510945
7567111 [no rank] 949384685
# mapping TaxIds to Genome accessions with taxid.map
$ taxonkit list --data-dir gtdb-taxdump/R220 -I "" --ids 791276584 \
| csvtk join -Ht -f '1;2' - gtdb-taxdump/R220/taxid.map \
| head -n 5
2229511 GCA_948901395.1
3636769 GCA_948711495.1
7496143 GCA_949510945.1
7567111 GCA_949384685.1
7776528 GCA_959604705.1
Find the history of a taxon using scientific name:
$ zcat gtdb-taxid-changelog.csv.gz \
| csvtk grep -f name -i -r -p "Escherichia dysenteriae" \
| csvtk cut -f -lineage,-lineage-taxids \
| csvtk csv2md
|taxid |version|change|change-value|name |rank |
|:---------|:------|:-----|:-----------|:----------------------|:------|
|1223627963|R089 |NEW | |Escherichia dysenteriae|species|
|1223627963|R207 |MERGE |599451526 |Escherichia dysenteriae|species|
# another example
$ zcat gtdb-taxid-changelog.csv.gz \
| csvtk grep -f name -i -r -p "Escherichia coli" \
| csvtk cut -f -lineage,-lineage-taxids \
| csvtk csv2md
taxid | version | change | change-value | name | rank |
---|---|---|---|---|---|
174151795 | R080 | NEW | Escherichia coli_A | species | |
174151795 | R089 | MERGE | 1584917910 | Escherichia coli_A | species |
266865208 | R086 | NEW | Escherichia coli_B | species | |
266865208 | R089 | MERGE | 1584917910 | Escherichia coli_B | species |
599451526 | R080 | NEW | Escherichia coli | species | |
599451526 | R207 | ABSORB | 1223627963;1584917910;1670897256;2030830777 | Escherichia coli | species |
599451526 | R214 | CHANGE_LIN_TAX | Escherichia coli | species | |
1584917910 | R089 | NEW | Escherichia coli_C | species | |
1584917910 | R089 | ABSORB | 174151795;266865208 | Escherichia coli_C | species |
1584917910 | R207 | MERGE | 599451526 | Escherichia coli_C | species |
1670897256 | R089 | NEW | Escherichia coli_D | species | |
1670897256 | R207 | MERGE | 599451526 | Escherichia coli_D | species |
1831350832 | R220 | NEW | Escherichia coli_F | species | |
1904681918 | R202 | NEW | Escherichia coli_E | species | |
1904681918 | R214 | CHANGE_LIN_TAX | Escherichia coli_E | species |
Check more TaxonKit commands and usages.
Note: the TaxIds below may be not the lastest (taxonkit v0.14.0 save TaxIds in int32
instead of uint32
).
In old versions, some taxa had the same names, e.g., 1-14-0-10-36-11
.
# r86.2
# taxid of 1-14-0-10-36-11: 810514457
GB_GCA_002762845.1 d__Archaea;p__Nanoarchaeota;c__Woesearchaeia;o__GW2011-AR9;f__GW2011-AR9;g__1-14-0-10-36-11;s__
# taxid of 1-14-0-10-36-11: 810514458
GB_GCA_002778535.1 d__Bacteria;p__Patescibacteria;c__ABY1;o__Kuenenbacterales;f__UBA2196;g__1-14-0-10-36-11;s__
Later in r89, the Archaea genus 1-14-0-10-36-11
was renamed,
while taxid 3509163818
was assigned to Bacteria genus 1-14-0-10-36-11
and taxid 3509163819
was marked in delnodes.dmp
.
# genus changed, and assigned a new species
GB_GCA_002762845.1 d__Archaea;p__Nanoarchaeota;c__Nanoarchaeia;o__Woesearchaeales;f__GW2011-AR9;g__PCYB01;s__PCYB01 sp002762845
# assigned a new species
# taxid of 1-14-0-10-36-11: 3509163818
GB_GCA_002778535.1 d__Bacteria;p__Patescibacteria;c__ABY1;o__UBA2196;f__UBA2196;g__1-14-0-10-36-11;s__1-14-0-10-36-11 sp002778535
As a result, the taxid-changelog showed:
$ zcat gtdb-taxid-changelog.csv.gz | csvtk grep -f taxid -p 810514457
taxid,version,change,change-value,name,rank,lineage,lineage-taxids
810514457,R086,NEW,,1-14-0-10-36-11,genus,Archaea;Nanoarchaeota;Woesearchaeia;GW2011-AR9;GW2011-AR9;1-14-0-10-36-11,1337977286;479299029;1556208458;912946924;930607342;810514457
810514457,R089,CHANGE_LIN_TAX,,1-14-0-10-36-11,genus,Bacteria;Patescibacteria;ABY1;UBA2196;UBA2196;1-14-0-10-36-11,81602897;1771153889;802220661;1881906388;2078787713;810514457
$ zcat gtdb-taxid-changelog.csv.gz | csvtk grep -f taxid -p 810514458
taxid,version,change,change-value,name,rank,lineage,lineage-taxids
810514458,R086,NEW,,1-14-0-10-36-11,genus,Bacteria;Patescibacteria;ABY1;Kuenenbacterales;UBA2196;1-14-0-10-36-11,81602897;1771153889;802220661;2147262481;2078787713;810514458
810514458,R089,DELETE,,1-14-0-10-36-11,genus,Bacteria;Patescibacteria;ABY1;Kuenenbacterales;UBA2196;1-14-0-10-36-11,81602897;1771153889;802220661;2147262481;2078787713;810514458
An example: In R95, some (Sphingobium japonicum_A) genomes (GCF_000445085.1) were merged into (Sphingobium chinhatense), while others (GCF_000091125.1) into Sphingobium indicum. Check details
- If you need the taxdump files and the
taxid.map
file mapping genome assembly accessions to TaxIds, please follow Merging the GTDB taxonomy (for prokaryotic genomes from GTDB) and NCBI taxonomy (for genomes from NCBI). - If you just need the taxdump files, please follow Merging GTDB and NCBI taxonomy.
Shen, W., Ren, H., TaxonKit: a practical and efficient NCBI Taxonomy toolkit, Journal of Genetics and Genomics, https://doi.org/10.1016/j.jgg.2021.03.006
We welcome pull requests, bug fixes and issue reports.
- gtdb_to_taxdump, Convert GTDB taxonomy to NCBI taxdump format