Missing ranks for GTDB genomes with duplicate names at different ranks #92

donovan-parks · 2024-03-04T21:46:13Z

I'm using v0.15.1. It appears the generated nodes.dmp is missing rank information for genomes with repeated GTDB taxon names at different ranks, e.g.:

GCA_009780445.1 d__Bacteria;p__Bacillota_A;c__Clostridia;o__Lachnospirales;f__WRAA01;g__WRAA01;s__WRAA01 sp009780445

For this genome, the nodes.dmp file contain a species entry and a family entry, but no genus entry. This doesn't seem correct since this genome (and species) does have a named genus in the GTDB taxonomy.

Cheers,
Donovan

shenwei356 · 2024-03-04T22:08:47Z

Yes, for these cases where child and parent taxons share the same name, I thought it's because there's no classified genus (which is common in NCBI taxonomy, especially in Viruses) according to some observations. e.g.,

$ taxonkit list --ids 1698208185 --data-dir . -nr
1698208185 [family] WRAA01
  788434200 [species] WRAA01 sp009780445
    1595698180 [no rank] 009780445
  1373682363 [species] WRAA01 sp009780015
    561963250 [no rank] 009780015

If there's a genus, there should be only one, WRAA01. If there is more than one genus, the genus name would not be the same as the parent (family).

Ah, I should asked you before writing this command. I'm open to discussion now.

For the current implementation, we can also output the genus name, according to the parent taxon.

$ grep GCA_009780445.1 R214.1/taxid.map | taxonkit reformat -I 2 --data-dir R214.1/
GCA_009780445.1 1595698180      Bacteria;Bacillota_A;Clostridia;Lachnospirales;WRAA01;;WRAA01 sp009780445

$ grep GCA_009780445.1 R214.1/taxid.map | taxonkit reformat -I 2 --data-dir R214.1/ -F -p "" -s ""
GCA_009780445.1 1595698180      Bacteria;Bacillota_A;Clostridia;Lachnospirales;WRAA01;WRAA01;WRAA01 sp009780445

donovan-parks · 2024-03-04T22:23:23Z

Hi.

Thank you for the quick response.

The GTDB taxonomy is always "complete". That is. every genome is assigned to all 7 ranks. The genome GCA_009780445.1 is assigned to the genus g__WRAA01 in the family f__WRAA01. In the GTDB taxonomy these are distinct labels as designated by the different rank suffix. As such, I would expect the nodes.dmp file to contain a genus and family entry, both with the name WRAA01.

Cheers,
Donovan

donovan-parks · 2024-03-04T22:24:21Z

Note that the missing genus entry in nodes.dmp is potentially a problem. For example, I'm using the output of TaxonKit as input files for MetaCache and I would want it to be understood that GCA_009780445.1 belongs to the genus WRAA01.

shenwei356 · 2024-03-04T22:50:54Z

I see. So I have to change the whole logic.

In the GTDB taxonomy these are distinct labels as designated by the different rank suffix.

Do you mean prefix?
And this inspires me. Thank you! To distinguish duplicated names, like the family WRAA01 and genus WRAA01, I think I can just hash names with the rank prefix, e.g., f__WRAA01 and g__WRAA01, to get a unique TaxId.
I'll do it tomorrow, it's late in UK.

Best,
Wei

donovan-parks · 2024-03-04T22:55:43Z

Thanks - much appreciated. And yes, I meant prefix (f__, g__).

shenwei356 · 2024-03-05T22:40:33Z

Now, duplicated names with different ranks are allowed.
TaxIds are generated from the hash value of rank+taxon_name (in lower case) .

$ grep GCA_009780445.1 gtdb-taxdump/R214/taxid.map \
    | taxonkit reformat -I 2 --data-dir gtdb-taxdump/R214/
GCA_009780445.1 1662163052      Bacteria;Bacillota_A;Clostridia;Lachnospirales;WRAA01;WRAA01;WRAA01 sp009780445

$ echo WRAA01 | taxonkit name2taxid --data-dir gtdb-taxdump/R214/ -r
WRAA01  718672132       genus
WRAA01  1562716195      family

donovan-parks · 2024-03-06T16:21:00Z

Thanks Wei - much appreciated.

shenwei356 · 2024-03-06T16:43:20Z

Oh right, here's the way to generate GTDB-like format

echo 599451526 \
    | taxonkit reformat -I 1 -P --prefix-k d__
599451526       d__Bacteria;p__Pseudomonadota;c__Gammaproteobacteria;o__Enterobacterales;f__Enterobacteriaceae;g__Escherichia;s__Escherichia coli

shenwei356 added a commit that referenced this issue Mar 5, 2024

TaxIds are generated from the hash value of "rank+taxon_name". #92

3c276bc

shenwei356 mentioned this issue Mar 6, 2024

Update TaxonKit to v0.16.0 bioconda/bioconda-recipes#46227

Merged

donovan-parks closed this as completed Mar 6, 2024

aababc1 mentioned this issue Apr 29, 2024

question on GTDBr214.1 gtdb taxdump file regarding taxID shenwei356/gtdb-taxdump#8

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Missing ranks for GTDB genomes with duplicate names at different ranks #92

Missing ranks for GTDB genomes with duplicate names at different ranks #92

donovan-parks commented Mar 4, 2024

shenwei356 commented Mar 4, 2024

donovan-parks commented Mar 4, 2024

donovan-parks commented Mar 4, 2024

shenwei356 commented Mar 4, 2024

donovan-parks commented Mar 4, 2024

shenwei356 commented Mar 5, 2024 •

edited

Loading

donovan-parks commented Mar 6, 2024

shenwei356 commented Mar 6, 2024

Missing ranks for GTDB genomes with duplicate names at different ranks #92

Missing ranks for GTDB genomes with duplicate names at different ranks #92

Comments

donovan-parks commented Mar 4, 2024

shenwei356 commented Mar 4, 2024

donovan-parks commented Mar 4, 2024

donovan-parks commented Mar 4, 2024

shenwei356 commented Mar 4, 2024

donovan-parks commented Mar 4, 2024

shenwei356 commented Mar 5, 2024 • edited Loading

donovan-parks commented Mar 6, 2024

shenwei356 commented Mar 6, 2024

shenwei356 commented Mar 5, 2024 •

edited

Loading