Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Missing ranks for GTDB genomes with duplicate names at different ranks #92

Closed
donovan-parks opened this issue Mar 4, 2024 · 8 comments

Comments

@donovan-parks
Copy link

I'm using v0.15.1. It appears the generated nodes.dmp is missing rank information for genomes with repeated GTDB taxon names at different ranks, e.g.:

GCA_009780445.1 d__Bacteria;p__Bacillota_A;c__Clostridia;o__Lachnospirales;f__WRAA01;g__WRAA01;s__WRAA01 sp009780445

For this genome, the nodes.dmp file contain a species entry and a family entry, but no genus entry. This doesn't seem correct since this genome (and species) does have a named genus in the GTDB taxonomy.

Cheers,
Donovan

@shenwei356
Copy link
Owner

Yes, for these cases where child and parent taxons share the same name, I thought it's because there's no classified genus (which is common in NCBI taxonomy, especially in Viruses) according to some observations. e.g.,

$ taxonkit list --ids 1698208185 --data-dir . -nr
1698208185 [family] WRAA01
  788434200 [species] WRAA01 sp009780445
    1595698180 [no rank] 009780445
  1373682363 [species] WRAA01 sp009780015
    561963250 [no rank] 009780015

If there's a genus, there should be only one, WRAA01. If there is more than one genus, the genus name would not be the same as the parent (family).

Ah, I should asked you before writing this command. I'm open to discussion now.

For the current implementation, we can also output the genus name, according to the parent taxon.

$ grep GCA_009780445.1 R214.1/taxid.map | taxonkit reformat -I 2 --data-dir R214.1/
GCA_009780445.1 1595698180      Bacteria;Bacillota_A;Clostridia;Lachnospirales;WRAA01;;WRAA01 sp009780445

$ grep GCA_009780445.1 R214.1/taxid.map | taxonkit reformat -I 2 --data-dir R214.1/ -F -p "" -s ""
GCA_009780445.1 1595698180      Bacteria;Bacillota_A;Clostridia;Lachnospirales;WRAA01;WRAA01;WRAA01 sp009780445

@donovan-parks
Copy link
Author

Hi.

Thank you for the quick response.

The GTDB taxonomy is always "complete". That is. every genome is assigned to all 7 ranks. The genome GCA_009780445.1 is assigned to the genus g__WRAA01 in the family f__WRAA01. In the GTDB taxonomy these are distinct labels as designated by the different rank suffix. As such, I would expect the nodes.dmp file to contain a genus and family entry, both with the name WRAA01.

Cheers,
Donovan

@donovan-parks
Copy link
Author

Note that the missing genus entry in nodes.dmp is potentially a problem. For example, I'm using the output of TaxonKit as input files for MetaCache and I would want it to be understood that GCA_009780445.1 belongs to the genus WRAA01.

@shenwei356
Copy link
Owner

I see. So I have to change the whole logic.

In the GTDB taxonomy these are distinct labels as designated by the different rank suffix.

Do you mean prefix?
And this inspires me. Thank you! To distinguish duplicated names, like the family WRAA01 and genus WRAA01, I think I can just hash names with the rank prefix, e.g., f__WRAA01 and g__WRAA01, to get a unique TaxId.
I'll do it tomorrow, it's late in UK.

Best,
Wei

@donovan-parks
Copy link
Author

Thanks - much appreciated. And yes, I meant prefix (f__, g__).

@shenwei356
Copy link
Owner

shenwei356 commented Mar 5, 2024

Now, duplicated names with different ranks are allowed.
TaxIds are generated from the hash value of rank+taxon_name (in lower case) .

$ grep GCA_009780445.1 gtdb-taxdump/R214/taxid.map \
    | taxonkit reformat -I 2 --data-dir gtdb-taxdump/R214/
GCA_009780445.1 1662163052      Bacteria;Bacillota_A;Clostridia;Lachnospirales;WRAA01;WRAA01;WRAA01 sp009780445

$ echo WRAA01 | taxonkit name2taxid --data-dir gtdb-taxdump/R214/ -r
WRAA01  718672132       genus
WRAA01  1562716195      family

@donovan-parks
Copy link
Author

Thanks Wei - much appreciated.

@shenwei356
Copy link
Owner

Oh right, here's the way to generate GTDB-like format

echo 599451526 \
    | taxonkit reformat -I 1 -P --prefix-k d__
599451526       d__Bacteria;p__Pseudomonadota;c__Gammaproteobacteria;o__Enterobacterales;f__Enterobacteriaceae;g__Escherichia;s__Escherichia coli

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants