-
Notifications
You must be signed in to change notification settings - Fork 32
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Missing ranks for GTDB genomes with duplicate names at different ranks #92
Comments
Yes, for these cases where child and parent taxons share the same name, I thought it's because there's no classified genus (which is common in NCBI taxonomy, especially in Viruses) according to some observations. e.g.,
If there's a genus, there should be only one, Ah, I should asked you before writing this command. I'm open to discussion now. For the current implementation, we can also output the genus name, according to the parent taxon.
|
Hi. Thank you for the quick response. The GTDB taxonomy is always "complete". That is. every genome is assigned to all 7 ranks. The genome GCA_009780445.1 is assigned to the genus g__WRAA01 in the family f__WRAA01. In the GTDB taxonomy these are distinct labels as designated by the different rank suffix. As such, I would expect the nodes.dmp file to contain a genus and family entry, both with the name WRAA01. Cheers, |
Note that the missing genus entry in nodes.dmp is potentially a problem. For example, I'm using the output of TaxonKit as input files for MetaCache and I would want it to be understood that GCA_009780445.1 belongs to the genus WRAA01. |
I see. So I have to change the whole logic.
Do you mean prefix? Best, |
Thanks - much appreciated. And yes, I meant prefix (f__, g__). |
Now, duplicated names with different ranks are allowed.
|
Thanks Wei - much appreciated. |
Oh right, here's the way to generate GTDB-like format
|
I'm using v0.15.1. It appears the generated nodes.dmp is missing rank information for genomes with repeated GTDB taxon names at different ranks, e.g.:
GCA_009780445.1 d__Bacteria;p__Bacillota_A;c__Clostridia;o__Lachnospirales;f__WRAA01;g__WRAA01;s__WRAA01 sp009780445
For this genome, the nodes.dmp file contain a species entry and a family entry, but no genus entry. This doesn't seem correct since this genome (and species) does have a named genus in the GTDB taxonomy.
Cheers,
Donovan
The text was updated successfully, but these errors were encountered: