-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Deleted nodes are kept in merged.dmp #2
Comments
I see. Please try the fixed version: Previous data:
Fixed data:
The history of
|
Similar TaxIds in NCBI taxonomy:
|
Thanks for the quick response! I'm now getting an error because of a circular definition in
|
I'd like to say, The history of GCF_001405015.1 showed
Also see the taxid-changelog:
|
Hope there's no bug. The change histories were not changed for |
Thanks again! The problem now is a chain of taxids being merged:
I guess in NCBI they would just remove |
|
Worked! :) |
Finally ~~ I didn't think it would be so completed ~ Thank you for reporting, Antônio! Please let me know if you found more issues. I might release the new version next week. |
Sure! Thanks for being so quick to fix this! |
Hi, Antônio. Everything is right? I'm going to tag new releases for taxonkit and gtdb-taxdump. |
Hmm, there are still some issues ... will check it tomorrow. |
I didn't have any problem with the last round of fixes. But |
Added checking another situation: the old taxid is merged to a new taxid. It's really complicated :(. But I think all possible changes should have been taken into consideration now. |
This gets really convoluted at this point. I think taxonkit's logic is as close as we can get to a "documentation" |
I'm still fixing it. |
I got it. For the case of " a chain of taxids being merged"
Previously:
Fixed:
|
I'm exhausted, leave it another day.
I don't get it. |
I just meant that (as far as I know) NCBI doesn't provide a documentation for the taxdump format. So the logic you came up with to generate files that are compliant with NCBI's is the best thing an external user can get to a documentation of NCBI's format. |
Here's the way how I defined a merging event.
But yesterday I found the R095In one run, 123112611 was merged to 3095577279.
While in another run, 123112611 was merged to 251684323.
Let's check the genomes:
It turns out that in R95 some (Sphingobium japonicum_A) genomes (GCF_000445085.1) were merged into (Sphingobium chinhatense), while others (GCF_000091125.1) into Sphingobium indicum. BTW, Sphingobium chinhatense only appeared in R95 not others. I don't know how to appropriately handle this case, any idea? @apcamargo |
As far as I remember (and also the release notes), there was no change in the way species were delimited in r89 and r95. It's just that the centroid can change between different releases which makes this sort of thing happen. Is that right, @dparks1134? I can try to come up with a smarter solution tomorrow. Right now, the simplest thing that came to my mind is to not make a taxon depend on a single genome. In NCBI, even if the taxonomy of a single genome changed from A to B, taxon A would not necessarily be merged into B. But (as far as I know), they do all the merging/deletion manually, which is something that defeats the purpose of taxonkit. So, one possible solution is to make a taxon depend on a "consensus". Let's say you have taxon A with 5 members and a single one changes to taxon B. Regardless of the order the genomes are processed,taxon A would not merge into B because the majority is still there. If 4 genomes were assigned to B and 1 to C, taxon A would be merged into B (nor perfect, but the most parsimonious merge). If 4 genomes moved to B and 1 remained in A, A would not merge with B. Does that makes sense? There might be some edge cases, I'll try to think of something more robust tomorrow. |
Hi. Yes, the representative (centroid) genome use to define a GTDB species clusters can change between releases. This is a rare event and only done when a sufficiently better genome assembly becomes available, but it does happen. You can find the exact criteria used to nominate a new representative in the "Updating GTDB species cluster" section of our NAR manuscript: https://academic.oup.com/nar/article/50/D1/D785/6370255 |
I'm trying to parse the GTDB r207 taxdump with
taxopy
and I got the following error:I've never had this problem with NCBI before, so my guess is that they remove deleted nodes from
merged.dmp
.The text was updated successfully, but these errors were encountered: