Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

duplicate taxid from name2taxid #29

Closed
slaperriere opened this issue Apr 6, 2020 · 5 comments
Closed

duplicate taxid from name2taxid #29

slaperriere opened this issue Apr 6, 2020 · 5 comments
Labels

Comments

@slaperriere
Copy link

Hello,

I am getting duplicate values from name2taxid when running

taxonkit name2taxid -i 2 filename

My input:
ESP_3 Bacteria
ESP_84 Bacteria
ESP_136 Bacteria
ESP_149 Bacteria
ESP_166 Bacteria
ESP_169 Bacteria
ESP_181 Bacteria
ESP_187 Bacteria
ESP_196 Bacteria

Output:
ESP_3 Bacteria 2
ESP_3 Bacteria 629395
ESP_84 Bacteria 2
ESP_84 Bacteria 629395
ESP_136 Bacteria 2
ESP_136 Bacteria 629395
ESP_149 Bacteria 2
ESP_149 Bacteria 629395
ESP_166 Bacteria 2
ESP_166 Bacteria 629395

Some lines as seen above are duplicated with a different taxid. There are no duplicates in the input.

Do you you what could be causing this?

Thank you!

@shenwei356
Copy link
Owner

Thanks for reporting this.

taxonkit name2taxid searches both scientific name and synonym, 629395 has a synonym of Bacteria...

629395  |       Bacteria        |       Bacteria <stick insect> |       synonym |
629395  |       Bacteria Latreille et al. 1825  |               |       scientific name |
629395  |       Bacteria Latreille, Peletier de Saint Fargeau, Serville & Guerin, 1825  |               |       authority       |
629395  |       Bacteria stick insect   |               |       common name     |

A new flag -s/--sci-name added for only searching scientific name:

@slaperriere
Copy link
Author

Great, thank you! It looks like it solved most of the problem.

However, I am still get some duplicates. Some examples are

ESP_48538 Paracoccus 265
ESP_48538 Paracoccus 249411
ESP_764 Actinobacteria 1760
ESP_764 Actinobacteria 201174
ESP_17204 Vertebrata 7742
ESP_17204 Vertebrata 1261581

@shenwei356
Copy link
Owner

it's not a bug, if you have switched on -s. Some taxids indeed share same scientific names, you can check their lineage. For these, I duplicate these lines, you may deduplicate them using awk or csvtk, or I can add a new flag.

@shenwei356
Copy link
Owner

@slaperriere Can I close this issue?

@slaperriere
Copy link
Author

Yes. Thank you for your help!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants