Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature request] name2taxid optional warnings with duplicates, name2taxid go 'up' taxonomy if no taxonomy ID #103

Closed
2 tasks done
philippbayer opened this issue Sep 25, 2024 · 3 comments

Comments

@philippbayer
Copy link

Prerequisites

  • make sure you're are using the latest version by taxonkit version
  • read the usage

Describe your issue

Optional warnings with duplicates

When having several options, name2taxid returns all. As it says in the help:

$ echo Drosophila | taxonkit name2taxid | taxonkit lineage -i 2 -r -L
Drosophila      7215    genus
Drosophila      32281   subgenus
Drosophila      2081351 genus

which is totally fine. However, there are a few cases on NCBI where two different kingdoms share a species name! Here's an example:

$ echo 'Centropogon australis' | taxonkit name2taxid
Centropogon australis   390307
Centropogon australis   1274815

That's a fish and a plant.

With large lists of fish species this happens once in a while, and all my lists are suddenly off by one or two rows. It would be very useful to have an optional flag that identifies these cases to STDERR, like 'WARNING: Found duplicate IDs for 'Centropogon australis', 'other species' etc'. That would let me filter these cases manually (deciding on the kingdom I'd prefer).

name2taxid go 'up' taxonomy if no taxonomy ID

There are many fish in BOLD and other non-NCBI databases that have no NCBI taxonomy ID. Often these are also weird 'sp.' or 'cf.' species. An example is BOLD:AAF5083, Unio cf. crassus.

$ echo 'Unio cf. crassus' | taxonkit name2taxid
Unio cf. crassus

I usually manually tend to replace these by the genus-level:

$ echo 'Unio' | taxonkit name2taxid
Unio    55836 

would it be possible to add an optional flag that moves 'up' the taxonomic levels until it has found something, with some logging to STDERR? As in, 'Replaced Unio cf. crassus by Unio' or something like that.

Thank you for all of your hard work on this amazing tool!

@shenwei356
Copy link
Owner

Optional warnings with duplicates

I'll add a warning for that.

name2taxid go 'up' taxonomy if no taxonomy ID

It seems hard to do that. While, there's a fuzzy mode, I've tried it (it's slow). It looks like there's a species named "Unio crassus"

$ time echo 'Unio cf. crassus' | taxonkit name2taxid -f --verbose 
14:54:47.211 [INFO] parsing names file: /home/shenwei/.taxonkit/names.dmp
14:54:50.154 [INFO] 3942782 names parsed
14:54:50.154 [INFO] creating indexing for name searching ...
14:55:22.962 [INFO] indexing finished
Unio cf. crassus        143297

real    0m35.838s
user    0m39.414s
sys     0m0.733s


$ echo 143297 | taxonkit  lineage 
143297  cellular organisms;Eukaryota;Opisthokonta;Metazoa;Eumetazoa;Bilateria;Protostomia;Spiralia;Lophotrochozoa;Mollusca;Bivalvia;Autobranchia;Heteroconchia;Palaeoheterodonta;Unionida;Unionoidea;Unionidae;Unioninae;Unio;Unio crassus

$ grep -w 143297 ~/.taxonkit/names.dmp | csvtk pretty -Ht
143297    |   Crassiana crassa (Philippson, 1788)   |                                        |   authority         |
143297    |   Crassiana crassa                      |                                        |   synonym           |
143297    |   Crassiana nana (Lamarck, 1819)        |                                        |   authority         |
143297    |   Crassiana nana                        |                                        |   synonym           |
143297    |   Unio crassus Philipsson, 1788         |                                        |   authority         |
143297    |   Unio crassus                          |                                        |   scientific name   |
143297    |   Unio musivus Spengler, 1793           |                                        |   authority         |
143297    |   Unio musivus                          |                                        |   synonym           |
143297    |   Unio nana Lamarck, 1819               |                                        |   authority         |
143297    |   Unio nana                             |                                        |   synonym           |
2491031   |   CBS 143297                            |   CBS 143297 <culture from holotype>   |   type material     |

@shenwei356
Copy link
Owner

The warning added.

$ echo 'Centropogon australis' | taxonkit name2taxid
15:13:36.488 [WARN] multiple TaxIds found for 'Centropogon australis'
Centropogon australis   390307
Centropogon australis   1274815

You can also just count the names and filter duplicated ones.

$ echo 'Centropogon australis' | taxonkit name2taxid  \
    | csvtk freq -Ht | csvtk filter2 -t -f '$2>1'
15:14:40.684 [WARN] multiple TaxIds found for 'Centropogon australis'
Centropogon australis   2

@philippbayer
Copy link
Author

Wow, super fast. Thank you very much!!! Love your work!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants