id2taxid mapping file format #766

amirkarger · 2023-12-15T14:22:25Z

amirkarger
Dec 15, 2023

Hi.

I build custom databases from a bunch of proteomes, mostly Uniprot but also random stuff from individual websites. Since I can't just download the NCBI prot.accession2taxid file, it would be helpful to learn more about the format for the file given to diamond makedb --taxonmap.

Does diamond care about the accession column, the accession.version column, or both?
If I have IDs that have periods in them that aren't related to a version, do I need to remove the period? What if the stuff after the period isn't a number? What if the IDs for two proteins are the same before the period, like abc.a1 and abc.a2?
I've had some success taking just the "middle" part of a Uniprot ID in the mapping file, using e.g. S7PI84 for tr|S7PI84|S7PI84_MYOBR.
Are there changes I need to make to FASTA headers too?

I got hints from reports like "xxx|" in the log file, but any more info would be helpful. I'm happy to munge FASTA or taxonmap files as needed, but I need the taxon mappings to work.

Thanks.

Answered by bbuchfink

Dec 19, 2023

Does diamond care about the accession column, the accession.version column, or both?

Only accession.version

If I have IDs that have periods in them that aren't related to a version, do I need to remove the period? What if the stuff after the period isn't a number? What if the IDs for two proteins are the same before the period, like abc.a1 and abc.a2?

It will always ignore everything after the last dot unless you use --no-parse-seqids.

Are there changes I need to make to FASTA headers too?

Probably not, for your use case it should be easier with --no-parse-seqids.

I forgot to add that using --no-parse-seqids didn't help, and in fact stopped the one single species that was mapping (…

View full answer

amirkarger · 2023-12-15T19:54:15Z

amirkarger
Dec 15, 2023
Author

I forgot to add that using --no-parse-seqids didn't help, and in fact stopped the one single species that was mapping (because it had simpler IDs with no period of pipe characters) from mapping.

1 reply

amirkarger Dec 17, 2023
Author

Of note, I had some IDs like blah.1-blah, and the period seemed to be confusing makedb. I arbitrarily changed them to % in both the FASTA and taxonmap files, and makedb successfully mapped all sequences. I was able to confirm with a diamond blastp that I could hit those modified sequences, as well as the modified Uniprot-style sequences and RefSeq sequences. Yay!

But I'd still love to get a better explanation of what to expect. I could imagine wanting to add more sequences and running into more trouble. Getting a better sense of what diamond makedb is doing to munge sequences, and where problems can arise, would be helpful. I'd also like to clarify whether makedb cares about the accession.version line, and what should be done for sequences whose IDs have periods that aren't the start of an accession.

bbuchfink · 2023-12-19T14:15:11Z

bbuchfink
Dec 19, 2023
Maintainer

Does diamond care about the accession column, the accession.version column, or both?

Only accession.version

If I have IDs that have periods in them that aren't related to a version, do I need to remove the period? What if the stuff after the period isn't a number? What if the IDs for two proteins are the same before the period, like abc.a1 and abc.a2?

It will always ignore everything after the last dot unless you use --no-parse-seqids.

Are there changes I need to make to FASTA headers too?

Probably not, for your use case it should be easier with --no-parse-seqids.

I forgot to add that using --no-parse-seqids didn't help, and in fact stopped the one single species that was mapping (because it had simpler IDs with no period of pipe characters) from mapping.

It should work, can you make a simple test case where it fails and send it to me?

3 replies

amirkarger Dec 29, 2023
Author

Sorry for the delay.

I ran:
diamond makedb --no-parse-seqids --in mini.fasta --db mini --threads 1 --taxonmap id2_diamond_taxid_mini.txt >& makedb_diamond_mini.log

And the log says:
Database sequences 49
Database letters 24810
Accessions in database 49
Entries in accession to taxid file 49
Database accessions mapped to taxid 0
Database sequences mapped to taxid 0

I expect the last two lines to also have 49, not 0. Right?

I had to rename mini.fasta -> mini.fasta.txt to be allowed to paste it.
makedb_diamond_mini.log
id2_diamond_taxid_mini.txt
mini.fasta.txt

-Amir

bbuchfink Jan 8, 2024
Maintainer

Something seems to be going wrong here, I'll have to look into it.

bbuchfink Oct 25, 2024
Maintainer

In your example, only one accession in the fasta file is also contained in the mapping file, which is tr|S7MBE3|S7MBE3_MYOBR. It is not matched to the entry in the mapping files because accession parsing rules are only applied to the accessions in the database file, but not in the mapping file. As this is quite confusing, I will change this behaviour in the next version. The entry is also not matched when using --no-parse-seqids because the accessions in your mapping file have a trailing dot that does not occur in the fasta file.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

id2taxid mapping file format #766

{{title}}

Replies: 2 comments 4 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

id2taxid mapping file format #766

amirkarger Dec 15, 2023

Replies: 2 comments · 4 replies

amirkarger Dec 15, 2023 Author

amirkarger Dec 17, 2023 Author

bbuchfink Dec 19, 2023 Maintainer

amirkarger Dec 29, 2023 Author

bbuchfink Jan 8, 2024 Maintainer

bbuchfink Oct 25, 2024 Maintainer

amirkarger
Dec 15, 2023

Replies: 2 comments 4 replies

amirkarger
Dec 15, 2023
Author

amirkarger Dec 17, 2023
Author

bbuchfink
Dec 19, 2023
Maintainer

amirkarger Dec 29, 2023
Author

bbuchfink Jan 8, 2024
Maintainer

bbuchfink Oct 25, 2024
Maintainer