Skip to content

Looking (exhaustively) for a particular function

Javier Tamames edited this page Oct 31, 2019 · 10 revisions

Suppose you are VERY interested in evaluating the presence of particular functions in a metagenome. For instance, when analysing thermal metagenomes (https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6176055), we wanted to know who in the pool was capable to fix nitrogen. Therefore, we were looking for the presence of the nitrogenase gene in the samples. But in some samples, even if we were pretty sure that N2 fixation should be taking place, we could not find the nitrogenase gene. How so?

Take into account assembly is not perfect and it will recover just part of the genes represented in the metagenome. Some of the reads will not assemble. Or even many of them, if sequencing depth is low and species richness and/or diversity is high. SqueezeMeta gives you the mappingstat file in which you will find the percentage of reads that could be mapped to the assembly, so that you can evaluate how comprehensive such assembly is. But even if you have very high mapping percentages, still some of your reads are not represented in the metagenome. And therefore you could miss a function that is rare and only present in these fraction of non-assembled reads. What then?

You must rely on working with raw reads. These are (way) shorter than contigs, therefore the assignment will be more difficult and risky. And in addition you lose the valuable contextual information (the rest of the gene, and the other genes in the contig). Therefore, you must be careful and extra-check your results when doing so. This said, SqueezeMeta gives you, from version 1.0, a couple of tools for annotating raw reads directly. You can find then in the utils directory.

First, the sqm_reads.pl uses the same SqueezeMeta machinery that annotates contigs, that is, aligning the reads against nr/COG/KEGG databases using Diamond, and using the lca or fun3 algorithms to annotate taxa and functions for the reads. This can be slow, of course, because you will probably have many millions of reads to process. The results are very nice, with a small error rate between 1-4% in our tests. You can see a benchmarking of this method in our paper https://www.biorxiv.org/content/10.1101/522292v1. But still, many of these reads could not be assigned, mainly because homology searching methods are not so sensitive for short reads. Therefore, still you cannot be sure that your function is or is not there, can you?

Then you can use our second tool: sqm_hmm_reads.pl. This is using the Hidden Markov Model algorithm of the Short Pair software (https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5657049/) to provide the highest sensitivity when annotating short reads. These guys did an excellent work adapting the HMM search for working with short reads. The drawback is that HMM profile search is way slower. Consequently, it is not practical to do it for all functions. You must provide the script with the pfam ids of the functions you want to look for.

Combining the three tools (standard SqueezeMeta, sqm_reads and sqm_hmm_reads), you will have the most complete information regarding the presence or not of your target functions in your dataset. From there, interpretation is all yours.