A discussion on taxonomic annotation and the meaning of disparity values of contigs

Let's take a look at the file contiglog (08.<project>.contiglog) from the Hadza example we worked on the SqueezeMeta paper. You can see something like this:

The table shows, for each contig, the putative taxonomic assignment, the corresponding taxonomic rank, a value named "disparity index", and the number of genes in the contig. What is that "disparity index"?

Disparity measures the consistency of the taxonomic annotations for the genes in the contig. Suppose you have five genes in a contig. Four of them belong to taxon A, while one is annotated to taxon B. The disparity index measures the percentage of paired comparisons between genes that belong to different taxa. That is, in our example we will do 20 paired comparisons between genes (gene 1 vs gene 2, gene 1 vs gene 3, ...), and eigth will report different taxa. Disparity index for this contig is then 8/20 = 0.4. But, to which rank is this value calculated? That is, which ranks are being compared? Phylum, families, species? The answer is, the disparity index is calculated for the rank of the consensus annotation. In the table above, it would be genus, species and order.

An additional consideration: A inconsistent comparison at a given rank it will be also inconsistent at lower ranks. Suppose we have two genes assigned to phyla Proteobacteria and Firmicutes, respectively. They do not have annotations for the genus rank. Suppose also that the consensus of the contig is genus Escherichia (a Proteobacteria). Are these two going to be inconsistent at genus rank, even if they have no annotations for that rank? Yes, because they are inconsistent at a higher rank.

Let us examine one example of the Hadza analysis to understand better the meaning of the disparity index. One of the highest disparity values is for contig k119_409504, assigned to genus Acidiphilium:

It is very easy to inspect the annotations for the genes in the contig. These can be found in the file 06.<project>.contiglog. We can just grep this file: grep k119_409504 06.\<project\>.contiglog. The result is:

The high disparity results from the inclusion of three genes assigned to another phylum (Firmicutes) in the predominantly Proteobacterial contig. Let us get the aminoacid sequence of these ORFs from the 03.<project>.faa file, and let's run it a NCBI's blastp to graphically show the results.

Most ORFs match a single species of Acidiphilium, for instance, gene 3 (81% identity):

The same is true for most of the genes in the contig. But take a look at gene 1. Its closest match is with a Ruminococcaceae bacteria (Firmicutes), with 68% identity

and gene 2, whose closest match is with an unidentified Firmicutes bacteria, just 36% identity (but having other matches with as much as 45% identity, also with Firmicutes bacteria):

What is the most likely situation here? The species in the Hadza metagenome is a Acidiphilium bacteria that happens to have just one relative from Acidiphilium genus in the current databases. Genes that are present in the Hadza Acidiphilium but not in the reference Acidiphilium in the database cannot be identified as belonging to that genus, because they do not have matches with it. Therefore, they are assigned to the closest remaining taxa: that of Firmicutes. This is most likely not a case of contig chimerism, but of incorrect assignment because of database uncompleteness.

We have good news and bad news. Good news, SqueezeMeta was able to annotate correctly the contig even in the presence of these unconsistences. Working with all annotations for the contig, consensus annotation is capable of correcting these errors. We could even re-annotate the genes considering the annotation of the full contig.

The bad news, the annotation of individual genes was still wrong for these two. Couldn't SqueezeMeta be able to add some filters for preventing these instances? Indeed, it does. We set identity filters for preventing annotation to ranks for which the identity level is not conclussive. We get these values from a great paper from Luo et al, Nucleic Acids Res. 2014 Apr; 42(8): e73. They compared the identity values between pairs of proteins, and came out with a distribution of identity values at different ranks. Take a look:

The lower plot shows these distributions. You can see, for instance, that if your protein has a hit with another at 50% identity, they are likely not to come from the same genus, because proteins from the same genus are expected to have more than 50% identity. We implemented these ranks in our annotations, so that if the identity is below the minimum level for a particular rank, the protein will not be assigned to that rank. The values we are using are 85, 60, 55, 50, 46, 42 and 40 % identity for species, genus, family, order, class, phylum and kingdom ranks, respectively. Now, take a look again to the identity values for the closest hits for these two conflicting genes: With 68% and 45%, they passed the threshold value to be assigned to the phylum rank. Therefore, they were.

Couldn't we be more strict in these cutoff values? Yes, of course we could. But that will mean to lose a large share of annotations, most of them correct, that are just close to the current threshold values.It is probably not worth it. Fortunately, the number of cases like the one above is very low, as you can see in the disparity values for the contigs. In the Hadza dataset, just 3722 of the original 983353 contigs had disparity greater than zero. That is, less than 0.4%. Take the disparity values as a warning flag indicating that something can be not entirely correct. Nevertheless, if you feel more strict than us, you can set your own values for the filter in the script 06.lca.pl.

But wait, 68% identity is way higher than the one proposed by Luo et al for proteins from different phyla. Why it resembles so much to a Proteobacterial protein? Several explanations are possible. We could think of a recent gene transfer event, for instance, where a Proteobacterial gene was transferred to a Firmicutes bacteria. But taking into account that the Ruminococcus protein is a metagenomic sequence, and has no similarity with any other Ruminococcus protein, I'd rather think of a wrong taxonomic annotation for that sequence.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

A discussion on taxonomic annotation and the meaning of disparity values of contigs

Clone this wiki locally