Skip to content

Working with long reads (ONT, PacBio)

Javier Tamames edited this page Mar 5, 2020 · 2 revisions

The choice of many users is to analyse their sequencing reads directly, skipping the assembly and performing taxonomic & functional annotation on reads directly. You can do this easily with sqm_reads.pl. But, if you are working with long reads, be careful with it. This script was designed to work on short reads resulting mostly from Illumina sequencing. Sqm_reads.pl runs homology searches and uses the results to annotate one single ORF. This is fine, because in these short reads you expect to have just one gene, most of the times. Should the read contains more than one gene, the script will annotate the one with the highest bitscore values, that is, the one with the most powerful homology. The other one (often will be a small piece of the next gene) will not be annotated. This will not harm the taxonomic annotation (it is enough just annotating one of the genes), and probably neither the functional one, since it is likely that the remaining gene is such a short fragment that it is not suitable for annotation anyways.

But, this is no longer true for long reads where you have > 1 Kb of sequence. Then it is very likely that many genes are represented in the read, and we will be annotating just one. This could be a problem especially for functional annotation.

What to do then? We suggest that you run these samples using the standard SqueezeMeta.pl pipeline. Since you want to work on reads, you must skip the assembly. You can do this easily with the –ext-assembly option, providing the reads fasta or fastq file as if it was the result of the assembly. Then you will proceed with the standard SqueezeMeta pipeline, that is, predicting genes with prodigal, and running homology searches for annotating taxa and functions

But wait, ONT long reads are prone to have sequencing errors, and gene predictors can be very sensitive to them. Probably we will miss many genes and wrongly predicted some more. This is when the doublepass flag, --D, comes to help. This will run an additional Diamond blastx search on the parts of the read in which: 1) No ORF was predicted, or 2) An ORF was predicted but no hits in the databases were found. Then SqueezeMeta will join the results of gene prediction with the blastx analysis, and proceeds with taxonomic and functional annotation for each ORF in the read. Finally, the taxonomic annotation for the read is done using the taxonomic consensus of all ORFs in the read, as it is done for contigs.

So, you will end with a set of SqueezeMeta results, in which you have to take into account that all references to “contigs” are instead referring to “reads”. Also, it is not possible to do binning, because you are working with reads and thus losing all differential coverage information.

Therefore, a command line for running SqueezeMeta on long reads could be as follows:

SqueezeMeta.pl -f <read file dir> -s minion.samples -m sequential -extassembly myreads.fasta --minion --nobins --nopfam --D

And the minion.samples file being as simple as this:

minionreads myreads.fasta pair1

I skipped the pfam annotation just for speeding things up. Take into account that this will be slower than the usual sqm_reads.pl procedure. At the end you will have a “minionreads” directory full of results ready for you to have a good time!

PD, we would probably add a "longreads" mode in upcoming SqueezeMeta version 1.2 which will select this type of analysis automatically.

PD2, if you already analyzed long reads with sqm_reads and are wondering if your results are valid, we did a comparison of the results using SQM in this way and the ones of sqm_reads. The picture below shows a Procustes plot with the differences in taxonomic annotations between the two methods, based on their PCA ordinations. The differences are minimal between both at this genus level, and it is that way for all other taxonomic ranks. So if you used sqm_reads, you will have less annotations, but your interpretation of the results is still valid.

Procustes plot showing the differences between taxonomic annotations for five samples