Skip to content

How to cluster huge datasets

Benjamin Buchfink edited this page Mar 6, 2023 · 6 revisions

Clustering huge input files of several billion protein sequences may not finish in reasonable time using the standard cluster workflow on a single server and should usually be done on a compute cluster with a multi-node setup.

Say we want to cluster at threshold values of 30% sequence identity and 90% coverage, the first step should always be to run the linclust workflow:

diamond linclust -d input.faa -o clusters.tsv -M 500G --approx-id 30 --member-cover 90

The linclust workflow is highly efficient and can process input files of several billion sequences within days on a single server. For clustering at 90% identity and above, running linclust is likely sufficient (although some clusterable homologs may be missed).

When clustering deeper and more sensitively, the representatives from the first run should be subjected to a further round of all-vs-all alignment. To get a FASTA file of the representatives, a third party tool like seqtk can be used:

seqtk subseq input.faa <(cut -f1 clusters.tsv | uniq) > reps.faa
diamond makedb --in reps.faa -d reps

To set up a multi-node run we use the --mp-init option once on a head node:

diamond blastp -q reps.faa -d reps -o out -f 6 qseqid sseqid corrected_bitscore --approx-id 30 --query-cover 90 -k1000 -c1 --fast --multiprocessing --mp-init --parallel-tmpdir $PTMP

We use -k1000 to limit the output to a reasonable size, but other values can be used here. The $PTMP variable needs to point to a directory accessible by all compute nodes. Once initialized, we can run the actual alignment within our submit script or similar:

diamond blastp -q reps.faa -d reps -o out -f 6 qseqid sseqid corrected_bitscore --approx-id 30 --query-cover 90 -k1000 -c1 --fast --multiprocessing --parallel-tmpdir $PTMP