-
Notifications
You must be signed in to change notification settings - Fork 182
How to cluster huge datasets
Clustering huge input files of several billion protein sequences may not finish in reasonable time using the standard cluster
workflow on a single server and should usually be done on a compute cluster with a multi-node setup.
Say we want to cluster at threshold values of 30% sequence identity and 90% coverage, the first step should always be to run the linclust
workflow:
diamond linclust -d input.faa -o clusters.tsv -M 500G --approx-id 30 --member-cover 90
The linclust
workflow is highly efficient and can process input files of several billion sequences within days on a single server. For clustering at 90% identity and above, running linclust
is likely sufficient (although some clusterable homologs may be missed).
When clustering deeper and more sensitively, the representatives from the first run should be subjected to a further round of all-vs-all alignment. To get a FASTA file of the representatives, a third party tool like seqtk
can be used:
seqtk subseq input.faa <(cut -f1 clusters.tsv | uniq) > reps.faa
diamond makedb --in reps.faa -d reps
To set up a multi-node run we use the --mp-init
option once on a head node:
diamond blastp -q reps.faa -d reps -o out -f 6 qseqid sseqid corrected_bitscore --approx-id 30 --query-cover 90 -k1000 -c1 --fast --multiprocessing --mp-init --parallel-tmpdir $PTMP
We use -k1000
to limit the output to a reasonable size, but other values can be used here. The $PTMP
variable needs to point to a directory accessible by all compute nodes. Once initialized, we can run the actual alignment within our submit script or similar:
diamond blastp -q reps.faa -d reps -o out -f 6 qseqid sseqid corrected_bitscore --approx-id 30 --query-cover 90 -k1000 -c1 --fast --multiprocessing --parallel-tmpdir $PTMP