-
Notifications
You must be signed in to change notification settings - Fork 182
How to cluster huge datasets
Clustering huge input files of several billion protein sequences may not finish in reasonable time using the standard cluster
workflow on a single server and should usually be done on a compute cluster with a multi-node setup.
Say we want to cluster at threshold values of 30% sequence identity and 90% coverage, the first step should always be to run the linclust
workflow:
diamond linclust -d input.faa -o clusters.tsv -M 500G --approx-id 30 --member-cover 90
The linclust
workflow is highly efficient and can process input files of several billion sequences within days on a single server. For clustering at 90% identity and above, running linclust
is likely sufficient (although some clusterable homologs may be missed).
When clustering deeper and more sensitively, the representatives from the first run should be subjected to a further round of all-vs-all alignment. To get a FASTA file of the representatives, a third party tool like seqtk
can be used:
seqtk subseq input.faa <(cut -f1 clusters.tsv | uniq) > reps.faa
diamond makedb --in reps.faa -d reps
To set up a multi-node run we use the --mp-init
option once on a head node:
diamond blastp -q reps.faa -d reps -o out -f 6 qseqid sseqid corrected_bitscore --approx-id 30 --query-cover 90 -k1000 -c1 --fast --multiprocessing --mp-init --parallel-tmpdir $PTMP
We use -k1000
to limit the output to a reasonable size, but other values can be used here. The $PTMP
variable needs to point to a directory accessible by all compute nodes. Once initialized, we can run the actual alignment within our submit script or similar:
diamond blastp -q reps.faa -d reps -o out -f 6 qseqid sseqid corrected_bitscore --approx-id 30 --query-cover 90 -k1000 -c1 --fast --multiprocessing --parallel-tmpdir $PTMP
The call above should be executed in a parallel way on many compute nodes. When finished, we compute the actual clustering based on the alignment output:
cat out_* > out.tsv
samtools faidx reps.faa
diamond greedy-vertex-cover --edges out.tsv -d reps.faa.fai --edge-format triplet -o clusters_round_2.tsv
The resulting representative sequences of this clustering can then be subjected to another round of all-vs-all alignment and clustering. Note that the greedy-vertex-cover
workflow does not group the output file by clusters, so use cut -f1 clusters_round_2.tsv | sort -u
to obtain a list of representatives. To compute a deeper clustering, it is recommended to chain multiple rounds with increasing sensitivity, e.g. --fast
, default, --sensitive
, --very-sensitive
.
See also: Distributed computing