-
Notifications
You must be signed in to change notification settings - Fork 182
1. Tutorial
We will demonstrate an example run of DIAMOND on a small dataset here. For this, we assume the tool has been installed according to the installation instructions.
First we will download the SCOPe database of structured domains in FASTA format, containing 14,323 sequences:
wget https://scop.berkeley.edu/downloads/scopeseq-2.07/astral-scopedom-seqres-gd-sel-gs-bib-40-2.07.fa
The next step is to setup a binary DIAMOND database file that can be used for subsequent searches against the database:
diamond makedb --in astral-scopedom-seqres-gd-sel-gs-bib-40-2.07.fa -d astral40
A binary file (astral40.dmnd
) containing the database sequences will be created
in the current working directory. We will now conduct a search against this database
using the same FASTA file of SCOPe domains as queries:
diamond blastp -q astral-scopedom-seqres-gd-sel-gs-bib-40-2.07.fa -d astral40 -o out.tsv --very-sensitive
Here we set the output file to be out.tsv
and also use the --very-sensitive
setting. DIAMOND has a number of sensitivity settings to accomodate different
applications. The default mode is the fastest and tailored towards finding
homologies of >70% sequence identity, the --sensitive
mode is tailored to hits
of >40% identity, while the --very-sensitive
and --ultra-sensitive
modes
provide sensitivity accross the whole range of pairwise alignments.
If the run completed successfully, at the end we will see this console output providing some statistics about the number of hits that were found:
Total time = 5.494s
Reported 74440 pairwise alignments, 74440 HSPs.
14323 queries aligned.
We will inspect the beginning of the output file like so:
head out.tsv
Here we will see the first ten lines of the file, showing 10 pairwise alignments of the first 3 query sequences:
d1dlwa_ d1dlwa_ 100 116 0 0 1 116 1 116 6.42e-77 220
d1dlwa_ d2gkma_ 35.4 113 73 0 1 113 13 125 1.43e-21 80.9
d1dlwa_ d4i0va_ 31.9 119 75 2 1 113 2 120 9.11e-13 58.2
d2gkma_ d2gkma_ 100 127 0 0 1 127 1 127 1.51e-87 248
d2gkma_ d1dlwa_ 34.8 115 75 0 13 127 1 115 6.90e-23 84.3
d2gkma_ d4i0va_ 33.6 110 69 1 13 118 2 111 1.35e-18 73.6
d2gkma_ d6bmea_ 35.5 110 67 1 13 118 2 111 1.32e-16 68.6
d2gkma_ d2bkma_ 37.3 67 38 2 13 76 5 70 5.18e-06 40.8
d1ngka_ d1ngka_ 100 126 0 0 1 126 1 126 4.34e-91 257
d1ngka_ d2bkma_ 38.4 125 73 2 1 125 4 124 1.42e-24 89.0
The file is generated in tabular-separated (TSV) format composed of 12 fields,
a format corresponding to the format generated by BLAST using the option
-outfmt 6
. The 12 fields are:
-
Query accession: the accession of the sequence that was the search query against
the database, as specified in the input FASTA file after the
>
character until the first blank. - Target accession: the accession of the target database sequence (also called subject) that the query was aligned against.
- Sequence identity: The percentage of identical amino acid residues that were aligned against each other in the local alignment.
- Length: The total length of the local alignment, which including matching and mismatching positions of query and subject, as well as gap positions in the query and subject.
- Mismatches: The number of non-identical amino acid residues aligned against each other.
- Gap openings: The number of gap openings.
- Query start: The starting coordinate of the local alignment in the query (1-based).
- Query end: The ending coordinate of the local alignment in the query (1-based).
- Target start: The starting coordinate of the local alignment in the target (1-based).
- Target end: The ending coordinate of the local alignment in the target (1-based).
- E-value: The expected value of the hit quantifies the number of alignments of similar or better quality that you expect to find searching this query against a database of random sequences the same size as the actual target database. This number is most useful for measuring the significance of a hit. By default, DIAMOND will report all alignments with e-value < 0.001, meaning that a hit of this quality will be found by chance on average once per 1,000 queries.
- Bit score: The bit score is a scoring matrix independent measure of the (local) similarity of the two aligned sequences, with higher numbers meaning more similar. It is always >= 0 for local Smith Waterman alignments.
Note that this output format can be customized with a number of non-default fields that are available. It is generally advisable to customize the format and limit it to the information required by downstream processing, as this may substantially increase performance. If no fields are selected that require alignment traceback (such as coordinates, length, identity, gap openings and mismatches), performance will be gained by omitting traceback computations.
We download a FASTA file and use it as the input file for clustering:
wget https://scop.berkeley.edu/downloads/scopeseq-2.07/astral-scopedom-seqres-gd-sel-gs-bib-95-2.07.fa
diamond cluster -d astral-scopedom-seqres-gd-sel-gs-bib-95-2.07.fa -o clusters.tsv \
--approx-id 40 -M 64G --header
We specify a 40% (approximate) identity cutoff as the clustering criterion, combined with the
default value of 80% coverage cutoff. This way we require that every cluster member sequence
have an alignment against its cluster representative covering at least 80% of the member
sequence and being above 40% identity. We also allow the tool to use 64 GB of memory for
better performance. Using --header
generates a header line for output files.
We inspect the output file:
head clusters.tsv
The first column of the output file denotes the accession of the cluster representative and the second column the accession of the member sequence:
centroid member
d1dlwa_ d1dlwa_
d2gkma_ d2gkma_
d4i0va_ d1dlya_
d4i0va_ d1s69a_
d4i0va_ d4i0va_
d4i0va_ d6bmea_
d4i0va_ d4xdia_
d4i0va_ d3aq9a_
d2dc3a_ d2dc3a_
For example, the sequence with accession d4i0va_
is the representative of a cluster
comprising six members. To get alignments of the cluster members against their representative
we run the realign workflow:
diamond realign -d astral-scopedom-seqres-gd-sel-gs-bib-95-2.07.fa --clusters clusters.tsv \
-o aln.tsv -M 64G --header