Skip to content
Benjamin Buchfink edited this page Feb 27, 2023 · 8 revisions

Protein alignment

We will demonstrate an example run of DIAMOND on a small dataset here. For this, we assume the tool has been installed according to the installation instructions.

First we will download the SCOPe database of structured domains in FASTA format, containing 14,323 sequences:

wget https://scop.berkeley.edu/downloads/scopeseq-2.07/astral-scopedom-seqres-gd-sel-gs-bib-40-2.07.fa

The next step is to setup a binary DIAMOND database file that can be used for subsequent searches against the database:

diamond makedb --in astral-scopedom-seqres-gd-sel-gs-bib-40-2.07.fa -d astral40

A binary file (astral40.dmnd) containing the database sequences will be created in the current working directory. We will now conduct a search against this database using the same FASTA file of SCOPe domains as queries:

diamond blastp -q astral-scopedom-seqres-gd-sel-gs-bib-40-2.07.fa -d astral40 -o out.tsv --very-sensitive

Here we set the output file to be out.tsv and also use the --very-sensitive setting. DIAMOND has a number of sensitivity settings to accomodate different applications. The default mode is the fastest and tailored towards finding homologies of >70% sequence identity, the --sensitive mode is tailored to hits of >40% identity, while the --very-sensitive and --ultra-sensitive modes provide sensitivity accross the whole range of pairwise alignments.

If the run completed successfully, at the end we will see this console output providing some statistics about the number of hits that were found:

Total time = 5.494s
Reported 74440 pairwise alignments, 74440 HSPs.
14323 queries aligned.

We will inspect the beginning of the output file like so:

head out.tsv

Here we will see the first ten lines of the file, showing 10 pairwise alignments of the first 3 query sequences:

d1dlwa_ d1dlwa_ 100     116     0       0       1       116     1       116     6.42e-77        220
d1dlwa_ d2gkma_ 35.4    113     73      0       1       113     13      125     1.43e-21        80.9
d1dlwa_ d4i0va_ 31.9    119     75      2       1       113     2       120     9.11e-13        58.2
d2gkma_ d2gkma_ 100     127     0       0       1       127     1       127     1.51e-87        248
d2gkma_ d1dlwa_ 34.8    115     75      0       13      127     1       115     6.90e-23        84.3
d2gkma_ d4i0va_ 33.6    110     69      1       13      118     2       111     1.35e-18        73.6
d2gkma_ d6bmea_ 35.5    110     67      1       13      118     2       111     1.32e-16        68.6
d2gkma_ d2bkma_ 37.3    67      38      2       13      76      5       70      5.18e-06        40.8
d1ngka_ d1ngka_ 100     126     0       0       1       126     1       126     4.34e-91        257
d1ngka_ d2bkma_ 38.4    125     73      2       1       125     4       124     1.42e-24        89.0

The file is generated in tabular-separated (TSV) format composed of 12 fields, a format corresponding to the format generated by BLAST using the option -outfmt 6. The 12 fields are:

  1. Query accession: the accession of the sequence that was the search query against the database, as specified in the input FASTA file after the > character until the first blank.
  2. Target accession: the accession of the target database sequence (also called subject) that the query was aligned against.
  3. Sequence identity: The percentage of identical amino acid residues that were aligned against each other in the local alignment.
  4. Length: The total length of the local alignment, which including matching and mismatching positions of query and subject, as well as gap positions in the query and subject.
  5. Mismatches: The number of non-identical amino acid residues aligned against each other.
  6. Gap openings: The number of gap openings.
  7. Query start: The starting coordinate of the local alignment in the query (1-based).
  8. Query end: The ending coordinate of the local alignment in the query (1-based).
  9. Target start: The starting coordinate of the local alignment in the target (1-based).
  10. Target end: The ending coordinate of the local alignment in the target (1-based).
  11. E-value: The expected value of the hit quantifies the number of alignments of similar or better quality that you expect to find searching this query against a database of random sequences the same size as the actual target database. This number is most useful for measuring the significance of a hit. By default, DIAMOND will report all alignments with e-value < 0.001, meaning that a hit of this quality will be found by chance on average once per 1,000 queries.
  12. Bit score: The bit score is a scoring matrix independent measure of the (local) similarity of the two aligned sequences, with higher numbers meaning more similar. It is always >= 0 for local Smith Waterman alignments.

Note that this output format can be customized with a number of non-default fields that are available. It is generally advisable to customize the format and limit it to the information required by downstream processing, as this may substantially increase performance. If no fields are selected that require alignment traceback (such as coordinates, length, identity, gap openings and mismatches), performance will be gained by omitting traceback computations.

Protein clustering

We download a FASTA file and use it as the input file for clustering (runtime ~4s on 8 cores):

wget https://scop.berkeley.edu/downloads/scopeseq-2.07/astral-scopedom-seqres-gd-sel-gs-bib-95-2.07.fa
diamond cluster -d astral-scopedom-seqres-gd-sel-gs-bib-95-2.07.fa -o clusters.tsv \
  --approx-id 40 -M 64G --header

We specify a 40% (approximate) identity cutoff as the clustering criterion, combined with the default value of 80% coverage cutoff. This way we require that every cluster member sequence have an alignment against its cluster representative covering at least 80% of the member sequence and being above 40% identity. We also allow the tool to use 64 GB of memory for better performance. Using --header generates a header line for output files.

We inspect the output file:

head clusters.tsv

The first column of the output file denotes the accession of the cluster representative and the second column the accession of the member sequence:

centroid member
d1dlwa_  d1dlwa_
d2gkma_  d2gkma_
d4i0va_  d1dlya_
d4i0va_  d1s69a_
d4i0va_  d4i0va_
d4i0va_  d6bmea_
d4i0va_  d4xdia_
d4i0va_  d3aq9a_
d2dc3a_  d2dc3a_

For example, the sequence with accession d4i0va_ is the representative of a cluster comprising six members. To get alignments of the cluster members against their representative we run the realign workflow:

diamond realign -d astral-scopedom-seqres-gd-sel-gs-bib-95-2.07.fa --clusters clusters.tsv \
  -o aln.tsv -M 64G --header

The resulting output from head aln.tsv:

cseqid  mseqid  approx_pident   cstart  cend    mstart  mend    evalue          bitscore
d1dlwa_ d1dlwa_ 100             1       116     1       116     1.09e-75        218
d2gkma_ d2gkma_ 100             1       127     1       127     2.56e-86        246
d4i0va_ d6bmea_ 41.8            2       115     2       115     8.28e-23        85.1
d4i0va_ d3aq9a_ 45.3            2       123     2       117     3.99e-29        100
d4i0va_ d1dlya_ 47.1            2       122     1       117     9.61e-31        105
d4i0va_ d4xdia_ 50.1            2       115     6       119     1.99e-31        107
d4i0va_ d4i0va_ 100             1       123     1       123     3.16e-85        243
d4i0va_ d1s69a_ 65.2            2       123     2       123     3.10e-51        157
d2dc3a_ d1urva_ 94.5            2       155     1       154     2.17e-107       302