\section{Introduction}
Pairwise alignments were the first that could be automated using
optimal algorithms based on dynamic programming. Today, these early
optimal algorithms have largely been superseded by heuristic versions
implemented in programs like Blast. However, optimal pairwise
alignment remains a staple in bioinformatics, if only as background to
current algorithms.
In the package \ty{pal} we implement the computation of three types of
alignment, global, overlap, and local. In a global alignment, homology
extends across the two sequences compared, which we call query, $q$,
and subject, $s$. Figure~\ref{fig:pal}A shows such an alignment. In an
overlap alignment, query and subject overlap at their beginning and
end, or \emph{vice versa} (Figure~\ref{fig:pal}B). You can see that
global alignment is a special case of overlap alignment, where the
overlap happens to extend across the full length of $q$ and $s$. In a
local alignment, finally, there is local homology between $q$ and $s$
(Figure~\ref{fig:pal}C). This is the most general type of alignment,
which subsumes the overlap and global alignments as special cases.
\begin{figure}
\begin{center}
\begin{psmatrix}[rowsep=0.2cm]
\textbf{A} & \textbf{B} & \textbf{C}\\
\input{gal} & \input{oal} & \input{lal}
\end{psmatrix}
\end{center}
\caption{The three types of alignment, global (\textbf{A}), overlap
(\textbf{B}), and local (\textbf{C}). Homology in
black.}\label{fig:pal}
\end{figure}
Regardless of its type, alignments are computed using an
$(m+1)\times(n+1)$ matrix, where $m$ is the length of $q$ and $n$ the
length of $s$. Let $p$ be the name of this matrix depicted in
Figure~\ref{fig:dpm}. Each row corresponds to a residue in $q$ and
each column to a residue in $s$. The sequences are preceded by a gap,
hence $p$ consists of $m+1$ rows and $n+1$ columns, which are numbered
starting from zero.
\begin{figure}
\begin{center}
\begin{pspicture} (0,0)(4,3.5)
\psline(0,0)(4,0)(4,3)(0,3)(0,0)
\rput[bl](0,3.1){\texttt{-ACCGTACC...}}
\rput[bl](0.15,3.4){$s\rightarrow$}
\rput[br](-0.1,0.9){\rotatebox{-90}{\texttt{-ACCCTACC...}}}
\rput[br](-0.5,2.2){\rotatebox{-90}{$q\rightarrow$}}
\rput[br](4,0){\fbox{\pnode{p40}}}
\rput[br](0.9,2.5){\rnode{ge}{$p_{0,0}$}}
\rput[tl](0,3){\fbox{\pnode{p03}}}
\rput[br](3.7,0.2){$p_{m,n}$}
\end{pspicture}
\end{center}
\caption{Setting up the dynamic programming matrix, $p$; the query, $q$, is $m$
residues long, the subject, $s$, $n$ residues.}\label{fig:dpm}
\end{figure}
Gaps are scored with affine costs, that is, gap opening, $g_{\rm o}$,
is distinguished from gap extension, $g_{\rm e}$ and a gap of length
$l$ has score
\[
g(l) = g_{\rm o} + (l-1)g_{\rm e}.
\]
To accommodate these
affine gap costs, each entry in $p$, $p_{ij}$, consists of four
subentries, $p_{ij}^{\rm e}$, $p_{ij}^{\rm f}$, $p_{ij}^{\rm g}$,
and $p_{ij}^{\rm v}$.
The last entry,
$p_{ij}^{\rm v}$, is the score of the subalignment of $q[1...i]$ and
$s[1...j]$. It is computed as the maximum of the other three entries
\cite[p. 405]{ohl13:bio}.
\begin{equation}\label{eq:alMax}
p_{ij}^{\rm v} = \mbox{max}\left(p_{ij}^{\rm e},p_{ij}^{\rm f},p_{ij}^{\rm g}\right).
\end{equation}
The three terms on the right hand side are computed with respect to
their vertical, horizontal, or diagonal neighbors:
\begin{eqnarray*}
p_{ij}^{\rm e} & = & \mbox{max}\left(p_{i-1,j}^{\rm e}+g_{\rm e},
p_{i-1,j}^{\rm v}+g_{\rm o}\right)\\
p_{ij}^{\rm f} & = & \mbox{max}\left(p_{i,j-1}^{\rm f}+g_{\rm e},
p_{i,j-1}^{\rm v}+g_{\rm o}\right)\\
p_{ij}^{\rm g} & = & p_{i-1,j-1}^{\rm g} + \mbox{score}\left(q[i],s[j]\right)
\end{eqnarray*}
In the last equation the \emph{score} of two residues is either
match/mismatch, or taken from an explicit score matrix, for example
the BLOSUM62 matrix in Figure~\ref{fig:scoreMat}.
\begin{figure}
\small
\verbatiminput{blosum62.txt}
\caption{The BLOSUM62 amino acid substitution matrix for scoring
aligned pairs of amino acids.}\label{fig:scoreMat}
\end{figure}
For local alignment, the maximum in equation (\ref{eq:alMax}) includes
zero,
\[
p_{ij}^{\rm v} = \mbox{max}\left(p_{ij}^{\rm e},p_{ij}^{\rm f},p_{ij}^{\rm g},0\right).
\]
Initialization of $p$ also depends on the alignment type. For global
we set the top left cell to zero.
\[
p_{00}^{\rm v} = p_{00}^{\rm e} = p_{00}^{\rm f} = p_{00}^{\rm g} = 0
\]
In addition we set initialize the costs of gaps along the top row and
first column.
\begin{eqnarray*}
p_{i0}^{\rm v} & = & p_{i0}^{\rm e} = g_{\rm o} + ig_{\rm e}; i>0\\
p_{i0}^{\rm f} & = & p_{i0}^{\rm g} = -\infty; i>0\\
p_{0j}^{\rm v} & = & p_{0j}^{\rm f} = g_{\rm o} + jg_{\rm e}; j>0\\
p_{0j}^{\rm e} & = & p_{0j}^{\rm g} = -\infty; j>0
\end{eqnarray*}
For local and overlap alignment, the scores in the first column and
first row are set to zero.
\[
p_{i0}^{\rm v}=p_{0j}^{\rm v}=0\\
p_{i0}^{\rm e}=p_{0j}^{\rm e}=0\\
p_{i0}^{\rm f}=p_{0j}^{\rm f}=0\\
p_{i0}^{\rm g}=p_{0j}^{\rm g}=0\\
\]
To extract the actual alignment from $p$, we follow its path. For
global alignment the path starts in the bottom right hand cell,
$p_{m,n}$ (Figure~\ref{fig:dpm}). If $p_{m,n}^{\rm v} = p_{m,n}^{\rm
e}$, we move to $p_{m-1,n}$ and add $q[m]$ to the nascent aligned
query sequence, $q_{\rm a}$, and a gap, '-', to the nascent aligned
subject sequence, $s_{\rm a}$. If $p_{m,n}^{\rm v}=p_{m,n}^{\rm f}$,
we move to $p_{m,n-1}$ and add a gap to $q_{\rm a}$ and $s[nj]$ to
$s_{\rm a}$. Finally, if $p_{m,n}^{\rm v}=p_{m,n}^{\rm g}$, we move to
$p_{m-1,n-1}$ and add $q[m]$ to $q_{\rm a}$ and $s[n]$ to $s_{\rm
a}$. This is repeated until we reach $p_{0,0}$. Reversing $q_{\rm
a}$ and $s_{\rm a}$ gives the alignment.
In overlap alignments, end gaps are free, so the trace back starts at
the maximum entry in the last row or column of $p$ and proceeds until
the first row or column is reached. Local alignment starts at the
maximum entry anywhere in $p$ and ends where $p_{ij}^{\rm v}=0$.
To summarize, an alignment is computed in three steps, matrix
initialization, filling in, and trace back. The details of each step
depend on alignment type. But the variations are so small that we can
implement all three types in a single package.
! Package \ty{pal} implements data structures, methods, and functions
! for computing and printing optimal pairwise alignments.
\section{Implementation}
The outline of \ty{pal} contains hooks for imports, types, methods,
and functions.
package pal
import (
//<<Imports>>
)
//<<Types>>
//<<Methods>>
//<<Functions>>
\subsection{Data Structure \ty{ScoreMatrix}}
Alignments are based on scoring pairs of residues. So we begin by
declaring the structure \ty{ScoreMatrix}.
! A \ty{ScoreMatrix} stores the scores of residue pairs.
It consists of a set of residues, a two-dimensional matrix of floats,
and a dictionary to map residues to matrix positions.
type ScoreMatrix struct {
res string
mat [][]float64
dic map[byte]int
}
\subsubsection{Function \ty{NewScoreMatrix}}
!Function \ty{NewScoreMatrix} generates a new score matrix from a
!match and a mismatch score.
We construct the residue string, allocate space for the dictionary and
fill it. Then we allocate space for the matrix, fill-in the match
scores, and fill-in the mismatch scores.
func NewScoreMatrix(match, mismatch float64) *ScoreMatrix {
sm := new(ScoreMatrix)
//<<Construct residue string>>
//<<Allocate dictionary>>
//<<Fill-in dictionary>>
//<<Allocate score matrix>>
//<<Fill-in match scores>>
//<<Fill-in mismatch scores>>
return sm
}
The BLOSUM62 matrix in Figure~\ref{fig:scoreMat} contains 24
characters, the BLAST PAM70 matrix also lists \ty{J}. Nucleotide
sequences add one more, \ty{U} for uracil.
sm.res = "ARNDCQEGHIJLKMFPSTWYVBZX*U"
We allocate the map holding the dictionary.
sm.dic = make(map[byte]int)
We go through the characters in the residue string and set the
corresponding index in the score matrix.
for i, r := range sm.res {
sm.dic[byte(r)] = i
}
We allocate the score matrix to hold entries for all residue pairs.
n := len(sm.res)
sm.mat = make([][]float64, n)
for i := 0; i < n; i++ {
sm.mat[i] = make([]float64, n)
}
The main diagonal contains the match score.
for i := 0; i < n; i++ {
sm.mat[i][i] = match
}
The off-diagonal elements contain the mismatch score.
for i := 0; i < n - 1; i++ {
for j := i + 1; j < n; j++ {
sm.mat[i][j] = mismatch
sm.mat[j][i] = mismatch
}
}
\subsubsection{Function \ty{NewByteScoreMatrix}}
!Function \ty{NewByteScoreMatrix} generates a new score matrix for
!all bytes from a match and a mismatch score.
We do this by constructing a new residue string and then reusing the
code for constructing the score matrix for molecular sequences.
func NewByteScoreMatrix(match, mismatch float64) *ScoreMatrix {
sm := new(ScoreMatrix)
//<<Construct byte string>>
//<<Allocate dictionary>>
//<<Fill-in dictionary>>
//<<Allocate score matrix>>
//<<Fill-in match scores>>
//<<Fill-in mismatch scores>>
return sm
}
We construct the residue string from all 256 bytes.
nr := 256
res := make([]byte, nr)
for i := 0; i < nr; i++ {
res[i] = byte(i)
}
sm.res = string(res)
\subsubsection{Function \ty{ReadScores}}
!Function \ty{ReadScores} reads the entries of a \ty{ScoreMatrix} from
!an \ty{io.Reader}.
Recall that Figure~\ref{fig:scoreMat} shows the BLOSUM62 score
matrix. It starts with optional comment lines marked by hash tags,
followed by a row of residues as column headers. Next are rows of
values, each preceded by the residue in question. Columns are
delimited by white space.
For reading scores, we prepare two variables. The boolean
\texttt{first} marks the first line of the table; the slice of byte
slices \texttt{res} holds the residues used as column headers in that
line. Then we scan the input.
func ReadScoreMatrix(r io.Reader) *ScoreMatrix {
s := NewScoreMatrix(1, -1)
first := true
var res [][]byte
sc := bufio.NewScanner(r)
for sc.Scan() {
b := sc.Bytes()
//<<Deal with score table>>
}
return s
}
We import \ty{io} and \ty{bufio}.
As shown in Figure~\ref{fig:scoreMat}, the first line that doesn't
begin with a hash tag is the table header, the others make up its
body.
if b[0] != '#' {
if first {
//<<Deal with header>>
} else {
//<<Deal with body>>
}
}
The header line is split into column headers, and we remember not to
do that again.
res = bytes.Fields(b)
first = false
We import \texttt{bytes}.
The body of the table is split into entries, of which the first is a
residue and the others are scores.
entries := bytes.Fields(b)
for i := 1; i < len(entries); i++ {
r1 := entries[0][0]
r2 := res[i-1][0]
score, err := strconv.ParseFloat(string(entries[i]), 64)
if err != nil {
log.Fatalf("couldn't parse %q\n", entries[i])
}
s.setScore(r1, r2, score)
}
We import \texttt{strconv} and \texttt{log}.
In the private method \ty{setScore}, we take two residues and their
score as input. We make sure the residues are upper case and store the
triple.
func (s *ScoreMatrix) setScore(r1, r2 byte, sc float64) {
//<<Make residues upper case>>
i1, ok1 := s.dic[r1]
i2, ok2 := s.dic[r2]
if !ok1 || !ok2 {
log.Fatalf("can't set score for " +
"residues (%c, %c)", r1, r2)
}
s.mat[i1][i2] = sc
}
If one of the residues is lower case, it is folded to upper.
if r1 >= 97 { r1 -= 32 }
if r2 >= 97 { r2 -= 32 }
\subsubsection{Method \ty{Score}}
!The method \texttt{Score} takes two characters as arguments and
!returns their score. If one of the characters is not in the
!dictionary of residues, it exits with message.
We first check we have a pair of valid residues, then return the
corresponding score.
func (s *ScoreMatrix) Score(r1, r2 byte) float64 {
//<<Make residues upper case>>
i1, ok1 := s.dic[r1]
i2, ok2 := s.dic[r2]
if !ok1 || !ok2 {
log.Fatalf("can't score (%c, %c)", r1, r2)
}
return s.mat[i1][i2]
}
\subsubsection{Method \ty{String}}
!\ty{String} converts the \ty{ScoreMatrix} to a printable string.
We line up the columns of the matrix using a tab writer, to which we
first write the header of the matrix, then its body. When we're done
writing the matrix, we flush the writer and return the underlying
buffer as a string.
func (s *ScoreMatrix) String() string {
b := make([]byte, 0)
buf := bytes.NewBuffer(b)
w := tabwriter.NewWriter(buf, 1, 1, 1, ' ', 0)
//<<Write matrix header>>
//<<Write matrix body>>
w.Flush()
return buf.String()
}
We import \ty{bytes} and \ty{tabwriter}.
The matrix header consists of the characters for which scores
exist. It starts with a blank column.
for _, r := range s.res {
fmt.Fprintf(w, "\t%2c", r)
}
fmt.Fprintf(w, "\n")
We print the matrix body avoiding a trailing newline.
n := len(s.res)
for i := 0; i < n; i++ {
fmt.Fprintf(w, "%c", s.res[i])
for j := 0; j < n; j++ {
fmt.Fprintf(w, "\t%2.3g", s.mat[i][j])
}
if i < n-1 {
fmt.Fprintf(w, "\n")
}
}
\subsection{Data Structure \ty{alignment}}
The three types of alignments, global, overlap, and local
(Figure~\ref{fig:pal}) have much in common. So we construct them by
composition based on a private alignment type, which we later wrap in
exported types for our three kinds of alignments.
All alignments are based on the query and subject to be aligned using
a score scheme consisting of a score matrix, gap opening, and gap
extension. So we declare the alignment type to initially hold these
four elements and add a hook for additional fields.
type alignment struct {
q, s *fasta.Sequence
m *ScoreMatrix
gapO, gapE float64
//<<Alignment fields>>
}
\subsubsection{Method \ty{new}}
We fill in the fields of a new alignment using the method \ty{new},
which takes as parameters the four elements we just declared.
func (a *alignment) new(q, s *fasta.Sequence,
sm *ScoreMatrix,
gapO, gapE float64) {
//<<Construct alignment>>
}
We begin constructing the alignment by storing the five variables
passed.
a.q = q
a.s = s
a.m = sm
a.gapO = gapO
a.gapE = gapE
We also store the query and subject lengths in separate variables. We
declare these.
a.ql = len(q.Data())
a.sl = len(s.Data())
An alignment also contains a dynamic programming matrix, which
consists of cells as described in the Introduction.
We give the programming matrix a getter.
! The Method \ty{ProgrammingMatrix} returns the (m+1) x (n+1) matrix
! used for calculating an alignment of two sequences of lengths m and
! n.
func (a *alignment) ProgrammingMatrix() [][]Cell {
return a.p
}
A cell holds four fields for carrying out the dynamic programming
algorithm. We also add a hook for additional fields.
type Cell struct {
E, F, G, V float64
//<<Cell fields>>
}
We construct the dynamic programming matrix consisting of $(m+1)\times
(n+1)$ cells, where $m$ is the length of the query, $n$ the length of
the subject.
m := len(a.q.Data()) + 1
n := len(a.s.Data()) + 1
a.p = make([][]Cell, m)
for i := 0; i < m; i++ {
a.p[i] = make([]Cell, n)
}
We also need space for the trace back of the alignment. So we allocate
two byte slices, one for the query alignment, the other for the
subject alignment.
We initialize the byte slices during alignmnet construction.
a.qa = make([]byte, 0)
a.sa = make([]byte, 0)
\subsubsection{Method \ty{RawAlignment}}
We also declare a getter for the raw alignment.
!\ty{RawAlignment} returns the aligned query and subject sequences.
func (a *alignment) RawAlignment() (q, s []byte) {
return a.qa, a.sa
}
\subsubsection{Method \ty{String}}
!Method \ty{String} returns the alignment as a string ready for pretty
!printing.
The string returned has two parts, a header and the actual
alignment. They are generated by writing to a byte buffer. An
alignment is terminated by \verb+//+
func (a *alignment) String() string {
var buf []byte
buffer := bytes.NewBuffer(buf)
//<<Write alignment header>>
//<<Write alignment data>>
buffer.Write([]byte("//"))
return buffer.String()
}
We write the header as a table, which is formatted via \ty{tabwriter}.
//<<Construct alignment tabwriter>>
//<<Write alignment header to tabwriter>>
Our \ty{tabwriter} writes to the buffer. Its minimal cell width is 1,
the width of the tab character is zero, and a single blank is added
for padding.
w := new(tabwriter.Writer)
w.Init(buffer, 1, 0, 1, ' ', 0)
We import \ty{tabwriter}.
The header consists of the names of the sequences, their role (query
or subject), their lengths, the alignment score, and the number of
errors. Once we've written them, we
flush the \ty{tabwriter}.
h := a.q.Header()
//<<Write number of query residues>>
h = a.s.Header()
//<<Write number of subject residues>>
fmt.Fprintf(w, "Score\t%g\n", a.score)
//<<Write number of errors>>
w.Flush()
We distinguish a single query residue from other residue counts.
fmt.Fprintf(w, "Query\t%s\t(%d residue", h, a.ql)
if a.ql != 1 { fmt.Fprint(w, "s") }
fmt.Fprint(w, ")\n")
Similarly, we distinguish a single subject residue from other residue
counts.
fmt.Fprintf(w, "Subject\t%s\t(%d residue", h, a.sl)
if a.sl != 1 {
fmt.Fprint(w, "s")
}
fmt.Fprint(w, ")\n")
Errors consist of gaps and mismatches and we again distinguish
singular from plural.
er := a.gaps + a.mismatches
fmt.Fprintf(w, "Error")
if er != 1 { fmt.Fprintf(w, "s") }
fmt.Fprintf(w, "\t%d", er)
fmt.Fprintf(w, " (%d gap", a.gaps)
if a.gaps != 1 { fmt.Fprint(w, "s") }
fmt.Fprintf(w, ", %d mismatch", a.mismatches)
if a.mismatches != 1 { fmt.Fprint(w, "es") }
fmt.Fprint(w, ")\n")
We declare the alignment fields \ty{score}, \ty{gaps}, and
\ty{mismatches}.
score float64
gaps, mismatches int
Having completed the header, we loop over the alignment and format it
in triplets of lines consisting of the two sequences sandwiching a row
of match characters. Figure~\ref{fig:al} shows an example, the
alignment of two short peptides. Pairs of identical residues get
vertical lines, distinct residues with scores greater than zero like
phenylalanine (\ty{F}) and tyrosine (\ty{Y}) a colon, and mismatches
or gaps blanks. We use the same \ty{tabwriter} as for the header.
\begin{figure}
\begin{center}
\begin{tabular}{lll}
Query: 1 & \texttt{MKFLAL-F} & 7\\
& \texttt{||:| | |}\\
Subject: 1 & \texttt{MKYLILLF} & 8
\end{tabular}
\end{center}
\caption{Example alignment with the match characters sandwiched by the
sequences.}\label{fig:al}
\end{figure}
At this point we use the query and subject alignment, \ty{qa} and
\ty{sa}, which we have already declared. In addition, we need to know
the length of a line in the alignment, we declare an alignment field
for it.
We initialize line length to the default line length of
FASTA-formatted sequences.
a.ll = fasta.DefaultLineLength
"github.com/evolbioinf/fasta"
Now we can break up the alignment into lines for printing. Each line
has an end position we compute before we write the residues and match
characters. We also add a hook for the ``global'' variables we need
inside the loop. After the loop we flush the writer.
//<<Declare variables for alignment loop>>
for i := 0; i < len(a.qa); i += a.ll {
//<<Compute line end>>
//<<Write residues and match characters>>
}
w.Flush()
The variable \ty{i} refers to the beginning of the line. Its end is
either the beginning plus line length, or, if fewer residues than
``line length'' remain, the end of the alignment.
if i + a.ll < len(a.qa) {
end = i + a.ll
} else {
end = len(a.qa)
}
We declare the variable end.
As shown in Figure~\ref{fig:al}, the sequences sandwich match
characters and our code reflects this structure. First we write the
query residues, then the match characters, then the subject residues.
//<<Write query>>
//<<Write match characters>>
//<<Write subject>>
We generate a row-length slice of query characters. The slice consists
of residues and gaps, of which we count the residues. From the number
of residues we compute the start and end positions in the underlying
query sequence. If the row contains at least one residue, the left
border of the interval is advanced by one from the previous end,
otherwise it remains unchanged.
data := a.qa[i:end]
nr := len(data) - bytes.Count(data, []byte("-"))
l := qs
if nr > 0 { l++ }
fmt.Fprintf(w, "\n\nQuery\t%d\t%s\t%d\n", l, data, qs+nr)
qs += nr
We declare a variable for the query start, \ty{qs}, which we
initialize to the corresponding alignment field. This is because local
alignments usually don't start at the first sequence position.
We declare the alignment field \ty{qs}.
In contrast to the query and the subject, where we can use the
alignment as given, the match characters still need to be
determined.
k := 0
for j := i; j < end; j++ {
m := ' '
//<<Determine match character>>
matches[k] = m
k++
}
fmt.Fprintf(w, "\t\t%s\n", string(matches[:k]))
The match character is initialized to blank, but it might in fact be a
a pipe for a match or a colon for a mismatch with positive score.
qc := a.qa[j]
sc := a.sa[j]
if qc != '-' && sc != '-' {
if qc == sc {
m = '|'
} else if a.m.Score(qc, sc) > 0 {
m = ':'
}
}
We declare the slice of matches.
matches := make([]rune, a.ll)
Like the query, we write the subject framed by start and end
positions.
data = a.sa[i:end]
nr = len(data) - bytes.Count(data, []byte("-"))
l = ss
if nr > 0 { l++ }
fmt.Fprintf(w, "Subject\t%d\t%s\t%d\n", l, data, ss+nr)
ss += nr
We declare a variable for the subject start, \ty{ss}, which we
initialize to the corresponding alignment field.
We declare the alignment field \ty{ss}.
\subsubsection{Method \ty{reverse}}
A trace back returns sequences that read from right to left, so we
implement an auxiliary method to reverse them. It calls the function
\ty{revStr} on the query and the subject alignments.
func (a *alignment) reverse() {
revStr(a.qa)
revStr(a.sa)
}
In the function \ty{revStr} we move simultaneously from the start and
the end of the slice to the middle swapping a pair of residues at each
step. I took this neat code from \cite[p. 86]{don16:go}.
func revStr(s []byte) {
for i, j := 0, len(s)-1; i < j; i, j = i+1, j-1 {
s[i], s[j] = s[j], s[i]
}
}
\subsubsection{Method \ty{errors}}
The header of an alignment contains the number of mismatches and gaps.
We compute these with another auxiliary method, \ty{errors}. At the
beginning of \ty{errors} we reset the counts of mismatches and
gaps. Then we check that the two alignment slices have the same length
and bail otherwise. Next, we count the errors.
func (a *alignment) errors() {
a.mismatches = 0
a.gaps = 0
if len(a.qa) != len(a.sa) {
log.Fatal("aligned sequences don't " +
"have same length")
}
//<<Count erors>>
}
for i, r1 := range a.qa {
r2 := a.sa[i]
if r1 == '-' || r2 == '-' {
a.gaps++
} else if r1 != r2 {
a.mismatches++
}
}
\subsubsection{Method \ty{SetLineLength}}
!\ty{SetLineLength} sets the length of alignment lines in \ty{String}.
If the length passed is less than one, we make no change.
func (a *alignment) SetLineLength(ll int) {
if ll > 0 {
a.ll = ll
}
}
\subsubsection{Method \ty{Score}}
!\ty{Score} returns the score of the alignment.
func (a *alignment) Score() float64 {
return a.score
}
\subsubsection{Method \ty{Mismatches}}
!\ty{Mismatches} returns the number of mismatches of the alignment.
func (a *alignment) Mismatches() int {
return a.mismatches
}
\subsubsection{Method \ty{Gaps}}
!\ty{Gaps} returns the number of gap characters in the alignment.
func (a *alignment) Gaps() int {
return a.gaps
}
\subsubsection{Method \ty{SetSubjectStart}}
In global/local alignments it's convenient to be able to set the start
of the subject sequence.
!\ty{SetSubjectStart} sets the start of the subject sequence, which is
!useful in global/local alignment.
func (a *alignment) SetSubjectStart(s int) {
a.ss = s
}
\subsubsection{Method \ty{SetSubjectLength}}
In global/local alignments the subject sequence is just a fragment of
the subject being searched. So its convenient to be able to set the
subject length.
!\ty{SetSubjectLength} sets the length of the subject sequence, which
!is useful in global/local alignment.
func (a *alignment) SetSubjectLength(l int) {
a.sl = l
}
\subsubsection{Method \ty{PrintMatrix}}
When teaching bioinformatics, it is often convenient to print the
dynamic programming matrix.
! The method \ty{PrintMatrix} returns a pretty string version of the
! dynamic programming matrix with back pointers. It takes as argument
! the type of printing desired, v, e, f, g for the cell elements, and
! t for trace back.
func (a *alignment) PrintMatrix(t byte) string {
q, s := a.RawAlignment()
str := ""
if t == 'v' || t == 'e' || t == 'f' || t == 'g' {
//<<Print entries>>
} else if t == 't' {
//<<Print trace back>>
} else {
log.Fatalf("pal.PrintMatrix: can't print %c", t)
}
return str
}
To print the entries, we construct a table with one cell for each cell
in the dynamic programming matrix. To align the column of our table,
we use a \ty{tabwriter} to write to the string we return
eventually. Since the first row and the firs column of our table
consist of the query and subject sequences, we extract these, before
we print the matrix header and its first row. Then we print the rest
of the matrix. At the end we flush the \ty{tabwriter}.
buf := new(bytes.Buffer)
w := tabwriter.NewWriter(buf, 0, 0, 1, ' ', tabwriter.AlignRight)
//<<Extract query and subject sequences>>
//<<Print matrix header>>
//<<Print first row of matrix>>
//<<Print rest of matrix>>
w.Flush()
str = buf.String()
We import \ty{tabwriter}.
We extract the query and subject sequences without the gaps of the raw
alignment.
q = a.q.Data()
s = a.s.Data()
The header of the matrix consists of a gap followed by the subject sequence.
fmt.Fprintf(w, "\t-")
for _, c := range s {
fmt.Fprintf(w, "\t%c", c)
}
fmt.Fprint(w, "\t\n")
The first row of the matrix is preceded by a gap followed by the
back pointer and the cell entry.
m := len(q)
n := len(s)
fmt.Fprint(w, "-")
var c byte
for i, j := 0, 0; j <= n; j++ {
v := a.p[i][j].V
//<<Determine back pointer>>
//<<Determine cell entry>>
fmt.Fprintf(w, "\t%c%g", c, v)
}
fmt.Fprint(w, "\t\n")
The back pointer is either vertical, horizontal, or diagonal. If its a
local alignment and the cell entry is zero, it is blank. Also, there
is no step back from the top left cell.
if v == a.p[i][j].E {
c = '^'
} else if v == a.p[i][j].F {
c = '<'
} else {
c = '\\'
}
if a.isLocal && v == 0 { c = ' ' }
if i == 0 && j == 0 { c = ' '}
The cell entry might be something other than $V$.
if t == 'e' {
v = a.p[i][j].E
} else if t == 'f' {
v = a.p[i][j].F
} else if t == 'g' {
v = a.p[i][j].G
}
The remaining rows are prefixed by a query residue.
for i := 1; i <= m; i++ {
fmt.Fprintf(w, "%c", q[i-1])
for j := 0; j <= n; j++ {
v := a.p[i][j].V
//<<Determine back pointer>>
//<<Determine cell entry>>
fmt.Fprintf(w, "\t%c%g", c, v)
}
fmt.Fprint(w, "\t\n")
}
For the trace back we abandon the character-centric approach used for
printing the matrix entries. The idea here is that users might like to
view a trace back in the context of a large matrix. So we just print
the x and y coordinates of the trace back, which can then be viewed
with a plotting tool. The starting point of the trace back is given by
the query and subject start positions.
buf := new(bytes.Buffer)
x := a.ss
y := a.qs
//<<Print x/y coordinates>>
str = buf.String()
We'd print the y coordinates.
for p, c := range q {
fmt.Fprintf(buf, "%d %d\n", x, y)
if c == '-' {
x++
} else if s[p] == '-' {
y++
} else {
x++
y++
}
}
fmt.Fprintf(buf, "%d %d\n", x, y)
\subsection{Data Structure \ty{GlobalAlignment}}
We build a global alignment from our basic alignment type.
!\ty{GlobalAlignment} holds the global alignment of two sequences.
type GlobalAlignment struct {
alignment
}
\subsubsection{Function \ty{NewGlobalAlignment}}
!\ty{NewGlobalAlignment} constructs a new global alignment from a
!query and a subject sequence, and a substitution matrix. To compute
!the actual alignment, call the method \ty{Align}.
We allocate a new global alignment and delegate its construction to the
\ty{new} method inherited from \ty{alignment}.
func NewGlobalAlignment(q, s *fasta.Sequence,
sm *ScoreMatrix,
gapO, gapE float64) *GlobalAlignment {
ga := new(GlobalAlignment)
ga.new(q, s, sm, gapO, gapE)
return ga
}
"github.com/evolbioinf/fasta"
\subsubsection{Align}
!Method \ty{Align} computes the alignment.
We declare a set of
convenient variables, initialize the dynamic programming matrix, fill
it, and carry out the trace back.
func (a *GlobalAlignment) Align() {
//<<Declare convenient variables>>
//<<Initialize global>>
//<<Fill global>>
//<<Trace back global>>
}
We unpack some of the \ty{alignment} data structure into more
convenient variables.
q := a.q.Data()
s := a.s.Data()
m := len(q)
n := len(s)
p := a.p
We initialize according to the recursions in the Introduction.
for i := 1; i <= m; i++ {
p[i][0].E = a.gapO + float64(i-1) * a.gapE
p[i][0].V = p[i][0].E
p[i][0].F = math.Inf(-1)
p[i][0].G = math.Inf(-1)
}
for j := 1; j <= n; j++ {
p[0][j].F = a.gapO + float64(j-1) * a.gapE
p[0][j].V = p[0][j].F
p[0][j].E = math.Inf(-1)
p[0][j].G = math.Inf(-1)
}
We iterate over all cells in the programming matrix and fill them. The
last cell contains the alignment score.
for i := 1; i <= m; i++ {
for j := 1; j <= n; j++ {
//<<Fill in cell>>
}
}
a.score = a.p[m][n].V
We use the recursions stated in the Introduction to assign the four
compound variables of a cell.
p[i][j].E = math.Max(p[i-1][j].E + a.gapE, p[i-1][j].V + a.gapO)
p[i][j].F = math.Max(p[i][j-1].F + a.gapE, p[i][j-1].V + a.gapO)
p[i][j].G = p[i-1][j-1].V + a.m.Score(q[i-1], s[j-1])
p[i][j].V = math.Max(math.Max(p[i][j].E, p[i][j].F), p[i][j].G)
Global trace back begins in the bottom right hand corner of the
dynamic programming matrix (Figure~\ref{fig:dpm}) and ends in its top
right hand corner. Trace
back returns the alignments in right to left notation, so we reverse
them. Then we calculate their errors in the alignment.
i := m
j := n
for i > 0 || j > 0 {
//<<Trace back>>
}
a.reverse()
a.errors()
During each step of the trace back we move either horizontally,
vertically, or diagonally, depending on which of the three neighboring
cells originated the final entry in the current cell. if $p^{\rm
v}_{ij}=p^{\rm e}_{ij}$, we move vertically, if $p^{\rm
v}_{ij}=p^{\rm f}_{ij}$ horizontally, and if $p^{\rm v}_{ij}=p^{\rm
g}_{ij}$ diagonally.
if p[i][j].V == p[i][j].E {
//<<Move vertically>>
} else if p[i][j].V == p[i][j].F {
//<<Move horizontally>>
} else {
//<<Move diagnoally>>
}
Moving vertically implies adding a residue ot the query alignment and
a gap to the subject alignment.
a.qa = append(a.qa, q[i-1])
a.sa = append(a.sa, '-')
i--
Similarly, moving horizontally implies adding a gap to the query
alignment and a residue ot the subject alignment.
a.qa = append(a.qa, '-')
a.sa = append(a.sa, s[j-1])
j--
When moving diagonally, we add a residue to both alignments.
a.qa = append(a.qa, q[i-1])
a.sa = append(a.sa, s[j-1])
i--
j--
\subsection{Data Structure \ty{OverlapAlignment}}
Like the global alignments, we build overlap alignments from our
basic alignment type.
!\ty{OverlapAlignment} holds the overlap alignment of two sequences.
type OverlapAlignment struct {
alignment
}
\subsubsection{Function \ty{NewOverlapAlignment}}
!\ty{NewOverlapAlignment} constructs an overlap alignment from a
!query and a subject sequence, and a substitution matrix. To compute
!the actual alignment, call the method \ty{Align}.
We allocate an overlap alignment and delegate its construction to the
\ty{new} method inherited from \ty{alignment}.
func NewOverlapAlignment(q, s *fasta.Sequence,
sm *ScoreMatrix,
gapO, gapE float64) *OverlapAlignment {
oa := new(OverlapAlignment)
oa.new(q, s, sm, gapO, gapE)
return oa
}
\subsubsection{Method \ty{Align}}
In outline, overlap alignment looks similar to global alignment. In
fact, its table filling procedure is identical and we just recycle the
code from global. The essence of overlap alignment is that end-gaps
are free, so the first column and the first row are initialized to
0. The compiler does this for us.
func (a *OverlapAlignment) Align() {
//<<Declare convenient variables>>
//<<Fill global>>
//<<Trace back overlap>>
}
The trace back starts at $p_{ij}$, where $i$ and $j$ remain to be
determined, and proceeds to the first row or column. We conclude it by
revering the aligned residues and counting the errors they contain.
//<<Find overlap starting point>>
//<<First round of end gaps>>
for i > 0 && j > 0 {
//<<Trace back>>
}
//<<Second round of end gaps>>
a.reverse()
a.errors()
The starting point of the overlap alignment is the maximum entry in
the last row or column.
//<<Scan last row>>
//<<Scan last column>>
When scanning the last row, we set the maximum entry to negative
infinity and determine the column-coordinate, $j$, of the starting
point. Its row coordinate is initialized to the last row, $i=m$.
max := math.Inf(-1)
j := 0
i := m
for k := 0; k <= n; k++ {
if max < p[i][k].V {
max = p[i][k].V
j = k
}
}
If we find a greater entry in the last column, $i$ is set to its row
and $j$ to the last column, $j=n$. We have now also found the score.
for k := 0; k <= m; k++ {
if max < p[k][n].V {
max = p[k][n].V
i = k
j = n
}
}
a.score = p[i][j].V
The first round of end gaps is added either to the subject alignment,
if the maximum was found in the last column, or to the query
alignment, if it was found in the last row.
for k := m; k > i; k-- {
a.qa = append(a.qa, q[k-1])
a.sa = append(a.sa, '-')
}
for k := n; k > j; k-- {
a.qa = append(a.qa, '-')
a.sa = append(a.sa, s[k-1])
}
The second round of end gaps is added either to the subject alignment,
if the trace back ended in the first column, or to the query
alignment, if it ended in the first row.
for k := i; k > 0; k-- {
a.sa = append(a.sa, '-')
a.qa = append(a.qa, q[k-1])
}
for k := j; k > 0; k-- {
a.sa = append(a.sa, s[k-1])
a.qa = append(a.qa, '-')
}
\subsubsection{Method \ty{TrimQuery}}
We might be interested in an alignment that is global in the query and
local in the subject. Such an alignment can be obtained from an overlap
alignment by trimming any flanking gaps the query might contain.
!\ty{TrimQuery} removes gaps flanking the query.
func (a *OverlapAlignment) TrimQuery() {
if len(a.qa) != len(a.sa) {
log.Fatal("can't trim alignments " +
"of unequal length")
}
//<<Trim from left>>
//<<Trim from right>>
}
While trimming from the left, we increment the subject start and
decrement gap count.
for a.qa[0] == '-' {
a.qa = a.qa[1:]
a.sa = a.sa[1:]
a.ss++
a.gaps--
}
We also trim from the right and decrement the gap count.
for a.qa[len(a.qa)-1] == '-' {
a.qa = a.qa[:len(a.qa)-1]
a.sa = a.sa[:len(a.sa)-1]
a.gaps--
}
\subsection{Data Structure \ty{LocalAlignment}}
Like global and overlap alignments, we build local alignments from our
basic alignment type.
!\ty{LocalAlignment} holds the local alignment of two sequences.
type LocalAlignment struct {
alignment
}
\subsubsection{Function \ty{NewLocalAlignment}}
!\ty{NewLocalAlignment} constructs a local alignment from a
!query and a subject sequence, and a substitution matrix. To compute
!the actual alignment, call the method \ty{Align}.
We allocate a local alignment and delegate its construction to the
\ty{new} method inherited from \ty{alignment}.
func NewLocalAlignment(q, s *fasta.Sequence,
sm *ScoreMatrix,
gapO, gapE float64) *LocalAlignment {
la := new(LocalAlignment)
la.new(q, s, sm, gapO, gapE)
la.isLocal = true
return la
}
We declare the alignment field \ty{isLocal}.
\subsubsection{Method \ty{Align}}
Local alignment is the most general of the three alignment types. The
most important deviation from global and overlap is that we can
extract more than one local alignment from the dynamic programming
matrix.
!\ty{Align} computes the next alignment in the dynamic programming
!matrix. It returns \ty{false} if none was found.
On the first call to \ty{Align}, we prepare the first alignment. On
the second call, we prepare all subsequent alignments. After the
preparation we try to find the next alignment and report whether or
not we were successful.
func (a *LocalAlignment) Align() bool {
//<<Declare convenient variables>>
a.count++
if a.count == 1 {
//<<Prepare first alignment>>
} else if a.count == 2 {
//<<Prepare subsequent alignments>>
}
found := false
//<<Find alignment>>
return found
}
We add the alignment field \ty{count}.
We prepare our first alignment by filling it and storing the optimal
starting point.
//<<Fill local>>
//<<Store optimal starting point>>
We fill the dynamic programming matrix for local alignment.
for i := 1; i <= m; i++ {
for j := 1; j <= n; j++ {
//<<Determine local entry>>
}
}
As explained in the Introduction, in a local alignment the score,
$p^{\rm v}_{ij}$, is formed as the maximum of $p^{\rm e}_{ij}, p^{\rm
f}_{ij}, p^{\rm g}_{ij}$, and zero. Since global alignment maximizes
over the first three of these terms, we can use it here and append the
maximization over zero.
//<<Fill in cell>>
if p[i][j].V < 0 {
p[i][j].V = 0
}
We store the starting point of the optimal alignment in the same
structure we later use to store all possible starting points, a slice
of coordinates. Each coordinate consists of a pair of indexes and a
score.
type coordinate struct {
i, j int
s float64
}
To store the alignment coordinates, we declare a slice of them as the
alignment field \ty{coords}.
We initialize the slice during alignment construction.
a.coords = make([]coordinate, 0)
We look for the maximum entry in the dynamic programming matrix and
store it as the first entry in the coordinates slice. In this search
we exclude the first row and the first column, as their scores are
always zero.
var c coordinate
c.s = math.Inf(-1)
for i := 1; i <= m; i++ {
for j := 1; j <= n; j++ {
if c.s < p[i][j].V {
c.s = p[i][j].V
c.i = i
c.j = j
}
}
}
a.coords = append(a.coords, c)
In preparation of all subsequent alignments, we reset the previous
alignment, store all possible alignment starting points of the new
alignment, and sort them.
//<<Reset previous alignment>>
//<<Store alignment starting points>>
//<<Sort alignment starting points>>
We erase the previous query and subject alignments and their starting
points.
a.qa = a.qa[:0]
a.sa = a.sa[:0]
a.coords = a.coords[:0]
A legitimate local alignment doesn't intersect the paths of previous
alignments. So we add the field \ty{visited} to the cells in the
programming matrix.
We iterate over the alignment cells minus the first row and column,
and store each one that hasn't been visited in the first round as a
coordinate.
var c coordinate
for i := 1; i <= m; i++ {
for j := 1; j <= n; j++ {
if !p[i][j].visited {
c.i = i
c.j = j
c.s = p[i][j].V
a.coords = append(a.coords, c)
}
}
}
When looking for alignments, we traverse the coordinates starting at
the point with the highest score. So we sort the coordinates by
score. For this we declare the type \ty{coordSlice}.
type coordSlice []coordinate
We implement the three methods of the \ty{Sort} interface on
\ty{coordSlice}, \ty{Len}, \ty{Swap}, and \ty{Less}. Note that we sort
in \emph{ascending} order.
func (c coordSlice) Len() int {
return len(c)
}
func (c coordSlice) Swap(i, j int) {
c[i], c[j] = c[j], c[i]
}
func (c coordSlice) Less(i, j int) bool {
return c[i].s > c[j].s
}
Now we can sort the coordinates.
sort.Sort(coordSlice(a.coords))
We iterate over the coordinates and run a trace back from each one. If
the trace back was successful, i. e. we found an alignment, we lop off
the coordinates we've used up and break from the loop.
for k, c := range a.coords {
i := c.i
j := c.j
found = true
//<<Trace back local>>
if found {
a.coords = a.coords[k+1:]
break
}
}
At the beginning of the trace back we set the alignment score and walk
until we reach a zero cell. Each cell is marked as visited, before we
take one step back. Then we check whether we've arrived at a cell that
has previously been visited. After the loop we check whether we've
found an alignment.
a.score = p[i][j].V
for p[i][j].V > 0 {
p[i][j].visited = true
//<<Trace back>>
if p[i][j].visited {
//<<Cell visited>>
}
}
if found {
//<<Alignment found>>
}
A previously visited cell means that we collide, we've found no
alignment, the alignments so far constructed are reset, and we break
from the trace back loop.
found = false
a.qa = a.qa[:0]
a.sa = a.sa[:0]
break
If we've found an alignment, we note its start positions in the query
and the subject, reverse its residues, and count its errors.
a.qs = i
a.ss = j
a.reverse()
a.errors()