\section{Introduction}
Programs like BLAST are based on exact matching of potentially large
sets of words, or \emph{patterns}, in a long \emph{text}. Such set
matching is often done using keyword trees~\cite{aho75:eff}. Say we'd
like to match the five patterns
\begin{center}
\begin{tabular}{ll}
$p_1$ & \texttt{ATTT}\\
$p_2$ & \texttt{ATTC}\\
$p_3$ & \texttt{AT}\\
$p_4$ & \texttt{TG}\\
$p_5$ & \texttt{TT}
\end{tabular}
\end{center}
The corresponding keyword tree consists of nodes, edges, and their
labels. It is constructed by inserting each pattern into a growing
intermediate tree. We begin with the first pattern, draw a root, node
1 in Figure~\ref{fig:kt1}A, and a node for each character in
$p_1$. Except for the root, each node, $v$, thus has one incoming edge
labeled with a character, $c=\mbox{in}(v)$.
The last node we add is labeled $p_1$ to mark that its path label,
that is, the concatenated characters of the path from the root to node
5, corresponds to $p_1$. Next, we add $p_2$ by matching it from the
root into the tree. The prefix \texttt{ATT} is found and extended at
node 4 by \texttt{C} (Figure~\ref{fig:kt1}B). Similarly, $p_3$ is
inserted, but stops at internal node 3, which is labeled $p_3$
(Figure~\ref{fig:kt1}C). The next patter, $p_4$, cannot be matched
into the tree, so it branches off at the root
(Figure~\ref{fig:kt1}D). This branch is branched at node 7 by the
addition of the last pattern, $p_5$ (Figure~\ref{fig:kt1}E).
Now each pattern is contained in the tree, and common prefixes such as
\texttt{ATT} are summarized into unique path labels. This compression
of common prefixes is the starting point for simultaneously searching
for all five patterns in a text like
\[
T=\texttt{ATGATTC}
\]
\begin{figure}
\begin{center}
\resizebox{\textwidth}{!}{
\begin{tabular}{ccccc}
\textbf{A} & \textbf{B} & \textbf{C} & \textbf{D} & \textbf{E}\\
\input{kt1a} & \input{kt1b} & \input{kt1c} & \input{kt1d} & \input{kt1e}
\end{tabular}
}
\end{center}
\caption{Stepwise construction of a keyword tree by adding
$p_1=\texttt{ATTT}$ (\textbf{A}), $p_2=\texttt{ATTC}$
(\textbf{B}), $p_3=\texttt{AT}$ (\textbf{C}), $p_4=\texttt{TG}$
(\textbf{D}), and $p_5=\texttt{TT}$ (\textbf{E}).}\label{fig:kt1}
\end{figure}
Algorithm~\ref{alg:sm1} summarizes the steps for doing this. Matching
begins at the first position in $T$, and at the root of the tree
(lines 1 \& 2). While there is an edge $v\rightarrow v'$ labeled
$T[j]$ (line 4), check whether node $v'$ is labeled $p_i$ (line 5). If
so, output that $p_i$ was found (line 6). The match to $p_i$ ends at
$j$, but it is customary to report its start rather than its end. The
start is at $j-|p_i|+1$. Then move $v$ to $v'$ in the tree and advance
to the next character in the text (lines 8 \& 9). With our example
tree and text, we'd match \texttt{AT} and report finding $p_3$.
\begin{algorithm}
\caption{Provisional set matching
algorithm~\cite{aho75:eff}.}\label{alg:sm1}
\begin{algorithmic}[1]
\input{ahoCor1}
\end{algorithmic}
\end{algorithm}
We've now run out of matches, at which point we might be tempted to
give up and return to the root. However, the $\texttt{T}$ we last
matched can be extended to possibly discover the patterns that start
with a \texttt{T}, $p_4$ or $p_5$. Returning from node 3 to the root
would require matching that \texttt{T} a second time. To avoid this,
every node, $v$, has a \emph{failure link} connecting it to the node
with the longest path label that is a suffix of the path label of
$v$. The failure links of the root and its children are initialized to
the root. The remaining failure links are added in a breadth-first
traversal. By default, the failure link of any other node, $v$, is
also the root. However, a more favorable failure link might be found
by visiting climbing the chain of failure links of $v$'s parent. The
first that has a child matching $v$ becomes the target of $v$'s
failure link In our example, the failure link of node 3 with path
label \texttt{AT} is node 7 with path label \texttt{T}. If no match is
found, the procedure is repeated until the root is reached. If no
failure link is found at all, use the root. Figure~\ref{fig:kt2} shows
our example tree with failure links.
\begin{figure}
\begin{center}
\input{kt2}
\end{center}
\caption{The keyword tree in Figure~\ref{fig:kt1}E with failure links.}\label{fig:kt2}
\end{figure}
In this way we find in our example text $p_4$ and $p_2$. Now we see
the utility of the sentinel character, $\$$, which terminates
$T$. Without it, $T$ would end in a match and we'd attempt to access a
character beyond the end of $T$ in line 4.
Another thing to realize at this stage of the algorithm is that by
walking down the path label of node 6 to find $p_2$, we've missed
$p_5$, the path label of node 9. This is fixed by constructing for
each node $v$ an \emph{output set} consisting of all the patterns
found following the failure links from $v$ to the
root. Figure~\ref{fig:kt3} shows Figure~\ref{fig:kt2} with output
sets, one of which, $\{p_1, p_5\}$, contains two patterns. Now,
whenever \texttt{ATT} has been matched, the match to \texttt{TT} is
also detected. Algorithm~\ref{alg:sm2} is derived from
Algorithm~\ref{alg:sm1} by amending the output routine in the
if-clause of lines 5--7.
The application of Algorithm~\ref{alg:sm2} to our example text,
$T=\texttt{ATGATTC}$ gives the following results:
\begin{center}
\begin{tabular}{cl}
\hline
Position & Pattern\\\hline
1 & \texttt{AT}\\
2 & \texttt{TG}\\
4 & \texttt{AT}\\
5 & \texttt{TT}\\
4 & \texttt{ATTC}\\\hline
\end{tabular}
\end{center}
The last three matches start at positions 4, 5, and again 4. So
although we always move forward in $T$, the match positions are not
necessarily ordered. This may confuse the user and we shall therefore
order the match positions before returning them.
\begin{figure}
\begin{center}
\input{kt3}
\end{center}
\caption{The keyword tree in Figure~\ref{fig:kt2} with output sets.}\label{fig:kt3}
\end{figure}
\begin{algorithm}
\caption{Final set matching
algorithm~\cite{aho75:eff}.}\label{alg:sm2}
\begin{algorithmic}[1]
\input{ahoCor2}
\end{algorithmic}
\end{algorithm}
To summarize, we construct a keyword tree by adding patterns, setting
the failure links, and constructing the output sets. This tree can
then be used to efficiently look up the starting positions of all
patters in a text.
\section{Implementation}
!Package \texttt{kt} implements set matching using a keyword tree.
The package outline contains hooks for imports, types, variables,
methods, and functions.
package kt
import (
//<<Imports>>
)
//<<Types>>
//<<Variables>>
//<<Methods>>
//<<Functions>>
We first construct a keyword tree, then use it to search a text.
\subsection{Construction}
A keyword tree consists of labeled nodes and labeled edges, both of
which are denoted in a structure with seven fields:
\begin{enumerate}
\item child node
\item sibling node
\item parent node
\item character on incoming edge
\item depth, that is, the length of the path label starting at the
root
\item output set
\item identifier
\end{enumerate}
type Node struct {
Child, Sib, Parent, Fail *Node
In byte
Depth int
Output []int
Id int
}
\subsection{Function \texttt{NewKeywordTree}}
!Function \texttt{NewKeywordTree} takes as argument a set of patterns
!and returns the root of the tree representing these patterns.
We construct the root, add the patterns, construct the failure links
and output sets, and return the root.
func NewKeywordTree(patterns []string) *Node {
root := new(Node)
for i, pattern := range patterns {
//<<Add pattern>>
}
//<<Construct failure links>>
//<<Construct output sets>>
return root
}
A pattern is added by first finding the matching prefix in the
intermediary tree, then adding the rest.
//<<Match prefix>>
//<<Add rest>>
The prefix is matched simultaneously by moving along the pattern and
into the tree. For each character in the pattern, the matching child
is located in the tree and moved to. Since locating a child is also
used later when constructing failure links, we delegate it to a
method.
v := root
var j int
for j = 0; j < len(pattern); j++ {
c := v.findChild(pattern[j])
if c != nil {
v = c
} else {
break
}
}
To locate a match among the children of node $v$, start at the first
child and move along its siblings.
func (v *Node) findChild(c byte) *Node {
child := v.Child
for child != nil && child.In != c {
child = child.Sib
}
return child
}
To add the rest of the pattern, extend $v$ by one node for every
character left in the pattern. The last node added gets the pattern's
label written into its output set.
for k := j; k < len(pattern); k++ {
child := new(Node)
//<<Initialize child>>
//<<Insert child>>
v = child
}
v.Output = append(v.Output, i)
To initialize a child, set its parent, the label of its incoming edge,
its depth, and its identifier.
child.Parent = v
child.In = pattern[k]
child.Depth = v.Depth + 1
child.Id = nodeId
nodeId++
The variable \texttt{nodeId} is global. We start counting at 1, which
leaves zero as the root's ID.
The \texttt{nodeId} also doubles as the node count, which the user may
want to access.
!\texttt{NodeCount} returns the number of nodes generated so far.
func NodeCount() int {
return nodeId
}
The child is inserted as the latest sibling among its parent's
children.
if v.Child == nil {
v.Child = child
} else {
cp := v.Child
for cp.Sib != nil {
cp = cp.Sib
}
cp.Sib = child
}
We have now added all patterns to the tree, which concludes the first
step in tree construction. In the second step, we add the failure link
to each node.
Failure links are constructed in two steps, initialization and
breadth-first tree traversal.
//<<Initialize failure links>>
BreadthFirst(root, setFailureLink, root)
The initial failure links are those of the root and its children. They
all point to the root.
root.Fail = root
v := root.Child
for v != nil {
v.Fail = root
v = v.Sib
}
Breadth-first tree traversal requires a queue.
A node can be added to the queue or gotten from it.
func (q *queue) add(n *Node) {
*q = append(*q, n)
}
func (q *queue) get() *Node {
n := (*q)[0]
*q = (*q)[1:]
return n
}
The function \texttt{BreadthFirst} takes as arguments the root and a
function applied to every node. This function might in turn have
arguments, which are also part of the argument list of
\texttt{BreadthFirst}. In our case, the function is called
\texttt{setFailureLink} and its argument is the default failure link,
the root. During traversal we first iterate over the siblings and then
find the next child.
func BreadthFirst(v *Node, fn NodeAction, args ...interface{}) {
q := new(queue)
for v != nil {
fn(v, args...)
q.add(v)
v = v.Sib
//<<Iterate over siblings>>
//<<Find next child>>
}
}
Each sibling is visited and added to the queue.
for v != nil {
fn(v, args...)
q.add(v)
v = v.Sib
}
We take nodes from the queue until we find one with a child and move
to it.
for v == nil && len(*q) > 0 {
v = q.get()
v = v.Child
}
A function of type \texttt{NodeAction} takes as argument a node and a
variadic list of empty interfaces.
type NodeAction func(*Node, ...interface{})
In the function \texttt{setFailureLink}, the root passed is retrieved
via reflection. The failure links of nodes on levels zero (the root)
and level one refer to the root. The failure links of all other nodes
require a bit more thought.
func setFailureLink(v *Node, args ...interface{}) {
root := args[0].(*Node)
if v.Depth > 1 {
//<<Search for failure link>>
}
}
A failure link of some node $v$ on level 2 or greater is discovered by
walking up the chain of failure links starting at $v$'s parent and
stopping at the root. At each node we look for a match to one of the
children. As soon as we've found such a match, we've got the failure
link. Without a match, the failure link remains in its default
position, that is, pointed to the root.
v.Fail = root
w := v.Parent
for w.Parent != nil {
w = w.Fail
c := w.findChild(v.In)
if c != nil {
v.Fail = c
break
}
}
The output set is constructed by the same tree traversal used for
constructing failure links, but with \texttt{addOutput} instead of
\texttt{setFailureLink} applied to each node.
BreadthFirst(root, addOutput)
For each node, $v$, the failure links are traversed up to the root and
any output found along the way is added to the output of $v$.
func addOutput(v *Node, args ...interface{}) {
if v.Parent == nil { return }
for w := v.Fail; w.Parent != nil; w = w.Fail{
v.Output = append(v.Output, w.Output...)
}
}
\subsection{Method \texttt{String}}
!\texttt{String} visualizes a keyword tree as plain text.
It is implemented by applying the function \texttt{writeTree} to every node
in the tree.
func (n *Node) String() string {
w := new(bytes.Buffer)
writeTree(n, w)
return(w.String())
}
We import \texttt{bytes}.
The function \texttt{writeTree} writes the tree in a slightly modified
version of the Newick
format\footnote{\texttt{evolution.genetics.washington.edu/phylip/newick\char`_doc.html}}. The
modification is that we label all nodes, not just the leaves as in
phylogenies. As a result, we get the following rules:
\begin{eqnarray*}
\mbox{tree} & \rightarrow & \mbox{children}[\mbox{label}][:\mbox{length}];\\
\mbox{children} & \rightarrow & \left(\mbox{child} \{,\mbox{children}\}\right)\\
\mbox{child} & \rightarrow & \mbox{children}[\mbox{label}][:\mbox{length}]
\end{eqnarray*}
These rules mean that
\begin{itemize}
\item A tree consists a root, indicated by the semicolon, an
optional root label and branch length to the root, and children.
\item Children are placed in paired parentheses and consists of the
first child followed by one or more sets of children separated by
commas.
\item A child may in turn have children. In this case it is an
\emph{internal node} and its children are followed by the child's
optional label and branch length.
\end{itemize}
In other words, elements in curly brackets may appear zero, one, or
more than one times, and elements in square brackets may or may no
appear once. The other punctuation marks, colon, comma, semicolon, and
parentheses, are syntactically significant. The syntax also allows for
comments, which are placed in square brackets.
Consider our example tree in Figure~\ref{fig:kt3}. Its Newick
representation might look like
\begin{verbatim}
((((5[T->9{1,5}],6[C->1{2}])4[T->9{5}])3[T->7{3}])2[A->1],
(8[G->1{4}],9[T->7{5}])7[T->1])1[->1];
\end{verbatim}
Here we have added comments to each node, for example
\begin{verbatim}
T->9{1,5}
\end{verbatim}
to node 5. This means node 5 has an incoming link labeled \texttt{T},
a failure link pointing to node 9, and an output set containing $p_1$
and $p_5$.
To implement the four rewrite rules of the Newick format, we traverse
the tree and ask three questions about each node, $v$, one for each
rule:
\begin{enumerate}
\item Is $v$ not a first child?
\item Is $v$ an internal node?
\item Is $v$ the root?
\end{enumerate}
func writeTree(v *Node, w *bytes.Buffer) {
if v == nil { return }
//<<Is $v$ not a first child?>>
//<<Is $v$ an internal node?>>
//<<Is $v$ the root?>>
}
A node is a first child, if its identifier is equal to the identifier
of it's parent's first child. If that isn't the case, it must be one
of the subsequent children, which are separated by a comma.
if v.Parent != nil && v.Parent.Child.Id != v.Id {
fmt.Fprint(w, ",")
}
If $v$ does have at least one child, it is an internal node. In this
case the subtree rooted on $v$ is placed in parentheses. We also label
the current node.
if v.Child != nil {
fmt.Fprint(w, "(")
}
writeTree(v.Child, w)
label(v, w)
writeTree(v.Sib, w)
if v.Parent != nil && v.Sib == nil {
fmt.Fprint(w, ")")
}
If $v$ has no parent, it's the root marked by a semicolon.
if v.Parent == nil {
fmt.Fprint(w, ";")
}
The label of a node consists of its identifier using one-based
counting, plus a comments section. The comment section contains the
character labeling the incoming branch, the failure link, and the
output set. The root doesn't have an incoming branch.
func label(v *Node, w *bytes.Buffer) {
fmt.Fprintf(w, "%d[", v.Id+1)
if v.Parent != nil {
fmt.Fprintf(w, "%c", v.In)
}
fmt.Fprintf(w, "->%d", v.Fail.Id+1)
//<<Write output set>>
fmt.Fprintf(w, "]")
}
Empty output sets are ignored.
if len(v.Output) > 0 {
fmt.Fprintf(w, "{%d", v.Output[0]+1)
for i := 1; i < len(v.Output); i++ {
fmt.Fprintf(w, ",%d", v.Output[i]+1)
}
fmt.Fprintf(w, "}")
}
\subsection{Searching}
!\texttt{Search} takes as input a text and the patterns, and returns
!the positions of all patterns in the text.
We do this by following Algorithm~\ref{alg:sm2}, where the text is
traversed using two kinds of actions: waking into the tree, and
following failure links. Before returning the matches, we sort them.
func (root *Node) Search(t []byte, p []string) []Match {
//<<Prepare search>>
v := root
j := 0
for j < len(t) - 1 {
//<<Walk into tree>>
//<<Follow failure link>>
}
sort.Sort(MatchSlice(matches))
return matches
}
By way of preparation, we first append the null character to $T$,
where it serves as sentinel that ensures $T$ doesn't end in a
match. In addition, we prepare variables for storing the matches.
t = append(t, 0)
matches := make([]Match, 0)
var match Match
A match consists of a position and a pattern identifier.
type Match struct {
Position, Pattern int
}
To sort matches, we also declare the type \texttt{MatchSlice},
and implement the sorting interface, where we first sort by position,
then by pattern identifier. The secondary sorting by pattern
identifier ensures identical output across runs.
func (m MatchSlice) Len() int {
return len(m)
}
func (m MatchSlice) Less(i, j int) bool {
if m[i].Position == m[j].Position {
return m[i].Pattern < m[j].Pattern
}
return m[i].Position < m[j].Position
}
func (m MatchSlice) Swap(i, j int) {
m[i], m[j] = m[j], m[i]
}
While walking into the tree, we store any match found on the
way.
for c := v.findChild(t[j]); c != nil; c = v.findChild(t[j]) {
if len(c.Output) > 0 {
//<<Store matches>>
}
v = c
j++
}
A node might refer to several matches.
for _, o := range c.Output {
match.Position = j - len(p[o]) + 1
match.Pattern = o
matches = append(matches, match)
}
When we run out of matches, follow the failure link, unless we're at
the root, in which case we advance by one character.
if v.Parent == nil {
j++
} else {
v = v.Fail
}
\section*{Testing}
The testing framework contains hooks for imports and functions.
package kt
import (
"testing"
//<<Testing imports>>
)
//<<Testing functions>>
There are two testing functions and an auxiliary function for
constructing a keyword tree.
func constructKt() (*Node, []string) {
//<<Construct test tree>>
}
func TestConstruction(t *testing.T) {
//<<Test construction>>
}
func TestSearching(t *testing.T) {
//<<Test searching>>
}
The tree is built from the five patterns used throughout, for example
in Figure~\ref{fig:kt1}E.
var p []string
p = append(p, "ATTT")
p = append(p, "ATTC")
p = append(p, "AT")
p = append(p, "TG")
p = append(p, "TT")
tree := NewKeywordTree(p)
return tree, p
To test tree construction, we compare the two textual representations
of the tree we get with the precomputed result we want.
tree, _ := constructKt()
get := tree.String() + "\n"
fn := "constr.txt"
want, err := ioutil.ReadFile(fn)
if err != nil {
t.Errorf("couldn't open %q", fn)
}
if !bytes.Equal([]byte(get), want) {
t.Errorf("get:\n%s\nwant:\n%s\n", get, want)
}
We import \texttt{ioutil} and \texttt{bytes}.
When searching, we read a sequence and search it with our prefab
keyword tree, print the matches to a buffer and compare the results we
get to the results we want.
//<<Read sequence>>
//<<Search sequence>>
//<<Print matches to buffer>>
//<<Compare matches>>
We read an input sequence.
fn := "test.fasta"
file, err := os.Open(fn)
if err != nil {
t.Errorf("couldn't open %q\n", fn)
}
scanner := fasta.NewScanner(file)
if !scanner.ScanSequence() {
t.Errorf("%q doesn't contain a sequence\n", fn)
}
seq := scanner.Sequence()
We import \texttt{os} and \texttt{fasta}.
"os"
"github.com/evolbioinf/fasta"
We construct the tree from the patterns and search the text.
tree, p := constructKt()
matches := tree.Search(seq.Data(), p)
We iterate over the matches to print them to a buffer.
buf := new(bytes.Buffer)
for _, match := range matches {
fmt.Fprintf(buf, "%s:%d\n", p[match.Pattern],
match.Position+1)
}
get := buf.Bytes()
The results we want are stored in \texttt{search.txt}.
fn = "search.txt"
want, err := ioutil.ReadFile(fn)
if err != nil {
t.Errorf("couldn't read %q\n", fn)
}
if !bytes.Equal(want, get) {
t.Errorf("want:\n%s\nget:\n%s\n", want, get)
}