This golang package implements Wavefront alignment algorithm (WFA), not BiWFA (maybe in the future).
Implemented features:
- Distance metrics: gap-affine.
- Alignment types: global, semi-global.
- Heuristics: wf-adaptive.
Executable binaries for common operating systems are available to download, for benchmarking with lots of sequences, or for fast aligning two sequences in the command line.
$ wfa-go -g "Bioinformatics helps Biology" "We learn bioinformatics to help biologists"
query ---------Bioinformatics ---helps Biology---
|||||||||||||| |||| | |||||
target We learn bioinformatics to help- biologists
cigar 9I1X14M3I4M1D1M1X5M1X3I
align-score : 32
match-region: q[2, 27]/28 vs t[11, 38]/42
align-length: 29, matches: 24 (82.76%), gaps: 4, gap regions: 2
- I need a fast DNA alignment package in Golang for my project.
- WFA "looks easy" to implement as it does not heavily reply on SIMD intrinsics, though there are some other algorithms performing well in a benchmark.
- I'm not familiar with C++, and I found it difficult to understand the official code. I only found one (in Rust) 3rd party implementation.
- After reading the WFA paper, I thought the algorithm is easy, so I implemented WFA from the scratch.
- Later I found it not easy, there were so many details.
After reading the
next
step in the rust version, I realised it's because I don't know gap-affine penalties. - The
backtrace
step is the most difficult part. In v0.1.0, I used 3 extra components to store the source offsets for each cell in I, D, M components. But it needs more memory. And the speed is also not ideal, 1/20 of official version.- Besides, I checked the bases again when tracing back matches. WFA did this too, but WFA2 did not.
- Next, aftering reading thofficial implementation, I rewrote the whole project, using similar backtrace workfow with WFA2. The speed increased to 1/10 of the official version.
- C++ is wild, it even support accessing list/array elements with negative indexes (the diagonal k).
-
A WFA component is saved as a list of WaveFront
[]*WaveFront
.- Score
s
is the index. - If the value is
nil
, it means the score does not exsit.
- Score
-
A WaveFront is saved as a list of offsets
[]uint32
. We preset the list with a big length (2048) to avoid frequentappend
operations.-
Diagonal
k
is the index. To support negativek
values, we use this layout:index: 0, 1, 2, 3, 4, 5, 6 k: 0, -1, 1, -2, 2, -3, 3
Value
0
means thek
does not exist. -
Offsets are saved with
uint32
integers, with the lower 3 bits for saving 6 possible paths which are used for backtrace.wfaInsertOpen uint32 = iota + 1 wfaInsertExt wfaDeleteOpen wfaDeleteExt wfaMismatch wfaMatch
-
-
Maximum sequence length: 512 Mb,
536,870,911
(1<<(32-3) - 1). -
All objects are saved in object pools for computation efficiency. Just don't forget to recycle them.
Each WFA component can be visualized as a table. A table cell contains the alignment type symbol and the score.
⊕ Unknown,
⟼ Gap open (Insertion)
🠦 Gap extension (Insertion)
↧ Gap open (Deletion)
🠧 Gap extension (Deletion)
⬂ Mismatch
⬊ Match
Global alignment
1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | ||
---|---|---|---|---|---|---|---|---|---|---|---|
A | G | G | A | T | G | C | T | C | G | ||
1 | A | ⬊ 0 | ⟼ 8 | 🠦10 | 🠦12 | . | . | . | . | . | . |
2 | C | ↧ 8 | ⬂ 4 | ⬂12 | . | . | . | . | . | . | . |
3 | C | 🠧10 | ⬂12 | ⬂ 8 | . | . | . | . | . | . | . |
4 | A | 🠧12 | . | . | ⬊ 8 | . | . | . | . | . | . |
5 | T | . | . | . | . | ⬊ 8 | . | . | . | . | . |
6 | A | . | . | . | . | . | ⬂12 | . | . | . | . |
7 | C | . | . | . | . | . | . | ⬊12 | . | . | . |
8 | T | . | . | . | . | . | . | . | ⬊12 | . | . |
9 | C | . | . | . | . | . | . | . | . | ⬊12 | . |
10 | G | . | . | . | . | . | . | . | . | . | ⬊12 |
CIGAR: 1M2X2M1X4M
query ACCATACTCG
| || ||||
target AGGATGCTCG
align-score : 12
align-region: q[1, 10] vs t[1, 10]
align-length: 10, matches: 7 (70.00%), gaps: 0, gapRegions: 0
Semi-global alignment
1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | ||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
C | A | G | G | C | T | C | C | T | C | G | G | ||
1 | A | ⬂ 4 | ⬊ 0 | ⬂ 4 | ⬂ 4 | ⬂ 4 | ⬂ 4 | ⬂ 4 | ⬂ 4 | ⬂ 4 | ⬂ 4 | ⬂ 4 | ⬂ 4 |
2 | C | ⬊ 0 | ⬂ 8 | ⬂ 4 | ⬂ 8 | ⬊ 4 | ⬂ 8 | ⬊ 4 | ⬊ 4 | ⬂ 8 | ⬊ 4 | ⬂ 8 | ⬂ 8 |
3 | G | ⬂ 4 | ⬂ 4 | ⬊ 8 | ⬊ 4 | ⬂12 | ⬂ 8 | ⬂12 | ⬂ 8 | ⬂ 8 | . | ⬊ 4 | ⬊ 8 |
4 | A | ⬂ 4 | ⬊ 4 | ⬂ 8 | ⬂12 | ⬂ 8 | 🠧14 | ⬂12 | 🠧14 | ⬂12 | ⬂12 | ↧12 | ⬂ 8 |
5 | T | ⬂ 4 | ⬂ 8 | ⬂ 8 | ⬂12 | . | ⬊ 8 | . | ⬂16 | ⬊14 | . | 🠧14 | ⬂16 |
6 | C | ⬊ 0 | ⬂ 8 | 🠦10 | ⬂12 | ⬊12 | 🠦18 | ⬊ 8 | ⟼16 | 🠦18 | ⬊14 | 🠧16 | ⬂18 |
7 | T | ⬂ 4 | ⬂ 4 | . | . | . | ⬊12 | ↧16 | ⬂12 | ⬊16 | . | ⬂18 | ⬂20 |
8 | C | ⬊ 0 | ⬂ 8 | ⬂ 8 | 🠦12 | 🠦14 | 🠦16 | ⬊12 | ⬊16 | ⬂16 | ⬊16 | 🠧20 | . |
9 | G | ⬂ 4 | ⬂ 4 | ⬊ 8 | ⬊ 8 | ⬂16 | ⬂18 | ⬂20 | ⬂16 | ⬂20 | ⬂20 | ⬊16 | ⬊20 |
CIGAR: 1I1M1X1M1X1M1I4M1I
query -ACGAT-CTCG-
| | | ||||
target CAGGCTCCTCGG
align-score : 16
align-region: q[1, 9] vs t[2, 11]
align-length: 10, matches: 7 (70.00%), gaps: 1, gapRegions: 1
import "github.com/shenwei356/wfa"
// ------------------[ initialization ]------------------
// aligner
algn := wfa.New(
&wfa.Penalties{
Mismatch: 4,
GapOpen: 6,
GapExt: 2,
},
&wfa.Options{
GlobalAlignment: false,
})
// set adaptive reduction parameters
algn.AdaptiveReduction(&wfa.AdaptiveReductionOption{
MinWFLen: 10,
MaxDistDiff: 50,
CutoffStep: 1,
})
// ------------------[ align one pair of seqs ]------------------
q := []byte("ACCATACTCG")
t := []byte("AGGATGCTCG")
// align
result, err := algn.Align(q, t)
checkErr(err)
// score table of M
algn.Plot(&q, &t, os.Stdout, algn.M, true)
if outputAlignment {
fmt.Println()
fmt.Printf("CIGAR: %s\n", result.CIGAR())
Q, A, T := result.AlignmentText(&q, &t)
fmt.Printf("query %s\n", *Q)
fmt.Printf(" %s\n", *A)
fmt.Printf("target %s\n", *T)
fmt.Println()
fmt.Printf("align-score : %d\n", result.Score)
fmt.Printf("align-region: q[%d, %d] vs t[%d, %d]\n",
result.QBegin, result.QEnd, result.TBegin, result.TEnd)
fmt.Printf("align-length: %d, matches: %d (%.2f%%), gaps: %d, gapRegions: %d\n",
result.AlignLen, result.Matches, float64(result.Matches)/float64(result.AlignLen)*100,
result.Gaps, result.GapRegions)
fmt.Println()
wfa.RecycleAlignmentText(Q, A, T) // !! important, recycle objects
}
wfa.RecycleAlignmentResult(result) // !! important, recycle objects
// ------------------[ clean ]------------------
wfa.RecycleAligner(algn) // !! important, recycle objects
A CLI (download binaries) is available to align two sequences from either positional arguments or an input file (format).
Example
Fast alignment.
$ wfa-go AGCTAGTGTCAATGGCTACTTTTCAGGTCCT AACTAAGTGTCGGTGGCTACTATATATCAGGTCCT
query AGCTA-GTGTCAATGGCTACT---TTTCAGGTCCT
| ||| ||||| |||||||| | |||||||||
target AACTAAGTGTCGGTGGCTACTATATATCAGGTCCT
cigar 1M1X3M1I5M2X8M3I1M1X9M
align-score : 36
match-region: q[1, 31]/31 vs t[1, 35]/35
align-length: 35, matches: 27 (77.14%), gaps: 4, gap regions: 2
From a input file, for benchmark.
$ wfa-go -i seqs.txt
query A-TTGGAAAATAGGATTGGGGTTTGTTTATATTTGGGTTGAGGGATGTCCCACCTTCGTCGTCCTTACGTTTCCGGAAGGGAGTGGTTAGCTCGAAGCCCA
|||||||||||||| ||||||||||||||||||||||||||||||||||||||| ||||||||||||||||||||||||||||||| ||||||||||||
target GATTGGAAAATAGGAT-GGGGTTTGTTTATATTTGGGTTGAGGGATGTCCCACCTT-GTCGTCCTTACGTTTCCGGAAGGGAGTGGTT-GCTCGAAGCCCA
cigar 1X1I14M1D39M1D31M1D12M
align-score : 36
match-region: q[2, 100]/100 vs t[3, 98]/98
align-length: 99, matches: 96 (96.97%), gaps: 3, gap regions: 3
Usage
WFA alignment in Golang
Author: Wei Shen <shenwei356@gmail.com>
Code: https://github.com/shenwei356/wfa
Version: v0.2.0
Input file format:
see https://github.com/smarco/WFA-paper?tab=readme-ov-file#41-introduction-to-benchmarking-wfa-simple-tests
Example:
>ATTGGAAAATAGGATTGGGGTTTGTTTATATTTGGGTTGAGGGATGTCCCACCTTCGTCGTCCTTACGTTTCCGGAAGGGAGTGGTTAGCTCGAAGCCCA
<GATTGGAAAATAGGATGGGGTTTGTTTATATTTGGGTTGAGGGATGTCCCACCTTGTCGTCCTTACGTTTCCGGAAGGGAGTGGTTGCTCGAAGCCCA
>CCGTAGAGTTAGACACTCGACCGTGGTGAATCCGCGACCACCGCTTTGACGGGCGCTCTACGGTATCCCGCGATTTGTGTACGTGAAGCAGTGATTAAAC
<CCTAGAGTTAGACACTCGACCGTGGTGAATCCGCGATCTACCGCTTTGACGGGCGCTCTACGGTATCCCGCGATTTGTGTACGTGAAGCGAGTGATTAAAC
Usage:
1. Align two sequences from the positional arguments.
wfa-go [options] <query seq> <target seq>
2. Align sequence pairs from the input file (described above).
wfa-go [options] -i input.txt
Options/Flags:
-N do not output alignment (for benchmark)
-a do not use adaptive reduction
-g do not use global alignment
-h print help message
-i string
input file.
-m mem pprof. go tool pprof -http=:8080 mem.pprof
-p cpu pprof. go tool pprof -http=:8080 cpu.pprof
Generate datasets with WFA (e634175) or WFA2-lib (v2.3.5):
./bin/generate_dataset -n 100000 -l 1000 -e 0.05 -o l1000-e0.05.seq
./bin/generate_dataset -n 100000 -l 1000 -e 0.10 -o l1000-e0.10.seq
./bin/generate_dataset -n 100000 -l 1000 -e 0.20 -o l1000-e0.20.seq
./bin/generate_dataset -n 500 -l 50000 -e 0.05 -o l50000-e0.05.seq
./bin/generate_dataset -n 500 -l 50000 -e 0.10 -o l50000-e0.10.seq
./bin/generate_dataset -n 500 -l 50000 -e 0.20 -o l50000-e0.20.seq
Commands (all tools use wfa-adaptive heuristic 10,50,1):
# memusg: https://github.com/shenwei356/memusg
# WFA
memusg -t -s "./bin/align_benchmark -i l1000-e0.05.seq -a gap-affine-wfa-adaptive"
# WFA2-lib
memusg -t -s "./bin/align_benchmark -i l1000-e0.05.seq -a gap-affine-wfa --wfa-heuristic wfa-adaptive --wfa-heuristic-parameters 10,50,1"
# WFA-go (this package, binary files are availabe in the release page)
# global alignment && do not output results
memusg -t -s "wfa-go -N -i l1000-e0.05.seq"
csvtk csv2md -t benchmark.tsv -a c,c,c,l,r,r
Results:
Seq-len | Seq-num | Error-rate | Package | Time | Memory |
---|---|---|---|---|---|
1000 | 100000 | 0.05 | WFA1 | 4.523s | 3.21 MB |
WFA2 | 3.597s | 4.90 MB | |||
WFA-go | 15.424s | 9.97 MB | |||
1000 | 100000 | 0.1 | WFA1 | 8.031s | 2.04 MB |
WFA2 | 6.973s | 6.55 MB | |||
WFA-go | 41.790s | 12.06 MB | |||
1000 | 100000 | 0.2 | WFA1 | 15.538s | 4.04 MB |
WFA2 | 13.450s | 9.24 MB | |||
WFA-go | 1m:51s | 10.22 MB | |||
50000 | 500 | 0.05 | WFA1 | 2.180s | 56.12 MB |
WFA2 | 1.481s | 107.76 MB | |||
WFA-go | 6.107s | 86.61 MB | |||
50000 | 500 | 0.1 | WFA1 | 4.144s | 55.3 MB |
WFA2 | 3.296s | 190.32 MB | |||
WFA-go | 17.908s | 174.62 MB | |||
50000 | 500 | 0.2 | WFA1 | 7.574s | 81.15 MB |
WFA2 | 6.842s | 314.08 MB | |||
WFA-go | 48.122s | 275.80 MB |
Run in a laptop PC, with single-thread.
- Santiago Marco-Sola, Juan Carlos Moure, Miquel Moreto, Antonio Espinosa. "Fast gap-affine pairwise alignment using the wavefront algorithm." Bioinformatics, 2020.
- Santiago Marco-Sola, Jordan M Eizenga, Andrea Guarracino, Benedict Paten, Erik Garrison, Miquel Moreto. "Optimal gap-affine alignment in O(s) space". Bioinformatics, 2023.
- https://github.com/smarco/WFA-paper/
- https://github.com/smarco/WFA2-lib
- https://github.com/rchikhi/rust-alignbench
Please open an issue to report bugs, propose new functions or ask for help.