Wavefront alignment algorithm (WFA) in Golang

This golang package implements Wavefront alignment algorithm (WFA), not BiWFA (maybe in the future).

Implemented features:

Distance metrics: gap-affine.
Alignment types: global, semi-global.
Heuristics: wf-adaptive.

Executable binaries for common operating systems are available to download, for benchmarking with lots of sequences, or for fast aligning two sequences in the command line.

$ wfa-go -g "Bioinformatics helps Biology" "We learn bioinformatics to help biologists"
query   ---------Bioinformatics ---helps Biology---
                  ||||||||||||||   |||| | |||||
target  We learn bioinformatics to help- biologists
cigar   9I1X14M3I4M1D1M1X5M1X3I

align-score : 32
match-region: q[2, 27]/28 vs t[11, 38]/42
align-length: 29, matches: 24 (82.76%), gaps: 4, gap regions: 2

History

I need a fast DNA alignment package in Golang for my project.
WFA "looks easy" to implement as it does not heavily reply on SIMD intrinsics, though there are some other algorithms performing well in a benchmark.
I'm not familiar with C++, and I found it difficult to understand the official code. I only found one (in Rust) 3rd party implementation.
After reading the WFA paper, I thought the algorithm is easy, so I implemented WFA from the scratch.
Later I found it not easy, there were so many details. After reading the next step in the rust version, I realised it's because I don't know gap-affine penalties.
The backtrace step is the most difficult part. In v0.1.0, I used 3 extra components to store the source offsets for each cell in I, D, M components. But it needs more memory. And the speed is also not ideal, 1/20 of official version.
- Besides, I checked the bases again when tracing back matches. WFA did this too, but WFA2 did not.
Next, aftering reading thofficial implementation, I rewrote the whole project, using similar backtrace workfow with WFA2. The speed increased to 1/10 of the official version.
C++ is wild, it even support accessing list/array elements with negative indexes (the diagonal k).

Details

A WFA component is saved as a list of WaveFront []*WaveFront.
- Score s is the index.
- If the value is nil, it means the score does not exsit.
A WaveFront is saved as a list of offsets []uint32. We preset the list with a big length (2048) to avoid frequent append operations.
- Diagonal k is the index. To support negative k values, we use this layout:
```
index: 0,  1,  2,  3,  4,  5,  6
k:     0, -1,  1, -2,  2, -3,  3
```
  Value 0 means the k does not exist.
- Offsets are saved with uint32 integers, with the lower 3 bits for saving 6 possible paths which are used for backtrace.
```
wfaInsertOpen uint32 = iota + 1
wfaInsertExt
wfaDeleteOpen
wfaDeleteExt
wfaMismatch
wfaMatch
```
Maximum sequence length: 512 Mb, 536,870,911 (1<<(32-3) - 1).
All objects are saved in object pools for computation efficiency. Just don't forget to recycle them.

Examples

Each WFA component can be visualized as a table. A table cell contains the alignment type symbol and the score.

⊕    Unknown,
⟼    Gap open (Insertion)
🠦    Gap extension (Insertion)
↧    Gap open (Deletion)
🠧    Gap extension (Deletion)
⬂    Mismatch
⬊    Match

Global alignment

		1	2	3	4	5	6	7	8	9	10
		A	G	G	A	T	G	C	T	C	G
1	A	⬊ 0	⟼ 8	🠦10	🠦12	.	.	.	.	.	.
2	C	↧ 8	⬂ 4	⬂12	.	.	.	.	.	.	.
3	C	🠧10	⬂12	⬂ 8	.	.	.	.	.	.	.
4	A	🠧12	.	.	⬊ 8	.	.	.	.	.	.
5	T	.	.	.	.	⬊ 8	.	.	.	.	.
6	A	.	.	.	.	.	⬂12	.	.	.	.
7	C	.	.	.	.	.	.	⬊12	.	.	.
8	T	.	.	.	.	.	.	.	⬊12	.	.
9	C	.	.	.	.	.	.	.	.	⬊12	.
10	G	.	.	.	.	.	.	.	.	.	⬊12

CIGAR:  1M2X2M1X4M
query   ACCATACTCG
        |  || ||||
target  AGGATGCTCG

align-score : 12
align-region: q[1, 10] vs t[1, 10]
align-length: 10, matches: 7 (70.00%), gaps: 0, gapRegions: 0

Semi-global alignment

		1	2	3	4	5	6	7	8	9	10	11	12
		C	A	G	G	C	T	C	C	T	C	G	G
1	A	⬂ 4	⬊ 0	⬂ 4	⬂ 4	⬂ 4	⬂ 4	⬂ 4	⬂ 4	⬂ 4	⬂ 4	⬂ 4	⬂ 4
2	C	⬊ 0	⬂ 8	⬂ 4	⬂ 8	⬊ 4	⬂ 8	⬊ 4	⬊ 4	⬂ 8	⬊ 4	⬂ 8	⬂ 8
3	G	⬂ 4	⬂ 4	⬊ 8	⬊ 4	⬂12	⬂ 8	⬂12	⬂ 8	⬂ 8	.	⬊ 4	⬊ 8
4	A	⬂ 4	⬊ 4	⬂ 8	⬂12	⬂ 8	🠧14	⬂12	🠧14	⬂12	⬂12	↧12	⬂ 8
5	T	⬂ 4	⬂ 8	⬂ 8	⬂12	.	⬊ 8	.	⬂16	⬊14	.	🠧14	⬂16
6	C	⬊ 0	⬂ 8	🠦10	⬂12	⬊12	🠦18	⬊ 8	⟼16	🠦18	⬊14	🠧16	⬂18
7	T	⬂ 4	⬂ 4	.	.	.	⬊12	↧16	⬂12	⬊16	.	⬂18	⬂20
8	C	⬊ 0	⬂ 8	⬂ 8	🠦12	🠦14	🠦16	⬊12	⬊16	⬂16	⬊16	🠧20	.
9	G	⬂ 4	⬂ 4	⬊ 8	⬊ 8	⬂16	⬂18	⬂20	⬂16	⬂20	⬂20	⬊16	⬊20

CIGAR:  1I1M1X1M1X1M1I4M1I
query   -ACGAT-CTCG-
         | | | ||||
target  CAGGCTCCTCGG

align-score : 16
align-region: q[1, 9] vs t[2, 11]
align-length: 10, matches: 7 (70.00%), gaps: 1, gapRegions: 1

Usages

import "github.com/shenwei356/wfa"

// ------------------[ initialization ]------------------

// aligner
algn := wfa.New(
    &wfa.Penalties{
        Mismatch: 4,
        GapOpen:  6,
        GapExt:   2,
    },
    &wfa.Options{
        GlobalAlignment: false,
    })

// set adaptive reduction parameters
algn.AdaptiveReduction(&wfa.AdaptiveReductionOption{
    MinWFLen:    10,
    MaxDistDiff: 50,
    CutoffStep:  1,
})

// ------------------[ align one pair of seqs ]------------------

q := []byte("ACCATACTCG")
t := []byte("AGGATGCTCG")

// align
result, err := algn.Align(q, t)
checkErr(err)

// score table of M
algn.Plot(&q, &t, os.Stdout, algn.M, true)

if outputAlignment {
    fmt.Println()
    fmt.Printf("CIGAR:  %s\n", result.CIGAR())

    Q, A, T := result.AlignmentText(&q, &t)
    fmt.Printf("query   %s\n", *Q)
    fmt.Printf("        %s\n", *A)
    fmt.Printf("target  %s\n", *T)

    fmt.Println()
    fmt.Printf("align-score : %d\n", result.Score)
    fmt.Printf("align-region: q[%d, %d] vs t[%d, %d]\n",
        result.QBegin, result.QEnd, result.TBegin, result.TEnd)
    fmt.Printf("align-length: %d, matches: %d (%.2f%%), gaps: %d, gapRegions: %d\n",
        result.AlignLen, result.Matches, float64(result.Matches)/float64(result.AlignLen)*100,
        result.Gaps, result.GapRegions)
    fmt.Println()

    
    wfa.RecycleAlignmentText(Q, A, T) // !! important, recycle objects
}

wfa.RecycleAlignmentResult(result) // !! important, recycle objects

// ------------------[ clean ]------------------

wfa.RecycleAligner(algn) // !! important, recycle objects

CLI

A CLI (download binaries) is available to align two sequences from either positional arguments or an input file (format).

Example

Fast alignment.

$ wfa-go AGCTAGTGTCAATGGCTACTTTTCAGGTCCT AACTAAGTGTCGGTGGCTACTATATATCAGGTCCT
query   AGCTA-GTGTCAATGGCTACT---TTTCAGGTCCT
        | ||| |||||  ||||||||   | |||||||||
target  AACTAAGTGTCGGTGGCTACTATATATCAGGTCCT
cigar   1M1X3M1I5M2X8M3I1M1X9M

align-score : 36
match-region: q[1, 31]/31 vs t[1, 35]/35
align-length: 35, matches: 27 (77.14%), gaps: 4, gap regions: 2

From a input file, for benchmark.

$ wfa-go -i seqs.txt
query   A-TTGGAAAATAGGATTGGGGTTTGTTTATATTTGGGTTGAGGGATGTCCCACCTTCGTCGTCCTTACGTTTCCGGAAGGGAGTGGTTAGCTCGAAGCCCA
          |||||||||||||| ||||||||||||||||||||||||||||||||||||||| ||||||||||||||||||||||||||||||| ||||||||||||
target  GATTGGAAAATAGGAT-GGGGTTTGTTTATATTTGGGTTGAGGGATGTCCCACCTT-GTCGTCCTTACGTTTCCGGAAGGGAGTGGTT-GCTCGAAGCCCA
cigar   1X1I14M1D39M1D31M1D12M

align-score : 36
match-region: q[2, 100]/100 vs t[3, 98]/98
align-length: 99, matches: 96 (96.97%), gaps: 3, gap regions: 3

Usage

WFA alignment in Golang

 Author: Wei Shen <shenwei356@gmail.com>
   Code: https://github.com/shenwei356/wfa
Version: v0.2.0

Input file format:
  see https://github.com/smarco/WFA-paper?tab=readme-ov-file#41-introduction-to-benchmarking-wfa-simple-tests
  Example:
  >ATTGGAAAATAGGATTGGGGTTTGTTTATATTTGGGTTGAGGGATGTCCCACCTTCGTCGTCCTTACGTTTCCGGAAGGGAGTGGTTAGCTCGAAGCCCA
  <GATTGGAAAATAGGATGGGGTTTGTTTATATTTGGGTTGAGGGATGTCCCACCTTGTCGTCCTTACGTTTCCGGAAGGGAGTGGTTGCTCGAAGCCCA
  >CCGTAGAGTTAGACACTCGACCGTGGTGAATCCGCGACCACCGCTTTGACGGGCGCTCTACGGTATCCCGCGATTTGTGTACGTGAAGCAGTGATTAAAC
  <CCTAGAGTTAGACACTCGACCGTGGTGAATCCGCGATCTACCGCTTTGACGGGCGCTCTACGGTATCCCGCGATTTGTGTACGTGAAGCGAGTGATTAAAC

Usage:
  1. Align two sequences from the positional arguments.

        wfa-go [options] <query seq> <target seq>

  2. Align sequence pairs from the input file (described above).

        wfa-go [options] -i input.txt

Options/Flags:
  -N    do not output alignment (for benchmark)
  -a    do not use adaptive reduction
  -g    do not use global alignment
  -h    print help message
  -i string
        input file.
  -m    mem pprof. go tool pprof -http=:8080 mem.pprof
  -p    cpu pprof. go tool pprof -http=:8080 cpu.pprof

Benchmark

Generate datasets with WFA (e634175) or WFA2-lib (v2.3.5):

./bin/generate_dataset -n 100000 -l 1000 -e 0.05 -o l1000-e0.05.seq
./bin/generate_dataset -n 100000 -l 1000 -e 0.10 -o l1000-e0.10.seq
./bin/generate_dataset -n 100000 -l 1000 -e 0.20 -o l1000-e0.20.seq

./bin/generate_dataset -n 500 -l 50000 -e 0.05 -o l50000-e0.05.seq
./bin/generate_dataset -n 500 -l 50000 -e 0.10 -o l50000-e0.10.seq
./bin/generate_dataset -n 500 -l 50000 -e 0.20 -o l50000-e0.20.seq

Commands (all tools use wfa-adaptive heuristic 10,50,1):

# memusg: https://github.com/shenwei356/memusg

# WFA
memusg -t -s "./bin/align_benchmark -i l1000-e0.05.seq -a gap-affine-wfa-adaptive"

# WFA2-lib
memusg -t -s "./bin/align_benchmark -i l1000-e0.05.seq -a gap-affine-wfa --wfa-heuristic wfa-adaptive --wfa-heuristic-parameters 10,50,1"

# WFA-go (this package, binary files are availabe in the release page)
# global alignment && do not output results
memusg -t -s "wfa-go -N -i l1000-e0.05.seq"

csvtk csv2md -t benchmark.tsv -a c,c,c,l,r,r

Results:

Seq-len	Seq-num	Error-rate	Package	Time	Memory
1000	100000	0.05	WFA1	4.523s	3.21 MB
			WFA2	3.597s	4.90 MB
			WFA-go	15.424s	9.97 MB
1000	100000	0.1	WFA1	8.031s	2.04 MB
			WFA2	6.973s	6.55 MB
			WFA-go	41.790s	12.06 MB
1000	100000	0.2	WFA1	15.538s	4.04 MB
			WFA2	13.450s	9.24 MB
			WFA-go	1m:51s	10.22 MB
50000	500	0.05	WFA1	2.180s	56.12 MB
			WFA2	1.481s	107.76 MB
			WFA-go	6.107s	86.61 MB
50000	500	0.1	WFA1	4.144s	55.3 MB
			WFA2	3.296s	190.32 MB
			WFA-go	17.908s	174.62 MB
50000	500	0.2	WFA1	7.574s	81.15 MB
			WFA2	6.842s	314.08 MB
			WFA-go	48.122s	275.80 MB

Run in a laptop PC, with single-thread.

Reference

Santiago Marco-Sola, Juan Carlos Moure, Miquel Moreto, Antonio Espinosa. "Fast gap-affine pairwise alignment using the wavefront algorithm." Bioinformatics, 2020.
Santiago Marco-Sola, Jordan M Eizenga, Andrea Guarracino, Benedict Paten, Erik Garrison, Miquel Moreto. "Optimal gap-affine alignment in O(s) space". Bioinformatics, 2023.
https://github.com/smarco/WFA-paper/
https://github.com/smarco/WFA2-lib
https://github.com/rchikhi/rust-alignbench

Support

Please open an issue to report bugs, propose new functions or ask for help.

License

MIT License

Name		Name	Last commit message	Last commit date
Latest commit History 48 Commits
wfa-go		wfa-go
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
LICENSE		LICENSE
README.md		README.md
benchmark.tsv		benchmark.tsv
go.mod		go.mod
go.sum		go.sum
wfa.go		wfa.go
wfa_backtrace_types.go		wfa_backtrace_types.go
wfa_cigar.go		wfa_cigar.go
wfa_component.go		wfa_component.go
wfa_component_plot.go		wfa_component_plot.go
wfa_test.go		wfa_test.go
wfa_wavefront.go		wfa_wavefront.go

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Wavefront alignment algorithm (WFA) in Golang

Table of Contents

History

Details

Examples

Usages

CLI

Benchmark

Reference

Support

License

About

Releases 8

Packages

Languages

License

shenwei356/wfa

Folders and files

Latest commit

History

Repository files navigation

Wavefront alignment algorithm (WFA) in Golang

Table of Contents

History

Details

Examples

Usages

CLI

Benchmark

Reference

Support

License

About

Topics

Resources

License

Stars

Watchers

Forks

Releases 8

Packages 0

Languages

Packages