The main logic of my program is coded in Python, and I use the Flask framework with HTML, CSS, JS, and Jinja to create the GUI.
/dna-alignment
/static
favicon.png
script.js
style.css
/templates
home.html
.gitingnore
app.py
DESIGN.md
main.py
README.md
results.db
sequence_alignment.py
test_sequence_alignment.py
test.csv
/static/favicon.png
: a PNG of a circle of DNA, used as the logo and loading sequence image/static/script.js
: where I provide client-side functionalities like the tooltips for overflowed cells, ability to switch between "Manual Input" and "Upload CSV", client-side validification of inputs, and loading sequence/static/style.css
: where I style my HTML webpage/templates/home.html
: where I create the backbone of my webpage, including the form where users submit their sequences as input using the "Manual Input" or "Upload CSV" functionalities.gitignore
: list of files and folders for the version control system to not trackapp.py
: where I define my webpage routes/
(the bulk of the GUI) and/download
, as well as where I use the values inputted in the form to perform the algorithms and interact withresults.db
DESIGN.md
: this design document!main.py
: where my CLI tool can be run from and where I define helper functions likeget_results()
that encapsulate many steps of my program- README.md: the user manual
results.db
: where I store a history of the results in theresults
table of this SQLite3 database, with the following columns:seq1 TEXT NOT NULL
: Sequence 1seq2 TEXT NOT NULL
: Sequence 2ga_align1 TEXT
: Global Alignment 1ga_align2 TEXT
: Global Alignment 2ga_score INTEGER
: Global Alignment Scorela_align1 TEXT
: Local Alignment 1la_align2 TEXT
: Local Alignment 2la_score INTEGER
: Local Alignment Scoretime DATETIME DEFAULT CURRENT_TIMESTAMP
: Timestamp
sequence_alignment.py
: where I define the abstract classSequenceAlignment
, which theGlobalAlignment
andLocalAlignment
subclasses inherit from with all the necessary fields and methods to perform their respective algorithmstest_sequence_alignment.py
: where I define various unit test cases inSAME_LENGTH_SEQ_CASES
,DIFF_LENGTH_SEQ_CASES
, andEMPTY_SEQ_CASES
to ensure that my program implements the algorithms properlytest.csv
: where I list 26 DNA sequences (20 valid, 6 invalid) of approximately 30-40 nucleotides in length to test my program with
I will walk you through how this algorithm works step-by-step by using the sample inputs seq1 = TGGTG
and seq2 = ATCGT
.
- Set
n
to one more than the number of nucleotides inseq1
, son = 6
- Set
m
to one more than the number of nucleotides inseq2
, som = 6
- Using
initialize_matrix()
, create ann
bym
matrix of 0s, where we can imagine the nucleotides ofseq1
along the vertical axis and the nucleotides ofseq2
along the horizontal axis
A | T | C | G | T | ||
---|---|---|---|---|---|---|
0 | 0 | 0 | 0 | 0 | 0 | |
T | 0 | 0 | 0 | 0 | 0 | 0 |
G | 0 | 0 | 0 | 0 | 0 | 0 |
G | 0 | 0 | 0 | 0 | 0 | 0 |
T | 0 | 0 | 0 | 0 | 0 | 0 |
G | 0 | 0 | 0 | 0 | 0 | 0 |
- In
fill_matrix()
, starting from the top left cell, calculate the score of each cell as the maximum of:- The score of the cell to its left + a gap penalty (set to -2): this represents aligning the nucleotide in
seq2
with a gap inseq1
, so we are moving horizontally overseq2
- The score of the cell above it + a gap penalty: this represents aligning the nucleotide in
seq1
with a gap inseq2
, so we are moving vertically overseq1
- The score of the cell to its upper-left diagonal + a match reward if they nucleotides in
seq1
andseq2
(set to 1) are the same or a mismatch penalty (set to -1) otherwise: this representings aligning the nucleotides ofseq1
andseq2
- The score of the cell to its left + a gap penalty (set to -2): this represents aligning the nucleotide in
- In each cell, also store the direction that the current cell's score was derived from (either left, up, or diagonal)
A | T | C | G | T | ||
---|---|---|---|---|---|---|
(0, -) | (-2, ←) | (-4, ←) | (-6, ←) | (-8, ←) | (-10, ←) | |
T | (-2, ↑) | (-1, ↖) | (-1, ↖) | (-3, ←) | (-5, ←) | (-7, ←) |
G | (-4, ↑) | (-3, ↑) | (-2, ↖) | (-2, ↖) | (-2, ↖) | (-4, ←) |
G | (-6, ↑) | (-5, ↑) | (-4, ↑) | (-3, ↖) | (-1, ↖) | (-3, ←) |
T | (-8, ↑) | (-7, ↑) | (-4, ↖) | (-5, ↑) | (-3, ↑) | (0, ↖) |
G | (-10, ↑) | (-9, ↑) | (-6, ↑) | (-5, ↖) | (-4, ↖) | (-2, ↑) |
- In
traceback()
, start from bottom right cell and follow the arrows until top left, saving the nucleotides inseq1
andseq2
, or gaps (represented by-
) in the case of horizontal/vertical movement as we go - Since we started from the bottom right and went to the top left, reverse the saved sequences to get
align1 = -TGGTG
andalign2 = ATCGT-
- In
set_alignment_score()
, setalignment_score
to the value of the bottom right cell in the matrix; in this casealignment_score = -2
I will walk you through how this algorithm works step-by-step by using the same sample inputs, seq1 = TGGTG
and seq2 = ATCGT
.
- Set
n
to one more than the number of nucleotides inseq1
, son = 6
- Set
m
to one more than the number of nucleotides inseq2
, som = 6
- Using
initialize_matrix()
, create ann
bym
matrix of 0s, where we can imagine the nucleotides ofseq1
along the vertical axis and the nucleotides ofseq2
along the horizontal axis
A | T | C | G | T | ||
---|---|---|---|---|---|---|
0 | 0 | 0 | 0 | 0 | 0 | |
T | 0 | 0 | 0 | 0 | 0 | 0 |
G | 0 | 0 | 0 | 0 | 0 | 0 |
G | 0 | 0 | 0 | 0 | 0 | 0 |
T | 0 | 0 | 0 | 0 | 0 | 0 |
G | 0 | 0 | 0 | 0 | 0 | 0 |
Note that the first three steps of the global and local alignment algorithms are identical, so I refactored my code to reduce redundancy between the two by using an inherited class strucutre
- In
fill_matrix()
, starting from the highest scoring cell, calculate the score of each cell as the maximum of:- The score of the cell to its left + a gap penalty (set to -2): this represents aligning the nucleotide in
seq2
with a gap inseq1
, so we are moving horizontally overseq2
- The score of the cell above it + a gap penalty: this represents aligning the nucleotide in
seq1
with a gap inseq2
, so we are moving vertically overseq1
- The score of the cell to its upper-left diagonal + a match reward if they nucleotides in
seq1
andseq2
(set to 1) are the same or a mismatch penalty (set to -1) otherwise: this representings aligning the nucleotides ofseq1
andseq2
0
(since the minimum local alignment score is 0)
- The score of the cell to its left + a gap penalty (set to -2): this represents aligning the nucleotide in
- In each cell, also store the direction that the current cell's score was derived from (either left, up, or none, represented by
-
)
A | T | C | G | T | ||
---|---|---|---|---|---|---|
(0, 0) | (0, 0) | (0, 0) | (0, 0) | (0, 0) | (0, 0) | |
T | (0, 0) | (0, -) | (1, ↖) | (0, -) | (0, -) | (1, ↖) |
G | (0, 0) | (0, -) | (0, -) | (0, -) | (1, ↖) | (0, -) |
G | (0, 0) | (0, -) | (0, -) | (0, -) | (1, ↖) | (0, -) |
T | (0, 0) | (0, -) | (1, ↖) | (0, -) | (0, -) | (2, ↖) |
G | (0, 0) | (0, -) | (0, -) | (0, -) | (1, ↖) | (0, -) |
Note that, if you analyze how the algorithm sets the score of each cell, you would notice that the first row and column will always have scores of 0, so they are left as such. This optimizes my program for time.
- In
traceback()
, start from the cell with the highest score and follow the arrows until we reach a nonpositive score or a cell without an arrow, saving the nucleotides inseq1
andseq2
, or gaps (represented by-
) in the case of horizontal/vertical movement as we go - Since we started from the bottom right and went to the top left, reverse the saved sequences to get
align1=GT
andalign2=GT
- In
set_alignment_score()
, setalignment_score
to the score of the highest scoring cell, soalignment_score = 2
- In
main.py
,get_results()
uses thecombinations()
method of theitertools
libary to get all combinations of two sequences from all the inputted sequences to run the global and local alignment algorithms on. It saves the set of outputs for each combination of sequences in theresults
table ofresults.db
and prints them to the console if it's being run as a CLI tool. - In
app.py
,home()
first connects toresults.db
and saves URL parameters about how to sort the table of results. Then, if the form has been submitted, it validates the inputs and runsmain.get_results()
. Next, whether the user has just navigated to the website or they have submitted the form, it gets all rows fromresults
and sends them to be included in the HTML table using Jinja. - In
sequence_alignment.py
, my implementation of theinitalize_matrix()
method for global alignment sets the first row and column to multiples of the gap penalty and point those arrows towards the top left because, if you analyze how the algorithm sets the score of each cell, you would notice that this is the case for all possible inputs. This optimizes my program for time.