This is a Python module to calculate a pairwise alignment between biological sequences (protein or nucleic acid). This module uses the needle, stretcher and water tools from the EMBOSS package to calculate an optimal, global/local pairwise alignment.
I wrote this module for two reasons. First, the needle and water tools are faster than any Python implementation. Second, Biopython has dropped support for tools from the EMBOSS package and recommends running them via the subprocess module directly.
Pairwise sequence alignment identifies regions of similarity between two sequences, which can indicate functional, structural, or evolutionary relationships.
-
Global alignment aligns two sequences from start to finish, ideal for sequences that are similar and of comparable length.
- needle Needleman-Wunsch algorithm) calculates a full-length global alignment by maximizing similarity across both sequences through a dynamic programming approach.
- stretcher documentation performs global alignment using a modified dynamic programming algorithm optimized for linear space efficiency.
-
Local alignment (water; Smith-Waterman algorithm) finds the region with the highest level of similarity between the two sequences. It is suitable for sequences that are not assumed to be similar over the entire length.
- Python >= 3.8
Check with
python3 --version
- EMBOSS >= 6.6.0
Check with
needle -version
orwater -version
.
You can install the module from PyPI:
pip install pairwise-sequence-alignment
or directly from GitHub:
pip install "git+https://github.com/aziele/pairwise-sequence-alignment.git"
or you can use the module without installation. Simply clone or download this repository and you're ready to use it.
import psa
# Global alignment
aln = psa.needle(moltype='nucl', qseq='ATGCTAGTA', sseq='ATGCTAGTAGATGATGA')
aln = psa.needle(moltype='prot', qseq='MKSTVWSG', sseq='MKSSVLW')
# Local alignment
aln = psa.water(moltype='nucl', qseq='ATGCTAGTA', sseq='ATGCTAGTAGATGATGAT')
aln = psa.water(moltype='prot', qseq='MKSTVWSG', sseq='MKSSVLW')
print(aln.score) # 20.0
print(aln.pidentity) # 71.4
print(aln.psimilarity) # 85.7
print(aln.pgaps) # 14.3
print(aln.qseq) # MKSTV-W
print(aln.sseq) # MKSSVLW
Attribute | Description |
---|---|
qid |
Query sequence identifier |
sid |
Subject sequence identifier |
qseq |
Query unaligned sequence |
sseq |
Subject unaligned sequence |
qaln |
Query aligned sequence |
saln |
Subject aligned sequence |
qstart |
Start of alignment in query |
qend |
End of alignment in query |
sstart |
Start of alignment in subject |
send |
End of alignment in subject |
length |
Alignment length |
score |
Alignment score |
nidentity |
Number of identical matches in the alignment |
pidentity |
Percentage of identical matches in the alignment |
nsimilarity |
Number of positive-scoring matches in the alignment |
psimilarity |
Percentage of positive-scoring matches in the alignment |
ngaps |
Total number of gaps in the alignment |
pgaps |
Total percentage of gaps in the alignment |
moltype |
nucl/prot |
program |
needle/water |
gapopen |
Gap open penalty |
gapextend |
Gap extension penalty |
matrix |
Name of scoring matrix |
raw |
Raw output obtained from EMBOSS' needle/water |
Method | Description |
---|---|
query_coverage() |
Returns a query coverage [%] |
subject_coverage() |
Returns a subject coverage [%] |
pvalue() |
Returns a p-value of the alignment |
fasta() |
Returns pairwise alignment in FASTA/Pearson format |
import psa
aln = psa.needle(
moltype='prot',
qseq='MTSPSTKNSDDKGRPNLSSTEYFANTNVLTCRLKWVNPDTFIMDPRKPQLHSRT',
sseq='MTTPSRENSDDKGRPIEEASNLSSTEYFANTNVLTCKLKYVNPDTFIMDPRKP',
qid='seq1',
sid='seq2'
)
print(aln.score) # 225.0
print(aln.length) # 59
print(aln.pidentity) # 72.9
print(aln.psimilarity) # 79.7
print(aln.pgaps) # 18.6
print(aln.ngaps) # 11
print(aln.qseq) # MTSPSTKNSDDKGRP-----NLSSTEYFANTNVLTCRLKWVNPDTFIMDPRKPQLHSRT
print(aln.sseq) # MTTPSRENSDDKGRPIEEASNLSSTEYFANTNVLTCKLKYVNPDTFIMDPRKP------
print(aln.qstart) # 1
print(aln.qend) # 54
print(aln.sstart) # 1
print(aln.send) # 53
print(aln.program) # needle
print(aln.gapopen) # 10
print(aln.gapextend) # 0.5
print(aln.matrix) # EBLOSUM62
print(aln.raw)
Output:
#=======================================
#
# Aligned_sequences: 2
# 1: seq1
# 2: seq2
# Matrix: EBLOSUM62
# Gap_penalty: 10.0
# Extend_penalty: 0.5
#
# Length: 59
# Identity: 43/59 (72.9%)
# Similarity: 47/59 (79.7%)
# Gaps: 11/59 (18.6%)
# Score: 225.0
#
#
#=======================================
seq1 1 MTSPSTKNSDDKGRP-----NLSSTEYFANTNVLTCRLKWVNPDTFIMDP 45
||:||.:|||||||| ||||||||||||||||:||:||||||||||
seq2 1 MTTPSRENSDDKGRPIEEASNLSSTEYFANTNVLTCKLKYVNPDTFIMDP 50
seq1 46 RKPQLHSRT 54
|||
seq2 51 RKP------ 53
#---------------------------------------
#---------------------------------------
print(aln.fasta(wrap=30))
Output:
>seq1 1-54
MTSPSTKNSDDKGRP-----NLSSTEYFAN
TNVLTCRLKWVNPDTFIMDPRKPQLHSRT
>seq2 1-53
MTTPSRENSDDKGRPIEEASNLSSTEYFAN
TNVLTCKLKYVNPDTFIMDPRKP------
You can iterate over the alignment residue-by-residude:
for aa1, aa2 in aln:
if aa1 != aa2:
print(aa1, aa2)
Output:
S T
T R
K E
- I
- E
- E
- A
- S
R K
W Y
Q -
L -
H -
S -
R -
T -
You can also loop through residudes in an alignment with the information of their position.
for i, (aa1, aa2) in enumerate(aln):
if aa1 != aa2:
print(i, aa1, aa2)
Output:
2 S T
5 T R
6 K E
15 - I
16 - E
17 - E
18 - A
19 - S
36 R K
39 W Y
53 Q -
54 L -
55 H -
56 S -
57 R -
58 T -
Query coverage describes how much of the query sequence is covered in the alignment by the subject sequence. Specifically, query coverage is the percentage of the query sequence length that is included in the alignment. In global alignments, query coverage is always 100% because both the sequences, query and subject, are aligned from end to end. It is thus more useful to calculate query coverage from local alignments.
import psa
aln = psa.water(
moltype='prot',
qseq='MTSPSTKNSDDKGRPNLSSTEYFANTNVLTCRLKWVNPDTFIMDPRKPQLHSRT',
sseq='NSDDKGRPIEEASNLSSTEYFANTNVLTCKLKYVNPDTFIMDPRKP',
qid='seq1',
sid='seq2'
)
# seq1 8 NSDDKGRP-----NLSSTEYFANTNVLTCRLKWVNPDTFIMDPRKP 48
# |||||||| ||||||||||||||||:||:|||||||||||||
# seq2 1 NSDDKGRPIEEASNLSSTEYFANTNVLTCKLKYVNPDTFIMDPRKP 46
print(aln.query_coverage())
# 75.9
print(aln.subject_coverage())
# 100
You can change a scoring matrix and penalties for the gap open and extension to calculate the alignment.
import psa
aln = psa.water(
moltype='prot',
qseq='MKSTWYERNST',
sseq='MKSTGYWTRESA',
matrix='EBLOSUM30',
gapopen=5,
gapextend=0.2
)
print(aln)
Output:
#=======================================
#
# Aligned_sequences: 2
# 1: query
# 2: subject
# Matrix: EBLOSUM30
# Gap_penalty: 5.0
# Extend_penalty: 0.2
#
# Length: 13
# Identity: 7/13 (53.8%)
# Similarity: 8/13 (61.5%)
# Gaps: 3/13 (23.1%)
# Score: 39.8
#
#
#=======================================
query 1 MKST--WYERNST 11
|||| |. |.|:
subject 1 MKSTGYWT-RESA 12
#---------------------------------------
#---------------------------------------
The Needleman-Wunsch and Smith-Waterman algorithms will always find an optimal alignment between two sequences, whether or not they are evolutionarily related. The strength of an alignment is determined by its score. However, often it is necessary to know if a score is high enough to indicate a biologically interesting alignment. The statistical significance of the score is assessed by the P-value, which describes how likely it is that two random sequences of similar length and composition will align with a score equal to or better than our target alignment.
The .pvalue()
method calculates the P-value of the alignment between query and subject sequences. The method shuffles a subject sequence many times (100 by default) and calculates the alignment score between the query and each shuffled subject sequence. It then counts how many times the alignment score was greater than or equal to the alignment score of the original query and subject sequences. For example, if 100 such shuffles all produce alignment scores that are lower than the observed alignment score, then one can say that the P-value is likely to be less than 0.01.
import psa
aln = psa.needle(moltype='prot', qseq='MKSTVILK', sseq='MKSRSLK')
print(aln.pvalue()) # 0.16
For more than two sequences, you can calculate alignments between every pair of your input sequences.
import itertools
import psa
# Input sequences
sequences = {
'dna1': 'ATCGAGATCGAGATGGCGATAG',
'dna2': 'ATGCTGATCGTAGGGGC',
'dna3': 'GTCGGATCCTCGATGGAGA',
'dna4': 'TTTGGGAATGCGTAGGAGCTA',
'dna5': 'CCGTGATGCGATGCA'
}
# All-against-all pairwise alignments
for qid, sid in itertools.combinations(sequences, r=2):
qseq = sequences[qid]
sseq = sequences[sid]
aln = psa.needle(moltype='nucl', qseq=qseq, sseq=sseq)
print(f'{qid} {sid} {aln.pidentity:.1f}% {aln.score}')
Output:
dna1 dna2 38.5% 24.0
dna1 dna3 60.9% 34.0
dna1 dna4 35.7% 20.0
dna1 dna5 45.8% 27.0
dna2 dna3 43.5% 18.0
dna2 dna4 42.3% 39.0
dna2 dna5 39.1% 22.0
dna3 dna4 20.0% 14.0
dna3 dna5 52.4% 26.0
dna4 dna5 40.9% 14.0
If you have multiple sequences in a FASTA file, you can use Biopython to read them and then calculate pairwise alignments.
import itertools
import psa
from Bio import SeqIO
# Input sequences
sequences = {}
for seq_record in SeqIO.parse('sequences.fasta', 'fasta'):
sequences[seq_record.id] = str(seq_record.seq)
# All-against-all pairwise alignments
for qid, sid in itertools.combinations(sequences, r=2):
qseq = sequences[qid]
sseq = sequences[sid]
aln = psa.needle(moltype='nucl', qseq=qseq, sseq=sseq)
print(f'{qid} {sid} {aln.pidentity:.1f}% {aln.score}')
Output:
dna1 dna2 38.5% 24.0
dna1 dna3 60.9% 34.0
dna1 dna4 35.7% 20.0
dna1 dna5 45.8% 27.0
dna2 dna3 43.5% 18.0
dna2 dna4 42.3% 39.0
dna2 dna5 39.1% 22.0
dna3 dna4 20.0% 14.0
dna3 dna5 52.4% 26.0
dna4 dna5 40.9% 14.0
If you want to check that everything works as intended, just run:
./test.py