Skip to content

Alignment

Sebastian Keller edited this page Nov 22, 2021 · 3 revisions

The results of the preceding structure search are a list of potential protein structure annotations (sequence-to-structure mapping) for each given protein sequence. To receive a position-specific structural annotation a global pairwise sequence alignment for each protein structure annotation has to be performed.

Sequence retrieval of protein structures

Annotated protein structures are given in the form of PDB entries. More precisely, an individual chain that is part of an PDB entry. To receive the sequence of amino acids resolved in the entry, we directly parse the ATOM records of the corresponding PDB-formatted file.

Needleman-Wunsch alignment

We use the Biopython implementation of the Needleman-Wunsch pairwise alignment algorithm with a gap opening penalty of 10 and a gap extension penalty of 0.5. Further, we use the BLOSUM62 substitution matrix and we do not penalize terminal gaps. The exact call of the functions is:
pairwise2.align.globalds(target_seq, template_seq, residue_consts.BLOSUM62, -10.0, -0.5, one_alignment_only=True, penalize_end_gaps=False)

Alignment quality scores

For the comparison and aggregation of results from different annotated protein structures, alignment quality criteria are very useful.

  1. We calculate the coverage as the amount of positions of the input protein sequence that are not aligned to a gap divided by the sequence length of given protein.
  2. We calculate the sequence identity as the number of identically matched amino acids in the alignment divided by the sequence length of the annotated structure.