Protein stability prediction upon point mutation is a key biological problem. Single point mutations can alter protein function resulting in disease incidence. A very significant example is that of sickle cell anemia, wherein a single genomic mutation results in a single amino acid change and impairs the function of hemoglobin. It is therefore essential to develop methods that can predict the impact of point mutations on protein stability. More specifically, such methods should enable the estimation of free energy change (ddG) upon point mutation and be able to tell us whether a mutation reduces, increases or doesn't change the thermodynamic stability of proteins.
In this project, we introduce a novel approach to estimate the changes in free energy (ΔΔG) associated with point mutations. We refer to our approach as Protein Stability Prediction using Gaussian Network Model (PSP-GNM). For a given wildtype-mutant pair, PSP-GNM utilizes the Gaussian Network Model (GNM) to identify putative contacts and the order in which they are broken during simulated partial protein unfolding. We then use the knowledge of these broken contacts to estimate the ΔΔG.
- Python3.8
- scikit-learn=1.0.1
- scipy=1.7.2
- statsmodels=0.13.1
- biopython=1.79
- click=8.0.3
- numpy=1.21.4
- seaborn=0.11.2
The repository includes 4 directories containing the relevant datasets and scripts for running PSP-GNM. It also includes a conda environment file (psp_gnm_env.yaml) that can be used to re-create the environment where the scripts can be run.
- datasets/ - Contains the 3 benchmark datasets on which PSP-GNM was assessed.
- predictions/ - Contains predictions made using different methods and PSP-GNM for different datasets
- scripts/ - Includes the script: psp_gnm.py to run calculations
- test_data/ - Includes test datasets to perform a test run of the 2 scripts
You will need access to a Linux terminal (or a MacOS, or even a Windows 10 having Windows Subsystem for Linux enabled and Ubuntu installed). The specific instructions below work best in a Linux (Ubuntu 18.02/Ubuntu20.04) platform.
To get a local copy up and execute the scripts, follow these simple steps:
- Install miniconda:
wget https://repo.anaconda.com/miniconda/Miniconda3-py38_4.11.0-Linux-x86_64.sh
chmod +x Miniconda3-latest-Linux-x86_64.sh
./Miniconda3-latest-Linux-x86_64.sh
- Clone the repository with
git clone https://github.com/sambitmishra0628/PSP-GNM
- On a terminal window, go to the PSP-GNM directory
- Create the conda environment using the provided yaml file:
conda env create -f psp_gnm_env.yaml
- Activate the environment: conda activate psp_gnm
- Checkout the command line arguments needed to run psp_gnm:
python scripts/psp_gnm.py --help
The following help manual should be displayed
Usage: psp_gnm.py [OPTIONS]
Options:
--data_file TEXT Name of the .csv file containing the information on ddG
for the mutants [required]
--outfile TEXT Name of the file to which the PSP-GNM-calculated
energies and experimental energies will be written
[required]
--outdir TEXT Name of the directory to which the intermittent result
files will be written to [required]
--wt_pdb_dir TEXT Directory containing the wild type atomic pdb files. For
a reverse mutant, the wildtype is the forward mutant.
[required]
--num_jobs TEXT Maximum number of jobs to be run in parallel [required]
--dist_cutoff TEXT Distance cutoff for interactions in GNM [default: 9;
required]
--num_modes TEXT Number of modes to be used [default: 10; required]
--rev_mut_pdb Set this option if data_file includes reverse mutants
and all the reverse mutants have the corresponding pdb
files of the forward mutant in wt_pdb_dir. [default:
False]
--help Show this message and exit.
You will need access to a Linux machine (or a MacOS or Windows 10 with Windows Subsystem for Linux enabled). You will need to have conda installed.
Before running make sure that your current working directory is the PSP_GNM/scripts/
directory.
python psp_gnm.py --data_file ../test_data/S350_test_benchmark_run.csv --outdir ../S350_test_run_output --outfile ../S350_test_benchmark_run_out.csv --wt_pdb_dir ../test_data/pdb_test --num_jobs 4 --dist_cutoff 9 --num_modes 10
In the above:
-
num_jobs
is the number of parallel jobs you intend to run. Ideally, it should be set to the number of cores in your machine (N) - 1. -
The above run will create an output directory
S350_test_run_output
, where all the intermediate files containing information on the contacts broken during partial unfolding will be stored. -
The
data_file
should have comma separated values and at least include the following columns (the column names should exactly match as given below)
Column Name | Expected value |
---|---|
PDB_CHAIN | The 4-lettered PDB ID + Chain ID (e.g., 1AJ3A, 1AONU) (Case-sensitive) |
WILD_RES | The single amino acid alphabet of the wildtype residue in the PDB file (Case-sensitive) |
RES_NUM_PDB | The PDB residue number for the mutation position |
MUTANT_RES | The single amino acid alphabet of the variant/mutant residue (Case-sensitive) |
Category | Should be one of Forward or Reverse (case-sensitive). If unsure, then use Forward and then use the value under column ddG_PSP_GNM_unscaled in the output file for your analysis. |
In the above, it is expected that position RES_NUM_PDB
in the PDB file has the residue given by WILD_RES
.
- The output file
S350_test_benchmark_run_out.csv
will include the calculated ddG. Note that this output file will include all the columns present in the data file. Additionally, it will have the columns corresponding to calculations made by PSP-GNM. Explanation of the different output columns are as follows.
Column Name | Explanation |
---|---|
RES_IND_SEQ | The serial index of the mutation position. Starts from 0 |
Calc_ddG | The raw calculated energy difference between wildtype and mutant |
Calc_ddI | The raw calculated entropy difference between wildtype and mutant. The entropy difference is measured as the difference in mean-squared fluctuation in distance. |
Calc_ddG_mean | The average calculated energy difference between wildtype and mutant. Averaged across all residues considered for calculations. |
Calc_ddI_mean | The average calculated entropy difference between wildtype and mutant. Averaged across all residues considered for calculations. |
Num_contacts | Total contacts broken during partial unfolding involving the mutation residue and considered for calculations. We suggest considering only those calculations having Num_contacts > 0. |
Calc_Energy_scaled | The scaled values for Calc_ddG |
Calc_Entropy_scaled | The scaled values for Calc_ddI |
ddG_PSP_GNM_unscaled | The unscaled prediction for ddG that incorporates both energy and entropy changes |
ddG_PSP_GNM_scaled | The scaled prediction for ddG that incorporates both energy and entropy changes. This value is meaningless if the user is not sure of the category of mutation. If that's the case, then use the unscaled values. |
- The output directory will include *_contact_breaks.csv files. These files include information on the contacts broken during the partial unfolding simulation. A separate contact_break file will be created for each mutant listed in the S350_test_benchmark_run.csv file and its corresponding wildtype. For example, the 2nd row of the S350_test_benchmark_run.csv includes the following mutant.
PDB_CHAIN | WILD_RES | RES_NUM_PDB | MUTANT_RES | Category |
---|---|---|---|---|
1AJ3A | I | 23 | A | Forward |
In the above case, the PDB ID is 1AJ3, the chain is A, wildtype residue is I, mutant residue is A and the position of mutation (according to the PDB file) is 23. The resulting contact_break file created during the unfolding simulation of the mutant form is 1AJ3A_I23A_contact_breaks.csv and the wildtype is 1AJ3A_wt_contact_breaks.csv. The first few rows for 1AJ3A_wt_contact_breaks.csv are shown below.
PDB_ID | WT_or_Mut | Mutation_position | Contact_Position | Res_at_Mut_Position | Res_at_Contact_Pos | Energy_MJ | Int_dist_change | Contact_break_rank |
---|---|---|---|---|---|---|---|---|
1AJ3A | wt | 17 | 88 | K | L | -3.37 | 2.355 | 1 |
1AJ3A | wt | 88 | 17 | L | K | -3.37 | 2.355 | 1 |
1AJ3A | wt | 46 | 89 | L | K | -3.37 | 2.301 | 2 |
1AJ3A | wt | 89 | 46 | K | L | -3.37 | 2.301 | 2 |
1AJ3A | wt | 39 | 92 | H | A | -2.41 | 2.305 | 3 |
1AJ3A | wt | 92 | 39 | A | H | -2.41 | 2.305 | 3 |
1AJ3A | wt | 39 | 93 | H | A | -2.41 | 2.308 | 4 |
1AJ3A | wt | 93 | 39 | A | H | -2.41 | 2.308 | 4 |
Note that there are duplicate rows for the same broken contact (e.g., 17 and 88) in this file. However, PSP-GNM only utilizes information on the unique broken contacts for ΔΔG calculations. A newer version of the code will be released in the future to output only the unique broken contacts in the *_contact_breaks file.
An explanation of each column in the table above is as follows.
Column Name | Explanation |
---|---|
PDB_ID | Name of the 4-lettered PDB ID with chain |
WT_of_Mut | Whether the simulation was carried out for the wildtype (wt) or the mutant form |
Mutation_postion | This is the sequential index (not the PDB residue number) of the first residue in the contact pair |
Contact_position | This is just the sequential index (not the PDB residue number) of the second residue in the contact pair |
Res_at_Mut_Position | Residue present at Mutation_position |
Res_at_Contact_Position | Residue present at the Contact_Position |
Energy_MJ | Interaction energy between Res_at_Mut_Position and Res_at_Contact_Position obtained from the MJ potential |
Int_dist_change | Mean-squared fluctuation in distance, which we use to estimate entropy, between the contact pair |
Contact_break_rank | The rank/order based on when this contact was broken during the simulation |
For example, in the first row, residues at position 17 (K) and 88 (L) were originally in contact. That is, the spatial distance between their C-alpha atoms in the native structure was <= 9 Angstroms. During the unfolding simulation, this was the first residue pair that was broken as it had the highest mean-squared fluctuation in distance. Therefore, we assign the residue pair a Contact_break_rank of 1.
1. Predicting ddG for reverse mutants using the structure of the native wildtype
For PDB ID 1AJ3 and chain A (wildtype PDB), the wildtype residue at position 18 is ASP. Let us assume a theoretical mutant form of this protein that has PHE at position 18. The forward mutant then is ASP18 -> PHE18 and the reverse mutant is PHE18 -> ASP18. If you want to test the antisymmetric property (ΔΔG_forward mutant = -ΔΔG_reverse mutant) of PSP-GNM, then use the following instance of the --data_file
to calculate ΔΔG of the forward mutant.
PDB_CHAIN | WILD_RES | RES_NUM_PDB | MUTANT_RES | Category |
---|---|---|---|---|
1AJ3A | D | 18 | F | Forward |
You can then calculate the ΔΔG of the reverse mutant using the same PDB file having the following content in the --data_file
PDB_CHAIN | WILD_RES | RES_NUM_PDB | MUTANT_RES | Category |
---|---|---|---|---|
1AJ3A | F | 18 | D | Reverse |
2. Predicting ddG for reverse mutants using the structure of the forward mutant For PDB ID 1AJ3 and chain A, the wildtype residue at position 18 is ASP. Let us assume a theoretical mutant form of this protein that has PHE at position 18. Let us say you have the PDB structure of the forward mutant i.e, position 18 in your PDB contains PHE. Now to predict ΔΔG for the reverse mutant use the following instance of `--data_file`.
PDB_CHAIN | WILD_RES | RES_NUM_PDB | MUTANT_RES | Category |
---|---|---|---|---|
1AJ3A | F | 18 | D | Reverse |
You will also need to include the --rev_mut_pdb
runtime argument while running PSP-GNM.
Distributed under the MIT License. See LICENSE
for more information.
Sambit Mishra - sambitmishra0628@gmail.com
Project Link: PSP-GNM
Mishra, S.K. PSP-GNM: Predicting Protein Stability Changes upon Point Mutations with a Gaussian Network Model. Int. J. Mol. Sci. 2022, 23, 10711. https://doi.org/10.3390/ijms231810711