Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question about non-focus columns and DMS scores #9

Closed
hnisonoff opened this issue Sep 23, 2022 · 7 comments
Closed

Question about non-focus columns and DMS scores #9

hnisonoff opened this issue Sep 23, 2022 · 7 comments

Comments

@hnisonoff
Copy link

Hi I noticed that for some DMS studies there are EVmutation scores for mutations that do not appear to be in focus columns from the MSAs that you provided. Is EVmutation using a different MSA?

As an example for the BLAT_ECOLX_Stiffler_2015 dataset, EVMutation has unique scores for mutations at position 24:

  mutant  EVmutation
0   H24C   -7.206646
1   H24Y   -5.784716
2   H24W   -5.258699
3   H24V   -5.273463
4   H24T   -3.646145

However, in the MSA file the WT sequence is msiqhfrvalipffaafclpvfahpetlvkvkdaedqlgARVGYIELDLNSGKILESFRPEERFPMMSTFKVLLCGAVLS RVdagQEQLGRRIHYSQNDLVEYSPVTEKHLTDGMTVRELCSAAITMSDNTAANLLLTTIGgPKELTAFLHNMGDHVTRL DRWEPelneaiPNDERDTTMPAAMATTLRKLLTGELLTLASRQQLIDWMEADKVAGPLLRSALPAGWFIADKSGAGERGS RGIIAALGPDGKPSrIVVIYTTGSQatmdernrqiaeigaslikhw

Position 24 is a non-match column and is filtered out during MSA processing. In this case how are scores computed for these mutations?

Thanks!

@pascalnotin
Copy link
Collaborator

Hi Hunter,

Great question! We are using the same MSAs for all models in in the benchmark, including EVmutation. As you noted, EVmutation and other alignment-based approaches (eg., EVE, DeepSequence, Site Independent) do not typically train (and therefore make predictions) on low-coverage positions. In the first released version of our performance files, we were using the standard approach for these alignment-based models and the scores for EVmutation were only available for sufficiently-covered positions.

However, the ProteinGym benchmarks also include models that are able to score all positions (eg., Tranception, RITA), including the low-coverage ones. As a result, our initial performance files had two sets of model comparisons: one set comparing all models on the subset of well-covered positions; another set comparing the subset of models able to score all positions on all mutants.

We subsequently investigated the effect of training alignment-based models on all positions, not just well-covered ones, as this would allow us to use these models to score all possible (substitution) mutations. We observed in particular that:

  1. The performance of these models (trained on all positions) on the subset of well-covered positions was on average similar to that of the same models trained on sufficiently-covered positions only (for some proteins a bit lower, for some a bit higher -- but similar in aggregate)
  2. The rank ordering of all models on sufficiently-covered positions was nearly identical to the rank ordering of models on all mutants (using the newly-trained versions of alignment-based models on all positions).

Consequently, to make things simpler, we are now only reporting one set of performance numbers for all models on all mutants, leveraging these alignment-based models trained on all positions (we made a note of that in the README).

To reproduce the scores we provide for EVmutation (or other alignment-based models), you would just need to pre-process all ProteinGym MSAs to ignore the low coverage information (ie. capitalize all sequences) and then train/score using the standard approach.

@hnisonoff
Copy link
Author

Thanks so much for the explanation!

@hnisonoff
Copy link
Author

@pascalnotin sorry I just noticed one other thing. It appears that the sequence weights that you provided are for MSAs with columns removed. Do you happen to have weights for the MSAs used when all positions were considered? Thanks!

@hnisonoff hnisonoff reopened this Oct 4, 2022
@pascalnotin
Copy link
Collaborator

Hi @hnisonoff -- I just made these sequence weights (when all positions considered) available on our servers. You may download them as follows:

curl -o MSA_weights_substitutions_all_positions.zip https://marks.hms.harvard.edu/tranception/MSA_weights_substitutions_all_positions.zip
curl -o MSA_weights_indels_all_positions.zip https://marks.hms.harvard.edu/tranception/MSA_weights_indels_all_positions.zip

Please let me know if any issues!

@hnisonoff
Copy link
Author

Thank you so much! This saves me a lot of compute.

@brycejoh16
Copy link

I think it should be noted this sentence, although correct, can be misleading when running EVE scores.

To reproduce the scores we provide for EVmutation (or other alignment-based models), you would just need to pre-process all ProteinGym MSAs to ignore the low coverage information (ie. capitalize all sequences) and then train/score using the standard approach.

This is because capitalizing the MSA alone will not work if your are having the EVE code base preprocess your MSA directly. Here is the link to the class for preprocessing the MSA.

I think Pascal has previously mentioned this, but one can add two lines of code to evol_indices.py and train_VAE.py in order to predict on non focus columns.

Pass in parameter threshold_focus_cols_frac_gaps=1 at this and this line of code. This will preprocess the MSA to include training and predictions at all non focus positions.

Hope this helps anyone in the future trying to solve this bug!

Take care,
Bryce

@pascalnotin
Copy link
Collaborator

That's correct - thank you @brycejoh16!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants