-
Notifications
You must be signed in to change notification settings - Fork 25
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Question about non-focus columns and DMS scores #9
Comments
Hi Hunter, Great question! We are using the same MSAs for all models in in the benchmark, including EVmutation. As you noted, EVmutation and other alignment-based approaches (eg., EVE, DeepSequence, Site Independent) do not typically train (and therefore make predictions) on low-coverage positions. In the first released version of our performance files, we were using the standard approach for these alignment-based models and the scores for EVmutation were only available for sufficiently-covered positions. However, the ProteinGym benchmarks also include models that are able to score all positions (eg., Tranception, RITA), including the low-coverage ones. As a result, our initial performance files had two sets of model comparisons: one set comparing all models on the subset of well-covered positions; another set comparing the subset of models able to score all positions on all mutants. We subsequently investigated the effect of training alignment-based models on all positions, not just well-covered ones, as this would allow us to use these models to score all possible (substitution) mutations. We observed in particular that:
Consequently, to make things simpler, we are now only reporting one set of performance numbers for all models on all mutants, leveraging these alignment-based models trained on all positions (we made a note of that in the README). To reproduce the scores we provide for EVmutation (or other alignment-based models), you would just need to pre-process all ProteinGym MSAs to ignore the low coverage information (ie. capitalize all sequences) and then train/score using the standard approach. |
Thanks so much for the explanation! |
@pascalnotin sorry I just noticed one other thing. It appears that the sequence weights that you provided are for MSAs with columns removed. Do you happen to have weights for the MSAs used when all positions were considered? Thanks! |
Hi @hnisonoff -- I just made these sequence weights (when all positions considered) available on our servers. You may download them as follows:
Please let me know if any issues! |
Thank you so much! This saves me a lot of compute. |
I think it should be noted this sentence, although correct, can be misleading when running EVE scores.
This is because capitalizing the MSA alone will not work if your are having the EVE code base preprocess your MSA directly. Here is the link to the class for preprocessing the MSA. I think Pascal has previously mentioned this, but one can add two lines of code to Pass in parameter Hope this helps anyone in the future trying to solve this bug! Take care, |
That's correct - thank you @brycejoh16! |
Hi I noticed that for some DMS studies there are EVmutation scores for mutations that do not appear to be in focus columns from the MSAs that you provided. Is EVmutation using a different MSA?
As an example for the
BLAT_ECOLX_Stiffler_2015
dataset, EVMutation has unique scores for mutations at position 24:However, in the MSA file the WT sequence is
msiqhfrvalipffaafclpvfahpetlvkvkdaedqlgARVGYIELDLNSGKILESFRPEERFPMMSTFKVLLCGAVLS RVdagQEQLGRRIHYSQNDLVEYSPVTEKHLTDGMTVRELCSAAITMSDNTAANLLLTTIGgPKELTAFLHNMGDHVTRL DRWEPelneaiPNDERDTTMPAAMATTLRKLLTGELLTLASRQQLIDWMEADKVAGPLLRSALPAGWFIADKSGAGERGS RGIIAALGPDGKPSrIVVIYTTGSQatmdernrqiaeigaslikhw
Position 24 is a non-match column and is filtered out during MSA processing. In this case how are scores computed for these mutations?
Thanks!
The text was updated successfully, but these errors were encountered: