Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clarification on uncertainty estimates for dataset metrics #22

Open
jmichel80 opened this issue Oct 23, 2019 · 15 comments
Open

Clarification on uncertainty estimates for dataset metrics #22

jmichel80 opened this issue Oct 23, 2019 · 15 comments

Comments

@jmichel80
Copy link
Contributor

Hi @ppxasjsm from discussions with @maxkuhn it is also good to clarify

Yet @maxkuhn says he has processed FEP+ dataset with freenrgworkflows and obtained different R uncertainties even though similar DDG uncertainties were used. @maxkuhn please post your input with the commands you ran and the output you got to allow @ppxasjsm and myself to reproduce your findings.

@maxjump
Copy link

maxjump commented Oct 23, 2019

Hi @jmichel80 and @ppxasjsm ,

please find three csv files containing the experimental and calculated affinities for the Thrombin dataset reported in the specified paper. R is reported to be 0.71 +/- 0.24 ("Errors for [...] R values by use of the bootstrapping method are also reported.")

issue22.zip

R is 0.706, the R confidence as calculated by freenrgworkflows is about 0.128 < 0.282 < 0.423 (i.e. 0.282 +/- 0.154 or 0.140) when taking into account the "cycle errors" or 0.102 < 0.264 < 0.411 (i.e. 0.264 +/- 0.163 or 0.146) when assuming an error of 1.1 kcal/mol per perturbation.

Any feedback on what may be different about the bootstrapping performed in freenrgworkflows and the FEP+ paper is greatly appreciated.

@ppxasjsm
Copy link
Collaborator

First thing I noticed is that you are using absolute and not relative free energies. They are fine to use for R but not other metrics. I'll write a method that will support reading in absolute free energies.

As for the other issues, I think the behaviour I am still debugging.

@maxjump
Copy link

maxjump commented Oct 23, 2019

Thanks for the quick feedback. I forgot to mention that I am generating the files containing the relative binding free energies on the fly. Hence, we do not necessarily need a method implemented in freenrgworkflows to handle that.
I assume you do not need the files, but just in case:
issue22_relativeddG.zip

@ppxasjsm
Copy link
Collaborator

So from my unerstanding of reading the SI the analysis I implemented and the one they suggest are very similar. You end up subsampling according to a gaussian distribution, both the experimental and computed values (I in fact do not experimental variation into account.)

I use:
image

with sigma set to 1.1 kcal/mol for computed free energies and 0.4 kcal/mol for experimental free energies.

So my algorithm looks like this:

for n samples:
    clear lists
    for n compounds:
            generate new sample from gaussian with DDG of compound n and sigma of 1.1 or 0.4
            append sample to new list
    compute correlation between new 'fake' experimental and computed values
    and add this to list

compute the mean and standard deviation of the list of R values. 

When I do this for one of your example datasets I get the following:
R = 0.33±0.25
And the generated distribution looks like this:
image

I would naively assume, oh...why is R so much worse after the subsampling, there must be something wrong (hence the question). Actually there is a paper that answers this quite well.
Realistic_samples.pdf

What I suspect has happened is, that they computed a correlation coefficient of the original data and then the standard deviation of the subsampled data. The standard deviation is around 0.25, which is exactly what they report, it just so happens that the mean of the data set due to the introduced error gets drastically shifted, this is what freenrgworkflows can report. I have attached my calculations for you to take a look.

Essentially what I am advocating is that:

  1. both the Wang et al paper and freenrgworkflows use gaussian subsamples in an almost identical way (free energy workflows uses estimated errors on free energies to generate subsamples)
  2. freenrgworkflows reports the mean and std, and interval from the gaussian subsamples
  3. Wang et al reports R of the original dataset and the std of the subsampling.

The only thing is that I may misunderstand how to sample from these Gaussian distributions, but this was pretty much code copied from a script @jmichel80 had given me a long time ago.

explanation.zip

@maxjump
Copy link

maxjump commented Oct 24, 2019

Thanks for the detailed feedback and clarification! I get the same results using my script.

However, it does not completely match the numbers provided in the paper:

If I understand the SI correctly, the approach above (assuming errors of 1.1 and 0.4 kcal/mol) is reported as "anticipated FEP R-value" (second last line of the table). When running the uncertainty estimation (fifth cell in your notebook) 100 times, the means of R and std are 0.34 +/- 0.26 which is slightly different from the reported 0.37 +/- 0.26.
Testing the same script on Tyk2 gives 0.62 +/- 0.13 which is significantly different from the reported 0.74 +/- 0.10. (input files: issue22_tyk.zip)

Furthermore, please note that there is a small difference in the standard deviation of the "observed R-value FEP" (9th line) and the "anticipated FEP R-value" (Thrombin: 0.24 vs 0.26, Tyk2: 0.07 vs. 0.10).

Hence, I assume that

  • we are not calculating the "anticipated FEP R-value" correctly and
  • the bootstrapping for the "observed" values is performed slightly differently and not according to the method described in the SI.
    However, in both cases I have no idea where the difference might be coming from. Any thoughts on this? @ppxasjsm @jmichel80

@ppxasjsm
Copy link
Collaborator

What I implemented in the notebook is what the SI says as far as I understand. I am not really sure what their DG_i value is other than the one they report, i.e. representing the mean of the distribution. I'll give it another careful read.

The data you gave me is the FEP+ data?

@jmichel80
Copy link
Contributor Author

jmichel80 commented Oct 24, 2019 via email

@ppxasjsm
Copy link
Collaborator

What I don't understand is equation 1.1 in the SI. Is this just the condition that the overall sum of free energies experimentally known and computed must be the same?

@maxjump
Copy link

maxjump commented Oct 24, 2019

@ppxasjsm I agree that the content in your notebook should reproduce the results outlined in the SI. As we both wrote some code independently after reading the SI and do get the same result, we might well both be overlooking something. The data is taken from the excel sheet supplied in the SI.

Regarding your other comment, yes, they reweigh the computed results so that the sums of the experimental and computed results are equal.

@ppxasjsm
Copy link
Collaborator

Mmmh, so we are using the DGs they provide and are generating new samples based on the Gaussian distributions they are using...

@maxjump
Copy link

maxjump commented Oct 24, 2019

@jmichel80 Thanks for your suggestions. I have been running it using 1000 repeats now, but the mean and stdev do not change. I find this really strange...

Assessing the uncertainty of R in the way you suggest certainly makes sense. However, reporting the actual R value with the standard deviation of the subsampled R value does not seem completely correct to me, and I have not really felt too confident doing that. What's your view on that?

I am also not a huge fan of the 1.1 kcal/mol estimate, especially since some perturbations are clearly easier than others. I was okay with this approach if I would expect all perturbations to be of similar difficulty (i.e. the same type of transformation). For example, this overestimates the difficulty for the Thrombin dataset, as there are only fairly easy perturbations.

@ppxasjsm It may sound dumb, but would you mind extracting the data for thrombin and tyk2 from the SI yourself and run your script on it, just to ensure I am not messing something up at that early stage?

@ppxasjsm
Copy link
Collaborator

I get the same paper prediction for Tyk2 0.89 is the direct correlation coefficient of exp v computed. The std I get from subsampling is slightly larger than the 0.07, but the correlation is the same.

I also don't think we should use 1.1 kcal/mol as a general error for computed results there are for sure some datasets that are more uncertain than that and the variability from repeats seems a better indicator.

I don't think you should report R on the raw data and std from the subsampling that is misleading. Though using the subsampling you are going to get a seemingly worse correlation. Reporting them separately may be useful, but the most meaningful value is really the subsampled mean±std or confidence intervals since the distribution is not perfectly Gaussian.
This is also a good read, if you haven't had a look at this yet:
http://practicalcheminformatics.blogspot.com/2019/07/how-good-could-should-my-models-be.html

I can extract the data and do a consistency check yes, will do that now and get back to you.

@ppxasjsm
Copy link
Collaborator

So I had another very close look at this and looked at all the FEP+ compounds.

I can regenerated the straight up R value for all of them, but the resampled ones from the distributions I cannot get the right results. Though I also noticed somewhat inconsistent seeming results in the spreadsheet.

See for example CDK2: R is: 0.48+-0.19, but supposedly Exp R FEP is 0.73+-0.11 which definitely seems off.

Attached is my complete analysis of this, but the simple print out looks like this:

/Users/toni_brain/Desktop/issue22/proteins/bace.csv
R from statistics is: 0.535 ± 0.098, FEP+: 0.64+-0.09
Direct R is: 0.781	 FEP+ R is 0.78
====================================
/Users/toni_brain/Desktop/issue22/proteins/cdk2.csv
R from statistics is: 0.242 ± 0.217, FEP+: 0.73+-0.11
Direct R is: 0.476	 FEP+ R is 0.48
====================================
/Users/toni_brain/Desktop/issue22/proteins/jnk1.csv
R from statistics is: 0.662 ± 0.097, FEP+: 0.64+-0.12
Direct R is: 0.846	 FEP+ R is 0.85
====================================
/Users/toni_brain/Desktop/issue22/proteins/mcl1.csv
R from statistics is: 0.602 ± 0.074, FEP+: 0.71+-0.07
Direct R is: 0.772	 FEP+ R is 0.77
====================================
/Users/toni_brain/Desktop/issue22/proteins/p38.csv
R from statistics is: 0.462 ± 0.107, FEP+: 0.67+-0.08
Direct R is: 0.653	 FEP+ R is 0.65
====================================
/Users/toni_brain/Desktop/issue22/proteins/ptp1b.csv
R from statistics is: 0.588 ± 0.110, FEP+: 0.79+-0.07
Direct R is: 0.803	 FEP+ R is 0.80
====================================
/Users/toni_brain/Desktop/issue22/proteins/thrombin.csv
R from statistics is: 0.339 ± 0.255, FEP+: 0.37+-0.26
Direct R is: 0.706	 FEP+ R is 0.71
====================================
/Users/toni_brain/Desktop/issue22/proteins/tyk2.csv
R from statistics is: 0.624 ± 0.133, FEP+: 0.74+-0.10
Direct R is: 0.891	 FEP+ R is 0.89
====================================

For the notebook to work, you'll have to change the file paths, but I guess that is kind of self explanatory.

proteins.zip

My suggestion in terms of proceeding:

I am at a bit of a loss to how exactly we are meant to reproduce the stated data with the information given, but I don't really see anything wrong with what we are doing. @jmichel80 What do you think?

@ppxasjsm
Copy link
Collaborator

Ah in terms of freenrgyworkflows I can incorporate the sampling of the experimental uncertainties as well, which isn't currently done. Let me know if you want me to do a quick update of the code.

@maxjump
Copy link

maxjump commented Oct 27, 2019

Hi Toni, thanks again for your feedback, suggestions and testing the datasets. It's really strange we cannot reprocue this, especially as the results differ quite a lot (looking at the p38 dataset for example). As you suggested, I also tend to report R and the reported R and std separately.
I will look into this again once I am back from the GCC (I am off this week). In the meanwhile, it would be great if you could get the code running in freergworklfows, which hopefully should not be too much of an issue. Cheers!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants