Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Choice of regression target? #2

Open
maxentile opened this issue Feb 17, 2020 · 6 comments
Open

Choice of regression target? #2

maxentile opened this issue Feb 17, 2020 · 6 comments

Comments

@maxentile
Copy link
Member

Currently the regression target is the total energy of each (molecule, configuration) pair, including the prediction of a geometry-independent per-molecule offset and prediction of geometry-dependent "strain" energy. However, for the QCArchive subset @yuanqing-wang is looking at, the variation of the per-molecule offsets initially appears much larger in magnitude than the conformation-dependent variation within a molecule's collection of snapshots.

Should we do something to decompose the variance into these two components, i.e. (1) predict the constant offset for each molecule, and (2) assuming away the constant offset, predict geometry-dependent strain energies for a given molecule? (To target (1), we can assume away any geometry-dependence and try to predict just the energy of a molecule's global minimum snapshot from its topology. To target (2), we can assume away any constant offset and try to minimize standard deviation of the residuals.)

Also, the energy prediction currently does not include electrostatic contribution. Should the regression target be something other than total energy? (Initially, it seems reasonable to target the valence contributions, for example by targeting QM total energy minus a MM-predicted nonbonded contribution, where the MM prediction uses Parsley's partial charges, sigmas, epsilons, combining rules, and exceptions.)

@jchodera
Copy link
Member

Our target shouldn't care about the target-dependent offset, should it?

We also can't decompose QM into valence and electrostatics easily (without SAPT-like methods, which can also be problematic).

@maxentile
Copy link
Member Author

Our target shouldn't care about the target-dependent offset, should it?

I guess it depends what the goal is. For modeling the conformational distribution of a given molecule, any constant offset of the energy is of course irrelevant. For estimating a logZ, a constant offset is relevant. (For estimating logZ differences some constant offsets become irrelevant again.)

I think in this project the priority should be the conformation-dependent part, as the offset is not always needed, is not really modeled in MM, and can be obtained by other means if needed.

At least, I would like to separate our reported regression errors into those two tasks, rather than treating both as a single task.

We also can't decompose QM into valence and electrostatics easily (without SAPT-like methods, which can also be problematic).

Sorry, I didn't mean to suggest that QM_total minus MM_nonbonded was a quantity we should try to get by decomposing the results of a QM calculation.

Instead I was suggesting to "freeze" all the parameters of the MM nonbonded model that we plan to use, and fit the MM valence terms to the residual. (The QM doesn't decompose into valence + electrostatic + vdW, but the MM model does.)

A modeling reason to consider doing this is if we have a model whose LJ parameters have important information about condensed-phase or intermolecular behavior "baked in" that we (1) don't expect to be able to infer reliably from QM energies of isolated small molecules in vacuum or (2) risk messing up by fitting to QM energies of isolated small molecules in vacuum.

A numerical reason to consider doing this -- at least initially -- is that the nonbonded terms involve more aggressive exponents than the valence terms, and I think it is good to start with variants of an approach that are more likely to be numerically stable before proceeding to more complete but more challenging variants. (Looking at reports @yuanqing-wang has generated from initial experiments that included LJ but not electrostatics in a model for total energy, numerical stability does seem to be a relevant concern here.)

@jchodera
Copy link
Member

A numerical reason to consider doing this -- at least initially -- is that the nonbonded terms involve more aggressive exponents than the valence terms, and I think it is good to start with variants of an approach that are more likely to be numerically stable before proceeding to more complete but more challenging variants.

Another widely-supported possibility that is less "aggressive" is to use exponential-6 (Buckingham) instead of LJ 12-6:
https://en.wikipedia.org/wiki/Buckingham_potential

@maxentile
Copy link
Member Author

In addition to solidifying choice of what quantities we want to regress on (relative potential energy, vs. relative potential energy minus certain nonbonded terms), I think we need to narrow down a bit the collection of molecules, the way that the snapshots are generated, and the way the target energies are computed.

I think so far @yuanqing-wang has mostly looked at molecules in the ANI dataset (very off-equilibrium, but with snapshots further filtered by an energy threshold), QM9 dataset (minimized), and samples from some QCArchive datasets (usually nearly minimized, sometimes generated by torsion scans).

For the positive control experiments where we seek to recover a molecular mechanics energy model, I think we can initially use one of the OpenFF coverage sets as the molecule collection, and generate (snapshot, energy) pairs by vacuum MD at a reasonable temperature (300K? 500K?) using the forcefield we wish to recover. I wouldn't expect to be able to "generalize across molecules" particularly well from a minimal coverage set, since the set may exercise each FF parameter only a few times, but once we're satisfied with training-set performance there we can move onto something bigger and nicer like the Roche or Bayer set.

To be a bit more explicit about a control experiment where I expect to be able to make the training error go nearly to 0, to check that the overall regression setup is workable:

@jchodera
Copy link
Member

jchodera commented May 15, 2020

I doubt small coverage sets are going to be valuable because they sample chemical space very sparsely. There's really no way to "learn" from that kind of information.

I think the only reasonable approaches here are:

  • One molecule: Generate an exhaustive dataset for one molecule. This will let us address how well we can learn a detailed potential, and how much data we need to do so.
  • A very limited but well-sampled set of molecules, like AlkEthOH: This would allow us to see how well we can learn a well-sampled chemical space. We can generate lots of configurations and see again how many conformers/molecule are needed.
  • A larger molecular set with good coverage of chemical space: The FreeSolv set is a small example, and the parm@frosst parameterized set is a larger example. I'm not sure how many examples we need to really learn a whole force field, but it may be a very large (>100K) number.
  • Small molecules in solvent.

The process of generating data and training sounds good, though!

@maxentile
Copy link
Member Author

These make sense, thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants