Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

define a MDAnalysis.analysis user interface #719

Closed
orbeckst opened this issue Feb 12, 2016 · 51 comments
Closed

define a MDAnalysis.analysis user interface #719

orbeckst opened this issue Feb 12, 2016 · 51 comments

Comments

@orbeckst
Copy link
Member

tl,dr: The MDAnalysis.analysis modules do not have a unified user interface, which is bad for users and bad for developers. We need to come up with a set of rules describing the analysis modules' user interface.

Divergent user interface in MDAnalysis.analysis

The MDAnalysis.analysis (and MDAnalysis.visualization) modules collect various kinds of "tools" to analyze simulations; in some sense, they are responsible for the "analysis" in MDAnalysis. However, while we have been pretty stringent about what our API inside the "core" should look like, we have been much less prescriptive with analysis. To a good degree this reflects the reality that code is mainly contributed by researchers that wrote something to get a particularly job done and then realized that it might be usable for the rest of the community – of course, that's exactly what we want for a user-driven open source project! On the other hand, there seems to be a growing feeling among developers that we should have a more uniform interface to the analysis tools as well.

Ideally, all our analysis tools should have a common philosophy and share a common set of options. Being able to use different analysis tools "out of the box" once you have a basic understanding of how it works makes for a good overall user experience.

From the developer side, it promotes code re-use and modularization with subsequent improvements in testing coverage and code reliability.

Using AnalysisBase

@richardjgowers wrote a prototype MDAnalysis.analysis.base.AnalysisBase class and in recent code reviews on contributions to analysis we have been pushing for basing analysis code on this class. But in discussions such as on PR #708 it is becoming clear that we should settle on what we expect of the analysis code to do, not the least so that developers, who spend a significant amount of time just cleaning up old mess when they implement code fixes and add new features, know where to set priorities and what is expected of them.

AnalysisBase outlines how to structure typical frame-based analysis but it does not really say (yet) what a user should be able to expect from analysis tools.

Different models for the user interface

Some of the current analysis tools come with additional methods to immediately plot data, many are able to write intermediate and final data to a file for reuse (and perhaps are even able to re-read the file, and perform plotting without needing to reanalyze a trajectory), most of the store results as numpy arrays in an attribute results (often a dict for multiple results).

A more purist approach is to just return final data structures, throw away intermediates and do not even store final results, and let the user do all downstream processing and plotting.

I can see four broadly defined models how we could handle the user interface:

  1. Anarchy : Do not prescribe any user interface and let each analysis tool writer decide what's best and most appropriate.

  2. Minimalist (or developer-friendly?):

    • class-based: prescribe use of AnalysisBase and stipulate that run() returns all computed data.
    • function-based: only provide a function that performs the data reduction and returns all computed data
  3. Baroque (or user-friendly?): prescribe AnalysisBase with additional features, for example (discussion needed!)

    • plot() for a simple visualization of the data (remember that sometimes data plotting is pretty involved, see for instance, PSAnalysis.plot()!)
    • save() to store data as a file on disk
    • to_df() to return as a pandas.DataFrame

    For any of these features you need to store the data inside the class somewhere.

  4. Eclecticism : Somewhere between Minimalist and Baroque with some features mandatory and other optional (but which ones?).

  5. Bauhaus (the emerging consensus from the discussion below: a cohesive reduction to a common set of functional elements together with minimalist inspirations.)

    • Prescribe AnalysisBase with a common feature set (like Baroque) with the goal to have a unified and utilitarian interface.
    • Provide the core numerical analysis (especially for frame-based analysis) as a function in the same module. This function is used in the _single_frame() method.

Feel free to edit/add to the list.

What do we need?

I am asking @MDAnalysis/coredevs (and anyone else interested) to chime in with opinions on what to do. The final outcome of this issue should be a consensus on set of rules (or a statement of the absence of rules for option 1) on how code in analysis ought to interface with the user. These rules will then become part of the Developer Guide.

History

  • 2016-02-15: added to list of options the Bauhaus model (best-of-both-worlds) as emerging from discussions below and added note to minimalist along what @jandom originally proposed.
  • 2016-02-22: consensus appears to be to go for the Bauhaus design model
@mnmelo
Copy link
Member

mnmelo commented Feb 13, 2016

#175, on the interchangeability of AtomGroups/selection strings is still open, partly because I couldn't yet come up with a pythonic way to implement it.
I think we should definitely set some analysts standards in that regard (keeping in mind some analyses might need selection strings instead of groups so that a selection refresh can be done every frame).

@jbarnoud
Copy link
Contributor

I am all in favour of the baroque option. Some analyses can generate multiple results, each of which can be manipulated in various ways.

With minimalist option, such an analysis would return a bunch of arrays (or dictionaries, or any object actually) that 1) lost their context once returned and 2) may need a bunch of helper functions to get plotted or saved. In addition, returning a tuple of array, most of them that have the same shape, can result rather easily to these array being mixed up.

These analyses, with the baroque option, would not return anything. Instead, all the results will be accessible from the analysis object. The results keep their context; some can be calculated lazily without the user having to care about what happens under the hood; all the helper function specific to the analysis are method of the analysis.

I think the baroque option is easier for the user that have everything he needs in a single place. It is also easier for the devs because a lot of things can be shared: for the simplest cases, there is no need to reimplement plop, save or to_df as they can be implemented once in AnalysisBase.

The anarchy option is not sustainable.

@richardjgowers
Copy link
Member

So what we found on the parallelisation of AnalysisBase is that we need to save results into the class, and then sum/concatenate these when we merge the different instances. So I think saving results in the class might be mandatory.

Also with the auto-parallelisation, if we make it mandatory for the raw results (before any averaging etc) to get stored in a prescribed place/format it makes it easier for this to be automatic. So eg a dict of arrays we can just concatenate each key of the dict together.

All that said, I don't want to ever turn something down which is otherwise good just because they didn't use my data structures. But then again, everything being uniform is a nice idea too...

@richardjgowers
Copy link
Member

And I think @jbarnoud says everything else I was thinking much better than I could put it

@jandom
Copy link
Contributor

jandom commented Feb 13, 2016

@jbarnoud @richardjgowers sorry to crash the party guys but i couldn't agree less

*tl;dr * please keep everybody happy and build fancy functionality in a way that allows a simple minded person to get just what they asked for

With no empirical data on 'users', I don't know what they want - but I don't think that they're that different from devs.

When thinking about analysis, the meme "you had one job" comes to mind - the aim of the code in this module is first and foremost to perform some reduction of input data and return the result. Let me share some frustrations about implementing the simple contact analysis in MDA last week.

baroque vs minimal

Have you seen this analysis http://mdtraj.org/latest/examples/native-contact.html?
Do you think anybody is going to pick up a 400-line super-clever object oriented class over a 10 line function that they can reason about? We can't do user A-B testing but I'd bet a 100$ that the pure function will win: it's easier to read and extend.

Analysis is complex, returns multiple things

Here is a familiar example from numpy

H, xedges, yedges = np.histogram2d(y, x, bins=(xedges, yedges))

3 values returned... should we instead have a wrapper class because people don't understand what a 2d histogram is? If an analysis is complex maybe it should be broken up it up into smaller bits...

Non-advertised output files written out

The crown example is
https://mdanalysis.googlecode.com/svn/trunk/doc/html/documentation_pages/analysis/helanal.html
It will pollute your cwd with all sorts of files. They're actually useful files! But they're written out in a format that requires a custom reader! Nothing should be written out if no reader is provided, or it cannot be loaded with pandas/np. Writing outputs leads to argument bloat in object constructors with prefix, postfix and such... again "you had one job"

Intermediate outputs written out

Typically justified for performance reasons. Introduces state and inevitably leads to a mess: perhaps the largest sinner. You have to

  • check if intermediate exists,
  • check if input changed, recompute intermediate if so,
  • keep track of a intermediate checksum/input checksum to catch accidental corruption,
  • cleanup the intermediate if sth went wrong.
    There are really rare circumstances when intermediate results will help. Re-computing as a default strategy will save time and sanity.

Plotting functions

Plotting functions can be useful only if they're adding value. In pandas, plotting the dataframe makes sense - it's a complicated matplotlib figure and would be a non-trival job. Plotting an average contact map, or a timeseries? Thanks, I think I can do that. Drawbacks of including plotting: coupling to plotting code, non trivial to test.

@jbarnoud
Copy link
Contributor

@jandom It is interesting to have an other point of view. I will try to answer to your comments.

tl;dr The object model barely makes things more complicated, but factorises a bunch of features. These features become available without effort from the developper of the analysis, and are extendable for free.

baroque vs minimal

With the object oriented approach, you just have to care about your analysis, and almost nothing else. As a developper, you inherit from the AnalysisBase class, you write :

  • a method with what comes before you iterate over the frames,
  • a method with what happens for each frame,
  • and a method with what comes after the loop.

All this code has to be written any way, whatever the coding style you choose. But, because you inherited from AnalysisBase, you get a bunch of things for free. For instance, users will be able to run the analysis on a slice of the trajectory without you having to do anything about it. Eventually, you will get parallelisation without touching you code. I do not know if AnalysisBase deals with a progress meter, but this should not be difficult to implement. In the future, who knows what kind of goodness will be added.

The function you wrote in a comment of #702 could be roughly written that way (not tested, just from the top of my mind):

class CalculateContacts(AnalysisBase):
    def __init__(self, ref, u, selA, selB, radius=4.5, beta=5.0, alpha=1.5,
                 begin=None, end=None, step=None):
        # OK that part is tedious...
        self.ref = ref
        self.u = u
        self.selA = selA
        self.selB = selB
        self.radius = radius
        self.beta = beta
        self.alpha = alpha
        # Here some magic happens
        self._prepare_frames(begin, end, step)

    def _prepare(self):
        # reference groups A and B from selection strings
        grA, grB = ref.select_atoms(self.selA), ref.select_atoms(self.selB)

        # reference distances (r0)
        self.dref = distance_array(grA.coordinates(), grB.coordinates())

        # select reference distances that are less than the cutoff radius
        self.mask = dref < radius

        # group A and B in a trajectory
        self.grA, self.grB = u.select_atoms(self.selA), u.select_atoms(self.selB)
        self._results = []

    def _single_frame(self):
        d = distance_array(grA.coordinates(), grB.coordinates())
        x = 1/(1 + np.exp(beta*(d[mask] - alpha * dref[mask])))
        self._results.append(( ts.time, x.sum()/mask.sum() ))

    def _conclude(self):
        self.result = pd.DataFrame(self._results, columns=["Time (ps)", "Q"])
        del self._result

This is indeed a bit longer than the version with a single function, especially because the tedious __init__ function. We are far from the 10 vs 400 lines you mention, though. This version, however, handles working on a slice of a trajectory, which the version in #702 does not. Soon, it should handle parallelisation. The _conclude method is needed in the current state of the code, but we could have helpers in the base class for that kind of conversion, and others. Same wise, AnalysisBase should have helpers so it alone has to care if the user provided selection strings of AtomGroups

Analysis is complex, returns multiple things

You mentioned HELANAL. HELANAL produces several outputs. In the "baroque" model, these outputs would be accessible as object attributes so the user can pick what it wants. From what I understand, you suggest that the output should be produced by different functions, or the function should return (at least) 7 different objects. Most of these outputs are based on the same calculations, also some may be needed to calculate others; splitting the analysis would need calculating several times the same thing. I could live with the seven or so output objects, but I would rather them keep them in context.

Non-advertised output files written out / Intermediate outputs written out

I do not think anybody advocates for intermediate output files. I do not see how this plays in the arguments. But, clearly, HELENAL would benefit from being refactored in the framework of the AnalysisBase class.

Plotting functions

I agree that most of the time, plotting functions are trivial.

The plot method is just a matter of convenience in most of the cases. The same goes for the save method. As most of the analysis simply produce a collection of time series, it should be possible to have these methods implemented in AnalysisBase. Then developers will have to overload them only if they have something special. One more thing not to care about.

But the minimal interface is not firmly defined yet. For the moment, the plot and save are nothing but suggestions.

Finally...

Using a class rather than a function does not make the coding much more complicated. Instead, it allows those who develop new analysis to focus on their analysis and still get all the feature we expect from an analysis (sliced trajectory, parallelisation...). The baroque model also give some homogeneity to the collection of analyses, making them more predictable and thus easier to use.

@dotsdl
Copy link
Member

dotsdl commented Feb 14, 2016

I'm for the baroque model. I think the MDAnalysis.analysis collection of submodules functions as a collection of dimensionality reductions as @jandom says, and that perhaps one of the best models of a library of dimensionality reductions is scikit-learn. That library is widely praised for having a consistent object-oriented API across its many estimators, which allows users to easily apply any of its offerings without having to figure out each one individually. They also have pipelining functionality that just works because of how swappable these estimators are.

I think our analysis submodules can be made to have a consistent feel, and that using classes instead of functions allows this to happen rather easily. We can also, as @jbarnoud said, figure out things like trajectory or task level parallelism globally.

@kain88-de
Copy link
Member

That library is widely praised for having a consistent object-oriented API across its many estimators

I think the consistent interface is why most people (including me) like it so much. The way they accomplish this is less important. But objects and inheritance are a convenient way to achieve such an API in python.

And yeah I'm also for a baroque model

@jandom
Copy link
Contributor

jandom commented Feb 15, 2016

Thanks for all your replies guys, especially @jbarnoud for a very complete response and good counter arguments. I'm somewhat convinced now, actually - much appreciated ;)

AnalysisBase is still the most sane thing out there and if we can bring uniformity/consistency and avoid confusion that would be tremendous. AnalysisBase is promising to be a map-reduce-like creature, with _single_fame being a map and a _conclude being a reduce. It's a pretty well understood way to introduce parallelism and a sound plan.

Many of the types of analysis wrapped by AnalysisBase have the _single_frame at its core. I'd like to cater to minimalists among us by maybe having _single_frame call a pure function foo (possibly a static method of the class), such that foo that can just be run without the class on some raw-input.

The aim of this is 2-fold

  • allow user to have a quick try of the feature on sample inputs, potentially loaded outside of MDA (decoupled from our parsers/atom selection)
  • insurance policy in case the parallelism model gets changed (unlikely)

Or is that unreasonable of an ask?

@dotsdl
Copy link
Member

dotsdl commented Feb 15, 2016

@jandom I think this is a reasonable approach, and one we should totally take. We (almost) already do this in MDAnalysis.analysis.rms, in which we provide a basic rmsd function that does a single frame calculation, but also the RMSD class for getting the calculation over a trajectory. What you're suggesting is that in many cases the single frame calculation be made as a standalone function within the same module as the AnalysisBase-derived class, which itself probably calls that function each frame. Are we on the same page?

@kain88-de
Copy link
Member

Many of the types of analysis wrapped by AnalysisBase have the _single_frame at its core. I'd like to cater to minimalists among us by maybe having _single_frame call a pure function foo (possibly a static method of the class), such that foo that can just be run without the class on some raw-input.

Don't make that function part of the class. Have it as a standalone function. That is what we are doing for the RMSD, see @dotsdl comment. That function can be used in _single_frame then.

@jandom
Copy link
Contributor

jandom commented Feb 15, 2016

@dotsdl yes, we are on the same page! @kain88-de here we're actually on the same line (of code) on that page.

Ok, fantastic, you guys are great - thanks for being gentle with me, sometimes I'm too grumpy when I don't eat my snickers :)

@cing
Copy link

cing commented Feb 15, 2016

User here, big fan of the baroque method. Some time ago, I wrote an MDAnalysis/MDTraj wrapper using a base Analysis class much like what is being described. The motivation for that was to include a checkpointing/resume feature so I could "sync up" old or new analyses types with a growing dataset. I suspect others have cooked up similar scripts to accumulate analysis. Anyway, not something that should be in version 1, just something that could be considered for the future that would work nicely in the baroque structure.

@orbeckst
Copy link
Member Author

The idea of the minimalist numerical core (i.e. the single-frame analysis) existing as a thing function outside the analysis class seems do-able for frame-based analysis. (It won't work easily for something like correlation functions without passing work arrays for in-place modification... at which the class wins in terms of readability.) I assume that for these functions you would not prescribe any user interface, assuming that only experts would use them? Or something generic along the lines of

def calculate_foo(AtomGroup, **kwargs):
      ...
      return stuff

possibly with data arrays passed in kwargs for in-place modification.

One advantage is that the calculate_foo() function can be more easily cythonized because the trajectory logic has been separated from the raw computation across positions. It is then easy to slot in the compiled function at a later stage (i.e. once someone found the energy to cythonize it).

@orbeckst
Copy link
Member Author

@cing many thanks – the rare specimen of a user telling developers what they actually want ;-)

@orbeckst
Copy link
Member Author

I added the Bauhaus model to the list, as the emerging consensus:

Bauhaus (the emerging consensus from the discussion above: a cohesive reduction to a common set of functional elements together with minimalist inspirations.)

  • Prescribe AnalysisBase with a common feature set (like Baroque) with the goal to have a unified and utilitarian interface.
  • Provide the core numerical analysis (especially for frame-based analysis) as a function in the same module. This function is used in the _single_frame() method.

@richardjgowers
Copy link
Member

@jandom having a standalone version of single_frame is actually a really good idea. Also lets people make their own map/reduce if they want to.

I'm not sure how we're coming up with these names, but ich bin ein Bauhauser

@jbarnoud
Copy link
Contributor

The standalone function for the single frame treatment is definitively
good practice when possible. 👍

On 16-02-16 14:11, Richard Gowers wrote:

@jandom https://github.com/jandom having a standalone version of
|single_frame| is actually a really good idea. Also lets people make
their own map/reduce if they want to.

I'm not sure how we're coming up with these names, but ich bin ein
Bauhauser


Reply to this email directly or view it on GitHub
#719 (comment).

@jandom
Copy link
Contributor

jandom commented Feb 16, 2016

@orbeckst +1 for Bauhaus clever reference! When possible, if we can make these functions decoupled from any MDAnalysis logic - such as the AtomGroup - it would make it easy to understand, test. For best hummer Q, this "invariant" function could for example be:

def foo(r, r0, beta, lamba_constant):
    """
    r : numpy.array, distances at time t
    r0: numpy.array, distances at time 0
    beta: float, softness of the switching function
    lambda_constant: float, tolerance of the distance
    """
    return 1/(1 + np.exp(beta*(r - lambda_constant * r0)))

Anybody can just import it, or copy&paste into a notebook, and get an intuition for the parameters on some synthetic input.

Digression: for any parallelizm efforts most map-reduce will work with the abstraction operating on a single frame. However, it may be useful for the primitive to be a list of frames - which most of the time will contain a single element (RMSD, Q, etc) - this will allow all sorts of autocorrelations to be calculated in parallel (where access to two frames n and n-lagtime is needed).

@orbeckst
Copy link
Member Author

tld;dr: I think we converged on the Bauhaus-style with the option to factor our the core numerics as a bare-bones (read: numpy arrays) function that can be written independently of MDAnalysis.

When possible, if we can make these functions decoupled from any MDAnalysis logic - such as the AtomGroup - it would make it easy to understand, test.

Just passing coordinate arrays (or similar) will work in many cases and will also simplify writing cython/FORTRAN (hehe) code. So yes, if the code is readable then by all means, factor it out in the bare bones variant (and add a unit test ;-) ).

On the other hand, a lot of the use of abstractions like AtomGroup comes from the fact that, once you know how to use them, you can write concise code without long argument lists so I wouldn't want to prescribe that the "single frame" functions cannot use any MDAnalysis specific functionality.

For correlation time-like functions you can either come up with clever streaming algorithms (such as the one for computing the variance in one go or the blocked algorithm for correlation times in Frenkel & Smit) or as you say, you don't call the core function a single frame function and just let it work on a time series.

In any case, it just remains to put the Bauhaus design model into words for the Style Guide.

@orbeckst
Copy link
Member Author

I think it simmered for a sufficiently long time and people have been using the Bauhaus approach when writing new classes or refactoring.

  • Use AnalysisBase
  • Keep a simple function for the single frame analysis where possible.

Once the essential points are transferred to a Wiki page we can close this issue (which I think is an excellent example for a very productive discussion).

@orbeckst orbeckst removed the proposal label Sep 23, 2016
@orbeckst
Copy link
Member Author

Implementations for parallel analysis (such as PR #618) should be discussed in the context of the model agreed upon in this thread.

orbeckst pushed a commit that referenced this issue Sep 23, 2016
* fixes #893 
* RMSD refactored with AnalysisBase class and confirms to the new analysis API (see #719)
* Tests added
* Documentation and chaining functions
* Fixed and updated documentation
* Addressed broadcasting and other issues
* Updated CHANGELOG
abiedermann pushed a commit to abiedermann/mdanalysis that referenced this issue Jan 5, 2017
* fixes MDAnalysis#893 
* RMSD refactored with AnalysisBase class and confirms to the new analysis API (see MDAnalysis#719)
* Tests added
* Documentation and chaining functions
* Fixed and updated documentation
* Addressed broadcasting and other issues
* Updated CHANGELOG
@orbeckst
Copy link
Member Author

orbeckst commented May 7, 2017

Ping... I think all that needs to be done is to

  • summarize the consensus (it's a long discussion...)
  • transfer it to a wiki page,
  • link from the Style Guide

and possibly create an equivalent document in the docs in a developer section of Analysis. Once that is done, the wiki page can just link to the docs.

@orbeckst
Copy link
Member Author

I added a preliminary wiki page MDAnalysis.analysis user interface with the consensus – please correct me if I overlooked anything.

If there are no requests for changes, I will finally close this issue in a few days.

@orbeckst
Copy link
Member Author

orbeckst commented Jun 30, 2017

I have one two questions/comments that have not been explicitly decided:

  1. We are currently not allowing run(start, stop, step) but fix start/stop/step in __init__. There wasn't an explicit consensus as far as I can tell but it is currently implemented in this way.
  2. What about allowing another Universe as argument for run? If this is allowed then _conclude has to be more fancy and include a step where old and new data are combined. Furthermore, there are types of analysis (such as correlation functions) where I don't know how you would do this combining.

EDIT: added 2nd question

@orbeckst orbeckst self-assigned this Jun 30, 2017
@orbeckst orbeckst added this to the 1.0 milestone Jun 30, 2017
@kain88-de
Copy link
Member

We are currently not allowing run(start, stop, step) but fix start/stop/step in init. There wasn't an explicit consensus as far as I can tell but it is currently implemented in this way.

I thought for some time about making these arguments also available in run.

What about allowing another Universe as argument for run? If this is allowed then _conclude has to be more fancy and include a step where old and new data are combined. Furthermore, there are types of analysis (such as correlation functions) where I don't know how you would do this combining.

What would be the benefit?

@orbeckst
Copy link
Member Author

You wanted Universe as an argument for run #719 (comment) ;-)

@jbarnoud
Copy link
Contributor

jbarnoud commented Jun 30, 2017 via email

@kain88-de
Copy link
Member

You wanted Universe as an argument for run #719 (comment) ;-)

That is specific to the PCA in my comment. It can't be generalized.

@kain88-de
Copy link
Member

@orbeckst I added some small changes to your entry. Looks good so far.

@orbeckst
Copy link
Member Author

orbeckst commented Jul 1, 2017

Excellent additions, thanks!

@orbeckst
Copy link
Member Author

orbeckst commented Jul 1, 2017

Regarding my previous questions #719 (comment) the consensus so far seems to be:

  1. We should probably have run() have start, stop, step instead of __init__.
  2. We won't prescribe run() to accept a Universe.

Comments welcome. For 1 I can open an issue. If we want it then this need to be settled before 1.0 together with deprecations before and time where both __init__ and run() accept, sigh.

@jbarnoud
Copy link
Contributor

jbarnoud commented Jul 3, 2017

I would also move verbose from __init__ to run.

@orbeckst
Copy link
Member Author

orbeckst commented Jul 4, 2017

Sorry, forgot about verbose.

Be able to set verbose on init, too, might still be useful because the preparation step might actually do various complicated things where one might actually want additional output (I'm thinking hydrogen bond analysis, for example). Then init could set the default and run can override the default during run.

@orbeckst
Copy link
Member Author

For the remaining discussion points (iteration range and verbose in run I opened #1463).

orbeckst pushed a commit that referenced this issue Aug 13, 2019
* Move the water bridge analysis to the new analysis class as defined in #719
* add high Order water bridge support – see PR #2087 
* Add a brief theory session to the docs
* Add additional tests for water bridges
* Add hydrogen bond testing to the water bridge test cases
* Update CHANGELOG
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

9 participants