-
Notifications
You must be signed in to change notification settings - Fork 647
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
define a MDAnalysis.analysis user interface #719
Comments
#175, on the interchangeability of AtomGroups/selection strings is still open, partly because I couldn't yet come up with a pythonic way to implement it. |
I am all in favour of the baroque option. Some analyses can generate multiple results, each of which can be manipulated in various ways. With minimalist option, such an analysis would return a bunch of arrays (or dictionaries, or any object actually) that 1) lost their context once returned and 2) may need a bunch of helper functions to get plotted or saved. In addition, returning a tuple of array, most of them that have the same shape, can result rather easily to these array being mixed up. These analyses, with the baroque option, would not return anything. Instead, all the results will be accessible from the analysis object. The results keep their context; some can be calculated lazily without the user having to care about what happens under the hood; all the helper function specific to the analysis are method of the analysis. I think the baroque option is easier for the user that have everything he needs in a single place. It is also easier for the devs because a lot of things can be shared: for the simplest cases, there is no need to reimplement The anarchy option is not sustainable. |
So what we found on the parallelisation of AnalysisBase is that we need to save results into the class, and then sum/concatenate these when we merge the different instances. So I think saving results in the class might be mandatory. Also with the auto-parallelisation, if we make it mandatory for the raw results (before any averaging etc) to get stored in a prescribed place/format it makes it easier for this to be automatic. So eg a dict of arrays we can just concatenate each key of the dict together. All that said, I don't want to ever turn something down which is otherwise good just because they didn't use my data structures. But then again, everything being uniform is a nice idea too... |
And I think @jbarnoud says everything else I was thinking much better than I could put it |
@jbarnoud @richardjgowers sorry to crash the party guys but i couldn't agree less *tl;dr * please keep everybody happy and build fancy functionality in a way that allows a simple minded person to get just what they asked for With no empirical data on 'users', I don't know what they want - but I don't think that they're that different from devs. When thinking about analysis, the meme "you had one job" comes to mind - the aim of the code in this module is first and foremost to perform some reduction of input data and return the result. Let me share some frustrations about implementing the simple contact analysis in MDA last week. baroque vs minimalHave you seen this analysis http://mdtraj.org/latest/examples/native-contact.html? Analysis is complex, returns multiple thingsHere is a familiar example from numpy
3 values returned... should we instead have a wrapper class because people don't understand what a 2d histogram is? If an analysis is complex maybe it should be broken up it up into smaller bits... Non-advertised output files written outThe crown example is Intermediate outputs written outTypically justified for performance reasons. Introduces state and inevitably leads to a mess: perhaps the largest sinner. You have to
Plotting functionsPlotting functions can be useful only if they're adding value. In pandas, plotting the dataframe makes sense - it's a complicated matplotlib figure and would be a non-trival job. Plotting an average contact map, or a timeseries? Thanks, I think I can do that. Drawbacks of including plotting: coupling to plotting code, non trivial to test. |
@jandom It is interesting to have an other point of view. I will try to answer to your comments. tl;dr The object model barely makes things more complicated, but factorises a bunch of features. These features become available without effort from the developper of the analysis, and are extendable for free. baroque vs minimalWith the object oriented approach, you just have to care about your analysis, and almost nothing else. As a developper, you inherit from the
All this code has to be written any way, whatever the coding style you choose. But, because you inherited from The function you wrote in a comment of #702 could be roughly written that way (not tested, just from the top of my mind): class CalculateContacts(AnalysisBase):
def __init__(self, ref, u, selA, selB, radius=4.5, beta=5.0, alpha=1.5,
begin=None, end=None, step=None):
# OK that part is tedious...
self.ref = ref
self.u = u
self.selA = selA
self.selB = selB
self.radius = radius
self.beta = beta
self.alpha = alpha
# Here some magic happens
self._prepare_frames(begin, end, step)
def _prepare(self):
# reference groups A and B from selection strings
grA, grB = ref.select_atoms(self.selA), ref.select_atoms(self.selB)
# reference distances (r0)
self.dref = distance_array(grA.coordinates(), grB.coordinates())
# select reference distances that are less than the cutoff radius
self.mask = dref < radius
# group A and B in a trajectory
self.grA, self.grB = u.select_atoms(self.selA), u.select_atoms(self.selB)
self._results = []
def _single_frame(self):
d = distance_array(grA.coordinates(), grB.coordinates())
x = 1/(1 + np.exp(beta*(d[mask] - alpha * dref[mask])))
self._results.append(( ts.time, x.sum()/mask.sum() ))
def _conclude(self):
self.result = pd.DataFrame(self._results, columns=["Time (ps)", "Q"])
del self._result This is indeed a bit longer than the version with a single function, especially because the tedious Analysis is complex, returns multiple thingsYou mentioned HELANAL. HELANAL produces several outputs. In the "baroque" model, these outputs would be accessible as object attributes so the user can pick what it wants. From what I understand, you suggest that the output should be produced by different functions, or the function should return (at least) 7 different objects. Most of these outputs are based on the same calculations, also some may be needed to calculate others; splitting the analysis would need calculating several times the same thing. I could live with the seven or so output objects, but I would rather them keep them in context. Non-advertised output files written out / Intermediate outputs written outI do not think anybody advocates for intermediate output files. I do not see how this plays in the arguments. But, clearly, HELENAL would benefit from being refactored in the framework of the Plotting functionsI agree that most of the time, plotting functions are trivial. The But the minimal interface is not firmly defined yet. For the moment, the Finally...Using a class rather than a function does not make the coding much more complicated. Instead, it allows those who develop new analysis to focus on their analysis and still get all the feature we expect from an analysis (sliced trajectory, parallelisation...). The baroque model also give some homogeneity to the collection of analyses, making them more predictable and thus easier to use. |
I'm for the baroque model. I think the I think our analysis submodules can be made to have a consistent feel, and that using classes instead of functions allows this to happen rather easily. We can also, as @jbarnoud said, figure out things like trajectory or task level parallelism globally. |
I think the consistent interface is why most people (including me) like it so much. The way they accomplish this is less important. But objects and inheritance are a convenient way to achieve such an API in python. And yeah I'm also for a baroque model |
Thanks for all your replies guys, especially @jbarnoud for a very complete response and good counter arguments. I'm somewhat convinced now, actually - much appreciated ;) AnalysisBase is still the most sane thing out there and if we can bring uniformity/consistency and avoid confusion that would be tremendous. AnalysisBase is promising to be a map-reduce-like creature, with Many of the types of analysis wrapped by AnalysisBase have the The aim of this is 2-fold
Or is that unreasonable of an ask? |
@jandom I think this is a reasonable approach, and one we should totally take. We (almost) already do this in |
Don't make that function part of the class. Have it as a standalone function. That is what we are doing for the RMSD, see @dotsdl comment. That function can be used in |
@dotsdl yes, we are on the same page! @kain88-de here we're actually on the same line (of code) on that page. Ok, fantastic, you guys are great - thanks for being gentle with me, sometimes I'm too grumpy when I don't eat my snickers :) |
User here, big fan of the baroque method. Some time ago, I wrote an MDAnalysis/MDTraj wrapper using a base |
The idea of the minimalist numerical core (i.e. the single-frame analysis) existing as a thing function outside the analysis class seems do-able for frame-based analysis. (It won't work easily for something like correlation functions without passing work arrays for in-place modification... at which the class wins in terms of readability.) I assume that for these functions you would not prescribe any user interface, assuming that only experts would use them? Or something generic along the lines of def calculate_foo(AtomGroup, **kwargs):
...
return stuff possibly with data arrays passed in kwargs for in-place modification. One advantage is that the |
@cing many thanks – the rare specimen of a user telling developers what they actually want ;-) |
I added the Bauhaus model to the list, as the emerging consensus: Bauhaus (the emerging consensus from the discussion above: a cohesive reduction to a common set of functional elements together with minimalist inspirations.)
|
@jandom having a standalone version of I'm not sure how we're coming up with these names, but ich bin ein Bauhauser |
The standalone function for the single frame treatment is definitively On 16-02-16 14:11, Richard Gowers wrote:
|
@orbeckst +1 for Bauhaus clever reference! When possible, if we can make these functions decoupled from any MDAnalysis logic - such as the AtomGroup - it would make it easy to understand, test. For best hummer Q, this "invariant" function could for example be:
Anybody can just import it, or copy&paste into a notebook, and get an intuition for the parameters on some synthetic input. Digression: for any parallelizm efforts most map-reduce will work with the abstraction operating on a single frame. However, it may be useful for the primitive to be a list of frames - which most of the time will contain a single element (RMSD, Q, etc) - this will allow all sorts of autocorrelations to be calculated in parallel (where access to two frames n and n-lagtime is needed). |
tld;dr: I think we converged on the Bauhaus-style with the option to factor our the core numerics as a bare-bones (read: numpy arrays) function that can be written independently of MDAnalysis.
Just passing coordinate arrays (or similar) will work in many cases and will also simplify writing cython/FORTRAN (hehe) code. So yes, if the code is readable then by all means, factor it out in the bare bones variant (and add a unit test ;-) ). On the other hand, a lot of the use of abstractions like For correlation time-like functions you can either come up with clever streaming algorithms (such as the one for computing the variance in one go or the blocked algorithm for correlation times in Frenkel & Smit) or as you say, you don't call the core function a single frame function and just let it work on a time series. In any case, it just remains to put the Bauhaus design model into words for the Style Guide. |
I think it simmered for a sufficiently long time and people have been using the Bauhaus approach when writing new classes or refactoring.
Once the essential points are transferred to a Wiki page we can close this issue (which I think is an excellent example for a very productive discussion). |
Implementations for parallel analysis (such as PR #618) should be discussed in the context of the model agreed upon in this thread. |
* fixes MDAnalysis#893 * RMSD refactored with AnalysisBase class and confirms to the new analysis API (see MDAnalysis#719) * Tests added * Documentation and chaining functions * Fixed and updated documentation * Addressed broadcasting and other issues * Updated CHANGELOG
Ping... I think all that needs to be done is to
and possibly create an equivalent document in the docs in a developer section of Analysis. Once that is done, the wiki page can just link to the docs. |
I added a preliminary wiki page MDAnalysis.analysis user interface with the consensus – please correct me if I overlooked anything. If there are no requests for changes, I will finally close this issue in a few days. |
I have
EDIT: added 2nd question |
I thought for some time about making these arguments also available in run.
What would be the benefit? |
You wanted Universe as an argument for run #719 (comment) ;-) |
I keep trying to set start stop and verbose in run. I do not remember the rational for having them in init. Le 30 juin 2017 7:21 AM, Oliver Beckstein <notifications@github.com> a écrit :You wanted Universe as an argument for run #719 (comment) ;-)
—You are receiving this because you were mentioned.Reply to this email directly, view it on GitHub, or mute the thread.
|
That is specific to the PCA in my comment. It can't be generalized. |
@orbeckst I added some small changes to your entry. Looks good so far. |
Excellent additions, thanks! |
Regarding my previous questions #719 (comment) the consensus so far seems to be:
Comments welcome. For 1 I can open an issue. If we want it then this need to be settled before 1.0 together with deprecations before and time where both |
I would also move |
Sorry, forgot about verbose. Be able to set verbose on init, too, might still be useful because the preparation step might actually do various complicated things where one might actually want additional output (I'm thinking hydrogen bond analysis, for example). Then init could set the default and run can override the default during run. |
For the remaining discussion points (iteration range and verbose in |
tl,dr: The
MDAnalysis.analysis
modules do not have a unified user interface, which is bad for users and bad for developers. We need to come up with a set of rules describing the analysis modules' user interface.Divergent user interface in
MDAnalysis.analysis
The
MDAnalysis.analysis
(andMDAnalysis.visualization
) modules collect various kinds of "tools" to analyze simulations; in some sense, they are responsible for the "analysis" in MDAnalysis. However, while we have been pretty stringent about what our API inside the "core" should look like, we have been much less prescriptive with analysis. To a good degree this reflects the reality that code is mainly contributed by researchers that wrote something to get a particularly job done and then realized that it might be usable for the rest of the community – of course, that's exactly what we want for a user-driven open source project! On the other hand, there seems to be a growing feeling among developers that we should have a more uniform interface to the analysis tools as well.Ideally, all our analysis tools should have a common philosophy and share a common set of options. Being able to use different analysis tools "out of the box" once you have a basic understanding of how it works makes for a good overall user experience.
From the developer side, it promotes code re-use and modularization with subsequent improvements in testing coverage and code reliability.
Using
AnalysisBase
@richardjgowers wrote a prototype
MDAnalysis.analysis.base.AnalysisBase
class and in recent code reviews on contributions to analysis we have been pushing for basing analysis code on this class. But in discussions such as on PR #708 it is becoming clear that we should settle on what we expect of the analysis code to do, not the least so that developers, who spend a significant amount of time just cleaning up old mess when they implement code fixes and add new features, know where to set priorities and what is expected of them.AnalysisBase
outlines how to structure typical frame-based analysis but it does not really say (yet) what a user should be able to expect from analysis tools.Different models for the user interface
Some of the current analysis tools come with additional methods to immediately plot data, many are able to write intermediate and final data to a file for reuse (and perhaps are even able to re-read the file, and perform plotting without needing to reanalyze a trajectory), most of the store results as numpy arrays in an attribute
results
(often adict
for multiple results).A more purist approach is to just return final data structures, throw away intermediates and do not even store final results, and let the user do all downstream processing and plotting.
I can see four broadly defined models how we could handle the user interface:
Anarchy: Do not prescribe any user interface and let each analysis tool writer decide what's best and most appropriate.Minimalist(or developer-friendly?):AnalysisBase
and stipulate thatrun()
returns all computed data.Baroque(or user-friendly?): prescribeAnalysisBase
with additional features, for example (discussion needed!)plot()
for a simple visualization of the data (remember that sometimes data plotting is pretty involved, see for instance,PSAnalysis.plot()
!)save()
to store data as a file on diskto_df()
to return as apandas.DataFrame
For any of these features you need to store the data inside the class somewhere.
Eclecticism: Somewhere between Minimalist and Baroque with some features mandatory and other optional (but which ones?).Bauhaus (the emerging consensus from the discussion below: a cohesive reduction to a common set of functional elements together with minimalist inspirations.)
AnalysisBase
with a common feature set (like Baroque) with the goal to have a unified and utilitarian interface._single_frame()
method.Feel free to edit/add to the list.
What do we need?
I am asking @MDAnalysis/coredevs (and anyone else interested) to chime in with opinions on what to do. The final outcome of this issue should be a consensus on set of rules (or a statement of the absence of rules for option 1) on how code in analysis ought to interface with the user. These rules will then become part of the Developer Guide.
History
The text was updated successfully, but these errors were encountered: