-
Notifications
You must be signed in to change notification settings - Fork 51
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
statistical_inefficiency does not slice data frame #198
Labels
Comments
ptmerz
added a commit
to ptmerz/alchemlyb
that referenced
this issue
Jun 22, 2022
Make sure that both the subsampling series and the dataframe are sliced when calling statistical_inefficieny(). This ensure that the series and the dataframe are in sync before the subsampled indices of the series are applied to the dataframe. Fixes alchemistry#198
@ptmerz Many thanks for reporting this and come up with the PR. I will have a look at it. |
orbeckst
pushed a commit
that referenced
this issue
Jul 7, 2022
* Fixes #198 * Sync series and dataframe when subsampling: Make sure that both the subsampling series and the dataframe are sliced when calling statistical_inefficieny(). This ensure that the series and the dataframe are in sync before the subsampled indices of the series are applied to the dataframe. * Add tests: This adds a number of tests for slicing() and statistical_inefficieny(): * Test that slicing() respects upper and lower time bounds (currently passes) * Test that statistical_inefficieny() respects upper and lower time bounds when it is used without series to subsample (currently passes) * Test that statistical_inefficieny() respects upper and lower time bounds when it is used without series to subsample (currently fails) * Test that first using slicing() on the data frame, then statistical_inefficieny() without time bounds yields the same results as a single call to statistical_inefficieny() with time bounds (currently fails) * Extract input sanity check in separate function * Extract test for unchanged data in separate test * Parametrize tests of statistical_inefficiency to merge cases which use `series` with the ones that don't * Parametrize all tests to use `conservative` `True` and `False` * Remove conservative keyword where it didn't belong * Use pytest fixture to load data at test time * Add notes to docstring, CHANGES, and AUTHORS
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Description
When using
alchemlyb.preprocessing.statistical_inefficiency
with a data frame, a series to calculate statistical inefficiency on, and slicing parameters (lower
,upper
,step
), the slicing is only applied to the series, but not to the data frame. Sincestatistical_inefficiency
uses a raw index to subsample the data frame, the series and the data frame are out of sync by the end of the function. This leads to results that seem, at the very least, unintuitive.Examples
I think that a few examples make this much easier to explain. Assume I have loaded some data, e.g.
and want to use only part of this data, e.g.
then I would expect a few things:
statistical_inefficiency
removes times before and after my set limits → this does not work ❌Specifically, I find that the times are shifted, i.e. that the lowest time of
subsampled_data
is 0, and the highest time isupper - lower
.statistical_inefficiency
, or runningstatistical_inefficiency
with slicing parameters leads to the same result → this does not work ❌The problem here is, unsurprisingly, similar to the one above:
subsampled_data
starts at time 0, ends at timeupper - lower
, whereassubsampled_sliced_data
starts at timelower
and ends at timeupper
.My expectation here would be to analyze windows of different data at every iteration. The current implementation would, however, always subsample from the window 0 to 1000. (The subsampled data might differ between windows, since the statistical inefficiency is actually calculated on a different window of the series each time, but the resulting indices are applied to the same window of the data frame every time.)
step
keyword is used. Imagine that we have data which is correlated such that theconservative
subsampling algorithm picks every 10th frame. If we were to usestatistical_inefficiency
withstep=10
(by chance, or because we have a feeling that we might have sampled data too frequently), thenstatistical_inefficiency
would return every frame of the first 10% of the data! So we would have lost a lot of information, and would continue working with heavily sampled data.Proposed solution
I will propose a change shortly which adds tests (in line with examples 1 to 3) and a fix for this behavior. I think the most stable fix is to call
slicing
on both the series and the dataframe. It might be possible to calculate an offset index plus a stride to avoid slicing the dataframe, but that seems like pointless overengineering since there is aslicing
function readily available that makes sure that the series and dataframe slicing will always be done in the same way.PS: Maybe I've misunderstood everything and this is intended behavior. In this case, please close the PR, but I would then suggest to add some documentation (ideally also a note when calling the function) warning about this behavior 🙂
The text was updated successfully, but these errors were encountered: