-
-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
nd-rolling #4219
nd-rolling #4219
Conversation
xarray/tests/test_dataarray.py
Outdated
@pytest.mark.parametrize("center", (True, False)) | ||
@pytest.mark.parametrize("min_periods", (None, 1)) | ||
def test_ndrolling_reduce(da, center, min_periods): | ||
rolling_obj = da.rolling(time=3, a=2, center=center, min_periods=min_periods) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a basic test for the new nd-rolling
This looks very promising; I'm surprised it was possible without more code. I'm being slow, but where is the nd-rolling algo? I had thought bottleneck didn't support more than one dimension? https://bottleneck.readthedocs.io/en/latest/bottleneck.move.html, and that we'd have to implement our own in numbagg (which would be very possible) |
Re the API, I think the dict is probably the best option, although it does complicate as the arguments become differently typed depending on one vs multiple dimensions. One alternative is to allow fluent args, like: (
da
.rolling(x=3, center=True)
.rolling(y=5, min_periods=2)
.mean()
) ...but does that then seem like the second |
Hi @max-sixty
I couldn't think of it until just now. But yes, it sounds to me like a repeated rolling operation.
No. With nd-rolling, we need to use numpy reductions. |
Another API concern. We now use With nd-dimension, I think Even if we leave it, we may disallow nd-argument of |
Hello @fujiisoup! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found: There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻 Comment last updated at 2020-08-07 22:59:00 UTC |
xarray/core/dask_array_ops.py
Outdated
if hasattr(axis, "__len__"): | ||
for ax, win, cen in zip(axis, window, center): | ||
a = rolling_window(a, ax, win, cen, fill_value) | ||
return a |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I hope it is true, though I didn't check it.
A suspicious part is da.concatenate
, which I expect does not copy the original (strided-)array.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I noticed that ghosting breaks my expectation. I need to update the algo here.
I think the dictionary is OK. We can allow scalars for scalar for |
I think now it is ready for review, though I'm sure tests miss a lot of edge cases. |
A possible improvement will be nan-reduction methods for nd-rolling. This issue can be solved by replacing nan by appropriate values and applying nonnan-reduction methods, da.rolling(x=2, y=3).construct(x='xw', y='yw').sum(['xw', 'yw']) should be the same with da.rolling(x=2, y=3).construct(x='xw', y='yw', fill_value=0).sum(['xw', 'yw'], skipna=False) and the latter is much more memory efficient. I'd like to leave this improvement to future PR. |
I got an error for typechecking, only in CI but not in local, from the code that I didn't change. |
Could anyone kindly review this? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks really good, great implementation imo
@@ -802,7 +802,7 @@ def rolling( | |||
Minimum number of observations in window required to have a value | |||
(otherwise result is NA). The default, None, is equivalent to | |||
setting min_periods equal to the size of the window. | |||
center : boolean, default False | |||
center : boolean, or a mapping, default False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Types on the signature should change too (indeed, I'm surprised mypy doesn't fail)
xarray/core/rolling.py
Outdated
for k in self._attributes | ||
if getattr(self, k, None) is not None | ||
for k in list(self.dim) | ||
+ list(self.window) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't this a list of int
s? Is it going to do getattr(da, 3)
?
a = np.pad(a, pads, mode="constant", constant_values=fill_value) | ||
return _rolling_window(a, window, axis) | ||
for ax, win in zip(axis, window): | ||
a = _rolling_window(a, win, ax) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is very clever! I spent a while trying to figure out how it works...
xarray/tests/test_dataset.py
Outdated
@pytest.mark.parametrize("ds", (1,), indirect=True) | ||
@pytest.mark.parametrize("center", (True, False)) | ||
@pytest.mark.parametrize("min_periods", (None, 1)) | ||
@pytest.mark.parametrize("name", ("sum", "mean", "max")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will this test will fail for something like std
?
IIUC iteratively applying the function to the array for each dim only works for some functions.
Which is probably fine, no need to write more tests imo, though maybe worth making a note
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this case, many other functions should have the same result.
If there are some nans in random positions, the result should be different.
I added a few more tests and include function std
, but I don't think it is necessary to test all nanops.
Thanks @max-sixty for the review ;) |
xarray/tests/test_dataset.py
Outdated
assert_allclose(actual, expected) | ||
assert actual.dims == expected.dims | ||
|
||
# Do it in the oposite order |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# Do it in the oposite order | |
# Do it in the opposite order |
I'm still a bit confused. How does the "roll over each dimension in turn" approach equal the "roll over both dimension together" approach with a function like std? Here's a proposed counter example: import xarray as xr
import numpy as np
da = xr.DataArray(np.asarray([[0,10,0],[0,10,0], [0,10,0]]), dims=list('xy'))
print(da)
<xarray.DataArray (x: 3, y: 3)>
array([[ 0, 10, 0],
[ 0, 10, 0],
[ 0, 10, 0]])
Dimensions without coordinates: x, y
x_std = da.rolling(dict(x=2)).std()
print(x_std)
<xarray.DataArray (x: 3, y: 3)>
array([[nan, nan, nan],
[ 0., 0., 0.],
[ 0., 0., 0.]])
Dimensions without coordinates: x, y
x_then_y_std = x_std.rolling(dict(y=2)).std()
print(x_then_y_std)
<xarray.DataArray (x: 3, y: 3)>
array([[nan, nan, nan],
[nan, 0., 0.],
[nan, 0., 0.]])
Dimensions without coordinates: x, y
combined_std = da.rolling(dict(x=2, y=2)).std()
print(combined_std)
<xarray.DataArray (x: 3, y: 3)>
array([[nan, nan, nan],
[nan, 5., 5.],
[nan, 5., 5.]])
Dimensions without coordinates: x, y |
I think it's more like
but that gives
|
Thanks @max-sixty . You are completely correct. The reason was that the dataset I was using for the test does not have Fixed. Now it correctly fails for |
Great! Thanks @fujiisoup ! Ready to go from my POV |
@max-sixty |
* 'master' of github.com:pydata/xarray: (260 commits) Increase support window of all dependencies (pydata#4296) Implement interp for interpolating between chunks of data (dask) (pydata#4155) Add @mathause to current core developers. (pydata#4335) install sphinx-autosummary-accessors from conda-forge (pydata#4332) Use sphinx-accessors-autosummary (pydata#4323) ndrolling fixes (pydata#4329) DOC: fix typo argmin -> argmax in DataArray.argmax docstring (pydata#4327) pin sphinx to 3.1(pydata#4326) nd-rolling (pydata#4219) Implicit dask import 4164 (pydata#4318) allow customizing the inline repr of a duck array (pydata#4248) silence the known docs CI issues (pydata#4316) enh: fixed pydata#4302 (pydata#4315) Remove all unused and warn-raising methods from AbstractDataStore (pydata#4310) Fix map_blocks example (pydata#4305) Fix docstring for missing_dims argument to isel methods (pydata#4298) Support for PyCharm remote deployment (pydata#4299) Update map_blocks and map_overlap docstrings (pydata#4303) Lazily load resource files (pydata#4297) warn about the removal of the ufuncs (pydata#4268) ...
* upstream/master: (34 commits) Fix bug in computing means of cftime.datetime arrays (pydata#4344) fix some str accessor inconsistencies (pydata#4339) pin matplotlib in ci/requirements/doc.yml (pydata#4340) Clarify drop_vars return value. (pydata#4244) Support explicitly setting a dimension order with to_dataframe() (pydata#4333) Increase support window of all dependencies (pydata#4296) Implement interp for interpolating between chunks of data (dask) (pydata#4155) Add @mathause to current core developers. (pydata#4335) install sphinx-autosummary-accessors from conda-forge (pydata#4332) Use sphinx-accessors-autosummary (pydata#4323) ndrolling fixes (pydata#4329) DOC: fix typo argmin -> argmax in DataArray.argmax docstring (pydata#4327) pin sphinx to 3.1(pydata#4326) nd-rolling (pydata#4219) Implicit dask import 4164 (pydata#4318) allow customizing the inline repr of a duck array (pydata#4248) silence the known docs CI issues (pydata#4316) enh: fixed pydata#4302 (pydata#4315) Remove all unused and warn-raising methods from AbstractDataStore (pydata#4310) Fix map_blocks example (pydata#4305) ...
* upstream/master: (40 commits) Fix bug in computing means of cftime.datetime arrays (pydata#4344) fix some str accessor inconsistencies (pydata#4339) pin matplotlib in ci/requirements/doc.yml (pydata#4340) Clarify drop_vars return value. (pydata#4244) Support explicitly setting a dimension order with to_dataframe() (pydata#4333) Increase support window of all dependencies (pydata#4296) Implement interp for interpolating between chunks of data (dask) (pydata#4155) Add @mathause to current core developers. (pydata#4335) install sphinx-autosummary-accessors from conda-forge (pydata#4332) Use sphinx-accessors-autosummary (pydata#4323) ndrolling fixes (pydata#4329) DOC: fix typo argmin -> argmax in DataArray.argmax docstring (pydata#4327) pin sphinx to 3.1(pydata#4326) nd-rolling (pydata#4219) Implicit dask import 4164 (pydata#4318) allow customizing the inline repr of a duck array (pydata#4248) silence the known docs CI issues (pydata#4316) enh: fixed pydata#4302 (pydata#4315) Remove all unused and warn-raising methods from AbstractDataStore (pydata#4310) Fix map_blocks example (pydata#4305) ...
isort -rc . && black . && mypy . && flake8
whats-new.rst
api.rst
I noticed that the implementation of nd-rolling is straightforward.
The core part is implemented but I am wondering what the best API is, with keeping it backward-compatible.
Obviously, it is basically should look like
A problem is other parameters,
centers
andmin_periods
.In principle, they can depend on dimension.
For example, we can have
center=True
only forx
but not fory
.So, maybe we allow dictionary for them?
The same thing happens for
.construct
method.I'm afraid if this dictionary argument was a bit too redundant.
Does anyone have another idea?