Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cumsum drops index coordinates #6528

Open
dcherian opened this issue Apr 27, 2022 · 18 comments
Open

cumsum drops index coordinates #6528

dcherian opened this issue Apr 27, 2022 · 18 comments
Labels

Comments

@dcherian
Copy link
Contributor

dcherian commented Apr 27, 2022

What happened?

cumsum drops index coordinates. Seen in #6525, #3417

What did you expect to happen?

Preserve index coordinates

Minimal Complete Verifiable Example

import xarray as xr

ds = xr.Dataset(
        {"foo": (("x",), [7, 3, 1, 1, 1, 1, 1])},
        coords={"x": [0, 1, 2, 3, 4, 5, 6]},
    )
ds.cumsum("x")
<xarray.Dataset>
Dimensions:  (x: 7)
Dimensions without coordinates: x
Data variables:
    foo      (x) int64 7 10 11 12 13 14 15

Relevant log output

No response

Anything else we need to know?

No response

Environment

xarray main

@dcherian dcherian added the bug label Apr 27, 2022
@Illviljan
Copy link
Contributor

Illviljan commented Apr 27, 2022

xarray/xarray/core/dataset.py

Lines 5239 to 5240 in 7173bd3

if not reduce_dims:
variables[name] = var

If you comment out this if-check so it always add the variable at least the example seems to work.

Not sure how to handle this:

  • Add a new keyword argument to reduce? keepcoords or can keepdims be expanded to handle this as well?
  • Since cumsum is not really a reduction, should we create a new method instead of reduce? I guess it would look very similar to reduce though.

@dcherian
Copy link
Contributor Author

Since cumsum is not really a reduction, should we create a new method instead of reduce? I guess it would look very similar to reduce though.

👍 I think we should add cumsum to generate_reductions.py. which maybe should become generate_aggregations.py

@max-sixty
Copy link
Collaborator

There was some discussion a while ago about how to handle reduce on things that aren't reductions. There's some overlap with #1618, and that has some links to other issues.

@dcherian dcherian mentioned this issue Oct 11, 2022
3 tasks
patrick-naylor added a commit to patrick-naylor/xarray that referenced this issue Oct 21, 2022
dcherian added a commit that referenced this issue Oct 26, 2022
* Added file to generate docs for cumulatives. generate_cumulatives.py creates _cumulatives.py

* Fixed mypy issues

* Updated cumulatives to fix mypy issues

* Added keep_attrs to groupby funcs

* Combined cumulatives and reductions and aggregations and modified dataset cumulative functions.

* Merged cumulatives and reductions into aggregations

* Removed test print from dataset.py

* Removed generate_cumulatives and generate_reductions

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Updated _aggregations with docstring changes

* Updated generate_aggregations.py with suggestions from @dcherian. Removed potential solution to #6528.

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Removed unecessary function from dataset.py

* Removed unecessary function from dataset.py

* Updated api.rst, whats-new.rst and added a cumprod test to test_groupby.py

* Fixed accidental edit to test_dataset.py

* Apply suggestions from code review

* Merge and rename reductions and cumulatives to aggregations

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* minor tweaks

* Update computation.py

* Update generate_aggregations.py

* Update .pre-commit-config.yaml

* fix mypy

* use _group_dim in resample?

* Update resample.py

* Manually fix docstring

* Apply suggestions from code review

Co-authored-by: Deepak Cherian <dcherian@users.noreply.github.com>

* Use TEMPLATE_SEE_ALSO

* use default example

* add resample test

* remove cumulative function in ops

* Revert "remove cumulative function in ops"

This reverts commit 66e7390.

* Add numeric_only=True

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Illviljan <14371165+Illviljan@users.noreply.github.com>
Co-authored-by: Deepak Cherian <dcherian@users.noreply.github.com>
@max-sixty
Copy link
Collaborator

max-sixty commented Aug 24, 2023

Was this closed by #7152?

Edit: No, I think there was a suggestion to change the approach from .reduce to .apply_ufunc here, but it didn't make it in


The general reduce issue overlaps with #6528.

@max-sixty
Copy link
Collaborator

I had a look at moving this to .apply_ufunc — if I'm thinking about it correctly — it's actually not trivial, since it's not just a gufunc like the .rolling_exp ones from #8114.

Instead, it's a function that we've written which can execute over multiple dimensions

def _nd_cum_func(cum_func, array, axis, **kwargs):
array = asarray(array)
if axis is None:
axis = tuple(range(array.ndim))
if isinstance(axis, int):
axis = (axis,)
out = array
for ax in axis:
out = cum_func(out, axis=ax, **kwargs)
return out
. (not super difficult! But IIUC, less of a lift-and-shift...)

@max-sixty
Copy link
Collaborator

Another alternative is to focus on Cumulative — ds.cumulative('lat').sum() works great!

@dcherian
Copy link
Contributor Author

Can we just redirect cumsum to that then?

@max-sixty
Copy link
Collaborator

Can we just redirect cumsum to that then?

I looked at this but it has slightly more features such as skipna...

@jhollowed
Copy link

Can we just redirect cumsum to that then?

fwiw, I just tested this with a 33 MB Dataset. I observed that

data.cumsum('time')

is anywhere from 18 to 30x faster than

data.cumulative('time').sum()

@max-sixty
Copy link
Collaborator

max-sixty commented Oct 7, 2024

@jhollowed, is that the first invocation? .cumulative uses functions from numbagg that are JIT compiled with numba, so the first invocation of the session will be much slower1. But then it should be between a bit faster and much faster if it can parallelize.

If you have a repro then happy to investigate, would be a big deal if that's real.

Footnotes

  1. Numba are working on enabling caching on parallel function, which will mean only the first run on the machine, not of the session, will be slow.

@jhollowed
Copy link

jhollowed commented Oct 7, 2024

@max-sixty No, it's not just the first invocation, from what I can tell. The repo in question is a mess, but here's a minimal working example which mimics my use case:

import time
import numpy as np
import xarray as xr
import matplotlib.pyplot as plt

data = xr.DataArray(np.random.rand(100,10,1000), dims=('x','y','z'))

cumsum_time = np.zeros(len(data.z))
cumulative_time = np.zeros(len(data.z))

for i in range(100):
    start = time.time()
    data.cumsum('z')
    cumsum_time[i] = time.time() - start

    start = time.time()
    data.cumulative('z').sum()
    cumulative_time[i] = time.time() - start

plt.hist(cumulative_time/cumsum_time, bins=20)
plt.show()

In this example the difference is even more extreme, with a ~50x slowdown by using cumulative().

Screenshot 2024-10-07 at 2 25 05 PM

I tried to also save the return of data.cumulative('z') as a new object prior to the loop, and call only sum() within the loop. This made no qualitative difference to the timing result. I also tried this in a Jupyter notebook, in which the difference was even worse but still qualitatively similar. I am running xarray version 2024.6.0.

If I'm making a mistake or have an obvious misunderstanding please let me know.

This is perhaps veering off-topic for this issue, but long story short, I just ended up using cumsum() and assign_coords() to recover the index coordinate dropped by cumsum(), rather than using the much slower cumulative().sum().

@max-sixty
Copy link
Collaborator

max-sixty commented Oct 7, 2024

Can you share your env? I get something very very different!

image

The one outlier will be the initial JIT instance; here's a second run without that:

image

@max-sixty
Copy link
Collaborator

max-sixty commented Oct 7, 2024

FWIW you may also want to use something like %%timeit — the repro above is using a missized array to store the values etc (no stress, it's exactly the sort of mistake I'd have made, but that's why %%timeit is great)

@jhollowed
Copy link

@max-sixty Interesting.

timeit.timeit(lambda: data.cumsum('z'), number=100)

and

timeit.timeit(lambda: data.cumulative().sum('z'), number=100)

yield 1.056 and 50.75, respectively.

I'm working on a login node of a cluster with many users, so I'm not totally in control of the context I'm running in. But my conda env looks like this.

@max-sixty
Copy link
Collaborator

max-sixty commented Oct 7, 2024

Ah, OK, neither numbagg nor bottleneck is installed, and so it defaults to the slow routine. I guess that's more of a trap than I realized, with much worse perf than I realized; I should have thought of that initially when seeing your issue. We do at least put that in the .cumulative docstring.

FWIW we're considering including numbagg in the standard xarray release to reduce this confusion, and having a "core" release without any optional dependencies.

In the meantime, you might want to add numbagg to the env. (disclaimer: also a maintainer of numbagg...)

@max-sixty
Copy link
Collaborator

max-sixty commented Oct 7, 2024

Given we now have a reference to numbagg in the .cumulative docstring, will plan to close this.

I thought about whether we should add a warning for .cumulative without numbagg nor bottleneck; I think that's a bit too invasive, and will be screened off if we do the xarray-core package. But open to feedback ofc.

@max-sixty max-sixty added the plan to close May be closeable, needs more eyeballs label Oct 7, 2024
@dcherian
Copy link
Contributor Author

dcherian commented Nov 6, 2024

The core cumsum is still buggy

@max-sixty
Copy link
Collaborator

The core cumsum is still buggy

Agree; I closed since it seems unlikely we'll fix. But great if there's a path.

We could deprecate the skipna, and then defer to .cumulative. But requires numbagg...

@dcherian dcherian removed the plan to close May be closeable, needs more eyeballs label Nov 11, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants