cumsum drops index coordinates #6528

dcherian · 2022-04-27T16:04:08Z

What happened?

cumsum drops index coordinates. Seen in #6525, #3417

What did you expect to happen?

Preserve index coordinates

Minimal Complete Verifiable Example

import xarray as xr

ds = xr.Dataset(
        {"foo": (("x",), [7, 3, 1, 1, 1, 1, 1])},
        coords={"x": [0, 1, 2, 3, 4, 5, 6]},
    )
ds.cumsum("x")

<xarray.Dataset>
Dimensions:  (x: 7)
Dimensions without coordinates: x
Data variables:
    foo      (x) int64 7 10 11 12 13 14 15

Relevant log output

No response

Anything else we need to know?

No response

Environment

xarray main

The text was updated successfully, but these errors were encountered:

Illviljan · 2022-04-27T18:51:07Z

xarray/xarray/core/dataset.py

Lines 5239 to 5240 in 7173bd3

    
           if not reduce_dims: 
        
               variables[name] = var

If you comment out this if-check so it always add the variable at least the example seems to work.

Not sure how to handle this:

Add a new keyword argument to reduce? keepcoords or can keepdims be expanded to handle this as well?
Since cumsum is not really a reduction, should we create a new method instead of reduce? I guess it would look very similar to reduce though.

dcherian · 2022-04-27T19:23:36Z

Since cumsum is not really a reduction, should we create a new method instead of reduce? I guess it would look very similar to reduce though.

👍 I think we should add cumsum to generate_reductions.py. which maybe should become generate_aggregations.py

max-sixty · 2022-04-27T20:21:12Z

There was some discussion a while ago about how to handle reduce on things that aren't reductions. There's some overlap with #1618, and that has some links to other issues.

…oved potential solution to pydata#6528.

@dcherian

* Added file to generate docs for cumulatives. generate_cumulatives.py creates _cumulatives.py * Fixed mypy issues * Updated cumulatives to fix mypy issues * Added keep_attrs to groupby funcs * Combined cumulatives and reductions and aggregations and modified dataset cumulative functions. * Merged cumulatives and reductions into aggregations * Removed test print from dataset.py * Removed generate_cumulatives and generate_reductions * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Updated _aggregations with docstring changes * Updated generate_aggregations.py with suggestions from @dcherian. Removed potential solution to #6528. * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Removed unecessary function from dataset.py * Removed unecessary function from dataset.py * Updated api.rst, whats-new.rst and added a cumprod test to test_groupby.py * Fixed accidental edit to test_dataset.py * Apply suggestions from code review * Merge and rename reductions and cumulatives to aggregations * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * minor tweaks * Update computation.py * Update generate_aggregations.py * Update .pre-commit-config.yaml * fix mypy * use _group_dim in resample? * Update resample.py * Manually fix docstring * Apply suggestions from code review Co-authored-by: Deepak Cherian <dcherian@users.noreply.github.com> * Use TEMPLATE_SEE_ALSO * use default example * add resample test * remove cumulative function in ops * Revert "remove cumulative function in ops" This reverts commit 66e7390. * Add numeric_only=True Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Illviljan <14371165+Illviljan@users.noreply.github.com> Co-authored-by: Deepak Cherian <dcherian@users.noreply.github.com>

max-sixty · 2023-08-24T08:40:55Z

~~Was this closed by #7152?~~

Edit: No, I think there was a suggestion to change the approach from .reduce to .apply_ufunc here, but it didn't make it in

The general reduce issue overlaps with #6528.

max-sixty · 2023-09-22T07:55:55Z

I had a look at moving this to .apply_ufunc — if I'm thinking about it correctly — it's actually not trivial, since it's not just a gufunc like the .rolling_exp ones from #8114.

Instead, it's a function that we've written which can execute over multiple dimensions

xarray/xarray/core/duck_array_ops.py

Lines 621 to 631 in c9d89e2

    
           def _nd_cum_func(cum_func, array, axis, **kwargs): 
        
               array = asarray(array) 
        
               if axis is None: 
        
                   axis = tuple(range(array.ndim)) 
        
               if isinstance(axis, int): 
        
                   axis = (axis,) 
        
               out = array 
        
               for ax in axis: 
        
                   out = cum_func(out, axis=ax, **kwargs) 
        
               return out

. (not super difficult! But IIUC, less of a lift-and-shift...)

max-sixty · 2024-08-12T22:58:27Z

Another alternative is to focus on Cumulative — ds.cumulative('lat').sum() works great!

dcherian · 2024-08-12T23:00:13Z

Can we just redirect cumsum to that then?

max-sixty · 2024-08-12T23:45:11Z

Can we just redirect cumsum to that then?

I looked at this but it has slightly more features such as skipna...

jhollowed · 2024-10-07T07:10:04Z

Can we just redirect cumsum to that then?

fwiw, I just tested this with a 33 MB Dataset. I observed that

data.cumsum('time')

is anywhere from 18 to 30x faster than

data.cumulative('time').sum()

max-sixty · 2024-10-07T17:46:39Z

@jhollowed, is that the first invocation? .cumulative uses functions from numbagg that are JIT compiled with numba, so the first invocation of the session will be much slower¹. But then it should be between a bit faster and much faster if it can parallelize.

If you have a repro then happy to investigate, would be a big deal if that's real.

Numba are working on enabling caching on parallel function, which will mean only the first run on the machine, not of the session, will be slow. ↩

jhollowed · 2024-10-07T18:35:50Z

@max-sixty No, it's not just the first invocation, from what I can tell. The repo in question is a mess, but here's a minimal working example which mimics my use case:

import time
import numpy as np
import xarray as xr
import matplotlib.pyplot as plt

data = xr.DataArray(np.random.rand(100,10,1000), dims=('x','y','z'))

cumsum_time = np.zeros(len(data.z))
cumulative_time = np.zeros(len(data.z))

for i in range(100):
    start = time.time()
    data.cumsum('z')
    cumsum_time[i] = time.time() - start

    start = time.time()
    data.cumulative('z').sum()
    cumulative_time[i] = time.time() - start

plt.hist(cumulative_time/cumsum_time, bins=20)
plt.show()

In this example the difference is even more extreme, with a ~50x slowdown by using cumulative().

I tried to also save the return of data.cumulative('z') as a new object prior to the loop, and call only sum() within the loop. This made no qualitative difference to the timing result. I also tried this in a Jupyter notebook, in which the difference was even worse but still qualitatively similar. I am running xarray version 2024.6.0.

If I'm making a mistake or have an obvious misunderstanding please let me know.

This is perhaps veering off-topic for this issue, but long story short, I just ended up using cumsum() and assign_coords() to recover the index coordinate dropped by cumsum(), rather than using the much slower cumulative().sum().

max-sixty · 2024-10-07T18:49:16Z

Can you share your env? I get something very very different!

The one outlier will be the initial JIT instance; here's a second run without that:

max-sixty · 2024-10-07T18:52:47Z

FWIW you may also want to use something like %%timeit — the repro above is using a missized array to store the values etc (no stress, it's exactly the sort of mistake I'd have made, but that's why %%timeit is great)

jhollowed · 2024-10-07T19:13:53Z

@max-sixty Interesting.

timeit.timeit(lambda: data.cumsum('z'), number=100)

and

timeit.timeit(lambda: data.cumulative().sum('z'), number=100)

yield 1.056 and 50.75, respectively.

I'm working on a login node of a cluster with many users, so I'm not totally in control of the context I'm running in. But my conda env looks like this.

max-sixty · 2024-10-07T19:48:18Z

Ah, OK, neither numbagg nor bottleneck is installed, and so it defaults to the slow routine. I guess that's more of a trap than I realized, with much worse perf than I realized; I should have thought of that initially when seeing your issue. We do at least put that in the .cumulative docstring.

FWIW we're considering including numbagg in the standard xarray release to reduce this confusion, and having a "core" release without any optional dependencies.

In the meantime, you might want to add numbagg to the env. (disclaimer: also a maintainer of numbagg...)

max-sixty · 2024-10-07T19:52:39Z

Given we now have a reference to numbagg in the .cumulative docstring, will plan to close this.

I thought about whether we should add a warning for .cumulative without numbagg nor bottleneck; I think that's a bit too invasive, and will be screened off if we do the xarray-core package. But open to feedback ofc.

dcherian · 2024-11-06T20:53:22Z

The core cumsum is still buggy

max-sixty · 2024-11-06T22:39:01Z

The core cumsum is still buggy

Agree; I closed since it seems unlikely we'll fix. But great if there's a path.

We could deprecate the skipna, and then defer to .cumulative. But requires numbagg...

dcherian added the bug label Apr 27, 2022

dcherian mentioned this issue Jul 16, 2022

Add cumsum to DatasetGroupBy #6525

Merged

4 tasks

dcherian mentioned this issue Oct 11, 2022

Cumulative examples #7152

Merged

3 tasks

patrick-naylor added a commit to patrick-naylor/xarray that referenced this issue Oct 21, 2022

Updated generate_aggregations.py with suggestions from @dcherian. Rem…

8f4d798

…oved potential solution to pydata#6528.

max-sixty mentioned this issue Aug 24, 2023

Move .rolling_exp functions from reduce to apply_ufunc #8114

Merged

2 tasks

max-sixty added the plan to close May be closeable, needs more eyeballs label Oct 7, 2024

max-sixty closed this as completed Oct 13, 2024

dcherian reopened this Nov 6, 2024

dcherian mentioned this issue Nov 6, 2024

cumsum drops coordinate in Dataset but not DataArray #9735

Closed

5 tasks

dcherian removed the plan to close May be closeable, needs more eyeballs label Nov 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cumsum drops index coordinates #6528

cumsum drops index coordinates #6528

dcherian commented Apr 27, 2022 •

edited

Loading

Illviljan commented Apr 27, 2022 •

edited

Loading

dcherian commented Apr 27, 2022

max-sixty commented Apr 27, 2022

max-sixty commented Aug 24, 2023 •

edited

Loading

max-sixty commented Sep 22, 2023

max-sixty commented Aug 12, 2024

dcherian commented Aug 12, 2024

max-sixty commented Aug 12, 2024

jhollowed commented Oct 7, 2024

max-sixty commented Oct 7, 2024 •

edited

Loading

jhollowed commented Oct 7, 2024 •

edited

Loading

max-sixty commented Oct 7, 2024 •

edited

Loading

max-sixty commented Oct 7, 2024 •

edited

Loading

jhollowed commented Oct 7, 2024

max-sixty commented Oct 7, 2024 •

edited

Loading

max-sixty commented Oct 7, 2024 •

edited

Loading

dcherian commented Nov 6, 2024

max-sixty commented Nov 6, 2024

cumsum drops index coordinates #6528

cumsum drops index coordinates #6528

Comments

dcherian commented Apr 27, 2022 • edited Loading

What happened?

What did you expect to happen?

Minimal Complete Verifiable Example

Relevant log output

Anything else we need to know?

Environment

Illviljan commented Apr 27, 2022 • edited Loading

dcherian commented Apr 27, 2022

max-sixty commented Apr 27, 2022

max-sixty commented Aug 24, 2023 • edited Loading

max-sixty commented Sep 22, 2023

max-sixty commented Aug 12, 2024

dcherian commented Aug 12, 2024

max-sixty commented Aug 12, 2024

jhollowed commented Oct 7, 2024

max-sixty commented Oct 7, 2024 • edited Loading

Footnotes

jhollowed commented Oct 7, 2024 • edited Loading

max-sixty commented Oct 7, 2024 • edited Loading

max-sixty commented Oct 7, 2024 • edited Loading

jhollowed commented Oct 7, 2024

max-sixty commented Oct 7, 2024 • edited Loading

max-sixty commented Oct 7, 2024 • edited Loading

dcherian commented Nov 6, 2024

max-sixty commented Nov 6, 2024

dcherian commented Apr 27, 2022 •

edited

Loading

Illviljan commented Apr 27, 2022 •

edited

Loading

max-sixty commented Aug 24, 2023 •

edited

Loading

max-sixty commented Oct 7, 2024 •

edited

Loading

jhollowed commented Oct 7, 2024 •

edited

Loading

max-sixty commented Oct 7, 2024 •

edited

Loading

max-sixty commented Oct 7, 2024 •

edited

Loading

max-sixty commented Oct 7, 2024 •

edited

Loading

max-sixty commented Oct 7, 2024 •

edited

Loading