A broadcasting sum for xarray.Dataset #6053

mjwillson · 2021-12-08T11:24:21Z

I've found it useful to have a version of Dataset.sum which sums variables in a way that's consistent with what would happen if they were broadcast to the full Dataset dimensions.

The difference is in what it does with variables that don't contain some of the dimensions it's asked to sum over: standard sum just ignores the summation over these dimensions for these variables, whereas a broadcasting_sum will multiply the variable by the product of sizes the missing dimensions, like so:

def broadcast_sum(dataset, dims):
  def broadcast_sum_var(var):
    present_sum_dims = [dim for dim in dims if dim in var.dims]
    non_present_sum_dims = [dim for dim in dims if dim not in var.dims]
    return var.sum(present_sum_dims) * np.prod([dataset.sizes[dim] for dim in non_present_sum_dims])
  return dataset.map(broadcast_sum_var)

This is consistent with mathematical sum notation, where the sum doesn't become a no-op just because the summand doesn't reference the index being summed over. E.g.:

$\sum_{n=1}^N x = N x$

I've found it useful when you need to do some broadcasting operations across different variables after the sum, and you want the summation done in a way that's consistent with the broadcasting logic that will be applied later.

Would you be open to adding this, and if so any preference how? (A separate method, an option to .sum ?)

dcherian · 2022-07-07T17:45:08Z

xr.broadcast(ds)[0].sum(dims) should do this.

We could add it here: https://xarray.pydata.org/en/latest/howdoi.html and to the docs under Aggregations

headtr1ck · 2022-07-08T09:17:53Z

See discussion in #6749

Maybe the current implementation of sum is not correct?

mjwillson · 2022-07-08T11:49:23Z

Re xr.broadcast(ds)[0].sum(dims) -- Thanks, that's neat and may be useful as a workaround, but it looks like it could incur significant extra CPU and RAM costs (tiling all variables to the full size in memory before summing over the tiled values)? Or is there some clever optimisation under the hood which would avoid this?

I also only wanted it to (behave as though it) broadcast the dims that are summed over, but this looks like it will broadcast all dims including those not summed over?

Overall I think it'd be better to have an option on sum (like missing_dim='broadcast' as suggested in #6749), rather than documenting a partial workaround like this, given the caveats attached to the workaround and that (to me at least) the broadcasting sum is more in keeping with the usual mathematical semantics of 'sum' than what 'sum' currently does.

dcherian · 2024-06-21T15:16:42Z

A more explicit API could be ds.broadcasting.sum()

TomNicholas added the enhancement label Dec 9, 2021

dcherian added topic-documentation and removed enhancement labels Jul 7, 2022

dcherian mentioned this issue Jun 21, 2024

Support for globs #9151

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

A broadcasting sum for xarray.Dataset #6053

A broadcasting sum for xarray.Dataset #6053

mjwillson commented Dec 8, 2021 •

edited

Loading

dcherian commented Jul 7, 2022 •

edited

Loading

headtr1ck commented Jul 8, 2022

mjwillson commented Jul 8, 2022

dcherian commented Jun 21, 2024

A broadcasting sum for xarray.Dataset #6053

A broadcasting sum for xarray.Dataset #6053

Comments

mjwillson commented Dec 8, 2021 • edited Loading

dcherian commented Jul 7, 2022 • edited Loading

headtr1ck commented Jul 8, 2022

mjwillson commented Jul 8, 2022

dcherian commented Jun 21, 2024

mjwillson commented Dec 8, 2021 •

edited

Loading

dcherian commented Jul 7, 2022 •

edited

Loading