What should `Dataset.count` return for missing dims? #6749

headtr1ck · 2022-07-03T11:49:12Z

What is your issue?

When using a dataset with multiple variables and using Dataset.count("x") it will return ones for variables that are missing dimension "x", e.g.:

import xarray as xr
ds = xr.Dataset({"a": ("x", [1, 2, 3]), "b": ("y", [4, 5])})
ds.count("x")
# returns:
# <xarray.Dataset>
# Dimensions:  (y: 2)
# Dimensions without coordinates: y
# Data variables:
#     a        int32 3
#     b        (y) int32 1 1

I can understand why "1" can be a valid answer, but the result is probably a bit philosophical.

For my usecase I would like it to return an array of ds.sizes["x"] / 0. I think this is also a valid return value, considering the broadcasting rules, where the size of the missing dimension is actually known in the dataset.

Maybe one could make this behavior adjustable with a kwarg, e.g. "missing_dim_value: {int, "size"}, default 1.

The text was updated successfully, but these errors were encountered:

dcherian · 2022-07-05T16:09:55Z

This is quite confusing and I doubt it's intentional.

I would've expected b (y) int32 3 3 assuming that it would've been broadcast along the reduction dimension.

The final value is the result of

import numpy as np
from xarray.core.duck_array_ops import isnull

np.sum(np.logical_not(isnull(ds.b.data)), axis=())
# np.sum([True, True], axis=())

What happens when you call a ufunc with an empty axis tuple? I bet this is just casting bool to int.

headtr1ck · 2022-07-06T17:58:49Z

What happens when you call a ufunc with an empty axis tuple?

This should also happen with all other ufuncs then?
I guess most of them just work, like mean, sum etc.

dcherian · 2022-07-07T02:05:06Z

We discussed:

dropping variables without the dimension
Return ds.sizes["x"] by broadcasting b along x

For the other reductions

import numpy as np
import xarray as xr

from xarray.core.duck_array_ops import count

ds = xr.Dataset({"a": ("x", [1, 2, 3]), "b": ("y", [4, 5])})

for func in [np.nansum, np.nanprod, np.nanmean, np.nanvar, np.nanstd, count]:
    print(f"{func.__name__!s}({ds.b.data}, axis=()) = {func(ds.b.data, axis=())}")

gives

nansum([4 5], axis=()) = [4 5]
nanprod([4 5], axis=()) = [4 5]
nanmean([4 5], axis=()) = [4. 5.]
nanvar([4 5], axis=()) = [0. 0.]
nanstd([4 5], axis=()) = [0. 0.]
count([4 5], axis=()) = [1 1]

I guess the output for nansum, nanprod doesn't match what you would get by broadcasting along the absent dimension.

headtr1ck · 2022-07-07T17:07:10Z

I think that changing the behavior of sum is quite a large breaking change.

headtr1ck · 2022-07-08T09:30:56Z

Another option is to add an option: missing_dim: "raise", ignore" or "broadcast".
The default then would be ignore, which is the current implementation.

But for workflows of variables that are either DataArray or Dataset, this argument should be added to DataArray.sum/count/prod as well?

headtr1ck added the needs triage Issue that has not been reviewed by xarray team member label Jul 3, 2022

dcherian added bug and removed needs triage Issue that has not been reviewed by xarray team member labels Jul 5, 2022

headtr1ck mentioned this issue Jul 6, 2022

Fix DataArrayRolling.__iter__ with center=True #6744

Merged

4 tasks

headtr1ck mentioned this issue Jul 8, 2022

A broadcasting sum for xarray.Dataset #6053

Open

dcherian added the needs discussion label Jul 14, 2022

headtr1ck mentioned this issue Apr 13, 2023

diff('non existing dimension') does not raise exception #7748

Open

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

What should `Dataset.count` return for missing dims? #6749

What should `Dataset.count` return for missing dims? #6749

headtr1ck commented Jul 3, 2022 •

edited

Loading

dcherian commented Jul 5, 2022

headtr1ck commented Jul 6, 2022

dcherian commented Jul 7, 2022

headtr1ck commented Jul 7, 2022

headtr1ck commented Jul 8, 2022

What should Dataset.count return for missing dims? #6749

What should Dataset.count return for missing dims? #6749

Comments

headtr1ck commented Jul 3, 2022 • edited Loading

What is your issue?

dcherian commented Jul 5, 2022

headtr1ck commented Jul 6, 2022

dcherian commented Jul 7, 2022

headtr1ck commented Jul 7, 2022

headtr1ck commented Jul 8, 2022

What should `Dataset.count` return for missing dims? #6749

What should `Dataset.count` return for missing dims? #6749

headtr1ck commented Jul 3, 2022 •

edited

Loading