Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

What should Dataset.count return for missing dims? #6749

Open
headtr1ck opened this issue Jul 3, 2022 · 5 comments
Open

What should Dataset.count return for missing dims? #6749

headtr1ck opened this issue Jul 3, 2022 · 5 comments

Comments

@headtr1ck
Copy link
Collaborator

headtr1ck commented Jul 3, 2022

What is your issue?

When using a dataset with multiple variables and using Dataset.count("x") it will return ones for variables that are missing dimension "x", e.g.:

import xarray as xr
ds = xr.Dataset({"a": ("x", [1, 2, 3]), "b": ("y", [4, 5])})
ds.count("x")
# returns:
# <xarray.Dataset>
# Dimensions:  (y: 2)
# Dimensions without coordinates: y
# Data variables:
#     a        int32 3
#     b        (y) int32 1 1

I can understand why "1" can be a valid answer, but the result is probably a bit philosophical.

For my usecase I would like it to return an array of ds.sizes["x"] / 0. I think this is also a valid return value, considering the broadcasting rules, where the size of the missing dimension is actually known in the dataset.

Maybe one could make this behavior adjustable with a kwarg, e.g. "missing_dim_value: {int, "size"}, default 1.

@headtr1ck headtr1ck added the needs triage Issue that has not been reviewed by xarray team member label Jul 3, 2022
@dcherian
Copy link
Contributor

dcherian commented Jul 5, 2022

This is quite confusing and I doubt it's intentional.

I would've expected b (y) int32 3 3 assuming that it would've been broadcast along the reduction dimension.

The final value is the result of

import numpy as np
from xarray.core.duck_array_ops import isnull

np.sum(np.logical_not(isnull(ds.b.data)), axis=())
# np.sum([True, True], axis=())

What happens when you call a ufunc with an empty axis tuple? I bet this is just casting bool to int.

@dcherian dcherian added bug and removed needs triage Issue that has not been reviewed by xarray team member labels Jul 5, 2022
@headtr1ck
Copy link
Collaborator Author

What happens when you call a ufunc with an empty axis tuple?

This should also happen with all other ufuncs then?
I guess most of them just work, like mean, sum etc.

@dcherian
Copy link
Contributor

dcherian commented Jul 7, 2022

We discussed:

  1. dropping variables without the dimension
  2. Return ds.sizes["x"] by broadcasting b along x

For the other reductions

import numpy as np
import xarray as xr

from xarray.core.duck_array_ops import count

ds = xr.Dataset({"a": ("x", [1, 2, 3]), "b": ("y", [4, 5])})

for func in [np.nansum, np.nanprod, np.nanmean, np.nanvar, np.nanstd, count]:
    print(f"{func.__name__!s}({ds.b.data}, axis=()) = {func(ds.b.data, axis=())}")

gives

nansum([4 5], axis=()) = [4 5]
nanprod([4 5], axis=()) = [4 5]
nanmean([4 5], axis=()) = [4. 5.]
nanvar([4 5], axis=()) = [0. 0.]
nanstd([4 5], axis=()) = [0. 0.]
count([4 5], axis=()) = [1 1]

I guess the output for nansum, nanprod doesn't match what you would get by broadcasting along the absent dimension.

@headtr1ck
Copy link
Collaborator Author

I think that changing the behavior of sum is quite a large breaking change.

@headtr1ck
Copy link
Collaborator Author

Another option is to add an option: missing_dim: "raise", ignore" or "broadcast".
The default then would be ignore, which is the current implementation.

But for workflows of variables that are either DataArray or Dataset, this argument should be added to DataArray.sum/count/prod as well?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants