Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

xarray.DataArray.where always returns array of float64 regardless of input dtype #3390

Open
pmallas opened this issue Oct 10, 2019 · 12 comments · Fixed by #3408
Open

xarray.DataArray.where always returns array of float64 regardless of input dtype #3390

pmallas opened this issue Oct 10, 2019 · 12 comments · Fixed by #3408

Comments

@pmallas
Copy link
Contributor

pmallas commented Oct 10, 2019

MCVE Code Sample

import numpy as np
import xarray as xr

a = xr.DataArray(np.arange(25).reshape(5, 5), dims=('x', 'y'))
print(a.dtype)
'int32'
a_sub = a.where(a.x + a.y < 4)
a_sub.dtype
'float64'

Expected Output

a_sub should be an xarray of dtype int32

Problem Description

The documentation (http://xarray.pydata.org/en/stable/generated/xarray.DataArray.where.html)
states that return type should be the same type as caller. However, the return type is always float64

Output of xr.show_versions()

INSTALLED VERSIONS

commit: None
python: 3.7.1 | packaged by conda-forge | (default, Mar 13 2019, 13:32:59) [MSC v.1900 64 bit (AMD64)]
python-bits: 64
OS: Windows
OS-release: 10
machine: AMD64
processor: Intel64 Family 6 Model 45 Stepping 7, GenuineIntel
byteorder: little
LC_ALL: None
LANG: None
LOCALE: None.None
libhdf5: 1.10.4
libnetcdf: 4.6.2

xarray: 0.13.0
pandas: 0.25.1
numpy: 1.17.2
scipy: 1.3.1
netCDF4: 1.4.2
pydap: None
h5netcdf: None
h5py: 2.9.0
Nio: None
zarr: None
cftime: 1.0.3.4
nc_time_axis: None
PseudoNetCDF: None
rasterio: 1.0.22
cfgrib: None
iris: None
bottleneck: None
dask: 2.5.2
distributed: None
matplotlib: 3.1.1
cartopy: 0.17.0
seaborn: 0.9.0
numbagg: None
setuptools: 41.4.0
pip: 19.2.3
conda: None
pytest: None
IPython: 7.8.0
sphinx: None

@pmallas pmallas changed the title xarray.DataArray.where always returns arrray of float64 regardless in input dtype xarray.DataArray.where always returns array of float64 regardless of input dtype Oct 10, 2019
@pmallas pmallas closed this as completed Oct 10, 2019
@jhamman
Copy link
Member

jhamman commented Oct 10, 2019

@pmallas - it looks like you figured this out but I'll just report on what was likely the confusion here.

Xarray's where methods use np.nan as the default other argument, this causes the type to be cast to a float. If you want to maintain a integer type, you'll need to specify another value for other.

xref: http://xarray.pydata.org/en/stable/computation.html#missing-values, http://xarray.pydata.org/en/stable/generated/xarray.DataArray.where.html

@pmallas
Copy link
Contributor Author

pmallas commented Oct 11, 2019

Yes, I read the return type as the 'same type as caller' and at first I expected the array type to be the same. I soon realized that means a DataArray or DataSet. And for your output array to support nan values, it has to be float. My bad - sorry for the clutter.

@dcherian
Copy link
Contributor

@pmallas it would be nice to update the docstring to make that clear if you are up for it

@pmallas
Copy link
Contributor Author

pmallas commented Oct 16, 2019

@dcherian Ok, I think I proposed a change correctly - never done this before.

@dcherian dcherian reopened this Oct 17, 2019
@dcherian
Copy link
Contributor

Looks great. You did well!

@chrisroat
Copy link
Contributor

chrisroat commented Jun 25, 2020

If drop=True, would it be problematic to return the same dtype or allow other?

My use case is a simple slicing of a dataset -- no missing values. The use of where is due to one of selections being on a non-dimension coordinate (#2028).

I can workaround using astype, but will say I was mildly surprised by this feature. I now understand why it's there. Our code is old and the data is intermediate and never deeply inspected -- I only noticed this when we started using a memory-intensive algorithm and surprised how much space was taken by our supposed uint16 data. :)

@shoyer
Copy link
Member

shoyer commented Jun 26, 2020

The trouble with returning the same dtype for uint16 values is that there's no easy way to have a missing value for uint16.

I don't entirely remember why we don't allow other in where if drop=True, but indeed that seems like a clean solution.

I suspect it might have something to do with alignment. But as long as other is already aligned with the result of aligning self and other (e.g., if other is a scalar, which is probably typical), then it should be fine allow for the other argument.

@dcherian dcherian reopened this Jun 28, 2020
@chrisroat
Copy link
Contributor

What about the case of no missing values, when other wouldn't be needed? Could the same dtype be returned then? This is my case, since I'm re-purposing where to do sel for non-dimension coordinates.

I'm capable of just recasting for my use case, if this is becoming an idea that would be difficult to maintain/document.

@shoyer
Copy link
Member

shoyer commented Jun 29, 2020

What about the case of no missing values, when other wouldn't be needed? Could the same dtype be returned then? This is my case, since I'm re-purposing where to do sel for non-dimension coordinates.

Could you give a concrete example of what this would look like?

It seems rather unlikely to me to have an example of where with drop=True where the condition is exactly aligned with the grid, such that there are no missing values.

I guess it could happen if you're trying to index out exactly one element along a dimension?

In the long term, the cleaner solution for this will be some form for support for more flexibly / multi-dimensional indexing.

@chrisroat
Copy link
Contributor

chrisroat commented Jun 29, 2020 via email

@dcherian dcherian mentioned this issue Jul 17, 2020
5 tasks
@dcherian
Copy link
Contributor

dcherian commented Mar 2, 2021

It seems rather unlikely to me to have an example of where with drop=True where the condition is exactly aligned with the grid, such that there are no missing values.

Actually, this is a really common pattern

ds = xr.tutorial.open_dataset('air_temperature')
ds.where(ds.time.dt.hour.isin([0, 12]), drop=True)

The efficient way to do this is

ds.loc[{"time": ds.time.dt.hour.isin([0, 12])}]

or

ds.sel(time=ds.time.dt.hour.isin([0, 12]))

At this point

xarray/xarray/core/common.py

Lines 1270 to 1273 in 48378c4

self = self.isel(**indexers)
cond = cond.isel(**indexers)
return ops.where_method(self, cond, other)

cond is all True and applying where is basically a totally useless copy since the isel has already copied.

Shall we raise a warning in where advising the more-efficient syntax? Or shall we skip the call to where_method

@shoyer
Copy link
Member

shoyer commented Mar 2, 2021

Shall we raise a warning in where advising the more-efficient syntax? Or shall we skip the call to where_method

I'm not sure that either of these is a good idea.

The problem with raising a warning is that this is well-defined behavior. It may not always be useful, but well defined but useless behavior arises all the time in programs, so it's annoying to raise a warning for a special case.

The problem with skipping where_method is that now we end up with a potentially inconsistent dtype, depending on the selection. These sort of special cases can be quite frustrating to program around.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants