xarray.DataArray.where always returns array of float64 regardless of input dtype #3390

pmallas · 2019-10-10T20:29:26Z

MCVE Code Sample

import numpy as np
import xarray as xr

a = xr.DataArray(np.arange(25).reshape(5, 5), dims=('x', 'y'))
print(a.dtype)
'int32'
a_sub = a.where(a.x + a.y < 4)
a_sub.dtype
'float64'

Expected Output

a_sub should be an xarray of dtype int32

Problem Description

The documentation (http://xarray.pydata.org/en/stable/generated/xarray.DataArray.where.html)
states that return type should be the same type as caller. However, the return type is always float64

Output of `xr.show_versions()`

INSTALLED VERSIONS

commit: None
python: 3.7.1 | packaged by conda-forge | (default, Mar 13 2019, 13:32:59) [MSC v.1900 64 bit (AMD64)]
python-bits: 64
OS: Windows
OS-release: 10
machine: AMD64
processor: Intel64 Family 6 Model 45 Stepping 7, GenuineIntel
byteorder: little
LC_ALL: None
LANG: None
LOCALE: None.None
libhdf5: 1.10.4
libnetcdf: 4.6.2

xarray: 0.13.0
pandas: 0.25.1
numpy: 1.17.2
scipy: 1.3.1
netCDF4: 1.4.2
pydap: None
h5netcdf: None
h5py: 2.9.0
Nio: None
zarr: None
cftime: 1.0.3.4
nc_time_axis: None
PseudoNetCDF: None
rasterio: 1.0.22
cfgrib: None
iris: None
bottleneck: None
dask: 2.5.2
distributed: None
matplotlib: 3.1.1
cartopy: 0.17.0
seaborn: 0.9.0
numbagg: None
setuptools: 41.4.0
pip: 19.2.3
conda: None
pytest: None
IPython: 7.8.0
sphinx: None

The text was updated successfully, but these errors were encountered:

jhamman · 2019-10-10T23:06:14Z

@pmallas - it looks like you figured this out but I'll just report on what was likely the confusion here.

Xarray's where methods use np.nan as the default other argument, this causes the type to be cast to a float. If you want to maintain a integer type, you'll need to specify another value for other.

xref: http://xarray.pydata.org/en/stable/computation.html#missing-values, http://xarray.pydata.org/en/stable/generated/xarray.DataArray.where.html

pmallas · 2019-10-11T14:33:02Z

Yes, I read the return type as the 'same type as caller' and at first I expected the array type to be the same. I soon realized that means a DataArray or DataSet. And for your output array to support nan values, it has to be float. My bad - sorry for the clutter.

dcherian · 2019-10-11T15:16:48Z

@pmallas it would be nice to update the docstring to make that clear if you are up for it

pmallas · 2019-10-16T22:52:48Z

@dcherian Ok, I think I proposed a change correctly - never done this before.

dcherian · 2019-10-17T00:32:08Z

Looks great. You did well!

chrisroat · 2020-06-25T23:08:52Z

If drop=True, would it be problematic to return the same dtype or allow other?

My use case is a simple slicing of a dataset -- no missing values. The use of where is due to one of selections being on a non-dimension coordinate (#2028).

I can workaround using astype, but will say I was mildly surprised by this feature. I now understand why it's there. Our code is old and the data is intermediate and never deeply inspected -- I only noticed this when we started using a memory-intensive algorithm and surprised how much space was taken by our supposed uint16 data. :)

shoyer · 2020-06-26T04:53:32Z

The trouble with returning the same dtype for uint16 values is that there's no easy way to have a missing value for uint16.

I don't entirely remember why we don't allow other in where if drop=True, but indeed that seems like a clean solution.

I suspect it might have something to do with alignment. But as long as other is already aligned with the result of aligning self and other (e.g., if other is a scalar, which is probably typical), then it should be fine allow for the other argument.

chrisroat · 2020-06-29T03:49:27Z

What about the case of no missing values, when other wouldn't be needed? Could the same dtype be returned then? This is my case, since I'm re-purposing where to do sel for non-dimension coordinates.

I'm capable of just recasting for my use case, if this is becoming an idea that would be difficult to maintain/document.

shoyer · 2020-06-29T04:34:49Z

What about the case of no missing values, when other wouldn't be needed? Could the same dtype be returned then? This is my case, since I'm re-purposing where to do sel for non-dimension coordinates.

Could you give a concrete example of what this would look like?

It seems rather unlikely to me to have an example of where with drop=True where the condition is exactly aligned with the grid, such that there are no missing values.

I guess it could happen if you're trying to index out exactly one element along a dimension?

In the long term, the cleaner solution for this will be some form for support for more flexibly / multi-dimensional indexing.

chrisroat · 2020-06-29T04:52:11Z

What about the case of no missing values, when other wouldn't be needed? Could the same dtype be returned then? This is my case, since I'm re-purposing where to do sel for non-dimension coordinates. Could you give a concrete example of what this would look like? It seems rather unlikely to me to have an example of where with drop=True where the condition is *exactly* aligned with the grid, such that there are no missing values. I guess it could happen if you're trying to index out exactly one element along a dimension?

That's exactly right. I am just selecting one slice of a data array, using `data.where(data.coords['stain'] == 'DAPI')`.

In the long term, the cleaner solution for this will be some form for support for more flexibly / multi-dimensional indexing.

Agreed. Once I actually get things running, I'll be ready to try and contribute fixes for all my TODOs that reference xarray github issues. :)

dcherian · 2021-03-02T01:42:09Z

It seems rather unlikely to me to have an example of where with drop=True where the condition is exactly aligned with the grid, such that there are no missing values.

Actually, this is a really common pattern

ds = xr.tutorial.open_dataset('air_temperature')
ds.where(ds.time.dt.hour.isin([0, 12]), drop=True)

The efficient way to do this is

ds.loc[{"time": ds.time.dt.hour.isin([0, 12])}]

or

ds.sel(time=ds.time.dt.hour.isin([0, 12]))

At this point

xarray/xarray/core/common.py

Lines 1270 to 1273 in 48378c4

    
               self = self.isel(**indexers) 
        
               cond = cond.isel(**indexers) 
        
           return ops.where_method(self, cond, other)

cond is all True and applying where is basically a totally useless copy since the isel has already copied.

Shall we raise a warning in where advising the more-efficient syntax? Or shall we skip the call to where_method

shoyer · 2021-03-02T02:42:37Z

Shall we raise a warning in where advising the more-efficient syntax? Or shall we skip the call to where_method

I'm not sure that either of these is a good idea.

The problem with raising a warning is that this is well-defined behavior. It may not always be useful, but well defined but useless behavior arises all the time in programs, so it's annoying to raise a warning for a special case.

The problem with skipping where_method is that now we end up with a potentially inconsistent dtype, depending on the selection. These sort of special cases can be quite frustrating to program around.

pmallas changed the title ~~xarray.DataArray.where always returns arrray of float64 regardless in input dtype~~ xarray.DataArray.where always returns array of float64 regardless of input dtype Oct 10, 2019

pmallas closed this as completed Oct 10, 2019

dcherian reopened this Oct 17, 2019

pmallas mentioned this issue Oct 17, 2019

Update where docstring to make return value type more clear #3408

Merged

3 tasks

dcherian closed this as completed in #3408 Oct 17, 2019

dcherian reopened this Jun 28, 2020

dcherian mentioned this issue Jul 17, 2020

per-variable fill values #4237

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

xarray.DataArray.where always returns array of float64 regardless of input dtype #3390

xarray.DataArray.where always returns array of float64 regardless of input dtype #3390

pmallas commented Oct 10, 2019

jhamman commented Oct 10, 2019

pmallas commented Oct 11, 2019

dcherian commented Oct 11, 2019

pmallas commented Oct 16, 2019

dcherian commented Oct 17, 2019

chrisroat commented Jun 25, 2020 •

edited

Loading

shoyer commented Jun 26, 2020

chrisroat commented Jun 29, 2020

shoyer commented Jun 29, 2020

chrisroat commented Jun 29, 2020 via email

dcherian commented Mar 2, 2021

shoyer commented Mar 2, 2021

xarray.DataArray.where always returns array of float64 regardless of input dtype #3390

xarray.DataArray.where always returns array of float64 regardless of input dtype #3390

Comments

pmallas commented Oct 10, 2019

MCVE Code Sample

Expected Output

Problem Description

Output of xr.show_versions()

INSTALLED VERSIONS

jhamman commented Oct 10, 2019

pmallas commented Oct 11, 2019

dcherian commented Oct 11, 2019

pmallas commented Oct 16, 2019

dcherian commented Oct 17, 2019

chrisroat commented Jun 25, 2020 • edited Loading

shoyer commented Jun 26, 2020

chrisroat commented Jun 29, 2020

shoyer commented Jun 29, 2020

chrisroat commented Jun 29, 2020 via email

dcherian commented Mar 2, 2021

shoyer commented Mar 2, 2021

Output of `xr.show_versions()`

chrisroat commented Jun 25, 2020 •

edited

Loading