Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature request] Masked operations #4143

Open
Hoeze opened this issue Jun 10, 2020 · 1 comment
Open

[Feature request] Masked operations #4143

Hoeze opened this issue Jun 10, 2020 · 1 comment

Comments

@Hoeze
Copy link

Hoeze commented Jun 10, 2020

Xarray already has unstack(sparse=True) which is quite awesome.
However, in many cases it is costly to convert a very dense array (existing values >> missing values) to a sparse representation. Also, many calculations require to convert the sparse array back into dense array and to manually mask the missing values (e.g. Keras).

Logically, a sparse array is equal to a masked dense array.
They only differ in their internal data representation.
Therefore, I would propose to have a masked=True option for all operations that can create missing values. These cover (amongst others):

  • .unstack([...], masked=True)
  • .where(<multi-dimensional array>, masked=True)
  • .align([...], masked=True)

This would solve a number of problems:

  • No more conversion of int -> float
  • Explicit value for missingness
  • When stacking data with missing values, the missing values can be just dropped
  • When converting data with missing values to DataFrame, the missing values can be just dropped

MCVE Code Sample

An example would be outer joins with slightly different coordinates (taken from the documentation):

>>> x
<xarray.DataArray (lat: 2, lon: 2)>
array([[25, 35],
       [10, 24]])
Coordinates:
* lat      (lat) float64 35.0 40.0
* lon      (lon) float64 100.0 120.0

>>> y
<xarray.DataArray (lat: 2, lon: 2)>
array([[20,  5],
       [ 7, 13]])
Coordinates:
* lat      (lat) float64 35.0 42.0
* lon      (lon) float64 100.0 120.0

Non-masked outer join:

>>> a, b = xr.align(x, y, join="outer")
>>> a
<xarray.DataArray (lat: 3, lon: 2)>
array([[25., 35.],
       [10., 24.],
       [nan, nan]])
Coordinates:
* lat      (lat) float64 35.0 40.0 42.0
* lon      (lon) float64 100.0 120.0
>>> b
<xarray.DataArray (lat: 3, lon: 2)>
array([[20.,  5.],
       [nan, nan],
       [ 7., 13.]])
Coordinates:
* lat      (lat) float64 35.0 40.0 42.0
* lon      (lon) float64 100.0 120.0

The masked version:

>>> a, b = xr.align(x, y, join="outer", masked=True)
>>> a
<xarray.DataArray (lat: 3, lon: 2)>
masked_array(data=[[25, 35],
                   [10, 24],
                   [--, --]],
             mask=[[False, False],
                   [False, False],
                   [True, True]],
             fill_value=0)
Coordinates:
* lat      (lat) float64 35.0 40.0 42.0
* lon      (lon) float64 100.0 120.0
>>> b
<xarray.DataArray (lat: 3, lon: 2)>
masked_array(data=[[20, 5],
                   [--, --],
                   [7, 13]],
             mask=[[False, False],
                   [True, True],
                   [False, False]],
             fill_value=0)
Coordinates:
* lat      (lat) float64 35.0 40.0 42.0
* lon      (lon) float64 100.0 120.0

Related issue:
#3955

@max-sixty
Copy link
Collaborator

While searching for issues related to #1887 I came across this and see no one replied.

More support for missing values would be great, it's a constant source of complications. Though this proposal would likely require a lot of work to implement, and add some complications to the API.

Considering this along with other approaches to missing values support in the community — e.g. pandas' Int type — would give this more context.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants