Support for __array_function__ implementers (sparse arrays) [WIP] #3117

nvictus · 2019-07-13T22:05:57Z

Doing a SciPy sprint. Working towards #1375

Together with pydata/sparse#261, it seems to work.

Closes #xxxx
Tests added
Fully documented, including whats-new.rst for all changes and api.rst for new API

pep8speaks · 2019-07-13T22:06:04Z

Hello @nvictus! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2019-08-05 15:49:19 UTC

codecov · 2019-07-13T22:17:54Z

Codecov Report

Merging #3117 into master will increase coverage by 0.03%.
The diff coverage is 91.89%.

@@            Coverage Diff             @@
##           master    #3117      +/-   ##
==========================================
+ Coverage    95.7%   95.73%   +0.03%     
==========================================
  Files          63       63              
  Lines       12861    12870       +9     
==========================================
+ Hits        12308    12321      +13     
+ Misses        553      549       -4

andersy005 · 2019-07-14T04:25:07Z

@nvictus, thank you for your work!

I just tried this on CuPy arrays, and it seems to be working during array creation:

In [1]: import cupy as cp                                                           

In [2]: import xarray as xr                                                         

In [3]: x = cp.arange(6).reshape(2, 3).astype('f')                                  

In [4]: y = cp.ones((2, 3), dtype='int')                           

In [5]: x                                                          
Out[5]: 
array([[0., 1., 2.],
       [3., 4., 5.]], dtype=float32)

In [6]: y                                                          
Out[6]: 
array([[1, 1, 1],
       [1, 1, 1]])

In [7]: y.device                                                   
Out[7]: <CUDA Device 0>

In [8]: x.device                                                   
Out[8]: <CUDA Device 0>

In [9]: ds = xr.Dataset()                                          

In [10]: ds['x'] = xr.DataArray(x, dims=['lat', 'lon'])            

In [11]: ds['y'] = xr.DataArray(y, dims=['lat', 'lon'])            

In [12]: ds                                                        
Out[12]: 
<xarray.Dataset>
Dimensions:  (lat: 2, lon: 3)
Dimensions without coordinates: lat, lon
Data variables:
    x        (lat, lon) float32 ...
    y        (lat, lon) int64 ...

In [13]: ds.x.data.device                                          
Out[13]: <CUDA Device 0>

In [14]: ds.y.data.device                                          
Out[14]: <CUDA Device 0>

Even though it failed when I tried applying an operation on the dataset, this is still awesome! I am very excited and looking forward to seeing this feature in xarray:

In [15]: m = ds.mean(dim='lat')                                    
-------------------------------------------------------------------
TypeError                         Traceback (most recent call last)
<ipython-input-15-8e4d5e7d5ee3> in <module>
----> 1 m = ds.mean(dim='lat')

/glade/work/abanihi/devel/pangeo/xarray/xarray/core/common.py in wrapped_func(self, dim, skipna, **kwargs)
     65                 return self.reduce(func, dim, skipna=skipna,
     66                                    numeric_only=numeric_only, allow_lazy=True,
---> 67                                    **kwargs)
     68         else:
     69             def wrapped_func(self, dim=None, **kwargs):  # type: ignore

/glade/work/abanihi/devel/pangeo/xarray/xarray/core/dataset.py in reduce(self, func, dim, keep_attrs, keepdims, numeric_only, allow_lazy, **kwargs)
   3532                                                  keepdims=keepdims,
   3533                                                  allow_lazy=allow_lazy,
-> 3534                                                  **kwargs)
   3535 
   3536         coord_names = set(k for k in self.coords if k in variables)

/glade/work/abanihi/devel/pangeo/xarray/xarray/core/variable.py in reduce(self, func, dim, axis, keep_attrs, keepdims, allow_lazy, **kwargs)
   1392         input_data = self.data if allow_lazy else self.values
   1393         if axis is not None:
-> 1394             data = func(input_data, axis=axis, **kwargs)
   1395         else:
   1396             data = func(input_data, **kwargs)

/glade/work/abanihi/devel/pangeo/xarray/xarray/core/duck_array_ops.py in mean(array, axis, skipna, **kwargs)
    370         return _to_pytimedelta(mean_timedeltas, unit='us') + offset
    371     else:
--> 372         return _mean(array, axis=axis, skipna=skipna, **kwargs)
    373 
    374 

/glade/work/abanihi/devel/pangeo/xarray/xarray/core/duck_array_ops.py in f(values, axis, skipna, **kwargs)
    257 
    258         try:
--> 259             return func(values, axis=axis, **kwargs)
    260         except AttributeError:
    261             if isinstance(values, dask_array_type):

/glade/work/abanihi/devel/pangeo/xarray/xarray/core/nanops.py in nanmean(a, axis, dtype, out)
    158         return dask_array.nanmean(a, axis=axis, dtype=dtype)
    159 
--> 160     return np.nanmean(a, axis=axis, dtype=dtype)
    161 
    162 

/glade/work/abanihi/softwares/miniconda3/envs/xarray-tests/lib/python3.7/site-packages/numpy/core/overrides.py in public_api(*args, **kwargs)
    163             relevant_args = dispatcher(*args, **kwargs)
    164             return implement_array_function(
--> 165                 implementation, public_api, relevant_args, args, kwargs)
    166 
    167         if module is not None:

TypeError: no implementation found for 'numpy.nanmean' on types that implement __array_function__: [<class 'cupy.core.core.ndarray'>]

shoyer

I am really excited about this!

xarray/core/variable.py

shoyer · 2019-07-14T07:35:45Z

Even though it failed when I tried applying an operation on the dataset, this is still awesome!

Yes, it really is!

For this specific failure, we should think about adding an option for the default skipna value, or maybe making the semantics depend on the array type.

If someone is using xarray to wrap a computation oriented library like CuPy, they probably almost always want to set skipna=False (along with join='exact'). I don't think I've seen any deep library that has bothered to implement nanmean.

xarray/tests/test_nep18.py

shoyer · 2019-07-15T05:31:24Z

xarray/tests/test_nep18.py

+
+sparse = pytest.importorskip('sparse')
+
+


Xarray does automatic alignment, so it would also be really good to add a test case with multiple duck arrays with different coordinate labels, to verify that alignment works. xarray/tests/test_dask.py also has a bunch of good examples of things to test.

xarray/tests/test_nep18.py

xarray/core/duck_array_ops.py

This version should be much more compatible out of the box with duck typing.

nvictus · 2019-07-17T21:22:47Z

After writing more tests, turns out sparse-backed xarrays were generating strange coordinates. On closer inspection, this is because sparse.COO objects have their own coords attribute which stores the indices of its nonzeros.

With a serendipitous shape and density of a sparse array, there were the right number of coords such that they got converted into xarray coords:

>>> S = sparse.random((100,100)) 
>>> S.coords                                                                                                                                                                                                          
array([[ 0,  0,  3,  3,  4,  5,  6,  7,  7,  7,  9, 10, 11, 14, 14, 16,
        17, 18, 19, 19, 21, 21, 21, 21, 22, 23, 23, 24, 24, 25, 26, 28,
        29, 31, 35, 35, 36, 36, 38, 39, 41, 42, 42, 43, 44, 46, 46, 47,
        48, 48, 49, 49, 49, 49, 50, 52, 53, 54, 55, 56, 57, 57, 58, 60,
        60, 61, 61, 62, 64, 64, 68, 70, 72, 73, 76, 77, 78, 79, 79, 80,
        81, 83, 84, 85, 88, 89, 90, 90, 90, 92, 92, 93, 93, 94, 94, 96,
        97, 98, 98, 99],
       [14, 28, 58, 67, 66, 37, 50,  9, 66, 67,  6, 22, 44, 51, 64, 26,
        53, 91, 45, 81, 13, 25, 42, 86, 47, 22, 67, 77, 81, 22, 96, 18,
        23, 96, 34, 57,  6, 96, 91, 86, 75, 86, 88, 91, 30,  6, 65, 47,
        61, 74, 14, 73, 82, 83, 32, 42, 53, 68, 21, 38, 48, 50, 87,  8,
        89, 22, 57, 60, 92, 94, 19, 79, 38, 53, 32, 95, 69, 22, 46, 17,
        17, 86, 36,  7, 71, 35,  9, 58, 79, 22, 68, 10, 47, 48, 54, 72,
        24, 47, 63, 86]])
>>> xr.DataArray(S)
<xarray.DataArray (dim_0: 100, dim_1: 100)>
<COO: shape=(100, 100), dtype=float64, nnz=100, fill_value=0.0>
Coordinates:
  * dim_0    (dim_0) int64 0 0 3 3 4 5 6 7 7 7 ... 92 93 93 94 94 96 97 98 98 99
  * dim_1    (dim_1) int64 14 28 58 67 66 37 50 9 66 ... 47 48 54 72 24 47 63 86

A simple fix is to special-case SparseArrays in the DataArray constructor and not extract the sparse coords attribute in that case, but that would require importing sparse.

Would it make sense to just assume that all non-DataArray NEP-18 compliant arrays do not contain an xarray-compliant coords attribute?

nvictus · 2019-07-17T21:33:26Z

Hmm, looks like the DataWithCoords base type might come in handy here.

…DataWithCoords

max-sixty · 2019-07-17T21:35:37Z

xarray/tests/test_nep18.py

+    assert isinstance(func(A).data, sparse.SparseArray)
+
+
+class TestSparseVariable:


An aside (don't let this slow you down): these are tests that we could run across any eligible array, including the current numpy & dask arrays

You could pull tests from the current dask tests if that would be a shortcut to building up this test suite from scratch

We could do that by inheriting this class, or having var as a fixture

Thanks @max-sixty. I did copy them over from the dask tests. I'm still figuring out which ones are transferable to this use case since I'm not a regular xarray user myself (yet) :)

Great! LMK if I can be helpful at all. Excited about the PR!

shoyer · 2019-07-17T21:53:50Z

Would it make sense to just assume that all non-DataArray NEP-18 compliant arrays do not contain an xarray-compliant coords attribute?

Yes, let's switch:

coords = getattr(data, 'coords', None)

to

if isinstance(data, DataArray):
    coords = data.coords

nvictus · 2019-08-01T16:43:36Z

Thanks for bumping this @mrocklin! I've put in some extra work on my free time, which hasn't been pushed yet. I'll try to write up a summary of my findings today. Briefly though, it seems like the two limiting factors for NEP18 duck array support are:

Operations which ultimately coerce duck arrays to ndarrays (e.g. via np.asarray). Many, but maybe not all of these operations can fixed to dispatch to the duck array's implementation. But that leads to:
Operations not supported by the duck type. This happens in a few cases with pydata/sparse, and would have to be solved upstream, unless it's a special case where it might be okay to coerce. e.g. what happens with binary operations that mix array types?

I think NEP18-backed xarray structures can be supported in principle, but it won't prevent some operations from simply failing in some contexts. So maybe xarray will need to define a minimum required implementation subset of the array API for duck arrays.

shoyer · 2019-08-01T18:14:38Z

2. Operations not supported by the duck type. This happens in a few cases with pydata/sparse, and would have to be solved upstream, unless it's a special case where it might be okay to coerce. e.g. what happens with binary operations that mix array types?

This is totally fine for now, as long as there are clear errors when attempting to do an unsupported operation. We can write unit tests with expected failures, which should provide a clear roadmap for things to fix upstream in sparse.

We could attempt to define a minimum required implementation, but in practice I suspect this will be hard to nail down definitively. The ultimate determinant of what works will be xarray's implementation.

nvictus · 2019-08-05T02:10:58Z

So, tests are passing now and I've documented the expected failures on sparse arrays. :)

As mentioned before, most fall into the categories of (1) implicit coercion to dense and (2) missing operations on sparse arrays.

Turns out a lot of the implicit coercion is due to the use of routines from bottleneck. A few other cases of using np.asarray remain. Sometimes this is through the use of the .values attribute. At the moment, the behavior of Variables and DataArrays is such that .data provides the duck array and .values coerces to numpy, following the original behavior for dask arrays -- which made me realize, we never asked if this behavior is desired in general?

I also modified Variable.load() to be a no-op for duck arrays (unless it is a dask array), so now compute() works. To make equality comparisons work between dense and sparse, I modified as_like_arrays() to coerce all arguments to sparse if at least one is sparse. As a side-effect, a.identical(b) can return True if one is dense and the other sparse. Again, here we're sort of mirroring the existing behavior when the duck array is a dask array, but this may seem questionable to extend to other duck arrays.

If bottleneck is turned off, most remaining issues are due to missing implementations in sparse. Some of these should be easy to add: PR pydata/sparse/pull/261 already exposes the result_type function.

shoyer

Looks great to me! I think we can merge this in after syncing again with master.

I probably wouldn't call this ready for widespread use until we get a few more features working (e.g., concat), but this is a great start, and it will be easier for others to contribute once this is merged.

xarray/tests/test_duck_array_ops.py

shoyer · 2019-08-05T03:32:09Z

xarray/tests/test_sparse.py

+    (do('astype', int), True),
+    (do('broadcast_equals', make_xrarray({'x': 10, 'y': 5})), False),
+    (do('clip', min=0, max=1), True),
+    # (do('close'), False),


I'm guessing the commented out cases don't work yet? :)

Oops, forgot to finish those! :)
I've added tests for all except those I couldn't figure out how to use like swap_dims, sel_points and isel_points.

xarray/tests/test_sparse.py

shoyer · 2019-08-05T03:43:22Z

At the moment, the behavior of Variables and DataArrays is such that .data provides the duck array and .values coerces to numpy, following the original behavior for dask arrays -- which made me realize, we never asked if this behavior is desired in general?

I think the right behavior is probably for .values to be implemented by calling np.asarray() on .data. That means it should raise on sparse arrays.

shoyer

thanks!

shoyer · 2019-08-05T15:41:41Z

xarray/tests/test_sparse.py

+    def test_binary_op(self):
+        assert_sparse_eq((2 * self.var).data, 2 * self.data)
+        assert_sparse_eq((self.var + self.var).data, self.data + self.data)
+        # assert_eq((self.var[0] + self.var).data, self.data[0] + self.data)


shoyer · 2019-08-05T17:14:15Z

@nvictus we are good to go ahead and merge, and do follow-ups in other PRs?

nvictus · 2019-08-05T18:35:43Z

Sounds good!

dcherian · 2019-08-05T18:51:12Z

Thanks a lot, @nvictus

mrocklin · 2019-08-05T18:53:39Z

Woot! Thanks @nvictus !

* master: pyupgrade one-off run (pydata#3190) mfdataset, concat now support the 'join' kwarg. (pydata#3102) reduce the size of example dataset in dask docs (pydata#3187) add climpred to related-projects (pydata#3188) bump rasterio to 1.0.24 in doc building environment (pydata#3186) More annotations (pydata#3177) Support for __array_function__ implementers (sparse arrays) [WIP] (pydata#3117) Internal clean-up of isnull() to avoid relying on pandas (pydata#3132) Call darray.compute() in plot() (pydata#3183) BUG: fix + test open_mfdataset fails on variable attributes with list… (pydata#3181)

crusaderky · 2019-08-08T09:33:58Z

ci/requirements/py37.yml

@@ -21,6 +21,7 @@ dependencies:
  - pip
  - scipy
  - seaborn
+  - sparse
  - toolz


Why did you exclude py36 and Windows?

max-sixty · 2019-08-08T16:27:09Z

Thanks @nvictus !

* master: pyupgrade one-off run (pydata#3190) mfdataset, concat now support the 'join' kwarg. (pydata#3102) reduce the size of example dataset in dask docs (pydata#3187) add climpred to related-projects (pydata#3188) bump rasterio to 1.0.24 in doc building environment (pydata#3186) More annotations (pydata#3177) Support for __array_function__ implementers (sparse arrays) [WIP] (pydata#3117) Internal clean-up of isnull() to avoid relying on pandas (pydata#3132) Call darray.compute() in plot() (pydata#3183) BUG: fix + test open_mfdataset fails on variable attributes with list… (pydata#3181)

Support for __array_function__ implementers

5b5e245

Pep8

ad403fc

shoyer reviewed Jul 14, 2019

View reviewed changes

xarray/core/variable.py Outdated Show resolved Hide resolved

xarray/core/variable.py Outdated Show resolved Hide resolved

mrocklin mentioned this pull request Jul 14, 2019

Sparse arrays #1375

Closed

nvictus added 6 commits July 14, 2019 12:30

Consistent naming

df618ad

Check for NEP18 enabled and nep18 non-numpy arrays

1b62f6d

Replace .values with .data

21c0a21

Add initial test for nep18

d62820b

Fix linting issues

0493263

Add parameterized tests

ad42627

amueller mentioned this pull request Jul 14, 2019

SLEP 8: Propagating feature names scikit-learn/enhancement_proposals#18

Closed

shoyer reviewed Jul 15, 2019

View reviewed changes

shoyer added 2 commits July 15, 2019 00:28

Internal clean-up of isnull() to avoid relying on pandas

ec01625

This version should be much more compatible out of the box with duck typing.

Merge branch 'master' into isnull-duck

fd05566

Hoeze mentioned this pull request Jul 15, 2019

[proposal] Changing the backend to xarray scverse/anndata#32

Closed

nvictus added 3 commits July 17, 2019 16:46

Add sparse to ci requirements

5d8edfd

Moar tests

7c7a9f2

Two more patches for __array_function__ duck-arrays

a231571

Don't use coords attribute from duck-arrays that aren't derived from …

4009379

…DataWithCoords

max-sixty reviewed Jul 17, 2019

View reviewed changes

nvictus added 3 commits July 17, 2019 19:34

Improve checking for coords, and autopep8

2c3b183

Skip tests if NEP-18 envvar is not set

23633b4

flake8

38c4717

nvictus added 7 commits August 2, 2019 14:03

Replace nep18 tests with more extensive tests on pydata/sparse

ef51976

Merge remote-tracking branch 'shoyer/isnull-duck' into sparse_xarrays

86ee35c

Add xfails for missing np.result_type (fixed by pydata/sparse/pull/261)

3afe7e2

Fix xpasses

33a07e7

Revert isnull/notnull

5a817a8

Fix as_like_arrays by coercing dense arrays to COO if any sparse

5675819

Make Variable.load a no-op for non-dask duck arrays

96e1346

shoyer reviewed Aug 5, 2019

View reviewed changes

nvictus added 2 commits August 5, 2019 10:27

Merge branch 'master' into sparse_xarrays

563148c

Add additional method tests

66c2f82

shoyer reviewed Aug 5, 2019

View reviewed changes

Fix utils.as_scalar to handle duck arrays with ndim>0

b353a0b

shoyer merged commit bbd25c6 into pydata:master Aug 5, 2019

crusaderky mentioned this pull request Aug 8, 2019

DataArray.chunk() from sparse array produces malformed dask array #3191

Closed

crusaderky reviewed Aug 8, 2019

View reviewed changes

keewis mentioned this pull request Aug 12, 2019

Picking up #1118: Do not convert subclasses of ndarray unless required #2956

Closed

friedrichknuth mentioned this pull request Jan 14, 2020

Need documentation on sparse / cupy integration #3484

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support for __array_function__ implementers (sparse arrays) [WIP] #3117

Support for __array_function__ implementers (sparse arrays) [WIP] #3117

nvictus commented Jul 13, 2019 •

edited

Loading

pep8speaks commented Jul 13, 2019 •

edited

Loading

codecov bot commented Jul 13, 2019 •

edited

Loading

andersy005 commented Jul 14, 2019 •

edited

Loading

shoyer left a comment

shoyer commented Jul 14, 2019

shoyer Jul 15, 2019

nvictus commented Jul 17, 2019

nvictus commented Jul 17, 2019

max-sixty Jul 17, 2019

nvictus Jul 18, 2019

max-sixty Jul 18, 2019

shoyer commented Jul 17, 2019

nvictus commented Aug 1, 2019

shoyer commented Aug 1, 2019

nvictus commented Aug 5, 2019 •

edited

Loading

shoyer left a comment

shoyer Aug 5, 2019

nvictus Aug 5, 2019

shoyer commented Aug 5, 2019

shoyer left a comment

shoyer Aug 5, 2019

shoyer commented Aug 5, 2019

nvictus commented Aug 5, 2019

dcherian commented Aug 5, 2019

mrocklin commented Aug 5, 2019

crusaderky Aug 8, 2019 •

edited

Loading

max-sixty commented Aug 8, 2019

		assert isinstance(func(A).data, sparse.SparseArray)


		class TestSparseVariable:

@@ @@ -21,6 +21,7 @@ dependencies: @@
                 - pip
                 - scipy
                 - seaborn
+                - sparse
                 - toolz

Support for __array_function__ implementers (sparse arrays) [WIP] #3117

Support for __array_function__ implementers (sparse arrays) [WIP] #3117

Conversation

nvictus commented Jul 13, 2019 • edited Loading

pep8speaks commented Jul 13, 2019 • edited Loading

Comment last updated at 2019-08-05 15:49:19 UTC

codecov bot commented Jul 13, 2019 • edited Loading

Codecov Report

andersy005 commented Jul 14, 2019 • edited Loading

shoyer left a comment

Choose a reason for hiding this comment

shoyer commented Jul 14, 2019

shoyer Jul 15, 2019

Choose a reason for hiding this comment

nvictus commented Jul 17, 2019

nvictus commented Jul 17, 2019

max-sixty Jul 17, 2019

Choose a reason for hiding this comment

nvictus Jul 18, 2019

Choose a reason for hiding this comment

max-sixty Jul 18, 2019

Choose a reason for hiding this comment

shoyer commented Jul 17, 2019

nvictus commented Aug 1, 2019

shoyer commented Aug 1, 2019

nvictus commented Aug 5, 2019 • edited Loading

shoyer left a comment

Choose a reason for hiding this comment

shoyer Aug 5, 2019

Choose a reason for hiding this comment

nvictus Aug 5, 2019

Choose a reason for hiding this comment

shoyer commented Aug 5, 2019

shoyer left a comment

Choose a reason for hiding this comment

shoyer Aug 5, 2019

Choose a reason for hiding this comment

shoyer commented Aug 5, 2019

nvictus commented Aug 5, 2019

dcherian commented Aug 5, 2019

mrocklin commented Aug 5, 2019

crusaderky Aug 8, 2019 • edited Loading

Choose a reason for hiding this comment

max-sixty commented Aug 8, 2019

nvictus commented Jul 13, 2019 •

edited

Loading

pep8speaks commented Jul 13, 2019 •

edited

Loading

codecov bot commented Jul 13, 2019 •

edited

Loading

andersy005 commented Jul 14, 2019 •

edited

Loading

nvictus commented Aug 5, 2019 •

edited

Loading

crusaderky Aug 8, 2019 •

edited

Loading