Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sparse arrays #1375

Closed
mrocklin opened this issue Apr 14, 2017 · 25 comments
Closed

Sparse arrays #1375

mrocklin opened this issue Apr 14, 2017 · 25 comments
Labels
design question topic-arrays related to flexible array support

Comments

@mrocklin
Copy link
Contributor

I would like to have an XArray that has scipy.sparse arrays rather than numpy arrays. Is this in scope?

What would need to happen within XArray to support this?

@shoyer
Copy link
Member

shoyer commented Apr 14, 2017

Yes, I would say this is in scope, as long as we can keep most of the data-type specific logic out of xarray's core (which seems doable).

Currently, we define most of our operations on duck arrays in https://github.com/pydata/xarray/blob/master/xarray/core/duck_array_ops.py

There are a few other hacks throughout the codebase, which can find by searching for "dask_array_type": https://github.com/pydata/xarray/search?p=1&q=dask_array_type&type=&utf8=%E2%9C%93

It's pretty crude, but basically this would need to be extended to implement many of these methods on for sparse arrays, too. Ideally we would define xarray's adapter logic into more cleanly separated submodules, perhaps using multiple dispatch. Even better, we would make this public API, so you can write something like xarray.register_data_type(MySparseArray) to register a type as valid for xarray's .data attribute.

It looks like __array_ufunc__ will actually finally land in NumPy 1.13, which might make this easier.

See also #1118

@rabernat
Copy link
Contributor

rabernat commented Apr 15, 2017

👍 to the scipy.sparse array suggestion

[While we are discussing supporting other array types, we should keep gpu arrays on the radar]

@benbovy
Copy link
Member

benbovy commented Apr 15, 2017

Although I don't know much about SciDB, it seems to be another possible application for xarray.register_data_type.

@mrocklin
Copy link
Contributor Author

Here is a brief attempt at a multi-dimensional sparse array: https://github.com/mrocklin/sparse

It depends on numpy and scipy.sparse and, with the exception of a bit of in-memory data movement and copies, should run at scipy speeds (though I haven't done any benchmarking).

@rabernat do you have an application that we could use to drive this?

@rabernat
Copy link
Contributor

rabernat commented Apr 17, 2017

@rabernat do you have an application that we could use to drive this?

Nothing comes to mind immediately. My data are unfortunately quite dense! 😜

@olgabot
Copy link

olgabot commented Jun 26, 2017

In case you're still looking for an application, gene expression from single cells (see data/00_original/GSM162679$i_P14Retina_$j.digital_expression.txt.gz) is very sparse due to high gene dropout. The shape is expression.shape (49300, 24760) and it's mostly zeros or nans. A plain csv from this data was 2.5 gigs, which gzipped to 300 megs.

Here is an example of using xarray to combine these files but my kernel keeps dying when I do ds.to_netcdf() :(

Hope this is a good example for sparse arrays!

@rth
Copy link
Contributor

rth commented Sep 3, 2017

do you have an application that we could use to drive this?

Other examples where labeled sparse arrays would be useful are,

  • one-hot encoding that are widely used in machine learning.
  • tokenizing textual data produces large sparse matrices where the column labels correspond to the vocabulary, while row labels correspond to document ids. Here is a minimal example using scikit-learn,
    import os.path
    from sklearn.datasets import fetch_20newsgroups
    from sklearn.feature_extraction.text import CountVectorizer
    
    ds = fetch_20newsgroups()
    vect = CountVectorizer()
    X = vect.fit_transform(ds.data)
    print(X)  # Extracted tokens
    # Returns:
    # <11314x130107 sparse matrix of type '<class 'numpy.int64'>'
    #	with 1787565 stored elements in Compressed Sparse Row format>
    
    column_labels = vect.get_feature_names()
    print(np.asarray(column_labels))
    # Returns:
    # array(['00', '000', '0000', ..., 'íålittin', 'ñaustin', 'ýé'],   dtype='<U180')
    
    row_labels = [int(os.path.split(el)[1]) for el in ds.filenames]
    print(np.asarray(row_labels))
    # Returns:
    # array([102994,  51861,  51879, ...,  60695,  38319, 104440])

@rabernat
Copy link
Contributor

rabernat commented Sep 3, 2017

Sparse Xarray DataArrays would be useful for the linear regridding operations discussed in JiaweiZhuang/xESMF#3.

@lbybee
Copy link

lbybee commented Jan 4, 2018

I'm interested to see if there have been any developments on this. I currently have an application where I'm working with multiple dask arrays, some of which are sparse (text data). It'd be worth my time to move my project to xarray, so I'm be interested in contributing something here if there is a need.

@Hoeze
Copy link

Hoeze commented Jun 6, 2018

I'd know a project which could make perfect use of xarray, if it would support sparse tensors:
https://github.com/theislab/anndata

Currently I have to work with both xarray and anndata to store counts in sparse arrays separate from other depending data which is a little bit annoying :)

@shoyer
Copy link
Member

shoyer commented Jun 6, 2018

See also: #1938

The major challenge now is the dispatching mechanism, which hopefully http://www.numpy.org/neps/nep-0018-array-function-protocol.html will solve.

@Hoeze
Copy link

Hoeze commented Jul 5, 2018

Would it be an option to use dask's sparse support?
http://dask.pydata.org/en/latest/array-sparse.html
This way xarray could let dask do the main work.

Currently I load everything into a dask array by hand and pass this dask array to xarray.
This works pretty good.

@Hoeze
Copy link

Hoeze commented Jul 5, 2018

How should these sparse arrays get stored in NetCDF4?
I know that NetCDF4 has some conventions how to store sparse data, but do we have to implement our own conversion mechanisms for each sparse type?

@shoyer
Copy link
Member

shoyer commented Jul 6, 2018

Would it be an option to use dask's sparse support?
http://dask.pydata.org/en/latest/array-sparse.html
This way xarray could let dask do the main work.

In principle this would work, though I would prefer to support it directly in xarray, too.

I know that NetCDF4 has some conventions how to store sparse data, but do we have to implement our own conversion mechanisms for each sparse type?

Yes, we would need to implement a convention for handling sparse array data.

@rabernat
Copy link
Contributor

Given the recent improvements in numpy duck array typing, how close are we to being able to just wrap a pydata/sparse array in an xarray Dataset?

@shoyer
Copy link
Member

shoyer commented Jun 23, 2019

It will need some experimentation, but I think things should be pretty close after NumPy 1.17 is released. Potentially it could be as easy as adjusting the rules xarray uses for casting in xarray.core.variable.as_compatible_data.

@rabernat
Copy link
Contributor

If someone who is good at numpy shows up at our sprint tomorrow, this could be a good issue try out.

@mrocklin
Copy link
Contributor Author

@rgommers might be able to recommend someone

@rabernat rabernat added topic-arrays related to flexible array support design question labels Jul 12, 2019
@rgommers
Copy link

I haven't talked to anyone at SciPy'19 yet who was interested in sparse arrays, but I'll keep an eye out today.

And yes, this is a fun issue to work on and would be really nice to have!

@rabernat
Copy link
Contributor

I personally use the new sparse project for my day-to-day research. I am motivated on this, but I probably won't have time today to dive deep on this.

Maybe CuPy would be more exciting.

@mrocklin
Copy link
Contributor Author

@nvictus has been working on this at #3117

@fjanoos
Copy link

fjanoos commented Jul 21, 2019

Wondering what the status on this is ? Is there a branch with this functionality implemented - would love to give it a spin !

@shoyer
Copy link
Member

shoyer commented Aug 13, 2019

This is working now on the master branch!

Once we get a few more kinks worked out, it will be in the next release.

I've started another issue for discussing how xarray could integrate sparse arrays better into its API: #3213

@shoyer shoyer closed this as completed Aug 13, 2019
@fjanoos
Copy link

fjanoos commented Aug 29, 2019

@shoyer
Is there documentation for using sparse arrays ? Could you point me to some example code ?

@dcherian
Copy link
Contributor

@fjanoos there isn't any formal documentation yet but you can look at test_sparse.py for examples. That file will also tell you what works and doesn't work currently.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
design question topic-arrays related to flexible array support
Projects
None yet
Development

No branches or pull requests