Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make SparseArray an ExtensionArray #21978

Closed
TomAugspurger opened this issue Jul 19, 2018 · 16 comments · Fixed by #22325
Closed

Make SparseArray an ExtensionArray #21978

TomAugspurger opened this issue Jul 19, 2018 · 16 comments · Fixed by #22325
Labels
API Design ExtensionArray Extending pandas with custom dtypes or arrays. Sparse Sparse Data Type
Milestone

Comments

@TomAugspurger
Copy link
Contributor

TomAugspurger commented Jul 19, 2018

We should make SparseArray a proper ExtensionArray.

It seems like this will be somewhat difficult to do properly when SparseArray subclasses ndarray. Basic things like np.asarray(sparse_array) don't match the required ExtensionArray API (#14167). Fixing this, especially when we subclass ndarray, is going to be difficult. I can't override the behavior of np.asarray(sparse_array) in Python.

So, some questions

  1. Do people rely on SparseArray being an ndarray subclass?
  2. Do we want to make a clean break, or introduce deprecations for things that will need changing (but with no clear upgrade path)?

My current preference is to just break things, but I don't use sparse. SparseArray would compose an ndarray of dense values and a SparseIndex, but it would no longer subclass ndarray.

CCing some people who seem to use pandas' sparse: @hexgnu @kernc @Licht-T

@TomAugspurger TomAugspurger added API Design Sparse Sparse Data Type ExtensionArray Extending pandas with custom dtypes or arrays. labels Jul 19, 2018
@TomAugspurger TomAugspurger added this to the 0.24.0 milestone Jul 19, 2018
@hexgnu
Copy link
Contributor

hexgnu commented Jul 19, 2018

I don't see many things relying on SparseArray being an ndarray. Mostly I've seen SparseArray show up out of using a SparseSeries.

Also I'm a fan of clean breaks, although I could imagine someone would show up complaining about something regressing so we'd have to be prepared for that.

@jreback
Copy link
Contributor

jreback commented Jul 19, 2018

even though i normally prefer backwards compat

sparse is holding us back and SparseArray as an EA will allow much internal cleanup

I think ok to just push this change thru - with a note in the whatsnew

@TomAugspurger
Copy link
Contributor Author

TomAugspurger commented Jul 19, 2018 via email

@jorisvandenbossche
Copy link
Member

I personally also don't think it is necessarily a problem to no longer subclass, but as Tom, I don't use sparse so hard to really say.

But, on the other hand, it would maybe be nice that it is actually possible to use ndarray subclasses in our ExtensionArray interface? In that sense this might be a good test case.

What is exactly not working with our interface?
You mention asarray, but that is something we have to fix anyway given the discussion in #14167?

@TomAugspurger
Copy link
Contributor Author

But, on the other hand, it would maybe be nice that it is actually possible to use ndarray subclasses in our ExtensionArray interface? In that sense this might be a good test case.

Agreed. They should be compatible, but I never got past np.asarray(sparse_array) only returning dense values. I can't figure out where things are going wrong, but I assume it's at the C level. I'd rather not have to write the EA in C / Cython.

@jorisvandenbossche
Copy link
Member

but I never got past np.asarray(sparse_array) only returning dense values.

But I think we agree that we want to change this (see #14167)

I can't figure out where things are going wrong, but I assume it's at the C level.

I suppose this is because the actual 'ndarray' is only the non-fills, the 'subclass' part then adds sp_index attribute and some methods to interpret the values of itself in light of sp_index.

For #14167, the question is if we only want to 'fix' SparseSeries.__array__ or also SparseArray.__array__. But I am not sure if subclassing ndarray still makes sense then (as subclassing should mean you follow the memory model, otherwise making a duck array is more logical I think?)

@TomAugspurger
Copy link
Contributor Author

TomAugspurger commented Jul 20, 2018

I suppose this is because the actual 'ndarray' is only the non-fills, the 'subclass' part then adds sp_index attribute and some methods to interpret the values of itself in light of sp_index.

Ah, yes you're right. The line result = data.view(cls) is likely what's doing it. data is the observed values, and cls is SparseArray.

(as subclassing should mean you follow the memory model, otherwise making a duck array is more logical I think?)

Agreed. I'll put up a WIP PR that tries to clean things up by not subclassing ndarray (early next week probably).

@TomAugspurger
Copy link
Contributor Author

Some notes on difficulties with the current EA API for sparse:

  1. unique: ExtensionArray.unique expects the output type to the same. That doesn't really make sense for SparseArray, we'd like to return a regular ndarray.
  2. factorize: ExtensionArray.factorize returns (labels, uniques) of types (ndarray, extension_array). It'd be nice if the labels could be an extension array, since we'd like to return a SparseArray for labels (and maybe an ndarray for uniques).

will add more notes later.

@jorisvandenbossche
Copy link
Member

unique: ExtensionArray.unique expects the output type to the same.

That's indeed in the example implementation (and I assume in the tests as well), but are there for the rest cases that we require that? Because other dtypes in pandas also give numpy arrays (so I mean in general that does not necessarily need to be a problem)

factorize: It'd be nice if the labels could be an extension array, since we'd like to return a SparseArray for labels

Ah, yes, that's a good point. The question is then maybe how this would be used? Because typically it is used as an indexer, or as the codes for a categorical, and for that sparse is probably not that useful?
For a groupby it's maybe more useful. Do you have any idea what functionality of the label array is used there?

@TomAugspurger
Copy link
Contributor Author

TomAugspurger commented Aug 4, 2018

(note to self/others): List of behavior changes

  1. SparseArray is no longer a subclass of ndarray
  2. SparseArray.dtype is no longer a numpy dtype. Use SparseArray.dtype.subdtype
  3. np.array(SparseArray) contains all the values, not just the non-fill values
  4. Constructing a SpraseArray with data and sparse_index will correctly infer fill_value from data, rather than always using nan.
  5. SparseArray.fillna(method='ffill' / 'bfill') now issues a PerformanceWarning about converting to dense values.
  6. passing fill_value to SparseArray.take no longer implies allow_fill=True.
  7. SparseArray.astype(np.dtype) will create a dense NumPy array. To keep astype to a SparseArray with
    a different subdtype, use .astype(sparse_dtype) or a string like .astype('Sparse[float32]').

expanding on 4: Our handling of fill_value is strange when sparse_index is specified. With 0.23.3

In [4]: pd.SparseArray([1, 2])
Out[4]:
[1, 2]
Fill: 0
IntIndex
Indices: array([0, 1], dtype=int32)

In [5]: pd.SparseArray([1, 2], sparse_index=pd.core.sparse.array.IntIndex(4, [1, 2]))
Out[5]:
[nan, 1.0, 2.0, nan]
Fill: nan
IntIndex
Indices: array([1, 2], dtype=int32)

I don't think specifying sparse_index should change the inferred fill value. I'd expect

In [2]: pd.SparseArray([1, 2], sparse_index=pd.core.sparse.array.IntIndex(4, [1, 2]))
   ...:
Out[2]:
[0, 1, 2, 0]
Fill: 0
IntIndex
Indices: array([1, 2], dtype=int32)

I'll update as I go.

FYI, right now I'm just listing these. Some of them (e.g. .astype(np.dtype) can be deprecated gracefully. Others, I'm not so sure about.

@TomAugspurger
Copy link
Contributor Author

WIP https://github.com/TomAugspurger/pandas/tree/ea-sparse-2

the sparse extension tests pass, but lots of stuff is currently broken elsewhere. Won't have as much time to work on this in the near future if anyone wants to pick it up. Otherwise, I'll return to it later.

@TomAugspurger
Copy link
Contributor Author

Hmm, so SparseArray and SparseSeries default to different sparse kinds. Any objections to making those match? Any preferences on integer vs. block?

@TomAugspurger
Copy link
Contributor Author

sparse_reindex seems buggy. Anyone have an idea what it's supposed to do? (this is in our test suite. probably xfailling for now).

In [71]: s = pd.SparseSeries(np.arange(6, dtype='f8'))

In [72]: s
Out[72]:
0    0.0
1    1.0
2    2.0
3    3.0
4    4.0
5    5.0
dtype: float64
BlockIndex
Block locations: array([0], dtype=int32)
Block lengths: array([6], dtype=int32)

In [73]: s.sparse_reindex(pd.core.sparse.array.IntIndex(10, [2, 4, 5]))
Out[73]:
0    NaN
1    NaN
2    2.0
3    NaN
4    4.0
5    5.0
     NaN
     NaN
     NaN
     NaN
dtype: float64
IntIndex
Indices: array([2, 4, 5], dtype=int32)

@TomAugspurger
Copy link
Contributor Author

Ah, seems to be used in pandas.core.sparse.frame.homogenize, which I've never heard of before...

@jorisvandenbossche
Copy link
Member

That let me wonder: should this be public? Or more in general, are the sparse index objects considered public? (you can pass it in the SparseSeries constructor currently, but the objects are never used in the docs / not exposed top-level).

@TomAugspurger
Copy link
Contributor Author

TomAugspurger commented Aug 13, 2018

Yeah, they're not in the official public API, but they were likely not considered in the pandas.core privatization shuffle.

So, I don't really know. I suppose it depends on if people find them useful (@hexgnu @kernc @Licht-T), otherwise I would default to making them private implementation details of SparseArray.

TomAugspurger added a commit that referenced this issue Oct 13, 2018
Makes SparseArray an ExtensionArray.

* Fixed DataFrame.__setitem__ for updating to sparse.

Closes #22367

* Fixed Series[sparse].to_sparse

Closes #22389

Closes #21978
Closes #19506
Closes #22835
tm9k1 pushed a commit to tm9k1/pandas that referenced this issue Nov 19, 2018
Makes SparseArray an ExtensionArray.

* Fixed DataFrame.__setitem__ for updating to sparse.

Closes pandas-dev#22367

* Fixed Series[sparse].to_sparse

Closes pandas-dev#22389

Closes pandas-dev#21978
Closes pandas-dev#19506
Closes pandas-dev#22835
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API Design ExtensionArray Extending pandas with custom dtypes or arrays. Sparse Sparse Data Type
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants