Make SparseArray an ExtensionArray #21978

TomAugspurger · 2018-07-19T11:52:45Z

We should make SparseArray a proper ExtensionArray.

It seems like this will be somewhat difficult to do properly when SparseArray subclasses ndarray. Basic things like np.asarray(sparse_array) don't match the required ExtensionArray API (#14167). Fixing this, especially when we subclass ndarray, is going to be difficult. I can't override the behavior of np.asarray(sparse_array) in Python.

So, some questions

Do people rely on SparseArray being an ndarray subclass?
Do we want to make a clean break, or introduce deprecations for things that will need changing (but with no clear upgrade path)?

My current preference is to just break things, but I don't use sparse. SparseArray would compose an ndarray of dense values and a SparseIndex, but it would no longer subclass ndarray.

CCing some people who seem to use pandas' sparse: @hexgnu @kernc @Licht-T

The text was updated successfully, but these errors were encountered:

hexgnu · 2018-07-19T14:37:12Z

I don't see many things relying on SparseArray being an ndarray. Mostly I've seen SparseArray show up out of using a SparseSeries.

Also I'm a fan of clean breaks, although I could imagine someone would show up complaining about something regressing so we'd have to be prepared for that.

jreback · 2018-07-19T14:42:56Z

even though i normally prefer backwards compat

sparse is holding us back and SparseArray as an EA will allow much internal cleanup

I think ok to just push this change thru - with a note in the whatsnew

TomAugspurger · 2018-07-19T14:48:59Z

Just to be clear, things *should* be backwards compatible at the the pandas level. I don't expect to break much, if anything, once the SparseArray is wrapped in a Series / DataFrame (or SparseSeries / SparseDataFrame).

…

On Thu, Jul 19, 2018 at 9:43 AM Jeff Reback ***@***.***> wrote: even though i normally prefer backwards compat sparse is holding us back and SparseArray as an EA will allow much internal cleanup I think ok to just push this change thru - with a note in the whatsnew — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#21978 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABQHIibG35eYFC0EYtFRs7vmNOlHDYgfks5uIJr2gaJpZM4VWQlf> .

jorisvandenbossche · 2018-07-20T09:45:04Z

I personally also don't think it is necessarily a problem to no longer subclass, but as Tom, I don't use sparse so hard to really say.

But, on the other hand, it would maybe be nice that it is actually possible to use ndarray subclasses in our ExtensionArray interface? In that sense this might be a good test case.

What is exactly not working with our interface?
You mention asarray, but that is something we have to fix anyway given the discussion in #14167?

TomAugspurger · 2018-07-20T10:58:45Z

But, on the other hand, it would maybe be nice that it is actually possible to use ndarray subclasses in our ExtensionArray interface? In that sense this might be a good test case.

Agreed. They should be compatible, but I never got past np.asarray(sparse_array) only returning dense values. I can't figure out where things are going wrong, but I assume it's at the C level. I'd rather not have to write the EA in C / Cython.

jorisvandenbossche · 2018-07-20T11:19:14Z

but I never got past np.asarray(sparse_array) only returning dense values.

But I think we agree that we want to change this (see #14167)

I can't figure out where things are going wrong, but I assume it's at the C level.

I suppose this is because the actual 'ndarray' is only the non-fills, the 'subclass' part then adds sp_index attribute and some methods to interpret the values of itself in light of sp_index.

For #14167, the question is if we only want to 'fix' SparseSeries.__array__ or also SparseArray.__array__. But I am not sure if subclassing ndarray still makes sense then (as subclassing should mean you follow the memory model, otherwise making a duck array is more logical I think?)

TomAugspurger · 2018-07-20T12:05:00Z

I suppose this is because the actual 'ndarray' is only the non-fills, the 'subclass' part then adds sp_index attribute and some methods to interpret the values of itself in light of sp_index.

Ah, yes you're right. The line result = data.view(cls) is likely what's doing it. data is the observed values, and cls is SparseArray.

(as subclassing should mean you follow the memory model, otherwise making a duck array is more logical I think?)

Agreed. I'll put up a WIP PR that tries to clean things up by not subclassing ndarray (early next week probably).

TomAugspurger · 2018-07-27T19:14:46Z

Some notes on difficulties with the current EA API for sparse:

unique: ExtensionArray.unique expects the output type to the same. That doesn't really make sense for SparseArray, we'd like to return a regular ndarray.
factorize: ExtensionArray.factorize returns (labels, uniques) of types (ndarray, extension_array). It'd be nice if the labels could be an extension array, since we'd like to return a SparseArray for labels (and maybe an ndarray for uniques).

will add more notes later.

jorisvandenbossche · 2018-07-27T21:11:25Z

unique: ExtensionArray.unique expects the output type to the same.

That's indeed in the example implementation (and I assume in the tests as well), but are there for the rest cases that we require that? Because other dtypes in pandas also give numpy arrays (so I mean in general that does not necessarily need to be a problem)

factorize: It'd be nice if the labels could be an extension array, since we'd like to return a SparseArray for labels

Ah, yes, that's a good point. The question is then maybe how this would be used? Because typically it is used as an indexer, or as the codes for a categorical, and for that sparse is probably not that useful?
For a groupby it's maybe more useful. Do you have any idea what functionality of the label array is used there?

TomAugspurger · 2018-08-04T19:15:47Z

(note to self/others): List of behavior changes

SparseArray is no longer a subclass of ndarray
SparseArray.dtype is no longer a numpy dtype. Use SparseArray.dtype.subdtype
np.array(SparseArray) contains all the values, not just the non-fill values
Constructing a SpraseArray with data and sparse_index will correctly infer fill_value from data, rather than always using nan.
SparseArray.fillna(method='ffill' / 'bfill') now issues a PerformanceWarning about converting to dense values.
passing fill_value to SparseArray.take no longer implies allow_fill=True.
SparseArray.astype(np.dtype) will create a dense NumPy array. To keep astype to a SparseArray with
a different subdtype, use .astype(sparse_dtype) or a string like .astype('Sparse[float32]').

expanding on 4: Our handling of fill_value is strange when sparse_index is specified. With 0.23.3

In [4]: pd.SparseArray([1, 2])
Out[4]:
[1, 2]
Fill: 0
IntIndex
Indices: array([0, 1], dtype=int32)

In [5]: pd.SparseArray([1, 2], sparse_index=pd.core.sparse.array.IntIndex(4, [1, 2]))
Out[5]:
[nan, 1.0, 2.0, nan]
Fill: nan
IntIndex
Indices: array([1, 2], dtype=int32)

I don't think specifying sparse_index should change the inferred fill value. I'd expect

In [2]: pd.SparseArray([1, 2], sparse_index=pd.core.sparse.array.IntIndex(4, [1, 2]))
   ...:
Out[2]:
[0, 1, 2, 0]
Fill: 0
IntIndex
Indices: array([1, 2], dtype=int32)

I'll update as I go.

FYI, right now I'm just listing these. Some of them (e.g. .astype(np.dtype) can be deprecated gracefully. Others, I'm not so sure about.

TomAugspurger · 2018-08-07T16:09:18Z

WIP https://github.com/TomAugspurger/pandas/tree/ea-sparse-2

the sparse extension tests pass, but lots of stuff is currently broken elsewhere. Won't have as much time to work on this in the near future if anyone wants to pick it up. Otherwise, I'll return to it later.

TomAugspurger · 2018-08-09T19:12:25Z

Hmm, so SparseArray and SparseSeries default to different sparse kinds. Any objections to making those match? Any preferences on integer vs. block?

TomAugspurger · 2018-08-13T14:33:45Z

sparse_reindex seems buggy. Anyone have an idea what it's supposed to do? (this is in our test suite. probably xfailling for now).

In [71]: s = pd.SparseSeries(np.arange(6, dtype='f8'))

In [72]: s
Out[72]:
0    0.0
1    1.0
2    2.0
3    3.0
4    4.0
5    5.0
dtype: float64
BlockIndex
Block locations: array([0], dtype=int32)
Block lengths: array([6], dtype=int32)

In [73]: s.sparse_reindex(pd.core.sparse.array.IntIndex(10, [2, 4, 5]))
Out[73]:
0    NaN
1    NaN
2    2.0
3    NaN
4    4.0
5    5.0
     NaN
     NaN
     NaN
     NaN
dtype: float64
IntIndex
Indices: array([2, 4, 5], dtype=int32)

TomAugspurger · 2018-08-13T14:40:45Z

Ah, seems to be used in pandas.core.sparse.frame.homogenize, which I've never heard of before...

jorisvandenbossche · 2018-08-13T15:52:21Z

That let me wonder: should this be public? Or more in general, are the sparse index objects considered public? (you can pass it in the SparseSeries constructor currently, but the objects are never used in the docs / not exposed top-level).

TomAugspurger · 2018-08-13T15:56:12Z

Yeah, they're not in the official public API, but they were likely not considered in the pandas.core privatization shuffle.

So, I don't really know. I suppose it depends on if people find them useful (@hexgnu @kernc @Licht-T), otherwise I would default to making them private implementation details of SparseArray.

Makes SparseArray an ExtensionArray. * Fixed DataFrame.__setitem__ for updating to sparse. Closes #22367 * Fixed Series[sparse].to_sparse Closes #22389 Closes #21978 Closes #19506 Closes #22835

Makes SparseArray an ExtensionArray. * Fixed DataFrame.__setitem__ for updating to sparse. Closes pandas-dev#22367 * Fixed Series[sparse].to_sparse Closes pandas-dev#22389 Closes pandas-dev#21978 Closes pandas-dev#19506 Closes pandas-dev#22835

TomAugspurger added API Design Sparse Sparse Data Type ExtensionArray Extending pandas with custom dtypes or arrays. labels Jul 19, 2018

TomAugspurger added this to the 0.24.0 milestone Jul 19, 2018

TomAugspurger mentioned this issue Aug 13, 2018

SparseArray is an ExtensionArray #22325

Merged

4 tasks

kernc mentioned this issue Aug 14, 2018

ENH: SparseDataFrame/SparseSeries value assignment #17785

Closed

4 tasks

TomAugspurger closed this as completed in #22325 Oct 13, 2018

drkarthi mentioned this issue May 9, 2019

Python API produces a datatype error for pandas sparse data structures microsoft/LightGBM#2143

Closed

StrikerRUS mentioned this issue Aug 8, 2019

[python] add sparsity support for new version of pandas and check Series for bad dtypes microsoft/LightGBM#2318

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make SparseArray an ExtensionArray #21978

Make SparseArray an ExtensionArray #21978

TomAugspurger commented Jul 19, 2018 •

edited

Loading

hexgnu commented Jul 19, 2018

jreback commented Jul 19, 2018

TomAugspurger commented Jul 19, 2018 via email

jorisvandenbossche commented Jul 20, 2018

TomAugspurger commented Jul 20, 2018

jorisvandenbossche commented Jul 20, 2018

TomAugspurger commented Jul 20, 2018 •

edited

Loading

TomAugspurger commented Jul 27, 2018

jorisvandenbossche commented Jul 27, 2018

TomAugspurger commented Aug 4, 2018 •

edited

Loading

TomAugspurger commented Aug 7, 2018

TomAugspurger commented Aug 9, 2018

TomAugspurger commented Aug 13, 2018

TomAugspurger commented Aug 13, 2018

jorisvandenbossche commented Aug 13, 2018

TomAugspurger commented Aug 13, 2018 •

edited

Loading

Make SparseArray an ExtensionArray #21978

Make SparseArray an ExtensionArray #21978

Comments

TomAugspurger commented Jul 19, 2018 • edited Loading

hexgnu commented Jul 19, 2018

jreback commented Jul 19, 2018

TomAugspurger commented Jul 19, 2018 via email

jorisvandenbossche commented Jul 20, 2018

TomAugspurger commented Jul 20, 2018

jorisvandenbossche commented Jul 20, 2018

TomAugspurger commented Jul 20, 2018 • edited Loading

TomAugspurger commented Jul 27, 2018

jorisvandenbossche commented Jul 27, 2018

TomAugspurger commented Aug 4, 2018 • edited Loading

TomAugspurger commented Aug 7, 2018

TomAugspurger commented Aug 9, 2018

TomAugspurger commented Aug 13, 2018

TomAugspurger commented Aug 13, 2018

jorisvandenbossche commented Aug 13, 2018

TomAugspurger commented Aug 13, 2018 • edited Loading

TomAugspurger commented Jul 19, 2018 •

edited

Loading

TomAugspurger commented Jul 20, 2018 •

edited

Loading

TomAugspurger commented Aug 4, 2018 •

edited

Loading

TomAugspurger commented Aug 13, 2018 •

edited

Loading