-
-
Notifications
You must be signed in to change notification settings - Fork 18k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make SparseArray an ExtensionArray #21978
Comments
I don't see many things relying on SparseArray being an ndarray. Mostly I've seen SparseArray show up out of using a SparseSeries. Also I'm a fan of clean breaks, although I could imagine someone would show up complaining about something regressing so we'd have to be prepared for that. |
even though i normally prefer backwards compat sparse is holding us back and SparseArray as an EA will allow much internal cleanup I think ok to just push this change thru - with a note in the whatsnew |
Just to be clear, things *should* be backwards compatible at the the pandas
level. I don't expect to break much, if anything, once the SparseArray is
wrapped in a Series / DataFrame (or SparseSeries / SparseDataFrame).
…On Thu, Jul 19, 2018 at 9:43 AM Jeff Reback ***@***.***> wrote:
even though i normally prefer backwards compat
sparse is holding us back and SparseArray as an EA will allow much
internal cleanup
I think ok to just push this change thru - with a note in the whatsnew
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#21978 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ABQHIibG35eYFC0EYtFRs7vmNOlHDYgfks5uIJr2gaJpZM4VWQlf>
.
|
I personally also don't think it is necessarily a problem to no longer subclass, but as Tom, I don't use sparse so hard to really say. But, on the other hand, it would maybe be nice that it is actually possible to use ndarray subclasses in our ExtensionArray interface? In that sense this might be a good test case. What is exactly not working with our interface? |
Agreed. They should be compatible, but I never got past |
But I think we agree that we want to change this (see #14167)
I suppose this is because the actual 'ndarray' is only the non-fills, the 'subclass' part then adds For #14167, the question is if we only want to 'fix' |
Ah, yes you're right. The line
Agreed. I'll put up a WIP PR that tries to clean things up by not subclassing ndarray (early next week probably). |
Some notes on difficulties with the current EA API for sparse:
will add more notes later. |
That's indeed in the example implementation (and I assume in the tests as well), but are there for the rest cases that we require that? Because other dtypes in pandas also give numpy arrays (so I mean in general that does not necessarily need to be a problem)
Ah, yes, that's a good point. The question is then maybe how this would be used? Because typically it is used as an indexer, or as the codes for a categorical, and for that sparse is probably not that useful? |
(note to self/others): List of behavior changes
expanding on 4: Our handling of In [4]: pd.SparseArray([1, 2])
Out[4]:
[1, 2]
Fill: 0
IntIndex
Indices: array([0, 1], dtype=int32)
In [5]: pd.SparseArray([1, 2], sparse_index=pd.core.sparse.array.IntIndex(4, [1, 2]))
Out[5]:
[nan, 1.0, 2.0, nan]
Fill: nan
IntIndex
Indices: array([1, 2], dtype=int32) I don't think specifying In [2]: pd.SparseArray([1, 2], sparse_index=pd.core.sparse.array.IntIndex(4, [1, 2]))
...:
Out[2]:
[0, 1, 2, 0]
Fill: 0
IntIndex
Indices: array([1, 2], dtype=int32) I'll update as I go. FYI, right now I'm just listing these. Some of them (e.g. |
WIP https://github.com/TomAugspurger/pandas/tree/ea-sparse-2 the sparse extension tests pass, but lots of stuff is currently broken elsewhere. Won't have as much time to work on this in the near future if anyone wants to pick it up. Otherwise, I'll return to it later. |
Hmm, so |
In [71]: s = pd.SparseSeries(np.arange(6, dtype='f8'))
In [72]: s
Out[72]:
0 0.0
1 1.0
2 2.0
3 3.0
4 4.0
5 5.0
dtype: float64
BlockIndex
Block locations: array([0], dtype=int32)
Block lengths: array([6], dtype=int32)
In [73]: s.sparse_reindex(pd.core.sparse.array.IntIndex(10, [2, 4, 5]))
Out[73]:
0 NaN
1 NaN
2 2.0
3 NaN
4 4.0
5 5.0
NaN
NaN
NaN
NaN
dtype: float64
IntIndex
Indices: array([2, 4, 5], dtype=int32) |
Ah, seems to be used in |
That let me wonder: should this be public? Or more in general, are the sparse index objects considered public? (you can pass it in the |
Yeah, they're not in the official public API, but they were likely not considered in the So, I don't really know. I suppose it depends on if people find them useful (@hexgnu @kernc @Licht-T), otherwise I would default to making them private implementation details of SparseArray. |
Makes SparseArray an ExtensionArray. * Fixed DataFrame.__setitem__ for updating to sparse. Closes pandas-dev#22367 * Fixed Series[sparse].to_sparse Closes pandas-dev#22389 Closes pandas-dev#21978 Closes pandas-dev#19506 Closes pandas-dev#22835
We should make SparseArray a proper ExtensionArray.
It seems like this will be somewhat difficult to do properly when SparseArray subclasses ndarray. Basic things like
np.asarray(sparse_array)
don't match the required ExtensionArray API (#14167). Fixing this, especially when we subclass ndarray, is going to be difficult. I can't override the behavior ofnp.asarray(sparse_array)
in Python.So, some questions
My current preference is to just break things, but I don't use sparse. SparseArray would compose an ndarray of dense values and a
SparseIndex
, but it would no longer subclass ndarray.CCing some people who seem to use pandas' sparse: @hexgnu @kernc @Licht-T
The text was updated successfully, but these errors were encountered: