SparseArray is an ExtensionArray #22325

TomAugspurger · 2018-08-13T20:02:46Z

Closes #21978
Closes #19506
Closes #22835

High-level summary: SparseArray is an ExtensionArray. It's no longer an ndarray subclass. The actual data model hasn't changed at all, it's still an array and a sparse_index. Only now the sparse values are self.sparse_values, rather than self.

This isn't really close to being ready yet. I'm going to go through and self-review a bunch of things right now, will call out for others' opinions in specific places.

API discussions:

Why is the default kind inconsistent between SparseSeries and SparseArray?
- Possibly get rid of block in a future PR
Should .astype(np_dtype) be sparse or dense?
- sparse
What should the inferred type of an empty SparseArray be? SparseArray([]).dtype? NumPy defaults to float (Sparse[float64]), pandas typically uses object (Sparse[object])
- sparse
Policy for warning when converting to dense (e.g. https://github.com/pandas-dev/pandas/pull/22325/files#diff-71caf9627e9687e837e4b1f86ecc6271R390). In the past, pandas would coerce to an ndarray all over the place. Now we at least have a chance of knowing when we're doing a bad thing, and warning about it.
- warn when it's implicit

TomAugspurger · 2018-10-12T12:18:30Z

The root problem is ndarray.__getitem__(Sparse[bool]). For NumPy 1.9.3, numpy doesn't treat the Sparse[bool] as a boolean indexer.

jorisvandenbossche · 2018-10-12T12:35:08Z

Yes, I understand that, but we can still manually convert the sparse boolean to a numpy boolean in places where we use it for indexing a numpy array? (although it would not be needed for newer numpy)

jreback

some comments. you may want to address some here (as you have to rebase anyhow). or a followup ok.

jreback · 2018-10-12T12:26:18Z

doc/source/whatsnew/v0.24.0.txt

+  * Passing a scalar for ``indices`` is no longer allowed.
+- The result of concatenating a mix of sparse and dense Series is a Series with sparse values, rather than a ``SparseSeries``.
+- ``SparseDataFrame.combine`` and ``DataFrame.combine_first`` no longer supports combining a sparse column with a dense column while preserving the sparse subtype. The result will be an object-dtype SparseArray.
+- Setting :attr:`SparseArray.fill_value` to a fill value with a different dtype is now allowed.


i agree the fill type should match the dtype but since missing value support is allowed here it is prob ok.

jreback · 2018-10-12T12:26:43Z

doc/source/whatsnew/v0.24.0.txt

@@ -566,6 +597,7 @@ update the ``ExtensionDtype._metadata`` tuple to match the signature of your
 - Added :meth:`pandas.api.types.register_extension_dtype` to register an extension type with pandas (:issue:`22664`)
 - Series backed by an ``ExtensionArray`` now work with :func:`util.hash_pandas_object` (:issue:`23066`)
 - Updated the ``.type`` attribute for ``PeriodDtype``, ``DatetimeTZDtype``, and ``IntervalDtype`` to be instances of the dtype (``Period``, ``Timestamp``, and ``Interval`` respectively) (:issue:`22938`)
+- :func:`ExtensionArray.isna` is allowed to return an ``ExtensionArray`` (:issue:`22325`).


really? we allow this. I agree this would be ok, but is reasonably tested?

Define reasonable :)

I'm reasonably sure there are places in pandas where we assume we have an ndarray, but may get an ExtensionArray instead.

The common case of .isna().any() is well tested though, I think.

In any case, if you create the masks in other ways than .isna(), eg with a comparison like df['sparse_column'] == 1, you get exactly the same issue. Which means we basically have to support indexing with boolean sparse masks anyway I think.

jreback · 2018-10-12T12:27:45Z

doc/source/whatsnew/v0.24.0.txt

+Sparse
+^^^^^^
+
+- Updating a boolean, datetime, or timedelta column to be Sparse now works (:issue:`22367`)


really, we have support for this? again i agree this is a nice feature, but we are decreasing support generally for sparse, so not anxious to advertise this

jreback · 2018-10-12T12:28:50Z

pandas/core/arrays/base.py


-        This should return a 1-D array the same length as 'self'.
+        * ``na_values._is_boolean`` should be True
+        * `na_values` should implement :func:`ExtensionArray._reduce`


we should probably have an Indexing EA mixin that implementes these as NotImplemented (so once can subclass)

Such an indexing mixing might be useful, but how is this related to the line above?

pandas/core/internals/blocks.py

jreback · 2018-10-12T12:30:25Z

pandas/core/internals/concat.py

-            else:
-                return g, None
+        try:
+            g = np.find_common_type(upcast_classes, [])


ok,, yeah would be nice to simplify this

pandas/core/reshape/reshape.py

pandas/tests/extension/arrow/bool.py

jreback · 2018-10-12T12:35:21Z

pandas/tests/extension/arrow/bool.py

@@ -67,7 +67,11 @@ def _from_sequence(cls, scalars, dtype=None, copy=False):
        return cls.from_scalars(scalars)

    def __getitem__(self, item):


see my comment above, we should create an Indexing EA mixin and use the interface here

TomAugspurger · 2018-10-12T13:07:50Z

Yes, I understand that, but we can still manually convert the sparse boolean to a numpy boolean in places where we use it for indexing a numpy array?

Oh, I see what you're suggesting. I only see two of these FutureWarnings from numpy...

Looking at quantile, right now it'd be fine to convert the ExtensionArray mask to an ndarray because we know that values is an ndarray. However, I could imagine a future where we (or numpy, if an array-like implements __array_function__) dispatches np.percentile. Converting the mask to an ndarray would be counter-productive. But we're a long way from that future, so let's just do the conversion for now, with a note that it may have to change.

The other case is in SparseArray.take, where we create the ndarray, so that's safe.

jorisvandenbossche · 2018-10-12T15:37:46Z

The other case is in SparseArray.take, where we create the ndarray, so that's safe.

Did you do a change for this? (I only see something in the diff for percentile)

TomAugspurger · 2018-10-12T15:42:48Z

Did you do a change for this? (I only see something in the diff for percentile)

Sorry forgot to post what I found. The LOC is

return self.take(np.arange(len(key), dtype=np.int32)[key])

We can't hit that with an SparseArray (we go to dense earlier). I see now that we're hitting it with something like

In [4]: arr[[True, False, True]]
pandas/core/sparse/array.py:662: FutureWarning: in the future, boolean array-likes will be handled as a boolean array index
  return self.take(np.arange(len(key), dtype=np.int32)[key])
Out[4]:
[2, 1, 2]
Fill: 0
IntIndex
Indices: array([0, 1, 2], dtype=int32)

again this is only on older numpys. So we can do an asarray on key there.

TomAugspurger · 2018-10-12T15:47:29Z

FYI, this boolean indexing behavior affects through numpy 1.11.

TomAugspurger · 2018-10-12T21:21:04Z

All green, if anyone wants to take a look at the dtype raising commit before pushing the green button.

jorisvandenbossche

Last changes look good, just a minor doc comment

jorisvandenbossche · 2018-10-13T08:07:21Z

pandas/core/sparse/dtype.py

@@ -173,9 +173,10 @@ def construct_from_string(cls, string):
            'Sparse[int, 1]' SparseDtype[np.int64, 0]


I think this one now needs to be updated as the above would raise an error. But make it 'Sparse[int, 0] to show that default fill value is OK?

TomAugspurger · 2018-10-13T11:35:48Z

Good catch.

Here we go.

jorisvandenbossche · 2018-10-15T08:37:02Z

Whoohoo!

@TomAugspurger did you open an issue with your follow-up items list?

TomAugspurger · 2018-10-15T11:07:40Z

And there was already one for deprecating SparseDataFrame. I'll add SparseSeries to that discussion.

Makes SparseArray an ExtensionArray. * Fixed DataFrame.__setitem__ for updating to sparse. Closes pandas-dev#22367 * Fixed Series[sparse].to_sparse Closes pandas-dev#22389 Closes pandas-dev#21978 Closes pandas-dev#19506 Closes pandas-dev#22835

MezentsevIlya · 2019-12-18T07:22:17Z

Hi! Due to these changes there is an issue #30316

jbrockmendel · 2021-11-29T23:20:31Z

pandas/tests/series/test_combine_concat.py

        assert result.ftype == 'float64:sparse'

        result = pd.concat([Series(dtype='float64').to_sparse(), Series(
            dtype='object')])
-        assert result.dtype == np.object_
-        assert result.ftype == 'object:dense'
+        # TODO: release-note: concat sparse dtype


@TomAugspurger these TODOs are still present (though now in a different file). Is this actionable?

Probably not.

TomAugspurger added 30 commits July 12, 2018 15:16

wip

ee187eb

from scratch

32c1372

Updates

b265659

Merge remote-tracking branch 'upstream/master' into ea-sparse-2

8dfc898

WIP

9c57725

wip

13952ab

wip take

7a6e7fa

wip take

1016af1

Merge remote-tracking branch 'upstream/master' into ea-sparse-2

072abec

take

0ad61cc

take working

5b0b524

Merge remote-tracking branch 'upstream/master' into ea-sparse-2

224744a

remove registry

620b5fb

Merge remote-tracking branch 'upstream/master' into ea-sparse-2

164c401

missing

65f83d6

Merge remote-tracking branch 'upstream/master' into ea-sparse-2

0b3c682

wip ops

69a5d13

More ops wip

f2b5862

segfault!

fa80fc5

wip

3f20890

start docs

484adb0

2 failing extension tests

1df1190

wip fillna

4246ac4

Merge remote-tracking branch 'upstream/master' into ea-sparse-2

a849699

registry dtype, asarray

c4da319

astype interface

a2f158f

"passing" extension tests

26b671a

no sparse block

375e160

wip

0a37050

Merge remote-tracking branch 'upstream/master' into ea-sparse-2

3c2cb0f

jreback approved these changes Oct 12, 2018

View reviewed changes

TomAugspurger added 3 commits October 12, 2018 08:25

COMPAT: NumPy 1.9 bool-like indexing

cc89ec7

misc. comments

3f713d4

Merge remote-tracking branch 'upstream/master' into ea-sparse-2

886fe03

TomAugspurger added 4 commits October 12, 2018 10:49

asarray on bool key for numpy compat

75099af

Raise for non-default values

731fc06

groupby / reduce compat

f91141d

lint

37a4b57

jorisvandenbossche approved these changes Oct 13, 2018

View reviewed changes

fix docs

4aad8e1

TomAugspurger merged commit 56d8e78 into pandas-dev:master Oct 13, 2018

TomAugspurger deleted the ea-sparse-2 branch October 13, 2018 11:37

This was referenced Oct 15, 2018

DEPR: SparseDataFrame and SparseSeries subclasses #19239

Closed

Fixes np.unique on SparseArray #19651

Closed

SparseSeries.__array__ only returns non-fills #14167

Closed

jorisvandenbossche mentioned this pull request Oct 23, 2018

Should SparseArray.astype be dense or Sparse #23125

Closed

thomasjpfan mentioned this pull request Jun 8, 2019

[MRG] Fix 'SparseSeries deprecated: scipy-dev failing on travis' #14002 scikit-learn/scikit-learn#14005

Closed

jbrockmendel reviewed Nov 29, 2021

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SparseArray is an ExtensionArray #22325

SparseArray is an ExtensionArray #22325

TomAugspurger commented Aug 13, 2018 •

edited

Loading

TomAugspurger commented Oct 12, 2018

jorisvandenbossche commented Oct 12, 2018

jreback left a comment

jreback Oct 12, 2018

jreback Oct 12, 2018

TomAugspurger Oct 12, 2018

jorisvandenbossche Oct 12, 2018

jreback Oct 12, 2018

jreback Oct 12, 2018

jorisvandenbossche Oct 12, 2018

jreback Oct 12, 2018

jreback Oct 12, 2018

TomAugspurger commented Oct 12, 2018

jorisvandenbossche commented Oct 12, 2018

TomAugspurger commented Oct 12, 2018 •

edited

Loading

TomAugspurger commented Oct 12, 2018

TomAugspurger commented Oct 12, 2018

jorisvandenbossche left a comment

jorisvandenbossche Oct 13, 2018

TomAugspurger commented Oct 13, 2018

jorisvandenbossche commented Oct 15, 2018

TomAugspurger commented Oct 15, 2018

MezentsevIlya commented Dec 18, 2019

jbrockmendel Nov 29, 2021

TomAugspurger Nov 30, 2021

		@@ -67,7 +67,11 @@ def _from_sequence(cls, scalars, dtype=None, copy=False):
		return cls.from_scalars(scalars)

		def __getitem__(self, item):

		@@ -173,9 +173,10 @@ def construct_from_string(cls, string):
		'Sparse[int, 1]' SparseDtype[np.int64, 0]

SparseArray is an ExtensionArray #22325

SparseArray is an ExtensionArray #22325

Conversation

TomAugspurger commented Aug 13, 2018 • edited Loading

TomAugspurger commented Oct 12, 2018

jorisvandenbossche commented Oct 12, 2018

jreback left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

TomAugspurger commented Oct 12, 2018

jorisvandenbossche commented Oct 12, 2018

TomAugspurger commented Oct 12, 2018 • edited Loading

TomAugspurger commented Oct 12, 2018

TomAugspurger commented Oct 12, 2018

jorisvandenbossche left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

TomAugspurger commented Oct 13, 2018

jorisvandenbossche commented Oct 15, 2018

TomAugspurger commented Oct 15, 2018

MezentsevIlya commented Dec 18, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

TomAugspurger commented Aug 13, 2018 •

edited

Loading

TomAugspurger commented Oct 12, 2018 •

edited

Loading