Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SparseArray is an ExtensionArray #22325

Merged
merged 236 commits into from
Oct 13, 2018
Merged
Show file tree
Hide file tree
Changes from 228 commits
Commits
Show all changes
236 commits
Select commit Hold shift + click to select a range
ee187eb
wip
TomAugspurger Jul 12, 2018
32c1372
from scratch
TomAugspurger Jul 13, 2018
b265659
Updates
TomAugspurger Jul 13, 2018
8dfc898
Merge remote-tracking branch 'upstream/master' into ea-sparse-2
TomAugspurger Jul 13, 2018
9c57725
WIP
TomAugspurger Jul 13, 2018
13952ab
wip
TomAugspurger Jul 13, 2018
7a6e7fa
wip take
TomAugspurger Jul 13, 2018
1016af1
wip take
TomAugspurger Jul 16, 2018
072abec
Merge remote-tracking branch 'upstream/master' into ea-sparse-2
TomAugspurger Jul 22, 2018
0ad61cc
take
TomAugspurger Jul 22, 2018
5b0b524
take working
TomAugspurger Jul 22, 2018
224744a
Merge remote-tracking branch 'upstream/master' into ea-sparse-2
TomAugspurger Jul 23, 2018
620b5fb
remove registry
TomAugspurger Jul 23, 2018
164c401
Merge remote-tracking branch 'upstream/master' into ea-sparse-2
TomAugspurger Jul 24, 2018
65f83d6
missing
TomAugspurger Jul 24, 2018
0b3c682
Merge remote-tracking branch 'upstream/master' into ea-sparse-2
TomAugspurger Jul 27, 2018
69a5d13
wip ops
TomAugspurger Jul 27, 2018
f2b5862
More ops wip
TomAugspurger Jul 27, 2018
fa80fc5
segfault!
TomAugspurger Jul 28, 2018
3f20890
wip
TomAugspurger Jul 28, 2018
484adb0
start docs
TomAugspurger Jul 28, 2018
1df1190
2 failing extension tests
TomAugspurger Jul 30, 2018
4246ac4
wip fillna
TomAugspurger Jul 30, 2018
a849699
Merge remote-tracking branch 'upstream/master' into ea-sparse-2
TomAugspurger Aug 1, 2018
c4da319
registry dtype, asarray
TomAugspurger Aug 1, 2018
a2f158f
astype interface
TomAugspurger Aug 1, 2018
26b671a
"passing" extension tests
TomAugspurger Aug 1, 2018
375e160
no sparse block
TomAugspurger Aug 1, 2018
0a37050
wip
TomAugspurger Aug 2, 2018
3c2cb0f
Merge remote-tracking branch 'upstream/master' into ea-sparse-2
TomAugspurger Aug 2, 2018
27c6378
wip
TomAugspurger Aug 3, 2018
e52dae9
a bit on concat
TomAugspurger Aug 3, 2018
b6d8430
revert concat changes
TomAugspurger Aug 3, 2018
640c4a5
passing again
TomAugspurger Aug 3, 2018
6b61597
More concat
TomAugspurger Aug 3, 2018
427234f
fillna...
TomAugspurger Aug 3, 2018
e055629
wip
TomAugspurger Aug 6, 2018
a79359c
wip
TomAugspurger Aug 6, 2018
de3aa71
Merge remote-tracking branch 'upstream/master' into ea-sparse-2
TomAugspurger Aug 6, 2018
21f4ee3
reductions, ufuncs
TomAugspurger Aug 6, 2018
c1e594a
failing on ufuncs
TomAugspurger Aug 6, 2018
dc7f93f
wipo
TomAugspurger Aug 6, 2018
eb09d21
concat is broken
TomAugspurger Aug 7, 2018
7dcf4b2
formatting failing
TomAugspurger Aug 7, 2018
b39658a
more wip
TomAugspurger Aug 7, 2018
a8b76bd
Merge remote-tracking branch 'upstream/master' into ea-sparse-2
TomAugspurger Aug 8, 2018
e041313
Extension test fixups
TomAugspurger Aug 8, 2018
595535e
some indexing, sparse string
TomAugspurger Aug 9, 2018
7700299
passing indexing
TomAugspurger Aug 9, 2018
f1ff7da
passing pivot
TomAugspurger Aug 9, 2018
33fa6f7
broken broken broken
TomAugspurger Aug 10, 2018
40c035e
sanitize
TomAugspurger Aug 10, 2018
1d49cc7
broken broken broken
TomAugspurger Aug 10, 2018
6f4b6b6
wip
TomAugspurger Aug 13, 2018
6f037b5
working through series
TomAugspurger Aug 13, 2018
7da220e
working through series
TomAugspurger Aug 13, 2018
bfbe4ab
Merge remote-tracking branch 'upstream/master' into ea-sparse-2
TomAugspurger Aug 13, 2018
c5666b6
series passing
TomAugspurger Aug 13, 2018
ff6037c
more tests
TomAugspurger Aug 13, 2018
5c362ef
wip
TomAugspurger Aug 13, 2018
55cac36
wip
TomAugspurger Aug 13, 2018
c4e8784
More test
TomAugspurger Aug 13, 2018
a00f987
skip internals tests
TomAugspurger Aug 13, 2018
a6d7eac
linting
TomAugspurger Aug 13, 2018
4b4f9bd
cleanup
TomAugspurger Aug 13, 2018
82801be
cleanup
TomAugspurger Aug 13, 2018
1a149dc
cleanup
TomAugspurger Aug 13, 2018
fde19d7
remove debug code
TomAugspurger Aug 13, 2018
a7ba8f6
API: dispatch to EA.astype
TomAugspurger Aug 13, 2018
5064217
API: ExtensionDtype._is_numeric
TomAugspurger Aug 14, 2018
e31e8aa
Merge remote-tracking branch 'upstream/master' into ea-sparse-2
TomAugspurger Aug 14, 2018
79c8e9c
update type
TomAugspurger Aug 14, 2018
26993fe
Merge remote-tracking branch 'upstream/master' into ea-astype-dispatch
TomAugspurger Aug 14, 2018
6eeec11
py2 compat
TomAugspurger Aug 14, 2018
50de326
fixed test
TomAugspurger Aug 14, 2018
5ef1747
test fill value
TomAugspurger Aug 14, 2018
f31970c
Test nbytes
TomAugspurger Aug 14, 2018
f1b860f
explainers
TomAugspurger Aug 14, 2018
5c44275
linting
TomAugspurger Aug 14, 2018
33bc8f8
Allow concatenating with different sparse dtypes
TomAugspurger Aug 14, 2018
9bf13ad
Linting
TomAugspurger Aug 14, 2018
de1fb5b
lint
TomAugspurger Aug 14, 2018
da580cd
Wip
TomAugspurger Aug 14, 2018
88b73c3
Merge branch 'ea-astype-dispatch' into ea-sparse-2
TomAugspurger Aug 14, 2018
afde64d
Merge branch 'ea-is-numeric' into ea-sparse-2
TomAugspurger Aug 14, 2018
e603d3d
fixup 33bc8f836
TomAugspurger Aug 15, 2018
ec5eb9a
Merge remote-tracking branch 'upstream/master' into ea-sparse-2
TomAugspurger Aug 15, 2018
a72ee1a
Fixed DataFrame.__setitem__ for updating to sparse.
TomAugspurger Aug 15, 2018
f147635
try removing
TomAugspurger Aug 15, 2018
c35c7c2
Merge branch 'ea-astype-dispatch' into ea-sparse-2
TomAugspurger Aug 15, 2018
e159ef2
wip
TomAugspurger Aug 16, 2018
d48a8fa
Fixup
TomAugspurger Aug 16, 2018
3bcf57e
astype works
TomAugspurger Aug 16, 2018
31d401f
Squashed commit of the following:
TomAugspurger Aug 16, 2018
a4369c2
Squashed commit of the following:
TomAugspurger Aug 16, 2018
608b499
Fixed Series[sparse].to_sparse
TomAugspurger Aug 16, 2018
14e60c9
Shift works
TomAugspurger Aug 16, 2018
550f163
parametrize shift test
TomAugspurger Aug 16, 2018
821cc91
Removed bogus test
TomAugspurger Aug 16, 2018
e21ed21
Un-xfail more
TomAugspurger Aug 16, 2018
aeb8c8c
scalar take raises
TomAugspurger Aug 16, 2018
34c90ed
Move fill_value to dtyep
TomAugspurger Aug 17, 2018
2103959
Move fill_value to dtyep
TomAugspurger Aug 17, 2018
26af959
Merge branch 'ea-sparse-dtype-fill-value' into ea-sparse-2
TomAugspurger Aug 18, 2018
e5920c2
Merge remote-tracking branch 'upstream/master' into ea-sparse-2
TomAugspurger Aug 18, 2018
084a967
cleanup
TomAugspurger Aug 18, 2018
bb17760
Merge remote-tracking branch 'upstream/master' into ea-sparse-2
TomAugspurger Aug 20, 2018
dde7852
Merge remote-tracking branch 'upstream/master' into ea-sparse-2
TomAugspurger Aug 20, 2018
f1b4e6b
Setting fill value (but that's bad)
TomAugspurger Aug 20, 2018
6a31077
Explicit fill value
TomAugspurger Aug 20, 2018
02aa7f7
Merge remote-tracking branch 'upstream/master' into ea-sparse-2
TomAugspurger Aug 20, 2018
3a7ee2d
Fixed merge conflicts
TomAugspurger Aug 20, 2018
d6fe191
subdtype -> subtype
TomAugspurger Aug 20, 2018
b1ea874
subdtype -> subtype
TomAugspurger Aug 20, 2018
2213b83
Fixed pickle
TomAugspurger Aug 21, 2018
94664c4
test dtype
TomAugspurger Aug 21, 2018
e54160c
astype update
TomAugspurger Aug 21, 2018
04a2dbb
Merge remote-tracking branch 'upstream/master' into ea-sparse-2
TomAugspurger Aug 21, 2018
fb01d1a
more
TomAugspurger Aug 21, 2018
f78ae81
lint
TomAugspurger Aug 21, 2018
11d5b40
py2 compat
TomAugspurger Aug 21, 2018
ba70753
dtype tests
TomAugspurger Aug 21, 2018
82bab3c
explainer
TomAugspurger Aug 21, 2018
2990124
Delete things
TomAugspurger Aug 21, 2018
a9d0f17
Merge remote-tracking branch 'upstream/master' into ea-sparse-2
TomAugspurger Aug 22, 2018
0c52c37
NumPy 1.9 compat
TomAugspurger Aug 22, 2018
998f113
implement divmod
TomAugspurger Aug 22, 2018
38b0356
Fix broken fill value setting
TomAugspurger Aug 22, 2018
7206d94
compare with lists
TomAugspurger Aug 22, 2018
fe771b5
clean
TomAugspurger Aug 22, 2018
12e424c
fixed index ctor fail
TomAugspurger Aug 22, 2018
3bd567f
New xfail
TomAugspurger Aug 22, 2018
f816346
Handle sparse reindex
TomAugspurger Aug 22, 2018
1a1dcf4
concat mixed
TomAugspurger Aug 22, 2018
e3d9173
take note
TomAugspurger Aug 22, 2018
2715cdb
Remove test.
TomAugspurger Aug 22, 2018
4e40599
concat NA and empty
TomAugspurger Aug 22, 2018
0aa3934
dum
TomAugspurger Aug 22, 2018
a3becb6
Fix lost fill value
TomAugspurger Aug 22, 2018
5660b9a
override
TomAugspurger Aug 22, 2018
dd3cba5
Handle fill in unique
TomAugspurger Aug 23, 2018
cc65b8a
Merge remote-tracking branch 'upstream/master' into ea-sparse-2
TomAugspurger Aug 23, 2018
06dce5f
Faster isna
TomAugspurger Aug 23, 2018
f7351d3
Support old numpy
TomAugspurger Aug 23, 2018
2055494
clean
TomAugspurger Aug 23, 2018
f310322
Simplified setter
TomAugspurger Aug 23, 2018
0008164
Inplace not supported.
TomAugspurger Aug 23, 2018
027f6d8
compat
TomAugspurger Aug 24, 2018
c0d9875
Merge remote-tracking branch 'upstream/master' into ea-sparse-2
TomAugspurger Aug 24, 2018
44b218c
Merge remote-tracking branch 'upstream/master' into ea-sparse-2
TomAugspurger Aug 28, 2018
47fa73a
32-bit compat
TomAugspurger Aug 28, 2018
c2c489f
Lint
TomAugspurger Aug 28, 2018
3729927
Test fixups
TomAugspurger Aug 28, 2018
9ba49e1
Merge remote-tracking branch 'upstream/master' into ea-sparse-2
TomAugspurger Aug 29, 2018
543ac7c
Merge remote-tracking branch 'upstream/master' into ea-sparse-2
TomAugspurger Aug 30, 2018
f66ef6f
CI passing
TomAugspurger Aug 30, 2018
ba8fc9d
Right numpy version
TomAugspurger Aug 30, 2018
9185e33
linting
TomAugspurger Aug 30, 2018
11799ab
Try intp
TomAugspurger Aug 31, 2018
73e7626
32-bit compat
TomAugspurger Aug 31, 2018
ebece16
Doc cleanup
TomAugspurger Aug 31, 2018
7db6990
Simplify is_sparse
TomAugspurger Aug 31, 2018
be21f42
Updated factorize
TomAugspurger Sep 4, 2018
e857363
Use ABC
TomAugspurger Sep 4, 2018
d0ee038
simplify interleave_dtype
TomAugspurger Sep 4, 2018
54f4417
docstring, simplify
TomAugspurger Sep 4, 2018
2082d86
fixup supers
TomAugspurger Sep 4, 2018
f846606
Linting
TomAugspurger Sep 4, 2018
ce8e0ac
Merge remote-tracking branch 'upstream/master' into ea-sparse-2
TomAugspurger Sep 4, 2018
1f6590e
Merge remote-tracking branch 'upstream/master' into ea-sparse-2
TomAugspurger Sep 5, 2018
b758469
Merge remote-tracking branch 'upstream/master' into ea-sparse-2
TomAugspurger Sep 6, 2018
f6b0924
move and fix conflict
TomAugspurger Sep 6, 2018
232518c
doc note
TomAugspurger Sep 6, 2018
e8b37da
ENH: is_homogenous
TomAugspurger Sep 20, 2018
0197e0c
BUG: Preserve dtype on homogeneous EA xs
TomAugspurger Sep 20, 2018
62326ae
asarray test
TomAugspurger Sep 20, 2018
f008c38
Fixed asarray
TomAugspurger Sep 20, 2018
88c6126
Merge remote-tracking branch 'upstream/master' into ea-xs
TomAugspurger Sep 20, 2018
5c8662e
Merge remote-tracking branch 'upstream/master' into ea-sparse-2
TomAugspurger Sep 20, 2018
78798cf
is_homogeneous -> is_homogeneous_type
TomAugspurger Sep 20, 2018
b051424
lint
TomAugspurger Sep 20, 2018
78979b6
Squashed commit of the following:
TomAugspurger Sep 20, 2018
2333db1
Merge followup
TomAugspurger Sep 20, 2018
b41d473
Followup from merge
TomAugspurger Sep 20, 2018
d6a2479
lint
TomAugspurger Sep 20, 2018
a23c27c
Merge remote-tracking branch 'origin/ea-xs' into ea-sparse-2
TomAugspurger Sep 20, 2018
7372eb3
Merge remote-tracking branch 'upstream/master' into ea-sparse-2
TomAugspurger Sep 26, 2018
cab8c54
handle unary ops
TomAugspurger Sep 26, 2018
52ae275
linting
TomAugspurger Sep 26, 2018
9c9b49e
compat, lint
TomAugspurger Sep 26, 2018
f5d7492
SparseSeries unary ops
TomAugspurger Sep 26, 2018
b4b4cbc
Merge remote-tracking branch 'upstream/master' into ea-sparse-2
TomAugspurger Sep 26, 2018
bf98b9d
Merge remote-tracking branch 'upstream/master' into ea-sparse-2
TomAugspurger Sep 26, 2018
f3d2681
Merge remote-tracking branch 'upstream/master' into ea-sparse-2
TomAugspurger Sep 29, 2018
7d4d3ba
Merge remote-tracking branch 'upstream/master' into ea-sparse-2
TomAugspurger Oct 4, 2018
57c03c2
splib
TomAugspurger Oct 4, 2018
0dbc33e
collections -> compat
TomAugspurger Oct 4, 2018
c217cf5
updates
TomAugspurger Oct 8, 2018
2ea7a91
Merge remote-tracking branch 'upstream/master' into ea-sparse-2
TomAugspurger Oct 8, 2018
8f2f228
Set dtype
TomAugspurger Oct 8, 2018
c83bed7
reveret
TomAugspurger Oct 8, 2018
53e494e
clarify fillna
TomAugspurger Oct 8, 2018
627b9ce
Remove old invert
TomAugspurger Oct 8, 2018
df0293a
some cleanup
TomAugspurger Oct 8, 2018
a590418
remove redundant whatsnew
TomAugspurger Oct 9, 2018
7821f19
Merge remote-tracking branch 'upstream/master' into ea-sparse-2
TomAugspurger Oct 9, 2018
ee26c52
Update hashing, eq
TomAugspurger Oct 9, 2018
40390f1
wip-comments
TomAugspurger Oct 11, 2018
15a164d
Merge remote-tracking branch 'upstream/master' into ea-sparse-2
TomAugspurger Oct 11, 2018
88432c8
hashing
TomAugspurger Oct 11, 2018
3e7ec90
dtype and datetime64
TomAugspurger Oct 11, 2018
7b0a179
Updates
TomAugspurger Oct 11, 2018
20d8815
index
TomAugspurger Oct 11, 2018
3e81c69
wip
TomAugspurger Oct 11, 2018
1098a7a
quantile test
TomAugspurger Oct 11, 2018
10d204a
merge conflict
TomAugspurger Oct 11, 2018
69075d8
use is_homogenous_type
TomAugspurger Oct 11, 2018
0764baa
use assert_frame_equal
TomAugspurger Oct 11, 2018
a4a47c5
merge exp construction
TomAugspurger Oct 11, 2018
a5b6c39
API: Allow ExtensionArray.isna to be an EA
TomAugspurger Oct 11, 2018
70d8268
document and test map
TomAugspurger Oct 11, 2018
7aed79f
table formatting
TomAugspurger Oct 11, 2018
11e55aa
fixup! API: Allow ExtensionArray.isna to be an EA
TomAugspurger Oct 11, 2018
11606af
Restore subclass test
TomAugspurger Oct 11, 2018
2f73179
Revert changes to test
TomAugspurger Oct 11, 2018
1b3058a
quote
TomAugspurger Oct 11, 2018
f4ec928
fixup! API: Allow ExtensionArray.isna to be an EA
TomAugspurger Oct 11, 2018
8c67ca2
lint
TomAugspurger Oct 11, 2018
cc89ec7
COMPAT: NumPy 1.9 bool-like indexing
TomAugspurger Oct 12, 2018
3f713d4
misc. comments
TomAugspurger Oct 12, 2018
886fe03
Merge remote-tracking branch 'upstream/master' into ea-sparse-2
TomAugspurger Oct 12, 2018
75099af
asarray on bool key for numpy compat
TomAugspurger Oct 12, 2018
731fc06
Raise for non-default values
TomAugspurger Oct 12, 2018
f91141d
groupby / reduce compat
TomAugspurger Oct 12, 2018
37a4b57
lint
TomAugspurger Oct 12, 2018
4aad8e1
fix docs
jorisvandenbossche Oct 13, 2018
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
53 changes: 46 additions & 7 deletions doc/source/whatsnew/v0.24.0.txt
Original file line number Diff line number Diff line change
Expand Up @@ -373,6 +373,37 @@ is the case with :attr:`Period.end_time`, for example

p.end_time

.. _whatsnew_0240.api_breaking.sparse_values:

Sparse Data Structure Refactor
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

``SparseArray``, the array backing ``SparseSeries`` and the columns in a ``SparseDataFrame``,
is now an extension array (:issue:`21978`, :issue:`19056`, :issue:`22835`).
To conform to this interface and for consistency with the rest of pandas, some API breaking
changes were made:
TomAugspurger marked this conversation as resolved.
Show resolved Hide resolved

- ``SparseArray`` is no longer a subclass of :class:`numpy.ndarray`. To convert a SparseArray to a NumPy array, use :meth:`numpy.asarray`.
- ``SparseArray.dtype`` and ``SparseSeries.dtype`` are now instances of :class:`SparseDtype`, rather than ``np.dtype``. Access the underlying dtype with ``SparseDtype.subtype``.
- :meth:`numpy.asarray(sparse_array)` now returns a dense array with all the values, not just the non-fill-value values (:issue:`14167`)
- ``SparseArray.take`` now matches the API of :meth:`pandas.api.extensions.ExtensionArray.take` (:issue:`19506`):

* The default value of ``allow_fill`` has changed from ``False`` to ``True``.
TomAugspurger marked this conversation as resolved.
Show resolved Hide resolved
* The ``out`` and ``mode`` parameters are now longer accepted (previously, this raised if they were specified).
* Passing a scalar for ``indices`` is no longer allowed.
jorisvandenbossche marked this conversation as resolved.
Show resolved Hide resolved

- The result of concatenating a mix of sparse and dense Series is a Series with sparse values, rather than a ``SparseSeries``.
TomAugspurger marked this conversation as resolved.
Show resolved Hide resolved
- ``SparseDataFrame.combine`` and ``DataFrame.combine_first`` no longer supports combining a sparse column with a dense column while preserving the sparse subtype. The result will be an object-dtype SparseArray.
- Setting :attr:`SparseArray.fill_value` to a fill value with a different dtype is now allowed.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't remember if I asked before, but do we actually want this?

In [31]: s = pd.Series([1, 0, 0])

In [32]: s = s.to_sparse()

In [33]: s
Out[33]: 
0    1
1    0
2    0
dtype: Sparse[int64]
BlockIndex
Block locations: array([0], dtype=int32)
Block lengths: array([1], dtype=int32)

In [34]: s.fill_value = np.nan

In [35]: s.to_dense()
Out[35]: 
0                      1
1   -9223372036854775808
2   -9223372036854775808
dtype: int64

I don't think the above makes much sense, so not sure this is good to allow.

For me it seems logical to restrict the fill_value of the same dtype as the data.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The somewhat strange thing is that on master we do allow that in the SparseArray constructor

In [13]: s = pd.SparseArray([1, 2, 0], fill_value=np.nan)

In [14]: s
Out[14]:
[1, 2, 0]
Fill: nan
IntIndex
Indices: array([0, 1, 2], dtype=int32)

I don't have strong opinions here, other than that people shouldn't be setting .fill_value in the first place. The new way to do it is .astype(SparseDtype(self.dtype.subtype, fill_value)). I'm happy to deprecate this

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i agree the fill type should match the dtype but since missing value support is allowed here it is prob ok.



Some new warnings are issued for operations that require or are likely to materialize a large dense array:

- A :class:`errors.PerformanceWarning` is issued when using fillna with a ``method``, as a dense array is constructed to create the filled array. Filling with a ``value`` is the efficient way to fill a sparse array.
TomAugspurger marked this conversation as resolved.
Show resolved Hide resolved
- A :class:`errors.PerformanceWarning` is now issued when concatenating sparse Series with differing fill values. The fill value from the first sparse array continues to be used.

In addition to these API breaking changes, many :ref:`performance improvements and bug fixes have been made <whatsnew_0240.bug_fixes.sparse>`.

.. _whatsnew_0240.api_breaking.frame_to_dict_index_orient:

Raise ValueError in ``DataFrame.to_dict(orient='index')``
Expand Down Expand Up @@ -566,6 +597,7 @@ update the ``ExtensionDtype._metadata`` tuple to match the signature of your
- Added :meth:`pandas.api.types.register_extension_dtype` to register an extension type with pandas (:issue:`22664`)
- Series backed by an ``ExtensionArray`` now work with :func:`util.hash_pandas_object` (:issue:`23066`)
- Updated the ``.type`` attribute for ``PeriodDtype``, ``DatetimeTZDtype``, and ``IntervalDtype`` to be instances of the dtype (``Period``, ``Timestamp``, and ``Interval`` respectively) (:issue:`22938`)
- :func:`ExtensionArray.isna` is allowed to return an ``ExtensionArray`` (:issue:`22325`).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

really? we allow this. I agree this would be ok, but is reasonably tested?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Define reasonable :)

I'm reasonably sure there are places in pandas where we assume we have an ndarray, but may get an ExtensionArray instead.

The common case of .isna().any() is well tested though, I think.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In any case, if you create the masks in other ways than .isna(), eg with a comparison like df['sparse_column'] == 1, you get exactly the same issue. Which means we basically have to support indexing with boolean sparse masks anyway I think.


.. _whatsnew_0240.api.incompatibilities:

Expand Down Expand Up @@ -647,6 +679,7 @@ Other API Changes
- :class:`pandas.io.formats.style.Styler` supports a ``number-format`` property when using :meth:`~pandas.io.formats.style.Styler.to_excel` (:issue:`22015`)
- :meth:`DataFrame.corr` and :meth:`Series.corr` now raise a ``ValueError`` along with a helpful error message instead of a ``KeyError`` when supplied with an invalid method (:issue:`22298`)
- :meth:`shift` will now always return a copy, instead of the previous behaviour of returning self when shifting by 0 (:issue:`22397`)
- Slicing a single row of a DataFrame with multiple ExtensionArrays of the same type now preserves the dtype, rather than coercing to object (:issue:`22784`)

.. _whatsnew_0240.deprecations:

Expand Down Expand Up @@ -888,13 +921,6 @@ Groupby/Resample/Rolling
- :func:`RollingGroupby.agg` and :func:`ExpandingGroupby.agg` now support multiple aggregation functions as parameters (:issue:`15072`)
- Bug in :meth:`DataFrame.resample` and :meth:`Series.resample` when resampling by a weekly offset (``'W'``) across a DST transition (:issue:`9119`, :issue:`21459`)

Sparse
^^^^^^

-
-
-

Reshaping
^^^^^^^^^

Expand All @@ -913,6 +939,19 @@ Reshaping
- Bug in :func:`merge_asof` when merging on float values within defined tolerance (:issue:`22981`)
- Bug in :func:`pandas.concat` when concatenating a multicolumn DataFrame with tz-aware data against a DataFrame with a different number of columns (:issue`22796`)

.. _whatsnew_0240.bug_fixes.sparse:

Sparse
TomAugspurger marked this conversation as resolved.
Show resolved Hide resolved
^^^^^^

- Updating a boolean, datetime, or timedelta column to be Sparse now works (:issue:`22367`)
TomAugspurger marked this conversation as resolved.
Show resolved Hide resolved
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

really, we have support for this? again i agree this is a nice feature, but we are decreasing support generally for sparse, so not anxious to advertise this

- Bug in :meth:`Series.to_sparse` with Series already holding sparse data not constructing properly (:issue:`22389`)
- Providing a ``sparse_index`` to the SparseArray constructor no longer defaults the na-value to ``np.nan`` for all dtypes. The correct na_value for ``data.dtype`` is now used.
- Bug in ``SparseArray.nbytes`` under-reporting its memory usage by not including the size of its sparse index.
- Improved performance of :meth:`Series.shift` for non-NA ``fill_value``, as values are no longer converted to a dense array.
- Bug in ``DataFrame.groupby`` not including ``fill_value`` in the groups for non-NA ``fill_value`` when grouping by a sparse column (:issue:`5078`)
- Bug in unary inversion operator (``~``) on a ``SparseSeries`` with boolean values. The performance of this has also been improved (:issue:`22835`)

Build Changes
^^^^^^^^^^^^^

Expand Down
8 changes: 8 additions & 0 deletions pandas/_libs/sparse.pyx
Original file line number Diff line number Diff line change
Expand Up @@ -68,6 +68,10 @@ cdef class IntIndex(SparseIndex):
output += 'Indices: %s\n' % repr(self.indices)
return output

@property
def nbytes(self):
return self.indices.nbytes

def check_integrity(self):
"""
Checks the following:
Expand Down Expand Up @@ -359,6 +363,10 @@ cdef class BlockIndex(SparseIndex):

return output

@property
def nbytes(self):
return self.blocs.nbytes + self.blengths.nbytes

@property
def ngaps(self):
return self.length - self.npoints
Expand Down
21 changes: 18 additions & 3 deletions pandas/core/arrays/base.py
Original file line number Diff line number Diff line change
Expand Up @@ -283,10 +283,25 @@ def astype(self, dtype, copy=True):
return np.array(self, dtype=dtype, copy=copy)

def isna(self):
# type: () -> np.ndarray
"""Boolean NumPy array indicating if each value is missing.
# type: () -> Union[ExtensionArray, np.ndarray]
"""
A 1-D array indicating if each value is missing.

Returns
-------
na_values : Union[np.ndarray, ExtensionArray]
In most cases, this should return a NumPy ndarray. For
exceptional cases like ``SparseArray``, where returning
an ndarray would be expensive, an ExtensionArray may be
returned.

Notes
-----
If returning an ExtensionArray, then

This should return a 1-D array the same length as 'self'.
* ``na_values._is_boolean`` should be True
* `na_values` should implement :func:`ExtensionArray._reduce`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should probably have an Indexing EA mixin that implementes these as NotImplemented (so once can subclass)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Such an indexing mixing might be useful, but how is this related to the line above?

* ``na_values.any`` and ``na_values.all`` should be implemented
"""
raise AbstractMethodError(self)

Expand Down
4 changes: 3 additions & 1 deletion pandas/core/common.py
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,9 @@

from pandas import compat
from pandas.compat import iteritems, PY36, OrderedDict
from pandas.core.dtypes.generic import ABCSeries, ABCIndex, ABCIndexClass
from pandas.core.dtypes.generic import (
ABCSeries, ABCIndex, ABCIndexClass
)
from pandas.core.dtypes.common import (
is_integer, is_bool_dtype, is_extension_array_dtype, is_array_like
)
Expand Down
17 changes: 14 additions & 3 deletions pandas/core/dtypes/common.py
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,7 @@
PeriodDtype, IntervalDtype,
PandasExtensionDtype, ExtensionDtype,
_pandas_registry)
from pandas.core.sparse.dtype import SparseDtype
from pandas.core.dtypes.generic import (
ABCCategorical, ABCPeriodIndex, ABCDatetimeIndex, ABCSeries,
ABCSparseArray, ABCSparseSeries, ABCCategoricalIndex, ABCIndexClass,
Expand Down Expand Up @@ -180,8 +181,10 @@ def is_sparse(arr):
>>> is_sparse(bsr_matrix([1, 2, 3]))
False
"""
from pandas.core.sparse.dtype import SparseDtype

return isinstance(arr, (ABCSparseArray, ABCSparseSeries))
TomAugspurger marked this conversation as resolved.
Show resolved Hide resolved
dtype = getattr(arr, 'dtype', arr)
TomAugspurger marked this conversation as resolved.
Show resolved Hide resolved
return isinstance(dtype, SparseDtype)


def is_scipy_sparse(arr):
Expand Down Expand Up @@ -1643,8 +1646,9 @@ def is_bool_dtype(arr_or_dtype):
True
>>> is_bool_dtype(pd.Categorical([True, False]))
True
>>> is_bool_dtype(pd.SparseArray([True, False]))
True
"""

if arr_or_dtype is None:
return False
try:
Expand Down Expand Up @@ -1751,6 +1755,8 @@ def is_extension_array_dtype(arr_or_dtype):
array interface. In pandas, this includes:

* Categorical
* Sparse
* Interval

Third-party libraries may implement arrays or types satisfying
this interface as well.
Expand Down Expand Up @@ -1873,7 +1879,8 @@ def _get_dtype(arr_or_dtype):
return PeriodDtype.construct_from_string(arr_or_dtype)
elif is_interval_dtype(arr_or_dtype):
return IntervalDtype.construct_from_string(arr_or_dtype)
elif isinstance(arr_or_dtype, (ABCCategorical, ABCCategoricalIndex)):
elif isinstance(arr_or_dtype, (ABCCategorical, ABCCategoricalIndex,
ABCSparseArray, ABCSparseSeries)):
return arr_or_dtype.dtype

if hasattr(arr_or_dtype, 'dtype'):
Expand Down Expand Up @@ -1921,6 +1928,10 @@ def _get_dtype_type(arr_or_dtype):
elif is_interval_dtype(arr_or_dtype):
return Interval
return _get_dtype_type(np.dtype(arr_or_dtype))
elif isinstance(arr_or_dtype, (ABCSparseSeries, ABCSparseArray,
SparseDtype)):
dtype = getattr(arr_or_dtype, 'dtype', arr_or_dtype)
return dtype.type
try:
return arr_or_dtype.dtype.type
except AttributeError:
Expand Down
72 changes: 18 additions & 54 deletions pandas/core/dtypes/concat.py
Original file line number Diff line number Diff line change
Expand Up @@ -93,11 +93,13 @@ def _get_series_result_type(result, objs=None):
def _get_frame_result_type(result, objs):
"""
return appropriate class of DataFrame-like concat
if all blocks are SparseBlock, return SparseDataFrame
if all blocks are sparse, return SparseDataFrame
otherwise, return 1st obj
"""

if result.blocks and all(b.is_sparse for b in result.blocks):
if (result.blocks and (
all(is_sparse(b) for b in result.blocks) or
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

related to my comment above. cannot is_sparse not simply check if its an EA and if it has a Sparse Dtype?

then you simply need to pass the b.values here, yes?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll give that a shot.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add a comment here, its not obvious what you are doing

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how can obj be a SparseFrame here? is this tested?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think a comment of mine may have been lost.

This is hit in several places (e.g. pandas/tests/sparse/test_combine_concat.py::TestSparseDataFrameConcat::test_concat).

What part can I clarify here?

all(isinstance(obj, ABCSparseDataFrame) for obj in objs))):
from pandas.core.sparse.api import SparseDataFrame
return SparseDataFrame
else:
Expand Down Expand Up @@ -554,61 +556,23 @@ def _concat_sparse(to_concat, axis=0, typs=None):
a single array, preserving the combined dtypes
"""

from pandas.core.sparse.array import SparseArray, _make_index
from pandas.core.sparse.array import SparseArray
TomAugspurger marked this conversation as resolved.
Show resolved Hide resolved

def convert_sparse(x, axis):
# coerce to native type
if isinstance(x, SparseArray):
x = x.get_values()
else:
x = np.asarray(x)
x = x.ravel()
if axis > 0:
x = np.atleast_2d(x)
return x
fill_values = [x.fill_value for x in to_concat
if isinstance(x, SparseArray)]

if typs is None:
typs = get_dtype_kinds(to_concat)
if len(set(fill_values)) > 1:
raise ValueError("Cannot concatenate SparseArrays with different "
"fill values")
TomAugspurger marked this conversation as resolved.
Show resolved Hide resolved

if len(typs) == 1:
# concat input as it is if all inputs are sparse
# and have the same fill_value
fill_values = {c.fill_value for c in to_concat}
if len(fill_values) == 1:
sp_values = [c.sp_values for c in to_concat]
indexes = [c.sp_index.to_int_index() for c in to_concat]

indices = []
loc = 0
for idx in indexes:
indices.append(idx.indices + loc)
loc += idx.length
sp_values = np.concatenate(sp_values)
indices = np.concatenate(indices)
sp_index = _make_index(loc, indices, kind=to_concat[0].sp_index)

return SparseArray(sp_values, sparse_index=sp_index,
fill_value=to_concat[0].fill_value)

# input may be sparse / dense mixed and may have different fill_value
# input must contain sparse at least 1
sparses = [c for c in to_concat if is_sparse(c)]
fill_values = [c.fill_value for c in sparses]
sp_indexes = [c.sp_index for c in sparses]

# densify and regular concat
to_concat = [convert_sparse(x, axis) for x in to_concat]
result = np.concatenate(to_concat, axis=axis)

if not len(typs - {'sparse', 'f', 'i'}):
# sparsify if inputs are sparse and dense numerics
# first sparse input's fill_value and SparseIndex is used
result = SparseArray(result.ravel(), fill_value=fill_values[0],
kind=sp_indexes[0])
else:
# coerce to object if needed
result = result.astype('object')
return result
fill_value = fill_values[0]

# TODO: Fix join unit generation so we aren't passed this.
to_concat = [x if isinstance(x, SparseArray)
else SparseArray(x.squeeze(), fill_value=fill_value)
for x in to_concat]

return SparseArray._concat_same_type(to_concat)


def _concat_rangeindex_same_dtype(indexes):
Expand Down
13 changes: 13 additions & 0 deletions pandas/core/dtypes/missing.py
Original file line number Diff line number Diff line change
Expand Up @@ -499,6 +499,19 @@ def na_value_for_dtype(dtype, compat=True):
Returns
-------
np.dtype or a pandas dtype

Examples
--------
>>> na_value_for_dtype(np.dtype('int64'))
0
>>> na_value_for_dtype(np.dtype('int64'), compat=False)
nan
>>> na_value_for_dtype(np.dtype('float64'))
nan
>>> na_value_for_dtype(np.dtype('bool'))
False
>>> na_value_for_dtype(np.dtype('datetime64[ns]'))
NaT
"""
dtype = pandas_dtype(dtype)

Expand Down
2 changes: 1 addition & 1 deletion pandas/core/internals/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@
make_block, # io.pytables, io.packers
FloatBlock, IntBlock, ComplexBlock, BoolBlock, ObjectBlock,
TimeDeltaBlock, DatetimeBlock, DatetimeTZBlock,
CategoricalBlock, ExtensionBlock, SparseBlock, ScalarBlock,
CategoricalBlock, ExtensionBlock, ScalarBlock,
Block)
from .managers import ( # noqa:F401
BlockManager, SingleBlockManager,
Expand Down
Loading