Skip to content

Commit

Permalink
[API/REF]: SparseArray is an ExtensionArray (pandas-dev#22325)
Browse files Browse the repository at this point in the history
Makes SparseArray an ExtensionArray.

* Fixed DataFrame.__setitem__ for updating to sparse.

Closes pandas-dev#22367

* Fixed Series[sparse].to_sparse

Closes pandas-dev#22389

Closes pandas-dev#21978
Closes pandas-dev#19506
Closes pandas-dev#22835
  • Loading branch information
TomAugspurger authored and tm9k1 committed Nov 19, 2018
1 parent 3feaa79 commit 54d621a
Show file tree
Hide file tree
Showing 50 changed files with 3,346 additions and 1,421 deletions.
53 changes: 46 additions & 7 deletions doc/source/whatsnew/v0.24.0.txt
Original file line number Diff line number Diff line change
Expand Up @@ -381,6 +381,37 @@ is the case with :attr:`Period.end_time`, for example

p.end_time

.. _whatsnew_0240.api_breaking.sparse_values:

Sparse Data Structure Refactor
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

``SparseArray``, the array backing ``SparseSeries`` and the columns in a ``SparseDataFrame``,
is now an extension array (:issue:`21978`, :issue:`19056`, :issue:`22835`).
To conform to this interface and for consistency with the rest of pandas, some API breaking
changes were made:

- ``SparseArray`` is no longer a subclass of :class:`numpy.ndarray`. To convert a SparseArray to a NumPy array, use :meth:`numpy.asarray`.
- ``SparseArray.dtype`` and ``SparseSeries.dtype`` are now instances of :class:`SparseDtype`, rather than ``np.dtype``. Access the underlying dtype with ``SparseDtype.subtype``.
- :meth:`numpy.asarray(sparse_array)` now returns a dense array with all the values, not just the non-fill-value values (:issue:`14167`)
- ``SparseArray.take`` now matches the API of :meth:`pandas.api.extensions.ExtensionArray.take` (:issue:`19506`):

* The default value of ``allow_fill`` has changed from ``False`` to ``True``.
* The ``out`` and ``mode`` parameters are now longer accepted (previously, this raised if they were specified).
* Passing a scalar for ``indices`` is no longer allowed.

- The result of concatenating a mix of sparse and dense Series is a Series with sparse values, rather than a ``SparseSeries``.
- ``SparseDataFrame.combine`` and ``DataFrame.combine_first`` no longer supports combining a sparse column with a dense column while preserving the sparse subtype. The result will be an object-dtype SparseArray.
- Setting :attr:`SparseArray.fill_value` to a fill value with a different dtype is now allowed.


Some new warnings are issued for operations that require or are likely to materialize a large dense array:

- A :class:`errors.PerformanceWarning` is issued when using fillna with a ``method``, as a dense array is constructed to create the filled array. Filling with a ``value`` is the efficient way to fill a sparse array.
- A :class:`errors.PerformanceWarning` is now issued when concatenating sparse Series with differing fill values. The fill value from the first sparse array continues to be used.

In addition to these API breaking changes, many :ref:`performance improvements and bug fixes have been made <whatsnew_0240.bug_fixes.sparse>`.

.. _whatsnew_0240.api_breaking.frame_to_dict_index_orient:

Raise ValueError in ``DataFrame.to_dict(orient='index')``
Expand Down Expand Up @@ -574,6 +605,7 @@ update the ``ExtensionDtype._metadata`` tuple to match the signature of your
- Added :meth:`pandas.api.types.register_extension_dtype` to register an extension type with pandas (:issue:`22664`)
- Series backed by an ``ExtensionArray`` now work with :func:`util.hash_pandas_object` (:issue:`23066`)
- Updated the ``.type`` attribute for ``PeriodDtype``, ``DatetimeTZDtype``, and ``IntervalDtype`` to be instances of the dtype (``Period``, ``Timestamp``, and ``Interval`` respectively) (:issue:`22938`)
- :func:`ExtensionArray.isna` is allowed to return an ``ExtensionArray`` (:issue:`22325`).
- Support for reduction operations such as ``sum``, ``mean`` via opt-in base class method override (:issue:`22762`)

.. _whatsnew_0240.api.incompatibilities:
Expand Down Expand Up @@ -656,6 +688,7 @@ Other API Changes
- :class:`pandas.io.formats.style.Styler` supports a ``number-format`` property when using :meth:`~pandas.io.formats.style.Styler.to_excel` (:issue:`22015`)
- :meth:`DataFrame.corr` and :meth:`Series.corr` now raise a ``ValueError`` along with a helpful error message instead of a ``KeyError`` when supplied with an invalid method (:issue:`22298`)
- :meth:`shift` will now always return a copy, instead of the previous behaviour of returning self when shifting by 0 (:issue:`22397`)
- Slicing a single row of a DataFrame with multiple ExtensionArrays of the same type now preserves the dtype, rather than coercing to object (:issue:`22784`)

.. _whatsnew_0240.deprecations:

Expand Down Expand Up @@ -897,13 +930,6 @@ Groupby/Resample/Rolling
- :func:`RollingGroupby.agg` and :func:`ExpandingGroupby.agg` now support multiple aggregation functions as parameters (:issue:`15072`)
- Bug in :meth:`DataFrame.resample` and :meth:`Series.resample` when resampling by a weekly offset (``'W'``) across a DST transition (:issue:`9119`, :issue:`21459`)

Sparse
^^^^^^

-
-
-

Reshaping
^^^^^^^^^

Expand All @@ -922,6 +948,19 @@ Reshaping
- Bug in :func:`merge_asof` when merging on float values within defined tolerance (:issue:`22981`)
- Bug in :func:`pandas.concat` when concatenating a multicolumn DataFrame with tz-aware data against a DataFrame with a different number of columns (:issue`22796`)

.. _whatsnew_0240.bug_fixes.sparse:

Sparse
^^^^^^

- Updating a boolean, datetime, or timedelta column to be Sparse now works (:issue:`22367`)
- Bug in :meth:`Series.to_sparse` with Series already holding sparse data not constructing properly (:issue:`22389`)
- Providing a ``sparse_index`` to the SparseArray constructor no longer defaults the na-value to ``np.nan`` for all dtypes. The correct na_value for ``data.dtype`` is now used.
- Bug in ``SparseArray.nbytes`` under-reporting its memory usage by not including the size of its sparse index.
- Improved performance of :meth:`Series.shift` for non-NA ``fill_value``, as values are no longer converted to a dense array.
- Bug in ``DataFrame.groupby`` not including ``fill_value`` in the groups for non-NA ``fill_value`` when grouping by a sparse column (:issue:`5078`)
- Bug in unary inversion operator (``~``) on a ``SparseSeries`` with boolean values. The performance of this has also been improved (:issue:`22835`)

Build Changes
^^^^^^^^^^^^^

Expand Down
8 changes: 8 additions & 0 deletions pandas/_libs/sparse.pyx
Original file line number Diff line number Diff line change
Expand Up @@ -68,6 +68,10 @@ cdef class IntIndex(SparseIndex):
output += 'Indices: %s\n' % repr(self.indices)
return output

@property
def nbytes(self):
return self.indices.nbytes

def check_integrity(self):
"""
Checks the following:
Expand Down Expand Up @@ -359,6 +363,10 @@ cdef class BlockIndex(SparseIndex):

return output

@property
def nbytes(self):
return self.blocs.nbytes + self.blengths.nbytes

@property
def ngaps(self):
return self.length - self.npoints
Expand Down
21 changes: 18 additions & 3 deletions pandas/core/arrays/base.py
Original file line number Diff line number Diff line change
Expand Up @@ -287,10 +287,25 @@ def astype(self, dtype, copy=True):
return np.array(self, dtype=dtype, copy=copy)

def isna(self):
# type: () -> np.ndarray
"""Boolean NumPy array indicating if each value is missing.
# type: () -> Union[ExtensionArray, np.ndarray]
"""
A 1-D array indicating if each value is missing.
Returns
-------
na_values : Union[np.ndarray, ExtensionArray]
In most cases, this should return a NumPy ndarray. For
exceptional cases like ``SparseArray``, where returning
an ndarray would be expensive, an ExtensionArray may be
returned.
Notes
-----
If returning an ExtensionArray, then
This should return a 1-D array the same length as 'self'.
* ``na_values._is_boolean`` should be True
* `na_values` should implement :func:`ExtensionArray._reduce`
* ``na_values.any`` and ``na_values.all`` should be implemented
"""
raise AbstractMethodError(self)

Expand Down
4 changes: 3 additions & 1 deletion pandas/core/common.py
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,9 @@

from pandas import compat
from pandas.compat import iteritems, PY36, OrderedDict
from pandas.core.dtypes.generic import ABCSeries, ABCIndex, ABCIndexClass
from pandas.core.dtypes.generic import (
ABCSeries, ABCIndex, ABCIndexClass
)
from pandas.core.dtypes.common import (
is_integer, is_bool_dtype, is_extension_array_dtype, is_array_like
)
Expand Down
17 changes: 14 additions & 3 deletions pandas/core/dtypes/common.py
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,7 @@
PeriodDtype, IntervalDtype,
PandasExtensionDtype, ExtensionDtype,
_pandas_registry)
from pandas.core.sparse.dtype import SparseDtype
from pandas.core.dtypes.generic import (
ABCCategorical, ABCPeriodIndex, ABCDatetimeIndex, ABCSeries,
ABCSparseArray, ABCSparseSeries, ABCCategoricalIndex, ABCIndexClass,
Expand Down Expand Up @@ -180,8 +181,10 @@ def is_sparse(arr):
>>> is_sparse(bsr_matrix([1, 2, 3]))
False
"""
from pandas.core.sparse.dtype import SparseDtype

return isinstance(arr, (ABCSparseArray, ABCSparseSeries))
dtype = getattr(arr, 'dtype', arr)
return isinstance(dtype, SparseDtype)


def is_scipy_sparse(arr):
Expand Down Expand Up @@ -1643,8 +1646,9 @@ def is_bool_dtype(arr_or_dtype):
True
>>> is_bool_dtype(pd.Categorical([True, False]))
True
>>> is_bool_dtype(pd.SparseArray([True, False]))
True
"""

if arr_or_dtype is None:
return False
try:
Expand Down Expand Up @@ -1751,6 +1755,8 @@ def is_extension_array_dtype(arr_or_dtype):
array interface. In pandas, this includes:
* Categorical
* Sparse
* Interval
Third-party libraries may implement arrays or types satisfying
this interface as well.
Expand Down Expand Up @@ -1873,7 +1879,8 @@ def _get_dtype(arr_or_dtype):
return PeriodDtype.construct_from_string(arr_or_dtype)
elif is_interval_dtype(arr_or_dtype):
return IntervalDtype.construct_from_string(arr_or_dtype)
elif isinstance(arr_or_dtype, (ABCCategorical, ABCCategoricalIndex)):
elif isinstance(arr_or_dtype, (ABCCategorical, ABCCategoricalIndex,
ABCSparseArray, ABCSparseSeries)):
return arr_or_dtype.dtype

if hasattr(arr_or_dtype, 'dtype'):
Expand Down Expand Up @@ -1921,6 +1928,10 @@ def _get_dtype_type(arr_or_dtype):
elif is_interval_dtype(arr_or_dtype):
return Interval
return _get_dtype_type(np.dtype(arr_or_dtype))
elif isinstance(arr_or_dtype, (ABCSparseSeries, ABCSparseArray,
SparseDtype)):
dtype = getattr(arr_or_dtype, 'dtype', arr_or_dtype)
return dtype.type
try:
return arr_or_dtype.dtype.type
except AttributeError:
Expand Down
72 changes: 18 additions & 54 deletions pandas/core/dtypes/concat.py
Original file line number Diff line number Diff line change
Expand Up @@ -93,11 +93,13 @@ def _get_series_result_type(result, objs=None):
def _get_frame_result_type(result, objs):
"""
return appropriate class of DataFrame-like concat
if all blocks are SparseBlock, return SparseDataFrame
if all blocks are sparse, return SparseDataFrame
otherwise, return 1st obj
"""

if result.blocks and all(b.is_sparse for b in result.blocks):
if (result.blocks and (
all(is_sparse(b) for b in result.blocks) or
all(isinstance(obj, ABCSparseDataFrame) for obj in objs))):
from pandas.core.sparse.api import SparseDataFrame
return SparseDataFrame
else:
Expand Down Expand Up @@ -554,61 +556,23 @@ def _concat_sparse(to_concat, axis=0, typs=None):
a single array, preserving the combined dtypes
"""

from pandas.core.sparse.array import SparseArray, _make_index
from pandas.core.sparse.array import SparseArray

def convert_sparse(x, axis):
# coerce to native type
if isinstance(x, SparseArray):
x = x.get_values()
else:
x = np.asarray(x)
x = x.ravel()
if axis > 0:
x = np.atleast_2d(x)
return x
fill_values = [x.fill_value for x in to_concat
if isinstance(x, SparseArray)]

if typs is None:
typs = get_dtype_kinds(to_concat)
if len(set(fill_values)) > 1:
raise ValueError("Cannot concatenate SparseArrays with different "
"fill values")

if len(typs) == 1:
# concat input as it is if all inputs are sparse
# and have the same fill_value
fill_values = {c.fill_value for c in to_concat}
if len(fill_values) == 1:
sp_values = [c.sp_values for c in to_concat]
indexes = [c.sp_index.to_int_index() for c in to_concat]

indices = []
loc = 0
for idx in indexes:
indices.append(idx.indices + loc)
loc += idx.length
sp_values = np.concatenate(sp_values)
indices = np.concatenate(indices)
sp_index = _make_index(loc, indices, kind=to_concat[0].sp_index)

return SparseArray(sp_values, sparse_index=sp_index,
fill_value=to_concat[0].fill_value)

# input may be sparse / dense mixed and may have different fill_value
# input must contain sparse at least 1
sparses = [c for c in to_concat if is_sparse(c)]
fill_values = [c.fill_value for c in sparses]
sp_indexes = [c.sp_index for c in sparses]

# densify and regular concat
to_concat = [convert_sparse(x, axis) for x in to_concat]
result = np.concatenate(to_concat, axis=axis)

if not len(typs - {'sparse', 'f', 'i'}):
# sparsify if inputs are sparse and dense numerics
# first sparse input's fill_value and SparseIndex is used
result = SparseArray(result.ravel(), fill_value=fill_values[0],
kind=sp_indexes[0])
else:
# coerce to object if needed
result = result.astype('object')
return result
fill_value = fill_values[0]

# TODO: Fix join unit generation so we aren't passed this.
to_concat = [x if isinstance(x, SparseArray)
else SparseArray(x.squeeze(), fill_value=fill_value)
for x in to_concat]

return SparseArray._concat_same_type(to_concat)


def _concat_rangeindex_same_dtype(indexes):
Expand Down
13 changes: 13 additions & 0 deletions pandas/core/dtypes/missing.py
Original file line number Diff line number Diff line change
Expand Up @@ -499,6 +499,19 @@ def na_value_for_dtype(dtype, compat=True):
Returns
-------
np.dtype or a pandas dtype
Examples
--------
>>> na_value_for_dtype(np.dtype('int64'))
0
>>> na_value_for_dtype(np.dtype('int64'), compat=False)
nan
>>> na_value_for_dtype(np.dtype('float64'))
nan
>>> na_value_for_dtype(np.dtype('bool'))
False
>>> na_value_for_dtype(np.dtype('datetime64[ns]'))
NaT
"""
dtype = pandas_dtype(dtype)

Expand Down
2 changes: 1 addition & 1 deletion pandas/core/internals/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@
make_block, # io.pytables, io.packers
FloatBlock, IntBlock, ComplexBlock, BoolBlock, ObjectBlock,
TimeDeltaBlock, DatetimeBlock, DatetimeTZBlock,
CategoricalBlock, ExtensionBlock, SparseBlock, ScalarBlock,
CategoricalBlock, ExtensionBlock, ScalarBlock,
Block)
from .managers import ( # noqa:F401
BlockManager, SingleBlockManager,
Expand Down
Loading

0 comments on commit 54d621a

Please sign in to comment.