Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixes np.unique on SparseArray #19651

Closed
wants to merge 11 commits into from
1 change: 1 addition & 0 deletions doc/source/whatsnew/v0.23.0.txt
Original file line number Diff line number Diff line change
Expand Up @@ -823,6 +823,7 @@ Sparse
- Bug in :class:`SparseDataFrame.to_csv` causing exception (:issue:`19384`)
- Bug in :class:`SparseSeries.memory_usage` which caused segfault by accessing non sparse elements (:issue:`19368`)
- Bug in constructing a ``SparseArray``: if ``data`` is a scalar and ``index`` is defined it will coerce to ``float64`` regardless of scalar's dtype. (:issue:`19163`)
- Bug in :func:`SparseSeries.unique` which returns only sparse elements during unique (:issue:`19651`)

Reshaping
^^^^^^^^^
Expand Down
13 changes: 11 additions & 2 deletions pandas/core/algorithms.py
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,8 @@
maybe_promote, construct_1d_object_array_from_listlike)
from pandas.core.dtypes.generic import (
ABCSeries, ABCIndex,
ABCIndexClass, ABCCategorical)
ABCIndexClass, ABCCategorical,
ABCSparseArray)
from pandas.core.dtypes.common import (
is_unsigned_integer_dtype, is_signed_integer_dtype,
is_integer_dtype, is_complex_dtype,
Expand Down Expand Up @@ -362,7 +363,14 @@ def unique(values):
htable, _, values, dtype, ndtype = _get_hashtable_algo(values)

table = htable(len(values))
uniques = table.unique(values)

if isinstance(values, ABCSparseArray):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure you saw my comment. this should not be handling unique like this. SparseArray should have a method .unique() which can directly handle the unique (and then call algorithms.unique on the sparse values).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IOW you might be able to get aways with doing a

if hasattr(values, 'unique'):
    return values.unique()
...

need to avoid recursion, but here values must be a ndarray or an EA, including Categorical. (and NOT a Series)

cc @TomAugspurger

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hrm I'll have to think about this... I'm hesitant to do this mainly because this code seems to rely on the fact that it always outputs an ndarray, which is why groupby doesn't work with sparse data.

The problem really is with classes that implement their own unique(). It usually outputs something that isn't ndarray.

Also I don't want to call unique on the class and cast it back to ndarray cause that feels sloppy. I'll think of something :) thanks!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hexgnu have you been following the ExtensionArray stuff? Eventually SparseArray should be an ExtensionArray like Categorical. Eventually things like groupby, factorize, etc will all work correctly on EAs.

Also I don't want to call unique on the class and cast it back to ndarray cause that feels sloppy.

SparseArray.unique() should return a SparseArray, just like Categorical.unique returns a Categorical.

to_unique = values.sp_values
if values.sp_index.ngaps > 0:
to_unique = np.append(to_unique, [values.fill_value])
uniques = table.unique(to_unique)
else:
uniques = table.unique(values)
uniques = _reconstruct_data(uniques, dtype, original)

if isinstance(original, ABCSeries) and is_datetime64tz_dtype(dtype):
Expand Down Expand Up @@ -469,6 +477,7 @@ def factorize(values, sort=False, order=None, na_sentinel=-1, size_hint=None):
table = hash_klass(size_hint or len(values))
uniques = vec_klass()
check_nulls = not is_integer_dtype(original)

labels = table.get_labels(values, uniques, 0, na_sentinel, check_nulls)

labels = _ensure_platform_int(labels)
Expand Down
13 changes: 12 additions & 1 deletion pandas/tests/test_algos.py
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,8 @@
from datetime import datetime
from itertools import permutations
from pandas import (Series, Categorical, CategoricalIndex,
Timestamp, DatetimeIndex, Index, IntervalIndex)
Timestamp, DatetimeIndex, Index, IntervalIndex,
SparseArray)
import pandas as pd

from pandas import compat
Expand Down Expand Up @@ -268,6 +269,16 @@ def test_object_refcount_bug(self):
for i in range(1000):
len(algos.unique(lst))

@pytest.mark.parametrize('fill_value', [0, 1, np.nan, None])
def test_sparse(self, fill_value):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reference issue number.

# GH 19595
arr = SparseArray([0, 1, np.nan, None], fill_value=fill_value)

result = algos.unique(arr)

assert isinstance(result, np.ndarray)
assert len(result) == 3

def test_on_index_object(self):

mindex = pd.MultiIndex.from_arrays([np.arange(5).repeat(5), np.tile(
Expand Down