Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixes np.unique on SparseArray #19651

Closed
wants to merge 11 commits into from

Conversation

hexgnu
Copy link
Contributor

@hexgnu hexgnu commented Feb 12, 2018

@hexgnu hexgnu changed the title Fix groupby sparse Fixes np.unique on SparseArray Feb 12, 2018
# The fill value as well assuming the ngaps are greater than 0

if ngaps_val > 0:
k = kh_get_{dtype}(self.table, fill_value_val)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is duplicated 3 times in this class. Would be nice to de-dupe this somehow without making it super complicated with string interpolation...

@gfyoung gfyoung added Bug Reshaping Concat, Merge/Join, Stack/Unstack, Explode Sparse Sparse Data Type labels Feb 12, 2018
@@ -268,6 +269,16 @@ def test_object_refcount_bug(self):
for i in range(1000):
len(algos.unique(lst))

@pytest.mark.parametrize('fill_value', [0, 1, np.nan, None])
def test_sparse(self, fill_value):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reference issue number.

@codecov
Copy link

codecov bot commented Feb 12, 2018

Codecov Report

Merging #19651 into master will increase coverage by 0.01%.
The diff coverage is 83.33%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #19651      +/-   ##
==========================================
+ Coverage   91.57%   91.59%   +0.01%     
==========================================
  Files         150      150              
  Lines       48817    48811       -6     
==========================================
+ Hits        44704    44707       +3     
+ Misses       4113     4104       -9
Flag Coverage Δ
#multiple 89.96% <83.33%> (+0.01%) ⬆️
#single 41.73% <66.66%> (ø) ⬆️
Impacted Files Coverage Δ
pandas/core/groupby.py 92.2% <ø> (ø) ⬆️
pandas/core/algorithms.py 94.01% <83.33%> (-0.13%) ⬇️
pandas/core/ops.py 95.22% <0%> (-1.62%) ⬇️
pandas/io/sas/sas_xport.py 90.27% <0%> (ø) ⬆️
pandas/core/frame.py 97.17% <0%> (+0.01%) ⬆️
pandas/core/sparse/array.py 91.4% <0%> (+0.02%) ⬆️
pandas/core/series.py 94.61% <0%> (+0.04%) ⬆️
pandas/util/testing.py 83.85% <0%> (+0.2%) ⬆️
pandas/io/parsers.py 95.56% <0%> (+0.24%) ⬆️
... and 5 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update d9551c8...4a13a75. Read the comment docs.

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there a reason you are not just passing the sp_values + fill_value to and calling .unique on that?

@@ -251,25 +251,36 @@ cdef class HashTable:
{{py:

# name, dtype, null_condition, float_group
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

update the comment

@pep8speaks
Copy link

pep8speaks commented Feb 13, 2018

Hello @hexgnu! Thanks for updating the PR.

Cheers ! There are no PEP8 issues in this Pull Request. 🍻

Comment last updated on February 13, 2018 at 04:44 Hours UTC

@@ -362,7 +363,14 @@ def unique(values):
htable, _, values, dtype, ndtype = _get_hashtable_algo(values)

table = htable(len(values))
uniques = table.unique(values)

if isinstance(values, ABCSparseArray):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure you saw my comment. this should not be handling unique like this. SparseArray should have a method .unique() which can directly handle the unique (and then call algorithms.unique on the sparse values).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IOW you might be able to get aways with doing a

if hasattr(values, 'unique'):
    return values.unique()
...

need to avoid recursion, but here values must be a ndarray or an EA, including Categorical. (and NOT a Series)

cc @TomAugspurger

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hrm I'll have to think about this... I'm hesitant to do this mainly because this code seems to rely on the fact that it always outputs an ndarray, which is why groupby doesn't work with sparse data.

The problem really is with classes that implement their own unique(). It usually outputs something that isn't ndarray.

Also I don't want to call unique on the class and cast it back to ndarray cause that feels sloppy. I'll think of something :) thanks!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hexgnu have you been following the ExtensionArray stuff? Eventually SparseArray should be an ExtensionArray like Categorical. Eventually things like groupby, factorize, etc will all work correctly on EAs.

Also I don't want to call unique on the class and cast it back to ndarray cause that feels sloppy.

SparseArray.unique() should return a SparseArray, just like Categorical.unique returns a Categorical.

@jreback
Copy link
Contributor

jreback commented Jul 8, 2018

can you rebase / update

@gfyoung
Copy link
Member

gfyoung commented Jul 8, 2018

@jreback : Based on this discussion here, doesn't this PR require a more substantial overhaul to properly anticipate the EA work from @TomAugspurger ? Or is your suggestion here an acceptable patch?

@TomAugspurger
Copy link
Contributor

Let's hold on this for a bit. I'm making SparseArray an ExtensionArray.

@hexgnu
Copy link
Contributor Author

hexgnu commented Jul 16, 2018

Hey everybody here. Let me know what you want me to do. We can close this and I can reopen new work when the ExtensionArray stuff lands or whatever. Just lemme know! 😄

@TomAugspurger
Copy link
Contributor

Superseded by #23186 and #22325

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Reshaping Concat, Merge/Join, Stack/Unstack, Explode Sparse Sparse Data Type
Projects
None yet
Development

Successfully merging this pull request may close these issues.

pd.unique doesn't actually return unique elements of SparseArray
5 participants