Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

COMPAT: unique() should preserve the dtype of the input #27874

Merged
merged 2 commits into from
Oct 7, 2019

Conversation

stuarteberg
Copy link
Contributor

@stuarteberg stuarteberg commented Aug 12, 2019

The behavior of pd.unique() is surprising, because -- unlike np.unique() -- the result does not have the same dtype as the input:

In [1]: pd.Series([1,2,3], dtype=np.uint8).unique()
Out[1]: array([1, 2, 3], dtype=uint64)

This PR just casts the output array to match the input dtype. Supercedes #27869.

Update: Augmented the tests to cover narrow dtypes.

I added a new assertion in test_value_counts_unique_nunique(), but it may not be sufficient. From what I can see, there isn't good coverage of Series whose data is not int/float/ etc. There is only good coverage of various index types. Any advice concerning test coverage?

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this would need actual tests for non 64bit dtypes

@stuarteberg stuarteberg force-pushed the fix-unique-dtype branch 4 times, most recently from 422804c to 7262d9e Compare August 14, 2019 18:25
@stuarteberg
Copy link
Contributor Author

@jreback I added the requested test cases and rebased. The CI is passing now.

BTW, I had a problem with an unrelated test. It was expected to fail, but in this PR it magically started passing -- under Python 3.5 only. Probably somehow related to dict ordering. Anyway, I "fixed" the issue by simply permitting the test to xpass under Python 3.5.

FWIW, I'm not the only one who ran into that xfail problem. It was also encountered recently in PR #27762.

@@ -1489,7 +1490,8 @@ def test_value_counts_with_nan(self):
"unicode_",
"timedelta64[h]",
pytest.param(
"datetime64[D]", marks=pytest.mark.xfail(reason="GH#7996", strict=True)
"datetime64[D]",
marks=pytest.mark.xfail(reason="GH#7996", strict=not PY35),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand this xfail. Typically we just reference open issues. What's causing this to fail on 3.5? What's the failure you get?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's causing this to fail on 3.5?

Just to be clear: The problem is that this test doesn't fail in Python 3.5. But since it's marked with xfail(..., strict=True), that breaks the test suite.

You can see a failed build log here:
https://dev.azure.com/pandas-dev/pandas/_build/results?buildId=15986

The relevant lines are:

=================================== FAILURES ===================================
 TestCategoricalSeriesAnalytics.test_drop_duplicates_categorical_non_bool[True-datetime64[D]] 
[gw1] linux -- Python 3.5.3 /home/vsts/miniconda3/envs/pandas-dev/bin/python
[XPASS(strict)] GH#7996
 TestCategoricalSeriesAnalytics.test_drop_duplicates_categorical_non_bool[False-datetime64[D]] 
[gw1] linux -- Python 3.5.3 /home/vsts/miniconda3/envs/pandas-dev/bin/python
[XPASS(strict)] GH#7996
 TestCategoricalSeriesAnalytics.test_drop_duplicates_categorical_non_bool[None-datetime64[D]] 
[gw1] linux -- Python 3.5.3 /home/vsts/miniconda3/envs/pandas-dev/bin/python
[XPASS(strict)] GH#7996
-------- generated xml file: /home/vsts/work/1/s/test-data-multiple.xml --------

What's the failure you get?

In Python 3.7, the test xfails as expected. Removing xfail identifies this line as the problem. And here's what pytest shows in that case:

>   raise_assert_detail(obj, msg, lobj, robj)
E   AssertionError: Series are different
E
E   Series values are different (50.0 %)
E   [left]:  [False, True, True, True]
E   [right]: [False, False, False, True]

pandas/_libs/testing.pyx:178: AssertionError

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand this xfail. Typically we just reference open issues.

The referenced issue seems to imply that the trouble is related to converting/comparing datetime[D] to datetime[ns]. In this test, the input is datetime[D], but it's implicitly converted to datetime[ns] when it is loaded into a Categorical.

One simple hack to make this test pass is to change datetime[D] to datetime[ns]. That doesn't seem appropriate, though.

Anyway, to get an idea of what is actually going wrong, here's what happens when I try the test's first three lines in my terminal. Note that the first item becomes NaT for some reason.

In [190]: cat_array = np.array([1, 2, 3, 4, 5], dtype=np.dtype(dtype))

In [191]: input1 = np.array([1, 2, 3, 3], dtype=np.dtype(dtype))

In [192]: tc1 = Series(Categorical(input1, categories=cat_array, ordered=False))

In [193]: tc1
Out[193]:
0          NaT
1   1970-01-03
2   1970-01-04
3   1970-01-04
dtype: category
Categories (5, datetime64[ns]): [1970-01-02, 1970-01-03, 1970-01-04, 1970-01-05, 1970-01-06]

@TomAugspurger TomAugspurger added this to the 1.0 milestone Aug 14, 2019
@TomAugspurger TomAugspurger added the Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff label Aug 14, 2019
@stuarteberg stuarteberg force-pushed the fix-unique-dtype branch 2 times, most recently from 186a544 to 71c6d1b Compare August 14, 2019 20:19
pandas/core/algorithms.py Show resolved Hide resolved
@@ -187,7 +187,24 @@ def setup_method(self, method):
types = ["bool", "int", "float", "dt", "dt_tz", "period", "string", "unicode"]
self.indexes = [getattr(self, "{}_index".format(t)) for t in types]
self.series = [getattr(self, "{}_series".format(t)) for t in types]
self.objs = self.indexes + self.series

# To test narrow dtypes, we use narrower *data* elements, not *index* elements
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this whole thing badly needs parameterization.

can you pull out the unique tests and do that instead of repeating all of this?

@jreback
Copy link
Contributor

jreback commented Sep 8, 2019

can you merge master; update the release note to 1.0, will have a look after

@stuarteberg
Copy link
Contributor Author

Sorry I haven't had time to look at this. I'll try to get to it next week.

@jreback
Copy link
Contributor

jreback commented Oct 6, 2019

@stuarteberg looked reasonable, can you merge master and move the note to 1.0.0

@stuarteberg
Copy link
Contributor Author

@jreback

looked reasonable, can you merge master and move the note to 1.0.0

Done. Sorry I haven't cleaned up the tests yet. If its sufficient as-is, great. If not, let me know.

@jreback
Copy link
Contributor

jreback commented Oct 7, 2019

thanks @stuarteberg this looks great! Can you create an issue to parameterize test_base (where indicated)? A PR would be most appreciated for that as well (if you can).

@jreback jreback changed the title unique() should preserve the dtype of the input COMPAT: unique() should preserve the dtype of the input Oct 7, 2019
@jreback jreback added Dtype Conversions Unexpected or buggy dtype conversions Reshaping Concat, Merge/Join, Stack/Unstack, Explode labels Oct 7, 2019
@jreback
Copy link
Contributor

jreback commented Oct 7, 2019

this did not have any issue (other than the PR itself) associated? can you do a quick search to see if this solves any open issues?

@stuarteberg
Copy link
Contributor Author

this did not have any issue (other than the PR itself) associated?
can you do a quick search to see if this solves any open issues?

OK, I couldn't find an issue for this in particular, but I just found #22824, which has a much broader scope. It does mention the return type issue:

  • Change return type for [Series/Index].unique to be same as caller (deprecation cycle by introducing raw=None which at first defaults to True?)

But since this PR doesn't resolve everything mentioned in that issue, it should remain open. But I made a comment there referencing this PR.

Can you create an issue to parameterize test_base (where indicated)?

OK, I have an issue draft ready to go (pasted into the details section below), but github noticed that #23877 is similar. Is that already sufficient, or would you like me to open the new issue anyway?

Issue draft regarding test_base

The tests in tests/test_base.py exercise the behavior of both Series and Indexes of various dtypes, but without using the standard pytest mechanisms for parameterization. In particular, the setup of the Ops base class should be cleaned up (or removed) in favor of proper pytest fixtures.

FWIW, this was noticed while reviewing #27874, so @jreback requested this issue to be opened.

@jreback
Copy link
Contributor

jreback commented Oct 7, 2019

@stuarteberg no #23877 is ok

@jreback jreback merged commit af498fe into pandas-dev:master Oct 7, 2019
@jreback
Copy link
Contributor

jreback commented Oct 7, 2019

thanks @stuarteberg

proost pushed a commit to proost/pandas that referenced this pull request Dec 19, 2019
proost pushed a commit to proost/pandas that referenced this pull request Dec 19, 2019
bongolegend pushed a commit to bongolegend/pandas that referenced this pull request Jan 1, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff Dtype Conversions Unexpected or buggy dtype conversions Reshaping Concat, Merge/Join, Stack/Unstack, Explode
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants