COMPAT: unique() should preserve the dtype of the input #27874

stuarteberg · 2019-08-12T18:38:24Z

The behavior of pd.unique() is surprising, because -- unlike np.unique() -- the result does not have the same dtype as the input:

In [1]: pd.Series([1,2,3], dtype=np.uint8).unique()
Out[1]: array([1, 2, 3], dtype=uint64)

This PR just casts the output array to match the input dtype. Supercedes #27869.

Update: Augmented the tests to cover narrow dtypes.

I added a new assertion in test_value_counts_unique_nunique(), but it may not be sufficient. From what I can see, there isn't good coverage of Series whose data is not int/float/ etc. There is only good coverage of various index types. Any advice concerning test coverage?

closes DOC: Clarify that unique() promotes dtype to 64-bit #27869
tests added / passed
passes black pandas
passes git diff upstream/master -u -- "*.py" | flake8 --diff
whatsnew entry

jreback

this would need actual tests for non 64bit dtypes

stuarteberg · 2019-08-14T19:28:04Z

@jreback I added the requested test cases and rebased. The CI is passing now.

BTW, I had a problem with an unrelated test. It was expected to fail, but in this PR it magically started passing -- under Python 3.5 only. Probably somehow related to dict ordering. Anyway, I "fixed" the issue by simply permitting the test to xpass under Python 3.5.

FWIW, I'm not the only one who ran into that xfail problem. It was also encountered recently in PR #27762.

TomAugspurger · 2019-08-14T19:48:25Z

pandas/tests/series/test_analytics.py

@@ -1489,7 +1490,8 @@ def test_value_counts_with_nan(self):
            "unicode_",
            "timedelta64[h]",
            pytest.param(
-                "datetime64[D]", marks=pytest.mark.xfail(reason="GH#7996", strict=True)
+                "datetime64[D]",
+                marks=pytest.mark.xfail(reason="GH#7996", strict=not PY35),


I don't understand this xfail. Typically we just reference open issues. What's causing this to fail on 3.5? What's the failure you get?

What's causing this to fail on 3.5?

Just to be clear: The problem is that this test doesn't fail in Python 3.5. But since it's marked with xfail(..., strict=True), that breaks the test suite.

You can see a failed build log here:
https://dev.azure.com/pandas-dev/pandas/_build/results?buildId=15986

The relevant lines are:

=================================== FAILURES =================================== TestCategoricalSeriesAnalytics.test_drop_duplicates_categorical_non_bool[True-datetime64[D]] [gw1] linux -- Python 3.5.3 /home/vsts/miniconda3/envs/pandas-dev/bin/python [XPASS(strict)] GH#7996 TestCategoricalSeriesAnalytics.test_drop_duplicates_categorical_non_bool[False-datetime64[D]] [gw1] linux -- Python 3.5.3 /home/vsts/miniconda3/envs/pandas-dev/bin/python [XPASS(strict)] GH#7996 TestCategoricalSeriesAnalytics.test_drop_duplicates_categorical_non_bool[None-datetime64[D]] [gw1] linux -- Python 3.5.3 /home/vsts/miniconda3/envs/pandas-dev/bin/python [XPASS(strict)] GH#7996 -------- generated xml file: /home/vsts/work/1/s/test-data-multiple.xml --------

What's the failure you get?

In Python 3.7, the test xfails as expected. Removing xfail identifies this line as the problem. And here's what pytest shows in that case:

> raise_assert_detail(obj, msg, lobj, robj) E AssertionError: Series are different E E Series values are different (50.0 %) E [left]: [False, True, True, True] E [right]: [False, False, False, True] pandas/_libs/testing.pyx:178: AssertionError

I don't understand this xfail. Typically we just reference open issues.

The referenced issue seems to imply that the trouble is related to converting/comparing datetime[D] to datetime[ns]. In this test, the input is datetime[D], but it's implicitly converted to datetime[ns] when it is loaded into a Categorical.

One simple hack to make this test pass is to change datetime[D] to datetime[ns]. That doesn't seem appropriate, though.

Anyway, to get an idea of what is actually going wrong, here's what happens when I try the test's first three lines in my terminal. Note that the first item becomes NaT for some reason.

In [190]: cat_array = np.array([1, 2, 3, 4, 5], dtype=np.dtype(dtype)) In [191]: input1 = np.array([1, 2, 3, 3], dtype=np.dtype(dtype)) In [192]: tc1 = Series(Categorical(input1, categories=cat_array, ordered=False)) In [193]: tc1 Out[193]: 0 NaT 1 1970-01-03 2 1970-01-04 3 1970-01-04 dtype: category Categories (5, datetime64[ns]): [1970-01-02, 1970-01-03, 1970-01-04, 1970-01-05, 1970-01-06]

doc/source/whatsnew/v0.25.1.rst

pandas/core/algorithms.py

jreback · 2019-08-15T13:40:11Z

pandas/tests/test_base.py

@@ -187,7 +187,24 @@ def setup_method(self, method):
        types = ["bool", "int", "float", "dt", "dt_tz", "period", "string", "unicode"]
        self.indexes = [getattr(self, "{}_index".format(t)) for t in types]
        self.series = [getattr(self, "{}_series".format(t)) for t in types]
-        self.objs = self.indexes + self.series
+
+        # To test narrow dtypes, we use narrower *data* elements, not *index* elements


this whole thing badly needs parameterization.

can you pull out the unique tests and do that instead of repeating all of this?

jreback · 2019-09-08T16:02:19Z

can you merge master; update the release note to 1.0, will have a look after

stuarteberg · 2019-09-09T22:52:38Z

Sorry I haven't had time to look at this. I'll try to get to it next week.

jreback · 2019-10-06T22:53:13Z

@stuarteberg looked reasonable, can you merge master and move the note to 1.0.0

stuarteberg · 2019-10-07T00:00:27Z

@jreback

looked reasonable, can you merge master and move the note to 1.0.0

Done. Sorry I haven't cleaned up the tests yet. If its sufficient as-is, great. If not, let me know.

jreback · 2019-10-07T00:12:31Z

thanks @stuarteberg this looks great! Can you create an issue to parameterize test_base (where indicated)? A PR would be most appreciated for that as well (if you can).

jreback · 2019-10-07T00:16:07Z

this did not have any issue (other than the PR itself) associated? can you do a quick search to see if this solves any open issues?

stuarteberg · 2019-10-07T00:51:04Z

this did not have any issue (other than the PR itself) associated?
can you do a quick search to see if this solves any open issues?

OK, I couldn't find an issue for this in particular, but I just found #22824, which has a much broader scope. It does mention the return type issue:

Change return type for [Series/Index].unique to be same as caller (deprecation cycle by introducing raw=None which at first defaults to True?)

But since this PR doesn't resolve everything mentioned in that issue, it should remain open. But I made a comment there referencing this PR.

Can you create an issue to parameterize test_base (where indicated)?

OK, I have an issue draft ready to go (pasted into the details section below), but github noticed that #23877 is similar. Is that already sufficient, or would you like me to open the new issue anyway?

Issue draft regarding test_base

The tests in tests/test_base.py exercise the behavior of both Series and Indexes of various dtypes, but without using the standard pytest mechanisms for parameterization. In particular, the setup of the Ops base class should be cleaned up (or removed) in favor of proper pytest fixtures.

FWIW, this was noticed while reviewing #27874, so @jreback requested this issue to be opened.

jreback · 2019-10-07T01:08:11Z

@stuarteberg no #23877 is ok

jreback · 2019-10-07T01:08:28Z

thanks @stuarteberg

…7874)

stuarteberg force-pushed the fix-unique-dtype branch from 21f7003 to 979843e Compare August 12, 2019 18:38

stuarteberg mentioned this pull request Aug 12, 2019

DOC: Clarify that unique() promotes dtype to 64-bit #27869

Closed

2 tasks

jreback requested changes Aug 12, 2019

View reviewed changes

stuarteberg force-pushed the fix-unique-dtype branch 4 times, most recently from 422804c to 7262d9e Compare August 14, 2019 18:25

TomAugspurger reviewed Aug 14, 2019

View reviewed changes

TomAugspurger added this to the 1.0 milestone Aug 14, 2019

TomAugspurger added the Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff label Aug 14, 2019

TomAugspurger reviewed Aug 14, 2019

View reviewed changes

doc/source/whatsnew/v0.25.1.rst Outdated Show resolved Hide resolved

stuarteberg force-pushed the fix-unique-dtype branch 2 times, most recently from 186a544 to 71c6d1b Compare August 14, 2019 20:19

jreback requested changes Aug 15, 2019

View reviewed changes

stuarteberg added 2 commits October 6, 2019 19:58

API: unique() should preserve the dtype of the input

602c55d

_reconstruct_data(): Use copy=False when calling astype()

1aec960

stuarteberg force-pushed the fix-unique-dtype branch from 71c6d1b to 1aec960 Compare October 6, 2019 23:59

jreback approved these changes Oct 7, 2019

View reviewed changes

jreback changed the title ~~unique() should preserve the dtype of the input~~ COMPAT: unique() should preserve the dtype of the input Oct 7, 2019

jreback added Dtype Conversions Unexpected or buggy dtype conversions Reshaping Concat, Merge/Join, Stack/Unstack, Explode labels Oct 7, 2019

stuarteberg mentioned this pull request Oct 7, 2019

API/ENH: overhaul/unify/improve .unique #22824

Open

6 tasks

jreback merged commit af498fe into pandas-dev:master Oct 7, 2019

proost pushed a commit to proost/pandas that referenced this pull request Dec 19, 2019

COMPAT: unique() should preserve the dtype of the input (pandas-dev#2…

e5e597a

…7874)

proost pushed a commit to proost/pandas that referenced this pull request Dec 19, 2019

COMPAT: unique() should preserve the dtype of the input (pandas-dev#2…

84f413c

…7874)

bongolegend pushed a commit to bongolegend/pandas that referenced this pull request Jan 1, 2020

COMPAT: unique() should preserve the dtype of the input (pandas-dev#2…

73b4c99

…7874)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

COMPAT: unique() should preserve the dtype of the input #27874

COMPAT: unique() should preserve the dtype of the input #27874

stuarteberg commented Aug 12, 2019 •

edited

Loading

jreback left a comment

stuarteberg commented Aug 14, 2019

TomAugspurger Aug 14, 2019

stuarteberg Aug 14, 2019

stuarteberg Aug 14, 2019

jreback Aug 15, 2019

jreback commented Sep 8, 2019

stuarteberg commented Sep 9, 2019

jreback commented Oct 6, 2019

stuarteberg commented Oct 7, 2019

jreback commented Oct 7, 2019

jreback commented Oct 7, 2019

stuarteberg commented Oct 7, 2019

jreback commented Oct 7, 2019

jreback commented Oct 7, 2019

COMPAT: unique() should preserve the dtype of the input #27874

COMPAT: unique() should preserve the dtype of the input #27874

Conversation

stuarteberg commented Aug 12, 2019 • edited Loading

jreback left a comment

Choose a reason for hiding this comment

stuarteberg commented Aug 14, 2019

TomAugspurger Aug 14, 2019

Choose a reason for hiding this comment

stuarteberg Aug 14, 2019

Choose a reason for hiding this comment

stuarteberg Aug 14, 2019

Choose a reason for hiding this comment

jreback Aug 15, 2019

Choose a reason for hiding this comment

jreback commented Sep 8, 2019

stuarteberg commented Sep 9, 2019

jreback commented Oct 6, 2019

stuarteberg commented Oct 7, 2019

jreback commented Oct 7, 2019

jreback commented Oct 7, 2019

stuarteberg commented Oct 7, 2019

jreback commented Oct 7, 2019

jreback commented Oct 7, 2019

stuarteberg commented Aug 12, 2019 •

edited

Loading