API: change unique to return Index #13979

sinhrks · 2016-08-13T01:39:54Z

closes Index.unique() should always return an Index object of the same type #13395, closes Call unique() on a timezone aware datetime series returns non timezone aware result #13565
tests added / passed
passes git diff upstream/master | flake8 --diff
whatsnew entry

jreback · 2016-08-13T01:52:29Z

what about for. multi index?

sinhrks · 2016-08-13T01:56:50Z

@jreback changed to return MI. Adding explicit tests.

mi = pd.MultiIndex.from_arrays([[1, 2, 1, 2], [1, 1, 1, 2]])
mi 
# MultiIndex(levels=[[1, 2], [1, 2]],
#            labels=[[0, 1, 0, 1], [0, 0, 0, 1]])

# on current master
mi.unique()
# array([(1, 1), (2, 1), (2, 2)], dtype=object)

# after this PR
pd.MultiIndex.from_arrays([[1, 2, 1, 2], [1, 1, 1, 2]]).unique()
# MultiIndex(levels=[[1, 2], [1, 2]],
#            labels=[[0, 1, 1], [0, 0, 1]])

jreback · 2016-08-13T01:59:28Z

pandas/core/base.py

+            result = unique1d(values)
+
+        if isinstance(self, ABCCategoricalIndex):
+            # CategoricalIndex._shallow_copy uses keeps original categories


I don't think u need special case for CI
it defaults to same categories and ordered

We have to update categories because Categorical.unique reorders categories if ordered=False. This can be simplified if Categorical.unique doesn't change categories.

https://github.com/pydata/pandas/blob/master/pandas/core/categorical.py#L1753

I am not sure anymore why we decided to reorder the categories, but that does not sounds logical to me. @sinhrks do you recall the reasoning?

The values in the index should be in order of appearance, but that does not mean the categories attribute should be changed

hmm, I think we did have some discussion about this, but I don't remember,
cc @JanSchulz

The PR is #10508, and the why seems to be #10505. So @sinhrks needs to answer this :-)

Somehow this looks a lot like a user facing API (unique -> why does it change the categories at all? IMO it should just return all categories even íf some are unused: you also don't change maxint if it isn't in an integer series) was changed to fix a internal user which gave a undesired result (probably groupby shouldn't return empty groups?).

Thanks for the reminder, I'd clean forgotten it.

it was mainly done to keep category dtype, and not sure whether keeping categories order breaks anything. let me check...

I don't think the actual decision how to sort/not sort the return value of Categorical.unique mattered for the bug fix. But, in retrospect, I think we took the wrong choice there. IMO, categoricals should just follow everything else: unique values in order of appearance, and leave the dtype/categories alone.

codecov-io · 2016-08-13T04:32:12Z

Current coverage is 85.27% (diff: 100%)

Merging #13979 into master will increase coverage by <.01%

@@             master     #13979   diff @@
==========================================
  Files           139        139          
  Lines         50502      50511     +9   
  Methods           0          0          
  Messages          0          0          
  Branches          0          0          
==========================================
+ Hits          43063      43071     +8   
- Misses         7439       7440     +1   
  Partials          0          0

Powered by Codecov. Last update be61825...28a2781

jreback · 2016-08-20T13:37:51Z

@sinhrks ok as per discussion

Index -> should return Index
Series -> match return types now (no API change):

In [1]: Series([1,2,3]).unique()
Out[1]: array([1, 2, 3])

In [2]: Series([1,2,3.5]).unique()
Out[2]: array([ 1. ,  2. ,  3.5])

In [3]: Series(['foo','bar','baz']).unique()
Out[3]: array(['foo', 'bar', 'baz'], dtype=object)

In [4]: Series(['foo','bar','baz'],dtype='category').unique()
Out[4]: 
[foo, bar, baz]
Categories (3, object): [foo, bar, baz]

In [6]: Series(pd.date_range('20130101',periods=3)).unique()
Out[6]: 
array(['2013-01-01T00:00:00.000000000', '2013-01-02T00:00:00.000000000',
       '2013-01-03T00:00:00.000000000'], dtype='datetime64[ns]')

#### this discussed in other issue should be object array of Timestamps (and is API change)
In [7]: Series(pd.date_range('20130101',periods=3,tz='US/Eastern')).unique()
Out[7]: 
array(['2013-01-01T05:00:00.000000000', '2013-01-02T05:00:00.000000000',
       '2013-01-03T05:00:00.000000000'], dtype='datetime64[ns]')

# to be honest would change this too to an object array of ``Timedelta``s, but leave for now
In [8]: Series(pd.timedelta_range('1 day',periods=3)).unique()
Out[8]: array([ 86400000000000, 172800000000000, 259200000000000], dtype='timedelta64[ns]')

# good
In [9]: Series(pd.period_range('2013-1',periods=3,freq='M')).unique()
Out[9]: 
array([Period('2013-01', 'M'), Period('2013-02', 'M'),
       Period('2013-03', 'M')], dtype=object)

sinhrks · 2016-08-20T14:05:11Z

yep, will update.

jorisvandenbossche · 2016-08-21T13:18:00Z

doc/source/whatsnew/v0.19.0.txt

@@ -1005,6 +1034,7 @@ Bug Fixes
 - Bug in ``pd.read_csv()`` with ``engine='c'`` in which fields were not properly cast to float when quoting was specified as non-numeric (:issue:`13411`)
 - Bug in ``pd.read_csv``, ``pd.read_table``, ``pd.read_fwf``, ``pd.read_stata`` and ``pd.read_sas`` where files were opened by parsers but not closed if both ``chunksize`` and ``iterator`` were ``None``. (:issue:`13940`)
 - Bug in ``StataReader``, ``StataWriter``, ``XportReader`` and ``SAS7BDATReader`` where a file was not properly closed when an error was raised. (:issue:`13940`)
+- Bug in ``Series.unique()`` with datetime and timezone returns unique values without timezone, rather than return array of ``Timestamp`` with timezone (:issue:`13565`)


I would put this also in the api changes section, as it was not a bug (it was a deliberate decision I think, but we changed our mind)

jorisvandenbossche · 2016-08-21T13:34:34Z

Looks good to me!
Small comment on the whatsnew note, but that can also be done after merge.

sinhrks · 2016-08-21T22:15:55Z

Moved Series.unique description to API change.

Just in case, categorical Series.unique returns Categorical rather than ndarray now (not changed by this PR)

pd.Series([1, 2], dtype='category').unique()
# [1, 2]
# Categories (2, int64): [1, 2]

jreback · 2016-08-25T10:20:58Z

doc/source/whatsnew/v0.19.0.txt

+``Index.unique`` consistently returns ``Index``
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+``Index.unique()`` now return unique values as


Index.unique() now returns unique values as an Index of the appropriate dtype

jorisvandenbossche · 2016-08-28T18:38:32Z

@sinhrks There is a small linting error:

pandas/indexes/base.py:18:1: F401 'pandas.types.generic.ABCCategoricalIndex' imported but unused

Can be merged apart from that.

jorisvandenbossche · 2016-08-29T12:29:13Z

@sinhrks Thanks a lot!

sinhrks added Bug Indexing Related to indexing on series/frames, not to indexes themselves API Design Compat pandas objects compatability with Numpy or Python functions labels Aug 13, 2016

sinhrks added this to the 0.19.0 milestone Aug 13, 2016

sinhrks force-pushed the unique_index branch from c00403e to 3cf0abd Compare August 13, 2016 01:41

sinhrks mentioned this pull request Aug 13, 2016

Call unique() on a timezone aware datetime series returns non timezone aware result #13565

Closed

jreback reviewed Aug 13, 2016
View reviewed changes

sinhrks force-pushed the unique_index branch 2 times, most recently from d4c1974 to 3839d3b Compare August 13, 2016 04:32

This was referenced Aug 13, 2016

CLN: Datetimelike._can_hold_na #13983

Merged

API: Add Series/Index.unique(dropna=True) #13984

Closed

sinhrks force-pushed the unique_index branch 2 times, most recently from f001cdd to 8dcf98d Compare August 21, 2016 04:22

jorisvandenbossche reviewed Aug 21, 2016
View reviewed changes

jorisvandenbossche mentioned this pull request Aug 21, 2016

unique docstring extend #13565 unique datetime tz issue #14045

Closed

sinhrks force-pushed the unique_index branch from 8dcf98d to 928b23f Compare August 21, 2016 21:56

jreback reviewed Aug 25, 2016
View reviewed changes

sinhrks force-pushed the unique_index branch from 928b23f to 252256d Compare August 26, 2016 23:26

sinhrks force-pushed the unique_index branch from 252256d to 55c836c Compare August 26, 2016 23:28

API: change unique to return Index

28a2781

sinhrks force-pushed the unique_index branch from 55c836c to 28a2781 Compare August 29, 2016 01:46

jorisvandenbossche merged commit 5a20ea2 into pandas-dev:master Aug 29, 2016

sinhrks deleted the unique_index branch August 29, 2016 22:41

h-vetinari mentioned this pull request Sep 25, 2018

API/ENH: overhaul/unify/improve .unique #22824

Open

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

API: change unique to return Index #13979

API: change unique to return Index #13979

sinhrks commented Aug 13, 2016

jreback commented Aug 13, 2016

sinhrks commented Aug 13, 2016

jreback Aug 13, 2016

sinhrks Aug 13, 2016 •

edited

Loading

jorisvandenbossche Aug 15, 2016

jreback Aug 17, 2016

jankatins Aug 18, 2016 •

edited

Loading

sinhrks Aug 18, 2016

jorisvandenbossche Aug 18, 2016

codecov-io commented Aug 13, 2016 •

edited

Loading

jreback commented Aug 20, 2016 •

edited

Loading

sinhrks commented Aug 20, 2016

jorisvandenbossche Aug 21, 2016

jorisvandenbossche commented Aug 21, 2016

sinhrks commented Aug 21, 2016

jreback Aug 25, 2016

jorisvandenbossche commented Aug 28, 2016

jorisvandenbossche commented Aug 29, 2016

API: change unique to return Index #13979

API: change unique to return Index #13979

Conversation

sinhrks commented Aug 13, 2016

jreback commented Aug 13, 2016

sinhrks commented Aug 13, 2016

jreback Aug 13, 2016

Choose a reason for hiding this comment

sinhrks Aug 13, 2016 • edited Loading

Choose a reason for hiding this comment

jorisvandenbossche Aug 15, 2016

Choose a reason for hiding this comment

jreback Aug 17, 2016

Choose a reason for hiding this comment

jankatins Aug 18, 2016 • edited Loading

Choose a reason for hiding this comment

sinhrks Aug 18, 2016

Choose a reason for hiding this comment

jorisvandenbossche Aug 18, 2016

Choose a reason for hiding this comment

codecov-io commented Aug 13, 2016 • edited Loading

Current coverage is 85.27% (diff: 100%)

jreback commented Aug 20, 2016 • edited Loading

sinhrks commented Aug 20, 2016

jorisvandenbossche Aug 21, 2016

Choose a reason for hiding this comment

jorisvandenbossche commented Aug 21, 2016

sinhrks commented Aug 21, 2016

jreback Aug 25, 2016

Choose a reason for hiding this comment

jorisvandenbossche commented Aug 28, 2016

jorisvandenbossche commented Aug 29, 2016

sinhrks Aug 13, 2016 •

edited

Loading

jankatins Aug 18, 2016 •

edited

Loading

codecov-io commented Aug 13, 2016 •

edited

Loading

jreback commented Aug 20, 2016 •

edited

Loading