Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

API: change unique to return Index #13979

Merged
merged 1 commit into from
Aug 29, 2016

Conversation

sinhrks
Copy link
Member

@sinhrks sinhrks commented Aug 13, 2016

@sinhrks sinhrks added Bug Indexing Related to indexing on series/frames, not to indexes themselves API Design Compat pandas objects compatability with Numpy or Python functions labels Aug 13, 2016
@sinhrks sinhrks added this to the 0.19.0 milestone Aug 13, 2016
@jreback
Copy link
Contributor

jreback commented Aug 13, 2016

what about for. multi index?

@sinhrks
Copy link
Member Author

sinhrks commented Aug 13, 2016

@jreback changed to return MI. Adding explicit tests.

mi = pd.MultiIndex.from_arrays([[1, 2, 1, 2], [1, 1, 1, 2]])
mi 
# MultiIndex(levels=[[1, 2], [1, 2]],
#            labels=[[0, 1, 0, 1], [0, 0, 0, 1]])

# on current master
mi.unique()
# array([(1, 1), (2, 1), (2, 2)], dtype=object)

# after this PR
pd.MultiIndex.from_arrays([[1, 2, 1, 2], [1, 1, 1, 2]]).unique()
# MultiIndex(levels=[[1, 2], [1, 2]],
#            labels=[[0, 1, 1], [0, 0, 1]])

result = unique1d(values)

if isinstance(self, ABCCategoricalIndex):
# CategoricalIndex._shallow_copy uses keeps original categories
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think u need special case for CI
it defaults to same categories and ordered

Copy link
Member Author

@sinhrks sinhrks Aug 13, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have to update categories because Categorical.unique reorders categories if ordered=False. This can be simplified if Categorical.unique doesn't change categories.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure anymore why we decided to reorder the categories, but that does not sounds logical to me. @sinhrks do you recall the reasoning?

The values in the index should be in order of appearance, but that does not mean the categories attribute should be changed

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, I think we did have some discussion about this, but I don't remember,
cc @JanSchulz

Copy link
Contributor

@jankatins jankatins Aug 18, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PR is #10508, and the why seems to be #10505. So @sinhrks needs to answer this :-)

Somehow this looks a lot like a user facing API (unique -> why does it change the categories at all? IMO it should just return all categories even íf some are unused: you also don't change maxint if it isn't in an integer series) was changed to fix a internal user which gave a undesired result (probably groupby shouldn't return empty groups?).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the reminder, I'd clean forgotten it.

it was mainly done to keep category dtype, and not sure whether keeping categories order breaks anything. let me check...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think the actual decision how to sort/not sort the return value of Categorical.unique mattered for the bug fix. But, in retrospect, I think we took the wrong choice there. IMO, categoricals should just follow everything else: unique values in order of appearance, and leave the dtype/categories alone.

@sinhrks sinhrks force-pushed the unique_index branch 2 times, most recently from d4c1974 to 3839d3b Compare August 13, 2016 04:32
@codecov-io
Copy link

codecov-io commented Aug 13, 2016

Current coverage is 85.27% (diff: 100%)

Merging #13979 into master will increase coverage by <.01%

@@             master     #13979   diff @@
==========================================
  Files           139        139          
  Lines         50502      50511     +9   
  Methods           0          0          
  Messages          0          0          
  Branches          0          0          
==========================================
+ Hits          43063      43071     +8   
- Misses         7439       7440     +1   
  Partials          0          0          

Powered by Codecov. Last update be61825...28a2781

@jreback
Copy link
Contributor

jreback commented Aug 20, 2016

@sinhrks ok as per discussion

Index -> should return Index
Series -> match return types now (no API change):

In [1]: Series([1,2,3]).unique()
Out[1]: array([1, 2, 3])

In [2]: Series([1,2,3.5]).unique()
Out[2]: array([ 1. ,  2. ,  3.5])

In [3]: Series(['foo','bar','baz']).unique()
Out[3]: array(['foo', 'bar', 'baz'], dtype=object)

In [4]: Series(['foo','bar','baz'],dtype='category').unique()
Out[4]: 
[foo, bar, baz]
Categories (3, object): [foo, bar, baz]

In [6]: Series(pd.date_range('20130101',periods=3)).unique()
Out[6]: 
array(['2013-01-01T00:00:00.000000000', '2013-01-02T00:00:00.000000000',
       '2013-01-03T00:00:00.000000000'], dtype='datetime64[ns]')

#### this discussed in other issue should be object array of Timestamps (and is API change)
In [7]: Series(pd.date_range('20130101',periods=3,tz='US/Eastern')).unique()
Out[7]: 
array(['2013-01-01T05:00:00.000000000', '2013-01-02T05:00:00.000000000',
       '2013-01-03T05:00:00.000000000'], dtype='datetime64[ns]')

# to be honest would change this too to an object array of ``Timedelta``s, but leave for now
In [8]: Series(pd.timedelta_range('1 day',periods=3)).unique()
Out[8]: array([ 86400000000000, 172800000000000, 259200000000000], dtype='timedelta64[ns]')

# good
In [9]: Series(pd.period_range('2013-1',periods=3,freq='M')).unique()
Out[9]: 
array([Period('2013-01', 'M'), Period('2013-02', 'M'),
       Period('2013-03', 'M')], dtype=object)

@sinhrks
Copy link
Member Author

sinhrks commented Aug 20, 2016

yep, will update.

@sinhrks sinhrks force-pushed the unique_index branch 2 times, most recently from f001cdd to 8dcf98d Compare August 21, 2016 04:22
@@ -1005,6 +1034,7 @@ Bug Fixes
- Bug in ``pd.read_csv()`` with ``engine='c'`` in which fields were not properly cast to float when quoting was specified as non-numeric (:issue:`13411`)
- Bug in ``pd.read_csv``, ``pd.read_table``, ``pd.read_fwf``, ``pd.read_stata`` and ``pd.read_sas`` where files were opened by parsers but not closed if both ``chunksize`` and ``iterator`` were ``None``. (:issue:`13940`)
- Bug in ``StataReader``, ``StataWriter``, ``XportReader`` and ``SAS7BDATReader`` where a file was not properly closed when an error was raised. (:issue:`13940`)
- Bug in ``Series.unique()`` with datetime and timezone returns unique values without timezone, rather than return array of ``Timestamp`` with timezone (:issue:`13565`)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would put this also in the api changes section, as it was not a bug (it was a deliberate decision I think, but we changed our mind)

@jorisvandenbossche
Copy link
Member

Looks good to me!
Small comment on the whatsnew note, but that can also be done after merge.

@sinhrks
Copy link
Member Author

sinhrks commented Aug 21, 2016

Moved Series.unique description to API change.

Just in case, categorical Series.unique returns Categorical rather than ndarray now (not changed by this PR)

pd.Series([1, 2], dtype='category').unique()
# [1, 2]
# Categories (2, int64): [1, 2]

``Index.unique`` consistently returns ``Index``
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

``Index.unique()`` now return unique values as
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Index.unique() now returns unique values as an Index of the appropriate dtype

@jorisvandenbossche
Copy link
Member

@sinhrks There is a small linting error:

pandas/indexes/base.py:18:1: F401 'pandas.types.generic.ABCCategoricalIndex' imported but unused

Can be merged apart from that.

@jorisvandenbossche jorisvandenbossche merged commit 5a20ea2 into pandas-dev:master Aug 29, 2016
@jorisvandenbossche
Copy link
Member

@sinhrks Thanks a lot!

@sinhrks sinhrks deleted the unique_index branch August 29, 2016 22:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API Design Bug Compat pandas objects compatability with Numpy or Python functions Indexing Related to indexing on series/frames, not to indexes themselves
Projects
None yet
5 participants