Index.unique() should always return an Index object of the same type #13395

shoyer · 2016-06-08T01:15:07Z

This should also be noted in the docstring for the method.

Currently, it sometimes returns numpy arrays:

>>> pd.DatetimeIndex(['2000-01-01', '2000-01-02']).unique()
DatetimeIndex(['2000-01-01', '2000-01-02'], dtype='datetime64[ns]', freq=None)

>>> pd.Index([0, 1, 2]).unique()
array([0, 1, 2])

Most of the work here is probably writing comprehensive tests to check each index type.

xref: https://github.com/pydata/pandas/pull/13361/files/17209f92330c5e949934aec9dea039b35faf6e40#r66179418

The text was updated successfully, but these errors were encountered:

jorisvandenbossche · 2016-06-08T08:33:57Z

At the moment, I think DatetimeIndex is rather the exception, as most seem to return a numpy array (and CategoricalIndex a Categorical):

In [18]: pd.Index(['a', 'a', 'b']).unique()
Out[18]: array(['a', 'b'], dtype=object)

In [19]: pd.Int64Index([1,1,2]).unique()
Out[19]: array([1, 2], dtype=int64)

In [20]: pd.Float64Index([1,1,2]).unique()
Out[20]: array([ 1.,  2.])

In [21]: pd.CategoricalIndex(['a','a','b']).unique()
Out[21]:
[a, b]
Categories (2, object): [a, b]

jreback · 2016-06-08T11:04:23Z

this is a dupe of #4126

jreback · 2016-06-08T11:04:46Z

closing the other one actually.

jorisvandenbossche · 2016-06-08T14:56:04Z

Are there possible back compat issues? I think most/all methods of array should work similar for Index (as it was previously a subclass), but I am not sure to what extent users would rely on the fact that it returns an array now.
What with ``Series.unique`? This also returns arrays now.

jreback · 2016-06-08T15:06:14Z

yeah I dont' think we ever changed Series.unique() which was for numpy compat, should change that too. These all need to go in 0.19.0 I think though.

shoyer · 2016-06-08T15:10:14Z

One reason not to change Series.unique() is that there is no meaningful index to associate with the unique values. That said, that is probably more useful overall (especially once you consider non-numpy dtypes).

This definitely needs to go in a major release because it will break some user code.

jorisvandenbossche · 2016-08-13T13:28:32Z

In the PR of @sinhrks, it is now proposed to return an Index of the same type for both Index and Series.

While for Index it seems logical to always return an Index of the same type, I am not very enthusiastic about Series.unique returning Index objects (I think that would be surprising). I would personally rather return Series objects with a default range index.

jreback · 2016-08-13T13:31:43Z

I don't agree

then you are then giving meaning to the index of the series that you are returning when it doesn't have any meaning (the ordering actually does have meaning but that is true in either case)

so returning an Index is the correct action here

jorisvandenbossche · 2016-08-13T13:49:12Z

Of course this boils down to not having a good array-like container that can hold all pandas supported types .. (Index is such a container, and can be used for that, but IMO to users it is not, to users it are the labels of the index/columns of a DataFrame/Series).

Options:

leave as is (np.array for most, Categorical for category series, possibly DatetimeIndex for timezone aware (this is currently not yet the case)), and try solve this in pandas 2.0
consistently np.array (with the consequence of loosing type information, or object array of pandas objects to keep type info)
consistently Index (can hold all dtypes, but IMO confusing object)
consistently Series (can hold all dtypes, but has meaningless default index that is not needed)

jreback · 2016-08-13T13:52:00Z

I disagree, Index IS the container object and is most appropriate

Series is plain confusing

sinhrks · 2016-08-13T17:08:07Z

i think it's natural that Index.unique returns Index.

Series may need discussion, but returnig Series with meaningless label does not sound good.

shoyer · 2016-08-13T19:15:56Z

While for Index it seems logical to always return an Index of the same type, I am not very enthusiastic about Series.unique returning Index objects (I think that would be surprising).

I agree. Returning an index for Series.unique would be quite surprising.

For Series.unique, I don't think we have any good options prior to pandas 2.0. I would stick with returning numpy arrays for now.

jreback · 2016-08-13T19:43:27Z

@shoyer

I agree. Returning an index for Series.unique would be quite surprising.

you seem to be against natural things and seem to want pandas to be like numpy
can you explain this reasoning

shoyer · 2016-08-13T20:11:31Z

seem to want pandas to be like numpy

You misunderstand me. This is about what feels consistent with the current version of pandas:

For 1D labeled arrays, we use Series
For labels, we use Index
For unlabeled values, we use np.ndarray or Categorical

jreback · 2016-08-17T22:26:39Z

ok, I'll change my opinion here.

I can see Series.unique() returning a Series with a default (range) Index. The reason is that this is a mutable object, rather than an immutable Index. Furthermore the resultant index of the returned object, actually does have meaning. Its the order of appearance in the unique, so this does firm up the API.

shoyer · 2016-08-17T22:32:38Z

As I mentioned in #13944, in pandas 2.0, I think the logical type for the return value of Series.unique() is the appropriate ScalarArray type. If that it where we want to end up, I don't think it makes sense to have additional churn now.

jreback · 2016-08-18T10:03:35Z

@shoyer yes and if pandas 2.0 was around the corner and we DIDN't have a 1.0 I would agree. However, we very-very rarely expose raw ndarrays to the user ATM. Aside from .values which is an impl detail, this is the remaining case (IIRC). SO this is wildly inconsistent, and should not survive a freeze of API.

shoyer · 2016-08-18T16:17:19Z

My opinion is that we should not introduce any breaking changes in 1.0 that
we know we will break again in 2.0. I'm happy to hear what others think,
though.

On Thu, Aug 18, 2016 at 3:03 AM, Jeff Reback notifications@github.com
wrote:

@shoyer https://github.com/shoyer yes and if pandas 2.0 was around the
corner and we DIDN't have a 1.0 I would agree. However, we very-very rarely
expose raw ndarrays to the user ATM. Aside from .values which is an impl
detail, this is the remaining case (IIRC). SO this is wildly inconsistent,
and should not survive a freeze of API.

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#13395 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/ABKS1pcj5tZFMXTs2l2enTCgDEvm91oQks5qhC37gaJpZM4IwglC
.

jorisvandenbossche · 2016-08-18T20:25:50Z

In the current pandas, I would vote for returning a Series, although also not ideal. But I agree with @shoyer that if we change it again for 2.0, it is maybe not very beneficial to change this for 1.0 as well.

To be clear, we should care that the "but we will change that for 2.0" does not become a reason to not do any needed changes anymore now. But, in this case, I personally don't think the return value of unique is that much of a pain point currently (with the exception of tz-aware datetime*), in most cases it just gives you what you want.

* we could also return an object array of timestamps for that specific case

jreback · 2016-08-18T22:32:24Z

ok, for 0.19.0 we need to change Index.unique for sure. That's a given as its the only inconsistent method and will ultimately return an Index in any future version.

Ok, so the only question then is to make Series.unique return a Series, Index, or leave as an ndarray.

If it should eventually return a Series, then we should change it now. I suppose if it ultimately will return a 1-d Array type (TBD), which is mutable (and not an Index), then I guess it could leave as a ndarray now.

But these are just way to many iffs. This needs to be resolved asap. @wesm why don't you weigh in here.

wesm · 2016-08-18T23:43:53Z

Just read through this.

In pandas 2.0 Series.unique is almost certainly going to return a pandas.Array (which is just a container type around some other data, possibly just a NumPy array in many cases) as @shoyer said.

Several problems with Series.unique returning a Series right now

The association of a new index with the result feels unclean to me. Let's look at this code:

In [10]: s = pd.Series([1,2,3,4] * 4)

In [11]: unique_vals = s.unique()

In [12]: from pandas.util.testing import rands

In [13]: df = pd.DataFrame({'uniques': unique_vals}, index=[rands(10) for i in range(len(unique
    ...: _vals))])

In [14]: df
Out[14]: 
            uniques
mB2LJrlOw5        1
qPF14xkGNl        2
0nE5HHGM0d        3
AbQEAYpYmW        4

If unique returned a Series, then you know what would happen:

In [18]: unique_vals = pd.Series(unique_vals)

In [19]: df = pd.DataFrame({'uniques': unique_vals}, index=[rands(10) for i in range(len(unique
    ...: _vals))])

In [20]: df
Out[20]: 
            uniques
LiQZXm6K5V      NaN
B8HABWAK2o      NaN
4hIrDH3Ue0      NaN
JpaO9iMWTP      NaN

Contrived as this may be is an enough of a concern to make me -0 on this and very nearly -1

by making Series.unique return a Series now, we introduce an API contract that we probably will want to break in pandas 2.x (since we'll have a consistent array container type that can be returned)

I agree it sort of stinks that we have both ndarray and non-ndarray (e.g. categorical) return values for Series.unique, but this is the bed we've made for ourselves.

wesm · 2016-08-18T23:49:03Z

(I agree that Index.unique should always return an Index)

sinhrks · 2016-08-18T23:58:50Z

I'm +1 to leave Series as it is to avoid repeated API changes.

One issue related to returning ndarray is #13565. We can use drop_duplicates if we actually need Series

jorisvandenbossche · 2016-08-19T08:17:04Z

Nice example of @wesm how series vs array could break code, so let's not do that (although the reindexing behaviour of the constructors is maybe also a point for discussion ...)

For the issue in #13565 (return value for unique of a tz aware series), options are:

document the current situation (looses the tz information, as a datetime64 array is returned)
change to object array of tz aware Timestamp objects
(return tz aware index for this specific case, to be fully inconsistent :-))

I would go for the first or the second, but not really a preference.

jreback · 2016-08-19T10:00:40Z

This was discussed here in the original issue.

In [19]: s
Out[19]: 
0   2013-01-01 00:00:00-05:00
1   2013-01-02 00:00:00-05:00
2   2013-01-03 00:00:00-05:00
dtype: datetime64[ns, US/Eastern]

# this is converted to UTC because numpy cannot hold anything else.
In [20]: s.values
Out[20]: 
array(['2013-01-01T05:00:00.000000000', '2013-01-02T05:00:00.000000000',
       '2013-01-03T05:00:00.000000000'], dtype='datetime64[ns]')

# this does not lose tz data (though is non-performant).
In [24]: np.array(list(iter(s._values)))
Out[24]: 
array([Timestamp('2013-01-01 00:00:00-0500', tz='US/Eastern', freq='D'),
       Timestamp('2013-01-02 00:00:00-0500', tz='US/Eastern', freq='D'),
       Timestamp('2013-01-03 00:00:00-0500', tz='US/Eastern', freq='D')], dtype=object)

Originally I had this returning a DTI, however It was suggested that numpy compat was more important here. But it is quite simple to just return an object dtyped array (e.g. [21])``. Whenever we have a tz-aware Series (whether doing unique or not).

So changing to 2) IMHO is the best; we should also change .values to be the same to preserve meta-data.

wesm · 2016-08-19T19:21:57Z

@jorisvandenbossche the conforming / reindexing behavior of the DataFrame ctor is a super valuable feature (and one of the very earliest ones from pandas 0.1) in my experience (it also saves a ctor-then-reindex step which results in an extra sweep of the data and copy). You can pass in a bunch of irregularly indexed data and "pluck" out the data that matches a particular "master" index that you have set. The alternative is to pass in label-naive arrays, which seems like an acceptable compromise

jorisvandenbossche · 2016-08-19T20:56:01Z

@wesm getting a bit off topic ... but: reindexing is undoubtly a very valuable operation, I am personally just not sure if it should be the behaviour of the default constructor (could also be a dedicated method). It also leads to suprises and bugs/ambiguous behaviour. I recall some discussion in #9237, where it was not clear which should happen first (reindex based on given columns, or determine index values based on passed objects), and with Series objects with names resulting in an empty frame when specifying columns=

wesm · 2016-08-19T22:58:19Z

I see. We should discuss this separately, it seems like there was a consistency issue in the treatment of a single Series versus a dict of Series (i.e. determining a row index from the input series prior to selecting only the columns in columns=) -- semantically you would think the single Series should be promoted to a dict and then handled identically. I agree it's an area that can cause people issues

shoyer added Indexing Related to indexing on series/frames, not to indexes themselves Difficulty Novice labels Jun 8, 2016

shoyer mentioned this issue Jun 8, 2016

API/ENH: union Categorical #13361

Closed

jreback closed this as completed Jun 8, 2016

jreback reopened this Jun 8, 2016

jreback added Compat pandas objects compatability with Numpy or Python functions Difficulty Advanced and removed Difficulty Novice labels Jun 8, 2016

jreback added this to the Next Major Release milestone Jun 8, 2016

jreback added Difficulty Intermediate and removed Difficulty Advanced labels Jun 8, 2016

jreback modified the milestones: 0.19.0, Next Major Release Jun 8, 2016

jreback mentioned this issue Jul 5, 2016

Call unique() on a timezone aware datetime series returns non timezone aware result #13565

Closed

sinhrks mentioned this issue Aug 13, 2016

API: change unique to return Index #13979

Merged

4 tasks

shoyer mentioned this issue Aug 13, 2016

DOC: Design drafts to assist with next-gen pandas internals discussion #13944

Closed

jreback modified the milestones: 0.19.0, 0.20.0 Aug 18, 2016

jorisvandenbossche closed this as completed in #13979 Aug 29, 2016

jorisvandenbossche mentioned this issue Oct 17, 2016

.unique() inconsistency between Series and Index objects #14437

Closed

jreback mentioned this issue Nov 1, 2016

unique(), values discards timezone information in some cases #14548

Closed

h-vetinari mentioned this issue Sep 25, 2018

API/ENH: overhaul/unify/improve .unique #22824

Open

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Index.unique() should always return an Index object of the same type #13395

Index.unique() should always return an Index object of the same type #13395

shoyer commented Jun 8, 2016

jorisvandenbossche commented Jun 8, 2016

jreback commented Jun 8, 2016

jreback commented Jun 8, 2016

jorisvandenbossche commented Jun 8, 2016 •

edited

Loading

jreback commented Jun 8, 2016 •

edited

Loading

shoyer commented Jun 8, 2016

jorisvandenbossche commented Aug 13, 2016

jreback commented Aug 13, 2016 •

edited

Loading

jorisvandenbossche commented Aug 13, 2016

jreback commented Aug 13, 2016

sinhrks commented Aug 13, 2016

shoyer commented Aug 13, 2016

jreback commented Aug 13, 2016

shoyer commented Aug 13, 2016

jreback commented Aug 17, 2016

shoyer commented Aug 17, 2016

jreback commented Aug 18, 2016

shoyer commented Aug 18, 2016

jorisvandenbossche commented Aug 18, 2016

jreback commented Aug 18, 2016 •

edited

Loading

wesm commented Aug 18, 2016

wesm commented Aug 18, 2016 •

edited

Loading

sinhrks commented Aug 18, 2016

jorisvandenbossche commented Aug 19, 2016

jreback commented Aug 19, 2016 •

edited

Loading

wesm commented Aug 19, 2016

jorisvandenbossche commented Aug 19, 2016

wesm commented Aug 19, 2016

Index.unique() should always return an Index object of the same type #13395

Index.unique() should always return an Index object of the same type #13395

Comments

shoyer commented Jun 8, 2016

jorisvandenbossche commented Jun 8, 2016

jreback commented Jun 8, 2016

jreback commented Jun 8, 2016

jorisvandenbossche commented Jun 8, 2016 • edited Loading

jreback commented Jun 8, 2016 • edited Loading

shoyer commented Jun 8, 2016

jorisvandenbossche commented Aug 13, 2016

jreback commented Aug 13, 2016 • edited Loading

jorisvandenbossche commented Aug 13, 2016

jreback commented Aug 13, 2016

sinhrks commented Aug 13, 2016

shoyer commented Aug 13, 2016

jreback commented Aug 13, 2016

shoyer commented Aug 13, 2016

jreback commented Aug 17, 2016

shoyer commented Aug 17, 2016

jreback commented Aug 18, 2016

shoyer commented Aug 18, 2016

jorisvandenbossche commented Aug 18, 2016

jreback commented Aug 18, 2016 • edited Loading

wesm commented Aug 18, 2016

wesm commented Aug 18, 2016 • edited Loading

sinhrks commented Aug 18, 2016

jorisvandenbossche commented Aug 19, 2016

jreback commented Aug 19, 2016 • edited Loading

wesm commented Aug 19, 2016

jorisvandenbossche commented Aug 19, 2016

wesm commented Aug 19, 2016

jorisvandenbossche commented Jun 8, 2016 •

edited

Loading

jreback commented Jun 8, 2016 •

edited

Loading

jreback commented Aug 13, 2016 •

edited

Loading

jreback commented Aug 18, 2016 •

edited

Loading

wesm commented Aug 18, 2016 •

edited

Loading

jreback commented Aug 19, 2016 •

edited

Loading