Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Index.unique() should always return an Index object of the same type #13395

Closed
shoyer opened this issue Jun 8, 2016 · 28 comments · Fixed by #13979
Closed

Index.unique() should always return an Index object of the same type #13395

shoyer opened this issue Jun 8, 2016 · 28 comments · Fixed by #13979
Labels
Compat pandas objects compatability with Numpy or Python functions Indexing Related to indexing on series/frames, not to indexes themselves
Milestone

Comments

@shoyer
Copy link
Member

shoyer commented Jun 8, 2016

This should also be noted in the docstring for the method.

Currently, it sometimes returns numpy arrays:

>>> pd.DatetimeIndex(['2000-01-01', '2000-01-02']).unique()
DatetimeIndex(['2000-01-01', '2000-01-02'], dtype='datetime64[ns]', freq=None)

>>> pd.Index([0, 1, 2]).unique()
array([0, 1, 2])

Most of the work here is probably writing comprehensive tests to check each index type.

xref: https://github.com/pydata/pandas/pull/13361/files/17209f92330c5e949934aec9dea039b35faf6e40#r66179418

@shoyer shoyer added Indexing Related to indexing on series/frames, not to indexes themselves Difficulty Novice labels Jun 8, 2016
@jorisvandenbossche
Copy link
Member

At the moment, I think DatetimeIndex is rather the exception, as most seem to return a numpy array (and CategoricalIndex a Categorical):

In [18]: pd.Index(['a', 'a', 'b']).unique()
Out[18]: array(['a', 'b'], dtype=object)

In [19]: pd.Int64Index([1,1,2]).unique()
Out[19]: array([1, 2], dtype=int64)

In [20]: pd.Float64Index([1,1,2]).unique()
Out[20]: array([ 1.,  2.])

In [21]: pd.CategoricalIndex(['a','a','b']).unique()
Out[21]:
[a, b]
Categories (2, object): [a, b]

@jreback
Copy link
Contributor

jreback commented Jun 8, 2016

this is a dupe of #4126

@jreback jreback closed this as completed Jun 8, 2016
@jreback jreback reopened this Jun 8, 2016
@jreback
Copy link
Contributor

jreback commented Jun 8, 2016

closing the other one actually.

@jreback jreback added Compat pandas objects compatability with Numpy or Python functions Difficulty Advanced and removed Difficulty Novice labels Jun 8, 2016
@jreback jreback added this to the Next Major Release milestone Jun 8, 2016
@jorisvandenbossche
Copy link
Member

jorisvandenbossche commented Jun 8, 2016

  • Are there possible back compat issues? I think most/all methods of array should work similar for Index (as it was previously a subclass), but I am not sure to what extent users would rely on the fact that it returns an array now.
  • What with ``Series.unique`? This also returns arrays now.

@jreback
Copy link
Contributor

jreback commented Jun 8, 2016

yeah I dont' think we ever changed Series.unique() which was for numpy compat, should change that too. These all need to go in 0.19.0 I think though.

@jreback jreback modified the milestones: 0.19.0, Next Major Release Jun 8, 2016
@shoyer
Copy link
Member Author

shoyer commented Jun 8, 2016

One reason not to change Series.unique() is that there is no meaningful index to associate with the unique values. That said, that is probably more useful overall (especially once you consider non-numpy dtypes).

This definitely needs to go in a major release because it will break some user code.

@jorisvandenbossche
Copy link
Member

In the PR of @sinhrks, it is now proposed to return an Index of the same type for both Index and Series.

While for Index it seems logical to always return an Index of the same type, I am not very enthusiastic about Series.unique returning Index objects (I think that would be surprising). I would personally rather return Series objects with a default range index.

@jreback
Copy link
Contributor

jreback commented Aug 13, 2016

I don't agree

then you are then giving meaning to the index of the series that you are returning when it doesn't have any meaning (the ordering actually does have meaning but that is true in either case)

so returning an Index is the correct action here

@jorisvandenbossche
Copy link
Member

Of course this boils down to not having a good array-like container that can hold all pandas supported types .. (Index is such a container, and can be used for that, but IMO to users it is not, to users it are the labels of the index/columns of a DataFrame/Series).

Options:

  • leave as is (np.array for most, Categorical for category series, possibly DatetimeIndex for timezone aware (this is currently not yet the case)), and try solve this in pandas 2.0
  • consistently np.array (with the consequence of loosing type information, or object array of pandas objects to keep type info)
  • consistently Index (can hold all dtypes, but IMO confusing object)
  • consistently Series (can hold all dtypes, but has meaningless default index that is not needed)

@jreback
Copy link
Contributor

jreback commented Aug 13, 2016

I disagree, Index IS the container object and is most appropriate

Series is plain confusing

@sinhrks
Copy link
Member

sinhrks commented Aug 13, 2016

i think it's natural that Index.unique returns Index.

Series may need discussion, but returnig Series with meaningless label does not sound good.

@shoyer
Copy link
Member Author

shoyer commented Aug 13, 2016

While for Index it seems logical to always return an Index of the same type, I am not very enthusiastic about Series.unique returning Index objects (I think that would be surprising).

I agree. Returning an index for Series.unique would be quite surprising.

For Series.unique, I don't think we have any good options prior to pandas 2.0. I would stick with returning numpy arrays for now.

@jreback
Copy link
Contributor

jreback commented Aug 13, 2016

@shoyer

I agree. Returning an index for Series.unique would be quite surprising.

you seem to be against natural things and seem to want pandas to be like numpy
can you explain this reasoning

@shoyer
Copy link
Member Author

shoyer commented Aug 13, 2016

seem to want pandas to be like numpy

You misunderstand me. This is about what feels consistent with the current version of pandas:

  • For 1D labeled arrays, we use Series
  • For labels, we use Index
  • For unlabeled values, we use np.ndarray or Categorical

@jreback
Copy link
Contributor

jreback commented Aug 17, 2016

ok, I'll change my opinion here.

I can see Series.unique() returning a Series with a default (range) Index. The reason is that this is a mutable object, rather than an immutable Index. Furthermore the resultant index of the returned object, actually does have meaning. Its the order of appearance in the unique, so this does firm up the API.

@shoyer
Copy link
Member Author

shoyer commented Aug 17, 2016

As I mentioned in #13944, in pandas 2.0, I think the logical type for the return value of Series.unique() is the appropriate ScalarArray type. If that it where we want to end up, I don't think it makes sense to have additional churn now.

@jreback
Copy link
Contributor

jreback commented Aug 18, 2016

@shoyer yes and if pandas 2.0 was around the corner and we DIDN't have a 1.0 I would agree. However, we very-very rarely expose raw ndarrays to the user ATM. Aside from .values which is an impl detail, this is the remaining case (IIRC). SO this is wildly inconsistent, and should not survive a freeze of API.

@shoyer
Copy link
Member Author

shoyer commented Aug 18, 2016

My opinion is that we should not introduce any breaking changes in 1.0 that
we know we will break again in 2.0. I'm happy to hear what others think,
though.

On Thu, Aug 18, 2016 at 3:03 AM, Jeff Reback notifications@github.com
wrote:

@shoyer https://github.com/shoyer yes and if pandas 2.0 was around the
corner and we DIDN't have a 1.0 I would agree. However, we very-very rarely
expose raw ndarrays to the user ATM. Aside from .values which is an impl
detail, this is the remaining case (IIRC). SO this is wildly inconsistent,
and should not survive a freeze of API.


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#13395 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/ABKS1pcj5tZFMXTs2l2enTCgDEvm91oQks5qhC37gaJpZM4IwglC
.

@jorisvandenbossche
Copy link
Member

In the current pandas, I would vote for returning a Series, although also not ideal. But I agree with @shoyer that if we change it again for 2.0, it is maybe not very beneficial to change this for 1.0 as well.

To be clear, we should care that the "but we will change that for 2.0" does not become a reason to not do any needed changes anymore now. But, in this case, I personally don't think the return value of unique is that much of a pain point currently (with the exception of tz-aware datetime*), in most cases it just gives you what you want.

* we could also return an object array of timestamps for that specific case

@jreback
Copy link
Contributor

jreback commented Aug 18, 2016

ok, for 0.19.0 we need to change Index.unique for sure. That's a given as its the only inconsistent method and will ultimately return an Index in any future version.

Ok, so the only question then is to make Series.unique return a Series, Index, or leave as an ndarray.

If it should eventually return a Series, then we should change it now. I suppose if it ultimately will return a 1-d Array type (TBD), which is mutable (and not an Index), then I guess it could leave as a ndarray now.

But these are just way to many iffs. This needs to be resolved asap. @wesm why don't you weigh in here.

@jreback jreback modified the milestones: 0.19.0, 0.20.0 Aug 18, 2016
@wesm
Copy link
Member

wesm commented Aug 18, 2016

Just read through this.

In pandas 2.0 Series.unique is almost certainly going to return a pandas.Array (which is just a container type around some other data, possibly just a NumPy array in many cases) as @shoyer said.

Several problems with Series.unique returning a Series right now

  • The association of a new index with the result feels unclean to me. Let's look at this code:
In [10]: s = pd.Series([1,2,3,4] * 4)

In [11]: unique_vals = s.unique()

In [12]: from pandas.util.testing import rands

In [13]: df = pd.DataFrame({'uniques': unique_vals}, index=[rands(10) for i in range(len(unique
    ...: _vals))])

In [14]: df
Out[14]: 
            uniques
mB2LJrlOw5        1
qPF14xkGNl        2
0nE5HHGM0d        3
AbQEAYpYmW        4

If unique returned a Series, then you know what would happen:

In [18]: unique_vals = pd.Series(unique_vals)

In [19]: df = pd.DataFrame({'uniques': unique_vals}, index=[rands(10) for i in range(len(unique
    ...: _vals))])

In [20]: df
Out[20]: 
            uniques
LiQZXm6K5V      NaN
B8HABWAK2o      NaN
4hIrDH3Ue0      NaN
JpaO9iMWTP      NaN

Contrived as this may be is an enough of a concern to make me -0 on this and very nearly -1

  1. by making Series.unique return a Series now, we introduce an API contract that we probably will want to break in pandas 2.x (since we'll have a consistent array container type that can be returned)

I agree it sort of stinks that we have both ndarray and non-ndarray (e.g. categorical) return values for Series.unique, but this is the bed we've made for ourselves.

@wesm
Copy link
Member

wesm commented Aug 18, 2016

(I agree that Index.unique should always return an Index)

@sinhrks
Copy link
Member

sinhrks commented Aug 18, 2016

I'm +1 to leave Series as it is to avoid repeated API changes.

One issue related to returning ndarray is #13565. We can use drop_duplicates if we actually need Series

@jorisvandenbossche
Copy link
Member

Nice example of @wesm how series vs array could break code, so let's not do that (although the reindexing behaviour of the constructors is maybe also a point for discussion ...)

For the issue in #13565 (return value for unique of a tz aware series), options are:

  • document the current situation (looses the tz information, as a datetime64 array is returned)
  • change to object array of tz aware Timestamp objects
  • (return tz aware index for this specific case, to be fully inconsistent :-))

I would go for the first or the second, but not really a preference.

@jreback
Copy link
Contributor

jreback commented Aug 19, 2016

This was discussed here in the original issue.

In [19]: s
Out[19]: 
0   2013-01-01 00:00:00-05:00
1   2013-01-02 00:00:00-05:00
2   2013-01-03 00:00:00-05:00
dtype: datetime64[ns, US/Eastern]

# this is converted to UTC because numpy cannot hold anything else.
In [20]: s.values
Out[20]: 
array(['2013-01-01T05:00:00.000000000', '2013-01-02T05:00:00.000000000',
       '2013-01-03T05:00:00.000000000'], dtype='datetime64[ns]')

# this does not lose tz data (though is non-performant).
In [24]: np.array(list(iter(s._values)))
Out[24]: 
array([Timestamp('2013-01-01 00:00:00-0500', tz='US/Eastern', freq='D'),
       Timestamp('2013-01-02 00:00:00-0500', tz='US/Eastern', freq='D'),
       Timestamp('2013-01-03 00:00:00-0500', tz='US/Eastern', freq='D')], dtype=object)

Originally I had this returning a DTI, however It was suggested that numpy compat was more important here. But it is quite simple to just return an object dtyped array (e.g. [21])``. Whenever we have a tz-aware Series (whether doing unique or not).

So changing to 2) IMHO is the best; we should also change .values to be the same to preserve meta-data.

@wesm
Copy link
Member

wesm commented Aug 19, 2016

@jorisvandenbossche the conforming / reindexing behavior of the DataFrame ctor is a super valuable feature (and one of the very earliest ones from pandas 0.1) in my experience (it also saves a ctor-then-reindex step which results in an extra sweep of the data and copy). You can pass in a bunch of irregularly indexed data and "pluck" out the data that matches a particular "master" index that you have set. The alternative is to pass in label-naive arrays, which seems like an acceptable compromise

@jorisvandenbossche
Copy link
Member

@wesm getting a bit off topic ... but: reindexing is undoubtly a very valuable operation, I am personally just not sure if it should be the behaviour of the default constructor (could also be a dedicated method). It also leads to suprises and bugs/ambiguous behaviour. I recall some discussion in #9237, where it was not clear which should happen first (reindex based on given columns, or determine index values based on passed objects), and with Series objects with names resulting in an empty frame when specifying columns=

@wesm
Copy link
Member

wesm commented Aug 19, 2016

I see. We should discuss this separately, it seems like there was a consistency issue in the treatment of a single Series versus a dict of Series (i.e. determining a row index from the input series prior to selecting only the columns in columns=) -- semantically you would think the single Series should be promoted to a dict and then handled identically. I agree it's an area that can cause people issues

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Compat pandas objects compatability with Numpy or Python functions Indexing Related to indexing on series/frames, not to indexes themselves
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants