Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

API/ENH: overhaul/unify/improve .unique #22824

Open
6 tasks
h-vetinari opened this issue Sep 24, 2018 · 45 comments
Open
6 tasks

API/ENH: overhaul/unify/improve .unique #22824

h-vetinari opened this issue Sep 24, 2018 · 45 comments
Labels
API Design Enhancement Needs Discussion Requires discussion from core team before further action

Comments

@h-vetinari
Copy link
Contributor

The state of the various flavours of .unique as of v0.23:

  • [pd/Series/Index].unique does not have keep-kwarg
  • Series.unique returns array, Series.drop_duplicates returns Series. Returning a plain np.ndarray is quite unusual for a Series method, and furthermore the differences between these closely-related methods are confusing from a user perspective, IMO
  • same point for Index
  • DataFrame.unique does not exist, but is a much more natural candidate (from the behaviour of numpy, resp. Series/Index) than .drop_duplicates
  • pd.unique chokes on 2-dimensional data
  • no return_inverse-kwarg for any of the .unique variants; see API: provide a better way of doing np.unique(return_inverses=True) #4087 (milestoned since 0.14), ENH: adding .unique() to DF (or return_inverse for duplicated) #21357

I originally wanted to add df.unique(..., return_inverse=True|False) for #21357, but got directed to add it to duplicated instead. After slow progress over 3 months in #21645 (PR essentially finished since 2), @jorisvandenbossche brought up the - justified (IMO) - feedback that:

I think my main worry is that we are adding a return_inverse keyword which actually does not return the inverse for that function (it does return the inverse for another function), and that it is in name similar to numpy's keyword, but in usage also different.

and

[...] it might make sense to add this to pd.unique / Series.unique as well? (not necessarily at the same time; or might actually be an easier starter)

This prompted me to have another look at the situation with .unique, and I found the list of the above inconsistencies. To resolve them, I suggest to:

  • Change return type for [Series/Index].unique to be same as caller (deprecation cycle by introducing raw=None which at first defaults to True?)
  • Add keep-kwarg to [Series/Index].unique (make .unique a wrapper around .drop_duplicates?)
  • Add df.unique (as thin wrapper around .drop_duplicates?)
  • Add keep-kwarg to pd.unique and dispatch to DataFrame/Series/Index as necessary
  • Add return_inverse-kwarg to all of them (and add to EA interface); under the hood by exposing the same kwarg to duplicated and drop_duplicates as well
  • (something for later) solve BUG: df.duplicated treats None as np.nan in object columns #21720 (treatment of np.nan/None in df.duplicated inconsistent vs. Series behaviour)

Each point is essentially self-contained and independent of the others, but of course they make more sense together.

@jreback
Copy link
Contributor

jreback commented Sep 24, 2018

there are multiple issues about changing .unique (for quite some time)

pls show the references

@jreback
Copy link
Contributor

jreback commented Sep 24, 2018

-1 in pd.unique being 2-d as this will make an already complicated function much more so

@h-vetinari
Copy link
Contributor Author

@jreback

there are multiple issues about changing .unique (for quite some time)
pls show the references

I tried best I could. Searching open issues with "unique" yields 185 results which are at first glance completely unrelated - so I'll admit I didn't comb through all of those. (https://github.com/pandas-dev/pandas/issues?utf8=%E2%9C%93&q=is%3Aissue+is%3Aopen+unique)

Searching for "df.unique" or "dataframe.unique" yielded only #21357, and now also this issue.

@h-vetinari
Copy link
Contributor Author

h-vetinari commented Sep 24, 2018

-1 in pd.unique being 2-d as this will make an already complicated function much more so

if df.unique wraps df.drop_duplicates as suggested, then it would be very easy to dispatch to the different types, and work with existing code at almost no extra cost.

Also, the implementation of pd.unique is not very complicated at all - just dispatching to hashtables (https://github.com/pandas-dev/pandas/blob/v0.23.4/pandas/core/algorithms.py#L358-L378).

@h-vetinari
Copy link
Contributor Author

there are multiple issues about changing .unique (for quite some time)
pls show the references

Went over the list again and found one related #15442, and one partly related #13984 and two barely related: #18291 #19595

@jreback
Copy link
Contributor

jreback commented Sep 24, 2018

there was tremendous discussion on this - what should Series.unique return - rather than rehash much better to search for it - follow the links in the first issue

@h-vetinari
Copy link
Contributor Author

Changing Index.unique() to return index: #4126 #13395 (where there's the discussion about what Series.unique() should return), finished in #13979.

rather than rehash much better to search for it

For someone who hasn't been through these discussions through the years, it's nigh impossible to find. Not going to go through thousands of closed issues to find some orthogonal discussions that may or not be hiding in there.

there was tremendous discussion on this - what should Series.unique return

This is but one aspect of this issue (in particular, the rest are independent of it). And still, most of the discussion in #13395 was what index Series.unique() would have - if we postulate unique = drop_duplicates, that question already has an answer.

@jorisvandenbossche
Copy link
Member

@h-vetinari Thanks for all the discussion

I think the main previous discussions on the return value of unique have been about having an index vs array vs categorical as output (so all non-indexed objects). So that is a bit different than the issue raised here.

But thanks for looking through the older issues! That is indeed not always easy (even for a core dev that might have participated in them ..)


Some replies to the inconsistencies / solutions:

Change return type for [Series/Index].unique to be same as caller (deprecation cycle by introducing raw=None which at first defaults to True?)

Personally, I think that boat has sailed (for Series, Index already returns an Index). I don't really see the added value in having our users go through such a deprecation cycle for such a core function.

Also, returning a Series, i.e. returning an indexed object, gives you more complexity, which is what drop_duplicates is dealing with. Adding this to unique would make them essentially the same. This might be a good thing, but now at least they each serve a slightly different purpose.

[pd/Series/Index].unique does not have keep-kwarg

Which is not relevant as long as the the return value is non an indexed object?

DataFrame.unique does not exist, but is a much more natural candidate (from the behaviour of numpy, resp. Series/Index) than .drop_duplicates

In naming yes, but what is the "natural" behaviour? If you consider the difference between the current Series.drop_duplicates and Series.unique, then DataFrame.drop_duplicates is the consistent equivalent with Series.drop_duplicates I think?

no return_inverse-kwarg for any of the .unique variants; see #4087 (milestoned since 0.14), #21357

As I said in the other thread, I think there would be not much disagreement on adding return_inverse to the current unique methods (but that is only my opinion of course).

pd.unique chokes on 2-dimensional data

But what would you want it to do? Work on the flattened values, or on a certain axis? (for np.unique they recently added an axis argument)

Add keep-kwarg to pd.unique and dispatch to DataFrame/Series/Index as necessary

For keep the same comment as above. And currently it is the Series/Index version that dispatches to pd.unique (if the underlying values itself have no unique method), what would be the advantage of changing it around?

@jorisvandenbossche
Copy link
Member

So maybe the main question for me is this (limiting it to Series methods here): Series.unique and Series.drop_duplicates are now two distinct functions, doing slightly different things. Is this good or not?

@h-vetinari's proposal is to unify them, but now they serve an overlapping but still different purpose, and it can also be fine to have two different methods for that (eg unique does not need to care about the keep argument)
Personally, I think it is fine to have both.

@h-vetinari
Copy link
Contributor Author

h-vetinari commented Sep 25, 2018

@jorisvandenbossche
Thanks for the detailed response!

So maybe the main question for me is this (limiting it to Series methods here): Series.unique and Series.drop_duplicates are now two distinct functions, doing slightly different things. Is this good or not?

I would argue that they do exactly the same thing (up to and including keep; more on that below), and having them do it in slightly different ways is inconsistent, confusing and error-prone from a user-perspective.

[pd/Series/Index].unique does not have keep-kwarg

Which is not relevant as long as the the return value is non an indexed object?

No, it IS relevant even now. Pandas strongly advertises that [pd/Series/Index].unique does not sort, and then keep='first'|'last' makes a real difference. Whenever the order of a Series matters (especially, but not only, with a DatetimeIndex) there is a clear distinction between keeping the first and last occurrence, even if the output does not have an index itself.

pd.unique chokes on 2-dimensional data

But what would you want it to do? Work on the flattened values, or on a certain axis? (for np.unique they recently added an axis argument)

Add axis=0 to the signature and raise NotImplementedError if axis!=0. Then pd.unique can dispatch to the respective .unique method of pandas objects (and default to standard treatment for other objects). All methods would continue to take the row-uniques.

And currently it is the Series/Index version that dispatches to pd.unique (if the underlying values itself have no unique method), what would be the advantage of changing it around?

WRT that, I found another inconsistency, pd.unique(idx) != idx.unique():

>>> idx = pd.Index([1, 2, 3])
>>> idx.unique()
Int64Index([1, 2, 3], dtype='int64')
>>> pd.unique(idx)
array([1, 2, 3], dtype=int64)

Finally,

Personally, I think that boat has sailed (for Series, Index already returns an Index). I don't really see the added value in having our users go through such a deprecation cycle for such a core function.

If the change was ok for Index in 0.19 two years ago, and another such central function (groupby.apply) was changed in a similar way in 0.23 just months ago, then IMO users will be able to deal with it without issue, and the long-term benefits far outweigh that single transition hump.

@jorisvandenbossche
Copy link
Member

BTW, @h-vetinari in case you would be interested, we are having a dev hangout tomorrow: https://mail.python.org/pipermail/pandas-dev/2018-September/000830.html

@h-vetinari
Copy link
Contributor Author

@jorisvandenbossche
Thanks for the invitation! I've got a deadline on Friday, so I'm skeptical whether I'll make it. Hoping I could get a go-ahead on #22812 though. ;-)

@jorisvandenbossche
Copy link
Member

@h-vetinari ah, that wasn't on the agenda yet, will add it to the list of potential topics to discuss, but can't promise we will get to it

@TomAugspurger
Copy link
Contributor

Whenever the order of a Series matters (especially, but not only, with a DatetimeIndex) there is a clear distinction between keeping the first and last occurrence, even if the output does not have an index itself.

What do you mean by this? I don't see how the output would differ if just an array is being returned.

@h-vetinari
Copy link
Contributor Author

h-vetinari commented Sep 27, 2018

@TomAugspurger

Since pandas advertises not sorting This does NOT sort (https://pandas.pydata.org/pandas-docs/stable/generated/pandas.unique.html), it makes a difference:

>>> s = pd.Series([0, 1, 99, 1, 3, 99, 1], name='error_codes')
>>> s.index = pd.DatetimeIndex(start='2018-01-01', end = '2018-01-7', freq='D')
>>>
>>> # numpy chooses to ignore order
>>> np.unique(s)
array([ 0,  1,  3, 99], dtype=int64)
>>>
>>> # first occurrence
>>> pd.unique(s)
array([ 0,  1, 99,  3], dtype=int64)
>>>
>>> # last occurrence
>>> pd.unique(s, keep='last') # hypothetical; used: s.drop_duplicates(keep='last').values
array([ 0,  3, 99,  1], dtype=int64)

Imagine you want to find out which unique errors occurred in whatever system generated the output, and when they last (or first occurred) - presumably the order of events will have some relevant information. Since .unique doesn't sort, it is a very natural approach for this, but it

  • loses the index information (which I argue should be changed)
  • doesn't allow keep='last'

I fully understand that drop_duplicates has the capabilities I'm talking about, but my point is exactly that unique fundamentally does the same thing (if it were to sort like numpy, this argument would be less strong). The question about the index of the result (e.g. in #13395) has an unambiguous answer (just depending on keep), and if someone truly wants just the array, they can always use .values afterwards.

It's not even a perf-issue, because the hashtable code has the info at which index it is (for StringHashTable.unique, this would just mean also returning uindexer, for example)

Finally, this is how it should look, IMO:

>>> s.unique(keep='last')  # hypothetical; used: s.drop_duplicates(keep='last')
2018-01-01     0
2018-01-05     3
2018-01-06    99
2018-01-07     1
Name: error_codes, dtype: int64

@WillAyd
Copy link
Member

WillAyd commented Sep 28, 2018

After speaking about this in the dev chat yesterday I'm wondering if it wouldn't make the most sense for unique to return an ndarray and drop_duplicates to return the same type as caller. Doing so delineates the two functions, and if that's the case then return_inverse I think could just be added to drop_duplicates.

Because Index, Series and DataFrame already have drop_duplicates it's an easier change to add the keyword there than creating new methods. For consistency the return value of Index .unique would need to change to an ndarray but I think that causes the least friction and makes sense from an end user perspective

@h-vetinari
Copy link
Contributor Author

Sorry I couldn't take part in the dev chat yesterday.

I understand that the hurdle is larger to change the output from Series.unique to a Series, but it's the only consistent choice and 1.0 is the perfect time for this because people will be very motivated/tolerant for changes in 1.0.

The distinction between unique and drop_duplicates is purely semantic, and it's confusing that things behave differently - and changing the return type of Index.unique is arguably as large a change as changing Series.unique. That aside, if unique were to always return an ndarray:

  • it would be confusing why np.unique and df.drop_duplicates have a return_inverse, but not the pandas-unique methods, and would really feel (IMO) like a left-over from the early days.
  • currently, the user has to remember a lot of relevant differences between pd.unique and np.unique, but both return an ndarray. pd.unique should really be able to deal with its own types, and that would also help differentiating those methods.

Again, 1.0 is the perfect time for this - it only gets harder later.

@h-vetinari
Copy link
Contributor Author

h-vetinari commented Oct 8, 2018

@jorisvandenbossche @TomAugspurger @jreback @WillAyd

I wanna pick this up again, please, and need your feedback. In addition to the points mentioned directly above, there's another important point:

  • the utility of / need for return_inverse is not in question AFAICT

  • we already had some discussion about the return type of return_inverse in ENH: add return_inverse to duplicated for DataFrame/Series/Index/MultiIndex #21645, and I'll make the case again that:

    • the inverse should be a Series
    • otherwise the user has to play "puzzle" with two different kind of Index objects (or ndarray holding index info)
  • one of the main goals of return_inverse is to allow reconstruction (perhaps after some other calculation) -- this too would only really be possible if Series.unique maintains its type/index.

    [the array of values would be reconstructible from uniques/inverse (as two ndarray), but then the original index is still missing. Just assuming that it will be available from somewhere else is not justified, and would feel hacky, with little user-confidence in the validity if this is not a direct result of the unique-calculation.]

  • it would be strange to have the uniques be an ndarray, but the inverse a Series (again, the only other option would be to return three times ndarray)

Long story short:

  • the only consistent option is a type-preserving unique, particularly for having return_inverse and enabling reconstruction
  • 1.0 is the perfect time to do this
  • changing the return type of Index.unique to ndarray would be an equally disruptive change, with no gains towards being able to reconstruct.
  • I'm volunteering to make the changes (starting slowly in CLN: prepare unifying hashtable.factorize and .unique; add doc-strings #22986...)
  • I need a go-ahead, or at least some feedback/discussion, please [not on all points of this issue, just the return-type of Series.unique, which needs a deprecation cycle]

@jorisvandenbossche
Copy link
Member

I understand that the hurdle is larger to change the output from Series.unique to a Series, but it's the only consistent choice

The distinction between unique and drop_duplicates is purely semantic,

I still don't see the point of unifying those two functions. For me, they have a clear different behaviour (and use case). unique gives you an array-like of the unique values in order of occurence, drop_duplicates is the more advanced method where you keep the index / can specify which duplicates to drop.

(personal use case: If I quickly want to check the unique values, I use unique (typically a limited number of unique values), and if I have a Series with a few duplicates I want to drop, I use drop_duplicates)

So I am personally fine with the current distinction between both functions.

it would be strange to have the uniques be an ndarray, but the inverse a Series (again, the only other option would be to return three times ndarray)

So for pd.unique, I think return_inverse should always be an array. But for Series.unique I agree that it would be strange to have the inverse be a Series while the uniques are an array. This is not ideal (both this, as having the inverse as array which looses information), but IMO not enough to warrant a change in return type for Series.unique

I need a go-ahead, or at least some feedback/discussion, please [not on all points of this issue, just the return-type of Series.unique, which needs a deprecation cycle]

Let's start with adding the functionality to the hashtables and to pd.unique (where the API is not under discussion I think). We can then further discuss the other points / how to integrate it in Series.unique/Series.drop_duplicates

@h-vetinari
Copy link
Contributor Author

@jorisvandenbossche
Thanks for the response and sorry for the slow reply.

@jreback @TomAugspurger @WillAyd @gfyoung @jschendel @toobaz
Gentle ping to invite feedback about the case I make below. TLDR: Changing the type of Series.unique is the only way that allows using return_inverse without several anti-patterns and disjoint parts.

Return type of Series.unique based on reconstructing with inverse

Instead of arguing again for the similarities of .unique and .drop_duplicates, let me give an example for the main reason to have a return_inverse at all - reconstruction from the uniques.

As a toy example, say we have a log of users interacting with a website, and want to identify the first time they arrive (to calculate behaviour per duration spent or whatever). Let's begin from a common starting point

>>> import pandas as pd
>>> import numpy as np
>>> import pandas.util.testing as tm
>>> from numpy.random import randint
>>> np.random.seed(444)
>>>
>>> N = 7  # number of timestamps
>>> k = 3  # number of unique IDs
>>> 
>>> # generate random timestamps during the day
>>> dti = pd.date_range(start='2018/10/24', end='2018/10/25', periods=N ** 3)
>>> dti = dti[randint(0, N ** 3, (N,))].sort_values()
>>> 
>>> # generate "k" user IDs and assign them to the timestamps
>>> with tm.RNGContext(444):
>>>     s = pd.Series(tm.makeStringIndex(k).str[:4][randint(0, k, (N,))],
...                   index=dti, name='User_ID')
>>> s
2018-10-24 01:49:28.421052672    dWxi
2018-10-24 03:34:44.210526208    mkqZ
2018-10-24 07:34:44.210526208    dWxi
2018-10-24 07:38:56.842105344    dWxi
2018-10-24 08:46:18.947368448    iPSk
2018-10-24 18:06:18.947368448    iPSk
2018-10-24 19:13:41.052631552    mkqZ
Name: User_ID, dtype: object

and we want to achieve:

>>> goal
                              User_ID first_contact_today
2018-10-24 01:49:28.421052672    dWxi     01:49:28.421052
2018-10-24 03:34:44.210526208    mkqZ     03:34:44.210526
2018-10-24 07:34:44.210526208    dWxi     01:49:28.421052
2018-10-24 07:38:56.842105344    dWxi     01:49:28.421052
2018-10-24 08:46:18.947368448    iPSk     08:46:18.947368
2018-10-24 18:06:18.947368448    iPSk     08:46:18.947368
2018-10-24 19:13:41.052631552    mkqZ     03:34:44.210526

Note: The problems here are not limited to DatetimeIndex or this specific use-case, but about generally needing pandas-capabilities to work on the uniques, and then rebroadcasting to the original. This is similar to the problem of groupby.apply pre-v.0.23, where it was eventually realized that it's necessary to have pandas-objects (and not raw ndarrays) available within the function.

The good: Series.unique returns Series, inverse also a Series

In the world that I'm proposing, the workflow is super simple:

>>> # both uniques/inv are Series in this example
>>> uniques, inv = s.unique(return_inverse=True)  # takes first occurrence
>>> uniques_enh = uniques.to_frame().assign(first_contact_today = uniques.index.time)
>>> # broadcast back to original: first select from uniques, then (re)set index
>>> s_enh = uniques_enh.reindex(inv).set_index(inv.index)
>>> s_enh.equals(goal)
True
>>> uniques_enh  # for comparison
                              User_ID first_contact_today
2018-10-24 01:49:28.421052672    dWxi     01:49:28.421052
2018-10-24 03:34:44.210526208    mkqZ     03:34:44.210526
2018-10-24 08:46:18.947368448    iPSk     08:46:18.947368

The weird: Series.unique returns ndarray, but inverse is a Series

As I wrote further up the thread, "it would be strange to have the uniques be an ndarray, but the inverse a Series", and this really is a half-assed solution IMO. It also makes it necessary to kludge around to get the required info on the level of the uniques -- otherwise we'd need two kwargs/arrays, see "the monstrous" below.

>>> # Series.unique is ndarray in this example, but inverse is a Series
>>> uniques, inv = s.unique(return_inverse=True)
>>>
>>> # cannot operate on "uniques" here to get index, kludge in index-info from "inv"
>>> # using "inv.unique()" while trying to reconstruct from previous unique is circular!
>>> # need to reconstruct - lost all label information
>>> uniques_enh = pd.DataFrame({'User_ID': uniques,
...                             'first_contact_today': inv.index[inv.unique()].time},
...                            index = inv.unique())
>>> 
>>> # broadcast back to original: first select from uniques, then (re)set index
>>> s_enh = uniques_enh.reindex(inv).set_index(inv.index)
>>> s_enh.equals(goal)
True
>>> uniques_enh  # for comparison
  User_ID first_contact_today
0    dWxi     01:49:28.421052
1    mkqZ     03:34:44.210526
4    iPSk     08:46:18.947368

Further deepening the weird mix between ndarray/Series, the inverse would be using two different kind of indexes, instead of one that's compatible both for s and uniques.

The monstrous: Series.unique returns ndarray, inverse(s) also ndarray

This is not really under discussion AFAICT, but I'm adding it in case someone thinks this scenario would be easier than "the weird" above. This case would necessitate adding two keywords (return_inverse and return_index like for np.unique), and really shouldn't be considered IMO.

>>> # Series.unique is ndarray here; if inverse is not a Series, we need *two* arrays
>>> uniques, idx, inv = s.unique(return_inverse=True, return_index=True)
>>>
>>> # cannot use pd.concat; numpy doesn't handle Datetimes well
>>> # np.concatenate cannot turn 1-d arrays into 2-d array -> use np.stack
>>> # cannot operate on "uniques" here to get index; re-using "dti" is fragile!
>>> uniques_enh = np.stack([uniques, dti[idx]], axis=1)
>>>
>>> # need to reconstruct instead of reindex; lost all label information
>>> # re-using "dti" is fragile! should natively come out of unique-calculation!
>>> s_enh = pd.DataFrame(uniques_enh[inv], index=dti,
...                      columns=['User_ID', 'first_contact_today'])
>>> # need to manually restore Datetime info
>>> s_enh.first_contact_today = s_enh.first_contact_today.astype('datetime64').dt.time
>>> s_enh.equals(goal)
True
>>> uniques_enh  # for comparison
array([['dWxi', 1540345768421052672],
       ['mkqZ', 1540352084210526208],
       ['iPSk', 1540370778947368448]], dtype=object)

Other comments/responses

  1. As @jorisvandenbossche correctly noted in ENH: add return_inverse to duplicated for DataFrame/Series/Index/MultiIndex #21645 (comment), the inverse should be first and foremost on unique, not drop_duplicates - i.e. the above discussion shouldn't be circumvented by saying "use drop_duplicates".
  2. Like for groupby.apply there's strong reasons for being able to stay in pandas-land
  3. Changing the type of Series.unique is the only way that allows using return_inverse without several anti-patterns and disjoint parts.
  4. Regarding: "personal use case: If I quickly want to check the unique values, I use unique (typically a limited number of unique values), and if I have a Series with a few duplicates I want to drop, I use drop_duplicates"
    IMO, if you really need an ndarray, it's trivial to add .values, but it's highly non-trivial to reconstruct the index.
  5. Let's start with adding the functionality to the hashtables [...]. We can then further discuss the other points / how to integrate it in Series.unique/Series.drop_duplicates

Working on it. But it's a slow process, and I think changing this should ideally be done for 1.0, hence deprecate for 0.24, which is soon.

@jreback
Copy link
Contributor

jreback commented Oct 24, 2018

your example looks like a simple group by and merge or just a merge_asod for ordered things

why is that not simply the answer?

@h-vetinari
Copy link
Contributor Author

h-vetinari commented Oct 24, 2018

@jreback

I tried to make a short enough toy example (as the post is already very long), so obviously some complexity gets brushed under the carpet, which then may allow alternative solutions. But even then, using .groupby for problems like this is orders of magnitude slower than .unique - e.g. 20min vs. 10sec (this is how I started this journey of trying to get return_inverse implemented, for more details see #21357).

A merge with (modified) uniques is fundamentally the same as using return_inverse, but A) that would need (like I'm arguing) for uniques to be Series, and B) it is also much more inefficient (esp. for large sizes, or if the uniques are formed over many columns), because it needs to recalculate the exact same HashTables/HashVectors for joining that are already available from the original computation of those very same uniques!

I had another toy example in mind that does something like "cumsum-only-new-values" (not a simple groupby AFAICT) - I just didn't work that one out as thoroughly:

>>> s = pd.Series([1, 4, 1, 2, 2, 4, 3], index=[x ** 2 for x in range(7)])
>>> uniques, inv = s.unique(return_inverse=True)
>>> # csonv = cumsum-only-new-values
>>> uniques_enh = uniques.to_frame('orig').assign(csonv = uniques.cumsum())
>>> s_enh = uniques_enh.reindex(inv).set_index(inv.index)
>>> s_enh.csonv = s_enh.csonv.expanding().max().astype(int)
>>> s_enh
    orig  csonv 
0      1      1
1      4      5
4      1      5
9      2      7
16     2      7
25     4      7
36     3     10

@jorisvandenbossche
Copy link
Member

@h-vetinari it might be besides the point, since it is only a simplified example and the below might not be the case in other actual use case, but I would solve the toy problem with drop_duplicates, and not unique:

In [26]: first_contact = (s.sort_index().drop_duplicates(keep='first')
    ...:                   .reset_index().rename(columns={'index': 'first_contact'}))


In [27]: pd.merge(s.reset_index(), first_contact, how='left')
Out[27]: 
                          index User_ID                 first_contact
0 2018-10-24 04:42:06.315789568    dWxi 2018-10-24 04:42:06.315789568
1 2018-10-24 07:51:34.736841984    mkqZ 2018-10-24 07:51:34.736841984
2 2018-10-24 09:49:28.421052672    dWxi 2018-10-24 04:42:06.315789568
3 2018-10-24 12:46:18.947368448    dWxi 2018-10-24 04:42:06.315789568
4 2018-10-24 16:33:41.052631552    iPSk 2018-10-24 16:33:41.052631552
5 2018-10-24 18:31:34.736841984    iPSk 2018-10-24 16:33:41.052631552
6 2018-10-24 19:34:44.210526208    mkqZ 2018-10-24 07:51:34.736841984

(I won't say this is the most beautiful code, but the annoyance are rather in different things: not being able to set the name of the index in reset_index, losing the index in pd.merge so therefore needing to do the reset_index, and then actually still a set_index to get to desired result)

@jorisvandenbossche
Copy link
Member

I see now you touched in your last post as answer to Jeff on reasons why you don't want the above, because a merge is less efficient?
But for that, you mention the case of when the uniques are formed over multiple columns. So for this case, we still need some way for getting the inverse here (so Series.unique would not be enough, we then either need return_inverse in DataFrame.drop_duplicates or add a DataFrame.unique, whatever it would be)


Another way, assuming the uniques and inverse are arrays:

In [50]: uniques = np.array(['dWxi', 'mkqZ', 'iPSk'], dtype=object)

In [51]: inverse = np.array([0, 1, 0, 0, 2, 2, 1])

In [52]: _, index = np.unique(inverse, return_index=True)

In [53]: result = s.to_frame()

In [54]: result['first_contact'] = s.index[index][inverse]

In [55]: result
Out[55]: 
                              User_ID                 first_contact
2018-10-24 04:42:06.315789568    dWxi 2018-10-24 04:42:06.315789568
2018-10-24 07:51:34.736841984    mkqZ 2018-10-24 07:51:34.736841984
2018-10-24 09:49:28.421052672    dWxi 2018-10-24 04:42:06.315789568
2018-10-24 12:46:18.947368448    dWxi 2018-10-24 04:42:06.315789568
2018-10-24 16:33:41.052631552    iPSk 2018-10-24 16:33:41.052631552
2018-10-24 18:31:34.736841984    iPSk 2018-10-24 16:33:41.052631552
2018-10-24 19:34:44.210526208    mkqZ 2018-10-24 07:51:34.736841984

So from that, a return_index might actually be interesting

@h-vetinari
Copy link
Contributor Author

@jorisvandenbossche
Thanks for your response:

I see now you touched in your last post as answer to Jeff on reasons why you don't want the above, because a merge is less efficient?

Because to merge with the uniques (or applying drop_duplicates in your example), one needs to first calculate the whole hashtable for the calculation of the uniques, and then recalculate it for the merge.

But for that, you mention the case of when the uniques are formed over multiple columns. So for this case, we still need some way for getting the inverse here (so Series.unique would not be enough, we then either need return_inverse in DataFrame.drop_duplicates or add a DataFrame.unique, whatever it would be)

This was not as cleanly delineated as it could have been:

  • reconstruction for DataFrame needs to be handled as well (whether in drop_duplicates or a to-be-added unique)
  • the inverse makes much more sense for unique (as you remarked)
  • the longer-term goal for me is to enable df.unique(return_inverse=True)

Another way, assuming the uniques and inverse are arrays:
[...]
So from that, a return_index might actually be interesting

I'm really -1 on having to deal with both return_index and return_inverse, because then you'd need four objects to reconstruct (the unique, the array-index, the array-inverse, and finally, the original index) - numpy can't avoid it because there are no indexes, but pandas can. Furthermore that exact functionality is already available in numpy (aside the issue with sorting).

I'm proposing a solution that's pandas-native, and only needs two objects for reconstruction (instead of four).

@h-vetinari
Copy link
Contributor Author

@jorisvandenbossche
Friendly ping. :)

@h-vetinari
Copy link
Contributor Author

@jorisvandenbossche
With #23400 nearing completion, could you please have another look at this?

@h-vetinari
Copy link
Contributor Author

@jorisvandenbossche
Now that #23400 is in, could I please ask you for another comment? I'd like to proceed with:

  1. Changing the return type of Series.unique to Series
  2. Adding return_inverse to Series.unique / Index.unique (this needs the first step to be coherent, see any of my posts in this thread)
  3. ...

@h-vetinari
Copy link
Contributor Author

Copying from #24108:

@jreback: the question about the return value of .unique needs to be answered first;

This is what #24108 is about, since the issue had stalled for >1 months despite several pings.

@jreback: return an .array of the result (for Index and Series) is probably the most reasonable change and is mostly backward compatible

I disagree with this quite strongly:

  • Changing Index.unique from Index->ndarray is as much of a breaking change as changing Series.unique from ndarray->Series (but has no benefits for reconstruction)
  • Series.unique already special-cases Categorical and EA. An ndarray fits even less as the return of a Series method for where pandas is heading.
  • Since .unique strongly advertises that it does not sort, there's an implicit index mapping happening already, only that it's very hard to coax out.
  • If it were to keep returning ndarray, having an inverse is basically impossible without running into several antipatterns.
  • etc.

@jreback: however better to bring this up on the issue (for unique return value)

Would you mind chiming in here then?

@h-vetinari
Copy link
Contributor Author

It's just inconsistencies over inconsistencies (discovered while writing tests for a precursor PR of #24108):

>>> import pandas as pd
>>> idx = pd.Index([0, 1, 1, 0])
>>> pd.unique(idx)
array([0, 1], dtype=int64)

So pd.unique(Index) yields an array, except if the Index is categorical...?

>>> idx = idx.astype('category')
>>> pd.unique(idx)
CategoricalIndex([0, 1], categories=[0, 1], ordered=False, dtype='category')

@jorisvandenbossche
Copy link
Member

Some of those inconsistencies (certainly not all :-)) have historical reasons / justifications.
I think in the past, we wanted the return value to be something array-like (and Index here is much more array-like than a Series), but to also not loose to much information.
That's eg the reason that we decided to make the exception for a categorical Series to not return a plain ndarray, but to return Categorical, as otherwise it would loose the type information.
And I think in general, for EA-backed Series, we will now also return the EA array, not a ndarray (which makes this actually a bit more consistent), although we might need to have some discussion about this for the existing cases like datetime with tz for back compat.

In a similar vein, for Index, we decided to return an Index object, as this then could already naturally retain all type information, and still return something array-like.
Of course this made it inconsistent with Series, but it made it quite consistent across all Index classes, and some compromises need to be made given the limitations we have concerning dtypes.

Changing Index.unique from Index->ndarray is as much of a breaking change as changing Series.unique from ndarray->Series (but has no benefits for reconstruction)

I don't think this change (Index.unique from Index->ndarray) is necessarily needed, but note that this is certainly a less breaking change than ndarray -> Series. For example, indexing an Index or an ndarray is much more alike.

Series.unique already special-cases Categorical and EA. An ndarray fits even less as the return of a Series method for where pandas is heading.

What do you mean here?

If it were to keep returning ndarray, having an inverse is basically impossible without running into several antipatterns.

It certainly needs more effort of the user for certain use case, but I don't think it is impossible and in many cases there are other (maybe even nicer alternatives) available**.

To repeat what I said above: I personally don't see the need to change the return type of a Series; I personally like the distinction between unique and drop_duplicates.
And above all, I think a possible deprecation would be quite annoying (every call to unique will raise a warning), and IMO not worth it (but as you did, it is certainly possible to do this with a nice deprecation cycle before changing anything).


** For your example above, this is another alternative solution:

In [33]: result = s.reset_index()                                                                                                                                                                                   

In [34]: result['time_first_contact'] = result.groupby('User_ID').transform('first')                                                                                                                                

In [35]: result.set_index('index')                                                                                                                                                                                  
Out[35]: 
                              User_ID            time_first_contact
index                                                              
2018-10-24 04:42:06.315789568    dWxi 2018-10-24 04:42:06.315789568
2018-10-24 07:51:34.736841984    mkqZ 2018-10-24 07:51:34.736841984
2018-10-24 09:49:28.421052672    dWxi 2018-10-24 04:42:06.315789568
2018-10-24 12:46:18.947368448    dWxi 2018-10-24 04:42:06.315789568
2018-10-24 16:33:41.052631552    iPSk 2018-10-24 16:33:41.052631552
2018-10-24 18:31:34.736841984    iPSk 2018-10-24 16:33:41.052631552
2018-10-24 19:34:44.210526208    mkqZ 2018-10-24 07:51:34.736841984

@h-vetinari
Copy link
Contributor Author

@jorisvandenbossche
Sorry for leaving this unattended for so long, I did add #24108 as a concept and #24119 as a first step though.

I think in the past, we wanted the return value to be something array-like (and Index here is much more array-like than a Series), but to also not lose too much information. [my emphasis]

That's the main point to me. Pandas is moving away numpy arrays in general towards richer and more versatile array formats (PandasArray, ExtensionArray, maybe something powered by arrow sometime in the not too distant future). This is what I referred to above with "where pandas is heading".

It would just be wrong (IMO) to return an ndarray for these cases (as you've already decided for Categorical) - the correct solution is that .unique keeps the type. The amount of kluges and exceptions-to-the-rule will just get worse otherwise.

In fact, I'd argue that probably the main reason why Series.unique had to be an ndarray for long is that returning a Series with the correct Index already requires the inverse for the cython-version of unique, and that wasn't available until comparatively recently (#22986, #23400).

PS. The example I gave can indeed be solved differently, but that's because I had to simplify the example for reasons of space. In practice, I don't have the option of a simple groupby.

@stuarteberg
Copy link
Contributor

I don't know what progress has been made regarding the bulk of this issue, but the following bullet point, at least, will be resolved by #27874, FWIW:

  • Change return type for [Series/Index].unique to be same as caller (deprecation cycle by introducing raw=None which at first defaults to True?)

@h-vetinari
Copy link
Contributor Author

@stuarteberg
Thanks for finding and linking this issue. Unfortunately, #27874 does not solve the point you quote. The type I'm talking about here is that Series.unique should return a Series (where it currently returns an np.ndarray).

You will see that the approach of #27874 must fail for everything that's a pandas-dtype for example, because numpy does not know them. Said otherwise: maintaining the dtype is part of the reason for aligning the type that's returned by Series.unique.

I had a working POC implementation in #24108 (not including tests), but I haven't followed up on this much because the appetite of the core devs for changing the return type was low. I still consider it essential enough to deal with the transition "pain", and would be willing to work on it.

@stuarteberg
Copy link
Contributor

@h-vetinari Ah, I misunderstood. Thanks for the clarification! (Sorry for the noise...)

@jreback
Copy link
Contributor

jreback commented Oct 8, 2019

repeating from the linked issues @h-vetinari if you'd summarize.

why is returning a Series from .unique() actually useful?
why not an .array? then doesn’t .factorize() exactly provide the ability to reconstruct (iirc your main goal); speaking of reconstruction; how is this useful to a user? what is the use case

@h-vetinari
Copy link
Contributor Author

h-vetinari commented Oct 11, 2019

@jreback
Thanks for the comments.

why is returning a Series from .unique() actually useful? why not an .array?

It's useful for preserving the index of the data (since pandas prominently advertises that unique does not sort, and thus the order contains information), and also the dtype of the data. The latter point is especially relevant for pandas-internal dtypes, because those will return either object-arrays or have already been hacked to preserve the dtype (like CategoricalIndex).

In general, pandas is emancipating itself more and more from numpy, and it is both an artifact** of its development and a burden to usability to have Series.unique return an array.

then doesn’t .factorize() exactly provide the ability to reconstruct (iirc your main goal);

Factorize is a similar but [ifferent method (docs here):

  • It aims to turn a set into a categorical variable (cf. here, even though pandas does not consistently return categoricals), which is not the same as what unique does.
  • It treats all missing values as one (also not what unique does). In fact, there is a bug that's due to the internals of unique using factorize at some point: BUG: df.duplicated treats None as np.nan in object columns #21720
  • It returns an Index (resp. CategoricalIndex) for a Series, instead of maintaining its type
  • It is much less obvious as a method than unique, especially for people not knee-deep in statistics or R.

There's actually an ancient issue (#4087, by you, no less ;-)) about adding inverse-handling to unique. It got (wrongly) closed in anticipation of resolution through #21645, but that was eventually abandoned as the API discussion then resulted in the instruction to implement return_inverse not for drop_duplicates, but for unique (which I'm stuck with in #24108 and precursors).

speaking of reconstruction; how is this useful to a user? what is the use case

Among other things, all cases where factorize is (mis-)used for getting the faux-inverse. Of course, you'll be able to mock most of such functionality through joins somehow, but the point is that the cython code already has all the necessary information for the inverse, and we just need to spit it out.

Additionally, perhaps the following rough example of a clean use case may be easier to imagine:

  1. Have a dirty dataset (e.g. address data)
  2. Want to execute something expensive (e.g. fuzzy matching through NN, LSH, word2vec or what have you), without wasting performance on records that you know to be the same a priori.
  3. calculate unique values, but save inverse.
  4. do expensive calculation for unique values.
  5. broadcast back to original dataset with inverse from unique-calculation.

** it's non-trivial to calculate which index gets preserved after the application of unique. I'd argue that perhaps one of the main reasons why Series.unique had to return an array is that no-one had implemented the capability into the cython code to return the (already captured) inverse until #23400.

@jreback
Copy link
Contributor

jreback commented Oct 11, 2019

In general, pandas is emancipating itself more and more from numpy, and it is both an artifact** of its development and a burden to usability to have Series.unique return an array

you misunderstand - i mean a Pandas Array

@jreback
Copy link
Contributor

jreback commented Oct 11, 2019

pls answer why .factorize is not just the answer here

@h-vetinari
Copy link
Contributor Author

pls answer why .factorize is not just the answer here

The whole middle block of my comment deals with that...

you misunderstand - i mean a Pandas Array

Because the index information is valuable too, especially since unique does not sort.

@jreback
Copy link
Contributor

jreback commented Oct 11, 2019

this doesn’t answer the question

an Array preserves ordering

the actual index values by definition would disappear even for a Series

@h-vetinari
Copy link
Contributor Author

an Array preserves ordering

Yes, the current implementation already maintains the order, but it's darn hard to correctly reconstruct the correct index information for the np.array (or pd.Array) the user then has. And factorize does not solve it because it treats all missing values equally, whereas unique distinguishes between None, np.nan, pd.NaT, etc.

the actual index values by definition would disappear even for a Series

There must be some sort of misunderstanding here? Of course a Series would have an index...? In fact, the result would be the same as s.loc[~s.duplicated()], but this also does not help as duplicated provides no information which unique values maps back to which original value.

@h-vetinari
Copy link
Contributor Author

h-vetinari commented Oct 11, 2019

@jreback: the actual index values by definition would disappear even for a Series

@h-vetinari: There must be some sort of misunderstanding here? [...]

Ok, maybe the misunderstanding was partly on my side. Obviously the Series has an index (that's what I mistakenly thought might be your question), but more than that, the mapping from the original values to the unique values preserves a meaningful index, precisely because it does not sort - and therefore (the implicit) keep='first' selects a well-defined index for each unique value it picks.

@jreback
Copy link
Contributor

jreback commented Oct 11, 2019

@jreback: the actual index values by definition would disappear even for a Series

@h-vetinari: There must be some sort of misunderstanding here? [...]

Ok, maybe the misunderstanding was partly on my side. Obviously the Series has an index (that's what I mistakenly thought might be your question), but more than that, the mapping from the original values to the unique values preserves a meaningful index, precisely because it does not sort - and therefore (the implicit) keep='first' selects a well-defined index for each unique value it picks.

again this is exactly what factorize does; it returns a tuple of the indexers and the unique values

if there is a specific issue with say not preserving a type of null value i suppose that could be a bug, but i see lots of conflation here

@h-vetinari
Copy link
Contributor Author

@jreback: if there is a specific issue with say not preserving a type of null value i suppose that could be a bug, [...]

Factorize and unique are different by design: The docs for factorize explicitly say that missing values will be ignored:

@docs: Note: Even if there’s a missing value in values, uniques will not contain an entry for it.

This is not what .unique does or should do. So it's not a bug, but a desired difference.

@jreback: again this is exactly what factorize does; it returns a tuple of the indexers and the unique values and their

Here as well, the indexer (into an array!) plus an Index of unique values is not nearly the same as just a thinned-out Series containing the (first) unique values and their correct index from the original

@jreback: [...] but i see lots of conflation here

May I ask you to not assume confusion so quickly? I've been on this case for about 1.5 years, and I do not conflate the various methods and their goals.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API Design Enhancement Needs Discussion Requires discussion from core team before further action
Projects
None yet
Development

No branches or pull requests

7 participants