-
-
Notifications
You must be signed in to change notification settings - Fork 18k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Index.unique() should always return an Index object of the same type #13395
Comments
At the moment, I think DatetimeIndex is rather the exception, as most seem to return a numpy array (and CategoricalIndex a Categorical):
|
this is a dupe of #4126 |
closing the other one actually. |
|
yeah I dont' think we ever changed |
One reason not to change This definitely needs to go in a major release because it will break some user code. |
In the PR of @sinhrks, it is now proposed to return an Index of the same type for both Index and Series. While for Index it seems logical to always return an Index of the same type, I am not very enthusiastic about |
I don't agree then you are then giving meaning to the index of the series that you are returning when it doesn't have any meaning (the ordering actually does have meaning but that is true in either case) so returning an Index is the correct action here |
Of course this boils down to not having a good array-like container that can hold all pandas supported types .. (Index is such a container, and can be used for that, but IMO to users it is not, to users it are the labels of the index/columns of a DataFrame/Series). Options:
|
I disagree, Index IS the container object and is most appropriate Series is plain confusing |
i think it's natural that
|
I agree. Returning an index for For Series.unique, I don't think we have any good options prior to pandas 2.0. I would stick with returning numpy arrays for now. |
you seem to be against natural things and seem to want pandas to be like numpy |
You misunderstand me. This is about what feels consistent with the current version of pandas:
|
ok, I'll change my opinion here. I can see |
As I mentioned in #13944, in pandas 2.0, I think the logical type for the return value of |
@shoyer yes and if pandas 2.0 was around the corner and we DIDN't have a 1.0 I would agree. However, we very-very rarely expose raw ndarrays to the user ATM. Aside from |
My opinion is that we should not introduce any breaking changes in 1.0 that On Thu, Aug 18, 2016 at 3:03 AM, Jeff Reback notifications@github.com
|
In the current pandas, I would vote for returning a Series, although also not ideal. But I agree with @shoyer that if we change it again for 2.0, it is maybe not very beneficial to change this for 1.0 as well. To be clear, we should care that the "but we will change that for 2.0" does not become a reason to not do any needed changes anymore now. But, in this case, I personally don't think the return value of * we could also return an object array of timestamps for that specific case |
ok, for 0.19.0 we need to change Ok, so the only question then is to make If it should eventually return a But these are just way to many iffs. This needs to be resolved asap. @wesm why don't you weigh in here. |
Just read through this. In pandas 2.0 Several problems with
In [10]: s = pd.Series([1,2,3,4] * 4)
In [11]: unique_vals = s.unique()
In [12]: from pandas.util.testing import rands
In [13]: df = pd.DataFrame({'uniques': unique_vals}, index=[rands(10) for i in range(len(unique
...: _vals))])
In [14]: df
Out[14]:
uniques
mB2LJrlOw5 1
qPF14xkGNl 2
0nE5HHGM0d 3
AbQEAYpYmW 4 If In [18]: unique_vals = pd.Series(unique_vals)
In [19]: df = pd.DataFrame({'uniques': unique_vals}, index=[rands(10) for i in range(len(unique
...: _vals))])
In [20]: df
Out[20]:
uniques
LiQZXm6K5V NaN
B8HABWAK2o NaN
4hIrDH3Ue0 NaN
JpaO9iMWTP NaN Contrived as this may be is an enough of a concern to make me -0 on this and very nearly -1
I agree it sort of stinks that we have both ndarray and non-ndarray (e.g. categorical) return values for |
(I agree that Index.unique should always return an Index) |
I'm +1 to leave One issue related to returning |
Nice example of @wesm how series vs array could break code, so let's not do that (although the reindexing behaviour of the constructors is maybe also a point for discussion ...) For the issue in #13565 (return value for unique of a tz aware series), options are:
I would go for the first or the second, but not really a preference. |
This was discussed here in the original issue.
Originally I had this returning a DTI, however It was suggested that numpy compat was more important here. But it is quite simple to just return an So changing to 2) IMHO is the best; we should also change |
@jorisvandenbossche the conforming / reindexing behavior of the DataFrame ctor is a super valuable feature (and one of the very earliest ones from pandas 0.1) in my experience (it also saves a ctor-then-reindex step which results in an extra sweep of the data and copy). You can pass in a bunch of irregularly indexed data and "pluck" out the data that matches a particular "master" index that you have set. The alternative is to pass in label-naive arrays, which seems like an acceptable compromise |
@wesm getting a bit off topic ... but: reindexing is undoubtly a very valuable operation, I am personally just not sure if it should be the behaviour of the default constructor (could also be a dedicated method). It also leads to suprises and bugs/ambiguous behaviour. I recall some discussion in #9237, where it was not clear which should happen first (reindex based on given columns, or determine index values based on passed objects), and with Series objects with names resulting in an empty frame when specifying |
I see. We should discuss this separately, it seems like there was a consistency issue in the treatment of a single Series versus a dict of Series (i.e. determining a row index from the input series prior to selecting only the columns in |
This should also be noted in the docstring for the method.
Currently, it sometimes returns numpy arrays:
Most of the work here is probably writing comprehensive tests to check each index type.
xref: https://github.com/pydata/pandas/pull/13361/files/17209f92330c5e949934aec9dea039b35faf6e40#r66179418
The text was updated successfully, but these errors were encountered: