-
-
Notifications
You must be signed in to change notification settings - Fork 18k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
API Should Index be made opt-in? #48880
Comments
I agree, this is too big to change. Also, I am not convinced that this would be an improvement in general. This is what Did you look into adding this as a metadata flag? Like with |
The doc linked in the OP talks about certain libraries only supporting "default indexing". Could this be recast as libraries only supporting Are there other ops like |
A couple that come to my mind:
|
Doh - of course; anything that uses alignment will immediately be incompatible. For groupby, I think one can just declare something like If we created a default index object, I think we'd need to implement alignment for such objects via ilocs. For simplicity, let's say alignment between a default index and non-default index raises. If the internals use of alignment were sufficiently abstracted, maybe this could work, but I'd have to imagine that isn't the case currently and everything would just break. Edit: And to make them sufficiently abstracted I imagine would be quite a perf hit. |
Alignment is a good point. Did not think about this. I don't really see the benefit on our side or for our users here |
Perhaps it's helpful to think about this request as less of a Perhaps a config option would control whether objects are created with an index or not, so that you never have a situation where "default" and "nondefault" indexes coexist. |
Would the idea then be to raise if I'm worried this would add a lot of complexity, and I'm also not really clear what the benefit to users would be (though I'll concede that I've not attended an API Consortium meeting yet, and that maybe if I'd followed the whole discussion then I'd see it) |
Thanks @MarcoGorelli and others! While I'm happy to continue discussing the technical details - which there are many to iron out - I feel the discussion would be greatly benefited by some additional context for this issue. First, I should clarify that I agree on both points below:
However, I'm hoping we can convince Pandas maintainers to make Pandas "index unaware" by default (whatever that looks like), because it would benefit the ecosystem of DataFrame-like libraries and their users in the longer term. The goal of the Consortium is to develop a "standard" DataFrame API, based on Pandas, that all Pandas-like Python libraries can support. This will enable users and downstream packages to switch out one library for another. A recent example of success of this idea from the Array world can be seen here. Many Pandas-like libraries today either do not support indexes at all, or support a subset of index functionality. Dask, for example does not support MultiIndex. Speaking as a cuDF developer, we do support both If we could drop indexes from our API and still match Pandas' default behaviour, we would do so in a heartbeat. I believe many other library maintainers feel the same way. Opt-in v/s opt-outThe success of a "standard API" depends on matching the default behaviour of Pandas, which is why the request is framed as making Indexes opt-in, rather than opt-out. |
Thanks for sharing context! To clarify:
would you then also drop |
Yes, the idea is that "standard API" would not include |
OK thanks that makes more sense to me then - removing I'll rework the proposal in light of this |
i am pretty skeptical here that you can change this w/o massive changes - nor do i think you should but happy to see a sketch |
I sympathize with the task of the dataframe standard, and that row labels make things complicated, but do we have actual users asking for this who wouldn't be better off thinking through their data model? Most (all?) of the times I've heard requests for not having row labels, the user would have been better of thinking through whether they have one or more columns that "should" be used as the index. And to confirm the intent of the proposal: are you proposing to make indexes opt-in or opt-out? I'd probably be OK with opt-out (users can explicitly say |
I think this comment by @shwina clarifies it best
So, the request is speficically for Index to be opt-in, because otherwise, no-Index wouldn't be the default behaviour. The options seem to be:
@shwina could you please clarify - would you consider option 2 an improvement on the status-quo? Would it be good enough for cuDF to drop Index, or would options 2 and 3 be indifferent to you? |
Thanks for the summary, @MarcoGorelli!
|
Sure, go ahead, I think I've given edit permissions In terms of breakage - |
I don't want to completely rule out option 1 either, but I suspect it's going to be difficult to get widespread agreement on it. Row labels are useful! It might be OK for a standard to ignore / defer them (though IMO it'll hurt the adoption of the standard by users even if makes the standard easier to implement). But in general I think that pandas users are better off finding natural row labels for their data than thinking about how to get rid of it. |
Honestly, it's hard to come up with stuff that would not change and that is more than a single function. You'd probably have to rewrite everything, I don't think that there is a code base out there that does not use any index related functionality. |
Linking a related prior discussion on optional indexes (but for the "pandas2" rewrite): wesm/pandas2#17 And a point of comparison, how xarray transitioned to optional indexes pydata/xarray#1017. IIUC:
cc @shoyer for any correction or any further experiences with xarray transitioning to an optional index. |
Thanks, that's useful What I have the most difficulty with from that list is
, really think But I'm warming to the idea of there being an Index-free mode, so long as any breakage is loud and clear |
This was mostly a gesture towards backwards compatibility. In principle, I agree -- we could (and maybe should?) make |
I am personally quite interested in seeing this explored and discussed. I think it could be an interesting improvement to the ease of use of pandas in many cases. I want to say something about the framing of the discussion, though. Personally, I am in the first place interested in this discussion for pandas itself. To put it a bit bluntly: for this discussion, I don't care about the fact that other dataframe libraries have a hard time implementing Index, or that the Consortium would like a standard without a row index. I personally have been pondering about this from time time before the consortium existed, as have others (eg wesm/pandas2#17), and we should discuss for pandas' sake. If such a change in pandas would help the broader ecosystem (the other dataframe libraries) and the consortium / standard, then that's a good additional argument of course (and the consortium has also certainly been useful to trigger this discussion), but IMO it shouldn't be the primary argument. I think this part of "why would this be useful for our users" is also a bit missing from the discussion for now (unless that is clear already? but then it would still be useful to explicitly spell out) The hackmd doc that Marco shared has a brief paragraph on this:
But I think we should try to expand on this. Another example where are default alignment can give surprising results is something like #34576 raises the issue of pandas by default writing the index in file output, even if that index was meaningless (pandas of course doesn't know whether that's the case, but in my experience a default RangeIndex often is)
Tom, could you expand a bit on this? I personally find row labels very useful, in specific cases where I manually set an existing column as the index (timestamps, some actual existing identifier, ..). But I almost never found the index very useful in case of the default RangeIndex (or converted to integer index once you do some operation such as a filter). It might be that sometimes there could have been a column in my data that could be used as index, but 1) I am not sure that is always the case, and 2) even if there is one, it doesn't always make your life easier to set it (if you don't explicitly make use of functionality that relies on it).
One note about |
Yeah, that matches my experience. I was thinking primarily about "Index vs. no Index", in which case I'll always push people a bit to think about their data model. But if we're thinking about improving the ergonomics of using a DataFrame with just the default RangeIndex, then yes, I think that auto resetting RangeIndex is worth exploring. |
Lots of really good points here, thanks for the discussion
I'm -1 on auto-resetting RangeIndex though as it'd silently break the example I put in the writeup:
Even if Something to address would be methods which default to setting an index without a way to opt-out. Perhaps to get to opting out of an index, we'd need to ensure that no method sets an index by default. This'd mean going through a deprecation cycle to make |
Yes, for me this discussion is about the above (being able to set an index and use it is super useful (and an important part of pandas), we are not planning to remove that, that's not the discussion, but can the default experience (currently RangeIndex) be improved?) And thinking about this as a "auto-resetting RangeIndex" is an interesting way, as it would be very similar as "no index" option (it are the cases where it wouldn't be exactly the same that we need to further explore and discuss, eg .loc, alignment, etc)
It would be possible in some cases, but that doesn't help to reduce the amount of code that this would break. |
Right, good point - OK with keeping The simplest way to drive the conversation forward might be to just try implementing this as a POC and seeing what we run into - I'll do some work on this |
I'm happy to be of help here if you need a collaborator, or even just a sounding board. Logistically, one place we could chat is the RAPIDS slack, but I'm open to other suggestions as well. |
In general this sounds good, but the details are tricky. This works as long as you don't get any meaningful data as index (groupby, stack, pivot, ...; anything that drops rows is not straightforward too, because you can use the index of the result to select the rows from another object for example). So if those operations don't return an Index anymore, this causes lots of trouble in places that might not be obvious. @MarcoGorelli do you want to change RangeIndex or implement a new Index object? |
New one, I'd feel really uneasy about changing RangeIndex, mainly because of the example given this comment (unless there's a way around it that wouldn't silently break code?) |
Yeah that is what I was referring to basically, its not only about loc though, drop_duplicates has a similar effect for example |
I think there are several good reasons to have it as a separate concept from RangeIndex (while it is useful to think about is as a different RangeIndex). Even if we would eventually want to replace the default behaviour from RangeIndex to this new behaviour, we still need a way for people to try this out without breaking existing RangeIndex-based code. Although it also doesn't necessarily have to be a different object (like
I think that something like
@shwina we nowadays also have a pandas slack! (https://mail.python.org/pipermail/pandas-dev/2022-October/001525.html) You are very welcome to join there as well! |
I'm with Jeff as skeptical-but-open-to-it. Making index opt-in rather than opt-out is probably a non-starter for the near future. |
As of now, I'm +1 for optional Indexes but favor opt-out Indexes |
Thanks everyone for the discussion I've amended the write-up to include a proposal for how this might happen: https://hackmd.io/JPWJqwc1SZKz_Zaxe9MZRQ In short:
How to clearly communicate such a breaking change in 3.0.0 would depend on decisions taken when making the index opt-out, so I'd rather have such a conversation now rather than once it's already opt-out |
How should a user opt-out of an object that has already has an index? Also if we move forward with the global config option, maybe |
Good point, thanks - I've added an example of that. I think In [4]: df
Out[4]:
a b
2000-01-01 1 4
2000-01-02 2 5
2000-01-03 3 6
In [5]: df.reset_index(drop=True)
Out[5]:
a b
1 4
2 5
3 6 There's also the question of whether the no-index should appear in the repr, and I think it looks better if it doesn't (like this it also signals to users that there's something different about this DataFrame the ones they're used to) |
From an annotation perspective, it might be nice to make index-less objects a different class (i've seen "Table" mentioned elsewhere), which I think DataFrame could then subclass. This would require something like #48126 |
While i am sympathetic to adding a NoIndex there are some concerns
i would be more amenable to starting something new if we finish things first (which practically would mean killing array manager) |
I've gone through with another iteration of the writeup: https://hackmd.io/JPWJqwc1SZKz_Zaxe9MZRQ TL;DR: The proposal is to have:
I think #49069 is uncontroversial, so if it's OK I'll let the commenter know that it's OK to get started on it if they're still interested EDIT: still not sure if we really want #49069 , and it may even be orthogonal to this issue, so holding off work on that til there's been further discussion |
closing, as discussion has moved to #49694 I'll update the PDEP soon, but in short, the proposal in its current state wouldn't make it as the default, and there's reluctance to add it as non-default because of the testing and maintenance burden that adding extra options entails I still think it should be possible to do something here, but it'd require a more comprehensive set of changes to be made at the same time. We're talking maybe about pandas 4.0.0 here, it'd be quite a long way away and would take considerable thought and discussion |
One of the points raised in pandas standardisation doc, in the context of the Consortium for Python Data API Standards, is about making indices opt-in
I've started putting together a write-up on this: https://hackmd.io/@mntOORP3TCesJJyvg-IdFQ/B1EqHfQfo
TLDR: the suggestion is to have a
DefaultIndex
which would always go from0
to the length of the DataFrame, and thenloc
andiloc
would be aligned. I think this'd be too big of a breaking change in pandas, and that there'd be a serious risk of silently ruining people's analyses/models. I'd be open to either introducingDefaultIndex
but not have it be the default, or to make it the default but not give it aloc
methodThe text was updated successfully, but these errors were encountered: