-
-
Notifications
You must be signed in to change notification settings - Fork 18k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Proposal to change behaviour with .loc and missing keys #15747
Comments
(another advantage of 3. would be that it's easier to implement - but still, I think we would want to change the behaviour on missing labels: namely, to drop them) |
Right: I think #10549 is, as I commented, not really a bug as long as 2. is the rule. The reporter is expecting a Of #10695, I clearly don't like the proposal to add a new indexer, but apart from this they are just asking for option 1. above, and I think they are right in preferring it to the current state of things. |
(I can try to see if it is feasible to have a By the way: I guess @shoyer might be interested too Worth mentioning that 1. is more coherent with |
yeah I would be ok with 1). The reason for the current behavior are twofold:
note that I think we are going to get rid of all of the expanding stuff in pandas2 anyhow. You will have to explicitly
So the driving force was actually 1). |
I think this third option needs clarification, as you can still either drop missing labels or introduce NaNs (so in that regard maybe 4 options or a 3a and 3b).
I don't think this is true? (or is there an issue on the pandas2 tracker for this?) For example being to assign one or multiple columns is an expanding indexing operation on assignment, and not an ability we want to loose? @jreback When you say "I would be ok with 1)", you are OK with If we would start fresh, I would, with my current knowledge, be in favor of The question for me is more: can we justify such a breaking change? (as it will break people's code) Are there ways to ease a transition? (I think in principle we can first have a warning in case it will raise an error in a next release?) |
cc @pandas-dev/pandas-core |
@jorisvandenbossche I am only talking about row-wise expansion, which is the obvious issue. column assignment of course is unaffected. It possible that this will still be ok (the issue is can performant
|
I think a warning is possible. |
yes I think moving back to a strict separation between |
how is 3b different from 2? |
xref wesm/pandas2#32 proposing disallowing setting with (row) enlargement |
And actually wesm/pandas2#21 is related to the proposal here (or at least to option 1) |
I would certainly pick (1) if starting from scratch (I did for xarray), but it's a serious breaking change so it's probably best to save it for pandas 2.0. We discussed this in other issues, but I think we need symmetric behavior for (3) seems like a strict improvement over what we have now, so I would definitely go for it. |
@shoyer What do you mean with 3 ? (see my previous question regarding clarification of this option)
I don't see how this is strictly possible. Assigning a new, non-existant column ( |
I meant (3) with introducing NaNs. Although it does indeed get really messy, e.g., with MultiIndex, so arguably we aren't even doing this consistently already.
Sorry, yes, I meant all row indexing, and probably most but not all column indexing. We will need an exception to the rules for setting new columns, at least one at a time. |
One possibility is to limit this discussion to only getting. If we only speak about getting, eg
I think something like that is a lot less common to be intended, and more likely to be an user error. |
@toobaz see the referenced issues (wesm/pandas2#32 and wesm/pandas2#21) |
(Replying in general to comparisons of That said, I agree to keep the setting discussion separated. And I agree with @jorisvandenbossche that with columns it is even more true that missing labels are usually symptoms of an error.
If we go for (3), is there any real downside in just dropping the missing labels?! Because the downside of introducing NaNs is, as you (@shoyer) mention, that we don't know how to extend to partial indexing of MultiIndex, and also the undesired type changes when for instance working with ints or bools. And also that, again, one does not expect a getter indexer to expand the index - that is, to "get" something which is not there. |
2 raises if asked a list of all missing labels, 3b would return an all-NaN object. |
so what is 3a then? |
@toobaz you can feel free to actually edit the top-section (of options) |
Drop the missing labels silently, without introducing NaNs (which is actually what MultiIndex does when indexing certain levels, as consequence of not being able to introduce NaNs) |
(Updated the description) |
… list are selected via .loc closes pandas-dev#15747
… list are selected via .loc closes pandas-dev#15747
… list are selected via .loc closes pandas-dev#15747
… list are selected via .loc closes pandas-dev#15747
I think currently |
MI need a separate issue (might be one we can hijack) |
They have exactly the same issue no? (in that they allow non existing labels in a list) Copying from your comment on the PR (#17295 (comment)):
|
Yes, sorry, I wasn't clear. They currently behave like flat indexes, and in this sense they don't exhibit any "particular" issue, and they will need a PR analogous to #17295 for homogeneity with flat indexes.
Opened #17758 ... #15452 is related but does need a different fix. |
… list are selected via .loc closes pandas-dev#15747
… list are selected via .loc (pandas-dev#17295) closes pandas-dev#15747
… list are selected via .loc (pandas-dev#17295) closes pandas-dev#15747
… list are selected via .loc (pandas-dev#17295) closes pandas-dev#15747
Is there a quick convenient way to achieve option 3a in the future? Maybe modify to the |
There must be a way to select all matching indexes in order, as @gjimzhou mentioned like a change to pd.Series.get or perhaps pd.DataFrame.get() or pd.DataFrame.set(). This is important for both getting and setting from one DataFrame to another, for example:
or, alternatively, when filtering one DataFrame by another in order:
This is extremely common in many data analysis workflows that rely on loc(). This new change in behavior greatly increases the complexity of both scenarios and is breaking functionality. |
Problem description
Although coherent (except for some unfortunate side-effects - some of them below) with the docs where they say "At least 1 of the labels for which you ask, must be in the index or a
KeyError
will be raised!", the current behavior is - I claim - a terrible choice for both developers and users.There are (at least) three ways to behave with missing labels:
2a ... while if at least one label is present, missing labels become
NaN
(current)2b. ... while if at least one label is present, missing labels are silently dropped
3a ... and they become
NaN
3b. ... and they are silently dropped
For developers
Options 1. and 3. are both much easier to implement, because in both cases you can reduce the question "am I going to get an error?" in smaller pieces - e.g. when indexing a
MultiIndex
, you will get an error if you get an error on any of its levels. Option 2. is instead more complicated (and computationally expensive), because you need to first aggregate in some way across levels/axes, and only then can you decide whether to raise an error or not. Several incoherences came out as a consequence of this choice, some of them still unsolved, such as #15452, this, the fact thatpd.Series(range(2)).loc[[]]
does not raise, and the fact thatpd.DataFrame.ix[[missing_label]]
doesn't either.Other consequences of 2.
Additionally, it was decided that the behavior with missing labels would be to introduce
NaN
s (rather than to drop them), and I think this was also not a good choice (and indeed partial indexingMultiIndex
es does not behave this way - it couldn't). I think it is also undocumented.And finally, since the above wouldn't always tell you what to do when there are missing labels in a
MultiIndex
, it was decided that.loc
would rather behave as.reindex
when there are missing and incomplete labels, which is totally unexpected and, I think, undocumented.Notice that these further issues (and more in general, the question "what to do when some labels are missing and you are not raising an error") would partially still hold with 3, but could be dealt with, I think, more elegantly.
For users
I think the current behavior is annoying to users not just because of those "Other consequences", but also because it is more complicated to describe in terms of set operation on labels/indices. For instance, with options 1. and 3.
pd.concat([chunk.loc[something] for chunk in chunks])
and
pd.concat(chunks).loc[something]
both return the same result (or raise). Instead with 2. it actually depends on how missing labels are distributed across chunks.
(Why this?)
It is worth understanding why 2. was picked in the first place, and I think the answer is "to be coherent with intervals". But I think it's not worth the damage - after all, an iterable and an interval are different objects. And moreover, introducing
NaN
s for missing labels is anyway incoherent with intervals.Backward incompatibility
Option 1. is, I think, the best, because it is also coherent with numpy's behavior with out-of-bounds indices (e.g.
np.array([[1,2], [3,4]])[0,[1,3]]
raises anIndexError
).But while clearly both 1. and 3. could break some existing code, 3. would be better from this point of view, in the sense that it would break only code assuming that an error is raised. Although one might even claim that 1., by breaking code which looks for missing labels, can help discover bugs in user code (not a great argument, I know).
So overall I am not sure about what we should pick between 1. and 3. But I really think we should leave 2., and that the later it is done, the worse. @jreback , @jorisvandenbossche if you want to tell me your thoughts about this, I can elaborate on what we could do with the "Other consequences" in the desired option.
Then if you approve the change, I'm willing to help in implementing it.
The text was updated successfully, but these errors were encountered: