Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proposal to change behaviour with .loc and missing keys #15747

Closed
toobaz opened this issue Mar 20, 2017 · 31 comments · Fixed by #17295
Closed

Proposal to change behaviour with .loc and missing keys #15747

toobaz opened this issue Mar 20, 2017 · 31 comments · Fixed by #17295
Labels
API Design Indexing Related to indexing on series/frames, not to indexes themselves
Milestone

Comments

@toobaz
Copy link
Member

toobaz commented Mar 20, 2017

In [2]: pd.Series([1, 2, 3]).loc[[2,3]]
Out[2]: 
2    3.0
3    NaN
dtype: float64

In [3]: pd.Series([1, 2, 3]).loc[[3]]
[...]
KeyError: 'None of [[3]] are in the [index]'

Problem description

Although coherent (except for some unfortunate side-effects - some of them below) with the docs where they say "At least 1 of the labels for which you ask, must be in the index or a KeyError will be raised!", the current behavior is - I claim - a terrible choice for both developers and users.

There are (at least) three ways to behave with missing labels:

  1. you raise an error if requested at least one missing label
  2. you raise an error if requested only missing labels
    2a ... while if at least one label is present, missing labels become NaN (current)
    2b. ... while if at least one label is present, missing labels are silently dropped
  3. you never raise an error for missing labels
    3a ... and they become NaN
    3b. ... and they are silently dropped

For developers

Options 1. and 3. are both much easier to implement, because in both cases you can reduce the question "am I going to get an error?" in smaller pieces - e.g. when indexing a MultiIndex, you will get an error if you get an error on any of its levels. Option 2. is instead more complicated (and computationally expensive), because you need to first aggregate in some way across levels/axes, and only then can you decide whether to raise an error or not. Several incoherences came out as a consequence of this choice, some of them still unsolved, such as #15452, this, the fact that pd.Series(range(2)).loc[[]] does not raise, and the fact that pd.DataFrame.ix[[missing_label]] doesn't either.

Other consequences of 2.

Additionally, it was decided that the behavior with missing labels would be to introduce NaNs (rather than to drop them), and I think this was also not a good choice (and indeed partial indexing MultiIndexes does not behave this way - it couldn't). I think it is also undocumented.

And finally, since the above wouldn't always tell you what to do when there are missing labels in a MultiIndex, it was decided that .loc would rather behave as .reindex when there are missing and incomplete labels, which is totally unexpected and, I think, undocumented.

Notice that these further issues (and more in general, the question "what to do when some labels are missing and you are not raising an error") would partially still hold with 3, but could be dealt with, I think, more elegantly.

For users

I think the current behavior is annoying to users not just because of those "Other consequences", but also because it is more complicated to describe in terms of set operation on labels/indices. For instance, with options 1. and 3.

pd.concat([chunk.loc[something] for chunk in chunks])

and

pd.concat(chunks).loc[something]

both return the same result (or raise). Instead with 2. it actually depends on how missing labels are distributed across chunks.

(Why this?)

It is worth understanding why 2. was picked in the first place, and I think the answer is "to be coherent with intervals". But I think it's not worth the damage - after all, an iterable and an interval are different objects. And moreover, introducing NaNs for missing labels is anyway incoherent with intervals.

Backward incompatibility

Option 1. is, I think, the best, because it is also coherent with numpy's behavior with out-of-bounds indices (e.g. np.array([[1,2], [3,4]])[0,[1,3]] raises an IndexError).

But while clearly both 1. and 3. could break some existing code, 3. would be better from this point of view, in the sense that it would break only code assuming that an error is raised. Although one might even claim that 1., by breaking code which looks for missing labels, can help discover bugs in user code (not a great argument, I know).

So overall I am not sure about what we should pick between 1. and 3. But I really think we should leave 2., and that the later it is done, the worse. @jreback , @jorisvandenbossche if you want to tell me your thoughts about this, I can elaborate on what we could do with the "Other consequences" in the desired option.

Then if you approve the change, I'm willing to help in implementing it.

@toobaz
Copy link
Member Author

toobaz commented Mar 20, 2017

(another advantage of 3. would be that it's easier to implement - but still, I think we would want to change the behaviour on missing labels: namely, to drop them)

@jreback
Copy link
Contributor

jreback commented Mar 20, 2017

xref #10549, #10695
xref (slightly) to #9213, #9595
@toobaz if you'd pull any nuggets from those would be helpful

@jreback jreback added API Design Indexing Related to indexing on series/frames, not to indexes themselves labels Mar 20, 2017
@toobaz
Copy link
Member Author

toobaz commented Mar 20, 2017

Right: I think #10549 is, as I commented, not really a bug as long as 2. is the rule. The reporter is expecting a reindex-like behavior, but then he rightly mentions the fact that it would be undefined for partial indexing, and so he states that he would have expected an error instead. He is in some sense asking for option 1. above, but the bug is instead quite orthogonal to the choice between 2. and 3. (and in both cases, I suggest we solve the problem by stopping inserting NaNs, and we drop missing labels instead).

Of #10695, I clearly don't like the proposal to add a new indexer, but apart from this they are just asking for option 1. above, and I think they are right in preferring it to the current state of things.

@toobaz
Copy link
Member Author

toobaz commented Mar 20, 2017

(I can try to see if it is feasible to have a strict argument for .loc which switches from 1. to 3.... but I'm not sure it would be easy to implement)

By the way: I guess @shoyer might be interested too

Worth mentioning that 1. is more coherent with .loc.__setitem__, which raises for missing keys (in lists). But then, it does add individual missing keys, so this is again not a very strong argument (I rather think that we should fix .loc.__setitem__ and make it add missing labels if we decide to pick 3. - unless with partial indexing of MultiIndexes, where it should probably drop them).

@jreback
Copy link
Contributor

jreback commented Mar 20, 2017

yeah I would be ok with 1). The reason for the current behavior are twofold:

  1. .loc and .reindex should be equivalent (though .reindex cannot be an lvalue), because it was confusing when to use these. IOW lots of complaints about I want to overwrite this label wether it exists or not (that's why partial setting works).

note that I think we are going to get rid of all of the expanding stuff in pandas2 anyhow. You will have to explicitly .reindex first, then assign.

  1. interactive work with a long list of labels could be confusing when 1 is missing on accident. IOW a typo. Though the counter-argument here is that this error should be raised immediately.

So the driving force was actually 1).

@jorisvandenbossche
Copy link
Member

  1. you never raise an error for missing labels

I think this third option needs clarification, as you can still either drop missing labels or introduce NaNs (so in that regard maybe 4 options or a 3a and 3b).

[@jreback] note that I think we are going to get rid of all of the expanding stuff in pandas2 anyhow. You will have to explicitly .reindex first, then assign.

I don't think this is true? (or is there an issue on the pandas2 tracker for this?) For example being to assign one or multiple columns is an expanding indexing operation on assignment, and not an ability we want to loose?

@jreback When you say "I would be ok with 1)", you are OK with loc and reindex no longer being equivalent? And also with the difference between getting (no enlargement) and setting (enlargement allowed)? As those would be the consequence of choosing for option 1.


If we would start fresh, I would, with my current knowledge, be in favor of loc being strict for getting (no missing labels allowed), i.e. option 1.

The question for me is more: can we justify such a breaking change? (as it will break people's code) Are there ways to ease a transition? (I think in principle we can first have a warning in case it will raise an error in a next release?)

@jorisvandenbossche
Copy link
Member

cc @pandas-dev/pandas-core

@jreback
Copy link
Contributor

jreback commented Mar 20, 2017

I don't think this is true? (or is there an issue on the pandas2 tracker for this?) For example being to assign one or multiple columns is an expanding indexing operation on assignment, and not an ability we want to loose?

@jorisvandenbossche I am only talking about row-wise expansion, which is the obvious issue. column assignment of course is unaffected. It possible that this will still be ok (the issue is can performant .appends be done. wesm/pandas2#53), the main issue is the dtype inference / changes. So this would be an error.

In [1]: s = Series([1,2,3])

In [2]: s
Out[2]: 
0    1
1    2
2    3
dtype: int64

In [3]: s.loc[3] = 'foo'

In [4]: s
Out[4]: 
0      1
1      2
2      3
3    foo
dtype: object

@jreback
Copy link
Contributor

jreback commented Mar 20, 2017

The question for me is more: can we justify such a breaking change? (as it will break people's code) Are there ways to ease a transition? (I think in principle we can first have a warning in case it will raise an error in a next release?)

I think a warning is possible.

@jreback
Copy link
Contributor

jreback commented Mar 20, 2017

@jreback When you say "I would be ok with 1)", you are OK with loc and reindex no longer being equivalent? And also with the difference between getting (no enlargement) and setting (enlargement allowed)? As those would be the consequence of choosing for option 1.

yes I think moving back to a strict separation between .reindex and .loc is fine. setting enlargement should be unaffected (and can kick the can down the road about whether to allow it).
getting enlargement by-definition will now fail (as its essentially a .reindex with missing labels)

@jreback
Copy link
Contributor

jreback commented Mar 20, 2017

@jorisvandenbossche

I think this third option needs clarification, as you can still either drop missing labels or introduce NaNs (so in that regard maybe 4 options or a 3a and 3b).

how is 3b different from 2?

@chris-b1
Copy link
Contributor

xref wesm/pandas2#32 proposing disallowing setting with (row) enlargement

@jorisvandenbossche
Copy link
Member

And actually wesm/pandas2#21 is related to the proposal here (or at least to option 1)

@shoyer
Copy link
Member

shoyer commented Mar 20, 2017

I would certainly pick (1) if starting from scratch (I did for xarray), but it's a serious breaking change so it's probably best to save it for pandas 2.0.

We discussed this in other issues, but I think we need symmetric behavior for .loc with unrecognized keys (for both getitem and setitem). So if we pick (1), we should be sure we have matching behavior prohibiting setting and enlarging, too.

(3) seems like a strict improvement over what we have now, so I would definitely go for it.

@jorisvandenbossche
Copy link
Member

(3) seems like a strict improvement over what we have now, so I would definitely go for it.

@shoyer What do you mean with 3 ? (see my previous question regarding clarification of this option)

I think we need symmetric behavior for .loc with unrecognized keys (for both getitem and setitem)

I don't see how this is strictly possible. Assigning a new, non-existant column (df['A']) should work, getting a non-existant column with df['A'] should still raise a KeyError (or would you let .loc differ here?)

@shoyer
Copy link
Member

shoyer commented Mar 20, 2017

@shoyer What do you mean with 3 ? (see my previous question regarding clarification of this option)

I meant (3) with introducing NaNs. Although it does indeed get really messy, e.g., with MultiIndex, so arguably we aren't even doing this consistently already.

I think we need symmetric behavior for .loc with unrecognized keys (for both getitem and setitem)
I don't see how this is strictly possible. Assigning a new, non-existant column (df['A']) should work, getting a non-existant column with df['A'] should still raise a KeyError (or would you let .loc differ here?)

Sorry, yes, I meant all row indexing, and probably most but not all column indexing. We will need an exception to the rules for setting new columns, at least one at a time.

@jorisvandenbossche
Copy link
Member

One possibility is to limit this discussion to only getting.
For the setting, there is clearly still a lot more discussion needed (which kind of row / column enlargment would we like to allow? Which not?), and also a lot more code that will be broken by this (things as s[..] and df.loc[..] are widely used for enlarging series/frames I think). This will not so easily be achieved, and certainly needs to wait until pandas 2.

If we only speak about getting, eg

In [4]: df = pd.DataFrame({'A':[1,2], 'B':[3,4]})

In [5]: df.loc[:,['A', 'C']]
Out[5]: 
   A   C
0  1 NaN
1  2 NaN

I think something like that is a lot less common to be intended, and more likely to be an user error.
Furthermore, for getting you already have the alternative of reindex if you want this behaviour.
And thus (with a good deprecation strategy, I also think a warning should be possible, although maybe complicated to implement) more realistic to tackle on the shorter term?

@jorisvandenbossche
Copy link
Member

@toobaz see the referenced issues (wesm/pandas2#32 and wesm/pandas2#21)

@toobaz
Copy link
Member Author

toobaz commented Mar 21, 2017

for getting you already have the alternative of reindex

(Replying in general to comparisons of loc and reindex, also in other comments): I can understand that on the dev side the two are similar, and that past discussions were based on this comparison... but I think for the typical user they are completely different beasts both conceptually (one is an indexer, the other is a modifier that returns a copy) and in the docs.

That said, I agree to keep the setting discussion separated. And I agree with @jorisvandenbossche that with columns it is even more true that missing labels are usually symptoms of an error.

I meant (3) with introducing NaNs

If we go for (3), is there any real downside in just dropping the missing labels?! Because the downside of introducing NaNs is, as you (@shoyer) mention, that we don't know how to extend to partial indexing of MultiIndex, and also the undesired type changes when for instance working with ints or bools. And also that, again, one does not expect a getter indexer to expand the index - that is, to "get" something which is not there.

@toobaz
Copy link
Member Author

toobaz commented Mar 21, 2017

how is 3b different from 2?

2 raises if asked a list of all missing labels, 3b would return an all-NaN object.

@jreback
Copy link
Contributor

jreback commented Mar 21, 2017

2 would raise if asked a list of all missing labels, 3b would return an all-NaN object.

so what is 3a then?

@jreback
Copy link
Contributor

jreback commented Mar 21, 2017

@toobaz you can feel free to actually edit the top-section (of options)

@jorisvandenbossche
Copy link
Member

jorisvandenbossche commented Mar 21, 2017

so what is 3a then?

Drop the missing labels silently, without introducing NaNs (which is actually what MultiIndex does when indexing certain levels, as consequence of not being able to introduce NaNs)

@toobaz
Copy link
Member Author

toobaz commented Mar 21, 2017

(Updated the description)

jreback added a commit to jreback/pandas that referenced this issue Oct 1, 2017
@jreback jreback modified the milestones: 0.21.0, Next Major Release Oct 2, 2017
jreback added a commit to jreback/pandas that referenced this issue Oct 2, 2017
jreback added a commit to jreback/pandas that referenced this issue Oct 2, 2017
@jorisvandenbossche
Copy link
Member

@toobaz the PR #17295 is going to close this issue, but only deals with single Index, not MultiIndex. Do you know from the top of your head the situation for MultiIndex or whether there is an issues that describes that aspect ?

jreback added a commit to jreback/pandas that referenced this issue Oct 2, 2017
@toobaz
Copy link
Member Author

toobaz commented Oct 2, 2017

Do you know from the top of your head the situation for MultiIndex or whether there is an issues that describes that aspect ?

I think currently MultiIndexes don't exhibit any particular issue related to this. Certainly with #17295 flat indexes will emit a warning while MultiIndexes won't, but that can be addressed in a separate PR. I can try to have a look but not immediately.

@jreback
Copy link
Contributor

jreback commented Oct 3, 2017

MI need a separate issue (might be one we can hijack)

@jorisvandenbossche
Copy link
Member

I think currently MultiIndexes don't exhibit any particular issue related to this.

They have exactly the same issue no? (in that they allow non existing labels in a list) Copying from your comment on the PR (#17295 (comment)):

In [132]: df = pd.DataFrame([[i] for i in range(3)], index=pd.MultiIndex.from_product([[1], [5,6,7]]))

In [133]: df
Out[133]: 
     0
1 5  0
  6  1
  7  2

In [134]: df.loc[[(1,5), (3,7)]]
Out[134]: 
       0
1 5  0.0
3 7  NaN

@toobaz
Copy link
Member Author

toobaz commented Oct 3, 2017

They have exactly the same issue no?

Yes, sorry, I wasn't clear. They currently behave like flat indexes, and in this sense they don't exhibit any "particular" issue, and they will need a PR analogous to #17295 for homogeneity with flat indexes.

MI need a separate issue (might be one we can hijack)

Opened #17758 ... #15452 is related but does need a different fix.

jreback added a commit to jreback/pandas that referenced this issue Oct 3, 2017
jreback added a commit that referenced this issue Oct 3, 2017
jreback added a commit to jreback/pandas that referenced this issue Oct 3, 2017
jreback added a commit that referenced this issue Oct 3, 2017
ocefpaf pushed a commit to ocefpaf/pandas that referenced this issue Oct 3, 2017
ghost pushed a commit to reef-technologies/pandas that referenced this issue Oct 16, 2017
ghost pushed a commit to reef-technologies/pandas that referenced this issue Oct 16, 2017
alanbato pushed a commit to alanbato/pandas that referenced this issue Nov 10, 2017
alanbato pushed a commit to alanbato/pandas that referenced this issue Nov 10, 2017
No-Stream pushed a commit to No-Stream/pandas that referenced this issue Nov 28, 2017
No-Stream pushed a commit to No-Stream/pandas that referenced this issue Nov 28, 2017
@gjimzhou
Copy link

Is there a quick convenient way to achieve option 3a in the future? Maybe modify to the pd.Series.get function to support a list of keys?

@ryananeff
Copy link

There must be a way to select all matching indexes in order, as @gjimzhou mentioned like a change to pd.Series.get or perhaps pd.DataFrame.get() or pd.DataFrame.set(). This is important for both getting and setting from one DataFrame to another, for example:

df_a.loc[df_b.index,"column"] = df_b["column"]

or, alternatively, when filtering one DataFrame by another in order:

new_df_a_in_df_b_order = df_a.loc[df_b.index,:]

This is extremely common in many data analysis workflows that rely on loc(). This new change in behavior greatly increases the complexity of both scenarios and is breaking functionality.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API Design Indexing Related to indexing on series/frames, not to indexes themselves
Projects
None yet
7 participants