Proposal to change behaviour with .loc and missing keys #15747

toobaz · 2017-03-20T18:25:07Z

In [2]: pd.Series([1, 2, 3]).loc[[2,3]]
Out[2]: 
2    3.0
3    NaN
dtype: float64

In [3]: pd.Series([1, 2, 3]).loc[[3]]
[...]
KeyError: 'None of [[3]] are in the [index]'

Problem description

Although coherent (except for some unfortunate side-effects - some of them below) with the docs where they say "At least 1 of the labels for which you ask, must be in the index or a KeyError will be raised!", the current behavior is - I claim - a terrible choice for both developers and users.

There are (at least) three ways to behave with missing labels:

you raise an error if requested at least one missing label
you raise an error if requested only missing labels
2a ... while if at least one label is present, missing labels become NaN (current)
2b. ... while if at least one label is present, missing labels are silently dropped
you never raise an error for missing labels
3a ... and they become NaN
3b. ... and they are silently dropped

For developers

Options 1. and 3. are both much easier to implement, because in both cases you can reduce the question "am I going to get an error?" in smaller pieces - e.g. when indexing a MultiIndex, you will get an error if you get an error on any of its levels. Option 2. is instead more complicated (and computationally expensive), because you need to first aggregate in some way across levels/axes, and only then can you decide whether to raise an error or not. Several incoherences came out as a consequence of this choice, some of them still unsolved, such as #15452, this, the fact that pd.Series(range(2)).loc[[]] does not raise, and the fact that pd.DataFrame.ix[[missing_label]] doesn't either.

Other consequences of 2.

Additionally, it was decided that the behavior with missing labels would be to introduce NaNs (rather than to drop them), and I think this was also not a good choice (and indeed partial indexing MultiIndexes does not behave this way - it couldn't). I think it is also undocumented.

And finally, since the above wouldn't always tell you what to do when there are missing labels in a MultiIndex, it was decided that .loc would rather behave as .reindex when there are missing and incomplete labels, which is totally unexpected and, I think, undocumented.

Notice that these further issues (and more in general, the question "what to do when some labels are missing and you are not raising an error") would partially still hold with 3, but could be dealt with, I think, more elegantly.

For users

I think the current behavior is annoying to users not just because of those "Other consequences", but also because it is more complicated to describe in terms of set operation on labels/indices. For instance, with options 1. and 3.

pd.concat([chunk.loc[something] for chunk in chunks])

and

pd.concat(chunks).loc[something]

both return the same result (or raise). Instead with 2. it actually depends on how missing labels are distributed across chunks.

(Why this?)

It is worth understanding why 2. was picked in the first place, and I think the answer is "to be coherent with intervals". But I think it's not worth the damage - after all, an iterable and an interval are different objects. And moreover, introducing NaNs for missing labels is anyway incoherent with intervals.

Backward incompatibility

Option 1. is, I think, the best, because it is also coherent with numpy's behavior with out-of-bounds indices (e.g. np.array([[1,2], [3,4]])[0,[1,3]] raises an IndexError).

But while clearly both 1. and 3. could break some existing code, 3. would be better from this point of view, in the sense that it would break only code assuming that an error is raised. Although one might even claim that 1., by breaking code which looks for missing labels, can help discover bugs in user code (not a great argument, I know).

So overall I am not sure about what we should pick between 1. and 3. But I really think we should leave 2., and that the later it is done, the worse. @jreback , @jorisvandenbossche if you want to tell me your thoughts about this, I can elaborate on what we could do with the "Other consequences" in the desired option.

Then if you approve the change, I'm willing to help in implementing it.

The text was updated successfully, but these errors were encountered:

toobaz · 2017-03-20T18:29:09Z

(another advantage of 3. would be that it's easier to implement - but still, I think we would want to change the behaviour on missing labels: namely, to drop them)

jreback · 2017-03-20T18:35:51Z

xref #10549, #10695
xref (slightly) to #9213, #9595
@toobaz if you'd pull any nuggets from those would be helpful

toobaz · 2017-03-20T18:53:44Z

Right: I think #10549 is, as I commented, not really a bug as long as 2. is the rule. The reporter is expecting a reindex-like behavior, but then he rightly mentions the fact that it would be undefined for partial indexing, and so he states that he would have expected an error instead. He is in some sense asking for option 1. above, but the bug is instead quite orthogonal to the choice between 2. and 3. (and in both cases, I suggest we solve the problem by stopping inserting NaNs, and we drop missing labels instead).

Of #10695, I clearly don't like the proposal to add a new indexer, but apart from this they are just asking for option 1. above, and I think they are right in preferring it to the current state of things.

toobaz · 2017-03-20T19:05:09Z

(I can try to see if it is feasible to have a strict argument for .loc which switches from 1. to 3.... but I'm not sure it would be easy to implement)

By the way: I guess @shoyer might be interested too

Worth mentioning that 1. is more coherent with .loc.__setitem__, which raises for missing keys (in lists). But then, it does add individual missing keys, so this is again not a very strong argument (I rather think that we should fix .loc.__setitem__ and make it add missing labels if we decide to pick 3. - unless with partial indexing of MultiIndexes, where it should probably drop them).

jreback · 2017-03-20T19:44:04Z

yeah I would be ok with 1). The reason for the current behavior are twofold:

.loc and .reindex should be equivalent (though .reindex cannot be an lvalue), because it was confusing when to use these. IOW lots of complaints about I want to overwrite this label wether it exists or not (that's why partial setting works).

note that I think we are going to get rid of all of the expanding stuff in pandas2 anyhow. You will have to explicitly .reindex first, then assign.

interactive work with a long list of labels could be confusing when 1 is missing on accident. IOW a typo. Though the counter-argument here is that this error should be raised immediately.

So the driving force was actually 1).

jorisvandenbossche · 2017-03-20T22:29:11Z

you never raise an error for missing labels

I think this third option needs clarification, as you can still either drop missing labels or introduce NaNs (so in that regard maybe 4 options or a 3a and 3b).

[@jreback] note that I think we are going to get rid of all of the expanding stuff in pandas2 anyhow. You will have to explicitly .reindex first, then assign.

I don't think this is true? (or is there an issue on the pandas2 tracker for this?) For example being to assign one or multiple columns is an expanding indexing operation on assignment, and not an ability we want to loose?

@jreback When you say "I would be ok with 1)", you are OK with loc and reindex no longer being equivalent? And also with the difference between getting (no enlargement) and setting (enlargement allowed)? As those would be the consequence of choosing for option 1.

If we would start fresh, I would, with my current knowledge, be in favor of loc being strict for getting (no missing labels allowed), i.e. option 1.

The question for me is more: can we justify such a breaking change? (as it will break people's code) Are there ways to ease a transition? (I think in principle we can first have a warning in case it will raise an error in a next release?)

jorisvandenbossche · 2017-03-20T22:29:26Z

cc @pandas-dev/pandas-core

jreback · 2017-03-20T22:30:55Z

I don't think this is true? (or is there an issue on the pandas2 tracker for this?) For example being to assign one or multiple columns is an expanding indexing operation on assignment, and not an ability we want to loose?

@jorisvandenbossche I am only talking about row-wise expansion, which is the obvious issue. column assignment of course is unaffected. It possible that this will still be ok (the issue is can performant .appends be done. wesm/pandas2#53), the main issue is the dtype inference / changes. So this would be an error.

In [1]: s = Series([1,2,3])

In [2]: s
Out[2]: 
0    1
1    2
2    3
dtype: int64

In [3]: s.loc[3] = 'foo'

In [4]: s
Out[4]: 
0      1
1      2
2      3
3    foo
dtype: object

jreback · 2017-03-20T22:32:51Z

The question for me is more: can we justify such a breaking change? (as it will break people's code) Are there ways to ease a transition? (I think in principle we can first have a warning in case it will raise an error in a next release?)

I think a warning is possible.

jreback · 2017-03-20T22:34:58Z

@jreback When you say "I would be ok with 1)", you are OK with loc and reindex no longer being equivalent? And also with the difference between getting (no enlargement) and setting (enlargement allowed)? As those would be the consequence of choosing for option 1.

yes I think moving back to a strict separation between .reindex and .loc is fine. setting enlargement should be unaffected (and can kick the can down the road about whether to allow it).
getting enlargement by-definition will now fail (as its essentially a .reindex with missing labels)

jreback · 2017-03-20T22:37:05Z

@jorisvandenbossche

I think this third option needs clarification, as you can still either drop missing labels or introduce NaNs (so in that regard maybe 4 options or a 3a and 3b).

how is 3b different from 2?

chris-b1 · 2017-03-20T22:38:18Z

xref wesm/pandas2#32 proposing disallowing setting with (row) enlargement

jorisvandenbossche · 2017-03-20T22:41:36Z

And actually wesm/pandas2#21 is related to the proposal here (or at least to option 1)

shoyer · 2017-03-20T23:00:24Z

I would certainly pick (1) if starting from scratch (I did for xarray), but it's a serious breaking change so it's probably best to save it for pandas 2.0.

We discussed this in other issues, but I think we need symmetric behavior for .loc with unrecognized keys (for both getitem and setitem). So if we pick (1), we should be sure we have matching behavior prohibiting setting and enlarging, too.

(3) seems like a strict improvement over what we have now, so I would definitely go for it.

jorisvandenbossche · 2017-03-20T23:04:55Z

(3) seems like a strict improvement over what we have now, so I would definitely go for it.

@shoyer What do you mean with 3 ? (see my previous question regarding clarification of this option)

I think we need symmetric behavior for .loc with unrecognized keys (for both getitem and setitem)

I don't see how this is strictly possible. Assigning a new, non-existant column (df['A']) should work, getting a non-existant column with df['A'] should still raise a KeyError (or would you let .loc differ here?)

shoyer · 2017-03-20T23:10:19Z

@shoyer What do you mean with 3 ? (see my previous question regarding clarification of this option)

I meant (3) with introducing NaNs. Although it does indeed get really messy, e.g., with MultiIndex, so arguably we aren't even doing this consistently already.

I think we need symmetric behavior for .loc with unrecognized keys (for both getitem and setitem)
I don't see how this is strictly possible. Assigning a new, non-existant column (df['A']) should work, getting a non-existant column with df['A'] should still raise a KeyError (or would you let .loc differ here?)

Sorry, yes, I meant all row indexing, and probably most but not all column indexing. We will need an exception to the rules for setting new columns, at least one at a time.

jorisvandenbossche · 2017-03-20T23:21:31Z

One possibility is to limit this discussion to only getting.
For the setting, there is clearly still a lot more discussion needed (which kind of row / column enlargment would we like to allow? Which not?), and also a lot more code that will be broken by this (things as s[..] and df.loc[..] are widely used for enlarging series/frames I think). This will not so easily be achieved, and certainly needs to wait until pandas 2.

If we only speak about getting, eg

In [4]: df = pd.DataFrame({'A':[1,2], 'B':[3,4]})

In [5]: df.loc[:,['A', 'C']]
Out[5]: 
   A   C
0  1 NaN
1  2 NaN

I think something like that is a lot less common to be intended, and more likely to be an user error.
Furthermore, for getting you already have the alternative of reindex if you want this behaviour.
And thus (with a good deprecation strategy, I also think a warning should be possible, although maybe complicated to implement) more realistic to tackle on the shorter term?

jorisvandenbossche · 2017-03-20T23:41:28Z

@toobaz see the referenced issues (wesm/pandas2#32 and wesm/pandas2#21)

toobaz · 2017-03-21T00:02:27Z

for getting you already have the alternative of reindex

(Replying in general to comparisons of loc and reindex, also in other comments): I can understand that on the dev side the two are similar, and that past discussions were based on this comparison... but I think for the typical user they are completely different beasts both conceptually (one is an indexer, the other is a modifier that returns a copy) and in the docs.

That said, I agree to keep the setting discussion separated. And I agree with @jorisvandenbossche that with columns it is even more true that missing labels are usually symptoms of an error.

I meant (3) with introducing NaNs

If we go for (3), is there any real downside in just dropping the missing labels?! Because the downside of introducing NaNs is, as you (@shoyer) mention, that we don't know how to extend to partial indexing of MultiIndex, and also the undesired type changes when for instance working with ints or bools. And also that, again, one does not expect a getter indexer to expand the index - that is, to "get" something which is not there.

toobaz · 2017-03-21T00:08:14Z

how is 3b different from 2?

2 raises if asked a list of all missing labels, 3b would return an all-NaN object.

jreback · 2017-03-21T00:09:30Z

2 would raise if asked a list of all missing labels, 3b would return an all-NaN object.

so what is 3a then?

jreback · 2017-03-21T00:09:59Z

@toobaz you can feel free to actually edit the top-section (of options)

jorisvandenbossche · 2017-03-21T00:10:08Z

so what is 3a then?

Drop the missing labels silently, without introducing NaNs (which is actually what MultiIndex does when indexing certain levels, as consequence of not being able to introduce NaNs)

toobaz · 2017-03-21T00:15:45Z

(Updated the description)

… list are selected via .loc closes pandas-dev#15747

jorisvandenbossche · 2017-10-02T19:46:56Z

@toobaz the PR #17295 is going to close this issue, but only deals with single Index, not MultiIndex. Do you know from the top of your head the situation for MultiIndex or whether there is an issues that describes that aspect ?

… list are selected via .loc closes pandas-dev#15747

toobaz · 2017-10-02T22:32:58Z

Do you know from the top of your head the situation for MultiIndex or whether there is an issues that describes that aspect ?

I think currently MultiIndexes don't exhibit any particular issue related to this. Certainly with #17295 flat indexes will emit a warning while MultiIndexes won't, but that can be addressed in a separate PR. I can try to have a look but not immediately.

jreback · 2017-10-03T08:26:51Z

MI need a separate issue (might be one we can hijack)

jorisvandenbossche · 2017-10-03T08:58:15Z

I think currently MultiIndexes don't exhibit any particular issue related to this.

They have exactly the same issue no? (in that they allow non existing labels in a list) Copying from your comment on the PR (#17295 (comment)):

In [132]: df = pd.DataFrame([[i] for i in range(3)], index=pd.MultiIndex.from_product([[1], [5,6,7]]))

In [133]: df
Out[133]: 
     0
1 5  0
  6  1
  7  2

In [134]: df.loc[[(1,5), (3,7)]]
Out[134]: 
       0
1 5  0.0
3 7  NaN

toobaz · 2017-10-03T09:33:38Z

They have exactly the same issue no?

Yes, sorry, I wasn't clear. They currently behave like flat indexes, and in this sense they don't exhibit any "particular" issue, and they will need a PR analogous to #17295 for homogeneity with flat indexes.

MI need a separate issue (might be one we can hijack)

Opened #17758 ... #15452 is related but does need a different fix.

… list are selected via .loc closes pandas-dev#15747

… list are selected via .loc (#17295) closes #15747

… list are selected via .loc (pandas-dev#17295) closes pandas-dev#15747

gjimzhou · 2019-04-11T17:03:37Z

Is there a quick convenient way to achieve option 3a in the future? Maybe modify to the pd.Series.get function to support a list of keys?

ryananeff · 2020-05-12T16:39:37Z

There must be a way to select all matching indexes in order, as @gjimzhou mentioned like a change to pd.Series.get or perhaps pd.DataFrame.get() or pd.DataFrame.set(). This is important for both getting and setting from one DataFrame to another, for example:

df_a.loc[df_b.index,"column"] = df_b["column"]

or, alternatively, when filtering one DataFrame by another in order:

new_df_a_in_df_b_order = df_a.loc[df_b.index,:]

This is extremely common in many data analysis workflows that rely on loc(). This new change in behavior greatly increases the complexity of both scenarios and is breaking functionality.

jreback added API Design Indexing Related to indexing on series/frames, not to indexes themselves labels Mar 20, 2017

jreback mentioned this issue Mar 24, 2017

API: should allow setting with enlargement with list-likes #7887

Closed

toobaz mentioned this issue May 13, 2017

Clean multiindex keys #15615

Closed

4 tasks

chris-b1 mentioned this issue Jun 8, 2017

Request: setting for raising error when Loc inputs are not present in indexes/column #16630

Closed

toobaz mentioned this issue Jul 14, 2017

Regression in s.loc._getitem_iterable(...) #16920

Closed

jreback added a commit to jreback/pandas that referenced this issue Oct 1, 2017

API: warning to raise KeyError in the future if not all elements of a…

9020827

… list are selected via .loc closes pandas-dev#15747

jreback modified the milestones: 0.21.0, Next Major Release Oct 2, 2017

jreback added a commit to jreback/pandas that referenced this issue Oct 2, 2017

API: warning to raise KeyError in the future if not all elements of a…

a4fb72c

… list are selected via .loc closes pandas-dev#15747

jreback added a commit to jreback/pandas that referenced this issue Oct 2, 2017

API: warning to raise KeyError in the future if not all elements of a…

db52a4f

… list are selected via .loc closes pandas-dev#15747

jreback added a commit to jreback/pandas that referenced this issue Oct 2, 2017

API: warning to raise KeyError in the future if not all elements of a…

bf02812

… list are selected via .loc closes pandas-dev#15747

jreback added a commit to jreback/pandas that referenced this issue Oct 3, 2017

API: warning to raise KeyError in the future if not all elements of a…

44ce09d

… list are selected via .loc closes pandas-dev#15747

jreback closed this as completed in #17295 Oct 3, 2017

jreback added a commit that referenced this issue Oct 3, 2017

API: warning to raise KeyError in the future if not all elements of a…

8e89cb3

… list are selected via .loc (#17295) closes #15747

jreback added a commit to jreback/pandas that referenced this issue Oct 3, 2017

TST: remove warnings, xref pandas-dev#15747

1b772b5

jreback added a commit that referenced this issue Oct 3, 2017

TST: remove warnings, xref #15747 (#17761)

2ff1241

ocefpaf pushed a commit to ocefpaf/pandas that referenced this issue Oct 3, 2017

TST: remove warnings, xref pandas-dev#15747 (pandas-dev#17761)

3590b75

ghost pushed a commit to reef-technologies/pandas that referenced this issue Oct 16, 2017

API: warning to raise KeyError in the future if not all elements of a…

6ac291a

… list are selected via .loc (pandas-dev#17295) closes pandas-dev#15747

ghost pushed a commit to reef-technologies/pandas that referenced this issue Oct 16, 2017

TST: remove warnings, xref pandas-dev#15747 (pandas-dev#17761)

a02e853

alanbato pushed a commit to alanbato/pandas that referenced this issue Nov 10, 2017

API: warning to raise KeyError in the future if not all elements of a…

688ac4c

… list are selected via .loc (pandas-dev#17295) closes pandas-dev#15747

alanbato pushed a commit to alanbato/pandas that referenced this issue Nov 10, 2017

TST: remove warnings, xref pandas-dev#15747 (pandas-dev#17761)

cdf1f8c

No-Stream pushed a commit to No-Stream/pandas that referenced this issue Nov 28, 2017

API: warning to raise KeyError in the future if not all elements of a…

b1f8e3d

… list are selected via .loc (pandas-dev#17295) closes pandas-dev#15747

No-Stream pushed a commit to No-Stream/pandas that referenced this issue Nov 28, 2017

TST: remove warnings, xref pandas-dev#15747 (pandas-dev#17761)

c5ee339

jreback mentioned this issue Jan 18, 2018

Slicing with no keys found #10695

Closed

matthewgilbert mentioned this issue Feb 19, 2018

NaNs for missing returns data in util.calc_rets matthewgilbert/mapping#3

Closed

WillAyd mentioned this issue Jun 2, 2019

to_csv does not fail with specified columns not in dataframe #26613

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Proposal to change behaviour with .loc and missing keys #15747

Proposal to change behaviour with .loc and missing keys #15747

toobaz commented Mar 20, 2017 •

edited

Loading

toobaz commented Mar 20, 2017

jreback commented Mar 20, 2017 •

edited

Loading

toobaz commented Mar 20, 2017

toobaz commented Mar 20, 2017

jreback commented Mar 20, 2017

jorisvandenbossche commented Mar 20, 2017

jorisvandenbossche commented Mar 20, 2017

jreback commented Mar 20, 2017 •

edited

Loading

jreback commented Mar 20, 2017

jreback commented Mar 20, 2017

jreback commented Mar 20, 2017

chris-b1 commented Mar 20, 2017

jorisvandenbossche commented Mar 20, 2017

shoyer commented Mar 20, 2017

jorisvandenbossche commented Mar 20, 2017

shoyer commented Mar 20, 2017

jorisvandenbossche commented Mar 20, 2017

jorisvandenbossche commented Mar 20, 2017

toobaz commented Mar 21, 2017

toobaz commented Mar 21, 2017 •

edited

Loading

jreback commented Mar 21, 2017

jreback commented Mar 21, 2017

jorisvandenbossche commented Mar 21, 2017 •

edited

Loading

toobaz commented Mar 21, 2017

jorisvandenbossche commented Oct 2, 2017

toobaz commented Oct 2, 2017 •

edited

Loading

jreback commented Oct 3, 2017

jorisvandenbossche commented Oct 3, 2017

toobaz commented Oct 3, 2017

gjimzhou commented Apr 11, 2019

ryananeff commented May 12, 2020

Proposal to change behaviour with .loc and missing keys #15747

Proposal to change behaviour with .loc and missing keys #15747

Comments

toobaz commented Mar 20, 2017 • edited Loading

Problem description

For developers

Other consequences of 2.

For users

(Why this?)

Backward incompatibility

toobaz commented Mar 20, 2017

jreback commented Mar 20, 2017 • edited Loading

toobaz commented Mar 20, 2017

toobaz commented Mar 20, 2017

jreback commented Mar 20, 2017

jorisvandenbossche commented Mar 20, 2017

jorisvandenbossche commented Mar 20, 2017

jreback commented Mar 20, 2017 • edited Loading

jreback commented Mar 20, 2017

jreback commented Mar 20, 2017

jreback commented Mar 20, 2017

chris-b1 commented Mar 20, 2017

jorisvandenbossche commented Mar 20, 2017

shoyer commented Mar 20, 2017

jorisvandenbossche commented Mar 20, 2017

shoyer commented Mar 20, 2017

jorisvandenbossche commented Mar 20, 2017

jorisvandenbossche commented Mar 20, 2017

toobaz commented Mar 21, 2017

toobaz commented Mar 21, 2017 • edited Loading

jreback commented Mar 21, 2017

jreback commented Mar 21, 2017

jorisvandenbossche commented Mar 21, 2017 • edited Loading

toobaz commented Mar 21, 2017

jorisvandenbossche commented Oct 2, 2017

toobaz commented Oct 2, 2017 • edited Loading

jreback commented Oct 3, 2017

jorisvandenbossche commented Oct 3, 2017

toobaz commented Oct 3, 2017

gjimzhou commented Apr 11, 2019

ryananeff commented May 12, 2020

toobaz commented Mar 20, 2017 •

edited

Loading

jreback commented Mar 20, 2017 •

edited

Loading

jreback commented Mar 20, 2017 •

edited

Loading

toobaz commented Mar 21, 2017 •

edited

Loading

jorisvandenbossche commented Mar 21, 2017 •

edited

Loading

toobaz commented Oct 2, 2017 •

edited

Loading