Emit warning for missing keys in list-likes for partial indexing in MultiIndex #20916

toobaz · 2018-05-01T23:01:09Z

We need the equivalent of #20770 for the case of partial indexing (which includes both indexing by lists of partial keys, and indexing by specifying a list for each level). It needs a separate approach because resulting indexers are different (as missing keys are dropped rather than replaced with NaN).

In [2]: mi = pd.MultiIndex.from_product([[1, 2], [3,4]])

In [3]: s = pd.Series(-1, index=mi)

In [4]: s.loc[[1, 3],[3, 4]]
Out[4]: 
1  3   -1
   4   -1
dtype: int64

In any case, as for #20770, it should be possible to emit the warning without checking a second (or first) time for absent keys.

Output of `pd.show_versions()`

INSTALLED VERSIONS

commit: c4da79b
python: 3.5.3.final.0
python-bits: 64
OS: Linux
OS-release: 4.9.0-6-amd64
machine: x86_64
processor:
byteorder: little
LC_ALL: None
LANG: it_IT.UTF-8
LOCALE: it_IT.UTF-8

pandas: 0.23.0.dev0+824.gc4da79b5b.dirty
pytest: 3.5.0
pip: 9.0.1
setuptools: 39.0.1
Cython: 0.25.2
numpy: 1.14.1
scipy: 0.19.0
pyarrow: None
xarray: None
IPython: 6.2.1
sphinx: 1.5.6
patsy: 0.5.0
dateutil: 2.7.0
pytz: 2017.2
blosc: None
bottleneck: 1.2.0dev
tables: 3.3.0
numexpr: 2.6.1
feather: 0.3.1
matplotlib: 2.0.0
openpyxl: 2.3.0
xlrd: 1.0.0
xlwt: 1.3.0
xlsxwriter: 0.9.6
lxml: 4.1.1
bs4: 4.5.3
html5lib: 0.999999999
sqlalchemy: 1.0.15
pymysql: None
psycopg2: None
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: 0.2.1

The text was updated successfully, but these errors were encountered:

toobaz · 2019-06-29T22:10:19Z

After further thought, I'm not anymore sure this (and eventually breaking lists with missing values) is a good idea.

When we have missing keys in normal indexing, it means that

the user was looking for a row/column which wasn't found, and
as a consequence, we introduce a N/A.
So it makes sense to raise an error instead.

But when a user does partial indexing, there is no obvious expectation about what can come back. It could be one row, multiple rows... zero rows. "Anything that starts with..." can also be "nothing", it's not a contradiction.
[yes, the case of duplicate indexes is more similar to partial indexing than to normal indexing, but that's another story]

And in fact, in the current codebase

In [2]: mi = pd.MultiIndex.from_product([[1, 2], [3,4]])                                                              

In [3]: s = pd.Series(1, mi)                                                                                          

In [4]: s.loc[[4, 5]]                                                                                                 
Out[4]: Series([], dtype: int64)

In [5]: s.loc[[4], [5]]                                                                                               
Out[5]: Series([], dtype: int64)

which contradicts the "raise if no key is found" rule.

On the other hand, however

s.loc[(4,)]

does raise. But my preferred solution would be to change this.

So my proposal would be:

the rule "raise if key is missing" (currently "warn if key is missing") applies only to full keys
partial indexing instead is considered a bit like slicing, which can legitimately return an empty result
as a consequence, indexing with a single missing partial key (s.loc[(4, )]) will be allowed

Notice that this is much simpler to implement, because we just do not need to care about missing keys.

Assuming we agree, only question is how to transition to this. Does FutureWarning: indexing a MultiIndex with a single missing partial key raises an error but in the future will return an empty object make sense?

jorisvandenbossche · 2019-06-30T03:57:14Z

Additional observation: with the above example, doing a level-wise indexing for only the first level with a list with missing label does raise:

In [67]: s.loc[[4],] 
..
KeyError: '[4] not in index'

Should that also result in an empty series?

toobaz · 2019-06-30T15:39:49Z

Additional observation: with the above example, doing a level-wise indexing for only the first level with a list with missing label does raise:

The crazy part is that this only happens for one element lists!

After discussing with Joris, we agreed that the two kinds of "partial" indexing (but we should maybe find a better name for passing a list per level: "composite"? "combined"?) should be considered separately.

I would say there are 3 possibilities for indexing (one or more) levels with lists of labels (s.loc[[1, 2], [3, 4]]):

never raise for missing labels
raise if and only if there is no valid combination of labels, i.e., if the result is empty
raise if at least one label, in any level, is missing

each requires some backward incompatibility, and I think nobody likes 2), while both 1) and 3) in principle make sense.

We then have 3 possibilities for indexing with incomplete keys, or lists of them (s.loc[(4,)], or s.loc[[(4,)]]):
A) never raise for missing keys/labels
B) raise for missing keys/labels, but not for lists of them
C) raise if at least one key/label is missing

Now consider the case s.loc[present, also], where s has an index with 3 levels, and (present, also) does not appear as combination in the first two. In case B), this should probably not raise, in analogy with s.loc[[present], [also]]. But then, again for analogy, s.loc[present].loc[also] shouldn't either - and that means that we'd rather choose A) over B).

Now: when we have a MultiIndex, s.loc[4] is actually a shortcut for s.loc[(4,)]]. So what is s.loc[[4]]? It could be a shortcut for two different things:

s.loc[[4], :]: "look on the first level for [4]"
s.loc[[(4,)]] (which doesn't work, but might some day): "look for a list of partial keys in the entire index"

In order to make our user's life simpler, we would like the two things to have the same semantics when 4 is missing. Which means that either we pick 1) with A), or 3) with C).

And this brings us back to the "always raise" vs. "never raise".

Advantages of "always raise": consistent with flat index, and with full indexing; "informs" the user that the desired labels do not appear. Easier transition (adding warnings that tell of future error is simpler than adding warning of future error removal).

Advantages of "never raise": in flat indexes, you do expect all the labels you provide to be in the result, while this is not true in MultiIndexes, even if you pass labels which appear in their own levels. Simpler to explain (partial indexing "returns everything that starts with...", regardless of whether this is actually an empty set), more similar to slicing. Probably implies less changes in existing code behavior.

Both options allow us to simplify our codebase.

I think I would go for "always raise", but at this point I don't really have a strong opinion. Thoughts?

toobaz · 2019-06-30T19:09:34Z

OK, after further thought and discussion, I would conclude that

the analogy in my previous comment between s.loc[present, also] and s.loc[[present], [also]] was flawed
we want s.loc[[1], :].loc[[4]] to be consistent with s.loc[[1], [4]] (which for sure won't raise if 1 and 4 are valid labels)
hence, the best option is "raise if scalars, or combinations of them, do not exist; do not raise if labels in lists, or combinations of them, do not exist". That way, you don't need to talk about missing combinations, and you can say it just works like SQL "in" vs. "=".

I will start adding warnings for errors which will be removed.

phofl · 2020-12-01T20:08:11Z

@toobaz Is this still up to date?

idx = MultiIndex.from_product(
    [["A", "B", "C"], ["foo", "bar", "baz"]], names=["one", "two"]
)
s = Series(np.arange(9, dtype="int64"), index=idx).sort_index()
indexer = ["A", "D"]
s.loc[indexer]

raises now while it was changed in #27154 to not raise. So I am wondering what exactly we would want to achieve here

jbrockmendel · 2021-11-15T19:04:32Z

@toobaz is this closed by either #27154 or #42351?

jbrockmendel · 2021-12-21T04:07:31Z

Closing as fixed by a combination of #27154 and #42351. @toobaz feel free to reopen if I'm wrong on this.

toobaz added Indexing Related to indexing on series/frames, not to indexes themselves MultiIndex Deprecate Functionality to remove in pandas labels May 1, 2018

toobaz mentioned this issue May 1, 2018

Emit warning for missing labels in Multiindex.loc[[...]] (and more) #20770

Merged

6 tasks

jorisvandenbossche added this to the 0.24.0 milestone May 11, 2018

jreback modified the milestones: 0.24.0, 0.25.0 Oct 23, 2018

jreback modified the milestones: 0.25.0, 1.0 Apr 20, 2019

toobaz mentioned this issue Jun 30, 2019

Indexing single level with list of missing labels raises if (and only if) it has length 1 #27148

Closed

TomAugspurger modified the milestones: 1.0, Contributions Welcome Dec 30, 2019

toobaz mentioned this issue Nov 30, 2020

BUG: DataFrame.loc silently drops non-existent elements when using MultiIndex #10549

Closed

jbrockmendel closed this as completed Dec 21, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Emit warning for missing keys in list-likes for partial indexing in MultiIndex #20916

Emit warning for missing keys in list-likes for partial indexing in MultiIndex #20916

toobaz commented May 1, 2018 •

edited

Loading

INSTALLED VERSIONS

toobaz commented Jun 29, 2019 •

edited

Loading

jorisvandenbossche commented Jun 30, 2019

toobaz commented Jun 30, 2019 •

edited

Loading

toobaz commented Jun 30, 2019 •

edited

Loading

phofl commented Dec 1, 2020

jbrockmendel commented Nov 15, 2021

jbrockmendel commented Dec 21, 2021

Emit warning for missing keys in list-likes for partial indexing in MultiIndex #20916

Emit warning for missing keys in list-likes for partial indexing in MultiIndex #20916

Comments

toobaz commented May 1, 2018 • edited Loading

Output of pd.show_versions()

INSTALLED VERSIONS

toobaz commented Jun 29, 2019 • edited Loading

jorisvandenbossche commented Jun 30, 2019

toobaz commented Jun 30, 2019 • edited Loading

toobaz commented Jun 30, 2019 • edited Loading

phofl commented Dec 1, 2020

jbrockmendel commented Nov 15, 2021

jbrockmendel commented Dec 21, 2021

toobaz commented May 1, 2018 •

edited

Loading

Output of `pd.show_versions()`

toobaz commented Jun 29, 2019 •

edited

Loading

toobaz commented Jun 30, 2019 •

edited

Loading

toobaz commented Jun 30, 2019 •

edited

Loading