Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Emit warning for missing keys in list-likes for partial indexing in MultiIndex #20916

Closed
toobaz opened this issue May 1, 2018 · 7 comments
Closed
Labels
Deprecate Functionality to remove in pandas Indexing Related to indexing on series/frames, not to indexes themselves MultiIndex

Comments

@toobaz
Copy link
Member

toobaz commented May 1, 2018

We need the equivalent of #20770 for the case of partial indexing (which includes both indexing by lists of partial keys, and indexing by specifying a list for each level). It needs a separate approach because resulting indexers are different (as missing keys are dropped rather than replaced with NaN).

In [2]: mi = pd.MultiIndex.from_product([[1, 2], [3,4]])

In [3]: s = pd.Series(-1, index=mi)

In [4]: s.loc[[1, 3],[3, 4]]
Out[4]: 
1  3   -1
   4   -1
dtype: int64

In any case, as for #20770, it should be possible to emit the warning without checking a second (or first) time for absent keys.

Output of pd.show_versions()

INSTALLED VERSIONS

commit: c4da79b
python: 3.5.3.final.0
python-bits: 64
OS: Linux
OS-release: 4.9.0-6-amd64
machine: x86_64
processor:
byteorder: little
LC_ALL: None
LANG: it_IT.UTF-8
LOCALE: it_IT.UTF-8

pandas: 0.23.0.dev0+824.gc4da79b5b.dirty
pytest: 3.5.0
pip: 9.0.1
setuptools: 39.0.1
Cython: 0.25.2
numpy: 1.14.1
scipy: 0.19.0
pyarrow: None
xarray: None
IPython: 6.2.1
sphinx: 1.5.6
patsy: 0.5.0
dateutil: 2.7.0
pytz: 2017.2
blosc: None
bottleneck: 1.2.0dev
tables: 3.3.0
numexpr: 2.6.1
feather: 0.3.1
matplotlib: 2.0.0
openpyxl: 2.3.0
xlrd: 1.0.0
xlwt: 1.3.0
xlsxwriter: 0.9.6
lxml: 4.1.1
bs4: 4.5.3
html5lib: 0.999999999
sqlalchemy: 1.0.15
pymysql: None
psycopg2: None
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: 0.2.1

@toobaz toobaz added Indexing Related to indexing on series/frames, not to indexes themselves MultiIndex Deprecate Functionality to remove in pandas labels May 1, 2018
@jorisvandenbossche jorisvandenbossche added this to the 0.24.0 milestone May 11, 2018
@jreback jreback modified the milestones: 0.24.0, 0.25.0 Oct 23, 2018
@jreback jreback modified the milestones: 0.25.0, 1.0 Apr 20, 2019
@toobaz
Copy link
Member Author

toobaz commented Jun 29, 2019

After further thought, I'm not anymore sure this (and eventually breaking lists with missing values) is a good idea.

When we have missing keys in normal indexing, it means that

  • the user was looking for a row/column which wasn't found, and
  • as a consequence, we introduce a N/A.
    So it makes sense to raise an error instead.

But when a user does partial indexing, there is no obvious expectation about what can come back. It could be one row, multiple rows... zero rows. "Anything that starts with..." can also be "nothing", it's not a contradiction.
[yes, the case of duplicate indexes is more similar to partial indexing than to normal indexing, but that's another story]

And in fact, in the current codebase

In [2]: mi = pd.MultiIndex.from_product([[1, 2], [3,4]])                                                              

In [3]: s = pd.Series(1, mi)                                                                                          

In [4]: s.loc[[4, 5]]                                                                                                 
Out[4]: Series([], dtype: int64)

In [5]: s.loc[[4], [5]]                                                                                               
Out[5]: Series([], dtype: int64)

which contradicts the "raise if no key is found" rule.

On the other hand, however

s.loc[(4,)]

does raise. But my preferred solution would be to change this.

So my proposal would be:

  • the rule "raise if key is missing" (currently "warn if key is missing") applies only to full keys
  • partial indexing instead is considered a bit like slicing, which can legitimately return an empty result
  • as a consequence, indexing with a single missing partial key (s.loc[(4, )]) will be allowed

Notice that this is much simpler to implement, because we just do not need to care about missing keys.

Assuming we agree, only question is how to transition to this. Does FutureWarning: indexing a MultiIndex with a single missing partial key raises an error but in the future will return an empty object make sense?

@jorisvandenbossche
Copy link
Member

Additional observation: with the above example, doing a level-wise indexing for only the first level with a list with missing label does raise:

In [67]: s.loc[[4],] 
..
KeyError: '[4] not in index'

Should that also result in an empty series?

@toobaz
Copy link
Member Author

toobaz commented Jun 30, 2019

Additional observation: with the above example, doing a level-wise indexing for only the first level with a list with missing label does raise:

The crazy part is that this only happens for one element lists!

After discussing with Joris, we agreed that the two kinds of "partial" indexing (but we should maybe find a better name for passing a list per level: "composite"? "combined"?) should be considered separately.

I would say there are 3 possibilities for indexing (one or more) levels with lists of labels (s.loc[[1, 2], [3, 4]]):

  1. never raise for missing labels
  2. raise if and only if there is no valid combination of labels, i.e., if the result is empty
  3. raise if at least one label, in any level, is missing

each requires some backward incompatibility, and I think nobody likes 2), while both 1) and 3) in principle make sense.

We then have 3 possibilities for indexing with incomplete keys, or lists of them (s.loc[(4,)], or s.loc[[(4,)]]):
A) never raise for missing keys/labels
B) raise for missing keys/labels, but not for lists of them
C) raise if at least one key/label is missing

Now consider the case s.loc[present, also], where s has an index with 3 levels, and (present, also) does not appear as combination in the first two. In case B), this should probably not raise, in analogy with s.loc[[present], [also]]. But then, again for analogy, s.loc[present].loc[also] shouldn't either - and that means that we'd rather choose A) over B).

Now: when we have a MultiIndex, s.loc[4] is actually a shortcut for s.loc[(4,)]]. So what is s.loc[[4]]? It could be a shortcut for two different things:

  • s.loc[[4], :]: "look on the first level for [4]"
  • s.loc[[(4,)]] (which doesn't work, but might some day): "look for a list of partial keys in the entire index"

In order to make our user's life simpler, we would like the two things to have the same semantics when 4 is missing. Which means that either we pick 1) with A), or 3) with C).

And this brings us back to the "always raise" vs. "never raise".

Advantages of "always raise": consistent with flat index, and with full indexing; "informs" the user that the desired labels do not appear. Easier transition (adding warnings that tell of future error is simpler than adding warning of future error removal).

Advantages of "never raise": in flat indexes, you do expect all the labels you provide to be in the result, while this is not true in MultiIndexes, even if you pass labels which appear in their own levels. Simpler to explain (partial indexing "returns everything that starts with...", regardless of whether this is actually an empty set), more similar to slicing. Probably implies less changes in existing code behavior.

Both options allow us to simplify our codebase.

I think I would go for "always raise", but at this point I don't really have a strong opinion. Thoughts?

@toobaz
Copy link
Member Author

toobaz commented Jun 30, 2019

OK, after further thought and discussion, I would conclude that

  • the analogy in my previous comment between s.loc[present, also] and s.loc[[present], [also]] was flawed
  • we want s.loc[[1], :].loc[[4]] to be consistent with s.loc[[1], [4]] (which for sure won't raise if 1 and 4 are valid labels)
  • hence, the best option is "raise if scalars, or combinations of them, do not exist; do not raise if labels in lists, or combinations of them, do not exist". That way, you don't need to talk about missing combinations, and you can say it just works like SQL "in" vs. "=".

I will start adding warnings for errors which will be removed.

@phofl
Copy link
Member

phofl commented Dec 1, 2020

@toobaz Is this still up to date?

idx = MultiIndex.from_product(
    [["A", "B", "C"], ["foo", "bar", "baz"]], names=["one", "two"]
)
s = Series(np.arange(9, dtype="int64"), index=idx).sort_index()
indexer = ["A", "D"]
s.loc[indexer]

raises now while it was changed in #27154 to not raise. So I am wondering what exactly we would want to achieve here

@jbrockmendel
Copy link
Member

@toobaz is this closed by either #27154 or #42351?

@jbrockmendel
Copy link
Member

Closing as fixed by a combination of #27154 and #42351. @toobaz feel free to reopen if I'm wrong on this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Deprecate Functionality to remove in pandas Indexing Related to indexing on series/frames, not to indexes themselves MultiIndex
Projects
None yet
Development

No branches or pull requests

6 participants