API: Meta-issue for making consistent API's to refer to column names and index names #27652

Dr-Irv · 2019-07-30T02:29:40Z

I would like to propose that any pandas API that allows specification of a column name also works when specifying an index level name. Today, this works in some places, but not all. Here is a list of places where things work, and where things don't work and could be improved. (The list is most likely incomplete, so additions are welcome). References to existing issues are given when they already exist (and if I knew about them):

Hoping to include this in the roadmap #27478

Things that work:

DataFrame.query() allows queries to use column names and index level names
DataFrame.merge() and pd.merge() allow both column names and index level names to be specified as key fields to use in the merge (but see below)
DataFrame.groupby() allows both column names and index level names to be mixed in the groupby list

Things that could be improved:

Allow .loc() and .getitem() to specify index level names wherever it allows column names.
Modify DataFrame.rename() to allow renaming of index levels with a dict argument. (API: Allow MultiIndex.rename() to accept a dict as an argument #20421)
Allow index names to be specified like columns in pd.Grouper() (Grouper should support key from index name (just like groupby supports grouping by index names) #19542)
When merging on a subset of MultiIndex levels, preserve the levels not included in the merge. (merge on index vs merge on column - different NaN handling #13371 is somewhat related)
Have .itertuples() return a named tuple that includes index names (ENH: Allow index names to be included in itertuples() result #27407)
Allow .assign to refer to columns that correspond to level names (although the .loc and .getitem() suggestion above might handle this

The text was updated successfully, but these errors were encountered:

WillAyd · 2019-07-31T22:04:01Z

Not sure my thoughts on this but out of curiosity what do you see the distinction being between an index level and a regular column, if any?

Dr-Irv · 2019-08-01T13:32:41Z

@WillAyd I don't think they should be any different in terms of doing column-wise operations. If you put certain "columns" in an index, then you can do slicing with .loc and pd.IndexSlice more efficiently.

To me, it's a bit analogous to a database table where you have a set of columns, and you then designate a few of them as primary keys. I think pandas could do the same thing, with the added benefit that you can do efficiently slicing on the "keys", i.e, the index.

jorisvandenbossche · 2019-08-01T13:54:15Z

This is somewhat a duplicate of #8162 (about allowing to reference an index level like a column in different APIs), a discussion which was also touched upon in #10000 (but many other cross links in #8162) and as mentioned already partly implemented in certain APIs (eg merge, see #14355).

The discussion in #8162 certainly has some relevant content.
I think in general (from that issue and the others), there is already consensus that in methods we want to reduce this distinction (like was already done in groupby and merge).

Personally, for me the bigger question is if we want to go further than that. As @Dr-Irv gives the analogy of a database, for me as well I find the idea of "index as special column" intriguing.

Dr-Irv · 2019-08-01T14:23:06Z

This is somewhat a duplicate of #8162 (about allowing to reference an index level like a column in different APIs), a discussion which was also touched upon in #10000 (but many other cross links in #8162) and as mentioned already partly implemented in certain APIs (eg merge, see #14355).

I think that my idea of allowing .loc and getitem() to refer to index level names is what #8162 is about.

I also found #20452 which has to do with the merge behavior I listed above,

I think in general (from that issue and the others), there is already consensus that in methods we want to reduce this distinction (like was already done in groupby and merge).

Yes, and I created this meta-issue to track this.

Personally, for me the bigger question is if we want to go further than that. As @Dr-Irv gives the analogy of a database, for me as well I find the idea of "index as special column" intriguing.

Glad to hear it!

shoyer · 2020-02-11T17:40:01Z

I think this is a worthy goal, and would certainly improve user experience.

My concern is that doing it well would also require an underlying change to pandas's core data model: indexes should be internally represented as a "type of column" rather than as external metadata. Otherwise the codebase will get filled with lots of special case logic, which is hard to maintain.

jorisvandenbossche · 2020-02-11T17:54:44Z

I also think that we should look into the data model topic of storing the index as a normal column (where the index would just be "a special column" that can be used in eg indexing operations).

However, that's a much bigger change than adding syntax sugar in several APIs, and with possibly much more backwards compatibility concerns (columns and values would then also include the index names/values, while they don't now, etc) But personally, I think this is worthwhile to explore (having the index as optional might be a prerequisite for this).

sangrey · 2020-02-12T17:50:20Z

As someone who is a heavy user of Multiindexes, I couldn't agree more that this would be very useful to have. I find working in a fairly complex production codebase that whether a particular "column" is a best thought of as an index or a column in its own right is constantly switching back and forth, and so my code ends up being littered with reset_index() calls and set_index() calls.

max-sixty · 2020-02-12T17:50:30Z

In the category of "syntax sugar" rather than data model changes: In my experience the most inconvenient occurance is set_index not allowing passing index names. For example:

In [10]: pd.DataFrame(np.random.rand(5,3), columns=list('abc')).set_index('a')                                                                                      
Out[10]: 
                 b         c
a                           
0.745626  0.628709  0.219627
0.071207  0.192600  0.643170
0.517568  0.556245  0.079201
0.819596  0.143708  0.368630
0.475074  0.142911  0.837396

If we want to make a multiindex with a & b, we need to do reset_index first, since passing index names to set_index doesn't work:

In [11]: pd.DataFrame(np.random.rand(10,3), columns=list('abc')).set_index('a').set_index(list('ab'))                                                               
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-11-264ec28ddef6> in <module>
----> 1 pd.DataFrame(np.random.rand(10,3), columns=list('abc')).set_index('a').set_index(list('ab'))

~/workspace/corpfin/.venv/lib64/python3.6/site-packages/pandas/core/frame.py in set_index(self, keys, drop, append, inplace, verify_integrity)
   4409 
   4410         if missing:
-> 4411             raise KeyError("None of {} are in the columns".format(missing))
   4412 
   4413         if inplace:

KeyError: "None of ['a'] are in the columns"

So I have a plethora of .reset_index().set_index(), which could be collapsed to set_index if that were possible.

Edit: @sangrey and I raced; I'm leaving this up as a more fleshed out version

Dr-Irv · 2020-02-12T17:58:44Z

@max-sixty Actually, you can do what you need without using reset_index and using the append argument of set_index:

In [3]: df=pd.DataFrame(np.random.rand(5,3), columns=list('abc')).set_index('a')

In [4]: df
Out[4]:
                 b         c
a
0.369952  0.768527  0.969260
0.587661  0.883011  0.455956
0.944659  0.331729  0.640796
0.528132  0.397122  0.170864
0.959429  0.869463  0.747232

In [5]: df.set_index('b', append=True)
Out[5]:
                          c
a        b
0.369952 0.768527  0.969260
0.587661 0.883011  0.455956
0.944659 0.331729  0.640796
0.528132 0.397122  0.170864
0.959429 0.869463  0.747232

max-sixty · 2020-02-12T19:37:44Z

Yes, thanks @Dr-Irv . That narrows the gap, though not all the way: being able to specify the indexes without the context of the current indexes would be more declarative and clearer, particularly to a reader of the code.

spaceman1313 · 2024-10-05T04:16:54Z

I like @Dr-Irv analogy to a database and primary keys. To me an index is just a regular column that has some super powers, it is not a different entity.

One example use case I'm having to write code around. I'm loading a data set from CSV files. I need to validate each column for specific formatting. One of the columns is a natural index. But now to validate that column I either need to set it as the index AFTER validation, or I need to have special logic in the validation process to check "is the column I'm about to validate an index or a regular column"

Dr-Irv mentioned this issue Jul 30, 2019

Add a Roadmap #27478

Merged

mroeschke added API Design Indexing Related to indexing on series/frames, not to indexes themselves labels Nov 2, 2019

jbrockmendel added the API - Consistency Internal Consistency of API/Behavior label Dec 18, 2019

mroeschke added the Enhancement label May 7, 2020

ChrisStuff mentioned this issue Jun 15, 2020

ENH: Inconsistency when string refers both to index level and column label #34791

Closed

mroeschke removed the API Design label Jul 10, 2021

MarcoGorelli mentioned this issue Apr 12, 2023

Allowing the index to be referenced by name, like a column #8162

Closed

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

API: Meta-issue for making consistent API's to refer to column names and index names #27652

API: Meta-issue for making consistent API's to refer to column names and index names #27652

Dr-Irv commented Jul 30, 2019

WillAyd commented Jul 31, 2019

Dr-Irv commented Aug 1, 2019

jorisvandenbossche commented Aug 1, 2019

Dr-Irv commented Aug 1, 2019

shoyer commented Feb 11, 2020

jorisvandenbossche commented Feb 11, 2020

sangrey commented Feb 12, 2020

max-sixty commented Feb 12, 2020 •

edited

Loading

Dr-Irv commented Feb 12, 2020

max-sixty commented Feb 12, 2020

spaceman1313 commented Oct 5, 2024

API: Meta-issue for making consistent API's to refer to column names and index names #27652

API: Meta-issue for making consistent API's to refer to column names and index names #27652

Comments

Dr-Irv commented Jul 30, 2019

WillAyd commented Jul 31, 2019

Dr-Irv commented Aug 1, 2019

jorisvandenbossche commented Aug 1, 2019

Dr-Irv commented Aug 1, 2019

shoyer commented Feb 11, 2020

jorisvandenbossche commented Feb 11, 2020

sangrey commented Feb 12, 2020

max-sixty commented Feb 12, 2020 • edited Loading

Dr-Irv commented Feb 12, 2020

max-sixty commented Feb 12, 2020

spaceman1313 commented Oct 5, 2024

max-sixty commented Feb 12, 2020 •

edited

Loading