Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allowing the index to be referenced by name, like a column #8162

Closed
3 tasks
makmanalp opened this issue Sep 2, 2014 · 27 comments
Closed
3 tasks

Allowing the index to be referenced by name, like a column #8162

makmanalp opened this issue Sep 2, 2014 · 27 comments
Labels
Enhancement Indexing Related to indexing on series/frames, not to indexes themselves

Comments

@makmanalp
Copy link
Contributor

makmanalp commented Sep 2, 2014

What if we allowed the index of a dataframe to be referred to in the usual ways?

data = pd.read_table("...", index_col="id")
data.id  # breaks
data["id"]  # breaks

I find myself setting and resetting indices very often to join to a different dataframe or to pull in the values of the index to a subselection of the dataframe, etc. I figure this is because of how the data is stored under the hood, but wouldn't this be convenient?

@jreback
Copy link
Contributor

jreback commented Sep 2, 2014

I recall another issue about this - can u have a look for it?

further this is not difficult

want to try a pr?

@makmanalp
Copy link
Contributor Author

Yeah, I'd love to take a shot at implementing this. I spent a few minutes looking for the old issue but couldn't find anything other than the tangentially relevant #8082 . Do you remember any other details?

@jreback
Copy link
Contributor

jreback commented Sep 2, 2014

I think I am remembering implementing (then reverting) this

you will need to change __getattr__ and _get_item_cached in core/generic.py

need good tests!

@shoyer
Copy link
Member

shoyer commented Sep 4, 2014

I think this is a great idea. I did something similar in xray.

A few things to consider for a full-fledged implementation:

  1. What should the type of data['id'] be? I think it should be a Series (i.e., data.index.to_series() or pd.Series(data.index, data.index)) rather than an Index (data.index), to follow the rule that the items in a DataFrame are always Series objects.
  2. This should work with a MultiIndex. In this case, you should get a Series where the values are only from the named level (i.e., pd.Series(data.index.get_level_values('id'), data.index)).
  3. Don't forget indexing columns with lists. This should also work, returning a DataFrame: data[['id', 'other_col']]

@makmanalp
Copy link
Contributor Author

@shoyer - thank you so much! I was pondering the first myself - great point about the type, I wonder if Index follows the Series interface exactly. If so, shouldn't be a problem. Second and third hadn't even occured to me.

It looks like Index and Series inherits IndexOpsMixin (https://github.com/pydata/pandas/blob/master/pandas/core/base.py#L283)

https://github.com/pydata/pandas/blob/master/pandas/core/index.py#L74 and https://github.com/pydata/pandas/blob/master/pandas/core/series.py#L80

@jreback thoughts?

@jreback
Copy link
Contributor

jreback commented Sep 4, 2014

this is very simple

just change the methods I showed above
and wrap with _constructor

@TomAugspurger
Copy link
Contributor

Regarding @shoyer's #3, with

In [7]: df = pd.DataFrame({"A": [1, 2, 3], "B": [4, 5, 6]}, index=['a', 'b', 'c'])
In [8]: df.index.name = 'idx'

Does df[['idx', 'A', 'B']] return

     A  B
idx      
a    1  4
b    2  5
c    3  6

with idx in the index still, or

  idx  A  B
0   a  1  4
1   b  2  5
2   c  3  6

with idx as a column? It should be the second one IMO.

@shoyer
Copy link
Member

shoyer commented Sep 4, 2014

@TomAugspurger actually, I think it should either be your first example, or something like:

     idx  A  B
idx
a      a  1  4
b      b  2  5
c      c  3  6

This has the disadvantage of now having a redundant column/index with the same name. But I don't like changing the index based on indexing particular columns -- if you want that, you can use reset_index().

@makmanalp
Copy link
Contributor Author

I think the first one is simpler too. We're not hiding that it's the index, and we're not promoting it to be a column, we're just allowing it to be referred to and used as a column.

@jorisvandenbossche
Copy link
Member

jorisvandenbossche commented Sep 5, 2014

But is should be consistent I think. If df['idx] returns the index wrapped in a Series, then df[['idx', 'A', 'B']] should also return it as a Series, and thus a DataFrame with 3 columns I think (so the example how @shoyer showed it). df[['idx', 'A', 'B']] and df[['A', 'B']] should not be the same I think.

@shoyer
Copy link
Member

shoyer commented Sep 5, 2014

I agree with @jorisvandenbossche. Columns are never going to be fully interchangeable with indexes (even after this change), and if you're explicitly indexing the index as a column you presumably want it as a series, not an index.

Another edge case to test for: let's make sure df.groupby('idx') works. Right now you need to write df.groupby(level='idx').

@TomAugspurger
Copy link
Contributor

+1 for @shoyer's example. I should have explained why I think that including idx in the slice should return it as a column. First of all there's the mental model that df[<list>] always returns a DataFrame whose columns are in the list. Second this would be the only way to do things like df[['idx', 'A', 'B']].sum(1) without resorting to the ugly old way of restet_index()ing.

I had an issue and PR about the @shoyer's groupby that I never finished off. We can handle groupby separately, but If this goes into 0.15, I'll finish up that PR.

@makmanalp
Copy link
Contributor Author

@shoyer didn't know about the level=idx! The groupby was on my list because it's such a pain in the butt.

One question, is wrapping the index in a series and adding it onto the dataframe essentially a no-op, or is it going to be horribly inefficient for larger dataframes?

@shoyer
Copy link
Member

shoyer commented Oct 2, 2014

I think a broader theme of the issue is that it is intuitive to think of an "index" as a special type of column, rather than as a separate type of entity.

@TomAugspurger
Copy link
Contributor

Just to reraise this with another use-case, this would help out matplotlib with their labeled data plotting. I haven't looked recently, but an earlier version had to workaround not being able to use __getitem__ to get to the index.

I'm less sure about the need to allow df[['index_name', 'other_col']].

@makmanalp
Copy link
Contributor Author

@TomAugspurger in defense of df[['index_name', 'other_col']], what's nice about it is that it saves you from a ton of gross foo.reset_index().blah.set_index() and other similar cruft that isn't really meaningful and obscures what your code is actually trying to do.

@tacaswell
Copy link
Contributor

There is currently code on that branch so that

plt.plot('foo', data=df)
plt.plot(df['foo'])

Will grab both the index to use as the index instead of range, but that is only implemented for plot, but nothing else.

But, major 👍 from me on this ability. I don't have a view on the list slicing, but the name should be something other than id as that seems like a source of endless collisions.

@jankatins
Copy link
Contributor

This "problem" was also on the ggplot todo list. I would vote for df["__index__"] being treated special (=return df.index) and have named index also show up in df[[<...>]]

@jreback jreback modified the milestones: Next Major Release, 0.17.0 Sep 1, 2015
@jbrockmendel
Copy link
Member

Transplanting from #17061 on convergence in Index/Series behavior.

It would be nice to be able to access foo.dt without first having to check whether foo is an Index or Series. This could be accomplished by having DatetimeIndex, PeriodIndex, and TimedeltaIndex have a property dt that just returns self. If others agree, I'll put together a PR. Thoughts?

@MarcoGorelli
Copy link
Member

If I've understood the suggestion correctly, I'm -1 on it, because of the ambiguity in what should happen if a column has the same name as the index

In [5]: df
Out[5]:
   a  b
a
7  1  4
8  2  5
9  3  6

# what does df['a'] return?

@jbrockmendel jbrockmendel removed their assignment Mar 30, 2023
@MarcoGorelli
Copy link
Member

closing as per today's discussion then - thanks anyway for the issue

@tacaswell
Copy link
Contributor

@MarcoGorelli Is there a link to any notes from the discussions?

@MarcoGorelli
Copy link
Member

MarcoGorelli commented Apr 12, 2023

yes but they just say "agreed to close" 😄 https://docs.google.com/document/d/1tGbTiYORHiSPgVMXawiweGJlBw5dOkVJLY-licoBmBU/edit#

A related issue which was brought up is #27652, which may still be considered

@tacaswell
Copy link
Contributor

😞

I am curious what the persuasive argument was.

I see how from inside of pandas the index is very special, but from the outside it just looks like any other column. In the case where you need to consume input from users of many types (which may just be a Matplotlib problem) being able to treat dict-of-array, dataframes, h5py groups, xarray, [anything that returns an array for __getitem__(key: str) -> Array], etc is pretty nice. From the outside it is a weird wart that there is data on dataframes that can not be extract via __getitem__.

On the other hand I see the namespace problem may be intractable and the above use case might be niche enough that it is not worth the engineering and documentation effort to make it work.

@jbrockmendel
Copy link
Member

The main pain point was cases where there the index name(s) matched a column label

@davidgilbertson
Copy link

That's doesn't seem like a great reason to not proceed. .groupby works seamlessly across columns and indexes. If an index and column share a name, it errors with ValueError: <name> is both an index level and a column label, which is ambiguous.

@MarcoGorelli if there's a deeper reason, it would be great to know so I can properly give up hope :). Otherwise, from all the other comments this doesn't seem like an impossible thing, I'm happy to contribute.

@MarcoGorelli
Copy link
Member

Personally, I'd rather not add even more auto-magic and inconsistencies. This is going to open up more issues. There's enough to work on. If a PDEP were raised, I'd probably vote down, sorry

But that doesn't mean you need to give up hope 😄 If you can get another core member on board, write a PDEP with them, and then get a 2/3 majority of core members to vote it up, then you could bypass my negativity

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement Indexing Related to indexing on series/frames, not to indexes themselves
Projects
None yet
Development

Successfully merging a pull request may close this issue.