Skip to content
This repository has been archived by the owner on Apr 10, 2024. It is now read-only.

Optional indexes #17

Open
shoyer opened this issue Sep 7, 2016 · 9 comments
Open

Optional indexes #17

shoyer opened this issue Sep 7, 2016 · 9 comments
Labels

Comments

@shoyer
Copy link

shoyer commented Sep 7, 2016

The pandas.Index is fantastically useful, but in many cases pandas's insistence on always having an index gets in the way.

Usually, it can be safely ignored when not relevant, especially now that we have RangeIndex (which makes the cost of creating the index minimal), but this is not always the case:

  1. The indexing and join behavior of default RangeIndex is actively harmful. It would be better to raise an error when implicitly joining on an index between two datasets with a default index.
  2. When converting a DataFrame into other formats, we need an argument (e.g., index=True) for controlling whether or not to include the index.

I propose that we make the index optional, e.g., by allowing it to be set to None. This entails a need for some rules to handle missing indexes:

  • Operations that explicitly rely on indexes (e.g., .loc and join) should raise TypeError when called on objects without an index.
  • Operations that implicitly rely on indexes for alignment (e.g., the DataFrame constructor and arithmetic) now need to handle three cases:
    1. Index/index operations: These work as before. The result's index has an outer join of the input indexes
    2. No-index/no-index operations: The inputs have the exact same length (or raise TypeError). The result has no index.
    3. Mixed index/no-index operations: The inputs must have the same length. The result takes on the index from the input with an index.

Somewhat related: #15

@chris-b1
Copy link

chris-b1 commented Sep 7, 2016

+1, although I think the opposite approach may also be worth consideration.

What if instead of a being a special property of a DataFrame, an "Index" is just defined by a selection of columns in the frame

  • 0 (your None)
  • 1 (today's Index)
  • or more (something like a MultiIndex)

This would remove the Index / column distinction, which I think is a stumbling block for many. Some discussion here: pandas-dev/pandas#8162

That said, it's not clear to me how a Series with an Index fits into this world, and would be a bigger api change.

@shoyer
Copy link
Author

shoyer commented Sep 7, 2016

@chris-b1 Yes, in fact I almost included "indexes as just a special type of column" as part of this issue, but then decided to save it for another one. Since you brought it up (and it's related), we might as well discuss it here.

I also really like this idea, because it's a major stumbling block for both new (and experienced) users.

It does entail a major overhaul of the pandas data model, though, which raises a number of questions. In particular: do we still use an Index/MultiIndex for DataFrame.columns? If so, then it follows that column names should now include the names of index columns.

This raises a big issue with how we handle "messy" data (i.e., non-tidy data). Currently, pandas is a pretty capable tool for such datasets, especially with the ability to use stack/unstack columns into hierarchies. But if columns is a MultiIndex with multiple levels, adding in index column names is going to make things a mess.

Consider this example adapted from the multi-index docs:

arrays = [['bar', 'bar', 'baz', 'baz', 'foo', 'foo',],
          ['one', 'two', 'one', 'two', 'one', 'two']]
tuples = list(zip(*arrays))
index = pd.MultiIndex.from_tuples(tuples, names=['first', 'second'])
df = pd.DataFrame(np.random.randn(3, 6), index=pd.Index(['A', 'B', 'C'], name='letter'), columns=index)
print df
first        bar                 baz                 foo          
second       one       two       one       two       one       two
letter                                                            
A       1.131677  3.008499 -1.513677  0.379074 -0.546790 -2.221491
B      -1.650027  2.157229 -1.030519 -0.187412  0.711109 -0.334537
C       1.226648  0.631318  0.197816  0.494960 -0.435740  1.098061

Do we now need to add in an extra level to the multi-index for the index column name (e.g., letter)? Or do we disallow a MultiIndex for column names altogether and use an index of tuples instead?

In general, I don't think it's worth a huge amount of effort to make it easy work with such data, given how much nicer tidy data is and the existence of multi-dimensional alternatives with a cleaner data model in the form of xarray, but such change would certainly going to break a non-negligible number of workflows. Making indexes optional would be a much less ambiguous win.

That said, it's not clear to me how a Series with an Index fits into this world

I think it could work in roughly the same way it currently does. Pulling a column out of a DataFrame would return Series object, associating with it any indexes on the frame.

@wesm
Copy link
Owner

wesm commented Sep 7, 2016

I'll put some thought into this when I have a chance, but: one possibility to consider is exposing a more primitive pandas.Table to the user, as a "DataFrame without the Index".

  • Referencing a column would give a pandas.Array.
  • Combining an Array plus an Index you obtain a Series
  • Combining an Index with a Table produces a DataFrame

The Table would be more like an R data.frame / data frames.

@chrisaycock
Copy link

Would the Table have the same functionality as a DataFrame? I.e., queries, joins, aggregations, IO, etc?

@wesm
Copy link
Owner

wesm commented Sep 7, 2016

One thought would be to equip Table with the most essential relational algebra and manipulations (add/remove columns, etc.) but make everything deferred (the deferred table DSL I designed for Ibis is one example of such a language that has effectively 1-1 parity with SQL, could provide some inspiration)

@shoyer
Copy link
Author

shoyer commented Sep 8, 2016

One possibility to consider is exposing a more primitive pandas.Table to the user, as a "DataFrame without the Index".

I'm a really big fan of the Table data structure. It could do enough for most users and client libraries, and DataFrame could be left for those who need indexing and alignment (which are important but niche use cases).

See also the datascience package for teaching introductory data science in Python.

On the other hand, the downside is that now we have two similar core data structures for tabular data. With the dynamic nature of Python, this could easily lead to confusion, and also twice the API to maintain.

One thought would be to equip Table with the most essential relational algebra and manipulations (add/remove columns, etc.) but make everything deferred

I'm all for deferred APIs, but I'm less sure that this makes sense for the DataFrame/Table distinction. It dilutes the message of Table as "DataFrame without the Index".

@chrisaycock
Copy link

I definitely like the idea of a base Table and then a DataFrame that adds an index. Other than indices, there doesn't need to be a distinction in terms of functionality.

@wesm
Copy link
Owner

wesm commented Sep 8, 2016

@shoyer I agree that having a deferred API as a separate beast would be better, and making the basic table a pared down, indexless DataFrame (with all operations eagerly evaluated). Was just curious what you all thought =)

@shoyer
Copy link
Author

shoyer commented Sep 26, 2016

I am probably going to make indexes optional in the next version of xarray (pydata/xarray#1017). (Note that in xarray, an index already is basically just a special kind of column, but currently we always generate an index like range(n).) I guess we'll see how the transition goes, but I am tentatively very optimistic about it.

It occurs to me that an additional virtue of optional indexes is that it could allow us to further cleanup DataFrame.__getitem__ with sane mixing between label and position based indexing, because we can differentiate between intentional integer indexes and no index at all. I'll elaborate over in #22.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

No branches or pull requests

5 participants