Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add to_dataframe() method to MultiIndex #12397

Closed
relativistic opened this issue Feb 19, 2016 · 8 comments
Closed

Add to_dataframe() method to MultiIndex #12397

relativistic opened this issue Feb 19, 2016 · 8 comments

Comments

@relativistic
Copy link

I find the Index.to_series method is a convenient way to allow indices to act as columns of a dataframe where desired. However, the behavior of MultiIndex.to_series, which gives a Series of tuples, is less useful.

Would it be convenient to provide a to_dataframe method for index classes? This would be a natural extension of the utility of to_series, and more useful for MultiIndex objects I would think.

I'm something equivalent to:

def to_dataframe(self):
   DataFrame(self.tolist(), columns=self.names, index=self)
@jreback
Copy link
Contributor

jreback commented Feb 22, 2016

can you give an actual usecase?

normally you would simply do .reset_index()

@toobaz
Copy link
Member

toobaz commented Mar 4, 2016

I just searched for this. I have a DF filled with boolean values (an adjacency matrix). I want to store the pairs of labels for which the value is True. I start by doing a simple

df = df.unstack()
df = df[df == True]

now I would like to do

df.to_dataframe().to_csv(path)

instead I need to do

df.reset_index()[['indexname1', 'indexname2']].to_csv(path)

(admittedly, not a huge problem, but on the other hand to_series() itself could easily be replaced by .reset_index()['indexname'] I think; to_dataframe seems to me at least as useful)

@relativistic
Copy link
Author

Sorry, missed the notification with @jreback 's comment.

One of the nice things about to_series() is that it maintains the original index, which is useful if you need to interact with the original dataframe. reset_index() does not do this. To get what I want with reset_index(), I would need to do something like:

index_names = df.index.names
df_indx = df.reset_index()[index_names]
df_indx.index = df.index

This is a bit verbose, but admittedly, not bad. The main problem is that this is fragile, and will not work on all DataFrames. It would fail if:

  • any string in index_names has a duplicate in df.columns .
  • The index is unnamed

Any one of these problems is relatively easy to get around, if you know a priori the construction of the dataframe. However, its a bit more effort if you want to write a general-purpose function that is guarantied to work in all conditions. And even in the case where you have some idea of your input, it would simply be nice if pandas could deal with this for you so you can focus data analysis, rather than how to wrangle it into the right format.

@jreback
Copy link
Contributor

jreback commented Mar 29, 2016

@relativistic again a fully worked out example here would be instructive. .reset_index() DOES preserve the index as its now in a column(s). So not sure what you mean.

@relativistic
Copy link
Author

@jreback : Okay, sure. I'll try to pull together an example some evening soon. Can't show the code I was working on unfortunately, so I'll have to come up with another example.

@jreback
Copy link
Contributor

jreback commented Nov 30, 2016

here is an external impl (e.g. could prob be faster / better if we did this inside of a .to_dataframe()) method

In [24]: i = pd.MultiIndex.from_product([np.arange(1000),np.arange(1000)],names=['one','two'])

In [25]: len(i)
Out[25]: 1000000

In [26]: %timeit DataFrame(i.tolist(), columns=i.names)
1 loop, best of 3: 612 ms per loop

In [27]: %timeit DataFrame([], index=i).reset_index()
10 loops, best of 3: 38.5 ms per loop

@toobaz
Copy link
Member

toobaz commented Nov 30, 2016

Just for reference: at the time, I ended up doing:

In [4]: %timeit pd.DataFrame({i.names[n] or n : i.levels[n][i.labels[n]]
                              for n in range(len(i.names))})
100 loops, best of 3: 16.4 ms per loop

which performs slightly better than (same as above, just to compare on the same CPU):

In [5]: %timeit pd.DataFrame([], index=i).reset_index()
10 loops, best of 3: 21.5 ms per loop

@jreback
Copy link
Contributor

jreback commented Nov 30, 2016

@toobaz that's only true for a relatively small frame

jreback added a commit to jreback/pandas that referenced this issue Jan 24, 2017
ENH: allow hashing of MultiIndex

closes pandas-dev#12397
@jreback jreback added this to the 0.20.0 milestone Jan 24, 2017
jreback added a commit to jreback/pandas that referenced this issue Jan 24, 2017
ENH: allow hashing of MultiIndex

closes pandas-dev#12397
jreback added a commit to jreback/pandas that referenced this issue Jan 24, 2017
ENH: allow hashing of MultiIndex

closes pandas-dev#12397
jreback added a commit to jreback/pandas that referenced this issue Jan 25, 2017
jreback added a commit to jreback/pandas that referenced this issue Jan 25, 2017
jreback added a commit to jreback/pandas that referenced this issue Jan 25, 2017
jreback added a commit to jreback/pandas that referenced this issue Jan 25, 2017
AnkurDedania pushed a commit to AnkurDedania/pandas that referenced this issue Mar 21, 2017
closes pandas-dev#12397

Author: Jeff Reback <jeff@reback.net>

Closes pandas-dev#15216 from jreback/to_dataframe and squashes the following commits:

b744fb5 [Jeff Reback] ENH: add MultiIndex.to_dataframe
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants