ENH: read_excel MultiIndex #4679 #10967

chris-b1 · 2015-09-02T00:25:28Z

Output of to_excel should now be fully round-trippable with read_excel with the
right combination of index_col and header.

To make the semantics match read_csv, an index column name (has_index_names=True) is
always assumed if something is passed to index_col - this should be non-breaking;
if there are no names, it will be just filled to None as before.

In [7]: df = pd.DataFrame([[1,2,3,4], [5,6,7,8]],
...:                   columns = pd.MultiIndex.from_product([['foo','bar'],['a','b']],
...:                                                        names = ['col1', 'col2']),
...:                   index = pd.MultiIndex.from_product([['j'], ['l', 'k']],
...:                                                      names = ['i1', 'i2']))

In [8]: df
Out[8]: 
col1    foo    bar   
col2    a  b   a  b
i1 i2              
j  l    1  2   3  4
   k    5  6   7  8

In [9]: df.to_excel('test.xlsx')

In [10]: df = pd.read_excel('test.xlsx', header=[0,1], index_col=[0,1])

In [11]: df
Out[11]: 
col1    foo    bar   
col2    a  b   a  b
i1 i2              
j  l    1  2   3  4
   k    5  6   7  8

chris-b1 · 2015-09-02T12:13:28Z

So my "non-breaking" change (always trying to parse index names) does have a corner case, if all values in the first row of the DataFrame are missing.

I could change that back easily enough, but I thought it may make sense to slightly change the default output format of to_excel to remove this ambiguity and further match the to_csv format.

What I would propose is if the index has a name, it is placed at the column level, as shown below. This is also a much easier format to work with in Excel - at least in my workflow, I usually end up manually reshaping the data to look like this anyways.

Thoughts? @jreback @jorisvandenbossche

Current

Proposed

jreback · 2015-09-03T01:24:35Z

can u show a picture when the index does not have a name (in current and proposed) - iow is the blank line their?

chris-b1 · 2015-09-03T04:10:11Z

If there isn't a name, both current and proposed have no blank line, like this:

jreback · 2015-09-03T14:02:12Z

@chris-b1 this change seems reasonable. let me open for a few comments

cc @jtratner
cc @hayd
cc @cancan101
cc @flamingbear
cc @onesandzeroes

chris-b1 · 2015-09-03T14:12:59Z

Thanks, I'll note my current branch still has a couple failing edge cases around mixes of names/no names (more tests coming), but those are fixable. Just to restate the goal clearly:

Any output of to_excel can be read with read_excel by specifying only index_col and header
Deprecate has_index_names
Generally match to/from_csv semantics/format

jreback · 2015-09-03T14:14:17Z

@chris-b1 awsome!

chris-b1 · 2015-09-03T23:52:08Z

Alright, my latest commit fully represents the new behavior.

I'll note that this does pick up two quirks from the base parser

In the case of column MultiIndex a blank row ALWAYS has to be inserted so the format is unambiguous. I believe this is unavoidable (csv does it too) - but the output looks a little odd if the index doesn't have names. xref BUG: to_csv extra header line with multiindex columns #6618, picture below.
In the case of a index MultiIndex without names, the level names will be read back in as [None, "Unnamed 1:", Unnamed 2:", etc] This is in the base parsing logic (i.e. also happens with csv) so I think it's outside the scope of this PR, added issue BUG? Parser adds empty MultiIndex level names #10984

jorisvandenbossche · 2015-09-04T11:22:42Z

In principle +1 on the change. (and your goals sound very good)

If the columns have a name (df.columns.name), then the current behaviour is kept, I assume? (EDIT: this is still about your question a bit higher #10967 (comment), but, I just noted that to_csv ignores the columns name in such a case (if it is not a Multi-Indexed columns))

jorisvandenbossche · 2015-09-04T11:36:00Z

About your points 1 above, do you think the blank row is unavoidable? Given #6618 it seems we would want to change this for csv, and if that is the case, it would make sense to do the change here already.

jreback · 2015-09-04T11:40:05Z

doc/source/whatsnew/v0.17.0.txt

+                      columns = pd.MultiIndex.from_product([['foo','bar'],['a','b']],
+                                                           names = ['col1', 'col2']),
+                      index = pd.MultiIndex.from_product([['j'], ['l', 'k']],
+                                                         names = ['i1', 'i2']))


I like adding your pictures of the before after here as well (the onces from above) (only in the whatsnew)

chris-b1 · 2015-09-04T11:41:40Z

@jorisvandenbossche - column names in the single index case are ignored, just like csv. The old export worked that way too.

For the blank row, this is the ambiguous case if you don't have it (or another kwarg). Is "a" the index name, or is the first row of data all missing?

jorisvandenbossche · 2015-09-04T11:48:32Z

Good point. But read_csv at the moment interprets it as an empty row? So that seems not consistent with the output? How do you roundtrip that correctly?

chris-b1 · 2015-09-04T11:56:42Z

You roundtrip by assuming 'a' is the index name, (because there would be an all blank row if it wasn't), which is what csv does too.

In [199]: df = pd.read_csv(
    ...: StringIO(""",foo,foo,bar,bar
    ...: ,a,b,a,b,
    ...: a,,,,
    ...: b,1,2,3,4
    ...: a,5,6,7,8"""), index_col=0, header=[0,1])

In [200]: df.index
Out[200]: Index([u'b', u'a'], dtype='object', name=u'a'

jreback · 2015-09-04T11:57:51Z

@chris-b1 yep this is the convoluted logic in read_csv to handle this (as well as the case where no blank line exists).

jorisvandenbossche · 2015-09-04T12:02:29Z

@chris-b1 yes, indeed, I forgot the index_col=0 in my test case ..

jreback · 2015-09-05T16:26:29Z

pandas/core/format.py

            if index_label and self.header is not False:
-                if self.merge_cells:


is .merge_cells needed any longer?

It is, there is still the non-default option to write the MI as non merged cells, it just no longer effects this particular offset.

flamingbear · 2015-09-05T19:02:39Z

Excellent, should probably get rid of warnings and docs I put in for #10564.

Let me see if I can get @chris-b1 a PR on your repo.

chris-b1 · 2015-09-05T19:05:45Z

@flamingbear, I think I've got them cleaned up, but if I missed anything, definitely appreciate a PR. I'm going to push more changes in a few minutes, so I would wait just a second before looking.

chris-b1 · 2015-09-05T19:09:54Z

@jreback - cleaned up the things you noted and rebased on the new testing code.

jreback · 2015-09-05T19:20:10Z

gr8. travis is borking atm. for some reason not tagging the versions correctly.....so may fail :<

flamingbear · 2015-09-05T19:49:47Z

I couldn't see how to mention you @jreback on this pr chris-b1#2 Seems reasonable if we're not warning anymore because round trips are ok?

chris-b1 · 2015-09-05T19:54:13Z

@flamingbear - I didn't realize the verbose keyword was only used for that warning, I'll merge it in. Thanks!

jreback · 2015-09-08T15:23:10Z

doc/source/whatsnew/v0.17.0.txt

+In version 0.16.2 a ``DataFrame`` with ``MultiIndex`` columns could not be written to Excel via ``to_excel``.
+That functionality has been added (:issue:`10564`), along with updating  ``read_excel`` so that the data can
+be read back with no loss of information by specifying which columns/rows make up the ``MultiIndex``
+in the `header` and `index_col` parameters (:issue:`4679`)


use double-backticks around header/index_col

jreback · 2015-09-08T15:25:32Z

minor doc fixes. ping when pushed (as only docs its already green)

chris-b1 · 2015-09-08T23:00:07Z

@jreback - doc changes pushed, thanks.

jreback · 2015-09-09T01:28:19Z

doc/source/whatsnew/v0.17.0.txt

@@ -205,6 +205,52 @@ The support math functions are `sin`, `cos`, `exp`, `log`, `expm1`, `log1p`,
 These functions map to the intrinsics for the NumExpr engine.  For Python
 engine, they are mapped to NumPy calls.

+Changes to Excel with ``MultiIndex``
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^


I think you need one more ^ here

jreback · 2015-09-09T01:34:16Z

some minor comments. ping when green (travis is way behind, FYI)

chris-b1 · 2015-09-09T10:38:47Z

@jreback - green. I went ahead and changed the docstring of to_excel and parse to a common template.

jreback · 2015-09-09T12:02:54Z

doc/source/io.rst

@@ -1989,6 +1989,46 @@ advanced strategies
 Reading Excel Files
 '''''''''''''''''''

+.. versionadded:: 0.17


ok for now, but maybe make this have sub-sections to make this a bit easier to navigate

ENH: read_excel MultiIndex #4679

jreback · 2015-09-09T12:06:15Z

@chris-b1 nice change!

jreback · 2015-09-11T19:39:22Z

I don't think the images in the whatsnew:

**Old**

.. image:: _static/old-excel-index.png

**New**

.. image:: _static/new-excel-index.png

got pushed up, can you do a quick pr with those

cfobel · 2016-09-15T13:28:03Z

Thanks for all who have worked on this. Round-trip to/from Excel has been very helpful for me to collaborate with colleagues in my multidisiplinary lab!

One thing that I was thinking was what about trying to infer the multi-index columns based on formatting?

For example, to_excel bolds and puts borders around the index rows/columns.

Any thoughts/suggestions or potential issues here? If there is interest in this, I might try coding this.

chris-b1 · 2016-09-15T13:39:47Z

I wouldn't be opposed, although of course would need to be implemented pretty carefully and probably not much fun to munge and deduce formats!

One other possibility I've considered but not seriously explored is saving metadata about exported frames in the file itself. For instance, add a hidden sheet, '_metadata' that stores the shape of each of saved frame, that could be used on the way back in.

Limiting yourself to .xlsx you could even pack this metadata in the XML document itself, without the ugliness of an additional sheet. Example here, though not sure if xlsxwriter / xlrd support writing/reading arbitrary metadata.
http://thinktibits.blogspot.com/2014/07/read-write-metadata-excel-poi-example.html

jreback added IO Excel read_excel, to_excel API Design MultiIndex labels Sep 3, 2015

jreback added this to the 0.17.0 milestone Sep 3, 2015

jreback reviewed Sep 4, 2015
View reviewed changes

jreback reviewed Sep 5, 2015
View reviewed changes

jsexauer mentioned this pull request Sep 5, 2015

DEPR: Clean up list of deprecations from prior versions #6581

Closed

1 task

chris-b1 force-pushed the excel-read-multiindex branch from 53d58b7 to 4fec952 Compare September 5, 2015 19:07

jreback reviewed Sep 8, 2015
View reviewed changes

chris-b1 force-pushed the excel-read-multiindex branch from bec7a7a to 705c34a Compare September 8, 2015 22:59

jreback reviewed Sep 9, 2015
View reviewed changes

ENH: read_excel MultiIndex pandas-dev#4679

98405f0

chris-b1 force-pushed the excel-read-multiindex branch from 705c34a to 98405f0 Compare September 9, 2015 02:39

jreback reviewed Sep 9, 2015
View reviewed changes

jreback added a commit that referenced this pull request Sep 9, 2015

Merge pull request #10967 from chris-b1/excel-read-multiindex

0e56279

ENH: read_excel MultiIndex #4679

jreback merged commit 0e56279 into pandas-dev:master Sep 9, 2015

chris-b1 mentioned this pull request Sep 11, 2015

DOC: Missing Excel index images #11067

Merged

chris-b1 deleted the excel-read-multiindex branch September 11, 2015 23:06

chris-b1 mentioned this pull request Oct 12, 2015

Export to excel for multiindex columns #11292

Open

chris-b1 mentioned this pull request Jan 27, 2016

read_excel(..., index_col=0, squeeze=True) raises AttributeError #12157

Closed

gfyoung mentioned this pull request May 28, 2017

MAINT: Drop has_index_names input from read_excel #16522

Merged

jreback mentioned this pull request May 30, 2017

DEPR: deprecations log for removed issues #13777

Closed

ahawryluk mentioned this pull request Mar 21, 2021

BUG: pandas.read_excel creates a DataFrame with incorrect multi-level columns #34188

Open

3 tasks

rhshadrach mentioned this pull request Dec 11, 2021

BUG: read_excel surprisingly filling empty levels in MultiIndex after first value #44837

Closed

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH: read_excel MultiIndex #4679 #10967

ENH: read_excel MultiIndex #4679 #10967

chris-b1 commented Sep 2, 2015

chris-b1 commented Sep 2, 2015

jreback commented Sep 3, 2015

chris-b1 commented Sep 3, 2015

jreback commented Sep 3, 2015

chris-b1 commented Sep 3, 2015

jreback commented Sep 3, 2015

chris-b1 commented Sep 3, 2015

jorisvandenbossche commented Sep 4, 2015

jorisvandenbossche commented Sep 4, 2015

jreback Sep 4, 2015

chris-b1 commented Sep 4, 2015

jorisvandenbossche commented Sep 4, 2015

chris-b1 commented Sep 4, 2015

jreback commented Sep 4, 2015

jorisvandenbossche commented Sep 4, 2015

jreback Sep 5, 2015

chris-b1 Sep 5, 2015

flamingbear commented Sep 5, 2015

chris-b1 commented Sep 5, 2015

chris-b1 commented Sep 5, 2015

jreback commented Sep 5, 2015

flamingbear commented Sep 5, 2015

chris-b1 commented Sep 5, 2015

jreback Sep 8, 2015

jreback commented Sep 8, 2015

chris-b1 commented Sep 8, 2015

jreback Sep 9, 2015

jreback commented Sep 9, 2015

chris-b1 commented Sep 9, 2015

jreback Sep 9, 2015

jreback commented Sep 9, 2015

jreback commented Sep 11, 2015

cfobel commented Sep 15, 2016

chris-b1 commented Sep 15, 2016

		if index_label and self.header is not False:
		if self.merge_cells:

ENH: read_excel MultiIndex #4679 #10967

ENH: read_excel MultiIndex #4679 #10967

Conversation

chris-b1 commented Sep 2, 2015

chris-b1 commented Sep 2, 2015

Current

Proposed

jreback commented Sep 3, 2015

chris-b1 commented Sep 3, 2015

jreback commented Sep 3, 2015

chris-b1 commented Sep 3, 2015

jreback commented Sep 3, 2015

chris-b1 commented Sep 3, 2015

jorisvandenbossche commented Sep 4, 2015

jorisvandenbossche commented Sep 4, 2015

jreback Sep 4, 2015

Choose a reason for hiding this comment

chris-b1 commented Sep 4, 2015

jorisvandenbossche commented Sep 4, 2015

chris-b1 commented Sep 4, 2015

jreback commented Sep 4, 2015

jorisvandenbossche commented Sep 4, 2015

jreback Sep 5, 2015

Choose a reason for hiding this comment

chris-b1 Sep 5, 2015

Choose a reason for hiding this comment

flamingbear commented Sep 5, 2015

chris-b1 commented Sep 5, 2015

chris-b1 commented Sep 5, 2015

jreback commented Sep 5, 2015

flamingbear commented Sep 5, 2015

chris-b1 commented Sep 5, 2015

jreback Sep 8, 2015

Choose a reason for hiding this comment

jreback commented Sep 8, 2015

chris-b1 commented Sep 8, 2015

jreback Sep 9, 2015

Choose a reason for hiding this comment

jreback commented Sep 9, 2015

chris-b1 commented Sep 9, 2015

jreback Sep 9, 2015

Choose a reason for hiding this comment

jreback commented Sep 9, 2015

jreback commented Sep 11, 2015

cfobel commented Sep 15, 2016

chris-b1 commented Sep 15, 2016