-
-
Notifications
You must be signed in to change notification settings - Fork 18k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH: read_excel MultiIndex #4679 #10967
Conversation
So my "non-breaking" change (always trying to parse index names) does have a corner case, if all values in the first row of the DataFrame are missing. I could change that back easily enough, but I thought it may make sense to slightly change the default output format of What I would propose is if the index has a name, it is placed at the column level, as shown below. This is also a much easier format to work with in Excel - at least in my workflow, I usually end up manually reshaping the data to look like this anyways. Thoughts? @jreback @jorisvandenbossche CurrentProposed |
can u show a picture when the index does not have a name (in current and proposed) - iow is the blank line their? |
@chris-b1 this change seems reasonable. let me open for a few comments cc @jtratner |
Thanks, I'll note my current branch still has a couple failing edge cases around mixes of names/no names (more tests coming), but those are fixable. Just to restate the goal clearly:
|
@chris-b1 awsome! |
Alright, my latest commit fully represents the new behavior. I'll note that this does pick up two quirks from the base parser
|
In principle +1 on the change. (and your goals sound very good) If the columns have a name ( |
About your points 1 above, do you think the blank row is unavoidable? Given #6618 it seems we would want to change this for csv, and if that is the case, it would make sense to do the change here already. |
columns = pd.MultiIndex.from_product([['foo','bar'],['a','b']], | ||
names = ['col1', 'col2']), | ||
index = pd.MultiIndex.from_product([['j'], ['l', 'k']], | ||
names = ['i1', 'i2'])) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like adding your pictures of the before after here as well (the onces from above) (only in the whatsnew)
@jorisvandenbossche - column names in the single index case are ignored, just like csv. The old export worked that way too. For the blank row, this is the ambiguous case if you don't have it (or another kwarg). Is "a" the index name, or is the first row of data all missing? |
Good point. But |
You roundtrip by assuming 'a' is the index name, (because there would be an all blank row if it wasn't), which is what csv does too.
|
@chris-b1 yep this is the convoluted logic in |
@chris-b1 yes, indeed, I forgot the |
if index_label and self.header is not False: | ||
if self.merge_cells: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is .merge_cells
needed any longer?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is, there is still the non-default option to write the MI as non merged cells, it just no longer effects this particular offset.
@flamingbear, I think I've got them cleaned up, but if I missed anything, definitely appreciate a PR. I'm going to push more changes in a few minutes, so I would wait just a second before looking. |
53d58b7
to
4fec952
Compare
@jreback - cleaned up the things you noted and rebased on the new testing code. |
gr8. travis is borking atm. for some reason not tagging the versions correctly.....so may fail :< |
I couldn't see how to mention you @jreback on this pr chris-b1#2 Seems reasonable if we're not warning anymore because round trips are ok? |
@flamingbear - I didn't realize the verbose keyword was only used for that warning, I'll merge it in. Thanks! |
In version 0.16.2 a ``DataFrame`` with ``MultiIndex`` columns could not be written to Excel via ``to_excel``. | ||
That functionality has been added (:issue:`10564`), along with updating ``read_excel`` so that the data can | ||
be read back with no loss of information by specifying which columns/rows make up the ``MultiIndex`` | ||
in the `header` and `index_col` parameters (:issue:`4679`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use double-backticks around header/index_col
minor doc fixes. ping when pushed (as only docs its already green) |
bec7a7a
to
705c34a
Compare
@jreback - doc changes pushed, thanks. |
@@ -205,6 +205,52 @@ The support math functions are `sin`, `cos`, `exp`, `log`, `expm1`, `log1p`, | |||
These functions map to the intrinsics for the NumExpr engine. For Python | |||
engine, they are mapped to NumPy calls. | |||
|
|||
Changes to Excel with ``MultiIndex`` | |||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you need one more ^
here
some minor comments. ping when green (travis is way behind, FYI) |
705c34a
to
98405f0
Compare
@jreback - green. I went ahead and changed the docstring of |
@@ -1989,6 +1989,46 @@ advanced strategies | |||
Reading Excel Files | |||
''''''''''''''''''' | |||
|
|||
.. versionadded:: 0.17 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok for now, but maybe make this have sub-sections to make this a bit easier to navigate
ENH: read_excel MultiIndex #4679
@chris-b1 nice change! |
I don't think the images in the whatsnew:
got pushed up, can you do a quick pr with those |
Thanks for all who have worked on this. Round-trip to/from Excel has been very helpful for me to collaborate with colleagues in my multidisiplinary lab! One thing that I was thinking was what about trying to infer the multi-index columns based on formatting? For example, Any thoughts/suggestions or potential issues here? If there is interest in this, I might try coding this. |
I wouldn't be opposed, although of course would need to be implemented pretty carefully and probably not much fun to munge and deduce formats! One other possibility I've considered but not seriously explored is saving metadata about exported frames in the file itself. For instance, add a hidden sheet, Limiting yourself to |
closes #4679
xref #10564
Output of
to_excel
should now be fully round-trippable withread_excel
with theright combination of
index_col
andheader
.To make the semantics match
read_csv
, an index column name (has_index_names=True
) isalways assumed if something is passed to
index_col
- this should be non-breaking;if there are no names, it will be just filled to
None
as before.