ENH Read mutiple excel sheets in single API call #9450

jnmclarty · 2015-02-09T11:53:24Z

Enables reading of multiple excel sheets in a single API call, reducing read time substantially.

Essentially, 2O(n) becomes O(1) + O(n).

Before

dfs = {}
for sheet in [‘Sheet1’,’Sheet2’,’Sheet3’]:
    #We have to open the file 3 times = Super Slow
    dfs[sheet] = pd.read_excel("TestData.xlsx",sheetname=sheet)

This PR

#We open the file only 1 time = Much Faster, and it's just one python line
dfs = pd.read_excel("TestData.xlsx",sheetname=[‘Sheet1’,’Sheet2’,’Sheet3’])

...but as a bonus...

#the above is the same as...
dfs = pd.read_excel_sheets("TestData.xlsx",[‘Sheet1’,’Sheet2’,’Sheet3’])
#or, assuming those are there are only the 3 sheets...
dfs = pd.read_excel_sheets("TestData.xlsx")
#and this also works. But the dictionary returned has the integer keys instead of strings.
dfs = pd.read_excel_sheets("TestData.xlsx",[0,1,2])

Return Objects

For the sheetname argument, specifying an int or a string, gets you a DataFrame (same as before the PR). But, specify a list of strings/int, returns a dictionary of dataframes. Specify None, and get all sheets in dictionary of dataframes, where the keys are the sheetnames.

More Discussion

The first commit (2db986d) implements 100% of the functionality, while the next two commits (5c2304c and 1a53b01) layer on optional changes to the documentation and API (of the two, only one is actually mandatory, but keeping both is the most explicit). Since the argument, sheetname, in read_excel was not plural, and defaults to 0, to maintain backwards compatibility AND create the “read all sheets” functionality associated with the non-default of None, the documentation of the argument gets really awkward. Hence, creating read_excel_sheets with an argument sheetnames which defaults to None instead of 0.

Things I need feedback on:

Confirm print statement acceptability
Confirm if sheetname argument documentation should be changed (or not, to avoid confusion of dictionary-based return cases)
Confirm if read_excel_sheets addition to API is right approach

Things I still have to do:

Test/write tests
Update the release notes
Build the docs and check

Note: There might have been another way of doing this, but I couldn't easily figure it out, (SO/google wasn't helpful). I just wanted to read my 623 sheets without it taking hours.

shoyer · 2015-02-10T03:25:44Z

Does it make sense to return a panel instead of a dict of dicts? Just something to think about...

jorisvandenbossche · 2015-02-10T08:41:15Z

@shoyer I think it should at least be an option? Personally I would rather get a dict than a Panel, but that is more because I never worked with Panels before. Plus, there is no requirement that there are common indices in the different sheets, so a Panel is not really desired in that case I think?

@jnmclarty I think this is nice new functionality! To your questions:

I wouldn't put the print statement there as default. If we want it, the user should at least be able to silence it (something like verbose argument, I don't know it something like that is already used somewhere else in pandas).
For the sheetname documentation, I think the additions are certainly good. Only the last part (the different options, will have to be formatted a bit so it displays well, I will put some inline comments).
For me, I don't think this new functionality warrants a new top-level function like read_excel_sheets. As doing read_excel(.., sheetname=None) is rather easy. But lets see what others think of this.
In any case, it could maybe be more prominently documented how to do this. Eg something like By default, it reads the first sheet. Specify the sheetnameargument to select a specific sheet or usesheetname=None to read all sheets. that could be on top of the docstring (after the first line).

jorisvandenbossche · 2015-02-10T08:43:39Z

pandas/io/excel.py

+        where the keys match the sheet specified.  Set to None to get
+        all sheets, and use sheetnames as the dictionary keys.
+
+        Undefined -> Defaults to 0 -> 1st sheet as a DataFrame


You will have to use bullet points here to have this formatted nicely in the html docs.

jnmclarty · 2015-02-10T13:52:27Z

@shoyer Thought about it, and my logic was the same as @jorisvandenbossche, plus, I'd wager that there are many times when the sheetnames are not something where an index makes sense. Eg. "Sheet3", "Sheet4", "Sheet5" have data, but Sheet1 and Sheet2 got used for aggregation/reporting and were renamed properly. Also, returning a dict forces the user to realize that the resulting dataframes are not ordered. And...final point...returning a panel, we would have to make an assumption about the order for the sheetname=None case.

@jorisvandenbossche Thanks for the input. I'll give this another 32 hours or so to let other feedback trickle in, and then begin implementing your comments.

jreback · 2015-02-10T14:03:26Z

this should for sure be a dict of DataFrames. Their is no reason to assume common indexes here. And if one wants it, then Panel(pd.read_excel(...,sheetnames=[.....])) would work (and you can add that to the docs in io.rst, need an example there as well).

agree with @jorisvandenbossche
only a single API, namely read_excel(..) should exist here. sheetname taking a list is good.

shoyer · 2015-02-10T19:00:54Z

Agreed, a dict makes more sense than a panel :).

jnmclarty · 2015-02-14T04:41:18Z

Pretty much done, unless anybody else has feedback, on testing or the sphinx renderings below.

jnmclarty · 2015-02-15T12:19:42Z

Rebased. Ready to merge. (Assuming Travis agrees)

jreback · 2015-02-16T13:22:02Z

doc/source/io.rst


+- The second argument is ``sheetname``, not to be confused with ``ExcelFile.sheet_names``


I would move these first 2 points to a separate note:: box.

Done, commit on it's way.

jreback · 2015-02-16T18:11:16Z

@jnmclarty I am a bit -1 on accepting mixed int/str for sheetname. You can do it provided you account for the possibility of ambiguity (e.g. if I have a sheet named '1', but its the only sheet, what does '1' return? or 1 return?). IOW, we accept an int by default and seek to match it (if it exists).

So you would have to account for duplicates, and raise on ambiguity.

I would maybe suggest making the user specify either all ints (e.g. position based), or all names (e.g. by name).

jnmclarty · 2015-02-16T18:37:25Z

It's currently just as ambiguous as it's always been. You could have made the same comment, about '1' vs 1 before, the fact that the object is now in a list, is indifferent.

The moment we add exclusivity to the object type, THAT's when '1' will get ambiguous.

If you pass ['1',1] (honestly, who would do this?), it will return a dict of DataFrames representing a sheet named "1", and the 2nd sheet if there is one.

Let's say I've got a sheet named "MetaData", in my excel file (but I don't know the position), and I know I want the 5th sheet (but don't know the name). With my PR, you'd just go ['MetaData', 4]. Prohibiting mixing, you'd have no way of doing this (without more code)

jreback · 2015-02-16T18:41:39Z

@jnmclarty ok as long as the behavior is well-defined and obvious. With a single sheetname it is either positional (if the name doesn't match), or is there (or a KeyError)?

with multiples then you can get dups and/or ambiguity. Maybe add a couple of tests to cement the behavior (ok with accepting mixed if its useful).

jnmclarty · 2015-02-16T19:11:26Z

With a single sheetname it is either positional (if the name doesn't match), or is there (or a KeyError)?

The bahviour is the same as before, just inside the list. [1] is positional. ['1'] is name.
No KeyErrors.

with multiples then you can get dups and/or ambiguity

Assume the excel file has a sheet named "1", in the 1st position. I handled 2 of 3 duplicates that would result from ['1',0,0], in my latest commit. So, passing ['1',0,0] would return the same object as ['1',0]. At this point though, maybe the user wants a duplicate copy of the DataFrame? Honestly, at this point, we've got an excel file with a ridiculously unfortunate naming convention, and a Pandas user that insists on a single line API call. I say, at this point, give the user the duplicates they request.

The duplicates the user would get, are definitely not ambiguous (unless the user doesn't understand object types?)

jnmclarty · 2015-02-16T19:19:47Z

Oh, and @jreback my original tests were already written to ensure the mixed-mode int/strings. See line 447 of test_excel.py.

jnmclarty · 2015-02-16T19:28:24Z

Ok, @jreback I think we're good, see 3556753 and a73645b. I'll squash momentarily.

jreback · 2015-02-16T19:46:08Z

@jnmclarty duplicates are ok, but a KeyError is still a problem

e.g. a file with 2 elements, what should [1,2] do? raise a KeyError.

also need a rebase.

jnmclarty · 2015-02-16T20:02:45Z

Implemented the key errors in d8775d7. While it is appropriate, but technically changes previous behaviour. read_excel(...,sheetname=7,...), in the past, would return IndexError: list index out of range

jnmclarty · 2015-02-16T21:50:28Z

Fixed the Dat_e_Frames, Nones and pd.s, and reference to my first implementation, 62d13ec. Got my verbosity stupidity in prev commit.

jnmclarty · 2015-02-16T21:51:32Z

Thanks @jreback, you're a machine.

jreback · 2015-02-16T21:54:59Z

@jnmclarty hahha....happen to be home and its cold out!

anyhow, lgtm.

@jorisvandenbossche
@TomAugspurger
@shoyer
have a look

jnmclarty · 2015-02-17T13:17:58Z

The IndexError exception I added breaks a test. I thought it was a very trivial change, I guess not. I'm wondering if different exceptions are raised depending on the modules, or versions of modules, installed. I reverted back my addition of IndexErrors, in 1783636 , but left it commented out. I believe Travis will not complain, with this final squash. Assuming it passes, this PR can be merged, but either I, or somebody else should look into the exceptions that are thrown, and the exceptions that should be thrown, for all the various cases, and versions. To me, this is low-value add, as the user of any given stack, will get a specific exception, and they will handle it according to their stack.

jreback · 2015-02-17T13:43:08Z

pandas/io/tests/test_excel.py

+        dfs = [tdf(s) for s in sheets]
+        dfs = dict(zip(sheets,dfs))
+
+        w = pd.ExcelWriter(self.multisheet2)


use a temporary filepath here, see like this; you cannot assume that you can write to the test dirctory (e.g. the code); travis lets you , but on a regular install you normally cant'.

jnmclarty · 2015-02-18T01:53:47Z

Ok, @jreback this PR is back to a state where it does not change any behaviour associated with any Exceptions anywhere. It merely extends the same behaviour that would have existed before this PR. So, sheetname=3, will do the same thing as sheetname=[2,3], if there is no 4th sheet. I removed my confusion inducing commented out exceptions that got added half-way through this PR discussion. All our discussion surrounding IndexError is mute.

I also implemented ensure_clean() in the one test.

Acknowledge my latest, and I'll squash sometime within 5 minutes to 12h time.

jreback · 2015-02-18T03:11:18Z

lgtm

jnmclarty · 2015-02-18T22:43:34Z

Explicitly, unless anybody else has any objections, this is ready to be merged.

jreback · 2015-02-18T22:52:12Z

just needs a look for @jorisvandenbossche / @shoyer

shoyer · 2015-02-18T22:55:30Z

I don't use Excel for serious analytics (thank goodness!) so I don't have much to add here.

My main thought is that this might be a good time to clean up the sheetname/sheet_name inconsistency between read_excel and to_excel.

jnmclarty · 2015-02-18T23:27:46Z

Can we merge this, and open an issue for the sheetname/sheet_name thing? I feel like at the same time as doing that, the codebase could use some cleaning up from a self-documenting perspective, for variable naming. IMO, a PR like that, which doesn't change functionality, would be better isolated from one that does (like this PR we just did here).

shoyer · 2015-02-18T23:30:02Z

@jnmclarty sounds good to me.

jorisvandenbossche · 2015-02-22T19:04:36Z

doc/source/io.rst

+~~~~~~~~~~~~~~~~~~~
+
+.. versionadded:: 0.16
+``read_excel`` can read more than one sheet, by setting ``sheetname`` to either


You need to add a blank line between .. versionadded:: 0.16 and the explanation (otherwise sphinx gives warnings)

and the same with the two occurences below

jnmclarty · 2015-02-22T21:35:00Z

@jorisvandenbossche all fixed, see e977432 (which, after I rebase momentarily, will be an outdated commit)

ENH Read mutiple excel sheets in single API call

jreback · 2015-02-23T12:53:21Z

@jnmclarty very nice enhancement. thanks for this!

pls have a look at the built docs and verify that everything looks kosher. If not, pls submit a followup!

http://pandas-docs.github.io/pandas-docs-travis/ (usually takes an hour; these are built on the 2.7 build from travis)

jnmclarty · 2015-02-24T00:18:44Z

@jreback it looks to me like there's a problem with the travis built docs. Does not look like my PR was the problem, but the index only has a few pages in it.

jreback · 2015-02-24T00:52:58Z

I don't think the index is built on the travis docs. Only the 'main' parts are built (e.g. API is also not built). These looks ok to me.

jnmclarty changed the title ~~ENH Read mutiple excel in single API call~~ ENH Read mutiple excel sheets in single API call Feb 9, 2015

jorisvandenbossche reviewed Feb 10, 2015
View reviewed changes

jreback added API Design Enhancement IO Excel read_excel, to_excel labels Feb 10, 2015

jreback added this to the 0.16.0 milestone Feb 10, 2015

jnmclarty force-pushed the multixlsheet branch from b1a15d2 to 8bf71cc Compare February 15, 2015 12:18

jreback reviewed Feb 16, 2015
View reviewed changes

jnmclarty force-pushed the multixlsheet branch from 3556753 to 3bb2f6b Compare February 16, 2015 19:34

jnmclarty force-pushed the multixlsheet branch from 3bb2f6b to 8fe4454 Compare February 16, 2015 19:47

jnmclarty force-pushed the multixlsheet branch from 5670823 to ffa0dbf Compare February 16, 2015 21:02

jnmclarty force-pushed the multixlsheet branch from 62d13ec to 003bba1 Compare February 16, 2015 21:50

jnmclarty force-pushed the multixlsheet branch 2 times, most recently from 1783636 to a443a5e Compare February 17, 2015 13:10

jreback reviewed Feb 17, 2015
View reviewed changes

jnmclarty force-pushed the multixlsheet branch from 2ee006a to c92b87f Compare February 18, 2015 10:44

jorisvandenbossche reviewed Feb 22, 2015
View reviewed changes

jnmclarty force-pushed the multixlsheet branch from f043f78 to e977432 Compare February 22, 2015 21:34

ENH read multiple sheets in read_excel()

d8a2893

jnmclarty force-pushed the multixlsheet branch from e977432 to d8a2893 Compare February 22, 2015 21:35

jreback added a commit that referenced this pull request Feb 23, 2015

Merge pull request #9450 from jnmclarty/multixlsheet

c4f2be6

ENH Read mutiple excel sheets in single API call

jreback merged commit c4f2be6 into pandas-dev:master Feb 23, 2015


		- The second argument is ``sheetname``, not to be confused with ``ExcelFile.sheet_names``

ENH Read mutiple excel sheets in single API call #9450

ENH Read mutiple excel sheets in single API call #9450

Conversation

jnmclarty commented Feb 9, 2015

Before

This PR

Return Objects

More Discussion

shoyer commented Feb 10, 2015

jorisvandenbossche commented Feb 10, 2015

jorisvandenbossche Feb 10, 2015

Choose a reason for hiding this comment

jnmclarty commented Feb 10, 2015

jreback commented Feb 10, 2015

shoyer commented Feb 10, 2015

jnmclarty commented Feb 14, 2015

jnmclarty commented Feb 15, 2015

jreback Feb 16, 2015

Choose a reason for hiding this comment

jnmclarty Feb 16, 2015

Choose a reason for hiding this comment

jreback commented Feb 16, 2015

jnmclarty commented Feb 16, 2015

jreback commented Feb 16, 2015

jnmclarty commented Feb 16, 2015

jnmclarty commented Feb 16, 2015

jnmclarty commented Feb 16, 2015

jreback commented Feb 16, 2015

jnmclarty commented Feb 16, 2015

jnmclarty commented Feb 16, 2015

jnmclarty commented Feb 16, 2015

jreback commented Feb 16, 2015

jnmclarty commented Feb 17, 2015

jreback Feb 17, 2015

Choose a reason for hiding this comment

jnmclarty commented Feb 18, 2015

jreback commented Feb 18, 2015

jnmclarty commented Feb 18, 2015

jreback commented Feb 18, 2015

shoyer commented Feb 18, 2015

jnmclarty commented Feb 18, 2015

shoyer commented Feb 18, 2015

jorisvandenbossche Feb 22, 2015

Choose a reason for hiding this comment

jorisvandenbossche Feb 22, 2015

Choose a reason for hiding this comment

jnmclarty commented Feb 22, 2015

jreback commented Feb 23, 2015

jnmclarty commented Feb 24, 2015

jreback commented Feb 24, 2015