Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH Read mutiple excel sheets in single API call #9450

Merged
merged 1 commit into from
Feb 23, 2015

Conversation

jnmclarty
Copy link
Contributor

Enables reading of multiple excel sheets in a single API call, reducing read time substantially.

Essentially, 2O(n) becomes O(1) + O(n).

Before

dfs = {}
for sheet in [‘Sheet1’,’Sheet2’,’Sheet3’]:
    #We have to open the file 3 times = Super Slow
    dfs[sheet] = pd.read_excel("TestData.xlsx",sheetname=sheet)

This PR

#We open the file only 1 time = Much Faster, and it's just one python line
dfs = pd.read_excel("TestData.xlsx",sheetname=[‘Sheet1’,’Sheet2’,’Sheet3’])

...but as a bonus...

#the above is the same as...
dfs = pd.read_excel_sheets("TestData.xlsx",[‘Sheet1’,’Sheet2’,’Sheet3’])
#or, assuming those are there are only the 3 sheets...
dfs = pd.read_excel_sheets("TestData.xlsx")
#and this also works. But the dictionary returned has the integer keys instead of strings.
dfs = pd.read_excel_sheets("TestData.xlsx",[0,1,2])

Return Objects

For the sheetname argument, specifying an int or a string, gets you a DataFrame (same as before the PR). But, specify a list of strings/int, returns a dictionary of dataframes. Specify None, and get all sheets in dictionary of dataframes, where the keys are the sheetnames.

More Discussion

The first commit (2db986d) implements 100% of the functionality, while the next two commits (5c2304c and 1a53b01) layer on optional changes to the documentation and API (of the two, only one is actually mandatory, but keeping both is the most explicit). Since the argument, sheetname, in read_excel was not plural, and defaults to 0, to maintain backwards compatibility AND create the “read all sheets” functionality associated with the non-default of None, the documentation of the argument gets really awkward. Hence, creating read_excel_sheets with an argument sheetnames which defaults to None instead of 0.

Things I need feedback on:

  • Confirm print statement acceptability
  • Confirm if sheetname argument documentation should be changed (or not, to avoid confusion of dictionary-based return cases)
  • Confirm if read_excel_sheets addition to API is right approach

Things I still have to do:

  • Test/write tests
  • Update the release notes
  • Build the docs and check

Note: There might have been another way of doing this, but I couldn't easily figure it out, (SO/google wasn't helpful). I just wanted to read my 623 sheets without it taking hours.

@jnmclarty jnmclarty changed the title ENH Read mutiple excel in single API call ENH Read mutiple excel sheets in single API call Feb 9, 2015
@shoyer
Copy link
Member

shoyer commented Feb 10, 2015

Does it make sense to return a panel instead of a dict of dicts? Just something to think about...

@jorisvandenbossche
Copy link
Member

@shoyer I think it should at least be an option? Personally I would rather get a dict than a Panel, but that is more because I never worked with Panels before. Plus, there is no requirement that there are common indices in the different sheets, so a Panel is not really desired in that case I think?

@jnmclarty I think this is nice new functionality! To your questions:

  • I wouldn't put the print statement there as default. If we want it, the user should at least be able to silence it (something like verbose argument, I don't know it something like that is already used somewhere else in pandas).
  • For the sheetname documentation, I think the additions are certainly good. Only the last part (the different options, will have to be formatted a bit so it displays well, I will put some inline comments).
  • For me, I don't think this new functionality warrants a new top-level function like read_excel_sheets. As doing read_excel(.., sheetname=None) is rather easy. But lets see what others think of this.
    In any case, it could maybe be more prominently documented how to do this. Eg something like By default, it reads the first sheet. Specify the sheetnameargument to select a specific sheet or usesheetname=None to read all sheets. that could be on top of the docstring (after the first line).

where the keys match the sheet specified. Set to None to get
all sheets, and use sheetnames as the dictionary keys.

Undefined -> Defaults to 0 -> 1st sheet as a DataFrame
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You will have to use bullet points here to have this formatted nicely in the html docs.

@jnmclarty
Copy link
Contributor Author

@shoyer Thought about it, and my logic was the same as @jorisvandenbossche, plus, I'd wager that there are many times when the sheetnames are not something where an index makes sense. Eg. "Sheet3", "Sheet4", "Sheet5" have data, but Sheet1 and Sheet2 got used for aggregation/reporting and were renamed properly. Also, returning a dict forces the user to realize that the resulting dataframes are not ordered. And...final point...returning a panel, we would have to make an assumption about the order for the sheetname=None case.

@jorisvandenbossche Thanks for the input. I'll give this another 32 hours or so to let other feedback trickle in, and then begin implementing your comments.

@jreback jreback added this to the 0.16.0 milestone Feb 10, 2015
@jreback
Copy link
Contributor

jreback commented Feb 10, 2015

this should for sure be a dict of DataFrames. Their is no reason to assume common indexes here. And if one wants it, then Panel(pd.read_excel(...,sheetnames=[.....])) would work (and you can add that to the docs in io.rst, need an example there as well).

agree with @jorisvandenbossche
only a single API, namely read_excel(..) should exist here. sheetname taking a list is good.

@shoyer
Copy link
Member

shoyer commented Feb 10, 2015

Agreed, a dict makes more sense than a panel :).

@jnmclarty
Copy link
Contributor Author

Pretty much done, unless anybody else has feedback, on testing or the sphinx renderings below.
read_excel
io excel

@jnmclarty
Copy link
Contributor Author

Rebased. Ready to merge. (Assuming Travis agrees)


- The second argument is ``sheetname``, not to be confused with ``ExcelFile.sheet_names``
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would move these first 2 points to a separate note:: box.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, commit on it's way.

@jreback
Copy link
Contributor

jreback commented Feb 16, 2015

@jnmclarty I am a bit -1 on accepting mixed int/str for sheetname. You can do it provided you account for the possibility of ambiguity (e.g. if I have a sheet named '1', but its the only sheet, what does '1' return? or 1 return?). IOW, we accept an int by default and seek to match it (if it exists).

So you would have to account for duplicates, and raise on ambiguity.

I would maybe suggest making the user specify either all ints (e.g. position based), or all names (e.g. by name).

@jnmclarty
Copy link
Contributor Author

It's currently just as ambiguous as it's always been. You could have made the same comment, about '1' vs 1 before, the fact that the object is now in a list, is indifferent.

The moment we add exclusivity to the object type, THAT's when '1' will get ambiguous.

If you pass ['1',1] (honestly, who would do this?), it will return a dict of DataFrames representing a sheet named "1", and the 2nd sheet if there is one.

Let's say I've got a sheet named "MetaData", in my excel file (but I don't know the position), and I know I want the 5th sheet (but don't know the name). With my PR, you'd just go ['MetaData', 4]. Prohibiting mixing, you'd have no way of doing this (without more code)

@jreback
Copy link
Contributor

jreback commented Feb 16, 2015

@jnmclarty ok as long as the behavior is well-defined and obvious. With a single sheetname it is either positional (if the name doesn't match), or is there (or a KeyError)?

with multiples then you can get dups and/or ambiguity. Maybe add a couple of tests to cement the behavior (ok with accepting mixed if its useful).

@jnmclarty
Copy link
Contributor Author

With a single sheetname it is either positional (if the name doesn't match), or is there (or a KeyError)?

The bahviour is the same as before, just inside the list. [1] is positional. ['1'] is name.
No KeyErrors.

with multiples then you can get dups and/or ambiguity

Assume the excel file has a sheet named "1", in the 1st position. I handled 2 of 3 duplicates that would result from ['1',0,0], in my latest commit. So, passing ['1',0,0] would return the same object as ['1',0]. At this point though, maybe the user wants a duplicate copy of the DataFrame? Honestly, at this point, we've got an excel file with a ridiculously unfortunate naming convention, and a Pandas user that insists on a single line API call. I say, at this point, give the user the duplicates they request.

The duplicates the user would get, are definitely not ambiguous (unless the user doesn't understand object types?)

@jnmclarty
Copy link
Contributor Author

Oh, and @jreback my original tests were already written to ensure the mixed-mode int/strings. See line 447 of test_excel.py.

@jnmclarty
Copy link
Contributor Author

Ok, @jreback I think we're good, see 3556753 and a73645b. I'll squash momentarily.

@jreback
Copy link
Contributor

jreback commented Feb 16, 2015

@jnmclarty duplicates are ok, but a KeyError is still a problem

e.g. a file with 2 elements, what should [1,2] do? raise a KeyError.

also need a rebase.

@jnmclarty
Copy link
Contributor Author

Implemented the key errors in d8775d7. While it is appropriate, but technically changes previous behaviour. read_excel(...,sheetname=7,...), in the past, would return IndexError: list index out of range

@jnmclarty
Copy link
Contributor Author

Fixed the Dat_e_Frames, Nones and pd.s, and reference to my first implementation, 62d13ec. Got my verbosity stupidity in prev commit.

@jnmclarty
Copy link
Contributor Author

Thanks @jreback, you're a machine.

@jreback
Copy link
Contributor

jreback commented Feb 16, 2015

@jnmclarty hahha....happen to be home and its cold out!

anyhow, lgtm.

@jorisvandenbossche
@TomAugspurger
@shoyer
have a look

@jnmclarty jnmclarty force-pushed the multixlsheet branch 2 times, most recently from 1783636 to a443a5e Compare February 17, 2015 13:10
@jnmclarty
Copy link
Contributor Author

The IndexError exception I added breaks a test. I thought it was a very trivial change, I guess not. I'm wondering if different exceptions are raised depending on the modules, or versions of modules, installed. I reverted back my addition of IndexErrors, in 1783636 , but left it commented out. I believe Travis will not complain, with this final squash. Assuming it passes, this PR can be merged, but either I, or somebody else should look into the exceptions that are thrown, and the exceptions that should be thrown, for all the various cases, and versions. To me, this is low-value add, as the user of any given stack, will get a specific exception, and they will handle it according to their stack.

dfs = [tdf(s) for s in sheets]
dfs = dict(zip(sheets,dfs))

w = pd.ExcelWriter(self.multisheet2)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use a temporary filepath here, see like this; you cannot assume that you can write to the test dirctory (e.g. the code); travis lets you , but on a regular install you normally cant'.

@jnmclarty
Copy link
Contributor Author

Ok, @jreback this PR is back to a state where it does not change any behaviour associated with any Exceptions anywhere. It merely extends the same behaviour that would have existed before this PR. So, sheetname=3, will do the same thing as sheetname=[2,3], if there is no 4th sheet. I removed my confusion inducing commented out exceptions that got added half-way through this PR discussion. All our discussion surrounding IndexError is mute.

I also implemented ensure_clean() in the one test.

Acknowledge my latest, and I'll squash sometime within 5 minutes to 12h time.

@jreback
Copy link
Contributor

jreback commented Feb 18, 2015

lgtm

@jnmclarty
Copy link
Contributor Author

Explicitly, unless anybody else has any objections, this is ready to be merged.

@jreback
Copy link
Contributor

jreback commented Feb 18, 2015

just needs a look for @jorisvandenbossche / @shoyer

@shoyer
Copy link
Member

shoyer commented Feb 18, 2015

I don't use Excel for serious analytics (thank goodness!) so I don't have much to add here.

My main thought is that this might be a good time to clean up the sheetname/sheet_name inconsistency between read_excel and to_excel.

@jnmclarty
Copy link
Contributor Author

Can we merge this, and open an issue for the sheetname/sheet_name thing? I feel like at the same time as doing that, the codebase could use some cleaning up from a self-documenting perspective, for variable naming. IMO, a PR like that, which doesn't change functionality, would be better isolated from one that does (like this PR we just did here).

@shoyer
Copy link
Member

shoyer commented Feb 18, 2015

@jnmclarty sounds good to me.

~~~~~~~~~~~~~~~~~~~

.. versionadded:: 0.16
``read_excel`` can read more than one sheet, by setting ``sheetname`` to either
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You need to add a blank line between .. versionadded:: 0.16 and the explanation (otherwise sphinx gives warnings)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and the same with the two occurences below

@jnmclarty
Copy link
Contributor Author

@jorisvandenbossche all fixed, see e977432 (which, after I rebase momentarily, will be an outdated commit)

jreback added a commit that referenced this pull request Feb 23, 2015
ENH Read mutiple excel sheets in single API call
@jreback jreback merged commit c4f2be6 into pandas-dev:master Feb 23, 2015
@jreback
Copy link
Contributor

jreback commented Feb 23, 2015

@jnmclarty very nice enhancement. thanks for this!

pls have a look at the built docs and verify that everything looks kosher. If not, pls submit a followup!

http://pandas-docs.github.io/pandas-docs-travis/ (usually takes an hour; these are built on the 2.7 build from travis)

@jnmclarty
Copy link
Contributor Author

@jreback it looks to me like there's a problem with the travis built docs. Does not look like my PR was the problem, but the index only has a few pages in it.

@jreback
Copy link
Contributor

jreback commented Feb 24, 2015

I don't think the index is built on the travis docs. Only the 'main' parts are built (e.g. API is also not built). These looks ok to me.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants