Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Decouple xlrd reading from ExcelFile class #24423

Merged
merged 6 commits into from
Dec 28, 2018

Conversation

WillAyd
Copy link
Member

@WillAyd WillAyd commented Dec 25, 2018

To support community engagement on #11499 I figured it was worth refactoring the existing code for read_excel which very tightly couples xlrd.

There are quite a few ways to go about this but here I am creating a separate private reader class for the xlrd engine which gets dispatched to for parsing and metadata inspection. In theory a similar class could be made for openpyxl.

I didn't make a base class here to keep diff minimal though I certainly could if we see a need for it

@WillAyd WillAyd added IO Excel read_excel, to_excel Clean labels Dec 25, 2018
@codecov
Copy link

codecov bot commented Dec 25, 2018

Codecov Report

Merging #24423 into master will increase coverage by <.01%.
The diff coverage is n/a.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #24423      +/-   ##
==========================================
+ Coverage    92.3%    92.3%   +<.01%     
==========================================
  Files         163      163              
  Lines       51950    51950              
==========================================
+ Hits        47953    47954       +1     
+ Misses       3997     3996       -1
Flag Coverage Δ
#multiple 90.71% <ø> (ø) ⬆️
#single 42.99% <ø> (-0.01%) ⬇️
Impacted Files Coverage Δ
pandas/util/testing.py 87.84% <0%> (+0.09%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 159772d...d015b12. Read the comment docs.

@codecov
Copy link

codecov bot commented Dec 25, 2018

Codecov Report

Merging #24423 into master will increase coverage by <.01%.
The diff coverage is n/a.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #24423      +/-   ##
==========================================
+ Coverage    92.3%    92.3%   +<.01%     
==========================================
  Files         163      163              
  Lines       51950    51954       +4     
==========================================
+ Hits        47953    47957       +4     
  Misses       3997     3997
Flag Coverage Δ
#multiple 90.71% <ø> (ø) ⬆️
#single 43% <ø> (-0.01%) ⬇️
Impacted Files Coverage Δ
pandas/core/groupby/base.py 91.83% <0%> (ø) ⬆️
pandas/core/frame.py 96.91% <0%> (ø) ⬆️
pandas/core/base.py 97.68% <0%> (+0.02%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update befd324...607dbe5. Read the comment docs.

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks ok, can you add a test on testing an invalid engine passed.

pandas/io/excel.py Outdated Show resolved Hide resolved
pandas/io/excel.py Outdated Show resolved Hide resolved
@WillAyd
Copy link
Member Author

WillAyd commented Dec 25, 2018

Latest commit should have changes. In looking at the test cases I think there also needs to be a decoupling done there to really support this (ex: something like test_read_from_http_url should exist in the base class, not just TestXlrdReader).

Can take a stab here or do in separate PR if it makes the diff easier. Lmk

@jreback
Copy link
Contributor

jreback commented Dec 25, 2018

@WillAyd might as well do it here.

@@ -119,6 +119,15 @@ def get_exceldf(self, basename, ext, *args, **kwds):
class ReadingTestsBase(SharedItems):
# This is based on ExcelWriterBase

@pytest.fixture(autouse=True, params=['xlrd', None])
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I originally wanted to put this parametrization in conftest but I think it got unnecessarily verbose in doing so. Since I don't see much use outside of the existing test class I figured here was the best spot for this

Copy link
Member

@gfyoung gfyoung Dec 28, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with this decision. In the future, in the interest of greater pytest idiom, it would be great (though not sure if possible) yet to break up this massive test file into a directory of excel tests, in which case we could put this fixture in the conftest file for excel.

But that's a ways off...I think...🙂

def set_engine(self, request):
func_name = "get_exceldf"
old_func = getattr(self, func_name)
new_func = partial(old_func, engine=request.param)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Long term we probably want to refactor get_exceldf altogether but for the time being the partial should get us the parametrization we want with a relatively minor diff

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, there isn't much to refactor for that method (it's only two lines), but a greater reorganization would indeed be nice since we're introducing more pytest-like code into this file (perhaps take inspiration from the work I did with the read_csv tests).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea by refactor I meant more-so replace with a fixture or something besides an instance method on a base class

"""

@td.skip_if_no("xlwt")
def test_read_xlrd_book(self, ext):
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The diff may be slightly misleading but I essentially refactored to put all of the tests in the base class save this one, which deals particular with an XLRD workbook

Copy link
Member

@gfyoung gfyoung Dec 28, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's fair. In the interest of a small diff, it can stay here, though I think we should put this in a separate file in the future for "xlrd-only" tests.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah i agree. let's do this part in a followup, I think a split of the excel tests to a sub-dir is also in order.

@@ -434,12 +420,13 @@ def parse(self,
index_col=None,
usecols=None,
squeeze=False,
converters=None,
dtype=None,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

were these just missing?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also somewhat of a misleading diff but the existing code base has different signatures for parse and _parse_excel. When I moved the latter to be part of the reader class and renamed to simply parse git picked it up the different signatures in the diff.

I didn't bother to align the signatures here but certainly can as well

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry if this is a duplicate (can't see my previous response?) but the diff here is somewhat misleading. Previously there was a parse and _parse_excel function. With the refactor, I moved _parse_excel to the private reader class but simply named it parse.

Git is mixing up the two parse functions, basically assuming that the existing one for the ExcelFile class is brand new (which it wasn't) and is comparing the reader's implementation to the existing ExcelFile class function. The signatures weren't aligned hence this small diff.

I just moved the code without any change but can look at aligning signatures if you'd like

@@ -448,72 +435,9 @@ def parse(self,
convert_float=True,
mangle_dupe_cols=True,
**kwds):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we now renove the kwds?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will double check. There is one branch where these would get used and dispatched to the TextParser, though maybe that is dead code

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

**kwds is passed through TextParser to the Python parser in parsers.py.

This is definitely not dead code, so I am very wary of removing this. I think some more work can be done to better align the signature read_excel with that of read_csv (in the interest of creating a more unified data IO API)

IMO we should refrain from removing it (that would be an API IMO), especially as there is enough happening with the refactor.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea after taking another look agreed with @gfyoung here. I think it's worth aligning the signatures of the different parse calls within the module and potentially removing keyword args if possible (I'm not actually sure what keywords would be applicable here) but would prefer to do in a separate PR since it would be potentially API breaking

Copy link
Member

@gfyoung gfyoung left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jreback @WillAyd : I'm personally OK with the changes as they are. I know there are a couple of other points of discussion that are still open, but they seem to be more clarification of existing changes or better pushed off to another PR (if needed).

@jreback jreback added this to the 0.24.0 milestone Dec 28, 2018
"""

@td.skip_if_no("xlwt")
def test_read_xlrd_book(self, ext):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah i agree. let's do this part in a followup, I think a split of the excel tests to a sub-dir is also in order.

@WillAyd
Copy link
Member Author

WillAyd commented Dec 28, 2018

Opened #24472 as a logical follow up

@jreback jreback merged commit ff28048 into pandas-dev:master Dec 28, 2018
@jreback
Copy link
Contributor

jreback commented Dec 28, 2018

thanks @WillAyd

Pingviinituutti pushed a commit to Pingviinituutti/pandas that referenced this pull request Feb 28, 2019
Pingviinituutti pushed a commit to Pingviinituutti/pandas that referenced this pull request Feb 28, 2019
@WillAyd WillAyd deleted the excel-read-refactor branch March 14, 2019 15:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Clean IO Excel read_excel, to_excel
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants