Decouple xlrd reading from ExcelFile class #24423

WillAyd · 2018-12-25T06:17:16Z

To support community engagement on #11499 I figured it was worth refactoring the existing code for read_excel which very tightly couples xlrd.

There are quite a few ways to go about this but here I am creating a separate private reader class for the xlrd engine which gets dispatched to for parsing and metadata inspection. In theory a similar class could be made for openpyxl.

I didn't make a base class here to keep diff minimal though I certainly could if we see a need for it

codecov · 2018-12-25T06:56:07Z

Codecov Report

Merging #24423 into master will increase coverage by <.01%.
The diff coverage is n/a.

@@            Coverage Diff             @@
##           master   #24423      +/-   ##
==========================================
+ Coverage    92.3%    92.3%   +<.01%     
==========================================
  Files         163      163              
  Lines       51950    51950              
==========================================
+ Hits        47953    47954       +1     
+ Misses       3997     3996       -1

Flag	Coverage Δ
#multiple	`90.71% <ø> (ø)`	⬆️
#single	`42.99% <ø> (-0.01%)`	⬇️

Impacted Files	Coverage Δ
pandas/util/testing.py	`87.84% <0%> (+0.09%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 159772d...d015b12. Read the comment docs.

codecov · 2018-12-25T06:56:07Z

Codecov Report

Merging #24423 into master will increase coverage by <.01%.
The diff coverage is n/a.

@@            Coverage Diff             @@
##           master   #24423      +/-   ##
==========================================
+ Coverage    92.3%    92.3%   +<.01%     
==========================================
  Files         163      163              
  Lines       51950    51954       +4     
==========================================
+ Hits        47953    47957       +4     
  Misses       3997     3997

Flag	Coverage Δ
#multiple	`90.71% <ø> (ø)`	⬆️
#single	`43% <ø> (-0.01%)`	⬇️

Impacted Files	Coverage Δ
pandas/core/groupby/base.py	`91.83% <0%> (ø)`	⬆️
pandas/core/frame.py	`96.91% <0%> (ø)`	⬆️
pandas/core/base.py	`97.68% <0%> (+0.02%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update befd324...607dbe5. Read the comment docs.

jreback

looks ok, can you add a test on testing an invalid engine passed.

pandas/io/excel.py

WillAyd · 2018-12-25T18:11:16Z

Latest commit should have changes. In looking at the test cases I think there also needs to be a decoupling done there to really support this (ex: something like test_read_from_http_url should exist in the base class, not just TestXlrdReader).

Can take a stab here or do in separate PR if it makes the diff easier. Lmk

jreback · 2018-12-25T18:11:48Z

@WillAyd might as well do it here.

WillAyd · 2018-12-25T19:21:03Z

pandas/tests/io/test_excel.py

@@ -119,6 +119,15 @@ def get_exceldf(self, basename, ext, *args, **kwds):
 class ReadingTestsBase(SharedItems):
    # This is based on ExcelWriterBase

+    @pytest.fixture(autouse=True, params=['xlrd', None])


So I originally wanted to put this parametrization in conftest but I think it got unnecessarily verbose in doing so. Since I don't see much use outside of the existing test class I figured here was the best spot for this

I agree with this decision. In the future, in the interest of greater pytest idiom, it would be great (though not sure if possible) yet to break up this massive test file into a directory of excel tests, in which case we could put this fixture in the conftest file for excel.

But that's a ways off...I think...🙂

WillAyd · 2018-12-25T19:21:57Z

pandas/tests/io/test_excel.py

+    def set_engine(self, request):
+        func_name = "get_exceldf"
+        old_func = getattr(self, func_name)
+        new_func = partial(old_func, engine=request.param)


Long term we probably want to refactor get_exceldf altogether but for the time being the partial should get us the parametrization we want with a relatively minor diff

Well, there isn't much to refactor for that method (it's only two lines), but a greater reorganization would indeed be nice since we're introducing more pytest-like code into this file (perhaps take inspiration from the work I did with the read_csv tests).

Yea by refactor I meant more-so replace with a fixture or something besides an instance method on a base class

WillAyd · 2018-12-25T19:23:10Z

pandas/tests/io/test_excel.py

+    """
+
+    @td.skip_if_no("xlwt")
+    def test_read_xlrd_book(self, ext):


The diff may be slightly misleading but I essentially refactored to put all of the tests in the base class save this one, which deals particular with an XLRD workbook

That's fair. In the interest of a small diff, it can stay here, though I think we should put this in a separate file in the future for "xlrd-only" tests.

yeah i agree. let's do this part in a followup, I think a split of the excel tests to a sub-dir is also in order.

jreback · 2018-12-27T16:48:18Z

pandas/io/excel.py

@@ -434,12 +420,13 @@ def parse(self,
              index_col=None,
              usecols=None,
              squeeze=False,
-              converters=None,
+              dtype=None,


were these just missing?

Also somewhat of a misleading diff but the existing code base has different signatures for parse and _parse_excel. When I moved the latter to be part of the reader class and renamed to simply parse git picked it up the different signatures in the diff.

I didn't bother to align the signatures here but certainly can as well

Sorry if this is a duplicate (can't see my previous response?) but the diff here is somewhat misleading. Previously there was a parse and _parse_excel function. With the refactor, I moved _parse_excel to the private reader class but simply named it parse.

Git is mixing up the two parse functions, basically assuming that the existing one for the ExcelFile class is brand new (which it wasn't) and is comparing the reader's implementation to the existing ExcelFile class function. The signatures weren't aligned hence this small diff.

I just moved the code without any change but can look at aligning signatures if you'd like

jreback · 2018-12-27T16:48:31Z

pandas/io/excel.py

@@ -448,72 +435,9 @@ def parse(self,
              convert_float=True,
              mangle_dupe_cols=True,
              **kwds):


can we now renove the kwds?

Will double check. There is one branch where these would get used and dispatched to the TextParser, though maybe that is dead code

**kwds is passed through TextParser to the Python parser in parsers.py.

This is definitely not dead code, so I am very wary of removing this. I think some more work can be done to better align the signature read_excel with that of read_csv (in the interest of creating a more unified data IO API)

IMO we should refrain from removing it (that would be an API IMO), especially as there is enough happening with the refactor.

Yea after taking another look agreed with @gfyoung here. I think it's worth aligning the signatures of the different parse calls within the module and potentially removing keyword args if possible (I'm not actually sure what keywords would be applicable here) but would prefer to do in a separate PR since it would be potentially API breaking

gfyoung

@jreback @WillAyd : I'm personally OK with the changes as they are. I know there are a couple of other points of discussion that are still open, but they seem to be more clarification of existing changes or better pushed off to another PR (if needed).

jreback · 2018-12-28T14:32:00Z

pandas/tests/io/test_excel.py

+    """
+
+    @td.skip_if_no("xlwt")
+    def test_read_xlrd_book(self, ext):


yeah i agree. let's do this part in a followup, I think a split of the excel tests to a sub-dir is also in order.

WillAyd · 2018-12-28T21:59:14Z

Opened #24472 as a logical follow up

jreback · 2018-12-28T23:21:10Z

thanks @WillAyd

Decoupled xlrd reading from ExcelFile class

d015b12

WillAyd added IO Excel read_excel, to_excel Clean labels Dec 25, 2018

jreback requested changes Dec 25, 2018

View reviewed changes

pandas/io/excel.py Outdated Show resolved Hide resolved

pandas/io/excel.py Outdated Show resolved Hide resolved

WillAyd added 3 commits December 25, 2018 09:50

Merge remote-tracking branch 'upstream/master' into excel-read-refactor

351abae

Name changes / doc updates

2dddf18

Added test for raising on bad engine

d2e4174

WillAyd added 2 commits December 25, 2018 10:23

Moved tests from XlrdReader to base

89a1ebe

Added fixture for reading engine

607dbe5

WillAyd commented Dec 25, 2018

View reviewed changes

jreback requested changes Dec 27, 2018

View reviewed changes

gfyoung approved these changes Dec 28, 2018

View reviewed changes

jreback added this to the 0.24.0 milestone Dec 28, 2018

jreback requested changes Dec 28, 2018

View reviewed changes

jreback approved these changes Dec 28, 2018

View reviewed changes

jreback merged commit ff28048 into pandas-dev:master Dec 28, 2018

WillAyd mentioned this pull request Dec 29, 2018

Read_excel signature cleanup #24487

Closed

Pingviinituutti pushed a commit to Pingviinituutti/pandas that referenced this pull request Feb 28, 2019

Decouple xlrd reading from ExcelFile class (pandas-dev#24423)

fa794e7

Pingviinituutti pushed a commit to Pingviinituutti/pandas that referenced this pull request Feb 28, 2019

Decouple xlrd reading from ExcelFile class (pandas-dev#24423)

35ef608

WillAyd deleted the excel-read-refactor branch March 14, 2019 15:23

WillAyd mentioned this pull request Mar 14, 2019

Excel Document Passing Kwargs to Engine #25723

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Decouple xlrd reading from ExcelFile class #24423

Decouple xlrd reading from ExcelFile class #24423

WillAyd commented Dec 25, 2018

codecov bot commented Dec 25, 2018

codecov bot commented Dec 25, 2018 •

edited

Loading

jreback left a comment

WillAyd commented Dec 25, 2018

jreback commented Dec 25, 2018

WillAyd Dec 25, 2018

gfyoung Dec 28, 2018 •

edited

Loading

WillAyd Dec 25, 2018

gfyoung Dec 28, 2018

WillAyd Dec 28, 2018

WillAyd Dec 25, 2018

gfyoung Dec 28, 2018 •

edited

Loading

jreback Dec 28, 2018

jreback Dec 27, 2018

WillAyd Dec 27, 2018

WillAyd Dec 27, 2018

jreback Dec 27, 2018

WillAyd Dec 27, 2018

gfyoung Dec 28, 2018

WillAyd Dec 28, 2018

gfyoung left a comment

jreback Dec 28, 2018

WillAyd commented Dec 28, 2018

jreback commented Dec 28, 2018

Decouple xlrd reading from ExcelFile class #24423

Decouple xlrd reading from ExcelFile class #24423

Conversation

WillAyd commented Dec 25, 2018

codecov bot commented Dec 25, 2018

Codecov Report

codecov bot commented Dec 25, 2018 • edited Loading

Codecov Report

jreback left a comment

Choose a reason for hiding this comment

WillAyd commented Dec 25, 2018

jreback commented Dec 25, 2018

Choose a reason for hiding this comment

gfyoung Dec 28, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gfyoung Dec 28, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gfyoung left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

WillAyd commented Dec 28, 2018

jreback commented Dec 28, 2018

codecov bot commented Dec 25, 2018 •

edited

Loading

gfyoung Dec 28, 2018 •

edited

Loading

gfyoung Dec 28, 2018 •

edited

Loading