-
-
Notifications
You must be signed in to change notification settings - Fork 18k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Decouple xlrd reading from ExcelFile class #24423
Conversation
Codecov Report
@@ Coverage Diff @@
## master #24423 +/- ##
==========================================
+ Coverage 92.3% 92.3% +<.01%
==========================================
Files 163 163
Lines 51950 51950
==========================================
+ Hits 47953 47954 +1
+ Misses 3997 3996 -1
Continue to review full report at Codecov.
|
Codecov Report
@@ Coverage Diff @@
## master #24423 +/- ##
==========================================
+ Coverage 92.3% 92.3% +<.01%
==========================================
Files 163 163
Lines 51950 51954 +4
==========================================
+ Hits 47953 47957 +4
Misses 3997 3997
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks ok, can you add a test on testing an invalid engine passed.
Latest commit should have changes. In looking at the test cases I think there also needs to be a decoupling done there to really support this (ex: something like Can take a stab here or do in separate PR if it makes the diff easier. Lmk |
@WillAyd might as well do it here. |
@@ -119,6 +119,15 @@ def get_exceldf(self, basename, ext, *args, **kwds): | |||
class ReadingTestsBase(SharedItems): | |||
# This is based on ExcelWriterBase | |||
|
|||
@pytest.fixture(autouse=True, params=['xlrd', None]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So I originally wanted to put this parametrization in conftest
but I think it got unnecessarily verbose in doing so. Since I don't see much use outside of the existing test class I figured here was the best spot for this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with this decision. In the future, in the interest of greater pytest
idiom, it would be great (though not sure if possible) yet to break up this massive test file into a directory of excel
tests, in which case we could put this fixture in the conftest
file for excel
.
But that's a ways off...I think...🙂
def set_engine(self, request): | ||
func_name = "get_exceldf" | ||
old_func = getattr(self, func_name) | ||
new_func = partial(old_func, engine=request.param) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Long term we probably want to refactor get_exceldf
altogether but for the time being the partial should get us the parametrization we want with a relatively minor diff
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, there isn't much to refactor for that method (it's only two lines), but a greater reorganization would indeed be nice since we're introducing more pytest
-like code into this file (perhaps take inspiration from the work I did with the read_csv
tests).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea by refactor I meant more-so replace with a fixture or something besides an instance method on a base class
""" | ||
|
||
@td.skip_if_no("xlwt") | ||
def test_read_xlrd_book(self, ext): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The diff may be slightly misleading but I essentially refactored to put all of the tests in the base class save this one, which deals particular with an XLRD workbook
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's fair. In the interest of a small diff, it can stay here, though I think we should put this in a separate file in the future for "xlrd-only" tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah i agree. let's do this part in a followup, I think a split of the excel tests to a sub-dir is also in order.
@@ -434,12 +420,13 @@ def parse(self, | |||
index_col=None, | |||
usecols=None, | |||
squeeze=False, | |||
converters=None, | |||
dtype=None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
were these just missing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also somewhat of a misleading diff but the existing code base has different signatures for parse
and _parse_excel
. When I moved the latter to be part of the reader class and renamed to simply parse
git picked it up the different signatures in the diff.
I didn't bother to align the signatures here but certainly can as well
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry if this is a duplicate (can't see my previous response?) but the diff here is somewhat misleading. Previously there was a parse
and _parse_excel
function. With the refactor, I moved _parse_excel
to the private reader class but simply named it parse
.
Git is mixing up the two parse
functions, basically assuming that the existing one for the ExcelFile
class is brand new (which it wasn't) and is comparing the reader's implementation to the existing ExcelFile
class function. The signatures weren't aligned hence this small diff.
I just moved the code without any change but can look at aligning signatures if you'd like
@@ -448,72 +435,9 @@ def parse(self, | |||
convert_float=True, | |||
mangle_dupe_cols=True, | |||
**kwds): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we now renove the kwds?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will double check. There is one branch where these would get used and dispatched to the TextParser
, though maybe that is dead code
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
**kwds
is passed through TextParser
to the Python parser in parsers.py
.
This is definitely not dead code, so I am very wary of removing this. I think some more work can be done to better align the signature read_excel
with that of read_csv
(in the interest of creating a more unified data IO API)
IMO we should refrain from removing it (that would be an API IMO), especially as there is enough happening with the refactor.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea after taking another look agreed with @gfyoung here. I think it's worth aligning the signatures of the different parse calls within the module and potentially removing keyword args if possible (I'm not actually sure what keywords would be applicable here) but would prefer to do in a separate PR since it would be potentially API breaking
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
""" | ||
|
||
@td.skip_if_no("xlwt") | ||
def test_read_xlrd_book(self, ext): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah i agree. let's do this part in a followup, I think a split of the excel tests to a sub-dir is also in order.
Opened #24472 as a logical follow up |
thanks @WillAyd |
To support community engagement on #11499 I figured it was worth refactoring the existing code for read_excel which very tightly couples xlrd.
There are quite a few ways to go about this but here I am creating a separate private reader class for the xlrd engine which gets dispatched to for parsing and metadata inspection. In theory a similar class could be made for openpyxl.
I didn't make a base class here to keep diff minimal though I certainly could if we see a need for it