-
-
Notifications
You must be signed in to change notification settings - Fork 18k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Class to read OpenDocument Tables #25427
Conversation
Codecov Report
@@ Coverage Diff @@
## master #25427 +/- ##
===========================================
- Coverage 91.74% 41.68% -50.07%
===========================================
Files 173 175 +2
Lines 52923 53037 +114
===========================================
- Hits 48554 22107 -26447
- Misses 4369 30930 +26561
Continue to review full report at Codecov.
|
Codecov Report
@@ Coverage Diff @@
## master #25427 +/- ##
===========================================
- Coverage 91.86% 41.12% -50.75%
===========================================
Files 180 175 -5
Lines 50794 50851 +57
===========================================
- Hits 46660 20910 -25750
- Misses 4134 29941 +25807
Continue to review full report at Codecov.
|
Ah the coverage went down but that's because the tests couldn't run because odfpy wasn't available. Also I fixed the whitespace problem. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so you are going to want to put this in pandas/io/excel and model it after one of the readers maybe _xlrd.py
tests in s similar way
then it will just work; read_excel is the generic interface here (we can discuss in a new issue whether aliasing to read_spreadsheet is ok)
Wait a second... I just realized that read_excel already supports using two totally different libraries for reading .xls and .xlsx files. Now it makes more sense why you want it hooked into io.excel.
OK it'll take me a bit to understand how the hook works.
My current reader class is ODFReader, but ODFFile or OpenDocumentFile might be more consistent with ExcelFile. Do you have any preferences?
Lastly is there any place I can add then odfpy dependency so the ci framework will be able to run my tests?
Thanks you for your comments.
Diane
|
you don't need to expose any of this directly. These can all be internal. invoking |
deps you would add in the ci/deps/* files; don't add to every one as you also have to function w/o being included (the framework handles this for you though) |
f5f42e5
to
d8495dc
Compare
Ok everything is moved into pandas.io.excel and tests/io/test_excel.py Currently its triggered by passing engine='odf' to read_excel. It looks like pandas is currently depending on xlrd to figure out what to do with the various excel style file extensions. Should there be something to activate odf reading in ExcelFile.init based on file extension or trying xlrd parsing and catching the error? |
Drat. I forgot the CI dependency change. |
d8495dc
to
050a5e7
Compare
Ok I added odfpy to the travis-36.yaml file since that seemed to be a thorough test configuration. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@detrout looks really good! pls add a whatsnew note, New features in 0.25.0, also pls update the io.rst documentation as needed
references the issue #2311 in your whatsnew |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here are some comments from my side but agreed this looks very nice - kudos!
pandas/io/excel/_odfreader.py
Outdated
'Unrecognized sheet identifier type {}. Please use' | ||
'a string or integer'.format(type(name))) | ||
|
||
def parse(self, sheet_name=0, **kwds): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is already defined in the base class so ideally don't need to override here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok. _ODFReader wasn't derived from _BaseExcelReader... and I have a problem making it fit. The sheet object returned by xlrd has properties that I can't easily match.
As far as I can tell from looking at the content.xml the only way to know the nrows, which is currently used by _BaseExcelReader is to actually parse the odf table.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is nrows
the only hang up here? Had a similar discussion here:
https://github.com/pandas-dev/pandas/pull/25092/files#r255718071
So if it's a hang up for two open PRs might consider doing that as a precursor to simplify things all around
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a good question. I'm pretty tired right now I'll try looking if something other than nrows might go also break in a few days.
The convert_float parameter for get_sheet_data doesn't make as much sense as ODF has separate types for integer and floats. (Here's the list of supported cell types: http://docs.oasis-open.org/office/v1.2/os/OpenDocument-v1.2-os-part1.html#__RefHeading__1417680_253892949 )
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so its ok to ignore a passed parameter; if its in conflict then you could raise an error if its explicity passed
pandas/io/excel/_odfreader.py
Outdated
i = self.sheet_names.index(name) | ||
return self.__get_table(self.tables[i]) | ||
|
||
def get_sheet(self, name): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we even need this method? I see it's added as a result of the subclassed parse
method but the behavior here to get a sheet via inference of position / name is not something I think the other engine has, so don't want to introduce anything inconsistent
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pending figuring out how to relate _ODFReader to the current _BaseExcelFile and if there's a way to make the base class parse function work, I renamed this to _get_sheet to indicate its private.
pandas/io/excel/_odfreader.py
Outdated
self.document = None | ||
self.tables = None | ||
self.filepath_or_stream = filepath_or_stream | ||
self.document = document_load(filepath_or_stream) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is called self.book
in the existing xlrd code - can we use the same naming convention here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a little harder case... in _xlrd._XlrdReader book looks to be a reference to the xlrd class that provides the interface to excel files.
odfpy provides more of an ElementTree or lxml level interface than anything useful about tables. self.tables is just a dictionary pointing to the root node of table XML nodes found in the OpenDocument file. I'm not sure its high level enough to be considered equivalent to a "book" object.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK - you'll have to excuse that I'm not terribly familiar with odfpy. This is a rather minor sticking point so not really a blocker for this PR but consistency in the codebase is very important; it may make sense subsequently to change the xlrd code to reflect this convention or come up with an abstraction for both that reads consistently
pandas/tests/io/test_excel.py
Outdated
assert book.sheet_names == ['Sheet1'] | ||
|
||
def test_read_types(self): | ||
"""Make sure we read ODF data types correctly |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK to remove docstring - I don't think we have these for any other tests
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some of the docstrings were just restating the function name, but some of them are descriptions of what the test is about. I'd like those more detailed comments to stay, is it ok if they're docstrings or should I turn them into # comments?
Like:
def test_read_invalid_types(self):
"""Make sure we throw an exception when encountering a new value-type
I had to manually create an invalid ods file by directly
editing the extracted xml. So it probably won't open in
LibreOffice correctly.
"""
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Comments are fine though as minimal as possible is preferred
pandas/tests/io/test_excel.py
Outdated
tm.assert_equal(sheet, expected) | ||
|
||
def test_read_invalid_types(self): | ||
"""Make sure we throw an exception when encountering a new value-type |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also no docstring
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd really like the discussion about the invalid file to stick around somehow. (It fails xml validation)
Ok I resolved many of the issues and pushed them as separate commits for the moment so its easier to see what I did. There's a few issues I'd like to discuss further, and I left those issues open. |
can you merge master |
@detrout can you merge master and update; let's see if we can get this in |
Sorry I got distracted. I'll try to do it tomorrow. |
Ok I rebased to master. Convered all my important test docstrings to comments and added a whatsnew for v0.25 about the OpenDocument support. I hope I remembered to fix everything (and that rebasing was the right merge strategy) Also since you use python as a macro language for LibreOffice, at some point I'd like to extend my ODFReader class to support reading tables while running in LO. (I only tried it once, and had problems with importing pandas in the LO environment, so it's going to be a while before that works. For now those experiments are going to stay in my detrout/pandasodf repository) |
@detrout looks like there are still some conflicts. Workflow to resolve locally on your branch should be as follows: git fetch upstream
git merge upstream/master
# Resolve conflicts here...
git merge --continue
git push origin YOUR_BRANCH_NAME |
1e43355
to
4b278dc
Compare
I'm not sure what to do about this travis failure?
You asked me to use pandas.compat.string_types, and its present in 0.23.3, but it looks gone in 0.25. There's just a couple of python calls to isinstance(source, bastring) or (isinstance(string, str) |
@detrout just use |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jreback mostly green with good coverage here. Travis failures are a result of defusedxml
and xlrd
being installed alongside one another and not sure there are great options to change. Should just catch warning in tests?
pandas/tests/io/excel/test_odf.py
Outdated
sheet = pd.read_excel("lowerdiagonal.ods", 'Sheet1', | ||
index_col=None, header=None) | ||
|
||
assert sheet.shape == (4, 4) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might change this and subsequent tests below to use tm.assert_frame_equal
in another iteration or PR
@@ -439,6 +445,9 @@ def test_bad_engine_raises(self, read_ext): | |||
|
|||
@tm.network | |||
def test_read_from_http_url(self, read_ext): | |||
if read_ext == '.ods': # TODO: remove once on master |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test only works when the file is available on master, so have to merge first and then can try again
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like this code is still hanging around ill aim to address it when I tackle: #29439
@jreback all green here if you want to take a look for 0.25 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
are the config_init.py options updated? can you add a note in io.rst about this & update install.rst?
doc/source/whatsnew/v0.25.0.rst
Outdated
@@ -164,6 +164,7 @@ Other enhancements | |||
- Added new option ``plotting.backend`` to be able to select a plotting backend different than the existing ``matplotlib`` one. Use ``pandas.set_option('plotting.backend', '<backend-module>')`` where ``<backend-module`` is a library implementing the pandas plotting API (:issue:`14130`) | |||
- :class:`pandas.offsets.BusinessHour` supports multiple opening hours intervals (:issue:`15481`) | |||
- :func:`read_excel` can now use ``openpyxl`` to read Excel files via the ``engine='openpyxl'`` argument. This will become the default in a future release (:issue:`11499`) | |||
- :func:`pandas.io.excel.read_excel` supports reading OpenDocument tables. Specify engine='odf' to enable. (:issue:`9070`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you put enable='odf'
in double back ticks
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Documentation updates done. Dialog on how to best represent parsing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
@@ -3217,7 +3219,20 @@ The look and feel of Excel worksheets created from pandas can be modified using | |||
* ``float_format`` : Format string for floating point numbers (default ``None``). | |||
* ``freeze_panes`` : A tuple of two integers representing the bottommost row and rightmost column to freeze. Each of these parameters is one-based, so (1, 1) will freeze the first row and first column (default ``None``). | |||
|
|||
.. _io.ods: | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you add a versionchanged here
doc/source/whatsnew/v0.25.0.rst
Outdated
@@ -164,6 +164,7 @@ Other enhancements | |||
- Added new option ``plotting.backend`` to be able to select a plotting backend different than the existing ``matplotlib`` one. Use ``pandas.set_option('plotting.backend', '<backend-module>')`` where ``<backend-module`` is a library implementing the pandas plotting API (:issue:`14130`) | |||
- :class:`pandas.offsets.BusinessHour` supports multiple opening hours intervals (:issue:`15481`) | |||
- :func:`read_excel` can now use ``openpyxl`` to read Excel files via the ``engine='openpyxl'`` argument. This will become the default in a future release (:issue:`11499`) | |||
- :func:`pandas.io.excel.read_excel` supports reading OpenDocument tables. Specify ``engine='odf'`` to enable. (:issue:`9070`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you add a link to the new docs
using the ``odfpy`` module. The semantics and features for reading | ||
OpenDocument spreadsheets match what can be done for `Excel files`_ using | ||
``engine='odf'``. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you show an example here (code is ok)
closes #2311
This is primarily intended for LibreOffice calc spreadsheets but will
also work with LO Writer and probably with LO Impress documents.
This is an alternate solution to #9070
There are test cases with several different problematic LibreOffice spread sheets.
git diff upstream/master -u | flake8 appeared to pass.
... I didn't do the whats new entry. Though I expect there's some more work to do before submitting this. I just wanted to get the core code in for comments.
The open issues is, the workaround for #25422 is embedded in the current code (so all my tests pass right now) but that, or a better solution should move closer to the iso 8601 parser.
Also I don't have the parser class hooked up to a read_excel or read_ods function.
Using read_excel bothers me some... because its not an excel file. I think it would be more clear if there was either a separate read_ods function or a generic read_spreadsheet function than just pressing read_excel into being a generic spreadsheet reader.
Also this just a reader. I had never gotten around to implementing a writer.