ENH: Open Document Format ODS support in read_excel() #9070

davidovitch · 2014-12-13T11:45:25Z

This work should eventually lead to a reader and writer for ods spreadsheets (LibreOffice / OpenOffice Open Document Format). I am doing this work in the master of my clone of pandas. From the wiki I understood it is fine for me to commit to my master and rebase with upstream from time to time. Please let me know if I should follow another git workflow (this is my first PR).

This is what I have so far:

It seems to make sense to implement an OdsFile and OdsWriter in pandas/io/excel.py
I created ods test files identical to the other Excel test files.
Basic reading works, but does not handle all corner cases correctly yet
Tests and WritingOds are not implemented yet

I am not sure how to approach the following issues:

For the tests, it would make sense to also use pandas/io/tests/test_excel.py (there would be a lot of code duplication otherwise). However, this would require some kind of a more elaborate test skipping mechanism than currently implemented. One could have ods dependencies installed, but not the Excel ones and vice versa.
When considering pandas could support more spreadsheet formats, does it make sense to rename pandas/io/excel.py to spreadsheet.py, and correspondingly, read_excel to read_spreadsheet? I guess this is not easy since it a lot of stuff out there already counts on the excel naming scheme.

For the record, there is one message on the mailing list, and this PR will fix issue #2311.

ods support depends on ezodf, and issue #2311 mentions some of the alternatives.

davidovitch · 2014-12-13T12:02:48Z

As long as the tests are not implemented, you can quickly get into the action as follows:

import pandas as pd
pd.read_excel('pandas/io/tests/data/test.ods')
pd.read_excel('pandas/io/tests/data/test2.ods')
pd.read_excel('pandas/io/tests/data/test3.ods')
pd.read_excel('pandas/io/tests/data/test_converters.ods')
pd.read_excel('pandas/io/tests/data/test_types.ods')
pd.read_excel('pandas/io/tests/data/test_types2.ods')

shoyer · 2014-12-16T08:24:50Z

@davidovitch Is there a good reason why you need this functionality in pd.read_excel? To me, the natural choice would be a new function pd.read_ods. If it's just about duplicated code/docstrings, we have ways to work around that.

pandas.io.spreadsheet is certainly an option, but we'll need a shim to preserve the pandas.io.excel namespace for backwards compatibility. It would probably be easier to make a separate pandas.io.ods module and import any shared logic (ods/excel) another file.

For similar but not quite identical tests, my usual approach is to use multiple inheritance: http://stackoverflow.com/a/1323554/809705

jreback · 2014-12-16T11:27:20Z

@davidovitch well, this is using a lot of duplicated code.

You can refactor to have the engine parameter decide how to open the file as well (rather than just using ExcelFile).

I don't think adding to the namespace is useful (e.g. read_spreadsheet/ods. This should work like pd.read_excel(......, engine='ods'). Even better would be if those formats have a specific extension / format which can be determined at run-time? which could then just automatically set the engine

davidovitch · 2014-12-24T16:16:46Z

@jreback, I've refactored it as suggested: the engine parameter can now also be set to ezodf when using ods files. Everything is now under ExcelFile, which has 3 additional private methods to deal with ods files (_parse_ods, _parse_datetime, _print_ods_cellinfo).

The entire logic to derive at run-time the file type is extended and lives in ExcelFile.__init__. I don't think what I have done here is a very elegant approach though, but at least it should work in the same way for MS Excel files. If another spreadsheet file reader has to be added (gnumeric for example), the current approach probably has to be reconsidered. Maybe the ExcelWriter approach could work for the reader as well. But that will require more work from my part, and maybe the current approach would suffice for now?

Only the last reader test fails at this stage, and I have some work left in figuring out why it goes wrong.

The test that currently still fails also revealed a small typo in pandas/io/parser.py: davidovitch/pandas@49245c4

The ods writer is still on the todo list.

jreback · 2014-12-24T16:46:42Z

pandas/io/excel.py

+            elif ext in ['xls', 'xlsx', 'xlsm']:
+                self.engine = 'xlrd'
+
+        # required imports for the respective engine


hmm, lots of if/then here. Can you base on the engine return a class which does this (e.g. you create a class hierarchy of a BaseFile, EZDFFile, and XLRDFile, these last 2 are sub-classes of BaseFile). Which you then instantiate based on the engine). will be much cleaner.

jreback · 2014-12-24T16:49:55Z

also need to add ezodf to the build process. This is not available on conda. But can pip install it. I can fix that so this will tests correctly.

Also would need an update to the install.rst docs (as this is now an alternative to xlrd).

davidovitch · 2015-01-18T16:58:31Z

So, after a long pause I am trying to get this going again. I just did a rebase with upstream and pushed my latest changes to my remote clone here. However, I think I did something not entirely correct because now I can see some other commits appearing in this pull request...

@jreback, i have made an attempt to make it less messy, but I have to admit I wasn't completely sure I understood your suggestions correctly. I guess you'll let me know if this is not exactly right :-)

There is now another test failing, so this is not over yet. Anyway, thanks for reviewing this. It has been an interesting and learnfull ride so far.

jreback · 2015-01-18T21:11:01Z

@davidovitch ok, first order of biz is to rebase this. Then can see what the changes.

davidovitch · 2015-01-18T21:59:55Z

@jreback, I think git history is fixed now. All but the most recent commit were doubled previously. I have removed those commits with a rebase, followed by a push origin master -f to here.

jreback · 2015-01-18T22:02:36Z

pandas/io/excel.py

+        self.engine = 'xlrd'
+        self.extensions = ['xls', 'xlsx', 'xlsm']
+        self.io_class = type(None)
+        self.open_workbook = None


I would pass all of these arguments to the BaseFile.__init__ and have it do the assignments

davidovitch · 2015-03-15T17:53:29Z

@jreback, is this looking better? Regarding the tests for the readers (ExcelReaderTests), I added a method to determine which readers are installed, and only test for those. This is different then before where a test would skip if only one out of 3 readers was not installed. I think it looks less compact in the code, but should result in better test coverage when not all the spreadsheet readers are installed.
EDIT: On second thought, I think this helps maintainability, and it should be easier to add another reader and create tests for it.

davidovitch · 2015-04-15T16:32:19Z

@shoyer, @jreback I would like to bring this PR back to life. Is any one of you interested in another round of reviews?

shoyer · 2015-04-15T16:54:02Z

Sure, let us know when you're ready

On Wed, Apr 15, 2015 at 9:32 AM, David Verelst notifications@github.com
wrote:

@shoyer, @jreback I would like to bring this PR back to life. Is any one of you interested in another round of reviews?

Reply to this email directly or view it on GitHub:
#9070 (comment)

davidovitch · 2015-08-28T14:37:18Z

It is not for the lack of trying (you will probably not belief the amount of time I already spent on this PR), or for ignoring you comments, it is that I am failing to properly translate the comments into clean, Pythonic code that clears the quality checks of the review procedure. I agree with your assessment, but I can only conclude that I fail to boil it down into simple and clean logic.

I have some more time today and tomorrow to work on this PR, lets see where we stand after that.

jreback · 2015-08-28T17:41:36Z

@davidovitch no I believe you are trying hard! let's squash down to a single commit and rebase. I won't have time to play with this for a bit though.

davidovitch · 2015-08-29T09:16:24Z

@jreback thanks for your patience. Too bad I couldn't make it for the 0.17 release, but at least I tried and learned something in the process :-)

Before squashing everything into a single commit, what I think makes sense to do is the following:

I create a new PR that contains the refactoring of the writer tests in test_excel.py. This PR would also contain the ods test files, and the renamed MS Excel files (xls, xlsx, xlsm). I've renamed them so it is more consistent, and corresponds to the ods files. This refactoring was necessary to ease the processes of supporting multiple reader mechanisms. Regardless of the outcome of the actual implementation of the new ods reader, I think this work got to an acceptable level, and can be merged independent from the actual new ODS reader. Not that is such a significant contribution, but it would otherwise get kind of lost (or create a more laborious merge later if other changes have appeared in the mean time). I would suggest to take this one up into the 0.17 release.
I close this PR and create a new PR that only contains the changes to the reader in a single commit. In that way this messy PR and its history is kept for "the record". In the description of the new PR I summarize with what I struggled, and some of the key comments of the reviewers, and refer to the old PR.
I think it would be great if someone could step in to fix this faster and better then I can, but until that happens, I hope it is ok if I try to find away to figure out how to do it more lean, clean and Pythonic (but I will not bother the reviewers again until I really think the results have improved significantly).

If you think this it too much whatever, please say so and I will just squash into one commit and forgot about the whole thing, no hard feelings :-)

@andresmrm, if you really want this badly, you can always give this fork a spin. The current implementation actually works, and passes all the tests (if you have a lot of ODS files hanging around you can test it more extensively and report back any bugs in this PR ;-) ). However, the implementation is not on par with the code quality everyone expects from a library as Pandas.

timmie · 2015-08-29T11:54:08Z

@davidovitch
Off topic: I understand fully your situation. I was in a similar case some time ago. Occasional users who are not trained software engineers have difficulties to provide contributions to the level required at pandas.
Ironically, it was with the excel reader. My code worked, passed the test, but style and technical issues remained leaving the work stalled due to time and understanding:
#4340
#4404
#4631

Good luck and perseverance.

…files, more consistant naming of files

…omes available

jreback · 2015-08-29T18:36:46Z

@davidovitch ok on a refactored test pr (though don't include the ods files; just the new naming scheme)

davidovitch · 2015-09-02T07:16:52Z

@jreback, the refactored tests (excluding ODS files and tests as requested) are in PR #10964.
I've moved the EzODF powered ods reader into the branch ezodf_reader of my pandas fork.

hnykda · 2015-11-30T09:37:00Z

I am lost in all related PRs. What is the status of this feature (why is it closed)? I am willing to contribute somehow, but really don't know where to start.

jreback · 2015-11-30T11:32:00Z

well, you can start with this PR and create a new one that addresses the comments.

jreback · 2015-11-30T11:32:32Z

contributing guide is here

davidovitch · 2015-11-30T14:21:32Z

@hnykda I would start with rebasing the fork again on top of the current pandas master. I've been planning to do this for a while, but I haven't got around it. Let me know if you manage/plan to do it so I don't need to that again ;-) Please feel free to dig in or ask any questions you have regarding the current implementation.

hnykda · 2015-11-30T15:52:54Z

Thanks. I unfortunately don't see that in the near future, probably after the new year.

davidovitch · 2016-03-13T12:40:11Z

@hnykda I've rebased the ezodf_reader branch in my pandas fork with the pandas master in case you like to have a look at it. I might try to improve the current implementation in the near future.

TrigonaMinima · 2019-02-21T08:20:06Z

@davidovitch any plans of working on this or can I take it up?

detrout · 2019-02-21T18:22:30Z

For what its worth I wrote a ODS reader that produces a pandas DataFrame with tests for the various ways my spreadsheets became hard to parse.

I tried to stay close to the pandas _XlrdReader class

https://github.com/detrout/pandasodf

jreback · 2019-02-21T18:26:44Z

@detrout that looks pretty good - seems way enough to adapt if you’d like to do a PR

detrout · 2019-02-21T19:43:05Z

@jreback I hadn't done a PR because I hadn't bothered to implement a writer. Is it worth going ahead and submitting just the reader class?

WillAyd · 2019-02-21T19:54:13Z

@detrout that is totally fine

TrigonaMinima · 2019-02-21T21:36:25Z

@detrout, that's really clean code. I was also thinking of going the odfpy route myself. You should submit a PR since you have already finished some major chunk of it. If you won't be able to then please let me know.

detrout · 2019-02-21T21:42:32Z

@TrigonaMinima Ok I'll try to put together over the weekend.

detrout · 2019-02-23T21:55:12Z

Quick question.

odfpy uses ISO 8601 durations and one of my ods test cases has a timeduration, of PT03H45M00S. I had my own handrolled parser, because I wrote this a while ago. But pandas now has a iso8601 duration parser.

Unfortunately it assumes that there will always be a day specifier and fails with PT03H45M00S, but does work with P0DT03H45M00S. So I filed #25422

Below is the quick workaround I came up with.

def pandas_isoduration_normalization(duration):
    """Make libreoffice durations compatible with current pandas Timedelta parser

    The current parser can't handle durations that lack a time component 
    to the left of the of the T. 
    For example PT3H45M0S.
    """
    if duration.startswith('PT'):
        duration = 'P0DT' + duration[2:]
    return pandas.Timedelta(duration)

Is that acceptable for the moment?

jreback · 2019-02-23T22:06:18Z

yes that’s ok but if u can do as a separate PR first
alternatively u can do it for this one and can separate later

detrout · 2019-02-24T07:00:20Z

My rough draft PR with an alternate implementation is here #25427

davidovitch changed the title ~~Open Document Format ODS support in read_excel()~~ ENH: Open Document Format ODS support in read_excel() Dec 13, 2014

shoyer added IO Data IO issues that don't fit into a more specific label IO Excel read_excel, to_excel labels Dec 16, 2014

shoyer added this to the 0.16.0 milestone Dec 16, 2014

jreback reviewed Dec 24, 2014
View reviewed changes

davidovitch force-pushed the master branch from 73e4352 to 0f910db Compare January 18, 2015 21:54

jreback reviewed Jan 18, 2015
View reviewed changes

davidovitch force-pushed the master branch from 0f910db to 8a37f14 Compare February 28, 2015 10:53

jreback modified the milestones: 0.16.0, Next Major Release Mar 3, 2015

davidovitch force-pushed the master branch 2 times, most recently from 378e4a5 to 5da115e Compare March 15, 2015 14:34

davidovitch force-pushed the master branch from b0eb3fd to e868dda Compare March 28, 2015 19:26

davidovitch force-pushed the master branch from e868dda to ad0f380 Compare April 15, 2015 16:12

davidovitch force-pushed the master branch from ad0f380 to 6ea3b54 Compare April 30, 2015 08:05

davidovitch mentioned this pull request Apr 30, 2015

Wish: Input / output for Open Document Spreadsheet (ODS) #2311

Closed

davidovitch added 4 commits August 29, 2015 17:53

TST: added test data for reading ods spreadsheets, add missing Excel …

58b1a53

…files, more consistant naming of files

added ezodf to optional dependencies

42cfdd8

TST: refactor Excel reader tests, add ODF (.ods) test when reader bec…

c2ea365

…omes available

ENH: Reading Open Document Format (.ods) spreadsheet using EzODF

f196336

davidovitch force-pushed the master branch from 6c4e3e6 to f196336 Compare August 29, 2015 16:53

davidovitch closed this Sep 1, 2015

davidovitch mentioned this pull request Sep 1, 2015

TST: refactored Excel reader tests #10964

Closed

detrout mentioned this pull request Feb 24, 2019

Class to read OpenDocument Tables #25427

Merged

ENH: Open Document Format ODS support in read_excel() #9070

ENH: Open Document Format ODS support in read_excel() #9070

Conversation

davidovitch commented Dec 13, 2014

davidovitch commented Dec 13, 2014

shoyer commented Dec 16, 2014

jreback commented Dec 16, 2014

davidovitch commented Dec 24, 2014

jreback Dec 24, 2014

Choose a reason for hiding this comment

jreback commented Dec 24, 2014

davidovitch commented Jan 18, 2015

jreback commented Jan 18, 2015

davidovitch commented Jan 18, 2015

jreback Jan 18, 2015

Choose a reason for hiding this comment

davidovitch commented Mar 15, 2015

davidovitch commented Apr 15, 2015

shoyer commented Apr 15, 2015

@shoyer, @jreback I would like to bring this PR back to life. Is any one of you interested in another round of reviews?

davidovitch commented Aug 28, 2015

jreback commented Aug 28, 2015

davidovitch commented Aug 29, 2015

timmie commented Aug 29, 2015

jreback commented Aug 29, 2015

davidovitch commented Sep 2, 2015

hnykda commented Nov 30, 2015

jreback commented Nov 30, 2015

jreback commented Nov 30, 2015

davidovitch commented Nov 30, 2015

hnykda commented Nov 30, 2015

davidovitch commented Mar 13, 2016

TrigonaMinima commented Feb 21, 2019

detrout commented Feb 21, 2019

jreback commented Feb 21, 2019

detrout commented Feb 21, 2019

WillAyd commented Feb 21, 2019

TrigonaMinima commented Feb 21, 2019

detrout commented Feb 21, 2019

detrout commented Feb 23, 2019

jreback commented Feb 23, 2019

detrout commented Feb 24, 2019