Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: Open Document Format ODS support in read_excel() #9070

Closed
wants to merge 4 commits into from

Conversation

davidovitch
Copy link
Contributor

This work should eventually lead to a reader and writer for ods spreadsheets (LibreOffice / OpenOffice Open Document Format). I am doing this work in the master of my clone of pandas. From the wiki I understood it is fine for me to commit to my master and rebase with upstream from time to time. Please let me know if I should follow another git workflow (this is my first PR).

This is what I have so far:

  • It seems to make sense to implement an OdsFile and OdsWriter in pandas/io/excel.py
  • I created ods test files identical to the other Excel test files.
  • Basic reading works, but does not handle all corner cases correctly yet
  • Tests and WritingOds are not implemented yet

I am not sure how to approach the following issues:

  • For the tests, it would make sense to also use pandas/io/tests/test_excel.py (there would be a lot of code duplication otherwise). However, this would require some kind of a more elaborate test skipping mechanism than currently implemented. One could have ods dependencies installed, but not the Excel ones and vice versa.
  • When considering pandas could support more spreadsheet formats, does it make sense to rename pandas/io/excel.py to spreadsheet.py, and correspondingly, read_excel to read_spreadsheet? I guess this is not easy since it a lot of stuff out there already counts on the excel naming scheme.

For the record, there is one message on the mailing list, and this PR will fix issue #2311.

ods support depends on ezodf, and issue #2311 mentions some of the alternatives.

@davidovitch
Copy link
Contributor Author

As long as the tests are not implemented, you can quickly get into the action as follows:

import pandas as pd
pd.read_excel('pandas/io/tests/data/test.ods')
pd.read_excel('pandas/io/tests/data/test2.ods')
pd.read_excel('pandas/io/tests/data/test3.ods')
pd.read_excel('pandas/io/tests/data/test_converters.ods')
pd.read_excel('pandas/io/tests/data/test_types.ods')
pd.read_excel('pandas/io/tests/data/test_types2.ods')

@davidovitch davidovitch changed the title Open Document Format ODS support in read_excel() ENH: Open Document Format ODS support in read_excel() Dec 13, 2014
@shoyer
Copy link
Member

shoyer commented Dec 16, 2014

@davidovitch Is there a good reason why you need this functionality in pd.read_excel? To me, the natural choice would be a new function pd.read_ods. If it's just about duplicated code/docstrings, we have ways to work around that.

pandas.io.spreadsheet is certainly an option, but we'll need a shim to preserve the pandas.io.excel namespace for backwards compatibility. It would probably be easier to make a separate pandas.io.ods module and import any shared logic (ods/excel) another file.

For similar but not quite identical tests, my usual approach is to use multiple inheritance: http://stackoverflow.com/a/1323554/809705

@shoyer shoyer added IO Data IO issues that don't fit into a more specific label IO Excel read_excel, to_excel labels Dec 16, 2014
@shoyer shoyer added this to the 0.16.0 milestone Dec 16, 2014
@jreback
Copy link
Contributor

jreback commented Dec 16, 2014

@davidovitch well, this is using a lot of duplicated code.

You can refactor to have the engine parameter decide how to open the file as well (rather than just using ExcelFile).

I don't think adding to the namespace is useful (e.g. read_spreadsheet/ods. This should work like pd.read_excel(......, engine='ods'). Even better would be if those formats have a specific extension / format which can be determined at run-time? which could then just automatically set the engine

@davidovitch
Copy link
Contributor Author

@jreback, I've refactored it as suggested: the engine parameter can now also be set to ezodf when using ods files. Everything is now under ExcelFile, which has 3 additional private methods to deal with ods files (_parse_ods, _parse_datetime, _print_ods_cellinfo).

The entire logic to derive at run-time the file type is extended and lives in ExcelFile.__init__. I don't think what I have done here is a very elegant approach though, but at least it should work in the same way for MS Excel files. If another spreadsheet file reader has to be added (gnumeric for example), the current approach probably has to be reconsidered. Maybe the ExcelWriter approach could work for the reader as well. But that will require more work from my part, and maybe the current approach would suffice for now?

Only the last reader test fails at this stage, and I have some work left in figuring out why it goes wrong.

The test that currently still fails also revealed a small typo in pandas/io/parser.py: davidovitch/pandas@49245c4

The ods writer is still on the todo list.

elif ext in ['xls', 'xlsx', 'xlsm']:
self.engine = 'xlrd'

# required imports for the respective engine
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, lots of if/then here. Can you base on the engine return a class which does this (e.g. you create a class hierarchy of a BaseFile, EZDFFile, and XLRDFile, these last 2 are sub-classes of BaseFile). Which you then instantiate based on the engine). will be much cleaner.

@jreback
Copy link
Contributor

jreback commented Dec 24, 2014

also need to add ezodf to the build process. This is not available on conda. But can pip install it. I can fix that so this will tests correctly.

Also would need an update to the install.rst docs (as this is now an alternative to xlrd).

@davidovitch
Copy link
Contributor Author

So, after a long pause I am trying to get this going again. I just did a rebase with upstream and pushed my latest changes to my remote clone here. However, I think I did something not entirely correct because now I can see some other commits appearing in this pull request...

@jreback, i have made an attempt to make it less messy, but I have to admit I wasn't completely sure I understood your suggestions correctly. I guess you'll let me know if this is not exactly right :-)

There is now another test failing, so this is not over yet. Anyway, thanks for reviewing this. It has been an interesting and learnfull ride so far.

@jreback
Copy link
Contributor

jreback commented Jan 18, 2015

@davidovitch ok, first order of biz is to rebase this. Then can see what the changes.

@davidovitch
Copy link
Contributor Author

@jreback, I think git history is fixed now. All but the most recent commit were doubled previously. I have removed those commits with a rebase, followed by a push origin master -f to here.

self.engine = 'xlrd'
self.extensions = ['xls', 'xlsx', 'xlsm']
self.io_class = type(None)
self.open_workbook = None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would pass all of these arguments to the BaseFile.__init__ and have it do the assignments

@jreback jreback modified the milestones: 0.16.0, Next Major Release Mar 3, 2015
@davidovitch davidovitch force-pushed the master branch 2 times, most recently from 378e4a5 to 5da115e Compare March 15, 2015 14:34
@davidovitch
Copy link
Contributor Author

@jreback, is this looking better? Regarding the tests for the readers (ExcelReaderTests), I added a method to determine which readers are installed, and only test for those. This is different then before where a test would skip if only one out of 3 readers was not installed. I think it looks less compact in the code, but should result in better test coverage when not all the spreadsheet readers are installed.
EDIT: On second thought, I think this helps maintainability, and it should be easier to add another reader and create tests for it.

@davidovitch
Copy link
Contributor Author

@shoyer, @jreback I would like to bring this PR back to life. Is any one of you interested in another round of reviews?

@shoyer
Copy link
Member

shoyer commented Apr 15, 2015

Sure, let us know when you're ready

On Wed, Apr 15, 2015 at 9:32 AM, David Verelst notifications@github.com
wrote:

@shoyer, @jreback I would like to bring this PR back to life. Is any one of you interested in another round of reviews?

Reply to this email directly or view it on GitHub:
#9070 (comment)

@davidovitch
Copy link
Contributor Author

It is not for the lack of trying (you will probably not belief the amount of time I already spent on this PR), or for ignoring you comments, it is that I am failing to properly translate the comments into clean, Pythonic code that clears the quality checks of the review procedure. I agree with your assessment, but I can only conclude that I fail to boil it down into simple and clean logic.

I have some more time today and tomorrow to work on this PR, lets see where we stand after that.

@jreback
Copy link
Contributor

jreback commented Aug 28, 2015

@davidovitch no I believe you are trying hard! let's squash down to a single commit and rebase. I won't have time to play with this for a bit though.

@davidovitch
Copy link
Contributor Author

@jreback thanks for your patience. Too bad I couldn't make it for the 0.17 release, but at least I tried and learned something in the process :-)

Before squashing everything into a single commit, what I think makes sense to do is the following:

  • I create a new PR that contains the refactoring of the writer tests in test_excel.py. This PR would also contain the ods test files, and the renamed MS Excel files (xls, xlsx, xlsm). I've renamed them so it is more consistent, and corresponds to the ods files. This refactoring was necessary to ease the processes of supporting multiple reader mechanisms. Regardless of the outcome of the actual implementation of the new ods reader, I think this work got to an acceptable level, and can be merged independent from the actual new ODS reader. Not that is such a significant contribution, but it would otherwise get kind of lost (or create a more laborious merge later if other changes have appeared in the mean time). I would suggest to take this one up into the 0.17 release.
  • I close this PR and create a new PR that only contains the changes to the reader in a single commit. In that way this messy PR and its history is kept for "the record". In the description of the new PR I summarize with what I struggled, and some of the key comments of the reviewers, and refer to the old PR.
  • I think it would be great if someone could step in to fix this faster and better then I can, but until that happens, I hope it is ok if I try to find away to figure out how to do it more lean, clean and Pythonic (but I will not bother the reviewers again until I really think the results have improved significantly).

If you think this it too much whatever, please say so and I will just squash into one commit and forgot about the whole thing, no hard feelings :-)

@andresmrm, if you really want this badly, you can always give this fork a spin. The current implementation actually works, and passes all the tests (if you have a lot of ODS files hanging around you can test it more extensively and report back any bugs in this PR ;-) ). However, the implementation is not on par with the code quality everyone expects from a library as Pandas.

@timmie
Copy link
Contributor

timmie commented Aug 29, 2015

@davidovitch
Off topic: I understand fully your situation. I was in a similar case some time ago. Occasional users who are not trained software engineers have difficulties to provide contributions to the level required at pandas.
Ironically, it was with the excel reader. My code worked, passed the test, but style and technical issues remained leaving the work stalled due to time and understanding:
#4340
#4404
#4631

Good luck and perseverance.

@jreback
Copy link
Contributor

jreback commented Aug 29, 2015

@davidovitch ok on a refactored test pr (though don't include the ods files; just the new naming scheme)

@davidovitch
Copy link
Contributor Author

@jreback, the refactored tests (excluding ODS files and tests as requested) are in PR #10964.
I've moved the EzODF powered ods reader into the branch ezodf_reader of my pandas fork.

@hnykda
Copy link

hnykda commented Nov 30, 2015

I am lost in all related PRs. What is the status of this feature (why is it closed)? I am willing to contribute somehow, but really don't know where to start.

@jreback
Copy link
Contributor

jreback commented Nov 30, 2015

well, you can start with this PR and create a new one that addresses the comments.

@jreback
Copy link
Contributor

jreback commented Nov 30, 2015

contributing guide is here

@davidovitch
Copy link
Contributor Author

@hnykda I would start with rebasing the fork again on top of the current pandas master. I've been planning to do this for a while, but I haven't got around it. Let me know if you manage/plan to do it so I don't need to that again ;-) Please feel free to dig in or ask any questions you have regarding the current implementation.

@hnykda
Copy link

hnykda commented Nov 30, 2015

Thanks. I unfortunately don't see that in the near future, probably after the new year.

@davidovitch
Copy link
Contributor Author

@hnykda I've rebased the ezodf_reader branch in my pandas fork with the pandas master in case you like to have a look at it. I might try to improve the current implementation in the near future.

@TrigonaMinima
Copy link

@davidovitch any plans of working on this or can I take it up?

@detrout
Copy link
Contributor

detrout commented Feb 21, 2019

For what its worth I wrote a ODS reader that produces a pandas DataFrame with tests for the various ways my spreadsheets became hard to parse.

I tried to stay close to the pandas _XlrdReader class

https://github.com/detrout/pandasodf

@jreback
Copy link
Contributor

jreback commented Feb 21, 2019

@detrout that looks pretty good - seems way enough to adapt if you’d like to do a PR

@detrout
Copy link
Contributor

detrout commented Feb 21, 2019

@jreback I hadn't done a PR because I hadn't bothered to implement a writer. Is it worth going ahead and submitting just the reader class?

@WillAyd
Copy link
Member

WillAyd commented Feb 21, 2019

@detrout that is totally fine

@TrigonaMinima
Copy link

@detrout, that's really clean code. I was also thinking of going the odfpy route myself. You should submit a PR since you have already finished some major chunk of it. If you won't be able to then please let me know.

@detrout
Copy link
Contributor

detrout commented Feb 21, 2019

@TrigonaMinima Ok I'll try to put together over the weekend.

@detrout
Copy link
Contributor

detrout commented Feb 23, 2019

Quick question.

odfpy uses ISO 8601 durations and one of my ods test cases has a timeduration, of PT03H45M00S. I had my own handrolled parser, because I wrote this a while ago. But pandas now has a iso8601 duration parser.

Unfortunately it assumes that there will always be a day specifier and fails with PT03H45M00S, but does work with P0DT03H45M00S. So I filed #25422

Below is the quick workaround I came up with.

def pandas_isoduration_normalization(duration):
    """Make libreoffice durations compatible with current pandas Timedelta parser

    The current parser can't handle durations that lack a time component 
    to the left of the of the T. 
    For example PT3H45M0S.
    """
    if duration.startswith('PT'):
        duration = 'P0DT' + duration[2:]
    return pandas.Timedelta(duration)

Is that acceptable for the moment?

@jreback
Copy link
Contributor

jreback commented Feb 23, 2019

yes that’s ok but if u can do as a separate PR first
alternatively u can do it for this one and can separate later

@detrout
Copy link
Contributor

detrout commented Feb 24, 2019

My rough draft PR with an alternate implementation is here #25427

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
IO Data IO issues that don't fit into a more specific label IO Excel read_excel, to_excel
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants