Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Class to read OpenDocument Tables #25427

Merged
merged 63 commits into from
Jul 3, 2019

Conversation

detrout
Copy link
Contributor

@detrout detrout commented Feb 24, 2019

closes #2311

This is primarily intended for LibreOffice calc spreadsheets but will
also work with LO Writer and probably with LO Impress documents.

This is an alternate solution to #9070
There are test cases with several different problematic LibreOffice spread sheets.
git diff upstream/master -u | flake8 appeared to pass.

... I didn't do the whats new entry. Though I expect there's some more work to do before submitting this. I just wanted to get the core code in for comments.

The open issues is, the workaround for #25422 is embedded in the current code (so all my tests pass right now) but that, or a better solution should move closer to the iso 8601 parser.

Also I don't have the parser class hooked up to a read_excel or read_ods function.

Using read_excel bothers me some... because its not an excel file. I think it would be more clear if there was either a separate read_ods function or a generic read_spreadsheet function than just pressing read_excel into being a generic spreadsheet reader.

Also this just a reader. I had never gotten around to implementing a writer.

@pep8speaks
Copy link

pep8speaks commented Feb 24, 2019

Hello @detrout! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2019-07-02 12:42:15 UTC

@codecov
Copy link

codecov bot commented Feb 24, 2019

Codecov Report

Merging #25427 into master will decrease coverage by 50.06%.
The diff coverage is 14.91%.

Impacted file tree graph

@@             Coverage Diff             @@
##           master   #25427       +/-   ##
===========================================
- Coverage   91.74%   41.68%   -50.07%     
===========================================
  Files         173      175        +2     
  Lines       52923    53037      +114     
===========================================
- Hits        48554    22107    -26447     
- Misses       4369    30930    +26561
Flag Coverage Δ
#multiple ?
#single 41.68% <14.91%> (-0.06%) ⬇️
Impacted Files Coverage Δ
pandas/io/opendocument/__init__.py 100% <100%> (ø)
pandas/io/opendocument/odfreader.py 13.39% <13.39%> (ø)
pandas/io/formats/latex.py 0% <0%> (-100%) ⬇️
pandas/core/categorical.py 0% <0%> (-100%) ⬇️
pandas/io/sas/sas_constants.py 0% <0%> (-100%) ⬇️
pandas/tseries/plotting.py 0% <0%> (-100%) ⬇️
pandas/tseries/converter.py 0% <0%> (-100%) ⬇️
pandas/io/formats/html.py 0% <0%> (-99.35%) ⬇️
pandas/core/groupby/categorical.py 0% <0%> (-95.46%) ⬇️
pandas/io/sas/sas7bdat.py 0% <0%> (-91.17%) ⬇️
... and 134 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 85572de...f5f42e5. Read the comment docs.

@codecov
Copy link

codecov bot commented Feb 24, 2019

Codecov Report

Merging #25427 into master will decrease coverage by 50.74%.
The diff coverage is 14.81%.

Impacted file tree graph

@@             Coverage Diff             @@
##           master   #25427       +/-   ##
===========================================
- Coverage   91.86%   41.12%   -50.75%     
===========================================
  Files         180      175        -5     
  Lines       50794    50851       +57     
===========================================
- Hits        46660    20910    -25750     
- Misses       4134    29941    +25807
Flag Coverage Δ
#multiple ?
#single 41.12% <14.81%> (-0.86%) ⬇️
Impacted Files Coverage Δ
pandas/io/excel/_base.py 28.26% <100%> (-63.67%) ⬇️
pandas/io/excel/_odfreader.py 14.01% <14.01%> (ø)
pandas/io/formats/latex.py 0% <0%> (-100%) ⬇️
pandas/io/gcs.py 0% <0%> (-100%) ⬇️
pandas/io/sas/sas_constants.py 0% <0%> (-100%) ⬇️
pandas/core/groupby/categorical.py 0% <0%> (-100%) ⬇️
pandas/tseries/plotting.py 0% <0%> (-100%) ⬇️
pandas/tseries/converter.py 0% <0%> (-100%) ⬇️
pandas/io/s3.py 0% <0%> (-100%) ⬇️
pandas/io/formats/html.py 0% <0%> (-99.37%) ⬇️
... and 157 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update a272b60...8302fd7. Read the comment docs.

@detrout
Copy link
Contributor Author

detrout commented Feb 24, 2019

Ah the coverage went down but that's because the tests couldn't run because odfpy wasn't available.

Also I fixed the whitespace problem.

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so you are going to want to put this in pandas/io/excel and model it after one of the readers maybe _xlrd.py

tests in s similar way

then it will just work; read_excel is the generic interface here (we can discuss in a new issue whether aliasing to read_spreadsheet is ok)

@detrout
Copy link
Contributor Author

detrout commented Feb 24, 2019 via email

@gfyoung gfyoung added Enhancement IO Data IO issues that don't fit into a more specific label labels Feb 25, 2019
@jreback
Copy link
Contributor

jreback commented Feb 25, 2019

My current reader class is ODFReader, but ODFFile or OpenDocumentFile might be more consistent with ExcelFile. Do you have any preferences?

you don't need to expose any of this directly. These can all be internal. invoking pd.read_excel(...., engine='ods') should just work.

@jreback
Copy link
Contributor

jreback commented Feb 25, 2019

deps you would add in the ci/deps/* files; don't add to every one as you also have to function w/o being included (the framework handles this for you though)

@detrout
Copy link
Contributor Author

detrout commented Feb 27, 2019

Ok everything is moved into pandas.io.excel and tests/io/test_excel.py

Currently its triggered by passing engine='odf' to read_excel. It looks like pandas is currently depending on xlrd to figure out what to do with the various excel style file extensions.

Should there be something to activate odf reading in ExcelFile.init based on file extension or trying xlrd parsing and catching the error?

@detrout
Copy link
Contributor Author

detrout commented Feb 27, 2019

Drat. I forgot the CI dependency change.

@detrout
Copy link
Contributor Author

detrout commented Feb 27, 2019

Ok I added odfpy to the travis-36.yaml file since that seemed to be a thorough test configuration.

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@detrout looks really good! pls add a whatsnew note, New features in 0.25.0, also pls update the io.rst documentation as needed

pandas/io/excel/_odfreader.py Outdated Show resolved Hide resolved
pandas/io/excel/_odfreader.py Outdated Show resolved Hide resolved
pandas/io/excel/_odfreader.py Outdated Show resolved Hide resolved
pandas/io/excel/_odfreader.py Outdated Show resolved Hide resolved
@jreback jreback requested a review from WillAyd February 27, 2019 13:12
@jreback
Copy link
Contributor

jreback commented Feb 27, 2019

references the issue #2311 in your whatsnew

Copy link
Member

@WillAyd WillAyd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here are some comments from my side but agreed this looks very nice - kudos!

pandas/io/excel/_odfreader.py Outdated Show resolved Hide resolved
'Unrecognized sheet identifier type {}. Please use'
'a string or integer'.format(type(name)))

def parse(self, sheet_name=0, **kwds):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is already defined in the base class so ideally don't need to override here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok. _ODFReader wasn't derived from _BaseExcelReader... and I have a problem making it fit. The sheet object returned by xlrd has properties that I can't easily match.

As far as I can tell from looking at the content.xml the only way to know the nrows, which is currently used by _BaseExcelReader is to actually parse the odf table.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is nrows the only hang up here? Had a similar discussion here:

https://github.com/pandas-dev/pandas/pull/25092/files#r255718071

So if it's a hang up for two open PRs might consider doing that as a precursor to simplify things all around

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good question. I'm pretty tired right now I'll try looking if something other than nrows might go also break in a few days.

The convert_float parameter for get_sheet_data doesn't make as much sense as ODF has separate types for integer and floats. (Here's the list of supported cell types: http://docs.oasis-open.org/office/v1.2/os/OpenDocument-v1.2-os-part1.html#__RefHeading__1417680_253892949 )

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so its ok to ignore a passed parameter; if its in conflict then you could raise an error if its explicity passed

i = self.sheet_names.index(name)
return self.__get_table(self.tables[i])

def get_sheet(self, name):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we even need this method? I see it's added as a result of the subclassed parse method but the behavior here to get a sheet via inference of position / name is not something I think the other engine has, so don't want to introduce anything inconsistent

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pending figuring out how to relate _ODFReader to the current _BaseExcelFile and if there's a way to make the base class parse function work, I renamed this to _get_sheet to indicate its private.

pandas/io/excel/_odfreader.py Outdated Show resolved Hide resolved
self.document = None
self.tables = None
self.filepath_or_stream = filepath_or_stream
self.document = document_load(filepath_or_stream)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is called self.book in the existing xlrd code - can we use the same naming convention here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a little harder case... in _xlrd._XlrdReader book looks to be a reference to the xlrd class that provides the interface to excel files.

odfpy provides more of an ElementTree or lxml level interface than anything useful about tables. self.tables is just a dictionary pointing to the root node of table XML nodes found in the OpenDocument file. I'm not sure its high level enough to be considered equivalent to a "book" object.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK - you'll have to excuse that I'm not terribly familiar with odfpy. This is a rather minor sticking point so not really a blocker for this PR but consistency in the codebase is very important; it may make sense subsequently to change the xlrd code to reflect this convention or come up with an abstraction for both that reads consistently

pandas/tests/io/test_excel.py Outdated Show resolved Hide resolved
assert book.sheet_names == ['Sheet1']

def test_read_types(self):
"""Make sure we read ODF data types correctly
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK to remove docstring - I don't think we have these for any other tests

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some of the docstrings were just restating the function name, but some of them are descriptions of what the test is about. I'd like those more detailed comments to stay, is it ok if they're docstrings or should I turn them into # comments?

Like:

    def test_read_invalid_types(self):
        """Make sure we throw an exception when encountering a new value-type

        I had to manually create an invalid ods file by directly
        editing the extracted xml. So it probably won't open in
        LibreOffice correctly.
        """

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comments are fine though as minimal as possible is preferred

pandas/tests/io/test_excel.py Outdated Show resolved Hide resolved
tm.assert_equal(sheet, expected)

def test_read_invalid_types(self):
"""Make sure we throw an exception when encountering a new value-type
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also no docstring

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd really like the discussion about the invalid file to stick around somehow. (It fails xml validation)

pandas/tests/io/test_excel.py Outdated Show resolved Hide resolved
@detrout
Copy link
Contributor Author

detrout commented Feb 28, 2019

Ok I resolved many of the issues and pushed them as separate commits for the moment so its easier to see what I did.

There's a few issues I'd like to discuss further, and I left those issues open.

@jreback
Copy link
Contributor

jreback commented Mar 20, 2019

can you merge master

@jreback
Copy link
Contributor

jreback commented Apr 5, 2019

@detrout can you merge master and update; let's see if we can get this in

@detrout
Copy link
Contributor Author

detrout commented Apr 5, 2019

Sorry I got distracted. I'll try to do it tomorrow.

@detrout
Copy link
Contributor Author

detrout commented Apr 5, 2019

Ok I rebased to master. Convered all my important test docstrings to comments and added a whatsnew for v0.25 about the OpenDocument support.

I hope I remembered to fix everything (and that rebasing was the right merge strategy)

Also since you use python as a macro language for LibreOffice, at some point I'd like to extend my ODFReader class to support reading tables while running in LO. (I only tried it once, and had problems with importing pandas in the LO environment, so it's going to be a while before that works. For now those experiments are going to stay in my detrout/pandasodf repository)

@WillAyd
Copy link
Member

WillAyd commented Apr 6, 2019

@detrout looks like there are still some conflicts. Workflow to resolve locally on your branch should be as follows:

git fetch upstream
git merge upstream/master
# Resolve conflicts here...
git merge --continue
git push origin YOUR_BRANCH_NAME

@detrout
Copy link
Contributor Author

detrout commented Apr 7, 2019

I'm not sure what to do about this travis failure?

self = <pandas.io.excel._odfreader._ODFReader object at 0x7fbefb3f1c18>
name = 0
    def _get_sheet(self, name):
        """Given a sheet name or index, return the root ODF Table node
        """
>       if isinstance(name, compat.string_types):
E       AttributeError: module 'pandas.compat' has no attribute 'string_types'
pandas/io/excel/_odfreader.py:42: AttributeError

You asked me to use pandas.compat.string_types, and its present in 0.23.3, but it looks gone in 0.25. There's just a couple of python calls to isinstance(source, bastring) or (isinstance(string, str)

@jreback
Copy link
Contributor

jreback commented Apr 7, 2019

@detrout just use
isinstance(name, str) as the string_types has been removed now that we only support PY3

Copy link
Member

@WillAyd WillAyd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jreback mostly green with good coverage here. Travis failures are a result of defusedxml and xlrd being installed alongside one another and not sure there are great options to change. Should just catch warning in tests?

pandas/io/excel/_odfreader.py Show resolved Hide resolved
sheet = pd.read_excel("lowerdiagonal.ods", 'Sheet1',
index_col=None, header=None)

assert sheet.shape == (4, 4)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might change this and subsequent tests below to use tm.assert_frame_equal in another iteration or PR

@@ -439,6 +445,9 @@ def test_bad_engine_raises(self, read_ext):

@tm.network
def test_read_from_http_url(self, read_ext):
if read_ext == '.ods': # TODO: remove once on master
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test only works when the file is available on master, so have to merge first and then can try again

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like this code is still hanging around ill aim to address it when I tackle: #29439

@WillAyd
Copy link
Member

WillAyd commented Jul 1, 2019

@jreback all green here if you want to take a look for 0.25

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are the config_init.py options updated? can you add a note in io.rst about this & update install.rst?

@@ -164,6 +164,7 @@ Other enhancements
- Added new option ``plotting.backend`` to be able to select a plotting backend different than the existing ``matplotlib`` one. Use ``pandas.set_option('plotting.backend', '<backend-module>')`` where ``<backend-module`` is a library implementing the pandas plotting API (:issue:`14130`)
- :class:`pandas.offsets.BusinessHour` supports multiple opening hours intervals (:issue:`15481`)
- :func:`read_excel` can now use ``openpyxl`` to read Excel files via the ``engine='openpyxl'`` argument. This will become the default in a future release (:issue:`11499`)
- :func:`pandas.io.excel.read_excel` supports reading OpenDocument tables. Specify engine='odf' to enable. (:issue:`9070`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you put enable='odf' in double back ticks

pandas/io/excel/_odfreader.py Show resolved Hide resolved
Copy link
Member

@WillAyd WillAyd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Documentation updates done. Dialog on how to best represent parsing

pandas/io/excel/_odfreader.py Show resolved Hide resolved
@detrout
Copy link
Contributor Author

detrout commented Jul 2, 2019

Hi, ... sorry I was busy...

I finally had a chance to start reading what @WillAyd had done. (Thank you! that was a lot of updating)

I've only read up to 8a9a66c and want to read through the rest of the changes.

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@@ -3217,7 +3219,20 @@ The look and feel of Excel worksheets created from pandas can be modified using
* ``float_format`` : Format string for floating point numbers (default ``None``).
* ``freeze_panes`` : A tuple of two integers representing the bottommost row and rightmost column to freeze. Each of these parameters is one-based, so (1, 1) will freeze the first row and first column (default ``None``).

.. _io.ods:

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add a versionchanged here

@@ -164,6 +164,7 @@ Other enhancements
- Added new option ``plotting.backend`` to be able to select a plotting backend different than the existing ``matplotlib`` one. Use ``pandas.set_option('plotting.backend', '<backend-module>')`` where ``<backend-module`` is a library implementing the pandas plotting API (:issue:`14130`)
- :class:`pandas.offsets.BusinessHour` supports multiple opening hours intervals (:issue:`15481`)
- :func:`read_excel` can now use ``openpyxl`` to read Excel files via the ``engine='openpyxl'`` argument. This will become the default in a future release (:issue:`11499`)
- :func:`pandas.io.excel.read_excel` supports reading OpenDocument tables. Specify ``engine='odf'`` to enable. (:issue:`9070`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add a link to the new docs

using the ``odfpy`` module. The semantics and features for reading
OpenDocument spreadsheets match what can be done for `Excel files`_ using
``engine='odf'``.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you show an example here (code is ok)

pandas/tests/io/excel/test_readers.py Show resolved Hide resolved
@WillAyd WillAyd mentioned this pull request Jul 3, 2019
@jreback jreback added this to the 0.25.0 milestone Jul 3, 2019
@jreback jreback merged commit 23099f7 into pandas-dev:master Jul 3, 2019
@jreback
Copy link
Contributor

jreback commented Jul 3, 2019

thanks @detrout and @WillAyd for the review and patches; very nice feature!

@detrout
Copy link
Contributor Author

detrout commented Jul 4, 2019

Thank you @WillAyd and @jreback I'm glad this made it in!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement IO Data IO issues that don't fit into a more specific label
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Wish: Input / output for Open Document Spreadsheet (ODS)
6 participants