Class to read OpenDocument Tables #25427

detrout · 2019-02-24T06:59:48Z

This is primarily intended for LibreOffice calc spreadsheets but will
also work with LO Writer and probably with LO Impress documents.

This is an alternate solution to #9070
There are test cases with several different problematic LibreOffice spread sheets.
git diff upstream/master -u | flake8 appeared to pass.

... I didn't do the whats new entry. Though I expect there's some more work to do before submitting this. I just wanted to get the core code in for comments.

The open issues is, the workaround for #25422 is embedded in the current code (so all my tests pass right now) but that, or a better solution should move closer to the iso 8601 parser.

Also I don't have the parser class hooked up to a read_excel or read_ods function.

Using read_excel bothers me some... because its not an excel file. I think it would be more clear if there was either a separate read_ods function or a generic read_spreadsheet function than just pressing read_excel into being a generic spreadsheet reader.

Also this just a reader. I had never gotten around to implementing a writer.

pep8speaks · 2019-02-24T06:59:50Z

Hello @detrout! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2019-07-02 12:42:15 UTC

codecov · 2019-02-24T07:16:29Z

Codecov Report

Merging #25427 into master will decrease coverage by 50.06%.
The diff coverage is 14.91%.

@@             Coverage Diff             @@
##           master   #25427       +/-   ##
===========================================
- Coverage   91.74%   41.68%   -50.07%     
===========================================
  Files         173      175        +2     
  Lines       52923    53037      +114     
===========================================
- Hits        48554    22107    -26447     
- Misses       4369    30930    +26561

Flag	Coverage Δ
#multiple	`?`
#single	`41.68% <14.91%> (-0.06%)`	⬇️

Impacted Files	Coverage Δ
pandas/io/opendocument/__init__.py	`100% <100%> (ø)`
pandas/io/opendocument/odfreader.py	`13.39% <13.39%> (ø)`
pandas/io/formats/latex.py	`0% <0%> (-100%)`	⬇️
pandas/core/categorical.py	`0% <0%> (-100%)`	⬇️
pandas/io/sas/sas_constants.py	`0% <0%> (-100%)`	⬇️
pandas/tseries/plotting.py	`0% <0%> (-100%)`	⬇️
pandas/tseries/converter.py	`0% <0%> (-100%)`	⬇️
pandas/io/formats/html.py	`0% <0%> (-99.35%)`	⬇️
pandas/core/groupby/categorical.py	`0% <0%> (-95.46%)`	⬇️
pandas/io/sas/sas7bdat.py	`0% <0%> (-91.17%)`	⬇️
... and 134 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 85572de...f5f42e5. Read the comment docs.

codecov · 2019-02-24T07:16:29Z

Codecov Report

Merging #25427 into master will decrease coverage by 50.74%.
The diff coverage is 14.81%.

@@             Coverage Diff             @@
##           master   #25427       +/-   ##
===========================================
- Coverage   91.86%   41.12%   -50.75%     
===========================================
  Files         180      175        -5     
  Lines       50794    50851       +57     
===========================================
- Hits        46660    20910    -25750     
- Misses       4134    29941    +25807

Flag	Coverage Δ
#multiple	`?`
#single	`41.12% <14.81%> (-0.86%)`	⬇️

Impacted Files	Coverage Δ
pandas/io/excel/_base.py	`28.26% <100%> (-63.67%)`	⬇️
pandas/io/excel/_odfreader.py	`14.01% <14.01%> (ø)`
pandas/io/formats/latex.py	`0% <0%> (-100%)`	⬇️
pandas/io/gcs.py	`0% <0%> (-100%)`	⬇️
pandas/io/sas/sas_constants.py	`0% <0%> (-100%)`	⬇️
pandas/core/groupby/categorical.py	`0% <0%> (-100%)`	⬇️
pandas/tseries/plotting.py	`0% <0%> (-100%)`	⬇️
pandas/tseries/converter.py	`0% <0%> (-100%)`	⬇️
pandas/io/s3.py	`0% <0%> (-100%)`	⬇️
pandas/io/formats/html.py	`0% <0%> (-99.37%)`	⬇️
... and 157 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update a272b60...8302fd7. Read the comment docs.

detrout · 2019-02-24T07:18:03Z

Ah the coverage went down but that's because the tests couldn't run because odfpy wasn't available.

Also I fixed the whitespace problem.

jreback

so you are going to want to put this in pandas/io/excel and model it after one of the readers maybe _xlrd.py

tests in s similar way

then it will just work; read_excel is the generic interface here (we can discuss in a new issue whether aliasing to read_spreadsheet is ok)

detrout · 2019-02-24T20:56:12Z

Wait a second... I just realized that read_excel already supports using two totally different libraries for reading .xls and .xlsx files. Now it makes more sense why you want it hooked into io.excel. OK it'll take me a bit to understand how the hook works. My current reader class is ODFReader, but ODFFile or OpenDocumentFile might be more consistent with ExcelFile. Do you have any preferences? Lastly is there any place I can add then odfpy dependency so the ci framework will be able to run my tests? Thanks you for your comments. Diane

jreback · 2019-02-25T12:20:55Z

My current reader class is ODFReader, but ODFFile or OpenDocumentFile might be more consistent with ExcelFile. Do you have any preferences?

you don't need to expose any of this directly. These can all be internal. invoking pd.read_excel(...., engine='ods') should just work.

jreback · 2019-02-25T12:21:35Z

deps you would add in the ci/deps/* files; don't add to every one as you also have to function w/o being included (the framework handles this for you though)

detrout · 2019-02-27T06:21:07Z

Ok everything is moved into pandas.io.excel and tests/io/test_excel.py

Currently its triggered by passing engine='odf' to read_excel. It looks like pandas is currently depending on xlrd to figure out what to do with the various excel style file extensions.

Should there be something to activate odf reading in ExcelFile.init based on file extension or trying xlrd parsing and catching the error?

detrout · 2019-02-27T06:21:29Z

Drat. I forgot the CI dependency change.

detrout · 2019-02-27T06:24:50Z

Ok I added odfpy to the travis-36.yaml file since that seemed to be a thorough test configuration.

jreback

@detrout looks really good! pls add a whatsnew note, New features in 0.25.0, also pls update the io.rst documentation as needed

pandas/io/excel/_odfreader.py

jreback · 2019-02-27T13:13:53Z

references the issue #2311 in your whatsnew

WillAyd

Here are some comments from my side but agreed this looks very nice - kudos!

pandas/io/excel/_odfreader.py

WillAyd · 2019-02-27T16:11:38Z

pandas/io/excel/_odfreader.py

+                'Unrecognized sheet identifier type {}. Please use'
+                'a string or integer'.format(type(name)))
+
+    def parse(self, sheet_name=0, **kwds):


This is already defined in the base class so ideally don't need to override here

Ok. _ODFReader wasn't derived from _BaseExcelReader... and I have a problem making it fit. The sheet object returned by xlrd has properties that I can't easily match.

As far as I can tell from looking at the content.xml the only way to know the nrows, which is currently used by _BaseExcelReader is to actually parse the odf table.

Is nrows the only hang up here? Had a similar discussion here:

https://github.com/pandas-dev/pandas/pull/25092/files#r255718071

So if it's a hang up for two open PRs might consider doing that as a precursor to simplify things all around

That's a good question. I'm pretty tired right now I'll try looking if something other than nrows might go also break in a few days.

The convert_float parameter for get_sheet_data doesn't make as much sense as ODF has separate types for integer and floats. (Here's the list of supported cell types: http://docs.oasis-open.org/office/v1.2/os/OpenDocument-v1.2-os-part1.html#__RefHeading__1417680_253892949 )

so its ok to ignore a passed parameter; if its in conflict then you could raise an error if its explicity passed

WillAyd · 2019-02-27T16:12:27Z

pandas/io/excel/_odfreader.py

+        i = self.sheet_names.index(name)
+        return self.__get_table(self.tables[i])
+
+    def get_sheet(self, name):


Do we even need this method? I see it's added as a result of the subclassed parse method but the behavior here to get a sheet via inference of position / name is not something I think the other engine has, so don't want to introduce anything inconsistent

Pending figuring out how to relate _ODFReader to the current _BaseExcelFile and if there's a way to make the base class parse function work, I renamed this to _get_sheet to indicate its private.

pandas/io/excel/_odfreader.py

WillAyd · 2019-02-27T16:14:16Z

pandas/io/excel/_odfreader.py

+        self.document = None
+        self.tables = None
+        self.filepath_or_stream = filepath_or_stream
+        self.document = document_load(filepath_or_stream)


This is called self.book in the existing xlrd code - can we use the same naming convention here?

That's a little harder case... in _xlrd._XlrdReader book looks to be a reference to the xlrd class that provides the interface to excel files.

odfpy provides more of an ElementTree or lxml level interface than anything useful about tables. self.tables is just a dictionary pointing to the root node of table XML nodes found in the OpenDocument file. I'm not sure its high level enough to be considered equivalent to a "book" object.

OK - you'll have to excuse that I'm not terribly familiar with odfpy. This is a rather minor sticking point so not really a blocker for this PR but consistency in the codebase is very important; it may make sense subsequently to change the xlrd code to reflect this convention or come up with an abstraction for both that reads consistently

pandas/tests/io/test_excel.py

WillAyd · 2019-02-27T16:19:01Z

pandas/tests/io/test_excel.py

+        assert book.sheet_names == ['Sheet1']
+
+    def test_read_types(self):
+        """Make sure we read ODF data types correctly


OK to remove docstring - I don't think we have these for any other tests

Some of the docstrings were just restating the function name, but some of them are descriptions of what the test is about. I'd like those more detailed comments to stay, is it ok if they're docstrings or should I turn them into # comments?

Like:

def test_read_invalid_types(self): """Make sure we throw an exception when encountering a new value-type I had to manually create an invalid ods file by directly editing the extracted xml. So it probably won't open in LibreOffice correctly. """

Comments are fine though as minimal as possible is preferred

pandas/tests/io/test_excel.py

WillAyd · 2019-02-27T16:19:54Z

pandas/tests/io/test_excel.py

+        tm.assert_equal(sheet, expected)
+
+    def test_read_invalid_types(self):
+        """Make sure we throw an exception when encountering a new value-type


Also no docstring

I'd really like the discussion about the invalid file to stick around somehow. (It fails xml validation)

pandas/tests/io/test_excel.py

detrout · 2019-02-28T07:04:55Z

Ok I resolved many of the issues and pushed them as separate commits for the moment so its easier to see what I did.

There's a few issues I'd like to discuss further, and I left those issues open.

jreback · 2019-03-20T02:01:53Z

can you merge master

jreback · 2019-04-05T00:54:26Z

@detrout can you merge master and update; let's see if we can get this in

detrout · 2019-04-05T03:41:48Z

Sorry I got distracted. I'll try to do it tomorrow.

detrout · 2019-04-05T18:36:05Z

Ok I rebased to master. Convered all my important test docstrings to comments and added a whatsnew for v0.25 about the OpenDocument support.

I hope I remembered to fix everything (and that rebasing was the right merge strategy)

Also since you use python as a macro language for LibreOffice, at some point I'd like to extend my ODFReader class to support reading tables while running in LO. (I only tried it once, and had problems with importing pandas in the LO environment, so it's going to be a while before that works. For now those experiments are going to stay in my detrout/pandasodf repository)

WillAyd · 2019-04-06T15:46:33Z

@detrout looks like there are still some conflicts. Workflow to resolve locally on your branch should be as follows:

git fetch upstream
git merge upstream/master
# Resolve conflicts here...
git merge --continue
git push origin YOUR_BRANCH_NAME

detrout · 2019-04-07T06:22:49Z

I'm not sure what to do about this travis failure?

self = <pandas.io.excel._odfreader._ODFReader object at 0x7fbefb3f1c18>
name = 0
    def _get_sheet(self, name):
        """Given a sheet name or index, return the root ODF Table node
        """
>       if isinstance(name, compat.string_types):
E       AttributeError: module 'pandas.compat' has no attribute 'string_types'
pandas/io/excel/_odfreader.py:42: AttributeError

You asked me to use pandas.compat.string_types, and its present in 0.23.3, but it looks gone in 0.25. There's just a couple of python calls to isinstance(source, bastring) or (isinstance(string, str)

jreback · 2019-04-07T21:36:47Z

@detrout just use
isinstance(name, str) as the string_types has been removed now that we only support PY3

WillAyd

@jreback mostly green with good coverage here. Travis failures are a result of defusedxml and xlrd being installed alongside one another and not sure there are great options to change. Should just catch warning in tests?

pandas/io/excel/_odfreader.py

WillAyd · 2019-06-30T19:37:07Z

pandas/tests/io/excel/test_odf.py

+    sheet = pd.read_excel("lowerdiagonal.ods", 'Sheet1',
+                          index_col=None, header=None)
+
+    assert sheet.shape == (4, 4)


Might change this and subsequent tests below to use tm.assert_frame_equal in another iteration or PR

WillAyd · 2019-06-30T19:38:21Z

pandas/tests/io/excel/test_readers.py

@@ -439,6 +445,9 @@ def test_bad_engine_raises(self, read_ext):

    @tm.network
    def test_read_from_http_url(self, read_ext):
+        if read_ext == '.ods':  # TODO: remove once on master


This test only works when the file is available on master, so have to merge first and then can try again

Seems like this code is still hanging around ill aim to address it when I tackle: #29439

WillAyd · 2019-07-01T18:28:52Z

@jreback all green here if you want to take a look for 0.25

jreback

are the config_init.py options updated? can you add a note in io.rst about this & update install.rst?

jreback · 2019-07-01T22:19:40Z

doc/source/whatsnew/v0.25.0.rst

@@ -164,6 +164,7 @@ Other enhancements
 - Added new option ``plotting.backend`` to be able to select a plotting backend different than the existing ``matplotlib`` one. Use ``pandas.set_option('plotting.backend', '<backend-module>')`` where ``<backend-module`` is a library implementing the pandas plotting API (:issue:`14130`)
 - :class:`pandas.offsets.BusinessHour` supports multiple opening hours intervals (:issue:`15481`)
 - :func:`read_excel` can now use ``openpyxl`` to read Excel files via the ``engine='openpyxl'`` argument. This will become the default in a future release (:issue:`11499`)
+- :func:`pandas.io.excel.read_excel` supports reading OpenDocument tables. Specify engine='odf' to enable. (:issue:`9070`)


can you put enable='odf' in double back ticks

pandas/io/excel/_odfreader.py

WillAyd

Documentation updates done. Dialog on how to best represent parsing

pandas/io/excel/_odfreader.py

detrout · 2019-07-02T05:15:59Z

Hi, ... sorry I was busy...

I finally had a chance to start reading what @WillAyd had done. (Thank you! that was a lot of updating)

I've only read up to 8a9a66c and want to read through the rest of the changes.

jreback

lgtm

jreback · 2019-07-02T12:30:45Z

doc/source/user_guide/io.rst

@@ -3217,7 +3219,20 @@ The look and feel of Excel worksheets created from pandas can be modified using
 * ``float_format`` : Format string for floating point numbers (default ``None``).
 * ``freeze_panes`` : A tuple of two integers representing the bottommost row and rightmost column to freeze. Each of these parameters is one-based, so (1, 1) will freeze the first row and first column (default ``None``).

+.. _io.ods:



can you add a versionchanged here

jreback · 2019-07-02T12:31:04Z

doc/source/whatsnew/v0.25.0.rst

@@ -164,6 +164,7 @@ Other enhancements
 - Added new option ``plotting.backend`` to be able to select a plotting backend different than the existing ``matplotlib`` one. Use ``pandas.set_option('plotting.backend', '<backend-module>')`` where ``<backend-module`` is a library implementing the pandas plotting API (:issue:`14130`)
 - :class:`pandas.offsets.BusinessHour` supports multiple opening hours intervals (:issue:`15481`)
 - :func:`read_excel` can now use ``openpyxl`` to read Excel files via the ``engine='openpyxl'`` argument. This will become the default in a future release (:issue:`11499`)
+- :func:`pandas.io.excel.read_excel` supports reading OpenDocument tables. Specify ``engine='odf'`` to enable. (:issue:`9070`)


can you add a link to the new docs

jreback · 2019-07-02T12:31:18Z

doc/source/user_guide/io.rst

+using the ``odfpy`` module. The semantics and features for reading
+OpenDocument spreadsheets match what can be done for `Excel files`_ using
+``engine='odf'``.
+


can you show an example here (code is ok)

pandas/tests/io/excel/test_readers.py

jreback · 2019-07-03T21:46:03Z

thanks @detrout and @WillAyd for the review and patches; very nice feature!

detrout · 2019-07-04T03:56:49Z

Thank you @WillAyd and @jreback I'm glad this made it in!

detrout mentioned this pull request Feb 24, 2019

ENH: Open Document Format ODS support in read_excel() #9070

Closed

jreback requested changes Feb 24, 2019

View reviewed changes

gfyoung added Enhancement IO Data IO issues that don't fit into a more specific label labels Feb 25, 2019

detrout force-pushed the libreoffice-support branch from f5f42e5 to d8495dc Compare February 27, 2019 06:16

detrout force-pushed the libreoffice-support branch from d8495dc to 050a5e7 Compare February 27, 2019 06:23

jreback requested changes Feb 27, 2019

View reviewed changes

pandas/io/excel/_odfreader.py Outdated Show resolved Hide resolved

pandas/io/excel/_odfreader.py Outdated Show resolved Hide resolved

pandas/io/excel/_odfreader.py Outdated Show resolved Hide resolved

pandas/io/excel/_odfreader.py Outdated Show resolved Hide resolved

jreback requested a review from WillAyd February 27, 2019 13:12

jreback mentioned this pull request Feb 27, 2019

Openpyxl engine for reading excel files #25092

Merged

5 tasks

WillAyd requested changes Feb 27, 2019

View reviewed changes

detrout force-pushed the libreoffice-support branch from 1e43355 to 4b278dc Compare April 7, 2019 03:45

WillAyd reviewed Jun 30, 2019

View reviewed changes

WillAyd added 4 commits July 1, 2019 06:36

Removed one-off tests

8ce45b4

Handled defusedxml warnings

f9f88b0

Updated assert_warnings funcs to allow DeprecationWarnings

3e0d758

Merge remote-tracking branch 'upstream/master' into libreoffice-support

ff28993

jreback requested changes Jul 1, 2019

View reviewed changes

WillAyd added 4 commits July 1, 2019 20:45

Updated to config_init.py

7396ad6

Updated whatsnew

5a440a4

Updated io.rst

250a3d3

Merge remote-tracking branch 'upstream/master' into libreoffice-support

d7e7d05

WillAyd reviewed Jul 2, 2019

View reviewed changes

pandas/io/excel/_odfreader.py Show resolved Hide resolved

WillAyd added 3 commits July 2, 2019 00:04

Refactored to simplify

93adedb

Removed unnecessary test

62a37e7

lint fixup

13fb76f

WillAyd added 2 commits July 2, 2019 07:51

mypy error

fb6c5ee

Merge remote-tracking branch 'upstream/master' into libreoffice-support

5c839f4

jreback requested changes Jul 2, 2019

View reviewed changes

Doc updates

4026fc1

WillAyd mentioned this pull request Jul 3, 2019

RLS: 0.25.0 #24950

Closed

jreback added this to the 0.25.0 milestone Jul 3, 2019

jreback approved these changes Jul 3, 2019

View reviewed changes

jreback merged commit 23099f7 into pandas-dev:master Jul 3, 2019

solstag mentioned this pull request Jul 3, 2019

Wish: Write support for Open Document Spreadsheet (ODS) #27222

Closed

detrout deleted the libreoffice-support branch July 4, 2019 03:56

WillAyd mentioned this pull request Jul 7, 2019

Enhancement: XLSB support in read_excel() #8540

Closed

alimcmaster1 mentioned this pull request Nov 13, 2019

CLN: Remove references to user branch (follow up) #29603

Merged

2 tasks

Class to read OpenDocument Tables #25427

Class to read OpenDocument Tables #25427

Conversation

detrout commented Feb 24, 2019 • edited by jreback Loading

pep8speaks commented Feb 24, 2019 • edited Loading

Comment last updated at 2019-07-02 12:42:15 UTC

codecov bot commented Feb 24, 2019

Codecov Report

codecov bot commented Feb 24, 2019 • edited Loading

Codecov Report

detrout commented Feb 24, 2019

jreback left a comment

Choose a reason for hiding this comment

detrout commented Feb 24, 2019 via email

jreback commented Feb 25, 2019

jreback commented Feb 25, 2019

detrout commented Feb 27, 2019

detrout commented Feb 27, 2019

detrout commented Feb 27, 2019

jreback left a comment

Choose a reason for hiding this comment

jreback commented Feb 27, 2019

WillAyd left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

detrout commented Feb 28, 2019

jreback commented Mar 20, 2019

jreback commented Apr 5, 2019

detrout commented Apr 5, 2019

detrout commented Apr 5, 2019

WillAyd commented Apr 6, 2019

detrout commented Apr 7, 2019

jreback commented Apr 7, 2019

WillAyd left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

WillAyd commented Jul 1, 2019

jreback left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

WillAyd left a comment

Choose a reason for hiding this comment

detrout commented Jul 2, 2019

jreback left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jreback commented Jul 3, 2019

detrout commented Jul 4, 2019

detrout commented Feb 24, 2019 •

edited by jreback

Loading

pep8speaks commented Feb 24, 2019 •

edited

Loading

codecov bot commented Feb 24, 2019 •

edited

Loading