feat(python): add `openpyxl` as a new/optional engine for `read_excel` #6183

bvanelli · 2023-01-11T22:00:47Z

Improvements on Excel support for polars, including a new exporter and an importer that does better type inferring, i.e. handling null entries, correctly parsing datetimes, etc.

There are some things to still do:

Fix pipelines, lint and format it properly, make sure mypy is passing.
Improve test coverage
Add openpyxl as optional installable
Improve the api a little bit
~~(OPTIONAL) Include basic optional styling~~

alexander-beedie · 2023-01-12T06:15:11Z

@bvanelli: Thanks for taking a look! However, I suspect that an xlsxwriter / xlsxwriter_rs approach is going to be the best way forward, given the native Rust integration (and previous experience with that library).

ritchie46 · 2023-01-12T07:55:32Z

@bvanelli: Thanks for taking a look! However, I suspect that an xlsxwriter / xlsxwriter_rs approach is going to be the best way forward, given the native Rust integration (and previous experience with that library).

Well.. I really don't think we should ship an excel reader / writer in our binary. It is a format I really don't want to push either.

Having this as an optional python dependency could be added as a utility function if it doesn't adds much maintenance and complexity.

alexander-beedie · 2023-01-12T08:04:56Z

Having this as an optional python dependency could be added as a utility function if it doesn't adds much maintenance and complexity.

Gotcha; in that case I'll take care of it on the Python side (with xlsxwriter), and if it gets traction or there are further pushes for it on the Rust side later we'll be in a good position to translate (though your point about it ending up in the binary is well-taken; my mental model was defaulting to Python-space there, where it's just a pip install away, and the library doesn't actually live in our code ;)

… CI.

bvanelli · 2023-01-14T16:52:43Z

Currently, the read_excel API uses a non-pythonic array, as the index of sheets start at 1: https://github.com/dilshod/xlsx2csv/blob/3180d9490f64baa25495a8098903acd28f1aa131/xlsx2csv.py#L438

Ideally, the first sheet id should be 0, but I don't want to make changes that could potentially break people's codes, so should I

keep the behavior for the new parser the same, starting at 1 or
make the new behavior for only openpyxl driver, keeping the same for xlsx2csv or
make both behaviors the same, possibly breaking current implementations

Also thanks @alexander-beedie for fixing the format_path issue, it was driving me insane because it always ran fine in my local machine but not on CI 😄

ghuls · 2023-01-14T22:52:58Z

Currently, the read_excel API uses a non-pythonic array, as the index of sheets start at 1: https://github.com/dilshod/xlsx2csv/blob/3180d9490f64baa25495a8098903acd28f1aa131/xlsx2csv.py#L438

Maybe make an issue there that they should use None for all sheets instead of 0.

bvanelli · 2023-01-21T14:22:04Z

I marked the PR as ready, as I have wrote some tests and improved test coverage for the old reader also.

If possible please someone take a look at how I used the methods on tests, for example the following:

    df_by_sheet_id = pl.read_excel(  # type: ignore[call-overload]
        example_file, sheet_id=1
    )

I had to introduce the ignore[call-overload] because mypy was complaining. I'm not exactly sure why because I have little experience with mypy. This was also used on the tests from before (see line below, taken from master) so I used it also for my tests, but there must be a reason why it's triggering.

polars/py-polars/tests/unit/io/test_excel.py

Line 18 in 32c97cf

df = pl.read_excel(example_file, sheet_id=None) # type: ignore[call-overload]

alexander-beedie · 2023-01-22T17:57:02Z

Ahh, I feel a little bad as I may not have been clear enough that I was planning to add write_excel functionality myself, as I previously wrote some powerful excel/dataframe export functionality in a previous job and was planning something similar here. I'll check over the PR tomorrow/shortly to see if we can preserve some of it... 😅

bvanelli · 2023-01-22T18:25:55Z

Ahh, I feel a little bad as I may not have been clear enough that I was planning to add write_excel functionality myself, as I previously wrote some powerful excel/dataframe export functionality in a previous job and was planning something similar here. I'll check over the PR tomorrow/shortly to see if we can preserve some of it... 😅

Ah, sorry, I misunderstood then. I implemented it through engines just like with pandas, so multiple engines can be used, so you can have multiple libraries as optional dependencies. Maybe it can be adapted together?

cnpryer · 2023-02-19T05:14:19Z

Any update on this? I have some use cases for writing to excel where this would be great to have.

Is there a blocker?

bvanelli · 2023-02-19T09:31:36Z

Any update on this? I have some use cases for writing to excel where this would be great to have.

Is there a blocker?

I'd like to know also. I can resolve merge conflicts/adapt changes if someone is willing to review it. 😃

alexander-beedie · 2023-02-28T21:12:16Z

Sorry for the hold-up; I've finished the first iteration of write_excel now, and it's running through CI as we speak: #7251. Still very interested in improvements to reading though! 👍

bvanelli · 2023-02-28T21:32:09Z

I took a look at the MR, and it looks pretty good from the exporting perspective, however, I still feel like there is stuff missing on the importing perspective, which I think xlsxwriter does not touch. The importer library is ok for simple scenarios but it does not handle data types well, that openpyxl can do natively. See for example my datatypes test.

Also, I feel like the you should include engine in your exporter just like pandas does for their read/write excel methods, as in the future one parser/writer written in rust could be added to the available drivers for, let's say, very fast exports. One example here.

alexander-beedie · 2023-03-01T08:20:20Z

I still feel like there is stuff missing on the importing perspective

Definitely; if you can update your PR to focus on just the import, I'll review - improvements in that area are very welcome 😄

Also, I feel like the you should include engine in your exporter just like pandas does

I don't see much purpose in doing so. If we were to move to a Rust-based exporter (for instance), it would most likely be the Rust version of xlsxwriter, in which case we could transition the internals/API pretty transparently.

The main reason pandas offers different engines is that it doesn't actually do anything with them itself; it's more like a trivial bootstrap, eg: "here's the data, now you're on your own". The way I've implemented it here, we actually handle all of the table/sheet relative positioning and column range determination so that you can declare what you want on a per-column/dtype basis, while still allowing direct integration with the underlying xlsxwriter library. It's a different (and, imho, a more productive/intuitive) approach...

bvanelli · 2023-06-27T19:55:52Z

@alexander-beedie Finally had some time to fix the MR. There are maybe two open points I needed some input:

I had to import the DataFrame inside the function to avoid circular imports. I wonder if there is a better way to do it

polars/py-polars/polars/io/excel/functions.py

Lines 228 to 236 in 83cf521

    
           def _read_excel_sheet_openpyxl( 
        
               parser: Any, 
        
               sheet_id: int | None, 
        
               sheet_name: str | None, 
        
               _: dict[str, Any] | None, 
        
           ) -> DataFrame: 
        
               # import here to avoid circular imports 
        
               from polars import DataFrame

The readers are fundamentally different, I tried to wrap them on the same API. Unfortunately, the lazy loading of the read_only=True does not allow this behavior, as the wb.close() must be called (reference). If the method raises an exception, then it will not be called. This is better suited for a context handle. If I remove the read_only, then the behavior will match, at the cost of reading speed.

polars/py-polars/polars/io/excel/functions.py

Line 190 in 83cf521

parser = openpyxl.load_workbook(source, read_only=True)

In any case, could someone review the changes? The biggest improvement of the MR is having native data types like boolean and datetime without manual conversion, which are currently not inferred:

polars/py-polars/tests/unit/io/test_excel.py

Lines 48 to 68 in 83cf521

    
           def test_basic_datatypes_openpyxl_read_excel() -> None: 
        
               df = pl.DataFrame( 
        
                   { 
        
                       "A": [1, 2, 3, 4, 5], 
        
                       "fruits": ["banana", "banana", "apple", "apple", "banana"], 
        
                       "floats": [1.1, 1.2, 1.3, 1.4, 1.5], 
        
                       "datetime": [datetime(2023, 1, x) for x in range(1, 6)], 
        
                       "nulls": [1, None, None, None, 1], 
        
                   } 
        
               ) 
        
               xls = BytesIO() 
        
               df.write_excel(xls) 
        
               # check if can be read as it was written 
        
               # we use openpyxl because type inference is better 
        
               df_by_default = pl.read_excel(xls, engine="openpyxl") 
        
               df_by_sheet_id = pl.read_excel(xls, sheet_id=1, engine="openpyxl") 
        
               df_by_sheet_name = pl.read_excel(xls, sheet_name="Sheet1", engine="openpyxl") 
        
               assert_frame_equal(df, df_by_default) 
        
               assert_frame_equal(df, df_by_sheet_id) 
        
               assert_frame_equal(df, df_by_sheet_name)

stinodego · 2023-08-26T16:53:20Z

@bvanelli This is good work and I would like to have openpyxl support - it's a more modern/feature rich option when compared to xlsx2csv, in my opinion.

@alexander-beedie Would you mind taking a look if this PR is ready / can be adopted into our current Excel capabilities?

alexander-beedie · 2023-08-27T14:59:01Z

@alexander-beedie Would you mind taking a look

Will do - I'm definitely in favour of improving the read functionality, and having an openpyxl option is almost certainly the way to go (apologies for missing the update earlier) 👍

@bvanelli: would you mind updating/rebasing the PR?
Let's see if we can get it in :)

bvanelli · 2023-08-27T15:59:43Z

@bvanelli: would you mind updating/rebasing the PR? Let's see if we can get it in :)

Hello, thanks for the update. I do not mind mind updating/doing changes but I'm currently in vacation away from any computer for a week, so I won't be able able to do it until the 4th of September.

bvanelli · 2023-09-03T10:10:11Z

@alexander-beedie I merged current main into my branch and solved the conflicts. I also slightly updated the documentation to reflect the added features. As a sidenote, here is a benchmark test comparing both libraries:

from __future__ import annotations
import polars as pl
from pathlib import Path


def test_openpyxl() -> None:
    excel_file_path = Path(__file__).parent / "example_benchmark_file.xlsx"
    df = pl.read_excel(excel_file_path, sheet_id=0, engine="openpyxl")


def test_xlsx2csv() -> None:
    excel_file_path = Path(__file__).parent / "example_benchmark_file.xlsx"
    df = pl.read_excel(excel_file_path, sheet_id=0)

xlsx2csv: 1.168 seconds
openpyxl: 3.303 seconds

I used an excel file with 50k rows and 3 columns.

alexander-beedie · 2023-09-06T13:58:39Z

Ok, the integration looks fine; I'll merge as-is and follow-up shortly (probably later today) with some additional enhancements that will be common to both engines 👍

Thanks for this one @bvanelli!

feat: Add first version of openpyxl both import and exporter for excel.

25c5ff6

bvanelli added 5 commits January 12, 2023 21:57

fix: Add openpyxl to dev dependencies and format with black.

f580076

fix: Satisfy mypy and make all excel tests run.

90eec07

fix: Fix incorrectly placed ignore call overload.

5e415f9

fix: Fix and test bytes version and get rid of failing format_path on…

7819b78

… CI.

Merge remote-tracking branch 'origin/master' into 5568-excel-writer

dc568bd

bvanelli added 4 commits January 21, 2023 12:16

fix: Standarize apis.

162218d

Merge remote-tracking branch 'origin/master' into 5568-excel-writer

819c894

fix: Remove extra arg.

f12aff9

refactor: Replace use_openpyxl by engine.

e5a5108

bvanelli mentioned this pull request Jan 21, 2023

Indexing for sheets start at 1 instead of 0 dilshod/xlsx2csv#249

Open

bvanelli marked this pull request as ready for review January 21, 2023 14:16

bvanelli added 5 commits March 13, 2023 19:21

Merge remote-tracking branch 'origin/master' into 5568-excel-writer

c9176e1

fix: Exclude previous write excel and rework file position of io.excel.

f6a5b03

Extra fixes from merge request.

2e4e5f2

Merge remote-tracking branch 'origin/master' into 5568-excel-writer

2f74aac

fix: Fix merge request tests and merge origin (except perhaps mypy).

e62a7be

bvanelli added 5 commits March 19, 2023 18:03

refactor: Reformat with black.

da4d048

refactor: Rerun all formatting

a02a774

Merge remote-tracking branch 'origin/master' into 5568-excel-writer

eea78b2

Merge remote-tracking branch 'origin/main' into 5568-excel-writer

1d9a699

fix: Fix mypy warnings

83cf521

bvanelli requested review from ritchie46, stinodego and alexander-beedie as code owners June 27, 2023 19:19

stinodego changed the title ~~Add write_excel and improve read_excel to also use openpyxl for better type inferring~~ feat(python): Improve read_excel to also use openpyxl for better type inferring Aug 10, 2023

stinodego changed the title ~~feat(python): Improve read_excel to also use openpyxl for better type inferring~~ feat(python): Improve read_excel to also use openpyxl for better type inferring Aug 10, 2023

github-actions bot added enhancement New feature or an improvement of an existing feature python Related to Python Polars labels Aug 10, 2023

bvanelli added 3 commits September 3, 2023 09:21

fix: Use safe loading of workbook.

e77ffd9

Merge remote-tracking branch 'origin/main' into 5568-excel-writer

bc602ea

docs: Update documentation to reflect changes.

b4bf1ca

alexander-beedie approved these changes Sep 6, 2023

View reviewed changes

alexander-beedie merged commit d3721f1 into pola-rs:main Sep 6, 2023

alexander-beedie changed the title ~~feat(python): Improve read_excel to also use openpyxl for better type inferring~~ feat(python): add openpyxl as a new/optional engine for read_excel Sep 6, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(python): add `openpyxl` as a new/optional engine for `read_excel` #6183

feat(python): add `openpyxl` as a new/optional engine for `read_excel` #6183

bvanelli commented Jan 11, 2023 •

edited

Loading

alexander-beedie commented Jan 12, 2023 •

edited

Loading

ritchie46 commented Jan 12, 2023

alexander-beedie commented Jan 12, 2023 •

edited

Loading

bvanelli commented Jan 14, 2023

ghuls commented Jan 14, 2023

bvanelli commented Jan 21, 2023

alexander-beedie commented Jan 22, 2023

bvanelli commented Jan 22, 2023

cnpryer commented Feb 19, 2023

bvanelli commented Feb 19, 2023

alexander-beedie commented Feb 28, 2023

bvanelli commented Feb 28, 2023

alexander-beedie commented Mar 1, 2023 •

edited

Loading

bvanelli commented Jun 27, 2023

stinodego commented Aug 26, 2023

alexander-beedie commented Aug 27, 2023 •

edited

Loading

bvanelli commented Aug 27, 2023

bvanelli commented Sep 3, 2023

alexander-beedie commented Sep 6, 2023 •

edited

Loading

feat(python): add openpyxl as a new/optional engine for read_excel #6183

feat(python): add openpyxl as a new/optional engine for read_excel #6183

Conversation

bvanelli commented Jan 11, 2023 • edited Loading

alexander-beedie commented Jan 12, 2023 • edited Loading

ritchie46 commented Jan 12, 2023

alexander-beedie commented Jan 12, 2023 • edited Loading

bvanelli commented Jan 14, 2023

ghuls commented Jan 14, 2023

bvanelli commented Jan 21, 2023

alexander-beedie commented Jan 22, 2023

bvanelli commented Jan 22, 2023

cnpryer commented Feb 19, 2023

bvanelli commented Feb 19, 2023

alexander-beedie commented Feb 28, 2023

bvanelli commented Feb 28, 2023

alexander-beedie commented Mar 1, 2023 • edited Loading

bvanelli commented Jun 27, 2023

stinodego commented Aug 26, 2023

alexander-beedie commented Aug 27, 2023 • edited Loading

bvanelli commented Aug 27, 2023

bvanelli commented Sep 3, 2023

alexander-beedie commented Sep 6, 2023 • edited Loading

feat(python): add `openpyxl` as a new/optional engine for `read_excel` #6183

feat(python): add `openpyxl` as a new/optional engine for `read_excel` #6183

bvanelli commented Jan 11, 2023 •

edited

Loading

alexander-beedie commented Jan 12, 2023 •

edited

Loading

alexander-beedie commented Jan 12, 2023 •

edited

Loading

alexander-beedie commented Mar 1, 2023 •

edited

Loading

alexander-beedie commented Aug 27, 2023 •

edited

Loading

alexander-beedie commented Sep 6, 2023 •

edited

Loading