Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(python): add openpyxl as a new/optional engine for read_excel #6183

Merged
merged 23 commits into from
Sep 6, 2023

Conversation

bvanelli
Copy link
Contributor

@bvanelli bvanelli commented Jan 11, 2023

Closes #5568

Improvements on Excel support for polars, including a new exporter and an importer that does better type inferring, i.e. handling null entries, correctly parsing datetimes, etc.

There are some things to still do:

  • Fix pipelines, lint and format it properly, make sure mypy is passing.
  • Improve test coverage
  • Add openpyxl as optional installable
  • Improve the api a little bit
  • (OPTIONAL) Include basic optional styling

@alexander-beedie
Copy link
Collaborator

alexander-beedie commented Jan 12, 2023

@bvanelli: Thanks for taking a look! However, I suspect that an xlsxwriter / xlsxwriter_rs approach is going to be the best way forward, given the native Rust integration (and previous experience with that library).

@ritchie46
Copy link
Member

@bvanelli: Thanks for taking a look! However, I suspect that an xlsxwriter / xlsxwriter_rs approach is going to be the best way forward, given the native Rust integration (and previous experience with that library).

Well.. I really don't think we should ship an excel reader / writer in our binary. It is a format I really don't want to push either.

Having this as an optional python dependency could be added as a utility function if it doesn't adds much maintenance and complexity.

@alexander-beedie
Copy link
Collaborator

alexander-beedie commented Jan 12, 2023

Having this as an optional python dependency could be added as a utility function if it doesn't adds much maintenance and complexity.

Gotcha; in that case I'll take care of it on the Python side (with xlsxwriter), and if it gets traction or there are further pushes for it on the Rust side later we'll be in a good position to translate (though your point about it ending up in the binary is well-taken; my mental model was defaulting to Python-space there, where it's just a pip install away, and the library doesn't actually live in our code ;)

@bvanelli
Copy link
Contributor Author

Currently, the read_excel API uses a non-pythonic array, as the index of sheets start at 1: https://github.com/dilshod/xlsx2csv/blob/3180d9490f64baa25495a8098903acd28f1aa131/xlsx2csv.py#L438

Ideally, the first sheet id should be 0, but I don't want to make changes that could potentially break people's codes, so should I

  • keep the behavior for the new parser the same, starting at 1 or
  • make the new behavior for only openpyxl driver, keeping the same for xlsx2csv or
  • make both behaviors the same, possibly breaking current implementations

Also thanks @alexander-beedie for fixing the format_path issue, it was driving me insane because it always ran fine in my local machine but not on CI 😄

@ghuls
Copy link
Collaborator

ghuls commented Jan 14, 2023

Currently, the read_excel API uses a non-pythonic array, as the index of sheets start at 1: https://github.com/dilshod/xlsx2csv/blob/3180d9490f64baa25495a8098903acd28f1aa131/xlsx2csv.py#L438

Maybe make an issue there that they should use None for all sheets instead of 0.

@bvanelli bvanelli marked this pull request as ready for review January 21, 2023 14:16
@bvanelli
Copy link
Contributor Author

I marked the PR as ready, as I have wrote some tests and improved test coverage for the old reader also.

If possible please someone take a look at how I used the methods on tests, for example the following:

    df_by_sheet_id = pl.read_excel(  # type: ignore[call-overload]
        example_file, sheet_id=1
    )

I had to introduce the ignore[call-overload] because mypy was complaining. I'm not exactly sure why because I have little experience with mypy. This was also used on the tests from before (see line below, taken from master) so I used it also for my tests, but there must be a reason why it's triggering.

df = pl.read_excel(example_file, sheet_id=None) # type: ignore[call-overload]

@alexander-beedie
Copy link
Collaborator

Ahh, I feel a little bad as I may not have been clear enough that I was planning to add write_excel functionality myself, as I previously wrote some powerful excel/dataframe export functionality in a previous job and was planning something similar here. I'll check over the PR tomorrow/shortly to see if we can preserve some of it... 😅

@bvanelli
Copy link
Contributor Author

Ahh, I feel a little bad as I may not have been clear enough that I was planning to add write_excel functionality myself, as I previously wrote some powerful excel/dataframe export functionality in a previous job and was planning something similar here. I'll check over the PR tomorrow/shortly to see if we can preserve some of it... 😅

Ah, sorry, I misunderstood then. I implemented it through engines just like with pandas, so multiple engines can be used, so you can have multiple libraries as optional dependencies. Maybe it can be adapted together?

@cnpryer
Copy link
Contributor

cnpryer commented Feb 19, 2023

Any update on this? I have some use cases for writing to excel where this would be great to have.

Is there a blocker?

@bvanelli
Copy link
Contributor Author

Any update on this? I have some use cases for writing to excel where this would be great to have.

Is there a blocker?

I'd like to know also. I can resolve merge conflicts/adapt changes if someone is willing to review it. 😃

@alexander-beedie
Copy link
Collaborator

Sorry for the hold-up; I've finished the first iteration of write_excel now, and it's running through CI as we speak: #7251. Still very interested in improvements to reading though! 👍

@bvanelli
Copy link
Contributor Author

I took a look at the MR, and it looks pretty good from the exporting perspective, however, I still feel like there is stuff missing on the importing perspective, which I think xlsxwriter does not touch. The importer library is ok for simple scenarios but it does not handle data types well, that openpyxl can do natively. See for example my datatypes test.

Also, I feel like the you should include engine in your exporter just like pandas does for their read/write excel methods, as in the future one parser/writer written in rust could be added to the available drivers for, let's say, very fast exports. One example here.

@alexander-beedie
Copy link
Collaborator

alexander-beedie commented Mar 1, 2023

I still feel like there is stuff missing on the importing perspective

Definitely; if you can update your PR to focus on just the import, I'll review - improvements in that area are very welcome 😄

Also, I feel like the you should include engine in your exporter just like pandas does

I don't see much purpose in doing so. If we were to move to a Rust-based exporter (for instance), it would most likely be the Rust version of xlsxwriter, in which case we could transition the internals/API pretty transparently.

The main reason pandas offers different engines is that it doesn't actually do anything with them itself; it's more like a trivial bootstrap, eg: "here's the data, now you're on your own". The way I've implemented it here, we actually handle all of the table/sheet relative positioning and column range determination so that you can declare what you want on a per-column/dtype basis, while still allowing direct integration with the underlying xlsxwriter library. It's a different (and, imho, a more productive/intuitive) approach...

@bvanelli
Copy link
Contributor Author

@alexander-beedie Finally had some time to fix the MR. There are maybe two open points I needed some input:

  1. I had to import the DataFrame inside the function to avoid circular imports. I wonder if there is a better way to do it

def _read_excel_sheet_openpyxl(
parser: Any,
sheet_id: int | None,
sheet_name: str | None,
_: dict[str, Any] | None,
) -> DataFrame:
# import here to avoid circular imports
from polars import DataFrame

  1. The readers are fundamentally different, I tried to wrap them on the same API. Unfortunately, the lazy loading of the read_only=True does not allow this behavior, as the wb.close() must be called (reference). If the method raises an exception, then it will not be called. This is better suited for a context handle. If I remove the read_only, then the behavior will match, at the cost of reading speed.

parser = openpyxl.load_workbook(source, read_only=True)

In any case, could someone review the changes? The biggest improvement of the MR is having native data types like boolean and datetime without manual conversion, which are currently not inferred:

def test_basic_datatypes_openpyxl_read_excel() -> None:
df = pl.DataFrame(
{
"A": [1, 2, 3, 4, 5],
"fruits": ["banana", "banana", "apple", "apple", "banana"],
"floats": [1.1, 1.2, 1.3, 1.4, 1.5],
"datetime": [datetime(2023, 1, x) for x in range(1, 6)],
"nulls": [1, None, None, None, 1],
}
)
xls = BytesIO()
df.write_excel(xls)
# check if can be read as it was written
# we use openpyxl because type inference is better
df_by_default = pl.read_excel(xls, engine="openpyxl")
df_by_sheet_id = pl.read_excel(xls, sheet_id=1, engine="openpyxl")
df_by_sheet_name = pl.read_excel(xls, sheet_name="Sheet1", engine="openpyxl")
assert_frame_equal(df, df_by_default)
assert_frame_equal(df, df_by_sheet_id)
assert_frame_equal(df, df_by_sheet_name)

@stinodego stinodego changed the title Add write_excel and improve read_excel to also use openpyxl for better type inferring feat(python): Improve read_excel to also use openpyxl for better type inferring Aug 10, 2023
@stinodego stinodego changed the title feat(python): Improve read_excel to also use openpyxl for better type inferring feat(python): Improve read_excel to also use openpyxl for better type inferring Aug 10, 2023
@github-actions github-actions bot added enhancement New feature or an improvement of an existing feature python Related to Python Polars labels Aug 10, 2023
@stinodego
Copy link
Contributor

@bvanelli This is good work and I would like to have openpyxl support - it's a more modern/feature rich option when compared to xlsx2csv, in my opinion.

@alexander-beedie Would you mind taking a look if this PR is ready / can be adopted into our current Excel capabilities?

@alexander-beedie
Copy link
Collaborator

alexander-beedie commented Aug 27, 2023

@alexander-beedie Would you mind taking a look

Will do - I'm definitely in favour of improving the read functionality, and having an openpyxl option is almost certainly the way to go (apologies for missing the update earlier) 👍

@bvanelli: would you mind updating/rebasing the PR?
Let's see if we can get it in :)

@bvanelli
Copy link
Contributor Author

@bvanelli: would you mind updating/rebasing the PR? Let's see if we can get it in :)

Hello, thanks for the update. I do not mind mind updating/doing changes but I'm currently in vacation away from any computer for a week, so I won't be able able to do it until the 4th of September.

@bvanelli
Copy link
Contributor Author

bvanelli commented Sep 3, 2023

@alexander-beedie I merged current main into my branch and solved the conflicts. I also slightly updated the documentation to reflect the added features. As a sidenote, here is a benchmark test comparing both libraries:

from __future__ import annotations
import polars as pl
from pathlib import Path


def test_openpyxl() -> None:
    excel_file_path = Path(__file__).parent / "example_benchmark_file.xlsx"
    df = pl.read_excel(excel_file_path, sheet_id=0, engine="openpyxl")


def test_xlsx2csv() -> None:
    excel_file_path = Path(__file__).parent / "example_benchmark_file.xlsx"
    df = pl.read_excel(excel_file_path, sheet_id=0)
  • xlsx2csv: 1.168 seconds
  • openpyxl: 3.303 seconds

I used an excel file with 50k rows and 3 columns.

@alexander-beedie
Copy link
Collaborator

alexander-beedie commented Sep 6, 2023

Ok, the integration looks fine; I'll merge as-is and follow-up shortly (probably later today) with some additional enhancements that will be common to both engines 👍

Thanks for this one @bvanelli!

@alexander-beedie alexander-beedie merged commit d3721f1 into pola-rs:main Sep 6, 2023
@alexander-beedie alexander-beedie changed the title feat(python): Improve read_excel to also use openpyxl for better type inferring feat(python): add openpyxl as a new/optional engine for read_excel Sep 6, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or an improvement of an existing feature python Related to Python Polars
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add DataFrame.write_excel
6 participants