-
-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(python): add openpyxl
as a new/optional engine for read_excel
#6183
Conversation
@bvanelli: Thanks for taking a look! However, I suspect that an |
Well.. I really don't think we should ship an excel reader / writer in our binary. It is a format I really don't want to push either. Having this as an optional python dependency could be added as a utility function if it doesn't adds much maintenance and complexity. |
Gotcha; in that case I'll take care of it on the Python side (with |
Currently, the read_excel API uses a non-pythonic array, as the index of sheets start at 1: https://github.com/dilshod/xlsx2csv/blob/3180d9490f64baa25495a8098903acd28f1aa131/xlsx2csv.py#L438 Ideally, the first sheet id should be 0, but I don't want to make changes that could potentially break people's codes, so should I
Also thanks @alexander-beedie for fixing the |
Maybe make an issue there that they should use |
I marked the PR as ready, as I have wrote some tests and improved test coverage for the old reader also. If possible please someone take a look at how I used the methods on tests, for example the following: df_by_sheet_id = pl.read_excel( # type: ignore[call-overload]
example_file, sheet_id=1
) I had to introduce the
|
Ahh, I feel a little bad as I may not have been clear enough that I was planning to add |
Ah, sorry, I misunderstood then. I implemented it through engines just like with pandas, so multiple engines can be used, so you can have multiple libraries as optional dependencies. Maybe it can be adapted together? |
Any update on this? I have some use cases for writing to excel where this would be great to have. Is there a blocker? |
I'd like to know also. I can resolve merge conflicts/adapt changes if someone is willing to review it. 😃 |
Sorry for the hold-up; I've finished the first iteration of |
I took a look at the MR, and it looks pretty good from the exporting perspective, however, I still feel like there is stuff missing on the importing perspective, which I think xlsxwriter does not touch. The importer library is ok for simple scenarios but it does not handle data types well, that openpyxl can do natively. See for example my datatypes test. Also, I feel like the you should include |
Definitely; if you can update your PR to focus on just the import, I'll review - improvements in that area are very welcome 😄
I don't see much purpose in doing so. If we were to move to a Rust-based exporter (for instance), it would most likely be the Rust version of The main reason pandas offers different engines is that it doesn't actually do anything with them itself; it's more like a trivial bootstrap, eg: "here's the data, now you're on your own". The way I've implemented it here, we actually handle all of the table/sheet relative positioning and column range determination so that you can declare what you want on a per-column/dtype basis, while still allowing direct integration with the underlying |
@alexander-beedie Finally had some time to fix the MR. There are maybe two open points I needed some input:
polars/py-polars/polars/io/excel/functions.py Lines 228 to 236 in 83cf521
polars/py-polars/polars/io/excel/functions.py Line 190 in 83cf521
In any case, could someone review the changes? The biggest improvement of the MR is having native data types like boolean and datetime without manual conversion, which are currently not inferred: polars/py-polars/tests/unit/io/test_excel.py Lines 48 to 68 in 83cf521
|
openpyxl
for better type inferring
openpyxl
for better type inferringread_excel
to also use openpyxl
for better type inferring
@bvanelli This is good work and I would like to have @alexander-beedie Would you mind taking a look if this PR is ready / can be adopted into our current Excel capabilities? |
Will do - I'm definitely in favour of improving the read functionality, and having an @bvanelli: would you mind updating/rebasing the PR? |
Hello, thanks for the update. I do not mind mind updating/doing changes but I'm currently in vacation away from any computer for a week, so I won't be able able to do it until the 4th of September. |
@alexander-beedie I merged current main into my branch and solved the conflicts. I also slightly updated the documentation to reflect the added features. As a sidenote, here is a benchmark test comparing both libraries: from __future__ import annotations
import polars as pl
from pathlib import Path
def test_openpyxl() -> None:
excel_file_path = Path(__file__).parent / "example_benchmark_file.xlsx"
df = pl.read_excel(excel_file_path, sheet_id=0, engine="openpyxl")
def test_xlsx2csv() -> None:
excel_file_path = Path(__file__).parent / "example_benchmark_file.xlsx"
df = pl.read_excel(excel_file_path, sheet_id=0)
I used an excel file with 50k rows and 3 columns. |
Ok, the integration looks fine; I'll merge as-is and follow-up shortly (probably later today) with some additional enhancements that will be common to both engines 👍 Thanks for this one @bvanelli! |
read_excel
to also use openpyxl
for better type inferringopenpyxl
as a new/optional engine for read_excel
Closes #5568
Improvements on Excel support for polars, including a new exporter and an importer that does better type inferring, i.e. handling null entries, correctly parsing datetimes, etc.
There are some things to still do:
(OPTIONAL) Include basic optional styling