Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(python): add "calamine" support to read_excel, using fastexcel (~8-10x speedup) #14000

Merged
merged 2 commits into from
Jan 26, 2024

Conversation

alexander-beedie
Copy link
Collaborator

@alexander-beedie alexander-beedie commented Jan 25, 2024

Closes #13874.

A long-standing request has been calamine bindings for read_excel; the last time I looked at this python_calamine1 was suggested, but there were some issues, so I didn't go ahead. I took another look at the python/calamine landscape this week and found both that python_calamine had significantly improved and would have been suitable for integration, but also discovered that there is a better alternative: fastexcel2.

Aside from the clean code, this library can skip materialisation to Python and provide us Arrow data, leading to further large speedups over python_calamine for our use-case. Originally fastexcel supported Python >= 3.10, but I made a PR to extend support to earlier versions so that it matches our own version matrix (>= 3.8).

My thanks to @lukapeschke for reviewing, accepting, and turning out a new fastexcel release that contains the extended compatibility within ~24 hours! 🎉

Note

Currently the fastexcel module is available for Linux and macOS; no Windows build yet.
(Update: but Windows support is coming soon to 0.8.0 - see ToucanToco/fastexcel#157).

Example

from codetiming import Timer
import pandas as pd
import polars as pl

# generate five 1,000,000 row columns of various types
df = pl.DataFrame({
    "v": range(1_000_000), 
    "w": 999.999, 
    "x": -42.0, 
    "y": date.today(), 
    "z": "acbdefghijlmnop",
})

# create an excel file to read from
wb = df.write_excel(xl := "test.xlsx")

# benchmark the different engines
with Timer():
    pf0 = pd.read_excel(xl)
with Timer():
    df0 = pl.read_excel(xl, engine="xlsx2csv")
with Timer():
    df1 = pl.read_excel(xl, engine="openpyxl")
with Timer():
    df2 = pl.read_excel(xl, engine="calamine")  # << newly integrated
read_excel_bench

(Hardware: Apple Silicon M2 Max)

Result: Roughly an order of magnitude speedup over our current options, with better than average type inference.

Also

  • Started making the read_excel parameters more generic, deprecating "xlsx2csv_options" in favour of "engine_options".
  • Reduced the amount of copy/paste code we have when importing optional modules by introducing a new import_optional utility function that replaces the current boilerplate with a simple one-liner.

Short-term follow-ups

  • The fastexcel/calamine engine can replace our existing ods reader (ezodf + lxml). I intend to make fastexcel the new default engine for read_ods as ezodf is unmaintained, old, and much slower.
  • It looks like fastexcel/calamine can also support the .xls format (the older flavour of Excel file, as compared to .xlsx). This should work out of the box, but I can add some magic-bytes detection to automatically pass such files to the newer engine (if none was specified).

Longer-term

  • We may be able to push the speed even further by integrating directly with Calamine down in the lower-levels of the Rust engine (which would also simplify loading Excel data into Polars for our Rust users).
  • Until/unless we decide to do that, we can look to replace the current default engine with fastexcel/calamine once it becomes more featureful (at the moment there are limited features, but hopefully this will change!)

Footnotes

  1. python_calamine: https://github.com/dimastbk/python-calamine

  2. fastexcel: https://github.com/ToucanToco/fastexcel

@github-actions github-actions bot added enhancement New feature or an improvement of an existing feature python Related to Python Polars labels Jan 25, 2024
@avimallu
Copy link
Contributor

Holy smokes. This is amazing. I don't use Excel much anymore, but if I had this "back in the day" 😍.

Great work!

@alexander-beedie alexander-beedie added the performance Performance issues or improvements label Jan 26, 2024
@ritchie46
Copy link
Member

Great stuff!

We may be able to push the speed even further by integrating directly with Calamine down in the lower-levels of the Rust engine (which would also simplify loading Excel data into Polars for our Rust users).

I still think we must do this. As I like to have a good architecture for us to support many (more) exotic readers without bloating our binary. We can create polars-arrow chunks and accept predicate and projection pushdown.

@alexander-beedie alexander-beedie changed the title feat(python): add "calamine" support to read_excel, using fastexcel feat(python): add "calamine" support to read_excel, using fastexcel (~8-10x speedup) Jan 26, 2024
@alexander-beedie
Copy link
Collaborator Author

alexander-beedie commented Jan 26, 2024

I still think we must do this. As I like to have a good architecture for us to support many (more) exotic readers without bloating our binary. We can create polars-arrow chunks and accept predicate and projection pushdown.

Yup; and now we have a really strong benchmark to aim for! 🤣 Lower-level integration could lead to some really special performance, and we can look to be fast by default once we have a sufficient set of features.

Copy link
Contributor

@lukapeschke lukapeschke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great 🚀 I'll try to work on windows support ASAP

I intend to make fastexcel the new default engine for read_ods

That would be really cool, but our tests on .ods files are really minimal for now, so increasing test coverage for that on our side would be needed

Comment on lines +527 to +529
# note: can't read directly from bytes (yet) so
if read_bytesio := isinstance(source, BytesIO):
temp_data = NamedTemporaryFile(delete=True)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be pretty easy to add to with calamine's open_workbook_auto_from_rs function. If this is something that is used a lot, I can try to add it soon

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That would be great :)

*,
raise_if_empty: bool,
) -> pl.DataFrame:
ws = parser.load_sheet_by_name(sheet_name)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NOTE: Once calamine 0.24.0 is out and ToucanToco/fastexcel#147 is polished and merged, you could use one of the new eager loading methods to lower the memory footprint here

@ritchie46 ritchie46 merged commit 4e19d62 into pola-rs:main Jan 26, 2024
17 checks passed
@alexander-beedie alexander-beedie deleted the fastexcel-calamine branch January 26, 2024 14:21
@alexander-beedie
Copy link
Collaborator Author

alexander-beedie commented Jan 27, 2024

That would be really cool, but our tests on .ods files are really minimal for now, so increasing test coverage for that on our side would be needed

@lukapeschke: Yup, I didn't pull the trigger on that with this PR yet - however, the competition is a module that hasn't been updated in 8 years and only had 7 releases. It's a low bar to beat 😅

@julio-34727
Copy link

Hi @alexander-beedie,
parameter engine_options (header_row, skip_rows and so..;) must be passed in load_sheet_by_name/load_sheet_by_idx and not in read_excel (see fastexcel docs: https://fastexcel.toucantoco.dev/fastexcel.html)

@alexander-beedie
Copy link
Collaborator Author

Hi @alexander-beedie, parameter engine_options (header_row, skip_rows and so..;) must be passed in load_sheet_by_name/load_sheet_by_idx and not in read_excel (see fastexcel docs: https://fastexcel.toucantoco.dev/fastexcel.html)

Thanks, will adjust accordingly; hadn't really gotten around to exposing those yet, as I'm planning to unify common options in the top-level function signature 👌

@adrivn
Copy link

adrivn commented Jan 27, 2024

@alexander-beedie Outstanding work. Those of us that must deal with Excel data cannot thank you enough for this addition.

@alexander-beedie
Copy link
Collaborator Author

@alexander-beedie Outstanding work. Those of us that must deal with Excel data cannot thank you enough for this addition.

All the thanks here go to the fastexcel folks. I am merely the plumber on this one ;)

@SaelKimberly
Copy link

This is a great news for me, thanks! Excel files are my headache, really)

But, still, I have three main problems, and, I think, they are important and have simple solution - at the reader level. In my practice, these problems appeared frequently, and I think, not for me only.

#14036

Thank you very much, polars is getting better and better, day by day 😌

@durgeksh
Copy link

durgeksh commented Feb 9, 2024

I cannot thank you guys enough for polars. At my place almost everyone started using this as we saw an incredible performance over Pandas.
I tried the new calamine engine, but found one issue with it. Although it is a new feature, thought to report.
#14388

@alexander-beedie
Copy link
Collaborator Author

alexander-beedie commented Feb 9, 2024

I cannot thank you guys enough for polars. At my place almost everyone started using this as we saw an incredible performance over Pandas. I tried the new calamine engine, but found one issue with it. Although it is a new feature, thought to report. #14388

@durgeksh: Thanks! Looking at your error report you should try and get a copy of your workbook to the fastexcel folks, so they can take a look and come up with a fix. As it looks like the error is raised inside fastexcel itself, we won't be able to fix it on our end (and it will be tricky without a sample workbook to test with - doesn't have to be your original data, just as long as it can reproduce the error) 👌

@durgeksh
Copy link

Thank you for this wonderful library. Sure, I will report to fastexcel.

@alexander-beedie alexander-beedie added the A-io-excel Area: reading/writing Excel files label Apr 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-io-excel Area: reading/writing Excel files enhancement New feature or an improvement of an existing feature performance Performance issues or improvements python Related to Python Polars
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add "calamine" and (maybe) "rxls" engines to read_excel
8 participants