-
-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(python): add "calamine" support to read_excel
, using fastexcel
(~8-10x speedup)
#14000
Conversation
5212d96
to
7c2eaa0
Compare
8ff54cb
to
c20e149
Compare
Holy smokes. This is amazing. I don't use Excel much anymore, but if I had this "back in the day" 😍. Great work! |
51f7923
to
ef7da42
Compare
Great stuff!
I still think we must do this. As I like to have a good architecture for us to support many (more) exotic readers without bloating our binary. We can create |
read_excel
, using fastexcel
read_excel
, using fastexcel
(~8-10x speedup)
Yup; and now we have a really strong benchmark to aim for! 🤣 Lower-level integration could lead to some really special performance, and we can look to be fast by default once we have a sufficient set of features. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great 🚀 I'll try to work on windows support ASAP
I intend to make
fastexcel
the new default engine forread_ods
That would be really cool, but our tests on .ods
files are really minimal for now, so increasing test coverage for that on our side would be needed
# note: can't read directly from bytes (yet) so | ||
if read_bytesio := isinstance(source, BytesIO): | ||
temp_data = NamedTemporaryFile(delete=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be pretty easy to add to with calamine's open_workbook_auto_from_rs
function. If this is something that is used a lot, I can try to add it soon
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That would be great :)
*, | ||
raise_if_empty: bool, | ||
) -> pl.DataFrame: | ||
ws = parser.load_sheet_by_name(sheet_name) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NOTE: Once calamine 0.24.0 is out and ToucanToco/fastexcel#147 is polished and merged, you could use one of the new eager loading methods to lower the memory footprint here
@lukapeschke: Yup, I didn't pull the trigger on that with this PR yet - however, the competition is a module that hasn't been updated in 8 years and only had 7 releases. It's a low bar to beat 😅 |
Hi @alexander-beedie, |
Thanks, will adjust accordingly; hadn't really gotten around to exposing those yet, as I'm planning to unify common options in the top-level function signature 👌 |
@alexander-beedie Outstanding work. Those of us that must deal with Excel data cannot thank you enough for this addition. |
All the thanks here go to the |
This is a great news for me, thanks! Excel files are my headache, really) But, still, I have three main problems, and, I think, they are important and have simple solution - at the reader level. In my practice, these problems appeared frequently, and I think, not for me only. Thank you very much, polars is getting better and better, day by day 😌 |
I cannot thank you guys enough for polars. At my place almost everyone started using this as we saw an incredible performance over Pandas. |
@durgeksh: Thanks! Looking at your error report you should try and get a copy of your workbook to the |
Thank you for this wonderful library. Sure, I will report to fastexcel. |
Closes #13874.
A long-standing request has been calamine bindings for
read_excel
; the last time I looked at thispython_calamine
1 was suggested, but there were some issues, so I didn't go ahead. I took another look at the python/calamine landscape this week and found both thatpython_calamine
had significantly improved and would have been suitable for integration, but also discovered that there is a better alternative:fastexcel
2.Aside from the clean code, this library can skip materialisation to Python and provide us Arrow data, leading to further large speedups over
python_calamine
for our use-case. Originallyfastexcel
supported Python >= 3.10, but I made a PR to extend support to earlier versions so that it matches our own version matrix (>= 3.8).My thanks to @lukapeschke for reviewing, accepting, and turning out a new
fastexcel
release that contains the extended compatibility within ~24 hours! 🎉Note
Currently the
fastexcel
module is available for Linux and macOS; no Windows build yet.(Update: but Windows support is coming soon to
0.8.0
- see ToucanToco/fastexcel#157).Example
(Hardware: Apple Silicon M2 Max)
Result: Roughly an order of magnitude speedup over our current options, with better than average type inference.
Also
read_excel
parameters more generic, deprecating "xlsx2csv_options" in favour of "engine_options".import_optional
utility function that replaces the current boilerplate with a simple one-liner.Short-term follow-ups
ods
reader (ezodf
+lxml
). I intend to makefastexcel
the new default engine forread_ods
asezodf
is unmaintained, old, and much slower..xls
format (the older flavour of Excel file, as compared to.xlsx
). This should work out of the box, but I can add some magic-bytes detection to automatically pass such files to the newer engine (if none was specified).Longer-term
Footnotes
python_calamine: https://github.com/dimastbk/python-calamine ↩
fastexcel: https://github.com/ToucanToco/fastexcel ↩