feat(python): Support DataFrame init from raw SQLAlchemy rows #19820

alexander-beedie · 2024-11-15T19:57:11Z

Closes #19816.

With a trivial extension to the existing Sequence checks we can support direct DataFrame/Series init from a list of SQLAlchemy Row objects, as they intentionally mimic namedtuple in various ways¹.
This allows for a very small (low single-digit percentage) speedup ingesting SQLAlchemy results via read_database, so I've taken advantage of that (against a 50000x10 test dataset I was seeing ~1-2% gains - but it's not 0%, so why not...😅)

sqlalchemy.engine.Row ↩

roganjoshp · 2024-11-15T21:42:32Z

Just to be clear, now that you've measured it as being marginally faster (I never anticipated that), I want to state that my primary goal is to decouple the query from the dataframe construction (be that pandas or polars) - I don't like executing SQL through yet another layer that may change or not be able to take advantage of new developments. I also do other things with query results than just making dataframes.

My motivation here is not just speed gains.

I'm amazed by your fast turn-around on this PR. Thank you.

alexander-beedie · 2024-11-16T05:50:43Z

I don't like executing SQL through yet another layer that may change or not be able to take advantage of new developments.

On the flip side - you may miss out on optimisations we are able to make; for instance, if you use duckdb_engine for SQLAlchemy-mediated access to DuckDB, we recognise that and fetch Arrow data from the Cursor instead of fetching rows, making for a significant speedup initialising a DataFrame (you could do this yourself of course, but not everyone is aware that it's an option).

Still, allowing you more control over your own init/queries/etc if you want it is a good thing, and it sounds like you have a variety of such use-cases, so happy to enable it.

My motivation here is not just speed gains.

That's probably a good thing as the speed gain (inside read_database) is only 1-2%, so it's barely above the threshold of noise (we can skip an internal [tuple(row) for row in result] comprehension now - that's about it) ;)

I'm amazed by your fast turn-around on this PR. Thank you.

😎👍

feat(python): Support DataFrame init from raw SQLAlchemy rows

2c57ea7

alexander-beedie requested review from ritchie46, c-peters, MarcoGorelli and reswqa as code owners November 15, 2024 19:57

github-actions bot added enhancement New feature or an improvement of an existing feature python Related to Python Polars labels Nov 15, 2024

alexander-beedie added the A-interop Area: interoperability with other libraries label Nov 15, 2024

alexander-beedie force-pushed the alchemy-rows-frame-init branch from 74c9c75 to 2c57ea7 Compare November 15, 2024 20:26

update read_database now frame init can take alchemy rows "as-is"

7657872

ritchie46 merged commit 7482315 into pola-rs:main Nov 16, 2024
18 of 19 checks passed

alexander-beedie deleted the alchemy-rows-frame-init branch November 16, 2024 17:54

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(python): Support DataFrame init from raw SQLAlchemy rows #19820

feat(python): Support DataFrame init from raw SQLAlchemy rows #19820

alexander-beedie commented Nov 15, 2024 •

edited

Loading

roganjoshp commented Nov 15, 2024

alexander-beedie commented Nov 16, 2024 •

edited

Loading

feat(python): Support DataFrame init from raw SQLAlchemy rows #19820

feat(python): Support DataFrame init from raw SQLAlchemy rows #19820

Conversation

alexander-beedie commented Nov 15, 2024 • edited Loading

Footnotes

roganjoshp commented Nov 15, 2024

alexander-beedie commented Nov 16, 2024 • edited Loading

alexander-beedie commented Nov 15, 2024 •

edited

Loading

alexander-beedie commented Nov 16, 2024 •

edited

Loading