Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(python): Support DataFrame init from raw SQLAlchemy rows #19820

Merged
merged 2 commits into from
Nov 16, 2024

Conversation

alexander-beedie
Copy link
Collaborator

@alexander-beedie alexander-beedie commented Nov 15, 2024

Closes #19816.

  • With a trivial extension to the existing Sequence checks we can support direct DataFrame/Series init from a list of SQLAlchemy Row objects, as they intentionally mimic namedtuple in various ways1.

  • This allows for a very small (low single-digit percentage) speedup ingesting SQLAlchemy results via read_database, so I've taken advantage of that (against a 50000x10 test dataset I was seeing ~1-2% gains - but it's not 0%, so why not...😅)

Footnotes

  1. sqlalchemy.engine.Row

@github-actions github-actions bot added enhancement New feature or an improvement of an existing feature python Related to Python Polars labels Nov 15, 2024
@alexander-beedie alexander-beedie added the A-interop Area: interoperability with other libraries label Nov 15, 2024
@roganjoshp
Copy link

Just to be clear, now that you've measured it as being marginally faster (I never anticipated that), I want to state that my primary goal is to decouple the query from the dataframe construction (be that pandas or polars) - I don't like executing SQL through yet another layer that may change or not be able to take advantage of new developments. I also do other things with query results than just making dataframes.

My motivation here is not just speed gains.

I'm amazed by your fast turn-around on this PR. Thank you.

@alexander-beedie
Copy link
Collaborator Author

alexander-beedie commented Nov 16, 2024

I don't like executing SQL through yet another layer that may change or not be able to take advantage of new developments.

On the flip side - you may miss out on optimisations we are able to make; for instance, if you use duckdb_engine for SQLAlchemy-mediated access to DuckDB, we recognise that and fetch Arrow data from the Cursor instead of fetching rows, making for a significant speedup initialising a DataFrame (you could do this yourself of course, but not everyone is aware that it's an option).

Still, allowing you more control over your own init/queries/etc if you want it is a good thing, and it sounds like you have a variety of such use-cases, so happy to enable it.

My motivation here is not just speed gains.

That's probably a good thing as the speed gain (inside read_database) is only 1-2%, so it's barely above the threshold of noise (we can skip an internal [tuple(row) for row in result] comprehension now - that's about it) ;)

I'm amazed by your fast turn-around on this PR. Thank you.

😎👍

@ritchie46 ritchie46 merged commit 7482315 into pola-rs:main Nov 16, 2024
18 of 19 checks passed
@alexander-beedie alexander-beedie deleted the alchemy-rows-frame-init branch November 16, 2024 17:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-interop Area: interoperability with other libraries enhancement New feature or an improvement of an existing feature python Related to Python Polars
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Erroneous ShapeError when unpacking SQLAlchemy raw query results into a dataframe
3 participants