Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

improving performance when converting DuckDB's results to pandas #451

Closed
edublancas opened this issue Apr 25, 2023 · 3 comments · Fixed by #469
Closed

improving performance when converting DuckDB's results to pandas #451

edublancas opened this issue Apr 25, 2023 · 3 comments · Fixed by #469
Assignees

Comments

@edublancas
Copy link

we got some competition 😁: https://github.com/iqmo-org/magic_duckdb/blob/main/notebooks/benchmarking.ipynb

sqlalchemy is adding a lot of overhead when converting DuckDB results to pandas, the fix is simple, we should use DuckDB's native .df() method and bypass sqlalchemy.

@ned2
Copy link

ned2 commented Apr 29, 2023

I was just recently comparing timings between DuckDB directly via the Python API compared with using via a JupySQL %%sql cell, and noticed a considerable performance drop using JupySQL. In this case I wasn't using Pandas conversion, so it seems like there are performance issues with just using bare ResultSets.

The idea in #470 to make ResultSets lazy is good. But I also wonder if it could be good to provide more options to users to avoid the need for using ResultSets as much as possible. I like how magic_duckdb has done this, by allowing users to specify the result type as any format that DuckDB can export to (Pandas, Arrow, Polars), or ask for a DuckDB relation back. The option to return a DuckDB relation is nice because then it enables workflows that make use of the DuckDB Python relational API. Personally, when using DuckDB in a notebook, of the two, I'm always going to want a DuckDB relation over a ResultSet.

@edublancas
Copy link
Author

hi @ned2, thanks a lot for your feedback!

We inherited ResultSet from ipython-sql so my default thought was to make it better. but I see your point that in many cases, users will convert it to another format anyway. I'll ensure we take your suggestions into account, I'm still unsure what the best API so I'll keep you in the loop for feedback!

@ned2
Copy link

ned2 commented Apr 29, 2023

no worries, glad it's helpful!

oh and another I just thought of in favour of making the DuckDB relation available as a result type is that it's already lazy

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants