Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use an alternative output format when using DuckDB #536

Open
edublancas opened this issue May 29, 2023 · 3 comments
Open

Use an alternative output format when using DuckDB #536

edublancas opened this issue May 29, 2023 · 3 comments

Comments

@edublancas
Copy link

DuckDB provides its own output format that we can leverage instead of converting things into ResultSet so if we're using a DuckDB connection we should use such format instead, see #451 (comment) for details

@ned2
Copy link

ned2 commented Oct 2, 2023

As a DuckDB user, this would make jupysql much more useful. The DuckDB DuckDBPyRelation is a lazy table representation that provides a bunch of useful integrations, including building further lazy queries using the relational (non-SQL) interface and also converting to multiple output formats (beyond just Pandas and Polars dataframes).

Perhaps this could be implemented via keeping the initial default the ResultSet, and then allowing users to opt into producing DuckDB relations by default, similar to how you can set the default output to Pandas or Polars dataframes. eg something like:

%config SqlMagic.autoduckdb = True

This would result in an out of the box experience that's consistent with other back ends, but allow users to opt into the much more useful DuckDBPyRelation as the default output.

@ned2
Copy link

ned2 commented Oct 2, 2023

Hey @edublancas, I've made a proof of concept for my above suggestion here. If you think this proposal has legs, I'd be up for working it into a PR.

I'm not really that familiar with the jupysql codebase though, so I'm sure I'm overlooking some things. Would be keen to hear alternative suggestions!

@edublancas
Copy link
Author

@ned2: thanks for working on this! I ran some quick tests, and here are my thoughts.

You're on the right track; I think the best place to apply this change is in run.py. This is the line where we execute the user's query. So we need to modify both run.py and raw_execute.

The critical consideration here is that there are two ways for users to connect to duckdb: sqlalchemy (via duckdb-engine) or with a native duckdb.connect() connection. So, we need a way for both scenarios to behave consistently. I think let's tackle the case where users have a native connection because it's simpler, then, we can modify the sqlalchemy case.

So to fix this, we need to create a new method similar to DBAPIConnection.raw_execute but that calls sql (or query? are they the same?) , instead of execute. if we do that, we'll get a DuckDBPyRelation

Then, in run.py we can decide which method to run, depending on the configuration (regular raw_execute vs the new one).

I'm unsure if autoduckdb is the best name, but let's get this working first and we can change the name later.

feel free to ask any questions if this explanation isn't clear! happy to review a draft PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants