Use an alternative output format when using DuckDB #536

edublancas · 2023-05-29T16:14:40Z

DuckDB provides its own output format that we can leverage instead of converting things into ResultSet so if we're using a DuckDB connection we should use such format instead, see #451 (comment) for details

ned2 · 2023-10-02T04:25:39Z

As a DuckDB user, this would make jupysql much more useful. The DuckDB DuckDBPyRelation is a lazy table representation that provides a bunch of useful integrations, including building further lazy queries using the relational (non-SQL) interface and also converting to multiple output formats (beyond just Pandas and Polars dataframes).

Perhaps this could be implemented via keeping the initial default the ResultSet, and then allowing users to opt into producing DuckDB relations by default, similar to how you can set the default output to Pandas or Polars dataframes. eg something like:

%config SqlMagic.autoduckdb = True

This would result in an out of the box experience that's consistent with other back ends, but allow users to opt into the much more useful DuckDBPyRelation as the default output.

ned2 · 2023-10-02T05:22:38Z

Hey @edublancas, I've made a proof of concept for my above suggestion here. If you think this proposal has legs, I'd be up for working it into a PR.

I'm not really that familiar with the jupysql codebase though, so I'm sure I'm overlooking some things. Would be keen to hear alternative suggestions!

edublancas · 2023-10-03T00:18:04Z

@ned2: thanks for working on this! I ran some quick tests, and here are my thoughts.

You're on the right track; I think the best place to apply this change is in run.py. This is the line where we execute the user's query. So we need to modify both run.py and raw_execute.

The critical consideration here is that there are two ways for users to connect to duckdb: sqlalchemy (via duckdb-engine) or with a native duckdb.connect() connection. So, we need a way for both scenarios to behave consistently. I think let's tackle the case where users have a native connection because it's simpler, then, we can modify the sqlalchemy case.

So to fix this, we need to create a new method similar to DBAPIConnection.raw_execute but that calls sql (or query? are they the same?) , instead of execute. if we do that, we'll get a DuckDBPyRelation

Then, in run.py we can decide which method to run, depending on the configuration (regular raw_execute vs the new one).

I'm unsure if autoduckdb is the best name, but let's get this working first and we can change the name later.

feel free to ask any questions if this explanation isn't clear! happy to review a draft PR.

edublancas mentioned this issue May 29, 2023

refactoring ResultSet #470

Closed

gilandose mentioned this issue Dec 20, 2023

Allow Spark SQL as a dialect #968

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use an alternative output format when using DuckDB #536

Use an alternative output format when using DuckDB #536

edublancas commented May 29, 2023

ned2 commented Oct 2, 2023 •

edited

Loading

ned2 commented Oct 2, 2023

edublancas commented Oct 3, 2023

Use an alternative output format when using DuckDB #536

Use an alternative output format when using DuckDB #536

Comments

edublancas commented May 29, 2023

ned2 commented Oct 2, 2023 • edited Loading

ned2 commented Oct 2, 2023

edublancas commented Oct 3, 2023

ned2 commented Oct 2, 2023 •

edited

Loading