Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: duckdb support #422

Merged
merged 7 commits into from
May 31, 2022
Merged

feat: duckdb support #422

merged 7 commits into from
May 31, 2022

Conversation

machow
Copy link
Owner

@machow machow commented May 3, 2022

re-implementation of #380

Notes:

Edit: Updated notes (May 30th):

  • Breaking: LazyTbl now gets the dialect name from sqlalchemy's URL, to work around needing dialects to set the correct name. This means LazyTbl must take an Engine, not a connection (which seems weird to pass it anyway).
  • I added a temporary _is_dialect_duckdb function to handle these edge cases:
    • collect() uses duckdb's query(...).to_df() method
    • collect() binds literals when compiling query
    • assert_equal_query coerces a duckdb to_df() result to use 64 bit column types (e.g. int64, float64). This is to match pandas to_sql behavior (which coerces everything to 64 bit). This is just to make testing easier. Could annotate duckdb translations with result_type="variable" to tell the tests not to check exact dtypes.

Example

from siuba.data import mtcars
from siuba.sql import LazyTbl
from siuba import _, mutate, group_by, filter

from sqlalchemy import create_engine

engine = create_engine("duckdb:///:memory")
engine.execute("register", ("some_df", mtcars))

tbl = LazyTbl(engine, "some_df")

# mutation ----
tbl >> mutate(res = _.hp - _.hp.mean(), res2 = _.res * 2)

# group by cylinder, and filter all entries less than the mean ----
tbl >> group_by(_.cyl) >> filter(_.hp < _.hp.mean())

@machow
Copy link
Owner Author

machow commented May 3, 2022

AFAICT duckdb is quite a bit faster! In this filter example below..

edit: wait--nevermind, I need to dig a bit deeper into timings

@machow
Copy link
Owner Author

machow commented May 11, 2022

rebased on main

@machow
Copy link
Owner Author

machow commented May 31, 2022

Alright--should be working! I made a copy of the duckdb timings colab notebook, and added siuba code.

I'm super interested in shifting a lot of focus to supporting duckdb as a backend, and am super curious to explore nesting/unnesting with it!

@machow machow merged commit 9723d72 into main May 31, 2022
@machow machow deleted the feat-duckdb2 branch May 31, 2022 03:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant