-
Notifications
You must be signed in to change notification settings - Fork 89
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(datasets) Experimental - sqlframe datasets #694
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not a proper review (especially of dataset logic), just left a few quick thoughts.
return OmegaConf.select(data, path) | ||
|
||
|
||
from kedro.io.data_catalog import DataCatalog |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's this for?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't have been committed!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do you need a resolver, and just for DuckDB?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So I wanted to discuss this -
Sqlframe doesn't have it's own connector logic, it is built to accept a live connection in the dialect of your choice:
Now the complexity comes in whether we build in declarative and complicated constructor logic for each back-end into the dataset itself. What I'm trying to show here is that OmegaConf resolvers are neat way of us keeping the dataset class dynamic and generic, but I wanted to get the group's view on the topic...
TableDataset: Any | ||
FileDataset: Any |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can file and table I/O be consolidated, as in Ibis dataset?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah probably now I've worked out how to do it, it's much easier
What do we want to do about this PR/dataset? |
@noklam I'm going to redo it, v2.0.0 just came out and it makes the API much better |
@datajoely Are you still keen on polishing this? And maybe a silly question but this looks very similar to the two Ibis datasets we now have - does it make sense to have both? |
Hi @merelcht I'm going to close this and reopen this at a later date - the library was changing at a rate of knotty so I held off finishing this. I do think both this and Ibis deserve to be supported in Kedro - the onboarding penalty in Ibis is the sticking point as you need to change your existing codebase. This is much easier to adopt for existing Spark users and unlocks a bunch of different execution engines. My recommendation would be use Ibis if you're starting something new, use SQLFrame if you're thinking about migrating an existing project off Spark. |
Description
Leveraging sqlframe a new dataframe library which targets SQL backends (e.g. duckdb/bigquery/postgres) but exposes the PySpark API data frame syntax.... without the JVM or actually running Spark itself.
This has two major benefits for users:
pandas.SQLTableDataset
are naive in the sense they don't use the SQL engine for processing, only storage.Development notes
OmegaConf
resolver stuff so that the SQL connection can be lazily defined in YAML without creating a super complicated dataset class whilst still supporting dynamic switching of back-ends.Checklist
RELEASE.md
file