Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(datasets) Experimental - sqlframe datasets #694

Closed
wants to merge 3 commits into from

Conversation

datajoely
Copy link
Contributor

Description

Leveraging sqlframe a new dataframe library which targets SQL backends (e.g. duckdb/bigquery/postgres) but exposes the PySpark API data frame syntax.... without the JVM or actually running Spark itself.

This has two major benefits for users:

  • Like Ibis it allows users to leverage SQL platforms as an execution engine in addition to a storage engine. Approaches like our pandas.SQLTableDataset are naive in the sense they don't use the SQL engine for processing, only storage.
  • For users already accustomed to Spark syntax or brownfield projects already written in spark this provides a low-friction adoption route.

Development notes

  • This has been tested locally in the terminal, I've not yet written formal tests. Experimental mode baby 😎 .
  • I've also done some funky OmegaConf resolver stuff so that the SQL connection can be lazily defined in YAML without creating a super complicated dataset class whilst still supporting dynamic switching of back-ends.

Checklist

  • Opened this PR as a 'Draft Pull Request' if it is work-in-progress
  • Updated the documentation to reflect the code changes
  • Added a description of this change in the relevant RELEASE.md file
  • Added tests to cover my changes

@datajoely datajoely changed the title [Experiment] Add sqlframe support as experimental dataset feat(datasets) Experimental - sqlframe datasets May 22, 2024
Copy link
Member

@deepyaman deepyaman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a proper review (especially of dataset logic), just left a few quick thoughts.

kedro-datasets/pyproject.toml Show resolved Hide resolved
return OmegaConf.select(data, path)


from kedro.io.data_catalog import DataCatalog
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's this for?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't have been committed!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do you need a resolver, and just for DuckDB?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I wanted to discuss this -

Sqlframe doesn't have it's own connector logic, it is built to accept a live connection in the dialect of your choice:

duckdb
image

bigquery
image

Now the complexity comes in whether we build in declarative and complicated constructor logic for each back-end into the dataset itself. What I'm trying to show here is that OmegaConf resolvers are neat way of us keeping the dataset class dynamic and generic, but I wanted to get the group's view on the topic...

Comment on lines +8 to +9
TableDataset: Any
FileDataset: Any
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can file and table I/O be consolidated, as in Ibis dataset?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah probably now I've worked out how to do it, it's much easier

@noklam
Copy link
Contributor

noklam commented Jul 23, 2024

What do we want to do about this PR/dataset?

@datajoely
Copy link
Contributor Author

@noklam I'm going to redo it, v2.0.0 just came out and it makes the API much better

@merelcht
Copy link
Member

@datajoely Are you still keen on polishing this? And maybe a silly question but this looks very similar to the two Ibis datasets we now have - does it make sense to have both?

@datajoely
Copy link
Contributor Author

Hi @merelcht I'm going to close this and reopen this at a later date - the library was changing at a rate of knotty so I held off finishing this.

I do think both this and Ibis deserve to be supported in Kedro - the onboarding penalty in Ibis is the sticking point as you need to change your existing codebase. This is much easier to adopt for existing Spark users and unlocks a bunch of different execution engines.

My recommendation would be use Ibis if you're starting something new, use SQLFrame if you're thinking about migrating an existing project off Spark.

@datajoely datajoely closed this Oct 18, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants