feat(datasets) Experimental - sqlframe datasets #694

datajoely · 2024-05-22T16:00:10Z

Description

Leveraging sqlframe a new dataframe library which targets SQL backends (e.g. duckdb/bigquery/postgres) but exposes the PySpark API data frame syntax.... without the JVM or actually running Spark itself.

This has two major benefits for users:

Like Ibis it allows users to leverage SQL platforms as an execution engine in addition to a storage engine. Approaches like our pandas.SQLTableDataset are naive in the sense they don't use the SQL engine for processing, only storage.
For users already accustomed to Spark syntax or brownfield projects already written in spark this provides a low-friction adoption route.

Development notes

This has been tested locally in the terminal, I've not yet written formal tests. Experimental mode baby 😎 .
I've also done some funky OmegaConf resolver stuff so that the SQL connection can be lazily defined in YAML without creating a super complicated dataset class whilst still supporting dynamic switching of back-ends.

Checklist

Opened this PR as a 'Draft Pull Request' if it is work-in-progress
Updated the documentation to reflect the code changes
Added a description of this change in the relevant RELEASE.md file
Added tests to cover my changes

deepyaman

Not a proper review (especially of dataset logic), just left a few quick thoughts.

kedro-datasets/pyproject.toml

deepyaman · 2024-05-22T16:17:19Z

kedro-datasets/kedro_datasets_experimental/sqlframe/resolvers.py

+    return OmegaConf.select(data, path)
+
+
+from kedro.io.data_catalog import DataCatalog


What's this for?

Shouldn't have been committed!

deepyaman · 2024-05-22T16:18:13Z

kedro-datasets/kedro_datasets_experimental/sqlframe/resolvers.py

Why do you need a resolver, and just for DuckDB?

So I wanted to discuss this -

Sqlframe doesn't have it's own connector logic, it is built to accept a live connection in the dialect of your choice:

duckdb

bigquery

Now the complexity comes in whether we build in declarative and complicated constructor logic for each back-end into the dataset itself. What I'm trying to show here is that OmegaConf resolvers are neat way of us keeping the dataset class dynamic and generic, but I wanted to get the group's view on the topic...

deepyaman · 2024-05-22T16:20:10Z

kedro-datasets/kedro_datasets_experimental/sqlframe/__init__.py

+TableDataset: Any
+FileDataset: Any


Can file and table I/O be consolidated, as in Ibis dataset?

Yeah probably now I've worked out how to do it, it's much easier

noklam · 2024-07-23T11:20:21Z

What do we want to do about this PR/dataset?

datajoely · 2024-08-02T10:40:08Z

@noklam I'm going to redo it, v2.0.0 just came out and it makes the API much better

merelcht · 2024-10-18T11:36:28Z

@datajoely Are you still keen on polishing this? And maybe a silly question but this looks very similar to the two Ibis datasets we now have - does it make sense to have both?

datajoely · 2024-10-18T12:20:22Z

Hi @merelcht I'm going to close this and reopen this at a later date - the library was changing at a rate of knotty so I held off finishing this.

I do think both this and Ibis deserve to be supported in Kedro - the onboarding penalty in Ibis is the sticking point as you need to change your existing codebase. This is much easier to adopt for existing Spark users and unlocks a bunch of different execution engines.

My recommendation would be use Ibis if you're starting something new, use SQLFrame if you're thinking about migrating an existing project off Spark.

datajoely added 2 commits May 22, 2024 16:02

initial sqlframe table version

7cc16ca

initial sqlframe filepath version

a81b947

datajoely added the datasets label May 22, 2024

datajoely requested review from deepyaman, noklam and merelcht May 22, 2024 16:00

datajoely self-assigned this May 22, 2024

datajoely mentioned this pull request May 22, 2024

Is the Pandas dependency necessary for all engines? eakmanrq/sqlframe#12

Closed

update docstring

a5138c8

datajoely changed the title ~~[Experiment] Add sqlframe support as experimental dataset~~ feat(datasets) Experimental - sqlframe datasets May 22, 2024

deepyaman reviewed May 22, 2024

View reviewed changes

merelcht mentioned this pull request Aug 19, 2024

Close/merge as many PRs as possible on kedro-plugins #809

Closed

16 tasks

datajoely closed this Oct 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(datasets) Experimental - sqlframe datasets #694

feat(datasets) Experimental - sqlframe datasets #694

datajoely commented May 22, 2024

deepyaman left a comment

deepyaman May 22, 2024

datajoely May 22, 2024

deepyaman May 22, 2024

datajoely May 23, 2024

deepyaman May 22, 2024

datajoely May 22, 2024

noklam commented Jul 23, 2024

datajoely commented Aug 2, 2024

merelcht commented Oct 18, 2024

datajoely commented Oct 18, 2024

		return OmegaConf.select(data, path)


		from kedro.io.data_catalog import DataCatalog

		TableDataset: Any
		FileDataset: Any

feat(datasets) Experimental - sqlframe datasets #694

feat(datasets) Experimental - sqlframe datasets #694

Conversation

datajoely commented May 22, 2024

Description

Development notes

Checklist

deepyaman left a comment

Choose a reason for hiding this comment

deepyaman May 22, 2024

Choose a reason for hiding this comment

datajoely May 22, 2024

Choose a reason for hiding this comment

deepyaman May 22, 2024

Choose a reason for hiding this comment

datajoely May 23, 2024

Choose a reason for hiding this comment

deepyaman May 22, 2024

Choose a reason for hiding this comment

datajoely May 22, 2024

Choose a reason for hiding this comment

noklam commented Jul 23, 2024

datajoely commented Aug 2, 2024

merelcht commented Oct 18, 2024

datajoely commented Oct 18, 2024