Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DRAFT] Separate file format from processing engine in datasets #273

Open
deepyaman opened this issue Jul 19, 2023 · 1 comment
Open

[DRAFT] Separate file format from processing engine in datasets #273

deepyaman opened this issue Jul 19, 2023 · 1 comment

Comments

@deepyaman
Copy link
Member

Context

  1. There is currently no clear consistency in what a dataset does; it loads (or, in cases like Spark, connects to) data in some format, and then you need to make sure the node consuming the dataset matches the loaded format. This means you can never truly have a separation of node from dataset, and swap one out without changing the other, unless they are "compatible" (by unenforceable rules).
  2. If you want to support a new file format (e.g. Delta), you need to write a connector for each engine. In many cases, it makes sense (perhaps there's nothing to reuse between the way Spark will load Delta and pandas will load Delta). In other cases, perhaps it should be possible to not have to define the loader in each place? Especially with things like dataframe exchange protocols coming into the picture.
  3. The current design of kedro-datasets makes datasets entirely independent, so you can't reuse logic from one dataset in another. This is great in many ways (separation of dependencies), but also makes it impossible (I think?) to share loading code.

Inspired by:

          > By adding this into kedro-datasets, there will be 3 possible ways of handling delta table:

Apache Spark
delta-rs, a non-Spark approach (this PR)
Databricks Unity Catalog

Agree, although DeltaTable is often associated with Spark, but it's actually just a file format and you can read it via Pandas or maybe other libraries later.

I think the current space is still dominated by Parquet for data processing, Delta files are usually larger due to the compression and version history. IMO the versioning features are quite important and it deserves wider adoption outside of the Spark ecosystem.

I have no idea how the compacting and re-partitioning works with non-spark implementation? This feels like the responsibility of some kind of DB or data processing engine, it's probably too much for the Dataset abstraction. WDYT?

Originally posted by @noklam in #243 (comment)

Possible Implementation

The purpose of this issue (thus far) is to raise some potential issues, but I don't have a good solution in mind. I'm also not 100% sure this is solvable, or that Kedro wants to solve this problem.

One half-baked thought is to make the "engine" on datasets a parameter of load/save. Then, it is the datasets responsibility as to when to more concretely manifest data.

@deepyaman deepyaman changed the title Separate file format from processing engine in datasets [DRAFT] Separate file format from processing engine in datasets Jul 19, 2023
@astrojuanlu
Copy link
Member

I essentially agree with all your points.

xref #200, kedro-org/kedro#1936, kedro-org/kedro#1981, kedro-org/kedro#1778, and to some extent kedro-org/kedro#2536

It's clear that at some point we need to sit and see if we can come up with a better design.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants