[DataCatalog]: Convert between dataset formats at the catalog level #3942

ElenaKhaustova · 2024-06-06T19:05:12Z

Description

Users express the need for functionality to convert between different dataset formats at the catalog level. Additionally, integrating Kedro with existing standard dataset formats like dlthub and Ibis would provide users with a convenient way to work with diverse datasets and enable the seamless conversion between formats.

We propose to:

Explore the feasibility of developing methods within the framework's API to facilitate conversion between different dataset formats at the catalog level. These methods should support seamless conversion between common formats such as CSV, JSON, Parquet, and others, providing users with flexibility in working with diverse datasets.
Explore the feasibility of integrating Kedro with existing standard dataset formats such as dlthub and Ibis, allowing users to leverage these formats directly within the framework.

Context

Currently, users have to add a separate node to handle conversion for example when using SnowParkTableDataSet: "If there as a way to use that method here rather than having to define a small node just to convert this distributed data frame into pandas data frame. That would be great because in this scenario, since the entire pipeline is in pandas and only this data frame is distributed, I had to write a separate node which converts this distributed data frame into pandas data frame. And I think there is a function from Snowflake that allows you to convert it. So we would love to have the functionality at the level of the catalog or a data set rather than adding the custom one to your pipeline."
The DataCatalog is seen as a well-designed and battle-tested system that could greatly benefit from more integration with external ETL tools like dlthub. This would allow users to leverage the strengths of these tools within the Kedro environment, enhancing data format transformations and interoperability without needing to develop extensive new dataset implementations.
"If think of the maintenance burden, we probably have too many data sets in there today. I actually think there's only maybe five that we really need to care about Pandas, Polar's, Spark, Sequel, Ibis. Like I'd be happy with those ones being the core and the rest being delegated to some sort of experimental status going broader than that. I'm interested in other ways that we can delegate that maintenance burden to dedicated tooling in the space. So, rather than just taking this philosophy that we have to own everything, why don't we integrate things with DLT hub or Ibis?"

The text was updated successfully, but these errors were encountered:

merelcht · 2024-06-10T15:22:42Z

How does this relate to the existing transcoding functionality? (https://docs.kedro.org/en/stable/data/data_catalog_yaml_examples.html#read-the-same-file-using-different-datasets-with-transcoding) And for the Ibis part, would that go beyond the IbisDataset that we've added recently?

ElenaKhaustova added the Issue: Feature Request New feature or improvement to existing feature label Jun 6, 2024

ElenaKhaustova added this to the Redesign the API for IO (catalog) milestone Jun 6, 2024

ElenaKhaustova mentioned this issue Jun 6, 2024

Research summary of insights for redesigning Kedro's data catalog API #3934

Open

github-actions bot mentioned this issue Jul 1, 2024

Monthly issue metrics report #3975

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[DataCatalog]: Convert between dataset formats at the catalog level #3942

[DataCatalog]: Convert between dataset formats at the catalog level #3942

ElenaKhaustova commented Jun 6, 2024

merelcht commented Jun 10, 2024

[DataCatalog]: Convert between dataset formats at the catalog level #3942

[DataCatalog]: Convert between dataset formats at the catalog level #3942

Comments

ElenaKhaustova commented Jun 6, 2024

Description

Context

merelcht commented Jun 10, 2024