Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DataCatalog]: Convert between dataset formats at the catalog level #3942

Open
ElenaKhaustova opened this issue Jun 6, 2024 · 1 comment
Labels
Issue: Feature Request New feature or improvement to existing feature

Comments

@ElenaKhaustova
Copy link
Contributor

Description

Users express the need for functionality to convert between different dataset formats at the catalog level. Additionally, integrating Kedro with existing standard dataset formats like dlthub and Ibis would provide users with a convenient way to work with diverse datasets and enable the seamless conversion between formats.

We propose to:

  1. Explore the feasibility of developing methods within the framework's API to facilitate conversion between different dataset formats at the catalog level. These methods should support seamless conversion between common formats such as CSV, JSON, Parquet, and others, providing users with flexibility in working with diverse datasets.
  2. Explore the feasibility of integrating Kedro with existing standard dataset formats such as dlthub and Ibis, allowing users to leverage these formats directly within the framework.

Context

  • Currently, users have to add a separate node to handle conversion for example when using SnowParkTableDataSet: "If there as a way to use that method here rather than having to define a small node just to convert this distributed data frame into pandas data frame. That would be great because in this scenario, since the entire pipeline is in pandas and only this data frame is distributed, I had to write a separate node which converts this distributed data frame into pandas data frame. And I think there is a function from Snowflake that allows you to convert it. So we would love to have the functionality at the level of the catalog or a data set rather than adding the custom one to your pipeline."
  • The DataCatalog is seen as a well-designed and battle-tested system that could greatly benefit from more integration with external ETL tools like dlthub. This would allow users to leverage the strengths of these tools within the Kedro environment, enhancing data format transformations and interoperability without needing to develop extensive new dataset implementations.
  • "If think of the maintenance burden, we probably have too many data sets in there today. I actually think there's only maybe five that we really need to care about Pandas, Polar's, Spark, Sequel, Ibis. Like I'd be happy with those ones being the core and the rest being delegated to some sort of experimental status going broader than that. I'm interested in other ways that we can delegate that maintenance burden to dedicated tooling in the space. So, rather than just taking this philosophy that we have to own everything, why don't we integrate things with DLT hub or Ibis?"
@merelcht
Copy link
Member

How does this relate to the existing transcoding functionality? (https://docs.kedro.org/en/stable/data/data_catalog_yaml_examples.html#read-the-same-file-using-different-datasets-with-transcoding) And for the Ibis part, would that go beyond the IbisDataset that we've added recently?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Issue: Feature Request New feature or improvement to existing feature
Projects
Status: No status
Development

No branches or pull requests

2 participants