Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to maintain external datasets contributions #535

Open
noklam opened this issue Jun 23, 2023 · 8 comments
Open

How to maintain external datasets contributions #535

noklam opened this issue Jun 23, 2023 · 8 comments

Comments

@noklam
Copy link
Contributor

noklam commented Jun 23, 2023

Description

Why this is raised?

With more incoming datasets PR, it become harder to maintain all the datasets. Particularly for the exotic datasets, we don't have the setup for every possible environment (e.g. snowflake/databricks). This create challenge for maintaining all the datasets since we don't have the re

This also lead to the question "Does every datasets belongs to kedro-datasets?

The answer is no, since there are few popular datasets maintained separately in kedro-mlflow as well.

Possible Action

  • CSVDataSet is more robust than say ManagedTableDataSet, can we signal this better through our docs? We did something similar to Deployment docs

More Discussion

How to we want to maintain the contributions? How do we draw the line that something should be a separate plugins or going into kedro-datasets Cc @astrojuanlu

Idea raised during retro:

  1. datasets could be maintained as a separate plugins. i.e. kedro-mlflow has its own datasets.
@noklam
Copy link
Contributor Author

noklam commented Jan 29, 2024

Link: #517 (comment)

Maybe we can close this ticket?

@astrojuanlu
Copy link
Member

kedro-org/kedro#517 was a different (although related) discussion. In the middle of it though, I raised the question "Should we accept every dataset that is in good shape in kedro-datasets?" and the answer seemed to be yes. However, this was at the very end of our meeting and there was nearly not enough time to weigh pros and cons of this.

So I'd say we keep it open.

Having said that though, there's a number of pull requests open already, and I think it's unfair that we hold them because of lack of firm consensus on this topic.

@astrojuanlu
Copy link
Member

For example, consider discoverability. The fact that the current monorepo approach already hinders the visibility of the individual plugins, as described in #401

For datasets inside kedro-datasets, the effect is even larger. On top of that, the actual business logic of custom datasets is hidden behind private methods that don't get documented by default kedro-org/kedro#1936 (comment)

@astrojuanlu
Copy link
Member

(And this is aside from the maintenance issues @noklam mentioned)

@astrojuanlu
Copy link
Member

I think we are underestimating the maintenance burden of the current approach.

Lots of people in the team have trouble building the docs locally, because one has to install all the dependencies of all datasets for that to work. @rashidakanchwala can attest - she struggled a lot, and now I'm unable to do it myself (troubleshooting some weird conflicts raised by pip).

On the other hand, there have been users in the past that have been confused and couldn't even run the test suite. It happened for #360 and also for #435.

I think it's time to seriously consider breaking kedro-datasets apart.

@datajoely
Copy link
Contributor

I do keep wondering if we could have a Low-code dataset contribution workflow on the website that allowed us to accept contributions and manage the test suite for users.

@astrojuanlu
Copy link
Member

A user literally ran out of disk space when trying to install kedro-datasets test dependencies while troubleshooting a pip conflict #597 (comment)

@lrcouto
Copy link
Contributor

lrcouto commented Apr 12, 2024

A user literally ran out of disk space when trying to install kedro-datasets test dependencies while troubleshooting a pip conflict #597 (comment)

This happened to me this week while running tests to figure out the issues with the kedro-datasets dependencies 😬

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
Status: No status
Development

No branches or pull requests

5 participants