Document usage of Kedro + DVC #2691

astrojuanlu · 2023-06-15T11:19:34Z

Description

It would be nice if we had a page on our docs that described how DVC and Kedro can be used together.

Context

Kedro users have been asking for DVC for some time. For example:

The scope of this issue is to document how such thing can be done, but there's a chance that the best way of documenting it is to contribute a plugin.

stichbury · 2023-06-16T09:23:06Z

I asked Chat GPT; what's the quality like for this response? Just wondering if we can use it start off a page...

Kedro and DVC are both tools that can be used for data science and machine learning projects. While they serve different purposes, they can be used together to enhance the reproducibility and version control aspects of your projects. Here's how you can integrate Kedro and DVC:

Initialize a Kedro project: Start by creating a new Kedro project using the kedro new command. This will set up the basic project structure and configuration files.

Set up DVC: Install DVC and initialize it within your Kedro project directory using the dvc init command. This will create the necessary DVC files and directories.

Define your Kedro pipelines: Use Kedro to define your data processing pipelines. This involves creating modular pipeline nodes that represent individual steps in your data science workflow.

Version your data with DVC: Use DVC to version your data files and track their changes over time. By using the dvc add command, you can tell DVC to track specific data files in your Kedro project. This ensures that your data remains under version control and can be easily reproduced.

Integrate DVC with Kedro pipelines: Modify your Kedro pipeline definitions to include DVC commands. For example, you can use the dvc run command within a Kedro pipeline node to run a data processing step and have DVC track the inputs and outputs of that step.

Use DVC for data versioning and sharing: DVC provides features for managing large datasets and collaborating with others. You can use DVC's remote storage options (such as AWS S3 or Google Cloud Storage) to store your data and share it with your team or collaborators.

By combining Kedro and DVC, you can leverage Kedro's pipeline management capabilities with DVC's data versioning and reproducibility features. This allows you to track and reproduce your data science experiments effectively, ensuring the integrity and consistency of your results.

noklam · 2023-06-16T09:28:24Z

@stichbury brilliant idea!

JaynouOliver · 2023-10-14T14:06:36Z

Sure, please assign me, I want to contribute and learn on the go

astrojuanlu · 2023-10-14T15:00:45Z

Hi @JaynouOliver, go ahead! No need to assign the issue, start working on a new documentation page and open a pull request when it's ready for a first review.

JaynouOliver · 2023-10-15T04:37:26Z

Sure!

astrojuanlu · 2023-10-18T20:21:04Z

Interesting perspective from a DVC user: https://fosstodon.org/@blakeNaccarato/111256190959866234

I appreciate the separation of concerns that working with DVC facilitates. Stages as shell commands make non-Python stages trivial. It's good for general processing outside research pipelines too, e.g. document processing.

Stage caching is enabled by hash comparison of deps/outs on disk and avoids costly recompute.

But this design forces disk access between each stage and lots of intermediate files. An abstraction enabling all-in-memory stages could help at the expense of caching.

astrojuanlu · 2023-10-26T10:33:39Z

Today @datajoely mentioned this in our Slack, didn't realize that our dataset versioning sort of overlaps https://linen-slack.kedro.org/t/16014653/hello-very-much-new-to-the-ml-world-i-m-trying-to-setup-a-fr#e111a9d2-188c-4cb3-8a64-37f938ad21ff

DVC and Kedro don’t gell super nicely together, it can be done but our support for native DataSet versioning and Delta (spark) (non-spark) also work in this space

stichbury · 2023-10-31T10:03:05Z

Hi @JaynouOliver -- how are you? Today is the last day of October so please do slip any PRs into our queue if you have them for Hacktoberfest.

JaynouOliver · 2023-10-31T10:06:40Z

Hi. I was not doing it for hacktoberfest. Mind if I submit it by tomorrow?

stichbury · 2023-10-31T10:07:59Z

Then that's grand, yes please, that would work for us. Thank you.

astrojuanlu · 2024-01-25T09:22:06Z

For the record, yesterday two users asked me how to combine Kedro and DVC.

stichbury · 2024-01-25T09:31:03Z

For the record, yesterday two users asked me how to combine Kedro and DVC.

Did you tell them? Did you write it down? If not, is the above generated content any use? Shall we publish?

I have many questions.

astrojuanlu · 2024-01-25T10:22:50Z

It was an in-person chat after my talk. I told them to try https://github.com/FactFiber/kedro-dvc/ but also warned them that Kedro versioning is not easily configurable so it might be hard #2355 I think this has to be an engineering spike before a documentation issue.

stichbury · 2024-01-25T10:28:28Z

Perfect, thanks for the background and also for the change in the ticket, makes sense to me.

merelcht · 2024-07-15T13:50:42Z

We're looking at this in the context of broader versioning and dataset research. If you have thoughts on this please comment on #3997.

astrojuanlu · 2024-08-22T18:37:51Z

A useful resource I found on DVC https://www.python4data.science/en/latest/productive/dvc/index.html

astrojuanlu · 2024-09-20T15:15:45Z

Did a bit of umprompted investigation into DVC. I don't think it's actually that hard to use DVC and Kedro together.

Level 0: Data files are tracked by DVC, Kedro catalog contains pointers to local filepaths. kedro run assumes the files have been pulled (dvc checkout). If they haven't, Kedro will fail with "no such file". Basically Kedro doesn't know anything about DVC and viceversa.
Level 1: Data files are tracked by DVC, Kedro catalog contains pointers to dvc:// filepaths thanks to the fsspec-compatible DVCFileSystem¹. Kedro would then never fail to read the files, DVC would be in charge of do an automatic checkout on read. Not much help from Kedro for outputs, those would still be written to a local filepath (if tracked by DVC) or to some other remote storage.
Level 2: Cooperative data tracking and versioning. Doesn't exist, unclear how that might look like.
Level 3: Cooperative data tracking and pipeline definition. Probably difficult or not possible, too much overlap.

I think Level 0 and 1 are possible today without any changes in Kedro. The only problem is that it would probably work badly with versioned: true datasets.

I'm all for at least documenting what's possible today.

It has existed in its current form for about 2 years. ↩

astrojuanlu added Component: Documentation 📄 Issue/PR for markdown and API documentation Issue: Feature Request New feature or improvement to existing feature labels Jun 15, 2023

stichbury added this to the Improve Kedro documentation used by advanced users milestone Jul 17, 2023

stichbury added the Hacktoberfest label Sep 29, 2023

stichbury mentioned this issue Oct 5, 2023

READ ME FIRST! Join Kedro for Hacktoberfest! #3130

Closed

stichbury assigned JaynouOliver Oct 16, 2023

astrojuanlu mentioned this issue Oct 30, 2023

How can we improve dataset versioning? #1979

Open

JaynouOliver mentioned this issue Nov 1, 2023

Document integration with DVC #3261

Closed

7 tasks

stichbury removed the Hacktoberfest label Jan 11, 2024

astrojuanlu mentioned this issue Jan 25, 2024

Configurable versioning #2355

Open

astrojuanlu changed the title ~~Document integration with DVC~~ [spike] Investigate possible integration with DVC Jan 25, 2024

astrojuanlu removed the Component: Documentation 📄 Issue/PR for markdown and API documentation label Jan 25, 2024

merelcht unassigned JaynouOliver Mar 28, 2024

astrojuanlu modified the milestones: Improve Kedro documentation used by advanced users, Dataset Versioning Jun 6, 2024

astrojuanlu mentioned this issue Jun 6, 2024

Conduct market research on versioning #3933

Closed

astrojuanlu added Component: Documentation 📄 Issue/PR for markdown and API documentation and removed Issue: Feature Request New feature or improvement to existing feature labels Sep 20, 2024

astrojuanlu changed the title ~~[spike] Investigate possible integration with DVC~~ Document usage of Kedro + DVC Sep 20, 2024

astrojuanlu mentioned this issue Sep 20, 2024

Parent task: Content on Kedro vs complementary tools #3012

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Document usage of Kedro + DVC #2691

Document usage of Kedro + DVC #2691

astrojuanlu commented Jun 15, 2023

stichbury commented Jun 16, 2023

noklam commented Jun 16, 2023

JaynouOliver commented Oct 14, 2023

astrojuanlu commented Oct 14, 2023

JaynouOliver commented Oct 15, 2023

astrojuanlu commented Oct 18, 2023

astrojuanlu commented Oct 26, 2023

stichbury commented Oct 31, 2023

JaynouOliver commented Oct 31, 2023

stichbury commented Oct 31, 2023

astrojuanlu commented Jan 25, 2024

stichbury commented Jan 25, 2024

astrojuanlu commented Jan 25, 2024

stichbury commented Jan 25, 2024

merelcht commented Jul 15, 2024

astrojuanlu commented Aug 22, 2024

astrojuanlu commented Sep 20, 2024 •

edited

Loading

Document usage of Kedro + DVC #2691

Document usage of Kedro + DVC #2691

Comments

astrojuanlu commented Jun 15, 2023

Description

Context

stichbury commented Jun 16, 2023

noklam commented Jun 16, 2023

JaynouOliver commented Oct 14, 2023

astrojuanlu commented Oct 14, 2023

JaynouOliver commented Oct 15, 2023

astrojuanlu commented Oct 18, 2023

astrojuanlu commented Oct 26, 2023

stichbury commented Oct 31, 2023

JaynouOliver commented Oct 31, 2023

stichbury commented Oct 31, 2023

astrojuanlu commented Jan 25, 2024

stichbury commented Jan 25, 2024

astrojuanlu commented Jan 25, 2024

stichbury commented Jan 25, 2024

merelcht commented Jul 15, 2024

astrojuanlu commented Aug 22, 2024

astrojuanlu commented Sep 20, 2024 • edited Loading

Footnotes

astrojuanlu commented Sep 20, 2024 •

edited

Loading