DVC Plugin to skip Nodes if Data and Code are up to date #837

Spooky-0 · 2021-07-14T21:21:29Z

Spooky-0
Jul 14, 2021

Hi All,

I have been thinking of creating a Kedro Plugin which leverages DVCs Version Controlled Data but Kedros beautiful pipelines. Somewhere I have read a discussion of this idea before, but I can’t find it anymore. Maybe Github, Discord, etc.

Background: At work I had the problem of jobs taking a long time and failing often, but once they ran successfully, all was good. However, when changing parts of the pipeline, after a while, you don’t know if your data is up to date any more - so you have to rerun all. And this takes a long time. So I used DVC to track my input data (e.g. CSV, Hadoop Data*, etc.) but I also tracked the scripts (.py, .sh, etc.). When any one of those changed, the Pipeline would run only that what was necessary. For development this was a game changer.
*) I added an extra step, which simply called ‘DESCRIBE my_hive/impala_table’, saved the output somewhere, and then DVC’ed just this proxy. If the Data on Hive changed, DVC would not know - but this was fine.

For Kedro I was thinking of creating a Plugin/Hook which has very similar behaviour. A Nodes Input is very easily tracked by DVC (especially easy for local data). The Code is not so easily tracked.
Some difficulties: I cannot track all the code, because if I change Node17 then I would have to rerun everything, because accoarding to DVC some code changed.
Thus I want to have a DVC run only look at the current Nodes Code and Data. I was thinking of using python’s inspect module to track the code for this Node. I think this would work. The problem might be, that the Node imports code from a different file. If this file then changes, would DVC know? Is it possible to track all relevant code?

DVC can only run one task at a time. Thus integrating with kedro-accelrator would be impossible. Right?

To reiterate: The goal of kedro-dvc would not be the Data Version Control per se, but rather it would help during devlopment of kedro pipelines. You could be sure that the code/data is always up to date, but only have to run individual nodes - the rest can be skipped as the data is up to date.
The goal is not to keep the data versioned (I guess you should use versioned datasets xD)

Before I embark on this, I would like to ask the community on feedback.

Thanks!

mzjp2 · 2021-07-14T22:39:11Z

mzjp2
Jul 14, 2021

Relevant discussions/resources:

I hadn't though of using DVC, but on the face of it, that sounds like a awesome way to implement the functionality.

Funnily enough, my gist above got a comment by @bvancil (https://github.com/bvancil) that also asked about the code-versioning functionality. I thought along the same lines as you, that comparing the AST of the node would be a cool way to do this, but your point on module-level things like imports changing is something I hadn't considered. Potentially it could be node-code + module-code (minus other functions/nodes)?

11 replies

shaunc Jan 31, 2022

@Spooky-0 -- instead of using inspect, perhaps one could leverage a "designated test generator" for a given artifact (type?) and coverage, parsing the xml or json output.

Afterthought -- thinking about our own code, the problem for us with this approach would be numba.cuda code, which isn't covered. Usually we place cuda compiled code together with wrapper code in the same module, so as long as we weren't trying to resolve beyond the module level this would work. I presume that perhaps inspect could do a bit better (seeing the call, at least) but it would also have problems with calls out into c code, etc. Presumably some config/manual specification of relation between certain code and certain artifacts could close this gap.

... perhaps pipeline steps can be associated with tests (and perhaps many-many?) via a wrapper. Could use coverage contexts to mark what code was run by which of these tests (in CI/CD, for instance).

antonymilne Jan 31, 2022

@shaunc LOVE the idea of using something like coverage to do this. This sounds much more promising to me than trying to parse the AST ourselves. Why try and reinvent the wheel to work out which lines of Python are executed when other people have already figured that out? Also might be worth considering trace and heartrate.

Being compatible with CUDA would be nice, but honestly I think that's quite an unusual/advanced case you have. If we could have something working just with normal Python code then that would be amazing. Even just the naive inspect of the node's function or looking at all modules within a node's directory would be a start.

shaunc Jan 31, 2022

Heartrate seems higher level, but trace looks quite nice... hmm... so one question would be to do this "at runtime" -- as you could with trace, or via a test as I was thinking.

My idea was that you either declared a test with the node, or you wrapped a test in a decorator that referenced the node. Vs CUDA or etc etc my idea was in that wrapper you could also specify some modules to track manually.

I guess you could do this using trace automagically without having a designated test? I guess there is a performance hit vs not having to maintain the test and seeing what branches actually ran in a node execution.

antonymilne Jan 31, 2022

Yeah, I was imagining something where you would do this at runtime rather than needing to write a test for each node that you want to "version control". Otherwise it seems like a bit of an abuse of the test runner, which we're only really using as a vehicle to get coverage. And you'd have to ensure that you're running the tests in between every kedro run, which maybe couples you to doing it in CI/CD only? Depending on the library used, definitely it might affect performance if we're doing this run time though.

The other possible issue I see with the test runner strategy is that you need to write test data for every single node you want to version control. I was thinking that you could avoid writing an individual test for each node by just writing one parametrised test that does something clever by going through each of kedro's nodes. But you'd still need a way to run each node function on its own test input data somehow.

shaunc Jan 31, 2022

Yes -- I guess you could perhaps have the best of both worlds if there was a "record mode" where you would run and write per-node-metadata on source code usage (which you could check in), and a "production mode" where you would just use this data to possibly skip nodes or not without having trace on.

I personally would still like the ability to add additional code manually to the generated list (and not have it clobbered by rerecord, of course).

shaunc · 2022-01-31T14:53:26Z

shaunc
Jan 31, 2022

The flip-side of this discussion on partial rebuilds -- suppose I'm using argo-workflows to run my pipeline. I want to get all the artifacts into DVC so that I can later use them for partial rebuilds. Should I write wrappers that check in outputs? What about logs and error outputs? Also in a distributed environment I can't just rely on the local work tree -- it would seem that I need a versioning policy to guide how the commits are merged.

Has any of this work been done already? Suggestions on where to look, or where to put it if I want to build something myself?

Update -- dvc checkpoints seem to be the thing to integrate with. I guess the approach would be to make kedro the source of authority, and automate generation of dvc.yaml, etc.... (unless there is some reason to think about doing it the other way around?)

3 replies

shaunc Jan 31, 2022

@Spooky-0, @AntonyMilneQB @mzjp2 -- is there any code out there that generates dvc.yaml-s from pipeline definitions? Is that part of the vision here? I would be happy to work on this -- should it go in a kedro-dvc repo/plugin?

datajoely Feb 1, 2022
Collaborator

So I'm not sure we have anything off the shelf - since joining the LF we've had a lot of activity on the topic so we'll try and consolidate it all shortly (see this discssion for details #1181)

Also happy to answer any questions about the internals that may be helpful - there will be some small breaking changes in 0.18.0 that we can keep front of mind.

shaunc Feb 1, 2022

Thanks, @datajoely ... on the dvc side I've gotten some encoragement --
https://discord.com/channels/485586884165107732/485596304961962003/938108282311147552
https://discord.com/channels/485586884165107732/485596304961962003/938119057109037108
Once I consolidate and think through things I'll spin up a kedro-dvc repo with a design document (to start out with, at least :)) -- probably next week. Comments and contributions will be welcome!

[Update -- more info on the dvc side]

shaunc · 2022-02-07T05:27:15Z

shaunc
Feb 7, 2022

An initial stab at a design -- https://github.com/FactFiber/kedro-dvc/blob/main/doc/design.md
Comments welcome!

BTW ...

The current project's Python requirement (>=3.8,<3.10) is not compatible with some of the required packages Python requirement:
    - kedro requires Python >=3.6, <3.9, so it will not be satisfied for Python >=3.9,<3.10

:( ...
Are you adding 3.9 support soon? (If numba can do it, so can you! :)) -- I like to try to use numpy typing which worked (works?) better with 3.9 (I think that was the issue....). I could go lower in this package, but the package I want to consume this in definitely needs 3.9.

7 replies

datajoely Feb 7, 2022
Collaborator

Also feel free to tag us in PRs and issues on your repo if it's to do with Kedro internals and we'll do our best ot help :)

shaunc Feb 7, 2022

We need 3.9 support certainly, but we can try to keep 3.8 at the very least as well for the moment. Hmm... how to handle with CI? Are you perchance going to create an 0.18 pre-release (which is stable enough! :)) anytime soon? I encourage it! :)

As for comments -- I'm sure there is still much to think through. The first issue that comes to mind is versioning. I haven't looked into it yet, but would hypothesize that dvc uses git versioning tools to produce hashes. Perhaps this is a good way to go? Is there a point at which I can intercept how you generate version strings and substitute in the dvc version hash? I encourage you to take the discussion to the plugin discussion board :)

Thank you!

antonymilne Feb 7, 2022

0.18 pre-release is unofficially available already in the develop branch. main always points to the latest development on the minor release branch (currently 0.17.7), while develop points to the latest development on the major release branch (currently 0.18.0). So if you want to work off 0.18 then you can do so already by pip install https://github.com/kedro-org/kedro/archive/develop.zip. Definitely not officially supported, but it should actually be pretty stable I think.

Will take a look at the plugin discussion board for the other stuff! Very exciting 😀

shaunc Feb 7, 2022

develop -- almost!:

 % poetry add https://github.com/kedro-org/kedro/archive/develop.zip
...
  Because no versions of black match >22.1.0,<23.0.0
   and black (22.1.0) depends on click (>=8.0.0), black (>=22.1.0,<23.0.0) requires click (>=8.0.0).
  And because kedro (0.17.6) depends on click (<8.0), black (>=22.1.0,<23.0.0) is incompatible with kedro (0.17.6).

sigh -- I wish poetry would allow overrides.... however -- works with black 21.12b0! :) ... still -- if you can loosen that click dependency -- probably I won't be the only one....

antonymilne Feb 7, 2022

Ah yes, this was pointed out last week and is on the backlog to fix 🙂

shaunc · 2022-02-11T01:59:48Z

shaunc
Feb 11, 2022

I've written some more detail into the basic correspondence between kedro and dvc.

A question -- is there a mechanism for attaching extra metadata to data catalog entries, nodes and pipelines? Perhaps a generic mechanism would suffice. E.g. a meta=dict(kedro_dvc=dict( ...)) or equiv in yaml for data catalog.

This could also be provided externally, but it gets harder to be DRY that way.

catalog entries: cached or uncached? Other dvc metadata
nodes: e.g. whether to turn into checkpoints in kedro experiments
pipelines: naming experiments, mapping persistent experiments onto directories or tags...

9 replies

datajoely Feb 15, 2022
Collaborator

Hi @shaunc off topic for a second - it would be great to join our discord server - it may make sense to make a dedicated kedro-dvc channel, I'd also like to personally send you some Kedro swag as a thank you so DM me on discord when you get a chance!

shaunc Feb 15, 2022

@datajoely -- thanks for the invite! @AntonyMilneQB -- thanks for the replies!

Thanks for the links to additional documentation.
Both kedro and dvc allow extra command line args. Kedro-dvc should support, but at first I'm focusing on node-declared parameters as I have to support those as a minimum.
When I wrote the message 2 days ago, it seemed to me that the best was, if kedro-dvc was installed, to "take over" the standard commands so as to minimize the surface area for users. However, what if you want to run some pipelines "dvc-aware" and others not? Also kedro-dvc will at first only support a subset of kedro features. So it seems best to me now to provide separate commands (with "dvc" command group) at first. Perhaps there will be an install recipe to configure full integration.

shaunc Feb 20, 2022

@datajoely -- do you have an "after_command_run" cli hook? We wanted to add our own conf install after kedro new runs if kedro-dvc is installed?

antonymilne Feb 21, 2022

@shaunc that doesn't exist, but we could probably add it if it's useful. The before_command_run hook was added for a specific purpose (telemetry) and there was no need for after_command_run so it wasn't added. The closest thing at the moment would be after_pipeline_run, but obviously that's only invoked if you're running a pipeline.

shaunc Feb 21, 2022

It would be useful... its not essential though. Right now I'm starting with a sequence kedro new then kedro dvc install with the latter creating the DVC repo and K-D config on the inside. But if you have K-D installed globally, though it would be convenient to do these steps in kedro new (perhaps modulated by an option).

dhavala · 2022-07-11T09:18:58Z

dhavala
Jul 11, 2022

Hello Kedro Devs/Mods,

Any updates in this work?

I read that Kedro provides data versioning -- but it simply writes data to folder with timestamp.
While that is a good starting point -- but taking a more principled and solid approach is needed, such as those taken by DVC (which seperates the metadata from data, and only tracks the metadata via git).

Recently, there are some demo's that show how MLFlow and DVC can work together. I am wondering are there any efforts to make DVC work with Kedro Data Catalogue.

It solves many problems and
core Kedro funct. is modular, opiniated pipelines. avoids working on duplicated responsibiliies
mlflow for tracking model, serving etc
dvc for data versioning
cml for CI/CD

Then Kedro can play nicely along with other mature open source tools in the MLOps space. Otherwise, it feels as if, everybody (tool) becomes a siwss-army knife of MLops.

Appreciate any pointers or comments or suggestions!

1 reply

datajoely Jul 11, 2022
Collaborator

@shaunc how are things going? Is there any support we cna provide?

DVC Plugin to skip Nodes if Data and Code are up to date #837

Replies: 5 comments · 31 replies

datajoely Feb 1, 2022 Collaborator

datajoely Feb 7, 2022 Collaborator

datajoely Feb 15, 2022 Collaborator

datajoely Jul 11, 2022 Collaborator

Replies: 5 comments 31 replies

datajoely Feb 1, 2022
Collaborator

datajoely Feb 7, 2022
Collaborator

datajoely Feb 15, 2022
Collaborator

datajoely Jul 11, 2022
Collaborator