stage 1: dvc.yaml <-> kedro pipeline #11

shaunc · 2022-02-10T15:12:12Z

shaunc
Feb 10, 2022
Maintainer

Representation

Kedro nodes represent explicit steps in an experiment. They are defined by wrappers around python functions.
Kedro pipelines represent experiments. They are also specified in python. In a pipeline, each node has inputs and outputs. These input and output names correspond to entries in the data catalog. The names of inputs and outputs are also edge labels in the execution DAG for the pipeline: a node with an output of a given name produces the input for another node with an input labelled by that name.

Kedro pipelines can load other pipelines as modules

DVC stages are represented by entries in dvc.yaml pipeline definitions. Entries define inputs and outputs via filename references. The task to execute is defined as a shell command. They also name parameters relevant
to the step, and non-data artifacts (metrics and plots) produced by the task.

Implementation

Using the "before-pipeline" hook, we can write/update dvc.yaml. We can also have a utility that does this without running an experiment. By default we can keep dvc.yaml in config/kedro_dvc, named after the pipeline.

Nodes refer to data catalog entries, which we translate to .dvc files. Kedro-dvc maintains a correspondence between registered pipelines and dvc.yaml files. Each of this files are placed in <CONF_ROOT>/dvc/pipelines/<pipeline_name>/ for each pipeline.

dvc.yaml contains a vars section, which we need to use to map parameters (see discussion), and a stages section, which describes the equivalent of kedro nodes. For each stage, we maintain the following fields:

Field	Generated Value
cmd	`kedro run --pipeline <pipeline name> --node <node name>`
wdir	relative path from `dvc.yaml` to project root
deps	`.dvc` files corresponding to node data inputs
outs	`.dvc` files corresponding to node data outputs
params	references to parameters
metrics	references to metric outputs
plots	references to plot outputs
meta	see below

Names

The node name is the qualified name: <namespace>.<short_name>, if a namespace is provided, and just <short_name>
if not. If no name is provided, the name of the function wrapped is used. These names might not be unique. Running the experiment via kedro will avoid problems with non-unique naming; it should also provide potential performance benefits.

antonymilne
Feb 25, 2022

Time for my random question and comment spree on this one... 😀

Questions

What is meta for? My understanding of DVC was that this was just ignored, so is it purely just for information purposes or wiulll you use it for something?
What is seq?
I don't quite understand what the proposed mechanism for actually running a pipeline is here - is it through dvc run invoking kedro run as in this? My naive initial idea was that, if you are using kedro's pipeline to construct the DAG and DVC to work out which nodes needed running then it would make sense to directly be doing kedro run rather than a dvc run. Or possibly, like you thought about before, some kedro dvc run or kedro run --new-argument or kedro run --runner=DVCRunner (using a custom runner). Or maybe even using DVC to work out which are the nodes that should be run, and then piping that into kedro run somehow. Or somehow DVC tags (see below) which kedro nodes should be run for a dvc repro. I'm curious to know which solution you've landed on here and why.
Related to the above, I don't quite understand how the dvc run command relates to the dvc repro and dvc exp run commands you explain here. From an end-user perspective, what ultimately are the commands that a user would run if they had kedro-dvc installed?

Maybe useful comments

Not sure if this is what you already meant in your discussion of node naming, but just to be clear: node.name in kedro is unique within a pipeline, i.e. if you try to construct a pipeline which has two nodes with the same name (including the case where you don't explicitly supply a name and kedro just uses the underlying function name) then you'll get an error "Pipeline nodes must have unique names."
Kedro nodes are something which you can attach metadata to through supplying a set of tags. Originally (before we had the notion of modular pipelines) this was used to select which nodes to run, e.g. kedro run --tags=data_science, but it can be used for anything you like really, e.g. if you want to convert a kedro pipeline into some other sort of workflow that divides a pipeline up by grouping nodes in a certain way then you can label nodes that belong to the same group using tags. Pipelines can also be tagged, but the effect is just to apply the tag to all nodes within that pipeline.
Pipeline.filter is quite powerful if you need to select a subset of a pipeline to run. Pipeline also supports various algebraic operations like +.
Depending on how you want to interact with kedro's runner, you can use AbstractRunner.run() or runner.run_node or Node.run or session.run() which might be alternatives to kedro run as a CLI command.

1 reply

shaunc Feb 25, 2022
Maintainer Author

Answers to questions:

Meta: I had various things I've wanted to put in meta -- but have found other mechanisms. So this isn't a pressing concern. My instinct is it would be useful. Off the top of my head:
1. Per pipeline: do we want to enable kedro-dvc for this pipeline? Allows someone to mix & match in one project w/ multiple pipelines.
2. might be, eg to annotate edges as well as nodes. Suppose that you knew a certain dependency was weak -- in a particular phase (say using "test" environment) we are more concerned with the effects of a change on another subgraph of the pipeline DAG, and don't care about computationally intensive changes for a given other subgraph. We could put an annotation of that edge in metadata.
3. Could use to group nodes to figure out how to deal with in-memory pipeline edges when distributing. For example, take graph below. Suppose A->C and B->D were in memory but the cross-links not. If we try to use the strategy "combine nodes that use memory links, then the resulting graph has a cycle. We might want to use meta to mark an edge to "Tee" data.
```
  graph TD
    A --> C
    A --> D
    B --> C
    B --> D
```
Loading
1. Especially for modular pipelines, it might be useful for nodes to be able to access metadata rather than pipeline designers having to know what parameters to pass. Node metadata can be used to "tunnel" info: say a modular pipeline allows a user to specify a predicate function for filtering or etc. This may want additional information as it processes data but the module developers might not know what that is.
So vs your comment 2. -- Excellent! -- so hadn't realized that name uniqueness was enforced in Kedro. seq was to generate an artificial sequence number or hash: will get rid of.
(re your 3 and 4) -- I spell out a little more how we want to run in experiments. Basically we want to use kedro to run. But in prep hook we will ask dvc what it would do, and then propagate that information to before hooks (combining with possible code change info when [stage-2][disc-stage-2] is written, to decide which nodes to skip. (NB dvc run is perhaps misnamed. It is for maintaining pipeline definitions. (I think this is a legacy problem -- it was there before experiment tracking.) dvc repro and dvc exp run are the commands that actually run pipelines.)

NB: vs kedro run vs kedro dvc ... specific: In (sketchy) issue, I envision kedro dvc repro and kedro dvc exp run as the commands for running. Not sure yet vs kedro run. Should this automatically map to kedro dvc repro? Perhaps only for some pipelines? Is that opt-in or opt-out? And where to specify? (Extra option ... metadata? :))

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

stage 1: dvc.yaml <-> kedro pipeline #11

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment 1 reply

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

stage 1: dvc.yaml <-> kedro pipeline #11

shaunc Feb 10, 2022 Maintainer

Representation

Implementation

Names

Meta

Replies: 1 comment · 1 reply

antonymilne Feb 25, 2022

shaunc Feb 25, 2022 Maintainer Author

shaunc
Feb 10, 2022
Maintainer

Replies: 1 comment 1 reply

antonymilne
Feb 25, 2022

shaunc Feb 25, 2022
Maintainer Author