stage 1: experiments #14

shaunc · 2022-02-11T00:48:27Z

shaunc
Feb 11, 2022
Maintainer

Representation

For Kedro, an experiment is a "run of a pipeline". Kedro defines pipeline runners, and allows the definition of some artifacts. However, it does not represent experiments -- runs of pipelines -- as first class objects, and versioned data, which can be the "roots" of different experiments, are represented "spatially" in directories with timestamps. To extend this paradigm to full experiments would be to put them all beside each other.

DVC experiment management is a core feature. With a nod to Borges, experiment runs are represented primarily using the (forking) "time" dimension of the underlying git repo. However, when persisting experiments, different (or hybrid) patterns are possible. By default, when a pipeline is run, previously calculated results whose dependencies are not stored are skipped, with outputs restored from the run cache.

DVC supports branching in experiments by means of checkpoints. Although primarily language neutral, DVC has a python utility to wrap stage runs written in python to automatically generate node checkpoints.

Implementation

We rely on DVC to represent experiments. In order to relate experiments with pipelines, we use the [experiment naming mechanism][d-exp-name] to tie particular experiments to particular runs.

DVC has two methods of running experiments. The basic one, dvc repro does the following:

Determine which steps' dependencies have changed (using dvc.lock)
Run those steps, saving results to the run cache.
Cache results and place hashes in dvc.lock

The second method, dvc exp run, wraps dvc repo in a context of a particular experiment:

create tag for experiment (references previous experiment if any)
associate tag with stash of current datasets
[proceed with dvc repro]

Kedro-DVC provides a way to use these DVC facilities while replacing step 2 ("run these steps") with kedro runner, and using before_node_run hook to skip steps DVC has noted need not be run.

The plan is to use dvc Repo.status() for repro step (1), and to call Repo.commit() for repro step (3). "Run exp" steps can be accomplished by dvc Repo.experiments.new().

Prev	Up	Next
pipelines	stage 1	stage 2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

stage 1: experiments #14

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 0 comments

Select a reply

stage 1: experiments #14

shaunc Feb 11, 2022 Maintainer

Representation

Implementation

Replies: 0 comments

shaunc
Feb 11, 2022
Maintainer