stage 1: dvc file <-> catalog entry #9

shaunc · 2022-02-10T13:50:15Z

shaunc
Feb 10, 2022
Maintainer

Representation

For relevant entries in kedro data catalog, we would like to write or update a .dvc file.

When a data file is added to dvc using dvc add, the file is moved to the dvc cache, a link is created to its original location, the link is placed in .gitignore, and the .dvc file is create -- by default with the same name and directory as the datafile, but with the .dvc suffix added to the name (so data.parq would be replaced by a link, and data.parq.dvc would be created next to it to store the .dvc metadata. The process is detailed in the dvc add documentation.

DVC data dependencies can include both other data files and code modules, explicitly referenced in a dvc file.

Kedro tracks data via the data catalog. The catalog is in a fixed place in the kedro repo (conf/base/catalog.yml). Entries can be superceded by a local catalog (conf/local/catalog.yml), which is (putatively) in .gitignore and can specify different locations, etc for particular copies of the repo.

There are a number of mismatches between dvc and kedro:

dvc focuses on exact content -- e.g. version is hash. kedro includes information on how to load and save, including how parameters and credentials are passed. Versions, on the other hand, are timestamps by default.
dvc uses git "time" dimension to store versions. kedro uses "space" dimension: file data.parq will be transformed into data.parq/<version>/data.parq
dvc allows whole directories or subtrees to be cached, while kedro supports multiple files per entry only in the case of a partitiioned dataset.
dvc optionally stores dependencies in the metadata. Kedro uses pipelines to track dependencies.

Kedro-DVC Implementation

We will focus on mapping those parts which most naturally correspond at first.

Generated kedro-dvc configuration will go into {CONF_ROOT}/{env}/dvc for each environment.
We will add or update .dvc entries in {CONF_ROOT}/{env}/dvc either on hook after_catalog_created, or using a utility: kedro dvc data update.
The .dvc file will be named after the data catalog entry
When a new entry is added, the standard dvc steps of checking it in are followed.
Dependency information is not added for now.
6.Renaming:
- catalog entry renamed: need to maintain manually at first
- data file pointed to by catalog entry changed: we spot the difference in the dvc file, remove the old and add the new.
transcoded datasets: dvc causes problems when two dvc files point to same data file. We can support with soft link from one to the other.

When it comes to versioning, however, we will do things the dvc way, by using the "repo time" dimension rather than the space dimension.

As a first pass, we do not use versioning in kedro.
We can add a utility that converts "spacewise" versions to a series of dvc commits of a file. The file is then removed from kedro versioning.
If necessary this utility can create a parameter file which maps timestamps to dvc commits.

conf/base vs conf/local and environments

Kedro has "in repo" data in "base" and "out of repo" data in "local". They suggest:

For instance, you can use the catalog.yml file in conf/base/ to register the locations of datasets that would run in production while copying and updating a second version of catalog.yml in conf/local/ to register the locations of sample datasets that you are using for prototyping your data pipeline(s).

Kedro also supports environments via "KEDRO_ENV" or cli option. Local is, in effect the default environment, but it can be overridden.

In DVC, a better paradigm would be to store prototype data in a branch that we don't push. (Note: git supports e.g. git config branch.local.pushRemote no_push).

By default, we will handle the mapping this way:

We will create a .dvc file for each entry in each environment
These will be placed in {CONF_ROOT}/{env}/dvc where env is the environment name.

Pipelines which use them need to have global definitions however -- see pipelines for the plan.

(NB based on feedback from @AntonyMilneQB -- below.

Partitioned Data

See discussion

External (File) Data

Initially, we won't support external data. However, both kedro and dvc support it. Adding at least limited support shouldn't be too hard.

Other data sources; artifacts.

Full list of supported data

Kedro also supports database data sources (e.g. databases). For the moment kedro-dvc will not support them. Stage artifacts can be handled, for the most part, like data files.

Prev	Up	Next
	stage 1	parameters

antonymilne · 2022-02-25T09:49:33Z

antonymilne
Feb 25, 2022

In my view this is indeed the most natural place for a kedro-dvc plugin to start, since the first thing that occurs to me when I hear DVC is data versioning. For me, the question boils down to can we easily swap out kedro's dataset versioning for DVC's one?

My naive initial idea for how that could be achieved is quite different from what you've outlined above. Here's very roughly how imagined it:

reuse the kedro dataset versioned property to mark whether a dataset should be DVC-versioned. When kedro-dvc is installed, any behaviour that kedro normally does when versioned: true is completely replaced by DVC versioning
the kedro dataset filepath would correspond to where the .dvc file goes. i.e. filepath: path/to/data.csv with versioned: true and the kedro-DVC plugin installed would work using a path/to/data.csv.dvc file
kedro run environments would not really come into it and continue functioning as they currently do

This would be implemented through a custom DATA_CATALOG_CLASS (specified in settings.py), and maybe a custom CONFIG_LOADER_CLASS would also be needed.

I haven't thought through any of the details of this, so I don't really know how feasible it is (e.g. kedro's default versioned could well be too tightly coupled to be easily overwritten by custom behaviour; DATA_CATALOG_CLASS doesn't really have a well-defined interface and currently cannot have custom arguments specified through settings.py). I have no doubt that your much more carefully thought out implementation is superior to my half-baked idea, but I'd be curious to hear why my idea doesn't work 😀

1 reply

shaunc Feb 25, 2022
Maintainer Author

The overarching design vision for kedro-dvc is "use kedro to specify and run pipelines while tracking experiments the dvc way".

AFAIK it is possible to map kedro versioned data onto dvc versioned data. "The dvc way" is (at least primarily) to use the "git time dimension" to deal with versions. So perhaps possible to have a node definition that gets a version from the dvc cache rather than using a subdirectory to store versions. However, why (unless we already had kedro versions and/or wanted to do things both ways) would we want to do this? If we are running an experiment, which version we want is determined by the experiment we are running, not the "datafile-specific" version.

Thinking about it a bit more -- perhaps we could use kedro versions for cherry picking: even in the context of a particular experiment (which we already have facilities to fork via checkpoints) we might want to bring in a specific other version of a particular data file. But I'm not sure if we want to do this with e.g. a different dataset class in python? Perhaps a use for per-node metadata -- "we have cherry-picked this from experiment X" and manipulate from the command line.

antonymilne · 2022-02-25T09:57:58Z

antonymilne
Feb 25, 2022

And a few random comments on the proposed implementation:

Minor point just so no one gets confused: the config folder is called conf rather than config (the default value for CONF_SOURCE in settings.py)
"Generated kedro-dvc configuration will go into config/kedro-dvc by default." What would go in here exactly? Is it just the .dvc files or other things from the DVC project structure? Is the .dvc/config/ completely independent from this or subsumed into it somehow?
I'm not sure that creating a new folder in conf is necessarily the right place to put it. From the kedro side this looks like a run environment, so I would expect to be able to do kedro run -e kedro-dvc, which I don't think is the intention here? It might be worth making a whole new folder outside conf for this instead. It's not a big thing though - I don't think anyone will get very confused by it.
Related to the last point: instead of creating a whole new folder and labelling files with __{ENV}, is it possible just to put the relevant files (whatever you were planning to put in conf/kedro-dvc) in the appropriate conf/{ENV}/ folder(s) instead?
It's definitely not an important requirement, but I wonder how this will work with transcoded datasets.

3 replies

shaunc Feb 25, 2022
Maintainer Author

oops -- my typo. Will fix -- thanks!
I don't propose to touch .dvc/config. In conf/kedro-dvc (or conf/kedro_dvc in case we need to put python code there?) there are data/ (.dvc files) and pipelines/ (dvc.yaml files (sigh -- not such a fan of their names... :)). See (4) below for data/ alternatives, but pipelines also have to be somewhere.
I was thinking about it from users perspective. "I think I'll use kedro, and try this plugin kedro-dvc" I didn't want to create more top-level cruft in their ml project, and from the user's perspective it is kedro-related, so why not put it in the kedro configuration directory? (IDEs show hidden files as well, and this does get checked in to repo -- so top level hidden file doesn't improve the situation in my opinion.) Hmm... perhaps the name should be _kedro_dvc though? (Snarky aside -- why not have environments be conf/env/{env} in kedro? :) ... ok I know we can't do that.)
so:
1. There is both data/ and pipelines/ in ...kedro_dvc/, and pipelines refer to multiple environments, so would have problems simply putting everything in individual environments and getting rid of ...kedro_dvc. However:
2. The data/ stuff -- .dvc files -- could be sorted per-environment. Hmm... I think I didn't want to mix user config with auto-generated config, and there could be very many .dvc files in a large project. Still either conf/{env}/_kedro_dvc/... or conf/_kedro_dvc/data/{env}/... would work. Opinions? (pipelines -- dvc.yaml files -- would still have to go somewhere.)
As far as I can see transcoded datasets should work. The .dvc file is itself an indirection -- the ones I generate correspond to catalog entries, not necessarily 1-1 with the data file itself.

antonymilne Feb 25, 2022

Ah ok, I hadn't appreciated before that there would also be the dvc.yaml files in the conf directory - I was just thinking of .dvc files before.

Personally I like conf/{env}/_kedro_dvc (or kedro_dvc or even just dvc) since this is consistent with how people often organise large catalogs or parameters when they break them into multiple files: conf/{env}/parameters/some_params.yml. This seems much cleaner and more consistent to me than files in conf/kedro-dvc/data with __{env} suffixes.

As for the pipeline dvc.yaml file, as you say this is only defined globally rather than per-environment so it's not so clear to me where it should go. I'd be tempted just to stick it in conf/base/pipelines. But actually I'm wondering how this would work correctly with the dvc.lock file. As I understand it, this will take a hash of the dependencies, which could be different depending on which environment you're using (parameters could be different, and also catalog entries could point to files at different paths). So it sounds like there might need to be one global dvc.yaml file but per-environment dvc.lock files somehow?

shaunc Feb 26, 2022
Maintainer Author

Hmm... yes. I need to think through dvc.lock. TODO

A further problem is that dvc.lock isn't just "per environment" its really "per sequence of environments" during a run. I'm wondering if I don't need to generate a version dynamically, which could get put in the experiment stash along with the rest of the experiment artifacts.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

stage 1: dvc file <-> catalog entry #9

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 2 comments 4 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

stage 1: dvc file <-> catalog entry #9

shaunc Feb 10, 2022 Maintainer

Representation

Kedro-DVC Implementation

conf/base vs conf/local and environments

Partitioned Data

External (File) Data

Other data sources; artifacts.

Replies: 2 comments · 4 replies

antonymilne Feb 25, 2022

shaunc Feb 25, 2022 Maintainer Author

antonymilne Feb 25, 2022

shaunc Feb 25, 2022 Maintainer Author

antonymilne Feb 25, 2022

shaunc Feb 26, 2022 Maintainer Author

shaunc
Feb 10, 2022
Maintainer

Replies: 2 comments 4 replies

antonymilne
Feb 25, 2022

shaunc Feb 25, 2022
Maintainer Author

antonymilne
Feb 25, 2022

shaunc Feb 25, 2022
Maintainer Author

shaunc Feb 26, 2022
Maintainer Author