Replies: 2 comments 4 replies
-
In my view this is indeed the most natural place for a kedro-dvc plugin to start, since the first thing that occurs to me when I hear DVC is data versioning. For me, the question boils down to can we easily swap out kedro's dataset versioning for DVC's one? My naive initial idea for how that could be achieved is quite different from what you've outlined above. Here's very roughly how imagined it:
This would be implemented through a custom I haven't thought through any of the details of this, so I don't really know how feasible it is (e.g. kedro's default |
Beta Was this translation helpful? Give feedback.
-
And a few random comments on the proposed implementation:
|
Beta Was this translation helpful? Give feedback.
-
Representation
For relevant entries in kedro data catalog, we would like to write or update a .dvc file.
When a data file is added to dvc using
dvc add
, the file is moved to the dvc cache, a link is created to its original location, the link is placed in .gitignore, and the.dvc
file is create -- by default with the same name and directory as the datafile, but with the.dvc
suffix added to the name (sodata.parq
would be replaced by a link, anddata.parq.dvc
would be created next to it to store the.dvc
metadata. The process is detailed in the dvc add documentation.DVC data dependencies can include both other data files and code modules, explicitly referenced in a dvc file.
Kedro tracks data via the data catalog. The catalog is in a fixed place in the kedro repo (
conf/base/catalog.yml
). Entries can be superceded by a local catalog (conf/local/catalog.yml
), which is (putatively) in.gitignore
and can specify different locations, etc for particular copies of the repo.There are a number of mismatches between dvc and kedro:
data.parq
will be transformed intodata.parq/<version>/data.parq
Kedro-DVC Implementation
We will focus on mapping those parts which most naturally correspond at first.
{CONF_ROOT}/{env}/dvc
for each environment..dvc
entries in{CONF_ROOT}/{env}/dvc
either on hookafter_catalog_created
, or using a utility:kedro dvc data update
..dvc
file will be named after the data catalog entry6.Renaming:
When it comes to versioning, however, we will do things the dvc way, by using the "repo time" dimension rather than the space dimension.
conf/base vs conf/local and environments
Kedro has "in repo" data in "base" and "out of repo" data in "local". They suggest:
Kedro also supports
environments
via "KEDRO_ENV" or cli option. Local is, in effect the default environment, but it can be overridden.In DVC, a better paradigm would be to store prototype data in a branch that we don't push. (Note: git supports e.g.
git config branch.local.pushRemote no_push
).By default, we will handle the mapping this way:
.dvc
file for each entry in each environment{CONF_ROOT}/{env}/dvc
whereenv
is the environment name.Pipelines which use them need to have global definitions however -- see pipelines for the plan.
(NB based on feedback from @AntonyMilneQB -- below.
Partitioned Data
See discussion
External (File) Data
Initially, we won't support external data. However, both kedro and dvc support it. Adding at least limited support shouldn't be too hard.
Other data sources; artifacts.
Full list of supported data
Kedro also supports database data sources (e.g. databases). For the moment kedro-dvc will not support them. Stage artifacts can be handled, for the most part, like data files.
Beta Was this translation helpful? Give feedback.
All reactions