Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature: maintain correspondence between kedro and dvc pipelines #17

Open
45 tasks
shaunc opened this issue Feb 14, 2022 · 0 comments
Open
45 tasks

Feature: maintain correspondence between kedro and dvc pipelines #17

shaunc opened this issue Feb 14, 2022 · 0 comments
Labels
breakdown Break down issues w/ >1D expected implementation feature tracking issue for feature
Milestone

Comments

@shaunc
Copy link
Collaborator

shaunc commented Feb 14, 2022

In order to run kedro pipelines together with dvc experiment tracking, we need to maintain a correspondence between kedro pipelines and dvc pipelines defined in dvc.yaml files.

See discussion.

For a named pipeline in kedro.framework.project.pipelines (which reads pipelines from src/<project_package>/pipeline_registry.py), we create a dvc file according to the discussion.

NOTE: does not include stage parameters yet.

This also does not test environments yet, until the last case. All inputs and outputs should be in the base data catalog.

Cases: (given fixed data catalog):

  • by pipeline type
    • empty pipeline -- warning and no output
    • pipeline with one node and:
      • inputs:
        • no inputs
        • one data input
        • two data inputs
        • Error: data input of type not supported (e.g. memory)
        • Error: input not declared
      • outputs:
        • no outputs
        • one data output
        • one plot output
        • one metric output
        • one each of data, metric, plot
        • Error output type not supported (memory)
        • Error: output not declared
    • pipeline with 2 nodes, start, intermediate and final data.
    • 2 pipelines: each with one node
    • modular pipelines: one pipeline containing another containing a node
    • modular pipelines: one overall pipeline using sub-pipeline twice in different namespaces
  • by dvc.yaml status:
    • dvc.yaml does not exist: create
    • exists, and corresponds: do nothing
    • error: dvc.yaml exists and is different
    • with --force: dvc.yaml exists and is different
      • same as above, except if exists and different, overwrite
    • by data catalog environment.
      • in base environment
      • some catalog items in base others in test

Scenarios:

  • kedro dvc pipelines update
  • on hook before_pipeline_run:
    • same as kedro dvc data update without --force for CURRENT pipeline

Issues:

@shaunc shaunc added the feature tracking issue for feature label Feb 14, 2022
@shaunc shaunc added this to the Stage 1 milestone Feb 14, 2022
@shaunc shaunc added the breakdown Break down issues w/ >1D expected implementation label Feb 19, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
breakdown Break down issues w/ >1D expected implementation feature tracking issue for feature
Projects
Status: Todo
Development

No branches or pull requests

3 participants