Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create how-to documentation for "pipeline reuse" #3282

Open
stichbury opened this issue Nov 7, 2023 · 1 comment
Open

Create how-to documentation for "pipeline reuse" #3282

stichbury opened this issue Nov 7, 2023 · 1 comment
Labels
Component: Documentation 📄 Issue/PR for markdown and API documentation

Comments

@stichbury
Copy link
Contributor

Issue #2627 lays out (some of) the various ways that a pipeline can be dynamic.

Here is a snippet of the intro to that issue discussion

When people are referring "Dynamic Pipeline", ... we need to make a clear distinction between them before we start to build a solution for it.

We can roughly categorise them into 2 buckets

  1. Dynamic construction of Pipeline
  2. Dynamic behavior at runtime

Dynamic construction of Pipeline (easier)

Examples of these are:

  1. Time series forecasting - Pipeline make prediction for Day 1, next pipeline requires Day 1 prediction as input.
  2. Hyperparameters tuning
  3. Combined variable length of features - feature engineering combine N features into 1 DataFrame
  4. A list of countries - each need to be saved as a catalog entry, the data are then combined in a pipeline for further processing

Dynamic behavior at runtime (harder)

Examples of these are:

  • 2nd order pipelines - pipelines generated from some node's output
    • I have a scenario that I would like to run a model training and model evaluation based on labels on dataset. Each Label
      would trigger an indiviual pipeline.

    • A pipeline that make prediction on 1 user, Fetch a list of N users, then run pipeline on each of them.

  • Running node conditionally - Run A if B does not exist, otherwise run C

Following the GetInData | Part of Xebia blog post from @marrrcin there has been conversation about how to reduce confusion and help users understand what they should do with Kedro to solve their particular requirements (which may be listed above or different again). Some mention of "documentation" but not a clear set of requirements on docs. This issue sets out to identify what is needed so we can deliver it.

From @astrojuanlu (some edits for clarity because this was on Slack)

...we need new content on "how to perform parameter sweep in Kedro". there's two options as far as I understand: using different environments and the blog post.

So this is a first requirement: Write up a comparison of the possible approaches for pipeline reuse which covers the following use cases (from @marrrcin's blog)

  • Use case 1: You have a pipeline that you want to re-run on a dataset that evolves over time - e.g. it’s forecasting model with monthly data tables, where one month consumes data from the previous month and so on, example here.
  • Use case 2: You have a set of similar model training experiments with similar parameters that you want to run in parallel. Model parameters, used features, target columns or types of models could vary in different experiments.
  • Use case 3: You want to implement a “core” / “reusable” pipeline that could be configured for multiple business use cases and be run multiple times.

Continuing from @astrojuanlu:

in the future we will expand that information with the 2 use cases @noklam identified for dynamic pipelines: "how to build pipelines dynamically" and "how to introduce dynamic behavior in pipelines", but these are yet to be researched. the answer to the first one might be "use custom Python in your pipeline.py", and the answer to the second one might be "no Kedro does not support this".

(This is a requirement TBD so won't be covered further here).

My view is that this is specific pipeline content should be covered in an advanced section about pipelines that uses a how-to guide or tutorial approach from Diataxis to explain the practical solutions available to a reader.

So this issue can be distilled to the following:

TL;DR

Create a how-to page (or set of pages) that explain how to solve for use cases 1-3 above with clear guidance on the most preferred to least preferred, and links to further discussion.

This should include practical information rather than theoretical analysis. We should be very consistent in our naming convention and clear about the problem we solve and those we don't (yet) solve.

@astrojuanlu
Copy link
Member

Names:

  • Pipeline reuse
  • Parameter sweep
  • Monte-carlo simulations
  • Multi-run
  • Parallel experimentation

Some of these are a bit hand wavy and that's why I think parameter sweep or pipeline reuse are the most appropriate, but putting the other names somewhere in the text will help with discoverability.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Component: Documentation 📄 Issue/PR for markdown and API documentation
Projects
Status: No status
Development

No branches or pull requests

2 participants