Create how-to documentation for "pipeline reuse" #3282

stichbury · 2023-11-07T10:35:56Z

Issue #2627 lays out (some of) the various ways that a pipeline can be dynamic.

Here is a snippet of the intro to that issue discussion

When people are referring "Dynamic Pipeline", ... we need to make a clear distinction between them before we start to build a solution for it.

We can roughly categorise them into 2 buckets

Dynamic construction of Pipeline

Dynamic behavior at runtime

Dynamic construction of Pipeline (easier)

Examples of these are:

Time series forecasting - Pipeline make prediction for Day 1, next pipeline requires Day 1 prediction as input.

Hyperparameters tuning

Combined variable length of features - feature engineering combine N features into 1 DataFrame

A list of countries - each need to be saved as a catalog entry, the data are then combined in a pipeline for further processing

Dynamic behavior at runtime (harder)

Examples of these are:

2nd order pipelines - pipelines generated from some node's output

I have a scenario that I would like to run a model training and model evaluation based on labels on dataset. Each Label
would trigger an indiviual pipeline.

A pipeline that make prediction on 1 user, Fetch a list of N users, then run pipeline on each of them.

Running node conditionally - Run A if B does not exist, otherwise run C

Following the GetInData | Part of Xebia blog post from @marrrcin there has been conversation about how to reduce confusion and help users understand what they should do with Kedro to solve their particular requirements (which may be listed above or different again). Some mention of "documentation" but not a clear set of requirements on docs. This issue sets out to identify what is needed so we can deliver it.

From @astrojuanlu (some edits for clarity because this was on Slack)

...we need new content on "how to perform parameter sweep in Kedro". there's two options as far as I understand: using different environments and the blog post.

So this is a first requirement: Write up a comparison of the possible approaches for pipeline reuse which covers the following use cases (from @marrrcin's blog)

Use case 1: You have a pipeline that you want to re-run on a dataset that evolves over time - e.g. it’s forecasting model with monthly data tables, where one month consumes data from the previous month and so on, example here.
Use case 2: You have a set of similar model training experiments with similar parameters that you want to run in parallel. Model parameters, used features, target columns or types of models could vary in different experiments.
Use case 3: You want to implement a “core” / “reusable” pipeline that could be configured for multiple business use cases and be run multiple times.

Continuing from @astrojuanlu:

in the future we will expand that information with the 2 use cases @noklam identified for dynamic pipelines: "how to build pipelines dynamically" and "how to introduce dynamic behavior in pipelines", but these are yet to be researched. the answer to the first one might be "use custom Python in your pipeline.py", and the answer to the second one might be "no Kedro does not support this".

(This is a requirement TBD so won't be covered further here).

My view is that this is specific pipeline content should be covered in an advanced section about pipelines that uses a how-to guide or tutorial approach from Diataxis to explain the practical solutions available to a reader.

So this issue can be distilled to the following:

TL;DR

Create a how-to page (or set of pages) that explain how to solve for use cases 1-3 above with clear guidance on the most preferred to least preferred, and links to further discussion.

This should include practical information rather than theoretical analysis. We should be very consistent in our naming convention and clear about the problem we solve and those we don't (yet) solve.

astrojuanlu · 2023-11-07T11:05:31Z

Names:

Pipeline reuse
Parameter sweep
Monte-carlo simulations
Multi-run
Parallel experimentation

Some of these are a bit hand wavy and that's why I think parameter sweep or pipeline reuse are the most appropriate, but putting the other names somewhere in the text will help with discoverability.

stichbury added the Component: Documentation 📄 Issue/PR for markdown and API documentation label Nov 7, 2023

This was referenced Nov 7, 2023

Dynamic Pipeline #2627

Open

Create one or more blog posts about Kedro & "dynamic pipelines" that sets out the various requirements, solutions and links to docs kedro-org/kedro-devrel#7

Open

stichbury added this to the Improve Kedro documentation used by advanced users milestone Nov 28, 2023

github-actions bot mentioned this issue Dec 1, 2023

Monthly issue metrics report #3375

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Create how-to documentation for "pipeline reuse" #3282

Create how-to documentation for "pipeline reuse" #3282

stichbury commented Nov 7, 2023

Dynamic construction of Pipeline (easier)

Dynamic behavior at runtime (harder)

astrojuanlu commented Nov 7, 2023

Create how-to documentation for "pipeline reuse" #3282

Create how-to documentation for "pipeline reuse" #3282

Comments

stichbury commented Nov 7, 2023

Dynamic construction of Pipeline (easier)

Dynamic behavior at runtime (harder)

TL;DR

astrojuanlu commented Nov 7, 2023