Performance analysis and smart re-running of pipelines #2057

merelcht · 2022-11-23T15:53:01Z

Description

Kedro currently doesn't offer any options to analyse the performance of pipelines. Additionally, our users have flagged that they would like to be able to re-run only parts of their pipeline.

Implementation ideas

Stats on performance of pipelines
Record Kedro run metadata and use it for suggesting pipeline optimisations
Make Kedro run faster
Rerun Kedro Pipeline should be smarter. Only re-run necessary data "kedro run -xxx"

Questions

noklam · 2022-11-29T16:27:28Z

I think it's better to further split this into two issues, but I will leave my comments for both topics.

We have something like PipelineMonitoringHook in our docs. It requires some infrastructure and it's not easy to set up by regular users.

Pipeline Statistic

How would these pipeline stats be useful?

Just having a summary of the statistic could be very useful to understanding the pipeline performance - instead of parsing the log by yourself
In addition, we could add visualization on kedro-viz side to show where's the bottleneck and help user to optimize their pipeline
The pipeline statistic could be useful for ParallelRunner or something similar. Currently, the workload is distributed naively, but not every node is equal.

Smarter way to re-run the pipeline

Similar to the Pipeline's run_only_missing, but a more sophisticated one. During development, it's common that you are working on one particular node and you just need to refresh one node(or a few dependent nodes). We can so some back-tracking.

Alternative

Currently, it requires users to figure out which nodes are not necessary, and do kedro run --from-nodes to skip unnecessary computation

Summary

One key realization of this change is that Run need to have memory. To optimize runtime performance, it needs to know how it is run previously. To re-run the pipeline in a smart way, it needs to know the previous run(s) and figure out what's the minimal computation.

noklam · 2022-12-02T11:11:32Z

Related Issue:

Draft Pull Request : Add incremental run method #2005

astrojuanlu · 2023-11-24T14:54:08Z

This is a very frequent question actually, will try to collect more evidence for it going forward.

There's different things when considering performance, namely (1) execution time, and (2) RAM usage. There are different tools for each of these purposes, so most likely we would need dedicated efforts.

I think execution time is probably the most urgent one. This is how I used pyinstrument #3033 (comment)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance analysis and smart re-running of pipelines #2057

Performance analysis and smart re-running of pipelines #2057

merelcht commented Nov 23, 2022

noklam commented Nov 29, 2022

noklam commented Dec 2, 2022 •

edited

Loading

astrojuanlu commented Nov 24, 2023

Performance analysis and smart re-running of pipelines #2057

Performance analysis and smart re-running of pipelines #2057

Comments

merelcht commented Nov 23, 2022

Description

Implementation ideas

Questions

noklam commented Nov 29, 2022

Pipeline Statistic

How would these pipeline stats be useful?

Smarter way to re-run the pipeline

Alternative

Summary

noklam commented Dec 2, 2022 • edited Loading

astrojuanlu commented Nov 24, 2023

noklam commented Dec 2, 2022 •

edited

Loading