Don't toposort nodes in non-user-facing operations #3146

deepyaman · 2023-10-09T15:45:56Z

NOTE: Kedro datasets are moving from kedro.extras.datasets to a separate kedro-datasets package in
kedro-plugins repository. Any changes to the dataset implementations
should be done by opening a pull request in that repository.

Description

https://kedro-org.slack.com/archives/C054U0LJLAC/p1696862244884179

Development notes

Docs failure fixed by #3152

Developer Certificate of Origin

We need all contributions to comply with the Developer Certificate of Origin (DCO). All commits must be signed off by including a Signed-off-by line in the commit message. See our wiki for guidance.

If your PR is blocked due to unsigned commits, then you must follow the instructions under "Rebase the branch" on the GitHub Checks page for your PR. This will retroactively add the sign-off to all unsigned commits and allow the DCO check to pass.

Checklist

Read the contributing guidelines
Signed off each commit with a Developer Certificate of Origin (DCO)
Opened this PR as a 'Draft Pull Request' if it is work-in-progress
Updated the documentation to reflect the code changes
Added a description of this change in the RELEASE.md file
Added tests to cover my changes
Checked if this change will affect Kedro-Viz, and if so, communicated that with the Viz team

astrojuanlu

Why not make def nodes a cached_property instead? Are pipelines supposed to be mutable?

deepyaman · 2023-10-09T15:57:03Z

Why not make def nodes a cached_property instead?

~~I think we can (and quite possibly should), but I also think~~ there's not even a need to sort once on a lot of these operations. Almost none of these operations are user-facing.

Edit: I've changed my mind; accessing nodes isn't that expensive, when a person does it. The issue is happening when we're unnecessarily accessing the toposorted version over and over again from our code.

We could still make it a cached property (depending on the answer to your other question), but I think whether it's a cached property or not doesn't actually make a significant difference for user behaviors, and just using the unsorted version is more efficient regardless for the motivating case (and a lot of the update operations in general).

Are pipelines supposed to be mutable?

I don't know; I suppose they could be? But I don't think anybody would suffer much if they weren't.

Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu>

marrrcin

I've tested it some time ago as part of #3167 - it contributed to a small (~5%) improvement.

noklam · 2023-11-17T05:22:28Z

Add more context, since the original slack thread is in a private channel I cannot get a linen copy.

I’ve stumped upon the performance issue with huge pipeline sums - it appears in the context of (obviously) dynamic pipelines / pipelines that are generated in loops.
It seems like the sum of pipelines is O(n^2) and it starts getting painful for our client 😞
In the chart n is the number of pipelines to sum, time is … time in seconds.

noklam

I'd like to understand this PR more. Is the description of this PR updated? AFAIK, nodes does not do any additional toposorts so I am not sure what's the fix for.

Does this actually helps performance or is just a clean up?

noklam · 2023-11-17T05:29:50Z

An archive of the discussion (terribly formatted)

marrrcin
1 month ago
I’ve stumped upon the performance issue with huge pipeline sums - it appears in the context of (obviously) dynamic pipelines / pipelines that are generated in loops.
It seems like the sum of pipelines is O(n^2) and it starts getting painful for our client 😞
In the chart n is the number of pipelines to sum, time is … time in seconds.
plot.png

plot.png

49 replies

marrrcin
1 month ago
Repro code (modified spaceflights starter)
def create_pipeline(**kwargs) -> Pipeline:
data_engineering_pipeline = pipeline(
[
node(
func=preprocess_companies,
inputs="companies",
outputs="preprocessed_companies",
name="preprocess_companies_node",
),
node(
func=preprocess_shuttles,
inputs="shuttles",
outputs="preprocessed_shuttles",
name="preprocess_shuttles_node",
),
node(
func=create_model_input_table,
inputs=["preprocessed_shuttles", "preprocessed_companies", "reviews"],
outputs="model_input_table",
name="create_model_input_table_node",
),
]
)

# TEST
import time
pipelines = []
MAX = 500
for i in range(MAX + 1):
    pipelines.append(
        pipeline(
            data_engineering_pipeline,
            inputs={"companies": "companies",
                    "shuttles": "shuttles",
                    "reviews": "reviews"},
            namespace=f"namespace_{i}",
        )
    )
data = []
for n in range(1, MAX, 10):
    start = time.monotonic()
    _ = sum(pipelines[:n])
    end = time.monotonic()
    print(f"Sum of {n} pipelines took: {end - start:0.3f}s")
    data.append((n, end - start))

import pandas as pd
df = pd.DataFrame(data, columns=["n", "time"])
df.plot.scatter(x="n", y="time").get_figure().savefig("plot.png")

marrrcin
1 month ago
Sum of 1 pipelines took: 0.000s
Sum of 11 pipelines took: 0.010s
Sum of 21 pipelines took: 0.036s
Sum of 31 pipelines took: 0.080s
Sum of 41 pipelines took: 0.145s
Sum of 51 pipelines took: 0.217s
Sum of 61 pipelines took: 0.300s
Sum of 71 pipelines took: 0.477s
Sum of 81 pipelines took: 0.549s
Sum of 91 pipelines took: 0.727s
Sum of 101 pipelines took: 0.868s
Sum of 111 pipelines took: 1.137s
Sum of 121 pipelines took: 1.311s
Sum of 131 pipelines took: 1.568s
Sum of 141 pipelines took: 1.824s
Sum of 151 pipelines took: 2.161s
Sum of 161 pipelines took: 2.836s
Sum of 171 pipelines took: 2.722s
Sum of 181 pipelines took: 3.906s
Sum of 191 pipelines took: 4.064s
Sum of 201 pipelines took: 4.644s
Sum of 211 pipelines took: 5.507s
Sum of 221 pipelines took: 6.610s
Sum of 231 pipelines took: 5.762s
Sum of 241 pipelines took: 6.654s
Sum of 251 pipelines took: 7.717s
Sum of 261 pipelines took: 8.446s
Sum of 271 pipelines took: 9.495s
Sum of 281 pipelines took: 10.725s
Sum of 291 pipelines took: 13.334s
Sum of 301 pipelines took: 12.023s
Sum of 311 pipelines took: 12.660s
Sum of 321 pipelines took: 12.584s
Sum of 331 pipelines took: 12.685s
Sum of 341 pipelines took: 13.855s
Sum of 351 pipelines took: 14.573s
Sum of 361 pipelines took: 15.114s
Sum of 371 pipelines took: 15.624s
Sum of 381 pipelines took: 16.656s
Sum of 391 pipelines took: 26.046s
Sum of 401 pipelines took: 25.595s
Sum of 411 pipelines took: 25.409s
Sum of 421 pipelines took: 23.397s
Sum of 431 pipelines took: 22.718s
Sum of 441 pipelines took: 23.516s
Sum of 451 pipelines took: 31.217s
Sum of 461 pipelines took: 47.941s
Sum of 471 pipelines took: 47.578s
Sum of 481 pipelines took: 37.139s
Sum of 491 pipelines took: 35.174s

Ivan Danov
1 month ago
The code for adding pipelines is this:
def add(self, other):
if not isinstance(other, Pipeline):
return NotImplemented
return Pipeline(set(self.nodes + other.nodes))

marrrcin
1 month ago
I think it’s toposort called N times that causes this

Ivan Danov
1 month ago
It's not the most efficient, but it shouldn't be quadratic... Unless something happens in Pipeline creation, e.g. the toposorting here might be challenging:
self._topo_sorted_nodes = _topologically_sorted(self.node_dependencies)

juanlu
1 month ago
ugh, good insight
@marrrcin

would you please open an issue about this? unlikely to affect small projects of course, but annoying enough for big projects we should be looking at it
👍
1

Ivan Danov
1 month ago
Well, summing more than 10 pipelines together is not a realistic scenario tbh...

datajoely
1 month ago
Well, summing more than 10 pipelines together is not a realistic scenario tbh...
Users will and do hit this
:this:
2

datajoely
1 month ago
I wonder if we can make toposort a bit more lazy

Ivan Danov
1 month ago
It's not the toposort thing... It's the way Python works

Ivan Danov
1 month ago
This one is summing two lists, try the same thing with integers and you will get the same result

marrrcin
1 month ago
We have a client who has ~20 instances of a single pipeline and ~40 pipelines in total (they are 100+ nodes) and kedro run --pipeline=only_single_pipeline takes 10+ minutes for them (to start executing) (edited)
😮
1

Ivan Danov
1 month ago
(we only add extra copying around)

Ivan Danov
1 month ago
Are you sure it's the pipelines, and not the config for those pipelines?

marrrcin
1 month ago
What do you mean by “config for those pipelines”?

datajoely
1 month ago
the reproducible example was above

Ivan Danov
1 month ago
@datajoely
the reproducible example shows that when you sum 100 pipelines, it takes less than 1s...

datajoely
1 month ago
good point!

marrrcin
1 month ago
I guess it also grows with nodes - I will do this test tomorrow - in client’s code base I’ve narrowed it down to the Pipeline constructor that causes the issue

Ivan Danov
1 month ago
@marrrcin
most of the time slowdowns happen due to large catalogs, not large pipelines
👍
1

Ivan Danov
1 month ago
have you checked if the same thing happens with an empty catalog?

marrrcin
1 month ago
Will do tomorrow!

juanlu
1 month ago
we don't need to speculate hopefully,
@marrrcin
if you could produce a flamegraph like #3033 (comment), we could have a much clear view of what's going on and I believe there would be no compromise of confidential data
image.png

image.png
👍
1

Ivan Danov
1 month ago
I think it's definitely worth making a Kedro optimisation sprint to make it faster wherever possible... A big problem has always been the config loading and catalog creation, especially with custom configs, because they tend to use the old .get API quite often, which is always reloading the config.
👍
1

marrrcin
1 month ago
It grows even faster when number of nodes is large (200 nodes):
Sum of 1 pipelines took: 0.000s
Sum of 11 pipelines took: 0.808s
Sum of 21 pipelines took: 2.921s
Sum of 31 pipelines took: 7.425s
Sum of 41 pipelines took: 13.902s

marrrcin
1 month ago
Funny thing is that we’ve been recommending them to keep the nodes small and now it got back on us 😄

Deepyaman Datta
1 month ago
I think your intuition is right re toposort; just looking at the code, using the nodes accessor doesn't make sense here.

Deepyaman Datta
1 month ago
Let me see if something like #3146 will pass
#3146 Don't toposort nodes in non-user-facing operations
NOTE: Kedro datasets are moving from kedro.extras.datasets to a separate kedro-datasets package in
kedro-plugins repository. Any changes to the dataset implementations
should be done by opening a pull request in that repository.
Description
Development notes
Developer Certificate of Origin
We need all contributions to comply with the Developer Certificate of Origin (DCO). All commits must be signed off by including a Signed-off-by line in the commit message. See our wiki for guidance.
If your PR is blocked due to unsigned commits, then you must follow the instructions under "Rebase the branch" on the GitHub Checks page for your PR. This will retroactively add the sign-off to all unsigned commi… Show more
https://github.com/[kedro-org/kedro](https://github.com/kedro-org/kedro)|kedro-org/kedrokedro-org/kedro | Oct 9th | Added by GitHub
❤️
1

Deepyaman Datta
1 month ago
I feel like there's a lot of unnecessary sorting going on 🙂

datajoely
1 month ago
there is also the parallel point about deterministic toposort that people keep asking for

Ivan Danov
1 month ago
It might quite well be toposort, I think I vaguely remember that the library we are using is not very scaleable....

juanlu
1 month ago
leaving a comment directly on the PR 👍:skin-tone-3:

juanlu
1 month ago
also thinking a lot about a certain :knuth: quote
👀
1

datajoely
1 month ago
https://docs.python.org/3/library/graphlib.html
Graphlib was added to python 3.9
Python documentationPython documentation
graphlib — Functionality to operate with graph-like structures
Source code: Lib/graphlib.py Exceptions: The graphlib module defines the following exception classes:
❤️
1

datajoely
1 month ago
we can now use it

juanlu
1 month ago
omg I had no idea about stdlib graphlib (edited)

juanlu
1 month ago
when are we dropping Python 3.8? 😄

datajoely
1 month ago
oh man got 3.7 mixed up

datajoely
1 month ago
we could do the whole conditional stuff and use graphlib if on the right version

Ivan Danov
1 month ago
https://pypi.org/project/graphlib-backport/
PyPIPyPI
graphlib-backport
Backport of the Python 3.9 graphlib module for Python 3.6+ (7 kB)
https://pypi.org/project/graphlib-backport/

🥳
1

datajoely
1 month ago
cannot wait to do this
https://twitter.com/cto_junior/status/1711067515600646585
X (formerly Twitter)X (formerly Twitter)
TDM (e/flλ) on X
A story in 4 parts.
Can't wait to offload more L5 tasks to GPT-4 (114 kB)
https://twitter.com/cto_junior/status/1711067515600646585

Deepyaman Datta
1 month ago
@marrrcin
went ahead and made sure tests and stuff are passing; can you check if #3146 resolves your issue, when you get a chance? Since you've already got your little test script, I assume. 🙂
#3146 Don't toposort nodes in non-user-facing operations
NOTE: Kedro datasets are moving from kedro.extras.datasets to a separate kedro-datasets package in
kedro-plugins repository. Any changes to the dataset implementations
should be done by opening a pull request in that repository.
Description
Development notes
Developer Certificate of Origin
We need all contributions to comply with the Developer Certificate of Origin (DCO). All commits must be signed off by including a Signed-off-by line in the commit message. See our wiki for guidance.
If your PR is blocked due to unsigned commits, then you must follow the instructions under "Rebase the branch" on the GitHub Checks page for your PR. This will retroactively add the sign-off to all unsigned commi… Show more
Comments
1
https://github.com/[kedro-org/kedro](https://github.com/kedro-org/kedro)|kedro-org/kedrokedro-org/kedro | Oct 9th | Added by GitHub

marrrcin
1 month ago
@Deepyaman Datta

There is a small improvement, but it’s less than 5%
😢
1

Deepyaman Datta
1 month ago
I'm happy to look into this more thoroughly at some point (sounds like something I'd like to do), but I'm not sure just how soon. 😅

marrrcin
1 month ago
I’ve diagnosed it down with pyinstrument
@juanlu

the most time is spent on two functions:
init of Pipeline class
add of Pipeline class <--- this one would most likely be fixed partially by your PR
@Deepyaman Datta
, I will run one more test with your code to confirm that
🙌
1

marrrcin
1 month ago
OK,
@Deepyaman Datta

with your fix, the add now takes less time - for the same tests it went down from 10.1s self-time to 2.1s self-time; init is faster too, but not as much. It’s definitely something 🙌
🚀
3

marrrcin
1 month ago
Added #3167 for that
#3167 Low performance of pipeline sums
Description
I have a project where there is a huge number of pipelines generated programatically (in a loop). The process of generating those pipelines takes a lot of time and it seems to be quadratic (see the chart below).
plot
n - number of pipelines to sum
time - time in seconds
The problem has 2 variants:

Large number of small pipelines
Small number of pipelines with large node count (200+).
Context
While Kedro encourages to keep the nodes small and pipelines modular - extensive use of both of those features/approaches lead to slow project startup times.
Steps to Reproduce
Create a project from spaceflights starter.
Change data_processing pipeline to:
Show the code ⬇️
def create_pipeline(…
Show less
https://github.com/[kedro-org/kedro](https://github.com/kedro-org/kedro)|kedro-org/kedrokedro-org/kedro | Oct 12th | Added by GitHub
Low performance of pipeline sums #3167

👍:skin-tone-3:👍
2
❤️
1

Deepyaman Datta
3 days ago
Doing some cleanup of my PRs--can I get reviews on #3146, please? 🙂 Since it is an improvement regardless.
#3146 Don't toposort nodes in non-user-facing operations
NOTE: Kedro datasets are moving from kedro.extras.datasets to a separate kedro-datasets package in
kedro-plugins repository. Any changes to the dataset implementations
should be done by opening a pull request in that repository.
Description
https://kedro-org.slack.com/archives/C054U0LJLAC/p1696862244884179
Development notes
Docs failure fixed by #3152
Developer Certificate of Origin
We need all contributions to comply with the Developer Certificate of Origin (DCO). All commits must be signed off by including a Signed-off-by line in the commit message. See our wiki for guidance.
If your PR is blocked due to unsigned commits, then you must follow the instructions under "Rebase the branch" on the G… Show more
Comments
1
Assignees
@deepyaman
https://github.com/[kedro-org/kedro](https://github.com/kedro-org/kedro)|kedro-org/kedrokedro-org/kedro | Oct 9th | Added by GitHub

Click to Expand

noklam · 2024-03-08T00:08:35Z

@deepyaman Are you still interested to continue with this PR?

Signed-off-by: Ivan Danov <idanov@users.noreply.github.com>

idanov

This is still an improvement over the original, since every time we call self.nodes, we copy the whole list. We can save up on making new copies of that list for internal non-mutating tasks. Not sure whether the stated goal of the PR is met though, since the toposorting happens only once anyways. Should we refocus this PR and make it about avoiding unnecessary copies through using list, set and copy in different places internally?

idanov · 2024-03-21T14:56:06Z

RELEASE.md

@@ -2,6 +2,7 @@

 ## Major features and improvements
 * Improved error message when passing wrong value to node.
+* Improved performance of pipeline operations by using non-toposorted nodes in internal code.


Suggested change

* Improved performance of pipeline operations by using non-toposorted nodes in internal code.

* Improved performance of pipeline operations by avoiding unnecessary copies of the nodes.

merelcht · 2024-03-26T11:23:08Z

I'm guessing this PR can be closed now #3730 is merged?

deepyaman · 2024-04-03T13:25:48Z

I'm guessing this PR can be closed now #3730 is merged?

Seems like that is a superset of my changes, so sure.

deepyaman requested a review from marrrcin October 9, 2023 15:47

astrojuanlu reviewed Oct 9, 2023

View reviewed changes

idanov self-requested a review October 9, 2023 20:00

deepyaman force-pushed the deepyaman-patch-1 branch from 6a5370e to 06145cf Compare October 10, 2023 02:06

deepyaman added 2 commits October 9, 2023 20:06

Don't toposort nodes in non-user-facing operations

2a41680

Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu>

Reimplement filter to be more like the original

26ba1e5

Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu>

deepyaman force-pushed the deepyaman-patch-1 branch from 06145cf to 26ba1e5 Compare October 10, 2023 02:06

deepyaman added 2 commits October 9, 2023 20:46

Merge branch 'main' into deepyaman-patch-1

00d9a68

Merge branch 'main' into deepyaman-patch-1

ca04c84

deepyaman marked this pull request as ready for review October 10, 2023 14:45

deepyaman requested a review from merelcht as a code owner October 10, 2023 14:45

deepyaman self-assigned this Oct 10, 2023

Merge branch 'main' into deepyaman-patch-1

9adae66

marrrcin mentioned this pull request Oct 12, 2023

Low performance of pipeline sums #3167

Closed

deepyaman added 3 commits October 13, 2023 22:08

Merge branch 'main' into deepyaman-patch-1

8a75160

Update RELEASE.md

5c3804c

Merge branch 'main' into deepyaman-patch-1

70b07f9

deepyaman enabled auto-merge (squash) October 14, 2023 19:17

deepyaman added 3 commits October 17, 2023 20:31

Merge branch 'main' into deepyaman-patch-1

295f671

Merge branch 'main' into deepyaman-patch-1

92e0208

Merge branch 'main' into deepyaman-patch-1

9741c16

marrrcin approved these changes Nov 14, 2023

View reviewed changes

deepyaman requested a review from noklam November 14, 2023 22:02

Merge branch 'main' into deepyaman-patch-1

f615e89

noklam reviewed Nov 17, 2023

View reviewed changes

Merge branch 'main' into deepyaman-patch-1

d71a89d

Signed-off-by: Ivan Danov <idanov@users.noreply.github.com>

Update RELEASE.md

f0892ce

Signed-off-by: Ivan Danov <idanov@users.noreply.github.com>

idanov reviewed Mar 21, 2024

View reviewed changes

idanov mentioned this pull request Mar 21, 2024

Optimise pipeline addition and creation #3730

Merged

7 tasks

deepyaman closed this Apr 3, 2024

auto-merge was automatically disabled April 3, 2024 13:25
Pull request was closed

deepyaman deleted the deepyaman-patch-1 branch April 3, 2024 13:25

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Don't toposort nodes in non-user-facing operations #3146

Don't toposort nodes in non-user-facing operations #3146

deepyaman commented Oct 9, 2023 •

edited

Loading

astrojuanlu left a comment

deepyaman commented Oct 9, 2023 •

edited

Loading

marrrcin left a comment

noklam commented Nov 17, 2023

noklam left a comment

noklam commented Nov 17, 2023 •

edited

Loading

noklam commented Mar 8, 2024

idanov left a comment

idanov Mar 21, 2024

merelcht commented Mar 26, 2024

deepyaman commented Apr 3, 2024

	* Improved performance of pipeline operations by using non-toposorted nodes in internal code.
	* Improved performance of pipeline operations by avoiding unnecessary copies of the nodes.

Don't toposort nodes in non-user-facing operations #3146

Don't toposort nodes in non-user-facing operations #3146

Conversation

deepyaman commented Oct 9, 2023 • edited Loading

Description

Development notes

Developer Certificate of Origin

Checklist

astrojuanlu left a comment

Choose a reason for hiding this comment

deepyaman commented Oct 9, 2023 • edited Loading

marrrcin left a comment

Choose a reason for hiding this comment

noklam commented Nov 17, 2023

noklam left a comment

Choose a reason for hiding this comment

noklam commented Nov 17, 2023 • edited Loading

noklam commented Mar 8, 2024

idanov left a comment

Choose a reason for hiding this comment

idanov Mar 21, 2024

Choose a reason for hiding this comment

merelcht commented Mar 26, 2024

deepyaman commented Apr 3, 2024

deepyaman commented Oct 9, 2023 •

edited

Loading

deepyaman commented Oct 9, 2023 •

edited

Loading

noklam commented Nov 17, 2023 •

edited

Loading