Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Low performance of pipeline sums #3167

Closed
marrrcin opened this issue Oct 12, 2023 · 2 comments
Closed

Low performance of pipeline sums #3167

marrrcin opened this issue Oct 12, 2023 · 2 comments

Comments

@marrrcin
Copy link
Contributor

marrrcin commented Oct 12, 2023

Description

I have a project where there is a huge number of pipelines generated programatically (in a loop). The process of generating those pipelines takes a lot of time and it seems to be quadratic (see the chart below).

plot
n - number of pipelines to sum
time - time in seconds

The problem has 2 variants:

  1. Large number of small pipelines
  2. Small number of pipelines with large node count (200+).

Context

While Kedro encourages to keep the nodes small and pipelines modular - extensive use of both of those features/approaches lead to slow project startup times.

The most severe impact of this issue is in mono-repo setups, where multiple teams work in the same project but on separate pipelines - in such setups the number of pipelines grows quickly as the development proceeds.

Steps to Reproduce

  1. Create a project from spaceflights starter.
  2. Change data_processing pipeline to:
Show the code ⬇️
def create_pipeline(**kwargs) -> Pipeline:
    data_engineering_pipeline = pipeline(
        [
            node(
                func=preprocess_companies,
                inputs="companies",
                outputs="preprocessed_companies",
                name="preprocess_companies_node",
            ),
            node(
                func=preprocess_shuttles,
                inputs="shuttles",
                outputs="preprocessed_shuttles",
                name="preprocess_shuttles_node",
            ),
            node(
                func=create_model_input_table,
                inputs=["preprocessed_shuttles", "preprocessed_companies", "reviews"],
                outputs="model_input_table",
                name="create_model_input_table_node",
            ),
        ] + [node(
            func=lambda x: print("YOLO", x),
            inputs="parameters",
            outputs=f"yolo_{i}",
            name=f"yolo_{i}"
        ) for i in range(200)]
    )

    # Poor man's performance test
    import time
    pipelines = []
    MAX = 60
    for i in range(MAX + 1):
        pipelines.append(
            pipeline(
                data_engineering_pipeline,
                inputs={"companies": "companies",
                        "shuttles": "shuttles",
                        "reviews": "reviews"},
                namespace=f"namespace_{i}",
            )
        )
    data = []
    for n in range(1, MAX, 10):
        start = time.monotonic()
        _ = sum(pipelines[:n])
        end = time.monotonic()
        print(f"Sum of {n} pipelines took: {end - start:0.3f}s")
        data.append((n, end - start))


    # uncomment to output chart / data
    # import pandas as pd
    # df = pd.DataFrame(data, columns=["n", "time"])
    # df.plot.scatter(x="n", y="time").get_figure().savefig("plot.png")
    return sum(pipelines)
  1. Run kedro registry list

Expected Result

Pipelines are listed quickly.

Actual Result

The pipelines are listed after a few minutes (depending on the number of pipelines/nodes), with the time increasing quadratically (see the chart above).

Possible causes

The main problem is that internally, the pipelines are summed __add__ and then __init__ in the Pipeline class. The slowness of the operations inside of the __add__ itself is partially addressed by #3146 but the problem with the __init__ still remains - maybe the calls to _topologically_sorted in the constructor are the root cause. It would require more detailed profiling.

Your Environment

  • Kedro version used: 0.18.13
  • Python version used: 3.10.13
  • Operating system and version: macOS 13.0.1
@astrojuanlu
Copy link
Member

Was this fully addressed by #3730?

@marrrcin
Copy link
Contributor Author

I hope so 🤞🏻

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants