Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The pipeline registry is difficult to understand #3233

Open
astrojuanlu opened this issue Oct 26, 2023 · 1 comment
Open

The pipeline registry is difficult to understand #3233

astrojuanlu opened this issue Oct 26, 2023 · 1 comment

Comments

@astrojuanlu
Copy link
Member

The current code for pipeline_registry.py in the default template is as follows:

"""Project pipelines."""
from __future__ import annotations
from kedro.framework.project import find_pipelines
from kedro.pipeline import Pipeline
def register_pipelines() -> dict[str, Pipeline]:
"""Register the project's pipelines.
Returns:
A mapping from pipeline names to ``Pipeline`` objects.
"""
pipelines = find_pipelines()
pipelines["__default__"] = sum(pipelines.values())
return pipelines

Apart from #2526, this is fine and works well. The magic is in kedro.framework.project.find_pipelines, which scans different directories searching for a create_pipeline function:

obj = getattr(pipeline_module, "create_pipeline")()

This is so magical though, that the moment users want to manually register pipelines, they go crazy. For example, this is a user that was trying something like kedro run --pipeline=data_science+evaluation, which is a beautiful syntax by the way https://linen-slack.kedro.org/t/15697047/i-have-a-quick-question-on-running-selected-pipelines-only-i#b93fe172-d54f-4f51-a8a6-b85f9dbcec32

to which I replied, how would I subtract a pipeline?

def register_pipelines() -> dict[str, Pipeline]:
    """Register the project's pipelines.

    Returns:
        A mapping from pipeline names to ``Pipeline`` objects.
    """
    pipelines = find_pipelines()
    pipelines["__default__"] = sum(pipelines.values())
    pipelines["except-train"] = ???
    return pipelines

in the end I did this:

from .pipelines.model_training import create_pipeline as create_model_training_pipeline

...
pipelines["all"] = sum(pipelines.values())
pipelines["all_except_eval"] = pipelines["all"] - create_model_training_pipeline()

but @noklam suggested this instead

pipelines["all_except_eval"] = pipelines["all"] - pipelines["eval"]

This week I saw a user do something similar, but they renamed the functions instead:

https://github.com/pablovdcf/TFM_HADO_Cares/blob/28d5a024b915169a039a5a84996b9ee11ee1f3ee/hado/src/hado/pipeline_registry.py#L5-L7

and since their pipeline creation functions were not named create_pipeline but something else, this completely broke the automagic find_pipelines for them.

@astrojuanlu
Copy link
Member Author

Another pattern: repeatedly using create_pipeline https://linen-slack.kedro.org/t/16062967/i-think-this-might-be-a-versioning-question-i-created-a-kedr#e92a5668-7e06-41e7-8825-3ec18fff1c0c

from kedro.framework.project import find_pipelines
from kedro.pipeline import Pipeline

from network_anomaly_detection.pipelines import (
    data_collection as dc,
    data_engineering as de,
    ...

def register_pipelines() -> Dict[str, Pipeline]:
    ...
    data_collection_pipeline = dc.create_pipeline()
    data_engineering_pipeline = de.create_pipeline()
    ...

    return {
        "dc": data_collection_pipeline,
        "de": data_engineering_pipeline,
        ...
        "__default__": data_collection_pipeline + data_engineering_pipeline + data_science_pipeline + plot_pipeline
    }

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: No status
Development

No branches or pull requests

1 participant