How to distribute and extend kedro pipelines #795

noklam · 2021-06-22T01:47:52Z

noklam
Jun 22, 2021
Collaborator

Kedro has been doing a nice job in structuring data science projects and build modular pipelines. I have been using it for development, but it is not clear to me that how should I deploy and distribute the pipelines.

For example, use cases as follow.

Distribute the pipeline to 3rd party

Say if I package up the kedro pipeline, it seems that the most straightforward way to run this pipeline is via the CLI. What if I have 2 pipelines distributed and I want them to run together? It is easy to do they if both pipelines exist in 1 single repository, but this will not be the case if I am packaging up the individual pipeline and shared to others.

Integrate pipeline with other python code

For example, I have a machine learning pipeline that trains a model. In order to use it for deployment, I may need to perform the following steps

Run the kedro pipeline, train a new model
Run a web service, it should load the model, receives request & returns predictions
Scheduled update
Run step 1 again and the web service needs to reload the model.

The current Kedro pipeline is pretty much a standalone application, it talks to the pipeline itself and filesystem directly, but not to other python applications. i.e. I want to return the model and the web services should load the model from memory directly instead of reading it from files. The pseudocode is similar to this.

1. model = my_kedro_pipeline()
2. app = PredictionService(model)

Thank you in advance, would be nice to know how others people are deploying the kedro pipeline.
(p.s. I find this issue is a more extensive description of many issues that I have, #795)

datajoely · 2021-06-22T08:07:32Z

datajoely
Jun 22, 2021
Collaborator

Hi @noklam thanks for kicking off the discussion! In this situation what would the PredictionService class do?

1 reply

noklam Jun 22, 2021
Collaborator Author

It could be anything, I just mean how can I make my kedro pipeline talk with other python code rather than a standalone application.

For this case, it may supports inference at 2 modes, receive a single JSON request or a batch array. You can imagine the data scientist is responsible to prepare the model training pipeline. While the application handles the input logic, it could be a JSON/array/data frame, anything is possible, that's up to the application logic.

class PredictionService:
    def predict(x):
         "Return a response via RESTful maybe?"
        ...
    def batch_predict(x):
        "Write to a particular Table"
        ...

antonymilne · 2021-06-22T11:06:02Z

antonymilne
Jun 22, 2021

Hello @noklam. This is a great question, and while I'm not sure I have all the answers maybe I can provide some useful tips anyway.

It is possible to do programmatic runs, i.e. instantiate a kedro run using the Python API rather than through the CLI. This is handled through sessions and is exactly what kedro run is doing under the hood:
https://github.com/quantumblacklabs/kedro/blob/7fd1fe1a089d0f004b05c8f31a3920315a6a4ce2/kedro/templates/project/%7B%7B%20cookiecutter.repo_name%20%7D%7D/src/%7B%7B%20cookiecutter.python_package%20%7D%7D/cli.py#L159-L170

KedroSession.create accepts an argument project_path, so that you can load up a project that is outside your current working directory. You could also use this to run two completely different projects in series:

from kedro.framework.session import KedroSession

with KedroSession.create(project_path="path/to/project/1") as session:
    session.run()
with KedroSession.create(project_path="path/to/project/2") as session:
    session.run()

session.run will return a dictionary that contains the free outputs of the pipeline it runs, so it's also possible to capture that output and process it further (model in your case). Through context = session.load_context() it's also possible to access the catalog, pipelines, etc. as well. Note that pipelines can actually be directly accessed without needing to start a kedro session at all through from kedro.framework.project import pipelines. It's possible that other components like the catalog might be accessed this way in future too.

4 replies

noklam Jun 23, 2021
Collaborator Author

I see, this may help downstream tasks integrate easier. How about if I need to accept some variable as my pipeline input?
I have a look at modular pipeline, but it seems only accept input for parameter, instead of changing the catalog.

1. df = get_some_data()
2. model = my_kedro_pipeline(input={'my_pipeline_input_df': df})
3. app = PredictionService(model)

antonymilne Jun 24, 2021

This is not so easy to do because kedro pipeline inputs are based on loading from the dataset name in the catalog, rather than passing in actual Python objects. The modular pipeline input/output/parameters arguments are always meant to transform a dataset/parameter name into another name rather than handling with the underlying data itself.

I think the best way to do this would be to create a catalog entry for df and then give kedro the name of that dataset as an input. If you don't want to be touching your catalog.yml file then it's possible to do this on the fly with some cunning use of hooks:

from kedro.extras.datasets.pandas import CSVDataSet

@hook_impl
def after_catalog_created(self, catalog: DataCatalog):
    catalog.add("dataset_name", CSVDataSet(filepath="df.csv")

The before_pipeline_run could be used similarly here but also gives access to the run_params argument, which contains extra_params. These can be used to parametrise your run through KedroSession.create(extra_params) (this is equivalent to passing parameters through kedro run --params). So you could even add something different to your catalog depending on the parameters:

@hook_spec
def before_pipeline_run(self, run_params: Dict[str, Any], catalog: DataCatalog):
    dataset_name = run_params["extra_params"].get("dataset_name")
    if dataset_name == "df":
        catalog.add("dataset_name", CSVDataSet(filepath="df.csv"))
    elif dataset_name == "blah":
        catalog.add("some_other_dataset_name", CSVDataSet(filepath="blah.csv"))

noklam Jun 28, 2021
Collaborator Author

Thanks @AntonyMilneQB

Upstream task (python program)
Kedro consume upstream variable (hooks to create catalog entries on the fly)
Downstream task consume Kedro output (via the dictionary of output)

For me, I feel there are strong assumptions of people who use the Kedro pipeline need to have a lot of prior knowledge about Kedro (which I think may not be true, the downstream may just want to consume it like a normal python function). This could be a big overhead when working with people, especially when they are not sharing the same codebase. I wish there is a more obvious way to do this in a simple way.

antonymilne Jun 28, 2021

Yeah, I think that is a fair point. Though I think step 2 (needing to add catalog entries on the fly) is maybe an unusual use case. In general I imagine the flow would be more like this:

# catalog.yml defines the pipeline input
dataframe_from_upstream_task:
    type: pandas.CSVDataSet
    filepath: df.csv

And then your program would look like:

# You could do this with kedro's CSVDataSet.save if you preferred
dataframe_from_upstream_task.to_csv("df.csv")

with KedroSession.create(project_path="path/to/project") as session:
    # run the pipeline that has dataframe_from_upstream_task as a input
    outputs = session.run()

do_more_stuff(outputs)

Galileo-Galilei · 2021-07-13T09:58:14Z

Galileo-Galilei
Jul 13, 2021
Collaborator

I see, this may help downstream tasks integrate easier. How about if I need to accept some variable as my pipeline input?
I have a look at modular pipeline, but it seems only accept input for parameter, instead of changing the catalog.
1. df = get_some_data()
2. model = my_kedro_pipeline(input={'my_pipeline_input_df': df})
3. app = PredictionService(model)

Hello @noklam,

sorry for self-advertising, but for your record, kedro-mlflow plugins provides this exact functionality with the following restrictions:

the only parameter you can modify at runtime is the data you want to predict on (it is not possible to pass additional parameters to the underlying function)
We assume that your pipeline is the inference pipeline in a machine learning sense, which means that at least one of the input data of your inference pipeline comes from another "training" pipeline (e.g. the model is an input of inference and the outputs of training pipeline)

You can log the pipeline in mlflow programatically:

from pathlib import Path
from kedro.framework.context import load_context
from kedro_mlflow.mlflow import KedroPipelineModel
from mlflow.models import ModelSignature

context=load_context(".")
pipelines=context.pipelines
catalog = context.io

# convert your two pipelines to a PipelineML object
pipeline_training= pipeline_ml_factory(
    training=pipelines["ml_pipeline"].only_nodes_with_tags("training"), #ssume you have a pipeline with nodes tagged "training
    inference=pipelines["ml_pipeline"].only_nodes_with_tags("inference"),
    input_name="instances"
)

# artifacts are all the inputs of the inference pipelines that are persisted in the catalog
artifacts = pipeline_training.extract_pipeline_artifacts(catalog)

kedro_model = KedroPipelineModel(
    pipeline_ml=pipeline_training,
    catalog=catalog
)

mlflow.pyfunc.log_model(
    artifact_path="model",
    python_model=kedro_model,
    artifacts=artifacts
)

You will later be able to reuse it from a script, serve it, or even feed it in the catalog for a downstream task:

PROJECT_PATH = r"<your/project/path>"
RUN_ID = "<your-run-id>"

from kedro.framework.context import load_context
from kedro_mlflow.framework.context import get_mlflow_config
from mlflow.pyfunc import load_model

local_context = load_context(PROJECT_PATH)
mlflow_config = get_mlflow_config(local_context)
mlflow_config.setup(local_context)

instances = local_context.io.load("instances")
model = load_model(f"runs:/{RUN_ID}/kedro_mlflow_tutorial")

predictions = model.predict(instances)

You can find a detailed example with code here: https://github.com/Galileo-Galilei/kedro-mlflow-tutorial.

A very nice feature is that if you declare your pipeline as a PipelineML object, yyour "app" object will be automatically created (with the proper artifacts, parameters, environment, model signature...) each time you launch the training pipeline, without the need to do it explicitly.

2 replies

noklam Jul 14, 2021
Collaborator Author

For details, I may have to look into the plugin, but from your snippets, it would still be reading the artifact from some files? I have something similar, but we are not using mlflow. But we would store artifacts with experiments, so I can extract the model/data file later for any downstream task.

Galileo-Galilei Jul 19, 2021
Collaborator

It would still be reading the artifact from some files?

No. They are stored with the model, and then stored in memory when loading the model.

we would store artifacts with experiments, so I can extract the model/data file later for any downstream task

This is exactly what mlflow models are designed for 😉

How mlflow models work?

Basically, when you save a model in mlflow (with one of the many log_model function available), it creates a folder with the following content:

all the artifacts (=the inputs objects like the model, the encoders...)
the python function to execute as a pickle object
a mlflow specific configuration file to explain how to load the model and execute it
(optional) additional information to validate inputs at runtime
(optional) the conda environment neede to run your model

This can be seen in this image from the mlflow ui.

This folder is entirely self contained. It can be manually downloaded and it is portable across environments. The mlflow-specific configuration file enable to serve it as an API with mlflow models serve -m "runs:/<your-model-run-id>/kedro_mlflow_tutorial" command, or loaded in a python script with load_model function.

If you do not want to use mlflow, it is also very easy to parse this folder manually and just ignore the mlflow configuration file.

How does the kedro-mlflow plugin convert Kedro pipelines as Mlflow Models?

The kedro-mlflow plugin provides several utilities:

First, it adds the possibility to save a kedro pipeline as a (custom) mlflow model. This means that the python function which will be executed when serving this model will be your kedro pipeline. All artifacts will be automatically stored in this dataset. You have to specify the artifacts (to store them within the experiment folder), but the plugin offers a convenient wraper to extract them from the pipeline's inputs. It will also automatically pickle some parameters (like vocabulary_size, stopwords, ...) needed to predict. This is the first code snippet above.
When you will reuse this model in a downstream task, the plugin will automatically recreate the catalog (with all the artifacts as MemoryDataset) and run the pipeline (this means that anyone can serve your pipeline as an API given that it has access to the mlflow model folder. You can litteraly download it manually from the mlflow UI, or directly store it locally and you are good to go). The only argument you have to pass is the new dataset to predict on. This is the second code snippet.
A cool feature is that the plugin provide a hook which automatically creates this "portable and self contained" model each time you run a training pipeline, provided you declared it as a PipelineML object. This means that you are always able to serve/predict with an old experiment with the exact code, the exact parameters and the exact artifacts when you ran it, just given the mlflow run id of this experiment. This should fix the following requirements:

Run step 1 again and the web service needs to reload the model.

Hope this clarifies what the plugin does and wh I think it does answer your question. I don't want to be pushing my solution too aggressively and I won't advocate for it anymore if you think it is not appropriate, but it seems designed exactly to solve your problem (and I initally created it because I had the same one) so I try to demonstrate that tracking experiment is not the only useful thing with mlflow 😄

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to distribute and extend kedro pipelines #795

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 3 comments 7 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

Select a reply

How to distribute and extend kedro pipelines #795

noklam Jun 22, 2021 Collaborator

Distribute the pipeline to 3rd party

Integrate pipeline with other python code

Replies: 3 comments · 7 replies

datajoely Jun 22, 2021 Collaborator

noklam Jun 22, 2021 Collaborator Author

antonymilne Jun 22, 2021

noklam Jun 23, 2021 Collaborator Author

antonymilne Jun 24, 2021

noklam Jun 28, 2021 Collaborator Author

antonymilne Jun 28, 2021

Galileo-Galilei Jul 13, 2021 Collaborator

noklam Jul 14, 2021 Collaborator Author

Galileo-Galilei Jul 19, 2021 Collaborator

How mlflow models work?

How does the kedro-mlflow plugin convert Kedro pipelines as Mlflow Models?

noklam
Jun 22, 2021
Collaborator

Replies: 3 comments 7 replies

datajoely
Jun 22, 2021
Collaborator

noklam Jun 22, 2021
Collaborator Author

antonymilne
Jun 22, 2021

noklam Jun 23, 2021
Collaborator Author

noklam Jun 28, 2021
Collaborator Author

Galileo-Galilei
Jul 13, 2021
Collaborator

noklam Jul 14, 2021
Collaborator Author

Galileo-Galilei Jul 19, 2021
Collaborator