Managing KedroPipelineModel predictions outputs #93

takikadiri · 2020-10-17T21:27:05Z

Description

The PipelineML is a composition of training pipeline and inference pipeline.
To separate inference inputs from models artifacts inputs, we ask the developer to tell us his inference dataset input_name
PipelineML made then a series of validations checks to make sure of the coherence of these attributes.

However, we don't have any control of the outputs, we let them flow between those layers Pipeline --> PipelineML --> KedroPipelineModel --> mlflow.pyfunc.scoring_server without any control.

Context

Here in KedroPipelineModel we get all the outputs of the leaf nodes (nodes without children) in a dictionnary indexed with nodes outputs's dataset names. Let's call this dictionnary "predictions"

In the mlflow.pyfunc.scoring_server side, those "predictions" will be geted here. Then mlflow.pyfunc.scoring_server try to convert this dictionnary and dump it to json strings see here.

Without any control in our side, the "predictions" dictonnary has a great chance having a non JSON serializable object Exception, as a user can output a [pickle, image, ..] dataset for logging purpose in the inference pipeline. The error can "infortunately" be raised so lately "in the scoring server" and will flow silently between kedro and kedro_mlflow.

Moreover, it's confusing having multiple datasets as output, as the user will not have the control of his output schema. On the other hand, in KedroPipelineModel having the results indexed with dataset name is useless and break the dataset schema. I remind you that this output (runner output) is what will be served in the mlflow scoring_server.

Possible Implementation

Introduce output_name attributes in PiplelineML and perform validation checks (pipeline resolution, json serializable, ...)
Unpack predictions from kedro runner in KedroPipelineModel by replacing run_outputs = runner.run(pipeline, catalog) by run_outputs = runner.run(pipeline, catalog)[pipeline_ml.output_name]

The text was updated successfully, but these errors were encountered:

Galileo-Galilei · 2020-10-18T07:21:01Z

Yes, This is really something that I have in mind and that would be a great improvement. Introducing a output_name atribute to PipelineML is easy, and unpacking would undoubtely offers a better experience.

However, it is not easy to perform checks at logging time on the output schema, because the mlflow logging is triggered by the training pipeline, and consequently the "predictions" object does not exist (you need to run the inference pipeline first). A potential solution would be to automatically trigger the inference pipeline after running the training pipeline to ensure consistency, but this is dangerous:

it can trigger lots of side-effets (interfere with parameter versioning for instance) and be hard to understand for users
it may increase significantly the training pipeline running time

If we chose this way to perform schema validation, we'll need to carefully design it and very likely to add a cli option to turn it off at run time:

kedro run --pipeline=training --no-output-schema

This is very tricky while kedro-org/kedro#382 is still open. I don't add this issue to the next milestone, because we need to discuss its advantages and drawbacks in depth first (and make sure the implementation wil be user-friendly)

takikadiri · 2020-10-18T10:31:33Z

It's not really about schema validation (altough it's an important thing to manage), its' more about object type validation, but it's true that we are not aware of these objects at training time.

It's clear that we have two DataSets in our PipelineML that have direct contact with mlflow scoring server, The dataset given as input_name and the dataset given as output_name One way to manage this inputs, is to create a MlflowInputDataSet that decorate the PandasDataSet since mlflow scoring server always give us pandas in inputs, and a MlflowOutputDataSet that decorate a range of accepted DataSet, matching mlflow scoring server expectations. That way we control statically the inputs and output of the model API.

We can let those validation and checking for ulterior release, but for now what urge is to unpack predictions from KedroPipelineModel when output_name is explitely given in PipelineML

Galileo-Galilei · 2020-10-18T10:57:28Z

Yes, I agree. I'll add this while creating a PR for #70.

Galileo-Galilei added this to the Release 0.5.0 milestone Oct 18, 2020

Galileo-Galilei added the enhancement New feature or request label Oct 18, 2020

Galileo-Galilei added the need-design-decision Several ways of implementation are possible and one must be chosen label Oct 18, 2020

Galileo-Galilei self-assigned this Oct 18, 2020

Galileo-Galilei modified the milestones: Release 0.5.0, Release 0.4.0 Oct 19, 2020

Galileo-Galilei added a commit that referenced this issue Oct 19, 2020

FIX #93 - Enforce inference pipeline output unpacking

5b3bede

Galileo-Galilei mentioned this issue Oct 19, 2020

Feature/unpack predictions #98

Merged

9 tasks

Galileo-Galilei added a commit that referenced this issue Oct 20, 2020

FIX #93 - Enforce inference pipeline output unpacking

018196d

Galileo-Galilei added a commit that referenced this issue Oct 25, 2020

FIX #93 - Enforce inference pipeline output unpacking

29eece9

Galileo-Galilei added a commit that referenced this issue Oct 25, 2020

FIX #93 - Enforce inference pipeline output unpacking

cd16633

takikadiri closed this as completed in #98 Oct 25, 2020

Galileo-Galilei moved this to ✅ Done in kedro-mlflow roadmap Oct 29, 2024

Galileo-Galilei added this to kedro-mlflow roadmap Oct 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Managing KedroPipelineModel predictions outputs #93

Managing KedroPipelineModel predictions outputs #93

takikadiri commented Oct 17, 2020 •

edited

Loading

Galileo-Galilei commented Oct 18, 2020 •

edited

Loading

takikadiri commented Oct 18, 2020 •

edited

Loading

Galileo-Galilei commented Oct 18, 2020

Managing KedroPipelineModel predictions outputs #93

Managing KedroPipelineModel predictions outputs #93

Comments

takikadiri commented Oct 17, 2020 • edited Loading

Description

Context

Possible Implementation

Galileo-Galilei commented Oct 18, 2020 • edited Loading

takikadiri commented Oct 18, 2020 • edited Loading

Galileo-Galilei commented Oct 18, 2020

takikadiri commented Oct 17, 2020 •

edited

Loading

Galileo-Galilei commented Oct 18, 2020 •

edited

Loading

takikadiri commented Oct 18, 2020 •

edited

Loading