Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Managing KedroPipelineModel predictions outputs #93

Closed
takikadiri opened this issue Oct 17, 2020 · 3 comments · Fixed by #98
Closed

Managing KedroPipelineModel predictions outputs #93

takikadiri opened this issue Oct 17, 2020 · 3 comments · Fixed by #98
Assignees
Labels
enhancement New feature or request need-design-decision Several ways of implementation are possible and one must be chosen
Milestone

Comments

@takikadiri
Copy link
Collaborator

takikadiri commented Oct 17, 2020

Description

The PipelineML is a composition of training pipeline and inference pipeline.
To separate inference inputs from models artifacts inputs, we ask the developer to tell us his inference dataset input_name
PipelineML made then a series of validations checks to make sure of the coherence of these attributes.

However, we don't have any control of the outputs, we let them flow between those layers Pipeline --> PipelineML --> KedroPipelineModel --> mlflow.pyfunc.scoring_server without any control.

Context

Here in KedroPipelineModel we get all the outputs of the leaf nodes (nodes without children) in a dictionnary indexed with nodes outputs's dataset names. Let's call this dictionnary "predictions"

In the mlflow.pyfunc.scoring_server side, those "predictions" will be geted here. Then mlflow.pyfunc.scoring_server try to convert this dictionnary and dump it to json strings see here.

Without any control in our side, the "predictions" dictonnary has a great chance having a non JSON serializable object Exception, as a user can output a [pickle, image, ..] dataset for logging purpose in the inference pipeline. The error can "infortunately" be raised so lately "in the scoring server" and will flow silently between kedro and kedro_mlflow.

Moreover, it's confusing having multiple datasets as output, as the user will not have the control of his output schema. On the other hand, in KedroPipelineModel having the results indexed with dataset name is useless and break the dataset schema. I remind you that this output (runner output) is what will be served in the mlflow scoring_server.

Possible Implementation

  • Introduce output_name attributes in PiplelineML and perform validation checks (pipeline resolution, json serializable, ...)

  • Unpack predictions from kedro runner in KedroPipelineModel by replacing run_outputs = runner.run(pipeline, catalog) by run_outputs = runner.run(pipeline, catalog)[pipeline_ml.output_name]

@Galileo-Galilei
Copy link
Owner

Galileo-Galilei commented Oct 18, 2020

Yes, This is really something that I have in mind and that would be a great improvement. Introducing a output_name atribute to PipelineML is easy, and unpacking would undoubtely offers a better experience.

However, it is not easy to perform checks at logging time on the output schema, because the mlflow logging is triggered by the training pipeline, and consequently the "predictions" object does not exist (you need to run the inference pipeline first). A potential solution would be to automatically trigger the inference pipeline after running the training pipeline to ensure consistency, but this is dangerous:

  • it can trigger lots of side-effets (interfere with parameter versioning for instance) and be hard to understand for users
  • it may increase significantly the training pipeline running time

If we chose this way to perform schema validation, we'll need to carefully design it and very likely to add a cli option to turn it off at run time:

kedro run --pipeline=training --no-output-schema

This is very tricky while kedro-org/kedro#382 is still open. I don't add this issue to the next milestone, because we need to discuss its advantages and drawbacks in depth first (and make sure the implementation wil be user-friendly)

@Galileo-Galilei Galileo-Galilei added this to the Release 0.5.0 milestone Oct 18, 2020
@Galileo-Galilei Galileo-Galilei added the enhancement New feature or request label Oct 18, 2020
@Galileo-Galilei Galileo-Galilei added the need-design-decision Several ways of implementation are possible and one must be chosen label Oct 18, 2020
@takikadiri
Copy link
Collaborator Author

takikadiri commented Oct 18, 2020

It's not really about schema validation (altough it's an important thing to manage), its' more about object type validation, but it's true that we are not aware of these objects at training time.

It's clear that we have two DataSets in our PipelineML that have direct contact with mlflow scoring server, The dataset given as input_name and the dataset given as output_name One way to manage this inputs, is to create a MlflowInputDataSet that decorate the PandasDataSet since mlflow scoring server always give us pandas in inputs, and a MlflowOutputDataSet that decorate a range of accepted DataSet, matching mlflow scoring server expectations. That way we control statically the inputs and output of the model API.

We can let those validation and checking for ulterior release, but for now what urge is to unpack predictions from KedroPipelineModel when output_name is explitely given in PipelineML

@Galileo-Galilei
Copy link
Owner

Yes, I agree. I'll add this while creating a PR for #70.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request need-design-decision Several ways of implementation are possible and one must be chosen
Projects
Status: ✅ Done
Development

Successfully merging a pull request may close this issue.

2 participants