-
Notifications
You must be signed in to change notification settings - Fork 34
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Managing KedroPipelineModel predictions outputs #93
Comments
Yes, This is really something that I have in mind and that would be a great improvement. Introducing a However, it is not easy to perform checks at logging time on the output schema, because the mlflow logging is triggered by the training pipeline, and consequently the "predictions" object does not exist (you need to run the inference pipeline first). A potential solution would be to automatically trigger the inference pipeline after running the training pipeline to ensure consistency, but this is dangerous:
If we chose this way to perform schema validation, we'll need to carefully design it and very likely to add a cli option to turn it off at run time: kedro run --pipeline=training --no-output-schema This is very tricky while kedro-org/kedro#382 is still open. I don't add this issue to the next milestone, because we need to discuss its advantages and drawbacks in depth first (and make sure the implementation wil be user-friendly) |
It's not really about schema validation (altough it's an important thing to manage), its' more about object type validation, but it's true that we are not aware of these objects at training time. It's clear that we have two DataSets in our PipelineML that have direct contact with mlflow scoring server, The dataset given as We can let those validation and checking for ulterior release, but for now what urge is to unpack predictions from |
Yes, I agree. I'll add this while creating a PR for #70. |
Description
The PipelineML is a composition of training pipeline and inference pipeline.
To separate inference inputs from models artifacts inputs, we ask the developer to tell us his inference dataset
input_name
PipelineML made then a series of validations checks to make sure of the coherence of these attributes.
However, we don't have any control of the outputs, we let them flow between those layers Pipeline --> PipelineML --> KedroPipelineModel --> mlflow.pyfunc.scoring_server without any control.
Context
Here in KedroPipelineModel we get all the outputs of the leaf nodes (nodes without children) in a dictionnary indexed with nodes outputs's dataset names. Let's call this dictionnary "predictions"
In the
mlflow.pyfunc.scoring_server
side, those "predictions" will be geted here. Thenmlflow.pyfunc.scoring_server
try to convert this dictionnary and dump it to json strings see here.Without any control in our side, the "predictions" dictonnary has a great chance having a
non JSON serializable object
Exception, as a user can output a [pickle, image, ..] dataset for logging purpose in the inference pipeline. The error can "infortunately" be raised so lately "in the scoring server" and will flow silently between kedro and kedro_mlflow.Moreover, it's confusing having multiple datasets as output, as the user will not have the control of his output schema. On the other hand, in KedroPipelineModel having the results indexed with dataset name is useless and break the dataset schema. I remind you that this output (runner output) is what will be served in the mlflow scoring_server.
Possible Implementation
Introduce output_name attributes in
PiplelineML
and perform validation checks (pipeline resolution, json serializable, ...)Unpack predictions from kedro runner in
KedroPipelineModel
by replacingrun_outputs = runner.run(pipeline, catalog)
byrun_outputs = runner.run(pipeline, catalog)[pipeline_ml.output_name]
The text was updated successfully, but these errors were encountered: