Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable pipeline_ml to share inputs between inference and training pipelines #71

Closed
Galileo-Galilei opened this issue Sep 29, 2020 · 3 comments · Fixed by #101
Closed
Assignees
Labels
enhancement New feature or request
Milestone

Comments

@Galileo-Galilei
Copy link
Owner

Notations:

  • training is the kedro.pipeline.Pipeline object passed as argument "training" of the pipeline_ml function
  • inference is the kedro.pipeline.Pipeline object passed as argument "inference" of the pipeline_ml function

Currently, inference.inputs() are forced to be part of training.all_outputs(). however in some situations the two pipelines may share some inputs too. For instance for some NLP models, a processing part include a list of stopwords to remove and these stopwords are inputs (and the same) both for training and inference.

Important points:

  • parameters are not persisted, if parameters are shared as inputs, they must be persited when packaging the model (as PickleDataSet for instance).
@Galileo-Galilei Galileo-Galilei self-assigned this Sep 29, 2020
@Galileo-Galilei Galileo-Galilei added the enhancement New feature or request label Sep 29, 2020
@Galileo-Galilei Galileo-Galilei added this to the Release 0.4.0 milestone Sep 29, 2020
@takikadiri
Copy link
Collaborator

takikadiri commented Oct 4, 2020

The concepts of function, input and output are clearly not suffisent to express an ML pipeline.
An ML pipeline introduce naturally a concept of model, which is input and output at the same time. In your situation here, the parameters are 'models' too because they will be fitted to the Data.
An ML pipeline introduce also a concept of say ml_processor which contain the fit and predict logic at the same node unit.

Actually, the post-construction of the pipeline_ml from regular kedro pipeline work well for advanced user, even if it
requires a considerable cognitive effort.

We should consider building a pipeline_ml API in the futur (probably not 0.4.0) that help user building their pipeline_ml using some ml computing concept (fit, predict, transform, ...), kedro-mlflow will translate that in backend to regular kedro pipelines.

But for now, i agree, we can just add the possibility to add parameters to inference inputs, kedro-mlflow will pickle them and packge them inside model artifacts at training time.

@Galileo-Galilei
Copy link
Owner Author

I suggest that we don't persist parameters under the hood to avoid side effects: if a user want to use a shared parameter as an input for both training and inference, he must persist it voluntarily (either as the input or output of the shared node).

@takikadiri
Copy link
Collaborator

We can drop the raise of KedroMlflowPipelineMLDatasetsError , exclusively for "parameters" and "params: xxx" and persist them under the hood.
It's counter intuitive for users to persist params. Moreover, there is some use case where a shared (training+inference) nodes have some params in inputs. The user cannot easily provide a pickleDataSet in this use cases.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants