You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I sometimes want to retrieve artifacts from an existing mlflow run, but there is no built-in functionality in kedro-mlflow to do so. I have to do it manually (create a mlflow client, feed in credentials and configuration, download the artifact, feed in the new path to the undelying dataset). This is error prone and often breaks the kedro principles by performing such IO operations inside a node.
Context
It is a common workflow to retrieve artifacts from a mlflow runs to feed a new run (e.g. retrieve embeddings trained in a ml pipeline and stored in mlflow in a previous run to feed another training pipeline for a NLP task like classification).
Possible Implementation
(Optional) Suggest an idea for implementing the addition or change.
The idea would be to add a load_args argument at the MlflowArtifactDataSet level to handle this.
Add a _load method which creates a mlflow client, download the file to self._filepath and load the dataset of a run_id is provided.
Some important notes:
the existing run_id attribute must be deprecated and moved to a save_args argument
credentials (i.e. environment variables) must be properly managed before and after downloading to avoid messing up with the global session.
We should likely add a logger to have informationon where the dataset is retrieved from (mlflow of filepath)
Possible Alternatives
It may be more user friendly to create to different datasets (MlflowArtifactLoggerDataSet and MlflowArtifactLoaderDataSet to separate these 2 behaviours, but I think it is still understandable with one dataset which performs both operations.
The text was updated successfully, but these errors were encountered:
Galileo-Galilei
changed the title
EnableMlflowArtifactDataSet to load from a run_id
Enable MlflowArtifactDataSet to load from a run_id
Oct 18, 2020
Description
I sometimes want to retrieve artifacts from an existing mlflow run, but there is no built-in functionality in
kedro-mlflow
to do so. I have to do it manually (create a mlflow client, feed in credentials and configuration, download the artifact, feed in the new path to the undelying dataset). This is error prone and often breaks the kedro principles by performing such IO operations inside a node.Context
It is a common workflow to retrieve artifacts from a mlflow runs to feed a new run (e.g. retrieve embeddings trained in a ml pipeline and stored in mlflow in a previous run to feed another training pipeline for a NLP task like classification).
Possible Implementation
The idea would be to add a load_args argument at the MlflowArtifactDataSet level to handle this.
catalog.yml
kedro_mlflow/io/mlflow_dataset.py
Add a _load method which creates a mlflow client, download the file to self._filepath and load the dataset of a
run_id
is provided.Some important notes:
run_id
attribute must be deprecated and moved to a save_args argumentPossible Alternatives
It may be more user friendly to create to different datasets (
MlflowArtifactLoggerDataSet
andMlflowArtifactLoaderDataSet
to separate these 2 behaviours, but I think it is still understandable with one dataset which performs both operations.The text was updated successfully, but these errors were encountered: