Enable MlflowArtifactDataSet to load from a run_id #95

Galileo-Galilei · 2020-10-18T08:18:06Z

Description

I sometimes want to retrieve artifacts from an existing mlflow run, but there is no built-in functionality in kedro-mlflow to do so. I have to do it manually (create a mlflow client, feed in credentials and configuration, download the artifact, feed in the new path to the undelying dataset). This is error prone and often breaks the kedro principles by performing such IO operations inside a node.

Context

It is a common workflow to retrieve artifacts from a mlflow runs to feed a new run (e.g. retrieve embeddings trained in a ml pipeline and stored in mlflow in a previous run to feed another training pipeline for a NLP task like classification).

Possible Implementation

(Optional) Suggest an idea for implementing the addition or change.

The idea would be to add a load_args argument at the MlflowArtifactDataSet level to handle this.

catalog.yml

my_artifact_dataset:
    type: kedro_mlflow.io.MlflowArtifactDataSet
    load_args:
        run_id: 123456789
        credentials: mlflow_connect
    data_set:
        type: pandas.CSVDataSet
        load_args:
            sep: ";"

kedro_mlflow/io/mlflow_dataset.py

Add a _load method which creates a mlflow client, download the file to self._filepath and load the dataset of a run_id is provided.

Some important notes:

the existing run_id attribute must be deprecated and moved to a save_args argument
credentials (i.e. environment variables) must be properly managed before and after downloading to avoid messing up with the global session.
We should likely add a logger to have informationon where the dataset is retrieved from (mlflow of filepath)

Possible Alternatives

It may be more user friendly to create to different datasets (MlflowArtifactLoggerDataSet and MlflowArtifactLoaderDataSet to separate these 2 behaviours, but I think it is still understandable with one dataset which performs both operations.

The text was updated successfully, but these errors were encountered:

…aset (#95)

…set (#95)

Galileo-Galilei changed the title ~~EnableMlflowArtifactDataSet to load from a run_id~~ Enable MlflowArtifactDataSet to load from a run_id Oct 18, 2020

Galileo-Galilei self-assigned this Oct 18, 2020

Galileo-Galilei added the enhancement New feature or request label Oct 18, 2020

Galileo-Galilei added this to the Release 0.5.0 milestone Oct 18, 2020

Galileo-Galilei modified the milestones: Release 0.5.0, Release 0.4.1 Nov 25, 2020

Galileo-Galilei modified the milestones: Release 0.5.0, Release 0.6.0 Dec 2, 2020

Galileo-Galilei modified the milestones: Release 0.5.0, Release 0.6.1 Feb 21, 2021

Galileo-Galilei modified the milestones: Release 0.7.1, 0.7.2 Apr 10, 2021

Galileo-Galilei added a commit that referenced this issue Aug 2, 2021

✨ Enable to load data from an exisiting run_id with MlflowArtifactDat…

130e96e

…aset (#95)

Galileo-Galilei mentioned this issue Aug 2, 2021

Enable to load data from an existing run_id with MlflowArtifactDataset (#95) #222

Merged

6 tasks

Galileo-Galilei added a commit that referenced this issue Aug 16, 2021

✨ Enable to load data from an existing run_id with MlflowArtifactData…

967a635

…set (#95)

Galileo-Galilei closed this as completed in #222 Aug 16, 2021

Galileo-Galilei added a commit that referenced this issue Aug 16, 2021

✨ Enable to load data from an existing run_id with MlflowArtifactData…

4f42d89

…set (#95)

Galileo-Galilei moved this to ✅ Done in kedro-mlflow roadmap Oct 29, 2024

Galileo-Galilei added this to kedro-mlflow roadmap Oct 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enable MlflowArtifactDataSet to load from a run_id #95

Enable MlflowArtifactDataSet to load from a run_id #95

Galileo-Galilei commented Oct 18, 2020

Enable MlflowArtifactDataSet to load from a run_id #95

Enable MlflowArtifactDataSet to load from a run_id #95

Comments

Galileo-Galilei commented Oct 18, 2020

Description

Context

Possible Implementation

Possible Alternatives