Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable MlflowArtifactDataSet to load from a run_id #95

Closed
Galileo-Galilei opened this issue Oct 18, 2020 · 0 comments · Fixed by #222
Closed

Enable MlflowArtifactDataSet to load from a run_id #95

Galileo-Galilei opened this issue Oct 18, 2020 · 0 comments · Fixed by #222
Assignees
Labels
enhancement New feature or request
Milestone

Comments

@Galileo-Galilei
Copy link
Owner

Description

I sometimes want to retrieve artifacts from an existing mlflow run, but there is no built-in functionality in kedro-mlflow to do so. I have to do it manually (create a mlflow client, feed in credentials and configuration, download the artifact, feed in the new path to the undelying dataset). This is error prone and often breaks the kedro principles by performing such IO operations inside a node.

Context

It is a common workflow to retrieve artifacts from a mlflow runs to feed a new run (e.g. retrieve embeddings trained in a ml pipeline and stored in mlflow in a previous run to feed another training pipeline for a NLP task like classification).

Possible Implementation

(Optional) Suggest an idea for implementing the addition or change.

The idea would be to add a load_args argument at the MlflowArtifactDataSet level to handle this.

catalog.yml

my_artifact_dataset:
    type: kedro_mlflow.io.MlflowArtifactDataSet
    load_args:
        run_id: 123456789
        credentials: mlflow_connect
    data_set:
        type: pandas.CSVDataSet
        load_args:
            sep: ";"

kedro_mlflow/io/mlflow_dataset.py

Add a _load method which creates a mlflow client, download the file to self._filepath and load the dataset of a run_id is provided.

Some important notes:

  • the existing run_id attribute must be deprecated and moved to a save_args argument
  • credentials (i.e. environment variables) must be properly managed before and after downloading to avoid messing up with the global session.
  • We should likely add a logger to have informationon where the dataset is retrieved from (mlflow of filepath)

Possible Alternatives

It may be more user friendly to create to different datasets (MlflowArtifactLoggerDataSet and MlflowArtifactLoaderDataSet to separate these 2 behaviours, but I think it is still understandable with one dataset which performs both operations.

@Galileo-Galilei Galileo-Galilei changed the title EnableMlflowArtifactDataSet to load from a run_id Enable MlflowArtifactDataSet to load from a run_id Oct 18, 2020
@Galileo-Galilei Galileo-Galilei self-assigned this Oct 18, 2020
@Galileo-Galilei Galileo-Galilei added the enhancement New feature or request label Oct 18, 2020
@Galileo-Galilei Galileo-Galilei added this to the Release 0.5.0 milestone Oct 18, 2020
@Galileo-Galilei Galileo-Galilei modified the milestones: Release 0.7.1, 0.7.2 Apr 10, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
Status: ✅ Done
1 participant