Refactor mlflow configs #77

takikadiri · 2020-10-04T22:54:08Z

As you know, the communication flow between the kedro app and mlflow can be represented as follows

In a ddition of what we already have in mlflow.yml, we can add some configuration entries for users in order to let them configure their connection to mlflow tracking and the artifact store (the red flows in the figure). these inputs may be different depending on the user's installation.

In this example we suppose that user have an mlflow tracking server with a Basic authentication and an artifact store on S3.

mlflow.hml

mlflow_access:

  mlflow_tracking: # Put here your non-sensitive environment variables relating to your mlflow tracking server connection. See the list here 
  https://www.mlflow.org/docs/latest/tracking.html#logging-to-a-tracking-server
  
    MLFLOW_TRACKING_URI: https://yourtrackingserver

  mlflow_artifact_store: # Put here your non sensitive environment variables relating to your artifact store connection . See the list here according to 
  your setup https://www.mlflow.org/docs/latest/tracking.html#artifact-store
  
    MLFLOW_S3_ENDPOINT_URL: https://yours3endpoint
    MLFLOW_S3_UPLOAD_EXTRA_ARGS: {}

mlflow_params:

  experiment:
    name: kedro_mlflow
    create: True

  run:
    id: null # if `id` is None, a new run will be created
    name: null # if `name` is None, pipeline name will be used for the run name
    nested: True

  ui:
    port: null  # the port to use for the ui. Find a free port if null.
    host: null

credentials.yml

kedro_mlflow:

  mlflow_tracking: # Put here your sensitive environment variables relating to your mlflow tracking server connection. See the list here https://www.mlflow.org/docs/latest/tracking.html#logging-to-a-tracking-server
    
    MLFLOW_TRACKING_USERNAME: user
    MLFLOW_TRACKING_PASSWORD: pass

  mlflow_artifact_store: # Put here your sensitive environment variables relating to your artifact store connection. See the list here according to your setup https://www.mlflow.org/docs/latest/tracking.html#artifact-stores

    AWS_ACCESS_KEY_ID: xxxxxxx
    AWS_SECRET_ACCESS_KEY: xxxxx

That way we leverage the multi environment configs mechanisms that kedro offer, and at the same time, we are making progress in making the use of mlflow easier and more fluid for our users.

kedro-mlflow hooks can easily access to those configs and export them as environ.

Fix #31 and #15

Let me know what you think

The text was updated successfully, but these errors were encountered:

Galileo-Galilei · 2020-10-17T15:47:07Z

Yes, I totally consider doing this. However credentials management can alsobe done at DataSet levels. We had uses cases when we wanted, inside a run (in a mlflow development database), to retrieve a model / an artifact from a different mlflow instance (a production one), for instance to combine/ compare two models. For this, we need to enable credentials management at DataSet level, something like:

In credentials.yml

# credentials.yml

kedro_mlflow:
    <your idea here>

mlflow_creds1:
    AWS_ACCESS_KEY_ID: <another password>

In catalog.yml

# catalog.yml

dataset_to_retrieve:
    type: MlflowArtifactDataSet
    load_args:
        run_id: 123456798
        credentials:  my_mlflow_creds1
    data_set:
         type: pandas.CSVDataSet
        load_args:
           sep: ";"

This needs to be thoroughly designed before we freeze the way to perform such an operation.

takikadiri · 2020-10-17T16:43:35Z

We are in front of two use cases of mlflow :

1 - mlflow as a tracking engine

In this use case mlflow.yml configurate the engine behavior and point to a potentiel credentials entries. Here is an enhanced example

mlflow_config:

  mlflow_tracking: # Put here your non-sensitive environment variables relating to your mlflow tracking server connection. See the list here 
  https://www.mlflow.org/docs/latest/tracking.html#logging-to-a-tracking-server
  
    MLFLOW_TRACKING_URI: https://yourtrackingserver

  mlflow_artifact_store: # Put here your non sensitive environment variables relating to your artifact store connection . See the list here according to 
  your setup https://www.mlflow.org/docs/latest/tracking.html#artifact-store
  
    MLFLOW_S3_ENDPOINT_URL: https://yours3endpoint
    MLFLOW_S3_UPLOAD_EXTRA_ARGS: {}

  experiment:
      name: your_experiment_name
      create: True

  run:
    id: null # if `id` is None, a new run will be created
    name: null # if `name` is None, pipeline name will be used for the run name
    nested: True

  credentials: credential_entry   # point to an entry in your conf/<env>/credentials.yml

hook_config:
  
  pipeline:
    tracking_pipelines: [data_engineering, data_science, __default__] # If no pipeline given, kedro_mlflow will track all your pipelines.
    
  node:
    flatten_dict_params: True  # if True, parameter which are dictionary will be splitted in multiple parameters when logged in mlflow, one for each key.
    recursive: True  # Should the dictionary flattening be applied recursively (i.e for nested dictionaries)? Not use if `flatten_dict_params` is False.
    sep: "-"

2 - Mlflow as a Database

Here we just leverage catalog and DataSets kedro mechanisms. In a regular catalog.yml we can have (what you suggested)

# catalog.yml

dataset_to_retrieve:
    type: MlflowArtifactDataSet
    load_args:
        run_id: 123456798
        credentials:  my_mlflow_creds1
    data_set:
         type: pandas.CSVDataSet
        load_args:
           sep: ";"

We just have to avoid starting an mlflow run when using it as a database.

So from configuration perspective, your use case is natively possible with kedro. All the questions are redirected to MlflowArtifactDataSet implementation itself.

In both cases credentials management is done at context level, we just leverage kedro credentials management mechanisms.

Galileo-Galilei · 2021-09-02T19:53:52Z

I have refactored the KedroMlflowConfig to use pydantic instead of dicts to store the different options. The advantages are:

increase validation on the provided keys with informative error message
make maintenance easier by removing a lot of extra validation code
autocomplete for easier development

Regarding the refactoring, i'd prefer explicit reference to mlflow objects for better comprehension, something like:

server: 
    mlflow_tracking_uri: xxx
    mlflow_model_registry: xxx
    mlflow_artifact_store: xxx
    credentials: xxx

entities:
    experiment:
        name: your_experiment_name
        create: True
    run:
        id: null # if `id` is None, a new run will be created
        name: null # if `name` is None, pipeline name will be used for the run name
        nested: True
    
tracking:
    params:
        dict_params: # extra level not needed, but it will simplify refactoring in the future?
            flatten:True 
            recursive: True
            sep: "-"
    metrics: # maybe one day, for some autologging?
    tags: # maybe one day if we find a convenient API?
    models: # maybe one day for autlogging?

After all, it does not make sense to make a reference to hooks since users are not aware of what they do. The "functional" part is always related to mlflow, not to Kedro.

takikadiri mentioned this issue Oct 17, 2020

FIX #66 - accessing project context #91

Merged

6 tasks

takikadiri changed the title ~~Managing mlflow tracking server and artifact store configurations and credentials~~ Refactor mlflow configs Oct 17, 2020

Galileo-Galilei added the enhancement New feature or request label Oct 17, 2020

Galileo-Galilei assigned takikadiri and Galileo-Galilei Oct 17, 2020

Galileo-Galilei added this to the Release 0.5.0 milestone Oct 18, 2020

Galileo-Galilei added the need-design-decision Several ways of implementation are possible and one must be chosen label Oct 18, 2020

Galileo-Galilei removed this from the Release 0.5.0 milestone Jan 25, 2021

Galileo-Galilei added a commit that referenced this issue Sep 1, 2021

♻️ Refactor KedroMlflowConfig with pydantic for robustness (#77)

480cae2

Galileo-Galilei added a commit that referenced this issue Sep 1, 2021

➕ Add pydantic to requirements (#77)

93963c1

Galileo-Galilei mentioned this issue Sep 2, 2021

Config/refactor pydantic #235

Merged

6 tasks

Galileo-Galilei added a commit that referenced this issue Sep 2, 2021

♻️ Refactor KedroMlflowConfig with pydantic for robustness (#77)

d3f0fb7

Galileo-Galilei added a commit that referenced this issue Sep 2, 2021

➕ Add pydantic to requirements (#77)

7fc102e

Galileo-Galilei added a commit that referenced this issue Sep 2, 2021

♻️ Refactor KedroMlflowConfig with pydantic for robustness (#77)

1f31d08

Galileo-Galilei added a commit that referenced this issue Sep 2, 2021

➕ Add pydantic to requirements (#77)

c4002d5

Galileo-Galilei added a commit that referenced this issue Nov 11, 2021

♻️ Refactor mlflow configuration to match mlflow API (##77)

7f32e2a

Galileo-Galilei mentioned this issue Nov 11, 2021

♻️ Refactor mlflow configuration to match mlflow API #267

Merged

6 tasks

Galileo-Galilei added a commit that referenced this issue Nov 11, 2021

♻️ Refactor mlflow configuration to match mlflow API (#77)

93dd63b

Galileo-Galilei added a commit that referenced this issue Nov 23, 2021

♻️ Refactor mlflow configuration to match mlflow API (#77)

4eb5424

Galileo-Galilei added a commit that referenced this issue Nov 23, 2021

♻️ Refactor mlflow configuration to match mlflow API (#77)

588956c

Galileo-Galilei closed this as completed in #267 Nov 23, 2021

Galileo-Galilei added a commit that referenced this issue Nov 23, 2021

♻️ Refactor mlflow configuration to match mlflow API (#77)

61cceea

Galileo-Galilei added this to kedro-mlflow roadmap Oct 29, 2024

Galileo-Galilei moved this to ✅ Done in kedro-mlflow roadmap Oct 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor mlflow configs #77

Refactor mlflow configs #77

takikadiri commented Oct 4, 2020

Galileo-Galilei commented Oct 17, 2020 •

edited

Loading

takikadiri commented Oct 17, 2020

Galileo-Galilei commented Sep 2, 2021

Refactor mlflow configs #77

Refactor mlflow configs #77

Comments

takikadiri commented Oct 4, 2020

mlflow.hml

credentials.yml

Galileo-Galilei commented Oct 17, 2020 • edited Loading

takikadiri commented Oct 17, 2020

1 - mlflow as a tracking engine

2 - Mlflow as a Database

Galileo-Galilei commented Sep 2, 2021

Galileo-Galilei commented Oct 17, 2020 •

edited

Loading