Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor mlflow configs #77

Closed
takikadiri opened this issue Oct 4, 2020 · 3 comments · Fixed by #267
Closed

Refactor mlflow configs #77

takikadiri opened this issue Oct 4, 2020 · 3 comments · Fixed by #267
Assignees
Labels
enhancement New feature or request need-design-decision Several ways of implementation are possible and one must be chosen

Comments

@takikadiri
Copy link
Collaborator

As you know, the communication flow between the kedro app and mlflow can be represented as follows

kedro_mlflow_archi

In a ddition of what we already have in mlflow.yml, we can add some configuration entries for users in order to let them configure their connection to mlflow tracking and the artifact store (the red flows in the figure). these inputs may be different depending on the user's installation.

In this example we suppose that user have an mlflow tracking server with a Basic authentication and an artifact store on S3.

mlflow.hml

mlflow_access:

  mlflow_tracking: # Put here your non-sensitive environment variables relating to your mlflow tracking server connection. See the list here 
  https://www.mlflow.org/docs/latest/tracking.html#logging-to-a-tracking-server
  
    MLFLOW_TRACKING_URI: https://yourtrackingserver

  mlflow_artifact_store: # Put here your non sensitive environment variables relating to your artifact store connection . See the list here according to 
  your setup https://www.mlflow.org/docs/latest/tracking.html#artifact-store
  
    MLFLOW_S3_ENDPOINT_URL: https://yours3endpoint
    MLFLOW_S3_UPLOAD_EXTRA_ARGS: {}

mlflow_params:

  experiment:
    name: kedro_mlflow
    create: True

  run:
    id: null # if `id` is None, a new run will be created
    name: null # if `name` is None, pipeline name will be used for the run name
    nested: True

  ui:
    port: null  # the port to use for the ui. Find a free port if null.
    host: null

credentials.yml

kedro_mlflow:

  mlflow_tracking: # Put here your sensitive environment variables relating to your mlflow tracking server connection. See the list here https://www.mlflow.org/docs/latest/tracking.html#logging-to-a-tracking-server
    
    MLFLOW_TRACKING_USERNAME: user
    MLFLOW_TRACKING_PASSWORD: pass

  mlflow_artifact_store: # Put here your sensitive environment variables relating to your artifact store connection. See the list here according to your setup https://www.mlflow.org/docs/latest/tracking.html#artifact-stores

    AWS_ACCESS_KEY_ID: xxxxxxx
    AWS_SECRET_ACCESS_KEY: xxxxx

That way we leverage the multi environment configs mechanisms that kedro offer, and at the same time, we are making progress in making the use of mlflow easier and more fluid for our users.

kedro-mlflow hooks can easily access to those configs and export them as environ.

Fix #31 and #15

Let me know what you think

@Galileo-Galilei
Copy link
Owner

Galileo-Galilei commented Oct 17, 2020

Yes, I totally consider doing this. However credentials management can alsobe done at DataSet levels. We had uses cases when we wanted, inside a run (in a mlflow development database), to retrieve a model / an artifact from a different mlflow instance (a production one), for instance to combine/ compare two models. For this, we need to enable credentials management at DataSet level, something like:

In credentials.yml

# credentials.yml

kedro_mlflow:
    <your idea here>

mlflow_creds1:
    AWS_ACCESS_KEY_ID: <another password>

In catalog.yml

# catalog.yml

dataset_to_retrieve:
    type: MlflowArtifactDataSet
    load_args:
        run_id: 123456798
        credentials:  my_mlflow_creds1
    data_set:
         type: pandas.CSVDataSet
        load_args:
           sep: ";"

This needs to be thoroughly designed before we freeze the way to perform such an operation.

@takikadiri
Copy link
Collaborator Author

We are in front of two use cases of mlflow :

1 - mlflow as a tracking engine

In this use case mlflow.yml configurate the engine behavior and point to a potentiel credentials entries. Here is an enhanced example

mlflow_config:

  mlflow_tracking: # Put here your non-sensitive environment variables relating to your mlflow tracking server connection. See the list here 
  https://www.mlflow.org/docs/latest/tracking.html#logging-to-a-tracking-server
  
    MLFLOW_TRACKING_URI: https://yourtrackingserver

  mlflow_artifact_store: # Put here your non sensitive environment variables relating to your artifact store connection . See the list here according to 
  your setup https://www.mlflow.org/docs/latest/tracking.html#artifact-store
  
    MLFLOW_S3_ENDPOINT_URL: https://yours3endpoint
    MLFLOW_S3_UPLOAD_EXTRA_ARGS: {}

  experiment:
      name: your_experiment_name
      create: True

  run:
    id: null # if `id` is None, a new run will be created
    name: null # if `name` is None, pipeline name will be used for the run name
    nested: True

  credentials: credential_entry   # point to an entry in your conf/<env>/credentials.yml

hook_config:
  
  pipeline:
    tracking_pipelines: [data_engineering, data_science, __default__] # If no pipeline given, kedro_mlflow will track all your pipelines.
    
  node:
    flatten_dict_params: True  # if True, parameter which are dictionary will be splitted in multiple parameters when logged in mlflow, one for each key.
    recursive: True  # Should the dictionary flattening be applied recursively (i.e for nested dictionaries)? Not use if `flatten_dict_params` is False.
    sep: "-"

2 - Mlflow as a Database

Here we just leverage catalog and DataSets kedro mechanisms. In a regular catalog.yml we can have (what you suggested)

# catalog.yml

dataset_to_retrieve:
    type: MlflowArtifactDataSet
    load_args:
        run_id: 123456798
        credentials:  my_mlflow_creds1
    data_set:
         type: pandas.CSVDataSet
        load_args:
           sep: ";"

We just have to avoid starting an mlflow run when using it as a database.

So from configuration perspective, your use case is natively possible with kedro. All the questions are redirected to MlflowArtifactDataSet implementation itself.

In both cases credentials management is done at context level, we just leverage kedro credentials management mechanisms.

@takikadiri takikadiri changed the title Managing mlflow tracking server and artifact store configurations and credentials Refactor mlflow configs Oct 17, 2020
@Galileo-Galilei Galileo-Galilei added the enhancement New feature or request label Oct 17, 2020
@Galileo-Galilei Galileo-Galilei added this to the Release 0.5.0 milestone Oct 18, 2020
@Galileo-Galilei Galileo-Galilei added the need-design-decision Several ways of implementation are possible and one must be chosen label Oct 18, 2020
@Galileo-Galilei Galileo-Galilei removed this from the Release 0.5.0 milestone Jan 25, 2021
@Galileo-Galilei
Copy link
Owner

I have refactored the KedroMlflowConfig to use pydantic instead of dicts to store the different options. The advantages are:

  • increase validation on the provided keys with informative error message
  • make maintenance easier by removing a lot of extra validation code
  • autocomplete for easier development

Regarding the refactoring, i'd prefer explicit reference to mlflow objects for better comprehension, something like:

server: 
    mlflow_tracking_uri: xxx
    mlflow_model_registry: xxx
    mlflow_artifact_store: xxx
    credentials: xxx

entities:
    experiment:
        name: your_experiment_name
        create: True
    run:
        id: null # if `id` is None, a new run will be created
        name: null # if `name` is None, pipeline name will be used for the run name
        nested: True
    
tracking:
    params:
        dict_params: # extra level not needed, but it will simplify refactoring in the future?
            flatten:True 
            recursive: True
            sep: "-"
    metrics: # maybe one day, for some autologging?
    tags: # maybe one day if we find a convenient API?
    models: # maybe one day for autlogging?

After all, it does not make sense to make a reference to hooks since users are not aware of what they do. The "functional" part is always related to mlflow, not to Kedro.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request need-design-decision Several ways of implementation are possible and one must be chosen
Projects
Status: ✅ Done
Development

Successfully merging a pull request may close this issue.

2 participants