Store artifacts on s3 #15

akruszewski · 2020-07-02T07:34:25Z

I'm just testing this plugin and trying to send artifacts to s3 bucket. After reading a code of this project, I figure out that in current state it's not possible. I just want to make sure that this is the case before I will implement this functionality. @Galileo-Galilei could you confirm that this is the case?

Galileo-Galilei · 2020-07-02T23:19:53Z

TL;DR:
1) create AWS_ACCESS_KEY_ID , AWS_SECRET_ACCESS_KEY and MLFLOW_S3_ENDPOINT_URL environment variables.
2) Replace your datasets entries from

my_dataset_to_version:
    type: pandas.CSVDataSet  # or any valid kedro DataSet
    filepath: s3://path/to//file.csv

to

my_dataset_to_version:
    type: kedro_mlflow.io.MlflowDataSet
    data_set:
        type: pandas.CSVDataSet  # or any valid kedro DataSet
        filepath: /path/to/a/local/destination/file.csv  # must be local!

Hello @akruszewski, nice to hear that you try the plugin out. In order to give you a more complete and accurate answer, I will need a bit more detail on how you set up your mlflow but the answer is yes, you can store the artifacts basically anywhere with the plugin and it should be completely straightforward.

A bit of context: mlflow under the hood

Warning : I don't want to be pedantic: If you are perfectly aware of how to configure mlflow, skip this section
This section describes how mlflow manages the artifacts (and this is completely plugin-independant).

Mlflow separate HOW you store the artifacts from WHERE you store them.
A basic mlflow configuration is set up in two steps:

declare WHERE your mlflow items will be stored. You can declare 2 locations:
- one for the artifacts (in your case, a s3 storage). The reason why they are managed differently than others items is that they can be very big and the storage size increase very quickly (each run copies the artifacts)
- one for all others items (metrics, tags, parameters) which are very lightweight and straighforward to store
define HOW your mlflow items will be stored. For artifacts, This is also a 2-step process:
- persist the artifact locally (this is mandatory)
- call log_artifact to upload your local file. It will be automatically uploaded WHERE you declared the artifact location in the first step

How to setup WHERE artifacts are recorded your S3 bucket

With this setup in mind, it should be clear that WHERE the artifacts are recorded does not depend on HOW you log them. It is a first configuration that you must do outside the logging part.

Here I have to make some hypothesis on how your mlflow is configurated but I can imagine that you either:

have created a mlflow server. In this case, you should specify the argument --default-artifact-root s3://my-mlflow-bucket/
are logging your runs locally, and want to switch the artifact store (and only it) to S3 bucket. In this case, mlflow will read the configuration from your environment variables (if you do not provide it, it defaults to the declared tracking uri, which is by default an mlruns folder at the root of you project). Basically, it looks in your environment variables for your credentials AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY and the endpoint MLFLOW_S3_ENDPOINT_URL. This is described in detail in this section in mlflow documentation.

How to use with the plugin

Setup WHERE mlflow records artifacts with the plugin

The mlflow.yml file is a configuration file. In particaluar, it overwrites the mlflow_tracking_uri you use in your environment variables with the entry mlflow_tracking_uri: /your/mlflow/tracking/uri/, which is your local mlruns folder by default.

If this tracking uri is a mlflow server (for which you specified the argument --default-artifact-root s3://my-mlflow-bucket/ as stated above); it should run just fine.
In the second case, you have to declare the 3 environment variables stated above manually for mlflow to be properly setup. I acknowledge that it is not very convenient. I wish I had your thoughts for this: would you find it easier to have entries in the mlflow.yml file to configure these environment variables?

HOW to log with the plugin

The plugin object for HOW to store artifacts is the MlflowDataSet and is documented (quite poorly I admit) here.
Basically, it performs the two actions for logging an artifact in a row : 1) persist the file locally 2) call log_artifact to upload to mlflow. Let assume you have an entry in your catalog that look like this :

my_dataset_to_version:
    type: pandas.CSVDataSet  # or any valid kedro DataSet
    filepath: /path/to/a/local/destination/file.csv

you just have to replace it with:

my_dataset_to_version:
    type: kedro_mlflow.io.MlflowDataSet
    data_set:
        type: pandas.CSVDataSet  # or any valid kedro DataSet
        filepath: /path/to/a/local/destination/file.csv

When the file will be saved (at the end of a node), it will be automatically uploaded WHERE your mlflow is configured to send it whatever it is (s3 or other). If you manage to log your dataset to your S3 bucket with log_artifact function, it should be logged in the same place with this dataset call. If it is not the case, it may be a bug, do not hesitate to tell me for further invertigations.

akruszewski · 2020-07-03T13:05:28Z

Hi @Galileo-Galilei. First of all sorry for lame introduction. I'm working with @kaemo and I'm planing to help with development of this plugin. Unfortunately, he is off for next two weeks, so you would here mostly from myself.

Thanks for your detailed answer, which is really insightful.

I wasn't really clear with my question, so let put more context on it.

In more detail: we are developing example pipeline (using titanic dataset) and trying to create idiomatic kedro pipeline
with s3 storage, mlflow and great expectation. Our initial assumption was to use kedro catalog to save and obtain versioned data for the pipeline, validate it with great expectation, log artifacts and models with mlflow and use it's deployment, serving and inspection capabilities.

I was hoping that I would be able to log artifacts which are stored in s3 with help of data catalog, something like:

my_dataset_to_version:
    type: kedro_mlflow.io.MlflowDataSet
    data_set:
        type: pandas.CSVDataSet  # or any valid kedro DataSet
        filepath: s3://path/to/a/destination/file.csv

That was my misconception, as you explained.

I wish I had your thoughts for this: would you find it easier to have entries in the mlflow.yml file to configure these environment variables?

In my opinion it would be more convenient to use mlflow.yml as single source of truth for all settings related with mlflow (credentials could be stored in conf/local/mlflow.yml).

Right now I'm playing with pipeline_ml, when I will finish that part, I will go back to s3 integration. If I would have some questions I will rich you out here, if you don't mind.

I'm also happy to contribute to this project, if you have any propositions let me know.

Galileo-Galilei · 2020-07-07T20:34:42Z

Nice to hear that you want to get involved!

Regarding the different points you adress:

I let you come back to me once you have tried above solution and tell me if it works (It really should, but I confess that I never stored data on S3. I did on azure blob storage though and it worked so it really should be straightforward).
adding more details on configuration in the mlflow.yml file is a good idea, but the credentials must be kept in credentials.yml (to avoid pushing data on github for instance, and to make the configuration more "portable from a dev environment to a production environment). I also have to take into account the priorities (between environment variables of your session, and potential other configuration files for AWS for instance. My best guess on the fly is that I could create a section in the mlflow.yml which take valid mlflow environment variables and export them when needed.
mlflow does not let push directly your file to a s3 storage or anywhere to avoid messing up the database. This is not a bug but a design pattern. I found it confusing at first too.
If you want to contribute, you can pick up one of the issue in the todo column of the dashboard and try to fil the gap. The explanations may be sparse, do not hesitate to ask if they are not clear. Creating a nice example of how the plugin work with an end to end example is a very good way to help and you seem on the way to do it!

Galileo-Galilei · 2021-02-21T21:30:55Z

I close this issue since detailed documentation is now available on readthedocs.

Feel free to reopen if needed.

Above answer is still valid, but many improvements have been made since:

it is now possible to specify the nevironment variables described above in kedro's credentials.yml file as described in the documentation
the MlflowDataSet is called MlflowArtifactDataSet in the most recent versions of the plugin

foxale · 2022-07-07T18:19:05Z

What a great thread! From what I understand, in your first comment you laid down how kedro-mlflow handles Scenario 4. I'm currently exploring ways to use kedro-mlflow, but for Scenario 5. From the client's perspective, the main difference is that all artifact paths should start with the mlflow-artifacts:/ prefix, which is later on translated into the actual storage URI (already on the mlflow server side). Does kedro-mlflow handle this case as well?

Edit: Nevermind, I've just tested Scenario 5 and it turns out kedro-mlflow handles this beautifully!

Galileo-Galilei · 2022-07-14T09:42:46Z

Hi @foxale, Sorry for the response delay.

There is no reason kedro-mlflow will not work for saving artifacts (it just call log_artifact under the hood), while we may have issue loading artifacts without the prefix.

Actually, the MlflowArtifactDataSet does not load from the server but from the local path, except if you specify the run id explictly (which is a very uncommon way to use kedro-mlflow, usually you just let the plugin open a new run_id for a new kedro run). I'd be glad to get feedback if you have any issue with this modern way to set up mlflow I have never tried myself yet!

OlegInsait · 2024-06-20T15:49:49Z

Hi @Galileo-Galilei,
We recently transitioned our storage from the host filesystem to an S3 bucket and encountered an error when attempting to save files as MLflow artifacts that are stored in S3. It appears that the system expects files to be locally stored, as indicated by an error message.
From a brief look at the code, it seems like there might be a limitation or missing functionality in handling remote files directly from S3. I noticed that @akruszewski raised a similar issue but hasn't responded yet.
Could you clarify if there's a way to log artifacts in MLflow directly from S3 using kedro-mlflow? Here is how our catalog.yml is configured:

predictions:
    type: kedro_mlflow.io.artifacts.MlflowArtifactDataset
    dataset:
        type: pandas.ParquetDataset
        filepath: s3://my-bucket/data/09_results/predictions.parquet
        credentials: s3_creds

Any guidance or workaround to handle this scenario would be greatly appreciated!

Galileo-Galilei · 2024-06-25T06:58:50Z

Hi @OlegInsait, as described in above thread, limiting artifact logging to local filepaths and not S3 filepaths is a limitation of mlflow itself, not kedro-mlflow. You need to configure your mlflow server to have a S3 backend, then all call to log_artifacts will log in this remote storage.

Can you elaborate more on your setup if you can't make it work so I can help?

OlegInsait · 2024-06-26T15:26:43Z

@Galileo-Galilei

limiting artifact logging to local filepaths and not S3 filepaths is a limitation of mlflow itself, not kedro-mlflow

This is exactly the issue. The source of the files to be logged is a S3 bucket. Setting up a S3 backend will only allow to log into S3.
in the mlflow issue (mlflow/mlflow#7547 (comment)) it was suggested to use mlflow.artifacts.download_artifacts() followed by mlflow.log_artifact()
I thought this could be implemented within a kedro-mlflow hook using the catalog to download if the protocol is not as for the local storage.

Galileo-Galilei · 2024-06-30T21:26:59Z

You can do perfectly do that, and you can implement your custom kedro dataset to do that, but I won't support it because:

I don't want to maintain "extra" mlflow behaviour. if mlflow ends up implementing something like this on their end, I'll support it but I can't affrod the maintenance burden to add many extra functionalities to what mlflow natively do.
it is likely not the desirable behaviour, at least not the default: it adds a lot of performance penalty for little reason
When you log data in S3 with kedro, you have access to te data in RAM a t this very momeent: you can log simultaneously in mlflow and S3,a nd avoird saving in S3, redownloading and saving again in mlflow (likely uploading to another S3 !)

EDIT:

The key idea would be to modify this section in the code :

kedro-mlflow/kedro_mlflow/io/artifacts/mlflow_artifact_dataset.py

Lines 67 to 78 in 64b8e94

    
           if self._logging_activated: 
        
               if self.run_id: 
        
                   # if a run id is specified, we have to use mlflow client 
        
                   # to avoid potential conflicts with an already active run 
        
                   mlflow_client = MlflowClient() 
        
                   mlflow_client.log_artifact( 
        
                       run_id=self.run_id, 
        
                       local_path=local_path, 
        
                       artifact_path=self.artifact_path, 
        
                   ) 
        
               else: 
        
                   mlflow.log_artifact(local_path, self.artifact_path)

to store the data in a tempfolder and then log in mlflow, using the underlying dataset by copying it and modify inplace its path location.

OlegInsait · 2024-07-01T09:12:59Z

Thank you @Galileo-Galilei !
I decided to use local "duplicated" datasets (wraped with MlflowArtifactDataset) for all the data I want to log in conjunction with an after_node_run hook that checks for duplicates of the outputs and saves the data if needed.
It is a bit ugly, but works for me.

Galileo-Galilei added documentation Improvements or additions to documentation question labels Jul 6, 2020

Galileo-Galilei mentioned this issue Jul 21, 2020

Add possibility to export mlflow environment variables in mlflow.yml #31

Closed

takikadiri mentioned this issue Oct 4, 2020

Refactor mlflow configs #77

Closed

Galileo-Galilei closed this as completed Feb 22, 2021

Galileo-Galilei mentioned this issue Feb 2, 2022

Specify non-default artifact root for experiment #280

Closed

Galileo-Galilei pinned this issue Jul 14, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Store artifacts on s3 #15

Store artifacts on s3 #15

akruszewski commented Jul 2, 2020

Galileo-Galilei commented Jul 2, 2020 •

edited

Loading

akruszewski commented Jul 3, 2020

Galileo-Galilei commented Jul 7, 2020

Galileo-Galilei commented Feb 21, 2021 •

edited

Loading

foxale commented Jul 7, 2022 •

edited

Loading

Galileo-Galilei commented Jul 14, 2022 •

edited

Loading

OlegInsait commented Jun 20, 2024

Galileo-Galilei commented Jun 25, 2024

OlegInsait commented Jun 26, 2024 •

edited

Loading

Galileo-Galilei commented Jun 30, 2024 •

edited

Loading

OlegInsait commented Jul 1, 2024

Store artifacts on s3 #15

Store artifacts on s3 #15

Comments

akruszewski commented Jul 2, 2020

Galileo-Galilei commented Jul 2, 2020 • edited Loading

A bit of context: mlflow under the hood

How to setup WHERE artifacts are recorded your S3 bucket

How to use with the plugin

Setup WHERE mlflow records artifacts with the plugin

HOW to log with the plugin

akruszewski commented Jul 3, 2020

Galileo-Galilei commented Jul 7, 2020

Galileo-Galilei commented Feb 21, 2021 • edited Loading

foxale commented Jul 7, 2022 • edited Loading

Galileo-Galilei commented Jul 14, 2022 • edited Loading

OlegInsait commented Jun 20, 2024

Galileo-Galilei commented Jun 25, 2024

OlegInsait commented Jun 26, 2024 • edited Loading

Galileo-Galilei commented Jun 30, 2024 • edited Loading

OlegInsait commented Jul 1, 2024

Galileo-Galilei commented Jul 2, 2020 •

edited

Loading

Galileo-Galilei commented Feb 21, 2021 •

edited

Loading

foxale commented Jul 7, 2022 •

edited

Loading

Galileo-Galilei commented Jul 14, 2022 •

edited

Loading

OlegInsait commented Jun 26, 2024 •

edited

Loading

Galileo-Galilei commented Jun 30, 2024 •

edited

Loading