Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Store artifacts on s3 #15

Closed
akruszewski opened this issue Jul 2, 2020 · 11 comments
Closed

Store artifacts on s3 #15

akruszewski opened this issue Jul 2, 2020 · 11 comments
Labels
documentation Improvements or additions to documentation

Comments

@akruszewski
Copy link

I'm just testing this plugin and trying to send artifacts to s3 bucket. After reading a code of this project, I figure out that in current state it's not possible. I just want to make sure that this is the case before I will implement this functionality. @Galileo-Galilei could you confirm that this is the case?

@Galileo-Galilei
Copy link
Owner

Galileo-Galilei commented Jul 2, 2020

TL;DR:
1) create AWS_ACCESS_KEY_ID , AWS_SECRET_ACCESS_KEY and MLFLOW_S3_ENDPOINT_URL environment variables.
2) Replace your datasets entries from

my_dataset_to_version:
    type: pandas.CSVDataSet  # or any valid kedro DataSet
    filepath: s3://path/to//file.csv

to

my_dataset_to_version:
    type: kedro_mlflow.io.MlflowDataSet
    data_set:
        type: pandas.CSVDataSet  # or any valid kedro DataSet
        filepath: /path/to/a/local/destination/file.csv  # must be local!

Hello @akruszewski, nice to hear that you try the plugin out. In order to give you a more complete and accurate answer, I will need a bit more detail on how you set up your mlflow but the answer is yes, you can store the artifacts basically anywhere with the plugin and it should be completely straightforward.

A bit of context: mlflow under the hood

Warning : I don't want to be pedantic: If you are perfectly aware of how to configure mlflow, skip this section
This section describes how mlflow manages the artifacts (and this is completely plugin-independant).

Mlflow separate HOW you store the artifacts from WHERE you store them.
A basic mlflow configuration is set up in two steps:

  1. declare WHERE your mlflow items will be stored. You can declare 2 locations:
    • one for the artifacts (in your case, a s3 storage). The reason why they are managed differently than others items is that they can be very big and the storage size increase very quickly (each run copies the artifacts)
    • one for all others items (metrics, tags, parameters) which are very lightweight and straighforward to store
  2. define HOW your mlflow items will be stored. For artifacts, This is also a 2-step process:
    • persist the artifact locally (this is mandatory)
    • call log_artifact to upload your local file. It will be automatically uploaded WHERE you declared the artifact location in the first step

How to setup WHERE artifacts are recorded your S3 bucket

With this setup in mind, it should be clear that WHERE the artifacts are recorded does not depend on HOW you log them. It is a first configuration that you must do outside the logging part.

Here I have to make some hypothesis on how your mlflow is configurated but I can imagine that you either:

  • have created a mlflow server. In this case, you should specify the argument --default-artifact-root s3://my-mlflow-bucket/
  • are logging your runs locally, and want to switch the artifact store (and only it) to S3 bucket. In this case, mlflow will read the configuration from your environment variables (if you do not provide it, it defaults to the declared tracking uri, which is by default an mlruns folder at the root of you project). Basically, it looks in your environment variables for your credentials AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY and the endpoint MLFLOW_S3_ENDPOINT_URL. This is described in detail in this section in mlflow documentation.

How to use with the plugin

Setup WHERE mlflow records artifacts with the plugin

The mlflow.yml file is a configuration file. In particaluar, it overwrites the mlflow_tracking_uri you use in your environment variables with the entry mlflow_tracking_uri: /your/mlflow/tracking/uri/, which is your local mlruns folder by default.

  • If this tracking uri is a mlflow server (for which you specified the argument --default-artifact-root s3://my-mlflow-bucket/ as stated above); it should run just fine.
  • In the second case, you have to declare the 3 environment variables stated above manually for mlflow to be properly setup. I acknowledge that it is not very convenient. I wish I had your thoughts for this: would you find it easier to have entries in the mlflow.yml file to configure these environment variables?

HOW to log with the plugin

The plugin object for HOW to store artifacts is the MlflowDataSet and is documented (quite poorly I admit) here.
Basically, it performs the two actions for logging an artifact in a row : 1) persist the file locally 2) call log_artifact to upload to mlflow. Let assume you have an entry in your catalog that look like this :

my_dataset_to_version:
    type: pandas.CSVDataSet  # or any valid kedro DataSet
    filepath: /path/to/a/local/destination/file.csv

you just have to replace it with:

my_dataset_to_version:
    type: kedro_mlflow.io.MlflowDataSet
    data_set:
        type: pandas.CSVDataSet  # or any valid kedro DataSet
        filepath: /path/to/a/local/destination/file.csv

When the file will be saved (at the end of a node), it will be automatically uploaded WHERE your mlflow is configured to send it whatever it is (s3 or other). If you manage to log your dataset to your S3 bucket with log_artifact function, it should be logged in the same place with this dataset call. If it is not the case, it may be a bug, do not hesitate to tell me for further invertigations.

@akruszewski
Copy link
Author

Hi @Galileo-Galilei. First of all sorry for lame introduction. I'm working with @kaemo and I'm planing to help with development of this plugin. Unfortunately, he is off for next two weeks, so you would here mostly from myself.

Thanks for your detailed answer, which is really insightful.

I wasn't really clear with my question, so let put more context on it.

In more detail: we are developing example pipeline (using titanic dataset) and trying to create idiomatic kedro pipeline
with s3 storage, mlflow and great expectation. Our initial assumption was to use kedro catalog to save and obtain versioned data for the pipeline, validate it with great expectation, log artifacts and models with mlflow and use it's deployment, serving and inspection capabilities.

I was hoping that I would be able to log artifacts which are stored in s3 with help of data catalog, something like:

my_dataset_to_version:
    type: kedro_mlflow.io.MlflowDataSet
    data_set:
        type: pandas.CSVDataSet  # or any valid kedro DataSet
        filepath: s3://path/to/a/destination/file.csv

That was my misconception, as you explained.

I wish I had your thoughts for this: would you find it easier to have entries in the mlflow.yml file to configure these environment variables?

In my opinion it would be more convenient to use mlflow.yml as single source of truth for all settings related with mlflow (credentials could be stored in conf/local/mlflow.yml).

Right now I'm playing with pipeline_ml, when I will finish that part, I will go back to s3 integration. If I would have some questions I will rich you out here, if you don't mind.

I'm also happy to contribute to this project, if you have any propositions let me know.

@Galileo-Galilei Galileo-Galilei added documentation Improvements or additions to documentation question labels Jul 6, 2020
@Galileo-Galilei
Copy link
Owner

Nice to hear that you want to get involved!

Regarding the different points you adress:

  • I let you come back to me once you have tried above solution and tell me if it works (It really should, but I confess that I never stored data on S3. I did on azure blob storage though and it worked so it really should be straightforward).
  • adding more details on configuration in the mlflow.yml file is a good idea, but the credentials must be kept in credentials.yml (to avoid pushing data on github for instance, and to make the configuration more "portable from a dev environment to a production environment). I also have to take into account the priorities (between environment variables of your session, and potential other configuration files for AWS for instance. My best guess on the fly is that I could create a section in the mlflow.yml which take valid mlflow environment variables and export them when needed.
  • mlflow does not let push directly your file to a s3 storage or anywhere to avoid messing up the database. This is not a bug but a design pattern. I found it confusing at first too.
  • If you want to contribute, you can pick up one of the issue in the todo column of the dashboard and try to fil the gap. The explanations may be sparse, do not hesitate to ask if they are not clear. Creating a nice example of how the plugin work with an end to end example is a very good way to help and you seem on the way to do it!

@Galileo-Galilei
Copy link
Owner

Galileo-Galilei commented Feb 21, 2021

I close this issue since detailed documentation is now available on readthedocs.

Feel free to reopen if needed.

Above answer is still valid, but many improvements have been made since:

  • it is now possible to specify the nevironment variables described above in kedro's credentials.yml file as described in the documentation
  • the MlflowDataSet is called MlflowArtifactDataSet in the most recent versions of the plugin

@foxale
Copy link

foxale commented Jul 7, 2022

What a great thread! From what I understand, in your first comment you laid down how kedro-mlflow handles Scenario 4. I'm currently exploring ways to use kedro-mlflow, but for Scenario 5. From the client's perspective, the main difference is that all artifact paths should start with the mlflow-artifacts:/ prefix, which is later on translated into the actual storage URI (already on the mlflow server side). Does kedro-mlflow handle this case as well?

Edit: Nevermind, I've just tested Scenario 5 and it turns out kedro-mlflow handles this beautifully!

@Galileo-Galilei Galileo-Galilei pinned this issue Jul 14, 2022
@Galileo-Galilei
Copy link
Owner

Galileo-Galilei commented Jul 14, 2022

Hi @foxale, Sorry for the response delay.

There is no reason kedro-mlflow will not work for saving artifacts (it just call log_artifact under the hood), while we may have issue loading artifacts without the prefix.

Actually, the MlflowArtifactDataSet does not load from the server but from the local path, except if you specify the run id explictly (which is a very uncommon way to use kedro-mlflow, usually you just let the plugin open a new run_id for a new kedro run). I'd be glad to get feedback if you have any issue with this modern way to set up mlflow I have never tried myself yet!

@OlegInsait
Copy link

Hi @Galileo-Galilei,
We recently transitioned our storage from the host filesystem to an S3 bucket and encountered an error when attempting to save files as MLflow artifacts that are stored in S3. It appears that the system expects files to be locally stored, as indicated by an error message.
From a brief look at the code, it seems like there might be a limitation or missing functionality in handling remote files directly from S3. I noticed that @akruszewski raised a similar issue but hasn't responded yet.
Could you clarify if there's a way to log artifacts in MLflow directly from S3 using kedro-mlflow? Here is how our catalog.yml is configured:

predictions:
    type: kedro_mlflow.io.artifacts.MlflowArtifactDataset
    dataset:
        type: pandas.ParquetDataset
        filepath: s3://my-bucket/data/09_results/predictions.parquet
        credentials: s3_creds

Any guidance or workaround to handle this scenario would be greatly appreciated!

@Galileo-Galilei
Copy link
Owner

Hi @OlegInsait, as described in above thread, limiting artifact logging to local filepaths and not S3 filepaths is a limitation of mlflow itself, not kedro-mlflow. You need to configure your mlflow server to have a S3 backend, then all call to log_artifacts will log in this remote storage.

Can you elaborate more on your setup if you can't make it work so I can help?

@OlegInsait
Copy link

OlegInsait commented Jun 26, 2024

@Galileo-Galilei

limiting artifact logging to local filepaths and not S3 filepaths is a limitation of mlflow itself, not kedro-mlflow

This is exactly the issue. The source of the files to be logged is a S3 bucket. Setting up a S3 backend will only allow to log into S3.
in the mlflow issue (mlflow/mlflow#7547 (comment)) it was suggested to use mlflow.artifacts.download_artifacts() followed by mlflow.log_artifact()
I thought this could be implemented within a kedro-mlflow hook using the catalog to download if the protocol is not as for the local storage.

@Galileo-Galilei
Copy link
Owner

Galileo-Galilei commented Jun 30, 2024

You can do perfectly do that, and you can implement your custom kedro dataset to do that, but I won't support it because:

  • I don't want to maintain "extra" mlflow behaviour. if mlflow ends up implementing something like this on their end, I'll support it but I can't affrod the maintenance burden to add many extra functionalities to what mlflow natively do.
  • it is likely not the desirable behaviour, at least not the default: it adds a lot of performance penalty for little reason
  • When you log data in S3 with kedro, you have access to te data in RAM a t this very momeent: you can log simultaneously in mlflow and S3,a nd avoird saving in S3, redownloading and saving again in mlflow (likely uploading to another S3 !)

EDIT:

The key idea would be to modify this section in the code :

if self._logging_activated:
if self.run_id:
# if a run id is specified, we have to use mlflow client
# to avoid potential conflicts with an already active run
mlflow_client = MlflowClient()
mlflow_client.log_artifact(
run_id=self.run_id,
local_path=local_path,
artifact_path=self.artifact_path,
)
else:
mlflow.log_artifact(local_path, self.artifact_path)

to store the data in a tempfolder and then log in mlflow, using the underlying dataset by copying it and modify inplace its path location.

@OlegInsait
Copy link

Thank you @Galileo-Galilei !
I decided to use local "duplicated" datasets (wraped with MlflowArtifactDataset) for all the data I want to log in conjunction with an after_node_run hook that checks for duplicates of the outputs and saves the data if needed.
It is a bit ugly, but works for me.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation
Projects
None yet
Development

No branches or pull requests

4 participants