Skip to content

Predict flooding risk with Copernicus data and a self-sustaining CP4D MLOps pipeline. Training/test data versioning with DVC

Notifications You must be signed in to change notification settings

iIias/mlops-sustainability-oss

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

64 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

mlops-sustainability-oss 🌱

Disclaimer: This repository focuses on realizing adequate processes/pipelines in CPD. From an ML perspective, the trained model itself here is dismissable and will not receive the time investment it needs.

A self-sustaining end-to-end CP4D MLOps workflow for a flood forecasting model with automated data versioning and rollback

Welcome 👋 to this MLOps for sustainability repository.

The Mission: With climate change becoming an increasingly pressing issue, flood risks gain in prevalence. We want to demonstrate how we can leverage data from Copernicus Climate Data Store, and build an MLOps pipeline which is self-sustaining in nature. When training a model once - or only re-training it in terms of hyperparameter tuning - the model's accuarcy will decay leaving a more or less unusable model after a given amount of time. 📉 Once the days for which we predicted flood risks passed and actual data is available, we retrieve the newest data, retrain the model, and benchmark the model against its predecessor, to determine which model to keep and deploy.

Consider this repository a modified version (extended-subset) of our MLOps-CPD repository which is our simplest approach with the most rigorous documentation.

The main differences are:

  • Using climate data from Copernicus instead of the German Credit Risk Dataset
  • Self-sustaining approach through a scheduled pipeline system that retrieves the most recent weather data
  • Use of Open Source Software (OSS) for Model and Dataset Versioning via Data Version Control (DVC)

This MLOps accelerator uses IBM Cloud Object Storage (COS) as remote for DVC (via S3 API), an SKLearn model for training, and Watson Machine Learning (WML) for deployment. Our notebook repertoire is easily modified to leverage a different data store, custom ML models, and other providers for model deployment.

🎥 IBM Employees only: Checkout the recording of our TechFest23 session for this project

⚠️ Disclaimer: This model is currently no where near academic-grade quality. We focused primarily on quickly constructing an MLOps workflow that works with sustainability data and DVC.

We reserve the right to continously fix, improve, and progress this repertoire. See todos for information on upcoming features.

Overview

This subsection will describe our data source, datasets, sub-modules, requirements etc. in more detail.

Data Source and Datasets

We use the Copernicus Data Store to retrieve historic and current climate data. We are collecting the following datasets and variables:

Since we want to predict flooding risks, we want to predict the time and place where extremely high river discharge occurs. Therefore, all variables gathered from ERA5 are used to predict our predictant dis24. All data coming from the aforementioned datasets come either in NetCDF or NetCDF4 format, which is easily handled in our Notebooks. We make use of Copernicus' "Sub-region extraction" feature to not get data for the whole globe, but rather for a specified sub-region delimited by coordinates for N, W, S, E boundaries. This allows us to easily make different pipelines (and subsequently deploy different models for different regions). In our example, we set Europe as specified region in our pipeline parameters.

All of the aforementioned tasks are handled in our notebooks. For you to recreat the proposed MLOps lifecycle, you will need to create your own Copernicus account to retrieve personalized credentials, since we are not sharing / hard-coding ours for obvious reasons. Retrieve your cdsapi credentials and learn how to use the API here.

Prerequisites on IBM Cloud

In order to use the above asset we need to have access to have an IBM environment with authentication. Your IBM Cloud Account should have access following services:

  • IBM Watson Studio
  • IBM Watson Machine Learning (If you are not deploying with a different provider)
  • IBM Cloud Object Storage (If you are not using a different data store)

Please ascertain you have appropriate access in all the services.

The runs are also governed by the amount of capacity unit hours (CUH) you have access to. If you are running on the free plan please refer to the following links:

Instructions for Project Set-up

General Set-up

For a general tutorial please refer to the regularly maintained instructions in our main MLOps repository. We cannot afford to update each repository subsequent to changes or new features in Watson Studio etc., which is why we refer you to the main documentation.

There you will find instructions on Watson Studio Project and Deployment Spaces, Cloud Object Storage set-up, Pipeline set-up and more.

Where this project deviates

Addendum: Cloud Object Storage Set-up

When creating Cloud Object Storage credentials, you will need to enable HMAC.

Service Credentials > New Credential (Advanced > USE HMAC KEYS ("HMAC": true) and WRITER privilege). We need the access_key_id and secret_access_key from this file.

Add both keys to your credentials.py file if you are running locally, or..

Make sure to make both keys available to the pipeline... e.g.

  • add them to the MLOPS_COS_CREDENTIALS with all other Cloud Object Storage secrets.

Addendum: Pipeline Set-up

Sample Pipeline Layout

DVC Set-up

DVC is a git-like way to manage large data across systems, and it can connect easily with IBM COS to store and distribute versioned data. This section assumes some familiarity with how to create resources through the cloud.ibm.com dashboard from MLOps-CPD.

NOT covered by Notebooks

In order for DVC to be able to actively track datasets or models, you will need to initialize a new empty repository.

Add the information for your Repository to credentials.py under the GIT_REPOSITORYenvironment variable. It has to have the following format:
https://USERNAME:GIT-TOKEN@github.com/USERNAME/REPOSITORY-NAME

Note: DVC will not store your dataset and model but placeholders to track data files and directories. Additionally it will contain your DVC configuration file, which in turn contains your remote (URL, Endpoint, unhashed Access Secrets).

Covered by Notebook

Within notebook a1_init_dvc_and_track_data.ipynb

  • the previouosly created Git Repository is cloned into the temporary filesystem of the CPDaaS Jupyter Runtime.
  • DVC is initialized (dvc init)
  • A Cloud Object Storage instance is added to DVC as a remote (using the credentials passed via WS Pipeline Parameters) and subsequently committed to the DVC Git Repository via git commitand dvc push.
  • Create folder structure e.g. /data, /model
  • The full pickle binary of the dataset is added dvc add, git commit, ```dvc push````
    • Metadata is pushed to repository
    • Binary is uploaded to COS Bucket

Note: All steps will be repeated by pipeline as a consequence of having the pipeline run according to a schedule. That is of no concern, since redundant cells will be skipped.

About

Predict flooding risk with Copernicus data and a self-sustaining CP4D MLOps pipeline. Training/test data versioning with DVC

Topics

Resources

Stars

Watchers

Forks