Responsible AI 2020 - Detect if a patient needs treatment based on heart-disease data

languages

products

description

python

Azure Machine Learning Service

Responsible AI 2020

Responsible AI 2020 - Detect if a patient needs treatment based on heart-disease data

Introduction

This repository wishes to show how using the latest machine learning features in Azure Machine Learning and Microsoft open-source toolkits we can put the responsible AI principles into practice. These tools empower data scientists and developers to understand ML models, protect people and their data, and control the end-to-end ML process.

For this, we will develop a solution that wishes to detect if a person is suitable for receiving treatment for heart disease or not. We will use a dataset that will help us classify patients that have heart disease from those that doesn’t. Using this example, we will show how to ensure ethical, transparent and accountable use of AI in a medical scenario.

This example ilustrates how to put the responsible AI principles into practice throughout the different stages of Machine Learning pipeline (Preprocessing, Training/Evaluation, Register Model).

Within this repository, you will find all the resources needed to create and simulate a medical scenario using Azure Machine Learning Service with Responsible AI techniques such as:

The goal of this project is to detect if a person is suitable for receiving a treatment for heart disease.

Objectives

Understand how Responsible AI techniques work.
Use Azure Machine learning Service to create a Machine Learning Pipeline with these Responsible AI techniques.
Prevent data exposure with differential privacy.
Mitigate model unfairness.
Interpret and explain models.

Why use Azure Machine Learning Service?

Azure Machine Learning Service give to use the capability to use MLOps techniques, it empowers data scientists and app developers to help bring ML models to production.

This MLOps functionalities that Azure Machine Learning have, enables you to track / version / audit / certify / re-use every asset in your ML lifecycle and provides orchestration services to streamline managing this lifecycle.

What are the key challenges we wish to solve with MLOps?

Model reproducibility & versioning

Track, snapshot & manage assets used to create the model
Enable collaboration and sharing of ML pipelines

Model packaging & validation

Support model portability across a variety of platforms
Certify model performance meets functional and latency requirements

Model auditability & explainability

Maintain asset integrity & persist access control logs
Certify model behavior meets regulatory & adversarial standards

Model deployment & monitoring

Release models with confidence
Monitor & know when to retrain by analyzing signals such as data drift

if you want to know more about how we have implemented the machine learning workflow using Azure Machine Learning Studio, please, see the following file

Understanding (InterpretML)

As ML integrates deeply into our day-to-day business processes, transparency is critical. Azure Machine Learning helps not only to understand model behavior, but also to assess and mitigate bias.

Interpretation and model’s explanation in Azure Machine Learning is based on the InterpretMLtoolset. It helps developers and data scientists to understand the models’ behavior and provide explanations about the decisions made during the model’s inference. Thus, it provides transparency to customers and business stakeholders.

Use model interpretation capability to:

Create accurate ML models.
Understand the behavior of a wide variety of models, including deep neural networks, during the learning and inference phases.

Perform conditional analysis to determine the impact on model predictions when characteristic values are changed.

Evaluation and mitigation of model bias (FairLearn)

Today, a challenge with the creation of artificial intelligence systems is the inability to prioritize impartiality. By using Fairlearn with Azure Machine Learning,developers and data scientists can leverage specialized algorithms to ensure more unbiased results for everyone.

Use impartiality capabilities to:

Evaluate the bias of models during learning and implementation.
Mitigate bias while optimizing model performance.
Use interactive visualizations to compare a set of recommended models that mitigate bias.

Protection (Differential Privacy)

ML is increasingly used in scenarios that encompass sensitive information, such as census and patient medical data. Current practices, such as writing or masking data, may be limiting ML. To address this issue, confidential machine learning and differential privacy techniques can be used to help organizations create solutions while maintaining data privacy and confidentiality.

Avoid data exposure with differential privacy

By using the new Differential Privacy Toolkit with Azure Machine Learning, data science teams can create ML solutions that preserve privacy and help prevent the re-identification of a person's data. These differential privacy techniques have been developed in collaboration with researchers from the Institute of Quantitative Social Sciences (IQSS) and the Harvard School of Engineering.

Differential privacy protects sensitive information with:

Statistical noise injection into the data to help prevent the disclosure of private information, without a significant loss of accuracy.
Exposure risk management with budget monitoring for information used in individual consultations and with greater limitation of consultations, as appropriate.

Data protection with sensitive machine learning

In addition to data privacy, organizations seek to ensure the security and confidentiality of all ML resources.

To enable deployment and learning of secure models, Azure Machine Learning provides a robust set of network and data protection capabilities. This includes support for Azure virtual networks, private links to connect to machine learning workspaces, dedicated compute hosts, and client managed keys for encryption in transit and at rest.

On this secure basis, Azure Machine Learning also enables Microsoft data science teams to build models with sensitive data in a secure environment, without the ability to view data. The confidentiality of all machine learning resources is preserved during this process. This approach is fully compatible with open source machine learning frameworks and a wide range of hardware options. We are pleased to offer these confidential machine learning capabilities to all developers and data scientists later this year.

Control (Data Drift)

To build responsibly, the ML development process must be repeatable, reliable, and responsible to stakeholders. Azure Machine Learning enables decision makers, auditors, and all ML lifecycle members to support a responsible process.

Tracking ML resources through audit logs

Azure Machine Learning provides capabilities to automatically track lineage and maintain an audit trail of ML resources. Details such as execution history, learning environment, and explanations of data and models are captured in a central log, allowing organizations to meet various auditing requirements.

Brief explanation of used Azure Machine Learning Services

Technology	Description
Azure Machine Learning Service	Cloud services to train, deploy and manage machine learning models
AutoML	Process of automating the time consuming, iterative tasks of machine learning model development
Differential Privacy	Process of protecting personal information and users identity
Interpret ML	Interpret a model by using an explainer that quantifies the amount of influence each feature contribues to the predicted label
FairLearn	Python package that empowers developers of AI systems to assess their system's fairness and mitigate any observed unfairness issues.
DataDrift	Data drift is the change in model input data that leads to model performance degradation

Final Result Azure Machine Learning Pipeline

Initial Azure Machine Learning Pipeline

Re-train Azure Machine Learning Pipeline with new model metrics validation

Understanding the Heart-Disease dataset

This database contains 76 attributes, but all published experiments refer to using a subset of 14 of them. In particular, the Cleveland database is the only one that has been used by ML researchers to this date. The "goal" field refers to the presence of heart disease in the patient. It is integer valued from 0 (no presence) to 1.

Download scratch dataset from: http://archive.ics.uci.edu/ml/datasets/Heart+Disease or https://www.kaggle.com/ronitf/heart-disease-uci

Original Columns Dataset:

age: age in years
sex:
- 0: female
- 1: male
chest_pain_type: chest pain type
- 1: typical angina
- 2: atypical angina
- 3: non-anginal pain
- 4: asymptomatic
resting_blood_pressure: resting blood pressure (in mm Hg on admission to the hospital)
cholesterol: serum cholestoral in mg/dl
fasting_blood_sugar: (fasting blood sugar > 120 mg/dl) (1 = true; 0 = false)
rest_ecg: resting electrocardiographic results
- 0: normal
- 1: having ST-T wave abnormality (T wave inversions and/or ST elevation or depression of > 0.05 mV)
- 2: showing probable or definite left ventricular hypertrophy by Estes' criteria
max_heart_rate_achieved: maximum heart rate achieved
exercise_induced_angina: exercise induced angina (1 = yes; 0 = no)
st_depression: ST depression induced by exercise relative to rest
st_slope: the slope of the peak exercise ST segment
- 1: upsloping
- 2: flat
- 3: downsloping
num_major_vessels: number of major vessels (0-3) colored by flourosopy
thalassemia:
- 3 = normal;
- 6 = fixed defect;
- 7 = reversable defect
target: diagnosis of heart disease (angiographic disease status)
- 0: < 50% diameter narrowing
- 1: > 50% diameter narrowing

Attribution:

The authors of the databases have requested that any publications resulting from the use of the data include the names of the principal investigator responsible for the data collection at each institution. They would be:

Hungarian Institute of Cardiology. Budapest: Andras Janosi, M.D.
University Hospital, Zurich, Switzerland: William Steinbrunn, M.D.
University Hospital, Basel, Switzerland: Matthias Pfisterer, M.D.
V.A. Medical Center, Long Beach and Cleveland Clinic Foundation:Robert Detrano, M.D., Ph.D.

Responsible AI Dataset

The dataset we use in this repository is a customized one. In the Getting Started section, we explain how it is generated and which transformations were applied to create our new dataset. The dataset is based on UCI Heart Disease data. The original UCI dataset has by default 76 columns, but Kaggle provides a version containing 14 columns. We used Kaggle one. In this notebook, we'll explore the heart disease dataset.

As part of the exploratory analysis and preprocessing of our data, we have applied several techniques that help us understand the data. The insights discovered (visualizations) were uploaded into an Azure ML Experiment. As part of the study, we took our target variable, and we analyzed it and checked its interaction with other variables.

The original dataset doesn't have any personal or sensible data that we can use it to identify a person or mitigate fairness, just Sex and Age. Therefore, for the purpose of this project and to show the capabilities of Differential Privacy and Detect Fairness techniques, we have created a notebook to generate a custom dataset with the following new columns and schema:

Custom columns base on original dataset:

Original Columns Dataset:

age (Original)
sex (Original)
chest_pain_type: chest pain type (Original)
resting_blood_pressure (Original)
cholesterol (Original)
fasting_blood_sugar (Original)
rest_ecg (Original)
max_heart_rate_achieved (Original)
exercise_induced_angina (Original)
st_depression (Original)
st_slope (Original)
num_major_vessels (Original)
thalassemia (Original)
target (Original)

Custom Columns Dataset:

state (custom)
city (custom)
address (custom)
postal code (custom)
ssn (social security card) (custom)
diabetic (custom)
- 0: not diabetic
- 1: diabetic
pregnant (custom)
- 0: not pregnant
- 1: pregnant
ashtmatic (custom)
- 0: not ashtmatic
- 1: ashtmatic
smoker (custom)
- 0: not smoker
- 1: smoker
observations (custom)

To generate the information related to state, address, city, postal code we have downloaded it from https://github.com/EthanRBrown/rrad. From this repository we are able to get a list of real, random addresses that geocode successfully (tested on Google's Geocoding API service). The address data comes from the OpenAddresses project, and all the addresses are in the public domain. The addresses are deliberately not linked to people or businesses; the only guarantee is that they are real addresses that geocode successfully.

This custom dataset will be the dataset that the Azure ML steps will use. The name that we set to it was "complete_patients_dataset.csv". You can find it on ./dataset

Getting started

The solution has this structure:

.
├── dataset
│   ├── complete_patients_dataset.csv
│   ├── heart_disease_preprocessed_inference.csv
│   ├── heart_disease_preprocessed_train.csv
│   ├── uci_dataset.csv
│   └── uci_dataset.yml
├── docs (documentation images)
├── infrastructure
│   ├── Scripts
│   │    └── Deploy-ARM.ps1
│   ├── deployment.json
│   ├── DeployUnified.ps1
│   └── README.md
├── src
│   ├── automated-ml (notebook to run AutoML)
│   ├── dataset-generator (notebook to generate dataset)
│   ├── deployment
│   ├── detect-fairness
│   ├── differential-privacy
│   ├── installation-libraries (notebook to install dependecies)
│   ├── mlops-pipeline
│   ├── monitoring
│   ├── notebooks-settings
│   ├── preprocessing
│   ├── retrain
│   └── utils
├── .gitignore
└── README.md

To run this sample you have to do this steps:

Create infrastructure
Run notebook to install dependencies
Run Dataset Generator Notebook
Publish the pipeline
Submit pipeline using API
Activate Data Drift Detector
Activate re-train Step

1. Create infrastructure

We need a infrastructure in Azure to run the experiment. You can read more about the necessary infrastructure here.

To facilitate the task of creating the infrastructure, you can run the infrastructure/DeployUnified.ps1 script. You have to indicate the Resource Group, the Location and the Suscription Id.

The final view of the Azure resource group will be like the following image:

Note: The services you see marked with a red line will be created in the next steps. Don't worry about it!

2. Install Project dependencies

Install Responsible AI Requirements

Each notebook contains an environment.yml file listing all the necessary python libraries which are associated and required for the notebook execution.We recommend you use a conda environment.

Here is the basic recipe for using Conda to manage a project specific software stack.

(base) $ cd project-dir

(base) $ conda env create --prefix ./env --file environment.yml

(base) $ conda activate ./env # activate the environment

(/path/to/env) $ conda deactivate # done working on project (for now!)

There are more details below on creating your conda environment

Libraries Required

The following libraries are required

pylint
numpy
pandas
ipykernel
joblib
sklearn
azureml-sdk
azureml-sdk[automl]
azureml-widgets
azureml-interpret
azureml-contrib-interpret
interpret-community
azureml-monitoring
opendp-whitenoise
opendp-whitenoise-core
matplotlib
seaborn
pandas-profiling
fairlearn
azureml-contrib-fairness
azureml-datadrift

Using Conda for Environments

The Notebook will automatically find all Jupyter kernels installed on the connected compute instance. To add a kernel to the compute instance:

Select Open terminal in the Notebook toolbar.

Use the Visual Studio Code terminal window to create a new environment. For example:

Conda commands to create local env by environment.yml: conda env create -f environment.yml
Set conda env into jupyter notebook: python -m ipykernel install --user --name <environment_name> --display-name "Python (<environment_name>)"
Activate the environment after creating newenv: conda activate <environment_name>

Adding New Kernels (Optional)

Install pip and ipykernel package to the new environment and create a kernel for that conda env conda install pip conda install ipykernel python -m ipykernel install --user --name newenv --display-name "Python (newenv)"

Any of the available Jupyter Kernels can be installed. https://github.com/jupyter/jupyter/wiki/Jupyter-kernels

Installation of Python Libraries (Optional)

Use the following to install the libraries: pip --disable-pip-version-check --no-cache-dir install pylint

Or inline within a Juputer Notebook use: !pip install numpy

You can now execute the notebooks successfully.

Virtual environments to execute Azure Machine Learning notebooks using Visual Studio Codespaces.(Optional)

This repository contains a labs to help you get started with Creating and deploying Azure machine learning module.

Manually creating a VS Online Container (Optional)

To complete the labs, you'll need the following:

A Microsoft Azure subscription. If you don't already have one, you can sign up for a free trial at https://azure.microsoft.com or a Student Subscription at https://aka.ms/azureforstudents.
A Visual Studio Codespaces environment. This provides a hosted instance of Visual Studio Code, in which you'll be able to run the notebooks for the lab exercises. To set up this environment:
1. Browse to https://online.visualstudio.com
2. Click Get Started.
3. Sign in using the Microsoft account associated with your Azure subscription.
4. Click Create environment. If you don't already have a Visual Studio Online plan, create one. This is used to track resource utlization by your Visual Studio Online environments. Then create an environment with the following settings:
  - Environment Name: A name for your environment - for example, MSLearn-create-deploy-azure-ml-module.
  - Git Repository: leestott/create-deploy-azure-ml-module
  - Instance Type: Standard (Linux) 4 cores, 8GB RAM
  - Suspend idle environment after: 120 minutes
5. Wait for the environment to be created, and then click Connect to connect to it. This will open a browser-based instance of Visual Studio Code.

The current Visual Studio Codespaces Environment is based on Debian 10 there are some limitation to the Azure ML SDK with linux at present. Error on some notebooks may occur ensure the correct libraries and versions are installed using !pip install and please check library dependencies.

Using Azure Machine learning Notebooks (Optional)

Simply download the folder structure and upload the entire content to Azure Machine Learning Notebook

You now need to create a new compute instance for your notebook environment

You now need to install the AML Prequestites to the Notebook Compute Host, to do this simply open a notebook and then select the open terminal.

select the terminal and install all the requirements using pip install

Jupyter Notebooks

In this project we have inside src folder many directories with jupyter notebook that you have to execute to obtain and complete the objective of this repository. The folder src have:

automated-ml: automated-ml.ipynb and environment.yml
dataset-generator: dataset-generator.ipynb and environment.yml
detect-fairness: fairlearn.ipynb and environment.yml
differential-privacy: differential-privacy.ipynb and environment.yml
mlops pipelines:
1. explain_automl_model_local.ipynb
2. mlops-publish-pipeline.ipynb
3. mlops-submit-pipeline.ipynb
4. environment.yml
monitoring: datadrift-pipeline.ipynb and environment.yml
preprocessing: exploratory_data_analysis.ipynb and environment.yml

Our recommendation is to use dedicated Conda environments for each of the Notebooks due to library and version dependencies if you are running this on a local machine non devcontainer you will need to create the conda enviroments via using conda navigator or execute the Conda installation before do anything inside these notebooks.

3. Run Dataset Generator

Run src/dataset-generator/dataset-generator.ipynb to create the project dataset made from UCI Heart-Disease dataset specifically to Responsible AI steps. See the dataset generated in the ./dataset folder

4. Publish the pipeline

Run src/mlops-pipeline/mlops-publish-pipeline.ipynb to create a machine learning service pipeline with Responsible AI steps and MLOps techniques that runs jobs unattended in different compute clusters.

You can see the run in the Azure Machine Learning Services Portal in the pipelines section of the portal.

5. Submit pipeline using API Rest

Run src/mlops-pipeline/mlops-submit-pipeline.ipynb to execute/invoke this publishes the pipeline via REST endpoint.

You can see the run in the Azure Machine Learning Services Portal in the pipelines section of the portal.

6. Activate Data Drift Detector

Run src/monitoring/datadrift-pipeline.ipynb to create and execute data drift detector. At the end of this notebook, you will be able to make a request with new data in order to detect drift

Go to Azure Machine Learning portal models section. In the details tab now you can see a new section about Data Drift Detector status and configuration.

7. Execute pipeline with retrain configuration

If Data Drift coefficient is greater than the configured threshold a new alert will be sent to the final user. In that moment, the user will can execute the re-train pipeline in order to improve the performance of the model taking into account the new collected data.

Go to Azure ML Portal Pipelines section. Click on the last pipeline version. Then, you will have to click on submit button. Now, you should see something like the following image:

First, select an existing experiment or create a new one for this new pipeline execution. Finally, in the same view, to do the retrain process correctly some parameters have to change:

use_datadrift = False
retrain_status_differential_privacy_step = True
retrain_status_preprocessing_step = True
update_deployment_deploy_step = True

Once the parameters are set, we have everything ready to execute the retraining process!

References

Tags: Azure Machine Learning Service, Machine Learning, Differential-Privacy, Fairlearn, MLOPs, Data-Drift, InterpretML

Name		Name	Last commit message	Last commit date
Latest commit History 52 Commits
.devcontainer		.devcontainer
dataset		dataset
docs		docs
infrastructure		infrastructure
src		src
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
LICENSE		LICENSE
README.md		README.md
SECURITY.md		SECURITY.md

License

leestott/ResponsibleAI

Folders and files

Latest commit

History

Repository files navigation

Responsible AI 2020 - Detect if a patient needs treatment based on heart-disease data

Introduction

Objectives

Why use Azure Machine Learning Service?

What are the key challenges we wish to solve with MLOps?

Understanding (InterpretML)

Evaluation and mitigation of model bias (FairLearn)

Protection (Differential Privacy)

Avoid data exposure with differential privacy

Data protection with sensitive machine learning

Control (Data Drift)

Tracking ML resources through audit logs

Brief explanation of used Azure Machine Learning Services

Final Result Azure Machine Learning Pipeline

Initial Azure Machine Learning Pipeline

Re-train Azure Machine Learning Pipeline with new model metrics validation

Understanding the Heart-Disease dataset

Original Columns Dataset:

Attribution:

Responsible AI Dataset

Custom columns base on original dataset:

Original Columns Dataset:

Custom Columns Dataset:

Getting started

1. Create infrastructure

2. Install Project dependencies

Install Responsible AI Requirements

Libraries Required

Using Conda for Environments

Adding New Kernels (Optional)

Installation of Python Libraries (Optional)

Virtual environments to execute Azure Machine Learning notebooks using Visual Studio Codespaces.(Optional)

Manually creating a VS Online Container (Optional)

Using Azure Machine learning Notebooks (Optional)

Jupyter Notebooks

3. Run Dataset Generator

4. Publish the pipeline

5. Submit pipeline using API Rest

6. Activate Data Drift Detector

7. Execute pipeline with retrain configuration

References

About

Topics

Resources

License

Code of conduct

Security policy

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages