Project Title

McKinsey Analytics Online Hackathon - Healthcare Analytics: https://datahack.analyticsvidhya.com/contest/mckinsey-analytics-online-hackathon/ (Apr 14th 2018)

Getting Started

This repository consists of the set of codes used in the above datathon.

Problem Statement

Here, the original problem statement is copied from "https://datahack.analyticsvidhya.com/contest/mckinsey-analytics-online-hackathon/":

"In this case, your client wants to have study around one of the critical disease "Stroke". Stroke is a disease that affects the arteries leading to and within the brain. A stroke occurs when a blood vessel that carries oxygen and nutrients to the brain is either blocked by a clot or bursts (or ruptures). When that happens, part of the brain cannot get the blood (and oxygen) it needs, so it and brain cells die.

Over the last few years, the Client has captured several health, demographic and lifestyle details about its patients. This includes details such as age and gender, along with several health parameters (e.g. hypertension, body mass index) and lifestyle related variables (e.g. smoking status, occupation type).

The Client wants you to predict the probability of stroke happening to their patients. This will help doctors take proactive health measures for these patients."

Hence, the goal of this hackathon is to attain a model with a high predictability of stroke for unseen test dataset of patients' information. Note that in this datathon, as a criterion of the evaluation metric for predictability, the organizer specified AUC ROC score (https://www.analyticsvidhya.com/blog/2016/02/7-important-model-evaluation-error-metrics/)

Data

The variables in the datasets give the following information:

Variables: Definition

id: Patient ID

gender: Gender of Patient

age: Age of Patient

hypertension: 0 - no hypertension, 1 - suffering from hypertension

heart_disease: 0 - no heart disease, 1 - suffering from heart disease

ever_married: Yes/No

work_type: Type of occupation

Residence_type: Area type of residence (Urban/ Rural)

avg_glucose_level: Average Glucose level (measured after meal)

bmi: Body mass index

smoking_status: patient's smoking status

stroke: 0 - no stroke, 1 - suffered stroke

Prerequisites

I used neural network in matlab to predict AUC and ROC; hence, the "netlab" package needs to be installed, which is available at http://www.aston.ac.uk/eas/research/groups/ncrg/resources/netlab/downloads/

Main flow

The main flow is the data cleaning and model fit. Regarding the data cleaning, the missng data are dealt with two ways, complete case analysis and Kth nearest imputation analysis. The majority of the code for this part is based on the book "practical-machine-learning-with-python" and its author's github "https://github.com/dipanjanS/practical-machine-learning-with-python.git".

After such data cleaning, the logistic model and neural network are explored.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
BayesianRegularizationBPNN_K3thimputation.m		BayesianRegularizationBPNN_K3thimputation.m
BayesianRegularizationBPNN_ignoremissing.m		BayesianRegularizationBPNN_ignoremissing.m
Data_cleaning_final.ipynb		Data_cleaning_final.ipynb
README.md		README.md
test_v2akXPA.csv		test_v2akXPA.csv
train_ajEneEa.csv		train_ajEneEa.csv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Project Title

Getting Started

Problem Statement

Data

Prerequisites

Main flow

About

Releases

Packages

Languages

Nozomi-Takemura/24-hour-McKinsey-Analytics-Online-Hackathon-Healthcare-Analytic

Folders and files

Latest commit

History

Repository files navigation

Project Title

Getting Started

Problem Statement

Data

Prerequisites

Main flow

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages