This synthetic dataset contains sleep and cardiovascular metrics and lifestyle factors of close to 400 fictive persons.
The workspace is set up with one CSV file, data.csv
, with the following columns:
Person ID
Gender
Age
Occupation
Sleep Duration
: Average number of hours of sleep per dayQuality of Sleep
: A subjective rating on a 1-10 scalePhysical Activity Level
: Average number of minutes the person engages in physical activity dailyStress Level
: A subjective rating on a 1-10 scaleBMI Category
Blood Pressure
: Indicated as systolic pressure over diastolic pressureHeart Rate
: In beats per minuteDaily Steps
Sleep Disorder
: One ofNone
,Insomnia
orSleep Apnea
Background: You work for a health insurance company and are tasked to identify whether a potential client will likely have a sleep disorder. The company wants to use this information to determine the premium they want the client to pay.
Objective: Construct a classifier to predict the presence of a sleep disorder based on the other columns in the dataset.
Methods Used: Exploratory Data Analysis, Inferential Statistics, Data Visualization, Machine Learning, Predictive Modeling.
Type of Problem: Multi-class Classification Task.
Language, Libraries, technologies used: Python, Pandas, Matplotlib, Seaborn, Numpy, Scipy, Scikit-learn, joblib
To start this project, I first checked that all the data was clean and matched the description in the data dictionary; I cleaned up the data that wasn't clean and then validated all my data.
Once my data was clean, I carried out an exploratory data analysis, followed by statistical tests which revealed that :
- Those whose occupation is Accountant, Doctor, Engineer, or Lawyer are less likely to have a sleep disorder nurses have a high chance of sleep apnea, and Salespersons and Teachers are more likely to have insomnia
- Overweight people have a high chance to suffer from a sleep disorder and people with an ideal or normal Blood pressure are less likely to have a sleep disorder.
- People between the ages of 50 and 60 have low stress levels, and a sleep quality of around 9, but are susceptible to sleep apnea
- Men and women aged between 42 and 45 are very likely to have insomnia, and women of 50 and above 55 have a very high chance of having sleep apnea
After that, I preprocessed my data and created a baseline model: A LogisticRegression and a comparison model: A DecisionTree, i fitted both models and evaluated them. With an accuracy of 89% the baseline model performs better .
I plotted the importance of each variable to see which variables contributed the most to the model prediction. I saved the model as a pickle file using joblib
Dataset Source: Kaggle