This is a introductory project of Machine Learning and Data Science concepts by exploring the problem of heart disease classification.
It is intended to be an end-to-end example of what a Data Science and Machine Learning proof of concept might look like.
- Exploratory data analysis (EDA) - the process of reviewing a dataset and finding out more about it.
- Model training - create a model(s) to learn to predict a target variable based on other variables.
- Model evaluation - evaluating a models predictions using problem-specific evaluation metrics.
- Model comparison - comparing several different models to find the best one.
- Model fine-tuning - once we've found a good model, how can we improve it?
- Feature importance - since we're predicting the presence of heart disease, are there some more important things for prediction?
- Cross validation - if we build a good model, can we be sure it will work on unseen data?
- Reporting what we've found - if we had to present our work, what would we show someone?
To work through these topics, we'll use Pandas, Matplotlib, and NumPy for Data Analysis, as well as, Scikit-Learn for Machine Learning and modeling tasks.
we've downloaded it in a formatted way from Kaggle. https://www.kaggle.com/datasets/johnsmith88/heart-disease-dataset
Heart Disease Data Dictionary The following are the features we'll use to predict our target variable (heart disease or no heart disease).
- age - age in years
- sex - (1 = male; 0 = female)
- cp - chest pain type
- 0: Typical angina: chest pain related decrease blood supply to the heart
- 1: Atypical angina: chest pain not related to heart
- 2: Non-anginal pain: typically esophageal spasms (non heart related)
- 3: Asymptomatic: chest pain not showing signs of disease
- trestbps - resting blood pressure (in mm Hg on admission to the hospital)
- anything above 130-140 is typically cause for concern
- chol - serum cholestoral in mg/dl
- serum = LDL + HDL + .2 * triglycerides
- above 200 is cause for concern
- fbs - (fasting blood sugar > 120 mg/dl) (1 = true; 0 = false)
- '>126' mg/dL signals diabetes
- restecg - resting electrocardiographic results
0. : Nothing to note
- : ST-T Wave abnormality
- can range from mild symptoms to severe problems
- signals non-normal heart beat
- : Possible or definite left ventricular hypertrophy
- Enlarged heart's main pumping chamber
- : ST-T Wave abnormality
- thalach - maximum heart rate achieved
- exang - exercise induced angina (1 = yes; 0 = no)
- oldpeak - ST depression induced by exercise relative to rest
- looks at stress of heart during excercise
- unhealthy heart will stress more
- slope - the slope of the peak exercise ST segment
0. : Upsloping: better heart rate with excercise (uncommon)
- : Flatsloping: minimal change (typical healthy heart)
- : Downslopins: signs of unhealthy heart
- ca - number of major vessels (0-3) colored by flourosopy
- colored vessel means the doctor can see the blood passing through
- the more blood movement the better (no clots)
- thal - thalium stress result
- 1,3: normal
- 6: fixed defect: used to be defect but ok now
- 7: reversable defect: no proper blood movement when excercising
- target - have disease or not (1=yes, 0=no) (= the predicted attribute) Note: No personal identifiable information (PPI) can be found in the dataset.