Skip to content

Cardiovascular diseases (CVDs) are the number 1 cause of death globally, taking an estimated17.9 million lives each year, which accounts for 31. Heart failure is a common event caused by CVDs and this dataset contains 12 features that can be used to predict mortality by heart failure. Most cardiovascular diseases can be prevented by addressing b…

Notifications You must be signed in to change notification settings

ozlemkorpe/Heart-Failure-Prediction

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

11 Commits
 
 
 
 
 
 

Repository files navigation

Heart-Failure-Prediction

Cardiovascular diseases (CVDs) are the number 1 cause of death globally, taking an estimated17.9 million lives each year, which accounts for 31. Heart failure is a common event caused by CVDs and this dataset contains 12 features that can be used to predict mortality by heart failure. Most cardiovascular diseases can be prevented by addressing behavioural risk factors such as tobacco use, unhealthy diet and obesity, physical inactivity and harmful use of alcohol using population-wide strategies. People with cardiovascular disease or who are at high cardiovascular risk (due to the presence of one or more risk factors such as hypertension, diabetes, hyper lipidaemia or al-ready established disease) need early detection and management where in a machine learning model can be of great help

Dataset

Dataset is obtained by Kaggle, Davide Chicco, Giuseppe Jurman: Machine learning canpredict survival of patients with heart failure from serum creatinine and ejection fractionalone. BMC Medical Informatics and Decision Making 20, 16 (2020).

Datafields

  • age: Age of the patient
  • anaemia: Decrease of red blood cells or hemoglobin (boolean) creatinine_phosphokinase: Level of the CPK enzyme in the blood (mcg/L)
  • diabetes: If the patient has diabetes (boolean)
  • ejection_fraction: Percentage of blood leaving the heart at each contraction (percentage)
  • high_blood_pressure: If the patient has hypertension (boolean)
  • platelets: Platelets in the blood (kiloplatelets/mL)
  • serum_creatinine: Level of serum creatinine in the blood (mg/dL)
  • serum_sodium: Level of serum sodium in the blood (mEq/L)
  • sex: Woman or man (binary)
  • smoking: If the patient smokes or not (boolean)
  • time: Follow-up period (days)
  • DEATH_EVENT: If the patient deceased during the follow-up period (boolean)

Methodology

Preprocessing

By using following line I checked for the number of missing values in each column and verified that there is no missing values in the whole dataset

missing_rows = sum(ismissing(data)); %No missing data

All features have different ranges and if I use them like that they will affect the computer with different ratios. For solving this problem, I used two different scaling method which are standardization and normalization. Later on I saw that normalization method produces slightly better output than standardization. Performed normalization on each variable in training and test sets and saved the results in normalized datatables.

stand_age = (data.age - mean(data.age)) / std(data.age);
data.age = stand_age

Partitioning and Classification

Partitioned the data into the training and the test sets. Created a variable named TestPartition at the beginning of the script to easily change the percentage of the test set and set it to 0.1 initially which means 10% of the dataset will be used for testing and 90% of it is used for training.

cv=cvpartition(classification_model.NumObservations,'HoldOut',TestPartition);
cross_validated_model = crossval(classification_model, 'cvpartition', cv);

After performing partition I performed the prediction by using the model I created. I used confusion matrix to see how the results are distributed.

classification_model = fitctree(data, 'DEATH_EVENT~platelets+serum_creatinine+time+serum_sodium+sex+smoking+age+anaemia+creatinine_phosphokinase+diabetes+ejection_fraction+high_blood_pressure','MinParentSize',DecisionAttr ); 

Experiments and Result

Used confusion matrix to calculate accuracy and analyze score.

    Results = confusionmat(cross_validated_model.Y(test(cv)),Predictions); 
    right_results = Results(1,1) + Results(2,2);
    wrong_results = Results(1,2) + Results(2,1);

    truth_score = right_results /(right_results + wrong_results);
    general_accuracy = general_accuracy + truth_score;

In first experiments I used all of the attributes provided in the dataset by using different customizations of decision tree algorithm like setting maximum number of splits or minimum size of the parent. During these experiments I obtained the best results by decreasing the number maximum splits or increasing the size of the minimum parents. When I analyzed the decision tree plot, I saw that in these customization algorithm decides only by checking at the follow-up period time. Even If provides the best resulted predictions approximately 82%, I wanted to analyze the effect of other attributes excluding the time. With the time attribute, accuracy of trained model is up to 0.83034%, in the other hand if the model is trained without time attribute has up to 0.74586%. If the time attribute is included into calculations, it has the most effect on the result. If excluded serum creatinine has the most effect as can be seen on the following diagram.

Data

Citation

Davide Chicco, Giuseppe Jurman: Machine learning can predict survival of patients with heart failure from serum creatinine and ejection fraction alone. BMC Medical Informatics and Decision Making 20, 16 (2020). (link) ###License CC BY 4.0

About

Cardiovascular diseases (CVDs) are the number 1 cause of death globally, taking an estimated17.9 million lives each year, which accounts for 31. Heart failure is a common event caused by CVDs and this dataset contains 12 features that can be used to predict mortality by heart failure. Most cardiovascular diseases can be prevented by addressing b…

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages