Predictive Modeling on Health Insurance Cost

This project aims to build predictive models to estimate health insurance charges based on various personal attributes. The dataset, sourced from Kaggle, includes variables such as age, sex, BMI, smoking status, region, and number of children. Several regression models were evaluated to identify the best predictors for insurance costs.

Introduction

The rising cost of healthcare in the United States has created a significant burden on families, businesses, and taxpayers. This project explores the factors contributing to these high costs and aims to develop predictive models to estimate health insurance charges.

Objectives

To find the best regression model
To predict health insurance charges based on personal data

Data Exploration

Data Source: The dataset was obtained from Kaggle and focuses on the USA region with 1338 observations and 7 variables.
Data Pre-processing: No missing data points were identified.
Exploratory Data Analysis: Summary statistics and data visualizations were generated to understand the relationships between variables.

Modeling

Multiple Linear Regression

Initial models did not satisfy assumptions of linearity, constant variance, and normality. A final model was chosen based on the least Akaike Information Criterion (AIC) value.

LASSO Regression

Used L1 regularization to avoid overfitting. Important variables identified were smoker, age, and BMI.

Decision Tree

Built using binary splits to minimize a cost function. Important variables were smoker, age, and BMI.

Random Forest

An ensemble of decision trees that aggregated results. Important variables identified were smoker, age, and BMI.

Bagging

Built multiple decision trees using bootstrapping. Important variables identified were age, BMI, and smoker.

Gradient Boosting

An ensemble method that builds decision trees by reducing the loss function. Important variables identified were smoker, age, and BMI.

Model Evaluations

Models were evaluated using Root Mean Squared Error (RMSE).
The model with the least RMSE value of about 4480.7 was identified as the best performer.

Conclusion

Findings: Age, BMI, and smoking status are crucial predictors of health insurance charges.
Limitations: The data is confined to the US region and might be biased due to unbalanced factor levels.
Future Ideas: Consider additional variables and collect more data for improved predictions.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
632_project_insurance.pptx		632_project_insurance.pptx
LICENSE		LICENSE
README.md		README.md
insurance.csv		insurance.csv
stat632_project.Rmd		stat632_project.Rmd

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Predictive Modeling on Health Insurance Cost

Table of Contents

Introduction

Objectives

Data Exploration

Modeling

Multiple Linear Regression

LASSO Regression

Decision Tree

Random Forest

Bagging

Gradient Boosting

Model Evaluations

Conclusion

References

Authors

Code Appendix

About

Releases

Packages

License

pratikshagadhe23/Predictive_Modeling-on_Health_Insurance_Cost

Folders and files

Latest commit

History

Repository files navigation

Predictive Modeling on Health Insurance Cost

Table of Contents

Introduction

Objectives

Data Exploration

Modeling

Multiple Linear Regression

LASSO Regression

Decision Tree

Random Forest

Bagging

Gradient Boosting

Model Evaluations

Conclusion

References

Authors

Code Appendix

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Packages