This project aims to build predictive models to estimate health insurance charges based on various personal attributes. The dataset, sourced from Kaggle, includes variables such as age, sex, BMI, smoking status, region, and number of children. Several regression models were evaluated to identify the best predictors for insurance costs.
- Introduction
- Objectives
- Data Exploration
- Modeling
- Model Evaluations
- Conclusion
- References
- Code Appendix
The rising cost of healthcare in the United States has created a significant burden on families, businesses, and taxpayers. This project explores the factors contributing to these high costs and aims to develop predictive models to estimate health insurance charges.
- To find the best regression model
- To predict health insurance charges based on personal data
- Data Source: The dataset was obtained from Kaggle and focuses on the USA region with 1338 observations and 7 variables.
- Data Pre-processing: No missing data points were identified.
- Exploratory Data Analysis: Summary statistics and data visualizations were generated to understand the relationships between variables.
Initial models did not satisfy assumptions of linearity, constant variance, and normality. A final model was chosen based on the least Akaike Information Criterion (AIC) value.
Used L1 regularization to avoid overfitting. Important variables identified were smoker, age, and BMI.
Built using binary splits to minimize a cost function. Important variables were smoker, age, and BMI.
An ensemble of decision trees that aggregated results. Important variables identified were smoker, age, and BMI.
Built multiple decision trees using bootstrapping. Important variables identified were age, BMI, and smoker.
An ensemble method that builds decision trees by reducing the loss function. Important variables identified were smoker, age, and BMI.
- Models were evaluated using Root Mean Squared Error (RMSE).
- The model with the least RMSE value of about 4480.7 was identified as the best performer.
- Findings: Age, BMI, and smoking status are crucial predictors of health insurance charges.
- Limitations: The data is confined to the US region and might be biased due to unbalanced factor levels.
- Future Ideas: Consider additional variables and collect more data for improved predictions.
- Pratiksha Gadhe
- Poornima Yedidi
- Yashi Agarwal