Skip to content

An analysis done to determine the effect of some categorical variables on expenses.

Notifications You must be signed in to change notification settings

Folasade-Ojo/Expenses-prediction

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

9 Commits
 
 
 
 

Repository files navigation

Research Requirement

I am interested in the effect that smoking and 6 other variables (age, sex, BMI, Children, Smoker, and region) have on expenses.

To achieve this, I will create a simple linear regression model and a multivariate regression model. These are needed to establish if there exists a linear relationship between the dependent variable and the independent variables.

I will also run a t-test based on the dataset to confirm if the true mean of expense is $10,000 or not.

Analysis

Histogram

image

  • The histogram shows that the data is positively skewed.
  • Furthermore, there is more dataset concentrated toward the left-hand side of the mean.

Although a t-test and linear regression analysis assume the normality of a dataset, I will go on with the analyses.

T Test

In order to test if the true population mean of expenses = 10,000 or not, a one-sample two-tailed t-test will be conducted since the population standard deviation is unknown and we are checking for equality. T-test assumes the normality of the dataset, however, as earlier stated, we will go on with the test even though the distribution is skewed.

Hypothesis Statement

  • Null Hypothesis, H0: Average expenses, µ = 10,000
  • Alternative Hypothesis, H1: Average expenses, µ ≠ 10,000

Significance level, α = 0.05

Test and Output:

The test was conducted in R studio with the population mean (10,000), through which the following parameters were calculated

  • The degree of freedom(df) = 1337 (i.e., n-1)
  • The p-value = 2.2 e -16. This is the risk I am willing to take based on this analysis and it is way smaller than the significance level of 0.05
  • Sample mean = 13270.42 (which is not equal to 10,000)
  • 95% confidence interval = (12620.95 - 13919.89). This implies that with 95% confidence, the interval between 12620.95 and 13919.89 contains the true mean

Conclusion: At a significance level of 0.05, with 95% confidence, we will reject the null hypothesis that assumes the average expenses to be 10,000. Additionally, there is sufficient evidence to indicate that the expenses' mean is not 10,000; therefore, we will fail to reject the alternative hypothesis.

Simple Linear Regression

Hypothesis Statement

  • Null hypothesis, H0: β1= 0 There is no linear relationship between smoker and expenses. Consequently, the coefficient is not statistically significant.
  • Alternative hypothesis, H1: β1 ≠ 0. There exists a linear relationship between smoker and expenses. Hence, the coefficient is statistically significant.

The Regression Model

image

y: expenses x: smoker

image

Model Interpretation and Evaluation

  • Smoker is a categorical variable: yes, no.
  • For every unit increase in smoker, there is a corresponding 23616 dollar increase in expenses.
  • According to the R2, the model (or the predictor, smoker) can explain the variability in only 62% of the y variable, expenses.
  • The correlation coefficient = 0.79, which indicates a slightly strong positive relationship between smoker and expenses. I obtained this by converting smoker to dummy variables
  • The P-value (for the t-test and f-test)= 2.2e-16, which is lesser than the significance level of 0.05.
  • With 95% confidence, we can reject the null hypothesis which states that the coefficient of the predictor, smoker, is not statistically significant. Furthermore,there is strong evidence to show that there is a relationship between smoker and expenses, and the results obtained are also in support of the alternative hypothesis.

Multiple Linear Regression

Hypothesis Statement

  • H0: β1= β2 = β3 = β4 = β5 = β6 = 0. There is no correlation between the independent variables and expenses. Consequently, the coefficients are not statistically significant.
  • H1: βi ≠ 0 where i = 1,2,3..,n. At least one of the independent variables influences expenses. Hence, the coefficient of one of the independent variables is statistically significant.

Multiple Linear Regression Model

image

image

image

  • According to the adjusted R2, the model can explain the variability in 75% of the y variable, expenses.
  • The p-value of the model is 2.2e-16 is less than 0.05. It seems like a good model, but we will still consider the independent variables.
  • Weight: smoker = yes, has the highest positive weight on the dependent variable expenses.
  • P-values
    • Age, bmi, children, and smoker (yes) are significant at 95% confidence interval (i.e., α = 0.05)
    • Region (southeast) and region (southwest) are significant at 90% confidence interval (i.e., α = 0.01)
    • Sex and region (northwest) are insignificant as they are greater than 0.05
  • With 95% confidence, we will reject the null hypothesis which states that none of the independent variables have a correlation with expenses. Furthermore, the coefficients are statistically significant.
  • Also, at the significance level of 0.05, it is best to use Age, bmi, children, and smoker (yes) to predict expenses. Whereas, at α =0.01, Region (southeast) and region (southwest) are the best variables to work with.

Conclusion

  • Both models have the same p-value.
  • Overall, the models can be used to accurately predict expenses since the null hypothesis has been rejected. However, the multiple linear regression model is a better model for predicting expenses because there are more variables that can be relied on at a 5% or 10% significance level. Also, only the variables with significance should be used for the model.
  • Finally, according to the adjusted R2, the multiple linear regression model (75%) can explain a higher percentage of variability in expenses than the simple linear model(62%).

About

An analysis done to determine the effect of some categorical variables on expenses.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages