-
Notifications
You must be signed in to change notification settings - Fork 0
/
RP3.Rmd
184 lines (149 loc) · 5.79 KB
/
RP3.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
---
title: "SDS 358 Group5 RP#3"
output: word_document
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
## I. Group Member
Xiaohan Wu (xw4822)
Ningxin Liu (nl8385)
Binglin Zhang (bz3856)
## II. Loading Data
We only keep the data of five predictors and the response we need.
```{r install, echo=FALSE, results='hide', message=FALSE}
# install.packages("tidyverse")
# install.packages("psych")
# install.packages("aod")
# install.packages("ggplot2")
```
```{r load lib&data, echo=FALSE, results='hide', message=FALSE}
library(tidyverse)
library(car)
library(psych)
library(aod)
library(ggplot2)
library(leaps)
library(readr)
raw_data <- read.csv("Pokemon.csv")
data <- raw_data[c("HP", "Attack", "Defense", "Sp..Atk", "Sp..Def", "Speed")]
```
## Correlation Matrix Plot
```{r cor matrix}
pairs.panels(data, method = "pearson")
```
## Interaction Analysis
```{r interaction1}
# Calculate the interaction term:
data <- data %>%
mutate(def.spdf = Defense*Sp..Def)
# Fit the regression model with 2 predictors and the interaction effect
reg2 <- lm(HP~Attack+Defense+Sp..Atk+Sp..Def+Speed + def.spdf, data)
# Display the summary table for the regression model
summary(reg2)
```
```{r interaction2}
# Calculate the interaction term:
data <- data %>%
mutate(spat.spdf = Sp..Atk*Sp..Def)
# Fit the regression model with 2 predictors and the interaction effect
reg3 <- lm(HP~Attack+Defense+Sp..Atk+Sp..Def+Speed + spat.spdf, data)
# Display the summary table for the regression model
summary(reg3)
```
```{r interaction3}
# Calculate the interaction term:
data <- data %>%
mutate(atk.df = Attack*Defense)
# Fit the regression model with 2 predictors and the interaction effect
reg4 <- lm(HP~Attack+Defense+Sp..Atk+Sp..Def+Speed + atk.df, data)
# Display the summary table for the regression model
summary(reg4)
```
```{r interaction4}
# Calculate the interaction term:
data <- data %>%
mutate(spat.speed = Sp..Atk*Speed)
# Fit the regression model with 2 predictors and the interaction effect
reg5 <- lm(HP~Attack+Defense+Sp..Atk+Sp..Def+Speed + spat.speed, data)
# Display the summary table for the regression model
summary(reg5)
```
The significant interactions are Attack\*Defense and Defense\*SpecialAttack.
## Model Selection
```{r stepwise forward}
#stepwise regression forward selection
data <- data %>%
mutate(atk.df = Attack*Defense, def.spdf = Defense*Sp..Def)
FitStart <-lm(HP ~1, data)
FitAll <- lm(HP~Attack+Defense+Sp..Atk+Sp..Def+Speed + def.spdf +atk.df, data)
step(FitStart,direction="forward", scope =formula(FitAll))
```
```{r stepwise backward}
#stepwise regression backward selection
step(FitAll,direction="backward", scope =formula(FitStart))
```
The results indicate that we should include all predictors expcept the Defense in the model. However, by the hierarchy principle, since we include the interaction term Defense*Attack, we have to include the Defense in our model.
```{r best subset}
#best subset regression
models <- regsubsets(HP~Attack+Defense+Sp..Atk+Sp..Def+Speed + def.spdf +atk.df, data, nvmax =7)
summary(models)
models.sum <-summary(models)
par(mfrow =c(2,2))
# SSE
plot(models.sum$rss, xlab ="Number of predictors", ylab ="SSE", type ="l")
# R2
plot(models.sum$adjr2, xlab ="Number of predictors", ylab ="Adjusted RSq", type ="l")
# Mallow's Cp
plot(models.sum$cp, xlab ="Number of predictors", ylab ="Cp", type ="l")
# BIC
plot(models.sum$bic, xlab ="Number of predictors", ylab ="BIC", type ="l")
```
The best subset selection method indicates that we shall use the model with seven predictors since the model has highest Adjusted $R^2$ and lowest SSE and $C_p$, although the BIC of the model is a little bit higher.
## Diagnostic
```{r model with interactions}
reg <- lm(HP~ Attack+Defense+Sp..Atk+Sp..Def+Speed+def.spdf+atk.df, data)
summary(reg)
```
We decide to choose the model $HP = 21.687 + 0.141Attack + 0.054Defense + 0.084SpecialAttack + 0.597SpecilDefense - 0.098Speed - 0.0036Defense*SpecialDefense + 0.002Attack*Defense$
### Residual Plot
```{r residuals}
data$resids <- residuals(reg)
data$predicted <- predict(reg)
```
```{r residual plot}
ggplot(data, aes(x=Defense, y=resids)) +
geom_point() + geom_hline(color='blue', yintercept = 0)
```
The residual plot shows that the model does not violate the linearity condition. However the condition of equal variance is not met. Moreover, there are some outliers.
```{r normal probability plot}
ggplot(data, aes(sample=resids)) +
stat_qq() + stat_qq_line()
```
From the normal probability plot, we can see that the normal assumption of the normality of the errors seems to be approximately met except some outliers on the right tail. Therefore, we shall consider a log transformation on the response.
### Log Transformation on Response
```{r transformation}
data <- data %>%
mutate(lnHP = log(HP))
```
```{r reg}
reg_ln_old <- lm(lnHP ~ Attack+Defense+Sp..Atk+Sp..Def+Speed+def.spdf+atk.df, data)
summary(reg_ln_old)
reg_ln <- lm(lnHP ~ Attack+Defense+Sp..Atk+Sp..Def+Speed+def.spdf, data)
summary(reg_ln)
data$resids2 <- residuals(reg_ln)
data$predicted2 <- predict(reg_ln)
```
After the log transformation for y, the atk.df has p-value larger than 0.05, so we omitted it from the model and did diagnostic analysis.
```{r residual plot2}
ggplot(data, aes(x=predicted2, y=resids2)) +
geom_point() + geom_hline(color='blue', yintercept = 0)
```
The residual plot indicates that the conditions of constant variance for error terms and linearity are met, except some outliers.
```{r normal probability plot2}
ggplot(data, aes(sample=resids2)) +
stat_qq() + stat_qq_line()
```
The assumption of the normality of the errors seems to be approximately met and is better than the previous model.
## Reference
[1]Barradas, A. (2016, August 29). Pokemon with stats. Retrieved October 13, 2020, from https://www.kaggle.com/abcsds/pokemon