-
Notifications
You must be signed in to change notification settings - Fork 0
/
RP4.Rmd
304 lines (242 loc) · 11.1 KB
/
RP4.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
---
title: "RP 4"
author: "Xiaohan Wu"
date: "12/4/2020"
output: word_document
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
## I. Group Member
Xiaohan Wu (xw4822)
Ningxin Liu (nl8385)
Binglin Zhang (bz3856)
## II. Research Topic
This research focuses on investigating the effect of Attack, Defense, Special Attack, Special Defense, and Speed on the HP and understanding the way of designers balancing the HP of each kind of Pokemon within other characteristics.
## III. Variables
| Predictor | Description |
|:---------: |:------------------------------------------------------------------------------------:|
| Attack | The base modifier for normal attacks (eg. Scratch, Punch) |
| Defense | The base damage resistance against normal attacks |
| Sp..Atk | Special attack, the base modifier for special attacks (e.g. fire blast, bubble beam) |
| Sp..Def | The base damage resistance against special attacks |
| Speed | Determines which Pokemon attacks first each round |
| Response | Description |
|:--------: |:--------------------------------------------------------------------------------------: |
| HP | hit points, or health, defines how much damage a Pokemon can withstand before fainting |
## IV. Methods of collecting the data
The original data set comes from Kaggle: [pokemon dataset](https://www.kaggle.com/abcsds/pokemon)[1]
It contains 721 Pokemon, including their number, name, first and second type, and basic stats: HP, Attack, Defense, Special Attack, Special Defense, and Speed. The data is about the pokemon game and comes from[1]:
* pokemon.com
* pokemondb
* bulbapedia
## V. Predictions of the Relationship
The research aims to run a multiple linear regression analysis on predictors and the response. We are expecting to see a negative association between each predictor and the response.
## VI. Snippet of data
We only keep the data of five predictors and the response we need.
```{r install, echo=FALSE, results='hide', message=FALSE}
install.packages("tidyverse")
install.packages("psych")
```
```{r load lib&data, echo=FALSE, results='hide', message=FALSE}
library(tidyverse)
library(psych)
raw_data <- read.csv("Pokemon.csv")
data <- raw_data[c("HP", "Attack", "Defense", "Sp..Atk", "Sp..Def", "Speed")]
```
## VII. Univariate Analysis
### Summary of Data
```{r data head}
head(data, n=5)
```
```{r Mean and SD of Attack, Defense, Special Attack, Special Defense, Speed}
summary.extended <- data %>%
select(Attack, Defense, Sp..Atk, Sp..Def, Speed) %>%
psych::describe(fast = TRUE) %>%
as_tibble(rownames="rowname") %>%
print(summary.extended)
```
### Histograms
```{r histogram1}
# histogram for Attack
ggplot(raw_data, aes(Attack))+geom_histogram(fill="lightblue")+labs(title = "histogram for Attack", x="Attack", y="relative frequency")+theme_classic()
```
The distribution of the predictor Attack is approximately normal distribution and skewed to the right.
```{r histogram2}
# histogram for Defense
ggplot(raw_data, aes(Defense))+geom_histogram(fill="lightgreen")+labs(title = "histogram for Defense", x="Defense", y="relative frequency")+theme_classic()
```
The distribution of the predictor Attack is approximately normal distribution and skewed to the right with some possible outliers.
```{r histogram3}
# histogram for Special Attack
ggplot(raw_data, aes(Sp..Atk))+geom_histogram(fill="red")+labs(title = "Special Attack", x="Special Attack", y="relative frequency")+theme_classic()
```
The distribution of the predictor Attack is approximately normal distribution and skewed to the right with some possible outliers.
```{r histogram4}
# histogram for Special Defense
ggplot(raw_data, aes(Sp..Def))+geom_histogram(fill="green")+labs(title = "Special Defense", x="Special Defense", y="relative frequency")+theme_classic()
```
The distribution of the predictor Attack is approximately normal distribution and skewed to the right with some possible outliers.
## VIII. Regression Analysis
```{r regression funciton all predictiors}
reg_function <- lm(HP~Attack+Defense+Sp..Atk+Sp..Def+Speed,raw_data)
coefficients(reg_function)
```
The regression function is $HP = 31.483 + 0.296Attack - 0.0840Defense + 0.110SpecialAttack + 0.264SpecialDefense - 0.094Speed$.
* For every one unit increase in the Attack, we expect an average $0.296$ increase in the HP.
* For every one unit increase in the Defense, we expect an average $0.0840$ decrease in the HP.
* For every one unit increase in the Special Attack, we expect an average $0.110$ increase in the HP.
* For every one unit increase in the Special Defense, we expect an average $0.264$ increase in the HP.
* For every one unit increase in the Speed, we expect an average $0.296$ decrease in the HP.
'
```{r SSE, MSE AND SQUARE ROOT OF MSE}
# Save fitted (or predicted) values
raw_data$predicted <- predict(reg_function) # Save residuals
raw_data$resids <- residuals(reg_function) # Square the residuals
raw_data$resid2 <- raw_data$resids^2
# Sum the squared residuals
SSE <- sum(raw_data$resid2)
# Divide by the degree of freedom to find the MSE
MSE <- SSE/(22)
# Square root the MSE to find an estimate of \sigma res_sd_error <- sqrt(MSE)
param <- matrix(c(SSE, MSE, sqrt(MSE)),ncol=3,byrow=TRUE)
colnames(param) <- c("SSE", "MSE", "RMSE")
param
```
## IX. Planning for Analysis
```{r install, echo=FALSE, results='hide', message=FALSE}
# install.packages("tidyverse")
# install.packages("psych")
# install.packages("aod")
# install.packages("ggplot2")
```
```{r load lib&data, echo=FALSE, results='hide', message=FALSE}
library(tidyverse)
library(car)
library(psych)
library(aod)
library(ggplot2)
library(leaps)
library(readr)
raw_data <- read.csv("Pokemon.csv")
data <- raw_data[c("HP", "Attack", "Defense", "Sp..Atk", "Sp..Def", "Speed")]
```
## Correlation Matrix Plot
```{r cor matrix}
pairs.panels(data, method = "pearson")
```
## Interaction Analysis
```{r interaction1}
# Calculate the interaction term:
data <- data %>%
mutate(def.spdf = Defense*Sp..Def)
# Fit the regression model with 2 predictors and the interaction effect
reg2 <- lm(HP~Attack+Defense+Sp..Atk+Sp..Def+Speed + def.spdf, data)
# Display the summary table for the regression model
summary(reg2)
```
```{r interaction2}
# Calculate the interaction term:
data <- data %>%
mutate(spat.spdf = Sp..Atk*Sp..Def)
# Fit the regression model with 2 predictors and the interaction effect
reg3 <- lm(HP~Attack+Defense+Sp..Atk+Sp..Def+Speed + spat.spdf, data)
# Display the summary table for the regression model
summary(reg3)
```
```{r interaction3}
# Calculate the interaction term:
data <- data %>%
mutate(atk.df = Attack*Defense)
# Fit the regression model with 2 predictors and the interaction effect
reg4 <- lm(HP~Attack+Defense+Sp..Atk+Sp..Def+Speed + atk.df, data)
# Display the summary table for the regression model
summary(reg4)
```
```{r interaction4}
# Calculate the interaction term:
data <- data %>%
mutate(spat.speed = Sp..Atk*Speed)
# Fit the regression model with 2 predictors and the interaction effect
reg5 <- lm(HP~Attack+Defense+Sp..Atk+Sp..Def+Speed + spat.speed, data)
# Display the summary table for the regression model
summary(reg5)
```
The significant interactions are Attack\*Defense and Defense\*SpecialAttack.
## Model Selection
```{r stepwise forward}
#stepwise regression forward selection
data <- data %>%
mutate(atk.df = Attack*Defense, def.spdf = Defense*Sp..Def)
FitStart <-lm(HP ~1, data)
FitAll <- lm(HP~Attack+Defense+Sp..Atk+Sp..Def+Speed + def.spdf +atk.df, data)
step(FitStart,direction="forward", scope =formula(FitAll))
```
```{r stepwise backward}
#stepwise regression backward selection
step(FitAll,direction="backward", scope =formula(FitStart))
```
The results indicate that we should include all predictors expcept the Defense in the model. However, by the hierarchy principle, since we include the interaction term Defense*Attack, we have to include the Defense in our model.
```{r best subset}
#best subset regression
models <- regsubsets(HP~Attack+Defense+Sp..Atk+Sp..Def+Speed + def.spdf +atk.df, data, nvmax =7)
summary(models)
models.sum <-summary(models)
par(mfrow =c(2,2))
# SSE
plot(models.sum$rss, xlab ="Number of predictors", ylab ="SSE", type ="l")
# R2
plot(models.sum$adjr2, xlab ="Number of predictors", ylab ="Adjusted RSq", type ="l")
# Mallow's Cp
plot(models.sum$cp, xlab ="Number of predictors", ylab ="Cp", type ="l")
# BIC
plot(models.sum$bic, xlab ="Number of predictors", ylab ="BIC", type ="l")
```
The best subset selection method indicates that we shall use the model with seven predictors since the model has highest Adjusted $R^2$ and lowest SSE and $C_p$, although the BIC of the model is a little bit higher.
## Diagnostic
```{r model with interactions}
reg <- lm(HP~ Attack+Defense+Sp..Atk+Sp..Def+Speed+def.spdf+atk.df, data)
summary(reg)
```
We decide to choose the model $HP = 21.687 + 0.141Attack + 0.054Defense + 0.084SpecialAttack + 0.597SpecilDefense - 0.098Speed - 0.0036Defense*SpecialDefense + 0.002Attack*Defense$
### Residual Plot
```{r residuals}
data$resids <- residuals(reg)
data$predicted <- predict(reg)
```
```{r residual plot}
ggplot(data, aes(x=Defense, y=resids)) +
geom_point() + geom_hline(color='blue', yintercept = 0)
```
The residual plot shows that the model does not violate the linearity condition. However the condition of equal variance is not met. Moreover, there are some outliers.
```{r normal probability plot}
ggplot(data, aes(sample=resids)) +
stat_qq() + stat_qq_line()
```
From the normal probability plot, we can see that the normal assumption of the normality of the errors seems to be approximately met except some outliers on the right tail. Therefore, we shall consider a log transformation on the response.
### Log Transformation on Response
```{r transformation}
data <- data %>%
mutate(lnHP = log(HP))
```
```{r reg}
reg_ln_old <- lm(lnHP ~ Attack+Defense+Sp..Atk+Sp..Def+Speed+def.spdf+atk.df, data)
summary(reg_ln_old)
reg_ln <- lm(lnHP ~ Attack+Defense+Sp..Atk+Sp..Def+Speed+def.spdf, data)
summary(reg_ln)
data$resids2 <- residuals(reg_ln)
data$predicted2 <- predict(reg_ln)
```
After the log transformation for y, the atk.df has p-value larger than 0.05, so we omitted it from the model and did diagnostic analysis.
```{r residual plot2}
ggplot(data, aes(x=predicted2, y=resids2)) +
geom_point() + geom_hline(color='blue', yintercept = 0)
```
The residual plot indicates that the conditions of constant variance for error terms and linearity are met, except some outliers.
```{r normal probability plot2}
ggplot(data, aes(sample=resids2)) +
stat_qq() + stat_qq_line()
```
The assumption of the normality of the errors seems to be approximately met and is better than the previous model.
## Reference
[1]Barradas, A. (2016, August 29). Pokemon with stats. Retrieved October 13, 2020, from https://www.kaggle.com/abcsds/pokemon