This repository has been archived by the owner on Aug 16, 2020. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy path4_Model.rmd
executable file
·199 lines (124 loc) · 4.96 KB
/
4_Model.rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
# Model
```{r}
list_features <- c("rating",
"movie_nRating_log", "movie_z", "movie_mean_rating", "movie_sd_rating",
"user_nRating_log", "user_z", "user_mean_rating", "user_sd_rating",
"movie_year_out",
"time_since_out", "time_movie_first_log", "time_user_first_log", genres_variables)
```
From the previous sections, the following variables `list_features` have been shown to be possibly relevant:
```{r}
list_features
```
In this section, we used the reduced and full dataset. However, on all full dataset training attempts, RStudio crashed running out of memory (exceeding 32 GB).
```{r echo=FALSE}
set.seed(42, sample.kind = "Rounding")
USE_MODEL_EXTRACT <- TRUE
if (USE_MODEL_EXTRACT) {
edx_training <- edx_extract
edx_test <- validation_extract
} else {
# Load the datasets which were saved on disk after using the course source code.
# 3.3 GB
if(!exists("edx_full")) edx <- readRDS("datasets/edx_full.rds")
# 383 MB
if(!exists("validation_full")) validation <- readRDS("datasets/validation_full.rds")
edx_training <- edx_full
edx_test <- validation_full
}
```
```{r echo=TRUE}
# Datasets used for training.
# edx_training is either an extract or the full dataset. See source code.
x <- edx_training %>% select(one_of(list_features)) %>% as.matrix() # 2.1 GB on full set
y <- edx_training %>% select(rating) %>% as.matrix() #
```
The following helper functions:
+ Make a prediction given a fitted model and return the validation dataset with squared error of each prediction.
+ Appends the validation RMSE to a table that will include the 3 models RMSEs.
```{r echo=TRUE}
# Squared error of predictions in descending order
square_fit <- function(fit_model){
predictions <- fit_model %>% predict(edx_test)
return (edx_test %>%
cbind(predictions) %>%
mutate(square_error = (predictions - rating)^2) %>%
arrange(desc(square_error))
)
}
RMSEs <- tibble(Model = "Target", RMSE = 0.8649)
add_rmse <- function(name, fit) {
rm <- sqrt(sum(fit$square_error) / nrow(fit))
rw <- tibble(Model = name, RMSE = rm)
RMSEs %>% rbind(rw)
}
```
## Linear regression
The following runs a linear regression on the training data using the predicting variables listed above.
```{r m_lm,echo=TRUE,eval=FALSE}
set.seed(42, sample.kind = "Rounding")
start_time <- Sys.time()
fit_lm <- train(rating ~ .,
data = x,
method = "lm")
# Make predictions
square_lm <- square_fit(fit_lm)
RMSEs <- add_rmse("lm", square_lm)
worst_lm <- square_lm %>% filter(square_error >= 1.5^2)
end_time <- Sys.time()
print(end_time - start_time)
# Results
# reduced dataset = 0.8946755
# full dataset = CRASH
```
## Generalised Linear regression
The following runs a generalised linear regression on the training data using the predicting variables listed above.
```{r m_glm,echo=TRUE,eval=FALSE}
set.seed(42, sample.kind = "Rounding")
start_time <- Sys.time()
fit_glm <- train(rating ~ .,
data = x,
method = "glm")
# Make predictions
square_glm <- square_fit(fit_glm)
RMSEs <- add_rmse("glm", square_glm)
worst_glm <- square_glm %>% filter(square_error >= 1.5^2)
end_time <- Sys.time()
print(end_time - start_time)
# Results
# reduced dataset = 0.9486
# full dataset = CRASH
```
## LASSO regression
The following runs a regularised linear regression on the training data using the predicting variables listed above.
LASSO stands for Least Absolute Shrinkage and Selection Operator. The regularisation operates in two ways:
+ The absolute values of the coeeficients is minimised.
+ Values below a certain threshold are nil-led, effectively removing predictors.
```{r m_lasso,echo=TRUE,eval=FALSE}
# save(fit_lasso, square_lasso, worst_glm, file = "datasets/model_lasso.rda")
# load("datasets/model_lasso.rda")
set.seed(42, sample.kind = "Rounding")
lambda <- 10^seq(-3, 3, length = 10)
fit_lasso <- train(
rating ~.,
data = x,
method = "glmnet",
trControl = trainControl("cv", number = 10),
tuneGrid = expand.grid(alpha = 1, lambda = lambda)
)
# Model coefficients
coef(fit_lasso$finalModel, fit_lasso$bestTune$lambda)
# Make predictions
square_lasso <- square_fit(fit_lasso)
RMSEs <- add_rmse("lasso", square_lasso)
worst_lasso <- square_lasso %>% filter(square_error >= 1.5^2)
end_time <- Sys.time()
print(end_time - start_time)
# Results
# reduced dataset = 0.94837
# full dataset = CRASH
```
## Conclusion
Those models, although initially promising, do fail to meet our expectations:
+ They reach an RMSE which is good but not below the threshold of 0.8649. The linear regression model performed best with an RMSE = 0.8946.
+ More importantly, the training and validation on a very small sample of the datasets (20%). The computational resources required to do anything with more data or more sophisticated models has been out of reach (RStudio has crashed numerous times in the process).