This repository has been archived by the owner on Aug 16, 2020. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy path5_Stochastic_descent.rmd
executable file
·536 lines (364 loc) · 21.3 KB
/
5_Stochastic_descent.rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
# Stochastic Gradient Descent
The previous models were based on the expectation that our intuitions, confirmed by visual inspection of the dataset, would lead to better performing models. This section shows this is incorrect. We here present a more _"brute-force"_ model: a *low-rank matrix factorisation* with is approximated by a *stochastic gradient descent*.
This model proves to be very efficient:
+ Before any training, the validation set RMSE is 0.88516 thanks to a non-naive (i.e. not random) initialisation;
+ After very little training, using the initial 3 features (explained below), the RMSE became 0.8304 which improves on the targetted RMSE.
+ A few hours of training brings the RMSE down to *0.7996* with 11 features ^[This is a white lie since many more hours went into exploring the various parameters.]. Visually, the RMSE improvements suggest that additional features may help.
## Latent factor model
The approach we follow ia a Latent Factor Model. This section partly draws on part 9 of the Stanford Machine Learning course taught by Andrew Ng (which we previously completed), and a blog post by Sonya Sawtelle ^[See https://sdsawtelle.github.io/blog/output/week9-recommender-andrew-ng-machine-learning-with-python.html].
In essence, this is a dimension reduction model. But two differences reduce the computational workload:
+ Users and movies are coalesced into groups of similar users and similar movies. This is purely based on the triplets user / movie / rescaled rating. Information about dates, genres is ignored.
+ The model is trained by Stochastic Gradient Descent (SGD). Gradient descent methods are a class of optimisation algorithms that minimise a cost function following downward gradients. SGD is a stochastic version of it where random subsets of the training set are used to converge on very large datasets.
## Formal description
### Low-rank factorisation
#### Singular Value Decomposition
As noted in the course material (section 34.1.1), singular value decomposition (SVD) is widely used in machine learning. Wikipedia provides a general description which includes a geometric intuition ^[https://en.wikipedia.org/wiki/Singular_value_decomposition].
However, this approach is not feasible: the dimensions are too large, the dataset is extremely sparse.
#### Low-rank matrix factorisation (LRMF)
At a high level, the purpose of this assignment is to estimate a matrix $R$ of $N_{users}$ rows by $N_{movies}$ columns, where each value contains the rating given by a user to a movie. This is to be estimated from a sample of values from the training set.
The intuitive and geometric intuition of LRMF is as follows:
+ Work in a low dimensional space ($k$ dimensions).
+ In that space, give each user $u$ and movie $m$ coordinates in that space ($u = (u_1, ..., u_k)$ and $m = (m_1, ..., m_k)$).
+ Note that the cross-product of two points in that space will be zero or close to zero if the points are in perpendicular directions. Conversely, points in a close zone in that space will have a cross-product away from zero. In that sense, movies and users can be grouped together in that space: similar movies would be in the same zone of space, different movies would be in perpendicular positions. Because movies ans users both have coordinates in that space, then can all be mixed and grouped: one can measure the similarities between movies, user or between movie and user. The dimensions are commonly called _features_.
In practice, LRMF is represented by two matrices each with $k$ columns: $P$ of $N_{users}$ rows, and $Q$ of $N_{users}$ rows. The $k$ columns give the $k$ coordinates of each user and movie in the feature space. Choosing a user and a movie, the cross-product of the corresponding rows in $P$ and $Q$ gives $u \times m = \sum_{i = 1}^{k}{u_i m_i}$ and should produce a rating.
The purpose of the algorithm is then to estimate $P$ and $Q$ so that the cross-products match that of the training sets. That is, n matrix notation, $R$ is estimated by $P Q^{\top}$.
![Low-Rank Factorisation](images/matrix_factorization.png)^[source: https://towardsdatascience.com/large-scale-jobs-recommendation-engine-using-implicit-data-in-pyspark-ccf8df5d910e]
It is important to note that the only information used is the rating. Knowledge about the genres of the movies, timestamp of a rating, year a movie premiered is ignored.
### Gradient Descent
SVD is not useful in our context because (1) the size of the matrices involved is too large, and (2) more importantly requires a fully populated matrix (filling out missing values is a difficult issue).
Instead, we will iteratively estimate the $P$ and $Q$ matrices' coefficients by gradient descent. The cost function used represents prediction error with an additional regularisation cost over those coefficients.
#### Cost function
Let's use the following terms:
+ $\Omega$ is the set of all $(user, movie)$ pairs in the training set;
+ For each $(u,m)$ in $\Omega$, $r_{u,m}$ is the rating in the training set.
+ $P$ is written as $p_{i,k}$, $Q$ is written as $q_{j,k}$ with $i \in [1,...,N_{users}]$, $j \in [1,...,N_{movies}]$ and $k \in [1,...,N_{features}]$.
+ $\lambda$ is the regularisation parameter.
Our regularised cost function is written:
$$
J_{P,Q} = \sum_{(i, j) \in \Omega} {\left ( r_{i,j} - \sum_{k=1}^{N_{features}}{p_{i,k} q_{j,k}} \right )^2} + \frac{\lambda}{2} \left ( \sum_{i,k}{p_{i,k}^2} + \sum_{j,k}{q_{j,k}^2} \right )
$$
The gradient descent algorithm seeks to minimise the $J_{P,Q}$ cost function by step-wise update of each model parameter $x$ as follows:
$$
x_{t+1} \leftarrow x_{t} - \alpha \frac{\partial J_{P,Q}}{\partial x}
$$
The parameters are the matrix coefficients $p_{i,k}$ $q_{j,k}$. $\alpha$ is the learning parameter that needs to be adjusted.
#### Cost function partial derivatives
The partial derivatives of the cost function is:
$$
\frac{\partial J_{P,Q}}{\partial x} = \frac{\partial}{\partial x} \left ( \sum_{(i, j) \in \Omega} {\left ( r_{i,j} - \sum_{k=1}^{N_{features}}{p_{i,k} q_{j,k}} \right )^2} + \frac{\lambda}{2} \left ( \sum_{i,k}{p_{i,k}^2} + \sum_{j,k}{q_{j,k}^2} \right ) \right )
$$
$$
\frac{\partial J_{P,Q}}{\partial x} = \sum_{(i, j) \in \Omega} { 2 \frac{\partial r_{i,j} - \sum_{k=1}^{N_{features}} {p_{i,k} q_{j,k}}} {\partial x} \left ( r_{i,j} - \sum_{k=1}^{N_{features}}{p_{i,k} q_{j,k}} \right ) } + \frac{\lambda}{2} \left ( \sum_{i,k}{2 \frac{\partial p_{i,k}}{\partial x} p_{i,k}} + \sum_{j,k}{2 \frac{\partial p_{i,k}}{\partial x} q_{j,k}} \right )
$$
We note that $r_{i,j}$ are constants
$$
\frac{\partial J_{P,Q}}{\partial x} = 2 \sum_{(i, j) \in \Omega} \sum_{k=1}^{N_{features}} \left ( { \frac{\partial - {p_{i,k} q_{j,k}}}{\partial x} \left ( r_{i,j} - \sum_{k=1}^{N_{features}}{p_{i,k} q_{j,k}} \right ) } \right ) + \lambda \left ( \sum_{i,k}{\frac{\partial p_{i,k}}{\partial x} p_{i,k}} + \sum_{j,k}{\frac{\partial p_{i,k}}{\partial x} q_{j,k}} \right )
$$
If $x$ is a coefficient of $P$ (resp. $Q$), say $p_{a,b}$ (resp. $q_{a,b}$), all partial derivatives will be nil unless for $(i,j) = (a,b)$.
Therefore:
$$
\frac{\partial J_{P,Q}}{\partial p_{a,b}} = -2 \sum_{(i, j) \in \Omega} { q_{j,b} \left ( r_{i,j} - \sum_{k=1}^{N_{features}}{p_{i,k} q_{j,k}} \right )} + \lambda p_{a,b}
$$
and,
$$
\frac{\partial J_{P,Q}}{\partial q_{a,b}} = -2 \sum_{(i, j) \in \Omega} { p_{i,b} \left ( r_{i,j} - \sum_{k=1}^{N_{features}}{p_{i,k} q_{j,k}} \right ) } + \lambda q_{a,b}
$$
Since $\epsilon{i, j} = r_{i,j} - \sum_{k=1}^{N_{features}}{p_{i,k} q_{j,k}}$ is the rating prediction error, this becomes:
$$
\frac{\partial J_{P,Q}}{\partial p_{a,b}} = -2 \sum_{(i, j) \in \Omega} { q_{j,b} \epsilon_{i,j}} + \lambda p_{a,b}
$$
and,
$$
\frac{\partial J_{P,Q}}{\partial q_{a,b}} = -2 \sum_{(i, j) \in \Omega} { p_{i,b} \epsilon_{i,j} } + \lambda q_{a,b}
$$
### Stochastic Gradient Descent (SGD)
The size of the datasets is prohibitive to do those calculations across the entire training set.
Instead, we will repeatedly update the model parameters on small random samples of the training set.
Chapter 14 of [@shalev2014understanding] gives an extensive introduction to various SGD algorithms.
We implemented a simple version of the algorithm and present the code in more detail.
## SGD Code walk
The algorithm is implemented from scratch and relies on nothing but the `Tidyverse` libraries.
```{r s_function_definitions,eval=FALSE,echo=TRUE}
library(tidyverse)
```
The quality of the training and predictions is measured by the _root mean squared error_ (RMSE), for which we define a few helper functions (the global variables are defined later):
```{r eval=FALSE,echo=TRUE}
rmse_training <- function(){
prediction_Z <- rowSums(Matrices$P[tri_train$userN,] *
Matrices$Q[tri_train$movieN,])
prediction <- prediction_Z * r_sd + r_m
sqrt( sum((tri_train$rating - prediction)^2 / nSamples) )
}
rmse_validation <- function(){
prediction_Z <- rowSums(Matrices$P[tri_test$userN,] *
Matrices$Q[tri_test$movieN,])
prediction <- prediction_Z * r_sd + r_m
sqrt( sum((tri_test$rating - prediction)^2) / nTest )
}
sum_square <- function(v){
return (sqrt(sum(v^2) / nrow(v)))
}
```
The key function updates the model coefficients. Its inputs are:
+ a list that contains the $P$ an $Q$ matrices, the training RMSE of those matrices, and a logical value indicating whether this RMSE is worse than what it was before the update (i.e. did the update diverge).
+ a `batch_size` that defines the number of samples to be drawn from the training set. A normal gradient descent would use the full training set; by default we only use 10,000 samples out of 10 million (one tenth of a percent).
+ The cost regularisation `lambda` and gradient descent learning parameter `alpha`.
+ A number of `times` to run the descent before recalculating the RMSE and exiting the function (calculating the RMSE is computationally expensive).
The training set used is less rich than the original set. As discussed, it only uses the rating (more exactly on the z_score of the rating). Genres, timestamps,... are discarded.
```{r s_stochastic_gardient,eval=FALSE,echo=TRUE}
# Iterate gradient descent
stochastic_grad_descent <- function(model, times = 1,
batch_size = 10000, lambda = 0.1, alpha = 0.01,
verbose = TRUE) {
# Run the descent `times` times.
for(i in 1:times) {
# Extract a sample of size `batch_size` from the training set.
spl <- sample(1:nSamples, size = batch_size, replace = FALSE)
spl_training_values <- tri_train[spl,]
# Take a subset of `P` and `Q` matching the users and
# movies in the training sample.
spl_P <- model$P[spl_training_values$userN,]
spl_Q <- model$Q[spl_training_values$movieN,]
# rowSums returns the cross-product for a given user and movie.
# err is the term inside brackets in the partial derivatives
# calculation above.
err <- spl_training_values$rating_z - rowSums(spl_P * spl_Q)
# Partial derivatives wrt p and q
delta_P <- -err * spl_Q + lambda * spl_P
delta_Q <- -err * spl_P + lambda * spl_Q
model$P[spl_training_values$userN,] <- spl_P - alpha * delta_P
model$Q[spl_training_values$movieN,] <- spl_Q - alpha * delta_Q
}
# RMSE against the training set
error <- sqrt(sum(
(tri_train$rating_z - rowSums(model$P[tri_train$userN,] *
model$Q[tri_train$movieN,]))^2)
/ nSamples )
# Compares to RMSE before update
model$WORSE_RMSE <- (model$RMSE < error)
model$RMSE <- error
# Print some information to keep track of success
if (verbose) {
cat(" # features=", ncol(model$P),
" J=", nSamples * error ^2 +
lambda/2 * (sum(model$P^2) + sum(model$Q^2)),
" Z-scores RMSE=", model$RMSE,
"\n")
flush.console()
}
return(model)
}
```
Now that the functions are defined, we prepare the data sets.
+ First load the original data if not already available.
```{r s_load_datasets,eval=FALSE,echo=TRUE}
# Load the datasets which were saved on disk after using the course source code.
if(!exists("edx")) edx <- readRDS("datasets/edx.rds")
if(!exists("validation")) validation <- readRDS("datasets/validation.rds")
```
+ Calculate the z-score of all ratings.
```{r s_create_reduced_datasets,eval=FALSE,echo=TRUE}
# Creates a movie index from 1 to nMovies
r_m <- mean(edx$rating)
r_sd <- sd(edx$rating)
training_set <- edx %>%
select(userId, movieId, rating) %>%
mutate(rating_z = (rating - r_m) / r_sd)
test_set <- validation %>%
select(userId, movieId, rating) %>%
mutate(rating_z = (rating - r_m) / r_sd)
```
+ We do not know if there are any gaps in the userId's and movieId's in the datasets. They cannot be used as the row numbers of the $P$ and $Q$ matrices. Therefore we count how many distinct users and movies there are and create an index to link a movieId (resp. userId) to its $Q$ (resp. $P$) -matrix row number.
```{r eval=FALSE,echo=TRUE}
movieIndex <-
training_set %>%
distinct(movieId) %>%
arrange(movieId) %>%
mutate(movieN = row_number())
userIndex <-
training_set %>%
distinct(userId) %>%
arrange(userId) %>%
mutate(userN = row_number())
```
+ For each movie and user, we calculate its mean rating z-score.
```{r eval=FALSE,echo=TRUE}
movieMean <-
training_set %>%
group_by(movieId) %>%
summarise(m = mean(rating_z))
userMean <-
training_set %>%
group_by(userId) %>%
summarise(m = mean(rating_z))
```
+ We can now create the training and validation sets contining the movie index (instead of the movieId), user index and ratings (original and z-score).
```{r eval=FALSE,echo=TRUE}
# Training triplets with z_score
tri_train <- training_set %>%
left_join(userIndex, by = "userId") %>%
left_join(movieIndex, by = "movieId") %>%
select(-userId, -movieId)
tri_test <- test_set %>%
select(userId, movieId, rating) %>%
left_join(userIndex, by = "userId") %>%
left_join(movieIndex, by = "movieId") %>%
select(-userId, -movieId) %>%
mutate(rating_z = (rating - r_m)/r_sd,
error = 0)
```
```{r eval=FALSE,echo=TRUE}
nSamples <- nrow(tri_train)
nTest <- nrow(tri_test)
nUsers <- tri_train %>% select(userN) %>% n_distinct()
nMovies <- tri_train %>% select(movieN) %>% n_distinct()
```
+ The $P$ and $Q$ matrices are defined with 3 latent factors to start with.
```{r s_create_feature_matrices,eval=FALSE,echo=TRUE}
# number of initial latent factors
nLF <- 3
LF_Model <- list( P = matrix(0, nrow = nUsers, ncol = nLF),
Q = matrix(0, nrow = nMovies, ncol = nLF),
RMSE = 1000.0,
WORSE_RMSE = FALSE)
```
+ To speed up the training, the matrices are initialised so that the cross product is the sum of the movie average z-rating ($m_{movieN}$) and user z-rating ($u_{userN}$).
$$
P \times Q^{\top} =
\begin{bmatrix}
1 & u_{1} & 0 \\
1 & u_{2} & 0 \\
\vdots & \vdots & \vdots \\
1 & u_{i} & 0 \\
\vdots & \vdots & \vdots \\
1 & u_{nUser} & 0 \\
\end{bmatrix}
\times
\begin{bmatrix}
m_{1} & m_{2} & \cdots & m_{j} & \cdots & m_{nMovies} \\
1 & 1 & \cdots & 1 & \cdots & 1 \\
0 & 0 & \cdots & 0 & \cdots & 0 \\
\end{bmatrix}
=
\begin{bmatrix}
& \vdots & \\
\cdots & u_{i} + m_{j} & \cdots \\
& \vdots & \\
\end{bmatrix}
$$
```{r eval=FALSE,echo=TRUE}
# Features matrices are initialised with:
# Users: 1st column is 1, 2nd is the mean rating (centered), rest is noise
# Movies: 1st column is the mean rating (centered), 2nd is 1, rest is noise
#
# That way, the matrix multiplication will start by giving reasonable value
LF_Model$P[,1] <- matrix(1, nrow = nUsers, ncol = 1)
LF_Model$P[,2] <- as.matrix(userIndex %>%
left_join(userMean, by ="userId") %>% select(m))
LF_Model$Q[,1] <- as.matrix(movieIndex %>%
left_join(movieMean, by ="movieId") %>% select(m))
LF_Model$Q[,2] <- matrix(1, nrow = nMovies, ncol = 1)
```
+ Random noise is added to all model parameters, otherwise the gradient descent has nowhere to start (zeros wipe everything in the matrix multiplications).
```{r eval=FALSE,echo=TRUE}
# Add random noise
set.seed(42, sample.kind = "Rounding")
LF_Model$P <- LF_Model$P + matrix(rnorm(nUsers * nLF,
mean = 0,
sd = 0.01),
nrow = nUsers,
ncol = nLF)
LF_Model$Q <- LF_Model$Q + matrix(rnorm(nMovies * nLF,
mean = 0,
sd = 0.01),
nrow = nMovies,
ncol = nLF)
```
+ We also have a list that keeps track of all the training steps and values.
```{r eval=FALSE,echo=TRUE}
rm(list_results)
list_results <- tibble("alpha" = numeric(),
"lambda" = numeric(),
"nFeatures" = numeric(),
"rmse_training_z_score" = numeric(),
"rmse_training" = numeric(),
"rmse_validation" = numeric())
```
The main training loop runs as follows:
+ We start with 3 features.
+ The model is updated in batches of 100 updates. This is done up to 250 times. At each time, if the model starts diverging, the learning parameter ($\alpha$) is reduced.
+ Once the 250 times have passed, or if $\alpha$ has become incredibly small, or if the RMSE doesn't really improve anymoe (by less than 1 millionth), we add another features and start again.
```{r s_training_loop,eval=FALSE,echo=TRUE}
initial_alpha <- 0.1
for(n in 1:100){
# Current number of features
number_features <- ncol(LF_Model$P)
# lambda = 0.01 for 25 features, i.e. for about 2,000,000 parameters.
# We keep lambda proportional to the number of features
lambda <- 0.1 * (nUsers + nMovies) * number_features / 2000000
alpha <- initial_alpha
cat("CURRENT FEATURES: ", number_features,
"---- Pre-training validation RMSE = ", rmse_validation(), "\n")
list_results <- list_results %>% add_row(alpha = alpha,
lambda = lambda,
nFeatures = number_features,
rmse_training_z_score = LF_Model$RMSE,
rmse_training = rmse_training(),
rmse_validation = rmse_validation())
for (i in 1:250){
pre_RMSE <- LF_Model$RMSE
LF_Model <- stochastic_grad_descent(model = LF_Model,
times = 100,
batch_size = 1000 * number_features,
alpha = alpha,
lambda = lambda)
list_results <- list_results %>% add_row(alpha = alpha,
lambda = lambda,
nFeatures = number_features,
rmse_training_z_score = LF_Model$RMSE,
rmse_training = rmse_training(),
rmse_validation = rmse_validation())
if (LF_Model$WORSE_RMSE) {
alpha <- alpha / 2
cat("Decreasing gradient parameter to: ", alpha, "\n")
}
if (initial_alpha / alpha > 1000 |
abs( (LF_Model$RMSE - pre_RMSE) / pre_RMSE) < 1e-6) {
break()
}
}
# RMSE against validation set:
rmse_validation_post <- rmse_validation()
cat("CURRENT FEATURES: ", number_features,
"---- POST-training validation RMSE = ", rmse_validation_post, "\n")
# if (number_features == 12){
# break()
# }
# Add k features
k_features <- 1
LF_Model$P <- cbind(LF_Model$P,
matrix(rnorm(nrow(LF_Model$P) * k_features,
mean = 0,
sd = sd(LF_Model$P)/100),
nrow = nrow(LF_Model$P),
ncol = k_features))
LF_Model$Q <- cbind(LF_Model$Q,
matrix(rnorm(nrow(LF_Model$Q) * k_features,
mean = 0,
sd = sd(LF_Model$Q)/100),
nrow = nrow(LF_Model$Q),
ncol = k_features))
}
```
The following table shows the RMSE on the validation set that is obtained for a given number of features.
```{r}
if(!exists("list_results")) list_results <- readRDS("datasets/LRMF_results.rds")
options(digits = 6)
list_results %>% group_by(nFeatures) %>% summarise(best_RMSE = min(rmse_validation))
options(digits = 6)
```
This plot shows the progress of the RMSE on the validation set. It shows an overall improvement with the number of features, with little worsening spikes each time a feature seeded with random values is added.
```{r fig.cap="Plot of the RmSE on the validation test"}
list_results %>% ggplot(aes(n, rmse_validation, color = nFeatures)) + geom_point()
```
We also developed a re-implementation in Julia (availabe on Gihub) that we used to cross-check the R implementation. It gave similar results (in much less time):