GitHub - codeandfire/movievecs: Code for predicting ratings given to movies using a basic matrix factorization model

This code trains a matrix factorization model that learns to predict the rating (as stars from 1-5) given by a user to a particular movie.

Model

The model implemented is a very basic matrix factorization model that has the following form:

$f(u, m) = \mu + b_u + b_m + \vec{u}^T \vec{m}$

Essentially, this model predicts the rating given by a user 'u' to a movie 'm' by learning two vectors, one for 'u' and one for 'm', and taking their dot product (along with adding a couple of bias terms). Hence the name movievecs: vectors are learnt for movies (as well as users).

This model is a very simplified version of the matrix factorization model discussed in Koren (2009).

Requirements

Python, as well as Numpy (and tqdm, to display nice progress bars):
```
$ pip install numpy tqdm
```
Bash shell (on Windows you can use Git Bash).

Usage

Download the Netflix Prize data. This dataset consists of real ratings given on Netflix, by over half a million users to 17770 movies. The whole dataset is about 2 GB in size. Verify that it contains the following files:
```
combined_data_1.txt
combined_data_2.txt
combined_data_3.txt
combined_data_4.txt
movie_titles.csv
probe.txt
qualifying.txt
README
```
Create a directory called netflix and move all these files to that directory. (Actually, only the first five files are required.)

Run the script extract.sh to extract the data for movies of your choice from this Netflix dataset. For example

$ bash extract.sh 'apollo 13'
ID,Year,Title
7745,1995,Apollo 13
Confirm ID 7745? [y/n] y
Ratings saved to apollo_13_7745.txt

Create a directory called dataset and move all the .txt files containing the extracted data to that directory.

Finally, run train.py to train the model. A sample output of this script is shown below:

$ python3 train.py
Loading rating data ... 
files: 100%|███████████████████████████████████| 36/36 [00:37<00:00,  1.04s/it]
3742159 total.
Split into training data with 2805813 ratings and test data with 936346 ratings.
Performing stochastic gradient descent (with adagrad update) ...
iterations: 100%|█████████████████| 2805813/2805813 [01:46<00:00, 26392.53it/s]
Calculating training RMSE ...
iterations: 100%|████████████████| 2805813/2805813 [00:16<00:00, 168617.65it/s]
Training RMSE: 0.793794
Calculating test RMSE ...
iterations: 100%|██████████████████| 936346/936346 [00:05<00:00, 174716.14it/s]
Test RMSE: 0.919558
Trained parameters saved to params.pkl

You can load the trained parameters using Python's pickle module:

import pickle
with open('params.pkl', 'rb') as f:
    params = pickle.load(f)

params is a Python dictionary, containing the parameters of the model (mu, user_biases, movie_biases, user_vecs, movie_vecs) as well as a list of all the movie names (movie_names) and user IDs (user_ids).

>>> params.keys()
dict_keys(['mu', 'user_biases', 'movie_biases', 'user_vecs', 'movie_vecs', 'user_ids', 'movie_names'])

To predict the rating for a user given by ID user and a movie given by its name movie, you can do

import numpy as np
u, m = params['user_ids'].index(user), params['movie_names'].index(movie)
rating = (
    params['mu'] + params['user_biases'][u] + params['movie_biases'][m] + 
    np.dot(params['user_vecs'][u], params['movie_vecs'][m])
)

Here, I discuss a clustering experiment on the movie vectors which yields very interesting results.

More details on both the scripts (extract.sh and train.py) are given in the details section below.

Notes

Note that the Netflix Prize data was published in 2006, so you will not find the most recent of films in that dataset.

A note on timing: the train.py script is quite fast. On a PC with 8 GB RAM and a 4-core CPU, given the extracted data of 36 movies that contains about 3.7 million ratings in total, this script takes about 3 minutes to run.

Details

extract.sh takes a search phrase as argument and looks through the file movie_titles.csv for matches. Ensure you use proper quoting when the phrase involves multiple words, for example 'apollo 13'. When there is only one match, it simply prompts the user for a confirmation of the movie's ID:

$ bash extract.sh 'apollo 13'
ID,Year,Title
7745,1995,Apollo 13
Confirm ID 7745? [y/n] y
Ratings saved to apollo_13_7745.txt

but when there are more matches, it prompts the user to manually enter the correct ID:

$ bash extract.sh 'armageddon'
ID,Year,Title
69,2003,WWE: Armageddon 2003
621,1997,Armageddon
6972,1998,Armageddon
8180,1993,Warlock: The Armageddon
9322,1979,Doctor Who: The Armageddon Factor
13429,1998,Getter Robo Armageddon: Vol. 1: Resurrection
Manually enter ID: 6972
Ratings saved to armageddon_6972.txt

The name of the file to which the ratings are saved is derived from the search phrase and the ID selected.

If no matches are found for the search phrase:

$ bash extract.sh 'blah blah'
blah blah not found.

train.py trains the matrix factorization model using stochastic gradient descent with AdaGrad updates. As you can see from a sample output

$ python3 train.py
Loading rating data ... 
files: 100%|███████████████████████████████████| 36/36 [00:37<00:00,  1.04s/it]
3742159 total.
Split into training data with 2805813 ratings and test data with 936346 ratings.
Performing stochastic gradient descent (with adagrad update) ...
iterations: 100%|█████████████████| 2805813/2805813 [01:46<00:00, 26392.53it/s]
Calculating training RMSE ...
iterations: 100%|████████████████| 2805813/2805813 [00:16<00:00, 168617.65it/s]
Training RMSE: 0.793794
Calculating test RMSE ...
iterations: 100%|██████████████████| 936346/936346 [00:05<00:00, 174716.14it/s]
Test RMSE: 0.919558
Trained parameters saved to params.pkl

this script performs the following:

it loads the extracted rating data (in my case I had extracted the data of 36 movies, which amounted to about 3.7 million ratings),
splits the data into training and test sets (by default it is a 75-25 split),
carries out stochastic gradient descent (one epoch by default),
calculates the RMSE (root-mean-squared error) of the trained model over the training and test sets,
and saves the trained parameters to disk.

The user and movie vectors learnt have a dimensionality of 20 by default. The learning rate of all the biases and vectors is 0.1 by default and the L2-regularization penalty is 0.01.

You can change all of these default values, of course:

$ python3 train.py --help
usage: train.py [-h] [-d DIM] [-e EPOCHS] [-s TEST_SPLIT]
                [--track-loss TRACK_LOSS] [--eta-bu ETA_BU] [--eta-bm ETA_BM]
                [--eta-vu ETA_VU] [--eta-vm ETA_VM] [--lambda-bu LAMBDA_BU]
                [--lambda-bm LAMBDA_BM] [--lambda-vu LAMBDA_VU]
                [--lambda-vm LAMBDA_VM]

A script to train vectors and biases for users and movies.

optional arguments:
  -h, --help            show this help message and exit
  -d DIM, --dim DIM     dimensionality of learnt vectors
  -e EPOCHS, --epochs EPOCHS
                        training epochs (a fraction is also allowed)
  -s TEST_SPLIT, --test-split TEST_SPLIT
                        fraction of instances in test split
  --track-loss TRACK_LOSS
                        track stochastic loss after every given number of
                        iterations
  --eta-bu ETA_BU       learning rate for user biases
  --eta-bm ETA_BM       learning rate for movie biases
  --eta-vu ETA_VU       learning rate for user vectors
  --eta-vm ETA_VM       learning rate for movie vectors
  --lambda-bu LAMBDA_BU
                        regularization penalty for user biases
  --lambda-bm LAMBDA_BM
                        regularization penalty for movie biases
  --lambda-vu LAMBDA_VU
                        regularization penalty for user vectors
  --lambda-vm LAMBDA_VM
                        regularization penalty for movie vectors

A few of these options probably require more explanation and are described below:

The number of epochs can be a fraction, i.e. 0.5 will train for half an epoch and 1.5 will train for one-and-a-half epochs.
To change the train-test split, you have to provide a fraction for the test split. For example a test split of 0.3 will result in a 70-30 train-test split.

You can track the stochastic loss after every few updates. For example, specifying the number 10000 will report the stochastic loss at the 10000th update, 20000th update, 30000th update and so on. This report is written to a log file train.log:

$ head train.log 
INFO:root:epoch 1 iteration 10000: loss 0.429710
INFO:root:epoch 1 iteration 20000: loss 0.166349
INFO:root:epoch 1 iteration 30000: loss 0.590381
INFO:root:epoch 1 iteration 40000: loss 0.587494
INFO:root:epoch 1 iteration 50000: loss 0.322644
INFO:root:epoch 1 iteration 60000: loss 1.863725
INFO:root:epoch 1 iteration 70000: loss 0.588115
INFO:root:epoch 1 iteration 80000: loss 0.681265
INFO:root:epoch 1 iteration 90000: loss 1.252593
INFO:root:epoch 1 iteration 100000: loss 0.169029

The rest of the options are self-explanatory.

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
clustering		clustering
LICENSE		LICENSE
README.md		README.md
extract.sh		extract.sh
train.py		train.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Model

Requirements

Usage

Notes

Details

About

Releases

Packages

Languages

License

codeandfire/movievecs

Folders and files

Latest commit

History

Repository files navigation

Model

Requirements

Usage

Notes

Details

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages