Code Repository: https://github.com/sumanthvrao/MovieBuddy
Report: https://github.com/sumanthvrao/MovieBuddy/Report.pdf
How amazing would it be if you could watch your favorite movie with someone who has similar interests like you! We compared different recommendation system models (Content-based filtering, Collaborative filtering, Restricted Boltzmann Machine) to find common movie interests among a group of people.
Dataset Link (movielens-100k-dataset.zip)
MovieLens offers dataset offers about 20 million ratings and 465,000 tag applications applied to 27,000 movies by 138,000 user.
Our aim is to bring together users with similar movie interests. In order to do this, we make use of users movie ratings and their information. We account for a variety of factors (location, interests , age .. to name some) before suggesting a MovieBuddy to you.
Our data set contains:
- 943 users , 1682 movies and 100000 ratings.
- Each user has rated at least 20 movies.
- Simple demograpic information about Users.
Content based filtering also referred to as cognitive filtering recommends items based on comparison between the content of items which means the items recommended by the model is same for any user. Content-based filtering avoids the cold-start problem that forestalls other recommendation techniques, as the the system considers only the content of the movies to make recommendations.
Content Based Recommendations rely on the characteristics of the item itself. The major challenge is in identifying these characteristics of the item to be considered. The Original MovieLens dataset consists of limited information about each movie - details like movie title, year of release, movie id, imdb url and list of genres. This data alone was insufficient to bring out valuable recommendations for a movie. We used tmdb (The Movie Database) api to extract more details for each movie. This api enabled us to obtain other characteristics like names of the protagonists, director etc. We created a hybrid feature for each movie which comprised of the name of the movie, year of release, list of genres, name of the director, name of the primary actor, name of secondary actor.
The Countvectorizer module identified 9105 distinct new features for each movie where each feature is a word extracted from the hybrid feature set of all the movies. We then calculated the self-cosine similarity of the matrix to compare each movie with every other movie in the dataset. Based on this similarity matrix we recommend 15 movies for every given movie.
Collaborative filtering is a method of making automatic predictions (filtering) about the interests of a user by collecting preferences from many users. The collaborative filtering model attempts to recommend movies and how much a user likes each movie by considering either user-user similarity or movie-movie similarity
Surprise (Simple Python RecommendatIon System Engine) library was used for Collaborative filtering. Results of running different collaborative filtering algorithms have been documented in the table below.
Algorithm | Mean RMSE | Mean MAE | Mean fit time | Mean test time |
---|---|---|---|---|
SVD | 0.9358 | 0.7375 | 11.38 | 0.45 |
KNN Basic (pearson baseline) | 1.0005 | 0.7917 | 5.65 | 10.91 |
KNN Basic (MSD) | 0.979 | 0.7731 | 1.23 | 8.47 |
KNN Basic (cosine) | 1.0174 | 0.8045 | 4.41 | 9.26 |
KNN with means (pearson baseline) | 0.9382 | 0.731 | 4.5 | 8.74 |
KNN with means (MSD) | 0.9502 | 0.7486 | 1.34 | 9.71 |
KNN with means (cosine) | 0.9556 | 0.7546 | 4 | 8.55 |
We chose SVD as our collaborative filtering algorithm as it had the least testing time, and lower RMSE and MAE values across the 5-folds.
The fundamental idea here is to use an RBM for each user with shared weights for users who rate the same set of movies. Every RBM has the same number of hidden units, but an RBM has active softmax visible units only for the items rated by that user. If two users have rated the same movie, their two RBM’s must use the same weights between the softmax unit for that movie and the hidden units. To ensure binary mappings, nodes with ratings from 1 to k are made for every user’s RBM for each movie he/she has rated. Each node is activated or deactivated based on the value it is looking for. It is shown that an RBM slightly outperform carefully tuned SVD models. A 2 layered undirected neural network was used as an RBM in our case.