A subreddit recommender system for users based on user and subreddit similarity using both implicit and explicit signals.
Abhishek Das, Janvi Palan, Nikhil Bhat, Sukanto Guha
Reddit is one of the biggest, most popular sites in the world, and we frequently use Reddit for staying up to date on subjects which interest us. Considering there are 540 million users on Reddit, we feel there is a need for a robust recommender system tailor made for Reddit users, so that they can discover new content and immprove their browsing goals, which may be different for different users.
-
Our dataset comprises of user comments on Reddit from the month of January 2015. It contains 57 million comments from reddit users. One interesting thing about our dataset is that we have more users than items(subreddits), which is unusual for Information Retrieval datasets and also why algorithms for other datasets cannot be directly applied to Reddit.
-
Pre-processing
We crysallized down on three approaches, and we created one dataset for each:- We removed user - subreddit interactions which were lesser than 30 characters and fewer than 5 comments
- We removed users which were bots and removed comments which were [deleted]
- Final dataset size: - Users = 735834 - Subreddits = 14842
- We removed user - subreddit interactions which were lesser than 10 characters and fewer than 3 comments.
- Final dataset size: 29 million comments of the same users and subreddits
- We removed user - subreddit interactions which were lesser than 10 characters and fewer than 3 comments.
- Final dataset size: 29 million comments of the same users and subreddits
For all three datasets, we performed comment filtering by removing stopwords, fixing punctuations and converting to lower case.
Subreddit recommendation is an important unsolved problem in the broad field of recommender systems, and we tried several methods and finally an ensemble appraoch to tackle this problem.
This approach involved Dataset 1 from above. We do not consider the actual words in a comment, but just the fact that the user has commented on a subreddit as a signal that they like it. Using this model, the advantage was that it was simple to implement, gave us a good baseline, and is easily scalable. We do not consider how many times a user comments on a subreddit, just the fact that they have commented. The major drawback of this method is that it falls behind in terms of our evaluation metrics.
This approach used Dataset 1. In ALS-MF, we theorize that the number of comments a user has on a subreddit is a strong indicator, and not just the fact that they have commented. The theory is that a user who has 50 comments on a subreddit finds the subreddit more relavant than someone who has 5 comments.
This approach invloves using BPR. BPR involves (user,subreddit,subreddit) triads. If User1 has commented on Subreddit1 but not Subreddit2, then then (User1,Subreddit1,Subreddit2) will have a positive value. We build such triads for all user and subreddit pairs to build the recommender system.
This approach uses Dataset 2 and 3. This approach uses the approach used in Visual BPR. In the paper, visual embeddings are founf for each item in the Amazon dataset. We use this aprroach by creating textual embeddings for both users and subreddits by concatenating over all the comments made on a subreddit and concatenating all comments made by a user,respectively. Each list of comments is labelled and embeddings are created via gensim. These embeddings were used to find the k-most similar subreddits for recommendation to the user.
This approach uses Dataset 2 and 3. The difference from Vanilla t-BPR is that the user-user embeddings are trained by our model from the data instead of using gensim for the same. This model was also based on the Visual BPR paper, and considers a Deep CNN model for training the lower dimension embeddings for each user.
In our project, we realized that combining different models like ALS with t-BPR may give better results than a similar model, as user recommendation should ideally take into account user serendipity, novelty and diversity. Choosing the best models is work in progress, and needs further insight on what user goals are when browsing Reddit.
For evaluation, we split our data into two sets- training and test data. We initially have a list of subreddits a user subscribes to. We take out 10% of subreddits associated with a user and add them to our test set. Once training is complete, we test how many of the subreddits we removed in thte initial set are present in our recommendations. We used the following evaluation model to test our models:
This evaluation criteria gives the probability that a classifier will rank a randomly chosen positive instance higher than a randomly chosen negative one. Because our models are comparison based, AUC works well as its defition pertains to the number of comparisons we perform correctly.
- Python 3
- Jupyter notebook
- gensim
- Colab - Google research
- implicit
- tqdm
- scipy
- nltk