This repository contains the implementation of a Recommender System in Neo4j.
The data used for recommendation come from some of the tables of the MovieLens 25M Dataset, specifically ratings.csv
, movies.csv
, tags.csv
, genome-scores.csv
and genome-tags.csv
. You need to insert them in a data
folder.
The script populate_db.py populates a pre-existing Neo4j graph with data from these tables. An example of instantiation of the graph can be seen in figure.
The script is gonna generate some pickle files in the data folder (serialized dictionaries that map original dataset ids to the UUIDs used in the Neo4j database).
NB: you need to have a Neo4j database running on your machine (connection is to localhost). The script is gonna ask you if you want to delete your data from the current database: this is done because if you execute the script twice, all data will be duplicated.
The file datasetanalysis.ipynb contains some statistics on the dataset that help understand performance.
The file queries.ipynb contains execution and performance measures of the queries implied by the following workflow.
- Given a User, find his top k Genres
- Given a User, find his top k Categories
- Given a Genre, find its top k Users
- Given a Category, find its top k Users
- Given a User, find similar users
- Given a Users, recommend Movies based on similar Users
- Given a Movie, find similar Movies
- Given a User, recommend Movies based on similarity with the ones he has rated.
The file gds_recommendation.py contains some functions used for the recommendation, basically wrappers of some GDS library functions.
Relazione.pdf and Neo4j Recommender System.pdf contain a deeper discussion on the project (in italian) and a summary presentation of it (in english).
To run all the code in the respository, you can create a virtual environment and run the following commands.
virtualenv venv
source ./venv/bin/activate
pip install -r requirements.txt
Non enterprise versions of Neo4j do not consent to have more than one active database at the time: if you don't want to use the default database neo4j
, you can create a new one and activate it following this procedure.
NB: it is advisable to execute the script populate_db.py on a machine with at least 8 GB of RAM.