Skip to content

Latest commit

 

History

History
495 lines (403 loc) · 25.3 KB

README.md

File metadata and controls

495 lines (403 loc) · 25.3 KB


Crates.io

RecZilla is a framework which provides the functionality to perform metalearning for algorithm selection on recommender systems datasets. It uses a meta-learner model to predict the best algorithm and hyperparameters for new, unseen datasets.

See our NeurIPS 2022 paper at https://arxiv.org/abs/2206.11886.

Overview

The figure below shows the overview of the end-to-end RecZilla framework pipeline.

This repository is based on the public repository RecSys2019_DeepLearning_Evaluation. We use several core functions of this codebase---for training and evaluating algorithms, and for reading and splitting datasets. This repo extends the original in several ways:

  • Data_manager/: added several datasets, and added a global timestamp splitting function
  • Experiment_handler/: added classes and scripts for training and evaluating recsys algorithms on datasets
  • Metafeatures/: added classess and scripts for calculating metafeatures of recsys datasets
  • ParameterTuning/: added a ParameterSpace class and RandomSearch class for random hyperparameter search
  • ReczillaClassifier/: added classes and scripts for training and using a recsys meta-model
  • algorithm_handler.py: added a function for accessing all implemented algorithms and their hyperparameter spaces
  • dataset_handler.py: added a function for accessing all implemented datasets
  • removed several large dataset files from the repo
  • made several small changes and bug fixes to support our experiments

NOTE: unless specified otherwise, all code should be run from the directory reczilla/RecSys2019_DeepLearning_Evaluation/

Table of contents

  1. Installation
  2. Datasets
    1. Loading An Implemented Dataset
    2. Loading All Implemented Datasets
    3. Loading New Datasets
  3. Evaluating Recsys Algorithms
  4. Meta-Learning
    1. Main script overview
    2. Training a new meta-model
    3. Using a Trained Meta-Model for Inference

Installation

You need Python 3.6 to use this repository.

You can start by first creating a new environment using conda or your preferred method.

# using conda
conda create -n DLevaluation python=3.6 anaconda
conda activate DLevaluation

Once you're done with the above step, you need to install all the dependencies in the requirements.txt file using,

pip install -r requirements.txt

Next step, you need to compile all the Cython algorithms. For that you will need to install gcc and python3-dev. You can install it on Linux as,

sudo apt install gcc 
sudo apt-get install python3-dev

Once installed, you can compile all the Cython algorithms by running the below command in the RecSys2019_DeepLearning_Evaluation directory,

python run_compile_all_cython.py

And you're all set!

Datasets

Each recsys dataset is managed using an instance of class DataReader in DataReader.py. All datasets in our paper are implemented as custom subclasses of DataReader objects---this object handles downloading, splitting, and i/o. In the current implementation datasets must be read using a DataReader object.

Before using any recsys dataset for training, testing, or meta-learning tasks, you need to load the dataset by calling the load_data() function of its `DataReader object. This function writes a version of the dataset locally.

Loading An Implemented Dataset

Each dataset used in our experiment has a custom DataReader class; a list of these classes can be found in Data_manager.dataset_handler.DATASET_READER_LIST. For example, the following python code downloads the Movielens100K dataset to a local folder, creates a global-timestamp split, and saves the split in a different folder:

from Data_manager.Movielens.Movielens100KReader import Movielens100KReader

# Folder where dataset will be loaded from. The dataset will be downloaded if it's not found here.
data_folder = "/home/datasets"

# load the dataset
data_reader = Movielens100KReader(folder=data_folder)
loaded_dataset = data_reader.load_data()
expected output
Movielens100K: reload_from_original_data is 'as-needed', will only reload original data if it cannot be found.
Movielens100K: Preloaded data not found, reading from original files...
Movielens100K: Loading original data
Movielens100K: Unable to fild data zip file. Downloading...
Downloading: http://files.grouplens.org/datasets/movielens/ml-100k.zip
In folder: /code/reczilla/RecSys2019_DeepLearning_Evaluation/Data_manager/../Data_manager_split_datasets/Movielens100K/ml-100k.zip
DataReader: Downloaded 100.00%, 4.70 MB, 922 KB/s, 5 seconds passed
Movielens100K: cleaning temporary files
Movielens100K: loading complete
Movielens100K: Verifying data consistency...
Movielens100K: Verifying data consistency... Passed!
Movielens100K: Found already existing folder '/home/datasets'
Movielens100K: Saving complete!

Now, the dataset Moviekens100K has been downloaded to folder /home/datasets. The following python code creates a global timestamp split for this dataset:

from Data_manager.DataSplitter_global_timestamp import DataSplitter_global_timestamp

# Folder where dataset splits will be written
split_folder = "/home/splits/MovieLens100K"

# split the dataset, and write it to file
data_splitter = DataSplitter_global_timestamp(data_reader)
data_splitter.load_data(save_folder_path=split_folder)
expected output
DataSplitter_global_timestamp: Cold users not allowed
DataSplitter_global_timestamp: Preloaded data not found, reading from original files...
Movielens100K: Verifying data consistency...
Movielens100K: Verifying data consistency... Passed!
split_data_on_global_timestamp: 192 cold users of total 943 users skipped
DataSplitter_global_timestamp: Split complete
DataSplitter_global_timestamp: Verifying data consistency...
DataSplitter_global_timestamp: Verifying data consistency... Passed!
DataSplitter_global_timestamp: Preloaded data not found, reading from original files... Done
DataSplitter_global_timestamp: DataReader: Movielens100K
	Num items: 1682
	Num users: 751
	Train 		interactions 79999, 	density 6.33E-02
	Validation 	interactions 1535, 	density 1.22E-03
	Test 		interactions 1418, 	density 1.12E-03
DataSplitter_global_timestamp: 
DataSplitter_global_timestamp: Done.

Now, the global timestamp split of Movielens100K has been written to /home/splits/MovieLens100K.

Loading All Implemented Datasets

The script Data_manager.create_all_data_splits runs the above procedure on all datasets used in our experiments:

usage: create_all_data_splits.py [-h] --data-dir DATA_DIR --splits-dir
                                 SPLITS_DIR

arguments:
  --data-dir DATA_DIR   Directory where the downloaded dataseta have been
                        stored. If a dataset is not downloaded, it will be
                        downloaded.
  --splits-dir SPLITS_DIR
                        Directory where the splits will be saved.

Loading New Datasets

To load a recsys dataset that is not currently implemented, you need to create a subclass of Data_manager.DataReader, which specifies the loading procedure for the dataset. Once you create a DataReader for your dataset, you can use the same splitting and loading process from above.

If the dataset is in CSV format with columns user_id, item_id, rating, timestamp, then it is simple to create a class based on the example class ExampleCSVDatasetReader, which loads a dataset from a sample CSV included in this repository.

This class reads a CSV from a fixed path, and loads it using shared functions:

#### from Dataset_manager/ExampleCSVDataset/ExampleCSVDatasetReader.py
...

URM_path = "../examples/random_rating_list.csv"

(
    URM_all,
    URM_timestamp,
    item_original_ID_to_index,
    user_original_ID_to_index,
) = load_CSV_into_SparseBuilder(
    URM_path, separator=",", header=True, remove_duplicates=True, timestamp=True
)

loaded_URM_dict = {"URM_all": URM_all, "URM_timestamp": URM_timestamp}

loaded_dataset = Dataset(
    dataset_name=self._get_dataset_name(),
    URM_dictionary=loaded_URM_dict,
    ICM_dictionary=None,
    ICM_feature_mapper_dictionary=None,
    UCM_dictionary=None,
    UCM_feature_mapper_dictionary=None,
    user_original_ID_to_index=user_original_ID_to_index,
    item_original_ID_to_index=item_original_ID_to_index,
    is_implicit=self.IS_IMPLICIT,
)

...

Evaluating Recsys Algorithms

The main results from our paper are based on a "meta-dataset", which consists of performance metrics for a large number of parameterized recsys algorithms on all recsys datasets implemented in this codebase.

To generate results for each algorithm-dataset pair, we use the script Experiment_handler.run_experiment, which takes several positional arguments:

usage: run_experiment.py [-h]
                         time_limit dataset_name split_type alg_name split_dir
                         alg_seed param_seed num_samples result_dir
                         experiment_name original_split_path

positional arguments:
  time_limit           time limit in seconds
  dataset_name         name of dataset. we use this to find the dataset and
                       split.
  split_type           name of datasplitter to use. we use this to find the
                       split directory.
  alg_name             name of the algorithm to use.
  split_dir            directory containing split data files.
  alg_seed             random seed passed to the recommender algorithm. only
                       for random algorithms.
  param_seed           random seed for generating random hyperparameters.
  num_samples          number of hyperparameter samples.
  result_dir           directory where result dir structure will be written.
                       this directory should exist.
  experiment_name      name of the result directory that will be created.
  original_split_path  full path to the split data. only used for bookkeeping.

For example, the following call trains and evaluates 5 hyperparameter samples for algorithm P3alphaRecommender, using the split created in the previous section. The results of this experiment will be written to /home/results.


# first, create a directory to write results in
mkdir ./example-results

python -m Experiment_handler.run_experiment \
    7200 \
    Movielens100K \
    DataSplitter_global_timestamp \
    P3alphaRecommender \
    /home/splits/MovieLens100K \
    0 \
    0 \
    5 \
    ./example-results \
    example-experiment \
    original-split-path
output
[2022-06-21 12:06:57,142] [Experiment.py:__init__] : initializing Experiment: base_directory=/code/reczilla/RecSys2019_DeepLearning_Evaluation/example-results, result_directory=/code/reczilla/RecSys2019_DeepLearning_Evaluation/example-results/example-experiment, data_directory=None
[2022-06-21 12:06:57,143] [Experiment.py:__init__] : found result directory: /code/reczilla/RecSys2019_DeepLearning_Evaluation/example-results/example-experiment
[2022-06-21 12:06:57,143] [Experiment.py:prepare_dataset] : initialized dataset in Movielens100K
[2022-06-21 12:06:57,254] [Experiment.py:prepare_split] : found a split in directory /home/splits/MovieLens100K_splits
[2022-06-21 12:06:57,254] [Experiment.py:prepare_split] : initialized split Movielens100K/DataSplitter_global_timestamp
[2022-06-21 12:06:57,254] [Experiment.py:run_experiment] : WARNING: URM_validation not found in URM_dict for split Movielens100K/DataSplitter_global_timestamp
EvaluatorHoldout: Ignoring 81 (89.2%) Users that have less than 1 test interactions
EvaluatorHoldout: Ignoring 69 (90.8%) Users that have less than 1 test interactions
[2022-06-21 12:06:57,257] [Experiment.py:run_experiment] : starting experiment, writing results to example-results
[2022-06-21 12:06:57,292] [RandomSearch.py:_log_info] : RandomSearch: Starting parameter set

P3alphaRecommender: URM Detected 66 (3.92 %) cold items.
EvaluatorHoldout: Processed 81 (100.0%) in 0.34 sec. Users per second: 240
EvaluatorHoldout: Processed 69 (100.0%) in 0.32 sec. Users per second: 213
DataIO: Json dumps supports only 'str' as dictionary keys. Transforming keys to string, note that this will alter the mapper content.
[2022-06-21 12:06:58,182] [RandomSearch.py:_log_info] : RandomSearch: Starting parameter set 1 of 5

P3alphaRecommender: URM Detected 66 (3.92 %) cold items.
EvaluatorHoldout: Processed 81 (100.0%) in 0.33 sec. Users per second: 243
EvaluatorHoldout: Processed 69 (100.0%) in 0.30 sec. Users per second: 227
[2022-06-21 12:07:00,094] [RandomSearch.py:_log_info] : RandomSearch: Starting parameter set 2 of 5

P3alphaRecommender: URM Detected 66 (3.92 %) cold items.
EvaluatorHoldout: Processed 81 (100.0%) in 0.32 sec. Users per second: 250
EvaluatorHoldout: Processed 69 (100.0%) in 0.31 sec. Users per second: 221
[2022-06-21 12:07:01,058] [RandomSearch.py:_log_info] : RandomSearch: Starting parameter set 3 of 5

P3alphaRecommender: URM Detected 66 (3.92 %) cold items.
EvaluatorHoldout: Processed 81 (100.0%) in 0.38 sec. Users per second: 215
EvaluatorHoldout: Processed 69 (100.0%) in 0.31 sec. Users per second: 220
[2022-06-21 12:07:02,465] [RandomSearch.py:_log_info] : RandomSearch: Starting parameter set 4 of 5

P3alphaRecommender: URM Detected 66 (3.92 %) cold items.
EvaluatorHoldout: Processed 81 (100.0%) in 0.33 sec. Users per second: 248
EvaluatorHoldout: Processed 69 (100.0%) in 0.27 sec. Users per second: 257
[2022-06-21 12:07:04,678] [RandomSearch.py:_log_info] : RandomSearch: Search complete. Output written to: example-results/

[2022-06-21 12:07:04,684] [Experiment.py:run_experiment] : results written to file: example-results/result_20220621_120657_metadata.zip
initial result file: example-results/result_20220621_120657_metadata.zip
renaming to: example-results/result.zip

There are two files of interest created by this experiment script, both written to the results folder provided (example-results):

  • a log file with namning convention result_yyyymmdd_hhmmss_RandomSearch.txt
  • the hyperparameters and evaluation metrics, stored in a zip archive named result.zip

Meta-Learning

Main script overview

The main script for meta-learning is run_reczilla.py, which must be run from the folder RecSys2019_DeepLearning_Evaluation. This script has two functions: (1) to train a new meta-model, and (2) use a pre-trained meta-model to train a new recommender on a dataset; both of these tasks can be performmed in the same call.

The script takes in these arguments:

> python -m ReczillaClassifier.run_reczilla -h
usage: run_reczilla.py [-h] [--train_meta] --metamodel_filepath
                       METAMODEL_FILEPATH
                       [--dataset_split_path DATASET_SPLIT_PATH]
                       [--rec_model_save_path REC_MODEL_SAVE_PATH]
                       [--metadataset_name METADATASET_NAME]
                       [--metamodel_name {xgboost,knn,linear,svm-poly}]
                       [--target_metric TARGET_METRIC]
                       [--num_algorithms NUM_ALGORITHMS]
                       [--num_metafeatures NUM_METAFEATURES]

Run Reczilla on a new dataset.

optional arguments:
  -h, --help            show this help message and exit
  --train_meta          Use to train a new metalearner Reczilla model (instead
                        of loading).
  --metamodel_filepath METAMODEL_FILEPATH
                        Filepath of Reczilla model (to save or load).
  --dataset_split_path DATASET_SPLIT_PATH
                        Path of dataset split to perform inference on. Only
                        required if performing inference
  --rec_model_save_path REC_MODEL_SAVE_PATH
                        Destination path for recommender model trained on
                        dataset on dataset_split_path.
  --metadataset_name METADATASET_NAME
                        Name of metadataset (required if training metamodel).
  --metamodel_name {xgboost,knn,linear,svm-poly}
                        Name of metalearner to use (required if training
                        metamodel).
  --target_metric TARGET_METRIC
                        Target metric to optimize.
  --num_algorithms NUM_ALGORITHMS
                        Number of algorithms to use in Reczilla (required if
                        training metamodel).
  --num_metafeatures NUM_METAFEATURES
                        Number of metafeatures to select for metalearner.

Training a new meta-model

The following files are required for training a new metamodel. Both of these files can be downloaded from a public Google drive folder, here:

  1. Metafeatures.csv: The dataset metafeatures. Note: This file must be placed in the local directory reczilla/RecSys2019_DeepLearning_Evaluation/Metafeatures/
  2. metadata-v2.pkl: performance metadataset, containing performance metrics of algorithms on each recsys dataset. Note: this file must be placed in the local directoryreczilla/RecSys2019_DeepLearning_Evaluation/metadatasets/

The script train_reczilla_models.sh shows samples for training metalearners for different metrics. The script does the following:

  1. Creates a directory ReczillaModels to save new meta-models
  2. Trains a metamodel for precision @ 10 and saves it to ReczillaModels/prec_10.pickle
  3. Trains a metamodel for training time and saves it to ReczillaModels/time_on_train.pickle
  4. Trains a metamodel for MRR @ 10 and saves it to ReczillaModels/mrr_10.pickle
  5. Trains a metamodel for item hit coverage @ 10 and saves it to ReczillaModels/item_hit_cov.pickle

For this script, the expected output should be similar to the following:

python -m ReczillaClassifier.run_reczilla     --train_meta     --metamodel_filepath="../ReczillaModels/prec_10.pickle"     --target_metric="PRECISION_cut_10"     --num_algorithms=10     --num_metafeatures=10
selecting algs and features..
done selecting algs in :  0:00:21.533609
Computing correlations...
done selecting features in :  0:00:02.300044
Metamodel saved to ../ReczillaModels/prec_10.pickle

python -m ReczillaClassifier.run_reczilla     --train_meta     --metamodel_filepath="../ReczillaModels/time_on_train.pickle"     --target_metric="time_on_train"     --num_algorithms=10     --num_metafeatures=10
selecting algs and features..
done selecting algs in :  0:00:25.295785
Computing correlations...
done selecting features in :  0:00:02.587050
Metamodel saved to ../ReczillaModels/time_on_train.pickle

python -m ReczillaClassifier.run_reczilla      --train_meta    --metamodel_filepath="../ReczillaModels/mrr_10.pickle"  --target_metric="MRR_cut_10"   --num_algorithms=10      --num_metafeatures=10
selecting algs and features..
done selecting algs in :  0:00:15.817387
Computing correlations...
done selecting features in :  0:00:01.631595
Metamodel saved to ../ReczillaModels/mrr_10.pickle

python -m ReczillaClassifier.run_reczilla     --train_meta     --metamodel_filepath="../ReczillaModels/item_hit_cov.pickle"     --target_metric="COVERAGE_ITEM_HIT_cut_10"     --num_algorithms=10     --num_metafeatures=10
selecting algs and features..
done selecting algs in :  0:00:20.211772
Computing correlations...
done selecting features in :  0:00:03.447911
Metamodel saved to ../ReczillaModels/item_hit_cov.pickle

Using a Trained Meta-Model for Inference

A sample script to perform inference on a new dataset is provided in run_reczilla_inference.sh. It uses pre-trained Reczilla models (located in the folder ReczillaModels) to select and train a recommender on a dataset specified on a path. This script can be modified to run inference on new datasets.

The only required files for execution is a pre-trained metamodel and a dataset to perform inference on. In the case of run_reczilla_inference.sh, these correspond to:

  1. ReczillaModels/prec_10.pickle (metamodel)
  2. ReczillaModels/time_on_train.pickle (metamodel)
  3. all_data/splits-v5/AmazonGiftCards/DataSplitter_leave_k_out_last (folder with dataset split to perform inference on)

The script does the following:

  1. Use the pre-trained precision @ 10 meta-model to select an algorithm to train on the dataset under all_data/splits-v5/AmazonGiftCards/DataSplitter_leave_k_out_last, and saves the recommender to a zip file with the prefix prec_10_.
  2. Use the pre-trained time on train meta-model to select an algorithm to train on the dataset under all_data/splits-v5/AmazonGiftCards/DataSplitter_leave_k_out_last, and saves the recommender to a zip file with the prefix train_time_.

For example, the following command does the following:

  • reads the Movielens100K dataset split created earlier in this README
  • reads the meta-model ReczillaModels/prec_10.pickle created in the example above
  • estimates the best parameterized recsys algorithm for the Movielens100K training split, using the prec_10.pickle metamodel
  • trains this parameterized recsys algorithm on the Movielens100K training split, and saves the trained model to file ../prec_10_{model name}.zip.
python -m ReczillaClassifier.run_reczilla \
        --dataset_split_path="/home/splits/MovieLens100K" \
        --metamodel_filepath="../ReczillaModels/prec_10.pickle" \
        --rec_model_save_path="../prec_10_"

expected output
Loading metamodel from ../ReczillaModels/prec_10.pickle
DataSplitter_global_timestamp: Cold users not allowed
DataSplitter_global_timestamp: Verifying data consistency...
DataSplitter_global_timestamp: Verifying data consistency... Passed!
DataSplitter_global_timestamp: DataReader: Movielens100K
        Num items: 1682
        Num users: 751
        Train           interactions 79999,     density 6.33E-02
        Validation      interactions 1535,      density 1.22E-03
        Test            interactions 1418,      density 1.12E-03

DataSplitter_global_timestamp: 

DataSplitter_global_timestamp: Done.
EvaluatorHoldout: Processed 100 (100.0%) in 0.05 sec. Users per second: 1966
Similarity column 100 (100.0%), 68255.56 column/sec. Elapsed time 0.00 sec
EvaluatorHoldout: Processed 100 (100.0%) in 0.05 sec. Users per second: 2009
DataSplitter_global_timestamp: Cold users not allowed
DataSplitter_global_timestamp: Verifying data consistency...
DataSplitter_global_timestamp: Verifying data consistency... Passed!
DataSplitter_global_timestamp: DataReader: Movielens100K
        Num items: 1682
        Num users: 751
        Train           interactions 79999,     density 6.33E-02
        Validation      interactions 1535,      density 1.22E-03
        Test            interactions 1418,      density 1.12E-03

DataSplitter_global_timestamp: 

DataSplitter_global_timestamp: Done.
Chose IALSRecommender:random_25 for PRECISION_cut_10 with predicted value 0.015277831815183163
IALSRecommender: URM Detected 66 (3.92 %) cold items.
IALSRecommender: Epoch 1 of 300. Elapsed time 0.28 sec
IALSRecommender: Epoch 2 of 300. Elapsed time 0.51 sec
IALSRecommender: Epoch 3 of 300. Elapsed time 0.75 sec
...
IALSRecommender: Epoch 299 of 300. Elapsed time 1.33 min
IALSRecommender: Epoch 300 of 300. Elapsed time 1.34 min
IALSRecommender: Terminating at epoch 300. Elapsed time 1.34 min
EvaluatorHoldout: Ignoring 69 (90.8%) Users that have less than 1 test interactions
EvaluatorHoldout: Processed 69 (100.0%) in 0.04 sec. Users per second: 1653

**************************************************
Done training recommender. Summary:
Metric to optimize: PRECISION_cut_10
Chosen algorithm: IALSRecommender:random_25
Predicted performance: 0.015277831815183163
Actual performance: 0.013043478260869566
**************************************************

IALSRecommender: Saving model in file '../prec_10_IALSRecommender'
IALSRecommender: Saving complete

Citation

Please cite our work if you use code from this repo:

@inproceedings{reczilla-2022,
  title={On the Generalizability and Predictability of Recommender Systems}, 
  author={McElfresh, Duncan and Khandagale, Sujay and Valverde, Jonathan and Dickerson, John P. and White, Colin}, 
  booktitle={Advances in Neural Information Processing Systems},
  year={2022}, 
}