These codes are associated with the following paper [pdf]:
Collaborative Large Language Model for Recommender Systems
Yaochen Zhu, Liang Wu, Qi Guo, Liangjie Hong, Jundong Li,
The ACM Web Conference (WWW) 2024.
which is a joint research from the University of Virginia VAST LAB and LinkedIn.
The proposed CLLM4Rec is the first recommender system that tightly combines the ID-based paradigm and LLM-based paradigm and leverages the advantages of both worlds.
With the following mutually-regularized pretraining with soft+hard prompting strategy, language modeling can be effectively conducted on recommendation-oriented corpora with heterogenous user/item tokens.
We also proposed a recommendation-oriented finetuning strategty, such that recommendation of multiple items with the whole item space as the candidate set can be effectively generated without hallucination.
We implement the following main classes based on the Hugging Face🤗 transformer library.
TokenizerWithUserItemIDTokens breaks down the word sequence into tokens, where user/item tokens are introduced. Specifically, if the vocabulary size of the original tokenizer is
Demo:
-----Show the encoding process:-----
Hello, user_1! Have you seen item_2?
['Hello', ',', 'user_1', '!', 'ĠHave', 'Ġyou', 'Ġseen', 'item_2', '?']
[15496, 11, 50258, 0, 8192, 345, 1775, 50269, 30]
GPT4RecommendationBaseModel is the base class for collaborative GPT for recommender systems.
This class extends the vocabulary of the original GPT2 with the user/item ID tokens. In our implementation, we randomly initialize the user/item ID embeddings. In the training time, we freeze the token embeddings for the original vocabulary and the transformer weights and only user/item ID embeddings can be updated.
Demo:
input_ids:
tensor([[0, 1, 2],
[3, 4, 5],
[6, 7, 8]])
-----Calculated Masks-----
vocab_mask:
tensor([[1, 1, 1],
[0, 0, 0],
[0, 0, 0]])
user_mask:
tensor([[0, 0, 0],
[1, 1, 1],
[0, 0, 0]])
item_mask:
tensor([[0, 0, 0],
[0, 0, 0],
[1, 1, 1]])
-----Embed Vocabulary Tokens-----
vocab_ids:
tensor([[0, 1, 2],
[0, 0, 0],
[0, 0, 0]])
vocab_embeddings:
tensor([[[ 1.4444, 0.0186],
[-0.3905, 1.5463],
[-0.2093, -1.3653]],
[[ 0.0000, 0.0000],
[ 0.0000, 0.0000],
[ 0.0000, 0.0000]],
[[ 0.0000, 0.0000],
[ 0.0000, 0.0000],
[ 0.0000, 0.0000]]], grad_fn=<MulBackward0>)
-----Embed User Tokens-----
user_ids:
tensor([[0, 0, 0],
[0, 1, 2],
[0, 0, 0]])
user_embeds:
tensor([[[-0.0000, 0.0000],
[-0.0000, 0.0000],
[-0.0000, 0.0000]],
[[-0.1392, 1.1265],
[-0.7857, 1.4319],
[ 0.4087, -0.0928]],
[[-0.0000, 0.0000],
[-0.0000, 0.0000],
[-0.0000, 0.0000]]], grad_fn=<MulBackward0>)
-----Embed Item Tokens-----
item_ids:
tensor([[0, 0, 0],
[0, 0, 0],
[0, 1, 2]])
item_embeds:
tensor([[[-0.0000, 0.0000],
[-0.0000, 0.0000],
[-0.0000, 0.0000]],
[[-0.0000, 0.0000],
[-0.0000, 0.0000],
[-0.0000, 0.0000]],
[[-0.3141, 0.6641],
[-1.4622, -0.5424],
[ 0.6969, -0.6390]]], grad_fn=<MulBackward0>)
-----The Whole Embeddings-----
input_embeddings:
tensor([[[ 1.4444, 0.0186],
[-0.3905, 1.5463],
[-0.2093, -1.3653]],
[[-0.1392, 1.1265],
[-0.7857, 1.4319],
[ 0.4087, -0.0928]],
[[-0.3141, 0.6641],
[-1.4622, -0.5424],
[ 0.6969, -0.6390]]], grad_fn=<AddBackward0>)
CollaborativeGPTwithItemLMHeadBatch defines the collaborative GPT, which gives prompts in the form "user_i has interacted with", to do language modeling (i.e., next token prediction) for the interacted item sequences, i.e., "item_j item_k item_z".
In this case, when doing next token prediction, we only need to calculate softmax over the item space.
Demo:
Prompt ids: tensor([[50257, 468, 49236, 351],
[50258, 468, 49236, 351],
[50259, 468, 49236, 351],
[50260, 468, 49236, 351],
[50261, 468, 49236, 351],
[50262, 468, 49236, 351],
[50263, 468, 49236, 351],
[50264, 468, 49236, 351],
[50265, 468, 49236, 351],
[50266, 468, 49236, 351],
[50267, 468, 49236, 351],
[50268, 468, 49236, 351],
[50269, 468, 49236, 351],
[50270, 468, 49236, 351],
[50271, 468, 49236, 351],
[50272, 468, 49236, 351]])
Main ids: tensor([[51602, 51603, 51604, 51605, 51607, 51608, 51609, 51610, 51613, 51614,
51615, 51616, 51617, 51618, 51619, 51621, 51622, 51624, 51625, 51626,
51628, 51630, 51632, 51633, 51634, 51635, 51636, 51637, 0, 0,
0, 0],
[51638, 51640, 51641, 51642, 51643, 51645, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0],
[51647, 51648, 51649, 51650, 51652, 51653, 51654, 51655, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0],
[51605, 51623, 51656, 51657, 51659, 51660, 51662, 51663, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0],
[51664, 51665, 51666, 51667, 51668, 51670, 51672, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0],
[51673, 51674, 51676, 51677, 51678, 51679, 51680, 51681, 51682, 51683,
51684, 51685, 51686, 51687, 51691, 51695, 51696, 51698, 51699, 51700,
51701, 51702, 51703, 51704, 51705, 51706, 51707, 51708, 51709, 51710,
51711, 51712],
[51713, 51714, 51716, 51717, 51718, 51719, 51720, 51721, 51722, 51723,
51724, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0],
[51604, 51611, 51612, 51616, 51666, 51727, 51728, 51729, 51731, 51732,
51733, 51734, 51735, 51737, 51738, 51740, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0],
[51741, 51742, 51743, 51744, 51747, 51748, 51749, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0],
[51619, 51625, 51732, 51750, 51751, 51752, 51753, 51754, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0],
[51621, 51640, 51645, 51672, 51741, 51756, 51758, 51759, 51760, 51761,
51763, 51765, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0],
[51618, 51763, 51767, 51768, 51769, 51770, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0],
[51625, 51769, 51771, 51772, 51773, 51775, 51776, 51777, 51778, 51780,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0],
[51673, 51674, 51675, 51676, 51677, 51679, 51681, 51694, 51699, 51701,
51781, 51782, 51783, 51785, 51786, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0],
[51660, 51737, 51758, 51787, 51788, 51789, 51790, 51792, 51793, 51794,
51795, 51796, 51798, 51799, 51800, 51801, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0],
[51661, 51760, 51793, 51804, 51805, 51806, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0]])
Calculated loss: 14.4347
ContentGPTForUserItemWithLMHeadBatch defines the content GPT that conducts language modeling on user/item content.
Take Amazon review data as an example, it treats "user_i writes the following review for item_j" as the prompt, while conducting language modeling (i.e., next token prediction) on the main review texts.
In this case, when predicting next tokens, we only need to calculate the softmax over the vocabulary space.
Demo:
Prompt ids: tensor([[50257, 2630, 262, 1708, 2423, 329, 51602, 25],
[50257, 2630, 262, 1708, 2423, 329, 51603, 25],
[50257, 2630, 262, 1708, 2423, 329, 51604, 25],
[50257, 2630, 262, 1708, 2423, 329, 51605, 25],
[50257, 2630, 262, 1708, 2423, 329, 51607, 25],
[50257, 2630, 262, 1708, 2423, 329, 51608, 25],
[50257, 2630, 262, 1708, 2423, 329, 51609, 25],
[50257, 2630, 262, 1708, 2423, 329, 51610, 25],
[50257, 2630, 262, 1708, 2423, 329, 51613, 25],
[50257, 2630, 262, 1708, 2423, 329, 51614, 25],
[50257, 2630, 262, 1708, 2423, 329, 51615, 25],
[50257, 2630, 262, 1708, 2423, 329, 51616, 25],
[50257, 2630, 262, 1708, 2423, 329, 51617, 25],
[50257, 2630, 262, 1708, 2423, 329, 51618, 25],
[50257, 2630, 262, 1708, 2423, 329, 51619, 25],
[50257, 2630, 262, 1708, 2423, 329, 51621, 25]])
Main ids: tensor([[ 40, 716, 281, ..., 0, 0, 0],
[ 1544, 1381, 510, ..., 428, 5156, 0],
[ 3666, 4957, 10408, ..., 0, 0, 0],
...,
[ 35, 563, 911, ..., 0, 0, 0],
[23044, 1049, 351, ..., 0, 0, 0],
[26392, 2499, 880, ..., 0, 0, 0]])
Calculated loss: 3.9180
CollaborativeGPTwithItemRecommendHead defines the recommendation GPT, where we randomly mask out items in the interaction history of the users and predict the hold-out items with multinomial likelihood.
Demo:
num_users: 10553
num_items: 6086
Prompt ids: tensor([[50257, 468, 49236, 351, 60819, 60812, 60818, 60811, 60816, 60810,
60822, 60823, 60820, 60817, 11, 50257, 481, 9427, 351, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0],
[50258, 468, 49236, 351, 60828, 60824, 60829, 60825, 11, 50258,
481, 9427, 351, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0],
[50259, 468, 49236, 351, 60833, 60835, 11, 50259, 481, 9427,
351, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0],
[50260, 468, 49236, 351, 60840, 60838, 11, 50260, 481, 9427,
351, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0],
[50261, 468, 49236, 351, 60845, 60842, 60847, 60841, 11, 50261,
481, 9427, 351, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0],
[50262, 468, 49236, 351, 60852, 60853, 60848, 11, 50262, 481,
9427, 351, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0],
[50263, 468, 49236, 351, 60855, 60854, 11, 50263, 481, 9427,
351, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0],
[50264, 468, 49236, 351, 60893, 60899, 60894, 60869, 60859, 60875,
60862, 60888, 60877, 60891, 60876, 60890, 60873, 60889, 60874, 60864,
60860, 60878, 60898, 60867, 60900, 60883, 60892, 60882, 60884, 60881,
60863, 60871, 60902, 11, 50264, 481, 9427, 351],
[50265, 468, 49236, 351, 60907, 60905, 11, 50265, 481, 9427,
351, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0],
[50266, 468, 49236, 351, 60909, 60914, 60915, 60917, 60907, 60910,
60908, 60912, 11, 50266, 481, 9427, 351, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0],
[50267, 468, 49236, 351, 60922, 60920, 11, 50267, 481, 9427,
351, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0],
[50268, 468, 49236, 351, 60924, 60927, 11, 50268, 481, 9427,
351, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0],
[50269, 468, 49236, 351, 60929, 60930, 11, 50269, 481, 9427,
351, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0],
[50270, 468, 49236, 351, 60933, 60934, 11, 50270, 481, 9427,
351, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0],
[50271, 468, 49236, 351, 60940, 60943, 60939, 11, 50271, 481,
9427, 351, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0],
[50272, 468, 49236, 351, 60949, 60945, 60944, 11, 50272, 481,
9427, 351, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0]])
Main ids: tensor([[1., 1., 1., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.],
...,
[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.]])
Calculated loss: 124.3801
The codes are composed of three files that defines the pretraining, fine-tuning, and prediction stage of the proposed CLLM4Rec model.
src/pretraining.py
:
Define the pretraining stage of CLLM4Rec.
src/finetuning.py
:
Define the finetuning stage of CLLM4Rec.
src/predict.py
:
Evaluate the trained model and save the results.
The codes are written in Python 3.9
- fsspec
- numpy==1.23.0
- scipy==1.8.0
- torch==2.0.0
- transformers==4.24.0
- wget==3.2
- accelerate
Details see src/requirements.txt
The datasets used in this paper can be accessed [here].
The pretrained weights of GPT-2 used in this paper can be downloaded [here], as well as the original tokenizer that we modified by introducing user/item tokens.
see scripts/run.sh
If you find this work is helpful to your research, please consider citing our paper:
@inproceedings{zhu2024collaborative,
title={Collaborative large language model for recommender systems},
author={Zhu, Yaochen and Wu, Liang and Guo, Qi and Hong, Liangjie and Li, Jundong},
booktitle={Proceedings of the ACM Web Conference},
pages={3162--3172},
year={2024}
}
Thanks for your interest in our work!