We introduce RoleEval, a bilingual benchmark designed to assess the memorization, utilization, and reasoning capabilities of role knowledge. RoleEval comprises RoleEval-Global (including internationally recognized characters) and RoleEval-Chinese (including characters popular in China), with 6,000 Chinese-English parallel multiple-choice questions focusing on 300 influential people and fictional characters drawn from a variety of domains including celebrities, anime, comics, movies, TV series, games, and fiction. These questions cover basic knowledge and multi-hop reasoning abilities, aiming to systematically probe various aspects such as personal information, relationships, abilities, and experiences of the characters. To maintain high standards, we perform a hybrid quality check process combining automatic and human verification, ensuring that the questions are diverse, challenging, and discriminative.
NOTE: The English version of RoleEval is under internal review and will be released later.
If you want to submit your model's predictions to our leaderboard, please feel free to contact us via thshen@tju.edu.cn for more details.
NOTE: * indicates the results calculated by submitted predictions.
Model | Celebrities | Anime and Comics | Movies and TV Series | Games | Fiction | Avg. |
---|---|---|---|---|---|---|
Qwen-72B | 70.00 | 59.75 | 66.00 | 61.25 | 74.00 | 66.20 |
Baichuan-NPC-Turbo* | 66.25 | 61.00 | 71.50 | 54.25 | 76.25 | 65.85 |
Yi-34B | 65.50 | 54.50 | 70.00 | 56.00 | 77.00 | 64.60 |
GPT-4-1106 | 62.50 | 63.25 | 63.00 | 62.00 | 63.00 | 62.75 |
GPT-4-0613 | 57.75 | 60.25 | 57.75 | 60.00 | 58.00 | 58.75 |
Yi-6B | 59.25 | 46.00 | 61.50 | 47.75 | 62.00 | 55.30 |
Baichuan-NPC-Lite* | 56.00 | 51.75 | 56.75 | 47.50 | 62.00 | 54.80 |
MiniMax | 54.00 | 55.00 | 52.75 | 57.50 | 54.00 | 54.65 |
Qwen-14B | 56.25 | 45.50 | 54.75 | 51.50 | 56.75 | 52.95 |
Baichuan2-13B | 54.75 | 47.75 | 54.00 | 47.50 | 60.00 | 52.80 |
Skywork-13B | 55.25 | 45.75 | 56.00 | 48.50 | 57.50 | 52.60 |
Baichuan2-7B | 52.25 | 43.75 | 49.00 | 47.25 | 55.00 | 49.45 |
ChatGLM3-6B | 50.00 | 44.50 | 48.00 | 44.25 | 58.00 | 48.95 |
Qwen-7B | 49.00 | 42.00 | 47.50 | 44.75 | 51.25 | 46.90 |
GPT-3.5-1106 | 47.50 | 46.75 | 41.75 | 44.75 | 38.75 | 43.90 |
GPT-3.5-0613 | 42.25 | 43.50 | 39.75 | 43.75 | 39.00 | 41.65 |
Chinese-LLaMA-2-13B | 36.50 | 36.50 | 34.00 | 34.00 | 40.50 | 36.30 |
LLaMA-2-70B | 36.00 | 38.00 | 36.25 | 36.25 | 34.75 | 36.25 |
Chinese-LLaMA-2-7B | 34.50 | 29.00 | 33.00 | 30.25 | 36.25 | 32.60 |
Mistral-7B | 32.50 | 37.50 | 26.25 | 33.25 | 31.50 | 32.20 |
Falcon-40B | 28.25 | 33.00 | 30.25 | 29.25 | 38.50 | 31.85 |
LLaMA-65B | 30.00 | 32.25 | 29.00 | 35.50 | 29.00 | 31.15 |
LLaMA-2-7B | 25.75 | 28.00 | 33.75 | 29.75 | 34.50 | 30.35 |
LLaMA-30B | 30.00 | 28.75 | 26.00 | 31.75 | 28.00 | 28.90 |
LLaMA-2-13B | 28.75 | 30.50 | 25.25 | 29.75 | 28.25 | 28.50 |
Falcon-7B | 24.75 | 30.50 | 31.50 | 29.75 | 25.25 | 28.35 |
LLaMA-13B | 27.25 | 29.75 | 27.25 | 26.00 | 29.00 | 27.85 |
LLaMA-7B | 28.50 | 24.75 | 20.50 | 27.75 | 29.00 | 26.10 |
Model | Celebrities | Anime and Comics | Movies and TV Series | Games | Fiction | Avg. |
---|---|---|---|---|---|---|
GPT-4-1106 | 74.75 | 73.62 | 74.38 | 72.50 | 71.62 | 73.38 |
GPT-4-0613 | 73.38 | 72.12 | 74.25 | 72.25 | 69.62 | 72.32 |
Qwen-72B | 72.88 | 63.88 | 70.38 | 56.75 | 73.50 | 67.47 |
Baichuan-NPC-Turbo* | 72.25 | 65.25 | 64.62 | 55.50 | 72.75 | 66.07 |
Yi-34B | 72.38 | 60.62 | 69.75 | 53.25 | 73.12 | 65.83 |
Baichuan-NPC-Lite* | 60.62 | 56.62 | 51.88 | 48.25 | 62.12 | 55.90 |
MiniMax | 51.75 | 54.50 | 62.62 | 56.75 | 52.75 | 55.67 |
Qwen-14B | 62.50 | 52.38 | 55.00 | 45.50 | 58.00 | 54.67 |
Yi-6B | 61.88 | 51.38 | 52.38 | 45.38 | 60.75 | 54.35 |
Baichuan2-13B | 60.25 | 52.38 | 51.00 | 46.88 | 60.75 | 54.25 |
Skywork-13B | 59.13 | 51.75 | 51.88 | 44.50 | 58.75 | 53.20 |
GPT-3.5-1106 | 48.75 | 51.88 | 51.25 | 49.88 | 48.38 | 50.02 |
ChatGLM3-6B | 56.50 | 47.62 | 48.38 | 41.88 | 54.50 | 49.78 |
Baichuan2-7B | 56.00 | 49.62 | 45.50 | 40.50 | 52.38 | 48.80 |
GPT-3.5-0613 | 46.62 | 48.38 | 51.75 | 49.50 | 47.38 | 48.73 |
Qwen-7B | 54.75 | 44.38 | 44.62 | 42.75 | 53.00 | 47.90 |
LLaMA-2-70B | 53.50 | 43.25 | 39.25 | 40.25 | 47.25 | 44.70 |
Chinese-LLaMA-2-13B | 45.38 | 38.25 | 39.88 | 31.87 | 42.12 | 39.50 |
Falcon-40B | 39.62 | 32.25 | 32.38 | 30.00 | 45.00 | 35.85 |
Chinese-LLaMA-2-7B | 35.62 | 36.75 | 35.62 | 35.38 | 34.38 | 35.55 |
LLaMA-2-7B | 37.00 | 29.88 | 28.75 | 34.50 | 38.25 | 33.67 |
LLaMA-2-13B | 36.50 | 34.00 | 33.00 | 31.87 | 31.75 | 33.42 |
Mistral-7B | 36.12 | 33.50 | 32.00 | 30.25 | 35.00 | 33.38 |
LLaMA-65B | 32.12 | 31.87 | 32.75 | 31.00 | 34.88 | 32.52 |
LLaMA-30B | 24.88 | 31.13 | 30.25 | 27.75 | 28.62 | 28.52 |
LLaMA-13B | 28.50 | 28.50 | 28.25 | 26.50 | 27.75 | 27.90 |
LLaMA-7B | 25.50 | 31.87 | 25.87 | 26.00 | 28.88 | 27.62 |
Falcon-7B | 23.88 | 28.12 | 24.50 | 28.00 | 28.12 | 26.52 |
Model | Celebrities | Anime and Comics | Movies and TV Series | Games | Fiction | Avg. |
---|---|---|---|---|---|---|
GPT-4-0613 | 54.25 | 61.75 | 63.00 | 63.00 | 63.00 | 61.00 |
GPT-4-1106 | 57.50 | 63.50 | 60.00 | 62.50 | 58.00 | 60.30 |
Yi-34B | 56.00 | 52.00 | 47.50 | 55.00 | 57.00 | 53.50 |
Qwen-72B | 52.75 | 47.50 | 46.50 | 54.25 | 50.50 | 50.30 |
GPT-3.5-0613 | 42.00 | 47.75 | 42.50 | 42.25 | 45.50 | 44.00 |
GPT-3.5-1106 | 38.25 | 45.50 | 44.00 | 44.50 | 46.00 | 43.65 |
LLaMA-2-70B | 43.25 | 41.50 | 40.25 | 47.50 | 43.50 | 43.20 |
Yi-6B | 42.25 | 38.50 | 41.50 | 44.25 | 45.00 | 42.30 |
Qwen-14B | 41.00 | 38.75 | 38.25 | 43.25 | 41.00 | 40.45 |
LLaMA-65B | 41.50 | 38.50 | 33.50 | 43.25 | 37.50 | 38.85 |
ChatGLM3-6B | 36.25 | 36.25 | 35.25 | 42.25 | 43.50 | 38.70 |
Skywork-13B | 39.25 | 34.50 | 38.25 | 41.75 | 38.50 | 38.45 |
MiniMax | 34.00 | 39.50 | 40.75 | 38.25 | 39.00 | 38.30 |
Qwen-7B | 36.25 | 36.00 | 36.25 | 42.25 | 40.00 | 38.15 |
Baichuan2-7B | 37.25 | 35.75 | 33.00 | 40.25 | 37.00 | 36.65 |
Mistral-7B | 35.75 | 42.00 | 30.00 | 41.75 | 31.50 | 36.20 |
Baichuan2-13B | 35.50 | 36.50 | 31.25 | 42.25 | 34.75 | 36.05 |
Falcon-40B | 34.00 | 38.25 | 30.75 | 38.75 | 35.25 | 35.40 |
LLaMA-30B | 34.75 | 35.75 | 30.75 | 40.00 | 35.00 | 35.25 |
Chinese-LLaMA-2-13B | 34.00 | 38.50 | 27.75 | 37.50 | 34.00 | 34.35 |
LLaMA-2-13B | 30.50 | 36.50 | 33.25 | 36.50 | 33.25 | 34.00 |
LLaMA-13B | 32.75 | 31.75 | 30.75 | 38.50 | 32.00 | 33.15 |
LLaMA-2-7B | 28.75 | 29.25 | 32.75 | 37.50 | 32.25 | 32.10 |
Chinese-LLaMA-2-7B | 30.50 | 27.75 | 33.00 | 30.50 | 27.75 | 29.90 |
LLaMA-7B | 24.00 | 27.50 | 29.75 | 33.00 | 29.25 | 28.70 |
Falcon-7B | 27.25 | 27.75 | 27.75 | 29.75 | 28.50 | 28.20 |
Model | Celebrities | Anime and Comics | Movies and TV Series | Games | Fiction | Avg. |
---|---|---|---|---|---|---|
GPT-4-0613 | 77.62 | 79.50 | 73.12 | 74.88 | 75.00 | 76.02 |
GPT-4-1106 | 75.12 | 78.75 | 75.00 | 76.12 | 75.00 | 76.00 |
Yi-34B | 73.12 | 61.75 | 67.88 | 57.12 | 67.25 | 65.42 |
Qwen-72B | 70.12 | 62.00 | 69.00 | 55.75 | 69.50 | 65.27 |
LLaMA-2-70B | 63.25 | 57.38 | 59.00 | 50.00 | 63.25 | 58.58 |
GPT-3.5-0613 | 57.38 | 59.62 | 58.13 | 59.50 | 57.50 | 58.43 |
GPT-3.5-1106 | 58.75 | 56.62 | 55.75 | 58.00 | 55.00 | 56.82 |
MiniMax | 54.87 | 56.38 | 53.50 | 54.12 | 51.38 | 54.05 |
Yi-6B | 59.25 | 52.00 | 54.12 | 47.50 | 56.25 | 53.82 |
Qwen-14B | 61.12 | 49.00 | 53.87 | 45.38 | 56.12 | 53.10 |
LLaMA-65B | 58.13 | 50.50 | 54.37 | 47.62 | 54.50 | 53.02 |
Baichuan2-13B | 56.12 | 47.50 | 51.50 | 45.62 | 54.00 | 50.95 |
Skywork-13B | 56.25 | 46.75 | 51.62 | 44.38 | 53.62 | 50.52 |
Mistral-7B | 54.87 | 46.75 | 49.62 | 44.25 | 52.25 | 49.55 |
ChatGLM3-6B | 55.12 | 46.62 | 49.25 | 43.25 | 52.62 | 49.37 |
LLaMA-30B | 51.62 | 46.88 | 48.62 | 43.12 | 52.62 | 48.57 |
Qwen-7B | 53.87 | 46.12 | 48.12 | 40.00 | 51.12 | 47.85 |
Baichuan2-7B | 51.00 | 45.12 | 49.00 | 42.12 | 50.00 | 47.45 |
Falcon-40B | 47.38 | 45.00 | 49.62 | 43.12 | 50.00 | 47.02 |
Chinese-LLaMA-2-13B | 47.75 | 46.00 | 46.88 | 45.00 | 48.38 | 46.80 |
LLaMA-2-13B | 49.38 | 43.50 | 46.50 | 44.25 | 48.25 | 46.38 |
LLaMA-13B | 39.38 | 40.25 | 39.88 | 40.62 | 43.00 | 40.63 |
LLaMA-2-7B | 38.88 | 37.00 | 37.50 | 41.62 | 42.38 | 39.48 |
Chinese-LLaMA-2-7B | 36.50 | 30.75 | 31.75 | 36.25 | 39.50 | 34.95 |
LLaMA-7B | 29.38 | 30.50 | 29.25 | 33.50 | 28.50 | 30.23 |
Falcon-7B | 26.25 | 27.75 | 28.50 | 29.38 | 31.00 | 28.58 |
If you find our work useful, please cite our paper:
@article{shen2023roleeval,
title={RoleEval: A Bilingual Role Evaluation Benchmark for Large Language Models},
author={Tianhao Shen and Sun Li and Deyi Xiong},
year={2023},
eprint={2312.16132},
archivePrefix={arXiv},
primaryClass={cs.CL}
}