Evaluation of LLMs on NLG benchmarks.
Natural Language Generation (NLG) is the process of producing a natural language text to meet specified communicative goals. The texts that are generated may range from a single phrase given in answer to a question, through multi-sentence remarks and questions within a dialog, to full-page explanations.
Language | Dataset | Test Input | Test Output | Train Input | Train Output | Data |
---|---|---|---|---|---|---|
English | CNN/DailyMail | 3628; 175; 1088.84 | 1101; 12; 84.47 | Paper & Data | ||
XSum (Extreme summarization) | 18962; 95; 661.01 | 107; 4; 31.63 | 23397; 95; 656.37 | 120; 2; 31.68 | Paper & Data | |
Chances | LCSTS | 786; 191; 329.88 | 72; 6; 32.20 | 2736; 171; 329.37 | 107; 4; 32.23 | --原始数据 处理后数据 提取码:duba |
Language | Dataset | Test Input | Test Output | Train Input | Train Output | Data |
---|---|---|---|---|---|---|
English | DailyDialog | 663, 129, 246.57 | max:246, 3, 16.55 | 1099, 128, 247.60 | 313, 2, 16.19 | Paper & Data |
PersonaChat | 529, 273, 372.28 | 32, 5, 13.53 | Paper & Data | |||
Empathatic Dialogues | 380, 148, 201.64 | 110, 2, 17.85 | 400, 146, 193.13 | 135 2, 16.80 | Paper & Data | |
Chinese | 清华LCCC | 848, 214, 264.73 | 347, 3, 28.27 | 1666, 213, 278.59 | 314, 2, 26.98 | Paper & Data |
Language | Dataset | Test Input | Test Output | Train Input | Train Output | Paper |
---|---|---|---|---|---|---|
English | ROCStories | 203; 115; 140.44 | 34; 4; 12.76 | 182; 115; 140.56 | 29; 4; 12.74 | Paper & Data |
WritingPrompts | 179; 95; 125.64 | 2912; 132; 793.14 | 208; 95; 126.23 | 8308; 115; 789.08 | Paper & Data | |
Chinese | LOT | 329; 215; 254.73 | 568; 112; 275.07 | 324; 207; 253.90 | 624; 113; 273.10 | Paper & Data |
Language | Dataset | Test Input | Test Output | Train Input | Train Output | Paper |
---|---|---|---|---|---|---|
English | WebNLG | 324; 156; 201.68 | 158; 7; 38.55 | Paper & Data | ||
Rotowire | 813; 177; 434.31 | 998; 191; 460.73 | Paper & Data | |||
Chinese | Advertisement | 453; 222; 288 | 364; 93; 193 | Paper & Data |
Language | Dataset | Paper |
---|---|---|
English | Yelp | Paper & Data |
Chinese (Formal-Informal) | CLSD | Paper & Data |
English (Paraphrasing) | ParaNMT | [Paper](ParaNMT-50M: Pushing the Limits of Paraphrastic Sentence Embeddings with Millions of Machine Translations) & Data |
English (Sentence Simplification) | WikiLarge | [Paper](Sentence Simplification with Deep Reinforcement Learning) & Data |
- ChatGPT
-
ChatGLM2-6B
$\checkmark$ -
Flan-T5-XXL
$\dots$ -
LLaMA2-7b-chat
$\checkmark$ -
LLaMA2-13b-chat
$\checkmark$ -
Llama2-Chinese-13b-Chat
$\checkmark$ -
Vicuna-13B-v1.5-16k
$\checkmark$ -
Chinese-Alpaca-2-13b
$\checkmark$ -
Qwen-7b-chat
$\checkmark$ -
Baichuan2-13b-chat
$\checkmark$ - Oasst-Pythia-12B
实验设置:Chat模型
实验目的:
- 中英文模型对照 (llama2-7b-chat vs baichuan2 vs Qwen)
- 模型规模(llama2-7b-chat vs llama2-13b-chat)
- 模型架构不同 (chatglm2-6b vs llama2-7b-chat,llama2-13b-chat vs oasst-pythia-12b,flan-t5-xxl)
- 模型微调数据集不同 (Qwen, Baichuan2, chinsese-alpaca2, vicuna)