Resources for the paper "Do Language Models Enjoy Their Own Stories? Prompting Large Language Models for Automatic Story Evaluation", published in TACL.
Authors: Cyril Chhun, Fabian Suchanek and Chloé Clavel.
[Note: resources for the paper "Of Human Criteria and Automatic Metrics: A Benchmark of the Evaluation of Story Generation", accepted in COLING 2022, can be accessed in the coling
branch here.]
2024/05/13 - Update for TACL Paper
2022/08/24 - Initial commit
We release in this repository HANNA, a large annotated dataset of Human-ANnotated NArratives for ASG evaluation. HANNA contains annotations for 1,056 stories generated from 96 prompts from the WritingPrompts dataset. Each story was annotated by 3 raters on 6 criteria (Relevance, Coherence, Empathy, Surprise, Engagement and Complexity), for a grand total of 19,008 annotations.
Additionally, we release the scores of those 1,056 stories evaluated by 72 automatic metrics and annotated by 4 different Large Language Models (Beluga-13B, Llama-13B, Mistral-7B, ChatGPT).
hanna_stories_annotations.csv
contains the raw annotations from our experiment.Story ID
is the ID of the story (from 0 to 1,055). Stories are grouped by model (0 to 95 are the Human stories, 96 to 191 are the BertGeneration stories, etc.).Prompt
is the promptHuman
is the corresponding human storyStory
is the generated storyModel
is the model used to generate the storyRelevance
is the Relevance (RE) scoreCoherence
is the Coherence (CH) scoreEmpathy
is the Empathy (EM) scoreSurprise
is the Surprise (SU) scoreEngagement
is the Engagement (EG) scoreComplexity
is the Complexity (CX) scoreWorker ID
is the ID of the mTurk workerAssignment ID
is the ID of the mTurk assignmentWork time in seconds
is the time the worker spent on the assignment in secondsName
is the name entered by the worker for the first mentioned character in the story
hanna_metrics_scores_llm.csv
contains average human annotations, average LLM annotations, and the scores of automatic measures per story per system. For instance, on row 2, you will find the scores of the stories generated by the BertGeneration model. Each list of that row contains the scores of stories 96 to 191 for each metric.
We also release:
- th complete answers of the LLMs (in the
llm_answers
folder) - the ~1,500 annotations from our user study on the LLM explanations (
user.study.csv
) - 576 stories generated by Llama-7B, Mmistral-7B, Beluga-13B, OrcaPlatypus-13B, Llama-30B and Platypus2-70B (
hanna_llm_stories.csv
).
Story ID | Prompt | Human | Story | Model | RE | CH | EM | SU | EG | CX |
---|---|---|---|---|---|---|---|---|---|---|
99 | Write a story about an elderly wizard and his young female apprentice. | His body was failing. He had taken care of it very well, but 205 years were a long time. Not a drop of alcohol all those long and lonely tavern nights, not a crumb of tobacco for the old pipe. [...] | “Follow me,” his mentor said. “I must stop this wizard.” At that, Tawthorn drew his dagger and leaned towards the woman. “If you were correct, my professor was right. You could change the world, and save the kingdom.” [...] | BertGeneration | 3 | 2 | 2 | 2 | 2 | 3 |
519 | You are a immortal during the zombie apocalypse, During the apocalypse the zombies ignore you and you try to live a normal life during the outbreak. | 50 years is a long time, enough time to go crazy and return sane. I remember before it happened, the CDC joked that they would have a cure “within a week” [...] | After a few weeks of running, you see something inside a tube/pulse generator. I woke up groggy. The day was Monday, it was Tuesday. How was my day going so fast? [...] | GPT-2 | 5 | 5 | 3 | 4 | 4 | 4 |
862 | When a new president is elected, they are given a special security briefing. In reality, this is an old tradition where various directors, military officers and current ministers present fake evidence and compete to see who can convince the president of the most ridiculous things. [...] | “Mr President I want you to know I am telling you this in full confidence .” Said the head of the Secret Service. The President looked at him. “Yes go ahead .” [...] | “Mr. President, you can see this! You know what the problem is. You see, President Obama, in the US, has been working on the latest model of the President 's campaign for over two years! [...] | Fusion | 2 | 1 | 1 | 1 | 1 | 1 |
We provide the Jupyter Notebook data_visualization.ipynb
containing the code we used to generate our results. It also allows for easier visualisation of the data from the csv
files.
The code was tested with Python 3.9.7. You can install the required packages with
pip install -r requirements.txt
You will also need the williams.py
file from the nlp-williams repository for the Williams
section of the notebook respectively. We cannot include them in the repository for licensing reasons.
If you do not plan to run the cells of this section, simply comment the corresponding import in the first cell.
@article{chhun2024do,
author = {Chhun, Cyril and Suchanek, Fabian M. and Clavel, Chlo{\'e}},
doi = {10.1162/tacl_a_00689},
eprint = {https://direct.mit.edu/tacl/article-pdf/doi/10.1162/tacl\_a\_00689/2470807/tacl\_a\_00689.pdf},
issn = {2307-387X},
journal = {Transactions of the Association for Computational Linguistics},
pages = {1122--1142},
publisher = {MIT Press},
title = {Do Language Models Enjoy Their Own Stories? {P}rompting Large Language Models for Automatic Story Evaluation},
url = {https://doi.org/10.1162/tacl\_a\_00689},
volume = {12},
year = {2024}
}
This work was performed using HPC resources from GENCI-IDRIS (Grant 2022-AD011013105R1) and was partially funded by the grant ANR-20-CHIA-0012-01 (``NoRDF''). We would also like to convey our appreciation to TACL Action Editor Ehud Reiter, as well as to our anonymous reviewers, for their valuable feedback.
Cyril, Fabian and Chloé are members of the NoRDF project.
WritingPrompts (Fan et al., 2018)
- BertGeneration (Rothe et al., 2020)
- CTRL (Keskar et al., 2019)
- GPT (Radford et al., 2018)
- GPT-2 (Radford et al., 2019)
- RoBERTa (Liu et al., 2019)
- XLNet (Yang et al., 2019)
- Fusion (Fan et al., 2018)
- HINT (Guan et al., 2021)
- TD-VAE (Wilmot et al., 2021)
- nlp-williams (Moon et al., 2019)
Please create a GitHub issue if you have any questions, suggestions, requests or bug-reports. We welcome PRs!