HANNA Benchmark Repository

Resources for the paper "Do Language Models Enjoy Their Own Stories? Prompting Large Language Models for Automatic Story Evaluation", published in TACL.

Authors: Cyril Chhun, Fabian Suchanek and Chloé Clavel.

[Note: resources for the paper "Of Human Criteria and Automatic Metrics: A Benchmark of the Evaluation of Story Generation", accepted in COLING 2022, can be accessed in the coling branch here.]

Updates

2024/05/13 - Update for TACL Paper
2022/08/24 - Initial commit

Data

We release in this repository HANNA, a large annotated dataset of Human-ANnotated NArratives for ASG evaluation. HANNA contains annotations for 1,056 stories generated from 96 prompts from the WritingPrompts dataset. Each story was annotated by 3 raters on 6 criteria (Relevance, Coherence, Empathy, Surprise, Engagement and Complexity), for a grand total of 19,008 annotations.

Additionally, we release the scores of those 1,056 stories evaluated by 72 automatic metrics and annotated by 4 different Large Language Models (Beluga-13B, Llama-13B, Mistral-7B, ChatGPT).

hanna_stories_annotations.csv contains the raw annotations from our experiment.
- Story ID is the ID of the story (from 0 to 1,055). Stories are grouped by model (0 to 95 are the Human stories, 96 to 191 are the BertGeneration stories, etc.).
- Prompt is the prompt
- Human is the corresponding human story
- Story is the generated story
- Model is the model used to generate the story
- Relevance is the Relevance (RE) score
- Coherence is the Coherence (CH) score
- Empathy is the Empathy (EM) score
- Surprise is the Surprise (SU) score
- Engagement is the Engagement (EG) score
- Complexity is the Complexity (CX) score
- Worker ID is the ID of the mTurk worker
- Assignment ID is the ID of the mTurk assignment
- Work time in seconds is the time the worker spent on the assignment in seconds
- Name is the name entered by the worker for the first mentioned character in the story
hanna_metrics_scores_llm.csv contains average human annotations, average LLM annotations, and the scores of automatic measures per story per system. For instance, on row 2, you will find the scores of the stories generated by the BertGeneration model. Each list of that row contains the scores of stories 96 to 191 for each metric.

We also release:

th complete answers of the LLMs (in the llm_answers folder)
the ~1,500 annotations from our user study on the LLM explanations (user.study.csv)
576 stories generated by Llama-7B, Mmistral-7B, Beluga-13B, OrcaPlatypus-13B, Llama-30B and Platypus2-70B (hanna_llm_stories.csv).

Samples

Story ID	Prompt	Human	Story	Model	RE	CH	EM	SU	EG	CX
99	Write a story about an elderly wizard and his young female apprentice.	His body was failing. He had taken care of it very well, but 205 years were a long time. Not a drop of alcohol all those long and lonely tavern nights, not a crumb of tobacco for the old pipe. [...]	“Follow me,” his mentor said. “I must stop this wizard.” At that, Tawthorn drew his dagger and leaned towards the woman. “If you were correct, my professor was right. You could change the world, and save the kingdom.” [...]	BertGeneration	3	2	2	2	2	3
519	You are a immortal during the zombie apocalypse, During the apocalypse the zombies ignore you and you try to live a normal life during the outbreak.	50 years is a long time, enough time to go crazy and return sane. I remember before it happened, the CDC joked that they would have a cure “within a week” [...]	After a few weeks of running, you see something inside a tube/pulse generator. I woke up groggy. The day was Monday, it was Tuesday. How was my day going so fast? [...]	GPT-2	5	5	3	4	4	4
862	When a new president is elected, they are given a special security briefing. In reality, this is an old tradition where various directors, military officers and current ministers present fake evidence and compete to see who can convince the president of the most ridiculous things. [...]	“Mr President I want you to know I am telling you this in full confidence .” Said the head of the Secret Service. The President looked at him. “Yes go ahead .” [...]	“Mr. President, you can see this! You know what the problem is. You see, President Obama, in the US, has been working on the latest model of the President 's campaign for over two years! [...]	Fusion	2	1	1	1	1	1

Jupyter Notebook

We provide the Jupyter Notebook data_visualization.ipynb containing the code we used to generate our results. It also allows for easier visualisation of the data from the csv files.

Setup

The code was tested with Python 3.9.7. You can install the required packages with

pip install -r requirements.txt

You will also need the williams.py file from the nlp-williams repository for the Williams section of the notebook respectively. We cannot include them in the repository for licensing reasons.

If you do not plan to run the cells of this section, simply comment the corresponding import in the first cell.

Citation

@article{chhun2024do,
    author = {Chhun, Cyril and Suchanek, Fabian M. and Clavel, Chlo{\'e}},
    doi = {10.1162/tacl_a_00689},
    eprint = {https://direct.mit.edu/tacl/article-pdf/doi/10.1162/tacl\_a\_00689/2470807/tacl\_a\_00689.pdf},
    issn = {2307-387X},
    journal = {Transactions of the Association for Computational Linguistics},
    pages = {1122--1142},
    publisher = {MIT Press},
    title = {Do Language Models Enjoy Their Own Stories? {P}rompting Large Language Models for Automatic Story Evaluation},
    url = {https://doi.org/10.1162/tacl\_a\_00689},
    volume = {12},
    year = {2024}
}

Acknowledgements

This work was performed using HPC resources from GENCI-IDRIS (Grant 2022-AD011013105R1) and was partially funded by the grant ANR-20-CHIA-0012-01 (``NoRDF''). We would also like to convey our appreciation to TACL Action Editor Ehud Reiter, as well as to our anonymous reviewers, for their valuable feedback.

Cyril, Fabian and Chloé are members of the NoRDF project.

Dataset

WritingPrompts (Fan et al., 2018)

Used systems

BertGeneration (Rothe et al., 2020)
CTRL (Keskar et al., 2019)
GPT (Radford et al., 2018)
GPT-2 (Radford et al., 2019)
RoBERTa (Liu et al., 2019)
XLNet (Yang et al., 2019)
Fusion (Fan et al., 2018)
HINT (Guan et al., 2021)
TD-VAE (Wilmot et al., 2021)

Libraries

nlp-williams (Moon et al., 2019)

Get Involved

Please create a GitHub issue if you have any questions, suggestions, requests or bug-reports. We welcome PRs!

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
llm_answers		llm_answers
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
data_visualization.ipynb		data_visualization.ipynb
fdr.py		fdr.py
hanna_llm_stories.csv		hanna_llm_stories.csv
hanna_metric_scores_llm.csv		hanna_metric_scores_llm.csv
hanna_stories_annotations.csv		hanna_stories_annotations.csv
requirements.txt		requirements.txt
user_study.csv		user_study.csv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

HANNA Benchmark Repository

Table of contents

Updates

Data

Samples

Jupyter Notebook

Setup

Citation

Acknowledgements

Dataset

Used systems

Libraries

Get Involved

About

Releases

Packages

Contributors 2

Languages

License

dig-team/hanna-benchmark-asg

Folders and files

Latest commit

History

Repository files navigation

HANNA Benchmark Repository

Table of contents

Updates

Data

Samples

Jupyter Notebook

Setup

Citation

Acknowledgements

Dataset

Used systems

Libraries

Get Involved

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages