For more information about this project, see the related paper:
TODO: Add citation once paper is published
Use the provided Makefile to install this project by running the following from the project root directory (the same directory as this README). Ensure the python
in PATH
is 3.11 before running this command:
make install
DeepSpeed must be installed manually. See Installation Details - DeepSpeed for instructions on how to do so.
Note that the installation command will attempt to download all used models from the Hugging Face Hub. To do this, you will need to create a Hugging Face account and request access on the pages for the following models:
Once your request has been approved, authenticate on your local machine using a user access token, using the official User access tokens documentation as a guide.
If the installation process fails, is interrupted, or for any reason needs to be restarted, run git clean -xdf
to reset the repository's state.
We have collected a dataset of 10 openly licensed summaries for movies and television episodes. These abstracts were found on Wikipedia with List of American films of 2023 - Wikipedia and Category:2023 works - Wikipedia being used as the main way to find materials. We only used works published in July 2023 or later to avoid materials that might have been used to train SOTA LLMs.
Once we collected the summaries, we wrote 5 questions for each one. For each question, we then wrote 4 example answers, one for each for the 3 different constraints and one without any constraints. This resulted in 20 unique (question, constraint, answer) tuples for each summary.
The dataset is stored in the following directories/files:
rcqa_data/
: Directory of data files used in experiments. Most of these files are also in the paper's supplemental materials.datasets/
: Directory of JSON Lines files containing the output ofconvert_json_to_prompts.py
using the files insummaries/
as input.prompts/
: Directory of JSON Lines files used as input data forrun_paper_experiments.sh
.prompts.md
: File containing the prompts in an easier-to-read Markdown format.RedactedContextualQuestionAnsweringAnnotation.xlsx
: Model output with annotations of correctness, along with various relevant calculations and visualizations.
summaries/
: Directory of individual JSON files for each summary.- Each file contains a single object with the following fields:
title
: Title of the television episode or movie.source
: Permalink to Wikipedia page version the summary was copied from.summary
: Markdown-formatted summary of episode or movie, copied from Wikipedia.questions
: Array of questions about each summary, with each question being an object with the following fields:question
: Question about the episode or movie that can be answered using the provided summary.answers
: Array of answers given specific constraints, with each answer being an object with the following fields:constraints
: Array of constraints to follow when answering the question.answer
: Example complete sentence that correctly answers the question and follows the constraints. If no answer is possible, then the value isnull
instead.
- Each file contains a single object with the following fields:
The summaries (and therefore the dataset) are licensed under the Creative Commons 4.0 BY-SA (Attribution-ShareAlike) license.
All data for the experiments can be found in rcqa_data/
. See the "Dataset" section above for a complete description.
Run the following command to run training, inference, and evaluation for the paper:
bash scripts/run_paper_experiments.sh
You will likely need to make changes to the codebase to run in your specific environment.
This project uses various code quality tooling, all of which is automatically installed with the rest of the development requirements.
All checks can be run with make check
, and some additional automatic changes can be run with make fix
.
To test GitHub Actions workflows locally, install act
and run it with act
.
- The dataset is under the Creative Commons 4.0 BY-SA (Attribution-ShareAlike) license, which is used by the dataset (Wikipedia) that it is derived from.
- The code is under the MIT license.
- The paper (
paper
) is under the Creative Commons 4.0 BY (Attribution) license, which is used for all publications in the ACL Anthology. src/run_clm.py
is originally under the Apache 2.0 license, with all changes from the original file being under the MIT license.