Benchmark to evaluate the capability of language models to consolidate and recall information from multiple training documents in order to answer questions during inference.
- python >= 3.10.0
- accelerate >= 0.24.1
- pandas >= 2.1.1
- protobuf >= 4.24.4
- pytorch >= 2.1.0
- sentencepiece >= 0.1.99
- transformers >= 4.35.1
- matplotlib (Optional)
- wandb >= 0.15.12 (Optional)
git clone https://github.com/chandar-lab/EpiK-Eval.git
cd EpiK-Eval
make
pip install --no-deps .
To benchmark a model on unsegmented stories:
bash configs/<model>/run_unsegmented.sh
or on segmented stories:
bash configs/<model>/run_segmented.sh
Our dataset is provided in an encrypted format and is decrypted only during the installation process. To avoid this data leaking into training corpuses, we urge all users to exercise caution and ensure they do not inadvertently publish or push the unencrypted dataset online. As a precautionary measure, the unencrypted dataset is included in the .gitignore
file. Your cooperation in maintaining the privacy of our dataset is greatly appreciated.
Our dataset is automatically decrypted during installation. If for ever reason, one wants to manually decrypt the dataset, for example, without installing our repo, one can run:
make
For more information on the format of our dataset, see data/README.md
We provide the following three additional scripts. Each script is meant to be ran in the scripts/
directory, e.g., python generate_paper_plots.py
.
To generate a new dataset, one can run:
python generate_dataset.py
The default parameters generate a dataset with the same format as the one we provide. Run python scripts/generate_dataset.py --help
for details.
Once a model has been benchmarked, the various metrics shown in our paper can be computed via:
python compute_paper_metrics.py --model_answer_log=logs/example_log.csv
To recreate the plots shown in our paper, one can run:
python generate_paper_plots.py
We already provide these figures in figures/