This repository contains the code, model outputs and evaluation results for the EMNLP 2023 paper "BLESS: Benchmarking Large Language Models on Sentence Simplification".
The following commands can be used to recreate the enviornment used to run our experiments.
# create a directory for source code
mkdir installs
# create directories for large items (e.g. data, models, outputs, etc.)
mkdir resources
# or alternatively, create a symlink
ln -s /path/to/storage/ resources
# then create the relevant subfolders
mkdir -p resources/data resources/outputs resources/models
# if working on slurm cluster, load relevant modules, e.g. run:
# ml multigpu anaconda3
# create a clean conda environment
# NOTE: we recommend using CUDA 11.6, but others may work too
conda create -n bless -c conda-forge python=3.9 cudatoolkit=11.6 cudatoolkit-dev=11.6 -y
conda activate bless
# install transformers from source
git clone https://github.com/huggingface/transformers.git installs/transformers
cd installs/transformers
pip install -e .
cd ../..
# for efficient inference, we use 8bit quantization with bitsandbytes.
# NOTE: This requires Turing or Ampere GPUs (RTX 20s, RTX 30s, A40-A100, T4+)
# NOTE: The version used was 0.37.0, which we had to install from source. More recent versions can simply be installed with pip install bitsandbytes, but these have not been tested for compatibility.
git clone https://github.com/TimDettmers/bitsandbytes.git installs/bitsandbytes
cd installs/bitsandbytes
git reset --hard 0f5c394 # v0.37.0
CUDA_VERSION=116 make cuda11x
python setup.py install
cd ../..
# install other deps
pip install -r requirements.txt
# check the install and CUDA dependencies
python -m bitsandbytes
# For evaluation purposes, we also require the following packages
git clone https://github.com/feralvam/easse.git installs/easse
cd installs/easse
git reset --hard a1108d2 # v0.2.4
pip install -e .
cd ../..
git clone https://github.com/Yao-Dou/LENS.git installs/LENS
cd LENS/lens
git reset --hard 7a601f3 # v0.1.0
pip install -e .
cd ../../..
Note: bitsandbytes
is only required for running inference with 8bit quantization. If you have any problems installing this library and don't intend on running inference locally, you can skip this dependency.
At the moment, we support the following models:
- Bloom / Bloomz
- Llama
- OPT (up to 66B) / OPT-IML
- GPT-J / GPT-NeoX
- T5 / T0 / Flan-T5
- UL2 / Flan-UL2
- OpenAI models (via API)
- Cohere models (via API)
To get the relevant datasets for reproducing experiments, use the script scripts/fetch_datasets.sh
.
This script downloads the raw data for publicly available datasets and writes the files to the resources/data/
directory.
bash scripts/fetch_datasets.sh
Once you have downloaded the raw datasets, we can prepare them for inference using the relevant prepare_*.py
script.
For example, to prepare ASSET, run
python -m scripts.prepare_asset
See the readme for more details on the format.
To facilitate running inference with different datasets, models, random seeds and prompts, use ./run.sh.
It accepts the following arguments:
--input_file
(test set)--examples
(validation set from which we sample few-shot examples)--model_configs
a space-delimited list of exp_configs (i.e. models to run, see ./exp_configs/).--seeds
a space-delimited list of seeds to use (for consistency, use 489 287 723)--prompt_ids
a space-delimited list of predefined prompts (see ./prompts/)
For example to run inference on Med-EASI with 3 models, 2 seeds and 1 prompt would like this:
nohup bash run.sh \
--input_file "resources/data/med-easi/med-easi.test.jsonl" \
--examples "resources/data/med-easi/med-easi.validation.jsonl" \
--model_configs "exp_configs/cluster/bloom-560m.json exp_configs/cluster/bloom-1b1.json exp_configs/cluster/bloom-3b.json" \
--seeds "489 723" \
--prompt_ids "p1" logs/medeasi.jobs 2>&1 &
Similarly, for API models, running inference with OpenAI models can be done as follows:
nohup bash run.sh \
--input_file "resources/data/med-easi/med-easi.test.jsonl" \
--examples "resources/data/med-easi/med-easi.validation.jsonl" \
--model_configs "exp_configs/rtx/openai-gpt-3.5-turbo.json exp_configs/rtx/openai-text-babbage-001.json exp_configs/rtx/openai-text-davinci-002.json exp_configs/rtx/openai-text-ada-001.json exp_configs/rtx/openai-text-curie-001.json exp_configs/rtx/openai-text-davinci-003.json" \
--seeds "287 489 723" \
--prompt_ids "p0 p1 p2" logs/all_medeasi_openai.jobs 2>&1 &
Note: evaluation runs immediately after inference so it pays to do this on a GPU server to efficiently compute mode-based metrics!
Things got a bit out-of-hand due to the number of experiments to run and in the hope of supporting different hardware setups. The heirarchy of scripts is currently:
./run.sh
└── ./run.py
├── ./slurm_scripts/run_inference_on_*.sh
│ └── ./inference.py
└──./slurm_scripts/run_evaluation.sh
└── ./evaluation/simplification_evaluation.py
Below we provide more information on the lower-level scripts.
./run.py
executes a single inference run followed by evaluation of model outputs.
For example, to run inference on ASSET
with bloom-560m
with prompt p0
, you can run:
python -m run \
--use_slurm True \
--ntasks 1 \
--cpus_per_task 1 \
--gres gpu:T4:1 \
--mem 20GB \
--time 00:30:00 \
--batch_size 8 \
--seed 489 \
--model_name_or_path bigscience/bloom-560m \
--examples resources/data/asset/dataset/asset.valid.jsonl \
--input_file resources/data/asset/dataset/asset.test.jsonl \
--prompt_json prompts/p0.json \
--n_refs 1 --few_shot_n 3 \
--dry_run False # set to True to inspect the command calls without actually executing anything
Alternatively, you can also pass a json file from ./exp_configs/ in position 1 with some or all of the arguments predefined.
For example, on a server with RTX 3090 (24GB) GPUs, you could use the following:
python -m run exp_configs/rtx/bloom-560m.json \
--seed 489 \
--examples resources/data/asset/dataset/asset.valid.jsonl \
--input_file resources/data/asset/dataset/asset.test.jsonl \
--prompt_json prompts/p0.json \
--n_refs 1 --few_shot_n 3 \
--dry_run False # set to True to inspect the command calls without actually executing anything
This scripts produces the following files:
<output_file>.jsonl
: The predictions of the model on the input file.<output_file>.json
: Command line arguments used for the inference run.<output_file>.log
: Log file of the inference run.<output_file>.eval
: Log file of the automatic evaluation with results.
./inference.py is the script that performs inference with a LLM.
For example, to use bigscience/bloom-1b1
for local inference, you could execute the following:
python -m inference \
--model_name_or_path "bigscience/bloom-1b1" \
--max_new_tokens 100 \
--max_memory 0.65 \
--batch_size 8 \
--num_beams 1 \
--num_return_sequences 1 \
--do_sample True \
--top_p 0.9 \
--input_file "resources/data/asset/dataset/asset.test.jsonl" \
--examples "resources/data/asset/dataset/asset.valid.jsonl" \
--n_refs 1 \
--few_shot_n 3 \
--output_dir "resources/outputs" \
--prompt_json "prompts/p0.json"
where:
--input_file
is a either a .txt file, with one input sentence per line OR a JSONL file produced byscripts/prepare_*.py
. For consistency we recommend the latter.--examples
is a JSONL file produced byscripts/prepare_*.py
, containing validation set examples that may be selected as few-shot examples.- NB I: by default, an additional JSON file will be generated which persists the inference parameters used for generation.
- NB II: specify
--output_dir ''
to print your outputs to stdout (good for debugging/development purposes)
The same script can be used for inference with closed-source models behind APIs.
For example:
python -m inference \
--model_name_or_path cohere-command-xlarge-nightly \
--input_file "resources/data/asset/dataset/asset.test.jsonl" \
--examples "resources/data/asset/dataset/asset.valid.jsonl" \
--n_refs 1 \
--few_shot_n 3 \
--output_dir "resources/outputs" \
--prompt_json "prompts/p0.json"
Note: for this to work you will have to add your keys to ./api_secrets.py
so that COHERE_API_KEY
and OPENAI_API_KEY
are exposed to the library.
Be aware that API models cost money!
The table below provides the approximate cost of running the OpenAI models on the ASSET test set (359 examples) with a single seed/prompt (using 3 Few-shot examples with the pre-defined prompts). Note: that the prompts differ in length and therefore overall cost.
Model | p0 |
p1 |
p2 |
Pricing |
---|---|---|---|---|
openai-gpt-3.5-turbo | $0.172 | $0.189 | $0.22 | $0.002 / 1k tokens |
openai-text-ada-001 | $0.035 | $0.038 | $0.046 | $0.0004 / 1k tokens |
openai-text-babbage-001 | $0.044 | $0.047 | $0.057 | $0.0005 / 1k tokens |
openai-text-curie-001 | $0.175 | $0.19 | $0.23 | $0.002 / 1k tokens |
openai-text-davinci-002 | $1.75 | $1.85 | $2.26 | $0.02 / 1k tokens |
openai-text-davinci-003 | $1.75 | $1.85 | $2.26 | $0.02 / 1k tokens |
approx. # TOKENS processed | ~86K | ~93K | ~113k |
For each prompt, we run inference with 3 seeds, totalling 9 inference runs per dataset. Therefore the maximal cost of running all experiments with OpenAI most expensive models is approximately $54 (2x3x3x3).
./evaluation/simplification_evaluation.py
computes automatic metrics using EASSE and other libraries.
We recommend running evaluation with a GPU in order to compute model-based metrics (e.g. LENS, BERTScore, PPL).
We also compute all automatic metrics on the ground truth simplificiations to provide a reference point for reference-free metrics such as FKGL and QE statistics.
To prepare the ground truth texts as model outputs and evaluate, run:
python scripts/prepare_ground_truth_as_outputs.py \
resources/data/asset/dataset/asset.test.jsonl \
resources/outputs/ground_truth/asset.test.jsonl
python -m evaluation.simplification_evaluation \
resources/outputs/ground_truth/asset.test.jsonl \
--out_file resources/outputs/ground_truth/asset.test.eval \
--use_cuda
To construct prompts flexibly, we use LangChain.
A valid prompt may look something like the following:
I want you to replace my complex sentence with simple sentence(s). Keep the meaning same, but make them simpler.
Complex: The Hubble Space Telescope observed Fortuna in 1993.
Simple: 0: The Hubble Space Telescope spotted Fortuna in 1993.
Complex: Order # 56 / CMLN of 20 October 1973 prescribed the coat of arms of the Republic of Mali.
Simple: 0: In 1973, order #56/CMLN described the coat of arms for the Republic of Mali.
Complex: One side of the armed conflicts is composed mainly of the Sudanese military and the Janjaweed, a Sudanese militia group recruited mostly from the Afro-Arab Abbala tribes of the northern Rizeigat region in Sudan.
Simple:
This example corresponds to the T1 prompt described in Feng et al., 2023.
Prompts can be defined on-the-fly at inference time by passing the relevant arguments. To do this for the example prompt above, pass the following arguments:
--prompt_prefix "I want you to replace my complex sentence with simple sentence(s). Keep the meaning same, but make them simpler."
--promt_suffix "Complex: {input}\nSimple:"
--prompt_template "Complex: {complex}\nSimple: {simple}"
--example_separator "\n\n"
--prompt_format "prefix_initial"
However, for reproducibility, we recommend using pre-defined prompts. These contain these relevant fields and easily be used for inference by passing them with the --prompt_json
argument.
The directory prompts contains a set of pre-defined prompts in JSON format. See the readme for more details.
Generated outputs and evaluation results can can be found in model_outputs_and_evals. Please see the corresponding README for more details.
We have aggregated the main results in a summarised format and full format. Unaggregated results can be found in the raw format.
We include a checklist with the list of experiments.
To visualise the results, take a look at the plots.
To quickly inspect model generated outputs at random, you can use this script.
For example, run:
python -m scripts.inspect_outputs model_outputs_and_evals/flan-t5-large/asset-test_asset-valid_p0_random_fs3_nr1_s489.jsonl
Whenever new outputs are available, results can be updated using the get_results.py script:
python -m scripts.get_results
- LLMs (especially non-instruction-tuned models) don't know when to stop. Thus, they typically generate sequences up to the specified
max_new_tokens
. The functionpostprocess_model_outputs()
is used to extract the single relevant model output from a long generation sequence and is currently pretty rough. - Setting
--n_refs
> 1 allows for a few-shot prompt example to have multiple possible targets (e.g. sampled from multiple validation set reference sentences). The current method of handling these is to enumerate them starting at 0, but this doesn't seem very elegant or intuitive.
@misc{kew2023bless,
title={BLESS: Benchmarking Large Language Models on Sentence Simplification},
author={Tannon Kew and Alison Chi and Laura Vásquez-Rodríguez and Sweta Agrawal and Dennis Aumiller and Fernando Alva-Manchego and Matthew Shardlow},
year={2023},
eprint={2310.15773},
archivePrefix={arXiv},
primaryClass={cs.CL}
}