NLPBench is a novel benchmark for Natural Language Processing problems consisting of 378 questions sourced from the NLP course final exams at Yale University.
Our example questions: Our dataset is licensed under the CC BY-ND. You can download our dataset through this link.
You can import our environment from the environment.yml
by
conda env create -f environment.yml
then activate our conda environment by
conda activate NLPBench
Our evaluations are based on both online (GPT-3.5, GPT-4, and PaLM 2) and open-sourced (LLAMA 2, Falcon, Bloom, etc.) LLMs.
Online LLM often requires an API-key
before access. If you want to access the OpenAI model, you need to add the OPENAI_API_KEY
to the system environment as follows:
export OPENAI_API_KEY="YOUR OPENAI API KEY"
and for PaLM 2, you need to add the PALM_API_KEY
to your system environment as follows:
export PALM_API_KEY="YOUR PALM API"
We use vLLM to start an openai-like endpoint for evaluation. All configurations are summarized in ./utils/utils.py:oai_llm_config
. Check this list for information on the supported open-source model.
Basically, if you want to evaluate other open-sourced models, add your model's configuration in the following format into the oai_llm_config
:
"HUGGINGFACE REPO": {
"model": "HUGGINGFACE REPO",
"api_key": "empty",
"api_base": "YOUR ENDPOINT HOST, DEFAULT: http://127.0.0.1:8000/v1",
}
then start the endpoint with the following script:
bash scripts/serving.sh [-m HUGGINGFACE REPO][-n NUMBER OF GPUs][-a HOST ADDRESS, DEFAULT: 127.0.0.1][-p PORT, DEFAULT: 8000]
We have two steps for evaluation: (1) solving the problems and (2) calculating the accuracy.
We adopt sacred to manage our configurations. All configs can be found in ./configs
. You can also add your config by creating a specific yaml
file. As an example, you can run the following code to let GPT-3.5
with only zero-shot
prompting answer the questions without context:
python run.py with configs/zero-shot.yaml model_name='gpt-3.5-turbo' ctx=False
The answer results will be saved in ./res/{SEED}/no_ctx/zero-shot_gpt-3.5-turbo.json
.
You can evaluate the above result by running the following code:
python evaluate.py
Then the result will be saved in ./res/{SEED}/
All the prompts in our evaluation can be found in ./prompts
, including prompt for question answering (qa_prompt.py
), system prompt (sys_prompt.py
), and prompt for tree-of-thought (tot_prompt.py
). You can customize your prompt by modifying the above three files.
If you think our repository and result is useful, please cite our paper by
@misc{song2023nlpbench,
title={NLPBench: Evaluating Large Language Models on Solving NLP Problems},
author={Linxin Song and Jieyu Zhang and Lechao Cheng and Pengyuan Zhou and Tianyi Zhou and Irene Li},
year={2023},
eprint={2309.15630},
archivePrefix={arXiv},
primaryClass={cs.CL}
}