This is a tool to evaluate large language models on NLP tasks such as text classification and summarization. It implements a common API for traditional encoder-decoder and prompt-based large language models, as well as APIs such as OpenAI and Cohere.
Currently, these functionalities are available:
- Prompting and truncation logic
- Support for vanilla LLMs (OPT, LLaMa) and instruction-tuned models (T0, Alpaca)
- Evaluation based on 🤗 Datasets or CSV files
- Memoization: inference outputs are cached on disk
- Parallelized computation of metrics
git clone https://github.com/thefonseca/llms.git
cd llms && pip install -e .
llm-classify \
--model_name llama-2-7b-chat \
--model_checkpoint_path path_to_llama2_checkpoint \
--model_dtype float16 \
--dataset_name imdb \
--split test \
--source_key text \
--target_key label \
--model_labels "{'Positive':1,'Negative':0}" \
--max_samples 1000
Evaluating BigBird on PubMed validation split, and saving the results on the output
folder:
llm-summarize \
--dataset_name scientific_papers \
--dataset_config pubmed \
--split validation \
--source_key article \
--target_key abstract \
--max_samples 1000 \
--model_name google/bigbird-pegasus-large-pubmed \
--output_dir output
where --model_name
is a huggingface model identifier.
Evaluating Alpaca (float16) on arXiv validation split:
llm-summarize \
--arxiv_id https://arxiv.org/abs/2304.15004v1 \
--model_name alpaca-7b \
--model_checkpoint_path path_to_alpaca_checkpoint \
--budget 7 \
--budget_unit sentences \
--model_dtype float16 \
--output_dir output
Notes:
--budget
controls length of instruct-tuned summaries (by default, in sentences).--model_checkpoint_path
allows changing checkpoint folder while keeping the cache key (--model_name
) constant.
Evaluating ChatGPT API on arXiv validation split:
export OPENAI_API_KEY=<your_api_key>
llm-summarize \
--dataset_name scientific_papers \
--dataset_config arxiv \
--split validation \
--source_key article \
--target_key abstract \
--max_samples 1000 \
--model_name gpt-3.5-turbo \
--output_dir output
Evaluating summary predictions from a CSV file:
llm-summarize \
--dataset_name scientific_papers \
--dataset_config arxiv \
--split validation \
--source_key article \
--target_key abstract \
--prediction_path path_to_csv_file \
--prediction_key prediction \
--max_samples 1000 \
--output_dir output