- Awesome LLMs Evaluation Papers: Evaluating Large Language Models: A Comprehensive Survey git [Oct 2023]
- Artificial Analysis LLM Performance Leaderboard: Performance benchmarks & pricing across API providers of LLMs
- LLMPerf Leaderboard: Evaulation the performance of LLM APIs. [Dec 2023]
- MMLU (Massive Multi-task Language Understanding): LLM performance across 57 tasks including elementary mathematics, US history, computer science, law, and more. [7 Sep 2020]
- HumanEval: Hand-Written Evaluation Set for Code Generation Bechmark. 164 Human written Programming Problems. ref / git [7 Jul 2021]
- BIG-bench: Consists of 204 evaluations, contributed by over 450 authors, that span a range of topics from science to social reasoning. The bottom-up approach; anyone can submit an evaluation task. git [9 Jun 2022]
- HELM: Evaluation scenarios like reasoning and disinformation using standardized metrics like accuracy, calibration, robustness, and fairness. The top-down approach; experts curate and decide what tasks to evaluate models on. git [16 Nov 2022]
- Evaluation Papers for ChatGPT [28 Feb 2023]
- Evaluation of Large Language Models: A Survey on Evaluation of Large Language Models: [cnt] [6 Jul 2023]
- Prometheus: Inducing Fine-grained Evaluation Capability in Language Models: [cnt]: We utilize the FEEDBACK COLLECTION, a novel dataset, to train PROMETHEUS, an open-source large language model with 13 billion parameters, designed specifically for evaluation tasks. [12 Oct 2023]
- ChatGPT’s One-year Anniversary: Are Open-Source Large Language Models Catching up?: Open-Source LLMs vs. ChatGPT; Benchmarks and Performance of LLMs [28 Nov 2023]
- LightEval: a lightweight LLM evaluation suite that Hugging Face has been using internally [Jan 2024]
- LLM Model Evals vs LLM Task Evals
:
Model Evals
are really for people who are building or fine-tuning an LLM. vs The best LLM application builders are usingTask evals
. It's a tool to help builders build. [Feb 2024] - LLM-as-a-Judge:💡LLM-as-a-Judge offers a quick, cost-effective way to develop models aligned with human preferences and is easy to implement with just a prompt, but should be complemented by human evaluation to address biases. [Jul 2024]
- Can Large Language Models Be an Alternative to Human Evaluations? [3 May 2023]
- Evaluating the Effectiveness of LLM-Evaluators (aka LLM-as-Judge):💡Key considerations and Use cases when using LLM-evaluators [Aug 2024]
- OpenAI MLE-bench: A benchmark for measuring the performance of AI agents on ML tasks using Kaggle. git [9 Oct 2024] > Agent Framework used in MLE-bench,
GPT-4o (AIDE) achieves more medals on average than both MLAB and OpenHands (8.7% vs. 0.8% and 4.4% respectively)
x-ref - Korean SAT LLM Leaderboard: Benchmarking 10 years of Korean CSAT (College Scholastic Ability Test) exams [Oct 2024]
- OpenAI SimpleQA Benchmark: SimpleQA, a factuality benchmark for short fact-seeking queries, narrows its scope to simplify factuality measurement. git [30 Oct 2024]
- Comprehensive and Practical Evaluation of Retrieval-Augmented Generation Systems for Medical Question Answering [14 Nov 2024]
Expand
- MMLU (Massive Multitask Language Understanding): Over 15,000 questions across 57 diverse tasks. [Published in 2021]
- TruthfulQA: Truthfulness. [Published in 2022]
- BigBench: 204 tasks. Predicting future potential [Published in 2023]
- GLUE & SuperGLUE: GLUE (General Language Understanding Evaluation)
- HumanEval: Challenges coding skills. [Published in 2021]
- CodeXGLUE: Programming tasks.
- SWE-bench: Software Engineering Benchmark. Real-world software issues sourced from GitHub.
- MBPP: Mostly Basic Python Programming. [Published in 2021]
- Chatbot Arena: Human-ranked ELO ranking.
- MT Bench: Multi-turn open-ended questions - Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena [9 Jun 2023]
- HellaSwag: Commonsense reasoning. [Published in 2019]
- ARC (AI2 Reasoning Challenge): Measures general fluid intelligence.
- DROP: Evaluates discrete reasoning.
- LogicQA: Evaluates logical reasoning skills.
- WMT: Evaluates translation skills.
- Automated evaluation of LLMs
-
n-gram based metrics: Evaluates the model using n-gram statistics and F1 score. ROUGE, BLEU, and METEOR are used for summarization and translation tasks.
-
Probabilistic model evaluation metrics: Evaluates the model using the predictive performance of probability models. Perplexity.
-
Embedding based metrics: Evaluates the model using semantic similarity of embeddings. Ada Similarity and BERTScore are used.
Expand
-
ROUGE (Recall-Oriented Understudy for Gisting Evaluation): The metrics compare an automatically produced summary or translation against a reference or a set of references (human-produced) summary or translation. It includes several measures such as:
- ROUGE-N: Overlap of n-grams between the system and reference summaries.
- ROUGE-L: Longest Common Subsequence (LCS) based statistics.
- ROUGE-W: Weighted LCS-based statistics that favor consecutive LCSes.
- ROUGE-S: Skip-bigram based co-occurrence statistics.
- ROUGE-SU: Skip-bigram plus unigram-based co-occurrence statistics1.
-
n-gram: An n-gram is a contiguous sequence of n items from a given sample of text or speech. For example, in the sentence “I love AI”, the unigrams (1-gram) are “I”, “love”, “AI”; the bigrams (2-gram) are “I love”, “love AI”; and the trigram (3-gram) is “I love AI”.
-
BLEU: BLEU’s output is always a number between 0 and 1. An algorithm for evaluating the quality of machine-translated text. The closer a machine translation is to a professional human translation, the better it is.
-
BERTScore: A metric that leverages pre-trained contextual embeddings from BERT for text generation tasks. It combines precision and recall values.
-
Perplexity: A measure of a model's predictive performance, with lower values indicating better prediction.
-
METEOR: An n-gram based metric for machine translation, considering precision, recall, and semantic similarity.
-
-
Human evaluation of LLMs (possibly Automate by LLM-based metrics): Evaluate the model’s performance on NLU and NLG tasks. It includes evaluations of relevance, fluency, coherence, and groundedness.
-
Built-in evaluation methods in Prompt flow: ref [Aug 2023] / ref
- LLMOps Database: A curated knowledge base of real-world LLMOps implementations.
- Language Model Evaluation Harness:💡Over 60 standard academic benchmarks for LLMs. A framework for few-shot evaluation. Hugginface uses this for Open LLM Leaderboard [Aug 2020]
- TruLens: Instrumentation and evaluation tools for large language model (LLM) based applications. [Nov 2020]
- Giskard: The testing framework for ML models, from tabular to LLMs [Mar 2022]
- OpenAI Evals: A framework for evaluating large language models (LLMs) [Mar 2023]
- promptfoo: Test your prompts. Evaluate and compare LLM outputs, catch regressions, and improve prompt quality. [Apr 2023]
- Ragas: Evaluation framework for your Retrieval Augmented Generation (RAG) [May 2023]
- Pezzo: Open-source, developer-first LLMOps platform [May 2023]
- Langfuse: git LLMOps platform that helps teams to collaboratively monitor, evaluate and debug AI applications. [May 2023]
- PromptTools: Open-source tools for prompt testing [Jun 2023]
- 30 requirements for an MLOps environment: Kirk Borne twitter [15 Jul 2023]
- DeepEval: LLM evaluation framework. similar to Pytest but specialized for unit testing LLM outputs. [Aug 2023]
- traceloop openllmetry: Quality monitoring for your LLM applications. [Sep 2023]
- Azure Machine Learning studio Model Data Collector: Collect production data, analyze key safety and quality evaluation metrics on a recurring basis, receive timely alerts about critical issues, and visualize the results. ref [Apr 2024]
- Azure ML Prompt flow: A set of LLMOps tools designed to facilitate the creation of LLM-based AI applications [Sep 2023] > How to Evaluate & Upgrade Model Versions in the Azure OpenAI Service [14 Aug 2024]
- Machine Learning Operations (MLOps) For Beginners: DVC (Data Version Control), MLflow, Evidently AI (Monitor a model). Insurance Cross Sell Prediction git [29 Aug 2024]
- Opik: an open-source platform for evaluating, testing and monitoring LLM applications. Built by Comet. [2 Sep 2024]
- Economics of Hosting Open Source LLMs: Comparison of cloud vendors such as AWS, Modal, BentoML, Replicate, Hugging Face Endpoints, and Beam, using metrics like processing time, cold start latency, and costs associated with CPU, memory, and GPU usage. git [13 Nov 2024]
- Pretraining on the Test Set Is All You Need: [cnt]
- On that note, in the satirical Pretraining on the Test Set Is All You Need paper, the author trains a small 1M parameter LLM that outperforms all other models, including the 1.3B phi-1.5 model. This is achieved by training the model on all downstream academic benchmarks. It appears to be a subtle criticism underlining how easily benchmarks can be "cheated" intentionally or unintentionally (due to data contamination). cite [13 Sep 2023]
- Challenges in evaluating AI systems: The challenges and limitations of various methods for evaluating AI systems, such as multiple-choice tests, human evaluations, red teaming, model-generated evaluations, and third-party audits. doc [4 Oct 2023]
- Your AI Product Needs Evals [29 Mar 2024] / How to Evaluate LLM Applications: The Complete Guide [7 Nov 2023]