Skip to content

Latest commit

 

History

History
156 lines (123 loc) · 17.6 KB

File metadata and controls

156 lines (123 loc) · 17.6 KB

Evaluating Large Language Models & LLMOps

Evaluating Large Language Models

LLM Evalution Benchmarks

Expand

Language Understanding and QA

  1. MMLU (Massive Multitask Language Understanding): Over 15,000 questions across 57 diverse tasks. [Published in 2021] GitHub Repo stars
  2. TruthfulQA: Truthfulness. [Published in 2022]
  3. BigBench: 204 tasks. Predicting future potential [Published in 2023] GitHub Repo stars
  4. GLUE & SuperGLUE: GLUE (General Language Understanding Evaluation)

Coding

  1. HumanEval: Challenges coding skills. [Published in 2021] GitHub Repo stars
  2. CodeXGLUE: Programming tasks. GitHub Repo stars
  3. SWE-bench: Software Engineering Benchmark. Real-world software issues sourced from GitHub.
  4. MBPP: Mostly Basic Python Programming. [Published in 2021]

Chatbot Assistance

  1. Chatbot Arena: Human-ranked ELO ranking.
  2. MT Bench: Multi-turn open-ended questions - Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena [9 Jun 2023]

Reasoning

  1. HellaSwag: Commonsense reasoning. [Published in 2019] GitHub Repo stars
  2. ARC (AI2 Reasoning Challenge): Measures general fluid intelligence. GitHub Repo stars
  3. DROP: Evaluates discrete reasoning.
  4. LogicQA: Evaluates logical reasoning skills. GitHub Repo stars

Translation

  1. WMT: Evaluates translation skills.

Math

  1. MATH: Tests ability to solve math problems. [Published in 2021] GitHub Repo stars
  2. GSM8K: Arithmetic Reasoning. [Published in 2021] GitHub Repo stars

Evaluation metrics

  1. Automated evaluation of LLMs
  • n-gram based metrics: Evaluates the model using n-gram statistics and F1 score. ROUGE, BLEU, and METEOR are used for summarization and translation tasks.

  • Probabilistic model evaluation metrics: Evaluates the model using the predictive performance of probability models. Perplexity.

  • Embedding based metrics: Evaluates the model using semantic similarity of embeddings. Ada Similarity and BERTScore are used.

    Expand
    • ROUGE (Recall-Oriented Understudy for Gisting Evaluation): The metrics compare an automatically produced summary or translation against a reference or a set of references (human-produced) summary or translation. It includes several measures such as:

      1. ROUGE-N: Overlap of n-grams between the system and reference summaries.
      2. ROUGE-L: Longest Common Subsequence (LCS) based statistics.
      3. ROUGE-W: Weighted LCS-based statistics that favor consecutive LCSes.
      4. ROUGE-S: Skip-bigram based co-occurrence statistics.
      5. ROUGE-SU: Skip-bigram plus unigram-based co-occurrence statistics1.
    • n-gram: An n-gram is a contiguous sequence of n items from a given sample of text or speech. For example, in the sentence “I love AI”, the unigrams (1-gram) are “I”, “love”, “AI”; the bigrams (2-gram) are “I love”, “love AI”; and the trigram (3-gram) is “I love AI”.

    • BLEU: BLEU’s output is always a number between 0 and 1. An algorithm for evaluating the quality of machine-translated text. The closer a machine translation is to a professional human translation, the better it is.

    • BERTScore: A metric that leverages pre-trained contextual embeddings from BERT for text generation tasks. It combines precision and recall values.

    • Perplexity: A measure of a model's predictive performance, with lower values indicating better prediction.

    • METEOR: An n-gram based metric for machine translation, considering precision, recall, and semantic similarity.

  1. Human evaluation of LLMs (possibly Automate by LLM-based metrics): Evaluate the model’s performance on NLU and NLG tasks. It includes evaluations of relevance, fluency, coherence, and groundedness.

  2. Built-in evaluation methods in Prompt flow: ref [Aug 2023] / ref

LLMOps: Large Language Model Operations

  • LLMOps Database: A curated knowledge base of real-world LLMOps implementations.
  • Language Model Evaluation Harness:💡Over 60 standard academic benchmarks for LLMs. A framework for few-shot evaluation. Hugginface uses this for Open LLM Leaderboard [Aug 2020] GitHub Repo stars
  • TruLens: Instrumentation and evaluation tools for large language model (LLM) based applications. [Nov 2020] GitHub Repo stars
  • Giskard: The testing framework for ML models, from tabular to LLMs [Mar 2022] GitHub Repo stars
  • OpenAI Evals: A framework for evaluating large language models (LLMs) [Mar 2023] GitHub Repo stars
  • promptfoo: Test your prompts. Evaluate and compare LLM outputs, catch regressions, and improve prompt quality. [Apr 2023] GitHub Repo stars
  • Ragas: Evaluation framework for your Retrieval Augmented Generation (RAG) [May 2023] GitHub Repo stars
  • Pezzo: Open-source, developer-first LLMOps platform [May 2023] GitHub Repo stars
  • Langfuse: git LLMOps platform that helps teams to collaboratively monitor, evaluate and debug AI applications. [May 2023] GitHub Repo stars
  • PromptTools: Open-source tools for prompt testing [Jun 2023] GitHub Repo stars
  • 30 requirements for an MLOps environment: Kirk Borne twitter [15 Jul 2023]
  • DeepEval: LLM evaluation framework. similar to Pytest but specialized for unit testing LLM outputs. [Aug 2023] GitHub Repo stars
  • traceloop openllmetry: Quality monitoring for your LLM applications. [Sep 2023] GitHub Repo stars
  • Azure Machine Learning studio Model Data Collector: Collect production data, analyze key safety and quality evaluation metrics on a recurring basis, receive timely alerts about critical issues, and visualize the results. ref [Apr 2024]
  • Azure ML Prompt flow: A set of LLMOps tools designed to facilitate the creation of LLM-based AI applications [Sep 2023] > How to Evaluate & Upgrade Model Versions in the Azure OpenAI Service [14 Aug 2024]
  • Machine Learning Operations (MLOps) For Beginners: DVC (Data Version Control), MLflow, Evidently AI (Monitor a model). Insurance Cross Sell Prediction git [29 Aug 2024] GitHub Repo stars
  • Opik: an open-source platform for evaluating, testing and monitoring LLM applications. Built by Comet. [2 Sep 2024] GitHub Repo stars
  • Economics of Hosting Open Source LLMs: Comparison of cloud vendors such as AWS, Modal, BentoML, Replicate, Hugging Face Endpoints, and Beam, using metrics like processing time, cold start latency, and costs associated with CPU, memory, and GPU usage. git [13 Nov 2024]

Challenges in evaluating AI systems

  1. Pretraining on the Test Set Is All You Need: [cnt]
    • On that note, in the satirical Pretraining on the Test Set Is All You Need paper, the author trains a small 1M parameter LLM that outperforms all other models, including the 1.3B phi-1.5 model. This is achieved by training the model on all downstream academic benchmarks. It appears to be a subtle criticism underlining how easily benchmarks can be "cheated" intentionally or unintentionally (due to data contamination). cite [13 Sep 2023]
  2. Challenges in evaluating AI systems: The challenges and limitations of various methods for evaluating AI systems, such as multiple-choice tests, human evaluations, red teaming, model-generated evaluations, and third-party audits. doc [4 Oct 2023]
  3. Your AI Product Needs Evals [29 Mar 2024] / How to Evaluate LLM Applications: The Complete Guide [7 Nov 2023]