Skip to content

An open paper list covers LLMs reasoning benchmark for science problems.

Notifications You must be signed in to change notification settings

amair-lab/Awesome-LM-Science-Bench

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 

Repository files navigation

License: MIT Awesome-LM-Science-Bench

An open benchmark list covers LLM's reasoning benchmark for science problems, we focus on LLM evaluation datasets in natural sciences.

Hi👋, if you find this repo helpful, welcome to give a star ⭐️!

As many benchmarks are being released, we will update this repo frequently and welcome contributions from the 🏠community!


Massive Multitask Language Understanding (MMLU, partial for natural science)

  • Description: Measures general knowledge across 57 different subjects, ranging from STEM to social sciences.
  • Purpose: To assess the LLM's understanding and reasoning in a wide range of subject areas.
  • Relevance: Ideal for multifaceted AI systems that require extensive world knowledge and problem-solving ability.
  • Source: Measuring Massive Multitask Language Understanding
  • Resources:

AI2 Reasoning Challenge (ARC)

General Language Understanding Evaluation (GLUE)

SciQ (multiple-choice questions, physics, chemistry, and biology)

  • Description: Consists of multiple-choice questions mainly in natural sciences like physics, chemistry, and biology.
  • Purpose: To test the ability to answer science-based questions, often with additional supporting text.
  • Relevance: Useful for educational tools, especially in science education and knowledge testing platforms.
  • Source: Crowdsourcing Multiple Choice Science Questions
  • Resources:

LAB-Bench: Language Agent Biology Benchmark

  • Description: A broad dataset of over 2,400 multiple-choice questions for evaluating AI systems on a range of practical biology research capabilities.
  • Purpose: To assess recall and reasoning over literature, interpretation of figures, navigation of databases, and comprehension and manipulation of DNA and protein sequences.
  • Relevance: Useful for evaluating LLMs in the context of biology research and education.
  • Source: LAB-Bench: Measuring Capabilities of Language Models for Biology

ChemQA: Chemistry Question-Answering Dataset

  • Description: A multimodal question-and-answering dataset on chemistry reasoning with 5 QA tasks.
  • Purpose: To evaluate LLMs' abilities in chemistry-related tasks such as counting atoms, calculating molecular weights, and retrosynthesis planning.
  • Relevance: Essential for AI applications in chemistry education and research.
  • Source: GitHub - materials-data-facility/matchem-llm

ChemBench - Lamalab

  • Description: A benchmark with over 7000 questions curated for various chemical topics.
  • Purpose: To evaluate LLMs on chemistry knowledge and reasoning abilities.
  • Relevance: Important for AI systems in chemistry education and research.
  • Source: GitHub - materials-data-facility/matchem-llm

ChemSafetyBench: LLM Safety in Chemistry

  • Description: A benchmark designed to evaluate the safety of LLMs in the field of chemistry.
  • Purpose: To assess the safety and reliability of LLMs in chemistry-related applications.
  • Relevance: Crucial for ensuring the safe use of LLMs in chemistry.
  • Source: ChemSafetyBench: Benchmarking LLM Safety on Chemistry

LLM4Mat-Bench: Benchmarking LLMs for Materials Property Prediction

  • Description: The largest benchmark for evaluating the performance of LLMs in predicting the properties of crystalline materials.
  • Purpose: To assess LLMs' capabilities in materials science and chemistry.
  • Relevance: Useful for AI applications in materials research and development.
  • Source: LLM4Mat-Bench: Benchmarking Large Language Models for Materials

BioLLMBench: A Comprehensive Benchmarking of Large Language Models in Biology

Large Language Model Benchmarks in Medical Tasks

  • Description: This comprehensive survey presents various benchmark datasets employed in medical LLM tasks, spanning multiple modalities including text, image, and multimodal benchmarks, focusing on different aspects of medical knowledge such as electronic health records (EHRs), doctor-patient dialogues, medical question-answering, and medical image captioning.
  • Purpose: To evaluate the performance of LLMs in medical tasks and contribute to the evolving field of medical artificial intelligence.
  • Relevance: Vital for advancing multimodal medical intelligence and improving healthcare delivery through AI.
  • Source: arXiv

NPHardEval: Dynamic Benchmark on Reasoning Ability of LLMs

  • Description: A new benchmark containing a broad spectrum of 900 algorithmic questions belonging up to the NP-Hard complexity class.
  • Purpose: To evaluate the reasoning ability of LLMs on complex algorithmic questions.
  • Relevance: Important for assessing LLMs' capabilities in solving complex problems in natural sciences.
  • Source: NPHardEval: Dynamic Benchmark on Reasoning Ability of LLMs

ChemLLMBench: A comprehensive benchmark on eight chemistry tasks

  • Description: ChemLLMBench covers a range of chemistry tasks, providing a thorough evaluation of LLMs in the chemistry domain.
  • Purpose: To assess LLMs' capabilities in various chemistry-related tasks.
  • Relevance: Useful for advancing chemistry research and education with AI.
  • Source: GitHub - materials-data-facility/matchem-llm

SMolInstruct: Instruction tuning dataset for chemistry

  • Description: SMolInstruct focuses on small molecules and includes 14 tasks and over 3M samples, covering name conversion, property prediction, molecule description, and chemical reaction prediction.
  • Purpose: To enhance LLMs with chemistry-specific instructions and improve their performance on chemistry tasks.
  • Relevance: Important for developing LLMs that can assist in chemical research and development.
  • Source: GitHub - materials-data-facility/matchem-llm

ChemBench4k: Chemistry competency evaluation benchmark

  • Description: ChemBench4k includes nine chemistry core tasks and 4100 high-quality single-choice questions and answers.
  • Purpose: To evaluate the chemistry knowledge and reasoning abilities of LLMs.
  • Relevance: Crucial for applications in chemistry education and knowledge assessment.
  • Source: GitHub - materials-data-facility/matchem-llm

Chem-RnD and ChemEDU CLAIRify: Chemistry protocols and instructions

  • Description: Chem-RnD and ChemEDU CLAIRify provide detailed chemistry protocols for synthesizing organic compounds and everyday educational chemistry instructions.
  • Purpose: To assess LLMs' ability to understand and generate instructions for chemical processes.
  • Relevance: Useful for training LLMs in chemical synthesis and education.
  • Source: GitHub - materials-data-facility/matchem-llm

LLM4Mat-Bench: Benchmarking Large Language Models for Materials

  • Description: LLM4Mat-Bench is the largest benchmark for evaluating LLMs in predicting properties of crystalline materials.
  • Purpose: To assess LLMs' capabilities in materials science and property prediction.
  • Relevance: Essential for materials research and development using AI.
  • Source: arXiv

SciBench: Evaluating College-Level Scientific Problem-Solving Abilities of Large Language Models

  • Description: SciBench is an expansive benchmark suite designed to systematically examine the reasoning capabilities required for solving complex scientific problems at the collegiate level. It contains a curated dataset featuring scientific problems from the mathematics, chemistry, and physics domains.
  • Purpose: To evaluate the performance of LLMs on collegiate-level scientific problem-solving and to identify areas for improvement in reasoning abilities.
  • Relevance: Crucial for advancing the scientific research and discovery capabilities of LLMs.
  • Results: Current LLMs show unsatisfactory performance with the best overall score of only 43.22%, indicating significant room for improvement.
  • Error Analysis: Errors made by LLMs are categorized into ten problem-solving abilities, suggesting no single prompting strategy outperforms others significantly.
  • Source: arXiv:2307.10635
  • Cite as: arXiv:2307.10635 [cs.CL] (or arXiv:2307.10635v3 [cs.CL] for this version)

LAB-Bench: Measuring Capabilities of Language Models for Biology Research

  • Description: LAB-Bench is a broad dataset of over 2,400 multiple-choice questions designed to evaluate AI systems on practical biology research capabilities, including literature recall, figure interpretation, database navigation, and DNA/protein sequence manipulation.
  • Purpose: To measure the performance of LLMs on tasks required for scientific research and to develop automated research systems.
  • Relevance: Essential for accelerating scientific discovery across disciplines by augmenting LLMs.
  • Human Expert Comparison: Performance of several LLMs is measured and compared against human expert biology researchers.
  • Availability: A public subset of LAB-Bench is available for use.
  • Source: arXiv:2407.10362
  • Cite as: arXiv:2407.10362 [cs.AI] (or arXiv:2407.10362v3 [cs.AI] for this version)

SciEx: Benchmarking Large Language Models on Scientific Exams with Human Expert Grading and Automatic Grading

  • Description: SciEx is a benchmark consisting of university computer science exam questions designed to evaluate the ability of LLMs to solve scientific tasks. It is characterized by being multilingual (containing both English and German exams), multi-modal (including questions with images), and featuring various types of freeform questions with different difficulty levels.
  • Purpose: To assess the performance of LLMs on scientific tasks typically encountered in university examinations, including writing algorithms, querying databases, and providing mathematical proofs.
  • Relevance: Essential for evaluating the capabilities of LLMs in scientific domains and their potential as assistants in academic and research settings.
  • Performance: The best-performing LLM achieves an average exam grade of 59.4%, indicating that free-form exams in SciEx remain challenging for current LLMs.
  • Human Expert Grading: Human expert grading of LLM outputs on SciEx is provided to evaluate performance, showcasing the difficulty in assessing free-form responses.
  • LLM-as-a-Judge: The study proposes using LLMs as judges to grade answers on SciEx, with experiments showing a 0.948 Pearson correlation with expert grading, demonstrating their potential as graders despite not being perfect at solving the exams.
  • Source: SciEx Benchmark arXiv

MatSci-NLP

  • Overview: MatSci-NLP is a comprehensive benchmark designed to evaluate NLP models specifically in material science. It covers a wide range of tasks such as predicting material properties and extracting information from scientific literature.
  • Key Features: The dataset is structured to encourage generalization across tasks, making it a cornerstone for assessing LLM capabilities in this field.
  • Link source: MatSci-NLP

SciKnowEval: Evaluating Multi-level Scientific Knowledge

  • Description: SciKnowEval introduces a systematic framework to assess LLMs across five progressive levels of scientific knowledge, including memory, comprehension, and reasoning in fields like chemistry and physics.
  • Purpose: To establish a standard for benchmarking scientific knowledge in LLMs with a dataset comprising 70,000 scientific problems.
  • Relevance: Essential for comprehensive evaluation of LLMs' capabilities in scientific domains.
  • Source: arXiv:2406.09098

Fine-tuning Large Language Models for Chemical Text Mining

  • Description: This study explores the effectiveness of fine-tuning LLMs on complex chemical text mining tasks, such as compound entity recognition and reaction role labeling.
  • Purpose: To demonstrate that fine-tuned models significantly outperform prompt-only versions, showcasing their potential in extracting knowledge from intricate chemical texts.
  • Relevance: Useful for chemical research and knowledge extraction.
  • Source: Chem. Sci., 2024

Materials Science and Chemistry LLM Resources

  • Description: This GitHub repository compiles various benchmarks, datasets, and evaluations related to machine learning applications in materials science and chemistry.
  • Purpose: To provide a comprehensive collection for researchers interested in LLM applications in chemistry and materials science, including resources like ChemQA and ChemLLMBench.
  • Relevance: Important for advancing LLM applications in chemistry and materials science.
  • Source: GitHub: materials-data-facility/matchem-llm

ARB: Advanced Reasoning Benchmark for Large Language Models

  • Description: The Advanced Reasoning Benchmark (ARB) focuses on evaluating LLMs through advanced reasoning problems across multiple disciplines, including physics and chemistry.
  • Purpose: To assess models' capabilities in logical deduction and problem-solving, contributing to a better understanding of their inferential abilities.
  • Relevance: Vital for evaluating LLMs' reasoning capabilities in scientific domains.
  • Source: OpenReview: ARB

Physics GRE: Testing an LLM’s performance on the Physics GRE (for education)

  • Description: This study evaluates the performance of a popular LLM-based conversational service on the standardized Physics GRE examination, which covers undergraduate physics topics including mechanics, electricity and magnetism, thermodynamics and statistical mechanics, and quantum physics.
  • Purpose: To understand the risks and limitations of LLMs in the field of physics education.
  • Relevance: Important for evaluating LLMs as personalized assistants for physics students.
  • Source: arXiv

MaterialBENCH: Evaluating College-Level Materials Science Knowledge

  • Description: MaterialBENCH is a college-level benchmark dataset for LLMs in the materials science field, consisting of problems that assess knowledge equivalent to that of a materials science undergraduate.
  • Purpose: To evaluate LLMs' understanding of materials science concepts and their ability to solve problems in this domain.
  • Relevance: Useful for assessing LLMs' capabilities in materials science education and research.
  • Source: arXiv

Leveraging Large Language Models for Explaining Material Synthesis Mechanisms

  • Description: Large language models (LLMs) have shown potential in advancing materials discovery, particularly in automating and providing insights into experimental design and result interpretation. This study addresses the gap in evaluating LLMs' understanding of physicochemical principles and their reasoning capabilities regarding material synthesis mechanisms.
  • Purpose: To develop a benchmark for evaluating LLMs' ability to reason about synthesis mechanisms, focusing on gold nanoparticles (AuNPs) synthesis, and to create an AI assistant for explaining these mechanisms.
  • Dataset: The study introduces a benchmark consisting of 775 semi-manually created multiple-choice questions in the field of AuNPs synthesis.
  • Evaluation Metric: A confidence-based score (c-score) is derived from the model's output logits to quantitatively evaluate the selection probabilities for correct answers.
  • Results: An AI assistant using retrieval-augmented generation (RAG) was developed, achieving a 10% improvement in accuracy over the leading model, Claude.
  • Relevance: This research underscores the potential of LLMs in recognizing scientific mechanisms and provides a valuable tool for aiding the exploration of synthesis methods. It also lays the foundation for developing highly efficient models utilizing material synthesis mechanisms.
  • Resources: Code and dataset are available at the GitHub repository: llm_for_mechanisms.

About

An open paper list covers LLMs reasoning benchmark for science problems.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published