llms-benchmarking

Star

Here are 42 public repositories matching this topic...

ChemFoundationModels / ChemLLMBench

Star

What can Large Language Models do in chemistry? A comprehensive benchmark on eight tasks

nlp benchmark chemistry ai4science large-language-models llm llms-benchmarking

Updated Jul 26, 2024
Jupyter Notebook

lerogo / MMGenBench

Star

Official repository of MMGenBench

mllm llms-benchmarking mmgenbench

Updated Nov 22, 2024
Python

bboylyg / BackdoorLLM

Star

BackdoorLLM: A Comprehensive Benchmark for Backdoor Attacks on Large Language Models

backdoor llms llms-benchmarking

Updated Sep 3, 2024
Python

parea-ai / parea-sdk-py

Star

Python SDK for experimenting, testing, evaluating & monitoring LLM-powered applications - Parea AI (YC S23)

metrics good-first-issue llm prompt-engineering generative-ai llmops llm-eval llm-tools llm-evaluation llm-evaluation-toolkit llms-benchmarking llm-evaluation-framework

Updated Sep 12, 2024
Python

lamalab-org / chem-bench

Star

How good are LLMs at chemistry?

benchmark machine-learning chemistry safety materials-science llm llms llms-benchmarking

Updated Dec 19, 2024
Jupyter Notebook

FSoft-AI4Code / XMainframe

Star

Language Model for Mainframe Modernization

migration cobol mainframe code-summarization codellm llms-benchmarking

Updated Aug 23, 2024
Python

CompBench evaluates the comparative reasoning of multimodal large language models (MLLMs) with 40K image pairs and questions across 8 dimensions of relative comparison: visual attribute, existence, state, emotion, temporality, spatiality, quantity, and quality. CompBench covers diverse visual domains, including animals, fashion, sports, and scenes.

benchmark reasoning vision-and-language multimodal-deep-learning human-annotation foundation-models large-language-models llms vision-language-model multimodal-large-language-models evaluation-llms llms-benchmarking

Updated Aug 6, 2024
Jupyter Notebook

epfl-dlab / cc_flows

Star

The data and implementation for the experiments in the paper "Flows: Building Blocks of Reasoning and Collaborating AI".

ai competitive-programming agents competitive-programming-contests competitive-coding llms llms-reasoning llms-benchmarking aiflows

Updated Feb 12, 2024
Python

amazon-science / llm-code-preference

Star

Training and Benchmarking LLMs for Code Preference.

code-generation llm-training llm-evaluation llms-benchmarking

Updated Nov 15, 2024
Python

declare-lab / resta

Star

Restore safety in fine-tuned language models through task arithmetic

alignment safety alignment-algorithm llm llms llm-safety llms-benchmarking llm-safety-benchmark

Updated Mar 28, 2024
Python

Laoyu84 / 4onebench

Star

A minimalist benchmarking tool designed to test the routine-generation capabilities of LLMs.

agents large-language-models llms-benchmarking

Updated Nov 28, 2024
Python

multinear / multinear

Star

Develop reliable AI apps

reliability evaluation llm llms llm-eval llm-evaluation llms-benchmarking llm-evaluation-framework

Updated Dec 3, 2024
Svelte

minnesotanlp / cobbler

Star

Code and data for Koo et al's ACL 2024 paper "Benchmarking Cognitive Biases in Large Language Models as Evaluators"

nlp evaluation bias bias-detection llm llms llm-evaluation llms-benchmarking llm-as-judge llm-as-a-judge llm-as-evaluator

Updated Feb 16, 2024
Jupyter Notebook

Paulescu / text-embedding-evaluation

Star

Join 15k builders to the Real-World ML Newsletter ⬇️⬇️⬇️

machine-learning embeddings llms llms-benchmarking

Updated Apr 19, 2024
Python

logikon-ai / cot-eval

Star

A framework for evaluating the effectiveness of chain-of-thought reasoning in language models.

leaderboard llm chain-of-thought gen-ai llms-reasoning llms-benchmarking

Updated Oct 6, 2024
Jupyter Notebook

nachoDRT / MERIT-Dataset

Star

The MERIT Dataset is a fully synthetic, labeled dataset created for training and benchmarking LLMs on Visually Rich Document Understanding tasks. It is also designed to help detect biases and improve interpretability in LLMs, where we are actively working. This repository is actively maintained, and new features are continuously being added.

biases synthetic-dataset-generation layoutlm synthetic-dataset layoutxlm token-classification layoutlmv3 layoutlmv2 llms-benchmarking

Updated Sep 6, 2024
Python

SuperBruceJia / Awesome-Mixture-of-Experts

Star

Awesome Mixture of Experts (MoE): A Curated List of Mixture of Experts (MoE) and Mixture of Multimodal Experts (MoME)

Updated Sep 25, 2024

lechmazur / nyt-connections

Star

Benchmark that evaluates LLMs using 436 NYT Connections puzzles

testing benchmark evaluation puzzles reasoning llm llms-benchmarking gpt-4o

Updated Nov 5, 2024
Python

cosmaadrian / romath

Star

Official repository for "RoMath: A Mathematical Reasoning Benchmark in 🇷🇴 Romanian 🇷🇴"

mathematics dataset romanian llms-benchmarking

Updated Sep 23, 2024
Python

microsoft / private-benchmarking

Star

A platform that enables users to perform private benchmarking of machine learning models. The platform facilitates the evaluation of models based on different trust levels between the model owners and the dataset owners.

platform benchmarking inference secure private mpc contamination trusted-execution-environment confidential-computing large-language-models llms-benchmarking private-benchmarking ezpc

Updated Sep 16, 2024
Python

Improve this page

Add a description, image, and links to the llms-benchmarking topic page so that developers can more easily learn about it.

Curate this topic

Add this topic to your repo

To associate your repository with the llms-benchmarking topic, visit your repo's landing page and select "manage topics."

Learn more

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

llms-benchmarking

Here are 42 public repositories matching this topic...

ChemFoundationModels / ChemLLMBench

lerogo / MMGenBench

bboylyg / BackdoorLLM

parea-ai / parea-sdk-py

lamalab-org / chem-bench

FSoft-AI4Code / XMainframe

RaptorMai / CompBench

epfl-dlab / cc_flows

amazon-science / llm-code-preference

declare-lab / resta

Laoyu84 / 4onebench

multinear / multinear

minnesotanlp / cobbler

Paulescu / text-embedding-evaluation

logikon-ai / cot-eval

nachoDRT / MERIT-Dataset

SuperBruceJia / Awesome-Mixture-of-Experts

lechmazur / nyt-connections

cosmaadrian / romath

microsoft / private-benchmarking

Improve this page

Add this topic to your repo