Skip to content

Files

benchmark

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
Sep 24, 2024
Jul 24, 2024
Sep 24, 2024
Jul 9, 2024
Jul 9, 2024
Jul 9, 2024
Jul 9, 2024
May 1, 2024
Sep 24, 2024
Jun 5, 2024
Jul 24, 2024
Jul 9, 2024

Benchmarking Models

In this benchmarking study, we evaluate the performance of CACTUS across a diverse set of state-of-the-art language models. By testing CACTUS with these models, we aim to assess its robustness, versatility, and effectiveness in solving chemistry-related tasks.

Models Tested

For this application we are benchmarking the following models:

Model model_name
llama2-7b meta-llama/Llama-2-7b-hf
llama3-8b meta-llama/Meta-Llama-3-8B
mistral-7b mistralai/Mistral-7B-v0.1
gemma-7b google/gemma-7b-it
falcon-7b tiiuae/falcon-7b
MPT-7b mosaicml/mpt-7b
Phi-2 microsoft/phi-2
Phi-3 microsoft/Phi-3-mini-4k-instruct
OLMo-1b allenai/OLMo-1B

These models were selected based on their strong performance in natural language tasks and their potential for adaptation to domain-specific applications.

Benchmark Dataset

To evaluate the performance of CACTUS with each model, we have created a comprehensive benchmark dataset consisting of a wide range of chemistry-related questions. This dataset can be generated using the benchmark_creation.py script provided in this folder. For the preprint we use the following benchmark datasets that are included here for reproducibility:

File Description
SingleStepQuestionList_Qualitative.csv The 500 Qualitative questions used in the manuscript
SingleStepQuestionList_Quantitative.csv The 500 Quantitative questions used in the manuscript
SingleStepQuestionList_Combined.csv A concatenation of the Qualitative and Quantitative questions (1000 questions in total)

The other files contained in this folder are:

File Description
benchmark_creation.py The script used to generate the above benchmark question lists
compound_list.csv A collection of 1000 molecules found on PubChem for benchmark creation
run_benchmark.py A script to run a benchmark by using CACTUS
plot_creation.py A series of methods for generating the plots used in the manuscript
plot_runtime.py A dedicated python script to generate the accuracy vs time plot in the manuscript
benchmark_files/ A directory of all the resulting benchmark files used in the manuscript
benchmark_files/Data_Analysis.ipynb A notebook detailing information about the dataset used, as well as how to calculate the expected answers from the dataset

To run the benchmark using the run_benchmark.py script, you can do the following:

python run_benchmark.py --model_name "google/gemma-7b" --model-type "vllm" --input-csv "SingleStepQuestionList_Combined.csv" --output-csv "output.csv"

Optionally you can include a --cache-dir and a --log-file but these default to None and my_log.txt respectfully.

Bechmarking Results

Comparison of model performance among 7B parameter models using minimal and domain-specific prompts. The Gemma-7b and Mistral-7b models demonstrate strong performance and adaptability across prompting strategies, highlighting their potential for widespread applicability in various computational settings, from high-performance clusters to more modest research setups.

Benchmarking Models