Name		Name	Last commit message	Last commit date
parent directory ..
benchmark_files		benchmark_files
README.md		README.md
Result_gemini_no_prompt_quantitative_a100.csv		Result_gemini_no_prompt_quantitative_a100.csv
SingleStepQuestionList_Combined.csv		SingleStepQuestionList_Combined.csv
SingleStepQuestionList_Qualitative.csv		SingleStepQuestionList_Qualitative.csv
SingleStepQuestionList_Quantitative.csv		SingleStepQuestionList_Quantitative.csv
benchmark_creation.py		benchmark_creation.py
compound_list.csv		compound_list.csv
gemini_no_prompt_quantitative_a100.csv		gemini_no_prompt_quantitative_a100.csv
plot_creation.py		plot_creation.py
plot_runtime.py		plot_runtime.py
run_benchmark.py		run_benchmark.py

README.md

Benchmarking Models

In this benchmarking study, we evaluate the performance of CACTUS across a diverse set of state-of-the-art language models. By testing CACTUS with these models, we aim to assess its robustness, versatility, and effectiveness in solving chemistry-related tasks.

Models Tested

For this application we are benchmarking the following models:

Model	model_name
`llama2-7b`	`meta-llama/Llama-2-7b-hf`
`llama3-8b`	`meta-llama/Meta-Llama-3-8B`
`mistral-7b`	`mistralai/Mistral-7B-v0.1`
`gemma-7b`	`google/gemma-7b-it`
`falcon-7b`	`tiiuae/falcon-7b`
`MPT-7b`	`mosaicml/mpt-7b`
`Phi-2`	`microsoft/phi-2`
`Phi-3`	`microsoft/Phi-3-mini-4k-instruct`
`OLMo-1b`	`allenai/OLMo-1B`

These models were selected based on their strong performance in natural language tasks and their potential for adaptation to domain-specific applications.

Benchmark Dataset

To evaluate the performance of CACTUS with each model, we have created a comprehensive benchmark dataset consisting of a wide range of chemistry-related questions. This dataset can be generated using the benchmark_creation.py script provided in this folder. For the preprint we use the following benchmark datasets that are included here for reproducibility:

File	Description
`SingleStepQuestionList_Qualitative.csv`	The 500 Qualitative questions used in the manuscript
`SingleStepQuestionList_Quantitative.csv`	The 500 Quantitative questions used in the manuscript
`SingleStepQuestionList_Combined.csv`	A concatenation of the Qualitative and Quantitative questions (1000 questions in total)

The other files contained in this folder are:

File	Description
`benchmark_creation.py`	The script used to generate the above benchmark question lists
`compound_list.csv`	A collection of 1000 molecules found on PubChem for benchmark creation
`run_benchmark.py`	A script to run a benchmark by using CACTUS
`plot_creation.py`	A series of methods for generating the plots used in the manuscript
`plot_runtime.py`	A dedicated python script to generate the accuracy vs time plot in the manuscript
`benchmark_files/`	A directory of all the resulting benchmark files used in the manuscript
`benchmark_files/Data_Analysis.ipynb`	A notebook detailing information about the dataset used, as well as how to calculate the expected answers from the dataset

To run the benchmark using the run_benchmark.py script, you can do the following:

python run_benchmark.py --model_name "google/gemma-7b" --model-type "vllm" --input-csv "SingleStepQuestionList_Combined.csv" --output-csv "output.csv"

Optionally you can include a --cache-dir and a --log-file but these default to None and my_log.txt respectfully.

Bechmarking Results

Comparison of model performance among 7B parameter models using minimal and domain-specific prompts. The Gemma-7b and Mistral-7b models demonstrate strong performance and adaptability across prompting strategies, highlighting their potential for widespread applicability in various computational settings, from high-performance clusters to more modest research setups.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

benchmark

benchmark

README.md

Benchmarking Models

Models Tested

Benchmark Dataset

Bechmarking Results

Files

benchmark

Directory actions

More options

Directory actions

More options

Latest commit

History

benchmark

Folders and files

parent directory

README.md

Benchmarking Models

Models Tested

Benchmark Dataset

Bechmarking Results