GitHub - aistairc/medLLM_QA_benchmark: Trilingual (English, Japanese, Chinese) QA benchmark for medical LLM

medLLM_QA_benchmark

Trilingual medical QA benchmark for English, Japanese, and Chinese.

This benchmark set is made to evaluate Llama3-ELAINE-medLLM-8B, Llama3-ELAINE-medLLM-instruct-8B, a lightweight English-Japanese-Chinese trilingual large language model for the Biomedical Domain, against various baseline LLMs. It contains various QA benchmarks in the bio-medical domain for three languages and provides scripts and Python source files for evaluation. The evaluation method uses a consistent benchmark-independent and language-dependent input format to evaluate models. Any LLM models supported by vllm can be evaluated. We have checked those models on the hugging face hub, but a locally stored checkpoint can also be used for evaluation.

This benchmark set is made using the following dataset.

Please abide by the original license for each benchmark. MedQA(MIT),MMLU(MIT),MedMCQA(MIT),PubMedQA (MIT), CMExam(Apach-2.0), JJISMQA(cc-by-nc-sa-4.00).

If the original QA benchmark contains training, validation, and testing splits, we used only the testing split.

en (English)　
- MedQA (./data/en/MedQA/medqa_en.jsonl)
- MedQA-4op (./data/en/MedQA-4op/medqa_en_4op.jsonl)
- MMLU (./data/en/MMLU/mmluen_en_medical.jsonl)
- MedMCQA (./data/en/MedMCQA/medmcqa.jsonl)
- PubMedQA (./data/en/PubMedQA/pubmedqa.jsonl)
ja (Japanese)
- IgakuQA (./data/ja/IgakuQA/igakuqa.jsonl)
  - We concatenate the original exam data from 2018 to 2022 into a single JSON file.
- JJSIMQA (./data/ja/JJSIMQA/jjsimqa.jsonl)
- DenQA (./data/ja/DenQA/denqa.jsonl)
  - It contains the exam problems from the Japan National Dentistry Examination and their answers in the past two years (from 2023 through 2024) extracted from the official website of the Ministry of Health, Labor and Welfare in Japan (https://www.mhlw.go.jp/stf/english/index.html).
zh (Chinese)
- MedQA (./data/zh/MedQA/medqa_zh.jsonl)
- MedQA-4op (./data/zh/MedQA-4op/medqa_zh_4op.jsonl)
- CMExam (./data/zh/CMExam/cmexam.jsonl)

How to evaluate the medical LLMs using this benchmark

Interactive evaluation

Test a single benchmark with a model

Usage: sh scripts/experiment.sh json_test_file model_name_or_path output_dir language few_shots
	json_test_file:		file path to the input JSON file
	model_name_or_path:	model name or path
	output_dir:	root dir for results
	lang:	the language of the benchmark
	few_shots:	number of shots for in-context learning

Evaluate a single result with a gold answer

Usage: sh scripts/evaluate.sh pred_file answer_file
	pred_file:	file path to the JSON file generated under the output_dir
	answer_file:	file path to the input JSON file used for evaluation

Batch mode evaluation

Test all benchmarks with multiple LLM models

To change the LLM models to be used for the experiment, manually edit the script scripts/experiment_batch.sh

The following snippet shows how to add the LLM model named 'xxxx/yyyy' to experiments. If you'd like to remove a LLM model from evaluation, comment out the model declaration.

hf_model_path=xxxx/yyyyy
model_path+=($hf_model_path)

The following script runs an experiment for all models for each one of the benchmark datasets.

Usage: sh scripts/experiment_batch.sh output_dir few_shots
	output_dir:	root dir for results
	few_shots:	number of shots for in-context learning

Evaluate all results with gold answers

Usage: sh scripts/evaluate_batch.sh output_dir
	output_dir:	root dir for results

Convert the evaluation result log to a CSV file

Using the following script, you can convert the evaluation log to a compact CSV file.

sh scripts/evaluate_batch.sh output_dir > log
cat log | grep @@ > results.csv

Remark

If you use this medical QA benchmark for evaluation, I ask that you please cite the following paper.

@article{published_papers/48577159,
title = {ELAINE-medLLM: Lightweight English Japanese Chinese Trilingual Large Language Model for Bio-medical Domain (To appear)},
author = {Ken Yano and Zheheng Luo and Jimin Huang and Qianqian Xie and Masaki Asada and Chenhan Yuan and Kailai Yang and Makoto Miwa and Sophia Ananiadou and Jun'ichi Tsujii},
journal = {The 31st International Conference on Computational Linguistics (COLING 2025)},
month = {1},
year = {2025}
}

Name		Name	Last commit message	Last commit date
Latest commit History 42 Commits
data		data
scripts		scripts
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

medLLM_QA_benchmark

This benchmark set is made using the following dataset.

How to evaluate the medical LLMs using this benchmark

Interactive evaluation

Test a single benchmark with a model

Evaluate a single result with a gold answer

Batch mode evaluation

Test all benchmarks with multiple LLM models

Evaluate all results with gold answers

Convert the evaluation result log to a CSV file

Remark

About

Releases

Packages

Languages

License

aistairc/medLLM_QA_benchmark

Folders and files

Latest commit

History

Repository files navigation

medLLM_QA_benchmark

This benchmark set is made using the following dataset.

How to evaluate the medical LLMs using this benchmark

Interactive evaluation

Test a single benchmark with a model

Evaluate a single result with a gold answer

Batch mode evaluation

Test all benchmarks with multiple LLM models

Evaluate all results with gold answers

Convert the evaluation result log to a CSV file

Remark

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages