Trilingual medical QA benchmark for English, Japanese, and Chinese.
This benchmark set is made to evaluate Llama3-ELAINE-medLLM-8B, Llama3-ELAINE-medLLM-instruct-8B, a lightweight English-Japanese-Chinese trilingual large language model for the Biomedical Domain, against various baseline LLMs. It contains various QA benchmarks in the bio-medical domain for three languages and provides scripts and Python source files for evaluation. The evaluation method uses a consistent benchmark-independent and language-dependent input format to evaluate models. Any LLM models supported by vllm can be evaluated. We have checked those models on the hugging face hub, but a locally stored checkpoint can also be used for evaluation.
Please abide by the original license for each benchmark. MedQA(MIT),MMLU(MIT),MedMCQA(MIT),PubMedQA (MIT), CMExam(Apach-2.0), JJISMQA(cc-by-nc-sa-4.00).
If the original QA benchmark contains training, validation, and testing splits, we used only the testing split.
-
en (English)
-
ja (Japanese)
- IgakuQA (./data/ja/IgakuQA/igakuqa.jsonl)
- We concatenate the original exam data from 2018 to 2022 into a single JSON file.
- JJSIMQA (./data/ja/JJSIMQA/jjsimqa.jsonl)
- DenQA (./data/ja/DenQA/denqa.jsonl)
- It contains the exam problems from the Japan National Dentistry Examination and their answers in the past two years (from 2023 through 2024) extracted from the official website of the Ministry of Health, Labor and Welfare in Japan (https://www.mhlw.go.jp/stf/english/index.html).
- IgakuQA (./data/ja/IgakuQA/igakuqa.jsonl)
-
zh (Chinese)
Usage: sh scripts/experiment.sh json_test_file model_name_or_path output_dir language few_shots
json_test_file: file path to the input JSON file
model_name_or_path: model name or path
output_dir: root dir for results
lang: the language of the benchmark
few_shots: number of shots for in-context learning
Usage: sh scripts/evaluate.sh pred_file answer_file
pred_file: file path to the JSON file generated under the output_dir
answer_file: file path to the input JSON file used for evaluation
To change the LLM models to be used for the experiment, manually edit the script scripts/experiment_batch.sh
The following snippet shows how to add the LLM model named 'xxxx/yyyy' to experiments. If you'd like to remove a LLM model from evaluation, comment out the model declaration.
hf_model_path=xxxx/yyyyy
model_path+=($hf_model_path)
The following script runs an experiment for all models for each one of the benchmark datasets.
Usage: sh scripts/experiment_batch.sh output_dir few_shots
output_dir: root dir for results
few_shots: number of shots for in-context learning
Usage: sh scripts/evaluate_batch.sh output_dir
output_dir: root dir for results
Using the following script, you can convert the evaluation log to a compact CSV file.
sh scripts/evaluate_batch.sh output_dir > log
cat log | grep @@ > results.csv
If you use this medical QA benchmark for evaluation, I ask that you please cite the following paper.
@article{published_papers/48577159,
title = {ELAINE-medLLM: Lightweight English Japanese Chinese Trilingual Large Language Model for Bio-medical Domain (To appear)},
author = {Ken Yano and Zheheng Luo and Jimin Huang and Qianqian Xie and Masaki Asada and Chenhan Yuan and Kailai Yang and Makoto Miwa and Sophia Ananiadou and Jun'ichi Tsujii},
journal = {The 31st International Conference on Computational Linguistics (COLING 2025)},
month = {1},
year = {2025}
}