M3-VQA, a novel pipeline for multilingual and multimodal biomedical VQA. M3-VQA leverages translation for multilingual inputs, retrieval augmented generation (RAG) for knowledge grounding, and in-context learning (ICL) with Chain-of-Thought prompting for accurate reasoning.
- Get a free API Key for Google Translate and configure locally, please refer to https://cloud.google.com/translate/docs/reference/rest/
- Clone the repo
git clone https://github.com/AmuroEita/M3-VQA.git && cd M3-VQA
- Use git lfs fetch the faiss index files
git lfs pull
- Install required Python packages
pip install -r requirements.txt
- Enter your GPT API in
utils/GPT-API.txt
echo "${Your GPT API Key}" > utils/GPT-API.txt
- Prepare the datasets
cd data && python download_data.py
- Download the model via hugging face
huggingface-cli login huggingface-cli download --resume-download unsloth/Llama-3.2-11B-Vision-Instruct --local-dir Llama-3.2-11B-Vision-Instruct
Use this mode to provide a specific question for Med-VQA to answer. The following example demonstrates how to test the 11th question in the israel_local_processed.tsv dataset. The process and results will be displayed directly in the command line.
export GOOGLE_APPLICATION_CREDENTIALS="/your_path_to/google_translate.json" && python3 demo.py --dataset data/israel_local_processed.tsv --question_idx 11
Run on the entire dataset to compute accuracy. Results will be saved in the results folder for further analysis.
export GOOGLE_APPLICATION_CREDENTIALS="/your_path_to/google_translate.json" && python3 inference.py