Dataset and evaluation code for K-QA benchmark.
This repository provides the dataset and evaluation code for K-QA, a comprehensive Question and Answer dataset in real-world medical. You can find detailed information on the dataset curation and evaluation metric computation in our full paper.
To explore the results of 7 state-of-the-art models, check out this space.
K-QA consists of two portions - a medium-scale corpus of diverse real-world medical inquires written by patients on an online platform, and a subset of rigorous and granular answers, annotated by a team of in-house medical experts.
The dataset comprises 201 questions and answers, incorporating more than 1,589 ground-truth statements. Additionally, we provide 1,212 authentic patient questions.
To evaluate models against K-QA benchmark, we propose a Natural Language Inference (NLI) framework.
We consider a predicted answer as a premise
and each gold statement derived from an annotated answer as an hypothesis
. Intuitively, a correctly predicted answer should entail every gold statement.
This formulation aims to quantify the extent to which the model's answer captures the semantic meaning of the gold answer, abstracting over the wording chosen by a particular expert annotator.
Two evaluation metrics:
- Hall (Hallucination rate) - This metric measures how many of the gold statements contradict the model’s answer.
- Comp (Comprehensiveness) - This metric mea- sures how many of the clinically crucial claims are entailed from the predicted answer.
The figure below provides an example illustrating the complete process of evaluating a generated answer and deriving these metrics.
Before running the evaluation script, ensure that your results are stored in a JSON file with keys Question
and result
. Here's an example:
[
{
'Question': "Alright so I dont know much about Lexapro would you tell me more about it?",
'result': "Lexapro is a medication that belongs to a class of drugs\ncalled selective serotonin reuptake inhibitors (SSRIs)"
},
{
'Question': "Also what is the oral option to get rid of scabies?" ,
'result': "The oral option to treat scabies is ivermectin, which is a prescription medication that is taken by mouth."
}
]
Due to the variability of language models, we provide here a dataset that contains almost 400 statements with different answers generated by various LLMs. Three different physicians labeled each statement with "Entailment," "Neutral," or "Contradiction." This dataset can be utilized to test different evaluation models, aiming to find the one most closely aligned with the physicians' majority vote.
Clone the repository and run the following to install and to activate your virtual environment:
poetry install
poetry shell
Set keys for GPT-4, either for OpenAI or Azure (the original paper uses models in Azure).
export OPENAI_API_KEY=""
export OPENAI_API_BASE=""
And for Azure also set the following keys:
export OPENAI_API_VERSION=""
export OPENAI_TYPE=""
Then, run the evaluation script as follows:
python run_eval.py
--result_file <path_to_result_file>
--version <your_version_name>
--on_openai # Include this flag if using OpenAI
@misc{manes2024kqa,
title={K-QA: A Real-World Medical Q&A Benchmark},
author={Itay Manes and Naama Ronn and David Cohen and Ran Ilan Ber and Zehavi Horowitz-Kugler and Gabriel Stanovsky},
year={2024},
eprint={2401.14493},
archivePrefix={arXiv},
primaryClass={cs.CL}
}