Skip to content
/ K-QA Public

Dataset and Evaluation Code for the K-QA Benchmark.

License

Notifications You must be signed in to change notification settings

Itaymanes/K-QA

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

17 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

K-QA Benchmark

Dataset and evaluation code for K-QA benchmark.

This repository provides the dataset and evaluation code for K-QA, a comprehensive Question and Answer dataset in real-world medical. You can find detailed information on the dataset curation and evaluation metric computation in our full paper.

To explore the results of 7 state-of-the-art models, check out this space.

The Dataset

K-QA consists of two portions - a medium-scale corpus of diverse real-world medical inquires written by patients on an online platform, and a subset of rigorous and granular answers, annotated by a team of in-house medical experts.

The dataset comprises 201 questions and answers, incorporating more than 1,589 ground-truth statements. Additionally, we provide 1,212 authentic patient questions.

Screen-Shot-2023-12-25-at-20-23-57

Evaluation Framework

To evaluate models against K-QA benchmark, we propose a Natural Language Inference (NLI) framework. We consider a predicted answer as a premise and each gold statement derived from an annotated answer as an hypothesis. Intuitively, a correctly predicted answer should entail every gold statement. This formulation aims to quantify the extent to which the model's answer captures the semantic meaning of the gold answer, abstracting over the wording chosen by a particular expert annotator.

Two evaluation metrics:

  • Hall (Hallucination rate) - This metric measures how many of the gold statements contradict the model’s answer.
  • Comp (Comprehensiveness) - This metric mea- sures how many of the clinically crucial claims are entailed from the predicted answer.

The figure below provides an example illustrating the complete process of evaluating a generated answer and deriving these metrics.

Screen-Shot-2023-12-27-at-11-03-46

How to Evaluate New Results

Organize Results in a Formatted Way

Before running the evaluation script, ensure that your results are stored in a JSON file with keys Question and result. Here's an example:

[
  {
  'Question': "Alright so I dont know much about Lexapro would you tell me more about it?",
  'result': "Lexapro is a medication that belongs to a class of drugs\ncalled selective serotonin reuptake inhibitors (SSRIs)"
  }, 
  {
  'Question': "Also what is the oral option to get rid of scabies?" , 
  'result': "The oral option to treat scabies is ivermectin, which is  a prescription medication that is taken by mouth."
  }
]

Align Your Evaluation Model with Physicians (optional)

Due to the variability of language models, we provide here a dataset that contains almost 400 statements with different answers generated by various LLMs. Three different physicians labeled each statement with "Entailment," "Neutral," or "Contradiction." This dataset can be utilized to test different evaluation models, aiming to find the one most closely aligned with the physicians' majority vote.

Install Requirements

Clone the repository and run the following to install and to activate your virtual environment:

poetry install
poetry shell

Set keys for GPT-4, either for OpenAI or Azure (the original paper uses models in Azure).

export OPENAI_API_KEY=""
export OPENAI_API_BASE=""

And for Azure also set the following keys:

export OPENAI_API_VERSION=""
export OPENAI_TYPE=""

Then, run the evaluation script as follows:

python run_eval.py 
    --result_file <path_to_result_file>
    --version <your_version_name>
    --on_openai  # Include this flag if using OpenAI

Cite Us

@misc{manes2024kqa,
      title={K-QA: A Real-World Medical Q&A Benchmark}, 
      author={Itay Manes and Naama Ronn and David Cohen and Ran Ilan Ber and Zehavi Horowitz-Kugler and Gabriel Stanovsky},
      year={2024},
      eprint={2401.14493},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

About

Dataset and Evaluation Code for the K-QA Benchmark.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages