Skip to content

Commit

Permalink
Ragaaf (RAG assessment annotation free) (#157)
Browse files Browse the repository at this point in the history
* minimized required fields/columns in user data

Signed-off-by: aasavari <aasavari.dhananjay.kakne@intel.com>

* add bench-target as the prefix of output folder (#133)

Signed-off-by: Yingchun Guo <yingchun.guo@intel.com>
Signed-off-by: aasavari <aasavari.dhananjay.kakne@intel.com>

* remove examples. (#135)

Co-authored-by: root <root@idc708073.jf.intel.com>
Signed-off-by: aasavari <aasavari.dhananjay.kakne@intel.com>

* minor naming correction to maintain consistency

Signed-off-by: aasavari <aasavari.dhananjay.kakne@intel.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

Signed-off-by: aasavari <aasavari.dhananjay.kakne@intel.com>

* Add hyperlinks and paths validation. (#132)

Signed-off-by: ZePan110 <ze.pan@intel.com>
Signed-off-by: aasavari <aasavari.dhananjay.kakne@intel.com>

* added support for older version of ragas

Signed-off-by: aasavari <aasavari.dhananjay.kakne@intel.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* testing automatic validation of ragas metrics

Signed-off-by: aasavari <aasavari.dhananjay.kakne@intel.com>

* removing summarization_score metric

Signed-off-by: aasavari <aasavari.dhananjay.kakne@intel.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* upgrading ragas from 0.1.16 to 0.1.19

Signed-off-by: aasavari <aasavari.dhananjay.kakne@intel.com>

* adding annotation free RAG assessment

Signed-off-by: aasavari <aasavari.dhananjay.kakne@intel.com>

* improved README

Signed-off-by: aasavari <aasavari.dhananjay.kakne@intel.com>

* adding jsonlines to requirements

Signed-off-by: aasavari <aasavari.dhananjay.kakne@intel.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* fulfilled feature request - allow unit test case for RAGAAF

Signed-off-by: aasavari <aasavari.dhananjay.kakne@intel.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* removing extra inputs from evaluation

Signed-off-by: aasavari <aasavari.dhananjay.kakne@intel.com>

* correcting class name for unit test

Signed-off-by: aasavari <aasavari.dhananjay.kakne@intel.com>

* test needs local ID

Signed-off-by: aasavari <aasavari.dhananjay.kakne@intel.com>

---------

Signed-off-by: aasavari <aasavari.dhananjay.kakne@intel.com>
Signed-off-by: Yingchun Guo <yingchun.guo@intel.com>
Signed-off-by: ZePan110 <ze.pan@intel.com>
Co-authored-by: Ying Chun Guo <yingchun.guo@intel.com>
Co-authored-by: lkk <33276950+lkk12014402@users.noreply.github.com>
Co-authored-by: root <root@idc708073.jf.intel.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: ZePan110 <ze.pan@intel.com>
Co-authored-by: lvliang-intel <liang1.lv@intel.com>
  • Loading branch information
7 people authored Oct 15, 2024
1 parent c3d55b3 commit 2413e70
Show file tree
Hide file tree
Showing 17 changed files with 693 additions and 0 deletions.
66 changes: 66 additions & 0 deletions evals/metrics/ragaaf/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,66 @@
# RAGAAF (RAG assessment - Annotation Free)

We introduce - RAGAAF, Intel's easy-to-use, flexible, opensource and annotation-free RAG evaluation tool using LLM-as-a-judge while benefitting from Intel's Gaudi2 AI accelator chips.

## Overview
### Data
RAGAAF is best suited for Long Form Question Answering (LFQA) datasets where you want to gauge quality and factualness of the answer via LLM's intelligence. Here, you can use benchmarking datasets or bring your own custom datasets. Please make sure to set `field_map` to map AutoEval fields such as "question" to your dataset's corresponding field like "query".
> Note : To use benchmarking datasets, set argument `data_mode=benchmarking`. Similarly, to use custom datasets, set `data_mode=local`.
### Model
AutoEval can run in 3 evaluation modes -
1. `evaluation_mode="endpoint"` uses HuggingFace endpoint.
- We recommend launching a HuggingFace endpoint on Gaudi AI accelerator machines to ensure maximum usage and performance.
- To launch HF endpoint on Gaudi2, please follow the 2-step instructions here - [tgi-gaudi](https://github.com/huggingface/tgi-gaudi).
- Pass your endpoint url as `model_name` argument.
2. `evaluation_mode="openai"` uses openai backend.
- Please set your `openai_key` and your choice of model as `model_name` argument.
3. `evaluation_mode="local"` uses your local hardware.
- Set `hf_token` argument and set your favourite open-source model in `model_name` argument.
- GPU usage will be prioritized after checking it's availability. If GPU is unavailable, the model will run on CPU.
## Metrics
AutoEval provides 4 metrics - factualness, correctness, relevance and readability. You can also bring your own metrics and grading scales. Don't forget to add your metric to `evaluation_metrics` argument.
## Generation configuration
We provide recommended generation parameters after experimenting with different LLMs. If you'd like to edit them to your requirement, please set generation parameters in `GENERATION_CONFIG` in `run_eval.py`.

## Run using HF endpoint
```python3
# step 1 : choose your dataset -- local or benchmarking
dataset = "explodinggradients/ragas-wikiqa"
data_mode = "benchmarking"
field_map = {"question": "question", "answer": "generated_with_rag", "context": "context"}

# step 2 - choose your favourite LLM and hardware

# evaluation_mode = "openai"
# model_name = "gpt-4o"
# openai_key = "<add your openai key>"

# evaluation_mode = "endpoint"
# model_name = f"http://{host_ip}:{port}"

evaluation_mode = "local"
model_name = "meta-llama/Llama-3.2-1B-Instruct"
hf_token = "<add your HF token>"

# step 3 - choose metrics of your choice, you can also add custom metrics
evaluation_metrics = ["factualness", "relevance", "correctness", "readability"]

# step 4 - run evaluation
evaluator = AnnotationFreeEvaluate(
dataset=dataset,
data_mode=data_mode,
field_map=field_map,
evaluation_mode=evaluation_mode,
model_name=model_name,
evaluation_metrics=evaluation_metrics,
# openai_key=openai_key,
hf_token=hf_token,
debug_mode=True,
)

responses = evaluator.measure()

for response in responses:
print(response)
```
That's it! For troubleshooting, please submit an issue and we will get right on it.
10 changes: 10 additions & 0 deletions evals/metrics/ragaaf/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
#!/usr/bin/env python
# -*- coding: utf-8 -*-
# Copyright (C) 2024 Intel Corporation
# SPDX-License-Identifier: Apache-2.0

#

from .run_eval import AnnotationFreeEvaluate

__all__ = [AnnotationFreeEvaluate]
77 changes: 77 additions & 0 deletions evals/metrics/ragaaf/prompt_engineering.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,77 @@
# Copyright (C) 2024 Intel Corporation
# SPDX-License-Identifier: Apache-2.0

from jinja2 import Template

from .prompt_templates import *
from .prompt_templates import NAME2METRIC


class Prompt:
"""Class to customize prompt template using user-defined list of metrics."""

def __init__(self, metrics, input_fields):
self.metrics = metrics
self.input_fields = input_fields
self.template = self.load_prompt_template()

def create_grading_format(self):
grading_format = (
"You must ALWAYS provide every single one of the scores and reasonings in the following JSON format:"
)
grading_format += "\n" + "{" + "\n"
content = []
reasoning_prompt = "Reasoning for {}: [your one line step by step reasoning about the {} of the answer]"
scoring_prompt = "Score for {}: [your score number for the {} of the answer]"
for metric in self.metrics:
reasoning = reasoning_prompt.format(metric, metric)
score = scoring_prompt.format(metric, metric)
content += (reasoning + "\n" + score,)
grading_format += "\n\n".join(content)
grading_format += "\n" + "}"
return grading_format

def create_closing_prompt(self):
closing_prompt = ["Let's begin!"]
for f in self.input_fields:
closing_prompt += ("Provided {}:".format(f) + "\n" + "{{" + f + "}}",)
return "\n\n".join(closing_prompt)

def load_prompt_template(self):
content = []
for metric_name in ["opening_prompt"] + self.metrics:
metric_instance = NAME2METRIC[metric_name]
content += (metric_instance.template,)
content += (self.create_grading_format(),)
content += (self.create_closing_prompt(),)
return Template("\n\n".join(content))

def render_prompt(self, **kwargs) -> str:
text = self.template.render(**kwargs)
return text


if __name__ == "__main__":

"""Here, we test implementation of Prompt class."""

# step 0 - user input
metrics = ["factualness", "relevance", "correctness", "readability"]
input_fields = ["question", "answer", "context"]

# step 1 - load prompt using Prompt class
prompt = Prompt(metrics=metrics, input_fields=input_fields)

example = {
"question": "Who is wife of Barak Obama",
"context": "Michelle Obama, wife of Barak Obama (former President of the United States of America) is an attorney. Barak and Michelle Obama have 2 daughters - Malia and Sasha",
"answer": "Michelle Obama",
"ground_truth": "Wife of Barak Obama is Michelle Obama",
}

# step 2 - render prompt with given inputs
rendered_prompt = prompt.render_prompt(
question=example["question"], answer=example["answer"], context=example["context"]
)

print(rendered_prompt)
21 changes: 21 additions & 0 deletions evals/metrics/ragaaf/prompt_templates/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
# Copyright (C) 2024 Intel Corporation
# SPDX-License-Identifier: Apache-2.0

from .opening_prompt import OpeningPrompt

from .correctness import Correctness
from .factualness import Factualness
from .relevance import Relevance
from .readability import Readability

__all__ = ["opening_prompt", "correctness", "factualness", "relevance", "readability"]

NAME2METRIC = {}


def snake2camel(s):
return "".join(x.capitalize() or "_" for x in s.split("_"))


for name in __all__:
NAME2METRIC[name] = eval(snake2camel(name))
13 changes: 13 additions & 0 deletions evals/metrics/ragaaf/prompt_templates/correctness.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
# Copyright (C) 2024 Intel Corporation
# SPDX-License-Identifier: Apache-2.0


class Correctness:
name = "correctness"
required_columns = ["answer", "context", "question"]
template = """- Correctness: correctness measures how accurately and comprehensively does the answer resolve problem posed in the question.
- Score 1: If the answer is empty string or something like "I do not know the answer", the correctness score is 1.
- Score 2: If the answer only addresses a small part of the question correctly or it is missing many critical steps/aspects of the answer or the answer is too short to fully answer the question or is missing many steps causing the answer to not fully address the problem described in the question, then the correctness score is 2.
- Score 3: The answer mostly addresses the question but one critical aspect/step is missing or is incorrect.
- Score 4: the answer mostly answer the question and covers all critical/main aspects of the question, but it’s missing important/necessary details about one or more aspects.
- Score 5: the answer correctly and completely addresses the query. It also covers important details about each step."""
13 changes: 13 additions & 0 deletions evals/metrics/ragaaf/prompt_templates/factualness.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
# Copyright (C) 2024 Intel Corporation
# SPDX-License-Identifier: Apache-2.0


class Factualness:
name = "factualness"
required_columns = ["answer", "context"]
template = """- Factualness: Factualness assesses how much of the provided answer is contained within the provided context. A higher score indicates that a higher proportion of claims present in the answer are present or can be derived from the provided context.
- Score 1: the answer is completely hallucinated i.e. not contained in the context at all or there is no answer.
- Score 2: only a small part of the answer is contained in the context but most of it is imaginary/hallucinated or the meaning is completely changed from what is represented in the context.
- Score 3: Only about half of the answer is contained in the context. Rest of the answer is hallucinated or imaginary.
- Score 4: Most of the claims in the answer can be inferred from the provided context with very little information that is not directly supported by the provided context.
- Score 5: All of the claims in the answer are directly supported by the provided context, demonstrating high faithfulness to the provided context."""
21 changes: 21 additions & 0 deletions evals/metrics/ragaaf/prompt_templates/opening_prompt.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
# Copyright (C) 2024 Intel Corporation
# SPDX-License-Identifier: Apache-2.0


class OpeningPrompt:
name = "opening_prompt"
required_columns = []

template = """Consider yourself as an helpful, truthful and impartial judge.
Your task:
You will be given an input consisting of a question, an answer and a context. Your task is to act as an impartial judge and provide a numerical score between 1 to 5 for each of the following metrics for the given answer.
Important rules for you while completing this task:
1. You MUST ALWAYS provide a score for every metric mentioned below.
2. Make sure to understand definition of every metric fully before completing your task. Every metric is provided with grading scale and rubric. You MUST use this grading scale and rubric to determine your score.
3. Ensure that your scores and reasoning for every metric is independent of each other e.g., score for factualness should not impact score for correctness and vice versa.
4. Base your grading decision only on the given inputs and do not speculate or hallucinate.
5. You must also provide reasoning for your score in a single sentence.
Your metric definitions along with grading scale and rubric:"""
13 changes: 13 additions & 0 deletions evals/metrics/ragaaf/prompt_templates/readability.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
# Copyright (C) 2024 Intel Corporation
# SPDX-License-Identifier: Apache-2.0


class Readability:
name = "readability"
required_columns = ["answer"]
template = """- Readability: Readability measures clarity and lucidity of the answer. Readability is measured solely based on the answer and it does not consider the question or the context.
- Score 1: the answer is empty or "I do not know the answer" or completely unreadable or No meaningful information can be extracted from the answer, then the score is 1.
- Score 2: the answer is slightly readable, there are irrelevant symbols or HTML tags or repeated words, but it can roughly form a meaningful sentence that can cover some aspects of the answer.
- Score 3: Answer can be read but there are grammatical mistakes in the answer.
- Score 4: the answer readable, but the readability and style can improved to better appeal to the reader.
- Score 5: the answer is reader friendly and well written."""
13 changes: 13 additions & 0 deletions evals/metrics/ragaaf/prompt_templates/relevance.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
# Copyright (C) 2024 Intel Corporation
# SPDX-License-Identifier: Apache-2.0


class Relevance:
name = "relevance"
required_columns = ["question", "answer"]
template = """- Relevance: Relevance measures how well the answer relates to the question.
- Score 1: The answer doesn't mention anything about the question or is completely irrelevant to the question.
- Score 2: The answer only identifies the domain (e.g. cnvrg) mentioned in the question and provides information from the correct domain. But, the answer does not address the question itself and the point of the question is completely missed by it.
- Score 3: The answer correctly identifies the domain and essence of the question but the details in the answer are not relevant to the focus of the question.
- Score 4: The answer correctly identifies domain mentioned the question and essence of the question as well as stays consistent with both of them. But there is some part of the answer that is not relevant to the question or it's topic or it's essence. This irrelevant part is damaging the overall relevance of the answer.
- Score 5: The answer is completely relevant to the question and the details do not deviate from the essence of the question. There are no parts of the answer that are irrelevant or unnecessary for the given question."""
66 changes: 66 additions & 0 deletions evals/metrics/ragaaf/rag_dataset.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,66 @@
# Copyright (C) 2024 Intel Corporation
# SPDX-License-Identifier: Apache-2.0

import os

import jsonlines
from datasets import Dataset, load_dataset


class RAGDataset:
"""Dataset class to store data in HF datasets API format."""

def __init__(self, dataset, field_map, mode, examples):
self.dataset = dataset
self.field_map = field_map
assert mode in ["unit", "local", "benchmarking"], "mode can be either unit or local or benchmarking"
self.mode = mode
self.data = self.load_data(examples)
self.validate_dataset()

def load_example(self, obj):
ex = {}
for out_field, in_field in self.field_map.items():
if type(obj[in_field]) == list:
ex[out_field] = "\n".join(obj[in_field])
else:
ex[out_field] = obj[in_field]
return ex

def load_local_data(self):
assert os.path.exists(self.dataset), "There is no such file - {}".format(self.dataset)
with jsonlines.open(self.dataset) as reader:
data = [self.load_example(obj) for obj in reader]
return Dataset.from_list(data)

def load_unit_data(self, examples):
assert len(examples) >= 1, "Please provide at least one example"
data = [self.load_example(obj) for obj in examples]
return Dataset.from_list(data)

def load_benchmarking_data(self):
dataset = load_dataset(self.dataset)["train"]
data = [self.load_example(obj) for obj in dataset]
return Dataset.from_list(data)

def load_data(self, examples):
if self.mode == "local":
return self.load_local_data()
elif self.mode == "unit":
return self.load_unit_data(examples)
else:
return self.load_benchmarking_data()

def validate_dataset(self):
for i, example in enumerate(self.data):
for out_field in self.field_map:
assert out_field in example, "Example {} does not have {} field".format(i + 1, out_field)

def __getitem__(self, index):
return self.data[index]

def __len__(self):
return len(self.data)

def __iter__(self):
return iter(self.data)
Loading

0 comments on commit 2413e70

Please sign in to comment.