RAGEval: Scenario Specific RAG Evaluation Dataset Generation Framework

Introduction

RAGEval is a novel framework designed for automatically generating evaluation datasets to assess the knowledge usage ability of different Large Language Models (LLMs) in various Retrieval-Augmented Generation (RAG) scenarios. Unlike existing RAG benchmarks that focus on general knowledge, RAGEval enables the creation of domain-specific factual queries, allowing for a more nuanced evaluation of RAG systems across different vertical domains.

News

[2024/8/31] We have released our evaluation method at the rageval/evaluation folder.
[2024/8/25] We have released our DragonBall dataset at the dragonball_dataset folder. The RAGEval pipeline is coming soon!

Key Features

🏗️ Flexible Schema Generation: Summarizes a schema from seed documents to capture domain-specific knowledge structures.
🔄 Diverse Document Generation: Uses the schema to generate varied configurations and subsequently diverse documents across multiple domains.
❓ Comprehensive QA Pair Creation: Constructs question-answering pairs based on generated documents and configurations.
📊 Novel Evaluation Metrics: Introduces three new metrics - Completeness, Hallucination, and Irrelevance - for a more thorough assessment of RAG model responses.
🌐 Multi-Domain Support: Covers various domains including finance, legal, and medical sectors in both Chinese and English languages.

Components

Schema Summary: Extracts domain-specific knowledge structures from seed documents.
Document Generation: Creates diverse, factually rich documents based on the schema.
QRA (Question-Reference-Answer) Generation: Produces comprehensive evaluation triples.
DRAGONBall Dataset: A diverse RAG benchmark covering multiple domains and languages.
Evaluation Metrics: Novel metrics for assessing RAG system performance.

Usage

The Usage and the remaining code is coming soon!

Experiments

RAGEval has been used to benchmark various LLMs and RAG configurations:

Compared performance of 9 popular open/closed-source generation models
Evaluated different retrieval models (BM25, GTE-Large, BGE-Large, BGE-M3)
Analyzed impact of hyperparameters like TopK retrieval and chunk size

Results

GPT-4o showed the best overall performance, but open-source models like Llama3-8B-Instruct demonstrated competitive results.
Language-specific optimization in retrieval models proved crucial for performance.
Hyperparameter tuning revealed important trade-offs between retrieval accuracy and generation quality.

Conclusion

RAGEval provides a comprehensive framework for evaluating RAG systems in domain-specific scenarios, offering more nuanced insights than existing benchmarks. It highlights the potential for significant improvements in open-source models for RAG tasks.

Citation

Please cite the following paper if you find RAGEval helpful!

@misc{zhu2024ragevalscenariospecificrag,
      title={RAGEval: Scenario Specific RAG Evaluation Dataset Generation Framework}, 
      author={Kunlun Zhu and Yifan Luo and Dingling Xu and Ruobing Wang and Shi Yu and Shuo Wang and Yukun Yan and Zhenghao Liu and Xu Han and Zhiyuan Liu and Maosong Sun},
      year={2024},
      eprint={2408.01262},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2408.01262}, 
}

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
assets		assets
dragonball_dataset		dragonball_dataset
rageval		rageval
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

RAGEval: Scenario Specific RAG Evaluation Dataset Generation Framework

Introduction

News

Key Features

Components

Usage

Experiments

Results

Conclusion

Citation

About

Releases

Packages

Contributors 2

Languages

License

OpenBMB/RAGEval

Folders and files

Latest commit

History

Repository files navigation

RAGEval: Scenario Specific RAG Evaluation Dataset Generation Framework

Introduction

News

Key Features

Components

Usage

Experiments

Results

Conclusion

Citation

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages