OmniEval: Omnidirectional and Automatic RAG Evaluation Benchmark in Financial Domain

Installation | Quick-Start | License | Citation

🔧 Installation

conda env create -f environment.yml
conda activate finrag

🚀 Quick-Start

Notes:

The code run path is OmniEval
We provide the following datasets:
1. Auto-generated evaluation dataset in
2. Constructed knowledge corpus in

1. Build the Retrieval Corpus

If you use our provided knowledge corpus, do the following steps:
1. download knowledge corpus from
2. move "gen_datas_all-2048-256_nodes" and "gen_datas_all-2048-256_docs" into OmniEval/corpus/nodes_dir

If you need to build your self knowledge corpus, do the following steps:

# cd OmniEval
# set the following parameters inner the bash file `corpus_builder/build_corpus.sh`:
# DATA_ROOT="corpus" # the root to save all retrieval data information
# DATA_DIR="few_shot" # the dirname of the source documents. You should first put your documents in the $DATA_ROOT/$DATA_DIR.
# SAVE_NAME="few_shot_test" #the dirname to save the built knowledge corpus. 
# CHUNK_SIZE=2048
# CHUNK_OVERLAP=256

sh corpus_builder/build_corpus.sh

2. Generate Evaluation Data Samples

Generate evaluation instances

# cd OmniEval
# set the following parameters inner the bash file `data_generator/generate_data.sh`:
# NODE_ROOT="corpus/nodes_dir" # the root to save the built knowledge corpus
# SAVE_NAME="few_shot_test" # the save dir of your built document corpus, which is same as the above file.
# CHUNK_SIZE=2048 # same with the above file. 
# CHUNK_OVERLAP=256 # same with the above file.
# NODE_NAME="${SAVE_NAME}-${CHUNK_SIZE}-${CHUNK_OVERLAP}" # the save name of your built knowledge corpus (same as the one in build_corpus.sh)
# API_KEY="${your_own_api_key}" # your own api key
# DATA_SUFFIX="test" # the indication of your current generation.

sh data_generator/generate_data.sh

Filter (quality inspection) evaluation instances

# cd OmniEval
# set the following parameters inner the bash file `data_generator/generate_data_filter.sh`:
# NODE_ROOT="corpus/nodes_dir" # the root to save the built knowledge corpus
# SAVE_NAME="few_shot_test" 
# CHUNK_SIZE=2048
# CHUNK_OVERLAP=256
# NODE_NAME="${SAVE_NAME}-${CHUNK_SIZE}-${CHUNK_OVERLAP}" # the save name of your built knowledge corpus (same as the one in build_corpus.sh)
# API_KEY="" # your own api key
# DATA_SUFFIX="test"
# GEN_TYPE="filter" # when you need to do data quality inspection, set this parameter as "filter"

sh data_generator/generate_data_filter.sh

Note that after data quality inspection, there will be three more folders containing filtered or cleaned generated data, namely "gen_datas_${DATA_SUFFIX}_clean", "gen_datas_${DATA_SUFFIX}_filter", and "gen_datas_${DATA_SUFFIX}_final". Only the "gen_datas_${DATA_SUFFIX}_final" servers as the final generated evaluation data samples. The previous two are the intermediate productions. If your code is corrupted without finishing running, these two folders can also useful to avoid generating again.

3. Inference Your Models

Config api in utils/api_config.py if you need to use the api to do inference.

api_config = {
  "llama3-70b-instruct": "http://localhost:8080",
  "llama3-8b-instruct": "http://localhost:8080",
}

# cd OmniEval
# set the following parameters inner the bash file `evaluator/inference/rag_inference.sh`:
# NODE_ROOT="corpus/nodes_dir" # the root to save the built knowledge corpus
# SAVE_NAME="few_shot_test" 
# CHUNK_SIZE=2048
# CHUNK_OVERLAP=256
# NODE_NAME="${SAVE_NAME}-${CHUNK_SIZE}-${CHUNK_OVERLAP}" # the save name of your built knowledge corpus (same as the one in build_corpus.sh)
# INDEX_ROOT="corpus/flash_rag_index" # the root to save the built index of the knowledge corpus
# PRED_RESULT_ROOT="evaluator/pred_results" # the root to save the inference results
# RAG_FRAMEWORK="vllm-openai" # the framework of your RAG model
# INFER_BATCH=4 # the batch size of your inference
# PER_GPU_EMBEDDER_BATCH_SIZE=32 # the batch size of your embedder

generators=("llama3-70b-instruct" "qwen2-72b" "yi15-34b" "deepseek-v2-chat") # set your target evaluation generation model names.
retrievers=("bge-m3" "bge-large-zh" "e5-mistral-7b" "gte-qwen2-1.5b" "jina-zh") # set your target evaluation retriever model names.

sh evaluator/inference/rag_inference.sh

4. Evaluate Your Models

(a) Model-based Evalution

We propose five model-based metric: accuracy, completeness, utilization, numerical_accuracy, and hallucination. We have trained two models from Qwen2.5-7B by the lora strategy and human-annotation labels to implement model-based evaluation.

Note that the evaluator of hallucination is different from other four. Their model checkpoint can be load from the following huggingface links:

The evaluator for hallucination metric:
The evaluator for other metric:

To implement model-based evaluation, you can first set up two vllm servers by the following codes:

BASE_MODEL_PATH=${your_local_root_of_Qwen2.5-7B-Instruct}
SERVER_NAME="Qwen2.5-7B-Instruct"
LOAR_NAME=Qwen2.5-7B-Evaluator # Qwen2.5-7B-Evaluator or Qwen2.5-7B-HalluEvaluator 
LORA_PATH=${your_local_root_of_OmniEval-ModelEvaluator} # or ${your_local_root_of_OmniEval-HallucinationEvaluator}

python3 -m vllm.entrypoints.openai.api_server \
    --model ${BASE_MODEL_PATH} \
    --served-model-name $SERVER_NAME \
    --max-model-len=32768 \
    --dtype=bfloat16 \
    --tensor-parallel-size=1 \
    --max-lora-rank 32 \
    --enable-lora \
    --lora-modules $LOAR_NAME=$LORA_PATH > vllm.server.log 2>&1 &

Then conduct the model-based evaluate using the following codes, (change the parameters inner the bash file).

# NODE_ROOT="corpus/nodes_dir" # the root to save the built knowledge corpus
# set the following parameters inner the bash file `evaluator/judgement/judger.sh`:

# SAVE_NAME="few_shot_test" 
# CHUNK_SIZE=2048
# CHUNK_OVERLAP=256
# NODE_NAME="${SAVE_NAME}-${CHUNK_SIZE}-${CHUNK_OVERLAP}" # the save name of your built knowledge corpus (same as the one in build_corpus.sh)
# EVAL_GENDATA_NAME="gen_datas_${your_suffix}" # folder name of generated evaluation dataset (that is same as the folder name of its inference results)

# retriever="bge-large-zh,gte-qwen2-1.5b" # set your target evaluation retriever model names, split by ','. The corresponding name2path map should be set in configs/model2path.json
# generator_model="deepseek-v2-chat,yi15-34b" # set your target evaluation generation model names, split by ','. The corresponding name2path map should be set in configs/model2path.json
# retrieval_topk=5
# judge_type="model" # model or rule
# hallu_eval=0 # hallucination evaluator is different to other evaluator. 1 = do hallucination evaluation, 0 = do other model-based metric evaluation. This parameter is only valid when judge_type="model"

# MODEL_EVAL_API=http://localhost:8000 # the access address of your vllm server of model-evaluator
# HALLU_EVAL_API=http://localhost:8000 # the access address of your vllm server of hallucination-evaluator

sh evaluator/judgement/judger.sh

(b) Rule-based Evaluation

# cd OmniEval
# Just set the value of ${judge_type} to "rule"
# judge_type="model" # model or rule
sh evaluator/judgement/judger.sh

Note that when evaluating close-book LLMs, you should set "--close_book" in "evaluator/judgement/judger.sh".

Merge Evaluation results After doing rule-based, model-based (hallucination and others) evaluation, there will be three files inner the save dir of each evaluated model (evaluator/pred_results/gen_data_${your_suffix}/${retriever}_TOP${k}-${generator}), i.e.:
1. evaluation_result_rule.jsonl
2. evaluation_result_model_qwen-eval-hallucination.jsonl
3. evaluation_result_model_qwen-eval.jsonl Merge the latter two to obtain the final results of model-based evaluation by the following commands:
```
python evaluator/judgement/merge_pred_results.py # In the code, set root = "evaluator/pred_results/gen_datas_${your_suffix}"
```
This command will generate the fourth evaluation file, named "evaluation_result_model_qwen-eval-merge.jsonl", which contains the final model-based evaluation result.

🔖 License

OmniEval is licensed under the MIT License.

🌟 Citation

@misc{wang2024omnievalomnidirectionalautomaticrag,
      title={OmniEval: An Omnidirectional and Automatic RAG Evaluation Benchmark in Financial Domain}, 
      author={Shuting Wang and Jiejun Tan and Zhicheng Dou and Ji-Rong Wen},
      year={2024},
      eprint={2412.13018},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2412.13018}, 
}

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
asset		asset
configs		configs
corpus_builder		corpus_builder
data_generator		data_generator
evaluator		evaluator
flashrag		flashrag
utils		utils
LICENSE		LICENSE
README.md		README.md
environment.yml		environment.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

OmniEval: Omnidirectional and Automatic RAG Evaluation Benchmark in Financial Domain

Installation | Quick-Start | License | Citation

🔧 Installation

🚀 Quick-Start

1. Build the Retrieval Corpus

2. Generate Evaluation Data Samples

3. Inference Your Models

4. Evaluate Your Models

(a) Model-based Evalution

(b) Rule-based Evaluation

🔖 License

🌟 Citation

About

Releases

Packages

Contributors 2

Languages

License

RUC-NLPIR/OmniEval

Folders and files

Latest commit

History

Repository files navigation

OmniEval: Omnidirectional and Automatic RAG Evaluation Benchmark in Financial Domain

Installation | Quick-Start | License | Citation

🔧 Installation

🚀 Quick-Start

1. Build the Retrieval Corpus

2. Generate Evaluation Data Samples

3. Inference Your Models

4. Evaluate Your Models

(a) Model-based Evalution

(b) Rule-based Evaluation

🔖 License

🌟 Citation

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages