| 📑 Paper | 🤗 HuggingFace Repo | 🐱 GitHub Repo |
Fanqi Wan†, Xinting Huang‡, Leyang Cui‡, Xiaojun Quan†, Wei Bi‡, Shuming Shi‡
† Sun Yat-sen University, ‡ Tencent AI Lab
- Jan 19, 2024: 🔥 We're excited to announce that the KCA datasets for open-book tuning, discard tuning, and refusal tuning are now available on 🤗 Huggingface Datasets. The fine-tuned models are now available on 🤗 Huggingface Models. Happy exploring!
- Overview
- Data Release
- Model Release
- Knowledge Inconsistency Detection
- Knowledge Inconsistency Calibration
- Evaluation
- Citation
In this study, we demonstrate the feasibility of mitigating hallucinations by verifying and minimizing the inconsistency between external knowledge present in the alignment data and the intrinsic knowledge embedded within foundation LLMs.
Specifically, we propose a novel approach called Knowledge Consistent Alignment (KCA), which employs a well-aligned LLM to automatically formulate assessments based on external knowledge to evaluate the knowledge boundaries of foundation LLMs. To address knowledge inconsistencies in the alignment data, KCA implements several specific strategies to deal with these data instances, which involve (i) open-book tuning, (ii) discard tuning, and (iii) refusal tuning.
We release the KCA datasets for open-book tuning, discard tuning, and refusal tuning on ./data/processed_results. Please note that each dataset is corresponding to a specific tuning method and a foundation LLM. The dataset is a structured data file in the JSON format. It consists of a list of dictionaries, with each dictionary containing multiple fields. Below is an example:
{
"id": "...", # Data index.
"conversations": [
{
"from": "human",
"value": "..." # Human instruction.
},
{
"from": "gpt",
"value": "..." # LLM response.
}
],
"class": "...", # Three categories: "no_need_fact" (the instruction does not require knowledge), "need_and_have_fact" (the instruction requires knowledge and the foundation LLM understands the generated knowledge), "need_and_have_no_fact" (the instruction requires knowledge but the foundation LLM does not understand the generated knowledge).
"analysis": "...", # Analysis for whether the instruction requires knowledge.
"knowledge": "..." # Generated knowledge.
}
We show the percentage (%) of the consistent subset (the instruction requires knowledge and the foundation LLM understands the generated knowledge) and the inconsistent subset (the instruction requires knowledge but the foundation LLM does not understand the generated knowledge) across various foundation LLMs on different training and evaluation datasets as follows:
We release the KCA models fine-tuned with different tuning methods on 🤗 Huggingface Models. Please note that each model is corresponding to a specific tuning method and a foundation LLM.
To facilitate a comprehensive evaluation, we conduct both LLM-based judgment and metric-based judgment. For the LLM-based judgment, we evaluate the performance on the LIMAEval, VicunaEval, WizardLMEval, and TruthfulQA benchmarks with GPT-4 to measure the hallucination rate. In terms of metric-based judgment, we assess the ROUGE-1, ROUGE-2, and ROUGE-L scores on the MS MARCO and ACI-Bench benchmarks.
The evaluation results of hallucination rate (%) on four public benchmarks for general instruction-following and truthful question answering with GPT-4 judgment are shown as follows, with a lower rate indicating better performance:
The evaluation results of ROUGE-1, ROUGE-2, and ROUGE-L on two public benchmarks for search and retrieve and clinical report generation are shown as follows, with a higher score indicating better performance:
The evaluation results of the helpful score on four public benchmarks for general instruction-following and truthful question answering with GPT-4 judgment are shown as follows, where the helpful score ranges from one (worst) to ten (best):
To detect the inconsistency between external knowledge within the instruction-tuning (alignment) data and intrinsic knowledge embedded in the foundation LLMs obtained from pretraining, we propose a four-stage approach: (i) knowledge requirement classification, (ii) reference knowledge generation, (iii) examination formulation, and (iv) examination completion.
The results of knowledge inconsistency detection are in ./data/generated_results and ./data/examination. You could download the results and put them in the right folder. If you want to reproduce the results, please follow the following commands step by step:
cd ./data_generation
export OPENAI_API_KEY=XXXXXX # set the OpenAI API key
split=train # train / test / test_truth
data_name=wizardlm_alpaca_single_turn # wizardlm_alpaca_single_turn (train) / lima_testset_single_turn (test) / vicuna_testset_single_turn (test) / wizardlm_testset_single_turn (test) / truthfulqa_testset_single_turn (test_truth)
input_dir=../data/source/${split}
input_filename=${data_name}.jsonl
res_dir=../data/generation_results/${split}/fact_enhance_classify
res_filename=${data_name}_classify.jsonl
mode=fact_enhance_classify_en
batch_size=10
python3 per_instance_query.py \
--data_dir ${input_dir} \
--input ${input_filename} \
--file_extension jsonl \
--out_dir ${res_dir} \
--output ${res_filename} \
--prompt_mode ${mode} \
--request_batch_size ${batch_size}
python3 post_process.py \
--split ${split} \
--stage fact_enhance_classify
cd ./data_generation
export OPENAI_API_KEY=XXXXXX # set the OpenAI API key
split=train # train / test / test_truth
data_name=wizardlm_alpaca_single_turn # wizardlm_alpaca_single_turn (train) / lima_testset_single_turn (test) / vicuna_testset_single_turn (test) / wizardlm_testset_single_turn (test) / truthfulqa_testset_single_turn (test_truth)
input_dir=../data/generation_results/${split}/fact_enhance_classify
input_filename=${data_name}_classify_parse_res_select_need.jsonl
res_dir=${global_dir}/generation_results/${split}/fact_generation
res_filename=${data_name}_classify_parse_res_select_need_knowledge_gen.jsonl
mode=fact_generation_en
batch_size=10
python3 per_instance_query.py \
--data_dir ${input_dir} \
--input ${input_filename} \
--file_extension jsonl \
--out_dir ${res_dir} \
--output ${res_filename} \
--prompt_mode ${mode} \
--request_batch_size ${batch_size}
python3 post_process.py \
--split ${split} \
--stage fact_generation
cd ./data_generation
export OPENAI_API_KEY=XXXXXX # set the OpenAI API key
split=train # train / test / test_truth
data_name=wizardlm_alpaca_single_turn # wizardlm_alpaca_single_turn (train) / lima_testset_single_turn (test) / vicuna_testset_single_turn (test) / wizardlm_testset_single_turn (test) / truthfulqa_testset_single_turn (test_truth)
input_dir=../data/generation_results/${split}/fact_generation
input_filename=${data_name}_classify_parse_res_select_need_knowledge_gen_parse_res.jsonl
res_dir=${global_dir}/generation_results/${split}/test_generation
res_filename=${data_name}_classify_parse_res_select_need_knowledge_gen_parse_res_test_gen.jsonl
mode=fact_to_tests_en
batch_size=10
python3 per_instance_query.py \
--data_dir ${input_dir} \
--input ${input_filename} \
--file_extension jsonl \
--out_dir ${res_dir} \
--output ${res_filename} \
--prompt_mode ${mode} \
--request_batch_size ${batch_size}
python3 post_process.py \
--split ${split} \
--stage test_generation
cd ./
split=train # train / test / test_truth
data_name=wizardlm_alpaca_single_turn # wizardlm_alpaca_single_turn (train) / lima_testset_single_turn (test) / vicuna_testset_single_turn (test) / wizardlm_testset_single_turn (test) / truthfulqa_testset_single_turn (test_truth)
mv ./data_generation/generation_results/${split}/test_generation/${data_name}_classify_parse_res_select_need_knowledge_gen_parse_res_test_gen_normalize.jsonl ./data/examination/input/hallucination/${split}/${data_name}_classify_parse_res_select_need_knowledge_gen_parse_res_test_gen_normalize_test.jsonl
export CUDA_VISIBLE_DEVICES=0
test_dataset=hallucination
eval_batch_size=1 # must set to 1
shot=5
model_name=llama-2-7b # pythia-6.9b / llama-2-7b / mistral-7b-v0.1 / llama-2-13b
output_dir=./data/examination/output/${test_dataset}/${split}/${model_name}/${shot}-shot
data_dir=./data/examination/input/${test_dataset}/${split}
python3 ./examination/${test_dataset}/run_eval.py \
--ntrain ${SHOT} \
--data_dir ${data_dir} \
--save_dir ${output_dir} \
--model_name_or_path ${model_name} \
--tokenizer_name_or_path ${model_name} \
--eval_batch_size ${eval_batch_size} \
--use_slow_tokenizer
python3 ./examination/${test_dataset}/get_metric.py
Since knowledge inconsistency could mislead foundation LLMs during alignment and lead to hallucinations, we propose three specific strategies to manage instances in Dinc, including (i) open-book tuning, which appends the generated knowledge snippets to the instructions, (ii) discard tuning, which discards both the instructions and responses, and (iii) refusal tuning, which changes the responses to a refusal format.
The results of knowledge inconsistency calibration are in ./data/processed_results. You could download the results and put them in the right folder. If you want to reproduce the results, please follow the following commands step by step:
First, we construct training data for these tuning methods:
cd ./
python3 ./data_generation/inconsistency_processing.py
Then, we fine-tune the foundation LLMs using these tuning methods:
cd ./
export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
MODEL_NAME=llama-2-7b # pythia-6.9b / llama-2-7b / mistral-7b-v0.1 / llama-2-13b
DATA_NAME=wizardlm_trainset_sorry # wizardlm_alpaca_train (baseline) / wizardlm_trainset_openbook (kca open-book tuning) / wizardlm_trainset_drop (kca discarding tuning) / wizardlm_trainset_sorry (kca refusal tuning)
DATA_PATH=./data/processed_results/${MODEL_NAME}_shot-5_${DATA_NAME}.json # ./data/processed_results/${DATA_NAME}.json (baseline) / ./data/processed_results/${MODEL_NAME}_shot-5_${DATA_NAME}.json (kca)
CONV_TEMP=vicuna
OUTPUT_DIR=./training_results/${MODEL_NAME}_shot-5_${DATA_NAME} # ./training_results/baseline_${MODEL_NAME}_${DATA_NAME} (baseline) / ./training_results/${MODEL_NAME}_shot-5_${DATA_NAME} (kca)
LOG_FILE=./training_loggings/${MODEL_NAME}_shot-5_${DATA_NAME}.log # ./training_loggings/baseline_${MODEL_NAME}_${DATA_NAME}.log (baseline) / ./training_loggings/${MODEL_NAME}_shot-5_${DATA_NAME}.log (kca)
torchrun --nproc_per_node=8 --master_port=20001 ./train/train.py \
--model_name_or_path ${MODEL_NAME} \
--data_path ${DATA_PATH} \
--bf16 True \
--output_dir ${OUTPUT_DIR} \
--num_train_epochs 3 \
--per_device_train_batch_size 8 \
--per_device_eval_batch_size 8 \
--gradient_accumulation_steps 2 \
--evaluation_strategy "no" \
--save_strategy "steps" \
--save_steps 500 \
--save_total_limit 1 \
--learning_rate 2e-5 \
--weight_decay 0. \
--warmup_ratio 0.03 \
--lr_scheduler_type "cosine" \
--logging_steps 1 \
--fsdp "full_shard auto_wrap" \
--fsdp_transformer_layer_cls_to_wrap "LlamaDecoderLayer" \
--tf32 True \
--model_max_length 2048 \
--gradient_checkpointing True \
--conv_temp ${CONV_TEMP} \
--lazy_preprocess True \
--flash_attn_transformers True 2>&1 | tee ${LOG_FILE}
We evaluate both the hallucination rate and helpfulness score of the fine-tuned LLMs. For hallucination evaluation, we conduct both LLM-based judgment and metric-based judgment. For helpfulness evaluation, we conduct LLM-based judgment.
Below are the scripts for hallucination evaluation.
# ========== LLM-Based Judgment (LIMAEval, VicunaEval, WizardLMEval, TruthfulQA) ==========
# Generate model answers
export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
NUM_GPUS=8
MODEL_NAME=llama-2-7b # pythia-6.9b / llama-2-7b / mistral-7b-v0.1 / llama-2-13b
DATA_NAME=wizardlm_trainset_sorry # wizardlm_alpaca_train (baseline) / wizardlm_trainset_openbook (kca open-book tuning) / wizardlm_trainset_drop (kca discarding tuning) / wizardlm_trainset_sorry (kca refusal tuning)
MODEL_ID=${MODEL_NAME}_shot-5_${DATA_NAME} # baseline_${MODEL_NAME}_${DATA_NAME} (baseline) / ${MODEL_NAME}_shot-5_${DATA_NAME} (kca)
MODEL_PATH=./training_results/${MODEL_ID}
QUESTION_NAME=lima_testset # lima_testset / vicuna_testset / wizardlm_testset / truthfulqa_test_truthset
QUESTION_FILE=./data/processed_results/${MODEL_NAME}_shot-5_${QUESTION_NAME}_sorry.json # do not use _openbook or _drop
ANSWER_FILE=./evaluation_results/answer_greedy/data-${MODEL_NAME}_shot-5_${QUESTION_NAME}_model-${MODEL_ID}_greedy.jsonl
python3 ./eval/gpt_judge/gen_answer.py \
--model-path ${MODEL_PATH} \
--model-id ${MODEL_ID} \
--conv-temp vicuna \
--question-file ${QUESTION_FILE} \
--answer-file ${ANSWER_FILE} \
--num-gpus ${NUM_GPUS}
# GPT-4 judgment
export OPENAI_API_KEY=XXXXXX # set the OpenAI API key
MODEL_NAME=llama-2-7b # pythia-6.9b / llama-2-7b / mistral-7b-v0.1 / llama-2-13b
DATA_NAME=wizardlm_trainset_sorry # wizardlm_alpaca_train (baseline) / wizardlm_trainset_openbook (kca open-book tuning) / wizardlm_trainset_drop (kca discarding tuning) / wizardlm_trainset_sorry (kca refusal tuning)
MODEL_ID=${MODEL_NAME}_shot-5_${DATA_NAME} # baseline_${MODEL_NAME}_${DATA_NAME} (baseline) / ${MODEL_NAME}_shot-5_${DATA_NAME} (kca)
QUESTION_NAME=lima_testset # lima_testset / vicuna_testset / wizardlm_testset / truthfulqa_test_truthset
JUDGE_TYPE=hallucination_judge
ANSWER_FILE=./evaluation_results/answer_greedy/data-${MODEL_NAME}_shot-5_${QUESTION_NAME}_model-${MODEL_ID}_greedy.jsonl
TESTSET_FILE=./data/processed_results/${MODEL_NAME}_shot-5_${QUESTION_NAME}_sorry.json # do not use _openbook or _drop
REVIEW_FILE=./evaluation_results/review_greedy/data-${MODEL_NAME}_shot-5_${QUESTION_NAME}_model-${MODEL_ID}_${JUDGE_TYPE}_greedy.jsonl
PROMPT_FILE=./eval/gpt_judge/gpt_judge_prompt.jsonl
BATCH_SIZE=3
python3 ./eval/gpt_judge/gpt_judge.py \
--answer_file ${ANSWER_FILE} \
--testset_file ${TESTSET_FILE} \
--review_file ${REVIEW_FILE} \
--prompt_file ${PROMPT_FILE} \
--prompt_type ${JUDGE_TYPE} \
--review_model gpt-4 \
--batch_size ${BATCH_SIZE} \
--use_demo \
--no_sorry # only when "DATA_NAME=wizardlm_trainset_sorry"
python3 ./eval/gpt_judge/show_results.py
# ======================= Metric-Based Judgment (MS-MARCO, ACI-Bench) ======================
export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
NUM_GPUS=8
MODEL_NAME=llama-2-7b # pythia-6.9b / llama-2-7b / mistral-7b-v0.1 / llama-2-13b
DATA_NAME=wizardlm_trainset_sorry # wizardlm_alpaca_train (baseline) / wizardlm_trainset_openbook (kca open-book tuning) / wizardlm_trainset_drop (kca discarding tuning) / wizardlm_trainset_sorry (kca refusal tuning)
MODEL_ID=${MODEL_NAME}_shot-5_${DATA_NAME} # baseline_${MODEL_NAME}_${DATA_NAME} (baseline) / ${MODEL_NAME}_shot-5_${DATA_NAME} (kca)
MODEL_PATH=./training_results/${MODEL_ID}
QUESTION_NAME=msmacro # msmacro / acibench
QUESTION_FILE=./data/metric_based_evaluation/${QUESTION_NAME}_testset.jsonl
ANSWER_FILE=./evaluation_results/answer_greedy/data-${MODEL_NAME}_shot-5_${QUESTION_NAME}_model-${MODEL_ID}_greedy.jsonl
python3 ./eval/gpt_judge/gen_summary.py \
--model-path ${MODEL_PATH} \
--model-id ${MODEL_ID} \
--conv-temp vicuna \
--question-file ${QUESTION_FILE} \
--answer-file ${ANSWER_FILE} \
--num-gpus ${NUM_GPUS} \
--no-sorry # only when "DATA_NAME=wizardlm_trainset_sorry"
Below are the scripts for helpfulness evaluation.
# ========== LLM-Based Judgment (LIMAEval, VicunaEval, WizardLMEval, TruthfulQA) ==========
# GPT-4 judgment
export OPENAI_API_KEY=XXXXXX # set the OpenAI API key
MODEL_NAME=llama-2-7b # pythia-6.9b / llama-2-7b / mistral-7b-v0.1 / llama-2-13b
DATA_NAME=wizardlm_trainset_sorry # wizardlm_alpaca_train (baseline) / wizardlm_trainset_openbook (kca open-book tuning) / wizardlm_trainset_drop (kca discarding tuning) / wizardlm_trainset_sorry (kca refusal tuning)
MODEL_ID=${MODEL_NAME}_shot-5_${DATA_NAME} # baseline_${MODEL_NAME}_${DATA_NAME} (baseline) / ${MODEL_NAME}_shot-5_${DATA_NAME} (kca)
QUESTION_NAME=lima_testset # lima_testset / vicuna_testset / wizardlm_testset / truthfulqa_test_truthset
JUDGE_TYPE=effectiveness_judge
ANSWER_FILE=./evaluation_results/answer_greedy/data-${MODEL_NAME}_shot-5_${QUESTION_NAME}_model-${MODEL_ID}_greedy.jsonl
TESTSET_FILE=./data/processed_results/${MODEL_NAME}_shot-5_${QUESTION_NAME}_sorry.json # do not use _openbook or _drop
REVIEW_FILE=./evaluation_results/review_greedy/data-${MODEL_NAME}_shot-5_${QUESTION_NAME}_model-${MODEL_ID}_${JUDGE_TYPE}_greedy.jsonl
PROMPT_FILE=./eval/gpt_judge/gpt_judge_prompt.jsonl
BATCH_SIZE=3
python3 ./eval/gpt_judge/gpt_judge.py \
--answer_file ${ANSWER_FILE} \
--testset_file ${TESTSET_FILE} \
--review_file ${REVIEW_FILE} \
--prompt_file ${PROMPT_FILE} \
--prompt_type ${JUDGE_TYPE} \
--review_model gpt-4 \
--batch_size ${BATCH_SIZE} \
--use_demo
python3 ./eval/gpt_judge/show_results.py
If you find this work is relevant to your research or applications, please feel free to cite our work!
@article{wan2024knowledge,
title={Knowledge Verification to Nip Hallucination in the Bud},
author={Wan, Fanqi and Huang, Xinting and Cui, Leyang and Quan, Xiaojun and Bi, Wei and Shi, Shuming},
journal={arXiv preprint arXiv:2401.10768},
year={2024}
}