In the CFEVER paper, we test the CFEVER test set using the following baselines:
- BEVERS (DeHaven and Scott, 2023)
- Stammbach (Stammbach, 2021)
- Our simple baseline
- ChatGPT (GPT-3.5 and GPT-4)
For the first two baselines, please refer to their source code:
Please go to the CFEVER-data repository to download the our dataset.
git clone https://github.com/IKMLab/CFEVER-baselines.git
cd CFEVER-baselines
CFEVER is a Chinese Fact Extraction and VERification dataset. Similar to FEVER (Thorne et al., 2018), the CFEVER task is to verify the veracity of a given claim while providing evidence from the Chinese Wikipedia (we provide our processed version in the CFEVER-data repository). Therefore, the task is split into three sub-tasks:
- Document Retrieval: Retrieve relevant documents from the Chinese Wikipedia.
- Sentence Retrieval: Select relevant sentences from the retrieved documents.
- Claim Verification: Determine whether the claim is “Supports”, “Refutes”, or “Not Enough Info.” Generally, in this stage, a model performs claim verification based on the provided claim in the dataset and the selected sentences from the stage 2 (sentence retrieval).
pip install -r requirements.txt
Plase refer to the simple_baseline folder and check the README.md for more details.
To evaluate document retrieval, you need to pass two paths to the script eval_doc_example.py
:
$GOLD_FILE
: the path to the file with gold answers in thejsonl
format.$DOC_PRED_FILE
: the path to the file with predicted documents in thejsonl
format.
python eval_doc_retrieval.py \
--source_file $GOLD_FILE \
--doc_pred_file $DOC_PRED_FILE
The example command is shown below:
python eval_doc_retrieval.py \
--source_file simple_baseline/data/dev.jsonl \
--doc_pred_file simple_baseline/data/bm25/dev_doc10.jsonl
Note that our evaluation of document retrieval aligns with the way of BEVERS. See BEVERS's code.
- You can also try to evaluate with the first k predicted pages by setting the
--top_k
parameter. For example,--top_k 10
will evaluate the first 10 predicted pages.
We follow the same evaluation script of fever-scorer and add some parameters to run the script:
python eval_sent_retrieval_rte.py \
--gt_file $GOLD_FILE \
--submission_file $PRED_FILE
where $GOLD_FILE
is the path to the file with gold answers in the jsonl
format and $PRED_FILE
is the path to the file with predicted answers in the jsonl
format. The example command is shown below:
python eval_sent_retrieval_rte.py \
--gt_file simple_baseline/data/dev.jsonl \
--submission_file simple_baseline/data/dumb_dev_pred.jsonl
The script will output the scores of sentence retrieval:
- Precision
- Recall
- F1-score
and the scores of claim verification:
- Accuracy (printed as
Label accuracy
) - FEVER Score (printed as
Strict accuracy
)
If you find our work useful, please cite our paper.
@article{Lin_Lin_Yeh_Li_Hu_Hsu_Lee_Kao_2024,
title = {CFEVER: A Chinese Fact Extraction and VERification Dataset},
author = {Lin, Ying-Jia and Lin, Chun-Yi and Yeh, Chia-Jen and Li, Yi-Ting and Hu, Yun-Yu and Hsu, Chih-Hao and Lee, Mei-Feng and Kao, Hung-Yu},
doi = {10.1609/aaai.v38i17.29825},
journal = {Proceedings of the AAAI Conference on Artificial Intelligence},
month = {Mar.},
number = {17},
pages = {18626-18634},
url = {https://ojs.aaai.org/index.php/AAAI/article/view/29825},
volume = {38},
year = {2024},
bdsk-url-1 = {https://ojs.aaai.org/index.php/AAAI/article/view/29825},
bdsk-url-2 = {https://doi.org/10.1609/aaai.v38i17.29825}
}