This README describes hot to use Doc-BERTScore an extension of the BERTScore metric that can be used for document-level evaluation.
This codebase is built upon the original BERTScore code. For a detailed presnetation of the BERTScore metric, including usage examples and instructions see the original documentation.
To run Doc-BERTScore you will need to develop locally:
git clone https://github.com/amazon-science/doc-mt-metrics.git
cd bert_score
pip install .
sacrebleu -t wmt21 -l en-de --echo ref | head -n 20 > ref.de
sacrebleu -t wmt21 -l en-de --echo ref | head -n 20 > hyp.de # put your system output here
To evaluate at the document level we need to know where the document boundaries are in the test set, so that we only use valid context. This is passed in as a file where each line contains a document ID.
For WMT test sets this can be obtained via sacreBLEU:
sacrebleu -t wmt21 -l en-de --echo docid | head -n 20 > docids.ende
To score using the document-level BERTScore simply add the --doc
flag:
bert-score -r ref.de -c hyp.de --lang de --doc docids.ende
In the paper we useroberta-large
for X->En pairs and bert-base-multilingual-cased
for En->X pairs (default at the time) but you can select another model with the -m MODEL_TYPE
flag. See the spreadsheet provided by the authors of BERTScore for a full list of supported models.
The BERTScore framework provides two APIs in order to use the BERTScore metric with python: an object-oriented one that caches the model and is recommended for multiple evaluations and a functional one that can be used for single evaluation. For more details see the demo provided by the authors.
In order to use Doc-BERTScore simple simply add doc=True
when calling the score
function:
from bert_score import BERTScorer
from add_context import add_context
with open("hyp.de") as f:
cands = [line.strip() for line in f]
with open("ref.de") as f:
refs = [line.strip() for line in f]
with open("docids.ende") as f:
doc_ids = [line.strip() for line in f]
scorer = BERTScorer(lang="de")
# add contexts to reference and hypothesis texts
cands = add_context(orig_txt=cands, context=refs, doc_ids=doc_ids, sep_token=scorer._tokenizer.sep_token)
refs = add_context(orig_txt=refs, context=refs, doc_ids=doc_ids, sep_token=scorer._tokenizer.sep_token)
# set doc=True to evaluate at the document level
P, R, F1 = scorer.score(cands, refs, doc=True)
In order to use Doc-BERTScore simple simply add doc=True
when calling the score
function:
from bert_score import score
from add_context import add_context
with open("hyp.de") as f:
cands = [line.strip() for line in f]
with open("ref.de") as f:
refs = [line.strip() for line in f]
with open("docids.ende") as f:
doc_ids = [line.strip() for line in f]
# add contexts to reference and hypothesis texts
cands = add_context(orig_txt=cands, context=refs, doc_ids=doc_ids, sep_token="[SEP]")
refs = add_context(orig_txt=refs, context=refs, doc_ids=doc_ids, sep_token="[SEP]")
# set doc=True to evaluate at the document level
P, R, F1 = score(cands, refs, lang="de", verbose=True, doc=True)
To use another model set the flag model_type=MODEL_TYPE
when calling score
function.
To reproduce the Doc-BERTScore results from the paper run the score_doc-metrics.py script with the flags --model bertscore
and --doc
.
If you use the code in your work, please cite Embarrassingly Easy Document-Level MT Metrics: How to Convert Any Pretrained Metric Into a Document-Level Metric:
@inproceedings{easy_doc_mt
title = {Embarrassingly Easy Document-Level MT Metrics: How to Convert Any Pretrained Metric Into a Document-Level Metric},
author = {Vernikos, Giorgos and Thompson, Brian and Mathur, Prashant and Federico, Marcello},
booktitle = "Proceedings of the Seventh Conference on Machine Translation",
month = dec,
year = "2022",
address = "Abu Dhabi, United Arab Emirates",
publisher = "Association for Computational Linguistics",
url = "https://statmt.org/wmt22/pdf/2022.wmt-1.6.pdf",
}