You can use WMD to evaluate one caption or diverse captions for one image. Given an image, the corresponding human annotations/annotation and the generate captions/caption, we first tokenize human annotations/annotation and obtain a dictionary and then tokenize the generated captions/caption and obtain another dictionary , finally, we compute the Word Mover Distance between the two dictionaries.
- Download the trained word2vec models and put in the trained_models/word2vec folder.
- WMD could be highly related to the corpus. We provide a word2vec model trained on MSCOCO captions (Download here). Alternatively, you can use other trained word2vec models, such as GoogleNews model (in our code, the default seeting is GoogleNews model).
- Download the tokenized MSCOCO dataset and put it in the data/files folder.
- You can download here (download all 3 files).
- Use your method to generate captions/caption and save as a json file, the format of which must be the same as results/results_bs3.json (each image has one caption) or results/merge_results10.json (each image has 10 captions).
- Run the command
cd ./evaluation
python accuracy_WMD.py --results_file ../results/merge_results10.json --score_file ../results/merge_results10_score.json --num_captions 10 --exp 1
- Matt J. Kusner et. al., From Word Embeddings To Document Distances. ICML, 2015.
- Mert Kilickaya et. al., Re-evaluating Automatic Metrics for Image Captioning. EACL, 2017.