Official Python implementation of ECCV Caption | Paper
Sanghyuk Chun, Wonjae Kim, Song Park, Minsuk Chang, Seong Joon Oh
ECCV Caption contains x8.47 positive images and x3.58 positive captions compared to the original COCO Caption. The positives are verified by machines (five state-of-the-art image-text matching models) and humans. This library provides an unified interface to measure various COCO Caption retrieval metrics, such as COCO 1k Recall@K, COCO 5k Recall@K, CxC Recall@K, PMRP, and ECCV Caption Recall@K, R-Precision and mAP@R.
For more details, please read our paper:
Image-Text matching (ITM) is a common task for evaluating the quality of Vision and Language (VL) models. However, existing ITM benchmarks have a significant limitation. They have many missing correspondences, originating from the data construction process itself. For example, a caption is only matched with one image although the caption can be matched with other similar images and vice versa. To correct the massive false negatives, we construct the Extended COCO Validation (ECCV) Caption dataset by supplying the missing associations with machine and human annotators. We employ five state-of-the-art ITM models with diverse properties for our annotation process. Our dataset provides x3.6 positive image-to-caption associations and x8.5 caption-to-image associations compared to the original MS-COCO. We also propose to use an informative ranking-based metric mAP@R, rather than the popular Recall@K (R@K). We re-evaluate the existing 25 VL models on existing and proposed benchmarks. Our findings are that the existing benchmarks, such as COCO 1K R@K, COCO 5K R@K, CxC R@1 are highly correlated with each other, while the rankings change when we shift to the ECCV mAP@R. Lastly, we delve into the effect of the bias introduced by the choice of machine annotator. Source code and dataset are available at https://github.com/naver-ai/eccv-caption
ECCV Caption dataset is an extended version of the COCO Caption test split by karpathy/neuraltalk2. We annotate positives the subset of the COCO Caption test set (1,332 query images, 1,261 query captions). We show the number of positive items for the subset of the COCO Caption test split.
Dataset | # positive images | # positive captions |
---|---|---|
Original MS-COCO Caption | 1,332 | 6,305 (=1,261×5) |
CxC | 1,895 (×1.42) | 8,906 (×1.41) |
Human-verified positives | 10,814 (×8.12) | 16,990 (×2.69) |
ECCV Caption | 11,279 (×8.47) | 22,550 (×3.58) |
- 8 Apr, 2022: Initial upload.
pip3 install eccv_caption
numpy
ujson
tqdm
ujson
and tqdm
are not neccessary.
from eccv_caption import Metrics
metric = Metrics()
# Get i2t, t2i retrived items from your own model
# i2t = {query_image_id: [sorted_caption_id_by_similarity]}
# t2i = {query_caption_id: [sorted_image_id_by_similarity]}
# See the example code for how to compute i2t / t2i from your model
scores = metric.compute_all_metrics(
i2t, t2i,
target_metrics=('eccv_r1', 'eccv_map_at_r', 'eccv_rprecision',
'coco_1k_recalls', 'coco_5k_recalls', 'cxc_recalls'),
Ks=(1, 5, 10),
verbose=False
)
print(scores)
It will return a score map, for example:
{
'coco_1k_r1': {'i2t': 0.7425999999999999, 't2i': 0.5544},
'coco_1k_r5': {'i2t': 0.9282, 't2i': 0.82324},
'coco_1k_r10': {'i2t': 0.9658, 't2i': 0.90152},
'coco_5k_r1': {'i2t': 0.5636, 't2i': 0.3658},
'coco_5k_r5': {'i2t': 0.7928, 't2i': 0.61048},
'coco_5k_r10': {'i2t': 0.867, 't2i': 0.71148},
'cxc_r1': {'i2t': 0.5804, 't2i': 0.38314912702226495},
'cxc_r5': {'i2t': 0.8142, 't2i': 0.6385151369533878},
'cxc_r10': {'i2t': 0.8862, 't2i': 0.7425116130065673},
'eccv_map_at_r': {'i2t': 0.23988490345859761, 't2i': 0.3195708452398554},
'eccv_rprecision': {'i2t': 0.3381634023972558, 't2i': 0.4177097641300428},
'eccv_r1': {'i2t': 0.7129262490087233, 't2i': 0.7297297297297297}
}
cxc
denotes CrissCrossed Captions. You can find the details of the CxC dataset from google-research-datasets/Crisscrossed-Captions
You can find an example code from here
eccv_caption.Metrics
extra_file_dir
: str (default:None
): directory wherepm_image_to_caption.json
andpm_caption_to_image.json
are downloaded. You don't need to specifyextra_file_dir
if you are not going to measurepmrp
.verbose
: bool (default: False),tqdm
verbose option
All metrics can be computed by the following method:
eccv_caption.Metrics.compute_all_metrics
The details of the method arguments are as the follows:
def compute_all_metrics(self, i2t_retrieved_items: Dict[int, List[int]] = None,
t2i_retrieved_items: Dict[int, List[int]] = None,
target_metrics: List[str] = ('coco_1k_r1', 'coco_5k_r1', 'cxc_r1', 'eccv_r1', 'eccv_map_at_r'),
Ks: List[int] = (1, 5, 10),
verbose: bool = None):
""" Compute various evaluation metrics for the given image-to-text & text-to-image retrieved items.
Parameters
----------
i2t_retrived_items: dict (key: image id (int), value: retrived caption ids sorted by similarity),
(default: `self.i2t_retrived_items` set by `self.set_i2t_retrieved_items`)
dictionary of target retrived caption ids to evaluate.
t2i_retrived_items: dict (key: caption id (int), value: retrived image ids sorted by similarity),
(default: `self.t2i_retrived_items` set by `self.set_t2i_retrieved_items`)
dictionary of target retrived image ids to evaluate.
target_metrics: list, the list of target metrics.
(default: ['coco_1k_r1', 'coco_5k_r1', 'cxc_r1', 'eccv_r1', 'eccv_map_at_r'])
Valid metrics: {'coco_1k_r1', 'coco_5k_r1', 'cxc_r1',
'coco_1k_recalls', 'coco_5k_recalls', 'cxc_recalls',
'eccv_r1', 'eccv_rprecision', 'eccv_map_at_r',
'eccv_recalls', 'pmrp'}
Note1: `{datasetname}_recalls` returns R@K, where K is given by `Ks`.
Note2: to use pmrp, you have to download PM GTs first. Please refer the official document for the details.
Ks: list, the list of target Ks for R@K.
(default: [1, 5, 10])
`target_metrics` should contain {datasetname}_recalls
verbose: bool (default: self.verbose), tqdm verbose option
Returns
-------
scores (dict): key: the name of metrics, value: {'i2t': score (float), 't2i': score (float)}
You can find the full data from the following gdrive link, including pm_image_to_caption.json
and pm_caption_to_image.json
for computing pmrp
.
https://drive.google.com/drive/folders/1Sam8_Hpm4uWKB_Ehk9C_JcGNpYH7jZD2?usp=sharing
The gdrive link also includes the full data, such as the raw human annotated data (mturk_parsed.csv
) and filtered items (wrong_captions.txt
, invalid_images.txt
, invalid_captions.txt
, verified_iids.npy
, verified_cids.npy
). We describe the details of how the dataset is built in the next section.
Due to the nature of the dataset annotation process, widely-used Image-Text aligned datasets, such as MS-COCO, have many false negatives. We extend the MS-COCO Caption test
split by using machine and human annotators.
Our annotation is built upon five state-of-the-art image-text matching models, such as VSRN (ICCV'19), PVSE (CVPR'19), ViLT (ICML'21), CLIP (ICML'21), PCME (CVPR'21).
We first collect top-25 retrived items for the subset of MS-COCO Caption by the five SOTA models. We verify whether the top-25 items are positive or negative by Amazon Mechanical Turk (MTurk). The full MTurk results can be found in gdrive mturk_parsed.csv
.
We measure precision and recall of the existing benchmarks. Precision (Prec) and Recall of previous datasets measured by human verified positive pairs are shown in the followin table. A low Prec means that many positives are actually negatives, and a low Recall means that there exist many missing positives. The table shows that the existing benchmarks, such as MS-COCO Caption have many missing positive correspondences.
Dataset | I2T Prec | I2T Recall | T2I Prec | T2I Recall |
---|---|---|---|---|
Original MS-COCO Caption | 96.9 | 21.1 | 96.4 | 13.8 |
CxC | 95.5 | 23.0 | 93.2 | 16.1 |
Plausible Match | 65.3 | 56.5 | 56.6 | 61.8 |
We combine our MTurk results and CxC results to build the final ECCV Caption dataset (by treating CxC results as the sixth machine annotator).
For more details, please read our paper.
Copyright (c) 2022-present NAVER Corp.
Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in
all copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
THE SOFTWARE.
@inproceedings{chun2022eccv_caption,
title={ECCV Caption: Correcting False Negatives by Collecting Machine-and-Human-verified Image-Caption Associations for MS-COCO},
author={Chun, Sanghyuk and Kim, Wonjae and Park, Song and Chang, Minsuk Chang and Oh, Seong Joon},
year={2022},
booktitle={European Conference on Computer Vision (ECCV)},
}
I would like to suggest citing PCME and PCME++, too.
@inproceedings{chun2021pcme,
title={Probabilistic Embeddings for Cross-Modal Retrieval},
author={Chun, Sanghyuk and Oh, Seong Joon and De Rezende, Rafael Sampaio and Kalantidis, Yannis and Larlus, Diane},
year={2021},
booktitle={Conference on Computer Vision and Pattern Recognition (CVPR)},
}
@inproceedings{chun2024pcmepp,
title={Improved Probabilistic Image-Text Representations},
author={Chun, Sanghyuk},
year={2024},
booktitle={International Conference on Learning Representations (ICLR)},
}