Skip to content

Evaluation tools for image captioning. Including BLEU, ROUGE-L, CIDEr, METEOR, SPICE scores.

Notifications You must be signed in to change notification settings

Aldenhovel/bleu-rouge-meteor-cider-spice-eval4imagecaption

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

13 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

About Image Captioning Metrics

  1. BLEU

    Papineni, K., Roukos, S., Ward, T., & Zhu, W.-J. (2002). Bleu: a method for automatic evaluation of machine translation. In Proceedings of meeting of the association for computational linguistics (pp. 311–318).

    BLEU has frequently been reported as correlating well with human judgement, and remains a benchmark for the assessment of any new evaluation metric. There are however a number of criticisms that have been voiced. It has been noted that, although in principle capable of evaluating translations of any language, BLEU cannot, in its present form, deal with languages lacking word boundaries. It has been argued that although BLEU has significant advantages, there is no guarantee that an increase in BLEU score is an indicator of improved translation quality.

  2. ROUGE-L

    Lin, C.-Y. (2004). ROUGE: A package for automatic evaluation of summaries. In Proceedings of meeting of the association for computational linguistics (pp. 74–81).

    ROUGE-L: Longest Common Subsequence (LCS) based statistics. Longest common subsequence problem takes into account sentence level structure similarity naturally and identifies longest co-occurring in sequence n-grams automatically.

  3. METEOR

    Banerjee, S., & Lavie, A. (2005). METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. In Proceedings of meeting of the association for computational linguistics (pp. 65–72).

    The metric is based on the harmonic mean of unigram precision and recall, with recall weighted higher than precision. It also has several features that are not found in other metrics, such as stemming and synonymy matching, along with the standard exact word matching. The metric was designed to fix some of the problems found in the more popular BLEU metric, and also produce good correlation with human judgement at the sentence or segment level. This differs from the BLEU metric in that BLEU seeks correlation at the corpus level.

  4. CIDEr

    Vedantam, R., Zitnick, C. L., & Parikh, D. (2015). CIDEr: Consensus-based image description evaluation. In Proceedings of IEEE conference on computer vision and pattern recognition (pp. 4566–4575).

  5. SPICE

    Anderson, P., Fernando, B., Johnson, M., & Gould, S. (2016). SPICE: Semantic propositional image caption evaluation. In Proceedings European conference on computer vision (pp. 382–398).

Installation

Please check salaniz/pycocoevalcap for installation of pycocotools and pycocoevalcap .

Microsoft COCO Caption Evaluation

Evaluation codes for MS COCO caption generation.

Description

This repository provides Python 3 support for the caption evaluation metrics used for the MS COCO dataset.

The code is derived from the original repository that supports Python 2.7: https://github.com/tylin/coco-caption. Caption evaluation depends on the COCO API that natively supports Python 3.

Requirements

  • Java 1.8.0
  • Python 3.6

Installation

To install pycocoevalcap and the pycocotools dependency (https://github.com/cocodataset/cocoapi), run:

pip install pycocoevalcap

Setup

  • SPICE requires the download of Stanford CoreNLP 3.6.0 code and models. This will be done automatically the first time the SPICE evaluation is performed.
  • Note: SPICE will try to create a cache of parsed sentences in ./spice/cache/. This dramatically speeds up repeated evaluations. The cache directory can be moved by setting 'CACHE_DIR' in ./spice. In the same file, caching can be turned off by removing the '-cache' argument to 'spice_cmd'.

How To Use

This repo is mainly based on the code from pycocotools and pycocoevalcap , which is designed for evaluation of MS COCO caption generation. Here the API was simplified, we can transfer the use of this evaluation tool to other caption datasets, such as Flickr8k, Flickr30k or any other else.

There are 2 json file saving the references and candidate captions were required in example/. And example/main.py would read these 2 json files and evaluate the scores automatically, then print them.

The references.json and captions.json (candidate captions) were shown in examples/ . In order to generate these files, please check the demo below:

# Collect all references from dataset as references: dict
# Collect all captions generated by model as captions: dict

references = {
    "1": ["this is a tree", "this is an apple", ...],
    "2": ["a man is sitting", "a man in the street", ...],
    //......
}

captions = {
    "1": ["this is a big tree"],
    "2": ["a man is sitting"],
    ......
}
# Save them as correct json files
import json

new_cap = []
for k, v in captions.items():
    new_cap.append({'image_id': k, 'caption': v[0]})

new_ref = {'images': [], 'annotations': []}
for k, refs in references.items():
    new_ref['images'].append({'id': k})
    for ref in refs:
        new_ref['annotations'].append({'image_id': k, 'id': k, 'caption': ref})

with open('references.json', 'w') as fgts:
    json.dump(new_gts, fgts)
with open('captions.json', 'w') as fres:
    json.dump(new_res, fres)

Then we can check the saved references.json and captions.json if it is the same format as demo references_example.json and captions_example.json :

  • references.json
{
    "images": [
        {"id": "0"}, 
        {"id": "1"},
        ......
    ], 
    "annotations": [
        {
            "image_id": "0", 
            "id": "0", 
            "caption": "A man with a red helmet on a small moped on a dirt road. "
        }, 
        {
            "image_id": "0", 
            "id": "0", 
            "caption": "Man riding a motor bike on a dirt road on the countryside."
        }, 
        {
            "image_id": "0", 
            "id": "0",
            "caption": "A man riding on the back of a motorcycle."
        }, 
        {
            "image_id": "0", 
            "id": "0", 
            "caption": "A dirt path with a young person on a motor bike rests to the foreground of a verdant area with a bridge and a background of cloud-wreathed mountains. "
        }, 
        {
            "image_id": "0", 
            "id": "0", 
            "caption": "A man in a red shirt and a red hat is on a motorcycle on a hill side."
        }, 
        {
            "image_id": "1", 
            "id": "1", 
            "caption": "A woman wearing a net on her head cutting a cake. "
        }, 
        {
            "image_id": "1", 
            "id": "1", 
            "caption": "A woman cutting a large white sheet cake."
        }, 
        {
            "image_id": "1", 
            "id": "1", 
            "caption": "A woman wearing a hair net cutting a large sheet cake."
        }, 
        {
            "image_id": "1", 
            "id": "1", 
            "caption": "there is a woman that is cutting a white cake"
        }, 
        {
            "image_id": "1", 
            "id": "1",
            "caption": "A woman marking a cake with the back of a chef's knife. "
        },
        ......
    ]
}
  • captions.json
[
    {
        "image_id": "0", 
        "caption": "a man standing on the side of a road ."
    }, 
    {
        "image_id": "1", 
        "caption": "a person standing in front of a mirror ."
    },
    ......
]

Then use command in example/ to run main.py :

python main.py

Terminal output:

>>
loading annotations into memory...
Done (t=0.01s)
creating index...
index created!
Loading and preparing results...
DONE (t=0.00s)
creating index...
index created!
tokenization...
PTBTokenizer tokenized 72388 tokens at 846674.96 tokens per second.
PTBTokenizer tokenized 12514 tokens at 290819.68 tokens per second.
setting up scorers...
computing Bleu score...
{'testlen': 10476, 'reflen': 10274, 'guess': [10476, 9476, 8476, 7476], 'correct': [7043, 3379, 1518, 669]}
ratio: 1.0196612809031516
Bleu_1: 0.672
Bleu_2: 0.490
Bleu_3: 0.350
Bleu_4: 0.249
computing METEOR score...
METEOR: 0.201
computing Rouge score...
ROUGE_L: 0.472
computing CIDEr score...
CIDEr: 0.457
computing SPICE score...
Parsing reference captions
Initiating Stanford parsing pipeline
[main] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator tokenize
[main] INFO edu.stanford.nlp.pipeline.TokenizerAnnotator - TokenizerAnnotator: No tokenizer type provided. Defaulting to PTBTokenizer.
[main] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator ssplit
[main] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator parse
[main] INFO edu.stanford.nlp.parser.common.ParserGrammar - Loading parser from serialized file edu/stanford/nlp/models/lexparser/englishPCFG.ser.gz ...
done [0.2 sec].
[main] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator lemma
[main] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator ner
Loading classifier from edu/stanford/nlp/models/ner/english.all.3class.distsim.crf.ser.gz ... done [0.8 sec].
Loading classifier from edu/stanford/nlp/models/ner/english.muc.7class.distsim.crf.ser.gz ... done [0.6 sec].
Loading classifier from edu/stanford/nlp/models/ner/english.conll.4class.distsim.crf.ser.gz ... done [0.3 sec].
Threads( StanfordCoreNLP ) [01:03.436 minutes]
Parsing test captions
Threads( StanfordCoreNLP ) [3.322 seconds]
SPICE evaluation took: 1.182 min
SPICE: 0.137
Bleu_1: 0.672
Bleu_2: 0.490
Bleu_3: 0.350
Bleu_4: 0.249
METEOR: 0.201
ROUGE_L: 0.472
CIDEr: 0.457
SPICE: 0.137

Reference

About

Evaluation tools for image captioning. Including BLEU, ROUGE-L, CIDEr, METEOR, SPICE scores.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages