Dataset and code associated to the paper Grounded Textual Entailment [1]. A BibTeX entry for the paper is the following:
@InProceedings{vu2018grounded,
title={Grounded Textual Entailment},
author={Vu, Hoa Trong and Greco, Claudio and Erofeeva, Aliia and Jafaritazehjan, Somayeh and Linders, Guido and Tanti, Marc and Testoni, Alberto and Bernardi, Raffaella and Gatt, Albert},
booktitle={Proceedings of the 27th International Conference on Computational Linguistics (COLING 2018)},
year={2018}
}
Capturing semantic relations between sentences, such as entailment, is a long-standing challenge for computational semantics. Logic-based models analyse entailment in terms of possible worlds (interpretations, or situations) where a premise P entails a hypothesis H iff in all worlds where P is true, H is also true. Statistical models view this relationship probabilistically, addressing it in terms of whether a human would likely infer H from P. In this paper, we wish to bridge these two perspectives, by arguing for a visually-grounded version of the Textual Entailment task. Specifically, we ask whether models can perform better if, in addition to P and H, there is also an image (corresponding to the relevant "world" or "situation"). We use a multimodal version of the SNLI dataset [2] and we compare "blind" and visually-augmented models of textual entailment. We show that visual information is beneficial, but we also conduct an in-depth error analysis that reveals that current multimodal models are not performing "grounding" in an optimal fashion.
The dataset is available here.
The pre-trained models are available here.
- Tensorflow
- Flick30k + Keras and its VGG16 pretrained models or download image names and image features
- Glove embeddings
File .config contains all the settings and hyperparameters for training. In order to run:
- Obtain the image features by either extracting using Keras and Flickr30k dataset or download the features file (git-lfs)
- To extract image features by yourself, specify location of Flickr30k dataset in image_utils.py
and run
python image_utils.py
- To extract image features by yourself, specify location of Flickr30k dataset in image_utils.py
and run
- Specify location of the embedding in the config file
python main.py --config_file=file_config_name_here.config
Trained models are saved in models directory. If you want to run decode only, change decoding_only
to true
.
[1] Hoa Trong Vu, Claudio Greco, Aliia Erofeeva, Somayeh Jafaritazehjan, Guido Linders, Marc Tanti, Alberto Testoni, Raffaella Bernardi, Albert Gatt. 2018. Grounded Textual Entailment. In Proceedings of the 27th International Conference on Computational Linguistics (COLING 2018).
[2] Samuel R. Bowman, Gabor Angeli, Christopher Potts, and Christopher D. Manning. 2015. A large annotated corpus for learning natural language inference. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing (EMNLP).