In this project we created a new method for probing the embeddings for multimodal-BERT models. Furthermore, we provide a mapping to create the novel Scene Tree structure over the image regions. It is created by using the dependency tree of a caption and imposing it on the attached regions.
Probing • Scene Tree • Repository Layout • Install • How to Run • Citation
This repository is build as a modular probing project.
By running the main file, you get access to all the possible datasets and probes.
While the project is called Multimodal-Probes, it is still possible to
train and use probes on uni-modal data or uni-modal models.
Simply run python main.py --help
to get a list of options.
Initialy, this is a general set of options. Once you start setting some
options (like the --task
), you get more specific options to see
if the probe works with the current task.
The scene tree is a novel structure that creates a hierarchical structure over the regions in an image. For the construction, we assume to have images with captions, where words/phrases are aligned with regions in the image (e.g. Flickr30K Entities).
A dependency structure is extracted for the caption using the spacy parser, and we map this structure on top of the connected regions in the image. This results in the scene tree.
Most of needed methods for creating the scene tree are contained in the probing_project/scene_tree.py
file.
The remaining methods are imported from probing_project/utils.py
, probing_project/tasks/depth_task.py
,
and probing_project/tasks/depth_task.py
.
- main.py: Handles the entire probing program
- data:
- raw: downloads of all needed datasets
- intermediate: all processed data not needed for main run
- processed: all finished data needed for main run
- README.md: Describes the data sources used in the paper and how to prepare the image region features
- scripts:
- extract_flickr_30k_images.py: script for extracting image region features
- probing_project:
- data:
- datasets: dataset specific preprocessing and loading methods
- modules: pytorch-lightning DataModule type classes for preprocessing and loading the data
- probing_dataset.py: the project pytorch dataset-type class
- utils.py: utilities needed for the dataset class
- embedding_mappings: processing of the embeddings before probing (DiskMapping does nothing)
- losses: additional losses for some tasks
- probes: the torch.nn.module functions for the probes
- reporters: the classes for computing metrics and results
- tasks: the possible task to probe on, each task should have one or more accompanying probes
- constants.py: General information needed, i.e. volta configs, optional settings for help message
- model.py: the main pytorch-lightning LightningModule-type class
- scene_tree.py: the specific files for generating the scene tree
- utils.py: extra utility functions
- data:
- volta: CREATE MANUALLY a clone of the volta repository, needed for running with the multimodal-BERT models
First clone and install dependencies
# clone project
git clone https://github.com/VSJMilewski/multimodal-probes
# install project
cd multimodal-probes
Manually install the pytorch following their get started page. We used version 1.10.1.
Next, install the other requirements.
pip install -r requirements.txt
If you want to use the multimodal models as used in the paper (and currently the only setup in the code), clone the Volta Library into the root directory (or install it somewhere else and use a symbolic link).
Simply run the main with python main.py
file and set the needed options.
Minimum required options to set is:
- --task
- --probe
- --dataset
- --embeddings_model
An example run:
# run module
python main.py --task DepthTask --dataset Flickr30k --probe OneWordPSDProbe --embeddings_model ViLBERT
Use the --help
flag to see a full set of options:
python main.py --help
Depending on which required arguments you have set already, the help output changes to show available options with those settings
Initial code for probes, evaluations, losses, and some of the data processing was taken from the Structural-Probes Project by Hewitt and Manning (2019).
If you use this repository, please cite:
@inproceedings{milewski-etal-2022-finding,
title = "Finding Structural Knowledge in Multimodal-{BERT}",
author = "Milewski, Victor and
de Lhoneux, Miryam and
Moens, Marie-Francine",
booktitle = "Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)",
month = may,
year = "2022",
address = "Dublin, Ireland",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2022.acl-long.388",
pages = "5658--5671",
abstract = "In this work, we investigate the knowledge learned in the embeddings of multimodal-BERT models. More specifically, we probe their capabilities of storing the grammatical structure of linguistic data and the structure learned over objects in visual data. To reach that goal, we first make the inherent structure of language and visuals explicit by a dependency parse of the sentences that describe the image and by the dependencies between the object regions in the image, respectively. We call this explicit visual structure the scene tree, that is based on the dependency tree of the language description. Extensive probing experiments show that the multimodal-BERT models do not encode these scene trees.",
}