Seeing Out of tHe bOx: End-to-End Pre-training for Vision-Language Representation Learning [CVPR'21, Oral]

By Zhicheng Huang*, Zhaoyang Zeng*, Yupan Huang*, Bei Liu, Dongmei Fu and Jianlong Fu

arxiv: https://arxiv.org/pdf/2104.03135.pdf

Introduction

This is the official implementation of the paper. In this paper, we propose SOHO to "See Out of tHe bOx" that takes a whole image as input, and learns vision-language representation in an end-to-end manner. SOHO does not require bounding box annotations which enables inference 10 times faster than region-based approaches.

Architecture

Release Progress

VQA Codebase
Pre-training Codebase

Installation

conda create -n soho python=3.7
conda activate soho
conda install pytorch==1.8.0 torchvision==0.9.0 torchaudio==0.8.0 cudatoolkit=11.1 -c pytorch -c conda-forge 
git clone https://github.com/NVIDIA/apex.git
cd apex
python setup.py install --cuda_ext --cpp_ext
cd ../ && rm -rf apex
git clone https://github.com/researchmm/soho.git
cd $SOHO_ROOT
python setup.py develop

Getting Started

Download the training, validation and test data

# download Pre-traning dataset
mkdir -p $SOHO_ROOT/data/vg_coco_pre
cd $SOHO_ROOT/data/vg_coco_pre
wget http://images.cocodataset.org/zips/train2014.zip
wget http://images.cocodataset.org/zips/val2014.zip
#download vg dataset
wget https://cs.stanford.edu/people/rak248/VG_100K_2/images.zip
wget https://cs.stanford.edu/people/rak248/VG_100K_2/images2.zip
unzip images.zip
unzip images2.zip
rm -rf images.zip images2.zip
mv VG_100K_2/*.jpg VG_100K/
cd VG_100K
zip -r images.zip .
mv images.zip ../
cd ..
rm -rf VG_100K*
wget https://sohose.s3.ap-southeast-1.amazonaws.com/data/pretraining/coco_cap_train_pre.json
wget https://sohose.s3.ap-southeast-1.amazonaws.com/data/pretraining/coco_cap_val_pre.json
wget https://sohose.s3.ap-southeast-1.amazonaws.com/data/pretraining/vg_cap_pre.json
mkdir -p $SOHO_ROOT/data/coco
cd $SOHO_ROOT/data/coco
# download VQA dataset
wget http://images.cocodataset.org/zips/train2014.zip
wget http://images.cocodataset.org/zips/val2014.zip
wget http://images.cocodataset.org/zips/test2015.zip
wget https://sohose.s3.ap-southeast-1.amazonaws.com/data/vqa/train_data_vqa.json
wget https://sohose.s3.ap-southeast-1.amazonaws.com/data/vqa/val_data_vqa.json
wget https://sohose.s3.ap-southeast-1.amazonaws.com/data/vqa/test_data_vqa.json

Train the Pre-training models

cd $SOHO_ROOT
# use 8 GPUS to train the model
bash tools/dist_train.sh configs/Pretrain/soho_res18_pre.py 8

# you also can download the pre-trained models 
mkdir -p $SOHO_ROOT/work_dirs/pretrained
cd $SOHO_ROOT/work_dirs/pretrained
# download pre-training weight
wget https://sohose.s3.ap-southeast-1.amazonaws.com/checkpoint/soho_res18_fp16_40-9441cdd3.pth

Training a VQA model

cd $SOHO_ROOT
# use 8 GPUS to train the model
bash tools/dist_train.sh configs/VQA/soho_res18_vqa.py 8

Evaluate a VQA model

# test 18 epoch with 8GPUs
bash tools/dist_test_vqa.sh configs/VQA/soho_res18_vqa.py 18 8

Citation

If you find this repo useful in your research, please consider citing the following papers:

@inproceedings{huang2021seeing,
  title={Seeing Out of tHe bOx: End-to-End Pre-training for Vision-Language Representation Learning},
  author={Huang, Zhicheng and Zeng, Zhaoyang and Huang, Yupan and Liu, Bei and Fu, Dongmei and Fu, Jianlong},
  booktitle={The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  year={2021}
}

@article{huang2020pixel,
  title={Pixel-bert: Aligning image pixels with text by deep multi-modal transformers},
  author={Huang, Zhicheng and Zeng, Zhaoyang and Liu, Bei and Fu, Dongmei and Fu, Jianlong},
  journal={arXiv preprint arXiv:2004.00849},
  year={2020}
}

Acknowledgements

We would like to thank mmcv and mmdetection. Our commons lib is based on mmcv.

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
SOHO		SOHO
commons		commons
configs		configs
requirements		requirements
resources		resources
tools		tools
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Seeing Out of tHe bOx: End-to-End Pre-training for Vision-Language Representation Learning [CVPR'21, Oral]

Introduction

Architecture

Release Progress

Installation

Getting Started

Citation

Acknowledgements

About

Releases

Packages

Contributors 3

Languages

researchmm/soho

Folders and files

Latest commit

History

Repository files navigation

Seeing Out of tHe bOx: End-to-End Pre-training for Vision-Language Representation Learning [CVPR'21, Oral]

Introduction

Architecture

Release Progress

Installation

Getting Started

Citation

Acknowledgements

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages