Skip to content

qijimrc/VLM_Papers

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

9 Commits
 
 

Repository files navigation

Papers about Vision Language Models

This repo lists recent advantages on VLMs, mainly contributed by Weihan Wang and Ji Qi.

Papers

Model Vision Enc. Textual Enc. Dec. Multimodal Fusion Pretraining Objectives Pretraining Dataset Published Year
ViLBERT OD->Xformer Xformer / Co-attn MLM+ITM+MIM CC3M 2019 (NIPS)
LXMERT OD+Xformer Xformer / Co-attn MLM+ITM+MIM+VQA COCO+VG+VQA 2019 (Arxiv)
VisualBERT OD Emb. / Merged-attn MLM+ITM COCO 2019 (Arxiv)
UNITER OD Emb. / Merged-attn MLM+ITM+MIM+WRA COCO+VG+CC3M+SBU 2020 (ECCV)
VL-BERT OD Emb. / Merged-attn MLM+ITM CC3M 2020 (ICLR)
OSCAR OD Emb. / Merged-attn MLM+ITM 4.1M 2020 (ECCV)
PixelBERT CNN Xformer / Merged-attn MLM+ITM COCO+VG 2020 (Arxiv)
VILLA OD Emb. / Merged-attn Adversarial Training+MLM+MIM+ITM COCO+VG+CC3M+SBU 2020 (NIPS)
ViLBERT-12in1 OD->Xformer Xformer / Co-attn Multi Tasks Multi Datasets 2020 (CVPR)
CLIP CNN/Xformer Xformer / / ITC 400M 2021 (ICML)
ALIGN CNN Xformer / / ITC 1800M 2021 (ICML)
VinVL OD Emb. / Merged-attn MLM+ITM COCO+VG+OI+OBJ365 2021 (CVPR)
MDETR CNN Xformer Merged-attn OD+Token Prediction+Contrastive Alignment COCO+VG+Flickr 2021 (ICCV)
VL-T5 OD Emb. Merged-attn MLM+ITM+VQA+Grounding+Captioning COCO+VG 2021 (ICML)
CLIP-VIL CNN Emb. / Merged-attn MLM+ITM+VQA COCO+VG+VQA 2021 (Arxiv)
SOHO CNN Emb. / Merged-attn MLM+ITM+MIM COCO+VG 2021 (CVPR)
VILT Patch Emb. Emb. / Merged-attn MLM+ITM COCO+VG+CC3M+SBU 2021 (ICCV)
ALBEF Xformer Xformer / Co-attn MLM+ITM+ITC COCO+VG+CC12M+SBU 2021 (NIPS)
VLMO Xformer Xformer / Multiway-attn MLM+ITM+ITC 4M/1000M 2021 (Arxiv)
Florence Xformer Xformer / / ITC 900M 2021 (Arxiv)
OFA CNN Emb. Co-attn Multi Tasks 20M 2022 (ICML)
METER Xformer Xformer / Co-attn MLM+ITM COCO+VG+CC3M+SBU 2022 (CVPR)
GLIP Xformer Xformer / Co-attn OD+Token Prediction+Contrastive Alignment FourODs+GoldG+Cap24M 2022 (CVPR)
GLIP-v2 Xformer Xformer / Co-attn MLM+OD+Token Prediction+Contrastive Alignment FourODs+GoldG+Cap24M 2022 (NIPS)
SimVLM CNN Emb. / Merged-attn PrefixLM 1800M 2022 (ICLR)
Flamingo Xformer Xformer Co-attn ITC+Captioning+.. 1.8B+LTIP+VTP 2022 (Arxiv)
PALI Xformer Xformer Co-attn Multi Tasks 10b image+12b text+29b image-ocr 2022 (Arxiv)
FIBER Xformer Xformer / Co-attn MLM+ITM+ITC COCO+VG+CC3M+SBU 2022 (Arxiv)
COCA Xformer Xformer Co-attn ITC+Captioning JFT-3B+Align 2022 (Arxiv)
BEIT-3 Xformer Xformer / Co-attn MLM COCO+VG+CC3M+CC12M+SBU 2022 (Arxiv)

Notes:

  • I : image inputs
  • T : text inputs
  • OD : objective detector
  • Xformer : transformer
  • Emb. : embedding
  • MLM : masked language modeling
  • MIM : masked image modeling
  • ITM : image-text matching
  • WRA : word-region alignment
  • ITC : image-text contrastive learning

Multimodal Adversarial Dataset

1.CARETS: A Consistency And Robustness Evaluative Test Suite for VQA
(ACL 2022)[paper]

2.VALSE: A Task-Independent Benchmark for Vision and Language Models Centered on Linguistic Phenomena
(ACL 2022)[paper]

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published