Research Trends in LLM-guided Multimodal
Learning.
- Multi Modalities:
- text, vision (image and video), audio, ...
- Large Language Model (LLM) Backbones:
- LLaMA, Alpaca, Vicuna, Bloom, GLM, OPT, ...
- LLM should be
open-source and research-friendly
- relatively small backbones (e.g., BART and T5) are also OK
- Learning Techniques:
- full fine-tuning, parameter-efficient tuning (Adapter, LoRA, ... )
- in-context learning, instruction tuning
- ...
- Examples of
LLM-guided Multimodal Model
:- OpenFlamingo, MiniGPT-4, Otter, InstructBILP, BLIVA ...
- Examples of
Evaluation on Multimodal LLM
:- MultiInstruct, POPE, AttackVLM, ...
-
BLIVA: A Simple Multimodal LLM for Better Handling of Text-Rich Visual Questions. arXiv:2308.09936.
Wenbo Hu, Yifan Xu, Yi Li, Weiyue Li, Zeyuan Chen, Zhuowen Tu. [Paper] [Code]
Backbone
: Vicuna-7B and Flan-T5-XXL (11B).
-
LLaVA-Med: Training a Large Language-and-Vision Assistant for Biomedicine in One Day. arXiv:2306.00890.
Chunyuan Li, Cliff Wong, Sheng Zhang, Naoto Usuyama, Haotian Liu, Jianwei Yang, Tristan Naumann, Hoifung Poon, Jianfeng Gao. [Paper] [Code]
Backbone
: based on LLaVA (with Vicuna-13B). -
Ziya-Visual.
Jiaxing Zhang and Ruyi Gan and Junjie Wang and Yuxiang Zhang and Lin Zhang and Ping Yang and Xinyu Gao and Ziwei Wu and Xiaoqun Dong and Junqing He and Jianheng Zhuo and Qi Yang and Yongfeng Huang and Xiayu Li and Yanghan Wu and Junyu Lu and Xinyu Zhu and Weifeng Chen and Ting Han and Kunhao Pan and Rui Wang and Hao Wang and Xiaojun Wu and Zhongshen Zeng and Chongpei Chen. [Paper] [Code]
Backbone
: based on Ziya-LLaMA-13B-v1. -
Video-LLaMA: An Instruction-Finetuned Visual Language Model for Video Understanding. arXiv:2306.02858.
Zhang, Hang and Li, Xin and Bing, Lidong. [Paper] [Code]
Backbone
: Vicuna-7B and Vicuna-13B.
-
Transfer Visual Prompt Generator across LLMs. arXiv:2305.01278.
Ao Zhang, Hao Fei, Yuan Yao, Wei Ji, Li Li, Zhiyuan Liu, Tat-Seng Chua. [Paper] [Code]
Backbone
: OPT (125M, 350M, 1.3B, and 2.7B) and Flan-T5 (base, large, and XL). -
LMEye: An Interactive Perception Network for Large Language Models. arXiv:2305.03701.
Yunxin Li, Baotian Hu, Xinyu Chen, Lin Ma, Min Zhang. [Paper] [Code]
Backbone
: LLaMA-7B, LLaMA-13B and Bloomz-7B. -
Otter: A Multi-Modal Model with In-Context Instruction Tuning. arXiv:2305.03726.
Bo Li, Yuanhan Zhang, Liangyu Chen, Jinghao Wang, Jingkang Yang, Ziwei Liu. [Paper] [Code]
Backbone
: based on OpenFlamingo-9B. -
X-LLM: Bootstrapping Advanced Large Language Models by Treating Multi-Modalities as Foreign Languages. arXiv:2305.04160.
Feilong Chen, Minglun Han, Haozhi Zhao, Qingyang Zhang, Jing Shi, Shuang Xu, Bo Xu. [Paper] [Code]
Backbone
: ChatGLM. -
MultiModal-GPT: A Vision and Language Model for Dialogue with Humans. arXiv:2305.04790.
Tao Gong, Chengqi Lyu, Shilong Zhang, Yudong Wang, Miao Zheng, Qian Zhao, Kuikun Liu, Wenwei Zhang, Ping Luo, Kai Chen. [Paper] [Code]
Backbone
: based on OpenFlamingo. -
VideoChat: Chat-Centric Video Understanding. arXiv:2305.06355.
KunChang Li, Yinan He, Yi Wang, Yizhuo Li, Wenhai Wang, Ping Luo, Yali Wang, Limin Wang, Yu Qiao. [Paper] [Code]
Backbone
: based on MiniGPT-4, MOSS and StableLM. -
InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning. arXiv:2305.06500.
Wenliang Dai, Junnan Li, Dongxu Li, Anthony Meng Huat Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale Fung, Steven Hoi. [Paper] [Code]
Backbone
: Flan-T5-XL (3B), Flan-T5-XXL (11B), Vicuna-7B and Vicuna-13B. -
ArtGPT-4: Artistic Vision-Language Understanding with Adapter-enhanced MiniGPT-4. arXiv:2305.07490.
Zhengqing Yuan, Huiwen Xue, Xinyi Wang, Yongming Liu, Zhuanzhe Zhao, Kun Wang. [Paper] [Code]
Backbone
: based on MiniGPT-4. -
Evaluating Object Hallucination in Large Vision-Language Models. arXiv:2305.10355.
Yifan Li, Yifan Du, Kun Zhou, Jinpeng Wang, Wayne Xin Zhao, Ji-Rong Wen. [Paper] [Code]
Evaluation
-
EmbodiedGPT: Vision-Language Pre-Training via Embodied Chain of Thought. arXiv:2305.15021.
Yao Mu, Qinglong Zhang, Mengkang Hu, Wenhai Wang, Mingyu Ding, Jun Jin, Bin Wang, Jifeng Dai, Yu Qiao, Ping Luo. [Paper] [Code]
Backbone
: LLaMA-7B. -
Cheap and Quick: Efficient Vision-Language Instruction Tuning for Large Language Models. arXiv:2305.15023.
Gen Luo, Yiyi Zhou, Tianhe Ren, Shengxin Chen, Xiaoshuai Sun, Rongrong Ji. [Paper] [Code]
Backbone
: LLaMA-7B and LLaMA-13B. -
On Evaluating Adversarial Robustness of Large Vision-Language Models. arXiv:2305.16934.
Yunqing Zhao, Tianyu Pang, Chao Du, Xiao Yang, Chongxuan Li, Ngai-Man Cheung, Min Lin. [Paper] [Code]
Evaluation
-
VisualGLM-6B.
ChatGLM Team. [Paper] [Code]
Backbone
: ChatGLM-6B. -
Generating Images with Multimodal Language Models. arXiv:2305.17216.
Jing Yu Koh, Daniel Fried, Ruslan Salakhutdinov. [Paper] [Code]
Backbone
: OPT-6.7B.
-
Visual Instruction Tuning. arXiv:2304.08485.
Haotian Liu, Chunyuan Li, Qingyang Wu, Yong Jae Lee. [Paper] [Code]
Backbone
: Vicuna-13B. -
MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models. arXiv:2304.10592.
Deyao Zhu, Jun Chen, Xiaoqian Shen, Xiang Li, Mohamed Elhosein. [Paper] [Code]
Backbone
: Vicuna-7B. -
mPLUG-Owl: Modularization Empowers Large Language Models with Multimodality. arXiv:2304.14178.
Qinghao Ye, Haiyang Xu, Guohai Xu, Jiabo Ye, Ming Yan, Yiyang Zhou, Junyang Wang, Anwen Hu, Pengcheng Shi, Yaya Shi, Chenliang Li, Yuanhong Xu, Hehong Chen, Junfeng Tian, Qian Qi, Ji Zhang, Fei Huang. [Paper] [Code]
Backbone
: LLaMA-7B. -
LLaMA-Adapter V2: Parameter-Efficient Visual Instruction Model. arXiv:2304.15010.
Peng Gao, Jiaming Han, Renrui Zhang, Ziyi Lin, Shijie Geng, Aojun Zhou, Wei Zhang, Pan Lu, Conghui He, Xiangyu Yue, Hongsheng Li, Yu Qiao. [Paper] [Code]
Backbone
: LLaMA-7B.
-
BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models. arXiv:2301.12597.
Junnan Li, Dongxu Li, Silvio Savarese, Steven Hoi. [Paper] [Code]
Backbone
: OPT-2.7B, OPT-6.7B, FLAN-T5-XL and FLAN-T5-XXL. -
Grounding Language Models to Images for Multimodal Inputs and Outputs. arXiv:2301.13823. ICML2023.
Jing Yu Koh, Ruslan Salakhutdinov, Daniel Fried. [Paper] [Code]
Backbone
: OPT-6.7B. -
Multimodal Chain-of-Thought Reasoning in Language Models. arXiv:2302.00923.
Zhuosheng Zhang, Aston Zhang, Mu Li, Hai Zhao, George Karypis, Alex Smola. [Paper] [Code]
Backbone
: T5 and FLAN-T5. -
Language Is Not All You Need: Aligning Perception with Language Models. arXiv:2302.14045.
Shaohan Huang, Li Dong, Wenhui Wang, Yaru Hao, Saksham Singhal, Shuming Ma, Tengchao Lv, Lei Cui, Owais Khan Mohammed, Barun Patra, Qiang Liu, Kriti Aggarwal, Zewen Chi, Johan Bjorck, Vishrav Chaudhary, Subhojit Som, Xia Song, Furu Wei. [Paper] [Code]
Backbone
: MAGNETO. -
PaLM-E: An Embodied Multimodal Language Model. arXiv:2303.03378.
Danny Driess, Fei Xia, Mehdi S. M. Sajjadi, Corey Lynch, Aakanksha Chowdhery, Brian Ichter, Ayzaan Wahid, Jonathan Tompson, Quan Vuong, Tianhe Yu, Wenlong Huang, Yevgen Chebotar, Pierre Sermanet, Daniel Duckworth, Sergey Levine, Vincent Vanhoucke, Karol Hausman, Marc Toussaint, Klaus Greff, Andy Zeng, Igor Mordatch, Pete Florence. [Paper] [Code]
Backbone
: PaLM-8B, PaLM-62B and PaLM-540B. -
OpenFlamingo.
Awadalla, Anas and Gao, Irena and Gardner, Joshua and Hessel, Jack and Hanafy, Yusuf and Zhu, Wanrong and Marathe, Kalyani and Bitton, Yonatan and Gadre, Samir and Jitsev, Jenia and Kornblith, Simon and Koh, Pang Wei and Ilharco, Gabriel and Wortsman, Mitchell and Schmidt, Ludwig. [Paper] [Code]
Backbone
: LLaMA-7B. -
LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attention. arXiv:2303.16199.
Renrui Zhang, Jiaming Han, Aojun Zhou, Xiangfei Hu, Shilin Yan, Pan Lu, Hongsheng Li, Peng Gao, Yu Qiao. [Paper] [Code]
Backbone
: LLaMA-7B.
-
VL-Adapter: Parameter-Efficient Transfer Learning for Vision-and-Language Tasks. arXiv:2112.06825. CVPR 2022.
Yi-Lin Sung, Jaemin Cho, Mohit Bansal. [Paper] [Code]
Backbone
: BART and T5. -
HyperPELT: Unified Parameter-Efficient Language Model Tuning for Both Language and Vision-and-Language Tasks. arXiv:2203.03878.
Zhengkun Zhang, Wenya Guo, Xiaojun Meng, Yasheng Wang, Yadao Wang, Xin Jiang, Qun Liu, Zhenglu Yang. [Paper] [Code]
Backbone
: T5. -
Flamingo: a Visual Language Model for Few-Shot Learning. arXiv:2204.14198. NeurIPS 2022.
Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katie Millican, Malcolm Reynolds, Roman Ring, Eliza Rutherford, Serkan Cabi, Tengda Han, Zhitao Gong, Sina Samangooei, Marianne Monteiro, Jacob Menick, Sebastian Borgeaud, Andrew Brock, Aida Nematzadeh, Sahand Sharifzadeh, Mikolaj Binkowski, Ricardo Barreira, Oriol Vinyals, Andrew Zisserman, Karen Simonyan. [Paper] [Code]
Backbone
: Chinchilla-70B. -
LST: Ladder Side-Tuning for Parameter and Memory Efficient Transfer Learning. arXiv:2206.06522. NeurIPS 2022.
Yi-Lin Sung, Jaemin Cho, Mohit Bansal. [Paper] [Code]
Backbone
: T5. -
Zero-Shot Video Question Answering via Frozen Bidirectional Language Models. arXiv:2206.08155. NeurIPS 2022.
Antoine Yang, Antoine Miech, Josef Sivic, Ivan Laptev, Cordelia Schmid. [Paper] [Code]
Backbone
: DeBERTa-V2-XLarge. -
MultiInstruct: Improving Multi-Modal Zero-Shot Learning via Instruction Tuning. arXiv:2212.10773.
Zhiyang Xu, Ying Shen, Lifu Huang. [Paper] [Code]
Evaluation
-
Unifying Vision-and-Language Tasks via Text Generation. arXiv:2102.02779. ICML 2021.
Jaemin Cho, Jie Lei, Hao Tan, Mohit Bansal. [Paper] [Code]
Backbone
: BART and T5.. -
Multimodal Few-Shot Learning with Frozen Language Models. arXiv:2106.13884. NeurIPS 2021.
Maria Tsimpoukelli, Jacob Menick, Serkan Cabi, S. M. Ali Eslami, Oriol Vinyals, Felix Hill. [Paper] [Code]
Backbone
: Transformer-7B.
Currently, most multimodal LLM are Vision-and-Language.
Vision-and-Language LLM = LLM backbone + Vision backbone.
Here are some useful links for your reference:
-
https://github.com/eugeneyan/open-llms
LLM backbone list (especially open-source LLM).
-
https://github.com/bethgelab/model-vs-human
Vision backbone list (e.g., ViT-22B and ViT-L).
-
https://github.com/zhengzangw/awesome-huge-models
LLM & Vision backbone list.
-
https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard
HuggingFace LLM leaderboard.
-
https://github.com/BradyFU/Awesome-Multimodal-Large-Language-Models
Multimodal LLM Learning & Tools (e.g., Visual ChatGPT and HuggingGPT) & Dataset list.
-
Research trends in AI with cutting-edge papers.
Feel free to pull request! LLM-guided multimodal
learning is the only limitation.
Please update the paper information with the following format:
For any interesting news about LLM-guided multimodal learning on Twitter, you can also @Zi_Yuan_Hu to follow and update it at our Awesome-Multimodal-LLM GitHub repo.
Hope everyone enjoy the LLM-guided future :)