[TMLR] SOLO: A Single Transformer for Scalable
Vision-Language Modeling

We present SOLO, a single Transformer architecture for unified vision-language modeling. SOLO accepts both raw image patches (in pixels) and texts as inputs, without using a separate pre-trained vision encoder.

TODO Roadmap

✅ Release the instruction tuning data mixture

✅ Release the code for instruction tuning

✅ Release the pre-training code

✅ Release the SOLO model 🤗 Model (SOLO-7B)

✅ Paper on arxiv 📃 Paper

Setup

Clone Repo

git clone https://github.com/Yangyi-Chen/SOLO
git submodule update --init --recursive

Setup Environment for Data Processing

conda env create -f environment.yml
conda activate solo

OR simply

pip install -r requirements.txt

SOLO Inference with Huggingface

Check scripts/notebook/demo.ipynb for an example of performing inference on the model.

Pre-Training

Please refer to PRETRAIN_GUIDE.md for more details about how to perform pre-training. The following table documents the data statistics in pre-training:

Instruction Fine-Tuning

Please refer to SFT_GUIDE.md for more details about how to perform instruction fine-tuning. The following table documents the data statistics in instruction fine-tuning:

Citation

If you use or extend our work, please consider citing our paper.

@article{chen2024single,
  title={A Single Transformer for Scalable Vision-Language Modeling},
  author={Chen, Yangyi and Wang, Xingyao and Peng, Hao and Ji, Heng},
  journal={arXiv preprint arXiv:2407.06438},
  year={2024}
}

Name		Name	Last commit message	Last commit date
Latest commit History 39 Commits
Megatron-LLM @ 8392550		Megatron-LLM @ 8392550
config		config
images		images
scripts		scripts
src		src
.gitignore		.gitignore
.gitmodules		.gitmodules
LICENSE		LICENSE
PRETRAIN_GUIDE.md		PRETRAIN_GUIDE.md
README.md		README.md
SFT_GUIDE.md		SFT_GUIDE.md
environment.yml		environment.yml
image_utils.py		image_utils.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

[TMLR] SOLO: A Single Transformer for Scalable
Vision-Language Modeling

TODO Roadmap

Setup

Clone Repo

Setup Environment for Data Processing

SOLO Inference with Huggingface

Pre-Training

Instruction Fine-Tuning

Citation

About

Releases

Packages

Contributors 2

Languages

License

Yangyi-Chen/SOLO

Folders and files

Latest commit

History

Repository files navigation

[TMLR] SOLO: A Single Transformer for Scalable Vision-Language Modeling

TODO Roadmap

Setup

Clone Repo

Setup Environment for Data Processing

SOLO Inference with Huggingface

Pre-Training

Instruction Fine-Tuning

Citation

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

[TMLR] SOLO: A Single Transformer for Scalable
Vision-Language Modeling

Packages