- [2024/11/18] We've released our paper: https://arxiv.org/abs/2411.10440
- [2024/11/18] Welcome to watch 👀 this repository for the latest updates.
LLaVA-o1 is the first visual language model capable of spontaneous, systematic reasoning, similar to GPT-o1!
Our 11B model outperforms Gemini-1.5-pro,GPT-4o-mini, and Llama-3.2-90B-Vision-Instruct on six challenging multimodal benchmarks.
LLaVA-o1 begins by outlining the problem, interprets relevant information from the image, proceeds step-by-step through reasoning, and ultimately reaches a well-supported conclusion.
Stay tuned! Our code, dataset, and pretrain weights are coming soon.
If you find this paper useful, please consider staring 🌟 this repo and citing 📑 our paper:
@misc{xu2024llavao1letvisionlanguage,
title={LLaVA-o1: Let Vision Language Models Reason Step-by-Step},
author={Guowei Xu and Peng Jin and Li Hao and Yibing Song and Lichao Sun and Li Yuan},
year={2024},
eprint={2411.10440},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2411.10440},
}
- The majority of this project is released under the Apache 2.0 license as found in the LICENSE file.
- The service is a research preview intended for non-commercial use only, subject to LLAMA 3.2 COMMUNITY LICENSE AGREEMENT, and Terms of Use of the data generated by OpenAI. Please contact us if you find any potential violations.
- The template is modified from Chat-Univi and LLaVA.