Robin: Multimodal (Visual-Language) Models. - CERC-AAI Lab - Robin v1.0 #184

irthomasthomas · 2023-12-30T14:49:15Z

CERC-AAI Lab - Robin v1.0

The Robin team is proud to present Robin, a suite of Multimodal (Visual-Language) Models.
These models outperform, or perform on par with, the state of the art models of similar scale.
In the ever-evolving realm of artificial intelligence, the intersection of language understanding and visual perception has paved the way for groundbreaking multimodal models. We study different components and methods to merge pretrained vision and language models with the goal to build better visual language models.
As part of this first milestone, we release this LLaVA-fork enabling the Mistral-7B & Open-Hermes-2.5 language models to process images. We combine the pretrained LLMs (Vicuna, Mistral and OpenHermes 2.5) and Vision models (CLIP and SigLIP), and further enhance capabilities by finetuning the vision encoder.

Models detailed bellow are available here: https://huggingface.co/agi-collective
The code used is available here: https://github.com/AGI-Collective/Robin/releases/tag/v1.0.0
Also, some related work by our team on aligning multimodal models: https://arxiv.org/abs/2304.13765
LLaVA Architecture Overview
The LLaVA architecture, an acronym for Large Language and Vision Assistant, represents a multimodal Visual Language Model (VLM). At its core, LLaVA integrates a pretrained language model with a pretrained vision encoder, connected through a projection layer. In its original incarnation, the Vicuna model served as the language foundation, while the CLIP ViT-Large from OpenAI assumes the role of the vision encoder.
Building upon this foundation, as part of the first milestone we study the impact of different language models, vision encoders and the effect of finetuning the vision encoder on the performance of our multimodal model. Notably, our journey led us to experiment with the fusion of various versions of the Mistral AI LLM model and the DeepMind SigLip visual encoder.
Architecture Variations
Our model variations are best encapsulated in the table below, outlining the diverse combinations of language models, vision encoders and the fine-tuning strategy.

irthomasthomas added inbox-url llm Large Language Models llm-experiments experiments with large language models llm-function-calling Function Calling with Large Language Models multimodal-llm LLMs that combine modes such as text and image recognition. labels Dec 30, 2023

irthomasthomas added the Models LLM and ML model repos and links label Jan 9, 2024

This was referenced Feb 27, 2024

LLaVA/README.md at main · haotian-liu/LLaVA #628

Open

Introducing the next generation of Claude \ Anthropic #685

Open

This was referenced Mar 16, 2024

MoAI/README.md at master · ByungKwanLee/MoAI #722

Open

DeepSeek-VL: Towards Real-World Vision-Language Understanding #726

Open

irthomasthomas removed the inbox-url label May 5, 2024

ShellLM removed the llama label May 9, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Robin: Multimodal (Visual-Language) Models. - CERC-AAI Lab - Robin v1.0 #184

Robin: Multimodal (Visual-Language) Models. - CERC-AAI Lab - Robin v1.0 #184

irthomasthomas commented Dec 30, 2023

Robin: Multimodal (Visual-Language) Models. - CERC-AAI Lab - Robin v1.0 #184

Robin: Multimodal (Visual-Language) Models. - CERC-AAI Lab - Robin v1.0 #184

Comments

irthomasthomas commented Dec 30, 2023