Robin: Multimodal (Visual-Language) Models. - CERC-AAI Lab - Robin v1.0 #184
Labels
llm
Large Language Models
llm-experiments
experiments with large language models
llm-function-calling
Function Calling with Large Language Models
Models
LLM and ML model repos and links
multimodal-llm
LLMs that combine modes such as text and image recognition.
CERC-AAI Lab - Robin v1.0
The Robin team is proud to present Robin, a suite of Multimodal (Visual-Language) Models.
These models outperform, or perform on par with, the state of the art models of similar scale.
In the ever-evolving realm of artificial intelligence, the intersection of language understanding and visual perception has paved the way for groundbreaking multimodal models. We study different components and methods to merge pretrained vision and language models with the goal to build better visual language models.
As part of this first milestone, we release this LLaVA-fork enabling the Mistral-7B & Open-Hermes-2.5 language models to process images. We combine the pretrained LLMs (Vicuna, Mistral and OpenHermes 2.5) and Vision models (CLIP and SigLIP), and further enhance capabilities by finetuning the vision encoder.
Models detailed bellow are available here: https://huggingface.co/agi-collective
The code used is available here: https://github.com/AGI-Collective/Robin/releases/tag/v1.0.0
Also, some related work by our team on aligning multimodal models: https://arxiv.org/abs/2304.13765
LLaVA Architecture Overview
The LLaVA architecture, an acronym for Large Language and Vision Assistant, represents a multimodal Visual Language Model (VLM). At its core, LLaVA integrates a pretrained language model with a pretrained vision encoder, connected through a projection layer. In its original incarnation, the Vicuna model served as the language foundation, while the CLIP ViT-Large from OpenAI assumes the role of the vision encoder.
Building upon this foundation, as part of the first milestone we study the impact of different language models, vision encoders and the effect of finetuning the vision encoder on the performance of our multimodal model. Notably, our journey led us to experiment with the fusion of various versions of the Mistral AI LLM model and the DeepMind SigLip visual encoder.
Architecture Variations
Our model variations are best encapsulated in the table below, outlining the diverse combinations of language models, vision encoders and the fine-tuning strategy.
The text was updated successfully, but these errors were encountered: