-
Notifications
You must be signed in to change notification settings - Fork 4.8k
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
Showing
6 changed files
with
94 additions
and
3 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,76 @@ | ||
--- | ||
title: Large Multimodal Model Support: GPT-4V and LLaVA Integration | ||
authors: beibinli | ||
tags: [LMM, multimodal] | ||
--- | ||
|
||
![Multimodal Model Architecture](img/teaser_lmm.png) | ||
|
||
**In Brief:** | ||
* Introducing the **Multimodal Conversable Agent** and the **LLaVA Agent** to enhance LMM functionalities. | ||
* Users can input text and images simultaneously using the `<img img_path>` tag to specify image loading. | ||
* Demonstrated through the [LLaVA notebook](https://github.com/microsoft/autogen/blob/main/notebook/agentchat_lmm_llava.ipynb). | ||
|
||
## Introduction | ||
Large multimodal models (LMMs) augment large language models (LLMs) with the ability to process multi-sensory data. | ||
|
||
This blog post and the latest AutoGen update concentrate on visual comprehension. Users can input images, pose questions about them, and receive text-based responses from these LMMs. | ||
Future AutoGen updates will introduce additional multimodal capabilities such as image generation with DALLE models, audio processing, and video comprehension. | ||
|
||
Here, we emphasize the **Multimodal Conversable Agent** and the **LLaVA Agent** due to their growing popularity. | ||
GPT-4V represents the forefront in image comprehension, while LLaVA is an efficient model, fine-tuned from LLama-2. | ||
|
||
## Installation | ||
Incorporate the `lmm` feature during AutoGen installation: | ||
|
||
```bash | ||
pip install "pyautogen[lmm]<0.2" | ||
``` | ||
|
||
Subsequently, import the **Multimodal Conversable Agent** or **LLaVA Agent** from AutoGen: | ||
|
||
```python | ||
from autogen.agentchat import MultimodalConversable Agent # for GPT-4V | ||
from autogen.agentchat.contrib.llava_agent import LLaVAAgent # for LLaVA | ||
``` | ||
|
||
## Usage | ||
|
||
A simple syntax has been defined to incorporate both messages and images within a single string. | ||
|
||
Example of an in-context learning prompt: | ||
|
||
```python | ||
prompt = """You are now an image classifier for facial expressions. Here are | ||
some examples. | ||
<img happy.jpg> depicts a happy expression. | ||
<img http://some_location.com/sad.jpg> represents a sad expression. | ||
<img obama.jpg> portrays a neutral expression. | ||
Now, identify the facial expression of this individual: <img unknown.png> | ||
""" | ||
|
||
agent = MultimodalConversableAgent() | ||
user = UserProxyAgent() | ||
user.initiate_chat(agent, message=prompt) | ||
``` | ||
|
||
The `MultimodalConversableAgent` interprets the input prompt, extracting images from local or internet sources. | ||
|
||
## Advanced Usage | ||
Similar to other AutoGen agents, multimodal agents support multi-round dialogues with other agents, code generation, factual queries, and management via a GroupChat interface. | ||
|
||
For example, the `FigureCreator` in our [notebook](https://github.com/microsoft/autogen/blob/main/notebook/agentchat_lmm_llava.ipynb) integrates two agents: a coder (an AssistantAgent) and critics (a multimodal agent). | ||
The coder drafts Python code for visualizations, while the critics provide insights for enhancement. Collaboratively, these agents aim to refine visual outputs. | ||
With `human_input_mode=ALWAYS`, you can also contribute suggestions for better visualizations. | ||
|
||
## Reference | ||
- [GPT-4V System Card](https://openai.com/research/gpt-4v-system-card) | ||
- [LLaVA GitHub](https://github.com/haotian-liu/LLaVA) | ||
|
||
## Future Enhancements | ||
|
||
For further inquiries or suggestions, please open an issue in the [AutoGen repository](https://github.com/microsoft/autogen/) or contact me directly at beibin.li@microsoft.com. | ||
|
||
AutoGen will continue to evolve, incorporating more multimodal functionalities such as DALLE model integration, audio interaction, and video comprehension. Stay tuned for these exciting developments. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters