Skip to content

Commit

Permalink
Add blog for LMM
Browse files Browse the repository at this point in the history
  • Loading branch information
BeibinLi committed Nov 2, 2023
1 parent f484518 commit 70128bb
Show file tree
Hide file tree
Showing 6 changed files with 94 additions and 3 deletions.
4 changes: 2 additions & 2 deletions .github/workflows/lmm-test.yml
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@ on:
paths:
- 'autogen/**'
- 'test/agentchat/**'
- 'test/agentchat/contrib/**'
- 'test/agentchat/contrib/llava_agent.py'
- '.github/workflows/lmm-test.yml'
- 'setup.py'

Expand Down Expand Up @@ -42,7 +42,7 @@ jobs:
pip install qdrant_client[fastembed]
- name: Install packages and dependencies for LMM
run: |
pip install -e .[llava]
pip install -e .[lmm]
pip uninstall -y openai
- name: Test LMM and LLaVA
run: |
Expand Down
2 changes: 1 addition & 1 deletion setup.py
Original file line number Diff line number Diff line change
Expand Up @@ -59,7 +59,7 @@
"mathchat": ["sympy", "pydantic==1.10.9", "wolframalpha"],
"retrievechat": ["chromadb", "tiktoken", "sentence_transformers", "pypdf", "ipython"],
"teachable": ["chromadb"],
"llava": ["replicate", "pillow"],
"lmm": ["replicate", "pillow"],
},
classifiers=[
"Programming Language :: Python :: 3",
Expand Down
Binary file added website/blog/2023-11-06-LMM/img/teaser_lmm.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
76 changes: 76 additions & 0 deletions website/blog/2023-11-06-LMM/index.mdx
Original file line number Diff line number Diff line change
@@ -0,0 +1,76 @@
---
title: Large Multimodal Model Support: GPT-4V and LLaVA Integration
authors: beibinli
tags: [LMM, multimodal]
---

![Multimodal Model Architecture](img/teaser_lmm.png)

**In Brief:**
* Introducing the **Multimodal Conversable Agent** and the **LLaVA Agent** to enhance LMM functionalities.
* Users can input text and images simultaneously using the `<img img_path>` tag to specify image loading.
* Demonstrated through the [LLaVA notebook](https://github.com/microsoft/autogen/blob/main/notebook/agentchat_lmm_llava.ipynb).

## Introduction
Large multimodal models (LMMs) augment large language models (LLMs) with the ability to process multi-sensory data.

This blog post and the latest AutoGen update concentrate on visual comprehension. Users can input images, pose questions about them, and receive text-based responses from these LMMs.
Future AutoGen updates will introduce additional multimodal capabilities such as image generation with DALLE models, audio processing, and video comprehension.

Here, we emphasize the **Multimodal Conversable Agent** and the **LLaVA Agent** due to their growing popularity.
GPT-4V represents the forefront in image comprehension, while LLaVA is an efficient model, fine-tuned from LLama-2.

## Installation
Incorporate the `lmm` feature during AutoGen installation:

```bash
pip install "pyautogen[lmm]<0.2"
```

Subsequently, import the **Multimodal Conversable Agent** or **LLaVA Agent** from AutoGen:

```python
from autogen.agentchat import MultimodalConversable Agent # for GPT-4V
from autogen.agentchat.contrib.llava_agent import LLaVAAgent # for LLaVA
```

## Usage

A simple syntax has been defined to incorporate both messages and images within a single string.

Example of an in-context learning prompt:

```python
prompt = """You are now an image classifier for facial expressions. Here are
some examples.
<img happy.jpg> depicts a happy expression.
<img http://some_location.com/sad.jpg> represents a sad expression.
<img obama.jpg> portrays a neutral expression.
Now, identify the facial expression of this individual: <img unknown.png>
"""

agent = MultimodalConversableAgent()
user = UserProxyAgent()
user.initiate_chat(agent, message=prompt)
```

The `MultimodalConversableAgent` interprets the input prompt, extracting images from local or internet sources.

## Advanced Usage
Similar to other AutoGen agents, multimodal agents support multi-round dialogues with other agents, code generation, factual queries, and management via a GroupChat interface.

For example, the `FigureCreator` in our [notebook](https://github.com/microsoft/autogen/blob/main/notebook/agentchat_lmm_llava.ipynb) integrates two agents: a coder (an AssistantAgent) and critics (a multimodal agent).
The coder drafts Python code for visualizations, while the critics provide insights for enhancement. Collaboratively, these agents aim to refine visual outputs.
With `human_input_mode=ALWAYS`, you can also contribute suggestions for better visualizations.

## Reference
- [GPT-4V System Card](https://openai.com/research/gpt-4v-system-card)
- [LLaVA GitHub](https://github.com/haotian-liu/LLaVA)

## Future Enhancements

For further inquiries or suggestions, please open an issue in the [AutoGen repository](https://github.com/microsoft/autogen/) or contact me directly at beibin.li@microsoft.com.

AutoGen will continue to evolve, incorporating more multimodal functionalities such as DALLE model integration, audio interaction, and video comprehension. Stay tuned for these exciting developments.
7 changes: 7 additions & 0 deletions website/blog/authors.yml
Original file line number Diff line number Diff line change
Expand Up @@ -33,3 +33,10 @@ rickyloynd-microsoft:
title: Senior Research Engineer at Microsoft
url: https://github.com/rickyloynd-microsoft
image_url: https://github.com/rickyloynd-microsoft.png


beibinli:
name: Beibin Li
title: Senior Research Engineer at Microsoft
url: https://github.com/BeibinLi
image_url: https://github.com/beibinli.png
8 changes: 8 additions & 0 deletions website/docs/Installation.md
Original file line number Diff line number Diff line change
Expand Up @@ -108,3 +108,11 @@ pip install "pyautogen[mathchat]<0.2"

Example notebooks:
[Using MathChat to Solve Math Problems](https://github.com/microsoft/autogen/blob/main/notebook/agentchat_MathChat.ipynb)

* Large Multimodal Models

We support both GPT4-V and LLaVA now. See [this notebook](https://github.com/microsoft/autogen/blob/main/notebook/agentchat_lmm_llava.ipynb) for an example of our LLaVA agent.

```bash
pip install "pyautogen[lmm]<0.2"
```

0 comments on commit 70128bb

Please sign in to comment.