Add blog for LMM

microsoft · Nov 2, 2023 · 70128bb · 70128bb
1 parent f484518
commit 70128bb
Show file tree

Hide file tree

Showing 6 changed files with 94 additions and 3 deletions.
diff --git a/.github/workflows/lmm-test.yml b/.github/workflows/lmm-test.yml
@@ -9,7 +9,7 @@ on:
     paths:
       - 'autogen/**'
       - 'test/agentchat/**'
-      - 'test/agentchat/contrib/**'
+      - 'test/agentchat/contrib/llava_agent.py'
       - '.github/workflows/lmm-test.yml'
       - 'setup.py'
 
@@ -42,7 +42,7 @@ jobs:
           pip install qdrant_client[fastembed]
       - name: Install packages and dependencies for LMM
         run: |
-          pip install -e .[llava]
+          pip install -e .[lmm]
           pip uninstall -y openai
       - name: Test LMM and LLaVA
         run: |

diff --git a/setup.py b/setup.py
@@ -59,7 +59,7 @@
         "mathchat": ["sympy", "pydantic==1.10.9", "wolframalpha"],
         "retrievechat": ["chromadb", "tiktoken", "sentence_transformers", "pypdf", "ipython"],
         "teachable": ["chromadb"],
-        "llava": ["replicate", "pillow"],
+        "lmm": ["replicate", "pillow"],
     },
     classifiers=[
         "Programming Language :: Python :: 3",

diff --git a/website/blog/2023-11-06-LMM/img/teaser_lmm.png b/website/blog/2023-11-06-LMM/img/teaser_lmm.png
diff --git a/website/blog/2023-11-06-LMM/index.mdx b/website/blog/2023-11-06-LMM/index.mdx
@@ -0,0 +1,76 @@
+---
+title: Large Multimodal Model Support: GPT-4V and LLaVA Integration
+authors: beibinli
+tags: [LMM, multimodal]
+---
+
+![Multimodal Model Architecture](img/teaser_lmm.png)
+
+**In Brief:**
+* Introducing the **Multimodal Conversable Agent** and the **LLaVA Agent** to enhance LMM functionalities.
+* Users can input text and images simultaneously using the `<img img_path>` tag to specify image loading.
+* Demonstrated through the [LLaVA notebook](https://github.com/microsoft/autogen/blob/main/notebook/agentchat_lmm_llava.ipynb).
+
+## Introduction
+Large multimodal models (LMMs) augment large language models (LLMs) with the ability to process multi-sensory data.
+
+This blog post and the latest AutoGen update concentrate on visual comprehension. Users can input images, pose questions about them, and receive text-based responses from these LMMs.
+Future AutoGen updates will introduce additional multimodal capabilities such as image generation with DALLE models, audio processing, and video comprehension.
+
+Here, we emphasize the **Multimodal Conversable Agent** and the **LLaVA Agent** due to their growing popularity.
+GPT-4V represents the forefront in image comprehension, while LLaVA is an efficient model, fine-tuned from LLama-2.
+
+## Installation
+Incorporate the `lmm` feature during AutoGen installation:
+
+```bash
+pip install "pyautogen[lmm]<0.2"
+```
+
+Subsequently, import the **Multimodal Conversable Agent** or **LLaVA Agent** from AutoGen:
+
+```python
+from autogen.agentchat import MultimodalConversable Agent # for GPT-4V
+from autogen.agentchat.contrib.llava_agent import LLaVAAgent # for LLaVA
+```
+
+## Usage
+
+A simple syntax has been defined to incorporate both messages and images within a single string.
+
+Example of an in-context learning prompt:
+
+```python
+prompt = """You are now an image classifier for facial expressions. Here are
+some examples.
+
+<img happy.jpg> depicts a happy expression.
+<img http://some_location.com/sad.jpg> represents a sad expression.
+<img obama.jpg> portrays a neutral expression.
+
+Now, identify the facial expression of this individual: <img unknown.png>
+"""
+
+agent = MultimodalConversableAgent()
+user = UserProxyAgent()
+user.initiate_chat(agent, message=prompt)
+```
+
+The `MultimodalConversableAgent` interprets the input prompt, extracting images from local or internet sources.
+
+## Advanced Usage
+Similar to other AutoGen agents, multimodal agents support multi-round dialogues with other agents, code generation, factual queries, and management via a GroupChat interface.
+
+For example, the `FigureCreator` in our [notebook](https://github.com/microsoft/autogen/blob/main/notebook/agentchat_lmm_llava.ipynb) integrates two agents: a coder (an AssistantAgent) and critics (a multimodal agent).
+The coder drafts Python code for visualizations, while the critics provide insights for enhancement. Collaboratively, these agents aim to refine visual outputs.
+With `human_input_mode=ALWAYS`, you can also contribute suggestions for better visualizations.
+
+## Reference
+- [GPT-4V System Card](https://openai.com/research/gpt-4v-system-card)
+- [LLaVA GitHub](https://github.com/haotian-liu/LLaVA)
+
+## Future Enhancements
+
+For further inquiries or suggestions, please open an issue in the [AutoGen repository](https://github.com/microsoft/autogen/) or contact me directly at beibin.li@microsoft.com.
+
+AutoGen will continue to evolve, incorporating more multimodal functionalities such as DALLE model integration, audio interaction, and video comprehension. Stay tuned for these exciting developments.
diff --git a/website/blog/authors.yml b/website/blog/authors.yml
@@ -33,3 +33,10 @@ rickyloynd-microsoft:
   title: Senior Research Engineer at Microsoft
   url: https://github.com/rickyloynd-microsoft
   image_url: https://github.com/rickyloynd-microsoft.png
+
+
+beibinli:
+  name: Beibin Li
+  title: Senior Research Engineer at Microsoft
+  url: https://github.com/BeibinLi
+  image_url: https://github.com/beibinli.png
diff --git a/website/docs/Installation.md b/website/docs/Installation.md
@@ -108,3 +108,11 @@ pip install "pyautogen[mathchat]<0.2"
 
 Example notebooks:
 [Using MathChat to Solve Math Problems](https://github.com/microsoft/autogen/blob/main/notebook/agentchat_MathChat.ipynb)
+
+* Large Multimodal Models
+
+We support both GPT4-V and LLaVA now. See [this notebook](https://github.com/microsoft/autogen/blob/main/notebook/agentchat_lmm_llava.ipynb) for an example of our LLaVA agent.
+
+```bash
+pip install "pyautogen[lmm]<0.2"
+```