[Feature] Adds Image Generation Capability #1874

WaelKarkoub · 2024-03-05T21:44:40Z

Why are these changes needed?

Proof of concept to use agent capabilities as a way of multimodal communication. I found trying to extend agent capabilities to include multimodal interactions, without undertaking extensive refactoring, difficult.

I went with a modular approach, treating different modalities as distinct agent capabilities. This strategy streamlines the integration of multimodal functions and enhances the versatility of "simple" agents with minimal adjustments to the existing architecture.

For this PR, I experimented with image generation, since I've seen quite a bit of great work already done by @BeibinLi. The idea is that the user can add the ability to generate images to any of their existing agents. I architected this code by creating an abstract class called ImageGenerator, where the user can implement their image generator from their favorite API provider (there's an example for dalle DalleImageGenerator). All the user has to do now is pass in the generator they like to ImageGeneration (the agent's ability to generate images) and add the ability to the agent.

The way ImageGeneration works is by adding a custom reply function, where it checks "Did I receive a message asking me to generate an image? If so, what is the prompt?" and generates the image accordingly.

The design idea laid by this image generation capability will allow for future extensions, such as:

Sound generation,
image description,
video generation, etc..

Design-wise it does have some downfalls:

I'm using TextAnalyzerAgent, so more llm calls, more token usage.
If an image was generated, we treat it as a final reply. Should it be the final reply? Or should we pass it along the reply chain?
Current implementation assumes only one image is desired to be generated when most APIs can generate more.

I wrote a quick script to test out the functionality test_image_generation.py (planning to remove and add a notebook instead)

image_gen_poc.mp4

Related issue number

Checks

I've included any doc changes needed for https://microsoft.github.io/autogen/. See https://microsoft.github.io/autogen/docs/Contribute#documentation to build and test documentation locally.
I've added tests (if relevant) corresponding to the changes introduced in this PR.
I've made sure all auto checks have passed.

codecov-commenter · 2024-03-05T21:59:36Z

Codecov Report

Attention: Patch coverage is 0% with 79 lines in your changes are missing coverage. Please review.

Project coverage is 46.28%. Comparing base (676e8e1) to head (8f7aeff).

Files	Patch %	Lines
.../agentchat/contrib/capabilities/generate_images.py	0.00%	79 Missing ⚠️

Additional details and impacted files

@@             Coverage Diff             @@
##             main    #1874       +/-   ##
===========================================
+ Coverage   36.10%   46.28%   +10.17%     
===========================================
  Files          63       64        +1     
  Lines        6658     6737       +79     
  Branches     1470     1601      +131     
===========================================
+ Hits         2404     3118      +714     
+ Misses       4056     3365      -691     
- Partials      198      254       +56

Flag	Coverage Δ
unittests	`46.20% <0.00%> (+10.10%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

ekzhu · 2024-03-06T07:37:51Z

@BeibinLi @rickyloynd-microsoft

rickyloynd-microsoft · 2024-03-06T15:31:44Z

@WaelKarkoub Thanks for the contribution! Can you add a few lines of high-level explanation at the top of this PR? Something that describes the use cases of this implementation, and it's current limitations.

WaelKarkoub · 2024-03-06T17:07:35Z

Hi @rickyloynd-microsoft, I've updated the PR with more details, let me know what you think!

rickyloynd-microsoft · 2024-03-06T17:32:00Z

Awesome!

Responding to a couple of your questions:

I'm using TextAnalyzerAgent, so more llm calls, more token usage.
- Nice to see TextAnalyzerAgent being used more.
If an image was generated, we treat it as a final reply. Should it be the final reply? Or should we pass it along the reply chain?
- I think all of the existing reply functions do the same, and I think it's very appropriate to treat a generated image as the final reply.

Looking forward to @BeibinLi's feedback.

BeibinLi · 2024-03-06T22:23:47Z

@WaelKarkoub Fantastic PR! Thanks so much for your work.

@rickyloynd-microsoft One thing I want to ask you is: should Dalle be a "skill" or "capability"? From the implementation perspective, I really like the current "capability" style. However, it may occur additional cost (aka, for text_analyzer) to process every message. The "skill", on the other hand, even though cheaper, would be less accurate (especially for small language models).

autogen/agentchat/contrib/capabilities/generate_images.py

BeibinLi · 2024-03-06T22:35:08Z

autogen/agentchat/contrib/capabilities/generate_images.py

+        assert self._agent is not None
+
+        self._text_analyzer.reset()
+        self._agent.send(


@rickyloynd-microsoft I know splitting the message into two messages is from the "teachability". It can make LLM's answer more accurate (aka, following the format).

However, some smaller models would not accept two consecutive messages from the same agent. Let's find out some other ways in the future. For here, I am good with it.

The LLM doesn't see two messages from the same agent. TextAnalyzerAgent combines them into one before sending it to the LLM.

autogen/agentchat/contrib/capabilities/generate_images.py

BeibinLi · 2024-03-06T22:38:52Z

autogen/agentchat/contrib/capabilities/generate_images.py

+            return False, None
+
+        if self._should_generate_image(last_message):
+            prompt = self._analyze_text(


One feature I always want to have is to determine the resolution automatically from LLM, instead of having it manually defined in the init function. Do you think it is easy to add this feature?

If your answer is NO, it is still fine. It is an optional nice-to-have feature.

I thought about this specific feature as I initially wanted to use function calling instead of a capability to generate images. Two ways to implement this:

Use the text analyzer to figure out what the resolution should be from the prompt (assuming other agents provided one)

Some regex commands to extract the resolution details.

Approach 1 is probably more robust. However, there's a failure mode where the LLM can provide invalid resolutions (For example Dalle 2 can accept 512×512, but the request will fail for Dalle 3). I believe keeping it simple for the first iteration of the PR will be better and probably more robust. But I'll definitely open another PR to improve this (another example would be the number of images to generate) as I think it would cool

BeibinLi · 2024-03-06T22:42:30Z

test_image_generation.py

@@ -0,0 +1,129 @@
+from typing import Dict, Optional


This file contains lots of information:

some are utility functions (which can be included in the img_utils.py or generate_images.py)

some are for demonstration purposes (which should be in the dalle notebook). Feel free to change the dalle notebook, and I will make comments directly there.

we will also need a test for the "test/agentchat/contrib/capabilities" folder. I can create that if you need help.

BeibinLi

Thanks so much again for this awesome feature!

rickyloynd-microsoft · 2024-03-06T22:59:44Z

@WaelKarkoub Fantastic PR! Thanks so much for your work.

@rickyloynd-microsoft One thing I want to ask you is: should Dalle be a "skill" or "capability"? From the implementation perspective, I really like the current "capability" style. However, it may occur additional cost (aka, for text_analyzer) to process every message. The "skill", on the other hand, even though cheaper, would be less accurate (especially for small language models).

How are you defining 'skill' here? As a function call?

ekzhu · 2024-03-07T18:22:49Z

autogen/agentchat/contrib/capabilities/generate_images.py

+
+    def _image_gen_reply(
+        self,
+        reciepient: ConversableAgent,


For future reference, we need to check the recipient for multi-modality capability. @BeibinLi

autogen/agentchat/contrib/capabilities/generate_images.py

adds image generation capability

636bf9d

WaelKarkoub added enhancement multimodal language + vision, speech etc. models Pertains to using alternate, non-GPT, models (e.g., local models, llama, etc.) labels Mar 5, 2024

add todo

aca17d5

WaelKarkoub added 2 commits March 6, 2024 16:03

readded cache

81e438c

Merge branch 'main' into describe-image-capability

1bfe7c7

Merge branch 'main' into describe-image-capability

a35a8d2

rickyloynd-microsoft self-assigned this Mar 6, 2024

rickyloynd-microsoft assigned BeibinLi Mar 6, 2024

Merge branch 'main' into describe-image-capability

8f7aeff

BeibinLi reviewed Mar 6, 2024

View reviewed changes