Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature] Adds Image Generation Capability #1874

Closed

Conversation

WaelKarkoub
Copy link
Contributor

@WaelKarkoub WaelKarkoub commented Mar 5, 2024

Why are these changes needed?

Proof of concept to use agent capabilities as a way of multimodal communication. I found trying to extend agent capabilities to include multimodal interactions, without undertaking extensive refactoring, difficult.

I went with a modular approach, treating different modalities as distinct agent capabilities. This strategy streamlines the integration of multimodal functions and enhances the versatility of "simple" agents with minimal adjustments to the existing architecture.

For this PR, I experimented with image generation, since I've seen quite a bit of great work already done by @BeibinLi. The idea is that the user can add the ability to generate images to any of their existing agents. I architected this code by creating an abstract class called ImageGenerator, where the user can implement their image generator from their favorite API provider (there's an example for dalle DalleImageGenerator). All the user has to do now is pass in the generator they like to ImageGeneration (the agent's ability to generate images) and add the ability to the agent.

The way ImageGeneration works is by adding a custom reply function, where it checks "Did I receive a message asking me to generate an image? If so, what is the prompt?" and generates the image accordingly.

The design idea laid by this image generation capability will allow for future extensions, such as:

  • Sound generation,
  • image description,
  • video generation, etc..

Design-wise it does have some downfalls:

  • I'm using TextAnalyzerAgent, so more llm calls, more token usage.
  • If an image was generated, we treat it as a final reply. Should it be the final reply? Or should we pass it along the reply chain?
  • Current implementation assumes only one image is desired to be generated when most APIs can generate more.

I wrote a quick script to test out the functionality test_image_generation.py (planning to remove and add a notebook instead)

image_gen_poc.mp4

Related issue number

Checks

@WaelKarkoub WaelKarkoub added enhancement multimodal language + vision, speech etc. models Pertains to using alternate, non-GPT, models (e.g., local models, llama, etc.) labels Mar 5, 2024
@codecov-commenter
Copy link

codecov-commenter commented Mar 5, 2024

Codecov Report

Attention: Patch coverage is 0% with 79 lines in your changes are missing coverage. Please review.

Project coverage is 46.28%. Comparing base (676e8e1) to head (8f7aeff).

Files Patch % Lines
.../agentchat/contrib/capabilities/generate_images.py 0.00% 79 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff             @@
##             main    #1874       +/-   ##
===========================================
+ Coverage   36.10%   46.28%   +10.17%     
===========================================
  Files          63       64        +1     
  Lines        6658     6737       +79     
  Branches     1470     1601      +131     
===========================================
+ Hits         2404     3118      +714     
+ Misses       4056     3365      -691     
- Partials      198      254       +56     
Flag Coverage Δ
unittests 46.20% <0.00%> (+10.10%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@ekzhu
Copy link
Collaborator

ekzhu commented Mar 6, 2024

@rickyloynd-microsoft
Copy link
Contributor

@WaelKarkoub Thanks for the contribution! Can you add a few lines of high-level explanation at the top of this PR? Something that describes the use cases of this implementation, and it's current limitations.

@WaelKarkoub
Copy link
Contributor Author

Hi @rickyloynd-microsoft, I've updated the PR with more details, let me know what you think!

@rickyloynd-microsoft rickyloynd-microsoft self-assigned this Mar 6, 2024
@rickyloynd-microsoft
Copy link
Contributor

Awesome!

Responding to a couple of your questions:

  • I'm using TextAnalyzerAgent, so more llm calls, more token usage.
    • Nice to see TextAnalyzerAgent being used more.
  • If an image was generated, we treat it as a final reply. Should it be the final reply? Or should we pass it along the reply chain?
    • I think all of the existing reply functions do the same, and I think it's very appropriate to treat a generated image as the final reply.

Looking forward to @BeibinLi's feedback.

@BeibinLi
Copy link
Collaborator

BeibinLi commented Mar 6, 2024

@WaelKarkoub Fantastic PR! Thanks so much for your work.

@rickyloynd-microsoft One thing I want to ask you is: should Dalle be a "skill" or "capability"? From the implementation perspective, I really like the current "capability" style. However, it may occur additional cost (aka, for text_analyzer) to process every message. The "skill", on the other hand, even though cheaper, would be less accurate (especially for small language models).

assert self._agent is not None

self._text_analyzer.reset()
self._agent.send(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rickyloynd-microsoft I know splitting the message into two messages is from the "teachability". It can make LLM's answer more accurate (aka, following the format).

However, some smaller models would not accept two consecutive messages from the same agent. Let's find out some other ways in the future. For here, I am good with it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The LLM doesn't see two messages from the same agent. TextAnalyzerAgent combines them into one before sending it to the LLM.

return False, None

if self._should_generate_image(last_message):
prompt = self._analyze_text(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One feature I always want to have is to determine the resolution automatically from LLM, instead of having it manually defined in the init function. Do you think it is easy to add this feature?

If your answer is NO, it is still fine. It is an optional nice-to-have feature.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought about this specific feature as I initially wanted to use function calling instead of a capability to generate images. Two ways to implement this:

  1. Use the text analyzer to figure out what the resolution should be from the prompt (assuming other agents provided one)
  2. Some regex commands to extract the resolution details.

Approach 1 is probably more robust. However, there's a failure mode where the LLM can provide invalid resolutions (For example Dalle 2 can accept 512×512, but the request will fail for Dalle 3). I believe keeping it simple for the first iteration of the PR will be better and probably more robust. But I'll definitely open another PR to improve this (another example would be the number of images to generate) as I think it would cool

@@ -0,0 +1,129 @@
from typing import Dict, Optional
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file contains lots of information:

  1. some are utility functions (which can be included in the img_utils.py or generate_images.py)
  2. some are for demonstration purposes (which should be in the dalle notebook). Feel free to change the dalle notebook, and I will make comments directly there.
  3. we will also need a test for the "test/agentchat/contrib/capabilities" folder. I can create that if you need help.

Copy link
Collaborator

@BeibinLi BeibinLi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks so much again for this awesome feature!

@rickyloynd-microsoft
Copy link
Contributor

@WaelKarkoub Fantastic PR! Thanks so much for your work.

@rickyloynd-microsoft One thing I want to ask you is: should Dalle be a "skill" or "capability"? From the implementation perspective, I really like the current "capability" style. However, it may occur additional cost (aka, for text_analyzer) to process every message. The "skill", on the other hand, even though cheaper, would be less accurate (especially for small language models).

How are you defining 'skill' here? As a function call?


def _image_gen_reply(
self,
reciepient: ConversableAgent,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For future reference, we need to check the recipient for multi-modality capability. @BeibinLi

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
models Pertains to using alternate, non-GPT, models (e.g., local models, llama, etc.) multimodal language + vision, speech etc.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants