-
Notifications
You must be signed in to change notification settings - Fork 5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Feature] Adds Image Generation Capability #1874
Conversation
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #1874 +/- ##
===========================================
+ Coverage 36.10% 46.28% +10.17%
===========================================
Files 63 64 +1
Lines 6658 6737 +79
Branches 1470 1601 +131
===========================================
+ Hits 2404 3118 +714
+ Misses 4056 3365 -691
- Partials 198 254 +56
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
@WaelKarkoub Thanks for the contribution! Can you add a few lines of high-level explanation at the top of this PR? Something that describes the use cases of this implementation, and it's current limitations. |
Hi @rickyloynd-microsoft, I've updated the PR with more details, let me know what you think! |
Awesome! Responding to a couple of your questions:
Looking forward to @BeibinLi's feedback. |
@WaelKarkoub Fantastic PR! Thanks so much for your work. @rickyloynd-microsoft One thing I want to ask you is: should Dalle be a "skill" or "capability"? From the implementation perspective, I really like the current "capability" style. However, it may occur additional cost (aka, for text_analyzer) to process every message. The "skill", on the other hand, even though cheaper, would be less accurate (especially for small language models). |
assert self._agent is not None | ||
|
||
self._text_analyzer.reset() | ||
self._agent.send( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rickyloynd-microsoft I know splitting the message into two messages is from the "teachability". It can make LLM's answer more accurate (aka, following the format).
However, some smaller models would not accept two consecutive messages from the same agent. Let's find out some other ways in the future. For here, I am good with it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The LLM doesn't see two messages from the same agent. TextAnalyzerAgent combines them into one before sending it to the LLM.
return False, None | ||
|
||
if self._should_generate_image(last_message): | ||
prompt = self._analyze_text( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One feature I always want to have is to determine the resolution
automatically from LLM, instead of having it manually defined in the init function. Do you think it is easy to add this feature?
If your answer is NO, it is still fine. It is an optional nice-to-have feature.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought about this specific feature as I initially wanted to use function calling instead of a capability to generate images. Two ways to implement this:
- Use the text analyzer to figure out what the resolution should be from the prompt (assuming other agents provided one)
- Some regex commands to extract the resolution details.
Approach 1 is probably more robust. However, there's a failure mode where the LLM can provide invalid resolutions (For example Dalle 2 can accept 512×512, but the request will fail for Dalle 3). I believe keeping it simple for the first iteration of the PR will be better and probably more robust. But I'll definitely open another PR to improve this (another example would be the number of images to generate) as I think it would cool
@@ -0,0 +1,129 @@ | |||
from typing import Dict, Optional |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This file contains lots of information:
- some are utility functions (which can be included in the
img_utils.py
orgenerate_images.py
) - some are for demonstration purposes (which should be in the dalle notebook). Feel free to change the dalle notebook, and I will make comments directly there.
- we will also need a test for the "test/agentchat/contrib/capabilities" folder. I can create that if you need help.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks so much again for this awesome feature!
How are you defining 'skill' here? As a function call? |
|
||
def _image_gen_reply( | ||
self, | ||
reciepient: ConversableAgent, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For future reference, we need to check the recipient for multi-modality capability. @BeibinLi
Why are these changes needed?
Proof of concept to use agent capabilities as a way of multimodal communication. I found trying to extend agent capabilities to include multimodal interactions, without undertaking extensive refactoring, difficult.
I went with a modular approach, treating different modalities as distinct agent capabilities. This strategy streamlines the integration of multimodal functions and enhances the versatility of "simple" agents with minimal adjustments to the existing architecture.
For this PR, I experimented with image generation, since I've seen quite a bit of great work already done by @BeibinLi. The idea is that the user can add the ability to generate images to any of their existing agents. I architected this code by creating an abstract class called
ImageGenerator
, where the user can implement their image generator from their favorite API provider (there's an example for dalleDalleImageGenerator
). All the user has to do now is pass in the generator they like toImageGeneration
(the agent's ability to generate images) and add the ability to the agent.The way
ImageGeneration
works is by adding a custom reply function, where it checks "Did I receive a message asking me to generate an image? If so, what is the prompt?" and generates the image accordingly.The design idea laid by this image generation capability will allow for future extensions, such as:
Design-wise it does have some downfalls:
TextAnalyzerAgent
, so more llm calls, more token usage.I wrote a quick script to test out the functionality
test_image_generation.py
(planning to remove and add a notebook instead)image_gen_poc.mp4
Related issue number
Checks