Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature] Adds Image Generation Capability 2.0 #1907

Merged
merged 55 commits into from
Mar 15, 2024
Merged

Conversation

WaelKarkoub
Copy link
Contributor

@BeibinLi @rickyloynd-microsoft @ekzhu I created this PR because the other PR (#1874) branch was based on my fork, which won't allow me to run openai tests. Closing #1874

Why are these changes needed?

Proof of concept to use agent capabilities as a way of multimodal communication. I found trying to extend agent capabilities to include multimodal interactions, without undertaking extensive refactoring, difficult.

I went with a modular approach, treating different modalities as distinct agent capabilities. This strategy streamlines the integration of multimodal functions and enhances the versatility of "simple" agents with minimal adjustments to the existing architecture.

For this PR, I experimented with image generation, since I've seen quite a bit of great work already done by @BeibinLi. The idea is that the user can add the ability to generate images to any of their existing agents. I architected this code by creating an abstract class called ImageGenerator, where the user can implement their image generator from their favorite API provider (there's an example for dalle DalleImageGenerator). All the user has to do now is pass in the generator they like to ImageGeneration (the agent's ability to generate images) and add the ability to the agent.

The way ImageGeneration works is by adding a custom reply function, where it checks "Did I receive a message asking me to generate an image? If so, what is the prompt?" and generates the image accordingly.

The design idea laid by this image generation capability will allow for future extensions, such as:

  • Sound generation,
  • image description,
  • video generation, etc..

Design-wise it does have some downfalls:

  • I'm using TextAnalyzerAgent, so more llm calls, more token usage.
  • If an image was generated, we treat it as a final reply. Should it be the final reply? Or should we pass it along the reply chain?
  • Current implementation assumes only one image is desired to be generated when most APIs can generate more.

I wrote a quick script to test out the functionality test_image_generation.py (planning to remove and add a notebook instead)

image_gen_poc.mp4

Related issue number

Checks

@codecov-commenter
Copy link

codecov-commenter commented Mar 7, 2024

Codecov Report

Attention: Patch coverage is 78.00000% with 22 lines in your changes are missing coverage. Please review.

Project coverage is 60.87%. Comparing base (ea2c1b2) to head (a250c3d).

Files Patch % Lines
.../agentchat/contrib/capabilities/generate_images.py 78.00% 17 Missing and 5 partials ⚠️
Additional details and impacted files
@@             Coverage Diff             @@
##             main    #1907       +/-   ##
===========================================
+ Coverage   37.53%   60.87%   +23.33%     
===========================================
  Files          65       66        +1     
  Lines        6913     7013      +100     
  Branches     1521     1660      +139     
===========================================
+ Hits         2595     4269     +1674     
+ Misses       4092     2357     -1735     
- Partials      226      387      +161     
Flag Coverage Δ
unittests 60.58% <78.00%> (+23.04%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@sonichi sonichi added this pull request to the merge queue Mar 15, 2024
Merged via the queue into main with commit c5536ee Mar 15, 2024
64 of 68 checks passed
@WaelKarkoub WaelKarkoub deleted the describe-image-capability branch March 16, 2024 00:52
whiskyboy pushed a commit to whiskyboy/autogen that referenced this pull request Apr 17, 2024
* adds image generation capability

* add todo

* readded cache

* wip

* fix content str bugs

* removed todo: delete imshow

* wip

* fix circular imports

* add notebook

* improve prompt

* improved text analyzer + notebook

* notebook update

* improve notebook

* smaller notebook size

* made changes to the wrong branch :(

* resolve comments + 1

* adds doc strings

* adds cache doc string

* adds doc string to add_to_agent

* adds doc string to ImageGeneration

* instructions are not configurable

* removed unnecessary imports

* changed doc string location

* more doc strings

* improves testability

* adds tests

* adds cache test

* added test to github workflow

* compatible llm config format

* configurable reply function position

* skip_openai + better comments

* fix test

* fix test?

* please fix test?

* last fix test?

* remove type hint

* skip cache test

* adds mock api key

* dalle-2 test

* fix dalle config

* use apu key function

---------

Co-authored-by: Chi Wang <wang.chi@microsoft.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
alt-models Pertains to using alternate, non-GPT, models (e.g., local models, llama, etc.) enhancement New feature or request multimodal language + vision, speech etc.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants