Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat/cody: Brings image modality for BYOK users #6354

Open
wants to merge 15 commits into
base: main
Choose a base branch
from

Conversation

PriNova
Copy link
Collaborator

@PriNova PriNova commented Dec 14, 2024

This PR brings image modality for BYOK users via the Google LLM provider.

The PR is behind the cody.dev.models experimental feature flag. You need to configure it in the settings.json like this:

"cody.dev.models": [
    {
            "provider": "google",
            "model": "gemini-2.0-flash-exp",
            "inputTokens": 1048576,
            "outputTokens": 8192,
            "apiKey": "your_key_goes_here",
            "options": {
                "temperature": 0.0
            }
        },
 ]
Image_Modality.mp4

Model Selection overhauled:

Screenshot 2024-12-16 105822

Test plan

Build Cody based on this PR

Manual Testing Steps

  1. Model Selection:

    • Open Cody chat
    • Click on model selector
    • Verify Gemini Flash 2.0 model shows as Vision model
  2. Image Upload Flow:

    • Select a Gemini Flash 2.0 model
    • Verify image selection button is visible in toolbar
    • Click image button
    • Select an image file
  3. Chat Interaction:

    • With uploaded image, send a message
    • Check response includes image context
  4. Edge Cases:

    • Switch between models and verify image selection button visibility
    • Test with various image formats ( jpeg, png, webg )
  5. Drag 'n' Drop:

Notes

  • Feature only available for Gemini Flash 2.0 model

Changelog

Added

  • Add Gemini Flash 2.0 experimental vision model support via cody dev models flag pull/6354

- Adds a new toolbar button to the chat interface to allow users to upload images when using the Google model
- The button is conditionally rendered based on the current model being a Google model (identified by the `ModelTag.BYOK` tag and the model ID containing 'gemini-2.0-flash')
- The onClick handler for the button is currently commented out, as the implementation for the actual image upload feature is not included in the provided diff
@PriNova PriNova changed the title WIP(Image:Modality): Brings image modality for BYOK users WIP(Image_Modality): Brings image modality for BYOK users Dec 14, 2024
- Implements the functionality to select an image file and add it to the `ChatBuilder` instance
- Adds the necessary handlers in `ChatController` to process the 'chat/upload-image' message and call the `ChatBuilder.addImages()` method
- Adds a new message type in `protocol.ts` to handle the 'chat/upload-image' command
Implement image handling capabilities for the Google LLM provider:
- Add types for image data and MIME type validation
- Enhance ChatBuilder with image processing and MIME detection
- Enable image support in completion parameters
- Add inline image data support to chat messages
- Add visual indicators for models supporting image uploads
- Improve image handling in Google chat client
- Extract Gemini model detection into separate utility
- Update model selection field to show image upload capability
@PriNova PriNova marked this pull request as ready for review December 15, 2024 15:31
@PriNova PriNova requested a review from ykdojo December 15, 2024 15:31
- Replace filesystem URI handling with direct base64 encoding for images
- Enhance image upload UI with preview and removal capabilities
- Update MIME type detection to work with base64 strings
- Simplify image upload protocol between webview and extension
- Add Vision tag for Gemini Flash 2.0 model configuration
- Implement image upload handling in chat editor
- Update model selection UI to display vision capabilities
- Add dedicated Vision model group in model selector
- Refactor image processing logic for better maintainability

Related: Vision AI integration
@PriNova PriNova requested a review from abeatrix December 16, 2024 11:09
@PriNova PriNova changed the title WIP(Image_Modality): Brings image modality for BYOK users feat/cody: Brings image modality for BYOK users Dec 16, 2024
PriNova and others added 2 commits December 16, 2024 14:03
- Add support for drag and drop image uploads in the human message cell
- Implement handlers for drag enter, drag leave, and drop events
- Update the HumanMessageEditor component to handle the uploaded image file
- Add a new state variable to track the current image file

Related: Vision AI integration
@@ -94,7 +95,8 @@ export class ChatBuilder {

public readonly sessionID: string = new Date(Date.now()).toUTCString(),
private messages: ChatMessage[] = [],
private customChatTitle?: string
private customChatTitle?: string,
private images: ImageData[] = []
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
private images: ImageData[] = []
private images: ImageData[] = []

I did it this way to get the prototype demo-ready as my hackathon project but I don't think this is the best approach (my bad!).
Instead of passing it to ChatBuidler, could we add a new ContextItem type for media data instead so the images could be preserve in chat history?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have also thought about this and find this a great idea. Not only would the user have visual feedback, but there would also be multiple media blobs available in the future.
In its current state, however, this would collide with the other non-visual models, which would mean adding an additional context filter later on (totally feasible).

Additionally, I'm just not sure if a large number of images in the chat history would hurt performance. Depending on the specification of the computer, a slowdown in chat history management is observed: https://linear.app/sourcegraph/issue/CODY-4516/vscode-cody-extension-lags-with-large-chat-history-40-items

PriNova and others added 5 commits December 16, 2024 19:24
- Add drag counter to properly handle nested drag events
- Restructure HumanMessageCell component hierarchy for better state management
- Enhance image upload cleanup on removal
- Fix drag state reset on drag end
- Improve component organization for better maintainability

This change provides a more reliable drag-and-drop experience and prevents
UI state inconsistencies when handling image uploads in the chat interface.
@PriNova PriNova requested a review from abeatrix December 18, 2024 17:47
@ykdojo
Copy link
Contributor

ykdojo commented Dec 20, 2024

Tried it again, and it looks great! Copying and pasting is still not working for me, though

- Rename model check function for clarity (isGeminiFlash2Model)
- Add smart title formatting for model names
- Standardize model title presentation across components
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants