Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OpenAI incompatible image handling in server multimodal #4771

Closed
gelim opened this issue Jan 4, 2024 · 10 comments
Closed

OpenAI incompatible image handling in server multimodal #4771

gelim opened this issue Jan 4, 2024 · 10 comments

Comments

@gelim
Copy link
Contributor

gelim commented Jan 4, 2024

Hello while testing Llava-13B with server implementation I got a 500 error related to the content being a list of dicts and not a simple string.

  • what works (but unrelated to Llava):
$ curl -H "Content-Type: application/json" -X POST -s $SERVER/v1/chat/completions -d '{"messages": [{"role": "user", "content": "hello"}]}'

{"choices":[{"finish_reason":"stop","index":0,"message":{"content":"Hi there! How can I help you today?","role":"assistant"}}],[...]

  • what yields a 500 with [json.exception.type_error.302] type must be string, but is array:
$ curl -H "Content-Type: application/json" -X POST -s $SERVER/v1/chat/completions -d '{"messages": [{"role": "user", "content": [{"type":"text","text":"hello"}]}]}'

this is to demonstrate the issue when using an OpenAI REST aware frontend that is pushing text with pic inside the content key like this:

{
  "messages": [
    {
      "role": "user",
      "content": [
        {
          "type": "text",
          "text": "describe the picture"
        },
        {
          "type": "image_url",
          "image_url": {
            "url": "data:image/webp;base64,AAAAAA==",
            "detail": "auto"
          }
        }
      ]
    }
  ],
  "model": "llava-13b",
  "frequency_penalty": 0,
  "max_tokens": 4000,
  "presence_penalty": 0,
  "temperature": 0.1,
  "top_p": 1,
  "user": "foobar"
}
@gelim
Copy link
Contributor Author

gelim commented Jan 4, 2024

If I understand correctly, that is more of a feature that is not implemented within server.cpp than a bug in itself.

Here is the OpenAI API documentation for reference: https://platform.openai.com/docs/api-reference/chat/create
Screenshot 2024-01-04 at 14 13 55

@gelim
Copy link
Contributor Author

gelim commented Jan 4, 2024

Ok after digging a bit, I see the code in examples/server/server.cpp and examples/server/public/index.html that is definitely not OpenAI REST API compatible.

Format info from README.md

@gelim
Copy link
Contributor Author

gelim commented Jan 4, 2024

I monkey patched api_like_OAI.py
This is highly untested and does not handle several pictures being sent during the chat session.

Main idea is to catch the messages 'content' typed as list, extract the 'image_url' b64 data, convert it to jpeg (forcing that as my frontend sends webp), create the root key 'inage_data' with data + id.
Update user message in prompt with the ref to img id.

To be done: add multi images support.

@gelim gelim changed the title HTTP/500 on multicontent in server (for llava) OpenAI incompatible image handling in server multimodal Jan 5, 2024
@kevkid
Copy link

kevkid commented Jan 22, 2024

I am experiencing the same thing. I tried using this code and could not get it to work:

import base64
import requests

CONTEXT = "You are LLaVA, a large language and vision assistant trained by UW Madison WAIV Lab. You are able to understand the visual content that the user provides, and assist the user with a variety of tasks using natural language. Follow the instructions carefully and explain your answers in detail.### Human: Hi!### Assistant: Hi there! How can I help you today?\n"

with open(image.jpg', 'rb') as f:
    img_str = base64.b64encode(f.read()).decode('utf-8')
    data = { 
        "messages": [
            {
                "role": "user",
                "image_url": f"data:image/jpeg;base64,{img_str}"
            },
            {
                "role": "user",
                "content": "what is in this image?"
            }
        ]
    }
response = requests.post('http://<addr>:<port>/v1/chat/completions', json
```=data)

@gelim
Copy link
Contributor Author

gelim commented Jan 22, 2024

Yes you need to do the json adaptation yourself. I can put my crappy code later for people to improve it.

@kevkid
Copy link

kevkid commented Jan 22, 2024

Yes you need to do the json adaptation yourself. I can put my crappy code later for people to improve it.

Would you be kind enough to drop your code in a gist or give an example? Thank you.

@gelim
Copy link
Contributor Author

gelim commented Feb 3, 2024

diff --git a/examples/server/api_like_OAI.py b/examples/server/api_like_OAI.py
index 607fe49..6638081 100755
--- a/examples/server/api_like_OAI.py
+++ b/examples/server/api_like_OAI.py
@@ -39,20 +39,51 @@ def convert_chat(messages):
     user_n = args.user_name
     ai_n = args.ai_name
     stop = args.stop
-
+    multimodal = str()
     prompt = "" + args.chat_prompt + stop

     for line in messages:
         if (line["role"] == "system"):
             prompt += f"{system_n}{line['content']}{stop}"
         if (line["role"] == "user"):
-            prompt += f"{user_n}{line['content']}{stop}"
+            # multimodal heuristic
+            if isinstance(line['content'], list):
+                for cont in line['content']:
+                    multimodal="[img-10]"
+                    if cont['type'] == 'text':
+                        prompt += f"{user_n}{multimodal}{cont['text']}{stop}"
+            else: prompt += f"{user_n}{multimodal}{line['content']}{stop}"
         if (line["role"] == "assistant"):
             prompt += f"{ai_n}{line['content']}{stop}"
     prompt += ai_n.rstrip()

     return prompt

+# from any image format in base64 to JPEG in base64
+# using Pillow lib
+def multimodal_convert_pic(image_b64):
+    from base64 import b64decode,b64encode
+    from io import BytesIO
+    from PIL import Image
+
+    webp_bytes = b64decode(image_b64)
+    im = Image.open(BytesIO(webp_bytes))
+    if im.mode != 'RGB': im = im.convert('RGB')
+    jpg_data = BytesIO()
+    im.save(jpg_data, 'JPEG')
+    jpg_data.seek(0)
+    return b64encode(jpg_data.read()).decode()
+
+def multimodal_extract_image(body):
+    for line in body['messages']:
+        if not line['role'] == 'user': continue
+        for cont in line['content']:
+            if cont['type'] == 'image_url':
+                url = cont['image_url']['url']
+                start = url.find(',') + 1
+                return multimodal_convert_pic(url[start:])
+    return False
+
 def make_postData(body, chat=False, stream=False):
     postData = {}
     if (chat):
@@ -81,6 +112,9 @@ def make_postData(body, chat=False, stream=False):
     postData["stream"] = stream
     postData["cache_prompt"] = True
     postData["slot_id"] = slot_id
+    # multimodal detection
+    pic_data = multimodal_extract_image(body)
+    if pic_data: postData["image_data"] = [{"data": pic_data, "id": 10}]
     return postData

 def make_resData(data, chat=False, promptToken=[]):

Launching the proxy with:
./api_like_OAI.py --llama-api http://llamacpp_listening_ip:llamacpp_port --host proxy_listening_ip --port proxy_port

Forwarded message to [llamacpp_listening_ip:llamacpp_port] will look like this:

POST /completion HTTP/1.1
Host: 172.17.1.1:8480
User-Agent: python-requests/2.31.0
Accept-Encoding: gzip, deflate
Accept: */*
Connection: keep-alive
Content-Length: 5381

{"prompt": "A chat between a curious user and an artificial intelligence assistant. The assistant follows the given rules no matter what.</s>USER: [img-10]describe this picture</s>ASSISTANT:", "temperature": 1, "top_p": 1, "n_predict": 4000, "presence_penalty": 0, "frequency_penalty": 0, "stop": ["</s>"], "n_keep": -1, "stream": true, "cache_prompt": true, "slot_id": -1, "image_data": [{"data": "/9j/4AA[***STRIPPED BASE64 JPEG****]RQB//2Q==", "id": 10}]}

and you will point your OpenAI protocol speaking frontend to baseUrl = http://proxy_listening_ip:proxy_port/v1

@gelim
Copy link
Contributor Author

gelim commented Feb 3, 2024

this is now getting more interesting with Llava 1.6 being released and results much usable than 1.5 on their demo...
Waiting for llama.cpp to update (#5267) as now loading the GGUFs result to same quality as with 1.5.

Copy link
Contributor

This issue is stale because it has been open for 30 days with no activity.

@github-actions github-actions bot added the stale label Mar 18, 2024
Copy link
Contributor

github-actions bot commented Apr 2, 2024

This issue was closed because it has been inactive for 14 days since being marked as stale.

@github-actions github-actions bot closed this as completed Apr 2, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants