-
Notifications
You must be signed in to change notification settings - Fork 9.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
OpenAI incompatible image handling in server multimodal #4771
Comments
If I understand correctly, that is more of a feature that is not implemented within server.cpp than a bug in itself. Here is the OpenAI API documentation for reference: https://platform.openai.com/docs/api-reference/chat/create |
Ok after digging a bit, I see the code in Format info from README.md |
I monkey patched api_like_OAI.py Main idea is to catch the messages 'content' typed as list, extract the 'image_url' b64 data, convert it to jpeg (forcing that as my frontend sends webp), create the root key 'inage_data' with data + id. To be done: add multi images support. |
I am experiencing the same thing. I tried using this code and could not get it to work:
|
Yes you need to do the json adaptation yourself. I can put my crappy code later for people to improve it. |
Would you be kind enough to drop your code in a gist or give an example? Thank you. |
diff --git a/examples/server/api_like_OAI.py b/examples/server/api_like_OAI.py
index 607fe49..6638081 100755
--- a/examples/server/api_like_OAI.py
+++ b/examples/server/api_like_OAI.py
@@ -39,20 +39,51 @@ def convert_chat(messages):
user_n = args.user_name
ai_n = args.ai_name
stop = args.stop
-
+ multimodal = str()
prompt = "" + args.chat_prompt + stop
for line in messages:
if (line["role"] == "system"):
prompt += f"{system_n}{line['content']}{stop}"
if (line["role"] == "user"):
- prompt += f"{user_n}{line['content']}{stop}"
+ # multimodal heuristic
+ if isinstance(line['content'], list):
+ for cont in line['content']:
+ multimodal="[img-10]"
+ if cont['type'] == 'text':
+ prompt += f"{user_n}{multimodal}{cont['text']}{stop}"
+ else: prompt += f"{user_n}{multimodal}{line['content']}{stop}"
if (line["role"] == "assistant"):
prompt += f"{ai_n}{line['content']}{stop}"
prompt += ai_n.rstrip()
return prompt
+# from any image format in base64 to JPEG in base64
+# using Pillow lib
+def multimodal_convert_pic(image_b64):
+ from base64 import b64decode,b64encode
+ from io import BytesIO
+ from PIL import Image
+
+ webp_bytes = b64decode(image_b64)
+ im = Image.open(BytesIO(webp_bytes))
+ if im.mode != 'RGB': im = im.convert('RGB')
+ jpg_data = BytesIO()
+ im.save(jpg_data, 'JPEG')
+ jpg_data.seek(0)
+ return b64encode(jpg_data.read()).decode()
+
+def multimodal_extract_image(body):
+ for line in body['messages']:
+ if not line['role'] == 'user': continue
+ for cont in line['content']:
+ if cont['type'] == 'image_url':
+ url = cont['image_url']['url']
+ start = url.find(',') + 1
+ return multimodal_convert_pic(url[start:])
+ return False
+
def make_postData(body, chat=False, stream=False):
postData = {}
if (chat):
@@ -81,6 +112,9 @@ def make_postData(body, chat=False, stream=False):
postData["stream"] = stream
postData["cache_prompt"] = True
postData["slot_id"] = slot_id
+ # multimodal detection
+ pic_data = multimodal_extract_image(body)
+ if pic_data: postData["image_data"] = [{"data": pic_data, "id": 10}]
return postData
def make_resData(data, chat=False, promptToken=[]): Launching the proxy with: Forwarded message to [llamacpp_listening_ip:llamacpp_port] will look like this: POST /completion HTTP/1.1
Host: 172.17.1.1:8480
User-Agent: python-requests/2.31.0
Accept-Encoding: gzip, deflate
Accept: */*
Connection: keep-alive
Content-Length: 5381
{"prompt": "A chat between a curious user and an artificial intelligence assistant. The assistant follows the given rules no matter what.</s>USER: [img-10]describe this picture</s>ASSISTANT:", "temperature": 1, "top_p": 1, "n_predict": 4000, "presence_penalty": 0, "frequency_penalty": 0, "stop": ["</s>"], "n_keep": -1, "stream": true, "cache_prompt": true, "slot_id": -1, "image_data": [{"data": "/9j/4AA[***STRIPPED BASE64 JPEG****]RQB//2Q==", "id": 10}]} and you will point your OpenAI protocol speaking frontend to baseUrl = http://proxy_listening_ip:proxy_port/v1 |
this is now getting more interesting with Llava 1.6 being released and results much usable than 1.5 on their demo... |
This issue is stale because it has been open for 30 days with no activity. |
This issue was closed because it has been inactive for 14 days since being marked as stale. |
Hello while testing Llava-13B with server implementation I got a 500 error related to the
content
being a list of dicts and not a simple string.{"choices":[{"finish_reason":"stop","index":0,"message":{"content":"Hi there! How can I help you today?","role":"assistant"}}],[...]
[json.exception.type_error.302] type must be string, but is array
:this is to demonstrate the issue when using an OpenAI REST aware frontend that is pushing text with pic inside the
content
key like this:The text was updated successfully, but these errors were encountered: