[Model] Molmo vLLM Integration #9016

mrsalehi · 2024-10-02T03:14:12Z

[Model] Molmo vLLM Integration

PR Checklist (Click to Expand)

Thank you for your contribution to vLLM! Before submitting the pull request, please ensure the PR meets the following criteria. This helps vLLM maintain the code quality and improve the efficiency of the review process.

PR Title and Classification

Only specific types of PRs will be reviewed. The PR title is prefixed appropriately to indicate the type of change. Please use one of the following:

[Bugfix] for bug fixes.
[CI/Build] for build or continuous integration improvements.
[Doc] for documentation fixes and improvements.
[Model] for adding a new model or improving an existing model. Model name should appear in the title.
[Frontend] For changes on the vLLM frontend (e.g., OpenAI API server, LLM class, etc.)
[Kernel] for changes affecting CUDA kernels or other compute kernels.
[Core] for changes in the core vLLM logic (e.g., LLMEngine, AsyncLLMEngine, Scheduler, etc.)
[Hardware][Vendor] for hardware-specific changes. Vendor name should appear in the prefix (e.g., [Hardware][AMD]).
[Misc] for PRs that do not fit the above categories. Please use this sparingly.

Note: If the PR spans more than one category, please include all relevant prefixes.

Code Quality

The PR need to meet the following code quality standards:

We adhere to Google Python style guide and Google C++ style guide.
Pass all linter checks. Please use format.sh to format your code.
The code need to be well-documented to ensure future contributors can easily understand the code.
Include sufficient tests to ensure the project to stay correct and robust. This includes both unit tests and integration tests.
Please add documentation to docs/source/ if the PR modifies the user-facing behaviors of vLLM. It helps vLLM user understand and utilize the new features or changes.

Adding or changing kernels

Each custom kernel needs a schema and one or more implementations to be registered with PyTorch.

Make sure custom ops are registered following PyTorch guidelines: Custom C++ and CUDA Operators and The Custom Operators Manual
Custom operations that return Tensors require meta-functions. Meta-functions should be implemented and registered in python so that dynamic dims can be handled automatically. See above documents for a description of meta-functions.
Use torch.libary.opcheck() to test the function registration and meta-function for any registered ops. See tests/kernels for examples.
When changing the C++ signature of an existing op, the schema must be updated to reflect the changes.
If a new custom type is needed, see the following document: Custom Class Support in PT2.

Notes for Large Changes

Please keep the changes as concise as possible. For major architectural changes (>500 LOC excluding kernel/data/config/test), we would expect a GitHub issue (RFC) discussing the technical design and justification. Otherwise, we will tag it with rfc-required and might not go through the PR.

What to Expect for the Reviews

The goal of the vLLM team is to be a transparent reviewing machine. We would like to make the review process transparent and efficient and make sure no contributor feel confused or frustrated. However, the vLLM team is small, so we need to prioritize some PRs over others. Here is what you can expect from the review process:

After the PR is submitted, the PR will be assigned to a reviewer. Every reviewer will pick up the PRs based on their expertise and availability.
After the PR is assigned, the reviewer will provide status update every 2-3 days. If the PR is not reviewed within 7 days, please feel free to ping the reviewer or the vLLM team.
After the review, the reviewer will put an action-required label on the PR if there are changes required. The contributor should address the comments and ping the reviewer to re-review the PR.
Please respond to all comments within a reasonable time frame. If a comment isn't clear or you disagree with a suggestion, feel free to ask for clarification or discuss the suggestion.

Thank You

Finally, thank you for taking the time to read these guidelines and for your interest in contributing to vLLM. Your contributions make vLLM a great tool for everyone!

github-actions · 2024-10-02T03:14:25Z

👋 Hi! Thank you for contributing to the vLLM project.
Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can do one of these:

Add ready label to the PR
Enable auto-merge.

🚀

ywang96 · 2024-10-02T03:16:48Z

Thank you for making the contribution! I’ll take a look tonight!

mrsalehi · 2024-10-02T03:20:50Z

Thank you for making the contribution! I’ll take a look tonight!

Thank you! let us know if there are any issues.

examples/offline_inference_molmo.py

sangho-vision · 2024-10-02T06:29:16Z

Please note that this integration is only for non-moe models, that is
https://huggingface.co/allenai/Molmo-72B-0924
https://huggingface.co/allenai/Molmo-7B-D-0924
https://huggingface.co/allenai/Molmo-7B-O-0924

We will create another PR for the moe model,
https://huggingface.co/allenai/MolmoE-1B-0924

Thank you!

ywang96

Thank you again for the contribution! I left a first round of review and please take a look!

examples/offline_inference_molmo.py

ywang96 · 2024-10-02T07:45:55Z

examples/offline_inference_molmo.py

It would be great if you can move this example to examples/offline_inference_vision_language.py

vllm/model_executor/models/molmo.py

ywang96 · 2024-10-02T09:48:50Z

vllm/model_executor/models/molmo.py

+        embedding_weight = dict()
+        projector_weight = dict()
+        for name, loaded_weight in weights:
+            log.info(f"Original name: {name}")


Suggested change

log.info(f"Original name: {name}")

We can remove this once you verify that all weights load properly.

ywang96 · 2024-10-02T09:51:47Z

vllm/model_executor/models/molmo.py

+                if "ln_f.weight" in name:
+                    name = "model.norm.weight"
+
+                if "transformer.blocks" in name:
+                    name = name.replace("transformer.blocks", "layers")
+
+                if "attn_out" in name:
+                    name = name.replace("attn_out", "self_attn.o_proj")
+
+                if "att_proj" in name:
+                    name = name.replace("att_proj", "self_attn.qkv_proj")
+
+                if 'q_norm' in name:
+                    name = name.replace("q_norm", "self_attn.q_norm")
+
+                if 'k_norm' in name:
+                    name = name.replace("k_norm", "self_attn.k_norm")


Can you use a mapping for this?

vllm/model_executor/models/molmo.py

payoto

I've been wanting to try Molmo 72B and in the process of trying to run it in the OpenAI server I found a couple of issues. Details of how I fixed them in the inline comment.

Thanks a lot for the great work!

vllm/model_executor/models/molmo.py

examples/offline_inference_molmo.py

payoto

Found a problem with how you load the preprocessor: you don't take into account the revision requested by the user - see details in comments

All the changes I'm suggesting are integrated on this branch: https://github.com/graphcore/vllm-fork/tree/molmo-online

Tested by doing:

python3 -m vllm.entrypoints.openai.api_server \
    --model allenai/Molmo-7B-D-0924 --revision bbf3f0508a1b818f29e54e54e8177723a7d72aae \
    --code-revision bbf3f0508a1b818f29e54e54e8177723a7d72aae \
    --gpu-memory-utilization 0.95 \
    --trust-remote-code

(Sha is for the pull request on the HF repo with the fix for the preprocessor)

payoto · 2024-10-04T09:49:51Z

vllm/model_executor/models/molmo.py

+
+
+def get_max_molmo_image_tokens(ctx: InputContext) -> int:
+    processor = cached_get_processor(ctx.model_config.model, trust_remote_code=True)


This call does not respect the revision of the model that is requested by the user you need to apply the following suggestion:

Suggested change

processor = cached_get_processor(ctx.model_config.model, trust_remote_code=True)

processor = cached_get_processor(ctx.model_config.model, trust_remote_code=True, revision=ctx.model_config.code_revision)

Similarly in the 4 other other locations where there is this function call.

payoto · 2024-10-04T09:50:49Z

vllm/model_executor/models/molmo.py

+def dummy_data_for_molmo(
+    ctx: InputContext, seq_len: int, mm_counts: Mapping[str, int]
+):
+    processor = cached_get_processor(ctx.model_config.model, trust_remote_code=True)


Suggested change

processor = cached_get_processor(ctx.model_config.model, trust_remote_code=True)

processor = cached_get_processor(ctx.model_config.model, trust_remote_code=True, revision=ctx.model_config.code_revision)

payoto · 2024-10-04T09:51:21Z

vllm/model_executor/models/molmo.py

+    prompt = llm_inputs["prompt"]
+    multi_modal_data = llm_inputs.get("multi_modal_data")
+    image = multi_modal_data.get("image")
+    processor = cached_get_processor(ctx.model_config.model, trust_remote_code=True)


Suggested change

processor = cached_get_processor(ctx.model_config.model, trust_remote_code=True)

processor = cached_get_processor(ctx.model_config.model, trust_remote_code=True, revision=ctx.model_config.code_revision)

ywang96 · 2024-10-06T03:29:19Z

Hello! @mrsalehi and @sangho-vision. Friendly ping to see when we can get an update on this PR. If you need any help please let us know!

ayylemao · 2024-10-07T12:44:39Z

I am not sure this is the right point to report this but here it goes:

I tested this PR by building it from source. Offline inference via LLM(model="allenai/Molmo-7B-D-0924") works without problems.

I also tested the openai style deployed server which is the more interesting use case for me via:

vllm serve allenai/Molmo-7B-D-0924 --dtype auto --tensor-parallel-size 2 --port 8082 --trust-remote-code

The server starts up without issues but if i send a request like:

base64_image = encode_image(image_path)

headers = {
    "Content-Type": "application/json",
}
payload = {
    "model": "allenai/Molmo-7B-D-0924",
    "messages": [
        {
            "role": "user",
            "content": [
                {
                    "type": "text",
                    "text": user
                },
                {
                    "type": "image_url",
                    "image_url": {
                        "url": f"data:image/jpeg;base64,{base64_image}"
                    }
                }
            ]
        }
    ],
    "max_tokens": 300,
    "temperature": 0
}
response = requests.post("http://localhost:8082/v1/chat/completions", headers=headers, json=payload)

I get the following Internal server error:

INFO 10-07 14:27:26 logger.py:36] Received request chat-5ce55afc00c944a5bf54d9e285e295ae: prompt: 'User: Point to max 1 element a human web analyst would most likely click. Focus on the most prominent element and start from the center. Assistant:', params: SamplingParams(n=1, best_of=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.7, top_p=1.0, top_k=-1, min_p=0.0, seed=None, use_beam_search=False, length_penalty=1.0, early_stopping=False, stop=[], stop_token_ids=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=4065, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None), guided_decoding=GuidedDecodingParams(json=None, regex=None, choice=None, grammar=None, json_object=None, backend=None, whitespace_pattern=None), prompt_token_ids: [1474, 25, 5126, 311, 1932, 220, 16, 2392, 264, 3738, 3482, 18237, 1035, 1429, 4363, 4205, 13, 25806, 389, 279, 1429, 20469, 2392, 323, 1191, 504, 279, 4126, 13, 21388, 25], lora_request: None, prompt_adapter_request: None.
INFO:     127.0.0.1:60290 - "POST /v1/chat/completions HTTP/1.1" 500 Internal Server Error
INFO 10-07 14:27:26 engine.py:306] Aborted request chat-5ce55afc00c944a5bf54d9e285e295ae.
ERROR:    Exception in ASGI application
Traceback (most recent call last):
  File "/home/jbxgpu01/projects/molmo/.venv/lib/python3.10/site-packages/uvicorn/protocols/http/httptools_impl.py", line 401, in run_asgi
    result = await app(  # type: ignore[func-returns-value]
  File "/home/jbxgpu01/projects/molmo/.venv/lib/python3.10/site-packages/uvicorn/middleware/proxy_headers.py", line 60, in __call__
    return await self.app(scope, receive, send)
  File "/home/jbxgpu01/projects/molmo/.venv/lib/python3.10/site-packages/fastapi/applications.py", line 1054, in __call__
    await super().__call__(scope, receive, send)
  File "/home/jbxgpu01/projects/molmo/.venv/lib/python3.10/site-packages/starlette/applications.py", line 113, in __call__
    await self.middleware_stack(scope, receive, send)
  File "/home/jbxgpu01/projects/molmo/.venv/lib/python3.10/site-packages/starlette/middleware/errors.py", line 187, in __call__
    raise exc
  File "/home/jbxgpu01/projects/molmo/.venv/lib/python3.10/site-packages/starlette/middleware/errors.py", line 165, in __call__
    await self.app(scope, receive, _send)
  File "/home/jbxgpu01/projects/molmo/.venv/lib/python3.10/site-packages/starlette/middleware/cors.py", line 85, in __call__
    await self.app(scope, receive, send)
  File "/home/jbxgpu01/projects/molmo/.venv/lib/python3.10/site-packages/starlette/middleware/exceptions.py", line 62, in __call__
    await wrap_app_handling_exceptions(self.app, conn)(scope, receive, send)
  File "/home/jbxgpu01/projects/molmo/.venv/lib/python3.10/site-packages/starlette/_exception_handler.py", line 62, in wrapped_app
    raise exc
  File "/home/jbxgpu01/projects/molmo/.venv/lib/python3.10/site-packages/starlette/_exception_handler.py", line 51, in wrapped_app
    await app(scope, receive, sender)
  File "/home/jbxgpu01/projects/molmo/.venv/lib/python3.10/site-packages/starlette/routing.py", line 715, in __call__
    await self.middleware_stack(scope, receive, send)
  File "/home/jbxgpu01/projects/molmo/.venv/lib/python3.10/site-packages/starlette/routing.py", line 735, in app
    await route.handle(scope, receive, send)
  File "/home/jbxgpu01/projects/molmo/.venv/lib/python3.10/site-packages/starlette/routing.py", line 288, in handle
    await self.app(scope, receive, send)
  File "/home/jbxgpu01/projects/molmo/.venv/lib/python3.10/site-packages/starlette/routing.py", line 76, in app
    await wrap_app_handling_exceptions(app, request)(scope, receive, send)
  File "/home/jbxgpu01/projects/molmo/.venv/lib/python3.10/site-packages/starlette/_exception_handler.py", line 62, in wrapped_app
    raise exc
  File "/home/jbxgpu01/projects/molmo/.venv/lib/python3.10/site-packages/starlette/_exception_handler.py", line 51, in wrapped_app
    await app(scope, receive, sender)
  File "/home/jbxgpu01/projects/molmo/.venv/lib/python3.10/site-packages/starlette/routing.py", line 73, in app
    response = await f(request)
  File "/home/jbxgpu01/projects/molmo/.venv/lib/python3.10/site-packages/fastapi/routing.py", line 301, in app
    raw_response = await run_endpoint_function(
  File "/home/jbxgpu01/projects/molmo/.venv/lib/python3.10/site-packages/fastapi/routing.py", line 212, in run_endpoint_function
    return await dependant.call(**values)
  File "/home/jbxgpu01/projects/molmo/vllm/vllm/entrypoints/openai/api_server.py", line 313, in create_chat_completion
    generator = await chat(raw_request).create_chat_completion(
  File "/home/jbxgpu01/projects/molmo/vllm/vllm/entrypoints/openai/serving_chat.py", line 255, in create_chat_completion
    return await self.chat_completion_full_generator(
  File "/home/jbxgpu01/projects/molmo/vllm/vllm/entrypoints/openai/serving_chat.py", line 625, in chat_completion_full_generator
    async for res in result_generator:
  File "/home/jbxgpu01/projects/molmo/vllm/vllm/utils.py", line 455, in iterate_with_cancellation
    item = await awaits[0]
  File "/home/jbxgpu01/projects/molmo/vllm/vllm/engine/multiprocessing/client.py", line 581, in _process_request
    raise request_output
TypeError: expected string or bytes-like object```

Maybe this is of interest to you.

mrsalehi · 2024-10-07T17:44:22Z

Hello! @mrsalehi and @sangho-vision. Friendly ping to see when we can get an update on this PR. If you need any help please let us know!

Apologies, we were busy with other stuff. I'll update the code today.

sangho-vision · 2024-10-11T16:05:05Z

@ywang96 I cleaned up the code and pushed it. Could you please review the code by today and release it if everything looks good? Thank you.

ywang96 · 2024-10-13T04:40:17Z

@sangho-vision Sorry for the delay - I will review this PR either tonight or tomorrow!

ywang96 · 2024-10-14T04:45:32Z

Just a heads up - I did some testing and the results look fine by me, but there's still quite a bit of cleanup work needed for this PR, so I'm just going to do that for you if you don't mind :)

ywang96

Thank you @mrsalehi @sangho-vision again for the contributing this model to vLLM! We really appreciate model support directly from model vendor!

I spent some time myself modifying this PR (and hope you don't mind me doing so), in particular:

Clean up code formatting. Please consider running format.sh we provide in the repository for easy code format checking in the future.
Clean up example and update documentation and NOTE comments.
I tried to add image embeddings as input for this model, but it looks like the preprocessor for this model is very tied to the assumption that image is either a Pillow.Image.Image or ndarray to preprocess the token sequence. Therefore I'm leaving that out for now.

I have tested both online and offline interface, as well as both models (7B on TP=1, 72B on TP=4), so I'm giving this PR a green light!

GSSimLtd · 2024-10-15T13:39:25Z

I am not sure this is the right point to report this but here it goes:

I tested this PR by building it from source. Offline inference via LLM(model="allenai/Molmo-7B-D-0924") works without problems.

I also tested the openai style deployed server which is the more interesting use case for me via:

vllm serve allenai/Molmo-7B-D-0924 --dtype auto --tensor-parallel-size 2 --port 8082 --trust-remote-code

The server starts up without issues but if i send a request like:

base64_image = encode_image(image_path)

headers = {
    "Content-Type": "application/json",
}
payload = {
    "model": "allenai/Molmo-7B-D-0924",
    "messages": [
        {
            "role": "user",
            "content": [
                {
                    "type": "text",
                    "text": user
                },
                {
                    "type": "image_url",
                    "image_url": {
                        "url": f"data:image/jpeg;base64,{base64_image}"
                    }
                }
            ]
        }
    ],
    "max_tokens": 300,
    "temperature": 0
}
response = requests.post("http://localhost:8082/v1/chat/completions", headers=headers, json=payload)

I get the following Internal server error:

INFO 10-07 14:27:26 logger.py:36] Received request chat-5ce55afc00c944a5bf54d9e285e295ae: prompt: 'User: Point to max 1 element a human web analyst would most likely click. Focus on the most prominent element and start from the center. Assistant:', params: SamplingParams(n=1, best_of=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.7, top_p=1.0, top_k=-1, min_p=0.0, seed=None, use_beam_search=False, length_penalty=1.0, early_stopping=False, stop=[], stop_token_ids=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=4065, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None), guided_decoding=GuidedDecodingParams(json=None, regex=None, choice=None, grammar=None, json_object=None, backend=None, whitespace_pattern=None), prompt_token_ids: [1474, 25, 5126, 311, 1932, 220, 16, 2392, 264, 3738, 3482, 18237, 1035, 1429, 4363, 4205, 13, 25806, 389, 279, 1429, 20469, 2392, 323, 1191, 504, 279, 4126, 13, 21388, 25], lora_request: None, prompt_adapter_request: None.
INFO:     127.0.0.1:60290 - "POST /v1/chat/completions HTTP/1.1" 500 Internal Server Error
INFO 10-07 14:27:26 engine.py:306] Aborted request chat-5ce55afc00c944a5bf54d9e285e295ae.
ERROR:    Exception in ASGI application
Traceback (most recent call last):
  File "/home/jbxgpu01/projects/molmo/.venv/lib/python3.10/site-packages/uvicorn/protocols/http/httptools_impl.py", line 401, in run_asgi
    result = await app(  # type: ignore[func-returns-value]
  File "/home/jbxgpu01/projects/molmo/.venv/lib/python3.10/site-packages/uvicorn/middleware/proxy_headers.py", line 60, in __call__
    return await self.app(scope, receive, send)
  File "/home/jbxgpu01/projects/molmo/.venv/lib/python3.10/site-packages/fastapi/applications.py", line 1054, in __call__
    await super().__call__(scope, receive, send)
  File "/home/jbxgpu01/projects/molmo/.venv/lib/python3.10/site-packages/starlette/applications.py", line 113, in __call__
    await self.middleware_stack(scope, receive, send)
  File "/home/jbxgpu01/projects/molmo/.venv/lib/python3.10/site-packages/starlette/middleware/errors.py", line 187, in __call__
    raise exc
  File "/home/jbxgpu01/projects/molmo/.venv/lib/python3.10/site-packages/starlette/middleware/errors.py", line 165, in __call__
    await self.app(scope, receive, _send)
  File "/home/jbxgpu01/projects/molmo/.venv/lib/python3.10/site-packages/starlette/middleware/cors.py", line 85, in __call__
    await self.app(scope, receive, send)
  File "/home/jbxgpu01/projects/molmo/.venv/lib/python3.10/site-packages/starlette/middleware/exceptions.py", line 62, in __call__
    await wrap_app_handling_exceptions(self.app, conn)(scope, receive, send)
  File "/home/jbxgpu01/projects/molmo/.venv/lib/python3.10/site-packages/starlette/_exception_handler.py", line 62, in wrapped_app
    raise exc
  File "/home/jbxgpu01/projects/molmo/.venv/lib/python3.10/site-packages/starlette/_exception_handler.py", line 51, in wrapped_app
    await app(scope, receive, sender)
  File "/home/jbxgpu01/projects/molmo/.venv/lib/python3.10/site-packages/starlette/routing.py", line 715, in __call__
    await self.middleware_stack(scope, receive, send)
  File "/home/jbxgpu01/projects/molmo/.venv/lib/python3.10/site-packages/starlette/routing.py", line 735, in app
    await route.handle(scope, receive, send)
  File "/home/jbxgpu01/projects/molmo/.venv/lib/python3.10/site-packages/starlette/routing.py", line 288, in handle
    await self.app(scope, receive, send)
  File "/home/jbxgpu01/projects/molmo/.venv/lib/python3.10/site-packages/starlette/routing.py", line 76, in app
    await wrap_app_handling_exceptions(app, request)(scope, receive, send)
  File "/home/jbxgpu01/projects/molmo/.venv/lib/python3.10/site-packages/starlette/_exception_handler.py", line 62, in wrapped_app
    raise exc
  File "/home/jbxgpu01/projects/molmo/.venv/lib/python3.10/site-packages/starlette/_exception_handler.py", line 51, in wrapped_app
    await app(scope, receive, sender)
  File "/home/jbxgpu01/projects/molmo/.venv/lib/python3.10/site-packages/starlette/routing.py", line 73, in app
    response = await f(request)
  File "/home/jbxgpu01/projects/molmo/.venv/lib/python3.10/site-packages/fastapi/routing.py", line 301, in app
    raw_response = await run_endpoint_function(
  File "/home/jbxgpu01/projects/molmo/.venv/lib/python3.10/site-packages/fastapi/routing.py", line 212, in run_endpoint_function
    return await dependant.call(**values)
  File "/home/jbxgpu01/projects/molmo/vllm/vllm/entrypoints/openai/api_server.py", line 313, in create_chat_completion
    generator = await chat(raw_request).create_chat_completion(
  File "/home/jbxgpu01/projects/molmo/vllm/vllm/entrypoints/openai/serving_chat.py", line 255, in create_chat_completion
    return await self.chat_completion_full_generator(
  File "/home/jbxgpu01/projects/molmo/vllm/vllm/entrypoints/openai/serving_chat.py", line 625, in chat_completion_full_generator
    async for res in result_generator:
  File "/home/jbxgpu01/projects/molmo/vllm/vllm/utils.py", line 455, in iterate_with_cancellation
    item = await awaits[0]
  File "/home/jbxgpu01/projects/molmo/vllm/vllm/engine/multiprocessing/client.py", line 581, in _process_request
    raise request_output
TypeError: expected string or bytes-like object```

Maybe this is of interest to you.

I am also getting this error when using pre-downloaded model weights:

$ vllm serve --trust-remote-code --served-model-name=allenai/Molmo-7B-D-0924 ./models/Molmo-7B-D-0924

payoto · 2024-10-15T14:34:04Z

Probably better to open an issue. You might have an error in your encoding method? When I was testing changes to this PR I got it to work with:

with open("some.png") as file:
    image = base64.b64encode(file.read()).decode("utf-8")
url = f"data:image/png;base64,{image}"

But I've not tried it since it landed on main

SinanAkkoyun · 2024-10-16T07:25:40Z

Hi! I get following error when trying to use the OpenAI endpoint only supplying text:

INFO:     127.0.0.1:44610 - "POST /v1/chat/completions HTTP/1.1" 500 Internal Server Error
INFO 10-16 09:21:19 engine.py:310] Aborted request chat-6fef9d002fc941f79979a8c649078157.
ERROR:    Exception in ASGI application
Traceback (most recent call last):
  File "/home/ai/.mconda3/envs/vllm/lib/python3.10/site-packages/uvicorn/protocols/http/httptools_impl.py", line 426, in run_asgi
    result = await app()  # type: ignore[func-returns-value]
  File "/home/ai/.mconda3/envs/vllm/lib/python3.10/site-packages/uvicorn/middleware/proxy_headers.py", line 84, in __call__
    return await self.app(scope, receive, send)
  File "/home/ai/.mconda3/envs/vllm/lib/python3.10/site-packages/fastapi/applications.py", line 1054, in __call__
    await super().__call__(scope, receive, send)
  File "/home/ai/.mconda3/envs/vllm/lib/python3.10/site-packages/starlette/applications.py", line 123, in __call__
    await self.middleware_stack(scope, receive, send)
  File "/home/ai/.mconda3/envs/vllm/lib/python3.10/site-packages/starlette/middleware/errors.py", line 186, in __call__
    raise exc
  File "/home/ai/.mconda3/envs/vllm/lib/python3.10/site-packages/starlette/middleware/errors.py", line 164, in __call__
    await self.app(scope, receive, _send)
  File "/home/ai/.mconda3/envs/vllm/lib/python3.10/site-packages/starlette/middleware/cors.py", line 85, in __call__
    await self.app(scope, receive, send)
  File "/home/ai/.mconda3/envs/vllm/lib/python3.10/site-packages/starlette/middleware/exceptions.py", line 65, in __call__
    await wrap_app_handling_exceptions(self.app, conn)(scope, receive, send)
  File "/home/ai/.mconda3/envs/vllm/lib/python3.10/site-packages/starlette/_exception_handler.py", line 64, in wrapped_app
    raise exc
  File "/home/ai/.mconda3/envs/vllm/lib/python3.10/site-packages/starlette/_exception_handler.py", line 53, in wrapped_app
    await app(scope, receive, sender)
  File "/home/ai/.mconda3/envs/vllm/lib/python3.10/site-packages/starlette/routing.py", line 756, in __call__
    await self.middleware_stack(scope, receive, send)
  File "/home/ai/.mconda3/envs/vllm/lib/python3.10/site-packages/starlette/routing.py", line 776, in app
    await route.handle(scope, receive, send)
  File "/home/ai/.mconda3/envs/vllm/lib/python3.10/site-packages/starlette/routing.py", line 297, in handle
    await self.app(scope, receive, send)
  File "/home/ai/.mconda3/envs/vllm/lib/python3.10/site-packages/starlette/routing.py", line 77, in app
    await wrap_app_handling_exceptions(app, request)(scope, receive, send)
  File "/home/ai/.mconda3/envs/vllm/lib/python3.10/site-packages/starlette/_exception_handler.py", line 64, in wrapped_app
    raise exc
  File "/home/ai/.mconda3/envs/vllm/lib/python3.10/site-packages/starlette/_exception_handler.py", line 53, in wrapped_app
    await app(scope, receive, sender)
  File "/home/ai/.mconda3/envs/vllm/lib/python3.10/site-packages/starlette/routing.py", line 72, in app
    response = await func(request)
  File "/home/ai/.mconda3/envs/vllm/lib/python3.10/site-packages/fastapi/routing.py", line 301, in app
    raw_response = await run_endpoint_function()
  File "/home/ai/.mconda3/envs/vllm/lib/python3.10/site-packages/fastapi/routing.py", line 212, in run_endpoint_function
    return await dependant.call(**values)
  File "/home/ai/.mconda3/envs/vllm/lib/python3.10/site-packages/vllm/entrypoints/openai/api_server.py", line 315, in create_chat_completion
    generator = await chat(raw_request).create_chat_completion()
  File "/home/ai/.mconda3/envs/vllm/lib/python3.10/site-packages/vllm/entrypoints/openai/serving_chat.py", line 275, in create_chat_completion
    return await self.chat_completion_full_generator()
  File "/home/ai/.mconda3/envs/vllm/lib/python3.10/site-packages/vllm/entrypoints/openai/serving_chat.py", line 658, in chat_completion_full_generator
    async for res in result_generator:
  File "/home/ai/.mconda3/envs/vllm/lib/python3.10/site-packages/vllm/utils.py", line 458, in iterate_with_cancellation
    item = await awaits[0]
  File "/home/ai/.mconda3/envs/vllm/lib/python3.10/site-packages/vllm/engine/multiprocessing/client.py", line 683, in _process_request
    raise request_output
AttributeError: 'NoneType' object has no attribute 'get'

ywang96 · 2024-10-16T07:37:20Z

@GSSimLtd @SinanAkkoyun I'm travelling this week so probably won't be able to take a look at this issue deeply, but could you both run examples/openai_api_client_for_multimodal.py and verify if the example code works for you?

It's possible that some other PRs caused this issue since I verified both online and offline examples worked when I approved this PR - please also open a separate issue so we can track properly. Thanks!

Edit: It looks like @SinanAkkoyun your issue has been fixed in #9397, please try the main branch!

Co-authored-by: sanghol <sanghol@allenai.org> Co-authored-by: Roger Wang <136131678+ywang96@users.noreply.github.com> Co-authored-by: Roger Wang <ywang@roblox.com> Signed-off-by: Alvant <alvasian@yandex.ru>

Co-authored-by: sanghol <sanghol@allenai.org> Co-authored-by: Roger Wang <136131678+ywang96@users.noreply.github.com> Co-authored-by: Roger Wang <ywang@roblox.com> Signed-off-by: Amit Garg <mitgarg17495@gmail.com>

Co-authored-by: sanghol <sanghol@allenai.org> Co-authored-by: Roger Wang <136131678+ywang96@users.noreply.github.com> Co-authored-by: Roger Wang <ywang@roblox.com> Signed-off-by: Sumit Dubey <sumit.dubey2@ibm.com>

Co-authored-by: sanghol <sanghol@allenai.org> Co-authored-by: Roger Wang <136131678+ywang96@users.noreply.github.com> Co-authored-by: Roger Wang <ywang@roblox.com>

Co-authored-by: sanghol <sanghol@allenai.org> Co-authored-by: Roger Wang <136131678+ywang96@users.noreply.github.com> Co-authored-by: Roger Wang <ywang@roblox.com> Signed-off-by: Maxime Fournioux <55544262+mfournioux@users.noreply.github.com>

mrsalehi added 2 commits October 1, 2024 20:03

molmo vllm integration

19154e0

rmvd import

4af9e3f

mrsalehi mentioned this pull request Oct 2, 2024

[New Model]: allenai/Molmo-7B-0-0924 VisionLM #8808

Closed

1 task

ywang96 self-assigned this Oct 2, 2024

dsingal0 reviewed Oct 2, 2024

View reviewed changes

examples/offline_inference_molmo.py Outdated Show resolved Hide resolved

ywang96 reviewed Oct 2, 2024

View reviewed changes

Isotr0py reviewed Oct 2, 2024

View reviewed changes

vllm/model_executor/models/molmo.py Show resolved Hide resolved

payoto reviewed Oct 2, 2024

View reviewed changes

vllm/model_executor/models/molmo.py Outdated Show resolved Hide resolved

examples/offline_inference_molmo.py Outdated Show resolved Hide resolved

DarkLight1337 mentioned this pull request Oct 3, 2024

[RFC]: Multi-modality Support on vLLM #4194

Open

91 tasks

payoto reviewed Oct 4, 2024

View reviewed changes

code cleanup

72c365e

ywang96 and others added 9 commits October 13, 2024 22:03

Merge branch 'main' into molmo

d4e2980

format

0a8946a

delete old example

81c759d

add molmo to correct registry

beecc01

a lot of format cleanup

877e71f

fix comment change

46bad99

add NOTE

ed41e0b

cleanup weight loading and error handling

aff4040

add to documentation

d48a28f

ywang96 approved these changes Oct 14, 2024

View reviewed changes

ywang96 added the ready ONLY add when PR is ready to merge/full CI is needed label Oct 14, 2024

ywang96 merged commit dfe43a2 into vllm-project:main Oct 14, 2024
68 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Model] Molmo vLLM Integration #9016

[Model] Molmo vLLM Integration #9016

mrsalehi commented Oct 2, 2024 •

edited by DarkLight1337

Loading

github-actions bot commented Oct 2, 2024

ywang96 commented Oct 2, 2024

mrsalehi commented Oct 2, 2024

sangho-vision commented Oct 2, 2024 •

edited

Loading

ywang96 left a comment

ywang96 Oct 2, 2024

ywang96 Oct 2, 2024

ywang96 Oct 2, 2024

payoto left a comment

payoto left a comment •

edited

Loading

payoto Oct 4, 2024

payoto Oct 4, 2024

payoto Oct 4, 2024

ywang96 commented Oct 6, 2024

ayylemao commented Oct 7, 2024

mrsalehi commented Oct 7, 2024

sangho-vision commented Oct 11, 2024

ywang96 commented Oct 13, 2024

ywang96 commented Oct 14, 2024

ywang96 left a comment •

edited

Loading

GSSimLtd commented Oct 15, 2024

payoto commented Oct 15, 2024

SinanAkkoyun commented Oct 16, 2024 •

edited

Loading

ywang96 commented Oct 16, 2024 •

edited

Loading



		def get_max_molmo_image_tokens(ctx: InputContext) -> int:
		processor = cached_get_processor(ctx.model_config.model, trust_remote_code=True)

[Model] Molmo vLLM Integration #9016

[Model] Molmo vLLM Integration #9016

Conversation

mrsalehi commented Oct 2, 2024 • edited by DarkLight1337 Loading

PR Title and Classification

Code Quality

Adding or changing kernels

Notes for Large Changes

What to Expect for the Reviews

Thank You

github-actions bot commented Oct 2, 2024

ywang96 commented Oct 2, 2024

mrsalehi commented Oct 2, 2024

sangho-vision commented Oct 2, 2024 • edited Loading

ywang96 left a comment

Choose a reason for hiding this comment

ywang96 Oct 2, 2024

Choose a reason for hiding this comment

ywang96 Oct 2, 2024

Choose a reason for hiding this comment

ywang96 Oct 2, 2024

Choose a reason for hiding this comment

payoto left a comment

Choose a reason for hiding this comment

payoto left a comment • edited Loading

Choose a reason for hiding this comment

payoto Oct 4, 2024

Choose a reason for hiding this comment

payoto Oct 4, 2024

Choose a reason for hiding this comment

payoto Oct 4, 2024

Choose a reason for hiding this comment

ywang96 commented Oct 6, 2024

ayylemao commented Oct 7, 2024

mrsalehi commented Oct 7, 2024

sangho-vision commented Oct 11, 2024

ywang96 commented Oct 13, 2024

ywang96 commented Oct 14, 2024

ywang96 left a comment • edited Loading

Choose a reason for hiding this comment

GSSimLtd commented Oct 15, 2024

payoto commented Oct 15, 2024

SinanAkkoyun commented Oct 16, 2024 • edited Loading

ywang96 commented Oct 16, 2024 • edited Loading

mrsalehi commented Oct 2, 2024 •

edited by DarkLight1337

Loading

sangho-vision commented Oct 2, 2024 •

edited

Loading

payoto left a comment •

edited

Loading

ywang96 left a comment •

edited

Loading

SinanAkkoyun commented Oct 16, 2024 •

edited

Loading

ywang96 commented Oct 16, 2024 •

edited

Loading