-
-
Notifications
You must be signed in to change notification settings - Fork 5.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Model] Molmo vLLM Integration #9016
Conversation
👋 Hi! Thank you for contributing to the vLLM project. Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can do one of these:
🚀 |
Thank you for making the contribution! I’ll take a look tonight! |
Thank you! let us know if there are any issues. |
Please note that this integration is only for non-moe models, that is We will create another PR for the moe model, Thank you! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you again for the contribution! I left a first round of review and please take a look!
examples/offline_inference_molmo.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be great if you can move this example to examples/offline_inference_vision_language.py
vllm/model_executor/models/molmo.py
Outdated
embedding_weight = dict() | ||
projector_weight = dict() | ||
for name, loaded_weight in weights: | ||
log.info(f"Original name: {name}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
log.info(f"Original name: {name}") |
We can remove this once you verify that all weights load properly.
vllm/model_executor/models/molmo.py
Outdated
if "ln_f.weight" in name: | ||
name = "model.norm.weight" | ||
|
||
if "transformer.blocks" in name: | ||
name = name.replace("transformer.blocks", "layers") | ||
|
||
if "attn_out" in name: | ||
name = name.replace("attn_out", "self_attn.o_proj") | ||
|
||
if "att_proj" in name: | ||
name = name.replace("att_proj", "self_attn.qkv_proj") | ||
|
||
if 'q_norm' in name: | ||
name = name.replace("q_norm", "self_attn.q_norm") | ||
|
||
if 'k_norm' in name: | ||
name = name.replace("k_norm", "self_attn.k_norm") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you use a mapping for this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've been wanting to try Molmo 72B and in the process of trying to run it in the OpenAI server I found a couple of issues. Details of how I fixed them in the inline comment.
Thanks a lot for the great work!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Found a problem with how you load the preprocessor: you don't take into account the revision requested by the user - see details in comments
All the changes I'm suggesting are integrated on this branch: https://github.com/graphcore/vllm-fork/tree/molmo-online
Tested by doing:
python3 -m vllm.entrypoints.openai.api_server \
--model allenai/Molmo-7B-D-0924 --revision bbf3f0508a1b818f29e54e54e8177723a7d72aae \
--code-revision bbf3f0508a1b818f29e54e54e8177723a7d72aae \
--gpu-memory-utilization 0.95 \
--trust-remote-code
(Sha is for the pull request on the HF repo with the fix for the preprocessor)
vllm/model_executor/models/molmo.py
Outdated
|
||
|
||
def get_max_molmo_image_tokens(ctx: InputContext) -> int: | ||
processor = cached_get_processor(ctx.model_config.model, trust_remote_code=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This call does not respect the revision of the model that is requested by the user you need to apply the following suggestion:
processor = cached_get_processor(ctx.model_config.model, trust_remote_code=True) | |
processor = cached_get_processor(ctx.model_config.model, trust_remote_code=True, revision=ctx.model_config.code_revision) |
Similarly in the 4 other other locations where there is this function call.
vllm/model_executor/models/molmo.py
Outdated
def dummy_data_for_molmo( | ||
ctx: InputContext, seq_len: int, mm_counts: Mapping[str, int] | ||
): | ||
processor = cached_get_processor(ctx.model_config.model, trust_remote_code=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
processor = cached_get_processor(ctx.model_config.model, trust_remote_code=True) | |
processor = cached_get_processor(ctx.model_config.model, trust_remote_code=True, revision=ctx.model_config.code_revision) |
vllm/model_executor/models/molmo.py
Outdated
prompt = llm_inputs["prompt"] | ||
multi_modal_data = llm_inputs.get("multi_modal_data") | ||
image = multi_modal_data.get("image") | ||
processor = cached_get_processor(ctx.model_config.model, trust_remote_code=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
processor = cached_get_processor(ctx.model_config.model, trust_remote_code=True) | |
processor = cached_get_processor(ctx.model_config.model, trust_remote_code=True, revision=ctx.model_config.code_revision) |
Hello! @mrsalehi and @sangho-vision. Friendly ping to see when we can get an update on this PR. If you need any help please let us know! |
I am not sure this is the right point to report this but here it goes: I tested this PR by building it from source. Offline inference via I also tested the openai style deployed server which is the more interesting use case for me via:
The server starts up without issues but if i send a request like:
I get the following Internal server error:
|
Apologies, we were busy with other stuff. I'll update the code today. |
@ywang96 I cleaned up the code and pushed it. Could you please review the code by today and release it if everything looks good? Thank you. |
@sangho-vision Sorry for the delay - I will review this PR either tonight or tomorrow! |
Just a heads up - I did some testing and the results look fine by me, but there's still quite a bit of cleanup work needed for this PR, so I'm just going to do that for you if you don't mind :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @mrsalehi @sangho-vision again for the contributing this model to vLLM! We really appreciate model support directly from model vendor!
I spent some time myself modifying this PR (and hope you don't mind me doing so), in particular:
- Clean up code formatting. Please consider running
format.sh
we provide in the repository for easy code format checking in the future. - Clean up example and update documentation and
NOTE
comments. - I tried to add image embeddings as input for this model, but it looks like the preprocessor for this model is very tied to the assumption that
image
is either aPillow.Image.Image
orndarray
to preprocess the token sequence. Therefore I'm leaving that out for now.
I have tested both online and offline interface, as well as both models (7B on TP=1, 72B on TP=4), so I'm giving this PR a green light!
I am also getting this error when using pre-downloaded model weights: $ vllm serve --trust-remote-code --served-model-name=allenai/Molmo-7B-D-0924 ./models/Molmo-7B-D-0924 |
Probably better to open an issue. You might have an error in your encoding method? When I was testing changes to this PR I got it to work with: with open("some.png") as file:
image = base64.b64encode(file.read()).decode("utf-8")
url = f"data:image/png;base64,{image}" But I've not tried it since it landed on main |
Hi! I get following error when trying to use the OpenAI endpoint only supplying text: INFO: 127.0.0.1:44610 - "POST /v1/chat/completions HTTP/1.1" 500 Internal Server Error
INFO 10-16 09:21:19 engine.py:310] Aborted request chat-6fef9d002fc941f79979a8c649078157.
ERROR: Exception in ASGI application
Traceback (most recent call last):
File "/home/ai/.mconda3/envs/vllm/lib/python3.10/site-packages/uvicorn/protocols/http/httptools_impl.py", line 426, in run_asgi
result = await app() # type: ignore[func-returns-value]
File "/home/ai/.mconda3/envs/vllm/lib/python3.10/site-packages/uvicorn/middleware/proxy_headers.py", line 84, in __call__
return await self.app(scope, receive, send)
File "/home/ai/.mconda3/envs/vllm/lib/python3.10/site-packages/fastapi/applications.py", line 1054, in __call__
await super().__call__(scope, receive, send)
File "/home/ai/.mconda3/envs/vllm/lib/python3.10/site-packages/starlette/applications.py", line 123, in __call__
await self.middleware_stack(scope, receive, send)
File "/home/ai/.mconda3/envs/vllm/lib/python3.10/site-packages/starlette/middleware/errors.py", line 186, in __call__
raise exc
File "/home/ai/.mconda3/envs/vllm/lib/python3.10/site-packages/starlette/middleware/errors.py", line 164, in __call__
await self.app(scope, receive, _send)
File "/home/ai/.mconda3/envs/vllm/lib/python3.10/site-packages/starlette/middleware/cors.py", line 85, in __call__
await self.app(scope, receive, send)
File "/home/ai/.mconda3/envs/vllm/lib/python3.10/site-packages/starlette/middleware/exceptions.py", line 65, in __call__
await wrap_app_handling_exceptions(self.app, conn)(scope, receive, send)
File "/home/ai/.mconda3/envs/vllm/lib/python3.10/site-packages/starlette/_exception_handler.py", line 64, in wrapped_app
raise exc
File "/home/ai/.mconda3/envs/vllm/lib/python3.10/site-packages/starlette/_exception_handler.py", line 53, in wrapped_app
await app(scope, receive, sender)
File "/home/ai/.mconda3/envs/vllm/lib/python3.10/site-packages/starlette/routing.py", line 756, in __call__
await self.middleware_stack(scope, receive, send)
File "/home/ai/.mconda3/envs/vllm/lib/python3.10/site-packages/starlette/routing.py", line 776, in app
await route.handle(scope, receive, send)
File "/home/ai/.mconda3/envs/vllm/lib/python3.10/site-packages/starlette/routing.py", line 297, in handle
await self.app(scope, receive, send)
File "/home/ai/.mconda3/envs/vllm/lib/python3.10/site-packages/starlette/routing.py", line 77, in app
await wrap_app_handling_exceptions(app, request)(scope, receive, send)
File "/home/ai/.mconda3/envs/vllm/lib/python3.10/site-packages/starlette/_exception_handler.py", line 64, in wrapped_app
raise exc
File "/home/ai/.mconda3/envs/vllm/lib/python3.10/site-packages/starlette/_exception_handler.py", line 53, in wrapped_app
await app(scope, receive, sender)
File "/home/ai/.mconda3/envs/vllm/lib/python3.10/site-packages/starlette/routing.py", line 72, in app
response = await func(request)
File "/home/ai/.mconda3/envs/vllm/lib/python3.10/site-packages/fastapi/routing.py", line 301, in app
raw_response = await run_endpoint_function()
File "/home/ai/.mconda3/envs/vllm/lib/python3.10/site-packages/fastapi/routing.py", line 212, in run_endpoint_function
return await dependant.call(**values)
File "/home/ai/.mconda3/envs/vllm/lib/python3.10/site-packages/vllm/entrypoints/openai/api_server.py", line 315, in create_chat_completion
generator = await chat(raw_request).create_chat_completion()
File "/home/ai/.mconda3/envs/vllm/lib/python3.10/site-packages/vllm/entrypoints/openai/serving_chat.py", line 275, in create_chat_completion
return await self.chat_completion_full_generator()
File "/home/ai/.mconda3/envs/vllm/lib/python3.10/site-packages/vllm/entrypoints/openai/serving_chat.py", line 658, in chat_completion_full_generator
async for res in result_generator:
File "/home/ai/.mconda3/envs/vllm/lib/python3.10/site-packages/vllm/utils.py", line 458, in iterate_with_cancellation
item = await awaits[0]
File "/home/ai/.mconda3/envs/vllm/lib/python3.10/site-packages/vllm/engine/multiprocessing/client.py", line 683, in _process_request
raise request_output
AttributeError: 'NoneType' object has no attribute 'get' |
@GSSimLtd @SinanAkkoyun I'm travelling this week so probably won't be able to take a look at this issue deeply, but could you both run It's possible that some other PRs caused this issue since I verified both online and offline examples worked when I approved this PR - please also open a separate issue so we can track properly. Thanks! Edit: It looks like @SinanAkkoyun your issue has been fixed in #9397, please try the main branch! |
Co-authored-by: sanghol <sanghol@allenai.org> Co-authored-by: Roger Wang <136131678+ywang96@users.noreply.github.com> Co-authored-by: Roger Wang <ywang@roblox.com> Signed-off-by: Alvant <alvasian@yandex.ru>
Co-authored-by: sanghol <sanghol@allenai.org> Co-authored-by: Roger Wang <136131678+ywang96@users.noreply.github.com> Co-authored-by: Roger Wang <ywang@roblox.com> Signed-off-by: Amit Garg <mitgarg17495@gmail.com>
Co-authored-by: sanghol <sanghol@allenai.org> Co-authored-by: Roger Wang <136131678+ywang96@users.noreply.github.com> Co-authored-by: Roger Wang <ywang@roblox.com> Signed-off-by: Sumit Dubey <sumit.dubey2@ibm.com>
Co-authored-by: sanghol <sanghol@allenai.org> Co-authored-by: Roger Wang <136131678+ywang96@users.noreply.github.com> Co-authored-by: Roger Wang <ywang@roblox.com>
Co-authored-by: sanghol <sanghol@allenai.org> Co-authored-by: Roger Wang <136131678+ywang96@users.noreply.github.com> Co-authored-by: Roger Wang <ywang@roblox.com> Signed-off-by: Maxime Fournioux <55544262+mfournioux@users.noreply.github.com>
[Model] Molmo vLLM Integration
FIX #8808
FIX #8940
PR Checklist (Click to Expand)
Thank you for your contribution to vLLM! Before submitting the pull request, please ensure the PR meets the following criteria. This helps vLLM maintain the code quality and improve the efficiency of the review process.
PR Title and Classification
Only specific types of PRs will be reviewed. The PR title is prefixed appropriately to indicate the type of change. Please use one of the following:
[Bugfix]
for bug fixes.[CI/Build]
for build or continuous integration improvements.[Doc]
for documentation fixes and improvements.[Model]
for adding a new model or improving an existing model. Model name should appear in the title.[Frontend]
For changes on the vLLM frontend (e.g., OpenAI API server,LLM
class, etc.)[Kernel]
for changes affecting CUDA kernels or other compute kernels.[Core]
for changes in the core vLLM logic (e.g.,LLMEngine
,AsyncLLMEngine
,Scheduler
, etc.)[Hardware][Vendor]
for hardware-specific changes. Vendor name should appear in the prefix (e.g.,[Hardware][AMD]
).[Misc]
for PRs that do not fit the above categories. Please use this sparingly.Note: If the PR spans more than one category, please include all relevant prefixes.
Code Quality
The PR need to meet the following code quality standards:
format.sh
to format your code.docs/source/
if the PR modifies the user-facing behaviors of vLLM. It helps vLLM user understand and utilize the new features or changes.Adding or changing kernels
Each custom kernel needs a schema and one or more implementations to be registered with PyTorch.
Tensors
require meta-functions. Meta-functions should be implemented and registered in python so that dynamic dims can be handled automatically. See above documents for a description of meta-functions.torch.libary.opcheck()
to test the function registration and meta-function for any registered ops. Seetests/kernels
for examples.Notes for Large Changes
Please keep the changes as concise as possible. For major architectural changes (>500 LOC excluding kernel/data/config/test), we would expect a GitHub issue (RFC) discussing the technical design and justification. Otherwise, we will tag it with
rfc-required
and might not go through the PR.What to Expect for the Reviews
The goal of the vLLM team is to be a transparent reviewing machine. We would like to make the review process transparent and efficient and make sure no contributor feel confused or frustrated. However, the vLLM team is small, so we need to prioritize some PRs over others. Here is what you can expect from the review process:
action-required
label on the PR if there are changes required. The contributor should address the comments and ping the reviewer to re-review the PR.Thank You
Finally, thank you for taking the time to read these guidelines and for your interest in contributing to vLLM. Your contributions make vLLM a great tool for everyone!