Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Model] Molmo vLLM Integration #9016

Merged
merged 12 commits into from
Oct 14, 2024
Merged

[Model] Molmo vLLM Integration #9016

merged 12 commits into from
Oct 14, 2024

Conversation

mrsalehi
Copy link
Contributor

@mrsalehi mrsalehi commented Oct 2, 2024

[Model] Molmo vLLM Integration

FIX #8808
FIX #8940


PR Checklist (Click to Expand)

Thank you for your contribution to vLLM! Before submitting the pull request, please ensure the PR meets the following criteria. This helps vLLM maintain the code quality and improve the efficiency of the review process.

PR Title and Classification

Only specific types of PRs will be reviewed. The PR title is prefixed appropriately to indicate the type of change. Please use one of the following:

  • [Bugfix] for bug fixes.
  • [CI/Build] for build or continuous integration improvements.
  • [Doc] for documentation fixes and improvements.
  • [Model] for adding a new model or improving an existing model. Model name should appear in the title.
  • [Frontend] For changes on the vLLM frontend (e.g., OpenAI API server, LLM class, etc.)
  • [Kernel] for changes affecting CUDA kernels or other compute kernels.
  • [Core] for changes in the core vLLM logic (e.g., LLMEngine, AsyncLLMEngine, Scheduler, etc.)
  • [Hardware][Vendor] for hardware-specific changes. Vendor name should appear in the prefix (e.g., [Hardware][AMD]).
  • [Misc] for PRs that do not fit the above categories. Please use this sparingly.

Note: If the PR spans more than one category, please include all relevant prefixes.

Code Quality

The PR need to meet the following code quality standards:

  • We adhere to Google Python style guide and Google C++ style guide.
  • Pass all linter checks. Please use format.sh to format your code.
  • The code need to be well-documented to ensure future contributors can easily understand the code.
  • Include sufficient tests to ensure the project to stay correct and robust. This includes both unit tests and integration tests.
  • Please add documentation to docs/source/ if the PR modifies the user-facing behaviors of vLLM. It helps vLLM user understand and utilize the new features or changes.

Adding or changing kernels

Each custom kernel needs a schema and one or more implementations to be registered with PyTorch.

  • Make sure custom ops are registered following PyTorch guidelines: Custom C++ and CUDA Operators and The Custom Operators Manual
  • Custom operations that return Tensors require meta-functions. Meta-functions should be implemented and registered in python so that dynamic dims can be handled automatically. See above documents for a description of meta-functions.
  • Use torch.libary.opcheck() to test the function registration and meta-function for any registered ops. See tests/kernels for examples.
  • When changing the C++ signature of an existing op, the schema must be updated to reflect the changes.
  • If a new custom type is needed, see the following document: Custom Class Support in PT2.

Notes for Large Changes

Please keep the changes as concise as possible. For major architectural changes (>500 LOC excluding kernel/data/config/test), we would expect a GitHub issue (RFC) discussing the technical design and justification. Otherwise, we will tag it with rfc-required and might not go through the PR.

What to Expect for the Reviews

The goal of the vLLM team is to be a transparent reviewing machine. We would like to make the review process transparent and efficient and make sure no contributor feel confused or frustrated. However, the vLLM team is small, so we need to prioritize some PRs over others. Here is what you can expect from the review process:

  • After the PR is submitted, the PR will be assigned to a reviewer. Every reviewer will pick up the PRs based on their expertise and availability.
  • After the PR is assigned, the reviewer will provide status update every 2-3 days. If the PR is not reviewed within 7 days, please feel free to ping the reviewer or the vLLM team.
  • After the review, the reviewer will put an action-required label on the PR if there are changes required. The contributor should address the comments and ping the reviewer to re-review the PR.
  • Please respond to all comments within a reasonable time frame. If a comment isn't clear or you disagree with a suggestion, feel free to ask for clarification or discuss the suggestion.

Thank You

Finally, thank you for taking the time to read these guidelines and for your interest in contributing to vLLM. Your contributions make vLLM a great tool for everyone!

Copy link

github-actions bot commented Oct 2, 2024

👋 Hi! Thank you for contributing to the vLLM project.
Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can do one of these:

  • Add ready label to the PR
  • Enable auto-merge.

🚀

@ywang96
Copy link
Member

ywang96 commented Oct 2, 2024

Thank you for making the contribution! I’ll take a look tonight!

@ywang96 ywang96 self-assigned this Oct 2, 2024
@mrsalehi
Copy link
Contributor Author

mrsalehi commented Oct 2, 2024

Thank you for making the contribution! I’ll take a look tonight!

Thank you! let us know if there are any issues.

@sangho-vision
Copy link
Contributor

sangho-vision commented Oct 2, 2024

Please note that this integration is only for non-moe models, that is
https://huggingface.co/allenai/Molmo-72B-0924
https://huggingface.co/allenai/Molmo-7B-D-0924
https://huggingface.co/allenai/Molmo-7B-O-0924

We will create another PR for the moe model,
https://huggingface.co/allenai/MolmoE-1B-0924

Thank you!

Copy link
Member

@ywang96 ywang96 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you again for the contribution! I left a first round of review and please take a look!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be great if you can move this example to examples/offline_inference_vision_language.py

embedding_weight = dict()
projector_weight = dict()
for name, loaded_weight in weights:
log.info(f"Original name: {name}")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
log.info(f"Original name: {name}")

We can remove this once you verify that all weights load properly.

Comment on lines 1040 to 1056
if "ln_f.weight" in name:
name = "model.norm.weight"

if "transformer.blocks" in name:
name = name.replace("transformer.blocks", "layers")

if "attn_out" in name:
name = name.replace("attn_out", "self_attn.o_proj")

if "att_proj" in name:
name = name.replace("att_proj", "self_attn.qkv_proj")

if 'q_norm' in name:
name = name.replace("q_norm", "self_attn.q_norm")

if 'k_norm' in name:
name = name.replace("k_norm", "self_attn.k_norm")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you use a mapping for this?

Copy link
Contributor

@payoto payoto left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've been wanting to try Molmo 72B and in the process of trying to run it in the OpenAI server I found a couple of issues. Details of how I fixed them in the inline comment.

Thanks a lot for the great work!

Copy link
Contributor

@payoto payoto left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Found a problem with how you load the preprocessor: you don't take into account the revision requested by the user - see details in comments

All the changes I'm suggesting are integrated on this branch: https://github.com/graphcore/vllm-fork/tree/molmo-online

Tested by doing:

python3 -m vllm.entrypoints.openai.api_server \
    --model allenai/Molmo-7B-D-0924 --revision bbf3f0508a1b818f29e54e54e8177723a7d72aae \
    --code-revision bbf3f0508a1b818f29e54e54e8177723a7d72aae \
    --gpu-memory-utilization 0.95 \
    --trust-remote-code

(Sha is for the pull request on the HF repo with the fix for the preprocessor)



def get_max_molmo_image_tokens(ctx: InputContext) -> int:
processor = cached_get_processor(ctx.model_config.model, trust_remote_code=True)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This call does not respect the revision of the model that is requested by the user you need to apply the following suggestion:

Suggested change
processor = cached_get_processor(ctx.model_config.model, trust_remote_code=True)
processor = cached_get_processor(ctx.model_config.model, trust_remote_code=True, revision=ctx.model_config.code_revision)

Similarly in the 4 other other locations where there is this function call.

def dummy_data_for_molmo(
ctx: InputContext, seq_len: int, mm_counts: Mapping[str, int]
):
processor = cached_get_processor(ctx.model_config.model, trust_remote_code=True)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
processor = cached_get_processor(ctx.model_config.model, trust_remote_code=True)
processor = cached_get_processor(ctx.model_config.model, trust_remote_code=True, revision=ctx.model_config.code_revision)

prompt = llm_inputs["prompt"]
multi_modal_data = llm_inputs.get("multi_modal_data")
image = multi_modal_data.get("image")
processor = cached_get_processor(ctx.model_config.model, trust_remote_code=True)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
processor = cached_get_processor(ctx.model_config.model, trust_remote_code=True)
processor = cached_get_processor(ctx.model_config.model, trust_remote_code=True, revision=ctx.model_config.code_revision)

@ywang96
Copy link
Member

ywang96 commented Oct 6, 2024

Hello! @mrsalehi and @sangho-vision. Friendly ping to see when we can get an update on this PR. If you need any help please let us know!

@ayylemao
Copy link
Contributor

ayylemao commented Oct 7, 2024

I am not sure this is the right point to report this but here it goes:

I tested this PR by building it from source. Offline inference via LLM(model="allenai/Molmo-7B-D-0924") works without problems.

I also tested the openai style deployed server which is the more interesting use case for me via:

vllm serve allenai/Molmo-7B-D-0924 --dtype auto --tensor-parallel-size 2 --port 8082 --trust-remote-code

The server starts up without issues but if i send a request like:

base64_image = encode_image(image_path)

headers = {
    "Content-Type": "application/json",
}
payload = {
    "model": "allenai/Molmo-7B-D-0924",
    "messages": [
        {
            "role": "user",
            "content": [
                {
                    "type": "text",
                    "text": user
                },
                {
                    "type": "image_url",
                    "image_url": {
                        "url": f"data:image/jpeg;base64,{base64_image}"
                    }
                }
            ]
        }
    ],
    "max_tokens": 300,
    "temperature": 0
}
response = requests.post("http://localhost:8082/v1/chat/completions", headers=headers, json=payload)

I get the following Internal server error:

INFO 10-07 14:27:26 logger.py:36] Received request chat-5ce55afc00c944a5bf54d9e285e295ae: prompt: 'User: Point to max 1 element a human web analyst would most likely click. Focus on the most prominent element and start from the center. Assistant:', params: SamplingParams(n=1, best_of=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.7, top_p=1.0, top_k=-1, min_p=0.0, seed=None, use_beam_search=False, length_penalty=1.0, early_stopping=False, stop=[], stop_token_ids=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=4065, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None), guided_decoding=GuidedDecodingParams(json=None, regex=None, choice=None, grammar=None, json_object=None, backend=None, whitespace_pattern=None), prompt_token_ids: [1474, 25, 5126, 311, 1932, 220, 16, 2392, 264, 3738, 3482, 18237, 1035, 1429, 4363, 4205, 13, 25806, 389, 279, 1429, 20469, 2392, 323, 1191, 504, 279, 4126, 13, 21388, 25], lora_request: None, prompt_adapter_request: None.
INFO:     127.0.0.1:60290 - "POST /v1/chat/completions HTTP/1.1" 500 Internal Server Error
INFO 10-07 14:27:26 engine.py:306] Aborted request chat-5ce55afc00c944a5bf54d9e285e295ae.
ERROR:    Exception in ASGI application
Traceback (most recent call last):
  File "/home/jbxgpu01/projects/molmo/.venv/lib/python3.10/site-packages/uvicorn/protocols/http/httptools_impl.py", line 401, in run_asgi
    result = await app(  # type: ignore[func-returns-value]
  File "/home/jbxgpu01/projects/molmo/.venv/lib/python3.10/site-packages/uvicorn/middleware/proxy_headers.py", line 60, in __call__
    return await self.app(scope, receive, send)
  File "/home/jbxgpu01/projects/molmo/.venv/lib/python3.10/site-packages/fastapi/applications.py", line 1054, in __call__
    await super().__call__(scope, receive, send)
  File "/home/jbxgpu01/projects/molmo/.venv/lib/python3.10/site-packages/starlette/applications.py", line 113, in __call__
    await self.middleware_stack(scope, receive, send)
  File "/home/jbxgpu01/projects/molmo/.venv/lib/python3.10/site-packages/starlette/middleware/errors.py", line 187, in __call__
    raise exc
  File "/home/jbxgpu01/projects/molmo/.venv/lib/python3.10/site-packages/starlette/middleware/errors.py", line 165, in __call__
    await self.app(scope, receive, _send)
  File "/home/jbxgpu01/projects/molmo/.venv/lib/python3.10/site-packages/starlette/middleware/cors.py", line 85, in __call__
    await self.app(scope, receive, send)
  File "/home/jbxgpu01/projects/molmo/.venv/lib/python3.10/site-packages/starlette/middleware/exceptions.py", line 62, in __call__
    await wrap_app_handling_exceptions(self.app, conn)(scope, receive, send)
  File "/home/jbxgpu01/projects/molmo/.venv/lib/python3.10/site-packages/starlette/_exception_handler.py", line 62, in wrapped_app
    raise exc
  File "/home/jbxgpu01/projects/molmo/.venv/lib/python3.10/site-packages/starlette/_exception_handler.py", line 51, in wrapped_app
    await app(scope, receive, sender)
  File "/home/jbxgpu01/projects/molmo/.venv/lib/python3.10/site-packages/starlette/routing.py", line 715, in __call__
    await self.middleware_stack(scope, receive, send)
  File "/home/jbxgpu01/projects/molmo/.venv/lib/python3.10/site-packages/starlette/routing.py", line 735, in app
    await route.handle(scope, receive, send)
  File "/home/jbxgpu01/projects/molmo/.venv/lib/python3.10/site-packages/starlette/routing.py", line 288, in handle
    await self.app(scope, receive, send)
  File "/home/jbxgpu01/projects/molmo/.venv/lib/python3.10/site-packages/starlette/routing.py", line 76, in app
    await wrap_app_handling_exceptions(app, request)(scope, receive, send)
  File "/home/jbxgpu01/projects/molmo/.venv/lib/python3.10/site-packages/starlette/_exception_handler.py", line 62, in wrapped_app
    raise exc
  File "/home/jbxgpu01/projects/molmo/.venv/lib/python3.10/site-packages/starlette/_exception_handler.py", line 51, in wrapped_app
    await app(scope, receive, sender)
  File "/home/jbxgpu01/projects/molmo/.venv/lib/python3.10/site-packages/starlette/routing.py", line 73, in app
    response = await f(request)
  File "/home/jbxgpu01/projects/molmo/.venv/lib/python3.10/site-packages/fastapi/routing.py", line 301, in app
    raw_response = await run_endpoint_function(
  File "/home/jbxgpu01/projects/molmo/.venv/lib/python3.10/site-packages/fastapi/routing.py", line 212, in run_endpoint_function
    return await dependant.call(**values)
  File "/home/jbxgpu01/projects/molmo/vllm/vllm/entrypoints/openai/api_server.py", line 313, in create_chat_completion
    generator = await chat(raw_request).create_chat_completion(
  File "/home/jbxgpu01/projects/molmo/vllm/vllm/entrypoints/openai/serving_chat.py", line 255, in create_chat_completion
    return await self.chat_completion_full_generator(
  File "/home/jbxgpu01/projects/molmo/vllm/vllm/entrypoints/openai/serving_chat.py", line 625, in chat_completion_full_generator
    async for res in result_generator:
  File "/home/jbxgpu01/projects/molmo/vllm/vllm/utils.py", line 455, in iterate_with_cancellation
    item = await awaits[0]
  File "/home/jbxgpu01/projects/molmo/vllm/vllm/engine/multiprocessing/client.py", line 581, in _process_request
    raise request_output
TypeError: expected string or bytes-like object```

Maybe this is of interest to you.

@mrsalehi
Copy link
Contributor Author

mrsalehi commented Oct 7, 2024

Hello! @mrsalehi and @sangho-vision. Friendly ping to see when we can get an update on this PR. If you need any help please let us know!

Apologies, we were busy with other stuff. I'll update the code today.

@sangho-vision
Copy link
Contributor

@ywang96 I cleaned up the code and pushed it. Could you please review the code by today and release it if everything looks good? Thank you.

@ywang96
Copy link
Member

ywang96 commented Oct 13, 2024

@sangho-vision Sorry for the delay - I will review this PR either tonight or tomorrow!

@ywang96
Copy link
Member

ywang96 commented Oct 14, 2024

Just a heads up - I did some testing and the results look fine by me, but there's still quite a bit of cleanup work needed for this PR, so I'm just going to do that for you if you don't mind :)

Copy link
Member

@ywang96 ywang96 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @mrsalehi @sangho-vision again for the contributing this model to vLLM! We really appreciate model support directly from model vendor!

I spent some time myself modifying this PR (and hope you don't mind me doing so), in particular:

  • Clean up code formatting. Please consider running format.sh we provide in the repository for easy code format checking in the future.
  • Clean up example and update documentation and NOTE comments.
  • I tried to add image embeddings as input for this model, but it looks like the preprocessor for this model is very tied to the assumption that image is either a Pillow.Image.Image or ndarray to preprocess the token sequence. Therefore I'm leaving that out for now.

I have tested both online and offline interface, as well as both models (7B on TP=1, 72B on TP=4), so I'm giving this PR a green light!

@ywang96 ywang96 added the ready ONLY add when PR is ready to merge/full CI is needed label Oct 14, 2024
@ywang96 ywang96 merged commit dfe43a2 into vllm-project:main Oct 14, 2024
68 checks passed
@GSSimLtd
Copy link

I am not sure this is the right point to report this but here it goes:

I tested this PR by building it from source. Offline inference via LLM(model="allenai/Molmo-7B-D-0924") works without problems.

I also tested the openai style deployed server which is the more interesting use case for me via:

vllm serve allenai/Molmo-7B-D-0924 --dtype auto --tensor-parallel-size 2 --port 8082 --trust-remote-code

The server starts up without issues but if i send a request like:

base64_image = encode_image(image_path)

headers = {
    "Content-Type": "application/json",
}
payload = {
    "model": "allenai/Molmo-7B-D-0924",
    "messages": [
        {
            "role": "user",
            "content": [
                {
                    "type": "text",
                    "text": user
                },
                {
                    "type": "image_url",
                    "image_url": {
                        "url": f"data:image/jpeg;base64,{base64_image}"
                    }
                }
            ]
        }
    ],
    "max_tokens": 300,
    "temperature": 0
}
response = requests.post("http://localhost:8082/v1/chat/completions", headers=headers, json=payload)

I get the following Internal server error:

INFO 10-07 14:27:26 logger.py:36] Received request chat-5ce55afc00c944a5bf54d9e285e295ae: prompt: 'User: Point to max 1 element a human web analyst would most likely click. Focus on the most prominent element and start from the center. Assistant:', params: SamplingParams(n=1, best_of=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.7, top_p=1.0, top_k=-1, min_p=0.0, seed=None, use_beam_search=False, length_penalty=1.0, early_stopping=False, stop=[], stop_token_ids=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=4065, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None), guided_decoding=GuidedDecodingParams(json=None, regex=None, choice=None, grammar=None, json_object=None, backend=None, whitespace_pattern=None), prompt_token_ids: [1474, 25, 5126, 311, 1932, 220, 16, 2392, 264, 3738, 3482, 18237, 1035, 1429, 4363, 4205, 13, 25806, 389, 279, 1429, 20469, 2392, 323, 1191, 504, 279, 4126, 13, 21388, 25], lora_request: None, prompt_adapter_request: None.
INFO:     127.0.0.1:60290 - "POST /v1/chat/completions HTTP/1.1" 500 Internal Server Error
INFO 10-07 14:27:26 engine.py:306] Aborted request chat-5ce55afc00c944a5bf54d9e285e295ae.
ERROR:    Exception in ASGI application
Traceback (most recent call last):
  File "/home/jbxgpu01/projects/molmo/.venv/lib/python3.10/site-packages/uvicorn/protocols/http/httptools_impl.py", line 401, in run_asgi
    result = await app(  # type: ignore[func-returns-value]
  File "/home/jbxgpu01/projects/molmo/.venv/lib/python3.10/site-packages/uvicorn/middleware/proxy_headers.py", line 60, in __call__
    return await self.app(scope, receive, send)
  File "/home/jbxgpu01/projects/molmo/.venv/lib/python3.10/site-packages/fastapi/applications.py", line 1054, in __call__
    await super().__call__(scope, receive, send)
  File "/home/jbxgpu01/projects/molmo/.venv/lib/python3.10/site-packages/starlette/applications.py", line 113, in __call__
    await self.middleware_stack(scope, receive, send)
  File "/home/jbxgpu01/projects/molmo/.venv/lib/python3.10/site-packages/starlette/middleware/errors.py", line 187, in __call__
    raise exc
  File "/home/jbxgpu01/projects/molmo/.venv/lib/python3.10/site-packages/starlette/middleware/errors.py", line 165, in __call__
    await self.app(scope, receive, _send)
  File "/home/jbxgpu01/projects/molmo/.venv/lib/python3.10/site-packages/starlette/middleware/cors.py", line 85, in __call__
    await self.app(scope, receive, send)
  File "/home/jbxgpu01/projects/molmo/.venv/lib/python3.10/site-packages/starlette/middleware/exceptions.py", line 62, in __call__
    await wrap_app_handling_exceptions(self.app, conn)(scope, receive, send)
  File "/home/jbxgpu01/projects/molmo/.venv/lib/python3.10/site-packages/starlette/_exception_handler.py", line 62, in wrapped_app
    raise exc
  File "/home/jbxgpu01/projects/molmo/.venv/lib/python3.10/site-packages/starlette/_exception_handler.py", line 51, in wrapped_app
    await app(scope, receive, sender)
  File "/home/jbxgpu01/projects/molmo/.venv/lib/python3.10/site-packages/starlette/routing.py", line 715, in __call__
    await self.middleware_stack(scope, receive, send)
  File "/home/jbxgpu01/projects/molmo/.venv/lib/python3.10/site-packages/starlette/routing.py", line 735, in app
    await route.handle(scope, receive, send)
  File "/home/jbxgpu01/projects/molmo/.venv/lib/python3.10/site-packages/starlette/routing.py", line 288, in handle
    await self.app(scope, receive, send)
  File "/home/jbxgpu01/projects/molmo/.venv/lib/python3.10/site-packages/starlette/routing.py", line 76, in app
    await wrap_app_handling_exceptions(app, request)(scope, receive, send)
  File "/home/jbxgpu01/projects/molmo/.venv/lib/python3.10/site-packages/starlette/_exception_handler.py", line 62, in wrapped_app
    raise exc
  File "/home/jbxgpu01/projects/molmo/.venv/lib/python3.10/site-packages/starlette/_exception_handler.py", line 51, in wrapped_app
    await app(scope, receive, sender)
  File "/home/jbxgpu01/projects/molmo/.venv/lib/python3.10/site-packages/starlette/routing.py", line 73, in app
    response = await f(request)
  File "/home/jbxgpu01/projects/molmo/.venv/lib/python3.10/site-packages/fastapi/routing.py", line 301, in app
    raw_response = await run_endpoint_function(
  File "/home/jbxgpu01/projects/molmo/.venv/lib/python3.10/site-packages/fastapi/routing.py", line 212, in run_endpoint_function
    return await dependant.call(**values)
  File "/home/jbxgpu01/projects/molmo/vllm/vllm/entrypoints/openai/api_server.py", line 313, in create_chat_completion
    generator = await chat(raw_request).create_chat_completion(
  File "/home/jbxgpu01/projects/molmo/vllm/vllm/entrypoints/openai/serving_chat.py", line 255, in create_chat_completion
    return await self.chat_completion_full_generator(
  File "/home/jbxgpu01/projects/molmo/vllm/vllm/entrypoints/openai/serving_chat.py", line 625, in chat_completion_full_generator
    async for res in result_generator:
  File "/home/jbxgpu01/projects/molmo/vllm/vllm/utils.py", line 455, in iterate_with_cancellation
    item = await awaits[0]
  File "/home/jbxgpu01/projects/molmo/vllm/vllm/engine/multiprocessing/client.py", line 581, in _process_request
    raise request_output
TypeError: expected string or bytes-like object```

Maybe this is of interest to you.

I am also getting this error when using pre-downloaded model weights:

$ vllm serve --trust-remote-code --served-model-name=allenai/Molmo-7B-D-0924 ./models/Molmo-7B-D-0924

@payoto
Copy link
Contributor

payoto commented Oct 15, 2024

Probably better to open an issue. You might have an error in your encoding method? When I was testing changes to this PR I got it to work with:

with open("some.png") as file:
    image = base64.b64encode(file.read()).decode("utf-8")
url = f"data:image/png;base64,{image}"

But I've not tried it since it landed on main

@SinanAkkoyun
Copy link

SinanAkkoyun commented Oct 16, 2024

Hi! I get following error when trying to use the OpenAI endpoint only supplying text:

INFO:     127.0.0.1:44610 - "POST /v1/chat/completions HTTP/1.1" 500 Internal Server Error
INFO 10-16 09:21:19 engine.py:310] Aborted request chat-6fef9d002fc941f79979a8c649078157.
ERROR:    Exception in ASGI application
Traceback (most recent call last):
  File "/home/ai/.mconda3/envs/vllm/lib/python3.10/site-packages/uvicorn/protocols/http/httptools_impl.py", line 426, in run_asgi
    result = await app()  # type: ignore[func-returns-value]
  File "/home/ai/.mconda3/envs/vllm/lib/python3.10/site-packages/uvicorn/middleware/proxy_headers.py", line 84, in __call__
    return await self.app(scope, receive, send)
  File "/home/ai/.mconda3/envs/vllm/lib/python3.10/site-packages/fastapi/applications.py", line 1054, in __call__
    await super().__call__(scope, receive, send)
  File "/home/ai/.mconda3/envs/vllm/lib/python3.10/site-packages/starlette/applications.py", line 123, in __call__
    await self.middleware_stack(scope, receive, send)
  File "/home/ai/.mconda3/envs/vllm/lib/python3.10/site-packages/starlette/middleware/errors.py", line 186, in __call__
    raise exc
  File "/home/ai/.mconda3/envs/vllm/lib/python3.10/site-packages/starlette/middleware/errors.py", line 164, in __call__
    await self.app(scope, receive, _send)
  File "/home/ai/.mconda3/envs/vllm/lib/python3.10/site-packages/starlette/middleware/cors.py", line 85, in __call__
    await self.app(scope, receive, send)
  File "/home/ai/.mconda3/envs/vllm/lib/python3.10/site-packages/starlette/middleware/exceptions.py", line 65, in __call__
    await wrap_app_handling_exceptions(self.app, conn)(scope, receive, send)
  File "/home/ai/.mconda3/envs/vllm/lib/python3.10/site-packages/starlette/_exception_handler.py", line 64, in wrapped_app
    raise exc
  File "/home/ai/.mconda3/envs/vllm/lib/python3.10/site-packages/starlette/_exception_handler.py", line 53, in wrapped_app
    await app(scope, receive, sender)
  File "/home/ai/.mconda3/envs/vllm/lib/python3.10/site-packages/starlette/routing.py", line 756, in __call__
    await self.middleware_stack(scope, receive, send)
  File "/home/ai/.mconda3/envs/vllm/lib/python3.10/site-packages/starlette/routing.py", line 776, in app
    await route.handle(scope, receive, send)
  File "/home/ai/.mconda3/envs/vllm/lib/python3.10/site-packages/starlette/routing.py", line 297, in handle
    await self.app(scope, receive, send)
  File "/home/ai/.mconda3/envs/vllm/lib/python3.10/site-packages/starlette/routing.py", line 77, in app
    await wrap_app_handling_exceptions(app, request)(scope, receive, send)
  File "/home/ai/.mconda3/envs/vllm/lib/python3.10/site-packages/starlette/_exception_handler.py", line 64, in wrapped_app
    raise exc
  File "/home/ai/.mconda3/envs/vllm/lib/python3.10/site-packages/starlette/_exception_handler.py", line 53, in wrapped_app
    await app(scope, receive, sender)
  File "/home/ai/.mconda3/envs/vllm/lib/python3.10/site-packages/starlette/routing.py", line 72, in app
    response = await func(request)
  File "/home/ai/.mconda3/envs/vllm/lib/python3.10/site-packages/fastapi/routing.py", line 301, in app
    raw_response = await run_endpoint_function()
  File "/home/ai/.mconda3/envs/vllm/lib/python3.10/site-packages/fastapi/routing.py", line 212, in run_endpoint_function
    return await dependant.call(**values)
  File "/home/ai/.mconda3/envs/vllm/lib/python3.10/site-packages/vllm/entrypoints/openai/api_server.py", line 315, in create_chat_completion
    generator = await chat(raw_request).create_chat_completion()
  File "/home/ai/.mconda3/envs/vllm/lib/python3.10/site-packages/vllm/entrypoints/openai/serving_chat.py", line 275, in create_chat_completion
    return await self.chat_completion_full_generator()
  File "/home/ai/.mconda3/envs/vllm/lib/python3.10/site-packages/vllm/entrypoints/openai/serving_chat.py", line 658, in chat_completion_full_generator
    async for res in result_generator:
  File "/home/ai/.mconda3/envs/vllm/lib/python3.10/site-packages/vllm/utils.py", line 458, in iterate_with_cancellation
    item = await awaits[0]
  File "/home/ai/.mconda3/envs/vllm/lib/python3.10/site-packages/vllm/engine/multiprocessing/client.py", line 683, in _process_request
    raise request_output
AttributeError: 'NoneType' object has no attribute 'get'

@ywang96
Copy link
Member

ywang96 commented Oct 16, 2024

@GSSimLtd @SinanAkkoyun I'm travelling this week so probably won't be able to take a look at this issue deeply, but could you both run examples/openai_api_client_for_multimodal.py and verify if the example code works for you?

It's possible that some other PRs caused this issue since I verified both online and offline examples worked when I approved this PR - please also open a separate issue so we can track properly. Thanks!

Edit: It looks like @SinanAkkoyun your issue has been fixed in #9397, please try the main branch!

Alvant pushed a commit to compressa-ai/vllm that referenced this pull request Oct 26, 2024
Co-authored-by: sanghol <sanghol@allenai.org>
Co-authored-by: Roger Wang <136131678+ywang96@users.noreply.github.com>
Co-authored-by: Roger Wang <ywang@roblox.com>
Signed-off-by: Alvant <alvasian@yandex.ru>
garg-amit pushed a commit to garg-amit/vllm that referenced this pull request Oct 28, 2024
Co-authored-by: sanghol <sanghol@allenai.org>
Co-authored-by: Roger Wang <136131678+ywang96@users.noreply.github.com>
Co-authored-by: Roger Wang <ywang@roblox.com>
Signed-off-by: Amit Garg <mitgarg17495@gmail.com>
sumitd2 pushed a commit to sumitd2/vllm that referenced this pull request Nov 14, 2024
Co-authored-by: sanghol <sanghol@allenai.org>
Co-authored-by: Roger Wang <136131678+ywang96@users.noreply.github.com>
Co-authored-by: Roger Wang <ywang@roblox.com>
Signed-off-by: Sumit Dubey <sumit.dubey2@ibm.com>
KuntaiDu pushed a commit to KuntaiDu/vllm that referenced this pull request Nov 20, 2024
Co-authored-by: sanghol <sanghol@allenai.org>
Co-authored-by: Roger Wang <136131678+ywang96@users.noreply.github.com>
Co-authored-by: Roger Wang <ywang@roblox.com>
mfournioux pushed a commit to mfournioux/vllm that referenced this pull request Nov 20, 2024
Co-authored-by: sanghol <sanghol@allenai.org>
Co-authored-by: Roger Wang <136131678+ywang96@users.noreply.github.com>
Co-authored-by: Roger Wang <ywang@roblox.com>
Signed-off-by: Maxime Fournioux <55544262+mfournioux@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ready ONLY add when PR is ready to merge/full CI is needed
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[New Model]: Molmo support [New Model]: allenai/Molmo-7B-0-0924 VisionLM
9 participants