[Bug] Launching Llama-3.2-11B-Vision-Instruct just hangs on generation #2619

SuperMasterBlasterLaser · 2024-12-27T17:19:58Z

Checklist

1. I have searched related issues but cannot get the expected help.
2. The bug has not been fixed in the latest version.
3. Please note that if the bug-related issue you submitted lacks corresponding environment info and a minimal reproducible demo, it will be challenging for us to reproduce and resolve the issue, reducing the likelihood of receiving feedback.
4. If the issue you raised is not a bug but a question, please raise a discussion at https://github.com/sgl-project/sglang/discussions/new/choose Otherwise, it will be closed.
5. Please use English, otherwise it will be closed.

Describe the bug

I have rented RTX 6000Ada with 48.0 GB VRAM GPU via vast.ai.

Specs:

Ubuntu 22.04
PyTorch 2.4.1
cuda12.4

Then I have installed flashinfer by this command:

pip install flashinfer -i https://flashinfer.ai/whl/cu124/torch2.4/

Then installed this lib with this command:

pip install "sglang[all]" --find-links https://flashinfer.ai/whl/cu124/torch2.4/flashinfer/

Then downloaded Llama-3.2-11B-Vision-Instruct and launched it like this:

python -m sglang.launch_server --model-path /root/Llama-3.2-11B-Vision-Instruct --port 8080 --host 0.0.0.0

Then I have used this simple code to infer an image:

import sglang as sgl


base_url = "url.to.my.server"

@sgl.function
def caption_image(s, image_file):
    s += sgl.user(sgl.image(image_file) + "What is the overall style of this image?")
    s += sgl.assistant(sgl.gen("global_style", choices=["cinematic", "animated", "anime", "3d", "cartoon"]))
    s += sgl.user("Overall description of this image:")
    s += sgl.assistant(sgl.gen("description", max_tokens=255))


sgl.set_default_backend(sgl.RuntimeEndpoint(base_url))

image_path = "./example.png"

state = caption_image.run(image_file=image_path)

print(state["global_style"])
print(state["description"])
print(state.text())

However, when I launched this code just to check simple image, it just hangs and I receive no response or even error message.

Logs:

[2024-12-27 17:04:20 TP0] Overlap scheduler is disabled for multimodal models.
[2024-12-27 17:04:20 TP0] Automatically turn off --chunked-prefill-size for mllama.
[2024-12-27 17:04:20 TP0] Init torch distributed begin.
[2024-12-27 17:04:21 TP0] Load weight begin. avail mem=46.99 GB
Loading safetensors checkpoint shards:   0% Completed | 0/5 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:  20% Completed | 1/5 [00:00<00:00,  4.62it/s]
Loading safetensors checkpoint shards:  40% Completed | 2/5 [00:01<00:01,  1.63it/s]
Loading safetensors checkpoint shards:  60% Completed | 3/5 [00:02<00:01,  1.22it/s]
Loading safetensors checkpoint shards:  80% Completed | 4/5 [00:03<00:00,  1.10it/s]
Loading safetensors checkpoint shards: 100% Completed | 5/5 [00:04<00:00,  1.01it/s]
Loading safetensors checkpoint shards: 100% Completed | 5/5 [00:04<00:00,  1.15it/s]

[2024-12-27 17:04:26 TP0] Load weight end. type=MllamaForConditionalGeneration, dtype=torch.bfloat16, avail mem=26.84 GB
[2024-12-27 17:04:26 TP0] Memory pool end. avail mem=6.62 GB
[2024-12-27 17:04:26 TP0] Capture cuda graph begin. This can take up to several minutes.
[00:11<00:00,  2.00it/s]
[2024-12-27 17:04:38 TP0] Capture cuda graph end. Time elapsed: 11.53 s
[2024-12-27 17:04:38 TP0] max_total_num_tokens=125417, max_prefill_tokens=16384, max_running_requests=2049, context_len=131072
[2024-12-27 17:04:39] INFO:     Started server process [1126]
[2024-12-27 17:04:39] INFO:     Waiting for application startup.
[2024-12-27 17:04:39] INFO:     Application startup complete.
[2024-12-27 17:04:39] INFO:     Uvicorn running on http://0.0.0.0:8080 (Press CTRL+C to quit)
[2024-12-27 17:04:40] INFO:     127.0.0.1:34840 - "GET /get_model_info HTTP/1.1" 200 OK
[2024-12-27 17:04:40 TP0] Prefill batch. #new-seq: 1, #new-token: 7, #cached-token: 0, cache hit rate: 0.00%, token usage: 0.00, #running-req: 0, #queue-req: 0
[2024-12-27 17:04:40] INFO:     127.0.0.1:34854 - "POST /generate HTTP/1.1" 200 OK
[2024-12-27 17:04:40] The server is fired up and ready to roll!
[2024-12-27 17:04:47] INFO:     91.198.101.42:57416 - "GET /get_model_info HTTP/1.1" 200 OK
[2024-12-27 17:05:18 TP0] Prefill batch. #new-seq: 1, #new-token: 6425, #cached-token: 0, cache hit rate: 0.00%, token usage: 0.00, #running-req: 0, #queue-req: 0
[2024-12-27 17:05:18] INFO:     91.198.101.42:14407 - "POST /generate HTTP/1.1" 200 OK
[2024-12-27 17:05:20 TP0] Prefill batch. #new-seq: 1, #new-token: 4, #cached-token: 6423, cache hit rate: 49.95%, token usage: 0.05, #running-req: 0, #queue-req: 0
[2024-12-27 17:05:21 TP0] Prefill batch. #new-seq: 1, #new-token: 3, #cached-token: 6423, cache hit rate: 66.61%, token usage: 0.05, #running-req: 0, #queue-req: 0
[2024-12-27 17:05:23 TP0] Prefill batch. #new-seq: 1, #new-token: 3, #cached-token: 6423, cache hit rate: 74.94%, token usage: 0.05, #running-req: 0, #queue-req: 0
[2024-12-27 17:05:24 TP0] Prefill batch. #new-seq: 1, #new-token: 4, #cached-token: 6423, cache hit rate: 79.94%, token usage: 0.05, #running-req: 0, #queue-req: 0
[2024-12-27 17:05:24 TP0] Prefill batch. #new-seq: 1, #new-token: 4, #cached-token: 6423, cache hit rate: 83.27%, token usage: 0.05, #running-req: 0, #queue-req: 0

I don't understand why this is happening?

Reproduction

I have written it in description

Environment

Specs:

Ubuntu 22.04
PyTorch 2.4.1
cuda12.4
RTX 6000Ada

The text was updated successfully, but these errors were encountered:

SuperMasterBlasterLaser · 2024-12-28T16:02:46Z

I found out that when I use select or choice it hangs, but simple gen without any other constraints returns generated results.

bluenevus · 2024-12-29T21:58:23Z

I had to roll back to v0.4.0 for 11b vision to work again. It errors out on 0.4.1 for me.

SuperMasterBlasterLaser · 2024-12-30T04:59:30Z

@bluenevus does gen with choices or select methods work on v0.4.0?

SuperMasterBlasterLaser · 2025-01-01T15:42:55Z

OK. I thought that hanging issue is connected with lack of VRAM of GPU. So I had rented H100 with 80 GB VRAM in order to launch Llama-3.2-11B-Vision-Instruct and ran this simple script:

@sgl.function
def caption_image(s, image_file):
    s += "You are very smart image captioning service"
    s += "Given this image: " + sgl.image(image_file)
    s += "Overall style of this image is: " + sgl.select("global_style", choices=["cinematic", "animated", "anime", "3d", "cartoon", "digital art"])

sgl.set_default_backend(sgl.RuntimeEndpoint(base_url))

image_path = "./examples/image.png"

state = caption_image.run(image_file=image_path)

print(state["global_style"])

And it still hangs with these logs:

[2025-01-01 15:27:41 TP0] Prefill batch. #new-seq: 1, #new-token: 2, #cached-token: 6426, cache hit rate: 88.85%, token usage: 0.02, #running-req: 0, #queue-req: 0
[2025-01-01 15:27:41 TP0] Prefill batch. #new-seq: 1, #new-token: 2, #cached-token: 6426, cache hit rate: 89.96%, token usage: 0.02, #running-req: 0, #queue-req: 0
[2025-01-01 15:27:41 TP0] Prefill batch. #new-seq: 1, #new-token: 2, #cached-token: 6426, cache hit rate: 90.87%, token usage: 0.02, #running-req: 0, #queue-req: 0
[2025-01-01 15:27:42 TP0] Prefill batch. #new-seq: 1, #new-token: 4, #cached-token: 6426, cache hit rate: 91.63%, token usage: 0.02, #running-req: 0, #queue-req: 0
[2025-01-01 15:27:42 TP0] Prefill batch. #new-seq: 1, #new-token: 2, #cached-token: 6426, cache hit rate: 92.27%, token usage: 0.02, #running-req: 0, #queue-req: 0
[2025-01-01 15:27:42 TP0] Prefill batch. #new-seq: 1, #new-token: 3, #cached-token: 6426, cache hit rate: 92.82%, token usage: 0.02, #running-req: 0, #queue-req: 0

Then I have changed --max-prefill-tokens to 8291 and it does not work. Then I changed models to LLava and it also hangs on choice generation with same logs

I think overall choices and select methods for multi modal is broken or does not work at all.

bluenevus · 2025-01-01T15:53:52Z

@bluenevus does gen with choices or select methods work on v0.4.0?

Not sure what that means but you can see the compose components here

deploy:
resources:
reservations:
devices:
- driver: nvidia
device_ids: ['1', '3']
capabilities: [gpu]
shm_size: '32gb'
ipc: host
ports:
- "8011:30000"
volumes:
- ~/.cache/huggingface:/root/.cache/huggingface
command: >
python3 -m sglang.launch_server
--model-path alpindale/Llama-3.2-11B-Vision-Instruct
--host 0.0.0.0
--port 30000
--device cuda
--kv-cache-dtype auto
--dtype float16
--tp-size 2
--context-length 32768
--max-running-requests 12
--attention-backend flashinfer
--sampling-backend flashinfer
--trust-remote-code
--mem-fraction-static 0.95
--disable-cuda-graph
--enable-torch-compile
--chat-template llama_3_vision
--grammar-backend xgrammar

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] Launching Llama-3.2-11B-Vision-Instruct just hangs on generation #2619

[Bug] Launching Llama-3.2-11B-Vision-Instruct just hangs on generation #2619

SuperMasterBlasterLaser commented Dec 27, 2024

SuperMasterBlasterLaser commented Dec 28, 2024

bluenevus commented Dec 29, 2024

SuperMasterBlasterLaser commented Dec 30, 2024

SuperMasterBlasterLaser commented Jan 1, 2025

bluenevus commented Jan 1, 2025

[Bug] Launching Llama-3.2-11B-Vision-Instruct just hangs on generation #2619

[Bug] Launching Llama-3.2-11B-Vision-Instruct just hangs on generation #2619

Comments

SuperMasterBlasterLaser commented Dec 27, 2024

Checklist

Describe the bug

Reproduction

Environment

SuperMasterBlasterLaser commented Dec 28, 2024

bluenevus commented Dec 29, 2024

SuperMasterBlasterLaser commented Dec 30, 2024

SuperMasterBlasterLaser commented Jan 1, 2025

bluenevus commented Jan 1, 2025