Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] requires "python-multipart" to be installed with docker image #949

Closed
1 of 3 tasks
kerthcet opened this issue Aug 6, 2024 · 18 comments
Closed
1 of 3 tasks

Comments

@kerthcet
Copy link

kerthcet commented Aug 6, 2024

Checklist

  • 1. I have searched related issues but cannot get the expected help.
  • 2. The bug has not been fixed in the latest version.
  • 3. Please note that if the bug-related issue you submitted lacks corresponding environment info and a minimal reproducible demo, it will be challenging for us to reproduce and resolve the issue, reducing the likelihood of receiving feedback.

Describe the bug

I tested sglang in Kubernetes with the minimum configuration below:

  containers:
  - args:
    - --model-path
    - /workspace/models/models--facebook--opt-125m
    - --served-model-name
    - opt-125m
    - --host
    - 0.0.0.0
    - --port
    - "8080"
    command:
    - python3
    - -m
    - sglang.launch_server
    image: lmsysorg/sglang:v0.2.9-cu121

However, it emits error like:

Form data requires "python-multipart" to be installed.
You can install "python-multipart" with:

pip install python-multipart

Traceback (most recent call last):
  File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/sgl-workspace/sglang/python/sglang/launch_server.py", line 5, in <module>
    from sglang.srt.server import launch_server
  File "/sgl-workspace/sglang/python/sglang/srt/server.py", line 162, in <module>
    async def openai_v1_files(file: UploadFile = File(...), purpose: str = Form("batch")):
  File "/usr/local/lib/python3.10/dist-packages/fastapi/routing.py", line 944, in decorator
    self.add_api_route(
  File "/usr/local/lib/python3.10/dist-packages/fastapi/routing.py", line 883, in add_api_route
    route = route_class(
  File "/usr/local/lib/python3.10/dist-packages/fastapi/routing.py", line 519, in __init__
    self.body_field = get_body_field(dependant=self.dependant, name=self.unique_id)
  File "/usr/local/lib/python3.10/dist-packages/fastapi/dependencies/utils.py", line 817, in get_body_field
    check_file_field(final_field)
  File "/usr/local/lib/python3.10/dist-packages/fastapi/dependencies/utils.py", line 100, in check_file_field
    raise RuntimeError(multipart_not_installed_error) from None
RuntimeError: Form data requires "python-multipart" to be installed.
You can install "python-multipart" with:

pip install python-multipart

Based on my understanding, this should be included in the docker image, anything I missed here?
Thanks !

Reproduction

Part of the yaml file as described above.

Environment

Running with docker image `lmsysorg/sglang:v0.2.9-cu121`
@zhyncs
Copy link
Member

zhyncs commented Aug 6, 2024

Please use lmsysorg/sglang:latest.

@zhyncs zhyncs closed this as completed Aug 6, 2024
@zhyncs
Copy link
Member

zhyncs commented Aug 6, 2024

fixed #895

@kerthcet
Copy link
Author

kerthcet commented Aug 6, 2024

Thanks for the response, when will we release a new version?

@zhyncs
Copy link
Member

zhyncs commented Aug 6, 2024

Hi @kerthcet In fact, the new version has been released.
ref
https://pypi.org/project/sglang/
https://hub.docker.com/r/lmsysorg/sglang/tags

@kerthcet
Copy link
Author

kerthcet commented Aug 6, 2024

Sounds great, let me have a try on v0.2.10-cu124.

@kerthcet
Copy link
Author

kerthcet commented Aug 6, 2024

I just met another error with v0.2.10-cu124:

[gpu=0] Load weight end. type=Qwen2ForCausalLM, dtype=torch.bfloat16, avail mem=13.40 GB
[gpu=0] Memory pool end. avail mem=1.06 GB
[gpu=0] Capture cuda graph begin. This can take up to several minutes.
Initialization failed. controller_init_state: Traceback (most recent call last):
  File "/sgl-workspace/sglang/python/sglang/srt/managers/controller_single.py", line 150, in start_controller_process
    controller = ControllerSingle(
  File "/sgl-workspace/sglang/python/sglang/srt/managers/controller_single.py", line 84, in __init__
    self.tp_server = ModelTpServer(
  File "/sgl-workspace/sglang/python/sglang/srt/managers/tp_worker.py", line 92, in __init__
    self.model_runner = ModelRunner(
  File "/sgl-workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 140, in __init__
    self.init_cuda_graphs()
  File "/sgl-workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 341, in init_cuda_graphs
    self.cuda_graph_runner.capture(batch_size_list)
  File "/sgl-workspace/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 133, in capture
    ) = self.capture_one_batch_size(bs, forward)
  File "/sgl-workspace/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 196, in capture_one_batch_size
    run_once()
  File "/sgl-workspace/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 193, in run_once
    return forward(input_ids, input_metadata.positions, input_metadata)
  File "/usr/local/lib/python3.8/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/sgl-workspace/sglang/python/sglang/srt/models/qwen2.py", line 287, in forward
    hidden_states = self.model(input_ids, positions, input_metadata, input_embeds)
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl
    return forward_call(*args, **kwargs)
  File "/sgl-workspace/sglang/python/sglang/srt/models/qwen2.py", line 255, in forward
    hidden_states, residual = layer(
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl
    return forward_call(*args, **kwargs)
  File "/sgl-workspace/sglang/python/sglang/srt/models/qwen2.py", line 207, in forward
    hidden_states = self.self_attn(
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl
    return forward_call(*args, **kwargs)
  File "/sgl-workspace/sglang/python/sglang/srt/models/qwen2.py", line 156, in forward
    attn_output = self.attn(q, k, v, input_metadata)
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl
    return forward_call(*args, **kwargs)
  File "/sgl-workspace/sglang/python/sglang/srt/layers/radix_attention.py", line 182, in forward
    return self.decode_forward(q, k, v, input_metadata)
  File "/sgl-workspace/sglang/python/sglang/srt/layers/radix_attention.py", line 166, in decode_forward_flashinfer
    o = input_metadata.flashinfer_decode_wrapper.forward(
  File "/usr/local/lib/python3.8/dist-packages/flashinfer/decode.py", line 629, in forward
    out = self._wrapper.forward(
ValueError: FlashInfer Internal Error: Invalid configuration : num_frags_x=1 num_frags_y=4 num_frags_z=1 num_warps_x=1 num_warps_z=4 please create an issue (https://github.com/flashinfer-ai/flashinfer/issues) and report the issue to the developers.

@zhyncs zhyncs reopened this Aug 6, 2024
@zhyncs
Copy link
Member

zhyncs commented Aug 6, 2024

Hi @kerthcet Thank you for reporting this issue, we will take a look. cc @yzh119

@kerthcet
Copy link
Author

kerthcet commented Aug 6, 2024

containers:
  - args:
    - --model-path
    - /workspace/models/models--Qwen--Qwen2-0.5B-Instruct
    - --served-model-name
    - qwen2-05b
    - --host
    - 0.0.0.0
    - --port
    - "8080"
    command:
    - python3
    - -m
    - sglang.launch_server

@zhyncs
Copy link
Member

zhyncs commented Aug 6, 2024

By the way, why did you start with cu121 and then switch to cu124 later?

@kerthcet
Copy link
Author

kerthcet commented Aug 6, 2024

Well, I care little about the cuda version, just want to make it out on Kubernetes. Does that matter?

@zhyncs
Copy link
Member

zhyncs commented Aug 6, 2024

If the CUDA version of your host machine is 12.1, then I still recommend you use the cu121 image.

@kerthcet
Copy link
Author

kerthcet commented Aug 6, 2024

Make sense, thanks for the input. I'll have another try.

@yzh119
Copy link
Collaborator

yzh119 commented Aug 7, 2024

@kerthcet what's your GPU architecture? I suppose you might be using sm_86 which has small shared memory size. I'll fix flashinfer's behavior for sm_86 in v0.1.4.

@yzh119
Copy link
Collaborator

yzh119 commented Aug 7, 2024

I can't reproduce the error with sm_86's shared memory limit (96kb/sm). It only happens when I limit the shared memory size to 64kb/sm (turing(sm_75)'s spec).

@kerthcet
Copy link
Author

kerthcet commented Aug 7, 2024

Hi I may use T4 or A10, I can't remember that. I can test later. So can you suggest which GPU arch I can test with ?

@yzh119
Copy link
Collaborator

yzh119 commented Aug 9, 2024

T4 has sm_75 which is kind of old and I'll fix the behavior later, A10 (sm_80) should work.

@kerthcet
Copy link
Author

kerthcet commented Aug 16, 2024

Tested with A10, it works. I can test with T4 once fixed with that arch. Thanks for the feedbacks.

@zhyncs
Copy link
Member

zhyncs commented Aug 16, 2024

Thanks for verification!

@zhyncs zhyncs closed this as completed Aug 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants