Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] libcudart.so.12: cannot open shared object file: No such file or directory #2584

Open
5 tasks
githust66 opened this issue Dec 26, 2024 · 8 comments
Open
5 tasks

Comments

@githust66
Copy link

Checklist

  • 1. I have searched related issues but cannot get the expected help.
  • 2. The bug has not been fixed in the latest version.
  • 3. Please note that if the bug-related issue you submitted lacks corresponding environment info and a minimal reproducible demo, it will be challenging for us to reproduce and resolve the issue, reducing the likelihood of receiving feedback.
  • 4. If the issue you raised is not a bug but a question, please raise a discussion at https://github.com/sgl-project/sglang/discussions/new/choose Otherwise, it will be closed.
  • 5. Please use English, otherwise it will be closed.

Describe the bug

There are no CUDA-related libraries in the rocm environment, but the SGLANG 0.4.1 version will report an error, while the 0.4.0 and earlier versions will not

error info:
ImportError: [address=0.0.0.0:39501, pid=13418] libcudart.so.12: cannot open shared object file: No such file or directory
2024-12-26 10:14:09,664 xinference.api.restful_api 4247 ERROR [address=0.0.0.0:39501, pid=13418] libcudart.so.12: cannot open shared object file: No such file or directory
Traceback (most recent call last):
File "/root/miniconda3/envs/xinf/lib/python3.10/site-packages/xinference/api/restful_api.py", line 1002, in launch_model
model_uid = await (await self._get_supervisor_ref()).launch_builtin_model(
File "/root/miniconda3/envs/xinf/lib/python3.10/site-packages/xoscar/backends/context.py", line 231, in send
return self._process_result_message(result)
File "/root/miniconda3/envs/xinf/lib/python3.10/site-packages/xoscar/backends/context.py", line 102, in _process_result_message
raise message.as_instanceof_cause()
File "/root/miniconda3/envs/xinf/lib/python3.10/site-packages/xoscar/backends/pool.py", line 667, in send
result = await self._run_coro(message.message_id, coro)
File "/root/miniconda3/envs/xinf/lib/python3.10/site-packages/xoscar/backends/pool.py", line 370, in _run_coro
return await coro
File "/root/miniconda3/envs/xinf/lib/python3.10/site-packages/xoscar/api.py", line 384, in on_receive
return await super().on_receive(message) # type: ignore
File "xoscar/core.pyx", line 558, in on_receive
raise ex
File "xoscar/core.pyx", line 520, in xoscar.core._BaseActor.on_receive
async with self._lock:
File "xoscar/core.pyx", line 521, in xoscar.core._BaseActor.on_receive
with debug_async_timeout('actor_lock_timeout',
File "xoscar/core.pyx", line 526, in xoscar.core._BaseActor.on_receive
result = await result
File "/root/miniconda3/envs/xinf/lib/python3.10/site-packages/xinference/core/supervisor.py", line 1041, in launch_builtin_model
await _launch_model()
File "/root/miniconda3/envs/xinf/lib/python3.10/site-packages/xinference/core/supervisor.py", line 1005, in _launch_model
await _launch_one_model(rep_model_uid)
File "/root/miniconda3/envs/xinf/lib/python3.10/site-packages/xinference/core/supervisor.py", line 984, in _launch_one_model
await worker_ref.launch_builtin_model(
File "/root/miniconda3/envs/xinf/lib/python3.10/site-packages/xoscar/backends/context.py", line 231, in send
return self._process_result_message(result)
File "/root/miniconda3/envs/xinf/lib/python3.10/site-packages/xoscar/backends/context.py", line 102, in _process_result_message
raise message.as_instanceof_cause()
File "/root/miniconda3/envs/xinf/lib/python3.10/site-packages/xoscar/backends/pool.py", line 667, in send
result = await self._run_coro(message.message_id, coro)
File "/root/miniconda3/envs/xinf/lib/python3.10/site-packages/xoscar/backends/pool.py", line 370, in _run_coro
return await coro
File "/root/miniconda3/envs/xinf/lib/python3.10/site-packages/xoscar/api.py", line 384, in on_receive
return await super().on_receive(message) # type: ignore
File "xoscar/core.pyx", line 558, in on_receive
raise ex
File "xoscar/core.pyx", line 520, in xoscar.core._BaseActor.on_receive
async with self._lock:
File "xoscar/core.pyx", line 521, in xoscar.core._BaseActor.on_receive
with debug_async_timeout('actor_lock_timeout',
File "xoscar/core.pyx", line 526, in xoscar.core._BaseActor.on_receive
result = await result
File "/root/miniconda3/envs/xinf/lib/python3.10/site-packages/xinference/core/utils.py", line 90, in wrapped
ret = await func(*args, **kwargs)
File "/root/miniconda3/envs/xinf/lib/python3.10/site-packages/xinference/core/worker.py", line 897, in launch_builtin_model
await model_ref.load()
File "/root/miniconda3/envs/xinf/lib/python3.10/site-packages/xoscar/backends/context.py", line 231, in send
return self._process_result_message(result)
File "/root/miniconda3/envs/xinf/lib/python3.10/site-packages/xoscar/backends/context.py", line 102, in _process_result_message
raise message.as_instanceof_cause()
File "/root/miniconda3/envs/xinf/lib/python3.10/site-packages/xoscar/backends/pool.py", line 667, in send
result = await self._run_coro(message.message_id, coro)
File "/root/miniconda3/envs/xinf/lib/python3.10/site-packages/xoscar/backends/pool.py", line 370, in _run_coro
return await coro
File "/root/miniconda3/envs/xinf/lib/python3.10/site-packages/xoscar/api.py", line 384, in on_receive
return await super().on_receive(message) # type: ignore
File "xoscar/core.pyx", line 558, in on_receive
raise ex
File "xoscar/core.pyx", line 520, in xoscar.core._BaseActor.on_receive
async with self._lock:
File "xoscar/core.pyx", line 521, in xoscar.core._BaseActor.on_receive
with debug_async_timeout('actor_lock_timeout',
File "xoscar/core.pyx", line 526, in xoscar.core._BaseActor.on_receive
result = await result
File "/root/miniconda3/envs/xinf/lib/python3.10/site-packages/xinference/core/model.py", line 414, in load
self._model.load()
File "/root/miniconda3/envs/xinf/lib/python3.10/site-packages/xinference/model/llm/sglang/core.py", line 135, in load
self._engine = sgl.Runtime(
File "/root/miniconda3/envs/xinf/lib/python3.10/site-packages/sglang/api.py", line 39, in Runtime
from sglang.srt.server import Runtime
File "/root/miniconda3/envs/xinf/lib/python3.10/site-packages/sglang/srt/server.py", line 47, in
from sglang.srt.managers.data_parallel_controller import (
File "/root/miniconda3/envs/xinf/lib/python3.10/site-packages/sglang/srt/managers/data_parallel_controller.py", line 25, in
from sglang.srt.managers.io_struct import (
File "/root/miniconda3/envs/xinf/lib/python3.10/site-packages/sglang/srt/managers/io_struct.py", line 24, in
from sglang.srt.managers.schedule_batch import BaseFinishReason
File "/root/miniconda3/envs/xinf/lib/python3.10/site-packages/sglang/srt/managers/schedule_batch.py", line 40, in
from sglang.srt.configs.model_config import ModelConfig
File "/root/miniconda3/envs/xinf/lib/python3.10/site-packages/sglang/srt/configs/model_config.py", line 24, in
from sglang.srt.layers.quantization import QUANTIZATION_METHODS
File "/root/miniconda3/envs/xinf/lib/python3.10/site-packages/sglang/srt/layers/quantization/init.py", line 25, in
from sglang.srt.layers.quantization.fp8 import Fp8Config
File "/root/miniconda3/envs/xinf/lib/python3.10/site-packages/sglang/srt/layers/quantization/fp8.py", line 31, in
from sglang.srt.layers.moe.fused_moe_triton.fused_moe import padding_size
File "/root/miniconda3/envs/xinf/lib/python3.10/site-packages/sglang/srt/layers/moe/fused_moe_triton/init.py", line 4, in
import sglang.srt.layers.moe.fused_moe_triton.fused_moe # noqa
File "/root/miniconda3/envs/xinf/lib/python3.10/site-packages/sglang/srt/layers/moe/fused_moe_triton/fused_moe.py", line 14, in
from sgl_kernel import moe_align_block_size as sgl_moe_align_block_size
File "/root/miniconda3/envs/xinf/lib/python3.10/site-packages/sgl_kernel/init.py", line 1, in
from .ops import (
File "/root/miniconda3/envs/xinf/lib/python3.10/site-packages/sgl_kernel/ops/init.py", line 1, in
from .custom_reduce_cuda import all_reduce as _all_reduce
ImportError: [address=0.0.0.0:39501, pid=13418] libcudart.so.12: cannot open shared object file: No such file or directory
2024-12-26 10:14:09,665 uvicorn.access 4247 INFO 127.0.0.1:47452 - "POST /v1/models HTTP/1.1" 500

Reproduction

qwen2.5-instruct-7B

Environment

(xinf) root@DESKTOP-ESRGKIB:~# python -m sglang.check_env
2024-12-26 10:26:55.241465: E external/local_xla/xla/stream_executor/plugin_registry.cc:91] Invalid plugin kind specified: FFT
2024-12-26 10:26:56.971386: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: SSE3 SSE4.1 SSE4.2 AVX AVX2 AVX512F AVX512_VNNI AVX512_BF16 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2024-12-26 10:26:57.864187: E external/local_xla/xla/stream_executor/plugin_registry.cc:91] Invalid plugin kind specified: DNN
WARNING 12-26 10:27:06 rocm.py:31] fork method is not supported by ROCm. VLLM_WORKER_MULTIPROC_METHOD is overridden to spawn instead.
Python: 3.10.14 (main, May 6 2024, 19:42:50) [GCC 11.2.0]
ROCM available: True
GPU 0: AMD Radeon RX 7900 XT
GPU 0 Compute Capability: 11.0
ROCM_HOME: /opt/rocm
HIPCC: HIP version: 6.3.42131-fa1d09cbd
ROCM Driver Version:
PyTorch: 2.4.0+rocm6.3.0
sglang: 0.4.1
flashinfer: Module Not Found
triton: 3.0.0+rocm6.3.0_75cc27c26a
transformers: 4.46.2
torchao: 0.6.1
numpy: 1.26.4
aiohttp: 3.10.10
fastapi: 0.115.4
hf_transfer: 0.1.8
huggingface_hub: 0.26.5
interegular: 0.3.3
modelscope: 1.19.2
orjson: 3.10.11
packaging: 24.1
psutil: 6.1.0
pydantic: 2.9.2
multipart: 0.0.12
zmq: 26.2.0
uvicorn: 0.32.0
uvloop: 0.21.0
vllm: 0.6.6.dev44+gc2d1b075.d20241221
openai: 1.54.1
anthropic: 0.39.0
decord: 0.6.0
AMD Topology:

Hypervisor vendor: Microsoft
ulimit soft: 1024

@zhyncs
Copy link
Member

zhyncs commented Dec 26, 2024

Hi @githust66 Could you help verify this #2590

@githust66
Copy link
Author

Hi @githust66 Could you help verify this #2590

ok,Is it pulling the latest code to build from source?

@zhyncs
Copy link
Member

zhyncs commented Dec 26, 2024

Nope. You only need to change the Python code.

@githust66
Copy link
Author

Nope. You only need to change the Python code.

ok, I'll give it a try

@githust66
Copy link
Author

Nope. You only need to change the Python code.

I modified the following two files, but the result is still the same error, the first file cannot be found on my environment
image

@zhyncs
Copy link
Member

zhyncs commented Dec 26, 2024

python3 -c "import sgl_kernel; print(sgl_kernel.__path__)"

@githust66
Copy link
Author

githust66 commented Dec 26, 2024

python3 -c "import sgl_kernel; print(sgl_kernel.__path__)"

image
image

These two init.py documents have been changed

@zhyncs
Copy link
Member

zhyncs commented Dec 26, 2024

fixed with #2601

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants