Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: vllm-0.5.3.post1部署Qwen2-72b-instruct-awq模型,刚开始服务正常,但是并发高的时候就报错 #6734

Open
xinzaifeixiang1992 opened this issue Jul 24, 2024 · 6 comments
Labels
bug Something isn't working unstale

Comments

@xinzaifeixiang1992
Copy link

Your current environment

cuda-12.2
torch-2.3.1
vllm-0.5.3.post1

🐛 Describe the bug

[rank1]:[E ProcessGroupNCCL.cpp:1414] [PG 2 Rank 1] Process group watchdog thread terminated with exception: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

Exception raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:43 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f0c70fb1897 in /data/anaconda3/envs/qwen/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7f0c70f61b25 in /data/anaconda3/envs/qwen/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x7f0c71089718 in /data/anaconda3/envs/qwen/lib/python3.10/site-packages/torch/lib/libc10_cuda.so)
frame #3: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x56 (0x7f0c722868e6 in /data/anaconda3/envs/qwen/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #4: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0x58 (0x7f0c7228a9e8 in /data/anaconda3/envs/qwen/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #5: c10d::ProcessGroupNCCL::watchdogHandler() + 0x77c (0x7f0c7229005c in /data/anaconda3/envs/qwen/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #6: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f0c72290dcc in /data/anaconda3/envs/qwen/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #7: + 0xdbbf4 (0x7f0cbdd45bf4 in /data/anaconda3/envs/qwen/bin/../lib/libstdc++.so.6)
frame #8: + 0x7ea5 (0x7f0cc5608ea5 in /lib64/libpthread.so.0)
frame #9: clone + 0x6d (0x7f0cc5331b0d in /lib64/libc.so.6)

[rank0]:[E ProcessGroupNCCL.cpp:1414] [PG 2 Rank 0] Process group watchdog thread terminated with exception: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

Exception raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:43 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f0c70fb1897 in /data/anaconda3/envs/qwen/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7f0c70f61b25 in /data/anaconda3/envs/qwen/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x7f0c71089718 in /data/anaconda3/envs/qwen/lib/python3.10/site-packages/torch/lib/libc10_cuda.so)
frame #3: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x56 (0x7f0c722868e6 in /data/anaconda3/envs/qwen/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #4: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0x58 (0x7f0c7228a9e8 in /data/anaconda3/envs/qwen/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #5: c10d::ProcessGroupNCCL::watchdogHandler() + 0x77c (0x7f0c7229005c in /data/anaconda3/envs/qwen/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #6: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f0c72290dcc in /data/anaconda3/envs/qwen/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #7: + 0xdbbf4 (0x7f0cbdd45bf4 in /data/anaconda3/envs/qwen/bin/../lib/libstdc++.so.6)
frame #8: + 0x7ea5 (0x7f0cc5608ea5 in /lib64/libpthread.so.0)
frame #9: clone + 0x6d (0x7f0cc5331b0d in /lib64/libc.so.6)

/data/anaconda3/envs/qwen/lib/python3.10/multiprocessing/resource_tracker.py:224: UserWarning: resource_tracker: There appear to be 1 leaked shared_memory objects to clean up at shutdown
warnings.warn('resource_tracker: There appear to be %d '

@xinzaifeixiang1992 xinzaifeixiang1992 added the bug Something isn't working label Jul 24, 2024
@xinzaifeixiang1992
Copy link
Author

机器配置L20,单卡48G,vllm启动脚本指定了--tensor-parallel 2 --quantization awq

@keakon-pureglobal
Copy link

一般是显存不足,把最大内存、上下文长度和并发数限制一下,压测没问题再慢慢增加

@xinzaifeixiang1992
Copy link
Author

好的,谢谢指导

Copy link

github-actions bot commented Nov 1, 2024

This issue has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this issue should remain open. Thank you!

@github-actions github-actions bot added the stale label Nov 1, 2024
@LugerW-A
Copy link

LugerW-A commented Nov 8, 2024

你好 L20显卡部署好像显存会一直变动 然后导致服务挂掉。这种情况有没有遇到过呢

@github-actions github-actions bot added unstale and removed stale labels Nov 10, 2024
@xinzaifeixiang1992
Copy link
Author

你好 L20显卡部署好像显存会一直变动 然后导致服务挂掉。这种情况有没有遇到过呢

遇到过的呢,在整台8*48G(实际只有45G)上面,每两块卡部署一个qwen2.5-72b-awq模型,可以起4个服务。并发为2批量测试跑批时,大概1万来条,服务莫名其妙就假死了,进程在,但是所有的request都aborted了。
现在也是很受困扰呢

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working unstale
Projects
None yet
Development

No branches or pull requests

3 participants