[Bug]: vllm-0.5.3.post1部署Qwen2-72b-instruct-awq模型，刚开始服务正常，但是并发高的时候就报错 #6734

xinzaifeixiang1992 · 2024-07-24T08:42:49Z

Your current environment

cuda-12.2
torch-2.3.1
vllm-0.5.3.post1

🐛 Describe the bug

[rank1]:[E ProcessGroupNCCL.cpp:1414] [PG 2 Rank 1] Process group watchdog thread terminated with exception: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

Exception raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:43 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f0c70fb1897 in /data/anaconda3/envs/qwen/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7f0c70f61b25 in /data/anaconda3/envs/qwen/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x7f0c71089718 in /data/anaconda3/envs/qwen/lib/python3.10/site-packages/torch/lib/libc10_cuda.so)
frame #3: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x56 (0x7f0c722868e6 in /data/anaconda3/envs/qwen/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #4: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0x58 (0x7f0c7228a9e8 in /data/anaconda3/envs/qwen/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #5: c10d::ProcessGroupNCCL::watchdogHandler() + 0x77c (0x7f0c7229005c in /data/anaconda3/envs/qwen/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #6: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f0c72290dcc in /data/anaconda3/envs/qwen/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #7: + 0xdbbf4 (0x7f0cbdd45bf4 in /data/anaconda3/envs/qwen/bin/../lib/libstdc++.so.6)
frame #8: + 0x7ea5 (0x7f0cc5608ea5 in /lib64/libpthread.so.0)
frame #9: clone + 0x6d (0x7f0cc5331b0d in /lib64/libc.so.6)

[rank0]:[E ProcessGroupNCCL.cpp:1414] [PG 2 Rank 0] Process group watchdog thread terminated with exception: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

Exception raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:43 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f0c70fb1897 in /data/anaconda3/envs/qwen/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7f0c70f61b25 in /data/anaconda3/envs/qwen/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x7f0c71089718 in /data/anaconda3/envs/qwen/lib/python3.10/site-packages/torch/lib/libc10_cuda.so)
frame #3: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x56 (0x7f0c722868e6 in /data/anaconda3/envs/qwen/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #4: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0x58 (0x7f0c7228a9e8 in /data/anaconda3/envs/qwen/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #5: c10d::ProcessGroupNCCL::watchdogHandler() + 0x77c (0x7f0c7229005c in /data/anaconda3/envs/qwen/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #6: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f0c72290dcc in /data/anaconda3/envs/qwen/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #7: + 0xdbbf4 (0x7f0cbdd45bf4 in /data/anaconda3/envs/qwen/bin/../lib/libstdc++.so.6)
frame #8: + 0x7ea5 (0x7f0cc5608ea5 in /lib64/libpthread.so.0)
frame #9: clone + 0x6d (0x7f0cc5331b0d in /lib64/libc.so.6)

/data/anaconda3/envs/qwen/lib/python3.10/multiprocessing/resource_tracker.py:224: UserWarning: resource_tracker: There appear to be 1 leaked shared_memory objects to clean up at shutdown
warnings.warn('resource_tracker: There appear to be %d '

The text was updated successfully, but these errors were encountered:

xinzaifeixiang1992 · 2024-07-24T08:44:27Z

机器配置L20，单卡48G，vllm启动脚本指定了--tensor-parallel 2 --quantization awq

keakon-pureglobal · 2024-07-26T13:26:02Z

一般是显存不足，把最大内存、上下文长度和并发数限制一下，压测没问题再慢慢增加

xinzaifeixiang1992 · 2024-08-02T03:08:16Z

好的，谢谢指导

github-actions · 2024-11-01T02:05:46Z

This issue has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this issue should remain open. Thank you!

LugerW-A · 2024-11-08T02:47:02Z

你好 L20显卡部署好像显存会一直变动然后导致服务挂掉。这种情况有没有遇到过呢

xinzaifeixiang1992 · 2024-11-13T13:09:03Z

你好 L20显卡部署好像显存会一直变动然后导致服务挂掉。这种情况有没有遇到过呢

遇到过的呢，在整台8*48G（实际只有45G）上面，每两块卡部署一个qwen2.5-72b-awq模型，可以起4个服务。并发为2批量测试跑批时，大概1万来条，服务莫名其妙就假死了，进程在，但是所有的request都aborted了。
现在也是很受困扰呢

xinzaifeixiang1992 added the bug Something isn't working label Jul 24, 2024

omarsou mentioned this issue Jul 24, 2024

[Model] Meta Llama 3.1 Know Issues & FAQ #6689

Closed

raywanb mentioned this issue Aug 5, 2024

[BugFix] Potential Fix to CUDA Illegal Memory Error #7142

Closed

github-actions bot added the stale label Nov 1, 2024

github-actions bot added unstale and removed stale labels Nov 10, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]: vllm-0.5.3.post1部署Qwen2-72b-instruct-awq模型，刚开始服务正常，但是并发高的时候就报错 #6734

[Bug]: vllm-0.5.3.post1部署Qwen2-72b-instruct-awq模型，刚开始服务正常，但是并发高的时候就报错 #6734

xinzaifeixiang1992 commented Jul 24, 2024

xinzaifeixiang1992 commented Jul 24, 2024

keakon-pureglobal commented Jul 26, 2024

xinzaifeixiang1992 commented Aug 2, 2024

github-actions bot commented Nov 1, 2024

LugerW-A commented Nov 8, 2024

xinzaifeixiang1992 commented Nov 13, 2024

[Bug]: vllm-0.5.3.post1部署Qwen2-72b-instruct-awq模型，刚开始服务正常，但是并发高的时候就报错 #6734

[Bug]: vllm-0.5.3.post1部署Qwen2-72b-instruct-awq模型，刚开始服务正常，但是并发高的时候就报错 #6734

Comments

xinzaifeixiang1992 commented Jul 24, 2024

Your current environment

🐛 Describe the bug

xinzaifeixiang1992 commented Jul 24, 2024

keakon-pureglobal commented Jul 26, 2024

xinzaifeixiang1992 commented Aug 2, 2024

github-actions bot commented Nov 1, 2024

LugerW-A commented Nov 8, 2024

xinzaifeixiang1992 commented Nov 13, 2024