RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:3 and cuda:0! #77

wotulong · 2025-02-03T14:36:53Z

env: A40 * 4
runing script: accelerate launch --num_processes 3 --config_file configs/accelerate_configs/deepspeed_zero3.yaml scripts/run_r1_grpo.py --config receipes/grpo-qwen-2.5-3b-deepseek-r1-countdown.yaml
config:

GRPO specific parameters

beta: 0.001 # 0.04 as in the deepseek math paper 0.001 from https://hijkzzz.notion.site/unraveling-rlhf-and-its-variants-engineering-insights#147d9a33ecc9806090f3d5c749d31f05
max_prompt_length: 256
max_completion_length: 1024
num_generations: 8
use_vllm: true
vllm_device: "cuda:3"
vllm_gpu_memory_utilization: 0.5

error msg:
...
File "/usr/local/lib/python3.10/dist-packages/torch/nn/functional.py", line 2551, in embedding
[rank0]: return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
[rank0]: RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:3 and cuda:0! (when checking argument for argument index in method wrapper_CUDA__index_select)
[rank0]:[W203 14:26:02.255236202 ProcessGroupNCCL.cpp:1250] Warning: WARNING: process group has NOT been destroyed before we destruct ProcessGroupNCCL. On normal program exit, the application should call destroy_process_group to ensure that any pending NCCL operations have finished in this process. In rare cases this process can exit before this point and block the progress of another member of the process group. This constraint has always been present, but this warning has only been added since PyTorch 2.4 (function operator())

a101269 · 2025-02-10T10:03:18Z

same error

philschmid · 2025-02-12T13:08:58Z

Can you try removing the "vllm_device: "cuda:3"" completely?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:3 and cuda:0! #77

RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:3 and cuda:0! #77

wotulong commented Feb 3, 2025

a101269 commented Feb 10, 2025

philschmid commented Feb 12, 2025

RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:3 and cuda:0! #77

RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:3 and cuda:0! #77

Comments

wotulong commented Feb 3, 2025

GRPO specific parameters

a101269 commented Feb 10, 2025

philschmid commented Feb 12, 2025