Cuda failure 'peer access is not supported between these two devices' #406

colorzhang · 2023-07-08T23:46:38Z

Usage stats collection is enabled. To disable this, run the following command: ray disable-usage-stats before starting Ray. See https://docs.ray.io/en/master/cluster/usage-stats.html for more details.
2023-07-08 23:11:34,236 INFO worker.py:1610 -- Started a local Ray instance. View the dashboard at 127.0.0.1:8265
INFO 07-08 23:11:35 llm_engine.py:60] Initializing an LLM engine with config: model='openlm-research/open_llama_13b', tokenizer='openlm-research/open_llama_13b', tokenizer_mode=auto, dtype=torch.float16, use_dummy_weights=False, download_dir=None, use_np_weights=False, tensor_parallel_size=4, seed=0)
INFO 07-08 23:11:35 tokenizer.py:28] For some LLaMA-based models, initializing the fast tokenizer may take a long time. To eliminate the initialization time, consider using 'hf-internal-testing/llama-tokenizer' instead of the original tokenizer.
(Worker pid=4225) Exception raised in creation task: The actor died because of an error raised in its creation task, ray::Worker.init() (pid=4225, ip=172.31.68.176, actor_id=5dc662848f950df8d330eb8a01000000, repr=<vllm.worker.worker.Worker object at 0x7f4e9ea814e0>)
(Worker pid=4225) File "/opt/conda/lib/python3.10/site-packages/vllm/worker/worker.py", line 40, in init
(Worker pid=4225) _init_distributed_environment(parallel_config, rank,
(Worker pid=4225) File "/opt/conda/lib/python3.10/site-packages/vllm/worker/worker.py", line 307, in _init_distributed_environment
(Worker pid=4225) torch.distributed.all_reduce(torch.zeros(1).cuda())
(Worker pid=4225) File "/opt/conda/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1451, in wrapper
(Worker pid=4225) return func(*args, **kwargs)
(Worker pid=4225) File "/opt/conda/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1700, in all_reduce
(Worker pid=4225) work = default_pg.allreduce([tensor], opts)
(Worker pid=4225) torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1275, internal error, NCCL version 2.14.3
(Worker pid=4225) ncclInternalError: Internal check failed.
(Worker pid=4225) Last error:
(Worker pid=4225) Cuda failure 'peer access is not supported between these two devices'

Code:
llm = LLM(model="openlm-research/open_llama_13b", tensor_parallel_size=4)

Env:
Single EC2 instance G5.12xlarge with 4 A10G GPU

The text was updated successfully, but these errors were encountered:

ai8hyf · 2023-07-09T00:34:40Z

I run into the same issue on a G5.12xlarge instance for WizardLM-30b-fp16

mspronesti · 2023-07-09T09:29:53Z

export NCCL_IGNORE_DISABLED_P2P=1 did the trick for me.
I wanted to open a PR to do this by default as I believe it would fix this issue without affecting the distributed inference when P2P is enabled . Any thoughts ?

nivibilla · 2023-07-10T17:58:55Z

I get this error even with trying the workaround mentioned by @mspronesti. Interestingly with the same G5.12xlarge instance

mspronesti · 2023-07-10T18:29:55Z

@nivibilla I tried the above workaround in a notebook g5.12xlarge instance in SageMaker and It worked for me. I also tried reinstalling vllm from source adding os.environ["NCCL_IGNORE_DISABLED_P2P"] = '1' in the codebase just before this line and it worked again. I guess you tried on a EC2 VM. Can you try the second way ?

steps

clone the project
Add the following before this line
(Import os also)

os.environ["NCCL_IGNORE_DISABLED_P2P"] = '1'

pip install .
run again your distributed inference

zhuohan123 · 2023-07-18T06:15:22Z

Thanks for reporting the issue! This should be fixed by #397. It should be merged soon. You can retry with this fix.

zhuohan123 · 2023-07-20T06:10:33Z

Should be fixed by #397. Please re-open if you meet any new issues.

nivibilla · 2023-07-21T09:35:16Z

@zhuohan123 Sorry for the delayed response, I will test this out thanks!

Tarun3679 · 2023-07-28T20:04:05Z

@mspronesti I tried adding the step in source and doing the pip install, However I am getting the following error - AttributeError: 'NoneType' object has no attribute 'fs'. Is there any other way to solve this issue?

nivibilla · 2023-07-28T20:04:56Z

I've tried with installing from the git with pip. And it works for me

Tarun3679 · 2023-07-28T20:11:28Z

@nivibilla, So basically you cloned the git repo. And then did the change and used pip install command to build it! Correct? Could you share you're existing versions and the worker.py file? I tried the same. But it doesn't work.

nivibilla · 2023-07-28T20:25:40Z

I didn't make any changes to the repo

Just do

pip install git+https://github.com/vllm-project/vllm.git

mspronesti · 2023-07-28T20:28:42Z

This issue was fixed in #397 but the changes from the named PR have not been released yet. However, if you install vllm as suggest by @nivibilla you will get the most updated version of the code :)

Tarun3679 · 2023-07-29T00:31:02Z

I am actually getting this error when I am trying to import the module. Any idea on what is going wrong?

nivibilla · 2023-07-29T08:06:38Z

That looks like pyarrow is missing to me.

Tarun3679 · 2023-07-30T00:26:34Z

Thanks @nivibilla and @mspronesti - It works perfectly now!

Modify `benchmark_throughput.py` to allow running with FP8 on HPU (KV cache dtype `fp8_inc`) and to use padding-aware scheduling.

zhuohan123 added the bug Something isn't working label Jul 18, 2023

zhuohan123 mentioned this issue Jul 18, 2023

[Roadmap] vLLM Development Roadmap: H2 2023 #244

Closed

76 tasks

zhuohan123 closed this as completed Jul 20, 2023

ahernandezSecurityScorecard mentioned this issue Jul 26, 2023

[Llama-2-13b-chat-hf] IPv6 Network Address Retrieval Error on 4 V100s 16GB #570

Closed

fpreiss mentioned this issue Jun 11, 2024

Fix Regression: Disable p2p for 4090 sgl-project/sglang#531

Merged

jikunshang pushed a commit to jikunshang/vllm that referenced this issue Oct 24, 2024

Add HPU specific arguments to benchmark_throughput (vllm-project#406)

acde882

Modify `benchmark_throughput.py` to allow running with FP8 on HPU (KV cache dtype `fp8_inc`) and to use padding-aware scheduling.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cuda failure 'peer access is not supported between these two devices' #406

Cuda failure 'peer access is not supported between these two devices' #406

colorzhang commented Jul 8, 2023

ai8hyf commented Jul 9, 2023

mspronesti commented Jul 9, 2023 •

edited

Loading

nivibilla commented Jul 10, 2023 •

edited

Loading

mspronesti commented Jul 10, 2023 •

edited

Loading

zhuohan123 commented Jul 18, 2023

zhuohan123 commented Jul 20, 2023

nivibilla commented Jul 21, 2023

Tarun3679 commented Jul 28, 2023

nivibilla commented Jul 28, 2023

Tarun3679 commented Jul 28, 2023 •

edited

Loading

nivibilla commented Jul 28, 2023

mspronesti commented Jul 28, 2023 •

edited

Loading

Tarun3679 commented Jul 29, 2023

nivibilla commented Jul 29, 2023

Tarun3679 commented Jul 30, 2023

Cuda failure 'peer access is not supported between these two devices' #406

Cuda failure 'peer access is not supported between these two devices' #406

Comments

colorzhang commented Jul 8, 2023

ai8hyf commented Jul 9, 2023

mspronesti commented Jul 9, 2023 • edited Loading

nivibilla commented Jul 10, 2023 • edited Loading

mspronesti commented Jul 10, 2023 • edited Loading

steps

zhuohan123 commented Jul 18, 2023

zhuohan123 commented Jul 20, 2023

nivibilla commented Jul 21, 2023

Tarun3679 commented Jul 28, 2023

nivibilla commented Jul 28, 2023

Tarun3679 commented Jul 28, 2023 • edited Loading

nivibilla commented Jul 28, 2023

mspronesti commented Jul 28, 2023 • edited Loading

Tarun3679 commented Jul 29, 2023

nivibilla commented Jul 29, 2023

Tarun3679 commented Jul 30, 2023

mspronesti commented Jul 9, 2023 •

edited

Loading

nivibilla commented Jul 10, 2023 •

edited

Loading

mspronesti commented Jul 10, 2023 •

edited

Loading

Tarun3679 commented Jul 28, 2023 •

edited

Loading

mspronesti commented Jul 28, 2023 •

edited

Loading