Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cuda failure 'peer access is not supported between these two devices' #406

Closed
colorzhang opened this issue Jul 8, 2023 · 15 comments
Closed
Labels
bug Something isn't working

Comments

@colorzhang
Copy link

Usage stats collection is enabled. To disable this, run the following command: ray disable-usage-stats before starting Ray. See https://docs.ray.io/en/master/cluster/usage-stats.html for more details.
2023-07-08 23:11:34,236 INFO worker.py:1610 -- Started a local Ray instance. View the dashboard at 127.0.0.1:8265
INFO 07-08 23:11:35 llm_engine.py:60] Initializing an LLM engine with config: model='openlm-research/open_llama_13b', tokenizer='openlm-research/open_llama_13b', tokenizer_mode=auto, dtype=torch.float16, use_dummy_weights=False, download_dir=None, use_np_weights=False, tensor_parallel_size=4, seed=0)
INFO 07-08 23:11:35 tokenizer.py:28] For some LLaMA-based models, initializing the fast tokenizer may take a long time. To eliminate the initialization time, consider using 'hf-internal-testing/llama-tokenizer' instead of the original tokenizer.
(Worker pid=4225) Exception raised in creation task: The actor died because of an error raised in its creation task, ray::Worker.init() (pid=4225, ip=172.31.68.176, actor_id=5dc662848f950df8d330eb8a01000000, repr=<vllm.worker.worker.Worker object at 0x7f4e9ea814e0>)
(Worker pid=4225) File "/opt/conda/lib/python3.10/site-packages/vllm/worker/worker.py", line 40, in init
(Worker pid=4225) _init_distributed_environment(parallel_config, rank,
(Worker pid=4225) File "/opt/conda/lib/python3.10/site-packages/vllm/worker/worker.py", line 307, in _init_distributed_environment
(Worker pid=4225) torch.distributed.all_reduce(torch.zeros(1).cuda())
(Worker pid=4225) File "/opt/conda/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1451, in wrapper
(Worker pid=4225) return func(*args, **kwargs)
(Worker pid=4225) File "/opt/conda/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1700, in all_reduce
(Worker pid=4225) work = default_pg.allreduce([tensor], opts)
(Worker pid=4225) torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1275, internal error, NCCL version 2.14.3
(Worker pid=4225) ncclInternalError: Internal check failed.
(Worker pid=4225) Last error:
(Worker pid=4225) Cuda failure 'peer access is not supported between these two devices'

Code:
llm = LLM(model="openlm-research/open_llama_13b", tensor_parallel_size=4)

Env:
Single EC2 instance G5.12xlarge with 4 A10G GPU

@ai8hyf
Copy link

ai8hyf commented Jul 9, 2023

I run into the same issue on a G5.12xlarge instance for WizardLM-30b-fp16

@mspronesti
Copy link
Contributor

mspronesti commented Jul 9, 2023

export NCCL_IGNORE_DISABLED_P2P=1 did the trick for me.
I wanted to open a PR to do this by default as I believe it would fix this issue without affecting the distributed inference when P2P is enabled . Any thoughts ?

@nivibilla
Copy link

nivibilla commented Jul 10, 2023

I get this error even with trying the workaround mentioned by @mspronesti. Interestingly with the same G5.12xlarge instance

@mspronesti
Copy link
Contributor

mspronesti commented Jul 10, 2023

@nivibilla I tried the above workaround in a notebook g5.12xlarge instance in SageMaker and It worked for me. I also tried reinstalling vllm from source adding os.environ["NCCL_IGNORE_DISABLED_P2P"] = '1' in the codebase just before this line and it worked again. I guess you tried on a EC2 VM. Can you try the second way ?

steps

  • clone the project
  • Add the following before this line
    (Import os also)
os.environ["NCCL_IGNORE_DISABLED_P2P"] = '1'
  • pip install .
  • run again your distributed inference

@zhuohan123
Copy link
Member

Thanks for reporting the issue! This should be fixed by #397. It should be merged soon. You can retry with this fix.

@zhuohan123 zhuohan123 added the bug Something isn't working label Jul 18, 2023
@zhuohan123
Copy link
Member

Should be fixed by #397. Please re-open if you meet any new issues.

@nivibilla
Copy link

@zhuohan123 Sorry for the delayed response, I will test this out thanks!

@Tarun3679
Copy link

@mspronesti I tried adding the step in source and doing the pip install, However I am getting the following error - AttributeError: 'NoneType' object has no attribute 'fs'. Is there any other way to solve this issue?

@nivibilla
Copy link

I've tried with installing from the git with pip. And it works for me

@Tarun3679
Copy link

Tarun3679 commented Jul 28, 2023

@nivibilla, So basically you cloned the git repo. And then did the change and used pip install command to build it! Correct? Could you share you're existing versions and the worker.py file? I tried the same. But it doesn't work.

@nivibilla
Copy link

I didn't make any changes to the repo

Just do

pip install git+https://github.com/vllm-project/vllm.git

@mspronesti
Copy link
Contributor

mspronesti commented Jul 28, 2023

This issue was fixed in #397 but the changes from the named PR have not been released yet. However, if you install vllm as suggest by @nivibilla you will get the most updated version of the code :)

@Tarun3679
Copy link

image
I am actually getting this error when I am trying to import the module. Any idea on what is going wrong?

@nivibilla
Copy link

That looks like pyarrow is missing to me.

@Tarun3679
Copy link

Thanks @nivibilla and @mspronesti - It works perfectly now!

jikunshang pushed a commit to jikunshang/vllm that referenced this issue Oct 24, 2024
Modify `benchmark_throughput.py` to allow running with FP8 on HPU (KV
cache dtype `fp8_inc`) and to use padding-aware scheduling.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

6 participants