Support Multi-GPU inference on CUDA devices #101

guoqingbao · 2025-01-14T08:44:32Z

Run Multi-GPU inference with NCCL feature

cargo run --release --features cuda,nccl -- --port 2000 --device-ids "0,1" --weight-path /home/Meta-Llama-3.1-8B-Instruct/ llama3 --temperature 0. --penalty 1.0

If you encoutered problems under Multi-GPU setttings, you may:

export NCCL_P2P_LEVEL=LOC # use local devices (mutiple cards within a server, PCIE, etc.)
export NCCL_P2P_DISABLE=1 # diable p2p cause this feature can cause illegal memory access in certain environments
export NCCL_IB_DISABLE=1 # diable ibnet/infiniband (optional)

Note: quantized models are not supported yet under multi-gpu setting.

…this to work)

guoqingbao added 6 commits January 9, 2025 03:43

Fix kernel rebuild bugs

dcd09cb

Initial support Multi-GPU infernce (with nccl)

91890fe

Multi-GPU inference, parallel model shards, simplified pipeline (got …

9a44628

…this to work)

Fix typo

31e655c

Typo fix

d5d3afc

Bump the project version to 0.1.1

b01103c

guoqingbao merged commit 8fc4c00 into EricLBuehler:master Jan 14, 2025
6 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support Multi-GPU inference on CUDA devices #101

Support Multi-GPU inference on CUDA devices #101

guoqingbao commented Jan 14, 2025

Support Multi-GPU inference on CUDA devices #101

Support Multi-GPU inference on CUDA devices #101

Conversation

guoqingbao commented Jan 14, 2025