NPU双机多卡微调报HCCL错误 #6646

AlbertWang001 · 2025-01-15T01:22:48Z

Reminder

I have read the above rules and searched the existing issues.

System Info

使用docker-npu的方式构建镜像和容器，在进行双机16卡微调qwen2.5-7B的时候一直报HCCL错误，在容器内执行的命令如下：
torchrun --master_port 6001 --nproc_per_node=8 --nnodes=2 --node_rank=0
--master_addr=10.0.1.30 src/train.py
--stage sft
--model_name_or_path /home/model_bin/Qwen/Qwen2___5-7B-Instruct/
--do_train
--dataset alpaca_zh_demo
--template qwen
--finetuning_type lora
--output_dir saves/qwen-7b/lora/sft
--overwrite_cache
--per_device_train_batch_size 1
--gradient_accumulation_steps 2
--lr_scheduler_type cosine
--logging_steps 1
--save_steps 500
--learning_rate 1e-4
--num_train_epochs 100.0
--plot_loss

Reproduction

具体错误日志如下图所示：

Others

No response

codemayq · 2025-01-15T06:06:00Z

报错没有显式说明原因，建议先看下单机是否有问题，不使用容器是否有问题，以及跑一下简单的deepspeed程序看一下，是否是一些基础环境上还有问题

chuangzhidan · 2025-01-16T03:50:19Z

报错没有显式说明原因，建议先看下单机是否有问题，不使用容器是否有问题，以及跑一下简单的deepspeed程序看一下，是否是一些基础环境上还有问题

为什么lora微调的yaml文件里，看不到lora的rank参数？令人震惊

AlbertWang001 added bug Something isn't working pending This problem is yet to be addressed labels Jan 15, 2025

github-actions bot added the npu This problem is related to NPU devices label Jan 15, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NPU双机多卡微调报HCCL错误 #6646

NPU双机多卡微调报HCCL错误 #6646

AlbertWang001 commented Jan 15, 2025 •

edited by hiyouga

Loading

codemayq commented Jan 15, 2025

chuangzhidan commented Jan 16, 2025

NPU双机多卡微调报HCCL错误 #6646

NPU双机多卡微调报HCCL错误 #6646

Comments

AlbertWang001 commented Jan 15, 2025 • edited by hiyouga Loading

Reminder

System Info

Reproduction

Others

codemayq commented Jan 15, 2025

chuangzhidan commented Jan 16, 2025

AlbertWang001 commented Jan 15, 2025 •

edited by hiyouga

Loading