Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NPU双机多卡微调报HCCL错误 #6646

Open
1 task done
AlbertWang001 opened this issue Jan 15, 2025 · 2 comments
Open
1 task done

NPU双机多卡微调报HCCL错误 #6646

AlbertWang001 opened this issue Jan 15, 2025 · 2 comments
Labels
bug Something isn't working npu This problem is related to NPU devices pending This problem is yet to be addressed

Comments

@AlbertWang001
Copy link

AlbertWang001 commented Jan 15, 2025

Reminder

  • I have read the above rules and searched the existing issues.

System Info

使用docker-npu的方式构建镜像和容器,在进行双机16卡微调qwen2.5-7B的时候一直报HCCL错误,在容器内执行的命令如下:
torchrun --master_port 6001 --nproc_per_node=8 --nnodes=2 --node_rank=0
--master_addr=10.0.1.30 src/train.py
--stage sft
--model_name_or_path /home/model_bin/Qwen/Qwen2___5-7B-Instruct/
--do_train
--dataset alpaca_zh_demo
--template qwen
--finetuning_type lora
--output_dir saves/qwen-7b/lora/sft
--overwrite_cache
--per_device_train_batch_size 1
--gradient_accumulation_steps 2
--lr_scheduler_type cosine
--logging_steps 1
--save_steps 500
--learning_rate 1e-4
--num_train_epochs 100.0
--plot_loss

Reproduction

具体错误日志如下图所示:
image
image

Others

No response

@AlbertWang001 AlbertWang001 added bug Something isn't working pending This problem is yet to be addressed labels Jan 15, 2025
@github-actions github-actions bot added the npu This problem is related to NPU devices label Jan 15, 2025
@codemayq
Copy link
Collaborator

报错没有显式说明原因,建议先看下单机是否有问题,不使用容器是否有问题,以及跑一下简单的deepspeed程序看一下,是否是 一些基础环境上还有问题

@chuangzhidan
Copy link

报错没有显式说明原因,建议先看下单机是否有问题,不使用容器是否有问题,以及跑一下简单的deepspeed程序看一下,是否是 一些基础环境上还有问题

为什么lora微调的yaml文件里,看不到lora的rank参数?令人震惊

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working npu This problem is related to NPU devices pending This problem is yet to be addressed
Projects
None yet
Development

No branches or pull requests

3 participants