We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
使用docker-npu的方式构建镜像和容器,在进行双机16卡微调qwen2.5-7B的时候一直报HCCL错误,在容器内执行的命令如下: torchrun --master_port 6001 --nproc_per_node=8 --nnodes=2 --node_rank=0 --master_addr=10.0.1.30 src/train.py --stage sft --model_name_or_path /home/model_bin/Qwen/Qwen2___5-7B-Instruct/ --do_train --dataset alpaca_zh_demo --template qwen --finetuning_type lora --output_dir saves/qwen-7b/lora/sft --overwrite_cache --per_device_train_batch_size 1 --gradient_accumulation_steps 2 --lr_scheduler_type cosine --logging_steps 1 --save_steps 500 --learning_rate 1e-4 --num_train_epochs 100.0 --plot_loss
具体错误日志如下图所示:
No response
The text was updated successfully, but these errors were encountered:
报错没有显式说明原因,建议先看下单机是否有问题,不使用容器是否有问题,以及跑一下简单的deepspeed程序看一下,是否是 一些基础环境上还有问题
Sorry, something went wrong.
为什么lora微调的yaml文件里,看不到lora的rank参数?令人震惊
No branches or pull requests
Reminder
System Info
使用docker-npu的方式构建镜像和容器,在进行双机16卡微调qwen2.5-7B的时候一直报HCCL错误,在容器内执行的命令如下:
torchrun --master_port 6001 --nproc_per_node=8 --nnodes=2 --node_rank=0
--master_addr=10.0.1.30 src/train.py
--stage sft
--model_name_or_path /home/model_bin/Qwen/Qwen2___5-7B-Instruct/
--do_train
--dataset alpaca_zh_demo
--template qwen
--finetuning_type lora
--output_dir saves/qwen-7b/lora/sft
--overwrite_cache
--per_device_train_batch_size 1
--gradient_accumulation_steps 2
--lr_scheduler_type cosine
--logging_steps 1
--save_steps 500
--learning_rate 1e-4
--num_train_epochs 100.0
--plot_loss
Reproduction
具体错误日志如下图所示:
Others
No response
The text was updated successfully, but these errors were encountered: