单机多卡预训练ChatGLM报错： #4

zzzhaoguziji · 2023-06-09T02:46:44Z

Describe the Question

Please provide a clear and concise description of what the question is.
单卡训练可以，单机多卡不形
训练命令为：
CUDA_VISIBLE_DEVICES=0,1 torchrun --nnodes 1 --nproc_per_node 1 pretraining.py
--model_type chatglm
--model_name_or_path ../chatglm
--train_file_dir ../data/pretrain
--validation_file_dir ../data/pretrain
--per_device_train_batch_size 4
--per_device_eval_batch_size 4
--do_train
--do_eval
--use_peft True
--seed 42
--fp16
--num_train_epochs 0.5
--learning_rate 2e-4
--warmup_ratio 0.05
--weight_decay 0.01
--logging_strategy steps
--logging_steps 10
--eval_steps 50
--evaluation_strategy steps
--save_steps 500
--save_strategy steps
--save_total_limit 3
--gradient_accumulation_steps 1
--preprocessing_num_workers 1
--block_size 1024
--output_dir outputs-pt-v1
--overwrite_output_dir
--ddp_timeout 30000
--logging_first_step True
--target_modules all
--lora_rank 8
--lora_alpha 16
--lora_dropout 0.05
--torch_dtype float16
--device_map auto
--report_to tensorboard
--ddp_find_unused_parameters False
--gradient_checkpointing True
--deepspeed deepspeed_config.json

Describe your attempts

I walked through the tutorials
I checked the documentation
I checked to make sure that this is not a duplicate question

shibing624 · 2023-06-12T12:09:52Z

参数设置需要为：CUDA_VISIBLE_DEVICES=0,1 torchrun --nnodes 1 --nproc_per_node 2 pretraining.py torchrun模式下，是每张卡加载全部模型参数，数据并行训练，如果显存不足，可以开启cpu_offload

shibing624 · 2023-06-13T04:00:05Z

还可以这样：CUDA_VISIBLE_DEVICES=0,1 python pretraining.py 使用device_map="auto"可以自动分配多个卡加载模型。

zzzhaoguziji · 2023-06-14T09:23:53Z

谢谢大佬，我再试试

boxter007 · 2023-06-29T11:52:18Z

还可以这样：CUDA_VISIBLE_DEVICES=0,1 python pretraining.py 使用device_map="auto"可以自动分配多个卡加载模型。

针对glm6b2我试过了，还是不行。报同样的错误。但是glm6b就能用。

Alfer-Feng · 2023-07-05T06:24:34Z

glm和glm2模型参数是不一样的，转换时要修改，我现在也在入手这个，有结果后再来评论，插个眼

archerbj · 2023-07-21T16:34:25Z

参数设置需要为：CUDA_VISIBLE_DEVICES=0,1 torchrun --nnodes 1 --nproc_per_node 2 pretraining.py torchrun模式下，是每张卡加载全部模型参数，数据并行训练，如果显存不足，可以开启cpu_offload

请问下 cpu_offload 怎么开启？

chloefresh · 2023-07-27T11:04:41Z

请问最后解决了吗？怎么解决的可以分享一下吗 @zzzhaoguziji @boxter007

chloefresh · 2023-07-27T11:04:56Z

glm和glm2模型参数是不一样的，转换时要修改，我现在也在入手这个，有结果后再来评论，插个眼

有结果了吗？

zzzhaoguziji added the question Further information is requested label Jun 9, 2023

shibing624 closed this as completed Jul 27, 2023

veresse mentioned this issue Oct 1, 2024

复现医疗大模型与训练数据加载问题 #425

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

单机多卡预训练ChatGLM报错： #4

单机多卡预训练ChatGLM报错： #4

zzzhaoguziji commented Jun 9, 2023

shibing624 commented Jun 12, 2023

shibing624 commented Jun 13, 2023

zzzhaoguziji commented Jun 14, 2023

boxter007 commented Jun 29, 2023

Alfer-Feng commented Jul 5, 2023

archerbj commented Jul 21, 2023

chloefresh commented Jul 27, 2023

chloefresh commented Jul 27, 2023

单机多卡预训练ChatGLM报错： #4

单机多卡预训练ChatGLM报错： #4

Comments

zzzhaoguziji commented Jun 9, 2023

Describe the Question

Describe your attempts

shibing624 commented Jun 12, 2023

shibing624 commented Jun 13, 2023

zzzhaoguziji commented Jun 14, 2023

boxter007 commented Jun 29, 2023

Alfer-Feng commented Jul 5, 2023

archerbj commented Jul 21, 2023

chloefresh commented Jul 27, 2023

chloefresh commented Jul 27, 2023