-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
lora训练保存的adapter_model.bin很小,只有443字节 #711
Comments
找到办法了,需要修改transformer代码,设置load_module_strict=False。还有个疑问,如果开始用zero3训练一段时间后,想改成deepspeed的zero2训练lora,resume_from_checkpoint时报错了,读取global_stepxxxx中保存的模型时,zero2和zero3不兼容吗? |
我zero3训练33B遇到了同样的问题,checkpoint中adapter_model.bin模型只有67.31M,从checkpoint继续训练时,也有“Missing key(s) in state_dict: ”的错误。 |
应该是不能在zero3的checkpoint上直接再用zero2继续训练吧。 |
zero3和zero2是两种不同的策略,理论上是不能resume的,可以基于合并后的模型继续训练。至于zero3保存模型的大小问题,我们没有试过zero3,无法给出有效的建议。 |
33B模型训练也是用zero2么?可以说一下你们33B训练的硬件条件么? |
感谢回复,原本以为resume的只是模型,和zero2,3的策略无关。查了下文档,ZeRO-2:分割保存Optimizer States、Gradients,ZeRO-3:分割保存Optimizer States、Gradients、Parameters。分割保存的粒度不同导致他们不能互相resume。尝试使用ZeRO-3训练后的模型作为初始模型,不使用resume,再使用zero2策略可以训起来,汇报完毕。 |
请问load_module_strict这个参数在什么位置?没有找到 |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your consideration. |
可能参考这里解决,huggingface/peft#286 |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your consideration. |
我也没找到,看了这个huggingface/peft#286 ,也没有啊 |
将run_clm_sft_with_peft.py脚本中这段代码注释掉即可。
|
很完美的解决了问题,但是不明白这段代码的意义是什么意思呢 |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your consideration. |
Closing the issue, since no updates observed. Feel free to re-open if you need any further assistance. |
分享一下我的解决办法,我的理解是这个问题其实和 ZeRO3 模型保存的特殊之处相关。 我也是用 ZeRO3 进行 LoRA 训练后,保存的 adapter_model.bin 只有443字节。用的其它项目的训练脚本,并没有直接操作state_dict,所以也没办法通过注释掉相关代码解决。 最后我在 transformers 保存模型的代码里读到了这个参数 stage3_gather_16bit_weights_on_model_save,搜索后终于找到了官方文档对 ZeRO3 模式保存模型的说明。我理解是有两种模式:
https://huggingface.co/docs/accelerate/usage_guides/deepspeed#saving-and-loading
|
提交前必须检查以下项目
问题类型
模型训练与精调
基础模型
LLaMA-7B
操作系统
Linux
详细描述问题
由于显存有限,用zero3的offload方法预训练7B模型,设置modules_to_save="embed_tokens,lm_head"。生成的adapter_model.bin只有443字节,而且加载模型继续训练时报了以下错误:
RuntimeError: Error(s) in loading state_dict for PeftModelForCausalLM:
Missing key(s) in state_dict: "base_model.model.model.embed_tokens.original_module.weight", "base_model.model.model.embed_tokens.modules_to_save.default.weight",
"base_model.model.model.layers.0.self_attn.q_proj.weight", "base_model.model.model.layers.0.self_attn.q_proj.lora_A.default.weight", "base_model.model.model.layers.0.self_attn.q_proj.lora_B.default.weight",
"base_model.model.model.layers.0.self_attn.k_proj.weight",
。。。。
Unexpected key(s) in state_dict: "base_model.model.model.embed_tokens.weight", "base_model.model.model.layers.0.self_attn.q_proj.lora_A.weight", "base_model.model.model.layers.0.self_attn.q_proj.lora_B.weight",
"base_model.model.model.layers.0.self_attn.k_proj.lora_A.weight", "base_model.model.model.layers.0.self_attn.k_proj.lora_B.weight",
。。。。
依赖情况(代码类问题务必提供)
functorch 1.13.1
lion-pytorch 0.0.8
open-clip-torch 2.16.0
peft 0.4.0.dev0
pytorch-lightning 1.7.7
sentence-transformers 2.2.2
torch 2.0.1
torch-fidelity 0.3.0
torchaudio 2.0.2
torchdiffeq 0.2.3
torchgeometry 0.1.2
torchmetrics 0.11.4
torchsde 0.2.5
torchtext 0.12.0
torchtyping 0.1.4
torchvision 0.15.2
transformers 4.31.0.dev0
运行日志或截图
[2023-07-03 20:09:33,040] [INFO] [torch_checkpoint_engine.py:29:load] [Torch] Loaded checkpoint from /media/zzg/GJ_disk01/pretrained_model/Chinese-LLaMA-Alpaca/LLAMA_JA/checkpoint-2640/global_step2640/zero_pp_rank_0_mp_rank_00_model_states.pt.
╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮
│ /home/zzg/workspace/pycharm/Chinese-LLaMA-Alpaca/scripts/training/run_clm_pt_with_peft.py:653 in │
│ │
│ │
│ 650 │
│ 651 │
│ 652 if name == "main": │
│ ❱ 653 │ main() │
│ 654 │
│ │
│ /home/zzg/workspace/pycharm/Chinese-LLaMA-Alpaca/scripts/training/run_clm_pt_with_peft.py:621 in │
│ main │
│ │
│ 618 │ │ │ checkpoint = training_args.resume_from_checkpoint │
│ 619 │ │ elif last_checkpoint is not None: │
│ 620 │ │ │ checkpoint = last_checkpoint │
│ ❱ 621 │ │ train_result = trainer.train(resume_from_checkpoint=checkpoint) │
│ 622 │ │ │
│ 623 │ │ metrics = train_result.metrics │
│ 624 │
│ │
│ /home/zzg/miniconda3/envs/py39_DL_cu118/lib/python3.9/site-packages/transformers/trainer.py:1645 │
│ in train │
│ │
│ 1642 │ │ inner_training_loop = find_executable_batch_size( │
│ 1643 │ │ │ self._inner_training_loop, self._train_batch_size, args.auto_find_batch_size │
│ 1644 │ │ ) │
│ ❱ 1645 │ │ return inner_training_loop( │
│ 1646 │ │ │ args=args, │
│ 1647 │ │ │ resume_from_checkpoint=resume_from_checkpoint, │
│ 1648 │ │ │ trial=trial, │
│ │
│ /home/zzg/miniconda3/envs/py39_DL_cu118/lib/python3.9/site-packages/transformers/trainer.py:1776 │
│ in _inner_training_loop │
│ │
│ 1773 │ │ │
│ 1774 │ │ # deepspeed ckpt loading │
│ 1775 │ │ if resume_from_checkpoint is not None and self.is_deepspeed_enabled: │
│ ❱ 1776 │ │ │ deepspeed_load_checkpoint(self.model_wrapped, resume_from_checkpoint) │
│ 1777 │ │ │
│ 1778 │ │ # Check if saved optimizer or scheduler states exist │
│ 1779 │ │ self._load_optimizer_and_scheduler(resume_from_checkpoint) │
│ │
│ /home/zzg/miniconda3/envs/py39_DL_cu118/lib/python3.9/site-packages/transformers/deepspeed.py:38 │
│ 3 in deepspeed_load_checkpoint │
│ │
│ 380 │ if len(deepspeed_checkpoint_dirs) > 0: │
│ 381 │ │ logger.info(f"Attempting to resume from {checkpoint_path}") │
│ 382 │ │ # this magically updates self.optimizer and self.lr_scheduler │
│ ❱ 383 │ │ load_path, _ = deepspeed_engine.load_checkpoint( │
│ 384 │ │ │ checkpoint_path, load_optimizer_states=True, load_lr_scheduler_states=True │
│ 385 │ │ ) │
│ 386 │ │ if load_path is None: │
│ │
│ /home/zzg/miniconda3/envs/py39_DL_cu118/lib/python3.9/site-packages/deepspeed/runtime/engine.py: │
│ 2605 in load_checkpoint │
│ │
│ 2602 │ │ │ # Prepare for checkpoint load by ensuring all parameters are partitioned │
│ 2603 │ │ │ self.optimizer.checkpoint_event_prologue() │
│ 2604 │ │ │
│ ❱ 2605 │ │ load_path, client_states = self.load_checkpoint(load_dir, │
│ 2606 │ │ │ │ │ │ │ │ │ │ │ │ │ │ tag, │
│ 2607 │ │ │ │ │ │ │ │ │ │ │ │ │ │ load_module_strict=load_module │
│ 2608 │ │ │ │ │ │ │ │ │ │ │ │ │ │ load_optimizer_states=load_opti │
│ │
│ /home/zzg/miniconda3/envs/py39_DL_cu118/lib/python3.9/site-packages/deepspeed/runtime/engine.py: │
│ 2664 in _load_checkpoint │
│ │
│ 2661 │ │ │ │ │ │ │ │ │ │ │ │ num_experts=self.num_experts, │
│ 2662 │ │ │ │ │ │ │ │ │ │ │ │ checkpoint_engine=self.checkpoint_engine │
│ 2663 │ │ if not self.load_universal_checkpoint(): │
│ ❱ 2664 │ │ │ self.load_module_state_dict(checkpoint=checkpoint, │
│ 2665 │ │ │ │ │ │ │ │ │ │ strict=load_module_strict, │
│ 2666 │ │ │ │ │ │ │ │ │ │ custom_load_fn=custom_load_fn) │
│ 2667 │
│ │
│ /home/zzg/miniconda3/envs/py39_DL_cu118/lib/python3.9/site-packages/deepspeed/runtime/engine.py: │
│ 2468 in load_module_state_dict │
│ │
│ 2465 │ │ if custom_load_fn: │
│ 2466 │ │ │ custom_load_fn(src=module_state_dict, dst=self.module) │
│ 2467 │ │ else: │
│ ❱ 2468 │ │ │ self.module.load_state_dict( │
│ 2469 │ │ │ │ module_state_dict, # TODO │
│ 2470 │ │ │ │ strict=strict) │
│ 2471 │
│ │
│ /home/zzg/miniconda3/envs/py39_DL_cu118/lib/python3.9/site-packages/torch/nn/modules/module.py:2 │
│ 041 in load_state_dict │
│ │
│ 2038 │ │ │ │ │ │ ', '.join('"{}"'.format(k) for k in missing_keys))) │
│ 2039 │ │ │
│ 2040 │ │ if len(error_msgs) > 0: │
│ ❱ 2041 │ │ │ raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format( │
│ 2042 │ │ │ │ │ │ │ self.class.name, "\n\t".join(error_msgs))) │
│ 2043 │ │ return _IncompatibleKeys(missing_keys, unexpected_keys) │
│ 2044 │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
RuntimeError: Error(s) in loading state_dict for PeftModelForCausalLM:
Missing key(s) in state_dict: "base_model.model.model.embed_tokens.original_module.weight", "base_model.model.model.embed_tokens.modules_to_save.default.weight",
"base_model.model.model.layers.0.self_attn.q_proj.weight", "base_model.model.model.layers.0.self_attn.q_proj.lora_A.default.weight", "base_model.model.model.layers.0.self_attn.q_proj.lora_B.default.weight",
"base_model.model.model.layers.0.self_attn.k_proj.weight", "base_model.model.model.layers.0.self_attn.k_proj.lora_A.default.weight", "base_model.model.model.layers.0.self_attn.k_proj.lora_B.default.weight",
"base_model.model.model.layers.0.self_attn.v_proj.weight", "base_model.model.model.layers.0.self_attn.v_proj.lora_A.default.weight", "base_model.model.model.layers.0.self_attn.v_proj.lora_B.default.weight",
"base_model.model.model.layers.0.self_attn.o_proj.weight", "base_model.model.model.layers.0.self_attn.o_proj.lora_A.default.weight", "base_model.model.model.layers.0.self_attn.o_proj.lora_B.default.weight",
"base_model.model.model.layers.0.self_attn.rotary_emb.inv_freq", "base_model.model.model.layers.0.mlp.gate_proj.weight", "base_model.model.model.layers.0.mlp.gate_proj.lora_A.default.weight",
"base_model.model.model.layers.0.mlp.gate_proj.lora_B.default.weight", "base_model.model.model.layers.0.mlp.down_proj.weight", "base_model.model.model.layers.0.mlp.down_proj.lora_A.default.weight",
"base_model.model.model.layers.0.mlp.down_proj.lora_B.default.weight", "base_model.model.model.layers.0.mlp.up_proj.weight", "base_model.model.model.layers.0.mlp.up_proj.lora_A.default.weight",
"base_model.model.model.layers.0.mlp.up_proj.lora_B.default.weight", "base_model.model.model.layers.0.input_layernorm.weight", "base_model.model.model.layers.0.post_attention_layernorm.weight",
"base_model.model.model.layers.1.self_attn.q_proj.weight", "base_model.model.model.layers.1.self_attn.q_proj.lora_A.default.weight", "base_model.model.model.layers.1.self_attn.q_proj.lora_B.default.weight",
"base_model.model.model.layers.1.self_attn.k_proj.weight", "base_model.model.model.layers.1.self_attn.k_proj.lora_A.default.weight", "base_model.model.model.layers.1.self_attn.k_proj.lora_B.default.weight",
"base_model.model.model.layers.1.self_attn.v_proj.weight", "base_model.model.model.layers.1.self_attn.v_proj.lora_A.default.weight", "base_model.model.model.layers.1.self_attn.v_proj.lora_B.default.weight",
"base_model.model.model.layers.1.self_attn.o_proj.weight", "base_model.model.model.layers.1.self_attn.o_proj.lora_A.default.weight", "base_model.model.model.layers.1.self_attn.o_proj.lora_B.default.weight",
"base_model.model.model.layers.1.self_attn.rotary_emb.inv_freq", "base_model.model.model.layers.1.mlp.gate_proj.weight", "base_model.model.model.layers.1.mlp.gate_proj.lora_A.default.weight",
"base_model.model.model.layers.1.mlp.gate_proj.lora_B.default.weight", "base_model.model.model.layers.1.mlp.down_proj.weight", "base_model.model.model.layers.1.mlp.down_proj.lora_A.default.weight",
"base_model.model.model.layers.1.mlp.down_proj.lora_B.default.weight", "base_model.model.model.layers.1.mlp.up_proj.weight", "base_model.model.model.layers.1.mlp.up_proj.lora_A.default.weight",
"base_model.model.model.layers.1.mlp.up_proj.lora_B.default.weight", "base_model.model.model.layers.1.input_layernorm.weight", "base_model.model.model.layers.1.post_attention_layernorm.weight",
。。。。。。
"base_model.model.model.layers.31.self_attn.q_proj.weight", "base_model.model.model.layers.31.self_attn.q_proj.lora_A.default.weight", "base_model.model.model.layers.31.self_attn.q_proj.lora_B.default.weight",
"base_model.model.model.layers.31.self_attn.k_proj.weight", "base_model.model.model.layers.31.self_attn.k_proj.lora_A.default.weight", "base_model.model.model.layers.31.self_attn.k_proj.lora_B.default.weight",
"base_model.model.model.layers.31.self_attn.v_proj.weight", "base_model.model.model.layers.31.self_attn.v_proj.lora_A.default.weight", "base_model.model.model.layers.31.self_attn.v_proj.lora_B.default.weight",
"base_model.model.model.layers.31.self_attn.o_proj.weight", "base_model.model.model.layers.31.self_attn.o_proj.lora_A.default.weight", "base_model.model.model.layers.31.self_attn.o_proj.lora_B.default.weight",
"base_model.model.model.layers.31.self_attn.rotary_emb.inv_freq", "base_model.model.model.layers.31.mlp.gate_proj.weight", "base_model.model.model.layers.31.mlp.gate_proj.lora_A.default.weight",
"base_model.model.model.layers.31.mlp.gate_proj.lora_B.default.weight", "base_model.model.model.layers.31.mlp.down_proj.weight", "base_model.model.model.layers.31.mlp.down_proj.lora_A.default.weight",
"base_model.model.model.layers.31.mlp.down_proj.lora_B.default.weight", "base_model.model.model.layers.31.mlp.up_proj.weight", "base_model.model.model.layers.31.mlp.up_proj.lora_A.default.weight",
"base_model.model.model.layers.31.mlp.up_proj.lora_B.default.weight", "base_model.model.model.layers.31.input_layernorm.weight", "base_model.model.model.layers.31.post_attention_layernorm.weight",
"base_model.model.model.norm.weight", "base_model.model.lm_head.original_module.weight", "base_model.model.lm_head.modules_to_save.default.weight".
Unexpected key(s) in state_dict: "base_model.model.model.embed_tokens.weight", "base_model.model.model.layers.0.self_attn.q_proj.lora_A.weight", "base_model.model.model.layers.0.self_attn.q_proj.lora_B.weight",
"base_model.model.model.layers.0.self_attn.k_proj.lora_A.weight", "base_model.model.model.layers.0.self_attn.k_proj.lora_B.weight", "base_model.model.model.layers.0.self_attn.v_proj.lora_A.weight",
"base_model.model.model.layers.0.self_attn.v_proj.lora_B.weight", "base_model.model.model.layers.0.self_attn.o_proj.lora_A.weight", "base_model.model.model.layers.0.self_attn.o_proj.lora_B.weight",
"base_model.model.model.layers.0.mlp.gate_proj.lora_A.weight", "base_model.model.model.layers.0.mlp.gate_proj.lora_B.weight", "base_model.model.model.layers.0.mlp.down_proj.lora_A.weight",
"base_model.model.model.layers.0.mlp.down_proj.lora_B.weight", "base_model.model.model.layers.0.mlp.up_proj.lora_A.weight", "base_model.model.model.layers.0.mlp.up_proj.lora_B.weight",
"base_model.model.model.layers.1.self_attn.q_proj.lora_A.weight", "base_model.model.model.layers.1.self_attn.q_proj.lora_B.weight", "base_model.model.model.layers.1.self_attn.k_proj.lora_A.weight",
"base_model.model.model.layers.1.self_attn.k_proj.lora_B.weight", "base_model.model.model.layers.1.self_attn.v_proj.lora_A.weight", "base_model.model.model.layers.1.self_attn.v_proj.lora_B.weight",
"base_model.model.model.layers.1.self_attn.o_proj.lora_A.weight", "base_model.model.model.layers.1.self_attn.o_proj.lora_B.weight", "base_model.model.model.layers.1.mlp.gate_proj.lora_A.weight",
"base_model.model.model.layers.1.mlp.gate_proj.lora_B.weight", "base_model.model.model.layers.1.mlp.down_proj.lora_A.weight", "base_model.model.model.layers.1.mlp.down_proj.lora_B.weight",
"base_model.model.model.layers.1.mlp.up_proj.lora_A.weight", "base_model.model.model.layers.1.mlp.up_proj.lora_B.weight",
。。。。。。
"base_model.model.model.layers.31.self_attn.q_proj.lora_A.weight", "base_model.model.model.layers.31.self_attn.q_proj.lora_B.weight", "base_model.model.model.layers.31.self_attn.k_proj.lora_A.weight",
"base_model.model.model.layers.31.self_attn.k_proj.lora_B.weight", "base_model.model.model.layers.31.self_attn.v_proj.lora_A.weight", "base_model.model.model.layers.31.self_attn.v_proj.lora_B.weight",
"base_model.model.model.layers.31.self_attn.o_proj.lora_A.weight", "base_model.model.model.layers.31.self_attn.o_proj.lora_B.weight", "base_model.model.model.layers.31.mlp.gate_proj.lora_A.weight",
"base_model.model.model.layers.31.mlp.gate_proj.lora_B.weight", "base_model.model.model.layers.31.mlp.down_proj.lora_A.weight", "base_model.model.model.layers.31.mlp.down_proj.lora_B.weight",
"base_model.model.model.layers.31.mlp.up_proj.lora_A.weight", "base_model.model.model.layers.31.mlp.up_proj.lora_B.weight", "base_model.model.lm_head.weight".
The text was updated successfully, but these errors were encountered: