-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
7B的模型在运行run_pt.py之后生成的pytorch_model.bin有13G #379
Comments
麻烦看下checkpoint-100下 |
当时确认过,也是13G的 |
用的是DS ZeRO-3?不知道会不会和这个策略有关。 |
保存了全量而不是LoRA可能和ZeRO 3 有关系; |
不确定是不是因为peft版本的原因 但根据 class SavePeftModelCallback(TrainerCallback):
def on_save(
self,
args: TrainingArguments,
state: TrainerState,
control: TrainerControl,
**kwargs,
):
checkpoint_folder = os.path.join(args.output_dir, f"{PREFIX_CHECKPOINT_DIR}-{state.global_step}")
peft_model_path = os.path.join(checkpoint_folder, "adapter_model")
kwargs["model"].save_pretrained(peft_model_path)
pytorch_model_path = os.path.join(checkpoint_folder, "pytorch_model.bin")
if os.path.exists(pytorch_model_path):
os.remove(pytorch_model_path)
return control 用 |
我也遇到了同样的问题,用什么方法可以把lora权重拆出来? |
用model.save_pretrained("tmp_output")替换trainer.save_model() 另一种方法是可以改保存模型的回调函数,可以参考 |
多谢,我试试。我的模型的trainer的checkpoint生成的,估计需要用peft重新加载,然后save_pretrained |
是哪里配置得不对吗?
run_pt.sh
ds_zero3_offload.json
The text was updated successfully, but these errors were encountered: