Skip to content

Commit

Permalink
[Unified checkpoint] update optimizer async save signal
Browse files Browse the repository at this point in the history
  • Loading branch information
DesmonDay committed Aug 21, 2024
1 parent d505a97 commit 82f62c4
Showing 1 changed file with 6 additions and 1 deletion.
7 changes: 6 additions & 1 deletion paddlenlp/trainer/trainer.py
Original file line number Diff line number Diff line change
Expand Up @@ -2305,7 +2305,12 @@ def _save_checkpoint(self, model, metrics=None):
self._save_ckpt_func(state_dict, save_path)
with open(saved_signal_path, mode="w+") as f:
f.write("1")

else:
if self.args.unified_checkpoint and "async_save" in self.args.unified_checkpoint_config:
global_rank = paddle.distributed.get_rank() if paddle.distributed.get_world_size() > 1 else -1
paddle.save(global_rank, os.path.join(output_dir, f".optimizer_weight.done.{global_rank}"))
if "skip_save_model_weight" not in self.args.unified_checkpoint_config:
paddle.save(global_rank, os.path.join(output_dir, f".master_weight.done.{global_rank}"))

Check warning on line 2313 in paddlenlp/trainer/trainer.py

View check run for this annotation

Codecov / codecov/patch

paddlenlp/trainer/trainer.py#L2309-L2313

Added lines #L2309 - L2313 were not covered by tests
if self.args.should_save or self.args.use_expert_parallel:
if not self.args.use_hybrid_parallel:
logger.info("Saving optimizer files.")
Expand Down

0 comments on commit 82f62c4

Please sign in to comment.