Checkpoint connector bugfixes #10647

jstjohn · 2024-09-26T23:00:17Z

What does this PR do ?

Get checkpoint connector working for bionemo (see https://github.com/NVIDIA/bionemo-fw-ea/pull/180)

Changelog

Use new nemo2 standard /weights and /context subdirectory scheme so checkpoint loaders work properly with checkpoints created by this method.
Other changes necessary to allow checkpoint transformation outside of training resumption.

nemo/lightning/io/connector.py

Signed-off-by: John St John <jstjohn@nvidia.com>

akoumpa · 2024-09-27T23:25:12Z

nemo/lightning/io/connector.py

@@ -170,16 +171,20 @@
            trainer (pl.Trainer): The trainer with the strategy to save the model.
            dump_io (bool): If True, the IO configuration will be saved to the output path.
        """
+        # Import here to avoid circular import
+        from nemo.lightning.pytorch.callbacks.model_checkpoint import ModelCheckpoint


@jstjohn any way to avoid this?

akoumpa · 2024-09-27T23:24:47Z

nemo/lightning/io/connector.py

        )
-
+        _trainer.state.fn = TrainerFn.FITTING  # needed for proper save.


@jstjohn please add what do you mean here, what would be missing if fn was not set to fitting?

akoumpa · 2024-09-27T23:26:32Z

nemo/lightning/pytorch/callbacks/model_checkpoint.py

@@ -277,7 +278,7 @@ def on_train_end(self, trainer, pl_module):
                else:
                    super()._save_last_checkpoint(trainer, monitor_candidates)
            if self.save_context_on_train_end and not self.always_save_context and is_global_rank_zero():
-                TrainerContext.from_trainer(trainer).io_dump(ckpt_to_dir(self.last_model_path) / "context")
+                TrainerContext.from_trainer(trainer).io_dump(ckpt_to_dir(self.last_model_path) / self.CONTEXT_PATH)


@jstjohn ideally I'd like to avoid changing the checkpoint structure but if we have to do it, let's add a comment giving an example for the use-case and cherry-pick this PR to make it to the 24.09 release.

jstjohn marked this pull request as draft September 26, 2024 23:00

jstjohn commented Sep 26, 2024

View reviewed changes

nemo/lightning/io/connector.py Outdated Show resolved Hide resolved

jstjohn self-assigned this Sep 26, 2024

jstjohn marked this pull request as ready for review September 26, 2024 23:31

jstjohn requested a review from akoumpa September 26, 2024 23:32

jstjohn force-pushed the jstjohn/nemo_checkpoint_connector_fixes branch 3 times, most recently from eb4bca5 to 6069101 Compare September 27, 2024 22:40

Update checkpoint connector nemo_save to match current folder heirarchy

a4ad2d7

Signed-off-by: John St John <jstjohn@nvidia.com>

jstjohn force-pushed the jstjohn/nemo_checkpoint_connector_fixes branch from 6069101 to a4ad2d7 Compare September 27, 2024 22:42

jstjohn added the Run CICD label Sep 27, 2024

github-advanced-security bot found potential problems Sep 27, 2024

View reviewed changes

akoumpa reviewed Sep 27, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Checkpoint connector bugfixes #10647

Checkpoint connector bugfixes #10647

jstjohn commented Sep 26, 2024 •

edited

Loading

akoumpa Sep 27, 2024

akoumpa Sep 27, 2024

akoumpa Sep 27, 2024

		)

		_trainer.state.fn = TrainerFn.FITTING # needed for proper save.

Checkpoint connector bugfixes #10647

Are you sure you want to change the base?

Checkpoint connector bugfixes #10647

Conversation

jstjohn commented Sep 26, 2024 • edited Loading

What does this PR do ?

Changelog

akoumpa Sep 27, 2024

Choose a reason for hiding this comment

akoumpa Sep 27, 2024

Choose a reason for hiding this comment

akoumpa Sep 27, 2024

Choose a reason for hiding this comment

jstjohn commented Sep 26, 2024 •

edited

Loading