Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

update checkpoint docs #1016

Merged
merged 14 commits into from
Mar 3, 2020
Merged

update checkpoint docs #1016

merged 14 commits into from
Mar 3, 2020

Conversation

Borda
Copy link
Member

@Borda Borda commented Mar 3, 2020

What does this PR do?

Fixes #775. Revisit checkpoint documentation...

PR review

Anyone in the community is free to review the PR once the tests have passed.
If we didn't discuss your PR in Github issues there's a high chance it will not be merged.

Did you have fun?

Make sure you had fun coding 🙃

@Borda Borda added bug Something isn't working docs Documentation related labels Mar 3, 2020
@Borda Borda requested a review from a team March 3, 2020 00:19
@pep8speaks
Copy link

pep8speaks commented Mar 3, 2020

Hello @Borda! Thanks for updating this PR.

Line 24:101: E501 line too long (122 > 100 characters)
Line 25:1: W293 blank line contains whitespace
Line 98:101: E501 line too long (105 > 100 characters)
Line 173:101: E501 line too long (110 > 100 characters)
Line 199:101: E501 line too long (102 > 100 characters)

Line 136:101: E501 line too long (104 > 100 characters)
Line 348:101: E501 line too long (104 > 100 characters)

Comment last updated at 2020-03-03 20:15:02 UTC

@Borda Borda marked this pull request as ready for review March 3, 2020 00:21
Copy link
Contributor

@jeremyjordan jeremyjordan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks like there are some lingering references tocheckpoint_callback.filepath in:

  • tests/test_restore_models.py
  • tests/models/utils.py
  • tests/trainer/test_trainer.py

Copy link
Contributor

@williamFalcon williamFalcon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Borda mind implementing the rich text for file names?

@williamFalcon williamFalcon added this to the 0.7.0 milestone Mar 3, 2020
@Borda Borda requested a review from jeremyjordan March 3, 2020 15:39
@Borda
Copy link
Member Author

Borda commented Mar 3, 2020

shall we make it rather model-name_epoch=02_val_loss=0.36.ckpt and if there is already something with model-name then it creates model-name-v1_epoch=02_val_loss=0.36.ckpt that would make sense more... but we can move it to next release and just keep this one now @williamFalcon

@williamFalcon
Copy link
Contributor

________________________________________________________________________________ test_resume_from_checkpoint_epoch_restored ________________________________________________________________________________

tmpdir = local('/tmp/pytest-of-waf251/pytest-48/test_resume_from_checkpoint_ep0')

    def test_resume_from_checkpoint_epoch_restored(tmpdir):
        """Verify resuming from checkpoint runs the right number of epochs"""
        import types

        tutils.reset_seed()

        hparams = tutils.get_hparams()

        def _new_model():
            # Create a model that tracks epochs and batches seen
            model = LightningTestModel(hparams)
            model.num_epochs_seen = 0
            model.num_batches_seen = 0

            def increment_epoch(self):
                self.num_epochs_seen += 1

            def increment_batch(self, _):
                self.num_batches_seen += 1

            # Bind the increment_epoch function on_epoch_end so that the
            # model keeps track of the number of epochs it has seen.
            model.on_epoch_end = types.MethodType(increment_epoch, model)
            model.on_batch_start = types.MethodType(increment_batch, model)
            return model

        model = _new_model()

        trainer_options = dict(
            show_progress_bar=False,
            max_epochs=2,
            train_percent_check=0.65,
            val_percent_check=1,
            checkpoint_callback=ModelCheckpoint(tmpdir, save_top_k=-1),
            logger=False,
            default_save_path=tmpdir,
            early_stop_callback=False,
            val_check_interval=0.5,
        )

        # fit model
        trainer = Trainer(**trainer_options)
        trainer.fit(model)

        training_batches = trainer.num_training_batches

        assert model.num_epochs_seen == 2
        assert model.num_batches_seen == training_batches * 2

        # Other checkpoints can be uncommented if/when resuming mid-epoch is supported
        checkpoints = sorted(glob.glob(os.path.join(trainer.checkpoint_callback.dirpath, '*.ckpt')))

        for check in checkpoints[::2]:
            next_model = _new_model()
            state = torch.load(check)

            # Resume training
            trainer_options['max_epochs'] = 4
            new_trainer = Trainer(**trainer_options, resume_from_checkpoint=check)
            new_trainer.fit(next_model)
>           assert state['global_step'] + next_model.num_batches_seen == training_batches * 4
E           assert 140 == 160
E             -140
E             +160

@Borda
Copy link
Member Author

Borda commented Mar 3, 2020

it is hard to replicate it locally, it seems the names on server are different...

@Borda Borda requested review from williamFalcon and a team and removed request for jeremyjordan March 3, 2020 17:41
@Borda
Copy link
Member Author

Borda commented Mar 3, 2020

it shall be fixed now, @williamFalcon ^^

@williamFalcon williamFalcon merged commit 64de57b into Lightning-AI:master Mar 3, 2020
@Borda Borda deleted the checkpoint branch March 3, 2020 20:46
@Borda Borda linked an issue Mar 30, 2020 that may be closed by this pull request
@Borda Borda mentioned this pull request Mar 30, 2020
tullie pushed a commit to tullie/pytorch-lightning that referenced this pull request Apr 3, 2020
* update checkpoint docs

* fix tests

* fix tests

* formatting

* typing

* filename

* fix tests

* fixing tests

* fixing tests

* fixing tests

* unique name

* fixing

* fixing

* Update model_checkpoint.py

Co-authored-by: William Falcon <waf2107@columbia.edu>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working docs Documentation related
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Checkpointing Names Checkpoint naming broken
4 participants