update checkpoint docs #1016

Borda · 2020-03-03T00:19:56Z

What does this PR do?

Fixes #775. Revisit checkpoint documentation...

PR review

Anyone in the community is free to review the PR once the tests have passed.
If we didn't discuss your PR in Github issues there's a high chance it will not be merged.

Did you have fun?

Make sure you had fun coding 🙃

pep8speaks · 2020-03-03T00:20:00Z

Hello @Borda! Thanks for updating this PR.

In the file pytorch_lightning/callbacks/model_checkpoint.py:

Line 24:101: E501 line too long (122 > 100 characters)
Line 25:1: W293 blank line contains whitespace
Line 98:101: E501 line too long (105 > 100 characters)
Line 173:101: E501 line too long (110 > 100 characters)
Line 199:101: E501 line too long (102 > 100 characters)

In the file tests/test_restore_models.py:

Line 136:101: E501 line too long (104 > 100 characters)
Line 348:101: E501 line too long (104 > 100 characters)

Comment last updated at 2020-03-03 20:15:02 UTC

pytorch_lightning/callbacks/model_checkpoint.py

jeremyjordan

looks like there are some lingering references tocheckpoint_callback.filepath in:

tests/test_restore_models.py
tests/models/utils.py
tests/trainer/test_trainer.py

williamFalcon

@Borda mind implementing the rich text for file names?

Borda · 2020-03-03T15:39:57Z

shall we make it rather model-name_epoch=02_val_loss=0.36.ckpt and if there is already something with model-name then it creates model-name-v1_epoch=02_val_loss=0.36.ckpt that would make sense more... but we can move it to next release and just keep this one now @williamFalcon

williamFalcon · 2020-03-03T16:39:58Z

________________________________________________________________________________ test_resume_from_checkpoint_epoch_restored ________________________________________________________________________________

tmpdir = local('/tmp/pytest-of-waf251/pytest-48/test_resume_from_checkpoint_ep0')

    def test_resume_from_checkpoint_epoch_restored(tmpdir):
        """Verify resuming from checkpoint runs the right number of epochs"""
        import types

        tutils.reset_seed()

        hparams = tutils.get_hparams()

        def _new_model():
            # Create a model that tracks epochs and batches seen
            model = LightningTestModel(hparams)
            model.num_epochs_seen = 0
            model.num_batches_seen = 0

            def increment_epoch(self):
                self.num_epochs_seen += 1

            def increment_batch(self, _):
                self.num_batches_seen += 1

            # Bind the increment_epoch function on_epoch_end so that the
            # model keeps track of the number of epochs it has seen.
            model.on_epoch_end = types.MethodType(increment_epoch, model)
            model.on_batch_start = types.MethodType(increment_batch, model)
            return model

        model = _new_model()

        trainer_options = dict(
            show_progress_bar=False,
            max_epochs=2,
            train_percent_check=0.65,
            val_percent_check=1,
            checkpoint_callback=ModelCheckpoint(tmpdir, save_top_k=-1),
            logger=False,
            default_save_path=tmpdir,
            early_stop_callback=False,
            val_check_interval=0.5,
        )

        # fit model
        trainer = Trainer(**trainer_options)
        trainer.fit(model)

        training_batches = trainer.num_training_batches

        assert model.num_epochs_seen == 2
        assert model.num_batches_seen == training_batches * 2

        # Other checkpoints can be uncommented if/when resuming mid-epoch is supported
        checkpoints = sorted(glob.glob(os.path.join(trainer.checkpoint_callback.dirpath, '*.ckpt')))

        for check in checkpoints[::2]:
            next_model = _new_model()
            state = torch.load(check)

            # Resume training
            trainer_options['max_epochs'] = 4
            new_trainer = Trainer(**trainer_options, resume_from_checkpoint=check)
            new_trainer.fit(next_model)
>           assert state['global_step'] + next_model.num_batches_seen == training_batches * 4
E           assert 140 == 160
E             -140
E             +160

Borda · 2020-03-03T17:34:45Z

it is hard to replicate it locally, it seems the names on server are different...

Borda · 2020-03-03T17:46:55Z

it shall be fixed now, @williamFalcon ^^

* update checkpoint docs * fix tests * fix tests * formatting * typing * filename * fix tests * fixing tests * fixing tests * fixing tests * unique name * fixing * fixing * Update model_checkpoint.py Co-authored-by: William Falcon <waf2107@columbia.edu>

Borda added bug Something isn't working docs Documentation related labels Mar 3, 2020

Borda requested a review from a team March 3, 2020 00:19

Borda marked this pull request as ready for review March 3, 2020 00:21

jeremyjordan approved these changes Mar 3, 2020

View reviewed changes

pytorch_lightning/callbacks/model_checkpoint.py Outdated Show resolved Hide resolved

jeremyjordan suggested changes Mar 3, 2020

View reviewed changes

Borda added 5 commits March 3, 2020 07:01

update checkpoint docs

21b4422

fix tests

dacbd51

fix tests

51b2d4a

formatting

8268169

typing

f583e84

williamFalcon requested changes Mar 3, 2020

View reviewed changes

williamFalcon added the need fix label Mar 3, 2020

williamFalcon added this to the 0.7.0 milestone Mar 3, 2020

Borda added 5 commits March 3, 2020 15:29

filename

7612bcd

fix tests

3b808ef

fixing tests

1f44c44

fixing tests

9e3cf86

fixing tests

bb32d07

Borda requested a review from jeremyjordan March 3, 2020 15:39

unique name

e1a5b34

fixing

ccaf2c4

fixing

de598d9

Borda requested review from williamFalcon and a team and removed request for jeremyjordan March 3, 2020 17:41

Update model_checkpoint.py

6d637f9

williamFalcon merged commit 64de57b into Lightning-AI:master Mar 3, 2020

Borda deleted the checkpoint branch March 3, 2020 20:46

Borda linked an issue Mar 30, 2020 that may be closed by this pull request

Checkpointing Names #829

Closed

Borda mentioned this pull request Mar 30, 2020

Checkpointing Names #829

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

update checkpoint docs #1016

update checkpoint docs #1016

Borda commented Mar 3, 2020

pep8speaks commented Mar 3, 2020 •

edited

Loading

jeremyjordan left a comment

williamFalcon left a comment

Borda commented Mar 3, 2020

williamFalcon commented Mar 3, 2020

Borda commented Mar 3, 2020

Borda commented Mar 3, 2020

update checkpoint docs #1016

update checkpoint docs #1016

Conversation

Borda commented Mar 3, 2020

What does this PR do?

PR review

Did you have fun?

pep8speaks commented Mar 3, 2020 • edited Loading

Comment last updated at 2020-03-03 20:15:02 UTC

jeremyjordan left a comment

Choose a reason for hiding this comment

williamFalcon left a comment

Choose a reason for hiding this comment

Borda commented Mar 3, 2020

williamFalcon commented Mar 3, 2020

Borda commented Mar 3, 2020

Borda commented Mar 3, 2020

pep8speaks commented Mar 3, 2020 •

edited

Loading