Duplicate epochs when calling .fit() twice #5007

carmocca · 2020-12-07T23:40:15Z

🐛 Bug

To Reproduce

def test_bug(tmpdir):
    epochs = []
    
    class TestModel(BoringModel):
        def on_epoch_end(self):
            epochs.append(self.current_epoch)
    
    trainer = Trainer(
        max_epochs=2,
        limit_train_batches=1,
        limit_val_batches=1,
        default_root_dir=tmpdir,
        checkpoint_callback=False,
        logger=False,
        weights_summary=None,
        progress_bar_refresh_rate=0,
    )
    trainer.fit(TestModel())
    trainer.max_epochs=4
    trainer.fit(TestModel())

    assert epochs == list(range(4))
    # AssertionError [0, 1, 1, 2, 3] != [0, 1, 2, 3]

Expected behavior

Assertion does not fail

Environment

Current master

cc @tchaton @Borda

carmocca · 2020-12-08T01:33:06Z

The epoch number is generated here:

https://github.com/PyTorchLightning/pytorch-lightning/blob/239347435029c0a02b305201ebbfa39d62746ca8/pytorch_lightning/trainer/trainer.py#L511

which assumes that self.current_epoch has not run yet. This assumption is not correct when fit is run twice because the epoch number is not increased after training ends.

When using Trainer(resume_from_checkpoint=...) this issue does not appear due to this piece:

https://github.com/PyTorchLightning/pytorch-lightning/blob/239347435029c0a02b305201ebbfa39d62746ca8/pytorch_lightning/trainer/connectors/checkpoint_connector.py#L273-L279

So the solution would be to increase the epoch number at the end of training_loop.on_train_end()? (This would break backwards compatibility)

ananthsub · 2020-12-10T02:42:39Z

@carmocca or should we reset these parameters in the trainer teardown?

carmocca · 2020-12-11T22:03:32Z

I'd say it's more natural to do it on_train_end because it has nothing to do with test and teardown is for fit or test

pierresegonne · 2021-01-18T16:32:51Z

Any opinion on what would be the recommended way of resetting the current_epoch to call fit twice?

Is something along those lines

trainer.fit(model, datamodule)
model.trainer.current_epoch = 0
trainer.fit(model, datamodule)

safe ?

carmocca · 2021-01-18T20:26:58Z

Yes, it should be safe.

edenlightning · 2021-02-16T18:47:07Z

@carmocca what is left TODO here?

carmocca · 2021-02-17T01:17:04Z

Everything, the bug is not fixed 🙂

There is a reproduction test at the top. We just need to make our minds about the best solution. Context here: #5007 (comment)

carmocca · 2021-07-06T16:02:16Z

Status update: WIP - tackling other related issues first. Need this for fault-tolerance

carmocca · 2021-07-26T17:41:56Z

Status update: Blocked by merging #8477 and enabling restoring the ckpt progress tracking state by default.

fschiffers · 2021-12-19T21:49:52Z

Any opinion on what would be the recommended way of resetting the current_epoch to call fit twice?

Is something along those lines
trainer.fit(model, datamodule)
model.trainer.current_epoch = 0
trainer.fit(model, datamodule)
safe ?

I think the current_epoch can no longer be set in the trainer but must be set in the fit_loop itself.

carmocca · 2021-12-20T14:04:05Z

Correct. You'll need to do trainer.fit_loop.current_epoch = 0

carmocca added bug Something isn't working help wanted Open to be worked on labels Dec 7, 2020

carmocca mentioned this issue Dec 7, 2020

Start version suffixes at 1 #5008

Merged

11 tasks

ORippler mentioned this issue Dec 11, 2020

ModelCheckpoint fails at garbage collecting checkpoint passed to Trainer.resume_from_checkpoint #5090

Closed

edenlightning added the with code label Dec 14, 2020

edenlightning assigned carmocca Dec 14, 2020

carmocca assigned kaushikb11 Feb 12, 2021

edenlightning added the priority: 1 Medium priority task label Feb 22, 2021

edenlightning unassigned carmocca and kaushikb11 Mar 2, 2021

carmocca mentioned this issue Mar 10, 2021

Global step always zero after loading checkpoint #6470

Closed

carmocca self-assigned this Mar 10, 2021

carmocca mentioned this issue Apr 13, 2021

[WIP] Fix current_epoch value on training end #6997

Closed

11 tasks

edenlightning modified the milestone: v1.3 Apr 27, 2021

carmocca mentioned this issue Jun 10, 2021

Loop Refactor 1/N - Training Loop #7871

Merged

13 tasks

awaelchli added a commit that referenced this issue Jun 10, 2021

todo for a fix in #5007

e080be8

edenlightning added this to the v1.3.x milestone Jul 1, 2021

edenlightning modified the milestones: v1.3.x, V1.4.X, v1.4.x Jul 6, 2021

edenlightning modified the milestones: v1.3.x, v1.4.x, v1.5 Jul 12, 2021

edenlightning added the breaking change Includes a breaking change label Jul 12, 2021

awaelchli mentioned this issue Jul 23, 2021

Missing cleanup after trainer.fit() and trainer.test() #4385

Closed

carmocca mentioned this issue Jul 27, 2021

Fix current_epoch value on training end #8578

Merged

12 tasks

ananthsub mentioned this issue Jul 30, 2021

Avoid rewrapping LightningModules in plugins #8593

Closed

Borda removed the with code label Aug 19, 2021

awaelchli modified the milestones: v1.5, 1.5.x Nov 4, 2021

tchaton added priority: 0 High priority task and removed priority: 1 Medium priority task labels Nov 29, 2021

carmocca closed this as completed in #8578 Feb 10, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Duplicate epochs when calling .fit() twice #5007

Duplicate epochs when calling .fit() twice #5007

carmocca commented Dec 7, 2020 •

edited by github-actions bot

Loading

carmocca commented Dec 8, 2020 •

edited

Loading

ananthsub commented Dec 10, 2020

carmocca commented Dec 11, 2020

pierresegonne commented Jan 18, 2021

carmocca commented Jan 18, 2021

edenlightning commented Feb 16, 2021

carmocca commented Feb 17, 2021 •

edited

Loading

carmocca commented Jul 6, 2021

carmocca commented Jul 26, 2021

fschiffers commented Dec 19, 2021

carmocca commented Dec 20, 2021

Duplicate epochs when calling .fit() twice #5007

Duplicate epochs when calling .fit() twice #5007

Comments

carmocca commented Dec 7, 2020 • edited by github-actions bot Loading

🐛 Bug

To Reproduce

Expected behavior

Environment

carmocca commented Dec 8, 2020 • edited Loading

ananthsub commented Dec 10, 2020

carmocca commented Dec 11, 2020

pierresegonne commented Jan 18, 2021

carmocca commented Jan 18, 2021

edenlightning commented Feb 16, 2021

carmocca commented Feb 17, 2021 • edited Loading

carmocca commented Jul 6, 2021

carmocca commented Jul 26, 2021

fschiffers commented Dec 19, 2021

carmocca commented Dec 20, 2021

carmocca commented Dec 7, 2020 •

edited by github-actions bot

Loading

carmocca commented Dec 8, 2020 •

edited

Loading

carmocca commented Feb 17, 2021 •

edited

Loading