Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

outputs in training_epoch_end contain only outputs from last batch repeated #8603

Closed
stas-sl opened this issue Jul 28, 2021 · 0 comments · Fixed by #8613
Closed

outputs in training_epoch_end contain only outputs from last batch repeated #8603

stas-sl opened this issue Jul 28, 2021 · 0 comments · Fixed by #8613
Assignees
Labels
bug Something isn't working help wanted Open to be worked on priority: 0 High priority task
Milestone

Comments

@stas-sl
Copy link

stas-sl commented Jul 28, 2021

🐛 Bug

outputs in training_epoch_end contain only outputs from last batch repeated multiple times. I believe it got broken only in 1.4.0, but in 1.3.x it worked.

To Reproduce

import torch
from torch.utils.data import DataLoader, Dataset
from pytorch_lightning import LightningModule, Trainer


class RandomDataset(Dataset):
    def __init__(self, size, length):
        self.len = length
        self.data = torch.randn(length, size)

    def __getitem__(self, index):
        return self.data[index]

    def __len__(self):
        return self.len


class BoringModel(LightningModule):
    def __init__(self):
        super().__init__()
        self.layer = torch.nn.Linear(32, 2)

    def forward(self, x):
        return self.layer(x)

    def training_step(self, batch, batch_idx):
        loss = self(batch).sum()
        print(f'training_step, {batch_idx=}: {loss=}')
        return loss

    def configure_optimizers(self):
        return torch.optim.SGD(self.parameters(), lr=0.1)

    def training_epoch_end(self, outputs):
        print('training_epoch_end:', outputs)


dl = DataLoader(RandomDataset(32, 100), batch_size=10)

model = BoringModel()
trainer = Trainer(max_epochs=1, progress_bar_refresh_rate=0)
trainer.fit(model, dl)

This will print the same loss repeated 10 times (equal to last batch loss) in training_epoch_end:

training_step, batch_idx=0: loss=tensor(0.6952, grad_fn=<SumBackward0>)
training_step, batch_idx=1: loss=tensor(-18.9661, grad_fn=<SumBackward0>)
training_step, batch_idx=2: loss=tensor(-27.7834, grad_fn=<SumBackward0>)
training_step, batch_idx=3: loss=tensor(-84.3158, grad_fn=<SumBackward0>)
training_step, batch_idx=4: loss=tensor(-119.3664, grad_fn=<SumBackward0>)
training_step, batch_idx=5: loss=tensor(-138.1930, grad_fn=<SumBackward0>)
training_step, batch_idx=6: loss=tensor(-126.4004, grad_fn=<SumBackward0>)
training_step, batch_idx=7: loss=tensor(-143.7022, grad_fn=<SumBackward0>)
training_step, batch_idx=8: loss=tensor(-175.9583, grad_fn=<SumBackward0>)
training_step, batch_idx=9: loss=tensor(-161.6977, grad_fn=<SumBackward0>)

training_epoch_end: [{'loss': tensor(-161.6977)}, {'loss': tensor(-161.6977)}, {'loss': tensor(-161.6977)}, {'loss': tensor(-161.6977)}, {'loss': tensor(-161.6977)}, {'loss': tensor(-161.6977)}, {'loss': tensor(-161.6977)}, {'loss': tensor(-161.6977)}, {'loss': tensor(-161.6977)}, {'loss': tensor(-161.6977)}]

Expected behavior

Output from all steps/batches is available in training_epoch_end (not only from last batch)

Environment

* CUDA:
	- GPU:
	- available:         False
	- version:           None
* Packages:
	- numpy:             1.18.5
	- pyTorch_debug:     False
	- pyTorch_version:   1.8.0
	- pytorch-lightning: 1.4.0
	- tqdm:              4.47.0
* System:
	- OS:                Darwin
	- architecture:
		- 64bit
		-
	- processor:         i386
	- python:            3.8.3
	- version:           Darwin Kernel Version 19.6.0: Tue Jan 12 22:13:05 PST 2021; root:xnu-6153.141.16~1/RELEASE_X86_64
@stas-sl stas-sl added bug Something isn't working help wanted Open to be worked on labels Jul 28, 2021
@awaelchli awaelchli added priority: 0 High priority task with code labels Jul 29, 2021
@awaelchli awaelchli added this to the v1.4.x milestone Jul 29, 2021
@awaelchli awaelchli self-assigned this Jul 29, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working help wanted Open to be worked on priority: 0 High priority task
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants