Logging non-tensor scalar with result breaks subsequent epoch aggregation #3276

s-rog · 2020-08-31T01:33:53Z

🐛 Bug

Logging non-tensor scalar with result breaks subsequent epoch/tbptt aggregation
(on both 0.9 and master)

-- Process 1 terminated with the following error:
Traceback (most recent call last):
  File "/opt/conda/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 20, in _wrap
    fn(i, *args)
  File "/opt/conda/lib/python3.6/site-packages/pytorch_lightning/accelerators/ddp_spawn_backend.py", line 165, in ddp_train
    results = self.trainer.run_pretrain_routine(model)
  File "/opt/conda/lib/python3.6/site-packages/pytorch_lightning/trainer/trainer.py", line 1237, in run_pretrain_routine
    self.train()
  File "/opt/conda/lib/python3.6/site-packages/pytorch_lightning/trainer/training_loop.py", line 396, in train
    self.run_training_epoch()
  File "/opt/conda/lib/python3.6/site-packages/pytorch_lightning/trainer/training_loop.py", line 543, in run_training_epoch
    self.run_training_epoch_end(epoch_output, checkpoint_accumulator, early_stopping_accumulator, num_optimizers)
  File "/opt/conda/lib/python3.6/site-packages/pytorch_lightning/trainer/training_loop.py", line 672, in run_training_epoch_end
    epoch_log_metrics, epoch_progress_bar_metrics = self.__auto_reduce_results_on_epoch_end(epoch_output)
  File "/opt/conda/lib/python3.6/site-packages/pytorch_lightning/trainer/training_loop.py", line 696, in __auto_reduce_results_on_epoch_end
    tbptt_outs = tbptt_outs[0].__class__.reduce_across_time(tbptt_outs)
  File "/opt/conda/lib/python3.6/site-packages/pytorch_lightning/core/step_result.py", line 392, in reduce_across_time
    result[k] = tbptt_reduce_fx(value)
TypeError: mean(): argument 'input' (position 1) must be Tensor, not list

To Reproduce

    def training_step(self, batch, batch_idx):
        x, y = batch[0], batch[1]
        x = self.forward(x)
        loss = self.loss(x, y)
        result = pl.TrainResult(loss)
        result.log("non tensor scalar", 1.0)
        result.log("loss", loss, on_step=False, on_epoch=True)

To Fix

result.log("non tensor scalar", torch.tensor(1.0))

Expected behavior

In log() of result objects, value should accept non tensor values as value: Any and not cause issues with other metrics to be logged

Additional context

log() can be changed to only accept tensors, or have a built-in conversion, will update as I investigate further

The text was updated successfully, but these errors were encountered:

edenlightning · 2020-09-01T14:51:45Z

@williamFalcon

Borda · 2020-09-15T16:52:44Z

@williamFalcon is it still there? @s-rog mind check if it is still on master? or better add test for such case...

s-rog · 2020-09-16T01:26:44Z

I'll take a look again when I get a chance, haven't probed much due to the refactors... Are they mostly done?

s-rog · 2020-09-16T02:02:51Z

@Borda the bug is still here, the following template tests for this issue as well as #3278

For this problem the easiest fix would be to force type to tensors. Though that's probably just a bandaid solution, thoughts?

Test template for reference

#!/opt/conda/bin/python
"""
Runs a model on a single node across multiple gpus.
"""
import os
from argparse import ArgumentParser

import torch.nn.functional as F

import pytorch_lightning as pl
from pl_examples.models.lightning_template import LightningTemplateModel
from pytorch_lightning import Trainer, seed_everything

seed_everything(234)

class custom_template(LightningTemplateModel):
    def __init__(self, *args, **kwargs):
        super().__init__(*args, **kwargs)
    
    def on_epoch_start(self):
        print("on_epoch_start")

    def on_fit_start(self):
        print("on_fit_start")
        
    def training_step(self, batch, batch_idx):
        """
        Lightning calls this inside the training loop with the data from the training dataloader
        passed in as `batch`.
        """
        # forward pass
        x, y = batch
        y_hat = self(x)
        loss = F.cross_entropy(y_hat, y)
        result = pl.TrainResult(loss)
        result.log("non tensor scalar", 1.0)
        result.log("loss", loss, on_step=False, on_epoch=True)
        return result
        

def main(args):
    """ Main training routine specific for this project. """
    # ------------------------
    # 1 INIT LIGHTNING MODEL
    # ------------------------
    model = custom_template(**vars(args))

    # ------------------------
    # 2 INIT TRAINER
    # ------------------------
    trainer = Trainer.from_argparse_args(args)

    # ------------------------
    # 3 START TRAINING
    # ------------------------
    trainer.fit(model)


def run_cli():
    # ------------------------
    # TRAINING ARGUMENTS
    # ------------------------
    # these are project-wide arguments
    root_dir = os.path.dirname(os.path.realpath(__file__))
    parent_parser = ArgumentParser(add_help=False)

    # each LightningModule defines arguments relevant to it
    parser = LightningTemplateModel.add_model_specific_args(parent_parser, root_dir)
    parser = Trainer.add_argparse_args(parser)
    parser.set_defaults(gpus=1, distributed_backend=None)
    args = parser.parse_args()

    # ---------------------
    # RUN TRAINING
    # ---------------------
    main(args)


if __name__ == '__main__':
    run_cli()

s-rog · 2020-09-17T01:01:44Z

I looked into it a bit and reduce_across_time is getting called on all metrics if one metric in results is logged with on_epoch=True which makes on_epoch=True only compatible with tensor scalars since the default tbptt fn is torch.mean

This is probably not intended behavior?

edenlightning · 2020-09-17T15:24:59Z

@justusschock @SkafteNicki

justusschock · 2020-09-17T15:57:06Z

@edenlightning this is not related to metrics.

@s-rog I guess a simple type check that converts scalars to scalars tensor should do the trick? If so, could you open a PR with this fix?

Borda · 2020-10-05T07:45:50Z

fixed by #3855

s-rog · 2020-10-13T08:16:31Z

Traceback (most recent call last):
  File "/opt/conda/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 59, in _wrap
    fn(i, *args)
  File "/opt/conda/lib/python3.6/site-packages/pytorch_lightning/accelerators/ddp_spawn_accelerator.py", line 152, in ddp_train
    results = self.train_or_test()
  File "/opt/conda/lib/python3.6/site-packages/pytorch_lightning/accelerators/accelerator.py", line 53, in train_or_test
    results = self.trainer.train()
  File "/opt/conda/lib/python3.6/site-packages/pytorch_lightning/trainer/trainer.py", line 483, in train
    self.train_loop.run_training_epoch()
  File "/opt/conda/lib/python3.6/site-packages/pytorch_lightning/trainer/training_loop.py", line 598, in run_training_epoch
    self.num_optimizers
  File "/opt/conda/lib/python3.6/site-packages/pytorch_lightning/trainer/connectors/logger_connector.py", line 339, in log_train_epoch_end_metrics
    epoch_log_metrics, epoch_progress_bar_metrics = self.__auto_reduce_results_on_epoch_end(epoch_output)
  File "/opt/conda/lib/python3.6/site-packages/pytorch_lightning/trainer/connectors/logger_connector.py", line 449, in __auto_reduce_results_on_epoch_end
    tbptt_outs = tbptt_outs[0].__class__.reduce_across_time(tbptt_outs)
  File "/opt/conda/lib/python3.6/site-packages/pytorch_lightning/core/step_result.py", line 483, in reduce_across_time
    result[k] = tbptt_reduce_fx(value.float())
AttributeError: 'list' object has no attribute 'float'

@Borda I don't think the issue was fixed completely, this is on rc5 (using self.log)

Borda · 2020-10-13T09:49:37Z

@s-rog mind add a test for this case?

s-rog · 2020-10-13T10:43:22Z

william beat me to it :]

s-rog added bug Something isn't working help wanted Open to be worked on labels Aug 31, 2020

edenlightning added this to the 0.9.x milestone Sep 1, 2020

edenlightning assigned williamFalcon Sep 1, 2020

Borda added the Result label Sep 4, 2020

Borda added the good first issue Good for newcomers label Sep 15, 2020

edenlightning assigned teddykoker and unassigned teddykoker Sep 17, 2020

teddykoker self-assigned this Sep 17, 2020

s-rog mentioned this issue Sep 18, 2020

WIP: Fix non-tensor scalar result aggregation #3540

Closed

7 tasks

edenlightning unassigned williamFalcon and teddykoker Sep 24, 2020

edenlightning modified the milestones: 0.9.x, 1.0 Oct 4, 2020

Borda closed this as completed Oct 5, 2020

Borda reopened this Oct 13, 2020

williamFalcon added a commit that referenced this issue Oct 13, 2020

Fixes #3276

8b8975a

williamFalcon mentioned this issue Oct 13, 2020

Fixes #3276 #4116

Merged

williamFalcon closed this as completed in #4116 Oct 13, 2020

williamFalcon added a commit that referenced this issue Oct 13, 2020

Fixes #3276 (#4116)

2d5a7f5

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Logging non-tensor scalar with result breaks subsequent epoch aggregation #3276

Logging non-tensor scalar with result breaks subsequent epoch aggregation #3276

s-rog commented Aug 31, 2020

edenlightning commented Sep 1, 2020

Borda commented Sep 15, 2020

s-rog commented Sep 16, 2020

s-rog commented Sep 16, 2020

s-rog commented Sep 17, 2020

edenlightning commented Sep 17, 2020

justusschock commented Sep 17, 2020

Borda commented Oct 5, 2020

s-rog commented Oct 13, 2020 •

edited

Loading

Borda commented Oct 13, 2020

s-rog commented Oct 13, 2020

Logging non-tensor scalar with result breaks subsequent epoch aggregation #3276

Logging non-tensor scalar with result breaks subsequent epoch aggregation #3276

Comments

s-rog commented Aug 31, 2020

🐛 Bug

To Reproduce

To Fix

Expected behavior

Additional context

edenlightning commented Sep 1, 2020

Borda commented Sep 15, 2020

s-rog commented Sep 16, 2020

s-rog commented Sep 16, 2020

s-rog commented Sep 17, 2020

edenlightning commented Sep 17, 2020

justusschock commented Sep 17, 2020

Borda commented Oct 5, 2020

s-rog commented Oct 13, 2020 • edited Loading

Borda commented Oct 13, 2020

s-rog commented Oct 13, 2020

s-rog commented Oct 13, 2020 •

edited

Loading