'NoneType' object has no attribute 'test_step' when DDP #577

jgsch · 2019-12-03T09:32:56Z

Describe the bug

When I activate the DDP, the test_step function is replaced by None. No problem when I run on one GPU.

To Reproduce

import torch
import numpy as np
import torch.nn as nn
import torch.optim as optim
import pytorch_lightning as pl
import torch.utils.data as data
from pytorch_lightning import Trainer
import torchvision
import torchvision.transforms as transforms

num_workers = 0
classes = (
    'plane', 'car', 'bird', 'cat', 'deer', 'dog',
    'frog', 'horse', 'ship', 'truck'
)
n_classes = len(classes)
ddp = True


class PlModule(pl.LightningModule):
    def __init__(self):
        super(PlModule, self).__init__()
        model = torchvision.models.squeezenet1_1(True)
        model.n_classes = n_classes

        final_conv = nn.Conv2d(512, n_classes, kernel_size=1)
        model.classifier = nn.Sequential(
            nn.Dropout(p=0.5),
            final_conv,
            nn.ReLU(inplace=True),
            nn.AdaptiveAvgPool2d((1, 1))
        )

        self.model = model
        self.criterion = nn.CrossEntropyLoss()

    def forward(self, x):
        return self.model(x)

    def training_step(self, batch, batch_nb):
        inputs, targets = batch
        outputs = self.forward(inputs)
        loss = self.criterion(outputs, targets)
        return {"loss": loss}

    def test_step(self, batch, batch_nb):
        inputs, targets = batch
        outputs = self.forward(inputs)
        loss = self.criterion(outputs, targets)
        return {'test_loss': loss}

    def test_end(self, outputs):
        metric = [o["test_loss"] for o in outputs]
        val_loss = np.sum(metric) / len(outputs)
        tqdm_dict = {"test_loss": val_loss}
        return {
            'test_loss': tqdm_dict["test_loss"],
            'progress_bar': tqdm_dict,
        }

    def configure_optimizers(self):
        return optim.AdamW(self.parameters(), lr=0.001)

    @pl.data_loader
    def train_dataloader(self):
        transform = transforms.Compose([
            transforms.ToTensor(),
            transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))
        ])
        trainset = torchvision.datasets.CIFAR10(
            root='./data', train=True,
            download=True, transform=transform)

        t_sampler = None
        if ddp:
            t_sampler = data.distributed.DistributedSampler(trainset)
        return torch.utils.data.DataLoader(
            trainset,
            sampler=t_sampler,
            batch_size=400,
            num_workers=num_workers,
            shuffle=False,
            pin_memory=True,
        )

    @pl.data_loader
    def test_dataloader(self):
        transform = transforms.Compose([
            transforms.ToTensor(),
            transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))
        ])
        testset = torchvision.datasets.CIFAR10(
            root='./data', train=False,
            download=True, transform=transform)

        test_sampler = None
        if ddp:
            test_sampler = data.distributed.DistributedSampler(testset)

        return torch.utils.data.DataLoader(
            testset,
            sampler=test_sampler,
            batch_size=400,
            num_workers=num_workers,
            shuffle=False,
            pin_memory=True,
        )


if __name__ == "__main__":

    distributed = {
        "gpus": 2 if ddp else 1,
        "distributed_backend": 'ddp' if ddp else None
    }

    trainer = Trainer(
        logger=False,
        checkpoint_callback=False,
        early_stop_callback=False,
        max_nb_epochs=1,
        nb_sanity_val_steps=1,
        **distributed
    )
    model = PlModule()

    trainer.fit(model)
    trainer.test()

Give the following error:

Traceback (most recent call last):
  File "test.py", line 128, in <module>
    trainer.test()
  File "/home/j/miniconda3/envs/alp36/lib/python3.6/site-packages/pytorch_lightning/trainer/trainer.py", line 478, in test
    self.run_evaluation(test=True)
  File "/home/j/miniconda3/envs/alp36/lib/python3.6/site-packages/pytorch_lightning/trainer/evaluation_loop_mixin.py", line 88, in run_evaluation
    can_run_test_step = self.is_overriden('test_step') and self.is_overriden('test_end')
  File "/home/j/miniconda3/envs/alp36/lib/python3.6/site-packages/pytorch_lightning/trainer/model_hooks_mixin.py", line 16, in is_overriden
    is_overriden = getattr(model, f_name).__code__ is not getattr(super_object, f_name).__code__
AttributeError: 'NoneType' object has no attribute 'test_step'

Change ddp = True by ddp = False and no error.

Version:

pytorch-lightning: 0.5.3.2
pytorch: 1.3.1
torchvision: 0.4.2

The text was updated successfully, but these errors were encountered:

williamFalcon · 2019-12-04T12:08:45Z

good catch. Mind submitting a PR?

jgsch · 2019-12-06T09:07:27Z

@williamFalcon > I investigated and I think the problem comes from this line: mp.spawn(self.ddp_train, nprocs=self.num_gpus, args=(model,))

After exiting function this line, self.model is still None. However, in the function self.ddp_train, the model is well defined. There must be a problem when the processes are joined together.

sneiman · 2020-01-21T17:01:04Z

Is there a workaround available in the meantime?

williamFalcon · 2020-01-21T17:08:20Z

Option A:
call fit(model) ?

option B:
submit a PR to track self.model before doing the ddp call?

sneiman · 2020-01-21T19:08:33Z

Sorry - I assumed that this had already led to a pr that included the fact that both trainer.test() and trainer.test(model) with various errors when using ddp. Ill do some work to narrow it and submt a pr.

pableeto · 2020-03-02T07:17:02Z

Hi, I am using the latest-version (pip install https://github.com/PytorchLightning/pytorch-lightning/archive/master.zip --upgrade) and I have met the same error when calling trainer.test(model).
Is there any update or some workaround on this? Thank you very much.

Borda · 2020-03-02T15:24:05Z

We are working on it now... this appears only for multi-GPU or have you observed it elsewhere?
Could you also check #979

williamFalcon · 2020-03-02T15:29:02Z

@pableeto can you put code here?

sneiman · 2020-03-02T17:08:54Z

I do have a fix for this with DDP – I’ve added a True/False argument to Trainer() that is used to call trainer.test() for you if you are running ddp or ddp2 back end. It only calls it after training is complete. Or it can be used with any back end, if you like. I have no way to test this on a cluster, but have tested it on my single node, 7 gpu machine running Ubuntu. I do not believe there is a way to make the current syntax work to resolve this in any of the modes which call pytorch spawn(). There is just no way to get the trainer and model back from that is more efficient than simply reloading a saved checkpoint. It is possible to create a system of hooks for saving and restoring pieces of models – a little like current practice with checkpoints. But I think this is fraught with potential user errors, as every little change in the way the model is created will affect this. I call the new trainer argument ‘ddp_run_test_auto’, and it defaults to False. If you would like me to submit a pr, I will move the fix to master, retest, and submit. seth

jgsch added the bug Something isn't working label Dec 3, 2019

jgsch changed the title ~~NoneType' object has no attribute 'test_step' when DDP~~ 'NoneType' object has no attribute 'test_step' when DDP Dec 3, 2019

williamFalcon added the need fix label Jan 21, 2020

Borda added the help wanted Open to be worked on label Jan 24, 2020

Borda assigned sneiman Jan 24, 2020

williamFalcon mentioned this issue Mar 3, 2020

fixes test issues on ddp #1017

Merged

williamFalcon closed this as completed in #1017 Mar 3, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

'NoneType' object has no attribute 'test_step' when DDP #577

'NoneType' object has no attribute 'test_step' when DDP #577

jgsch commented Dec 3, 2019 •

edited

Loading

williamFalcon commented Dec 4, 2019

jgsch commented Dec 6, 2019 •

edited

Loading

sneiman commented Jan 21, 2020

williamFalcon commented Jan 21, 2020 •

edited

Loading

sneiman commented Jan 21, 2020 •

edited

Loading

pableeto commented Mar 2, 2020

Borda commented Mar 2, 2020 •

edited

Loading

williamFalcon commented Mar 2, 2020

sneiman commented Mar 2, 2020 via email •

edited by Borda

Loading

'NoneType' object has no attribute 'test_step' when DDP #577

'NoneType' object has no attribute 'test_step' when DDP #577

Comments

jgsch commented Dec 3, 2019 • edited Loading

williamFalcon commented Dec 4, 2019

jgsch commented Dec 6, 2019 • edited Loading

sneiman commented Jan 21, 2020

williamFalcon commented Jan 21, 2020 • edited Loading

sneiman commented Jan 21, 2020 • edited Loading

pableeto commented Mar 2, 2020

Borda commented Mar 2, 2020 • edited Loading

williamFalcon commented Mar 2, 2020

sneiman commented Mar 2, 2020 via email • edited by Borda Loading

jgsch commented Dec 3, 2019 •

edited

Loading

jgsch commented Dec 6, 2019 •

edited

Loading

williamFalcon commented Jan 21, 2020 •

edited

Loading

sneiman commented Jan 21, 2020 •

edited

Loading

Borda commented Mar 2, 2020 •

edited

Loading

sneiman commented Mar 2, 2020 via email •

edited by Borda

Loading