Very slow training on colab with TPU #2148

ipyffor · 2020-06-11T14:16:33Z

https://colab.research.google.com/drive/1OxoEcbNVCF5aj_9o0axTnKAh8p5I4Ikw?usp=sharing

github-actions · 2020-06-11T14:17:09Z

Hi! thanks for your contribution!, great first issue!

Borda · 2020-06-11T14:20:21Z

@ipyffor mind share your PL version and sample notebook?

ipyffor · 2020-06-11T14:27:29Z

After running the progress has not changed

ipyffor · 2020-06-11T14:30:11Z

Sorry, it seemed that my picture could not be uploaded.

lezwon · 2020-06-23T19:33:32Z

@ipyffor I cant access the colab file anymore. are you still facing the issue?

rahulvigneswaran · 2020-07-13T20:08:00Z

@ipyffor @lezwon Not just on TPU. Even on GPU, it makes the entire browser unresponsive. It doesn't look like, it is code specific.

@Borda The pytorch-lightning version: 0.8.5

Am running into this issue only when I run the code inline. Instead of that, if I have the code in a separate file, say train.py and just use !python train.py, this problem is non-existent.

!pip install pytorch-lightning
from pytorch_lightning import Trainer
from pytorch_lightning.callbacks import ModelCheckpoint

import torch
from torch.nn import functional as F
from torch.utils.data import DataLoader
from torchvision.datasets import MNIST
from torchvision import transforms
from pytorch_lightning.core.lightning import LightningModule
import os, sys

class LitModel(LightningModule):

    def __init__(self):
        super().__init__()
        self.l1 = torch.nn.Linear(28 * 28, 10)

    def forward(self, x):
        return torch.relu(self.l1(x.view(x.size(0), -1)))

    def training_step(self, batch, batch_idx):
        x, y = batch
        y_hat = self(x)
        loss = F.cross_entropy(y_hat, y)
        tensorboard_logs = {'train_loss': loss}
        return {'loss': loss, 'log': tensorboard_logs}

    def configure_optimizers(self):
        return torch.optim.Adam(self.parameters(), lr=0.001)

    def train_dataloader(self):
        dataset = MNIST(os.getcwd(), train=True, download=True, transform=transforms.ToTensor())
        loader = DataLoader(dataset, batch_size=32, num_workers=4, shuffle=True)
        return loader
    
    def train_epoch_end(self, outputs):
        avg_loss = torch.stack([x['test_loss'] for x in outputs]).mean()
        tensorboard_logs = {'train_loss': avg_loss}
        return {'avg_train_loss': avg_loss, 'log': tensorboard_logs}

    def validation_step(self, batch, batch_idx):
        x, y = batch
        y_hat = self(x)
        return {'val_loss': F.cross_entropy(y_hat, y)}

    def validation_epoch_end(self, outputs):
        avg_loss = torch.stack([x['val_loss'] for x in outputs]).mean()
        tensorboard_logs = {'val_loss': avg_loss}
        return {'val_loss': avg_loss, 'log': tensorboard_logs}

    def val_dataloader(self):
        dataset = MNIST(os.getcwd(), train=False, download=True, transform=transforms.ToTensor())
        loader = DataLoader(dataset, batch_size=32, num_workers=4)
        return loader

model1 = LitModel()

checkpoint_callback = ModelCheckpoint(filepath='model1/{epoch}', save_last=True, save_top_k=-1)

trainer = Trainer(max_epochs=100, gpus=1, fast_dev_run=False, checkpoint_callback=checkpoint_callback)
trainer.fit(model)

iliemihai · 2020-07-29T13:27:11Z

I am facing the same issue. Even if I run the code on 8 core TPU one iteration takes 35s, the same as running on 1 core TPU.

Borda · 2020-07-29T13:54:25Z

@iliemihai @rahulvigneswaran we had a bug there so multicore was not running in fact... shall be fixed now #2632 mind try actual master? also, mind send a PR with some parity speed testing?

ipyffor added the help wanted Open to be worked on label Jun 11, 2020

Borda added information needed accelerator: tpu Tensor Processing Unit labels Jun 11, 2020

Borda closed this as completed Aug 4, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Very slow training on colab with TPU #2148

Very slow training on colab with TPU #2148

ipyffor commented Jun 11, 2020 •

edited

Loading

github-actions bot commented Jun 11, 2020

Borda commented Jun 11, 2020

ipyffor commented Jun 11, 2020

ipyffor commented Jun 11, 2020

lezwon commented Jun 23, 2020

rahulvigneswaran commented Jul 13, 2020 •

edited

Loading

iliemihai commented Jul 29, 2020

Borda commented Jul 29, 2020

Very slow training on colab with TPU #2148

Very slow training on colab with TPU #2148

Comments

ipyffor commented Jun 11, 2020 • edited Loading

github-actions bot commented Jun 11, 2020

Borda commented Jun 11, 2020

ipyffor commented Jun 11, 2020

ipyffor commented Jun 11, 2020

lezwon commented Jun 23, 2020

rahulvigneswaran commented Jul 13, 2020 • edited Loading

iliemihai commented Jul 29, 2020

Borda commented Jul 29, 2020

ipyffor commented Jun 11, 2020 •

edited

Loading

rahulvigneswaran commented Jul 13, 2020 •

edited

Loading