Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Very slow training on colab with TPU #2148

Closed
ipyffor opened this issue Jun 11, 2020 · 8 comments
Closed

Very slow training on colab with TPU #2148

ipyffor opened this issue Jun 11, 2020 · 8 comments
Labels
accelerator: tpu Tensor Processing Unit help wanted Open to be worked on

Comments

@ipyffor
Copy link

ipyffor commented Jun 11, 2020

https://colab.research.google.com/drive/1OxoEcbNVCF5aj_9o0axTnKAh8p5I4Ikw?usp=sharing

@ipyffor ipyffor added the help wanted Open to be worked on label Jun 11, 2020
@github-actions
Copy link
Contributor

Hi! thanks for your contribution!, great first issue!

@Borda Borda added information needed accelerator: tpu Tensor Processing Unit labels Jun 11, 2020
@Borda
Copy link
Member

Borda commented Jun 11, 2020

@ipyffor mind share your PL version and sample notebook?

@ipyffor
Copy link
Author

ipyffor commented Jun 11, 2020

image
After running the progress has not changed

@ipyffor
Copy link
Author

ipyffor commented Jun 11, 2020

Sorry, it seemed that my picture could not be uploaded.

@lezwon
Copy link
Contributor

lezwon commented Jun 23, 2020

@ipyffor I cant access the colab file anymore. are you still facing the issue?

@rahulvigneswaran
Copy link

rahulvigneswaran commented Jul 13, 2020

@ipyffor @lezwon Not just on TPU. Even on GPU, it makes the entire browser unresponsive. It doesn't look like, it is code specific.

@Borda The pytorch-lightning version: 0.8.5

Am running into this issue only when I run the code inline. Instead of that, if I have the code in a separate file, say train.py and just use !python train.py, this problem is non-existent.

!pip install pytorch-lightning
from pytorch_lightning import Trainer
from pytorch_lightning.callbacks import ModelCheckpoint

import torch
from torch.nn import functional as F
from torch.utils.data import DataLoader
from torchvision.datasets import MNIST
from torchvision import transforms
from pytorch_lightning.core.lightning import LightningModule
import os, sys

class LitModel(LightningModule):

    def __init__(self):
        super().__init__()
        self.l1 = torch.nn.Linear(28 * 28, 10)

    def forward(self, x):
        return torch.relu(self.l1(x.view(x.size(0), -1)))

    def training_step(self, batch, batch_idx):
        x, y = batch
        y_hat = self(x)
        loss = F.cross_entropy(y_hat, y)
        tensorboard_logs = {'train_loss': loss}
        return {'loss': loss, 'log': tensorboard_logs}

    def configure_optimizers(self):
        return torch.optim.Adam(self.parameters(), lr=0.001)

    def train_dataloader(self):
        dataset = MNIST(os.getcwd(), train=True, download=True, transform=transforms.ToTensor())
        loader = DataLoader(dataset, batch_size=32, num_workers=4, shuffle=True)
        return loader
    
    def train_epoch_end(self, outputs):
        avg_loss = torch.stack([x['test_loss'] for x in outputs]).mean()
        tensorboard_logs = {'train_loss': avg_loss}
        return {'avg_train_loss': avg_loss, 'log': tensorboard_logs}

    def validation_step(self, batch, batch_idx):
        x, y = batch
        y_hat = self(x)
        return {'val_loss': F.cross_entropy(y_hat, y)}

    def validation_epoch_end(self, outputs):
        avg_loss = torch.stack([x['val_loss'] for x in outputs]).mean()
        tensorboard_logs = {'val_loss': avg_loss}
        return {'val_loss': avg_loss, 'log': tensorboard_logs}

    def val_dataloader(self):
        dataset = MNIST(os.getcwd(), train=False, download=True, transform=transforms.ToTensor())
        loader = DataLoader(dataset, batch_size=32, num_workers=4)
        return loader

model1 = LitModel()

checkpoint_callback = ModelCheckpoint(filepath='model1/{epoch}', save_last=True, save_top_k=-1)

trainer = Trainer(max_epochs=100, gpus=1, fast_dev_run=False, checkpoint_callback=checkpoint_callback)
trainer.fit(model)

@iliemihai
Copy link

I am facing the same issue. Even if I run the code on 8 core TPU one iteration takes 35s, the same as running on 1 core TPU.

@Borda
Copy link
Member

Borda commented Jul 29, 2020

@iliemihai @rahulvigneswaran we had a bug there so multicore was not running in fact... shall be fixed now #2632 mind try actual master? also, mind send a PR with some parity speed testing?

@Borda Borda closed this as completed Aug 4, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
accelerator: tpu Tensor Processing Unit help wanted Open to be worked on
Projects
None yet
Development

No branches or pull requests

5 participants