Trainer on colab TPU error: process X terminated with signal SIGSEGV #1956

VictorCallejas · 2020-05-26T13:46:08Z

I am trying to train an image encoder with pytorch-lightning on colab using TPU(8 cores).

I am following this demo notebook: https://colab.research.google.com/drive/1-_LKx4HwAxl5M6xPJmqAAu444LTDQoa3#scrollTo=dEeUzX_5aLrX

LIbraries version:
torch: 1.5.0
torchvision_ 0.6.0
pytorch-lightning: 0.7.5
pytorch-xla:1.6

I have also tried with nightly versions and older, but same error.

When running:

trainer = pl.Trainer(num_tpu_cores=8, progress_bar_refresh_rate=10, max_epochs=10)
# Run lr finder
lr_finder = trainer.lr_find(model)

Error:

---------------------------------------------------------------------------
Exception                                 Traceback (most recent call last)
<ipython-input-25-8e434d4d50a5> in <module>()
      2 
      3 # Run lr finder
----> 4 lr_finder = trainer.lr_find(model)
      5 
      6 fig = lr_finder.plot(suggest=True)

4 frames
/usr/local/lib/python3.6/dist-packages/pytorch_lightning/trainer/lr_finder.py in lr_find(self, model, train_dataloader, min_lr, max_lr, num_training, mode, num_accumulation_steps)
    151 
    152         # Fit, lr & loss logged in callback
--> 153         self.fit(model, train_dataloader=train_dataloader)
    154 
    155         # Prompt if we stopped early

/usr/local/lib/python3.6/dist-packages/pytorch_lightning/trainer/trainer.py in fit(self, model, train_dataloader, val_dataloaders)
    775 
    776             # train
--> 777             xmp.spawn(self.tpu_train, args=(model,), nprocs=self.num_tpu_cores, start_method=start_method)
    778 
    779             # load weights if not interrupted

/usr/local/lib/python3.6/dist-packages/torch_xla/distributed/xla_multiprocessing.py in spawn(fn, args, nprocs, join, daemon, start_method)
    180         join=join,
    181         daemon=daemon,
--> 182         start_method=start_method)

/usr/local/lib/python3.6/dist-packages/torch/multiprocessing/spawn.py in start_processes(fn, args, nprocs, join, daemon, start_method)
    156 
    157     # Loop on join until it returns True or raises an exception.
--> 158     while not context.join():
    159         pass
    160 

/usr/local/lib/python3.6/dist-packages/torch/multiprocessing/spawn.py in join(self, timeout)
    106                 raise Exception(
    107                     "process %d terminated with signal %s" %
--> 108                     (error_index, name)
    109                 )
    110             else:

Exception: process 2 terminated with signal SIGSEGV

My notebook on Gist here: https://colab.research.google.com/gist/VictorCallejas/10e4c39fc25051012ae28a2a7261f814/untitled.ipynb

It seems it is raising the exception because the other processes are not joining, but I have not a clue why. I have everything as in the demo notebook.

Thanks.

The text was updated successfully, but these errors were encountered:

github-actions · 2020-05-26T13:47:00Z

Hi! thanks for your contribution!, great first issue!

williamFalcon · 2020-06-26T13:42:57Z

The problem here is the learning rate finder doesn't work with anything multi-process... @SkafteNicki .

SkafteNicki · 2020-06-26T14:51:36Z

This is weird. It is correct that learning rate finder does not have multi process support, but that is because the state of the search is destroyed when self.fit() finishes. However, here the problem seems to be earlier during the fit

@VictorCallejas do trainer.fit work normally for you?

VictorCallejas · 2020-06-28T11:56:35Z

The error is when running trainer.fit(model), when using trainer.lr_find(model) or trainer.fit(model) alone.

Yes, it works on cpu or gpu.

SkafteNicki · 2020-06-29T12:10:44Z

Then I guess that the problem is unrelated to the learning rate finder, if standard trainer.fit(model) fails.

Borda · 2020-07-27T20:44:18Z

that is spawn issue, shall be fixed in #2632

VictorCallejas added the help wanted Open to be worked on label May 26, 2020

Borda added accelerator: tpu Tensor Processing Unit bug Something isn't working labels Jun 16, 2020

williamFalcon assigned SkafteNicki Jun 26, 2020

williamFalcon added this to the 0.9.0 milestone Jun 26, 2020

williamFalcon added feature Is an improvement or enhancement and removed bug Something isn't working labels Jun 26, 2020

Borda mentioned this issue Jul 27, 2020

fixing TPU tests #2632

Merged

7 tasks

williamFalcon closed this as completed in #2632 Jul 27, 2020

FabianBell mentioned this issue Nov 1, 2020

SIGSEGV when training on TPU #4464

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Trainer on colab TPU error: process X terminated with signal SIGSEGV #1956

Trainer on colab TPU error: process X terminated with signal SIGSEGV #1956

VictorCallejas commented May 26, 2020

github-actions bot commented May 26, 2020

williamFalcon commented Jun 26, 2020

SkafteNicki commented Jun 26, 2020 •

edited

Loading

VictorCallejas commented Jun 28, 2020 •

edited

Loading

SkafteNicki commented Jun 29, 2020

Borda commented Jul 27, 2020

Trainer on colab TPU error: process X terminated with signal SIGSEGV #1956

Trainer on colab TPU error: process X terminated with signal SIGSEGV #1956

Comments

VictorCallejas commented May 26, 2020

github-actions bot commented May 26, 2020

williamFalcon commented Jun 26, 2020

SkafteNicki commented Jun 26, 2020 • edited Loading

VictorCallejas commented Jun 28, 2020 • edited Loading

SkafteNicki commented Jun 29, 2020

Borda commented Jul 27, 2020

SkafteNicki commented Jun 26, 2020 •

edited

Loading

VictorCallejas commented Jun 28, 2020 •

edited

Loading