Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Trainer on colab TPU error: process X terminated with signal SIGSEGV #1956

Closed
VictorCallejas opened this issue May 26, 2020 · 6 comments · Fixed by #2632
Closed

Trainer on colab TPU error: process X terminated with signal SIGSEGV #1956

VictorCallejas opened this issue May 26, 2020 · 6 comments · Fixed by #2632
Assignees
Labels
accelerator: tpu Tensor Processing Unit feature Is an improvement or enhancement help wanted Open to be worked on
Milestone

Comments

@VictorCallejas
Copy link

I am trying to train an image encoder with pytorch-lightning on colab using TPU(8 cores).

I am following this demo notebook: https://colab.research.google.com/drive/1-_LKx4HwAxl5M6xPJmqAAu444LTDQoa3#scrollTo=dEeUzX_5aLrX

LIbraries version:
torch: 1.5.0
torchvision_ 0.6.0
pytorch-lightning: 0.7.5
pytorch-xla:1.6

I have also tried with nightly versions and older, but same error.

When running:

trainer = pl.Trainer(num_tpu_cores=8, progress_bar_refresh_rate=10, max_epochs=10)
# Run lr finder
lr_finder = trainer.lr_find(model)

Error:

---------------------------------------------------------------------------
Exception                                 Traceback (most recent call last)
<ipython-input-25-8e434d4d50a5> in <module>()
      2 
      3 # Run lr finder
----> 4 lr_finder = trainer.lr_find(model)
      5 
      6 fig = lr_finder.plot(suggest=True)

4 frames
/usr/local/lib/python3.6/dist-packages/pytorch_lightning/trainer/lr_finder.py in lr_find(self, model, train_dataloader, min_lr, max_lr, num_training, mode, num_accumulation_steps)
    151 
    152         # Fit, lr & loss logged in callback
--> 153         self.fit(model, train_dataloader=train_dataloader)
    154 
    155         # Prompt if we stopped early

/usr/local/lib/python3.6/dist-packages/pytorch_lightning/trainer/trainer.py in fit(self, model, train_dataloader, val_dataloaders)
    775 
    776             # train
--> 777             xmp.spawn(self.tpu_train, args=(model,), nprocs=self.num_tpu_cores, start_method=start_method)
    778 
    779             # load weights if not interrupted

/usr/local/lib/python3.6/dist-packages/torch_xla/distributed/xla_multiprocessing.py in spawn(fn, args, nprocs, join, daemon, start_method)
    180         join=join,
    181         daemon=daemon,
--> 182         start_method=start_method)

/usr/local/lib/python3.6/dist-packages/torch/multiprocessing/spawn.py in start_processes(fn, args, nprocs, join, daemon, start_method)
    156 
    157     # Loop on join until it returns True or raises an exception.
--> 158     while not context.join():
    159         pass
    160 

/usr/local/lib/python3.6/dist-packages/torch/multiprocessing/spawn.py in join(self, timeout)
    106                 raise Exception(
    107                     "process %d terminated with signal %s" %
--> 108                     (error_index, name)
    109                 )
    110             else:

Exception: process 2 terminated with signal SIGSEGV

My notebook on Gist here: https://colab.research.google.com/gist/VictorCallejas/10e4c39fc25051012ae28a2a7261f814/untitled.ipynb

It seems it is raising the exception because the other processes are not joining, but I have not a clue why. I have everything as in the demo notebook.

Thanks.

@VictorCallejas VictorCallejas added the help wanted Open to be worked on label May 26, 2020
@github-actions
Copy link
Contributor

Hi! thanks for your contribution!, great first issue!

@Borda Borda added accelerator: tpu Tensor Processing Unit bug Something isn't working labels Jun 16, 2020
@williamFalcon
Copy link
Contributor

The problem here is the learning rate finder doesn't work with anything multi-process... @SkafteNicki .

@williamFalcon williamFalcon added this to the 0.9.0 milestone Jun 26, 2020
@williamFalcon williamFalcon added feature Is an improvement or enhancement and removed bug Something isn't working labels Jun 26, 2020
@SkafteNicki
Copy link
Member

SkafteNicki commented Jun 26, 2020

This is weird. It is correct that learning rate finder does not have multi process support, but that is because the state of the search is destroyed when self.fit() finishes. However, here the problem seems to be earlier during the fit

@VictorCallejas do trainer.fit work normally for you?

@VictorCallejas
Copy link
Author

VictorCallejas commented Jun 28, 2020

The error is when running trainer.fit(model), when using trainer.lr_find(model) or trainer.fit(model) alone.

Yes, it works on cpu or gpu.

@SkafteNicki
Copy link
Member

Then I guess that the problem is unrelated to the learning rate finder, if standard trainer.fit(model) fails.

@Borda
Copy link
Member

Borda commented Jul 27, 2020

that is spawn issue, shall be fixed in #2632

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
accelerator: tpu Tensor Processing Unit feature Is an improvement or enhancement help wanted Open to be worked on
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants