Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

handle keyboard interrupt for ddp .test() #1019

Merged
merged 26 commits into from
Mar 3, 2020
Merged

handle keyboard interrupt for ddp .test() #1019

merged 26 commits into from
Mar 3, 2020

Conversation

williamFalcon
Copy link
Contributor

@williamFalcon williamFalcon commented Mar 3, 2020

When keyboard interrupt stops training, model doesn't get to save and load state to exit spawn.

In the .test() we see if there was a spawn checkpoint saved.
solves the DDP problem but not TPU.

Also doesn't solve the problem of returning the weights to the user once training is done without calling .test() (only a problem on colab).

-- Process 0 terminated with the following error:
Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/torch/multiprocessing/spawn.py", line 20, in _wrap
    fn(i, *args)
  File "/usr/local/lib/python3.6/dist-packages/torch_xla/distributed/xla_multiprocessing.py", line 116, in _start_fn
    _setup_replication()
  File "/usr/local/lib/python3.6/dist-packages/torch_xla/distributed/xla_multiprocessing.py", line 109, in _setup_replication
    xm.set_replication(str(device), [str(device)])
  File "/usr/local/lib/python3.6/dist-packages/torch_xla/core/xla_model.py", line 199, in set_replication
    replication_devices = xla_replication_devices(devices)
  File "/usr/local/lib/python3.6/dist-packages/torch_xla/core/xla_model.py", line 186, in xla_replication_devices
    .format(len(local_devices), len(kind_devices)))
RuntimeError: Cannot replicate if number of devices (1) is different from 8

@dlibenzi any thoughts?

@williamFalcon williamFalcon changed the title Keyboard handle keyboard interrupt for ddp .test() Mar 3, 2020
@williamFalcon williamFalcon merged commit 1789165 into master Mar 3, 2020
@Borda Borda added the bug Something isn't working label Mar 3, 2020
@Borda Borda deleted the keyboard branch March 3, 2020 10:18
@Borda Borda added this to the 0.7.0 milestone Mar 7, 2020
tullie pushed a commit to tullie/pytorch-lightning that referenced this pull request Apr 3, 2020
* updated checkpoint docs

* updated checkpoint docs

* updated checkpoint docs

* updated checkpoint docs

* updated checkpoint docs

* updated checkpoint docs

* updated checkpoint docs

* updated checkpoint docs

* updated checkpoint docs

* updated checkpoint docs

* updated checkpoint docs

* updated checkpoint docs

* updated checkpoint docs

* updated checkpoint docs

* updated checkpoint docs

* updated checkpoint docs

* updated checkpoint docs

* updated checkpoint docs

* updated checkpoint docs

* updated checkpoint docs

* updated checkpoint docs

* updated checkpoint docs

* updated checkpoint docs

* updated checkpoint docs

* updated checkpoint docs

* updated checkpoint docs
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants