Replaces ddp .spawn with subprocess #2029

williamFalcon · 2020-05-31T16:23:35Z

Problem

A ton of the reported issues on DDP when using a single node are because of the .spawn used in the implementation.

.test() not working properly
.ckpt artifacts
inability to use num_workers > 0 in dataloader
wrong visible cuda devices
all sorts of other issues

But more importantly... removes the need to pickle things!

Approach

Here we remove .spawn and instead use subprocess to spin-off independent script calls while treating the original script as the master.

Fixes

(Issue authors, please verify if it does)
Fixes #2028
Fixes #1981
Fixes #1972
Fixes #1970
Fixes #1943
Fixes #1942
Fixes #1890
Fixes #1834
Fixes #1831
Fixes #1774
Fixes #1710
Fixes #1704
Fixes #1694
Fixes #1682
Fixes #1479
Fixes #1461
Fixes #1376
Fixes #981
Fixes #965
Fixes #958

Maybe fixes (authors, please check if it does)
#1761
#1714
#1550
#1542
#1846

Additional checks

dp
ddp on interactive notebook or shell
ddp on SLURM cluster (single node)
ddp on SLURM cluster (multi node)
ddp_cpu (@neggert)

I verified the following cases, but @tullie @justusschock @ethanwharris @jeremyjordan check on your own to make sure we have no issues?

Review

Sorry to ping multiple people here lol, just want a good number of sanity checks since this is so critical.
@justusschock i had to comment out the signal thing... not sure what that was solving but it was causing the ctrl+c to not work in killing the processes. I did add a fix to the keyboard thing which now registers ctrl+c only once and avoids leaving hanging procs

Borda

I do not think that lowering max_epochs is a solution, just hiding the true issue :{

Borda · 2020-06-01T15:10:21Z

pytorch_lightning/trainer/distrib_data_parallel.py

+        os.environ['LOCAL_RANK'] = '0'
+
+        # pull out the commands used to run the script and resolve the abs file path
+        command = sys.argv


this assumes that it can be called only from cmd with Trainer like arguments...
but what about just a sprit with loaded params from a file...
cc @PyTorchLightning/core-contributors

can you give an example?
these tests were done with cli flags and also trainer args. this local rank is coming from the gpus flag in the trainer.

This also assumes that it is called as (python) path/to/some/script.py and not e.g. from a console entry point

in fact, this part of code is not tested...
https://codecov.io/gh/PyTorchLightning/pytorch-lightning/commit/82a20296e308a67c8d9202e4cbdf92a44b90b077

pytorch_lightning/trainer/distrib_data_parallel.py

Borda · 2020-06-01T15:14:23Z

pytorch_lightning/trainer/training_loop.py

-        def _signal_kill_handler(*args):
-            return TrainerTrainLoopMixin.run_training_teardown(self)
-
-        orig_signal_handlers = {}


cc: @justusschock

as mentioned @justusschock i disabled this since it was a blocker.
TBH, i don't know how this was allowed on master with the exceptions being thrown every time

I'm fine with temporarily disabling this. In our CI these exceptions did not appear

tests/base/utils.py

tests/trainer/test_dataloaders.py

s-rog · 2020-06-01T16:17:40Z

ddp on interactive notebook or shell

Does this merge make ddp work in jupyter notebooks now?

awaelchli · 2020-06-01T20:11:06Z

pytorch_lightning/trainer/trainer.py

-            weights_summary: Optional[str] = 'full',
+            weights_summary: Optional[str] = 'top',


This change is not reflected in the docs

mind shot a fix PR?

yes unless it was not intentional. @williamFalcon yes or no?

sorry, forgot to update docs on this.
i propose we make it top going forward

Added it here #2021, hope it's fine including it there.

* replace ddp spawn with subprocess * replace ddp spawn with subprocess * replace ddp spawn with subprocess * replace ddp spawn with subprocess * replace ddp spawn with subprocess * replace ddp spawn with subprocess * replace ddp spawn with subprocess * replace ddp spawn with subprocess * replace ddp spawn with subprocess * replace ddp spawn with subprocess * replace ddp spawn with subprocess * replace ddp spawn with subprocess * replace ddp spawn with subprocess * replace ddp spawn with subprocess * replace ddp spawn with subprocess * replace ddp spawn with subprocess * replace ddp spawn with subprocess * replace ddp spawn with subprocess * replace ddp spawn with subprocess * replace ddp spawn with subprocess * replace ddp spawn with subprocess * replace ddp spawn with subprocess * replace ddp spawn with subprocess * replace ddp spawn with subprocess * replace ddp spawn with subprocess * replace ddp spawn with subprocess * replace ddp spawn with subprocess * replace ddp spawn with subprocess * replace ddp spawn with subprocess * replace ddp spawn with subprocess * replace ddp spawn with subprocess * replace ddp spawn with subprocess * replace ddp spawn with subprocess * replace ddp spawn with subprocess * replace ddp spawn with subprocess * replace ddp spawn with subprocess * replace ddp spawn with subprocess * replace ddp spawn with subprocess * replace ddp spawn with subprocess * replace ddp spawn with subprocess * replace ddp spawn with subprocess * replace ddp spawn with subprocess * replace ddp spawn with subprocess * replace ddp spawn with subprocess * replace ddp spawn with subprocess * replace ddp spawn with subprocess * replace ddp spawn with subprocess * replace ddp spawn with subprocess * replace ddp spawn with subprocess * replace ddp spawn with subprocess * replace ddp spawn with subprocess * replace ddp spawn with subprocess * replace ddp spawn with subprocess * replace ddp spawn with subprocess * replace ddp spawn with subprocess * hot fix * hot fix * hot fix * hot fix * hot fix * hot fix * hot fix * hot fix * hot fix * hot fix * hot fix * hot fix * hot fix * hot fix * hot fix * hot fix * hot fix * hot fix * hot fix * hot fix * hot fix * hot fix * hot fix * hot fix * hot fix * hot fix * hot fix * hot fix * hot fix * hot fix * hot fix * hot fix * hot fix * hot fix * hot fix * hot fix * hot fix * hot fix * hot fix * hot fix * hot fix * hot fix * hot fix * hot fix * hot fix * hot fix * hot fix * hot fix * hot fix * hot fix * hot fix * hot fix * hot fix * hot fix * hot fix * hot fix * hot fix * hot fix * hot fix

williamFalcon added 30 commits May 31, 2020 10:23

replace ddp spawn with subprocess

7ec6aea

replace ddp spawn with subprocess

38f3d8e

replace ddp spawn with subprocess

689f53b

replace ddp spawn with subprocess

7368f4d

replace ddp spawn with subprocess

241dd59

replace ddp spawn with subprocess

c5667c7

replace ddp spawn with subprocess

42def65

replace ddp spawn with subprocess

3498ca7

replace ddp spawn with subprocess

9249cfd

replace ddp spawn with subprocess

be3420a

replace ddp spawn with subprocess

d5f4aa9

replace ddp spawn with subprocess

9e698e0

replace ddp spawn with subprocess

3b6fb44

replace ddp spawn with subprocess

824323f

replace ddp spawn with subprocess

74bc48e

replace ddp spawn with subprocess

6ca43ff

replace ddp spawn with subprocess

df1e9bb

replace ddp spawn with subprocess

e97f6ca

replace ddp spawn with subprocess

4e00774

replace ddp spawn with subprocess

eb701da

replace ddp spawn with subprocess

98def93

replace ddp spawn with subprocess

fac94b0

replace ddp spawn with subprocess

ce52526

replace ddp spawn with subprocess

1267279

replace ddp spawn with subprocess

f3675c2

replace ddp spawn with subprocess

39fa7f5

replace ddp spawn with subprocess

cd10531

replace ddp spawn with subprocess

bdba97c

replace ddp spawn with subprocess

3c3694e

replace ddp spawn with subprocess

0d60400

williamFalcon added 2 commits June 1, 2020 10:02

hot fix

4d0e8a3

hot fix

a36e451

Borda added the Important label Jun 1, 2020

williamFalcon added 4 commits June 1, 2020 10:24

hot fix

332b0db

hot fix

f424040

hot fix

705e805

hot fix

07777d0

williamFalcon merged commit 82a2029 into master Jun 1, 2020

Borda deleted the ddp branch June 1, 2020 15:06

Borda reviewed Jun 1, 2020

View reviewed changes

mergify bot requested a review from a team June 1, 2020 15:19

Borda mentioned this pull request Jun 1, 2020

increase acc #2039

Merged

awaelchli reviewed Jun 1, 2020

View reviewed changes

mergify bot requested a review from a team June 1, 2020 20:11

This was referenced Jun 2, 2020

tests drop macOS py38 #2054

Merged

Update/merge multi-gpu docs #2021

Merged

reactivetype mentioned this pull request Jun 6, 2020

Return the evaluation result of Trainer.test #1694

Closed

LoicGrobol added a commit to LoicGrobol/zeldarose that referenced this pull request Jun 7, 2020

pin lightning to <0.8.0 because of Lightning-AI/pytorch-lightning#2029

55938b3

acxz mentioned this pull request Jun 10, 2020

Trainer.from_argparse_args with additional kwargs causes model to not be saved #1714

Closed

rjanovski mentioned this pull request Aug 9, 2020

When the parameter gpus of Trainer> 1, _pickle.PicklingError: #2883

Closed

PetrochukM mentioned this pull request Jan 27, 2021

Multi GPU training (ddp) gets very slow when using list of tensors in Dataset #1925

Closed

edenlightning mentioned this pull request Feb 3, 2021

benchmark subprocess vs spawn #5772

Closed

PetrochukM mentioned this pull request Feb 4, 2021

Is torch.multiprocessing.spawn compatible with DataLoader? pytorch/pytorch#51688

Closed

hocop mentioned this pull request Mar 25, 2021

validation_epoch_end behavior with DDP #1479

Closed

turian mentioned this pull request Apr 15, 2021

[Grid] You must call wandb.init() before wandb.log() #7028

Closed

JiamingSuen mentioned this pull request Aug 8, 2021

How to store dataset in shared memory for ddp? #1981

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Replaces ddp .spawn with subprocess #2029

Replaces ddp .spawn with subprocess #2029

williamFalcon commented May 31, 2020 •

edited

Loading

Borda left a comment

Borda Jun 1, 2020

williamFalcon Jun 1, 2020

LoicGrobol Jun 1, 2020

Borda Jun 1, 2020

Borda Jun 1, 2020

williamFalcon Jun 1, 2020

justusschock Jun 1, 2020

s-rog commented Jun 1, 2020

awaelchli Jun 1, 2020

Borda Jun 1, 2020

awaelchli Jun 1, 2020

williamFalcon Jun 1, 2020

awaelchli Jun 2, 2020

		weights_summary: Optional[str] = 'full',
		weights_summary: Optional[str] = 'top',

Replaces ddp .spawn with subprocess #2029

Replaces ddp .spawn with subprocess #2029

Conversation

williamFalcon commented May 31, 2020 • edited Loading

Problem

Approach

Fixes

Additional checks

Review

Borda left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

s-rog commented Jun 1, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

williamFalcon commented May 31, 2020 •

edited

Loading