[WIP] Fix Trainer.test in ddp before running Trainer.fit #2790

awaelchli · 2020-08-01T13:35:10Z

What does this PR do?

Fixes #2683
Fixes #2765
Fixes #2807
Fixes #2537
(maybe also #2901)

trainer = Trainer(distributed_backend="ddp", gpus=2)
# no fit before
trainer.test(model)
# this hangs because master ports are different on rank 0 and rank 1

Core issues discovered here

Fixing this revealed a multitude of interconnected issues:

Selecting a random port does not work for 2 main reasons. (a) determinism in subprocess not easy, (b) if you run fit many times, you eventually run out of ports
In ddp, each subrocess excluding rank 0 has to init ddp with the same master port. When we run fit and then test, it has to happen twice, because between fit and test the subrocesses must be killed and restarted except rank 0. Unfortunately we cannot reuse the connection on rank 0, becaues the other new processes will see it as "port still in use", we have to create a new one on a new port.

Solution:

Find a free port (not random) on rank 0
broadcast the port number through torch.distributed and let subprocesses init their connection (and rank 0 as well)
Finally make sure we properly destroy the connections so that we never run out of ports

TODO:

See what's going on in pytest / CI
Ask someone nicely if they could try it on SLURM

awaelchli · 2020-08-01T13:39:21Z

@williamFalcon This fixes the reported issue, but I am afraid this causes other problems. In which case do we need to select the port randomly (by force)? And why is it only doing that in test and not fit?

williamFalcon · 2020-08-01T17:37:05Z

the port needs to be randomly chosen when running on a single node. But it should use the same port for test and fit

in the common case,
fit (sets port)
test (should not set port again)

in this case
no fit
test (port not set, then set it)

awaelchli · 2020-08-01T17:47:37Z

I just don't understand the force part. Different ranks will get different master port if we set force=True.

awaelchli · 2020-08-02T05:01:52Z

didn't find a way to write a test case, since this requires ddp for testing. any ideas?

awaelchli · 2020-08-03T14:10:16Z

@williamFalcon found this in the code

PID = os.getpid()
RNG1 = np.random.RandomState(PID)
RANDOM_PORTS = RNG1.randint(10000, 19999, 1000)

in DDP, pid will be different for each gpu, so it will yield different master ports when force=True. It looks like this is the root of the issue.

pep8speaks · 2020-08-03T17:28:31Z

Hello @awaelchli! Thanks for updating this PR.

In the file pytorch_lightning/accelerators/ddp_backend.py:

Line 270:13: E265 block comment should start with '# '
Line 275:13: E265 block comment should start with '# '
Line 275:13: E303 too many blank lines (2)

In the file pytorch_lightning/core/lightning.py:

Line 957:120: E501 line too long (138 > 119 characters)

Comment last updated at 2020-08-15 17:20:22 UTC

mergify · 2020-08-03T21:18:15Z

This pull request is now in conflict... :(

mergify · 2020-08-11T23:29:45Z

This pull request is now in conflict... :(

asrafulashiq · 2020-08-14T14:44:23Z

Hi, any progress on this issue?

awaelchli · 2020-08-15T21:16:54Z

After countless hours of trying to fix this problem, I have come to the realization that this "ddp" mode is inherently limited to a single trainer.fit or single trainer.test call. Due to the fact that the main process launches the program itself N-1 times for a single trainer.fit means that there is a conflict when we do subsequent calls

trainer.fit()
...
trianer.fit()
trainer.test()

(for example in a loop), or when we initialize Trainer multiple times in a script. In the example above, the first fit will launch the same script again in subprocesses. These train in parallel until fitting is completed and then they get killed. The main process will reach the second fit, and launch once again the same script in subprocesses again. Here is the problem. The subprocess now executes the first fit again because it simply executes the script from the beginning. There is no way we can reroute it to the right place, as no state is maintained across processes.

I will close this PR and propose to add a note or even warning to the docs that ddp backend has this limitation.

williamFalcon · 2020-08-15T21:23:01Z

we can maintain state using environment variables

do not force

ca98878

mergify bot requested a review from a team August 1, 2020 13:35

awaelchli changed the title ~~Fix Trainer.test in ddp before running Trainer.fit~~ [WIP] Fix Trainer.test in ddp before running Trainer.fit Aug 1, 2020

awaelchli added bug Something isn't working priority: 0 High priority task and removed priority: 0 High priority task labels Aug 1, 2020

awaelchli marked this pull request as ready for review August 2, 2020 21:57

awaelchli mentioned this pull request Aug 3, 2020

Issue with running multiple models in PyTorch Lightning #2807

Closed

Adrian Wälchli added 3 commits August 3, 2020 19:19

debug

b0dbc28

debug

c9f91e0

debug

e64a56f

Adrian Wälchli added 6 commits August 3, 2020 19:31

debug

602809f

debug

fb2e0c8

debug

f238dc3

debug

81c5255

debug

d48e147

debug

885c1d7

Adrian Wälchli added 6 commits August 3, 2020 23:24

debug

c1da18b

debug

98cb6bb

debug

87e4a78

debug

8279449

debug

3fbdf76

debug

34ac16b

Adrian Wälchli added 20 commits August 11, 2020 08:02

repair

ab7ebdd

repair

47712a0

repair

68a2db6

repair

8f8c0fd

repair

159b4c8

repair

ce4ad1e

repair

ce8a93c

repair

25767df

repair

418fc90

repair

0495da8

repair

5b267ff

repair

18e75ca

repair

6d56a78

repair

8622c43

repair

6dfec2c

repair

b35679c

repair

d0e6f3b

repair

13e9236

repair

f684550

repair

b5f8978

Adrian Wälchli added 2 commits August 15, 2020 19:15

simple

f9a7353

mem

68ec750

awaelchli closed this Aug 15, 2020

awaelchli deleted the bugfix/test-before-fit branch August 16, 2020 16:04

Borda modified the milestones: 1.0.0, 0.9.0 Aug 20, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] Fix Trainer.test in ddp before running Trainer.fit #2790

[WIP] Fix Trainer.test in ddp before running Trainer.fit #2790

awaelchli commented Aug 1, 2020 •

edited

Loading

awaelchli commented Aug 1, 2020

williamFalcon commented Aug 1, 2020

awaelchli commented Aug 1, 2020

awaelchli commented Aug 2, 2020

awaelchli commented Aug 3, 2020

pep8speaks commented Aug 3, 2020 •

edited

Loading

mergify bot commented Aug 3, 2020

mergify bot commented Aug 11, 2020

asrafulashiq commented Aug 14, 2020

awaelchli commented Aug 15, 2020

williamFalcon commented Aug 15, 2020

[WIP] Fix Trainer.test in ddp before running Trainer.fit #2790

[WIP] Fix Trainer.test in ddp before running Trainer.fit #2790

Conversation

awaelchli commented Aug 1, 2020 • edited Loading

What does this PR do?

Core issues discovered here

Solution:

awaelchli commented Aug 1, 2020

williamFalcon commented Aug 1, 2020

awaelchli commented Aug 1, 2020

awaelchli commented Aug 2, 2020

awaelchli commented Aug 3, 2020

pep8speaks commented Aug 3, 2020 • edited Loading

Comment last updated at 2020-08-15 17:20:22 UTC

mergify bot commented Aug 3, 2020

mergify bot commented Aug 11, 2020

asrafulashiq commented Aug 14, 2020

awaelchli commented Aug 15, 2020

williamFalcon commented Aug 15, 2020

awaelchli commented Aug 1, 2020 •

edited

Loading

pep8speaks commented Aug 3, 2020 •

edited

Loading