Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Fix Trainer.test in ddp before running Trainer.fit #2790

Closed
wants to merge 198 commits into from

Conversation

awaelchli
Copy link
Contributor

@awaelchli awaelchli commented Aug 1, 2020

What does this PR do?

Fixes #2683
Fixes #2765
Fixes #2807
Fixes #2537
(maybe also #2901)

trainer = Trainer(distributed_backend="ddp", gpus=2)
# no fit before
trainer.test(model)
# this hangs because master ports are different on rank 0 and rank 1

Core issues discovered here

Fixing this revealed a multitude of interconnected issues:

  • Selecting a random port does not work for 2 main reasons. (a) determinism in subprocess not easy, (b) if you run fit many times, you eventually run out of ports
  • In ddp, each subrocess excluding rank 0 has to init ddp with the same master port. When we run fit and then test, it has to happen twice, because between fit and test the subrocesses must be killed and restarted except rank 0. Unfortunately we cannot reuse the connection on rank 0, becaues the other new processes will see it as "port still in use", we have to create a new one on a new port.

Solution:

  1. Find a free port (not random) on rank 0
  2. broadcast the port number through torch.distributed and let subprocesses init their connection (and rank 0 as well)
  3. Finally make sure we properly destroy the connections so that we never run out of ports

TODO:

  • See what's going on in pytest / CI
  • Ask someone nicely if they could try it on SLURM

@mergify mergify bot requested a review from a team August 1, 2020 13:35
@awaelchli
Copy link
Contributor Author

@williamFalcon This fixes the reported issue, but I am afraid this causes other problems. In which case do we need to select the port randomly (by force)? And why is it only doing that in test and not fit?

@awaelchli awaelchli changed the title Fix Trainer.test in ddp before running Trainer.fit [WIP] Fix Trainer.test in ddp before running Trainer.fit Aug 1, 2020
@awaelchli awaelchli added bug Something isn't working priority: 0 High priority task and removed priority: 0 High priority task labels Aug 1, 2020
@williamFalcon
Copy link
Contributor

the port needs to be randomly chosen when running on a single node. But it should use the same port for test and fit

in the common case,
fit (sets port)
test (should not set port again)

in this case
no fit
test (port not set, then set it)

@awaelchli
Copy link
Contributor Author

I just don't understand the force part. Different ranks will get different master port if we set force=True.

@awaelchli
Copy link
Contributor Author

didn't find a way to write a test case, since this requires ddp for testing. any ideas?

@awaelchli awaelchli marked this pull request as ready for review August 2, 2020 21:57
@awaelchli
Copy link
Contributor Author

@williamFalcon found this in the code

PID = os.getpid()
RNG1 = np.random.RandomState(PID)
RANDOM_PORTS = RNG1.randint(10000, 19999, 1000)

in DDP, pid will be different for each gpu, so it will yield different master ports when force=True. It looks like this is the root of the issue.

@pep8speaks
Copy link

pep8speaks commented Aug 3, 2020

Hello @awaelchli! Thanks for updating this PR.

Line 270:13: E265 block comment should start with '# '
Line 275:13: E265 block comment should start with '# '
Line 275:13: E303 too many blank lines (2)

Line 957:120: E501 line too long (138 > 119 characters)

Comment last updated at 2020-08-15 17:20:22 UTC

@mergify
Copy link
Contributor

mergify bot commented Aug 3, 2020

This pull request is now in conflict... :(

@mergify
Copy link
Contributor

mergify bot commented Aug 11, 2020

This pull request is now in conflict... :(

@asrafulashiq
Copy link
Contributor

Hi, any progress on this issue?

Adrian Wälchli added 2 commits August 15, 2020 19:15
@awaelchli
Copy link
Contributor Author

After countless hours of trying to fix this problem, I have come to the realization that this "ddp" mode is inherently limited to a single trainer.fit or single trainer.test call. Due to the fact that the main process launches the program itself N-1 times for a single trainer.fit means that there is a conflict when we do subsequent calls

trainer.fit()
...
trianer.fit()
trainer.test()

(for example in a loop), or when we initialize Trainer multiple times in a script. In the example above, the first fit will launch the same script again in subprocesses. These train in parallel until fitting is completed and then they get killed. The main process will reach the second fit, and launch once again the same script in subprocesses again. Here is the problem. The subprocess now executes the first fit again because it simply executes the script from the beginning. There is no way we can reroute it to the right place, as no state is maintained across processes.

I will close this PR and propose to add a note or even warning to the docs that ddp backend has this limitation.

@awaelchli awaelchli closed this Aug 15, 2020
@williamFalcon
Copy link
Contributor

we can maintain state using environment variables

@awaelchli awaelchli deleted the bugfix/test-before-fit branch August 16, 2020 16:04
@Borda Borda modified the milestones: 1.0.0, 0.9.0 Aug 20, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
6 participants