-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] Fix Trainer.test in ddp before running Trainer.fit #2790
Conversation
@williamFalcon This fixes the reported issue, but I am afraid this causes other problems. In which case do we need to select the port randomly (by force)? And why is it only doing that in test and not fit? |
the port needs to be randomly chosen when running on a single node. But it should use the same port for test and fit in the common case, in this case |
I just don't understand the force part. Different ranks will get different master port if we set force=True. |
didn't find a way to write a test case, since this requires ddp for testing. any ideas? |
@williamFalcon found this in the code PID = os.getpid()
RNG1 = np.random.RandomState(PID)
RANDOM_PORTS = RNG1.randint(10000, 19999, 1000) in DDP, pid will be different for each gpu, so it will yield different master ports when force=True. It looks like this is the root of the issue. |
Hello @awaelchli! Thanks for updating this PR.
Comment last updated at 2020-08-15 17:20:22 UTC |
This pull request is now in conflict... :( |
This pull request is now in conflict... :( |
Hi, any progress on this issue? |
After countless hours of trying to fix this problem, I have come to the realization that this "ddp" mode is inherently limited to a single trainer.fit or single trainer.test call. Due to the fact that the main process launches the program itself N-1 times for a single trainer.fit means that there is a conflict when we do subsequent calls trainer.fit() (for example in a loop), or when we initialize Trainer multiple times in a script. In the example above, the first fit will launch the same script again in subprocesses. These train in parallel until fitting is completed and then they get killed. The main process will reach the second fit, and launch once again the same script in subprocesses again. Here is the problem. The subprocess now executes the first fit again because it simply executes the script from the beginning. There is no way we can reroute it to the right place, as no state is maintained across processes. I will close this PR and propose to add a note or even warning to the docs that ddp backend has this limitation. |
we can maintain state using environment variables |
What does this PR do?
Fixes #2683
Fixes #2765
Fixes #2807
Fixes #2537
(maybe also #2901)
Core issues discovered here
Fixing this revealed a multitude of interconnected issues:
Solution:
TODO: