-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
'bad value(s) in fds_to_keep' error in DDP mode #1550
Comments
lightning_template.txt you can rename .txt to .py to verify the bug |
Check #538. Relevant solution copied here for your convenience. I don't know exactly what (from #538): This is NOT a ptl bug. This is the result of the naive assignment of the parameter to another variable:
self.class_p is known to pytorch as a parameter, and so is copied to the correct gpu. But the reference to it in self.class_p_t is not known to pytorch as a parameter, and so this reference is not updated when the model is copied. To fix this simply, do a deep copy instead of the naive assignment. The self.class_p_t is still not moved to the gpu, but it is now within the process space of each ddp model:
Hope this helps ... |
thanks for your reply, spectral_norm is a standard module in pytorch, and I can run it in pure pytorch implementation, but if i use pytorch_lightning, it report bug as above, so i think this may be a bug in pytorch_lightning. |
A very frustrating situation for you, I am sure. I am a little suspicious that this is actually a problem with |
Out of curiosity, what happens if you don't assign the linear layer to the module before calling spectral norm? (i.e. I ran into a similar issue, and removing the assignment or doing a manual clone should fix it. It's possibly not a pytorch lightning bug - lightning uses torch's |
🐛 Bug
To Reproduce
if i put spectral_norm in the model, it will output the error msg "bad value(s) in fds_to_keep"
event the example provided by pytorch-lightning have this kind of issue.
Steps to reproduce the behavior:
change the example model lightning_template.py: to
run the example with
python3 gpu_template.py --gpus 2 --distributed_backend ddp
we will get error msg
Traceback (most recent call last): File "gpu_template.py", line 80, in <module> main(hyperparams) File "gpu_template.py", line 41, in main trainer.fit(model) File "/usr/local/lib/python3.6/dist-packages/pytorch_lightning/trainer/trainer.py", line 692, in fit mp.spawn(self.ddp_train, nprocs=self.num_gpus, args=(model,)) File "/usr/local/lib/python3.6/dist-packages/torch/multiprocessing/spawn.py", line 162, in spawn process.start() File "/usr/lib/python3.6/multiprocessing/process.py", line 105, in start self._popen = self._Popen(self) File "/usr/lib/python3.6/multiprocessing/context.py", line 284, in _Popen return Popen(process_obj) File "/usr/lib/python3.6/multiprocessing/popen_spawn_posix.py", line 32, in __init__ super().__init__(process_obj) File "/usr/lib/python3.6/multiprocessing/popen_fork.py", line 19, in __init__ self._launch(process_obj) File "/usr/lib/python3.6/multiprocessing/popen_spawn_posix.py", line 59, in _launch cmd, self._fds) File "/usr/lib/python3.6/multiprocessing/util.py", line 417, in spawnv_passfds False, False, None) ValueError: bad value(s) in fds_to_keep
Environment
The text was updated successfully, but these errors were encountered: