Timed out initializing process group in store based.... #15593
Replies: 3 comments 3 replies
-
Hey EvanZ, I have met the same error in single-node multi-GPU training (DDP). In DDP, the trainer won't start until all workers join the process group (i.e. Sometimes, the process group initialization could take quite a long time and exceed 30 mins (timeout) The solution is to increase the trainer = pl.Trainer(
# set timeout to 1hr, the default timeout is 1800 sec (30 mins)
strategy=pl.strategies.DDPStrategy(timeout=datetime.timedelta(seconds=3600)),
...
)
trainer.fit(...) Please let me know if this way can fix your issue. |
Beta Was this translation helpful? Give feedback.
-
|
Beta Was this translation helpful? Give feedback.
-
I met this problem on Multi-nodes Multi-gpus Type. Did anyone has solutions? |
Beta Was this translation helpful? Give feedback.
-
Does anyone know what this error means?
Beta Was this translation helpful? Give feedback.
All reactions