You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Did you set difference NODE_RANK to each node ?
I currently run multi-node training with lightning v2.2.0 + deepspeed on azure's gpu cluster successfully, without manual set any env varable, (maybe it's set by the cluster system)
Bug description
I am trying to run a very simple training script for 2 nodes and I always get this error:
Output:
What version are you seeing the problem on?
v2.2
How to reproduce the bug
Error messages and logs
Environment
Current environment
More info
No response
The text was updated successfully, but these errors were encountered: