-
Notifications
You must be signed in to change notification settings - Fork 706
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
The WORLD_SIZE environment variable for elastic policy is not getting set correctly #1947
Comments
We can close this? @kuizhiqing |
Yes, I think so. @deepanker13 Can you confirm with the fix. |
@kuizhiqing I am getting the same error, world size is getting set to 2 and not 4, I deleted and redeployed training operator on my Kubernetes cluster. |
If the I suggest to verify it with either following way, first of all, pull the latest code with my fix, then
finally, , restart your job. |
Yes, it is working as expected. Thanks @kuizhiqing , we can close this issue. |
According to this PR the nProcPerNode in elastic policy is deprecated and it is suggested to use nprocPerNode in the spec.
However after trying the above, the world size is getting set to the number of replicas when using elastic mode and not replicas * nprocs_per_node.
The following yaml is not working, instead the commented line , when uncommented works.
The text was updated successfully, but these errors were encountered: