-
Notifications
You must be signed in to change notification settings - Fork 674
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PytorchJob DDP training will stop if I delete a worker pod #1478
Comments
Are you sure that your training script is fault tolerant? |
No. The training code is almost the same as the official PyTorch mnist demo. By the way, I notice that |
I do not think the official PyTorch demo supports fault-tolerant. Maybe you need to dive into the demo.
1.4.0 is already released. But the elasticPolicy is introduced in master branch, thus you cannot use it in 1.4.0. |
OK. Many thanks. |
Hi, everyone.
I want to test the failure tolerance of PytorchJob.
I started a PytorchJob with 1 master and 3 workers.
It trains fine.
Then I deleted a worker.
As I set
restartPolicy: OnFailure
, this pod will restart quickly with the same namemnist-ddp-worker-1
.But sadly, I can't see this newborn worker join the DDP training.
Thanks.
The text was updated successfully, but these errors were encountered: