Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PytorchJob DDP training will stop if I delete a worker pod #1478

Closed
Shuai-Xie opened this issue Nov 22, 2021 · 4 comments
Closed

PytorchJob DDP training will stop if I delete a worker pod #1478

Shuai-Xie opened this issue Nov 22, 2021 · 4 comments

Comments

@Shuai-Xie
Copy link

Hi, everyone.

I want to test the failure tolerance of PytorchJob.

I started a PytorchJob with 1 master and 3 workers.

$ kubectl get pods -o wide
NAME                 READY   STATUS    RESTARTS   AGE     IP           NODE
mnist-ddp-master-0   1/1     Running   0          2m55s   10.10.10.1   11.71.1.160
mnist-ddp-worker-0   1/1     Running   0          2m55s   10.10.10.2   11.71.1.161
mnist-ddp-worker-1   1/1     Running   0          2m55s   10.10.10.3   11.71.1.161
mnist-ddp-worker-2   1/1     Running   0          2m55s   10.10.10.4   11.71.1.162

It trains fine.

Then I deleted a worker.

$ kubectl delete pod mnist-ddp-worker-1

As I set restartPolicy: OnFailure, this pod will restart quickly with the same name mnist-ddp-worker-1.

But sadly, I can't see this newborn worker join the DDP training.

Thanks.

@gaocegege
Copy link
Member

Are you sure that your training script is fault tolerant?

@Shuai-Xie
Copy link
Author

No. The training code is almost the same as the official PyTorch mnist demo.
I just want to test if the restarted worker can join the DDP training automatically.

By the way, I notice that elasticPolicy feature has been integrated into PytorchJob.
Does that mean that we can use the elastic feature in kubeflow 1.4.0 with training-operator 1.3.0 now?

@gaocegege
Copy link
Member

I just want to test if the restarted worker can join the DDP training automatically.

I do not think the official PyTorch demo supports fault-tolerant. Maybe you need to dive into the demo.

Does that mean that we can use the elastic feature in kubeflow 1.4.0 with training-operator 1.3.0 now?

1.4.0 is already released. But the elasticPolicy is introduced in master branch, thus you cannot use it in 1.4.0.

@Shuai-Xie
Copy link
Author

OK. Many thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants